DFlash × DDTree × Luce DFlash：Block Diffusion drafter 點樣俾 RTX 3090 跑 Qwen3.5-27B 跑到 207 tok/s？

2026 年 2 月 5 號，z-lab 喺 arxiv 放咗一篇叫 DFlash 嘅 paper，講佢哋用 block diffusion drafter 做 speculative decoding，可以 6× lossless speedup，打贏 EAGLE-3 2.5×。兩個月後，Technion 嘅 Liran Ringel 同 Yaniv Romano 推出咗 DDTree，講「你 DFlash 一 pass 出嘅 per-position distribution 揼咗去邊？不如砌返棵 tree 出嚟同時 verify」，結果 60/60 setting 全部提升，Qwen3-30B-MoE 喺 HumanEval 衝到 8.22×。
最宜家係幾個禮拜前，一個叫 Luce-Org 嘅 team 寫咗 ~2000 行 C++/CUDA，跨咗 GGUF + ggml + 3 個 custom CUDA kernel，俾一部 24 GB RTX 3090 跑 Qwen3.5-27B 跑到 peak 207 tok/s、HumanEval mean 129.5 tok/s，3.43× 加速同時 100% lossless。
今次唔似以往 paper review 咁淨係講原理，我哋要拆三個層次：DFlash 點解用 diffusion 做 drafter 會 work、DDTree 點解 1 pass 嘅 marginal 已經夠 build 到 tree、Luce DFlash 點樣喺一部 24 GB 消費級 GPU 跑得郁部 27B 模型。最後仲會透埋隱藏嘅 architecture 重點——Qwen3.5 本身就唔係 dense Transformer，佢係一個 Gated DeltaNet hybrid。

TL;DR

三件嘢一句話講：

⚡ DFlash（arxiv 2602.06036）：用 block diffusion model 做 draft，一個 forward pass 同時生成 16 個 token candidates，超越 EAGLE-3（6× lossless speedup）
🌳 DDTree（arxiv 2604.12989）：你 DFlash 嗰 1 pass 本身已經提供咗 per-position marginal distribution，不如 build 棵 tree 出嚟（best-first heap, fixed budget），同時 verify，60/60 全部提升
📦 Luce DFlash（Luce-Org/lucebox-hub）：第一個 GGUF 移植，俾你部 24 GB RTX 3090 跑到 Qwen3.5-27B at 129.5 tok/s mean / 207 tok/s peak、128K context、3.43× 加速、同 AR 輸出位元級相同
🎯 核心 insight：Block diffusion 本身會輸出 per-position distribution，呢個係「免費嘅 tree material」——chain-style drafter（EAGLE）冇得咁玩
🧱 隱藏伏位：Qwen3.5-27B 唔係 dense Transformer，佢係 64 層每 4 層先有一層 full softmax attention、其餘係 Gated DeltaNet 嘅 hybrid model；呢個 SSM 狀態先係 rollback 嘅最大難題

1. 背景：Speculative Decoding 為咗解決乜嘢？

大家都知道 LLM inference 慢嘅根本原因：autoregressive decoding 係 sequential 嘅。每生成一個 token 都要 load 成個 model weights 一次，memory bandwidth 直接決定 throughput。

以 RTX 3090 跑 Qwen3.5-27B Q4_K_M（~16 GB weights）做例：

VRAM bandwidth：936 GB/s
理論 max throughput：936 / 16 ≈ 58 tok/s
實測 autoregressive：37.7 tok/s（hit memory wall）

無論你用 vLLM、SGLang 定 llama.cpp，呢個天花板都打唔穿——因為佢係硬件物理限制，唔係軟件 bug。

Speculative Decoding 嘅基本想法

Leviathan et al. 2023 提出嘅 framework：用一個細啲嘅 draft model 預測下一段 tokens，再用 target model 一次過 verify。

python# Standard autoregressive: 1 token per target forward
for i in range(N):
    token = target.forward(prev_tokens)  # 1 forward → 1 token

# Speculative decoding: K candidates per target forward
while not done:
    draft_tokens = draft.forward(prev_tokens, k=K)  # cheap, K tokens
    accepted = target.verify(prev_tokens, draft_tokens)  # 1 forward → up to K tokens

Key insight：target model 一個 forward pass 計到 K 個 logits（每 position 一個），所以 verify K 個 candidate token 嘅成本同 verify 1 個一樣。如果 draft 啱 80%，每個 target forward 平均 commit 4-5 個 token，throughput 直接 4-5×。

點解 chain drafter（EAGLE）見頂？

EAGLE / EAGLE-3 係之前嘅 SOTA。佢嘅 draft model 係一個小 transformer，但 draft 過程仍然 sequential：

javascriptDraft step 1: predict token_1
Draft step 2: predict token_2 | token_1
Draft step 3: predict token_3 | token_1, token_2
...

如果 draft 個 latency 係 1.5ms，draft 8 個 token 就要 12ms。雖然 draft model 細好多，但累加起嚟 draft phase 都會 dominate。

DFlash 嘅諗法：如果 draft 都可以 parallel 出哂 8 個 token，咁咪 draft phase 縮成 1 個 forward？

2. DFlash：Block Diffusion 做 drafter

⚡ DFlash 一句話：個 draft model 由 5 層 non-causal Transformer 組成，input 係 [last_target_token, MASK × 15] 加上由 target model 偷返嚟嘅 hidden states，一 pass denoise 哂 16 個 mask，輸出 16 個 candidate token。

2.1 點解係 diffusion？

Diffusion model 嘅本質係：一次過去 noise。Image diffusion 一 pass 由純 noise 生成成幅圖；DFlash 嘅 block diffusion 一 pass 由純 mask token 生成成個 block 嘅 token。

但 LLM 唔似 image，token 之間有強烈嘅 sequential dependency——「The cat sat on the」之後出「mat」唔係「banana」。如果你純粹獨立預測每個 mask，acceptance rate 會差到嘔。

DFlash 嘅核心 trick：deep KV injection。

javascriptTarget model（Qwen3.5-27B）
├── Layer 0
├── Layer 16  ← capture hidden state
├── Layer 32  ← capture hidden state
├── Layer 48  ← capture hidden state
└── Layer 64  ← capture hidden state

Draft model（5 layers）
├── Layer 0 ← inject Layer 16 hidden state via cross-attention
├── Layer 1 ← inject Layer 32 hidden state
├── Layer 2 ← inject Layer 48 hidden state
├── Layer 3 ← inject Layer 64 hidden state
└── Layer 4 → output 16 token logits

Draft model 唔係由零開始 predict，佢係直接 read target model 嘅 deep representation。等於話：「target model 已經諗到 layer 64 喇，draft model 你只係負責由呢個 representation decode 出 token」。

5 層 non-causal denoising 已經夠表達 token 之間嘅 dependency。Acceptance length（AL，平均每 round 接受嘅 token 數）：

Drafter	AL	機制
EAGLE-3 chain	~3	Sequential autoregressive draft
DFlash block diffusion	~8	One-pass parallel denoising with deep KV injection

Draft latency 由「8 × 1.5ms = 12ms」壓到「1 × 2ms = 2ms」。Effective throughput：6× over autoregressive。

2.2 Architecture：5 層做到嘅原因

DFlash draft model 嘅大細：

python# z-lab/Qwen3.5-27B-DFlash 嘅 size
draft_layers = 5            # 對比 target 64 層
draft_hidden = 2048         # 對比 target 5120
block_size = 16             # 一 pass 出 16 個 token
kv_injection_layers = 4     # cross-attention to target hidden states

Draft model 大約 3.46 GB in BF16，計算量大概係 target 嘅 1/40。再 inject deep target features，draft 質素直逼一個獨立嘅小 LLM。

2.3 為咗 lossless：rejection sampling

Speculative decoding 嘅 verify rule（Leviathan 2023 證明過）保證輸出嘅 token distribution 等於 target model 自己 generate：

pythonfor i in range(K):
    p = target_logits[i].softmax()
    q = draft_logits[i].softmax()
    
    if rand() < min(1, p[draft_token] / q[draft_token]):
        accept(draft_token)
    else:
        # reject + resample from (p - q).clamp(0).normalize()
        token = sample(max(0, p - q))
        break

第一個 mismatch 之後就 break，但 commit 咗嘅 token 都係 mathematically equivalent to sampling from target model。100% lossless——唔似 Medusa 嗰啲 lossy 加速法。

3. DDTree：Tree Verification on top of DFlash

🌳 DDTree 一句話：DFlash 一 pass 已經輸出咗 16 個 position 嘅 distribution，揀 top-1 path 等於浪費咗 15 個 position 嘅 information。Build 棵 tree 出嚟（fixed node budget）一次過 verify，期望接受長度由 ~5 變 ~8。

3.1 觀察：每 position 嘅 marginal 都已經 free

DFlash 嘅 output：

javascriptposition 0: [a: 0.6, b: 0.2, c: 0.1, ...]
position 1: [d: 0.5, e: 0.3, f: 0.1, ...]
position 2: [g: 0.4, h: 0.3, i: 0.2, ...]
...
position 15: [...]

Vanilla DFlash 揀每 position 嘅 argmax，組成 chain a → d → g → ...，verify。如果 position 1 揀錯（target 想要 e 而唔係 d），即刻停，浪費咗 position 2-15 個 prediction。

但 DFlash 嗰一 pass 本身已經提供咗 e 嘅機率！只係 Vanilla DFlash 揼咗去。

3.2 Tree construction：Best-first under budget

DDTree 將呢啲 marginal 拼返棵 tree：

javascriptroot (last bonus token)
├── a (0.6)
│   ├── d (0.5)
│   │   ├── g (0.4)
│   │   └── h (0.3)
│   └── e (0.3)
│       ├── g (0.4)
│       └── h (0.3)
├── b (0.2)
│   └── d (0.5)
│       └── g (0.4)
└── c (0.1)

(node_budget = 22) 嘅意思：成棵 tree 最多 22 個 node。用 best-first heap：每次 expand 個 expected acceptance probability 最高嘅 leaf。

目標函數：maximize

E[L_{accept}] = \sum_{path} P(path) \cdot L(path)

其中 $P(path)$ 係 path 上每 position marginal 嘅乘積（factorized assumption）， $L(path)$ 係 path 長度。實際上 DDTree 唔係用 dynamic programming，而係用 best-first heap：把每個 candidate node 嘅 P(parent_path) × P_marginal(node) 入 heap，pop top 22 個。

3.3 Tree verification：一個 forward pass

關鍵 trick 係ancestor-only attention mask。Target model 一次過 process 成棵 tree（22 個 token），但每個 token 只可以 attend 到佢自己嘅 ancestors，唔可以 attend siblings。

javascripta   d   g   h   e   g   h   ...
root:   1   1   1   1   1   1   1   ...
a:      .   1   1   1   .   .   .
d:      .   .   1   1   .   .   .
g:      .   .   .   .   .   .   .
h:      .   .   .   .   .   .   .
e:      .   .   .   .   1   1   1
...

咁樣 22 個 token 一個 forward 就計埋 22 個 logits（每個 token 對應佢自己 path 嘅 next-token distribution）。Verify 嘅成本同 verify 1 條 chain 一樣，但 candidate 多咗 22 個。

3.4 Walk + bonus token

Verify 完之後：

pythonnode = root
while node.has_children:
    target_token = sample(target_logits[node])
    matched_child = find_child(node, target_token)
    if matched_child:
        commit(matched_child)
        node = matched_child
    else:
        # 第一個 mismatch：commit target_token 做 bonus，return
        bonus = target_token
        break

Bonus token 係下一輪嘅 root。咁樣即使整棵 tree 都 reject 完，最少都 commit 到 1 個 token（target's own sample），保證 progress。

3.5 結果：Budget vs speedup tradeoff

DDTree paper 報告嘅 speedup（temperature 0，相對 autoregressive）：

Qwen3-8B

Benchmark	DFlash	DDTree	Gain
MATH-500	5.56×	7.52×	+1.96×
HumanEval	4.84×	6.90×	+2.06×
GSM8K	4.78×	6.75×	+1.97×

Qwen3-30B-MoE

Benchmark	DFlash	DDTree	Gain
HumanEval	6.09×	8.22×	+2.13×
MBPP	~7.0×	7.7×	+0.7×
MATH-500	~5.5×	6.2×	+0.7×

60 個 (model × dataset × temperature) setting 全部都贏，最大 gain 集中喺 reasoning + code。

⚠️ Budget 唔係越大越好：node budget 16 → 22 → 28，acceptance length 升，但 verifier latency 升得仲快。Sweet spot 係 22 左右。Luce DFlash 默認用 budget=22。

4. DFlash + DDTree：Pipeline 全圖

Loading diagram...

每 round commit 平均 ~8 個 token（AL = 8.31 喺 HumanEval），總 latency ~12 ms。等價 throughput：8 / 0.012 = ~660 tok/s effective draft，但實際被 GPU memory bandwidth 同 KV cache 限制，最終落到 ~130 tok/s mean。

5. Luce DFlash：點樣係一部 RTX 3090 度 run

📦 Luce DFlash 一句話：z-lab 嘅 reference implementation 只跑 BF16 on B200（54+ GB VRAM）。Luce-Org 寫咗 ~2000 行 C++/CUDA on top of ggml，做出世界第一個 GGUF port，俾你 24 GB RTX 3090 都跑到 Qwen3.5-27B + DFlash + DDTree。

5.1 The gap：Why no one shipped it on consumer GPUs

要喺 24 GB GPU 跑 27B + draft + tree state，個 stack 必須係：

Component	Size requirement
Target weights (27B)	≤ 16 GB → 必須 quantize 到 Q4_K_M
Draft model (BF16)	3.46 GB
KV cache (~32K context)	~2-4 GB
DDTree tree state	~0.5 GB
Total	~22-24 GB

問題：

vLLM / SGLang：無 GGUF 路徑，AWQ INT4 working 但無 DFlash integration
z-lab reference：只 support BF16，54 GB target weights，唔可能 fit 24 GB
llama.cpp：有 GGUF Q4_K_M loader，但無 DFlash／DDTree
無人有 Gated DeltaNet 嘅 tree-mode kernel（Qwen3.5 嘅 attention 一半係呢個）

所以 Luce-Org 嘅選擇好清楚：fork llama.cpp + 自己寫 3 個 custom CUDA kernel。

5.2 Architecture 伏位：Qwen3.5 唔係 dense Transformer

javascriptQwen3.5-27B layer stack (64 layers total):
├── Layer 0:  Gated DeltaNet (linear attention + recurrent state)
├── Layer 1:  Gated DeltaNet
├── Layer 2:  Gated DeltaNet
├── Layer 3:  Full softmax attention  ← every 4th
├── Layer 4:  Gated DeltaNet
├── ...
└── Layer 63: Full softmax attention

48/64 層係 Gated DeltaNet（linear attention with learned recurrence；similar to Mamba 嘅 SSM 但用 delta rule update）
16/64 層係 full softmax attention
M-RoPE，dimension sections [11, 11, 10, 0]
24 query heads, 4 KV heads, key/value length 256

Gated DeltaNet 有一個recurrent state——你 verify 完 tree 之後 reject 咗某條 path，個 SSM intermediate 同 conv window 都要 rollback 返去 commit prefix 嘅狀態。Standard llama.cpp 嗰個 ggml_gated_delta_net 唔識 rollback。

5.3 三個 custom CUDA kernel

Luce DFlash 喺 Luce-Org/llama.cpp@luce-dflash 加咗三個 ggml ops：

Kernel	功能	解決乜嘢
`ggml_ssm_conv_tree`	Tree-aware conv state gather	每個 sibling 沿住 DDTree parent chain 讀返自己嘅 K-1 window，唔係 DFS order
`ggml_gated_delta_net_tree`	DeltaNet recurrence under tree mask	喺 tree 度做 ancestor-only recurrent update
`ggml_gated_delta_net_tree_persist`	Direct-write SSM intermediate to persistent buffer	慳咗每 step 9 ms 嘅 `ggml_cpy`

冇呢三個 kernel：要嘛 verify 唔到 tree（state 撈亂），要嘛每 step 都要 9 ms cpy 拖低 throughput。Luce DFlash 嘅 RESULTS 顯示冇 persist kernel 嗰陣 throughput 由 130 跌返去 ~85 tok/s。

5.4 Per-step rollback：snapshot + restore

pythondef one_round():
    # Snapshot before verify
    state_snapshot = {
        "ssm_intermediate": deltanet_state.clone(),
        "conv_window": conv_state.clone(),
        "kv_cache_pos": kv_cache.cur_pos,
    }
    
    # Build tree + verify
    tree = build_ddtree(draft_logits, budget=22)
    target_logits = target.forward_tree(tree)  # uses ggml_*_tree kernels
    
    # Walk + commit
    accepted_path, bonus = walk_tree(tree, target_logits)
    
    # Restore to committed prefix only
    restore_state(state_snapshot, accepted_path.length)
    kv_cache.cur_pos = state_snapshot["kv_cache_pos"] + len(accepted_path)
    
    return accepted_path + [bonus]

好處：冇 replay forward。傳統 spec decode 要 commit 之後 re-run target 一次先攞到正確 KV state，呢度直接由 tree forward 嘅 partial state restore 出嚟。

5.5 Long context：TQ3_0 KV cache + sliding ring

Qwen3.5-27B 原生 32K context。Luce DFlash 點樣用 24 GB 撐到 256K？

KV cache quantization

Format	bpv	Memory savings
F16	16	1× baseline
Q4_0	4.5	~3.5×
TQ3_0	3.5	~9.7×

環境變量：DFLASH27B_KV_TQ3=1（default）。

Sliding target_feat ring

Deep KV injection 要 capture target 嘅 layer 16/32/48/64 hidden states。如果 prompt 128K，5120-dim × 128K × 4 layers × 2 bytes = 6.6 GB 純 hidden state——直接爆 VRAM。

Luce DFlash 用 4096-slot ring buffer：pos % 4096 直接 overwrite。距離當前 position 4K 之外嘅 hidden state 會丟，但因為 draft 只睇近期 context，影響細微。實測 128K mode HumanEval 仍然 134.78 tok/s @ AL 8.33。

5.6 Benchmarks

Qwen3.5-27B Q4_K_M, RTX 3090 24 GB, n_gen=256, 10 prompts/dataset：

Task	AR tok/s	DFlash+DDTree tok/s	AL	Speedup
HumanEval	37.78	129.52	8.31	3.43×
Math500	37.71	110.51	7.04	2.93×
GSM8K	37.65	96.15	6.14	2.55×

比較同部硬件嘅其他 stack：

Stack	HumanEval tok/s	限制
llama.cpp Q4_K_M (AR)	37.78	冇 spec decode
SGLang AWQ INT4 (AR)	46.6	冇 DFlash, 仲係 AR
Luce DFlash Q4_K_M	129.52	27B model, 100% lossless, 128K context

比 SGLang AWQ 快 2.8×；比 llama.cpp 快 3.43×。Demo run peak 去到 207.6 tok/s（5.46× AR）。

5.7 Quick start

bashgit clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash

# Build
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_dflash -j

# Models
huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
    Qwen3.5-27B-Q4_K_M.gguf --local-dir models/
huggingface-cli download z-lab/Qwen3.5-27B-DFlash \
    model.safetensors --local-dir models/draft/

# Run
python3 scripts/run.py --prompt "def fibonacci(n):"

# Long-context mode (up to 256K)
DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=16 \
  build/test_dflash models/Qwen3.5-27B-Q4_K_M.gguf \
  models/draft/model.safetensors prompt.bin 256 out.bin \
  --fast-rollback --ddtree --ddtree-budget=22 --max-ctx=131072

要求：

NVIDIA sm_86+（3090, A10, A40, 4090）or sm_110 Jetson AGX Thor
CUDA 12+（13+ for Thor）
24 GB VRAM
~80 GB disk for both models

6. 點解三件嘢咁岩夾埋一齊出？

依個係 2026 年第一個季度嘅 inference acceleration 連環 trilogy：

Loading diagram...

學術 vs 工程嘅分工

**z-lab（DFlash）**做嘅嘢：證明 block diffusion drafter 可以打贏 chain drafter，open-source weights
**Ringel & Romano（DDTree）**做嘅嘢：諗到「marginal 已經 free」，加 tree verification 榨乾 DFlash 嘅 information
**Luce-Org（Luce DFlash）**做嘅嘢：~2000 行 C++/CUDA + 3 個 kernel，將 paper 由「跑得郁 B200」變成「跑得郁 3090」

講真嘅，第三件先係令 community 用得到嘅。Paper 寫得幾靚都好，BF16 on B200 嘅成本對 indie hacker 同個別 researcher 等於虛無。Luce DFlash 嘅 README 直接寫：

Consumer GPUs can run 27B models at chat-grade speed without multi-GPU, without batching, without quantization compromises. The bottleneck was never hardware. It was the decoding algorithm.

7. 限制同 caveat

DFlash 嘅限制

Block size 16 寫死咗：drafter 一 pass 出 16 個 token，唔可以動態調
Deep KV injection 要捕獲 target hidden states：每 forward 要 capture 4 個 layer，多 ~5% 嘅 prefill cost
每個 target model 要 train 自己嘅 draft：z-lab 暫時 release 咗 Qwen3.5/3.6、Kimi-K2.5、gpt-oss-20b/120b、Llama-3.1-8B；其他模型要等社區 train

DDTree 嘅限制

Marginal independence assumption：tree 用 factorized $P(path) = \prod P_i$ ，但實際 token 之間有 correlation；budget 太大 acceptance length 會飽和
Verifier cost 隨 budget 平方升：tree attention mask 係 $O(N^2)$ in tree size

Luce DFlash 嘅限制

Batch size 1：single user 場景，無 KV paging
Greedy only：temperature/top_p 收咗但 ignore 咗（rejection sampling 仲未實現）
One model pair：淨係 Qwen3.5-27B Q4_K_M target + z-lab DFlash BF16 draft；換其他 model 要重寫 graph builder
Q4_K_M 比 BF16 平均蝕 30 個 acceptance points：~8.31 AL on Q4_K_M vs paper 嘅 ~12 AL on BF16；Q5_K_M / Q6_K 可以追返但 fit 唔落 24 GB
CUDA only：無 Metal / ROCm；同 dflash-mlx 唔同坨

8. 對 inference 生態嘅啟示

8.1 Drafter 嘅未來：parallel everywhere

DFlash 證明咗：drafter 嘅 sequentiality 唔係必要嘅。Block diffusion 一 pass 出 16 個 token 質素仲贏 sequential。下一步可能係：

Hierarchical block diffusion：先 1 pass 出 64 個 token 嘅 outline，再 1 pass refine（即 SSD - Speculative Speculative Decoding，ICLR 2026 已經出咗類似 idea）
Prompt-conditioned drafter：唔同 task（code / math / chat）用唔同嘅 drafter

8.2 Tree 已經係 standard

Medusa、SpecInfer、EAGLE-2 都用過 tree verification。但DDTree 嘅獨特之處係：tree material 完全 free——drafter 已經輸出 marginal，唔使 extra forward。對比 EAGLE-2 要做 multiple drafter forward 先 build 到 tree，DDTree 純粹係 best-first heap on existing data。

8.3 GGUF + custom kernel = consumer-grade SOTA

Luce DFlash 嘅出現話畀我哋知：llama.cpp 嘅 GGUF stack 已經追到接近 SOTA，唔再係「research 落後 vLLM 半年」嘅 second-class citizen。Quantization-aware spec decoding 加上 hand-tuned kernel，consumer GPU 可以 host 一個 27B model run 到 chat-grade speed（>100 tok/s）。

2026 年下半年大概率會見：

llama.cpp upstream merge DFlash kernels
Ollama / LM Studio 默認 enable DFlash for supported models
第三方 community 出更多 model pair（Llama-4, Mistral-Next, ...）嘅 GGUF DFlash draft

9. 自己想試？做啲乜

如果你有部 24 GB+ NVIDIA GPU：

跑 Luce DFlash demo：上面個 quick start，5 分鐘內見到 100+ tok/s 嘅 Qwen3.5-27B
試 DDTree budget sweep：--ddtree-budget=8/16/22/28，睇你部 GPU 嘅 sweet spot
測試 long-context mode：DFLASH27B_KV_TQ3=1 + 128K prompt，睇 throughput 變化

如果你想 train 自己 draft model：

Wait z-lab training recipe：佢哋 promise 緊會 open-source
依 EAGLE-3 paper 改 architecture 為 block-diffusion：5-layer transformer + cross-attention to target hidden states
Distill on UltraChat / SlimPajama：BF16 training 約 ~24 GPU-hours on 8× H100

如果你想 contribute Luce DFlash：

Temperature / top-k sampling：rejection sampling in verify path
Full llama.cpp integration：llama-speculative-dflash.cpp + llama-cli / llama-server wiring
新 model pair：呢個係硬骨頭，要重寫 graph builder

10. 總結

DFlash → DDTree → Luce DFlash 呢條鏈展示咗 inference 加速研究嘅完整 lifecycle：

階段	重點	影響
Algorithmic insight (DFlash)	Block diffusion drafter 打破 sequential drafting bottleneck	6× speedup, 但 BF16 only
Free lunch on top (DDTree)	Marginal distribution 本身已經 free，build tree 就可以再榨多 30-40%	60/60 setting 全部贏
Engineering port (Luce DFlash)	GGUF + 3 個 custom CUDA kernel + sliding KV ring	24 GB consumer GPU 跑得郁 27B model at 130 tok/s

核心 takeaway：

唔好 underestimate diffusion 喺 LLM 嘅角色——佢唔係淨係用嚟做 image generation，做 drafter 都直接幹翻 chain method
Marginal 係 free 嘅 tree material——任何 parallel drafter 都應該配 tree verification
Engineering 同 research 嘅 gap 越收越窄——一個 indie team 兩個禮拜可以將 paper 由 B200 帶落 3090
Lossless guarantee 係 non-negotiable——所有呢啲加速都係 mathematical equivalent to target sampling，唔似 lossy quantization 咁要妥協精度

下次有人話「LLM inference 已經慢到唔可能再快」，俾佢睇 207 tok/s 嘅 Qwen3.5-27B on a 5-year-old GPU。

TL;DR

三件嘢一句話講：

⚡ DFlash（arxiv 2602.06036）：用 block diffusion model 做 draft，一個 forward pass 同時生成 16 個 token candidates，超越 EAGLE-3（6× lossless speedup）
🌳 DDTree（arxiv 2604.12989）：你 DFlash 嗰 1 pass 本身已經提供咗 per-position marginal distribution，不如 build 棵 tree 出嚟（best-first heap, fixed budget），同時 verify，60/60 全部提升
📦 Luce DFlash（Luce-Org/lucebox-hub）：第一個 GGUF 移植，俾你部 24 GB RTX 3090 跑到 Qwen3.5-27B at 129.5 tok/s mean / 207 tok/s peak、128K context、3.43× 加速、同 AR 輸出位元級相同
🎯 核心 insight：Block diffusion 本身會輸出 per-position distribution，呢個係「免費嘅 tree material」——chain-style drafter（EAGLE）冇得咁玩
🧱 隱藏伏位：Qwen3.5-27B 唔係 dense Transformer，佢係 64 層每 4 層先有一層 full softmax attention、其餘係 Gated DeltaNet 嘅 hybrid model；呢個 SSM 狀態先係 rollback 嘅最大難題

1. 背景：Speculative Decoding 為咗解決乜嘢？

以 RTX 3090 跑 Qwen3.5-27B Q4_K_M（~16 GB weights）做例：

VRAM bandwidth：936 GB/s
理論 max throughput：936 / 16 ≈ 58 tok/s
實測 autoregressive：37.7 tok/s（hit memory wall）

無論你用 vLLM、SGLang 定 llama.cpp，呢個天花板都打唔穿——因為佢係硬件物理限制，唔係軟件 bug。

Speculative Decoding 嘅基本想法

Leviathan et al. 2023 提出嘅 framework：用一個細啲嘅 draft model 預測下一段 tokens，再用 target model 一次過 verify。

python# Standard autoregressive: 1 token per target forward
for i in range(N):
    token = target.forward(prev_tokens)  # 1 forward → 1 token

# Speculative decoding: K candidates per target forward
while not done:
    draft_tokens = draft.forward(prev_tokens, k=K)  # cheap, K tokens
    accepted = target.verify(prev_tokens, draft_tokens)  # 1 forward → up to K tokens

點解 chain drafter（EAGLE）見頂？

EAGLE / EAGLE-3 係之前嘅 SOTA。佢嘅 draft model 係一個小 transformer，但 draft 過程仍然 sequential：

javascriptDraft step 1: predict token_1
Draft step 2: predict token_2 | token_1
Draft step 3: predict token_3 | token_1, token_2
...

如果 draft 個 latency 係 1.5ms，draft 8 個 token 就要 12ms。雖然 draft model 細好多，但累加起嚟 draft phase 都會 dominate。

DFlash 嘅諗法：如果 draft 都可以 parallel 出哂 8 個 token，咁咪 draft phase 縮成 1 個 forward？

2. DFlash：Block Diffusion 做 drafter

⚡ DFlash 一句話：個 draft model 由 5 層 non-causal Transformer 組成，input 係 [last_target_token, MASK × 15] 加上由 target model 偷返嚟嘅 hidden states，一 pass denoise 哂 16 個 mask，輸出 16 個 candidate token。

2.1 點解係 diffusion？

Diffusion model 嘅本質係：一次過去 noise。Image diffusion 一 pass 由純 noise 生成成幅圖；DFlash 嘅 block diffusion 一 pass 由純 mask token 生成成個 block 嘅 token。

DFlash 嘅核心 trick：deep KV injection。

javascriptTarget model（Qwen3.5-27B）
├── Layer 0
├── Layer 16  ← capture hidden state
├── Layer 32  ← capture hidden state
├── Layer 48  ← capture hidden state
└── Layer 64  ← capture hidden state

Draft model（5 layers）
├── Layer 0 ← inject Layer 16 hidden state via cross-attention
├── Layer 1 ← inject Layer 32 hidden state
├── Layer 2 ← inject Layer 48 hidden state
├── Layer 3 ← inject Layer 64 hidden state
└── Layer 4 → output 16 token logits

5 層 non-causal denoising 已經夠表達 token 之間嘅 dependency。Acceptance length（AL，平均每 round 接受嘅 token 數）：

Drafter	AL	機制
EAGLE-3 chain	~3	Sequential autoregressive draft
DFlash block diffusion	~8	One-pass parallel denoising with deep KV injection

Draft latency 由「8 × 1.5ms = 12ms」壓到「1 × 2ms = 2ms」。Effective throughput：6× over autoregressive。

2.2 Architecture：5 層做到嘅原因

DFlash draft model 嘅大細：

python# z-lab/Qwen3.5-27B-DFlash 嘅 size
draft_layers = 5            # 對比 target 64 層
draft_hidden = 2048         # 對比 target 5120
block_size = 16             # 一 pass 出 16 個 token
kv_injection_layers = 4     # cross-attention to target hidden states

Draft model 大約 3.46 GB in BF16，計算量大概係 target 嘅 1/40。再 inject deep target features，draft 質素直逼一個獨立嘅小 LLM。

2.3 為咗 lossless：rejection sampling

Speculative decoding 嘅 verify rule（Leviathan 2023 證明過）保證輸出嘅 token distribution 等於 target model 自己 generate：

pythonfor i in range(K):
    p = target_logits[i].softmax()
    q = draft_logits[i].softmax()
    
    if rand() < min(1, p[draft_token] / q[draft_token]):
        accept(draft_token)
    else:
        # reject + resample from (p - q).clamp(0).normalize()
        token = sample(max(0, p - q))
        break

第一個 mismatch 之後就 break，但 commit 咗嘅 token 都係 mathematically equivalent to sampling from target model。100% lossless——唔似 Medusa 嗰啲 lossy 加速法。

3. DDTree：Tree Verification on top of DFlash

🌳 DDTree 一句話：DFlash 一 pass 已經輸出咗 16 個 position 嘅 distribution，揀 top-1 path 等於浪費咗 15 個 position 嘅 information。Build 棵 tree 出嚟（fixed node budget）一次過 verify，期望接受長度由 ~5 變 ~8。

3.1 觀察：每 position 嘅 marginal 都已經 free

DFlash 嘅 output：

javascriptposition 0: [a: 0.6, b: 0.2, c: 0.1, ...]
position 1: [d: 0.5, e: 0.3, f: 0.1, ...]
position 2: [g: 0.4, h: 0.3, i: 0.2, ...]
...
position 15: [...]

但 DFlash 嗰一 pass 本身已經提供咗 e 嘅機率！只係 Vanilla DFlash 揼咗去。

3.2 Tree construction：Best-first under budget

DDTree 將呢啲 marginal 拼返棵 tree：

javascriptroot (last bonus token)
├── a (0.6)
│   ├── d (0.5)
│   │   ├── g (0.4)
│   │   └── h (0.3)
│   └── e (0.3)
│       ├── g (0.4)
│       └── h (0.3)
├── b (0.2)
│   └── d (0.5)
│       └── g (0.4)
└── c (0.1)

(node_budget = 22) 嘅意思：成棵 tree 最多 22 個 node。用 best-first heap：每次 expand 個 expected acceptance probability 最高嘅 leaf。

目標函數：maximize

E[L_{accept}] = \sum_{path} P(path) \cdot L(path)

3.3 Tree verification：一個 forward pass

javascripta   d   g   h   e   g   h   ...
root:   1   1   1   1   1   1   1   ...
a:      .   1   1   1   .   .   .
d:      .   .   1   1   .   .   .
g:      .   .   .   .   .   .   .
h:      .   .   .   .   .   .   .
e:      .   .   .   .   1   1   1
...

3.4 Walk + bonus token

Verify 完之後：

pythonnode = root
while node.has_children:
    target_token = sample(target_logits[node])
    matched_child = find_child(node, target_token)
    if matched_child:
        commit(matched_child)
        node = matched_child
    else:
        # 第一個 mismatch：commit target_token 做 bonus，return
        bonus = target_token
        break

Bonus token 係下一輪嘅 root。咁樣即使整棵 tree 都 reject 完，最少都 commit 到 1 個 token（target's own sample），保證 progress。

3.5 結果：Budget vs speedup tradeoff

DDTree paper 報告嘅 speedup（temperature 0，相對 autoregressive）：

Qwen3-8B

Benchmark	DFlash	DDTree	Gain
MATH-500	5.56×	7.52×	+1.96×
HumanEval	4.84×	6.90×	+2.06×
GSM8K	4.78×	6.75×	+1.97×

Qwen3-30B-MoE

Benchmark	DFlash	DDTree	Gain
HumanEval	6.09×	8.22×	+2.13×
MBPP	~7.0×	7.7×	+0.7×
MATH-500	~5.5×	6.2×	+0.7×

60 個 (model × dataset × temperature) setting 全部都贏，最大 gain 集中喺 reasoning + code。

⚠️ Budget 唔係越大越好：node budget 16 → 22 → 28，acceptance length 升，但 verifier latency 升得仲快。Sweet spot 係 22 左右。Luce DFlash 默認用 budget=22。

4. DFlash + DDTree：Pipeline 全圖

Loading diagram...

5. Luce DFlash：點樣係一部 RTX 3090 度 run

📦 Luce DFlash 一句話：z-lab 嘅 reference implementation 只跑 BF16 on B200（54+ GB VRAM）。Luce-Org 寫咗 ~2000 行 C++/CUDA on top of ggml，做出世界第一個 GGUF port，俾你 24 GB RTX 3090 都跑到 Qwen3.5-27B + DFlash + DDTree。

5.1 The gap：Why no one shipped it on consumer GPUs

要喺 24 GB GPU 跑 27B + draft + tree state，個 stack 必須係：

Component	Size requirement
Target weights (27B)	≤ 16 GB → 必須 quantize 到 Q4_K_M
Draft model (BF16)	3.46 GB
KV cache (~32K context)	~2-4 GB
DDTree tree state	~0.5 GB
Total	~22-24 GB

問題：

vLLM / SGLang：無 GGUF 路徑，AWQ INT4 working 但無 DFlash integration
z-lab reference：只 support BF16，54 GB target weights，唔可能 fit 24 GB
llama.cpp：有 GGUF Q4_K_M loader，但無 DFlash／DDTree
無人有 Gated DeltaNet 嘅 tree-mode kernel（Qwen3.5 嘅 attention 一半係呢個）

所以 Luce-Org 嘅選擇好清楚：fork llama.cpp + 自己寫 3 個 custom CUDA kernel。

5.2 Architecture 伏位：Qwen3.5 唔係 dense Transformer

javascriptQwen3.5-27B layer stack (64 layers total):
├── Layer 0:  Gated DeltaNet (linear attention + recurrent state)
├── Layer 1:  Gated DeltaNet
├── Layer 2:  Gated DeltaNet
├── Layer 3:  Full softmax attention  ← every 4th
├── Layer 4:  Gated DeltaNet
├── ...
└── Layer 63: Full softmax attention

48/64 層係 Gated DeltaNet（linear attention with learned recurrence；similar to Mamba 嘅 SSM 但用 delta rule update）
16/64 層係 full softmax attention
M-RoPE，dimension sections [11, 11, 10, 0]
24 query heads, 4 KV heads, key/value length 256

5.3 三個 custom CUDA kernel

Luce DFlash 喺 Luce-Org/llama.cpp@luce-dflash 加咗三個 ggml ops：

Kernel	功能	解決乜嘢
`ggml_ssm_conv_tree`	Tree-aware conv state gather	每個 sibling 沿住 DDTree parent chain 讀返自己嘅 K-1 window，唔係 DFS order
`ggml_gated_delta_net_tree`	DeltaNet recurrence under tree mask	喺 tree 度做 ancestor-only recurrent update
`ggml_gated_delta_net_tree_persist`	Direct-write SSM intermediate to persistent buffer	慳咗每 step 9 ms 嘅 `ggml_cpy`

5.4 Per-step rollback：snapshot + restore

pythondef one_round():
    # Snapshot before verify
    state_snapshot = {
        "ssm_intermediate": deltanet_state.clone(),
        "conv_window": conv_state.clone(),
        "kv_cache_pos": kv_cache.cur_pos,
    }
    
    # Build tree + verify
    tree = build_ddtree(draft_logits, budget=22)
    target_logits = target.forward_tree(tree)  # uses ggml_*_tree kernels
    
    # Walk + commit
    accepted_path, bonus = walk_tree(tree, target_logits)
    
    # Restore to committed prefix only
    restore_state(state_snapshot, accepted_path.length)
    kv_cache.cur_pos = state_snapshot["kv_cache_pos"] + len(accepted_path)
    
    return accepted_path + [bonus]

好處：冇 replay forward。傳統 spec decode 要 commit 之後 re-run target 一次先攞到正確 KV state，呢度直接由 tree forward 嘅 partial state restore 出嚟。

5.5 Long context：TQ3_0 KV cache + sliding ring

Qwen3.5-27B 原生 32K context。Luce DFlash 點樣用 24 GB 撐到 256K？

KV cache quantization

Format	bpv	Memory savings
F16	16	1× baseline
Q4_0	4.5	~3.5×
TQ3_0	3.5	~9.7×

環境變量：DFLASH27B_KV_TQ3=1（default）。

Sliding target_feat ring

Deep KV injection 要 capture target 嘅 layer 16/32/48/64 hidden states。如果 prompt 128K，5120-dim × 128K × 4 layers × 2 bytes = 6.6 GB 純 hidden state——直接爆 VRAM。

5.6 Benchmarks

Qwen3.5-27B Q4_K_M, RTX 3090 24 GB, n_gen=256, 10 prompts/dataset：

Task	AR tok/s	DFlash+DDTree tok/s	AL	Speedup
HumanEval	37.78	129.52	8.31	3.43×
Math500	37.71	110.51	7.04	2.93×
GSM8K	37.65	96.15	6.14	2.55×

比較同部硬件嘅其他 stack：

Stack	HumanEval tok/s	限制
llama.cpp Q4_K_M (AR)	37.78	冇 spec decode
SGLang AWQ INT4 (AR)	46.6	冇 DFlash, 仲係 AR
Luce DFlash Q4_K_M	129.52	27B model, 100% lossless, 128K context

比 SGLang AWQ 快 2.8×；比 llama.cpp 快 3.43×。Demo run peak 去到 207.6 tok/s（5.46× AR）。

5.7 Quick start

bashgit clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash

# Build
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_dflash -j

# Models
huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
    Qwen3.5-27B-Q4_K_M.gguf --local-dir models/
huggingface-cli download z-lab/Qwen3.5-27B-DFlash \
    model.safetensors --local-dir models/draft/

# Run
python3 scripts/run.py --prompt "def fibonacci(n):"

# Long-context mode (up to 256K)
DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=16 \
  build/test_dflash models/Qwen3.5-27B-Q4_K_M.gguf \
  models/draft/model.safetensors prompt.bin 256 out.bin \
  --fast-rollback --ddtree --ddtree-budget=22 --max-ctx=131072

要求：

NVIDIA sm_86+（3090, A10, A40, 4090）or sm_110 Jetson AGX Thor
CUDA 12+（13+ for Thor）
24 GB VRAM
~80 GB disk for both models

6. 點解三件嘢咁岩夾埋一齊出？

依個係 2026 年第一個季度嘅 inference acceleration 連環 trilogy：

Loading diagram...

學術 vs 工程嘅分工

**z-lab（DFlash）**做嘅嘢：證明 block diffusion drafter 可以打贏 chain drafter，open-source weights
**Ringel & Romano（DDTree）**做嘅嘢：諗到「marginal 已經 free」，加 tree verification 榨乾 DFlash 嘅 information
**Luce-Org（Luce DFlash）**做嘅嘢：~2000 行 C++/CUDA + 3 個 kernel，將 paper 由「跑得郁 B200」變成「跑得郁 3090」

講真嘅，第三件先係令 community 用得到嘅。Paper 寫得幾靚都好，BF16 on B200 嘅成本對 indie hacker 同個別 researcher 等於虛無。Luce DFlash 嘅 README 直接寫：

Consumer GPUs can run 27B models at chat-grade speed without multi-GPU, without batching, without quantization compromises. The bottleneck was never hardware. It was the decoding algorithm.

7. 限制同 caveat

DFlash 嘅限制

Block size 16 寫死咗：drafter 一 pass 出 16 個 token，唔可以動態調
Deep KV injection 要捕獲 target hidden states：每 forward 要 capture 4 個 layer，多 ~5% 嘅 prefill cost
每個 target model 要 train 自己嘅 draft：z-lab 暫時 release 咗 Qwen3.5/3.6、Kimi-K2.5、gpt-oss-20b/120b、Llama-3.1-8B；其他模型要等社區 train

DDTree 嘅限制

Marginal independence assumption：tree 用 factorized $P(path) = \prod P_i$ ，但實際 token 之間有 correlation；budget 太大 acceptance length 會飽和
Verifier cost 隨 budget 平方升：tree attention mask 係 $O(N^2)$ in tree size

Luce DFlash 嘅限制

Batch size 1：single user 場景，無 KV paging
Greedy only：temperature/top_p 收咗但 ignore 咗（rejection sampling 仲未實現）
One model pair：淨係 Qwen3.5-27B Q4_K_M target + z-lab DFlash BF16 draft；換其他 model 要重寫 graph builder
Q4_K_M 比 BF16 平均蝕 30 個 acceptance points：~8.31 AL on Q4_K_M vs paper 嘅 ~12 AL on BF16；Q5_K_M / Q6_K 可以追返但 fit 唔落 24 GB
CUDA only：無 Metal / ROCm；同 dflash-mlx 唔同坨

8. 對 inference 生態嘅啟示

8.1 Drafter 嘅未來：parallel everywhere

DFlash 證明咗：drafter 嘅 sequentiality 唔係必要嘅。Block diffusion 一 pass 出 16 個 token 質素仲贏 sequential。下一步可能係：

Hierarchical block diffusion：先 1 pass 出 64 個 token 嘅 outline，再 1 pass refine（即 SSD - Speculative Speculative Decoding，ICLR 2026 已經出咗類似 idea）
Prompt-conditioned drafter：唔同 task（code / math / chat）用唔同嘅 drafter

8.2 Tree 已經係 standard

8.3 GGUF + custom kernel = consumer-grade SOTA

2026 年下半年大概率會見：

llama.cpp upstream merge DFlash kernels
Ollama / LM Studio 默認 enable DFlash for supported models
第三方 community 出更多 model pair（Llama-4, Mistral-Next, ...）嘅 GGUF DFlash draft

9. 自己想試？做啲乜

如果你有部 24 GB+ NVIDIA GPU：

跑 Luce DFlash demo：上面個 quick start，5 分鐘內見到 100+ tok/s 嘅 Qwen3.5-27B
試 DDTree budget sweep：--ddtree-budget=8/16/22/28，睇你部 GPU 嘅 sweet spot
測試 long-context mode：DFLASH27B_KV_TQ3=1 + 128K prompt，睇 throughput 變化

如果你想 train 自己 draft model：

Wait z-lab training recipe：佢哋 promise 緊會 open-source
依 EAGLE-3 paper 改 architecture 為 block-diffusion：5-layer transformer + cross-attention to target hidden states
Distill on UltraChat / SlimPajama：BF16 training 約 ~24 GPU-hours on 8× H100

如果你想 contribute Luce DFlash：

Temperature / top-k sampling：rejection sampling in verify path
Full llama.cpp integration：llama-speculative-dflash.cpp + llama-cli / llama-server wiring
新 model pair：呢個係硬骨頭，要重寫 graph builder

10. 總結

DFlash → DDTree → Luce DFlash 呢條鏈展示咗 inference 加速研究嘅完整 lifecycle：

階段	重點	影響
Algorithmic insight (DFlash)	Block diffusion drafter 打破 sequential drafting bottleneck	6× speedup, 但 BF16 only
Free lunch on top (DDTree)	Marginal distribution 本身已經 free，build tree 就可以再榨多 30-40%	60/60 setting 全部贏
Engineering port (Luce DFlash)	GGUF + 3 個 custom CUDA kernel + sliding KV ring	24 GB consumer GPU 跑得郁 27B model at 130 tok/s

核心 takeaway：

唔好 underestimate diffusion 喺 LLM 嘅角色——佢唔係淨係用嚟做 image generation，做 drafter 都直接幹翻 chain method
Marginal 係 free 嘅 tree material——任何 parallel drafter 都應該配 tree verification
Engineering 同 research 嘅 gap 越收越窄——一個 indie team 兩個禮拜可以將 paper 由 B200 帶落 3090
Lossless guarantee 係 non-negotiable——所有呢啲加速都係 mathematical equivalent to target sampling，唔似 lossy quantization 咁要妥協精度

下次有人話「LLM inference 已經慢到唔可能再快」，俾佢睇 207 tok/s 嘅 Qwen3.5-27B on a 5-year-old GPU。

TL;DR

Table of Contents

1. 背景：Speculative Decoding 為咗解決乜嘢？

Speculative Decoding 嘅基本想法

點解 chain drafter（EAGLE）見頂？

2. DFlash：Block Diffusion 做 drafter

2.1 點解係 diffusion？

2.2 Architecture：5 層做到嘅原因

2.3 為咗 lossless：rejection sampling

3. DDTree：Tree Verification on top of DFlash

3.1 觀察：每 position 嘅 marginal 都已經 free

3.2 Tree construction：Best-first under budget

3.3 Tree verification：一個 forward pass

3.4 Walk + bonus token

3.5 結果：Budget vs speedup tradeoff

4. DFlash + DDTree：Pipeline 全圖

5. Luce DFlash：點樣係一部 RTX 3090 度 run

5.1 The gap：Why no one shipped it on consumer GPUs

5.2 Architecture 伏位：Qwen3.5 唔係 dense Transformer

5.3 三個 custom CUDA kernel

5.4 Per-step rollback：snapshot + restore

5.5 Long context：TQ3_0 KV cache + sliding ring

5.6 Benchmarks

5.7 Quick start

6. 點解三件嘢咁岩夾埋一齊出？

學術 vs 工程嘅分工

7. 限制同 caveat

DFlash 嘅限制

DDTree 嘅限制

Luce DFlash 嘅限制

8. 對 inference 生態嘅啟示

8.1 Drafter 嘅未來：parallel everywhere

8.2 Tree 已經係 standard

8.3 GGUF + custom kernel = consumer-grade SOTA

9. 自己想試？做啲乜

10. 總結

相關資源

Papers

Code

Demos & writeups

Models

TL;DR

Table of Contents

1. 背景：Speculative Decoding 為咗解決乜嘢？

Speculative Decoding 嘅基本想法

點解 chain drafter（EAGLE）見頂？

2. DFlash：Block Diffusion 做 drafter

2.1 點解係 diffusion？

2.2 Architecture：5 層做到嘅原因

2.3 為咗 lossless：rejection sampling

3. DDTree：Tree Verification on top of DFlash

3.1 觀察：每 position 嘅 marginal 都已經 free

3.2 Tree construction：Best-first under budget

3.3 Tree verification：一個 forward pass

3.4 Walk + bonus token

3.5 結果：Budget vs speedup tradeoff

4. DFlash + DDTree：Pipeline 全圖

5. Luce DFlash：點樣係一部 RTX 3090 度 run

5.1 The gap：Why no one shipped it on consumer GPUs

5.2 Architecture 伏位：Qwen3.5 唔係 dense Transformer

5.3 三個 custom CUDA kernel

5.4 Per-step rollback：snapshot + restore

5.5 Long context：TQ3_0 KV cache + sliding ring

5.6 Benchmarks

5.7 Quick start

6. 點解三件嘢咁岩夾埋一齊出？

學術 vs 工程嘅分工

7. 限制同 caveat

DFlash 嘅限制

DDTree 嘅限制

Luce DFlash 嘅限制

8. 對 inference 生態嘅啟示

8.1 Drafter 嘅未來：parallel everywhere

8.2 Tree 已經係 standard

8.3 GGUF + custom kernel = consumer-grade SOTA

9. 自己想試？做啲乜

10. 總結

相關資源

Papers

Code