Billy Tse
HomeRoadmapBlogContact
Playground
Buy me a bug

© 2026 Billy Tse

OnlyFansLinkedInGitHubEmail
Back to Blog
April 29, 2026•26 min read

DFlash × DDTree × Luce DFlash:Block Diffusion drafter 點樣俾 RTX 3090 跑 Qwen3.5-27B 跑到 207 tok/s?

2026 年 2 月 z-lab 出咗 DFlash(block diffusion drafter)6× lossless speculative decoding;4 月 Liran Ringel 同 Yaniv Romano 出咗 DDTree,在 DFlash 上面加咗 tree-structured verify,全面 60/60 setting 都提升;Luce-Org 跨咗 GGUF + ggml + 3 個 custom CUDA kernel,俾一部 24 GB RTX 3090 跑 Qwen3.5-27B 跑到 207 tok/s peak、3.43× 加速。今次拆解三個層次:paper 原理、tree verification math、同 consumer GPU 部署實戰。

AINLPInference OptimizationHardware AccelerationGPUDiffusion

2026 年 2 月 5 號,z-lab 喺 arxiv 放咗一篇叫 DFlash 嘅 paper,講佢哋用 block diffusion drafter 做 speculative decoding,可以 6× lossless speedup,打贏 EAGLE-3 2.5×。兩個月後,Technion 嘅 Liran Ringel 同 Yaniv Romano 推出咗 DDTree,講「你 DFlash 一 pass 出嘅 per-position distribution 揼咗去邊?不如砌返棵 tree 出嚟同時 verify」,結果 60/60 setting 全部提升,Qwen3-30B-MoE 喺 HumanEval 衝到 8.22×。
最宜家係幾個禮拜前,一個叫 Luce-Org 嘅 team 寫咗 ~2000 行 C++/CUDA,跨咗 GGUF + ggml + 3 個 custom CUDA kernel,俾一部 24 GB RTX 3090 跑 Qwen3.5-27B 跑到 peak 207 tok/s、HumanEval mean 129.5 tok/s,3.43× 加速同時 100% lossless。
今次唔似以往 paper review 咁淨係講原理,我哋要拆三個層次:DFlash 點解用 diffusion 做 drafter 會 work、DDTree 點解 1 pass 嘅 marginal 已經夠 build 到 tree、Luce DFlash 點樣喺一部 24 GB 消費級 GPU 跑得郁部 27B 模型。最後仲會透埋隱藏嘅 architecture 重點——Qwen3.5 本身就唔係 dense Transformer,佢係一個 Gated DeltaNet hybrid。

TL;DR

三件嘢一句話講:

  • ⚡ DFlash(arxiv 2602.06036):用 block diffusion model 做 draft,一個 forward pass 同時生成 16 個 token candidates,超越 EAGLE-3(6× lossless speedup)
  • 🌳 DDTree(arxiv 2604.12989):你 DFlash 嗰 1 pass 本身已經提供咗 per-position marginal distribution,不如 build 棵 tree 出嚟(best-first heap, fixed budget),同時 verify,60/60 全部提升
  • 📦 Luce DFlash(Luce-Org/lucebox-hub):第一個 GGUF 移植,俾你部 24 GB RTX 3090 跑到 Qwen3.5-27B at 129.5 tok/s mean / 207 tok/s peak、128K context、3.43× 加速、同 AR 輸出位元級相同
  • 🎯 核心 insight:Block diffusion 本身會輸出 per-position distribution,呢個係「免費嘅 tree material」——chain-style drafter(EAGLE)冇得咁玩
  • 🧱 隱藏伏位:Qwen3.5-27B 唔係 dense Transformer,佢係 64 層每 4 層先有一層 full softmax attention、其餘係 Gated DeltaNet 嘅 hybrid model;呢個 SSM 狀態先係 rollback 嘅最大難題

Table of Contents

1. 背景:Speculative Decoding 為咗解決乜嘢?

大家都知道 LLM inference 慢嘅根本原因:autoregressive decoding 係 sequential 嘅。每生成一個 token 都要 load 成個 model weights 一次,memory bandwidth 直接決定 throughput。

以 RTX 3090 跑 Qwen3.5-27B Q4_K_M(~16 GB weights)做例:

  • VRAM bandwidth:936 GB/s
  • 理論 max throughput:936 / 16 ≈ 58 tok/s
  • 實測 autoregressive:37.7 tok/s(hit memory wall)

無論你用 vLLM、SGLang 定 llama.cpp,呢個天花板都打唔穿——因為佢係硬件物理限制,唔係軟件 bug。

Speculative Decoding 嘅基本想法

Leviathan et al. 2023 提出嘅 framework:用一個細啲嘅 draft model 預測下一段 tokens,再用 target model 一次過 verify。

python# Standard autoregressive: 1 token per target forward for i in range(N): token = target.forward(prev_tokens) # 1 forward → 1 token # Speculative decoding: K candidates per target forward while not done: draft_tokens = draft.forward(prev_tokens, k=K) # cheap, K tokens accepted = target.verify(prev_tokens, draft_tokens) # 1 forward → up to K tokens

Key insight:target model 一個 forward pass 計到 K 個 logits(每 position 一個),所以 verify K 個 candidate token 嘅成本同 verify 1 個一樣。如果 draft 啱 80%,每個 target forward 平均 commit 4-5 個 token,throughput 直接 4-5×。

點解 chain drafter(EAGLE)見頂?

EAGLE / EAGLE-3 係之前嘅 SOTA。佢嘅 draft model 係一個小 transformer,但 draft 過程仍然 sequential:

javascriptDraft step 1: predict token_1 Draft step 2: predict token_2 | token_1 Draft step 3: predict token_3 | token_1, token_2 ...

如果 draft 個 latency 係 1.5ms,draft 8 個 token 就要 12ms。雖然 draft model 細好多,但累加起嚟 draft phase 都會 dominate。

DFlash 嘅諗法:如果 draft 都可以 parallel 出哂 8 個 token,咁咪 draft phase 縮成 1 個 forward?

2. DFlash:Block Diffusion 做 drafter

⚡ DFlash 一句話:個 draft model 由 5 層 non-causal Transformer 組成,input 係 [last_target_token, MASK × 15] 加上由 target model 偷返嚟嘅 hidden states,一 pass denoise 哂 16 個 mask,輸出 16 個 candidate token。

2.1 點解係 diffusion?

Diffusion model 嘅本質係:一次過去 noise。Image diffusion 一 pass 由純 noise 生成成幅圖;DFlash 嘅 block diffusion 一 pass 由純 mask token 生成成個 block 嘅 token。

但 LLM 唔似 image,token 之間有強烈嘅 sequential dependency——「The cat sat on the」之後出「mat」唔係「banana」。如果你純粹獨立預測每個 mask,acceptance rate 會差到嘔。

DFlash 嘅核心 trick:deep KV injection。

javascriptTarget model(Qwen3.5-27B) ├── Layer 0 ├── Layer 16 ← capture hidden state ├── Layer 32 ← capture hidden state ├── Layer 48 ← capture hidden state └── Layer 64 ← capture hidden state Draft model(5 layers) ├── Layer 0 ← inject Layer 16 hidden state via cross-attention ├── Layer 1 ← inject Layer 32 hidden state ├── Layer 2 ← inject Layer 48 hidden state ├── Layer 3 ← inject Layer 64 hidden state └── Layer 4 → output 16 token logits

Draft model 唔係由零開始 predict,佢係直接 read target model 嘅 deep representation。等於話:「target model 已經諗到 layer 64 喇,draft model 你只係負責由呢個 representation decode 出 token」。

5 層 non-causal denoising 已經夠表達 token 之間嘅 dependency。Acceptance length(AL,平均每 round 接受嘅 token 數):

DrafterAL機制
EAGLE-3 chain~3Sequential autoregressive draft
DFlash block diffusion~8One-pass parallel denoising with deep KV injection

Draft latency 由「8 × 1.5ms = 12ms」壓到「1 × 2ms = 2ms」。Effective throughput:6× over autoregressive。

2.2 Architecture:5 層做到嘅原因

DFlash draft model 嘅大細:

python# z-lab/Qwen3.5-27B-DFlash 嘅 size draft_layers = 5 # 對比 target 64 層 draft_hidden = 2048 # 對比 target 5120 block_size = 16 # 一 pass 出 16 個 token kv_injection_layers = 4 # cross-attention to target hidden states

Draft model 大約 3.46 GB in BF16,計算量大概係 target 嘅 1/40。再 inject deep target features,draft 質素直逼一個獨立嘅小 LLM。

2.3 為咗 lossless:rejection sampling

Speculative decoding 嘅 verify rule(Leviathan 2023 證明過)保證輸出嘅 token distribution 等於 target model 自己 generate:

pythonfor i in range(K): p = target_logits[i].softmax() q = draft_logits[i].softmax() if rand() < min(1, p[draft_token] / q[draft_token]): accept(draft_token) else: # reject + resample from (p - q).clamp(0).normalize() token = sample(max(0, p - q)) break

第一個 mismatch 之後就 break,但 commit 咗嘅 token 都係 mathematically equivalent to sampling from target model。100% lossless——唔似 Medusa 嗰啲 lossy 加速法。

3. DDTree:Tree Verification on top of DFlash

🌳 DDTree 一句話:DFlash 一 pass 已經輸出咗 16 個 position 嘅 distribution,揀 top-1 path 等於浪費咗 15 個 position 嘅 information。Build 棵 tree 出嚟(fixed node budget)一次過 verify,期望接受長度由 ~5 變 ~8。

3.1 觀察:每 position 嘅 marginal 都已經 free

DFlash 嘅 output:

javascriptposition 0: [a: 0.6, b: 0.2, c: 0.1, ...] position 1: [d: 0.5, e: 0.3, f: 0.1, ...] position 2: [g: 0.4, h: 0.3, i: 0.2, ...] ... position 15: [...]

Vanilla DFlash 揀每 position 嘅 argmax,組成 chain a → d → g → ...,verify。如果 position 1 揀錯(target 想要 e 而唔係 d),即刻停,浪費咗 position 2-15 個 prediction。

但 DFlash 嗰一 pass 本身已經提供咗 e 嘅機率!只係 Vanilla DFlash 揼咗去。

3.2 Tree construction:Best-first under budget

DDTree 將呢啲 marginal 拼返棵 tree:

javascriptroot (last bonus token) ├── a (0.6) │ ├── d (0.5) │ │ ├── g (0.4) │ │ └── h (0.3) │ └── e (0.3) │ ├── g (0.4) │ └── h (0.3) ├── b (0.2) │ └── d (0.5) │ └── g (0.4) └── c (0.1)

(node_budget = 22) 嘅意思:成棵 tree 最多 22 個 node。用 best-first heap:每次 expand 個 expected acceptance probability 最高嘅 leaf。

目標函數:maximize

E[Laccept]=∑pathP(path)⋅L(path)E[L_{accept}] = \sum_{path} P(path) \cdot L(path)E[Laccept​]=path∑​P(path)⋅L(path)

其中 P(path)P(path)P(path) 係 path 上每 position marginal 嘅乘積(factorized assumption),L(path)L(path)L(path) 係 path 長度。實際上 DDTree 唔係用 dynamic programming,而係用 best-first heap:把每個 candidate node 嘅 P(parent_path) × P_marginal(node) 入 heap,pop top 22 個。

3.3 Tree verification:一個 forward pass

關鍵 trick 係ancestor-only attention mask。Target model 一次過 process 成棵 tree(22 個 token),但每個 token 只可以 attend 到佢自己嘅 ancestors,唔可以 attend siblings。

javascripta d g h e g h ... root: 1 1 1 1 1 1 1 ... a: . 1 1 1 . . . d: . . 1 1 . . . g: . . . . . . . h: . . . . . . . e: . . . . 1 1 1 ...

咁樣 22 個 token 一個 forward 就計埋 22 個 logits(每個 token 對應佢自己 path 嘅 next-token distribution)。Verify 嘅成本同 verify 1 條 chain 一樣,但 candidate 多咗 22 個。

3.4 Walk + bonus token

Verify 完之後:

pythonnode = root while node.has_children: target_token = sample(target_logits[node]) matched_child = find_child(node, target_token) if matched_child: commit(matched_child) node = matched_child else: # 第一個 mismatch:commit target_token 做 bonus,return bonus = target_token break

Bonus token 係下一輪嘅 root。咁樣即使整棵 tree 都 reject 完,最少都 commit 到 1 個 token(target's own sample),保證 progress。

3.5 結果:Budget vs speedup tradeoff

DDTree paper 報告嘅 speedup(temperature 0,相對 autoregressive):

Qwen3-8B

BenchmarkDFlashDDTreeGain
MATH-5005.56×7.52×+1.96×
HumanEval4.84×6.90×+2.06×
GSM8K4.78×6.75×+1.97×

Qwen3-30B-MoE

BenchmarkDFlashDDTreeGain
HumanEval6.09×8.22×+2.13×
MBPP~7.0×7.7×+0.7×
MATH-500~5.5×6.2×+0.7×

60 個 (model × dataset × temperature) setting 全部都贏,最大 gain 集中喺 reasoning + code。

⚠️ Budget 唔係越大越好:node budget 16 → 22 → 28,acceptance length 升,但 verifier latency 升得仲快。Sweet spot 係 22 左右。Luce DFlash 默認用 budget=22。

4. DFlash + DDTree:Pipeline 全圖

Loading diagram...

每 round commit 平均 ~8 個 token(AL = 8.31 喺 HumanEval),總 latency ~12 ms。等價 throughput:8 / 0.012 = ~660 tok/s effective draft,但實際被 GPU memory bandwidth 同 KV cache 限制,最終落到 ~130 tok/s mean。

5. Luce DFlash:點樣係一部 RTX 3090 度 run

📦 Luce DFlash 一句話:z-lab 嘅 reference implementation 只跑 BF16 on B200(54+ GB VRAM)。Luce-Org 寫咗 ~2000 行 C++/CUDA on top of ggml,做出世界第一個 GGUF port,俾你 24 GB RTX 3090 都跑到 Qwen3.5-27B + DFlash + DDTree。

5.1 The gap:Why no one shipped it on consumer GPUs

要喺 24 GB GPU 跑 27B + draft + tree state,個 stack 必須係:

ComponentSize requirement
Target weights (27B)≤ 16 GB → 必須 quantize 到 Q4_K_M
Draft model (BF16)3.46 GB
KV cache (~32K context)~2-4 GB
DDTree tree state~0.5 GB
Total~22-24 GB

問題:

  • vLLM / SGLang:無 GGUF 路徑,AWQ INT4 working 但無 DFlash integration
  • z-lab reference:只 support BF16,54 GB target weights,唔可能 fit 24 GB
  • llama.cpp:有 GGUF Q4_K_M loader,但無 DFlash/DDTree
  • 無人有 Gated DeltaNet 嘅 tree-mode kernel(Qwen3.5 嘅 attention 一半係呢個)

所以 Luce-Org 嘅選擇好清楚:fork llama.cpp + 自己寫 3 個 custom CUDA kernel。

5.2 Architecture 伏位:Qwen3.5 唔係 dense Transformer

javascriptQwen3.5-27B layer stack (64 layers total): ├── Layer 0: Gated DeltaNet (linear attention + recurrent state) ├── Layer 1: Gated DeltaNet ├── Layer 2: Gated DeltaNet ├── Layer 3: Full softmax attention ← every 4th ├── Layer 4: Gated DeltaNet ├── ... └── Layer 63: Full softmax attention
  • 48/64 層係 Gated DeltaNet(linear attention with learned recurrence;similar to Mamba 嘅 SSM 但用 delta rule update)
  • 16/64 層係 full softmax attention
  • M-RoPE,dimension sections [11, 11, 10, 0]
  • 24 query heads, 4 KV heads, key/value length 256

Gated DeltaNet 有一個recurrent state——你 verify 完 tree 之後 reject 咗某條 path,個 SSM intermediate 同 conv window 都要 rollback 返去 commit prefix 嘅狀態。Standard llama.cpp 嗰個 ggml_gated_delta_net 唔識 rollback。

5.3 三個 custom CUDA kernel

Luce DFlash 喺 Luce-Org/llama.cpp@luce-dflash 加咗三個 ggml ops:

Kernel功能解決乜嘢
ggml_ssm_conv_treeTree-aware conv state gather每個 sibling 沿住 DDTree parent chain 讀返自己嘅 K-1 window,唔係 DFS order
ggml_gated_delta_net_treeDeltaNet recurrence under tree mask喺 tree 度做 ancestor-only recurrent update
ggml_gated_delta_net_tree_persistDirect-write SSM intermediate to persistent buffer慳咗每 step 9 ms 嘅 ggml_cpy

冇呢三個 kernel:要嘛 verify 唔到 tree(state 撈亂),要嘛每 step 都要 9 ms cpy 拖低 throughput。Luce DFlash 嘅 RESULTS 顯示冇 persist kernel 嗰陣 throughput 由 130 跌返去 ~85 tok/s。

5.4 Per-step rollback:snapshot + restore

pythondef one_round(): # Snapshot before verify state_snapshot = { "ssm_intermediate": deltanet_state.clone(), "conv_window": conv_state.clone(), "kv_cache_pos": kv_cache.cur_pos, } # Build tree + verify tree = build_ddtree(draft_logits, budget=22) target_logits = target.forward_tree(tree) # uses ggml_*_tree kernels # Walk + commit accepted_path, bonus = walk_tree(tree, target_logits) # Restore to committed prefix only restore_state(state_snapshot, accepted_path.length) kv_cache.cur_pos = state_snapshot["kv_cache_pos"] + len(accepted_path) return accepted_path + [bonus]

好處:冇 replay forward。傳統 spec decode 要 commit 之後 re-run target 一次先攞到正確 KV state,呢度直接由 tree forward 嘅 partial state restore 出嚟。

5.5 Long context:TQ3_0 KV cache + sliding ring

Qwen3.5-27B 原生 32K context。Luce DFlash 點樣用 24 GB 撐到 256K?

KV cache quantization

FormatbpvMemory savings
F16161× baseline
Q4_04.5~3.5×
TQ3_03.5~9.7×

環境變量:DFLASH27B_KV_TQ3=1(default)。

Sliding target_feat ring

Deep KV injection 要 capture target 嘅 layer 16/32/48/64 hidden states。如果 prompt 128K,5120-dim × 128K × 4 layers × 2 bytes = 6.6 GB 純 hidden state——直接爆 VRAM。

Luce DFlash 用 4096-slot ring buffer:pos % 4096 直接 overwrite。距離當前 position 4K 之外嘅 hidden state 會丟,但因為 draft 只睇近期 context,影響細微。實測 128K mode HumanEval 仍然 134.78 tok/s @ AL 8.33。

5.6 Benchmarks

Qwen3.5-27B Q4_K_M, RTX 3090 24 GB, n_gen=256, 10 prompts/dataset:

TaskAR tok/sDFlash+DDTree tok/sALSpeedup
HumanEval37.78129.528.313.43×
Math50037.71110.517.042.93×
GSM8K37.6596.156.142.55×

比較同部硬件嘅其他 stack:

StackHumanEval tok/s限制
llama.cpp Q4_K_M (AR)37.78冇 spec decode
SGLang AWQ INT4 (AR)46.6冇 DFlash, 仲係 AR
Luce DFlash Q4_K_M129.5227B model, 100% lossless, 128K context

比 SGLang AWQ 快 2.8×;比 llama.cpp 快 3.43×。Demo run peak 去到 207.6 tok/s(5.46× AR)。

5.7 Quick start

bashgit clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub cd lucebox-hub/dflash # Build cmake -B build -S . -DCMAKE_BUILD_TYPE=Release cmake --build build --target test_dflash -j # Models huggingface-cli download unsloth/Qwen3.5-27B-GGUF \ Qwen3.5-27B-Q4_K_M.gguf --local-dir models/ huggingface-cli download z-lab/Qwen3.5-27B-DFlash \ model.safetensors --local-dir models/draft/ # Run python3 scripts/run.py --prompt "def fibonacci(n):" # Long-context mode (up to 256K) DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=16 \ build/test_dflash models/Qwen3.5-27B-Q4_K_M.gguf \ models/draft/model.safetensors prompt.bin 256 out.bin \ --fast-rollback --ddtree --ddtree-budget=22 --max-ctx=131072

要求:

  • NVIDIA sm_86+(3090, A10, A40, 4090)or sm_110 Jetson AGX Thor
  • CUDA 12+(13+ for Thor)
  • 24 GB VRAM
  • ~80 GB disk for both models

6. 點解三件嘢咁岩夾埋一齊出?

依個係 2026 年第一個季度嘅 inference acceleration 連環 trilogy:

Loading diagram...

學術 vs 工程嘅分工

  • **z-lab(DFlash)**做嘅嘢:證明 block diffusion drafter 可以打贏 chain drafter,open-source weights
  • **Ringel & Romano(DDTree)**做嘅嘢:諗到「marginal 已經 free」,加 tree verification 榨乾 DFlash 嘅 information
  • **Luce-Org(Luce DFlash)**做嘅嘢:~2000 行 C++/CUDA + 3 個 kernel,將 paper 由「跑得郁 B200」變成「跑得郁 3090」

講真嘅,第三件先係令 community 用得到嘅。Paper 寫得幾靚都好,BF16 on B200 嘅成本對 indie hacker 同個別 researcher 等於虛無。Luce DFlash 嘅 README 直接寫:

Consumer GPUs can run 27B models at chat-grade speed without multi-GPU, without batching, without quantization compromises. The bottleneck was never hardware. It was the decoding algorithm.

7. 限制同 caveat

DFlash 嘅限制

  • Block size 16 寫死咗:drafter 一 pass 出 16 個 token,唔可以動態調
  • Deep KV injection 要捕獲 target hidden states:每 forward 要 capture 4 個 layer,多 ~5% 嘅 prefill cost
  • 每個 target model 要 train 自己嘅 draft:z-lab 暫時 release 咗 Qwen3.5/3.6、Kimi-K2.5、gpt-oss-20b/120b、Llama-3.1-8B;其他模型要等社區 train

DDTree 嘅限制

  • Marginal independence assumption:tree 用 factorized P(path)=∏PiP(path) = \prod P_iP(path)=∏Pi​,但實際 token 之間有 correlation;budget 太大 acceptance length 會飽和
  • Verifier cost 隨 budget 平方升:tree attention mask 係 O(N2)O(N^2)O(N2) in tree size

Luce DFlash 嘅限制

  • Batch size 1:single user 場景,無 KV paging
  • Greedy only:temperature/top_p 收咗但 ignore 咗(rejection sampling 仲未實現)
  • One model pair:淨係 Qwen3.5-27B Q4_K_M target + z-lab DFlash BF16 draft;換其他 model 要重寫 graph builder
  • Q4_K_M 比 BF16 平均蝕 30 個 acceptance points:~8.31 AL on Q4_K_M vs paper 嘅 ~12 AL on BF16;Q5_K_M / Q6_K 可以追返但 fit 唔落 24 GB
  • CUDA only:無 Metal / ROCm;同 dflash-mlx 唔同坨

8. 對 inference 生態嘅啟示

8.1 Drafter 嘅未來:parallel everywhere

DFlash 證明咗:drafter 嘅 sequentiality 唔係必要嘅。Block diffusion 一 pass 出 16 個 token 質素仲贏 sequential。下一步可能係:

  • Hierarchical block diffusion:先 1 pass 出 64 個 token 嘅 outline,再 1 pass refine(即 SSD - Speculative Speculative Decoding,ICLR 2026 已經出咗類似 idea)
  • Prompt-conditioned drafter:唔同 task(code / math / chat)用唔同嘅 drafter

8.2 Tree 已經係 standard

Medusa、SpecInfer、EAGLE-2 都用過 tree verification。但DDTree 嘅獨特之處係:tree material 完全 free——drafter 已經輸出 marginal,唔使 extra forward。對比 EAGLE-2 要做 multiple drafter forward 先 build 到 tree,DDTree 純粹係 best-first heap on existing data。

8.3 GGUF + custom kernel = consumer-grade SOTA

Luce DFlash 嘅出現話畀我哋知:llama.cpp 嘅 GGUF stack 已經追到接近 SOTA,唔再係「research 落後 vLLM 半年」嘅 second-class citizen。Quantization-aware spec decoding 加上 hand-tuned kernel,consumer GPU 可以 host 一個 27B model run 到 chat-grade speed(>100 tok/s)。

2026 年下半年大概率會見:

  • llama.cpp upstream merge DFlash kernels
  • Ollama / LM Studio 默認 enable DFlash for supported models
  • 第三方 community 出更多 model pair(Llama-4, Mistral-Next, ...)嘅 GGUF DFlash draft

9. 自己想試?做啲乜

如果你有部 24 GB+ NVIDIA GPU:

  1. 跑 Luce DFlash demo:上面個 quick start,5 分鐘內見到 100+ tok/s 嘅 Qwen3.5-27B
  2. 試 DDTree budget sweep:--ddtree-budget=8/16/22/28,睇你部 GPU 嘅 sweet spot
  3. 測試 long-context mode:DFLASH27B_KV_TQ3=1 + 128K prompt,睇 throughput 變化

如果你想 train 自己 draft model:

  1. Wait z-lab training recipe:佢哋 promise 緊會 open-source
  2. 依 EAGLE-3 paper 改 architecture 為 block-diffusion:5-layer transformer + cross-attention to target hidden states
  3. Distill on UltraChat / SlimPajama:BF16 training 約 ~24 GPU-hours on 8× H100

如果你想 contribute Luce DFlash:

  • Temperature / top-k sampling:rejection sampling in verify path
  • Full llama.cpp integration:llama-speculative-dflash.cpp + llama-cli / llama-server wiring
  • 新 model pair:呢個係硬骨頭,要重寫 graph builder

10. 總結

DFlash → DDTree → Luce DFlash 呢條鏈展示咗 inference 加速研究嘅完整 lifecycle:

階段重點影響
Algorithmic insight (DFlash)Block diffusion drafter 打破 sequential drafting bottleneck6× speedup, 但 BF16 only
Free lunch on top (DDTree)Marginal distribution 本身已經 free,build tree 就可以再榨多 30-40%60/60 setting 全部贏
Engineering port (Luce DFlash)GGUF + 3 個 custom CUDA kernel + sliding KV ring24 GB consumer GPU 跑得郁 27B model at 130 tok/s

核心 takeaway:

  1. 唔好 underestimate diffusion 喺 LLM 嘅角色——佢唔係淨係用嚟做 image generation,做 drafter 都直接幹翻 chain method
  2. Marginal 係 free 嘅 tree material——任何 parallel drafter 都應該配 tree verification
  3. Engineering 同 research 嘅 gap 越收越窄——一個 indie team 兩個禮拜可以將 paper 由 B200 帶落 3090
  4. Lossless guarantee 係 non-negotiable——所有呢啲加速都係 mathematical equivalent to target sampling,唔似 lossy quantization 咁要妥協精度

下次有人話「LLM inference 已經慢到唔可能再快」,俾佢睇 207 tok/s 嘅 Qwen3.5-27B on a 5-year-old GPU。

相關資源

Papers

  • DFlash: arxiv 2602.06036 — Chen, Liang, Liu (z-lab, 2026)
  • DDTree: arxiv 2604.12989 — Ringel, Romano (Technion, 2026)
  • EAGLE-3 (background): arxiv 2503.01840
  • Speculative Speculative Decoding (next step): ICLR 2026 OpenReview
  • Original Speculative Decoding: arxiv 2211.17192 — Leviathan et al.

Code

  • z-lab/dflash: github.com/z-lab/dflash — reference vLLM/SGLang/Transformers/MLX impl
  • liranringel/ddtree: github.com/liranringel/ddtree — DDTree reference impl
  • Luce-Org/lucebox-hub: github.com/Luce-Org/lucebox-hub — GGUF port for RTX 3090
  • Luce-Org/llama.cpp: github.com/Luce-Org/llama.cpp/tree/luce-dflash — fork with tree-mode kernels
  • Aryagm/dflash-mlx: github.com/Aryagm/dflash-mlx — Apple Silicon impl

Demos & writeups

  • DDTree project page: liranringel.github.io/ddtree — 互動 visualization
  • Luce DFlash blog: lucebox.com/blog/dflash27b
  • The Salt #116: substack.com/p/8x-faster-inference-dflash — paper review
  • Hacker News thread: news.ycombinator.com/item?id=47838788 — 207 tok/s discussion

Models

  • z-lab/Qwen3.5-27B-DFlash: HuggingFace — 主流 target 嘅 draft
  • z-lab DFlash collection: HuggingFace collection — 全 family models
Back to all articles
目錄