2026 年 2 月 5 號,z-lab 喺 arxiv 放咗一篇叫 DFlash 嘅 paper,講佢哋用 block diffusion drafter 做 speculative decoding,可以 6× lossless speedup,打贏 EAGLE-3 2.5×。兩個月後,Technion 嘅 Liran Ringel 同 Yaniv Romano 推出咗 DDTree,講「你 DFlash 一 pass 出嘅 per-position distribution 揼咗去邊?不如砌返棵 tree 出嚟同時 verify」,結果 60/60 setting 全部提升,Qwen3-30B-MoE 喺 HumanEval 衝到 8.22×。
最宜家係幾個禮拜前,一個叫 Luce-Org 嘅 team 寫咗 ~2000 行 C++/CUDA,跨咗 GGUF + ggml + 3 個 custom CUDA kernel,俾一部 24 GB RTX 3090 跑 Qwen3.5-27B 跑到 peak 207 tok/s、HumanEval mean 129.5 tok/s,3.43× 加速同時 100% lossless。
今次唔似以往 paper review 咁淨係講原理,我哋要拆三個層次:DFlash 點解用 diffusion 做 drafter 會 work、DDTree 點解 1 pass 嘅 marginal 已經夠 build 到 tree、Luce DFlash 點樣喺一部 24 GB 消費級 GPU 跑得郁部 27B 模型。最後仲會透埋隱藏嘅 architecture 重點——Qwen3.5 本身就唔係 dense Transformer,佢係一個 Gated DeltaNet hybrid。
TL;DR
三件嘢一句話講:
- ⚡ DFlash(arxiv 2602.06036):用 block diffusion model 做 draft,一個 forward pass 同時生成 16 個 token candidates,超越 EAGLE-3(6× lossless speedup)
- 🌳 DDTree(arxiv 2604.12989):你 DFlash 嗰 1 pass 本身已經提供咗 per-position marginal distribution,不如 build 棵 tree 出嚟(best-first heap, fixed budget),同時 verify,60/60 全部提升
- 📦 Luce DFlash(Luce-Org/lucebox-hub):第一個 GGUF 移植,俾你部 24 GB RTX 3090 跑到 Qwen3.5-27B at 129.5 tok/s mean / 207 tok/s peak、128K context、3.43× 加速、同 AR 輸出位元級相同
- 🎯 核心 insight:Block diffusion 本身會輸出 per-position distribution,呢個係「免費嘅 tree material」——chain-style drafter(EAGLE)冇得咁玩
- 🧱 隱藏伏位:Qwen3.5-27B 唔係 dense Transformer,佢係 64 層每 4 層先有一層 full softmax attention、其餘係 Gated DeltaNet 嘅 hybrid model;呢個 SSM 狀態先係 rollback 嘅最大難題
Table of Contents
1. 背景:Speculative Decoding 為咗解決乜嘢?
大家都知道 LLM inference 慢嘅根本原因:autoregressive decoding 係 sequential 嘅。每生成一個 token 都要 load 成個 model weights 一次,memory bandwidth 直接決定 throughput。
以 RTX 3090 跑 Qwen3.5-27B Q4_K_M(~16 GB weights)做例:
- VRAM bandwidth:936 GB/s
- 理論 max throughput:936 / 16 ≈ 58 tok/s
- 實測 autoregressive:37.7 tok/s(hit memory wall)
無論你用 vLLM、SGLang 定 llama.cpp,呢個天花板都打唔穿——因為佢係硬件物理限制,唔係軟件 bug。
Speculative Decoding 嘅基本想法
Leviathan et al. 2023 提出嘅 framework:用一個細啲嘅 draft model 預測下一段 tokens,再用 target model 一次過 verify。
python# Standard autoregressive: 1 token per target forward
for i in range(N):
token = target.forward(prev_tokens) # 1 forward → 1 token
# Speculative decoding: K candidates per target forward
while not done:
draft_tokens = draft.forward(prev_tokens, k=K) # cheap, K tokens
accepted = target.verify(prev_tokens, draft_tokens) # 1 forward → up to K tokens
Key insight:target model 一個 forward pass 計到 K 個 logits(每 position 一個),所以 verify K 個 candidate token 嘅成本同 verify 1 個一樣。如果 draft 啱 80%,每個 target forward 平均 commit 4-5 個 token,throughput 直接 4-5×。
點解 chain drafter(EAGLE)見頂?
EAGLE / EAGLE-3 係之前嘅 SOTA。佢嘅 draft model 係一個小 transformer,但 draft 過程仍然 sequential:
javascriptDraft step 1: predict token_1
Draft step 2: predict token_2 | token_1
Draft step 3: predict token_3 | token_1, token_2
...
如果 draft 個 latency 係 1.5ms,draft 8 個 token 就要 12ms。雖然 draft model 細好多,但累加起嚟 draft phase 都會 dominate。
DFlash 嘅諗法:如果 draft 都可以 parallel 出哂 8 個 token,咁咪 draft phase 縮成 1 個 forward?
2. DFlash:Block Diffusion 做 drafter
⚡ DFlash 一句話:個 draft model 由 5 層 non-causal Transformer 組成,input 係
[last_target_token, MASK × 15]加上由 target model 偷返嚟嘅 hidden states,一 pass denoise 哂 16 個 mask,輸出 16 個 candidate token。
2.1 點解係 diffusion?
Diffusion model 嘅本質係:一次過去 noise。Image diffusion 一 pass 由純 noise 生成成幅圖;DFlash 嘅 block diffusion 一 pass 由純 mask token 生成成個 block 嘅 token。
但 LLM 唔似 image,token 之間有強烈嘅 sequential dependency——「The cat sat on the」之後出「mat」唔係「banana」。如果你純粹獨立預測每個 mask,acceptance rate 會差到嘔。
DFlash 嘅核心 trick:deep KV injection。
javascriptTarget model(Qwen3.5-27B)
├── Layer 0
├── Layer 16 ← capture hidden state
├── Layer 32 ← capture hidden state
├── Layer 48 ← capture hidden state
└── Layer 64 ← capture hidden state
Draft model(5 layers)
├── Layer 0 ← inject Layer 16 hidden state via cross-attention
├── Layer 1 ← inject Layer 32 hidden state
├── Layer 2 ← inject Layer 48 hidden state
├── Layer 3 ← inject Layer 64 hidden state
└── Layer 4 → output 16 token logits
Draft model 唔係由零開始 predict,佢係直接 read target model 嘅 deep representation。等於話:「target model 已經諗到 layer 64 喇,draft model 你只係負責由呢個 representation decode 出 token」。
5 層 non-causal denoising 已經夠表達 token 之間嘅 dependency。Acceptance length(AL,平均每 round 接受嘅 token 數):
| Drafter | AL | 機制 |
|---|---|---|
| EAGLE-3 chain | ~3 | Sequential autoregressive draft |
| DFlash block diffusion | ~8 | One-pass parallel denoising with deep KV injection |
Draft latency 由「8 × 1.5ms = 12ms」壓到「1 × 2ms = 2ms」。Effective throughput:6× over autoregressive。
2.2 Architecture:5 層做到嘅原因
DFlash draft model 嘅大細:
python# z-lab/Qwen3.5-27B-DFlash 嘅 size
draft_layers = 5 # 對比 target 64 層
draft_hidden = 2048 # 對比 target 5120
block_size = 16 # 一 pass 出 16 個 token
kv_injection_layers = 4 # cross-attention to target hidden states
Draft model 大約 3.46 GB in BF16,計算量大概係 target 嘅 1/40。再 inject deep target features,draft 質素直逼一個獨立嘅小 LLM。
2.3 為咗 lossless:rejection sampling
Speculative decoding 嘅 verify rule(Leviathan 2023 證明過)保證輸出嘅 token distribution 等於 target model 自己 generate:
pythonfor i in range(K):
p = target_logits[i].softmax()
q = draft_logits[i].softmax()
if rand() < min(1, p[draft_token] / q[draft_token]):
accept(draft_token)
else:
# reject + resample from (p - q).clamp(0).normalize()
token = sample(max(0, p - q))
break
第一個 mismatch 之後就 break,但 commit 咗嘅 token 都係 mathematically equivalent to sampling from target model。100% lossless——唔似 Medusa 嗰啲 lossy 加速法。
3. DDTree:Tree Verification on top of DFlash
🌳 DDTree 一句話:DFlash 一 pass 已經輸出咗 16 個 position 嘅 distribution,揀 top-1 path 等於浪費咗 15 個 position 嘅 information。Build 棵 tree 出嚟(fixed node budget)一次過 verify,期望接受長度由 ~5 變 ~8。
3.1 觀察:每 position 嘅 marginal 都已經 free
DFlash 嘅 output:
javascriptposition 0: [a: 0.6, b: 0.2, c: 0.1, ...]
position 1: [d: 0.5, e: 0.3, f: 0.1, ...]
position 2: [g: 0.4, h: 0.3, i: 0.2, ...]
...
position 15: [...]
Vanilla DFlash 揀每 position 嘅 argmax,組成 chain a → d → g → ...,verify。如果 position 1 揀錯(target 想要 e 而唔係 d),即刻停,浪費咗 position 2-15 個 prediction。
但 DFlash 嗰一 pass 本身已經提供咗 e 嘅機率!只係 Vanilla DFlash 揼咗去。
3.2 Tree construction:Best-first under budget
DDTree 將呢啲 marginal 拼返棵 tree:
javascriptroot (last bonus token)
├── a (0.6)
│ ├── d (0.5)
│ │ ├── g (0.4)
│ │ └── h (0.3)
│ └── e (0.3)
│ ├── g (0.4)
│ └── h (0.3)
├── b (0.2)
│ └── d (0.5)
│ └── g (0.4)
└── c (0.1)
(node_budget = 22) 嘅意思:成棵 tree 最多 22 個 node。用 best-first heap:每次 expand 個 expected acceptance probability 最高嘅 leaf。
目標函數:maximize
其中 係 path 上每 position marginal 嘅乘積(factorized assumption), 係 path 長度。實際上 DDTree 唔係用 dynamic programming,而係用 best-first heap:把每個 candidate node 嘅 P(parent_path) × P_marginal(node) 入 heap,pop top 22 個。
3.3 Tree verification:一個 forward pass
關鍵 trick 係ancestor-only attention mask。Target model 一次過 process 成棵 tree(22 個 token),但每個 token 只可以 attend 到佢自己嘅 ancestors,唔可以 attend siblings。
javascripta d g h e g h ...
root: 1 1 1 1 1 1 1 ...
a: . 1 1 1 . . .
d: . . 1 1 . . .
g: . . . . . . .
h: . . . . . . .
e: . . . . 1 1 1
...
咁樣 22 個 token 一個 forward 就計埋 22 個 logits(每個 token 對應佢自己 path 嘅 next-token distribution)。Verify 嘅成本同 verify 1 條 chain 一樣,但 candidate 多咗 22 個。
3.4 Walk + bonus token
Verify 完之後:
pythonnode = root
while node.has_children:
target_token = sample(target_logits[node])
matched_child = find_child(node, target_token)
if matched_child:
commit(matched_child)
node = matched_child
else:
# 第一個 mismatch:commit target_token 做 bonus,return
bonus = target_token
break
Bonus token 係下一輪嘅 root。咁樣即使整棵 tree 都 reject 完,最少都 commit 到 1 個 token(target's own sample),保證 progress。
3.5 結果:Budget vs speedup tradeoff
DDTree paper 報告嘅 speedup(temperature 0,相對 autoregressive):
Qwen3-8B
| Benchmark | DFlash | DDTree | Gain |
|---|---|---|---|
| MATH-500 | 5.56× | 7.52× | +1.96× |
| HumanEval | 4.84× | 6.90× | +2.06× |
| GSM8K | 4.78× | 6.75× | +1.97× |
Qwen3-30B-MoE
| Benchmark | DFlash | DDTree | Gain |
|---|---|---|---|
| HumanEval | 6.09× | 8.22× | +2.13× |
| MBPP | ~7.0× | 7.7× | +0.7× |
| MATH-500 | ~5.5× | 6.2× | +0.7× |
60 個 (model × dataset × temperature) setting 全部都贏,最大 gain 集中喺 reasoning + code。
⚠️ Budget 唔係越大越好:node budget 16 → 22 → 28,acceptance length 升,但 verifier latency 升得仲快。Sweet spot 係 22 左右。Luce DFlash 默認用 budget=22。
4. DFlash + DDTree:Pipeline 全圖
每 round commit 平均 ~8 個 token(AL = 8.31 喺 HumanEval),總 latency ~12 ms。等價 throughput:8 / 0.012 = ~660 tok/s effective draft,但實際被 GPU memory bandwidth 同 KV cache 限制,最終落到 ~130 tok/s mean。
5. Luce DFlash:點樣係一部 RTX 3090 度 run
📦 Luce DFlash 一句話:z-lab 嘅 reference implementation 只跑 BF16 on B200(54+ GB VRAM)。Luce-Org 寫咗 ~2000 行 C++/CUDA on top of ggml,做出世界第一個 GGUF port,俾你 24 GB RTX 3090 都跑到 Qwen3.5-27B + DFlash + DDTree。
5.1 The gap:Why no one shipped it on consumer GPUs
要喺 24 GB GPU 跑 27B + draft + tree state,個 stack 必須係:
| Component | Size requirement |
|---|---|
| Target weights (27B) | ≤ 16 GB → 必須 quantize 到 Q4_K_M |
| Draft model (BF16) | 3.46 GB |
| KV cache (~32K context) | ~2-4 GB |
| DDTree tree state | ~0.5 GB |
| Total | ~22-24 GB |
問題:
- vLLM / SGLang:無 GGUF 路徑,AWQ INT4 working 但無 DFlash integration
- z-lab reference:只 support BF16,54 GB target weights,唔可能 fit 24 GB
- llama.cpp:有 GGUF Q4_K_M loader,但無 DFlash/DDTree
- 無人有 Gated DeltaNet 嘅 tree-mode kernel(Qwen3.5 嘅 attention 一半係呢個)
所以 Luce-Org 嘅選擇好清楚:fork llama.cpp + 自己寫 3 個 custom CUDA kernel。
5.2 Architecture 伏位:Qwen3.5 唔係 dense Transformer
javascriptQwen3.5-27B layer stack (64 layers total):
├── Layer 0: Gated DeltaNet (linear attention + recurrent state)
├── Layer 1: Gated DeltaNet
├── Layer 2: Gated DeltaNet
├── Layer 3: Full softmax attention ← every 4th
├── Layer 4: Gated DeltaNet
├── ...
└── Layer 63: Full softmax attention
- 48/64 層係 Gated DeltaNet(linear attention with learned recurrence;similar to Mamba 嘅 SSM 但用 delta rule update)
- 16/64 層係 full softmax attention
- M-RoPE,dimension sections
[11, 11, 10, 0] - 24 query heads, 4 KV heads, key/value length 256
Gated DeltaNet 有一個recurrent state——你 verify 完 tree 之後 reject 咗某條 path,個 SSM intermediate 同 conv window 都要 rollback 返去 commit prefix 嘅狀態。Standard llama.cpp 嗰個 ggml_gated_delta_net 唔識 rollback。
5.3 三個 custom CUDA kernel
Luce DFlash 喺 Luce-Org/llama.cpp@luce-dflash 加咗三個 ggml ops:
| Kernel | 功能 | 解決乜嘢 |
|---|---|---|
ggml_ssm_conv_tree | Tree-aware conv state gather | 每個 sibling 沿住 DDTree parent chain 讀返自己嘅 K-1 window,唔係 DFS order |
ggml_gated_delta_net_tree | DeltaNet recurrence under tree mask | 喺 tree 度做 ancestor-only recurrent update |
ggml_gated_delta_net_tree_persist | Direct-write SSM intermediate to persistent buffer | 慳咗每 step 9 ms 嘅 ggml_cpy |
冇呢三個 kernel:要嘛 verify 唔到 tree(state 撈亂),要嘛每 step 都要 9 ms cpy 拖低 throughput。Luce DFlash 嘅 RESULTS 顯示冇 persist kernel 嗰陣 throughput 由 130 跌返去 ~85 tok/s。
5.4 Per-step rollback:snapshot + restore
pythondef one_round():
# Snapshot before verify
state_snapshot = {
"ssm_intermediate": deltanet_state.clone(),
"conv_window": conv_state.clone(),
"kv_cache_pos": kv_cache.cur_pos,
}
# Build tree + verify
tree = build_ddtree(draft_logits, budget=22)
target_logits = target.forward_tree(tree) # uses ggml_*_tree kernels
# Walk + commit
accepted_path, bonus = walk_tree(tree, target_logits)
# Restore to committed prefix only
restore_state(state_snapshot, accepted_path.length)
kv_cache.cur_pos = state_snapshot["kv_cache_pos"] + len(accepted_path)
return accepted_path + [bonus]
好處:冇 replay forward。傳統 spec decode 要 commit 之後 re-run target 一次先攞到正確 KV state,呢度直接由 tree forward 嘅 partial state restore 出嚟。
5.5 Long context:TQ3_0 KV cache + sliding ring
Qwen3.5-27B 原生 32K context。Luce DFlash 點樣用 24 GB 撐到 256K?
KV cache quantization
| Format | bpv | Memory savings |
|---|---|---|
| F16 | 16 | 1× baseline |
| Q4_0 | 4.5 | ~3.5× |
| TQ3_0 | 3.5 | ~9.7× |
環境變量:DFLASH27B_KV_TQ3=1(default)。
Sliding target_feat ring
Deep KV injection 要 capture target 嘅 layer 16/32/48/64 hidden states。如果 prompt 128K,5120-dim × 128K × 4 layers × 2 bytes = 6.6 GB 純 hidden state——直接爆 VRAM。
Luce DFlash 用 4096-slot ring buffer:pos % 4096 直接 overwrite。距離當前 position 4K 之外嘅 hidden state 會丟,但因為 draft 只睇近期 context,影響細微。實測 128K mode HumanEval 仍然 134.78 tok/s @ AL 8.33。
5.6 Benchmarks
Qwen3.5-27B Q4_K_M, RTX 3090 24 GB, n_gen=256, 10 prompts/dataset:
| Task | AR tok/s | DFlash+DDTree tok/s | AL | Speedup |
|---|---|---|---|---|
| HumanEval | 37.78 | 129.52 | 8.31 | 3.43× |
| Math500 | 37.71 | 110.51 | 7.04 | 2.93× |
| GSM8K | 37.65 | 96.15 | 6.14 | 2.55× |
比較同部硬件嘅其他 stack:
| Stack | HumanEval tok/s | 限制 |
|---|---|---|
| llama.cpp Q4_K_M (AR) | 37.78 | 冇 spec decode |
| SGLang AWQ INT4 (AR) | 46.6 | 冇 DFlash, 仲係 AR |
| Luce DFlash Q4_K_M | 129.52 | 27B model, 100% lossless, 128K context |
比 SGLang AWQ 快 2.8×;比 llama.cpp 快 3.43×。Demo run peak 去到 207.6 tok/s(5.46× AR)。
5.7 Quick start
bashgit clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash
# Build
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_dflash -j
# Models
huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
Qwen3.5-27B-Q4_K_M.gguf --local-dir models/
huggingface-cli download z-lab/Qwen3.5-27B-DFlash \
model.safetensors --local-dir models/draft/
# Run
python3 scripts/run.py --prompt "def fibonacci(n):"
# Long-context mode (up to 256K)
DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=16 \
build/test_dflash models/Qwen3.5-27B-Q4_K_M.gguf \
models/draft/model.safetensors prompt.bin 256 out.bin \
--fast-rollback --ddtree --ddtree-budget=22 --max-ctx=131072
要求:
- NVIDIA sm_86+(3090, A10, A40, 4090)or sm_110 Jetson AGX Thor
- CUDA 12+(13+ for Thor)
- 24 GB VRAM
- ~80 GB disk for both models
6. 點解三件嘢咁岩夾埋一齊出?
依個係 2026 年第一個季度嘅 inference acceleration 連環 trilogy:
學術 vs 工程嘅分工
- **z-lab(DFlash)**做嘅嘢:證明 block diffusion drafter 可以打贏 chain drafter,open-source weights
- **Ringel & Romano(DDTree)**做嘅嘢:諗到「marginal 已經 free」,加 tree verification 榨乾 DFlash 嘅 information
- **Luce-Org(Luce DFlash)**做嘅嘢:~2000 行 C++/CUDA + 3 個 kernel,將 paper 由「跑得郁 B200」變成「跑得郁 3090」
講真嘅,第三件先係令 community 用得到嘅。Paper 寫得幾靚都好,BF16 on B200 嘅成本對 indie hacker 同個別 researcher 等於虛無。Luce DFlash 嘅 README 直接寫:
Consumer GPUs can run 27B models at chat-grade speed without multi-GPU, without batching, without quantization compromises. The bottleneck was never hardware. It was the decoding algorithm.
7. 限制同 caveat
DFlash 嘅限制
- Block size 16 寫死咗:drafter 一 pass 出 16 個 token,唔可以動態調
- Deep KV injection 要捕獲 target hidden states:每 forward 要 capture 4 個 layer,多 ~5% 嘅 prefill cost
- 每個 target model 要 train 自己嘅 draft:z-lab 暫時 release 咗 Qwen3.5/3.6、Kimi-K2.5、gpt-oss-20b/120b、Llama-3.1-8B;其他模型要等社區 train
DDTree 嘅限制
- Marginal independence assumption:tree 用 factorized ,但實際 token 之間有 correlation;budget 太大 acceptance length 會飽和
- Verifier cost 隨 budget 平方升:tree attention mask 係 in tree size
Luce DFlash 嘅限制
- Batch size 1:single user 場景,無 KV paging
- Greedy only:temperature/top_p 收咗但 ignore 咗(rejection sampling 仲未實現)
- One model pair:淨係 Qwen3.5-27B Q4_K_M target + z-lab DFlash BF16 draft;換其他 model 要重寫 graph builder
- Q4_K_M 比 BF16 平均蝕 30 個 acceptance points:~8.31 AL on Q4_K_M vs paper 嘅 ~12 AL on BF16;Q5_K_M / Q6_K 可以追返但 fit 唔落 24 GB
- CUDA only:無 Metal / ROCm;同 dflash-mlx 唔同坨
8. 對 inference 生態嘅啟示
8.1 Drafter 嘅未來:parallel everywhere
DFlash 證明咗:drafter 嘅 sequentiality 唔係必要嘅。Block diffusion 一 pass 出 16 個 token 質素仲贏 sequential。下一步可能係:
- Hierarchical block diffusion:先 1 pass 出 64 個 token 嘅 outline,再 1 pass refine(即 SSD - Speculative Speculative Decoding,ICLR 2026 已經出咗類似 idea)
- Prompt-conditioned drafter:唔同 task(code / math / chat)用唔同嘅 drafter
8.2 Tree 已經係 standard
Medusa、SpecInfer、EAGLE-2 都用過 tree verification。但DDTree 嘅獨特之處係:tree material 完全 free——drafter 已經輸出 marginal,唔使 extra forward。對比 EAGLE-2 要做 multiple drafter forward 先 build 到 tree,DDTree 純粹係 best-first heap on existing data。
8.3 GGUF + custom kernel = consumer-grade SOTA
Luce DFlash 嘅出現話畀我哋知:llama.cpp 嘅 GGUF stack 已經追到接近 SOTA,唔再係「research 落後 vLLM 半年」嘅 second-class citizen。Quantization-aware spec decoding 加上 hand-tuned kernel,consumer GPU 可以 host 一個 27B model run 到 chat-grade speed(>100 tok/s)。
2026 年下半年大概率會見:
- llama.cpp upstream merge DFlash kernels
- Ollama / LM Studio 默認 enable DFlash for supported models
- 第三方 community 出更多 model pair(Llama-4, Mistral-Next, ...)嘅 GGUF DFlash draft
9. 自己想試?做啲乜
如果你有部 24 GB+ NVIDIA GPU:
- 跑 Luce DFlash demo:上面個 quick start,5 分鐘內見到 100+ tok/s 嘅 Qwen3.5-27B
- 試 DDTree budget sweep:
--ddtree-budget=8/16/22/28,睇你部 GPU 嘅 sweet spot - 測試 long-context mode:
DFLASH27B_KV_TQ3=1+ 128K prompt,睇 throughput 變化
如果你想 train 自己 draft model:
- Wait z-lab training recipe:佢哋 promise 緊會 open-source
- 依 EAGLE-3 paper 改 architecture 為 block-diffusion:5-layer transformer + cross-attention to target hidden states
- Distill on UltraChat / SlimPajama:BF16 training 約 ~24 GPU-hours on 8× H100
如果你想 contribute Luce DFlash:
- Temperature / top-k sampling:rejection sampling in verify path
- Full llama.cpp integration:
llama-speculative-dflash.cpp+llama-cli/llama-serverwiring - 新 model pair:呢個係硬骨頭,要重寫 graph builder
10. 總結
DFlash → DDTree → Luce DFlash 呢條鏈展示咗 inference 加速研究嘅完整 lifecycle:
| 階段 | 重點 | 影響 |
|---|---|---|
| Algorithmic insight (DFlash) | Block diffusion drafter 打破 sequential drafting bottleneck | 6× speedup, 但 BF16 only |
| Free lunch on top (DDTree) | Marginal distribution 本身已經 free,build tree 就可以再榨多 30-40% | 60/60 setting 全部贏 |
| Engineering port (Luce DFlash) | GGUF + 3 個 custom CUDA kernel + sliding KV ring | 24 GB consumer GPU 跑得郁 27B model at 130 tok/s |
核心 takeaway:
- 唔好 underestimate diffusion 喺 LLM 嘅角色——佢唔係淨係用嚟做 image generation,做 drafter 都直接幹翻 chain method
- Marginal 係 free 嘅 tree material——任何 parallel drafter 都應該配 tree verification
- Engineering 同 research 嘅 gap 越收越窄——一個 indie team 兩個禮拜可以將 paper 由 B200 帶落 3090
- Lossless guarantee 係 non-negotiable——所有呢啲加速都係 mathematical equivalent to target sampling,唔似 lossy quantization 咁要妥協精度
下次有人話「LLM inference 已經慢到唔可能再快」,俾佢睇 207 tok/s 嘅 Qwen3.5-27B on a 5-year-old GPU。
相關資源
Papers
- DFlash: arxiv 2602.06036 — Chen, Liang, Liu (z-lab, 2026)
- DDTree: arxiv 2604.12989 — Ringel, Romano (Technion, 2026)
- EAGLE-3 (background): arxiv 2503.01840
- Speculative Speculative Decoding (next step): ICLR 2026 OpenReview
- Original Speculative Decoding: arxiv 2211.17192 — Leviathan et al.
Code
- z-lab/dflash: github.com/z-lab/dflash — reference vLLM/SGLang/Transformers/MLX impl
- liranringel/ddtree: github.com/liranringel/ddtree — DDTree reference impl
- Luce-Org/lucebox-hub: github.com/Luce-Org/lucebox-hub — GGUF port for RTX 3090
- Luce-Org/llama.cpp: github.com/Luce-Org/llama.cpp/tree/luce-dflash — fork with tree-mode kernels
- Aryagm/dflash-mlx: github.com/Aryagm/dflash-mlx — Apple Silicon impl
Demos & writeups
- DDTree project page: liranringel.github.io/ddtree — 互動 visualization
- Luce DFlash blog: lucebox.com/blog/dflash27b
- The Salt #116: substack.com/p/8x-faster-inference-dflash — paper review
- Hacker News thread: news.ycombinator.com/item?id=47838788 — 207 tok/s discussion
Models
- z-lab/Qwen3.5-27B-DFlash: HuggingFace — 主流 target 嘅 draft
- z-lab DFlash collection: HuggingFace collection — 全 family models