📄 DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence (DeepSeek-AI, 2026) 🤗 Open Weights:deepseek-ai/DeepSeek-V4-Pro、deepseek-ai/DeepSeek-V4-Flash 📊 Tech Report:DeepSeek_V4.pdf 📅 Release Date:2026 年 4 月 24 日(preview)
TL;DR
DeepSeek 喺 2026 年 4 月 24 日 release 咗 V4 系列,包括兩個 MoE 模型,全部 MIT license open weights:
| Model | Total / Active params | Context | Pricing (input / output per 1M) | 定位 |
|---|---|---|---|---|
| V4-Pro | 1.6T / 49B | 1M tokens | 3.48 | Frontier 對標 Claude Opus 4.6 / Gemini 3.1 Pro |
| V4-Flash | 284B / 13B | 1M tokens | 0.28 | 高速經濟版本,agentic 任務輕量首選 |
但呢次 release 嘅真正主菜唔係 benchmark——係架構效率:
⚡ 核心數字:喺 1M token context 設定下
V4-Pro 只需要 V3.2 嘅 27% single-token inference FLOPs
V4-Pro 只需要 V3.2 嘅 10% KV cache
V4-Flash 仲更狠:10% FLOPs + 7% KV cache
對比標準 GQA-8 + bf16 cache,V4 只需要約 2% 嘅 KV cache 大細
呢個效率 gain 嚟自 5 個架構級嘅 trick,今篇 blog 會逐個拆。
目錄
為咩要關心 V4?因為佢解決嘅係 agent 嘅實際痛點
如果你試過用 frontier model 跑長 agentic workflow(例如 SWE-bench、multi-step browse、terminal session),你應該見過呢啲 failure mode:
💀 Agent 跑長任務嘅典型死法
Context budget 爆:每次 tool call result append 入 context,幾百次 tool round trip 之後 context 直接撞牆
KV cache 食晒 GPU:Cache 隨 sequence length 線性增長,GPU memory OOM
Inference 越行越慢:每個新 token 都要 attend 全部歷史 token,FLOPs 隨 sequence length 線性增加
Multi-turn reasoning state 失蹤:好多 model 喺新 user turn 會 flush 之前嘅 reasoning trace
1M context 唔係新嘢(Gemini、GPT 都做到),但真正 affordable 嘅 1M context 仲係冇——直到 V4。佢嘅核心 contribution 就係將 long-context inference 嘅成本砍到一個真正可以 deploy 嘅水平。
架構總覽:V4 點解同 V3 唔同?
🏗️ V4-Pro 結構
61 layers total
Layers 0–1:純 HCA(warm up,俾模型先建立 global view)
Layers 2–60:交替 CSA / HCA(精細選擇 + 廣域 context 互補)
最後嘅 MTP(Multi-Token Prediction)block:純 sliding-window attention
FFN 全部用 DeepSeekMoE(72 routed + 2 shared experts)
Residual connection 全部換成 mHC(manifold-constrained hyper-connections)
V3 → V4 嘅變化一覽
| Dimension | DeepSeek V3 | DeepSeek V4-Pro |
|---|---|---|
| Total params | 671B | 1.6T(2.4× 大) |
| Active params | 37B | 49B(1.3× 多) |
| Context window | 128K | 1M(8× 長) |
| Attention | 標準 MLA (Multi-Latent Attention) | Hybrid CSA + HCA |
| Residual | 標準 residual | mHC (Birkhoff Polytope + Sinkhorn) |
| Optimizer | AdamW | Muon(embedding 仍用 AdamW) |
| Quantization | FP8(post-training) | FP4 QAT on MoE experts |
| Training data | 14.8T tokens | 33T tokens(2.2×) |
| 1M-ctx FLOPs | baseline | 27% of V3.2 |
| 1M-ctx KV cache | baseline | 10% of V3.2 |
| License | Modified OpenRAIL | MIT |
💡 最反直覺嘅一點:V4-Pro 總 params 大 2.4 倍、active params 大 1.3 倍,但 1M context 下嘅 inference cost 反而更平。即係話 V4 喺架構效率上嘅 gain 完全壓過 scale up 嘅成本。
創新 #1:Hybrid Attention — CSA + HCA
呢個係 V4 嘅靈魂。其他所有效率提升都圍繞呢個機制展開。
Motivation:點解傳統 attention 唔 scale 到 1M?
標準 multi-head attention 嘅複雜度係:
喺 嘅情況下:
- Compute 係 量級嘅 token 對
- KV cache 對 1.6T model 嚟講可以食幾百 GB GPU memory
之前嘅解法分兩派:
- Linear attention / SSM 派(Mamba、RWKV、Gated DeltaNet):將 attention 由 變 ,但 quality 通常有 trade-off
- Sparse attention 派(Longformer、BigBird、DSA in V3.2):揀啲 token attend,但 token-level 嘅 sparse 操作 hardware utilization 唔好
V4 採用嘅係第三條路:將 attention 喺壓縮過嘅 token stream 上面做。
CSA:Compressed Sparse Attention
三個關鍵組件:
- Compressor:用 softmax-gated pooling,每 4 個原始 token 壓做 1 個 "compressed KV entry"。Compressor 嘅 weights 係 learnable,仲加咗 positional bias。
- Lightning Indexer:每個 query 過嚟,indexer 用 ReLU-scored multi-head dot product 揀最相關嘅 top-1024 個 compressed blocks。Indexer 跑 FP4 精度,所以揀 top-K 嘅成本極低。
- Sliding Window:最近 128 個 token 唔壓縮,直接 attend,保證 local context 嘅精細度。
🔍 CSA vs DSA (V3.2 嘅 DeepSeek Sparse Attention)
DSA 係喺原始 token level 揀 top-K(搜索空間係 )。CSA 係喺4× 壓縮過嘅 block level 揀 top-K(搜索空間係 )。Indexer 嘅工作量直接細 4 倍。
HCA:Heavily Compressed Attention
CSA 雖然平,但仍然 sparse(top-1024)——可能漏咗某啲遠處嘅重要 token。HCA 補嗰個窿。
核心思想:壓 128× 之後,1M tokens 變 8K blocks。8K × 8K dense attention 係極平——比 1M × 1M 平 倍。
所以 HCA 嘅 trade-off 係:用粗顆粒度換取 global view——每個 query 都見到成個 1M context,只係見到嘅係 "smoothed" 版本。
🎚️ CSA vs HCA 嘅互補關係
CSA:Selective + 4× 壓縮 + top-1024。精細但 narrow(只見 selected 嘅地方)。
HCA:Dense + 128× 壓縮。粗糙但 broad(成個 context 都見得到,但見唔到 fine detail)。
一個負責「focus retrieval」,一個負責「global awareness」——交替排列就好似喺長文件入面輪流近距離 zoom in 同遠距離 overview。
點解唔可以淨係用 CSA 或淨係 HCA?
論文做咗 ablation:
- 純 CSA:Top-1024 揀漏咗嘅遠處 token 永遠 invisible。長文件理解能力下降。
- 純 HCA:128× 壓縮抹走太多 fine detail,code/math 任務退化嚴重。
- 交替 CSA + HCA:每個 query 喺不同 layer 既有 selective fine-grained access(CSA)又有 dense global view(HCA)。Best of both worlds。
數據精度:FP8 + FP4 嘅組合拳
| Component | Precision |
|---|---|
| 絕大部分 KV entries | FP8 |
| RoPE dimensions | BF16(保 numerical stability) |
| Lightning Indexer | FP4 |
| MoE expert weights | FP4 QAT(pretrain 時就用 FP4) |
呢個精度設計同壓縮率乘埋一齊,先得到「KV cache = baseline 嘅 2%」嘅誇張數字。
創新 #2:Manifold-Constrained Hyper-Connections (mHC)
呢個可能係 V4 最 underrated 嘅 contribution——重新設計咗 residual connection。
Motivation:標準 residual 嘅老問題
自 ResNet 2015 年以來,幾乎所有 deep network 都用:
但喺 trillion-scale + 60+ layers 嘅情況下,呢個簡單 update rule 有兩個問題:
- Signal explosion:每層加上 ,norm 慢慢谷大,到 deep layer 變天文數字
- Signal collapse:另一極, norm 變小,deep layer 嘅 update 越嚟越冇效果
標準解法係 LayerNorm + residual scaling,但呢啲都係 post-hoc 修補,唔係架構級 guarantee。
Hyper-Connections 嘅 idea
Hyper-connections 將 residual 由「scalar add」推廣到「multi-channel weighted mix」:
其中:
- 每層維護 個並行 "residual streams"
- 係 learnable 嘅 mixing matrix
- (attention 或 FFN)只接收某一條 stream 做 input
呢個 generalize 咗 standard residual()。但 raw mixing matrix 冇任何約束,仍然會有 signal explosion。
mHC 嘅 trick:Birkhoff Polytope + Sinkhorn-Knopp
V4 嘅創新係強制 係一個 doubly stochastic matrix:
呢類 matrix 就係Birkhoff Polytope 嘅元素。佢有一個極好嘅性質:
📐 關鍵 property:Doubly stochastic matrix 嘅 spectral norm
所以 ——呢個 mixing 永遠唔會放大 signal magnitude。加上 嘅 update,整體仍然受控——stability by construction。
點樣做到 doubly stochastic?Sinkhorn-Knopp
Direct parameterize 一個 doubly stochastic matrix 唔容易,所以 V4 用 Sinkhorn-Knopp algorithm:
pythondef sinkhorn_knopp(M, n_iters=10):
"""
將任意 non-negative matrix 投影到 doubly stochastic.
"""
M = torch.exp(M) # 保證 non-negative
for _ in range(n_iters):
M = M / M.sum(dim=0, keepdim=True) # column normalize
M = M / M.sum(dim=1, keepdim=True) # row normalize
return M
每次 forward pass,learnable parameter 經過 exp + 幾步 row/column 標準化,輸出一個 valid doubly stochastic matrix。
🔬 Sinkhorn-Knopp 嘅由來
呢個 algorithm 喺 1967 年由 Sinkhorn 同 Knopp 提出,原本用嚟解 optimal transport。近年喺 ML 入面復活,例如:
DETR 嘅 Hungarian matching(變種)
Optimal transport-based losses
Self-supervised learning(SwAV)
V4 將佢搬到 residual connection 入面係第一次喺 frontier LLM 用呢個 trick。
mHC 嘅實際好處
| Issue | Standard Residual | mHC |
|---|---|---|
| Signal magnitude control | 需要 LayerNorm post-hoc 修補 | by construction 受限 |
| Trillion-scale stability | 需要好多 init / scaling tricks | 幾乎冇 hyperparam tuning |
| Multi-stream representation | 淨係一條 stream | 多條 parallel streams |
| Expressivity | Standard | 更豐富(不同 stream 學不同 pattern) |
論文話 V4 訓練全程冇出現過明顯 loss spike——對比 GPT-3 / Llama 3 訓練 log 嘅 spike 圖,呢個係好顯著嘅 win。
創新 #3:Muon Optimizer
V4 將大部分 parameters 由 AdamW 換成 Muon optimizer。
Muon 係咩?
Muon 由 Keller Jordan et al. 2024 年提出,核心思想係:將 update 限制喺 orthogonal manifold 上。
python# AdamW update(簡化版)
def adamw_step(W, grad, m, v):
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
W -= lr * m / (sqrt(v) + eps)
return W, m, v
# Muon update(簡化版)
def muon_step(W, grad, momentum_buffer):
momentum_buffer = beta * momentum_buffer + grad
# 關鍵 step:將 update 投影到 orthogonal matrix space
O = newton_schulz_iteration(momentum_buffer, n_iters=5)
W -= lr * O
return W, momentum_buffer
Newton-Schulz iteration 係一個 fast method 將任意 matrix 近似到 nearest orthogonal matrix(無需 SVD)。
點解 Muon work?
Intuitive 解釋:標準 SGD/Adam 嘅 update 方向冇任何 geometric structure,可能某啲方向 update 過大、某啲過細。Muon 強制 update 係 orthogonal——所有方向 update 都被 normalize 到 unit scale——令 effective learning rate 更 uniform。
⚡ V4 嘅 Muon 配置
Most parameters:用 Muon
Embeddings:仍用 AdamW(embedding 嘅 update 統計特性同 weight matrix 唔同)
Prediction head:AdamW
RMSNorm weights:AdamW
Peak LR:2.0e-4 for Pro, cosine decay
DeepSeek 報 Muon 帶嚟 ~30% faster convergence 同 更穩定嘅 loss curve——喺 33T training tokens 嘅規模下省到嘅 compute 係天文數字。
創新 #4:FP4 Quantization-Aware Training (QAT)
V4 將 MoE expert weights 同 indexer QK path 喺 pretraining 階段就用 FP4 訓練——而唔係 pretrain 完先 quantize。
Post-training Quantization vs QAT
| Post-training Quant (PTQ) | Quantization-Aware Training (QAT) | |
|---|---|---|
| Workflow | Pretrain (FP16/BF16) → quantize 落 INT8/FP4 | Pretrain 全程已經模擬 quantization noise |
| Quality drop | FP4 通常掉 1–3% benchmark | 幾乎冇 drop(model learn 適應 quant noise) |
| Compute cost | Pretrain 用 FP16/BF16,貴 | FP4 forward + backward → 更平 |
| Deploy cost | FP4 inference | FP4 inference |
V4 喺 expert weights 用 FP4 QAT,意味住:
- Pretraining 慳 compute(FP4 forward 比 BF16 快好多)
- Inference 直接用 pretrain weights,冇 quality 損失
- Memory footprint 細:1.6T params @ FP4 ≈ 800GB(vs BF16 嘅 3.2TB)
Agent-Side 嘅後訓練創新
講完架構,再看 V4 點樣特別針對 agent workflow 做 post-training 同 infrastructure 投資。
Interleaved Thinking Across Tool Calls
V3.2 嘅行為:每個新 user message 嚟,discard 之前嘅 reasoning trace。Single-turn agent 冇問題,但 multi-turn agent(user 中途插入 follow-up)就 lose context。
V4 嘅修正:當對話包含 tool calls 時,preserve 全部 reasoning history 跨 user turn。
對 agent 嚟講呢個係質變——reasoning 變成 cumulative,唔再每次清零。
Tool-call Schema:|DSML| + XML
JSON-in-string 嘅 tool call format 一直有個問題:escape character hell。例如 model 想 emit:
javascript{"query": "He said \"hello\" to me"}
Model 經常 emit:
javascript{"query": "He said "hello" to me"}
V4 引入 special token |DSML| + XML format:
xml<tool_call name="search">
<param name="query" string="true">He said "hello" to me</param>
<param name="limit" string="false">10</param>
</tool_call>
關鍵設計:
string="true":parameter 當純 string 處理(唔需要 JSON escape)string="false":parameter 係 JSON value(number, bool, dict)
呢個分離直接消滅咗一大類 parsing failure。
DSec:DeepSeek Elastic Compute (RL Sandbox)
V4 嘅 agent capability 嚟自 RL training against real tool environments。問題係:點樣同時跑幾十萬個 sandbox?
DeepSeek 自己起咗一個叫 DSec 嘅 Rust 平台:
| Layer | Use case |
|---|---|
| Function calls | 純 Python function execution(最快) |
| Containers | Docker-style isolation |
| microVMs (Firecracker) | 強隔離 + 快 startup |
| Full VMs (QEMU) | 需要 root / kernel 嘅 task |
單一 cluster 跑幾十萬個 concurrent sandbox。三個關鍵 feature:
- Layered 3FS storage:Image loading 極快,RL rollout 唔使等 container startup
- Preemption-safe trajectory replay:訓練中斷可以無縫 resume,唔使 re-run tool calls
- Uniform Python SDK:同一個 training harness 可以 target function call 或 full VM
🏭 Infrastructure 嘅戰略意義
DSec 唔係 paper 主菜,但係好多 lab 跟唔到 V4 agent 表現嘅原因——你冇 infrastructure 跑得到大規模 RL rollout,就 train 唔到呢個 quality 嘅 agent。Infrastructure is the moat.
Benchmark:V4-Pro-Max 點樣對比 frontier?
V4-Pro-Max 係 V4-Pro 加長 reasoning tokens 嘅版本(類似 OpenAI 嘅 o3 對 GPT-4o)。
Coding Benchmarks(V4 嘅強項)
| Benchmark | V4-Pro-Max | Claude Opus 4.6 | GPT-5.4 xHigh | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 80.6% | 80.8% | — | 80.6% |
| LiveCodeBench Pass@1 | 93.5 ⭐ | 88.8 | — | 91.7 |
| Codeforces Rating | 3206 ⭐ | — | 3168 | 3052 |
| Terminal Bench 2.0 | 67.9 | — | 75.1 ⭐ | 68.5 |
Reasoning / Knowledge Benchmarks
| Benchmark | V4-Pro-Max | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|
| MMLU-Pro | 87.5% | — | 91.0% ⭐ |
| GPQA Diamond | 90.1% | — | 94.3% ⭐ |
| HLE | 37.7% | 40.0% ⭐ | — |
| HMMT 2026 | 95.2% | 96.2% ⭐ | — |
| Putnam 2025 | 120/120 🏆 | — | — |
Agentic Tasks
| Benchmark | V4-Pro-Max | Notes |
|---|---|---|
| MCPAtlas Public | 73.6 | 第二,僅次 Opus 4.6 (73.8) |
| Toolathlon | 51.8 ⭐ | Beat K2.6 (50.0), GLM-5.1 (40.7), Gemini 3.1 Pro (48.8) |
| Internal R&D coding (30 tasks) | 67% pass | vs Sonnet 4.5 (47%), Opus 4.5 (70%) |
Long-Context Retrieval
喺 MRCR 8-needle benchmark:
- 256K tokens:accuracy ≥ 0.82
- 1M tokens:accuracy = 0.59
對比好多 model 喺 200K+ 已經急跌,V4 喺 1M 仲企穩 0.59 係好驚人。
Pricing:V4 真正改寫嘅遊戲規則
| Model | Input / 1M | Output / 1M | vs V4-Pro |
|---|---|---|---|
| V4-Flash | $0.14 | $0.28 | 12× cheaper than Pro |
| V4-Pro | $1.74 | $3.48 | baseline |
| Claude Opus 4.6 | $15 | $75 | 21× more expensive output |
| GPT-5.4 | ~$15 | ~$60 | 17× more expensive output |
| Gemini 3.1 Pro | ~$3.50 | ~$10.50 | 3× more expensive output |
💰 Cost-per-task 對比(agentic coding workload)
假設一個 typical SWE-bench task:50K input + 10K output token,每日 20 個 task:
V4-Flash:~6/month)
V4-Pro:~73/month)
Claude Opus 4.6:~900/month)
**V4-Pro-Max 對 SWE-bench parity 嘅 Opus 4.6,cost 細 21 倍。**對於需要跑大量 agentic task 嘅團隊,呢個直接改變咩任務 economically feasible。
V4-Pro 定 V4-Flash 點揀?
| Use Case | Recommendation | Why |
|---|---|---|
| 一般 coding | V4-Flash | 2-3 點 gap,但 12× 平 |
| Agentic / multi-step coding | V4-Pro / Pro-Max | SWE-Pro / Terminal 上 V4-Flash 落後 7-10 點 |
| Cost-sensitive batch | V4-Flash | $0.14/M input 係 frontier-tier 入面最平之一 |
| 追 maximum coding accuracy | V4-Pro-Max | 93.5 LiveCodeBench、80.6% SWE-bench |
| 長 context retrieval | 兩個都 work | 同樣 1M context、同樣 CSA+HCA |
點樣本地跑 V4?
用 HuggingFace Transformers
pythonimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# V4-Flash 細啲,自己 GPU 跑得到
model_id = "deepseek-ai/DeepSeek-V4-Flash"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
# Non-think mode (快)
prompt = "Write a Python function to compute fibonacci numbers."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=1.0,
top_p=1.0,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
三種 reasoning mode
🧠 V4 嘅 reasoning mode 切換
Non-think:快,冇 chain of thought(適合 chat / 簡單 task)
Think High:標準 reasoning,輸出包含
<think>...</think>blockThink Max:最深入 reasoning,需要至少 384K context window
Sampling 全部 mode 都建議:
temperature=1.0, top_p=1.0
API 調用
pythonfrom openai import OpenAI
client = OpenAI(
base_url="https://api.deepseek.com",
api_key="YOUR_API_KEY",
)
# V4-Pro thinking mode
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "user", "content": "Solve this SWE-bench task: ..."}
],
extra_body={"thinking": True},
temperature=1.0,
top_p=1.0,
)
print(response.choices[0].message.content)
注意:deepseek-chat 同 deepseek-reasoner 兩個舊 model name 會喺 2026-07-24 retire,目前 alias 去 deepseek-v4-flash 嘅 non-thinking / thinking mode。
V4 嘅 takeaway 同未解問題
V4 嘅 key insights
🎯 呢次 release 嘅 5 個 takeaway
Long-context inference cost 可以 sub-linear scale——靠 hybrid attention(CSA + HCA)而非 linear attention
Residual connection 唔係 sacred——mHC 證明 trillion-scale 上有更好嘅選擇
Optimizer 仍未 settled——Muon vs AdamW 嘅 convergence gap 喺超大規模上係 meaningful 嘅
FP4 QAT 已經 production-ready——pretrain 階段就 quantize,唔再需要 post-hoc 修補
Infrastructure(DSec)係 agent quality 嘅 hidden moat
未解嘅問題
| 問題 | Why |
|---|---|
| Hybrid attention 係 final answer 嗎? | CSA + HCA 仲係 attention-based。如果未來 SSM/Linear attention 進一步成熟(例如 Mamba-3、Gated DeltaNet),可能整個 paradigm 又會 shift |
| mHC 嘅 expressivity 上限喺邊? | Birkhoff Polytope 係 strict subset of all matrices。會唔會某啲 task 呢個 constraint 過嚴? |
| Muon 點樣 scale 到 10T+? | V4 喺 1.6T 證明 work,但 scaling laws 仲未測到 GPT-4 級別嘅 scale |
| Agentic gap 嘅根源 | V4-Flash 喺 agentic 任務落後 7-10 點。係 active param 唔夠定 reasoning 深度唔夠? |
| **` | DSML |
結論:V4 點解咁重要?
V4 嘅 benchmark 唔係 SOTA,但佢解決嘅唔係 benchmark。
🔮 V4 嘅真正貢獻
V4 係第一次將 frontier-quality 嘅 1M context 模型嘅單 token inference cost 砍到一個真正 deployable 嘅水平。對於:
跑 long agentic workflow 嘅 SaaS
處理整個 codebase 嘅 IDE
分析整本書 / 整份 contract 嘅 RAG-less workflow
需要無數 tool calls 嘅 multi-agent system
呢啲場景一直都「理論上可以但經濟上不可行」。V4-Pro 將 1M context 嘅 cost 拉低到 Claude Opus 嘅 1/21,第一次令呢個 frontier 變到 mass deployable。
配合 MIT license + open weights,V4 直接將 frontier capability 帶到任何有 GPU 嘅 lab / startup 手上。下次 agent 領域嘅 breakthrough,可能就係喺 someone fine-tune V4-Pro 嘅 base 上面發生。
相關資源
官方 release
- 📄 Tech Report:DeepSeek_V4.pdf
- 🤗 DeepSeek-V4-Pro(1.6T):deepseek-ai/DeepSeek-V4-Pro
- 🤗 DeepSeek-V4-Flash(284B):deepseek-ai/DeepSeek-V4-Flash
- 🤗 Base models:V4-Pro-Base、V4-Flash-Base
- 📰 API Announcement:api-docs.deepseek.com/news/news260424
- 💬 Try it:chat.deepseek.com
深入分析
- 📝 HuggingFace Blog:DeepSeek-V4: a million-token context that agents can actually use
- 📝 Reddit 討論(架構深度分析):r/LocalLLaMA — V4 architecture takeaways
- 🚀 NVIDIA Technical Blog:Build with DeepSeek V4 Using NVIDIA Blackwell
前置論文(V4 嘅 building blocks)
- 📄 Engram (Conditional Memory):arXiv:2601.07372
- 📄 mHC 原始論文:arXiv:2512.24880(Xie et al., 2025)
- 📄 DeepSeek Sparse Attention (V3.2):DeepSeek-AI 內部 release
- 📄 Muon Optimizer:Jordan et al., 2024
- 📄 DeepSeekMoE:Dai et al., 2024
- 📄 MLA (Multi-Latent Attention, V2):DeepSeek-AI, 2024
2026 年 4 月 24 日,DeepSeek 一次過 release 兩個 1M context MoE,將 long-context inference cost 帶入 sub-linear era。Hybrid attention、mHC、Muon、FP4 QAT——四個架構級 trick 加埋,V4 證明咗 frontier capability 唔一定要 frontier cost。下一個 trillion-scale model 應該都會跟住呢個 blueprint 行。🐋⚡