DeepSeek-V4：1.6T MoE × 1M context × Hybrid Attention，DeepSeek 點樣將 long-context inference 砍到 V3.2 嘅 27% FLOPs / 10% KV cache？

📄 DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence (DeepSeek-AI, 2026) 🤗 Open Weights：deepseek-ai/DeepSeek-V4-Pro、deepseek-ai/DeepSeek-V4-Flash 📊 Tech Report：DeepSeek_V4.pdf 📅 Release Date：2026 年 4 月 24 日（preview）

TL;DR

DeepSeek 喺 2026 年 4 月 24 日 release 咗 V4 系列，包括兩個 MoE 模型，全部 MIT license open weights：

Model	Total / Active params	Context	Pricing (input / output per 1M)	定位
V4-Pro	1.6T / 49B	1M tokens	$1.74 /$ 3.48	Frontier 對標 Claude Opus 4.6 / Gemini 3.1 Pro
V4-Flash	284B / 13B	1M tokens	$0.14 /$ 0.28	高速經濟版本，agentic 任務輕量首選

但呢次 release 嘅真正主菜唔係 benchmark——係架構效率：

⚡ 核心數字：喺 1M token context 設定下

V4-Pro 只需要 V3.2 嘅 27% single-token inference FLOPs

V4-Pro 只需要 V3.2 嘅 10% KV cache

V4-Flash 仲更狠：10% FLOPs + 7% KV cache

對比標準 GQA-8 + bf16 cache，V4 只需要約 2% 嘅 KV cache 大細

呢個效率 gain 嚟自 5 個架構級嘅 trick，今篇 blog 會逐個拆。

為咩要關心 V4？因為佢解決嘅係 agent 嘅實際痛點

如果你試過用 frontier model 跑長 agentic workflow（例如 SWE-bench、multi-step browse、terminal session），你應該見過呢啲 failure mode：

💀 Agent 跑長任務嘅典型死法

Context budget 爆：每次 tool call result append 入 context，幾百次 tool round trip 之後 context 直接撞牆

KV cache 食晒 GPU：Cache 隨 sequence length 線性增長，GPU memory OOM

Inference 越行越慢：每個新 token 都要 attend 全部歷史 token，FLOPs 隨 sequence length 線性增加

Multi-turn reasoning state 失蹤：好多 model 喺新 user turn 會 flush 之前嘅 reasoning trace

1M context 唔係新嘢（Gemini、GPT 都做到），但真正 affordable 嘅 1M context 仲係冇——直到 V4。佢嘅核心 contribution 就係將 long-context inference 嘅成本砍到一個真正可以 deploy 嘅水平。

架構總覽：V4 點解同 V3 唔同？

Loading diagram...

🏗️ V4-Pro 結構

61 layers total

Layers 0–1：純 HCA（warm up，俾模型先建立 global view）

Layers 2–60：交替 CSA / HCA（精細選擇 + 廣域 context 互補）

最後嘅 MTP（Multi-Token Prediction）block：純 sliding-window attention

FFN 全部用 DeepSeekMoE（72 routed + 2 shared experts）

Residual connection 全部換成 mHC（manifold-constrained hyper-connections）

V3 → V4 嘅變化一覽

Dimension	DeepSeek V3	DeepSeek V4-Pro
Total params	671B	1.6T（2.4× 大）
Active params	37B	49B（1.3× 多）
Context window	128K	1M（8× 長）
Attention	標準 MLA (Multi-Latent Attention)	Hybrid CSA + HCA
Residual	標準 residual	mHC (Birkhoff Polytope + Sinkhorn)
Optimizer	AdamW	Muon（embedding 仍用 AdamW）
Quantization	FP8（post-training）	FP4 QAT on MoE experts
Training data	14.8T tokens	33T tokens（2.2×）
1M-ctx FLOPs	baseline	27% of V3.2
1M-ctx KV cache	baseline	10% of V3.2
License	Modified OpenRAIL	MIT

💡 最反直覺嘅一點：V4-Pro 總 params 大 2.4 倍、active params 大 1.3 倍，但 1M context 下嘅 inference cost 反而更平。即係話 V4 喺架構效率上嘅 gain 完全壓過 scale up 嘅成本。

創新 #1：Hybrid Attention — CSA + HCA

呢個係 V4 嘅靈魂。其他所有效率提升都圍繞呢個機制展開。

Motivation：點解傳統 attention 唔 scale 到 1M？

標準 multi-head attention 嘅複雜度係：

\text{Compute} \propto O(L^2 \cdot d), \quad \text{KV cache} \propto O(L \cdot d \cdot N_{\text{layers}})

喺 $L = 1\text{M}$ 嘅情況下：

Compute 係 $10^{12}$ 量級嘅 token 對
KV cache 對 1.6T model 嚟講可以食幾百 GB GPU memory

之前嘅解法分兩派：

Linear attention / SSM 派（Mamba、RWKV、Gated DeltaNet）：將 attention 由 $O(L^2)$ 變 $O(L)$ ，但 quality 通常有 trade-off
Sparse attention 派（Longformer、BigBird、DSA in V3.2）：揀啲 token attend，但 token-level 嘅 sparse 操作 hardware utilization 唔好

V4 採用嘅係第三條路：將 attention 喺壓縮過嘅 token stream 上面做。

CSA：Compressed Sparse Attention

Loading diagram...

三個關鍵組件：

Compressor：用 softmax-gated pooling，每 4 個原始 token 壓做 1 個 "compressed KV entry"。Compressor 嘅 weights 係 learnable，仲加咗 positional bias。
Lightning Indexer：每個 query 過嚟，indexer 用 ReLU-scored multi-head dot product 揀最相關嘅 top-1024 個 compressed blocks。Indexer 跑 FP4 精度，所以揀 top-K 嘅成本極低。
Sliding Window：最近 128 個 token 唔壓縮，直接 attend，保證 local context 嘅精細度。

🔍 CSA vs DSA (V3.2 嘅 DeepSeek Sparse Attention)
DSA 係喺原始 token level 揀 top-K（搜索空間係 $L$ ）。CSA 係喺4× 壓縮過嘅 block level 揀 top-K（搜索空間係 $L/4$ ）。Indexer 嘅工作量直接細 4 倍。

HCA：Heavily Compressed Attention

CSA 雖然平，但仍然 sparse（top-1024）——可能漏咗某啲遠處嘅重要 token。HCA 補嗰個窿。

Loading diagram...

核心思想：壓 128× 之後，1M tokens 變 8K blocks。8K × 8K dense attention 係極平——比 1M × 1M 平 $16000$ 倍。

所以 HCA 嘅 trade-off 係：用粗顆粒度換取 global view——每個 query 都見到成個 1M context，只係見到嘅係 "smoothed" 版本。

🎚️ CSA vs HCA 嘅互補關係

CSA：Selective + 4× 壓縮 + top-1024。精細但 narrow（只見 selected 嘅地方）。

HCA：Dense + 128× 壓縮。粗糙但 broad（成個 context 都見得到，但見唔到 fine detail）。

一個負責「focus retrieval」，一個負責「global awareness」——交替排列就好似喺長文件入面輪流近距離 zoom in 同遠距離 overview。

點解唔可以淨係用 CSA 或淨係 HCA？

論文做咗 ablation：

純 CSA：Top-1024 揀漏咗嘅遠處 token 永遠 invisible。長文件理解能力下降。
純 HCA：128× 壓縮抹走太多 fine detail，code/math 任務退化嚴重。
交替 CSA + HCA：每個 query 喺不同 layer 既有 selective fine-grained access（CSA）又有 dense global view（HCA）。Best of both worlds。

數據精度：FP8 + FP4 嘅組合拳

Component	Precision
絕大部分 KV entries	FP8
RoPE dimensions	BF16（保 numerical stability）
Lightning Indexer	FP4
MoE expert weights	FP4 QAT（pretrain 時就用 FP4）

呢個精度設計同壓縮率乘埋一齊，先得到「KV cache = baseline 嘅 2%」嘅誇張數字。

創新 #2：Manifold-Constrained Hyper-Connections (mHC)

呢個可能係 V4 最 underrated 嘅 contribution——重新設計咗 residual connection。

Motivation：標準 residual 嘅老問題

自 ResNet 2015 年以來，幾乎所有 deep network 都用：

x_{l+1} = x_l + f_l(x_l)

但喺 trillion-scale + 60+ layers 嘅情況下，呢個簡單 update rule 有兩個問題：

Signal explosion：每層加上 $f_l(x_l)$ ，norm 慢慢谷大，到 deep layer 變天文數字
Signal collapse：另一極， $f_l(x_l)$ norm 變小，deep layer 嘅 update 越嚟越冇效果

標準解法係 LayerNorm + residual scaling，但呢啲都係 post-hoc 修補，唔係架構級 guarantee。

Hyper-Connections 嘅 idea

Hyper-connections 將 residual 由「scalar add」推廣到「multi-channel weighted mix」：

x_{l+1}^{(i)} = \sum_{j} M_{ij}\, x_l^{(j)} + f_l(x_l^{(c)})

其中：

每層維護 $n$ 個並行 "residual streams" $x^{(1)}, \ldots, x^{(n)}$
$M \in \mathbb{R}^{n \times n}$ 係 learnable 嘅 mixing matrix
$f_l$ （attention 或 FFN）只接收某一條 stream 做 input

呢個 generalize 咗 standard residual（ $n=1, M=I$ ）。但 raw mixing matrix $M$ 冇任何約束，仍然會有 signal explosion。

mHC 嘅 trick：Birkhoff Polytope + Sinkhorn-Knopp

V4 嘅創新係強制 $M$ 係一個 doubly stochastic matrix：

M_{ij} \geq 0, \quad \sum_i M_{ij} = 1, \quad \sum_j M_{ij} = 1

呢類 matrix 就係Birkhoff Polytope 嘅元素。佢有一個極好嘅性質：

📐 關鍵 property：Doubly stochastic matrix 嘅 spectral norm $\|M\|_2 \leq 1$
所以 $\|Mx\|_2 \leq \|x\|_2$ ——呢個 mixing 永遠唔會放大 signal magnitude。

加上 $f_l$ 嘅 update，整體仍然受控——stability by construction。

點樣做到 doubly stochastic？Sinkhorn-Knopp

Direct parameterize 一個 doubly stochastic matrix 唔容易，所以 V4 用 Sinkhorn-Knopp algorithm：

pythondef sinkhorn_knopp(M, n_iters=10):
    """
    將任意 non-negative matrix 投影到 doubly stochastic.
    """
    M = torch.exp(M)  # 保證 non-negative
    for _ in range(n_iters):
        M = M / M.sum(dim=0, keepdim=True)  # column normalize
        M = M / M.sum(dim=1, keepdim=True)  # row normalize
    return M

每次 forward pass，learnable parameter $\hat{M}$ 經過 exp + 幾步 row/column 標準化，輸出一個 valid doubly stochastic matrix。

🔬 Sinkhorn-Knopp 嘅由來
呢個 algorithm 喺 1967 年由 Sinkhorn 同 Knopp 提出，原本用嚟解 optimal transport。近年喺 ML 入面復活，例如：

DETR 嘅 Hungarian matching（變種）

Optimal transport-based losses

Self-supervised learning（SwAV）

V4 將佢搬到 residual connection 入面係第一次喺 frontier LLM 用呢個 trick。

mHC 嘅實際好處

Issue	Standard Residual	mHC
Signal magnitude control	需要 LayerNorm post-hoc 修補	by construction 受限
Trillion-scale stability	需要好多 init / scaling tricks	幾乎冇 hyperparam tuning
Multi-stream representation	淨係一條 stream	多條 parallel streams
Expressivity	Standard	更豐富（不同 stream 學不同 pattern）

論文話 V4 訓練全程冇出現過明顯 loss spike——對比 GPT-3 / Llama 3 訓練 log 嘅 spike 圖，呢個係好顯著嘅 win。

創新 #3：Muon Optimizer

V4 將大部分 parameters 由 AdamW 換成 Muon optimizer。

Muon 係咩？

Muon 由 Keller Jordan et al. 2024 年提出，核心思想係：將 update 限制喺 orthogonal manifold 上。

python# AdamW update（簡化版）
def adamw_step(W, grad, m, v):
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * grad**2
    W -= lr * m / (sqrt(v) + eps)
    return W, m, v

# Muon update（簡化版）
def muon_step(W, grad, momentum_buffer):
    momentum_buffer = beta * momentum_buffer + grad
    # 關鍵 step：將 update 投影到 orthogonal matrix space
    O = newton_schulz_iteration(momentum_buffer, n_iters=5)
    W -= lr * O
    return W, momentum_buffer

Newton-Schulz iteration 係一個 fast method 將任意 matrix 近似到 nearest orthogonal matrix（無需 SVD）。

點解 Muon work？

Intuitive 解釋：標準 SGD/Adam 嘅 update 方向冇任何 geometric structure，可能某啲方向 update 過大、某啲過細。Muon 強制 update 係 orthogonal——所有方向 update 都被 normalize 到 unit scale——令 effective learning rate 更 uniform。

⚡ V4 嘅 Muon 配置

Most parameters：用 Muon

Embeddings：仍用 AdamW（embedding 嘅 update 統計特性同 weight matrix 唔同）

Prediction head：AdamW

RMSNorm weights：AdamW

Peak LR：2.0e-4 for Pro, cosine decay

DeepSeek 報 Muon 帶嚟 ~30% faster convergence 同 更穩定嘅 loss curve——喺 33T training tokens 嘅規模下省到嘅 compute 係天文數字。

創新 #4：FP4 Quantization-Aware Training (QAT)

V4 將 MoE expert weights 同 indexer QK path 喺 pretraining 階段就用 FP4 訓練——而唔係 pretrain 完先 quantize。

Post-training Quantization vs QAT

	Post-training Quant (PTQ)	Quantization-Aware Training (QAT)
Workflow	Pretrain (FP16/BF16) → quantize 落 INT8/FP4	Pretrain 全程已經模擬 quantization noise
Quality drop	FP4 通常掉 1–3% benchmark	幾乎冇 drop（model learn 適應 quant noise）
Compute cost	Pretrain 用 FP16/BF16，貴	FP4 forward + backward → 更平
Deploy cost	FP4 inference	FP4 inference

V4 喺 expert weights 用 FP4 QAT，意味住：

Pretraining 慳 compute（FP4 forward 比 BF16 快好多）
Inference 直接用 pretrain weights，冇 quality 損失
Memory footprint 細：1.6T params @ FP4 ≈ 800GB（vs BF16 嘅 3.2TB）

Agent-Side 嘅後訓練創新

講完架構，再看 V4 點樣特別針對 agent workflow 做 post-training 同 infrastructure 投資。

Interleaved Thinking Across Tool Calls

V3.2 嘅行為：每個新 user message 嚟，discard 之前嘅 reasoning trace。Single-turn agent 冇問題，但 multi-turn agent（user 中途插入 follow-up）就 lose context。

V4 嘅修正：當對話包含 tool calls 時，preserve 全部 reasoning history 跨 user turn。

Loading diagram...

對 agent 嚟講呢個係質變——reasoning 變成 cumulative，唔再每次清零。

Tool-call Schema：`|DSML|` + XML

JSON-in-string 嘅 tool call format 一直有個問題：escape character hell。例如 model 想 emit:

javascript{"query": "He said \"hello\" to me"}

Model 經常 emit:

javascript{"query": "He said "hello" to me"}

V4 引入 special token |DSML| + XML format：

xml<tool_call name="search">
  <param name="query" string="true">He said "hello" to me</param>
  <param name="limit" string="false">10</param>
</tool_call>

關鍵設計：

string="true"：parameter 當純 string 處理（唔需要 JSON escape）
string="false"：parameter 係 JSON value（number, bool, dict）

呢個分離直接消滅咗一大類 parsing failure。

DSec：DeepSeek Elastic Compute (RL Sandbox)

V4 嘅 agent capability 嚟自 RL training against real tool environments。問題係：點樣同時跑幾十萬個 sandbox？

DeepSeek 自己起咗一個叫 DSec 嘅 Rust 平台：

Layer	Use case
Function calls	純 Python function execution（最快）
Containers	Docker-style isolation
microVMs (Firecracker)	強隔離 + 快 startup
Full VMs (QEMU)	需要 root / kernel 嘅 task

單一 cluster 跑幾十萬個 concurrent sandbox。三個關鍵 feature：

Layered 3FS storage：Image loading 極快，RL rollout 唔使等 container startup
Preemption-safe trajectory replay：訓練中斷可以無縫 resume，唔使 re-run tool calls
Uniform Python SDK：同一個 training harness 可以 target function call 或 full VM

🏭 Infrastructure 嘅戰略意義
DSec 唔係 paper 主菜，但係好多 lab 跟唔到 V4 agent 表現嘅原因——你冇 infrastructure 跑得到大規模 RL rollout，就 train 唔到呢個 quality 嘅 agent。Infrastructure is the moat.

Benchmark：V4-Pro-Max 點樣對比 frontier？

V4-Pro-Max 係 V4-Pro 加長 reasoning tokens 嘅版本（類似 OpenAI 嘅 o3 對 GPT-4o）。

Coding Benchmarks（V4 嘅強項）

Benchmark	V4-Pro-Max	Claude Opus 4.6	GPT-5.4 xHigh	Gemini 3.1 Pro
SWE-bench Verified	80.6%	80.8%	—	80.6%
LiveCodeBench Pass@1	93.5 ⭐	88.8	—	91.7
Codeforces Rating	3206 ⭐	—	3168	3052
Terminal Bench 2.0	67.9	—	75.1 ⭐	68.5

Reasoning / Knowledge Benchmarks

Benchmark	V4-Pro-Max	Claude Opus 4.6	Gemini 3.1 Pro
MMLU-Pro	87.5%	—	91.0% ⭐
GPQA Diamond	90.1%	—	94.3% ⭐
HLE	37.7%	40.0% ⭐	—
HMMT 2026	95.2%	96.2% ⭐	—
Putnam 2025	120/120 🏆	—	—

Agentic Tasks

Benchmark	V4-Pro-Max	Notes
MCPAtlas Public	73.6	第二，僅次 Opus 4.6 (73.8)
Toolathlon	51.8 ⭐	Beat K2.6 (50.0), GLM-5.1 (40.7), Gemini 3.1 Pro (48.8)
Internal R&D coding (30 tasks)	67% pass	vs Sonnet 4.5 (47%), Opus 4.5 (70%)

Long-Context Retrieval

喺 MRCR 8-needle benchmark：

256K tokens：accuracy ≥ 0.82
1M tokens：accuracy = 0.59

對比好多 model 喺 200K+ 已經急跌，V4 喺 1M 仲企穩 0.59 係好驚人。

Pricing：V4 真正改寫嘅遊戲規則

Model	Input / 1M	Output / 1M	vs V4-Pro
V4-Flash	$0.14	$0.28	12× cheaper than Pro
V4-Pro	$1.74	$3.48	baseline
Claude Opus 4.6	$15	$75	21× more expensive output
GPT-5.4	~$15	~$60	17× more expensive output
Gemini 3.1 Pro	~$3.50	~$10.50	3× more expensive output

💰 Cost-per-task 對比（agentic coding workload）
假設一個 typical SWE-bench task：50K input + 10K output token，每日 20 個 task：

V4-Flash：~ $0.20/day（$ 6/month）

V4-Pro：~ $2.43/day（$ 73/month）

Claude Opus 4.6：~ $30/day（$ 900/month）

**V4-Pro-Max 對 SWE-bench parity 嘅 Opus 4.6，cost 細 21 倍。**對於需要跑大量 agentic task 嘅團隊，呢個直接改變咩任務 economically feasible。

V4-Pro 定 V4-Flash 點揀？

Use Case	Recommendation	Why
一般 coding	V4-Flash	2-3 點 gap，但 12× 平
Agentic / multi-step coding	V4-Pro / Pro-Max	SWE-Pro / Terminal 上 V4-Flash 落後 7-10 點
Cost-sensitive batch	V4-Flash	$0.14/M input 係 frontier-tier 入面最平之一
追 maximum coding accuracy	V4-Pro-Max	93.5 LiveCodeBench、80.6% SWE-bench
長 context retrieval	兩個都 work	同樣 1M context、同樣 CSA+HCA

點樣本地跑 V4？

用 HuggingFace Transformers

pythonimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# V4-Flash 細啲，自己 GPU 跑得到
model_id = "deepseek-ai/DeepSeek-V4-Flash"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Non-think mode (快)
prompt = "Write a Python function to compute fibonacci numbers."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=1.0,
    top_p=1.0,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

三種 reasoning mode

🧠 V4 嘅 reasoning mode 切換

Non-think：快，冇 chain of thought（適合 chat / 簡單 task）

Think High：標準 reasoning，輸出包含 <think>...</think> block

Think Max：最深入 reasoning，需要至少 384K context window

Sampling 全部 mode 都建議：temperature=1.0, top_p=1.0

API 調用

pythonfrom openai import OpenAI

client = OpenAI(
    base_url="https://api.deepseek.com",
    api_key="YOUR_API_KEY",
)

# V4-Pro thinking mode
response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "user", "content": "Solve this SWE-bench task: ..."}
    ],
    extra_body={"thinking": True},
    temperature=1.0,
    top_p=1.0,
)
print(response.choices[0].message.content)

注意：deepseek-chat 同 deepseek-reasoner 兩個舊 model name 會喺 2026-07-24 retire，目前 alias 去 deepseek-v4-flash 嘅 non-thinking / thinking mode。

V4 嘅 takeaway 同未解問題

V4 嘅 key insights

🎯 呢次 release 嘅 5 個 takeaway

Long-context inference cost 可以 sub-linear scale——靠 hybrid attention（CSA + HCA）而非 linear attention

Residual connection 唔係 sacred——mHC 證明 trillion-scale 上有更好嘅選擇

Optimizer 仍未 settled——Muon vs AdamW 嘅 convergence gap 喺超大規模上係 meaningful 嘅

FP4 QAT 已經 production-ready——pretrain 階段就 quantize，唔再需要 post-hoc 修補

Infrastructure（DSec）係 agent quality 嘅 hidden moat

未解嘅問題

問題	Why
Hybrid attention 係 final answer 嗎？	CSA + HCA 仲係 attention-based。如果未來 SSM/Linear attention 進一步成熟（例如 Mamba-3、Gated DeltaNet），可能整個 paradigm 又會 shift
mHC 嘅 expressivity 上限喺邊？	Birkhoff Polytope 係 strict subset of all matrices。會唔會某啲 task 呢個 constraint 過嚴？
Muon 點樣 scale 到 10T+？	V4 喺 1.6T 證明 work，但 scaling laws 仲未測到 GPT-4 級別嘅 scale
Agentic gap 嘅根源	V4-Flash 喺 agentic 任務落後 7-10 點。係 active param 唔夠定 reasoning 深度唔夠？
**`	DSML

結論：V4 點解咁重要？

V4 嘅 benchmark 唔係 SOTA，但佢解決嘅唔係 benchmark。

🔮 V4 嘅真正貢獻
V4 係第一次將 frontier-quality 嘅 1M context 模型嘅單 token inference cost 砍到一個真正 deployable 嘅水平。對於：

跑 long agentic workflow 嘅 SaaS

處理整個 codebase 嘅 IDE

分析整本書 / 整份 contract 嘅 RAG-less workflow

需要無數 tool calls 嘅 multi-agent system

呢啲場景一直都「理論上可以但經濟上不可行」。V4-Pro 將 1M context 嘅 cost 拉低到 Claude Opus 嘅 1/21，第一次令呢個 frontier 變到 mass deployable。

配合 MIT license + open weights，V4 直接將 frontier capability 帶到任何有 GPU 嘅 lab / startup 手上。下次 agent 領域嘅 breakthrough，可能就係喺 someone fine-tune V4-Pro 嘅 base 上面發生。

TL;DR

DeepSeek 喺 2026 年 4 月 24 日 release 咗 V4 系列，包括兩個 MoE 模型，全部 MIT license open weights：

Model	Total / Active params	Context	Pricing (input / output per 1M)	定位
V4-Pro	1.6T / 49B	1M tokens	$1.74 /$ 3.48	Frontier 對標 Claude Opus 4.6 / Gemini 3.1 Pro
V4-Flash	284B / 13B	1M tokens	$0.14 /$ 0.28	高速經濟版本，agentic 任務輕量首選

但呢次 release 嘅真正主菜唔係 benchmark——係架構效率：

⚡ 核心數字：喺 1M token context 設定下

V4-Pro 只需要 V3.2 嘅 27% single-token inference FLOPs

V4-Pro 只需要 V3.2 嘅 10% KV cache

V4-Flash 仲更狠：10% FLOPs + 7% KV cache

對比標準 GQA-8 + bf16 cache，V4 只需要約 2% 嘅 KV cache 大細

呢個效率 gain 嚟自 5 個架構級嘅 trick，今篇 blog 會逐個拆。

為咩要關心 V4？因為佢解決嘅係 agent 嘅實際痛點

如果你試過用 frontier model 跑長 agentic workflow（例如 SWE-bench、multi-step browse、terminal session），你應該見過呢啲 failure mode：

💀 Agent 跑長任務嘅典型死法

Context budget 爆：每次 tool call result append 入 context，幾百次 tool round trip 之後 context 直接撞牆

KV cache 食晒 GPU：Cache 隨 sequence length 線性增長，GPU memory OOM

Inference 越行越慢：每個新 token 都要 attend 全部歷史 token，FLOPs 隨 sequence length 線性增加

Multi-turn reasoning state 失蹤：好多 model 喺新 user turn 會 flush 之前嘅 reasoning trace

架構總覽：V4 點解同 V3 唔同？

Loading diagram...

🏗️ V4-Pro 結構

61 layers total

Layers 0–1：純 HCA（warm up，俾模型先建立 global view）

Layers 2–60：交替 CSA / HCA（精細選擇 + 廣域 context 互補）

最後嘅 MTP（Multi-Token Prediction）block：純 sliding-window attention

FFN 全部用 DeepSeekMoE（72 routed + 2 shared experts）

Residual connection 全部換成 mHC（manifold-constrained hyper-connections）

V3 → V4 嘅變化一覽

Dimension	DeepSeek V3	DeepSeek V4-Pro
Total params	671B	1.6T（2.4× 大）
Active params	37B	49B（1.3× 多）
Context window	128K	1M（8× 長）
Attention	標準 MLA (Multi-Latent Attention)	Hybrid CSA + HCA
Residual	標準 residual	mHC (Birkhoff Polytope + Sinkhorn)
Optimizer	AdamW	Muon（embedding 仍用 AdamW）
Quantization	FP8（post-training）	FP4 QAT on MoE experts
Training data	14.8T tokens	33T tokens（2.2×）
1M-ctx FLOPs	baseline	27% of V3.2
1M-ctx KV cache	baseline	10% of V3.2
License	Modified OpenRAIL	MIT

💡 最反直覺嘅一點：V4-Pro 總 params 大 2.4 倍、active params 大 1.3 倍，但 1M context 下嘅 inference cost 反而更平。即係話 V4 喺架構效率上嘅 gain 完全壓過 scale up 嘅成本。

創新 #1：Hybrid Attention — CSA + HCA

呢個係 V4 嘅靈魂。其他所有效率提升都圍繞呢個機制展開。

Motivation：點解傳統 attention 唔 scale 到 1M？

標準 multi-head attention 嘅複雜度係：

\text{Compute} \propto O(L^2 \cdot d), \quad \text{KV cache} \propto O(L \cdot d \cdot N_{\text{layers}})

喺 $L = 1\text{M}$ 嘅情況下：

Compute 係 $10^{12}$ 量級嘅 token 對
KV cache 對 1.6T model 嚟講可以食幾百 GB GPU memory

之前嘅解法分兩派：

Linear attention / SSM 派（Mamba、RWKV、Gated DeltaNet）：將 attention 由 $O(L^2)$ 變 $O(L)$ ，但 quality 通常有 trade-off
Sparse attention 派（Longformer、BigBird、DSA in V3.2）：揀啲 token attend，但 token-level 嘅 sparse 操作 hardware utilization 唔好

V4 採用嘅係第三條路：將 attention 喺壓縮過嘅 token stream 上面做。

CSA：Compressed Sparse Attention

Loading diagram...

三個關鍵組件：

Compressor：用 softmax-gated pooling，每 4 個原始 token 壓做 1 個 "compressed KV entry"。Compressor 嘅 weights 係 learnable，仲加咗 positional bias。
Lightning Indexer：每個 query 過嚟，indexer 用 ReLU-scored multi-head dot product 揀最相關嘅 top-1024 個 compressed blocks。Indexer 跑 FP4 精度，所以揀 top-K 嘅成本極低。
Sliding Window：最近 128 個 token 唔壓縮，直接 attend，保證 local context 嘅精細度。

🔍 CSA vs DSA (V3.2 嘅 DeepSeek Sparse Attention)
DSA 係喺原始 token level 揀 top-K（搜索空間係 $L$ ）。CSA 係喺4× 壓縮過嘅 block level 揀 top-K（搜索空間係 $L/4$ ）。Indexer 嘅工作量直接細 4 倍。

HCA：Heavily Compressed Attention

CSA 雖然平，但仍然 sparse（top-1024）——可能漏咗某啲遠處嘅重要 token。HCA 補嗰個窿。

Loading diagram...

核心思想：壓 128× 之後，1M tokens 變 8K blocks。8K × 8K dense attention 係極平——比 1M × 1M 平 $16000$ 倍。

所以 HCA 嘅 trade-off 係：用粗顆粒度換取 global view——每個 query 都見到成個 1M context，只係見到嘅係 "smoothed" 版本。

🎚️ CSA vs HCA 嘅互補關係

CSA：Selective + 4× 壓縮 + top-1024。精細但 narrow（只見 selected 嘅地方）。

HCA：Dense + 128× 壓縮。粗糙但 broad（成個 context 都見得到，但見唔到 fine detail）。

一個負責「focus retrieval」，一個負責「global awareness」——交替排列就好似喺長文件入面輪流近距離 zoom in 同遠距離 overview。

點解唔可以淨係用 CSA 或淨係 HCA？

論文做咗 ablation：

純 CSA：Top-1024 揀漏咗嘅遠處 token 永遠 invisible。長文件理解能力下降。
純 HCA：128× 壓縮抹走太多 fine detail，code/math 任務退化嚴重。
交替 CSA + HCA：每個 query 喺不同 layer 既有 selective fine-grained access（CSA）又有 dense global view（HCA）。Best of both worlds。

數據精度：FP8 + FP4 嘅組合拳

Component	Precision
絕大部分 KV entries	FP8
RoPE dimensions	BF16（保 numerical stability）
Lightning Indexer	FP4
MoE expert weights	FP4 QAT（pretrain 時就用 FP4）

呢個精度設計同壓縮率乘埋一齊，先得到「KV cache = baseline 嘅 2%」嘅誇張數字。

創新 #2：Manifold-Constrained Hyper-Connections (mHC)

呢個可能係 V4 最 underrated 嘅 contribution——重新設計咗 residual connection。

Motivation：標準 residual 嘅老問題

自 ResNet 2015 年以來，幾乎所有 deep network 都用：

x_{l+1} = x_l + f_l(x_l)

但喺 trillion-scale + 60+ layers 嘅情況下，呢個簡單 update rule 有兩個問題：

Signal explosion：每層加上 $f_l(x_l)$ ，norm 慢慢谷大，到 deep layer 變天文數字
Signal collapse：另一極， $f_l(x_l)$ norm 變小，deep layer 嘅 update 越嚟越冇效果

標準解法係 LayerNorm + residual scaling，但呢啲都係 post-hoc 修補，唔係架構級 guarantee。

Hyper-Connections 嘅 idea

Hyper-connections 將 residual 由「scalar add」推廣到「multi-channel weighted mix」：

x_{l+1}^{(i)} = \sum_{j} M_{ij}\, x_l^{(j)} + f_l(x_l^{(c)})

其中：

每層維護 $n$ 個並行 "residual streams" $x^{(1)}, \ldots, x^{(n)}$
$M \in \mathbb{R}^{n \times n}$ 係 learnable 嘅 mixing matrix
$f_l$ （attention 或 FFN）只接收某一條 stream 做 input

呢個 generalize 咗 standard residual（ $n=1, M=I$ ）。但 raw mixing matrix $M$ 冇任何約束，仍然會有 signal explosion。

mHC 嘅 trick：Birkhoff Polytope + Sinkhorn-Knopp

V4 嘅創新係強制 $M$ 係一個 doubly stochastic matrix：

M_{ij} \geq 0, \quad \sum_i M_{ij} = 1, \quad \sum_j M_{ij} = 1

呢類 matrix 就係Birkhoff Polytope 嘅元素。佢有一個極好嘅性質：

📐 關鍵 property：Doubly stochastic matrix 嘅 spectral norm $\|M\|_2 \leq 1$
所以 $\|Mx\|_2 \leq \|x\|_2$ ——呢個 mixing 永遠唔會放大 signal magnitude。

加上 $f_l$ 嘅 update，整體仍然受控——stability by construction。

點樣做到 doubly stochastic？Sinkhorn-Knopp

Direct parameterize 一個 doubly stochastic matrix 唔容易，所以 V4 用 Sinkhorn-Knopp algorithm：

pythondef sinkhorn_knopp(M, n_iters=10):
    """
    將任意 non-negative matrix 投影到 doubly stochastic.
    """
    M = torch.exp(M)  # 保證 non-negative
    for _ in range(n_iters):
        M = M / M.sum(dim=0, keepdim=True)  # column normalize
        M = M / M.sum(dim=1, keepdim=True)  # row normalize
    return M

每次 forward pass，learnable parameter $\hat{M}$ 經過 exp + 幾步 row/column 標準化，輸出一個 valid doubly stochastic matrix。

🔬 Sinkhorn-Knopp 嘅由來
呢個 algorithm 喺 1967 年由 Sinkhorn 同 Knopp 提出，原本用嚟解 optimal transport。近年喺 ML 入面復活，例如：

DETR 嘅 Hungarian matching（變種）

Optimal transport-based losses

Self-supervised learning（SwAV）

V4 將佢搬到 residual connection 入面係第一次喺 frontier LLM 用呢個 trick。

mHC 嘅實際好處

Issue	Standard Residual	mHC
Signal magnitude control	需要 LayerNorm post-hoc 修補	by construction 受限
Trillion-scale stability	需要好多 init / scaling tricks	幾乎冇 hyperparam tuning
Multi-stream representation	淨係一條 stream	多條 parallel streams
Expressivity	Standard	更豐富（不同 stream 學不同 pattern）

論文話 V4 訓練全程冇出現過明顯 loss spike——對比 GPT-3 / Llama 3 訓練 log 嘅 spike 圖，呢個係好顯著嘅 win。

創新 #3：Muon Optimizer

V4 將大部分 parameters 由 AdamW 換成 Muon optimizer。

Muon 係咩？

Muon 由 Keller Jordan et al. 2024 年提出，核心思想係：將 update 限制喺 orthogonal manifold 上。

python# AdamW update（簡化版）
def adamw_step(W, grad, m, v):
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * grad**2
    W -= lr * m / (sqrt(v) + eps)
    return W, m, v

# Muon update（簡化版）
def muon_step(W, grad, momentum_buffer):
    momentum_buffer = beta * momentum_buffer + grad
    # 關鍵 step：將 update 投影到 orthogonal matrix space
    O = newton_schulz_iteration(momentum_buffer, n_iters=5)
    W -= lr * O
    return W, momentum_buffer

Newton-Schulz iteration 係一個 fast method 將任意 matrix 近似到 nearest orthogonal matrix（無需 SVD）。

點解 Muon work？

⚡ V4 嘅 Muon 配置

Most parameters：用 Muon

Embeddings：仍用 AdamW（embedding 嘅 update 統計特性同 weight matrix 唔同）

Prediction head：AdamW

RMSNorm weights：AdamW

Peak LR：2.0e-4 for Pro, cosine decay

DeepSeek 報 Muon 帶嚟 ~30% faster convergence 同 更穩定嘅 loss curve——喺 33T training tokens 嘅規模下省到嘅 compute 係天文數字。

創新 #4：FP4 Quantization-Aware Training (QAT)

V4 將 MoE expert weights 同 indexer QK path 喺 pretraining 階段就用 FP4 訓練——而唔係 pretrain 完先 quantize。

Post-training Quantization vs QAT

	Post-training Quant (PTQ)	Quantization-Aware Training (QAT)
Workflow	Pretrain (FP16/BF16) → quantize 落 INT8/FP4	Pretrain 全程已經模擬 quantization noise
Quality drop	FP4 通常掉 1–3% benchmark	幾乎冇 drop（model learn 適應 quant noise）
Compute cost	Pretrain 用 FP16/BF16，貴	FP4 forward + backward → 更平
Deploy cost	FP4 inference	FP4 inference

V4 喺 expert weights 用 FP4 QAT，意味住：

Pretraining 慳 compute（FP4 forward 比 BF16 快好多）
Inference 直接用 pretrain weights，冇 quality 損失
Memory footprint 細：1.6T params @ FP4 ≈ 800GB（vs BF16 嘅 3.2TB）

Agent-Side 嘅後訓練創新

講完架構，再看 V4 點樣特別針對 agent workflow 做 post-training 同 infrastructure 投資。

Interleaved Thinking Across Tool Calls

V3.2 嘅行為：每個新 user message 嚟，discard 之前嘅 reasoning trace。Single-turn agent 冇問題，但 multi-turn agent（user 中途插入 follow-up）就 lose context。

V4 嘅修正：當對話包含 tool calls 時，preserve 全部 reasoning history 跨 user turn。

Loading diagram...

對 agent 嚟講呢個係質變——reasoning 變成 cumulative，唔再每次清零。

Tool-call Schema：`|DSML|` + XML

JSON-in-string 嘅 tool call format 一直有個問題：escape character hell。例如 model 想 emit:

javascript{"query": "He said \"hello\" to me"}

Model 經常 emit:

javascript{"query": "He said "hello" to me"}

V4 引入 special token |DSML| + XML format：

xml<tool_call name="search">
  <param name="query" string="true">He said "hello" to me</param>
  <param name="limit" string="false">10</param>
</tool_call>

關鍵設計：

string="true"：parameter 當純 string 處理（唔需要 JSON escape）
string="false"：parameter 係 JSON value（number, bool, dict）

呢個分離直接消滅咗一大類 parsing failure。

DSec：DeepSeek Elastic Compute (RL Sandbox)

V4 嘅 agent capability 嚟自 RL training against real tool environments。問題係：點樣同時跑幾十萬個 sandbox？

DeepSeek 自己起咗一個叫 DSec 嘅 Rust 平台：

Layer	Use case
Function calls	純 Python function execution（最快）
Containers	Docker-style isolation
microVMs (Firecracker)	強隔離 + 快 startup
Full VMs (QEMU)	需要 root / kernel 嘅 task

單一 cluster 跑幾十萬個 concurrent sandbox。三個關鍵 feature：

Layered 3FS storage：Image loading 極快，RL rollout 唔使等 container startup
Preemption-safe trajectory replay：訓練中斷可以無縫 resume，唔使 re-run tool calls
Uniform Python SDK：同一個 training harness 可以 target function call 或 full VM

🏭 Infrastructure 嘅戰略意義
DSec 唔係 paper 主菜，但係好多 lab 跟唔到 V4 agent 表現嘅原因——你冇 infrastructure 跑得到大規模 RL rollout，就 train 唔到呢個 quality 嘅 agent。Infrastructure is the moat.

Benchmark：V4-Pro-Max 點樣對比 frontier？

V4-Pro-Max 係 V4-Pro 加長 reasoning tokens 嘅版本（類似 OpenAI 嘅 o3 對 GPT-4o）。

Coding Benchmarks（V4 嘅強項）

Benchmark	V4-Pro-Max	Claude Opus 4.6	GPT-5.4 xHigh	Gemini 3.1 Pro
SWE-bench Verified	80.6%	80.8%	—	80.6%
LiveCodeBench Pass@1	93.5 ⭐	88.8	—	91.7
Codeforces Rating	3206 ⭐	—	3168	3052
Terminal Bench 2.0	67.9	—	75.1 ⭐	68.5

Reasoning / Knowledge Benchmarks

Benchmark	V4-Pro-Max	Claude Opus 4.6	Gemini 3.1 Pro
MMLU-Pro	87.5%	—	91.0% ⭐
GPQA Diamond	90.1%	—	94.3% ⭐
HLE	37.7%	40.0% ⭐	—
HMMT 2026	95.2%	96.2% ⭐	—
Putnam 2025	120/120 🏆	—	—

Agentic Tasks

Benchmark	V4-Pro-Max	Notes
MCPAtlas Public	73.6	第二，僅次 Opus 4.6 (73.8)
Toolathlon	51.8 ⭐	Beat K2.6 (50.0), GLM-5.1 (40.7), Gemini 3.1 Pro (48.8)
Internal R&D coding (30 tasks)	67% pass	vs Sonnet 4.5 (47%), Opus 4.5 (70%)

Long-Context Retrieval

喺 MRCR 8-needle benchmark：

256K tokens：accuracy ≥ 0.82
1M tokens：accuracy = 0.59

對比好多 model 喺 200K+ 已經急跌，V4 喺 1M 仲企穩 0.59 係好驚人。

Pricing：V4 真正改寫嘅遊戲規則

Model	Input / 1M	Output / 1M	vs V4-Pro
V4-Flash	$0.14	$0.28	12× cheaper than Pro
V4-Pro	$1.74	$3.48	baseline
Claude Opus 4.6	$15	$75	21× more expensive output
GPT-5.4	~$15	~$60	17× more expensive output
Gemini 3.1 Pro	~$3.50	~$10.50	3× more expensive output

💰 Cost-per-task 對比（agentic coding workload）
假設一個 typical SWE-bench task：50K input + 10K output token，每日 20 個 task：

V4-Flash：~ $0.20/day（$ 6/month）

V4-Pro：~ $2.43/day（$ 73/month）

Claude Opus 4.6：~ $30/day（$ 900/month）

**V4-Pro-Max 對 SWE-bench parity 嘅 Opus 4.6，cost 細 21 倍。**對於需要跑大量 agentic task 嘅團隊，呢個直接改變咩任務 economically feasible。

V4-Pro 定 V4-Flash 點揀？

Use Case	Recommendation	Why
一般 coding	V4-Flash	2-3 點 gap，但 12× 平
Agentic / multi-step coding	V4-Pro / Pro-Max	SWE-Pro / Terminal 上 V4-Flash 落後 7-10 點
Cost-sensitive batch	V4-Flash	$0.14/M input 係 frontier-tier 入面最平之一
追 maximum coding accuracy	V4-Pro-Max	93.5 LiveCodeBench、80.6% SWE-bench
長 context retrieval	兩個都 work	同樣 1M context、同樣 CSA+HCA

點樣本地跑 V4？

用 HuggingFace Transformers

pythonimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# V4-Flash 細啲，自己 GPU 跑得到
model_id = "deepseek-ai/DeepSeek-V4-Flash"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Non-think mode (快)
prompt = "Write a Python function to compute fibonacci numbers."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=1.0,
    top_p=1.0,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

三種 reasoning mode

🧠 V4 嘅 reasoning mode 切換

Non-think：快，冇 chain of thought（適合 chat / 簡單 task）

Think High：標準 reasoning，輸出包含 <think>...</think> block

Think Max：最深入 reasoning，需要至少 384K context window

Sampling 全部 mode 都建議：temperature=1.0, top_p=1.0

API 調用

pythonfrom openai import OpenAI

client = OpenAI(
    base_url="https://api.deepseek.com",
    api_key="YOUR_API_KEY",
)

# V4-Pro thinking mode
response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "user", "content": "Solve this SWE-bench task: ..."}
    ],
    extra_body={"thinking": True},
    temperature=1.0,
    top_p=1.0,
)
print(response.choices[0].message.content)

注意：deepseek-chat 同 deepseek-reasoner 兩個舊 model name 會喺 2026-07-24 retire，目前 alias 去 deepseek-v4-flash 嘅 non-thinking / thinking mode。

V4 嘅 takeaway 同未解問題

V4 嘅 key insights

🎯 呢次 release 嘅 5 個 takeaway

Long-context inference cost 可以 sub-linear scale——靠 hybrid attention（CSA + HCA）而非 linear attention

Residual connection 唔係 sacred——mHC 證明 trillion-scale 上有更好嘅選擇

Optimizer 仍未 settled——Muon vs AdamW 嘅 convergence gap 喺超大規模上係 meaningful 嘅

FP4 QAT 已經 production-ready——pretrain 階段就 quantize，唔再需要 post-hoc 修補

Infrastructure（DSec）係 agent quality 嘅 hidden moat

未解嘅問題

問題	Why
Hybrid attention 係 final answer 嗎？	CSA + HCA 仲係 attention-based。如果未來 SSM/Linear attention 進一步成熟（例如 Mamba-3、Gated DeltaNet），可能整個 paradigm 又會 shift
mHC 嘅 expressivity 上限喺邊？	Birkhoff Polytope 係 strict subset of all matrices。會唔會某啲 task 呢個 constraint 過嚴？
Muon 點樣 scale 到 10T+？	V4 喺 1.6T 證明 work，但 scaling laws 仲未測到 GPT-4 級別嘅 scale
Agentic gap 嘅根源	V4-Flash 喺 agentic 任務落後 7-10 點。係 active param 唔夠定 reasoning 深度唔夠？
**`	DSML

結論：V4 點解咁重要？

V4 嘅 benchmark 唔係 SOTA，但佢解決嘅唔係 benchmark。

🔮 V4 嘅真正貢獻
V4 係第一次將 frontier-quality 嘅 1M context 模型嘅單 token inference cost 砍到一個真正 deployable 嘅水平。對於：

跑 long agentic workflow 嘅 SaaS

處理整個 codebase 嘅 IDE

分析整本書 / 整份 contract 嘅 RAG-less workflow

需要無數 tool calls 嘅 multi-agent system

呢啲場景一直都「理論上可以但經濟上不可行」。V4-Pro 將 1M context 嘅 cost 拉低到 Claude Opus 嘅 1/21，第一次令呢個 frontier 變到 mass deployable。

TL;DR

目錄

為咩要關心 V4？因為佢解決嘅係 agent 嘅實際痛點

架構總覽：V4 點解同 V3 唔同？

V3 → V4 嘅變化一覽

創新 #1：Hybrid Attention — CSA + HCA

Motivation：點解傳統 attention 唔 scale 到 1M？

CSA：Compressed Sparse Attention

HCA：Heavily Compressed Attention

點解唔可以淨係用 CSA 或淨係 HCA？

數據精度：FP8 + FP4 嘅組合拳

創新 #2：Manifold-Constrained Hyper-Connections (mHC)

Motivation：標準 residual 嘅老問題

Hyper-Connections 嘅 idea

mHC 嘅 trick：Birkhoff Polytope + Sinkhorn-Knopp

點樣做到 doubly stochastic？Sinkhorn-Knopp

mHC 嘅實際好處

創新 #3：Muon Optimizer

Muon 係咩？

點解 Muon work？

創新 #4：FP4 Quantization-Aware Training (QAT)

Post-training Quantization vs QAT

Agent-Side 嘅後訓練創新

Interleaved Thinking Across Tool Calls

Tool-call Schema：|DSML| + XML

DSec：DeepSeek Elastic Compute (RL Sandbox)

Benchmark：V4-Pro-Max 點樣對比 frontier？

Coding Benchmarks（V4 嘅強項）

Reasoning / Knowledge Benchmarks

Agentic Tasks

Long-Context Retrieval

Pricing：V4 真正改寫嘅遊戲規則

V4-Pro 定 V4-Flash 點揀？

點樣本地跑 V4？

用 HuggingFace Transformers

三種 reasoning mode

API 調用

V4 嘅 takeaway 同未解問題

V4 嘅 key insights

未解嘅問題

結論：V4 點解咁重要？

相關資源

官方 release

深入分析

前置論文（V4 嘅 building blocks）

TL;DR

目錄

為咩要關心 V4？因為佢解決嘅係 agent 嘅實際痛點

架構總覽：V4 點解同 V3 唔同？

V3 → V4 嘅變化一覽

創新 #1：Hybrid Attention — CSA + HCA

Motivation：點解傳統 attention 唔 scale 到 1M？

CSA：Compressed Sparse Attention

HCA：Heavily Compressed Attention

點解唔可以淨係用 CSA 或淨係 HCA？

數據精度：FP8 + FP4 嘅組合拳

創新 #2：Manifold-Constrained Hyper-Connections (mHC)

Motivation：標準 residual 嘅老問題

Hyper-Connections 嘅 idea

mHC 嘅 trick：Birkhoff Polytope + Sinkhorn-Knopp

點樣做到 doubly stochastic？Sinkhorn-Knopp

mHC 嘅實際好處

創新 #3：Muon Optimizer

Muon 係咩？

點解 Muon work？

創新 #4：FP4 Quantization-Aware Training (QAT)

Post-training Quantization vs QAT

Agent-Side 嘅後訓練創新

Interleaved Thinking Across Tool Calls

Tool-call Schema：|DSML| + XML

DSec：DeepSeek Elastic Compute (RL Sandbox)

Benchmark：V4-Pro-Max 點樣對比 frontier？

Coding Benchmarks（V4 嘅強項）

Reasoning / Knowledge Benchmarks

Agentic Tasks

Long-Context Retrieval

Pricing：V4 真正改寫嘅遊戲規則

V4-Pro 定 V4-Flash 點揀？

點樣本地跑 V4？

用 HuggingFace Transformers

Tool-call Schema：`|DSML|` + XML

Tool-call Schema：`|DSML|` + XML