Billy Tse
HomeRoadmapBlogContact
Playground
Buy me a bug

© 2026 Billy Tse

OnlyFansLinkedInGitHubEmail
Back to Blog
April 25, 2026•35 min read

DeepSeek-V4:1.6T MoE × 1M context × Hybrid Attention,DeepSeek 點樣將 long-context inference 砍到 V3.2 嘅 27% FLOPs / 10% KV cache?

DeepSeek 喺 2026 年 4 月 24 日 release 咗 V4 系列(V4-Pro 1.6T / 49B active;V4-Flash 284B / 13B active),全部支援 1M token context。深入拆解 5 個核心架構創新:CSA(4× 壓縮 + lightning indexer 揀 top-1024)、HCA(128× 壓縮 + dense attention)、manifold-constrained hyper-connections(mHC,用 Birkhoff Polytope + Sinkhorn-Knopp 取代 residual)、Muon optimizer、FP4 QAT,加埋 agent-side 嘅 interleaved thinking、|DSML| XML tool calls、DSec RL sandbox。

AITransformerNLPInference OptimizationAttention Mechanisms

📄 DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence (DeepSeek-AI, 2026) 🤗 Open Weights:deepseek-ai/DeepSeek-V4-Pro、deepseek-ai/DeepSeek-V4-Flash 📊 Tech Report:DeepSeek_V4.pdf 📅 Release Date:2026 年 4 月 24 日(preview)

TL;DR

DeepSeek 喺 2026 年 4 月 24 日 release 咗 V4 系列,包括兩個 MoE 模型,全部 MIT license open weights:

ModelTotal / Active paramsContextPricing (input / output per 1M)定位
V4-Pro1.6T / 49B1M tokens1.74/1.74 / 1.74/3.48Frontier 對標 Claude Opus 4.6 / Gemini 3.1 Pro
V4-Flash284B / 13B1M tokens0.14/0.14 / 0.14/0.28高速經濟版本,agentic 任務輕量首選

但呢次 release 嘅真正主菜唔係 benchmark——係架構效率:

⚡ 核心數字:喺 1M token context 設定下

  • V4-Pro 只需要 V3.2 嘅 27% single-token inference FLOPs

  • V4-Pro 只需要 V3.2 嘅 10% KV cache

  • V4-Flash 仲更狠:10% FLOPs + 7% KV cache

  • 對比標準 GQA-8 + bf16 cache,V4 只需要約 2% 嘅 KV cache 大細

呢個效率 gain 嚟自 5 個架構級嘅 trick,今篇 blog 會逐個拆。

目錄

為咩要關心 V4?因為佢解決嘅係 agent 嘅實際痛點

如果你試過用 frontier model 跑長 agentic workflow(例如 SWE-bench、multi-step browse、terminal session),你應該見過呢啲 failure mode:

💀 Agent 跑長任務嘅典型死法

  1. Context budget 爆:每次 tool call result append 入 context,幾百次 tool round trip 之後 context 直接撞牆

  2. KV cache 食晒 GPU:Cache 隨 sequence length 線性增長,GPU memory OOM

  3. Inference 越行越慢:每個新 token 都要 attend 全部歷史 token,FLOPs 隨 sequence length 線性增加

  4. Multi-turn reasoning state 失蹤:好多 model 喺新 user turn 會 flush 之前嘅 reasoning trace

1M context 唔係新嘢(Gemini、GPT 都做到),但真正 affordable 嘅 1M context 仲係冇——直到 V4。佢嘅核心 contribution 就係將 long-context inference 嘅成本砍到一個真正可以 deploy 嘅水平。

架構總覽:V4 點解同 V3 唔同?

Loading diagram...

🏗️ V4-Pro 結構

  • 61 layers total

  • Layers 0–1:純 HCA(warm up,俾模型先建立 global view)

  • Layers 2–60:交替 CSA / HCA(精細選擇 + 廣域 context 互補)

  • 最後嘅 MTP(Multi-Token Prediction)block:純 sliding-window attention

  • FFN 全部用 DeepSeekMoE(72 routed + 2 shared experts)

  • Residual connection 全部換成 mHC(manifold-constrained hyper-connections)

V3 → V4 嘅變化一覽

DimensionDeepSeek V3DeepSeek V4-Pro
Total params671B1.6T(2.4× 大)
Active params37B49B(1.3× 多)
Context window128K1M(8× 長)
Attention標準 MLA (Multi-Latent Attention)Hybrid CSA + HCA
Residual標準 residualmHC (Birkhoff Polytope + Sinkhorn)
OptimizerAdamWMuon(embedding 仍用 AdamW)
QuantizationFP8(post-training)FP4 QAT on MoE experts
Training data14.8T tokens33T tokens(2.2×)
1M-ctx FLOPsbaseline27% of V3.2
1M-ctx KV cachebaseline10% of V3.2
LicenseModified OpenRAILMIT

💡 最反直覺嘅一點:V4-Pro 總 params 大 2.4 倍、active params 大 1.3 倍,但 1M context 下嘅 inference cost 反而更平。即係話 V4 喺架構效率上嘅 gain 完全壓過 scale up 嘅成本。

創新 #1:Hybrid Attention — CSA + HCA

呢個係 V4 嘅靈魂。其他所有效率提升都圍繞呢個機制展開。

Motivation:點解傳統 attention 唔 scale 到 1M?

標準 multi-head attention 嘅複雜度係:

Compute∝O(L2⋅d),KV cache∝O(L⋅d⋅Nlayers)\text{Compute} \propto O(L^2 \cdot d), \quad \text{KV cache} \propto O(L \cdot d \cdot N_{\text{layers}})Compute∝O(L2⋅d),KV cache∝O(L⋅d⋅Nlayers​)

喺 L=1ML = 1\text{M}L=1M 嘅情況下:

  • Compute 係 101210^{12}1012 量級嘅 token 對
  • KV cache 對 1.6T model 嚟講可以食幾百 GB GPU memory

之前嘅解法分兩派:

  • Linear attention / SSM 派(Mamba、RWKV、Gated DeltaNet):將 attention 由 O(L2)O(L^2)O(L2) 變 O(L)O(L)O(L),但 quality 通常有 trade-off
  • Sparse attention 派(Longformer、BigBird、DSA in V3.2):揀啲 token attend,但 token-level 嘅 sparse 操作 hardware utilization 唔好

V4 採用嘅係第三條路:將 attention 喺壓縮過嘅 token stream 上面做。

CSA:Compressed Sparse Attention

Loading diagram...

三個關鍵組件:

  1. Compressor:用 softmax-gated pooling,每 4 個原始 token 壓做 1 個 "compressed KV entry"。Compressor 嘅 weights 係 learnable,仲加咗 positional bias。
  2. Lightning Indexer:每個 query 過嚟,indexer 用 ReLU-scored multi-head dot product 揀最相關嘅 top-1024 個 compressed blocks。Indexer 跑 FP4 精度,所以揀 top-K 嘅成本極低。
  3. Sliding Window:最近 128 個 token 唔壓縮,直接 attend,保證 local context 嘅精細度。

🔍 CSA vs DSA (V3.2 嘅 DeepSeek Sparse Attention)
DSA 係喺原始 token level 揀 top-K(搜索空間係 LLL)。CSA 係喺4× 壓縮過嘅 block level 揀 top-K(搜索空間係 L/4L/4L/4)。Indexer 嘅工作量直接細 4 倍。

HCA:Heavily Compressed Attention

CSA 雖然平,但仍然 sparse(top-1024)——可能漏咗某啲遠處嘅重要 token。HCA 補嗰個窿。

Loading diagram...

核心思想:壓 128× 之後,1M tokens 變 8K blocks。8K × 8K dense attention 係極平——比 1M × 1M 平 160001600016000 倍。

所以 HCA 嘅 trade-off 係:用粗顆粒度換取 global view——每個 query 都見到成個 1M context,只係見到嘅係 "smoothed" 版本。

🎚️ CSA vs HCA 嘅互補關係

  • CSA:Selective + 4× 壓縮 + top-1024。精細但 narrow(只見 selected 嘅地方)。

  • HCA:Dense + 128× 壓縮。粗糙但 broad(成個 context 都見得到,但見唔到 fine detail)。

一個負責「focus retrieval」,一個負責「global awareness」——交替排列就好似喺長文件入面輪流近距離 zoom in 同遠距離 overview。

點解唔可以淨係用 CSA 或淨係 HCA?

論文做咗 ablation:

  • 純 CSA:Top-1024 揀漏咗嘅遠處 token 永遠 invisible。長文件理解能力下降。
  • 純 HCA:128× 壓縮抹走太多 fine detail,code/math 任務退化嚴重。
  • 交替 CSA + HCA:每個 query 喺不同 layer 既有 selective fine-grained access(CSA)又有 dense global view(HCA)。Best of both worlds。

數據精度:FP8 + FP4 嘅組合拳

ComponentPrecision
絕大部分 KV entriesFP8
RoPE dimensionsBF16(保 numerical stability)
Lightning IndexerFP4
MoE expert weightsFP4 QAT(pretrain 時就用 FP4)

呢個精度設計同壓縮率乘埋一齊,先得到「KV cache = baseline 嘅 2%」嘅誇張數字。

創新 #2:Manifold-Constrained Hyper-Connections (mHC)

呢個可能係 V4 最 underrated 嘅 contribution——重新設計咗 residual connection。

Motivation:標準 residual 嘅老問題

自 ResNet 2015 年以來,幾乎所有 deep network 都用:

xl+1=xl+fl(xl)x_{l+1} = x_l + f_l(x_l)xl+1​=xl​+fl​(xl​)

但喺 trillion-scale + 60+ layers 嘅情況下,呢個簡單 update rule 有兩個問題:

  1. Signal explosion:每層加上 fl(xl)f_l(x_l)fl​(xl​),norm 慢慢谷大,到 deep layer 變天文數字
  2. Signal collapse:另一極,fl(xl)f_l(x_l)fl​(xl​) norm 變小,deep layer 嘅 update 越嚟越冇效果

標準解法係 LayerNorm + residual scaling,但呢啲都係 post-hoc 修補,唔係架構級 guarantee。

Hyper-Connections 嘅 idea

Hyper-connections 將 residual 由「scalar add」推廣到「multi-channel weighted mix」:

xl+1(i)=∑jMij xl(j)+fl(xl(c))x_{l+1}^{(i)} = \sum_{j} M_{ij}\, x_l^{(j)} + f_l(x_l^{(c)})xl+1(i)​=j∑​Mij​xl(j)​+fl​(xl(c)​)

其中:

  • 每層維護 nnn 個並行 "residual streams" x(1),…,x(n)x^{(1)}, \ldots, x^{(n)}x(1),…,x(n)
  • M∈Rn×nM \in \mathbb{R}^{n \times n}M∈Rn×n 係 learnable 嘅 mixing matrix
  • flf_lfl​(attention 或 FFN)只接收某一條 stream 做 input

呢個 generalize 咗 standard residual(n=1,M=In=1, M=In=1,M=I)。但 raw mixing matrix MMM 冇任何約束,仍然會有 signal explosion。

mHC 嘅 trick:Birkhoff Polytope + Sinkhorn-Knopp

V4 嘅創新係強制 MMM 係一個 doubly stochastic matrix:

Mij≥0,∑iMij=1,∑jMij=1M_{ij} \geq 0, \quad \sum_i M_{ij} = 1, \quad \sum_j M_{ij} = 1Mij​≥0,i∑​Mij​=1,j∑​Mij​=1

呢類 matrix 就係Birkhoff Polytope 嘅元素。佢有一個極好嘅性質:

📐 關鍵 property:Doubly stochastic matrix 嘅 spectral norm ∥M∥2≤1\|M\|_2 \leq 1∥M∥2​≤1
所以 ∥Mx∥2≤∥x∥2\|Mx\|_2 \leq \|x\|_2∥Mx∥2​≤∥x∥2​——呢個 mixing 永遠唔會放大 signal magnitude。

加上 flf_lfl​ 嘅 update,整體仍然受控——stability by construction。

點樣做到 doubly stochastic?Sinkhorn-Knopp

Direct parameterize 一個 doubly stochastic matrix 唔容易,所以 V4 用 Sinkhorn-Knopp algorithm:

pythondef sinkhorn_knopp(M, n_iters=10): """ 將任意 non-negative matrix 投影到 doubly stochastic. """ M = torch.exp(M) # 保證 non-negative for _ in range(n_iters): M = M / M.sum(dim=0, keepdim=True) # column normalize M = M / M.sum(dim=1, keepdim=True) # row normalize return M

每次 forward pass,learnable parameter M^\hat{M}M^ 經過 exp + 幾步 row/column 標準化,輸出一個 valid doubly stochastic matrix。

🔬 Sinkhorn-Knopp 嘅由來
呢個 algorithm 喺 1967 年由 Sinkhorn 同 Knopp 提出,原本用嚟解 optimal transport。近年喺 ML 入面復活,例如:

  • DETR 嘅 Hungarian matching(變種)

  • Optimal transport-based losses

  • Self-supervised learning(SwAV)

V4 將佢搬到 residual connection 入面係第一次喺 frontier LLM 用呢個 trick。

mHC 嘅實際好處

IssueStandard ResidualmHC
Signal magnitude control需要 LayerNorm post-hoc 修補by construction 受限
Trillion-scale stability需要好多 init / scaling tricks幾乎冇 hyperparam tuning
Multi-stream representation淨係一條 stream多條 parallel streams
ExpressivityStandard更豐富(不同 stream 學不同 pattern)

論文話 V4 訓練全程冇出現過明顯 loss spike——對比 GPT-3 / Llama 3 訓練 log 嘅 spike 圖,呢個係好顯著嘅 win。

創新 #3:Muon Optimizer

V4 將大部分 parameters 由 AdamW 換成 Muon optimizer。

Muon 係咩?

Muon 由 Keller Jordan et al. 2024 年提出,核心思想係:將 update 限制喺 orthogonal manifold 上。

python# AdamW update(簡化版) def adamw_step(W, grad, m, v): m = beta1 * m + (1 - beta1) * grad v = beta2 * v + (1 - beta2) * grad**2 W -= lr * m / (sqrt(v) + eps) return W, m, v # Muon update(簡化版) def muon_step(W, grad, momentum_buffer): momentum_buffer = beta * momentum_buffer + grad # 關鍵 step:將 update 投影到 orthogonal matrix space O = newton_schulz_iteration(momentum_buffer, n_iters=5) W -= lr * O return W, momentum_buffer

Newton-Schulz iteration 係一個 fast method 將任意 matrix 近似到 nearest orthogonal matrix(無需 SVD)。

點解 Muon work?

Intuitive 解釋:標準 SGD/Adam 嘅 update 方向冇任何 geometric structure,可能某啲方向 update 過大、某啲過細。Muon 強制 update 係 orthogonal——所有方向 update 都被 normalize 到 unit scale——令 effective learning rate 更 uniform。

⚡ V4 嘅 Muon 配置

  • Most parameters:用 Muon

  • Embeddings:仍用 AdamW(embedding 嘅 update 統計特性同 weight matrix 唔同)

  • Prediction head:AdamW

  • RMSNorm weights:AdamW

  • Peak LR:2.0e-4 for Pro, cosine decay

DeepSeek 報 Muon 帶嚟 ~30% faster convergence 同 更穩定嘅 loss curve——喺 33T training tokens 嘅規模下省到嘅 compute 係天文數字。

創新 #4:FP4 Quantization-Aware Training (QAT)

V4 將 MoE expert weights 同 indexer QK path 喺 pretraining 階段就用 FP4 訓練——而唔係 pretrain 完先 quantize。

Post-training Quantization vs QAT

Post-training Quant (PTQ)Quantization-Aware Training (QAT)
WorkflowPretrain (FP16/BF16) → quantize 落 INT8/FP4Pretrain 全程已經模擬 quantization noise
Quality dropFP4 通常掉 1–3% benchmark幾乎冇 drop(model learn 適應 quant noise)
Compute costPretrain 用 FP16/BF16,貴FP4 forward + backward → 更平
Deploy costFP4 inferenceFP4 inference

V4 喺 expert weights 用 FP4 QAT,意味住:

  1. Pretraining 慳 compute(FP4 forward 比 BF16 快好多)
  2. Inference 直接用 pretrain weights,冇 quality 損失
  3. Memory footprint 細:1.6T params @ FP4 ≈ 800GB(vs BF16 嘅 3.2TB)

Agent-Side 嘅後訓練創新

講完架構,再看 V4 點樣特別針對 agent workflow 做 post-training 同 infrastructure 投資。

Interleaved Thinking Across Tool Calls

V3.2 嘅行為:每個新 user message 嚟,discard 之前嘅 reasoning trace。Single-turn agent 冇問題,但 multi-turn agent(user 中途插入 follow-up)就 lose context。

V4 嘅修正:當對話包含 tool calls 時,preserve 全部 reasoning history 跨 user turn。

Loading diagram...

對 agent 嚟講呢個係質變——reasoning 變成 cumulative,唔再每次清零。

Tool-call Schema:|DSML| + XML

JSON-in-string 嘅 tool call format 一直有個問題:escape character hell。例如 model 想 emit:

javascript{"query": "He said \"hello\" to me"}

Model 經常 emit:

javascript{"query": "He said "hello" to me"}

V4 引入 special token |DSML| + XML format:

xml<tool_call name="search"> <param name="query" string="true">He said "hello" to me</param> <param name="limit" string="false">10</param> </tool_call>

關鍵設計:

  • string="true":parameter 當純 string 處理(唔需要 JSON escape)
  • string="false":parameter 係 JSON value(number, bool, dict)

呢個分離直接消滅咗一大類 parsing failure。

DSec:DeepSeek Elastic Compute (RL Sandbox)

V4 嘅 agent capability 嚟自 RL training against real tool environments。問題係:點樣同時跑幾十萬個 sandbox?

DeepSeek 自己起咗一個叫 DSec 嘅 Rust 平台:

LayerUse case
Function calls純 Python function execution(最快)
ContainersDocker-style isolation
microVMs (Firecracker)強隔離 + 快 startup
Full VMs (QEMU)需要 root / kernel 嘅 task

單一 cluster 跑幾十萬個 concurrent sandbox。三個關鍵 feature:

  1. Layered 3FS storage:Image loading 極快,RL rollout 唔使等 container startup
  2. Preemption-safe trajectory replay:訓練中斷可以無縫 resume,唔使 re-run tool calls
  3. Uniform Python SDK:同一個 training harness 可以 target function call 或 full VM

🏭 Infrastructure 嘅戰略意義
DSec 唔係 paper 主菜,但係好多 lab 跟唔到 V4 agent 表現嘅原因——你冇 infrastructure 跑得到大規模 RL rollout,就 train 唔到呢個 quality 嘅 agent。Infrastructure is the moat.

Benchmark:V4-Pro-Max 點樣對比 frontier?

V4-Pro-Max 係 V4-Pro 加長 reasoning tokens 嘅版本(類似 OpenAI 嘅 o3 對 GPT-4o)。

Coding Benchmarks(V4 嘅強項)

BenchmarkV4-Pro-MaxClaude Opus 4.6GPT-5.4 xHighGemini 3.1 Pro
SWE-bench Verified80.6%80.8%—80.6%
LiveCodeBench Pass@193.5 ⭐88.8—91.7
Codeforces Rating3206 ⭐—31683052
Terminal Bench 2.067.9—75.1 ⭐68.5

Reasoning / Knowledge Benchmarks

BenchmarkV4-Pro-MaxClaude Opus 4.6Gemini 3.1 Pro
MMLU-Pro87.5%—91.0% ⭐
GPQA Diamond90.1%—94.3% ⭐
HLE37.7%40.0% ⭐—
HMMT 202695.2%96.2% ⭐—
Putnam 2025120/120 🏆——

Agentic Tasks

BenchmarkV4-Pro-MaxNotes
MCPAtlas Public73.6第二,僅次 Opus 4.6 (73.8)
Toolathlon51.8 ⭐Beat K2.6 (50.0), GLM-5.1 (40.7), Gemini 3.1 Pro (48.8)
Internal R&D coding (30 tasks)67% passvs Sonnet 4.5 (47%), Opus 4.5 (70%)

Long-Context Retrieval

喺 MRCR 8-needle benchmark:

  • 256K tokens:accuracy ≥ 0.82
  • 1M tokens:accuracy = 0.59

對比好多 model 喺 200K+ 已經急跌,V4 喺 1M 仲企穩 0.59 係好驚人。

Pricing:V4 真正改寫嘅遊戲規則

ModelInput / 1MOutput / 1Mvs V4-Pro
V4-Flash$0.14$0.2812× cheaper than Pro
V4-Pro$1.74$3.48baseline
Claude Opus 4.6$15$7521× more expensive output
GPT-5.4~$15~$6017× more expensive output
Gemini 3.1 Pro~$3.50~$10.503× more expensive output

💰 Cost-per-task 對比(agentic coding workload)
假設一個 typical SWE-bench task:50K input + 10K output token,每日 20 個 task:

  • V4-Flash:~0.20/day(0.20/day(0.20/day(6/month)

  • V4-Pro:~2.43/day(2.43/day(2.43/day(73/month)

  • Claude Opus 4.6:~30/day(30/day(30/day(900/month)

**V4-Pro-Max 對 SWE-bench parity 嘅 Opus 4.6,cost 細 21 倍。**對於需要跑大量 agentic task 嘅團隊,呢個直接改變咩任務 economically feasible。

V4-Pro 定 V4-Flash 點揀?

Use CaseRecommendationWhy
一般 codingV4-Flash2-3 點 gap,但 12× 平
Agentic / multi-step codingV4-Pro / Pro-MaxSWE-Pro / Terminal 上 V4-Flash 落後 7-10 點
Cost-sensitive batchV4-Flash$0.14/M input 係 frontier-tier 入面最平之一
追 maximum coding accuracyV4-Pro-Max93.5 LiveCodeBench、80.6% SWE-bench
長 context retrieval兩個都 work同樣 1M context、同樣 CSA+HCA

點樣本地跑 V4?

用 HuggingFace Transformers

pythonimport torch from transformers import AutoModelForCausalLM, AutoTokenizer # V4-Flash 細啲,自己 GPU 跑得到 model_id = "deepseek-ai/DeepSeek-V4-Flash" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) # Non-think mode (快) prompt = "Write a Python function to compute fibonacci numbers." messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False, ) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, temperature=1.0, top_p=1.0, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

三種 reasoning mode

🧠 V4 嘅 reasoning mode 切換

  • Non-think:快,冇 chain of thought(適合 chat / 簡單 task)

  • Think High:標準 reasoning,輸出包含 <think>...</think> block

  • Think Max:最深入 reasoning,需要至少 384K context window

Sampling 全部 mode 都建議:temperature=1.0, top_p=1.0

API 調用

pythonfrom openai import OpenAI client = OpenAI( base_url="https://api.deepseek.com", api_key="YOUR_API_KEY", ) # V4-Pro thinking mode response = client.chat.completions.create( model="deepseek-v4-pro", messages=[ {"role": "user", "content": "Solve this SWE-bench task: ..."} ], extra_body={"thinking": True}, temperature=1.0, top_p=1.0, ) print(response.choices[0].message.content)

注意:deepseek-chat 同 deepseek-reasoner 兩個舊 model name 會喺 2026-07-24 retire,目前 alias 去 deepseek-v4-flash 嘅 non-thinking / thinking mode。

V4 嘅 takeaway 同未解問題

V4 嘅 key insights

🎯 呢次 release 嘅 5 個 takeaway

  1. Long-context inference cost 可以 sub-linear scale——靠 hybrid attention(CSA + HCA)而非 linear attention

  2. Residual connection 唔係 sacred——mHC 證明 trillion-scale 上有更好嘅選擇

  3. Optimizer 仍未 settled——Muon vs AdamW 嘅 convergence gap 喺超大規模上係 meaningful 嘅

  4. FP4 QAT 已經 production-ready——pretrain 階段就 quantize,唔再需要 post-hoc 修補

  5. Infrastructure(DSec)係 agent quality 嘅 hidden moat

未解嘅問題

問題Why
Hybrid attention 係 final answer 嗎?CSA + HCA 仲係 attention-based。如果未來 SSM/Linear attention 進一步成熟(例如 Mamba-3、Gated DeltaNet),可能整個 paradigm 又會 shift
mHC 嘅 expressivity 上限喺邊?Birkhoff Polytope 係 strict subset of all matrices。會唔會某啲 task 呢個 constraint 過嚴?
Muon 點樣 scale 到 10T+?V4 喺 1.6T 證明 work,但 scaling laws 仲未測到 GPT-4 級別嘅 scale
Agentic gap 嘅根源V4-Flash 喺 agentic 任務落後 7-10 點。係 active param 唔夠定 reasoning 深度唔夠?
**`DSML

結論:V4 點解咁重要?

V4 嘅 benchmark 唔係 SOTA,但佢解決嘅唔係 benchmark。

🔮 V4 嘅真正貢獻
V4 係第一次將 frontier-quality 嘅 1M context 模型嘅單 token inference cost 砍到一個真正 deployable 嘅水平。對於:

  • 跑 long agentic workflow 嘅 SaaS

  • 處理整個 codebase 嘅 IDE

  • 分析整本書 / 整份 contract 嘅 RAG-less workflow

  • 需要無數 tool calls 嘅 multi-agent system

呢啲場景一直都「理論上可以但經濟上不可行」。V4-Pro 將 1M context 嘅 cost 拉低到 Claude Opus 嘅 1/21,第一次令呢個 frontier 變到 mass deployable。

配合 MIT license + open weights,V4 直接將 frontier capability 帶到任何有 GPU 嘅 lab / startup 手上。下次 agent 領域嘅 breakthrough,可能就係喺 someone fine-tune V4-Pro 嘅 base 上面發生。

相關資源

官方 release

  • 📄 Tech Report:DeepSeek_V4.pdf
  • 🤗 DeepSeek-V4-Pro(1.6T):deepseek-ai/DeepSeek-V4-Pro
  • 🤗 DeepSeek-V4-Flash(284B):deepseek-ai/DeepSeek-V4-Flash
  • 🤗 Base models:V4-Pro-Base、V4-Flash-Base
  • 📰 API Announcement:api-docs.deepseek.com/news/news260424
  • 💬 Try it:chat.deepseek.com

深入分析

  • 📝 HuggingFace Blog:DeepSeek-V4: a million-token context that agents can actually use
  • 📝 Reddit 討論(架構深度分析):r/LocalLLaMA — V4 architecture takeaways
  • 🚀 NVIDIA Technical Blog:Build with DeepSeek V4 Using NVIDIA Blackwell

前置論文(V4 嘅 building blocks)

  • 📄 Engram (Conditional Memory):arXiv:2601.07372
  • 📄 mHC 原始論文:arXiv:2512.24880(Xie et al., 2025)
  • 📄 DeepSeek Sparse Attention (V3.2):DeepSeek-AI 內部 release
  • 📄 Muon Optimizer:Jordan et al., 2024
  • 📄 DeepSeekMoE:Dai et al., 2024
  • 📄 MLA (Multi-Latent Attention, V2):DeepSeek-AI, 2024

2026 年 4 月 24 日,DeepSeek 一次過 release 兩個 1M context MoE,將 long-context inference cost 帶入 sub-linear era。Hybrid attention、mHC、Muon、FP4 QAT——四個架構級 trick 加埋,V4 證明咗 frontier capability 唔一定要 frontier cost。下一個 trillion-scale model 應該都會跟住呢個 blueprint 行。🐋⚡

Back to all articles
目錄