Parcae × Loop, Think & Generalize：呢個星期兩篇 RDT 新論文同時登場，Recurrent-Depth Transformers 點樣令模型喺 latent space 思考？

本週同時登場嘅兩篇 Recurrent-Depth Transformers (RDT) 新論文： 📄 Parcae: Scaling Laws For Stable Looped Language Models (Prairie et al., 2026) — arXiv:2604.12946 📄 Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers (Kohli et al., 2026) — arXiv:2604.07822

TL;DR

2026 年 4 月呢個星期，RDT 領域同時爆咗兩篇重磅論文，剛好係一對互補嘅 puzzle piece：

論文	團隊	核心貢獻	一句話總結
Parcae	UCSD + Together AI (Sandy Research)	用 LTI 動態系統分析 + 負對角參數化，徹底解決 looped model 嘅 residual explosion 同 loss spikes，建立首個 RDT scaling laws	「點樣穩定訓練一個 looped LLM」
Loop, Think, & Generalize	OSU NLP Group	首次系統驗證 RDT 真係做到 implicit 多跳推理，而且可以由 5-hop training extrapolate 到 10-hop test。揭示 overthinking 問題	「訓練好之後，RDT 真係會喺 latent space 推理嗎？」

🔗 點解呢兩篇要一齊睇？
Parcae 解決 training 側嘅問題（點樣 scale up 一個穩定嘅 looped model），Loop-Think-Generalize 解決 inference 側嘅問題（loop 真係會帶嚟 reasoning 嗎？能唔能 generalize？）。兩篇論文同一個星期 drop，合埋嚟睇，等於俾咗 RDT 一個完整嘅「training + reasoning」證據鏈。

背景：RDT 嘅八年進化，只為解決一個問題

喺深入主菜之前，花 5 分鐘先過一下背景。唔熟悉 RDT 嘅讀者可以了解下點解呢兩篇論文嘅 contribution 咁重要。

標準 Transformer 嘅深度困境

經典 Transformer（GPT、Llama 等）嘅每一層都有獨立參數。32-layer 模型有 32 套唔同嘅 $\{W_Q, W_K, W_V, W_O, W_{\text{FFN}}\}$ 。問題係：

⚠️ 固定深度 = 固定計算量

32-layer 處理任何 token，都只做正好 32 步計算

簡單問題（「1+1=?」）做 32 步 → 浪費

複雜問題（「證明費馬大定理」）都只做 32 步 → 不足

加深 → 加參數 → 加 memory → 加訓練時間 😭

數學上仲更慘：標準 Transformer 唔係 Turing-complete，某啲需要無限步迭代嘅算法根本做唔到。

Recurrent Depth 嘅核心思想

RDT 嘅想法好簡單：同一組參數 $W$ 反覆應用 $L$ 次。

Loading diagram...

優勢：參數量 $|W|$ 固定，但 FLOPs $= |W| \times L$ 可以動態調整——test time 加 loop 就加 reasoning compute。

前三個重要嘅里程碑（快速回顧）

年份	論文	做咗咩	留低咗咩問題
2018	Universal Transformers (Dehghani et al., Google Brain)	首次引入 recurrence in depth + ACT halting；證明 Turing-completeness	ACT 訓練極不穩定、長 sequence 崩潰、冇 scaling laws
2025 Feb	Huginn-3.5B (ELLIS Tübingen + UMD)	首個大規模 RDT proof of concept：3.5B params / 800B tokens，test-time 加 loop 可以單調提升 reasoning	仍然黑盒；訓練要好多 tricks；冇理論保證；冇 scaling laws
2025 Feb	Reasoning with Latent Thoughts (Saunshi et al., DeepMind, ICLR 2025)	理論證明 loop = 連續 latent space 嘅 CoT；claim「depth 比 parameters 更重要」	純理論；實際訓練仍然崩潰；冇 scaling laws

七年嚟最大嘅瓶頸一直都係：RDT 訓練極度不穩定。Loss spikes、gradient 爆炸、long sequence diverge⋯⋯所以 RDT 一直停留喺「academic 有趣但工業冇用」嘅階段。

呢個星期，Parcae 終於打破咗呢個瓶頸。

第一篇：Parcae — 終於令 RDT 喺訓練時穩得住

論文背景

📄 Parcae: Scaling Laws For Stable Looped Language Models

作者：Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, Dan Fu

機構：UC San Diego + Together AI (Sandy Research Lab)

發表：2026 年 4 月 14 日（arXiv v1）

論文：arXiv:2604.12946

Blog：sandyresearch.github.io/parcae/

代碼：Together AI 官方 blog 有 reference impl

Motivation：點解 Huginn 咁難訓？

作者先開門見山，展示咗一個好熟悉嘅 failure mode：

💀 Huginn 式訓練嘅典型症狀

Residual state explosion：訓練到某個 step，hidden state 嘅 norm 突然由 $\sim 10$ 爆到 $10^5$

Loss spikes：validation loss 平時係 2.3，突然一個 batch 彈到 50+

Gradient NaN：偶然整個 training run 直接 crash

作者需要不停 tune gradient clipping、warmup、LR、init 先可以勉強訓練得到

呢個唔係 engineering issue，而係架構本身嘅問題。

關鍵 Insight：將 Loop 視為 Dynamical System

Parcae 最美麗嘅 contribution 係一個視角嘅轉換：

如果將 recurrent block 嘅更新線性化，loop 就係一個線性非時變 (LTI) 動態系統。

具體講，如果 recurrent block 嘅更新寫成：

h^{t+1} = A h^t + B u

其中：

$h^t \in \mathbb{R}^{d}$ ：第 $t$ 次 loop 嘅 latent state
$u$ ：prelude output（每 loop 都 inject 返去）
$A \in \mathbb{R}^{d \times d}$ ：state-to-state transition
$B \in \mathbb{R}^{d \times d}$ ：input injection matrix

控制理論話俾我哋知：呢個系統穩定 ⇔ $\rho(A) < 1$ （ $\rho$ 係 spectral radius，即 $|\lambda|_{\max}$ ）。

🔑 診斷：Huginn 嘅 $A$ 冇任何約束。訓練過程中 $A$ 嘅 eigenvalues 會漂移到 $|\lambda| > 1$ ，系統變成 unstable dynamical system， $h^t$ 呈指數增長——就係 residual explosion 嘅本質。

解法：Negative Diagonal Parameterization

Parcae 嘅 trick 極度精巧：強制 $A$ 係負對角 + ZOH 離散化。

Step 1：連續時間負對角

A := \text{Diag}\big(-\exp(\log_A)\big), \quad \log_A \in \mathbb{R}^{d_h}

$\exp(\log_A) > 0$ （恆正）
加個負號 → $A$ 對角元素恆負
對角矩陣嘅 eigenvalues 就係對角元素本身 → 所有 $\lambda < 0$

Step 2：Zero-Order Hold (ZOH) 離散化

仿照 Mamba / S4：

\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B

因為 $A < 0$ ：

\bar{A} = \exp(\Delta A) \in (0, 1)

即 $\rho(\bar{A}) < 1$ by construction ✅

💡 類比：想像一個人喺山谷入面搵穩定點。Huginn 俾人自由走動——可能搵到谷底，但都可能爬上山頂（explosion）。Parcae 就好似綁咗條橡筋繩喺谷底——無論點走，都會被拉返落穩定點。
呢個 trick 其實係State Space Model (SSM) 嘅 signature move。Mamba、S4、S5 全部用緊。Parcae 嘅貢獻係第一次將 SSM stability tricks 引入 Transformer loops。

架構：Parcae Recurrent Block

Loading diagram...

關鍵設計：LTI 部分負責穩定性，Transformer 部分負責表達能力。兩者相加，既唔會 explode，又保留 attention 嘅非線性學習能力。

另一個關鍵：Per-Sequence Depth Sampling

Huginn 式訓練每個 micro-batch sample 同一個 loop count $r$ 。Parcae 發現呢個有問題——同一個 batch 入面唔同 sequence 嘅 optimal depth 唔同，強制同一個 $r$ 令 gradient estimator biased。

Parcae 嘅改進：每個 sequence 獨立 sample $r$ 。

python# Huginn 式（sub-optimal）
r = random.choice([1, ..., r_max])   # batch-level
for seq in batch:
    loss += compute(seq, r)          # 所有 seq 用同一個 r

# Parcae 式（unbiased）
for seq in batch:
    r_seq = random.choice([1, ..., r_max])  # sequence-level
    loss += compute(seq, r_seq)

呢個小改動顯著減少 loss spikes 嘅 frequency。

實驗結果：RDT 第一次贏硬 Transformer

New database

Scaling Laws：Chinchilla for RDT

Parcae 建立咗三條關鍵 scaling laws（類似 Chinchilla 之於 Transformer）：

Loops 同 data 要同步 scale：fix FLOPs 嘅情況下，增加 loop 次數嘅 optimal data 量隨之增加
Test-time scaling 係 saturating exponential：gain ∝ $1 - e^{-k \cdot L}$ ——early loops 收益大，後期邊際收益遞減
Params vs. Loops 嘅 trade-off 有 sweet spot：太細 params + 太多 loops 會 underfit；太大 params + 太少 loops 浪費 capacity

🚀 Parcae 嘅工程意義
Parcae 係 RDT 史上第一次真正可以 deploy 嘅架構：

訓練穩定（冇 spike）

參數效率（一半 params 匹配 baseline）

Test-time 動態 compute（on-device、memory-constrained 場景嘅 dream）

對 edge LLM / mobile LLM 係 game changer。

Parcae 留低嘅問題

Sequence-level compute：所有 tokens loop 一樣多次。但「呢個」同「量子物理」顯然需要唔同 thinking depth → adaptive halting 仍未解
Interpretability 未解：只係解決 training，冇解 latent state 發生緊乜嘢
Compositional reasoning 未驗證：Parcae 做嘅係標準 language modeling，冇 specifically 測 multi-hop reasoning → 呢個正正係下一篇論文嘅主角

第二篇：Loop, Think, & Generalize — 驗證 RDT 真係會「諗嘢」

論文背景

📄 Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

作者：Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, Yuekun Yao

機構：Ohio State University NLP Group

發表：2026 年 4 月 9 日（arXiv v1）

論文：arXiv:2604.07822

代碼：github.com/OSU-NLP-Group/Loop-Think-Generalize

如果話 Parcae 解決嘅係「點樣訓練一個 looped model」，Loop-Think-Generalize（下稱 LTG）解決嘅就係最核心嘅哲學問題：呢啲 looped models 真係會推理嗎？

核心問題：Implicit Multi-Hop Reasoning

咩係 implicit reasoning？

想像你問一個 LLM：「Barack Obama 嘅太太嘅媽咪嘅出生地嘅 capital 係邊？」

Explicit (CoT)：模型寫出：「Michelle Obama 係 Obama 太太 → 佢媽咪 Marian Robinson → Marian 喺 Chicago 出生 → Illinois capital 係 Springfield」
Implicit：模型一次 forward pass 內部完成，直接俾答案 Springfield

⚠️ 標準 Transformer 嘅痛點
好多 mechanistic interpretability 研究發現，標準 Transformer 雖然儲存咗呢啲 atomic facts，但組合唔到——佢哋知道每個 fact，但一次 forward pass 砌唔出推理鏈。呢個叫 compositional generalization failure。

LTG 嘅 hypothesis：俾 recurrent depth，Transformer 就組合得到。

兩大關鍵挑戰

LTG formalize 咗 implicit reasoning 嘅兩個具體問題：

挑戰	定義	難點
Systematic Generalization	Training 時見過 $A \to B, B \to C, \ldots$ 嘅 single-hop facts，但從未見過 $A \to C$ 嘅組合。Test 時要求模型做 unseen compositions	要求模型真正 learn 到「組合律」而唔係 memorize
Depth / Recursion Extrapolation	Training 最多見 5-hop 推理；inference 要求做 10-hop（training 從未見過呢個深度）	模型要能 generalize 到未見過嘅推理深度

實驗設計：Synthetic Multi-Hop Dataset

LTG 構造咗一個完美 controlled 嘅 synthetic setup：

javascript// Entity graph: 一條長鏈
Facts:  A → B → C → D → E → F → G → H → I → J → K
        ↑  每條 edge 代表一個 learned fact

Training:
  - Single-hop facts: A→B, B→C, C→D, ...
  - Multi-hop queries: 2-hop, 3-hop, 4-hop, 5-hop composed queries
  - NEVER: 6-hop or beyond

Test (unseen depth):
  - 6-hop, 7-hop, 8-hop, 9-hop, 10-hop queries

如果一個 RDT 可以做對 10-hop 嘅 query，我哋就知：

佢真係 internally 做緊多步 reasoning（唔係 memorize）
佢可以將 training 嘅 reasoning depth extrapolate 去 unseen depth

三大訓練策略對比

LTG 嘅核心實驗——測試三種 training-time loop-count schedule：

Strategy	描述	5-hop → 10-hop Extrapolation
Fixed recurrence	Training 每次都用固定 $r = 5$	❌ 完全失敗：Test 10-hop 接近 random
Curriculum recurrence	由細到大遞增 $r$ （先 1→2→3→4→5）	⚠️ 中等：7-hop 仲 OK，8+ hop 明顯退化
Dynamic recurrence	每個 batch 隨機 sample $r \in [1, 5]$	✅ 最好：10-hop test 仍高準確率

💡 Dynamic Recurrence 點解 work？
當 training 時 $r$ 係隨機嘅，模型被迫學識一個 depth-agnostic 嘅 reasoning block——同一個 block 要 handle shallow 同 deep 兩種 reasoning。呢個 inductive bias 令佢 test time 加到 10-hop 都能泛化。

呢個發現同 Parcae 嘅 per-sequence depth sampling 暗合——randomization 係 RDT 泛化嘅關鍵。

實證：Latent Chain-of-Thought 真係存在

LTG 最有趣嘅 finding：用 probing 技巧分析每次 loop 嘅 latent state，發現清晰嘅「推理軌跡」。

Loading diagram...

每一 loop 嘅 latent 真係 activate 中間 entity 嘅 representation。呢個第一次實證咗 Saunshi 嘅理論 claim：loop 就係 latent space 嘅 CoT。

但係⋯⋯Overthinking!

LTG 揭示咗一個意外嘅新問題：

⚠️ Overthinking Phenomenon
當 test-time loops 超過某個 threshold（例如 15-hop），準確率反而下降。

直覺解釋：latent state 喺好多次 loop 之後會偏離有意義嘅流形（manifold），走入一個「semantic void」嘅區域，Coda 解碼出錯。

呢個 phenomenon 最早由 Bansal et al. 2022 喺 algorithmic tasks 上觀察到，LTG 係第一次喺 reasoning context 正式 document。

Loading diagram...

呢個直接將 community 帶回到 Universal Transformers 八年前嘅未解難題：adaptive halting。點樣知幾時應該停？

LTG 解決咗咩？留低咗咩？

✅ 解決	❌ 留低
首次實證 RDT 做到 implicit 多跳推理	Adaptive halting 仍係 open problem（回到 UT 嘅 ACT）
證明 depth extrapolation 可行（5 → 10 hops）	Overthinking 點樣避免？
Dynamic recurrence 係 SG/extrapolation 嘅關鍵	呢啲 synthetic dataset 嘅 insight 能否 transfer 去 real LLM？
用 probing 驗證 latent CoT 真係存在	Latent state 嘅 fine-grained interpretability 仍未夠

兩篇論文點樣互補？

呢兩篇同期登場嘅論文，合埋嚟睇先夠完整。

	Parcae
重點	Training-side stability
實驗規模	1.3B params / 104B tokens (real LLM scale)
核心 Claim	Looped model 可以穩定訓練 + beats iso-param Transformer
核心 Trick	Negative diagonal $A$ • ZOH + per-sequence depth sampling
共同嘅 insight	Randomization of loop count 係穩定訓練 + 良好泛化嘅共同答案

🧬 如果你係做 RDT 研究嘅
呢兩篇嘅最佳 combo 就係：

用 Parcae 嘅 architecture 做 foundation（負對角 + ZOH + LTI 穩定性）

用 LTG 嘅 dynamic recurrence 做 training recipe

用 LTG 嘅 probing methodology 做 interpretability 分析

咁基本上就有咗一個 production-ready 嘅 looped LLM blueprint。

實戰案例：OpenMythos — 將兩篇論文落地嘅開源實作

巧合嘅係，同一個星期仲有第三件大事：Kye Gomez（swarms）喺 2026 年 4 月 19 日開源咗 OpenMythos——一個聲稱係 Anthropic 未公開嘅「Claude Mythos」架構嘅理論重建。呢個 repo 恰好將 Parcae + LTG 嘅 insights 同其他 SOTA 技術（DeepSeek MoE、Multi-Latent Attention）真正整合做一個可以跑嘅 PyTorch 模型——係目前睇到最貼近「production-ready RDT」嘅開源嘗試。

📦 OpenMythos

作者：Kye Gomez (swarms)

發表：2026 年 4 月 19 日

GitHub：github.com/kyegomez/OpenMythos

報導：MarkTechPost 深入介紹

聲稱：770M 參數可以匹配 1.3B Transformer（引用 Parcae 嘅 scaling laws）

核心假設：Claude Mythos 可能係一個 RDT

Anthropic 從來冇公開過 Mythos 嘅技術細節，Kye 基於 circulating rumor 加上近期 RDT literature，大膽提出：

Mythos 嘅 reasoning 能力源於 Recurrent-Depth Transformer——weights 被反覆應用 $T$ 次（ $T \leq 16$ ），整個 reasoning 過程喺 continuous latent space 發生，唔輸出任何 intermediate tokens。

呢個 hypothesis 係 falsifiable 嘅（可以被實驗推翻），而且 Kye 將佢完整實作成 PyTorch，係成個 community 可以 play around 嘅 concrete artifact。

三層架構：Prelude → Recurrent Block → Coda

Loading diagram...

🎼 Prelude / Coda 嘅命名由來
呢個三層命名借自音樂術語（prelude = 前奏，coda = 尾奏），對應三個唔同職責：

Prelude（前奏）：將 raw tokens encode 入 latent thinking space——將文字搬到「諗嘢嘅平面」。係幾層普通 Transformer，只行一次。

Recurrent Block（主旋律）：真正嘅 reasoning 引擎，喺 latent space 反覆 loop $T$ 次。所有 thinking 都喺呢度發生，weight reuse 嘅戲法都喺度。

Coda（尾奏）：將 loop 完之後嘅 latent state decode 返做 output tokens——將「諗完嘅嘢」翻譯返做文字。同樣係幾層普通 Transformer，只行一次。

類比：Prelude 係「讀題」，Recurrent 係「諗題」，Coda 係「寫答案」。Recurrent 嘅參數會被反覆重用，但 Prelude / Coda 嘅參數唔重用——佢哋只係普通 Transformer layers。

更新規則（每次 loop）：

h_{t+1} = A \cdot h_t + B \cdot e + \text{Transformer}(h_t, e)

其中：

$h_t$ ：第 $t$ 次 loop 嘅 hidden state
$e$ ：Prelude 輸出（encoded input）——每次 loop 都 re-inject（防止 latent drift）
$A, B$ ：learned transition / injection matrices
Transformer：block 內部嘅 attention + MoE FFN

🔄 Re-injection 嘅重要性
冇咗 B · e 呢項， $h_t$ 喺 deep loops 之後會漂走得完全唔似 input。Re-injection 等於每次 loop 都「提醒」模型原本嘅 input 係乜——呢個設計同 Huginn / Parcae 完全一致。

Design Choice #1：從 Parcae 借穩定性

OpenMythos 直接採用 Parcae 嘅 LTI 穩定性約束：

🔑 LTI-Stable Recurrent Injection
強制 $\rho(A) < 1$ by construction——用嘅就係 Parcae 嘅負對角 + ZOH trick。Kye 喺 repo 入面將呢個寫成 first-class training primitive，意思係呢個穩定性唔係 optional 嘅 regularizer，而係架構本身嘅一部分。

實際影響：training 時 learning rate / gradient noise / init 都唔會令 model explode，大幅減少 hyperparameter tuning 嘅時間。

Design Choice #2：MoE FFN 代替標準 FFN

OpenMythos 冇用標準 FFN，而係跟住 DeepSeekMoE 嘅設計：

Loading diagram...

Shared experts：永遠 active，學 cross-domain 共通 pattern
Routed experts：sparse top-K，每個 token 只 activate 少數
Critical twist：Router 喺每個 loop depth 揀唔同嘅 expert subsets

💡 呢個設計點解天才？
普通 RDT 嘅每次 loop 都用完全一樣嘅計算。OpenMythos 嘅 MoE router 令每次 loop 嘅 effective 計算唔同——loop 1 可能揀 experts {3, 17, 42}，loop 2 揀 {8, 23, 71}，即使用嘅係同一組 base weights。

結果係：MoE 提供 domain breadth，looping 提供 reasoning depth——兩個 orthogonal 嘅 axes。

Design Choice #3：Multi-Latent Attention（MLA）

先回顧：點解標準 Multi-Head Attention 喺 RDT 入面咁痛苦？

標準 MHA inference 時要 cache 每層、每個 token 嘅 K 同 V tensor 俾後續 token 做 attention（呢個就係出名嘅 KV cache）。每多一個 token，cache 都要加大：

\text{KV Cache Size} \approx 2 \cdot L_{\text{layers}} \cdot L_{\text{seq}} \cdot d_{\text{model}}

對於長 context，呢個 cache 大到佔晒 GPU memory；喺 RDT 上面問題仲放大——同一組 weights 要 loop $T$ 次，每 loop 都做一次 attention，memory bandwidth 直接成為 bottleneck。

MLA 嘅 trick：將 K 同 V 壓入一個 low-rank latent

DeepSeek-V2 嘅 Multi-Latent Attention 唔 cache 完整 K, V，而係 cache 一個壓縮後嘅 latent vector $c \in \mathbb{R}^{d_c}$ （ $d_c \ll d_{\text{model}}$ ，通常細 10–20 倍）：

c_t = W_{DKV}\, h_t \quad \text{（down-project，壓縮）}

要做 attention 嘅時候，先即場 up-project 返出 K, V：

K_t = W_{UK}\, c_t, \quad V_t = W_{UV}\, c_t \quad \text{（解壓）}

Loading diagram...

📦 MLA vs MHA 一句話
MHA 直接 cache 完整 K 同 V tensor（重）；MLA 只 cache 一個 low-rank summary $c$ （輕），用嗰陣先用 $W_{UK}, W_{UV}$ 解壓返做 K 同 V。論文報 5–20× KV memory reduction，performance 接近無損。

喺 RDT context 下呢個特別重要，因為 loop 越多，每次 loop 嘅 attention 都要重算/讀 cache 一次——MLA 令 $T=16$ loops 嘅 memory footprint 仍然 manageable。

Design Choice #4：Depth-Wise LoRA

先快速 recap 標準 LoRA（識嘅可以跳過）

LoRA 嘅核心 idea：要 fine-tune 一個大 matrix $W \in \mathbb{R}^{d \times d}$ ，唔好直接改 $W$ （要改 $d^2$ 個參數），而係 freeze 佢，再加一個 low-rank update：

W' = W + A B, \quad A \in \mathbb{R}^{d \times r},\; B \in \mathbb{R}^{r \times d},\; r \ll d

只訓練 $A, B$ （參數量由 $d^2$ 減到 $2 d r$ ），保留 $W$ 不變。常用嚟 fine-tune 大模型——用幾 MB 嘅 adapter 就 customize 到一個 7B model。

Depth-Wise LoRA：每個 loop depth 一個 adapter

OpenMythos 嘅 twist 係將 LoRA 由「fine-tuning 工具」升級做架構 primitive：base weights $W_{\text{base}}$ 喺所有 loops 共享，但每個 loop depth $t$ 有自己嘅 LoRA pair $(A_t, B_t)$ ：

W^{(t)} = W_{\text{base}} + A_t B_t

Loop $t$	實際用嘅 weights
t = 2	$W_{\text{base}} + A_2 B_2$
t = T	$W_{\text{base}} + A_T B_T$

效果：每個 loop 唔係完全一樣嘅計算（避免 pure weight-tying 嘅表達力瓶頸），但又唔係完全獨立（避免退化做標準 Transformer）——共享主幹 + 每 depth 微調。

🎚️ Standard LoRA vs Depth-Wise LoRA 對照

Standard LoRA：一個 frozen base + 一個 adapter $(A, B)$ ，用喺 fine-tune 階段（pre-train 之後嘅補丁）。

Depth-Wise LoRA：一個 base + T 個 adapters $\{(A_t, B_t)\}_{t=1}^{T}$ ，用喺 架構入面區分唔同 loop depth（pre-train 同 inference 都用緊）。

參數效率：總參數 ≈ $d^2 + T \cdot 2 d r$ 。如果 $r=8, d=512, T=16$ ，T 個 adapters 加埋只係 base 嘅 ~10%——比每層獨立 weights 慳好多。

點解 OpenMythos 要咁做

呢個係 OpenMythos 獨有嘅 contribution，用嚟解決一個 RDT 嘅 tension：

極端一	極端二	OpenMythos 嘅中庸
純 weight-tying（所有 loop 用完全一樣嘅 weights）	每層完全獨立 weights（退化返標準 Transformer）	Depth-Wise LoRA：base weights 共享，每個 loop depth 加一個 rank- $r$ 小 adapter

具體實作：

W^{(t)} = W_{\text{base}} + A_t B_t, \quad A_t \in \mathbb{R}^{d \times r}, \; B_t \in \mathbb{R}^{r \times d}

其中 $W_{\text{base}}$ 喺所有 loops 共享， $\{A_t, B_t\}_{t=1}^{T}$ 每個 loop depth 唔同。 $r \ll d$ ，所以加嘅參數好少，但每個 loop depth 仍然有 distinct behavior——唔使完全 tie weights 都可以保留 RDT 嘅參數效率。

Design Choice #5：ACT Halting 對抗 Overthinking

先解釋：ACT 究竟係咩？

Adaptive Computation Time (ACT) 由 Alex Graves（2016）提出，2018 年俾 Universal Transformers 引入 Transformer 世界。一句話 motivation：

唔同 token 需要嘅 thinking depth 唔同。「the」可能 1 步搞掂；「prove」可能要 10 步先諗到。等模型自己學識幾時應該停 loop。

機制（每個 token position 獨立做）：

每個 loop step $t$ ，模型用一個 small linear head 算個 halting probability $p_t \in (0, 1)$ （typically sigmoid）
累積 halting： $P_t = \sum_{i=1}^{t} p_i$
一旦 $P_t \geq 1$ ，呢個 token position 即刻停 loop，當前 latent state 直接交俾 Coda decode

效果：簡單 token early exit（慳 compute），難 token 行到 max loops。每個 token 嘅 compute 都係 learned + adaptive，呢個 inductive bias 正正係 LTG 揭示嘅 overthinking 嘅 candidate 解藥。

OpenMythos 點樣將 ACT 接入 RDT

針對 LTG 揭示嘅 overthinking 問題，OpenMythos 直接 revive 咗 Universal Transformers 嘅 Adaptive Computation Time (ACT)：

python# Pseudo-code
h = prelude(x)
e = h
cumulative_halting = 0
for t in range(T_max):
    halting_prob = sigmoid(linear(h))  # learned halting head
    cumulative_halting += halting_prob
    if cumulative_halting >= 1.0:
        break  # 呢個 position 已經 converged，停
    h = recurrent_block(h, e)
return coda(h)

每個 token 嘅 position 都有自己嘅 halting scalar，硬嘅 token 食多啲 compute，簡單 token 早啲 exit。呢個正正係 Parcae 留低嘅 adaptive halting 問題嘅一個 candidate 解法。

OpenMythos 點樣將兩篇論文「落地」

論文貢獻	OpenMythos 點樣體現
Parcae：LTI 穩定性	✅ Negative diagonal $A$ • ZOH 做成 first-class primitive
Parcae：per-sequence depth sampling	✅ Training loop 入面實作
Parcae：770M ≈ 1.3B Transformer	✅ 明確聲稱同採納 scaling laws
LTG：Dynamic recurrence	✅ Training 時隨機 sample $T$
LTG：Overthinking 問題	✅ ACT halting 直接解決
LTG：Latent CoT reasoning	✅ Re-injection + continuous latent space 設計
OpenMythos 自己嘅 bonus	⭐ MoE router depth-conditional + Multi-Latent Attention + Depth-Wise LoRA

🎯 OpenMythos 嘅真正價值
OpenMythos 未必真係 Claude Mythos 嘅實際架構（只有 Anthropic 先知）。但佢係目前睇到最完整嘅 RDT production-style reference impl——將 Parcae 嘅穩定性、LTG 嘅 dynamic recurrence、DeepSeek 嘅 MoE + MLA、UT 嘅 ACT halting 全部整合埋一齊。

如果你想自己玩 RDT research，唔使由零開始——直接 fork OpenMythos，改你嘅 dataset，就可以開始實驗。

需要留意嘅 caveat

⚠️ 誠實 disclaimer

OpenMythos 係 theoretical reconstruction，冇任何 official 確認

代碼主要係 architectural blueprint，真正 pretrain 到 1.3B scale 需要顯著 compute（估計 100K+ GPU-hours）

目前嘅 release 著重架構完整性多過訓練 artifact——即係話你要自己 pretrain，repo 唔會俾你 weights

MoE routing + LTI constraints 嘅 interaction 仲未完全 characterised，可能有 hidden 失敗 mode

OpenMythos 同「minimal 實作」嘅對比

下一 section 嘅 minimal 實作係為咗教學——大概 100 行 code，focus 喺最小可行 RDT。OpenMythos 就係「工業級」版本：

	Minimal 實作（下文）	OpenMythos
代碼量	~100 行	完整 repo，thousands of lines
Parcae 穩定性	✅ 負對角 + ZOH	✅ 同樣
Dynamic recurrence	✅ Random $T$	✅ 同樣
FFN	標準 MLP	✅ DeepSeek MoE
Attention	標準 MultiheadAttention	✅ Multi-Latent Attention
Depth differentiation	純 weight-tying	✅ Depth-Wise LoRA
Halting	Fixed loops	✅ ACT halting
用途	教學 + 快速 prototype	Real research baseline

想真正做 RDT research？先讀 minimal 實作明白 mechanics，再跳去 OpenMythos 睇 production-ready 嘅設計。

實作指南：結合兩篇論文嘅 PyTorch 骨架

下面係一個 ~100 行嘅 minimal RDT 實作，結合 Parcae 嘅穩定性 + LTG 嘅 dynamic recurrence：

pythonimport torch
import torch.nn as nn
import torch.nn.functional as F

class ParcaeRecurrentBlock(nn.Module):
    """
    Parcae-style stable recurrent block.
    
    更新規則：
        h^{t+1} = Ā ⊙ h^t + B̄ u + Transformer(h^t)
    
    其中：
        A = -exp(log_A)          ← 負對角（保證 λ < 0）
        Ā = exp(Δ * A)           ← ZOH 離散化（∈ (0, 1)）
        ρ(Ā) < 1 by construction ← 穩定性保證
    """
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d = d_model
        
        # ✨ Parcae 穩定性機制：負對角 A + ZOH
        self.log_A = nn.Parameter(torch.zeros(d_model))
        self.delta = nn.Parameter(torch.ones(d_model) * 0.1)
        self.B = nn.Parameter(torch.randn(d_model, d_model) * 0.02)
        
        # 標準 Transformer components（across loops 共享）
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
    
    def get_discretized_AB(self):
        """ZOH 離散化，保證穩定"""
        A = -torch.exp(self.log_A)              # 負數（恆負）
        A_bar = torch.exp(self.delta * A)       # ∈ (0, 1)，ρ < 1
        B_bar = self.delta.unsqueeze(-1) * self.B
        return A_bar, B_bar
    
    def forward(self, h, u):
        """
        h: current latent state (B, L, d)
        u: prelude injection (B, L, d)
        """
        A_bar, B_bar = self.get_discretized_AB()
        
        # LTI 穩定通路
        linear_update = h * A_bar + u @ B_bar.T
        
        # Transformer 表達通路
        h_attn = self.norm1(h + self.attn(h, h, h)[0])
        h_ffn = self.norm2(h_attn + self.ffn(h_attn))
        
        return linear_update + h_ffn


class RecurrentDepthTransformer(nn.Module):
    """
    Full RDT: Prelude → Recurrent Core (reused) → Coda
    """
    def __init__(self, vocab_size, d_model=512, n_heads=8,
                 n_prelude=2, n_coda=2, max_loops=8):
        super().__init__()
        self.d = d_model
        self.max_loops = max_loops
        
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(2048, d_model)
        
        # Prelude：embed 入 latent space
        self.prelude = nn.ModuleList([
            ParcaeRecurrentBlock(d_model, n_heads)
            for _ in range(n_prelude)
        ])
        
        # ✨ THE recurrent block — 只得一個，會被反覆 loop
        self.recurrent = ParcaeRecurrentBlock(d_model, n_heads)
        
        # Coda：decode 返 output
        self.coda = nn.ModuleList([
            ParcaeRecurrentBlock(d_model, n_heads)
            for _ in range(n_coda)
        ])
        
        self.head = nn.Linear(d_model, vocab_size, bias=False)
    
    def forward(self, x, num_loops=None):
        """
        x: token ids (B, L)
        num_loops: loop count (training: random; inference: depend on task)
        """
        B, L = x.shape
        pos = torch.arange(L, device=x.device).unsqueeze(0).expand(B, L)
        h = self.embed(x) + self.pos_embed(pos)
        
        # Step 1: Prelude
        u = h
        for block in self.prelude:
            h = block(h, u)
        prelude_output = h  # 呢個會作為 loop 嘅 injection
        
        # Step 2: Recurrent Core
        if num_loops is None:
            if self.training:
                # ✨ LTG 嘅 dynamic recurrence
                num_loops = torch.randint(1, self.max_loops + 1, (1,)).item()
            else:
                num_loops = self.max_loops
        
        for _ in range(num_loops):
            h = self.recurrent(h, prelude_output)  # 同一組參數反覆用！
        
        # Step 3: Coda
        for block in self.coda:
            h = block(h, prelude_output)
        
        return self.head(h)


# ================================
# Training loop（結合 Parcae + LTG 嘅 best practices）
# ================================

def train_step(model, batch, optimizer):
    x, y = batch  # (B, L), (B, L)
    B = x.size(0)
    
    # ✨ Parcae：per-sequence depth sampling（非 per-batch）
    # 實作上簡化：用 batch-level 隨機已經好 close，要完全 per-sequence 要 loop
    num_loops_per_seq = torch.randint(1, model.max_loops + 1, (B,))
    
    total_loss = 0
    for i, r in enumerate(num_loops_per_seq):
        logits = model(x[i:i+1], num_loops=r.item())
        total_loss += F.cross_entropy(
            logits.reshape(-1, logits.size(-1)),
            y[i:i+1].reshape(-1),
        )
    loss = total_loss / B
    
    optimizer.zero_grad()
    loss.backward()
    
    # Parcae 建議：gradient clipping 防 rare spike（即使架構穩定）
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    
    optimizer.step()
    return loss.item()


# ================================
# Inference: test-time compute scaling
# ================================

@torch.no_grad()
def answer_question(model, question_ids, difficulty='medium'):
    """
    ✨ Test-time scaling：根據難度調 loop 次數
    注意 overthinking：唔好 set 到太大！
    """
    loop_config = {
        'easy': 2,        # 簡單 factual QA
        'medium': 6,      # 一般 reasoning
        'hard': 12,       # 多步 math / compositional
        'very_hard': 15,  # 深度推理（勿超過 15，否則 overthinking）
    }
    num_loops = loop_config[difficulty]
    
    logits = model(question_ids, num_loops=num_loops)
    return logits.argmax(-1)

關鍵實作重點

🎯 精華總結

負對角參數化（Parcae）：A = -exp(log_A) + ZOH → $\rho(\bar{A}) < 1$

Dynamic recurrence（LTG）：training 隨機 $r$ ，唔好 fix

Per-sequence depth sampling（Parcae）：每個 sequence 獨立 sample loop count

Injection signal：每次 loop 都將 prelude output 注入，防 latent drift

Loop 嘅核心：得一個 self.recurrent block，反覆 reuse

Test-time scaling：根據難度調 loop 次數，但唔好超 15（overthinking）

Gradient clipping：即使架構穩定，clipping 仍係安全網

結合兩篇論文嘅開放問題

呢兩篇同週登場嘅論文,合埋留低幾個超有趣嘅下一步：

問題	Parcae 做到邊	LTG 做到邊	未解
Adaptive halting	冇處理（fixed loops）	揭示 overthinking 但冇解	Token-level / sequence-level 動態 halting？
Interpretability	Black box	Probing 驗證 latent CoT	Sparse autoencoders 穿越 loops？Circuit analysis？
Scale up	1.3B	Synthetic only	結合兩者，scale 到 70B+？
Real multi-hop	未測	Synthetic work	喺真實知識圖（Wikidata, FreeBase）work 嗎？
Combine with CoT	未討論	未討論	Hybrid implicit + explicit（Kahneman System 1 / 2）?

一個大膽推測

🔮 RDT 可能成為 reasoning LLM 嘅新標準
目前 o1 / o3 / DeepSeek-R1 等 thinking model 全部行 explicit CoT 路線：輸出大量 thinking tokens 換 quality。呢條路線嘅邊際成本極高（每個 thinking token 都要完整 forward pass）。

RDT 提供嘅係 implicit thinking：用共享參數嘅 loop 喺 latent space 完成推理，每 loop 嘅成本遠低於 CoT 生成一個 token。

2026 年 Parcae + LTG 呢個組合登場之後，下一代 reasoning LLM 可能會混合：

Latent loop 做 "fast thinking"（Parcae-stable core）

Explicit CoT 做 "deliberate thinking"（when latent fails）

Adaptive halting 切換兩種模式（未來工作）

呢個正正係 Kahneman 嘅 System 1 / System 2 二元論嘅 neural 版本。

總結：2026 年 4 月呢個星期，RDT 拎到咗成熟 blueprint

論文	Unlock 咗咩
Parcae	✅ Training stability（冇 spike）
✅ Scaling laws
✅ 1.3B scale prove to work
✅ 770M params 匹配 1.3B Transformer
Loop, Think, & Generalize	✅ 首次實證 implicit multi-hop reasoning
✅ Depth extrapolation（5 → 10 hops）
✅ Latent CoT 可觀測
✅ Dynamic recurrence 係 training 秘訣
⚠️ 揭示 overthinking

一句話結論

🧠 Parcae 令 RDT 由「實驗室玩具」變成「工業可用架構」。
Loop, Think, & Generalize 令 RDT 由「架構新奇」變成「真正會推理嘅模型」。

兩篇論文同一個星期 drop 並唔係巧合——RDT 已經準備好 hitting prime time，下一步只係有人真正 scale 到 70B+ 嘅 frontier model 上面。

背景：RDT 嘅八年進化，只為解決一個問題

喺深入主菜之前，花 5 分鐘先過一下背景。唔熟悉 RDT 嘅讀者可以了解下點解呢兩篇論文嘅 contribution 咁重要。

標準 Transformer 嘅深度困境

經典 Transformer（GPT、Llama 等）嘅每一層都有獨立參數。32-layer 模型有 32 套唔同嘅 $\{W_Q, W_K, W_V, W_O, W_{\text{FFN}}\}$ 。問題係：

⚠️ 固定深度 = 固定計算量

32-layer 處理任何 token，都只做正好 32 步計算

簡單問題（「1+1=?」）做 32 步 → 浪費

複雜問題（「證明費馬大定理」）都只做 32 步 → 不足

加深 → 加參數 → 加 memory → 加訓練時間 😭

數學上仲更慘：標準 Transformer 唔係 Turing-complete，某啲需要無限步迭代嘅算法根本做唔到。

Recurrent Depth 嘅核心思想

RDT 嘅想法好簡單：同一組參數 $W$ 反覆應用 $L$ 次。

Loading diagram...

優勢：參數量 $|W|$ 固定，但 FLOPs $= |W| \times L$ 可以動態調整——test time 加 loop 就加 reasoning compute。

前三個重要嘅里程碑（快速回顧）

年份	論文	做咗咩	留低咗咩問題
2018	Universal Transformers (Dehghani et al., Google Brain)	首次引入 recurrence in depth + ACT halting；證明 Turing-completeness	ACT 訓練極不穩定、長 sequence 崩潰、冇 scaling laws
2025 Feb	Huginn-3.5B (ELLIS Tübingen + UMD)	首個大規模 RDT proof of concept：3.5B params / 800B tokens，test-time 加 loop 可以單調提升 reasoning	仍然黑盒；訓練要好多 tricks；冇理論保證；冇 scaling laws
2025 Feb	Reasoning with Latent Thoughts (Saunshi et al., DeepMind, ICLR 2025)	理論證明 loop = 連續 latent space 嘅 CoT；claim「depth 比 parameters 更重要」	純理論；實際訓練仍然崩潰；冇 scaling laws

呢個星期，Parcae 終於打破咗呢個瓶頸。

第一篇：Parcae — 終於令 RDT 喺訓練時穩得住

論文背景

📄 Parcae: Scaling Laws For Stable Looped Language Models

作者：Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, Dan Fu

機構：UC San Diego + Together AI (Sandy Research Lab)

發表：2026 年 4 月 14 日（arXiv v1）

論文：arXiv:2604.12946

Blog：sandyresearch.github.io/parcae/

代碼：Together AI 官方 blog 有 reference impl

Motivation：點解 Huginn 咁難訓？

作者先開門見山，展示咗一個好熟悉嘅 failure mode：

💀 Huginn 式訓練嘅典型症狀

Residual state explosion：訓練到某個 step，hidden state 嘅 norm 突然由 $\sim 10$ 爆到 $10^5$

Loss spikes：validation loss 平時係 2.3，突然一個 batch 彈到 50+

Gradient NaN：偶然整個 training run 直接 crash

作者需要不停 tune gradient clipping、warmup、LR、init 先可以勉強訓練得到

呢個唔係 engineering issue，而係架構本身嘅問題。

關鍵 Insight：將 Loop 視為 Dynamical System

Parcae 最美麗嘅 contribution 係一個視角嘅轉換：

如果將 recurrent block 嘅更新線性化，loop 就係一個線性非時變 (LTI) 動態系統。

具體講，如果 recurrent block 嘅更新寫成：

h^{t+1} = A h^t + B u

其中：

$h^t \in \mathbb{R}^{d}$ ：第 $t$ 次 loop 嘅 latent state
$u$ ：prelude output（每 loop 都 inject 返去）
$A \in \mathbb{R}^{d \times d}$ ：state-to-state transition
$B \in \mathbb{R}^{d \times d}$ ：input injection matrix

控制理論話俾我哋知：呢個系統穩定 ⇔ $\rho(A) < 1$ （ $\rho$ 係 spectral radius，即 $|\lambda|_{\max}$ ）。

🔑 診斷：Huginn 嘅 $A$ 冇任何約束。訓練過程中 $A$ 嘅 eigenvalues 會漂移到 $|\lambda| > 1$ ，系統變成 unstable dynamical system， $h^t$ 呈指數增長——就係 residual explosion 嘅本質。

解法：Negative Diagonal Parameterization

Parcae 嘅 trick 極度精巧：強制 $A$ 係負對角 + ZOH 離散化。

Step 1：連續時間負對角

A := \text{Diag}\big(-\exp(\log_A)\big), \quad \log_A \in \mathbb{R}^{d_h}

$\exp(\log_A) > 0$ （恆正）
加個負號 → $A$ 對角元素恆負
對角矩陣嘅 eigenvalues 就係對角元素本身 → 所有 $\lambda < 0$

Step 2：Zero-Order Hold (ZOH) 離散化

仿照 Mamba / S4：

\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B

因為 $A < 0$ ：

\bar{A} = \exp(\Delta A) \in (0, 1)

即 $\rho(\bar{A}) < 1$ by construction ✅

💡 類比：想像一個人喺山谷入面搵穩定點。Huginn 俾人自由走動——可能搵到谷底，但都可能爬上山頂（explosion）。Parcae 就好似綁咗條橡筋繩喺谷底——無論點走，都會被拉返落穩定點。
呢個 trick 其實係State Space Model (SSM) 嘅 signature move。Mamba、S4、S5 全部用緊。Parcae 嘅貢獻係第一次將 SSM stability tricks 引入 Transformer loops。

架構：Parcae Recurrent Block

Loading diagram...

關鍵設計：LTI 部分負責穩定性，Transformer 部分負責表達能力。兩者相加，既唔會 explode，又保留 attention 嘅非線性學習能力。

另一個關鍵：Per-Sequence Depth Sampling

Parcae 嘅改進：每個 sequence 獨立 sample $r$ 。

python# Huginn 式（sub-optimal）
r = random.choice([1, ..., r_max])   # batch-level
for seq in batch:
    loss += compute(seq, r)          # 所有 seq 用同一個 r

# Parcae 式（unbiased）
for seq in batch:
    r_seq = random.choice([1, ..., r_max])  # sequence-level
    loss += compute(seq, r_seq)

呢個小改動顯著減少 loss spikes 嘅 frequency。

實驗結果：RDT 第一次贏硬 Transformer

New database

Scaling Laws：Chinchilla for RDT

Parcae 建立咗三條關鍵 scaling laws（類似 Chinchilla 之於 Transformer）：

Loops 同 data 要同步 scale：fix FLOPs 嘅情況下，增加 loop 次數嘅 optimal data 量隨之增加
Test-time scaling 係 saturating exponential：gain ∝ $1 - e^{-k \cdot L}$ ——early loops 收益大，後期邊際收益遞減
Params vs. Loops 嘅 trade-off 有 sweet spot：太細 params + 太多 loops 會 underfit；太大 params + 太少 loops 浪費 capacity

🚀 Parcae 嘅工程意義
Parcae 係 RDT 史上第一次真正可以 deploy 嘅架構：

訓練穩定（冇 spike）

參數效率（一半 params 匹配 baseline）

Test-time 動態 compute（on-device、memory-constrained 場景嘅 dream）

對 edge LLM / mobile LLM 係 game changer。

Parcae 留低嘅問題

Sequence-level compute：所有 tokens loop 一樣多次。但「呢個」同「量子物理」顯然需要唔同 thinking depth → adaptive halting 仍未解
Interpretability 未解：只係解決 training，冇解 latent state 發生緊乜嘢
Compositional reasoning 未驗證：Parcae 做嘅係標準 language modeling，冇 specifically 測 multi-hop reasoning → 呢個正正係下一篇論文嘅主角

第二篇：Loop, Think, & Generalize — 驗證 RDT 真係會「諗嘢」

論文背景

📄 Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

作者：Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, Yuekun Yao

機構：Ohio State University NLP Group

發表：2026 年 4 月 9 日（arXiv v1）

論文：arXiv:2604.07822

代碼：github.com/OSU-NLP-Group/Loop-Think-Generalize

核心問題：Implicit Multi-Hop Reasoning

咩係 implicit reasoning？

想像你問一個 LLM：「Barack Obama 嘅太太嘅媽咪嘅出生地嘅 capital 係邊？」

Explicit (CoT)：模型寫出：「Michelle Obama 係 Obama 太太 → 佢媽咪 Marian Robinson → Marian 喺 Chicago 出生 → Illinois capital 係 Springfield」
Implicit：模型一次 forward pass 內部完成，直接俾答案 Springfield

⚠️ 標準 Transformer 嘅痛點
好多 mechanistic interpretability 研究發現，標準 Transformer 雖然儲存咗呢啲 atomic facts，但組合唔到——佢哋知道每個 fact，但一次 forward pass 砌唔出推理鏈。呢個叫 compositional generalization failure。

LTG 嘅 hypothesis：俾 recurrent depth，Transformer 就組合得到。

兩大關鍵挑戰

LTG formalize 咗 implicit reasoning 嘅兩個具體問題：

挑戰	定義	難點
Systematic Generalization	Training 時見過 $A \to B, B \to C, \ldots$ 嘅 single-hop facts，但從未見過 $A \to C$ 嘅組合。Test 時要求模型做 unseen compositions	要求模型真正 learn 到「組合律」而唔係 memorize
Depth / Recursion Extrapolation	Training 最多見 5-hop 推理；inference 要求做 10-hop（training 從未見過呢個深度）	模型要能 generalize 到未見過嘅推理深度

實驗設計：Synthetic Multi-Hop Dataset

LTG 構造咗一個完美 controlled 嘅 synthetic setup：

javascript// Entity graph: 一條長鏈
Facts:  A → B → C → D → E → F → G → H → I → J → K
        ↑  每條 edge 代表一個 learned fact

Training:
  - Single-hop facts: A→B, B→C, C→D, ...
  - Multi-hop queries: 2-hop, 3-hop, 4-hop, 5-hop composed queries
  - NEVER: 6-hop or beyond

Test (unseen depth):
  - 6-hop, 7-hop, 8-hop, 9-hop, 10-hop queries

如果一個 RDT 可以做對 10-hop 嘅 query，我哋就知：

佢真係 internally 做緊多步 reasoning（唔係 memorize）
佢可以將 training 嘅 reasoning depth extrapolate 去 unseen depth

三大訓練策略對比

LTG 嘅核心實驗——測試三種 training-time loop-count schedule：

Strategy	描述	5-hop → 10-hop Extrapolation
Fixed recurrence	Training 每次都用固定 $r = 5$	❌ 完全失敗：Test 10-hop 接近 random
Curriculum recurrence	由細到大遞增 $r$ （先 1→2→3→4→5）	⚠️ 中等：7-hop 仲 OK，8+ hop 明顯退化
Dynamic recurrence	每個 batch 隨機 sample $r \in [1, 5]$	✅ 最好：10-hop test 仍高準確率

💡 Dynamic Recurrence 點解 work？
當 training 時 $r$ 係隨機嘅，模型被迫學識一個 depth-agnostic 嘅 reasoning block——同一個 block 要 handle shallow 同 deep 兩種 reasoning。呢個 inductive bias 令佢 test time 加到 10-hop 都能泛化。

呢個發現同 Parcae 嘅 per-sequence depth sampling 暗合——randomization 係 RDT 泛化嘅關鍵。

實證：Latent Chain-of-Thought 真係存在

LTG 最有趣嘅 finding：用 probing 技巧分析每次 loop 嘅 latent state，發現清晰嘅「推理軌跡」。

Loading diagram...

每一 loop 嘅 latent 真係 activate 中間 entity 嘅 representation。呢個第一次實證咗 Saunshi 嘅理論 claim：loop 就係 latent space 嘅 CoT。

但係⋯⋯Overthinking!

LTG 揭示咗一個意外嘅新問題：

⚠️ Overthinking Phenomenon
當 test-time loops 超過某個 threshold（例如 15-hop），準確率反而下降。

直覺解釋：latent state 喺好多次 loop 之後會偏離有意義嘅流形（manifold），走入一個「semantic void」嘅區域，Coda 解碼出錯。

呢個 phenomenon 最早由 Bansal et al. 2022 喺 algorithmic tasks 上觀察到，LTG 係第一次喺 reasoning context 正式 document。

Loading diagram...

呢個直接將 community 帶回到 Universal Transformers 八年前嘅未解難題：adaptive halting。點樣知幾時應該停？

LTG 解決咗咩？留低咗咩？

✅ 解決	❌ 留低
首次實證 RDT 做到 implicit 多跳推理	Adaptive halting 仍係 open problem（回到 UT 嘅 ACT）
證明 depth extrapolation 可行（5 → 10 hops）	Overthinking 點樣避免？
Dynamic recurrence 係 SG/extrapolation 嘅關鍵	呢啲 synthetic dataset 嘅 insight 能否 transfer 去 real LLM？
用 probing 驗證 latent CoT 真係存在	Latent state 嘅 fine-grained interpretability 仍未夠

兩篇論文點樣互補？

呢兩篇同期登場嘅論文，合埋嚟睇先夠完整。

	Parcae
重點	Training-side stability
實驗規模	1.3B params / 104B tokens (real LLM scale)
核心 Claim	Looped model 可以穩定訓練 + beats iso-param Transformer
核心 Trick	Negative diagonal $A$ • ZOH + per-sequence depth sampling
共同嘅 insight	Randomization of loop count 係穩定訓練 + 良好泛化嘅共同答案

🧬 如果你係做 RDT 研究嘅
呢兩篇嘅最佳 combo 就係：

用 Parcae 嘅 architecture 做 foundation（負對角 + ZOH + LTI 穩定性）

用 LTG 嘅 dynamic recurrence 做 training recipe

用 LTG 嘅 probing methodology 做 interpretability 分析

咁基本上就有咗一個 production-ready 嘅 looped LLM blueprint。

實戰案例：OpenMythos — 將兩篇論文落地嘅開源實作

📦 OpenMythos

作者：Kye Gomez (swarms)

發表：2026 年 4 月 19 日

GitHub：github.com/kyegomez/OpenMythos

報導：MarkTechPost 深入介紹

聲稱：770M 參數可以匹配 1.3B Transformer（引用 Parcae 嘅 scaling laws）

核心假設：Claude Mythos 可能係一個 RDT

Anthropic 從來冇公開過 Mythos 嘅技術細節，Kye 基於 circulating rumor 加上近期 RDT literature，大膽提出：

Mythos 嘅 reasoning 能力源於 Recurrent-Depth Transformer——weights 被反覆應用 $T$ 次（ $T \leq 16$ ），整個 reasoning 過程喺 continuous latent space 發生，唔輸出任何 intermediate tokens。

呢個 hypothesis 係 falsifiable 嘅（可以被實驗推翻），而且 Kye 將佢完整實作成 PyTorch，係成個 community 可以 play around 嘅 concrete artifact。

三層架構：Prelude → Recurrent Block → Coda

Loading diagram...

🎼 Prelude / Coda 嘅命名由來
呢個三層命名借自音樂術語（prelude = 前奏，coda = 尾奏），對應三個唔同職責：

Prelude（前奏）：將 raw tokens encode 入 latent thinking space——將文字搬到「諗嘢嘅平面」。係幾層普通 Transformer，只行一次。

Recurrent Block（主旋律）：真正嘅 reasoning 引擎，喺 latent space 反覆 loop $T$ 次。所有 thinking 都喺呢度發生，weight reuse 嘅戲法都喺度。

Coda（尾奏）：將 loop 完之後嘅 latent state decode 返做 output tokens——將「諗完嘅嘢」翻譯返做文字。同樣係幾層普通 Transformer，只行一次。

類比：Prelude 係「讀題」，Recurrent 係「諗題」，Coda 係「寫答案」。Recurrent 嘅參數會被反覆重用，但 Prelude / Coda 嘅參數唔重用——佢哋只係普通 Transformer layers。

更新規則（每次 loop）：

h_{t+1} = A \cdot h_t + B \cdot e + \text{Transformer}(h_t, e)

其中：

$h_t$ ：第 $t$ 次 loop 嘅 hidden state
$e$ ：Prelude 輸出（encoded input）——每次 loop 都 re-inject（防止 latent drift）
$A, B$ ：learned transition / injection matrices
Transformer：block 內部嘅 attention + MoE FFN

🔄 Re-injection 嘅重要性
冇咗 B · e 呢項， $h_t$ 喺 deep loops 之後會漂走得完全唔似 input。Re-injection 等於每次 loop 都「提醒」模型原本嘅 input 係乜——呢個設計同 Huginn / Parcae 完全一致。

Design Choice #1：從 Parcae 借穩定性

OpenMythos 直接採用 Parcae 嘅 LTI 穩定性約束：

🔑 LTI-Stable Recurrent Injection
強制 $\rho(A) < 1$ by construction——用嘅就係 Parcae 嘅負對角 + ZOH trick。Kye 喺 repo 入面將呢個寫成 first-class training primitive，意思係呢個穩定性唔係 optional 嘅 regularizer，而係架構本身嘅一部分。

實際影響：training 時 learning rate / gradient noise / init 都唔會令 model explode，大幅減少 hyperparameter tuning 嘅時間。

Design Choice #2：MoE FFN 代替標準 FFN

OpenMythos 冇用標準 FFN，而係跟住 DeepSeekMoE 嘅設計：

Loading diagram...

Shared experts：永遠 active，學 cross-domain 共通 pattern
Routed experts：sparse top-K，每個 token 只 activate 少數
Critical twist：Router 喺每個 loop depth 揀唔同嘅 expert subsets

💡 呢個設計點解天才？
普通 RDT 嘅每次 loop 都用完全一樣嘅計算。OpenMythos 嘅 MoE router 令每次 loop 嘅 effective 計算唔同——loop 1 可能揀 experts {3, 17, 42}，loop 2 揀 {8, 23, 71}，即使用嘅係同一組 base weights。

結果係：MoE 提供 domain breadth，looping 提供 reasoning depth——兩個 orthogonal 嘅 axes。

Design Choice #3：Multi-Latent Attention（MLA）

先回顧：點解標準 Multi-Head Attention 喺 RDT 入面咁痛苦？

標準 MHA inference 時要 cache 每層、每個 token 嘅 K 同 V tensor 俾後續 token 做 attention（呢個就係出名嘅 KV cache）。每多一個 token，cache 都要加大：

\text{KV Cache Size} \approx 2 \cdot L_{\text{layers}} \cdot L_{\text{seq}} \cdot d_{\text{model}}

MLA 嘅 trick：將 K 同 V 壓入一個 low-rank latent

DeepSeek-V2 嘅 Multi-Latent Attention 唔 cache 完整 K, V，而係 cache 一個壓縮後嘅 latent vector $c \in \mathbb{R}^{d_c}$ （ $d_c \ll d_{\text{model}}$ ，通常細 10–20 倍）：

c_t = W_{DKV}\, h_t \quad \text{（down-project，壓縮）}

要做 attention 嘅時候，先即場 up-project 返出 K, V：

K_t = W_{UK}\, c_t, \quad V_t = W_{UV}\, c_t \quad \text{（解壓）}

Loading diagram...

📦 MLA vs MHA 一句話
MHA 直接 cache 完整 K 同 V tensor（重）；MLA 只 cache 一個 low-rank summary $c$ （輕），用嗰陣先用 $W_{UK}, W_{UV}$ 解壓返做 K 同 V。論文報 5–20× KV memory reduction，performance 接近無損。

喺 RDT context 下呢個特別重要，因為 loop 越多，每次 loop 嘅 attention 都要重算/讀 cache 一次——MLA 令 $T=16$ loops 嘅 memory footprint 仍然 manageable。

Design Choice #4：Depth-Wise LoRA

先快速 recap 標準 LoRA（識嘅可以跳過）

LoRA 嘅核心 idea：要 fine-tune 一個大 matrix $W \in \mathbb{R}^{d \times d}$ ，唔好直接改 $W$ （要改 $d^2$ 個參數），而係 freeze 佢，再加一個 low-rank update：

W' = W + A B, \quad A \in \mathbb{R}^{d \times r},\; B \in \mathbb{R}^{r \times d},\; r \ll d

只訓練 $A, B$ （參數量由 $d^2$ 減到 $2 d r$ ），保留 $W$ 不變。常用嚟 fine-tune 大模型——用幾 MB 嘅 adapter 就 customize 到一個 7B model。

Depth-Wise LoRA：每個 loop depth 一個 adapter

W^{(t)} = W_{\text{base}} + A_t B_t

Loop $t$	實際用嘅 weights
t = 2	$W_{\text{base}} + A_2 B_2$
t = T	$W_{\text{base}} + A_T B_T$

🎚️ Standard LoRA vs Depth-Wise LoRA 對照

Standard LoRA：一個 frozen base + 一個 adapter $(A, B)$ ，用喺 fine-tune 階段（pre-train 之後嘅補丁）。

Depth-Wise LoRA：一個 base + T 個 adapters $\{(A_t, B_t)\}_{t=1}^{T}$ ，用喺 架構入面區分唔同 loop depth（pre-train 同 inference 都用緊）。

參數效率：總參數 ≈ $d^2 + T \cdot 2 d r$ 。如果 $r=8, d=512, T=16$ ，T 個 adapters 加埋只係 base 嘅 ~10%——比每層獨立 weights 慳好多。

點解 OpenMythos 要咁做

呢個係 OpenMythos 獨有嘅 contribution，用嚟解決一個 RDT 嘅 tension：

極端一	極端二	OpenMythos 嘅中庸
純 weight-tying（所有 loop 用完全一樣嘅 weights）	每層完全獨立 weights（退化返標準 Transformer）	Depth-Wise LoRA：base weights 共享，每個 loop depth 加一個 rank- $r$ 小 adapter

具體實作：

W^{(t)} = W_{\text{base}} + A_t B_t, \quad A_t \in \mathbb{R}^{d \times r}, \; B_t \in \mathbb{R}^{r \times d}

Design Choice #5：ACT Halting 對抗 Overthinking

先解釋：ACT 究竟係咩？

Adaptive Computation Time (ACT) 由 Alex Graves（2016）提出，2018 年俾 Universal Transformers 引入 Transformer 世界。一句話 motivation：

唔同 token 需要嘅 thinking depth 唔同。「the」可能 1 步搞掂；「prove」可能要 10 步先諗到。等模型自己學識幾時應該停 loop。

機制（每個 token position 獨立做）：

每個 loop step $t$ ，模型用一個 small linear head 算個 halting probability $p_t \in (0, 1)$ （typically sigmoid）
累積 halting： $P_t = \sum_{i=1}^{t} p_i$
一旦 $P_t \geq 1$ ，呢個 token position 即刻停 loop，當前 latent state 直接交俾 Coda decode

OpenMythos 點樣將 ACT 接入 RDT

針對 LTG 揭示嘅 overthinking 問題，OpenMythos 直接 revive 咗 Universal Transformers 嘅 Adaptive Computation Time (ACT)：

python# Pseudo-code
h = prelude(x)
e = h
cumulative_halting = 0
for t in range(T_max):
    halting_prob = sigmoid(linear(h))  # learned halting head
    cumulative_halting += halting_prob
    if cumulative_halting >= 1.0:
        break  # 呢個 position 已經 converged，停
    h = recurrent_block(h, e)
return coda(h)

OpenMythos 點樣將兩篇論文「落地」

論文貢獻	OpenMythos 點樣體現
Parcae：LTI 穩定性	✅ Negative diagonal $A$ • ZOH 做成 first-class primitive
Parcae：per-sequence depth sampling	✅ Training loop 入面實作
Parcae：770M ≈ 1.3B Transformer	✅ 明確聲稱同採納 scaling laws
LTG：Dynamic recurrence	✅ Training 時隨機 sample $T$
LTG：Overthinking 問題	✅ ACT halting 直接解決
LTG：Latent CoT reasoning	✅ Re-injection + continuous latent space 設計
OpenMythos 自己嘅 bonus	⭐ MoE router depth-conditional + Multi-Latent Attention + Depth-Wise LoRA

🎯 OpenMythos 嘅真正價值
OpenMythos 未必真係 Claude Mythos 嘅實際架構（只有 Anthropic 先知）。但佢係目前睇到最完整嘅 RDT production-style reference impl——將 Parcae 嘅穩定性、LTG 嘅 dynamic recurrence、DeepSeek 嘅 MoE + MLA、UT 嘅 ACT halting 全部整合埋一齊。

如果你想自己玩 RDT research，唔使由零開始——直接 fork OpenMythos，改你嘅 dataset，就可以開始實驗。

需要留意嘅 caveat

⚠️ 誠實 disclaimer

OpenMythos 係 theoretical reconstruction，冇任何 official 確認

代碼主要係 architectural blueprint，真正 pretrain 到 1.3B scale 需要顯著 compute（估計 100K+ GPU-hours）

目前嘅 release 著重架構完整性多過訓練 artifact——即係話你要自己 pretrain，repo 唔會俾你 weights

MoE routing + LTI constraints 嘅 interaction 仲未完全 characterised，可能有 hidden 失敗 mode

OpenMythos 同「minimal 實作」嘅對比

下一 section 嘅 minimal 實作係為咗教學——大概 100 行 code，focus 喺最小可行 RDT。OpenMythos 就係「工業級」版本：

	Minimal 實作（下文）	OpenMythos
代碼量	~100 行	完整 repo，thousands of lines
Parcae 穩定性	✅ 負對角 + ZOH	✅ 同樣
Dynamic recurrence	✅ Random $T$	✅ 同樣
FFN	標準 MLP	✅ DeepSeek MoE
Attention	標準 MultiheadAttention	✅ Multi-Latent Attention
Depth differentiation	純 weight-tying	✅ Depth-Wise LoRA
Halting	Fixed loops	✅ ACT halting
用途	教學 + 快速 prototype	Real research baseline

想真正做 RDT research？先讀 minimal 實作明白 mechanics，再跳去 OpenMythos 睇 production-ready 嘅設計。

實作指南：結合兩篇論文嘅 PyTorch 骨架

下面係一個 ~100 行嘅 minimal RDT 實作，結合 Parcae 嘅穩定性 + LTG 嘅 dynamic recurrence：

pythonimport torch
import torch.nn as nn
import torch.nn.functional as F

class ParcaeRecurrentBlock(nn.Module):
    """
    Parcae-style stable recurrent block.
    
    更新規則：
        h^{t+1} = Ā ⊙ h^t + B̄ u + Transformer(h^t)
    
    其中：
        A = -exp(log_A)          ← 負對角（保證 λ < 0）
        Ā = exp(Δ * A)           ← ZOH 離散化（∈ (0, 1)）
        ρ(Ā) < 1 by construction ← 穩定性保證
    """
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d = d_model
        
        # ✨ Parcae 穩定性機制：負對角 A + ZOH
        self.log_A = nn.Parameter(torch.zeros(d_model))
        self.delta = nn.Parameter(torch.ones(d_model) * 0.1)
        self.B = nn.Parameter(torch.randn(d_model, d_model) * 0.02)
        
        # 標準 Transformer components（across loops 共享）
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
    
    def get_discretized_AB(self):
        """ZOH 離散化，保證穩定"""
        A = -torch.exp(self.log_A)              # 負數（恆負）
        A_bar = torch.exp(self.delta * A)       # ∈ (0, 1)，ρ < 1
        B_bar = self.delta.unsqueeze(-1) * self.B
        return A_bar, B_bar
    
    def forward(self, h, u):
        """
        h: current latent state (B, L, d)
        u: prelude injection (B, L, d)
        """
        A_bar, B_bar = self.get_discretized_AB()
        
        # LTI 穩定通路
        linear_update = h * A_bar + u @ B_bar.T
        
        # Transformer 表達通路
        h_attn = self.norm1(h + self.attn(h, h, h)[0])
        h_ffn = self.norm2(h_attn + self.ffn(h_attn))
        
        return linear_update + h_ffn


class RecurrentDepthTransformer(nn.Module):
    """
    Full RDT: Prelude → Recurrent Core (reused) → Coda
    """
    def __init__(self, vocab_size, d_model=512, n_heads=8,
                 n_prelude=2, n_coda=2, max_loops=8):
        super().__init__()
        self.d = d_model
        self.max_loops = max_loops
        
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(2048, d_model)
        
        # Prelude：embed 入 latent space
        self.prelude = nn.ModuleList([
            ParcaeRecurrentBlock(d_model, n_heads)
            for _ in range(n_prelude)
        ])
        
        # ✨ THE recurrent block — 只得一個，會被反覆 loop
        self.recurrent = ParcaeRecurrentBlock(d_model, n_heads)
        
        # Coda：decode 返 output
        self.coda = nn.ModuleList([
            ParcaeRecurrentBlock(d_model, n_heads)
            for _ in range(n_coda)
        ])
        
        self.head = nn.Linear(d_model, vocab_size, bias=False)
    
    def forward(self, x, num_loops=None):
        """
        x: token ids (B, L)
        num_loops: loop count (training: random; inference: depend on task)
        """
        B, L = x.shape
        pos = torch.arange(L, device=x.device).unsqueeze(0).expand(B, L)
        h = self.embed(x) + self.pos_embed(pos)
        
        # Step 1: Prelude
        u = h
        for block in self.prelude:
            h = block(h, u)
        prelude_output = h  # 呢個會作為 loop 嘅 injection
        
        # Step 2: Recurrent Core
        if num_loops is None:
            if self.training:
                # ✨ LTG 嘅 dynamic recurrence
                num_loops = torch.randint(1, self.max_loops + 1, (1,)).item()
            else:
                num_loops = self.max_loops
        
        for _ in range(num_loops):
            h = self.recurrent(h, prelude_output)  # 同一組參數反覆用！
        
        # Step 3: Coda
        for block in self.coda:
            h = block(h, prelude_output)
        
        return self.head(h)


# ================================
# Training loop（結合 Parcae + LTG 嘅 best practices）
# ================================

def train_step(model, batch, optimizer):
    x, y = batch  # (B, L), (B, L)
    B = x.size(0)
    
    # ✨ Parcae：per-sequence depth sampling（非 per-batch）
    # 實作上簡化：用 batch-level 隨機已經好 close，要完全 per-sequence 要 loop
    num_loops_per_seq = torch.randint(1, model.max_loops + 1, (B,))
    
    total_loss = 0
    for i, r in enumerate(num_loops_per_seq):
        logits = model(x[i:i+1], num_loops=r.item())
        total_loss += F.cross_entropy(
            logits.reshape(-1, logits.size(-1)),
            y[i:i+1].reshape(-1),
        )
    loss = total_loss / B
    
    optimizer.zero_grad()
    loss.backward()
    
    # Parcae 建議：gradient clipping 防 rare spike（即使架構穩定）
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    
    optimizer.step()
    return loss.item()


# ================================
# Inference: test-time compute scaling
# ================================

@torch.no_grad()
def answer_question(model, question_ids, difficulty='medium'):
    """
    ✨ Test-time scaling：根據難度調 loop 次數
    注意 overthinking：唔好 set 到太大！
    """
    loop_config = {
        'easy': 2,        # 簡單 factual QA
        'medium': 6,      # 一般 reasoning
        'hard': 12,       # 多步 math / compositional
        'very_hard': 15,  # 深度推理（勿超過 15，否則 overthinking）
    }
    num_loops = loop_config[difficulty]
    
    logits = model(question_ids, num_loops=num_loops)
    return logits.argmax(-1)

關鍵實作重點

🎯 精華總結

負對角參數化（Parcae）：A = -exp(log_A) + ZOH → $\rho(\bar{A}) < 1$

Dynamic recurrence（LTG）：training 隨機 $r$ ，唔好 fix

Per-sequence depth sampling（Parcae）：每個 sequence 獨立 sample loop count

Injection signal：每次 loop 都將 prelude output 注入，防 latent drift

Loop 嘅核心：得一個 self.recurrent block，反覆 reuse

Test-time scaling：根據難度調 loop 次數，但唔好超 15（overthinking）

Gradient clipping：即使架構穩定，clipping 仍係安全網

結合兩篇論文嘅開放問題

呢兩篇同週登場嘅論文,合埋留低幾個超有趣嘅下一步：

問題	Parcae 做到邊	LTG 做到邊	未解
Adaptive halting	冇處理（fixed loops）	揭示 overthinking 但冇解	Token-level / sequence-level 動態 halting？
Interpretability	Black box	Probing 驗證 latent CoT	Sparse autoencoders 穿越 loops？Circuit analysis？
Scale up	1.3B	Synthetic only	結合兩者，scale 到 70B+？
Real multi-hop	未測	Synthetic work	喺真實知識圖（Wikidata, FreeBase）work 嗎？
Combine with CoT	未討論	未討論	Hybrid implicit + explicit（Kahneman System 1 / 2）?

一個大膽推測

🔮 RDT 可能成為 reasoning LLM 嘅新標準
目前 o1 / o3 / DeepSeek-R1 等 thinking model 全部行 explicit CoT 路線：輸出大量 thinking tokens 換 quality。呢條路線嘅邊際成本極高（每個 thinking token 都要完整 forward pass）。

RDT 提供嘅係 implicit thinking：用共享參數嘅 loop 喺 latent space 完成推理，每 loop 嘅成本遠低於 CoT 生成一個 token。

2026 年 Parcae + LTG 呢個組合登場之後，下一代 reasoning LLM 可能會混合：

Latent loop 做 "fast thinking"（Parcae-stable core）

Explicit CoT 做 "deliberate thinking"（when latent fails）

Adaptive halting 切換兩種模式（未來工作）

呢個正正係 Kahneman 嘅 System 1 / System 2 二元論嘅 neural 版本。

總結：2026 年 4 月呢個星期，RDT 拎到咗成熟 blueprint

論文	Unlock 咗咩
Parcae	✅ Training stability（冇 spike）
✅ Scaling laws
✅ 1.3B scale prove to work
✅ 770M params 匹配 1.3B Transformer
Loop, Think, & Generalize	✅ 首次實證 implicit multi-hop reasoning
✅ Depth extrapolation（5 → 10 hops）
✅ Latent CoT 可觀測
✅ Dynamic recurrence 係 training 秘訣
⚠️ 揭示 overthinking

一句話結論

🧠 Parcae 令 RDT 由「實驗室玩具」變成「工業可用架構」。
Loop, Think, & Generalize 令 RDT 由「架構新奇」變成「真正會推理嘅模型」。

兩篇論文同一個星期 drop 並唔係巧合——RDT 已經準備好 hitting prime time，下一步只係有人真正 scale 到 70B+ 嘅 frontier model 上面。

相關資源

本週兩篇主論文

📄 Parcae：Prairie et al., Scaling Laws For Stable Looped Language Models — arXiv:2604.12946
📄 LTG：Kohli et al., Loop, Think, & Generalize — arXiv:2604.07822
💻 LTG Code：OSU-NLP-Group/Loop-Think-Generalize
📊 Parcae Blog：Sandy Research Parcae
📰 Together AI 官方介紹：together.ai/blog/parcae

歷史背景（建議延伸閱讀）

📄 Universal Transformers (2018)：Dehghani et al. — arXiv:1807.03819
📄 Huginn-3.5B (2025)：Geiping et al. — arXiv:2502.05171
📄 Reasoning with Latent Thoughts (2025)：Saunshi et al. — arXiv:2502.17416
🤗 Huginn Model：huggingface.co/tomg-group-umd/huginn-0125

TL;DR

目錄

背景：RDT 嘅八年進化，只為解決一個問題

標準 Transformer 嘅深度困境

Recurrent Depth 嘅核心思想

前三個重要嘅里程碑（快速回顧）

第一篇：Parcae — 終於令 RDT 喺訓練時穩得住

論文背景

Motivation：點解 Huginn 咁難訓？

關鍵 Insight：將 Loop 視為 Dynamical System

解法：Negative Diagonal Parameterization

架構：Parcae Recurrent Block

另一個關鍵：Per-Sequence Depth Sampling

實驗結果：RDT 第一次贏硬 Transformer

Scaling Laws：Chinchilla for RDT

Parcae 留低嘅問題

第二篇：Loop, Think, & Generalize — 驗證 RDT 真係會「諗嘢」

論文背景

核心問題：Implicit Multi-Hop Reasoning

兩大關鍵挑戰

實驗設計：Synthetic Multi-Hop Dataset

三大訓練策略對比

實證：Latent Chain-of-Thought 真係存在

但係⋯⋯Overthinking!

LTG 解決咗咩？留低咗咩？

兩篇論文點樣互補？

實戰案例：OpenMythos — 將兩篇論文落地嘅開源實作

核心假設：Claude Mythos 可能係一個 RDT

三層架構：Prelude → Recurrent Block → Coda

Design Choice #1：從 Parcae 借穩定性

Design Choice #2：MoE FFN 代替標準 FFN

Design Choice #3：Multi-Latent Attention（MLA）

Design Choice #4：Depth-Wise LoRA

Design Choice #5：ACT Halting 對抗 Overthinking

OpenMythos 點樣將兩篇論文「落地」

需要留意嘅 caveat

OpenMythos 同「minimal 實作」嘅對比

實作指南：結合兩篇論文嘅 PyTorch 骨架

關鍵實作重點

結合兩篇論文嘅開放問題

一個大膽推測

總結：2026 年 4 月呢個星期，RDT 拎到咗成熟 blueprint

一句話結論

相關資源

本週兩篇主論文

歷史背景（建議延伸閱讀）

相關技術

TL;DR

目錄

背景：RDT 嘅八年進化，只為解決一個問題

標準 Transformer 嘅深度困境

Recurrent Depth 嘅核心思想

前三個重要嘅里程碑（快速回顧）

第一篇：Parcae — 終於令 RDT 喺訓練時穩得住

論文背景

Motivation：點解 Huginn 咁難訓？

關鍵 Insight：將 Loop 視為 Dynamical System

解法：Negative Diagonal Parameterization

架構：Parcae Recurrent Block

另一個關鍵：Per-Sequence Depth Sampling

實驗結果：RDT 第一次贏硬 Transformer

Scaling Laws：Chinchilla for RDT

Parcae 留低嘅問題

第二篇：Loop, Think, & Generalize — 驗證 RDT 真係會「諗嘢」

論文背景

核心問題：Implicit Multi-Hop Reasoning

兩大關鍵挑戰

實驗設計：Synthetic Multi-Hop Dataset

三大訓練策略對比

實證：Latent Chain-of-Thought 真係存在

但係⋯⋯Overthinking!

LTG 解決咗咩？留低咗咩？

兩篇論文點樣互補？

實戰案例：OpenMythos — 將兩篇論文落地嘅開源實作

核心假設：Claude Mythos 可能係一個 RDT

三層架構：Prelude → Recurrent Block → Coda

Design Choice #1：從 Parcae 借穩定性

Design Choice #2：MoE FFN 代替標準 FFN

Design Choice #3：Multi-Latent Attention（MLA）

Design Choice #4：Depth-Wise LoRA