LLM Wiki vs RAG：Karpathy 點樣用 100 篇 Markdown 取代 Vector DB？知識「編譯」與「解釋」嘅範式之爭

2026 年 4 月 3 日，Andrej Karpathy 喺 GitHub Gist 放咗一份叫做 LLM Wiki 嘅「idea file」，標題寫得好細粒，但啲 AI Twitter 立刻炸鍋。VentureBeat 仲落埋一個咁戲劇化嘅 headline：「Karpathy bypasses RAG」。但係如果你真係坐低讀完佢嗰份 Gist，會發現佢喺 closing 嗰段親自寫住：「This document is intentionally abstract... pick what's useful, ignore what isn't.」
所以呢篇 blog 唔會幫你選邊隊，而係幫你睇清楚：LLM Wiki 同 RAG 根本唔係喺度爭同一個位。一個係 compile-time 嘅知識物件，一個係 query-time 嘅檢索 pipeline。識分呢條線，先至知道幾時用邊個。

TL;DR

核心重點：

🎯 範式差異：RAG 係 interpreted knowledge（每次 query 都重新 chunk → embed → retrieve → synthesize）；LLM Wiki 係 compiled knowledge（ingest 嗰陣已經由 LLM 寫好 summary page，query 時只係讀返現成嘅結果）
📚 Karpathy 三層 folder：raw/（原始材料）+ wiki/（LLM 編譯出嚟嘅 markdown 文章）+ index.md（成個 wiki 嘅目錄，必須塞得入 context window）
⚖️ Stateful vs Stateless：RAG 係 stateless，每次都由零開始；LLM Wiki 係 stateful，知識會 compound，越用越「識嘢」
📐 Scale 係條 hard line：~50K–100K token 以下 LLM Wiki 慳錢慳事；過咗呢條線 index.md 自己塞唔入 context，retrieval layer 一定要返嚟
❌ 三個被誇大嘅 claim：「100% recall 零幻覺」「Markdown 打贏 vector DB」「bypass RAG」—— 三個都唔係 Karpathy 講嘅
🧩 真正嘅貢獻：LLM Wiki 解咗 1945 年 Vannevar Bush Memex 留低嘅 80 年老問題 —— 「邊個負責 maintenance？」答案係：LLM 唔會悶到放棄更新 cross-references
🤝 唔係二選一：production 系統最常見係 hybrid，wiki 做 curated 高信度層，RAG 做大規模 retrieval 層

點解突然全部人都喺度講 LLM Wiki？
Traditional RAG 做緊乜？
LLM Wiki 點樣運作？
Compile-time vs Query-time：最核心嗰條線
一個 worked example：同一條問題，兩個系統點答
Head-to-Head 比較表
點樣親手砌一個 LLM Wiki
三個被嚴重誇大嘅 claim
Hybrid：點樣兩樣一齊用
Decision framework：你應該用邊個？
總結
相關資源

點解突然全部人都喺度講 LLM Wiki？ {#why-now}

2026 年 4 月初，Karpathy 喺 X 度寫咗一句嘢：

「a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge.」

呢句話翻譯返做廣東話即係：「我最近用 LLM 嘅 token，唔係用嚟寫 code，而係用嚟整理知識。」

佢然後就放咗份 Gist：karpathy/442a6bf555914893e9891c11519de94f，講佢自己點樣用 LLM 整一個 personal wiki，唔用 vector DB、唔用 embedding，只用 markdown + 一個 index.md。

結果係連續一個禮拜：

VentureBeat 落 headline 話「bypasses RAG」
Mehul Gupta 喺 Medium 寫「Bye Bye RAG」
Bailing Zhang 喺 LinkedIn 寫「LLM Wiki Isn't a Better RAG. It's a Different Kind of Object Entirely」
Atlan、MindStudio、Letta、Mem0 全部出 blog 講「wiki vs RAG」

但係好多 coverage 都將佢框成「邊個會贏」嘅二選一，呢個其實完全 miss 咗 point。Karpathy 自己份 Gist 開頭就講明：「This is an idea file... your agent will build out the specifics in collaboration with you.」結尾再強調：「Everything mentioned above is optional and modular — pick what's useful, ignore what isn't.」

換句話講，佢提出嘅唔係一個 spec，而係一個範式（paradigm）。要理解呢個範式，要先睇返 RAG 喺度做緊乜。

Traditional RAG 做緊乜？ {#what-is-rag}

如果你寫過 LangChain、LlamaIndex、或者開過 Pinecone account，你已經用過 RAG。Pipeline 大致咁樣：

python# Ingest time
for doc in documents:
    chunks = chunker.split(doc, chunk_size=512, overlap=64)
    for chunk in chunks:
        embedding = embedder.encode(chunk)  # e.g. text-embedding-3-small
        vector_db.add(id=chunk.id, vector=embedding, metadata={...})

# Query time
query_vec = embedder.encode(user_query)
top_k = vector_db.search(query_vec, k=8)         # cosine similarity
context = "\n\n".join([chunk.text for chunk in top_k])
answer = llm.generate(prompt=PROMPT.format(context=context, query=user_query))

四個步驟總結：

Chunk：將 documents 切碎，通常 256–1024 tokens 一段
Embed：每段轉做 vector（768 / 1536 / 3072 維都有）
Index：vectors 入 Pinecone / Weaviate / Chroma / pgvector
Retrieve + Generate：query 都 embed，搵 top-K 最相似嘅 chunks，塞入 context 俾 LLM

🎯 核心特徵
RAG 嘅「智能」全部都喺 query time 發生。Embedding 係 ingest 時做，但係 synthesis、multi-hop reasoning、跨 chunk 整合，全部都係 query 嗰陣由 LLM 即場做。每一條 query 都係由零開始 —— 上一次做過嘅 reasoning，呢一次唔會記得。

RAG 喺邊度做得好

大規模 corpus：百萬份文件、enterprise 級 search
Dynamic content：documents 成日改，新加文件即時搵得到
Heterogeneous sources：PDF、HTML、Notion、Slack、Jira 全部 throw 入同一個 index
Open-ended queries：你 user 會問乜你估唔到，semantic search 識處理 ambiguity

RAG 嘅死角

但係 RAG 有幾個結構性問題，唔係 prompt 寫得靚就解決到：

Chunking destroys structure：你份 100 頁政策文件，切到 200 段 512-token 之後，section heading、table 嘅 row–column 關係、sequential 步驟全部冇晒
Synthesis 永遠由零做起：問 50 篇 paper 講「邊個 EV 公司未來最 dominant？」，每次都要 LLM 即場讀返成 50 篇，下次再問又再做一次
冇 writable layer：上次答得好嘅 reasoning，落唔到地。Chat history 一關 session 就消失
Embedding drift：embedding model 升級或者換，全部要 re-index
Lost-in-the-middle：top-K = 8、10、20 嘅時候，LLM 對中間 chunks 嘅注意力會塌

好啦，呢個就係 LLM Wiki 想解決嘅問題。

LLM Wiki 點樣運作？ {#what-is-llm-wiki}

Karpathy 嗰份 Gist 嘅核心係一個三層 folder structure：

javascriptmy-llm-wiki/
├── raw/                    ← 原始材料（PDF、screenshots、blog dumps）
│   ├── tesla_2024_10K.pdf
│   ├── battery_review.md
│   └── ev_market_report.html
│
├── wiki/                   ← LLM 編譯出嚟嘅 summary pages
│   ├── tesla.md
│   ├── battery_technology.md
│   ├── ev_market_trends.md
│   └── future_battery_leaders.md
│
└── index.md                ← 全部 wiki 文章嘅 master map（必須塞得入 context window）

工作流程

Karpathy 形容呢個係一個 compile + query 嘅 cycle：

Loading diagram...

關鍵幾步：

Ingest：你掉份新 source（paper、blog、會議紀錄）入 raw/
Compile：你叫 agent 做一個 compile pass，LLM 會 (a) 讀晒新 source、(b) 提取核心概念、(c) 寫或者更新 wiki/ 入面對應嘅 markdown page、(d) 更新 cross-references、(e) 加入 index.md
Lint / Health Check：定期跑一個 pass，搵 stale / contradictory / orphan 嘅 page
Query：你問嘢嗰陣，LLM 先讀 index.md，再 load 1–3 篇相關 wiki pages 入 context，最後合成答案
Compound：個答案本身可以再寫返做新 wiki page（例：「Future Leaders in EV Batteries」）

個關鍵設計：`index.md` 必須塞得入 context

呢點係成個 architecture 嘅 lynch pin。Karpathy 設計嘅 sweet spot 係 ~100 篇 articles，~400,000 字。對 200K context window 嘅 model（Claude 3.5 Sonnet、GPT-4o、Gemini 1.5 Pro）嚟講，個 index 自己唔到 1% context，輕鬆塞得入。

markdown<!-- index.md -->
# My LLM Wiki Index

## EV / Battery
- **tesla.md** — Tesla 公司歷史、product line、battery strategy
- **battery_technology.md** — Lithium-ion、solid-state、LFP、新型 chemistry
- **ev_market_trends.md** — 全球 EV adoption、policy、charging infrastructure
- **future_battery_leaders.md** — 派生 page，2026 年 5 大潛在贏家

## AI / Inference
- **kv_cache.md** — KV cache 機制、quantization、TurboQuant
- **rdt.md** — Recurrent-Depth Transformers, latent reasoning

LLM 一見到呢個 index，就知道有乜嘢 page 可以揀，唔使做 cosine search。

💡 核心 insight
LLM 唔係去 retrieve 知識，而係去 navigate 知識。 你個 index 提供嘅係 human-readable structure，唔係 high-dimensional embedding。LLM 用佢嘅語言理解能力直接揀，而唔係用 cosine similarity 撞。

Compile-time vs Query-time：最核心嗰條線 {#compile-vs-query}

Bailing Zhang 喺 LinkedIn 嗰篇文寫得最清楚：呢個唔係工具差別，係 compile-time 同 interpreted-time 嘅範式 shift。

寫過 code 嘅人都知道呢個分別。Python / JavaScript 係 interpreted —— 每次 run 都重新 parse 一次。C++ / Rust 係 compiled —— parse 同 type-check 早就做完，run 嗰陣只係執行已經消化好嘅 binary。

而家將 "program" 換做 "knowledge base"：

Dimension	RAG (Interpreted)	LLM Wiki (Compiled)
幾時做 reasoning	每次 query 都重做	Ingest 時做一次
State	Stateless（每次由零）	Stateful（compound 知識）
Synthesis 在邊度	Query time、即場做	Ingest time、寫入 wiki page
儲存咩嘢	Raw chunks + embeddings	Pre-digested markdown articles
錯誤模式	每次都可能 retrieve 錯	錯一次，影響所有 downstream queries
Cost trajectory	Per-query 高，scale 大時 amortize	Ingest 時貴，query 時平

Stateful 嘅威力：knowledge compounds

呢個係最容易俾人忽略嘅點。

RAG 入面嘅每一條 query 都係 stateless event：你問完，個答案存咗喺 chat history，但knowledge base 本身冇變。下次同一條問題嚟，LLM 又要由 raw chunks 重新拼一次答案。

LLM Wiki 唔同。每一次：

新 source 入 → wiki 多咗 / 修正咗一啲 page
你 query 完一條複雜問題 → 個答案沉底返做 wiki page
過幾日有新 paper 出 → LLM 自動 update 對應 page，標返新 cross-reference

用一個月之後，你嘅 wiki 唔係留喺起點，而係坐喺一座細山上面，有一份越嚟越密、越嚟越準確嘅 organized understanding。RAG 結構上做唔到呢樣嘢，因為佢冇 writable layer。

用 Mehul Gupta 嘅比喻：

RAG：Search → Answer → Reset
LLM Wiki：Read → Organize → Link → Improve → Reuse

一個 worked example：同一條問題，兩個系統點答 {#worked-example}

假設你研究緊 EV 同 battery，掉咗 50 篇 paper / blog 入系統，問：

「2026–2030 年邊間公司最有可能 dominate EV battery market？」

RAG 點答

pythonquery = "Which company will dominate EV battery market 2026-2030?"
query_vec = embedder.encode(query)
top_8 = vector_db.search(query_vec, k=8)

# top_8 會包括：
# - Tesla 2024 10K 入面講 4680 cell 嗰段
# - CATL Q3 earnings call 一段
# - 某 Substack 講 solid-state 嘅 chunk
# - BloombergNEF report 一段 forecast table
# - 兩段重複講 LFP 平 lithium 嘅 noise
# - ...

answer = llm.generate(context=top_8, query=query)

結果通常會係：

✅ 引用準確（因為塞咗原文 chunk）
❌ 但會 fragmented：CATL 同 BYD 喺唔同 chunk，LLM 好難 cross-reference
❌ Multi-hop reasoning 弱：例如「LFP cost curve × 中國 EV subsidy 退場 × 美國 IRA tax credit」呢類三 hop 推論會塌
❌ 下次再問類似問題，又做一次同樣嘅 work

LLM Wiki 點答

你個 wiki 入面已經有：

wiki/tesla.md —— 講晒 Tesla 4680、Gigafactory、battery strategy（已經由 5 個 raw source compile 出嚟）
wiki/catl.md —— 講晒 CATL 嘅 LFP dominance、海外擴張、技術路線
wiki/byd.md —— Blade battery、垂直整合、東南亞 / 拉美 strategy
wiki/solid_state_battery.md —— Toyota / Samsung SDI / QuantumScape 進度
wiki/ev_battery_market_dynamics.md —— Cost curves、policy、demand forecast

python# Step 1: LLM 讀 index.md
# Step 2: LLM 自己揀 5 篇 wiki page，全部 load 入 context（總共可能 8K tokens）
# Step 3: Synthesize

結果通常會係：

✅ Coherent narrative：因為每篇 wiki page 已經係 pre-digested
✅ Multi-hop 強：cross-references 已經喺 ingest 時建好
✅ 可以順手沉底：將呢次 analysis 寫返做 wiki/future_battery_leaders.md，下次直接 load
⚠️ 但係：如果 ingest 時 LLM 對 wiki/catl.md 嘅理解有偏差，呢個 bias 會喺所有後續答案出現

⚠️ 重要 trade-off
RAG 每次都重讀原文，錯誤係 transient。LLM Wiki 將 LLM 嘅 interpretation 烙咗喺 markdown page 上面，錯誤係 persistent，仲會 amplify。所以 lint pass 同 human-in-the-loop（例如用 Obsidian 睇 audit view）唔係 optional，係 mandatory。

Head-to-Head 比較表 {#head-to-head}

Dimension	LLM Wiki	Traditional RAG
Setup complexity	✅ 低 —— 寫 markdown + 一個 system prompt	⚠️ 高 —— chunking、embedder、vector DB、retrieval tuning
Infrastructure	✅ Zero —— filesystem + Git	⚠️ Vector DB + embedding pipeline
Best knowledge size	~50K–100K tokens（≈ 100–200 articles）	✅ Millions of documents
Retrieval method	Structural / intent-based	Semantic similarity (cosine)
Retrieval reliability	✅ 100%（全 page load 入 context）	⚠️ Variable —— 視乎 chunking、embedding 質素
Update workflow	✅ 改 markdown file（Git diff）	⚠️ Re-chunk、re-embed、re-index
Token cost / query	固定（成個 wiki section）	Variable（top-K chunks）
Latency	✅ 低（file read）	⚠️ 高（embed + search + rerank）
Source attribution	⚠️ 較難（synthesized away）	✅ Native（chunk URL）
Compounding	✅ 知識每次互動都增值	❌ Stateless，每次重做
Multi-user / concurrency	❌ Race condition、write conflict	✅ 設計上 concurrent-safe
Access control	❌ 只有 file system permissions	⚠️ 視乎 retrieval layer 設計
Failure mode	Compilation hallucination 烙住、context overflow	Bad chunking、embedding drift、lost-in-the-middle
Best for	Personal research、bounded domain、stable corpora	Enterprise scale、dynamic content、heterogeneous sources

點樣親手砌一個 LLM Wiki {#how-to-build}

如果你想試吓，最低限度只要 30 行 Python + 一個 wiki/ folder。

Step 1: 結構化你嘅 markdown

每個 file 講一個 coherent topic，檔名要 descriptive（cancellation-policy.md，唔好 doc-14.md），同埋頂 part 加 frontmatter：

markdown---
title: KV Cache 壓縮
category: inference-optimization
last_updated: 2026-04-28
related:
  - turboquant.md
  - flash_attention.md
  - quantization.md
---

# KV Cache 壓縮

## 背景
...

## 主流方法
...

## TurboQuant 點解 work
See [[turboquant.md]] for full math derivation.

Step 2: 寫一個 `index.md`

markdown# Wiki Index

## Inference Optimization
- **kv_cache.md** — KV cache 機制、壓縮方法總覽
- **turboquant.md** — Google 3-bit quantization, near-optimal
- **flash_attention.md** — IO-aware attention, FA1-4
- **quantization.md** — INT8 / INT4 / FP8 / 1-bit overview

## Architecture
- **rdt.md** — Recurrent-Depth Transformers
- **moe.md** — Mixture-of-Experts (DeepSeek V3/V4)
- **mamba.md** — State-Space Models

Step 3: 一個極簡 agent loop

pythonimport os, glob
from anthropic import Anthropic

client = Anthropic()

def load_index():
    return open("wiki/index.md").read()

def load_pages(filenames):
    return "\n\n".join([
        f"## File: {fn}\n\n{open(f'wiki/{fn}').read()}"
        for fn in filenames if os.path.exists(f"wiki/{fn}")
    ])

def ask(query: str):
    # 第一步：俾 index 入去，問 LLM 揀邊幾個 file
    routing = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"""Wiki index:

{load_index()}

User query: {query}

Return ONLY a JSON list of filenames (max 4) you'd open to answer.
Format: ["a.md", "b.md"]"""
        }]
    )
    files = eval(routing.content[0].text)  # demo only — production 請用 json.loads + 防呆

    # 第二步：load 嗰幾個 page 入 context，正式回答
    answer = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"""Wiki pages:

{load_pages(files)}

Query: {query}

Answer the query using the pages above. Cite filenames in [[...]] format."""
        }]
    )
    return answer.content[0].text, files

print(ask("TurboQuant 同 INT4 quantization 有咩分別？"))

50 行內，冇 Pinecone、冇 embedding、冇 chunk strategy。

Step 4: Compile / Lint pass

當有新 source 入 raw/：

pythondef compile_source(raw_path: str):
    raw = open(raw_path).read()
    index = load_index()
    
    instruction = f"""You are a knowledge compiler. Below is a new source and the current wiki index.

New source ({raw_path}):
{raw}

Current index:
{index}

Do the following:
1. Identify which existing wiki pages should be updated.
2. For each, output the FULL new markdown (with frontmatter).
3. Identify if any NEW page should be created. Output its full markdown.
4. Update index.md to include any new page.
5. Add cross-references using [[other_page.md]] syntax.
"""
    # ... 跑 LLM，parse 返佢嘅 output，寫返入 wiki/

而 lint pass 大致係：

pythondef lint():
    """找 stale / contradictory / orphan pages"""
    # 1. 檢查每個 page 嘅 last_updated（>3 個月 flag 出嚟）
    # 2. 拎 random pair 嘅 page 俾 LLM 比較，搵 contradiction
    # 3. 找冇 backlink 嘅 orphan page

Karpathy 自己係用 Obsidian 睇成個 graph view 做 audit，個 vault 仲可以直接 git push。

三個被嚴重誇大嘅 claim {#overclaims}

Bailing Zhang 嗰篇 LinkedIn 文有一段我覺得寫得好直接，值得照抄落嚟：

Overclaim #1：「100% recall, zero hallucination」

見過至少四篇中英文 summary 咁寫過，唔係事實。

將 400K 字塞入 context window 唔等於 perfect recall —— 你會中 lost-in-the-middle：LLM 對 context 嘅頭尾注意力遠高於中間
Compile step 本身就係 generative act，LLM 寫 wiki page 嗰陣就會幻覺，而呢啲幻覺會喺所有 downstream query 度 amplify
所以 Karpathy 自己堅持要 lint 同 human-in-the-loop

Overclaim #2：「Markdown 打贏 vector DB」

呢句係 category error。

Vector DB 解決嘅問題：搜索 millions of documents at enterprise scale
Markdown wiki 解決嘅問題：俾一個人或者細 team 維護一個 compounding knowledge object 而唔崩潰

兩樣嘢喺唔同 scale、為唔同用戶、有唔同 failure mode。問邊個贏，等於問單車打唔打贏貨櫃船。

Overclaim #3：「Bypass RAG」

呢個 framing 係 VentureBeat 嘅 headline，唔係 Karpathy 嘅原話。

Karpathy 喺 Gist 講得好清楚：呢個 pattern 係 scoped to 個人 researcher with bounded、stable corpora of ~100–200 articles。過咗呢個 scale，index.md 自己塞唔入 context window，retrieval layer 一定要返嚟 —— 佢甚至 suggest 配合 hybrid BM25 + vector search。

Hybrid：點樣兩樣一齊用 {#hybrid}

Production agent 最常見嘅 pattern 唔係二選一，而係分層架構：

Loading diagram...

分工原則

Layer	負責	例子
LLM Wiki	「我哋確定知道嘅嘢」—— 概念、定義、internal frameworks、stable policies	公司 brand guidelines、工程原則、產品 spec、客戶 FAQ top-50
RAG	「corpus 入面而家有乜」—— real-time、broad、long tail	歷史 ticket、研究 paper 全文、Slack 對話、PR diff
Agent Memory	「呢個 user 係邊個」—— preference、past sessions	Mem0、Letta、MemPalace

Vishal Mysore 喺 Medium 講得好啱：

「LLM Wiki ← Domain knowledge, compiled at ingest time
Agent Memory ← User knowledge, written at conversation time, read at query time
RAG ← Document retrieval, stateless by default, stateful by design」

三個 axis 係 orthogonal 嘅，唔係互相替代。

Decision framework：你應該用邊個？ {#decision}

用 6 條問題決定：

Knowledge base 大唔大？
- < 100 篇結構化文件 → LLM Wiki
- 100–1,000 篇 → 兩個都得，傾向 LLM Wiki
- 1,000+ 篇 → RAG
Content 多 stable？
- 月度 / 季度級更新 → LLM Wiki
- 每日多次更新 → RAG（或者 hybrid）
要幾結構化？
- Tables、procedures、policies、FAQs → LLM Wiki
- Long-form prose、transcripts、research → RAG
Retrieval 要幾準？
- 要 exact answer（legal、compliance、pricing）→ LLM Wiki
- 「semantic match 已經 OK」→ RAG
Engineering capacity？
- Solo 或 small team → LLM Wiki
- 有專門 ML eng team → 兩個都做得
要 Git audit / version control？
- 要 → LLM Wiki（每次 LLM edit 都係 commit，可以 review diff）
- 唔重要 → 兩個都得

💡 個人建議
如果你而家係一個獨立 researcher、做 personal 知識管理、或者開緊一個 vertical agent（例如 medical / legal / 公司內部 SOP），從 LLM Wiki 開始。Setup 一個下晝搞掂，跑兩個禮拜就感覺得到 compound 嘅威力。
當你嘅 wiki 過咗 ~150 篇 page、index.md 開始要塞唔入 context、或者要俾成個 team 同時 write，先諗 migrate 去 hybrid。

總結 {#conclusion}

核心重點回顧

唔係替代品：LLM Wiki 唔係「更好嘅 RAG」，而係另一隻完全唔同嘅嘢。RAG 解決 retrieval at scale，LLM Wiki 解決 maintainable knowledge compounding
Compile vs Interpret：呢個係最重要嗰條線。RAG 每次 query 都重新 parse；LLM Wiki 早就 parse 完，query 時讀 binary
Stateful 嘅威力：知識會 compound，越用越識，呢個係 stateless 系統結構上做唔到
Scale 係 hard line：~50K–100K token 以下 LLM Wiki 完美；過咗 index 塞唔入，一定要 retrieval
Compilation hallucination：唯一一個結構性風險，所以 lint pass 同 human-in-the-loop 唔係 optional
Memex 嘅 80 年問題：1945 年 Vannevar Bush 講嘅 Memex 失敗喺「邊個維護？」呢個問題上面。LLM 唔會悶到放棄更新 cross-references，呢個先係 LLM Wiki 真正嘅貢獻
Production = Hybrid：Wiki + RAG + Agent Memory 三個 axis 互補，唔係互相取代

下一步

如果你想試：

開一個新 folder，整 raw/ + wiki/ + index.md
掉 5–10 篇你最近睇緊嘅 blog / paper 入 raw/
寫一個 50 行 agent loop（上面有 template）
跑一個禮拜，每次有新嘢就掉入 raw/ + 跑 compile pass
兩個禮拜後返嚟睇 index.md —— 如果你嘅 wiki 真係 compound 緊，你會見到一個自己 build 出嚟嘅 mini brain

如果效果好，你就理解到點解 Karpathy 話佢「最近 token 都用喺 manipulating knowledge」 —— 呢樣嘢一旦 work 過，好難返轉頭去 raw RAG。

TL;DR

核心重點：

🎯 範式差異：RAG 係 interpreted knowledge（每次 query 都重新 chunk → embed → retrieve → synthesize）；LLM Wiki 係 compiled knowledge（ingest 嗰陣已經由 LLM 寫好 summary page，query 時只係讀返現成嘅結果）
📚 Karpathy 三層 folder：raw/（原始材料）+ wiki/（LLM 編譯出嚟嘅 markdown 文章）+ index.md（成個 wiki 嘅目錄，必須塞得入 context window）
⚖️ Stateful vs Stateless：RAG 係 stateless，每次都由零開始；LLM Wiki 係 stateful，知識會 compound，越用越「識嘢」
📐 Scale 係條 hard line：~50K–100K token 以下 LLM Wiki 慳錢慳事；過咗呢條線 index.md 自己塞唔入 context，retrieval layer 一定要返嚟
❌ 三個被誇大嘅 claim：「100% recall 零幻覺」「Markdown 打贏 vector DB」「bypass RAG」—— 三個都唔係 Karpathy 講嘅
🧩 真正嘅貢獻：LLM Wiki 解咗 1945 年 Vannevar Bush Memex 留低嘅 80 年老問題 —— 「邊個負責 maintenance？」答案係：LLM 唔會悶到放棄更新 cross-references
🤝 唔係二選一：production 系統最常見係 hybrid，wiki 做 curated 高信度層，RAG 做大規模 retrieval 層

點解突然全部人都喺度講 LLM Wiki？
Traditional RAG 做緊乜？
LLM Wiki 點樣運作？
Compile-time vs Query-time：最核心嗰條線
一個 worked example：同一條問題，兩個系統點答
Head-to-Head 比較表
點樣親手砌一個 LLM Wiki
三個被嚴重誇大嘅 claim
Hybrid：點樣兩樣一齊用
Decision framework：你應該用邊個？
總結
相關資源

點解突然全部人都喺度講 LLM Wiki？ {#why-now}

2026 年 4 月初，Karpathy 喺 X 度寫咗一句嘢：

「a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge.」

呢句話翻譯返做廣東話即係：「我最近用 LLM 嘅 token，唔係用嚟寫 code，而係用嚟整理知識。」

結果係連續一個禮拜：

VentureBeat 落 headline 話「bypasses RAG」
Mehul Gupta 喺 Medium 寫「Bye Bye RAG」
Bailing Zhang 喺 LinkedIn 寫「LLM Wiki Isn't a Better RAG. It's a Different Kind of Object Entirely」
Atlan、MindStudio、Letta、Mem0 全部出 blog 講「wiki vs RAG」

換句話講，佢提出嘅唔係一個 spec，而係一個範式（paradigm）。要理解呢個範式，要先睇返 RAG 喺度做緊乜。

Traditional RAG 做緊乜？ {#what-is-rag}

如果你寫過 LangChain、LlamaIndex、或者開過 Pinecone account，你已經用過 RAG。Pipeline 大致咁樣：

python# Ingest time
for doc in documents:
    chunks = chunker.split(doc, chunk_size=512, overlap=64)
    for chunk in chunks:
        embedding = embedder.encode(chunk)  # e.g. text-embedding-3-small
        vector_db.add(id=chunk.id, vector=embedding, metadata={...})

# Query time
query_vec = embedder.encode(user_query)
top_k = vector_db.search(query_vec, k=8)         # cosine similarity
context = "\n\n".join([chunk.text for chunk in top_k])
answer = llm.generate(prompt=PROMPT.format(context=context, query=user_query))

四個步驟總結：

Chunk：將 documents 切碎，通常 256–1024 tokens 一段
Embed：每段轉做 vector（768 / 1536 / 3072 維都有）
Index：vectors 入 Pinecone / Weaviate / Chroma / pgvector
Retrieve + Generate：query 都 embed，搵 top-K 最相似嘅 chunks，塞入 context 俾 LLM

🎯 核心特徵
RAG 嘅「智能」全部都喺 query time 發生。Embedding 係 ingest 時做，但係 synthesis、multi-hop reasoning、跨 chunk 整合，全部都係 query 嗰陣由 LLM 即場做。每一條 query 都係由零開始 —— 上一次做過嘅 reasoning，呢一次唔會記得。

RAG 喺邊度做得好

大規模 corpus：百萬份文件、enterprise 級 search
Dynamic content：documents 成日改，新加文件即時搵得到
Heterogeneous sources：PDF、HTML、Notion、Slack、Jira 全部 throw 入同一個 index
Open-ended queries：你 user 會問乜你估唔到，semantic search 識處理 ambiguity

RAG 嘅死角

但係 RAG 有幾個結構性問題，唔係 prompt 寫得靚就解決到：

Chunking destroys structure：你份 100 頁政策文件，切到 200 段 512-token 之後，section heading、table 嘅 row–column 關係、sequential 步驟全部冇晒
Synthesis 永遠由零做起：問 50 篇 paper 講「邊個 EV 公司未來最 dominant？」，每次都要 LLM 即場讀返成 50 篇，下次再問又再做一次
冇 writable layer：上次答得好嘅 reasoning，落唔到地。Chat history 一關 session 就消失
Embedding drift：embedding model 升級或者換，全部要 re-index
Lost-in-the-middle：top-K = 8、10、20 嘅時候，LLM 對中間 chunks 嘅注意力會塌

好啦，呢個就係 LLM Wiki 想解決嘅問題。

LLM Wiki 點樣運作？ {#what-is-llm-wiki}

Karpathy 嗰份 Gist 嘅核心係一個三層 folder structure：

javascriptmy-llm-wiki/
├── raw/                    ← 原始材料（PDF、screenshots、blog dumps）
│   ├── tesla_2024_10K.pdf
│   ├── battery_review.md
│   └── ev_market_report.html
│
├── wiki/                   ← LLM 編譯出嚟嘅 summary pages
│   ├── tesla.md
│   ├── battery_technology.md
│   ├── ev_market_trends.md
│   └── future_battery_leaders.md
│
└── index.md                ← 全部 wiki 文章嘅 master map（必須塞得入 context window）

工作流程

Karpathy 形容呢個係一個 compile + query 嘅 cycle：

Loading diagram...

關鍵幾步：

Ingest：你掉份新 source（paper、blog、會議紀錄）入 raw/
Compile：你叫 agent 做一個 compile pass，LLM 會 (a) 讀晒新 source、(b) 提取核心概念、(c) 寫或者更新 wiki/ 入面對應嘅 markdown page、(d) 更新 cross-references、(e) 加入 index.md
Lint / Health Check：定期跑一個 pass，搵 stale / contradictory / orphan 嘅 page
Query：你問嘢嗰陣，LLM 先讀 index.md，再 load 1–3 篇相關 wiki pages 入 context，最後合成答案
Compound：個答案本身可以再寫返做新 wiki page（例：「Future Leaders in EV Batteries」）

個關鍵設計：`index.md` 必須塞得入 context

markdown<!-- index.md -->
# My LLM Wiki Index

## EV / Battery
- **tesla.md** — Tesla 公司歷史、product line、battery strategy
- **battery_technology.md** — Lithium-ion、solid-state、LFP、新型 chemistry
- **ev_market_trends.md** — 全球 EV adoption、policy、charging infrastructure
- **future_battery_leaders.md** — 派生 page，2026 年 5 大潛在贏家

## AI / Inference
- **kv_cache.md** — KV cache 機制、quantization、TurboQuant
- **rdt.md** — Recurrent-Depth Transformers, latent reasoning

LLM 一見到呢個 index，就知道有乜嘢 page 可以揀，唔使做 cosine search。

💡 核心 insight
LLM 唔係去 retrieve 知識，而係去 navigate 知識。 你個 index 提供嘅係 human-readable structure，唔係 high-dimensional embedding。LLM 用佢嘅語言理解能力直接揀，而唔係用 cosine similarity 撞。

Compile-time vs Query-time：最核心嗰條線 {#compile-vs-query}

Bailing Zhang 喺 LinkedIn 嗰篇文寫得最清楚：呢個唔係工具差別，係 compile-time 同 interpreted-time 嘅範式 shift。

而家將 "program" 換做 "knowledge base"：

Dimension	RAG (Interpreted)	LLM Wiki (Compiled)
幾時做 reasoning	每次 query 都重做	Ingest 時做一次
State	Stateless（每次由零）	Stateful（compound 知識）
Synthesis 在邊度	Query time、即場做	Ingest time、寫入 wiki page
儲存咩嘢	Raw chunks + embeddings	Pre-digested markdown articles
錯誤模式	每次都可能 retrieve 錯	錯一次，影響所有 downstream queries
Cost trajectory	Per-query 高，scale 大時 amortize	Ingest 時貴，query 時平

Stateful 嘅威力：knowledge compounds

呢個係最容易俾人忽略嘅點。

LLM Wiki 唔同。每一次：

新 source 入 → wiki 多咗 / 修正咗一啲 page
你 query 完一條複雜問題 → 個答案沉底返做 wiki page
過幾日有新 paper 出 → LLM 自動 update 對應 page，標返新 cross-reference

用 Mehul Gupta 嘅比喻：

RAG：Search → Answer → Reset
LLM Wiki：Read → Organize → Link → Improve → Reuse

一個 worked example：同一條問題，兩個系統點答 {#worked-example}

假設你研究緊 EV 同 battery，掉咗 50 篇 paper / blog 入系統，問：

「2026–2030 年邊間公司最有可能 dominate EV battery market？」

RAG 點答

pythonquery = "Which company will dominate EV battery market 2026-2030?"
query_vec = embedder.encode(query)
top_8 = vector_db.search(query_vec, k=8)

# top_8 會包括：
# - Tesla 2024 10K 入面講 4680 cell 嗰段
# - CATL Q3 earnings call 一段
# - 某 Substack 講 solid-state 嘅 chunk
# - BloombergNEF report 一段 forecast table
# - 兩段重複講 LFP 平 lithium 嘅 noise
# - ...

answer = llm.generate(context=top_8, query=query)

結果通常會係：

✅ 引用準確（因為塞咗原文 chunk）
❌ 但會 fragmented：CATL 同 BYD 喺唔同 chunk，LLM 好難 cross-reference
❌ Multi-hop reasoning 弱：例如「LFP cost curve × 中國 EV subsidy 退場 × 美國 IRA tax credit」呢類三 hop 推論會塌
❌ 下次再問類似問題，又做一次同樣嘅 work

LLM Wiki 點答

你個 wiki 入面已經有：

wiki/tesla.md —— 講晒 Tesla 4680、Gigafactory、battery strategy（已經由 5 個 raw source compile 出嚟）
wiki/catl.md —— 講晒 CATL 嘅 LFP dominance、海外擴張、技術路線
wiki/byd.md —— Blade battery、垂直整合、東南亞 / 拉美 strategy
wiki/solid_state_battery.md —— Toyota / Samsung SDI / QuantumScape 進度
wiki/ev_battery_market_dynamics.md —— Cost curves、policy、demand forecast

python# Step 1: LLM 讀 index.md
# Step 2: LLM 自己揀 5 篇 wiki page，全部 load 入 context（總共可能 8K tokens）
# Step 3: Synthesize

結果通常會係：

✅ Coherent narrative：因為每篇 wiki page 已經係 pre-digested
✅ Multi-hop 強：cross-references 已經喺 ingest 時建好
✅ 可以順手沉底：將呢次 analysis 寫返做 wiki/future_battery_leaders.md，下次直接 load
⚠️ 但係：如果 ingest 時 LLM 對 wiki/catl.md 嘅理解有偏差，呢個 bias 會喺所有後續答案出現

⚠️ 重要 trade-off
RAG 每次都重讀原文，錯誤係 transient。LLM Wiki 將 LLM 嘅 interpretation 烙咗喺 markdown page 上面，錯誤係 persistent，仲會 amplify。所以 lint pass 同 human-in-the-loop（例如用 Obsidian 睇 audit view）唔係 optional，係 mandatory。

Head-to-Head 比較表 {#head-to-head}

Dimension	LLM Wiki	Traditional RAG
Setup complexity	✅ 低 —— 寫 markdown + 一個 system prompt	⚠️ 高 —— chunking、embedder、vector DB、retrieval tuning
Infrastructure	✅ Zero —— filesystem + Git	⚠️ Vector DB + embedding pipeline
Best knowledge size	~50K–100K tokens（≈ 100–200 articles）	✅ Millions of documents
Retrieval method	Structural / intent-based	Semantic similarity (cosine)
Retrieval reliability	✅ 100%（全 page load 入 context）	⚠️ Variable —— 視乎 chunking、embedding 質素
Update workflow	✅ 改 markdown file（Git diff）	⚠️ Re-chunk、re-embed、re-index
Token cost / query	固定（成個 wiki section）	Variable（top-K chunks）
Latency	✅ 低（file read）	⚠️ 高（embed + search + rerank）
Source attribution	⚠️ 較難（synthesized away）	✅ Native（chunk URL）
Compounding	✅ 知識每次互動都增值	❌ Stateless，每次重做
Multi-user / concurrency	❌ Race condition、write conflict	✅ 設計上 concurrent-safe
Access control	❌ 只有 file system permissions	⚠️ 視乎 retrieval layer 設計
Failure mode	Compilation hallucination 烙住、context overflow	Bad chunking、embedding drift、lost-in-the-middle
Best for	Personal research、bounded domain、stable corpora	Enterprise scale、dynamic content、heterogeneous sources

點樣親手砌一個 LLM Wiki {#how-to-build}

如果你想試吓，最低限度只要 30 行 Python + 一個 wiki/ folder。

Step 1: 結構化你嘅 markdown

每個 file 講一個 coherent topic，檔名要 descriptive（cancellation-policy.md，唔好 doc-14.md），同埋頂 part 加 frontmatter：

markdown---
title: KV Cache 壓縮
category: inference-optimization
last_updated: 2026-04-28
related:
  - turboquant.md
  - flash_attention.md
  - quantization.md
---

# KV Cache 壓縮

## 背景
...

## 主流方法
...

## TurboQuant 點解 work
See [[turboquant.md]] for full math derivation.

Step 2: 寫一個 `index.md`

markdown# Wiki Index

## Inference Optimization
- **kv_cache.md** — KV cache 機制、壓縮方法總覽
- **turboquant.md** — Google 3-bit quantization, near-optimal
- **flash_attention.md** — IO-aware attention, FA1-4
- **quantization.md** — INT8 / INT4 / FP8 / 1-bit overview

## Architecture
- **rdt.md** — Recurrent-Depth Transformers
- **moe.md** — Mixture-of-Experts (DeepSeek V3/V4)
- **mamba.md** — State-Space Models

Step 3: 一個極簡 agent loop

pythonimport os, glob
from anthropic import Anthropic

client = Anthropic()

def load_index():
    return open("wiki/index.md").read()

def load_pages(filenames):
    return "\n\n".join([
        f"## File: {fn}\n\n{open(f'wiki/{fn}').read()}"
        for fn in filenames if os.path.exists(f"wiki/{fn}")
    ])

def ask(query: str):
    # 第一步：俾 index 入去，問 LLM 揀邊幾個 file
    routing = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"""Wiki index:

{load_index()}

User query: {query}

Return ONLY a JSON list of filenames (max 4) you'd open to answer.
Format: ["a.md", "b.md"]"""
        }]
    )
    files = eval(routing.content[0].text)  # demo only — production 請用 json.loads + 防呆

    # 第二步：load 嗰幾個 page 入 context，正式回答
    answer = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"""Wiki pages:

{load_pages(files)}

Query: {query}

Answer the query using the pages above. Cite filenames in [[...]] format."""
        }]
    )
    return answer.content[0].text, files

print(ask("TurboQuant 同 INT4 quantization 有咩分別？"))

50 行內，冇 Pinecone、冇 embedding、冇 chunk strategy。

Step 4: Compile / Lint pass

當有新 source 入 raw/：

pythondef compile_source(raw_path: str):
    raw = open(raw_path).read()
    index = load_index()
    
    instruction = f"""You are a knowledge compiler. Below is a new source and the current wiki index.

New source ({raw_path}):
{raw}

Current index:
{index}

Do the following:
1. Identify which existing wiki pages should be updated.
2. For each, output the FULL new markdown (with frontmatter).
3. Identify if any NEW page should be created. Output its full markdown.
4. Update index.md to include any new page.
5. Add cross-references using [[other_page.md]] syntax.
"""
    # ... 跑 LLM，parse 返佢嘅 output，寫返入 wiki/

而 lint pass 大致係：

pythondef lint():
    """找 stale / contradictory / orphan pages"""
    # 1. 檢查每個 page 嘅 last_updated（>3 個月 flag 出嚟）
    # 2. 拎 random pair 嘅 page 俾 LLM 比較，搵 contradiction
    # 3. 找冇 backlink 嘅 orphan page

Karpathy 自己係用 Obsidian 睇成個 graph view 做 audit，個 vault 仲可以直接 git push。

三個被嚴重誇大嘅 claim {#overclaims}

Bailing Zhang 嗰篇 LinkedIn 文有一段我覺得寫得好直接，值得照抄落嚟：

Overclaim #1：「100% recall, zero hallucination」

見過至少四篇中英文 summary 咁寫過，唔係事實。

將 400K 字塞入 context window 唔等於 perfect recall —— 你會中 lost-in-the-middle：LLM 對 context 嘅頭尾注意力遠高於中間
Compile step 本身就係 generative act，LLM 寫 wiki page 嗰陣就會幻覺，而呢啲幻覺會喺所有 downstream query 度 amplify
所以 Karpathy 自己堅持要 lint 同 human-in-the-loop

Overclaim #2：「Markdown 打贏 vector DB」

呢句係 category error。

Vector DB 解決嘅問題：搜索 millions of documents at enterprise scale
Markdown wiki 解決嘅問題：俾一個人或者細 team 維護一個 compounding knowledge object 而唔崩潰

兩樣嘢喺唔同 scale、為唔同用戶、有唔同 failure mode。問邊個贏，等於問單車打唔打贏貨櫃船。

Overclaim #3：「Bypass RAG」

呢個 framing 係 VentureBeat 嘅 headline，唔係 Karpathy 嘅原話。

Hybrid：點樣兩樣一齊用 {#hybrid}

Production agent 最常見嘅 pattern 唔係二選一，而係分層架構：

Loading diagram...

分工原則

Layer	負責	例子
LLM Wiki	「我哋確定知道嘅嘢」—— 概念、定義、internal frameworks、stable policies	公司 brand guidelines、工程原則、產品 spec、客戶 FAQ top-50
RAG	「corpus 入面而家有乜」—— real-time、broad、long tail	歷史 ticket、研究 paper 全文、Slack 對話、PR diff
Agent Memory	「呢個 user 係邊個」—— preference、past sessions	Mem0、Letta、MemPalace

Vishal Mysore 喺 Medium 講得好啱：

「LLM Wiki ← Domain knowledge, compiled at ingest time
Agent Memory ← User knowledge, written at conversation time, read at query time
RAG ← Document retrieval, stateless by default, stateful by design」

三個 axis 係 orthogonal 嘅，唔係互相替代。

Decision framework：你應該用邊個？ {#decision}

用 6 條問題決定：

Knowledge base 大唔大？
- < 100 篇結構化文件 → LLM Wiki
- 100–1,000 篇 → 兩個都得，傾向 LLM Wiki
- 1,000+ 篇 → RAG
Content 多 stable？
- 月度 / 季度級更新 → LLM Wiki
- 每日多次更新 → RAG（或者 hybrid）
要幾結構化？
- Tables、procedures、policies、FAQs → LLM Wiki
- Long-form prose、transcripts、research → RAG
Retrieval 要幾準？
- 要 exact answer（legal、compliance、pricing）→ LLM Wiki
- 「semantic match 已經 OK」→ RAG
Engineering capacity？
- Solo 或 small team → LLM Wiki
- 有專門 ML eng team → 兩個都做得
要 Git audit / version control？
- 要 → LLM Wiki（每次 LLM edit 都係 commit，可以 review diff）
- 唔重要 → 兩個都得

💡 個人建議
如果你而家係一個獨立 researcher、做 personal 知識管理、或者開緊一個 vertical agent（例如 medical / legal / 公司內部 SOP），從 LLM Wiki 開始。Setup 一個下晝搞掂，跑兩個禮拜就感覺得到 compound 嘅威力。
當你嘅 wiki 過咗 ~150 篇 page、index.md 開始要塞唔入 context、或者要俾成個 team 同時 write，先諗 migrate 去 hybrid。

總結 {#conclusion}

核心重點回顧

唔係替代品：LLM Wiki 唔係「更好嘅 RAG」，而係另一隻完全唔同嘅嘢。RAG 解決 retrieval at scale，LLM Wiki 解決 maintainable knowledge compounding
Compile vs Interpret：呢個係最重要嗰條線。RAG 每次 query 都重新 parse；LLM Wiki 早就 parse 完，query 時讀 binary
Stateful 嘅威力：知識會 compound，越用越識，呢個係 stateless 系統結構上做唔到
Scale 係 hard line：~50K–100K token 以下 LLM Wiki 完美；過咗 index 塞唔入，一定要 retrieval
Compilation hallucination：唯一一個結構性風險，所以 lint pass 同 human-in-the-loop 唔係 optional
Memex 嘅 80 年問題：1945 年 Vannevar Bush 講嘅 Memex 失敗喺「邊個維護？」呢個問題上面。LLM 唔會悶到放棄更新 cross-references，呢個先係 LLM Wiki 真正嘅貢獻
Production = Hybrid：Wiki + RAG + Agent Memory 三個 axis 互補，唔係互相取代

下一步

如果你想試：

開一個新 folder，整 raw/ + wiki/ + index.md
掉 5–10 篇你最近睇緊嘅 blog / paper 入 raw/
寫一個 50 行 agent loop（上面有 template）
跑一個禮拜，每次有新嘢就掉入 raw/ + 跑 compile pass
兩個禮拜後返嚟睇 index.md —— 如果你嘅 wiki 真係 compound 緊，你會見到一個自己 build 出嚟嘅 mini brain

如果效果好，你就理解到點解 Karpathy 話佢「最近 token 都用喺 manipulating knowledge」 —— 呢樣嘢一旦 work 過，好難返轉頭去 raw RAG。

TL;DR

Table of Contents

點解突然全部人都喺度講 LLM Wiki？ {#why-now}

Traditional RAG 做緊乜？ {#what-is-rag}

RAG 喺邊度做得好

RAG 嘅死角

LLM Wiki 點樣運作？ {#what-is-llm-wiki}

工作流程

個關鍵設計：index.md 必須塞得入 context

Compile-time vs Query-time：最核心嗰條線 {#compile-vs-query}

Stateful 嘅威力：knowledge compounds

一個 worked example：同一條問題，兩個系統點答 {#worked-example}

RAG 點答

LLM Wiki 點答

Head-to-Head 比較表 {#head-to-head}

點樣親手砌一個 LLM Wiki {#how-to-build}

Step 1: 結構化你嘅 markdown

Step 2: 寫一個 index.md

Step 3: 一個極簡 agent loop

Step 4: Compile / Lint pass

三個被嚴重誇大嘅 claim {#overclaims}

Overclaim #1：「100% recall, zero hallucination」

Overclaim #2：「Markdown 打贏 vector DB」

Overclaim #3：「Bypass RAG」

Hybrid：點樣兩樣一齊用 {#hybrid}

分工原則

Decision framework：你應該用邊個？ {#decision}

總結 {#conclusion}

核心重點回顧

下一步

相關資源 {#resources}

TL;DR

Table of Contents

點解突然全部人都喺度講 LLM Wiki？ {#why-now}

Traditional RAG 做緊乜？ {#what-is-rag}

RAG 喺邊度做得好

RAG 嘅死角

LLM Wiki 點樣運作？ {#what-is-llm-wiki}

工作流程

個關鍵設計：index.md 必須塞得入 context

Compile-time vs Query-time：最核心嗰條線 {#compile-vs-query}

Stateful 嘅威力：knowledge compounds

一個 worked example：同一條問題，兩個系統點答 {#worked-example}

RAG 點答

LLM Wiki 點答

Head-to-Head 比較表 {#head-to-head}

點樣親手砌一個 LLM Wiki {#how-to-build}

Step 1: 結構化你嘅 markdown

Step 2: 寫一個 index.md

Step 3: 一個極簡 agent loop

Step 4: Compile / Lint pass

三個被嚴重誇大嘅 claim {#overclaims}

Overclaim #1：「100% recall, zero hallucination」

Overclaim #2：「Markdown 打贏 vector DB」

Overclaim #3：「Bypass RAG」

Hybrid：點樣兩樣一齊用 {#hybrid}

分工原則

Decision framework：你應該用邊個？ {#decision}

總結 {#conclusion}

核心重點回顧

下一步

相關資源 {#resources}

個關鍵設計：`index.md` 必須塞得入 context

Step 2: 寫一個 `index.md`

個關鍵設計：`index.md` 必須塞得入 context

Step 2: 寫一個 `index.md`