RTMO：點樣將 Coordinate Classification 塞入 YOLO，做到 One-Stage Real-Time 多人 Pose Estimation？

論文來源：Tsinghua SIGS / Shanghai AI Laboratory / Nanyang Technological University arXiv：2312.07526（CVPR 2024） Code：github.com/open-mmlab/mmpose/tree/main/projects/rtmo 同系列：RTMPose（top-down 兄弟）

TL;DR

上一系列 blog 我哋睇咗 Person ReID 點樣由 OSNet 走到 DINOv2，係 identity 層面 嘅 human-centric vision。今次轉去 keypoint 層面——multi-person pose estimation。

RTMO 喺 CVPR 2024 提出，核心問題好簡單：Top-down 太慢、bottom-up / one-stage 又唔夠準，可唔可以兩者兼得？

核心重點：

🎯 One-stage + Coordinate Classification：第一次將 dual 1-D heatmap classification（SimCC / RTMPose 嘅招）塞入 YOLO-based dense prediction
🧩 Dynamic Coordinate Classifier (DCC)：Dynamic Bin Allocation（bins 跟 bbox 大細走）+ Dynamic Bin Encoding（sine PE + FC 生成 per-bin representation）
📐 MLE Loss with Learnable Variance：用 Maximum Likelihood Estimation 取代 KLD，每個 sample 自己學 uncertainty，hard sample 大 variance、easy sample 細 variance
⚡ Real-time + 多人不衰減：RTMO-l 喺 COCO val2017 攞到 74.8% AP @ 141 FPS（V100），10 人以上場景 latency 只多 0.1ms
🏆 CrowdPose SOTA：RTMO-l + extra data 喺 CrowdPose 上 83.8% AP，遠拋 ED-Pose（Swin-L, 218M params）

背景：Multi-Person Pose Estimation 嘅三條路

喺 RTMO 出現之前，real-time multi-person pose estimation（MPPE）基本上分成三派：

Loading diagram...

Paradigm	精度	速度同人數關係	Pipeline 複雜度
Top-down	⭐⭐⭐⭐⭐	Linear in N（人多就慢）	需要 detector + cropper + pose net
Bottom-up	⭐⭐⭐	Constant，但 grouping 慢	需要 keypoint grouping（PAF / heatmap matching）
One-stage	⭐⭐⭐（之前）	Constant	一個 forward 搞掂

💡 One-stage 嘅理論優勢好明顯：latency 唔受人數影響，pipeline 簡單。但喺 RTMO 之前，所有 real-time one-stage 方法（YOLO-Pose、KAPAO、YOLOX-Pose）都係用一個 fully connected layer 直接 regress 出 keypoint 坐標——呢個做法等同假設 keypoint 位置係 Dirac delta 分佈，完全忽略咗 annotation 嘅 inherent uncertainty。結果就係：快係快，但 AP 永遠追唔上 top-down。

Coordinate Classification：Top-down 嘅秘密武器

2022 年 SimCC、2023 年 RTMPose 都用咗一個叫 dual 1-D heatmap 嘅技巧，喺 top-down setting 大殺四方：

將 x 軸切成 $B_x$ 個 bin、y 軸切成 $B_y$ 個 bin
每個 keypoint output 兩個 1-D probability heatmap（一條 x、一條 y）
透過 sub-pixel bin 達到高空間解析度，唔需要 high-res 2D heatmap

問題嚟啦：呢個技巧直接搬入 one-stage（dense prediction）就出事——

Bin wastage：bins 散佈成張圖，但每個人只佔幾個 grid，大部分 bin 對某個 instance 嚟講都係廢嘅
DFL [21] 嘅 fix 又唔夠：bins 固定喺 anchor 附近一個 fixed range，大人會超出、細人會量化誤差爆炸
KLD loss 一視同仁：dense prediction 入面每個 grid 嘅難度差好遠（位置、size、姿態），KLD 冇辦法區分

🎯 RTMO 嘅 thesis：如果可以解決呢三個 incompatibility，就可以將 coordinate classification 嘅準度搬入 one-stage 嘅速度——拎到「top-down 級精度 + one-stage 級速度」。

RTMO Architecture 全景

RTMO 整體沿用 YOLO 嘅 dense grid prediction，但 head 嘅 keypoint 分支特別嗌咗一個 Dynamic Coordinate Classifier。

Loading diagram...

三個關鍵設計決定

決定	做法	原因
Backbone	CSPDarknet（YOLOX 嗰個）	同 YOLO 生態 align，方便 deployment
Neck	Hybrid Encoder（RT-DETR）	同時有 self-attention（global）+ FPN（local）
Feature Levels	只用 P4 + P5，丟咗 P3	P3 食咗 head 78.5% FLOPs，但只貢獻 10.7% correct detections

🔑 P3 嘅故事：傳統 feature pyramid 嘅諗法係「淺層 feature 揸細人」，但 RTMO 嘅 ablation（Table 4）發現 P4 + P5 已經夠 cover 細人——P3 嘅 FLOPs 純粹浪費。RTMO-l 由 3 級 features → 2 級，CPU latency 由 186ms 降到 125ms，AP 只跌 0.2%。Lesson：feature pyramid 唔係越深越多越好。

核心創新 1：Dynamic Coordinate Classifier (DCC)

DCC 係 RTMO 嘅技術核心。佢將 SimCC 嘅 dual 1-D heatmap 改造成適合 dense prediction：

A. Dynamic Bin Allocation (DBA)

傳統做法（SimCC / RTMPose）：Bins 鋪滿成張 input image。Top-down 入面已經 crop 過，所以 OK。

DFL 嘅做法：Bins 固定喺 anchor 附近一個 predefined range。大人裝唔晒、細人 quantization error 爆炸。

RTMO 嘅做法：每個 grid 先 regress 出 bounding box，將 box expand 1.25 倍（cover 預測唔準嘅情況），再喺 expanded box 內部均勻 divide bins。

x_i = x_l + (x_r - x_l) \cdot \frac{i-1}{B_x - 1}, \quad i = 1, \ldots, B_x

其中 $x_l, x_r$ 係 expanded bounding box 嘅左右邊界。所有 model 都用 $B_x = 192, B_y = 256$ 。

Loading diagram...

🎯 DBA 嘅 elegant 之處：bins 嘅 spatial location 變成 input-dependent——細人有細人嘅 fine-grained bins，大人有大人嘅 wider bins。Quantization error 自動同 instance size scale。

B. Dynamic Bin Encoding (DBE)

問題嚟啦：而家每個 grid 嘅 bin 坐標都唔同，唔可以再用一組 shared learnable embeddings 代表每個 bin（SimCC / RTMPose 嘅做法）。

RTMO 用 sine positional encoding + learnable FC 嚟為每個 bin on-the-fly 生成 representation：

[\boldsymbol{PE}(x_i)]_c = \begin{cases} \sin(x_i / t^{c/C}), & c \text{ even} \\ \cos(x_i / t^{(c-1)/C}), & c \text{ odd} \end{cases}

再經一個 fully connected layer $\phi$ refine：

\hat{p}_k(x_i) = \frac{\exp(\boldsymbol{f}_k \cdot \boldsymbol{\phi}(\boldsymbol{PE}(x_i)))}{\sum_{j=1}^{B_x} \exp(\boldsymbol{f}_k \cdot \boldsymbol{\phi}(\boldsymbol{PE}(x_j)))}

其中 $\boldsymbol{f}_k$ 係第 $k$ 個 keypoint 嘅 feature（經 GAU module refine 過，跟 RTMPose 一樣）。

💡 諗法類比：呢個就好似 Transformer 入面 query × key 嘅 dot product——keypoint feature 係 query，bin 嘅 PE 係 key，softmax 之後就係 attention weights，即 keypoint 喺呢個 bin 嘅 probability。RTMO 將 Transformer 嘅 positional encoding 用到 spatial bins 上面，係一個好聰明嘅 cross-pollination。

C. 用具體數字諗一次

假設一張圖入面有一個人，predicted bbox 係 $(x_l, x_r) = (100, 300)$ （width 200px）。Expand 1.25× 之後：

新 $x_l = 75, x_r = 325$ （width = 250）
$B_x = 192$ ，所以每個 bin 大約 1.3 px wide
Bin 1 中心喺 75px，Bin 96 喺 200px（人嘅中心），Bin 192 喺 325px

對比 SimCC：如果 input image 係 640×640，SimCC 將整個 640px 分成例如 384 個 bin（每 bin 1.67px），但 200px 寬嘅人只用得到 120 個 bin → 浪費 264 個 bin。

對比 DFL：固定 range 例如 ±32px 圍住 anchor → 一個 200px 嘅人完全裝唔入！

RTMO：192 個 bin 全部用嚟覆蓋呢個人嘅 250px expanded box → 每個 bin 都有用，quantization error ≈ 0.65px。

核心創新 2：MLE Loss with Learnable Variance

呢個係另一個 elegant 嘅 contribution。Standard 嘅 coordinate classification 用 Gaussian label smoothing + KLD loss：

p_k(x_i | \mu_x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x_i - \mu_x)^2}{2\sigma^2}\right)

RTMO 用咗一個關鍵 observation：Gaussian 對 mean 係對稱嘅，所以 $p_k(x_i|\mu_x) = p_k(\mu_x|x_i)$ ——即可以將「target distribution」reinterpret 成「annotation likelihood under Gaussian error model」。

如果將預測 $\hat{p}_k(x_i)$ 視為 $x_i$ 嘅 prior，annotation $\mu_x$ 嘅 marginal likelihood 就係：

P(\mu_x) = \sum_{i=1}^{B_x} P(\mu_x | x_i) P(x_i) = \sum_{i=1}^{B_x} \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x_i - \mu_x)^2}{2\sigma^2}} \hat{p}_k(x_i)

Maximize 呢個就 model 緊 annotation 嘅真實分佈。RTMO 實際用 Laplace 分佈 同 negative log-likelihood：

\mathcal{L}_{\text{mle}}^{(x)} = -\log\left[\sum_{i=1}^{B_x} \frac{1}{\hat{\sigma}} \exp\left(-\frac{|x_i - \mu_x|}{2\hat{\sigma} s}\right) \hat{p}_k(x_i)\right]

其中 $s$ 係 instance size（normalize error）， $\hat{\sigma}$ 係 model 預測嘅 variance（即 uncertainty）。

點解 Learnable Variance 咁重要？

Loading diagram...

🔑 MLE vs KLD 嘅根本分別：

KLD + learnable σ：Model 會懶到全部 sample 都 predict 一個大 σ → flatten target distribution → loss 永遠細 → 訓練崩潰

MLE + learnable σ：σ 喺 likelihood 入面 同時影響 numerator 同 denominator，model 唔可以 cheat。Hard sample predict 大 σ 真係會減少 loss，但 easy sample 預測大 σ 反而會 penalize。呢個係 self-paced learning 嘅天然 implementation。

Ablation 數據（RTMO-s, COCO val2017）

Decoding	Loss	COCO AP	CrowdPose AP
Regression	OKS	65.6	66.1
CC + DBA + DBE	KLD	64.4	62.5
CC（static bins）	MLE	66.7	65.8
CC + DBA	MLE	65.6	65.2
CC + DBA + DBE	MLE	67.6	67.2

🎯 三個 takeaway：

MLE > KLD：差 3.2% AP——learnable variance 真係解決咗 hard/easy 不均衡嘅問題

DBA alone 會跌：只加 DBA 唔加 DBE 反而衰咗，因為 bin 位變咗但 representation 冇變，semantics broken

DBA + DBE 一齊先 work：兩個 component 互相依賴——bin 位變、representation 就要跟住變

Training Pipeline：YOLOX 嘅 Trick + Two-Stage Training

RTMO 嘅 training 借用咗 YOLO 系列嘅成熟技術，再加幾個 pose-specific 嘅調整：

Label Assignment：擴展版 SimOTA

YOLOX 用 SimOTA 嚟動態 assign positive grids。RTMO 將 score 嘅 cost 由「bbox 質素」改成「bbox + pose 質素」嘅組合：

Score branch：用 Varifocal Loss，target 係預測 pose 同 GT pose 嘅 OKS（Object Keypoint Similarity）
BBox branch：IoU loss
Keypoint branch（DCC）：MLE loss
Visibility branch：BCE loss

Proxy Regression：避免 OOM

DCC 嘅 computation 喺 每個 grid × 每個 keypoint × Bx × By 上面行——dense prediction 入面太貴。RTMO 嘅 trick：

額外加一個 lightweight pointwise conv 做 keypoint regression（kpt_reg）
SimOTA 用 kpt_reg 嚟揀 positive grids（避免要 DCC 跑全部 grid）
DCC 只跑 positive grids，輸出 decoded keypoints（kpt_dec）
Proxy loss： $\mathcal{L}_{\text{proxy}} = 1 - \text{OKS}(\text{kpt}_{\text{reg}}, \text{kpt}_{\text{dec}})$ ——令 proxy 同 DCC 嘅 output 一致

Total Loss

\mathcal{L} = \lambda_1 \mathcal{L}_{\text{bbox}} + \lambda_2 \mathcal{L}_{\text{mle}} + \lambda_3 \mathcal{L}_{\text{proxy}} + \lambda_4 \mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{vis}}

其中 $\lambda_1 = \lambda_2 = 5, \lambda_3 = 10, \lambda_4 = 2$ 。

Two-Stage Training Schedule

Stage	Proxy Target	Learning Rate	目的
Stage 1	GT pose annotations	4e-3	Bootstrap：proxy + DCC 同時學 GT
Stage 2	DCC decoded pose	5e-4 → 2e-4（cosine）	Refine：proxy follow DCC，DCC 學更 fine-grained

💡 Stage 2 嘅 self-distillation 味道：Proxy 由「學 GT」變「學 DCC 嘅 output」——因為 DCC 嘅輸出本身已經比 GT 更 informative（包含 uncertainty）。呢個類似 knowledge distillation 入面 soft label > hard label 嘅 insight。

實驗結果：Real-Time 入面打贏所有人

COCO test-dev：One-Stage 之王

Method	Backbone	Params	Time (ms)	AP
YOLO-Pose-s	CSPDarknet	10.8M	7.9	63.2
YOLO-Pose-l	CSPDarknet	61.3M	20.5	70.2
KAPAO-l	CSPNet	77.0M	50.2	70.3
PETR	Swin-L	213.8M	133	70.5
ED-Pose	Swin-L	218.0M	265.6	72.7
RTMO-s	CSPDarknet	9.9M	8.9	66.9
RTMO-m	CSPDarknet	22.6M	12.4	70.1
RTMO-l	CSPDarknet	44.8M	19.1	71.6
RTMO-l †	CSPDarknet	44.8M	19.1	73.3

🚀 重點觀察：

RTMO-s 同 YOLO-Pose-s 速度差唔多（8.9 vs 7.9 ms），但 AP 高 3.7%

RTMO-l 比 ED-Pose-Swin-L 快 14×（19.1 vs 265.6 ms），AP 只差 1.1%

RTMO-l 比 PETR-Swin-L 快 7×、AP 高 1.1%

加 extra training data（†）後，RTMO-l 攞到 73.3% AP，已經超越所有 one-stage 方法

多人場景：Top-down 嘅 noose

RTMO 對比 RTMPose（top-down）+ RTMDet-nano（detector）嘅組合：

Loading diagram...

⚡ RTMO-l 喺 10+ 人嘅 GPU latency 只比 1 人多 0.1ms（佔 total 0.5%）——呢個就係 one-stage 嘅 constant-time 優勢。對於 surveillance、體育分析、AR 等場景，呢個 property 比絕對最快嘅 single-person latency 更重要。

CrowdPose：擠擁場景嘅大殺四方

Method	Params	AP	AP_Easy	AP_Medium	AP_Hard
HRNet（top-down）	28.5M	71.3	80.5	71.4	62.5
DEKR（bottom-up）	65.7M	67.3	74.6	68.1	58.7
ED-Pose Swin-L	218.0M	73.1	80.5	73.8	63.8
RTMO-l	44.8M	73.2	79.2	74.1	65.3
RTMO-l †	44.8M	83.8	88.8	84.7	77.2

🏆 RTMO-l † 喺 CrowdPose AP_Hard 攞到 77.2%——hard split 係最擠擁、遮擋最嚴重嘅場景。比 ED-Pose-Swin-L 高 13.4%，model size 仲細 5×。呢個係 DCC + MLE 對 hard sample 嘅 robustness 嘅最佳證明。

同 Cousin RTMPose 嘅深度對比

RTMO 同 RTMPose 出自同一個團隊（OpenMMLab / Shanghai AI Lab），共享好多 design idea，但解決唔同問題：

特性	RTMPose	RTMO
Paradigm	Top-down	One-stage
需要 detector？	✅ 必須	❌ 唔需要
Coordinate Classification	SimCC（static bins）	DCC（dynamic bins + encoding）
Loss	KLD with Gaussian smoothing	MLE with learnable variance
Feature refinement	GAU	GAU
Latency vs people count	Linear（人多就慢）	Constant
1-person speed	最快	略慢
Multi-person speed	退化	穩定
Best use case	單人 / 少人 / mobile	Surveillance / 體育 / 演唱會

🎯 揀邊個？簡單 rule of thumb：

每張圖 ≤ 2 人 → RTMPose（單人 latency 最低，detector 嘅 overhead 唔大）

每張圖 ≥ 4 人 → RTMO（constant latency 嘅優勢開始顯現）

crowded scene（遮擋 + 擠擁） → RTMO 嘅 dense grid + MLE robustness 完勝

實作指南：用 MMPose 跑 RTMO

方法一：Off-the-shelf Inference

python# 安裝
# pip install mmcv mmengine
# pip install mmpose>=1.3.0
# pip install mmdet  # 唔需要 detector，但 demo script 會用到

from mmpose.apis import MMPoseInferencer

# 一行加載 RTMO-l
inferencer = MMPoseInferencer(
    pose2d='rtmo-l_16xb16-600e_body7-640x640',
    device='cuda:0'
)

# Inference
results = inferencer(
    'crowd.jpg',
    show=False,
    out_dir='outputs/',
    radius=4,
    thickness=2
)

for result in results:
    poses = result['predictions'][0]
    print(f"Detected {len(poses)} people")
    for p in poses:
        keypoints = p['keypoints']      # (17, 2)
        scores = p['keypoint_scores']   # (17,)
        bbox = p['bbox']

方法二：自己 Train（COCO）

bash# Clone MMPose
git clone https://github.com/open-mmlab/mmpose.git
cd mmpose/projects/rtmo

# 下載 pretrained backbone（YOLOX-l on COCO det）
# 然後直接跑
bash tools/dist_train.sh \
    configs/rtmo-l_16xb16-600e_coco-640x640.py \
    8  # 8 GPUs

關鍵 config 片段：

python# rtmo-l_16xb16-600e_coco-640x640.py 嘅核心部分
model = dict(
    type='BottomupPoseEstimator',
    backbone=dict(type='CSPDarknet', deepen_factor=1.0, widen_factor=1.0),
    neck=dict(type='HybridEncoder', encoder_cfg=dict(...)),
    head=dict(
        type='RTMOHead',
        num_keypoints=17,
        featmap_strides=(16, 32),  # 只用 P4 + P5
        head_module_cfg=dict(
            num_classes=1,
            in_channels=256,
            cls_feat_channels=256,
            channels_per_group=36,
            pose_vec_channels=512,
        ),
        dcc_cfg=dict(
            in_channels=256,
            feat_channels=128,
            num_bins=(192, 256),  # Bx, By
            spe_channels=128,      # sine PE 維度
        ),
        loss_mle=dict(type='MLECCLoss', use_target_weight=True, loss_weight=5.0),
        loss_bbox=dict(type='IoULoss', loss_weight=5.0),
        loss_oks=dict(type='OKSLoss', loss_weight=10.0),  # proxy loss
        loss_vis=dict(type='BCELoss', loss_weight=1.0),
        loss_cls=dict(type='VariFocalLoss', loss_weight=2.0),
    ),
)

方法三：Export to ONNX（部署）

bash# 用 mmdeploy export
python tools/deploy.py \
    configs/mmpose/pose-detection_rtmo_onnxruntime_dynamic-640x640.py \
    projects/rtmo/configs/rtmo-l_16xb16-600e_coco-640x640.py \
    checkpoints/rtmo-l_coco-640x640.pth \
    demo/resources/human-pose.jpg \
    --work-dir mmdeploy_models/rtmo-l \
    --device cuda \
    --dump-info

💡 Deployment 建議：

ONNXRuntime + FP32：x86 server / desktop

TensorRT + FP16：NVIDIA Jetson / 數據中心 GPU（RTMO-l 喺 V100 用 TRT FP16 可以衝到 ~141 FPS）

ONNXRuntime + OpenVINO：Intel CPU

唔好用 PyTorch raw model 做 production——ED-Pose 喺 paper 入面就示範咗 ONNX 化失敗會點樣（1.5 秒 / frame）

核心 Code 解讀：DCC 嘅 PyTorch 骨架

下面係簡化版嘅 Dynamic Coordinate Classifier，保留所有 essential 嘅 logic：

pythonimport torch
import torch.nn as nn
import torch.nn.functional as F


class SinePositionalEncoding(nn.Module):
    """為 bin 坐標生成 sine PE。"""
    def __init__(self, num_channels=128, temperature=10000):
        super().__init__()
        self.num_channels = num_channels
        self.temperature = temperature

    def forward(self, coords):
        # coords: (B, num_bins) — 每個 grid 嘅 bin 坐標（已 normalize）
        dim_t = torch.arange(self.num_channels, device=coords.device)
        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_channels)
        pos = coords.unsqueeze(-1) / dim_t   # (B, num_bins, C)
        pos[..., 0::2] = pos[..., 0::2].sin()
        pos[..., 1::2] = pos[..., 1::2].cos()
        return pos


class DynamicCoordinateClassifier(nn.Module):
    def __init__(self, in_channels=256, feat_channels=128,
                 num_keypoints=17, num_bins=(192, 256), spe_channels=128):
        super().__init__()
        self.K = num_keypoints
        self.Bx, self.By = num_bins

        # 將 pose feature 拆成每個 keypoint 嘅 feature vector
        self.kpt_feat_proj = nn.Linear(in_channels, num_keypoints * feat_channels)

        # Sine PE + learnable FC (φ)
        self.spe = SinePositionalEncoding(spe_channels)
        self.phi_x = nn.Linear(spe_channels, feat_channels)
        self.phi_y = nn.Linear(spe_channels, feat_channels)

        # Variance head（用嚟 predict σ̂）
        self.sigma_head = nn.Linear(in_channels, num_keypoints * 2)

    def get_bin_coords(self, bboxes, expand=1.25):
        """Dynamic Bin Allocation：bins 跟 expanded bbox 走。"""
        # bboxes: (N, 4) in xyxy format
        cx = (bboxes[:, 0] + bboxes[:, 2]) / 2
        cy = (bboxes[:, 1] + bboxes[:, 3]) / 2
        w = (bboxes[:, 2] - bboxes[:, 0]) * expand
        h = (bboxes[:, 3] - bboxes[:, 1]) * expand
        xl, xr = cx - w / 2, cx + w / 2
        yt, yb = cy - h / 2, cy + h / 2

        # 均勻 divide bins
        steps_x = torch.linspace(0, 1, self.Bx, device=bboxes.device)
        steps_y = torch.linspace(0, 1, self.By, device=bboxes.device)
        x_bins = xl[:, None] + (xr - xl)[:, None] * steps_x[None, :]   # (N, Bx)
        y_bins = yt[:, None] + (yb - yt)[:, None] * steps_y[None, :]   # (N, By)
        return x_bins, y_bins

    def forward(self, pose_feat, bboxes):
        # pose_feat: (N, C) — 每個 positive grid 嘅 pose feature
        # bboxes:    (N, 4) — 該 grid 預測嘅 bbox
        N = pose_feat.size(0)

        # ---- Step 1: Keypoint features ----
        f_k = self.kpt_feat_proj(pose_feat).view(N, self.K, -1)  # (N, K, D)

        # ---- Step 2: Dynamic bin coordinates ----
        x_bins, y_bins = self.get_bin_coords(bboxes)             # (N, Bx), (N, By)

        # ---- Step 3: Dynamic bin encoding ----
        # 將 bin coord 經 sine PE → FC
        # 注意：每個 sample 嘅 bin 坐標都唔同 → on-the-fly 計算
        x_emb = self.phi_x(self.spe(x_bins))    # (N, Bx, D)
        y_emb = self.phi_y(self.spe(y_bins))    # (N, By, D)

        # ---- Step 4: Bin-keypoint similarity (softmax over bins) ----
        logits_x = torch.einsum('nkd,nid->nki', f_k, x_emb)     # (N, K, Bx)
        logits_y = torch.einsum('nkd,nid->nki', f_k, y_emb)     # (N, K, By)
        prob_x = F.softmax(logits_x, dim=-1)
        prob_y = F.softmax(logits_y, dim=-1)

        # ---- Step 5: Integral decode 出坐標 ----
        kpt_x = (prob_x * x_bins.unsqueeze(1)).sum(-1)           # (N, K)
        kpt_y = (prob_y * y_bins.unsqueeze(1)).sum(-1)           # (N, K)
        keypoints = torch.stack([kpt_x, kpt_y], dim=-1)          # (N, K, 2)

        # ---- Step 6: Predict per-sample variance（MLE loss 要用）----
        sigma = self.sigma_head(pose_feat).view(N, self.K, 2).exp()  # 保證 σ > 0

        return {
            'keypoints': keypoints,
            'prob_x': prob_x, 'prob_y': prob_y,
            'x_bins': x_bins, 'y_bins': y_bins,
            'sigma': sigma,
        }


def mle_loss(prob, bins, target, sigma, instance_size, eps=1e-9):
    """
    prob:   (N, K, B)   — predicted probability over bins
    bins:   (N, B)      — bin coordinates
    target: (N, K)      — ground-truth keypoint coordinate
    sigma:  (N, K)      — predicted variance
    instance_size: (N,) — bbox diagonal 或者其他 normalizing factor
    """
    diff = (bins.unsqueeze(1) - target.unsqueeze(-1)).abs()           # (N, K, B)
    s = instance_size.view(-1, 1, 1)
    laplace = torch.exp(-diff / (2 * sigma.unsqueeze(-1) * s)) / sigma.unsqueeze(-1)
    likelihood = (prob * laplace).sum(-1).clamp_min(eps)              # (N, K)
    return -likelihood.log().mean()

🔑 注意三個 implementation detail：

phi_x 同 phi_y 係 separate 嘅 FC（因為 x 同 y 軸嘅 semantics 唔同，bin 數量都唔同）

sigma_head 用 .exp() 保證 σ > 0（log-space 學習更 stable）

MLE loss 入面 instance_size 嘅作用係令 loss scale 對人嘅大細 invariant，大人同細人嘅同樣 px 誤差有唔同 semantic meaning

限制同未來方向

當前限制

1. Bin 數量 hard-coded

$B_x = 192, B_y = 256$ 對 COCO 嘅 17 keypoint 啱用，但對 whole-body（133 keypoints, Halpe / COCO-Wholebody）或者 hand pose 可能要重新 tune。

2. P3 被砍 → 細人 / 遠處場景退化

為咗速度砍咗 P3，喺極細人（< 32px）場景會明顯退化。如果 use case 包含 long-range surveillance，可能要 reactivate P3。

3. Hybrid Encoder 嘅 attention 食 memory

High-resolution input（e.g. 800×800）下，Hybrid Encoder 嘅 self-attention 仍然係 memory bottleneck。RTMO 論文用咗 480-800 嘅 multi-scale training 已經反映呢個 trade-off。

4. Single-frame，冇 temporal smoothing

Video 應用上會有 jitter，需要外接 temporal smoother（One-Euro filter 等）。

未來方向

3D Multi-Person Pose：將 DCC 擴展到 3D bins（dual 1-D heatmap → triple 1-D heatmap）
Whole-Body Variant：COCO-Wholebody 嘅 133 keypoints 需要更 efficient 嘅 DCC implementation
Sparse DCC：而家 DCC 喺所有 positive grid 上面跑，可以引入 token routing 進一步減 FLOPs
同 RTMPose 嘅 distillation：用 RTMO 嘅 dense feature 監督 RTMPose 嘅 top-down branch（或者反過嚟）

技術啟示

1. Hybrid Paradigm Wins

RTMO 嘅成功本質係 將 top-down 嘅 representation（dual 1-D heatmap）嫁接落 one-stage 嘅 architecture。Single-paradigm 嘅 model 通常會 stuck 喺某個 trade-off frontier 上面——能夠 hybridize 兩種 paradigm 嘅 representation 同 inference logic，往往能突破舊嘅 Pareto frontier。

2. Dynamic > Static

呢個 echo 之前 OSNet 嘅 takeaway：input-dependent 嘅機制永遠優於 fixed mechanism。DBA、DBE、MLE 嘅 learnable variance，全部都係「跟住 input 變」嘅設計。Pose estimation、object detection、attention⋯⋯AI 嘅 trend 一直都係由 static → dynamic。

3. Likelihood Framework 提供免費嘅 Self-Paced Learning

MLE with learnable variance 唔需要任何 hand-crafted curriculum，就能 auto-balance hard / easy samples。呢個 framework 喺 face recognition（Chang 2020）、object detection（He 2019）、pose（RTMO）都 work——係一個 generalizable 嘅 design pattern。

4. Positional Encoding 嘅意想不到應用

Sine PE 由 Transformer 嘅 sequence position 開始，去到 ViT 嘅 patch position，再去到 RTMO 嘅 spatial bin representation。同一個數學工具，喺三個完全唔同嘅 context 下都 work——呢就係 fundamental tool 嘅威力。

總結

RTMO 用一個好 elegant 嘅 recipe，解決咗 one-stage MPPE 長期以來嘅 accuracy 瓶頸：

Dynamic Coordinate Classifier：將 SimCC 嘅 dual 1-D heatmap 改造成 input-adaptive，徹底解決 dense prediction 入面 bin wastage / coverage 嘅矛盾
MLE Loss with Learnable Variance：用一個 likelihood-based formulation 取代 KLD，免費獲得 hard / easy sample balancing
YOLOX-style Training + Two-Stage Schedule：成熟嘅 detection framework + self-distillation 味道嘅 proxy refinement

實用價值

🎯 74.8% AP @ 141 FPS（V100, TRT FP16）——而家最強嘅 real-time MPPE
⚡ Multi-person latency stable：10+ 人場景同 1 人差 0.5%
🏆 CrowdPose hard split 77.2% AP：擠擁場景嘅 SOTA
🛠️ MMPose 官方支持：production-ready，ONNX / TRT export 都 work

揀邊個？

場景	推薦	原因
單人 / mobile AR coaching	RTMPose-s	single-person 速度最快
體育分析（5-20 人）	RTMO-m	constant latency + 中等精度
演唱會 / 集會 surveillance	RTMO-l †	crowded scene SOTA
需要 whole-body / hand	RTMPose-Wholebody	RTMO 暫無 wholebody variant
Edge device（Jetson Nano）	RTMO-s	9.9M params + TRT FP16

TL;DR

上一系列 blog 我哋睇咗 Person ReID 點樣由 OSNet 走到 DINOv2，係 identity 層面 嘅 human-centric vision。今次轉去 keypoint 層面——multi-person pose estimation。

RTMO 喺 CVPR 2024 提出，核心問題好簡單：Top-down 太慢、bottom-up / one-stage 又唔夠準，可唔可以兩者兼得？

核心重點：

🎯 One-stage + Coordinate Classification：第一次將 dual 1-D heatmap classification（SimCC / RTMPose 嘅招）塞入 YOLO-based dense prediction
🧩 Dynamic Coordinate Classifier (DCC)：Dynamic Bin Allocation（bins 跟 bbox 大細走）+ Dynamic Bin Encoding（sine PE + FC 生成 per-bin representation）
📐 MLE Loss with Learnable Variance：用 Maximum Likelihood Estimation 取代 KLD，每個 sample 自己學 uncertainty，hard sample 大 variance、easy sample 細 variance
⚡ Real-time + 多人不衰減：RTMO-l 喺 COCO val2017 攞到 74.8% AP @ 141 FPS（V100），10 人以上場景 latency 只多 0.1ms
🏆 CrowdPose SOTA：RTMO-l + extra data 喺 CrowdPose 上 83.8% AP，遠拋 ED-Pose（Swin-L, 218M params）

背景：Multi-Person Pose Estimation 嘅三條路

喺 RTMO 出現之前，real-time multi-person pose estimation（MPPE）基本上分成三派：

Loading diagram...

Paradigm	精度	速度同人數關係	Pipeline 複雜度
Top-down	⭐⭐⭐⭐⭐	Linear in N（人多就慢）	需要 detector + cropper + pose net
Bottom-up	⭐⭐⭐	Constant，但 grouping 慢	需要 keypoint grouping（PAF / heatmap matching）
One-stage	⭐⭐⭐（之前）	Constant	一個 forward 搞掂

💡 One-stage 嘅理論優勢好明顯：latency 唔受人數影響，pipeline 簡單。但喺 RTMO 之前，所有 real-time one-stage 方法（YOLO-Pose、KAPAO、YOLOX-Pose）都係用一個 fully connected layer 直接 regress 出 keypoint 坐標——呢個做法等同假設 keypoint 位置係 Dirac delta 分佈，完全忽略咗 annotation 嘅 inherent uncertainty。結果就係：快係快，但 AP 永遠追唔上 top-down。

Coordinate Classification：Top-down 嘅秘密武器

2022 年 SimCC、2023 年 RTMPose 都用咗一個叫 dual 1-D heatmap 嘅技巧，喺 top-down setting 大殺四方：

將 x 軸切成 $B_x$ 個 bin、y 軸切成 $B_y$ 個 bin
每個 keypoint output 兩個 1-D probability heatmap（一條 x、一條 y）
透過 sub-pixel bin 達到高空間解析度，唔需要 high-res 2D heatmap

問題嚟啦：呢個技巧直接搬入 one-stage（dense prediction）就出事——

Bin wastage：bins 散佈成張圖，但每個人只佔幾個 grid，大部分 bin 對某個 instance 嚟講都係廢嘅
DFL [21] 嘅 fix 又唔夠：bins 固定喺 anchor 附近一個 fixed range，大人會超出、細人會量化誤差爆炸
KLD loss 一視同仁：dense prediction 入面每個 grid 嘅難度差好遠（位置、size、姿態），KLD 冇辦法區分

🎯 RTMO 嘅 thesis：如果可以解決呢三個 incompatibility，就可以將 coordinate classification 嘅準度搬入 one-stage 嘅速度——拎到「top-down 級精度 + one-stage 級速度」。

RTMO Architecture 全景

RTMO 整體沿用 YOLO 嘅 dense grid prediction，但 head 嘅 keypoint 分支特別嗌咗一個 Dynamic Coordinate Classifier。

Loading diagram...

三個關鍵設計決定

決定	做法	原因
Backbone	CSPDarknet（YOLOX 嗰個）	同 YOLO 生態 align，方便 deployment
Neck	Hybrid Encoder（RT-DETR）	同時有 self-attention（global）+ FPN（local）
Feature Levels	只用 P4 + P5，丟咗 P3	P3 食咗 head 78.5% FLOPs，但只貢獻 10.7% correct detections

🔑 P3 嘅故事：傳統 feature pyramid 嘅諗法係「淺層 feature 揸細人」，但 RTMO 嘅 ablation（Table 4）發現 P4 + P5 已經夠 cover 細人——P3 嘅 FLOPs 純粹浪費。RTMO-l 由 3 級 features → 2 級，CPU latency 由 186ms 降到 125ms，AP 只跌 0.2%。Lesson：feature pyramid 唔係越深越多越好。

核心創新 1：Dynamic Coordinate Classifier (DCC)

DCC 係 RTMO 嘅技術核心。佢將 SimCC 嘅 dual 1-D heatmap 改造成適合 dense prediction：

A. Dynamic Bin Allocation (DBA)

傳統做法（SimCC / RTMPose）：Bins 鋪滿成張 input image。Top-down 入面已經 crop 過，所以 OK。

DFL 嘅做法：Bins 固定喺 anchor 附近一個 predefined range。大人裝唔晒、細人 quantization error 爆炸。

RTMO 嘅做法：每個 grid 先 regress 出 bounding box，將 box expand 1.25 倍（cover 預測唔準嘅情況），再喺 expanded box 內部均勻 divide bins。

x_i = x_l + (x_r - x_l) \cdot \frac{i-1}{B_x - 1}, \quad i = 1, \ldots, B_x

其中 $x_l, x_r$ 係 expanded bounding box 嘅左右邊界。所有 model 都用 $B_x = 192, B_y = 256$ 。

Loading diagram...

🎯 DBA 嘅 elegant 之處：bins 嘅 spatial location 變成 input-dependent——細人有細人嘅 fine-grained bins，大人有大人嘅 wider bins。Quantization error 自動同 instance size scale。

B. Dynamic Bin Encoding (DBE)

問題嚟啦：而家每個 grid 嘅 bin 坐標都唔同，唔可以再用一組 shared learnable embeddings 代表每個 bin（SimCC / RTMPose 嘅做法）。

RTMO 用 sine positional encoding + learnable FC 嚟為每個 bin on-the-fly 生成 representation：

[\boldsymbol{PE}(x_i)]_c = \begin{cases} \sin(x_i / t^{c/C}), & c \text{ even} \\ \cos(x_i / t^{(c-1)/C}), & c \text{ odd} \end{cases}

再經一個 fully connected layer $\phi$ refine：

\hat{p}_k(x_i) = \frac{\exp(\boldsymbol{f}_k \cdot \boldsymbol{\phi}(\boldsymbol{PE}(x_i)))}{\sum_{j=1}^{B_x} \exp(\boldsymbol{f}_k \cdot \boldsymbol{\phi}(\boldsymbol{PE}(x_j)))}

其中 $\boldsymbol{f}_k$ 係第 $k$ 個 keypoint 嘅 feature（經 GAU module refine 過，跟 RTMPose 一樣）。

💡 諗法類比：呢個就好似 Transformer 入面 query × key 嘅 dot product——keypoint feature 係 query，bin 嘅 PE 係 key，softmax 之後就係 attention weights，即 keypoint 喺呢個 bin 嘅 probability。RTMO 將 Transformer 嘅 positional encoding 用到 spatial bins 上面，係一個好聰明嘅 cross-pollination。

C. 用具體數字諗一次

假設一張圖入面有一個人，predicted bbox 係 $(x_l, x_r) = (100, 300)$ （width 200px）。Expand 1.25× 之後：

新 $x_l = 75, x_r = 325$ （width = 250）
$B_x = 192$ ，所以每個 bin 大約 1.3 px wide
Bin 1 中心喺 75px，Bin 96 喺 200px（人嘅中心），Bin 192 喺 325px

對比 SimCC：如果 input image 係 640×640，SimCC 將整個 640px 分成例如 384 個 bin（每 bin 1.67px），但 200px 寬嘅人只用得到 120 個 bin → 浪費 264 個 bin。

對比 DFL：固定 range 例如 ±32px 圍住 anchor → 一個 200px 嘅人完全裝唔入！

RTMO：192 個 bin 全部用嚟覆蓋呢個人嘅 250px expanded box → 每個 bin 都有用，quantization error ≈ 0.65px。

核心創新 2：MLE Loss with Learnable Variance

呢個係另一個 elegant 嘅 contribution。Standard 嘅 coordinate classification 用 Gaussian label smoothing + KLD loss：

p_k(x_i | \mu_x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x_i - \mu_x)^2}{2\sigma^2}\right)

如果將預測 $\hat{p}_k(x_i)$ 視為 $x_i$ 嘅 prior，annotation $\mu_x$ 嘅 marginal likelihood 就係：

P(\mu_x) = \sum_{i=1}^{B_x} P(\mu_x | x_i) P(x_i) = \sum_{i=1}^{B_x} \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x_i - \mu_x)^2}{2\sigma^2}} \hat{p}_k(x_i)

Maximize 呢個就 model 緊 annotation 嘅真實分佈。RTMO 實際用 Laplace 分佈 同 negative log-likelihood：

\mathcal{L}_{\text{mle}}^{(x)} = -\log\left[\sum_{i=1}^{B_x} \frac{1}{\hat{\sigma}} \exp\left(-\frac{|x_i - \mu_x|}{2\hat{\sigma} s}\right) \hat{p}_k(x_i)\right]

其中 $s$ 係 instance size（normalize error）， $\hat{\sigma}$ 係 model 預測嘅 variance（即 uncertainty）。

點解 Learnable Variance 咁重要？

Loading diagram...

🔑 MLE vs KLD 嘅根本分別：

KLD + learnable σ：Model 會懶到全部 sample 都 predict 一個大 σ → flatten target distribution → loss 永遠細 → 訓練崩潰

MLE + learnable σ：σ 喺 likelihood 入面 同時影響 numerator 同 denominator，model 唔可以 cheat。Hard sample predict 大 σ 真係會減少 loss，但 easy sample 預測大 σ 反而會 penalize。呢個係 self-paced learning 嘅天然 implementation。

Ablation 數據（RTMO-s, COCO val2017）

Decoding	Loss	COCO AP	CrowdPose AP
Regression	OKS	65.6	66.1
CC + DBA + DBE	KLD	64.4	62.5
CC（static bins）	MLE	66.7	65.8
CC + DBA	MLE	65.6	65.2
CC + DBA + DBE	MLE	67.6	67.2

🎯 三個 takeaway：

MLE > KLD：差 3.2% AP——learnable variance 真係解決咗 hard/easy 不均衡嘅問題

DBA alone 會跌：只加 DBA 唔加 DBE 反而衰咗，因為 bin 位變咗但 representation 冇變，semantics broken

DBA + DBE 一齊先 work：兩個 component 互相依賴——bin 位變、representation 就要跟住變

Training Pipeline：YOLOX 嘅 Trick + Two-Stage Training

RTMO 嘅 training 借用咗 YOLO 系列嘅成熟技術，再加幾個 pose-specific 嘅調整：

Label Assignment：擴展版 SimOTA

YOLOX 用 SimOTA 嚟動態 assign positive grids。RTMO 將 score 嘅 cost 由「bbox 質素」改成「bbox + pose 質素」嘅組合：

Score branch：用 Varifocal Loss，target 係預測 pose 同 GT pose 嘅 OKS（Object Keypoint Similarity）
BBox branch：IoU loss
Keypoint branch（DCC）：MLE loss
Visibility branch：BCE loss

Proxy Regression：避免 OOM

DCC 嘅 computation 喺 每個 grid × 每個 keypoint × Bx × By 上面行——dense prediction 入面太貴。RTMO 嘅 trick：

額外加一個 lightweight pointwise conv 做 keypoint regression（kpt_reg）
SimOTA 用 kpt_reg 嚟揀 positive grids（避免要 DCC 跑全部 grid）
DCC 只跑 positive grids，輸出 decoded keypoints（kpt_dec）
Proxy loss： $\mathcal{L}_{\text{proxy}} = 1 - \text{OKS}(\text{kpt}_{\text{reg}}, \text{kpt}_{\text{dec}})$ ——令 proxy 同 DCC 嘅 output 一致

Total Loss

\mathcal{L} = \lambda_1 \mathcal{L}_{\text{bbox}} + \lambda_2 \mathcal{L}_{\text{mle}} + \lambda_3 \mathcal{L}_{\text{proxy}} + \lambda_4 \mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{vis}}

其中 $\lambda_1 = \lambda_2 = 5, \lambda_3 = 10, \lambda_4 = 2$ 。

Two-Stage Training Schedule

Stage	Proxy Target	Learning Rate	目的
Stage 1	GT pose annotations	4e-3	Bootstrap：proxy + DCC 同時學 GT
Stage 2	DCC decoded pose	5e-4 → 2e-4（cosine）	Refine：proxy follow DCC，DCC 學更 fine-grained

💡 Stage 2 嘅 self-distillation 味道：Proxy 由「學 GT」變「學 DCC 嘅 output」——因為 DCC 嘅輸出本身已經比 GT 更 informative（包含 uncertainty）。呢個類似 knowledge distillation 入面 soft label > hard label 嘅 insight。

實驗結果：Real-Time 入面打贏所有人

COCO test-dev：One-Stage 之王

Method	Backbone	Params	Time (ms)	AP
YOLO-Pose-s	CSPDarknet	10.8M	7.9	63.2
YOLO-Pose-l	CSPDarknet	61.3M	20.5	70.2
KAPAO-l	CSPNet	77.0M	50.2	70.3
PETR	Swin-L	213.8M	133	70.5
ED-Pose	Swin-L	218.0M	265.6	72.7
RTMO-s	CSPDarknet	9.9M	8.9	66.9
RTMO-m	CSPDarknet	22.6M	12.4	70.1
RTMO-l	CSPDarknet	44.8M	19.1	71.6
RTMO-l †	CSPDarknet	44.8M	19.1	73.3

🚀 重點觀察：

RTMO-s 同 YOLO-Pose-s 速度差唔多（8.9 vs 7.9 ms），但 AP 高 3.7%

RTMO-l 比 ED-Pose-Swin-L 快 14×（19.1 vs 265.6 ms），AP 只差 1.1%

RTMO-l 比 PETR-Swin-L 快 7×、AP 高 1.1%

加 extra training data（†）後，RTMO-l 攞到 73.3% AP，已經超越所有 one-stage 方法

多人場景：Top-down 嘅 noose

RTMO 對比 RTMPose（top-down）+ RTMDet-nano（detector）嘅組合：

Loading diagram...

⚡ RTMO-l 喺 10+ 人嘅 GPU latency 只比 1 人多 0.1ms（佔 total 0.5%）——呢個就係 one-stage 嘅 constant-time 優勢。對於 surveillance、體育分析、AR 等場景，呢個 property 比絕對最快嘅 single-person latency 更重要。

CrowdPose：擠擁場景嘅大殺四方

Method	Params	AP	AP_Easy	AP_Medium	AP_Hard
HRNet（top-down）	28.5M	71.3	80.5	71.4	62.5
DEKR（bottom-up）	65.7M	67.3	74.6	68.1	58.7
ED-Pose Swin-L	218.0M	73.1	80.5	73.8	63.8
RTMO-l	44.8M	73.2	79.2	74.1	65.3
RTMO-l †	44.8M	83.8	88.8	84.7	77.2

🏆 RTMO-l † 喺 CrowdPose AP_Hard 攞到 77.2%——hard split 係最擠擁、遮擋最嚴重嘅場景。比 ED-Pose-Swin-L 高 13.4%，model size 仲細 5×。呢個係 DCC + MLE 對 hard sample 嘅 robustness 嘅最佳證明。

同 Cousin RTMPose 嘅深度對比

RTMO 同 RTMPose 出自同一個團隊（OpenMMLab / Shanghai AI Lab），共享好多 design idea，但解決唔同問題：

特性	RTMPose	RTMO
Paradigm	Top-down	One-stage
需要 detector？	✅ 必須	❌ 唔需要
Coordinate Classification	SimCC（static bins）	DCC（dynamic bins + encoding）
Loss	KLD with Gaussian smoothing	MLE with learnable variance
Feature refinement	GAU	GAU
Latency vs people count	Linear（人多就慢）	Constant
1-person speed	最快	略慢
Multi-person speed	退化	穩定
Best use case	單人 / 少人 / mobile	Surveillance / 體育 / 演唱會

🎯 揀邊個？簡單 rule of thumb：

每張圖 ≤ 2 人 → RTMPose（單人 latency 最低，detector 嘅 overhead 唔大）

每張圖 ≥ 4 人 → RTMO（constant latency 嘅優勢開始顯現）

crowded scene（遮擋 + 擠擁） → RTMO 嘅 dense grid + MLE robustness 完勝

實作指南：用 MMPose 跑 RTMO

方法一：Off-the-shelf Inference

python# 安裝
# pip install mmcv mmengine
# pip install mmpose>=1.3.0
# pip install mmdet  # 唔需要 detector，但 demo script 會用到

from mmpose.apis import MMPoseInferencer

# 一行加載 RTMO-l
inferencer = MMPoseInferencer(
    pose2d='rtmo-l_16xb16-600e_body7-640x640',
    device='cuda:0'
)

# Inference
results = inferencer(
    'crowd.jpg',
    show=False,
    out_dir='outputs/',
    radius=4,
    thickness=2
)

for result in results:
    poses = result['predictions'][0]
    print(f"Detected {len(poses)} people")
    for p in poses:
        keypoints = p['keypoints']      # (17, 2)
        scores = p['keypoint_scores']   # (17,)
        bbox = p['bbox']

方法二：自己 Train（COCO）

bash# Clone MMPose
git clone https://github.com/open-mmlab/mmpose.git
cd mmpose/projects/rtmo

# 下載 pretrained backbone（YOLOX-l on COCO det）
# 然後直接跑
bash tools/dist_train.sh \
    configs/rtmo-l_16xb16-600e_coco-640x640.py \
    8  # 8 GPUs

關鍵 config 片段：

python# rtmo-l_16xb16-600e_coco-640x640.py 嘅核心部分
model = dict(
    type='BottomupPoseEstimator',
    backbone=dict(type='CSPDarknet', deepen_factor=1.0, widen_factor=1.0),
    neck=dict(type='HybridEncoder', encoder_cfg=dict(...)),
    head=dict(
        type='RTMOHead',
        num_keypoints=17,
        featmap_strides=(16, 32),  # 只用 P4 + P5
        head_module_cfg=dict(
            num_classes=1,
            in_channels=256,
            cls_feat_channels=256,
            channels_per_group=36,
            pose_vec_channels=512,
        ),
        dcc_cfg=dict(
            in_channels=256,
            feat_channels=128,
            num_bins=(192, 256),  # Bx, By
            spe_channels=128,      # sine PE 維度
        ),
        loss_mle=dict(type='MLECCLoss', use_target_weight=True, loss_weight=5.0),
        loss_bbox=dict(type='IoULoss', loss_weight=5.0),
        loss_oks=dict(type='OKSLoss', loss_weight=10.0),  # proxy loss
        loss_vis=dict(type='BCELoss', loss_weight=1.0),
        loss_cls=dict(type='VariFocalLoss', loss_weight=2.0),
    ),
)

方法三：Export to ONNX（部署）

bash# 用 mmdeploy export
python tools/deploy.py \
    configs/mmpose/pose-detection_rtmo_onnxruntime_dynamic-640x640.py \
    projects/rtmo/configs/rtmo-l_16xb16-600e_coco-640x640.py \
    checkpoints/rtmo-l_coco-640x640.pth \
    demo/resources/human-pose.jpg \
    --work-dir mmdeploy_models/rtmo-l \
    --device cuda \
    --dump-info

💡 Deployment 建議：

ONNXRuntime + FP32：x86 server / desktop

TensorRT + FP16：NVIDIA Jetson / 數據中心 GPU（RTMO-l 喺 V100 用 TRT FP16 可以衝到 ~141 FPS）

ONNXRuntime + OpenVINO：Intel CPU

唔好用 PyTorch raw model 做 production——ED-Pose 喺 paper 入面就示範咗 ONNX 化失敗會點樣（1.5 秒 / frame）

核心 Code 解讀：DCC 嘅 PyTorch 骨架

下面係簡化版嘅 Dynamic Coordinate Classifier，保留所有 essential 嘅 logic：

pythonimport torch
import torch.nn as nn
import torch.nn.functional as F


class SinePositionalEncoding(nn.Module):
    """為 bin 坐標生成 sine PE。"""
    def __init__(self, num_channels=128, temperature=10000):
        super().__init__()
        self.num_channels = num_channels
        self.temperature = temperature

    def forward(self, coords):
        # coords: (B, num_bins) — 每個 grid 嘅 bin 坐標（已 normalize）
        dim_t = torch.arange(self.num_channels, device=coords.device)
        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_channels)
        pos = coords.unsqueeze(-1) / dim_t   # (B, num_bins, C)
        pos[..., 0::2] = pos[..., 0::2].sin()
        pos[..., 1::2] = pos[..., 1::2].cos()
        return pos


class DynamicCoordinateClassifier(nn.Module):
    def __init__(self, in_channels=256, feat_channels=128,
                 num_keypoints=17, num_bins=(192, 256), spe_channels=128):
        super().__init__()
        self.K = num_keypoints
        self.Bx, self.By = num_bins

        # 將 pose feature 拆成每個 keypoint 嘅 feature vector
        self.kpt_feat_proj = nn.Linear(in_channels, num_keypoints * feat_channels)

        # Sine PE + learnable FC (φ)
        self.spe = SinePositionalEncoding(spe_channels)
        self.phi_x = nn.Linear(spe_channels, feat_channels)
        self.phi_y = nn.Linear(spe_channels, feat_channels)

        # Variance head（用嚟 predict σ̂）
        self.sigma_head = nn.Linear(in_channels, num_keypoints * 2)

    def get_bin_coords(self, bboxes, expand=1.25):
        """Dynamic Bin Allocation：bins 跟 expanded bbox 走。"""
        # bboxes: (N, 4) in xyxy format
        cx = (bboxes[:, 0] + bboxes[:, 2]) / 2
        cy = (bboxes[:, 1] + bboxes[:, 3]) / 2
        w = (bboxes[:, 2] - bboxes[:, 0]) * expand
        h = (bboxes[:, 3] - bboxes[:, 1]) * expand
        xl, xr = cx - w / 2, cx + w / 2
        yt, yb = cy - h / 2, cy + h / 2

        # 均勻 divide bins
        steps_x = torch.linspace(0, 1, self.Bx, device=bboxes.device)
        steps_y = torch.linspace(0, 1, self.By, device=bboxes.device)
        x_bins = xl[:, None] + (xr - xl)[:, None] * steps_x[None, :]   # (N, Bx)
        y_bins = yt[:, None] + (yb - yt)[:, None] * steps_y[None, :]   # (N, By)
        return x_bins, y_bins

    def forward(self, pose_feat, bboxes):
        # pose_feat: (N, C) — 每個 positive grid 嘅 pose feature
        # bboxes:    (N, 4) — 該 grid 預測嘅 bbox
        N = pose_feat.size(0)

        # ---- Step 1: Keypoint features ----
        f_k = self.kpt_feat_proj(pose_feat).view(N, self.K, -1)  # (N, K, D)

        # ---- Step 2: Dynamic bin coordinates ----
        x_bins, y_bins = self.get_bin_coords(bboxes)             # (N, Bx), (N, By)

        # ---- Step 3: Dynamic bin encoding ----
        # 將 bin coord 經 sine PE → FC
        # 注意：每個 sample 嘅 bin 坐標都唔同 → on-the-fly 計算
        x_emb = self.phi_x(self.spe(x_bins))    # (N, Bx, D)
        y_emb = self.phi_y(self.spe(y_bins))    # (N, By, D)

        # ---- Step 4: Bin-keypoint similarity (softmax over bins) ----
        logits_x = torch.einsum('nkd,nid->nki', f_k, x_emb)     # (N, K, Bx)
        logits_y = torch.einsum('nkd,nid->nki', f_k, y_emb)     # (N, K, By)
        prob_x = F.softmax(logits_x, dim=-1)
        prob_y = F.softmax(logits_y, dim=-1)

        # ---- Step 5: Integral decode 出坐標 ----
        kpt_x = (prob_x * x_bins.unsqueeze(1)).sum(-1)           # (N, K)
        kpt_y = (prob_y * y_bins.unsqueeze(1)).sum(-1)           # (N, K)
        keypoints = torch.stack([kpt_x, kpt_y], dim=-1)          # (N, K, 2)

        # ---- Step 6: Predict per-sample variance（MLE loss 要用）----
        sigma = self.sigma_head(pose_feat).view(N, self.K, 2).exp()  # 保證 σ > 0

        return {
            'keypoints': keypoints,
            'prob_x': prob_x, 'prob_y': prob_y,
            'x_bins': x_bins, 'y_bins': y_bins,
            'sigma': sigma,
        }


def mle_loss(prob, bins, target, sigma, instance_size, eps=1e-9):
    """
    prob:   (N, K, B)   — predicted probability over bins
    bins:   (N, B)      — bin coordinates
    target: (N, K)      — ground-truth keypoint coordinate
    sigma:  (N, K)      — predicted variance
    instance_size: (N,) — bbox diagonal 或者其他 normalizing factor
    """
    diff = (bins.unsqueeze(1) - target.unsqueeze(-1)).abs()           # (N, K, B)
    s = instance_size.view(-1, 1, 1)
    laplace = torch.exp(-diff / (2 * sigma.unsqueeze(-1) * s)) / sigma.unsqueeze(-1)
    likelihood = (prob * laplace).sum(-1).clamp_min(eps)              # (N, K)
    return -likelihood.log().mean()

🔑 注意三個 implementation detail：

phi_x 同 phi_y 係 separate 嘅 FC（因為 x 同 y 軸嘅 semantics 唔同，bin 數量都唔同）

sigma_head 用 .exp() 保證 σ > 0（log-space 學習更 stable）

MLE loss 入面 instance_size 嘅作用係令 loss scale 對人嘅大細 invariant，大人同細人嘅同樣 px 誤差有唔同 semantic meaning

限制同未來方向

當前限制

1. Bin 數量 hard-coded

$B_x = 192, B_y = 256$ 對 COCO 嘅 17 keypoint 啱用，但對 whole-body（133 keypoints, Halpe / COCO-Wholebody）或者 hand pose 可能要重新 tune。

2. P3 被砍 → 細人 / 遠處場景退化

為咗速度砍咗 P3，喺極細人（< 32px）場景會明顯退化。如果 use case 包含 long-range surveillance，可能要 reactivate P3。

3. Hybrid Encoder 嘅 attention 食 memory

High-resolution input（e.g. 800×800）下，Hybrid Encoder 嘅 self-attention 仍然係 memory bottleneck。RTMO 論文用咗 480-800 嘅 multi-scale training 已經反映呢個 trade-off。

4. Single-frame，冇 temporal smoothing

Video 應用上會有 jitter，需要外接 temporal smoother（One-Euro filter 等）。

未來方向

3D Multi-Person Pose：將 DCC 擴展到 3D bins（dual 1-D heatmap → triple 1-D heatmap）
Whole-Body Variant：COCO-Wholebody 嘅 133 keypoints 需要更 efficient 嘅 DCC implementation
Sparse DCC：而家 DCC 喺所有 positive grid 上面跑，可以引入 token routing 進一步減 FLOPs
同 RTMPose 嘅 distillation：用 RTMO 嘅 dense feature 監督 RTMPose 嘅 top-down branch（或者反過嚟）

技術啟示

1. Hybrid Paradigm Wins

2. Dynamic > Static

3. Likelihood Framework 提供免費嘅 Self-Paced Learning

4. Positional Encoding 嘅意想不到應用

總結

RTMO 用一個好 elegant 嘅 recipe，解決咗 one-stage MPPE 長期以來嘅 accuracy 瓶頸：

Dynamic Coordinate Classifier：將 SimCC 嘅 dual 1-D heatmap 改造成 input-adaptive，徹底解決 dense prediction 入面 bin wastage / coverage 嘅矛盾
MLE Loss with Learnable Variance：用一個 likelihood-based formulation 取代 KLD，免費獲得 hard / easy sample balancing
YOLOX-style Training + Two-Stage Schedule：成熟嘅 detection framework + self-distillation 味道嘅 proxy refinement

實用價值

🎯 74.8% AP @ 141 FPS（V100, TRT FP16）——而家最強嘅 real-time MPPE
⚡ Multi-person latency stable：10+ 人場景同 1 人差 0.5%
🏆 CrowdPose hard split 77.2% AP：擠擁場景嘅 SOTA
🛠️ MMPose 官方支持：production-ready，ONNX / TRT export 都 work

揀邊個？

場景	推薦	原因
單人 / mobile AR coaching	RTMPose-s	single-person 速度最快
體育分析（5-20 人）	RTMO-m	constant latency + 中等精度
演唱會 / 集會 surveillance	RTMO-l †	crowded scene SOTA
需要 whole-body / hand	RTMPose-Wholebody	RTMO 暫無 wholebody variant
Edge device（Jetson Nano）	RTMO-s	9.9M params + TRT FP16

TL;DR

目錄

背景：Multi-Person Pose Estimation 嘅三條路

Coordinate Classification：Top-down 嘅秘密武器

RTMO Architecture 全景

三個關鍵設計決定

核心創新 1：Dynamic Coordinate Classifier (DCC)

A. Dynamic Bin Allocation (DBA)

B. Dynamic Bin Encoding (DBE)

C. 用具體數字諗一次

核心創新 2：MLE Loss with Learnable Variance

點解 Learnable Variance 咁重要？

Ablation 數據（RTMO-s, COCO val2017）

Training Pipeline：YOLOX 嘅 Trick + Two-Stage Training

Label Assignment：擴展版 SimOTA

Proxy Regression：避免 OOM

Total Loss

Two-Stage Training Schedule

實驗結果：Real-Time 入面打贏所有人

COCO test-dev：One-Stage 之王

多人場景：Top-down 嘅 noose

CrowdPose：擠擁場景嘅大殺四方

同 Cousin RTMPose 嘅深度對比

實作指南：用 MMPose 跑 RTMO

方法一：Off-the-shelf Inference

方法二：自己 Train（COCO）

方法三：Export to ONNX（部署）

核心 Code 解讀：DCC 嘅 PyTorch 骨架

限制同未來方向

當前限制

未來方向

技術啟示

1. Hybrid Paradigm Wins

2. Dynamic > Static

3. Likelihood Framework 提供免費嘅 Self-Paced Learning

4. Positional Encoding 嘅意想不到應用

總結

實用價值

揀邊個？

相關資源

TL;DR

目錄

背景：Multi-Person Pose Estimation 嘅三條路

Coordinate Classification：Top-down 嘅秘密武器

RTMO Architecture 全景

三個關鍵設計決定

核心創新 1：Dynamic Coordinate Classifier (DCC)

A. Dynamic Bin Allocation (DBA)

B. Dynamic Bin Encoding (DBE)

C. 用具體數字諗一次

核心創新 2：MLE Loss with Learnable Variance

點解 Learnable Variance 咁重要？

Ablation 數據（RTMO-s, COCO val2017）

Training Pipeline：YOLOX 嘅 Trick + Two-Stage Training

Label Assignment：擴展版 SimOTA

Proxy Regression：避免 OOM

Total Loss

Two-Stage Training Schedule

實驗結果：Real-Time 入面打贏所有人

COCO test-dev：One-Stage 之王

多人場景：Top-down 嘅 noose

CrowdPose：擠擁場景嘅大殺四方

同 Cousin RTMPose 嘅深度對比

實作指南：用 MMPose 跑 RTMO

方法一：Off-the-shelf Inference

方法二：自己 Train（COCO）

方法三：Export to ONNX（部署）

核心 Code 解讀：DCC 嘅 PyTorch 骨架

限制同未來方向

當前限制

未來方向

技術啟示

1. Hybrid Paradigm Wins

2. Dynamic > Static

3. Likelihood Framework 提供免費嘅 Self-Paced Learning

4. Positional Encoding 嘅意想不到應用

總結

實用價值

揀邊個？

相關資源