Billy Tse
HomeRoadmapBlogContact
Playground
Buy me a bug

© 2026 Billy Tse

OnlyFansLinkedInGitHubEmail
Back to Blog
May 24, 2026•28 min read

RTMO:點樣將 Coordinate Classification 塞入 YOLO,做到 One-Stage Real-Time 多人 Pose Estimation?

深入拆解 RTMO(CVPR 2024):第一個將 dual 1-D heatmap coordinate classification 整合落 YOLO 嘅 one-stage real-time multi-person pose estimation framework。Dynamic Coordinate Classifier(DBA + DBE)、MLE loss with learnable variance、YOLOX-style training,喺 COCO val2017 攞到 74.8% AP + 141 FPS(V100)。

Computer VisionAIImage Processing

論文來源:Tsinghua SIGS / Shanghai AI Laboratory / Nanyang Technological University arXiv:2312.07526(CVPR 2024) Code:github.com/open-mmlab/mmpose/tree/main/projects/rtmo 同系列:RTMPose(top-down 兄弟)

TL;DR

上一系列 blog 我哋睇咗 Person ReID 點樣由 OSNet 走到 DINOv2,係 identity 層面 嘅 human-centric vision。今次轉去 keypoint 層面——multi-person pose estimation。

RTMO 喺 CVPR 2024 提出,核心問題好簡單:Top-down 太慢、bottom-up / one-stage 又唔夠準,可唔可以兩者兼得?

核心重點:

  • 🎯 One-stage + Coordinate Classification:第一次將 dual 1-D heatmap classification(SimCC / RTMPose 嘅招)塞入 YOLO-based dense prediction
  • 🧩 Dynamic Coordinate Classifier (DCC):Dynamic Bin Allocation(bins 跟 bbox 大細走)+ Dynamic Bin Encoding(sine PE + FC 生成 per-bin representation)
  • 📐 MLE Loss with Learnable Variance:用 Maximum Likelihood Estimation 取代 KLD,每個 sample 自己學 uncertainty,hard sample 大 variance、easy sample 細 variance
  • ⚡ Real-time + 多人不衰減:RTMO-l 喺 COCO val2017 攞到 74.8% AP @ 141 FPS(V100),10 人以上場景 latency 只多 0.1ms
  • 🏆 CrowdPose SOTA:RTMO-l + extra data 喺 CrowdPose 上 83.8% AP,遠拋 ED-Pose(Swin-L, 218M params)

目錄

背景:Multi-Person Pose Estimation 嘅三條路

喺 RTMO 出現之前,real-time multi-person pose estimation(MPPE)基本上分成三派:

Loading diagram...
Paradigm精度速度同人數關係Pipeline 複雜度
Top-down⭐⭐⭐⭐⭐Linear in N(人多就慢)需要 detector + cropper + pose net
Bottom-up⭐⭐⭐Constant,但 grouping 慢需要 keypoint grouping(PAF / heatmap matching)
One-stage⭐⭐⭐(之前)Constant一個 forward 搞掂

💡 One-stage 嘅理論優勢好明顯:latency 唔受人數影響,pipeline 簡單。但喺 RTMO 之前,所有 real-time one-stage 方法(YOLO-Pose、KAPAO、YOLOX-Pose)都係用一個 fully connected layer 直接 regress 出 keypoint 坐標——呢個做法等同假設 keypoint 位置係 Dirac delta 分佈,完全忽略咗 annotation 嘅 inherent uncertainty。結果就係:快係快,但 AP 永遠追唔上 top-down。

Coordinate Classification:Top-down 嘅秘密武器

2022 年 SimCC、2023 年 RTMPose 都用咗一個叫 dual 1-D heatmap 嘅技巧,喺 top-down setting 大殺四方:

  • 將 x 軸切成 BxB_xBx​ 個 bin、y 軸切成 ByB_yBy​ 個 bin
  • 每個 keypoint output 兩個 1-D probability heatmap(一條 x、一條 y)
  • 透過 sub-pixel bin 達到高空間解析度,唔需要 high-res 2D heatmap

問題嚟啦:呢個技巧直接搬入 one-stage(dense prediction)就出事——

  1. Bin wastage:bins 散佈成張圖,但每個人只佔幾個 grid,大部分 bin 對某個 instance 嚟講都係廢嘅
  2. DFL [21] 嘅 fix 又唔夠:bins 固定喺 anchor 附近一個 fixed range,大人會超出、細人會量化誤差爆炸
  3. KLD loss 一視同仁:dense prediction 入面每個 grid 嘅難度差好遠(位置、size、姿態),KLD 冇辦法區分

🎯 RTMO 嘅 thesis:如果可以解決呢三個 incompatibility,就可以將 coordinate classification 嘅準度搬入 one-stage 嘅速度——拎到「top-down 級精度 + one-stage 級速度」。

RTMO Architecture 全景

RTMO 整體沿用 YOLO 嘅 dense grid prediction,但 head 嘅 keypoint 分支特別嗌咗一個 Dynamic Coordinate Classifier。

Loading diagram...

三個關鍵設計決定

決定做法原因
BackboneCSPDarknet(YOLOX 嗰個)同 YOLO 生態 align,方便 deployment
NeckHybrid Encoder(RT-DETR)同時有 self-attention(global)+ FPN(local)
Feature Levels只用 P4 + P5,丟咗 P3P3 食咗 head 78.5% FLOPs,但只貢獻 10.7% correct detections

🔑 P3 嘅故事:傳統 feature pyramid 嘅諗法係「淺層 feature 揸細人」,但 RTMO 嘅 ablation(Table 4)發現 P4 + P5 已經夠 cover 細人——P3 嘅 FLOPs 純粹浪費。RTMO-l 由 3 級 features → 2 級,CPU latency 由 186ms 降到 125ms,AP 只跌 0.2%。Lesson:feature pyramid 唔係越深越多越好。

核心創新 1:Dynamic Coordinate Classifier (DCC)

DCC 係 RTMO 嘅技術核心。佢將 SimCC 嘅 dual 1-D heatmap 改造成適合 dense prediction:

A. Dynamic Bin Allocation (DBA)

傳統做法(SimCC / RTMPose):Bins 鋪滿成張 input image。Top-down 入面已經 crop 過,所以 OK。

DFL 嘅做法:Bins 固定喺 anchor 附近一個 predefined range。大人裝唔晒、細人 quantization error 爆炸。

RTMO 嘅做法:每個 grid 先 regress 出 bounding box,將 box expand 1.25 倍(cover 預測唔準嘅情況),再喺 expanded box 內部均勻 divide bins。

xi=xl+(xr−xl)⋅i−1Bx−1,i=1,…,Bxx_i = x_l + (x_r - x_l) \cdot \frac{i-1}{B_x - 1}, \quad i = 1, \ldots, B_xxi​=xl​+(xr​−xl​)⋅Bx​−1i−1​,i=1,…,Bx​

其中 xl,xrx_l, x_rxl​,xr​ 係 expanded bounding box 嘅左右邊界。所有 model 都用 Bx=192,By=256B_x = 192, B_y = 256Bx​=192,By​=256。

Loading diagram...

🎯 DBA 嘅 elegant 之處:bins 嘅 spatial location 變成 input-dependent——細人有細人嘅 fine-grained bins,大人有大人嘅 wider bins。Quantization error 自動同 instance size scale。

B. Dynamic Bin Encoding (DBE)

問題嚟啦:而家每個 grid 嘅 bin 坐標都唔同,唔可以再用一組 shared learnable embeddings 代表每個 bin(SimCC / RTMPose 嘅做法)。

RTMO 用 sine positional encoding + learnable FC 嚟為每個 bin on-the-fly 生成 representation:

[PE(xi)]c={sin⁡(xi/tc/C),c evencos⁡(xi/t(c−1)/C),c odd[\boldsymbol{PE}(x_i)]_c = \begin{cases} \sin(x_i / t^{c/C}), & c \text{ even} \\ \cos(x_i / t^{(c-1)/C}), & c \text{ odd} \end{cases}[PE(xi​)]c​={sin(xi​/tc/C),cos(xi​/t(c−1)/C),​c evenc odd​

再經一個 fully connected layer ϕ\phiϕ refine:

p^k(xi)=exp⁡(fk⋅ϕ(PE(xi)))∑j=1Bxexp⁡(fk⋅ϕ(PE(xj)))\hat{p}_k(x_i) = \frac{\exp(\boldsymbol{f}_k \cdot \boldsymbol{\phi}(\boldsymbol{PE}(x_i)))}{\sum_{j=1}^{B_x} \exp(\boldsymbol{f}_k \cdot \boldsymbol{\phi}(\boldsymbol{PE}(x_j)))}p^​k​(xi​)=∑j=1Bx​​exp(fk​⋅ϕ(PE(xj​)))exp(fk​⋅ϕ(PE(xi​)))​

其中 fk\boldsymbol{f}_kfk​ 係第 kkk 個 keypoint 嘅 feature(經 GAU module refine 過,跟 RTMPose 一樣)。

💡 諗法類比:呢個就好似 Transformer 入面 query × key 嘅 dot product——keypoint feature 係 query,bin 嘅 PE 係 key,softmax 之後就係 attention weights,即 keypoint 喺呢個 bin 嘅 probability。RTMO 將 Transformer 嘅 positional encoding 用到 spatial bins 上面,係一個好聰明嘅 cross-pollination。

C. 用具體數字諗一次

假設一張圖入面有一個人,predicted bbox 係 (xl,xr)=(100,300)(x_l, x_r) = (100, 300)(xl​,xr​)=(100,300)(width 200px)。Expand 1.25× 之後:

  • 新 xl=75,xr=325x_l = 75, x_r = 325xl​=75,xr​=325(width = 250)
  • Bx=192B_x = 192Bx​=192,所以每個 bin 大約 1.3 px wide
  • Bin 1 中心喺 75px,Bin 96 喺 200px(人嘅中心),Bin 192 喺 325px

對比 SimCC:如果 input image 係 640×640,SimCC 將整個 640px 分成例如 384 個 bin(每 bin 1.67px),但 200px 寬嘅人只用得到 120 個 bin → 浪費 264 個 bin。

對比 DFL:固定 range 例如 ±32px 圍住 anchor → 一個 200px 嘅人完全裝唔入!

RTMO:192 個 bin 全部用嚟覆蓋呢個人嘅 250px expanded box → 每個 bin 都有用,quantization error ≈ 0.65px。

核心創新 2:MLE Loss with Learnable Variance

呢個係另一個 elegant 嘅 contribution。Standard 嘅 coordinate classification 用 Gaussian label smoothing + KLD loss:

pk(xi∣μx)=12πσexp⁡(−(xi−μx)22σ2)p_k(x_i | \mu_x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x_i - \mu_x)^2}{2\sigma^2}\right)pk​(xi​∣μx​)=2π​σ1​exp(−2σ2(xi​−μx​)2​)

RTMO 用咗一個關鍵 observation:Gaussian 對 mean 係對稱嘅,所以 pk(xi∣μx)=pk(μx∣xi)p_k(x_i|\mu_x) = p_k(\mu_x|x_i)pk​(xi​∣μx​)=pk​(μx​∣xi​)——即可以將「target distribution」reinterpret 成「annotation likelihood under Gaussian error model」。

如果將預測 p^k(xi)\hat{p}_k(x_i)p^​k​(xi​) 視為 xix_ixi​ 嘅 prior,annotation μx\mu_xμx​ 嘅 marginal likelihood 就係:

P(μx)=∑i=1BxP(μx∣xi)P(xi)=∑i=1Bx12πσe−(xi−μx)22σ2p^k(xi)P(\mu_x) = \sum_{i=1}^{B_x} P(\mu_x | x_i) P(x_i) = \sum_{i=1}^{B_x} \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x_i - \mu_x)^2}{2\sigma^2}} \hat{p}_k(x_i)P(μx​)=i=1∑Bx​​P(μx​∣xi​)P(xi​)=i=1∑Bx​​2π​σ1​e−2σ2(xi​−μx​)2​p^​k​(xi​)

Maximize 呢個就 model 緊 annotation 嘅真實分佈。RTMO 實際用 Laplace 分佈 同 negative log-likelihood:

Lmle(x)=−log⁡[∑i=1Bx1σ^exp⁡(−∣xi−μx∣2σ^s)p^k(xi)]\mathcal{L}_{\text{mle}}^{(x)} = -\log\left[\sum_{i=1}^{B_x} \frac{1}{\hat{\sigma}} \exp\left(-\frac{|x_i - \mu_x|}{2\hat{\sigma} s}\right) \hat{p}_k(x_i)\right]Lmle(x)​=−log[i=1∑Bx​​σ^1​exp(−2σ^s∣xi​−μx​∣​)p^​k​(xi​)]

其中 sss 係 instance size(normalize error),σ^\hat{\sigma}σ^ 係 model 預測嘅 variance(即 uncertainty)。

點解 Learnable Variance 咁重要?

Loading diagram...

🔑 MLE vs KLD 嘅根本分別:

  • KLD + learnable σ:Model 會懶到全部 sample 都 predict 一個大 σ → flatten target distribution → loss 永遠細 → 訓練崩潰

  • MLE + learnable σ:σ 喺 likelihood 入面 同時影響 numerator 同 denominator,model 唔可以 cheat。Hard sample predict 大 σ 真係會減少 loss,但 easy sample 預測大 σ 反而會 penalize。呢個係 self-paced learning 嘅天然 implementation。

Ablation 數據(RTMO-s, COCO val2017)

DecodingLossCOCO APCrowdPose AP
RegressionOKS65.666.1
CC + DBA + DBEKLD64.462.5
CC(static bins)MLE66.765.8
CC + DBAMLE65.665.2
CC + DBA + DBEMLE67.667.2

🎯 三個 takeaway:

  1. MLE > KLD:差 3.2% AP——learnable variance 真係解決咗 hard/easy 不均衡嘅問題

  2. DBA alone 會跌:只加 DBA 唔加 DBE 反而衰咗,因為 bin 位變咗但 representation 冇變,semantics broken

  3. DBA + DBE 一齊先 work:兩個 component 互相依賴——bin 位變、representation 就要跟住變

Training Pipeline:YOLOX 嘅 Trick + Two-Stage Training

RTMO 嘅 training 借用咗 YOLO 系列嘅成熟技術,再加幾個 pose-specific 嘅調整:

Label Assignment:擴展版 SimOTA

YOLOX 用 SimOTA 嚟動態 assign positive grids。RTMO 將 score 嘅 cost 由「bbox 質素」改成「bbox + pose 質素」嘅組合:

  • Score branch:用 Varifocal Loss,target 係預測 pose 同 GT pose 嘅 OKS(Object Keypoint Similarity)
  • BBox branch:IoU loss
  • Keypoint branch(DCC):MLE loss
  • Visibility branch:BCE loss

Proxy Regression:避免 OOM

DCC 嘅 computation 喺 每個 grid × 每個 keypoint × Bx × By 上面行——dense prediction 入面太貴。RTMO 嘅 trick:

  1. 額外加一個 lightweight pointwise conv 做 keypoint regression(kpt_reg)
  2. SimOTA 用 kpt_reg 嚟揀 positive grids(避免要 DCC 跑全部 grid)
  3. DCC 只跑 positive grids,輸出 decoded keypoints(kpt_dec)
  4. Proxy loss:Lproxy=1−OKS(kptreg,kptdec)\mathcal{L}_{\text{proxy}} = 1 - \text{OKS}(\text{kpt}_{\text{reg}}, \text{kpt}_{\text{dec}})Lproxy​=1−OKS(kptreg​,kptdec​) ——令 proxy 同 DCC 嘅 output 一致

Total Loss

L=λ1Lbbox+λ2Lmle+λ3Lproxy+λ4Lcls+Lvis\mathcal{L} = \lambda_1 \mathcal{L}_{\text{bbox}} + \lambda_2 \mathcal{L}_{\text{mle}} + \lambda_3 \mathcal{L}_{\text{proxy}} + \lambda_4 \mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{vis}}L=λ1​Lbbox​+λ2​Lmle​+λ3​Lproxy​+λ4​Lcls​+Lvis​

其中 λ1=λ2=5,λ3=10,λ4=2\lambda_1 = \lambda_2 = 5, \lambda_3 = 10, \lambda_4 = 2λ1​=λ2​=5,λ3​=10,λ4​=2。

Two-Stage Training Schedule

StageProxy TargetLearning Rate目的
Stage 1GT pose annotations4e-3Bootstrap:proxy + DCC 同時學 GT
Stage 2DCC decoded pose5e-4 → 2e-4(cosine)Refine:proxy follow DCC,DCC 學更 fine-grained

💡 Stage 2 嘅 self-distillation 味道:Proxy 由「學 GT」變「學 DCC 嘅 output」——因為 DCC 嘅輸出本身已經比 GT 更 informative(包含 uncertainty)。呢個類似 knowledge distillation 入面 soft label > hard label 嘅 insight。

實驗結果:Real-Time 入面打贏所有人

COCO test-dev:One-Stage 之王

MethodBackboneParamsTime (ms)AP
YOLO-Pose-sCSPDarknet10.8M7.963.2
YOLO-Pose-lCSPDarknet61.3M20.570.2
KAPAO-lCSPNet77.0M50.270.3
PETRSwin-L213.8M13370.5
ED-PoseSwin-L218.0M265.672.7
RTMO-sCSPDarknet9.9M8.966.9
RTMO-mCSPDarknet22.6M12.470.1
RTMO-lCSPDarknet44.8M19.171.6
RTMO-l †CSPDarknet44.8M19.173.3

🚀 重點觀察:

  • RTMO-s 同 YOLO-Pose-s 速度差唔多(8.9 vs 7.9 ms),但 AP 高 3.7%

  • RTMO-l 比 ED-Pose-Swin-L 快 14×(19.1 vs 265.6 ms),AP 只差 1.1%

  • RTMO-l 比 PETR-Swin-L 快 7×、AP 高 1.1%

  • 加 extra training data(†)後,RTMO-l 攞到 73.3% AP,已經超越所有 one-stage 方法

多人場景:Top-down 嘅 noose

RTMO 對比 RTMPose(top-down)+ RTMDet-nano(detector)嘅組合:

Loading diagram...

⚡ RTMO-l 喺 10+ 人嘅 GPU latency 只比 1 人多 0.1ms(佔 total 0.5%)——呢個就係 one-stage 嘅 constant-time 優勢。對於 surveillance、體育分析、AR 等場景,呢個 property 比絕對最快嘅 single-person latency 更重要。

CrowdPose:擠擁場景嘅大殺四方

MethodParamsAPAP_EasyAP_MediumAP_Hard
HRNet(top-down)28.5M71.380.571.462.5
DEKR(bottom-up)65.7M67.374.668.158.7
ED-Pose Swin-L218.0M73.180.573.863.8
RTMO-l44.8M73.279.274.165.3
RTMO-l †44.8M83.888.884.777.2

🏆 RTMO-l † 喺 CrowdPose AP_Hard 攞到 77.2%——hard split 係最擠擁、遮擋最嚴重嘅場景。比 ED-Pose-Swin-L 高 13.4%,model size 仲細 5×。呢個係 DCC + MLE 對 hard sample 嘅 robustness 嘅最佳證明。

同 Cousin RTMPose 嘅深度對比

RTMO 同 RTMPose 出自同一個團隊(OpenMMLab / Shanghai AI Lab),共享好多 design idea,但解決唔同問題:

特性RTMPoseRTMO
ParadigmTop-downOne-stage
需要 detector?✅ 必須❌ 唔需要
Coordinate ClassificationSimCC(static bins)DCC(dynamic bins + encoding)
LossKLD with Gaussian smoothingMLE with learnable variance
Feature refinementGAUGAU
Latency vs people countLinear(人多就慢)Constant
1-person speed最快略慢
Multi-person speed退化穩定
Best use case單人 / 少人 / mobileSurveillance / 體育 / 演唱會

🎯 揀邊個?簡單 rule of thumb:

  • 每張圖 ≤ 2 人 → RTMPose(單人 latency 最低,detector 嘅 overhead 唔大)

  • 每張圖 ≥ 4 人 → RTMO(constant latency 嘅優勢開始顯現)

  • crowded scene(遮擋 + 擠擁) → RTMO 嘅 dense grid + MLE robustness 完勝

實作指南:用 MMPose 跑 RTMO

方法一:Off-the-shelf Inference

python# 安裝 # pip install mmcv mmengine # pip install mmpose>=1.3.0 # pip install mmdet # 唔需要 detector,但 demo script 會用到 from mmpose.apis import MMPoseInferencer # 一行加載 RTMO-l inferencer = MMPoseInferencer( pose2d='rtmo-l_16xb16-600e_body7-640x640', device='cuda:0' ) # Inference results = inferencer( 'crowd.jpg', show=False, out_dir='outputs/', radius=4, thickness=2 ) for result in results: poses = result['predictions'][0] print(f"Detected {len(poses)} people") for p in poses: keypoints = p['keypoints'] # (17, 2) scores = p['keypoint_scores'] # (17,) bbox = p['bbox']

方法二:自己 Train(COCO)

bash# Clone MMPose git clone https://github.com/open-mmlab/mmpose.git cd mmpose/projects/rtmo # 下載 pretrained backbone(YOLOX-l on COCO det) # 然後直接跑 bash tools/dist_train.sh \ configs/rtmo-l_16xb16-600e_coco-640x640.py \ 8 # 8 GPUs

關鍵 config 片段:

python# rtmo-l_16xb16-600e_coco-640x640.py 嘅核心部分 model = dict( type='BottomupPoseEstimator', backbone=dict(type='CSPDarknet', deepen_factor=1.0, widen_factor=1.0), neck=dict(type='HybridEncoder', encoder_cfg=dict(...)), head=dict( type='RTMOHead', num_keypoints=17, featmap_strides=(16, 32), # 只用 P4 + P5 head_module_cfg=dict( num_classes=1, in_channels=256, cls_feat_channels=256, channels_per_group=36, pose_vec_channels=512, ), dcc_cfg=dict( in_channels=256, feat_channels=128, num_bins=(192, 256), # Bx, By spe_channels=128, # sine PE 維度 ), loss_mle=dict(type='MLECCLoss', use_target_weight=True, loss_weight=5.0), loss_bbox=dict(type='IoULoss', loss_weight=5.0), loss_oks=dict(type='OKSLoss', loss_weight=10.0), # proxy loss loss_vis=dict(type='BCELoss', loss_weight=1.0), loss_cls=dict(type='VariFocalLoss', loss_weight=2.0), ), )

方法三:Export to ONNX(部署)

bash# 用 mmdeploy export python tools/deploy.py \ configs/mmpose/pose-detection_rtmo_onnxruntime_dynamic-640x640.py \ projects/rtmo/configs/rtmo-l_16xb16-600e_coco-640x640.py \ checkpoints/rtmo-l_coco-640x640.pth \ demo/resources/human-pose.jpg \ --work-dir mmdeploy_models/rtmo-l \ --device cuda \ --dump-info

💡 Deployment 建議:

  • ONNXRuntime + FP32:x86 server / desktop

  • TensorRT + FP16:NVIDIA Jetson / 數據中心 GPU(RTMO-l 喺 V100 用 TRT FP16 可以衝到 ~141 FPS)

  • ONNXRuntime + OpenVINO:Intel CPU

  • 唔好用 PyTorch raw model 做 production——ED-Pose 喺 paper 入面就示範咗 ONNX 化失敗會點樣(1.5 秒 / frame)

核心 Code 解讀:DCC 嘅 PyTorch 骨架

下面係簡化版嘅 Dynamic Coordinate Classifier,保留所有 essential 嘅 logic:

pythonimport torch import torch.nn as nn import torch.nn.functional as F class SinePositionalEncoding(nn.Module): """為 bin 坐標生成 sine PE。""" def __init__(self, num_channels=128, temperature=10000): super().__init__() self.num_channels = num_channels self.temperature = temperature def forward(self, coords): # coords: (B, num_bins) — 每個 grid 嘅 bin 坐標(已 normalize) dim_t = torch.arange(self.num_channels, device=coords.device) dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_channels) pos = coords.unsqueeze(-1) / dim_t # (B, num_bins, C) pos[..., 0::2] = pos[..., 0::2].sin() pos[..., 1::2] = pos[..., 1::2].cos() return pos class DynamicCoordinateClassifier(nn.Module): def __init__(self, in_channels=256, feat_channels=128, num_keypoints=17, num_bins=(192, 256), spe_channels=128): super().__init__() self.K = num_keypoints self.Bx, self.By = num_bins # 將 pose feature 拆成每個 keypoint 嘅 feature vector self.kpt_feat_proj = nn.Linear(in_channels, num_keypoints * feat_channels) # Sine PE + learnable FC (φ) self.spe = SinePositionalEncoding(spe_channels) self.phi_x = nn.Linear(spe_channels, feat_channels) self.phi_y = nn.Linear(spe_channels, feat_channels) # Variance head(用嚟 predict σ̂) self.sigma_head = nn.Linear(in_channels, num_keypoints * 2) def get_bin_coords(self, bboxes, expand=1.25): """Dynamic Bin Allocation:bins 跟 expanded bbox 走。""" # bboxes: (N, 4) in xyxy format cx = (bboxes[:, 0] + bboxes[:, 2]) / 2 cy = (bboxes[:, 1] + bboxes[:, 3]) / 2 w = (bboxes[:, 2] - bboxes[:, 0]) * expand h = (bboxes[:, 3] - bboxes[:, 1]) * expand xl, xr = cx - w / 2, cx + w / 2 yt, yb = cy - h / 2, cy + h / 2 # 均勻 divide bins steps_x = torch.linspace(0, 1, self.Bx, device=bboxes.device) steps_y = torch.linspace(0, 1, self.By, device=bboxes.device) x_bins = xl[:, None] + (xr - xl)[:, None] * steps_x[None, :] # (N, Bx) y_bins = yt[:, None] + (yb - yt)[:, None] * steps_y[None, :] # (N, By) return x_bins, y_bins def forward(self, pose_feat, bboxes): # pose_feat: (N, C) — 每個 positive grid 嘅 pose feature # bboxes: (N, 4) — 該 grid 預測嘅 bbox N = pose_feat.size(0) # ---- Step 1: Keypoint features ---- f_k = self.kpt_feat_proj(pose_feat).view(N, self.K, -1) # (N, K, D) # ---- Step 2: Dynamic bin coordinates ---- x_bins, y_bins = self.get_bin_coords(bboxes) # (N, Bx), (N, By) # ---- Step 3: Dynamic bin encoding ---- # 將 bin coord 經 sine PE → FC # 注意:每個 sample 嘅 bin 坐標都唔同 → on-the-fly 計算 x_emb = self.phi_x(self.spe(x_bins)) # (N, Bx, D) y_emb = self.phi_y(self.spe(y_bins)) # (N, By, D) # ---- Step 4: Bin-keypoint similarity (softmax over bins) ---- logits_x = torch.einsum('nkd,nid->nki', f_k, x_emb) # (N, K, Bx) logits_y = torch.einsum('nkd,nid->nki', f_k, y_emb) # (N, K, By) prob_x = F.softmax(logits_x, dim=-1) prob_y = F.softmax(logits_y, dim=-1) # ---- Step 5: Integral decode 出坐標 ---- kpt_x = (prob_x * x_bins.unsqueeze(1)).sum(-1) # (N, K) kpt_y = (prob_y * y_bins.unsqueeze(1)).sum(-1) # (N, K) keypoints = torch.stack([kpt_x, kpt_y], dim=-1) # (N, K, 2) # ---- Step 6: Predict per-sample variance(MLE loss 要用)---- sigma = self.sigma_head(pose_feat).view(N, self.K, 2).exp() # 保證 σ > 0 return { 'keypoints': keypoints, 'prob_x': prob_x, 'prob_y': prob_y, 'x_bins': x_bins, 'y_bins': y_bins, 'sigma': sigma, } def mle_loss(prob, bins, target, sigma, instance_size, eps=1e-9): """ prob: (N, K, B) — predicted probability over bins bins: (N, B) — bin coordinates target: (N, K) — ground-truth keypoint coordinate sigma: (N, K) — predicted variance instance_size: (N,) — bbox diagonal 或者其他 normalizing factor """ diff = (bins.unsqueeze(1) - target.unsqueeze(-1)).abs() # (N, K, B) s = instance_size.view(-1, 1, 1) laplace = torch.exp(-diff / (2 * sigma.unsqueeze(-1) * s)) / sigma.unsqueeze(-1) likelihood = (prob * laplace).sum(-1).clamp_min(eps) # (N, K) return -likelihood.log().mean()

🔑 注意三個 implementation detail:

  1. phi_x 同 phi_y 係 separate 嘅 FC(因為 x 同 y 軸嘅 semantics 唔同,bin 數量都唔同)

  2. sigma_head 用 .exp() 保證 σ > 0(log-space 學習更 stable)

  3. MLE loss 入面 instance_size 嘅作用係令 loss scale 對人嘅大細 invariant,大人同細人嘅同樣 px 誤差有唔同 semantic meaning

限制同未來方向

當前限制

1. Bin 數量 hard-coded

Bx=192,By=256B_x = 192, B_y = 256Bx​=192,By​=256 對 COCO 嘅 17 keypoint 啱用,但對 whole-body(133 keypoints, Halpe / COCO-Wholebody)或者 hand pose 可能要重新 tune。

2. P3 被砍 → 細人 / 遠處場景退化

為咗速度砍咗 P3,喺極細人(< 32px)場景會明顯退化。如果 use case 包含 long-range surveillance,可能要 reactivate P3。

3. Hybrid Encoder 嘅 attention 食 memory

High-resolution input(e.g. 800×800)下,Hybrid Encoder 嘅 self-attention 仍然係 memory bottleneck。RTMO 論文用咗 480-800 嘅 multi-scale training 已經反映呢個 trade-off。

4. Single-frame,冇 temporal smoothing

Video 應用上會有 jitter,需要外接 temporal smoother(One-Euro filter 等)。

未來方向

  • 3D Multi-Person Pose:將 DCC 擴展到 3D bins(dual 1-D heatmap → triple 1-D heatmap)
  • Whole-Body Variant:COCO-Wholebody 嘅 133 keypoints 需要更 efficient 嘅 DCC implementation
  • Sparse DCC:而家 DCC 喺所有 positive grid 上面跑,可以引入 token routing 進一步減 FLOPs
  • 同 RTMPose 嘅 distillation:用 RTMO 嘅 dense feature 監督 RTMPose 嘅 top-down branch(或者反過嚟)

技術啟示

1. Hybrid Paradigm Wins

RTMO 嘅成功本質係 將 top-down 嘅 representation(dual 1-D heatmap)嫁接落 one-stage 嘅 architecture。Single-paradigm 嘅 model 通常會 stuck 喺某個 trade-off frontier 上面——能夠 hybridize 兩種 paradigm 嘅 representation 同 inference logic,往往能突破舊嘅 Pareto frontier。

2. Dynamic > Static

呢個 echo 之前 OSNet 嘅 takeaway:input-dependent 嘅機制永遠優於 fixed mechanism。DBA、DBE、MLE 嘅 learnable variance,全部都係「跟住 input 變」嘅設計。Pose estimation、object detection、attention⋯⋯AI 嘅 trend 一直都係由 static → dynamic。

3. Likelihood Framework 提供免費嘅 Self-Paced Learning

MLE with learnable variance 唔需要任何 hand-crafted curriculum,就能 auto-balance hard / easy samples。呢個 framework 喺 face recognition(Chang 2020)、object detection(He 2019)、pose(RTMO)都 work——係一個 generalizable 嘅 design pattern。

4. Positional Encoding 嘅意想不到應用

Sine PE 由 Transformer 嘅 sequence position 開始,去到 ViT 嘅 patch position,再去到 RTMO 嘅 spatial bin representation。同一個數學工具,喺三個完全唔同嘅 context 下都 work——呢就係 fundamental tool 嘅威力。

總結

RTMO 用一個好 elegant 嘅 recipe,解決咗 one-stage MPPE 長期以來嘅 accuracy 瓶頸:

  1. Dynamic Coordinate Classifier:將 SimCC 嘅 dual 1-D heatmap 改造成 input-adaptive,徹底解決 dense prediction 入面 bin wastage / coverage 嘅矛盾
  2. MLE Loss with Learnable Variance:用一個 likelihood-based formulation 取代 KLD,免費獲得 hard / easy sample balancing
  3. YOLOX-style Training + Two-Stage Schedule:成熟嘅 detection framework + self-distillation 味道嘅 proxy refinement

實用價值

  • 🎯 74.8% AP @ 141 FPS(V100, TRT FP16)——而家最強嘅 real-time MPPE
  • ⚡ Multi-person latency stable:10+ 人場景同 1 人差 0.5%
  • 🏆 CrowdPose hard split 77.2% AP:擠擁場景嘅 SOTA
  • 🛠️ MMPose 官方支持:production-ready,ONNX / TRT export 都 work

揀邊個?

場景推薦原因
單人 / mobile AR coachingRTMPose-ssingle-person 速度最快
體育分析(5-20 人)RTMO-mconstant latency + 中等精度
演唱會 / 集會 surveillanceRTMO-l †crowded scene SOTA
需要 whole-body / handRTMPose-WholebodyRTMO 暫無 wholebody variant
Edge device(Jetson Nano)RTMO-s9.9M params + TRT FP16

相關資源

  • 📄 論文:arXiv:2312.07526(CVPR 2024)
  • 💻 Code:github.com/open-mmlab/mmpose/tree/main/projects/rtmo
  • 📦 Pretrained Models:RTMO Model Zoo
  • 📚 MMPose Docs:mmpose.readthedocs.io
  • 🔗 Sibling: RTMPose:arXiv:2303.07399(top-down 版)
  • 🔗 Foundation: SimCC:arXiv:2107.03332(ECCV 2022,coordinate classification 嘅 origin)
  • 🔗 YOLOX:arXiv:2107.08430(backbone + SimOTA)
  • 🔗 DFL:arXiv:2006.04388(static bin allocation 嘅前作)
  • 🔗 延伸閱讀:Person ReID 進化史:由 TransReID 到 SOLIDER 到 DINOv2,Transformer 點樣統治行人重識別?

RTMO 嘅故事提醒我哋:一個 paradigm(one-stage)嘅瓶頸,往往唔係 paradigm 本身嘅問題,而係佢借用咗錯嘅 representation。將 top-down 嘅 dual 1-D heatmap 搬入 dense prediction,加上 dynamic bin + likelihood-based loss,one-stage 就由「快但唔準」變成「又快又準」。下一個被解鎖嘅 paradigm,可能就喺度等緊一個啱嘅 representation 借過嚟。 🤸✨

Back to all articles
目錄