論文來源:Tsinghua SIGS / Shanghai AI Laboratory / Nanyang Technological University arXiv:2312.07526(CVPR 2024) Code:github.com/open-mmlab/mmpose/tree/main/projects/rtmo 同系列:RTMPose(top-down 兄弟)
TL;DR
上一系列 blog 我哋睇咗 Person ReID 點樣由 OSNet 走到 DINOv2,係 identity 層面 嘅 human-centric vision。今次轉去 keypoint 層面——multi-person pose estimation。
RTMO 喺 CVPR 2024 提出,核心問題好簡單:Top-down 太慢、bottom-up / one-stage 又唔夠準,可唔可以兩者兼得?
核心重點:
- 🎯 One-stage + Coordinate Classification:第一次將 dual 1-D heatmap classification(SimCC / RTMPose 嘅招)塞入 YOLO-based dense prediction
- 🧩 Dynamic Coordinate Classifier (DCC):Dynamic Bin Allocation(bins 跟 bbox 大細走)+ Dynamic Bin Encoding(sine PE + FC 生成 per-bin representation)
- 📐 MLE Loss with Learnable Variance:用 Maximum Likelihood Estimation 取代 KLD,每個 sample 自己學 uncertainty,hard sample 大 variance、easy sample 細 variance
- ⚡ Real-time + 多人不衰減:RTMO-l 喺 COCO val2017 攞到 74.8% AP @ 141 FPS(V100),10 人以上場景 latency 只多 0.1ms
- 🏆 CrowdPose SOTA:RTMO-l + extra data 喺 CrowdPose 上 83.8% AP,遠拋 ED-Pose(Swin-L, 218M params)
目錄
背景:Multi-Person Pose Estimation 嘅三條路
喺 RTMO 出現之前,real-time multi-person pose estimation(MPPE)基本上分成三派:
| Paradigm | 精度 | 速度同人數關係 | Pipeline 複雜度 |
|---|---|---|---|
| Top-down | ⭐⭐⭐⭐⭐ | Linear in N(人多就慢) | 需要 detector + cropper + pose net |
| Bottom-up | ⭐⭐⭐ | Constant,但 grouping 慢 | 需要 keypoint grouping(PAF / heatmap matching) |
| One-stage | ⭐⭐⭐(之前) | Constant | 一個 forward 搞掂 |
💡 One-stage 嘅理論優勢好明顯:latency 唔受人數影響,pipeline 簡單。但喺 RTMO 之前,所有 real-time one-stage 方法(YOLO-Pose、KAPAO、YOLOX-Pose)都係用一個 fully connected layer 直接 regress 出 keypoint 坐標——呢個做法等同假設 keypoint 位置係 Dirac delta 分佈,完全忽略咗 annotation 嘅 inherent uncertainty。結果就係:快係快,但 AP 永遠追唔上 top-down。
Coordinate Classification:Top-down 嘅秘密武器
2022 年 SimCC、2023 年 RTMPose 都用咗一個叫 dual 1-D heatmap 嘅技巧,喺 top-down setting 大殺四方:
- 將 x 軸切成 個 bin、y 軸切成 個 bin
- 每個 keypoint output 兩個 1-D probability heatmap(一條 x、一條 y)
- 透過 sub-pixel bin 達到高空間解析度,唔需要 high-res 2D heatmap
問題嚟啦:呢個技巧直接搬入 one-stage(dense prediction)就出事——
- Bin wastage:bins 散佈成張圖,但每個人只佔幾個 grid,大部分 bin 對某個 instance 嚟講都係廢嘅
- DFL [21] 嘅 fix 又唔夠:bins 固定喺 anchor 附近一個 fixed range,大人會超出、細人會量化誤差爆炸
- KLD loss 一視同仁:dense prediction 入面每個 grid 嘅難度差好遠(位置、size、姿態),KLD 冇辦法區分
🎯 RTMO 嘅 thesis:如果可以解決呢三個 incompatibility,就可以將 coordinate classification 嘅準度搬入 one-stage 嘅速度——拎到「top-down 級精度 + one-stage 級速度」。
RTMO Architecture 全景
RTMO 整體沿用 YOLO 嘅 dense grid prediction,但 head 嘅 keypoint 分支特別嗌咗一個 Dynamic Coordinate Classifier。
三個關鍵設計決定
| 決定 | 做法 | 原因 |
|---|---|---|
| Backbone | CSPDarknet(YOLOX 嗰個) | 同 YOLO 生態 align,方便 deployment |
| Neck | Hybrid Encoder(RT-DETR) | 同時有 self-attention(global)+ FPN(local) |
| Feature Levels | 只用 P4 + P5,丟咗 P3 | P3 食咗 head 78.5% FLOPs,但只貢獻 10.7% correct detections |
🔑 P3 嘅故事:傳統 feature pyramid 嘅諗法係「淺層 feature 揸細人」,但 RTMO 嘅 ablation(Table 4)發現 P4 + P5 已經夠 cover 細人——P3 嘅 FLOPs 純粹浪費。RTMO-l 由 3 級 features → 2 級,CPU latency 由 186ms 降到 125ms,AP 只跌 0.2%。Lesson:feature pyramid 唔係越深越多越好。
核心創新 1:Dynamic Coordinate Classifier (DCC)
DCC 係 RTMO 嘅技術核心。佢將 SimCC 嘅 dual 1-D heatmap 改造成適合 dense prediction:
A. Dynamic Bin Allocation (DBA)
傳統做法(SimCC / RTMPose):Bins 鋪滿成張 input image。Top-down 入面已經 crop 過,所以 OK。
DFL 嘅做法:Bins 固定喺 anchor 附近一個 predefined range。大人裝唔晒、細人 quantization error 爆炸。
RTMO 嘅做法:每個 grid 先 regress 出 bounding box,將 box expand 1.25 倍(cover 預測唔準嘅情況),再喺 expanded box 內部均勻 divide bins。
其中 係 expanded bounding box 嘅左右邊界。所有 model 都用 。
🎯 DBA 嘅 elegant 之處:bins 嘅 spatial location 變成 input-dependent——細人有細人嘅 fine-grained bins,大人有大人嘅 wider bins。Quantization error 自動同 instance size scale。
B. Dynamic Bin Encoding (DBE)
問題嚟啦:而家每個 grid 嘅 bin 坐標都唔同,唔可以再用一組 shared learnable embeddings 代表每個 bin(SimCC / RTMPose 嘅做法)。
RTMO 用 sine positional encoding + learnable FC 嚟為每個 bin on-the-fly 生成 representation:
再經一個 fully connected layer refine:
其中 係第 個 keypoint 嘅 feature(經 GAU module refine 過,跟 RTMPose 一樣)。
💡 諗法類比:呢個就好似 Transformer 入面 query × key 嘅 dot product——keypoint feature 係 query,bin 嘅 PE 係 key,softmax 之後就係 attention weights,即 keypoint 喺呢個 bin 嘅 probability。RTMO 將 Transformer 嘅 positional encoding 用到 spatial bins 上面,係一個好聰明嘅 cross-pollination。
C. 用具體數字諗一次
假設一張圖入面有一個人,predicted bbox 係 (width 200px)。Expand 1.25× 之後:
- 新 (width = 250)
- ,所以每個 bin 大約 1.3 px wide
- Bin 1 中心喺 75px,Bin 96 喺 200px(人嘅中心),Bin 192 喺 325px
對比 SimCC:如果 input image 係 640×640,SimCC 將整個 640px 分成例如 384 個 bin(每 bin 1.67px),但 200px 寬嘅人只用得到 120 個 bin → 浪費 264 個 bin。
對比 DFL:固定 range 例如 ±32px 圍住 anchor → 一個 200px 嘅人完全裝唔入!
RTMO:192 個 bin 全部用嚟覆蓋呢個人嘅 250px expanded box → 每個 bin 都有用,quantization error ≈ 0.65px。
核心創新 2:MLE Loss with Learnable Variance
呢個係另一個 elegant 嘅 contribution。Standard 嘅 coordinate classification 用 Gaussian label smoothing + KLD loss:
RTMO 用咗一個關鍵 observation:Gaussian 對 mean 係對稱嘅,所以 ——即可以將「target distribution」reinterpret 成「annotation likelihood under Gaussian error model」。
如果將預測 視為 嘅 prior,annotation 嘅 marginal likelihood 就係:
Maximize 呢個就 model 緊 annotation 嘅真實分佈。RTMO 實際用 Laplace 分佈 同 negative log-likelihood:
其中 係 instance size(normalize error), 係 model 預測嘅 variance(即 uncertainty)。
點解 Learnable Variance 咁重要?
🔑 MLE vs KLD 嘅根本分別:
KLD + learnable σ:Model 會懶到全部 sample 都 predict 一個大 σ → flatten target distribution → loss 永遠細 → 訓練崩潰
MLE + learnable σ:σ 喺 likelihood 入面 同時影響 numerator 同 denominator,model 唔可以 cheat。Hard sample predict 大 σ 真係會減少 loss,但 easy sample 預測大 σ 反而會 penalize。呢個係 self-paced learning 嘅天然 implementation。
Ablation 數據(RTMO-s, COCO val2017)
| Decoding | Loss | COCO AP | CrowdPose AP |
|---|---|---|---|
| Regression | OKS | 65.6 | 66.1 |
| CC + DBA + DBE | KLD | 64.4 | 62.5 |
| CC(static bins) | MLE | 66.7 | 65.8 |
| CC + DBA | MLE | 65.6 | 65.2 |
| CC + DBA + DBE | MLE | 67.6 | 67.2 |
🎯 三個 takeaway:
MLE > KLD:差 3.2% AP——learnable variance 真係解決咗 hard/easy 不均衡嘅問題
DBA alone 會跌:只加 DBA 唔加 DBE 反而衰咗,因為 bin 位變咗但 representation 冇變,semantics broken
DBA + DBE 一齊先 work:兩個 component 互相依賴——bin 位變、representation 就要跟住變
Training Pipeline:YOLOX 嘅 Trick + Two-Stage Training
RTMO 嘅 training 借用咗 YOLO 系列嘅成熟技術,再加幾個 pose-specific 嘅調整:
Label Assignment:擴展版 SimOTA
YOLOX 用 SimOTA 嚟動態 assign positive grids。RTMO 將 score 嘅 cost 由「bbox 質素」改成「bbox + pose 質素」嘅組合:
- Score branch:用 Varifocal Loss,target 係預測 pose 同 GT pose 嘅 OKS(Object Keypoint Similarity)
- BBox branch:IoU loss
- Keypoint branch(DCC):MLE loss
- Visibility branch:BCE loss
Proxy Regression:避免 OOM
DCC 嘅 computation 喺 每個 grid × 每個 keypoint × Bx × By 上面行——dense prediction 入面太貴。RTMO 嘅 trick:
- 額外加一個 lightweight pointwise conv 做 keypoint regression(kpt_reg)
- SimOTA 用 kpt_reg 嚟揀 positive grids(避免要 DCC 跑全部 grid)
- DCC 只跑 positive grids,輸出 decoded keypoints(kpt_dec)
- Proxy loss: ——令 proxy 同 DCC 嘅 output 一致
Total Loss
其中 。
Two-Stage Training Schedule
| Stage | Proxy Target | Learning Rate | 目的 |
|---|---|---|---|
| Stage 1 | GT pose annotations | 4e-3 | Bootstrap:proxy + DCC 同時學 GT |
| Stage 2 | DCC decoded pose | 5e-4 → 2e-4(cosine) | Refine:proxy follow DCC,DCC 學更 fine-grained |
💡 Stage 2 嘅 self-distillation 味道:Proxy 由「學 GT」變「學 DCC 嘅 output」——因為 DCC 嘅輸出本身已經比 GT 更 informative(包含 uncertainty)。呢個類似 knowledge distillation 入面 soft label > hard label 嘅 insight。
實驗結果:Real-Time 入面打贏所有人
COCO test-dev:One-Stage 之王
| Method | Backbone | Params | Time (ms) | AP |
|---|---|---|---|---|
| YOLO-Pose-s | CSPDarknet | 10.8M | 7.9 | 63.2 |
| YOLO-Pose-l | CSPDarknet | 61.3M | 20.5 | 70.2 |
| KAPAO-l | CSPNet | 77.0M | 50.2 | 70.3 |
| PETR | Swin-L | 213.8M | 133 | 70.5 |
| ED-Pose | Swin-L | 218.0M | 265.6 | 72.7 |
| RTMO-s | CSPDarknet | 9.9M | 8.9 | 66.9 |
| RTMO-m | CSPDarknet | 22.6M | 12.4 | 70.1 |
| RTMO-l | CSPDarknet | 44.8M | 19.1 | 71.6 |
| RTMO-l † | CSPDarknet | 44.8M | 19.1 | 73.3 |
🚀 重點觀察:
RTMO-s 同 YOLO-Pose-s 速度差唔多(8.9 vs 7.9 ms),但 AP 高 3.7%
RTMO-l 比 ED-Pose-Swin-L 快 14×(19.1 vs 265.6 ms),AP 只差 1.1%
RTMO-l 比 PETR-Swin-L 快 7×、AP 高 1.1%
加 extra training data(†)後,RTMO-l 攞到 73.3% AP,已經超越所有 one-stage 方法
多人場景:Top-down 嘅 noose
RTMO 對比 RTMPose(top-down)+ RTMDet-nano(detector)嘅組合:
⚡ RTMO-l 喺 10+ 人嘅 GPU latency 只比 1 人多 0.1ms(佔 total 0.5%)——呢個就係 one-stage 嘅 constant-time 優勢。對於 surveillance、體育分析、AR 等場景,呢個 property 比絕對最快嘅 single-person latency 更重要。
CrowdPose:擠擁場景嘅大殺四方
| Method | Params | AP | AP_Easy | AP_Medium | AP_Hard |
|---|---|---|---|---|---|
| HRNet(top-down) | 28.5M | 71.3 | 80.5 | 71.4 | 62.5 |
| DEKR(bottom-up) | 65.7M | 67.3 | 74.6 | 68.1 | 58.7 |
| ED-Pose Swin-L | 218.0M | 73.1 | 80.5 | 73.8 | 63.8 |
| RTMO-l | 44.8M | 73.2 | 79.2 | 74.1 | 65.3 |
| RTMO-l † | 44.8M | 83.8 | 88.8 | 84.7 | 77.2 |
🏆 RTMO-l † 喺 CrowdPose AP_Hard 攞到 77.2%——hard split 係最擠擁、遮擋最嚴重嘅場景。比 ED-Pose-Swin-L 高 13.4%,model size 仲細 5×。呢個係 DCC + MLE 對 hard sample 嘅 robustness 嘅最佳證明。
同 Cousin RTMPose 嘅深度對比
RTMO 同 RTMPose 出自同一個團隊(OpenMMLab / Shanghai AI Lab),共享好多 design idea,但解決唔同問題:
| 特性 | RTMPose | RTMO |
|---|---|---|
| Paradigm | Top-down | One-stage |
| 需要 detector? | ✅ 必須 | ❌ 唔需要 |
| Coordinate Classification | SimCC(static bins) | DCC(dynamic bins + encoding) |
| Loss | KLD with Gaussian smoothing | MLE with learnable variance |
| Feature refinement | GAU | GAU |
| Latency vs people count | Linear(人多就慢) | Constant |
| 1-person speed | 最快 | 略慢 |
| Multi-person speed | 退化 | 穩定 |
| Best use case | 單人 / 少人 / mobile | Surveillance / 體育 / 演唱會 |
🎯 揀邊個?簡單 rule of thumb:
每張圖 ≤ 2 人 → RTMPose(單人 latency 最低,detector 嘅 overhead 唔大)
每張圖 ≥ 4 人 → RTMO(constant latency 嘅優勢開始顯現)
crowded scene(遮擋 + 擠擁) → RTMO 嘅 dense grid + MLE robustness 完勝
實作指南:用 MMPose 跑 RTMO
方法一:Off-the-shelf Inference
python# 安裝
# pip install mmcv mmengine
# pip install mmpose>=1.3.0
# pip install mmdet # 唔需要 detector,但 demo script 會用到
from mmpose.apis import MMPoseInferencer
# 一行加載 RTMO-l
inferencer = MMPoseInferencer(
pose2d='rtmo-l_16xb16-600e_body7-640x640',
device='cuda:0'
)
# Inference
results = inferencer(
'crowd.jpg',
show=False,
out_dir='outputs/',
radius=4,
thickness=2
)
for result in results:
poses = result['predictions'][0]
print(f"Detected {len(poses)} people")
for p in poses:
keypoints = p['keypoints'] # (17, 2)
scores = p['keypoint_scores'] # (17,)
bbox = p['bbox']
方法二:自己 Train(COCO)
bash# Clone MMPose
git clone https://github.com/open-mmlab/mmpose.git
cd mmpose/projects/rtmo
# 下載 pretrained backbone(YOLOX-l on COCO det)
# 然後直接跑
bash tools/dist_train.sh \
configs/rtmo-l_16xb16-600e_coco-640x640.py \
8 # 8 GPUs
關鍵 config 片段:
python# rtmo-l_16xb16-600e_coco-640x640.py 嘅核心部分
model = dict(
type='BottomupPoseEstimator',
backbone=dict(type='CSPDarknet', deepen_factor=1.0, widen_factor=1.0),
neck=dict(type='HybridEncoder', encoder_cfg=dict(...)),
head=dict(
type='RTMOHead',
num_keypoints=17,
featmap_strides=(16, 32), # 只用 P4 + P5
head_module_cfg=dict(
num_classes=1,
in_channels=256,
cls_feat_channels=256,
channels_per_group=36,
pose_vec_channels=512,
),
dcc_cfg=dict(
in_channels=256,
feat_channels=128,
num_bins=(192, 256), # Bx, By
spe_channels=128, # sine PE 維度
),
loss_mle=dict(type='MLECCLoss', use_target_weight=True, loss_weight=5.0),
loss_bbox=dict(type='IoULoss', loss_weight=5.0),
loss_oks=dict(type='OKSLoss', loss_weight=10.0), # proxy loss
loss_vis=dict(type='BCELoss', loss_weight=1.0),
loss_cls=dict(type='VariFocalLoss', loss_weight=2.0),
),
)
方法三:Export to ONNX(部署)
bash# 用 mmdeploy export
python tools/deploy.py \
configs/mmpose/pose-detection_rtmo_onnxruntime_dynamic-640x640.py \
projects/rtmo/configs/rtmo-l_16xb16-600e_coco-640x640.py \
checkpoints/rtmo-l_coco-640x640.pth \
demo/resources/human-pose.jpg \
--work-dir mmdeploy_models/rtmo-l \
--device cuda \
--dump-info
💡 Deployment 建議:
ONNXRuntime + FP32:x86 server / desktop
TensorRT + FP16:NVIDIA Jetson / 數據中心 GPU(RTMO-l 喺 V100 用 TRT FP16 可以衝到 ~141 FPS)
ONNXRuntime + OpenVINO:Intel CPU
唔好用 PyTorch raw model 做 production——ED-Pose 喺 paper 入面就示範咗 ONNX 化失敗會點樣(1.5 秒 / frame)
核心 Code 解讀:DCC 嘅 PyTorch 骨架
下面係簡化版嘅 Dynamic Coordinate Classifier,保留所有 essential 嘅 logic:
pythonimport torch
import torch.nn as nn
import torch.nn.functional as F
class SinePositionalEncoding(nn.Module):
"""為 bin 坐標生成 sine PE。"""
def __init__(self, num_channels=128, temperature=10000):
super().__init__()
self.num_channels = num_channels
self.temperature = temperature
def forward(self, coords):
# coords: (B, num_bins) — 每個 grid 嘅 bin 坐標(已 normalize)
dim_t = torch.arange(self.num_channels, device=coords.device)
dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_channels)
pos = coords.unsqueeze(-1) / dim_t # (B, num_bins, C)
pos[..., 0::2] = pos[..., 0::2].sin()
pos[..., 1::2] = pos[..., 1::2].cos()
return pos
class DynamicCoordinateClassifier(nn.Module):
def __init__(self, in_channels=256, feat_channels=128,
num_keypoints=17, num_bins=(192, 256), spe_channels=128):
super().__init__()
self.K = num_keypoints
self.Bx, self.By = num_bins
# 將 pose feature 拆成每個 keypoint 嘅 feature vector
self.kpt_feat_proj = nn.Linear(in_channels, num_keypoints * feat_channels)
# Sine PE + learnable FC (φ)
self.spe = SinePositionalEncoding(spe_channels)
self.phi_x = nn.Linear(spe_channels, feat_channels)
self.phi_y = nn.Linear(spe_channels, feat_channels)
# Variance head(用嚟 predict σ̂)
self.sigma_head = nn.Linear(in_channels, num_keypoints * 2)
def get_bin_coords(self, bboxes, expand=1.25):
"""Dynamic Bin Allocation:bins 跟 expanded bbox 走。"""
# bboxes: (N, 4) in xyxy format
cx = (bboxes[:, 0] + bboxes[:, 2]) / 2
cy = (bboxes[:, 1] + bboxes[:, 3]) / 2
w = (bboxes[:, 2] - bboxes[:, 0]) * expand
h = (bboxes[:, 3] - bboxes[:, 1]) * expand
xl, xr = cx - w / 2, cx + w / 2
yt, yb = cy - h / 2, cy + h / 2
# 均勻 divide bins
steps_x = torch.linspace(0, 1, self.Bx, device=bboxes.device)
steps_y = torch.linspace(0, 1, self.By, device=bboxes.device)
x_bins = xl[:, None] + (xr - xl)[:, None] * steps_x[None, :] # (N, Bx)
y_bins = yt[:, None] + (yb - yt)[:, None] * steps_y[None, :] # (N, By)
return x_bins, y_bins
def forward(self, pose_feat, bboxes):
# pose_feat: (N, C) — 每個 positive grid 嘅 pose feature
# bboxes: (N, 4) — 該 grid 預測嘅 bbox
N = pose_feat.size(0)
# ---- Step 1: Keypoint features ----
f_k = self.kpt_feat_proj(pose_feat).view(N, self.K, -1) # (N, K, D)
# ---- Step 2: Dynamic bin coordinates ----
x_bins, y_bins = self.get_bin_coords(bboxes) # (N, Bx), (N, By)
# ---- Step 3: Dynamic bin encoding ----
# 將 bin coord 經 sine PE → FC
# 注意:每個 sample 嘅 bin 坐標都唔同 → on-the-fly 計算
x_emb = self.phi_x(self.spe(x_bins)) # (N, Bx, D)
y_emb = self.phi_y(self.spe(y_bins)) # (N, By, D)
# ---- Step 4: Bin-keypoint similarity (softmax over bins) ----
logits_x = torch.einsum('nkd,nid->nki', f_k, x_emb) # (N, K, Bx)
logits_y = torch.einsum('nkd,nid->nki', f_k, y_emb) # (N, K, By)
prob_x = F.softmax(logits_x, dim=-1)
prob_y = F.softmax(logits_y, dim=-1)
# ---- Step 5: Integral decode 出坐標 ----
kpt_x = (prob_x * x_bins.unsqueeze(1)).sum(-1) # (N, K)
kpt_y = (prob_y * y_bins.unsqueeze(1)).sum(-1) # (N, K)
keypoints = torch.stack([kpt_x, kpt_y], dim=-1) # (N, K, 2)
# ---- Step 6: Predict per-sample variance(MLE loss 要用)----
sigma = self.sigma_head(pose_feat).view(N, self.K, 2).exp() # 保證 σ > 0
return {
'keypoints': keypoints,
'prob_x': prob_x, 'prob_y': prob_y,
'x_bins': x_bins, 'y_bins': y_bins,
'sigma': sigma,
}
def mle_loss(prob, bins, target, sigma, instance_size, eps=1e-9):
"""
prob: (N, K, B) — predicted probability over bins
bins: (N, B) — bin coordinates
target: (N, K) — ground-truth keypoint coordinate
sigma: (N, K) — predicted variance
instance_size: (N,) — bbox diagonal 或者其他 normalizing factor
"""
diff = (bins.unsqueeze(1) - target.unsqueeze(-1)).abs() # (N, K, B)
s = instance_size.view(-1, 1, 1)
laplace = torch.exp(-diff / (2 * sigma.unsqueeze(-1) * s)) / sigma.unsqueeze(-1)
likelihood = (prob * laplace).sum(-1).clamp_min(eps) # (N, K)
return -likelihood.log().mean()
🔑 注意三個 implementation detail:
phi_x同phi_y係 separate 嘅 FC(因為 x 同 y 軸嘅 semantics 唔同,bin 數量都唔同)
sigma_head用.exp()保證 σ > 0(log-space 學習更 stable)MLE loss 入面
instance_size嘅作用係令 loss scale 對人嘅大細 invariant,大人同細人嘅同樣 px 誤差有唔同 semantic meaning
限制同未來方向
當前限制
1. Bin 數量 hard-coded
對 COCO 嘅 17 keypoint 啱用,但對 whole-body(133 keypoints, Halpe / COCO-Wholebody)或者 hand pose 可能要重新 tune。
2. P3 被砍 → 細人 / 遠處場景退化
為咗速度砍咗 P3,喺極細人(< 32px)場景會明顯退化。如果 use case 包含 long-range surveillance,可能要 reactivate P3。
3. Hybrid Encoder 嘅 attention 食 memory
High-resolution input(e.g. 800×800)下,Hybrid Encoder 嘅 self-attention 仍然係 memory bottleneck。RTMO 論文用咗 480-800 嘅 multi-scale training 已經反映呢個 trade-off。
4. Single-frame,冇 temporal smoothing
Video 應用上會有 jitter,需要外接 temporal smoother(One-Euro filter 等)。
未來方向
- 3D Multi-Person Pose:將 DCC 擴展到 3D bins(dual 1-D heatmap → triple 1-D heatmap)
- Whole-Body Variant:COCO-Wholebody 嘅 133 keypoints 需要更 efficient 嘅 DCC implementation
- Sparse DCC:而家 DCC 喺所有 positive grid 上面跑,可以引入 token routing 進一步減 FLOPs
- 同 RTMPose 嘅 distillation:用 RTMO 嘅 dense feature 監督 RTMPose 嘅 top-down branch(或者反過嚟)
技術啟示
1. Hybrid Paradigm Wins
RTMO 嘅成功本質係 將 top-down 嘅 representation(dual 1-D heatmap)嫁接落 one-stage 嘅 architecture。Single-paradigm 嘅 model 通常會 stuck 喺某個 trade-off frontier 上面——能夠 hybridize 兩種 paradigm 嘅 representation 同 inference logic,往往能突破舊嘅 Pareto frontier。
2. Dynamic > Static
呢個 echo 之前 OSNet 嘅 takeaway:input-dependent 嘅機制永遠優於 fixed mechanism。DBA、DBE、MLE 嘅 learnable variance,全部都係「跟住 input 變」嘅設計。Pose estimation、object detection、attention⋯⋯AI 嘅 trend 一直都係由 static → dynamic。
3. Likelihood Framework 提供免費嘅 Self-Paced Learning
MLE with learnable variance 唔需要任何 hand-crafted curriculum,就能 auto-balance hard / easy samples。呢個 framework 喺 face recognition(Chang 2020)、object detection(He 2019)、pose(RTMO)都 work——係一個 generalizable 嘅 design pattern。
4. Positional Encoding 嘅意想不到應用
Sine PE 由 Transformer 嘅 sequence position 開始,去到 ViT 嘅 patch position,再去到 RTMO 嘅 spatial bin representation。同一個數學工具,喺三個完全唔同嘅 context 下都 work——呢就係 fundamental tool 嘅威力。
總結
RTMO 用一個好 elegant 嘅 recipe,解決咗 one-stage MPPE 長期以來嘅 accuracy 瓶頸:
- Dynamic Coordinate Classifier:將 SimCC 嘅 dual 1-D heatmap 改造成 input-adaptive,徹底解決 dense prediction 入面 bin wastage / coverage 嘅矛盾
- MLE Loss with Learnable Variance:用一個 likelihood-based formulation 取代 KLD,免費獲得 hard / easy sample balancing
- YOLOX-style Training + Two-Stage Schedule:成熟嘅 detection framework + self-distillation 味道嘅 proxy refinement
實用價值
- 🎯 74.8% AP @ 141 FPS(V100, TRT FP16)——而家最強嘅 real-time MPPE
- ⚡ Multi-person latency stable:10+ 人場景同 1 人差 0.5%
- 🏆 CrowdPose hard split 77.2% AP:擠擁場景嘅 SOTA
- 🛠️ MMPose 官方支持:production-ready,ONNX / TRT export 都 work
揀邊個?
| 場景 | 推薦 | 原因 |
|---|---|---|
| 單人 / mobile AR coaching | RTMPose-s | single-person 速度最快 |
| 體育分析(5-20 人) | RTMO-m | constant latency + 中等精度 |
| 演唱會 / 集會 surveillance | RTMO-l † | crowded scene SOTA |
| 需要 whole-body / hand | RTMPose-Wholebody | RTMO 暫無 wholebody variant |
| Edge device(Jetson Nano) | RTMO-s | 9.9M params + TRT FP16 |
相關資源
- 📄 論文:arXiv:2312.07526(CVPR 2024)
- 💻 Code:github.com/open-mmlab/mmpose/tree/main/projects/rtmo
- 📦 Pretrained Models:RTMO Model Zoo
- 📚 MMPose Docs:mmpose.readthedocs.io
- 🔗 Sibling: RTMPose:arXiv:2303.07399(top-down 版)
- 🔗 Foundation: SimCC:arXiv:2107.03332(ECCV 2022,coordinate classification 嘅 origin)
- 🔗 YOLOX:arXiv:2107.08430(backbone + SimOTA)
- 🔗 DFL:arXiv:2006.04388(static bin allocation 嘅前作)
- 🔗 延伸閱讀:Person ReID 進化史:由 TransReID 到 SOLIDER 到 DINOv2,Transformer 點樣統治行人重識別?
RTMO 嘅故事提醒我哋:一個 paradigm(one-stage)嘅瓶頸,往往唔係 paradigm 本身嘅問題,而係佢借用咗錯嘅 representation。將 top-down 嘅 dual 1-D heatmap 搬入 dense prediction,加上 dynamic bin + likelihood-based loss,one-stage 就由「快但唔準」變成「又快又準」。下一個被解鎖嘅 paradigm,可能就喺度等緊一個啱嘅 representation 借過嚟。 🤸✨