本文涵蓋三篇核心論文: TransReID — arXiv:2102.04378(ICCV 2021)| GitHub SOLIDER — arXiv:2303.17602(CVPR 2023)| GitHub DINOv2 — arXiv:2304.07193(Meta AI, 2023)| GitHub
TL;DR
上一篇我哋講咗 OSNet 點樣用 2.2M 參數嘅 CNN 打贏 ResNet50 做 Person ReID。但 2021 年之後,Transformer 徹底改寫咗 ReID 嘅遊戲規則。
呢篇 blog 會帶你行過 ReID 嘅三個 Transformer 時代里程碑:
核心重點:
- 🏗️ TransReID(ICCV 2021):第一個 pure Transformer ReID framework,提出 JPM + SIE 兩個 ReID-specific 模組
- 🧠 SOLIDER(CVPR 2023):語義可控嘅自監督預訓練,一個 model 搞掂 6 個 human-centric 任務
- 🦕 DINOv2(Meta AI 2023):通用視覺基礎模型,frozen features 就可以做 ReID
- 📊 進化趨勢:由「手工設計 backbone」→「ReID-specific Transformer」→「通用自監督預訓練」→「Foundation Model」
目錄
背景:ReID 進化嘅三個時代
喺深入每個 model 之前,先睇下 Person ReID 嘅技術演變全景:
| 時代 | 代表作 | 核心理念 | Pre-training Data | MSMT17 mAP |
|---|---|---|---|---|
| CNN | OSNet(ICCV 2019) | Task-specific CNN 設計 | ImageNet-1K | 52.9 |
| Transformer | TransReID(ICCV 2021) | Pure ViT + ReID modules | ImageNet-21K | 67.4 |
| Pre-training | SOLIDER(CVPR 2023) | Human-centric SSL | LUPerson(4M) | 77.1 |
| Foundation | DINOv2(Meta 2023) | Universal visual features | LVD-142M | ~70+(frozen) |
🎯 核心趨勢:由「為 ReID 專門設計 architecture」逐步走向「用更好嘅 pre-training 提供更 universal 嘅 features」。Model 越嚟越 general,但 performance 越嚟越好——呢個就係 Foundation Model 時代嘅 magic。
Part 1:TransReID — 第一個 Pure Transformer 做 ReID
點解要用 Transformer 做 ReID?
CNN 有一個根本性嘅限制:local receptive field。即使 OSNet 用 multi-scale streams 嚟擴展 receptive field,每一層嘅 convolution 仍然只能「睇到」附近嘅 pixels。
Transformer 嘅 self-attention 天然就解決咗呢個問題——每個 token 都可以直接 attend to 圖片嘅任何位置。
| 特性 | CNN(OSNet) | Transformer(TransReID) |
|---|---|---|
| Receptive field | Local → 逐層擴展 | Global from Layer 1 |
| Long-range dependencies | 需要 deep stacking | 每層都有 |
| Position encoding | Implicit(by conv position) | Explicit(可加 side info) |
| Part-level features | 需要額外 branch | Patch tokens 天然對應 body parts |
💡 TransReID 嘅 insight:ViT 將圖片切成 patches,每個 patch 自然對應行人嘅唔同部位(頭、上身、腿)。呢個同 ReID 需要 part-level features 嘅需求完美 match!
TransReID 嘅架構
TransReID 嘅 baseline 就係一個 standard ViT-B/16,加上兩個 ReID-specific 嘅模組:
創新 1:Side Information Embedding(SIE)
呢個係 TransReID 最實用嘅創新。ReID 有一個 CNN 時代一直冇好好解決嘅問題:camera bias。
問題: 同一個人喺 Camera A(室內、暖光)同 Camera B(室外、冷光)嘅外觀差異可以好大。Model 容易學到「Camera A 嘅人都偏黃,Camera B 嘅人都偏藍」呢種 shortcut。
SIE 嘅解決方案: 將 camera ID 同 viewpoint 編碼成 learnable embeddings,直接加到 patch embeddings 入面:
其中:
- = camera ID embedding(每個 camera 有自己嘅 learnable vector)
- = viewpoint embedding(正面 / 背面 / 側面)
- = 可學習嘅 scaling factors
🎯 點解 SIE 有效?
佢俾 model 知道「呢張相係邊個 camera 影嘅」→ model 可以學識 compensate camera-specific 嘅 bias
類似 NLP 入面嘅 segment embeddings(BERT 用嚟區分 sentence A 同 sentence B)
唔需要任何額外嘅 annotation——camera ID 喺 ReID datasets 入面本身就有
創新 2:Jigsaw Patch Module(JPM)
JPM 解決嘅係另一個問題:ViT 嘅 patch tokens 太 regular。每個 patch 固定對應圖片嘅某個位置,model 容易 overfit 到特定嘅 spatial pattern。
JPM 嘅做法:
- Shift:將 patch embeddings 向上 shift k 個位置(circular shift)
- Patch Shuffle:將 shifted patches 隨機分成 K 組
- 每組獨立做 classification:每組都要識別出呢個人係邊個
💡 點解 Shuffle 有用?三個好處:
Robustness:強迫 model 用「打亂咗嘅」局部特徵都能識別人 → 更 robust against occlusion
Diversity:每個 group 都包含唔同 body parts → 每個 branch 學到嘅 features 更 diverse
Regularization:避免 model overfit 到「頭永遠喺上面、腳永遠喺下面」嘅 pattern
類比:想像你要認一個朋友,但每次只俾你睇佢身體嘅隨機幾個部分——你必須學識每個部分嘅特徵先可以做到。
用具體數字解釋 JPM
假設 ViT-B/16 將 256×128 嘅圖片切成 16×8 = 128 個 patches,加上 1 個 CLS token = 129 個 tokens。
Step 1:Shift
假設 shift = 5:
- 原本 token 排列:[CLS, P1, P2, ..., P128]
- Shift 後:[CLS, P6, P7, ..., P128, P1, P2, P3, P4, P5]
Step 2:分成 K=4 組
- Group 1:P6, P10, P14, ...(每隔 4 個取一個)
- Group 2:P7, P11, P15, ...
- Group 3:P8, P12, P16, ...
- Group 4:P9, P13, P17, ...
每組都包含來自圖片唔同位置嘅 patches → heterogeneous spatial coverage。
Step 3:每組各自做 classification
- 每組各有一個 learnable token(類似 CLS)→ 生成一個 feature vector
- 連同 global CLS token,總共 K+1 個 feature vectors
- 每個都要能正確識別 identity → 多個 complementary 嘅 features
TransReID 嘅一個微妙但重要嘅改進:Overlapping Patches
Standard ViT 用 stride = patch_size(即 16),但 TransReID 發現用 stride = 12(overlapping)可以顯著提升 performance:
| Setting | Stride | Market1501 R1/mAP | MSMT17 R1/mAP |
|---|---|---|---|
| ViT-B/16 | 16(no overlap) | 94.6 / 87.1 | 81.8 / 61.0 |
| ViT-B/16_s=12 | 12(overlap) | 95.0 / 88.2 | 83.4 / 64.5 |
🔑 點解 overlap 有用? 因為 non-overlapping patches 嘅邊界會 cut through 重要嘅 features(例如一個 logo 被切成兩半)。Overlap 確保每個 patch 同隔離嘅 patches 有部分重疊 → 保留更多 boundary 資訊。
TransReID 完整結果
| Method | Backbone | Market1501 R1/mAP | MSMT17 R1/mAP |
|---|---|---|---|
| OSNet(ICCV 2019) | OSNet | 94.8 / 84.9 | 78.7 / 52.9 |
| BoT(CVPRW 2019) | ResNet50 | 94.5 / 85.9 | • / - |
| ABDNet(ICCV 2019) | ResNet50 | 95.6 / 88.3 | 82.3 / 60.8 |
| TransReID Baseline | ViT-B/16 | 94.6 / 87.1 | 81.8 / 61.0 |
| TransReID | ViT-B/16_s=12 | 95.2 / 89.5 | 85.3 / 67.4 |
🚀 重點觀察:
TransReID 喺 MSMT17(最大最難嘅 dataset)上 mAP = 67.4,比 OSNet 高出 14.5%
用 ImageNet-21K pre-training 嘅 ViT-B/16 backbone 已經好強
JPM + SIE 喺 baseline 上再提升 3-6% mAP
第一次證明 pure Transformer 可以 dominate ReID
Part 2:SOLIDER — 語義可控嘅自監督預訓練
由「Architecture Design」到「Pre-training」
TransReID 證明咗 Transformer 做 ReID 嘅潛力,但佢仍然依賴 ImageNet pre-training——一個為物體分類設計嘅 dataset。
問題來咗:ImageNet 入面冇幾多人嘅圖片。Pre-training data 同 downstream task 嘅 domain gap 好大。
SOLIDER 嘅核心問題:如果用大量 unlabeled 嘅人嘅圖片做 self-supervised pre-training,效果會唔會好好多?
答案係:好好多。
SOLIDER 嘅核心設計
SOLIDER = Semantic cOntrollable seLf-supervIseD lEaRning
佢有三個核心創新:
創新 1:用 Human Prior Knowledge 生成 Pseudo Semantic Labels
一般嘅 self-supervised learning(例如 DINO、MAE)學到嘅係 pure appearance features——邊個 pixel 同邊個 pixel 相似。但人嘅圖片有好明確嘅 semantic structure:頭、上身、下身、鞋。
SOLIDER 利用呢個 prior knowledge:
- 用一個 off-the-shelf human parsing model(例如 SCHP)對 training images 生成 pseudo semantic labels
- 將 pixels 分成 semantic groups(頭、上身、下身等)
- 喺 contrastive learning 入面,同一個 semantic group 嘅 features 應該更相似
💡 點解唔直接用 ground-truth labels? 因為 LUPerson 有 4 百萬張 unlabeled 圖片——冇人可以手動 annotate 咁多。用 pseudo labels 係唯一可行嘅方法,而且 parsing model 嘅 pseudo labels 已經夠準確。
創新 2:Semantic Controller — 一個 model 適應所有任務
呢個係 SOLIDER 最 unique 嘅設計。唔同嘅 downstream tasks 需要唔同類型嘅 features:
| Task | 需要嘅 Features | Semantic vs Appearance |
|---|---|---|
| Person ReID | 衫嘅顏色、紋理、logo | 偏 Appearance(λ ≈ 0.2) |
| Human Parsing | 身體部位嘅邊界 | 偏 Semantic(λ ≈ 0.8) |
| Pedestrian Detection | 人嘅整體形狀 | 平衡(λ ≈ 0.5) |
| Attribute Recognition | 衫嘅類型、顏色名稱 | 偏 Semantic(λ ≈ 0.6) |
SOLIDER 引入一個 semantic controller——一個以 為輸入嘅 conditional network:
- :輸出 pure appearance features(適合 ReID)
- :輸出 pure semantic features(適合 parsing)
- :balanced features(適合 detection)
🎯 點解呢個設計好 elegant?
Pre-training 時:用 做 conditioning,令 model 學識同時編碼 appearance 同 semantic info
Fine-tuning 時:用戶只需要 set 一個 值就可以調整 features → 唔需要重新 train
一個 model 搞掂所有 human-centric tasks → 極大嘅實用價值
創新 3:Swin Transformer Backbone
SOLIDER 揀咗 Swin Transformer 而唔係 ViT 做 backbone,原因:
| 特性 | ViT | Swin Transformer |
|---|---|---|
| Attention scope | Global(所有 tokens) | Shifted Windows(local → global) |
| Resolution | Fixed patch size | Hierarchical(多尺度 feature maps) |
| Dense prediction | 需要 decoder | 天然支持(FPN-friendly) |
| Downstream flexibility | 主要做 classification | Detection、Segmentation 都 work |
因為 SOLIDER 要支持 6 個 downstream tasks(包括 detection 同 parsing),Swin 嘅 hierarchical design 係必要嘅。
SOLIDER 嘅結果
| Method | Pre-training | Backbone | Market1501 mAP/R1 | MSMT17 mAP/R1 |
|---|---|---|---|---|
| TransReID | ImageNet-21K | ViT-B/16 | 89.5 / 95.2 | 67.4 / 85.3 |
| TransReID-SSL | LUPerson(SSL) | ViT-B/16 | 90.0 / 95.6 | 68.7 / 86.1 |
| PASS(ECCV 2022) | LUPerson(Part-aware SSL) | ViT-B/16 | 90.3 / 95.8 | 70.0 / 86.8 |
| SOLIDER(Swin-T) | LUPerson(Semantic SSL) | Swin-Tiny | 91.6 / 96.1 | 67.4 / 85.9 |
| SOLIDER(Swin-S) | LUPerson(Semantic SSL) | Swin-Small | 93.3 / 96.6 | 76.9 / 90.8 |
| SOLIDER(Swin-B) | LUPerson(Semantic SSL) | Swin-Base | 93.9 / 96.9 | 77.1 / 90.7 |
🚀 重點觀察:
SOLIDER Swin-B 嘅 MSMT17 mAP = 77.1,比 TransReID 高出 9.7%!
即使係最細嘅 Swin-Tiny,Market1501 mAP 都已經有 91.6
用 re-ranking 嘅話,MSMT17 mAP 可以去到 86.5
同一個 pre-trained model 仲可以做 detection、parsing、pose estimation、attribute recognition、person search
SOLIDER 嘅跨任務表現
| Task | Dataset | Metric | ImageNet Pre-train | SOLIDER Swin-B |
|---|---|---|---|---|
| Person ReID | Market1501 | mAP | ~88 | 93.9 |
| Person ReID | MSMT17 | mAP | ~62 | 77.1 |
| Pedestrian Detection | CityPersons | MR-2(↓) | ~11 | 9.7 |
| Human Parsing | LIP | mIOU | ~56 | 60.5 |
| Attribute Recognition | PA100K | mA | ~82 | 86.4 |
| Pose Estimation | COCO | AP | ~74 | 76.6 |
🎯 一個 model,六個任務全部 SOTA——呢個就係 human-centric pre-training 嘅威力。
Part 3:DINOv2 — Foundation Model 嘅 Universal Features
由 Human-Specific 到 Universal
SOLIDER 用 human images 做 pre-training → human-centric tasks 表現好好。但如果有一個 model 喺 任何視覺任務 都表現好好呢?
呢個就係 DINOv2。
DINOv2 係咩?
DINOv2 係 Meta AI 喺 2023 年發布嘅 self-supervised vision foundation model,用 142M 張多樣化圖片(LVD-142M dataset)訓練。
關鍵特點:
- 🦕 Self-supervised:完全唔需要 labels,用 self-distillation(teacher-student)方式訓練
- 🌍 Universal features:同一個 frozen backbone 可以做 classification、segmentation、depth estimation、retrieval...
- 📐 ViT backbone:ViT-S/14、ViT-B/14、ViT-L/14、ViT-g/14
- ❄️ Frozen features work:唔需要 fine-tune——直接用 linear probe 或 kNN 就有好好嘅表現
DINOv2 嘅訓練方法
DINOv2 結合咗幾種 self-supervised 方法:
- DINO loss(self-distillation):student 嘅 local view output 要 match teacher 嘅 global view output
- iBOT loss(masked image modeling):mask 部分 patches,要求 model 預測 masked 嘅 features
- KoLeo regularizer:確保 feature space 嘅 uniformity
💡 DINOv2 點解 features 咁好?三個原因:
Data scale:142M 張圖片,涵蓋自然、城市、室內、人物等各種 domain
Data curation:用 self-supervised retrieval 精心篩選訓練數據(唔係 random web scraping)
Training recipe:結合多種 SSL objectives → features 同時有 local 同 global 嘅 discriminability
DINOv2 做 ReID:Frozen Features 嘅威力
最近嘅研究開始探索 DINOv2 作為 ReID backbone 嘅潛力。同傳統方法最大嘅分別係:你唔需要 fine-tune 個 model。
使用方式:
- 用 DINOv2 ViT-L/14 提取 frozen features
- 加一個 lightweight head(linear layer 或 simple MLP)
- 只 train 個 head
優勢:
- 極低嘅 training cost(只 train 幾個 layers)
- 唔需要大量 ReID-specific training data
- Cross-domain generalization 特別好(因為 features 已經夠 universal)
| 方法 | Backbone | Training | Market1501 R1 | Cross-domain 能力 |
|---|---|---|---|---|
| OSNet | OSNet(2.2M) | Full fine-tune | 94.8 | ⚠️ 需要 AIN |
| TransReID | ViT-B/16(86M) | Full fine-tune | 95.2 | ⚠️ 需要 SIE |
| SOLIDER | Swin-B(88M) | Fine-tune | 96.9 | ✅ 較好 |
| DINOv2 | ViT-L/14(300M) | Frozen + linear probe | ~93-95 | ✅✅ 最好 |
🎯 DINOv2 嘅 trade-off:
✅ Cross-domain generalization 最強:因為 features 唔係為特定 dataset fine-tune 嘅
✅ Zero-shot / few-shot 能力:幾乎唔需要 target domain 嘅 data
⚠️ Same-domain 精度唔係最高:fine-tuned SOLIDER 仍然喺 same-domain benchmarks 上領先
⚠️ Model size 大:ViT-L 有 300M params,唔適合 edge deployment
DINOv2 嘅 Attention Maps:點解佢天然適合 ReID?
DINOv2 學到嘅 attention maps 有一個好驚人嘅特性:佢可以自動 segment objects——包括人嘅唔同身體部位。
呢個能力係完全 self-supervised 學到嘅,冇任何 segmentation labels。對於 ReID 嚟講,呢個意味住:
- 自動 part-level attention:唔需要 JPM 或者 external pose detector
- Semantic understanding:知道邊啲 patches 屬於同一個 body part
- Background suppression:自然地 ignore 背景 clutter
深度對比:四個方法嘅根本分別
Architecture 對比
| 特性 | OSNet | TransReID | SOLIDER | DINOv2 |
|---|---|---|---|---|
| Architecture | Custom CNN | ViT-B/16 | Swin-T/S/B | ViT-S/B/L/g |
| Pre-training data | ImageNet-1K | ImageNet-21K | LUPerson(4M 人) | LVD-142M(通用) |
| Pre-training method | Supervised | Supervised | Self-supervised + semantic | Self-supervised |
| ReID-specific design | AG + multi-scale | JPM + SIE | Semantic controller | None(general purpose) |
| Params | 2.2M | ~86M | ~88M | 300M+ |
| Edge-friendly? | ✅✅ | ⚠️ | ⚠️ | ❌ |
| Cross-domain | Needs AIN | SIE helps | Good | Best |
點解要選邊個?Decision Framework
實作指南
方法一:TransReID(ViT-based ReID)
python# ===== 安裝 =====
# git clone https://github.com/damo-cv/TransReID.git
# pip install yacs timm
# ===== Training =====
# 下載 ViT-B/16 pretrained weights(ImageNet-21K)
# wget https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p16_224-80ecf9dd.pth
import torch
from config import cfg
from model import make_model
from processor import do_train
# 配置
cfg.merge_from_file("configs/Market/vit_transreid.yml")
cfg.MODEL.PRETRAIN_PATH = "jx_vit_base_p16_224-80ecf9dd.pth"
cfg.MODEL.SIE_CAMERA = True # 啟用 Camera SIE
cfg.MODEL.SIE_VIEW = False # Person ReID 通常冇 viewpoint label
cfg.MODEL.JPM = True # 啟用 Jigsaw Patch Module
cfg.MODEL.STRIDE_SIZE = [12, 12] # Overlapping patches
# 建立 model
model = make_model(cfg, num_class=751, camera_num=6, view_num=0)
model = model.cuda()
print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
# → Parameters: 86.5M
bash# 完整 training command
python train.py --config_file configs/Market/vit_transreid.yml \
MODEL.DEVICE_ID "('0')" \
MODEL.PRETRAIN_PATH 'jx_vit_base_p16_224-80ecf9dd.pth' \
OUTPUT_DIR './logs/market_transreid'
方法二:SOLIDER Pre-trained Model(最推薦)
python# ===== 用 SOLIDER pretrained weights 做 ReID =====
# git clone https://github.com/tinyvision/SOLIDER-REID.git
# 下載 SOLIDER Swin-Base weights
import torch
from model import make_model
from config import cfg
# SOLIDER 嘅 semantic controller
# λ ≈ 0.2 for ReID(偏 appearance)
cfg.merge_from_file("configs/market/swin_base.yml")
cfg.MODEL.PRETRAIN_PATH = "solider_swin_base.pth"
cfg.MODEL.SEMANTIC_WEIGHT = 0.2 # Semantic controller λ
model = make_model(cfg, num_class=751)
model = model.cuda()
# ===== 簡單 demo:SOLIDER 嘅 semantic controller =====
def demo_semantic_control():
"""展示 λ 嘅效果"""
import torch
from solider import build_model
model = build_model("swin_base", pretrained="solider_swin_base.pth")
model.eval()
dummy_input = torch.randn(1, 3, 256, 128)
# λ=0.0 → Pure appearance (最適合 ReID)
feat_appearance = model(dummy_input, semantic_weight=0.0)
# λ=0.5 → Balanced (適合 detection)
feat_balanced = model(dummy_input, semantic_weight=0.5)
# λ=1.0 → Pure semantic (最適合 parsing)
feat_semantic = model(dummy_input, semantic_weight=1.0)
print(f"Appearance feat: {feat_appearance.shape}")
print(f"Balanced feat: {feat_balanced.shape}")
print(f"Semantic feat: {feat_semantic.shape}")
bash# Training SOLIDER-REID
python train.py --config_file configs/market/swin_base.yml \
MODEL.PRETRAIN_PATH 'solider_swin_base.pth' \
MODEL.SEMANTIC_WEIGHT 0.2 \
OUTPUT_DIR './logs/market_solider'
💡 SOLIDER fine-tuning tips:
ReID 用
SEMANTIC_WEIGHT = 0.2(偏 appearance)如果 dataset 好細,可以試
0.3-0.4(多啲 semantic info 防 overfit)Cross-domain evaluation 可以試
0.4-0.5(更 generalizable)用 cosine learning rate scheduler + warmup 效果最好
方法三:DINOv2 Frozen Features(最簡單)
python# ===== DINOv2 做 ReID:Frozen Backbone + Linear Probe =====
import torch
import torch.nn as nn
from torchvision import transforms
# Step 1: 加載 DINOv2
dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2 = dinov2.cuda().eval()
# Freeze 所有參數
for param in dinov2.parameters():
param.requires_grad = False
# Step 2: 定義 ReID head
class DINOv2ReID(nn.Module):
def __init__(self, backbone, feat_dim=1024, num_classes=751):
super().__init__()
self.backbone = backbone
self.bn = nn.BatchNorm1d(feat_dim)
self.classifier = nn.Linear(feat_dim, num_classes)
def forward(self, x):
with torch.no_grad():
features = self.backbone(x) # CLS token features
features = self.bn(features)
if self.training:
logits = self.classifier(features)
return features, logits
return features
model = DINOv2ReID(dinov2, feat_dim=1024, num_classes=751)
model = model.cuda()
# Step 3: 只 train BN + classifier
optimizer = torch.optim.Adam(
[p for p in model.parameters() if p.requires_grad],
lr=3.5e-4, weight_decay=5e-4
)
# Step 4: Data preprocessing(DINOv2 需要 518×518 input)
transform = transforms.Compose([
transforms.Resize((518, 518)), # DINOv2 optimal resolution
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.1f}M")
print(f"Total params: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
# → Trainable params: ~1.8M
# → Total params: ~305M
🚀 DINOv2 ReID 嘅好處:
Training 極快:只 train 1.8M params(vs 86M for TransReID)
數據需求少:Frozen features 已經好 discriminative
Cross-domain 最強:唔需要 target domain adaptation
代價:Inference 時 backbone 仍然係 300M params → 需要 GPU
技術深潛:三個 Model 嘅 Feature Space 有咩分別?
Attention Pattern 嘅差異
TransReID:透過 self-attention 學到 person-specific 嘅 attention patterns
- CLS token attend to 全身
- 唔同 layer 嘅 attention 逐漸由 local(紋理)變到 global(輪廓)
- JPM 令 attention 更 diverse(唔會過度集中喺某個部位)
SOLIDER:Swin 嘅 shifted window attention + semantic conditioning
- Semantic controller 可以調整 attention focus 嘅 level
- 低:attention 偏向 texture/color(appearance)
- 高:attention 偏向 body part boundaries(semantic)
DINOv2:Self-supervised attention 自然 emerge 出 object-centric patterns
- 唔需要 labels 就能 segment 人嘅唔同部位
- Attention maps 嘅 consistency 好高(跨 domain 都 stable)
- 但可能 miss ReID-specific 嘅 fine-grained details(因為唔係為 ReID 訓練嘅)
Pre-training Data 嘅影響
| Pre-training | Data | Domain Match | Feature Quality | Generalization |
|---|---|---|---|---|
| ImageNet-1K Supervised | 1.2M images, 1K classes | ❌ 冇人 | ⭐⭐ | ⭐⭐ |
| ImageNet-21K Supervised | 14M images, 21K classes | ❌ 冇人 | ⭐⭐⭐ | ⭐⭐⭐ |
| LUPerson SSL(SOLIDER) | 4M person images | ✅✅ 全部係人 | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| LVD-142M SSL(DINOv2) | 142M diverse images | ⚠️ 部分有人 | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
💡 Insight:SOLIDER 同 DINOv2 代表兩種唔同嘅 scaling strategy:
SOLIDER:domain-specific data + domain-specific SSL → same-domain 最強
DINOv2:massive diverse data + general SSL → generalization 最強
最終嘅最佳方案可能係結合兩者:用 DINOv2 做 general pre-training,再用 SOLIDER-style human-centric fine-tuning。
限制同未來方向
TransReID 嘅限制
- 依賴 ImageNet pre-training:ViT 喺細 dataset 上 from scratch 表現差
- SIE 需要 camera metadata:如果冇 camera ID 就用唔到
- JPM 增加 inference cost:多個 branch = 多次 forward
SOLIDER 嘅限制
- LUPerson dataset 唔 public:需要向作者申請
- Pseudo label quality:parsing model 嘅誤差會傳播到 pre-training
- Swin 唔係最新嘅 backbone:2024+ 已經有更強嘅 backbone
DINOv2 嘅限制
- Model 太大:ViT-L 有 300M params,唔適合 edge
- 唔係 ReID-specific:Same-domain accuracy 比 SOLIDER 低
- Resolution sensitivity:最佳 input size 係 518×518,ReID 通常用 256×128
未來方向
- DINOv2 + Human-centric fine-tuning:結合 DINOv2 嘅 universal features 同 SOLIDER 嘅 semantic control
- Efficient ViT for ReID:用 distillation 將大 model 壓縮到 edge-friendly size
- Multi-modal ReID:結合 text descriptions(CLIP-ReID)或 LLM scene graphs
- Video-based Foundation Models:將 DINOv2 extend 到 temporal domain
技術啟示
1. Pre-training Data > Architecture Design
SOLIDER 用「普通」嘅 Swin Transformer + 「正確」嘅 pre-training data,就打贏咗 TransReID 嘅精心設計嘅 JPM + SIE。呢個同 NLP 嘅經驗一致:data + scale > clever architecture。
2. Self-Supervised Learning 嘅威力
DINOv2 同 SOLIDER 都唔用 labels 做 pre-training,但學到嘅 features 比 supervised ImageNet features 好得多。呢個趨勢暗示 labels 嘅重要性喺下降——未來嘅 computer vision 可能越嚟越少依賴 manual annotation。
3. 「General vs Specific」嘅 Spectrum
ReID 嘅進化顯示:最 specific 嘅 model(OSNet)edge-friendly 但 performance ceiling 低,最 general 嘅 model(DINOv2)performance ceiling 高但 resource heavy。實際部署需要喺呢個 spectrum 上搵到最適合嘅位置。
4. Camera-Aware Learning 嘅重要性
TransReID 嘅 SIE 證明:利用 metadata(camera ID、viewpoint)可以顯著提升 ReID performance。呢個 insight 喺 production systems 特別有價值——因為你通常知道每張相係邊個 camera 影嘅。
總結
核心 Takeaways
- TransReID(ICCV 2021):證明 pure Transformer 可以統治 ReID,JPM 提供 robust part features,SIE 解決 camera bias
- SOLIDER(CVPR 2023):human-centric SSL pre-training + semantic controller = 一個 model 搞掂 6 個任務,same-domain SOTA
- DINOv2(Meta 2023):universal visual features,frozen backbone + linear probe 就可以做到 competitive ReID,cross-domain generalization 最強
點樣揀?
| 場景 | 推薦方案 | 原因 |
|---|---|---|
| Edge / Camera 端部署 | OSNet x0.25 | 0.2M params,可以跑喺 STM32 |
| 有大量 labeled data | SOLIDER Swin-B + fine-tune | Same-domain performance 最高 |
| 冇 / 少 labeled data | DINOv2 ViT-L frozen + probe | 唔需要 target domain labels |
| 多個 human-centric tasks | SOLIDER | 一個 pre-trained model 搞掂晒 |
| Balanced(GPU available) | TransReID + ImageNet-21K | Good accuracy,well-documented |
相關資源
TransReID
- 📄 論文:arXiv:2102.04378(ICCV 2021,1500+ citations)
- 💻 Code:github.com/damo-cv/TransReID(MIT License)
- 📊 Results:Market1501 95.2/89.5,MSMT17 85.3/67.4
SOLIDER
- 📄 論文:arXiv:2303.17602(CVPR 2023,200+ citations)
- 💻 Code:github.com/tinyvision/SOLIDER(1.5K stars,Apache 2.0)
- 💻 ReID downstream:github.com/tinyvision/SOLIDER-REID
- 📊 Results:Market1501 93.9/96.9,MSMT17 77.1/90.7
DINOv2
- 📄 論文:arXiv:2304.07193(Meta AI 2023)
- 💻 Code:github.com/facebookresearch/dinov2(Apache 2.0)
- 🔗 Demo:dinov2.metademolab.com
- 📊 Models:ViT-S/14、ViT-B/14、ViT-L/14、ViT-g/14
延伸閱讀
- 📄 PASS(ECCV 2022):Part-Aware Self-Supervised Pre-Training — arXiv:2203.03931
- 📄 CLIP-ReID(AAAI 2023):用 CLIP 嘅 text-image alignment 做 ReID
- 📄 OSNet(ICCV 2019):上一篇 blog — OSNet:點解一個 2.2M 參數嘅輕量 CNN 可以打贏 ResNet50 做 Person Re-ID?
Person ReID 嘅進化告訴我哋一個深刻嘅教訓:喺 AI 領域,「正確嘅 pre-training」往往比「精巧嘅 architecture」更重要。由 OSNet 嘅手工 multi-scale CNN,到 TransReID 嘅 Transformer adaptation,到 SOLIDER 嘅 human-centric SSL,再到 DINOv2 嘅 universal foundation model——每一步都係 data 同 scale 嘅勝利。未來嘅 ReID 可能唔再需要 ReID-specific 嘅任何設計——只需要更好嘅 foundation model。 🔄✨