System Design: Architecture
Module Pipeline: Evaluate → Pretrain → Surgery → Head-Only
flowchart LR
subgraph CH9 ["Ch 9 · Evaluate (DONE)"]
direction TB
m04["m04 · VLM tag"] --> m05["m05 · V-JEPA embed"]
m04d["m04d · 23-D motion feat\n(Phase 0 ✅)"] --> probe_action["probe_action\n8-class motion top-1"]
m05 --> probe_action
probe_action --> probe_plot["probe_plot\ntop-1 · loss · confusion"]
end
subgraph CH10 ["Ch 10 · Pretrain (POC done)"]
direction TB
m09a1["m09a1_pretrain_encoder\nJEPA + motion_aux head\n+ L2 anchor"]
end
subgraph CH11 ["Ch 11 · Surgery (iter14 R1 winner)"]
direction TB
m10["m10 · SAM3"] --> m11["m11 · factor\nD_L · D_A · D_I"] --> m09c1["m09c1_surgery_encoder\nrecipe-v3 (SALT + LP-FT +\nSPD + CLEAR + MGMAE)"]
end
subgraph ITER15 ["iter15 · Head-Only (in progress)"]
direction TB
m09a2["m09a2_pretrain_head\n(frozen encoder + predictor,\n432K motion_aux head only)"]
m09c2["m09c2_surgery_head\n(same freeze + factor data)"]
end
m05 -->|"frozen"| m09a1
m09a1 -->|"adapted ckpt"| probe_action
m09c1 -->|"surgical ckpt"| probe_action
m09a2 -->|"head-only ckpt"| probe_action
m09c2 -->|"head-only surgery ckpt"| probe_action
style m04 fill:#00acc1,color:#fff,font-weight:bold
style m05 fill:#43a047,color:#fff,font-weight:bold
style m04d fill:#00897b,color:#fff,font-weight:bold
style probe_action fill:#e53935,color:#fff,font-weight:bold
style probe_plot fill:#6d4c41,color:#fff,font-weight:bold
style m09a1 fill:#7b1fa2,color:#fff,font-weight:bold
style m10 fill:#1565c0,color:#fff,font-weight:bold
style m11 fill:#0d47a1,color:#fff,font-weight:bold
style m09c1 fill:#0d47a1,color:#fff,font-weight:bold
style m09a2 fill:#bf360c,color:#fff,font-weight:bold
style m09c2 fill:#b71c1c,color:#fff,font-weight:bold
Full pipeline from motion-class evaluation through domain adaptation. Every adapted encoder (vanilla pretrain, surgery, iter15 head-only) flows back into probe_action for the same 8-class motion top-1 measurement — metric stays fixed across iterations, so Δs are directly comparable.
Ch10: JEPA Training Step (Single Iteration)
flowchart LR
subgraph DATA ["1 · Data"]
CLIP["Indian clip\n10s · 16 frames"] --> AUG["Augment\nRRC 384×384\nsame crop\nall frames"]
end
subgraph MASKING ["2 · Mask"]
AUG --> TOKENS["Patchify\n24×24×8\n=4608 tokens"]
TOKENS --> MGEN["MaskCollator\n8 small (15%)\n2 large (70%)"]
MGEN --> VIS["visible\n~500-1100"]
MGEN --> MASKED["masked\n~3500-4100"]
end
subgraph ENCODE ["3 · Encode"]
direction TB
VIS --> S_ENC["STUDENT\nViT-g · 1B\nvisible only\ntrainable"]
AUG2["same clip"] --> T_ENC["TEACHER\nViT-g · 1B\nALL tokens\nno grad"]
end
subgraph PREDICT ["4 · Predict + Loss"]
S_ENC --> PREDICTOR["PREDICTOR\n12 blocks\n384-dim"]
T_ENC --> SELECT["select masked\npositions"]
PREDICTOR --> L1["L1 LOSS\n|pred − target|"]
SELECT --> L1
end
subgraph UPDATE ["5 · Update"]
L1 --> ADAM["AdamW\n+ grad clip 1.0\n+ drift control"]
ADAM --> EMA_UP["EMA teacher\nτ = 0.99925"]
end
style CLIP fill:#5e35b1,color:#fff,font-weight:bold
style AUG fill:#00897b,color:#fff,font-weight:bold
style TOKENS fill:#795548,color:#fff,font-weight:bold
style MGEN fill:#e65100,color:#fff,font-weight:bold
style VIS fill:#2e7d32,color:#fff,font-weight:bold
style MASKED fill:#c62828,color:#fff,font-weight:bold
style S_ENC fill:#1565c0,color:#fff,font-weight:bold
style T_ENC fill:#546e7a,color:#fff,font-weight:bold
style AUG2 fill:#546e7a,color:#fff,font-weight:bold
style PREDICTOR fill:#6a1b9a,color:#fff,font-weight:bold
style SELECT fill:#37474f,color:#fff,font-weight:bold
style L1 fill:#c62828,color:#fff,font-weight:bold
style ADAM fill:#1a237e,color:#fff,font-weight:bold
style EMA_UP fill:#9c27b0,color:#fff,font-weight:bold
One training iteration of V-JEPA continual pretraining. Student sees ~20% visible tokens; teacher sees all. L1 loss at masked positions only.
Ch11: Progressive Prefix Unfreezing with Factor Datasets
flowchart TB
subgraph SAM ["SAM3 Segmentation (m10)"]
direction LR
SEG["SAM3\ninstance masks"] --> TRACK["Greedy IoU\ntracklets"] --> AGENT["Agent vs Layout\nmotion filter"]
end
subgraph FACTOR ["Factor Datasets (m11)"]
direction LR
DL["D_L\nlayout-only\nblur agents"]
DA["D_A\nagent-only\nsuppress BG"]
DI["D_I\ninteraction\ntube crops"]
end
AGENT --> DL
AGENT --> DA
AGENT --> DI
subgraph STAGES ["3-Stage Surgery (m12)"]
direction LR
S1["Stage 1\nlayers 0→10\n100% D_L\nroads · wires"] --> S2["Stage 2\nlayers 0→20\n90% D_A\nrickshaws · cows"] --> S3["Stage 3\nlayers 0→30\n85% D_I\nagent interactions"]
end
FACTOR --> STAGES
S3 --> SURGICAL["V-JEPA\n(surgical)"]
style SEG fill:#0277bd,color:#fff,font-weight:bold
style TRACK fill:#01579b,color:#fff,font-weight:bold
style AGENT fill:#004d40,color:#fff,font-weight:bold
style DL fill:#1b5e20,color:#fff,font-weight:bold
style DA fill:#33691e,color:#fff,font-weight:bold
style DI fill:#827717,color:#fff,font-weight:bold
style S1 fill:#1565c0,color:#fff,font-weight:bold
style S2 fill:#1565c0,color:#fff,font-weight:bold
style S3 fill:#0d47a1,color:#fff,font-weight:bold
style SURGICAL fill:#4a148c,color:#fff,font-weight:bold
Ch11 surgery: SAM3 segments each frame into agents and layout. Three factor datasets train progressively deeper layers with replay mixing.