Does V-JEPA 2 Understand Indian Streets? Benchmarking Video Foundation Models on DenseWorld-200K

Wanaskar, Kapil; Jena, Gaytri; Jain, Vinija; Chadha, Aman; Das, Amitava

Does V-JEPA 2 Understand Indian Streets?

Benchmarking Video Foundation Models on DenseWorld-200K

Kapil Wanaskar¹ Gaytri Jena⁴ Vinija Jain² Aman Chadha³ Amitava Das⁴

¹Canva Research, USA ²Google, USA ³Apple, USA ⁴Pragya Lab, BITS Pilani Goa, India

115,687 clips • 714 videos • 22 cities • 276 hours • 121 GB

Tier 1 Cities — 6 metros, 68,614 clips

Kolkata

Chennai

Bangalore

Mumbai

Delhi

Hyderabad

Tier 2 Cities — 15 cities, 40,743 clips

Jaipur

Varanasi

Lucknow

Ahmedabad

Pune

Kochi

Chandigarh

Indore

Bhopal

Coimbatore

Nagpur

Visakhapatnam

Surat

Trivandrum

Mysuru

Full Taxonomy Coverage — 15 fields, 65+ values, v3 structured tags

Scene Type

market

residential

commercial

promenade

transit

highway

heritage

junction

flyover

beach

ghat

bazaar

skyline

Time of Day

day

night

Weather

clear

cloudy

rain

fog

overcast

Crowd

high

medium

low

Traffic

high

medium

low

Traffic Mix

mixed motor.

pedestrian

motorized

mixed all

Ped-Veh Sep.

separated

partial

shared space

Road Layout

intersection

narrow lane

wide road

sidewalk

bridge

Road Surface

asphalt

paved

wet

dirt

concrete

cobblestone

mixed

unpaved

Infrastructure

good

moderate

poor

Encroachment

clear

partial

heavy

Objects

auto rickshaw

sacred cow

street vendor

bus

cycle rickshaw

Vegetation

dense

moderate

sparse

none

Lighting

natural

artificial

mixed

Video Quality

clean

200K+ clips across 714 videos from 25+ Indian cities. Every clip tagged with 16 structured attributes by Qwen3-VL-8B. Showing all 15 taggable fields and 63 values.

DenseWorld-200K Dataset

115,687 video clips

714

source videos

Indian cities

276h

of video

121 GB

total size

taxonomy fields

Metric	Value
Clip duration	4–12 seconds (scene-aware cuts)
Taxonomy	16 fields (v3): 13 single + 2 multi
VLM tagger	Qwen3-VL-8B (0.919 bake-off score)
Format	WebDataset TARs on HuggingFace
POC subset	10K clips (video-level uniform, seed=42)

Browse on HuggingFace

Full taxonomy distribution across 16 fields for 10,000 clips tagged by Qwen3-VL-8B

Full taxonomy distribution (v3, 16 fields) across 10,000 POC clips tagged by Qwen3-VL-8B.

Clips Per City — Full Breakdown drive · walk · drone · rain

Tier 1 Cities — 6 metros

Tier 2 Cities — 15 cities

Goa · Monuments — 4 destinations

Grand Total: 115,687 clips | 275.8 hours | 121.2 GB

M11 · Factor Decomposition · Per-Clip Verification

What the surgery
actually sees.

Streaming-pass outputs from the m11 factor extractor — layout (_DL), agent (_DA), and interaction (_DI) — the three masked views that feed the recipe-v3 surgery loss. Eight clips, eight Indian cities, eight conditions.

tier 1

Bangalore drive

tier 1

Delhi walking

tier 1

Kolkata walking

tier 2 · drone

Coimbatore aerial

tier 2 · rain

Ahmedabad drive · rain

tier 2

Varanasi walking

tier 2

Nagpur drive

tier 2 · rain

Bhopal drive · rain

Per-Factor Train Loss

m09c surgery 3-stage DI encoder — Stage 1 D_L → Stage 2 D_A → Stage 3 D_I.

Per-stage training loss curves across the three factors D_L, D_A, D_I

Customized Loss Function

$$ \mathcal{L}_{\text{total}} \;=\; \mathcal{L}_{\text{JEPA}}^{\text{salt}} \;+\; \alpha \,\mathcal{L}_{\text{CE}}^{\text{8-cls}} \;+\; \beta \,\mathcal{L}_{\text{MSE}}^{\text{23-D}} \;+\; \lambda \,\mathcal{L}_{\text{anchor}}^{L_2} $$

$$ \mathcal{L}_{\text{JEPA}}^{\text{salt}} \;=\; \sum_{p \,\in\, \mathcal{P}_{\text{mask}}} w(p) \cdot \bigl\lVert \mathrm{student}(p) - \mathrm{teacher}_{\text{salt}}(p) \bigr\rVert_{1} $$

$\mathcal{L}_{\text{JEPA}}^{\text{salt}}$ saliency-weighted L1 vs frozen SALT teacher latents — fill in the masked patches
$\mathcal{L}_{\text{CE}}^{\text{8-cls}}$ motion-class cross-entropy (8 classes) — name the motion class
$\mathcal{L}_{\text{MSE}}^{\text{23-D}}$ RAFT motion-vector regression — predict the optical flow
$\mathcal{L}_{\text{anchor}}^{L_2}$ L2 drift anchor to Meta initialization — don't forget the prior brain

Per-component loss decomposition over training steps — JEPA / CE / MSE / anchor

Loss decomposition. 4 components stacked per training step.

Probe-trio validation trajectory: top-1, motion-cos, future-L1 across training steps

Validation trajectory. Probe-trio (top-1 / motion-cos / future-L1) correlates with the decomposed train loss on the left.

Per-Block Weight Drift

Tradeoff axis: no movement ↔ catastrophic forgetting. Stage-gated backward keeps frozen blocks pinned at 0 drift while top-of-stack blocks adapt within budget.

Motion-Class Spectrum

8 motion classes from 23-D camera-subtracted optical-flow features. Each class shows 5 representative clips (closest to class centroid in 23-D space). Top row of each cell: original mp4. Bottom row: RAFT optical flow rendered via the Middlebury color wheel — hue = direction, saturation = magnitude.

🎬

Motion-Class Spectrum Gallery (8 classes × 5 clips + RAFT color-wheel viz)

pending: src/utils/motion_spectrum_gallery.py (planCODE_html.md Stage G — ~250 LoC). Inputs: data/eval_10k_local/motion_features.npy (9297, 23) + outputs/full/probe_action/action_labels.json (8 surviving classes). Outputs: 40 clip mp4s + 40 RAFT flow mp4s (~30 min GPU).

Within each row, all 5 clips will share a CONSISTENT color signature in the RAFT viz column — e.g., fast__rightward tints red, still__downward stays light with faint cyan. That visual constancy is the visceral proof that the 23-D feature actually captures direction + magnitude, even when the scenes differ (market vs drive vs walking).

Ablation Results — Backbone Champion & Factor Surgery vs Continual SSL

10K-clip POC on DenseWorld · every cell carries a BCa 10K-bootstrap 95% CI · paired duels are compute-matched (same data, steps, optimizer — only the technique differs).

F1 · Frozen video encoders are nearly motion-blind

Across 10 frozen backbones (V-JEPA 1/2.0/2.1, I-JEPA, DINOv2, LeJEPA), motion-cosine separation is ≤ 0.019 — essentially zero linearly-readable motion signal, while action top-1 reaches 44%. Semantics are abundant; motion is the missing factor.

F2 · Surgery beats matched continual SSL on motion/temporal

On V-JEPA 2.1 ViT-G (2B): surgery wins 4 metrics, vanilla wins 0, 5 ties — all 4 wins are motion/temporal (motion-cos, future-frame, causal, mask-ratio). Crucially, the ties include action top-1 and taxonomy-F1: no semantic forgetting.

F3 · The effect is representation-dependent

The same surgery recipe on the older V-JEPA 2.0 ViT-g flips to 0 wins vs 5 — factor surgery exploits structure that only the newer V-JEPA 2.1 representation exposes. This selects V-JEPA 2.1 ViT-G (2B) as the paper’s base model.

Scorecard table — 10 frozen encoders × 3 shared metrics with BCa 95% CIs; V-JEPA 2.1 leads action top-1, DINOv2 leads taxonomy-F1, all motion-cos values near zero

🏆 Backbone selection (frozen, no Δ). 10 frozen encoders × 3 shared head metrics; blue box = best per metric. V-JEPA 2.1 tops action top-1 (44.4 ± 2.3), DINOv2 tops taxonomy-F1 (0.816), and the entire column of motion-cos sits at 0.004–0.019 — the quantitative motivation for factor surgery (finding F1).

Paired-difference heatmap — surgery minus vanilla continual SSL across 9 metrics × 3 backbones; green = surgery significant, red = vanilla, grey = tie

⚔️ Champion duel: surgery − vanilla continual SSL (paired per-clip Δ, 95% BCa CI; green = surgery significantly better, red = vanilla, grey = CI crosses 0). Column 1 (V-JEPA 2.1 ViT-G 2B): surgery 4–0 with 5 ties — wins on motion-cos (+0.010), future-frame (+0.025), causal (+0.019), mask-ratio (+0.009); ties on both semantic metrics. Column 2 (V-JEPA 2.1 ViT-g 1B): surgery 5–2. Column 3 (V-JEPA 2.0 ViT-g 1B): 0–5 — the reversal behind finding F3.

Raw-value grids — per-backbone tables of all encoders × all 10 metrics with CIs (champion V-JEPA 2.1 ViT-G panel on top)

📋 Raw values behind the duel (one panel per backbone; champion 2B panel on top) — every encoder × every metric with its CI, for the paper’s appendix tables. Click to open full-resolution.

Pipeline schema — raw clips feed a shared continual-SSL init; seven adaptation arms (Full-FT, LP-FT, LoRA/DoRA, Auto-RGN, CaSSLe+EWC, Surgery, Surgery-RAW control) export encoders into one paired evaluation

🗺️ The fine-tuning-technique ablation, at a glance. One shared continual-SSL init fans out into every standard adaptation family (Full-FT, LP-FT, LoRA→DoRA, Auto-RGN — the must-beat namesake in red — CaSSLe+EWC) on RAW clips, plus Surgery (ours, green) on the factor curriculum and Surgery-RAW (dashed) — the same method on raw clips, isolating the factor-curriculum effect from the method effect. Every arm exports the same encoder format into one paired evaluation (9 metrics, N=1825 test clips, BCa 95% CI).

🔭 In progress — fine-tuning-technique ablation (13 arms)

On the champion backbone, factor surgery is now being compared against the standard adaptation arsenal — full fine-tune, LP-FT, LoRA, DoRA, AutoRGN, CaSSLe, EWC, raw-data surgery control and head-only variants — same data, same checkpoint-selection rule (held-out future-frame L1), same 451-clip held-out validation pool. Results land here as the hero grid + paired-Δ CIs.

Where Does V-JEPA Fit?

A visual guide to machine learning paradigms and how V-JEPA 2's training relates to them.

Machine Learning Taxonomy

mindmap root((Machine
Learning)) **Supervised** Classification cat vs dog spam detection Regression price prediction forecasting **Unsupervised** Clustering k-means DBSCAN Dim. Reduction PCA UMAP Association Rules market basket **Self-Supervised — V-JEPA 2 lives here** 🎯 Contrastive SimCLR MoCo Masked Prediction Pixel-space MAE VideoMAE :::highlight Latent-space :::highlight **V-JEPA 2** ✦ Distillation DINO / BYOL Generative GPT pretraining **Reinforcement** RL from Human Feedback PPO / DPO / GRPO Robotics Q-learning / SAC

How Self-Supervised Learning Methods Compare

Method	What It Predicts	Needs Labels?	Needs Negatives?	Used By
Contrastive	Pull similar views together, push different apart	No	Yes (hard to scale)	SimCLR, MoCo
Pixel Masked	Reconstruct missing pixels	No	No	MAE, VideoMAE
Latent Masked ✦	Predict missing features (not pixels)	No	No	V-JEPA, V-JEPA 2
Distillation	Student matches teacher's output	No	No	DINO, BYOL
Generative	Predict next token / word	No	No	GPT, BERT

V-JEPA 2 in One Picture

🎬

1. Take a video clip

10s Indian street scene, 16 frames

🧩

2. Mask 80% of patches

Student only sees 20% of the video

🎯

3. Predict what's hidden

In feature space, not pixels — the key insight

🔄

4. Teacher provides targets

EMA copy of student, sees ALL patches

No labels. No rewards. No human feedback.
The supervision comes from the video itself.

If You Know LLMs...

	LLMs (GPT, LLaMA)	V-JEPA 2
Input	Text tokens	Video patch tokens (spatiotemporal)
Training task	Predict next token	Predict masked patch features
Architecture	Decoder-only transformer	Encoder (student) + predictor + teacher (EMA)
Fine-tuning	Supervised Fine-Tuning → Reinforcement Learning from Human Feedback (PPO/DPO/GRPO)	Continual Self-Supervised Learning → Surgery (prefix unfreezing)
Output	Generated text	Fixed-dim embedding per video clip

System Design: Architecture

Module Pipeline: Evaluate → Pretrain → Surgery → Head-Only

flowchart LR
    subgraph CH9 ["Ch 9 · Evaluate (DONE)"]
        direction TB
        m04["m04 · VLM tag"] --> m05["m05 · V-JEPA embed"]
        m04d["m04d · 23-D motion feat\n(Phase 0 ✅)"] --> probe_action["probe_action\n8-class motion top-1"]
        m05 --> probe_action
        probe_action --> probe_plot["probe_plot\ntop-1 · loss · confusion"]
    end
    subgraph CH10 ["Ch 10 · Pretrain (POC done)"]
        direction TB
        m09a1["m09a1_pretrain_encoder\nJEPA + motion_aux head\n+ L2 anchor"]
    end
    subgraph CH11 ["Ch 11 · Surgery (iter14 R1 winner)"]
        direction TB
        m10["m10 · SAM3"] --> m11["m11 · factor\nD_L · D_A · D_I"] --> m09c1["m09c1_surgery_encoder\nrecipe-v3 (SALT + LP-FT +\nSPD + CLEAR + MGMAE)"]
    end
    subgraph ITER15 ["iter15 · Head-Only (in progress)"]
        direction TB
        m09a2["m09a2_pretrain_head\n(frozen encoder + predictor,\n432K motion_aux head only)"]
        m09c2["m09c2_surgery_head\n(same freeze + factor data)"]
    end
    m05 -->|"frozen"| m09a1
    m09a1 -->|"adapted ckpt"| probe_action
    m09c1 -->|"surgical ckpt"| probe_action
    m09a2 -->|"head-only ckpt"| probe_action
    m09c2 -->|"head-only surgery ckpt"| probe_action
    style m04 fill:#00acc1,color:#fff,font-weight:bold
    style m05 fill:#43a047,color:#fff,font-weight:bold
    style m04d fill:#00897b,color:#fff,font-weight:bold
    style probe_action fill:#e53935,color:#fff,font-weight:bold
    style probe_plot fill:#6d4c41,color:#fff,font-weight:bold
    style m09a1 fill:#7b1fa2,color:#fff,font-weight:bold
    style m10 fill:#1565c0,color:#fff,font-weight:bold
    style m11 fill:#0d47a1,color:#fff,font-weight:bold
    style m09c1 fill:#0d47a1,color:#fff,font-weight:bold
    style m09a2 fill:#bf360c,color:#fff,font-weight:bold
    style m09c2 fill:#b71c1c,color:#fff,font-weight:bold

Full pipeline from motion-class evaluation through domain adaptation. Every adapted encoder (vanilla pretrain, surgery, iter15 head-only) flows back into probe_action for the same 8-class motion top-1 measurement — metric stays fixed across iterations, so Δs are directly comparable.

Ch10: JEPA Training Step (Single Iteration)

flowchart LR
    subgraph DATA ["1 · Data"]
        CLIP["Indian clip\n10s · 16 frames"] --> AUG["Augment\nRRC 384×384\nsame crop\nall frames"]
    end
    subgraph MASKING ["2 · Mask"]
        AUG --> TOKENS["Patchify\n24×24×8\n=4608 tokens"]
        TOKENS --> MGEN["MaskCollator\n8 small (15%)\n2 large (70%)"]
        MGEN --> VIS["visible\n~500-1100"]
        MGEN --> MASKED["masked\n~3500-4100"]
    end
    subgraph ENCODE ["3 · Encode"]
        direction TB
        VIS --> S_ENC["STUDENT\nViT-g · 1B\nvisible only\ntrainable"]
        AUG2["same clip"] --> T_ENC["TEACHER\nViT-g · 1B\nALL tokens\nno grad"]
    end
    subgraph PREDICT ["4 · Predict + Loss"]
        S_ENC --> PREDICTOR["PREDICTOR\n12 blocks\n384-dim"]
        T_ENC --> SELECT["select masked\npositions"]
        PREDICTOR --> L1["L1 LOSS\n|pred − target|"]
        SELECT --> L1
    end
    subgraph UPDATE ["5 · Update"]
        L1 --> ADAM["AdamW\n+ grad clip 1.0\n+ drift control"]
        ADAM --> EMA_UP["EMA teacher\nτ = 0.99925"]
    end
    style CLIP fill:#5e35b1,color:#fff,font-weight:bold
    style AUG fill:#00897b,color:#fff,font-weight:bold
    style TOKENS fill:#795548,color:#fff,font-weight:bold
    style MGEN fill:#e65100,color:#fff,font-weight:bold
    style VIS fill:#2e7d32,color:#fff,font-weight:bold
    style MASKED fill:#c62828,color:#fff,font-weight:bold
    style S_ENC fill:#1565c0,color:#fff,font-weight:bold
    style T_ENC fill:#546e7a,color:#fff,font-weight:bold
    style AUG2 fill:#546e7a,color:#fff,font-weight:bold
    style PREDICTOR fill:#6a1b9a,color:#fff,font-weight:bold
    style SELECT fill:#37474f,color:#fff,font-weight:bold
    style L1 fill:#c62828,color:#fff,font-weight:bold
    style ADAM fill:#1a237e,color:#fff,font-weight:bold
    style EMA_UP fill:#9c27b0,color:#fff,font-weight:bold

One training iteration of V-JEPA continual pretraining. Student sees ~20% visible tokens; teacher sees all. L1 loss at masked positions only.

Ch11: Progressive Prefix Unfreezing with Factor Datasets

flowchart TB
    subgraph SAM ["SAM3 Segmentation (m10)"]
        direction LR
        SEG["SAM3\ninstance masks"] --> TRACK["Greedy IoU\ntracklets"] --> AGENT["Agent vs Layout\nmotion filter"]
    end
    subgraph FACTOR ["Factor Datasets (m11)"]
        direction LR
        DL["D_L\nlayout-only\nblur agents"]
        DA["D_A\nagent-only\nsuppress BG"]
        DI["D_I\ninteraction\ntube crops"]
    end
    AGENT --> DL
    AGENT --> DA
    AGENT --> DI
    subgraph STAGES ["3-Stage Surgery (m12)"]
        direction LR
        S1["Stage 1\nlayers 0→10\n100% D_L\nroads · wires"] --> S2["Stage 2\nlayers 0→20\n90% D_A\nrickshaws · cows"] --> S3["Stage 3\nlayers 0→30\n85% D_I\nagent interactions"]
    end
    FACTOR --> STAGES
    S3 --> SURGICAL["V-JEPA\n(surgical)"]
    style SEG fill:#0277bd,color:#fff,font-weight:bold
    style TRACK fill:#01579b,color:#fff,font-weight:bold
    style AGENT fill:#004d40,color:#fff,font-weight:bold
    style DL fill:#1b5e20,color:#fff,font-weight:bold
    style DA fill:#33691e,color:#fff,font-weight:bold
    style DI fill:#827717,color:#fff,font-weight:bold
    style S1 fill:#1565c0,color:#fff,font-weight:bold
    style S2 fill:#1565c0,color:#fff,font-weight:bold
    style S3 fill:#0d47a1,color:#fff,font-weight:bold
    style SURGICAL fill:#4a148c,color:#fff,font-weight:bold

Ch11 surgery: SAM3 segments each frame into agents and layout. Three factor datasets train progressively deeper layers with replay mixing.

Tier 1 Cities — 6 metros, 68,614 clips

Tier 2 Cities — 15 cities, 40,743 clips

Full Taxonomy Coverage — 15 fields, 65+ values, v3 structured tags

DenseWorld-200K Dataset

Tier 1 Cities — 6 metros

Tier 2 Cities — 15 cities

Goa · Monuments — 4 destinations

Per-Factor Train Loss

Customized Loss Function

Per-Block Weight Drift

Motion-Class Spectrum

Ablation Results — Backbone Champion & Factor Surgery vs Continual SSL

Where Does V-JEPA Fit?

Machine Learning Taxonomy

How Self-Supervised Learning Methods Compare

V-JEPA 2 in One Picture

If You Know LLMs...

System Design: Architecture

Module Pipeline: Evaluate → Pretrain → Surgery → Head-Only

Ch10: JEPA Training Step (Single Iteration)

Ch11: Progressive Prefix Unfreezing with Factor Datasets

BibTeX