Does V-JEPA 2 Understand Indian Streets? Benchmarking Video Foundation Models on DenseWorld-200K

Wanaskar, Kapil; Jena, Gaytri; Jain, Vinija; Chadha, Aman; Das, Amitava

DenseWorld-200K Dataset

115,687 video clips

714

source videos

Indian cities

276h

of video

121 GB

total size

taxonomy fields

Metric	Value
Clip duration	4–12 seconds (scene-aware cuts)
Taxonomy	16 fields (v3): 13 single + 2 multi
VLM tagger	Qwen3-VL-8B (0.919 bake-off score)
Format	WebDataset TARs on HuggingFace
POC subset	10K clips (video-level uniform, seed=42)

Browse on HuggingFace

Full taxonomy distribution across 16 fields for 10,000 clips tagged by Qwen3-VL-8B

Full taxonomy distribution (v3, 16 fields) across 10,000 POC clips tagged by Qwen3-VL-8B.

Clips Per City — Full Breakdown (click to collapse)

Tier 1 Cities — 6 metros, 68,614 clips

City	Drive	Walk	Drone	Total	Hours	GB
Kolkata	3,790	20,633	655	25,078	58.0	27.1
Mumbai	6,412	4,167	181	10,760	25.8	12.1
Delhi	4,114	6,159	199	10,472	24.5	11.5
Chennai	1,984	5,870	426	8,280	19.5	9.0
Hyderabad	2,210	5,037	217	7,464	18.0	6.8
Bangalore	3,453	2,335	772	6,560	15.7	7.2
Tier 1 Total	21,963	44,201	2,450	68,614	161.4	73.7

Goa — 5,835 clips

City	Walk	Total	Hours	GB
Goa	5,835	5,835	14.2	6.1

Tier 2 Cities — 15 cities, 40,743 clips

City	Drive	Walk	Drone	Rain	Total	Hours	GB
Ahmedabad	476	4,875	208	308	5,867	14.0	6.8
Varanasi	1,087	2,761	233	236	4,317	10.7	4.5
Jaipur	676	2,206	214	1,074	4,170	10.0	4.2
Kochi	1,021	1,222	865	988	4,096	10.1	4.2
Pune	707	1,795	326	328	3,156	7.4	3.2
Coimbatore	1,705	554	145	529	2,933	7.2	2.9
Lucknow	812	1,412	189	366	2,779	6.9	2.6
Thiruvananthapuram	727	664	197	1,062	2,650	6.4	2.8
Chandigarh	524	923	160	684	2,291	5.5	2.4
Nagpur	376	929	131	434	1,870	4.4	1.6
Indore	645	614	90	345	1,694	4.2	1.4
Visakhapatnam	256	811	304	137	1,508	3.6	1.2
Mysuru	489	418	174	271	1,352	3.3	1.2
Surat	472	303	142	185	1,102	2.7	1.0
Bhopal	249	322	119	268	958	2.3	0.8
Tier 2 Total	10,222	19,809	3,497	7,215	40,743	98.8	40.9

Monuments — 495 clips

Monument	Clips	Hours	GB
Gateway of India, Mumbai	266	0.7	0.3
Red Fort, Delhi	214	0.6	0.2
Mysore Palace	15	0.0	0.0
Monuments Total	495	1.3	0.5

Grand Total: 115,687 clips | 275.8 hours | 121.2 GB

Why does V-JEPA fail on scene classification?

V-JEPA 2 is trained to predict masked spatiotemporal patches — it learns motion patterns, camera trajectories, and temporal dynamics. But our taxonomy measures only spatial attributes (scene_type, road_surface, crowd_density...). The model's temporal features are strong (Cycle@K: 78.7%, highest) but encode the wrong thing for this task.

Evidence: Shuffling V-JEPA's 64 input frames via torch.randperm() destroys temporal information, forcing the model to use its spatial pathway — which improves Prec@K from 14.6% to 35.3%. This confirms: temporal encoding is noise for spatial scene classification.

External validation: arXiv:2509.21595 "Temporal vs Spatial: Comparing DINOv3 and V-JEPA2" confirms the same spatial/temporal tradeoff on UCF Sports.

Next steps: We are extending the evaluation with temporal metrics (optical flow features, motion tags) to measure V-JEPA on the axis where it should excel. If V-JEPA >> image baselines on temporal metrics, the paper story becomes: "Spatial and temporal transfer are independent axes — V-JEPA captures Indian dynamics but not Indian scenes."

BibTeX

@article{wanaskar2026factorjepa,
  title={Does V-JEPA 2 Understand Indian Streets? Benchmarking Video Foundation Models on DenseWorld-200K},
  author={Wanaskar, Kapil and Jena, Gaytri and Jain, Vinija and Chadha, Aman and Das, Amitava},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

Encoder	Dim	Cycle@K	Overlap@K	Prec@K	mAP@K	nDCG@K	Silhouette
V-JEPA 2	1408	78.65%	10.50%	14.63%	0.0792	0.9032	-0.2503
Random	1408	54.99%	0.00%	12.21%	0.0608	0.8978	-0.0206
DINOv2	1024	66.77%	60.90%	50.48%	0.4271	0.9577	-0.0574
CLIP	768	65.17%	47.06%	46.03%	0.3816	0.9583	-0.0470
V-JEPA Shuffled	1408	76.19%	35.32%	35.33%	0.2724	0.9500	-0.2245

Phase	Description	Status
Evaluate (Ch 9)	5-encoder benchmark on 10K POC — 9 spatial + 5 temporal metrics, 95% CI, Easy/Hard	Complete
Temporal Extension	GPU-RAFT motion features (m04d) + 5 temporal correlation metrics (m06b) + spatial-temporal comparison plots	Complete
Pretrain (Ch 10)	Continual pretraining on Indian clips using JEPA loss (student-teacher EMA)	Planned
Surgery (Ch 11)	Progressive prefix unfreezing with factor datasets (Layout → Agent → Interaction)	Planned
115K + Scaling Ablation	Full 115K run: V-JEPA 2.0 (1B, primary) + V-JEPA 2.1 (2B, ablation)	Planned

Tier 1 Cities — 6 metros, 68,614 clips

Tier 2 Cities — 15 cities, 40,743 clips

Full Taxonomy Coverage — 15 fields, 65+ values, v3 structured tags

DenseWorld-200K Dataset

Tier 1 Cities — 6 metros, 68,614 clips

Goa — 5,835 clips

Tier 2 Cities — 15 cities, 40,743 clips

Monuments — 495 clips

Abstract

Key Findings

Five-Encoder Comparison

kNN Retrieval Demo

Pipeline

Detailed Results

Spatial vs Temporal Analysis

The Spatial-Temporal Gap

Roadmap

System Design: Next Steps

Module Pipeline: Evaluate → Pretrain → Surgery

Ch10: JEPA Training Step (Single Iteration)

Ch11: Progressive Prefix Unfreezing with Factor Datasets

BibTeX