DenseWorld Logo

Does V-JEPA 2 Understand Indian Streets?

Benchmarking Video Foundation Models on DenseWorld-200K

Kapil Wanaskar1   Gaytri Jena4   Vinija Jain2   Aman Chadha3   Amitava Das4
1Canva Research, USA    2Google, USA    3Apple, USA    4Pragya Lab, BITS Pilani Goa, India
115,687 clips • 714 videos • 22 cities • 276 hours • 121 GB

Tier 1 Cities — 6 metros, 68,614 clips

Kolkata
Chennai
Bangalore
Mumbai
Delhi
Hyderabad

Tier 2 Cities — 15 cities, 40,743 clips

Jaipur
Varanasi
Lucknow
Ahmedabad
Pune
Kochi
Chandigarh
Indore
Bhopal
Coimbatore
Nagpur
Visakhapatnam
Surat
Trivandrum
Mysuru

Full Taxonomy Coverage — 15 fields, 65+ values, v3 structured tags

Scene Type
market
residential
commercial
promenade
transit
highway
heritage
junction
flyover
beach
ghat
bazaar
skyline
Time of Day
day
night
Weather
clear
cloudy
rain
fog
overcast
Crowd
high
medium
low
Traffic
high
medium
low
Traffic Mix
mixed motor.
pedestrian
motorized
mixed all
Ped-Veh Sep.
separated
partial
shared space
Road Layout
intersection
narrow lane
wide road
sidewalk
bridge
Road Surface
asphalt
paved
wet
dirt
concrete
cobblestone
mixed
unpaved
Infrastructure
good
moderate
poor
Encroachment
clear
partial
heavy
Objects
auto rickshaw
sacred cow
street vendor
bus
cycle rickshaw
Vegetation
dense
moderate
sparse
none
Lighting
natural
artificial
mixed
Video Quality
clean

200K+ clips across 714 videos from 25+ Indian cities. Every clip tagged with 16 structured attributes by Qwen3-VL-8B. Showing all 15 taggable fields and 63 values.

DenseWorld-200K Dataset

115,687 video clips
714
source videos
22
Indian cities
276h
of video
121 GB
total size
16
taxonomy fields
MetricValue
Clip duration4–12 seconds (scene-aware cuts)
Taxonomy16 fields (v3): 13 single + 2 multi
VLM taggerQwen3-VL-8B (0.919 bake-off score)
FormatWebDataset TARs on HuggingFace
POC subset10K clips (video-level uniform, seed=42)
Full taxonomy distribution across 16 fields for 10,000 clips tagged by Qwen3-VL-8B

Full taxonomy distribution (v3, 16 fields) across 10,000 POC clips tagged by Qwen3-VL-8B.

 Clips Per City — Full Breakdown (click to collapse)

Tier 1 Cities — 6 metros, 68,614 clips

CityDriveWalkDroneTotalHoursGB
Kolkata3,79020,63365525,07858.027.1
Mumbai6,4124,16718110,76025.812.1
Delhi4,1146,15919910,47224.511.5
Chennai1,9845,8704268,28019.59.0
Hyderabad2,2105,0372177,46418.06.8
Bangalore3,4532,3357726,56015.77.2
Tier 1 Total21,96344,2012,45068,614161.473.7

Goa — 5,835 clips

CityWalkTotalHoursGB
Goa5,8355,83514.26.1

Tier 2 Cities — 15 cities, 40,743 clips

CityDriveWalkDroneRainTotalHoursGB
Ahmedabad4764,8752083085,86714.06.8
Varanasi1,0872,7612332364,31710.74.5
Jaipur6762,2062141,0744,17010.04.2
Kochi1,0211,2228659884,09610.14.2
Pune7071,7953263283,1567.43.2
Coimbatore1,7055541455292,9337.22.9
Lucknow8121,4121893662,7796.92.6
Thiruvananthapuram7276641971,0622,6506.42.8
Chandigarh5249231606842,2915.52.4
Nagpur3769291314341,8704.41.6
Indore645614903451,6944.21.4
Visakhapatnam2568113041371,5083.61.2
Mysuru4894181742711,3523.31.2
Surat4723031421851,1022.71.0
Bhopal2493221192689582.30.8
Tier 2 Total10,22219,8093,4977,21540,74398.840.9

Monuments — 495 clips

MonumentClipsHoursGB
Gateway of India, Mumbai2660.70.3
Red Fort, Delhi2140.60.2
Mysore Palace150.00.0
Monuments Total4951.30.5
Grand Total: 115,687 clips  |  275.8 hours  |  121.2 GB

BibTeX

@article{wanaskar2026factorjepa,
  title={Does V-JEPA 2 Understand Indian Streets? Benchmarking Video Foundation Models on DenseWorld-200K},
  author={Wanaskar, Kapil and Jena, Gaytri and Jain, Vinija and Chadha, Aman and Das, Amitava},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}