
In recent years, there has been a growing trend toward training data-centric, large-scale foundation models that reduce reliance on structural priors. However, is simply scaling up Transformers truly the ultimate solution for computer vision? In this workshop, we aim to reintroduce structural priors and explore how they can further push the boundaries of foundation models.
Our workshop provides an interdisciplinary space for sharing ideas across domains. For example, scene-aware 2D perception can enhance 3D modeling and robotic manipulation, while geometric reasoning can enhance the visual grounding of 2D perception and multimodal models. Through these interactions, we aim to better define the role of priors in vision foundation models.
Our topics include but are not limited to:
- Scene-aware vision models for images and videos.
- Geometry and equivariance for 3D vision.
- Temporal and motion priors for videos.
- Behavioral priors for robotics and egocentric views.
- Physics priors for world models and interactions.

Georgia Tech & NVIDIA

Google DeepMind

Stanford

UT Austin

NYU

MIT
Opening Remarks and Welcome | 08:50-09:00 |
Keynote Talk: Speaker TBD
Title TBD
|
09:00-09:40 |
Keynote Talk: Speaker TBD
Title TBD
|
09:40-10:20 |
Coffee Break | 10:20-10:40 |
Keynote Talk: Speaker TBD
Title TBD
|
10:40-11:20 |
Spotlight Talk
Title TBD
|
11:20-11:35 |
Spotlight Talk
Title TBD
|
11:35-11:50 |
Lunch Break | 11:50-12:30 |
Accepted Paper Poster Session | 12:30-13:30 |
Keynote Talk: Speaker TBD
Title TBD
|
13:30-14:10 |
Keynote Talk: Speaker TBD
Title TBD
|
14:10-14:50 |
Coffee Break | 14:50-15:10 |
Keynote Talk: Speaker TBD
Title TBD
|
15:10-15:50 |
Spotlight Talk
Title TBD
|
15:50-16:05 |
Spotlight Talk
Title TBD
|
16:05-16:20 |
Closing Remarks | 16:20-16:30 |
Accepted Paper Poster Session | 16:30-17:30 |

UMich

Stanford

Tel Aviv

Google DeepMind

NVIDIA

Stanford

UMich