In recent years, there has been a growing trend toward training data-centric, large-scale foundation models that reduce reliance on structural priors. However, is simply scaling up Transformers truly the ultimate solution for computer vision? In this workshop, we aim to reintroduce structural priors and explore how they can further push the boundaries of foundation models.
Our workshop provides an interdisciplinary space for sharing ideas across domains. For example, scene-aware 2D perception can enhance 3D modeling and robotic manipulation, while geometric reasoning can enhance the visual grounding of 2D perception and multimodal models. Through these interactions, we aim to better define the role of priors in vision foundation models.
Our topics include but are not limited to:
- Scene-aware vision models for images and videos.
- Geometry and equivariance for 3D vision.
- Temporal and motion priors for videos.
- Behavioral priors for robotics and egocentric views.
- Physics priors for world models and interactions.
MIT & Google DeepMind
UT Austin
UT Austin
Georgia Tech & NVIDIA
Stanford
NYU
| Opening Remarks and Welcome | 08:50-09:00 |
|
Keynote Talk: Danfei Xu
Human Experience as a Foundation for Robot Learning
|
09:00-09:45 |
|
Keynote Talk: Kristen Grauman
Persistent Scene Models for 4D Human Activity
|
09:45-10:30 |
| Coffee Break | 10:30-10:45 |
|
Keynote Talk: Jiajun Wu
Understanding Visual Intelligence Through Physical Intrinsics
|
10:45-11:30 |
| Spotlight Talk: Sandeep Mishra
VidMP3: Video Editing by Representing Motion with Pose and Position Priors
|
11:30-11:45 |
| Spotlight Talk: Simon Coessens
MultiViewPano: A Generalist Approach to 360-degree Panorama Generation
|
11:45-12:00 |
| Lunch Break & Accepted Paper Poster Session | 12:00-13:30 |
|
Keynote Talk: Saining Xie
From Structural Priors to Representation Priors
|
13:30-14:15 |
|
Keynote Talk: Bill Freeman
Exploiting Sensor Independence to Reduce the Reliance on the Prior
|
14:15-15:00 |
| Coffee Break | 15:00-15:15 |
|
Keynote Talk: Qixing Huang
Enforcing 3D Inductive Bias via Network Designs and Regularisation Losses
|
15:15-16:00 |
| Spotlight Talk: Manling Li
Spatial Mental Modeling from Limited Views
|
16:00-16:15 |
| Spotlight Talk: Tien Duc Nguyen
The Diashow Paradox: Stronger 3D-Aware Representations Emerge from Image Sets, Not Videos
|
16:15-16:30 |
| Closing Remarks | 16:30-16:40 |
| Accepted Paper Poster Session | 16:40-17:30 |
| Ground-Displacement Forecasting from Satellite Image Time Series via a Koopman-Prior Autoencoder | |
| (Spotlight) Spatial Mental Modeling from Limited Views | |
| SEAL-Pose: Enhancing 3D Human Pose Estimation through Trainable Loss Function | |
| StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation | |
| (Spotlight) VidMP3: Video Editing by Representing Motion with Pose and Position Priors | |
| (Spotlight) The Diashow Paradox: Stronger 3D-Aware Representations Emerge from Image Sets, Not Videos | |
| Identity-Motion Trade-offs in Text-to-Video via Query-Guided Attention Priors | |
| Axis-level Symmetry Detection with Group-Equivariant Representation | |
| Combinative Matching for Geometric Shape Assembly | |
| Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation | |
| Generic Event Boundary Detection via Denoising Diffusion | |
| SHED Light on Segmentation for Depth Estimation | |
| Few-Shot Pattern Detection via Template Matching and Regression | |
| (Spotlight) MultiViewPano: A Generalist Approach to 360-degree Panorama Generation | |
| LACONIC: A 3D Layout Adapter for Controllable Image Creation | |
| SuperDec: 3D Scene Decomposition with Superquadric Primitives | |
| Injecting Geometric Scene Priors into Vision Transformers for Improved 2D-3D Understanding |
UMich
Stanford
Black Forest Labs
Google DeepMind
NVIDIA
Stanford
UMich
|
|