SP4V

In recent years, there has been a growing trend toward training data-centric, large-scale foundation models that reduce reliance on structural priors. However, is simply scaling up Transformers truly the ultimate solution for computer vision? In this workshop, we aim to reintroduce structural priors and explore how they can further push the boundaries of foundation models.

Our workshop provides an interdisciplinary space for sharing ideas across domains. For example, scene-aware 2D perception can enhance 3D modeling and robotic manipulation, while geometric reasoning can enhance the visual grounding of 2D perception and multimodal models. Through these interactions, we aim to better define the role of priors in vision foundation models.

Our topics include but are not limited to:

Scene-aware vision models for images and videos.
Geometry and equivariance for 3D vision.
Temporal and motion priors for videos.
Behavioral priors for robotics and egocentric views.
Physics priors for world models and interactions.

Keynote Speakers

Bill Freeman
MIT & Google DeepMind

Kristen Grauman
UT Austin

Qixing Huang
UT Austin

Danfei Xu
Georgia Tech & NVIDIA

Jiajun Wu
Stanford

Saining Xie
NYU

Schedule

Opening Remarks and Welcome	08:50-09:00
Keynote Talk: Danfei Xu Human Experience as a Foundation for Robot Learning	09:00-09:45
Keynote Talk: Kristen Grauman Persistent Scene Models for 4D Human Activity	09:45-10:30
Coffee Break	10:30-10:45
Keynote Talk: Jiajun Wu Understanding Visual Intelligence Through Physical Intrinsics	10:45-11:30
Spotlight Talk: Sandeep Mishra VidMP3: Video Editing by Representing Motion with Pose and Position Priors	11:30-11:45
Spotlight Talk: Simon Coessens MultiViewPano: A Generalist Approach to 360-degree Panorama Generation	11:45-12:00
Lunch Break & Accepted Paper Poster Session	12:00-13:30
Keynote Talk: Saining Xie From Structural Priors to Representation Priors	13:30-14:15
Keynote Talk: Bill Freeman Exploiting Sensor Independence to Reduce the Reliance on the Prior	14:15-15:00
Coffee Break	15:00-15:15
Keynote Talk: Qixing Huang Enforcing 3D Inductive Bias via Network Designs and Regularisation Losses	15:15-16:00
Spotlight Talk: Manling Li Spatial Mental Modeling from Limited Views	16:00-16:15
Spotlight Talk: Tien Duc Nguyen The Diashow Paradox: Stronger 3D-Aware Representations Emerge from Image Sets, Not Videos	16:15-16:30
Closing Remarks	16:30-16:40
Accepted Paper Poster Session	16:40-17:30

Accepted Papers

Ground-Displacement Forecasting from Satellite Image Time Series via a Koopman-Prior Autoencoder Takayuki Shinohara
(Spotlight) Spatial Mental Modeling from Limited Views Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei
SEAL-Pose: Enhancing 3D Human Pose Estimation through Trainable Loss Function Junggeun Do, Jay-Yoon Lee
StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation Haodong Li, Chen Wang, Jiahui Lei, Kostas Daniilidis, Lingjie Liu
(Spotlight) VidMP3: Video Editing by Representing Motion with Pose and Position Priors Sandeep Mishra, Oindrila Saha, Alan Bovik
(Spotlight) The Diashow Paradox: Stronger 3D-Aware Representations Emerge from Image Sets, Not Videos Nguyen Tien Duc, Anna Sonnweber, Mark Weber, Nikita Araslanov, Daniel Cremers
Identity-Motion Trade-offs in Text-to-Video via Query-Guided Attention Priors Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, Gal Chechik
Axis-level Symmetry Detection with Group-Equivariant Representation Wongyun Yu, Ahyun Seo, Minsu Cho
Combinative Matching for Geometric Shape Assembly Nahyuk Lee, Juhong Min, Junhong Lee, Chunghyun Park, Minsu Cho
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation Phillip Y. Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, Minhyuk Sung
Generic Event Boundary Detection via Denoising Diffusion Jaejun Hwang, Dayoung Gong, Manjin Kim, Minsu Cho
SHED Light on Segmentation for Depth Estimation Seung Hyun Lee, Sangwoo Mo, Stella X. Yu
Few-Shot Pattern Detection via Template Matching and Regression Eunchan Jo, Dahyun Kang, Sanghyun Kim, Yunseon Choi, Minsu Cho
(Spotlight) MultiViewPano: A Generalist Approach to 360-degree Panorama Generation Simon Coessens, Akash Malhotra, Nacera Seghouani
LACONIC: A 3D Layout Adapter for Controllable Image Creation Léopold Maillard, Tom Durand, Adrien Ramanana Rahary, Maks Ovsjanikov
SuperDec: 3D Scene Decomposition with Superquadric Primitives Elisabetta Fedele, Boyang Sun, Leonidas Guibas, Marc Pollefeys, Francis Engelmann
Injecting Geometric Scene Priors into Vision Transformers for Improved 2D-3D Understanding Laura Tran-Dubois