SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

1Stability AI, 2Northeastern University
*Equal contribution ^Equal advising
        

SV4D 2.0 takes a reference video as input and generates novel view videos and 4D models.

Abstract

We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces much higher quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Our extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14% LPIPS) and 4D consistency (-44% FV4D) in novel-view video synthesis and 4D optimization (-12% LPIPS and -24% FV4D) compared to SV4D.

Summary Video

Visualization Results

Visual Comparisons

Novel View Video Synthesis

Comparing our results with baselines on Objaverse dataset.

Video 1 Video 2 Video 3 Video 3 Video 3 Video 3 Video 3 Video 3 Video 3 Video 3 Video 3 Video 3

Click on the thumbnail image to select and change the input video.


Comparing our results with baselines on DAVIS dataset.

Video 1 Video 2 Video 3 Video 3 Video 3 Video 3 Video 3 Video 3

Click on the thumbnail image to select and change the input video.




4D Optimization

Video 1 Video 3 Video 3 Video 2 Video 3 Video 2 Video 3

Click on the thumbnail image to select and change the input video.


Comparing to L4GM

On Real-world data

L4GM does not generalize well on real-world data (no video prior like ours) and struggles with videos at non-zero elevation (training data primarily at 0° elevation).

Video 1 Video 2 Video 3

Click on the thumbnail image to select and change the input video.


On the input video with non-zero elevations

Video 1 Video 2 Video 3

Click on the thumbnail image to select and change the input video.


4D Optimization Results with Continuous View and Time Changes

SV4D 2.0 with DyNeRF vs 4D Gaussians

In our sparse-view setting:
1. 4D Gaussians suffer from temporal flickering and floater artifacts due to its discrete nature.
2. DyNeRF interpolates better across sparse views and fast motion.

Acknowledgement

We sincerely thank the creators of the 3D models ( Camera, Arrow, Monitor, Pedestal ) used in our teaser and pipeline illustrations.

BibTeX


	@article{yao2024sv4d2,
	    title={{SV4D2.0}: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation}, 
	    author={Chun-Han Yao and Yiming Xie and Vikram Voleti and Huaizu Jiang and Varun Jampani},
	    journal={arXiv preprint arXiv:2503.16396},
	    year={2025},
	}