











Click on the thumbnail image to select and change the input video.
Comparing our results with baselines on Objaverse dataset.
Comparing our results with baselines on DAVIS dataset.
L4GM does not generalize well on real-world data (no video prior like ours) and struggles with videos at non-zero elevation (training data primarily at 0° elevation).
In our sparse-view setting:
1. 4D Gaussians suffer from temporal flickering and floater artifacts due to its discrete nature.
2. DyNeRF interpolates better across sparse views and fast motion.
@article{yao2024sv4d2,
title={{SV4D2.0}: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation},
author={Chun-Han Yao and Yiming Xie and Vikram Voleti and Huaizu Jiang and Varun Jampani},
journal={arXiv preprint arXiv:2503.16396},
year={2025},
}