4DGT
Learning a 4D Gaussian Transformer
Using Real-World Monocular Videos

NeurIPS 2025 (Spotlight)

Zhen Xu^1,2,* Zhengqin Li¹ Zhao Dong¹ Xiaowei Zhou² Richard Newcombe¹ Zhaoyang Lv¹

¹Reality Labs Research, Meta ²Zhejiang University
^*Work done during internship at Meta.

Trained on monocular videos only, our model learns to perform feed-forward 4D reconstruction in seconds.

Reconstruction results on the Ego-Exo4D and AEA Dataset (concatenated full sequences)

We propose 4DGT, a 4D Gaussian-based Transformer model for dynamic scene reconstruction, trained entirely on real-world monocular posed videos. Using 4D Gaussian as an inductive bias, 4DGT unifies static and dynamic components, enabling the modeling of complex, time-varying environments with varying object lifespans. We proposed a novel density control strategy in training which enables our 4DGT to handle longer space-time input and remain efficient rendering at runtime. Our model processes 64 consecutive posed frames in a rolling-window fashion, predicting consistent 4D Gaussians in the scene. Unlike optimization-based methods, 4DGT performs purely feed-forward inference, reducing reconstruction time from hours to seconds and scaling effectively to long video sequences. Trained only on large-scale monocular posed video datasets, 4DGT can outperform prior Gaussian-based networks significantly in real-world videos and achieve on-par accuracy with optimization-based methods on cross-domain videos.

Method

4DGT takes a series of monocular frames with poses as input. During training, we subsample the temporal frames at different granularity and use all images for supervision. We first train 4DGT to predict pixel-aligned Gaussians at coarse resolution in stage one. In stage two training, we pruned a majority of non-activated Gaussians according to the histograms of per-patch activation channels, and densify the Gaussian prediction by increasing the input token samples in both space and time. At inference time, we run the 4DGT network trained after stage two. It can support dense video frames input at high resolution.

4DGT
Learning a 4D Gaussian Transformer
Using Real-World Monocular Videos

NeurIPS 2025 (Spotlight)

Method

More Results

More Results on the ADT, HOT3D and Nymeria Dataset

More Results on the DyCheck and TUM Dynamics Dataset

Baseline Comparisons & Ablations

Comparisons with L4GM, Shape-of-Motion and StaticLRM

Ablations studies of the proposed components

Citation

4DGT Learning a 4D Gaussian Transformer Using Real-World Monocular Videos

NeurIPS 2025 (Spotlight)

Method

More Results

Click to Expand More Results on the ADT, HOT3D and Nymeria Dataset

Click to Expand More Results on the DyCheck and TUM Dynamics Dataset

Baseline Comparisons & Ablations

Click to Expand Comparisons with L4GM, Shape-of-Motion and StaticLRM

Click to Expand Ablations studies of the proposed components

Citation

4DGT
Learning a 4D Gaussian Transformer
Using Real-World Monocular Videos

More Results on the ADT, HOT3D and Nymeria Dataset

More Results on the DyCheck and TUM Dynamics Dataset

Comparisons with L4GM, Shape-of-Motion and StaticLRM

Ablations studies of the proposed components