Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

The University of Tokyo
*Co-first authors.
Teaser

Abstract

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

WildCity Dataset

Our WildCity dataset is designed to train feed-forward 3DGS models for 3D reconstruction from unconstrained photo collections. The dataset creation process consists of four main stages.

  1. We collect 3D assets from the SceneCity, a Blender add-on, and Sketchfab.
  2. We generate cities using these assets.
  3. We render the images from multiple viewpoints and multiple lighting conditions using the HDRI maps.
  4. We add transient objects to the rendered images using Gemini.
Dataset Pipeline

The Example of WildCity Dataset

WildCity comprises multi-view, multi-lighting images of various 3D assets, covering 200 scenes, 170 HDRI maps, and diverse transient objects. We also provide corresponding camera parameters, depth maps, and masks indicating the sky regions. The dataset is available here.
WildCity Dataset Example

Wild3R

We aim to reconstruct 3D scenes from appearance-varying input views without per-scene optimization. We propose Wild3R, a feed-forward 3DGS model trained to enforce appearance consistency and transient-free geometry across views. Our network builds upon a camera-free feed-forward 3DGS model and is fine-tuned on our WildCity dataset. Specifically, we use the first frame as an appearance reference view to condition the scene reconstruction, and supervise the predicted depth maps using transient-free ground truth depth maps. This minimal extension requires no structural modifications to the base model, thereby preserving its fast inference speed and architectural simplicity.
Method

Comparison with Previous Methods

Brandenburg Gate

Sacre Coeur

Trevi Fountain

BibTeX

@article{furutani2026wild3r,
  title   = {Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection},
  author  = {Furutani, Yuto and Otonari, Takashi and Shiohara, Kaede and Yamasaki, Toshihiko},
  journal = {arXiv preprint arXiv:2606.11894},
  year    = {2026}
}