One View Is Enough! Monocular Training for In-the-Wild Novel View Generation
Abstract
OVIE enables monocular novel-view synthesis from single images using pseudo-target views generated via geometric scaffolding, achieving superior performance with faster inference compared to previous methods.
Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.
Community
OVIE aims to learn camera based next based view, learning only from monocular images
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm (2026)
- Repurposing Geometric Foundation Models for Multi-view Diffusion (2026)
- Real-Time Human Frontal View Synthesis from a Single Image (2026)
- One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image (2026)
- Projected Representation Conditioning for High-fidelity Novel View Synthesis (2026)
- PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery (2026)
- OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper