Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
Abstract
LivingSwap enhances video face swapping by using keyframes and reference guidance to maintain identity and fidelity over long sequences, reducing manual effort and achieving state-of-the-art results.
Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DiffSwap++: 3D Latent-Controlled Diffusion for Identity-Preserving Face Swapping (2025)
- ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation (2025)
- MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control (2025)
- ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation (2025)
- FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement (2025)
- LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization (2025)
- Video4Edit: Viewing Image Editing as a Degenerate Temporal Process (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper