PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
Abstract
Pyramidal diffusion models reduce computational cost through hierarchical resolution processing, with pretrained models converted via low-cost fine-tuning maintaining output quality while enabling efficient inference.
Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at https://qualcomm-ai-research.github.io/PyramidalWan.
Community
We tackle the challenge of quadratic complexity in video generation with a novel Recurrent Hybrid Attention mechanism. By combining the fidelity of softmax attention for local dependencies with the efficiency of linear attention globally, we enable high-quality modeling with linear scaling.
Constant Memory Usage: Our chunk-wise recurrent reformulation allows for the generation of arbitrarily long videos.
Massive Training Efficiency: Using a two-stage distillation pipeline, we reduced training costs by two orders of magnitude to just ~160 GPU hours.
SOTA Performance: Validated on VBench and VBench-2.0, ReHyAt achieves state-of-the-art quality while unlocking practical on-device video generation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VDOT: Efficient Unified Video Creation via Optimal Transport Distillation (2025)
- AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path (2025)
- USV: Unified Sparsification for Accelerating Video Diffusion Models (2025)
- InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior (2025)
- Glance: Accelerating Diffusion Models with 1 Sample (2025)
- Video Generation Models Are Good Latent Reward Models (2025)
- Guiding Token-Sparse Diffusion Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper