Papers
arxiv:2601.04792

PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference

Published on Jan 8
· Submitted by
Amir Habibian
on Jan 9
Authors:
,
,
,
,

Abstract

Pyramidal diffusion models reduce computational cost through hierarchical resolution processing, with pretrained models converted via low-cost fine-tuning maintaining output quality while enabling efficient inference.

AI-generated summary

Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at https://qualcomm-ai-research.github.io/PyramidalWan.

Community

Paper author Paper submitter

We tackle the challenge of quadratic complexity in video generation with a novel Recurrent Hybrid Attention mechanism. By combining the fidelity of softmax attention for local dependencies with the efficiency of linear attention globally, we enable high-quality modeling with linear scaling.

Constant Memory Usage: Our chunk-wise recurrent reformulation allows for the generation of arbitrarily long videos.
Massive Training Efficiency: Using a two-stage distillation pipeline, we reduced training costs by two orders of magnitude to just ~160 GPU hours.
SOTA Performance: Validated on VBench and VBench-2.0, ReHyAt achieves state-of-the-art quality while unlocking practical on-device video generation.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.04792 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.04792 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.04792 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.