Allegro: Open the Black Box of Commercial-Level Video Generation Model
Paper
•
2410.15458
•
Published
•
40
Gallery · GitHub · Blog · Paper · Discord
For more demos and corresponding prompts, see the Allegro Gallery.
| Model | Allegro-TI2V | Allegro |
|---|---|---|
| Description | Text-Image-to-Video Generation Model | Text-to-Video Generation Model |
| Download | Hugging Face | Hugging Face |
| Parameter | VAE: 175M | |
| DiT: 2.8B | ||
| Inference Precision | VAE: FP32/TF32/BF16/FP16 (best in FP32/TF32) | |
| DiT/T5: BF16/FP32/TF32 | ||
| Context Length | 79.2K | |
| Resolution | 720 x 1280 | |
| Frames | 88 | |
| Video Length | 6 seconds @ 15 FPS | |
| Single GPU Memory Usage | 9.3G BF16 (with cpu_offload) | |
| Inference time | 20 mins (single H100) / 3 mins (8xH100) | |
Download the Allegro GitHub code.
Install the necessary requirements.
Download the Allegro-TI2V model weights.
Run inference.
python single_inference_ti2v.py \
--user_prompt 'The car drives along the road.' \
--first_frame your/path/to/first_frame_image.png \
--vae your/path/to/vae \
--dit your/path/to/transformer \
--text_encoder your/path/to/text_encoder \
--tokenizer your/path/to/tokenizer \
--guidance_scale 8 \
--num_sampling_steps 100 \
--seed 1427329220
The output video resolution is fixed at 720 × 1280. Input images with different resolutions will be automatically cropped and resized to fit.
| Argument | Description |
|---|---|
--user_prompt |
[Required] Text input for image-to-video generation. |
--first_frame |
[Required] First-frame image input for image-to-video generation. |
--last_frame |
[Optional] If provided, the model will generate intermediate video content based on the specified first and last frame images. |
--enable_cpu_offload |
[Optional] Offload the model into CPU for less GPU memory cost (about 9.3G, compared to 27.5G if CPU offload is not enabled), but the inference time will increase significantly. |
(Optional) Interpolate the video to 30 FPS
This repo is released under the Apache 2.0 License.