CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation
Abstract
Chain-of-Frame reasoning is integrated into text-to-image generation through progressive visual refinement with explicit intermediate steps, achieving superior performance on benchmark datasets.
Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models (2025)
- Loom: Diffusion-Transformer for Interleaved Generation (2025)
- Unified Video Editing with Temporal Reasoner (2025)
- ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning (2025)
- Plan-X: Instruct Video Generation via Semantic Planning (2025)
- MIRA: Multimodal Iterative Reasoning Agent for Image Editing (2025)
- AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper