Learning Long-term Motion Embeddings for Efficient Kinematics Generation
Abstract
Efficient motion generation is achieved through compressed motion embeddings and conditional flow-matching models that produce realistic long-term motions from text prompts or spatial inputs.
Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.
Community
Are you working with trajectories already to model motion more efficiently? Consider using our first stage to embed them!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GigaWorld-Policy: An Efficient Action-Centered World--Action Model (2026)
- MotionCFG: Boosting Motion Dynamics via Stochastic Concept Perturbation (2026)
- PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition (2026)
- ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding (2026)
- Toward Physically Consistent Driving Video World Models under Challenging Trajectories (2026)
- Envisioning the Future, One Step at a Time (2026)
- Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.11737 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper