MMDuet

This is the model checkpoint of MMDuet, a VideoLLM you can interact with in a real-time manner while the video plays.

Model Details

Related Resources

Paper: VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
Github: MMDuet
Video Demo: On Youtube and On Bilibili
Data: MMDuetIT

Citation

If you use this work in your research, please consider cite:

@misc{wang2024mmduet,
      title={VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format}, 
      author={Yueqian Wang and Xiaojun Meng and Yuxuan Wang and Jianxin Liang and Jiansheng Wei and Huishuai Zhang and Dongyan Zhao},
      year={2024},
      eprint={2411.17991},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.17991}, 
}

Downloads last month: 6

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wangyueqian/MMDuet

Base model

lmms-lab/llava-onevision-qwen2-7b-ov

Adapter

(2)

this model

Dataset used to train wangyueqian/MMDuet

Paper for wangyueqian/MMDuet

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Paper • 2411.17991 • Published Nov 27, 2024 • 5