Moshi: a speech-text foundation model for real-time dialogue Paper • 2410.00037 • Published Sep 17, 2024 • 9
Vision-Speech Models: Teaching Speech Models to Converse about Images Paper • 2503.15633 • Published Mar 19 • 2
ARC-Encoder: learning compressed text representations for large language models Paper • 2510.20535 • Published Oct 23 • 7
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion Paper • 2512.19535 • Published 6 days ago • 9