Papers
arxiv:2601.10547

HeartMuLa: A Family of Open Sourced Music Foundation Models

Published on Jan 15
· Submitted by
Dongchao Yang
on Jan 16
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A suite of open-source music foundation models is introduced, featuring components for audio-text alignment, lyric recognition, music coding, and large language model-based song generation with controllable attributes and scalable parameterization.

AI-generated summary

We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric recognition model optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate (12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details and enabling efficient autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions (e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides two specialized modes: (i) fine-grained musical attribute control, which allows users to specify the style of different song sections (e.g., intro, verse, chorus) using natural language prompts; and (ii) short, engaging music generation, which is suitable as background music for short videos. Lastly, HeartMuLa improves significantly when scaled to 7B parameters. For the first time, we show that a Suno-level, commercial-grade system can be reproduced using academic-scale data and GPU resources. We expect these foundation models to serve as strong baselines for future research and to facilitate practical applications in multimodal content production.

Community

Paper author Paper submitter

We present a family of open-source Music Foundation Models designed to advance
large-scale music understanding and generation across diverse tasks and modalities.
Our framework consists of four major components: (1) HeartCLAP, an audiotext alignment model; (2) HeartTranscriptor, a robust lyric recognition model
optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate
(12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical
structure while preserving fine-grained acoustic details and enabling efficient
autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model
capable of synthesizing high-fidelity music under rich, user-controllable conditions
(e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides
two specialized modes: (i) fine-grained musical attribute control, which allows
users to specify the style of different song sections (e.g., intro, verse, chorus)
using natural language prompts; and (ii) short, engaging music generation, which
is suitable as background music for short videos. Lastly, HeartMuLa improves
significantly when scaled to 7B parameters. For the first time, we show that a
Suno-level, commercial-grade system can be reproduced using academic-scale
data and GPU resources. We expect these foundation models to serve as strong
baselines for future research and to facilitate practical applications in multimodal
content production.

截屏2026-01-16 17.54.51

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.10547 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.10547 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.