🎙️ VoxMind

✨ Overview

Recent end-to-end spoken dialogue models have made natural voice interaction increasingly practical. However, as user requests become more complex and task-oriented, conversational ability alone is often not enough. To address real-world spoken tasks, these models must be equipped with agentic capabilities such as structured reasoning, tool use, and dynamic access to external functions.

VoxMind is an integrated framework designed to equip end-to-end spoken dialogue models with comprehensive agentic abilities. Built around a Think-before-Speak paradigm, VoxMind enables the model to internalize structured reasoning before response generation, which improves planning, tool selection, and spoken answer quality. In addition, to alleviate the latency bottleneck introduced by large-scale tool integration, VoxMind includes a Multi-Agent Dynamic Tool Management architecture that asynchronously delegates tool retrieval to an auxiliary agent aligned with the main model’s reasoning trajectory.

🧪 Minimal Usage Example

from runtime import DEFAULT_SYSTEM_PROMPT, VoxMind

model = VoxMind("/path/to/VoxMind")

tools = []
system_prompt = model.build_system_prompt(
    DEFAULT_SYSTEM_PROMPT,
    tools,
    extra_context={"current_city": "Beijing", "user_language": "en"},
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What's the weather like in Beijing today?"},
]

response = model.generate(
    messages,
    post_think_prefix="After careful reasoning, here is my detailed answer:\n",
    max_new_tokens=512,
    temperature=0.6,
    top_p=0.9,
    do_sample=True,
)

print(response.think)
print(response.answer)
print(model.parse_tool_calls(response.answer))

📚 Citation

If this repository or its workflow design is helpful to your research, please cite or reference it appropriately.

@misc{liang2026voxmindendtoendagenticspoken,
      title={VoxMind: An End-to-End Agentic Spoken Dialogue System}, 
      author={Tianle Liang and Yifu Chen and Shengpeng Ji and Yijun Chen and Zhiyang Jia and Jingyu Lu and Fan Zhuo and Xueyi Pu and Yangzhuo Li and Zhou Zhao},
      year={2026},
      eprint={2604.15710},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2604.15710}, 
}

Downloads last month: 44

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leungtianle/VoxMind

Unable to build the model tree, the base model loops to the model itself. Learn more.

Dataset used to train leungtianle/VoxMind

Paper for leungtianle/VoxMind

VoxMind: An End-to-End Agentic Spoken Dialogue System

Paper • 2604.15710 • Published 10 days ago • 8