What is MinMo?
MinMo is a multimodal large language model (approximately 8B parameters) for seamless voice interaction, combining speech and text processing with full-duplex conversation and instruction-following capabilities.
When was MinMo released?
The research paper was published on January 10, 2025, and submitted to arXiv on January 14, 2025; code and models are planned for open-source release soon after.
Is MinMo free to use?
Yes, it will be open-source with code and weights released freely (likely permissive license); no cost once available, though running requires compute resources.
What are MinMo’s key capabilities?
Full-duplex voice conversation, state-of-the-art speech comprehension/generation, low-latency processing, instruction control for emotions/dialects/rates/mimicry, and strong text LLM performance.
What latency does MinMo achieve?
Speech-to-text around 100ms; full-duplex theoretical 600ms, practical around 800ms, enabling near-real-time interactions.
How was MinMo trained?
Through multi-stage alignment on 1.4 million hours of speech data: speech-to-text, text-to-speech, speech-to-speech, and duplex interaction stages.
Where can I find MinMo’s project page?
The official project page is at funaudiollm.github.io/minmo, with further details, potential demos, and release updates.
What makes MinMo stand out?
It combines top voice benchmarks performance, full-duplex support, expressive control via instructions, and a novel simple voice decoder in a balanced multimodal LLM.




