What is MinMo?
MinMo is a multimodal large language model (approximately 8B parameters) for seamless voice interaction, combining speech and text processing with full-duplex conversation and instruction-following capabilities.
When was MinMo released?
The research paper was published on January 10, 2025, and submitted to arXiv on January 14, 2025; code and models are planned for open-source release soon after.
Is MinMo free to use?
Yes, it will be open-source with code and weights released freely (likely permissive license); no cost once available, though running requires compute resources.
What are MinMo’s key capabilities?
Full-duplex voice conversation, state-of-the-art speech comprehension/generation, low-latency processing, instruction control for emotions/dialects/rates/mimicry, and strong text LLM performance.
What latency does MinMo achieve?
Speech-to-text around 100ms; full-duplex theoretical 600ms, practical around 800ms, enabling near-real-time interactions.
How was MinMo trained?
Through multi-stage alignment on 1.4 million hours of speech data: speech-to-text, text-to-speech, speech-to-speech, and duplex interaction stages.
Where can I find MinMo’s project page?
The official project page is at funaudiollm.github.io/minmo, with further details, potential demos, and release updates.
What makes MinMo stand out?
It combines top voice benchmarks performance, full-duplex support, expressive control via instructions, and a novel simple voice decoder in a balanced multimodal LLM.

MinMo

About This AI
MinMo is a multimodal large language model with approximately 8 billion parameters, designed for seamless voice interaction by integrating speech and text processing in a unified framework.
It achieves state-of-the-art performance in voice comprehension and generation while enabling full-duplex conversation (simultaneous two-way communication) and instruction-following capabilities for controlling speech nuances.
Trained through multi-stage alignment on 1.4 million hours of diverse speech data: speech-to-text, text-to-speech, speech-to-speech, and duplex interaction stages.
The model maintains strong text LLM capabilities alongside superior speech tasks, including automatic speech recognition (ASR), speech-to-text translation (S2TT), spoken question answering (SQA), vocal sound classification (VSC), speech emotion recognition (SER), language identification (LID), age/gender detection, and expressive generation.
A novel simple voice decoder enhances voice generation quality, outperforming prior models.
Key highlights include low latency (speech-to-text around 100ms, full-duplex theoretical 600ms/practical 800ms) and support for emotions, dialects, speaking rates, and voice mimicry via instructions.
Released as a research paper on arXiv January 10 2025 (arXiv:2501.06282), with code and models planned for open-source release soon after.
Project page at funaudiollm.github.io/minmo provides further details, though full weights and deployment are pending as of early 2026.
Ideal for real-time voice AI assistants, conversational agents, emotion-aware speech systems, and applications requiring natural, controllable spoken interaction.
Key Features
- Full-duplex conversation support: Simultaneous two-way voice interaction with low latency
- Multimodal speech-text integration: Unified processing of audio input/output and text
- Instruction-following for speech: Control emotions, dialects, speaking rates, and voice mimicry
- State-of-the-art voice benchmarks: Tops ASR, S2TT, SQA, SER, LID, VSC, and generation tasks
- Novel voice decoder: Simple yet high-quality speech synthesis outperforming prior methods
- Low-latency pipeline: Speech-to-text around 100ms, full-duplex around 600-800ms
- Multi-stage training alignment: Speech-to-text, text-to-speech, speech-to-speech, duplex stages
- Maintains text LLM strength: No catastrophic forgetting of language capabilities
- Expressive generation: Supports nuanced voice control via natural language instructions
- Extensive speech data training: 1.4 million hours across diverse tasks and languages
Price Plans
- Free ($0): Planned open-source release with code and model weights under permissive license (details pending); no cost for research or personal use once available
- Potential Enterprise (Custom): Future hosted or API versions possible but not specified
Pros
- Leading voice interaction quality: SOTA on multiple speech benchmarks with natural full-duplex
- Highly controllable output: Instruction-based customization of emotion, dialect, rate, mimicry
- Low latency for real-time use: Practical full-duplex around 800ms, suitable for conversations
- Balanced multimodal performance: Strong speech without sacrificing text LLM abilities
- Innovative simple decoder: Achieves superior voice generation with elegant architecture
- Research impact potential: Open-source planned, enabling community extensions
- Extensive training scale: 1.4M hours data for robust generalization
Cons
- Not yet fully released: Code and models promised but pending as of early 2026
- Requires heavy compute: 8B parameters need powerful GPUs for inference
- No hosted demo mentioned: Project page available but no public interactive try-out
- Limited public benchmarks: Performance claims strong but full independent verification pending
- Research-focused: Not production-ready deployment guide yet
- Latency still noticeable: Full-duplex practical 800ms may feel slightly delayed for some uses
- Language coverage: Strong multilingual but specifics not detailed in paper summary
Use Cases
- Voice AI assistants: Natural full-duplex conversations with emotional and stylistic control
- Real-time translation agents: Spoken Q&A and translation with low latency
- Emotion-aware chatbots: Detect and respond with appropriate tone/dialect
- Voice mimicry applications: Clone specific voices for personalized audio
- Spoken educational tools: Interactive tutoring with expressive speech
- Accessibility aids: Real-time speech comprehension for hearing-impaired users
- Multimodal research: Extend for combined audio-text-vision systems
Target Audience
- AI researchers in speech/multimodal: Studying voice LLMs and full-duplex systems
- Voice AI developers: Building conversational agents with controllable output
- Accessibility and edtech creators: Needing natural, expressive speech interfaces
- Multilingual application builders: Leveraging strong ASR/S2TT capabilities
- Open-source enthusiasts: Waiting for weights to experiment and fine-tune
- Enterprise voice teams: Potential for production once deployed
How To Use
- Wait for release: Monitor project page funaudiollm.github.io/minmo for code/weights availability
- Download once out: Get model from Hugging Face or GitHub repo when published
- Install dependencies: Set up PyTorch and speech libraries per instructions
- Run inference: Load model for speech input/output; use streaming for low-latency
- Provide audio/text: Input microphone stream or file for comprehension/generation
- Control via prompts: Add instructions like 'speak happily in British accent'
- Integrate duplex: Use streaming API for full-duplex conversation loops
How we rated MinMo
- Performance: 4.8/5
- Accuracy: 4.7/5
- Features: 4.9/5
- Cost-Efficiency: 5.0/5
- Ease of Use: 4.3/5
- Customization: 4.8/5
- Data Privacy: 5.0/5
- Support: 4.2/5
- Integration: 4.5/5
- Overall Score: 4.7/5
MinMo integration with other tools
- Hugging Face (Upcoming): Model weights and inference pipelines expected on Hugging Face for easy loading
- Project Web Page: funaudiollm.github.io/minmo for demos, updates, and documentation
- Streaming Frameworks: Compatible with real-time audio libraries like PyAudio or WebRTC for duplex apps
- Voice SDKs: Potential integration with tools like Mozilla TTS, Coqui, or Whisper for extended pipelines
- Local Hardware: Runs on GPUs via PyTorch; no cloud dependency required once released
Best prompts optimised for MinMo
- Respond to this spoken query in a calm, empathetic tone with a slight Southern US dialect, keeping answers concise:
- Translate and reply to the user's spoken English question in natural Mandarin Chinese with a friendly, enthusiastic voice:
- Mimic the voice style of a famous narrator while explaining this concept slowly and clearly: [text prompt + audio style reference]
- Generate a response with excited emotion and faster speaking rate for this kids' educational question:
- Answer this spoken technical query using a professional, neutral tone in formal British English:
FAQs
Newly Added Tools
About Author
