MinMo is a multimodal large language model (approximately 8B parameters) for seamless voice interaction, combining speech and text processing with full-duplex conversation and instruction-following capabilities.

When was MinMo released?

The research paper was published on January 10, 2025, and submitted to arXiv on January 14, 2025; code and models are planned for open-source release soon after.

Is MinMo free to use?

Yes, it will be open-source with code and weights released freely (likely permissive license); no cost once available, though running requires compute resources.

What are MinMo's key capabilities?

Full-duplex voice conversation, state-of-the-art speech comprehension/generation, low-latency processing, instruction control for emotions/dialects/rates/mimicry, and strong text LLM performance.

What latency does MinMo achieve?

Speech-to-text around 100ms; full-duplex theoretical 600ms, practical around 800ms, enabling near-real-time interactions.

How was MinMo trained?

Through multi-stage alignment on 1.4 million hours of speech data: speech-to-text, text-to-speech, speech-to-speech, and duplex interaction stages.

Where can I find MinMo's project page?

The official project page is at funaudiollm.github.io/minmo, with further details, potential demos, and release updates.

What makes MinMo stand out?

It combines top voice benchmarks performance, full-duplex support, expressive control via instructions, and a novel simple voice decoder in a balanced multimodal LLM.

MinMo

Name: MinMo
Author: Zelili AI

From Research collaboration (not tied to single company)

Multimodal Large Language Model for Seamless Voice Interaction – Full-Duplex Speech with Instruction-Following and Low Latency

Audio & Music

Pricing Model

Free

Starting Price

$0/Month

Last Updated: January 3, 2026

By Zelili AI

About This AI

MinMo is a multimodal large language model with approximately 8 billion parameters, designed for seamless voice interaction by integrating speech and text processing in a unified framework.

It achieves state-of-the-art performance in voice comprehension and generation while enabling full-duplex conversation (simultaneous two-way communication) and instruction-following capabilities for controlling speech nuances.

Trained through multi-stage alignment on 1.4 million hours of diverse speech data: speech-to-text, text-to-speech, speech-to-speech, and duplex interaction stages.

The model maintains strong text LLM capabilities alongside superior speech tasks, including automatic speech recognition (ASR), speech-to-text translation (S2TT), spoken question answering (SQA), vocal sound classification (VSC), speech emotion recognition (SER), language identification (LID), age/gender detection, and expressive generation.

A novel simple voice decoder enhances voice generation quality, outperforming prior models.

Key highlights include low latency (speech-to-text around 100ms, full-duplex theoretical 600ms/practical 800ms) and support for emotions, dialects, speaking rates, and voice mimicry via instructions.

Released as a research paper on arXiv January 10 2025 (arXiv:2501.06282), with code and models planned for open-source release soon after.

Project page at funaudiollm.github.io/minmo provides further details, though full weights and deployment are pending as of early 2026.

Ideal for real-time voice AI assistants, conversational agents, emotion-aware speech systems, and applications requiring natural, controllable spoken interaction.

Key Features

Full-duplex conversation support: Simultaneous two-way voice interaction with low latency
Multimodal speech-text integration: Unified processing of audio input/output and text
Instruction-following for speech: Control emotions, dialects, speaking rates, and voice mimicry
State-of-the-art voice benchmarks: Tops ASR, S2TT, SQA, SER, LID, VSC, and generation tasks
Novel voice decoder: Simple yet high-quality speech synthesis outperforming prior methods
Low-latency pipeline: Speech-to-text around 100ms, full-duplex around 600-800ms
Multi-stage training alignment: Speech-to-text, text-to-speech, speech-to-speech, duplex stages
Maintains text LLM strength: No catastrophic forgetting of language capabilities
Expressive generation: Supports nuanced voice control via natural language instructions
Extensive speech data training: 1.4 million hours across diverse tasks and languages

Price Plans

Free ($0): Planned open-source release with code and model weights under permissive license (details pending); no cost for research or personal use once available
Potential Enterprise (Custom): Future hosted or API versions possible but not specified

Pros

Leading voice interaction quality: SOTA on multiple speech benchmarks with natural full-duplex
Highly controllable output: Instruction-based customization of emotion, dialect, rate, mimicry
Low latency for real-time use: Practical full-duplex around 800ms, suitable for conversations
Balanced multimodal performance: Strong speech without sacrificing text LLM abilities
Innovative simple decoder: Achieves superior voice generation with elegant architecture
Research impact potential: Open-source planned, enabling community extensions
Extensive training scale: 1.4M hours data for robust generalization

Cons

Not yet fully released: Code and models promised but pending as of early 2026
Requires heavy compute: 8B parameters need powerful GPUs for inference
No hosted demo mentioned: Project page available but no public interactive try-out
Limited public benchmarks: Performance claims strong but full independent verification pending
Research-focused: Not production-ready deployment guide yet
Latency still noticeable: Full-duplex practical 800ms may feel slightly delayed for some uses
Language coverage: Strong multilingual but specifics not detailed in paper summary

Use Cases

Voice AI assistants: Natural full-duplex conversations with emotional and stylistic control
Real-time translation agents: Spoken Q&A and translation with low latency
Emotion-aware chatbots: Detect and respond with appropriate tone/dialect
Voice mimicry applications: Clone specific voices for personalized audio
Spoken educational tools: Interactive tutoring with expressive speech
Accessibility aids: Real-time speech comprehension for hearing-impaired users
Multimodal research: Extend for combined audio-text-vision systems

Target Audience

AI researchers in speech/multimodal: Studying voice LLMs and full-duplex systems
Voice AI developers: Building conversational agents with controllable output
Accessibility and edtech creators: Needing natural, expressive speech interfaces
Multilingual application builders: Leveraging strong ASR/S2TT capabilities
Open-source enthusiasts: Waiting for weights to experiment and fine-tune
Enterprise voice teams: Potential for production once deployed

How To Use

Wait for release: Monitor project page funaudiollm.github.io/minmo for code/weights availability
Download once out: Get model from Hugging Face or GitHub repo when published
Install dependencies: Set up PyTorch and speech libraries per instructions
Run inference: Load model for speech input/output; use streaming for low-latency
Provide audio/text: Input microphone stream or file for comprehension/generation
Control via prompts: Add instructions like 'speak happily in British accent'
Integrate duplex: Use streaming API for full-duplex conversation loops

How we rated MinMo

Performance: 4.8/5
Accuracy: 4.7/5
Features: 4.9/5
Cost-Efficiency: 5.0/5
Ease of Use: 4.3/5
Customization: 4.8/5
Data Privacy: 5.0/5
Support: 4.2/5
Integration: 4.5/5
Overall Score: 4.7/5

MinMo integration with other tools

Hugging Face (Upcoming): Model weights and inference pipelines expected on Hugging Face for easy loading
Project Web Page: funaudiollm.github.io/minmo for demos, updates, and documentation
Streaming Frameworks: Compatible with real-time audio libraries like PyAudio or WebRTC for duplex apps
Voice SDKs: Potential integration with tools like Mozilla TTS, Coqui, or Whisper for extended pipelines
Local Hardware: Runs on GPUs via PyTorch; no cloud dependency required once released

Best prompts optimised for MinMo

Respond to this spoken query in a calm, empathetic tone with a slight Southern US dialect, keeping answers concise:
Translate and reply to the user's spoken English question in natural Mandarin Chinese with a friendly, enthusiastic voice:
Mimic the voice style of a famous narrator while explaining this concept slowly and clearly: [text prompt + audio style reference]
Generate a response with excited emotion and faster speaking rate for this kids' educational question:
Answer this spoken technical query using a professional, neutral tone in formal British English:

MinMo represents a strong advancement in multimodal voice LLMs, delivering SOTA speech comprehension/generation with full-duplex support, low latency, and expressive instruction control. Once open-sourced, it promises great potential for natural voice agents. Currently research-stage with pending release, but its balanced design and scale make it highly anticipated for conversational AI.

FAQs

What is MinMo?
MinMo is a multimodal large language model (approximately 8B parameters) for seamless voice interaction, combining speech and text processing with full-duplex conversation and instruction-following capabilities.
When was MinMo released?
The research paper was published on January 10, 2025, and submitted to arXiv on January 14, 2025; code and models are planned for open-source release soon after.
Is MinMo free to use?
Yes, it will be open-source with code and weights released freely (likely permissive license); no cost once available, though running requires compute resources.
What are MinMo’s key capabilities?
Full-duplex voice conversation, state-of-the-art speech comprehension/generation, low-latency processing, instruction control for emotions/dialects/rates/mimicry, and strong text LLM performance.
What latency does MinMo achieve?
Speech-to-text around 100ms; full-duplex theoretical 600ms, practical around 800ms, enabling near-real-time interactions.
How was MinMo trained?
Through multi-stage alignment on 1.4 million hours of speech data: speech-to-text, text-to-speech, speech-to-speech, and duplex interaction stages.
Where can I find MinMo’s project page?
The official project page is at funaudiollm.github.io/minmo, with further details, potential demos, and release updates.
What makes MinMo stand out?
It combines top voice benchmarks performance, full-duplex support, expressive control via instructions, and a novel simple voice decoder in a balanced multimodal LLM.

Newly Added Tools

Qwen-Image-2.0

Image & Design

$0/Month

Qodo AI

Code & Development

$0/Month

Codiga

Code & Development

$10/Month

Tabnine

Code & Development

$59/Month

MinMo Alternatives

Synthflow AI

Audio & Music

$0/Month

Fireflies

Audio & Music

$10/Month

Notta AI

Audio & Music

$9/Month

About Author

Hi Guys! We are a group of ML Engineers by profession with years of experience exploring and building AI tools, LLMs, and generative technologies. We analyze new tools not just as a user, but as someone who understands their technical depth and real-world value.We know how overwhelming these tools can be for most people, that’s why we break down complex AI concepts into simple, practical insights. Our goal is to help you discover these magical AI tools that actually save your time and make everyday work smarter, not harder.“We don’t just write about AI: We build, test and simplify it for you.”

MinMo

From Research collaboration (not tied to single company)

About This AI

Key Features

Price Plans

Pros

Cons

Use Cases

Target Audience

How To Use

How we rated MinMo

MinMo integration with other tools

Best prompts optimised for MinMo

FAQs

What is MinMo?

When was MinMo released?

Is MinMo free to use?

What are MinMo’s key capabilities?

What latency does MinMo achieve?

How was MinMo trained?

Where can I find MinMo’s project page?

What makes MinMo stand out?

Newly Added Tools​

Qwen-Image-2.0

Qodo AI

Codiga

Tabnine

Synthflow AI

Fireflies

Notta AI

Newly Added Tools