MinMo

Multimodal Large Language Model for Seamless Voice Interaction – Full-Duplex Speech with Instruction-Following and Low Latency
Last Updated: January 3, 2026
By Zelili AI

About This AI

MinMo is a multimodal large language model with approximately 8 billion parameters, designed for seamless voice interaction by integrating speech and text processing in a unified framework.

It achieves state-of-the-art performance in voice comprehension and generation while enabling full-duplex conversation (simultaneous two-way communication) and instruction-following capabilities for controlling speech nuances.

Trained through multi-stage alignment on 1.4 million hours of diverse speech data: speech-to-text, text-to-speech, speech-to-speech, and duplex interaction stages.

The model maintains strong text LLM capabilities alongside superior speech tasks, including automatic speech recognition (ASR), speech-to-text translation (S2TT), spoken question answering (SQA), vocal sound classification (VSC), speech emotion recognition (SER), language identification (LID), age/gender detection, and expressive generation.

A novel simple voice decoder enhances voice generation quality, outperforming prior models.

Key highlights include low latency (speech-to-text around 100ms, full-duplex theoretical 600ms/practical 800ms) and support for emotions, dialects, speaking rates, and voice mimicry via instructions.

Released as a research paper on arXiv January 10 2025 (arXiv:2501.06282), with code and models planned for open-source release soon after.

Project page at funaudiollm.github.io/minmo provides further details, though full weights and deployment are pending as of early 2026.

Ideal for real-time voice AI assistants, conversational agents, emotion-aware speech systems, and applications requiring natural, controllable spoken interaction.

Key Features

  1. Full-duplex conversation support: Simultaneous two-way voice interaction with low latency
  2. Multimodal speech-text integration: Unified processing of audio input/output and text
  3. Instruction-following for speech: Control emotions, dialects, speaking rates, and voice mimicry
  4. State-of-the-art voice benchmarks: Tops ASR, S2TT, SQA, SER, LID, VSC, and generation tasks
  5. Novel voice decoder: Simple yet high-quality speech synthesis outperforming prior methods
  6. Low-latency pipeline: Speech-to-text around 100ms, full-duplex around 600-800ms
  7. Multi-stage training alignment: Speech-to-text, text-to-speech, speech-to-speech, duplex stages
  8. Maintains text LLM strength: No catastrophic forgetting of language capabilities
  9. Expressive generation: Supports nuanced voice control via natural language instructions
  10. Extensive speech data training: 1.4 million hours across diverse tasks and languages

Price Plans

  1. Free ($0): Planned open-source release with code and model weights under permissive license (details pending); no cost for research or personal use once available
  2. Potential Enterprise (Custom): Future hosted or API versions possible but not specified

Pros

  1. Leading voice interaction quality: SOTA on multiple speech benchmarks with natural full-duplex
  2. Highly controllable output: Instruction-based customization of emotion, dialect, rate, mimicry
  3. Low latency for real-time use: Practical full-duplex around 800ms, suitable for conversations
  4. Balanced multimodal performance: Strong speech without sacrificing text LLM abilities
  5. Innovative simple decoder: Achieves superior voice generation with elegant architecture
  6. Research impact potential: Open-source planned, enabling community extensions
  7. Extensive training scale: 1.4M hours data for robust generalization

Cons

  1. Not yet fully released: Code and models promised but pending as of early 2026
  2. Requires heavy compute: 8B parameters need powerful GPUs for inference
  3. No hosted demo mentioned: Project page available but no public interactive try-out
  4. Limited public benchmarks: Performance claims strong but full independent verification pending
  5. Research-focused: Not production-ready deployment guide yet
  6. Latency still noticeable: Full-duplex practical 800ms may feel slightly delayed for some uses
  7. Language coverage: Strong multilingual but specifics not detailed in paper summary

Use Cases

  1. Voice AI assistants: Natural full-duplex conversations with emotional and stylistic control
  2. Real-time translation agents: Spoken Q&A and translation with low latency
  3. Emotion-aware chatbots: Detect and respond with appropriate tone/dialect
  4. Voice mimicry applications: Clone specific voices for personalized audio
  5. Spoken educational tools: Interactive tutoring with expressive speech
  6. Accessibility aids: Real-time speech comprehension for hearing-impaired users
  7. Multimodal research: Extend for combined audio-text-vision systems

Target Audience

  1. AI researchers in speech/multimodal: Studying voice LLMs and full-duplex systems
  2. Voice AI developers: Building conversational agents with controllable output
  3. Accessibility and edtech creators: Needing natural, expressive speech interfaces
  4. Multilingual application builders: Leveraging strong ASR/S2TT capabilities
  5. Open-source enthusiasts: Waiting for weights to experiment and fine-tune
  6. Enterprise voice teams: Potential for production once deployed

How To Use

  1. Wait for release: Monitor project page funaudiollm.github.io/minmo for code/weights availability
  2. Download once out: Get model from Hugging Face or GitHub repo when published
  3. Install dependencies: Set up PyTorch and speech libraries per instructions
  4. Run inference: Load model for speech input/output; use streaming for low-latency
  5. Provide audio/text: Input microphone stream or file for comprehension/generation
  6. Control via prompts: Add instructions like 'speak happily in British accent'
  7. Integrate duplex: Use streaming API for full-duplex conversation loops

How we rated MinMo

  • Performance: 4.8/5
  • Accuracy: 4.7/5
  • Features: 4.9/5
  • Cost-Efficiency: 5.0/5
  • Ease of Use: 4.3/5
  • Customization: 4.8/5
  • Data Privacy: 5.0/5
  • Support: 4.2/5
  • Integration: 4.5/5
  • Overall Score: 4.7/5

MinMo integration with other tools

  1. Hugging Face (Upcoming): Model weights and inference pipelines expected on Hugging Face for easy loading
  2. Project Web Page: funaudiollm.github.io/minmo for demos, updates, and documentation
  3. Streaming Frameworks: Compatible with real-time audio libraries like PyAudio or WebRTC for duplex apps
  4. Voice SDKs: Potential integration with tools like Mozilla TTS, Coqui, or Whisper for extended pipelines
  5. Local Hardware: Runs on GPUs via PyTorch; no cloud dependency required once released

Best prompts optimised for MinMo

  1. Respond to this spoken query in a calm, empathetic tone with a slight Southern US dialect, keeping answers concise:
  2. Translate and reply to the user's spoken English question in natural Mandarin Chinese with a friendly, enthusiastic voice:
  3. Mimic the voice style of a famous narrator while explaining this concept slowly and clearly: [text prompt + audio style reference]
  4. Generate a response with excited emotion and faster speaking rate for this kids' educational question:
  5. Answer this spoken technical query using a professional, neutral tone in formal British English:
MinMo represents a strong advancement in multimodal voice LLMs, delivering SOTA speech comprehension/generation with full-duplex support, low latency, and expressive instruction control. Once open-sourced, it promises great potential for natural voice agents. Currently research-stage with pending release, but its balanced design and scale make it highly anticipated for conversational AI.

FAQs

  • What is MinMo?

    MinMo is a multimodal large language model (approximately 8B parameters) for seamless voice interaction, combining speech and text processing with full-duplex conversation and instruction-following capabilities.

  • When was MinMo released?

    The research paper was published on January 10, 2025, and submitted to arXiv on January 14, 2025; code and models are planned for open-source release soon after.

  • Is MinMo free to use?

    Yes, it will be open-source with code and weights released freely (likely permissive license); no cost once available, though running requires compute resources.

  • What are MinMo’s key capabilities?

    Full-duplex voice conversation, state-of-the-art speech comprehension/generation, low-latency processing, instruction control for emotions/dialects/rates/mimicry, and strong text LLM performance.

  • What latency does MinMo achieve?

    Speech-to-text around 100ms; full-duplex theoretical 600ms, practical around 800ms, enabling near-real-time interactions.

  • How was MinMo trained?

    Through multi-stage alignment on 1.4 million hours of speech data: speech-to-text, text-to-speech, speech-to-speech, and duplex interaction stages.

  • Where can I find MinMo’s project page?

    The official project page is at funaudiollm.github.io/minmo, with further details, potential demos, and release updates.

  • What makes MinMo stand out?

    It combines top voice benchmarks performance, full-duplex support, expressive control via instructions, and a novel simple voice decoder in a balanced multimodal LLM.

Newly Added Tools​

Qwen-Image-2.0

$0/Month

Qodo AI

$0/Month

Codiga

$10/Month

Tabnine

$59/Month
MinMo Alternatives

Synthflow AI

$0/Month

Fireflies

$10/Month

Notta AI

$9/Month

About Author

Hi Guys! We are a group of ML Engineers by profession with years of experience exploring and building AI tools, LLMs, and generative technologies. We analyze new tools not just as a user, but as someone who understands their technical depth and real-world value.We know how overwhelming these tools can be for most people, that’s why we break down complex AI concepts into simple, practical insights. Our goal is to help you discover these magical AI tools that actually save your time and make everyday work smarter, not harder.“We don’t just write about AI: We build, test and simplify it for you.”