When was VibeVoice released?

The TTS model was open-sourced in August 2025, with ASR variant added in January 2026; repo temporarily disabled due to misuse concerns.

Is VibeVoice free to use?

Yes, it is fully open-source research framework with weights on Hugging Face; no cost for download and local use (subject to responsible guidelines).

What makes VibeVoice special?

It supports 90-minute multi-speaker audio, spontaneous emotions, singing with music, cross-lingual expression, and efficient long-sequence processing via 7.5 Hz tokenizers.

Does VibeVoice support voice cloning?

It can impersonate voices with short samples for expressive synthesis, but explicitly prohibits unauthorized cloning, satire, deepfakes, or real-time conversion.

What languages does VibeVoice support?

Primarily English and Mandarin demonstrated, with cross-lingual capabilities preserving emotional expression across them.

How many speakers can VibeVoice handle?

Up to 4 distinct speakers in long-form conversations with natural turn-taking and consistency.

Where can I download VibeVoice?

Model weights are on Hugging Face (microsoft/VibeVoice-1.5B); check the GitHub page for code (may be limited due to temporary disablement).

VibeVoice

Name: VibeVoice
Author: Zelili AI

From Microsoft

Frontier Open-Source Text-to-Speech Model – Expressive Long-Form Multi-Speaker Conversational Audio with Emotion and Singing

Audio & Music

Pricing Model

Free

Starting Price

$0/Month

Last Updated: January 19, 2026

By Zelili AI

About This AI

VibeVoice is a novel open-source framework from Microsoft for generating expressive, long-form, multi-speaker conversational audio such as podcasts from text.

It supports up to 90 minutes of continuous speech with up to 4 distinct speakers, natural turn-taking, spontaneous emotions, singing, and background music integration.

Core innovation includes continuous speech tokenizers (Acoustic and Semantic) at ultra-low 7.5 Hz frame rate for efficiency in long sequences, combined with a next-token diffusion framework using an LLM for textual context/dialogue flow and a diffusion head for high-fidelity acoustics.

It excels at context-aware expression including unscripted emotional nuances, cross-lingual capabilities (English and Mandarin demonstrated), and realistic prosody.

The model family includes VibeVoice-TTS for synthesis and later VibeVoice-ASR for long-form transcription with structured outputs (speaker, timestamps, content).

Released in August 2025 with weights on Hugging Face (e.g., microsoft/VibeVoice-1.5B), it emphasizes responsible use and was temporarily disabled due to misuse concerns but advances speech synthesis research.

Applications include podcast production, multi-speaker dialogues, emotional voiceovers, singing generation, and cross-lingual audio.

As an open-source research framework, it promotes collaboration in TTS while prohibiting out-of-scope uses like unauthorized voice cloning or real-time deepfakes.

Demos showcase spontaneous arguments, singing lyrics, tech podcasts with background music, sports debates, and climate discussions, highlighting its expressive and long-form strengths.

Key Features

Long-form multi-speaker synthesis: Generates up to 90 minutes of coherent audio with up to 4 distinct speakers and natural turn-taking
Expressive and emotional speech: Captures spontaneous emotions, nuances, prosody, and unscripted dynamics
Singing and music integration: Supports singing lyrics with background music in generated audio
Cross-lingual capabilities: Demonstrated English-Mandarin translation and expression preservation
Ultra-low frame rate tokenizers: Acoustic/Semantic tokenizers at 7.5 Hz for efficient long-sequence processing
Next-token diffusion framework: LLM for context/dialogue + diffusion head for high-fidelity acoustics
Context-aware generation: Understands dialogue flow, speaker roles, and emotional cues from text
Open-source research framework: Weights and code for TTS (and ASR variant) to advance speech synthesis

Price Plans

Free ($0): Fully open-source research framework with model weights available on Hugging Face; no usage fees
Commercial/Enterprise (N/A): Not specified; intended for research, not production deployment without review

Pros

Breakthrough long-form stability: Handles extended conversations far beyond typical 1-2 speaker limits
Highly expressive output: Realistic emotions, singing, and spontaneous nuances for lifelike audio
Efficient architecture: Low frame rate enables processing of very long sequences without collapse
Open-source accessibility: Weights on Hugging Face for research and development use
Multi-speaker naturalness: Strong turn-taking and speaker distinction in dialogues
Cross-lingual potential: Preserves expression across English and Mandarin
Responsible AI focus: Guidelines against misuse like unauthorized cloning or deepfakes

Cons

Repo temporarily disabled: Access limited due to misuse concerns (as of late 2025)
Requires powerful hardware: Diffusion-based model demands GPU for inference
Setup for local use: Needs technical knowledge to run from Hugging Face weights
Limited languages demonstrated: Primarily English/Mandarin; broader support unclear
No real-time low-latency focus: Optimized for offline long-form rather than streaming
Responsible use restrictions: Prohibits voice impersonation without consent or deepfake apps
Early research stage: May have artifacts in edge cases or complex emotions

Use Cases

Podcast production: Generate full episodes with multiple hosts, guests, emotions, and background music
Conversational audio creation: Synthesize dialogues, debates, interviews, or storytelling with natural flow
Expressive voiceovers: Add emotional depth to narrations, audiobooks, or character voices
Singing and music demos: Create sung lyrics or musical segments from text
Cross-lingual content: Produce audio translations preserving original expression
Research in TTS: Extend or benchmark expressive multi-speaker synthesis
Educational audio: Generate engaging lectures or discussions with varied speakers

Target Audience

AI speech researchers: Advancing TTS with expressive, long-form capabilities
Content creators: Podcasters, audiobook producers needing synthetic multi-speaker audio
Developers and experimenters: Running open-source models locally for custom applications
Multimedia artists: Incorporating emotional/singing voices in projects
Language tech enthusiasts: Exploring cross-lingual expressive synthesis
Microsoft ecosystem users: Interested in frontier voice AI research

How To Use

Access repo (when available): Visit microsoft.github.io/VibeVoice or Hugging Face microsoft/VibeVoice-1.5B
Download model weights: Get from Hugging Face for local inference
Install dependencies: Set up environment with required libraries (PyTorch, etc.)
Prepare input: Provide text script with speaker tags and optional emotion cues
Run generation: Use provided inference scripts for TTS synthesis
Listen and iterate: Generate audio samples; refine prompts for better expression
Follow guidelines: Adhere to responsible use policy against misuse

How we rated VibeVoice

Performance: 4.6/5
Accuracy: 4.5/5
Features: 4.8/5
Cost-Efficiency: 5.0/5
Ease of Use: 4.0/5
Customization: 4.7/5
Data Privacy: 4.9/5
Support: 4.2/5
Integration: 4.4/5
Overall Score: 4.6/5

VibeVoice integration with other tools

Hugging Face: Model weights and inference examples hosted for easy download and community use
GitHub Repository: Codebase (when enabled) for local setup, extensions, and contributions
Audio Production Tools: Export generated audio (WAV/MP3) for import into DAWs like Audacity, Adobe Audition, or Reaper
Research Frameworks: Compatible with PyTorch ecosystems for fine-tuning or integration in TTS pipelines
Local Deployment: Runs on personal GPUs; no cloud required for core synthesis

Best prompts optimised for VibeVoice

Generate a heated spontaneous argument between two friends about a broken promise, with rising emotion, interruptions, and natural turn-taking: Speaker1: I can't believe you did it again. Speaker2: Wait, let me explain...
Create a podcast episode discussing the latest AI advancements with two hosts and one guest expert, including background music fades, enthusiastic tones, and laughter: Host1 welcomes Guest, discusses GPT-5 launch...
Synthesize a singer performing 'See You Again' with emotional delivery, slight vocal cracks for realism, and soft instrumental background: [lyrics here]
Produce a cross-lingual conversation: Speaker in Mandarin expresses frustration, then switches to English with preserved emotional tone: Ni wei shen me zhe me zuo? Why did you do this?
Generate a 10-minute tech podcast segment on climate change impacts with three speakers debating solutions, natural pauses, agreements, and background ambient music

VibeVoice pushes open-source TTS forward with impressive long-form multi-speaker synthesis, spontaneous emotions, singing, and cross-lingual support. Its efficient low-frame-rate tokenizers enable 90-minute coherent audio, ideal for podcasts and expressive content. As a research framework, it’s powerful for developers despite setup needs and responsible use limits. Excellent for advancing conversational voice AI.

FAQs

What is VibeVoice?
VibeVoice is Microsoft’s open-source TTS framework for generating expressive, long-form, multi-speaker conversational audio like podcasts, with emotions, singing, and up to 90 minutes duration.
When was VibeVoice released?
The TTS model was open-sourced in August 2025, with ASR variant added in January 2026; repo temporarily disabled due to misuse concerns.
Is VibeVoice free to use?
Yes, it is fully open-source research framework with weights on Hugging Face; no cost for download and local use (subject to responsible guidelines).
What makes VibeVoice special?
It supports 90-minute multi-speaker audio, spontaneous emotions, singing with music, cross-lingual expression, and efficient long-sequence processing via 7.5 Hz tokenizers.
Does VibeVoice support voice cloning?
It can impersonate voices with short samples for expressive synthesis, but explicitly prohibits unauthorized cloning, satire, deepfakes, or real-time conversion.
What languages does VibeVoice support?
Primarily English and Mandarin demonstrated, with cross-lingual capabilities preserving emotional expression across them.
How many speakers can VibeVoice handle?
Up to 4 distinct speakers in long-form conversations with natural turn-taking and consistency.
Where can I download VibeVoice?
Model weights are on Hugging Face (microsoft/VibeVoice-1.5B); check the GitHub page for code (may be limited due to temporary disablement).

Newly Added Tools

Qwen-Image-2.0

Image & Design

$0/Month

Qodo AI

Code & Development

$0/Month

Codiga

Code & Development

$10/Month

Tabnine

Code & Development

$59/Month

VibeVoice Alternatives

Synthflow AI

Audio & Music

$0/Month

Fireflies

Audio & Music

$10/Month

Notta AI

Audio & Music

$9/Month

About Author

Hi Guys! We are a group of ML Engineers by profession with years of experience exploring and building AI tools, LLMs, and generative technologies. We analyze new tools not just as a user, but as someone who understands their technical depth and real-world value.We know how overwhelming these tools can be for most people, that’s why we break down complex AI concepts into simple, practical insights. Our goal is to help you discover these magical AI tools that actually save your time and make everyday work smarter, not harder.“We don’t just write about AI: We build, test and simplify it for you.”

VibeVoice

From Microsoft

About This AI

Key Features

Price Plans

Pros

Cons

Use Cases

Target Audience

How To Use

How we rated VibeVoice

VibeVoice integration with other tools

Best prompts optimised for VibeVoice

FAQs

What is VibeVoice?

When was VibeVoice released?

Is VibeVoice free to use?

What makes VibeVoice special?

Does VibeVoice support voice cloning?

What languages does VibeVoice support?

How many speakers can VibeVoice handle?

Where can I download VibeVoice?

Newly Added Tools​

Qwen-Image-2.0

Qodo AI

Codiga

Tabnine

Synthflow AI

Fireflies

Notta AI

Newly Added Tools