What is VibeVoice?
VibeVoice is Microsoft’s open-source TTS framework for generating expressive, long-form, multi-speaker conversational audio like podcasts, with emotions, singing, and up to 90 minutes duration.
When was VibeVoice released?
The TTS model was open-sourced in August 2025, with ASR variant added in January 2026; repo temporarily disabled due to misuse concerns.
Is VibeVoice free to use?
Yes, it is fully open-source research framework with weights on Hugging Face; no cost for download and local use (subject to responsible guidelines).
What makes VibeVoice special?
It supports 90-minute multi-speaker audio, spontaneous emotions, singing with music, cross-lingual expression, and efficient long-sequence processing via 7.5 Hz tokenizers.
Does VibeVoice support voice cloning?
It can impersonate voices with short samples for expressive synthesis, but explicitly prohibits unauthorized cloning, satire, deepfakes, or real-time conversion.
What languages does VibeVoice support?
Primarily English and Mandarin demonstrated, with cross-lingual capabilities preserving emotional expression across them.
How many speakers can VibeVoice handle?
Up to 4 distinct speakers in long-form conversations with natural turn-taking and consistency.
Where can I download VibeVoice?
Model weights are on Hugging Face (microsoft/VibeVoice-1.5B); check the GitHub page for code (may be limited due to temporary disablement).




