
Alibaba’s Qwen team has released Qwen3 TTS, a family of advanced text to speech models that push the boundaries of open source AI audio generation.
Designed for high fidelity, low latency, and versatile control, these models enable rapid voice cloning, custom design, and natural speech output across multiple languages.
With parameter sizes ranging from 0.6 billion to 1.7 billion, Qwen3 TTS caters to both efficiency focused applications and high performance needs, making it accessible for developers building real time apps, virtual assistants, and content creation tools.
Topics
ToggleRead More: ChatGPT API Surges with Over 1 Billion Dollars in New Annual Recurring Revenue in a Single Month
This release addresses key challenges in TTS technology, such as maintaining emotional expressiveness, handling dialects, and minimizing delays in streaming scenarios.
Qwen3-TTS is officially live. We’ve open-sourced the full family—VoiceDesign, CustomVoice, and Base—bringing high quality to the open community.
— Qwen (@Alibaba_Qwen) January 22, 2026
– 5 models (0.6B & 1.8B)
– Free-form voice design & cloning
– Support for 10 languages
– SOTA 12Hz tokenizer for high compression
-… pic.twitter.com/BSWpaYoZWj
By open sourcing the full suite, Alibaba aims to democratize state of the art voice AI, allowing researchers and creators to fine tune and innovate without proprietary barriers.
Core Features and Innovations
Qwen3 TTS stands out with several cutting edge capabilities:
- Ultra Low Latency Streaming: Achieves end to end synthesis in as little as 97 milliseconds, outputting the first audio packet after just one character input.
- Free Form Voice Design: Generate unique timbres using natural language descriptions, specifying attributes like gender, age, emotion, accent, and pitch.
- Rapid Voice Cloning: Clone any voice from a mere 3 second audio sample, supporting cross lingual applications.
- Intelligent Control: Adjust prosody, tone, and style via instructions, with robustness to noisy or complex text inputs.
- High Compression Tokenizer: The Qwen3 TTS Tokenizer 12Hz enables efficient acoustic representation, preserving paralinguistic details for realistic output.
These features make Qwen3 TTS ideal for interactive systems, audiobooks, gaming, and accessibility tools.
Models and Architecture
The family includes five models across two sizes:
| Model Name | Size | Focus | Key Supports |
|---|---|---|---|
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | 1.7B | Voice creation from descriptions | Instruction control, streaming |
| Qwen3-TTS-12Hz-1.7B-CustomVoice | 1.7B | Style control with preset timbres | 9 premium voices, multilingual |
| Qwen3-TTS-12Hz-1.7B-Base | 1.7B | General cloning and generation | Fine tuning, 3 second cloning |
| Qwen3-TTS-12Hz-0.6B-CustomVoice | 0.6B | Efficient style control | Preset voices, streaming |
| Qwen3-TTS-12Hz-0.6B-Base | 0.6B | Lightweight cloning | Fine tuning, rapid deployment |
At its core, Qwen3 TTS employs a discrete multi codebook language model architecture for end to end speech modeling, avoiding traditional bottlenecks.
The dual track hybrid streaming mechanism combines efficiency with quality, while the non diffusion transformer design ensures fast reconstruction.
Supported Languages and Dialects
Qwen3 TTS excels in multilingual scenarios, covering 10 major languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.
It also handles various Chinese dialects like Beijing and Sichuan, enabling single speaker generalization across tongues with an average word error rate of 2.34 percent.
Performance Benchmarks
Evaluations highlight Qwen3 TTS’s superiority:
| Category | Metric | Qwen3-TTS Score | Comparison Models |
|---|---|---|---|
| Voice Clone (Multilingual Avg, 10 lang, content) | WER ↓ | 1.835 | 1.906 (Qwen3-Omni-30B-A3B), 2.442 (MiniMax-Speech) |
| Voice Clone (Multilingual Avg, 10 lang, similarity) | Similarity ↑ | 0.789 | 0.753 (Qwen3-Omni-30B-A3B), 0.748 (MiniMax-Speech) |
| Cross Lingual Avg (12 lang) | Similarity ↑ | 4.418 | 4.623 (Qwen3-Omni-30B-A3B), 5.548 (CosyVoice3) |
| Voice Design (InstructTTS Eval APS/PSD) | Score ↑ | 84.1/81.8 | 82.3/81.6 (MiniMax-Voice-Design) |
| Custom Voice (Multilingual Avg, 10 lang) | WER ↓ | 2.34 | 2.47 (Qwen3-Omni-30B-A3B), 1.82 (MiniMax-Speech-02-HD) |
| Custom Voice (InstructTTS Eval) | Score ↑ | 75.4 | – |
| Long Speech Eval (Zh/En) | WER ↓ | 2.36/2.81 | 4.84/4.7 (Voxcpm) |
Lower WER indicates better accuracy; higher similarity and scores mean superior performance.
How to Access and Use Qwen3 TTS
Developers can download models from GitHub or Hugging Face repositories. Integration is straightforward via provided code and APIs, with demos available for quick testing. Fine tuning supports customization, and the Qwen API offers cloud based access for scalable deployment.
Broader Implications for AI Audio
Qwen3 TTS sets a new standard for open source TTS, outperforming rivals in stability, expressiveness, and efficiency. It empowers creators in education, entertainment, and enterprise, potentially transforming podcasts, virtual reality, and assistive tech.
As AI voice tools evolve, this release fosters innovation while raising considerations for ethical use in deepfake prevention and accessibility.



