Alibaba Unveils Qwen3 TTS: Open Source Breakthrough in Multilingual Voice Synthesis

By Zelili AI
January 23, 2026
Launch

Alibaba’s Qwen team has released Qwen3 TTS, a family of advanced text to speech models that push the boundaries of open source AI audio generation.

Designed for high fidelity, low latency, and versatile control, these models enable rapid voice cloning, custom design, and natural speech output across multiple languages.

With parameter sizes ranging from 0.6 billion to 1.7 billion, Qwen3 TTS caters to both efficiency focused applications and high performance needs, making it accessible for developers building real time apps, virtual assistants, and content creation tools.

Topics

This release addresses key challenges in TTS technology, such as maintaining emotional expressiveness, handling dialects, and minimizing delays in streaming scenarios.

Qwen3-TTS is officially live. We’ve open-sourced the full family—VoiceDesign, CustomVoice, and Base—bringing high quality to the open community.

– 5 models (0.6B & 1.8B)
– Free-form voice design & cloning
– Support for 10 languages
– SOTA 12Hz tokenizer for high compression
-… pic.twitter.com/BSWpaYoZWj
— Qwen (@Alibaba_Qwen) January 22, 2026

By open sourcing the full suite, Alibaba aims to democratize state of the art voice AI, allowing researchers and creators to fine tune and innovate without proprietary barriers.

Core Features and Innovations

Qwen3 TTS stands out with several cutting edge capabilities:

Ultra Low Latency Streaming: Achieves end to end synthesis in as little as 97 milliseconds, outputting the first audio packet after just one character input.
Free Form Voice Design: Generate unique timbres using natural language descriptions, specifying attributes like gender, age, emotion, accent, and pitch.
Rapid Voice Cloning: Clone any voice from a mere 3 second audio sample, supporting cross lingual applications.
Intelligent Control: Adjust prosody, tone, and style via instructions, with robustness to noisy or complex text inputs.
High Compression Tokenizer: The Qwen3 TTS Tokenizer 12Hz enables efficient acoustic representation, preserving paralinguistic details for realistic output.

These features make Qwen3 TTS ideal for interactive systems, audiobooks, gaming, and accessibility tools.

Models and Architecture

The family includes five models across two sizes:

Model Name	Size	Focus	Key Supports
Qwen3-TTS-12Hz-1.7B-VoiceDesign	1.7B	Voice creation from descriptions	Instruction control, streaming
Qwen3-TTS-12Hz-1.7B-CustomVoice	1.7B	Style control with preset timbres	9 premium voices, multilingual
Qwen3-TTS-12Hz-1.7B-Base	1.7B	General cloning and generation	Fine tuning, 3 second cloning
Qwen3-TTS-12Hz-0.6B-CustomVoice	0.6B	Efficient style control	Preset voices, streaming
Qwen3-TTS-12Hz-0.6B-Base	0.6B	Lightweight cloning	Fine tuning, rapid deployment

At its core, Qwen3 TTS employs a discrete multi codebook language model architecture for end to end speech modeling, avoiding traditional bottlenecks.

The dual track hybrid streaming mechanism combines efficiency with quality, while the non diffusion transformer design ensures fast reconstruction.

Supported Languages and Dialects

Qwen3 TTS excels in multilingual scenarios, covering 10 major languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.

It also handles various Chinese dialects like Beijing and Sichuan, enabling single speaker generalization across tongues with an average word error rate of 2.34 percent.

Performance Benchmarks

Evaluations highlight Qwen3 TTS’s superiority:

Category	Metric	Qwen3-TTS Score	Comparison Models
Voice Clone (Multilingual Avg, 10 lang, content)	WER ↓	1.835	1.906 (Qwen3-Omni-30B-A3B), 2.442 (MiniMax-Speech)
Voice Clone (Multilingual Avg, 10 lang, similarity)	Similarity ↑	0.789	0.753 (Qwen3-Omni-30B-A3B), 0.748 (MiniMax-Speech)
Cross Lingual Avg (12 lang)	Similarity ↑	4.418	4.623 (Qwen3-Omni-30B-A3B), 5.548 (CosyVoice3)
Voice Design (InstructTTS Eval APS/PSD)	Score ↑	84.1/81.8	82.3/81.6 (MiniMax-Voice-Design)
Custom Voice (Multilingual Avg, 10 lang)	WER ↓	2.34	2.47 (Qwen3-Omni-30B-A3B), 1.82 (MiniMax-Speech-02-HD)
Custom Voice (InstructTTS Eval)	Score ↑	75.4	–
Long Speech Eval (Zh/En)	WER ↓	2.36/2.81	4.84/4.7 (Voxcpm)

Lower WER indicates better accuracy; higher similarity and scores mean superior performance.

How to Access and Use Qwen3 TTS

Developers can download models from GitHub or Hugging Face repositories. Integration is straightforward via provided code and APIs, with demos available for quick testing. Fine tuning supports customization, and the Qwen API offers cloud based access for scalable deployment.

Broader Implications for AI Audio

Qwen3 TTS sets a new standard for open source TTS, outperforming rivals in stability, expressiveness, and efficiency. It empowers creators in education, entertainment, and enterprise, potentially transforming podcasts, virtual reality, and assistive tech.

As AI voice tools evolve, this release fosters innovation while raising considerations for ethical use in deepfake prevention and accessibility.