What is Qwen3-ASR?
Qwen3-ASR is an open-source automatic speech recognition model family from Alibaba’s Qwen team, with the 1.7B version offering state-of-the-art multilingual transcription, language identification for 52 languages/dialects, streaming support, and strong performance on singing and noisy audio.
When was Qwen3-ASR released?
Qwen3-ASR was officially open-sourced and announced on January 28, 2026, with the technical report published on arXiv around the same time.
How many languages does Qwen3-ASR support?
It supports language identification and transcription for 52 languages and dialects, including 30 global languages, 22 Chinese dialects, and various English accents.
Is Qwen3-ASR free to use?
Yes, the models are completely free and open-source under Apache 2.0 license, with full weights and inference code available on Hugging Face.
Does Qwen3-ASR support real-time streaming transcription?
Yes, it provides unified streaming and offline inference with very low latency (as low as 92ms TTFT on the 0.6B version) using the vLLM backend.
Can Qwen3-ASR transcribe songs or singing?
Yes, it has strong capabilities for singing voice and song transcription even with background music, achieving competitive WER on specialized benchmarks.
How accurate is Qwen3-ASR compared to other models?
The 1.7B version achieves state-of-the-art results among open-source ASR models and is competitive with top proprietary APIs like GPT-4o or Gemini on internal and public benchmarks.
What is included with Qwen3-ASR for timestamps?
It pairs with Qwen3-ForcedAligner-0.6B, a non-autoregressive model that provides highly accurate word/character-level timestamps for 11 languages, outperforming baselines like WhisperX.




