Qwen3-ASR is an open-source automatic speech recognition model family from Alibaba's Qwen team, with the 1.7B version offering state-of-the-art multilingual transcription, language identification for 52 languages/dialects, streaming support, and strong performance on singing and noisy audio.

When was Qwen3-ASR released?

Qwen3-ASR was officially open-sourced and announced on January 28, 2026, with the technical report published on arXiv around the same time.

How many languages does Qwen3-ASR support?

It supports language identification and transcription for 52 languages and dialects, including 30 global languages, 22 Chinese dialects, and various English accents.

Is Qwen3-ASR free to use?

Yes, the models are completely free and open-source under Apache 2.0 license, with full weights and inference code available on Hugging Face.

Does Qwen3-ASR support real-time streaming transcription?

Yes, it provides unified streaming and offline inference with very low latency (as low as 92ms TTFT on the 0.6B version) using the vLLM backend.

Can Qwen3-ASR transcribe songs or singing?

Yes, it has strong capabilities for singing voice and song transcription even with background music, achieving competitive WER on specialized benchmarks.

How accurate is Qwen3-ASR compared to other models?

The 1.7B version achieves state-of-the-art results among open-source ASR models and is competitive with top proprietary APIs like GPT-4o or Gemini on internal and public benchmarks.

What is included with Qwen3-ASR for timestamps?

It pairs with Qwen3-ForcedAligner-0.6B, a non-autoregressive model that provides highly accurate word/character-level timestamps for 11 languages, outperforming baselines like WhisperX.

Qwen3-ASR

From Alibaba Cloud

State-of-the-Art Open-Source Multilingual Speech Recognition with Language ID and Streaming Support

Audio & Music

28 Jan 2026

N/A

0.0

Pricing Model

Free

Starting Price

$0/Month

👁 196

About This AI

Qwen3-ASR is an advanced open-source automatic speech recognition family from Alibaba’s Qwen team, featuring the flagship Qwen3-ASR-1.7B model and a lighter Qwen3-ASR-0.6B variant.

Built on the powerful audio understanding capabilities of the Qwen3-Omni foundation model, it delivers all-in-one functionality including automatic language identification and high-accuracy transcription for 52 languages and dialects.

This covers 30 global languages, 22 Chinese dialects, and various English accents from different regions.

The model excels in challenging real-world scenarios such as noisy environments, singing voice and song transcription with background music, long-form audio, and complex acoustic conditions.

It supports unified streaming and offline inference with a single model, enabling low-latency real-time transcription and efficient batch processing.

Key highlights include very low time-to-first-token (as low as 92ms on the 0.6B version) and exceptional throughput (transcribing 2000 seconds of speech in 1 second at high concurrency).

Paired with Qwen3-ForcedAligner-0.6B, it provides precise timestamp prediction for forced alignment in 11 languages at word or character level.

Released under Apache 2.0 license with full weights and a comprehensive inference toolkit (supporting transformers, vLLM for fast batch/async/streaming), it achieves state-of-the-art results among open-source ASR models and competes closely with top proprietary APIs on both public and internal benchmarks.

Ideal for developers, researchers, and applications needing robust, multilingual, streaming-capable speech-to-text with minimal dependencies.

Key Features

All-in-one language identification and ASR: Automatically detects and transcribes speech in 52 languages and dialects including 30 global languages and 22 Chinese dialects
Multilingual and dialect support: Covers major languages like English (various accents), Chinese Mandarin/dialects, Arabic, French, German, Spanish, Japanese, Korean, and more
Robust real-world performance: Handles noisy environments, accented speech, singing voice, songs with BGM, elders/kids speech, and complex text patterns
Streaming and offline unified inference: Single model supports real-time streaming transcription and offline batch processing
Long audio transcription: Efficiently processes extended audio files without segmentation issues
Low-latency TTFT: Achieves as low as 92ms time-to-first-token on the 0.6B variant for fast response
High throughput: Transcribes 2000 seconds of speech in 1 second at concurrency of 128 in async mode
Timestamp prediction via ForcedAligner: Non-autoregressive LLM-based aligner provides accurate word/character-level timestamps in 11 languages
Comprehensive inference toolkit: Supports vLLM batch/async/streaming, transformers backend, and easy integration
Singing and music-aware transcription: Strong performance on song lyrics and vocal content
Open-source under Apache 2.0: Full weights, code, and inference framework available for free use and modification

Price Plans

Free ($0): Full open-source access to model weights, inference code, and toolkit under Apache 2.0 license with no usage fees
Cloud API (Paid via DashScope): Optional Alibaba Cloud API access for hosted real-time/file transcription with tiered pricing
Enterprise (Custom): Potential premium support or scaled deployment through Alibaba Cloud services

Pros

Leading open-source ASR accuracy: Achieves SOTA among open models and rivals top proprietary APIs on diverse benchmarks
Exceptional multilingual coverage: 52 languages/dialects with high LID accuracy (97.9% average) and low WER across scenarios
Fast and efficient: Extremely low latency and massive throughput make it suitable for real-time and high-volume applications
Robust to real-world challenges: Excellent handling of noise, accents, singing, long audio, and complex acoustics
Streaming support built-in: True streaming inference without separate models for online use
Timestamp alignment excellence: ForcedAligner outperforms baselines in accuracy and efficiency for 11 languages
Fully open and accessible: Apache 2.0 license with easy pip install and demos (Gradio, Flask, Hugging Face Spaces)
Strong community foundation: Backed by Qwen series popularity and large-scale training data

Cons

Requires GPU for best performance: Optimal speed needs CUDA/vLLM; CPU inference slower for large models
Limited to supported languages: While broad, does not cover every low-resource language or dialect
Setup complexity for advanced use: vLLM backend for streaming/high-throughput requires additional installation
No native mobile/edge optimization mentioned: Primarily server-side focused; on-device deployment may need quantization
Recent release: Community integrations and fine-tuning examples still emerging
Potential VRAM requirements: 1.7B model needs sufficient GPU memory for batch or long audio
Alignment limited to 11 languages: ForcedAligner supports fewer languages than main ASR

Use Cases

Real-time captioning and transcription: Live subtitling for meetings, streams, calls, or broadcasts
Multilingual audio processing: Transcribe podcasts, videos, or interviews in diverse languages/dialects
Song and music transcription: Extract lyrics from songs including those with background music
Accessibility tools: Speech-to-text for hearing-impaired users or voice note conversion
Call center and voice analytics: High-throughput transcription for customer service recordings
Research and data labeling: Generate accurate timestamps and transcripts for speech datasets
Content localization: Transcribe foreign-language media for subtitling or dubbing preparation

Target Audience

Developers and AI engineers: Building speech applications or integrating ASR into products
Researchers in speech AI: Studying multilingual ASR, singing recognition, or alignment techniques
Content creators and media companies: Transcribing videos, podcasts, or music for captions/subtitles
Enterprises with multilingual needs: Call centers, global customer support, or compliance recording
Open-source enthusiasts: Experimenting with state-of-the-art free ASR models
Accessibility advocates: Developing tools for real-time speech assistance

How To Use

Install the package: Run pip install -U qwen-asr or pip install -U qwen-asr[vllm] for faster backend
Load the model: Use from qwen_asr import Qwen3ASRModel; model = Qwen3ASRModel.from_pretrained('Qwen/Qwen3-ASR-1.7B', dtype=torch.bfloat16, device_map='cuda:0')
Transcribe audio: Call model.transcribe(audio) where audio is local path, URL, base64, or numpy array + sample rate
Enable streaming: Use vLLM backend and streaming=True for real-time partial results
Force language or LID: Set language parameter or leave None for automatic detection
Use ForcedAligner: Load Qwen3-ForcedAligner-0.6B and align text-audio pairs for timestamps
Try demos: Access Hugging Face Spaces or ModelScope Gradio demo for no-code testing

How we rated Qwen3-ASR

Performance: 4.7/5
Accuracy: 4.8/5
Features: 4.9/5
Cost-Efficiency: 5.0/5
Ease of Use: 4.4/5
Customization: 4.6/5
Data Privacy: 4.9/5
Support: 4.5/5
Integration: 4.7/5
Overall Score: 4.8/5

Qwen3-ASR integration with other tools

Hugging Face Transformers: Direct loading and inference via the transformers library for easy integration in Python scripts and pipelines
vLLM Backend: High-performance serving for batch, async, and streaming transcription in production environments
Gradio and Flask Demos: Ready-to-use web interfaces for testing and prototyping on Hugging Face Spaces
Alibaba Cloud DashScope: Optional API integration for hosted inference without local hardware
Custom ASR Pipelines: Compatible with frameworks like Whisper.cpp, Faster-Whisper, or NeMo for hybrid or on-device deployments

Best prompts optimised for Qwen3-ASR

Since Qwen3-ASR is an automatic speech recognition (ASR) model, it processes audio inputs directly rather than text prompts. There are no traditional 'best prompts' like in text-to-image or text-to-video tools.
Instead, provide audio files (local path, URL, base64, or numpy array + sample rate) to the transcribe method. For best results, use clear audio with good signal-to-noise ratio and enable automatic language detection by setting language=None.
Example usage (no prompt needed): model.transcribe('path/to/audio.wav') or model.transcribe('https://example.com/audio.mp3', streaming=True) for real-time partial results with vLLM backend.

Qwen3-ASR represents a major leap in open-source speech recognition, delivering exceptional multilingual accuracy, robust real-world handling, and blazing-fast streaming performance that rivals or beats proprietary leaders. The 1.7B model’s SOTA results on diverse benchmarks combined with the 0.6B’s efficiency make the family versatile for everything from low-latency apps to high-volume batch jobs. With full Apache 2.0 openness, comprehensive toolkit, and unique strengths in singing/dialect support, it’s an outstanding choice for developers seeking powerful, free ASR without compromises on quality or speed. Minor setup hurdles aside, this is currently one of the best open ASR solutions available.

FAQs

What is Qwen3-ASR?
Qwen3-ASR is an open-source automatic speech recognition model family from Alibaba’s Qwen team, with the 1.7B version offering state-of-the-art multilingual transcription, language identification for 52 languages/dialects, streaming support, and strong performance on singing and noisy audio.
When was Qwen3-ASR released?
Qwen3-ASR was officially open-sourced and announced on January 28, 2026, with the technical report published on arXiv around the same time.
How many languages does Qwen3-ASR support?
It supports language identification and transcription for 52 languages and dialects, including 30 global languages, 22 Chinese dialects, and various English accents.
Is Qwen3-ASR free to use?
Yes, the models are completely free and open-source under Apache 2.0 license, with full weights and inference code available on Hugging Face.
Does Qwen3-ASR support real-time streaming transcription?
Yes, it provides unified streaming and offline inference with very low latency (as low as 92ms TTFT on the 0.6B version) using the vLLM backend.
Can Qwen3-ASR transcribe songs or singing?
Yes, it has strong capabilities for singing voice and song transcription even with background music, achieving competitive WER on specialized benchmarks.
How accurate is Qwen3-ASR compared to other models?
The 1.7B version achieves state-of-the-art results among open-source ASR models and is competitive with top proprietary APIs like GPT-4o or Gemini on internal and public benchmarks.
What is included with Qwen3-ASR for timestamps?
It pairs with Qwen3-ForcedAligner-0.6B, a non-autoregressive model that provides highly accurate word/character-level timestamps for 11 languages, outperforming baselines like WhisperX.

Newly Added Tools

Qwen-Image-2.0

Image & Design

$0/Month

Qodo AI

Code & Development

$0/Month

Codiga

Code & Development

$10/Month

Tabnine

Code & Development

$59/Month

Qwen3-ASR Alternatives

Synthflow AI

Audio & Music

$0/Month

Fireflies

Audio & Music

$10/Month

Notta AI

Audio & Music

$9/Month

Latest AI News

Qwen3-ASR Reviews

0.0

0.0 out of 5 stars (based on 0 reviews)

Excellent0%

Very good0%

Average0%

Poor0%

Terrible0%

There are no reviews yet. Be the first one to write one.

Qwen3-ASR

About This AI

Key Features

Price Plans

Pros

Cons

Use Cases

Target Audience

How To Use

How we rated Qwen3-ASR

Qwen3-ASR integration with other tools

Best prompts optimised for Qwen3-ASR

FAQs

What is Qwen3-ASR?

When was Qwen3-ASR released?

How many languages does Qwen3-ASR support?

Is Qwen3-ASR free to use?

Does Qwen3-ASR support real-time streaming transcription?

Can Qwen3-ASR transcribe songs or singing?

How accurate is Qwen3-ASR compared to other models?

What is included with Qwen3-ASR for timestamps?

Newly Added Tools

Qwen-Image-2.0

Qodo AI

Codiga

Tabnine

Synthflow AI

Fireflies

Notta AI

Latest AI News

Qwen-Image-2.0 Launched: Complete Guide to Setup, Optimization, and Workflows

Cursor Unveils Composer 1.5: Major Boost for Handling Complex Coding Challenges

OpenAI starts to roll out a test for ads in ChatGPT today: Take a look at the new UI

Qwen3-ASR Reviews

Qwen3-ASR

From Alibaba Cloud

About This AI

Key Features

Price Plans

Pros

Cons

Use Cases

Target Audience

How To Use

How we rated Qwen3-ASR

Qwen3-ASR integration with other tools

Best prompts optimised for Qwen3-ASR

FAQs

What is Qwen3-ASR?

When was Qwen3-ASR released?

How many languages does Qwen3-ASR support?

Is Qwen3-ASR free to use?

Does Qwen3-ASR support real-time streaming transcription?

Can Qwen3-ASR transcribe songs or singing?

How accurate is Qwen3-ASR compared to other models?

What is included with Qwen3-ASR for timestamps?

Newly Added Tools​

Qwen-Image-2.0

Qodo AI

Codiga

Tabnine

Synthflow AI

Fireflies

Notta AI

Latest AI News

Qwen-Image-2.0 Launched: Complete Guide to Setup, Optimization, and Workflows

Cursor Unveils Composer 1.5: Major Boost for Handling Complex Coding Challenges

OpenAI starts to roll out a test for ads in ChatGPT today: Take a look at the new UI

Qwen3-ASR Reviews

Newly Added Tools