Zelili AI

Qwen3-ASR

State-of-the-Art Open-Source Multilingual Speech Recognition with Language ID and Streaming Support
Tool Users
N/A
0.0
๐Ÿ‘ 196

About This AI

Qwen3-ASR is an advanced open-source automatic speech recognition family from Alibaba’s Qwen team, featuring the flagship Qwen3-ASR-1.7B model and a lighter Qwen3-ASR-0.6B variant.

Built on the powerful audio understanding capabilities of the Qwen3-Omni foundation model, it delivers all-in-one functionality including automatic language identification and high-accuracy transcription for 52 languages and dialects.

This covers 30 global languages, 22 Chinese dialects, and various English accents from different regions.

The model excels in challenging real-world scenarios such as noisy environments, singing voice and song transcription with background music, long-form audio, and complex acoustic conditions.

It supports unified streaming and offline inference with a single model, enabling low-latency real-time transcription and efficient batch processing.

Key highlights include very low time-to-first-token (as low as 92ms on the 0.6B version) and exceptional throughput (transcribing 2000 seconds of speech in 1 second at high concurrency).

Paired with Qwen3-ForcedAligner-0.6B, it provides precise timestamp prediction for forced alignment in 11 languages at word or character level.

Released under Apache 2.0 license with full weights and a comprehensive inference toolkit (supporting transformers, vLLM for fast batch/async/streaming), it achieves state-of-the-art results among open-source ASR models and competes closely with top proprietary APIs on both public and internal benchmarks.

Ideal for developers, researchers, and applications needing robust, multilingual, streaming-capable speech-to-text with minimal dependencies.

Key Features

  1. All-in-one language identification and ASR: Automatically detects and transcribes speech in 52 languages and dialects including 30 global languages and 22 Chinese dialects
  2. Multilingual and dialect support: Covers major languages like English (various accents), Chinese Mandarin/dialects, Arabic, French, German, Spanish, Japanese, Korean, and more
  3. Robust real-world performance: Handles noisy environments, accented speech, singing voice, songs with BGM, elders/kids speech, and complex text patterns
  4. Streaming and offline unified inference: Single model supports real-time streaming transcription and offline batch processing
  5. Long audio transcription: Efficiently processes extended audio files without segmentation issues
  6. Low-latency TTFT: Achieves as low as 92ms time-to-first-token on the 0.6B variant for fast response
  7. High throughput: Transcribes 2000 seconds of speech in 1 second at concurrency of 128 in async mode
  8. Timestamp prediction via ForcedAligner: Non-autoregressive LLM-based aligner provides accurate word/character-level timestamps in 11 languages
  9. Comprehensive inference toolkit: Supports vLLM batch/async/streaming, transformers backend, and easy integration
  10. Singing and music-aware transcription: Strong performance on song lyrics and vocal content
  11. Open-source under Apache 2.0: Full weights, code, and inference framework available for free use and modification

Price Plans

  1. Free ($0): Full open-source access to model weights, inference code, and toolkit under Apache 2.0 license with no usage fees
  2. Cloud API (Paid via DashScope): Optional Alibaba Cloud API access for hosted real-time/file transcription with tiered pricing
  3. Enterprise (Custom): Potential premium support or scaled deployment through Alibaba Cloud services

Pros

  1. Leading open-source ASR accuracy: Achieves SOTA among open models and rivals top proprietary APIs on diverse benchmarks
  2. Exceptional multilingual coverage: 52 languages/dialects with high LID accuracy (97.9% average) and low WER across scenarios
  3. Fast and efficient: Extremely low latency and massive throughput make it suitable for real-time and high-volume applications
  4. Robust to real-world challenges: Excellent handling of noise, accents, singing, long audio, and complex acoustics
  5. Streaming support built-in: True streaming inference without separate models for online use
  6. Timestamp alignment excellence: ForcedAligner outperforms baselines in accuracy and efficiency for 11 languages
  7. Fully open and accessible: Apache 2.0 license with easy pip install and demos (Gradio, Flask, Hugging Face Spaces)
  8. Strong community foundation: Backed by Qwen series popularity and large-scale training data

Cons

  1. Requires GPU for best performance: Optimal speed needs CUDA/vLLM; CPU inference slower for large models
  2. Limited to supported languages: While broad, does not cover every low-resource language or dialect
  3. Setup complexity for advanced use: vLLM backend for streaming/high-throughput requires additional installation
  4. No native mobile/edge optimization mentioned: Primarily server-side focused; on-device deployment may need quantization
  5. Recent release: Community integrations and fine-tuning examples still emerging
  6. Potential VRAM requirements: 1.7B model needs sufficient GPU memory for batch or long audio
  7. Alignment limited to 11 languages: ForcedAligner supports fewer languages than main ASR

Use Cases

  1. Real-time captioning and transcription: Live subtitling for meetings, streams, calls, or broadcasts
  2. Multilingual audio processing: Transcribe podcasts, videos, or interviews in diverse languages/dialects
  3. Song and music transcription: Extract lyrics from songs including those with background music
  4. Accessibility tools: Speech-to-text for hearing-impaired users or voice note conversion
  5. Call center and voice analytics: High-throughput transcription for customer service recordings
  6. Research and data labeling: Generate accurate timestamps and transcripts for speech datasets
  7. Content localization: Transcribe foreign-language media for subtitling or dubbing preparation

Target Audience

  1. Developers and AI engineers: Building speech applications or integrating ASR into products
  2. Researchers in speech AI: Studying multilingual ASR, singing recognition, or alignment techniques
  3. Content creators and media companies: Transcribing videos, podcasts, or music for captions/subtitles
  4. Enterprises with multilingual needs: Call centers, global customer support, or compliance recording
  5. Open-source enthusiasts: Experimenting with state-of-the-art free ASR models
  6. Accessibility advocates: Developing tools for real-time speech assistance

How To Use

  1. Install the package: Run pip install -U qwen-asr or pip install -U qwen-asr[vllm] for faster backend
  2. Load the model: Use from qwen_asr import Qwen3ASRModel; model = Qwen3ASRModel.from_pretrained('Qwen/Qwen3-ASR-1.7B', dtype=torch.bfloat16, device_map='cuda:0')
  3. Transcribe audio: Call model.transcribe(audio) where audio is local path, URL, base64, or numpy array + sample rate
  4. Enable streaming: Use vLLM backend and streaming=True for real-time partial results
  5. Force language or LID: Set language parameter or leave None for automatic detection
  6. Use ForcedAligner: Load Qwen3-ForcedAligner-0.6B and align text-audio pairs for timestamps
  7. Try demos: Access Hugging Face Spaces or ModelScope Gradio demo for no-code testing

How we rated Qwen3-ASR

  • Performance: 4.7/5
  • Accuracy: 4.8/5
  • Features: 4.9/5
  • Cost-Efficiency: 5.0/5
  • Ease of Use: 4.4/5
  • Customization: 4.6/5
  • Data Privacy: 4.9/5
  • Support: 4.5/5
  • Integration: 4.7/5
  • Overall Score: 4.8/5

Qwen3-ASR integration with other tools

  1. Hugging Face Transformers: Direct loading and inference via the transformers library for easy integration in Python scripts and pipelines
  2. vLLM Backend: High-performance serving for batch, async, and streaming transcription in production environments
  3. Gradio and Flask Demos: Ready-to-use web interfaces for testing and prototyping on Hugging Face Spaces
  4. Alibaba Cloud DashScope: Optional API integration for hosted inference without local hardware
  5. Custom ASR Pipelines: Compatible with frameworks like Whisper.cpp, Faster-Whisper, or NeMo for hybrid or on-device deployments

Best prompts optimised for Qwen3-ASR

  1. Since Qwen3-ASR is an automatic speech recognition (ASR) model, it processes audio inputs directly rather than text prompts. There are no traditional 'best prompts' like in text-to-image or text-to-video tools.
  2. Instead, provide audio files (local path, URL, base64, or numpy array + sample rate) to the transcribe method. For best results, use clear audio with good signal-to-noise ratio and enable automatic language detection by setting language=None.
  3. Example usage (no prompt needed): model.transcribe('path/to/audio.wav') or model.transcribe('https://example.com/audio.mp3', streaming=True) for real-time partial results with vLLM backend.
Qwen3-ASR represents a major leap in open-source speech recognition, delivering exceptional multilingual accuracy, robust real-world handling, and blazing-fast streaming performance that rivals or beats proprietary leaders. The 1.7B model’s SOTA results on diverse benchmarks combined with the 0.6B’s efficiency make the family versatile for everything from low-latency apps to high-volume batch jobs. With full Apache 2.0 openness, comprehensive toolkit, and unique strengths in singing/dialect support, it’s an outstanding choice for developers seeking powerful, free ASR without compromises on quality or speed. Minor setup hurdles aside, this is currently one of the best open ASR solutions available.

FAQs

  • What is Qwen3-ASR?

    Qwen3-ASR is an open-source automatic speech recognition model family from Alibaba’s Qwen team, with the 1.7B version offering state-of-the-art multilingual transcription, language identification for 52 languages/dialects, streaming support, and strong performance on singing and noisy audio.

  • When was Qwen3-ASR released?

    Qwen3-ASR was officially open-sourced and announced on January 28, 2026, with the technical report published on arXiv around the same time.

  • How many languages does Qwen3-ASR support?

    It supports language identification and transcription for 52 languages and dialects, including 30 global languages, 22 Chinese dialects, and various English accents.

  • Is Qwen3-ASR free to use?

    Yes, the models are completely free and open-source under Apache 2.0 license, with full weights and inference code available on Hugging Face.

  • Does Qwen3-ASR support real-time streaming transcription?

    Yes, it provides unified streaming and offline inference with very low latency (as low as 92ms TTFT on the 0.6B version) using the vLLM backend.

  • Can Qwen3-ASR transcribe songs or singing?

    Yes, it has strong capabilities for singing voice and song transcription even with background music, achieving competitive WER on specialized benchmarks.

  • How accurate is Qwen3-ASR compared to other models?

    The 1.7B version achieves state-of-the-art results among open-source ASR models and is competitive with top proprietary APIs like GPT-4o or Gemini on internal and public benchmarks.

  • What is included with Qwen3-ASR for timestamps?

    It pairs with Qwen3-ForcedAligner-0.6B, a non-autoregressive model that provides highly accurate word/character-level timestamps for 11 languages, outperforming baselines like WhisperX.

Newly Added Toolsโ€‹

Qwen-Image-2.0

$0/Month

Qodo AI

$0/Month

Codiga

$10/Month

Tabnine

$59/Month
Qwen3-ASR Alternatives

Synthflow AI

$0/Month

Fireflies

$10/Month

Notta AI

$9/Month

Qwen3-ASR Reviews

0.0
0.0 out of 5 stars (based on 0 reviews)
Excellent0%
Very good0%
Average0%
Poor0%
Terrible0%

There are no reviews yet. Be the first one to write one.