What is Qwen-Audio?
Qwen-Audio (Qwen2-Audio series) is Alibaba’s open-source large audio-language model for processing diverse audio inputs (speech, sounds, music) and generating text outputs like transcription, analysis, translation, and voice chat responses.
When was Qwen2-Audio released?
Qwen2-Audio was released on August 9, 2024, as the next version of Qwen-Audio with improved capabilities.
Is Qwen-Audio free to use?
Yes, Qwen2-Audio-7B and Instruct variants are fully open-weights under Apache 2.0 on Hugging Face and ModelScope with no usage fees for local or self-hosted deployment.
What languages does Qwen-Audio support?
Strong multilingual support for speech recognition, translation, and analysis across many languages (exact count not specified but excels on global benchmarks like Fleurs and CoVoST2).
Can Qwen-Audio handle non-speech audio?
Yes, it processes music, environmental sounds, mixed audio, and performs sound event classification, emotion detection from vocals, and more.
How do I run Qwen-Audio locally?
Install latest transformers from GitHub source, load model from ‘Qwen/Qwen2-Audio-7B-Instruct’, prepare audio/text inputs with processor, and generate outputs.
What benchmarks does Qwen-Audio excel on?
It outperforms SOTA on AIR-Benchmark for audio instruction-following, LibriSpeech ASR, Common Voice, Fleurs translation, Meld emotion, Vocalsound classification, and more.
Does Qwen-Audio support voice chat?
Yes, users can speak instructions directly for natural conversational interaction, with multi-turn support.




