Fun-Audio-Chat

Large Audio Language Model for Natural Low-Latency Voice Interactions – Speech-to-Speech and Speech-to-Text with Emotion Awareness
Last Updated: January 3, 2026
By Zelili AI

About This AI

Fun-Audio-Chat is an open-source Large Audio Language Model (LALM) developed by Alibaba’s Tongyi Fun Team, released on December 23, 2025.

The 8B-parameter model (with a 30B MoE variant Fun-Audio-Chat-30B-A3B) enables natural, low-latency voice conversations using innovative Dual-Resolution Speech Representations (5Hz efficient backbone + 25Hz refined head) to reduce compute by nearly 50% while preserving high speech quality.

It incorporates Core-Cocktail training to retain strong text LLM capabilities alongside audio understanding, reasoning, and generation.

Key strengths include state-of-the-art performance on spoken QA, audio understanding, speech function calling, speech instruction-following, and voice empathy benchmarks (top rankings among similar-sized models on OpenAudioBench, VoiceBench, UltraEval-Audio, MMAU, MMAU-Pro, MMSU, Speech-ACEBench, Speech-BFCL, Speech-SmartInteract, VStyle).

Supports speech-to-text (S2T), speech-to-speech (S2S), full-duplex two-way communication (Fun-Audio-Chat-Duplex variant), emotional tone detection/response, and tool/function calling via spoken prompts.

Fully open-source under Apache 2.0 with model weights on Hugging Face/ModelScope, training/inference code, web demo, and vLLM integration for up to 50x speedup on long audio.

Requires Python 3.12, PyTorch 2.8.0, ~24GB GPU for inference; ideal for building real-time voice assistants, interactive agents, and multimodal speech applications.

Key Features

  1. Dual-Resolution Speech Representations: 5Hz backbone for efficiency plus 25Hz head for high-quality speech output
  2. Core-Cocktail Training: Preserves text LLM knowledge while gaining strong audio capabilities
  3. Low-Latency Voice Interaction: Natural real-time speech-to-speech conversations with minimal delay
  4. Voice Empathy: Detects and responds to emotional tone, pace, and energy in speech
  5. Spoken Function Calling: Executes tools and instructions via voice prompts
  6. Full-Duplex Support: Simultaneous two-way communication in Duplex variant
  7. Multimodal Audio Understanding: Excels at spoken QA, audio analysis, and instruction-following
  8. vLLM Inference Acceleration: Up to 20x speedup for short audios and 50x for long ones
  9. Open-Source Ecosystem: Full code, weights, web demo, and evaluation scripts available
  10. High Benchmark Performance: Tops leaderboards for 8B-scale models across audio tasks

Price Plans

  1. Free ($0): Fully open-source under Apache 2.0; model weights, code, demo, and inference scripts available at no cost for local use and modification

Pros

  1. State-of-the-art efficiency: 50% less compute with dual-resolution design without quality loss
  2. Top benchmark rankings: Leads in spoken QA, audio understanding, function calling, and empathy
  3. Fully open-source: Apache 2.0 license with complete training/inference code and weights
  4. Real-time low-latency: Enables natural voice conversations suitable for interactive agents
  5. Emotional intelligence: Unique voice empathy for more human-like responses
  6. Acceleration support: vLLM integration dramatically speeds up inference
  7. Community resources: Hugging Face/ModelScope hosting, interactive demo, and paper

Cons

  1. High hardware requirements: Needs ~24GB GPU VRAM for inference
  2. Setup complexity: Requires specific Python/PyTorch versions and dependencies
  3. Limited languages: Primarily English and Chinese (based on LLM backbone)
  4. No hosted service: Local deployment only; no cloud API mentioned
  5. Recent release: Adoption and community integrations still growing
  6. Potential latency variance: Depends on hardware and audio length
  7. Evaluation focused: Strong on benchmarks but real-world edge cases may vary

Use Cases

  1. Voice assistants and chatbots: Build natural spoken dialogue systems with emotion awareness
  2. Interactive AI agents: Real-time voice interaction for gaming, virtual companions, or customer service
  3. Spoken question answering: Handle audio-based queries with high accuracy
  4. Speech instruction-following: Execute complex voice commands and function calls
  5. Audio understanding tasks: Analyze spoken content for insights or summarization
  6. Research in LALMs: Fine-tune or extend for new audio-language applications
  7. Accessibility tools: Voice interfaces for hands-free computing

Target Audience

  1. AI researchers and developers: Experimenting with audio LLMs and voice AI
  2. Voice application builders: Creating real-time speech-to-speech systems
  3. Open-source enthusiasts: Deploying and extending large audio models
  4. Game and virtual agent creators: Adding natural voice interactions
  5. Multimodal AI teams: Integrating speech with LLM reasoning
  6. Alibaba/Tongyi ecosystem users: Leveraging related models like CosyVoice

How To Use

  1. Clone repo: git clone --recurse-submodules https://github.com/FunAudioLLM/Fun-Audio-Chat
  2. Install dependencies: Use Python 3.12, PyTorch 2.8.0 (CUDA), ffmpeg, and pip install -r requirements.txt
  3. Download models: From Hugging Face (FunAudioLLM/Fun-Audio-Chat-8B) or ModelScope
  4. Run inference: python examples/infer_s2t.py for speech-to-text or infer_s2s.py for speech-to-speech
  5. Launch web demo: Run server (python -m web_demo.server.server) and client (npm run dev)
  6. Evaluate: Use provided scripts for benchmarks like VoiceBench or OpenAudioBench
  7. Customize: Modify configs or fine-tune with LLaMA-Factory on your data

How we rated Fun-Audio-Chat

  • Performance: 4.8/5
  • Accuracy: 4.7/5
  • Features: 4.6/5
  • Cost-Efficiency: 5.0/5
  • Ease of Use: 4.3/5
  • Customization: 4.7/5
  • Data Privacy: 5.0/5
  • Support: 4.4/5
  • Integration: 4.5/5
  • Overall Score: 4.7/5

Fun-Audio-Chat integration with other tools

  1. Hugging Face: Model weights and pipelines for easy download and inference
  2. ModelScope: Alternative hosting for weights and community access
  3. vLLM: Inference acceleration backend for significant speedups
  4. LLaMA-Factory: Training/fine-tuning framework used in development
  5. CosyVoice: Integrated speech synthesis component for output generation

Best prompts optimised for Fun-Audio-Chat

  1. N/A - Fun-Audio-Chat is a speech-to-speech / audio language model that processes spoken input directly; no text prompts required for core voice interaction (automatic transcription and response generation).
  2. N/A - Use spoken questions or commands in real-time voice mode via demo or inference scripts.
  3. N/A - For evaluation or custom use, provide audio files or live microphone input instead of text prompts.
Fun-Audio-Chat is a highly efficient open-source audio LLM delivering top-tier natural voice interactions with low latency, emotion awareness, and strong benchmark results in spoken QA and function calling. Dual-resolution design cuts compute needs while maintaining quality. Fully free and local-run, it’s excellent for developers building real-time voice agents or assistants.

FAQs

  • What is Fun-Audio-Chat?

    Fun-Audio-Chat is an open-source Large Audio Language Model (8B parameters) from Alibaba’s Tongyi Fun Team, released December 23, 2025, for natural low-latency speech-to-speech and speech-to-text voice interactions.

  • Is Fun-Audio-Chat free to use?

    Yes, it is completely open-source under Apache 2.0 license with model weights, code, and inference/demo scripts freely available on GitHub and Hugging Face.

  • When was Fun-Audio-Chat released?

    The model was officially released on December 23, 2025, with technical report on arXiv and open-source code shortly after.

  • What are the key innovations in Fun-Audio-Chat?

    It uses Dual-Resolution Speech Representations (5Hz backbone + 25Hz head) for ~50% compute reduction and Core-Cocktail training to preserve text LLM strength alongside audio capabilities.

  • What hardware is required for Fun-Audio-Chat?

    Inference needs approximately 24GB GPU VRAM; training requires 4x80GB GPUs; supports vLLM for significant speedups.

  • Does Fun-Audio-Chat support voice empathy?

    Yes, it detects emotional tone, pace, and energy in speech and responds appropriately for more natural conversations.

  • What benchmarks does Fun-Audio-Chat excel in?

    It ranks top among similar-sized models on OpenAudioBench, VoiceBench, UltraEval-Audio, MMAU series, and speech function/instruction benchmarks.

  • How do I run Fun-Audio-Chat locally?

    Clone the GitHub repo, install dependencies (Python 3.12, PyTorch 2.8.0, ffmpeg), download weights from Hugging Face, and run inference scripts like infer_s2t.py or infer_s2s.py.

Newly Added Tools​

Qwen-Image-2.0

$0/Month

Qodo AI

$0/Month

Codiga

$10/Month

Tabnine

$59/Month
Fun-Audio-Chat Alternatives

Synthflow AI

$0/Month

Fireflies

$10/Month

Notta AI

$9/Month

About Author

Hi Guys! We are a group of ML Engineers by profession with years of experience exploring and building AI tools, LLMs, and generative technologies. We analyze new tools not just as a user, but as someone who understands their technical depth and real-world value.We know how overwhelming these tools can be for most people, that’s why we break down complex AI concepts into simple, practical insights. Our goal is to help you discover these magical AI tools that actually save your time and make everyday work smarter, not harder.“We don’t just write about AI: We build, test and simplify it for you.”