Zelili AI

Fun-Audio-Chat

Large Audio Language Model for Natural Low-Latency Voice Interactions – Speech-to-Speech and Speech-to-Text with Emotion Awareness
Tool Release Date

23 Dec 2025

Tool Users
N/A
0.0
πŸ‘ 42

About This AI

Fun-Audio-Chat is an open-source Large Audio Language Model (LALM) developed by Alibaba’s Tongyi Fun Team, released on December 23, 2025.

The 8B-parameter model (with a 30B MoE variant Fun-Audio-Chat-30B-A3B) enables natural, low-latency voice conversations using innovative Dual-Resolution Speech Representations (5Hz efficient backbone + 25Hz refined head) to reduce compute by nearly 50% while preserving high speech quality.

It incorporates Core-Cocktail training to retain strong text LLM capabilities alongside audio understanding, reasoning, and generation.

Key strengths include state-of-the-art performance on spoken QA, audio understanding, speech function calling, speech instruction-following, and voice empathy benchmarks (top rankings among similar-sized models on OpenAudioBench, VoiceBench, UltraEval-Audio, MMAU, MMAU-Pro, MMSU, Speech-ACEBench, Speech-BFCL, Speech-SmartInteract, VStyle).

Supports speech-to-text (S2T), speech-to-speech (S2S), full-duplex two-way communication (Fun-Audio-Chat-Duplex variant), emotional tone detection/response, and tool/function calling via spoken prompts.

Fully open-source under Apache 2.0 with model weights on Hugging Face/ModelScope, training/inference code, web demo, and vLLM integration for up to 50x speedup on long audio.

Requires Python 3.12, PyTorch 2.8.0, ~24GB GPU for inference; ideal for building real-time voice assistants, interactive agents, and multimodal speech applications.

Key Features

  1. Dual-Resolution Speech Representations: 5Hz backbone for efficiency plus 25Hz head for high-quality speech output
  2. Core-Cocktail Training: Preserves text LLM knowledge while gaining strong audio capabilities
  3. Low-Latency Voice Interaction: Natural real-time speech-to-speech conversations with minimal delay
  4. Voice Empathy: Detects and responds to emotional tone, pace, and energy in speech
  5. Spoken Function Calling: Executes tools and instructions via voice prompts
  6. Full-Duplex Support: Simultaneous two-way communication in Duplex variant
  7. Multimodal Audio Understanding: Excels at spoken QA, audio analysis, and instruction-following
  8. vLLM Inference Acceleration: Up to 20x speedup for short audios and 50x for long ones
  9. Open-Source Ecosystem: Full code, weights, web demo, and evaluation scripts available
  10. High Benchmark Performance: Tops leaderboards for 8B-scale models across audio tasks

Price Plans

  1. Free ($0): Fully open-source under Apache 2.0; model weights, code, demo, and inference scripts available at no cost for local use and modification

Pros

  1. State-of-the-art efficiency: 50% less compute with dual-resolution design without quality loss
  2. Top benchmark rankings: Leads in spoken QA, audio understanding, function calling, and empathy
  3. Fully open-source: Apache 2.0 license with complete training/inference code and weights
  4. Real-time low-latency: Enables natural voice conversations suitable for interactive agents
  5. Emotional intelligence: Unique voice empathy for more human-like responses
  6. Acceleration support: vLLM integration dramatically speeds up inference
  7. Community resources: Hugging Face/ModelScope hosting, interactive demo, and paper

Cons

  1. High hardware requirements: Needs ~24GB GPU VRAM for inference
  2. Setup complexity: Requires specific Python/PyTorch versions and dependencies
  3. Limited languages: Primarily English and Chinese (based on LLM backbone)
  4. No hosted service: Local deployment only; no cloud API mentioned
  5. Recent release: Adoption and community integrations still growing
  6. Potential latency variance: Depends on hardware and audio length
  7. Evaluation focused: Strong on benchmarks but real-world edge cases may vary

Use Cases

  1. Voice assistants and chatbots: Build natural spoken dialogue systems with emotion awareness
  2. Interactive AI agents: Real-time voice interaction for gaming, virtual companions, or customer service
  3. Spoken question answering: Handle audio-based queries with high accuracy
  4. Speech instruction-following: Execute complex voice commands and function calls
  5. Audio understanding tasks: Analyze spoken content for insights or summarization
  6. Research in LALMs: Fine-tune or extend for new audio-language applications
  7. Accessibility tools: Voice interfaces for hands-free computing

Target Audience

  1. AI researchers and developers: Experimenting with audio LLMs and voice AI
  2. Voice application builders: Creating real-time speech-to-speech systems
  3. Open-source enthusiasts: Deploying and extending large audio models
  4. Game and virtual agent creators: Adding natural voice interactions
  5. Multimodal AI teams: Integrating speech with LLM reasoning
  6. Alibaba/Tongyi ecosystem users: Leveraging related models like CosyVoice

How To Use

  1. Clone repo: git clone --recurse-submodules https://github.com/FunAudioLLM/Fun-Audio-Chat
  2. Install dependencies: Use Python 3.12, PyTorch 2.8.0 (CUDA), ffmpeg, and pip install -r requirements.txt
  3. Download models: From Hugging Face (FunAudioLLM/Fun-Audio-Chat-8B) or ModelScope
  4. Run inference: python examples/infer_s2t.py for speech-to-text or infer_s2s.py for speech-to-speech
  5. Launch web demo: Run server (python -m web_demo.server.server) and client (npm run dev)
  6. Evaluate: Use provided scripts for benchmarks like VoiceBench or OpenAudioBench
  7. Customize: Modify configs or fine-tune with LLaMA-Factory on your data

How we rated Fun-Audio-Chat

  • Performance: 4.8/5
  • Accuracy: 4.7/5
  • Features: 4.6/5
  • Cost-Efficiency: 5.0/5
  • Ease of Use: 4.3/5
  • Customization: 4.7/5
  • Data Privacy: 5.0/5
  • Support: 4.4/5
  • Integration: 4.5/5
  • Overall Score: 4.7/5

Fun-Audio-Chat integration with other tools

  1. Hugging Face: Model weights and pipelines for easy download and inference
  2. ModelScope: Alternative hosting for weights and community access
  3. vLLM: Inference acceleration backend for significant speedups
  4. LLaMA-Factory: Training/fine-tuning framework used in development
  5. CosyVoice: Integrated speech synthesis component for output generation

Best prompts optimised for Fun-Audio-Chat

  1. N/A - Fun-Audio-Chat is a speech-to-speech / audio language model that processes spoken input directly; no text prompts required for core voice interaction (automatic transcription and response generation).
  2. N/A - Use spoken questions or commands in real-time voice mode via demo or inference scripts.
  3. N/A - For evaluation or custom use, provide audio files or live microphone input instead of text prompts.
Fun-Audio-Chat is a highly efficient open-source audio LLM delivering top-tier natural voice interactions with low latency, emotion awareness, and strong benchmark results in spoken QA and function calling. Dual-resolution design cuts compute needs while maintaining quality. Fully free and local-run, it’s excellent for developers building real-time voice agents or assistants.

FAQs

  • What is Fun-Audio-Chat?

    Fun-Audio-Chat is an open-source Large Audio Language Model (8B parameters) from Alibaba’s Tongyi Fun Team, released December 23, 2025, for natural low-latency speech-to-speech and speech-to-text voice interactions.

  • Is Fun-Audio-Chat free to use?

    Yes, it is completely open-source under Apache 2.0 license with model weights, code, and inference/demo scripts freely available on GitHub and Hugging Face.

  • When was Fun-Audio-Chat released?

    The model was officially released on December 23, 2025, with technical report on arXiv and open-source code shortly after.

  • What are the key innovations in Fun-Audio-Chat?

    It uses Dual-Resolution Speech Representations (5Hz backbone + 25Hz head) for ~50% compute reduction and Core-Cocktail training to preserve text LLM strength alongside audio capabilities.

  • What hardware is required for Fun-Audio-Chat?

    Inference needs approximately 24GB GPU VRAM; training requires 4x80GB GPUs; supports vLLM for significant speedups.

  • Does Fun-Audio-Chat support voice empathy?

    Yes, it detects emotional tone, pace, and energy in speech and responds appropriately for more natural conversations.

  • What benchmarks does Fun-Audio-Chat excel in?

    It ranks top among similar-sized models on OpenAudioBench, VoiceBench, UltraEval-Audio, MMAU series, and speech function/instruction benchmarks.

  • How do I run Fun-Audio-Chat locally?

    Clone the GitHub repo, install dependencies (Python 3.12, PyTorch 2.8.0, ffmpeg), download weights from Hugging Face, and run inference scripts like infer_s2t.py or infer_s2s.py.

Newly Added Tools​

CodeRabbit

$0/Month

Code Genius

$0/Month

AskCodi

$20/Month

PearAI

$0/Month
Fun-Audio-Chat Alternatives

Synthflow AI

$0/Month

Fireflies

$10/Month

Notta AI

$9/Month

Fun-Audio-Chat Reviews

0.0
0.0 out of 5 stars (based on 0 reviews)
Excellent0%
Very good0%
Average0%
Poor0%
Terrible0%

There are no reviews yet. Be the first one to write one.