SAM Audio

Meta’s Unified Multimodal Model for Prompt-Based Audio Separation – Isolate Any Sound with Text, Visual, or Time Prompts
Last Updated: December 28, 2025
By Zelili AI

About This AI

SAM Audio is Meta’s groundbreaking open-source foundation model for audio separation, released December 16, 2025, as the audio counterpart to the Segment Anything family.

It allows users to isolate any specific sound, music instrument, vocals, or speech from complex audio or audio-visual mixtures using intuitive multimodal prompts: text descriptions, visual cues (clicking on video sources), or temporal time-span selections.

The unified model handles general sounds (traffic, barking dogs), music (separating guitar from band mix), and speech (isolating speakers from noise), producing both target and residual audio stems with high quality.

Built on a flow-matching Diffusion Transformer in DAC-VAE latent space and powered by the new Perception Encoder Audiovisual (PE-AV) extension of Meta’s Perception Encoder, it achieves state-of-the-art performance across diverse real-world benchmarks.

Available for download on GitHub/Hugging Face under SAM License (research and commercial use), with a demo in the Segment Anything Playground for easy testing.

Applications include music production, podcast editing, film post-production, accessibility (hearing aid enhancements), scientific audio analysis, and content creation by removing unwanted noises or isolating elements.

As a fully open model, it enables developers to integrate advanced audio separation into apps, tools, or research without proprietary restrictions.

Key Features

  1. Multimodal prompting: Separate sounds using text descriptions, visual clicks on video, or selected time spans in audio
  2. Target and residual separation: Outputs both isolated target audio and remaining mixture stem
  3. Unified model architecture: Handles general sounds, music instruments/vocals, and speech in one framework
  4. State-of-the-art performance: Outperforms prior models on diverse audio separation benchmarks
  5. Flow-matching Diffusion Transformer: Operates in DAC-VAE latent space for high-fidelity results
  6. Perception Encoder Audiovisual (PE-AV): Extends visual embeddings to multimodal audio understanding
  7. Open-source availability: Full model checkpoints, inference code, and evaluation tools on GitHub/Hugging Face
  8. Segment Anything Playground demo: Try separation interactively without local setup
  9. Research and commercial license: SAM License allows broad use including commercial applications

Price Plans

  1. Free ($0): Fully open-source model with checkpoints, code, and Playground demo; no usage fees for download, local run, or research/commercial integration
  2. Cloud/Enterprise (Custom): Potential future hosted options or premium support via Meta AI (not available at launch)

Pros

  1. Intuitive prompting: Natural text/visual/time inputs make separation accessible without technical expertise
  2. Versatile across audio types: Excels at music, speech, and general sound isolation in real-world mixtures
  3. High-quality stems: Clean target isolation with minimal artifacts and full residual preservation
  4. Fully open-source: Download, run locally, integrate, or fine-tune freely under permissive license
  5. Multimodal innovation: First to unify text, visual, and span prompting for audio tasks
  6. Strong real-world utility: Useful for creators, editors, accessibility, and scientific analysis
  7. Easy demo access: Playground lets anyone test capabilities instantly

Cons

  1. Requires local setup for full use: Playground is demo-only; advanced features need GPU/hardware
  2. Compute intensive: Inference demands powerful GPU for fast processing of long audio
  3. Early release stage: Released late 2025; community integrations and optimizations still emerging
  4. No hosted API yet: Must run locally or via custom deployment for production use
  5. Limited prompt robustness: Complex or ambiguous prompts may require iteration for best results
  6. No mobile/web native app: Primarily research/dev focused rather than consumer-ready app
  7. Potential artifacts in edge cases: Very noisy/overlapping sources can challenge even SOTA models

Use Cases

  1. Music production: Isolate instruments/vocals from mixes for remixing or stem creation
  2. Podcast/video editing: Remove background noise, isolate speakers, or clean unwanted sounds
  3. Film post-production: Separate dialogue, effects, or music from complex scenes
  4. Accessibility enhancements: Isolate speech for hearing aids or captioning tools
  5. Scientific audio analysis: Extract specific events/sounds from field recordings
  6. Content creation: Clean audio for social media, remove distractions in recordings
  7. Developer integration: Build apps/tools with advanced audio separation via model code

Target Audience

  1. Audio engineers and producers: For precise sound isolation in music and post-production
  2. Video creators and podcasters: Cleaning and editing audio from recordings
  3. Film/TV professionals: Dialogue/effects separation in complex mixes
  4. Accessibility researchers: Improving hearing tech and captioning
  5. AI developers and researchers: Extending or integrating the open model
  6. Content creators: Quick fixes for noisy social media or personal audio

How To Use

  1. Visit Playground: Go to aidemos.meta.com/segment-anything/editor/segment-audio for instant demo
  2. Upload audio/video: Load your file (audio or audiovisual source)
  3. Provide prompt: Use text (e.g., 'isolate the guitar'), click visual source in video, or mark time span
  4. Run separation: Model processes and outputs isolated target + residual stems
  5. Download results: Export separated audio files for editing
  6. Local setup: Clone GitHub repo (facebookresearch/sam-audio), install dependencies, download checkpoints from Hugging Face
  7. Run inference: Use provided notebooks/scripts with your prompts and media

How we rated SAM Audio

  • Performance: 4.8/5
  • Accuracy: 4.7/5
  • Features: 4.9/5
  • Cost-Efficiency: 5.0/5
  • Ease of Use: 4.5/5
  • Customization: 4.6/5
  • Data Privacy: 4.8/5
  • Support: 4.4/5
  • Integration: 4.5/5
  • Overall Score: 4.8/5

SAM Audio integration with other tools

  1. Segment Anything Playground: Web-based demo for instant testing without installation
  2. GitHub Repository: Full inference code, checkpoints, and example notebooks for local/dev use
  3. Hugging Face: Model weights and community spaces for easy download and experimentation
  4. Audio Editing Software (Potential): Export stems to DAWs like Audacity, Reaper, Logic Pro, or Adobe Audition
  5. Custom Apps: Integrate via code for developers building audio tools or accessibility features

Best prompts optimised for SAM Audio

  1. Isolate the lead guitar solo from this rock band recording while preserving the drums and vocals
  2. Separate the speaking voice from background traffic noise in this street interview video
  3. Extract only the dog barking sounds from this park ambient audio clip
  4. Remove the piano accompaniment and keep just the singer's vocals in this acoustic track
  5. Isolate the dialogue between two characters in this movie scene clip, excluding music and effects
SAM Audio is Meta’s innovative open-source breakthrough for audio separation, letting users isolate any sound from mixes with simple text, visual, or time prompts. It outperforms prior models across music, speech, and general audio, with high-quality stems and broad applications. Fully free to download and use, it’s a game-changer for creators, editors, and developers.

FAQs

  • What is SAM Audio?

    SAM Audio is Meta’s open-source unified multimodal model for audio separation, allowing isolation of any sound from complex mixtures using text, visual, or time-span prompts.

  • When was SAM Audio released?

    SAM Audio was officially introduced and released by Meta on December 16, 2025.

  • Is SAM Audio free to use?

    Yes, it is completely free and open-source under the SAM License, with model checkpoints, code, and a Playground demo available for research and commercial use.

  • What prompts does SAM Audio support?

    It supports text prompts (describe the sound), visual prompts (click on video source), and time-span prompts (select segment in timeline).

  • Where can I try SAM Audio?

    Test it instantly in the Segment Anything Playground at aidemos.meta.com/segment-anything/editor/segment-audio, or download from GitHub/Hugging Face for local use.

  • What types of audio can SAM Audio separate?

    It handles general sounds (e.g., traffic, barking), music (instruments/vocals), and speech (speakers from noise) from audio or video files.

  • What license does SAM Audio use?

    Released under the SAM License, allowing both research and commercial applications with no restrictions on usage.

  • How does SAM Audio compare to other tools?

    It sets new standards with multimodal prompting and unified handling of sounds/music/speech, outperforming previous separation models on benchmarks.

Newly Added Tools​

Qwen-Image-2.0

$0/Month

Qodo AI

$0/Month

Codiga

$10/Month

Tabnine

$59/Month
SAM Audio Alternatives

Synthflow AI

$0/Month

Fireflies

$10/Month

Notta AI

$9/Month

About Author

Hi Guys! We are a group of ML Engineers by profession with years of experience exploring and building AI tools, LLMs, and generative technologies. We analyze new tools not just as a user, but as someone who understands their technical depth and real-world value.We know how overwhelming these tools can be for most people, that’s why we break down complex AI concepts into simple, practical insights. Our goal is to help you discover these magical AI tools that actually save your time and make everyday work smarter, not harder.“We don’t just write about AI: We build, test and simplify it for you.”