Zelili AI

ElevenLabs Launches Scribe V2: What’s New, Pricing, API, Comparison

ElevenLabs has released Scribe V2, a new speech-to-text model that combines batch and real-time transcription with ultra-low latency, advanced detection, speaker diarization, and significantly reduced word error rates across dozens of languages.

Scribe V2 Launch

As a blogger who is always collecting audio so that I can transcribe my AI tool reviews, I was excited to see when ElevenLabs released Scribe V2.

This new hybrid of both batch and real-time speech-to-text features is exciting as it will accelerate the workflow of creators like me who require consistent high quality speech to text, minus the complexity.

Released 3 days ago, Scribe V2 inherits the high-fidelity, AI-driven audio for which ElevenLabs is known but also now provides an unparalleled level of accuracy and a suite of features designed to accommodate any use case from podcaster all the way up to enterprise meeting.

Standout Features and Enhancements

Scribe V2 works wonders in both of its modes: batch to analyse deeply real-time for lightening fast results. Key highlights include:

  • Ultra-Low Latency: Real-time mode returns transcription results in less than 150ms, with negative latency for predicting the next word, ideal for live captioning or voice agents.
  • Advanced Detection: Entity recognition for 56 categories (such as PII, health data) with time stamps, and keyterm prompting of up to 100 custom terms to overcome jargon in audio.
  • Speaker Diarization: Transcribes up to 48 speakers and includes punctuation if wanted ideal for interviews or group recordings.
  • Audio Tagging: Identifies non-speech elements like laughter or pauses for richer transcripts.
  • Compliance and Scale: Enterprise-ready with GDPR, HIPAA, SOC 2, and zero-retention options; handles files up to 10 hours and 3GB.

Compared to the previous Scribe V1, V2 slashes word error rates (WER) significantly achieving ≤5% in over 35 languages and outperforming on noisy, complex samples.

Read More: xAI Restricts Grok’s Image Generation to Paid Subscribers Amid Global Backlash Over Sexualized Deepfakes

It better handles accents, silences and multi-language switching, which also results in up to 20% error reduction on benchmarks.

For me, that translates to less manual cutting on interview transcripts and more precious time saved.

Language Support and API Integration

With support for more than 90 languages, including common ones like English and Mandarin as well as lesser-knowns like Zulu & Wolof.

Scribe V2 Benchmarks

Scribe V2 automatically recognizes mixed-language context and transcribes even complex cross-talk. This also gives it global reach, so you can use it with international projects.

The API is designed to be friendly for developers, and supports WebSocket for real-time streaming and REST for uploading in batches.

It’s compatible with different audio/video types (eg, MP3, MP4) and webhooks for async results. It is easy to integrate by using ElevenLabs’ documentation, which means custom apps for voice assistants or subtitling tools.

Pricing Breakdown

ElevenLabs uses a subscription model with included hours and per-hour billing. Here’s a quick tier overview:

TierMonthly CostIncluded Batch HoursIncluded Real-Time HoursAdditional Batch $/HourAdditional Real-Time $/Hour
Free$02.510$0.48N/A
Starter$512.548$0.63$0.53
Creator$1162.5225$0.07$0.63
Pro$99300786$0.40$0.46
Scale$33011003385$0.33$0.39
BusinessCustom6000Custom$0.22Custom

Pricing is competitive at roughly 40 cents an hour on average, comparable to competitors like OpenAI’s Whisper but with better accuracy in tests.

Read More: DeepSeek V4 Release Rumors: Chinese AI Model Aims to Challenge GPT and Claude in Coding

In my experience, starting with the Starter tier offers great value for occasional users, while Pro suits heavy transcribers. This launch cements ElevenLabs as a leader in AI audio, and I can’t wait to see what a huge leap the next model will make.