Updated April 2026 — Mistral Voxtral · xAI Grok · Microsoft MAI

The independent voice AI
intelligence radar.

Real benchmarks. Transparent pricing. Guided recommendations.20+ TTS & STT providers, updated for April 2026.

20+
Providers Tracked
$11B
ElevenLabs Valuation
67+
Models on Artificial Analysis
Aug 2
EU AI Act Deadline '26

Independent Benchmarks

Market Leaders at a Glance

Quality, speed, features, and price efficiency across all active providers — sorted by composite value score.

Market Value Composition

Stacked analysis of Quality, Speed, Features, and Price efficiency. Higher total area = better overall value. Sorted by composite score.

Top providers ranked by composite score (Quality + Speed + Features + Price)
ProviderQuality (out of 5)Speed (out of 5)Features (out of 5)Price Score (out of 5)Total (out of 20)
Deepgram4.554417.5
AssemblyAI4.535517.5
Chatterbox4.544517.5
Qwen3-TTS4.544517.5
Inworld AI555217
Azure AI Speech445417
Cartesia4.554316.5
Hume AI4.544416.5
Fish Audio S2 Pro4.543516.5
Kokoro4.243516.2
ElevenLabs545216
Microsoft MAI544316

Scores are composite ratings (1–5 per dimension) compiled from Artificial Analysis, HuggingFace TTS Arena, and official benchmarks. Price Score: 5 = cheapest. Open-source models at $0 self-hosted.

Recommendation Engine

Find Your Perfect Stack

Answer 3 questions to get a personalised recommendation rooted in April 2026 benchmarks.

Step 1 of 3

What are you building?

Market Updates — April 2026

What Changed This Quarter

Three new entrants, open-source models beating commercial leaders, and one major shutdown.

NEW

Mistral Voxtral (Mar 26)

4B-parameter open-weight model, 90ms TTFA, runs on smartphones with 3 GB RAM. $16/1M chars.

NEW

Microsoft MAI (Apr 3)

MAI-Transcribe-1 achieves 3.8% WER — beats Whisper-large-v3 on 22 of 25 languages.

NEW

xAI Grok TTS (Mar 16)

OpenAI Realtime API-compatible format enables drop-in migration from existing stacks.

MILESTONE

ElevenLabs — $11B

$500M Series D led by Sequoia. On-premise/on-device now available. $330M+ ARR.

OPEN SOURCE

Chatterbox beats ElevenLabs

MIT-licensed model preferred by 63.75% of evaluators in blind tests. Free, 23 languages.

SHUTDOWN

PlayHT — Discontinued

Shut down Dec 31, 2025 after Meta acquisition. Migrate to ElevenLabs or Chatterbox.

Provider Comparison

Compare All Providers

Benchmarks, pricing, ELO scores, WER, features, and compliance across every active TTS & STT provider.

Filter providers by type
WERELOTTFAMOS— hover for definitions
Provider
Inworld AINEW
TTS
ElevenLabs
BOTH
Deepgram
BOTH
AssemblyAI
STT
Cartesia
TTS
Mistral VoxtralNEW
TTS
Microsoft MAINEW
BOTH
xAI Grok TTSNEW
TTS
OpenAI
BOTH
Azure AI Speech
BOTH
Google Cloud Speech
BOTH
Hume AINEW
TTS
LeanVoxNEW
TTS
Kokoro v1.0OPEN SOURCE
TTS
ChatterboxNEWOPEN SOURCE
TTS
Qwen3-TTSNEWOPEN SOURCE
TTS
Fish Audio S2 ProNEWOPEN SOURCE
TTS
Moonshine (Useful Sensors)NEWOPEN SOURCE
STT
PlayHTDISCONTINUED
TTS
Pricing
TTS-1.5 Max: $30/1M chars (enterprise)
TTS-1.5 Mini: $15/1M chars (low latency)
Starter: $5/mo for 30k chars
Creator: $22/mo for 100k chars
Scale API: ~$165/1M chars (Scale)
Nova-3 STT: $0.0043–$0.0077/min (per-second billing)
Voice Agent API: ~$0.075/min (STT+LLM+TTS)
Universal-2: $0.0025/min — 99 languages
Universal-3 Pro: $0.0035/min — prompt-based customization
Pay-as-you-go: $5/100k credits (1 credit/char)
API: $16/1M chars
MAI-Transcribe-1: ~$0.017/min (Azure pricing)
MAI-Voice-1: $16/1M chars
API: ~$15/1M chars (estimated)
TTS Standard: $15/1M chars (tts-1)
TTS HD: $30/1M chars (tts-1-hd)
GPT-4o Transcribe: $0.006/min — free diarization
Mini Transcribe: $0.003/min — budget option
Neural TTS: $15–16/1M chars
STT Standard: $0.017/min (140+ languages)
WaveNet: $4/1M chars (standard)
Chirp 3 HD: $30/1M chars (HD)
STT Standard: $0.024/min (60 min/mo free)
Octave 2: $7.60/1M chars
Standard: $5/1M chars
Self-hosted: Free (compute only)
Hosted (DeepInfra): ~$0.65/1M chars hosted
Self-hosted: Free (MIT license)
Self-hosted: Free (Apache 2.0)
API: ~$10/1M chars (API)
Self-hosted: Free (open weights)
Self-hosted: Free (MIT license)
Service discontinued
Quality & ELO
Quality
Speed
ELO 1,236
130ms TTFA
Quality
Speed
ELO 1,197
75ms TTFA
Quality
Speed
5.3% WER
Quality
Speed
14.5% WER
Quality
Speed
40ms TTFA
Quality
Speed
90ms TTFA
Quality
Speed
3.8% WER
Quality
Speed
Quality
Speed
5% WER
Quality
Speed
Quality
Speed
Quality
Speed
Quality
Speed
Quality
Speed
ELO 1,056
Quality
Speed
Quality
Speed
1.24% WER
Quality
Speed
ELO 1,128
Quality
Speed
Quality
Speed
Key Features
  • #1 TTS Arena ELO
  • Zero-shot Voice Cloning (5–15s)
  • Sub-250ms P90 Latency
  • Domain-specific Pronunciation
  • Healthcare/Finance/Legal
  • Eleven v3 (GA Feb 2)
  • On-Premise / On-Device (Apr 9)
  • Voice Cloning (10,000+ voices)
  • Scribe v2 STT
  • Dubbing & Translation
  • 74 Languages
  • ElevenAgents
  • IBM watsonx Integration
  • Nova-3 (5.3% WER)
  • Sub-300ms Streaming
  • Flux Turn Detection
  • Diarization
  • Smart Formatting
  • TTS Speed Controls (0.7–1.5×)
  • Self-hosted Deployment
  • 45+ Languages
  • Per-second Billing
  • Universal-2 (99 languages)
  • Universal-3 Pro Streaming
  • Prompt-based Domain Customization
  • Medical Mode (en/es/de/fr)
  • Sentiment Analysis
  • PII Redaction
  • LLM Integration
  • Audio Intelligence
  • Sonic 3 (SageMaker)
  • 40ms TTFA (Sonic Turbo)
  • 3-second Voice Cloning
  • Emotion Control
  • Sonic Flash 75ms
  • SageMaker JumpStart
  • 4B Parameters
  • 90ms TTFA
  • Smartphone Deployment (3GB RAM)
  • Voice Cloning
  • EU Data Sovereignty
  • CC BY-NC 4.0 (open weights)
  • MAI-Voice-1 (cloning)
  • MAI-Transcribe-1 (3.8% WER)
  • Beats Whisper-large-v3 (22/25 languages)
  • Beats ElevenLabs Scribe v2 (15/25)
  • 25 Language STT
  • Half GPU usage vs competitors
  • OpenAI Realtime API Compatible
  • Drop-in Migration Path
  • xAI Infrastructure
  • GPT-4o Transcribe (5% WER)
  • GPT-4o Mini Transcribe
  • Free Diarization
  • tts-1 / tts-1-hd
  • 99+ Languages (STT)
  • Simple API
  • 140+ Languages (TTS)
  • 500+ Neural Voices
  • Custom Neural Voice
  • Speech Translation
  • Real-time Captions
  • Avatar Video Synthesis
  • Chirp 3 HD
  • WaveNet & Studio Voices
  • Gemini Integration
  • 380+ Voices
  • 75+ Languages
  • 60 min/mo Free STT
  • Google Translate Integration
  • Octave 2 Voice Model
  • TADA Architecture (1B/3B)
  • Zero Content Hallucinations
  • 10× Context Efficiency
  • Emotional Intelligence
  • 11 Languages
  • 23+ Languages
  • Standard Neural Voices
  • REST API
  • 82M Parameters
  • MOS 4.2 (highest open-source)
  • CPU / Raspberry Pi Capable
  • 210× Real-time on GPU
  • Apache 2.0 License
  • 9 Languages
  • MIT License
  • 63.75% Preferred over ElevenLabs
  • Chatterbox Turbo (sub-200ms)
  • Chatterbox Multilingual (23 languages)
  • PerTh Neural Watermarking
  • Paralinguistic Tags [laugh] [cough]
  • 11K+ GitHub Stars
  • Emotion Control
  • Apache 2.0 License
  • 0.77% Chinese WER
  • 1.24% English WER
  • 0.6B & 1.7B Variants
  • 49+ Voice Presets
  • 12Hz Proprietary Tokenizer
  • Natural-language Voice Design
  • 10 Languages
  • ELO 1128 (Best Open-weights)
  • Voice Cloning
  • Open Weights
  • Commercial API Available
  • 245M Parameters (MIT)
  • Matches Whisper Large-v3
  • 1/6 the Size of Whisper
  • Mobile & Embedded Ready
  • CPU Capable
  • Service Discontinued
  • Acquired by Meta (Dec 31, 2025)
  • Migrate to: ElevenLabs, Chatterbox, Kokoro
Languages30+ 74+ 45+ 99+ 🌍20+ 9+ 25+ 13+ 99+ 🌍140+ 🌍75+ 11+ 23+ 9+ 23+ 10+ 15+ 1+ N/A
Compliance
SOC2HIPAA
SOC2HIPAAGDPR
HIPAASOC2
HIPAASOC2 Type 2ISO 27001:2022PCI DSS v4.0GDPR
Azure ComplianceGDPRHIPAA
SOC2GDPR
SOC2HIPAAISO 27001GDPRFedRAMP
SOC2HIPAAISO 27001GDPR
Best For
voice agentcontent creationenterprise
content creationnarrationvoice agententerprise
voice agenttranscriptionanalyticsreal time
analyticstranscriptionunderstandingenterprise
voice agentreal time
voice agentbudgetoffline
enterprisetranscriptionaccessibility
voice agentprototyping
simple appprototypingtranscription
enterpriseaccessibilityglobal
enterpriseanalyticsaccessibility
voice agentcontent creation
budgetsimple app
budgetaccessibilityoffline
budgetcontent creationvoice agentoffline
budgetcontent creationaccessibility
content creationbudgetvoice agent
accessibilityofflinebudget

Data sourced from Artificial Analysis Speech Arena, HuggingFace Open ASR Leaderboard, and official provider documentation. All prices approximate as of April 2026. Benchmark scores may vary by use case.

EU AI Act Article 50 — August 2, 2026. All voice AI systems must disclose AI origin, mark synthetic audio in machine-readable format, and comply with emotion recognition restrictions. California CAITA aligns to the same date. Penalties up to €30M or 7% of global turnover.

Pricing Calculator

Estimate Your Monthly Cost

Drag the volume slider to compare real costs across all active providers at your scale. Open-source options shown at $0 self-hosted.

Technology
10 hours

Estimated at base pricing tiers.

TTS: ~15,000 chars/hr · STT: 60 mins/hr

Open-source models shown at $0 (self-hosted).

At 10h/month

Cheapest paid:$0.60Google Cloud Speech
Most expensive:$25.00ElevenLabs

Monthly Cost Comparison

Commercial APIOpen Source (self-hosted $0)

Community Picks

Best Provider by Use Case

Ranked from real production deployments, blind tests, and Artificial Analysis benchmarks.

Real-time Voice Agent

  1. 1.Cartesia Sonic — 40ms TTFA
  2. 2.Inworld TTS-1.5 Mini — <130ms
  3. 3.ElevenLabs Flash v2.5 — ~75ms

Highest Quality TTS

  1. 1.Inworld TTS-1.5 Max (ELO 1,236)
  2. 2.ElevenLabs v3 (ELO 1,197)
  3. 3.Fish Audio S2 Pro (ELO 1,128 — open)
💰

Best Value STT

  1. 1.AssemblyAI Universal-2 — $0.0025/min
  2. 2.Deepgram Nova-3 — $0.0043/min
  3. 3.OpenAI Mini Transcribe — $0.003/min
🎤

Voice Cloning

  1. 1.ElevenLabs — professional grade
  2. 2.Cartesia — 3-second cloning
  3. 3.Chatterbox — free, MIT licensed
📱

Edge / Offline TTS

  1. 1.Kokoro 82M — CPU capable, Apache 2.0
  2. 2.Chatterbox Turbo — sub-200ms
  3. 3.Mistral Voxtral — 3 GB RAM
🔒

Edge / Offline STT

  1. 1.Moonshine 245M — MIT, = Whisper v3
  2. 2.Whisper.cpp — 38K+ GitHub stars
  3. 3.NVIDIA Canary Qwen — 5.63% WER