Updated April 2026 — Mistral Voxtral · xAI Grok · Microsoft MAI

The independent voice AI
intelligence radar.

Real benchmarks. Transparent pricing. Guided recommendations.
20+ TTS & STT providers, updated for April 2026.

20+: Providers Tracked
$11B: ElevenLabs Valuation
67+: Models on Artificial Analysis
Aug 2: EU AI Act Deadline '26

Independent Benchmarks

Market Leaders at a Glance

Quality, speed, features, and price efficiency across all active providers — sorted by composite value score.

Market Value Composition

Stacked analysis of Quality, Speed, Features, and Price efficiency. Higher total area = better overall value. Sorted by composite score.

Top providers ranked by composite score (Quality + Speed + Features + Price)
Provider	Quality (out of 5)	Speed (out of 5)	Features (out of 5)	Price Score (out of 5)	Total (out of 20)
Deepgram	4.5	5	4	4	17.5
AssemblyAI	4.5	3	5	5	17.5
Chatterbox	4.5	4	4	5	17.5
Qwen3-TTS	4.5	4	4	5	17.5
Inworld AI	5	5	5	2	17
Azure AI Speech	4	4	5	4	17
Cartesia	4.5	5	4	3	16.5
Hume AI	4.5	4	4	4	16.5
Fish Audio S2 Pro	4.5	4	3	5	16.5
Kokoro	4.2	4	3	5	16.2
ElevenLabs	5	4	5	2	16
Microsoft MAI	5	4	4	3	16

Scores are composite ratings (1–5 per dimension) compiled from Artificial Analysis, HuggingFace TTS Arena, and official benchmarks. Price Score: 5 = cheapest. Open-source models at $0 self-hosted.

Recommendation Engine

Find Your Perfect Stack

Answer 3 questions to get a personalised recommendation rooted in April 2026 benchmarks.

Step 1 of 3

What are you building?

Market Updates — April 2026

What Changed This Quarter

Three new entrants, open-source models beating commercial leaders, and one major shutdown.

NEW

Mistral Voxtral (Mar 26)

4B-parameter open-weight model, 90ms TTFA, runs on smartphones with 3 GB RAM. $16/1M chars.

NEW

Microsoft MAI (Apr 3)

MAI-Transcribe-1 achieves 3.8% WER — beats Whisper-large-v3 on 22 of 25 languages.

NEW

xAI Grok TTS (Mar 16)

OpenAI Realtime API-compatible format enables drop-in migration from existing stacks.

MILESTONE

ElevenLabs — $11B

$500M Series D led by Sequoia. On-premise/on-device now available. $330M+ ARR.

OPEN SOURCE

Chatterbox beats ElevenLabs

MIT-licensed model preferred by 63.75% of evaluators in blind tests. Free, 23 languages.

SHUTDOWN

PlayHT — Discontinued

Shut down Dec 31, 2025 after Meta acquisition. Migrate to ElevenLabs or Chatterbox.

Provider Comparison

Compare All Providers

Benchmarks, pricing, ELO scores, WER, features, and compliance across every active TTS & STT provider.

Full page view

Hide discontinued

WERELOTTFAMOS— hover for definitions

Provider	Inworld AINEW TTS	ElevenLabs BOTH	Deepgram BOTH	AssemblyAI STT	Cartesia TTS	Mistral VoxtralNEW TTS	Microsoft MAINEW BOTH	xAI Grok TTSNEW TTS	OpenAI BOTH	Azure AI Speech BOTH	Google Cloud Speech BOTH	Hume AINEW TTS	LeanVoxNEW TTS	Kokoro v1.0OPEN SOURCE TTS	ChatterboxNEWOPEN SOURCE TTS	Qwen3-TTSNEWOPEN SOURCE TTS	Fish Audio S2 ProNEWOPEN SOURCE TTS	Moonshine (Useful Sensors)NEWOPEN SOURCE STT	PlayHTDISCONTINUED TTS
Pricing	TTS-1.5 Max: $30/1M chars (enterprise) TTS-1.5 Mini: $15/1M chars (low latency)	Starter: $5/mo for 30k chars Creator: $22/mo for 100k chars Scale API: ~$165/1M chars (Scale)	Nova-3 STT: $0.0043–$0.0077/min (per-second billing) Voice Agent API: ~$0.075/min (STT+LLM+TTS)	Universal-2: $0.0025/min — 99 languages Universal-3 Pro: $0.0035/min — prompt-based customization	Pay-as-you-go: $5/100k credits (1 credit/char)	API: $16/1M chars	MAI-Transcribe-1: ~$0.017/min (Azure pricing) MAI-Voice-1: $16/1M chars	API: ~$15/1M chars (estimated)	TTS Standard: $15/1M chars (tts-1) TTS HD: $30/1M chars (tts-1-hd) GPT-4o Transcribe: $0.006/min — free diarization Mini Transcribe: $0.003/min — budget option	Neural TTS: $15–16/1M chars STT Standard: $0.017/min (140+ languages)	WaveNet: $4/1M chars (standard) Chirp 3 HD: $30/1M chars (HD) STT Standard: $0.024/min (60 min/mo free)	Octave 2: $7.60/1M chars	Standard: $5/1M chars	Self-hosted: Free (compute only) Hosted (DeepInfra): ~$0.65/1M chars hosted	Self-hosted: Free (MIT license)	Self-hosted: Free (Apache 2.0)	API: ~$10/1M chars (API) Self-hosted: Free (open weights)	Self-hosted: Free (MIT license)	Service discontinued
Quality & ELO	Quality Speed ELO 1,236 130ms TTFA	Quality Speed ELO 1,197 75ms TTFA	Quality Speed 5.3% WER	Quality Speed 14.5% WER	Quality Speed 40ms TTFA	Quality Speed 90ms TTFA	Quality Speed 3.8% WER	Quality Speed	Quality Speed 5% WER	Quality Speed	Quality Speed	Quality Speed	Quality Speed	Quality Speed ELO 1,056	Quality Speed	Quality Speed 1.24% WER	Quality Speed ELO 1,128	Quality Speed	Quality Speed
Key Features	#1 TTS Arena ELO Zero-shot Voice Cloning (5–15s) Sub-250ms P90 Latency Domain-specific Pronunciation Healthcare/Finance/Legal	Eleven v3 (GA Feb 2) On-Premise / On-Device (Apr 9) Voice Cloning (10,000+ voices) Scribe v2 STT Dubbing & Translation 74 Languages ElevenAgents IBM watsonx Integration	Nova-3 (5.3% WER) Sub-300ms Streaming Flux Turn Detection Diarization Smart Formatting TTS Speed Controls (0.7–1.5×) Self-hosted Deployment 45+ Languages Per-second Billing	Universal-2 (99 languages) Universal-3 Pro Streaming Prompt-based Domain Customization Medical Mode (en/es/de/fr) Sentiment Analysis PII Redaction LLM Integration Audio Intelligence	Sonic 3 (SageMaker) 40ms TTFA (Sonic Turbo) 3-second Voice Cloning Emotion Control Sonic Flash 75ms SageMaker JumpStart	4B Parameters 90ms TTFA Smartphone Deployment (3GB RAM) Voice Cloning EU Data Sovereignty CC BY-NC 4.0 (open weights)	MAI-Voice-1 (cloning) MAI-Transcribe-1 (3.8% WER) Beats Whisper-large-v3 (22/25 languages) Beats ElevenLabs Scribe v2 (15/25) 25 Language STT Half GPU usage vs competitors	OpenAI Realtime API Compatible Drop-in Migration Path xAI Infrastructure	GPT-4o Transcribe (5% WER) GPT-4o Mini Transcribe Free Diarization tts-1 / tts-1-hd 99+ Languages (STT) Simple API	140+ Languages (TTS) 500+ Neural Voices Custom Neural Voice Speech Translation Real-time Captions Avatar Video Synthesis	Chirp 3 HD WaveNet & Studio Voices Gemini Integration 380+ Voices 75+ Languages 60 min/mo Free STT Google Translate Integration	Octave 2 Voice Model TADA Architecture (1B/3B) Zero Content Hallucinations 10× Context Efficiency Emotional Intelligence 11 Languages	23+ Languages Standard Neural Voices REST API	82M Parameters MOS 4.2 (highest open-source) CPU / Raspberry Pi Capable 210× Real-time on GPU Apache 2.0 License 9 Languages	MIT License 63.75% Preferred over ElevenLabs Chatterbox Turbo (sub-200ms) Chatterbox Multilingual (23 languages) PerTh Neural Watermarking Paralinguistic Tags [laugh] [cough] 11K+ GitHub Stars Emotion Control	Apache 2.0 License 0.77% Chinese WER 1.24% English WER 0.6B & 1.7B Variants 49+ Voice Presets 12Hz Proprietary Tokenizer Natural-language Voice Design 10 Languages	ELO 1128 (Best Open-weights) Voice Cloning Open Weights Commercial API Available	245M Parameters (MIT) Matches Whisper Large-v3 1/6 the Size of Whisper Mobile & Embedded Ready CPU Capable	Service Discontinued Acquired by Meta (Dec 31, 2025) Migrate to: ElevenLabs, Chatterbox, Kokoro
Languages	30+	74+	45+	99+ 🌍	20+	9+	25+	13+	99+ 🌍	140+ 🌍	75+	11+	23+	9+	23+	10+	15+	1+	N/A
Compliance	SOC2HIPAA	SOC2HIPAAGDPR	HIPAASOC2	HIPAASOC2 Type 2ISO 27001:2022PCI DSS v4.0GDPR	—	—	Azure ComplianceGDPRHIPAA	—	SOC2GDPR	SOC2HIPAAISO 27001GDPRFedRAMP	SOC2HIPAAISO 27001GDPR	—	—	—	—	—	—	—	—
Best For	voice agentcontent creationenterprise	content creationnarrationvoice agententerprise	voice agenttranscriptionanalyticsreal time	analyticstranscriptionunderstandingenterprise	voice agentreal time	voice agentbudgetoffline	enterprisetranscriptionaccessibility	voice agentprototyping	simple appprototypingtranscription	enterpriseaccessibilityglobal	enterpriseanalyticsaccessibility	voice agentcontent creation	budgetsimple app	budgetaccessibilityoffline	budgetcontent creationvoice agentoffline	budgetcontent creationaccessibility	content creationbudgetvoice agent	accessibilityofflinebudget	—

Data sourced from Artificial Analysis Speech Arena, HuggingFace Open ASR Leaderboard, and official provider documentation. All prices approximate as of April 2026. Benchmark scores may vary by use case.

EU AI Act Article 50 — August 2, 2026. All voice AI systems must disclose AI origin, mark synthetic audio in machine-readable format, and comply with emotion recognition restrictions. California CAITA aligns to the same date. Penalties up to €30M or 7% of global turnover.

Pricing Calculator

Estimate Your Monthly Cost

Drag the volume slider to compare real costs across all active providers at your scale. Open-source options shown at $0 self-hosted.

Full page view

Technology

Monthly Volume10 hours

Estimated at base pricing tiers.

TTS: ~15,000 chars/hr · STT: 60 mins/hr

Open-source models shown at $0 (self-hosted).

At 10h/month

Cheapest paid:$0.60 — Google Cloud Speech

Most expensive:$25.00 — ElevenLabs

Monthly Cost Comparison

Commercial APIOpen Source (self-hosted $0)

Community Picks

Best Provider by Use Case

Ranked from real production deployments, blind tests, and Artificial Analysis benchmarks.

⚡

Real-time Voice Agent

1.Cartesia Sonic — 40ms TTFA
2.Inworld TTS-1.5 Mini — <130ms
3.ElevenLabs Flash v2.5 — ~75ms

✨

Highest Quality TTS

1.Inworld TTS-1.5 Max (ELO 1,236)
2.ElevenLabs v3 (ELO 1,197)
3.Fish Audio S2 Pro (ELO 1,128 — open)

💰

Best Value STT

1.AssemblyAI Universal-2 — $0.0025/min
2.Deepgram Nova-3 — $0.0043/min
3.OpenAI Mini Transcribe — $0.003/min

🎤

Voice Cloning

1.ElevenLabs — professional grade
2.Cartesia — 3-second cloning
3.Chatterbox — free, MIT licensed

📱

Edge / Offline TTS

1.Kokoro 82M — CPU capable, Apache 2.0
2.Chatterbox Turbo — sub-200ms
3.Mistral Voxtral — 3 GB RAM

🔒

Edge / Offline STT

1.Moonshine 245M — MIT, = Whisper v3
2.Whisper.cpp — 38K+ GitHub stars
3.NVIDIA Canary Qwen — 5.63% WER

The independent voice AIintelligence radar.