Provider Comparison Matrix

Benchmarks, pricing models, ELO scores, word error rates, feature sets, and compliance certifications across every active TTS & STT provider. Updated April 2026.

Hide discontinued

WERELOTTFAMOS— hover for definitions

Provider	Inworld AINEW TTS	ElevenLabs BOTH	Deepgram BOTH	AssemblyAI STT	Cartesia TTS	Mistral VoxtralNEW TTS	Microsoft MAINEW BOTH	xAI Grok TTSNEW TTS	OpenAI BOTH	Azure AI Speech BOTH	Google Cloud Speech BOTH	Hume AINEW TTS	LeanVoxNEW TTS	Kokoro v1.0OPEN SOURCE TTS	ChatterboxNEWOPEN SOURCE TTS	Qwen3-TTSNEWOPEN SOURCE TTS	Fish Audio S2 ProNEWOPEN SOURCE TTS	Moonshine (Useful Sensors)NEWOPEN SOURCE STT	PlayHTDISCONTINUED TTS
Pricing	TTS-1.5 Max: $30/1M chars (enterprise) TTS-1.5 Mini: $15/1M chars (low latency)	Starter: $5/mo for 30k chars Creator: $22/mo for 100k chars Scale API: ~$165/1M chars (Scale)	Nova-3 STT: $0.0043–$0.0077/min (per-second billing) Voice Agent API: ~$0.075/min (STT+LLM+TTS)	Universal-2: $0.0025/min — 99 languages Universal-3 Pro: $0.0035/min — prompt-based customization	Pay-as-you-go: $5/100k credits (1 credit/char)	API: $16/1M chars	MAI-Transcribe-1: ~$0.017/min (Azure pricing) MAI-Voice-1: $16/1M chars	API: ~$15/1M chars (estimated)	TTS Standard: $15/1M chars (tts-1) TTS HD: $30/1M chars (tts-1-hd) GPT-4o Transcribe: $0.006/min — free diarization Mini Transcribe: $0.003/min — budget option	Neural TTS: $15–16/1M chars STT Standard: $0.017/min (140+ languages)	WaveNet: $4/1M chars (standard) Chirp 3 HD: $30/1M chars (HD) STT Standard: $0.024/min (60 min/mo free)	Octave 2: $7.60/1M chars	Standard: $5/1M chars	Self-hosted: Free (compute only) Hosted (DeepInfra): ~$0.65/1M chars hosted	Self-hosted: Free (MIT license)	Self-hosted: Free (Apache 2.0)	API: ~$10/1M chars (API) Self-hosted: Free (open weights)	Self-hosted: Free (MIT license)	Service discontinued
Quality & ELO	Quality Speed ELO 1,236 130ms TTFA	Quality Speed ELO 1,197 75ms TTFA	Quality Speed 5.3% WER	Quality Speed 14.5% WER	Quality Speed 40ms TTFA	Quality Speed 90ms TTFA	Quality Speed 3.8% WER	Quality Speed	Quality Speed 5% WER	Quality Speed	Quality Speed	Quality Speed	Quality Speed	Quality Speed ELO 1,056	Quality Speed	Quality Speed 1.24% WER	Quality Speed ELO 1,128	Quality Speed	Quality Speed
Key Features	#1 TTS Arena ELO Zero-shot Voice Cloning (5–15s) Sub-250ms P90 Latency Domain-specific Pronunciation Healthcare/Finance/Legal	Eleven v3 (GA Feb 2) On-Premise / On-Device (Apr 9) Voice Cloning (10,000+ voices) Scribe v2 STT Dubbing & Translation 74 Languages ElevenAgents IBM watsonx Integration	Nova-3 (5.3% WER) Sub-300ms Streaming Flux Turn Detection Diarization Smart Formatting TTS Speed Controls (0.7–1.5×) Self-hosted Deployment 45+ Languages Per-second Billing	Universal-2 (99 languages) Universal-3 Pro Streaming Prompt-based Domain Customization Medical Mode (en/es/de/fr) Sentiment Analysis PII Redaction LLM Integration Audio Intelligence	Sonic 3 (SageMaker) 40ms TTFA (Sonic Turbo) 3-second Voice Cloning Emotion Control Sonic Flash 75ms SageMaker JumpStart	4B Parameters 90ms TTFA Smartphone Deployment (3GB RAM) Voice Cloning EU Data Sovereignty CC BY-NC 4.0 (open weights)	MAI-Voice-1 (cloning) MAI-Transcribe-1 (3.8% WER) Beats Whisper-large-v3 (22/25 languages) Beats ElevenLabs Scribe v2 (15/25) 25 Language STT Half GPU usage vs competitors	OpenAI Realtime API Compatible Drop-in Migration Path xAI Infrastructure	GPT-4o Transcribe (5% WER) GPT-4o Mini Transcribe Free Diarization tts-1 / tts-1-hd 99+ Languages (STT) Simple API	140+ Languages (TTS) 500+ Neural Voices Custom Neural Voice Speech Translation Real-time Captions Avatar Video Synthesis	Chirp 3 HD WaveNet & Studio Voices Gemini Integration 380+ Voices 75+ Languages 60 min/mo Free STT Google Translate Integration	Octave 2 Voice Model TADA Architecture (1B/3B) Zero Content Hallucinations 10× Context Efficiency Emotional Intelligence 11 Languages	23+ Languages Standard Neural Voices REST API	82M Parameters MOS 4.2 (highest open-source) CPU / Raspberry Pi Capable 210× Real-time on GPU Apache 2.0 License 9 Languages	MIT License 63.75% Preferred over ElevenLabs Chatterbox Turbo (sub-200ms) Chatterbox Multilingual (23 languages) PerTh Neural Watermarking Paralinguistic Tags [laugh] [cough] 11K+ GitHub Stars Emotion Control	Apache 2.0 License 0.77% Chinese WER 1.24% English WER 0.6B & 1.7B Variants 49+ Voice Presets 12Hz Proprietary Tokenizer Natural-language Voice Design 10 Languages	ELO 1128 (Best Open-weights) Voice Cloning Open Weights Commercial API Available	245M Parameters (MIT) Matches Whisper Large-v3 1/6 the Size of Whisper Mobile & Embedded Ready CPU Capable	Service Discontinued Acquired by Meta (Dec 31, 2025) Migrate to: ElevenLabs, Chatterbox, Kokoro
Languages	30+	74+	45+	99+ 🌍	20+	9+	25+	13+	99+ 🌍	140+ 🌍	75+	11+	23+	9+	23+	10+	15+	1+	N/A
Compliance	SOC2HIPAA	SOC2HIPAAGDPR	HIPAASOC2	HIPAASOC2 Type 2ISO 27001:2022PCI DSS v4.0GDPR	—	—	Azure ComplianceGDPRHIPAA	—	SOC2GDPR	SOC2HIPAAISO 27001GDPRFedRAMP	SOC2HIPAAISO 27001GDPR	—	—	—	—	—	—	—	—
Best For	voice agentcontent creationenterprise	content creationnarrationvoice agententerprise	voice agenttranscriptionanalyticsreal time	analyticstranscriptionunderstandingenterprise	voice agentreal time	voice agentbudgetoffline	enterprisetranscriptionaccessibility	voice agentprototyping	simple appprototypingtranscription	enterpriseaccessibilityglobal	enterpriseanalyticsaccessibility	voice agentcontent creation	budgetsimple app	budgetaccessibilityoffline	budgetcontent creationvoice agentoffline	budgetcontent creationaccessibility	content creationbudgetvoice agent	accessibilityofflinebudget	—

Data sourced from Artificial Analysis Speech Arena, HuggingFace Open ASR Leaderboard, and official provider documentation. All prices approximate as of April 2026. Benchmark scores may vary by use case.