AI Infrastructure Stack
Voice AI Stack
Transcription, text-to-speech, and voice agents. Whether you are adding voice features to an existing product or building a standalone audio pipeline, these are the building blocks.
Transcription (Speech-to-Text)
π€Turn audio into text. For real-time use cases (live captions, voice agents), streaming latency matters. For batch processing (meeting notes, podcast transcription), accuracy and cost per hour matter more.
Nova-3 model. The most used API for real-time voice agents. Streaming latency under 300ms, good noise robustness. 36 languages.
Universal-2 model with strong accuracy. Also offers Slam-1 speech-language model that combines transcription with understanding. Good for structured audio analysis.
French company (EU-hosted). Solaria model with anti-hallucination features. Strong on multilingual transcription and code-switching (mixing languages in one stream).
Text-to-Speech
πGenerate spoken audio from text. For voice agents, time-to-first-audio (TTFA) is critical, you want the user to hear something within 100ms. For content generation (audiobooks, videos), voice quality and expressiveness matter more.
Flash v2.5 for low-latency voice agents, Eleven v3 for higher quality output. Voice cloning and 74 languages. The largest TTS provider by adoption.
Sonic 3 model. Very low time-to-first-audio, purpose-built for real-time voice agents. Competitive pricing per character.
TTS API with multiple voices. Straightforward to use if you are already on the OpenAI platform. Less customizable than ElevenLabs, but simpler integration.
Voice Agent Frameworks
ποΈA voice agent pipeline is: microphone β STT β LLM β TTS β speaker, running in real time. The challenge is orchestrating these steps with low enough latency that conversation feels natural. You can buy a managed platform or build with an open-source framework.
Open-source framework for real-time voice and video agents. WebRTC-native. Plugin system for mixing STT, LLM, and TTS providers. The most popular open-source option for custom voice pipelines.
Conversational AI platform that combines their TTS with an LLM in a managed pipeline. Simpler than building your own if voice quality is the priority.
Emotion-aware voice AI. Their EVI (Empathic Voice Interface) responds to tone and emotional cues. Useful for support and wellness applications.
Things to keep in mind
- For voice agents, the whole pipeline (STT β LLM β TTS) needs to stay under ~1 second total for conversation to feel natural. Each component adds latency, so pick providers with low streaming latency and test the full round trip.
- Self-hosting Whisper is viable for batch transcription but hard to beat the managed APIs on streaming latency and accuracy. If real-time is not a requirement, Whisper large-v3-turbo is a good self-hosted option.
- Voice cloning and custom voices are available from ElevenLabs, Cartesia, and LMNT. If your product has a brand voice, this matters.
- Open-source TTS has improved significantly. Kokoro-82M and Fish Speech are self-hostable with good quality. Worth evaluating if you need to control costs at scale.
Is your product missing?