AI Infrastructure Stack

Voice AI Stack

Transcription, text-to-speech, and voice agents. Whether you are adding voice features to an existing product or building a standalone audio pipeline, these are the building blocks.

🎀 Speech-to-text πŸ”Š Text-to-speech πŸŽ™οΈ Voice agents
Hand-drawn illustration of a voice AI pipeline

Things to keep in mind

  • For voice agents, the whole pipeline (STT β†’ LLM β†’ TTS) needs to stay under ~1 second total for conversation to feel natural. Each component adds latency, so pick providers with low streaming latency and test the full round trip.
  • Self-hosting Whisper is viable for batch transcription but hard to beat the managed APIs on streaming latency and accuracy. If real-time is not a requirement, Whisper large-v3-turbo is a good self-hosted option.
  • Voice cloning and custom voices are available from ElevenLabs, Cartesia, and LMNT. If your product has a brand voice, this matters.
  • Open-source TTS has improved significantly. Kokoro-82M and Fish Speech are self-hostable with good quality. Worth evaluating if you need to control costs at scale.

Is your product missing?

Add it here →