Voice-First AI App (Assistant or Call Automation)
End-to-end stack for building a low-latency voice AI application — from microphone input to intelligent response to synthesized speech.
The Stack
Vapi AI
— Voice agent platform (full-stack)Vapi handles the entire voice pipeline: telephony (inbound/outbound calls), STT, LLM routing, TTS, barge-in detection, and call recordings — reduces integration from weeks to hours.
Alternatives: retell-ai, bland-ai, livekit, daily-co
Deepgram
— Speech-to-text transcription optionalDeepgram Nova-3 achieves <300ms streaming transcription latency with state-of-the-art accuracy on noisy telephone audio and domain-specific vocabulary fine-tuning.
Alternatives: assemblyai
ElevenLabs
— Text-to-speech synthesis optionalElevenLabs Turbo v2.5 generates human-quality speech at 300ms TTFB with voice cloning from a 30-second sample — essential for branded or persona-driven voice products.
Alternatives: playht, cartesia, coqui-ai
Cartesia
— Ultra-low-latency TTS optionalCartesia's Sonic model delivers the lowest TTS latency on the market (~90ms TTFB) at the cost of slightly less naturalness — use when the call flow requires faster back-and-forth.
OpenAI
— Conversational LLMGPT-4o-realtime provides native voice-in/voice-out with built-in emotion detection. GPT-4o via streaming text works for orchestration-based pipelines.
Alternatives: anthropic, groq, deepseek-api, minimax-ai, iflytek-spark
Retell AI
— Alternative voice agent platform optionalRetell AI offers a no-code agent builder and pre-built telephony integrations — better for teams that need a fast production deployment without custom pipeline wiring.
Bland AI
— Outbound call automation optionalBland AI is purpose-built for high-volume outbound calling (lead qualification, scheduling, surveys) with built-in compliance controls for TCPA and GDPR.
LiveKit
— Real-time audio/video WebRTC infrastructure optionalLiveKit handles the WebRTC signaling, TURN servers, and media routing for browser-based voice apps — eliminates the need to manage your own WebRTC infrastructure.
Alternatives: daily-co, agora
Langfuse
— Conversation quality tracing optionalLangfuse traces each voice turn — transcription text, LLM response, TTS latency, and user interruption events — so you can identify which pipeline stage is degrading call quality.
Alternatives: langsmith, agentops
Gotchas
- ⚠️ End-to-end latency perception compounds: 400ms STT + 400ms LLM TTFT + 300ms TTS = 1.1s before the user hears anything. Users perceive >800ms as 'slow' — choose Groq for LLM and Cartesia for TTS when speed beats quality.
- ⚠️ Barge-in (user interrupting the bot mid-speech) must be handled at the WebRTC layer, not the application layer — if you use Vapi or Retell, this is handled; if building custom, it requires VAD integration.
- ⚠️ Telephone audio quality (8kHz G.711) degrades TTS naturalness significantly — enable Deepgram's telephony model, not the general Nova model, for inbound PSTN calls.
- ⚠️ Voice cloning with ElevenLabs requires consent documentation per jurisdiction — check local laws before using a cloned voice in production calls.
- ⚠️ LLM-generated transcripts of calls may be classified as personal data under GDPR/CCPA — implement configurable retention policies and deletion endpoints from day one.
- ⚠️ ElevenLabs latency spikes under heavy load — implement a TTS fallback (e.g. Cartesia) or queue management to avoid dead air during peak traffic.
Related Stacks
Customer-Facing AI Chatbot SaaS
Production stack for shipping a multi-tenant AI chatbot with streaming, memory, guardrails, and usage-based billing.
Multi-Agent Autonomous Platform
Stack for building production multi-agent systems that browse the web, write and run code, use tools, and complete long-horizon tasks autonomously.