Skip to content

Architecture

How Voicex is structured — from the database layer to real-time voice streaming.


High-Level System Overview

Voicex is a multi-tenant SaaS platform with two main subsystems:

  1. Dashboard — REST API + Next.js frontend for managing agents, providers, calls, and settings.
  2. Voice Engine — WebSocket-based real-time pipeline for STT → LLM → TTS conversations.

Project Structure

voicex/
├── frontend/                        # Next.js 14 (port 3000)
│   └── src/
│       ├── app/                     # App Router pages
│       │   ├── login/               # Sign in
│       │   ├── signup/              # Sign up
│       │   ├── pending/             # Account pending verification
│       │   └── dashboard/           # Protected dashboard
│       │       ├── agents/          # Agent list + detail/edit
│       │       ├── calls/           # Call list + detail
│       │       ├── playground/      # Live voice testing
│       │       ├── analytics/       # Usage charts
│       │       ├── settings/        # Account, API keys, provider link
│       │       └── providers/       # Custom provider management
│       ├── components/              # VoiceAssistant, Toast, ConfirmDialog, etc.
│       ├── lib/                     # API client, plan context, voice hook
│       └── utils/                   # cn() class utility

├── backend/                         # Node.js + Express + WS (port 3001)
│   └── src/
│       ├── db/                      # schema.ts (interfaces), client.ts (connection)
│       ├── repositories/            # Data access: agent, provider, plan, call, user, org, apikey
│       ├── routes/                  # REST endpoints: auth, dashboard, health, setup, twilio
│       ├── services/                # Voice session, pipeline, context, call summary, auth
│       ├── handlers/                # WebSocket connection handlers (voice, twilio)
│       ├── ws/                      # WebSocket gateway (routing, auth, rate limit)
│       ├── middleware/              # Auth, rate-limit, API limiter
│       ├── providers/               # STT (Deepgram), LLM (Groq/OpenAI/Ollama), TTS (ElevenLabs/OpenAI/Edge/System), Call channels
│       ├── shared/                  # Logger, errors, encryption, audio utils, WS types
│       ├── config/                  # Environment (Zod), voice config
│       └── scripts/                 # Seed plans, seed providers, seed test data, migrations

├── docs/                            # This documentation (VitePress)
│   └── src/

└── scripts/                         # Bash helpers for seed scripts

Backend Component Map


Real-Time Voice Pipeline

The core pipeline processes a single conversational turn. Each stage streams data to the next — the AI starts speaking before it finishes thinking.

Detailed Data Flow (One Turn)


Interrupt & Echo Suppression

When the user starts talking while the assistant is speaking, the system immediately stops and listens.

Echo suppression: While assistantSpeaking = true, ALL STT events are suppressed. After the pipeline finishes, an 800ms cooldown prevents tail-end speaker audio from triggering false interrupts.

Interruption sensitivity levels:

LevelMin chars to triggerUse case
low5 charactersFormal conversations, reduce false interrupts
medium (default)2 charactersBalanced
high1 characterFast-paced, customer support

Abort Chain

Every async operation accepts AbortSignal so nothing leaks when the user interrupts.


Data Architecture

See Database Schema for full details. Here's the relationship overview:

Key Relationships

FromToFieldDescription
organizationsplansplanIdWhich plan the org is on
agentsprovidersllmProviderIdWhich LLM provider the agent uses
agentsprovidersttsProviderIdWhich TTS provider the agent uses
agentsproviderssttProviderIdWhich STT provider the agent uses
providersorganizationsorgIdOwner org (null = global/platform provider)
callsagentsagentIdWhich agent handled the call

Provider Architecture

Providers are stored in a single unified providers collection. There are two types:

TypeorgIdWho managesExample
GlobalnullPlatform adminGroq, OpenAI, Ollama, ElevenLabs, Edge, Deepgram
Client<orgId>Client via dashboardClient's own OpenAI key, custom ElevenLabs voice

When an agent references a provider (e.g., llmProviderId), the system:

  1. Fetches the provider document
  2. If global (orgId: null), checks the org's plan allows the selected model
  3. If client-owned, allows it (client providers bypass plan model checks)
  4. Decrypts credentials at runtime using AES-256-GCM

See Providers for full details.


Agent Status Computation

Agent status is computed server-side by checking three conditions:

StatusMeaningUser action
activeAgent is fully operationalNone
inactiveAgent is manually deactivatedToggle active on
paused_providerA referenced provider is disabledEdit agent to use different provider, or re-enable provider
paused_planA global model requires a higher planUpgrade plan or switch to a model included in current plan

Latency Breakdown

Where time goes in a typical voice turn:

StageProviderTypical Latency
STT endpointingDeepgram~300ms
LLM first tokenGroq~150-250ms
TTS first audioElevenLabs~200-300ms
Network + decodeWebSocket~50-100ms
Total to first audio~700-950ms

Startup Sequence

Built with Deepgram, Groq, and ElevenLabs.