Skip to content

TTS Module

Text-to-speech synthesis with word-level timestamp alignment, powered by Kokoro (ONNX) and Whisper.

TtsService

synthesizeWithTimestamps(text: string): Promise<Result<SynthesisResult, TtsError>>

  1. Sanitizes input text
  2. Synthesizes speech via KokoroProvider
  3. Extracts word-level timestamps via WhisperAlignerService
  4. Returns audio path + timestamps + duration

Sub-components

KokoroProvider

ONNX-based TTS model (Kokoro-82M-v1.0). Loads the model on startup and synthesizes text to WAV audio.

  • Model: configurable via KOKORO_MODEL_ID
  • Voice: configurable via KOKORO_VOICE (default: af_heart)
  • Quantization: configurable via KOKORO_DTYPE (default: q8)
  • isReady(): boolean — returns true when model is loaded

WhisperAlignerService

Uses OpenAI Whisper to transcribe the generated audio and extract word boundaries with timestamps. Supports multiple timestamp formats.

SynthesisResult

typescript
interface SynthesisResult {
  audioPath: string;
  wordTimestamps: WordTimestamp[];
  durationMs: number;
}

WordTimestamp

typescript
interface WordTimestamp {
  word: string;
  startMs: number;
  endMs: number;
}

Text Processing

chunkTextBySentence(text: string): TextChunk[]

Splits long text into sentence-level chunks for more natural TTS synthesis. Handles edge cases like abbreviations and decimal numbers.

Error Codes

CodeDescription
MODEL_LOAD_FAILEDKokoro ONNX model failed to initialize
SYNTHESIS_FAILEDAudio generation error
CHUNKING_FAILEDText splitting error
CONCATENATION_FAILEDAudio chunk joining error
ALIGNMENT_FAILEDWhisper timestamp extraction error

Built with VitePress