TTS Module

Text-to-speech synthesis with word-level timestamp alignment, powered by Kokoro (ONNX) and Whisper.

`TtsService`

ONNX-based TTS model (Kokoro-82M-v1.0). Loads the model on startup and synthesizes text to WAV audio.

Uses OpenAI Whisper to transcribe the generated audio and extract word boundaries with timestamps. Supports multiple timestamp formats.

typescript

interface SynthesisResult {
  audioPath: string;
  wordTimestamps: WordTimestamp[];
  durationMs: number;
}

typescript

interface WordTimestamp {
  word: string;
  startMs: number;
  endMs: number;
}

Splits long text into sentence-level chunks for more natural TTS synthesis. Handles edge cases like abbreviations and decimal numbers.

Code	Description
`MODEL_LOAD_FAILED`	Kokoro ONNX model failed to initialize
`SYNTHESIS_FAILED`	Audio generation error
`CHUNKING_FAILED`	Text splitting error
`CONCATENATION_FAILED`	Audio chunk joining error
`ALIGNMENT_FAILED`	Whisper timestamp extraction error