TTS Module
Text-to-speech synthesis with word-level timestamp alignment, powered by Kokoro (ONNX) and Whisper.
TtsService
synthesizeWithTimestamps(text: string): Promise<Result<SynthesisResult, TtsError>>
- Sanitizes input text
- Synthesizes speech via KokoroProvider
- Extracts word-level timestamps via WhisperAlignerService
- Returns audio path + timestamps + duration
Sub-components
KokoroProvider
ONNX-based TTS model (Kokoro-82M-v1.0). Loads the model on startup and synthesizes text to WAV audio.
- Model: configurable via
KOKORO_MODEL_ID - Voice: configurable via
KOKORO_VOICE(default:af_heart) - Quantization: configurable via
KOKORO_DTYPE(default:q8) isReady(): boolean— returns true when model is loaded
WhisperAlignerService
Uses OpenAI Whisper to transcribe the generated audio and extract word boundaries with timestamps. Supports multiple timestamp formats.
SynthesisResult
typescript
interface SynthesisResult {
audioPath: string;
wordTimestamps: WordTimestamp[];
durationMs: number;
}WordTimestamp
typescript
interface WordTimestamp {
word: string;
startMs: number;
endMs: number;
}Text Processing
chunkTextBySentence(text: string): TextChunk[]
Splits long text into sentence-level chunks for more natural TTS synthesis. Handles edge cases like abbreviations and decimal numbers.
Error Codes
| Code | Description |
|---|---|
MODEL_LOAD_FAILED | Kokoro ONNX model failed to initialize |
SYNTHESIS_FAILED | Audio generation error |
CHUNKING_FAILED | Text splitting error |
CONCATENATION_FAILED | Audio chunk joining error |
ALIGNMENT_FAILED | Whisper timestamp extraction error |