Piper TTS Voice Training Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime. Overview Piper produces fast, offline TTS suitable for embedded devices. Training involves: Corpus preparation (text covering phonetic range) Audio generation or recording Quality validation via Whisper transcription Fine-tuning from existing checkpoint (recommended) or training from scratch ONNX export for deployment Fine-tuning vs from-scratch: Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU) From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months) Workflow 1. Corpus Preparation Gather 1,300-1,500+ phrases covering broad phonetic range: Use piper-recording-studio corpus as base Add domain-specific phrases for your use case Include varied sentence structures and lengths Critical for non-US English: Ensure corpus uses correct regional spelling. See Localisation . 2. Audio Generation Generate or record training audio at 22050Hz mono WAV. If using voice cloning (e.g., Chatterbox TTS): Generate at source sample rate (often 24kHz) Convert to 22050Hz: sox -v 0.95 input.wav -r 22050 -t wav output.wav The -v 0.95 prevents clipping during resampling Recording requirements: Consistent microphone position and room acoustics Minimal background noise Natural speaking pace (not reading voice) 3. Quality Validation with Whisper Automate quality checks rather than manual listening: import whisper from piper_phonemize import phonemize_text model = whisper . load_model ( "base" ) def validate_sample ( audio_path , expected_text ) : result = model . transcribe ( audio_path ) transcribed = result [ "text" ] . strip ( )
Compare phonemically to handle spelling/punctuation differences
expected_phonemes
- phonemize_text
- (
- expected_text
- ,
- "en-gb"
- )
- transcribed_phonemes
- =
- phonemize_text
- (
- transcribed
- ,
- "en-gb"
- )
- return
- expected_phonemes
- ==
- transcribed_phonemes
- Retry failed samples up to 3 times. Target 95%+ dataset coverage.
- 4. Dataset Format (LJSpeech)
- Structure your dataset:
- dataset/
- ├── metadata.csv
- └── wavs/
- ├── sample_0001.wav
- ├── sample_0002.wav
- └── ...
- metadata.csv format:
- {id}|{text}
- (pipe-separated, no headers)
- sample_0001|The quick brown fox jumps over the lazy dog.
- sample_0002|Pack my box with five dozen liquor jugs.
- 5. Preprocessing
- Convert to PyTorch tensors:
- python3
- -m
- piper_train.preprocess
- \
- --language
- en-gb
- \
- --input-dir dataset/
- \
- --output-dir piper_training_dir/
- \
- --dataset-format ljspeech
- Use
- en-gb
- for Australian/NZ/UK voices (espeak-ng phoneme set).
- 6. Training
- Fine-tuning (recommended):
- python3
- -m
- piper_train
- \
- --dataset-dir piper_training_dir/
- \
- --accelerator
- gpu
- \
- --devices
- 1
- \
- --batch-size
- 12
- \
- --max_epochs
- 3000
- \
- --resume_from_checkpoint
- ljspeech-2000.ckpt
- \
- --checkpoint-epochs
- 100
- \
- --quality
- high
- \
- --precision
- 32
- Key parameters:
- --batch-size
-
- Reduce if VRAM limited (12 works on 8GB)
- --resume_from_checkpoint
-
- Start from LJSpeech high-quality checkpoint
- --precision 32
-
- More stable than mixed precision
- --validation-split 0.0 --num-test-examples 0
- Skip validation for small datasets Monitor with TensorBoard: watch loss_disc_all for convergence. 7. ONNX Export python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized onnxsim output.onnx.unoptimized output.onnx Create metadata file output.onnx.json from training config.json . Localisation for Australian, New Zealand and UK English Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift. Corpus preparation: Run scripts/convert_spelling.py on corpus text before training Use en-gb or en-au espeak-ng voice for phonemisation Review generated phonemes for Americanisms Common spelling conversions: American Australian/UK -ize -ise -or -our -er -re -og -ogue -ense -ence Phoneme considerations: /r/ linking and intrusion patterns differ Vowel sounds in words like "dance", "bath", "castle" Final -ile pronunciation (hostile, missile) For complete word lists and phonetic details, see references/localisation.md . Validation: Use Whisper with language="en" and verify transcriptions match expected regional forms. Dependencies Pin versions to avoid API breakage: pytorch-lightning==1.9.3 torch<2.6.0 piper-phonemize onnxruntime-gpu onnxsim Docker containerisation recommended for reproducibility. Hardware Requirements Minimum (fine-tuning): 8GB VRAM GPU (Pascal or newer) 8GB system RAM ~5 days for 1,000 epochs on Tesla P4 From scratch: Multiply time by ~200x. Troubleshooting Issue Solution CUDA OOM Reduce batch-size (try 8 or 4) Checkpoint won't load Check pytorch-lightning version matches checkpoint Garbled output Insufficient training epochs or dataset too small Wrong accent Check espeak-ng language code and corpus spelling