Piper TTS Voice Training Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime. Overview Piper produces fast, offline TTS suitable for embedded devices. Training involves: Corpus preparation (text covering phonetic range) Audio generation or recording Quality validation via Whisper transcription Fine-tuning from existing checkpoint (recommended) or training from scratch ONNX export for deployment Fine-tuning vs from-scratch: Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU) From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months) Workflow 1. Corpus Preparation Gather 1,300-1,500+ phrases covering broad phonetic range: Use piper-recording-studio corpus as base Add domain-specific phrases for your use case Include varied sentence structures and lengths Critical for non-US English: Ensure corpus uses correct regional spelling. See Localisation . 2. Audio Generation Generate or record training audio at 22050Hz mono WAV. If using voice cloning (e.g., Chatterbox TTS): Generate at source sample rate (often 24kHz) Convert to 22050Hz: sox -v 0.95 input.wav -r 22050 -t wav output.wav The -v 0.95 prevents clipping during resampling Recording requirements: Consistent microphone position and room acoustics Minimal background noise Natural speaking pace (not reading voice) 3. Quality Validation with Whisper Automate quality checks rather than manual listening: import whisper from piper_phonemize import phonemize_text model = whisper . load_model ( "base" ) def validate_sample ( audio_path , expected_text ) : result = model . transcribe ( audio_path ) transcribed = result [ "text" ] . strip ( )

Compare phonemically to handle spelling/punctuation differences

expected_phonemes

phonemize_text

(

expected_text

,

"en-gb"

)

transcribed_phonemes

=

phonemize_text

(

transcribed

,

"en-gb"

)

return

expected_phonemes

==

transcribed_phonemes

Retry failed samples up to 3 times. Target 95%+ dataset coverage.

4. Dataset Format (LJSpeech)

Structure your dataset:

dataset/

├── metadata.csv

└── wavs/

├── sample_0001.wav

├── sample_0002.wav

└── ...

metadata.csv format:

{id}|{text}

(pipe-separated, no headers)

sample_0001|The quick brown fox jumps over the lazy dog.

sample_0002|Pack my box with five dozen liquor jugs.

5. Preprocessing

Convert to PyTorch tensors:

python3

-m

piper_train.preprocess

\

--language

en-gb

\

--input-dir dataset/

\

--output-dir piper_training_dir/

\

--dataset-format ljspeech

Use

en-gb

for Australian/NZ/UK voices (espeak-ng phoneme set).

6. Training

Fine-tuning (recommended):

python3

-m

piper_train

\

--dataset-dir piper_training_dir/

\

--accelerator

gpu

\

--devices

1

\

--batch-size

12

\

--max_epochs

3000

\

--resume_from_checkpoint

ljspeech-2000.ckpt

\

--checkpoint-epochs

100

\

--quality

high

\

--precision

32

Key parameters:

--batch-size

Reduce if VRAM limited (12 works on 8GB)

--resume_from_checkpoint

Start from LJSpeech high-quality checkpoint

--precision 32

More stable than mixed precision
--validation-split 0.0 --num-test-examples 0: Skip validation for small datasets Monitor with TensorBoard: watch loss_disc_all for convergence. 7. ONNX Export python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized onnxsim output.onnx.unoptimized output.onnx Create metadata file output.onnx.json from training config.json . Localisation for Australian, New Zealand and UK English Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift. Corpus preparation: Run scripts/convert_spelling.py on corpus text before training Use en-gb or en-au espeak-ng voice for phonemisation Review generated phonemes for Americanisms Common spelling conversions: American Australian/UK -ize -ise -or -our -er -re -og -ogue -ense -ence Phoneme considerations: /r/ linking and intrusion patterns differ Vowel sounds in words like "dance", "bath", "castle" Final -ile pronunciation (hostile, missile) For complete word lists and phonetic details, see references/localisation.md . Validation: Use Whisper with language="en" and verify transcriptions match expected regional forms. Dependencies Pin versions to avoid API breakage: pytorch-lightning==1.9.3 torch<2.6.0 piper-phonemize onnxruntime-gpu onnxsim Docker containerisation recommended for reproducibility. Hardware Requirements Minimum (fine-tuning): 8GB VRAM GPU (Pascal or newer) 8GB system RAM ~5 days for 1,000 epochs on Tesla P4 From scratch: Multiply time by ~200x. Troubleshooting Issue Solution CUDA OOM Reduce batch-size (try 8 or 4) Checkpoint won't load Check pytorch-lightning version matches checkpoint Garbled output Insufficient training epochs or dataset too small Wrong accent Check espeak-ng language code and corpus spelling

piper-tts-training

安装

Compare phonemically to handle spelling/punctuation differences

expected_phonemes