piper-tts-training

安装量: 43
排名: #17057

安装

npx skills add https://github.com/sammcj/agentic-coding --skill piper-tts-training

Piper TTS Voice Training Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime. Overview Piper produces fast, offline TTS suitable for embedded devices. Training involves: Corpus preparation (text covering phonetic range) Audio generation or recording Quality validation via Whisper transcription Fine-tuning from existing checkpoint (recommended) or training from scratch ONNX export for deployment Fine-tuning vs from-scratch: Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU) From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months) Workflow 1. Corpus Preparation Gather 1,300-1,500+ phrases covering broad phonetic range: Use piper-recording-studio corpus as base Add domain-specific phrases for your use case Include varied sentence structures and lengths Critical for non-US English: Ensure corpus uses correct regional spelling. See Localisation . 2. Audio Generation Generate or record training audio at 22050Hz mono WAV. If using voice cloning (e.g., Chatterbox TTS): Generate at source sample rate (often 24kHz) Convert to 22050Hz: sox -v 0.95 input.wav -r 22050 -t wav output.wav The -v 0.95 prevents clipping during resampling Recording requirements: Consistent microphone position and room acoustics Minimal background noise Natural speaking pace (not reading voice) 3. Quality Validation with Whisper Automate quality checks rather than manual listening: import whisper from piper_phonemize import phonemize_text model = whisper . load_model ( "base" ) def validate_sample ( audio_path , expected_text ) : result = model . transcribe ( audio_path ) transcribed = result [ "text" ] . strip ( )

Compare phonemically to handle spelling/punctuation differences

expected_phonemes

phonemize_text
(
expected_text
,
"en-gb"
)
transcribed_phonemes
=
phonemize_text
(
transcribed
,
"en-gb"
)
return
expected_phonemes
==
transcribed_phonemes
Retry failed samples up to 3 times. Target 95%+ dataset coverage.
4. Dataset Format (LJSpeech)
Structure your dataset:
dataset/
├── metadata.csv
└── wavs/
├── sample_0001.wav
├── sample_0002.wav
└── ...
metadata.csv format:
{id}|{text}
(pipe-separated, no headers)
sample_0001|The quick brown fox jumps over the lazy dog.
sample_0002|Pack my box with five dozen liquor jugs.
5. Preprocessing
Convert to PyTorch tensors:
python3
-m
piper_train.preprocess
\
--language
en-gb
\
--input-dir dataset/
\
--output-dir piper_training_dir/
\
--dataset-format ljspeech
Use
en-gb
for Australian/NZ/UK voices (espeak-ng phoneme set).
6. Training
Fine-tuning (recommended):
python3
-m
piper_train
\
--dataset-dir piper_training_dir/
\
--accelerator
gpu
\
--devices
1
\
--batch-size
12
\
--max_epochs
3000
\
--resume_from_checkpoint
ljspeech-2000.ckpt
\
--checkpoint-epochs
100
\
--quality
high
\
--precision
32
Key parameters:
--batch-size
Reduce if VRAM limited (12 works on 8GB)
--resume_from_checkpoint
Start from LJSpeech high-quality checkpoint
--precision 32
More stable than mixed precision
--validation-split 0.0 --num-test-examples 0
Skip validation for small datasets Monitor with TensorBoard: watch loss_disc_all for convergence. 7. ONNX Export python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized onnxsim output.onnx.unoptimized output.onnx Create metadata file output.onnx.json from training config.json . Localisation for Australian, New Zealand and UK English Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift. Corpus preparation: Run scripts/convert_spelling.py on corpus text before training Use en-gb or en-au espeak-ng voice for phonemisation Review generated phonemes for Americanisms Common spelling conversions: American Australian/UK -ize -ise -or -our -er -re -og -ogue -ense -ence Phoneme considerations: /r/ linking and intrusion patterns differ Vowel sounds in words like "dance", "bath", "castle" Final -ile pronunciation (hostile, missile) For complete word lists and phonetic details, see references/localisation.md . Validation: Use Whisper with language="en" and verify transcriptions match expected regional forms. Dependencies Pin versions to avoid API breakage: pytorch-lightning==1.9.3 torch<2.6.0 piper-phonemize onnxruntime-gpu onnxsim Docker containerisation recommended for reproducibility. Hardware Requirements Minimum (fine-tuning): 8GB VRAM GPU (Pascal or newer) 8GB system RAM ~5 days for 1,000 epochs on Tesla P4 From scratch: Multiply time by ~200x. Troubleshooting Issue Solution CUDA OOM Reduce batch-size (try 8 or 4) Checkpoint won't load Check pytorch-lightning version matches checkpoint Garbled output Insufficient training epochs or dataset too small Wrong accent Check espeak-ng language code and corpus spelling
返回排行榜