speak - Talk to your Claude!

Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon. Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.

Prerequisites Requirement Check Install Apple Silicon Mac uname -m → arm64 Intel not supported macOS 12.0+ sw_vers - sox which sox brew install sox ffmpeg which ffmpeg brew install ffmpeg poppler (PDF) which pdftotext brew install poppler Input Sources Source Example Text file speak article.txt Markdown speak doc.md Direct string speak "Hello" Clipboard pbpaste | speak Stdin cat file.txt | speak Web Articles lynx -dump -nolist "https://example.com/article" | speak --output article.wav

Converting Formats Format Convert Command PDF pdftotext doc.pdf doc.txt DOCX textutil -convert txt doc.docx HTML pandoc -f html -t plain doc.html > doc.txt Output Modes Goal Command Save for later speak text.txt --output file.wav Listen now (streaming) speak text.txt --stream Listen now (complete) speak text.txt --play Both speak text.txt --stream --output file.wav Default Behavior speak article.txt # → ~/Audio/speak/article.wav (no playback) speak "Hello" # → ~/Audio/speak/speak_.wav

Directory Auto-Creation Directory Auto-Created? ~/Audio/speak/ ✓ Yes ~/.chatter/voices/ ✗ No Custom directories ✗ No

Always create custom directories first:

mkdir -p ~/.chatter/voices/ mkdir -p ~/Audio/custom/

Voice Cloning

Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.

Quality Expectations Output captures general voice characteristics but is not a perfect replica Quality depends heavily on sample quality 15-25 seconds is optimal (10s minimum, 30s maximum) Recording Your Voice

Using QuickTime:

Open QuickTime Player → File → New Audio Recording Record 20 seconds of clear speech File → Export As → Audio Only (.m4a) Convert to WAV (see below)

Using sox (command line):

-d = use default microphone

Recording starts immediately and stops after 25 seconds

sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25

Converting to Required Format

Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.

From MP3

ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav

From M4A (QuickTime)

ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav

Trim to 25 seconds

ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav

Check sample properties

ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"

Should show: Duration ~15-25s, 24000 Hz, mono

Using Your Voice

Create directory

mkdir -p ~/.chatter/voices/

Move sample

mv voice.wav ~/.chatter/voices/my_voice.wav

Test

speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream

Use for content

speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav

Path requirements:

✓ Works: ~/.chatter/voices/my_voice.wav (tilde expanded by shell) ✓ Works: /Users/name/.chatter/voices/my_voice.wav ✗ Fails: my_voice.wav (relative path) ✗ Fails: ./voices/my_voice.wav (relative path) Voice Sample Tips Good Sample Bad Sample Quiet room Background noise Natural pace Rushed or monotone Clear diction Mumbling Varied content Repetitive phrases Default Voice

When --voice is omitted, a built-in default voice is used:

speak "Hello world" --stream # Uses default voice

Emotion Tags

Tags produce audible effects (actual sounds), not spoken words:

speak "[sigh] Monday again." --stream

Output: (sigh sound) "Monday again."

Tag Effect [laugh] Laughter [chuckle] Light chuckle [sigh] Sighing [gasp] Gasping [groan] Groaning [clear throat] Throat clearing [cough] Coughing [crying] Crying [singing] Sung speech

NOT supported: [pause], [whisper] (ignored)

For pauses: Use punctuation: "Wait... let me think."

Batch Processing mkdir -p ~/Audio/book/ speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/

Creates: ch01.wav, ch02.wav, ch03.wav

With auto-chunking (for long files)

speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk

Skip completed files

speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing

Auto-Chunk Behavior

When using --auto-chunk with batch processing:

Each input file is chunked independently Chunks are generated and automatically concatenated per file Final output: one .wav per input file (e.g., ch01.wav) Intermediate chunks deleted (unless --keep-chunks)

You don't need to manually concatenate chunks — only concatenate final chapter files.

Concatenating Audio

Explicit order (recommended)

speak concat ch01.wav ch02.wav ch03.wav --output book.wav

Glob pattern (REQUIRES zero-padded filenames)

speak concat audiobook/*.wav --output book.wav

Zero-Padding Rules

Critical for correct concatenation order:

Files Correct Wrong 1-9 01, 02, ..., 09 1, 2, ..., 9 10-99 01, 02, ..., 99 1, 10, 2, ... 100+ 001, 002, ..., 999 1, 100, 2, ...

Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.

PDF to Audiobook (Complete Workflow) Step 1: Find Chapter Boundaries

Preview table of contents

pdftotext -f 1 -l 5 textbook.pdf toc.txt cat toc.txt # Note chapter page numbers

Or search for "Chapter" markers

pdftotext textbook.pdf - | grep -n "Chapter"

Step 2: Extract Chapters (Zero-Padded!)

For 100-page book with ~10 chapters

pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt

... continue for all chapters

Step 3: Estimate Time speak --estimate ch*.txt

Shows: total audio duration, generation time, storage needed

Quick estimates:

1 page ≈ 2 min audio ≈ 1 min generation

100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB

Step 4: Generate Audio mkdir -p audiobook/ speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk

Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav

Step 5: Concatenate speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav

Or with glob (only if zero-padded):

speak concat audiobook/ch*.wav --output complete_audiobook.wav

PDF Troubleshooting Issue Solution Empty/garbled text Scanned PDF — use OCR: brew install tesseract Wrong encoding Try: pdftotext -enc UTF-8 doc.pdf Check word count pdftotext doc.pdf - | wc -w (should be >100) Multi-Voice Content mkdir -p podcast/scripts podcast/wav

echo "Welcome to the show." > podcast/scripts/01_host.txt echo "Thanks for having me." > podcast/scripts/02_guest.txt

speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav

speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav

Options Reference Option Description Default --stream Stream as it generates false --play Play after complete false --output Output file ~/Audio/speak/ --output-dir

Batch output directory - --voice Voice sample (full path) default --timeout Timeout per file 300 --auto-chunk Split long documents false --chunk-size Chars per chunk 6000 --resume Resume from manifest - --keep-chunks Keep intermediate files false --skip-existing Skip if output exists false --estimate Show duration estimate false --dry-run Preview only false --quiet Suppress output false Commands Command Description speak setup Set up environment speak health Check system status speak models List TTS models speak concat Concatenate audio speak daemon kill Stop TTS server speak config Show configuration Performance Metric Value Cold start ~4-8s Warm start ~3-8s Speed 0.3-0.5x RTF (faster than real-time) Storage ~2.5 MB/min, ~150 MB/hour Resume Capability

For interrupted long generations:

Single file with auto-chunk — use --resume

speak long.txt --auto-chunk --output book.wav

If interrupted, manifest saved at ~/Audio/speak/manifest.json

speak --resume ~/Audio/speak/manifest.json

Batch processing — use --skip-existing

speak ch*.txt --output-dir audiobook/ --auto-chunk

If interrupted, re-run same command:

speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing

Common Errors Error Cause Solution "Voice file not found" Relative path Use full path: ~/.chatter/voices/x.wav "Invalid WAV format" Wrong specs Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav "Voice sample too short" <10 seconds Record 15-25 seconds "Output directory doesn't exist" Not created mkdir -p dirname/ "sox not found" Not installed brew install sox Scrambled concat order Non-zero-padded Use 01, 02, not 1, 2 Timeout >5 min generation Use --auto-chunk or --timeout 600 "Server not running" Stale daemon speak daemon kill && speak health Setup speak "test" # Auto-setup on first run (downloads model ~500MB) speak setup # Or manual setup speak health # Verify everything works

Server Management

Server auto-starts and shuts down after 1 hour idle.

speak health # Check status speak daemon kill # Stop manually

speak-tts

安装