Gemini Text-to-Speech

Generate natural-sounding speech from text using Gemini's TTS models through executable scripts with support for multiple voices and multi-speaker conversations.

When to Use This Skill

Use this skill when you need to:

Convert text to natural speech

Create audio for podcasts, audiobooks, or videos

Generate multi-speaker conversations

Stream audio for long content

Choose from multiple voice options

Create accessible audio content

Generate voiceovers for presentations

Batch convert text to audio files

Available Scripts

scripts/tts.py

Purpose

Convert text to speech using Gemini TTS models
When to use
:
Any text-to-speech conversion
Multi-speaker conversation generation
Streaming audio for long texts
Voiceovers for content creation
Accessible audio generation
Key parameters
:
Parameter
Description
Example
text
Text to convert (required)
"Hello, world!"
--voice
,
-v
Voice name
Kore
--output
,
-o
Base name for output file
welcome
--output-dir
Output directory for audio
audio/
--no-timestamp
Disable auto timestamp
Flag
--model
,
-m
TTS model
gemini-2.5-flash-preview-tts
--stream
,
-s
Enable streaming
Flag
--speakers
Multi-speaker mapping
"Joe:Kore,Jane:Puck"
Output: WAV audio file path Workflows Workflow 1: Basic Text-to-Speech python scripts/tts.py "Hello, world! Have a wonderful day." Best for: Quick audio generation, simple messages Voice: Kore (default, clear and professional) Output: audio/tts_output_YYYYMMDD_HHMMSS.wav (auto timestamp) Workflow 2: Choose Different Voice python scripts/tts.py "Welcome to our podcast about technology trends" --voice Puck --output welcome Best for: Friendly, conversational content Voice options: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat Output: audio/welcome_YYYYMMDD_HHMMSS.wav Workflow 3: Multi-Speaker Conversation python scripts/tts.py "TTS the following conversation: Joe: How's it going today? Jane: Not too bad, how about you? Joe: I'm working on a new project. Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation Best for: Dialogues, interviews, role-playing content Format: Marked conversation with speaker names Script automatically routes text to appropriate voices Output: audio/conversation_YYYYMMDD_HHMMSS.wav Workflow 4: Long Content with Streaming python scripts/tts.py "This is a very long text that would benefit from streaming..." --stream --output long-form Best for: Podcasts, audiobooks, long articles Streaming: Processes audio in chunks for long texts Output: audio/long-form_YYYYMMDD_HHMMSS.wav Workflow 5: Professional Voiceover python scripts/tts.py "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover Best for: Corporate content, presentations, formal announcements Voice: Charon (deep, authoritative) Use when: Professional, serious tone required Workflow 6: Custom Output Directory python scripts/tts.py "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1 Best for: Organized project structures Directory created automatically if it doesn't exist Output: ./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav Workflow 7: Content Creation Pipeline (Text → Audio)

1. Generate script (gemini-text skill)

python skills/gemini-text/scripts/generate.py "Write a 2-minute podcast intro about sustainable energy"

2. Generate audio (this skill)

python scripts/tts.py "[Paste generated script]" --voice Fenrir --output podcast-intro

3. Use in video or podcast

Best for: Podcasts, audiobooks, video narration

Combines with: gemini-text for script generation

Workflow 8: Accessible Content

python scripts/tts.py

"Welcome to our accessible website. This audio describes our main navigation options."

--voice

Aoede

--output

accessibility

Best for: Web accessibility, screen reader alternatives

Voice:

Aoede

(melodic, pleasant)

Use when: Making content accessible to visually impaired users

Workflow 9: Educational Content

python scripts/tts.py

"Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..."

--voice

Zephyr

--output

chapter1

Best for: Educational materials, tutorials, e-learning

Voice:

Zephyr

(light, airy)

Combines well with: gemini-text for content generation

Workflow 10: Disable Timestamp

python scripts/tts.py

"Fixed filename."

--output

my-audio --no-timestamp

Best for: When you want complete control over filename

Output:

audio/my-audio.wav

(no timestamp)

Use when: Generating files for specific naming schemes

Parameters Reference

Model Selection

Model

Quality

Speed

Best For

gemini-2.5-flash-preview-tts

Good

Fast

General use, high volume

gemini-2.5-pro-preview-tts

Higher

Slower

Premium content, voiceovers

Voice Selection

Voice

Characteristics

Best For

Kore

Clear, professional

Announcements, general purpose (default)

Puck

Friendly, conversational

Casual content, interviews

Charon

Deep, authoritative

Corporate, serious content

Fenrir

Warm, expressive

Storytelling, narratives

Aoede

Melodic, pleasant

Educational, accessibility

Zephyr

Light, airy

Gentle content, tutorials

Sulafat

Neutral, balanced

Documentaries, factual content

Audio Format

Specification

Value

Format

WAV (PCM)

Sample rate

24000 Hz

Channels

1 (mono)

Bit depth

16-bit

Token Limits

Limit

Type

Description

8,192

Input

Maximum input text tokens

16,384

Output

Maximum output audio tokens

Output Interpretation

Audio File

Format: WAV (compatible with most players)

Mono channel (single audio track)

Sample rate: 24000 Hz (broadcast quality)

Can be converted to MP3/AAC if needed

Multi-Speaker Files

Single WAV file with multiple voices

Voices separated by timing within file

Use

--speakers

parameter to map speakers to voices

Streaming Output

Audio processed in chunks during generation

Script shows "Streaming audio..." message

Useful for very long texts or real-time applications

Common Issues

"google-genai not installed"

pip

install

google-genai

"Voice name not found"

Check voice name spelling

Use available voices: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat

Voice names are case-sensitive

"No audio generated"

Check text is not empty

Verify text doesn't exceed token limit (8,192)

Try shorter text segments

Check API quota limits

"Multi-speaker format error"

Format:

SpeakerName:VoiceName,Speaker2:Voice2

Separate speakers with commas

Use colon between speaker and voice

Example:

"Joe:Kore,Jane:Puck,Host:Charon"

"Output file already exists"

Script will overwrite existing files

Change

--output

filename to avoid conflicts

Use unique names for batch generation

Audio quality issues

Check input text for unusual characters

Try different voice for better pronunciation

Consider splitting long text into smaller segments

Verify audio playback software compatibility

Best Practices

Voice Selection

Kore

General purpose, clear articulation

Puck

Conversational, engaging tone

Charon

Professional, authoritative

Fenrir

Emotional, storytelling

Aoede

Soft, gentle for accessibility
Zephyr: Educational, clear explanations Text Preparation Use natural language and punctuation Include pauses with commas and periods Spell out difficult words if needed Break very long text into logical segments Add speaker labels for multi-speaker content Performance Optimization Use streaming for very long texts Generate shorter segments for better control Use flash model for faster generation Batch process multiple files for efficiency Quality Tips Test different voices for your content type Use appropriate pacing with punctuation Consider context when selecting voice Listen to output before final use Multi-speaker requires clear speaker labeling Use Cases by Voice Voice Ideal Use Cases Kore Announcements, navigation, general info Puck Podcasts, interviews, casual content Charon Corporate, news, formal presentations Fenrir Audiobooks, stories, emotional content Aoede Accessibility, educational, gentle content Zephyr Tutorials, explanations, guides Sulafat Documentaries, factual presentations

gemini-tts

安装

1. Generate script (gemini-text skill)

2. Generate audio (this skill)

3. Use in video or podcast