Four modes, one entry point:

Podcast — Two-person dialogue, ideal for deep discussions Explain — Single narrator + AI visuals, ideal for product intros TTS/Flow Speech — Pure voice reading, ideal for articles Image Generation — AI image creation, ideal for creative visualization

Users don't need to remember APIs, modes, or parameters. Just say what you want.

⛔ Hard Constraints (Inviolable)

The scripts are the ONLY interface. Period.

┌─────────────────────────────────────────────────────────┐ │ AI Agent ──▶ ./scripts/*.sh ──▶ ListenHub API │ │ ▲ │ │ │ │ │ This is the ONLY path. │ │ Direct API calls are FORBIDDEN. │ └─────────────────────────────────────────────────────────┘

MUST:

Execute functionality ONLY through provided scripts in **/skills/listenhub/scripts/ Pass user intent as script arguments exactly as documented Trust script outputs; do not second-guess internal logic

MUST NOT:

Write curl commands to ListenHub/Marswave API directly Construct JSON bodies for API calls manually Guess or fabricate speakerIds, endpoints, or API parameters Assume API structure based on patterns or web searches Hallucinate features not exposed by existing scripts

Why: The API is proprietary. Endpoints, parameters, and speakerIds are NOT publicly documented. Web searches will NOT find this information. Any attempt to bypass scripts will produce incorrect, non-functional code.

Script Location

Scripts are located at **/skills/listenhub/scripts/ relative to your working context.

Different AI clients use different dot-directories:

Claude Code: .claude/skills/listenhub/scripts/ Other clients: may vary (.cursor/, .windsurf/, etc.)

Resolution: Use glob pattern */skills/listenhub/scripts/.sh to locate scripts reliably, or resolve from the SKILL.md file's own path.

Private Data (Cannot Be Searched)

The following are internal implementation details that AI cannot reliably know:

Category Examples How to Obtain API Base URL api.marswave.ai/... ✗ Cannot — internal to scripts Endpoints podcast/episodes, etc. ✗ Cannot — internal to scripts Speaker IDs cozy-man-english, etc. ✓ Call get-speakers.sh Request schemas JSON body structure ✗ Cannot — internal to scripts Response formats Episode ID, status codes ✓ Documented per script

Rule: If information is not in this SKILL.md or retrievable via a script (like get-speakers.sh), assume you don't know it.

Design Philosophy

Hide complexity, reveal magic.

Users don't need to know: Episode IDs, API structure, polling mechanisms, credits, endpoint differences. Users only need: Say idea → wait a moment → get the link.

Environment ListenHub API Key

API key stored in $LISTENHUB_API_KEY. Check on first use:

source ~/.zshrc 2>/dev/null; [ -n "$LISTENHUB_API_KEY" ] && echo "ready" || echo "need_setup"

If setup needed, guide user:

Visit https://listenhub.ai/zh/settings/api-keys Paste key (only the lh_sk_... part) Auto-save to ~/.zshrc Labnana API Key (for Image Generation)

API key stored in $LABNANA_API_KEY, output path in $LABNANA_OUTPUT_DIR.

On first image generation, the script auto-guides configuration:

Visit https://labnana.com/api-keys (requires subscription) Paste API key Configure output path (default: ~/Downloads) Auto-save to shell rc file

Security: Never expose full API keys in output.

Mode Detection

Auto-detect mode from user input:

→ Podcast (Two-person dialogue)

Keywords: "podcast", "chat about", "discuss", "debate", "dialogue" Use case: Topic exploration, opinion exchange, deep analysis Feature: Two voices, interactive feel

→ Explain (Explainer video)

Keywords: "explain", "introduce", "video", "explainer", "tutorial" Use case: Product intro, concept explanation, tutorials Feature: Single narrator + AI-generated visuals, can export video

→ TTS (Text-to-speech)

Keywords: "read aloud", "convert to speech", "tts", "voice" Use case: Article to audio, note review, document narration Feature: Fastest (1-2 min), pure audio

→ Image Generation

Keywords: "generate image", "draw", "create picture", "visualize" Use case: Creative visualization, concept art, illustrations Feature: AI image generation via Labnana API, multiple resolutions and aspect ratios

Default: If unclear, ask user which format they prefer.

Explicit override: User can say "make it a podcast" / "I want explainer video" / "just voice" / "generate image" to override auto-detection.

Interaction Flow Step 1: Receive input + detect mode → Got it! Preparing... Mode: Two-person podcast Topic: Latest developments in Manus AI

For URLs, identify type:

youtu.be/XXX → convert to https://www.youtube.com/watch?v=XXX Other URLs → use directly Step 2: Submit generation → Generation submitted

Estimated time: • Podcast: 2-3 minutes • Explain: 3-5 minutes • TTS: 1-2 minutes

You can: • Wait and ask "done yet?" • Check listenhub.ai/zh/app/library • Do other things, ask later

Internally remember Episode ID for status queries.

Step 3: Query status

When user says "done yet?" / "ready?" / "check status":

Success: Show result + next options Processing: "Still generating, wait another minute?" Failed: "Generation failed, content might be unparseable. Try another?" Step 4: Show results

Podcast result:

✓ Podcast generated!

"{title}"

Listen: https://listenhub.ai/zh/app/library

Duration: ~{duration} minutes

Need to download? Just say so.

Explain result:

✓ Explainer video generated!

"{title}"

Watch: https://listenhub.ai/zh/app/explainer-video/slides/{episodeId}

Duration: ~{duration} minutes

Need to download audio? Just say so.

Image result:

✓ Image generated!

~/Downloads/labnana-{timestamp}.jpg

Important: Prioritize web experience. Only provide download URLs when user explicitly requests.

Script Reference

All scripts are curl-based (no extra dependencies). Locate via */skills/listenhub/scripts/.sh.

⚠️ Long-running Tasks: Generation may take 1-5 minutes. Use your CLI client's native background execution feature:

Claude Code: set run_in_background: true in Bash tool Other CLIs: use built-in async/background job management if available

Invocation pattern: $SCRIPTS/script-name.sh [args]

Where $SCRIPTS = resolved path to **/skills/listenhub/scripts/

Podcast (One-Stage) $SCRIPTS/create-podcast.sh "query" [mode] [source_url]

mode: quick (default) | deep | debate

source_url: optional URL for content analysis

Example:

$SCRIPTS/create-podcast.sh "The future of AI development" deep $SCRIPTS/create-podcast.sh "Analyze this article" deep "https://example.com/article"

Podcast (Two-Stage: Text → Audio)

For advanced workflows requiring script editing between generation:

Stage 1: Generate text content

$SCRIPTS/create-podcast-text.sh "query" [mode] [source_url]

Returns: episode_id + scripts array

Stage 2: Generate audio from text

$SCRIPTS/create-podcast-audio.sh "" [modified_scripts.json]

Without scripts file: uses original scripts

With scripts file: uses modified scripts

Speech (Multi-Speaker) $SCRIPTS/create-speech.sh

Or pipe: echo '{"scripts":[...]}' | $SCRIPTS/create-speech.sh -

scripts.json format:

{

"scripts": [

{"content": "Script content here", "speakerId": "speaker-id"},

...

]

}

Get Available Speakers $SCRIPTS/get-speakers.sh [language]

language: zh (default) | en

Response structure (for AI parsing):

{ "code": 0, "data": { "items": [ { "name": "Yuanye", "speakerId": "cozy-man-english", "gender": "male", "language": "zh" } ] } }

Usage: When user requests specific voice characteristics (gender, style), call this script first to discover available speakerId values. NEVER hardcode or assume speakerIds.

Explain $SCRIPTS/create-explainer.sh "" [mode]

mode: info (default) | story

Generate video file (optional)

$SCRIPTS/generate-video.sh ""

TTS $SCRIPTS/create-tts.sh "" [mode]

mode: smart (default) | direct

Image Generation $SCRIPTS/generate-image.sh "" [size] [ratio] [reference_images]

size: 1K | 2K | 4K (default: 2K)

ratio: 16:9 | 1:1 | 9:16 | 2:3 | 3:2 | 3:4 | 4:3 | 21:9 (default: 16:9)

reference_images: comma-separated URLs (max 14), e.g. "url1,url2"

- Provides visual guidance for style, composition, or content

- Supports jpg, png, gif, webp, bmp formats

- URLs must be publicly accessible

Check Status $SCRIPTS/check-status.sh ""

type: podcast | explainer | tts

Language Adaptation

Automatic Language Detection: Adapt output language based on user input and context.

Detection Rules:

User Input Language: If user writes in Chinese, respond in Chinese. If user writes in English, respond in English. Context Consistency: Maintain the same language throughout the interaction unless user explicitly switches. CLAUDE.md Override: If project-level CLAUDE.md specifies a default language, respect it unless user input indicates otherwise. Mixed Input: If user mixes languages, prioritize the dominant language (>50% of content).

Application:

Status messages: "→ Got it! Preparing..." (English) vs "→ 收到！准备中..." (Chinese) Error messages: Match user's language Result summaries: Match user's language Script outputs: Pass through as-is (scripts handle their own language)

Example:

User (Chinese): "生成一个关于 AI 的播客" AI (Chinese): "→ 收到！准备双人播客..."

User (English): "Make a podcast about AI" AI (English): "→ Got it! Preparing two-person podcast..."

Principle: Language is interface, not barrier. Adapt seamlessly to user's natural expression.

AI Responsibilities Black Box Principle

You are a dispatcher, not an implementer.

Your job is to:

Understand user intent (what do they want to create?) Select the correct script (which tool fits?) Format arguments correctly (what parameters?) Execute and relay results (what happened?)

Your job is NOT to:

Understand or modify script internals Construct API calls directly Guess parameters not documented here Invent features that scripts don't expose Mode-Specific Behavior

ListenHub modes (passthrough):

Podcast/Explain/TTS/Speech → pass user input directly Server has full AI capability to process content If user needs specific speakers → call get-speakers.sh first to list options

Labnana mode (enhance):

Image Generation → client-side AI optimizes prompt Thin forwarding layer, needs client intelligence enhancement Prompt Optimization (Image Generation)

When generating images, optimize user prompts by adding:

Style Enhancement:

"cyberpunk" → add "neon lights, futuristic, dystopian" "ink painting" → add "Chinese ink painting, traditional art style" "photorealistic" → add "highly detailed, 8K quality"

Scene Details:

Time: at night / at sunset / in the morning Lighting: dramatic lighting / soft lighting / neon glow Weather: rainy / foggy / clear sky

Composition Quality:

Composition: cinematic composition / wide-angle / close-up Quality: highly detailed / 8K quality / professional photography

DO:

Understand user intent, add missing details Use English keywords (models trained on English) Add quality descriptors Keep user's core intent unchanged Show optimized prompt transparently

DON'T:

Drastically change user's original meaning Add elements user explicitly doesn't want Over-stack complex terminology If user wants "simple", don't add "highly detailed"

→ Generation submitted, about 2-3 minutes

You can: • Wait and ask "done yet?" • Check listenhub.ai/zh/app/library

→ Generation submitted, explainer videos take 3-5 minutes

Includes: Script + narration + AI visuals

→ TTS submitted, about 1-2 minutes

Wait a moment, or ask "done yet?" to check

Original: cyberpunk city at night

Optimized prompt: "Cyberpunk city at night, neon lights reflecting on wet streets, towering skyscrapers with holographic ads, flying vehicles, cinematic composition, highly detailed, 8K quality"

Resolution: 4K (16:9)

✓ Image generated! ~/Downloads/labnana-20260121-143145.jpg

Prompt: a futuristic car Reference images: 1 Resolution: 2K (16:9)

✓ Image generated! ~/Downloads/labnana-20260122-154230.jpg

"AI Revolution: From GPT to AGI"

Listen: https://listenhub.ai/zh/app/library

Duration: ~8 minutes

Need to download? Just say so.

安装

mode: quick (default) | deep | debate

source_url: optional URL for content analysis

Example:

Stage 1: Generate text content

Returns: episode_id + scripts array

Stage 2: Generate audio from text

Without scripts file: uses original scripts

With scripts file: uses modified scripts

Or pipe: echo '{"scripts":[...]}' | $SCRIPTS/create-speech.sh -

scripts.json format:

{

"scripts": [

{"content": "Script content here", "speakerId": "speaker-id"},

...

]

}

language: zh (default) | en

mode: info (default) | story

Generate video file (optional)

mode: smart (default) | direct

size: 1K | 2K | 4K (default: 2K)

ratio: 16:9 | 1:1 | 9:16 | 2:3 | 3:2 | 3:4 | 4:3 | 21:9 (default: 16:9)

reference_images: comma-separated URLs (max 14), e.g. "url1,url2"

- Provides visual guidance for style, composition, or content

- Supports jpg, png, gif, webp, bmp formats

- URLs must be publicly accessible

type: podcast | explainer | tts