OmniVoice TTS Skill Skill by ara.so — Daily 2026 Skills collection. OmniVoice is a state-of-the-art zero-shot TTS model supporting 600+ languages, built on a diffusion language model-style architecture. It supports voice cloning (from reference audio), voice design (via text attributes), and auto voice generation with RTF as low as 0.025. Installation Requirements Python 3.9+ PyTorch 2.8+ CUDA (recommended) or Apple Silicon (MPS) or CPU pip (recommended)

Step 1: Install PyTorch for your platform

NVIDIA GPU (CUDA 12.8)

pip install torch == 2.8 .0+cu128 torchaudio == 2.8 .0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

Apple Silicon

pip install torch == 2.8 .0 torchaudio == 2.8 .0

Step 2: Install OmniVoice

pip install omnivoice

Or from source (latest)

pip install git+https://github.com/k2-fsa/OmniVoice.git

Or editable dev install

git clone https://github.com/k2-fsa/OmniVoice.git cd OmniVoice pip install -e . uv git clone https://github.com/k2-fsa/OmniVoice.git cd OmniVoice uv sync

With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"

HuggingFace Mirror (if blocked) export HF_ENDPOINT = "https://hf-mirror.com" Core Concepts Mode What you provide Use case Voice Cloning ref_audio + ref_text Clone a speaker from a short audio clip Voice Design instruct string Describe speaker attributes (no audio needed) Auto Voice nothing extra Model picks a random voice Python API Load the Model from omnivoice import OmniVoice import torch import torchaudio

NVIDIA GPU

model

OmniVoice . from_pretrained ( "k2-fsa/OmniVoice" , device_map = "cuda:0" , dtype = torch . float16 )

Apple Silicon

model

OmniVoice . from_pretrained ( "k2-fsa/OmniVoice" , device_map = "mps" , dtype = torch . float16 )

CPU (slower)

model

OmniVoice . from_pretrained ( "k2-fsa/OmniVoice" , device_map = "cpu" , dtype = torch . float32 ) Voice Cloning

With manual reference transcription (faster, more accurate)

audio

model . generate ( text = "Hello, this is a test of zero-shot voice cloning." , ref_audio = "ref.wav" , ref_text = "Transcription of the reference audio." , )

Without ref_text — Whisper auto-transcribes ref_audio

audio

model . generate ( text = "Hello, this is a test of zero-shot voice cloning." , ref_audio = "ref.wav" , )

audio is a list of torch.Tensor, shape (1, T) at 24kHz

torchaudio . save ( "out.wav" , audio [ 0 ] , 24000 ) Voice Design

Describe speaker via comma-separated attributes

audio

model . generate ( text = "Hello, this is a test of zero-shot voice design." , instruct = "female, low pitch, british accent" , ) torchaudio . save ( "out.wav" , audio [ 0 ] , 24000 ) Supported attributes: Gender : male , female Age : child , young , middle-aged , elderly Pitch : very low pitch , low pitch , high pitch , very high pitch Style : whisper English accents : american accent , british accent , australian accent , etc. Chinese dialects : 四川话 , 陕西话 , etc. Auto Voice audio = model . generate ( text = "This is a sentence without any voice prompt." ) torchaudio . save ( "out.wav" , audio [ 0 ] , 24000 ) Generation Parameters audio = model . generate ( text = "Hello world." , ref_audio = "ref.wav" , ref_text = "Reference text." , num_step = 32 ,

diffusion steps; use 16 for faster (slightly lower quality)

speed

1.2 ,

speaking rate multiplier (>1 faster, <1 slower)

duration

8.0 ,

fix output duration in seconds (overrides speed)

) Non-Verbal Symbols

Insert expressive non-verbal sounds inline

audio

model . generate ( text = "[laughter] You really got me. I didn't see that coming at all." ) Supported tags: [laughter] , [sigh] , [confirmation-en] , [question-en] , [question-ah] , [question-oh] , [question-ei] , [question-yi] , [surprise-ah] , [surprise-oh] , [surprise-wa] , [surprise-yo] , [dissatisfaction-hnn] Pronunciation Control

Chinese: pinyin with tone numbers (inline, uppercase)

audio

model . generate ( text = "这批货物打ZHE2出售后他严重SHE2本了，再也经不起ZHE1腾了。" )

English: CMU dict pronunciation in brackets (uppercase)

audio

model . generate ( text = "You could probably still make [IH1 T] look good." ) CLI Tools Web Demo omnivoice-demo --ip 0.0 .0.0 --port 8001 omnivoice-demo --help

all options

Single Inference

Voice Cloning (ref_text optional; omit for Whisper auto-transcription)

omnivoice-infer \ --model k2-fsa/OmniVoice \ --text "This is a test for text to speech." \ --ref_audio ref.wav \ --ref_text "Transcription of the reference audio." \ --output hello.wav

Voice Design

omnivoice-infer \ --model k2-fsa/OmniVoice \ --text "This is a test for text to speech." \ --instruct "male, British accent" \ --output hello.wav

Auto Voice

omnivoice-infer \ --model k2-fsa/OmniVoice \ --text "This is a test for text to speech." \ --output hello.wav Batch Inference (Multi-GPU) omnivoice-infer-batch \ --model k2-fsa/OmniVoice \ --test_list test.jsonl \ --res_dir results/ JSONL format ( test.jsonl ): {"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"} {"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"} {"id": "sample_003", "text": "Auto voice example"} {"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2} {"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0} {"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"} JSONL field reference: Field Required Description id ✅ Unique identifier text ✅ Text to synthesize ref_audio ❌ Path to reference audio (voice cloning) ref_text ❌ Transcript of ref audio instruct ❌ Speaker attributes (voice design) language_id ❌ Language code, e.g. "en" language_name ❌ Language name, e.g. "English" duration ❌ Fixed output duration in seconds speed ❌ Speaking rate multiplier (ignored if duration set) Common Patterns Full Voice Cloning Pipeline from omnivoice import OmniVoice import torch import torchaudio from pathlib import Path def clone_voice ( ref_audio_path : str , texts : list [ str ] , output_dir : str ) : model = OmniVoice . from_pretrained ( "k2-fsa/OmniVoice" , device_map = "cuda:0" , dtype = torch . float16 ) Path ( output_dir ) . mkdir ( parents = True , exist_ok = True ) for i , text in enumerate ( texts ) : audio = model . generate ( text = text , ref_audio = ref_audio_path ,

ref_text omitted: Whisper auto-transcribes

num_step

32 , speed = 1.0 , ) out_path = f" { output_dir } /output_ { i : 04d } .wav" torchaudio . save ( out_path , audio [ 0 ] , 24000 ) print ( f"Saved: { out_path } " ) clone_voice ( ref_audio_path = "speaker.wav" , texts = [ "Hello world." , "Second sentence." , "Third sentence." ] , output_dir = "outputs/" ) Batch Processing from a List import json from omnivoice import OmniVoice import torch import torchaudio model = OmniVoice . from_pretrained ( "k2-fsa/OmniVoice" , device_map = "cuda:0" , dtype = torch . float16 ) items = [ { "id" : "s1" , "text" : "English sentence." , "instruct" : "female, american accent" } , { "id" : "s2" , "text" : "Another sentence." , "ref_audio" : "ref.wav" } , { "id" : "s3" , "text" : "Auto voice." , } , ] for item in items : kwargs = { "text" : item [ "text" ] } if "ref_audio" in item : kwargs [ "ref_audio" ] = item [ "ref_audio" ] if "ref_text" in item : kwargs [ "ref_text" ] = item [ "ref_text" ] if "instruct" in item : kwargs [ "instruct" ] = item [ "instruct" ] audio = model . generate ( ** kwargs ) torchaudio . save ( f" { item [ 'id' ] } .wav" , audio [ 0 ] , 24000 ) Voice Design Combinations designs = [ "male, elderly, low pitch" , "female, child, high pitch" , "male, whisper" , "female, british accent, high pitch" , "male, american accent, middle-aged" , ] for design in designs : audio = model . generate ( text = "The quick brown fox jumps over the lazy dog." , instruct = design , ) safe_name = design . replace ( ", " , "" ) . replace ( " " , "-" ) torchaudio . save ( f"design { safe_name } .wav" , audio [ 0 ] , 24000 ) Fast Inference (Lower Diffusion Steps)

Default: num_step=32 (high quality)

Fast: num_step=16 (slightly lower quality, ~2x faster)

audio

model

.

generate

(

text

=

"Fast inference example."

,

ref_audio

=

"ref.wav"

,

num_step

=

16

,

)

Output Format

Sample rate

24,000 Hz
Type
:
list[torch.Tensor]
, each tensor shape
(1, T)
Save: use torchaudio.save(path, audio[0], 24000) Troubleshooting HuggingFace download fails export HF_ENDPOINT = "https://hf-mirror.com" CUDA out of memory

Use float16 (not float32)

model

OmniVoice . from_pretrained ( "k2-fsa/OmniVoice" , device_map = "cuda:0" , dtype = torch . float16 )

Or reduce batch size / text length in batch inference

Whisper ASR not available for ref_text auto-transcription pip install openai-whisper Wrong pronunciation in Chinese Use inline pinyin with tone numbers directly in the text string:

Format: PINYINTONE_NUMBER within the sentence

text

"这批货物打ZHE2出售" Audio quality issues Increase num_step to 32 or 64 Provide ref_text manually instead of relying on auto-transcription Use a clean, noise-free reference audio clip (3–15 seconds recommended) Apple Silicon (MPS) issues

Use mps device explicitly

model

OmniVoice . from_pretrained ( "k2-fsa/OmniVoice" , device_map = "mps" , dtype = torch . float16 ) Model & Resources Resource Link HuggingFace Model k2-fsa/OmniVoice HuggingFace Space https://huggingface.co/spaces/k2-fsa/OmniVoice Paper (arXiv) https://arxiv.org/abs/2604.00688 Demo Page https://zhu-han.github.io/omnivoice Supported Languages docs/languages.md in repo Voice Design Attributes docs/voice-design.md in repo Generation Parameters docs/generation-parameters.md in repo Training/Eval Examples examples/ in repo

安装

Step 1: Install PyTorch for your platform

NVIDIA GPU (CUDA 12.8)

Apple Silicon

Step 2: Install OmniVoice

Or from source (latest)

Or editable dev install

With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"

NVIDIA GPU

model

Apple Silicon

model

CPU (slower)

model

With manual reference transcription (faster, more accurate)

audio

Without ref_text — Whisper auto-transcribes ref_audio

audio

audio is a list of torch.Tensor, shape (1, T) at 24kHz

Describe speaker via comma-separated attributes

audio

diffusion steps; use 16 for faster (slightly lower quality)

speed

speaking rate multiplier (>1 faster, <1 slower)

duration

fix output duration in seconds (overrides speed)

Insert expressive non-verbal sounds inline

audio

Chinese: pinyin with tone numbers (inline, uppercase)

audio

English: CMU dict pronunciation in brackets (uppercase)

audio

all options

Voice Cloning (ref_text optional; omit for Whisper auto-transcription)

Voice Design

Auto Voice

ref_text omitted: Whisper auto-transcribes

num_step

Default: num_step=32 (high quality)

Fast: num_step=16 (slightly lower quality, ~2x faster)

audio

Use float16 (not float32)

model

Or reduce batch size / text length in batch inference

Format: PINYINTONE_NUMBER within the sentence

text

Use mps device explicitly

model