MOSS-TTS-Nano Speech Generation Skill Skill by ara.so — Daily 2026 Skills collection. MOSS-TTS-Nano is an open-source multilingual tiny TTS model (0.1B parameters) from MOSI.AI and the OpenMOSS team. It uses an Audio Tokenizer + LLM autoregressive pipeline to generate 48 kHz stereo speech in real time, supports 20 languages, voice cloning, streaming inference, and runs on CPU without a GPU. Installation Conda (recommended) conda create -n moss-tts-nano python = 3.12 -y conda activate moss-tts-nano git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git cd MOSS-TTS-Nano pip install -r requirements.txt pip install -e . Fix WeTextProcessing if it fails conda install -c conda-forge pynini = 2.1 .6.post1 -y pip install git+https://github.com/WhizZest/WeTextProcessing.git After pip install -e . the moss-tts-nano CLI command is available in the active environment. Model Weights Models are auto-downloaded from Hugging Face on first run: TTS model: OpenMOSS-Team/MOSS-TTS-Nano Audio tokenizer: OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano ModelScope mirrors are available at openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano . CLI Commands Generate speech (voice clone mode) moss-tts-nano generate \ --prompt-speech assets/audio/zh_1.wav \ --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。" Output defaults to generated_audio/moss_tts_nano_output.wav . Generate from a text file (long-form) moss-tts-nano generate \ --prompt-speech assets/audio/zh_1.wav \ --text-file my_script.txt \ --output output.wav Launch local web demo moss-tts-nano serve

or directly:

python app.py Opens at http://127.0.0.1:18083 — model stays loaded in memory for fast repeated requests. Direct Python entrypoint python infer.py \ --prompt-audio-path assets/audio/zh_1.wav \ --text "Hello, this is a test of MOSS-TTS-Nano." Output: generated_audio/infer_output.wav Python API Usage Basic voice clone inference from infer import MossTTSNanoInference

Initialize once (downloads weights on first run)

tts

MossTTSNanoInference ( )

Voice clone: synthesize text in the style of the reference audio

audio

tts . infer ( text = "欢迎使用MOSS语音合成系统。" , prompt_audio_path = "assets/audio/zh_1.wav" , )

Save output

import soundfile as sf sf . write ( "output.wav" , audio , samplerate = 48000 ) English voice clone from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) audio = tts . infer ( text = "Welcome to MOSS TTS Nano, a tiny but capable text to speech model." , prompt_audio_path = "assets/audio/en_sample.wav" , ) import soundfile as sf sf . write ( "english_output.wav" , audio , samplerate = 48000 ) Streaming inference (low latency) from infer import MossTTSNanoInference import soundfile as sf import numpy as np tts = MossTTSNanoInference ( ) chunks = [ ] for audio_chunk in tts . infer_stream ( text = "This sentence is generated chunk by chunk for low latency playback." , prompt_audio_path = "assets/audio/en_sample.wav" , ) : chunks . append ( audio_chunk )

process or play chunk in real time here

full_audio

np . concatenate ( chunks ) sf . write ( "streamed_output.wav" , full_audio , samplerate = 48000 ) Long-text synthesis with chunked voice cloning from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) long_text = """ MOSS-TTS-Nano supports long-form synthesis through automatic chunking. Each chunk uses the same reference voice, producing consistent speaker identity across the entire output even for multi-paragraph documents. """ audio = tts . infer ( text = long_text , prompt_audio_path = "assets/audio/en_sample.wav" , ) import soundfile as sf sf . write ( "long_form_output.wav" , audio , samplerate = 48000 ) FastAPI HTTP endpoint usage When the server is running ( moss-tts-nano serve or python app.py ): import requests import base64 import soundfile as sf import io import numpy as np

Read reference audio as base64

with

open

(

"assets/audio/zh_1.wav"

,

"rb"

)

as

f

:

ref_audio_b64

=

base64

.

b64encode

(

f

.

read

(

)

.

decode

(

)

response

=

requests

.

post

(

"http://127.0.0.1:18083/generate"

,

json

=

{

"text"

:

"你好，这是一个语音合成测试。"

,

"prompt_audio_base64"

:

ref_audio_b64

,

}

,

)

data

=

response

.

json

(

)

audio_bytes

=

base64

.

b64decode

(

data

[

"audio_base64"

]

)

audio_array

,

sr

=

sf

.

read

(

io

.

BytesIO

(

audio_bytes

)

sf

.

write

(

"api_output.wav"

,

audio_array

,

samplerate

=

sr

)

Streaming HTTP response (real-time web playback)

import

requests

with

open

(

"assets/audio/zh_1.wav"

,

"rb"

)

as

f

:

ref_audio_b64

=

import

(

"base64"

)

.

b64encode

(

f

.

read

(

)

.

decode

(

)

with

requests

.

post

(

"http://127.0.0.1:18083/generate_stream"

,

json

=

{

"text"

:

"流式语音合成示例，适合实时播放场景。"

,

"prompt_audio_base64"

:

ref_audio_b64

,

}

,

stream

=

True

,

)

as

resp

:

with

open

(

"stream_output.wav"

,

"wb"

)

as

out

:

for

chunk

in

resp

.

iter_content

(

chunk_size

=

4096

)

:

out

.

write

(

chunk

)

Supported Languages

Code

Language

Code

Language

Code

Language

zh

Chinese

en

English

de

German

es

Spanish

fr

French

ja

Japanese

it

Italian

hu

Hungarian

ko

Korean

ru

Russian

fa

Persian

ar

Arabic

pl

Polish

pt

Portuguese

cs

Czech

da

Danish

sv

Swedish

el

Greek

tr

Turkish

The language is inferred automatically from the input text and the reference audio. No explicit language code parameter is required for basic usage.

Architecture Overview

Pipeline

Audio Tokenizer + LLM (pure autoregressive)

Audio Tokenizer

MOSS-Audio-Tokenizer-Nano (~20M params), CNN-free causal Transformer (Cat architecture)

Output

48 kHz, 2-channel (stereo)

Token rate

12.5 Hz token stream

Codebooks

RVQ with 16 codebooks (0.125 kbps – 2 kbps)
LLM: ~0.1B parameters total Key CLI Flags Flag Alias Description --prompt-audio-path — Path to reference WAV for voice cloning ( infer.py ) --prompt-speech — Same purpose in moss-tts-nano generate CLI --text — Input text string --text-file — Path to plain text file for long-form synthesis --output — Output WAV file path (default varies by entrypoint) Common Patterns Pattern: Batch synthesis with one reference voice from infer import MossTTSNanoInference import soundfile as sf tts = MossTTSNanoInference ( ) ref = "assets/audio/zh_1.wav" sentences = [ "第一句话，用于批量合成测试。" , "第二句话，保持相同的音色。" , "第三句话，输出独立的音频文件。" , ] for i , sentence in enumerate ( sentences ) : audio = tts . infer ( text = sentence , prompt_audio_path = ref ) sf . write ( f"output_ { i : 02d } .wav" , audio , samplerate = 48000 ) print ( f"Saved output_ { i : 02d } .wav" ) Pattern: Real-time playback with sounddevice import sounddevice as sd import numpy as np from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) buffer = [ ] for chunk in tts . infer_stream ( text = "Real-time playback example using sounddevice." , prompt_audio_path = "assets/audio/en_sample.wav" , ) : buffer . append ( chunk ) audio = np . concatenate ( buffer ) sd . play ( audio , samplerate = 48000 ) sd . wait ( ) Pattern: Gradio integration import gradio as gr import soundfile as sf import numpy as np import io from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) def synthesize ( reference_audio_path : str , text : str ) : audio = tts . infer ( text = text , prompt_audio_path = reference_audio_path )

Return as (sample_rate, numpy_array) tuple for Gradio Audio component

return ( 48000 , audio ) demo = gr . Interface ( fn = synthesize , inputs = [ gr . Audio ( type = "filepath" , label = "Reference Voice" ) , gr . Textbox ( label = "Text to synthesize" ) , ] , outputs = gr . Audio ( label = "Generated Speech" ) , title = "MOSS-TTS-Nano Voice Clone" , ) demo . launch ( ) Troubleshooting WeTextProcessing install fails

Use conda to get pynini, then install from source

conda install -c conda-forge pynini = 2.1 .6.post1 -y pip install git+https://github.com/WhizZest/WeTextProcessing.git Model download is slow or fails Set HF_ENDPOINT to a mirror if Hugging Face is unreachable: export HF_ENDPOINT = https://hf-mirror.com python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试" Or use ModelScope: pip install modelscope Then point model paths to openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano . Out of memory on CPU Use streaming inference ( infer_stream ) to reduce peak memory. Reduce chunk size for long text inputs — the model handles chunked voice cloning automatically. Close other applications; the model needs ~1–2 GB RAM. Audio output is silent or corrupt Ensure the reference WAV is a clean mono or stereo file, 16-bit or float32, any sample rate (it will be resampled). Minimum reference audio duration: ~3–5 seconds for reliable voice cloning. Avoid reference audio with heavy background noise. moss-tts-nano command not found

Re-run editable install inside the active conda env

pip install -e . which moss-tts-nano

should resolve now

Port conflict for web demo

Default port is 18083; check what occupies it

lsof -i :18083

Kill if needed, then relaunch

moss-tts-nano serve Output Defaults Entrypoint Default output path python infer.py generated_audio/infer_output.wav moss-tts-nano generate generated_audio/moss_tts_nano_output.wav python app.py / moss-tts-nano serve returned via HTTP response The generated_audio/ directory is created automatically if it does not exist.

moss-tts-nano-speech

安装

or directly:

Initialize once (downloads weights on first run)

tts

Voice clone: synthesize text in the style of the reference audio

audio

Save output

process or play chunk in real time here

full_audio

Read reference audio as base64

Return as (sample_rate, numpy_array) tuple for Gradio Audio component

Use conda to get pynini, then install from source

Re-run editable install inside the active conda env

should resolve now

Default port is 18083; check what occupies it

Kill if needed, then relaunch