MOSS-TTS-Nano Speech Generation Skill Skill by ara.so — Daily 2026 Skills collection. MOSS-TTS-Nano is an open-source multilingual tiny TTS model (0.1B parameters) from MOSI.AI and the OpenMOSS team. It uses an Audio Tokenizer + LLM autoregressive pipeline to generate 48 kHz stereo speech in real time, supports 20 languages, voice cloning, streaming inference, and runs on CPU without a GPU. Installation Conda (recommended) conda create -n moss-tts-nano python = 3.12 -y conda activate moss-tts-nano git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git cd MOSS-TTS-Nano pip install -r requirements.txt pip install -e . Fix WeTextProcessing if it fails conda install -c conda-forge pynini = 2.1 .6.post1 -y pip install git+https://github.com/WhizZest/WeTextProcessing.git After pip install -e . the moss-tts-nano CLI command is available in the active environment. Model Weights Models are auto-downloaded from Hugging Face on first run: TTS model: OpenMOSS-Team/MOSS-TTS-Nano Audio tokenizer: OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano ModelScope mirrors are available at openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano . CLI Commands Generate speech (voice clone mode) moss-tts-nano generate \ --prompt-speech assets/audio/zh_1.wav \ --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。" Output defaults to generated_audio/moss_tts_nano_output.wav . Generate from a text file (long-form) moss-tts-nano generate \ --prompt-speech assets/audio/zh_1.wav \ --text-file my_script.txt \ --output output.wav Launch local web demo moss-tts-nano serve
or directly:
python app.py Opens at http://127.0.0.1:18083 — model stays loaded in memory for fast repeated requests. Direct Python entrypoint python infer.py \ --prompt-audio-path assets/audio/zh_1.wav \ --text "Hello, this is a test of MOSS-TTS-Nano." Output: generated_audio/infer_output.wav Python API Usage Basic voice clone inference from infer import MossTTSNanoInference
Initialize once (downloads weights on first run)
tts
MossTTSNanoInference ( )
Voice clone: synthesize text in the style of the reference audio
audio
tts . infer ( text = "欢迎使用MOSS语音合成系统。" , prompt_audio_path = "assets/audio/zh_1.wav" , )
Save output
import soundfile as sf sf . write ( "output.wav" , audio , samplerate = 48000 ) English voice clone from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) audio = tts . infer ( text = "Welcome to MOSS TTS Nano, a tiny but capable text to speech model." , prompt_audio_path = "assets/audio/en_sample.wav" , ) import soundfile as sf sf . write ( "english_output.wav" , audio , samplerate = 48000 ) Streaming inference (low latency) from infer import MossTTSNanoInference import soundfile as sf import numpy as np tts = MossTTSNanoInference ( ) chunks = [ ] for audio_chunk in tts . infer_stream ( text = "This sentence is generated chunk by chunk for low latency playback." , prompt_audio_path = "assets/audio/en_sample.wav" , ) : chunks . append ( audio_chunk )
process or play chunk in real time here
full_audio
np . concatenate ( chunks ) sf . write ( "streamed_output.wav" , full_audio , samplerate = 48000 ) Long-text synthesis with chunked voice cloning from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) long_text = """ MOSS-TTS-Nano supports long-form synthesis through automatic chunking. Each chunk uses the same reference voice, producing consistent speaker identity across the entire output even for multi-paragraph documents. """ audio = tts . infer ( text = long_text , prompt_audio_path = "assets/audio/en_sample.wav" , ) import soundfile as sf sf . write ( "long_form_output.wav" , audio , samplerate = 48000 ) FastAPI HTTP endpoint usage When the server is running ( moss-tts-nano serve or python app.py ): import requests import base64 import soundfile as sf import io import numpy as np
Read reference audio as base64
- with
- open
- (
- "assets/audio/zh_1.wav"
- ,
- "rb"
- )
- as
- f
- :
- ref_audio_b64
- =
- base64
- .
- b64encode
- (
- f
- .
- read
- (
- )
- )
- .
- decode
- (
- )
- response
- =
- requests
- .
- post
- (
- "http://127.0.0.1:18083/generate"
- ,
- json
- =
- {
- "text"
- :
- "你好,这是一个语音合成测试。"
- ,
- "prompt_audio_base64"
- :
- ref_audio_b64
- ,
- }
- ,
- )
- data
- =
- response
- .
- json
- (
- )
- audio_bytes
- =
- base64
- .
- b64decode
- (
- data
- [
- "audio_base64"
- ]
- )
- audio_array
- ,
- sr
- =
- sf
- .
- read
- (
- io
- .
- BytesIO
- (
- audio_bytes
- )
- )
- sf
- .
- write
- (
- "api_output.wav"
- ,
- audio_array
- ,
- samplerate
- =
- sr
- )
- Streaming HTTP response (real-time web playback)
- import
- requests
- with
- open
- (
- "assets/audio/zh_1.wav"
- ,
- "rb"
- )
- as
- f
- :
- ref_audio_b64
- =
- import
- (
- "base64"
- )
- .
- b64encode
- (
- f
- .
- read
- (
- )
- )
- .
- decode
- (
- )
- with
- requests
- .
- post
- (
- "http://127.0.0.1:18083/generate_stream"
- ,
- json
- =
- {
- "text"
- :
- "流式语音合成示例,适合实时播放场景。"
- ,
- "prompt_audio_base64"
- :
- ref_audio_b64
- ,
- }
- ,
- stream
- =
- True
- ,
- )
- as
- resp
- :
- with
- open
- (
- "stream_output.wav"
- ,
- "wb"
- )
- as
- out
- :
- for
- chunk
- in
- resp
- .
- iter_content
- (
- chunk_size
- =
- 4096
- )
- :
- out
- .
- write
- (
- chunk
- )
- Supported Languages
- Code
- Language
- Code
- Language
- Code
- Language
- zh
- Chinese
- en
- English
- de
- German
- es
- Spanish
- fr
- French
- ja
- Japanese
- it
- Italian
- hu
- Hungarian
- ko
- Korean
- ru
- Russian
- fa
- Persian
- ar
- Arabic
- pl
- Polish
- pt
- Portuguese
- cs
- Czech
- da
- Danish
- sv
- Swedish
- el
- Greek
- tr
- Turkish
- The language is inferred automatically from the input text and the reference audio. No explicit language code parameter is required for basic usage.
- Architecture Overview
- Pipeline
-
- Audio Tokenizer + LLM (pure autoregressive)
- Audio Tokenizer
-
- MOSS-Audio-Tokenizer-Nano (~20M params), CNN-free causal Transformer (Cat architecture)
- Output
-
- 48 kHz, 2-channel (stereo)
- Token rate
-
- 12.5 Hz token stream
- Codebooks
-
- RVQ with 16 codebooks (0.125 kbps – 2 kbps)
- LLM
- ~0.1B parameters total Key CLI Flags Flag Alias Description --prompt-audio-path — Path to reference WAV for voice cloning ( infer.py ) --prompt-speech — Same purpose in moss-tts-nano generate CLI --text — Input text string --text-file — Path to plain text file for long-form synthesis --output — Output WAV file path (default varies by entrypoint) Common Patterns Pattern: Batch synthesis with one reference voice from infer import MossTTSNanoInference import soundfile as sf tts = MossTTSNanoInference ( ) ref = "assets/audio/zh_1.wav" sentences = [ "第一句话,用于批量合成测试。" , "第二句话,保持相同的音色。" , "第三句话,输出独立的音频文件。" , ] for i , sentence in enumerate ( sentences ) : audio = tts . infer ( text = sentence , prompt_audio_path = ref ) sf . write ( f"output_ { i : 02d } .wav" , audio , samplerate = 48000 ) print ( f"Saved output_ { i : 02d } .wav" ) Pattern: Real-time playback with sounddevice import sounddevice as sd import numpy as np from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) buffer = [ ] for chunk in tts . infer_stream ( text = "Real-time playback example using sounddevice." , prompt_audio_path = "assets/audio/en_sample.wav" , ) : buffer . append ( chunk ) audio = np . concatenate ( buffer ) sd . play ( audio , samplerate = 48000 ) sd . wait ( ) Pattern: Gradio integration import gradio as gr import soundfile as sf import numpy as np import io from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) def synthesize ( reference_audio_path : str , text : str ) : audio = tts . infer ( text = text , prompt_audio_path = reference_audio_path )
Return as (sample_rate, numpy_array) tuple for Gradio Audio component
return ( 48000 , audio ) demo = gr . Interface ( fn = synthesize , inputs = [ gr . Audio ( type = "filepath" , label = "Reference Voice" ) , gr . Textbox ( label = "Text to synthesize" ) , ] , outputs = gr . Audio ( label = "Generated Speech" ) , title = "MOSS-TTS-Nano Voice Clone" , ) demo . launch ( ) Troubleshooting WeTextProcessing install fails
Use conda to get pynini, then install from source
conda install -c conda-forge pynini = 2.1 .6.post1 -y pip install git+https://github.com/WhizZest/WeTextProcessing.git Model download is slow or fails Set HF_ENDPOINT to a mirror if Hugging Face is unreachable: export HF_ENDPOINT = https://hf-mirror.com python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试" Or use ModelScope: pip install modelscope Then point model paths to openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano . Out of memory on CPU Use streaming inference ( infer_stream ) to reduce peak memory. Reduce chunk size for long text inputs — the model handles chunked voice cloning automatically. Close other applications; the model needs ~1–2 GB RAM. Audio output is silent or corrupt Ensure the reference WAV is a clean mono or stereo file, 16-bit or float32, any sample rate (it will be resampled). Minimum reference audio duration: ~3–5 seconds for reliable voice cloning. Avoid reference audio with heavy background noise. moss-tts-nano command not found
Re-run editable install inside the active conda env
pip install -e . which moss-tts-nano
should resolve now
Port conflict for web demo
Default port is 18083; check what occupies it
lsof -i :18083
Kill if needed, then relaunch
moss-tts-nano serve Output Defaults Entrypoint Default output path python infer.py generated_audio/infer_output.wav moss-tts-nano generate generated_audio/moss_tts_nano_output.wav python app.py / moss-tts-nano serve returned via HTTP response The generated_audio/ directory is created automatically if it does not exist.