moss-tts-nano-speech

安装量: 286
排名: #10448

安装

npx skills add https://github.com/aradotso/trending-skills --skill moss-tts-nano-speech

MOSS-TTS-Nano Speech Generation Skill Skill by ara.so — Daily 2026 Skills collection. MOSS-TTS-Nano is an open-source multilingual tiny TTS model (0.1B parameters) from MOSI.AI and the OpenMOSS team. It uses an Audio Tokenizer + LLM autoregressive pipeline to generate 48 kHz stereo speech in real time, supports 20 languages, voice cloning, streaming inference, and runs on CPU without a GPU. Installation Conda (recommended) conda create -n moss-tts-nano python = 3.12 -y conda activate moss-tts-nano git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git cd MOSS-TTS-Nano pip install -r requirements.txt pip install -e . Fix WeTextProcessing if it fails conda install -c conda-forge pynini = 2.1 .6.post1 -y pip install git+https://github.com/WhizZest/WeTextProcessing.git After pip install -e . the moss-tts-nano CLI command is available in the active environment. Model Weights Models are auto-downloaded from Hugging Face on first run: TTS model: OpenMOSS-Team/MOSS-TTS-Nano Audio tokenizer: OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano ModelScope mirrors are available at openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano . CLI Commands Generate speech (voice clone mode) moss-tts-nano generate \ --prompt-speech assets/audio/zh_1.wav \ --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。" Output defaults to generated_audio/moss_tts_nano_output.wav . Generate from a text file (long-form) moss-tts-nano generate \ --prompt-speech assets/audio/zh_1.wav \ --text-file my_script.txt \ --output output.wav Launch local web demo moss-tts-nano serve

or directly:

python app.py Opens at http://127.0.0.1:18083 — model stays loaded in memory for fast repeated requests. Direct Python entrypoint python infer.py \ --prompt-audio-path assets/audio/zh_1.wav \ --text "Hello, this is a test of MOSS-TTS-Nano." Output: generated_audio/infer_output.wav Python API Usage Basic voice clone inference from infer import MossTTSNanoInference

Initialize once (downloads weights on first run)

tts

MossTTSNanoInference ( )

Voice clone: synthesize text in the style of the reference audio

audio

tts . infer ( text = "欢迎使用MOSS语音合成系统。" , prompt_audio_path = "assets/audio/zh_1.wav" , )

Save output

import soundfile as sf sf . write ( "output.wav" , audio , samplerate = 48000 ) English voice clone from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) audio = tts . infer ( text = "Welcome to MOSS TTS Nano, a tiny but capable text to speech model." , prompt_audio_path = "assets/audio/en_sample.wav" , ) import soundfile as sf sf . write ( "english_output.wav" , audio , samplerate = 48000 ) Streaming inference (low latency) from infer import MossTTSNanoInference import soundfile as sf import numpy as np tts = MossTTSNanoInference ( ) chunks = [ ] for audio_chunk in tts . infer_stream ( text = "This sentence is generated chunk by chunk for low latency playback." , prompt_audio_path = "assets/audio/en_sample.wav" , ) : chunks . append ( audio_chunk )

process or play chunk in real time here

full_audio

np . concatenate ( chunks ) sf . write ( "streamed_output.wav" , full_audio , samplerate = 48000 ) Long-text synthesis with chunked voice cloning from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) long_text = """ MOSS-TTS-Nano supports long-form synthesis through automatic chunking. Each chunk uses the same reference voice, producing consistent speaker identity across the entire output even for multi-paragraph documents. """ audio = tts . infer ( text = long_text , prompt_audio_path = "assets/audio/en_sample.wav" , ) import soundfile as sf sf . write ( "long_form_output.wav" , audio , samplerate = 48000 ) FastAPI HTTP endpoint usage When the server is running ( moss-tts-nano serve or python app.py ): import requests import base64 import soundfile as sf import io import numpy as np

Read reference audio as base64

with
open
(
"assets/audio/zh_1.wav"
,
"rb"
)
as
f
:
ref_audio_b64
=
base64
.
b64encode
(
f
.
read
(
)
)
.
decode
(
)
response
=
requests
.
post
(
"http://127.0.0.1:18083/generate"
,
json
=
{
"text"
:
"你好,这是一个语音合成测试。"
,
"prompt_audio_base64"
:
ref_audio_b64
,
}
,
)
data
=
response
.
json
(
)
audio_bytes
=
base64
.
b64decode
(
data
[
"audio_base64"
]
)
audio_array
,
sr
=
sf
.
read
(
io
.
BytesIO
(
audio_bytes
)
)
sf
.
write
(
"api_output.wav"
,
audio_array
,
samplerate
=
sr
)
Streaming HTTP response (real-time web playback)
import
requests
with
open
(
"assets/audio/zh_1.wav"
,
"rb"
)
as
f
:
ref_audio_b64
=
import
(
"base64"
)
.
b64encode
(
f
.
read
(
)
)
.
decode
(
)
with
requests
.
post
(
"http://127.0.0.1:18083/generate_stream"
,
json
=
{
"text"
:
"流式语音合成示例,适合实时播放场景。"
,
"prompt_audio_base64"
:
ref_audio_b64
,
}
,
stream
=
True
,
)
as
resp
:
with
open
(
"stream_output.wav"
,
"wb"
)
as
out
:
for
chunk
in
resp
.
iter_content
(
chunk_size
=
4096
)
:
out
.
write
(
chunk
)
Supported Languages
Code
Language
Code
Language
Code
Language
zh
Chinese
en
English
de
German
es
Spanish
fr
French
ja
Japanese
it
Italian
hu
Hungarian
ko
Korean
ru
Russian
fa
Persian
ar
Arabic
pl
Polish
pt
Portuguese
cs
Czech
da
Danish
sv
Swedish
el
Greek
tr
Turkish
The language is inferred automatically from the input text and the reference audio. No explicit language code parameter is required for basic usage.
Architecture Overview
Pipeline
Audio Tokenizer + LLM (pure autoregressive)
Audio Tokenizer
MOSS-Audio-Tokenizer-Nano (~20M params), CNN-free causal Transformer (Cat architecture)
Output
48 kHz, 2-channel (stereo)
Token rate
12.5 Hz token stream
Codebooks
RVQ with 16 codebooks (0.125 kbps – 2 kbps)
LLM
~0.1B parameters total Key CLI Flags Flag Alias Description --prompt-audio-path — Path to reference WAV for voice cloning ( infer.py ) --prompt-speech — Same purpose in moss-tts-nano generate CLI --text — Input text string --text-file — Path to plain text file for long-form synthesis --output — Output WAV file path (default varies by entrypoint) Common Patterns Pattern: Batch synthesis with one reference voice from infer import MossTTSNanoInference import soundfile as sf tts = MossTTSNanoInference ( ) ref = "assets/audio/zh_1.wav" sentences = [ "第一句话,用于批量合成测试。" , "第二句话,保持相同的音色。" , "第三句话,输出独立的音频文件。" , ] for i , sentence in enumerate ( sentences ) : audio = tts . infer ( text = sentence , prompt_audio_path = ref ) sf . write ( f"output_ { i : 02d } .wav" , audio , samplerate = 48000 ) print ( f"Saved output_ { i : 02d } .wav" ) Pattern: Real-time playback with sounddevice import sounddevice as sd import numpy as np from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) buffer = [ ] for chunk in tts . infer_stream ( text = "Real-time playback example using sounddevice." , prompt_audio_path = "assets/audio/en_sample.wav" , ) : buffer . append ( chunk ) audio = np . concatenate ( buffer ) sd . play ( audio , samplerate = 48000 ) sd . wait ( ) Pattern: Gradio integration import gradio as gr import soundfile as sf import numpy as np import io from infer import MossTTSNanoInference tts = MossTTSNanoInference ( ) def synthesize ( reference_audio_path : str , text : str ) : audio = tts . infer ( text = text , prompt_audio_path = reference_audio_path )

Return as (sample_rate, numpy_array) tuple for Gradio Audio component

return ( 48000 , audio ) demo = gr . Interface ( fn = synthesize , inputs = [ gr . Audio ( type = "filepath" , label = "Reference Voice" ) , gr . Textbox ( label = "Text to synthesize" ) , ] , outputs = gr . Audio ( label = "Generated Speech" ) , title = "MOSS-TTS-Nano Voice Clone" , ) demo . launch ( ) Troubleshooting WeTextProcessing install fails

Use conda to get pynini, then install from source

conda install -c conda-forge pynini = 2.1 .6.post1 -y pip install git+https://github.com/WhizZest/WeTextProcessing.git Model download is slow or fails Set HF_ENDPOINT to a mirror if Hugging Face is unreachable: export HF_ENDPOINT = https://hf-mirror.com python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试" Or use ModelScope: pip install modelscope Then point model paths to openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano . Out of memory on CPU Use streaming inference ( infer_stream ) to reduce peak memory. Reduce chunk size for long text inputs — the model handles chunked voice cloning automatically. Close other applications; the model needs ~1–2 GB RAM. Audio output is silent or corrupt Ensure the reference WAV is a clean mono or stereo file, 16-bit or float32, any sample rate (it will be resampled). Minimum reference audio duration: ~3–5 seconds for reliable voice cloning. Avoid reference audio with heavy background noise. moss-tts-nano command not found

Re-run editable install inside the active conda env

pip install -e . which moss-tts-nano

should resolve now

Port conflict for web demo

Default port is 18083; check what occupies it

lsof -i :18083

Kill if needed, then relaunch

moss-tts-nano serve Output Defaults Entrypoint Default output path python infer.py generated_audio/infer_output.wav moss-tts-nano generate generated_audio/moss_tts_nano_output.wav python app.py / moss-tts-nano serve returned via HTTP response The generated_audio/ directory is created automatically if it does not exist.

返回排行榜