cosyvoice3.rs

Python bindings for CosyVoice3 TTS using Candle.

CosyVoice3 is a state-of-the-art multilingual zero-shot text-to-speech model from FunAudioLLM.

Features

� Hirgh-performance Rust implementation via Candle
🐍 Native Python bindings via PyO3
🎯 Zero-shot voice cloning
🌍 Cross-lingual synthesis
� Instruction-guided synthesis
💻 Multi-platform support (Windows, macOS, Linux)
🔧 GPU acceleration (CUDA, Metal)
🎵 Built-in audio file loading (WAV, MP3, OGG)

Model Weights

Pre-converted weights are available on Hugging Face:

# Download using hf CLI
pip install huggingface_hub
hf download spensercai/CosyVoice3-0.5B-Candle --local-dir ./CosyVoice3-0.5B-Candle

Convert from Original Weights

If you want to convert from the original PyTorch weights:

# Download original weights
hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --local-dir ./Fun-CosyVoice3-0.5B-2512

# Convert to Candle format
python scripts/convert_weights.py \
    --input ./Fun-CosyVoice3-0.5B-2512 \
    --output ./CosyVoice3-0.5B-Candle

Installation

Pre-built Wheels (Recommended)

Download pre-built wheels from GitHub Releases:

# CPU version (all platforms)
pip install cosyvoice3-<version>-cp310-abi3-<platform>.whl

# Metal version (macOS Apple Silicon)
pip install cosyvoice3-<version>+metal-cp310-abi3-macosx_11_0_arm64.whl

# CUDA 11.8+ version (Linux, maximum compatibility)
pip install cosyvoice3-<version>+cu118-cp310-abi3-linux_x86_64.whl

# CUDA 12.4+ version (Linux/Windows)
pip install cosyvoice3-<version>+cu124-cp310-abi3-<platform>.whl

# CUDA 12.8+ version (Linux, latest features)
pip install cosyvoice3-<version>+cu128-cp310-abi3-linux_x86_64.whl

From Source

# Clone the repository
git clone https://github.com/SpenserCai/cosyvoice3.rs
cd cosyvoice3.rs

# Install maturin
pip install maturin

# Build and install (default: CPU + ONNX)
maturin develop --release

# Build with Metal support (macOS Apple Silicon)
maturin develop --release --features metal

# Build with CUDA support (Linux/Windows)
maturin develop --release --features cuda

Linux CUDA Build Notes

Recommended: For Linux users with CUDA, building from source is recommended over pre-built wheels for better compatibility.

Prerequisites:

Install Rust toolchain:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

Install protoc (Protocol Buffers compiler):

# Ubuntu/Debian
sudo apt install protobuf-compiler

# Or download from GitHub releases
PB_REL="https://github.com/protocolbuffers/protobuf/releases"
curl -LO $PB_REL/download/v28.3/protoc-28.3-linux-x86_64.zip
sudo unzip protoc-28.3-linux-x86_64.zip -d /usr/local

Install maturin:

pip install maturin

Build:

# Install with editable mode
pip install -e .

# Or use maturin directly
maturin develop --release --features cuda

GCC Version Compatibility: CUDA has strict GCC version requirements:

CUDA 11.x: requires GCC ≤ 11
CUDA 12.x: requires GCC ≤ 12
CUDA 12.8: requires GCC ≤ 13

If your system default GCC is newer (e.g., GCC 13), you need to specify a compatible version:

# Check your CUDA version
nvcc --version

# Check your GCC version
gcc --version

# If GCC is too new, install a compatible version and set NVCC_CCBIN
# For Ubuntu/Debian:
sudo apt install gcc-11 g++-11  # For CUDA 11.x
sudo apt install gcc-12 g++-12  # For CUDA 12.x

# Build with specific GCC version
NVCC_CCBIN=/usr/bin/gcc-11 maturin develop --release --features cuda  # CUDA 11.x
NVCC_CCBIN=/usr/bin/gcc-12 maturin develop --release --features cuda  # CUDA 12.x

Environment Variables for CUDA Build:

Variable	Description	Example
`NVCC_CCBIN`	Path to GCC compiler for nvcc	`/usr/bin/gcc-11`
`CUDA_COMPUTE_CAP`	Target GPU compute capability	`80` (for A100), `89` (for RTX 4090)
`CUDA_PATH`	CUDA installation path (usually auto-detected)	`/usr/local/cuda`

Build Wheel

# Build wheel for distribution
maturin build --release

# With specific features
maturin build --release --features "metal"

Quick Start

from cosyvoice3 import CosyVoice3

# Load model
model = CosyVoice3("./CosyVoice3-0.5B-Candle")

# Zero-shot voice cloning - just provide text and a prompt audio file
audio = model.inference_zero_shot(
    text="你好，这是一个测试。",
    prompt_text="You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。",
    prompt_wav="prompt.wav"
)

# Save audio
import wave, struct
with wave.open("output.wav", "w") as f:
    f.setnchannels(1)
    f.setsampwidth(2)
    f.setframerate(model.sample_rate)
    audio_int16 = [int(max(-32768, min(32767, s * 32767))) for s in audio]
    f.writeframes(struct.pack(f"{len(audio_int16)}h", *audio_int16))

Usage

Zero-Shot Voice Cloning

Clone a voice from a reference audio sample:

from cosyvoice3 import CosyVoice3, SamplingConfig

model = CosyVoice3("./CosyVoice3-0.5B-Candle")

# Basic usage
audio = model.inference_zero_shot(
    text="Hello, this is synthesized speech.",
    prompt_text="You are a helpful assistant.<|endofprompt|>Hello, this is my voice sample.",
    prompt_wav="reference_voice.wav"
)

# With custom sampling config
config = SamplingConfig(top_k=25, top_p=0.8, temperature=1.0)
audio = model.inference_zero_shot(
    text="Hello, this is synthesized speech.",
    prompt_text="You are a helpful assistant.<|endofprompt|>Hello, this is my voice sample.",
    prompt_wav="reference_voice.wav",
    sampling_config=config,
    n_timesteps=10
)

Cross-Lingual Voice Cloning

Clone a voice across different languages:

audio = model.inference_cross_lingual(
    text="<|en|>Hello, this is cross-lingual synthesis.",
    prompt_wav="chinese_reference.wav"
)

Instruction-Guided Synthesis

Control speech style with instructions:

audio = model.inference_instruct(
    text="你好世界",
    instruct_text="You are a helpful assistant. 请用广东话表达。<|endofprompt|>",
    prompt_wav="reference.wav"
)

Using Pre-extracted Features

For repeated synthesis with the same voice, you can load pre-extracted features:

# Load features from safetensors file
tokens, mel, embedding = model.load_prompt_features("features.safetensors")

# Use low-level synthesize API
from cosyvoice3 import SynthesisMode

audio = model.synthesize(
    text="Hello, world!",
    prompt_speech_tokens=tokens,
    prompt_mel=mel,
    speaker_embedding=embedding,
    prompt_text="You are a helpful assistant.<|endofprompt|>Hello.",
    mode=SynthesisMode.ZeroShot
)

Device Selection

from cosyvoice3 import CosyVoice3, PyDevice

# Auto-select best available device
model = CosyVoice3("./model")

# Explicitly specify device
model = CosyVoice3("./model", device=PyDevice("cpu"))
model = CosyVoice3("./model", device=PyDevice("cuda"))   # Requires cuda feature
model = CosyVoice3("./model", device=PyDevice("metal"))  # Requires metal feature, macOS only

# Use FP16 precision (GPU only)
model = CosyVoice3("./model", device=PyDevice("metal"), use_f16=True)

# Check device availability
print(f"CUDA available: {PyDevice.cuda_is_available()}")
print(f"Metal available: {PyDevice.metal_is_available()}")
print(f"Best available: {PyDevice.best_available()}")

Text Normalization

For better TTS quality, use the built-in text normalizer:

from cosyvoice3 import TextNormalizer

# Initialize normalizer (requires wetext FST files)
normalizer = TextNormalizer("./wetext-fsts")

# Normalize text
text = normalizer.normalize("2024年1月15日，价格是$100.50")
# Output: "二零二四年一月十五日，价格是一百美元五十美分"

# Use normalized text for synthesis
audio = model.inference_zero_shot(
    text=text,
    prompt_text="...",
    prompt_wav="prompt.wav"
)

WeText FST files are available on ModelScope:

git clone https://www.modelscope.cn/models/pengzhendong/wetext

API Reference

CosyVoice3

Main model class.

Constructor:

CosyVoice3(
    model_dir: str,
    device: PyDevice = None,  # Auto-selects if None
    use_f16: bool = False     # Use FP16 precision (GPU only)
)

Methods:

Method	Description
`inference_zero_shot(text, prompt_text, prompt_wav, ...)`	Zero-shot voice cloning
`inference_cross_lingual(text, prompt_wav, ...)`	Cross-lingual synthesis
`inference_instruct(text, instruct_text, prompt_wav, ...)`	Instruction-guided synthesis
`synthesize(text, prompt_speech_tokens, prompt_mel, speaker_embedding, ...)`	Low-level synthesis API
`load_prompt_features(features_path)`	Load pre-extracted features

Properties:

Property	Type	Description
`sample_rate`	`int`	Audio sample rate (24000)
`config`	`CosyVoice3Config`	Model configuration
`has_onnx`	`bool`	Whether ONNX feature is compiled

SynthesisMode

Enum for synthesis modes:

SynthesisMode.ZeroShot - Zero-shot voice cloning
SynthesisMode.CrossLingual - Cross-lingual synthesis
SynthesisMode.Instruct - Instruction-guided synthesis

SamplingConfig

LLM sampling configuration:

SamplingConfig(
    top_k: int = 25,
    top_p: float = 0.8,
    temperature: float = 1.0,
    repetition_penalty: float = 1.0
)

PyDevice

Device selection:

PyDevice("cpu")              # CPU
PyDevice("cuda")             # CUDA GPU
PyDevice("metal")            # Metal GPU (macOS)
PyDevice.best_available()    # Auto-select best
PyDevice.cuda_is_available() # Check CUDA
PyDevice.metal_is_available() # Check Metal

About `prompt_text`

In zero-shot mode, prompt_text should follow this format:

You are a helpful assistant.<|endofprompt|>[transcript of prompt audio]

Fixed prefix: You are a helpful assistant.<|endofprompt|> - Required by CosyVoice3's LLM
Transcript: The actual text content spoken in the prompt audio

Example: If your prompt audio says "希望你以后能够做的比我还好呦", use:

prompt_text = "You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。"

For best voice cloning quality, the transcript should accurately match the audio content.

Build Features

Feature	Description
`onnx`	Enable ONNX frontend for prompt feature extraction (default)
`symphonia`	Enable audio file loading (WAV, MP3, OGG) (default)
`metal`	Enable Metal GPU acceleration (macOS)
`cuda`	Enable CUDA GPU acceleration
`accelerate`	Enable Apple Accelerate framework
`mkl`	Enable Intel MKL

Performance

Device	RTF (Real-Time Factor)
Apple M1 Pro (Metal)	~0.3-0.5x
CPU (x86_64)	~2-4x

RTF < 1.0 means faster than real-time

License

Apache-2.0

Acknowledgments

CosyVoice - Original model by FunAudioLLM
Candle - Rust ML framework by Hugging Face
PyO3 - Rust-Python bindings

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
examples		examples
python		python
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cosyvoice3.rs

Features

Model Weights

Convert from Original Weights

Installation

Pre-built Wheels (Recommended)

From Source

Linux CUDA Build Notes

Build Wheel

Quick Start

Usage

Zero-Shot Voice Cloning

Cross-Lingual Voice Cloning

Instruction-Guided Synthesis

Using Pre-extracted Features

Device Selection

Text Normalization

API Reference

CosyVoice3

SynthesisMode

SamplingConfig

PyDevice

About `prompt_text`

Build Features

Performance

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cosyvoice3.rs

Features

Model Weights

Convert from Original Weights

Installation

Pre-built Wheels (Recommended)

From Source

Linux CUDA Build Notes

Build Wheel

Quick Start

Usage

Zero-Shot Voice Cloning

Cross-Lingual Voice Cloning

Instruction-Guided Synthesis

Using Pre-extracted Features

Device Selection

Text Normalization

API Reference

CosyVoice3

SynthesisMode

SamplingConfig

PyDevice

About prompt_text

Build Features

Performance

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

About `prompt_text`

Packages