Python bindings for CosyVoice3 TTS using Candle.
CosyVoice3 is a state-of-the-art multilingual zero-shot text-to-speech model from FunAudioLLM.
- � Hirgh-performance Rust implementation via Candle
- 🐍 Native Python bindings via PyO3
- 🎯 Zero-shot voice cloning
- 🌍 Cross-lingual synthesis
- � Instruction-guided synthesis
- 💻 Multi-platform support (Windows, macOS, Linux)
- 🔧 GPU acceleration (CUDA, Metal)
- 🎵 Built-in audio file loading (WAV, MP3, OGG)
Pre-converted weights are available on Hugging Face:
spensercai/CosyVoice3-0.5B-Candle
# Download using hf CLI
pip install huggingface_hub
hf download spensercai/CosyVoice3-0.5B-Candle --local-dir ./CosyVoice3-0.5B-CandleIf you want to convert from the original PyTorch weights:
# Download original weights
hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --local-dir ./Fun-CosyVoice3-0.5B-2512
# Convert to Candle format
python scripts/convert_weights.py \
--input ./Fun-CosyVoice3-0.5B-2512 \
--output ./CosyVoice3-0.5B-CandleDownload pre-built wheels from GitHub Releases:
# CPU version (all platforms)
pip install cosyvoice3-<version>-cp310-abi3-<platform>.whl
# Metal version (macOS Apple Silicon)
pip install cosyvoice3-<version>+metal-cp310-abi3-macosx_11_0_arm64.whl
# CUDA 11.8+ version (Linux, maximum compatibility)
pip install cosyvoice3-<version>+cu118-cp310-abi3-linux_x86_64.whl
# CUDA 12.4+ version (Linux/Windows)
pip install cosyvoice3-<version>+cu124-cp310-abi3-<platform>.whl
# CUDA 12.8+ version (Linux, latest features)
pip install cosyvoice3-<version>+cu128-cp310-abi3-linux_x86_64.whl# Clone the repository
git clone https://github.com/SpenserCai/cosyvoice3.rs
cd cosyvoice3.rs
# Install maturin
pip install maturin
# Build and install (default: CPU + ONNX)
maturin develop --release
# Build with Metal support (macOS Apple Silicon)
maturin develop --release --features metal
# Build with CUDA support (Linux/Windows)
maturin develop --release --features cudaRecommended: For Linux users with CUDA, building from source is recommended over pre-built wheels for better compatibility.
Prerequisites:
- Install Rust toolchain:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env- Install protoc (Protocol Buffers compiler):
# Ubuntu/Debian
sudo apt install protobuf-compiler
# Or download from GitHub releases
PB_REL="https://github.com/protocolbuffers/protobuf/releases"
curl -LO $PB_REL/download/v28.3/protoc-28.3-linux-x86_64.zip
sudo unzip protoc-28.3-linux-x86_64.zip -d /usr/local- Install maturin:
pip install maturinBuild:
# Install with editable mode
pip install -e .
# Or use maturin directly
maturin develop --release --features cudaGCC Version Compatibility: CUDA has strict GCC version requirements:
- CUDA 11.x: requires GCC ≤ 11
- CUDA 12.x: requires GCC ≤ 12
- CUDA 12.8: requires GCC ≤ 13
If your system default GCC is newer (e.g., GCC 13), you need to specify a compatible version:
# Check your CUDA version
nvcc --version
# Check your GCC version
gcc --version
# If GCC is too new, install a compatible version and set NVCC_CCBIN
# For Ubuntu/Debian:
sudo apt install gcc-11 g++-11 # For CUDA 11.x
sudo apt install gcc-12 g++-12 # For CUDA 12.x
# Build with specific GCC version
NVCC_CCBIN=/usr/bin/gcc-11 maturin develop --release --features cuda # CUDA 11.x
NVCC_CCBIN=/usr/bin/gcc-12 maturin develop --release --features cuda # CUDA 12.xEnvironment Variables for CUDA Build:
| Variable | Description | Example |
|---|---|---|
NVCC_CCBIN |
Path to GCC compiler for nvcc | /usr/bin/gcc-11 |
CUDA_COMPUTE_CAP |
Target GPU compute capability | 80 (for A100), 89 (for RTX 4090) |
CUDA_PATH |
CUDA installation path (usually auto-detected) | /usr/local/cuda |
# Build wheel for distribution
maturin build --release
# With specific features
maturin build --release --features "metal"from cosyvoice3 import CosyVoice3
# Load model
model = CosyVoice3("./CosyVoice3-0.5B-Candle")
# Zero-shot voice cloning - just provide text and a prompt audio file
audio = model.inference_zero_shot(
text="你好,这是一个测试。",
prompt_text="You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。",
prompt_wav="prompt.wav"
)
# Save audio
import wave, struct
with wave.open("output.wav", "w") as f:
f.setnchannels(1)
f.setsampwidth(2)
f.setframerate(model.sample_rate)
audio_int16 = [int(max(-32768, min(32767, s * 32767))) for s in audio]
f.writeframes(struct.pack(f"{len(audio_int16)}h", *audio_int16))Clone a voice from a reference audio sample:
from cosyvoice3 import CosyVoice3, SamplingConfig
model = CosyVoice3("./CosyVoice3-0.5B-Candle")
# Basic usage
audio = model.inference_zero_shot(
text="Hello, this is synthesized speech.",
prompt_text="You are a helpful assistant.<|endofprompt|>Hello, this is my voice sample.",
prompt_wav="reference_voice.wav"
)
# With custom sampling config
config = SamplingConfig(top_k=25, top_p=0.8, temperature=1.0)
audio = model.inference_zero_shot(
text="Hello, this is synthesized speech.",
prompt_text="You are a helpful assistant.<|endofprompt|>Hello, this is my voice sample.",
prompt_wav="reference_voice.wav",
sampling_config=config,
n_timesteps=10
)Clone a voice across different languages:
audio = model.inference_cross_lingual(
text="<|en|>Hello, this is cross-lingual synthesis.",
prompt_wav="chinese_reference.wav"
)Control speech style with instructions:
audio = model.inference_instruct(
text="你好世界",
instruct_text="You are a helpful assistant. 请用广东话表达。<|endofprompt|>",
prompt_wav="reference.wav"
)For repeated synthesis with the same voice, you can load pre-extracted features:
# Load features from safetensors file
tokens, mel, embedding = model.load_prompt_features("features.safetensors")
# Use low-level synthesize API
from cosyvoice3 import SynthesisMode
audio = model.synthesize(
text="Hello, world!",
prompt_speech_tokens=tokens,
prompt_mel=mel,
speaker_embedding=embedding,
prompt_text="You are a helpful assistant.<|endofprompt|>Hello.",
mode=SynthesisMode.ZeroShot
)from cosyvoice3 import CosyVoice3, PyDevice
# Auto-select best available device
model = CosyVoice3("./model")
# Explicitly specify device
model = CosyVoice3("./model", device=PyDevice("cpu"))
model = CosyVoice3("./model", device=PyDevice("cuda")) # Requires cuda feature
model = CosyVoice3("./model", device=PyDevice("metal")) # Requires metal feature, macOS only
# Use FP16 precision (GPU only)
model = CosyVoice3("./model", device=PyDevice("metal"), use_f16=True)
# Check device availability
print(f"CUDA available: {PyDevice.cuda_is_available()}")
print(f"Metal available: {PyDevice.metal_is_available()}")
print(f"Best available: {PyDevice.best_available()}")For better TTS quality, use the built-in text normalizer:
from cosyvoice3 import TextNormalizer
# Initialize normalizer (requires wetext FST files)
normalizer = TextNormalizer("./wetext-fsts")
# Normalize text
text = normalizer.normalize("2024年1月15日,价格是$100.50")
# Output: "二零二四年一月十五日,价格是一百美元五十美分"
# Use normalized text for synthesis
audio = model.inference_zero_shot(
text=text,
prompt_text="...",
prompt_wav="prompt.wav"
)WeText FST files are available on ModelScope:
git clone https://www.modelscope.cn/models/pengzhendong/wetextMain model class.
Constructor:
CosyVoice3(
model_dir: str,
device: PyDevice = None, # Auto-selects if None
use_f16: bool = False # Use FP16 precision (GPU only)
)Methods:
| Method | Description |
|---|---|
inference_zero_shot(text, prompt_text, prompt_wav, ...) |
Zero-shot voice cloning |
inference_cross_lingual(text, prompt_wav, ...) |
Cross-lingual synthesis |
inference_instruct(text, instruct_text, prompt_wav, ...) |
Instruction-guided synthesis |
synthesize(text, prompt_speech_tokens, prompt_mel, speaker_embedding, ...) |
Low-level synthesis API |
load_prompt_features(features_path) |
Load pre-extracted features |
Properties:
| Property | Type | Description |
|---|---|---|
sample_rate |
int |
Audio sample rate (24000) |
config |
CosyVoice3Config |
Model configuration |
has_onnx |
bool |
Whether ONNX feature is compiled |
Enum for synthesis modes:
SynthesisMode.ZeroShot- Zero-shot voice cloningSynthesisMode.CrossLingual- Cross-lingual synthesisSynthesisMode.Instruct- Instruction-guided synthesis
LLM sampling configuration:
SamplingConfig(
top_k: int = 25,
top_p: float = 0.8,
temperature: float = 1.0,
repetition_penalty: float = 1.0
)Device selection:
PyDevice("cpu") # CPU
PyDevice("cuda") # CUDA GPU
PyDevice("metal") # Metal GPU (macOS)
PyDevice.best_available() # Auto-select best
PyDevice.cuda_is_available() # Check CUDA
PyDevice.metal_is_available() # Check MetalIn zero-shot mode, prompt_text should follow this format:
You are a helpful assistant.<|endofprompt|>[transcript of prompt audio]
- Fixed prefix:
You are a helpful assistant.<|endofprompt|>- Required by CosyVoice3's LLM - Transcript: The actual text content spoken in the prompt audio
Example: If your prompt audio says "希望你以后能够做的比我还好呦", use:
prompt_text = "You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。"For best voice cloning quality, the transcript should accurately match the audio content.
| Feature | Description |
|---|---|
onnx |
Enable ONNX frontend for prompt feature extraction (default) |
symphonia |
Enable audio file loading (WAV, MP3, OGG) (default) |
metal |
Enable Metal GPU acceleration (macOS) |
cuda |
Enable CUDA GPU acceleration |
accelerate |
Enable Apple Accelerate framework |
mkl |
Enable Intel MKL |
| Device | RTF (Real-Time Factor) |
|---|---|
| Apple M1 Pro (Metal) | ~0.3-0.5x |
| CPU (x86_64) | ~2-4x |
RTF < 1.0 means faster than real-time
Apache-2.0