Skip to content

pnnbao97/VieNeu-TTS

Repository files navigation

🦜 VieNeu-TTS

Awesome Discord

Open In Colab Hugging Face v2 Turbo Hugging Face VieNeu-TTS

image

VieNeu-TTS is an advanced on-device Vietnamese Text-to-Speech (TTS) model with instant voice cloning and English-Vietnamese bilingual support.

Important

🚀 VieNeu-TTS-v2 Turbo: Optimized for edge devices and extremely fast inference (CPU & Low-end devices).
Note: Quality is lower than the Standard VieNeu-TTS and may struggle with very short segments (< 5 words).
Version VieNeu-TTS-v2 (Non-Turbo) is coming soon!

✨ Key Features

  • Bilingual (English-Vietnamese): Smooth and natural transitions between languages powered by sea-g2p.
  • Instant Voice Cloning: Clone any voice with just 3-5 seconds of reference audio (Turbo v2 & GPU modes).
  • Ultra-Fast Turbo Mode: Optimized for both CPU (GGUF) and GPU (LMDeploy), offering the fastest inference in the VieNeu family.
  • AI Identification: Built-in audio watermarking for responsible AI content creation.
  • Production-Ready: High-quality 24 kHz waveform generation, fully offline.
Demo-VieNeu-TTS.mp4

📌 Table of Contents

  1. 🦜 Installation & Web UI
  2. 📦 Using the Python SDK
  3. 🐳 High-Quality Server (Standard Mode)
  4. 🔬 Model Overview
  5. 🚀 Roadmap
  6. 🤝 Support & Contact
  7. 📑 Citation

🦜 1. Installation & Web UI

Setup with uv (Recommended)

uv is the fastest way to manage dependencies.

# Windows:
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Linux/macOS:
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Clone the Repo:

    git clone https://github.com/pnnbao97/VieNeu-TTS.git
    cd VieNeu-TTS
  2. Install Dependencies:

    • Option 1: Minimal (Turbo/CPU) - Fast & Lightweight

      ⚠️ Note: This mode only supports VieNeu-TTS-v2-Turbo (CPU) — runs on any machine without a GPU, but audio quality is lower than Standard VieNeu-TTS (especially for short phrases < 5 words). Recommended for quick testing or deployment on low-end devices.

      uv sync
    • Option 2: Full (GPU/Standard) - High Quality & Cloning (For GPU users)

      💡 Note: Requires a CUDA-compatible NVIDIA GPU (CUDA version >= 12.8) or Apple Silicon MPS. NVIDIA Toolkit is required for maximum speed. Enables the full Standard VieNeu-TTS backbone for maximum audio quality and high-fidelity voice cloning.

      uv sync --group gpu
  3. Start the Web UI:

    uv run vieneu-web

    Access the UI at http://127.0.0.1:7860. The Turbo v2 model is selected by default for immediate use.


📦 2. Using the Python SDK (vieneu)

The vieneu SDK defaults to Turbo mode when used locally to prioritize extreme speed and real-time performance. To achieve maximum audio quality (Standard VieNeu-TTS), you should set up a Remote Server and use the SDK in remote mode.

Quick Start

# Minimal installation (Builds llama-cpp from source - may take a while)
pip install vieneu

# Optional: For Windows users (CPU pre-built)
pip install vieneu --extra-index-url https://pnnbao97.github.io/llama-cpp-python-v0.3.16/cpu/

# Optional: For macOS users (ARM64/Apple Silicon - Enables Metal GPU acceleration)
pip install vieneu --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal/
from vieneu import Vieneu

# Initialize in Turbo mode (Default - Minimal dependencies)
tts = Vieneu()

# 1. Simple synthesis (uses default Southern Male voice 'Xuân Vĩnh')
text = "Hệ thống điện chủ yếu sử dụng alternating current because it is more efficient."
audio = tts.infer(text=text)

# Save to file
tts.save(audio, "output_Xuân Vĩnh.wav")
print("💾 Saved to output_Xuân Vĩnh.wav")

# 2. Using a specific Preset Voice
voices = tts.list_preset_voices()
for desc, voice_id in voices:
    print(f"Voice: {desc} (ID: {voice_id})")

my_voice_id = voices[1][1] if len(voices) > 1 else voices[0][1] # Giọng Phạm Tuyên
voice_data = tts.get_preset_voice(my_voice_id)

audio_custom = tts.infer(text="Tôi đang nói bằng giọng của Bác sĩ Tuyên.", voice=voice_data)

# 3. Save to file
tts.save(audio_custom, "output_Phạm Tuyên.wav")
print("💾 Saved to output_Phạm Tuyên.wav")

🦜 Zero-shot Voice Cloning (SDK)

Clone any voice with only 3-5 seconds of audio using the local Turbo engine:

from vieneu import Vieneu

tts = Vieneu() # Defaults to Turbo mode

# 1. Encode the reference audio
# Supported formats: .wav, .mp3, .flac (5-10 seconds recommended)
my_voice = tts.encode_reference("examples/audio_ref/example.wav")

# 2. Synthesize with the cloned voice
# No reference text required for Turbo v2!
audio = tts.infer(
    text="Đây là giọng nói được clone trực tiếp bằng SDK của VieNeu-TTS.", 
    voice=my_voice  # accepts numpy array from encode_reference() or preset dict from get_preset_voice()
)

tts.save(audio, "cloned_voice.wav")

🐳 3. High-Quality Server (Standard Mode)

Deploy VieNeu-TTS as a high-performance API Server (powered by LMDeploy) with a single command.

1. Run with Docker (Recommended)

Requirement: NVIDIA Container Toolkit is required for GPU support.

Start the Server with a Public Tunnel (No port forwarding needed):

docker run --gpus all -p 23333:23333 pnnbao/vieneu-tts:serve --tunnel
  • Default: The server loads the VieNeu-TTS model for maximum quality.
  • Tunneling: The Docker image includes a built-in bore tunnel. Check the container logs to find your public address (e.g., bore.pub:31631).

2. Using the SDK (Remote Mode)

Once the server is running, you can connect from anywhere (Colab, Web Apps, etc.) without loading heavy models locally:

from vieneu import Vieneu
import os

# Configuration
REMOTE_API_BASE = 'http://your-server-ip:23333/v1'  # Or bore tunnel URL
REMOTE_MODEL_ID = "pnnbao-ump/VieNeu-TTS"

# Initialization (LIGHTWEIGHT - only loads small codec locally)
tts = Vieneu(mode='remote', api_base=REMOTE_API_BASE, model_name=REMOTE_MODEL_ID)
os.makedirs("outputs", exist_ok=True)

# List remote voices
available_voices = tts.list_preset_voices()
for desc, name in available_voices:
    print(f"   - {desc} (ID: {name})")

# Use specific voice (dynamically select second voice)
if available_voices:
    _, my_voice_id = available_voices[1]
    voice_data = tts.get_preset_voice(my_voice_id)
    audio_spec = tts.infer(text="Chào bạn, tôi đang nói bằng giọng của bác sĩ Tuyên.", voice=voice_data)
    tts.save(audio_spec, f"outputs/remote_{my_voice_id}.wav")
    print(f"💾 Saved synthesis to: outputs/remote_{my_voice_id}.wav")

# Standard synthesis (uses default voice)
text_input = "Chế độ remote giúp tích hợp VieNeu vào ứng dụng Web hoặc App cực nhanh mà không cần GPU tại máy khách."
audio = tts.infer(text=text_input)
tts.save(audio, "outputs/remote_output.wav")
print("💾 Saved remote synthesis to: outputs/remote_output.wav")

# Zero-shot voice cloning (encodes audio locally, sends codes to server)
if os.path.exists("examples/audio_ref/example_ngoc_huyen.wav"):
    cloned_audio = tts.infer(
        text="Đây là giọng nói được clone và xử lý thông qua VieNeu Server.",
        ref_audio="examples/audio_ref/example_ngoc_huyen.wav",
        ref_text="Tác phẩm dự thi bảo đảm tính khoa học, tính đảng, tính chiến đấu, tính định hướng."
    )
    tts.save(cloned_audio, "outputs/remote_cloned_output.wav")
    print("💾 Saved remote cloned voice to: outputs/remote_cloned_output.wav")

For full implementation details, see: examples/main_remote.py

Voice Preset Specification (v1.0)

VieNeu-TTS uses the official vieneu.voice.presets specification to define reusable voice assets. Only voices.json files following this spec are guaranteed to be compatible with VieNeu-TTS SDK ≥ v1.x.

3. Advanced Configuration

Customize the server to run specific versions or your own fine-tuned models.

Run the 0.3B Model (Faster):

docker run --gpus all pnnbao/vieneu-tts:serve --model pnnbao-ump/VieNeu-TTS-0.3B --tunnel

Serve a Local Fine-tuned Model: If you have merged a LoRA adapter, mount your output directory to the container:

# Linux / macOS
docker run --gpus all \
  -v $(pwd)/finetune/output:/workspace/models \
  pnnbao/vieneu-tts:serve \
  --model /workspace/models/merged_model --tunnel

🔬 4. Model Overview

Model Format Device Bilingual Cloning Speed
VieNeu-v2-Turbo GGUF/ONNX CPU/Edge ✅ Yes Extreme (Fastest)
VieNeu-TTS-v2 PyTorch GPU ✅ Yes Standard (Coming soon)
VieNeu-TTS 0.3B PyTorch GPU/CPU ✅ Yes Very Fast
VieNeu-TTS PyTorch GPU/CPU ✅ Yes Standard

Tip

Use Turbo v2 for AI assistants, chatbots, and real-time edge applications where speed is critical. Note: It may have stability issues with very short phrases (< 5 words). Use GPU/Standard (VieNeu-TTS v1/v2) for maximum audio quality and high-fidelity voice cloning.


🚀 5. Roadmap

  • VieNeu-TTS-v2 Turbo: English-Vietnamese code-switching support.
  • VieNeu-Codec: Optimized neural codec for Vietnamese (ONNX).
  • VieNeu-TTS-v2 (Non-Turbo): Full high-fidelity bilingual architecture with instant Voice Cloning and LMDeploy GPU acceleration support.
  • Turbo Voice Cloning: Bringing instant cloning to the lightweight Turbo engine.
  • Mobile SDK: Official support for Android/iOS deployment.

🤝 6. Support & Contact


📑 7. Citation

@misc{vieneutts2026,
  title        = {VieNeu-TTS: Vietnamese Text-to-Speech with Instant Voice Cloning},
  author       = {Pham Nguyen Ngoc Bao},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS}}
}

🌟 Star History

Star History Chart


🤝 Contributors

Thanks to all the amazing people who have contributed to this project!


🙏 Acknowledgements

This project uses neucodec for audio decoding and sea-g2p for text normalization and phonemization.

Made with ❤️ for the Vietnamese TTS community

About

Vietnamese TTS with instant voice cloning • On-device • Real-time CPU inference • 24kHz audio quality • Chuyển văn bản thành giọng nói tiếng Việt • Text to speech tiếng Việt • TTS tiếng Việt

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors