Skip to content

Chatterbox TTS API Service

This is a high-performance text-to-speech (TTS) service based on Chatterbox-TTS. It provides an OpenAI TTS-compatible API interface, an enhanced interface supporting voice cloning, and a clean web user interface.

This project aims to offer developers and content creators a private, powerful, and easy-to-integrate TTS solution.

Project URL: https://github.com/jianchang512/chatterbox-api


Using with pyVideoTrans

This project can serve as a powerful TTS backend, providing high-quality English voiceovers for pyVideoTrans.

  1. Start this project: Ensure the Chatterbox TTS API service is running locally (http://127.0.0.1:5093).

  2. Update pyVideoTrans: Make sure your pyVideoTrans is upgraded to v3.73 or higher.

  3. Configure pyVideoTrans:

    • In the pyVideoTrans menu, go to TTS Settings -> Chatterbox TTS.
    • API Address: Enter the address of this service, default is http://127.0.0.1:5093.
    • Reference Audio (Optional): If you want to use voice cloning, enter the filename of the reference audio here (e.g., my_voice.wav). Ensure the audio file is placed in the chatterbox folder in the pyVideoTrans root directory.
    • Adjust Parameters: Tune cfg_weight and exaggeration as needed for optimal results.

    Parameter Tuning Suggestions:

    • General Scenarios (TTS, Voice Assistants): Default settings (cfg_weight=0.5, exaggeration=0.5) work for most cases.
    • Fast-Paced Reference Audio: If the reference audio is fast, try lowering cfg_weight to around 0.3 to improve speech rhythm.
    • Expressive/Dramatic Speech: Try lower cfg_weight (e.g., 0.3) and higher exaggeration (e.g., 0.7 or higher). Increasing exaggeration often speeds up speech, while lowering cfg_weight helps balance for a more measured and clearer pace.

✨ Features

  • Two API Interfaces:
    1. OpenAI-Compatible Interface: /v1/audio/speech, seamlessly integrates with existing workflows using OpenAI SDK.
    2. Voice Cloning Interface: /v2/audio/speech_with_prompt, generates speech with the same voice characteristics by uploading a short reference audio.
  • Web User Interface: Provides an intuitive frontend for quick testing and use of TTS features without coding.
  • Flexible Output Formats: Supports generating audio in .mp3 and .wav formats.
  • Cross-Platform Support: Detailed installation guides for Windows, macOS, and Linux.
  • One-Click Windows Deployment: Offers a compressed package for Windows users with all dependencies and startup scripts for out-of-the-box use.
  • GPU Acceleration: Supports NVIDIA GPU (CUDA), with a one-click upgrade script for Windows users.
  • Seamless Integration: Easily integrates as a backend service with tools like pyVideoTrans.

🚀 Quick Start

We provide a portable package win.7z for Windows users containing all dependencies, greatly simplifying installation.

  1. Download and Extract:

  2. Start the Service:

    • Double-click the 启动服务.bat script in the root directory.

    When you see information like the following in the command window, the service has started successfully:

```
✅ Model loading complete.
Service started successfully, HTTP address: http://127.0.0.1:5093
```

Method 2: macOS, Linux, and Manual Installation Users

For macOS, Linux users, or Windows users preferring manual setup, follow these steps.

1. Prerequisites

  • Python: Ensure Python 3.9 or higher is installed.
  • ffmpeg: A required audio/video processing tool.
    • macOS (using Homebrew): brew install ffmpeg
    • Debian/Ubuntu: sudo apt-get update && sudo apt-get install ffmpeg
    • Windows (manual): Download ffmpeg and add it to the system PATH environment variable.

2. Installation Steps

bash
# 1. Clone the project repository
git clone https://github.com/jianchang512/chatterbox-api.git
cd chatterbox-api

# 2. Create and activate a Python virtual environment (recommended)
python3 -m venv venv
# on Windows:
# venv\Scripts\activate
# on macOS/Linux:
source venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Start the service
python app.py

Once the service starts successfully, you will see the service address http://127.0.0.1:5093 in the terminal.


⚡ Upgrade to GPU Version (Optional)

If your computer has a CUDA-capable NVIDIA GPU and has NVIDIA drivers and CUDA Toolkit correctly installed, you can upgrade to the GPU version for significant performance gains.

Windows Users (One-Click Upgrade)

  1. First, ensure you have successfully run 启动服务.bat at least once to complete the basic environment setup.
  2. Double-click the 安装N卡GPU支持.bat script.
  3. The script will automatically uninstall the CPU version of PyTorch and install the GPU version compatible with CUDA 12.6.

Linux Manual Upgrade

After activating the virtual environment, run the following commands:

bash
# 1. Uninstall the existing CPU version of PyTorch
pip uninstall -y torch torchaudio

# 2. Install PyTorch matching your CUDA version
# The following command is for CUDA 12.6; get the correct command from the PyTorch website for your CUDA version
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126

Visit the PyTorch website for installation commands suitable for your system.

After upgrading, restart the service. You will see Using device: cuda in the startup logs.


📖 Usage Guide

1. Web Interface

After starting the service, open http://127.0.0.1:5093 in your browser to access the Web UI.

  • Input Text: Enter the text you want to convert in the text box.
  • Adjust Parameters:
    • cfg_weight: (Range 0.0 - 1.0) Controls speech rhythm. Lower values result in slower, more measured speech. For fast-paced reference audio, lower this value (e.g., 0.3).
    • exaggeration: (Range 0.25 - 2.0) Controls emotional and intonational exaggeration. Higher values mean more expressive speech and potentially faster speed.
  • Voice Cloning: Click "Choose File" to upload a reference audio file (e.g., .mp3, .wav). If provided, the service uses the cloning interface.
  • Generate Speech: Click the "Generate Speech" button, wait a moment, then listen online or download the generated MP3 file.

2. API Calls

Interface 1: OpenAI-Compatible Interface (/v1/audio/speech)

This interface does not require reference audio and can be called directly using the OpenAI SDK.

Python Example (openai SDK):

python
from openai import OpenAI
import os

# Point the client to our local service
client = OpenAI(
    base_url="http://127.0.0.1:5093/v1",
    api_key="not-needed"  # API key is not required but needed by the SDK
)

response = client.audio.speech.create(
    model="chatterbox-tts",   # This parameter is ignored
    voice="en",              # 
    speed=0.5,               # Corresponds to cfg_weight parameter
    input="Hello, this is a test from the OpenAI compatible API.",
    instructions="0.5"     # (Optional) Corresponds to exaggeration parameter, note it must be a string
    response_format="mp3"    # Optional 'mp3' or 'wav'
)

# Save the audio stream to a file
response.stream_to_file("output_api1.mp3")
print("Audio saved to output_api1.mp3")

Interface 2: Voice Cloning Interface (/v2/audio/speech_with_prompt)

This interface requires uploading both text and a reference audio file in multipart/form-data format.

Python Example (requests library):

python
import requests

API_URL = "http://127.0.0.1:5093/v2/audio/speech_with_prompt"
REFERENCE_AUDIO = "path/to/your/reference.mp3"  # Replace with your reference audio path

form_data = {
    'input': 'This voice should sound like the reference audio.',
    'cfg_weight': '0.5',
    'exaggeration': '0.5',
    'response_format': 'mp3'  # Optional 'mp3' or 'wav'
}

with open(REFERENCE_AUDIO, 'rb') as audio_file:
    files = {'audio_prompt': audio_file}
    response = requests.post(API_URL, data=form_data, files=files)

if response.ok:
    with open("output_api2.mp3", "wb") as f:
        f.write(response.content)
    print("Cloned audio saved to output_api2.mp3")
else:
    print("Request failed:", response.text)