Skip to content

F5-TTS-api

Project source code: https://github.com/jianchang512/f5-tts-api

This is the API and web UI for the F5-TTS project.

F5-TTS is an advanced text-to-speech system that uses deep learning technology to generate realistic, high-quality human voices. With just a 10-second audio sample, it can clone your voice. F5-TTS accurately reproduces speech and infuses it with rich emotional tones.

Original voice sample of the Queen of Women's Kingdom:

Cloned audio:

Windows Integrated Package (Includes F5-TTS Model and Runtime Environment)

123 Cloud Download: https://www.123684.com/s/03Sxjv-okTJ3

HuggingFace Download: https://huggingface.co/spaces/mortimerme/s4/resolve/main/f5-tts-api-v0.3.7z?download=true

Compatible Systems: Windows 10/11 (Extract and use after download)

Usage Instructions:

Start API Service: Double-click run-api.bat; the API address is http://127.0.0.1:5010/api. image.png

API service must be started to use it in translation software.

The integrated package uses CUDA 11.8 by default. If you have an NVIDIA GPU with CUDA/cuDNN configured, the system will automatically use GPU acceleration. To use a higher CUDA version, e.g., 12.4, follow these steps:

Navigate to the folder containing api.py, type cmd in the address bar and press Enter, then run the following commands in the terminal:

.\runtime\python -m pip uninstall -y torch torchaudio

.\runtime\python -m pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

F5-TTS excels in efficiency and high-quality voice output. Compared to similar technologies that require longer audio samples, F5-TTS generates high-fidelity speech with minimal audio input and effectively conveys emotions, enhancing the listening experience—a feat many existing technologies struggle to achieve.

Currently, F5-TTS supports both English and Chinese.

Usage Tip: Proxy/VPN

The model needs to be downloaded from huggingface.co. Since this site is inaccessible in some regions, set up a system or global proxy in advance; otherwise, model download will fail.

The integrated package includes most required models, but it may check for updates or download additional small dependency models. If you encounter an HTTPSConnect error in the terminal, you still need to set up a system proxy.

Using in Video Translation Software

  1. Start the API service. API service must be started to use it in translation software.

  2. Open the video translation software, go to TTS settings, select F5-TTS, and enter the API address (default: http://127.0.0.1:5010).

  3. Enter the reference audio and audio text.

  4. It is recommended to select the f5-tts model for better generation quality.

Using api.py in Third-Party Integrated Packages

  1. Copy api.py and the configs folder to the root directory of the third-party integrated package.
  2. Check the path of the integrated python.exe, e.g., in the py311 folder. In the root directory's address bar, type cmd and press Enter, then run: .\py311\python api.py. If you see module flask not found, first run .\py311\python -m pip install waitress flask.

Using api.py After Deploying the Official F5-TTS Project from Source

  1. Copy api.py and the configs folder to the project directory.
  2. Install the required modules: pip install flask waitress.
  3. Run python api.py.

API Usage Example

import requests

res = requests.post('http://127.0.0.1:5010/api', data={
    "ref_text": 'Enter the text content corresponding to 1.wav here',
    "gen_text": '''Enter the text to generate here.''',
    "model": 'f5-tts'
}, files={"audio": open('./1.wav', 'rb')})

if res.status_code != 200:
    print(res.text)
    exit()

with open("ceshi.wav", 'wb') as f:
    f.write(res.content)

Compatible with OpenAI TTS Interface

The voice parameter must separate the reference audio and its corresponding text with three # symbols, e.g.,

1.wav###你说四大皆空,却为何紧闭双眼,若你睁开眼睛看看我,我不相信你,两眼空空。 This means the reference audio is 1.wav located in the same directory as api.py, and the text in 1.wav is "你说四大皆空,却为何紧闭双眼,若你睁开眼睛看看我,我不相信你,两眼空空."

The returned data is fixed as WAV audio data.

import requests
import json
import os
import base64
import struct

from openai import OpenAI

client = OpenAI(api_key='12314', base_url='http://127.0.0.1:5010/v1')
with client.audio.speech.with_streaming_response.create(
                    model='f5-tts',
                    voice='1.wav###你说四大皆空,却为何紧闭双眼,若你睁开眼睛看看我,我不相信你,两眼空空。',
                    input='你好啊,亲爱的朋友们',
                    speed=1.0
                ) as response:
    with open('./test.wav', 'wb') as f:
       for chunk in response.iter_bytes():
            f.write(chunk)