Skip to content

From v3.68 onwards, you can directly select Spark-TTS in the software TTS->F5-TTS configuration, without modifying the spark-tts source code as described at the bottom of this article.

Spark-TTS is a recent highly anticipated open-source voice cloning project, jointly developed by Hong Kong University of Science and Technology, Northwestern Polytechnical University, Shanghai Jiao Tong University, and other universities. Local testing shows that its effect is comparable to F5-TTS.

Spark-TTS supports Chinese and English voice cloning. The installation and deployment process is not complicated. This article will detail how to install and deploy it, and modify it to be compatible with the F5-TTS API interface, so that it can be directly used in the F5-TTS dubbing channel of the pyVideoTrans software.

Prerequisites: Ensure that Python version 3.10, 3.11, or 3.12 is installed.

If you have not installed it yet, please refer to the previous article for installation. We will not repeat the details here.

1. Download Spark-TTS Source Code

First, create a folder consisting of English letters or numbers on a non-system drive, such as D:/spark. The reason for requiring English letters, numbers, and a non-system drive is to avoid potential errors related to Chinese characters, permissions, etc.

Then, visit the official Spark-TTS code repository: https://github.com/SparkAudio/Spark-TTS

As shown in the figure below, click to download the ZIP package of the source code:

Click to download the source code ZIP package

After the download is complete, unzip it and copy all files and folders to the D:/spark folder. The directory structure after copying should be as shown in the following figure:

Directory structure after copying

2. Create a Virtual Environment and Install Dependencies

  • Create a virtual environment

Enter cmd in the address bar of the folder and press Enter. In the black terminal window that pops up, execute the following command:

bash
python -m venv venv

As shown:

Clear the folder address bar and enter cmd then press Enter

Execute command

After execution, a venv folder will be added to the D:/spark directory:

A venv directory will be added to the folder after success

Note: If you are prompted with python is not an internal or external command during execution, it means that Python is not installed or has not been added to the system environment variables. Please refer to relevant articles to install Python.

Next, execute venv\scripts\activate to activate the virtual environment. After activation, (venv) will appear at the beginning of the terminal line, indicating successful activation. All subsequent commands need to be executed in this environment. Please check whether it has been activated before each execution.

Make sure there is (venv) at the beginning

  • Install dependencies

In the activated virtual environment, continue to execute the following command in the terminal to install all dependencies:

bash
pip install -r requirements.txt

The installation process may take a long time, please be patient.

The installation takes a long time

3. Download Models

The models required for open-source AI projects are usually hosted on Hugging Face (huggingface.co). Since this website has been blocked in China, you need to use a proxy to download the models. Please make sure that the proxy environment is configured correctly and that a system proxy is set.

Create a text file named down.txt in the current directory D:/spark, copy and paste the following code into the file and save it:

python
from huggingface_hub import snapshot_download
snapshot_download("SparkAudio/Spark-TTS-0.5B", local_dir="pretrained_models/Spark-TTS-0.5B")
print('下载完成')

Then, execute the following command in the terminal window of the activated virtual environment:

bash
python down.txt

Check that there is (venv) before the command line:

Make sure there is (venv) character at the beginning of the command line

Wait for the terminal to prompt that the download is complete.

If the output is similar to the following information, it means that the network connection is incorrect, and the proxy environment may not be configured correctly:

Returning existing local_dir `pretrained_models\Spark-TTS-0.5B` as remote repo cannot be accessed in `snapshot_download` ((MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/SparkAudio/Spark-TTS-0.5B/revision/main (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001BC4C8A4430>, 'Connection to huggingface.co timed out. (connect timeout=None)'))"), '(Request ID: aa61d1fb-ffc7-4479-9a99-2258c1bc0aee)')).

Connection failed, please configure the proxy environment correctly

4. Start the Web Interface

After the model is downloaded, you can start and open the web interface.

Execute the following command in the terminal of the activated virtual environment:

bash
python webui.py

Confirm (venv) at the beginning

Wait for the following information to appear, indicating that the startup is complete:

Startup successful

At this point, you can open the address http://127.0.0.1:7860 in your browser. The web interface is shown below:

Open the web interface

5. Voice Cloning Test

As shown in the figure below, select an audio file of the voice you want to clone (3-10 seconds long, clear pronunciation, clean background).

Then, enter the text corresponding to the audio in the Text of prompt speech on the right. Enter the text you want to generate on the left, and finally click the Generate button at the bottom to start execution.

Execute voice cloning

After the execution is complete, it is shown in the figure below.

6. Use in pyVideotrans Software

Spark-TTS is very similar to F5-TTS. With simple modifications, you can directly use Spark-TTS in the F5-TTS dubbing channel of pyVideotrans.

  • Open the webui.py file, and paste the following code above approximately line 135:
python
    def basic_tts(gen_text_input, ref_text_input, ref_audio_input,remove_silence=None,speed_slider=None):
        """
        Gradio callback to clone voice using text and optional prompt speech.
        - text: The input text to be synthesised.
        - prompt_text: Additional textual info for the prompt (optional).
        - prompt_wav_upload/prompt_wav_record: Audio files used as reference.
        """
        prompt_speech = ref_audio_input
        prompt_text_clean = None if len(ref_text_input) < 2 else ref_text_input

        audio_output_path = run_tts(
            gen_text_input,
            model,
            prompt_text=prompt_text_clean,
            prompt_speech=prompt_speech
        )
        return audio_output_path,prompt_text_clean

Pay special attention to aligning the code levels

Pay special attention: Python code uses spaces for indentation alignment, otherwise the code will report an error. To avoid errors, it is recommended not to use Notepad to open the webui.py file, but to use a professional code editor, such as Notepad++ or VSCode, which are free tools.

  • Then, find the generate_buttom_clone = gr.Button("Generate") code at approximately line 190. Paste the following code above it, and pay attention to alignment:
python
generate_buttom_clone2 = gr.Button("Generate2",visible=False)
generate_buttom_clone2.click(
       basic_tts,
       inputs=[
          text_input,
          prompt_text_input,
          prompt_wav_upload,
          text_input,
          text_input
       ],
       outputs=[audio_output,prompt_text_input],
       api_name="basic_tts"
 )

Pay attention to level alignment

  • After saving the file, restart webui.py:
bash
python webui.py

Be sure to pay attention to (venv) when starting

  • Fill in the address http://127.0.0.1:7860 in the "Menu" -> "TTS Settings" -> "F5-TTS" of the pyVideotrans software, and you can start using it. The location and filling method of the reference audio are the same as the usage method of F5-TTS.

After modification, it can be directly used in the F5-TTS channel