Skip to content

Spark-TTS is a recently popular open-source voice cloning project developed collaboratively by institutions like Hong Kong University of Science and Technology, Northwestern Polytechnical University, and Shanghai Jiao Tong University. Based on local testing, its performance is comparable to F5-TTS.

Spark-TTS supports voice cloning in both Chinese and English, and the installation process is straightforward. This guide details how to install and deploy it, and modify it to be compatible with the F5-TTS API interface, enabling direct use in pyVideoTrans software through the F5-TTS dubbing channel.

Prerequisites: Ensure Python 3.10, 3.11, or 3.12 is installed.

1. Download Spark-TTS Source Code

First, create a folder named with English letters or numbers on a non-system drive, such as D:/spark. Using a non-system drive and avoiding Chinese characters helps prevent potential errors related to permissions or encoding.

Then, visit the official Spark-TTS repository: https://github.com/SparkAudio/Spark-TTS

As shown below, click to download the source code ZIP file:

Click to download the source code zip file

After downloading, extract the contents and copy all files and folders into the D:/spark folder. The directory structure should look like this:

Directory structure after copying

2. Create a Virtual Environment and Install Dependencies

  • Create a Virtual Environment

In the folder's address bar, type cmd and press Enter. In the opened terminal window, execute the following command:

bash
python -m venv venv

As shown:

Type cmd in the address bar and press Enter

Execute the command

After execution, a venv folder will appear in the D:/spark directory:

A venv directory appears after success

Note: If you see an error like python is not recognized as an internal or external command, Python may not be installed or added to the system environment variables. Refer to relevant guides to install Python.

Next, run venv\scripts\activate to activate the virtual environment. Once activated, (venv) will appear at the beginning of the terminal line, indicating success. All subsequent commands must be run in this activated environment; always check for (venv) before proceeding.

Ensure (venv) appears at the beginning

  • Install Dependencies

In the activated virtual environment, continue in the terminal and execute the following command to install all dependencies:

bash
pip install -r requirements.txt

The installation may take some time; please wait patiently.

Installation may take a while

3. Download Models

Open-source AI models are often hosted on Hugging Face (huggingface.co). Since this site is blocked in some regions, you need proper internet access to download the models. Ensure your system proxy is configured correctly.

In the current directory D:/spark, create a text file named down.txt, copy and paste the following code into it, and save:

python
from huggingface_hub import snapshot_download
snapshot_download("SparkAudio/Spark-TTS-0.5B", local_dir="pretrained_models/Spark-TTS-0.5B")
print('Download complete')

Then, in the activated virtual environment terminal, execute:

bash
python down.txt

Make sure (venv) is visible at the command line:

Ensure (venv) is at the beginning of the command line

Wait for the terminal to indicate the download is complete.

If you see an output like the following, it indicates a network connection error, possibly due to incorrect proxy configuration:

Returning existing local_dir `pretrained_models\Spark-TTS-0.5B` as remote repo cannot be accessed in `snapshot_download` ((MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/SparkAudio/Spark-TTS-0.5B/revision/main (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001BC4C8A4430>, 'Connection to huggingface.co timed out. (connect timeout=None)'))"), '(Request ID: aa61d1fb-ffc7-4479-9a99-2258c1bc0aee)')).

Connection failed; please configure internet access correctly

4. Launch the Web Interface

Once the model is downloaded, you can start and open the web interface.

In the activated virtual environment terminal, execute:

bash
python webui.py

Confirm (venv) is at the beginning

Wait until you see the following message, indicating startup is complete:

Startup successful

Now, open your browser and go to http://127.0.0.1:7860. The web interface will look like this:

Open the web interface

5. Test Voice Cloning

As shown below, select an audio file for cloning (3-10 seconds long, clear pronunciation, clean background).

Then, enter the corresponding text in the right Text of prompt speech field, input the desired text for generation on the left, and click the Generate button at the bottom to start.

Perform voice cloning

After execution, the result will appear as shown.

6. Use in pyVideoTrans Software

Spark-TTS is very similar to F5-TTS. With a simple modification, it can be used directly in pyVideoTrans via the F5-TTS dubbing channel. If you're not comfortable modifying the code, you can download the pre-modified version and overwrite the webui.py file from: https://pvt9.com/spark-use-f5-webui.zip

  • Open the webui.py file and paste the following code above approximately line 135:
python
    def basic_tts(gen_text_input, ref_text_input, ref_audio_input,remove_silence=None,speed_slider=None):
        """
        Gradio callback to clone voice using text and optional prompt speech.
        - text: The input text to be synthesised.
        - prompt_text: Additional textual info for the prompt (optional).
        - prompt_wav_upload/prompt_wav_record: Audio files used as reference.
        """
        prompt_speech = ref_audio_input
        prompt_text_clean = None if len(ref_text_input) < 2 else ref_text_input

        audio_output_path = run_tts(
            gen_text_input,
            model,
            prompt_text=prompt_text_clean,
            prompt_speech=prompt_speech
        )
        return audio_output_path,prompt_text_clean

Pay special attention to code indentation alignment

Important: Python code uses spaces for indentation; misalignment will cause errors. To avoid issues, do not use Notepad; instead, use a professional code editor like Notepad++ or VSCode.

  • Then, find the code generate_buttom_clone = gr.Button("Generate") around line 190. Paste the following code above it, ensuring proper alignment:
python
generate_buttom_clone2 = gr.Button("Generate2",visible=False)
generate_buttom_clone2.click(
       basic_tts,
       inputs=[
          text_input,
          prompt_text_input,
          prompt_wav_upload,
          text_input,
          text_input
       ],
       outputs=[audio_output,prompt_text_input],
       api_name="basic_tts"
 )

Pay attention to indentation alignment

  • Save the file and restart webui.py:
bash
python webui.py

Ensure (venv) is active when starting

  • Enter the address http://127.0.0.1:7860 in pyVideoTrans under "Menu" -> "TTS Settings" -> "F5-TTS" API address to start using it. The reference audio location and input method are the same as for F5-TTS.

After modification, it can be used directly in the F5-TTS channel