Skip to content

CosyVoice Open Source: https://github.com/FunAudioLLM/CosyVoice

CosyVoice-api Open Source: https://github.com/jianchang512/cosyvoice-api

Supports Chinese, English, Japanese, Korean, Cantonese. Corresponding language codes are zh|en|jp|ko|yue

Using CosyVoice in Video Translation Software

  1. First, update the software to version 2.08+.
  2. Ensure that the CosyVoice project is deployed, that the api.py from CosyVoice-api has been placed, and that api.py has been successfully started (the API service must be running to use it in the translation software).
  3. Open the video translation software, go to Settings (top left) -- CosyVoice: fill in the API address, which is http://127.0.0.1:9233 by default.
  4. Fill in the reference audio and the corresponding text.
Reference audio format:

Each line is divided into two parts by the # symbol. The first part is the path to the WAV audio, and the second part is the corresponding text content. Multiple lines can be filled in.

The optimal duration for WAV audio is 5-15 seconds. If the audio is placed in the root directory of the CosyVoice project (i.e., the same directory as webui.py), simply enter the name here.
If it is placed in the wavs directory under the root directory, then you need to enter wavs/audio_name.wav

Reference audio example:

1.wav#Hello dear friends
wavs/2.wav#Hello friends
  1. After filling in, select CosyVoice as the dubbing channel on the main interface, and select the corresponding role. The "clone" role copies the timbre from the original video.

For other systems, please deploy CosyVoice first. The specific deployment method is as follows:

Source Code Deployment of the Official CosyVoice Project

Deployment uses conda, and this method is strongly recommended. Otherwise, installation may fail, and you may encounter many problems. Some dependencies cannot be installed successfully with pip on Windows, such as pynini.

1. Download and Install Miniconda

Miniconda is a conda management software that is easy to install on Windows, just like a regular software, just click next to complete.

Download address: https://docs.anaconda.com/miniconda/

After downloading, double-click the .exe file.

The only thing to note is that on the interface below, you need to select the top 2 checkboxes. Otherwise, the subsequent operations will be a bit troublesome. The second box selected means "Add conda command to system environment variables". If you do not select it, you will not be able to directly use the conda short command.

Then click "install" and wait for it to complete before closing.

2. Download the CosyVoice Source Code

First create an empty directory, such as creating a folder D:/py under the D drive. The following will use this as an example.

Open the CosyVoice open source address: https://github.com/FunAudioLLM/CosyVoice

After downloading and decompressing, copy all the files in the CosyVoice-main directory to D:/py.

3. Create and Activate a Virtual Environment

Enter the D:/py folder, enter cmd in the address bar, and press Enter. This will open a black cmd window.

In this window, enter the command conda create -n cosyvoice python=3.10 and press Enter. This will create a virtual environment named "cosyvoice" with Python version "3.10".

Continue to enter the command conda activate cosyvoice and press Enter. This activates the virtual environment. Only after activation can you continue to install and start, otherwise errors are inevitable.

The sign of activation is that the beginning of the command line adds the "(cosyvoice)" character.

4. Install the pynini Module

This module can only be installed with the conda command under Windows, which is why it is recommended to use conda on Windows at the beginning.

Continue to enter the command conda install -y -c conda-forge pynini==2.1.5 WeTextProcessing==1.0.3 in the cmd window that has been opened and activated, and press Enter.

Note: During installation, a prompt requiring confirmation will appear. At this time, enter y and press Enter, as shown in the figure below.

5. Install Other Dependencies Using the Alibaba Cloud Mirror

Open the requirements.txt file, delete the last line WeTextProcessing==1.0.3. Otherwise, the installation will definitely fail, because this module depends on pynini, and pynini cannot be installed under Windows pip.

Then add 3 lines Matcha-TTS, flask and waitress to the requirements.txt file.

Continue to enter the command

pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

And press Enter. After a long wait, the installation will be successful without any accidents.

6. Download the api.py File and Place it in the Project

Go to this address https://github.com/jianchang512/cosyvoice-api/blob/main/api.py to download the api.py file, and place it with webui.py after downloading.

image.png

image.png

Start the API Service

The API interface address is: http://127.0.0.1:9233

Enter the command and execute python api.py

API Interface List

Synthesize Text Based on Built-in Roles

  • Interface address: /tts

  • Simply synthesize text into speech without timbre cloning

  • Required parameters:

text: The text to be synthesized into speech

role: Choose one of '中文女' (Chinese Female), '中文男' (Chinese Male), '日语男' (Japanese Male), '粤语女' (Cantonese Female), '英文女' (English Female), '英文男' (English Male), '韩语女' (Korean Female)

  • Successful return: WAV audio data

  • Sample code

data={
    "text":"你好啊亲爱的朋友们",
    "reference_audio":"10.wav"
}

response=requests.post(f'http://127.0.0.1:9933/tts',data=data,timeout=3600)

Same Language Timbre Cloning Synthesis

  • Address: /clone_eq

The pronunciation language of the reference audio is the same as the text language to be synthesized. For example, the reference audio is Chinese pronunciation, and the Chinese text needs to be synthesized into speech based on this audio.

  • Required parameters:

text: The text to be synthesized into speech

reference_audio: The reference audio for cloning the timbre

reference_text: The text content corresponding to the reference audio The path of the reference audio relative to api.py. For example, if you are referencing 1.wav and the file is in the same folder as api.py, fill in 1.wav

  • Successful return: WAV data

  • Sample code

data={
    "text":"你好啊亲爱的朋友们。",
    "reference_audio":"10.wav",
    "reference_text":"希望你过的比我更好哟。"
}

response=requests.post(f'http://127.0.0.1:9933/tts',data=data,timeout=3600)

Different Language Timbre Cloning:

  • Address: /cone

The pronunciation language of the reference audio is different from the text language to be synthesized. For example, you need to synthesize an English text into speech based on the reference audio of Chinese pronunciation.

  • Required parameters:

text: The text to be synthesized into speech

reference_audio: The reference audio for cloning the timbre The path of the reference audio relative to api.py. For example, if you are referencing 1.wav and the file is in the same folder as api.py, fill in 1.wav

  • Successful return: WAV data

  • Sample code

data={
    "text":"親友からの誕生日プレゼントを遠くから受け取り、思いがけないサプライズと深い祝福に、私の心は甘い喜びで満たされた!。",
    "reference_audio":"10.wav"
}

response=requests.post(f'http://127.0.0.1:9933/tts',data=data,timeout=3600)

Compatible with OpenAI TTS

  • Interface address /v1/audio/speech
  • Request method POST
  • Request type Content-Type: application/json
  • Request parameters input: The text to synthesize model: Fixed to tts-1, compatible with OpenAI parameters, not actually used speed: Speech rate, default 1.0 reponse_format: Return format, fixed WAV audio data voice: Only used for text synthesis, choose one of '中文女' (Chinese Female), '中文男' (Chinese Male), '日语男' (Japanese Male), '粤语女' (Cantonese Female), '英文女' (English Female), '英文男' (English Male), '韩语女' (Korean Female)

When used for cloning, fill in the path of the referenced audio relative to api.py. For example, if you are referencing 1.wav and the file is in the same folder as api.py, fill in 1.wav

  • Sample code
from openai import OpenAI

client = OpenAI(api_key='12314', base_url='http://127.0.0.1:9933/v1')
with  client.audio.speech.with_streaming_response.create(
                    model='tts-1',
                    voice='中文女',
                    input='你好啊,亲爱的朋友们',
                    speed=1.0                    
                ) as response:
    with open('./test.wav', 'wb') as f:
       for chunk in response.iter_bytes():
            f.write(chunk)