The core principle of video translation software is: recognize speech from the video's audio, translate the text into the target language, dub the translated text, and finally embed the dubbing and text into the video.
As you can see, the first step is to recognize speech from the video's audio, and the recognition accuracy directly affects subsequent translation and dubbing.
Faster Local Mode 
Recommended for use. This mode is based on a converted model from OpenAI's open-source Whisper. As the name implies, it offers faster recognition speed without compromising accuracy.

After selecting Faster Mode, you can choose the model to use on the right side. By default, the built-in tiny model is selected, which is the smallest model but also the least accurate.

The models tiny--base--small--medium--large increase in size, and their recognition accuracy also improves accordingly.
For Chinese videos, it is recommended to use at least the medium model. The model download address is: https://github.com/jianchang512/stt/releases/0.0
Models with the .en suffix and those starting with distil can only be used for English videos.
To the right of the model selection, there is a Whole Recognition dropdown menu. When expanded, it shows Equal Split. Generally, you can select Whole Recognition unless you have specific needs. If you want to split the audio into equal-length segments, such as making each subtitle last 10 seconds, you can choose Equal Split. Then, in the menu under Tools/Advanced Settings → Advanced Settings → VAD Parameters, set the segment duration in seconds.
To speed up tasks, on Windows and Linux systems with NVIDIA GPUs, you can configure and install the CUDA and cuDNN environments, then enable CUDA Acceleration to significantly improve execution speed.

View CUDA and cuDNN Installation Tutorial
Auto-Detect Language 

After version v2.59, the original language dropdown menu includes an "Auto-Detect" option. Use this when the language is unknown or not among the 24 supported languages. The program will attempt to automatically identify the spoken language.
However, if possible, try to avoid using this option, especially if there is no clear speech in the first 30 seconds of the video. This is because the auto-detection mechanism analyzes the first 30 seconds of audio to determine the language for the entire video.
Another point to note: Some languages with similar pronunciation but different writing systems may not be accurately identified and could be randomly assigned to either variant. For example, a Chinese video might be randomly detected as Simplified or Traditional Chinese.
