The core principle of video translation software is: to recognize text from the speech in the video, then translate the text into the target language, then dub the translated text, and finally embed the dubbing and text into the video.
As you can see, the first step is to recognize text from the speech in the video. The accuracy of recognition directly affects subsequent translation and dubbing.
Faster Local Mode
Recommended for use, this is a model based on OpenAI's open-source Whisper conversion. As the name implies, the recognition speed is faster without sacrificing accuracy.
After selecting faster mode
, you can select the model to use on the right. The default built-in model is tiny
, which is the smallest model and the least accurate.
tiny--base--small--medium--large
The model size increases, and the recognition accuracy also increases.
For Chinese videos, it is recommended to select at least the medium
model. The model download address is https://github.com/jianchang512/stt/releases/0.0
Models with the .en
suffix and models starting with distil
can only be used for English videos.
There is also an Overall Recognition
drop-down box on the right side of the model. The drop-down box will display Equal Segmentation
. Generally, select Overall Recognition
unless you have special needs. If you need to divide the audio into parts of equal duration, such as wanting each subtitle to be 10 seconds long, then you can select equal segmentation. And set the fragment duration in seconds in Menu--Tools/Advanced Settings--Advanced Settings--VAD Parameters.
To speed up the task, on Windows and Linux, if you have an Nvidia graphics card, you can configure and install the CUDA and cuDNN environment, and then enable CUDA acceleration
, which will significantly improve the execution speed.
View CUDA and cuDNN Installation Tutorial
Auto Detect Language
In version v2.59 and later, the original language drop-down box has a new "Auto Detect" option. When you don't know what language it is, or the language is not among the 24 languages supported, you can select the "Auto Detect" option, and the program will try to automatically recognize the spoken language.
Of course, if possible, try to avoid using this option, especially when there is no clear speech within the first 30 seconds of the video, because the principle of automatic detection is to use the first 30 seconds of sound fragments to determine the language used throughout the video. Another point to note: some languages with similar pronunciation but different writing methods cannot be accurately identified, and may be identified as any one of them. For example, a Chinese video may be randomly identified as Simplified or Traditional Chinese.