The core principle of video translation software is to recognize text from the speech in the video, then translate the text into the target language, add voice-over to the translated text, and finally embed the voice-over and text into the video.
As you can see, the first step is to recognize text from the speech in the video. The accuracy of the recognition directly affects the subsequent translation and voice-over.
OpenAI-Whisper Local Mode
This mode uses the OpenAI's official open-source Whisper model. Compared to the faster mode, it is slower but has the same accuracy.
The model selection method on the right is the same. From tiny
to large-v3
, the more computer resources are consumed, the higher the accuracy.
Note: Although the faster mode and the OpenAI mode have mostly the same model names, the models are not universal. Please download the models for the OpenAI mode from https://github.com/jianchang512/stt/releases/0.0.
Large-v3-turbo Model
OpenAI-Whisper has recently released a model based on large-v3 optimization called large-v3-turbo. The recognition accuracy is similar to the former, but the size and resource consumption are greatly reduced, and it can be used as a replacement for large-v3.
How to use
Update the software to version v2.67
After speech recognition, select openai-whisper local in the drop-down box
Select large-v3-turbo in the drop-down box after the model
Download the large-v3-turbo.pt file to the models folder in the software directory