Introduction and Selection of Speech Recognition Models | pyVideoTrans-Open Source Video Translation Tool -pyvideotrans.com github.com/jianchang512/pyvideotrans

Speech recognition, also known as Automatic Speech Recognition (ASR), converts human speech from audio and video into text. It is the foundational step in video translation and a key factor determining the quality of subsequent dubbing and subtitles.

Currently, the software primarily supports two local offline recognition models: faster-whisper (local) and openai-whisper (local).

These two models are quite similar. Essentially, faster-whisper is a refined and optimized version of openai-whisper. They offer virtually identical recognition accuracy, but the former boasts faster recognition speed. However, when using CUDA acceleration, faster-whisper has stricter requirements for environment configuration.

Faster-Whisper Local Recognition Mode

This mode is the software's default and recommended choice, as it offers faster speed and higher efficiency.

In this mode, the model sizes range from smallest to largest: tiny -> base -> small -> medium -> large-v1 -> large-v3.

The model size gradually increases from 60MB to 2.7G, correspondingly increasing the consumption of memory, VRAM, and CPU/GPU. If your available VRAM is less than 10G, it is not recommended to use large-v3, as it may lead to crashes or freezes.

From tiny to large-v3, the recognition accuracy increases with the increase in model size and resource consumption. Models like tiny/base/small are micro-models, offering fast recognition speed and low resource usage, but with lower accuracy.

medium is a medium-sized model. If you need to recognize videos with Chinese pronunciation, it is recommended to use at least the medium model or larger for better results.

If your CPU is powerful enough and you have ample memory, you can choose the large-v1/v2 model even without CUDA acceleration. This will significantly improve accuracy compared to the smaller models, although the recognition speed will be reduced.

large-v3 consumes a significant amount of resources and is not recommended unless you have a powerful computer. It is suggested to use large-v3-turbo as a replacement, as it offers the same accuracy but is faster and consumes fewer resources.

Models ending with .en and starting with distil are only suitable for English-speaking videos. Please do not use them for videos in other languages.

OpenAI-Whisper Local Recognition Mode

The models in this mode are essentially the same as those in faster-whisper, ranging from smallest to largest: tiny -> base -> small -> medium -> large-v1 -> large-v3. The usage precautions are also the same. tiny/base/small are micro-models, while large-v1/v2/v3 are large models.

Summary of Model Selection

Prioritize the faster-whisper local mode. If you encounter persistent environment errors when attempting CUDA acceleration, consider using the openai-whisper local mode.
Regardless of the mode, if you need to recognize videos with Chinese pronunciation, it is recommended to choose at least the medium model, or at least small. For English-speaking videos, choose at least small. If you have sufficient computer resources, it is recommended to use large-v3-turbo.
Models ending with .en and starting with distil are only suitable for English-speaking videos.

Faster-Whisper Local Recognition Mode ​

OpenAI-Whisper Local Recognition Mode ​

Summary of Model Selection ​

Faster-Whisper Local Recognition Mode

OpenAI-Whisper Local Recognition Mode

Summary of Model Selection