VAD Parameter Tuning in Speech Recognition
During the speech recognition phase of video translation, subtitles can sometimes be excessively long (tens of seconds, even minutes) or too short (less than a second). These issues can be optimized by adjusting the Voice Activity Detection (VAD) parameters.
What is VAD?
Silero VAD is an efficient Voice Activity Detection (VAD) tool that identifies whether audio contains speech, separating speech segments from silence or noise. It can be used with speech recognition libraries like Whisper to detect and segment speech fragments before or after recognition, optimizing recognition performance.
Faster-whisper uses VAD by default for speech analysis and segmentation, primarily controlled by the following four parameters to adjust the segmentation and recognition performance. These parameters control the judgment and segmentation of speech and silence. Here's a detailed explanation and setup suggestions for each parameter:
Threshold
Meaning: Represents the probability threshold for speech. Silero VAD outputs the speech probability for each audio fragment. Probabilities higher than this value are considered speech (SPEECH), while those lower are considered silence or background noise.
Setting Recommendation: The default value is 0.5, which works well in most cases. However, for different datasets, you can adjust this value to more accurately distinguish between speech and noise. If you find too many misjudgments, try increasing it to 0.6 or 0.7; if too many speech fragments are lost, you can lower it to 0.3 or 0.4.
min_speech_duration_ms (Minimum Speech Duration, in milliseconds)
Meaning: If the detected speech fragment is shorter than this value, it will be discarded. The purpose is to remove brief non-speech sounds or noise.
Setting Recommendation: The default value is 250 milliseconds, suitable for most scenarios. You can adjust this as needed. If speech fragments are too short and easily mistaken for noise, you can increase this value, for example, to 500 milliseconds.
max_speech_duration_s (Maximum Speech Duration, in seconds)
Meaning: The maximum length of a single speech fragment. If a speech fragment exceeds this duration, the system will attempt to split it at a silence longer than 100 milliseconds. If no silence is found, it will be forcibly split before this duration to avoid excessively long continuous fragments.
Setting Recommendation: The default is infinity (no limit). If you need to process longer speech fragments, you can keep the default value. However, if you want to control the fragment length, such as for processing dialogues or segmented output, you can set it according to specific needs, such as 10 seconds or 30 seconds.
min_silence_duration_ms (Minimum Silence Duration, in milliseconds)
Meaning: The amount of silence time to wait after speech is detected. Only if the silence duration exceeds this value will the speech fragment be split.
Setting Recommendation: The default value is 2000 milliseconds (2 seconds). If you want to detect and split speech fragments more quickly, you can reduce this value, for example, to 500 milliseconds; if you want a more lenient split, you can increase it.
speech_pad_ms (Speech Padding Time, in milliseconds)
Meaning: The padding time added before and after the detected speech fragment to avoid cutting off some marginal speech by cutting the speech fragment too tightly.
Setting Recommendation: The default value is 400 milliseconds. If you find that the cut speech fragments have missing parts, you can increase this value, such as to 500 milliseconds or 800 milliseconds. Conversely, if the speech fragments are too long or contain too many invalid parts, you can reduce this value.
The specific settings of these parameters need to be optimized based on the speech dataset and application scenario you are using. Reasonable configuration can significantly improve the performance of VAD.
The above parameters can be modified and adjusted in Menu--Tools/Options--Advanced Options--faster/openai You can also select
faster-whisper local
after speech recognition in the main interface, and then click the "Speech Recognition" text on the left, and the modification text box for these parameters will be displayed below
Summary:
threshold: Can be adjusted according to the dataset; the default value of 0.5 is relatively universal.
min_speech_duration_ms and min_silence_duration_ms: Determine the length of speech fragments and the sensitivity of silence segmentation; fine-tune according to the application scenario.
max_speech_duration_s: Prevents unreasonable growth of long speech fragments and should generally be set according to specific applications.
speech_pad_ms: Adds a buffer to speech fragments to prevent fragments from being over-cut. The specific numerical choice depends on your audio data and the need for speech segmentation.
The cleaner and clearer the sound without noise, the better the recognition effect. Even the most carefully modulated parameters are not as good as a clean background sound.