Why are automatically generated subtitles uneven and messy? | pyVideoTrans-Open Source Video Translation Tool -pyvideotrans.com github.com/jianchang512/pyvideotrans

Why are Automatically Generated Subtitles Uneven and Messy - How to Optimize and Adjust?

During video translation, automatically generated subtitles from the speech recognition stage are often unsatisfactory. They can be too long, almost filling the screen; or too short, displaying only two or three characters, making them look fragmented. Why does this happen?

The Sentence Segmentation Standard of Speech Recognition

Speech Recognition:

Converting human speech into text subtitles usually relies on silent intervals to segment sentences. Generally, the duration of silent segments is set between 200 milliseconds and 500 milliseconds. If set to 250 milliseconds, the program considers it the end of a sentence when silence lasts for 250 milliseconds. A subtitle is then generated from the previous ending point to this point.

Factors Affecting Subtitle Quality

Speaking Speed:
If the audio features fast speech with almost no pauses, or pauses shorter than 250 milliseconds, the segmented subtitles will be very long, potentially lasting for tens of seconds and filling the screen when embedded in the video.
Irregular Pauses:
Conversely, if there are unnecessary pauses during speech, such as breaking up a coherent sentence, the resulting subtitles will be fragmented, with some only displaying a few words.
Background Noise:
Background noise or music can also interfere with the detection of silent intervals, leading to inaccurate recognition.
Pronunciation Clarity: This is self-evident; unclear pronunciation makes it difficult even for humans to understand.

How to Deal with These Problems?

Reduce Background Noise:
If there is significant background noise, you can separate the human voice from the background noise before recognition to remove interference and improve recognition accuracy.
Use a Larger Speech Recognition Model:
If your computer's performance allows, try to use a larger model for recognition, such as large-v2 or large-v3-turbo.
Adjust Silent Segment Duration:
The software defaults to setting silent segments to 200 milliseconds. You can adjust this value based on the specific audio and video content. If the video you want to recognize has a fast speaking speed, you can reduce it to 100 milliseconds; if there are many pauses, you can increase it to 300 or 500 milliseconds. To configure, open the Tools/Options menu, select Advanced Options, and modify the Minimum Silent Segment Value in the Faster/OpenAI Speech Recognition Adjustment section.
Set Maximum Subtitle Duration:
You can set a maximum duration for subtitles, and subtitles exceeding this duration will be forcibly split. This setting is also found in Advanced Options.
As shown, subtitles longer than 10 seconds will be re-segmented.
Set Maximum Characters per Subtitle Line:
You can set a limit on the number of characters per subtitle line. Subtitles exceeding this limit will automatically wrap to the next line or be split.
Enable Re-Segmentation: After enabling this option, combined with settings 4 and 5 above, the program will automatically re-segment the subtitles.

After the above settings 3, 4, 5, and 6, the program will first generate subtitles based on silent intervals. When encountering overly long subtitles or subtitles with too many characters, the program will split the subtitles by re-segmenting them. When re-segmenting subtitles, the program uses the nltk natural language processing library, combining silent interval duration, punctuation, subtitle character count, and other factors to determine how to split them.

Why are Automatically Generated Subtitles Uneven and Messy - How to Optimize and Adjust? ​

The Sentence Segmentation Standard of Speech Recognition ​

How to Deal with These Problems? ​

Why are Automatically Generated Subtitles Uneven and Messy - How to Optimize and Adjust?

The Sentence Segmentation Standard of Speech Recognition

How to Deal with These Problems?