Skip to content

Why Are the Recognized Subtitles Inconsistent and Messy - How to Optimize and Adjust?

During the video translation process, subtitles automatically generated through speech recognition often yield unsatisfactory results. They can be too long, almost filling the screen, or display only two or three characters, appearing fragmented. Why does this happen?

Speech Recognition Segmentation Standards

Speech Recognition:

When converting human speech into text subtitles, sentences are typically segmented based on silence intervals. Generally, the duration of a silent segment is set between 200 milliseconds and 500 milliseconds. For example, if set to 250 milliseconds, when a silence lasting 250 milliseconds is detected, the program considers it the end of a sentence. At this point, a subtitle is generated from the previous endpoint to this one.

Factors Affecting Subtitle Quality

  1. Speaking Speed

If the speech in the audio is very fast with almost no pauses, or pauses shorter than 250 milliseconds, the resulting subtitles can be very long, potentially lasting tens of seconds and filling the screen when embedded in the video.

Image

  1. Irregular Pauses:

Conversely, if there are unnecessary pauses in the speech, such as multiple breaks within a coherent sentence, the subtitles may become very fragmented, with some displaying only a few words.

Image

  1. Background Noise

Background noise or music can interfere with the detection of silence intervals, leading to inaccurate recognition.

  1. Pronunciation Clarity: This is self-evident—unclear pronunciation makes it difficult even for humans to understand.

How to Address These Issues?

  1. Reduce Background Noise:

If there is significant background noise, separate the voice from the background sounds before recognition to remove interference and improve recognition accuracy.

  1. Use Large Speech Recognition Models:

If computing performance allows, use larger models for recognition, such as large-v2 or large-v3-turbo.

  1. Adjust Silence Segment Duration:

The software defaults to a silence segment of 200 milliseconds. Adjust this value based on the specific audio or video. For faster speech, reduce it to 100 milliseconds; for more pauses, increase it to 300 or 500 milliseconds. To adjust, open the Tools/Options menu, select Advanced Options, and modify the minimum silence segment value in the Faster/OpenAI speech recognition section.

Image

  1. Set Maximum Subtitle Duration:

Define a maximum duration for subtitles; any subtitle exceeding this limit will be forcibly segmented. This setting is also available in the Advanced Options.

Image

As shown in the image, subtitles longer than 10 seconds will be re-segmented.

  1. Set Maximum Characters per Line:

Limit the number of characters per subtitle line; any subtitle exceeding this limit will automatically wrap or be segmented.

Image

  1. Enable LLM Re-segmentation Function:

After enabling this option, combined with settings 4 and 5 above, the program will automatically re-segment subtitles.

With the above settings (3, 4, 5, and 6), the program first generates subtitles based on silence intervals. When encountering overly long subtitles or excessive characters, it re-segments them accordingly.