Skip to content

Version 3.64 includes several minor optimizations, with a focus on segmentation during speech recognition and reducing text-to-speech errors.

Adjusting Subtitle Duration in Speech Recognition

The principle of speech recognition is to segment the entire audio into small fragments based on silent intervals. Each fragment can be 1 second, 5 seconds, 10 seconds, or 20 seconds long, etc. These fragments are then transcribed into text and combined into subtitle form.

When using faster-whisper mode or GeminiAI as the speech recognition channel, you may encounter situations where the subtitle recognition results are too long (a large chunk of text) or too fragmented. In this case, you can adjust the segmentation parameters yourself based on the speaking characteristics of the audio. This mainly involves the following parameters:

Find the Menu → Tools/Options → Advanced Options → faster/openai Speech Recognition Adjustments interface, as shown below:

  1. Silence Separation Milliseconds (Note: unit is milliseconds): This is the basis for speech segmentation. Only when a period of silence reaches or exceeds the set value will a cut be made at that point. For example, setting it to 200 means that a cut will only be made when the silent interval reaches or exceeds 200 milliseconds. If the speech rate is fast and the pauses are short, this value can be lowered; conversely, if the speech rate is slow, it can be appropriately increased.
  2. Minimum Speech Duration/Milliseconds (Note: unit is milliseconds): Only fragments exceeding this set duration will be cut into a subtitle. For example, setting it to 1000ms means that the shortest subtitle cut will not be less than 1000 milliseconds, avoiding overly fragmented subtitles.
  3. Maximum Speech Duration/Seconds (Note: unit is seconds): This is the opposite of the previous item and is used to limit the maximum duration of subtitles. For example, setting it to 15 means that if the fragment duration reaches 15 seconds and a suitable cut point is not found, a cut will be forced.
  4. Maximum Subtitle Duration (Seconds): This parameter is used to re-sentence after recognition is complete to limit subtitle length, and is not related to segmentation during the speech recognition process.

Reduce 403 Error Rate for edge-tts (Also Applicable to Other Text-to-Speech Channels)

Since text-to-speech requires connecting to Microsoft's API, 403 errors cannot be completely avoided. However, you can reduce the occurrence of errors by making the following adjustments:

Find Menu → Tools/Options → Advanced Options → Text-to-Speech Adjustments, as shown below:

  1. Number of Subtitles for Simultaneous Text-to-Speech: It is recommended to set this to 1. Reducing the number of subtitles for simultaneous text-to-speech can reduce errors caused by excessive request frequency. This setting also applies to other text-to-speech channels.
  2. Pause Time After Text-to-Speech/Seconds: For example, setting it to 5 means pausing for 5 seconds after completing text-to-speech for one subtitle before proceeding with the next one. It is recommended to set this value to 5 or higher to reduce the error rate by extending the request interval.

This is an open-source and free software for video translation, speech transcription, text-to-speech, and subtitle translation. Open-source address: https://github.com/jianchang512/pyvideotrans Documentation site: https://pvt9.com The software itself has no charges or revenue and is maintained by interest. If it is useful to you, welcome to donate to support it: https://pvt9.com/about