Skip to content

Re-segmenting Speech Recognition Results Using Large Language Models (LLMs)

To improve the naturalness and accuracy of subtitle segmentation, pyVideoTrans, starting from version v3.69, has introduced an intelligent re-segmentation feature based on Large Language Models (LLMs), aimed at optimizing your subtitle processing experience.

Background: Limitations of Traditional Re-segmentation

In v3.68 and earlier versions, we provided a "Re-segmentation" feature. This feature, after initial speech recognition is completed by faster-whisper, openai-whisper, or deepgram, would call upon an Alibaba model to perform secondary splitting and segmentation of the generated subtitles.

Traditional Method: Splitting and Segmenting Recognized Subtitles

The original "Re-segmentation" feature also had some shortcomings:

  1. Inconvenient First Use: Required downloading three large model files online from ModelScope.
  2. Suboptimal Efficiency and Effect: Processing speed was slow, and the segmentation effect was sometimes still not ideal.

Although models like faster-whisper themselves can output segmentation results, in practical application, issues such as overly long, overly short, or unnatural sentence breaks often occurred.

Innovation: v3.69+ Introduces LLM Intelligent Re-segmentation

To address the above issues, starting from version v3.69, we have fully upgraded the "Re-segmentation" feature to LLM Re-segmentation.

How it Works: When you perform speech recognition using faster-whisper (Local), openai-whisper (Local), or Deepgram.com, and have enabled the new LLM Re-segmentation feature, and have correctly configured the model, API Key (SK), and other information in Translation Settings - OpenAI API & Compatible AI:

  1. pyVideoTrans will send the recognized characters/words containing word-level timestamps in batches of 3000 to the LLM you've configured for re-segmentation.
  2. The LLM will intelligently segment the text based on the prompt instructions in the /videotrans/recharge-llm.txt file.
  3. After segmentation is complete, the results will be re-organized into standard SRT subtitle format for subsequent translation or direct use.
  4. If LLM re-segmentation fails, the software will automatically fall back to using the segmentation results provided by faster-whisper/openai-whisper/deepgram itself.

Prerequisites for Enabling "LLM Re-segmentation"

To successfully enable and use this feature, please ensure the following conditions are met:

  1. Check to Enable: In the software interface, select the LLM Re-segmentation option. Check to Enable LLM Re-segmentation

  2. Specify Speech Recognition Model: The speech recognition engine must be one of the following three:

    • faster-whisper (Local)
    • openai-whisper (Local)
    • Deepgram.comSelect one of the supported Speech Recognition Models
  3. Select Audio Segmentation Mode: Needs to be set to Overall Recognition. Audio Segmentation Mode must be Overall Recognition

  4. Configure LLM API: In Menu -> Translation Settings -> OpenAI API & Compatible AI, correctly fill in your API Key (SK), select the model name, and set other relevant parameters. Configure LLM API Settings

Adjusting and Optimizing LLM Re-segmentation Effect

  1. Adjust the value in Tools -- Options -- Advanced Options -- LLM Re-segmentation Characters/Words per Batch, which defaults to sending a re-segmentation request every 3000 characters or words. A larger value yields better segmentation results, but it will cause an error if the output exceeds the maximum output tokens allowed by the model being used. If you increase this value, you also need to correspondingly increase the Maximum Output Tokens mentioned in the next point. Default: sending 3000 characters or words per batch

  2. Based on the maximum output tokens allowed by the LLM model you are using, you can modify Menu --> Translation Settings -> OpenAI API & Compatible AI -> Maximum Output Tokens, which defaults to 4096. A larger value here allows for a larger LLM Re-segmentation Characters/Words per Batch value. Default Maximum Output Tokens is 4096, usually supported by all models

  3. You can adjust and optimize the prompt in the videotrans/recharge-llm.txt file within the software directory to achieve better results. Default segmentation prompt

In summary: The larger the Maximum Output Tokens, the more characters or words are allowed for LLM Re-segmentation Characters/Words per Batch, resulting in better segmentation. However, the Maximum Output Tokens must not exceed the value supported by the model itself, otherwise, it will inevitably cause an error. The output corresponding to the segmentation of each character or word in LLM Re-segmentation Characters/Words per Batch might consume multiple tokens, so please increase this value cautiously and gradually to avoid errors caused by the output exceeding the Maximum Tokens.

How to Query the Maximum Output Tokens for Different Models?

Note that it must be the Maximum Output Tokens (Max output token), not the Context Tokens. The context length is usually very large, such as 128k, 256k, 1M, etc., while the Maximum Output Tokens are much smaller than context tokens, generally 8k (8092) / 32k (32768), etc.

1. OpenAI Models

You can view the details of each model in the official OpenAI model documentation: https://platform.openai.com/docs/models

  • Click the name of the model you plan to use to enter the details page.
  • Look for descriptions related to "Max output tokens".

Click on the model you are using to enter the details page

Find the maximum output tokens supported by this model, not the context window

2. Other OpenAI Compatible Models

For other large model providers compatible with the OpenAI API, their Maximum Output Tokens (not context length) are usually listed in their official API documentation or model descriptions.

DeepSeek's maximum output token is 8k, i.e., 8092

Others are similar; just pay attention that it is the maximum output length or maximum output tokens, not the context length.

Important Note: Please be sure to find the maximum output tokens for the model you are using, not the context tokens, and fill it in correctly in the pyVideoTrans settings.

Remember to fill it in the correct location