Re-segmenting Speech Recognition Results Using Large Language Models (LLMs)
To improve the naturalness and accuracy of subtitle segmentation, pyVideoTrans, starting from version v3.69, has introduced an intelligent re-segmentation feature based on Large Language Models (LLMs), aimed at optimizing your subtitle processing experience.
Background: Limitations of Traditional Re-segmentation
In v3.68 and earlier versions, we provided a "Re-segmentation" feature. This feature, after initial speech recognition is completed by faster-whisper
, openai-whisper
, or deepgram
, would call upon an Alibaba model to perform secondary splitting and segmentation of the generated subtitles.
The original "Re-segmentation" feature also had some shortcomings:
- Inconvenient First Use: Required downloading three large model files online from ModelScope.
- Suboptimal Efficiency and Effect: Processing speed was slow, and the segmentation effect was sometimes still not ideal.
Although models like faster-whisper
themselves can output segmentation results, in practical application, issues such as overly long, overly short, or unnatural sentence breaks often occurred.
Innovation: v3.69+ Introduces LLM Intelligent Re-segmentation
To address the above issues, starting from version v3.69, we have fully upgraded the "Re-segmentation" feature to LLM Re-segmentation.
How it Works: When you perform speech recognition using faster-whisper (Local)
, openai-whisper (Local)
, or Deepgram.com
, and have enabled the new LLM Re-segmentation feature, and have correctly configured the model, API Key (SK), and other information in Translation Settings - OpenAI API & Compatible AI:
- pyVideoTrans will send the recognized characters/words containing word-level timestamps in batches of 3000 to the LLM you've configured for re-segmentation.
- The LLM will intelligently segment the text based on the prompt instructions in the
/videotrans/recharge-llm.txt
file. - After segmentation is complete, the results will be re-organized into standard SRT subtitle format for subsequent translation or direct use.
- If LLM re-segmentation fails, the software will automatically fall back to using the segmentation results provided by
faster-whisper/openai-whisper/deepgram
itself.
Prerequisites for Enabling "LLM Re-segmentation"
To successfully enable and use this feature, please ensure the following conditions are met:
Check to Enable: In the software interface, select the LLM Re-segmentation option.
Specify Speech Recognition Model: The speech recognition engine must be one of the following three:
faster-whisper (Local)
openai-whisper (Local)
Deepgram.com
Select Audio Segmentation Mode: Needs to be set to
Overall Recognition
.Configure LLM API: In Menu -> Translation Settings -> OpenAI API & Compatible AI, correctly fill in your API Key (SK), select the model name, and set other relevant parameters.
Adjusting and Optimizing LLM Re-segmentation Effect
Adjust the value in
Tools -- Options -- Advanced Options -- LLM Re-segmentation Characters/Words per Batch
, which defaults to sending a re-segmentation request every 3000 characters or words. A larger value yields better segmentation results, but it will cause an error if the output exceeds the maximum output tokens allowed by the model being used. If you increase this value, you also need to correspondingly increase the Maximum Output Tokens mentioned in the next point.Based on the maximum output tokens allowed by the LLM model you are using, you can modify Menu --> Translation Settings -> OpenAI API & Compatible AI -> Maximum Output Tokens, which defaults to
4096
. A larger value here allows for a largerLLM Re-segmentation Characters/Words per Batch
value.You can adjust and optimize the prompt in the
videotrans/recharge-llm.txt
file within the software directory to achieve better results.
In summary: The larger the Maximum Output Tokens
, the more characters or words are allowed for LLM Re-segmentation Characters/Words per Batch
, resulting in better segmentation. However, the Maximum Output Tokens
must not exceed the value supported by the model itself, otherwise, it will inevitably cause an error. The output corresponding to the segmentation of each character or word in LLM Re-segmentation Characters/Words per Batch
might consume multiple tokens, so please increase this value cautiously and gradually to avoid errors caused by the output exceeding the Maximum Tokens
.
How to Query the Maximum Output Tokens for Different Models?
Note that it must be the Maximum Output Tokens (Max output token), not the
Context Tokens
. The context length is usually very large, such as 128k, 256k, 1M, etc., while the Maximum Output Tokens are much smaller than context tokens, generally 8k (8092) / 32k (32768), etc.
1. OpenAI Models
You can view the details of each model in the official OpenAI model documentation: https://platform.openai.com/docs/models
- Click the name of the model you plan to use to enter the details page.
- Look for descriptions related to "Max output tokens".
2. Other OpenAI Compatible Models
For other large model providers compatible with the OpenAI API, their Maximum Output Tokens (not context length) are usually listed in their official API documentation or model descriptions.
- DeepSeek (e.g.,
deepseek-chat
ordeepseek-reasoner
): Refer to their pricing or model description page, such as: https://platform.deepseek.com/api-docs/pricing
SiliconFlow: Look it up in their Model Documentation: https://docs.siliconflow.cn/cn/faqs/misc#2-%E5%85%B3%E4%BA%8Emax-tokens%E8%AF%B4%E6%98%8E
Alibaba Cloud Bailian (百炼): Find the parameter limits for specific models in the Model Plaza or Model Documentation: https://help.aliyun.com/zh/model-studio/models
Others are similar; just pay attention that it is the maximum output length or maximum output tokens, not the context length.
Important Note: Please be sure to find the maximum output tokens for the model you are using, not the context tokens, and fill it in correctly in the pyVideoTrans settings.