Whisper's Sentence Segmentation Not Good Enough? Use AI LLMs and Structured Data to Create Perfect Subtitles
OpenAI's Whisper model is undoubtedly revolutionary in the field of speech recognition, converting audio to text with remarkable accuracy. However, for long videos or complex dialogues, its automatic sentence segmentation and punctuation features can sometimes be unsatisfactory, often generating large blocks of text that are difficult to read.
This article provides you with the ultimate solution: combining Whisper's word-level timestamps with the powerful understanding capabilities of Large Language Models (LLMs) to create a fully automated subtitle processing pipeline that can intelligently segment sentences, optimize text, and output structured data.
I will detail the entire process from recognition and data preparation to interacting with AI, and focus on analyzing the key issues encountered in practice and their solutions.
Step 1: Get the "Raw Materials" from Whisper - Word-Level Timestamps
To allow the LLM to accurately assign start and end times to new sentences, I must first obtain the time information for each word or term from Whisper. This requires enabling a specific parameter.
When using Whisper for recognition, be sure to set the word_timestamps
parameter to True
. Using the openai-whisper
library in Python as an example:
import whisper
model = whisper.load_model("base")
# Enable the word_timestamps option
result = model.transcribe("audio.mp3", word_timestamps=True)
result
will contain a segments
list, and each segment will have a words
list. The data I need is here. Next, I will assemble this data into a clean JSON list designed specifically for LLMs.
word_level_timestamps = []
for segment in result['segments']:
for word_info in segment['words']:
word_level_timestamps.append({
'word': word_info['word'],
'start': word_info['start'],
'end': word_info['end']
})
# The final data structure:
# [
# {"word": "Five", "start": 1.95, "end": 2.17},
# {"word": "Old", "start": 2.17, "end": 2.33},
# ...
# ]
This list is the "raw material" I feed into the LLM.
Step 2: Intelligent Chunking - Avoid Token Limits
The list of words and phrases transcribed from an hour-long video can be very large, and sending it directly to the LLM may exceed its token limit (Context Window). Therefore, it must be processed in chunks.
A simple and effective method is to set a threshold, such as 500 words per chunk.
def create_chunks(data, chunk_size=500):
chunks = []
for i in range(0, len(data), chunk_size):
chunks.append(data[i:i + chunk_size])
return chunks
word_chunks = create_chunks(word_level_timestamps, 500)
Advanced Tip: To avoid abruptly cutting in the middle of a sentence, a better chunking strategy is to look for the largest gap between words (the time difference between end
and the next start
) near the chunk_size
mark. This improves the contextual integrity of each chunk processed by the LLM.
Step 3: Design the "Soul" - Write High-Quality LLM Prompts
The prompt is the soul of the entire process, and it directly determines the quality and stability of the output. An excellent prompt should include the following elements:
- Clear Role and Objective: Clearly tell the LLM its identity (such as "AI Subtitle Processing Engine") and its sole task.
- Detailed Processing Flow: Describe step-by-step what it needs to do, including language identification, intelligent segmentation, text correction, adding punctuation, etc.
- Extremely Strict Output Format Definition: Use tables, code blocks, etc. to precisely define the output JSON structure, key names, and value types, and emphasize what is "required" and "prohibited."
- Provide Examples: Give 1-2 complete examples including input and expected output. This greatly helps the model understand the task, especially when dealing with special situations (such as correcting typos or removing filler words).
- Built-in Final Checklist: Have the model perform a self-check at the end of the Prompt, which is a powerful psychological suggestion that effectively improves adherence to the output format.
The prompt that I finally optimized today is a perfect example of following all the above principles. (See the specific prompt at the bottom)
Step 4: Avoid "Pitfalls" - Common Problems and Solutions in Structured Calls
This is the link most prone to errors in practice.
Pitfall 1: Mixing Instructions and Data
Problem Description: Beginners often concatenate lengthy prompt instructions and massive JSON data into a huge string and send it as a single message to the LLM.
Symptom: The LLM returns an error, complaining that "the input format does not meet the requirements" because it sees a complex text mixed with natural language and JSON, rather than the pure JSON data it was told to process.
{ "error": "The input provided does not conform to the expected format for processing. Please ensure the input is a valid JSON list of dictionaries, each containing \'word\', \'start\', and \'end\' keys."}'
Solution: Strictly separate instructions from data. Use the OpenAI API's messages
structure to put your prompt into a message with role: 'system'
, and put the pure JSON data string to be processed into a message with role: 'user'
.
messages = [
{"role": "system", "content": "Your complete prompt..."},
{"role": "user", "content": 'Pure JSON data string...'} # e.g., json.dumps(chunk)
]
Pitfall 2: json_object
Mode Conflicts with Prompt Instructions
Problem Description: To ensure a 100% return of valid JSON, I use the response_format={"type": "json_object"}
parameter. However, this parameter forces the model to return a JSON object (enclosed in {}
). If, in the prompt, you ask the model to directly return a JSON list (enclosed in []
), there will be an instruction conflict.
response = model.chat.completions.create(
model=config.params['chatgpt_model'],
timeout=7200,
max_tokens= max(int(config.params.get('chatgpt_max_token')) if config.params.get('chatgpt_max_token') else 4096,4096),
messages=message,
response_format= { "type":"json_object" }
)
Incorrect Prompt
## Output **json** format result (critical and must be followed)
You **must** return the result as a valid json list. Each element in the output list **must and can only** contain the following three keys:
Symptom: Even if the instructions and data are separated, the LLM may still report an error because it cannot simultaneously meet the contradictory requirements of "returning an object" and "returning a list."
Solution: Keep the Prompt instructions consistent with the API's constraints. Modify your prompt to require the model to return a JSON object wrapping the subtitle list.
- Incorrect Approach: Require direct output of
[{...}, {...}]
- Correct Approach: Require output of
{"subtitles": [{...}, {...}]}
In this way, the API's requirement (returning an object) and the Prompt's instruction (returning an object containing the subtitles
key) are perfectly unified. Accordingly, when parsing the result in the code, you also need an extra step to extract: result_object['subtitles']
.
Step 5: Integration and Wrap-Up - Other Considerations
Complete Process: In the code, you need to iterate through all the chunks, call the LLM to process each chunk, and then concatenate the subtitle lists returned by each chunk to form the final complete subtitle file.
Error Handling and Retries: Network requests may fail, and the LLM may occasionally return non-standard JSON. Wrapping a
try-except
block outside the API call and adding a retry mechanism (such as using thetenacity
library) is key to ensuring program stability.Cost and Model Selection: Models like
GPT-4o
ordeepseek-chat
perform better in following complex instructions and formatting output.Final Proofreading: Although the LLM can complete 99% of the work, after splicing all the results, you can write simple scripts for a final check, for example: check if any subtitles are longer than 6 seconds, or if the start and end times of two subtitles overlap.
Summary
By combining Whisper's precise recognition capabilities with the LLM's deep understanding and generation capabilities, I can build a highly automated, production-level subtitle optimization pipeline. The key to success lies in:
- High-Quality Data Input: Obtain accurate word-level timestamps from Whisper.
- Smart Engineering Handling: Avoid API limitations through chunking.
- Precise and Unambiguous Instructions: Write a watertight system prompt.
- Deep Understanding of API Features: Avoid common pitfalls of
json_object
and other modes.
Appendix: Final Version of System Prompt
# Role and Ultimate Goal
You are a top-tier AI subtitle processing engine. Your **sole objective** is to convert the **word-level** timestamp data (containing the `'word'` key) in the user input (user message) into a **sentence-level**, intelligently segmented, and text-optimized subtitle list, and return it in a **JSON object** format containing the subtitle list.
---
## Core Processing Flow
1. **Receive Input**: You will receive a json format list as user input. Each element in the list contains `'word'`, `'start'`, `'end'`.
2. **Identify Language**: Automatically determine the main language of the input text (such as Chinese, English, Japanese, Spanish, etc.) and call the corresponding language knowledge base. **Each task processes only one language**.
3. **Intelligent Segmentation and Merging**:
* **Principle**: Sentence segmentation is based on the highest principles of **semantic coherence and grammatical naturalness**.
* **Duration**: The ideal duration of each subtitle is 1-3 seconds, and **absolutely cannot exceed 6 seconds**.
* **Merging**: Merge multiple word/term dictionaries belonging to the same sentence into one.
4. **Text Correction and Enhancement**:
* During the merging of text, perform deep proofreading and optimization of the **entire sentence**.
* **Correction**: Automatically correct spelling errors, grammatical errors, and common wording errors in specific languages.
* **Optimization**: Remove unnecessary filler words and adjust word order to make the expression more fluent and authentic, but never change the original meaning.
* **Punctuation**: Add or correct punctuation marks intelligently at sentence breaks and within sentences according to the norms of the identified language.
5. **Generate Output**: Return the results in the **strictly defined output format** below.
---
## Output json format result (critical and must be followed)
You **must** return the result in a legal **JSON object** format. The object **must** contain a key named `'subtitles'`, whose value is a list of subtitles. Each element in the list **must and can only** contain the following three keys:
| Output Key (Key) | Type (Type) | Description |
| :----------------- | :----------- | :----------------------------------------------------------------------------------------------------------------------- |
| `'start'` | `float` | **Must exist**. Taken from the `start` time of the **first word/term** of the sentence. |
| `'end'` | `float` | **Must exist**. Taken from the `end` time of the **last word/term** of the sentence. |
| `'text'` | `str` | **Must exist**. **Complete subtitle text** after merging, correcting, optimizing, and adding punctuation. **【This is the most important key, and 'word' or any other name should never be used.】** |
**Strictly prohibited**: The `'word'` key **should not** appear in the output dictionary. The input `'word'` content is processed and stored in the `'text'` key.
---
## Example: Demonstrate Core Processing Principles (Applicable to All Languages)
**Important Note**: The following examples are intended to clarify the **processing logic and output format** you need to follow. These principles are universal, and you must apply them to **any language** you identify in the user input, not just the languages in the examples.
### Principle Demonstration 1
#### User Input
```
[
{'word': 'so', 'start': 0.5, 'end': 0.7},
{'word': 'uh', 'start': 0.9, 'end': 1.0},
{'word': 'whatis', 'start': 1.2, 'end': 1.6},
{'word': 'your', 'start': 1.7, 'end': 1.9},
{'word': 'plan', 'start': 2.0, 'end': 2.4}
]
```
#### Your JSON Output
```json
{
"subtitles": [
{
"start": 0.5,
"end": 2.4,
"text": "So, what is your plan?"
}
]
}
```
### Principle Demonstration 2
#### User Input
```
[
{'word': 'This', 'start': 2.1, 'end': 2.2},
{'word': 'is', 'start': 2.3, 'end': 2.6},
{'word': 'air', 'start': 2.8, 'end': 2.9},
{'word': 'port', 'start': 3.0, 'end': 3.5},
{'word': '?', 'start': 3.5, 'end': 3.6},
{'word': 'It', 'start': 4.2, 'end': 4.5},
{'word': 'is', 'start': 4.5, 'end': 4.6},
{'word': 'late', 'start': 4.6, 'end': 5.0}
]
```
#### Your JSON Output
```json
{
"subtitles": [
{
"start": 2.1,
"end": 3.6,
"text": "Is this the airport?"
},
{
"start": 4.2,
"end": 5.0,
"text": "It is late."
}
]
}
```
---
## Final Check Before Execution
Before you generate the final answer, please perform a final internal check to ensure that your output **100%** complies with the following rules:
1. **Is the final output a valid json object `{...}`?** -> (Yes/No)
2. **Does the JSON object contain a key named `'subtitles'`?** -> (Yes/No)
3. **Is the value of `'subtitles'` a list `[...]`, and is each element in the list a valid JSON object `{...}`?** -> (Yes/No)
4. **Does each dictionary in the list contain only the three keys `'start'`, `'end'`, and `'text'`?** -> (Yes/No)
5. **The most critical point: Is the key name `'text'`, not `'word'`?** -> (Yes/No)
**Only generate your final output when the answer to all of the above questions is "Yes".**