Skip to content

Whisper's Sentence Segmentation Not Good Enough? Use AI LLMs and Structured Data to Create Perfect Subtitles

OpenAI's Whisper model is undoubtedly revolutionary in the field of speech recognition, converting audio to text with remarkable accuracy. However, for long videos or complex dialogues, its automatic sentence segmentation and punctuation features can sometimes be unsatisfactory, often generating large blocks of text that are difficult to read.

This article provides you with the ultimate solution: combining Whisper's word-level timestamps with the powerful understanding capabilities of Large Language Models (LLMs) to create a fully automated subtitle processing pipeline that can intelligently segment sentences, optimize text, and output structured data.

I will detail the entire process from recognition and data preparation to interacting with AI, and focus on analyzing the key issues encountered in practice and their solutions.

Step 1: Get the "Raw Materials" from Whisper - Word-Level Timestamps

To allow the LLM to accurately assign start and end times to new sentences, I must first obtain the time information for each word or term from Whisper. This requires enabling a specific parameter.

When using Whisper for recognition, be sure to set the word_timestamps parameter to True. Using the openai-whisper library in Python as an example:

python
import whisper

model = whisper.load_model("base")
# Enable the word_timestamps option
result = model.transcribe("audio.mp3", word_timestamps=True)

result will contain a segments list, and each segment will have a words list. The data I need is here. Next, I will assemble this data into a clean JSON list designed specifically for LLMs.

python
word_level_timestamps = []
for segment in result['segments']:
    for word_info in segment['words']:
        word_level_timestamps.append({
            'word': word_info['word'],
            'start': word_info['start'],
            'end': word_info['end']
        })

# The final data structure:
# [
#   {"word": "Five", "start": 1.95, "end": 2.17},
#   {"word": "Old", "start": 2.17, "end": 2.33},
#   ...
# ]

This list is the "raw material" I feed into the LLM.

Step 2: Intelligent Chunking - Avoid Token Limits

The list of words and phrases transcribed from an hour-long video can be very large, and sending it directly to the LLM may exceed its token limit (Context Window). Therefore, it must be processed in chunks.

A simple and effective method is to set a threshold, such as 500 words per chunk.

python
def create_chunks(data, chunk_size=500):
    chunks = []
    for i in range(0, len(data), chunk_size):
        chunks.append(data[i:i + chunk_size])
    return chunks

word_chunks = create_chunks(word_level_timestamps, 500)

Advanced Tip: To avoid abruptly cutting in the middle of a sentence, a better chunking strategy is to look for the largest gap between words (the time difference between end and the next start) near the chunk_size mark. This improves the contextual integrity of each chunk processed by the LLM.

Step 3: Design the "Soul" - Write High-Quality LLM Prompts

The prompt is the soul of the entire process, and it directly determines the quality and stability of the output. An excellent prompt should include the following elements:

  1. Clear Role and Objective: Clearly tell the LLM its identity (such as "AI Subtitle Processing Engine") and its sole task.
  2. Detailed Processing Flow: Describe step-by-step what it needs to do, including language identification, intelligent segmentation, text correction, adding punctuation, etc.
  3. Extremely Strict Output Format Definition: Use tables, code blocks, etc. to precisely define the output JSON structure, key names, and value types, and emphasize what is "required" and "prohibited."
  4. Provide Examples: Give 1-2 complete examples including input and expected output. This greatly helps the model understand the task, especially when dealing with special situations (such as correcting typos or removing filler words).
  5. Built-in Final Checklist: Have the model perform a self-check at the end of the Prompt, which is a powerful psychological suggestion that effectively improves adherence to the output format.

The prompt that I finally optimized today is a perfect example of following all the above principles. (See the specific prompt at the bottom)

Step 4: Avoid "Pitfalls" - Common Problems and Solutions in Structured Calls

This is the link most prone to errors in practice.

Pitfall 1: Mixing Instructions and Data

Problem Description: Beginners often concatenate lengthy prompt instructions and massive JSON data into a huge string and send it as a single message to the LLM.

Symptom: The LLM returns an error, complaining that "the input format does not meet the requirements" because it sees a complex text mixed with natural language and JSON, rather than the pure JSON data it was told to process.

{  "error": "The input provided does not conform to the expected format for processing. Please ensure the input is a valid JSON list of dictionaries, each containing \'word\', \'start\', and \'end\' keys."}'

Solution: Strictly separate instructions from data. Use the OpenAI API's messages structure to put your prompt into a message with role: 'system', and put the pure JSON data string to be processed into a message with role: 'user'.

python
messages = [
    {"role": "system", "content": "Your complete prompt..."},
    {"role": "user", "content": 'Pure JSON data string...'} # e.g., json.dumps(chunk)
]

Pitfall 2: json_object Mode Conflicts with Prompt Instructions

Problem Description: To ensure a 100% return of valid JSON, I use the response_format={"type": "json_object"} parameter. However, this parameter forces the model to return a JSON object (enclosed in {}). If, in the prompt, you ask the model to directly return a JSON list (enclosed in []), there will be an instruction conflict.

response = model.chat.completions.create(
                    model=config.params['chatgpt_model'],
                    timeout=7200,
                    max_tokens= max(int(config.params.get('chatgpt_max_token')) if config.params.get('chatgpt_max_token') else 4096,4096),
                    messages=message,
                    response_format= { "type":"json_object" }
                )

Incorrect Prompt

## Output **json** format result (critical and must be followed)

You **must** return the result as a valid json list. Each element in the output list **must and can only** contain the following three keys:

Symptom: Even if the instructions and data are separated, the LLM may still report an error because it cannot simultaneously meet the contradictory requirements of "returning an object" and "returning a list."

Solution: Keep the Prompt instructions consistent with the API's constraints. Modify your prompt to require the model to return a JSON object wrapping the subtitle list.

  • Incorrect Approach: Require direct output of [{...}, {...}]
  • Correct Approach: Require output of {"subtitles": [{...}, {...}]}

In this way, the API's requirement (returning an object) and the Prompt's instruction (returning an object containing the subtitles key) are perfectly unified. Accordingly, when parsing the result in the code, you also need an extra step to extract: result_object['subtitles'].

Step 5: Integration and Wrap-Up - Other Considerations

  1. Complete Process: In the code, you need to iterate through all the chunks, call the LLM to process each chunk, and then concatenate the subtitle lists returned by each chunk to form the final complete subtitle file.

  2. Error Handling and Retries: Network requests may fail, and the LLM may occasionally return non-standard JSON. Wrapping a try-except block outside the API call and adding a retry mechanism (such as using the tenacity library) is key to ensuring program stability.

  3. Cost and Model Selection: Models like GPT-4o or deepseek-chat perform better in following complex instructions and formatting output.

  4. Final Proofreading: Although the LLM can complete 99% of the work, after splicing all the results, you can write simple scripts for a final check, for example: check if any subtitles are longer than 6 seconds, or if the start and end times of two subtitles overlap.

Summary

By combining Whisper's precise recognition capabilities with the LLM's deep understanding and generation capabilities, I can build a highly automated, production-level subtitle optimization pipeline. The key to success lies in:

  • High-Quality Data Input: Obtain accurate word-level timestamps from Whisper.
  • Smart Engineering Handling: Avoid API limitations through chunking.
  • Precise and Unambiguous Instructions: Write a watertight system prompt.
  • Deep Understanding of API Features: Avoid common pitfalls of json_object and other modes.

Appendix: Final Version of System Prompt

# Role and Ultimate Goal

You are a top-tier AI subtitle processing engine. Your **sole objective** is to convert the **word-level** timestamp data (containing the `'word'` key) in the user input (user message) into a **sentence-level**, intelligently segmented, and text-optimized subtitle list, and return it in a **JSON object** format containing the subtitle list.

---

## Core Processing Flow

1.  **Receive Input**: You will receive a json format list as user input. Each element in the list contains `'word'`, `'start'`, `'end'`.
2.  **Identify Language**: Automatically determine the main language of the input text (such as Chinese, English, Japanese, Spanish, etc.) and call the corresponding language knowledge base. **Each task processes only one language**.
3.  **Intelligent Segmentation and Merging**:
    *   **Principle**: Sentence segmentation is based on the highest principles of **semantic coherence and grammatical naturalness**.
    *   **Duration**: The ideal duration of each subtitle is 1-3 seconds, and **absolutely cannot exceed 6 seconds**.
    *   **Merging**: Merge multiple word/term dictionaries belonging to the same sentence into one.
4.  **Text Correction and Enhancement**:
    *   During the merging of text, perform deep proofreading and optimization of the **entire sentence**.
    *   **Correction**: Automatically correct spelling errors, grammatical errors, and common wording errors in specific languages.
    *   **Optimization**: Remove unnecessary filler words and adjust word order to make the expression more fluent and authentic, but never change the original meaning.
    *   **Punctuation**: Add or correct punctuation marks intelligently at sentence breaks and within sentences according to the norms of the identified language.
5.  **Generate Output**: Return the results in the **strictly defined output format** below.

---

## Output json format result (critical and must be followed)

You **must** return the result in a legal **JSON object** format. The object **must** contain a key named `'subtitles'`, whose value is a list of subtitles. Each element in the list **must and can only** contain the following three keys:

| Output Key (Key) | Type (Type)  | Description                                                                                                              |
| :----------------- | :----------- | :----------------------------------------------------------------------------------------------------------------------- |
| `'start'`           | `float`      | **Must exist**.  Taken from the `start` time of the **first word/term** of the sentence.                                    |
| `'end'`             | `float`      | **Must exist**.  Taken from the `end` time of the **last word/term** of the sentence.                                     |
| `'text'`          | `str`        | **Must exist**. **Complete subtitle text** after merging, correcting, optimizing, and adding punctuation. **【This is the most important key, and 'word' or any other name should never be used.】** |

**Strictly prohibited**: The `'word'` key **should not** appear in the output dictionary. The input `'word'` content is processed and stored in the `'text'` key.

---

## Example: Demonstrate Core Processing Principles (Applicable to All Languages)

**Important Note**: The following examples are intended to clarify the **processing logic and output format** you need to follow. These principles are universal, and you must apply them to **any language** you identify in the user input, not just the languages in the examples.

### Principle Demonstration 1
#### User Input
```
[
    {'word': 'so', 'start': 0.5, 'end': 0.7},
    {'word': 'uh', 'start': 0.9, 'end': 1.0},
    {'word': 'whatis', 'start': 1.2, 'end': 1.6},
    {'word': 'your', 'start': 1.7, 'end': 1.9},
    {'word': 'plan', 'start': 2.0, 'end': 2.4}
]
```
#### Your JSON Output
```json
{
    "subtitles": [
        {
            "start": 0.5,
            "end": 2.4,
            "text": "So, what is your plan?"
        }
    ]
}
```

### Principle Demonstration 2
#### User Input
```
[
    {'word': 'This', 'start': 2.1, 'end': 2.2},
    {'word': 'is', 'start': 2.3, 'end': 2.6},
    {'word': 'air', 'start': 2.8, 'end': 2.9},
    {'word': 'port', 'start': 3.0, 'end': 3.5},
    {'word': '?', 'start': 3.5, 'end': 3.6},
    {'word': 'It', 'start': 4.2, 'end': 4.5},
    {'word': 'is', 'start': 4.5, 'end': 4.6},
    {'word': 'late', 'start': 4.6, 'end': 5.0}
]
```
#### Your JSON Output
```json
{
    "subtitles": [
        {
            "start": 2.1,
            "end": 3.6,
            "text": "Is this the airport?"
        },
        {
            "start": 4.2,
            "end": 5.0,
            "text": "It is late."
        }
    ]
}
```

---

## Final Check Before Execution

Before you generate the final answer, please perform a final internal check to ensure that your output **100%** complies with the following rules:

1.  **Is the final output a valid json object `{...}`?** -> (Yes/No)
2.  **Does the JSON object contain a key named `'subtitles'`?** -> (Yes/No)
3.  **Is the value of `'subtitles'` a list `[...]`, and is each element in the list a valid JSON object `{...}`?** -> (Yes/No)
4.  **Does each dictionary in the list contain only the three keys `'start'`, `'end'`, and `'text'`?** -> (Yes/No)
5.  **The most critical point: Is the key name `'text'`, not `'word'`?** -> (Yes/No)

**Only generate your final output when the answer to all of the above questions is "Yes".**