Gemini + VAD Hybrid Architecture Solving Small Language Challenges with Whisper, Generating Accurate SRT Subtitles | pyVideoTrans-Open Source Video Translation Tool -pyvideotrans.com github.com/jianchang512/pyvideotrans

Gemini + VAD Hybrid Architecture: Solving Small Language Challenges with Whisper, Generating Accurate SRT Subtitles

Open-source speech recognition models like Whisper are known for their impressive performance with English. However, when venturing outside of the English comfort zone, their performance in other languages drops sharply. For "small languages" without massive amounts of specifically fine-tuned data, the transcription results are often unsatisfactory. This makes creating subtitles for Thai, Vietnamese, Malay, or even some dialects a costly and time-consuming endeavor.

This is where Gemini enters as a game-changer.

Unlike many tools that rely on specific language models, Google Gemini was born in a truly global, multimodal, multilingual environment. Its out-of-the-box high-quality recognition capabilities when processing various "small languages" are its core competitive advantage. This means that without any additional fine-tuning, we can achieve recognition results that in the past could only be achieved with targeted training.

However, even Gemini, with its powerful "language brain," has a common weakness: it cannot provide the frame-level precision timestamps necessary for generating SRT subtitles.

This article presents a "hybrid architecture" solution that has been repeatedly verified in real-world scenarios:

faster-whisper's precise voice activity detection (sileroVAD): Utilize only its best feature—locating the start and end times of human voices with millisecond-level accuracy.
Gemini's unparalleled language talent: Let it focus on its core task—performing high-quality, multilingual content transcription and speaker recognition on the short audio clips segmented by VAD.

This workflow achieves the best of both worlds, ultimately generating professional-grade, multilingual SRT subtitle files with accurate timestamps. Whether your audio is in mainstream English, Chinese, or other languages that models struggle with, this solution will provide unprecedented convenience and accuracy.

Core Challenge: Why Not Use Gemini Directly?

Gemini's strengths lie in content understanding. It excels at:

High-quality transcription: High text accuracy, able to understand context.
Multilingual recognition: Automatically detects the language of the audio.
Speaker recognition: Identifies the same speaker in multiple audio clips.

But its weakness lies in time precision. Gemini currently cannot provide sufficiently precise answers to the question of "when does this word appear" which is critical for generating SRT subtitles. This is where tools like faster-whisper (built-in sileroVAD), specifically designed for speech processing, excel.

Solution: A Hybrid VAD and LLM Architecture

Our solution divides the task in two, letting specialized tools do what they do best:

Precise Segmentation (faster-whisper): We utilize the sileroVAD voice activity detection feature built into the faster-whisper library. VAD can scan the entire audio with millisecond-level accuracy, finding the start and end times of all human voice segments. We cut the audio into a series of short .wav clips with precise timestamps based on this information.
High-Quality Transcription (Gemini): We send these small audio clips to Gemini sequentially in batches. Since each clip itself carries precise time information, we no longer need Gemini to provide timestamps. We only need it to focus on what it does best: transcribing content and identifying speakers.

Finally, we correlate the transcription text returned by Gemini with the timestamps provided by faster-whisper to assemble a complete SRT file.

Complete Implementation Code

Below is the complete Python code to implement the above workflow. You can directly copy and save it as a test.py file for testing.

How to Use:

Install Dependencies:

bash

pip install faster-whisper pydub google-generativeai

Set API Key: It is recommended to set your Gemini API key as an environment variable for security reasons.
- In Linux/macOS: export GOOGLE_API_KEY="YOUR_API_KEY"
- In Windows: set GOOGLE_API_KEY="YOUR_API_KEY"
- Alternatively, you can directly modify the gemini_api_key variable in the code.
Run Script:
bash
```
python test.py "path/to/your/audio.mp3"
```
1
Supports common audio formats such as .mp3, .wav, .m4a, etc.

python

import os
import re
import sys
import time
import google.generativeai as genai
from pathlib import Path
from pydub import AudioSegment
# Optional: Fill in the corresponding proxy address
# os.environ['https_proxy']='http://127.0.0.1:10808'

# --- Helper Function ---
def ms_to_time_string(ms):
    """Converts milliseconds to SRT time format HH:MM:SS,ms"""
    hours = ms // 3600000
    ms %= 3600000
    minutes = ms // 60000
    ms %= 60000
    seconds = ms // 1000
    milliseconds = ms % 1000
    return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"

# --- Core Logic ---
def generate_srt_from_audio(audio_file_path, api_key):
    """
    Generates an SRT file from an audio file using VAD and Gemini.
    """
    if not Path(audio_file_path).exists():
        print(f"Error: Audio file not found at {audio_file_path}")
        return

    # 1. VAD-based Audio Segmentation
    print("Step 1: Segmenting audio with VAD...")
    try:
        # These imports are here to ensure faster-whisper is an optional dependency
        from faster_whisper.audio import decode_audio
        from faster_whisper.vad import VadOptions, get_speech_timestamps
    except ImportError:
        print("Error: faster-whisper is not installed. Please run 'pip install faster-whisper'")
        return

    sampling_rate = 16000
    audio_for_vad = decode_audio(audio_file_path, sampling_rate=sampling_rate)
    
    # VAD options can be tweaked for better performance
    vad_p={
            #"threshold":float(config.settings['threshold']),
            "min_speech_duration_ms":1,
            "max_speech_duration_s":8,
            "min_silence_duration_ms":200,
            "speech_pad_ms":100
        }
    vad_options = VadOptions(**vad_p)
    
    speech_chunks_samples = get_speech_timestamps(audio_for_vad, vad_options)
    
    # Convert sample-based timestamps to milliseconds
    speech_chunks_ms = [
        {"start": int(chunk["start"] / sampling_rate * 1000), "end": int(chunk["end"] / sampling_rate * 1000)}
        for chunk in speech_chunks_samples
    ]

    if not speech_chunks_ms:
        print("No speech detected in the audio file.")
        return

    # Create a temporary directory for audio chunks
    temp_dir = Path(f"./temp_audio_chunks_{int(time.time())}")
    temp_dir.mkdir(exist_ok=True)
    print(f"Saving segments to {temp_dir}...")

    full_audio = AudioSegment.from_file(audio_file_path)
    segment_data = []
    for i, chunk_times in enumerate(speech_chunks_ms):
        start_ms, end_ms = chunk_times['start'], chunk_times['end']
        audio_chunk = full_audio[start_ms:end_ms]
        chunk_file_path = temp_dir / f"chunk_{i}_{start_ms}_{end_ms}.wav"
        audio_chunk.export(chunk_file_path, format="wav")
        segment_data.append({"start_time": start_ms, "end_time": end_ms, "file": str(chunk_file_path)})
    print(segment_data)
    #return
    # 2. Batch Transcription with Gemini
    print("\nStep 2: Transcribing with Gemini in batches...")
    
    # Configure Gemini API
    genai.configure(api_key=api_key)
    
    # The final, robust prompt
    prompt = """
# Role
You are a highly specialized AI data processor. Your sole function is to receive a batch of audio files and generate a **single, complete XML report** according to the following inviolable rules. You are not a conversational assistant.

# Inviolable Rules and Output Format
You must analyze all audio files received in this request as a whole and strictly adhere to the following rules. **The priority of these rules is above all else, especially rule #1.**

1.  **[Highest Priority] Strict One-to-One Mapping**:
    *   This is the most important rule: **Each audio file** I provide you **must correspond to one and only one `<audio_text>` tag** in the final output.
    *   **Regardless of how long a single audio file is, or how many pauses or sentences it contains**, you **must** merge all its transcribed content **into a single string** and place it in that unique `<audio_text>` tag.
    *   **Absolutely forbidden** to create multiple `<audio_text>` tags for the same input file.

2.  **[Data Analysis] Speaker Recognition**:
    *   Analyze all audio, identifying different speakers. All segments spoken by the same person must use the same, incrementally increasing ID starting from 0 (`[spk0]`, `[spk1]`...).
    *   For audio where the speaker cannot be identified (e.g., noise, music), use the ID `-1` (`[spk-1]`) uniformly.

3.  **[Content and Order] Transcription and Sorting**:
    *   Automatically detect the language of each audio and transcribe it. If transcription is not possible, fill the text content with an empty string.
    *   The order of `<audio_text>` tags in the final XML must strictly match the order of the input audio files.

# Mandatory Example of Output Format
<!-- You must generate output that is exactly consistent with the structure below. Note: Even if the audio is very long, all its content must be merged into one tag. -->
```xml
<result>
    <audio_text>[spk0]This is the transcription result of the first file.</audio_text>
    <audio_text>[spk1]This is the transcription for the second file, it might be very long but all content must be in this single tag.</audio_text>
    <audio_text>[spk0]This is the transcription result of the third file, the speaker is the same as the first file.</audio_text>
    <audio_text>[spk-1]</audio_text> 
</result>
```

# !!! Final Mandatory Check !!!
- **Zero Tolerance Policy**: Your response **can only be XML content**. Absolutely forbidden to include any text, explanations, or ` ```xml ` tags outside of XML.
- **Mandatory Counting and Error Correction**: Before you generate the final response, you **must perform a counting check**: Is the number of `<audio_text>` tags you are about to generate **exactly equal** to the number of audio files I provided?
    - **If the count does not match**, this means you have seriously violated the **[Highest Priority] rule #1**. You must **[Discard]** the current draft and **[Regenerate]**, ensuring strict compliance with the one-to-one mapping.
    - **Only output if the count matches completely.**

"""

    model = genai.GenerativeModel(model_name="gemini-2.0-flash")

    # Process in batches of 20 (adjust as needed)
    batch_size = 50
    all_srt_entries = []
    print(f'{len(segment_data)=}')
    for i in range(0, len(segment_data), batch_size):
        batch = segment_data[i:i + batch_size]
        print(f"Processing batch {i//batch_size + 1}...")

        files_to_upload = []
        for seg in batch:
            files_to_upload.append(genai.upload_file(path=seg['file'], mime_type="audio/wav"))

        try:
            chat_session = model.start_chat(
                    history=[
                        {
                            "role": "user",
                            "parts": files_to_upload,
                        }
                    ]
                )
            print(files_to_upload)
            response = chat_session.send_message(prompt,request_options={"timeout":600})    


            # Use regex to parse the XML-like response
            transcribed_texts = re.findall(r'<audio_text>(.*?)</audio_text>', response.text.strip(), re.DOTALL)
            print(response.text)
            print(batch)
            
            
            

            for idx, text in enumerate(transcribed_texts):
                if idx < len(batch):
                    seg_info = batch[idx]
                    all_srt_entries.append({
                        "start_time": seg_info['start_time'],
                        "end_time": seg_info['end_time'],
                        "text": text.strip()
                    })

        except Exception as e:
            print(f"An error occurred during Gemini API call: {e}")

    # 3. Assemble SRT File
    print("\nStep 3: Assembling SRT file...")
    srt_file_path = Path(audio_file_path).with_suffix('.srt')
    with open(srt_file_path, 'w', encoding='utf-8') as f:
        for i, entry in enumerate(all_srt_entries):
            start_time_str = ms_to_time_string(entry['start_time'])
            end_time_str = ms_to_time_string(entry['end_time'])
            f.write(f"{i + 1}\n")
            f.write(f"{start_time_str} --> {end_time_str}\n")
            f.write(f"{entry['text']}\n\n")

    print(f"\nSuccess! SRT file saved to: {srt_file_path}")
    
    # Clean up temporary files
    for seg in segment_data:
        Path(seg['file']).unlink()
    temp_dir.rmdir()


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python gemini_srt_generator.py <path_to_audio_file>")
        sys.exit(1)
        
    audio_file = sys.argv[1]
    
    # It's recommended to set the API key as an environment variable
    # for security reasons, e.g., export GOOGLE_API_KEY="YOUR_KEY"
    gemini_api_key = os.environ.get("GOOGLE_API_KEY", "在此填写 Gemini API KEY")

    generate_srt_from_audio(audio_file, gemini_api_key)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204

"Blood and Tears" of Prompt Engineering: How to Tame Gemini

The final version of the prompt you see is the result of a series of failures and optimizations. This process is highly valuable for any developer looking to integrate LLMs into automated processes.

Stage 1: Initial Assumptions and Failures

The initial prompt was straightforward, asking Gemini to perform speaker recognition and output the results in order. However, when sending more than 10 audio clips at once, Gemini's behavior became unpredictable: instead of performing the task, it replied like a conversational assistant: "Okay, please provide the audio files", completely ignoring that we had already included the files in the request.

Conclusion: Overly complex prompts describing "workflows" can easily confuse the model when handling multimodal batch tasks, causing it to degrade into conversation mode.

Stage 2: Format Amnesia

We adjusted the prompt to make it more like a "rule set" rather than a "flowchart". This time, Gemini successfully transcribed everything! But it forgot the XML format we requested, directly concatenating all the transcribed text into a large paragraph and returning it.

Conclusion: When the model faces a high "cognitive load" (processing dozens of audio files simultaneously), it may prioritize completing the core task (transcription) and ignore or "forget" secondary but critical instructions like formatting.

Stage 3: Uncontrolled "Internal Segmentation"

We further reinforced the format instructions, explicitly requesting XML output. This time the format was correct, but a new problem arose: for a slightly longer (e.g., 10 seconds) audio clip, Gemini would arbitrarily split it into two or three sentences and generate an <audio_text> tag for each sentence. This resulted in receiving more than 30 tags for 20 input files, completely disrupting our one-to-one relationship with the timestamps.

Conclusion: The model's internal logic (such as splitting by sentence) may conflict with our external instructions. We must use stronger, more explicit instructions to override its default behavior.

Final Prompt Version

Finally, we summarized a set of effective "taming" strategies and embodied them in the final prompt:

Extreme Role Definition: Define it as a "highly specialized AI data processor" rather than an "assistant" from the start, eliminating casual conversation.
Rule Hierarchy with Highest Priority: Explicitly set "one input file corresponds to one output tag" as the [Highest Priority] rule, letting the model know that this is an inviolable red line.
Explicit Merging Instructions: Directly order the model to "merge all its content into a single string, regardless of how long the audio is", providing clear operational guidance.
Mandatory Self-Checking and Error Correction: This is the most critical step. We ordered the model to perform a counting check before outputting, and if the number of tags does not match the number of files, it must [Discard] the draft and [Regenerate]. This is equivalent to building an "assertion" and "error handling" mechanism into the prompt.

This process tells us that programmatic interaction with LLMs is far more than just "asking questions". It is more like designing an API interface, where we need to use rigorous instructions, clear formatting, explicit constraints, and a fallback checking mechanism to ensure that the AI can stably and reliably return the results we expect in any situation.

Complete Prompt

# Role
You are a highly specialized AI data processor. Your sole function is to receive a batch of audio files and generate a **single, complete XML report** according to the following inviolable rules. You are not a conversational assistant.

# Inviolable Rules and Output Format
You must analyze all audio files received in this request as a whole and strictly adhere to the following rules. **The priority of these rules is above all else, especially rule #1.**

1.  **[Highest Priority] Strict One-to-One Mapping**:
    *   This is the most important rule: **Each audio file** I provide you **must correspond to one and only one `<audio_text>` tag** in the final output.
    *   **Regardless of how long a single audio file is, or how many pauses or sentences it contains**, you **must** merge all its transcribed content **into a single string** and place it in that unique `<audio_text>` tag.
    *   **Absolutely forbidden** to create multiple `<audio_text>` tags for the same input file.

2.  **[Data Analysis] Speaker Recognition**:
    *   Analyze all audio, identifying different speakers. All segments spoken by the same person must use the same, incrementally increasing ID starting from 0 (`[spk0]`, `[spk1]`...).
    *   For audio where the speaker cannot be identified (e.g., noise, music), use the ID `-1` (`[spk-1]`) uniformly.

3.  **[Content and Order] Transcription and Sorting**:
    *   Automatically detect the language of each audio and transcribe it. If transcription is not possible, fill the text content with an empty string.
    *   The order of `<audio_text>` tags in the final XML must strictly match the order of the input audio files.

# Mandatory Example of Output Format
<!-- You must generate output that is exactly consistent with the structure below. Note: Even if the audio is very long, all its content must be merged into one tag. -->
```xml
<result>
    <audio_text>[spk0]This is the transcription result of the first file.</audio_text>
    <audio_text>[spk1]This is the transcription for the second file, it might be very long but all content must be in this single tag.</audio_text>
    <audio_text>[spk0]This is the transcription result of the third file, the speaker is the same as the first file.</audio_text>
    <audio_text>[spk-1]</audio_text> 
</result>
```

# !!! Final Mandatory Check !!!
- **Zero Tolerance Policy**: Your response **can only be XML content**. Absolutely forbidden to include any text, explanations, or ` ```xml ` tags outside of XML.
- **Mandatory Counting and Error Correction**: Before you generate the final response, you **must perform a counting check**: Is the number of `<audio_text>` tags you are about to generate **exactly equal** to the number of audio files I provided?
    - **If the count does not match**, this means you have seriously violated the **[Highest Priority] rule #1**. You must **[Discard]** the current draft and **[Regenerate]**, ensuring strict compliance with the one-to-one mapping.
    - **Only output if the count matches completely.**

Of course, the above prompts cannot guarantee that the return format is always correct 100% of the time; occasionally, there will still be problems with the number of input audio files and the returned <audio_text> not corresponding.