Thanks to the rapid advancement of AI technology, the once challenging task of video translation has become more accessible, although the results may not yet be perfect.
Video translation is more complex than text translation, but the core is still text-based translation (although there are technologies that directly convert sound into another language's sound, this method is not yet mature and has limited practicality).
The video translation workflow can be roughly divided into the following stages:
Speech Recognition: Extract human voices from the video and convert them into text;
Text Translation: Translate the extracted text into the target language;
Speech Synthesis: Generate speech in the target language based on the translated text;
Synchronization Adjustment: Ensure that the dubbed audio and subtitle files are synchronized with the video content;
Embedding Processing: Embed the translated subtitles and dubbing into the video to generate a new video file.
Detailed Discussion of Each Stage:
Speech Recognition
The goal of this step is to accurately convert the speech content in the video into text, with timestamps attached. Currently, there are various implementation methods, including using OpenAI's Whisper model, Alibaba's FunASR series models, or directly calling online speech recognition APIs, such as Baidu Speech Recognition.
When selecting a model, you can choose from small (tiny) to large (large-v3) according to your needs. The larger the model, the higher the recognition accuracy.
Text Translation
After obtaining the text, translation can be performed. It is important to note that subtitle translation is different from ordinary text translation. Subtitle translation needs to consider the matching of timestamps.
When using traditional translation engines (such as Baidu Translate, Tencent Translate), only the subtitle text lines should be transmitted for translation, avoiding passing line numbers and timestamp lines to prevent exceeding character limits or changing the subtitle format.
Ideally, the translated subtitles should be consistent with the number of lines in the original subtitles, with no blank lines.
However, different translation engines, especially AI translation, will intelligently merge lines based on the context, especially when the next line contains only a few isolated characters or one or two words, and is semantically coherent with the previous sentence. It will most likely merge this into the previous line.
Although the translation result is more fluent and beautiful, it also causes the subtitles to be not strictly matched with the original subtitles, resulting in blank lines.
Speech Synthesis (Dubbing)
After the translation is completed, dubbing can be generated based on the translated subtitles.
Currently, EdgeTTS is a nearly unlimited and free dubbing channel. By sending subtitles line by line to EdgeTTS, dubbed audio files can be obtained, and then these audio files are merged into a complete audio file.
Synchronization and Alignment Adjustment
Ensuring that subtitles and audio are synchronized with the video is the biggest challenge in video translation.
Differences in pronunciation duration between different languages are inevitable, which leads to synchronization problems. Strategies to solve this problem include speeding up audio playback or extending the length of video clips, as well as using the blank intervals between subtitles to make adjustments to achieve the best synchronization effect.
If no adjustment is made and it is directly embedded according to the original subtitle timestamp, it will inevitably happen that the subtitle has disappeared, but the person is still talking, or the person in the video has finished speaking and shut up long ago, but the audio is still playing.
To solve this problem, there are two simpler ways:
One is to speed up the audio playback and force it to finish playing within the subtitle time interval, which can achieve synchronization. The disadvantage is that the speech speed is sometimes fast and sometimes slow, and the experience is poor.
The second is to slow down the playback of the video clip in the subtitle interval, that is, extend the video clip until the length matches the new dubbing length, which can also achieve synchronization. The disadvantage is that the picture will appear to have a stuttering effect.
Both methods can be used at the same time, that is, the audio is accelerated while the video clip is extended, which prevents the audio from accelerating too quickly and also prevents the video from extending too much.
According to the actual situation of the video, you can also use the blank interval clips between 2 subtitles. First, try not to accelerate the audio. If the audio can be played normally within the specified interval of the subtitle by accelerating the blank interval time, then there is no need to accelerate, and the effect will be better. Of course, the disadvantage is that the video picture has finished speaking, but the actual audio is still playing.
Synthesis and Output
After completing the above steps, embed the translated subtitles and dubbing into the original video. Tools such as ffmpeg can be easily used to achieve this. The final generated video file completes the translation process.
ffmpeg -y -i original_video.mp4 -i dubbed_audio.m4a -c:v libx264 -c:a aac -vf subtitles=subtitles.srt out.mp4
Difficult to Solve Problems: Multiple Speaker Recognition
Speaker role recognition, that is, synthesizing different dubbing according to different character roles in the video, involves speaker recognition, and it is necessary to pre-specify how many speaker roles there are. It is barely suitable for ordinary one- or two-person dialogue roles, but for most videos, it is difficult to determine several speakers in advance, and the final synthesized effect is also very poor, so this part is not considered for the time being.
Summary
The above is just a simple workflow principle. In fact, to achieve good translation results, there are many points to pay attention to, such as the pre-processing of the original video input format (mov/mp4/avi/mkv), splitting the video into audio and silent video, separating the human voice and background sound in the audio, processing the results of batch translation to speed up subtitle translation, re-splitting when blank lines appear in subtitles, generating and embedding dual subtitles, and so on.
Through this series of steps, the video translation task can be successfully completed, seamlessly converting the video content into the target language. Although some technical challenges may be encountered during the process, with the continuous progress and optimization of technology, the quality and efficiency of video translation are expected to be further improved in the future.