Agentic Video Localisation — Case Study

The Challenge

Localising video content is one of the most labour-intensive post-production workflows. A single piece of video content requires caption extraction, professional translation, voice recording in the target language, audio timing synchronisation, and final video assembly — with specialist handoffs at each stage. For content creators and studios producing at volume, this process is prohibitively expensive and slow.

Chainscript Labs needed to collapse this multi-day, multi-vendor workflow into an autonomous pipeline. The constraint was that voice quality had to be high enough for professional distribution — ruling out generic TTS — and the client needed to be able to review and approve the translation before audio synthesis began.

"The architecture had to respect the difference between reversible and irreversible steps. Translation can be corrected; voice synthesis is compute-intensive and produces an artefact. The review gate exists precisely at that boundary."

The Solution

The Agentic Video Localisation Pipeline is a two-stage orchestration system. Stage one — caption extraction, translation, and review — is fast and reversible. Stage two — voice synthesis and video assembly — is compute-intensive and irreversible. A human-in-the-loop review gate sits between the two stages: the client approves the translated script before the system commits to synthesis.

Voice synthesis uses Dia-1.6B for voice cloning from a short reference sample, with F5-TTS as an alternative for languages where Dia performs less well. The synthesised audio is synchronised to the original video timeline using FFmpeg, producing a localised video that preserves the original pacing and visual content.

How It Works

1

Caption Retrieval

The pipeline fetches captions from the source video — from YouTube's caption API for YouTube content, or from a Whisper transcription run for uploaded video files. Captions are parsed into a timestamped segment list with start time, end time, and text for each segment.

2

LLM Translation

Gemini translates the caption segments into the target language. Critically, timing constraints are passed alongside the text: the model is instructed to produce translations that, when spoken at natural pace, will fit within the original segment duration. Segments that exceed duration limits trigger a condensation pass.

3

Human Review Gate

The translated script is presented to the client for review and approval before any synthesis begins. The client can edit individual segments, adjust timing hints, or flag segments for re-translation. Only after explicit approval does the pipeline proceed to stage two.

4

Voice Cloning & Synthesis

The approved translation is synthesised using Dia-1.6B, conditioned on a 10–30 second reference sample of the target speaker's voice. Each segment is synthesised independently with its timing target. F5-TTS serves as a fallback for languages or speaker profiles where Dia's output quality falls below threshold.

5

Audio Synchronisation

Synthesised audio segments are assembled into a full audio track using FFmpeg, with timing adjustments applied where synthesis duration drifted from the target. The original audio is muted and replaced with the localised track; background music and sound effects are preserved from the original.

6

Video Assembly & Delivery

The final video is assembled by FFmpeg — original video stream combined with the synthesised audio track, with localised subtitles optionally burned in. The output is encoded to the target specification and delivered to the client's storage location via the API.

Technical Highlights

Duration-Constrained Translation

Standard LLM translation optimises for semantic accuracy — it does not consider whether the output will fit in the original segment's time slot when spoken aloud. This pipeline passes segment durations as hard constraints, and a condensation agent automatically trims translations that exceed their window without human intervention.

Reversible / Irreversible Stage Separation

The pipeline's architecture explicitly models the cost of each stage. Translation is cheap and reversible; voice synthesis is expensive and produces a committed artefact. The human review gate enforces a clean boundary between these stages — preventing the system from wasting synthesis compute on translations the client would reject.

Dual TTS Engine Strategy

Dia-1.6B delivers high-quality voice cloning but has language coverage gaps. F5-TTS is configured as a hot fallback: after synthesis, a quality classifier scores the Dia output on naturalness and timing accuracy. Segments below threshold are automatically re-synthesised with F5-TTS, with no manual intervention required.

Stateful Pipeline Orchestration

Each job's state — retrieval complete, translation approved, synthesis in progress, assembly complete — is tracked in a PostgreSQL state machine. The Flask API exposes real-time job status so clients can monitor progress and receive completion webhooks at each stage boundary.

The Stack

Flask Google Gemini Dia-1.6B F5-TTS FFmpeg Whisper YouTube Data API PostgreSQL SQLAlchemy Docker

Outcomes

End-to-end video localisation time collapsed from a multi-day, multi-vendor workflow to hours — with human review preserved at the critical approval gate.
Duration-constrained translation eliminated the common problem of synthesised audio overrunning segment boundaries, producing naturally-paced localised audio without manual timing adjustments.
The dual TTS engine strategy ensured consistent voice quality across target languages, with automatic fallback handling language coverage gaps without any client-visible failure.
The stateful pipeline architecture meant failed or interrupted jobs could be resumed from any stage — no re-processing of already-completed synthesis segments.

From source video to voice-cloned localisation — fully automated