Agentic PDF Translation — Case Study

The Challenge

Chainscript Labs handles legal and regulatory documents that need accurate, auditable translation across European languages. The standard LLM translation approach — treating each page or paragraph in isolation — produces output that reads fluently in isolation but loses coherence at document scale. Technical terms shift mid-document, defined concepts are translated inconsistently, and regulatory language drifts from the source meaning.

The deeper problem: how do you know when a translation is wrong? With standard approaches, you don't — until a human reviewer catches it, often long after the document has been used. The system needed to verify its own output automatically, and fix what it found.

"The self-healing loop was the core innovation. Translating is one thing — knowing whether you've translated correctly, and fixing it when you haven't, is the hard problem."

The Solution

The Agentic PDF Translation Service processes documents page by page, but with a rolling context window that carries forward key terminology and defined concepts from earlier pages. Each translated page is immediately back-translated into the source language, and the result is compared against the original using a composite quality score combining semantic similarity (Sentence Transformers), structural equivalence, and terminology consistency.

Pages scoring below a configurable threshold are automatically retried with enriched context and alternative prompting strategies. The pipeline checkpoints progress to disk, so long documents can resume from the last successful page if interrupted — no re-processing of already-approved translations.

How It Works

1

Document Parsing

The PDF is parsed and segmented into pages with layout metadata preserved. Tables, headers, footnotes, and body text are identified and tagged — the translation prompt adapts based on content type to preserve structural meaning.

2

Terminology Extraction

Before translation begins, Gemini identifies defined terms, proper nouns, and domain-specific vocabulary in the source document. These are compiled into a translation glossary that is injected into every subsequent page translation prompt — enforcing consistency across the full document.

3

Context-Aware Translation

Each page is translated with a rolling context window: the translated text from the preceding pages (truncated to fit the context budget) is included in the prompt. This allows the model to maintain narrative continuity, pronoun resolution, and consistent register across page boundaries.

4

Back-Translation Verification

The translated page is immediately back-translated into the source language. The back-translation is compared against the original using three signals: semantic similarity via Sentence Transformers embeddings, structural preservation (paragraph count, list items, headers), and glossary term hit rate.

5

Quality Gate & Retry

The composite quality score is compared against a configurable threshold. Pages below threshold are flagged and re-translated with an enriched prompt that includes the specific quality failures as feedback — guiding the model toward a corrected output. The retry loop runs up to three times per page.

6

Checkpointing & Assembly

Each approved page is written to a checkpoint file keyed by document ID and page number. On completion, all approved pages are assembled into the output PDF. If the pipeline is interrupted at any point, it resumes from the last checkpoint — protecting against wasted compute on long documents.

Technical Highlights

Composite Quality Scoring

Quality is measured on three independent axes: semantic similarity (cosine similarity between Sentence Transformer embeddings of original and back-translation), structural preservation (paragraph and element count delta), and terminology consistency (glossary term coverage). The composite score requires all three to pass — a fluent back-translation that drops a defined term still fails the gate.

Glossary-Constrained Generation

Domain terminology extracted during the initial pass is compiled into a structured glossary and injected as a hard constraint into each translation prompt. The model is instructed to translate listed terms using the exact target-language forms specified — preventing terminology drift across hundreds of pages.

Fault-Tolerant Checkpointing

Every approved page is immediately checkpointed to disk. If processing fails mid-document — network interruption, rate limit, unexpected error — the pipeline resumes from the last successful checkpoint rather than reprocessing the entire document. For 200-page regulatory documents, this matters.

Per-Page Audit Log

Every page produces an audit record: original text, translation, back-translation, composite quality score, number of retries, and final approval status. This log is stored in PostgreSQL and accessible via API, providing a complete verifiable record of the translation process — essential for regulatory submission contexts.

The Stack

FastAPI Google Gemini Sentence Transformers PyMuPDF PostgreSQL SQLAlchemy ReportLab Redis Docker python-jose (JWT)

Outcomes

The self-healing loop catches and corrects translation failures before they reach a human reviewer — reducing post-translation manual correction significantly.
Glossary-constrained generation produces terminology consistency across full documents that was previously achievable only with human terminology management.
The per-page audit log provides the evidentiary trail required for regulatory submissions — a verifiable record that each page met a defined quality standard.
Checkpointing allows large documents (150–300 pages) to be processed reliably across extended time periods without risk of losing progress to transient failures.

A translation pipeline that checks its own work — and fixes it