Video Intelligence API — Case Study

The Challenge

Adfluence Hub manages large volumes of video content — creator submissions, campaign recordings, performance clips — and needed to extract structured intelligence from it at scale. Manual review was the bottleneck: a human reviewer could watch perhaps 20–30 hours of video per week, creating an ever-growing backlog as content volume increased.

The requirement was to build a system that could process video programmatically: identify key moments, classify content segments, extract spoken claims, and produce a structured report — all without human intervention in the standard pipeline.

"The insight was treating video not as a media file, but as a data source. Every frame and every spoken word is structured information waiting to be extracted."

The Solution

The Video Intelligence API accepts video uploads, stores them on S3, and queues them for multimodal analysis through Gemini's vision capabilities. The system extracts frames at configurable intervals, combines them with transcribed audio, and runs a battery of analytical prompts to surface key moments, content classifications, speaker claims, and quality signals.

All processing is asynchronous — clients submit a video and poll or receive a webhook when the report is ready. The entire pipeline runs in Docker containers, making it straightforward to scale analysis workers independently of the API layer.

How It Works

1

Ingest & Store

Videos are uploaded via the FastAPI endpoint, validated for format and size, and stored on S3 with a unique job ID. The API returns immediately with a job reference — no waiting for analysis to complete.

2

Frame Extraction

A worker process retrieves the video from S3 and uses FFmpeg to extract frames at configurable intervals (e.g. every 2 seconds, or keyframes only). The extraction strategy is tuned per use case — dense sampling for action-heavy content, sparse for talking-head video.

3

Audio Transcription

Audio is extracted and transcribed in parallel with frame extraction. The transcript is segmented and timestamped, creating a text track that aligns with the visual frames for joint multimodal analysis.

4

Multimodal Analysis

Frame batches are sent to Gemini Vision alongside the aligned transcript segments. Analytical prompts target different signals: key moment detection, content classification, on-screen text extraction, speaker claim identification, and brand safety flags.

5

Report Assembly

Individual analysis results are aggregated into a structured JSON report with timestamped entries for each detected signal. A summary section synthesises the key findings across the full video duration.

6

Delivery

The completed report is stored in PostgreSQL linked to the job ID and optionally delivered via webhook. Clients can retrieve full reports, filtered subsets (e.g. only brand safety flags), or paginated moment timelines via the REST API.

Technical Highlights

Joint Frame + Transcript Analysis

Rather than analysing video and audio in separate, independent pipelines, the system aligns frames and transcript segments by timestamp before sending them to Gemini. This joint multimodal context allows the model to reason about what is said relative to what is shown — catching contradictions, on-screen text, and visual-verbal claim pairs that single-modality analysis misses.

Configurable Extraction Strategies

Frame sampling rate, analysis prompt set, and output schema are all configurable per job via the API request body. Different content types require different analysis cadences — a 30-second ad clip needs different extraction density than a 40-minute interview recording.

Decoupled Worker Architecture

The API layer and the analysis workers are fully decoupled: the API handles ingest and result retrieval; workers handle the compute-intensive processing pipeline. Workers can be scaled horizontally without any changes to the API, and failed jobs re-enter the queue automatically with exponential backoff.

Structured Output Enforcement

All Gemini analysis calls use structured output mode with explicit JSON schemas. This eliminates post-processing parsing failures and ensures consistent report structure regardless of video content variability — critical for downstream systems that consume the reports programmatically.

The Stack

FastAPI Google Gemini Vision FFmpeg AWS S3 PostgreSQL SQLAlchemy (async) Docker Celery Redis python-jose (JWT)

Outcomes

Video review throughput scaled from the human-constrained baseline to processing hours of content per minute — with no increase in headcount.
Structured JSON reports enabled downstream systems to consume video intelligence programmatically, unlocking automated compliance checks and content scoring.
The decoupled worker architecture allowed analysis capacity to be scaled independently during peak submission periods without touching the API or client integrations.
Joint multimodal analysis surfaced signal categories — visual-verbal contradictions, on-screen text extraction — that single-modality review pipelines had previously missed entirely.

Extracting structured intelligence from hours of video — automatically