The Challenge
Adfluence Hub manages large volumes of video content — creator submissions, campaign recordings, performance clips — and needed to extract structured intelligence from it at scale. Manual review was the bottleneck: a human reviewer could watch perhaps 20–30 hours of video per week, creating an ever-growing backlog as content volume increased.
The requirement was to build a system that could process video programmatically: identify key moments, classify content segments, extract spoken claims, and produce a structured report — all without human intervention in the standard pipeline.
The Solution
The Video Intelligence API accepts video uploads, stores them on S3, and queues them for multimodal analysis through Gemini's vision capabilities. The system extracts frames at configurable intervals, combines them with transcribed audio, and runs a battery of analytical prompts to surface key moments, content classifications, speaker claims, and quality signals.
All processing is asynchronous — clients submit a video and poll or receive a webhook when the report is ready. The entire pipeline runs in Docker containers, making it straightforward to scale analysis workers independently of the API layer.
How It Works
Videos are uploaded via the FastAPI endpoint, validated for format and size, and stored on S3 with a unique job ID. The API returns immediately with a job reference — no waiting for analysis to complete.
A worker process retrieves the video from S3 and uses FFmpeg to extract frames at configurable intervals (e.g. every 2 seconds, or keyframes only). The extraction strategy is tuned per use case — dense sampling for action-heavy content, sparse for talking-head video.
Audio is extracted and transcribed in parallel with frame extraction. The transcript is segmented and timestamped, creating a text track that aligns with the visual frames for joint multimodal analysis.
Frame batches are sent to Gemini Vision alongside the aligned transcript segments. Analytical prompts target different signals: key moment detection, content classification, on-screen text extraction, speaker claim identification, and brand safety flags.
Individual analysis results are aggregated into a structured JSON report with timestamped entries for each detected signal. A summary section synthesises the key findings across the full video duration.
The completed report is stored in PostgreSQL linked to the job ID and optionally delivered via webhook. Clients can retrieve full reports, filtered subsets (e.g. only brand safety flags), or paginated moment timelines via the REST API.
Technical Highlights
Joint Frame + Transcript Analysis
Rather than analysing video and audio in separate, independent pipelines, the system aligns frames and transcript segments by timestamp before sending them to Gemini. This joint multimodal context allows the model to reason about what is said relative to what is shown — catching contradictions, on-screen text, and visual-verbal claim pairs that single-modality analysis misses.
Configurable Extraction Strategies
Frame sampling rate, analysis prompt set, and output schema are all configurable per job via the API request body. Different content types require different analysis cadences — a 30-second ad clip needs different extraction density than a 40-minute interview recording.
Decoupled Worker Architecture
The API layer and the analysis workers are fully decoupled: the API handles ingest and result retrieval; workers handle the compute-intensive processing pipeline. Workers can be scaled horizontally without any changes to the API, and failed jobs re-enter the queue automatically with exponential backoff.
Structured Output Enforcement
All Gemini analysis calls use structured output mode with explicit JSON schemas. This eliminates post-processing parsing failures and ensures consistent report structure regardless of video content variability — critical for downstream systems that consume the reports programmatically.
The Stack
Outcomes
- Video review throughput scaled from the human-constrained baseline to processing hours of content per minute — with no increase in headcount.
- Structured JSON reports enabled downstream systems to consume video intelligence programmatically, unlocking automated compliance checks and content scoring.
- The decoupled worker architecture allowed analysis capacity to be scaled independently during peak submission periods without touching the API or client integrations.
- Joint multimodal analysis surfaced signal categories — visual-verbal contradictions, on-screen text extraction — that single-modality review pipelines had previously missed entirely.