DataFlow Conversational Agent — Case Study

The Challenge

Data analysts at Adfluence Hub were spending hours on repetitive transformation work — filtering, pivoting, aggregating, and joining Excel or CSV exports before any real analysis could begin. The tooling options were either too technical (Python, SQL) or too limited (spreadsheet formulas).

What was needed wasn't another BI tool with a drag-and-drop interface. The team needed to express what they wanted — in plain language — and have the system figure out the steps, write the code, and return the result. The key constraint: the system had to be production-reliable, not a demo that collapses on edge cases.

"The goal wasn't to teach analysts to code. It was to let them stay in their domain language while the system handled the translation into execution logic."

The Solution

DataFlow is a full-stack conversational data agent. Users upload a dataset — CSV, Excel, or a database connection — describe what they want in natural language, and receive the transformed result along with the generated code that produced it.

Under the hood, Gemini decomposes the natural language instruction into a structured operation plan, generates syntactically valid Pandas code for each step, and executes it in a sandboxed environment. The conversation is stateful: follow-up instructions reference earlier transformations, letting analysts build analysis incrementally the same way they would in a conversation with a colleague.

How It Works

1

Dataset Ingestion

The user uploads a file (CSV/Excel) or connects a database. The backend profiles the dataset — infers column types, detects date formats, flags nulls — and passes this schema context to the LLM.

2

Intent Parsing

The user's natural language instruction is sent to Gemini with the dataset schema and conversation history as context. Gemini returns a structured operation plan: an ordered list of transformation steps with types, parameters, and expected outputs.

3

Code Generation

Each step in the plan maps to a generated Pandas expression. Gemini produces clean, commented Python — no hallucinated APIs, no invalid column references — because the schema is part of the prompt context that constrains generation.

4

Sandboxed Execution

The generated code runs in an isolated execution environment. Runtime exceptions are caught and fed back to Gemini for self-correction — the system retries up to three times before surfacing a graceful error to the user.

5

Result Delivery

The transformed DataFrame is serialised and returned to the React frontend, where it renders as an interactive table. The generated code is shown alongside — giving analysts full transparency into what the system did and why.

6

Stateful Conversation

Each exchange is persisted to PostgreSQL. The conversation history and the current state of the dataset are stored so users can return to a session, fork an analysis, or compare transformations across different instruction paths.

Technical Highlights

Schema-Constrained Prompting

Every Gemini call includes the full dataset schema as structured context. Column names, inferred types, null counts, and sample values are injected into the prompt — dramatically reducing hallucinated column references and type errors compared to unconstrained generation.

Self-Correcting Execution Loop

When generated code raises a runtime exception, the error traceback is appended to the conversation and Gemini is re-prompted to correct the code. This retry loop runs up to three times before failing gracefully — handling the long tail of edge cases without brittle hand-coded error logic.

Stateful Session Architecture

Conversation state, intermediate DataFrames, and the full operation history are persisted in PostgreSQL via SQLAlchemy. Sessions can be resumed, forked, or shared — and analysts can roll back to any prior transformation state with a single instruction.

Multi-Step Operation Planning

Rather than generating a single monolithic transformation, Gemini produces a structured operation plan with discrete steps. This allows partial results to surface even when a later step fails, and gives users an auditable trace of every transformation the system applied.

The Stack

FastAPI Google Gemini Pandas PostgreSQL SQLAlchemy (async) React Vite Tailwind CSS python-jose (JWT) OpenPyXL

Outcomes

Analysts reduced routine data preparation time from hours to minutes — without writing a single line of code.
The self-correcting execution loop resolves the majority of LLM code errors automatically, with no user intervention required.
Full code transparency alongside results increased analyst trust in the system — they could verify what the agent did, not just accept the output.
Stateful sessions enabled multi-session analysis workflows, allowing teams to build on each other's work rather than starting from scratch.

Turning plain English into production-grade data transformations