Multi-Agent GenAI Analytics Platform

FastAPI · Databricks · LLM Orchestration · Cost Governance · LLM-as-Judge

Client: Large-scale eCommerce client
Role: Principal Architect

This system was designed, built, and delivered as a forward-deployed engagement - I was embedded as the primary technical lead and architect working directly inside the client environment, owning both the engineering and the stakeholder relationship from requirements through production.


Executive Summary

Metric Value
Concurrent users (production) 1,000+ (architected to 10,000)
Response latency target <100ms
Reports automated per week 60
FTE manual effort replaced 40 FTE
Annual operational overhead eliminated $100K/year
LLM inference cost reduction 65% ($8 → $2.50 per request)

1. Problem Context & Business Objective

Category managers were responsible for evaluating category performance across 12+ dimensions - traffic source, geography, device type, time frames - each with multiple levels (e.g., traffic source L1/L2), resulting in over 100 possible analytical combinations per category.

Weekly and monthly performance reviews were critical inputs for:

However, generating these insights required manual exploration of multiple dashboards and reports, making the process time-consuming, error-prone, and heavily dependent on analyst support. The evaluation process:

Stakeholders included category managers, regional managers, and sales & marketing leadership up to VP and C-suite level - with weekly platform outputs consumed directly by senior leadership for budget allocation and incentive planning decisions worth tens of millions annually.

Objective: Automate performance analysis across multiple dimensions, surface actionable insights, and reduce dependency on manual reporting while maintaining strict cost and latency constraints.


Key Constraints


2. Why LLM-Based Approach?

A rule-based system was evaluated first. This proved unsuitable:

An LLM-based evaluation layer was chosen to:


3. Data Landscape

The primary data source resided on an internal Big Data Platform (BDP) exposing TB-scale transactional datasets. Query latency was variable due to resource contention across multiple teams.

A dedicated ETL pipeline was designed to extract and materialise the required datasets on a scheduled basis - aligned with downstream consumption needs and optimised for predictable performance.

The transactional data was at terabyte scale, optimised primarily for write-heavy ingestion. No dedicated OLAP layer was available; most teams relied on ad-hoc SQL aggregations. For this system, performance reports were refreshed on a weekly cadence with controlled snapshots to enable consistent week-over-week comparisons.


4. Feature Engineering & Data Representation

The system operated in extract-based mode rather than live data connection. Given that performance reviews were conducted weekly and downstream actions required multiple days to implement, near-real-time data did not provide additional value. Batch extraction enabled predictable performance, lower cost, and consistent snapshots.

Features were constructed by aggregating transactional data into an OLAP-style representation - SQL aggregations materialised as partitioned Parquet files optimised for downstream processing.

Feature categories:

An external market-pulse signal was introduced to capture category-level trends from public internet sources, allowing the model to contextualise internal performance with external demand conditions - directly influencing marketing budget decisions.

All feature values were normalised with explicit unit annotations (currency, percentage points). Missing values were imputed using metric-specific defaults based on business relevance - ensuring absence of data did not distort downstream reasoning.


5. Model Choice

The solution operated within enterprise AI governance constraints. All LLMs were centrally managed by a platform team responsible for responsible AI, security, and compliance. Model selection was limited to approved options: GPT-4o and Gemini.

A comparative evaluation was conducted using representative performance datasets. Both models were prompted with identical structured inputs and reviewed by business stakeholders using a qualitative scoring framework (1–5) for relevance, clarity, and actionability. GPT-4o consistently scored higher, particularly in synthesising multi-dimensional signals into concise insights, and was selected as the primary model.

Model selection was decoupled from application logic through an abstraction layer, enabling endpoint switching without changes to downstream pipelines.


6. System Architecture

The system follows a modular, service-oriented architecture with strict separation between user-facing services and AI execution, enabling independent scaling, cost control, and governance.

Architecture Diagram

Technology Stack Decisions

Language: Python - tight integration with data processing, feature engineering, and AI workflows; used consistently across ETL, backend services, and AI logic.

Backend Framework: FastAPI - native async request handling, strong typing via Pydantic, low overhead, clear API contracts.

Frontend: React - fine-grained control over user interactions, role-based UI rendering, and clean separation between presentation and backend logic. Streamlit was evaluated and rejected: limited support for complex interactions, constraints on data access control, and challenges scaling to multi-user enterprise applications.


High-Level Components

Frontend (React)

Application Backend (FastAPI)

AI Service (FastAPI)

Data Layer


7. Key Design Decision: Custom Async Router vs LangGraph

This was the most consequential architectural decision in the engagement - and the one that separates production-grade AI systems from well-intentioned prototypes. Choosing a popular framework because it exists is not engineering judgment. Rejecting it after benchmarking because it violates a production SLA is.

LangGraph was evaluated for agentic routing in the initial design phase. After benchmarking under production load conditions, the framework introduced latency overhead that conflicted with the <100ms end-to-end response requirement.

Decision: Replaced LangGraph with a purpose-built asynchronous FastAPI routing layer.

Why this was the right call:

This decision reflects a deliberate trade-off: less framework abstraction in exchange for predictable latency, operational control, and cost efficiency at scale.


8. LLM Evaluation Framework

LLM-as-Judge Design

A dedicated LLM-as-Judge evaluation pipeline was implemented to assess output quality before serving to end users.

Evaluation dimensions scored per-output:

Outputs below threshold were flagged for regeneration or human review, rather than served directly.

Human-in-the-Loop Review


9. Prompt Engineering & Input Representation

Prompts are constructed using structured templates - not raw text - to control token usage and improve determinism.

Prompt Design Principles

Prompt templates are versioned and managed independently from application code for iterative refinement.


10. API Design & Contracts

Application Backend APIs

AI Service APIs

Strict schema validation at all service boundaries mitigates prompt injection risk and ensures stable integration.


11. Performance, Scaling & Cost Controls

Scaling Strategy

Cost Controls


12. Reliability & Failure Handling

Failure Scenarios

Mitigations


13. Security & Governance


14. Monitoring & Observability

Metrics Tracked

Logs and metrics were used to continuously refine prompt design and cost–performance trade-offs.


15. Deployment & Environment Strategy


16. Key Design Decisions Summary

Decision Choice Rationale
Inference mode Batch over real-time Cost, consistency, matches weekly decision cadence
Agent routing Custom FastAPI async router Lower latency than LangGraph at production scale
LLM evaluation LLM-as-Judge + human review loop Output quality assurance before serving
Frontend React over Streamlit Enterprise RBAC, multi-user, separation of concerns
Data access mode Extract-based OLAP snapshots Predictable performance, cost control, WoW consistency
AI service isolation Dedicated FastAPI service Independent scaling, governance, provider switching

17. Lessons Learned