Loading...
Please wait while we prepare your experience
Loading...
Please wait while we prepare your experience
A case study in eval-driven AI engineering.
RAGbot (Retrieval Augmented Generation) is a multi-tenant AI assistant deployed in production across two SaaS verticals: a mid-market FP&A platform and a SaaS ERP. It serves 200+ stakeholders, reduces customer query volume by 40%, and runs as a Slack-native interface backed by FastAPI, Qdrant, and a hybrid retrieval pipeline.
This article describes the production system that followed an earlier proof-of-concept exploring domain-specific retrieval challenges in enterprise financial platforms. Where that work validated feasibility — chunking strategy, terminology gaps, knowledge curation cost — RAGbot addresses what production requires: tenant isolation, citation traceability, numerical eval gates, and operational observability.
The treatment is technical and descriptive. The aim is to document architecture, design decisions, and the evaluation methodology that gates every deployment.
Implementation consultants at a partner consultancy spent significant time answering repetitive product questions — questions whose answers existed in product documentation, knowledge base articles, and training material, but were hard to surface quickly. The cost was twofold: consultant time spent on retrieval rather than higher-value work, and slower implementation cycles for end clients.
Off-the-shelf chatbots were rejected for three reasons:
Multi-tenancy. Two distinct SaaS products, two distinct knowledge bases, two distinct user populations. A single shared instance would either contaminate retrieval or require operational duplication.
Citation requirements. Consultants needed answers traceable to source documents. Hallucinated or unsourced responses were unacceptable in a context where incorrect guidance could delay client go-lives.
Eval discipline. The system needed measurable accuracy before reaching production users — not vibes-based assessment from demo sessions.
RAGbot was built to address all three.
RAGbot uses a hybrid retrieval pipeline combining BM25 sparse retrieval and dense vector retrieval, fused using weighted Reciprocal Rank Fusion (RRF). The vector store is Qdrant, with per-tenant collections ensuring complete data isolation between products.
Embeddings are generated using OpenAI's text-embedding-3-small. Documents are chunked using LangChain's MarkdownHeaderTextSplitter at 1,024 characters, preserving heading hierarchy in chunk metadata for citation accuracy. Content-hash deduplication prevents redundant indexing when documentation is re-ingested.
This represents a deliberate evolution from the POC phase, which used dense retrieval only (ChromaDB, semantic search). Hybrid retrieval with heading-aware metadata was the primary lever for moving accuracy toward production thresholds.
Generation uses Anthropic's Claude models, selected for grounding quality and citation behaviour. Prompts are versioned and hosted in LangFuse, which allows prompt iteration without code deployment — a critical capability for production tuning.
The user-facing layer is a Slack bot built with Bolt for Python. Thread-aware multi-turn context allows users to refine questions naturally:
User: How do I configure intercompany journals?
RAGbot: [answers with citations]
User: What about for foreign currency?
RAGbot: [answers in context of the prior turn]
Session IDs use channel:thread_ts format, enabling LangFuse to trace every turn within a conversation as a coherent unit.
The interface choice reflects where consultants already work. A separate web UI would add adoption friction; Slack integration met users in their existing workflow.
For the ERP-side RAGbot, the architecture extends beyond pure RAG. The system needs to take actions in the ERP — creating departments, raising journals, configuring records — not just answer questions.
LangGraph was the obvious candidate. It models agent behaviour as a directed graph: nodes for tool calls, edges for routing decisions, shared state passed between steps. For a system that already uses LangChain elsewhere, LangGraph offers a unified stack and a natural expression of multi-step workflows. Tool orchestration, retry logic, and conditional branching are first-class.
The tradeoff is where enforcement lives. In a LangGraph implementation, the agent graph is the execution path. Approval gates, idempotency keys, schema validation, and audit logging either become graph nodes — mixed with LLM routing decisions — or wrap the graph externally, adding complexity without clear separation. State that must survive agent restarts — pending approvals, in-flight journal entries, deduplication keys — tends to accumulate in graph state or ad hoc stores attached to the graph. For read-only RAG, that is acceptable. For ERP mutations, it creates a coupling problem: the same structure that gives the LLM flexibility also becomes the audit surface.
Agent-on-middleware inverts the dependency. The agent orchestrates freely via tool calls; a FastAPI middleware layer intercepts every call before it reaches the ERP API. Enforcement is deterministic and testable without invoking the model.
| Concern | LangGraph | Agent-on-middleware |
|---|---|---|
| Orchestration | Expressive; graph encodes routing logic | Expressive; LLM selects tools freely |
| Enforcement | Entangled with graph structure or wrapped externally | Isolated in middleware; independent of model output |
| Durable state | Graph state / checkpointer | Postgres with explicit schema |
| Audit replay | Reconstruct graph execution path | Query append-only audit log |
| Testability | Integration tests against graph | Unit tests on middleware; mock agent for E2E |
The decision criterion was operational risk, not expressiveness. ERP-side RAGbot performs writes. Writes require approval queues, idempotent API calls, schema validation against the ERP's OpenAPI spec, and append-only audit logs. Those properties are easier to guarantee in middleware that the LLM cannot route around than in graph nodes whose edges the model influences at runtime.
The resulting pattern:
This separates orchestration (which benefits from LLM flexibility) from enforcement (which benefits from deterministic code). The agent can reason about what to do; the middleware controls whether and how it happens. Every action is logged, every destructive operation is approval-gated, and every API call is idempotent.
The pattern reflects prior experience architecting enterprise systems in regulated environments: the LLM proposes; deterministic code disposes.
RAGbot is built on an eval-driven workflow. Every deployment to production gates on numerical thresholds — not subjective review.
Two curated datasets define the accuracy baseline:
Each question has a known-good answer, a known relevant source document, and is scored on a 1–10 scale by a structured LLM judge.
Current production scores:
| Vertical | Score | Scope |
|---|---|---|
| FP&A RAGbot | 9.1/10 | 100 production questions (post-CoT) |
| ERP RAGbot | 9.21/10 | 98 scored questions (baseline) |
These scores represent the outcome of iterative retrieval tuning, prompt versioning, and chain-of-thought judge calibration — not initial deployment quality.
Every deployment runs the full RAGAS suite:
| Metric | Score | Interpretation |
|---|---|---|
| Faithfulness | 0.988 | Answers grounded in retrieved context |
| Answer Relevancy | 0.838 | Answers address the actual question |
| Context Precision | 1.000 | Retrieved documents are relevant |
| Context Recall | 0.882 | Relevant documents are retrieved |
Context precision at 1.000 indicates the hybrid retrieval pipeline successfully eliminated irrelevant document noise — a known weakness of dense-only retrieval in domain-specific corpora.
A QA bot reviews every production response and tags it with a four-level grounding confidence enum: grounded, plausible, unverifiable, unsupported.
This replaced an earlier binary hallucination flag. The finer granularity provides a more useful signal for tuning: plausible responses may need prompt adjustment; unsupported responses indicate retrieval or grounding failures requiring investigation.
from enum import Enum
class GroundingConfidence(str, Enum):
GROUNDED = "grounded" # claim fully supported by cited source(s)
PLAUSIBLE = "plausible" # partially supported; warrants review
UNVERIFIABLE = "unverifiable" # sources retrieved; claim not confirmed
UNSUPPORTED = "unsupported" # claim absent from retrieved context
Every production response carries a grounding_confidence field. Responses tagged unsupported are excluded from user-facing output and logged for retrieval tuning.
Every production session is traced in LangFuse: retrieval calls, prompt versions, token usage, latency, and final response. Regressions are caught in evals, not in production user complaints.
Early in the FP&A vertical deployment, a query about consolidation logic returned a plausible-sounding answer that cited the wrong source document — the retrieved chunk contained similar terminology but addressed a different entity hierarchy. The response passed a binary hallucination check because the language was grounded in some retrieved text.
The incident drove three changes: mandatory source citation with document ID verification, the four-level grounding confidence enum, and golden-dataset questions specifically targeting terminological ambiguity within the product domain. Context precision reached 1.000 after hybrid retrieval was deployed.
Production AI delivery. Not a demo, not a notebook — a live multi-tenant system serving real users with measurable business outcomes (40% query reduction, faster implementation cycles).
Retrieval engineering depth. Hybrid retrieval with RRF, per-tenant vector isolation, structured chunking with heading-aware metadata, content-hash deduplication.
Eval-driven development. Golden datasets, RAGAS metrics, grounding confidence scoring, LangFuse tracing. Deployment is gated on numerical thresholds.
Agent architecture. A considered choice between LangGraph and agent-on-middleware, made for auditability, idempotency, and separation of orchestration from enforcement.
Full-stack ownership. Backend (Python, FastAPI, Postgres, Redis, S3, Docker), AI layer (LangChain, LangGraph, Qdrant, OpenAI, Anthropic, LangFuse, RAGAS), interface (Slack Bolt, thread-aware context), deployment (Render), and CI (GitHub Actions, pytest).
Enterprise context. Multi-tenancy, audit logging, and operational risk treated as first-class concerns.
AI & Retrieval: LangChain, LangGraph, Qdrant, OpenAI text-embedding-3-small, Anthropic Claude, LangFuse, RAGAS
Backend: Python, FastAPI, Postgres, Redis, S3, Docker
Interface: Slack Bolt (production); Next.js, TypeScript (internal admin tooling)
Infra & Tooling: Render, GitHub Actions, pytest, Git
Client names, product identifiers, and consultancy details are anonymised. Metrics reflect production eval runs at time of writing.