Please wait while we prepare your experience

Building RAGbot: A Production RAG System for SaaS Verticals

May 2026•7 min read

RAGProduction AILangChainQdrantFastAPIEvalsMulti-Tenant

A case study in eval-driven AI engineering.

Overview

RAGbot (Retrieval Augmented Generation) is a multi-tenant AI assistant deployed in production across two SaaS verticals: a mid-market FP&A platform and a SaaS ERP. It serves 200+ stakeholders, reduces customer query volume by 40%, and runs as a Slack-native interface backed by FastAPI, Qdrant, and a hybrid retrieval pipeline.

This article describes the production system that followed an earlier proof-of-concept exploring domain-specific retrieval challenges in enterprise financial platforms. Where that work validated feasibility — chunking strategy, terminology gaps, knowledge curation cost — RAGbot addresses what production requires: tenant isolation, citation traceability, numerical eval gates, and operational observability.

The treatment is technical and descriptive. The aim is to document architecture, design decisions, and the evaluation methodology that gates every deployment.

The Problem

Implementation consultants at a partner consultancy spent significant time answering repetitive product questions — questions whose answers existed in product documentation, knowledge base articles, and training material, but were hard to surface quickly. The cost was twofold: consultant time spent on retrieval rather than higher-value work, and slower implementation cycles for end clients.

Off-the-shelf chatbots were rejected for three reasons:

Multi-tenancy. Two distinct SaaS products, two distinct knowledge bases, two distinct user populations. A single shared instance would either contaminate retrieval or require operational duplication.

Citation requirements. Consultants needed answers traceable to source documents. Hallucinated or unsourced responses were unacceptable in a context where incorrect guidance could delay client go-lives.

Eval discipline. The system needed measurable accuracy before reaching production users — not vibes-based assessment from demo sessions.

RAGbot was built to address all three.

Architecture

Retrieval

RAGbot uses a hybrid retrieval pipeline combining BM25 sparse retrieval and dense vector retrieval, fused using weighted Reciprocal Rank Fusion (RRF). The vector store is Qdrant, with per-tenant collections ensuring complete data isolation between products.

Embeddings are generated using OpenAI's text-embedding-3-small. Documents are chunked using LangChain's MarkdownHeaderTextSplitter at 1,024 characters, preserving heading hierarchy in chunk metadata for citation accuracy. Content-hash deduplication prevents redundant indexing when documentation is re-ingested.

This represents a deliberate evolution from the POC phase, which used dense retrieval only (ChromaDB, semantic search). Hybrid retrieval with heading-aware metadata was the primary lever for moving accuracy toward production thresholds.

Generation

Generation uses Anthropic's Claude models, selected for grounding quality and citation behaviour. Prompts are versioned and hosted in LangFuse, which allows prompt iteration without code deployment — a critical capability for production tuning.

Slack-Native Interface

The user-facing layer is a Slack bot built with Bolt for Python. Thread-aware multi-turn context allows users to refine questions naturally:

User: How do I configure intercompany journals?
RAGbot: [answers with citations]
User: What about for foreign currency?
RAGbot: [answers in context of the prior turn]

Session IDs use channel:thread_ts format, enabling LangFuse to trace every turn within a conversation as a coherent unit.

The interface choice reflects where consultants already work. A separate web UI would add adoption friction; Slack integration met users in their existing workflow.

Agent-on-Middleware Architecture (ERP Vertical)

For the ERP-side RAGbot, the architecture extends beyond pure RAG. The system needs to take actions in the ERP — creating departments, raising journals, configuring records — not just answer questions.

LangGraph was the obvious candidate. It models agent behaviour as a directed graph: nodes for tool calls, edges for routing decisions, shared state passed between steps. For a system that already uses LangChain elsewhere, LangGraph offers a unified stack and a natural expression of multi-step workflows. Tool orchestration, retry logic, and conditional branching are first-class.

The tradeoff is where enforcement lives. In a LangGraph implementation, the agent graph is the execution path. Approval gates, idempotency keys, schema validation, and audit logging either become graph nodes — mixed with LLM routing decisions — or wrap the graph externally, adding complexity without clear separation. State that must survive agent restarts — pending approvals, in-flight journal entries, deduplication keys — tends to accumulate in graph state or ad hoc stores attached to the graph. For read-only RAG, that is acceptable. For ERP mutations, it creates a coupling problem: the same structure that gives the LLM flexibility also becomes the audit surface.

Agent-on-middleware inverts the dependency. The agent orchestrates freely via tool calls; a FastAPI middleware layer intercepts every call before it reaches the ERP API. Enforcement is deterministic and testable without invoking the model.

Concern	LangGraph	Agent-on-middleware
Orchestration	Expressive; graph encodes routing logic	Expressive; LLM selects tools freely
Enforcement	Entangled with graph structure or wrapped externally	Isolated in middleware; independent of model output
Durable state	Graph state / checkpointer	Postgres with explicit schema
Audit replay	Reconstruct graph execution path	Query append-only audit log
Testability	Integration tests against graph	Unit tests on middleware; mock agent for E2E

The decision criterion was operational risk, not expressiveness. ERP-side RAGbot performs writes. Writes require approval queues, idempotent API calls, schema validation against the ERP's OpenAPI spec, and append-only audit logs. Those properties are easier to guarantee in middleware that the LLM cannot route around than in graph nodes whose edges the model influences at runtime.

The resulting pattern:

The agent (Claude) orchestrates freely via tool calls
A FastAPI middleware layer enforces approval queues, idempotency, schema validation, and audit logging
Durable state lives in Postgres, not in the agent

This separates orchestration (which benefits from LLM flexibility) from enforcement (which benefits from deterministic code). The agent can reason about what to do; the middleware controls whether and how it happens. Every action is logged, every destructive operation is approval-gated, and every API call is idempotent.

The pattern reflects prior experience architecting enterprise systems in regulated environments: the LLM proposes; deterministic code disposes.

Evaluation

RAGbot is built on an eval-driven workflow. Every deployment to production gates on numerical thresholds — not subjective review.

Golden Datasets

Two curated datasets define the accuracy baseline:

FP&A vertical: 128 questions
ERP vertical: 113 questions

Each question has a known-good answer, a known relevant source document, and is scored on a 1–10 scale by a structured LLM judge.

Current production scores:

Vertical	Score	Scope
FP&A RAGbot	9.1/10	100 production questions (post-CoT)
ERP RAGbot	9.21/10	98 scored questions (baseline)

These scores represent the outcome of iterative retrieval tuning, prompt versioning, and chain-of-thought judge calibration — not initial deployment quality.

RAGAS Metrics

Every deployment runs the full RAGAS suite:

Metric	Score	Interpretation
Faithfulness	0.988	Answers grounded in retrieved context
Answer Relevancy	0.838	Answers address the actual question
Context Precision	1.000	Retrieved documents are relevant
Context Recall	0.882	Relevant documents are retrieved

Context precision at 1.000 indicates the hybrid retrieval pipeline successfully eliminated irrelevant document noise — a known weakness of dense-only retrieval in domain-specific corpora.

Grounding Confidence

A QA bot reviews every production response and tags it with a four-level grounding confidence enum: grounded, plausible, unverifiable, unsupported.

This replaced an earlier binary hallucination flag. The finer granularity provides a more useful signal for tuning: plausible responses may need prompt adjustment; unsupported responses indicate retrieval or grounding failures requiring investigation.

from enum import Enum

class GroundingConfidence(str, Enum):
    GROUNDED = "grounded"         # claim fully supported by cited source(s)
    PLAUSIBLE = "plausible"       # partially supported; warrants review
    UNVERIFIABLE = "unverifiable" # sources retrieved; claim not confirmed
    UNSUPPORTED = "unsupported"   # claim absent from retrieved context

Every production response carries a grounding_confidence field. Responses tagged unsupported are excluded from user-facing output and logged for retrieval tuning.

LangFuse Observability

Every production session is traced in LangFuse: retrieval calls, prompt versions, token usage, latency, and final response. Regressions are caught in evals, not in production user complaints.

A Failure That Shaped the Design

Early in the FP&A vertical deployment, a query about consolidation logic returned a plausible-sounding answer that cited the wrong source document — the retrieved chunk contained similar terminology but addressed a different entity hierarchy. The response passed a binary hallucination check because the language was grounded in some retrieved text.

The incident drove three changes: mandatory source citation with document ID verification, the four-level grounding confidence enum, and golden-dataset questions specifically targeting terminological ambiguity within the product domain. Context precision reached 1.000 after hybrid retrieval was deployed.

Engineering Capabilities Demonstrated

Production AI delivery. Not a demo, not a notebook — a live multi-tenant system serving real users with measurable business outcomes (40% query reduction, faster implementation cycles).

Retrieval engineering depth. Hybrid retrieval with RRF, per-tenant vector isolation, structured chunking with heading-aware metadata, content-hash deduplication.

Eval-driven development. Golden datasets, RAGAS metrics, grounding confidence scoring, LangFuse tracing. Deployment is gated on numerical thresholds.

Agent architecture. A considered choice between LangGraph and agent-on-middleware, made for auditability, idempotency, and separation of orchestration from enforcement.

Full-stack ownership. Backend (Python, FastAPI, Postgres, Redis, S3, Docker), AI layer (LangChain, LangGraph, Qdrant, OpenAI, Anthropic, LangFuse, RAGAS), interface (Slack Bolt, thread-aware context), deployment (Render), and CI (GitHub Actions, pytest).

Enterprise context. Multi-tenancy, audit logging, and operational risk treated as first-class concerns.

Stack

AI & Retrieval: LangChain, LangGraph, Qdrant, OpenAI text-embedding-3-small, Anthropic Claude, LangFuse, RAGAS

Backend: Python, FastAPI, Postgres, Redis, S3, Docker

Interface: Slack Bolt (production); Next.js, TypeScript (internal admin tooling)

Infra & Tooling: Render, GitHub Actions, pytest, Git

Client names, product identifiers, and consultancy details are anonymised. Metrics reflect production eval runs at time of writing.

← Back to Blog