Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Resource Center

Events & webinars Blog Videos Glossary Resources Architecture Diagrams Demo Center

Back to blog

Blog

How to scale RAG from prototype to production

January 21, 20268 minute read

Jim Allen Wallace

Your RAG prototype works great with 1,000 documents and three test users. Your RAG prototype works great with 1,000 documents and three test users. But at millions of vectors and thousands of concurrent queries, things break. Response times spike, your autoscaler kicks in too late, and LLM costs spiral because every request hits the API without caching.

Vector search that felt instant during demos now times out because pure vector retrieval struggles at production scale. Without proper memory and observability, agents lose context mid-workflow, and you can't tell whether retrieval or generation caused the failure.

This is where most RAG projects stall. The jump from POC to production isn't about tweaking parameters; you typically need a different architecture. You need dual pipelines instead of single scripts, hybrid retrieval combining vector search with sparse BM25, and semantic caching that cuts LLM costs by up to 68.8% in typical production workloads. Some production systems, including in-memory architectures, can achieve single-digit millisecond P95 latencies even at billion-vector scale.

In this article, we'll cover why hybrid retrieval outperforms pure vector search, how semantic caching cuts LLM costs dramatically, and why agent memory architecture matters for production systems.

What changes when RAG moves from POC to production

Production RAG typically needs a different setup: separated indexing and query pipelines, hybrid retrieval, and complete observability with 99.9% uptime SLAs. Time-to-First-Token (TTFT) p90 should stay under 2 seconds or autoscaling triggers. Throughput requirements vary by two orders of magnitude based on your retrieval strategy, so design decisions on retrieval frequency have massive impact on end-to-end latency.

Here's what happens in production: Three months in, you're managing four different data storage layers. Your vectors live in one system, your semantic cache in another, app state scattered across a third, and operational data in a fourth. Each integration point adds latency and creates new failure modes. When your semantic cache misses, LLM costs spike. When vector search times out during peak traffic, your retry logic kicks in, adding even more latency.

The architecture often breaks first. POCs conflate offline indexing and online queries into monolithic deployments. You'll hit this wall when your indexing pipeline starts competing with live queries for resources.

Production typically requires separation: an offline pipeline for document processing and an online pipeline for retrieval. Your first instinct might be to optimize the existing monolith, but that path leads to constant firefighting.

Production systems need complete observability. Track retrieval precision across document types, measure chunk relevance, correlate request IDs across components, and monitor cache hit rates. Without full pipeline traceability, debugging becomes guesswork. You won't know if it's the vector search choking, the embedding model timing out, or the LLM API hitting rate limits.

The retrieval problem: why vector search alone fails

Pure vector search has mathematically provable limitations that scale alone often struggles to overcome. Single-vector embeddings can't represent overlapping relationships in knowledge bases, so retrieval will miss relevant evidence by design. High-dimensional sparse vectors introduce their own failure modes: clustering time increases linearly with dimension, sparse distance computation gets expensive, and there's no established method for unified hybrid search.

Keyword matching

A user searches for "ISO 27001 compliance requirements." Pure vector search might return documents about "security best practices" and "compliance frameworks," which are semantically similar but miss the specific standard. The one document that explicitly mentions ISO 27001 by name gets buried on page three because it doesn't have the richest semantic context.

BM25 catches the exact keyword match that vector search glossed over. When retrieval fails like this, support tickets spike with complaints about "the AI doesn't understand what I'm asking for." Your precision metrics look fine in aggregate, but user satisfaction drops because the most important result isn't surfacing.

Hybrid retrieval

Production RAG systems increasingly adopt this approach. The performance gains are measurable: hybrid approaches can improve recall accuracy by 1% to 9% compared to vector search alone, depending on implementation.

A common implementation pattern: run vector search and BM25 in parallel, combine results using Reciprocal Rank Fusion (which handles score incompatibility), then apply a cross-encoder reranking model for final ranking. Hybrid retrieval delivers measurable improvements but adds operational complexity since you're managing two retrieval paths instead of one.

Redis supports this through its Redis Query Engine, combining vector search with traditional filtering. Embeddings live alongside operational data in memory, supporting simultaneous vector searches and exact-match filtering without data synchronization. Redis' in-memory architecture delivers low-latency performance, with benchmarks showing recall ≥0.98 at scale with competitive query times.

Ingestion & data consistency

Getting data into your RAG system is only half the challenge. Keeping it accurate and in sync as documents change is where most teams run into trouble.

Chunking strategies that work

Many teams start with page-level chunking, which often achieves strong accuracy across document types. Common chunk sizes range from 256-512 tokens for factoid queries to 1,024 tokens for complex analytical queries, though effectiveness varies by document type, layout, and task.

You'll be tempted to use tiny 128-token chunks thinking more granularity means better precision. We've seen this backfire when chunks split mid-concept. Your retrieval starts returning sentence fragments that make no sense to the LLM. The model tries to synthesize an answer from "...in accordance with regulatory standards..." and "The board approved three new..." without the context that ties them together. Your hallucination rate spikes, and you can't figure out why until you audit your chunk boundaries.

Keeping indexes in sync

The index synchronization problem gets real when you're updating 10,000 documents daily. Keeping indexes synchronized with evolving data typically means choosing between:

Frequent re-indexing prioritizes freshness but adds overhead
Continuous sync through CDC offers real-time updates but introduces complexity
Hybrid approaches combine batch updates for historical data with stream processing for real-time changes

Most teams start with batch updates, running a full re-index every night at 2 AM when traffic is lowest. Three months later, when stakeholders complain that yesterday's product launch isn't reflected in today's search results, you start evaluating CDC pipelines.

The operational complexity triples, but your data freshness SLA goes from 24 hours to sub-minute. That trade-off often only makes sense when staleness becomes a measurable business problem, like support calls about outdated pricing or compliance risks from serving superseded policy documents.

The deletion problem

Document deletion creates a hidden challenge: when you deprecate old content, your batch pipeline might remove it from the source system but leave orphaned vectors in your index for days. Users get search results pointing to documents that no longer exist. This typically requires explicit deletion tracking and cleanup jobs, not just additive synchronization.

The hidden cost of scale: semantic caching

Semantic caching reduces LLM API calls by recognizing when incoming queries are semantically similar to ones you've already answered. Instead of hitting the LLM for every request, you serve cached responses for queries that mean essentially the same thing, cutting costs by up to 68.8% in typical production workloads.

The process works like this: when a query comes in, generate an embedding, search your cache for similar queries using vector search, and return the cached response if similarity exceeds your threshold (typically 0.85-0.95). Cache hits deliver sub-100ms responses versus multi-second LLM calls, making responses up to 65x faster.

Redis makes this practical through in-memory vector search. Sub-millisecond lookup latency is critical for semantic caching to work without adding noticeable delay, and Redis Cloud delivers exactly that. You can implement this using the RedisSemanticCache class in LangChain or use Redis LangCache for a fully managed solution.

Threshold tuning matters: high-precision use cases typically need thresholds in the 0.90-0.95 range to prioritize correctness over cache hits. High-recall scenarios can use lower thresholds (0.85-0.90) to maximize cost savings, though exact optimal values vary by use case.

Turning RAG into a production system

At scale, your vector search, caching layer, and LLM calls all need to perform consistently under load, not just during demos with a handful of test queries. Hitting performance targets requires infrastructure designed for speed from the start, not bolted on after you've already built your prototype architecture.

Observability becomes critical at scale. Beyond the basics of tracking retrieval precision and cache hit rates, you need to monitor reranking effectiveness, measure embedding quality over time, and watch hallucination rates. Without this visibility, debugging production issues turns into guesswork.

Rate limiting protects your system from cascading failures. Production deployments typically implement limits at multiple levels: user or tenant-level limits for fair usage, LLM API-level limits for cost control, vector database-level limits to handle load spikes, and system-level limits for overall infrastructure protection.

Why agent memory is a production concern

Without proper memory architecture, agents struggle to maintain coherent conversations. Your agent forgets the user said they're looking for Python solutions and starts recommending Java libraries three messages later. Or it asks "What programming language are you using?" twice in the same conversation. The user experience degrades from "intelligent assistant" to "stateless chatbot that keeps forgetting things."

Production agent memory typically needs three storage layers: short-term memory for current session state, long-term memory storing user profiles and preferences through vector embeddings, and episodic memory that helps agents learn from past interactions. Episodic memory is what allows an agent to recall that a particular approach worked well for a similar question yesterday and apply that pattern again.

The technical requirements follow from these needs: vector embedding storage for semantic memory, fast retrieval with sub-second latency, and reliable state persistence. Redis works as the memory substrate with vector search for semantic retrieval, key-value operations for state management, and pub/sub for agent communication. This unified data model doesn't require separate infrastructure for each memory type.

Getting RAG to production scale

The jump from prototype to production changes your entire architecture. Start with page-level chunking for accuracy, add semantic caching early, and build hybrid retrieval from the start.

The infrastructure you choose determines whether you hit production targets or spend months firefighting. Your RAG system needs fast vector search, semantic caching that doesn't add network hops, and agent memory that scales with concurrent sessions. Managing these across separate systems creates the integration complexity that slows teams down.

Redis provides the in-memory infrastructure that production RAG systems need. Vector search, semantic caching, agent memory, and operational data live in the same unified in-memory infrastructure with no synchronization delays and no integration points that become failure modes under load.

Try Redis free to see how unified in-memory infrastructure handles your RAG workload, or talk to our team about scaling from prototype to production.

Get started with Redis today

Speak to a Redis expert and learn more about enterprise-grade Redis today.

Try for free Talk to sales