How a Redis caching layer cut P95/P99 latency and stabilized a production Retrieval-Augmented Generation (RAG) system — without touching the LLM.
Generative AI applications are moving rapidly from experiments and proofs of concept into business-critical systems. As organizations adopt Retrieval-Augmented Generation (RAG) to power knowledge assistants, customer support bots, internal copilots, and enterprise search, expectations change.
Users no longer compare your application to another AI tool. They compare it to every digital experience they use daily.
And that means one thing: Speed matters.
Recently, while working on a production RAG implementation, we encountered a challenge that many teams eventually face. The system was providing accurate answers, but under load, response times started becoming inconsistent. While average response times looked acceptable, a deeper analysis of latency metrics revealed a different story.
The solution wasn’t changing the LLM.
It was optimizing the RAG architecture around it.
One of the biggest mistakes teams make when measuring AI application performance is relying solely on average response time.
An average can hide a lot of problems.
For example:
This is why we started tracking the three latency percentiles that matter most for enterprise AI applications:
The response time experienced by 50% of users — a baseline for typical RAG response time.
The response time experienced by 95% of users. P95 latency is one of the earliest signals of AI latency problems in a retrieval pipeline.
The worst-performing 1% of requests. P99 latency exposes the architectural bottlenecks that averages and even P95 can hide.
In enterprise AI applications, P95 and P99 latency often tell a much more accurate story than averages.
These long-tail delays are exactly what users remember.
Like many RAG systems, our architecture followed a familiar pattern:

When we analyzed the request flow, we noticed something interesting.
A significant portion of the response time wasn't being spent inside the LLM.
Instead, it was being consumed by:
In other words, the system was doing a lot of unnecessary work before the LLM even received the prompt — a common cause of RAG latency issues in production.
To address this, we introduced Redis as a caching layer within the retrieval pipeline.
Rather than repeatedly performing identical vector search and retrieval operations, frequently accessed context and retrieval results could be served directly from a Redis cache.
The updated architecture looked like this:

The goal wasn't simply to reduce latency.
It was to improve consistency.
After implementing Redis caching and benchmarking the system under realistic workloads, we observed measurable improvements across all key latency metrics.
Most users experienced faster response times.
Performance became significantly more consistent during periods of higher activity — a direct result of reducing redundant vector search calls.
The number of long-tail response delays reduced noticeably, demonstrating how a Redis caching layer can flatten tail latency in production RAG systems.
This was particularly important because enterprise users rarely judge a system based on its average performance.
They judge it based on the slowest experiences they encounter.
Reducing those outliers had a direct impact on perceived application quality.
| Metric | Before Redis Caching | After Redis Caching |
|---|---|---|
| P50 Latency | Baseline | Improved |
| P95 Latency | Inconsistent under load | Significantly more consistent |
| P99 Latency | Noticeable long-tail delays | Reduced long-tail delays |
Caching introduces another challenge that cannot be ignored.
How do you ensure that cached information remains relevant and up to date?
In RAG systems, stale context can be just as problematic as slow responses.
A fast answer that is based on outdated information is still a bad answer.
To address this, we implemented a combination of:
Cached entries are refreshed after predefined intervals — a core cache invalidation strategy for keeping retrieval results current.
Changes in source documents trigger refresh logic.
Document versions help ensure that updates propagate correctly through the retrieval layer.
This allows us to improve RAG performance without compromising answer quality.
Many organizations spend significant time comparing models.
Should we use GPT-4?
Claude?
Llama?
Mistral?
While model selection is important, our experience continues to reinforce a different reality.
Once an AI application reaches production scale, success depends just as much on:
as it does on the model itself.
The most impactful performance improvements often come from optimizing the ecosystem around the model rather than changing the model.
RAG applications are rapidly becoming a core part of enterprise software landscapes. As adoption grows, performance, scalability, and reliability become critical business requirements.
Our Redis implementation was a relatively small architectural change, but the impact on user experience was significant.
The exercise reinforced an important principle:
Production-grade AI is not just about choosing the right model. It is about designing the right system around the model.
Organizations that focus on retrieval efficiency, intelligent caching, and scalable architecture will be far better positioned to deliver AI experiences that users trust and adopt at scale.
Redis is used as a caching layer in RAG applications because it stores frequently accessed retrieval results and context in memory, eliminating the need to repeat vector search and document retrieval for identical or similar queries. This directly reduces P95 and P99 latency in production RAG systems.
P50 latency is the response time experienced by the median (50%) of users, P95 latency reflects what 95% of users experience, and P99 latency captures the slowest 1% of requests. In enterprise AI applications, P95 and P99 latency are better indicators of real-world experience than average response time, because averages hide long-tail delays.
A Redis cache does not change how fast the LLM itself generates a response. Instead, it reduces the time spent on retrieval, vector search, and context assembly before the prompt reaches the LLM. Since these steps often account for a large share of total RAG latency, caching them improves overall response time and consistency.
Cache freshness in RAG systems is typically managed through Time-Based Expiration (TTL), metadata-aware validation that triggers refresh logic when source documents change, and version-based cache management that propagates document updates through the retrieval layer.
No. Redis caching is an architectural optimization, not a model replacement. Model selection still matters, but production-scale RAG performance depends just as much on retrieval architecture, caching strategy, observability, and data quality as it does on which LLM is used.
Enterprise RAG applications should track P50, P95, and P99 latency together, rather than relying on average response time alone. This combination shows both typical user experience and the long-tail delays that most affect perceived application quality.
At DotStark, we help organizations build and optimize enterprise-grade AI solutions, including RAG platforms, AI agents, knowledge assistants, document intelligence systems, and scalable AI infrastructure.
If you're evaluating how to improve the performance, reliability, or scalability of your AI applications, we'd be happy to exchange ideas.