The Hidden Cost of LLM Context

The Problem Nobody Talks About

The thing about RAG systems - retrieval-augmented generation - is that they fail silently. The retrieval looks correct. The model produces a coherent response. The user gets an answer that sounds right. And somewhere in that answer is a fact that doesn't appear anywhere in the retrieved context, or worse, contradicts it.

I've spent time working on AI platforms where the retrieval layer was the primary source of production failures. Not model failures - retrieval failures. Context that was stale. Chunks that lost their surrounding meaning. Queries that returned semantically similar but topically wrong documents.

This is the hidden cost of LLM context: the quality of your outputs is bounded by the quality of your retrieval, and retrieval quality is harder to measure than model quality.

What "Context" Actually Means in Production

When engineers talk about context in LLM systems, they usually mean the prompt - the combination of system instructions, user input, and retrieved documents that gets sent to the model. But in a production RAG system, context has several distinct layers, each with its own failure modes.

Retrieval context is the documents or document chunks returned by your vector search or keyword search. The failure mode here is returning semantically wrong documents - ones that match the query vector but don't actually answer the question.

Chunking context is the degradation that happens when you break a document into chunks for embedding. The chunk boundaries rarely respect semantic boundaries. A code example gets split mid-function. A paragraph's conclusion gets attached to the next paragraph's topic. The chunk that gets retrieved is comprehensible. The meaning that was lost in chunking is not.

Temporal context is the problem of stale context. Your documents change. Your embeddings don't. Until you re-embed, your retrieval is searching against an outdated representation of your current documents. This is the failure mode nobody thinks about until a user asks about something that changed last week and gets an answer about how things worked last month.

Conversational context is the context carried across multiple turns in a conversation. This is where systems most commonly fail in ways users notice - losing track of what "it" refers to, forgetting constraints established three messages ago, or re-introducing information that was already provided.

RAG pipeline with observability checkpoints

What Retrieval Quality Actually Looks Like

The standard way to evaluate retrieval quality is embedding similarity - does the retrieved document have a high cosine similarity to the query? This metric is necessary but not sufficient.

What you actually want to know is: did the retrieved document actually help answer the user's question? This is relevance, not similarity. A document can be semantically similar to a query and still not answer it. A document can be topically relevant but contain the answer in a context that makes it unusable.

I've found two proxies for relevance that are more useful than raw similarity scores.

Context sufficiency testing: Given a retrieved document and a query, can a human answer the query using only that document? If not, the document failed retrieval even if it had a high similarity score.

Distractor testing: Given a retrieved document, a relevant document, and a user query - does the retrieval system consistently rank the relevant document above the distractor? This tells you whether your system can distinguish correct answers from plausible wrong ones.

Neither of these is cheap to implement. Both give you visibility into retrieval quality that similarity scores don't.

The Chunking Problem Nobody Solves

Most RAG tutorials show you how to chunk text: fixed-size chunks, overlapping chunks, sentence boundary chunks. What they don't show you is what happens when your chunks hit real documents.

A technical documentation page has code examples. Code examples have indentation that carries meaning. Code examples have variable names that refer to other parts of the file. Split the chunk at the wrong place and you've created a fragment that looks valid but refers to undefined variables.

A legal document has definitions that apply to the entire document. Chunk at a section boundary and the definition is in chunk A while the clause that uses it is in chunk B. Your retrieval returns chunk B. The model has the clause but not the definition.

A research paper has methodology in the methods section that explains the results in the results section. Chunk by sections and you've separated the answer from the evidence that validates it.

The chunking strategy that works for one document type will fail for another. The chunking strategy that works for your current documents will fail when someone adds a new document type.

The practical solution I've found: accept that chunking is lossy, measure the loss, and build retrieval systems that are robust to imperfect chunks. This means building redundancy into your retrieval - retrieving more chunks than you need and letting the model learn which are relevant - and accepting that some questions can't be answered from the available context no matter how good your embeddings are.

Token Limits Are a Design Constraint, Not a Performance Problem

When I started working with LLMs, I treated token limits as a performance constraint to work around. Chunk more aggressively. Summarize more aggressively. The goal was to fit everything into context.

What I learned: the token limit is actually a design constraint that forces you to make decisions about what matters. If you have 128,000 tokens of context, you can retrieve everything and let the model figure out what's relevant. Or you can retrieve only what's most relevant and accept that some questions won't have enough context to answer.

The second approach produces better results in my experience, but it requires being honest about what your system doesn't know. A retrieval system that returns partial context with high precision is more useful than one that returns comprehensive context with low precision. The model can work with high-precision context. It struggles with contradictory or irrelevant context that fills the available tokens.

What I'd Do Differently

If I were building a RAG system from scratch today, I'd invest earlier in evaluation infrastructure, not retrieval infrastructure.

Specifically: I'd build a retrieval eval set before I built the retrieval system. A set of (query, relevant documents, irrelevant documents) tuples that I could use to measure whether my retrieval was actually improving. Without this, you're flying blind - you don't know if changes to your chunking strategy or embedding model are making things better or worse.

I'd also be more aggressive about provenance tracking. Every fact in a generated response should be traceable to a specific retrieved chunk. This is technically non-trivial - the model doesn't naturally tell you which parts of its context it used - but there are approaches that work, and the alternative is generating confident responses that can't be verified.

Final Thoughts

The thing about context quality is that it's invisible until it isn't. The system works until it doesn't, and when it doesn't, the failure looks like a model failure - a confident wrong answer, a confused follow-up - rather than a retrieval failure.

The teams I've seen handle this well are the ones who treated retrieval quality as a first-class engineering problem, not an afterthought. They built eval sets. They measured precision and recall on their retrieval. They accepted that the model is only as good as what it can retrieve, and they invested accordingly.

The teams that struggled were the ones who assumed retrieval was solved by vector search - load your documents, embed them, let the vector database handle the rest. Vector search is a component. Retrieval quality is a system property. The difference matters in production.