The 80% Problem

Most RAG demos look magical. You drop in 10 PDFs, ask 3 questions, get clean answers. Ship it.

Then production hits. The document corpus grows from 10 to 10,000. Users ask questions the demo never anticipated. Edge cases stack up. Accuracy drops from 95% to 60% in two weeks. The team starts apologising to the client.

I've built 20+ production RAG systems for clients across the USA, UK, UAE, Canada, Australia, Switzerland, and Pakistan. About 80% of the RAG projects I audit before clients hire me are in this exact failure mode — they passed the demo, then collapsed under real data.

The fixes aren't more complex models. They're architectural patterns designed for failure modes from day one. Here are the five that matter most.

Failure 1: Hallucinations on edge cases

A vanilla RAG pipeline does this: embed the user query, retrieve top-k documents, stuff them into a prompt, ask the LLM to answer. When retrieval finds something, the LLM dutifully constructs an answer — even when the retrieved context is unrelated to the question.

In production, you get confident-sounding nonsense on the long tail of queries.

The fix: a self-correction loop. Before the LLM answers, force it to grade the retrieved context against the question. If the grade is poor, rewrite the query or fall back to a "I don't have enough information" response.

from langgraph.graph import StateGraph, END

def grade_relevance(state):
    docs = state["documents"]
    question = state["question"]
    prompt = f"""Given the question and retrieved documents, score 0-10 how
    relevant the documents are to answering the question. Be strict.
    Question: {question}
    Documents: {docs[:3000]}
    Respond with just a number."""
    score = int(llm.invoke(prompt).content.strip())
    return {"relevance_score": score}

def route_after_grading(state):
    if state["relevance_score"] < 6:
        return "rewrite_query"
    return "generate_answer"

graph = StateGraph(RAGState)
graph.add_node("retrieve", retrieve)
graph.add_node("grade", grade_relevance)
graph.add_node("rewrite_query", rewrite_query)
graph.add_node("generate_answer", generate_answer)
graph.add_conditional_edges("grade", route_after_grading)

I built exactly this pattern for an enterprise client — full breakdown in my Agentic RAG case study. It moved accuracy from ~70% to 90%+ on real questions, and dropped hallucinations to single digits.

Failure 2: Stale retrieval as your data changes

You ship a RAG system on Monday with 500 documents. By Friday, 50 of those documents have been edited. Your vector store still has the old embeddings.

Users ask questions about the new content. The system retrieves the old version. They lose trust.

The fix: incremental re-indexing with content hashing, not full re-builds. Hash each source document. On a schedule (or webhook), only re-embed documents whose hash changed.

import hashlib

def document_hash(text, metadata):
    payload = text + str(sorted(metadata.items()))
    return hashlib.sha256(payload.encode()).hexdigest()

def upsert_if_changed(doc_id, text, metadata, pinecone_index):
    new_hash = document_hash(text, metadata)
    existing = pinecone_index.fetch([doc_id]).vectors.get(doc_id)
    if existing and existing.metadata.get("hash") == new_hash:
        return False  # unchanged, skip
    embedding = embed(text)
    pinecone_index.upsert([{
        "id": doc_id,
        "values": embedding,
        "metadata": {**metadata, "hash": new_hash, "indexed_at": now()}
    }])
    return True

This single pattern saved a client 70% on embedding API costs and kept their knowledge base accurate without manual intervention.

Failure 3: Bad retrieval ranking

Top-k retrieval over pure semantic similarity has a known weakness: it rewards documents that sound similar to the question, not documents that answer the question. Worse, exact keyword matches (product codes, names, error codes) often get ranked below conceptually-similar-but-wrong chunks.

The fix: hybrid search + a reranker. Combine dense vector search with sparse keyword search (BM25), then run the merged candidates through a cross-encoder reranker.

from rank_bm25 import BM25Okapi
from sentence_transformers import CrossEncoder

bm25 = BM25Okapi([doc.text.split() for doc in corpus])
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def hybrid_retrieve(query, k=20):
    dense_hits = vector_store.similarity_search(query, k=k)
    sparse_hits = bm25.get_top_n(query.split(), corpus, n=k)
    candidates = dedupe(dense_hits + sparse_hits)
    pairs = [(query, c.text) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c for c, _ in ranked[:5]]

Why this matters: in financial, legal, and medical use cases, missing a specific code or term means missing the entire answer. Pure semantic search misses these constantly. Hybrid + rerank fixed this for a healthcare client managing 10,000+ patient records.

Failure 4: Multimodal blindspots

Most RAG systems can't read the charts, diagrams, screenshots, or tables inside PDFs. They OCR the text and lose 40% of the information.

If your domain has visual content — research papers, technical docs, medical scans, financial reports — text-only RAG is broken by design.

The fix: vision-language embeddings (ColPali, CLIP) for image regions alongside text chunks. Index both. Let the retriever match queries against both modalities.

from colpali_engine.models import ColPali, ColPaliProcessor

processor = ColPaliProcessor.from_pretrained("vidore/colpali")
model = ColPali.from_pretrained("vidore/colpali")

def embed_page_image(pdf_page_image):
    inputs = processor(images=[pdf_page_image], return_tensors="pt")
    return model(**inputs).last_hidden_state.mean(dim=1)

# Store both text embeddings AND image embeddings in the same vector store
# with a 'modality' tag. Retrieve from both, then merge.

I built this for a research firm searching 10,000+ pages of mixed-content PDFs. Asking "show me the Q3 conversion funnel chart" actually returns the right chart now. Full writeup: Multimodal RAG with ColPali & CLIP.

Failure 5: No evaluation harness = no improvement

Most teams ship RAG without an evaluation pipeline. Then when accuracy degrades, they can't tell: - Did retrieval get worse? - Did the LLM get worse? - Did the data get harder? - Was it always this bad and we just didn't notice?

You can't fix what you can't measure.

The fix: a golden dataset + automated nightly eval. 50–100 hand-curated question/answer pairs covering your edge cases. Run them through the system every deploy. Track three metrics:

def evaluate_rag(golden_dataset, rag_system):
    results = {
        "retrieval_hit_rate": 0,    # did retrieval find the right doc?
        "answer_correctness": 0,    # did the final answer match?
        "faithfulness": 0,          # was the answer grounded in retrieved docs?
    }
    for q, expected_doc_ids, expected_answer in golden_dataset:
        retrieved = rag_system.retrieve(q)
        answer = rag_system.answer(q)
        results["retrieval_hit_rate"] += any(d.id in expected_doc_ids for d in retrieved)
        results["answer_correctness"] += llm_judge(answer, expected_answer)
        results["faithfulness"] += llm_judge_grounding(answer, retrieved)
    return {k: v / len(golden_dataset) for k, v in results.items()}

This is the single highest-leverage thing you can build. Every RAG improvement I've shipped started with one of these metrics moving in the wrong direction.

The Pattern: Design for failure on day 1

If I had to compress all 20 RAG projects into one sentence: the production-ready systems are the ones designed for failure from the first commit. Self-correction loops, hash-based incremental indexing, hybrid retrieval, multimodal embeddings, and an evaluation harness aren't optimizations you add later — they're load-bearing infrastructure.

Most "AI demos that broke in production" stories are really "demos without failure handling that met production." The fix isn't a smarter model. It's better architecture.

If you're building a RAG system that needs to survive real data, look at every component and ask: what happens when this fails? If you don't have an answer, that's the next thing to build.

About the Author

Muaz Ashraf is a freelance AI engineer specialising in production-ready RAG systems, AI agents, and AI integration. He has shipped 20+ AI systems across 7 countries, with a 100% project completion rate.

Open for AI consulting, RAG system development, AI agent development, and LLM application work. Typical MVP delivery: 2–4 weeks.