Advanced Retrieval-Augmented Generation Patterns for Production LLM Systems

Current RAG Landscape

Retrieval-Augmented Generation has evolved far beyond simple vector search. The latest survey "Retrieval-Augmented Generation for Large Language Models" highlights three active areas of innovation:

  1. Query-side augmentation (query transformation)
  2. Agentic orchestration of retrieval
  3. Post-retrieval refinement

Self-RAG introduced a trainable LLM that decides when to retrieve, but its multi-step pipeline can be too heavy for latency-sensitive services. The patterns below are battle-tested in LangChain ≥ 0.1 and LlamaIndex ≥ 0.10 and can be adopted incrementally.


Query Transformation

1. Single-Query Rewrite

A lightweight LLM call that turns colloquial input into a search-optimized string.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

rewrite_tpl = ChatPromptTemplate.from_template(
    "Rephrase the following question into a concise web-search query ending with '**'.\n\n{q}"
)
rewrite_chain = rewrite_tpl | ChatOpenAI(temperature=0) | StrOutputParser()

user_utterance = "man that sam bankman fried trial was crazy! what is langchain?"
search_query = rewrite_chain.invoke({"q": user_utterance}).split("**")[0].strip()

2. Multi-Query Expansion

Generate k semantically close queries and union their results to reduce recall gaps.

from langchain.retrievers.multi_query import MultiQueryRetriever

mq_retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=ChatOpenAI(temperature=0),
    include_original=True
)
docs = mq_retriever.invoke("What are the approaches to Task Decomposition?")

3. Hypothetical Document Embeddings (HyDE)

Let the LLM fabricate a pseudo document, embed it, and search against that embedding.

from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import OpenAIEmbeddings, OpenAI

base = OpenAIEmbeddings()
hyde = HypotheticalDocumentEmbedder.from_llm(
    llm=OpenAI(max_tokens=256),
    base_embeddings=base,
    prompt_key="web_search"
)
hyde.embed_query("Where is the Taj Mahal?")

4. Step-Back Prompting

Create a broader question, retrieve context for both the broad and original questions, then merge them before generation.

step_back_template = """
Context from original query:
{specific_ctx}

Context from step-back query:
{generic_ctx}

Original question: {question}
Answer:"""

Agentic Retrieval

Router for Multiple Indexes

When several corpora exist (e.g., NYT news, Wikipedia, internal wiki), an LLM-based router picks the right index.

from llama_index.core.tools import ToolMetadata
from llama_index.core.selectors import LLMSingleSelector

tools = [
    ToolMetadata(name="nyt_covid", description="NYT articles about COVID-19"),
    ToolMetadata(name="wiki_covid", description="Wikipedia page on COVID-19"),
    ToolMetadata(name="tesla_wiki", description="Wikipedia page on Tesla"),
]
selector = LLMSingleSelector.from_defaults()
picked = selector.select(tools, query="Tell me more about COVID-19")

Post-Retrieval Refinement

Long-Context Reordering (LiTM)

Mitigate the lost-in-the-middle effect by placing the least relevant chunks in the center.

def litm_order(docs):
    docs = list(reversed(docs))
    out = []
    for i, d in enumerate(docs):
        out.insert(0, d) if i % 2 == 0 else out.append(d)
    return out

ranked = litm_order(retrieved_docs)

Contextual Compression

Use an LLM to strip irrelevant sentences from each chunk before feeding the generator.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(ChatOpenAI(temperature=0))
compressed_retriever = ContextualCompressionRetriever(
    base_retriever=vector_store.as_retriever(),
    base_compressor=compressor
)
compressed = compressed_retriever.invoke("What did the president say about Ketanji Jackson?")

Refinement Loop

Iteratively improve the answer by injecting extra snippets.

refine_prompt = """
Original question: {query}
Existing answer: {answer}

Additional context:
{context}

Refine the answer if the new context is helpful; otherwise keep the original.
Refined answer:
"""

Emotion Stimuli

Microsoft’s 2023 paper shows that adding urgency cues can lift quality. Example prompt suffix:

emotion_suffix = (
    "This is very important to my career. "
    "You’d better be sure. Provide a confidence score 0–1."
)

Key Takeaways

  • Query transformation and post-processing yield the biggest gains for the least complexity.
  • Self-routing agents shine when multiple heterogeneous data sources exist.
  • Evaluate end-to-end quality, not just retrieval recall; sometimes 70 % open-source recall with 90 % precision beats 90 % recall with 50 % precision.
  • No single pattern fits every domain—build a benchmark that mirrors real user questions and iterate.
# Minimal production pipeline skeleton
pipeline = (
    query_rewrite
    | multi_query_retriever
    | litm_order
    | contextual_compressor
    | generator_with_emotion
)

Tags: RAG LLM LangChain LlamaIndex Retrieval

Posted on Tue, 12 May 2026 13:54:23 +0000 by Kane250