Current RAG Landscape
Retrieval-Augmented Generation has evolved far beyond simple vector search. The latest survey "Retrieval-Augmented Generation for Large Language Models" highlights three active areas of innovation:
- Query-side augmentation (query transformation)
- Agentic orchestration of retrieval
- Post-retrieval refinement
Self-RAG introduced a trainable LLM that decides when to retrieve, but its multi-step pipeline can be too heavy for latency-sensitive services. The patterns below are battle-tested in LangChain ≥ 0.1 and LlamaIndex ≥ 0.10 and can be adopted incrementally.
Query Transformation
1. Single-Query Rewrite
A lightweight LLM call that turns colloquial input into a search-optimized string.
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
rewrite_tpl = ChatPromptTemplate.from_template(
"Rephrase the following question into a concise web-search query ending with '**'.\n\n{q}"
)
rewrite_chain = rewrite_tpl | ChatOpenAI(temperature=0) | StrOutputParser()
user_utterance = "man that sam bankman fried trial was crazy! what is langchain?"
search_query = rewrite_chain.invoke({"q": user_utterance}).split("**")[0].strip()
2. Multi-Query Expansion
Generate k semantically close queries and union their results to reduce recall gaps.
from langchain.retrievers.multi_query import MultiQueryRetriever
mq_retriever = MultiQueryRetriever.from_llm(
retriever=vector_store.as_retriever(),
llm=ChatOpenAI(temperature=0),
include_original=True
)
docs = mq_retriever.invoke("What are the approaches to Task Decomposition?")
3. Hypothetical Document Embeddings (HyDE)
Let the LLM fabricate a pseudo document, embed it, and search against that embedding.
from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import OpenAIEmbeddings, OpenAI
base = OpenAIEmbeddings()
hyde = HypotheticalDocumentEmbedder.from_llm(
llm=OpenAI(max_tokens=256),
base_embeddings=base,
prompt_key="web_search"
)
hyde.embed_query("Where is the Taj Mahal?")
4. Step-Back Prompting
Create a broader question, retrieve context for both the broad and original questions, then merge them before generation.
step_back_template = """
Context from original query:
{specific_ctx}
Context from step-back query:
{generic_ctx}
Original question: {question}
Answer:"""
Agentic Retrieval
Router for Multiple Indexes
When several corpora exist (e.g., NYT news, Wikipedia, internal wiki), an LLM-based router picks the right index.
from llama_index.core.tools import ToolMetadata
from llama_index.core.selectors import LLMSingleSelector
tools = [
ToolMetadata(name="nyt_covid", description="NYT articles about COVID-19"),
ToolMetadata(name="wiki_covid", description="Wikipedia page on COVID-19"),
ToolMetadata(name="tesla_wiki", description="Wikipedia page on Tesla"),
]
selector = LLMSingleSelector.from_defaults()
picked = selector.select(tools, query="Tell me more about COVID-19")
Post-Retrieval Refinement
Long-Context Reordering (LiTM)
Mitigate the lost-in-the-middle effect by placing the least relevant chunks in the center.
def litm_order(docs):
docs = list(reversed(docs))
out = []
for i, d in enumerate(docs):
out.insert(0, d) if i % 2 == 0 else out.append(d)
return out
ranked = litm_order(retrieved_docs)
Contextual Compression
Use an LLM to strip irrelevant sentences from each chunk before feeding the generator.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(ChatOpenAI(temperature=0))
compressed_retriever = ContextualCompressionRetriever(
base_retriever=vector_store.as_retriever(),
base_compressor=compressor
)
compressed = compressed_retriever.invoke("What did the president say about Ketanji Jackson?")
Refinement Loop
Iteratively improve the answer by injecting extra snippets.
refine_prompt = """
Original question: {query}
Existing answer: {answer}
Additional context:
{context}
Refine the answer if the new context is helpful; otherwise keep the original.
Refined answer:
"""
Emotion Stimuli
Microsoft’s 2023 paper shows that adding urgency cues can lift quality. Example prompt suffix:
emotion_suffix = (
"This is very important to my career. "
"You’d better be sure. Provide a confidence score 0–1."
)
Key Takeaways
- Query transformation and post-processing yield the biggest gains for the least complexity.
- Self-routing agents shine when multiple heterogeneous data sources exist.
- Evaluate end-to-end quality, not just retrieval recall; sometimes 70 % open-source recall with 90 % precision beats 90 % recall with 50 % precision.
- No single pattern fits every domain—build a benchmark that mirrors real user questions and iterate.
# Minimal production pipeline skeleton
pipeline = (
query_rewrite
| multi_query_retriever
| litm_order
| contextual_compressor
| generator_with_emotion
)