Retrieval-Augmented Generation Architecture
Large Language Models (LLMs) often struggle with static training data limitations, leading to issues with outdated information and hallucinations. Retrieval-Augmented Generation (RAG) addresses this by grounding model responses in external, up-to-date knowledge bases. This technique operates in three primary stages: indexing, retrieval, and generation.
During indexing, raw documents are segmented into chunks, converted into vector embeddings, and stored in a vector database. In the retrieval phase, the user's query is encoded into a vector, and the system performs a similarity search (typically using cosine similarity) to identify the most relevant document chunks. Finally, in generation, the retrieved context and the original query are combined into a prompt for the LLM to synthesize a precise answer.
This approach provides external memory for the model, significantly improving accuracy for knowledge-intensive tasks without the need for costly model retraining.
Introduction to Huixiangdou
Huixiangdou is an open-source RAG framework optimized for group chat scenarios and technical support. It allows users to deploy intelligent assistants capable of answering domain-specific questions based on provided documentation. Key features include support for multiple document formats (Markdown, PDF, Word), integration with platforms like WeChat and Feishu, and flexible backend support for local models (e.g., InternLM) or remote APIs (e.g., GPT-4).
Environment Setup and Dependency Installation
To deploy the assistant, begin by configuring a Python virtual environment. This isolates dependencies and ensures version compatibility.
# Create and activate a new conda environment
conda create -n huixiang_env python=3.10 -y
conda activate huixiang_env
# Install core dependencies including LangChain and FAISS
pip install protobuf accelerate aiohttp auto-gptq bcembedding \
beautifulsoup4 einops faiss-gpu langchain loguru \
lxml_html_clean openai openpyxl pandas pydantic pymupdf \
python-docx pytoml readability-lxml redis requests \
scikit-learn sentence_transformers tiktoken transformers
Next, acquire the necessary model files. For this deployment, we utilize the InternLM2-Chat-7B model alongside BCE embedding and reranking models. Assuming shared model storage is available, link them to your project directory.
# Define project structure
mkdir -p /workspace/models
# Link Embedding and Reranker models
ln -s /shared_models/bce-embedding-base_v1 /workspace/models/embed_model
ln -s /shared_models/bce-reranker-base_v1 /workspace/models/reranker_model
# Link the main LLM
ln -s /shared_models/internlm2-chat-7b /workspace/models/llm_model
Project Configuration
Clone the repository and configure the config.ini file to point to your specific model paths. This configuration controls the behavior of the vector database, the LLM backend, and the retrieval pipeline.
# Clone the source code
cd /workspace
git clone https://github.com/internlm/huixiangdou.git
cd huixiangdou
# Update model paths in the configuration file
sed -i 's|embedding_model_path = .*|embedding_model_path = "/workspace/models/embed_model"|' config.ini
sed -i 's|reranker_model_path = .*|reranker_model_path = "/workspace/models/reranker_model"|' config.ini
sed -i 's|local_llm_path = .*|local_llm_path = "/workspace/models/llm_model"|' config.ini
Constructing the Knowledge Base
The core of a RAG system is its knowledge base. We need to process raw documents into vector embeddings. Huixiangdou distinguishes between "positive" questions (topics the assistant should answer) and "negative" questions (irrelevant or off-topic queries it should ignore).
First, prepare the source text. For this example, we use the project's own documentation as the knowledge source.
# Prepare directory for source documents
mkdir -p /workspace/huixiangdou/source_docs
git clone https://github.com/internlm/huixiangdou --depth=1 /workspace/huixiangdou/source_docs/huixiangdou_repo
Next, define the positive and negative query samples. These vectors help the router determine if an incoming user query falls within the intended domain.
# Update positive questions (save as resource/accept_list.)
import
accepted_queries = [
"How do I configure the environment for Huixiangdou?",
"What file formats are supported for the knowledge base?",
"Explain the workflow of the retrieval module.",
"How to integrate Huixiangdou with WeChat?",
"What are the hardware requirements for running InternLM2?",
"Difference between local and remote LLM deployment.",
"How to update the vector database?"
]
with open('resource/accept_list.', 'w') as f:
.dump(accepted_queries, f)
Run the feature extraction script to build the vector indexes. This process converts text chunks into vectors using the specified embedding model.
# Create output directory for vectors
mkdir -p /workspace/huixiangdou/vector_store
# Execute the feature store pipeline
python3 -m huixiangdou.service.feature_store --sample ./test_queries.
Running the Assistant
With the knowledge base indexed, launch the inference server. Modify the main script to include specific test queries to verify functionality.
# Define test questions
sed -i 's/queries = .*/queries = ["What is Huixiangdou?", "How to deploy on WeChat?", "Tell me a joke"]/' huixiangdou/main.py
# Start the service in standalone mode
python3 -m huixiangdou.main --standalone
Advanced Configuration: Web Search and Remote Models
To enhance the assistant's capabilities, you can enable web search for real-time information access or switch to remote APIs to reduce local GPU load.
Enabling Web Search:
Obtain an API key from a search provider (e.g., Serper) and update the configuration:
[web_search]
x_api_key = "YOUR_API_KEY_HERE"
domain_partial_order = ["github.com", "stackoverflow.com", "pytorch.org"]
save_dir = "logs/web_search_cache"
Using Remote LLMs:
Modify config.ini to disable the local model and enable the remote client:
[llm.server]
enable_local = 0
enable_remote = 1
[llm.remote]
remote_type = "openai" # or "kimi", "deepseek", etc.
api_key = "YOUR_REMOTE_API_KEY"
model_name = "gpt-4-turbo"
Deploying the Web Interface
For user interaction, a Gradio-based web interface can be launched. This provides a chat UI accessible via a browser.
# Install web dependencies
pip install gradio redis flask lark_oapi
# Start the web server
python3 -m tests.test_query_gradio
If deploying on a remote server, use SSH port forwarding to access the interface locally:
# Forward remote port 7860 to local port 7860
ssh -CNg -L 7860:127.0.0.1:7860 user@remote_host -p 22
Navigate to http://127.0.0.1:7860 in your browser to interact with your custom RAG assistant.