Prerequisites and Environment Setup
Establishing a self-hosted foundation for large language models is essential for leveraging LangChain without reliance on external paid APIs. This procedure outlines the initialization process using the Baichuan2-13B-Chat model optimized for 4-bit quantization.
Ensure the server environment supports specific compute frameworks before proceeding:
- Python Version: 3.8 or higher.
- Deep Learning Framework: PyTorch 1.12+ (preferably 2.0+).
- GPU Acceleration: NVIDIA drivers compatible with CUDA 11.4+
- Operating System: Linux (instructions below are based on Linux)
Assume the working directory root is set to /aidev.
Acquiring Model Artifacts
The target model resides on Hugging Face. To streamline the download process, a batch script can retrieve configuration parameters, vocabulary mappings, and binary weight files simultaneously rather than manually selecting individual files.
#!/bin/bash
cd /aidev
mkdir -p baichuan-inc/Baichuan2-13B-Chat-4bits
cd baichuan-inc/Baichuan2-13B-Chat-4bits
apt-get update && apt-get install -y aria2
FILE_LIST=(
"config.json"
"configuration_baichuan.py"
"generation_config.json"
"generation_utils.py"
"handler.py"
"modeling_baichuan.py"
"quantizer.py"
"requirements.txt"
"special_tokens_map.json"
"tokenization_baichuan.py"
"tokenizer_config.json"
"tokenizer.model"
)
WEIGHT_FILE="pytorch_model.bin"
for file in "${FILE_LIST[@]}"; do
aria2c --console-log-level=error -x 16 -s 16 \
"https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bits/raw/main/$file" \
-o "$file"
done
aria2c --console-log-level=error -x 16 -s 16 \
"https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bits/resolve/main/$WEIGHT_FILE" \
-o "$WEIGHT_FILE"
Dependency Resolution
With the model directory baichuan-inc/Baichuan2-13B-Chat-4bits populated, resolve Python dependencies to ensure compatibility with the transformer library.
pip install -r baichuan-inc/Baichuan2-13B-Chat-4bits/requirements.txt
Inference Verification
Before integrating with complex application logic, validate the inference pipeline. Create a local execution script that instantiates the tokenizer and causal language model classes. Weights are loaded in bfloat16 precision to balance memory usage and performance.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
def load_local_llm():
model_path = "baichuan-inc/Baichuan2-13B-Chat-4bits"
tok_handler = AutoTokenizer.from_pretrained(
model_path,
use_fast=False,
trust_remote_code=True
)
llm_engine = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
config_params = GenerationConfig.from_pretrained(model_path)
llm_engine.generation_config = config_params
return tok_handler, llm_engine
if __name__ == "__main__":
try:
chat_tokenizer, inference_model = load_local_llm()
context_window = [{"role": "user", "content": "Explain the phrase 'learning from the old'"}]
response_text = inference_model.chat(chat_tokenizer, context_window)
print(f"Output: {response_text}")
except Exception as error:
print(f"Execution failed: {error}")
Run this verification script using python app_verify.py. If the terminal returns a coherent answer, the model loading sequence is successful.
Exposing via REST API
To allow external applications to query the hosted model, encapsulate it within a FastAPI service layer. Define a POST endpoint that accepts text payloads, processes them through the loaded context, and returns the generated response.
First, install the web server framework:
pip install uvicorn fastapi pydantic
Next, implement the server logic in server_app.py:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
app = FastAPI(title="Private LLM Service")
# Global instances initialized once
chat_tokenizer = None
inference_model = None
model_config = None
def initialize_model():
global chat_tokenizer, inference_model, model_config
path = "baichuan-inc/Baichuan2-13B-Chat-4bits"
chat_tokenizer = AutoTokenizer.from_pretrained(path, use_fast=False, trust_remote_code=True)
inference_model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model_config = GenerationConfig.from_pretrained(path)
inference_model.generation_config = model_config
# Ensure initialization happens at startup
@app.on_event("startup")
async def startup_event():
initialize_model()
class PromptPayload(BaseModel):
user_query: str
@app.post("/api/v1/chat")
async def handle_inference(payload: PromptPayload):
if not inference_model:
raise HTTPException(status_code=503, detail="Model not loaded")
try:
input_context = [{"role": "user", "content": payload.user_query}]
result = inference_model.chat(chat_tokenizer, input_context)
return {"status": "success", "result": result}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
# Start background server binding to 0.0.0.0:8000
uvicorn.run(app, host="0.0.0.0", port=8000)
Initialize the service with python server_app.py or via background daemon mode:
uvicorn server_app:app --host 0.0.0.0 --port 8000 &