Hardware Requirements
The following configurations are for reference only:
- ChatGLM3-6B + M3E: NVIDIA RTX 3060 12GB or higher
- Qwen:4B + M3E: NVIDIA RTX 3060 12GB or higher
- Qwen:2B + M3E: NVIDIA GTX 1660 6GB or higher
Larger models require better GPU performance. Extremely small models can run on low-end CPUs, but inference accuracy will be poor, and combining them with the M3E vector model will result in very slow processing speeds.
Required Resources
- Conda: https://www.anaconda.com/ Package and environment manager for Python and other languages.
- One-API: https://github.com/songquanpeng/one-api Unified OpenAI interface integration supporting multiple API access methods.
- ChatGLM3: https://github.com/THUDM/ChatGLM3 A conversational pre-trained model developed by Zhipu AI and Tsinghua University.
- M3E: https://modelscope.cn/models/Jerry0/m3e-base/summary Moka Massive Mixed Embedding model for converting text into dense vectors.
- Ollama: https://ollama.com/ A management tool for running large language models locally.
- FastGPT: https://github.com/labring/FastGPT A knowledge base Q&A system based on LLMs with visual workflow orchestration.
- ModelScope: https://modelscope.cn/home An open-source model-as-a-service platform.
Environment Setup with Conda
- Update Conda:
conda update -n base -c defaults conda - Update all libraries:
conda update --all - Create a virtual environment:
conda create --name llm_env python=3.11 -y - Activate the environment:
conda activate llm_env - Check CUDA version:
nvidia-smi - Install PyTorch (adjust based on your CUDA version):
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
Method 1: Direct Deployment (ChatGLM3 + M3E)
Download Models and Demo
- ChatGLM3-6B:
git lfs install git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git - M3E-Base:
git clone https://www.modelscope.cn/Jerry0/m3e-base.git - Official Demo:
git clone https://github.com/THUDM/ChatGLM3
Configuration and Execution
- Navigate to
ChatGLM3/openai_api_demo. - Edit
api_server.py. Update the model paths and port:from transformers import AutoTokenizer, AutoModel from sentence_transformers import SentenceTransformer # Configure paths llm_tokenizer = AutoTokenizer.from_pretrained("/path/to/chatglm3-6b", trust_remote_code=True) llm_model = AutoModel.from_pretrained("/path/to/chatglm3-6b", trust_remote_code=True, device_map="auto").eval() # Load embedding model vector_model = SentenceTransformer("/path/to/m3e-base", trust_remote_code=True, device="cuda") # Launch server uvicorn.run(app, host='0.0.0.0', port=8000, workers=1) - Install dependencies from the ChatGLM3 root directory:
pip install -r requirements.txt - Run the server:
python openai_api_demo/api_server.py
Method 2: Ollama Deployment
Install Ollama and Models
- Download from https://ollama.com/download.
- Verify installation:
ollama -v - Pull a model (e.g., Qwen):
ollama run qwen:1.8b
Deploy M3E via Docker
docker run -d --name m3e-vector -p 6008:6008 --gpus all -e sk-key=123321 registry.cn-hangzhou.aliyuncs.com/fastgpt_docker/m3e-large-api
One-API Deployment
docker run --name one-api -d --restart always -p 3000:3000 -e TZ=Asia/Shanghai -v /data/one-api:/data justsong/one-api
Access at http://localhost:3000 (default credentials: root / 123456). Add your LLM and M3E channels, then create a token for FastGPT.
FastGPT Deployment
- Download
docker-compose.ymlandconfig.json. - Modify
config.json:- Add your LLM to
llmModels:{ "model": "qwen:1.8b", "name": "qwen:1.8b", "maxContext": 16000, "avatar": "/imgs/model/openai.svg", "maxResponse": 4000 } - If using Ollama, add M3E to
vectorModels:{ "model": "m3e", "name": "M3E", "defaultToken": 700, "maxToken": 1800 }
- Add your LLM to
- Adjust
docker-compose.yml(comment out MySQL/One-API if separately deployed, ensure MongoDB version compatibility). - Start services:
docker-compose up -d sleep 10 docker restart oneapi - Access FastGPT at
http://ip:3000(default: root / 1234).
Troubleshooting
- Hugging Face Hub Error:
pip install huggingface-hub==0.20.3 - CUDA Out of Memory:
Consider using a smaller model or enabling quantization.set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True - Ollama Localhost Only:
Set environment variable
OLLAMA_HOSTto0.0.0.0:11434.
Sample API Server Script
import os
import time
import torch
import uvicorn
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
# Paths
LLM_DIR = os.environ.get('LLM_DIR', '/models/chatglm3-6b')
EMBED_DIR = os.environ.get('EMBED_DIR', '/models/m3e-base')
@asynccontextmanager
async def lifespan(app: FastAPI):
yield
if torch.cuda.is_available():
torch.cuda.empty_cache()
app = FastAPI(lifespan=lifespan)
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained(LLM_DIR, trust_remote_code=True)
model = AutoModel.from_pretrained(LLM_DIR, trust_remote_code=True, device_map="auto").eval()
embedder = SentenceTransformer(EMBED_DIR, trust_remote_code=True, device="cuda")
uvicorn.run(app, host='0.0.0.0', port=8000)