Building a Private Knowledge Base with FastGPT, ChatGLM, Ollama, and M3E Embeddings

Hardware Requirements

The following configurations are for reference only:

  • ChatGLM3-6B + M3E: NVIDIA RTX 3060 12GB or higher
  • Qwen:4B + M3E: NVIDIA RTX 3060 12GB or higher
  • Qwen:2B + M3E: NVIDIA GTX 1660 6GB or higher

Larger models require better GPU performance. Extremely small models can run on low-end CPUs, but inference accuracy will be poor, and combining them with the M3E vector model will result in very slow processing speeds.

Required Resources

Environment Setup with Conda

  1. Update Conda:
    conda update -n base -c defaults conda
    
  2. Update all libraries:
    conda update --all
    
  3. Create a virtual environment:
    conda create --name llm_env python=3.11 -y
    
  4. Activate the environment:
    conda activate llm_env
    
  5. Check CUDA version:
    nvidia-smi
    
  6. Install PyTorch (adjust based on your CUDA version):
    conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
    

Method 1: Direct Deployment (ChatGLM3 + M3E)

Download Models and Demo

  1. ChatGLM3-6B:
    git lfs install
    git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git
    
  2. M3E-Base:
    git clone https://www.modelscope.cn/Jerry0/m3e-base.git
    
  3. Official Demo:
    git clone https://github.com/THUDM/ChatGLM3
    

Configuration and Execution

  1. Navigate to ChatGLM3/openai_api_demo.
  2. Edit api_server.py. Update the model paths and port:
    from transformers import AutoTokenizer, AutoModel
    from sentence_transformers import SentenceTransformer
    
    # Configure paths
    llm_tokenizer = AutoTokenizer.from_pretrained("/path/to/chatglm3-6b", trust_remote_code=True)
    llm_model = AutoModel.from_pretrained("/path/to/chatglm3-6b", trust_remote_code=True, device_map="auto").eval()
    
    # Load embedding model
    vector_model = SentenceTransformer("/path/to/m3e-base", trust_remote_code=True, device="cuda")
    
    # Launch server
    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
    
  3. Install dependencies from the ChatGLM3 root directory:
    pip install -r requirements.txt
    
  4. Run the server:
    python openai_api_demo/api_server.py
    

Method 2: Ollama Deployment

Install Ollama and Models

  1. Download from https://ollama.com/download.
  2. Verify installation:
    ollama -v
    
  3. Pull a model (e.g., Qwen):
    ollama run qwen:1.8b
    

Deploy M3E via Docker

docker run -d --name m3e-vector -p 6008:6008 --gpus all -e sk-key=123321 registry.cn-hangzhou.aliyuncs.com/fastgpt_docker/m3e-large-api

One-API Deployment

docker run --name one-api -d --restart always -p 3000:3000 -e TZ=Asia/Shanghai -v /data/one-api:/data justsong/one-api

Access at http://localhost:3000 (default credentials: root / 123456). Add your LLM and M3E channels, then create a token for FastGPT.

FastGPT Deployment

  1. Download docker-compose.yml and config.json.
  2. Modify config.json:
    • Add your LLM to llmModels:
      {
        "model": "qwen:1.8b",
        "name": "qwen:1.8b",
        "maxContext": 16000,
        "avatar": "/imgs/model/openai.svg",
        "maxResponse": 4000
      }
      
    • If using Ollama, add M3E to vectorModels:
      {
        "model": "m3e",
        "name": "M3E",
        "defaultToken": 700,
        "maxToken": 1800
      }
      
  3. Adjust docker-compose.yml (comment out MySQL/One-API if separately deployed, ensure MongoDB version compatibility).
  4. Start services:
    docker-compose up -d
    sleep 10
    docker restart oneapi
    
  5. Access FastGPT at http://ip:3000 (default: root / 1234).

Troubleshooting

  • Hugging Face Hub Error:
    pip install huggingface-hub==0.20.3
    
  • CUDA Out of Memory:
    set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    
    Consider using a smaller model or enabling quantization.
  • Ollama Localhost Only: Set environment variable OLLAMA_HOST to 0.0.0.0:11434.

Sample API Server Script

import os
import time
import torch
import uvicorn
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer

# Paths
LLM_DIR = os.environ.get('LLM_DIR', '/models/chatglm3-6b')
EMBED_DIR = os.environ.get('EMBED_DIR', '/models/m3e-base')

@asynccontextmanager
async def lifespan(app: FastAPI):
    yield
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

app = FastAPI(lifespan=lifespan)
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])

if __name__ == "__main__":
    tokenizer = AutoTokenizer.from_pretrained(LLM_DIR, trust_remote_code=True)
    model = AutoModel.from_pretrained(LLM_DIR, trust_remote_code=True, device_map="auto").eval()
    embedder = SentenceTransformer(EMBED_DIR, trust_remote_code=True, device="cuda")
    uvicorn.run(app, host='0.0.0.0', port=8000)

Tags: FastGPT ChatGLM Ollama M3E Knowledge Base

Posted on Fri, 15 May 2026 11:00:07 +0000 by elementaluk