Building a Private Knowledge Base with FastGPT, ChatGLM, Ollama, and M3E Embeddings

Hardware Requirements

The following configurations are for reference only:

ChatGLM3-6B + M3E: NVIDIA RTX 3060 12GB or higher
Qwen:4B + M3E: NVIDIA RTX 3060 12GB or higher
Qwen:2B + M3E: NVIDIA GTX 1660 6GB or higher

Larger models require better GPU performance. Extremely small models can run on low-end CPUs, but inference accuracy will be poor, and combining them with the M3E vector model will result in very slow processing speeds.

Required Resources

Conda: https://www.anaconda.com/ Package and environment manager for Python and other languages.
One-API: https://github.com/songquanpeng/one-api Unified OpenAI interface integration supporting multiple API access methods.
ChatGLM3: https://github.com/THUDM/ChatGLM3 A conversational pre-trained model developed by Zhipu AI and Tsinghua University.
M3E: https://modelscope.cn/models/Jerry0/m3e-base/summary Moka Massive Mixed Embedding model for converting text into dense vectors.
Ollama: https://ollama.com/ A management tool for running large language models locally.
FastGPT: https://github.com/labring/FastGPT A knowledge base Q&A system based on LLMs with visual workflow orchestration.
ModelScope: https://modelscope.cn/home An open-source model-as-a-service platform.

Environment Setup with Conda

Update Conda:
```
conda update -n base -c defaults conda
```
Update all libraries:
```
conda update --all
```

Create a virtual environment:

conda create --name llm_env python=3.11 -y

Activate the environment:
```
conda activate llm_env
```
Check CUDA version:
```
nvidia-smi
```

Install PyTorch (adjust based on your CUDA version):

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Method 1: Direct Deployment (ChatGLM3 + M3E)

Download Models and Demo

ChatGLM3-6B:

git lfs install
git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git

M3E-Base:

git clone https://www.modelscope.cn/Jerry0/m3e-base.git

Official Demo:

git clone https://github.com/THUDM/ChatGLM3

Configuration and Execution

Navigate to ChatGLM3/openai_api_demo.

Edit api_server.py. Update the model paths and port:

from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer

# Configure paths
llm_tokenizer = AutoTokenizer.from_pretrained("/path/to/chatglm3-6b", trust_remote_code=True)
llm_model = AutoModel.from_pretrained("/path/to/chatglm3-6b", trust_remote_code=True, device_map="auto").eval()

# Load embedding model
vector_model = SentenceTransformer("/path/to/m3e-base", trust_remote_code=True, device="cuda")

# Launch server
uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)

Install dependencies from the ChatGLM3 root directory:
```
pip install -r requirements.txt
```
Run the server:
```
python openai_api_demo/api_server.py
```

Method 2: Ollama Deployment

Install Ollama and Models

Download from https://ollama.com/download.
Verify installation:
```
ollama -v
```
Pull a model (e.g., Qwen):
```
ollama run qwen:1.8b
```

Deploy M3E via Docker

docker run -d --name m3e-vector -p 6008:6008 --gpus all -e sk-key=123321 registry.cn-hangzhou.aliyuncs.com/fastgpt_docker/m3e-large-api

One-API Deployment

docker run --name one-api -d --restart always -p 3000:3000 -e TZ=Asia/Shanghai -v /data/one-api:/data justsong/one-api

Access at http://localhost:3000 (default credentials: root / 123456). Add your LLM and M3E channels, then create a token for FastGPT.

FastGPT Deployment

Download docker-compose.yml and config.json.

Modify config.json:

Add your LLM to llmModels:

{
  "model": "qwen:1.8b",
  "name": "qwen:1.8b",
  "maxContext": 16000,
  "avatar": "/imgs/model/openai.svg",
  "maxResponse": 4000
}

If using Ollama, add M3E to vectorModels:

{
  "model": "m3e",
  "name": "M3E",
  "defaultToken": 700,
  "maxToken": 1800
}

Adjust docker-compose.yml (comment out MySQL/One-API if separately deployed, ensure MongoDB version compatibility).

Start services:

docker-compose up -d
sleep 10
docker restart oneapi

Access FastGPT at http://ip:3000 (default: root / 1234).

Troubleshooting

Hugging Face Hub Error:
```
pip install huggingface-hub==0.20.3
```
CUDA Out of Memory:
```
set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```
Consider using a smaller model or enabling quantization.
Ollama Localhost Only: Set environment variable OLLAMA_HOST to 0.0.0.0:11434.

Sample API Server Script

import os
import time
import torch
import uvicorn
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer

# Paths
LLM_DIR = os.environ.get('LLM_DIR', '/models/chatglm3-6b')
EMBED_DIR = os.environ.get('EMBED_DIR', '/models/m3e-base')

@asynccontextmanager
async def lifespan(app: FastAPI):
    yield
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

app = FastAPI(lifespan=lifespan)
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])

if __name__ == "__main__":
    tokenizer = AutoTokenizer.from_pretrained(LLM_DIR, trust_remote_code=True)
    model = AutoModel.from_pretrained(LLM_DIR, trust_remote_code=True, device_map="auto").eval()
    embedder = SentenceTransformer(EMBED_DIR, trust_remote_code=True, device="cuda")
    uvicorn.run(app, host='0.0.0.0', port=8000)

Tags: FastGPT ChatGLM Ollama M3E Knowledge Base

Posted on Fri, 15 May 2026 11:00:07 +0000 by elementaluk

Freaks City