Prerequisites and Environment Configuration
Before deploying any model, ensure the Python environment is stable. Update package managers and configure mirror sources to improve download stability if network constraints exist.
Dependency Installation
Update pip first, then install core libraries required for Hugging Face or ModelScope models. Using domestic mirrors can significantly accelerate installation in regions with restricted access.
pip install --upgrade pip
pip install transformers==4.35.2
pip install accelerate==0.24.1
pip install streamlit==1.24.0
pip install sentencepiece==0.1.99
Conda Repository Management
If using Conda, manage channels explicitly to prevent conflicts between system packages and scientific dependencies.
# Add primary Anaconda channels
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
# Enable channel URLs for debugging
conda config --set show_channel_urls yes
# Install specific package from a URL if needed
conda install -c url package_name
Model Acquisition Strategies
Models can be retrieved via ModelScope (domestic) or Hugging Face (international). Both methods support transformers integration.
Method A: ModelScope Snapshot
This method uses the built-in snapshot downloader provided by ModelScope.
import torch
from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
import os
# Define repository ID and local cache path
task_repo = "Shanghai_AI_Laboratory/internlm-20b"
cache_path = "/root/models_cache"
revision_tag = "v1.0.2"
# Fetch model files
downloaded_dir = snapshot_download(
repo_id=task_repo,
cache_dir=cache_path,
revision=revision_tag
)
# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(downloaded_dir, trust_remote_code=True)
model_dtype = torch.bfloat16
llm_model = AutoModelForCausalLM.from_pretrained(
downloaded_dir,
torch_dtype=model_dtype,
trust_remote_code=True
).to("cuda")
llm_model.eval()
# Prepare input prompt
prompt_text = "Discover the wonders of the natural world"
payload = tokenizer(prompt_text, return_tensors="pt")
# Move inputs to GPU
for key, tensor_val in payload.items():
payload[key] = tensor_val.cuda()
# Generation configuration
generation_params = {
"max_length": 128,
"top_p": 0.8,
"temperature": 0.8,
"do_sample": True,
"repetition_penalty": 1.05
}
# Execute inference
raw_response = llm_model.generate(**payload, **generation_params)
final_output = tokenizer.decode(raw_response[0].tolist(), skip_special_tokens=True)
print(final_output)
Method B: Hugging Face Hub
For international repositories, environment variables often need adjustment to bypass connectivity issues.
- Set the endpoint mirror:
export HF_ENDPOINT="https://hf-mirror.com" - Authenticate locally:
huggingface-cli login # Paste token when prompted - Install necessary utilities:
pip install --upgrade huggingface_hub hf-transfer - Sync model weights:
huggingface-cli download --resume-download <model-name> --cache-dir /path/to/cache
Building an API Service Layer
To serve the model programmatically, wrap the inference logic within a Flask application. This approach allows external clients to query the model via HTTP POST requests.
from flask import Flask, request, jsonify
import torch
from modelscope import AutoTokenizer, AutoModelForCausalLM
app = Flask(__name__)
MODEL_CACHE = None
TOKENIZER_OBJ = None
# Load assets on startup
def init_engine(model_path):
global MODEL_CACHE, TOKENIZER_OBJ
if MODEL_CACHE is None:
model_config = torch.bfloat16
TOKENIZER_OBJ = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
MODEL_CACHE = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=model_config,
trust_remote_code=True
).cuda()
MODEL_CACHE.eval()
@app.route('/predict', methods=['POST'])
def run_inference():
if not app.config.get('ENGINE_READY'):
return jsonify({"error": "Model engine not initialized"}), 503
json_body = request.get_json()
if not json_body or 'input_text' not in json_body:
return jsonify({"error": "Missing 'input_text' field"}), 400
user_query = json_body['input_text']
encoded_input = TOKENIZER_OBJ(user_query, return_tensors="pt")
# Transfer batch to GPU
for key in encoded_input:
encoded_input[key] = encoded_input[key].cuda()
settings = {
"max_length": 128,
"do_sample": True,
"top_p": 0.8,
"temperature": 0.8,
"repetition_penalty": 1.05
}
response_tensor = MODEL_CACHE.generate(**encoded_input, **settings)
decoded_result = TOKENIZER_OBJ.decode(response_tensor[0].tolist(), skip_special_tokens=True)
return jsonify({"response": decoded_result})
if __name__ == '__main__':
# Ensure initialization happens before serving
init_engine("./local_model_dir")
app.config['ENGINE_READY'] = True
app.run(host='0.0.0.0', port=8848, debug=True)
Hardware Abstraction and CUDA Management
Managing the underlying GPU environment is critical for performance and error prevention. Modern Linux systems often use module systems like Lmod to abstract hardware drivers.
Module System Usage
Environment modules allow dynamic swapping of software versions without affecting the whole shell session.
# View available modules
module avail
# Load specific CUDA version
module load cuda/11.4
# Unload current module
module unload cuda/11.4
# Show currently loaded modules
module list
Driver vs. Toolkit Awareness
PyTorch compatibility relies on matching the CUDA toolkit version rather then the driver version. Typically, driver versions are higher (backward compatible) than the toolkit.
# Inspect driver status and active GPUs
nvidia-smi
# Check compiler/toolkit version
nvcc --version
Debugging Visibility Issues
If encountering errors regarding GPU initialization or device visibility, specifically driver initialization failed, check the CUDA_VISIBLE_DEVICES environment variable. This controls which physical GPUs the process can access.
To restrict or expose specific cards:
export CUDA_VISIBLE_DEVICES=0,1,6,7
source ~/.bashrc
This setting persists across sessions if added to profile scripts, ensuring consistent GPU selection after server reboots or configurations changes.