llama.cpp serves as the underlying implementation for popular applications like Ollama, LMStudio, and is one of the supported inference engines in GPUStack. It provides the GGUF (General Gaussian U-Net Format) model file format designed specifically for optimized inference, enabling rapid loading and execution of models.
The framework also supports model quantization, which reduces storage and computational requirements while maintaining high model accuracy. This capability allows large language models to be efficiently deployed on desktops, embedded systems, and resource-constrained environments while improving inference speed.
This guide explains the process of creating and quantifying GGUF models and uploading them to HuggingFace and ModelScope repositories.
Setting Up HuggingFace and ModelScope Accounts
Registering for HuggingFace
Visit https://huggingface.co/join to create a HuggingFace account (requires appropriate network access).
Configuring HuggingFace SSH Public Key
Add your local environment's SSH public key to HuggingFace. To view or generate your SSH public key (if not existing), use:
cat ~/.ssh/id_rsa.pub
In HuggingFace, click your profile picture in the top-right corner, select Settings → SSH and GPG Keys, and add the public key for authentication during model uploads.
Registering for ModelScope
Visit https://www.modelscope.cn/register?back=%2Fhome to create a ModelScope account.
Obtaining ModelScope Token
Access https://www.modelscope.cn/my/myaccesstoken to copy and save your Git access token for authentication during model uploads.
Preparing the llama.cpp Environment
Create and activate a Conda environment (if not installed, refer to Miniconda doucmentation at https://docs.anaconda.com/miniconda/):
conda create -n llama-cpp python=3.12 -y
conda activate llama-cpp
which python
pip -V
Clone the latest llama.cpp branch code and compile the required binaries for quantization:
cd ~
git clone -b b4034 https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/
pip install -r requirements.txt
brew install cmake
make
After compilation, verify the availability of the llama-quantize binary with:
./llama-quantize --help
Downloading the Original Model
Download the original model that will be converted to GGUF format and quantized.
To download models from HuggingFace, use the huggingface-cli command. First, install dependencies:
pip install -U huggingface_hub
Configure the download mirror source for domestic networks:
export HF_ENDPOINT=https://hf-mirror.com
In this example, we'll download the meta-llama/Llama-3.2-3B-Instruct model, which is a gated model requiring access authorization on HuggingFace:
- Click your profile picture in the top-right corner of HuggingFace and select
Access Tokens - Create a token with
Readpermissions and save it
Download the model using:
mkdir ~/huggingface.co
cd ~/huggingface.co/
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct --local-dir Llama-3.2-3B-Instruct --token hf_abcdefghijklmnopqrstuvwxyz
Converting to GGUF Format and Quantizing the Model
Create a script for converting to GGUF format and quantizing the model:
cd ~/huggingface.co/
vim convert_quantize.sh
Enter the following script content, modifying the paths for llama.cpp and huggingface.co to match your actual environment using absolute paths. Change the username variable to your HuggingFace username:
#!/usr/bin/env bash
llama_cpp_path="/Users/username/llama.cpp"
base_dir="/Users/username/huggingface.co"
export PATH="$PATH:${llama_cpp_path}"
source_model="$1"
model_name="$(echo "${source_model}" | cut -d'/' -f2)"
destination_dir="username/${model_name}-GGUF"
# Prepare directory structure
mkdir -p ${base_dir}/${destination_dir} &>/dev/null
pushd ${base_dir}/${destination_dir} &>/dev/null
git init . &>/dev/null
# Copy necessary files if they don't exist
if [[ ! -f .gitattributes ]]; then
cp -f ${base_dir}/${source_model}/.gitattributes . &>/dev/null || true
echo "*.gguf filter=lfs diff=lfs merge=lfs -text" >> .gitattributes
fi
for dir in assets images imgs; do
if [[ ! -d $dir ]]; then
cp -rf ${base_dir}/${source_model}/$dir . &>/dev/null || true
fi
done
if [[ ! -f README.md ]]; then
cp -f ${base_dir}/${source_model}/README.md . &>/dev/null || true
fi
set -e
pushd ${llama_cpp_path} &>/dev/null
# Activate virtual environment if it exists
[[ -f venv/bin/activate ]] && source venv/bin/activate
# Convert to FP16 GGUF
echo "#### Converting ${base_dir}/${source_model} to GGUF format"
python3 convert_hf_to_gguf.py ${base_dir}/${source_model} --outfile ${base_dir}/${destination_dir}/${model_name}-FP16.gguf
# Quantize with different methods
quantization_methods=(
"Q8_0"
"Q6_K"
"Q5_K_M"
"Q5_0"
"Q4_K_M"
"Q4_0"
"Q3_K"
"Q2_K"
)
for method in "${quantization_methods[@]}"; do
echo "#### Quantizing with ${method} method"
llama-quantize ${base_dir}/${destination_dir}/${model_name}-FP16.gguf ${base_dir}/${destination_dir}/${model_name}-${method}.gguf ${method}
ls -lth ${base_dir}/${destination_dir}
sleep 3
done
popd &>/dev/null
set +e
Execute the script to convert the model to FP16 precision GGUF and quantize it with various methods:
bash convert_quantize.sh Llama-3.2-3B-Instruct
After execution, verify the succesfull conversion to FP16 precision GGUF model and quantized versions:
ls username/Llama-3.2-3B-Instruct-GGUF/
Uploading to HuggingFace
In HuggingFace, click your profile picture in the top-right corner and select New Model to create a repository with the format original-model-name-GGUF.
Update the model's README:
cd ~/huggingface.co/username/Llama-3.2-3B-Instruct-GGUF
vim README.md
For maintainability, include metadata about the original model and llama.cpp commit information at the beginning:
# Llama-3.2-3B-Instruct-GGUF
**Model creator**: [meta-llama](https://huggingface.co/meta-llama)<br></br>
**Original model**: [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)<br></br>
**GGUF quantization**: based on llama.cpp release [b8deef0e](https://github.com/ggerganov/llama.cpp/commit/b8deef0ec0af5febac1d2cfd9119ff330ed0b762)
---
Prepare for upload by enstalling Git LFS to manage large file uploads:
brew install git-lfs
Add the remote repository:
git remote add origin git@hf.co:username/Llama-3.2-3B-Instruct-GGUF
Add files and verify with git commands:
git add .
git ls-files
git lfs ls-files
For uploading files larger than 5GB to HuggingFace, enable large file uploads. Log in to HuggingFace via the command line using the token created earlier:
huggingface-cli login
Enable large file uploads for the current directory:
huggingface-cli lfs-enable-largefiles .
Upload the model to the HuggingFace repository:
git commit -m "feat: initial GGUF model conversion and quantization" --signoff
git push origin main -f
Verify successful upload on the HuggingFace website.
Uploading to ModelScope
In ModelScope, click your profile picture in the top-right corner and select Create Model to create a repository with the format original-model-name-GGUF. Configure settings including License, model type, AI framework, and visibility.
Upload the local repository's README.md file and create the repository.
Add the remote repository using the ModelScope Git access token obtained earlier:
git remote add modelscope https://oauth2:xxxxxxxxxxxxxxxxxxxx@www.modelscope.cn/username/Llama-3.2-3B-Instruct-GGUF.git
Fetch existing files from the remote repository:
git fetch modelscope master
Since ModelScope uses the master branch instead of main, switch to the master branch and cherry-pick files from the main branch. First, check and note the current commit ID:
git log
Switch to the master branch and cherry-pick files from the main branch:
git checkout FETCH_HEAD -b master
git cherry-pick -n 833fb20e5b07231e66c677180f95e27376eb25c6
Resolve any conflicts in the .gitattributes file (merge the *.gguf filter from the original model's .gitattributes):
vim .gitattributes
Add files and verify:
git add .
git ls-files
git lfs ls-files
Upload the model to the ModelScope repository:
git commit -m "feat: initial GGUF model conversion and quantization" --signoff
git push modelscope master -f
Verify successful upload on the ModelScope website.