Creating and Quantifying GGUF Models for Deployment on HuggingFace and ModelScope

llama.cpp serves as the underlying implementation for popular applications like Ollama, LMStudio, and is one of the supported inference engines in GPUStack. It provides the GGUF (General Gaussian U-Net Format) model file format designed specifically for optimized inference, enabling rapid loading and execution of models.

The framework also supports model quantization, which reduces storage and computational requirements while maintaining high model accuracy. This capability allows large language models to be efficiently deployed on desktops, embedded systems, and resource-constrained environments while improving inference speed.

This guide explains the process of creating and quantifying GGUF models and uploading them to HuggingFace and ModelScope repositories.

Setting Up HuggingFace and ModelScope Accounts

Registering for HuggingFace

Visit https://huggingface.co/join to create a HuggingFace account (requires appropriate network access).

Configuring HuggingFace SSH Public Key

Add your local environment's SSH public key to HuggingFace. To view or generate your SSH public key (if not existing), use:

cat ~/.ssh/id_rsa.pub

In HuggingFace, click your profile picture in the top-right corner, select Settings → SSH and GPG Keys, and add the public key for authentication during model uploads.

Registering for ModelScope

Visit https://www.modelscope.cn/register?back=%2Fhome to create a ModelScope account.

Obtaining ModelScope Token

Access https://www.modelscope.cn/my/myaccesstoken to copy and save your Git access token for authentication during model uploads.

Preparing the llama.cpp Environment

Create and activate a Conda environment (if not installed, refer to Miniconda doucmentation at https://docs.anaconda.com/miniconda/):

conda create -n llama-cpp python=3.12 -y
conda activate llama-cpp
which python
pip -V

Clone the latest llama.cpp branch code and compile the required binaries for quantization:

cd ~
git clone -b b4034 https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/
pip install -r requirements.txt
brew install cmake
make

After compilation, verify the availability of the llama-quantize binary with:

./llama-quantize --help

Downloading the Original Model

Download the original model that will be converted to GGUF format and quantized.

To download models from HuggingFace, use the huggingface-cli command. First, install dependencies:

pip install -U huggingface_hub

Configure the download mirror source for domestic networks:

export HF_ENDPOINT=https://hf-mirror.com

In this example, we'll download the meta-llama/Llama-3.2-3B-Instruct model, which is a gated model requiring access authorization on HuggingFace:

Click your profile picture in the top-right corner of HuggingFace and select Access Tokens
Create a token with Read permissions and save it

Download the model using:

mkdir ~/huggingface.co
cd ~/huggingface.co/
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct --local-dir Llama-3.2-3B-Instruct --token hf_abcdefghijklmnopqrstuvwxyz

Converting to GGUF Format and Quantizing the Model

Create a script for converting to GGUF format and quantizing the model:

cd ~/huggingface.co/
vim convert_quantize.sh

Enter the following script content, modifying the paths for llama.cpp and huggingface.co to match your actual environment using absolute paths. Change the username variable to your HuggingFace username:

#!/usr/bin/env bash

llama_cpp_path="/Users/username/llama.cpp"
base_dir="/Users/username/huggingface.co"

export PATH="$PATH:${llama_cpp_path}"

source_model="$1"
model_name="$(echo "${source_model}" | cut -d'/' -f2)"
destination_dir="username/${model_name}-GGUF"

# Prepare directory structure

mkdir -p ${base_dir}/${destination_dir} &>/dev/null
pushd ${base_dir}/${destination_dir} &>/dev/null
git init . &>/dev/null

# Copy necessary files if they don't exist
if [[ ! -f .gitattributes ]]; then
    cp -f ${base_dir}/${source_model}/.gitattributes . &>/dev/null || true
    echo "*.gguf filter=lfs diff=lfs merge=lfs -text" >> .gitattributes
fi
for dir in assets images imgs; do
    if [[ ! -d $dir ]]; then
        cp -rf ${base_dir}/${source_model}/$dir . &>/dev/null || true
    fi
done
if [[ ! -f README.md ]]; then
    cp -f ${base_dir}/${source_model}/README.md . &>/dev/null || true
fi

set -e

pushd ${llama_cpp_path} &>/dev/null

# Activate virtual environment if it exists
[[ -f venv/bin/activate ]] && source venv/bin/activate

# Convert to FP16 GGUF
echo "#### Converting ${base_dir}/${source_model} to GGUF format"
python3 convert_hf_to_gguf.py ${base_dir}/${source_model} --outfile ${base_dir}/${destination_dir}/${model_name}-FP16.gguf

# Quantize with different methods
quantization_methods=(
  "Q8_0"
  "Q6_K"
  "Q5_K_M"
  "Q5_0"
  "Q4_K_M"
  "Q4_0"
  "Q3_K"
  "Q2_K"
)

for method in "${quantization_methods[@]}"; do
    echo "#### Quantizing with ${method} method"
    llama-quantize ${base_dir}/${destination_dir}/${model_name}-FP16.gguf ${base_dir}/${destination_dir}/${model_name}-${method}.gguf ${method}
    ls -lth ${base_dir}/${destination_dir}
    sleep 3
done

popd &>/dev/null

set +e

Execute the script to convert the model to FP16 precision GGUF and quantize it with various methods:

bash convert_quantize.sh Llama-3.2-3B-Instruct

After execution, verify the succesfull conversion to FP16 precision GGUF model and quantized versions:

ls username/Llama-3.2-3B-Instruct-GGUF/

Uploading to HuggingFace

In HuggingFace, click your profile picture in the top-right corner and select New Model to create a repository with the format original-model-name-GGUF.

Update the model's README:

cd ~/huggingface.co/username/Llama-3.2-3B-Instruct-GGUF
vim README.md

For maintainability, include metadata about the original model and llama.cpp commit information at the beginning:

# Llama-3.2-3B-Instruct-GGUF

**Model creator**: [meta-llama](https://huggingface.co/meta-llama)<br></br>
**Original model**: [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)<br></br>
**GGUF quantization**: based on llama.cpp release [b8deef0e](https://github.com/ggerganov/llama.cpp/commit/b8deef0ec0af5febac1d2cfd9119ff330ed0b762)

---

Prepare for upload by enstalling Git LFS to manage large file uploads:

brew install git-lfs

Add the remote repository:

git remote add origin git@hf.co:username/Llama-3.2-3B-Instruct-GGUF

Add files and verify with git commands:

git add .
git ls-files
git lfs ls-files

For uploading files larger than 5GB to HuggingFace, enable large file uploads. Log in to HuggingFace via the command line using the token created earlier:

huggingface-cli login

Enable large file uploads for the current directory:

huggingface-cli lfs-enable-largefiles .

Upload the model to the HuggingFace repository:

git commit -m "feat: initial GGUF model conversion and quantization" --signoff
git push origin main -f

Verify successful upload on the HuggingFace website.

Uploading to ModelScope

In ModelScope, click your profile picture in the top-right corner and select Create Model to create a repository with the format original-model-name-GGUF. Configure settings including License, model type, AI framework, and visibility.

Upload the local repository's README.md file and create the repository.

Add the remote repository using the ModelScope Git access token obtained earlier:

git remote add modelscope https://oauth2:xxxxxxxxxxxxxxxxxxxx@www.modelscope.cn/username/Llama-3.2-3B-Instruct-GGUF.git

Fetch existing files from the remote repository:

git fetch modelscope master

Since ModelScope uses the master branch instead of main, switch to the master branch and cherry-pick files from the main branch. First, check and note the current commit ID:

git log

Switch to the master branch and cherry-pick files from the main branch:

git checkout FETCH_HEAD -b master
git cherry-pick -n 833fb20e5b07231e66c677180f95e27376eb25c6

Resolve any conflicts in the .gitattributes file (merge the *.gguf filter from the original model's .gitattributes):

vim .gitattributes

Add files and verify:

git add .
git ls-files
git lfs ls-files

Upload the model to the ModelScope repository:

git commit -m "feat: initial GGUF model conversion and quantization" --signoff
git push modelscope master -f

Verify successful upload on the ModelScope website.

Tags: llama.cpp GGUF model quantization huggingface ModelScope

Posted on Tue, 02 Jun 2026 17:30:08 +0000 by spicerje

Freaks City