Deploying GeneFace++ for Audio-Driven 3D Facial Animation Synthesis

Project Overview

GeneFace++ is a PyTorch-based deep learning framework that enables real-time audio-driven 3D facial animation synthesis. The system generates synchronized lip movements and facial expressions from audio input, creating realistic virtual character videos. The project repository is available at https://github.com/yerfor/GeneFacePlusPlus/tree/main.

Environment Setup

The deployment was performed on a Volcano Engine cloud instance with the following specifications: GPU compute type g1vc, ecs.g1vc.xlarge instance, 6 vCPUs, 26GiB memory, V100 GPU, running Ubuntu 22.04.

CUDA Installation

CUDA 11.7 is the recommended version, validated on A100/V100 GPUs. CUDA 12 and above are not supported due to compatibility issues with torch-ngp.

# Download CUDA 11.7 from NVIDIA archive
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring*.deb
sudo apt-get update
sudo apt-get -y install cuda=11.7.1-1

# Verify installation
ls /usr/local | grep cuda

If the wrong CUDA version was installed by default, remove and reinstall with the specific version:

sudo apt-get --purge remove cuda
sudo apt-get -y install cuda=11.7.1-1

Python Environment Configuration

# Install Anaconda
wget https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh
chmod +x Anaconda3-2024.02-1-Linux_x86_64.sh
./Anaconda3-2024.02-1-Linux_x86_64.sh

# Clone repository
git clone https://github.com/yerfor/GeneFacePlusPlus.git

# Create Python 3.9 environment
cd GeneFacePlusPlus
conda create -n geneface python=3.9
conda activate geneface

# Install FFmpeg with libx264 codec
conda install conda-forge::ffmpeg

# Install PyTorch 2.0.1 with CUDA 11.7
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia

# Build PyTorch3D from source
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"

# Install MMCV dependencies
pip install cython
pip install openmim==0.3.9
mim install mmcv==2.1.0

# Install remaining dependencies
sudo apt-get install libasound2-dev portaudio19-dev
pip install -r docs/prepare_env/requirements.txt -v

# Build torch-ngp extensions
bash docs/prepare_env/install_ext.sh

3DMM Model Setup (BFM2009)

Download the BFM2009 model files and place them in the deep_3drecon/BFM/ directory. The folder should contain these files:

deep_3drecon/BFM/
├── 01_MorphableModel.mat
├── BFM_exp_idx.mat
├── BFM_front_idx.mat
├── BFM_model_front.mat
├── Exp_Pca.bin
├── facemodel_info.mat
├── index_mp468_from_mesh35709.npy
├── mediapipe_in_bfm53201.npy
└── std_exp.txt

Running Inference Demos

Downloading Pre-trained Models

Download the required datasets and models:

May dataset: trainval_dataset.npy → data/binary/videos/May/trainval_dataset.npy
Audio-to-motion model: audio2motion_vae.zip → ./checkpoints/
Motion-to-video model: motion2video_nerf.zip → ./checkpoints/

Command Line Inference

cd GeneFacePlusPlus
conda activate geneface
export PYTHONPATH=./

python inference/genefacepp_infer.py \
    --a2m_ckpt=checkpoints/audio2motion_vae \
    --head_ckpt= \
    --torso_ckpt=checkpoints/motion2video_nerf/may_torso \
    --drv_aud=data/raw/val_wavs/MacronSpeech.wav \
    --out_name=may_demo.mp4

Gradio Web Interface

python inference/app_genefacepp.py \
    --a2m_ckpt=checkpoints/audio2motion_vae \
    --head_ckpt= \
    --torso_ckpt=checkpoints/motion2video_nerf/may_torso

Training Custom Models

Data Preparation

Prepare a video with minimal head occlusion and limited movement. Resolution should be at least 512x512. Place the video in data/raw/videos/.

Step 1: Video Preprocessing

export VIDEO_ID=your_video_name

# Crop to 512x512 at 25 FPS
ffmpeg -i data/raw/videos/${VIDEO_ID}.mp4 \
    -filter:v "fps=25,crop=512:512:284:224" \
    -qmin 1 -q:v 1 \
    data/raw/videos/${VIDEO_ID}_512.mp4

mv data/raw/videos/${VIDEO_ID}.mp4 data/raw/videos/${VIDEO_ID}_backup.mp4
mv data/raw/videos/${VIDEO_ID}_512.mp4 data/raw/videos/${VIDEO_ID}.mp4

Step 2: Audio Feature Extraction

export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH=./
export VIDEO_ID=your_video_name

mkdir -p data/processed/videos/${VIDEO_ID}

# Extract audio waveform
ffmpeg -i data/raw/videos/${VIDEO_ID}.mp4 \
    -f wav -ar 16000 \
    data/processed/videos/${VIDEO_ID}/aud.wav

# Generate HuBERT features
python data_gen/utils/process_audio/extract_hubert.py --video_id=${VIDEO_ID}

# Generate mel spectrogram and F0 features
python data_gen/utils/process_audio/extract_mel_f0.py --video_id=${VIDEO_ID}

Step 3: Frame Extraction and Segmentation

export PYTHONPATH=./
export VIDEO_ID=your_video_name
export CUDA_VISIBLE_DEVICES=0

mkdir -p data/processed/videos/${VIDEO_ID}/gt_imgs

# Extract video frames
ffmpeg -i data/raw/videos/${VIDEO_ID}.mp4 \
    -vf fps=25,scale=w=512:h=512 \
    -qmin 1 -q:v 1 -start_number 0 \
    data/processed/videos/${VIDEO_ID}/gt_imgs/%08d.jpg

# Extract segmentation masks
python data_gen/utils/process_video/extract_segment_imgs.py \
    --ds_name=nerf \
    --vid_dir=data/raw/videos/${VIDEO_ID}.mp4

If segmentation hangs, run with single process:

python data_gen/utils/process_video/extract_segment_imgs.py \
    --ds_name=nerf \
    --vid_dir=data/raw/videos/${VIDEO_ID}.mp4 \
    --force_single_process

Step 4: Landmark Detection and 3DMM Fitting

export PYTHONPATH=./
export VIDEO_ID=your_video_name

# Extract 2D landmarks
python data_gen/utils/process_video/extract_lm2d.py \
    --ds_name=nerf \
    --vid_dir=data/raw/videos/${VIDEO_ID}.mp4

# Fit 3D morphable model
python data_gen/utils/process_video/fit_3dmm_landmark.py \
    --ds_name=nerf \
    --vid_dir=data/raw/videos/${VIDEO_ID}.mp4 \
    --reset --debug --id_mode=global

Step 5: Data Binarization

export PYTHONPATH=./
export VIDEO_ID=your_video_name

python data_gen/runs/binarizer_nerf.py --video_id=${VIDEO_ID}

Model Training

Configuration Setup

cp -r egs/datasets/May egs/datasets/${VIDEO_ID}

# Update configuration files:
# - Replace video_id references with ${VIDEO_ID}
# - Update directory paths accordingly

Train Head NeRF Model

CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
    --config=egs/datasets/${VIDEO_ID}/lm3d_radnerf_sr.yaml \
    --exp_name=motion2video_nerf/${VIDEO_ID}_head \
    --reset

Train Torso NeRF Model

CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
    --config=egs/datasets/${VIDEO_ID}/lm3d_radnerf_torso_sr.yaml \
    --exp_name=motion2video_nerf/${VIDEO_ID}_torso \
    --hparams=head_model_dir=checkpoints/motion2video_nerf/${VIDEO_ID}_head \
    --reset

Custom Inference

CUDA_VISIBLE_DEVICES=0 python inference/app_genefacepp.py \
    --a2m_ckpt=checkpoints/audio2motion_vae \
    --head_ckpt= \
    --torso_ckpt=motion2video_nerf/${VIDEO_ID}_torso

Troubleshooting Common Issues

Missing bg.jpg During Preprocessing

This typically indicates the segmentation extraction step failed. Re-run with single-process mode:

python data_gen/utils/process_video/extract_segment_imgs.py \
    --ds_name=nerf \
    --vid_dir=data/raw/videos/${VIDEO_ID}.mp4 \
    --force_single_process

Process Killed During Long Audio Inference

Enable low memory mode in the inference script to reduce memory consumption.

Character Eyes Not Opening

Modify the eye_blink_dim parameter to 4 in both configuration files:

egs/datasets/${VIDEO_ID}/lm3d_radnerf_torso.yaml
egs/datasets/${VIDEO_ID}/lm3d_radnerf_sr.yaml

State Dict Loading Error

A size mismatch error in blink_encoder layers indicates inconsistent configuration parameters between training and inference. Verify that eye_blink_dim matches across all config files.

Technical Background

HuBERT (Hidden Unit BERT)

HuBERT is a self-supervised speech representation learning model developed by Facebook AI Research. Unlike traditional approaches, HuBERT learns speech representations by predicting hidden units from audio context without requiring labeled data. The architecture uses Transformer-based encoders to process waveform features and generates bidirectional representations suitable for speech recognition, speaker identification, emotion analysis, and speech synthesis tasks.

Mel Spectrogram and F0 Fundamental Frequency

Mel spectrograms apply a perceptually-motivated frequency scale that matches human auditory sensitivity, providing finer resolution at lower frequencies. F0 represents the fundamental frequency or pitch of a voice signal, derived from vocal cord vibration periodicity. Together, these features capture both spectral content and prosodic information essential for natural speech synthesis.

2D Facial Landmarks

2D facial landmarks are geometric key points representing facial features such as eye corners, nose tip, and mouth boundaries. They enable face alignment, expression recognition, head pose estimation, and serve as input for 3D face reconstruction. MediaPipe provides robust landmark detection used in the preprocessing pipeline.

3D Morphable Model Fitting

3DMM fitting aligns a statistical 3D face model to 2D image observations through optimization. The process involves detecting 2D landmarks, estimating corresponding 3D positions, and iteratively adjusting model parameters to minimize reprojection error. The fitted 3DMM provides shape, expression, and pose coefficients that drive the NeRF-based rendering pipeline.

Tags: GeneFace++ Deep Learning 3D Facial Animation NeRF pytorch

Posted on Wed, 20 May 2026 02:32:25 +0000 by adamb10

Freaks City