Project Overview
GeneFace++ is a PyTorch-based deep learning framework that enables real-time audio-driven 3D facial animation synthesis. The system generates synchronized lip movements and facial expressions from audio input, creating realistic virtual character videos. The project repository is available at https://github.com/yerfor/GeneFacePlusPlus/tree/main.
Environment Setup
The deployment was performed on a Volcano Engine cloud instance with the following specifications: GPU compute type g1vc, ecs.g1vc.xlarge instance, 6 vCPUs, 26GiB memory, V100 GPU, running Ubuntu 22.04.
CUDA Installation
CUDA 11.7 is the recommended version, validated on A100/V100 GPUs. CUDA 12 and above are not supported due to compatibility issues with torch-ngp.
# Download CUDA 11.7 from NVIDIA archive
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring*.deb
sudo apt-get update
sudo apt-get -y install cuda=11.7.1-1
# Verify installation
ls /usr/local | grep cudaIf the wrong CUDA version was installed by default, remove and reinstall with the specific version:
sudo apt-get --purge remove cuda
sudo apt-get -y install cuda=11.7.1-1Python Environment Configuration
# Install Anaconda
wget https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh
chmod +x Anaconda3-2024.02-1-Linux_x86_64.sh
./Anaconda3-2024.02-1-Linux_x86_64.sh
# Clone repository
git clone https://github.com/yerfor/GeneFacePlusPlus.git
# Create Python 3.9 environment
cd GeneFacePlusPlus
conda create -n geneface python=3.9
conda activate geneface
# Install FFmpeg with libx264 codec
conda install conda-forge::ffmpeg
# Install PyTorch 2.0.1 with CUDA 11.7
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
# Build PyTorch3D from source
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
# Install MMCV dependencies
pip install cython
pip install openmim==0.3.9
mim install mmcv==2.1.0
# Install remaining dependencies
sudo apt-get install libasound2-dev portaudio19-dev
pip install -r docs/prepare_env/requirements.txt -v
# Build torch-ngp extensions
bash docs/prepare_env/install_ext.sh3DMM Model Setup (BFM2009)
Download the BFM2009 model files and place them in the deep_3drecon/BFM/ directory. The folder should contain these files:
deep_3drecon/BFM/
├── 01_MorphableModel.mat
├── BFM_exp_idx.mat
├── BFM_front_idx.mat
├── BFM_model_front.mat
├── Exp_Pca.bin
├── facemodel_info.mat
├── index_mp468_from_mesh35709.npy
├── mediapipe_in_bfm53201.npy
└── std_exp.txtRunning Inference Demos
Downloading Pre-trained Models
Download the required datasets and models:
- May dataset:
trainval_dataset.npy→data/binary/videos/May/trainval_dataset.npy - Audio-to-motion model:
audio2motion_vae.zip→./checkpoints/ - Motion-to-video model:
motion2video_nerf.zip→./checkpoints/
Command Line Inference
cd GeneFacePlusPlus
conda activate geneface
export PYTHONPATH=./
python inference/genefacepp_infer.py \
--a2m_ckpt=checkpoints/audio2motion_vae \
--head_ckpt= \
--torso_ckpt=checkpoints/motion2video_nerf/may_torso \
--drv_aud=data/raw/val_wavs/MacronSpeech.wav \
--out_name=may_demo.mp4Gradio Web Interface
python inference/app_genefacepp.py \
--a2m_ckpt=checkpoints/audio2motion_vae \
--head_ckpt= \
--torso_ckpt=checkpoints/motion2video_nerf/may_torsoTraining Custom Models
Data Preparation
Prepare a video with minimal head occlusion and limited movement. Resolution should be at least 512x512. Place the video in data/raw/videos/.
Step 1: Video Preprocessing
export VIDEO_ID=your_video_name
# Crop to 512x512 at 25 FPS
ffmpeg -i data/raw/videos/${VIDEO_ID}.mp4 \
-filter:v "fps=25,crop=512:512:284:224" \
-qmin 1 -q:v 1 \
data/raw/videos/${VIDEO_ID}_512.mp4
mv data/raw/videos/${VIDEO_ID}.mp4 data/raw/videos/${VIDEO_ID}_backup.mp4
mv data/raw/videos/${VIDEO_ID}_512.mp4 data/raw/videos/${VIDEO_ID}.mp4Step 2: Audio Feature Extraction
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH=./
export VIDEO_ID=your_video_name
mkdir -p data/processed/videos/${VIDEO_ID}
# Extract audio waveform
ffmpeg -i data/raw/videos/${VIDEO_ID}.mp4 \
-f wav -ar 16000 \
data/processed/videos/${VIDEO_ID}/aud.wav
# Generate HuBERT features
python data_gen/utils/process_audio/extract_hubert.py --video_id=${VIDEO_ID}
# Generate mel spectrogram and F0 features
python data_gen/utils/process_audio/extract_mel_f0.py --video_id=${VIDEO_ID}Step 3: Frame Extraction and Segmentation
export PYTHONPATH=./
export VIDEO_ID=your_video_name
export CUDA_VISIBLE_DEVICES=0
mkdir -p data/processed/videos/${VIDEO_ID}/gt_imgs
# Extract video frames
ffmpeg -i data/raw/videos/${VIDEO_ID}.mp4 \
-vf fps=25,scale=w=512:h=512 \
-qmin 1 -q:v 1 -start_number 0 \
data/processed/videos/${VIDEO_ID}/gt_imgs/%08d.jpg
# Extract segmentation masks
python data_gen/utils/process_video/extract_segment_imgs.py \
--ds_name=nerf \
--vid_dir=data/raw/videos/${VIDEO_ID}.mp4If segmentation hangs, run with single process:
python data_gen/utils/process_video/extract_segment_imgs.py \
--ds_name=nerf \
--vid_dir=data/raw/videos/${VIDEO_ID}.mp4 \
--force_single_processStep 4: Landmark Detection and 3DMM Fitting
export PYTHONPATH=./
export VIDEO_ID=your_video_name
# Extract 2D landmarks
python data_gen/utils/process_video/extract_lm2d.py \
--ds_name=nerf \
--vid_dir=data/raw/videos/${VIDEO_ID}.mp4
# Fit 3D morphable model
python data_gen/utils/process_video/fit_3dmm_landmark.py \
--ds_name=nerf \
--vid_dir=data/raw/videos/${VIDEO_ID}.mp4 \
--reset --debug --id_mode=globalStep 5: Data Binarization
export PYTHONPATH=./
export VIDEO_ID=your_video_name
python data_gen/runs/binarizer_nerf.py --video_id=${VIDEO_ID}Model Training
Configuration Setup
cp -r egs/datasets/May egs/datasets/${VIDEO_ID}
# Update configuration files:
# - Replace video_id references with ${VIDEO_ID}
# - Update directory paths accordinglyTrain Head NeRF Model
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config=egs/datasets/${VIDEO_ID}/lm3d_radnerf_sr.yaml \
--exp_name=motion2video_nerf/${VIDEO_ID}_head \
--resetTrain Torso NeRF Model
CUDA_VISIBLE_DEVICES=0 python tasks/run.py \
--config=egs/datasets/${VIDEO_ID}/lm3d_radnerf_torso_sr.yaml \
--exp_name=motion2video_nerf/${VIDEO_ID}_torso \
--hparams=head_model_dir=checkpoints/motion2video_nerf/${VIDEO_ID}_head \
--resetCustom Inference
CUDA_VISIBLE_DEVICES=0 python inference/app_genefacepp.py \
--a2m_ckpt=checkpoints/audio2motion_vae \
--head_ckpt= \
--torso_ckpt=motion2video_nerf/${VIDEO_ID}_torsoTroubleshooting Common Issues
Missing bg.jpg During Preprocessing
This typically indicates the segmentation extraction step failed. Re-run with single-process mode:
python data_gen/utils/process_video/extract_segment_imgs.py \
--ds_name=nerf \
--vid_dir=data/raw/videos/${VIDEO_ID}.mp4 \
--force_single_processProcess Killed During Long Audio Inference
Enable low memory mode in the inference script to reduce memory consumption.
Character Eyes Not Opening
Modify the eye_blink_dim parameter to 4 in both configuration files:
egs/datasets/${VIDEO_ID}/lm3d_radnerf_torso.yamlegs/datasets/${VIDEO_ID}/lm3d_radnerf_sr.yaml
State Dict Loading Error
A size mismatch error in blink_encoder layers indicates inconsistent configuration parameters between training and inference. Verify that eye_blink_dim matches across all config files.
Technical Background
HuBERT (Hidden Unit BERT)
HuBERT is a self-supervised speech representation learning model developed by Facebook AI Research. Unlike traditional approaches, HuBERT learns speech representations by predicting hidden units from audio context without requiring labeled data. The architecture uses Transformer-based encoders to process waveform features and generates bidirectional representations suitable for speech recognition, speaker identification, emotion analysis, and speech synthesis tasks.
Mel Spectrogram and F0 Fundamental Frequency
Mel spectrograms apply a perceptually-motivated frequency scale that matches human auditory sensitivity, providing finer resolution at lower frequencies. F0 represents the fundamental frequency or pitch of a voice signal, derived from vocal cord vibration periodicity. Together, these features capture both spectral content and prosodic information essential for natural speech synthesis.
2D Facial Landmarks
2D facial landmarks are geometric key points representing facial features such as eye corners, nose tip, and mouth boundaries. They enable face alignment, expression recognition, head pose estimation, and serve as input for 3D face reconstruction. MediaPipe provides robust landmark detection used in the preprocessing pipeline.
3D Morphable Model Fitting
3DMM fitting aligns a statistical 3D face model to 2D image observations through optimization. The process involves detecting 2D landmarks, estimating corresponding 3D positions, and iteratively adjusting model parameters to minimize reprojection error. The fitted 3DMM provides shape, expression, and pose coefficients that drive the NeRF-based rendering pipeline.