Effective generative modeling relies heavily on robust tooling. This article focuses on two essential Python libraries from Hugging Face: diffusers for diffusion-based models and accelerate for streamlined distributed training.
Accelerate Library
The accelerate library simplifies distributed training, mixed-precision computation, gradient accumulation, and integration with logging tools like TensorBoard or Weights & Biases. Installation is straightforward:
pip install accelerate
A typical workflow involves initializing an Accelerator instance, preparing model components, and managing training loops with built-in synchronization:
from accelerate import Accelerator
import torch
import os
# Initialize accelerator with mixed precision and gradient accumulation
accelerator = Accelerator(
mixed_precision='fp16',
gradient_accumulation_steps=2,
log_with='wandb',
project_dir='./logs'
)
# Create output directory only on main process
if accelerator.is_main_process:
os.makedirs('./output', exist_ok=True)
accelerator.init_trackers(
project_name="Diffusion-Training",
init_kwargs={"wandb": {"name": "experiment-01"}}
)
# Configure optimizer with multiple learning rates
optimizer = torch.optim.AdamW([
{'params': model.backbone.parameters(), 'lr': 1e-5},
{'params': model.head.parameters(), 'lr': 5e-5}
])
# Learning rate scheduler with warmup and cosine decay
total_steps = num_epochs * len(dataloader)
warmup_steps = int(0.1 * total_steps)
warmup_scheduler = torch.optim.lr_scheduler.LinearLR(
optimizer, start_factor=0.1, total_iters=warmup_steps
)
cosine_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=total_steps - warmup_steps, eta_min=1e-7
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
optimizer, [warmup_scheduler, cosine_scheduler], [warmup_steps]
)
# Prepare components for distributed training
model, optimizer, dataloader, scheduler = accelerator.prepare(
model, optimizer, dataloader, scheduler
)
# Training loop
for epoch in range(num_epochs):
for batch_idx, batch in enumerate(dataloader):
with accelerator.accumulate(model):
inputs, targets = batch
outputs = model(inputs)
loss = criterion(outputs, targets)
accelerator.backward(loss)
if accelerator.sync_gradients:
accelerator.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
# Log metrics
accelerator.log({
"loss": loss.item(),
"lr": scheduler.get_last_lr()[0]
}, step=epoch * len(dataloader) + batch_idx)
# Save model after synchronization
accelerator.wait_for_everyone()
if accelerator.is_main_process:
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained('./output')
accelerator.end_training()
Diffusers Library
The diffusers library provides modular components for building and training diffusion models. Install via:
pip install diffusers
Core Components
Training involves noise scheduling and iterative denoising:
from diffusers import DDPMScheduler
import torch.nn.functional as F
noise_scheduler = DDPMScheduler(
num_train_timesteps=1000,
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear"
)
# Training step
for images in train_dataloader:
clean_images = images.to(accelerator.device)
batch_size = clean_images.shape[0]
# Sample random timesteps
timesteps = torch.randint(
0, noise_scheduler.config.num_train_timesteps,
(batch_size,), device=clean_images.device
).long()
# Add noise
noise = torch.randn_like(clean_images)
noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)
# Predict noise
noise_pred = model(noisy_images, timesteps).sample
loss = F.mse_loss(noise_pred, noise)
accelerator.backward(loss)
# ... optimizer steps ...
Inference Pipeline
Generation uses reverse diffusion through scheduler steps:
def generate_image(model, scheduler, latent_shape, device):
latents = torch.randn(latent_shape, device=device)
for t in scheduler.timesteps:
with torch.no_grad():
noise_pred = model(latents, t).sample
latents = scheduler.step(noise_pred, t, latents).prev_sample
return latents
Scheduler Mechanics
The DDPMScheduler.step() method implements the reverse diffusion process:
- Computes cumulative alphas (
α̅_t) and betas - Predicts original sample: \( x_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon_\theta}{\sqrt{\bar{\alpha}_t}} \)
- Calculates previous sample using variance preservation: \( x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \cdot x_0 \cdot c_1 + \sqrt{\alpha_t} \cdot x_t \cdot c_2 \)
Stable Diffusion Pipeline
High-level pipelines combine VAE, text encoder (CLIP), and UNet:
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
image = pipe(
prompt="a photo of an astronaut riding a horse on mars",
num_inference_steps=50,
guidance_scale=7.5
).images[0]
Classifier-Free Guidance (CFG) works by:
- Encoding both prompt and empty/negative prompt
- Processing concatenated embeddings through UNet
- Combining predictions: \( \epsilon = \epsilon_{\text{uncond}} + w(\epsilon_{\text{cond}} - \epsilon_{\text{uncond}}) \)
SDXL Inpainting Example
Two-stage inpainting with base and refiner models:
from diffusers import StableDiffusionXLInpaintPipeline
base = StableDiffusionXLInpaintPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True
).to("cuda")
refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=base.text_encoder_2,
vae=base.vae,
torch_dtype=torch.float16
).to("cuda")
# Base generation (high noise fraction)
latents = base(
prompt="a cat sitting on a park bench",
image=init_image,
mask_image=mask,
denoising_end=0.8,
output_type="latent"
).images
# Refinement (low noise fraction)
final_image = refiner(
prompt="a cat sitting on a park bench",
image=latents,
mask_image=mask,
denoising_start=0.8
).images[0]
LoRA Fine-tuning
Parameter-efficient adaptation using Low-Rank Adaptation:
from peft import LoraConfig
from diffusers import UNet2DConditionModel
unet = UNet2DConditionModel.from_pretrained("path/to/model")
unet.requires_grad_(False)
lora_config = LoraConfig(
r=4,
lora_alpha=8,
target_modules=["to_k", "to_q", "to_v", "to_out.0"]
)
unet.add_adapter(lora_config)
LoRA modifies attention layers via: \( y = Wx + \frac{\alpha}{r} \cdot B(Ax) \) where \( A \in \mathbb{R}^{r \times d} \), \( B \in \mathbb{R}^{d \time r} \) are trainable low-rank matrices.
Custom Attention Processors
Modify attention mechanisms by replacing processors:
class CustomAttnProcessor:
def __call__(self, attn, hidden_states, encoder_hidden_states=None):
query = attn.to_q(hidden_states)
key = attn.to_k(encoder_hidden_states or hidden_states)
value = attn.to_v(encoder_hidden_states or hidden_states)
# Custom attention logic
attn_weights = torch.softmax(query @ key.transpose(-2, -1) * attn.scale, dim=-1)
output = attn_weights @ value
return attn.to_out(output)
# Apply to cross-attention layers
processor_dict = {}
for name in unet.attn_processors.keys():
if "attn2" in name: # Cross-attention
processor_dict[name] = CustomAttnProcessor()
else:
processor_dict[name] = unet.attn_processors[name]
unet.set_attn_processor(processor_dict)