Practical Guide to Diffusers and Accelerate for Generative Modeling

Effective generative modeling relies heavily on robust tooling. This article focuses on two essential Python libraries from Hugging Face: diffusers for diffusion-based models and accelerate for streamlined distributed training.

Accelerate Library

The accelerate library simplifies distributed training, mixed-precision computation, gradient accumulation, and integration with logging tools like TensorBoard or Weights & Biases. Installation is straightforward:

pip install accelerate

A typical workflow involves initializing an Accelerator instance, preparing model components, and managing training loops with built-in synchronization:

from accelerate import Accelerator
import torch
import os

# Initialize accelerator with mixed precision and gradient accumulation
accelerator = Accelerator(
    mixed_precision='fp16',
    gradient_accumulation_steps=2,
    log_with='wandb',
    project_dir='./logs'
)

# Create output directory only on main process
if accelerator.is_main_process:
    os.makedirs('./output', exist_ok=True)
    accelerator.init_trackers(
        project_name="Diffusion-Training",
        init_kwargs={"wandb": {"name": "experiment-01"}}
    )

# Configure optimizer with multiple learning rates
optimizer = torch.optim.AdamW([
    {'params': model.backbone.parameters(), 'lr': 1e-5},
    {'params': model.head.parameters(), 'lr': 5e-5}
])

# Learning rate scheduler with warmup and cosine decay
total_steps = num_epochs * len(dataloader)
warmup_steps = int(0.1 * total_steps)
warmup_scheduler = torch.optim.lr_scheduler.LinearLR(
    optimizer, start_factor=0.1, total_iters=warmup_steps
)
cosine_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=total_steps - warmup_steps, eta_min=1e-7
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer, [warmup_scheduler, cosine_scheduler], [warmup_steps]
)

# Prepare components for distributed training
model, optimizer, dataloader, scheduler = accelerator.prepare(
    model, optimizer, dataloader, scheduler
)

# Training loop
for epoch in range(num_epochs):
    for batch_idx, batch in enumerate(dataloader):
        with accelerator.accumulate(model):
            inputs, targets = batch
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            
            accelerator.backward(loss)
            if accelerator.sync_gradients:
                accelerator.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

            # Log metrics
            accelerator.log({
                "loss": loss.item(),
                "lr": scheduler.get_last_lr()[0]
            }, step=epoch * len(dataloader) + batch_idx)

# Save model after synchronization
accelerator.wait_for_everyone()
if accelerator.is_main_process:
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained('./output')
accelerator.end_training()

Diffusers Library

The diffusers library provides modular components for building and training diffusion models. Install via:

pip install diffusers

Core Components

Training involves noise scheduling and iterative denoising:

from diffusers import DDPMScheduler
import torch.nn.functional as F

noise_scheduler = DDPMScheduler(
    num_train_timesteps=1000,
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear"
)

# Training step
for images in train_dataloader:
    clean_images = images.to(accelerator.device)
    batch_size = clean_images.shape[0]
    
    # Sample random timesteps
    timesteps = torch.randint(
        0, noise_scheduler.config.num_train_timesteps,
        (batch_size,), device=clean_images.device
    ).long()
    
    # Add noise
    noise = torch.randn_like(clean_images)
    noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)
    
    # Predict noise
    noise_pred = model(noisy_images, timesteps).sample
    loss = F.mse_loss(noise_pred, noise)
    
    accelerator.backward(loss)
    # ... optimizer steps ...

Inference Pipeline

Generation uses reverse diffusion through scheduler steps:

def generate_image(model, scheduler, latent_shape, device):
    latents = torch.randn(latent_shape, device=device)
    
    for t in scheduler.timesteps:
        with torch.no_grad():
            noise_pred = model(latents, t).sample
        latents = scheduler.step(noise_pred, t, latents).prev_sample
    
    return latents

Scheduler Mechanics

The DDPMScheduler.step() method implements the reverse diffusion process:

  1. Computes cumulative alphas (α̅_t) and betas
  2. Predicts original sample: \( x_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon_\theta}{\sqrt{\bar{\alpha}_t}} \)
  3. Calculates previous sample using variance preservation: \( x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \cdot x_0 \cdot c_1 + \sqrt{\alpha_t} \cdot x_t \cdot c_2 \)

Stable Diffusion Pipeline

High-level pipelines combine VAE, text encoder (CLIP), and UNet:

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

image = pipe(
    prompt="a photo of an astronaut riding a horse on mars",
    num_inference_steps=50,
    guidance_scale=7.5
).images[0]

Classifier-Free Guidance (CFG) works by:

  1. Encoding both prompt and empty/negative prompt
  2. Processing concatenated embeddings through UNet
  3. Combining predictions: \( \epsilon = \epsilon_{\text{uncond}} + w(\epsilon_{\text{cond}} - \epsilon_{\text{uncond}}) \)

SDXL Inpainting Example

Two-stage inpainting with base and refiner models:

from diffusers import StableDiffusionXLInpaintPipeline

base = StableDiffusionXLInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16
).to("cuda")

# Base generation (high noise fraction)
latents = base(
    prompt="a cat sitting on a park bench",
    image=init_image,
    mask_image=mask,
    denoising_end=0.8,
    output_type="latent"
).images

# Refinement (low noise fraction)
final_image = refiner(
    prompt="a cat sitting on a park bench",
    image=latents,
    mask_image=mask,
    denoising_start=0.8
).images[0]

LoRA Fine-tuning

Parameter-efficient adaptation using Low-Rank Adaptation:

from peft import LoraConfig
from diffusers import UNet2DConditionModel

unet = UNet2DConditionModel.from_pretrained("path/to/model")
unet.requires_grad_(False)

lora_config = LoraConfig(
    r=4,
    lora_alpha=8,
    target_modules=["to_k", "to_q", "to_v", "to_out.0"]
)
unet.add_adapter(lora_config)

LoRA modifies attention layers via: \( y = Wx + \frac{\alpha}{r} \cdot B(Ax) \) where \( A \in \mathbb{R}^{r \times d} \), \( B \in \mathbb{R}^{d \time r} \) are trainable low-rank matrices.

Custom Attention Processors

Modify attention mechanisms by replacing processors:

class CustomAttnProcessor:
    def __call__(self, attn, hidden_states, encoder_hidden_states=None):
        query = attn.to_q(hidden_states)
        key = attn.to_k(encoder_hidden_states or hidden_states)
        value = attn.to_v(encoder_hidden_states or hidden_states)
        
        # Custom attention logic
        attn_weights = torch.softmax(query @ key.transpose(-2, -1) * attn.scale, dim=-1)
        output = attn_weights @ value
        return attn.to_out(output)

# Apply to cross-attention layers
processor_dict = {}
for name in unet.attn_processors.keys():
    if "attn2" in name:  # Cross-attention
        processor_dict[name] = CustomAttnProcessor()
    else:
        processor_dict[name] = unet.attn_processors[name]
unet.set_attn_processor(processor_dict)

Tags: diffusers accelerate huggingface stable-diffusion loram

Posted on Mon, 11 May 2026 09:49:07 +0000 by Jagand