Architectural Breakdown and Operational Workflow of YOLOv5

Model Parameter Profiling

Utility functions in torch_utils facilitate the analysis of model complexity, including layer counts, parameter volumes, and computational load (FLOPs). The following snippet demonstrates how to aggregate parameter statistics and estimate floating-point operations using a dummy input tensor aligned with the model's stride.

import torch
import thop
from copy import deepcopy

def analyze_network(net, verbose_flag=False, input_dim=640):
    """
    Outputs layer details including gradients and FLOPs.
    """
    total_params = sum(p.numel() for p in net.parameters())
    trainable_params = sum(p.numel() for p in net.parameters() if p.requires_grad)

    if verbose_flag:
        print(f"{'Index':>5} {'Name':>40} {'Grad':>8} {'Params':>12} {'Shape':>20} {'Mean':>10} {'Std':>10}")
        for idx, (layer_name, param) in enumerate(net.named_parameters()):
            layer_name = layer_name.replace("module_list.", "")
            print(
                "%5g %40s %8s %12g %20s %10.3g %10.3g"
                % (idx, layer_name, param.requires_grad, param.numel(), list(param.shape), param.mean(), param.std())
            )

    try:
        dummy_p = next(net.parameters())
        stride_val = max(int(net.stride.max()), 32) if hasattr(net, "stride") else 32
        dummy_input = torch.empty((1, dummy_p.shape[1], stride_val, stride_val), device=dummy_p.device)
        ops = thop.profile(deepcopy(net), inputs=(dummy_input,), verbose=False)[0] / 1e9 * 2
        dims = [input_dim, input_dim] if isinstance(input_dim, int) else input_dim
        flops_info = f", {dims} {ops * dims[0] / stride_val * dims[1] / stride_val:.1f} GFLOPs"
    except Exception:
        flops_info = ""

    print(f"Summary: {len(list(net.modules()))} layers, {total_params} params, {trainable_params} grads{flops_info}")

Hyperparameter Configuration

Training behavior is governed by a YAML configuration file defining learning rates, augmentation probabilities, and loss weights. The hyp.scratch-low.yaml file illustrates the default settings for low-augmentation training on COCO. Key parameters include the initial learning rate (lr0), momentum, weight decay, and specific augmentation probabilities like mosaic and mixup.

Model Parsing and Architecture

The Model class in yolo.py parses a dictionary derived from a YAML file to construct the neural network. The parse_model function iterates through the backbone and head definitions, instantiating modules like Conv, C3, and SPPF. It adjusts channel depth based on the width_multiple and layer repetition count based on depth_multiple.

def construct_architecture(arch_dict, in_channels):
    """
    Builds the YOLOv5 model from a dictionary definition.
    """
    print(f"\n{'From':>18}{'N':>3}{'Params':>10}  {'Module':<40}{'Args':<30}")
    
    anchors, num_classes, depth_gain, width_gain = (
        arch_dict["anchors"],
        arch_dict["nc"],
        arch_dict["depth_multiple"],
        arch_dict["width_multiple"]
    )
    
    num_anchors = (len(anchors[0]) // 2) if isinstance(anchors, list) else anchors
    num_outputs = num_anchors * (num_classes + 5)

    output_layers, save_indices, out_c = [], [], in_channels[-1]

    for i, (f, n, m, args) in enumerate(arch_dict["backbone"] + arch_dict["head"]):
        m = eval(m) if isinstance(m, str) else m
        
        # Evaluate string arguments
        for j, arg in enumerate(args):
            try:
                args[j] = eval(arg) if isinstance(arg, str) else arg
            except NameError:
                pass

        # Apply depth gain
        n = max(round(n * depth_gain), 1) if n > 1 else n

        if m in {Conv, C3, SPPF}:
            c1, c2 = in_channels[f], args[0]
            c2 = make_divisible(c2 * width_gain, 8) if c2 != num_outputs else c2
            args = [c1, c2, *args[1:]]
            if m is C3:
                args.insert(2, n)
                n = 1
        elif m is Concat:
            c2 = sum(in_channels[x] for x in f)
        elif m is Detect:
            args.append([in_channels[x] for x in f])
        else:
            c2 = in_channels[f]

        module_instance = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args)
        t = str(module_instance)[8:-2].replace("__main__.", "")
        n_params = sum(x.numel() for x in module_instance.parameters())
        
        module_instance.i, module_instance.f, module_instance.type, module_instance.np = i, f, t, n_params
        print(f"{i:>3}{str(f):>18}{n:>3}{n_params:10.0f}  {t:<40}{str(args):<30}")
        
        save_indices.extend(x % i for x in ([f] if isinstance(f, int) else f) if x != -1)
        output_layers.append(module_instance)
        
        if i == 0:
            in_channels = []
        in_channels.append(c2)
        
    return nn.Sequential(*output_layers), sorted(save_indices)

Core Building Blocks

The fundamental components of YOLOv5 include the Conv block (Conv2d + BatchNorm + SiLU), the Bottleneck, and the C3 module (Cross Stage Partial).

Standard Convolution

class Conv(nn.Module):
    default_act = nn.SiLU()

    def __init__(self, inp, oup, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(inp, oup, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(oup)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

C3 Module

class C3(nn.Module):
    def __init__(self, inp, oup, n=1, shortcut=True, g=1, e=0.5):
        super().__init__()
        hidden_chn = int(oup * e)
        self.cv1 = Conv(inp, hidden_chn, 1, 1)
        self.cv2 = Conv(inp, hidden_chn, 1, 1)
        self.cv3 = Conv(2 * hidden_chn, oup, 1)
        self.m = nn.Sequential(*(Bottleneck(hidden_chn, hidden_chn, shortcut, g, e=1.0) for _ in range(n)))

    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1))

Forward Propagation and Tensor Shapes

During inference, the input tensor passes through the backbone and neck. The _forward_once method handles routing layers (e.g., Concat, Upsample). Below is a simplified analysis of tensor shape transformations assuming an input of 256x256 and stride of 8 (P3), 16 (P4), and 32 (P5).

  • Layer 0 (Conv): Input [1, 3, 256, 256] → Output [1, 32, 128, 128].
  • Layer 1 (Conv): Input [1, 32, 128, 128] → Output [1, 64, 64, 64].
  • Layer 4 (C3): Input [1, 128, 32, 32] → Output [1, 128, 32, 32] (Saved for P3).
  • Layer 6 (C3): Input [1, 256, 16, 16] → Output [1, 256, 16, 16] (Saved for P4).
  • Layer 9 (SPPF): Input [1, 512, 8, 8] → Output [1, 512, 8, 8] (Saved for P5).
  • Head: Upsampling and concatenation merge features from P5/P4 and P4/P3 to produce multi-scale outputs.

Non-Maximum Suppression (NMS)

NMS filters redundant bounding boxes. The algorithm converts center-width-height coordinates to corner coordinates, applies confidence thresholds, and suppresses overlapping boxes based on the Intersection over Union (IoU). The non_max_suppression function implements this logic efficiently using PyTorch operations.

def filter_overlaps(inference_res, conf_cutoff=0.25, iou_cutoff=0.45, max_dets=300):
    """
    Removes overlapping detections.
    """
    bs = inference_res.shape[0]
    nc = inference_res.shape[2] - 5  # Number of classes
    xc = inference_res[..., 4] > conf_cutoff  # Candidates mask

    output = [torch.zeros((0, 6), device=inference_res.device)] * bs
    for xi, x in enumerate(inference_res):
        x = x[xc[xi]]  # Filter by confidence

        if not x.shape[0]:
            continue

        # Compute conf and convert boxes
        x[:, 5:] *= x[:, 4:5]  # Class conf = obj_conf * cls_conf
        box = xywh2xyxy(x[:, :4])  # Convert to xyxy format

        # Get best class only
        conf, j = x[:, 5:6].max(1, keepdim=True)
        x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_cutoff]

        # Sort by confidence and apply NMS
        n = x.shape[0]
        if not n:
            continue
        
        x = x[x[:, 4].argsort(descending=True)[:max_dets]]
        
        c = x[:, 5:6] * 7680  # Class offsets for agnostic NMS
        boxes, scores = x[:, :4] + c, x[:, 4]
        i = torchvision.ops.nms(boxes, scores, iou_cutoff)
        i = i[:max_dets]
        
        output[xi] = x[i]
    return output

Loss Function Components

Training involves minimizing three specific loss components:

  1. Box Loss (CIoU): Measures the overlap and aspect ratio difference between predicted and ground truth boxes.
  2. Objectness Loss (BCE): Binary Cross Entropy determining the probability that an object exists within the anchor box.
  3. Classification Loss (BCE): Binary Cross Entropy for classifying the detected object into one of the nc categories.

Inference Workflow and Stride Handling

During preprocessing, the input image is resized to a dimension divisible by the model's stride (e.g., 32). The aspect ratio is preserved by calculating the ratio r and padding the image evenly. If the input shape is [640, 427] and the target size is 416, the resize logic calculates padding such that the final tensor dimensions are compatible with the downsampling layers.

The final detection output typically reshapes tensors from [batch, anchors, grid_y, grid_x, (5+classes)] into a flat list of predictions [batch, total_predictions, 85] (for 80 classes). This flattened tensor is then passed through the NMS function to generate the final bounding boxes.

Tags: YOLOv5 Computer Vision Deep Learning Object Detection pytorch

Posted on Mon, 11 May 2026 10:06:51 +0000 by smith.james0