Model Parameter Profiling
Utility functions in torch_utils facilitate the analysis of model complexity, including layer counts, parameter volumes, and computational load (FLOPs). The following snippet demonstrates how to aggregate parameter statistics and estimate floating-point operations using a dummy input tensor aligned with the model's stride.
import torch
import thop
from copy import deepcopy
def analyze_network(net, verbose_flag=False, input_dim=640):
"""
Outputs layer details including gradients and FLOPs.
"""
total_params = sum(p.numel() for p in net.parameters())
trainable_params = sum(p.numel() for p in net.parameters() if p.requires_grad)
if verbose_flag:
print(f"{'Index':>5} {'Name':>40} {'Grad':>8} {'Params':>12} {'Shape':>20} {'Mean':>10} {'Std':>10}")
for idx, (layer_name, param) in enumerate(net.named_parameters()):
layer_name = layer_name.replace("module_list.", "")
print(
"%5g %40s %8s %12g %20s %10.3g %10.3g"
% (idx, layer_name, param.requires_grad, param.numel(), list(param.shape), param.mean(), param.std())
)
try:
dummy_p = next(net.parameters())
stride_val = max(int(net.stride.max()), 32) if hasattr(net, "stride") else 32
dummy_input = torch.empty((1, dummy_p.shape[1], stride_val, stride_val), device=dummy_p.device)
ops = thop.profile(deepcopy(net), inputs=(dummy_input,), verbose=False)[0] / 1e9 * 2
dims = [input_dim, input_dim] if isinstance(input_dim, int) else input_dim
flops_info = f", {dims} {ops * dims[0] / stride_val * dims[1] / stride_val:.1f} GFLOPs"
except Exception:
flops_info = ""
print(f"Summary: {len(list(net.modules()))} layers, {total_params} params, {trainable_params} grads{flops_info}")
Hyperparameter Configuration
Training behavior is governed by a YAML configuration file defining learning rates, augmentation probabilities, and loss weights. The hyp.scratch-low.yaml file illustrates the default settings for low-augmentation training on COCO. Key parameters include the initial learning rate (lr0), momentum, weight decay, and specific augmentation probabilities like mosaic and mixup.
Model Parsing and Architecture
The Model class in yolo.py parses a dictionary derived from a YAML file to construct the neural network. The parse_model function iterates through the backbone and head definitions, instantiating modules like Conv, C3, and SPPF. It adjusts channel depth based on the width_multiple and layer repetition count based on depth_multiple.
def construct_architecture(arch_dict, in_channels):
"""
Builds the YOLOv5 model from a dictionary definition.
"""
print(f"\n{'From':>18}{'N':>3}{'Params':>10} {'Module':<40}{'Args':<30}")
anchors, num_classes, depth_gain, width_gain = (
arch_dict["anchors"],
arch_dict["nc"],
arch_dict["depth_multiple"],
arch_dict["width_multiple"]
)
num_anchors = (len(anchors[0]) // 2) if isinstance(anchors, list) else anchors
num_outputs = num_anchors * (num_classes + 5)
output_layers, save_indices, out_c = [], [], in_channels[-1]
for i, (f, n, m, args) in enumerate(arch_dict["backbone"] + arch_dict["head"]):
m = eval(m) if isinstance(m, str) else m
# Evaluate string arguments
for j, arg in enumerate(args):
try:
args[j] = eval(arg) if isinstance(arg, str) else arg
except NameError:
pass
# Apply depth gain
n = max(round(n * depth_gain), 1) if n > 1 else n
if m in {Conv, C3, SPPF}:
c1, c2 = in_channels[f], args[0]
c2 = make_divisible(c2 * width_gain, 8) if c2 != num_outputs else c2
args = [c1, c2, *args[1:]]
if m is C3:
args.insert(2, n)
n = 1
elif m is Concat:
c2 = sum(in_channels[x] for x in f)
elif m is Detect:
args.append([in_channels[x] for x in f])
else:
c2 = in_channels[f]
module_instance = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args)
t = str(module_instance)[8:-2].replace("__main__.", "")
n_params = sum(x.numel() for x in module_instance.parameters())
module_instance.i, module_instance.f, module_instance.type, module_instance.np = i, f, t, n_params
print(f"{i:>3}{str(f):>18}{n:>3}{n_params:10.0f} {t:<40}{str(args):<30}")
save_indices.extend(x % i for x in ([f] if isinstance(f, int) else f) if x != -1)
output_layers.append(module_instance)
if i == 0:
in_channels = []
in_channels.append(c2)
return nn.Sequential(*output_layers), sorted(save_indices)
Core Building Blocks
The fundamental components of YOLOv5 include the Conv block (Conv2d + BatchNorm + SiLU), the Bottleneck, and the C3 module (Cross Stage Partial).
Standard Convolution
class Conv(nn.Module):
default_act = nn.SiLU()
def __init__(self, inp, oup, k=1, s=1, p=None, g=1, d=1, act=True):
super().__init__()
self.conv = nn.Conv2d(inp, oup, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(oup)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
return self.act(self.bn(self.conv(x)))
C3 Module
class C3(nn.Module):
def __init__(self, inp, oup, n=1, shortcut=True, g=1, e=0.5):
super().__init__()
hidden_chn = int(oup * e)
self.cv1 = Conv(inp, hidden_chn, 1, 1)
self.cv2 = Conv(inp, hidden_chn, 1, 1)
self.cv3 = Conv(2 * hidden_chn, oup, 1)
self.m = nn.Sequential(*(Bottleneck(hidden_chn, hidden_chn, shortcut, g, e=1.0) for _ in range(n)))
def forward(self, x):
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1))
Forward Propagation and Tensor Shapes
During inference, the input tensor passes through the backbone and neck. The _forward_once method handles routing layers (e.g., Concat, Upsample). Below is a simplified analysis of tensor shape transformations assuming an input of 256x256 and stride of 8 (P3), 16 (P4), and 32 (P5).
- Layer 0 (Conv): Input [1, 3, 256, 256] → Output [1, 32, 128, 128].
- Layer 1 (Conv): Input [1, 32, 128, 128] → Output [1, 64, 64, 64].
- Layer 4 (C3): Input [1, 128, 32, 32] → Output [1, 128, 32, 32] (Saved for P3).
- Layer 6 (C3): Input [1, 256, 16, 16] → Output [1, 256, 16, 16] (Saved for P4).
- Layer 9 (SPPF): Input [1, 512, 8, 8] → Output [1, 512, 8, 8] (Saved for P5).
- Head: Upsampling and concatenation merge features from P5/P4 and P4/P3 to produce multi-scale outputs.
Non-Maximum Suppression (NMS)
NMS filters redundant bounding boxes. The algorithm converts center-width-height coordinates to corner coordinates, applies confidence thresholds, and suppresses overlapping boxes based on the Intersection over Union (IoU). The non_max_suppression function implements this logic efficiently using PyTorch operations.
def filter_overlaps(inference_res, conf_cutoff=0.25, iou_cutoff=0.45, max_dets=300):
"""
Removes overlapping detections.
"""
bs = inference_res.shape[0]
nc = inference_res.shape[2] - 5 # Number of classes
xc = inference_res[..., 4] > conf_cutoff # Candidates mask
output = [torch.zeros((0, 6), device=inference_res.device)] * bs
for xi, x in enumerate(inference_res):
x = x[xc[xi]] # Filter by confidence
if not x.shape[0]:
continue
# Compute conf and convert boxes
x[:, 5:] *= x[:, 4:5] # Class conf = obj_conf * cls_conf
box = xywh2xyxy(x[:, :4]) # Convert to xyxy format
# Get best class only
conf, j = x[:, 5:6].max(1, keepdim=True)
x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_cutoff]
# Sort by confidence and apply NMS
n = x.shape[0]
if not n:
continue
x = x[x[:, 4].argsort(descending=True)[:max_dets]]
c = x[:, 5:6] * 7680 # Class offsets for agnostic NMS
boxes, scores = x[:, :4] + c, x[:, 4]
i = torchvision.ops.nms(boxes, scores, iou_cutoff)
i = i[:max_dets]
output[xi] = x[i]
return output
Loss Function Components
Training involves minimizing three specific loss components:
- Box Loss (CIoU): Measures the overlap and aspect ratio difference between predicted and ground truth boxes.
- Objectness Loss (BCE): Binary Cross Entropy determining the probability that an object exists within the anchor box.
- Classification Loss (BCE): Binary Cross Entropy for classifying the detected object into one of the
nccategories.
Inference Workflow and Stride Handling
During preprocessing, the input image is resized to a dimension divisible by the model's stride (e.g., 32). The aspect ratio is preserved by calculating the ratio r and padding the image evenly. If the input shape is [640, 427] and the target size is 416, the resize logic calculates padding such that the final tensor dimensions are compatible with the downsampling layers.
The final detection output typically reshapes tensors from [batch, anchors, grid_y, grid_x, (5+classes)] into a flat list of predictions [batch, total_predictions, 85] (for 80 classes). This flattened tensor is then passed through the NMS function to generate the final bounding boxes.