Introduction to PyTorch Framework // Optimizing Convolution Operations with AVX // Essential GDB Debugging Techniques

PyTorch is a tensor library optimized for deep learning that leverages both GPU and CPU capabilities

Chinese documentation: https://pytorch.org/resources

Gradient and Derivative Calculation

# gradient_calculation.py

import torch
import numpy as np

input_val = torch.tensor(3.)
weight = torch.tensor(4., requires_grad=True)
bias = torch.tensor(5., requires_grad=True)

output = weight * input_val + bias
output.backward()

print('d(output)/d(input_val):', input_val.grad)
print('d(output)/d(weight):', weight.grad)
print('d(output)/d(bias):', bias.grad)

Linear Regression Implementation

# linear_regression.py

import numpy as np
import torch

# Input data (temperature, rainfall, humidity)
input_data = np.array([[73, 67, 43], 
                       [91, 88, 64], 
                       [87, 134, 58], 
                       [102, 43, 37], 
                       [69, 96, 70]], dtype='float32')

# Target values (apples, oranges)
target_values = np.array([[56, 70], 
                         [81, 101], 
                         [119, 133], 
                         [22, 37], 
                         [103, 119]], dtype='float32')

input_data = torch.from_numpy(input_data)
target_values = torch.from_numpy(target_values)

# Initialize weights and biases with gradient tracking
weights = torch.randn(2, 3, requires_grad=True)
biases = torch.randn(2, requires_grad=True)

# Define the linear model
def linear_model(features):
    return features @ weights.t() + biases

# Calculate mean squared error
def mean_squared_error(predictions, targets):
    difference = predictions - targets
    return torch.sum(difference * difference) / difference.numel()

# Training loop
epochs = 1000
for epoch in range(epochs):
    predictions = linear_model(input_data)
    loss = mean_squared_error(predictions, target_values)
    print(f"Epoch {epoch+1} Loss: {loss.item()}")
    
    # Backpropagation
    loss.backward()
    
    # Update weights and biases
    with torch.no_grad():
        # Learning rate of 1e-5
        weights -= weights.grad * 1e-5
        biases -= biases.grad * 1e-5
        
        # Reset gradients
        weights.grad.zero_()
        biases.grad.zero_()
        
        print(f"Epoch {epoch+1} Weights & Biases: {weights}, {biases}")

print("Target values:\n", target_values, '\nPredictions:\n', predictions)

Linear Regression Using PyTorch Built-in Functions

# pytorch_linear_regression.py

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import torch.nn.functional as F

# Load training data
# Input (temperature, rainfall, humidity)
input_data = np.array([[73, 67, 43], 
                       [91, 88, 64], 
                       [87, 134, 58], 
                       [102, 43, 37], 
                       [69, 96, 70], 
                       [74, 66, 43], 
                       [91, 87, 65], 
                       [88, 134, 59], 
                       [101, 44, 37], 
                       [68, 96, 71], 
                       [73, 66, 44], 
                       [92, 87, 64], 
                       [87, 135, 57], 
                       [103, 43, 36], 
                       [68, 97, 70]], 
                      dtype='float32')

# Targets (apples, oranges)
target_values = np.array([[56, 70], 
                          [81, 101], 
                          [119, 133], 
                          [22, 37], 
                          [103, 119],
                          [57, 69], 
                          [80, 102], 
                          [118, 132], 
                          [21, 38], 
                          [104, 118], 
                          [57, 69], 
                          [82, 100], 
                          [118, 134], 
                          [20, 38], 
                          [102, 120]], 
                     dtype='float32')

input_data = torch.from_numpy(input_data)
target_values = torch.from_numpy(target_values)

# Create dataset and dataloader
train_dataset = TensorDataset(input_data, target_values)
batch_size = 5
train_dataloader = DataLoader(train_dataset, batch_size, shuffle=True)

# Initialize linear model
model = nn.Linear(3, 2)
# Define loss function
loss_function = F.mse_loss
# Calculate initial loss
initial_loss = loss_function(model(input_data), target_values)

# Define optimizer (stochastic gradient descent)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-5)

# Training function
def train_model(num_epochs, model, loss_fn, optimizer):
    for epoch in range(num_epochs):
        for inputs, targets in train_dataloader:
            # Forward pass
            predictions = model(inputs)
            # Calculate loss
            loss = loss_fn(predictions, targets)
            # Backward pass
            loss.backward()
            # Update weights
            optimizer.step()
            # Reset gradients
            optimizer.zero_grad()
        
        # Print progress every 10 epochs
        if (epoch + 1) % 10 == 0:
            print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

# Train the model
train_model(100, model, loss_function, optimizer)

Optimizing Convolution Operations with AVX

Checking Supported SIMD Instruction Sets

Windows:

  • Task Manager:
    1. Right-click on the Start button and select "Task Manager"
    2. Switch to the "Performance" tab
    3. Select "CPU" and look for "Instruction Sets" or "Features" to see if AVX, AVX2, or other information is listed
  • **System Information:**1. Press the Windows key and type "System Information", then select the application 2. In "System Summary", look for the "Instruction Sets" section to see if it includes AVX or other SIMD instruction sets
  • CPU-Z:
    1. Download and install CPU-Z
    2. Run CPU-Z and switch to the "CPU" or "Instructions" tab to see supported instruction sets
  • Command Line:
    1. Open Command Prompt and enter: ``` wmic CPU get caption, deviceID, name, numberOfCores, numberOfLogicalProcessors, architecture, family, manufacturer, status
      
      This lists CPU information but may not directly show AVX or NEON support
      
      

Linux:

  • proc/cpuinfo:
    1. Open a terminal
    2. Enter cat /proc/cpuinfo or grep flags /proc/cpuinfo
    3. Look for "avx", "avx2", and other keywords in the flags
  • lscpu:
    1. Enter lscpu in terminal
    2. Look in the "Flags" or "Features" section

macOS:

  • About This Mac:
    1. Click the Apple menu and select "About This Mac"
    2. Click "System Report"
    3. In the Hardware section, check "Processor" information (though it may not directly show AVX or NEON support)
  • Command Line:
    1. Open Terminal
    2. Enter sysctl -a | grep machdep.cpu.features
    3. This should display all supported instruction sets including AVX

Original Convolution Code

// convolution_basic.cpp

// Using __restrict__ keyword to declare pointer parameters
// Memory regions accessed through different __restrict__ pointers won't overlap
bool perform_convolution(double *__restrict__ output, const double * __restrict__ input, 
                        const double*  __restrict__ kernel, int64_t data_length)
{
  constexpr int64_t kernel_size = 5;
  constexpr int64_t half_kernel = kernel_size / 2;
 
  if (data_length < kernel_size){
    return false;
  }

  for (int64_t i = half_kernel; i < data_length - half_kernel; i++)
  {
   double temp0[4] = {0.0}, temp1[4] = {0.0}, 
   temp2[4]={0.0},temp3[4]={0.0},temp4[4]={0.0};

   temp0[0] = kernel[half_kernel - 2] * input[i + 2];
   temp0[1] = kernel[half_kernel - 2] * input[i + 3];
   temp0[2] = kernel[half_kernel - 2] * input[i + 4];
   temp0[3] = kernel[half_kernel - 2] * input[i + 5];

   temp1[0] = kernel[half_kernel - 1] * input[i + 1];
   temp1[1] = kernel[half_kernel - 1] * input[i + 2];
   temp1[2] = kernel[half_kernel - 1] * input[i + 3];
   temp1[3] = kernel[half_kernel - 1] * input[i + 4];

   temp2[0] = kernel[half_kernel - 0] * input[i + 0];
   temp2[1] = kernel[half_kernel - 0] * input[i + 1];
   temp2[2] = kernel[half_kernel - 0] * input[i + 2];
   temp2[3] = kernel[half_kernel - 0] * input[i + 3];
 
   temp3[0] = kernel[half_kernel - (-1)] * input[i + (-1)];
   temp3[1] = kernel[half_kernel - (-1)] * input[i + 0];
   temp3[2] = kernel[half_kernel - (-1)] * input[i + 1];
   temp3[3] = kernel[half_kernel - (-1)] * input[i + 2];
   
   temp4[0] = kernel[half_kernel - (-2)] * input[i + (-2)];
   temp4[1] = kernel[half_kernel - (-2)] * input[i + (-1)];
   temp4[2] = kernel[half_kernel - (-2)] * input[i + 0];
   temp4[3] = kernel[half_kernel - (-2)] * input[i + 1];
    
   output[i] = temp0[0] + temp1[0] + temp2[0] + temp3[0] + temp4[0];
   output[i+1] = temp0[1] + temp1[1] + temp2[1] + temp3[1] + temp4[1];
   output[i+2] = temp0[2] + temp1[2] + temp2[2] + temp3[2] + temp4[2];
   output[i+3] = temp0[3] + temp1[3] + temp2[3] + temp3[3] + temp4[3];
     
 }
  return true;
}

The perform_convolution function processes input data using a convolution kernel. Parameters include:

  • output: Stores the convolution results
  • input: The signal or data to be processed
  • kernel: The convolution kernel array (constant size 5)
  • data_length: Length of the input array

The function first checks if the input is large enough for the convolution operation. It then processes the input in chunks, calculating multiple output elements simultaneously to simulate SIMD operations.

AVX-Optimized Implementation

// convolution_avx.cpp

bool optimized_convolution(double* __restrict__ output, const double* __restrict__ input, 
                          const double* __restrict__ kernel, int64_t data_length) {
    constexpr int64_t kernel_size = 5;
    constexpr int64_t half_kernel = kernel_size / 2;

    // Check if input is sufficient for convolution
    if (data_length < kernel_size) {
        return false;
    }

    // Load kernel values into AVX2 registers
    __m256d k0 = _mm256_set1_pd(kernel[0]);
    __m256d k1 = _mm256_set1_pd(kernel[1]);
    __m256d k2 = _mm256_set1_pd(kernel[2]);
    __m256d k3 = _mm256_set1_pd(kernel[3]);
    __m256d k4 = _mm256_set1_pd(kernel[4]);

    // Process data in chunks using AVX2
    for (int64_t i = half_kernel; i <= data_length - kernel_size; i += 4) {
        // Load input data chunks into AVX2 registers
        __m256d x0 = _mm256_loadu_pd(&input[i + 2]);
        __m256d x1 = _mm256_loadu_pd(&input[i + 1]);
        __m256d x2 = _mm256_loadu_pd(&input[i]);
        __m256d x3 = _mm256_loadu_pd(&input[i - 1]);
        __m256d x4 = _mm256_loadu_pd(&input[i - 2]);

        // Perform element-wise multiplication and accumulate results
        __m256d result = _mm256_add_pd(
            _mm256_add_pd(
                _mm256_mul_pd(x0, k0),
                _mm256_mul_pd(x1, k1)),
            _mm256_add_pd(
                _mm256_mul_pd(x2, k2),
                _mm256_add_pd(
                    _mm256_mul_pd(x3, k3),
                    _mm256_mul_pd(x4, k4))));

        // Store results back to output array
        _mm256_storeu_pd(&output[i], result);
    }

    // Process remaining elements
    for (int64_t i = data_length - half_kernel; i < data_length; i++) {
        double result_value = 0.0;
        for (int64_t k = -half_kernel; k <= half_kernel; k++) {
            result_value += input[i - k] * kernel[k + half_kernel];
        }
        output[i] = result_value;
    }

    return true;
}

During testing, a segmentation fault occurred when using _mm256_load_pd due to memory alignment issues. The solution was to use the unaligned version _mm256_loadu_pd for both loading and storing operations.

Interestingly, since the convolution kernel used is symmetric, the results remain correct even without following the standard convolution order.

Essential GDB Debugging Techniques

Starting GDB

  • Launch GDB with a compiled program: gdb [executable_program]

Setting Breakpoints

  • Set breakpoint at function entry: break [function_name]
  • Set breakpoint at specific line in a file: break [filename]:[line_number]

Running the Program

  • Execute the program: run [arguments] where [arguments] are any parameters passed to the program

Inspecting Program State

  • Display all register values: info registers
  • Show local variables in current stack frame: info locals
  • Print expression value: print [expression] where [expression] can be a variable name or valid C expression
  • Display call stack: backtrace or bt

Stepping Through Code

  • Execute next line (step over): next or n
  • Execute next line (step into): step or s

Continuing Execution

  • Resume program execution: continue or c

Exiting GDB

  • Quit debugger: quit or q

Viewing and Modifying Variables

  • Set variable value: set var [variable]=[value]
  • Display variable value: print [variable]

Examining Source Code

  • Show source code around specific line: list [line_number]
  • Show source code for a functon: list [function]

Conditional Breakpoints

  • Set breakpoint with condition: break [location] if [condition]

Tags: pytorch Deep Learning AVX SIMD convolution

Posted on Sat, 04 Jul 2026 17:50:31 +0000 by zoozoo