PyTorch is a tensor library optimized for deep learning that leverages both GPU and CPU capabilities
Chinese documentation: https://pytorch.org/resources
Gradient and Derivative Calculation
# gradient_calculation.py
import torch
import numpy as np
input_val = torch.tensor(3.)
weight = torch.tensor(4., requires_grad=True)
bias = torch.tensor(5., requires_grad=True)
output = weight * input_val + bias
output.backward()
print('d(output)/d(input_val):', input_val.grad)
print('d(output)/d(weight):', weight.grad)
print('d(output)/d(bias):', bias.grad)
Linear Regression Implementation
# linear_regression.py
import numpy as np
import torch
# Input data (temperature, rainfall, humidity)
input_data = np.array([[73, 67, 43],
[91, 88, 64],
[87, 134, 58],
[102, 43, 37],
[69, 96, 70]], dtype='float32')
# Target values (apples, oranges)
target_values = np.array([[56, 70],
[81, 101],
[119, 133],
[22, 37],
[103, 119]], dtype='float32')
input_data = torch.from_numpy(input_data)
target_values = torch.from_numpy(target_values)
# Initialize weights and biases with gradient tracking
weights = torch.randn(2, 3, requires_grad=True)
biases = torch.randn(2, requires_grad=True)
# Define the linear model
def linear_model(features):
return features @ weights.t() + biases
# Calculate mean squared error
def mean_squared_error(predictions, targets):
difference = predictions - targets
return torch.sum(difference * difference) / difference.numel()
# Training loop
epochs = 1000
for epoch in range(epochs):
predictions = linear_model(input_data)
loss = mean_squared_error(predictions, target_values)
print(f"Epoch {epoch+1} Loss: {loss.item()}")
# Backpropagation
loss.backward()
# Update weights and biases
with torch.no_grad():
# Learning rate of 1e-5
weights -= weights.grad * 1e-5
biases -= biases.grad * 1e-5
# Reset gradients
weights.grad.zero_()
biases.grad.zero_()
print(f"Epoch {epoch+1} Weights & Biases: {weights}, {biases}")
print("Target values:\n", target_values, '\nPredictions:\n', predictions)
Linear Regression Using PyTorch Built-in Functions
# pytorch_linear_regression.py
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import torch.nn.functional as F
# Load training data
# Input (temperature, rainfall, humidity)
input_data = np.array([[73, 67, 43],
[91, 88, 64],
[87, 134, 58],
[102, 43, 37],
[69, 96, 70],
[74, 66, 43],
[91, 87, 65],
[88, 134, 59],
[101, 44, 37],
[68, 96, 71],
[73, 66, 44],
[92, 87, 64],
[87, 135, 57],
[103, 43, 36],
[68, 97, 70]],
dtype='float32')
# Targets (apples, oranges)
target_values = np.array([[56, 70],
[81, 101],
[119, 133],
[22, 37],
[103, 119],
[57, 69],
[80, 102],
[118, 132],
[21, 38],
[104, 118],
[57, 69],
[82, 100],
[118, 134],
[20, 38],
[102, 120]],
dtype='float32')
input_data = torch.from_numpy(input_data)
target_values = torch.from_numpy(target_values)
# Create dataset and dataloader
train_dataset = TensorDataset(input_data, target_values)
batch_size = 5
train_dataloader = DataLoader(train_dataset, batch_size, shuffle=True)
# Initialize linear model
model = nn.Linear(3, 2)
# Define loss function
loss_function = F.mse_loss
# Calculate initial loss
initial_loss = loss_function(model(input_data), target_values)
# Define optimizer (stochastic gradient descent)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-5)
# Training function
def train_model(num_epochs, model, loss_fn, optimizer):
for epoch in range(num_epochs):
for inputs, targets in train_dataloader:
# Forward pass
predictions = model(inputs)
# Calculate loss
loss = loss_fn(predictions, targets)
# Backward pass
loss.backward()
# Update weights
optimizer.step()
# Reset gradients
optimizer.zero_grad()
# Print progress every 10 epochs
if (epoch + 1) % 10 == 0:
print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")
# Train the model
train_model(100, model, loss_function, optimizer)
Optimizing Convolution Operations with AVX
Checking Supported SIMD Instruction Sets
Windows:
- Task Manager:
- Right-click on the Start button and select "Task Manager"
- Switch to the "Performance" tab
- Select "CPU" and look for "Instruction Sets" or "Features" to see if AVX, AVX2, or other information is listed
- **System Information:**1. Press the Windows key and type "System Information", then select the application 2. In "System Summary", look for the "Instruction Sets" section to see if it includes AVX or other SIMD instruction sets
- CPU-Z:
- Download and install CPU-Z
- Run CPU-Z and switch to the "CPU" or "Instructions" tab to see supported instruction sets
- Command Line:
- Open Command Prompt and enter: ```
wmic CPU get caption, deviceID, name, numberOfCores, numberOfLogicalProcessors, architecture, family, manufacturer, status
This lists CPU information but may not directly show AVX or NEON support
- Open Command Prompt and enter: ```
wmic CPU get caption, deviceID, name, numberOfCores, numberOfLogicalProcessors, architecture, family, manufacturer, status
Linux:
- proc/cpuinfo:
- Open a terminal
- Enter
cat /proc/cpuinfoorgrep flags /proc/cpuinfo - Look for "avx", "avx2", and other keywords in the flags
- lscpu:
- Enter
lscpuin terminal - Look in the "Flags" or "Features" section
- Enter
macOS:
- About This Mac:
- Click the Apple menu and select "About This Mac"
- Click "System Report"
- In the Hardware section, check "Processor" information (though it may not directly show AVX or NEON support)
- Command Line:
- Open Terminal
- Enter
sysctl -a | grep machdep.cpu.features - This should display all supported instruction sets including AVX
Original Convolution Code
// convolution_basic.cpp
// Using __restrict__ keyword to declare pointer parameters
// Memory regions accessed through different __restrict__ pointers won't overlap
bool perform_convolution(double *__restrict__ output, const double * __restrict__ input,
const double* __restrict__ kernel, int64_t data_length)
{
constexpr int64_t kernel_size = 5;
constexpr int64_t half_kernel = kernel_size / 2;
if (data_length < kernel_size){
return false;
}
for (int64_t i = half_kernel; i < data_length - half_kernel; i++)
{
double temp0[4] = {0.0}, temp1[4] = {0.0},
temp2[4]={0.0},temp3[4]={0.0},temp4[4]={0.0};
temp0[0] = kernel[half_kernel - 2] * input[i + 2];
temp0[1] = kernel[half_kernel - 2] * input[i + 3];
temp0[2] = kernel[half_kernel - 2] * input[i + 4];
temp0[3] = kernel[half_kernel - 2] * input[i + 5];
temp1[0] = kernel[half_kernel - 1] * input[i + 1];
temp1[1] = kernel[half_kernel - 1] * input[i + 2];
temp1[2] = kernel[half_kernel - 1] * input[i + 3];
temp1[3] = kernel[half_kernel - 1] * input[i + 4];
temp2[0] = kernel[half_kernel - 0] * input[i + 0];
temp2[1] = kernel[half_kernel - 0] * input[i + 1];
temp2[2] = kernel[half_kernel - 0] * input[i + 2];
temp2[3] = kernel[half_kernel - 0] * input[i + 3];
temp3[0] = kernel[half_kernel - (-1)] * input[i + (-1)];
temp3[1] = kernel[half_kernel - (-1)] * input[i + 0];
temp3[2] = kernel[half_kernel - (-1)] * input[i + 1];
temp3[3] = kernel[half_kernel - (-1)] * input[i + 2];
temp4[0] = kernel[half_kernel - (-2)] * input[i + (-2)];
temp4[1] = kernel[half_kernel - (-2)] * input[i + (-1)];
temp4[2] = kernel[half_kernel - (-2)] * input[i + 0];
temp4[3] = kernel[half_kernel - (-2)] * input[i + 1];
output[i] = temp0[0] + temp1[0] + temp2[0] + temp3[0] + temp4[0];
output[i+1] = temp0[1] + temp1[1] + temp2[1] + temp3[1] + temp4[1];
output[i+2] = temp0[2] + temp1[2] + temp2[2] + temp3[2] + temp4[2];
output[i+3] = temp0[3] + temp1[3] + temp2[3] + temp3[3] + temp4[3];
}
return true;
}
The perform_convolution function processes input data using a convolution kernel. Parameters include:
output: Stores the convolution resultsinput: The signal or data to be processedkernel: The convolution kernel array (constant size 5)data_length: Length of the input array
The function first checks if the input is large enough for the convolution operation. It then processes the input in chunks, calculating multiple output elements simultaneously to simulate SIMD operations.
AVX-Optimized Implementation
// convolution_avx.cpp
bool optimized_convolution(double* __restrict__ output, const double* __restrict__ input,
const double* __restrict__ kernel, int64_t data_length) {
constexpr int64_t kernel_size = 5;
constexpr int64_t half_kernel = kernel_size / 2;
// Check if input is sufficient for convolution
if (data_length < kernel_size) {
return false;
}
// Load kernel values into AVX2 registers
__m256d k0 = _mm256_set1_pd(kernel[0]);
__m256d k1 = _mm256_set1_pd(kernel[1]);
__m256d k2 = _mm256_set1_pd(kernel[2]);
__m256d k3 = _mm256_set1_pd(kernel[3]);
__m256d k4 = _mm256_set1_pd(kernel[4]);
// Process data in chunks using AVX2
for (int64_t i = half_kernel; i <= data_length - kernel_size; i += 4) {
// Load input data chunks into AVX2 registers
__m256d x0 = _mm256_loadu_pd(&input[i + 2]);
__m256d x1 = _mm256_loadu_pd(&input[i + 1]);
__m256d x2 = _mm256_loadu_pd(&input[i]);
__m256d x3 = _mm256_loadu_pd(&input[i - 1]);
__m256d x4 = _mm256_loadu_pd(&input[i - 2]);
// Perform element-wise multiplication and accumulate results
__m256d result = _mm256_add_pd(
_mm256_add_pd(
_mm256_mul_pd(x0, k0),
_mm256_mul_pd(x1, k1)),
_mm256_add_pd(
_mm256_mul_pd(x2, k2),
_mm256_add_pd(
_mm256_mul_pd(x3, k3),
_mm256_mul_pd(x4, k4))));
// Store results back to output array
_mm256_storeu_pd(&output[i], result);
}
// Process remaining elements
for (int64_t i = data_length - half_kernel; i < data_length; i++) {
double result_value = 0.0;
for (int64_t k = -half_kernel; k <= half_kernel; k++) {
result_value += input[i - k] * kernel[k + half_kernel];
}
output[i] = result_value;
}
return true;
}
During testing, a segmentation fault occurred when using _mm256_load_pd due to memory alignment issues. The solution was to use the unaligned version _mm256_loadu_pd for both loading and storing operations.
Interestingly, since the convolution kernel used is symmetric, the results remain correct even without following the standard convolution order.
Essential GDB Debugging Techniques
Starting GDB
- Launch GDB with a compiled program:
gdb [executable_program]
Setting Breakpoints
- Set breakpoint at function entry:
break [function_name] - Set breakpoint at specific line in a file:
break [filename]:[line_number]
Running the Program
- Execute the program:
run [arguments]where [arguments] are any parameters passed to the program
Inspecting Program State
- Display all register values:
info registers - Show local variables in current stack frame:
info locals - Print expression value:
print [expression]where [expression] can be a variable name or valid C expression - Display call stack:
backtraceorbt
Stepping Through Code
- Execute next line (step over):
nextorn - Execute next line (step into):
stepors
Continuing Execution
- Resume program execution:
continueorc
Exiting GDB
- Quit debugger:
quitorq
Viewing and Modifying Variables
- Set variable value:
set var [variable]=[value] - Display variable value:
print [variable]
Examining Source Code
- Show source code around specific line:
list [line_number] - Show source code for a functon:
list [function]
Conditional Breakpoints
- Set breakpoint with condition:
break [location] if [condition]