Gradient descent is widely adopted in modern machine learning inference due to its efficiency with large-scale datasets and high-dimensional feature spaces. Unlike closed-form solutions that become computationally prohibitive as data volume grows, gradient descent updates parameters iteratively using gradient computations on subsets or the entirety of data.
Batch Gradient Descent
import numpy as np
# np.random.seed(1)
features = np.random.rand(100, 1)
targets = 4 + 3 * features + np.random.randn(100, 1)
design_matrix = np.c_[np.ones((100, 1)), features]
step_size = 0.001
iterations = 10000
weights = np.random.randn(2, 1)
for _ in range(iterations):
error = design_matrix.dot(weights) - targets
grad = design_matrix.T.dot(error)
weights = weights - step_size * grad
print(weights)
Library Import
numpy is used for numerical operations and matrix computations.
Data Generation
features = np.random.rand(100, 1) creates 100 random values following a uniform distribution. targets = 4 + 3 * features + np.random.randn(100, 1) generates corresponding target values with linear relationship plus random noise.
Design Matrix Construction
design_matrix = np.c_[np.ones((100, 1)), features] appends a column of ones to the feature matrix to account for the intercept term in the linear model.
Hyperparameter Setup
step_size = 0.001 controls the magnitude of parameter updates. iterations = 10000 defines the total number of update cycles.
Weight Initialization
weights = np.random.randn(2, 1) initializes the model parameters (intercept and slope) with random values.
Gradient Descant Loop
The loop runs for the specified number of iterations. error = design_matrix.dot(weights) - targets computes the difference between predictions and true values. grad = design_matrix.T.dot(error) calculates the gradient via matrix multiplication. weights = weights - step_size * grad updates the parameters in the direction opposite to the gradient.
Mini-Batch Gradient Descent
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
X_b = np.c_[np.ones((100, 1)), X]
step_size = 0.001
epochs = 1000
sample_count = 100
batch_size = 10
batch_num = sample_count // batch_size
weights = np.random.randn(2, 1)
for epoch in range(epochs):
for _ in range(batch_num):
idx = np.random.randint(sample_count)
x_batch = X_b[idx: idx + batch_size]
y_batch = y[idx: idx + batch_size]
grad = x_batch.T.dot(x_batch.dot(weights) - y_batch)
weights = weights - step_size * grad
print(weights)
Hyperparameter Differences
batch_size = 10 specifies the number of samples per mini-batch. batch_num = sample_count // batch_size calculates how many batches the dataset is split into.
Iterative Training
The outer loop runs for the total number of epochs (full passes over the dataset). The inner loop processes each mini-batch. A random starting index is selected, and a subset of data is extracted to compute gradients and update weights. This approach uses only a portion of data per update, making it more scalable for large datasets.
Shuffled Batch Optimization
for epoch in range(epochs):
indices = np.arange(len(X_b))
np.random.shuffle(indices)
X_b = X_b[indices]
y = y[indices]
for i in range(batch_num):
x_batch = X_b[i * batch_size: (i + 1) * batch_size]
y_batch = y[i * batch_size: (i + 1) * batch_size]
grad = x_batch.T.dot(x_batch.dot(weights) - y_batch)
weights = weights - step_size * grad
Shuffling the dataset before splitting into batches introduces randomness and reduces the chance of repeatedly missing certain data samples, improving training stability.
Adaptive Learning Rate
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
X_b = np.c_[np.ones((100, 1)), X]
a, b = 5, 500
def get_learning_rate(t):
return a / (t + b)
step_size = 0.001
epochs = 1000
sample_count = 100
batch_size = 10
batch_num = sample_count // batch_size
weights = np.random.randn(2, 1)
for epoch in range(epochs):
indices = np.arange(len(X_b))
np.random.shuffle(indices)
X_b = X_b[indices]
y = y[indices]
for i in range(batch_num):
x_batch = X_b[i * batch_size: (i + 1) * batch_size]
y_batch = y[i * batch_size: (i + 1) * batch_size]
grad = x_batch.T.dot(x_batch.dot(weights) - y_batch)
step_size = get_learning_rate(epoch * sample_count + i)
weights = weights - step_size * grad
print(weights)
The get_learning_rate function reduces the step size as training progresses. Larger steps are used early to speed up convergence, while smaller steps in later stages help fine-tune the parameters and avoid overshooting the optimal solution.