From LeNet to AlexNet: How Deep Convolutional Networks Finally Took Over Computer Vision

After LeNet demonstrated that convolutional architectures could work, interest in neural networks for vision spiked—yet for almost two decades they remained a niche curiosity. The problem was not the concept but the constraints: tiny labeled corpora, weak accelerators, and training tricks that had not yet been invented. Support-vector machines, boosted trees, and other shallow learners dominated benchmarks because they thrived on the small, hand-crafted feature sets that were all anyone could realistically compute.

Traditional vision pipelines of the 1990s and 2000s looked like this:

  1. Acquire a few thousand images—often captured with expensive, low-resolution sensors.
  2. Design a bespoke pre-processing chain based on optics, geometry, or luck.
  3. Extract descriptors such as SIFT, SURF, or HOG.
  4. Feed the descriptors into an off-the-shelf classifier.

Researchers in machine learning and computer vision told two different stories. ML theorists celebrated elegant proofs and ever-better optimizers; vision practitioners quietly admitted that progress came from cleverer features and larger, cleaner data sets. End-to-end learning from raw pixels sounded utopian.

What Was Missing

Data

Deep models have millions of parameters, so generalization requires massive labeled corpora. Until the late 2000s, most public sets contained only hundreds or thousands of low-resolution images. That changed in 2009 with ImageNet: one million photos, 1 000 classes, crowd-sourced labels via Amazon Mechanical Turk. Overnight, the bottleneck shifted from "how do we engineer features?" to "how do we train models big enough to exploit this data?"

Hardware

Neural-network accelerators existed in the 1990s, but they were rare and weak. CPUs of the era were optimized for single-threaded, branching code, not the dense linear algebra that dominates convolutions. GPUs, originally built for 3-D graphics, turned out to be perfect for the job: thousands of simple cores, high memory bandwidth, and a programming model that maps naturally to tensor operations. In 2012 Alex Krizhevsky and Ilya Sutskever released cuda-convnet, the first practical library that let researchers train large CNNs on commodity gaming GPUs. Training time dropped from weeks to days, and the race was on.

AlexNet Architecture

AlexNet’s blueprint is deeper and wider than LeNet-5:

  • Five convolutional blocks followed by three fully-connected layers.
  • ReLU activations instead of sigmoids to fight vanishing gradients.
  • Dropout on the dense layers for regularization.
  • Heavy data augmentation (flips, crops, color jitter) to enlarge the effective data set.

The first convolution uses an 11 × 11 kernel with stride 4 to cope with 224 × 224 ImageNet inputs. Subsequent layers shrink the kernel to 5 × 5 and then 3 × 3 while doubling or tripling the channel count. Three max-pool layers (3 × 3, stride 2) progressively halve spatial resolution. The final conv stack outputs 256 feature maps of size 5 × 5; after flattening, two 4096-unit dense layers and a 1000-way softmax complete the network.

Re-implementing AlexNet in PyTorch

Below is a compact single-GPU version. Note the channel counts are halved compared to the original paper to fit on a modern laptop GPU.

import torch
from torch import nn

alexnet = nn.Sequential(
    nn.Conv2d(3, 48, kernel_size=11, stride=4, padding=2), nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Conv2d(48, 128, kernel_size=5, padding=2), nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Conv2d(128, 192, kernel_size=3, padding=1), nn.ReLU(inplace=True),
    nn.Conv2d(192, 192, kernel_size=3, padding=1), nn.ReLU(inplace=True),
    nn.Conv2d(192, 128, kernel_size=3, padding=1), nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Flatten(),
    nn.Linear(128 * 6 * 6, 2048), nn.ReLU(inplace=True), nn.Dropout(0.5),
    nn.Linear(2048, 2048), nn.ReLU(inplace=True), nn.Dropout(0.5),
    nn.Linear(2048, 1000)
)

Shape tracing on a single 224 × 224 RGB image:

X = torch.randn(1, 3, 224, 224)
for layer in alexnet:
    X = layer(X)
    print(layer.__class__.__name__, X.shape)

Expected output:


Conv2d torch.Size([1, 48, 55, 55])
ReLU torch.Size([1, 48, 55, 55])
MaxPool2d torch.Size([1, 48, 27, 27])
Conv2d torch.Size([1, 128, 27, 27])
ReLU torch.Size([1, 128, 27, 27])
MaxPool2d torch.Size([1, 128, 13, 13])
Conv2d torch.Size([1, 192, 13, 13])
ReLU torch.Size([1, 192, 13, 13])
Conv2d torch.Size([1, 192, 13, 13])
ReLU torch.Size([1, 192, 13, 13])
Conv2d torch.Size([1, 128, 13, 13])
ReLU torch.Size([1, 128, 13, 13])
MaxPool2d torch.Size([1, 128, 6, 6])
Flatten torch.Size([1, 4608])
Linear torch.Size([1, 2048])
ReLU torch.Size([1, 2048])
Dropout torch.Size([1, 2048])
Linear torch.Size([1, 2048])
ReLU torch.Size([1, 2048])
Dropout torch.Size([1, 2048])
Linear torch.Size([1, 1000])

Training on Fashion-MNIST

Because ImageNet takes hours even on modern GPUs, we demonstrate the workflow on Fashion-MNIST, up-sampled to 224 × 224 so the architecture can be reused verbatim.

import torchvision.transforms as T
from torch.utils.data import DataLoader
from torchvision.datasets import FashionMNIST

transform = T.Compose([
    T.Resize(224),
    T.ToTensor()
])

train_set = FashionMNIST(root='.', train=True,  download=True, transform=transform)
test_set  = FashionMNIST(root='.', train=False, download=True, transform=transform)

train_loader = DataLoader(train_set, batch_size=128, shuffle=True,  num_workers=4)
test_loader  = DataLoader(test_set,  batch_size=128, shuffle=False, num_workers=4)

Replace the final Linear(2048, 1000) with Linear(2048, 10) for the ten Fashion classes, then train with SGD and a small learning rate:

model = alexnet
model[-1] = nn.Linear(2048, 10)

opt   = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-4)
loss  = nn.CrossEntropyLoss()

# A minimal training loop (GPU assumed)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

for epoch in range(10):
    model.train()
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        opt.zero_grad()
        out = model(x)
        l = loss(out, y)
        l.backward()
        opt.step()
    # validation omitted for brevity

With the settings above you should reach ≈ 90 % test accuracy in a few minutes on a single RTX-class GPU—proof that the same ideas that conquered ImageNet can be scaled down efficiently.

Tags: alexnet convolutional-neural-networks imagenet ReLU dropout

Posted on Fri, 03 Jul 2026 16:35:35 +0000 by jamessw