Foundations of Deep Learning: From Nearest Neighbors to Transformers

Nearest Neighbor and k-NN Classifiers

The Nearest Neighbor classifier stores the entire training set and predicts labels by finding the closest training example using a distance metric like L1 (Manhattan) or L2 (Euclidean). While simple, it suffers from high prediction latency (O(n)) and large memory usage.

class KNearestNeighbor: def init(self): self.X_train = None self.y_train = None

def fit(self, X, y):
    self.X_train = X
    self.y_train = y

def predict(self, X, k=1):
    num_test = X.shape[0]
    predictions = np.zeros(num_test, dtype=self.y_train.dtype)

    for i in range(num_test):
        distances = np.sum(np.abs(self.X_train - X[i]), axis=1)
        nearest_indices = np.argsort(distances)[:k]
        nearest_labels = self.y_train[nearest_indices]
        predictions[i] = np.bincount(nearest_labels).argmax()

    return predictions


</details>L1 distance preserves feature sparsity better, while L2 is more robust to small geometric transformations. For *k &gt; 1*, ties in voting can occur; cross-validation over multiple splits (e.g., 5-fold) helps choose optimal *k*.

Linear Classifiers and Optimization
-----------------------------------

A linear classifier computes scores via *f(x) = Wx + b*. The bias term allows decision boundaries to shift away from the origin. However, linear models fail on non-linearly separable data.

The loss function combines empirical risk and regularization:

<div>\[ \mathcal{L} = \frac{1}{N} \sum_{i=1}^N \ell(f(x_i; W, b), y_i) + \lambda \|W\|_F^2 \]</div>For softmax with cross-entropy loss, the gradient w.r.t. weights is:

<div>\[ \nabla_W \mathcal{L} = \frac{1}{N} \sum_{i=1}^N (p_i - \mathbf{1}_{y_i}) x_i^\top + 2\lambda W \]</div>where *p\_i* is the softmax probability vector and *1<sub>y\_i</sub>* is a one-hot label vector.

### Optimization Algorithms

- **SGD**: Updates parameters using gradients from mini-batches.
- **Momentum**: Accumulates velocity to dampen oscillations and escape shallow minima.
- **RMSProp**: Adapts learning rates per parameter using squared gradient history.
- **Adam**: Combines momentum and RMSProp with bias correction: ```
    mt = beta1 * mt + (1 - beta1) * grad
    vt = beta2 * vt + (1 - beta2) * grad**2
    mt_hat = mt / (1 - beta1**t)
    vt_hat = vt / (1 - beta2**t)
    W -= lr * mt_hat / (np.sqrt(vt_hat) + 1e-8)
    ```

Learning rate schedules (e.g., cosine decay, linear warmup) improve convergence. Larger batch sizes often require proportionally larger learning rates.

Loss Functions
--------------

- **SVM (Hinge) Loss**: *L\_i = Σ<sub>j≠y\_i</sub> max(0, s\_j − s<sub>y\_i</sub> + Δ)*
- **MSE**: *L = ½(y − t)²*; suffers from vanishing gradients when errors are large.
- **Cross-Entropy**: Derived from information theory; equivalent to minimizing KL divergence between true and predicted distributions.

Regularization
--------------

Regularization penalizes model complexity to reduce overfitting:

- **L2**: Encourages small, diffuse weights.
- **L1**: Promotes sparsity (many zero weights).
- **Elastic Net**: Hybrid of L1 and L2.

Activation Functions
--------------------

- **Sigmoid**: Prone to vanishing gradients in deep networks.
- **tanh**: Zero-centered but still saturates.
- **ReLU**: *f(x) = max(0, x)*; efficient but can cause "dead neurons."
- **Leaky ReLU**: *f(x) = max(αx, x)* with small α &gt; 0.
- **GELU / SiLU**: Smooth approximations used in modern architectures.

Neural Network Architectures
----------------------------

### Fully Connected Networks

Multi-layer perceptrons (MLPs) use alternating linear layers and non-linear activations. Without activation functions, they collapse to a single linear transformation. Backpropagation efficiently computes gradients via the chain rule.

### Convolutional Neural Networks (CNNs)

CNNs exploit spatial locality through convolutional filters. Key components:

- **Conv Layers**: Apply learnable filters; padding preserves spatial dimensions.
- **Pooling**: Max pooling reduces spatial size and introduces translation equivariance.
- **Normalization**: BatchNorm or LayerNorm stabilizes training.
- **Dropout**: Randomly deactivates neurons during training (typically *p=0.5*); scales activations by *p* at test time.

Deep networks face vanishing/exploding gradients. Solutoins include:

- **ResNet**: Uses skip connections to enable identity mapping.
- **Kaiming Initialization**: Scales weights by *√(2 / fan\_in)* for ReLU networks.

### Recurrent Neural Networks (RNNs)

RNNs process sequences via recurrent hidden states: *h\_t = tanh(W\_hh h\_{t−1} + W\_xh x\_t)*. They suffer from gradient vanishing/explosion over long sequences.

- **LSTM**: Uses gating mechanisms to control information flow.
- **Truncated BPTT**: Limits backpropagation to recent time steps.

### Transformers

Transformers replace recurrence with self-attention:

<div>\[ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \]</div>Key innovations:

- **Positional Encoding**: Injects sequence order information.
- **Pre-Norm Architecture**: Places LayerNorm before attention/MLP blocks.
- **RMSNorm**: Normalizes using root-mean-square instead of variance.
- **MoE (Mixture of Experts)**: Routes tokens to specialized subnetworks.

Computer Vision Tasks
---------------------

### Semantic Segmentation

Assigns a class label to every pixel. U-Net uses an encoder-decoder structure with skip connections. Upsampling methods include transposed convolutions and max unpooling.

### Instance Segmentation

Distinguishes individual object instances. Mask R-CNN extends Faster R-CNN with a mask prediction branch and RoIAlign (avoids quantization errors in RoI Pooling).

### Object Detection Evolution

- **R-CNN**: Applies CNN to region proposals from Selective Search.
- **Fast R-CNN**: Shares convolutional features across proposals.
- **Faster R-CNN**: Replaces Selective Search with a Region Proposal Network (RPN).
- **DETR**: Uses Transformer with object queries for end-to-end detection.

Video Understanding
-------------------

Approaches include:

- **3D CNNs**: Extend convolutions to spatio-temporal domains (e.g., C3D, I3D).
- **Two-Stream Networks**: Separate RGB (appearance) and optical flow (motion) streams.
- **Non-local Blocks**: Capture long-range dependencies via self-attention.

Visualization Techniques
------------------------

- **Class Activation Mapping (Grad-CAM)**: Highlights image regions influencing predictions.
- **Filter Visualization**: Shows learned convolutional features.
- **Attention Maps**: Reveals token interactions in Vision Transformers.

Tags: deep-learning neural-networks convolutional-neural-networks transformers Optimization

Posted on Mon, 25 May 2026 19:10:33 +0000 by suigion

Freaks City

Foundations of Deep Learning: From Nearest Neighbors to Transformers

Nearest Neighbor and k-NN Classifiers

Hot Tags