Nearest Neighbor and k-NN Classifiers
The Nearest Neighbor classifier stores the entire training set and predicts labels by finding the closest training example using a distance metric like L1 (Manhattan) or L2 (Euclidean). While simple, it suffers from high prediction latency (O(n)) and large memory usage.
class KNearestNeighbor: def init(self): self.X_train = None self.y_train = None
def fit(self, X, y):
self.X_train = X
self.y_train = y
def predict(self, X, k=1):
num_test = X.shape[0]
predictions = np.zeros(num_test, dtype=self.y_train.dtype)
for i in range(num_test):
distances = np.sum(np.abs(self.X_train - X[i]), axis=1)
nearest_indices = np.argsort(distances)[:k]
nearest_labels = self.y_train[nearest_indices]
predictions[i] = np.bincount(nearest_labels).argmax()
return predictions
</details>L1 distance preserves feature sparsity better, while L2 is more robust to small geometric transformations. For *k > 1*, ties in voting can occur; cross-validation over multiple splits (e.g., 5-fold) helps choose optimal *k*.
Linear Classifiers and Optimization
-----------------------------------
A linear classifier computes scores via *f(x) = Wx + b*. The bias term allows decision boundaries to shift away from the origin. However, linear models fail on non-linearly separable data.
The loss function combines empirical risk and regularization:
<div>\[ \mathcal{L} = \frac{1}{N} \sum_{i=1}^N \ell(f(x_i; W, b), y_i) + \lambda \|W\|_F^2 \]</div>For softmax with cross-entropy loss, the gradient w.r.t. weights is:
<div>\[ \nabla_W \mathcal{L} = \frac{1}{N} \sum_{i=1}^N (p_i - \mathbf{1}_{y_i}) x_i^\top + 2\lambda W \]</div>where *p\_i* is the softmax probability vector and *1<sub>y\_i</sub>* is a one-hot label vector.
### Optimization Algorithms
- **SGD**: Updates parameters using gradients from mini-batches.
- **Momentum**: Accumulates velocity to dampen oscillations and escape shallow minima.
- **RMSProp**: Adapts learning rates per parameter using squared gradient history.
- **Adam**: Combines momentum and RMSProp with bias correction: ```
mt = beta1 * mt + (1 - beta1) * grad
vt = beta2 * vt + (1 - beta2) * grad**2
mt_hat = mt / (1 - beta1**t)
vt_hat = vt / (1 - beta2**t)
W -= lr * mt_hat / (np.sqrt(vt_hat) + 1e-8)
```
Learning rate schedules (e.g., cosine decay, linear warmup) improve convergence. Larger batch sizes often require proportionally larger learning rates.
Loss Functions
--------------
- **SVM (Hinge) Loss**: *L\_i = Σ<sub>j≠y\_i</sub> max(0, s\_j − s<sub>y\_i</sub> + Δ)*
- **MSE**: *L = ½(y − t)²*; suffers from vanishing gradients when errors are large.
- **Cross-Entropy**: Derived from information theory; equivalent to minimizing KL divergence between true and predicted distributions.
Regularization
--------------
Regularization penalizes model complexity to reduce overfitting:
- **L2**: Encourages small, diffuse weights.
- **L1**: Promotes sparsity (many zero weights).
- **Elastic Net**: Hybrid of L1 and L2.
Activation Functions
--------------------
- **Sigmoid**: Prone to vanishing gradients in deep networks.
- **tanh**: Zero-centered but still saturates.
- **ReLU**: *f(x) = max(0, x)*; efficient but can cause "dead neurons."
- **Leaky ReLU**: *f(x) = max(αx, x)* with small α > 0.
- **GELU / SiLU**: Smooth approximations used in modern architectures.
Neural Network Architectures
----------------------------
### Fully Connected Networks
Multi-layer perceptrons (MLPs) use alternating linear layers and non-linear activations. Without activation functions, they collapse to a single linear transformation. Backpropagation efficiently computes gradients via the chain rule.
### Convolutional Neural Networks (CNNs)
CNNs exploit spatial locality through convolutional filters. Key components:
- **Conv Layers**: Apply learnable filters; padding preserves spatial dimensions.
- **Pooling**: Max pooling reduces spatial size and introduces translation equivariance.
- **Normalization**: BatchNorm or LayerNorm stabilizes training.
- **Dropout**: Randomly deactivates neurons during training (typically *p=0.5*); scales activations by *p* at test time.
Deep networks face vanishing/exploding gradients. Solutoins include:
- **ResNet**: Uses skip connections to enable identity mapping.
- **Kaiming Initialization**: Scales weights by *√(2 / fan\_in)* for ReLU networks.
### Recurrent Neural Networks (RNNs)
RNNs process sequences via recurrent hidden states: *h\_t = tanh(W\_hh h\_{t−1} + W\_xh x\_t)*. They suffer from gradient vanishing/explosion over long sequences.
- **LSTM**: Uses gating mechanisms to control information flow.
- **Truncated BPTT**: Limits backpropagation to recent time steps.
### Transformers
Transformers replace recurrence with self-attention:
<div>\[ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \]</div>Key innovations:
- **Positional Encoding**: Injects sequence order information.
- **Pre-Norm Architecture**: Places LayerNorm before attention/MLP blocks.
- **RMSNorm**: Normalizes using root-mean-square instead of variance.
- **MoE (Mixture of Experts)**: Routes tokens to specialized subnetworks.
Computer Vision Tasks
---------------------
### Semantic Segmentation
Assigns a class label to every pixel. U-Net uses an encoder-decoder structure with skip connections. Upsampling methods include transposed convolutions and max unpooling.
### Instance Segmentation
Distinguishes individual object instances. Mask R-CNN extends Faster R-CNN with a mask prediction branch and RoIAlign (avoids quantization errors in RoI Pooling).
### Object Detection Evolution
- **R-CNN**: Applies CNN to region proposals from Selective Search.
- **Fast R-CNN**: Shares convolutional features across proposals.
- **Faster R-CNN**: Replaces Selective Search with a Region Proposal Network (RPN).
- **DETR**: Uses Transformer with object queries for end-to-end detection.
Video Understanding
-------------------
Approaches include:
- **3D CNNs**: Extend convolutions to spatio-temporal domains (e.g., C3D, I3D).
- **Two-Stream Networks**: Separate RGB (appearance) and optical flow (motion) streams.
- **Non-local Blocks**: Capture long-range dependencies via self-attention.
Visualization Techniques
------------------------
- **Class Activation Mapping (Grad-CAM)**: Highlights image regions influencing predictions.
- **Filter Visualization**: Shows learned convolutional features.
- **Attention Maps**: Reveals token interactions in Vision Transformers.