1. Anomaly Detection
Anomaly detection identifies unusual patterns that deviate from expected behavior. While primarily an unsupervised learning task, it shares characteristics with supervised learning in certain aspects.
1.1 Algorithm Overview
Given a dataset of normal examples, the goal is to build a probability model that flags observations with low probability as anomalies.
Training Data: (x^{(1)}, x^{(2)}, ..., x^{(m)}) where each (x^{(i)} \in \mathbb{R}^n)
Feature Selection: Choose features (x_j) that exhibit unusually large or small values when anomalies ocur.
Parameter Estimation: Each feature is modeled as a Gaussian distribution:
$$x_j \sim \mathcal{N}(\mu_j, \sigma_j^2)$$
The parameters are estimated using maximum likelihood:
$$\mu_j = \frac{1}{m}\sum_{i=1}^{m}x_j^{(i)}$$
$$\sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m}(x_j^{(i)} - \mu_j)^2$$
Probability Computation: For a new example (x), compute the joint probability:
$$P(x) = \prod_{j=1}^{n} P(x_j; \mu_j, \sigma_j^2) = \prod_{j=1}^{n} \frac{1}{\sigma_j\sqrt{2\pi}}\exp\left(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2}\right)$$
Classification Rule:
$$y = \begin{cases} 1 & \text{if } p(x) < \epsilon \text{ (anomaly)} \ 0 & \text{if } p(x) \geq \epsilon \text{ (normal)} \end{cases}$$
Selecting the threshold (\epsilon) requires labeled data, which explains the semi-supervised nature of this approach.
1.2 System Evaluation
Having labeled data enables systematic evaluation of the algorithm.
Dataset Labeling Convention:
- (y = 0): Normal example
- (y = 1): Anomalous example
Training Procedure: Fit (p(x)) exclusively on negative examples (normal cases). Do not include positive examples during training.
Typical Split (Aircraft Engine Example):
- 10,000 normal engines, 20 anomalous engines
- Training set: 6,000 normal engines
- Cross-validation set: 2,000 normal + 10 anomalous
- Test set: 2,000 normal + 10 anomalous
Prediction for New Examples:
$$y = \begin{cases} 1 & \text{if } p(x) < \epsilon \ 0 & \text{otherwise} \end{cases}$$
Performance Metrics:
$$Precision = \frac{TP}{TP + FP}$$
$$Recall = \frac{TP}{TP + FN}$$
$$F_1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$$
The threshold (\epsilon) is selected by testing multiple values and choosing the one that maximizes (F_1) score.
# True positives: predicted anomaly, actually anomaly
tp = np.sum((predictions == 1) & (y_true == 1))
# False positives: predicted anomaly, actually normal
fp = np.sum((predictions == 1) & (y_true == 0))
# False negatives: predicted normal, actually anomaly
fn = np.sum((predictions == 0) & (y_true == 1))
1.3 Comparison with Supervised Learning
| Scenario | Anomaly Detection | Supervised Learning |
|---|---|---|
| Positive examples | Very few (0-20) | Abundant |
| Negative examples | Large quantity | Abundant |
| Anomaly diversity | Many different types; future anomalies may differ from training examples | Sufficient samples to learn patterns |
| Applications | Fraud detection, quality control, monitoring | Classification, weather prediction, diagnosis |
1.4 Multivariate Gaussian Distribution
Standard Gaussian models assume feature independence. When features exhibit correlations, the multivariate Gaussian distribution provides better modeling.
Parameter Estimation:
$$\mu = \frac{1}{m}\sum_{i=1}^{m}x^{(i)}$$
$$\Sigma = \frac{1}{m}\sum_{i=1}^{m}(x^{(i)} - \mu)(x^{(i)} - \mu)^T$$
Probability Density:
$$p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu)\right)$$
Comparison of Approaches:
| Aspect | Original Model | Multivariate Gaussian |
|---|---|---|
| Formula | Product of independent Gaussians | Full covariance matrix |
| Feature engineering | Manual creation of interaction features | Automatic correlation capture |
| Computational cost | Lower | Higher |
| Sample requirements | Works with small datasets | Requires (m > n) |
The covariance matrix (\Sigma) controls the orientation and shape of the distribution. Diagonal entries correspond to individual feature variances.
2. Recommender Systems
2.1 Problem Definition
Recommender systems predict user preferences for items. Key applications include product recommendations (Amazon), movie suggestions (Netflix), and content personalization.
Notation:
- (n_u): Number of users
- (n_m): Number of movies
- (r(i,j) = 1): User (j) has rated movie (i)
- (y^{(i,j)}): Rating by user (j) on movie (i) (defined only when (r(i,j) = 1))
- (\theta^{(j)} \in \mathbb{R}^n): Preference parameter vector for user (j)
- (x^{(i)} \in \mathbb{R}^n): Feature vector for movie (i)
Prediction: User (j)'s rating for movie (i) is predicted as:
$$\hat{y}^{(i,j)} = (\theta^{(j)})^T x^{(i)}$$
2.2 Optimization Objectives
Objective 1: Learning User Parameters
For a single user (j):
$$J(\theta^{(j)}) = \frac{1}{2}\sum_{i: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$
Only movies rated by user (j) contribute to the cost. Unrated entries are effectively excluded from the loss computation.
Objective 2: Learning All User Parameters
$$J(\theta^{(1)}, ..., \theta^{(n_u)}) = \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$
Objective 3: Learning Movie Features
Given user parameters, learn movie features:
$$J(x^{(i)}) = \frac{1}{2}\sum_{j: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{k=1}^{n}(x_k^{(i)})^2$$
Objective 4: Learning All Movie Features
$$J(x^{(1)}, ..., x^{(n_m)}) = \frac{1}{2}\sum_{i=1}^{n_m}\sum_{j: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2$$
2.3 Collaborative Filtering
The key insight is that user preferences and item features can be learned simultaneously in a joint optimization problem.
Combined Objective:
$$J(\theta^{(1)}, ..., \theta^{(n_u)}, x^{(1)}, ..., x^{(n_m)}) = \frac{1}{2}\sum_{(i,j): r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2$$
Algorithm:
- Initialize (\theta^{(1)}, ..., \theta^{(n_u)}) and (x^{(1)}, ..., x^{(n_m)}) to small random values
- Minimize (J) using gradient descent for all parameters
- Predict ratings as (\theta^T x)
# Cost function computation
cost = (1/2) * np.sum(((movie_features @ user_params.T - ratings) ** 2) * rating_mask) \
+ (lambda_reg / 2) * np.sum(user_params ** 2) \
+ (lambda_reg / 2) * np.sum(movie_features ** 2)
# Gradient with respect to movie features
feat_grad = (user_params.T @ ((user_params @ movie_features.T - ratings.T) * rating_mask.T)).T \
+ lambda_reg * movie_features
# Gradient with respect to user parameters
param_grad = (movie_features.T @ ((movie_features @ user_params.T - ratings) * rating_mask)).T \
+ lambda_reg * user_params
Advantages of Collaborative Filtering:
- Automatically discovers relevant features without manual feature engineering
- Captures latent patterns in user behavior and item characteristics
- Handles the cold-start problem through learned representations