Anomaly Detection and Recommender Systems

1. Anomaly Detection

Anomaly detection identifies unusual patterns that deviate from expected behavior. While primarily an unsupervised learning task, it shares characteristics with supervised learning in certain aspects.

1.1 Algorithm Overview

Given a dataset of normal examples, the goal is to build a probability model that flags observations with low probability as anomalies.

Training Data: (x^{(1)}, x^{(2)}, ..., x^{(m)}) where each (x^{(i)} \in \mathbb{R}^n)

Feature Selection: Choose features (x_j) that exhibit unusually large or small values when anomalies ocur.

Parameter Estimation: Each feature is modeled as a Gaussian distribution:

$$x_j \sim \mathcal{N}(\mu_j, \sigma_j^2)$$

The parameters are estimated using maximum likelihood:

$$\mu_j = \frac{1}{m}\sum_{i=1}^{m}x_j^{(i)}$$

$$\sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m}(x_j^{(i)} - \mu_j)^2$$

Probability Computation: For a new example (x), compute the joint probability:

$$P(x) = \prod_{j=1}^{n} P(x_j; \mu_j, \sigma_j^2) = \prod_{j=1}^{n} \frac{1}{\sigma_j\sqrt{2\pi}}\exp\left(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2}\right)$$

Classification Rule:

$$y = \begin{cases} 1 & \text{if } p(x) < \epsilon \text{ (anomaly)} \ 0 & \text{if } p(x) \geq \epsilon \text{ (normal)} \end{cases}$$

Selecting the threshold (\epsilon) requires labeled data, which explains the semi-supervised nature of this approach.

1.2 System Evaluation

Having labeled data enables systematic evaluation of the algorithm.

Dataset Labeling Convention:

(y = 0): Normal example
(y = 1): Anomalous example

Training Procedure: Fit (p(x)) exclusively on negative examples (normal cases). Do not include positive examples during training.

Typical Split (Aircraft Engine Example):

10,000 normal engines, 20 anomalous engines
Training set: 6,000 normal engines
Cross-validation set: 2,000 normal + 10 anomalous
Test set: 2,000 normal + 10 anomalous

Prediction for New Examples:

$$y = \begin{cases} 1 & \text{if } p(x) < \epsilon \ 0 & \text{otherwise} \end{cases}$$

Performance Metrics:

$$Precision = \frac{TP}{TP + FP}$$

$$Recall = \frac{TP}{TP + FN}$$

$$F_1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$$

The threshold (\epsilon) is selected by testing multiple values and choosing the one that maximizes (F_1) score.

# True positives: predicted anomaly, actually anomaly
tp = np.sum((predictions == 1) & (y_true == 1))

# False positives: predicted anomaly, actually normal
fp = np.sum((predictions == 1) & (y_true == 0))

# False negatives: predicted normal, actually anomaly
fn = np.sum((predictions == 0) & (y_true == 1))

1.3 Comparison with Supervised Learning

Scenario	Anomaly Detection	Supervised Learning
Positive examples	Very few (0-20)	Abundant
Negative examples	Large quantity	Abundant
Anomaly diversity	Many different types; future anomalies may differ from training examples	Sufficient samples to learn patterns
Applications	Fraud detection, quality control, monitoring	Classification, weather prediction, diagnosis

1.4 Multivariate Gaussian Distribution

Standard Gaussian models assume feature independence. When features exhibit correlations, the multivariate Gaussian distribution provides better modeling.

Parameter Estimation:

$$\mu = \frac{1}{m}\sum_{i=1}^{m}x^{(i)}$$

$$\Sigma = \frac{1}{m}\sum_{i=1}^{m}(x^{(i)} - \mu)(x^{(i)} - \mu)^T$$

Probability Density:

$$p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu)\right)$$

Comparison of Approaches:

Aspect	Original Model	Multivariate Gaussian
Formula	Product of independent Gaussians	Full covariance matrix
Feature engineering	Manual creation of interaction features	Automatic correlation capture
Computational cost	Lower	Higher
Sample requirements	Works with small datasets	Requires (m > n)

The covariance matrix (\Sigma) controls the orientation and shape of the distribution. Diagonal entries correspond to individual feature variances.

2. Recommender Systems

2.1 Problem Definition

Recommender systems predict user preferences for items. Key applications include product recommendations (Amazon), movie suggestions (Netflix), and content personalization.

Notation:

(n_u): Number of users
(n_m): Number of movies
(r(i,j) = 1): User (j) has rated movie (i)
(y^{(i,j)}): Rating by user (j) on movie (i) (defined only when (r(i,j) = 1))
(\theta^{(j)} \in \mathbb{R}^n): Preference parameter vector for user (j)
(x^{(i)} \in \mathbb{R}^n): Feature vector for movie (i)

Prediction: User (j)'s rating for movie (i) is predicted as:

$$\hat{y}^{(i,j)} = (\theta^{(j)})^T x^{(i)}$$

2.2 Optimization Objectives

Objective 1: Learning User Parameters

For a single user (j):

$$J(\theta^{(j)}) = \frac{1}{2}\sum_{i: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$

Only movies rated by user (j) contribute to the cost. Unrated entries are effectively excluded from the loss computation.

Objective 2: Learning All User Parameters

$$J(\theta^{(1)}, ..., \theta^{(n_u)}) = \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$

Objective 3: Learning Movie Features

Given user parameters, learn movie features:

$$J(x^{(i)}) = \frac{1}{2}\sum_{j: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{k=1}^{n}(x_k^{(i)})^2$$

Objective 4: Learning All Movie Features

$$J(x^{(1)}, ..., x^{(n_m)}) = \frac{1}{2}\sum_{i=1}^{n_m}\sum_{j: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2$$

2.3 Collaborative Filtering

The key insight is that user preferences and item features can be learned simultaneously in a joint optimization problem.

Combined Objective:

$$J(\theta^{(1)}, ..., \theta^{(n_u)}, x^{(1)}, ..., x^{(n_m)}) = \frac{1}{2}\sum_{(i,j): r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2$$

Algorithm:

Initialize (\theta^{(1)}, ..., \theta^{(n_u)}) and (x^{(1)}, ..., x^{(n_m)}) to small random values
Minimize (J) using gradient descent for all parameters
Predict ratings as (\theta^T x)

# Cost function computation
cost = (1/2) * np.sum(((movie_features @ user_params.T - ratings) ** 2) * rating_mask) \
       + (lambda_reg / 2) * np.sum(user_params ** 2) \
       + (lambda_reg / 2) * np.sum(movie_features ** 2)

# Gradient with respect to movie features
feat_grad = (user_params.T @ ((user_params @ movie_features.T - ratings.T) * rating_mask.T)).T \
            + lambda_reg * movie_features

# Gradient with respect to user parameters
param_grad = (movie_features.T @ ((movie_features @ user_params.T - ratings) * rating_mask)).T \
             + lambda_reg * user_params

Advantages of Collaborative Filtering:

Automatically discovers relevant features without manual feature engineering
Captures latent patterns in user behavior and item characteristics
Handles the cold-start problem through learned representations

Tags: Machine Learning anomaly detection recommender systems gaussian distribution collaborative filtering

Posted on Sun, 05 Jul 2026 16:37:15 +0000 by mewhocorrupts

Freaks City