Anomaly Detection and Recommender Systems

1. Anomaly Detection

Anomaly detection identifies unusual patterns that deviate from expected behavior. While primarily an unsupervised learning task, it shares characteristics with supervised learning in certain aspects.

1.1 Algorithm Overview

Given a dataset of normal examples, the goal is to build a probability model that flags observations with low probability as anomalies.

Training Data: (x^{(1)}, x^{(2)}, ..., x^{(m)}) where each (x^{(i)} \in \mathbb{R}^n)

Feature Selection: Choose features (x_j) that exhibit unusually large or small values when anomalies ocur.

Parameter Estimation: Each feature is modeled as a Gaussian distribution:

$$x_j \sim \mathcal{N}(\mu_j, \sigma_j^2)$$

The parameters are estimated using maximum likelihood:

$$\mu_j = \frac{1}{m}\sum_{i=1}^{m}x_j^{(i)}$$

$$\sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m}(x_j^{(i)} - \mu_j)^2$$

Probability Computation: For a new example (x), compute the joint probability:

$$P(x) = \prod_{j=1}^{n} P(x_j; \mu_j, \sigma_j^2) = \prod_{j=1}^{n} \frac{1}{\sigma_j\sqrt{2\pi}}\exp\left(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2}\right)$$

Classification Rule:

$$y = \begin{cases} 1 & \text{if } p(x) < \epsilon \text{ (anomaly)} \ 0 & \text{if } p(x) \geq \epsilon \text{ (normal)} \end{cases}$$

Selecting the threshold (\epsilon) requires labeled data, which explains the semi-supervised nature of this approach.

1.2 System Evaluation

Having labeled data enables systematic evaluation of the algorithm.

Dataset Labeling Convention:

  • (y = 0): Normal example
  • (y = 1): Anomalous example

Training Procedure: Fit (p(x)) exclusively on negative examples (normal cases). Do not include positive examples during training.

Typical Split (Aircraft Engine Example):

  • 10,000 normal engines, 20 anomalous engines
  • Training set: 6,000 normal engines
  • Cross-validation set: 2,000 normal + 10 anomalous
  • Test set: 2,000 normal + 10 anomalous

Prediction for New Examples:

$$y = \begin{cases} 1 & \text{if } p(x) < \epsilon \ 0 & \text{otherwise} \end{cases}$$

Performance Metrics:

$$Precision = \frac{TP}{TP + FP}$$

$$Recall = \frac{TP}{TP + FN}$$

$$F_1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$$

The threshold (\epsilon) is selected by testing multiple values and choosing the one that maximizes (F_1) score.

# True positives: predicted anomaly, actually anomaly
tp = np.sum((predictions == 1) & (y_true == 1))

# False positives: predicted anomaly, actually normal
fp = np.sum((predictions == 1) & (y_true == 0))

# False negatives: predicted normal, actually anomaly
fn = np.sum((predictions == 0) & (y_true == 1))

1.3 Comparison with Supervised Learning

Scenario Anomaly Detection Supervised Learning
Positive examples Very few (0-20) Abundant
Negative examples Large quantity Abundant
Anomaly diversity Many different types; future anomalies may differ from training examples Sufficient samples to learn patterns
Applications Fraud detection, quality control, monitoring Classification, weather prediction, diagnosis

1.4 Multivariate Gaussian Distribution

Standard Gaussian models assume feature independence. When features exhibit correlations, the multivariate Gaussian distribution provides better modeling.

Parameter Estimation:

$$\mu = \frac{1}{m}\sum_{i=1}^{m}x^{(i)}$$

$$\Sigma = \frac{1}{m}\sum_{i=1}^{m}(x^{(i)} - \mu)(x^{(i)} - \mu)^T$$

Probability Density:

$$p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu)\right)$$

Comparison of Approaches:

Aspect Original Model Multivariate Gaussian
Formula Product of independent Gaussians Full covariance matrix
Feature engineering Manual creation of interaction features Automatic correlation capture
Computational cost Lower Higher
Sample requirements Works with small datasets Requires (m > n)

The covariance matrix (\Sigma) controls the orientation and shape of the distribution. Diagonal entries correspond to individual feature variances.

2. Recommender Systems

2.1 Problem Definition

Recommender systems predict user preferences for items. Key applications include product recommendations (Amazon), movie suggestions (Netflix), and content personalization.

Notation:

  • (n_u): Number of users
  • (n_m): Number of movies
  • (r(i,j) = 1): User (j) has rated movie (i)
  • (y^{(i,j)}): Rating by user (j) on movie (i) (defined only when (r(i,j) = 1))
  • (\theta^{(j)} \in \mathbb{R}^n): Preference parameter vector for user (j)
  • (x^{(i)} \in \mathbb{R}^n): Feature vector for movie (i)

Prediction: User (j)'s rating for movie (i) is predicted as:

$$\hat{y}^{(i,j)} = (\theta^{(j)})^T x^{(i)}$$

2.2 Optimization Objectives

Objective 1: Learning User Parameters

For a single user (j):

$$J(\theta^{(j)}) = \frac{1}{2}\sum_{i: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$

Only movies rated by user (j) contribute to the cost. Unrated entries are effectively excluded from the loss computation.

Objective 2: Learning All User Parameters

$$J(\theta^{(1)}, ..., \theta^{(n_u)}) = \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$

Objective 3: Learning Movie Features

Given user parameters, learn movie features:

$$J(x^{(i)}) = \frac{1}{2}\sum_{j: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{k=1}^{n}(x_k^{(i)})^2$$

Objective 4: Learning All Movie Features

$$J(x^{(1)}, ..., x^{(n_m)}) = \frac{1}{2}\sum_{i=1}^{n_m}\sum_{j: r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2$$

2.3 Collaborative Filtering

The key insight is that user preferences and item features can be learned simultaneously in a joint optimization problem.

Combined Objective:

$$J(\theta^{(1)}, ..., \theta^{(n_u)}, x^{(1)}, ..., x^{(n_m)}) = \frac{1}{2}\sum_{(i,j): r(i,j)=1}\left((\theta^{(j)})^T x^{(i)} - y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2$$

Algorithm:

  1. Initialize (\theta^{(1)}, ..., \theta^{(n_u)}) and (x^{(1)}, ..., x^{(n_m)}) to small random values
  2. Minimize (J) using gradient descent for all parameters
  3. Predict ratings as (\theta^T x)
# Cost function computation
cost = (1/2) * np.sum(((movie_features @ user_params.T - ratings) ** 2) * rating_mask) \
       + (lambda_reg / 2) * np.sum(user_params ** 2) \
       + (lambda_reg / 2) * np.sum(movie_features ** 2)

# Gradient with respect to movie features
feat_grad = (user_params.T @ ((user_params @ movie_features.T - ratings.T) * rating_mask.T)).T \
            + lambda_reg * movie_features

# Gradient with respect to user parameters
param_grad = (movie_features.T @ ((movie_features @ user_params.T - ratings) * rating_mask)).T \
             + lambda_reg * user_params

Advantages of Collaborative Filtering:

  • Automatically discovers relevant features without manual feature engineering
  • Captures latent patterns in user behavior and item characteristics
  • Handles the cold-start problem through learned representations

Tags: Machine Learning anomaly detection recommender systems gaussian distribution collaborative filtering

Posted on Sun, 05 Jul 2026 16:37:15 +0000 by mewhocorrupts