Understanding the K-Nearest Neighbor Algorithm

KNN (K-Nearest Neighbor) Algorithm

KNN is a classification algorithm in supervised learning that stands out because it can be considered both a model-free algorithm and one where the training dataset itself serves as the model.

KNN Algorithm Principles

When predicting a new value, the KNN algorithm determines its class based on the classes of the K closest points in the feature space.

Distance Metrics in KNN

Implementing KNN requires calculating distances between points. Common distance metrics include:

Euclidean Distance: [\sqrt{(x^{(a)}{(1)} - x^{(b)}{(1)})^2 + \cdots + (x^{(a)}{(n)} - x^{(b)}{(n)})^2} = ({\sum {i=1}^n(x{(i)}^{(a)} - x_{(i)}^{(b)})^2})^{(\frac{1}{2})} ]
Manhattan Distance: [\sum_{i=1}^n {|x_{(i)}^{(a)} - x_{(i)}^{(b)}|} ]
Minkowski Distance:
- p is a hyperparameter
- When p=2, it equals Euclidean distance [(\sum_{i=1}^{n} |x_{(i)}^{(a)} - x_{(i)}^{(b)}|^{p})^{(\frac{1}{p})} ]
Cosine Similarity: Measures the cosine of the angle between two vectors in a vector space, where values closer to 1 indicate higher similarity. [\cos \theta = \frac{\displaystyle\sum_{i=1}^n (x_{(i)}^{(a)}\times x_{(i)}^{(b)})}{\sqrt{\displaystyle\sum_{i=1}^{n} (x_{(i)}^{(a)})^2}\times \sqrt{\displaystyle\sum_{i=1}^{n}(x_{(i)}^{(b)})^{2}}} ]
Pearson Correlation Coefficient: The ratio of covariance to the product of standard deviations between two variables. [\rho {X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma{X}\sigma_{Y}} = \frac{E[(X-\mu_{x})(Y-\mu_{y})]}{\sigma_{X}\sigma_{Y}} ]

KNN Implementation

import numpy as np
from math import sqrt
from collections import Counter

class NearestNeighborClassifier:
    def __init__(self, neighbors_count):
        # Initialize the classifier
        assert neighbors_count >= 1, "Neighbor count must be positive"
        self.neighbors = neighbors_count
        self.training_features = None
        self.training_labels = None

    def train(self, feature_matrix, label_vector):
        # Train the classifier with provided data
        assert feature_matrix.shape[0] == label_vector.shape[0], \
            "Number of samples must match between features and labels"
        assert self.neighbors <= feature_matrix.shape[0], \
            "Training samples must be at least equal to neighbor count"
        self.training_features = feature_matrix
        self.training_labels = label_vector
        return self

    # Predict class for new data points
    def predict(self, unknown_samples):
        assert self.training_features is not None and self.training_labels is not None, \
            "Must train the model before making predictions"
        assert unknown_samples.shape[1] == self.training_features.shape[1], \
            "Feature dimensions must match training data"
        predictions = [self._classify(sample) for sample in unknown_samples]
        return np.array(predictions)

    # Classify a single data point
    def _classify(self, data_point):
        # Classify a single unknown sample and return the result
        assert data_point.shape[0] == self.training_features.shape[1], \
            "Feature dimensions must match training data"

        # Calculate distances to all training points
        distances = [sqrt(np.sum((training_point - data_point) ** 2)) 
                     for training_point in self.training_features]
        
        # Find indices of k nearest neighbors
        nearest_indices = np.argsort(distances)
        k_nearest_labels = [self.training_labels[i] for i in nearest_indices[:self.neighbors]]
        
        # Return the most common label among neighbors
        return Counter(k_nearest_labels).most_common(1)[0][0]

    def __repr__(self):
        return "NearestNeighborClassifier(neighbors=%d)" % self.neighbors

Testing the Implementation

Test Data

# Dataset
raw_features = [[3.393533211, 2.331273381],
               [3.110073483, 1.781539638],
               [1.343808831, 3.368360954],
               [3.582294042, 4.679179110],
               [2.280362439, 2.866990263],
               [7.423436942, 4.696522875],
               [5.745051997, 3.533989803],
               [9.172168622, 2.511101045],
               [7.792783481, 3.424088941],
               [7.939820817, 0.791637231]]
class_labels = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
training_features = np.array(raw_features)
training_labels = np.array(class_labels)
# Sample to predict (must be provided as matrix)
unknown_sample = np.array([8.093607318, 3.365731514]).reshape(1,-1)

Advantages of KNN

Effective classification performance
Simple concept and implementation
Robust to outliers
Minimal mathematical requiremants
Proivdes a clear illustration of the machine learning workflow

Disadvantages of KNN

High computational and space complexity
Sensitive to class imbalance issues
Performance degrades with high-dimensional data
Cannot provide insights into data intrinsic relationships

Tags: KNN-algorithm supervised-learning Classification distance-metrics machine-learning

Posted on Sun, 31 May 2026 17:36:19 +0000 by Rayhan Muktader

Freaks City