Understanding the K-Nearest Neighbor Algorithm

KNN (K-Nearest Neighbor) Algorithm

KNN is a classification algorithm in supervised learning that stands out because it can be considered both a model-free algorithm and one where the training dataset itself serves as the model.

KNN Algorithm Principles

When predicting a new value, the KNN algorithm determines its class based on the classes of the K closest points in the feature space.

Distance Metrics in KNN

Implementing KNN requires calculating distances between points. Common distance metrics include:

  • Euclidean Distance: [\sqrt{(x^{(a)}{(1)} - x^{(b)}{(1)})^2 + \cdots + (x^{(a)}{(n)} - x^{(b)}{(n)})^2} = ({\sum {i=1}^n(x{(i)}^{(a)} - x_{(i)}^{(b)})^2})^{(\frac{1}{2})} ]

  • Manhattan Distance: [\sum_{i=1}^n {|x_{(i)}^{(a)} - x_{(i)}^{(b)}|} ]

  • Minkowski Distance:

    • p is a hyperparameter
    • When p=2, it equals Euclidean distance [(\sum_{i=1}^{n} |x_{(i)}^{(a)} - x_{(i)}^{(b)}|^{p})^{(\frac{1}{p})} ]
  • Cosine Similarity: Measures the cosine of the angle between two vectors in a vector space, where values closer to 1 indicate higher similarity. [\cos \theta = \frac{\displaystyle\sum_{i=1}^n (x_{(i)}^{(a)}\times x_{(i)}^{(b)})}{\sqrt{\displaystyle\sum_{i=1}^{n} (x_{(i)}^{(a)})^2}\times \sqrt{\displaystyle\sum_{i=1}^{n}(x_{(i)}^{(b)})^{2}}} ]

  • Pearson Correlation Coefficient: The ratio of covariance to the product of standard deviations between two variables. [\rho {X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma{X}\sigma_{Y}} = \frac{E[(X-\mu_{x})(Y-\mu_{y})]}{\sigma_{X}\sigma_{Y}} ]

KNN Implementation

import numpy as np
from math import sqrt
from collections import Counter

class NearestNeighborClassifier:
    def __init__(self, neighbors_count):
        # Initialize the classifier
        assert neighbors_count >= 1, "Neighbor count must be positive"
        self.neighbors = neighbors_count
        self.training_features = None
        self.training_labels = None

    def train(self, feature_matrix, label_vector):
        # Train the classifier with provided data
        assert feature_matrix.shape[0] == label_vector.shape[0], \
            "Number of samples must match between features and labels"
        assert self.neighbors <= feature_matrix.shape[0], \
            "Training samples must be at least equal to neighbor count"
        self.training_features = feature_matrix
        self.training_labels = label_vector
        return self

    # Predict class for new data points
    def predict(self, unknown_samples):
        assert self.training_features is not None and self.training_labels is not None, \
            "Must train the model before making predictions"
        assert unknown_samples.shape[1] == self.training_features.shape[1], \
            "Feature dimensions must match training data"
        predictions = [self._classify(sample) for sample in unknown_samples]
        return np.array(predictions)

    # Classify a single data point
    def _classify(self, data_point):
        # Classify a single unknown sample and return the result
        assert data_point.shape[0] == self.training_features.shape[1], \
            "Feature dimensions must match training data"

        # Calculate distances to all training points
        distances = [sqrt(np.sum((training_point - data_point) ** 2)) 
                     for training_point in self.training_features]
        
        # Find indices of k nearest neighbors
        nearest_indices = np.argsort(distances)
        k_nearest_labels = [self.training_labels[i] for i in nearest_indices[:self.neighbors]]
        
        # Return the most common label among neighbors
        return Counter(k_nearest_labels).most_common(1)[0][0]

    def __repr__(self):
        return "NearestNeighborClassifier(neighbors=%d)" % self.neighbors

Testing the Implementation

  • Test Data
# Dataset
raw_features = [[3.393533211, 2.331273381],
               [3.110073483, 1.781539638],
               [1.343808831, 3.368360954],
               [3.582294042, 4.679179110],
               [2.280362439, 2.866990263],
               [7.423436942, 4.696522875],
               [5.745051997, 3.533989803],
               [9.172168622, 2.511101045],
               [7.792783481, 3.424088941],
               [7.939820817, 0.791637231]]
class_labels = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
training_features = np.array(raw_features)
training_labels = np.array(class_labels)
# Sample to predict (must be provided as matrix)
unknown_sample = np.array([8.093607318, 3.365731514]).reshape(1,-1)

Advantages of KNN

  • Effective classification performance
  • Simple concept and implementation
  • Robust to outliers
  • Minimal mathematical requiremants
  • Proivdes a clear illustration of the machine learning workflow

Disadvantages of KNN

  • High computational and space complexity
  • Sensitive to class imbalance issues
  • Performance degrades with high-dimensional data
  • Cannot provide insights into data intrinsic relationships

Tags: KNN-algorithm supervised-learning Classification distance-metrics machine-learning

Posted on Sun, 31 May 2026 17:36:19 +0000 by Rayhan Muktader