KNN (K-Nearest Neighbor) Algorithm
KNN is a classification algorithm in supervised learning that stands out because it can be considered both a model-free algorithm and one where the training dataset itself serves as the model.
KNN Algorithm Principles
When predicting a new value, the KNN algorithm determines its class based on the classes of the K closest points in the feature space.
Distance Metrics in KNN
Implementing KNN requires calculating distances between points. Common distance metrics include:
-
Euclidean Distance: [\sqrt{(x^{(a)}{(1)} - x^{(b)}{(1)})^2 + \cdots + (x^{(a)}{(n)} - x^{(b)}{(n)})^2} = ({\sum {i=1}^n(x{(i)}^{(a)} - x_{(i)}^{(b)})^2})^{(\frac{1}{2})} ]
-
Manhattan Distance: [\sum_{i=1}^n {|x_{(i)}^{(a)} - x_{(i)}^{(b)}|} ]
-
Minkowski Distance:
- p is a hyperparameter
- When p=2, it equals Euclidean distance [(\sum_{i=1}^{n} |x_{(i)}^{(a)} - x_{(i)}^{(b)}|^{p})^{(\frac{1}{p})} ]
-
Cosine Similarity: Measures the cosine of the angle between two vectors in a vector space, where values closer to 1 indicate higher similarity. [\cos \theta = \frac{\displaystyle\sum_{i=1}^n (x_{(i)}^{(a)}\times x_{(i)}^{(b)})}{\sqrt{\displaystyle\sum_{i=1}^{n} (x_{(i)}^{(a)})^2}\times \sqrt{\displaystyle\sum_{i=1}^{n}(x_{(i)}^{(b)})^{2}}} ]
-
Pearson Correlation Coefficient: The ratio of covariance to the product of standard deviations between two variables. [\rho {X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma{X}\sigma_{Y}} = \frac{E[(X-\mu_{x})(Y-\mu_{y})]}{\sigma_{X}\sigma_{Y}} ]
KNN Implementation
import numpy as np
from math import sqrt
from collections import Counter
class NearestNeighborClassifier:
def __init__(self, neighbors_count):
# Initialize the classifier
assert neighbors_count >= 1, "Neighbor count must be positive"
self.neighbors = neighbors_count
self.training_features = None
self.training_labels = None
def train(self, feature_matrix, label_vector):
# Train the classifier with provided data
assert feature_matrix.shape[0] == label_vector.shape[0], \
"Number of samples must match between features and labels"
assert self.neighbors <= feature_matrix.shape[0], \
"Training samples must be at least equal to neighbor count"
self.training_features = feature_matrix
self.training_labels = label_vector
return self
# Predict class for new data points
def predict(self, unknown_samples):
assert self.training_features is not None and self.training_labels is not None, \
"Must train the model before making predictions"
assert unknown_samples.shape[1] == self.training_features.shape[1], \
"Feature dimensions must match training data"
predictions = [self._classify(sample) for sample in unknown_samples]
return np.array(predictions)
# Classify a single data point
def _classify(self, data_point):
# Classify a single unknown sample and return the result
assert data_point.shape[0] == self.training_features.shape[1], \
"Feature dimensions must match training data"
# Calculate distances to all training points
distances = [sqrt(np.sum((training_point - data_point) ** 2))
for training_point in self.training_features]
# Find indices of k nearest neighbors
nearest_indices = np.argsort(distances)
k_nearest_labels = [self.training_labels[i] for i in nearest_indices[:self.neighbors]]
# Return the most common label among neighbors
return Counter(k_nearest_labels).most_common(1)[0][0]
def __repr__(self):
return "NearestNeighborClassifier(neighbors=%d)" % self.neighbors
Testing the Implementation
- Test Data
# Dataset
raw_features = [[3.393533211, 2.331273381],
[3.110073483, 1.781539638],
[1.343808831, 3.368360954],
[3.582294042, 4.679179110],
[2.280362439, 2.866990263],
[7.423436942, 4.696522875],
[5.745051997, 3.533989803],
[9.172168622, 2.511101045],
[7.792783481, 3.424088941],
[7.939820817, 0.791637231]]
class_labels = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
training_features = np.array(raw_features)
training_labels = np.array(class_labels)
# Sample to predict (must be provided as matrix)
unknown_sample = np.array([8.093607318, 3.365731514]).reshape(1,-1)
Advantages of KNN
- Effective classification performance
- Simple concept and implementation
- Robust to outliers
- Minimal mathematical requiremants
- Proivdes a clear illustration of the machine learning workflow
Disadvantages of KNN
- High computational and space complexity
- Sensitive to class imbalance issues
- Performance degrades with high-dimensional data
- Cannot provide insights into data intrinsic relationships