Iris Species Classification Using K-Nearest Neighbors Algorithm

Dataset Overview

The Iris dataset, collected by Fisher in 1936, is a widely used classification dataset containing 150 samples from three iris species: Setosa, Versicolor, and Virginica. Each species has 50 samples with four features: sepal length, sepal width, petal length, and petal width.

In machine learning practice, data collection is typically performed by domain experts who understand which features are relevant for classification tasks.

Loading and Exploring the Dataset

Scikit-learn Dataset APIs

Scikit-learn provides two categories of datasets:

  1. Small datasets: Loaded using sklearn.datasets.load_*() functions, stored locally within the library
  2. Large datasets: Retrieved using sklearn.datasets.fetch_*() functions, downloaded from remote sources
from sklearn.datasets import load_iris

# Load the Iris dataset
iris_data = load_iris()
print(iris_data)

For larger datasets requiring network download:

from sklearn.datasets import fetch_20newsgroups

# Fetch 20 newsgroups dataset
news_data = fetch_20newsgroups()
print(news_data)

Dataset Structure

Both load and fetch functions return datasets.base.Bunch objects with these key attributes:

  • data: Feature matrix as 2D numpy array
  • target: Label vector as 1D numpy array
  • DESCR: Dataset description
  • feature_names: List of feature names
  • target_names: List of target class names
from sklearn.datasets import load_iris

# Load and examine dataset components
iris_dataset = load_iris()

print('Feature matrix:\n', iris_dataset.data)
print('Target labels:\n', iris_dataset['target'])
print('Feature names:\n', iris_dataset.feature_names)
print('Target names:\n', iris_dataset.target_names)
print('Dataset description:\n', iris_dataset.DESCR)

Data Visualization

Visualizing data distributions helps understand how different classes separate based on features. While ideal scenarios show perfect separation, real-world data often requires more sophisticated analysis.

Seaborn provides high-level plotting capabilities built on Matplotlib:

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris

# Load dataset
iris_samples = load_iris()

# Convert to DataFrame for easier manipulation
iris_df = pd.DataFrame(iris_samples.data, 
                      columns=['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
iris_df['Species'] = iris_samples.target


def visualize_distribution(dataset, x_col, y_col):
    sns.lmplot(x=x_col, y=y_col, data=dataset, hue='Species', fit_reg=False)
    plt.xlabel(x_col)
    plt.ylabel(y_col)
    plt.title('Iris Species Distribution')
    plt.show()

visualize_distribution(iris_df, 'Petal_Width', 'Sepal_Length')

Dataset Partitioning

Machine learning datasets are typically divided into:

  • Training set: Used to build the model (70-80% of data)
  • Testing set: Used to evaluate model performance (20-30% of data)

Train-Test Split API

sklearn.model_selection.train_test_split() handles dataset partitioning:

Parameters:

  • arrays: Feature and target arrays
  • test_size: Proportion of test data
  • random_state: Seed for reproducible sampling

Returns four arrays: training features, testing features, training targets, testing targets

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris_collection = load_iris()

# Split dataset
features_train, features_test, targets_train, targets_test = train_test_split(
    iris_collection.data, iris_collection.target, 
    test_size=0.2, random_state=22
)

print('Training features:\n', features_train)
print('Training targets:\n', targets_train)
print('Testing features:\n', features_test)
print('Testing targets:\n', targets_test)

Feature Engineering

Normalization

Normalization maps data to a specified range (typically [0,1]):

Formula: X' = (x - min) / (max - min)

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Generate sample data
sample_data = pd.DataFrame(np.random.randint(200, 4000, size=(5, 4)))
print(sample_data)

# Apply normalization
normalizer = MinMaxScaler(feature_range=(2, 3))
normalized_result = normalizer.fit_transform(sample_data[[0, 1, 2]])
print("Normalized data:\n", normalized_result)

Normalization is sensitive to outliers. For robust preprocessing, standardization is preferred.

Standardization

Standardization transforms data to have zero mean and unit variance:

Formula: X' = (x - mean) / standard_deviation

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Generate sample data
training_data = pd.DataFrame(np.random.randint(200, 4000, size=(5, 4)))
print(training_data)

# Apply standardization
scaler = StandardScaler()
standardized_features = scaler.fit_transform(training_data[[0, 1, 2]])
print('Standardized results:\n', standardized_features)
print('Feature means:\n', scaler.mean_)
print('Feature variances:\n', scaler.var_)

Iris Species Prediction Implementation

K-Nearest Neighbors Classifier

Key parameters for sklearn.neighbors.KNeighborsClassifier:

  • n_neighbors: Number of neighbors to consider (default: 5)
  • algorithm: Method for computing nearest neighbors ('auto', 'ball_tree', 'kd_tree', 'brute')
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Load and prepare data
iris_dataset = load_iris()
train_features, test_features, train_targets, test_targets = train_test_split(
    iris_dataset.data, iris_dataset.target, 
    test_size=0.2, random_state=22
)

# Feature scaling
feature_scaler = StandardScaler()
train_features_scaled = feature_scaler.fit_transform(train_features)
test_features_scaled = feature_scaler.fit_transform(test_features)

# Model training
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(train_features_scaled, train_targets)

# Model evaluation
predictions = classifier.predict(test_features_scaled)
print('Predictions:\n', predictions)
print('Prediction accuracy comparison:\n', predictions == test_targets)
accuracy = classifier.score(test_features_scaled, test_targets)
print('Model accuracy:\n', accuracy)

Cross-Validation and Hyperparameter Tuning

Cross-Validation Process

Cross-validation divides training data into k subsets, using each subset as a validation set once. This process provides more reliable model performance estimates.

Grid Search Optimization

Grid search systematically evaluates different hyperparameter combinations using cross-validation to identify optimal settings.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Load and prepare data
iris_samples = load_iris()
train_x, test_x, train_y, test_y = train_test_split(
    iris_samples.data, iris_samples.target, 
    test_size=0.2, random_state=22
)

# Feature scaling
scaler_object = StandardScaler()
train_x_processed = scaler_object.fit_transform(train_x)
test_x_processed = scaler_object.fit_transform(test_x)

# Model with hyperparameter tuning
base_classifier = KNeighborsClassifier()
hyperparameter_grid = {'n_neighbors': [1, 3, 5, 7]}
optimized_model = GridSearchCV(base_classifier, 
                              param_grid=hyperparameter_grid, 
                              cv=5)
optimized_model.fit(train_x_processed, train_y)

# Evaluation
model_predictions = optimized_model.predict(test_x_processed)
print('Model predictions:\n', model_predictions)
print('Accuracy comparison:\n', model_predictions == test_y)
final_accuracy = optimized_model.score(test_x_processed, test_y)
print('Final accuracy:\n', final_accuracy)

# Cross-validation results
print('Best cross-validation score:\n', optimized_model.best_score_)
print('Optimal parameters:\n', optimized_model.best_params_)
print('Detailed CV results:\n', optimized_model.cv_results_)

Tags: machine-learning knn Classification scikit-learn iris-dataset

Posted on Mon, 01 Jun 2026 17:37:31 +0000 by 22Pixels