Dataset Overview
The Iris dataset, collected by Fisher in 1936, is a widely used classification dataset containing 150 samples from three iris species: Setosa, Versicolor, and Virginica. Each species has 50 samples with four features: sepal length, sepal width, petal length, and petal width.
In machine learning practice, data collection is typically performed by domain experts who understand which features are relevant for classification tasks.
Loading and Exploring the Dataset
Scikit-learn Dataset APIs
Scikit-learn provides two categories of datasets:
- Small datasets: Loaded using
sklearn.datasets.load_*()functions, stored locally within the library - Large datasets: Retrieved using
sklearn.datasets.fetch_*()functions, downloaded from remote sources
from sklearn.datasets import load_iris
# Load the Iris dataset
iris_data = load_iris()
print(iris_data)
For larger datasets requiring network download:
from sklearn.datasets import fetch_20newsgroups
# Fetch 20 newsgroups dataset
news_data = fetch_20newsgroups()
print(news_data)
Dataset Structure
Both load and fetch functions return datasets.base.Bunch objects with these key attributes:
data: Feature matrix as 2D numpy arraytarget: Label vector as 1D numpy arrayDESCR: Dataset descriptionfeature_names: List of feature namestarget_names: List of target class names
from sklearn.datasets import load_iris
# Load and examine dataset components
iris_dataset = load_iris()
print('Feature matrix:\n', iris_dataset.data)
print('Target labels:\n', iris_dataset['target'])
print('Feature names:\n', iris_dataset.feature_names)
print('Target names:\n', iris_dataset.target_names)
print('Dataset description:\n', iris_dataset.DESCR)
Data Visualization
Visualizing data distributions helps understand how different classes separate based on features. While ideal scenarios show perfect separation, real-world data often requires more sophisticated analysis.
Seaborn provides high-level plotting capabilities built on Matplotlib:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris
# Load dataset
iris_samples = load_iris()
# Convert to DataFrame for easier manipulation
iris_df = pd.DataFrame(iris_samples.data,
columns=['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
iris_df['Species'] = iris_samples.target
def visualize_distribution(dataset, x_col, y_col):
sns.lmplot(x=x_col, y=y_col, data=dataset, hue='Species', fit_reg=False)
plt.xlabel(x_col)
plt.ylabel(y_col)
plt.title('Iris Species Distribution')
plt.show()
visualize_distribution(iris_df, 'Petal_Width', 'Sepal_Length')
Dataset Partitioning
Machine learning datasets are typically divided into:
- Training set: Used to build the model (70-80% of data)
- Testing set: Used to evaluate model performance (20-30% of data)
Train-Test Split API
sklearn.model_selection.train_test_split() handles dataset partitioning:
Parameters:
arrays: Feature and target arraystest_size: Proportion of test datarandom_state: Seed for reproducible sampling
Returns four arrays: training features, testing features, training targets, testing targets
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load dataset
iris_collection = load_iris()
# Split dataset
features_train, features_test, targets_train, targets_test = train_test_split(
iris_collection.data, iris_collection.target,
test_size=0.2, random_state=22
)
print('Training features:\n', features_train)
print('Training targets:\n', targets_train)
print('Testing features:\n', features_test)
print('Testing targets:\n', targets_test)
Feature Engineering
Normalization
Normalization maps data to a specified range (typically [0,1]):
Formula: X' = (x - min) / (max - min)
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Generate sample data
sample_data = pd.DataFrame(np.random.randint(200, 4000, size=(5, 4)))
print(sample_data)
# Apply normalization
normalizer = MinMaxScaler(feature_range=(2, 3))
normalized_result = normalizer.fit_transform(sample_data[[0, 1, 2]])
print("Normalized data:\n", normalized_result)
Normalization is sensitive to outliers. For robust preprocessing, standardization is preferred.
Standardization
Standardization transforms data to have zero mean and unit variance:
Formula: X' = (x - mean) / standard_deviation
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Generate sample data
training_data = pd.DataFrame(np.random.randint(200, 4000, size=(5, 4)))
print(training_data)
# Apply standardization
scaler = StandardScaler()
standardized_features = scaler.fit_transform(training_data[[0, 1, 2]])
print('Standardized results:\n', standardized_features)
print('Feature means:\n', scaler.mean_)
print('Feature variances:\n', scaler.var_)
Iris Species Prediction Implementation
K-Nearest Neighbors Classifier
Key parameters for sklearn.neighbors.KNeighborsClassifier:
n_neighbors: Number of neighbors to consider (default: 5)algorithm: Method for computing nearest neighbors ('auto', 'ball_tree', 'kd_tree', 'brute')
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# Load and prepare data
iris_dataset = load_iris()
train_features, test_features, train_targets, test_targets = train_test_split(
iris_dataset.data, iris_dataset.target,
test_size=0.2, random_state=22
)
# Feature scaling
feature_scaler = StandardScaler()
train_features_scaled = feature_scaler.fit_transform(train_features)
test_features_scaled = feature_scaler.fit_transform(test_features)
# Model training
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(train_features_scaled, train_targets)
# Model evaluation
predictions = classifier.predict(test_features_scaled)
print('Predictions:\n', predictions)
print('Prediction accuracy comparison:\n', predictions == test_targets)
accuracy = classifier.score(test_features_scaled, test_targets)
print('Model accuracy:\n', accuracy)
Cross-Validation and Hyperparameter Tuning
Cross-Validation Process
Cross-validation divides training data into k subsets, using each subset as a validation set once. This process provides more reliable model performance estimates.
Grid Search Optimization
Grid search systematically evaluates different hyperparameter combinations using cross-validation to identify optimal settings.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# Load and prepare data
iris_samples = load_iris()
train_x, test_x, train_y, test_y = train_test_split(
iris_samples.data, iris_samples.target,
test_size=0.2, random_state=22
)
# Feature scaling
scaler_object = StandardScaler()
train_x_processed = scaler_object.fit_transform(train_x)
test_x_processed = scaler_object.fit_transform(test_x)
# Model with hyperparameter tuning
base_classifier = KNeighborsClassifier()
hyperparameter_grid = {'n_neighbors': [1, 3, 5, 7]}
optimized_model = GridSearchCV(base_classifier,
param_grid=hyperparameter_grid,
cv=5)
optimized_model.fit(train_x_processed, train_y)
# Evaluation
model_predictions = optimized_model.predict(test_x_processed)
print('Model predictions:\n', model_predictions)
print('Accuracy comparison:\n', model_predictions == test_y)
final_accuracy = optimized_model.score(test_x_processed, test_y)
print('Final accuracy:\n', final_accuracy)
# Cross-validation results
print('Best cross-validation score:\n', optimized_model.best_score_)
print('Optimal parameters:\n', optimized_model.best_params_)
print('Detailed CV results:\n', optimized_model.cv_results_)