Machine learning models require validation before deployment in production environments to insure reliability and accuracy.
Training and Testing Data Separation
Splitting datasets into training and testing subsets enables model evaluation. Models are trained on the training data and subsequently validated using the testing data.
Manual Implementation
- Data Partitioning
# Dataset splitting implementation
import numpy as np
def split_dataset(features, labels, test_fraction, random_seed=None):
if features.shape[0] != labels.shape[0]:
raise ValueError("Feature and label dimensions must match")
if not 0.0 <= test_fraction <= 1.0:
raise ValueError("Test fraction must be between 0 and 1")
if random_seed:
np.random.seed(random_seed)
shuffled_indices = np.random.permutation(len(features))
test_count = int(len(features) * test_fraction)
test_indices = shuffled_indices[:test_count]
train_indices = shuffled_indices[test_count:]
train_features = features[train_indices]
test_features = features[test_indices]
train_labels = labels[train_indices]
test_labels = labels[test_indices]
return train_features, test_features, train_labels, test_labels
- Model Training and Validation
- Using K-Nearest Neighbors classification algorithm
import numpy as np
from sklearn import datasets
from knn_classifier import KNNClassifier
from data_utils import split_dataset
iris_data = datasets.load_iris()
features = iris_data.data
targets = iris_data.target
for iteration in range(5):
# Split dataset
train_x, test_x, train_y, test_y = split_dataset(features, targets, 0.2)
# Model training and evaluation
classifier = KNNClassifier(3)
classifier.fit(train_x, train_y)
predictions = classifier.predict(test_x)
accuracy = np.sum(predictions == test_y) / len(test_y) * 100
print(f"Iteration {iteration+1} accuracy: {accuracy:.5f}%")
- Vaildation Results
Iteration 1 accuracy: 96.66667%
Iteration 2 accuracy: 96.66667%
Iteration 3 accuracy: 96.66667%
Iteration 4 accuracy: 93.33333%
Iteration 5 accuracy: 93.33333%
Using Scikit-learn Modules
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
digits_dataset = datasets.load_digits()
features = digits_dataset.data
targets = digits_dataset.target
for trial in range(5):
x_train, x_test, y_train, y_test = train_test_split(features, targets)
model = KNeighborsClassifier(3)
model.fit(x_train, y_train)
score = model.score(x_test, y_test) * 100
print(f"Trial {trial+1} accuracy: {score:.5f}%")
- Performance Results
Trial 1 accuracy: 98.22222%
Trial 2 accuracy: 98.66667%
Trial 3 accuracy: 98.44444%
Trial 4 accuracy: 98.66667%
Trial 5 accuracy: 99.11111%