Environment Configuration
Before executing any machine learning pipelines, ensure the computational environment contains the necessary dependencies. Utilizing an isolated virtual environment is strongly recommended to prevent package conflicts.
pip install numpy pillow scikit-learn tensorflow keras opencv-contrib-python imutils
Key libraries include numpy for matrix operations, pillow for image I/O, scikit-learn for traditional algorithms and metrics, and tensorflow/keras for constructing neural architectures. opencv and imutils provide supplementary computer vision utilities.
Target Datasets
Two distinct datasets are utilized to demonstrate algorithm behavior across different data modalities:
- Iris Dataset: A standard numerical dataset containing 150 samples across three flower species (Setosa, Versicolor, Virginica). Each sample includes four continuous features: sepal length, sepal width, petal length, and petal width. While one class is linearly separable, the remaining two require non-linear decision boundaries.
- 3-Scenes Image Dataset: A collection of 948 photographs categorized into three environmental classes: Coast (360), Forest (328), and Highway (260). This dataset tests model capability in processing raw pixel data and extracting spatial patterns.
Standard Development Workflow
A systematic approach to model development typically follows these phases:
- Problem Assessment: Identify data type, define success metrics, and hypothesize suitable algorithm families.
- Data Preparation: Handle ingestion, cleaning, feature extraction, and engineering.
- Algorithm Screening: Train multiple baseline models across different families (linear, tree-based, neural, etc.).
- Evaluation: Compare metrics to identify top performers.
- Refinement: Tune hyperparameters or architect custom solutions for the best-performing approach.
Tabular Data Classification with Scikit-Learn
The following script demonstrates a unified interface for training and evaluating multiple classical algorithms on numerical data. The architecture uses a registry pattern to dynamically select classifiers via command-line arguments.
import argparse
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
CLASSIFIER_REGISTRY = {
"knn": KNeighborsClassifier(n_neighbors=3),
"gaussian_nb": GaussianNB(),
"log_reg": LogisticRegression(max_iter=1000, solver="lbfgs"),
"svc_rbf": SVC(kernel="rbf", probability=True),
"dtree": DecisionTreeClassifier(max_depth=5),
"rf_ensemble": RandomForestClassifier(n_estimators=50, random_state=42),
"mlp_net": MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500)
}
def execute_tabular_experiment(selected_model: str):
print("[STATUS] Fetching Iris dataset...")
iris_bundle = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris_bundle.data, iris_bundle.target, test_size=0.25, random_state=42
)
clf = CLASSIFIER_REGISTRY.get(selected_model)
if not clf:
raise ValueError(f"Model '{selected_model}' not recognized.")
print(f"[STATUS] Training {selected_model}...")
clf.fit(X_train, y_train)
print("[STATUS] Generating evaluation metrics...")
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=iris_bundle.target_names))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--algo", type=str, default="knn", choices=CLASSIFIER_REGISTRY.keys())
args = parser.parse_args()
execute_tabular_experiment(args.algo)
Image Classification Using Handcrafted Statistical Features
Raw pixels often degrade the performance of traditional machine learning models. This implementation extracts first-order statistical moments (mean and standard deviation) from each RGB channel, reducing each image to a 6-dimensional feature vector before classification.
import os
import argparse
import numpy as np
from PIL import Image
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
IMG_CLASSIFIER_MAP = {
"knn": KNeighborsClassifier(n_neighbors=3),
"svc_rbf": SVC(kernel="rbf"),
"rf_ensemble": RandomForestClassifier(n_estimators=50, random_state=42)
}
def compute_channel_moments(img_obj):
r_band, g_band, b_band = img_obj.split()
return [
np.mean(r_band), np.std(r_band),
np.mean(g_band), np.std(g_band),
np.mean(b_band), np.std(b_band)
]
def run_image_experiment(dataset_dir: str, model_key: str):
print("[STATUS] Parsing image directory and extracting features...")
feature_matrix = []
raw_labels = []
for root, _, files in os.walk(dataset_dir):
for fname in files:
if fname.lower().endswith(('.png', '.jpg', '.jpeg')):
fpath = os.path.join(root, fname)
pil_img = Image.open(fpath).convert('RGB')
feature_matrix.append(compute_channel_moments(pil_img))
raw_labels.append(os.path.basename(root))
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(raw_labels)
X_train, X_test, y_train, y_test = train_test_split(
np.array(feature_matrix), y_encoded, test_size=0.25, random_state=42
)
clf = IMG_CLASSIFIER_MAP[model_key]
print(f"[STATUS] Fitting {model_key} on color statistics...")
clf.fit(X_train, y_train)
print("[STATUS] Evaluation complete.")
preds = clf.predict(X_test)
print(classification_report(y_test, preds, target_names=encoder.classes_))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--dir", type=str, default="3scenes")
parser.add_argument("--model", type=str, default="rf_ensemble", choices=IMG_CLASSIFIER_MAP.keys())
cfg = parser.parse_args()
run_image_experiment(cfg.dir, cfg.model)
Neural Network Implementation for Numerical Data
Deep learning frameworks can also process tabular data effectively. The following pipeline constructs a multi-layer perceptron (MLP) using Keras, applying one-hot encoding to categorical targets and utilizing stochastic gradient descent for optimization.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
def train_mlp_on_iris():
print("[STATUS] Preparing numerical data...")
bundle = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
bundle.data, bundle.target, test_size=0.25, random_state=42
)
binarizer = LabelBinarizer()
y_train_bin = binarizer.fit_transform(y_train)
y_test_bin = binarizer.transform(y_test)
print("[STATUS] Constructing feed-forward network...")
net = Sequential([
Dense(12, input_dim=4, activation="relu"),
Dense(8, activation="relu"),
Dense(3, activation="softmax")
])
optimizer = SGD(learning_rate=0.05, momentum=0.8)
net.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
print("[STATUS] Initiating training cycle...")
net.fit(X_train, y_train_bin, validation_data=(X_test, y_test_bin), epochs=200, batch_size=8, verbose=0)
print("[STATUS] Computing final metrics...")
raw_preds = net.predict(X_test)
print(classification_report(
y_test_bin.argmax(axis=1),
raw_preds.argmax(axis=1),
target_names=bundle.target_names
))
if __name__ == "__main__":
train_mlp_on_iris()
Convolutional Architecture for Visual Recognition
Convolutional Neural Networks (CNNs) automatically learn hierarchical spatial features, eliminating the need for manual statistical extraction. This implementation resizes inputs to a fixed 32x32 resolution, normalizes pixel intensity, and stacks convolutional blocks with max-pooling layers.
import os
import numpy as np
from PIL import Image
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Activation
from tensorflow.keras.optimizers import Adam
def prepare_visual_data(source_dir):
print("[STATUS] Ingesting and normalizing images...")
pixel_data = []
class_names = []
for root, _, files in os.walk(source_dir):
for f in files:
if f.lower().endswith(('.jpg', '.png')):
full_path = os.path.join(root, f)
img = Image.open(full_path).convert('RGB').resize((32, 32))
pixel_data.append(np.array(img) / 255.0)
class_names.append(os.path.basename(root))
lbl_enc = LabelBinarizer()
y_vec = lbl_enc.fit_transform(class_names)
X_train, X_test, y_train, y_test = train_test_split(
np.array(pixel_data), y_vec, test_size=0.25, random_state=42
)
return X_train, X_test, y_train, y_test, lbl_enc
def build_and_train_cnn(X_tr, X_te, y_tr, y_te, encoder):
print("[STATUS] Assembling convolutional pipeline...")
cnn = Sequential([
Conv2D(16, (3, 3), padding="same", input_shape=(32, 32, 3)),
Activation("relu"),
MaxPooling2D(pool_size=(2, 2)),
Conv2D(32, (3, 3), padding="same"),
Activation("relu"),
MaxPooling2D(pool_size=(2, 2)),
Flatten(),
Dense(64, activation="relu"),
Dense(3, activation="softmax")
])
adam_opt = Adam(learning_rate=1e-3)
cnn.compile(loss="categorical_crossentropy", optimizer=adam_opt, metrics=["accuracy"])
print("[STATUS] Running backpropagation...")
cnn.fit(X_tr, y_tr, validation_data=(X_te, y_te), epochs=40, batch_size=16, verbose=0)
print("[STATUS] Validation results:")
probs = cnn.predict(X_te)
print(classification_report(
y_te.argmax(axis=1),
probs.argmax(axis=1),
target_names=encoder.classes_
))
if __name__ == "__main__":
X_tr, X_te, y_tr, y_te, enc = prepare_visual_data("3scenes")
build_and_train_cnn(X_tr, X_te, y_tr, y_te, enc)
Algorithm Performance Comparison
Executing these pipelines reveals distinct performance characteristics based on data modality and algorithm family:
- Tabular Data (Iris): Classical algorithms like KNN, Logistic Regression, and SVM consistently achieve accuracy between 95% and 98%. The structured, low-dimensional nature of the data allows linear and distance-based models to establish highly effective decision boundaries. Neural networks also reach near-perfect accuracy but require careful learning rate scheduling to avoid oscillation.
- Image Data (3-Scenes): Traditional models relying on raw color statistics (mean/std per channel) typically plateau between 63% and 77% accuracy. Color distribution alone lacks spatial context, making it difficult to distinguish structurally different scenes with similar palettes. Random Forests generally outperform linear models here due to their ability to capture non-linear interactions between channel moments.
- Convolutional Networks: The CNN architecture significantly outperforms handcrafted feature approaches on the image dataset, typically exceeding 90% accuracy. By learning edge detectors, texture filters, and object parts directly from pixel arrays, convolutional layers capture the spatial hierarchy that statistical moments miss.
Variations in reported metrics across different executions are expected. Factors such as random train/test splits, weight initialization seeds, and stochastic optimization steps introduce natural variance. Consistent evaluation requires fixed random states and cross-validation strategies. The results demonstrate that algorithm selection must align with data structure: tree-based and linear models excel on engineered tabular features, while convolutional architectures are necessary for raw visual recognition tasks.