Principle Overview
A decsiion tree classifier is a supervised learning algorithm that constructs a tree-like model of decisions based on feature values. The algorithm works by recursively partitioning the dataset into subsets based on the most significant feature that best separates different class labels. Each internal node represents a test on a particular feature, each branch represents the outcome of that test, and each leaf node represents a class label.
The splitting criterion typically uses either Gini impurity or information gain (entropy) to determine the best feature for division. The process continues recursively until all instances in a subsett belong to the same class or further splitting yields no significant improvement. During prediction, new instances traverse the tree from the root to a leaf node based on their feature values, and the leaf node's class label is assigned to the instance.
Experimental Implementation
Library Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
Dataset Loading
data = pd.read_csv('D:\\train.csv')
print("Dataset shape:", data.shape)
print("\nFirst few rows:")
data.head()
Output:
Dataset shape: (42000, 785)
The dataset contains 42,000 samples with 784 pixel features (28x28 images) and one label column. Each pixel value ranges from 0 to 255, representing grayscale intensity.
Label Distribution Visualization
plt.figure(figsize=(12, 6))
label_counts = data['label'].value_counts().sort_index()
plt.bar(label_counts.index, label_counts.values, color='steelblue', edgecolor='black')
plt.xlabel('Digit Class', fontsize=14)
plt.ylabel('Sample Count', fontsize=14)
plt.title('MNIST Dataset Class Distribution', fontsize=18)
plt.xticks(range(10), fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
Data Preparation
X = data.drop('label', axis=1).values
y = data['label'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
Model Training
clf = DecisionTreeClassifier(
criterion='gini',
max_depth=15,
min_samples_split=10,
min_samples_leaf=5,
random_state=42
)
clf.fit(X_train, y_train)
print("Model training completed.")
Model Evaluation
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Confusion Matrix Visualization
plt.figure(figsize=(10, 8))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.xlabel('Predicted Label', fontsize=14)
plt.ylabel('True Label', fontsize=14)
plt.title('Confusion Matrix for Decision Tree Classifier', fontsize=16)
plt.tight_layout()
plt.show()
Results Analysis
The decision tree classifier achieves reasonible accuracy on the MNIST dataset. The confusion matrix reveals that digits with similar structures (such as 3 and 8, or 1 and 7) tend to be confused with each other more frequently. This is expected given that the algorithm makes hard decisions at each node based on individual pixel thresholds.
Hyperparameters such as maximum depth, minimum samples for split, and minimum samples per leaf significantly impact model performance. Deeper trees can capture more complex patterns but risk overfitting, while shallower trees may underfit the data. The choice of criterion (Gini or entropy) also influences the resulting tree structure.