Implementing Binary Classification with Logistic Regression in Python

Binary Classification Overview

Logistic regression serves as a foundational algorithm for binary classification tasks where the target variable consists of two distinct categories. Typical scenarios include spam detection, medical disease screening, and customer churn prediction.

The algorithm transforms linear regression outputs into probabilities ranging from 0 to 1 via the sigmoid function. This probability determines the likelihood of a sample belonging to a specific class.

Strengths and Limitations

  • Interpretability: Model coefficients clearly indicate the influence of each feature on the prediction.
  • Efficiency: Computationally inexpensive, making it suitable for large datasets.
  • Probabilistic Output: Provides class probabilities rather than just hard labels, enabling deeper analytical insights.
  • Linearity Assumption: Falters with highly complex, non-linear relationships within the data.
  • Underfitting Risk: May oversimplify complex patterns, resulting in poor performance on intricate datasets.
  • Outlier Sensitivity: Extreme values can disproportionately skew the model, necessitating rigorous data cleaning.

Python Implementation and Evaluation

A comprehensive workflow involves generating synthetic data, preprocessing, training the classifier, and evaluating performance through various metrics and visualizations.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, roc_curve, confusion_matrix,
                             ConfusionMatrixDisplay)

# Generate synthetic binary classification dataset
features, targets = make_classification(n_samples=1500, n_features=10,
                                        n_informative=5, n_classes=2, random_state=101)

# Split into training and testing subsets
feat_train, feat_test, targ_train, targ_test = train_test_split(
    features, targets, test_size=0.25, random_state=101)

# Standardize features to have zero mean and unit variance
std_scaler = StandardScaler()
feat_train_scaled = std_scaler.fit_transform(feat_train)
feat_test_scaled = std_scaler.transform(feat_test)

# Initialize and train the logistic regression classifier
clf = LogisticRegression(random_state=101)
clf.fit(feat_train_scaled, targ_train)

# Generate predictions and probability estimates
targ_preds = clf.predict(feat_test_scaled)
pred_probs = clf.predict_proba(feat_test_scaled)[:, 1]

# Calculate evaluation metrics
acc = accuracy_score(targ_test, targ_preds)
prec = precision_score(targ_test, targ_preds)
rec = recall_score(targ_test, targ_preds)
f1 = f1_score(targ_test, targ_preds)
roc_auc = roc_auc_score(targ_test, pred_probs)
cm = confusion_matrix(targ_test, targ_preds)

print(f"Accuracy: {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall: {rec:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC AUC: {roc_auc:.4f}")

Visualizing Model Performance

The confusion matrix illustrates the counts of true positives, true negatives, false positives, and false negatives, providing a clear snapshot of classification accuracy.

# Plot Confusion Matrix
fig_cm, ax_cm = plt.subplots(figsize=(6, 6))
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
cm_display.plot(cmap='Blues', ax=ax_cm)
ax_cm.set_title('Classifier Confusion Matrix')
plt.show()

The Receiver Operating Characteristic (ROC) curve plots the true positive rate against the false positive rate at various threshold settings. The Area Under the Curve (AUC) quantifies the model's overall discriminative ability.

# Plot ROC Curve
fpr_vals, tpr_vals, _ = roc_curve(targ_test, pred_probs)
fig_roc, ax_roc = plt.subplots(figsize=(6, 6))
ax_roc.plot(fpr_vals, tpr_vals, color='coral', lw=2,
            label=f'ROC Curve (AUC = {roc_auc:.2f})')
ax_roc.plot([0, 1], [0, 1], color='slategray', lw=2, linestyle='--')
ax_roc.set_xlim([0.0, 1.0])
ax_roc.set_ylim([0.0, 1.05])
ax_roc.set_xlabel('False Positive Rate')
ax_roc.set_ylabel('True Positive Rate')
ax_roc.set_title('ROC Curve Analysis')
ax_roc.legend(loc="lower right")
plt.show()

Interpreting Feature Coefficients

Extracting the learned weights reveals how each feature drives the prediction. Positive coefficients increase the probability of the positive class, while negative coefficients decrease it.

# Extract and display model coefficients
weights = clf.coef_[0]
bias = clf.intercept_[0]

feature_names = [f'Var_{i}' for i in range(len(weights))]
coeff_dataframe = pd.DataFrame({
    'Feature': feature_names,
    'Weight': weights
})
coeff_dataframe['Bias'] = bias

print(coeff_dataframe)

Tags: Machine Learning Logistic Regression python Binary Classification scikit-learn

Posted on Thu, 18 Jun 2026 17:09:32 +0000 by jaylearning