Scikit-learn is a Python library for machine learning, offering efficient tools for data mining and analysis. This guide covers its core concepts and practical usage.
Installation
Install Scikit-learn via pip:
pip install scikit-learn
Core Concepts
- Dataset: Data is structured into features (input variables) and labels (target values).
- Model: An implementation of a machine learning algorithm that learns from data to make predictions.
- Training and Testing: Data is split into training sets (for model learning) and test sets (for evaluation).
Key Functionalities
Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Sample data
features = [[10, 20], [20, 30], [30, 40], [40, 50]]
targets = [0, 1, 0, 1]
# Split data
feat_train, feat_test, targ_train, targ_test = train_test_split(features, targets, test_size=0.2, random_state=42)
# Normalize data
normalizer = StandardScaler()
feat_train = normalizer.fit_transform(feat_train)
feat_test = normalizer.transform(feat_test)
Model Selection
Common models include:
- Classification:
- Logistic Regression:
from sklearn.linear_model import LogisticRegression - Support Vector Classifier:
from sklearn.svm import SVC - Decision Tree Classifier:
from sklearn.tree import DecisionTreeClassifier - Random Forest Classifier:
from sklearn.ensemble import RandomForestClassifier
- Logistic Regression:
- Regression:
- Linear Regression:
from sklearn.linear_model import LinearRegression - Ridge Regression:
from sklearn.linear_model import Ridge - Random Forest Regressor:
from sklearn.ensemble import RandomForestRegressor
- Linear Regression:
Model Training
from sklearn.linear_model import LogisticRegression
# Initialize model
classifier = LogisticRegression()
# Train model
classifier.fit(feat_train, targ_train)
Prediction
# Make predictions
predicted_labels = classifier.predict(feat_test)
Model Evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Accuracy
acc = accuracy_score(targ_test, predicted_labels)
print(f'Accuracy: {acc}')
# Confusion matrix
conf_mat = confusion_matrix(targ_test, predicted_labels)
print(f'Confusion Matrix:\n{conf_mat}')
# Classification report
class_report = classification_report(targ_test, predicted_labels)
print(f'Classification Report:\n{class_report}')
Classification Example
A complete example using the Iris dataset:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load dataset
iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Normalize data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Create model
clf = RandomForestClassifier()
# Train model
clf.fit(X_train, y_train)
# Predict
predictions = clf.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy}')
Regression Example
A regression example using the Boston housing dataset:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load dataset
boston_data = datasets.load_boston()
X = boston_data.data
y = boston_data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create model
regressor = LinearRegression()
# Train model
regressor.fit(X_train, y_train)
# Predict
pred_vals = regressor.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, pred_vals)
print(f'Mean Squared Error: {mse}')
Common Modules
- Model Selection:
sklearn.model_selection - Data Prperocessing:
sklearn.preprocessing - Model Evaluation:
sklearn.metrics - Ensemble Methods:
sklearn.ensemble