Mineral Resource Clustering Analysis Using Random Forest Classification

Overview

This analysis applies multivariate statistical techniques to uncover patterns and relationships within mineral reosurce datasets. The dataset encompasses multiple features including voltage (V), altitude (H), soil type (S), and mineral type (M). A Random Forest classifier serves as the primary predictive model, leveraging ensemble learning to achieve robust classification performance.

The methodology involves standard data preprocessing steps, followed by dataset partitioning into training, validation, and test subsets. Performance evaluation employs ROC curves, confusion matrices, correlation heatmaps, and feature importance visualizations. This approach enables accurate mineral type classification while providing interpretable insights into feature contributions.

Dataset Description

The dataset originates from mineral resource surveys and contains the following variables:

Variable Description Type
V Voltage measurements Numeric
H Altitude/Height Numeric
S Soil classification Categorical
M Mineral type (target) Categorical

The dataset undergoes normalization prior to model training to ensure comparable scales across features.

Visualization Analysis

Soil Type and Mineral Type Impact on Voltage

Box plots and violin plots reveal the distribution of voltage measurements across different soil and mineral type combinations. This visualization exposes outliers and helps identify feature interactions.

Altitude and Mineral Type Relationships

Scatter plots with mineral type color-coding illustrate how voltage measurements vary with altitude across different mineral categories. Polynomial regression curves highlight non-linear trends.

Principal Component Analysis (PCA)

PCA reduces dimensionality while preserving variance structure:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize principal components
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.6)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
plt.title('PCA Visualization of Mineral Resources')
plt.colorbar(label='Mineral Type')
plt.show()

Discriminant Analysis

Linear Discriminant Analysis (LDA) maximizes class separability:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

plt.figure(figsize=(10, 6))
for label in np.unique(y):
    mask = y == label
    plt.scatter(X_lda[mask, 0], X_lda[mask, 1], label=f'Type {label}', alpha=0.7)
plt.xlabel('First Discriminant Component')
plt.ylabel('Second Discriminant Component')
plt.legend()
plt.title('Linear Discriminant Analysis')
plt.show()

Clustering Visualization

K-Means clustering identifies natural groupings:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='coolwarm', marker='o')
centers = kmeans.cluster_centers_
plt.scatter(pca.transform(centers)[:, 0], pca.transform(centers)[:, 1], 
            c='black', s=200, marker='X', label='Centroids')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-Means Clustering Results')
plt.legend()
plt.show()

Factor Analysis

Factor analysis reveals underlying latent variables:

from sklearn.decomposition import FactorAnalysis

fa = FactorAnalysis(n_components=2, random_state=0)
X_factors = fa.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_factors[:, 0], X_factors[:, 1], alpha=0.6)
plt.xlabel('Factor 1')
plt.ylabel('Factor 2')
plt.title('Factor Analysis Representation')
plt.grid(True, alpha=0.3)
plt.show()

Model Training

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Initialize and train Random Forest
rf_classifier = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    random_state=42,
    n_jobs=-1
)

rf_classifier.fit(X_train, y_train)

Model Evaluation

ROC Curve Analysis

ROC curves evaluate binary classification performance:

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Compute ROC curve
proba_scores = rf_classifier.predict_proba(X_test)[:, 1]
fpr_values, tpr_values, thresholds = roc_curve(y_test, proba_scores)
roc_auc_score = auc(fpr_values, tpr_values)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr_values, tpr_values, color='crimson', linewidth=2, 
         label=f'ROC Curve (AUC = {roc_auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='slategray', linestyle='--', linewidth=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()

Confusion Matrix and Classification Report

Detailed performance metrics for each class:

from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

# Generate predictions
predictions = rf_classifier.predict(X_test)

# Confusion matrix visualization
cm = confusion_matrix(y_test, predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=rf_classifier.classes_,
            yticklabels=rf_classifier.classes_)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Classification metrics
report = classification_report(y_test, predictions, target_names=['Type A', 'Type B'])
print('Classification Report:')
print(report)

Feature Importance Analysis

Understanding which features drive predictions:

import numpy as np

# Extract feature importance scores
importance_scores = rf_classifier.feature_importances_
feature_names = X.columns
sorted_indices = np.argsort(importance_scores)[::-1]

# Create horizontal bar chart
plt.figure(figsize=(10, 6))
colors = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(feature_names)))
plt.barh(range(len(feature_names)), importance_scores[sorted_indices[::-1]], 
         color=colors)
plt.yticks(range(len(feature_names)), feature_names[sorted_indices[::-1]])
plt.xlabel('Importance Score')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()

# Print sorted importance values
for idx in sorted_indices:
    print(f'{feature_names[idx]}: {importance_scores[idx]:.4f}')

Correlation Heatmap

Exploring inter-feature relationships:

import seaborn as sns
import pandas as pd

# Compute correlation matrix
dataframe = pd.DataFrame(X_scaled, columns=feature_names)
correlation = dataframe.corr()

# Generate heatmap visualization
plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, fmt='.2f', center=0,
            cmap='RdBu_r', square=True, linewidths=0.5,
            cbar_kws={'shrink': 0.8})
plt.title('Feature Correlation Heatmap')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

Performance Summary

The Random Forest classifier demonstrates strong predictive capability across multiple evaluation metrics. The ROC AUC score provides an aggregate measure of classification quality, while the confusion matrix reveals class-specific performance characteristics. Feature importance rankings identify voltage and altitude as the most influential predictors, informing future data collection priorities. The correlation heatmap exposes moderate correlations between certain feature pairs, suggesting potential multicollinearity considerations for subsequent modeling efforts.

This integrated approach—combining dimensionality reduction, clustering, and ensemble classification—provides a comprehensive framework for mineral resource pattern analysis and type prediction.

Tags: python Machine Learning Random Forest Clustering Analysis Principal Component Analysis

Posted on Sat, 06 Jun 2026 18:29:05 +0000 by tauchai83