Overview
This guide demonstrates implementing a K-Nearest Neighbors classifier using MindSpore for the Wine dataset. We'll explore how to process chemical composition data to predict wine cultivars through distance-based classification.
Prerequisites
Before proceeding, ensure you have:
- Python programming proficiency
- Basic understanding of KNN algorithm and Euclidean distance
- MindSpore 2.0+ installed (CPU/GPU/Ascend backends supported)
Dataset Acquisition
The Wine dataset contains 178 samples with 13 chemical properties measured for three Italian wine types. Download the compressed archive:
from download import download
dataset_url = "https://ascend-professional-construction-dataset.obs.cn-north-4.myhuaweicloud.com:443/MachineLearning/wine.zip"
download_path = download(dataset_url, "./", kind="zip", replace=True)
Data Preparation
Import necessary libraries and configure the execution environment:
import os
import csv
import numpy as np
import matplotlib.pyplot as plt
import mindspore as ms
from mindspore import ops
ms.set_context(device_target="CPU")
Load and inspect the raw data:
with open('wine.data') as csv_file:
raw_data = list(csv.reader(csv_file, delimiter=','))
print("Sample records:", raw_data[56:62] + raw_data[130:133])
Seperate features and labels, converting to appropriate numeric types:
features = np.array([[float(val) for val in row[1:]] for row in raw_data[:178]], dtype=np.float32)
labels = np.array([int(row[0]) for row in raw_data[:178]], dtype=np.int32)
Feature Exploration
Visualize pairwise relationships between attributes to assess clas separability:
attribute_names = ['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue',
'OD280/OD315 of diluted wines', 'Proline']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
attribute_pairs = [(0, 1), (2, 3), (4, 5), (6, 7)]
for idx, (ax, (attr1, attr2)) in enumerate(zip(axes.flat, attribute_pairs)):
ax.scatter(features[:59, attr1], features[:59, attr2], label='Cultivar 1', alpha=0.7)
ax.scatter(features[59:130, attr1], features[59:130, attr2], label='Cultivar 2', alpha=0.7)
ax.scatter(features[130:, attr1], features[130:, attr2], label='Cultivar 3', alpha=0.7)
ax.set_xlabel(attribute_names[attr1])
ax.set_ylabel(attribute_names[attr2])
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Dataset Partitioning
Randomly split data into training (128 samples) and testing (50 samples) sets:
train_size = 128
total_samples = features.shape[0]
shuffled_indices = np.random.permutation(total_samples)
train_indices = shuffled_indices[:train_size]
test_indices = shuffled_indices[train_size:]
train_features, train_labels = features[train_indices], labels[train_indices]
test_features, test_labels = features[test_indices], labels[test_indices]
KNN Implementation
Create a MindSpore Cell that computes distances and identifies nearest neighbors:
class KnnModel(ms.nn.Cell):
def __init__(self, neighbor_count):
super(KnnModel, self).__init__()
self.neighbor_count = neighbor_count
def construct(self, sample, training_set):
# Broadcast sample to match training set dimensions
expanded_input = ops.broadcast_to(sample, (training_set.shape[0], training_set.shape[1]))
element_diff = ops.sub(expanded_input, training_set)
squared_diff = ops.pow(element_diff, 2)
distance = ops.sqrt(ops.reduce_sum(squared_diff, axis=1))
# Find k smallest distances (using negative for topk)
_, nearest_indices = ops.top_k(ops.neg(distance), self.neighbor_count)
return nearest_indices
Define the classification function that aggregates neighbor votes:
def classify_sample(model, sample, train_features, train_labels):
sample_tensor = ms.Tensor(sample)
train_tensor = ms.Tensor(train_features)
neighbor_indices = model(sample_tensor, train_tensor).asnumpy()
vote_tally = {}
for idx in neighbor_indices:
label = int(train_labels[idx])
vote_tally[label] = vote_tally.get(label, 0) + 1
return max(vote_tally, key=vote_tally.get)
Model Evaluation
Test the classifeir with k=5 neighbors:
knn_model = KnnModel(neighbor_count=5)
correct_count = 0
for test_sample, true_label in zip(test_features, test_labels):
prediction = classify_sample(knn_model, test_sample, train_features, train_labels)
correct_count += (prediction == true_label)
print(f"Actual: {true_label}, Predicted: {prediction}")
accuracy = correct_count / len(test_labels)
print(f"\nFinal validation accuracy: {accuracy:.4f}")
The achieved accuracy of approximately 80% demonstrates effective cultivar discrimination based on chemical composition using distance-based classification.