Wine Classification Using K-Nearest Neighbors in MindSpore

Overview

This guide demonstrates implementing a K-Nearest Neighbors classifier using MindSpore for the Wine dataset. We'll explore how to process chemical composition data to predict wine cultivars through distance-based classification.

Prerequisites

Before proceeding, ensure you have:

  • Python programming proficiency
  • Basic understanding of KNN algorithm and Euclidean distance
  • MindSpore 2.0+ installed (CPU/GPU/Ascend backends supported)

Dataset Acquisition

The Wine dataset contains 178 samples with 13 chemical properties measured for three Italian wine types. Download the compressed archive:

from download import download

dataset_url = "https://ascend-professional-construction-dataset.obs.cn-north-4.myhuaweicloud.com:443/MachineLearning/wine.zip"
download_path = download(dataset_url, "./", kind="zip", replace=True)

Data Preparation

Import necessary libraries and configure the execution environment:

import os
import csv
import numpy as np
import matplotlib.pyplot as plt
import mindspore as ms
from mindspore import ops

ms.set_context(device_target="CPU")

Load and inspect the raw data:

with open('wine.data') as csv_file:
    raw_data = list(csv.reader(csv_file, delimiter=','))
print("Sample records:", raw_data[56:62] + raw_data[130:133])

Seperate features and labels, converting to appropriate numeric types:

features = np.array([[float(val) for val in row[1:]] for row in raw_data[:178]], dtype=np.float32)
labels = np.array([int(row[0]) for row in raw_data[:178]], dtype=np.int32)

Feature Exploration

Visualize pairwise relationships between attributes to assess clas separability:

attribute_names = ['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue',
                   'OD280/OD315 of diluted wines', 'Proline']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
attribute_pairs = [(0, 1), (2, 3), (4, 5), (6, 7)]

for idx, (ax, (attr1, attr2)) in enumerate(zip(axes.flat, attribute_pairs)):
    ax.scatter(features[:59, attr1], features[:59, attr2], label='Cultivar 1', alpha=0.7)
    ax.scatter(features[59:130, attr1], features[59:130, attr2], label='Cultivar 2', alpha=0.7)
    ax.scatter(features[130:, attr1], features[130:, attr2], label='Cultivar 3', alpha=0.7)
    ax.set_xlabel(attribute_names[attr1])
    ax.set_ylabel(attribute_names[attr2])
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Dataset Partitioning

Randomly split data into training (128 samples) and testing (50 samples) sets:

train_size = 128
total_samples = features.shape[0]
shuffled_indices = np.random.permutation(total_samples)

train_indices = shuffled_indices[:train_size]
test_indices = shuffled_indices[train_size:]

train_features, train_labels = features[train_indices], labels[train_indices]
test_features, test_labels = features[test_indices], labels[test_indices]

KNN Implementation

Create a MindSpore Cell that computes distances and identifies nearest neighbors:

class KnnModel(ms.nn.Cell):
    def __init__(self, neighbor_count):
        super(KnnModel, self).__init__()
        self.neighbor_count = neighbor_count
    
    def construct(self, sample, training_set):
        # Broadcast sample to match training set dimensions
        expanded_input = ops.broadcast_to(sample, (training_set.shape[0], training_set.shape[1]))
        element_diff = ops.sub(expanded_input, training_set)
        squared_diff = ops.pow(element_diff, 2)
        distance = ops.sqrt(ops.reduce_sum(squared_diff, axis=1))
        
        # Find k smallest distances (using negative for topk)
        _, nearest_indices = ops.top_k(ops.neg(distance), self.neighbor_count)
        return nearest_indices

Define the classification function that aggregates neighbor votes:

def classify_sample(model, sample, train_features, train_labels):
    sample_tensor = ms.Tensor(sample)
    train_tensor = ms.Tensor(train_features)
    
    neighbor_indices = model(sample_tensor, train_tensor).asnumpy()
    vote_tally = {}
    
    for idx in neighbor_indices:
        label = int(train_labels[idx])
        vote_tally[label] = vote_tally.get(label, 0) + 1
    
    return max(vote_tally, key=vote_tally.get)

Model Evaluation

Test the classifeir with k=5 neighbors:

knn_model = KnnModel(neighbor_count=5)
correct_count = 0

for test_sample, true_label in zip(test_features, test_labels):
    prediction = classify_sample(knn_model, test_sample, train_features, train_labels)
    correct_count += (prediction == true_label)
    print(f"Actual: {true_label}, Predicted: {prediction}")

accuracy = correct_count / len(test_labels)
print(f"\nFinal validation accuracy: {accuracy:.4f}")

The achieved accuracy of approximately 80% demonstrates effective cultivar discrimination based on chemical composition using distance-based classification.

Tags: mindspore knn wine-classification euclidean-distance matrix-operations

Posted on Fri, 15 May 2026 10:08:46 +0000 by Mikell