When applying the k-means clustering algorithm to the Iris and Wine datasets, the Iris dataset executes successfully, whereas the Wine dataset throws an error.
The error message index 0 is out of bounds for axis 1 with size 0 indicates an attempt to access an empty column dimension. To investigate the issue, the data loading function was updated to inspect dataset dimensions:
import pandas as pd
def load_and_preprocess_data():
try:
# Load datasets
iris = pd.read_csv("dataset/Iris.csv", header=0)
wine = pd.read_csv("dataset/wine.csv")
wine_red = pd.read_csv("dataset/wine+quality/winequality-red.csv", sep=';')
wine_white = pd.read_csv("dataset/wine+quality/winequality-white.csv", sep=';')
df = wine_red # Select dataset to process
print("=== Dataset Summary ===")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"First 5 rows:")
print(df.head())
print(f"Data types:")
print(df.dtypes)
# Identify label column
col_names = list(df.columns)
print(f"Last column name: {col_names[-1]}")
print(f"Unique values in last column: {df[col_names[-1]].unique()}")
# Separate features and labels
feature_cols = col_names[:-1]
label_col = col_names[-1]
features = df[feature_cols]
labels = df[label_col]
print(f"Number of features: {len(feature_cols)}")
print(f"Feature names: {feature_cols}")
print(f"Features shape: {features.shape}")
print(f"Labels shape: {labels.shape}")
return features, labels, feature_cols, label_col, df
except Exception as e:
print(f"Error loading data: {e}")
# Fallback to synthetic data
print("Generating sample data for testing...")
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=1599, centers=3, n_features=4, random_state=42)
features = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(4)])
labels = pd.Series(y)
feature_cols = features.columns.tolist()
label_col = 'target'
df = features.copy()
df[label_col] = y
return features, labels, feature_cols, label_col, df
Upon execution, the Wine dataset showed unexpected dimensions:
=== Dataset Summary ===
Shape: (4898, 0)
Dimensions: 2
Size: 0
Column count: 0
This output implies 4898 rows but zero columns, suggesting that all feature columns were inadvertently removed during data ingestion.
The root cause was idnetified as a delimiter mismatch: while the Iris dataset uses comma-separated values, the Wine dataset employs semicolon-separated entries. Since the default separator was not specified, the data was misinterpreted, leading to incorrect parsing.
The solution involves explicitly specifying the correct separtaor when reading the file:
wine_red = pd.read_csv("dataset/winequality-red.csv", sep=';')
Adding sep=';' resolves the parsing issue and allows proper execution of the clustering algorithm.