Resolving IndexError in K-Means Clustering Due to Incorrect CSV Delimiter

When applying the k-means clustering algorithm to the Iris and Wine datasets, the Iris dataset executes successfully, whereas the Wine dataset throws an error.

The error message index 0 is out of bounds for axis 1 with size 0 indicates an attempt to access an empty column dimension. To investigate the issue, the data loading function was updated to inspect dataset dimensions:

import pandas as pd

def load_and_preprocess_data():
    try:
        # Load datasets
        iris = pd.read_csv("dataset/Iris.csv", header=0)
        wine = pd.read_csv("dataset/wine.csv")
        wine_red = pd.read_csv("dataset/wine+quality/winequality-red.csv", sep=';')
        wine_white = pd.read_csv("dataset/wine+quality/winequality-white.csv", sep=';')
        
        df = wine_red  # Select dataset to process
        
        print("=== Dataset Summary ===")
        print(f"Shape: {df.shape}")
        print(f"Columns: {df.columns.tolist()}")
        print(f"First 5 rows:")
        print(df.head())
        print(f"Data types:")
        print(df.dtypes)
        
        # Identify label column
        col_names = list(df.columns)
        print(f"Last column name: {col_names[-1]}")
        print(f"Unique values in last column: {df[col_names[-1]].unique()}")
        
        # Separate features and labels
        feature_cols = col_names[:-1]
        label_col = col_names[-1]
        
        features = df[feature_cols]
        labels = df[label_col]
        
        print(f"Number of features: {len(feature_cols)}")
        print(f"Feature names: {feature_cols}")
        print(f"Features shape: {features.shape}")
        print(f"Labels shape: {labels.shape}")
        
        return features, labels, feature_cols, label_col, df
        
    except Exception as e:
        print(f"Error loading data: {e}")
        # Fallback to synthetic data
        print("Generating sample data for testing...")
        from sklearn.datasets import make_blobs
        X, y = make_blobs(n_samples=1599, centers=3, n_features=4, random_state=42)
        features = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(4)])
        labels = pd.Series(y)
        feature_cols = features.columns.tolist()
        label_col = 'target'
        df = features.copy()
        df[label_col] = y
        
        return features, labels, feature_cols, label_col, df

Upon execution, the Wine dataset showed unexpected dimensions:

=== Dataset Summary ===
Shape: (4898, 0)
Dimensions: 2
Size: 0
Column count: 0

This output implies 4898 rows but zero columns, suggesting that all feature columns were inadvertently removed during data ingestion.

The root cause was idnetified as a delimiter mismatch: while the Iris dataset uses comma-separated values, the Wine dataset employs semicolon-separated entries. Since the default separator was not specified, the data was misinterpreted, leading to incorrect parsing.

The solution involves explicitly specifying the correct separtaor when reading the file:

wine_red = pd.read_csv("dataset/winequality-red.csv", sep=';')

Adding sep=';' resolves the parsing issue and allows proper execution of the clustering algorithm.

Tags: k-means Pandas CSV data-loading error-handling

Posted on Fri, 15 May 2026 23:14:48 +0000 by sargus