Preparing and Visualizing Data for Machine Learning

Data Preparation and Cleening

When working with machine learning, the initial step involves preparing the dataset. For demonstration purposes, we'll use a pre-downloaded dataset containing pumpkin pricing information.

Initial Data Exploration

import pandas as pd
pumpkin_data = pd.read_csv('../data/US-pumpkins.csv')
print(pumpkin_data.head())
print(pumpkin_data.tail())

Column Selection and Cleaning

selected_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date']
pumpkin_data = pumpkin_data.drop([col for col in pumpkin_data.columns 
                                if col not in selected_columns], axis=1)
print(pumpkin_data.isnull().sum())

Data Transformation

# Extract month from date
month_data = pd.DatetimeIndex(pumpkin_data['Date']).month

# Calculate average price
avg_price = (pumpkin_data['Low Price'] + pumpkin_data['High Price']) / 2

# Filter by package type
filtered_pumpkins = pumpkin_data[pumpkin_data['Package'].str.contains('bushel', 
                                    case=True, regex=True)]

# Normalize prices
filtered_pumpkins.loc[filtered_pumpkins['Package'].str.contains('1 1/9'), 
                     'Price'] = avg_price/(1 + 1/9)
filtered_pumpkins.loc[filtered_pumpkins['Package'].str.contains('1/2'), 
                     'Price'] = avg_price/(1/2)

Data Visualization

import matplotlib.pyplot as plt

# Basic scatter plot
plt.scatter(filtered_pumpkins.Month, filtered_pumpkins.Price)
plt.show()

# Enhanced bar chart
filtered_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')
plt.ylabel("Average Pumpkin Price")
plt.show()

While this approach provides a basic analysis of pumpkin prices by month, it's important to note that additional factors like year, weather conditions, and measurement units could provide more comprehensive insights.

Tags: machine-learning data-preprocessing Pandas matplotlib data-visualization

Posted on Wed, 13 May 2026 22:36:29 +0000 by Sianide