Data Preparation and Cleening
When working with machine learning, the initial step involves preparing the dataset. For demonstration purposes, we'll use a pre-downloaded dataset containing pumpkin pricing information.
Initial Data Exploration
import pandas as pd
pumpkin_data = pd.read_csv('../data/US-pumpkins.csv')
print(pumpkin_data.head())
print(pumpkin_data.tail())
Column Selection and Cleaning
selected_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date']
pumpkin_data = pumpkin_data.drop([col for col in pumpkin_data.columns
if col not in selected_columns], axis=1)
print(pumpkin_data.isnull().sum())
Data Transformation
# Extract month from date
month_data = pd.DatetimeIndex(pumpkin_data['Date']).month
# Calculate average price
avg_price = (pumpkin_data['Low Price'] + pumpkin_data['High Price']) / 2
# Filter by package type
filtered_pumpkins = pumpkin_data[pumpkin_data['Package'].str.contains('bushel',
case=True, regex=True)]
# Normalize prices
filtered_pumpkins.loc[filtered_pumpkins['Package'].str.contains('1 1/9'),
'Price'] = avg_price/(1 + 1/9)
filtered_pumpkins.loc[filtered_pumpkins['Package'].str.contains('1/2'),
'Price'] = avg_price/(1/2)
Data Visualization
import matplotlib.pyplot as plt
# Basic scatter plot
plt.scatter(filtered_pumpkins.Month, filtered_pumpkins.Price)
plt.show()
# Enhanced bar chart
filtered_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')
plt.ylabel("Average Pumpkin Price")
plt.show()
While this approach provides a basic analysis of pumpkin prices by month, it's important to note that additional factors like year, weather conditions, and measurement units could provide more comprehensive insights.