Data Exploration with Visualization
Understanding the dataset structure is crucial before building any model. The training data contains house identifiers, daily timestamps, house types, and the target variable representing electricity consumption.
import numpy as np
import pandas as pd
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from lightgbm import log_evaluation, early_stopping
import warnings
warnings.filterwarnings('ignore')
# Load datasets
train_df = pd.read_csv('./data/data283931/train.csv')
test_df = pd.read_csv('./data/data283931/test.csv')
Visualizing Category Distribution
Analyzing how different house categories correlate with consumption patterns helps identify meaningfull features:
category_stats = train_df.groupby('type')['target'].mean().reset_index()
plt.figure(figsize=(8, 4))
plt.bar(category_stats['type'], category_stats['target'], color=['blue', 'green'])
plt.xlabel('Category')
plt.ylabel('Mean Consumption')
plt.title('Average Consumption by House Type')
plt.show()
Time Series Trennd Analysis
Examining individual time series patterns reveals seasonal behavior and anomalies:
single_house = train_df[train_df['id'] == '00037f39cf'].copy()
plt.figure(figsize=(10, 5))
plt.plot(single_house['dt'], single_house['target'], marker='o', linestyle='-')
plt.xlabel('Day')
plt.ylabel('Consumption')
plt.title('Consumption Trend for Specific House ID')
plt.show()
Constructing Time-Based Features
Data Preprocessing Pipeline
Combining train and test sets enables consistent feature engineering across both datasets:
combined_data = pd.concat([test_df, train_df], axis=0, ignore_index=True)
combined_data = combined_data.sort_values(['id', 'dt'], ascending=False).reset_index(drop=True)
Lag Features (Historical Shift)
Lag features capture historical values by shifting time series backward. This approach helps the model learn temporal dependencies:
for offset in range(10, 30):
combined_data[f'lag_{offset}'] = combined_data.groupby(['id'])['target'].shift(offset)
Rolling Window Statistics
Aggregating values within sliding windows provides smoothed representations of recent trends:
combined_data['window_3_mean'] = (combined_data['lag_10'] + combined_data['lag_11'] + combined_data['lag_12']) / 3
Dataset Split and Feature Selection
train_data = combined_data[combined_data.target.notna()].reset_index(drop=True)
test_data = combined_data[combined_data.target.isna()].reset_index(drop=True)
feature_columns = [col for col in combined_data.columns if col not in ['id', 'target']]
LightGBM Model Training
Training Configuration
def train_temporal_model(model_type, train_set, eval_set, features):
# Split by temporal boundary
X_train = train_set[train_set.dt >= 31][features]
y_train = train_set[train_set.dt >= 31]['target']
X_val = train_set[train_set.dt <= 30][features]
y_val = train_set[train_set.dt <= 30]['target']
# Convert to LightGBM format
train_matrix = model_type.Dataset(X_train, label=y_train)
val_matrix = model_type.Dataset(X_val, label=y_val)
# Hyperparameters
params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': 'mse',
'min_child_weight': 5,
'num_leaves': 32,
'lambda_l2': 10,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 4,
'learning_rate': 0.05,
'seed': 2024,
'nthread': 16,
'verbose': -1,
}
# Training with callbacks
callbacks = [log_evaluation(period=500), early_stopping(stopping_rounds=500)]
fitted_model = model_type.train(
params,
train_matrix,
50000,
valid_sets=[train_matrix, val_matrix],
categorical_feature=[],
callbacks=callbacks
)
# Predictions
val_predictions = fitted_model.predict(X_val, num_iteration=fitted_model.best_iteration)
test_predictions = fitted_model.predict(eval_set[features], num_iteration=fitted_model.best_iteration)
# Evaluation
mse_score = mean_squared_error(val_predictions, y_val)
print(f'Validation MSE: {mse_score}')
return val_predictions, test_predictions
validation_preds, test_preds = train_temporal_model(lgb, train_data, test_data, feature_columns)
Saving Predictions
test_data['target'] = test_preds
test_data[['id', 'dt', 'target']].to_csv('submission.csv', index=None)
Common Pitfalls and Solutions
File Path Resolution
If encountering FileNotFoundError, verify the working directory matches your file structure. Use absolute paths or check os.getcwd() to confirm location. Ensure the data directory exists and contains the required CSV files before execution.
Missing Dependencies
When ModuleNotFoundError occurs for LightGBM, install via pip:
pip install lightgbm
Or within a notebook:
!pip install lightgbm
API Compatibility Issues
The verbose_eval parameter was deprecated in newer LightGBM versions. Use callbacks instead:
from lightgbm import log_evaluation, early_stopping
callbacks = [log_evaluation(period=500), early_stopping(stopping_rounds=500)]
Indentation Problems
Python relies on consistent indentation. Use four spaces per level and avoid mixing tabs with spaces. Most IDEs include auto-formatting tools to resolve these issues.
Understanding Gradient Boosting Decision Trees
GBDT constructs an ensemble of decision trees iteratively. Each tree corrects errors from previous iterations by fitting to negative gradients of the loss function. This sequential approach enables the model to capture complex non-linear relationships in structured data.
Key characteristics include:
- Sequential tree building with residual correction
- Gradient descent optimization of loss functions
- Regularization through tree depth and leaf count constraints
- Robustness to outliers and missing values
LightGBM Implementation Details
LightGBM optimizes the traditional gradient boosting approach through several innovations:
Histogram-Based Splitting: Continuous features are discretized into bins, reducing computation from O(N) to O(bins) for each split.
Leaf-Wise Growth: Unlike level-wise strategies, LightGBM expands the leaf node with maximum delta loss, often achieving faster convergence.
Parallel Processing: Multi-threading accelerates both data partitioning and histogram construction.
Native Categorical Support: Categorical features can be processed directly without one-hot encoding, preserving ordinal relationships.
Advanced Feature Engineering Techniques
Lag Features
Shifting values backward in time creates features representing past observations. These capture autocorrelation and delayed responses in the target variable.
| Aspect | Consideration |
|---|---|
| Lag Period | Longer lags capture slower trends but may dilute recent patterns |
| Missing Values | Lag features for early time steps will be null |
| Target Leakage | Ensure lag periods don't include future information |
Rolling Statistics
Computing aggregations over sliding windows reveals local trends and volatility:
combined_data['rolling_std'] = combined_data.groupby('id')['target'].transform(
lambda x: x.shift(1).rolling(window=7, min_periods=3).std()
)
Difference Features
First-order differences represent the rate of change between consecutive time points:
combined_data['diff_1'] = combined_data.groupby('id')['target'].diff(1)
combined_data['diff_7'] = combined_data.groupby('id')['target'].diff(7) # Weekly seasonality
Differencing helps stationarize non-stationary time series and emphasizes relative changes over absolute values.
Model Evaluation Strategy
Using the most recent time periods as validation sets simulates real-world prediction scenarios where future values must be forecast. This temporal validation approach prevents data leakage that would occur with random cross-validation on time series data.