Time Series Prediction with LightGBM: Feature Engineering and Model Training

Data Exploration with Visualization

Understanding the dataset structure is crucial before building any model. The training data contains house identifiers, daily timestamps, house types, and the target variable representing electricity consumption.

import numpy as np
import pandas as pd
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from lightgbm import log_evaluation, early_stopping
import warnings
warnings.filterwarnings('ignore')

# Load datasets
train_df = pd.read_csv('./data/data283931/train.csv')
test_df = pd.read_csv('./data/data283931/test.csv')

Visualizing Category Distribution

Analyzing how different house categories correlate with consumption patterns helps identify meaningfull features:

category_stats = train_df.groupby('type')['target'].mean().reset_index()
plt.figure(figsize=(8, 4))
plt.bar(category_stats['type'], category_stats['target'], color=['blue', 'green'])
plt.xlabel('Category')
plt.ylabel('Mean Consumption')
plt.title('Average Consumption by House Type')
plt.show()

Time Series Trennd Analysis

Examining individual time series patterns reveals seasonal behavior and anomalies:

single_house = train_df[train_df['id'] == '00037f39cf'].copy()
plt.figure(figsize=(10, 5))
plt.plot(single_house['dt'], single_house['target'], marker='o', linestyle='-')
plt.xlabel('Day')
plt.ylabel('Consumption')
plt.title('Consumption Trend for Specific House ID')
plt.show()

Constructing Time-Based Features

Data Preprocessing Pipeline

Combining train and test sets enables consistent feature engineering across both datasets:

combined_data = pd.concat([test_df, train_df], axis=0, ignore_index=True)
combined_data = combined_data.sort_values(['id', 'dt'], ascending=False).reset_index(drop=True)

Lag Features (Historical Shift)

Lag features capture historical values by shifting time series backward. This approach helps the model learn temporal dependencies:

for offset in range(10, 30):
    combined_data[f'lag_{offset}'] = combined_data.groupby(['id'])['target'].shift(offset)

Rolling Window Statistics

Aggregating values within sliding windows provides smoothed representations of recent trends:

combined_data['window_3_mean'] = (combined_data['lag_10'] + combined_data['lag_11'] + combined_data['lag_12']) / 3

Dataset Split and Feature Selection

train_data = combined_data[combined_data.target.notna()].reset_index(drop=True)
test_data = combined_data[combined_data.target.isna()].reset_index(drop=True)

feature_columns = [col for col in combined_data.columns if col not in ['id', 'target']]

LightGBM Model Training

Training Configuration

def train_temporal_model(model_type, train_set, eval_set, features):
    # Split by temporal boundary
    X_train = train_set[train_set.dt >= 31][features]
    y_train = train_set[train_set.dt >= 31]['target']
    X_val = train_set[train_set.dt <= 30][features]
    y_val = train_set[train_set.dt <= 30]['target']
    
    # Convert to LightGBM format
    train_matrix = model_type.Dataset(X_train, label=y_train)
    val_matrix = model_type.Dataset(X_val, label=y_val)
    
    # Hyperparameters
    params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': 'mse',
        'min_child_weight': 5,
        'num_leaves': 32,
        'lambda_l2': 10,
        'feature_fraction': 0.8,
        'bagging_fraction': 0.8,
        'bagging_freq': 4,
        'learning_rate': 0.05,
        'seed': 2024,
        'nthread': 16,
        'verbose': -1,
    }
    
    # Training with callbacks
    callbacks = [log_evaluation(period=500), early_stopping(stopping_rounds=500)]
    fitted_model = model_type.train(
        params, 
        train_matrix, 
        50000, 
        valid_sets=[train_matrix, val_matrix], 
        categorical_feature=[], 
        callbacks=callbacks
    )
    
    # Predictions
    val_predictions = fitted_model.predict(X_val, num_iteration=fitted_model.best_iteration)
    test_predictions = fitted_model.predict(eval_set[features], num_iteration=fitted_model.best_iteration)
    
    # Evaluation
    mse_score = mean_squared_error(val_predictions, y_val)
    print(f'Validation MSE: {mse_score}')
    
    return val_predictions, test_predictions

validation_preds, test_preds = train_temporal_model(lgb, train_data, test_data, feature_columns)

Saving Predictions

test_data['target'] = test_preds
test_data[['id', 'dt', 'target']].to_csv('submission.csv', index=None)

Common Pitfalls and Solutions

File Path Resolution

If encountering FileNotFoundError, verify the working directory matches your file structure. Use absolute paths or check os.getcwd() to confirm location. Ensure the data directory exists and contains the required CSV files before execution.

Missing Dependencies

When ModuleNotFoundError occurs for LightGBM, install via pip:

pip install lightgbm

Or within a notebook:

!pip install lightgbm

API Compatibility Issues

The verbose_eval parameter was deprecated in newer LightGBM versions. Use callbacks instead:

from lightgbm import log_evaluation, early_stopping
callbacks = [log_evaluation(period=500), early_stopping(stopping_rounds=500)]

Indentation Problems

Python relies on consistent indentation. Use four spaces per level and avoid mixing tabs with spaces. Most IDEs include auto-formatting tools to resolve these issues.

Understanding Gradient Boosting Decision Trees

GBDT constructs an ensemble of decision trees iteratively. Each tree corrects errors from previous iterations by fitting to negative gradients of the loss function. This sequential approach enables the model to capture complex non-linear relationships in structured data.

Key characteristics include:

  • Sequential tree building with residual correction
  • Gradient descent optimization of loss functions
  • Regularization through tree depth and leaf count constraints
  • Robustness to outliers and missing values

LightGBM Implementation Details

LightGBM optimizes the traditional gradient boosting approach through several innovations:

Histogram-Based Splitting: Continuous features are discretized into bins, reducing computation from O(N) to O(bins) for each split.

Leaf-Wise Growth: Unlike level-wise strategies, LightGBM expands the leaf node with maximum delta loss, often achieving faster convergence.

Parallel Processing: Multi-threading accelerates both data partitioning and histogram construction.

Native Categorical Support: Categorical features can be processed directly without one-hot encoding, preserving ordinal relationships.

Advanced Feature Engineering Techniques

Lag Features

Shifting values backward in time creates features representing past observations. These capture autocorrelation and delayed responses in the target variable.

Aspect Consideration
Lag Period Longer lags capture slower trends but may dilute recent patterns
Missing Values Lag features for early time steps will be null
Target Leakage Ensure lag periods don't include future information

Rolling Statistics

Computing aggregations over sliding windows reveals local trends and volatility:

combined_data['rolling_std'] = combined_data.groupby('id')['target'].transform(
    lambda x: x.shift(1).rolling(window=7, min_periods=3).std()
)

Difference Features

First-order differences represent the rate of change between consecutive time points:

combined_data['diff_1'] = combined_data.groupby('id')['target'].diff(1)
combined_data['diff_7'] = combined_data.groupby('id')['target'].diff(7)  # Weekly seasonality

Differencing helps stationarize non-stationary time series and emphasizes relative changes over absolute values.

Model Evaluation Strategy

Using the most recent time periods as validation sets simulates real-world prediction scenarios where future values must be forecast. This temporal validation approach prevents data leakage that would occur with random cross-validation on time series data.

Tags: lightgbm time-series feature-engineering machine-learning gradient-boosting

Posted on Fri, 08 May 2026 17:39:22 +0000 by ejwf