Data Preparation
The dataset originates from a Kaggle competition involving store item demand forecasting. It contains 5 years of sales data (2013-2017) for 50 items across 10 stores, requiring predictions for the next 3 months (January-March 2018). This represents a multi-step multivariate time series problem with 500 distinct time series to forecast.
Feature Engineering
Key observations from the data include weekly/monthly seasonality and annual trends. To capture these patterns, we incorporate:
- DateTime features with cyclic encoding (sine/cosine transformations)
- Annual autocorrelation values
- All features are normalized per time series
Sequence Construction
The model requires fixed-length input/output sequences:
- Output sequence: 90 days (3 months)
- Input sequence: 180 days (6 months)
- Sliding window approach generates sequential training samples
PyTorch Data Pipeline
class TimeSeriesDataset(Dataset):
def __init__(self, categorical_cols=[], numeric_cols=[], embed_dims=None, include_decoder_input=True):
self.sequences = None
self.cat_cols = categorical_cols
self.num_cols = numeric_cols
self.embed_config = []
self.embed_dims = embed_dims if embed_dims else {}
self.decoder_input = include_decoder_input
def load_data(self, processed_df):
self.sequences = processed_df
def __len__(self):
return len(self.sequences)
def __getitem__(self, idx):
sample = self.sequences.iloc[[idx]]
x_seq = torch.tensor(sample['x_sequence'].values[0], dtype=torch.float32)
y_seq = torch.tensor(sample['y_sequence'].values[0], dtype=torch.float32)
if self.decoder_input:
decoder_in = torch.tensor(y_seq[:, 1:], dtype=torch.float32)
# Handle numeric features
for col in self.num_cols:
num_val = torch.tensor([sample[col].values[0]], dtype=torch.float32)
x_seq = torch.cat((x_seq, num_val.repeat(x_seq.size(0)).unsqueeze(1)), dim=1)
decoder_in = torch.cat((decoder_in, num_val.repeat(decoder_in.size(0)).unsqueeze(1)), dim=1)
return (x_seq, decoder_in), y_seq[:, 0]
Model Architecture
The encoder-decoder framework consists of two main components:
Encoder Network
class SequenceEncoder(nn.Module):
def __init__(self, input_dim, hidden_size, num_layers=1, bidirectional=False, dropout=0.2):
super().__init__()
self.gru = nn.GRU(
input_size=input_dim,
hidden_size=hidden_size,
num_layers=num_layers,
bidirectional=bidirectional,
dropout=dropout,
batch_first=True
)
def forward(self, x):
hidden = torch.zeros(self.gru.num_layers * (2 if self.gru.bidirectional else 1),
x.size(0), self.gru.hidden_size, device=x.device)
if x.ndim < 3:
x = x.unsqueeze(2)
output, hidden = self.gru(x, hidden)
return output, hidden[-1] if hidden.size(0) > 1 else hidden.squeeze(0)
Decoder Network
class DecoderUnit(nn.Module):
def __init__(self, input_dim, hidden_size, dropout=0.2):
super().__init__()
self.gru_cell = nn.GRUCell(input_dim, hidden_size)
self.linear = nn.Linear(hidden_size, 1)
self.dropout = nn.Dropout(dropout)
def forward(self, hidden, x):
hidden = self.gru_cell(x, hidden)
return self.linear(self.dropout(hidden)), hidden
Complete Model
class Seq2SeqForecaster(nn.Module):
def __init__(self, encoder, decoder, pred_length=90, teacher_forcing=0.3):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.pred_len = pred_length
self.teacher_forcing = teacher_forcing
def forward(self, x, y_true=None):
enc_out, hidden = self.encoder(x[0])
predictions = torch.zeros(x[0].size(0), self.pred_len, device=x[0].device)
prev_val = x[0][:, -1, 0].unsqueeze(1)
for i in range(self.pred_len):
dec_input = torch.cat((prev_val, x[1][:, i]), dim=1)
if y_true is not None and i > 0 and torch.rand(1) < self.teacher_forcing:
dec_input = torch.cat((y_true[:, i].unsqueeze(1), x[1][:, i]), dim=1)
pred, hidden = self.decoder(hidden, dec_input)
predictions[:, i] = pred.squeeze(1)
prev_val = pred
return predictions
Training Strategy
Key training considerations:
- Validation Approach: Time-based split (2014-2016 train, 2017 validation)
- Optimizer: AdamW with seperate optimizers for encoder/decoder
- Learning Rate: 1cycle policy with maximum rate determined via LR finder
- Loss Function: MSE (more stable than SMAPE during training)
- Regularization: Dropout in both encoder and decoder networks
Performance
The model achieved top 10% performance in the Kaggle competition. Future improvements could include attention mechanisms and additoinal hyperparameter tuning.