Deconstructing the Transformer Architecture: A Component-Level Implementation Guide

Input Tensor Configuration and Data Pipeline

Sequence-to-sequence translation systems operate by mapping discrete token indices from a source vocabulary to a target vocabulary. For implementation purposes, consider a source lexicon containing 2,000 tokens and a target lexicon with 1,000 tokens. Training occurs in batches, typically formatted as a three-dimensional structure once embedded, but initially arriving as a two-dimensional matrix. Assuming a batch size of 10 sequences, each truncated or padded to a length of 8 tokens, the raw input tensors for both the encoder and decoder branches assume the shape (10, 8).

Token Vectorization and Positional Integration

Discrete token indices lack geometric continuity, preventing effective gradient-based optimization. A lookup table converts these integers into dense vector representations. To stabilize gradient magnitudes during early training phases, the output of the embedding lookup is scaled by the square root of the model's latent dimension. Since the architecture inherent lacks recurrence or convolutional spatial awareness, deterministic sinusoidal signals are superimposed to encode token order.

class TokenEmbedding(nn.Module):
    def __init__(self, num_tokens: int, embed_dim: int):
        super().__init__()
        self.vector_lookup = nn.Embedding(num_tokens, embed_dim)
        self.scale = math.sqrt(embed_dim)

    def forward(self, token_indices: torch.Tensor) -> torch.Tensor:
        return self.vector_lookup(token_indices) * self.scale


class FixedPositionalEncoding(nn.Module):
    def __init__(self, embed_dim: int, max_len: int = 5000, drop_rate: float = 0.1):
        super().__init__()
        self.regularizer = nn.Dropout(drop_rate)
        freq_matrix = torch.zeros(max_len, embed_dim)
        positions = torch.arange(0, max_len).float().unsqueeze(1)
        divisors = torch.exp(torch.arange(0, embed_dim, 2) * -(math.log(10000.0) / embed_dim))
        freq_matrix[:, 0::2] = torch.sin(positions * divisors)
        freq_matrix[:, 1::2] = torch.cos(positions * divisors)
        self.register_buffer('encoding', freq_matrix.unsqueeze(0))

    def forward(self, sequence_vectors: torch.Tensor) -> torch.Tensor:
        seq_len = sequence_vectors.size(1)
        return self.regularizer(sequence_vectors + self.encoding[:, :seq_len].detach())

Executing a forward pass with a batch of shape (10, 8) yields a transformed tensor of (10, 8, 512) after embedding and positional addition, preserving spatial dimensions while enriching semantic depth.

Stacked Encoder Architecture

The encoding subsystem functions as a sequential stack of identical processing units. A configuration parameter dictates how many times the base unit is duplicated. After the data traverses the entire stack, a final normalization layer standardizes the feature distribution before passing it to subsequent modules.

class TransformerEncoder(nn.Module):
    def __init__(self, base_layer: nn.Module, repetition_count: int, feature_dim: int):
        super().__init__()
        self.layer_stack = nn.ModuleList([copy.deepcopy(base_layer) for _ in range(repetition_count)])
        self.final_norm = nn.LayerNorm(feature_dim)

    def forward(self, input_tensor: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        for processing_layer in self.layer_stack:
            input_tensor = processing_layer(input_tensor, mask)
        return self.final_norm(input_tensor)

Encoder Block Internals

Each individual block operates on two primary computational pathways: a self-attention mechanism that contextualizes each token against the entire sequence, and a position-independent feed-forward network. Both pathway are wrapped in a residual connection pattern featuring layer normalization and dropout to stabilize deep gradient flow.

class SublayerWrapper(nn.Module):
    def __init__(self, dim: int, drop_prob: float):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.drop = nn.Dropout(drop_prob)

    def forward(self, residual_tensor: torch.Tensor, operation: callable) -> torch.Tensor:
        return residual_tensor + self.drop(operation(self.norm(residual_tensor)))


class EncoderBlock(nn.Module):
    def __init__(self, model_dim: int, attention_unit: nn.Module, ffn_unit: nn.Module, drop_p: float):
        super().__init__()
        self.attn = attention_unit
        self.ffn = ffn_unit
        self.connections = nn.ModuleList([SublayerWrapper(model_dim, drop_p) for _ in range(2)])

    def forward(self, x: torch.Tensor, src_mask: torch.Tensor) -> torch.Tensor:
        x = self.connections[0](x, lambda vec: self.attn(vec, vec, vec, src_mask))
        return self.connections[1](x, self.ffn)

Multi-Head Attention Projection

The attention module linear projects incoming tensors into query, key, and value subspaces. These projections are split across parallel computation heads, allowing the model to jointly attend to information from different representational positions. After scaling the dot products, masking invalid positions, and applying a softmax distribution, the weighted values are recombined.

class MultiHeadAttention(nn.Module):
    def __init__(self, head_count: int, hidden_size: int, dropout_val: float = 0.1):
        super().__init__()
        assert hidden_size % head_count == 0
        self.head_dim = hidden_size // head_count
        self.heads = head_count
        self.proj_q = nn.Linear(hidden_size, hidden_size)
        self.proj_k = nn.Linear(hidden_size, hidden_size)
        self.proj_v = nn.Linear(hidden_size, hidden_size)
        self.proj_out = nn.Linear(hidden_size, hidden_size)
        self.drop = nn.Dropout(dropout_val)

    def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, attn_mask: torch.Tensor = None) -> torch.Tensor:
        if attn_mask is not None:
            attn_mask = attn_mask.unsqueeze(1)
        batch_size = q.size(0)

        def reshape_and_project(proj_layer, tensor):
            return proj_layer(tensor).view(batch_size, -1, self.heads, self.head_dim).transpose(1, 2)

        q_states = reshape_and_project(self.proj_q, q)
        k_states = reshape_and_project(self.proj_k, k)
        v_states = reshape_and_project(self.proj_v, v)

        scores = torch.matmul(q_states, k_states.transpose(-2, -1)) / math.sqrt(self.head_dim)
        if attn_mask is not None:
            scores = scores.masked_fill(attn_mask == 0, -1e9)
        
        attn_probs = F.softmax(scores, dim=-1)
        context = self.drop(attn_probs) @ v_states
        output = context.transpose(1, 2).contiguous().view(batch_size, -1, self.heads * self.head_dim)
        return self.proj_out(output)

Tags: transformer-architecture pytorch self-attention-mechanism positional-encoding neural-network-modules

Posted on Thu, 02 Jul 2026 16:19:48 +0000 by wmvdwerf