The Attention Mechanism: How Transformers Learn to Focus

The attention mechanism is the single most important innovation in modern NLP. Every LLM you interact with. GPT-4, Claude, Gemini; is built on top of it.

The Problem It Solves

Before attention, sequence models (RNNs, LSTMs) processed tokens one at a time, left to right. Long-range dependencies were hard to learn; by the time the model processed token 100, information from token 1 had passed through 99 compression steps.

Attention solves this by allowing every token to directly attend to every other token in the sequence, regardless of distance.

Scaled Dot-Product Attention

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (batch, heads, seq_len, d_k)
    K: (batch, heads, seq_len, d_k)
    V: (batch, heads, seq_len, d_v)
    """
    d_k = Q.size(-1)

    # Step 1: Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    # Step 2: Apply mask (for causal/padding)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # Step 3: Softmax over key dimension
    attn_weights = F.softmax(scores, dim=-1)

    # Step 4: Weighted sum of values
    output = torch.matmul(attn_weights, V)

    return output, attn_weights

Interview Questions on This Topic

After reading this write-up, you should be able to answer:

Why do we divide by √d_k in the attention formula?
What is the time and space complexity of self-attention?
What is the difference between encoder attention and decoder attention?
Why does causal masking work the way it does?

These are all in the question bank. Head to Drill Mode to practice.

The Attention Mechanism: How Transformers Learn to Focus

The Problem It Solves

Scaled Dot-Product Attention

Interview Questions on This Topic

Put your knowledge to work.