The Attention Mechanism: How Transformers Learn to Focus
A deep dive into scaled dot-product attention with code — the core building block behind every modern LLM.
The Attention Mechanism: How Transformers Learn to Focus
The attention mechanism is the single most important innovation in modern NLP. Every LLM you interact with. GPT-4, Claude, Gemini; is built on top of it.
The Problem It Solves
Before attention, sequence models (RNNs, LSTMs) processed tokens one at a time, left to right. Long-range dependencies were hard to learn; by the time the model processed token 100, information from token 1 had passed through 99 compression steps.
Attention solves this by allowing every token to directly attend to every other token in the sequence, regardless of distance.
Scaled Dot-Product Attention
import torch
import torch.nn.functional as F
import math
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: (batch, heads, seq_len, d_k)
K: (batch, heads, seq_len, d_k)
V: (batch, heads, seq_len, d_v)
"""
d_k = Q.size(-1)
# Step 1: Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# Step 2: Apply mask (for causal/padding)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Step 3: Softmax over key dimension
attn_weights = F.softmax(scores, dim=-1)
# Step 4: Weighted sum of values
output = torch.matmul(attn_weights, V)
return output, attn_weights
Interview Questions on This Topic
After reading this write-up, you should be able to answer:
- Why do we divide by √d_k in the attention formula?
- What is the time and space complexity of self-attention?
- What is the difference between encoder attention and decoder attention?
- Why does causal masking work the way it does?
These are all in the question bank. Head to Drill Mode to practice.
Ready to test yourself?
Put your knowledge to work.
Practice real interview questions on this topic. Get AI feedback on exactly what you missed.