TransformersFree
The Attention Mechanism: How Transformers Learn to Focus
A deep dive into scaled dot-product attention with code — the core building block behind every modern LLM.
Differences between BERT and the original Transformer encodereasy
What problem did the Transformer architecture solve that RNNs/LSTMs could not?easy
Why does attention use a scaling factor of 1/sqrt(d_k)?easy
Why is decoder-only the dominant LLM architecture?medium
Why divide by √d_k in scaled dot-product attention?medium
What is the purpose of multi-head attention vs. single-head attention?medium
LLM decoding methods: greedy, sampling, top-k, top-p, beam searchmedium
Can K, Q, V use the same weight matrix in attention?medium
Does the decoder cross-attention use a causal mask?medium
Pre-Norm vs Post-Norm in Transformersmedium
GQA, MHA, and MLA principles; does GQA save memory during training?medium
Multi-head attention: linear projection order relative to head splittingmedium
Speculative decoding vs blockwise parallel decodinghard
Implement LayerNorm from scratch in PyTorchhard
Implement multi-head attention from scratchhard
Implement multi-head attention using NumPy (einsum and matmul)hard
What is the time and space complexity of self-attention, and how does this limit context length?hard
17 practice questions
Unlock all questions and AI feedback with a Pro subscription.
Unlock with Pro →Already subscribed? Sign in