TransformersFree

The Attention Mechanism: How Transformers Learn to Focus

A deep dive into scaled dot-product attention with code — the core building block behind every modern LLM.

Differences between BERT and the original Transformer encodereasy

What problem did the Transformer architecture solve that RNNs/LSTMs could not?easy

Why does attention use a scaling factor of 1/sqrt(d_k)?easy

Why is decoder-only the dominant LLM architecture?medium

Why divide by √d_k in scaled dot-product attention?medium

What is the purpose of multi-head attention vs. single-head attention?medium

LLM decoding methods: greedy, sampling, top-k, top-p, beam searchmedium

Can K, Q, V use the same weight matrix in attention?medium

Does the decoder cross-attention use a causal mask?medium

Pre-Norm vs Post-Norm in Transformersmedium

GQA, MHA, and MLA principles; does GQA save memory during training?medium

Multi-head attention: linear projection order relative to head splittingmedium

Implement LayerNorm from scratch in PyTorchhard

Speculative decoding vs blockwise parallel decodinghard

Implement multi-head attention from scratchhard

Implement multi-head attention using NumPy (einsum and matmul)hard

What is the time and space complexity of self-attention, and how does this limit context length?hard

Unlock all questions and AI feedback with a Pro subscription.

Already subscribed? Sign in