Depth

Paper-level deep dives on the models and techniques that shaped modern ML — the kind of understanding that shows in interviews

Transformers
The Attention Mechanism: How Transformers Learn to Focus

Scaled dot-product attention with code — the core building block behind every modern LLM.

TransformersFree
1.1.1 Tokenization

How text is split into tokens for LLMs. Covers BPE, BBPE, WordPiece, and Unigram with algorithm details and key interview distinctions.

TransformersPro
1.1.2 Word Embedding

From one-hot encoding to contextual LLM embeddings — covers Word2Vec, GloVe, and weight tying in modern large language models.

TransformersPro
1.1.3 Positional Encoding

Why Transformers need positional encoding and how it works — sinusoidal, learnable, RoPE, and ALiBi methods compared.

TransformersPro
1.1.4 Attention Mechanism

Scaled dot-product attention, self-attention vs cross-attention, and the MHA/MQA/GQA variants explained with code.

TransformersPro
1.1.5 Feed-Forward Networks (FFN) and Activation Functions

FFN structure in Transformers and activation functions from ReLU to SwiGLU — the building blocks of every modern LLM layer.

TransformersPro
1.1.6 Masking

Padding masks, causal masks, and MLM masks — how Transformers control which tokens can attend to which, and why it matters.

TransformersPro
1.1.7 Normalization

BatchNorm, LayerNorm, RMSNorm, and beyond — normalization techniques in Transformers with Pre-LN vs Post-LN trade-offs.

TransformersPro
1.1.8 Encoder vs. Decoder

The architectural split between BERT and GPT — how encoder-only and decoder-only models differ in structure and use case.

TransformersPro
1.1.9 Decoding Strategies

Greedy, beam search, top-k, top-p, and temperature — how vocabulary distributions are turned into generated text.

TransformersPro
MHA (Importance: ★★★★★)

Four reference implementations of Multi-Head Attention in PyTorch and NumPy — with a test harness to verify your own solution.

TransformersPro
MQA/GQA (Importance: ★★★★)

Multi-Query and Grouped-Query Attention implementations from scratch — the attention variants used in Qwen, GLM, and most modern LLMs.

TransformersPro
MHA with KV Cache (Importance: ★★★★)

Implementing Multi-Head Attention with KV cache — the essential memory optimization for production LLM inference.

TransformersPro
LayerNorm/RMSNorm (Importance: ★★★★)

From-scratch implementations of LayerNorm and RMSNorm — including the DeepSeek interview coding question.

TransformersPro

More models added regularly. Pro members get early access.