Depth

Paper-level deep dives on the models and techniques that shaped modern ML — the kind of understanding that shows in interviews

Transformers

Scaled dot-product attention with code — the core building block behind every modern LLM.

How text is split into tokens for LLMs. Covers BPE, BBPE, WordPiece, and Unigram with algorithm details and key interview distinctions.

From one-hot encoding to contextual LLM embeddings — covers Word2Vec, GloVe, and weight tying in modern large language models.

Why Transformers need positional encoding and how it works — sinusoidal, learnable, RoPE, and ALiBi methods compared.

Scaled dot-product attention, self-attention vs cross-attention, and the MHA/MQA/GQA variants explained with code.

FFN structure in Transformers and activation functions from ReLU to SwiGLU — the building blocks of every modern LLM layer.

Padding masks, causal masks, and MLM masks — how Transformers control which tokens can attend to which, and why it matters.

BatchNorm, LayerNorm, RMSNorm, and beyond — normalization techniques in Transformers with Pre-LN vs Post-LN trade-offs.

The architectural split between BERT and GPT — how encoder-only and decoder-only models differ in structure and use case.

Greedy, beam search, top-k, top-p, and temperature — how vocabulary distributions are turned into generated text.

Four reference implementations of Multi-Head Attention in PyTorch and NumPy — with a test harness to verify your own solution.

Multi-Query and Grouped-Query Attention implementations from scratch — the attention variants used in Qwen, GLM, and most modern LLMs.

Implementing Multi-Head Attention with KV cache — the essential memory optimization for production LLM inference.

From-scratch implementations of LayerNorm and RMSNorm — including the DeepSeek interview coding question.

More models added regularly. Pro members get early access.