VLM FoundationsFree

CLIP (2021, OpenAI)

Contrastive Language-Image Pre-training — zero-shot visual classification by aligning image and text embeddings in a shared semantic space.

CLIP contrastive pre-training objectiveeasy
ViT patch embedding and CLS token roleeasy
BLIP-2 Q-Former: what is frozen and what is trainedeasy
How LLaVA generates instruction-following training dataeasy
Role of Variable Visual Position Encoding (V2PE) in InternVL3easy
SigLIP2 NaFlex variant: purpose and mechanismeasy
InternVL3 pixel unshuffle and its equivalence to patch mergingeasy
Causes of hallucination in multimodal LLMs and mitigation strategiesmedium
CLIP zero-shot inference with prompt templatesmedium
CLIP dual-encoder architecture and its weaknessesmedium
ViT vs CNN: inductive biases and data requirementsmedium
Flamingo Perceiver Resampler design and purposemedium
Flamingo few-shot multimodal in-context learningmedium
BLIP MED architecture: three training objectivesmedium
BLIP-2 Q-Former as lossy compression and its failure modesmedium
LLaVA visual connector design vs BLIP-2 and Flamingomedium
LLaVA two-stage training pipelinemedium
LLaVA-1.5 improvements over original LLaVAmedium
Qwen-VL three-stage training strategymedium
Qwen2.5-VL Time-Absolute MRoPE for video understandingmedium
Qwen2.5-VL post-training: SFT and DPO stagesmedium
InternVL3 Native Multimodal Pre-Training vs. traditional two-stage approachmedium
InternVL3 tile-and-thumbnail image processing strategymedium
DeepSeek VL Hybrid Vision Encoder: SigLIP + SAM design rationalemedium
DeepSeek VL vs. DeepSeek VL2: architecture changes and motivationsmedium
SigLIP sigmoid loss vs. CLIP softmax loss: two key improvementsmedium
SigLIP2 multi-task training objectives beyond sigmoid contrastive lossmedium
SigLIP-SO400M architecture and its use in VLMsmedium
ViT positional embedding interpolation during fine-tuninghard
Flamingo Gated Cross-Attention initialization and motivationhard
BLIP CapFilt: bootstrapping clean data from noisy web pairshard
BLIP-2 Q-Former two-stage training and attention masking strategyhard
Why LLaVA uses CLIP's second-to-last layer featureshard
Qwen-VL Position-Aware Vision-Language Adapter designhard
Qwen2-VL MRoPE: how temporal, height, and width are encodedhard
Qwen2-VL PatchMerger: mechanism and purposehard
Qwen2.5-VL Interleaved Window Attention in ViThard
InternVL3 Mixed Preference Optimization (MPO) loss formulationhard
DeepSeek VL2 dynamic tiling token sequence length formulahard
SigLIP chunked loss computation for memory efficiencyhard
InternVL3 Test-Time Scaling with VisualPRM-8Bhard

41 practice questions

Unlock all questions and AI feedback with a Pro subscription.

Unlock with Pro →

Already subscribed? Sign in