VLM FoundationsFree

CLIP (2021, OpenAI)

Contrastive Language-Image Pre-training — zero-shot visual classification by aligning image and text embeddings in a shared semantic space.

CLIP contrastive pre-training objectiveeasy

ViT patch embedding and CLS token roleeasy

BLIP-2 Q-Former: what is frozen and what is trainedeasy

How LLaVA generates instruction-following training dataeasy

Role of Variable Visual Position Encoding (V2PE) in InternVL3easy

SigLIP2 NaFlex variant: purpose and mechanismeasy

InternVL3 pixel unshuffle and its equivalence to patch mergingeasy

Causes of hallucination in multimodal LLMs and mitigation strategiesmedium

CLIP zero-shot inference with prompt templatesmedium

CLIP dual-encoder architecture and its weaknessesmedium

ViT vs CNN: inductive biases and data requirementsmedium

Flamingo Perceiver Resampler design and purposemedium

Flamingo few-shot multimodal in-context learningmedium

BLIP MED architecture: three training objectivesmedium

BLIP-2 Q-Former as lossy compression and its failure modesmedium

LLaVA visual connector design vs BLIP-2 and Flamingomedium

LLaVA two-stage training pipelinemedium

LLaVA-1.5 improvements over original LLaVAmedium

Qwen-VL three-stage training strategymedium

Qwen2.5-VL Time-Absolute MRoPE for video understandingmedium

Qwen2.5-VL post-training: SFT and DPO stagesmedium

InternVL3 Native Multimodal Pre-Training vs. traditional two-stage approachmedium

InternVL3 tile-and-thumbnail image processing strategymedium

DeepSeek VL Hybrid Vision Encoder: SigLIP + SAM design rationalemedium

DeepSeek VL vs. DeepSeek VL2: architecture changes and motivationsmedium

SigLIP sigmoid loss vs. CLIP softmax loss: two key improvementsmedium

SigLIP2 multi-task training objectives beyond sigmoid contrastive lossmedium

SigLIP-SO400M architecture and its use in VLMsmedium

ViT positional embedding interpolation during fine-tuninghard

Flamingo Gated Cross-Attention initialization and motivationhard

BLIP CapFilt: bootstrapping clean data from noisy web pairshard

BLIP-2 Q-Former two-stage training and attention masking strategyhard

Why LLaVA uses CLIP's second-to-last layer featureshard

Qwen-VL Position-Aware Vision-Language Adapter designhard

Qwen2-VL MRoPE: how temporal, height, and width are encodedhard

Qwen2-VL PatchMerger: mechanism and purposehard

Qwen2.5-VL Interleaved Window Attention in ViThard

InternVL3 Mixed Preference Optimization (MPO) loss formulationhard

DeepSeek VL2 dynamic tiling token sequence length formulahard

SigLIP chunked loss computation for memory efficiencyhard

InternVL3 Test-Time Scaling with VisualPRM-8Bhard

Unlock all questions and AI feedback with a Pro subscription.

Already subscribed? Sign in