VLM FoundationsFree

VLM Fundamentals: Overview and Learning Roadmap

Cross-modal alignment timeline from CLIP to modern VLMs, recommended reading, key survey papers, and a curated model list.

Fundamentals Section

1. Models (Understanding-Side Multimodal)

Cross-modal alignment: CLIP, VLMo LLM capabilities become more important: Frozen, Flamingo, BLIP-2 MLP is sufficient for alignment: LLaVA, MiniGPT-4 Visual encoder matters: Deepseek-VL, Qwen2.5-VL

2. Papers

4. Recommended Beginner Projects

Zeyi-Lin/LLM-Finetune: Qwen2-VL fine-tuning lemon-little/BetterSynth: Multimodal data synthesis dvlab-research/MGM: Full pipeline reproduction MMMU-Benchmark/MMMU: Multimodal model evaluation Loading...

Ready to test yourself?

Put your knowledge to work.

Practice real interview questions on this topic. Get AI feedback on exactly what you missed.