VLM FoundationsFree
VLM Fundamentals: Overview and Learning Roadmap
Cross-modal alignment timeline from CLIP to modern VLMs, recommended reading, key survey papers, and a curated model list.
Fundamentals Section
1. Models (Understanding-Side Multimodal)
Cross-modal alignment: CLIP, VLMo LLM capabilities become more important: Frozen, Flamingo, BLIP-2 MLP is sufficient for alignment: LLaVA, MiniGPT-4 Visual encoder matters: Deepseek-VL, Qwen2.5-VL
2. Papers
- Surveys Multimodal Large Model Research Progress: A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks Aligning Multimodal LLMs with Human Preferences: Aligning Multimodal LLM with Human Preference: A Survey Hallucination in Multimodal Large Models: A Survey on Hallucination in Large Vision-Language Models Interpretability of Multimodal Large Models: Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
- Latest Paper Repos Multimodal techniques: BradyFU/Awesome-Multimodal-Large-Language-Models Multimodal Reasoning: lwpyh/Awesome-MLLM-Reasoning-Collection Multimodal Reasoning: Wild-Cooperation-Hub/Awesome-MLLM-Reasoning-Benchmarks Multimodal Large Model Infra: DefTruth/Awesome-LLM-Inference Video Understanding Multimodal: pipixin321/Awesome-Video-MLLMs
4. Recommended Beginner Projects
Zeyi-Lin/LLM-Finetune: Qwen2-VL fine-tuning lemon-little/BetterSynth: Multimodal data synthesis dvlab-research/MGM: Full pipeline reproduction MMMU-Benchmark/MMMU: Multimodal model evaluation Loading...
Ready to test yourself?
Put your knowledge to work.
Practice real interview questions on this topic. Get AI feedback on exactly what you missed.