Fundamentals Section

1. Models (Understanding-Side Multimodal)

Cross-modal alignment: CLIP, VLMo LLM capabilities become more important: Frozen, Flamingo, BLIP-2 MLP is sufficient for alignment: LLaVA, MiniGPT-4 Visual encoder matters: Deepseek-VL, Qwen2.5-VL

2. Papers

Surveys Multimodal Large Model Research Progress: A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks Aligning Multimodal LLMs with Human Preferences: Aligning Multimodal LLM with Human Preference: A Survey Hallucination in Multimodal Large Models: A Survey on Hallucination in Large Vision-Language Models Interpretability of Multimodal Large Models: Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
Latest Paper Repos Multimodal techniques: BradyFU/Awesome-Multimodal-Large-Language-Models Multimodal Reasoning: lwpyh/Awesome-MLLM-Reasoning-Collection Multimodal Reasoning: Wild-Cooperation-Hub/Awesome-MLLM-Reasoning-Benchmarks Multimodal Large Model Infra: DefTruth/Awesome-LLM-Inference Video Understanding Multimodal: pipixin321/Awesome-Video-MLLMs

4. Recommended Beginner Projects

Zeyi-Lin/LLM-Finetune: Qwen2-VL fine-tuning lemon-little/BetterSynth: Multimodal data synthesis dvlab-research/MGM: Full pipeline reproduction MMMU-Benchmark/MMMU: Multimodal model evaluation Loading...

VLM Fundamentals: Overview and Learning Roadmap

Fundamentals Section

1. Models (Understanding-Side Multimodal)

2. Papers

4. Recommended Beginner Projects

Put your knowledge to work.