LLaVA(2023, Microsoft)

LLaVA (2023.04, Microsoft): Visual Instruction Tuning LLaVA: Large Language and Vision Assistant

First, understand two different concepts: visual instruction tuning vs. visual prompt tuning.

Visual Instruction Tuning: Improving the model's instruction-following ability

Refers to using visual information as part of the instruction, training the model to generate instruction-following data based on visual inputs; i.e., dialogue capability grounded in visual data. This approach focuses on enabling the model to understand visual information and execute specific instructions or tasks based on it.
Example: Suppose a model needs to generate a textual description of an image's content. In visual instruction tuning, the model is trained to understand image content and generate a corresponding description; involving both visual understanding and instruction execution.

Visual Prompt Tuning: Parameter-efficient adaptation of models

A parameter-efficient fine-tuning method that introduces a small number of learnable parameters to map images into the prompt space (task-relevant continuous vectors), learning an optimal prompt vector to transfer pre-trained models to specific tasks. This approach focuses on adapting to specific downstream tasks by adding or adjusting prompts.
Example: In an image classification task, a specific prompt can be added to the input image to fine-tune a pre-trained model for recognizing and classifying objects. During training, only the prompt parameters are updated, while the rest of the model remains unchanged.

Instruction tuning is common in NLP and can improve LLM generalization for zero-shot and few-shot settings; used in InstructGPT, ChatGPT, FLAN-T5, etc.; but has not been fully explored in multimodal tasks combining vision and language.

LLaVA's motivation is to realize a general-purpose multimodal assistant, benchmarked against InstructGPT for LLMs:

An end-to-end trained multimodal large model connecting a visual encoder and LLM for general visual and language understanding.
The first work to use pure-language GPT-4 to generate multimodal language-image instruction-following data, and train a large multimodal model via instruction tuning.

Experiments show that LLaVA demonstrates relatively strong multimodal chat capability, sometimes exhibiting multimodal GPT-4 behavior on unseen images/instructions, and achieves a relative score of 85.1% compared to GPT-4 on synthetic multimodal instruction-following datasets. After fine-tuning on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. LLaVA excels at understanding and following instructions in conversational contexts; it can reason about and respond to queries aligned with human intent, outperforming other models such as BLIP-2 and Flamingo.

Concept supplement. Instruction-following Large Multimodal Model (LMM): = Pre-trained visual backbone for encoding visual features + Pre-trained LLM for understanding user instructions and generating responses + Visual-language cross-modal connector for aligning visual encoder output with the language model.

Stage a: Visual-language alignment pre-training; using image-text pairs to align visual features with the word embedding space of the language model.
Stage b: Visual instruction tuning; fine-tuning the model based on visual instructions to enable it to follow diverse user instruction requirements.

1. Main Contributions

Core:

Connected CLIP-L/14's visual encoder with LLaMA's large language model through a simple linear projection, and performed end-to-end fine-tuning on the generated multimodal instruction-following dataset, developing a Large Multimodal Model (LMM).
The model architecture adopts a BLIP-like Image Encoder + Projector + LLM architecture, but the Projector connection differs from Flamingo's gated cross-attention and BLIP-2's Q-Former. LLaVA directly uses a simple linear layer to align visual features with text features.
Uses ChatGPT/GPT-4 to convert image-text pairs into appropriate instruction-following data format, generating three types of instruction-following data: Conversation, Detailed description, and Complex reasoning.

2. Instruction-Following Data

Three types of instructions are carefully designed, with a well-crafted prompt design for constructing instruction data using GPT-4/ChatGPT; standing on the "shoulders of giants." This approach has been widely adopted in subsequent work.

LLaVA uses GPT-4/ChatGPT to generate training data for multimodal tasks.

In LLaVA, a "Human - Assistant" dialogue format is used to simulate human-machine interaction. Each data sample follows this format (instruction-following data format):

Human: Xq Xv<STOP> Assistant: Xc<STOP>

Xq (Question/Prompt): The text instruction issued by the user; a request about the image (question, description requirement, etc.), generated by GPT in training data. Xv (Visual Information): The image input. LLaVA directly accepts images, but in training data, textual descriptions (captions + bounding boxes) are used as substitutes. Xc (Completion): The model's answer; the response generated by the assistant for the given image and instruction, generated by GPT in training data. STOP: End token, marking the end of a conversation.

However, GPT-4 and ChatGPT do not natively support direct image processing; they can only understand text input. Therefore, image information must be converted to text so that GPT-4 can "see" the image content, read its textual description, automatically generate a series of questions about the image ( $X_q$ , user input), and then answer them in the "Assistant" role to generate $X_c$ (answers).

How to convert image information to text? LLaVA uses two methods to encode image information as text, enabling GPT-4/ChatGPT to generate multimodal task data:

Captions: A textual description of an image that typically summarizes the overall content.
Bounding Boxes: Object positions in images obtained via object detection methods and represented as text. For example: "dog in the lower-left corner, running."

Through these two methods, each image's information becomes pure-text input that GPT-4 can understand.

The COCO dataset was selected as the base dataset, using captions + bounding boxes as input to let GPT-4 generate three types of instruction-following data:

1. Conversation:

Purpose: Enable LLaVA to understand and answer various questions about images (diversity + sensible answers), so the model can engage in image-based dialogue like a chat assistant.
Method: Generate a series of questions about image content and let GPT-4 generate answers.

Human: What is in this image? Xv Assistant: There is a dog, a cat, and a sofa in the image.

Human: What is the dog doing? Xv Assistant: It is sleeping on the carpet in the living room.

Human: Which is closer to the sofa; the cat or the dog? Xv Assistant: The dog is closer to the sofa; it is lying right next to it.

This approach aims to improve LLaVA's performance in multi-turn dialogue, enabling it to understand various questions users ask about images.

2. Detailed Description:

Purpose: Enable the model to produce complete, detailed image descriptions.
Method: Design a series of different prompts, randomly select one each time, and ask GPT-4 to generate descriptions as detailed as possible.

Human: Please describe the content of this image in detail. Xv Assistant: The image shows a lush green meadow with a red kite flying in the center. On the ground, two children are chasing the kite, wearing blue and yellow clothes. The background shows a blue sky and a few white clouds.

This approach trains the model to describe images more naturally and in greater detail.

3. Complex Reasoning:

Purpose: Train the model to perform higher-level logical reasoning in multimodal tasks; not just describing image content, but engaging in causal reasoning, event analysis, etc.
Method: Based on existing dialogue and description data, generate questions requiring logical reasoning.

Human: What might the people in this image be doing? Why? Xv Assistant: A group of people is sitting around a table with a cake and drinks. They are probably celebrating a birthday, because a little girl is about to blow out the candles on the cake while everyone around her smiles and watches.

Human: In what season might this image have been taken? Xv Assistant: Judging from the short sleeves the people are wearing and the bright sunlight, this image was probably taken in summer.

This data helps improve LLaVA's performance on complex problems such as science and logical reasoning.

The final constructed multimodal dataset contains 158K instruction-following samples, including: 58K conversation data + 23K detailed description data + 77K complex reasoning data.

The figure below shows LLaVA's instruction examples, containing 3 types of questions.

![Image](images/LLaVA(2023, Microsoft)_68.png)

The figure below shows the prompt design for having GPT-4 generate instruction-tuning data.

![Image](images/LLaVA(2023, Microsoft)_70.png)

In LLaVA's first-stage training (image-text alignment pre-training), the CC3M dataset was also filtered and instructions were manually constructed. The specific filtering approach uses the Spacy library to count the frequency of each unique noun phrase, removing data with frequency less than 3, and randomly selecting a subset of 100 for noun phrases with frequency exceeding 100 (filtering out infrequently occurring nouns).

LLaVA's instruction-following data format is essentially Human (user asks question or makes request) → Assistant (model generates answer), specifically:

Dialogue mode (question-answer)
Detailed description (complete image description)
Complex reasoning (logical reasoning and advanced analysis)

This data format not only trains LLaVA for visual question answering (VQA), but also strengthens its reasoning and description capabilities, making it more competitive in multimodal tasks.

3. Model Architecture

![Image](images/LLaVA(2023, Microsoft)_76.png)

This document hasn't been mentioned by others yet.

Note: CLIP's last-layer features may focus more on global and abstract image properties compared to earlier layers, while earlier layers can focus more on local properties that help understand specific image details.

Using a simple linear layer as the Projector is a better choice than a complex Q-Former.

Concatenate the visual features $H_v$ and text features $H_q$ (already in the same dimension), then feed them together into the LLM below.
Large Language Model: Vicuna 13B (a Decoder-Only LLM, fine-tuned on dialogue data based on the LLaMA series), generating a Language Response.

4. Training Method

Also uses the LM Loss + Causal Mask Decoder paradigm.

Instruction-following data generation (data is in dialogue format): For each image $X_v$ , generate multi-turn conversation data $(X_{q1}, X_{a1}, ..., X_{qT}, X_{aT})$ , where $T$ is the total number of turns. These are organized into a sequence; all answers are treated as the assistant's responses. The instruction $X^t_{instruct}$ for the $t$ -th turn is:

![Image](images/LLaVA(2023, Microsoft)_85.png)

Training objective: Next token prediction. The training loss is the maximum likelihood probability; i.e., the product of maximum probabilities for each token prediction. $X_a$ is the answer, $X_v$ is the image, $X_{instruct}$ is the instruction, and $\theta$ is the trainable parameters for each stage:

![Image](images/LLaVA(2023, Microsoft)_87.png)

During training, the model only predicts how the assistant should respond and when to stop. As shown in the figure below, only the answer $X_a$ and the <STOP> tokens are used to compute the autoregressive model's loss:

![Image](images/LLaVA(2023, Microsoft)_89.png)

Training proceeds in two stages:

Stage 1 - Image-Text Feature Alignment Pre-training (Pre-training for Feature Alignment): (Training a compatible visual tokenizer with a frozen LLM)

Filter 595K image-text pairs from CC3M. These pairs are converted to instruction-following data through a simple expansion approach.
Each sample is a single-turn conversation.
- The ground truth is the original caption text for the image.
- Through this step, image features output by the visual encoder can be aligned with the word vectors of the pre-trained LLM.

Stage 2 - Visual Instruction Tuning: Fine-tune the projection and LLM on single-turn/multi-turn dialogue data, keeping the visual encoder weights frozen throughout. Uses conversation, detailed description, and complex reasoning data.

Multimodal Chatbot: Fine-tuned on the 158K language-image instruction-following data constructed above. Among the three types of responses (conversation, detailed description, complex reasoning), conversation is multi-turn interactive, while the other two are single-turn interactive. During training, each sample is uniformly sampled.
Science QA: Input includes questions and corresponding context information; the model's answers must provide reasoning and determine answers from multiple choices.

| | Stage 1: Feature Alignment Pre-training | Stage 2: Instruction Fine-tuning (Visual instruction tuning) | | | |---|---|---|---|---| | Objective | Align visual features with text features | Enable model to better adapt to visual QA tasks | Task 1: Multimodal Chatbot | Task 2: Science QA | | Dataset | Filtered from CC3M, 595k | GPT-generated 158K instruction-following data (conversation 48K, detailed description 23K, complex reasoning 77K) | 12.7K, 4.2K, 4.2K | | | ViT trained | No | No | No | | | Projection trained | Yes | Yes | Yes | | | LLM trained | No | Yes | Yes | |

LLaVA ablation study:

Using CLIP's second-to-last layer is better than the last layer
CoT has limited effect
Stage 1 feature alignment pre-training is very important and yields significant performance gains
Larger models perform better

![Image](images/LLaVA(2023, Microsoft)_101.png)

A blog post with detailed code explanation: https://zhuanlan.zhihu.com/p/721647840

5. Summary

LLaVA's greatest strength lies in its pursuit of "simplicity and practicality." Rather than re-inventing widely available or mature basic functionality, it combines the widely-used CLIP VIT-L/14 (image encoder) and Vicuna based on Llama 2 (large language model) with a lightweight "projection matrix" to unify images and text in the same understanding space. This makes LLaVA excel at understanding and following instructions in conversational contexts, outperforming overly complex alternatives like Flamingo's gated cross-attention and BLIP-2's Q-Former.

LLaVA (2023, Microsoft)