CLIP(2021, OpenAI)

CLIP: Contrastive Learning as a Bridge Between Images and Text (2021, OpenAI) Contrastive Language Image Pre-train Multimodality refers to input data from different domains, such as text, images, audio, and video. In traditional approaches, each domain has its own set of classical processing algorithms; for example, RNNs, LSTMs, and Transformers for text, and various CNNs for images; with each domain operating relatively independently. However, there are always cases where joint cross-domain data is needed: for instance, given an image, output a textual description of it; or given a piece of text, output an image matching the description. The challenge lies in the fact that different domains have different feature distributions and feature information. The overall goal of multimodal models is therefore: train a model that can both unify feature representations and learn correlations between features from different modalities. Core tasks primarily include representation learning, modality mapping, modality alignment, modality fusion, and co-learning. CLIP is a multimodal pre-trained neural network based on contrastive image-text pairs, proposed by OpenAI in February 2021. It can effectively leverage natural language supervision to learn visual concepts. Through cross-modal learning, CLIP successfully represents images and text in a unified semantic space, enabling machines to simultaneously understand images and text, achieving more accurate and intelligent matching between images and text.

1. CLIP Overview

The motivation behind CLIP and the problems it addresses in computer vision. The essence and core of CLIP.

1.1 Motivation

Despite the revolutionary breakthroughs that deep learning has brought to computer vision, existing methods still face several key challenges: High cost of dataset construction with limited coverage Typical visual datasets require substantial human annotation and cover only a limited range of visual concepts, making it difficult to scale to broader scenarios. Poor model generalization, difficulty adapting to new tasks Traditional visual models are typically designed for specific tasks; transferring them to new tasks requires additional fine-tuning and large amounts of labeled data. Poor performance in real-world environments Although many models perform well on standard benchmarks, their performance in stress tests (such as handling noise, occlusion, viewpoint changes, etc.) is often disappointing. This has raised questions about the feasibility of deep learning in computer vision. When performing traditional classification tasks, ViT has two notable shortcomings:

If an image contains a category the model has never seen before, the model cannot produce the correct output. For example, the model was trained on animal images, but at inference time it is given an image of a car.
If the input data undergoes a distribution shift, the model may also fail to produce the correct output. The situation described in (1) is one type of shift; another example is training on normal animal images but being given Picasso-style animal images at inference time. The traditional solution to these two shortcomings is: fine-tuning. But multimodal approaches aim to do everything in one step: achieving zero-shot image classification without any fine-tuning. One-sentence summary of the CLIP paper: CLIP is a zero-shot visual classification model; the pre-trained model achieves strong transfer performance on downstream tasks without fine-tuning.

1.2 Core Idea

Core idea: Contrastive Learning; letting images and text perform contrastive learning in the same semantic space. This can also be described as: extracting supervisory signals from natural language to guide model learning. CLIP aims to use contrastive learning to map image and text representations into a shared high-dimensional embedding space, such that semantically related image-text pairs have high similarity, while unrelated image-text pairs have low similarity. In this way, although images and text come from different modalities, they are aligned in a common semantic space. For example, given a photo of a cat and the description "a cute cat," CLIP learns to place this image-text pair close together, while pushing unrelated pairs farther apart. For instance, a photo of a dog and the description "a cute cat" would be placed far apart by the model. Advantages of this core idea: No need for human annotation; free from labels Strong zero-shot transfer learning capability Strong general visual understanding capability Multimodality

2. CLIP Pipeline

![Image](images/CLIP(2021, OpenAI)_65.png)

1️⃣ 2️⃣ Contrastive pre-training: In the contrastive pre-training phase, contrastive learning is performed using image-text pairs. Create dataset classifier from label text: Extract text features for the prediction categories. Use for zero-shot prediction: Perform zero-shot inference and prediction.

2.1 Dataset Construction

The process of constructing a dataset using the Internet as the data source is as follows:

Find all words appearing more than 100 times in English Wikipedia to form a query list.
Construct a query set of size 500,000.
For each query, find related images from a search engine. To maintain relative balance across queries, cap the number of (image, text) pairs per query at 20,000.
The resulting dataset is called WIT (WebImageText). It contains approximately the same number of words as the WebText dataset used by GPT-2. This dataset gave rise not only to CLIP but also to DALL-E.

2.2 Pre-training Phase

Pre-training method: Using the idea of contrastive learning, the model only predicts which text, as a whole, is paired with which image; it does not need to predict the exact content of the text. This contrastive learning approach is 4x more training-efficient compared to a predictive approach.

![Image](images/CLIP(2021, OpenAI)_88.png)

2.2.1 Pre-training Process

Input: Each batch contains N (image, text) pairs.
Image feature extraction: Each image passes through an Image Encoder to extract image features, yielding a feature vector (i-dimensional).
Text feature extraction: Each text passes through a Text Encoder to extract text features, yielding a feature vector (t-dimensional). a. Image features are mapped through a linear projection to a unified text-image multimodal space, producing an e-dimensional vector $I$ . All N images in a batch produce N such vectors. b. Text features are mapped through a linear projection to a unified text-image multimodal space, producing an e-dimensional vector $T$ . All N texts in a batch produce N such vectors. After obtaining the N image vectors and N text vectors, contrastive learning begins: pairing $I_1$ with $T_1$ , all the way through $I_N$ with $T_N$ . Ground-truth definition for positive and negative samples: Each element in matrix S is the dot product similarity between an image vector and a text vector. Elements on the diagonal of S are positive samples; all other elements are negative samples. Similarity is computed using Cosine Similarity: A method for measuring the similarity between two vectors based on the cosine of the angle between them. The formula is: $cos(s) = (a \cdot b) / (||a|| \cdot ||b||)$ . Cosine similarity ranges from -1 to 1. Values closer to 1 indicate greater similarity; values closer to -1 indicate greater dissimilarity; a value of 0 means the two vectors are perpendicular (orthogonal). Cosine similarity is widely used in text matching, data mining, machine learning, and other fields to evaluate similarity between samples. In some cases, it can replace Euclidean distance because cosine similarity considers not only the magnitude of vectors but also their direction. Pre-training pseudocode:

# After obtaining I_f and T_f, there is a projection layer W_i, W_t
# np.dot(I_f, W_i) multiplies a matrix of size [n, d_i] with a matrix of size [d_i, d_e],
# yielding a projection matrix of size [n, d_e]
# The projection layer primarily learns how to map from unimodal to multimodal space,
# so this is a merged multimodal feature.
# A feature is used to learn how to go from unimodal to multimodal, then L2 normalization
# is applied to get the features I_e, T_e for contrastive learning.
'''
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
'''
# Scaled pairwise cosine similarities [n, n]
# Given n image features and n text features, compute cosine similarity to get
# similarity logits for classification.
'''
logits = np.dot(I_e, T_e.T) * np.exp(t)
'''
# Symmetric loss function
# The ground truth is generated using the arange function, values from 1 to n.
# For CLIP, all positive samples are on the diagonal, so this method is used to create ground truth.
'''
labels = np.arange(n)
'''
# Compute cross-entropy loss between logits and ground truth labels.
# loss_i and loss_t are the losses for Image and Text respectively; the final loss is their average.
'''
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

2.2.2 Model Architecture

CLIP is a classic dual-tower model with two encoders: one for images and one for text. Text Encoder Image Encoder Uses a variant of Transformer architecture called GPT-2. Each text sentence begins with the [SOS] token and ends with the [EOS] token. The vector corresponding to [EOS] in the last layer of GPT-2 serves as the representation vector for the entire text, i.e., the text feature vector used downstream. Experiments found that CLIP's performance is not strongly correlated with the size of the text encoder, so when increasing compute, only the width of the text encoder is increased, not its depth. Either ResNet or ViT can be used. ViT performs better and is used by default. ViT uses CLS to represent image information; in ViT(2020, Google Research), CLS serves as the representation of classification information.

2.2.3 Training Details

Why:

CLIP's training objective is to learn the multimodal alignment relationship between images and text, rather than a purely visual or language task.
If an ImageNet pre-trained image encoder is used, the model may be biased toward traditional classification tasks rather than cross-modal alignment.
Training from scratch ensures the model is not subject to single-modality (visual or textual) pre-training biases, and can learn a joint image-text representation from the very beginning.
When mapping text feature vectors and image feature vectors to the multimodal space, linear projection (multiplication by parameter matrix W) is used rather than nonlinear projection.
Data augmentation uses only random cropping. Additional data augmentation would further increase computational overhead, and overly strong augmentation might cause the model to learn visual features mismatched with text, harming image-text alignment.
The temperature parameter of softmax is not chosen as a hyperparameter; instead it is set as a learnable parameter jointly optimized during training. Given the enormous size of the dataset and model and the heavy computation cost, hyperparameter search is impractical.

2.3 Inference Phase

![Image](images/CLIP(2021, OpenAI)_110.png)

This phase uses CLIP's pre-trained Image Encoder and Text Encoder to perform Zero-Shot Transfer. For example, given an image from the ImageNet-1K validation set, we want the pre-trained CLIP model to complete the classification task. As shown below, CLIP adopts the Prompt Template pattern.

2.3.1 Extract Text Features for Prediction Categories

For each category, construct a descriptive text: "A photo of {label}," then feed these texts into the Text Encoder to get the corresponding text features. If there are N categories, N text feature vectors are obtained. Taking ImageNet as an example: all category labels such as "cat," "dog," etc. are turned into prompts like "A photo of a {object}" and fed into the CLIP pre-trained Text Encoder sequentially to obtain feature vectors $T_1, \ldots, T_N$ . These text feature vectors are then compared with the image feature vectors, allowing CLIP to convert standard classification tasks into image-text matching tasks.

2.3.2 Zero-Shot Prediction

The given image is fed into the Image Encoder to produce image features, which are then compared with the N text feature vectors. The scaled cosine similarity is computed (consistent with training), and the category corresponding to the text with the highest similarity is chosen as the image classification prediction result. Furthermore, these similarities can be treated as logits and passed through softmax to obtain the predicted probability for each category. This enables zero-shot classification.

3. Summary

CLIP's advantages: CLIP advances the CNN idea that "similar images have similar feature vectors" one large step further; not only do similar images have similar feature vectors, but similar images and text also have similar feature vectors. The image features learned by CLIP have a very strong relationship with the natural language people use to describe visual concepts. It broke down the boundary between text and image understanding, spawning infinite possibilities for multimodal learning. Subsequent work on "image-to-text" and "text-to-image search" all relies on this foundation. CLIP's limitations: Contrastive learning only focuses on global matching and lacks finer-grained semantic alignment. Example: It can know that "a photo of a cat" corresponds to an image of a cat, but cannot accurately localize where the cat is in the image. If an image contains multiple objects, CLIP may not be able to accurately understand which textual descriptions correspond to which local regions. The text encoder is trained as a Transformer, but it is only used to convert text into fixed-dimensional vectors without deep text understanding. It cannot handle complex text reasoning tasks. It cannot perform contextual reasoning across multiple sentences but simply matches text-image similarity. Examples:

"Which dog in this picture looks happier?"
"What might be the background story of this image?"
Difficulty handling fine-grained classification Performs poorly on fine-grained tasks (e.g., distinguishing similar species, car models, brands). Also performs poorly on some out-of-distribution datasets. At inference time, it is heavily dependent on Prompt Engineering; small changes in the prompt can significantly affect results. Example: For the same dog image, if the text provided is:
"A photo of a dog" → CLIP may correctly classify it
"A fluffy pet" → CLIP may fail to correctly identify it CLIP primarily processes global image information and has weak understanding of local regions. Example: An image contains both "a cat" and "a dog," but CLIP may not clearly localize these two objects and can only classify based on overall information.

Supplement

How can GPT's approach (using self-supervised learning on internet text data to automatically extract supervisory signals, guiding model training, achieving broad transferability) be applied to computer vision? How can the rich image and text information on the internet be used to train CV models in a self-supervised manner? Problems to solve CLIP's approach How to construct a text-image dataset Mined a dataset of 400 million (image, text) pairs from the internet. Key features: fully leverages existing internet information. Data must be large in scale, broad in coverage, and rich in information. How to design supervisory signals Key feature: Since training must be large-scale and automated, human annotation cannot be relied upon. Uses the prediction of which text is paired with which image (and thus semantically related) as the training objective. This elegantly and simply addresses the challenge of extracting visual supervisory signals from text. How to design a large model. Key feature: The model must have sufficient capacity to consume the continuously supplied data. Designed the CLIP model with both a Text Encoder and an Image Encoder that can be jointly trained. Solved the model capacity problem using mainstream Transformers and ResNets. How to perform joint training. Key feature: Joint training on text-image content. Possesses extremely strong zero-shot transfer capability. In a zero-shot manner, without any fine-tuning, CLIP achieves results on ImageNet comparable to supervised training with ResNet-50. This CLIP inference method (Prompt Template) advantages: Free from category restrictions. Categories not seen in the training dataset (e.g., "tricycle") can still be recognized at inference time. Simply add the "tricycle" category label in the text encoding step. Prompt engineering: Convert the input label into a sentence, e.g., "a photo of a {label}." For inference on specific categories, additional information can be added. For example, for a pet dataset, the input could be "a photo of a {label}, a type of pet." For an OCR dataset, double quotes could be added around the text. Adding extra information narrows the solution space and improves performance. Prompt ensembling: Construct multiple prompt templates for the same label, typically by adding different adjectives and modifiers. For example: "a photo of a big {label}," "a photo of a small {label}."

Code

Overall calling code

# -------------------------- Contrastive loss computation --------------------------

print("Label probs:", probs)  # Output label probabilities, e.g. [[0.99, 0.004, 0.003]]

CLIP implementation

# Extract the vector at the EOS (end-of-sentence) position as the text feature
        # text.argmax(dim=-1) finds the actual length of each text (CLIP marks EOS with the maximum id)
        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)]
        x = x @ self.text_projection  # Project into unified feature space
        return x

    def forward(self, image, text):
        # Get image and text features (already normalized)
        image_features = self.encode_image(image)
        text_features = self.encode_text(text)

        # Normalize feature vectors (so cosine similarity = dot product)
        image_features = image_features / image_features.norm(dim=1, keepdim=True)
        text_features = text_features / text_features.norm(dim=1, keepdim=True)

        return image_features, text_features

Image encoder ViT implementation

# Pre-layer normalization
        x = self.ln_pre(x)

        # Through Transformer (need to adjust dimensions to [sequence_length, batch, features])
        x = x.permute(1, 0, 2)
        x = self.transformer(x)
        x = x.permute(1, 0, 2)

        # Take the feature corresponding to the class token → [B,768]
        x = self.ln_post(x[:, 0, :])

        # Project to unified feature space → [B,512] (assuming output_dim=512)
        if self.proj is not None:
            x = x @ self.proj

        return x

Text encoder implementation (in CLIP class)

# Extract the vector at the EOS (end-of-sentence) position as the text feature
    # text.argmax(dim=-1) finds the actual length of each text (CLIP marks EOS with the maximum id)
    x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)]
    x = x @ self.text_projection  # Project into unified feature space
return x

The text Transformer module implementation is as follows:

class Transformer(nn.Module):
    def __init__(self, width: int, layers: int, heads: int, attn_mask=None):
        super().__init__()
        # Stack multiple Transformer blocks
        self.resblocks = nn.Sequential(*[
            ResidualAttentionBlock(width, heads, attn_mask)
            for _ in range(layers)
        ])

    def forward(self, x: torch.Tensor):
        return self.resblocks(x)  # Pass through each layer sequentially

Loss

def train():
    # Initialize CLIP model
    model = CLIP(...)

    # Get image and text features (already normalized)
    image_features, text_features = model(image, text)

    # Compute image-text similarity matrix (cosine similarity)
    logits = (text_features @ image_features.T)  # [B,B]

    # Compute self-similarity within images and within texts (for constructing soft targets)
    images_similarity = image_features @ image_features.T  # [B,B]
    texts_similarity = text_features @ text_features.T     # [B,B]

    # Construct target distribution (average of two similarities)
    targets = F.softmax((images_similarity + texts_similarity) / 2, dim=-1)

    # Compute cross-entropy loss (simplified form of KL divergence)
    loss = (-targets * logits).sum(1).mean()
    return loss

CLIP (2021, OpenAI)