# CSE5519 Advances in Computer Vision (Topic B: 2022: Vision-Language Models)

## BLIP

[link to the paper](https://arxiv.org/pdf/2201.12086)

Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is a unified Vision-Language Pre-training framework to learn from noisy image-text pairs.

### Novelty in BLIP

#### MED

MED is a multimodel mixture of encoder-decoder architecture.

- Unimodel encoder
  - separately encodes image and text
- Image-grounded text encoder
  - injects visual information by inserting one additional cross-attention layer
- Image-grounded text decoder
  - replaces bidirectional self-attention layer in the image-grounded text encoder with causal self-attention layers

#### Pre-training objectives

- Image-text contrastive loss
  - align the feature space of the visual transformer and text transformer
- Image-text matching loss
  - learn image-text multimodal representation that captures the fine-grained alignment between the image and text
- Language modeling loss
  - generate textual descriptions given an image. Optimize the cross-entropy loss of the predicted text tokens.

#### CapFilt

CapFilter is a method to improve the quality of text corpus.

Captioner to generate captions given web images.

Filter to remove noisy image-text pairs.

- Diversity is the key for synthetic captions.
- Improvement is not due to longer training data.

> [!TIP]
>
> This paper shows a new way to pre-train a unified Vision-Language model from noisy image-text pairs. With combined architecture and pre-training objectives, the model is able to learn from noisy image-text pairs and generate high-quality captions and deal with various image-text tasks. The CapFilter is a simple but effective method to improve the quality of text corpus over noisy image-alt text pairs.
>
> I wonder how this method is applied to general non-labeled images since in some extreme cases, the caption are solely generated by the CapFilter. Will the performance continue to improve? Taking few steps further, is it possible to use actual image only data as some alignment and use some image generation model integrated into the framework to self-improve the model?