This commit is contained in:
Trance-0
2025-09-29 20:54:31 -05:00
parent 158dd0b6dc
commit 85ef67610c
2 changed files with 73 additions and 0 deletions

View File

@@ -1,2 +1,48 @@
# CSE5519 Advances in Computer Vision (Topic B: 2022: Vision-Language Models)
## BLIP
[link to the paper](https://arxiv.org/pdf/2201.12086)
Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
This is a unified Vision-Language Pre-training framework to learn from noisy image-text pairs.
### Novelty in BLIP
#### MED
MED is a multimodel mixture of encoder-decoder architecture.
- Unimodel encoder
- separately encodes image and text
- Image-grounded text encoder
- injects visual information by inserting one additional cross-attention layer
- Image-grounded text decoder
- replaces bidirectional self-attention layer in the image-grounded text encoder with causal self-attention layers
#### Pre-training objectives
- Image-text contrastive loss
- align the feature space of the visual transformer and text transformer
- Image-text matching loss
- learn image-text multimodal representation that captures the fine-grained alignment between the image and text
- Language modeling loss
- generate textual descriptions given an image. Optimize the cross-entropy loss of the predicted text tokens.
#### CapFilt
CapFilter is a method to improve the quality of text corpus.
Captioner to generate captions given web images.
Filter to remove noisy image-text pairs.
- Diversity is the key for synthetic captions.
- Improvement is not due to longer training data.
> [!TIP]
>
> This paper shows a new way to pre-train a unified Vision-Language model from noisy image-text pairs. With combined architecture and pre-training objectives, the model is able to learn from noisy image-text pairs and generate high-quality captions and deal with various image-text tasks. The CapFilter is a simple but effective method to improve the quality of text corpus over noisy image-alt text pairs.
>
> I wonder how this method is applied to general non-labeled images since in some extreme cases, the caption are solely generated by the CapFilter. Will the performance continue to improve? Taking few steps further, is it possible to use actual image only data as some alignment and use some image generation model integrated into the framework to self-improve the model?

View File

@@ -1,2 +1,29 @@
# CSE5519 Advances in Computer Vision (Topic D: 2022: Image and Video Generation)
## An Image is Worth One Word
[link to the paper](https://arxiv.org/pdf/2208.01618)
Personalizing Text-to-Image Generation using Textual Inversion
Goal: Enable language-guided generation of new, user-specific concepts.
### Novelty in Textual Inversion
Use pseudo-words that can guide generation.
Textual inversion:
$$
v_*=\arg\min_{v}\mathbb{E}_{z\sim \varepsilon(x),y,\epsilon\sim \mathcal{N}(0,1),t}\left[\|\epsilon-\epsilon_\theta(z_t,t,c_{\theta}(y))\|_2^2\right],
$$
> [!TIP]
>
> This paper shows that we can use pseudo-words to guide the generation of new, user-specific concepts.
>
> However, the technical details are not fully explained in the paper, for example, how the loss function is constructed from scratch and how this maximize the generalization ability of the model over different styles and concepts?
>
> For example, what is $
v_*=\arg\min_{v}\mathbb{E}_{z\sim \varepsilon(x),y,\epsilon\sim \mathcal{N}(0,1),t}\left[\|\epsilon-\epsilon_\theta(z_t,t,c_{\theta}(y))\|_2^2\right],
$ means and how it is used to generate the new, user-specific concepts?