This commit is contained in:
Zheyuan Wu
2025-09-09 14:23:28 -05:00
parent 341af61ea9
commit 384e538bc9
2 changed files with 50 additions and 0 deletions

View File

@@ -38,3 +38,12 @@ Zero-shot CLIP still generalizes poorly to data that is truly out-of-distributio
> In defining the general task that CLIP can solve, and experimental results from Zero-Shot CLIP vs. Linear Probe on ResNet50. I can see that the performance of Zero-Shot CLIP is better than Linear Probe on ResNet50 on the tasks that are somehow "frequently labeled" by humans. For example, the car brand, the location of the image, etc. And perform badly when humans don't label the image or the idea is more abstract. For example, the distance of the camera, the numbers in the image, a satellite image of terrain, etc.
>
> Is the CLIP model really learning enough knowledge from the general natural language description of the image? If the description is more comprehensive, will CLIP outperform the Linear Probe on ResNet50?
### On lecture new takes
Flicker image label selection.
Visual Commonsense Reasoning dataset. (Visual Bert)
Loss based on cosine similarity between caption pairs.

View File

@@ -12,3 +12,44 @@ Image synthesis in high resolution.
use cross-attention to integrate the text embedding into the latent space.
> [!TIP]
>
> How are the transformer encoder and decoder embedded in UNet? How does the implementation go? How does the cross-attention help improve the image generation?
### On lecture new takes
#### Variational Autoencoder (VAE)
- Map input data into a probabilistic latent space and then reconstruct back the original data.
- Probabilistic latent space allows model to operate on smoother latent space from which we can sample.
- Each sample is mapped to a gaussian distribution in the latent space.
- The exact posterior is not known. We use a gaussian prior to approximate the posterior.
Drawbacks:
- Lose high-frequency information.
- Joint latent space is not usually gaussian.
#### Diffusion models:
Stacks of learnable VAE decoders.
#### Latent Diffusion Models (Stable diffusion)
Ok, that's the name I recognize.
Vanilla diffusion models operates on pixel space is expensive.
Perform diffusion process in latent space.
FIrst train a powerful VAE to encode data. Then do diffusion on these VAE latent codes. Then decode the latent space to get the image using VAE decoder.
#### VAE training
Semantic compression: LDM
Perceptual compression: Autoencoder+GAN
#### Limitations
Lack of contextual understanding.