This commit is contained in:
Zheyuan Wu
2025-09-09 14:23:28 -05:00
parent 341af61ea9
commit 384e538bc9
2 changed files with 50 additions and 0 deletions

View File

@@ -12,3 +12,44 @@ Image synthesis in high resolution.
use cross-attention to integrate the text embedding into the latent space.
> [!TIP]
>
> How are the transformer encoder and decoder embedded in UNet? How does the implementation go? How does the cross-attention help improve the image generation?
### On lecture new takes
#### Variational Autoencoder (VAE)
- Map input data into a probabilistic latent space and then reconstruct back the original data.
- Probabilistic latent space allows model to operate on smoother latent space from which we can sample.
- Each sample is mapped to a gaussian distribution in the latent space.
- The exact posterior is not known. We use a gaussian prior to approximate the posterior.
Drawbacks:
- Lose high-frequency information.
- Joint latent space is not usually gaussian.
#### Diffusion models:
Stacks of learnable VAE decoders.
#### Latent Diffusion Models (Stable diffusion)
Ok, that's the name I recognize.
Vanilla diffusion models operates on pixel space is expensive.
Perform diffusion process in latent space.
FIrst train a powerful VAE to encode data. Then do diffusion on these VAE latent codes. Then decode the latent space to get the image using VAE decoder.
#### VAE training
Semantic compression: LDM
Perceptual compression: Autoencoder+GAN
#### Limitations
Lack of contextual understanding.