updates
This commit is contained in:
@@ -12,3 +12,44 @@ Image synthesis in high resolution.
|
||||
|
||||
use cross-attention to integrate the text embedding into the latent space.
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> How are the transformer encoder and decoder embedded in UNet? How does the implementation go? How does the cross-attention help improve the image generation?
|
||||
|
||||
### On lecture new takes
|
||||
|
||||
#### Variational Autoencoder (VAE)
|
||||
|
||||
- Map input data into a probabilistic latent space and then reconstruct back the original data.
|
||||
- Probabilistic latent space allows model to operate on smoother latent space from which we can sample.
|
||||
- Each sample is mapped to a gaussian distribution in the latent space.
|
||||
- The exact posterior is not known. We use a gaussian prior to approximate the posterior.
|
||||
|
||||
Drawbacks:
|
||||
|
||||
- Lose high-frequency information.
|
||||
- Joint latent space is not usually gaussian.
|
||||
|
||||
#### Diffusion models:
|
||||
|
||||
Stacks of learnable VAE decoders.
|
||||
|
||||
#### Latent Diffusion Models (Stable diffusion)
|
||||
|
||||
Ok, that's the name I recognize.
|
||||
|
||||
Vanilla diffusion models operates on pixel space is expensive.
|
||||
|
||||
Perform diffusion process in latent space.
|
||||
|
||||
FIrst train a powerful VAE to encode data. Then do diffusion on these VAE latent codes. Then decode the latent space to get the image using VAE decoder.
|
||||
|
||||
#### VAE training
|
||||
|
||||
Semantic compression: LDM
|
||||
|
||||
Perceptual compression: Autoencoder+GAN
|
||||
|
||||
#### Limitations
|
||||
|
||||
Lack of contextual understanding.
|
||||
Reference in New Issue
Block a user