55 lines
1.7 KiB
Markdown
55 lines
1.7 KiB
Markdown
# CSE5519 Advances in Computer Vision (Topic D: 2021 and before: Image and Video Generation)
|
|
|
|
## High-Resolution Image Synthesis with Latent Diffusion Models.
|
|
|
|
[link to the paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.pdf)
|
|
|
|
Image synthesis in high resolution.
|
|
|
|
### Novelty in Latent Diffusion Models
|
|
|
|
#### Transformer encoder for LDMs
|
|
|
|
use cross-attention to integrate the text embedding into the latent space.
|
|
|
|
> [!TIP]
|
|
>
|
|
> How are the transformer encoder and decoder embedded in UNet? How does the implementation go? How does the cross-attention help improve the image generation?
|
|
|
|
### On lecture new takes
|
|
|
|
#### Variational Autoencoder (VAE)
|
|
|
|
- Map input data into a probabilistic latent space and then reconstruct back the original data.
|
|
- Probabilistic latent space allows model to operate on smoother latent space from which we can sample.
|
|
- Each sample is mapped to a gaussian distribution in the latent space.
|
|
- The exact posterior is not known. We use a gaussian prior to approximate the posterior.
|
|
|
|
Drawbacks:
|
|
|
|
- Lose high-frequency information.
|
|
- Joint latent space is not usually gaussian.
|
|
|
|
#### Diffusion models:
|
|
|
|
Stacks of learnable VAE decoders.
|
|
|
|
#### Latent Diffusion Models (Stable diffusion)
|
|
|
|
Ok, that's the name I recognize.
|
|
|
|
Vanilla diffusion models operates on pixel space is expensive.
|
|
|
|
Perform diffusion process in latent space.
|
|
|
|
FIrst train a powerful VAE to encode data. Then do diffusion on these VAE latent codes. Then decode the latent space to get the image using VAE decoder.
|
|
|
|
#### VAE training
|
|
|
|
Semantic compression: LDM
|
|
|
|
Perceptual compression: Autoencoder+GAN
|
|
|
|
#### Limitations
|
|
|
|
Lack of contextual understanding. |