# CSE5519 Advances in Computer Vision (Topic D: 2021 and before: Image and Video Generation) ## High-Resolution Image Synthesis with Latent Diffusion Models. [link to the paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.pdf) Image synthesis in high resolution. ### Novelty in Latent Diffusion Models #### Transformer encoder for LDMs use cross-attention to integrate the text embedding into the latent space. > [!TIP] > > How are the transformer encoder and decoder embedded in UNet? How does the implementation go? How does the cross-attention help improve the image generation? ### On lecture new takes #### Variational Autoencoder (VAE) - Map input data into a probabilistic latent space and then reconstruct back the original data. - Probabilistic latent space allows model to operate on smoother latent space from which we can sample. - Each sample is mapped to a gaussian distribution in the latent space. - The exact posterior is not known. We use a gaussian prior to approximate the posterior. Drawbacks: - Lose high-frequency information. - Joint latent space is not usually gaussian. #### Diffusion models: Stacks of learnable VAE decoders. #### Latent Diffusion Models (Stable diffusion) Ok, that's the name I recognize. Vanilla diffusion models operates on pixel space is expensive. Perform diffusion process in latent space. FIrst train a powerful VAE to encode data. Then do diffusion on these VAE latent codes. Then decode the latent space to get the image using VAE decoder. #### VAE training Semantic compression: LDM Perceptual compression: Autoencoder+GAN #### Limitations Lack of contextual understanding.