updates

2025-09-09 14:23:28 -05:00
parent 341af61ea9
commit 384e538bc9
2 changed files with 50 additions and 0 deletions
--- a/content/CSE5519/CSE5519_D1.md
+++ b/content/CSE5519/CSE5519_D1.md
@@ -12,3 +12,44 @@ Image synthesis in high resolution.

 use cross-attention to integrate the text embedding into the latent space.

+> [!TIP]
+>
+> How are the transformer encoder and decoder embedded in UNet? How does the implementation go? How does the cross-attention help improve the image generation?
+
+### On lecture new takes
+
+#### Variational Autoencoder (VAE)
+
+- Map input data into a probabilistic latent space and then reconstruct back the original data.
+- Probabilistic latent space allows model to operate on smoother latent space from which we can sample.
+- Each sample is mapped to a gaussian distribution in the latent space.
+- The exact posterior is not known. We use a gaussian prior to approximate the posterior.
+
+Drawbacks:
+
+- Lose high-frequency information.
+- Joint latent space is not usually gaussian.
+
+#### Diffusion models:
+
+Stacks of learnable VAE decoders.
+
+#### Latent Diffusion Models (Stable diffusion)
+
+Ok, that's the name I recognize.
+
+Vanilla diffusion models operates on pixel space is expensive.
+
+Perform diffusion process in latent space.
+
+FIrst train a powerful VAE to encode data. Then do diffusion on these VAE latent codes. Then decode the latent space to get the image using VAE decoder.
+
+#### VAE training
+
+Semantic compression: LDM
+
+Perceptual compression: Autoencoder+GAN
+
+#### Limitations
+
+Lack of contextual understanding.