updates

2025-10-30 13:22:39 -05:00
parent e7abef9e14
commit aa8dee67d9
7 changed files with 391 additions and 0 deletions
--- a/content/CSE5519/CSE5519_D4.md
+++ b/content/CSE5519/CSE5519_D4.md
@@ -1,2 +1,21 @@
 # CSE5519 Advances in Computer Vision (Topic D: 2024: Image and Video Generation)

+## Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
+
+[link to the paper](https://arxiv.org/pdf/2406.06525)
+
+This paper shows that the autoregressive model can outperform the diffusion model in terms of image generation.
+
+### Novelty in the autoregressive model
+
+Use Llama 3.1 as the autoregressive model.
+
+Use code book and downsampling to reduce the memory footprint.
+
+> [!TIP]
+>
+> This paper shows that the autoregressive model can outperform the diffusion model in terms of image generation.
+>
+> And in later works, we showed that usually the image can be represented by a few code words; for example, 32 tokens may be enough to represent most of the images (that most humans need to annotate). However, I doubt the result if it can be generalized to more complex image generation tasks, for example, the image generation with a human face, since I found it difficult to describe people around me distinctively without calling their name.
+> 
+> For more real-life videos, to ensure contextual consistency, we may need to use more code words. Is such a method scalable to video generation to produce realistic results? Or will there be an exponential memory cost for the video generation?