upgrade structures and migrate to nextra v4

2025-07-06 12:40:25 -05:00
parent 76e50de44d
commit 717520624d
317 changed files with 18143 additions and 22777 deletions
--- a/content/CSE559A/CSE559A_L13.md
+++ b/content/CSE559A/CSE559A_L13.md
@@ -0,0 +1,59 @@
+# CSE559A Lecture 13
+
+## Positional Encodings
+
+### Fixed Positional Encodings
+
+Set of sinusoids of different frequencies.
+
+$$
+f(p,2i)=\sin(\frac{p}{10000^{2i/d}})\quad f(p,2i+1)=\cos(\frac{p}{10000^{2i/d}})
+$$
+
+[source](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)
+
+### Positional Encodings in Reconstruction
+
+MLP is hard to learn high-frequency information from scaler input $(x,y)$.
+
+Example: network mapping from $(x,y)$ to $(r,g,b)$.
+
+### Generalized Positional Encodings
+
+- Dependence on location, scaler, metadata, etc.
+- Can just be fully learned (use `nn.Embedding` and optimize based on a categorical input.)
+
+## Vision Transformer (ViT)
+
+### Class Token
+
+In Vision Transformers, a special token called the class token is added to the input sequence to aggregate information for classification tasks.
+
+### Hidden CNN Modules
+
+- PxP convolution with stride P (split the image into patches and use positional encoding)
+
+### ViT + ResNet Hybrid
+
+Build a hybrid model that combines the vision transformer after 50 layer of ResNet.
+
+## Moving Forward
+
+At least for now, CNN and ViT architectures have similar performance at least in ImageNet.
+
+- General Consensus: once the architecture is big enough, and not designed terribly, it can do well.
+- Differences remain:
+  - Computational efficiency
+  - Ease of use in other tasks and with other input data
+  - Ease of training
+
+## Wrap up
+
+Self attention as a key building block
+
+Flexible input specification using tokens with positional encodings
+
+A wide variety of architectural styles
+
+Up Next:
+Training deep neural networks