# CSE559A Lecture 13 ## Positional Encodings ### Fixed Positional Encodings Set of sinusoids of different frequencies. $$ f(p,2i)=\sin(\frac{p}{10000^{2i/d}})\quad f(p,2i+1)=\cos(\frac{p}{10000^{2i/d}}) $$ [source](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/) ### Positional Encodings in Reconstruction MLP is hard to learn high-frequency information from scaler input $(x,y)$. Example: network mapping from $(x,y)$ to $(r,g,b)$. ### Generalized Positional Encodings - Dependence on location, scaler, metadata, etc. - Can just be fully learned (use `nn.Embedding` and optimize based on a categorical input.) ## Vision Transformer (ViT) ### Class Token In Vision Transformers, a special token called the class token is added to the input sequence to aggregate information for classification tasks. ### Hidden CNN Modules - PxP convolution with stride P (split the image into patches and use positional encoding) ### ViT + ResNet Hybrid Build a hybrid model that combines the vision transformer after 50 layer of ResNet. ## Moving Forward At least for now, CNN and ViT architectures have similar performance at least in ImageNet. - General Consensus: once the architecture is big enough, and not designed terribly, it can do well. - Differences remain: - Computational efficiency - Ease of use in other tasks and with other input data - Ease of training ## Wrap up Self attention as a key building block Flexible input specification using tokens with positional encodings A wide variety of architectural styles Up Next: Training deep neural networks