1.6 KiB
1.6 KiB
CSE559A Lecture 13
Positional Encodings
Fixed Positional Encodings
Set of sinusoids of different frequencies.
f(p,2i)=\sin(\frac{p}{10000^{2i/d}})\quad f(p,2i+1)=\cos(\frac{p}{10000^{2i/d}})
Positional Encodings in Reconstruction
MLP is hard to learn high-frequency information from scaler input (x,y).
Example: network mapping from (x,y) to (r,g,b).
Generalized Positional Encodings
- Dependence on location, scaler, metadata, etc.
- Can just be fully learned (use
nn.Embeddingand optimize based on a categorical input.)
Vision Transformer (ViT)
Class Token
In Vision Transformers, a special token called the class token is added to the input sequence to aggregate information for classification tasks.
Hidden CNN Modules
- PxP convolution with stride P (split the image into patches and use positional encoding)
ViT + ResNet Hybrid
Build a hybrid model that combines the vision transformer after 50 layer of ResNet.
Moving Forward
At least for now, CNN and ViT architectures have similar performance at least in ImageNet.
- General Consensus: once the architecture is big enough, and not designed terribly, it can do well.
- Differences remain:
- Computational efficiency
- Ease of use in other tasks and with other input data
- Ease of training
Wrap up
Self attention as a key building block
Flexible input specification using tokens with positional encodings
A wide variety of architectural styles
Up Next: Training deep neural networks