upgrade structures and migrate to nextra v4
This commit is contained in:
59
content/CSE559A/CSE559A_L13.md
Normal file
59
content/CSE559A/CSE559A_L13.md
Normal file
@@ -0,0 +1,59 @@
|
||||
# CSE559A Lecture 13
|
||||
|
||||
## Positional Encodings
|
||||
|
||||
### Fixed Positional Encodings
|
||||
|
||||
Set of sinusoids of different frequencies.
|
||||
|
||||
$$
|
||||
f(p,2i)=\sin(\frac{p}{10000^{2i/d}})\quad f(p,2i+1)=\cos(\frac{p}{10000^{2i/d}})
|
||||
$$
|
||||
|
||||
[source](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)
|
||||
|
||||
### Positional Encodings in Reconstruction
|
||||
|
||||
MLP is hard to learn high-frequency information from scaler input $(x,y)$.
|
||||
|
||||
Example: network mapping from $(x,y)$ to $(r,g,b)$.
|
||||
|
||||
### Generalized Positional Encodings
|
||||
|
||||
- Dependence on location, scaler, metadata, etc.
|
||||
- Can just be fully learned (use `nn.Embedding` and optimize based on a categorical input.)
|
||||
|
||||
## Vision Transformer (ViT)
|
||||
|
||||
### Class Token
|
||||
|
||||
In Vision Transformers, a special token called the class token is added to the input sequence to aggregate information for classification tasks.
|
||||
|
||||
### Hidden CNN Modules
|
||||
|
||||
- PxP convolution with stride P (split the image into patches and use positional encoding)
|
||||
|
||||
### ViT + ResNet Hybrid
|
||||
|
||||
Build a hybrid model that combines the vision transformer after 50 layer of ResNet.
|
||||
|
||||
## Moving Forward
|
||||
|
||||
At least for now, CNN and ViT architectures have similar performance at least in ImageNet.
|
||||
|
||||
- General Consensus: once the architecture is big enough, and not designed terribly, it can do well.
|
||||
- Differences remain:
|
||||
- Computational efficiency
|
||||
- Ease of use in other tasks and with other input data
|
||||
- Ease of training
|
||||
|
||||
## Wrap up
|
||||
|
||||
Self attention as a key building block
|
||||
|
||||
Flexible input specification using tokens with positional encodings
|
||||
|
||||
A wide variety of architectural styles
|
||||
|
||||
Up Next:
|
||||
Training deep neural networks
|
||||
Reference in New Issue
Block a user