NoteNextra-origin/content/CSE559A/CSE559A_L13.md

# CSE559A Lecture 13

## Positional Encodings

### Fixed Positional Encodings

Set of sinusoids of different frequencies.

$$
f(p,2i)=\sin(\frac{p}{10000^{2i/d}})\quad f(p,2i+1)=\cos(\frac{p}{10000^{2i/d}})
$$

[source](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)

### Positional Encodings in Reconstruction

MLP is hard to learn high-frequency information from scaler input $(x,y)$.

Example: network mapping from $(x,y)$ to $(r,g,b)$.

### Generalized Positional Encodings

- Dependence on location, scaler, metadata, etc.
- Can just be fully learned (use `nn.Embedding` and optimize based on a categorical input.)

## Vision Transformer (ViT)

### Class Token

In Vision Transformers, a special token called the class token is added to the input sequence to aggregate information for classification tasks.

### Hidden CNN Modules

- PxP convolution with stride P (split the image into patches and use positional encoding)

### ViT + ResNet Hybrid

Build a hybrid model that combines the vision transformer after 50 layer of ResNet.

## Moving Forward

At least for now, CNN and ViT architectures have similar performance at least in ImageNet.

- General Consensus: once the architecture is big enough, and not designed terribly, it can do well.
- Differences remain:
  - Computational efficiency
  - Ease of use in other tasks and with other input data
  - Ease of training

## Wrap up

Self attention as a key building block

Flexible input specification using tokens with positional encodings

A wide variety of architectural styles

Up Next:
Training deep neural networks