This commit is contained in:
Zheyuan Wu
2025-09-11 12:47:03 -05:00
parent 0200ab7eed
commit 533ea36b37
5 changed files with 538 additions and 0 deletions

247
content/CSE510/CSE510_L6.md Normal file
View File

@@ -0,0 +1,247 @@
# CSE510 Lecture 6
## Active reinforcement learning
### Exploration vs. Exploitation
- **Exploitation**: To try to get reward. We exploit our current knowledge to get a payoff.
- **Exploration**: Get more information about the world. How do we know if there is not a pot of gold around the corner?
- To explore we typically need to take actions that do not seem best according to our current model
- Managing the trade-off between exploration and exploitation is a critical issue in RL
- Basic intuition behind most approaches
- Explore more when knowledge is weak
- Exploit more as we gain knowledge
### ADP-based RL
Model based
1. Start with an initial (uninformed) model
2. solve for optimal policy given the current model (using value or policy iteration)
3. Take action according to an **exploration/exploitation** policy
4. Update estimated model based on observed transition
5. Goto 2
#### Exploration/Exploitation policy
**Greedy action** is the action maximizing estimated $Q$ value
$$
Q(s,a) = R(s) + \gamma \max_{s'\in S} P(s,a,s')V(s')
$$
- where $V$ is current optimal value function estimate (based on current model), and $R, T$ are current estimates of model
- $Q(s,a)$ is the expected value of taking action $a$ in state $s$ and then getting estimated value $V(s')$ for the next state $s'$
Want an exploration policy that is **greedy in the limit of infinite exploration** (GLIE)
- Try each action in each state and unbounded number of times
- Guarantees convergence
**GLIE**: Greedy in the limit of infinite exploration
#### Greedy Policy 1
On time step $t$ select random action with probability $p(t)$ and greedy action with probability $1-p(t)$
$p(t) = \frac{1}{t}$ will lead to convergence, but is slow.
> [!TIP]
>
> In practice, it's common to simply set $p(t) = \epsilon$ for all $t$.
#### Greedy Policy 2
Boltzmann exploration
Selection action with probability,
$$
Pr(a\mid s)=\frac{\exp(Q(s,a)/T)}{\sum_{a'\in A}\exp(Q(s,a')/T)}
$$
$T$ is the temperature. Large $T$ means that each action has about the same probability. Small $T$ leads to more greedy behavior.
Typically start with large $T$ and decrease with time.
<details>
<summary>Example: impact of temperature</summary>
Suppose we have two actions and that $Q(s,a_1) = 1$ and $Q(s,a_2) = 0$.
When $T=10$, we have
$$
Pr(a_1\mid s)=\frac{\exp(1/10)}{\exp(1/10)+\exp(2/10)}=0.48
$$
$$
Pr(a_2\mid s)=\frac{\exp(2/10)}{\exp(1/10)+\exp(2/10)}=0.52
$$
When $T=1$, we have
$$
Pr(a_1\mid s)=\frac{\exp(1/1)}{\exp(1/1)+\exp(2/1)}=0.27
$$
$$
Pr(a_2\mid s)=\frac{\exp(2/1)}{\exp(1/1)+\exp(2/1)}=0.73
$$
When $T=0.1$, we have
$$
Pr(a_1\mid s)=\frac{\exp(1/0.1)}{\exp(1/0.1)+\exp(2/0.1)}=0.02
$$
$$
Pr(a_2\mid s)=\frac{\exp(2/0.1)}{\exp(1/0.1)+\exp(2/0.1)}=0.98
$$
</details>
### (Alternative Model-Based RL) Optimistic Exploration: Rmax [Brafman & Tennenholtz, 2002]
1. Start with an **optimistic model**
- (assign largest possible reward to "unexplored states")
- (actions from "unexplored states" only self transition)
2. Solve for optimal policy in optimistic model (standard VI)
3. Take greedy action according to the computed policy
4. Update optimistic estimated model
- (if a state becomes "known" then use its true statistics)
5. Goto 2
Agent always acts greedily according to a model that assumes
all "unexplored" states are maximally rewarding
#### Implementation for optimistic model
- Keep track of number of times a state-action pair is tried
- If $N(s, a) < N_e$ then $T(s,a,s)=1$ and $R(s) = Rmax$ in optimistic model,
- Otherwise, $T(s,a,s)$ and $R(s)$ are based on estimates obtained from the $N_e$ experiences (the estimate of true model)
- $N_e$ can be determined by using Chernoff Bound
- An optimal policy for this optimistic model will try to reach unexplored states (those with unexplored actions) since it can stay at those states and accumulate maximum reward
- Never explicitly explores. Is always greedy, but with respect to an optimistic outlook.
```pseudocode
Algorithm: (for Infinite horizon RL problems)
Initialize $\hat{p}, \hat{r}$, and $N(s,a)$ For $t = 1, 2, ...$
1. Build an optimistic reward model $(Q(s,a))_{s,a}$ from $\hat{p}, \hat{r}$, and $N(s,a)$
2. Select action $a(t)$ maximizing $Q(s(t),a)$ over $A_{s(t)}$
3. Observe the transition to $s(t+1)$ and collect reward $r(s(t),a(t))$ according to $\hat{p}$
4. Update $\hat{p}, \hat{r}$, and $N(s,a)$
```
#### Efficiency of Rmax
If the model is very completely learned (i.e. $N(s, a) = N_e$ for all $s, a$), then Rmax will be near optimal.
Results how that this will happen "quickly" in terms of number of steps.
General proof strategy: **PAC Guarantee (Roughly speaking):** There is a value $N_e$, such that with high probability the Rmax algorithm will select at most a polynomial number of actions with value less than $\epsilon$ of optimal.
RL can be solved in poly-time in number of actions, number of states, and discount factor.
### TD-based Active RL
1. Start with initial value function
2. Take action from an **exploration/exploitation** policygiving new state $s'$ (should converge to optimal policy)
3. **Update** estimated model (To compute the exploration/exploitation policy.)
4. Perform TD update
$$
V(s) \gets V(s) + \alpha (R(s) + \gamma V(s') - V(s))
$$
$V(s)$ is new estimate of optimal value function at state $s$.
5. Goto 2
Given the usual assumptions about learning rate and GLIE, TD will converge to an optimal value function!
- Exploration/Exploitation policy requires computing $argmax Q(s, a)$ for the exploitation part of the policy
- Computing $argmax Q(s, a)$ requires $T$ in addition to $V$
- Thus TD-learning must still maintain an estimated model for action selection
- It is computationally more efficient at each step compared to Rmax (i.e., optimistic exploration)
- TD-update vs. Value Iteration
- But model requires much more memory than value function
- Can we get a model-fee variant?
### Q-learning
Instead of learning the optimal value function $V$, directly learn the optimal $Q$ function.
Recall $Q(s, a)$ is the expected value of taking action $a$ in state $s$ and then
following the optimal policy thereafter
Given the $Q$ function we can act optimally by selecting action greedily according to $Q(s, a)$ without a model
The optimal $Q$-function satisfies $V(s) = \max_{a'\in A} Q(s, a')$ which gives:
$$
\begin{aligned}
Q(s,a) &= R(s) + \gamma \sum_{s'\in S} T(s,a,s') V(s')\\
&= R(s) + \gamma \sum_{s'\in S} T(s,a,s') \max_{a'\in A} Q(s',a')\\
\end{aligned}
$$
How can we learn the $Q$-function directly?
#### Q-learning implementation
Model-free reinforcement learning
1. Start with initial Q-values (e.g. all zeros)
2. Take action from an **exploration/exploitation** policy giving new state $s'$ (should converge to optimal policy)
3. Perform TD update
$$
Q(s,a) \gets Q(s,a) + \alpha (R(s) + \gamma \max_{a'\in A} Q(s',a') - Q(s,a))
$$
$Q(s,a)$ is current estimate of optimal Q-value for state $s$ and action $a$.
4. Goto 2
- Does not require model since we learn the Q-value function directly
- Use explicit $|S|\times |A|$ table to store Q-values
- Off-policy learning: the update does not depend on the actual next action
- The exploration/exploitation policy directly uses $Q$-values
#### Convergence of Q-learning
Q-learning converges to the optimal Q-value in the limit with probability 1 if:
- Every state-action pair is visited infinitely often
- Learning rate decays just so: $\sum_{t=1}^{\infty} \alpha(t) = \infty$ and $\sum_{t=1}^{\infty} \alpha(t)^2 < \infty$
#### Speedup for Goal-Based Problems
- **Goal-Based Problem**: receive big reward in goal state and then transition to terminal state
- Initializing $Q(s, a)$ for all $s \in S$ and $a \in A$ to zeros and then observing the following sequence of (state, reward, action) triples
- $(s0, 0, a0) (s1, 0, a1) (s2, 10, a2) (terminal,0)$
- The sequence of Q-value updates would result in: $Q(s0, a0) = 0$, $Q(s1, a1) =0$, $Q(s2, a2)=10$
- So nothing was learned at $s0$ and $s1$
- Next time this trajectory is observed we will get non-zero for $Q(s1, a1)$ but still $Q(s0, a0)=0$
From the example we see that it can take many learning trials for the final reward to "back propagate" to early state-action pairs
- Two approaches for addressing this problem:
1. Trajectory replay: store each trajectory and do several iterations of Q-updates on each one
2. Reverse updates: store trajectory and do Q-updates in reverse order
- In our example (with learning rate and discount factor equal to 1 for ease of illustration) reverse updates would give
- $Q(s2,a2) = 10$, $Q(s1,a1) = 10$, $Q(s0,a0)=10$
### Off-policy vs on-policy RL
### SARSA
1. Start with initial Q-values (e.g. all zeros)
2. Take action $a_n$ on state $s_n$ from an $\epsilon$-greedy policy giving new state $s_{n+1}$
3. Take action $a_{n+1}$ on state $s_{n+1}$ from an $\epsilon$-greedy
4. Perform TD update
$$
Q(s_n,a_n) \gets Q(s_n,a_n) + \alpha (R(s_n) + \gamma Q(s_{n+1},a_{n+1}) - Q(s_n,a_n))
$$
5. Goto 2
> [!NOTES]
>
> Compared with Q-learning, SARSA (on-policy) usually takes more "safer" actions.

View File

@@ -6,4 +6,7 @@ export default {
CSE510_L1: "CSE510 Deep Reinforcement Learning (Lecture 1)",
CSE510_L2: "CSE510 Deep Reinforcement Learning (Lecture 2)",
CSE510_L3: "CSE510 Deep Reinforcement Learning (Lecture 3)",
CSE510_L4: "CSE510 Deep Reinforcement Learning (Lecture 4)",
CSE510_L5: "CSE510 Deep Reinforcement Learning (Lecture 5)",
CSE510_L6: "CSE510 Deep Reinforcement Learning (Lecture 6)",
}

View File

@@ -0,0 +1,246 @@
# CSE5313 Coding and information theory for data science (Lecture 6)
## Recap
### Vector spaces and subspaces over finite fields
$\mathbb{F}^n$ is a vector space over $\mathbb{F}$.
With point-wise vector addition and scalar multiplication.
<details>
<summary>Example</summary>
$\mathbb{F}_2^4$ is a vector space over $\mathbb{F}_2$.
Let $v=\begin{pmatrix}
1 & 1 & 1 & 1
\end{pmatrix}$
Then $v$ is a vector in $\mathbb{F}_2^4$ that's "orthogonal" to itself.
$v\cdot v=1+1+1+1=4=0$ in $\mathbb{F}_2$.
In general field, the dual space and space may intersect non-trivially.
</details>
Let $V$ be a subspace of $\mathbb{F}^n$.
$V$ is a subgroup of $\mathbb{F}^n$ under vector addition $(\mathbb{F}^n,+)$.
- Apply the theorem: If $H$ is finite, non-empty, and closed under the operation of $G$, then $H$ is a subgroup of $G$.
<details>
<summary>Proof</summary>
Since $H\subseteq G$, $H$ is non-empty and closed under the operation of $G$ and finite, then $H\leq G$.
left to show:
Associativity: inherited from $G$.
Unit element: $0\in H$.
Consider $a\in H$, $a,a^2,a^3,\cdots$ are in $H$. Since $H$ is finite, there exists $i,j\in\mathbb{N}$ such that $a^i=a^j$.
Then $a^i=a^j\iff a^{i-j}=e\in H$.
Inverses: $a^{-1}\in H$.
Automatically holds for unit element traversing.
</details>
> Is every subgroup of $\mathbb{F}^n$ a subspace?
<details>
<summary>Answer</summary>
No.
Consider $F_4=\{0,1,x,x+1\}$ (field extension of $\mathbb{F}_2$ with $p(x)=x$).
$F_4^2=\{(a,b):a,b\in F_4\}$, $\{(0,0),(1,1)\}$ is a subgroup of $(F_4^2,+)$.
But the span of $F_4\{(1,1)\}$ is $\{(0,0),(1,1),(x,x),(x+1,x+1)\}\neq \{(0,0),(1,1)\}$, which is not a subspace of $F_4^2$.
</details>
Cosets in this definition are called Affine subspaces.
$$
V+a=\{v+a:v\in V\}\text{ for some }a\in \mathbb{F}^n
$$
## New content
### Linear codes
A linear code $\mathcal{C}$ is a subspace of $\mathbb{F}^n$ over $\mathbb{F}$.
- The dimension of $\mathcal{C}$ is denoted by $k$.
- The minimum Hamming distance of $\mathcal{C}$ is denoted by $d$.
- Notation $\mathcal{C}= [n,k,d]_{\mathbb{F}}$.
Two equivalent ways to constructing a linear code:
- A **generator matrix** $G\in \mathbb{F}^{k\times n}$ with $k$ rows and $n$ columns.
$$
\mathcal{C}=\{xG:x\in \mathbb{F}^k\}
$$
- The left image of $G$ is $\mathcal{C}$.
- The rows of $G$ are a basis for $\mathcal{C}$.
- A **parity check** matrix $H\in \mathbb{F}^{(n-k)\times n}$ with $(n-k)$ rows and $n$ columns.
$$
\mathcal{C}=\{c\in \mathbb{F}^n:Hc^T=0\}
$$
- The right kernel of $H$ is $\mathcal{C}$.
- Multiplying $c^T$ by $H$ "checks" if $c\in \mathcal{C}$.
### Encoding of linear codes
Reminder:
- Encoding is the process of mapping a message $u\in \mathbb{F}^k$ to a codeword $c\in \mathcal{C}\subseteq \mathbb{F}^n$.
$E: \mathbb{F}^k\to \mathcal{C}$ is a linear map.
Let $\mathcal{C}= [n,k,d]_{\mathbb{F}}$ be a linear code with generator matrix $G\in \mathbb{F}^{k\times n}$.
- Encoding is given by $E(x)=xG$.
- It is injective (1-1). Suppose otherwise, then there exists $x_1,x_2\in \mathbb{F}^k$ such that $x_1G=x_2G$. Then $x_1G-x_2G=0\implies (x_1-x_2)G=0\implies x_1-x_2=0\implies x_1=x_2$. Therefore, $E$ is injective.
So linear codes implies linear encoding: $E(x)+E(y)=E(x+y)$.
### Systematic codes
Fact: Every $G\in \mathbb{F}^{k\times n}$ can be brought to the form $G_{sys}=(I|A)$ by
- Row operations.
- Permutation of columns.
Fact $\{xG|x\in \mathbb{F}^k\}$ and $\{xG_{sys}|x\in \mathbb{F}^k\}$ are equivalent.
- Same length $n$.
- Same dimension $k$.
- Same minimum Hamming distance $d$.
Encoding a systematic code:
- The input is a part of the output.
- Efficient encoding
- Immediate decoding (if no errors).
### Codes, cosets, encoding, decoding
Linear code $[n,k,d]_{\mathbb{F}}$ is a $k$ dimensional subspace of $\mathbb{F}^n$.
Size of the code is $|\mathbb{F}|^k$.
Encoding: $x\to xG$.
Decoding: $(y+e)\to x$, $y=xG$.
Use **syndrome** to identify which coset $\mathcal{C}_i$ that the noisy-code to $\mathcal{C}_i+e$ belongs to.
$$
H(y+e)^T=H(y+e)=Hx+He=He
$$
### Syndrome decoding
- Heavily depends onn the linear structure of the code.
Linear code $\mathcal{C}= [n,k,d]_{\mathbb{F}}$ is a $k$-dimensional subspace of $(\mathbb{F}^n,+)$.
**Shift of** Linear code $[n,k,d]_{\mathbb{F}}$ is a $k$-dimensional affine subspace of $\mathbb{F}^n$.
All cosets of the same size
If $w_H(e)\leq \lfloor \frac{d-1}{2}\rfloor$, then it is possible to extract $y$ from $y+e$.
by syndrome decoding, we can do better than exhaustive search.
Idea:
Let $y+e$ belogns to the coset $\mathcal{C}+e$.
Moreover,$y_1+e$ and $y_2+e$ are in the same coset.
#### Standard Array
Let $\mathcal{C}= [n,k,d]_{\mathbb{F}}$ and denote $|F|=q$.
- Then $|\mathcal{C}|=q^k$.
- The number of cosets is $q^{n-k}$.
Then we arrange all $q^n$ elements of $\mathbb{F}^n$ into a $q^{n-k}\times q^k$ array.
- So that every row is a coset (including $\mathcal{C}$ itself)
- Lightest word in each cosets on the leftmost column
<details>
<summary>Example</summary>
Let $\mathbb{F}=\mathbb{Z}_2$ and $C=\{xG|x\in \mathbb{F}_2\}$
$$
G=\begin{pmatrix}
1 & 0 & 1 & 1 & 0\\
0 & 1 & 1 & 0 & 1
\end{pmatrix}
$$
So $\mathcal{C}=\{00000,10110,01011,11101\}$.
Then $G=[5,2,3]_2$.
The standard array is:
First row is $\mathcal{C}$.
Second row is $\mathcal{C}+(00001)$,
Third row is $\mathcal{C}+(00010)$.
Fourth row is $\mathcal{C}+(00100)$.
|00000|10110|01011|11101|
|---|---|---|---|
|00001|10111|01010|11100|
|00010|10100|01001|11110|
|00100|10010|01101|11000|
</details>
Any two elements in a row are of the form $y_1'=y_1+e$ and $y_2'=y_2+e$ for some $e\in \mathbb{F}^n$.
Same syndrome if $H(y_1'+e)^T=H(y_2'+e)^T$.
Entries in different rows have different syndrome.
<details>
<summary>Proof</summary>
</details>
Choose the lightest word in each coset on the leftmost column.
Time complexity: $O(n(n-k))$. Space complexity: $n|F|^n$ space.
Compare with exhaustive search: Time: $O(|F|^n)$.
#### Syndrome decoding - Intuition
Given 𝒚, we identify the set 𝐶 + 𝒆 to which 𝒚 belongs by computing the syndrome.
• We identify 𝒆 as the coset leader (leftmost entry) of the row 𝐶 + 𝒆.
• We output the codeword in 𝐶 which is closest (𝒄3) by subtracting 𝒆 from 𝒚.
#### Syndrome decoding - Formal
Given $y'\in \mathbb{F}^n$, we identify the set $\mathcal{C}+e$ to which $y'$ belongs by computing the syndrome.
We identify $e$ as the coset leader (leftmost entry) of the row $\mathcal{C}+e$.
We output the codeword in $\mathcal{C}$ which is closest ($c_3$) by subtracting $e$ from $y'$.

View File

@@ -7,4 +7,6 @@ export default {
CSE5313_L2: "CSE5313 Coding and information theory for data science (Lecture 2)",
CSE5313_L3: "CSE5313 Coding and information theory for data science (Lecture 3)",
CSE5313_L4: "CSE5313 Coding and information theory for data science (Lecture 4)",
CSE5313_L5: "CSE5313 Coding and information theory for data science (Lecture 5)",
CSE5313_L6: "CSE5313 Coding and information theory for data science (Lecture 6)",
}

View File

@@ -1,2 +1,42 @@
# CSE5519 Advances in Computer Vision (Topic A: 2021 and before: Semantic Segmentation)
## SETR
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
[link to the paper](https://arxiv.org/pdf/2012.15840)
Treating semantic segmentation as a sequence-to-sequence prediction task.
### Novelty in SETR
#### FCN-based semantic segmentation
An FCN encoder consists of a stack of sequentially connected convolutional layers.
Having limited receptive fields for context modeling is thus
an intrinsic limitation of the vanilla FCN architecture.
#### Segmentation transformers (SETR)
Image to sequence.
By further mapping each vectorized patch $p$ into a latent $C$-dimensional embedding space using a linear projection function $f: p \rightarrow e \in \mathbb{R}^C$, we obtain a 1D sequence of patch embeddings for an image $x$. To encode the patch spacial information, we learn a specific embedding $p_i$ for every location $i$ which is added to $e_i$ to form the final sequence input $E = \{e_1 + p_1, e_2 + p_2, \cdots, e_L + p_L\}$. This way, spatial information is kept despite the orderless self-attention nature of transformers.
#### Decoder for segmentation
Three choices:
Naive Upsampling:
- Upsample the sequence to the original image size.
- Then use a 1x1 convolution to get the final segmentation map.
Use Progressive Upsampling to get the final segmentation map.
Use Multi-level feature aggregation to get the final segmentation map.
> [!TIP]
>
> This paper shows a remarkable success of transformer in semantic segmentation. The authors use linear projection split large images to mini patches to get the patch embeddings and then use a transformer encoder to get the final segmentation map.
>
> I'm really interested in the linear projection function $f$. How does it work to preserve the spatial information across the patches? What will happen if we have square frames overlapping the image? how doe the transformer encoder work to solve the occlusion problem or it is out of scope of the paper?