This commit is contained in:
Zheyuan Wu
2025-10-15 00:28:41 -05:00
5 changed files with 532 additions and 6 deletions

View File

@@ -1,8 +1,12 @@
# CSE510 Deep Reinforcement Learning (Lecture 13)
> [!CAUTION]
> Recap from last lecture
>
> This lecture is terribly taught in 90 minutes for me, who have no idea about the optimization and the Lagrangian Duality. May include numerous mistakes and it is good to restart learning from this checkpoint.
> For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$
>
> The policy gradient is
>
> $\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]$
## Problem for policy gradient method
@@ -20,7 +24,7 @@ Unstable update step size is very important
- Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
- If step size is too small: the learning process is slow
## Deriving the optimization objection function of TRPO
## Deriving the optimization objection function of Trusted Region Policy Optimization (TRPO)
### Objective of Policy Gradient Methods
@@ -30,8 +34,7 @@ $$
J(\pi_\theta)=\mathbb{E}_{\tau\sim \pi_theta}\sum_{t=0}^{\infty} \gamma^t r^t
$$
here $\tau$ is the trajectory.
here $\tau$ is the trajectory for the policy $\pi_\theta$.
Policy objective can be written in terms of old one:
@@ -97,6 +100,12 @@ $$
### Lower bound of Optimization
> [!NOTE]
>
> (Kullback-Leibler) KL divergence is a measure of the difference between two probability distributions.
>
> $D_{KL}(\pi(\dot|s)||\pi'(\dot|s))=\int_s \pi(a|s)\log \frac{\pi(a|s)}{\pi'(a|s)}da$
$$
J(\pi')-J(\pi)\geq L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
$$

View File

@@ -0,0 +1,182 @@
# CSE510 Deep Reinforcement Learning (Lecture 14)
## Advanced Policy Gradient Methods
### Trust Region Policy Optimization (TRPO)
"Recall" from last lecture
$$
\max_{\pi'} \mathbb{E}_{s\sim d^{\pi},a\sim \pi} \left[\frac{\pi'(a|s)}{\pi(a|s)}A^{\pi}(s,a)\right]
$$
such that
$$
\mathbb{E}_{s\sim d^{\pi}} D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta
$$
Unconstrained penalized objective:
$$
d^*=\arg\max_{d} J(\theta+d)-\lambda(D_{KL}\left[\pi_\theta||\pi_{\theta+d}\right]-\delta)
$$
$\theta_{new}=\theta_{old}+d$
First order Taylor expansion for the loss and second order for the KL:
$$
\approx \arg\max_{d} J(\theta_{old})+\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d-\frac{1}{2}\lambda(d^T\nabla_\theta^2 D_{KL}\left[\pi_{\theta_{old}}||\pi_{\theta}\right]\mid_{\theta=\theta_{old}}d)+\lambda \delta
$$
If you are really interested, try to fill the solving the KL Constrained Problem section.
#### Natural Gradient Descent
Setting the gradient to zero:
$$
\begin{aligned}
0&=\frac{\partial}{\partial d}\left(-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d+\frac{1}{2}\lambda(d^T F(\theta_{old})d\right)\\
&=-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}+\frac{1}{2}\lambda F(\theta_{old})d
\end{aligned}
$$
$$
d=\frac{2}{\lambda} F^{-1}(\theta_{old})\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}
$$
The natural gradient is
$$
\tilde{\nabla}J(\theta)=F^{-1}(\theta_{old})\nabla_\theta J(\theta)
$$
$$
\theta_{new}=\theta_{old}+\alpha F^{-1}(\theta_{old})\hat{g}
$$
$$
D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})\approx \frac{1}{2}(\theta-\theta_{old})^T F(\theta_{old})(\theta-\theta_{old})
$$
$$
\frac{1}{2}(\alpha g_N)^T F(\alpha g_N)=\delta
$$
$$
\alpha=\sqrt{\frac{2\delta}{g_N^T F g_N}}
$$
However, due to the quadratic approximation, the KL constrains may be violated.
#### Linear search
We do Linear search for the best step size by making sure that
- Improving the objective
- Satisfying the KL constraint
TRPO = NPG + Linesearch + monotonic improvement theorem
#### Summary of TRPO
Pros
- Proper learning step
- [Monotonic improvement guarantee](./CSE510_L13.md#Monotonic-Improvement-Theorem)
Cons
- Poor scalability
- Second-order optimization: computing Fisher Information Matrix and its inverse every time for the current policy model is expensive
- Not quite sample efficient
- Requiring a large batch of rollouts to approximate accurately
### Proximal Policy Optimization (PPO)
> Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. -- OpenAI
[link to paper](https://arxiv.org/pdf/1707.06347)
Idea:
- The constraint helps in the training process. However, maybe the constraint is not a strict constraint:
- Does it matter if we only break the constraint just a few times?
What if we treat it as a “soft” constraint? Add proximal value to the objective function?
#### PPO with Adaptive KL Penalty
$$
\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]
$$
Use adaptive $\beta$ value.
$$
L^{KLPEN}(\theta)=\hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]
$$
Compute $d=\hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]$
- If $d<d_{target}/1.5$, $\beta\gets \beta/2$
- If $d>d_{target}\times 1.5$, $\beta\gets \beta\times 2$
#### PPO with Clipped Objective
$$
\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]
$$
$$
r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}
$$
- Here, $r_t(\theta)$ measures how much the new policy changes the probability of taking action $a_t$ in state $s_t$:
- If $r_t > 1$ :the action becomes more likely under the new policy.
- If $r_t < 1$ :the action becomes less likely.
- We'd like $r_tA_t$ to increase if $A_t > 0$ (good actions become more probable) and decrease if $A_t < 0$.
- But if $r_t$ changes too much, the update becomes **unstable**, just like in vanilla PG.
We limit $r_t(\theta)$ to be in a range:
$$
L^{CLIP}(\theta)=\hat{\mathbb{E}_t}\left[\min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right]
$$
> Trusted region Policy Optimization (TRPO): Don't move further than $\delta$ in KL.
> Proximal Policy Optimization (PPO): Don't let $r_t(\theta)$ drift further than $\epsilon$.
#### PPO in Practice
$$
L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}_t}\left[L_t^{CLIP}(\theta)+c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)\right]
$$
Here $L_t^{CLIP}(\theta)$ is the surrogate objective function.
$L_t^{VF}(\theta)$ is a squared-error loss for "critic" $(V_\theta(s_t)-V_t^{target})^2$.
$S[\pi_\theta](s_t)$ is the entropy bonus to ensure sufficient exploration. Encourage diversity of actions.
$c_1$ and $c_2$ are trade-off parameters, in paper $c_1=1$ and $c_2=0.01$.
### Summary for Policy Gradient Methods
Trust region policy optimization (TRPO)
- Optimization problem formulation
- Natural gradient ascent + monotonic improvement + line search
- But require second-order optimization
Proximal policy optimization (PPO)
- Clipped objective
- Simple yet effective
Take-away:
- Proper step size is critical for policy gradient methods
- Sample efficiency can be improved by using important sampling

View File

@@ -15,5 +15,6 @@ export default {
CSE510_L10: "CSE510 Deep Reinforcement Learning (Lecture 10)",
CSE510_L11: "CSE510 Deep Reinforcement Learning (Lecture 11)",
CSE510_L12: "CSE510 Deep Reinforcement Learning (Lecture 12)",
CSE510_L13: "CSE510 Deep Reinforcement Learning (Lecture 13)"
CSE510_L13: "CSE510 Deep Reinforcement Learning (Lecture 13)",
CSE510_L14: "CSE510 Deep Reinforcement Learning (Lecture 14)",
}

View File

@@ -0,0 +1,333 @@
# CSE5313 Coding and information theory for data science (Lecture 13)
## Recap from last lecture
Local recoverable codes:
An $[n, k]_q$ code is called $r$-locally recoverable if
- every codeword symbol $y_j$ has a recovering set $R_j \subseteq [n] \setminus j$ ($[n]=\{1,2,\ldots,n\}$),
- such that $y_j$ is computable from $y_i$ for all $i \in R_j$.
- $|R_j| \leq r$ for every $j \in n$.
### Bounds for locally recoverable codes
**Bound 1**: $\frac{k}{n}\leq \frac{r}{r+1}$.
**Bound 2**: $d\leq n-k-\lceil\frac{k}{r}\rceil +2$. ($r=k$ gives singleton bound)
#### Turan's Lemma
Let $G$ be a graph with $n$ vertices. Then there exists an induced directed acyclic subgraph (DAG) of $G$ on at least $\frac{n}{1+\operatorname{avg}_i(d^{out}_i)}$ nodes, where $d^{out}_i$ is the out-degree of vertex $i$.
Proof using probabilistic method.
<details>
<summary>Proof via the probabilistic method</summary>
> Useful for showing the existence of a large acyclic subgraph, but not for finding it.
> [!TIP]
>
> Show that $\mathbb{E}[X]\geq something$, and therefore there exists $U_\pi$ with $|U_\pi|\geq something$, using pigeonhole principle.
For a permutation $\pi$ of $[n]$, define $U_\pi = \{\pi(i): i \in [n]\}$.
Let $i\in U_\pi$ if each of the $d_i^{out}$ outgoing edges from $i$ connect to a node $j$ with $\pi(j)>\pi(i)$.
In other words, we select a subset of nodes $U_\pi$ such that each node in $U_\pi$ has an outgoing edge to a node in $U_\pi$ with a larger index. All edges going to right.
This graph is clearly acyclic.
Choose $\pi$ at random and Let $X=|U_\pi|$ be a random variable.
Let $X_i$ be the indicator random variable for $i\in U_\pi$.
So $X=\sum_{i=1}^{n} X_i$.
Using linearity of expectation, we have
$$
E[X]=\sum_{i=1}^{n} E[X_i]
$$
$E[X_i]$ is the probability that $\pi$ places $i$ before any of its out-neighbors.
For each node, there are $(d_i^{out}+1)!$ ways to place the node and its out-neighbors.
For each node, there are $d_i^{out}!$ ways to place the out-neighbors.
So, $E[X_i]=\frac{d_i^{out}!}{(d_i^{out}+1)!}=\frac{1}{d_i^{out}+1}$.
Since $X=\sum_{i=1}^{n} X_i$, we have
> [!TIP]
>
> Recall Arithmetic mean ($\frac{1}{n}\sum_{i=1}^{n} x_i$) is greater than or equal to harmonic mean ($\frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}$).
$$
\begin{aligned}
E[X]&=\sum_{i=1}^{n} E[X_i]\\
&=\sum_{i=1}^{n} \frac{1}{d_i^{out}+1}\\
&=\frac{1}{n}\sum_{i=1}^{n} \frac{1}{d_i^{out}+1}\\
&\geq \frac{n}{\frac{1}{n}(1+\sum_{i=1}^{n} d_i^{out})}\\
&=\frac{n}{1+\operatorname{avg}_i(d_i^{out})}\\
\end{aligned}
$$
</details>
## Locally recoverable codes
### Bound 1
$$
\frac{k}{n}\leq \frac{r}{r+1}
$$
<details>
<summary>Proof via Turan's Lemma</summary>
Let $G$ be a directed graph such that $i\to j$ is an edge if $j\in R_i$. ($j$ required to repair $i$)
$d_i^{out}\leq r$ for every $i$, so $\operatorname{avg}_i(d_i^{out})\leq r$.
By Turan's Lemma, there exists an induced directed acyclic subgraph (DAG) $G_U$ on $G$ with $U$ nodes such that $|U|\geq \frac{n}{1+\operatorname{avg}_i(d_i^{out})}\geq \frac{n}{1+r}$.
Since $G_U$ is a DAG, it is acyclic. So there exists a vertex $u_1\in U$ with no outgoing edges **inside** $G_U$. (leaf node in $G_U$)
Note that if we remove $u_1$ from $G_U$, the remaining graph is still a DAG.
Let $G_{U_1}$ be the induced graph on $U_1=U\setminus \{u_1\}$.
We can find a vertex $u_2\in U_1$ with no outgoing edges **inside** $G_{U_1}$.
We can continue this process until we have all the vertices in $U$ removed.
So all symbols in $U$ are redundant.
Note the number of redundant symbols $=n-k\geq \frac{n}{1+r}$.
So $k\leq \frac{rn}{r+1}$.
So $\frac{k}{n}\leq \frac{r}{r+1}$.
</details>
### Bound 2
$$
d\leq n-k-\lceil\frac{k}{r}\rceil +2
$$
#### Definition for code reduction
For $C=[n,k]_q$ code, let $I\subseteq [n]$ be a set of indices, e.g., $I=\{1,3,5,6,7,8\}$.
Let $C_I$ be the code obtained by deleting the symbols that is not in $I$. (only keep the symbols with indices in $I$)
Then $C_I$ is a code of length $|I|$, $C_I$ is generated by $G_I\in \mathbb{F}^{k\times |I|}_q$ (only keep the rows of $G$ with indices in $I$).
$|C_I|=q^{\operatorname{rank}(G_I)}$.
Note $G_I$ not necessarily of rank $k$.
#### Reduction lemma
Let $\mathcal{C}$ be an $[n, k]_q$ code. Then $d=n-\max\{|I|:C_I<q^k,I\subseteq [n]\}$.
<details>
<summary>Proof for reduction lemma</summary>
We proceed from two inequalities.
First $d\leq n-\max\{|I|:C_I<q^k,I\subseteq [n]\}$.
Let $I\subseteq [n]$ with $|C_I|\leq q^k$.
Then there exists $m_1,m_2\in \mathbb{F}_q^k$ such that $m_1G_I=m_2G_I$.
Let $y_1=m_1G_I$ and $y_2=m_2G_I$.
Then $y_1,y_2$ have at least $|I|$ entries in common.
So $d_H(y_1,y_2)\leq n-|I|$.
So $d\leq n-|I|$ for every $I$ such that $|C_I|\leq q^k$.
So $d\leq n-\max\{|I|:C_I<q^k,I\subseteq [n]\}$.
---
Now we show the other inequality.
$d\geq n-\max\{|I|:C_I<q^k,I\subseteq [n]\}$.
Let $y_1,y_2\in \mathcal{C}$ such that $d_H(y_1,y_2)=d$.
Let $J$ be the set on which $y_1$ and $y_2$ identical.
So $|J|=n-d$.
$|C_J|<q^k$ since $y_1$ and $y_2$ have at least $d$ entries in common.
So $\max\{|I|:C_I<q^k,I\subseteq [n]\}\geq n-d$. (by existence of $J$)
So $d\geq n-\max\{|I|:C_I<q^k,I\subseteq [n]\}$.
</details>
We prove bound 2 using the same subgraph from proof of bound 1.
<details>
<summary>Proof for bound 2</summary>
Let $G$ be a directed graph such that $i\to j$ is an edge if $j\in R_i$. ($j$ required to repair $i$)
$d_i^{out}\leq r$ for every $i$, so $\operatorname{avg}_i(d_i^{out})\leq r$.
By Turan's Lemma, there exists an induced directed acyclic subgraph (DAG) $G_U$ on $G$ with $U$ nodes such that $|U|\geq \frac{n}{1+\operatorname{avg}_i(d_i^{out})}\geq \frac{n}{1+r}$.
Let $U'\subset U$ with size $\lfloor\frac{k-1}{r}\rfloor$.
$U'$ exists since $\frac{n}{r+1}>\frac{k}{r}$ by bound 1, and $\frac{k}{r}>\lfloor\frac{k-1}{r}\rfloor$.
So $G_{U'}$ is also acyclic.
Let $N\subseteq [n]\setminus U'$ be the set of neighbors of $U'$ in $G_U$.
$|N|\leq r|U'|\leq k-1$.
Complete $n$ to be of the size $k-1$ by adding elements not in $U'$.
$|C_N|\leq q^{k-1}$
Also $|N\cup U'|=k-1+\lfloor\frac{k-1}{r}\rfloor$
Note that all nodes in $G_U$ can be recovered from nodes in $N$.
So $|C_{N\cup U'}|=|C_N|\leq q^{k-1}$.
Therefore, $\max\{|I|:C_I<q^k,I\subseteq [n]\}\geq |N\cup U'|=k-1+\lfloor\frac{k-1}{r}\rfloor$.
Using reduction lemma, we have $d= n-\max\{|I|:C_I<q^k,I\subseteq [n]\}\leq n-k-1-\lfloor\frac{k-1}{r}\rfloor+2=n-k-\lceil\frac{k}{r}\rceil +2$.
</details>
### Construction of locally recoverable codes
Recall Reed-Solomon code:
$\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$.
- $\dim=k$
- $d=n-k+1$
- Need $n\geq q$.
- To encode $a=a_1,a_2,\ldots,a_n$, we need to evaluate $f_a(x)=\sum_{i=0}^{k-1}f_a(a_i)x^i$.
Assume $r+1|n$ and $r|k$.
Choose $A=\{\alpha_1,\alpha_2,\ldots,\alpha_n\}$
Partition $A$ into $\frac{n}{r+1}$ subsets of size $r+1$.
#### Definition for good polynomial
A polynomial $g$ over $\mathbb{F}_q$ is called good if
- $\deg(g)=r+1$
- $g$ is constant for every $A_i$., i.e., $\forall x,y\in A_i, g(x)=g(y)$.
To encode $a=(a_0,a_1,\ldots,a_n)$, we consider $a\in \mathbb{F}_q^{r\times (k/r)}$
Let $f_a(x)=\sum_{i=0}^{r-1} f_{a,i}(x)\cdot x^i$
$f_{a,i}(x)=\sum_{j=0}^{k/r-1} a_{i,j}\cdot g(x)^j$, where $g$ is a good polynomial.
<details>
<summary>Proof for valid codeword</summary
If $a\neq b$, then $(f_a(\alpha_i))_{i=1}^{n}\neq (f_b(\alpha_i))_{i=1}^{n}$.
- $\deg f_a = k-2$
- Maximum degree monomial in $f_a$:
$a_{r-1,k/r-1}\cdot g(x)^{k/r-1}\cdot x^{r-1}$.
- $\deg f_a \leq \frac{k}{r}-1+\frac{k}{r}-1=k-2$.
By Bound 1: $k-2\leq n-2$.
If $f_a(\alpha_i)_{i=1}^{n}=f_b(\alpha_i)_{i=1}^{n}$. then
- $f_a$ and $f_b$ agree on more points than their degree.
- $a=b$, a contradiction.
Corollary: $|C|=q^k$.
</details>
In other words,
$$
\mathcal{C}=\{(f_a(\alpha_i))_{i=1}^{n}|a\in \mathbb{F}_q^k\}
$$
For data $a\in \mathbb{F}_q^k$, the $i$th server stores $f_a(\alpha_i)$.
Note $\mathcal{C}$ is a linear code. and $\dim \mathcal{C}=k$.
Let $\alpha\in \{\alpha_1,\alpha_2,\ldots,\alpha_n\}$.
Suppose $\alpha\in A_1$.
Locality is ensured by recovering failed server $\alpha$ by $f_a(\alpha_i)$ from $\{f_a(\beta):\beta\in A_1\setminus \{\alpha\}\}$.
Note $g$ is constant on $A_1$.
Denote the constant by $f_i(A_1)$.
Let $\delta_1(x)=\sum_{i=0}^{r-1} f_{a,i}(A_1)x^i$.
$\delta_1(x)=f_a(x)$ for all $x\in A_1$.
#### Constructing good polynomial
Let $(\mathbb{F}_q^*,\cdot)$ be the multiplicative group of $\mathbb{F}_q$.
For a subgroup $H$ of this group, a multiplicative coset is $Ha=\{ha|h\in H,a\in \mathbb{F}_q^*\}$.
The family of all cosets of $H:\{Ha|a\in \mathbb{F}_q^*\}$.
This is a partition of $\mathbb{F}_q^*$.
#### Definition of annihilator
The annihilator of $H$ is $g(x)=\prod_{h\in H}(x-h)$.
Note that $g(h)=0$ for all $h\in H$.
#### Annihilator is a good polynomial
Fix $H$, consider the partition $\{Ha|a\in \mathbb{F}_q^*\}$ and let $g(x)=\prod_{a\in \mathbb{F}_q^*}(x-a)$. Then $g$ is a good polynomial.
<details>
<summary>Proof</summary>
Need to show if $x,y\in Ha$, then $g(x)=g(y)$.
Since $x\in Ha$, there exists $h',h''\in H$ such that $x=h'a$ and $y=h''a$.
So $g(x)=g(h'a)=\prod_{h\in H}(h'a-h)=(h')^{|H|}\prod_{h\in H}(a-h/h')$.
By Fermat's Little Theorem, $(h')^{|H|}\equiv 1$ for every $h'\in H$.
So $\{h/h'|h\in H\}=H$ for every $h'\in H$.
So $g(x)=(h')^{|H|}\prod_{h\in H}(a-h/h')=\prod_{h\in H}(a-h/h')=g(a)$.
Similarly, $g(y)=g(h''a)=\prod_{h\in H}(h''a-h)=(h'')^{|H|}\prod_{h\in H}(a-h/h'')=g(a)$.
</details>

View File

@@ -15,4 +15,5 @@ export default {
CSE5313_L10: "CSE5313 Coding and information theory for data science (Recitation 10)",
CSE5313_L11: "CSE5313 Coding and information theory for data science (Recitation 11)",
CSE5313_L12: "CSE5313 Coding and information theory for data science (Lecture 12)",
CSE5313_L13: "CSE5313 Coding and information theory for data science (Lecture 13)",
}