From e4250883181c11fd98ec69b2f159cc69ed2c8be0 Mon Sep 17 00:00:00 2001 From: Trance-0 <60459821+Trance-0@users.noreply.github.com> Date: Tue, 14 Oct 2025 11:19:51 -0500 Subject: [PATCH 1/2] updates --- content/CSE510/CSE510_L13.md | 19 +++- content/CSE510/CSE510_L14.md | 182 +++++++++++++++++++++++++++++++++++ content/CSE510/_meta.js | 3 +- 3 files changed, 198 insertions(+), 6 deletions(-) create mode 100644 content/CSE510/CSE510_L14.md diff --git a/content/CSE510/CSE510_L13.md b/content/CSE510/CSE510_L13.md index 5109681..43ef0f9 100644 --- a/content/CSE510/CSE510_L13.md +++ b/content/CSE510/CSE510_L13.md @@ -1,8 +1,12 @@ # CSE510 Deep Reinforcement Learning (Lecture 13) -> [!CAUTION] +> Recap from last lecture > -> This lecture is terribly taught in 90 minutes for me, who have no idea about the optimization and the Lagrangian Duality. May include numerous mistakes and it is good to restart learning from this checkpoint. +> For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$ +> +> The policy gradient is +> +> $\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]$ ## Problem for policy gradient method @@ -20,7 +24,7 @@ Unstable update: step size is very important - Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it) - If step size is too small: the learning process is slow -## Deriving the optimization objection function of TRPO +## Deriving the optimization objection function of Trusted Region Policy Optimization (TRPO) ### Objective of Policy Gradient Methods @@ -30,8 +34,7 @@ $$ J(\pi_\theta)=\mathbb{E}_{\tau\sim \pi_theta}\sum_{t=0}^{\infty} \gamma^t r^t $$ - -here $\tau$ is the trajectory. +here $\tau$ is the trajectory for the policy $\pi_\theta$. Policy objective can be written in terms of old one: @@ -97,6 +100,12 @@ $$ ### Lower bound of Optimization +> [!NOTE] +> +> (Kullback-Leibler) KL divergence is a measure of the difference between two probability distributions. +> +> $D_{KL}(\pi(\dot|s)||\pi'(\dot|s))=\int_s \pi(a|s)\log \frac{\pi(a|s)}{\pi'(a|s)}da$ + $$ J(\pi')-J(\pi)\geq L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s)) $$ diff --git a/content/CSE510/CSE510_L14.md b/content/CSE510/CSE510_L14.md new file mode 100644 index 0000000..96263a9 --- /dev/null +++ b/content/CSE510/CSE510_L14.md @@ -0,0 +1,182 @@ +# CSE510 Deep Reinforcement Learning (Lecture 14) + +## Advanced Policy Gradient Methods + +### Trust Region Policy Optimization (TRPO) + +"Recall" from last lecture + +$$ +\max_{\pi'} \mathbb{E}_{s\sim d^{\pi},a\sim \pi} \left[\frac{\pi'(a|s)}{\pi(a|s)}A^{\pi}(s,a)\right] +$$ + +such that + +$$ +\mathbb{E}_{s\sim d^{\pi}} D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta +$$ + +Unconstrained penalized objective: + +$$ +d^*=\arg\max_{d} J(\theta+d)-\lambda(D_{KL}\left[\pi_\theta||\pi_{\theta+d}\right]-\delta) +$$ + +$\theta_{new}=\theta_{old}+d$ + +First order Taylor expansion for the loss and second order for the KL: + +$$ +\approx \arg\max_{d} J(\theta_{old})+\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d-\frac{1}{2}\lambda(d^T\nabla_\theta^2 D_{KL}\left[\pi_{\theta_{old}}||\pi_{\theta}\right]\mid_{\theta=\theta_{old}}d)+\lambda \delta +$$ + +If you are really interested, try to fill the solving the KL Constrained Problem section. + +#### Natural Gradient Descent + +Setting the gradient to zero: + +$$ +\begin{aligned} +0&=\frac{\partial}{\partial d}\left(-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d+\frac{1}{2}\lambda(d^T F(\theta_{old})d\right)\\ +&=-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}+\frac{1}{2}\lambda F(\theta_{old})d +\end{aligned} +$$ + +$$ +d=\frac{2}{\lambda} F^{-1}(\theta_{old})\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}} +$$ + +The natural gradient is + +$$ +\tilde{\nabla}J(\theta)=F^{-1}(\theta_{old})\nabla_\theta J(\theta) +$$ + +$$ +\theta_{new}=\theta_{old}+\alpha F^{-1}(\theta_{old})\hat{g} +$$ + +$$ +D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})\approx \frac{1}{2}(\theta-\theta_{old})^T F(\theta_{old})(\theta-\theta_{old}) +$$ + +$$ +\frac{1}{2}(\alpha g_N)^T F(\alpha g_N)=\delta +$$ + +$$ +\alpha=\sqrt{\frac{2\delta}{g_N^T F g_N}} +$$ + +However, due to the quadratic approximation, the KL constrains may be violated. + +#### Linear search + +We do Linear search for the best step size by making sure that + +- Improving the objective +- Satisfying the KL constraint + +TRPO = NPG + Linesearch + monotonic improvement theorem + +#### Summary of TRPO + +Pros + +- Proper learning step +- [Monotonic improvement guarantee](./CSE510_L13.md#Monotonic-Improvement-Theorem) + +Cons + +- Poor scalability + - Second-order optimization: computing Fisher Information Matrix and its inverse every time for the current policy model is expensive +- Not quite sample efficient + - Requiring a large batch of rollouts to approximate accurately + +### Proximal Policy Optimization (PPO) + +> Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. -- OpenAI + +[link to paper](https://arxiv.org/pdf/1707.06347) + +Idea: + +- The constraint helps in the training process. However, maybe the constraint is not a strict constraint: +- Does it matter if we only break the constraint just a few times? + +What if we treat it as a “soft” constraint? Add proximal value to the objective function? + +#### PPO with Adaptive KL Penalty + +$$ +\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right] +$$ + +Use adaptive $\beta$ value. + +$$ +L^{KLPEN}(\theta)=\hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right] +$$ + +Compute $d=\hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]$ + +- If $dd_{target}\times 1.5$, $\beta\gets \beta\times 2$ + +#### PPO with Clipped Objective + +$$ +\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right] +$$ + +$$ +r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} +$$ + +- Here, $r_t(\theta)$ measures how much the new policy changes the probability of taking action $a_t$ in state $s_t$: + - If $r_t > 1$ :the action becomes more likely under the new policy. + - If $r_t < 1$ :the action becomes less likely. +- We'd like $r_tA_t$ to increase if $A_t > 0$ (good actions become more probable) and decrease if $A_t < 0$. +- But if $r_t$ changes too much, the update becomes **unstable**, just like in vanilla PG. + +We limit $r_t(\theta)$ to be in a range: + +$$ +L^{CLIP}(\theta)=\hat{\mathbb{E}_t}\left[\min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right] +$$ + +> Trusted region Policy Optimization (TRPO): Don't move further than $\delta$ in KL. +> Proximal Policy Optimization (PPO): Don't let $r_t(\theta)$ drift further than $\epsilon$. + +#### PPO in Practice + +$$ +L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}_t}\left[L_t^{CLIP}(\theta)+c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)\right] +$$ + +Here $L_t^{CLIP}(\theta)$ is the surrogate objective function. + +$L_t^{VF}(\theta)$ is a squared-error loss for "critic" $(V_\theta(s_t)-V_t^{target})^2$. + +$S[\pi_\theta](s_t)$ is the entropy bonus to ensure sufficient exploration. Encourage diversity of actions. + +$c_1$ and $c_2$ are trade-off parameters, in paper $c_1=1$ and $c_2=0.01$. + +### Summary for Policy Gradient Methods + +Trust region policy optimization (TRPO) + +- Optimization problem formulation +- Natural gradient ascent + monotonic improvement + line search +- But require second-order optimization + +Proximal policy optimization (PPO) + +- Clipped objective +- Simple yet effective + +Take-away: + +- Proper step size is critical for policy gradient methods +- Sample efficiency can be improved by using important sampling \ No newline at end of file diff --git a/content/CSE510/_meta.js b/content/CSE510/_meta.js index 9e0456e..f3bfa16 100644 --- a/content/CSE510/_meta.js +++ b/content/CSE510/_meta.js @@ -15,5 +15,6 @@ export default { CSE510_L10: "CSE510 Deep Reinforcement Learning (Lecture 10)", CSE510_L11: "CSE510 Deep Reinforcement Learning (Lecture 11)", CSE510_L12: "CSE510 Deep Reinforcement Learning (Lecture 12)", - CSE510_L13: "CSE510 Deep Reinforcement Learning (Lecture 13)" + CSE510_L13: "CSE510 Deep Reinforcement Learning (Lecture 13)", + CSE510_L14: "CSE510 Deep Reinforcement Learning (Lecture 14)", } \ No newline at end of file From 0b10672bc86d7d4bfae56770ae940d4646fe574e Mon Sep 17 00:00:00 2001 From: Trance-0 <60459821+Trance-0@users.noreply.github.com> Date: Tue, 14 Oct 2025 12:44:28 -0500 Subject: [PATCH 2/2] updates --- content/CSE5313/CSE5313_L13.md | 333 +++++++++++++++++++++++++++++++++ content/CSE5313/_meta.js | 1 + 2 files changed, 334 insertions(+) create mode 100644 content/CSE5313/CSE5313_L13.md diff --git a/content/CSE5313/CSE5313_L13.md b/content/CSE5313/CSE5313_L13.md new file mode 100644 index 0000000..3e452ba --- /dev/null +++ b/content/CSE5313/CSE5313_L13.md @@ -0,0 +1,333 @@ +# CSE5313 Coding and information theory for data science (Lecture 13) + +## Recap from last lecture + +Local recoverable codes: + +An $[n, k]_q$ code is called $r$-locally recoverable if + +- every codeword symbol $y_j$ has a recovering set $R_j \subseteq [n] \setminus j$ ($[n]=\{1,2,\ldots,n\}$), +- such that $y_j$ is computable from $y_i$ for all $i \in R_j$. +- $|R_j| \leq r$ for every $j \in n$. + +### Bounds for locally recoverable codes + +**Bound 1**: $\frac{k}{n}\leq \frac{r}{r+1}$. + +**Bound 2**: $d\leq n-k-\lceil\frac{k}{r}\rceil +2$. ($r=k$ gives singleton bound) + +#### Turan's Lemma + +Let $G$ be a graph with $n$ vertices. Then there exists an induced directed acyclic subgraph (DAG) of $G$ on at least $\frac{n}{1+\operatorname{avg}_i(d^{out}_i)}$ nodes, where $d^{out}_i$ is the out-degree of vertex $i$. + +Proof using probabilistic method. + + +
+Proof via the probabilistic method + +> Useful for showing the existence of a large acyclic subgraph, but not for finding it. + +> [!TIP] +> +> Show that $\mathbb{E}[X]\geq something$, and therefore there exists $U_\pi$ with $|U_\pi|\geq something$, using pigeonhole principle. + +For a permutation $\pi$ of $[n]$, define $U_\pi = \{\pi(i): i \in [n]\}$. + +Let $i\in U_\pi$ if each of the $d_i^{out}$ outgoing edges from $i$ connect to a node $j$ with $\pi(j)>\pi(i)$. + +In other words, we select a subset of nodes $U_\pi$ such that each node in $U_\pi$ has an outgoing edge to a node in $U_\pi$ with a larger index. All edges going to right. + +This graph is clearly acyclic. + +Choose $\pi$ at random and Let $X=|U_\pi|$ be a random variable. + +Let $X_i$ be the indicator random variable for $i\in U_\pi$. + +So $X=\sum_{i=1}^{n} X_i$. + +Using linearity of expectation, we have + +$$ +E[X]=\sum_{i=1}^{n} E[X_i] +$$ + +$E[X_i]$ is the probability that $\pi$ places $i$ before any of its out-neighbors. + +For each node, there are $(d_i^{out}+1)!$ ways to place the node and its out-neighbors. + +For each node, there are $d_i^{out}!$ ways to place the out-neighbors. + +So, $E[X_i]=\frac{d_i^{out}!}{(d_i^{out}+1)!}=\frac{1}{d_i^{out}+1}$. + +Since $X=\sum_{i=1}^{n} X_i$, we have + +> [!TIP] +> +> Recall Arithmetic mean ($\frac{1}{n}\sum_{i=1}^{n} x_i$) is greater than or equal to harmonic mean ($\frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}$). + +$$ +\begin{aligned} +E[X]&=\sum_{i=1}^{n} E[X_i]\\ +&=\sum_{i=1}^{n} \frac{1}{d_i^{out}+1}\\ +&=\frac{1}{n}\sum_{i=1}^{n} \frac{1}{d_i^{out}+1}\\ +&\geq \frac{n}{\frac{1}{n}(1+\sum_{i=1}^{n} d_i^{out})}\\ +&=\frac{n}{1+\operatorname{avg}_i(d_i^{out})}\\ +\end{aligned} +$$ + +
+ +## Locally recoverable codes + +### Bound 1 + +$$ +\frac{k}{n}\leq \frac{r}{r+1} +$$ + +
+ +Proof via Turan's Lemma + +Let $G$ be a directed graph such that $i\to j$ is an edge if $j\in R_i$. ($j$ required to repair $i$) + +$d_i^{out}\leq r$ for every $i$, so $\operatorname{avg}_i(d_i^{out})\leq r$. + +By Turan's Lemma, there exists an induced directed acyclic subgraph (DAG) $G_U$ on $G$ with $U$ nodes such that $|U|\geq \frac{n}{1+\operatorname{avg}_i(d_i^{out})}\geq \frac{n}{1+r}$. + +Since $G_U$ is a DAG, it is acyclic. So there exists a vertex $u_1\in U$ with no outgoing edges **inside** $G_U$. (leaf node in $G_U$) + +Note that if we remove $u_1$ from $G_U$, the remaining graph is still a DAG. + +Let $G_{U_1}$ be the induced graph on $U_1=U\setminus \{u_1\}$. + +We can find a vertex $u_2\in U_1$ with no outgoing edges **inside** $G_{U_1}$. + +We can continue this process until we have all the vertices in $U$ removed. + +So all symbols in $U$ are redundant. + +Note the number of redundant symbols $=n-k\geq \frac{n}{1+r}$. + +So $k\leq \frac{rn}{r+1}$. + +So $\frac{k}{n}\leq \frac{r}{r+1}$. + +
+ +### Bound 2 + +$$ +d\leq n-k-\lceil\frac{k}{r}\rceil +2 +$$ + +#### Definition for code reduction + +For $C=[n,k]_q$ code, let $I\subseteq [n]$ be a set of indices, e.g., $I=\{1,3,5,6,7,8\}$. + +Let $C_I$ be the code obtained by deleting the symbols that is not in $I$. (only keep the symbols with indices in $I$) + +Then $C_I$ is a code of length $|I|$, $C_I$ is generated by $G_I\in \mathbb{F}^{k\times |I|}_q$ (only keep the rows of $G$ with indices in $I$). + +$|C_I|=q^{\operatorname{rank}(G_I)}$. + +Note $G_I$ not necessarily of rank $k$. + +#### Reduction lemma + +Let $\mathcal{C}$ be an $[n, k]_q$ code. Then $d=n-\max\{|I|:C_I + +Proof for reduction lemma + +We proceed from two inequalities. + +First $d\leq n-\max\{|I|:C_I + +We prove bound 2 using the same subgraph from proof of bound 1. + +
+ +Proof for bound 2 + +Let $G$ be a directed graph such that $i\to j$ is an edge if $j\in R_i$. ($j$ required to repair $i$) + +$d_i^{out}\leq r$ for every $i$, so $\operatorname{avg}_i(d_i^{out})\leq r$. + +By Turan's Lemma, there exists an induced directed acyclic subgraph (DAG) $G_U$ on $G$ with $U$ nodes such that $|U|\geq \frac{n}{1+\operatorname{avg}_i(d_i^{out})}\geq \frac{n}{1+r}$. + +Let $U'\subset U$ with size $\lfloor\frac{k-1}{r}\rfloor$. + +$U'$ exists since $\frac{n}{r+1}>\frac{k}{r}$ by bound 1, and $\frac{k}{r}>\lfloor\frac{k-1}{r}\rfloor$. + +So $G_{U'}$ is also acyclic. + +Let $N\subseteq [n]\setminus U'$ be the set of neighbors of $U'$ in $G_U$. + +$|N|\leq r|U'|\leq k-1$. + +Complete $n$ to be of the size $k-1$ by adding elements not in $U'$. + +$|C_N|\leq q^{k-1}$ + +Also $|N\cup U'|=k-1+\lfloor\frac{k-1}{r}\rfloor$ + +Note that all nodes in $G_U$ can be recovered from nodes in $N$. + +So $|C_{N\cup U'}|=|C_N|\leq q^{k-1}$. + +Therefore, $\max\{|I|:C_I + +### Construction of locally recoverable codes + +Recall Reed-Solomon code: + +$\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$. + +- $\dim=k$ +- $d=n-k+1$ +- Need $n\geq q$. +- To encode $a=a_1,a_2,\ldots,a_n$, we need to evaluate $f_a(x)=\sum_{i=0}^{k-1}f_a(a_i)x^i$. + +Assume $r+1|n$ and $r|k$. + +Choose $A=\{\alpha_1,\alpha_2,\ldots,\alpha_n\}$ + +Partition $A$ into $\frac{n}{r+1}$ subsets of size $r+1$. + +#### Definition for good polynomial + +A polynomial $g$ over $\mathbb{F}_q$ is called good if + +- $\deg(g)=r+1$ +- $g$ is constant for every $A_i$., i.e., $\forall x,y\in A_i, g(x)=g(y)$. + +To encode $a=(a_0,a_1,\ldots,a_n)$, we consider $a\in \mathbb{F}_q^{r\times (k/r)}$ + +Let $f_a(x)=\sum_{i=0}^{r-1} f_{a,i}(x)\cdot x^i$ + +$f_{a,i}(x)=\sum_{j=0}^{k/r-1} a_{i,j}\cdot g(x)^j$, where $g$ is a good polynomial. + +
+Proof for valid codeword + +In other words, + +$$ +\mathcal{C}=\{(f_a(\alpha_i))_{i=1}^{n}|a\in \mathbb{F}_q^k\} +$$ + +For data $a\in \mathbb{F}_q^k$, the $i$th server stores $f_a(\alpha_i)$. + +Note $\mathcal{C}$ is a linear code. and $\dim \mathcal{C}=k$. + +Let $\alpha\in \{\alpha_1,\alpha_2,\ldots,\alpha_n\}$. + +Suppose $\alpha\in A_1$. + +Locality is ensured by recovering failed server $\alpha$ by $f_a(\alpha_i)$ from $\{f_a(\beta):\beta\in A_1\setminus \{\alpha\}\}$. + +Note $g$ is constant on $A_1$. + +Denote the constant by $f_i(A_1)$. + +Let $\delta_1(x)=\sum_{i=0}^{r-1} f_{a,i}(A_1)x^i$. + +$\delta_1(x)=f_a(x)$ for all $x\in A_1$. + +#### Constructing good polynomial + +Let $(\mathbb{F}_q^*,\cdot)$ be the multiplicative group of $\mathbb{F}_q$. + +For a subgroup $H$ of this group, a multiplicative coset is $Ha=\{ha|h\in H,a\in \mathbb{F}_q^*\}$. + +The family of all cosets of $H:\{Ha|a\in \mathbb{F}_q^*\}$. + +This is a partition of $\mathbb{F}_q^*$. + +#### Definition of annihilator + +The annihilator of $H$ is $g(x)=\prod_{h\in H}(x-h)$. + +Note that $g(h)=0$ for all $h\in H$. + +#### Annihilator is a good polynomial + +Fix $H$, consider the partition $\{Ha|a\in \mathbb{F}_q^*\}$ and let $g(x)=\prod_{a\in \mathbb{F}_q^*}(x-a)$. Then $g$ is a good polynomial. + +
+Proof + +Need to show if $x,y\in Ha$, then $g(x)=g(y)$. + +Since $x\in Ha$, there exists $h',h''\in H$ such that $x=h'a$ and $y=h''a$. + +So $g(x)=g(h'a)=\prod_{h\in H}(h'a-h)=(h')^{|H|}\prod_{h\in H}(a-h/h')$. + +By Fermat's Little Theorem, $(h')^{|H|}\equiv 1$ for every $h'\in H$. + +So $\{h/h'|h\in H\}=H$ for every $h'\in H$. + +So $g(x)=(h')^{|H|}\prod_{h\in H}(a-h/h')=\prod_{h\in H}(a-h/h')=g(a)$. + +Similarly, $g(y)=g(h''a)=\prod_{h\in H}(h''a-h)=(h'')^{|H|}\prod_{h\in H}(a-h/h'')=g(a)$. + +
diff --git a/content/CSE5313/_meta.js b/content/CSE5313/_meta.js index 8388c92..c7a5b5b 100644 --- a/content/CSE5313/_meta.js +++ b/content/CSE5313/_meta.js @@ -15,4 +15,5 @@ export default { CSE5313_L10: "CSE5313 Coding and information theory for data science (Recitation 10)", CSE5313_L11: "CSE5313 Coding and information theory for data science (Recitation 11)", CSE5313_L12: "CSE5313 Coding and information theory for data science (Lecture 12)", + CSE5313_L13: "CSE5313 Coding and information theory for data science (Lecture 13)", } \ No newline at end of file