diff --git a/content/CSE510/CSE510_L12.md b/content/CSE510/CSE510_L12.md new file mode 100644 index 0000000..8ad0405 --- /dev/null +++ b/content/CSE510/CSE510_L12.md @@ -0,0 +1,204 @@ +# CSE510 Deep Reinforcement Learning (Lecture 12) + +## Policy Gradient Theorem + +For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$ + +The policy gradient is + +$$ +\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right] +$$ + +## Policy Gradient Methods + +Advantages of Policy-Based RL + +Advantages: + +- Better convergence properties +- Effective in high-dimensional or continuous action spaces +- Can learn stochastic policies + +Disadvantages: + +- Typically converge to a local rather than global optimum +- Evaluating a policy is typically inefficient and high variance + +### Anchor-Critic Methods + +#### Q Actor-Critic + +Reducing Variance Using a Critic + +Monte-Carlo Policy Gradient still has high variance. + +We use a critic to estimate the action-value function $Q_w(s,a)\approx Q^{\pi_\theta}(s,a)$. + +Anchor-critic algorithms maintain two sets of parameters: + +Critic: updates action-value function parameters $w$ + +Actor: updates policy parameters $\theta$, in direction suggested by the critic. + +Actor-critic algorithms follow an approximate policy gradient: + +$$ +\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)\right] +$$ +$$ +\Delta \theta = \alpha \nabla_\theta \log \pi_\theta(s,a)Q_w(s,a) +$$ + +Action-Value Actor-Critic + +- Simple actor-critic algorithm based on action-value critic +- Using linear value function approximation $Q_w(s,a)=\phi(s,a)^T w$ + +Critic: updates $w$ by linear $TD(0)$ +Actor: updates $\theta$ by policy gradient + +```python +def Q_actor-critic(states,theta): + actions=sample_actions(a,pi_theta) + for i in range(num_steps): + reward=sample_rewards(actions,states) + transition=sample_transition(actions,states) + new_actions=sample_action(transition,theta) + delta=sample_reward+gamma*Q_w(transition, new_actions)-Q_w(states, actions) + theta=theta+alpha*nabla_theta*log(pi_theta(states, actions))*Q_w(states, actions) + w=w+beta*delta*phi(states, actions) + a=new_actions + s=transition +``` + +#### Advantage Actor-Critic + +Reducing variance using a baseline + +- We subtract a baseline function $B(s)$ form the policy gradient +- This can reduce the variance without changing expectation + +$$ +\begin{aligned} +\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log \pi_\theta(s,a)B(s)]&=\sum_{s\in S}d^{\pi_\theta}(s)\sum_{a\in A}\nabla_{\theta}\pi_\theta(s,a)B(s)\\ +&=\sum_{s\in S}d^{\pi_\theta}B(s)\nabla_\theta\sum_{a\in A}\pi_\theta(s,a)\\ +&=0 +\end{aligned} +$$ + +A good baseline is the state value function $B(s)=V^{\pi_\theta}(s)$ + +So we can rewrite the policy gradient using the advantage function $A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_theta}(s)$ + +$$ +\nabla_\theta J(\theta)=\mathbb{E}\left[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_theta}(s,a)\right] +$$ + +##### Estimating the Advantage function + +**Method 1:** direct estimation + +> May increase the variance + +The advantage function can significantly reduce variance of policy gradient + +So the critic should really estimate the advantage function + +For example, by estimating both $V^{\pi_theta}(s)$ and $Q^{\pi_theta}(s,a)$ + +Using two function approximators and two parameter vectors, + +$$ +V_v(s)\approx V^{\pi_\theta}(s)\\ +Q_w(s,a)\approx Q^{\pi_\theta}(s,a)\\ +A(s,a)=Q_w(s,a)-V_v(s) +$$ + +And updating both value functions by e.g. TD learning + +**Method 2:** using the TD error + +> We can prove that TD error is an unbiased estimation of the advantage function + +For the true value function $V^{\pi_\theta}(s)$, the TD error $\delta^{\pi_\theta}$ + +$$ +\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s) - V^{\pi_\theta}(s) +$$ + +is an unbiased estimate of the advantage function + +$$ +\begin{aligned} +\mathbb{E}_{\pi_\theta}[\delta^{\pi_\theta}| s,a]&=\mathbb{E}_{\pi_\theta}[r + \gamma V^{\pi_\theta}(s') |s,a]-V^{\pi_\theta}(s)\\ +&=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\\ +&=A^{\pi_\theta}(s,a) +\end{aligned} +$$ + +So we can use the TD error to compute the policy gradient + +$$ +\Delta \theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}] +$$ + +In practice, we can use an approximate TD error $\delta_v=r+\gamma V_v(s')-V_v(s)$ to compute the policy gradient + +### Summary of policy gradient algorithms + +THe policy gradient has many equivalent forms. + +$$ +\begin{aligned} +\nabla_\theta J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) v_t] \text{ REINFORCE} \\ +&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)] \text{ Q Actor-Critic} \\ +&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_\theta}(s,a)] \text{ Advantage Actor-Critic} \\ +&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}] \text{ TD Actor-Critic} +\end{aligned} +$$ + +Each leads s stochastic gradient ascent algorithm. + +Critic use policy evaluation to estimate the $Q^\pi(s,a)$ or $A^\pi(s,a)$ or $V^\pi(s)$. + +## Compatible Function Approximation + +If the following two conditions are satisfied: + +1. Value function approximation is a compatible with the policy + $$ + \nabla_w Q_w(s,a) = \nabla_\theta \log \pi_\theta(s,a) + $$ +2. Value function parameters $w$ minimize the MSE + $$ + \epsilon = \mathbb{E}_{\pi_\theta}[(Q^{\pi_\theta}(s,a)-Q_w(s,a))^2] + $$ + Note $\epsilon$ need not be zero, just need to be minimized. + +Then the policy gradient is exact + +$$ +\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)] +$$ + +Remember: + +$$ +\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)] +$$ + +### Challenges with Policy Gradient Methods + +- Data Inefficiency + - On-policy method: for each new policy, we need to generate a completely new + - trajectory + - The data is thrown out after just one gradient update + - As complex neural networks need many updates, this makes the training process very slow +- Unstable update: step size is very important + - If step size is too large: + - Large step -> bad policy + - Next batch is generated from current bad policy -> collect bad samples + - Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it) + - If step size is too small: the learning process is slow + diff --git a/content/CSE510/_meta.js b/content/CSE510/_meta.js index aa4ef7e..be90a9e 100644 --- a/content/CSE510/_meta.js +++ b/content/CSE510/_meta.js @@ -14,4 +14,5 @@ export default { CSE510_L9: "CSE510 Deep Reinforcement Learning (Lecture 9)", CSE510_L10: "CSE510 Deep Reinforcement Learning (Lecture 10)", CSE510_L11: "CSE510 Deep Reinforcement Learning (Lecture 11)", + CSE510_L12: "CSE510 Deep Reinforcement Learning (Lecture 12)" } \ No newline at end of file diff --git a/content/CSE5313/CSE5313_L11.md b/content/CSE5313/CSE5313_L11.md new file mode 100644 index 0000000..287eefc --- /dev/null +++ b/content/CSE5313/CSE5313_L11.md @@ -0,0 +1,166 @@ +# CSE5313 Coding and information theory for data science (Recitation 10) + +## Question 5 + +Prove the minimum distance of Reed-Muller code $RM(r,m)$ is $2^{m-r}$. + +$n=2^m$. + +Recall that the definition of RM code is: + +$$ +\operatorname{RM}(r,m)=\left\{(f(\alpha_1),\ldots,f(\alpha_2^m))|\alpha_i\in \mathbb{F}_2^m,\deg f\leq r\right\} +$$ + +
+Example of RM code + +Let $r=0$, it is the repetition code. + +$\dim \operatorname{RM}(r,m)=\sum_{i=0}^{r}\binom{m}{i}$. + +Here $r=0$, so $\dim \operatorname{RM}(0,m)=1$. + +So the minimum distance of $RM(0,m)$ is $2^{m-0}=n$. + +--- + +Let $r=m$, + +then $\dim \operatorname{RM}(r,m)=\sum_{i=0}^{r}\binom{m}{i}=2^m$. (binomial theorem) + +So the generator matrix is $n\times n$ + +So the minimum distance of $RM(m,m)$ is $2^{m-m}=1$. +
+ +Then we can do the induction on $r$. + +Assume the minimum distance of $RM(r',m')$ is $2^{m'-r'}$ for all $0\leq r'\leq r$, $r'\leq m' +Proof + +Recall that the polynomial $p(x_1,x_2,\ldots,x_m)$ can be written as $p(x_1,x_2,\ldots,x_m)=\sum_{S\subseteq [m],|S|\leq r}f_s X_s$, where $f_s\in \mathbb{F}_2$, the monomial $X_s=\prod_{i\in S}x_i$. + +Every monomial $f(x_1,x_2,\ldots,x_m)$ can be written as + +$$ +\begin{aligned} +p(x_1,x_2,\ldots,x_m)&=\sum_{S\subseteq [m],|S|\leq r}f_s X_s\\ +&=g(x_1,x_2,\ldots,x_{m-1})+x_m h(x_1,x_2,\ldots,x_{m-1})\\ +\end{aligned} +$$ + +So $g(x_1,x_2,\ldots,x_{m-1})$ has degree at most $r$ and does not contain $x_m$. + +And $x_m h(x_1,x_2,\ldots,x_{m-1})$ has degree at most $r-1$ and contains $x_m$. + +Note that the codeword of $RM(r,m)$ is the truth table of some monomial evaluated at all $2^m$ $\alpha_i\in \mathbb{F}_2^m$. + +And the minimum distance of $RM(r,m)$ is the minimum hamming weight for linear code, which is the number of $\alpha_i$ such that $f(\alpha_i)=1$ + +Then we can defined the weight of $f$ to be all $\alpha_i$ such that $f(\alpha_i)=1$. + +$$ +\operatorname{wt}(f)=\{\alpha_i|f(\alpha_i)=1\} +$$ + +Note that $g(x_1,x_2,\ldots,x_{m-1})$ is a $RM(r,m-1)$ and $h(x_1,x_2,\ldots,x_{m-1})$ is a $RM(r-1,m-1)$. + +If $x_m=0$, then $f(\alpha_i)=g(\alpha_i)$. +If $x_m=1$, then $f(\alpha_i)=g(\alpha_i)+h(\alpha_i)$. + +So $\operatorname{wt}(f)=\operatorname{wt}(g)\cup\operatorname{wt}(g+h)$. + +Note that $\operatorname{wt}(g+h)$ is the number of $\alpha_i$ such that $g(\alpha_i)+h(\alpha_i)=1$, which is `XOR` in binary field. + +So $\operatorname{wt}(g+h)=(\operatorname{wt}(g)\setminus\operatorname{wt}(h))\cup (\operatorname{wt}(h)\setminus\operatorname{wt}(g))$. + +So + +$$ +\begin{aligned} +|\operatorname{wt}(f)|&=|\operatorname{wt}(g)|+|\operatorname{wt}(g+h)|\\ +&=|\operatorname{wt}(g)|+|\operatorname{wt}(g)\setminus\operatorname{wt}(h)|+|\operatorname{wt}(h)\setminus\operatorname{wt}(g)|\\ +&=|\operatorname{wt}(h)|+2|\operatorname{wt}(h)\setminus\operatorname{wt}(g)|\\ +\end{aligned} +$$ + +Note $h$ is in $\operatorname{RM}(r-1,m-1)$, so $|\operatorname{wt}(h)|=2^{m-r}$ + + + +## Theorem for Reed-Muller code + +$$ +\operatorname{RM}(r,m)^\perp=\operatorname{RM}(m-r-1,m) +$$ + +Let $\mathcal{C}=[n,k,d]_q$. + +The dual code of $\mathcal{C}$ is $\mathcal{C}^\perp=\{x\in \mathbb{F}^n_q|xc^T=0\text{ for all }c\in \mathcal{C}\}$. + +
+Example + +$\operatorname{RM}(0,m)^\perp=\operatorname{RM}(m-1,m)$. + +and $\operatorname{RM}(0,m)$ is the repetition code. + +which is the dual of the parity code $\operatorname{RM}(m-1,m)$. + +
+ +### Lemma for sum of binary product + +For $A\subseteq [m]=\{1,2,\ldots,m\}$, let $X^A=\prod_{i\in A}x_i$, we can defined the inner product $\langle X^A,X^B\rangle=\sum_{x\in \{0,1\}^m}\prod_{i\in A}x_i\prod_{i\in B}x_i=\sum_{x\in \{0,1\}^m}\prod_{i\in A\cup B}x_i$. + +So $\langle X^A,X^B\rangle=\begin{cases} +1 & \text{if }A\cup B=[m]\\ +0 & \text{otherwise} +\end{cases}$ + +because $\prod_{i\in A\cup B}x_i=1$ if every coordinate in $A\cup B$ is 1. + +So the number of such $x\in \{0,1\}^m$ is $2^{m-|A\cup B|}$. + +This implies that $\langle X^A,X^B\rangle=1$ if and only if $m-|A\cup B|=0$. + +Recall that $\operatorname{RM}(r,m)$ is the evaluation of $f=\sum_{B\subseteq [m],|B|\leq r}\beta X^B$ at all $\beta_i\in \{0,1\}^m$. + +$\operatorname{RM}(m-r-1,m)$ is the evaluation of $h=\sum_{A\subseteq [m],|A|\leq m-r-1}\alpha X^A$ at all $\alpha_i \in \{0,1\}^m$. + +By linearity of inner product, we have + +$$ +\begin{aligned} +\langle f,h\rangle&=\langle \sum_{B\subseteq [m],|B|\leq r}\beta X^B,\sum_{A\subseteq [m],|A|\leq m-r-1}\alpha X^A\rangle\\ +&=\sum_{B\subseteq [m],|B|\leq r}\sum_{A\subseteq [m],|A|\leq m-r-1}\beta\alpha\langle X^B,X^A\rangle\\ +\end{aligned} +$$ + +Because $|A\cup B|\leq |A|+|B|\leq m-r-1+r=m-1$. + +So $\langle X^B,X^A\rangle=0$ since $m-1 +Proof for the theorem + +Recall that the dual code of $\operatorname{RM}(r,m)^\perp=\{x\in \mathbb{F}_2^m|xc^T=0\text{ for all }c\in \operatorname{RM}(r,m)\}$. + +So $\operatorname{RM}(m-r-1,m)\subseteq \operatorname{RM}(r,m)^\perp$. + +So the last step is the dimension check. + +Since $\dim \operatorname{RM}(r,m)=\sum_{i=0}^{r}\binom{m}{i}$ and the dimension of the dual code is $2^m-\dim \operatorname{RM}(r,m)=\sum_{i=0}^{m}\binom{m}{i}-\sum_{i=0}^{r}\binom{m}{i}=\sum_{i=r+1}^{m}\binom{m}{i}$. + +Since $\binom{m}{i}=\binom{m}{m-i}$, we have $\sum_{i=r+1}^{m}\binom{m}{i}=\sum_{i=r+1}^{m}\binom{m}{m-i}=\sum_{i=0}^{m-r-1}\binom{m}{i}$. + +This is exactly the dimension of $\operatorname{RM}(m-r-1,m)$. + + \ No newline at end of file diff --git a/content/CSE5313/_meta.js b/content/CSE5313/_meta.js index 3ca12dc..b2a3452 100644 --- a/content/CSE5313/_meta.js +++ b/content/CSE5313/_meta.js @@ -13,4 +13,5 @@ export default { CSE5313_L8: "CSE5313 Coding and information theory for data science (Lecture 8)", CSE5313_L9: "CSE5313 Coding and information theory for data science (Lecture 9)", CSE5313_L10: "CSE5313 Coding and information theory for data science (Recitation 10)", + CSE5313_L11: "CSE5313 Coding and information theory for data science (Recitation 11)", } \ No newline at end of file diff --git a/content/Math4201/Math4201_L17.md b/content/Math4201/Math4201_L17.md new file mode 100644 index 0000000..9f376c2 --- /dev/null +++ b/content/Math4201/Math4201_L17.md @@ -0,0 +1,2 @@ +# Math4201 Topology I (Lecture 17) + diff --git a/content/Math4201/_meta.js b/content/Math4201/_meta.js index bf60a58..c209da5 100644 --- a/content/Math4201/_meta.js +++ b/content/Math4201/_meta.js @@ -19,4 +19,5 @@ export default { Math4201_L14: "Topology I (Lecture 14)", Math4201_L15: "Topology I (Lecture 15)", Math4201_L16: "Topology I (Lecture 16)", + Math4201_L17: "Topology I (Lecture 17)", }