diff --git a/content/CSE510/CSE510_L15.md b/content/CSE510/CSE510_L15.md index 6cd3509..1aa6254 100644 --- a/content/CSE510/CSE510_L15.md +++ b/content/CSE510/CSE510_L15.md @@ -1,8 +1,148 @@ # CSE510 Deep Reinforcement Learning (Lecture 15) +## Motivation + +For policy gradient methods over stochastic policies + +$$ +\pi_\theta(a|s) = P[a|s,\theta] +$$ + +Advantages + +- Potentially learning optimal solutions for multi-agent settings +- Dealing with partial observable settings +- Sufficient exploration + +Disadvantages + +- Can not learning a deterministic policy +- Extension to continuous action space is not straightforward + +### On-Policy vs. Off-Policy Policy Gradients + +On-Policy Policy Gradients: + +- Training samples are collected according to the current policy. + +Off-Policy Algorithms: + +- Enable the reuse of past experience. +- Samples can be collected by an exploratory behavior policy. + +How to design off-policy policy gradient? + +- Using importance sampling + ## Off-Policy Actor-Critic (OffPAC) +Stochastic Behavior Policy for exploration. + +- For collecting data. Labelled as $\beta(a|s)$ + +The objective function is: + +$$ +\begin{aligned} +J(\theta)=\mathbb{E}_{s\sim d^\beta}[V^{\pi}(s)] +&= \sum_{s\in S} d^\beta(s) \sum_{a\in A} \pi_\theta(a|s) Q^{\pi}(s,a)\\ +\end{aligned} +$$ + +$d^\beta(s)$ is the stationary distribution under the behavior policy $\beta(a|s)$. + +### Solving the Off-Policy Policy Gradient + +$$ +\begin{aligned} +\nabla_\theta J(\theta) &= \nabla_\theta \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\ +&= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)+\pi_\theta(a|s) \nabla_\theta Q^{\pi}(s,a)\right]\\ +&= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\ +&= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \beta(a|s) \frac{1}{\beta(a|s)} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\ +&= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \beta(a|s) \nabla_\theta \log \beta(a|s) Q^{\pi}(s,a)\right]\\ +&= \mathbb{E}_{\beta}\left[\frac{1}{\beta(a|s)} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\ +&= \mathbb{E}_{\beta}\left[\frac{\pi_\theta(a|s)}{\beta(a|s)} Q^{\pi}(s,a)\nabla_\theta \log \pi_\theta(a|s)\right]\\ +\end{aligned} +$$ + +To compute the off-policy policy gradient, $Q^{\pi}(s,a)$ is estimated given data collected by $\beta$. + +Common solution: + +- Importance sampling +- Tree backup +- Gradient temporal-difference learning +- Retrace [Munos et al., 2016] [IMPALA](https://arxiv.org/abs/1802.01561) + +### Importance Sampling + +Assume that samples come in the form of episodes. + +$M$ is the number of episodes containing $(s,a), t_m$ be the first time when $(s,a)$ appears in episode $m$. + +The first-visit importance sampling estimator of $Q^{\pi}(s,a)$ is: + +$$ +Q^{IS}(s,a)\coloneqq \frac{1}{M}\sum_{m=1}^M R_m w_m +$$ + +$R_m$ is the return following $(s,a)$ in episode $m$. + +$$ +R_m\coloneqq r_{t_m +1}+\gamma r_{t_m +2}+\cdots+\gamma^{T_m-t_m -1} r_{T_m} +$$ + +$w_m$ is the importance sampling weight: + +$$ +w_m\coloneqq \frac{\pi(a_{t_m}|s_{t_m})}{\beta(a_{t_m}|s_{t_m})}\frac{\pi(a_{t_m+1}|s_{t_m+1})}{\beta(a_{t_m+1}|s_{t_m+1})}\cdots\frac{\pi(a_{T_m}|s_{T_m})}{\beta(a_{T_m}|s_{T_m})} +$$ + +### Per-decision algorithm + +Consider the parts we used in importance sampling: + +$$ +R_m w_m=\sum_{i=t_m+1}^{T_m}\gamma^{i-t_m-1} r_i \frac{\pi(a_{t_m}|s_{t_m})}{\beta(a_{t_m}|s_{t_m})}\cdots \frac{\pi(a_{t_{i-1}}|s_{t_{i-1}})}{\beta(a_{t_{i-1}}|s_{t_{i-1}})}\frac{\pi(a_{t_i}|s_{t_i})}{\beta(a_{t_i}|s_{t_i})}\cdots \frac{\pi(a_{T_m-1}|s_{T_m-1})}{\beta(a_{T_m-1}|s_{T_m-1})} +$$ + +Intuitively, $r_i$ should not depend on the actions taken after $t_i$. + +This gives the per-decision importance sampling estimator: + +$$ +Q^{PD}(s,a)\coloneqq \frac{1}{M}\sum_{m=1}^M \sum_{k=1}^{T_m-t_m} \gamma^{k-1} r_{t_m+k}\prod_{i=t_m}^{t_m+k-1} \frac{\pi(a_{t_i}|s_{t_i})}{\beta(a_{t_i}|s_{t_i})} +$$ + +The per-decision importance sampling estimator is consistence and unbiased estimator of $Q^{\pi}(s,a)$. + +Proof as exercise. + +
+Hints +- Show the expectation of $Q^{PD}(s,a)$ is the same as $Q^{IS}(s,a)$. +- $Q^{IS}(s,a)$ is a consistence and unbiased estimator of $Q^{\pi}(s,a)$. +
+ ## Deterministic Policy Gradient (DPG) -## Deep Deterministic Policy Gradient (DDPG) -ยง Extensions of DDPG \ No newline at end of file +The objective function is: + +$$ +J(\theta)=\int_{s\in S} \pho^{\mu}(s) r(s,\mu_\theta(s)) ds +$$ + +where $\pho^{\mu}(s)$ is the stationary distribution under the behavior policy $\mu_\theta(s)$. + +Proof along the same lines of the standard policy gradient theorem. + +$$ +\nabla_\theta J(\theta) = \mathbb{E}_{\mu_\theta}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \pho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}] +$$ + +### Issues for DPG + +The formulations up to now can only use on-policy data. + + +## Deep Deterministic Policy Gradient (DDPG) \ No newline at end of file diff --git a/content/Math4501/Math4501_L1.md b/content/Swap/Math4501/Math4501_L1.md similarity index 100% rename from content/Math4501/Math4501_L1.md rename to content/Swap/Math4501/Math4501_L1.md diff --git a/content/Math4501/Math4501_L2.md b/content/Swap/Math4501/Math4501_L2.md similarity index 100% rename from content/Math4501/Math4501_L2.md rename to content/Swap/Math4501/Math4501_L2.md diff --git a/content/Math4501/Math4501_L3.md b/content/Swap/Math4501/Math4501_L3.md similarity index 100% rename from content/Math4501/Math4501_L3.md rename to content/Swap/Math4501/Math4501_L3.md diff --git a/content/Math4501/_meta.js b/content/Swap/Math4501/_meta.js similarity index 100% rename from content/Math4501/_meta.js rename to content/Swap/Math4501/_meta.js diff --git a/content/Math4501/index.md b/content/Swap/Math4501/index.md similarity index 100% rename from content/Math4501/index.md rename to content/Swap/Math4501/index.md diff --git a/content/_meta.js b/content/_meta.js index fd1070c..427baf6 100644 --- a/content/_meta.js +++ b/content/_meta.js @@ -47,12 +47,6 @@ export default { timestamp: true, } }, - Math4501: { - type: 'page', - theme:{ - timestamp: true, - } - }, Math416: { type: 'page', theme:{