fix typos
This commit is contained in:
@@ -29,13 +29,18 @@ Scale of rewards and Q-values is unknown
|
|||||||
|
|
||||||
### Deadly Triad in Reinforcement Learning
|
### Deadly Triad in Reinforcement Learning
|
||||||
|
|
||||||
Off-policy learning (learning the expected reward changes of policy change instead of the optimal policy)
|
Off-policy learning
|
||||||
Function approximation (usually with supervised learning)
|
|
||||||
|
|
||||||
$Q(s,a)\gets f_\theta(s,a)$
|
- (learning the expected reward changes of policy change instead of the optimal policy)
|
||||||
|
|
||||||
Bootstrapping (self-reference)
|
Function approximation
|
||||||
|
|
||||||
|
- (usually with supervised learning)
|
||||||
|
- $Q(s,a)\gets f_\theta(s,a)$
|
||||||
|
|
||||||
|
Bootstrapping
|
||||||
|
|
||||||
|
- (self-reference, update new function from itself)
|
||||||
- $Q(s,a)\gets r(s,a)+\gamma \max_{a'\in A} Q(s',a')$
|
- $Q(s,a)\gets r(s,a)+\gamma \max_{a'\in A} Q(s',a')$
|
||||||
|
|
||||||
### Stable Solutions for DQN
|
### Stable Solutions for DQN
|
||||||
|
|||||||
@@ -81,7 +81,7 @@ Reducing variance using a baseline
|
|||||||
|
|
||||||
$$
|
$$
|
||||||
\begin{aligned}
|
\begin{aligned}
|
||||||
\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log \pi_\theta(s,a)B(s)]&=\sum_{s\in S}d^{\pi_\theta}(s)\sum_{a\in A}\nabla_{\theta}\pi_\theta(s,a)B(s)\\
|
\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log \pi_\theta(s,a)B(s)\right]&=\sum_{s\in S}d^{\pi_\theta}(s)\sum_{a\in A}\nabla_{\theta}\pi_\theta(s,a)B(s)\\
|
||||||
&=\sum_{s\in S}d^{\pi_\theta}B(s)\nabla_\theta\sum_{a\in A}\pi_\theta(s,a)\\
|
&=\sum_{s\in S}d^{\pi_\theta}B(s)\nabla_\theta\sum_{a\in A}\pi_\theta(s,a)\\
|
||||||
&=0
|
&=0
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
|
|||||||
Reference in New Issue
Block a user