fix typos
This commit is contained in:
@@ -29,13 +29,18 @@ Scale of rewards and Q-values is unknown
|
||||
|
||||
### Deadly Triad in Reinforcement Learning
|
||||
|
||||
Off-policy learning (learning the expected reward changes of policy change instead of the optimal policy)
|
||||
Function approximation (usually with supervised learning)
|
||||
Off-policy learning
|
||||
|
||||
$Q(s,a)\gets f_\theta(s,a)$
|
||||
- (learning the expected reward changes of policy change instead of the optimal policy)
|
||||
|
||||
Bootstrapping (self-reference)
|
||||
Function approximation
|
||||
|
||||
- (usually with supervised learning)
|
||||
- $Q(s,a)\gets f_\theta(s,a)$
|
||||
|
||||
Bootstrapping
|
||||
|
||||
- (self-reference, update new function from itself)
|
||||
- $Q(s,a)\gets r(s,a)+\gamma \max_{a'\in A} Q(s',a')$
|
||||
|
||||
### Stable Solutions for DQN
|
||||
|
||||
Reference in New Issue
Block a user