fix typos

This commit is contained in:
Zheyuan Wu
2025-10-11 12:25:24 -05:00
parent 15a7be1dad
commit 29a5945f05
2 changed files with 10 additions and 5 deletions

View File

@@ -29,13 +29,18 @@ Scale of rewards and Q-values is unknown
### Deadly Triad in Reinforcement Learning
Off-policy learning (learning the expected reward changes of policy change instead of the optimal policy)
Function approximation (usually with supervised learning)
Off-policy learning
$Q(s,a)\gets f_\theta(s,a)$
- (learning the expected reward changes of policy change instead of the optimal policy)
Bootstrapping (self-reference)
Function approximation
- (usually with supervised learning)
- $Q(s,a)\gets f_\theta(s,a)$
Bootstrapping
- (self-reference, update new function from itself)
- $Q(s,a)\gets r(s,a)+\gamma \max_{a'\in A} Q(s',a')$
### Stable Solutions for DQN