update
This commit is contained in:
65
content/CSE510/CSE510_L18.md
Normal file
65
content/CSE510/CSE510_L18.md
Normal file
@@ -0,0 +1,65 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 18)
|
||||
|
||||
## Model-based RL framework
|
||||
|
||||
Model Learning with High-Dimensional Observations
|
||||
|
||||
- Learning model in a latent space with observation reconstruction
|
||||
- Learning model in a latent space without observation reconstruction
|
||||
- Learning model in the observation space (i.e., videos)
|
||||
|
||||
### Naive approach:
|
||||
|
||||
If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last week. (or $p(s_{t+1}| s_t, a_t)$ in the stochastic case)
|
||||
|
||||
So we can learn $f(s_t,a_t)$ from data, and _then_ plan through it.
|
||||
|
||||
Model-based reinforcement learning version **0.5**:
|
||||
|
||||
1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
|
||||
2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
|
||||
3. Plan through $f(s_t,a_t)$ to choose action $a_t$
|
||||
|
||||
Sometime, it does work!
|
||||
|
||||
- Essentially how system identification works in classical robotics
|
||||
- Some care should be taken to design a good base policy
|
||||
- Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters
|
||||
|
||||
However, Distribution mismatch problem becomes worse as we use more
|
||||
expressive model classes.
|
||||
|
||||
Version 0.5: collect random samples, train dynamics, plan
|
||||
|
||||
- Pro: simple, no iterative procedure
|
||||
- Con: distribution mismatch problem
|
||||
|
||||
Version 1.0: iteratively collect data, replan, collect data
|
||||
|
||||
- Pro: simple, solves distribution mismatch
|
||||
- Con: open loop plan might perform poorly, esp. in stochastic domains
|
||||
|
||||
Version 1.5: iteratively collect data using MPC (replan at each step)
|
||||
|
||||
- Pro: robust to small model errors
|
||||
- Con: computationally expensive, but have a planning algorithm available
|
||||
|
||||
Version 2.0: backpropagate directly into policy
|
||||
|
||||
- Pro: computationally cheap at runtime
|
||||
- Con: can be numerically unstable, especially in stochastic domains
|
||||
- Solution: model-free RL + model-based RL
|
||||
|
||||
Final version:
|
||||
|
||||
1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
|
||||
2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
|
||||
3. Backpropagate through $f(s_t,a_t)$ into the policy to optimized $\pi_\theta(s_t,a_t)$
|
||||
4. Run the policy $\pi_\theta(s_t,a_t)$ to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
|
||||
5. Goto 2
|
||||
|
||||
## Model Learning with High-Dimensional Observations
|
||||
|
||||
- Learning model in a latent space with observation reconstruction
|
||||
- Learning model in a latent space without observation reconstruction
|
||||
|
||||
Reference in New Issue
Block a user