2.5 KiB
2.5 KiB
CSE510 Deep Reinforcement Learning (Lecture 18)
Model-based RL framework
Model Learning with High-Dimensional Observations
- Learning model in a latent space with observation reconstruction
- Learning model in a latent space without observation reconstruction
- Learning model in the observation space (i.e., videos)
Naive approach:
If we knew f(s_t,a_t)=s_{t+1}, we could use the tools from last week. (or p(s_{t+1}| s_t, a_t) in the stochastic case)
So we can learn f(s_t,a_t) from data, and then plan through it.
Model-based reinforcement learning version 0.5:
- Run base polity
\pi_0(e.g. random policy) to collect\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top - Learn dynamics model
f(s_t,a_t)to minimize\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2 - Plan through
f(s_t,a_t)to choose actiona_t
Sometime, it does work!
- Essentially how system identification works in classical robotics
- Some care should be taken to design a good base policy
- Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters
However, Distribution mismatch problem becomes worse as we use more expressive model classes.
Version 0.5: collect random samples, train dynamics, plan
- Pro: simple, no iterative procedure
- Con: distribution mismatch problem
Version 1.0: iteratively collect data, replan, collect data
- Pro: simple, solves distribution mismatch
- Con: open loop plan might perform poorly, esp. in stochastic domains
Version 1.5: iteratively collect data using MPC (replan at each step)
- Pro: robust to small model errors
- Con: computationally expensive, but have a planning algorithm available
Version 2.0: backpropagate directly into policy
- Pro: computationally cheap at runtime
- Con: can be numerically unstable, especially in stochastic domains
- Solution: model-free RL + model-based RL
Final version:
- Run base polity
\pi_0(e.g. random policy) to collect\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top - Learn dynamics model
f(s_t,a_t)to minimize\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2 - Backpropagate through
f(s_t,a_t)into the policy to optimized\pi_\theta(s_t,a_t) - Run the policy
\pi_\theta(s_t,a_t)to collect\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top - Goto 2
Model Learning with High-Dimensional Observations
- Learning model in a latent space with observation reconstruction
- Learning model in a latent space without observation reconstruction