# CSE510 Deep Reinforcement Learning (Lecture 18) ## Model-based RL framework Model Learning with High-Dimensional Observations - Learning model in a latent space with observation reconstruction - Learning model in a latent space without observation reconstruction - Learning model in the observation space (i.e., videos) ### Naive approach: If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last week. (or $p(s_{t+1}| s_t, a_t)$ in the stochastic case) So we can learn $f(s_t,a_t)$ from data, and _then_ plan through it. Model-based reinforcement learning version **0.5**: 1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top$ 2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$ 3. Plan through $f(s_t,a_t)$ to choose action $a_t$ Sometime, it does work! - Essentially how system identification works in classical robotics - Some care should be taken to design a good base policy - Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters However, Distribution mismatch problem becomes worse as we use more expressive model classes. Version 0.5: collect random samples, train dynamics, plan - Pro: simple, no iterative procedure - Con: distribution mismatch problem Version 1.0: iteratively collect data, replan, collect data - Pro: simple, solves distribution mismatch - Con: open loop plan might perform poorly, esp. in stochastic domains Version 1.5: iteratively collect data using MPC (replan at each step) - Pro: robust to small model errors - Con: computationally expensive, but have a planning algorithm available Version 2.0: backpropagate directly into policy - Pro: computationally cheap at runtime - Con: can be numerically unstable, especially in stochastic domains - Solution: model-free RL + model-based RL Final version: 1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top$ 2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$ 3. Backpropagate through $f(s_t,a_t)$ into the policy to optimized $\pi_\theta(s_t,a_t)$ 4. Run the policy $\pi_\theta(s_t,a_t)$ to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top$ 5. Goto 2 ## Model Learning with High-Dimensional Observations - Learning model in a latent space with observation reconstruction - Learning model in a latent space without observation reconstruction