Files
NoteNextra-origin/content/CSE510/CSE510_L18.md
2025-11-04 12:43:23 -06:00

2.5 KiB

CSE510 Deep Reinforcement Learning (Lecture 18)

Model-based RL framework

Model Learning with High-Dimensional Observations

  • Learning model in a latent space with observation reconstruction
  • Learning model in a latent space without observation reconstruction
  • Learning model in the observation space (i.e., videos)

Naive approach:

If we knew f(s_t,a_t)=s_{t+1}, we could use the tools from last week. (or p(s_{t+1}| s_t, a_t) in the stochastic case)

So we can learn f(s_t,a_t) from data, and then plan through it.

Model-based reinforcement learning version 0.5:

  1. Run base polity \pi_0 (e.g. random policy) to collect \mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top
  2. Learn dynamics model f(s_t,a_t) to minimize \sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2
  3. Plan through f(s_t,a_t) to choose action a_t

Sometime, it does work!

  • Essentially how system identification works in classical robotics
  • Some care should be taken to design a good base policy
  • Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters

However, Distribution mismatch problem becomes worse as we use more expressive model classes.

Version 0.5: collect random samples, train dynamics, plan

  • Pro: simple, no iterative procedure
  • Con: distribution mismatch problem

Version 1.0: iteratively collect data, replan, collect data

  • Pro: simple, solves distribution mismatch
  • Con: open loop plan might perform poorly, esp. in stochastic domains

Version 1.5: iteratively collect data using MPC (replan at each step)

  • Pro: robust to small model errors
  • Con: computationally expensive, but have a planning algorithm available

Version 2.0: backpropagate directly into policy

  • Pro: computationally cheap at runtime
  • Con: can be numerically unstable, especially in stochastic domains
  • Solution: model-free RL + model-based RL

Final version:

  1. Run base polity \pi_0 (e.g. random policy) to collect \mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top
  2. Learn dynamics model f(s_t,a_t) to minimize \sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2
  3. Backpropagate through f(s_t,a_t) into the policy to optimized \pi_\theta(s_t,a_t)
  4. Run the policy \pi_\theta(s_t,a_t) to collect \mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top
  5. Goto 2

Model Learning with High-Dimensional Observations

  • Learning model in a latent space with observation reconstruction
  • Learning model in a latent space without observation reconstruction