Trance-0/NoteNextra-origin

Files

Trance-0 614479e4d0 update notations

2025-11-04 12:43:23 -06:00

2.5 KiB

Raw Permalink Blame History

CSE510 Deep Reinforcement Learning (Lecture 18)

Model-based RL framework

Model Learning with High-Dimensional Observations

Learning model in a latent space with observation reconstruction
Learning model in a latent space without observation reconstruction
Learning model in the observation space (i.e., videos)

Naive approach:

If we knew f(s_t,a_t)=s_{t+1}, we could use the tools from last week. (or p(s_{t+1}| s_t, a_t) in the stochastic case)

So we can learn f(s_t,a_t) from data, and then plan through it.

Model-based reinforcement learning version 0.5:

Run base polity \pi_0 (e.g. random policy) to collect \mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top
Learn dynamics model f(s_t,a_t) to minimize \sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2
Plan through f(s_t,a_t) to choose action a_t

Sometime, it does work!

Essentially how system identification works in classical robotics
Some care should be taken to design a good base policy
Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters

However, Distribution mismatch problem becomes worse as we use more expressive model classes.

Version 0.5: collect random samples, train dynamics, plan

Pro: simple, no iterative procedure
Con: distribution mismatch problem

Version 1.0: iteratively collect data, replan, collect data

Pro: simple, solves distribution mismatch
Con: open loop plan might perform poorly, esp. in stochastic domains

Version 1.5: iteratively collect data using MPC (replan at each step)

Pro: robust to small model errors
Con: computationally expensive, but have a planning algorithm available

Version 2.0: backpropagate directly into policy

Pro: computationally cheap at runtime
Con: can be numerically unstable, especially in stochastic domains
Solution: model-free RL + model-based RL

Final version:

Run base polity \pi_0 (e.g. random policy) to collect \mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top
Learn dynamics model f(s_t,a_t) to minimize \sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2
Backpropagate through f(s_t,a_t) into the policy to optimized \pi_\theta(s_t,a_t)
Run the policy \pi_\theta(s_t,a_t) to collect \mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top
Goto 2

Model Learning with High-Dimensional Observations

Learning model in a latent space with observation reconstruction
Learning model in a latent space without observation reconstruction