318 lines
8.4 KiB
Markdown
318 lines
8.4 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 22)
|
|
|
|
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
|
|
|
|
## Offline Reinforcement Learning: Introduction and Challenges
|
|
|
|
Offline reinforcement learning (offline RL), also called batch RL, aims to learn an optimal policy -without- interacting with the environment. Instead, the agent is given a fixed dataset of transitions collected by an unknown behavior policy.
|
|
|
|
### The Offline RL Dataset
|
|
|
|
We are given a static dataset:
|
|
|
|
$$
|
|
D = { (s_i, a_i, s'-i, r_i ) }-{i=1}^N
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $s_i$: state sampled from behavior policy state distribution.
|
|
- $a_i$: action selected by the behavior policy $\pi_beta$.
|
|
- $s'_i$: next state sampled from environment dynamics $p(s'|s,a)$.
|
|
- $r_i$: reward observed for transition $(s_i,a_i)$.
|
|
- $N$: total number of transitions in the dataset.
|
|
- $D$: full offline dataset used for training.
|
|
|
|
The goal is to learn a new policy $\pi$ maximizing expected discounted return using only $D$:
|
|
|
|
$$
|
|
\max_{\pi} ; \mathbb{E}\Big[\sum_{t=0}^T \gamma^t r(s_t, a_t)\Big]
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $\pi$: policy we want to learn.
|
|
- $r(s,a)$: reward received for state-action pair.
|
|
- $\gamma$: discount factor controlling weight of future rewards.
|
|
- $T$: horizon or trajectory length.
|
|
|
|
### Why Offline RL Is Difficult
|
|
|
|
Offline RL is fundamentally harder than online RL because:
|
|
|
|
- The agent cannot try new actions to fix wrong value estimates.
|
|
- The policy may choose out-of-distribution actions not present in $D$.
|
|
- Q-value estimates for unseen actions can be arbitrarily incorrect.
|
|
- Bootstrapping on wrong Q-values can cause divergence.
|
|
|
|
This leads to two major failure modes:
|
|
|
|
1. --Distribution shift--: new policy actions differ from dataset actions.
|
|
2. --Extrapolation error--: the Q-function guesses values for unseen actions.
|
|
|
|
### Extrapolation Error Problem
|
|
|
|
In standard Q-learning, the Bellman backup is:
|
|
|
|
$$
|
|
Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s', a')
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $Q(s,a)$: estimated value of taking action $a$ in state $s$.
|
|
- $\max_{a'}$: maximum over possible next actions.
|
|
- $a'$: candidate next action for evaluation in backup step.
|
|
|
|
If $a'$ was rarely or never taken in the dataset, $Q(s',a')$ is poorly estimated, so Q-learning boots off invalid values, causing instability.
|
|
|
|
### Behavior Cloning (BC): The Safest Baseline
|
|
|
|
The simplest offline method is to imitate the behavior policy:
|
|
|
|
$$
|
|
\max_{\phi} ; \mathbb{E}_{(s,a) \sim D}[\log \pi_{\phi}(a|s)]
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $\phi$: neural network parameters of the cloned policy.
|
|
- $\pi_{\phi}$: learned policy approximating behavior policy.
|
|
- $\log \pi_{\phi}(a|s)$: negative log-likelihood loss.
|
|
|
|
Pros:
|
|
|
|
- Does not suffer from extrapolation error.
|
|
- Extremely stable.
|
|
|
|
Cons:
|
|
|
|
- Cannot outperform the behavior policy.
|
|
- Ignores reward information entirely.
|
|
|
|
### Naive Offline Q-Learning Fails
|
|
|
|
Directly applying off-policy Q-learning on $D$ generally leads to:
|
|
|
|
- Overestimation of unseen actions.
|
|
- Divergence due to extrapolation error.
|
|
- Policies worse than behavior cloning.
|
|
|
|
## Strategies for Safe Offline RL
|
|
|
|
There are two primary families of solutions:
|
|
|
|
1. --Policy constraint methods--
|
|
2. --Conservative value estimation methods--
|
|
|
|
---
|
|
|
|
# 1. Policy Constraint Methods
|
|
|
|
These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions.
|
|
|
|
### Advantage Weighted Regression (AWR / AWAC)
|
|
|
|
Policy update:
|
|
|
|
$$
|
|
\pi(a|s) \propto \pi_{beta}(a|s)\exp\left(\frac{1}{\lambda}A(s,a)\right)
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $\pi_{beta}$: behavior policy used to collect dataset.
|
|
- $A(s,a)$: advantage function derived from Q or V estimates.
|
|
- $\lambda$: temperature controlling strength of advantage weighting.
|
|
- $\exp(\cdot)$: positive weighting on high-advantage actions.
|
|
|
|
Properties:
|
|
|
|
- Uses advantages to filter good and bad actions.
|
|
- Improves beyond behavior policy while staying safe.
|
|
|
|
### Batch-Constrained Q-learning (BCQ)
|
|
|
|
BCQ constrains the policy using a generative model:
|
|
|
|
1. Train a VAE $G_{\omega}$ to model $a$ given $s$.
|
|
2. Train a small perturbation model $\xi$.
|
|
3. Limit the policy to $a = G_{\omega}(s) + \xi(s)$.
|
|
|
|
Parameter explanations:
|
|
|
|
- $G_{\omega}(s)$: VAE-generated action similar to data actions.
|
|
- $\omega$: VAE parameters.
|
|
- $\xi(s)$: small correction to generated actions.
|
|
- $a$: final policy action constrained near dataset distribution.
|
|
|
|
BCQ avoids selecting unseen actions and strongly reduces extrapolation.
|
|
|
|
### BEAR (Bootstrapping Error Accumulation Reduction)
|
|
|
|
BEAR adds explicit constraints:
|
|
|
|
$$
|
|
D_{MMD}\left(\pi(a|s), \pi_{beta}(a|s)\right) < \epsilon
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $D_{MMD}$: Maximum Mean Discrepancy distance between action distributions.
|
|
- $\epsilon$: threshold restricting policy deviation from behavior policy.
|
|
|
|
BEAR controls distribution shift more tightly than BCQ.
|
|
|
|
---
|
|
|
|
# 2. Conservative Value Function Methods
|
|
|
|
These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values.
|
|
|
|
### Conservative Q-Learning (CQL)
|
|
|
|
One formulation is:
|
|
|
|
$$
|
|
J(Q) = J_{TD}(Q) + \alpha\big(\mathbb{E}_{a\sim\pi(\cdot|s)}Q(s,a) - \mathbb{E}_{a\sim D}Q(s,a)\big)
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $J_{TD}$: standard Bellman TD loss.
|
|
- $\alpha$: weight of conservatism penalty.
|
|
- $\mathbb{E}_{a\sim\pi(\cdot|s)}$: expectation over policy-chosen actions.
|
|
- $\mathbb{E}_{a\sim D}$: expectation over dataset actions.
|
|
|
|
Effect:
|
|
|
|
- Increases Q-values of dataset actions.
|
|
- Decreases Q-values of out-of-distribution actions.
|
|
|
|
### Implicit Q-Learning (IQL)
|
|
|
|
IQL avoids constraints entirely by using expectile regression:
|
|
|
|
Value regression:
|
|
|
|
$$
|
|
V(s) = \arg\min_{v} ; \mathbb{E}\big[\rho_{\tau}(Q(s,a) - v)\big]
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $v$: scalar value estimate for state $s$.
|
|
- $\rho_{\tau}(x)$: expectile regression loss.
|
|
- $\tau$: expectile parameter controlling conservatism.
|
|
- $Q(s,a)$: Q-value estimate.
|
|
|
|
Key idea:
|
|
|
|
- For $\tau < 1$, IQL reduces sensitivity to large (possibly incorrect) Q-values.
|
|
- Implicitly conservative without special constraints.
|
|
|
|
IQL often achieves state-of-the-art performance due to simplicity and stability.
|
|
|
|
---
|
|
|
|
# Model-Based Offline RL
|
|
|
|
### Forward Model-Based RL
|
|
|
|
Train a dynamics model:
|
|
|
|
$$
|
|
p_{\theta}(s'|s,a)
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $p_{\theta}$: learned transition model.
|
|
- $\theta$: parameters of transition model.
|
|
|
|
We can generate synthetic transitions using $p_{\theta}$, but model error accumulates.
|
|
|
|
### Penalty-Based Model Approaches (MOPO, MOReL)
|
|
|
|
Add uncertainty penalty:
|
|
|
|
$$
|
|
r_{model}(s,a) = r(s,a) - \beta , u(s,a)
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $r_{model}$: penalized reward for model rollouts.
|
|
- $u(s,a)$: model uncertainty estimate.
|
|
- $\beta$: penalty coefficient.
|
|
|
|
These methods limit exploration into unknown model regions.
|
|
|
|
---
|
|
|
|
# Reverse Model-Based Imagination (ROMI)
|
|
|
|
ROMI generates new training data by -backward- imagination.
|
|
|
|
### Reverse Dynamics Model
|
|
|
|
ROMI learns:
|
|
|
|
$$
|
|
p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $\psi$: parameters of reverse dynamics model.
|
|
- $s_{t+1}$: later state.
|
|
- $a_{t}$: action taken leading to $s_{t+1}$.
|
|
- $s_{t}$: predicted predecessor state.
|
|
|
|
ROMI also learns a reverse policy for sampling likely predecessor actions.
|
|
|
|
### Reverse Imagination Process
|
|
|
|
Given a goal state $s_{g}$:
|
|
|
|
1. Sample $a_{t}$ from reverse policy.
|
|
2. Predict $s_{t}$ from reverse dynamics.
|
|
3. Form imagined transition $(s_{t}, a_{t}, s_{t+1})$.
|
|
4. Repeat to build longer imagined trajectories.
|
|
|
|
Benefits:
|
|
|
|
- Imagined transitions end in real states, ensuring grounding.
|
|
- Completes missing parts of dataset.
|
|
- Helps propagate reward backward reliably.
|
|
|
|
ROMI combined with conservative RL often outperforms standard offline methods.
|
|
|
|
---
|
|
|
|
# Summary of Lecture 22
|
|
|
|
Offline RL requires balancing:
|
|
|
|
- Improvement beyond dataset behavior.
|
|
- Avoiding unsafe extrapolation to unseen actions.
|
|
|
|
Three major families of solutions:
|
|
|
|
1. Policy constraints (BCQ, BEAR, AWR)
|
|
2. Conservative Q-learning (CQL, IQL)
|
|
3. Model-based conservatism and imagination (MOPO, MOReL, ROMI)
|
|
|
|
Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems.
|
|
|
|
---
|
|
|
|
# Recommended Screenshot Frames for Lecture 22
|
|
|
|
- Lecture 22, page 7: Offline RL diagram showing policy learning from fixed dataset, subsection "Offline RL Setting".
|
|
- Lecture 22, page 35: Illustration of dataset support vs policy action distribution, subsection "Strategies for Safe Offline RL".
|
|
|
|
---
|
|
|
|
--End of CSE510_L22.md--
|