partial updates

2025-11-18 13:25:21 -06:00
parent 946d0b605f
commit 9416bd4956
10 changed files with 1218 additions and 136 deletions
--- a/content/CSE510/CSE510_L22.md
+++ b/content/CSE510/CSE510_L22.md
@@ -1,50 +1,317 @@
 # CSE510 Deep Reinforcement Learning (Lecture 22)

-## Offline Reinforcement Learning
+> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.

-### Requirements for Current Successes
+## Offline Reinforcement Learning: Introduction and Challenges

- Access to the Environment Model or Simulator
- Not Costly for Exploration or Trial-and-Error
+Offline reinforcement learning (offline RL), also called batch RL, aims to learn an optimal policy -without- interacting with the environment. Instead, the agent is given a fixed dataset of transitions collected by an unknown behavior policy.

-#### Background: Offline RL
+### The Offline RL Dataset

- The success of modern machine learning
-  - Scalable data-driven learning methods (GPT-4, CLIP,DALL·E, Sora)
- Reinforcement learning
-  - Online learning paradigm
-  - Interaction is expensive & dangerous
-    - Healthcare, Robotics, Recommendation...
- Can we develop data-driven offline RL?
-
-#### Definition in Offline RL
-
- the policy $\pi_k$ is updated with a static dataset $\mathcal{D}$, which is collected by _unknown behavior policy_ $\pi_\beta$
- Interaction is not allowed
-
- $\mathcal{D}=\{(s_i,a_i,s_i',r_i)\}$
- $s\sim d^{\pi_\beta} (s)$
- $a\sim \pi_\beta (a|s)$
- $s'\sim p(s'|s,a)$
- $r\gets r(s,a)$
- Objective: $\max_\pi\sum _{t=0}^{T}\mathbb{E}_{s_t\sim d^\pi(s),a_t\sim \pi(a|s)}[\gamma^tr(s_t,a_t)]$
-
-#### Key challenge in Offline RL
-
-Distribution Shift
-
-How about using the traditional reinforcement learning (bootstrapping)?
+We are given a static dataset:

 $$
-Q(s,a)=r(s,a)+\gamma \max_{a'\in A} Q(s',a')
+D = { (s_i, a_i, s'-i, r_i ) }-{i=1}^N
 $$

-$$
-\pi(s)=\arg\max_{a\in A} Q(s,a)
-$$
+Parameter explanations:

-but notice that
+- $s_i$: state sampled from behavior policy state distribution.
+- $a_i$: action selected by the behavior policy $\pi_beta$.
+- $s'_i$: next state sampled from environment dynamics $p(s'|s,a)$.
+- $r_i$: reward observed for transition $(s_i,a_i)$.
+- $N$: total number of transitions in the dataset.
+- $D$: full offline dataset used for training.
+
+The goal is to learn a new policy $\pi$ maximizing expected discounted return using only $D$:

 $$
-P_{\pi_beta}(s,a)\neq P_{\pi_f}(s,a)
-$$
+\max_{\pi} ; \mathbb{E}\Big[\sum_{t=0}^T \gamma^t r(s_t, a_t)\Big]
+$$
+
+Parameter explanations:
+
+- $\pi$: policy we want to learn.
+- $r(s,a)$: reward received for state-action pair.
+- $\gamma$: discount factor controlling weight of future rewards.
+- $T$: horizon or trajectory length.
+
+### Why Offline RL Is Difficult
+
+Offline RL is fundamentally harder than online RL because:
+
+- The agent cannot try new actions to fix wrong value estimates.
+- The policy may choose out-of-distribution actions not present in $D$.
+- Q-value estimates for unseen actions can be arbitrarily incorrect.
+- Bootstrapping on wrong Q-values can cause divergence.
+
+This leads to two major failure modes:
+
+1. --Distribution shift--: new policy actions differ from dataset actions.
+2. --Extrapolation error--: the Q-function guesses values for unseen actions.
+
+### Extrapolation Error Problem
+
+In standard Q-learning, the Bellman backup is:
+
+$$
+Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s', a')
+$$
+
+Parameter explanations:
+
+- $Q(s,a)$: estimated value of taking action $a$ in state $s$.
+- $\max_{a'}$: maximum over possible next actions.
+- $a'$: candidate next action for evaluation in backup step.
+
+If $a'$ was rarely or never taken in the dataset, $Q(s',a')$ is poorly estimated, so Q-learning boots off invalid values, causing instability.
+
+### Behavior Cloning (BC): The Safest Baseline
+
+The simplest offline method is to imitate the behavior policy:
+
+$$
+\max_{\phi} ; \mathbb{E}_{(s,a) \sim D}[\log \pi_{\phi}(a|s)]
+$$
+
+Parameter explanations:
+
+- $\phi$: neural network parameters of the cloned policy.
+- $\pi_{\phi}$: learned policy approximating behavior policy.
+- $\log \pi_{\phi}(a|s)$: negative log-likelihood loss.
+
+Pros:
+
+- Does not suffer from extrapolation error.
+- Extremely stable.
+
+Cons:
+
+- Cannot outperform the behavior policy.
+- Ignores reward information entirely.
+
+### Naive Offline Q-Learning Fails
+
+Directly applying off-policy Q-learning on $D$ generally leads to:
+
+- Overestimation of unseen actions.
+- Divergence due to extrapolation error.
+- Policies worse than behavior cloning.
+
+## Strategies for Safe Offline RL
+
+There are two primary families of solutions:
+
+1. --Policy constraint methods--
+2. --Conservative value estimation methods--
+
+---
+
+# 1. Policy Constraint Methods
+
+These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions.
+
+### Advantage Weighted Regression (AWR / AWAC)
+
+Policy update:
+
+$$
+\pi(a|s) \propto \pi_{beta}(a|s)\exp\left(\frac{1}{\lambda}A(s,a)\right)
+$$
+
+Parameter explanations:
+
+- $\pi_{beta}$: behavior policy used to collect dataset.
+- $A(s,a)$: advantage function derived from Q or V estimates.
+- $\lambda$: temperature controlling strength of advantage weighting.
+- $\exp(\cdot)$: positive weighting on high-advantage actions.
+
+Properties:
+
+- Uses advantages to filter good and bad actions.
+- Improves beyond behavior policy while staying safe.
+
+### Batch-Constrained Q-learning (BCQ)
+
+BCQ constrains the policy using a generative model:
+
+1. Train a VAE $G_{\omega}$ to model $a$ given $s$.
+2. Train a small perturbation model $\xi$.
+3. Limit the policy to $a = G_{\omega}(s) + \xi(s)$.
+
+Parameter explanations:
+
+- $G_{\omega}(s)$: VAE-generated action similar to data actions.
+- $\omega$: VAE parameters.
+- $\xi(s)$: small correction to generated actions.
+- $a$: final policy action constrained near dataset distribution.
+
+BCQ avoids selecting unseen actions and strongly reduces extrapolation.
+
+### BEAR (Bootstrapping Error Accumulation Reduction)
+
+BEAR adds explicit constraints:
+
+$$
+D_{MMD}\left(\pi(a|s), \pi_{beta}(a|s)\right) < \epsilon
+$$
+
+Parameter explanations:
+
+- $D_{MMD}$: Maximum Mean Discrepancy distance between action distributions.
+- $\epsilon$: threshold restricting policy deviation from behavior policy.
+
+BEAR controls distribution shift more tightly than BCQ.
+
+---
+
+# 2. Conservative Value Function Methods
+
+These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values.
+
+### Conservative Q-Learning (CQL)
+
+One formulation is:
+
+$$
+J(Q) = J_{TD}(Q) + \alpha\big(\mathbb{E}_{a\sim\pi(\cdot|s)}Q(s,a) - \mathbb{E}_{a\sim D}Q(s,a)\big)
+$$
+
+Parameter explanations:
+
+- $J_{TD}$: standard Bellman TD loss.
+- $\alpha$: weight of conservatism penalty.
+- $\mathbb{E}_{a\sim\pi(\cdot|s)}$: expectation over policy-chosen actions.
+- $\mathbb{E}_{a\sim D}$: expectation over dataset actions.
+
+Effect:
+
+- Increases Q-values of dataset actions.
+- Decreases Q-values of out-of-distribution actions.
+
+### Implicit Q-Learning (IQL)
+
+IQL avoids constraints entirely by using expectile regression:
+
+Value regression:
+
+$$
+V(s) = \arg\min_{v} ; \mathbb{E}\big[\rho_{\tau}(Q(s,a) - v)\big]
+$$
+
+Parameter explanations:
+
+- $v$: scalar value estimate for state $s$.
+- $\rho_{\tau}(x)$: expectile regression loss.
+- $\tau$: expectile parameter controlling conservatism.
+- $Q(s,a)$: Q-value estimate.
+
+Key idea:
+
+- For $\tau < 1$, IQL reduces sensitivity to large (possibly incorrect) Q-values.
+- Implicitly conservative without special constraints.
+
+IQL often achieves state-of-the-art performance due to simplicity and stability.
+
+---
+
+# Model-Based Offline RL
+
+### Forward Model-Based RL
+
+Train a dynamics model:
+
+$$
+p_{\theta}(s'|s,a)
+$$
+
+Parameter explanations:
+
+- $p_{\theta}$: learned transition model.
+- $\theta$: parameters of transition model.
+
+We can generate synthetic transitions using $p_{\theta}$, but model error accumulates.
+
+### Penalty-Based Model Approaches (MOPO, MOReL)
+
+Add uncertainty penalty:
+
+$$
+r_{model}(s,a) = r(s,a) - \beta , u(s,a)
+$$
+
+Parameter explanations:
+
+- $r_{model}$: penalized reward for model rollouts.
+- $u(s,a)$: model uncertainty estimate.
+- $\beta$: penalty coefficient.
+
+These methods limit exploration into unknown model regions.
+
+---
+
+# Reverse Model-Based Imagination (ROMI)
+
+ROMI generates new training data by -backward- imagination.
+
+### Reverse Dynamics Model
+
+ROMI learns:
+
+$$
+p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
+$$
+
+Parameter explanations:
+
+- $\psi$: parameters of reverse dynamics model.
+- $s_{t+1}$: later state.
+- $a_{t}$: action taken leading to $s_{t+1}$.
+- $s_{t}$: predicted predecessor state.
+
+ROMI also learns a reverse policy for sampling likely predecessor actions.
+
+### Reverse Imagination Process
+
+Given a goal state $s_{g}$:
+
+1. Sample $a_{t}$ from reverse policy.
+2. Predict $s_{t}$ from reverse dynamics.
+3. Form imagined transition $(s_{t}, a_{t}, s_{t+1})$.
+4. Repeat to build longer imagined trajectories.
+
+Benefits:
+
+- Imagined transitions end in real states, ensuring grounding.
+- Completes missing parts of dataset.
+- Helps propagate reward backward reliably.
+
+ROMI combined with conservative RL often outperforms standard offline methods.
+
+---
+
+# Summary of Lecture 22
+
+Offline RL requires balancing:
+
+- Improvement beyond dataset behavior.
+- Avoiding unsafe extrapolation to unseen actions.
+
+Three major families of solutions:
+
+1. Policy constraints (BCQ, BEAR, AWR)
+2. Conservative Q-learning (CQL, IQL)
+3. Model-based conservatism and imagination (MOPO, MOReL, ROMI)
+
+Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems.
+
+---
+
+# Recommended Screenshot Frames for Lecture 22
+
+- Lecture 22, page 7: Offline RL diagram showing policy learning from fixed dataset, subsection "Offline RL Setting".
+- Lecture 22, page 35: Illustration of dataset support vs policy action distribution, subsection "Strategies for Safe Offline RL".
+
+---
+
+--End of CSE510_L22.md--