# CSE510 Deep Reinforcement Learning (Lecture 22) > Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note. ## Offline Reinforcement Learning: Introduction and Challenges Offline reinforcement learning (offline RL), also called batch RL, aims to learn an optimal policy -without- interacting with the environment. Instead, the agent is given a fixed dataset of transitions collected by an unknown behavior policy. ### The Offline RL Dataset We are given a static dataset: $$ D = { (s_i, a_i, s'-i, r_i ) }-{i=1}^N $$ Parameter explanations: - $s_i$: state sampled from behavior policy state distribution. - $a_i$: action selected by the behavior policy $\pi_beta$. - $s'_i$: next state sampled from environment dynamics $p(s'|s,a)$. - $r_i$: reward observed for transition $(s_i,a_i)$. - $N$: total number of transitions in the dataset. - $D$: full offline dataset used for training. The goal is to learn a new policy $\pi$ maximizing expected discounted return using only $D$: $$ \max_{\pi} ; \mathbb{E}\Big[\sum_{t=0}^T \gamma^t r(s_t, a_t)\Big] $$ Parameter explanations: - $\pi$: policy we want to learn. - $r(s,a)$: reward received for state-action pair. - $\gamma$: discount factor controlling weight of future rewards. - $T$: horizon or trajectory length. ### Why Offline RL Is Difficult Offline RL is fundamentally harder than online RL because: - The agent cannot try new actions to fix wrong value estimates. - The policy may choose out-of-distribution actions not present in $D$. - Q-value estimates for unseen actions can be arbitrarily incorrect. - Bootstrapping on wrong Q-values can cause divergence. This leads to two major failure modes: 1. --Distribution shift--: new policy actions differ from dataset actions. 2. --Extrapolation error--: the Q-function guesses values for unseen actions. ### Extrapolation Error Problem In standard Q-learning, the Bellman backup is: $$ Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s', a') $$ Parameter explanations: - $Q(s,a)$: estimated value of taking action $a$ in state $s$. - $\max_{a'}$: maximum over possible next actions. - $a'$: candidate next action for evaluation in backup step. If $a'$ was rarely or never taken in the dataset, $Q(s',a')$ is poorly estimated, so Q-learning boots off invalid values, causing instability. ### Behavior Cloning (BC): The Safest Baseline The simplest offline method is to imitate the behavior policy: $$ \max_{\phi} ; \mathbb{E}_{(s,a) \sim D}[\log \pi_{\phi}(a|s)] $$ Parameter explanations: - $\phi$: neural network parameters of the cloned policy. - $\pi_{\phi}$: learned policy approximating behavior policy. - $\log \pi_{\phi}(a|s)$: negative log-likelihood loss. Pros: - Does not suffer from extrapolation error. - Extremely stable. Cons: - Cannot outperform the behavior policy. - Ignores reward information entirely. ### Naive Offline Q-Learning Fails Directly applying off-policy Q-learning on $D$ generally leads to: - Overestimation of unseen actions. - Divergence due to extrapolation error. - Policies worse than behavior cloning. ## Strategies for Safe Offline RL There are two primary families of solutions: 1. --Policy constraint methods-- 2. --Conservative value estimation methods-- --- # 1. Policy Constraint Methods These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions. ### Advantage Weighted Regression (AWR / AWAC) Policy update: $$ \pi(a|s) \propto \pi_{beta}(a|s)\exp\left(\frac{1}{\lambda}A(s,a)\right) $$ Parameter explanations: - $\pi_{beta}$: behavior policy used to collect dataset. - $A(s,a)$: advantage function derived from Q or V estimates. - $\lambda$: temperature controlling strength of advantage weighting. - $\exp(\cdot)$: positive weighting on high-advantage actions. Properties: - Uses advantages to filter good and bad actions. - Improves beyond behavior policy while staying safe. ### Batch-Constrained Q-learning (BCQ) BCQ constrains the policy using a generative model: 1. Train a VAE $G_{\omega}$ to model $a$ given $s$. 2. Train a small perturbation model $\xi$. 3. Limit the policy to $a = G_{\omega}(s) + \xi(s)$. Parameter explanations: - $G_{\omega}(s)$: VAE-generated action similar to data actions. - $\omega$: VAE parameters. - $\xi(s)$: small correction to generated actions. - $a$: final policy action constrained near dataset distribution. BCQ avoids selecting unseen actions and strongly reduces extrapolation. ### BEAR (Bootstrapping Error Accumulation Reduction) BEAR adds explicit constraints: $$ D_{MMD}\left(\pi(a|s), \pi_{beta}(a|s)\right) < \epsilon $$ Parameter explanations: - $D_{MMD}$: Maximum Mean Discrepancy distance between action distributions. - $\epsilon$: threshold restricting policy deviation from behavior policy. BEAR controls distribution shift more tightly than BCQ. --- # 2. Conservative Value Function Methods These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values. ### Conservative Q-Learning (CQL) One formulation is: $$ J(Q) = J_{TD}(Q) + \alpha\big(\mathbb{E}_{a\sim\pi(\cdot|s)}Q(s,a) - \mathbb{E}_{a\sim D}Q(s,a)\big) $$ Parameter explanations: - $J_{TD}$: standard Bellman TD loss. - $\alpha$: weight of conservatism penalty. - $\mathbb{E}_{a\sim\pi(\cdot|s)}$: expectation over policy-chosen actions. - $\mathbb{E}_{a\sim D}$: expectation over dataset actions. Effect: - Increases Q-values of dataset actions. - Decreases Q-values of out-of-distribution actions. ### Implicit Q-Learning (IQL) IQL avoids constraints entirely by using expectile regression: Value regression: $$ V(s) = \arg\min_{v} ; \mathbb{E}\big[\rho_{\tau}(Q(s,a) - v)\big] $$ Parameter explanations: - $v$: scalar value estimate for state $s$. - $\rho_{\tau}(x)$: expectile regression loss. - $\tau$: expectile parameter controlling conservatism. - $Q(s,a)$: Q-value estimate. Key idea: - For $\tau < 1$, IQL reduces sensitivity to large (possibly incorrect) Q-values. - Implicitly conservative without special constraints. IQL often achieves state-of-the-art performance due to simplicity and stability. --- # Model-Based Offline RL ### Forward Model-Based RL Train a dynamics model: $$ p_{\theta}(s'|s,a) $$ Parameter explanations: - $p_{\theta}$: learned transition model. - $\theta$: parameters of transition model. We can generate synthetic transitions using $p_{\theta}$, but model error accumulates. ### Penalty-Based Model Approaches (MOPO, MOReL) Add uncertainty penalty: $$ r_{model}(s,a) = r(s,a) - \beta , u(s,a) $$ Parameter explanations: - $r_{model}$: penalized reward for model rollouts. - $u(s,a)$: model uncertainty estimate. - $\beta$: penalty coefficient. These methods limit exploration into unknown model regions. --- # Reverse Model-Based Imagination (ROMI) ROMI generates new training data by -backward- imagination. ### Reverse Dynamics Model ROMI learns: $$ p_{\psi}(s_{t} \mid s_{t+1}, a_{t}) $$ Parameter explanations: - $\psi$: parameters of reverse dynamics model. - $s_{t+1}$: later state. - $a_{t}$: action taken leading to $s_{t+1}$. - $s_{t}$: predicted predecessor state. ROMI also learns a reverse policy for sampling likely predecessor actions. ### Reverse Imagination Process Given a goal state $s_{g}$: 1. Sample $a_{t}$ from reverse policy. 2. Predict $s_{t}$ from reverse dynamics. 3. Form imagined transition $(s_{t}, a_{t}, s_{t+1})$. 4. Repeat to build longer imagined trajectories. Benefits: - Imagined transitions end in real states, ensuring grounding. - Completes missing parts of dataset. - Helps propagate reward backward reliably. ROMI combined with conservative RL often outperforms standard offline methods. --- # Summary of Lecture 22 Offline RL requires balancing: - Improvement beyond dataset behavior. - Avoiding unsafe extrapolation to unseen actions. Three major families of solutions: 1. Policy constraints (BCQ, BEAR, AWR) 2. Conservative Q-learning (CQL, IQL) 3. Model-based conservatism and imagination (MOPO, MOReL, ROMI) Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems. --- # Recommended Screenshot Frames for Lecture 22 - Lecture 22, page 7: Offline RL diagram showing policy learning from fixed dataset, subsection "Offline RL Setting". - Lecture 22, page 35: Illustration of dataset support vs policy action distribution, subsection "Strategies for Safe Offline RL". --- --End of CSE510_L22.md--