partial updates

2025-11-18 13:25:21 -06:00
parent 946d0b605f
commit 9416bd4956
10 changed files with 1218 additions and 136 deletions
--- a/content/CSE510/CSE510_L21.md
+++ b/content/CSE510/CSE510_L21.md
@@ -1,146 +1,242 @@
 # CSE510 Deep Reinforcement Learning (Lecture 21)

-## Exploration in RL
+> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.

-### Information state search
+## Exploration in RL: Information-Based Exploration (Intrinsic Curiosity)

-Uncertainty about state transitions or dynamics
-
-Dynamics prediction error or Information gain for dynamics learning
-
-#### Computational Curiosity
+### Computational Curiosity

 - "The direct goal of curiosity and boredom is to improve the world model."
- "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
- There is positive reinforcement whenever the system fails to correctly predict the environment.
- Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)
+- Curiosity encourages agents to seek experiences that better predict or explain the environment.
+- A "curiosity unit" gives reward based on the mismatch between current model predictions and actual outcomes.
+- Intrinsic reward is high when the agent's prediction fails, that is, when it encounters surprising outcomes.
+- This yields positive intrinsic reinforcement when the internal predictive model errs, causing the agent to repeat actions that lead to prediction errors.
+- The agent is effectively motivated to create situations where its model fails.

-#### Reward Prediction Error
+### Model Prediction Error as Intrinsic Reward

- Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.
+We augment the reward with an intrinsic bonus based on model prediction error:

-$$
-R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)
-$$
+$R(s, a, s') = r(s, a, s') + B(|T(s, a; \theta) - s'|)$

- where $r(s,a,s')$ is the extrinsic reward, $T(s,a,\theta)$ is the predicted next state, and $\mathcal{B}$ is a bonus function (intrinsic reward bonus).
- Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.
+Parameter explanations:

-[link to the paper](https://arxiv.org/pdf/1507.08750)
-</details>
+- $s$: current state of the agent.
+- $a$: action taken by the agent in state $s$.
+- $s'$: next state resulting from executing action $a$ in state $s$.
+- $r(s, a, s')$: extrinsic environment reward for transition $(s, a, s')$.
+- $T(s, a; \theta)$: learned dynamics model with parameters $\theta$ that predicts the next state.
+- $\theta$: parameter vector of the predictive dynamics model $T$.
+- $|T(s, a; \theta) - s'|$: prediction error magnitude between predicted next state and actual next state.
+- $B(\cdot)$: function converting prediction error magnitude into an intrinsic reward bonus.
+- $R(s, a, s')$: total reward, sum of extrinsic reward and intrinsic curiosity bonus.

-<details>
-<summary>Example</summary>
+Key ideas:

-Learning Visual Dynamics
+- The agent receives an intrinsic reward $B(|T(s, a; \theta) - s'|)$ when the actual outcome differs from what its world model predicts.
+- Initially many transitions are surprising, encouraging broad exploration.
+- As the model improves, familiar transitions yield smaller error and smaller intrinsic reward.
+- Exploration becomes focused on less-known parts of the state space.
+- Intrinsic motivation is non-stationary: as the agent learns, previously novel states lose their intrinsic reward.

- Exploration reward bonuses $\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|$
-  - However, trivial solution exists: could get reward by just moving around randomly.
+#### Avoiding Trivial Curiosity Traps

---
+[link to paper](https://ar5iv.labs.arxiv.org/html/1705.05363#:~:text=reward%20signal%20based%20on%20how,this%20feature%20space%20using%20self)

- Exploration reward bonuses with autoencoders $\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|$
-  - But suffer the problems of autoencoding reconstruction loss that has little to do with our task
+Naively defining $B(s, a, s')$ directly in raw observation space can lead to trivial curiosity traps.

-#### Task Rewards vs. Exploration Rewards
+Examples:

-Exploration rewards bonuses:
+- The agent may purposely cause chaotic or noisy observations (like flickering pixels) that are impossible to predict.
+- The model cannot reduce prediction error on pure noise, so the agent is rewarded for meaningless randomness.
+- This yields high intrinsic reward without meaningful learning or progress toward task goals.

-$$
-\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
-$$
+To prevent this, we restrict prediction to a more informative feature space:

-Only task rewards:
+$B(s, a, s') = |T(E(s; \phi), a; \theta) - E(s'; \phi)|$

-$$
-R(s,a,s') = r(s,a,s')
-$$
+Parameter explanations:

-Task+curiosity rewards:
+- $E(s; \phi)$: learned encoder mapping raw state $s$ into a feature vector.
+- $\phi$: parameter vector of the encoder $E$.
+- $T(E(s; \phi), a; \theta)$: forward model predicting next feature representation from encoded state and action.
+- $E(s'; \phi)$: encoded feature representation of the next state $s'$.
+- $B(s, a, s')$: intrinsic reward based on prediction error in feature space.

-$$
-R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')
-$$
+Key ideas:

-Sparse task + curiosity rewards:
+- The encoder $E(s; \phi)$ is trained so that features capture aspects of the state that are controllable by the agent.
+- One approach is to train $E$ via an inverse dynamics model that predicts $a$ from $(s, s')$.
+- This encourages $E$ to keep only information necessary to infer actions, discarding irrelevant noise.
+- Measuring prediction error in feature space ignores unpredictable environmental noise.
+- Intrinsic reward focuses on errors due to lack of knowledge about controllable dynamics.
+- The agent's curiosity is directed toward aspects of the environment it can influence and learn.

-$$
-R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')
-$$
+A practical implementation is the Intrinsic Curiosity Module (ICM) by Pathak et al. (2017):

-Only curiosity rewards:
+- The encoder $E$ and forward model $T$ are trained jointly.
+- The loss includes both forward prediction error and inverse dynamics error.
+- Intrinsic reward is set to the forward prediction error in feature space.
+- This drives exploration of states where the agent cannot yet predict the effect of its actions.

-$$
-R^c(s,a,s') = \mathcal{B}^c(s, a, s')
-$$
+#### Random Network Distillation (RND)

-#### Extrinsic reward RL is not New
+Random Network Distillation (RND) provides a simpler curiosity bonus without learning a dynamics model.

- Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006)
- Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991)
- Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010)
- Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004)
- Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995)
- Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708
+Basic idea:

-#### Limitation of Prediction Errors
+- Use a fixed random neural network $f_{\text{target}}$ that maps states to feature vectors.
+- Train a predictor network $f_{\text{pred}}$ to approximate $f_{\text{target}}$ on visited states.
+- The intrinsic reward is the prediction error between $f_{\text{pred}}(s)$ and $f_{\text{target}}(s)$.

- Agent will be rewarded even though the model cannot improve.
- So it will focus on parts of environment that are inherently unpredictable or stochastic.
- Example: the noisy-TV problem
-  - The agent is attracted forever in the most noisy states, with unpredictable outcomes.
+Typical form of the intrinsic reward:

-#### Random Network Distillation
+$r^{\text{int}}(s) = |f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$

-Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.
+Parameter explanations:

-New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the **next state itself.**
+- $f_{\text{target}}$: fixed random neural network generating target features for each state.
+- $f_{\text{pred}}(s; \psi)$: trainable predictor network with parameters $\psi$.
+- $\psi$: parameter vector for the predictor network.
+- $s$: state input to both networks.
+- $|f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$: squared error between predictor and target features.
+- $r^{\text{int}}(s)$: intrinsic reward based on prediction error in random feature space.

- The target network is a neural network with fixed, randomized weights, which is never trained.
- The prediction network is trained to predict the target network's output.
+Key properties:

-> the more you visit the state, the less loss you will have.
+- For novel or rarely visited states, $f_{\text{pred}}$ has not yet learned to match $f_{\text{target}}$, so error is high.
+- For frequently visited states, prediction error becomes small, and intrinsic reward decays.
+- The target network is random and fixed, so it does not adapt to the policy.
+- This provides a stable novelty signal without explicit dynamics learning.
+- RND achieves strong exploration performance in challenging environments, such as hard-exploration Atari games.

-### Posterior Sampling
+### Efficacy of Curiosity-Driven Exploration

-Uncertainty about Q-value functions or policies
+Empirical observations:

-Selecting actions according to the probability they are best according to the current model.
+- Curiosity-driven intrinsic rewards often lead to significantly higher extrinsic returns in sparse-reward environments compared to agents trained only on extrinsic rewards.
+- Intrinsic rewards act as a proxy objective that guides the agent toward interesting or informative regions of the state space.
+- In some experiments, agents trained with only intrinsic rewards (no extrinsic reward during training) still learn behaviors that later achieve high task scores when extrinsic rewards are measured.
+- Using random features for curiosity (as in RND) can perform nearly as well as using learned features in many domains.
+- Simple surprise signals are often sufficient to drive effective exploration.
+- Learned feature spaces may generalize better to truly novel scenarios but are not always necessary.

-#### Exploration with Action Value Information
+Historical context:

-Count-Based and Curiosity-driven method does not take into
-account the action value information
+- The concept of learning from intrinsic rewards alone is not new.
+- Itti and Baldi (2005) studied "Bayesian surprise" as a driver of human attention.
+- Schmidhuber (1991, 2010) formalized curiosity, creativity, and fun as intrinsic motivations in learning agents.
+- Singh et al. (2004) proposed intrinsically motivated reinforcement learning frameworks.
+- These early works laid the conceptual foundation for modern curiosity-driven deep RL methods.

-![Action Value Information](https://notenextra.trance-0.com/CSE510/Action_Value_Information.png)
+For further reading on intrinsic curiosity methods:

-> In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.
+- Pathak et al., "Curiosity-driven Exploration by Self-supervised Prediction", 2017.
+- Burda et al., "Exploration by Random Network Distillation", 2018.
+- Schmidhuber, "Formal Theory of Creativity, Fun, and Intrinsic Motivation", 2010.

-#### Exploration via Posterior Sampling of Q Functions
+## Exploration via Posterior Sampling

- Represent a posterior distribution of Q functions, instead of a point estimate.
-  1. Sample from $P(Q), Q\sim P(Q)$
-  2. Choose actions according to this $Q$ for one episode $a=\arg\max_{a} Q(s,a)$
-  3. Update $P(Q)$ based on the sampled $Q$ and collected experience tuples $(s,a,r,s')$
- Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
- But how can we learn a distribution of Q functions $P(Q)$ if Q function is a deep neural network?
+While optimistic and curiosity bonus methods modify the reward function, posterior sampling approaches handle exploration by maintaining uncertainty over models or value functions and sampling from this uncertainty.

-#### Bootstrap Ensemble
+These methods are rooted in Thompson Sampling and naturally balance exploration and exploitation.

- Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
-  - Computationally expensive
- Neural network ensembles with shared backbone: only the heads are trained with different subset of the data
+### Posterior Sampling in Multi-Armed Bandits (Thompson Sampling)

-### Questions
+In a multi-armed bandit problem (no state transitions), Thompson Sampling works as follows:

- Why do PG methods implicitly support exploration?
- Is it sufficient? How can we improve its implicit exploration?
- What are limitations of entropy regularization?
- How can we improve exploration for PG methods?
-  - Intrinsic-motivated bonuses (e.g., RND)
-  - Explicitly optimize per-state entropy in the return (e.g., SAC)
-  - Hierarchical RL
-  - Goal-conditional RL
- What are potentially more effective exploration methods?
-  - Knowledge-driven
-  - Model-based exploration
+1. Maintain a prior and posterior distribution over the reward parameters for each arm.
+2. At each time step, sample reward parameters for all arms from their current posterior.
+3. Select the arm with the highest sampled mean reward.
+4. Observe the reward, update the posterior, and repeat.
+
+Intuition:
+
+- Each action is selected with probability equal to the posterior probability that it is optimal.
+- Arms with high uncertainty are more likely to be sampled as optimal in some posterior draws.
+- Exploration arises naturally from uncertainty, without explicit epsilon-greedy noise or bonus terms.
+- Over time, the posterior concentrates on the true reward means, and the algorithm shifts toward exploitation.
+
+Theoretical properties:
+
+- Thompson Sampling attains near-optimal regret bounds in many bandit settings.
+- It often performs as well as or better than upper confidence bound algorithms in practice.
+
+### Posterior Sampling for Reinforcement Learning (PSRL)
+
+In reinforcement learning with states and transitions, posterior sampling generalizes to sampling entire MDP models.
+
+Posterior Sampling for Reinforcement Learning (PSRL) operates as follows:
+
+1. Maintain a posterior distribution over environment dynamics and rewards, based on observed transitions.
+2. At the beginning of an episode, sample an MDP model from this posterior.
+3. Compute the optimal policy for the sampled MDP (for example, by value iteration).
+4. Execute this policy in the real environment for the whole episode.
+5. Use the observed transitions to update the posterior, then repeat.
+
+Key advantages:
+
+- The agent commits to a sampled model's policy for an extended duration, which induces deep exploration.
+- If a sampled model is optimistic in unexplored regions, the corresponding policy will deliberately visit those regions.
+- Exploration is coherent across time within an episode, unlike per-step randomization in epsilon-greedy.
+- The method does not require ad hoc exploration bonuses; exploration is an emergent property of the posterior.
+
+Challenges:
+
+- Maintaining an exact posterior over high-dimensional MDPs is usually intractable.
+- Practical implementations use approximations.
+
+### Approximate Posterior Sampling with Ensembles (Bootstrapped DQN)
+
+A common approximate posterior method in deep RL is Bootstrapped DQN.
+
+Basic idea:
+
+- Train an ensemble of $K$ Q-networks (heads), $Q^{(1)}, \dots, Q^{(K)}$.
+- Each head is trained on a different bootstrap sample or masked subset of experience.
+- At the start of each episode, sample a head index $k$ uniformly from ${1, \dots, K}$.
+- For the entire episode, act greedily with respect to $Q^{(k)}$.
+
+Parameter definitions for the ensemble:
+
+- $K$: number of Q-network heads in the ensemble.
+- $Q^{(k)}(s, a)$: Q-value estimate for head $k$ at state-action pair $(s, a)$.
+- $k$: index of the sampled head used for the current episode.
+- $(s, a)$: state and action arguments to Q-value functions.
+
+Implementation details:
+
+- A shared feature backbone network processes state inputs, feeding into all heads.
+- Each head has its own final layers, allowing diverse value estimates.
+- Masking or bootstrapping assigns different subsets of transitions to different heads during training.
+
+Benefits:
+
+- Each head approximates a different plausible Q-function, analogous to a sample from a posterior.
+- When a head is optimistic about certain under-explored actions, its greedy policy will explore them deeply.
+- Exploration behavior is temporally consistent within an episode.
+- No modification of the reward function is required; exploration arises from policy randomization via multiple heads.
+
+Comparison to epsilon-greedy:
+
+- Epsilon-greedy adds per-step random actions, which can be inefficient for long-horizon exploration.
+- Bootstrapped DQN commits to a strategy for an episode, enabling the agent to execute complete exploratory plans.
+- This can dramatically increase the probability of discovering long sequences needed to reach sparse rewards.
+
+Other approximate posterior approaches:
+
+- Bayesian neural networks for Q-functions (explicit parameter distributions).
+- Using Monte Carlo dropout at inference to sample Q-functions.
+- Randomized prior functions added to Q-networks to maintain exploration.
+
+Theoretical insights:
+
+- Posterior sampling methods can enjoy strong regret bounds in some RL settings.
+- They can have better asymptotic constants than optimism-based methods in certain problems.
+- Coherent, temporally extended exploration is essential in environments with delayed rewards and complex goals.
+
+For further reading:
+
+- Osband et al., "Deep Exploration via Bootstrapped DQN", 2016.
+- Osband and Van Roy, "Why Is Posterior Sampling Better Than Optimism for Reinforcement Learning?", 2017.
+- Chapelle and Li, "An Empirical Evaluation of Thompson Sampling", 2011.
--- a/content/CSE510/CSE510_L22.md
+++ b/content/CSE510/CSE510_L22.md
@@ -1,50 +1,317 @@
 # CSE510 Deep Reinforcement Learning (Lecture 22)

-## Offline Reinforcement Learning
+> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.

-### Requirements for Current Successes
+## Offline Reinforcement Learning: Introduction and Challenges

- Access to the Environment Model or Simulator
- Not Costly for Exploration or Trial-and-Error
+Offline reinforcement learning (offline RL), also called batch RL, aims to learn an optimal policy -without- interacting with the environment. Instead, the agent is given a fixed dataset of transitions collected by an unknown behavior policy.

-#### Background: Offline RL
+### The Offline RL Dataset

- The success of modern machine learning
-  - Scalable data-driven learning methods (GPT-4, CLIP,DALL·E, Sora)
- Reinforcement learning
-  - Online learning paradigm
-  - Interaction is expensive & dangerous
-    - Healthcare, Robotics, Recommendation...
- Can we develop data-driven offline RL?
-
-#### Definition in Offline RL
-
- the policy $\pi_k$ is updated with a static dataset $\mathcal{D}$, which is collected by _unknown behavior policy_ $\pi_\beta$
- Interaction is not allowed
-
- $\mathcal{D}=\{(s_i,a_i,s_i',r_i)\}$
- $s\sim d^{\pi_\beta} (s)$
- $a\sim \pi_\beta (a|s)$
- $s'\sim p(s'|s,a)$
- $r\gets r(s,a)$
- Objective: $\max_\pi\sum _{t=0}^{T}\mathbb{E}_{s_t\sim d^\pi(s),a_t\sim \pi(a|s)}[\gamma^tr(s_t,a_t)]$
-
-#### Key challenge in Offline RL
-
-Distribution Shift
-
-How about using the traditional reinforcement learning (bootstrapping)?
+We are given a static dataset:

 $$
-Q(s,a)=r(s,a)+\gamma \max_{a'\in A} Q(s',a')
+D = { (s_i, a_i, s'-i, r_i ) }-{i=1}^N
 $$

-$$
-\pi(s)=\arg\max_{a\in A} Q(s,a)
-$$
+Parameter explanations:

-but notice that
+- $s_i$: state sampled from behavior policy state distribution.
+- $a_i$: action selected by the behavior policy $\pi_beta$.
+- $s'_i$: next state sampled from environment dynamics $p(s'|s,a)$.
+- $r_i$: reward observed for transition $(s_i,a_i)$.
+- $N$: total number of transitions in the dataset.
+- $D$: full offline dataset used for training.
+
+The goal is to learn a new policy $\pi$ maximizing expected discounted return using only $D$:

 $$
-P_{\pi_beta}(s,a)\neq P_{\pi_f}(s,a)
-$$
+\max_{\pi} ; \mathbb{E}\Big[\sum_{t=0}^T \gamma^t r(s_t, a_t)\Big]
+$$
+
+Parameter explanations:
+
+- $\pi$: policy we want to learn.
+- $r(s,a)$: reward received for state-action pair.
+- $\gamma$: discount factor controlling weight of future rewards.
+- $T$: horizon or trajectory length.
+
+### Why Offline RL Is Difficult
+
+Offline RL is fundamentally harder than online RL because:
+
+- The agent cannot try new actions to fix wrong value estimates.
+- The policy may choose out-of-distribution actions not present in $D$.
+- Q-value estimates for unseen actions can be arbitrarily incorrect.
+- Bootstrapping on wrong Q-values can cause divergence.
+
+This leads to two major failure modes:
+
+1. --Distribution shift--: new policy actions differ from dataset actions.
+2. --Extrapolation error--: the Q-function guesses values for unseen actions.
+
+### Extrapolation Error Problem
+
+In standard Q-learning, the Bellman backup is:
+
+$$
+Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s', a')
+$$
+
+Parameter explanations:
+
+- $Q(s,a)$: estimated value of taking action $a$ in state $s$.
+- $\max_{a'}$: maximum over possible next actions.
+- $a'$: candidate next action for evaluation in backup step.
+
+If $a'$ was rarely or never taken in the dataset, $Q(s',a')$ is poorly estimated, so Q-learning boots off invalid values, causing instability.
+
+### Behavior Cloning (BC): The Safest Baseline
+
+The simplest offline method is to imitate the behavior policy:
+
+$$
+\max_{\phi} ; \mathbb{E}_{(s,a) \sim D}[\log \pi_{\phi}(a|s)]
+$$
+
+Parameter explanations:
+
+- $\phi$: neural network parameters of the cloned policy.
+- $\pi_{\phi}$: learned policy approximating behavior policy.
+- $\log \pi_{\phi}(a|s)$: negative log-likelihood loss.
+
+Pros:
+
+- Does not suffer from extrapolation error.
+- Extremely stable.
+
+Cons:
+
+- Cannot outperform the behavior policy.
+- Ignores reward information entirely.
+
+### Naive Offline Q-Learning Fails
+
+Directly applying off-policy Q-learning on $D$ generally leads to:
+
+- Overestimation of unseen actions.
+- Divergence due to extrapolation error.
+- Policies worse than behavior cloning.
+
+## Strategies for Safe Offline RL
+
+There are two primary families of solutions:
+
+1. --Policy constraint methods--
+2. --Conservative value estimation methods--
+
+---
+
+# 1. Policy Constraint Methods
+
+These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions.
+
+### Advantage Weighted Regression (AWR / AWAC)
+
+Policy update:
+
+$$
+\pi(a|s) \propto \pi_{beta}(a|s)\exp\left(\frac{1}{\lambda}A(s,a)\right)
+$$
+
+Parameter explanations:
+
+- $\pi_{beta}$: behavior policy used to collect dataset.
+- $A(s,a)$: advantage function derived from Q or V estimates.
+- $\lambda$: temperature controlling strength of advantage weighting.
+- $\exp(\cdot)$: positive weighting on high-advantage actions.
+
+Properties:
+
+- Uses advantages to filter good and bad actions.
+- Improves beyond behavior policy while staying safe.
+
+### Batch-Constrained Q-learning (BCQ)
+
+BCQ constrains the policy using a generative model:
+
+1. Train a VAE $G_{\omega}$ to model $a$ given $s$.
+2. Train a small perturbation model $\xi$.
+3. Limit the policy to $a = G_{\omega}(s) + \xi(s)$.
+
+Parameter explanations:
+
+- $G_{\omega}(s)$: VAE-generated action similar to data actions.
+- $\omega$: VAE parameters.
+- $\xi(s)$: small correction to generated actions.
+- $a$: final policy action constrained near dataset distribution.
+
+BCQ avoids selecting unseen actions and strongly reduces extrapolation.
+
+### BEAR (Bootstrapping Error Accumulation Reduction)
+
+BEAR adds explicit constraints:
+
+$$
+D_{MMD}\left(\pi(a|s), \pi_{beta}(a|s)\right) < \epsilon
+$$
+
+Parameter explanations:
+
+- $D_{MMD}$: Maximum Mean Discrepancy distance between action distributions.
+- $\epsilon$: threshold restricting policy deviation from behavior policy.
+
+BEAR controls distribution shift more tightly than BCQ.
+
+---
+
+# 2. Conservative Value Function Methods
+
+These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values.
+
+### Conservative Q-Learning (CQL)
+
+One formulation is:
+
+$$
+J(Q) = J_{TD}(Q) + \alpha\big(\mathbb{E}_{a\sim\pi(\cdot|s)}Q(s,a) - \mathbb{E}_{a\sim D}Q(s,a)\big)
+$$
+
+Parameter explanations:
+
+- $J_{TD}$: standard Bellman TD loss.
+- $\alpha$: weight of conservatism penalty.
+- $\mathbb{E}_{a\sim\pi(\cdot|s)}$: expectation over policy-chosen actions.
+- $\mathbb{E}_{a\sim D}$: expectation over dataset actions.
+
+Effect:
+
+- Increases Q-values of dataset actions.
+- Decreases Q-values of out-of-distribution actions.
+
+### Implicit Q-Learning (IQL)
+
+IQL avoids constraints entirely by using expectile regression:
+
+Value regression:
+
+$$
+V(s) = \arg\min_{v} ; \mathbb{E}\big[\rho_{\tau}(Q(s,a) - v)\big]
+$$
+
+Parameter explanations:
+
+- $v$: scalar value estimate for state $s$.
+- $\rho_{\tau}(x)$: expectile regression loss.
+- $\tau$: expectile parameter controlling conservatism.
+- $Q(s,a)$: Q-value estimate.
+
+Key idea:
+
+- For $\tau < 1$, IQL reduces sensitivity to large (possibly incorrect) Q-values.
+- Implicitly conservative without special constraints.
+
+IQL often achieves state-of-the-art performance due to simplicity and stability.
+
+---
+
+# Model-Based Offline RL
+
+### Forward Model-Based RL
+
+Train a dynamics model:
+
+$$
+p_{\theta}(s'|s,a)
+$$
+
+Parameter explanations:
+
+- $p_{\theta}$: learned transition model.
+- $\theta$: parameters of transition model.
+
+We can generate synthetic transitions using $p_{\theta}$, but model error accumulates.
+
+### Penalty-Based Model Approaches (MOPO, MOReL)
+
+Add uncertainty penalty:
+
+$$
+r_{model}(s,a) = r(s,a) - \beta , u(s,a)
+$$
+
+Parameter explanations:
+
+- $r_{model}$: penalized reward for model rollouts.
+- $u(s,a)$: model uncertainty estimate.
+- $\beta$: penalty coefficient.
+
+These methods limit exploration into unknown model regions.
+
+---
+
+# Reverse Model-Based Imagination (ROMI)
+
+ROMI generates new training data by -backward- imagination.
+
+### Reverse Dynamics Model
+
+ROMI learns:
+
+$$
+p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
+$$
+
+Parameter explanations:
+
+- $\psi$: parameters of reverse dynamics model.
+- $s_{t+1}$: later state.
+- $a_{t}$: action taken leading to $s_{t+1}$.
+- $s_{t}$: predicted predecessor state.
+
+ROMI also learns a reverse policy for sampling likely predecessor actions.
+
+### Reverse Imagination Process
+
+Given a goal state $s_{g}$:
+
+1. Sample $a_{t}$ from reverse policy.
+2. Predict $s_{t}$ from reverse dynamics.
+3. Form imagined transition $(s_{t}, a_{t}, s_{t+1})$.
+4. Repeat to build longer imagined trajectories.
+
+Benefits:
+
+- Imagined transitions end in real states, ensuring grounding.
+- Completes missing parts of dataset.
+- Helps propagate reward backward reliably.
+
+ROMI combined with conservative RL often outperforms standard offline methods.
+
+---
+
+# Summary of Lecture 22
+
+Offline RL requires balancing:
+
+- Improvement beyond dataset behavior.
+- Avoiding unsafe extrapolation to unseen actions.
+
+Three major families of solutions:
+
+1. Policy constraints (BCQ, BEAR, AWR)
+2. Conservative Q-learning (CQL, IQL)
+3. Model-based conservatism and imagination (MOPO, MOReL, ROMI)
+
+Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems.
+
+---
+
+# Recommended Screenshot Frames for Lecture 22
+
+- Lecture 22, page 7: Offline RL diagram showing policy learning from fixed dataset, subsection "Offline RL Setting".
+- Lecture 22, page 35: Illustration of dataset support vs policy action distribution, subsection "Strategies for Safe Offline RL".
+
+---
+
+--End of CSE510_L22.md--
--- a/content/CSE510/CSE510_L23.md
+++ b/content/CSE510/CSE510_L23.md
@@ -1,3 +1,177 @@
 # CSE510 Deep Reinforcement Learning (Lecture 23)

-## Offline Reinforcement Learning
+> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
+
+## Offline Reinforcement Learning Part II: Advanced Approaches
+
+Lecture 23 continues with advanced topics in offline RL, expanding on model-based imagination methods and credit assignment structures relevant for offline multi-agent and single-agent settings.
+
+## Reverse Model-Based Imagination (ROMI)
+
+ROMI is a method for augmenting an offline dataset with additional transitions generated by imagining trajectories -backwards- from desirable states. Unlike forward model rollouts, backward imagination stays grounded in real data because imagined transitions always terminate in dataset states.
+
+### Reverse Dynamics Model
+
+ROMI learns a reverse dynamics model:
+
+$$
+p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
+$$
+
+Parameter explanations:
+
+- $p_{\psi}$: learned reverse transition model.
+- $\psi$: parameter vector for the reverse model.
+- $s_{t+1}$: next state (from dataset).
+- $a_{t}$: action that hypothetically leads into $s_{t+1}$.
+- $s_{t}$: predicted predecessor state.
+
+ROMI also learns a reverse policy to sample actions that likely lead into known states:
+
+$$
+\pi_{rev}(a_{t} \mid s_{t+1})
+$$
+
+Parameter explanations:
+
+- $\pi_{rev}$: reverse policy distribution.
+- $a_{t}$: action sampled for backward trajectory generation.
+- $s_{t+1}$: state whose predecessors are being imagined.
+
+### Reverse Imagination Process
+
+To generate imagined transitions:
+
+1. Select a goal or high-value state $s_{g}$ from the offline dataset.
+2. Sample $a_{t}$ from $\pi_{rev}(a_{t} \mid s_{g})$.
+3. Predict $s_{t}$ from $p_{\psi}(s_{t} \mid s_{g}, a_{t})$.
+4. Form an imagined transition $(s_{t}, a_{t}, s_{g})$.
+5. Repeat backward to obtain a longer imagined trajectory.
+
+Benefits:
+
+- Imagined states remain grounded by terminating in real dataset states.
+- Helps propagate reward signals backward through states not originally visited.
+- Avoids runaway model error that occurs in forward model rollouts.
+
+ROMI effectively fills in missing gaps in the state-action graph, improving training stability and performance when paired with conservative offline RL algorithms.
+
+---
+
+## Implicit Credit Assignment via Value Factorization Structures
+
+Although initially studied for multi-agent systems, insights from value factorization also improve offline RL by providing structured credit assignment signals.
+
+### Counterfactual Credit Assignment Insight
+
+A factored value function structure of the form:
+
+$$
+Q_{tot}(s, a_{1}, \dots, a_{n}) = f(Q_{1}(s, a_{1}), \dots, Q_{n}(s, a_{n}))
+$$
+
+can implicitly implement counterfactual credit assignment.
+
+Parameter explanations:
+
+- $Q_{tot}$: global value function.
+- $Q_{i}(s,a_{i})$: individual component value for agent or subsystem $i$.
+- $f(\cdot)$: mixing function combining components.
+- $s$: environment state.
+- $a_{i}$: action taken by entity $i$.
+
+In architectures designed for IGM (Individual-Global-Max) consistency, gradients backpropagated through $f$ isolate the marginal effect of each component. This implicitly gives each agent or subsystem a counterfactual advantage signal.
+
+Even in single-agent structured RL, similar factorization structures allow credit flowing into components representing skills, modes, or action groups, enabling better temporal and structural decomposition.
+
+---
+
+## Model-Based vs Model-Free Offline RL
+
+Lecture 23 contrasts model-based imagination (ROMI) with conservative model-free methods such as IQL and CQL.
+
+### Forward Model-Based Rollouts
+
+Forward imagination using a learned model:
+
+$$
+p_{\theta}(s'|s,a)
+$$
+
+Parameter explanations:
+
+- $p_{\theta}$: learned forward dynamics model.
+- $\theta$: parameters of the forward model.
+- $s'$: predicted next state.
+- $s$: current state.
+- $a$: action taken in current state.
+
+Problems:
+
+- Forward rollouts drift away from dataset support.
+- Model error compounds with each step.
+- Leads to training instability if used without penalties.
+
+### Penalty Methods (MOPO, MOReL)
+
+Augmented reward:
+
+$$
+r_{model}(s,a) = r(s,a) - \beta u(s,a)
+$$
+
+Parameter explanations:
+
+- $r_{model}(s,a)$: penalized reward for model-generated steps.
+- $u(s,a)$: uncertainty score of model for state-action pair.
+- $\beta$: penalty coefficient.
+- $r(s,a)$: original reward.
+
+These methods limit exploration into uncertain model regions.
+
+### ROMI vs Forward Rollouts
+
+- Forward methods expand state space beyond dataset.
+- ROMI expands -backward-, staying consistent with known good future states.
+- ROMI reduces error accumulation because future anchors are real.
+
+---
+
+## Combining ROMI With Conservative Offline RL
+
+ROMI is typically combined with:
+
+- CQL (Conservative Q-Learning)
+- IQL (Implicit Q-Learning)
+- BCQ and BEAR (policy constraint methods)
+
+Workflow:
+
+1. Generate imagined transitions via ROMI.
+2. Add them to dataset.
+3. Train Q-function or policy using conservative losses.
+
+Benefits:
+
+- Better coverage of reward-relevant states.
+- Increased policy improvement over dataset.
+- More stable Q-learning backups.
+
+---
+
+## Summary of Lecture 23
+
+Key points:
+
+- Offline RL can be improved via structured imagination.
+- ROMI creates safe imagined transitions by reversing dynamics.
+- Reverse imagination avoids pitfalls of forward model error.
+- Factored value structures provide implicit counterfactual credit assignment.
+- Combining ROMI with conservative learners yields state-of-the-art performance.
+
+---
+
+## Recommended Screenshot Frames for Lecture 23
+
+- Lecture 23, page 20: ROMI concept diagram depicting reverse imagination from goal states. Subsection: "Reverse Model-Based Imagination (ROMI)".
+- Lecture 23, page 24: Architecture figure showing reverse policy and reverse dynamics model used to generate imagined transitions. Subsection: "Reverse Imagination Process".
--- a/content/CSE510/CSE510_L24.md
+++ b/content/CSE510/CSE510_L24.md
@@ -0,0 +1,275 @@
+# CSE510 Deep Reinforcement Learning (Lecture 24)
+
+## Cooperative Multi-Agent Reinforcement Learning (MARL)
+
+This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX.
+
+
+
+## Multi-Agent Coordination Under Uncertainty
+
+In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP).
+
+A transition is defined as:
+
+$$
+P(s' \mid s, a_{1}, \dots, a_{n})
+$$
+
+Parameter explanations:
+
+- $s$: current global state.
+- $s'$: next global state.
+- $a_{i}$: action taken by agent $i$.
+- $P(\cdot)$: environment transition function.
+
+The shared return is:
+
+$$
+\mathbb{E}\left[\sum_{t=0}^{T} \gamma^{t} r_{t}\right]
+$$
+
+Parameter explanations:
+
+- $\gamma$: discount factor.
+- $T$: horizon length.
+- $r_{t}$: shared team reward at time $t$.
+
+### CTDE: Centralized Training, Decentralized Execution
+
+Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment.
+
+
+## Joint vs Factored Q-Learning
+
+### Joint Q-Learning
+
+In joint-action learning, one learns a full joint Q-function:
+
+$$
+Q_{tot}(s, a_{1}, \dots, a_{n})
+$$
+
+Parameter explanations:
+
+- $Q_{tot}$: joint value for the entire team.
+- $(a_{1}, \dots, a_{n})$: joint action vector across agents.
+
+Problem:
+
+- The joint action space grows exponentially in $n$.
+- Learning is not scalable.
+
+### Value Factorization
+
+Instead of learning $Q_{tot}$ directly, we factorize it into individual utility functions:
+
+$$
+Q_{tot}(s, \mathbf{a}) = f(Q_{1}(s,a_{1}), \dots, Q_{n}(s,a_{n}))
+$$
+
+Parameter explanations:
+
+- $\mathbf{a}$: joint action vector.
+- $f(\cdot)$: mixing network combining individual Q-values.
+
+The goal is to enable decentralized greedy action selection.
+
+
+
+## Individual-Global-Max (IGM) Condition
+
+The IGM condition enables decentralized optimal action selection:
+
+$$
+\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a})
+===========================================
+
+\big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big)
+$$
+
+Parameter explanations:
+
+- $\arg\max_{\mathbf{a}}$: search for best joint action.
+- $\arg\max_{a_{i}}$: best local action for agent $i$.
+- $Q_{i}(s,a_{i})$: individual utility for agent $i$.
+
+IGM makes decentralized execution optimal with respect to the learned factorized value.
+
+
+
+## Linear Value Factorization
+
+### VDN (Value Decomposition Networks)
+
+VDN assumes:
+
+$$
+Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} Q_{i}(s,a_{i})
+$$
+
+Parameter explanations:
+
+- $Q_{i}(s,a_{i})$: value of agent $i$'s action.
+- $\sum_{i=1}^{n}$: linear sum over agents.
+
+Pros:
+
+- Very simple, satisfies IGM.
+- Fully decentralized execution.
+
+Cons:
+
+- Limited representation capacity.
+- Cannot model non-linear teamwork interactions.
+
+
+
+## QMIX: Monotonic Value Factorization
+
+QMIX uses a state-conditioned mixing network enforcing monotonicity:
+
+$$
+\frac{\partial Q_{tot}}{\partial Q_{i}} \ge 0
+$$
+
+Parameter explanations:
+
+- $\partial Q_{tot} / \partial Q_{i}$: gradient of global Q w.r.t. individual Q.
+- $\ge 0$: ensures monotonicity required for IGM.
+
+The mixing function is:
+
+$$
+Q_{tot}(s,\mathbf{a}) = f_{mix}(Q_{1}, \dots, Q_{n}; s)
+$$
+
+Parameter explanations:
+
+- $f_{mix}$: neural network with non-negative weights.
+- $s$: global state conditioning the mixing process.
+
+Benefits:
+
+- More expressive than VDN.
+- Supports CTDE while keeping decentralized greedy execution.
+
+
+
+## Theoretical Issues With Linear and Monotonic Factorization
+
+Limitations:
+
+- Linear models (VDN) cannot represent complex coordination.
+- QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions.
+- Off-policy training can diverge in some factorizations.
+
+
+
+## QPLEX: Duplex Dueling Multi-Agent Q-Learning
+
+QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class.
+
+### QPLEX Advantage Factorization
+
+QPLEX factorizes:
+
+$$
+Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} \lambda_{i}(s,\mathbf{a})\big(Q_{i}(s,a_{i}) - \max_{a'-{i}} Q_{i}(s,a'-{i})\big)
+
+- \max_{\mathbf{a}} \sum_{i=1}^{n} Q_{i}(s,a_{i})
+  $$
+
+Parameter explanations:
+
+- $\lambda_{i}(s,\mathbf{a})$: positive mixing coefficients.
+- $Q_{i}(s,a_{i})$: individual utility.
+- $\max_{a'-{i}} Q_{i}(s,a'-{i})$: per-agent baseline value.
+- $\max_{\mathbf{a}}$: maximization over joint actions.
+
+QPLEX Properties:
+
+- Fully satisfies IGM.
+- Has full representation capacity for all IGM-consistent Q-functions.
+- Enables stable off-policy training.
+
+
+
+## QPLEX Training Objective
+
+QPLEX minimizes a TD loss over $Q_{tot}$:
+
+$$
+L = \mathbb{E}\Big[(r + \gamma \max_{\mathbf{a'}} Q_{tot}(s',\mathbf{a'}) - Q_{tot}(s,\mathbf{a}))^{2}\Big]
+$$
+
+Parameter explanations:
+
+- $r$: shared team reward.
+- $\gamma$: discount factor.
+- $s'$: next state.
+- $\mathbf{a'}$: next joint action evaluated by TD target.
+- $Q_{tot}$: QPLEX global value estimate.
+
+
+
+## Role of Credit Assignment
+
+Credit assignment addresses: "Which agent contributed what to the team reward?"
+
+Value factorization supports implicit credit assignment:
+
+- Gradients into each $Q_{i}$ act as counterfactual signals.
+- Dueling architectures allow each agent to learn its influence.
+- QPLEX provides clean marginal contributions implicitly.
+
+
+
+## Performance on SMAC Benchmarks
+
+QPLEX outperforms:
+
+- QTRAN
+- QMIX
+- VDN
+- Other CTDE baselines
+
+Key reasons:
+
+- Effective realization of IGM.
+- Strong representational capacity.
+- Off-policy stability.
+
+
+
+## Extensions: Diversity and Shared Parameter Learning
+
+Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior.
+
+Approaches such as CDS (Celebrating Diversity in Shared MARL) introduce:
+
+- Identity-aware diversity.
+- Information-based intrinsic rewards for agent differentiation.
+- Balanced sharing vs agent specialization.
+
+These techniques improve exploration and cooperation in complex multi-agent tasks.
+
+
+
+## Summary of Lecture 24
+
+Key points:
+
+- Cooperative MARL requires scalable value decomposition.
+- IGM enables decentralized action selection from centralized training.
+- QMIX introduces monotonic non-linear factorization.
+- QPLEX achieves full IGM representational capacity.
+- Implicit credit assignment arises naturally from factorization.
+- Diversity methods allow richer multi-agent coordination strategies.
+
+
+
+## Recommended Screenshot Frames for Lecture 24
+
+- Lecture 24, page 16: CTDE and QMIX architecture diagram (mixing network). Subsection: "QMIX: Monotonic Value Factorization".
+- Lecture 24, page 31: QPLEX benchmark performance on SMAC. Subsection: "Performance on SMAC Benchmarks".
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -26,4 +26,5 @@ export default {
    CSE510_L21: "CSE510 Deep Reinforcement Learning (Lecture 21)",
    CSE510_L22: "CSE510 Deep Reinforcement Learning (Lecture 22)",
    CSE510_L23: "CSE510 Deep Reinforcement Learning (Lecture 23)",
+    CSE510_L24: "CSE510 Deep Reinforcement Learning (Lecture 24)",
 }