partial updates
This commit is contained in:
@@ -1,146 +1,242 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 21)
|
||||
|
||||
## Exploration in RL
|
||||
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
|
||||
|
||||
### Information state search
|
||||
## Exploration in RL: Information-Based Exploration (Intrinsic Curiosity)
|
||||
|
||||
Uncertainty about state transitions or dynamics
|
||||
|
||||
Dynamics prediction error or Information gain for dynamics learning
|
||||
|
||||
#### Computational Curiosity
|
||||
### Computational Curiosity
|
||||
|
||||
- "The direct goal of curiosity and boredom is to improve the world model."
|
||||
- "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
|
||||
- There is positive reinforcement whenever the system fails to correctly predict the environment.
|
||||
- Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)
|
||||
- Curiosity encourages agents to seek experiences that better predict or explain the environment.
|
||||
- A "curiosity unit" gives reward based on the mismatch between current model predictions and actual outcomes.
|
||||
- Intrinsic reward is high when the agent's prediction fails, that is, when it encounters surprising outcomes.
|
||||
- This yields positive intrinsic reinforcement when the internal predictive model errs, causing the agent to repeat actions that lead to prediction errors.
|
||||
- The agent is effectively motivated to create situations where its model fails.
|
||||
|
||||
#### Reward Prediction Error
|
||||
### Model Prediction Error as Intrinsic Reward
|
||||
|
||||
- Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.
|
||||
We augment the reward with an intrinsic bonus based on model prediction error:
|
||||
|
||||
$$
|
||||
R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)
|
||||
$$
|
||||
$R(s, a, s') = r(s, a, s') + B(|T(s, a; \theta) - s'|)$
|
||||
|
||||
- where $r(s,a,s')$ is the extrinsic reward, $T(s,a,\theta)$ is the predicted next state, and $\mathcal{B}$ is a bonus function (intrinsic reward bonus).
|
||||
- Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.
|
||||
Parameter explanations:
|
||||
|
||||
[link to the paper](https://arxiv.org/pdf/1507.08750)
|
||||
</details>
|
||||
- $s$: current state of the agent.
|
||||
- $a$: action taken by the agent in state $s$.
|
||||
- $s'$: next state resulting from executing action $a$ in state $s$.
|
||||
- $r(s, a, s')$: extrinsic environment reward for transition $(s, a, s')$.
|
||||
- $T(s, a; \theta)$: learned dynamics model with parameters $\theta$ that predicts the next state.
|
||||
- $\theta$: parameter vector of the predictive dynamics model $T$.
|
||||
- $|T(s, a; \theta) - s'|$: prediction error magnitude between predicted next state and actual next state.
|
||||
- $B(\cdot)$: function converting prediction error magnitude into an intrinsic reward bonus.
|
||||
- $R(s, a, s')$: total reward, sum of extrinsic reward and intrinsic curiosity bonus.
|
||||
|
||||
<details>
|
||||
<summary>Example</summary>
|
||||
Key ideas:
|
||||
|
||||
Learning Visual Dynamics
|
||||
- The agent receives an intrinsic reward $B(|T(s, a; \theta) - s'|)$ when the actual outcome differs from what its world model predicts.
|
||||
- Initially many transitions are surprising, encouraging broad exploration.
|
||||
- As the model improves, familiar transitions yield smaller error and smaller intrinsic reward.
|
||||
- Exploration becomes focused on less-known parts of the state space.
|
||||
- Intrinsic motivation is non-stationary: as the agent learns, previously novel states lose their intrinsic reward.
|
||||
|
||||
- Exploration reward bonuses $\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|$
|
||||
- However, trivial solution exists: could get reward by just moving around randomly.
|
||||
#### Avoiding Trivial Curiosity Traps
|
||||
|
||||
---
|
||||
[link to paper](https://ar5iv.labs.arxiv.org/html/1705.05363#:~:text=reward%20signal%20based%20on%20how,this%20feature%20space%20using%20self)
|
||||
|
||||
- Exploration reward bonuses with autoencoders $\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|$
|
||||
- But suffer the problems of autoencoding reconstruction loss that has little to do with our task
|
||||
Naively defining $B(s, a, s')$ directly in raw observation space can lead to trivial curiosity traps.
|
||||
|
||||
#### Task Rewards vs. Exploration Rewards
|
||||
Examples:
|
||||
|
||||
Exploration rewards bonuses:
|
||||
- The agent may purposely cause chaotic or noisy observations (like flickering pixels) that are impossible to predict.
|
||||
- The model cannot reduce prediction error on pure noise, so the agent is rewarded for meaningless randomness.
|
||||
- This yields high intrinsic reward without meaningful learning or progress toward task goals.
|
||||
|
||||
$$
|
||||
\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
|
||||
$$
|
||||
To prevent this, we restrict prediction to a more informative feature space:
|
||||
|
||||
Only task rewards:
|
||||
$B(s, a, s') = |T(E(s; \phi), a; \theta) - E(s'; \phi)|$
|
||||
|
||||
$$
|
||||
R(s,a,s') = r(s,a,s')
|
||||
$$
|
||||
Parameter explanations:
|
||||
|
||||
Task+curiosity rewards:
|
||||
- $E(s; \phi)$: learned encoder mapping raw state $s$ into a feature vector.
|
||||
- $\phi$: parameter vector of the encoder $E$.
|
||||
- $T(E(s; \phi), a; \theta)$: forward model predicting next feature representation from encoded state and action.
|
||||
- $E(s'; \phi)$: encoded feature representation of the next state $s'$.
|
||||
- $B(s, a, s')$: intrinsic reward based on prediction error in feature space.
|
||||
|
||||
$$
|
||||
R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')
|
||||
$$
|
||||
Key ideas:
|
||||
|
||||
Sparse task + curiosity rewards:
|
||||
- The encoder $E(s; \phi)$ is trained so that features capture aspects of the state that are controllable by the agent.
|
||||
- One approach is to train $E$ via an inverse dynamics model that predicts $a$ from $(s, s')$.
|
||||
- This encourages $E$ to keep only information necessary to infer actions, discarding irrelevant noise.
|
||||
- Measuring prediction error in feature space ignores unpredictable environmental noise.
|
||||
- Intrinsic reward focuses on errors due to lack of knowledge about controllable dynamics.
|
||||
- The agent's curiosity is directed toward aspects of the environment it can influence and learn.
|
||||
|
||||
$$
|
||||
R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')
|
||||
$$
|
||||
A practical implementation is the Intrinsic Curiosity Module (ICM) by Pathak et al. (2017):
|
||||
|
||||
Only curiosity rewards:
|
||||
- The encoder $E$ and forward model $T$ are trained jointly.
|
||||
- The loss includes both forward prediction error and inverse dynamics error.
|
||||
- Intrinsic reward is set to the forward prediction error in feature space.
|
||||
- This drives exploration of states where the agent cannot yet predict the effect of its actions.
|
||||
|
||||
$$
|
||||
R^c(s,a,s') = \mathcal{B}^c(s, a, s')
|
||||
$$
|
||||
#### Random Network Distillation (RND)
|
||||
|
||||
#### Extrinsic reward RL is not New
|
||||
Random Network Distillation (RND) provides a simpler curiosity bonus without learning a dynamics model.
|
||||
|
||||
- Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006)
|
||||
- Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991)
|
||||
- Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010)
|
||||
- Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004)
|
||||
- Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995)
|
||||
- Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708
|
||||
Basic idea:
|
||||
|
||||
#### Limitation of Prediction Errors
|
||||
- Use a fixed random neural network $f_{\text{target}}$ that maps states to feature vectors.
|
||||
- Train a predictor network $f_{\text{pred}}$ to approximate $f_{\text{target}}$ on visited states.
|
||||
- The intrinsic reward is the prediction error between $f_{\text{pred}}(s)$ and $f_{\text{target}}(s)$.
|
||||
|
||||
- Agent will be rewarded even though the model cannot improve.
|
||||
- So it will focus on parts of environment that are inherently unpredictable or stochastic.
|
||||
- Example: the noisy-TV problem
|
||||
- The agent is attracted forever in the most noisy states, with unpredictable outcomes.
|
||||
Typical form of the intrinsic reward:
|
||||
|
||||
#### Random Network Distillation
|
||||
$r^{\text{int}}(s) = |f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$
|
||||
|
||||
Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.
|
||||
Parameter explanations:
|
||||
|
||||
New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the **next state itself.**
|
||||
- $f_{\text{target}}$: fixed random neural network generating target features for each state.
|
||||
- $f_{\text{pred}}(s; \psi)$: trainable predictor network with parameters $\psi$.
|
||||
- $\psi$: parameter vector for the predictor network.
|
||||
- $s$: state input to both networks.
|
||||
- $|f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$: squared error between predictor and target features.
|
||||
- $r^{\text{int}}(s)$: intrinsic reward based on prediction error in random feature space.
|
||||
|
||||
- The target network is a neural network with fixed, randomized weights, which is never trained.
|
||||
- The prediction network is trained to predict the target network's output.
|
||||
Key properties:
|
||||
|
||||
> the more you visit the state, the less loss you will have.
|
||||
- For novel or rarely visited states, $f_{\text{pred}}$ has not yet learned to match $f_{\text{target}}$, so error is high.
|
||||
- For frequently visited states, prediction error becomes small, and intrinsic reward decays.
|
||||
- The target network is random and fixed, so it does not adapt to the policy.
|
||||
- This provides a stable novelty signal without explicit dynamics learning.
|
||||
- RND achieves strong exploration performance in challenging environments, such as hard-exploration Atari games.
|
||||
|
||||
### Posterior Sampling
|
||||
### Efficacy of Curiosity-Driven Exploration
|
||||
|
||||
Uncertainty about Q-value functions or policies
|
||||
Empirical observations:
|
||||
|
||||
Selecting actions according to the probability they are best according to the current model.
|
||||
- Curiosity-driven intrinsic rewards often lead to significantly higher extrinsic returns in sparse-reward environments compared to agents trained only on extrinsic rewards.
|
||||
- Intrinsic rewards act as a proxy objective that guides the agent toward interesting or informative regions of the state space.
|
||||
- In some experiments, agents trained with only intrinsic rewards (no extrinsic reward during training) still learn behaviors that later achieve high task scores when extrinsic rewards are measured.
|
||||
- Using random features for curiosity (as in RND) can perform nearly as well as using learned features in many domains.
|
||||
- Simple surprise signals are often sufficient to drive effective exploration.
|
||||
- Learned feature spaces may generalize better to truly novel scenarios but are not always necessary.
|
||||
|
||||
#### Exploration with Action Value Information
|
||||
Historical context:
|
||||
|
||||
Count-Based and Curiosity-driven method does not take into
|
||||
account the action value information
|
||||
- The concept of learning from intrinsic rewards alone is not new.
|
||||
- Itti and Baldi (2005) studied "Bayesian surprise" as a driver of human attention.
|
||||
- Schmidhuber (1991, 2010) formalized curiosity, creativity, and fun as intrinsic motivations in learning agents.
|
||||
- Singh et al. (2004) proposed intrinsically motivated reinforcement learning frameworks.
|
||||
- These early works laid the conceptual foundation for modern curiosity-driven deep RL methods.
|
||||
|
||||

|
||||
For further reading on intrinsic curiosity methods:
|
||||
|
||||
> In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.
|
||||
- Pathak et al., "Curiosity-driven Exploration by Self-supervised Prediction", 2017.
|
||||
- Burda et al., "Exploration by Random Network Distillation", 2018.
|
||||
- Schmidhuber, "Formal Theory of Creativity, Fun, and Intrinsic Motivation", 2010.
|
||||
|
||||
#### Exploration via Posterior Sampling of Q Functions
|
||||
## Exploration via Posterior Sampling
|
||||
|
||||
- Represent a posterior distribution of Q functions, instead of a point estimate.
|
||||
1. Sample from $P(Q), Q\sim P(Q)$
|
||||
2. Choose actions according to this $Q$ for one episode $a=\arg\max_{a} Q(s,a)$
|
||||
3. Update $P(Q)$ based on the sampled $Q$ and collected experience tuples $(s,a,r,s')$
|
||||
- Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
|
||||
- But how can we learn a distribution of Q functions $P(Q)$ if Q function is a deep neural network?
|
||||
While optimistic and curiosity bonus methods modify the reward function, posterior sampling approaches handle exploration by maintaining uncertainty over models or value functions and sampling from this uncertainty.
|
||||
|
||||
#### Bootstrap Ensemble
|
||||
These methods are rooted in Thompson Sampling and naturally balance exploration and exploitation.
|
||||
|
||||
- Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
|
||||
- Computationally expensive
|
||||
- Neural network ensembles with shared backbone: only the heads are trained with different subset of the data
|
||||
### Posterior Sampling in Multi-Armed Bandits (Thompson Sampling)
|
||||
|
||||
### Questions
|
||||
In a multi-armed bandit problem (no state transitions), Thompson Sampling works as follows:
|
||||
|
||||
- Why do PG methods implicitly support exploration?
|
||||
- Is it sufficient? How can we improve its implicit exploration?
|
||||
- What are limitations of entropy regularization?
|
||||
- How can we improve exploration for PG methods?
|
||||
- Intrinsic-motivated bonuses (e.g., RND)
|
||||
- Explicitly optimize per-state entropy in the return (e.g., SAC)
|
||||
- Hierarchical RL
|
||||
- Goal-conditional RL
|
||||
- What are potentially more effective exploration methods?
|
||||
- Knowledge-driven
|
||||
- Model-based exploration
|
||||
1. Maintain a prior and posterior distribution over the reward parameters for each arm.
|
||||
2. At each time step, sample reward parameters for all arms from their current posterior.
|
||||
3. Select the arm with the highest sampled mean reward.
|
||||
4. Observe the reward, update the posterior, and repeat.
|
||||
|
||||
Intuition:
|
||||
|
||||
- Each action is selected with probability equal to the posterior probability that it is optimal.
|
||||
- Arms with high uncertainty are more likely to be sampled as optimal in some posterior draws.
|
||||
- Exploration arises naturally from uncertainty, without explicit epsilon-greedy noise or bonus terms.
|
||||
- Over time, the posterior concentrates on the true reward means, and the algorithm shifts toward exploitation.
|
||||
|
||||
Theoretical properties:
|
||||
|
||||
- Thompson Sampling attains near-optimal regret bounds in many bandit settings.
|
||||
- It often performs as well as or better than upper confidence bound algorithms in practice.
|
||||
|
||||
### Posterior Sampling for Reinforcement Learning (PSRL)
|
||||
|
||||
In reinforcement learning with states and transitions, posterior sampling generalizes to sampling entire MDP models.
|
||||
|
||||
Posterior Sampling for Reinforcement Learning (PSRL) operates as follows:
|
||||
|
||||
1. Maintain a posterior distribution over environment dynamics and rewards, based on observed transitions.
|
||||
2. At the beginning of an episode, sample an MDP model from this posterior.
|
||||
3. Compute the optimal policy for the sampled MDP (for example, by value iteration).
|
||||
4. Execute this policy in the real environment for the whole episode.
|
||||
5. Use the observed transitions to update the posterior, then repeat.
|
||||
|
||||
Key advantages:
|
||||
|
||||
- The agent commits to a sampled model's policy for an extended duration, which induces deep exploration.
|
||||
- If a sampled model is optimistic in unexplored regions, the corresponding policy will deliberately visit those regions.
|
||||
- Exploration is coherent across time within an episode, unlike per-step randomization in epsilon-greedy.
|
||||
- The method does not require ad hoc exploration bonuses; exploration is an emergent property of the posterior.
|
||||
|
||||
Challenges:
|
||||
|
||||
- Maintaining an exact posterior over high-dimensional MDPs is usually intractable.
|
||||
- Practical implementations use approximations.
|
||||
|
||||
### Approximate Posterior Sampling with Ensembles (Bootstrapped DQN)
|
||||
|
||||
A common approximate posterior method in deep RL is Bootstrapped DQN.
|
||||
|
||||
Basic idea:
|
||||
|
||||
- Train an ensemble of $K$ Q-networks (heads), $Q^{(1)}, \dots, Q^{(K)}$.
|
||||
- Each head is trained on a different bootstrap sample or masked subset of experience.
|
||||
- At the start of each episode, sample a head index $k$ uniformly from ${1, \dots, K}$.
|
||||
- For the entire episode, act greedily with respect to $Q^{(k)}$.
|
||||
|
||||
Parameter definitions for the ensemble:
|
||||
|
||||
- $K$: number of Q-network heads in the ensemble.
|
||||
- $Q^{(k)}(s, a)$: Q-value estimate for head $k$ at state-action pair $(s, a)$.
|
||||
- $k$: index of the sampled head used for the current episode.
|
||||
- $(s, a)$: state and action arguments to Q-value functions.
|
||||
|
||||
Implementation details:
|
||||
|
||||
- A shared feature backbone network processes state inputs, feeding into all heads.
|
||||
- Each head has its own final layers, allowing diverse value estimates.
|
||||
- Masking or bootstrapping assigns different subsets of transitions to different heads during training.
|
||||
|
||||
Benefits:
|
||||
|
||||
- Each head approximates a different plausible Q-function, analogous to a sample from a posterior.
|
||||
- When a head is optimistic about certain under-explored actions, its greedy policy will explore them deeply.
|
||||
- Exploration behavior is temporally consistent within an episode.
|
||||
- No modification of the reward function is required; exploration arises from policy randomization via multiple heads.
|
||||
|
||||
Comparison to epsilon-greedy:
|
||||
|
||||
- Epsilon-greedy adds per-step random actions, which can be inefficient for long-horizon exploration.
|
||||
- Bootstrapped DQN commits to a strategy for an episode, enabling the agent to execute complete exploratory plans.
|
||||
- This can dramatically increase the probability of discovering long sequences needed to reach sparse rewards.
|
||||
|
||||
Other approximate posterior approaches:
|
||||
|
||||
- Bayesian neural networks for Q-functions (explicit parameter distributions).
|
||||
- Using Monte Carlo dropout at inference to sample Q-functions.
|
||||
- Randomized prior functions added to Q-networks to maintain exploration.
|
||||
|
||||
Theoretical insights:
|
||||
|
||||
- Posterior sampling methods can enjoy strong regret bounds in some RL settings.
|
||||
- They can have better asymptotic constants than optimism-based methods in certain problems.
|
||||
- Coherent, temporally extended exploration is essential in environments with delayed rewards and complex goals.
|
||||
|
||||
For further reading:
|
||||
|
||||
- Osband et al., "Deep Exploration via Bootstrapped DQN", 2016.
|
||||
- Osband and Van Roy, "Why Is Posterior Sampling Better Than Optimism for Reinforcement Learning?", 2017.
|
||||
- Chapelle and Li, "An Empirical Evaluation of Thompson Sampling", 2011.
|
||||
@@ -1,50 +1,317 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 22)
|
||||
|
||||
## Offline Reinforcement Learning
|
||||
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
|
||||
|
||||
### Requirements for Current Successes
|
||||
## Offline Reinforcement Learning: Introduction and Challenges
|
||||
|
||||
- Access to the Environment Model or Simulator
|
||||
- Not Costly for Exploration or Trial-and-Error
|
||||
Offline reinforcement learning (offline RL), also called batch RL, aims to learn an optimal policy -without- interacting with the environment. Instead, the agent is given a fixed dataset of transitions collected by an unknown behavior policy.
|
||||
|
||||
#### Background: Offline RL
|
||||
### The Offline RL Dataset
|
||||
|
||||
- The success of modern machine learning
|
||||
- Scalable data-driven learning methods (GPT-4, CLIP,DALL·E, Sora)
|
||||
- Reinforcement learning
|
||||
- Online learning paradigm
|
||||
- Interaction is expensive & dangerous
|
||||
- Healthcare, Robotics, Recommendation...
|
||||
- Can we develop data-driven offline RL?
|
||||
|
||||
#### Definition in Offline RL
|
||||
|
||||
- the policy $\pi_k$ is updated with a static dataset $\mathcal{D}$, which is collected by _unknown behavior policy_ $\pi_\beta$
|
||||
- Interaction is not allowed
|
||||
|
||||
- $\mathcal{D}=\{(s_i,a_i,s_i',r_i)\}$
|
||||
- $s\sim d^{\pi_\beta} (s)$
|
||||
- $a\sim \pi_\beta (a|s)$
|
||||
- $s'\sim p(s'|s,a)$
|
||||
- $r\gets r(s,a)$
|
||||
- Objective: $\max_\pi\sum _{t=0}^{T}\mathbb{E}_{s_t\sim d^\pi(s),a_t\sim \pi(a|s)}[\gamma^tr(s_t,a_t)]$
|
||||
|
||||
#### Key challenge in Offline RL
|
||||
|
||||
Distribution Shift
|
||||
|
||||
How about using the traditional reinforcement learning (bootstrapping)?
|
||||
We are given a static dataset:
|
||||
|
||||
$$
|
||||
Q(s,a)=r(s,a)+\gamma \max_{a'\in A} Q(s',a')
|
||||
D = { (s_i, a_i, s'-i, r_i ) }-{i=1}^N
|
||||
$$
|
||||
|
||||
$$
|
||||
\pi(s)=\arg\max_{a\in A} Q(s,a)
|
||||
$$
|
||||
Parameter explanations:
|
||||
|
||||
but notice that
|
||||
- $s_i$: state sampled from behavior policy state distribution.
|
||||
- $a_i$: action selected by the behavior policy $\pi_beta$.
|
||||
- $s'_i$: next state sampled from environment dynamics $p(s'|s,a)$.
|
||||
- $r_i$: reward observed for transition $(s_i,a_i)$.
|
||||
- $N$: total number of transitions in the dataset.
|
||||
- $D$: full offline dataset used for training.
|
||||
|
||||
The goal is to learn a new policy $\pi$ maximizing expected discounted return using only $D$:
|
||||
|
||||
$$
|
||||
P_{\pi_beta}(s,a)\neq P_{\pi_f}(s,a)
|
||||
$$
|
||||
\max_{\pi} ; \mathbb{E}\Big[\sum_{t=0}^T \gamma^t r(s_t, a_t)\Big]
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $\pi$: policy we want to learn.
|
||||
- $r(s,a)$: reward received for state-action pair.
|
||||
- $\gamma$: discount factor controlling weight of future rewards.
|
||||
- $T$: horizon or trajectory length.
|
||||
|
||||
### Why Offline RL Is Difficult
|
||||
|
||||
Offline RL is fundamentally harder than online RL because:
|
||||
|
||||
- The agent cannot try new actions to fix wrong value estimates.
|
||||
- The policy may choose out-of-distribution actions not present in $D$.
|
||||
- Q-value estimates for unseen actions can be arbitrarily incorrect.
|
||||
- Bootstrapping on wrong Q-values can cause divergence.
|
||||
|
||||
This leads to two major failure modes:
|
||||
|
||||
1. --Distribution shift--: new policy actions differ from dataset actions.
|
||||
2. --Extrapolation error--: the Q-function guesses values for unseen actions.
|
||||
|
||||
### Extrapolation Error Problem
|
||||
|
||||
In standard Q-learning, the Bellman backup is:
|
||||
|
||||
$$
|
||||
Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s', a')
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $Q(s,a)$: estimated value of taking action $a$ in state $s$.
|
||||
- $\max_{a'}$: maximum over possible next actions.
|
||||
- $a'$: candidate next action for evaluation in backup step.
|
||||
|
||||
If $a'$ was rarely or never taken in the dataset, $Q(s',a')$ is poorly estimated, so Q-learning boots off invalid values, causing instability.
|
||||
|
||||
### Behavior Cloning (BC): The Safest Baseline
|
||||
|
||||
The simplest offline method is to imitate the behavior policy:
|
||||
|
||||
$$
|
||||
\max_{\phi} ; \mathbb{E}_{(s,a) \sim D}[\log \pi_{\phi}(a|s)]
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $\phi$: neural network parameters of the cloned policy.
|
||||
- $\pi_{\phi}$: learned policy approximating behavior policy.
|
||||
- $\log \pi_{\phi}(a|s)$: negative log-likelihood loss.
|
||||
|
||||
Pros:
|
||||
|
||||
- Does not suffer from extrapolation error.
|
||||
- Extremely stable.
|
||||
|
||||
Cons:
|
||||
|
||||
- Cannot outperform the behavior policy.
|
||||
- Ignores reward information entirely.
|
||||
|
||||
### Naive Offline Q-Learning Fails
|
||||
|
||||
Directly applying off-policy Q-learning on $D$ generally leads to:
|
||||
|
||||
- Overestimation of unseen actions.
|
||||
- Divergence due to extrapolation error.
|
||||
- Policies worse than behavior cloning.
|
||||
|
||||
## Strategies for Safe Offline RL
|
||||
|
||||
There are two primary families of solutions:
|
||||
|
||||
1. --Policy constraint methods--
|
||||
2. --Conservative value estimation methods--
|
||||
|
||||
---
|
||||
|
||||
# 1. Policy Constraint Methods
|
||||
|
||||
These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions.
|
||||
|
||||
### Advantage Weighted Regression (AWR / AWAC)
|
||||
|
||||
Policy update:
|
||||
|
||||
$$
|
||||
\pi(a|s) \propto \pi_{beta}(a|s)\exp\left(\frac{1}{\lambda}A(s,a)\right)
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $\pi_{beta}$: behavior policy used to collect dataset.
|
||||
- $A(s,a)$: advantage function derived from Q or V estimates.
|
||||
- $\lambda$: temperature controlling strength of advantage weighting.
|
||||
- $\exp(\cdot)$: positive weighting on high-advantage actions.
|
||||
|
||||
Properties:
|
||||
|
||||
- Uses advantages to filter good and bad actions.
|
||||
- Improves beyond behavior policy while staying safe.
|
||||
|
||||
### Batch-Constrained Q-learning (BCQ)
|
||||
|
||||
BCQ constrains the policy using a generative model:
|
||||
|
||||
1. Train a VAE $G_{\omega}$ to model $a$ given $s$.
|
||||
2. Train a small perturbation model $\xi$.
|
||||
3. Limit the policy to $a = G_{\omega}(s) + \xi(s)$.
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $G_{\omega}(s)$: VAE-generated action similar to data actions.
|
||||
- $\omega$: VAE parameters.
|
||||
- $\xi(s)$: small correction to generated actions.
|
||||
- $a$: final policy action constrained near dataset distribution.
|
||||
|
||||
BCQ avoids selecting unseen actions and strongly reduces extrapolation.
|
||||
|
||||
### BEAR (Bootstrapping Error Accumulation Reduction)
|
||||
|
||||
BEAR adds explicit constraints:
|
||||
|
||||
$$
|
||||
D_{MMD}\left(\pi(a|s), \pi_{beta}(a|s)\right) < \epsilon
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $D_{MMD}$: Maximum Mean Discrepancy distance between action distributions.
|
||||
- $\epsilon$: threshold restricting policy deviation from behavior policy.
|
||||
|
||||
BEAR controls distribution shift more tightly than BCQ.
|
||||
|
||||
---
|
||||
|
||||
# 2. Conservative Value Function Methods
|
||||
|
||||
These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values.
|
||||
|
||||
### Conservative Q-Learning (CQL)
|
||||
|
||||
One formulation is:
|
||||
|
||||
$$
|
||||
J(Q) = J_{TD}(Q) + \alpha\big(\mathbb{E}_{a\sim\pi(\cdot|s)}Q(s,a) - \mathbb{E}_{a\sim D}Q(s,a)\big)
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $J_{TD}$: standard Bellman TD loss.
|
||||
- $\alpha$: weight of conservatism penalty.
|
||||
- $\mathbb{E}_{a\sim\pi(\cdot|s)}$: expectation over policy-chosen actions.
|
||||
- $\mathbb{E}_{a\sim D}$: expectation over dataset actions.
|
||||
|
||||
Effect:
|
||||
|
||||
- Increases Q-values of dataset actions.
|
||||
- Decreases Q-values of out-of-distribution actions.
|
||||
|
||||
### Implicit Q-Learning (IQL)
|
||||
|
||||
IQL avoids constraints entirely by using expectile regression:
|
||||
|
||||
Value regression:
|
||||
|
||||
$$
|
||||
V(s) = \arg\min_{v} ; \mathbb{E}\big[\rho_{\tau}(Q(s,a) - v)\big]
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $v$: scalar value estimate for state $s$.
|
||||
- $\rho_{\tau}(x)$: expectile regression loss.
|
||||
- $\tau$: expectile parameter controlling conservatism.
|
||||
- $Q(s,a)$: Q-value estimate.
|
||||
|
||||
Key idea:
|
||||
|
||||
- For $\tau < 1$, IQL reduces sensitivity to large (possibly incorrect) Q-values.
|
||||
- Implicitly conservative without special constraints.
|
||||
|
||||
IQL often achieves state-of-the-art performance due to simplicity and stability.
|
||||
|
||||
---
|
||||
|
||||
# Model-Based Offline RL
|
||||
|
||||
### Forward Model-Based RL
|
||||
|
||||
Train a dynamics model:
|
||||
|
||||
$$
|
||||
p_{\theta}(s'|s,a)
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $p_{\theta}$: learned transition model.
|
||||
- $\theta$: parameters of transition model.
|
||||
|
||||
We can generate synthetic transitions using $p_{\theta}$, but model error accumulates.
|
||||
|
||||
### Penalty-Based Model Approaches (MOPO, MOReL)
|
||||
|
||||
Add uncertainty penalty:
|
||||
|
||||
$$
|
||||
r_{model}(s,a) = r(s,a) - \beta , u(s,a)
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $r_{model}$: penalized reward for model rollouts.
|
||||
- $u(s,a)$: model uncertainty estimate.
|
||||
- $\beta$: penalty coefficient.
|
||||
|
||||
These methods limit exploration into unknown model regions.
|
||||
|
||||
---
|
||||
|
||||
# Reverse Model-Based Imagination (ROMI)
|
||||
|
||||
ROMI generates new training data by -backward- imagination.
|
||||
|
||||
### Reverse Dynamics Model
|
||||
|
||||
ROMI learns:
|
||||
|
||||
$$
|
||||
p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $\psi$: parameters of reverse dynamics model.
|
||||
- $s_{t+1}$: later state.
|
||||
- $a_{t}$: action taken leading to $s_{t+1}$.
|
||||
- $s_{t}$: predicted predecessor state.
|
||||
|
||||
ROMI also learns a reverse policy for sampling likely predecessor actions.
|
||||
|
||||
### Reverse Imagination Process
|
||||
|
||||
Given a goal state $s_{g}$:
|
||||
|
||||
1. Sample $a_{t}$ from reverse policy.
|
||||
2. Predict $s_{t}$ from reverse dynamics.
|
||||
3. Form imagined transition $(s_{t}, a_{t}, s_{t+1})$.
|
||||
4. Repeat to build longer imagined trajectories.
|
||||
|
||||
Benefits:
|
||||
|
||||
- Imagined transitions end in real states, ensuring grounding.
|
||||
- Completes missing parts of dataset.
|
||||
- Helps propagate reward backward reliably.
|
||||
|
||||
ROMI combined with conservative RL often outperforms standard offline methods.
|
||||
|
||||
---
|
||||
|
||||
# Summary of Lecture 22
|
||||
|
||||
Offline RL requires balancing:
|
||||
|
||||
- Improvement beyond dataset behavior.
|
||||
- Avoiding unsafe extrapolation to unseen actions.
|
||||
|
||||
Three major families of solutions:
|
||||
|
||||
1. Policy constraints (BCQ, BEAR, AWR)
|
||||
2. Conservative Q-learning (CQL, IQL)
|
||||
3. Model-based conservatism and imagination (MOPO, MOReL, ROMI)
|
||||
|
||||
Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems.
|
||||
|
||||
---
|
||||
|
||||
# Recommended Screenshot Frames for Lecture 22
|
||||
|
||||
- Lecture 22, page 7: Offline RL diagram showing policy learning from fixed dataset, subsection "Offline RL Setting".
|
||||
- Lecture 22, page 35: Illustration of dataset support vs policy action distribution, subsection "Strategies for Safe Offline RL".
|
||||
|
||||
---
|
||||
|
||||
--End of CSE510_L22.md--
|
||||
|
||||
@@ -1,3 +1,177 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 23)
|
||||
|
||||
## Offline Reinforcement Learning
|
||||
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
|
||||
|
||||
## Offline Reinforcement Learning Part II: Advanced Approaches
|
||||
|
||||
Lecture 23 continues with advanced topics in offline RL, expanding on model-based imagination methods and credit assignment structures relevant for offline multi-agent and single-agent settings.
|
||||
|
||||
## Reverse Model-Based Imagination (ROMI)
|
||||
|
||||
ROMI is a method for augmenting an offline dataset with additional transitions generated by imagining trajectories -backwards- from desirable states. Unlike forward model rollouts, backward imagination stays grounded in real data because imagined transitions always terminate in dataset states.
|
||||
|
||||
### Reverse Dynamics Model
|
||||
|
||||
ROMI learns a reverse dynamics model:
|
||||
|
||||
$$
|
||||
p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $p_{\psi}$: learned reverse transition model.
|
||||
- $\psi$: parameter vector for the reverse model.
|
||||
- $s_{t+1}$: next state (from dataset).
|
||||
- $a_{t}$: action that hypothetically leads into $s_{t+1}$.
|
||||
- $s_{t}$: predicted predecessor state.
|
||||
|
||||
ROMI also learns a reverse policy to sample actions that likely lead into known states:
|
||||
|
||||
$$
|
||||
\pi_{rev}(a_{t} \mid s_{t+1})
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $\pi_{rev}$: reverse policy distribution.
|
||||
- $a_{t}$: action sampled for backward trajectory generation.
|
||||
- $s_{t+1}$: state whose predecessors are being imagined.
|
||||
|
||||
### Reverse Imagination Process
|
||||
|
||||
To generate imagined transitions:
|
||||
|
||||
1. Select a goal or high-value state $s_{g}$ from the offline dataset.
|
||||
2. Sample $a_{t}$ from $\pi_{rev}(a_{t} \mid s_{g})$.
|
||||
3. Predict $s_{t}$ from $p_{\psi}(s_{t} \mid s_{g}, a_{t})$.
|
||||
4. Form an imagined transition $(s_{t}, a_{t}, s_{g})$.
|
||||
5. Repeat backward to obtain a longer imagined trajectory.
|
||||
|
||||
Benefits:
|
||||
|
||||
- Imagined states remain grounded by terminating in real dataset states.
|
||||
- Helps propagate reward signals backward through states not originally visited.
|
||||
- Avoids runaway model error that occurs in forward model rollouts.
|
||||
|
||||
ROMI effectively fills in missing gaps in the state-action graph, improving training stability and performance when paired with conservative offline RL algorithms.
|
||||
|
||||
---
|
||||
|
||||
## Implicit Credit Assignment via Value Factorization Structures
|
||||
|
||||
Although initially studied for multi-agent systems, insights from value factorization also improve offline RL by providing structured credit assignment signals.
|
||||
|
||||
### Counterfactual Credit Assignment Insight
|
||||
|
||||
A factored value function structure of the form:
|
||||
|
||||
$$
|
||||
Q_{tot}(s, a_{1}, \dots, a_{n}) = f(Q_{1}(s, a_{1}), \dots, Q_{n}(s, a_{n}))
|
||||
$$
|
||||
|
||||
can implicitly implement counterfactual credit assignment.
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $Q_{tot}$: global value function.
|
||||
- $Q_{i}(s,a_{i})$: individual component value for agent or subsystem $i$.
|
||||
- $f(\cdot)$: mixing function combining components.
|
||||
- $s$: environment state.
|
||||
- $a_{i}$: action taken by entity $i$.
|
||||
|
||||
In architectures designed for IGM (Individual-Global-Max) consistency, gradients backpropagated through $f$ isolate the marginal effect of each component. This implicitly gives each agent or subsystem a counterfactual advantage signal.
|
||||
|
||||
Even in single-agent structured RL, similar factorization structures allow credit flowing into components representing skills, modes, or action groups, enabling better temporal and structural decomposition.
|
||||
|
||||
---
|
||||
|
||||
## Model-Based vs Model-Free Offline RL
|
||||
|
||||
Lecture 23 contrasts model-based imagination (ROMI) with conservative model-free methods such as IQL and CQL.
|
||||
|
||||
### Forward Model-Based Rollouts
|
||||
|
||||
Forward imagination using a learned model:
|
||||
|
||||
$$
|
||||
p_{\theta}(s'|s,a)
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $p_{\theta}$: learned forward dynamics model.
|
||||
- $\theta$: parameters of the forward model.
|
||||
- $s'$: predicted next state.
|
||||
- $s$: current state.
|
||||
- $a$: action taken in current state.
|
||||
|
||||
Problems:
|
||||
|
||||
- Forward rollouts drift away from dataset support.
|
||||
- Model error compounds with each step.
|
||||
- Leads to training instability if used without penalties.
|
||||
|
||||
### Penalty Methods (MOPO, MOReL)
|
||||
|
||||
Augmented reward:
|
||||
|
||||
$$
|
||||
r_{model}(s,a) = r(s,a) - \beta u(s,a)
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $r_{model}(s,a)$: penalized reward for model-generated steps.
|
||||
- $u(s,a)$: uncertainty score of model for state-action pair.
|
||||
- $\beta$: penalty coefficient.
|
||||
- $r(s,a)$: original reward.
|
||||
|
||||
These methods limit exploration into uncertain model regions.
|
||||
|
||||
### ROMI vs Forward Rollouts
|
||||
|
||||
- Forward methods expand state space beyond dataset.
|
||||
- ROMI expands -backward-, staying consistent with known good future states.
|
||||
- ROMI reduces error accumulation because future anchors are real.
|
||||
|
||||
---
|
||||
|
||||
## Combining ROMI With Conservative Offline RL
|
||||
|
||||
ROMI is typically combined with:
|
||||
|
||||
- CQL (Conservative Q-Learning)
|
||||
- IQL (Implicit Q-Learning)
|
||||
- BCQ and BEAR (policy constraint methods)
|
||||
|
||||
Workflow:
|
||||
|
||||
1. Generate imagined transitions via ROMI.
|
||||
2. Add them to dataset.
|
||||
3. Train Q-function or policy using conservative losses.
|
||||
|
||||
Benefits:
|
||||
|
||||
- Better coverage of reward-relevant states.
|
||||
- Increased policy improvement over dataset.
|
||||
- More stable Q-learning backups.
|
||||
|
||||
---
|
||||
|
||||
## Summary of Lecture 23
|
||||
|
||||
Key points:
|
||||
|
||||
- Offline RL can be improved via structured imagination.
|
||||
- ROMI creates safe imagined transitions by reversing dynamics.
|
||||
- Reverse imagination avoids pitfalls of forward model error.
|
||||
- Factored value structures provide implicit counterfactual credit assignment.
|
||||
- Combining ROMI with conservative learners yields state-of-the-art performance.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Screenshot Frames for Lecture 23
|
||||
|
||||
- Lecture 23, page 20: ROMI concept diagram depicting reverse imagination from goal states. Subsection: "Reverse Model-Based Imagination (ROMI)".
|
||||
- Lecture 23, page 24: Architecture figure showing reverse policy and reverse dynamics model used to generate imagined transitions. Subsection: "Reverse Imagination Process".
|
||||
|
||||
275
content/CSE510/CSE510_L24.md
Normal file
275
content/CSE510/CSE510_L24.md
Normal file
@@ -0,0 +1,275 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 24)
|
||||
|
||||
## Cooperative Multi-Agent Reinforcement Learning (MARL)
|
||||
|
||||
This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX.
|
||||
|
||||
|
||||
|
||||
## Multi-Agent Coordination Under Uncertainty
|
||||
|
||||
In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP).
|
||||
|
||||
A transition is defined as:
|
||||
|
||||
$$
|
||||
P(s' \mid s, a_{1}, \dots, a_{n})
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $s$: current global state.
|
||||
- $s'$: next global state.
|
||||
- $a_{i}$: action taken by agent $i$.
|
||||
- $P(\cdot)$: environment transition function.
|
||||
|
||||
The shared return is:
|
||||
|
||||
$$
|
||||
\mathbb{E}\left[\sum_{t=0}^{T} \gamma^{t} r_{t}\right]
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $\gamma$: discount factor.
|
||||
- $T$: horizon length.
|
||||
- $r_{t}$: shared team reward at time $t$.
|
||||
|
||||
### CTDE: Centralized Training, Decentralized Execution
|
||||
|
||||
Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment.
|
||||
|
||||
|
||||
## Joint vs Factored Q-Learning
|
||||
|
||||
### Joint Q-Learning
|
||||
|
||||
In joint-action learning, one learns a full joint Q-function:
|
||||
|
||||
$$
|
||||
Q_{tot}(s, a_{1}, \dots, a_{n})
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $Q_{tot}$: joint value for the entire team.
|
||||
- $(a_{1}, \dots, a_{n})$: joint action vector across agents.
|
||||
|
||||
Problem:
|
||||
|
||||
- The joint action space grows exponentially in $n$.
|
||||
- Learning is not scalable.
|
||||
|
||||
### Value Factorization
|
||||
|
||||
Instead of learning $Q_{tot}$ directly, we factorize it into individual utility functions:
|
||||
|
||||
$$
|
||||
Q_{tot}(s, \mathbf{a}) = f(Q_{1}(s,a_{1}), \dots, Q_{n}(s,a_{n}))
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $\mathbf{a}$: joint action vector.
|
||||
- $f(\cdot)$: mixing network combining individual Q-values.
|
||||
|
||||
The goal is to enable decentralized greedy action selection.
|
||||
|
||||
|
||||
|
||||
## Individual-Global-Max (IGM) Condition
|
||||
|
||||
The IGM condition enables decentralized optimal action selection:
|
||||
|
||||
$$
|
||||
\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a})
|
||||
===========================================
|
||||
|
||||
\big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big)
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $\arg\max_{\mathbf{a}}$: search for best joint action.
|
||||
- $\arg\max_{a_{i}}$: best local action for agent $i$.
|
||||
- $Q_{i}(s,a_{i})$: individual utility for agent $i$.
|
||||
|
||||
IGM makes decentralized execution optimal with respect to the learned factorized value.
|
||||
|
||||
|
||||
|
||||
## Linear Value Factorization
|
||||
|
||||
### VDN (Value Decomposition Networks)
|
||||
|
||||
VDN assumes:
|
||||
|
||||
$$
|
||||
Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} Q_{i}(s,a_{i})
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $Q_{i}(s,a_{i})$: value of agent $i$'s action.
|
||||
- $\sum_{i=1}^{n}$: linear sum over agents.
|
||||
|
||||
Pros:
|
||||
|
||||
- Very simple, satisfies IGM.
|
||||
- Fully decentralized execution.
|
||||
|
||||
Cons:
|
||||
|
||||
- Limited representation capacity.
|
||||
- Cannot model non-linear teamwork interactions.
|
||||
|
||||
|
||||
|
||||
## QMIX: Monotonic Value Factorization
|
||||
|
||||
QMIX uses a state-conditioned mixing network enforcing monotonicity:
|
||||
|
||||
$$
|
||||
\frac{\partial Q_{tot}}{\partial Q_{i}} \ge 0
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $\partial Q_{tot} / \partial Q_{i}$: gradient of global Q w.r.t. individual Q.
|
||||
- $\ge 0$: ensures monotonicity required for IGM.
|
||||
|
||||
The mixing function is:
|
||||
|
||||
$$
|
||||
Q_{tot}(s,\mathbf{a}) = f_{mix}(Q_{1}, \dots, Q_{n}; s)
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $f_{mix}$: neural network with non-negative weights.
|
||||
- $s$: global state conditioning the mixing process.
|
||||
|
||||
Benefits:
|
||||
|
||||
- More expressive than VDN.
|
||||
- Supports CTDE while keeping decentralized greedy execution.
|
||||
|
||||
|
||||
|
||||
## Theoretical Issues With Linear and Monotonic Factorization
|
||||
|
||||
Limitations:
|
||||
|
||||
- Linear models (VDN) cannot represent complex coordination.
|
||||
- QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions.
|
||||
- Off-policy training can diverge in some factorizations.
|
||||
|
||||
|
||||
|
||||
## QPLEX: Duplex Dueling Multi-Agent Q-Learning
|
||||
|
||||
QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class.
|
||||
|
||||
### QPLEX Advantage Factorization
|
||||
|
||||
QPLEX factorizes:
|
||||
|
||||
$$
|
||||
Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} \lambda_{i}(s,\mathbf{a})\big(Q_{i}(s,a_{i}) - \max_{a'-{i}} Q_{i}(s,a'-{i})\big)
|
||||
|
||||
- \max_{\mathbf{a}} \sum_{i=1}^{n} Q_{i}(s,a_{i})
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $\lambda_{i}(s,\mathbf{a})$: positive mixing coefficients.
|
||||
- $Q_{i}(s,a_{i})$: individual utility.
|
||||
- $\max_{a'-{i}} Q_{i}(s,a'-{i})$: per-agent baseline value.
|
||||
- $\max_{\mathbf{a}}$: maximization over joint actions.
|
||||
|
||||
QPLEX Properties:
|
||||
|
||||
- Fully satisfies IGM.
|
||||
- Has full representation capacity for all IGM-consistent Q-functions.
|
||||
- Enables stable off-policy training.
|
||||
|
||||
|
||||
|
||||
## QPLEX Training Objective
|
||||
|
||||
QPLEX minimizes a TD loss over $Q_{tot}$:
|
||||
|
||||
$$
|
||||
L = \mathbb{E}\Big[(r + \gamma \max_{\mathbf{a'}} Q_{tot}(s',\mathbf{a'}) - Q_{tot}(s,\mathbf{a}))^{2}\Big]
|
||||
$$
|
||||
|
||||
Parameter explanations:
|
||||
|
||||
- $r$: shared team reward.
|
||||
- $\gamma$: discount factor.
|
||||
- $s'$: next state.
|
||||
- $\mathbf{a'}$: next joint action evaluated by TD target.
|
||||
- $Q_{tot}$: QPLEX global value estimate.
|
||||
|
||||
|
||||
|
||||
## Role of Credit Assignment
|
||||
|
||||
Credit assignment addresses: "Which agent contributed what to the team reward?"
|
||||
|
||||
Value factorization supports implicit credit assignment:
|
||||
|
||||
- Gradients into each $Q_{i}$ act as counterfactual signals.
|
||||
- Dueling architectures allow each agent to learn its influence.
|
||||
- QPLEX provides clean marginal contributions implicitly.
|
||||
|
||||
|
||||
|
||||
## Performance on SMAC Benchmarks
|
||||
|
||||
QPLEX outperforms:
|
||||
|
||||
- QTRAN
|
||||
- QMIX
|
||||
- VDN
|
||||
- Other CTDE baselines
|
||||
|
||||
Key reasons:
|
||||
|
||||
- Effective realization of IGM.
|
||||
- Strong representational capacity.
|
||||
- Off-policy stability.
|
||||
|
||||
|
||||
|
||||
## Extensions: Diversity and Shared Parameter Learning
|
||||
|
||||
Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior.
|
||||
|
||||
Approaches such as CDS (Celebrating Diversity in Shared MARL) introduce:
|
||||
|
||||
- Identity-aware diversity.
|
||||
- Information-based intrinsic rewards for agent differentiation.
|
||||
- Balanced sharing vs agent specialization.
|
||||
|
||||
These techniques improve exploration and cooperation in complex multi-agent tasks.
|
||||
|
||||
|
||||
|
||||
## Summary of Lecture 24
|
||||
|
||||
Key points:
|
||||
|
||||
- Cooperative MARL requires scalable value decomposition.
|
||||
- IGM enables decentralized action selection from centralized training.
|
||||
- QMIX introduces monotonic non-linear factorization.
|
||||
- QPLEX achieves full IGM representational capacity.
|
||||
- Implicit credit assignment arises naturally from factorization.
|
||||
- Diversity methods allow richer multi-agent coordination strategies.
|
||||
|
||||
|
||||
|
||||
## Recommended Screenshot Frames for Lecture 24
|
||||
|
||||
- Lecture 24, page 16: CTDE and QMIX architecture diagram (mixing network). Subsection: "QMIX: Monotonic Value Factorization".
|
||||
- Lecture 24, page 31: QPLEX benchmark performance on SMAC. Subsection: "Performance on SMAC Benchmarks".
|
||||
@@ -26,4 +26,5 @@ export default {
|
||||
CSE510_L21: "CSE510 Deep Reinforcement Learning (Lecture 21)",
|
||||
CSE510_L22: "CSE510 Deep Reinforcement Learning (Lecture 22)",
|
||||
CSE510_L23: "CSE510 Deep Reinforcement Learning (Lecture 23)",
|
||||
CSE510_L24: "CSE510 Deep Reinforcement Learning (Lecture 24)",
|
||||
}
|
||||
Reference in New Issue
Block a user