178 lines
5.7 KiB
Markdown
178 lines
5.7 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 23)
|
|
|
|
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
|
|
|
|
## Offline Reinforcement Learning Part II: Advanced Approaches
|
|
|
|
Lecture 23 continues with advanced topics in offline RL, expanding on model-based imagination methods and credit assignment structures relevant for offline multi-agent and single-agent settings.
|
|
|
|
## Reverse Model-Based Imagination (ROMI)
|
|
|
|
ROMI is a method for augmenting an offline dataset with additional transitions generated by imagining trajectories -backwards- from desirable states. Unlike forward model rollouts, backward imagination stays grounded in real data because imagined transitions always terminate in dataset states.
|
|
|
|
### Reverse Dynamics Model
|
|
|
|
ROMI learns a reverse dynamics model:
|
|
|
|
$$
|
|
p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $p_{\psi}$: learned reverse transition model.
|
|
- $\psi$: parameter vector for the reverse model.
|
|
- $s_{t+1}$: next state (from dataset).
|
|
- $a_{t}$: action that hypothetically leads into $s_{t+1}$.
|
|
- $s_{t}$: predicted predecessor state.
|
|
|
|
ROMI also learns a reverse policy to sample actions that likely lead into known states:
|
|
|
|
$$
|
|
\pi_{rev}(a_{t} \mid s_{t+1})
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $\pi_{rev}$: reverse policy distribution.
|
|
- $a_{t}$: action sampled for backward trajectory generation.
|
|
- $s_{t+1}$: state whose predecessors are being imagined.
|
|
|
|
### Reverse Imagination Process
|
|
|
|
To generate imagined transitions:
|
|
|
|
1. Select a goal or high-value state $s_{g}$ from the offline dataset.
|
|
2. Sample $a_{t}$ from $\pi_{rev}(a_{t} \mid s_{g})$.
|
|
3. Predict $s_{t}$ from $p_{\psi}(s_{t} \mid s_{g}, a_{t})$.
|
|
4. Form an imagined transition $(s_{t}, a_{t}, s_{g})$.
|
|
5. Repeat backward to obtain a longer imagined trajectory.
|
|
|
|
Benefits:
|
|
|
|
- Imagined states remain grounded by terminating in real dataset states.
|
|
- Helps propagate reward signals backward through states not originally visited.
|
|
- Avoids runaway model error that occurs in forward model rollouts.
|
|
|
|
ROMI effectively fills in missing gaps in the state-action graph, improving training stability and performance when paired with conservative offline RL algorithms.
|
|
|
|
---
|
|
|
|
## Implicit Credit Assignment via Value Factorization Structures
|
|
|
|
Although initially studied for multi-agent systems, insights from value factorization also improve offline RL by providing structured credit assignment signals.
|
|
|
|
### Counterfactual Credit Assignment Insight
|
|
|
|
A factored value function structure of the form:
|
|
|
|
$$
|
|
Q_{tot}(s, a_{1}, \dots, a_{n}) = f(Q_{1}(s, a_{1}), \dots, Q_{n}(s, a_{n}))
|
|
$$
|
|
|
|
can implicitly implement counterfactual credit assignment.
|
|
|
|
Parameter explanations:
|
|
|
|
- $Q_{tot}$: global value function.
|
|
- $Q_{i}(s,a_{i})$: individual component value for agent or subsystem $i$.
|
|
- $f(\cdot)$: mixing function combining components.
|
|
- $s$: environment state.
|
|
- $a_{i}$: action taken by entity $i$.
|
|
|
|
In architectures designed for IGM (Individual-Global-Max) consistency, gradients backpropagated through $f$ isolate the marginal effect of each component. This implicitly gives each agent or subsystem a counterfactual advantage signal.
|
|
|
|
Even in single-agent structured RL, similar factorization structures allow credit flowing into components representing skills, modes, or action groups, enabling better temporal and structural decomposition.
|
|
|
|
---
|
|
|
|
## Model-Based vs Model-Free Offline RL
|
|
|
|
Lecture 23 contrasts model-based imagination (ROMI) with conservative model-free methods such as IQL and CQL.
|
|
|
|
### Forward Model-Based Rollouts
|
|
|
|
Forward imagination using a learned model:
|
|
|
|
$$
|
|
p_{\theta}(s'|s,a)
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $p_{\theta}$: learned forward dynamics model.
|
|
- $\theta$: parameters of the forward model.
|
|
- $s'$: predicted next state.
|
|
- $s$: current state.
|
|
- $a$: action taken in current state.
|
|
|
|
Problems:
|
|
|
|
- Forward rollouts drift away from dataset support.
|
|
- Model error compounds with each step.
|
|
- Leads to training instability if used without penalties.
|
|
|
|
### Penalty Methods (MOPO, MOReL)
|
|
|
|
Augmented reward:
|
|
|
|
$$
|
|
r_{model}(s,a) = r(s,a) - \beta u(s,a)
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $r_{model}(s,a)$: penalized reward for model-generated steps.
|
|
- $u(s,a)$: uncertainty score of model for state-action pair.
|
|
- $\beta$: penalty coefficient.
|
|
- $r(s,a)$: original reward.
|
|
|
|
These methods limit exploration into uncertain model regions.
|
|
|
|
### ROMI vs Forward Rollouts
|
|
|
|
- Forward methods expand state space beyond dataset.
|
|
- ROMI expands -backward-, staying consistent with known good future states.
|
|
- ROMI reduces error accumulation because future anchors are real.
|
|
|
|
---
|
|
|
|
## Combining ROMI With Conservative Offline RL
|
|
|
|
ROMI is typically combined with:
|
|
|
|
- CQL (Conservative Q-Learning)
|
|
- IQL (Implicit Q-Learning)
|
|
- BCQ and BEAR (policy constraint methods)
|
|
|
|
Workflow:
|
|
|
|
1. Generate imagined transitions via ROMI.
|
|
2. Add them to dataset.
|
|
3. Train Q-function or policy using conservative losses.
|
|
|
|
Benefits:
|
|
|
|
- Better coverage of reward-relevant states.
|
|
- Increased policy improvement over dataset.
|
|
- More stable Q-learning backups.
|
|
|
|
---
|
|
|
|
## Summary of Lecture 23
|
|
|
|
Key points:
|
|
|
|
- Offline RL can be improved via structured imagination.
|
|
- ROMI creates safe imagined transitions by reversing dynamics.
|
|
- Reverse imagination avoids pitfalls of forward model error.
|
|
- Factored value structures provide implicit counterfactual credit assignment.
|
|
- Combining ROMI with conservative learners yields state-of-the-art performance.
|
|
|
|
---
|
|
|
|
## Recommended Screenshot Frames for Lecture 23
|
|
|
|
- Lecture 23, page 20: ROMI concept diagram depicting reverse imagination from goal states. Subsection: "Reverse Model-Based Imagination (ROMI)".
|
|
- Lecture 23, page 24: Architecture figure showing reverse policy and reverse dynamics model used to generate imagined transitions. Subsection: "Reverse Imagination Process".
|