Files
NoteNextra-origin/content/CSE510/CSE510_L23.md
Trance-0 2946feefbe updates?
2025-11-18 14:08:20 -06:00

163 lines
5.3 KiB
Markdown

# CSE510 Deep Reinforcement Learning (Lecture 23)
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
## Offline Reinforcement Learning Part II: Advanced Approaches
Lecture 23 continues with advanced topics in offline RL, expanding on model-based imagination methods and credit assignment structures relevant for offline multi-agent and single-agent settings.
## Reverse Model-Based Imagination (ROMI)
ROMI is a method for augmenting an offline dataset with additional transitions generated by imagining trajectories -backwards- from desirable states. Unlike forward model rollouts, backward imagination stays grounded in real data because imagined transitions always terminate in dataset states.
### Reverse Dynamics Model
ROMI learns a reverse dynamics model:
$$
p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
$$
Parameter explanations:
- $p_{\psi}$: learned reverse transition model.
- $\psi$: parameter vector for the reverse model.
- $s_{t+1}$: next state (from dataset).
- $a_{t}$: action that hypothetically leads into $s_{t+1}$.
- $s_{t}$: predicted predecessor state.
ROMI also learns a reverse policy to sample actions that likely lead into known states:
$$
\pi_{rev}(a_{t} \mid s_{t+1})
$$
Parameter explanations:
- $\pi_{rev}$: reverse policy distribution.
- $a_{t}$: action sampled for backward trajectory generation.
- $s_{t+1}$: state whose predecessors are being imagined.
### Reverse Imagination Process
To generate imagined transitions:
1. Select a goal or high-value state $s_{g}$ from the offline dataset.
2. Sample $a_{t}$ from $\pi_{rev}(a_{t} \mid s_{g})$.
3. Predict $s_{t}$ from $p_{\psi}(s_{t} \mid s_{g}, a_{t})$.
4. Form an imagined transition $(s_{t}, a_{t}, s_{g})$.
5. Repeat backward to obtain a longer imagined trajectory.
Benefits:
- Imagined states remain grounded by terminating in real dataset states.
- Helps propagate reward signals backward through states not originally visited.
- Avoids runaway model error that occurs in forward model rollouts.
ROMI effectively fills in missing gaps in the state-action graph, improving training stability and performance when paired with conservative offline RL algorithms.
## Implicit Credit Assignment via Value Factorization Structures
Although initially studied for multi-agent systems, insights from value factorization also improve offline RL by providing structured credit assignment signals.
### Counterfactual Credit Assignment Insight
A factored value function structure of the form:
$$
Q_{tot}(s, a_{1}, \dots, a_{n}) = f(Q_{1}(s, a_{1}), \dots, Q_{n}(s, a_{n}))
$$
can implicitly implement counterfactual credit assignment.
Parameter explanations:
- $Q_{tot}$: global value function.
- $Q_{i}(s,a_{i})$: individual component value for agent or subsystem $i$.
- $f(\cdot)$: mixing function combining components.
- $s$: environment state.
- $a_{i}$: action taken by entity $i$.
In architectures designed for IGM (Individual-Global-Max) consistency, gradients backpropagated through $f$ isolate the marginal effect of each component. This implicitly gives each agent or subsystem a counterfactual advantage signal.
Even in single-agent structured RL, similar factorization structures allow credit flowing into components representing skills, modes, or action groups, enabling better temporal and structural decomposition.
## Model-Based vs Model-Free Offline RL
Lecture 23 contrasts model-based imagination (ROMI) with conservative model-free methods such as IQL and CQL.
### Forward Model-Based Rollouts
Forward imagination using a learned model:
$$
p_{\theta}(s'|s,a)
$$
Parameter explanations:
- $p_{\theta}$: learned forward dynamics model.
- $\theta$: parameters of the forward model.
- $s'$: predicted next state.
- $s$: current state.
- $a$: action taken in current state.
Problems:
- Forward rollouts drift away from dataset support.
- Model error compounds with each step.
- Leads to training instability if used without penalties.
### Penalty Methods (MOPO, MOReL)
Augmented reward:
$$
r_{model}(s,a) = r(s,a) - \beta u(s,a)
$$
Parameter explanations:
- $r_{model}(s,a)$: penalized reward for model-generated steps.
- $u(s,a)$: uncertainty score of model for state-action pair.
- $\beta$: penalty coefficient.
- $r(s,a)$: original reward.
These methods limit exploration into uncertain model regions.
### ROMI vs Forward Rollouts
- Forward methods expand state space beyond dataset.
- ROMI expands -backward-, staying consistent with known good future states.
- ROMI reduces error accumulation because future anchors are real.
## Combining ROMI With Conservative Offline RL
ROMI is typically combined with:
- CQL (Conservative Q-Learning)
- IQL (Implicit Q-Learning)
- BCQ and BEAR (policy constraint methods)
Workflow:
1. Generate imagined transitions via ROMI.
2. Add them to dataset.
3. Train Q-function or policy using conservative losses.
Benefits:
- Better coverage of reward-relevant states.
- Increased policy improvement over dataset.
- More stable Q-learning backups.
## Summary of Lecture 23
Key points:
- Offline RL can be improved via structured imagination.
- ROMI creates safe imagined transitions by reversing dynamics.
- Reverse imagination avoids pitfalls of forward model error.
- Factored value structures provide implicit counterfactual credit assignment.
- Combining ROMI with conservative learners yields state-of-the-art performance.