# CSE510 Deep Reinforcement Learning (Lecture 23)

> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.

## Offline Reinforcement Learning Part II: Advanced Approaches

Lecture 23 continues with advanced topics in offline RL, expanding on model-based imagination methods and credit assignment structures relevant for offline multi-agent and single-agent settings.

## Reverse Model-Based Imagination (ROMI)

ROMI is a method for augmenting an offline dataset with additional transitions generated by imagining trajectories -backwards- from desirable states. Unlike forward model rollouts, backward imagination stays grounded in real data because imagined transitions always terminate in dataset states.

### Reverse Dynamics Model

ROMI learns a reverse dynamics model:

$$
p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
$$

Parameter explanations:

- $p_{\psi}$: learned reverse transition model.
- $\psi$: parameter vector for the reverse model.
- $s_{t+1}$: next state (from dataset).
- $a_{t}$: action that hypothetically leads into $s_{t+1}$.
- $s_{t}$: predicted predecessor state.

ROMI also learns a reverse policy to sample actions that likely lead into known states:

$$
\pi_{rev}(a_{t} \mid s_{t+1})
$$

Parameter explanations:

- $\pi_{rev}$: reverse policy distribution.
- $a_{t}$: action sampled for backward trajectory generation.
- $s_{t+1}$: state whose predecessors are being imagined.

### Reverse Imagination Process

To generate imagined transitions:

1. Select a goal or high-value state $s_{g}$ from the offline dataset.
2. Sample $a_{t}$ from $\pi_{rev}(a_{t} \mid s_{g})$.
3. Predict $s_{t}$ from $p_{\psi}(s_{t} \mid s_{g}, a_{t})$.
4. Form an imagined transition $(s_{t}, a_{t}, s_{g})$.
5. Repeat backward to obtain a longer imagined trajectory.

Benefits:

- Imagined states remain grounded by terminating in real dataset states.
- Helps propagate reward signals backward through states not originally visited.
- Avoids runaway model error that occurs in forward model rollouts.

ROMI effectively fills in missing gaps in the state-action graph, improving training stability and performance when paired with conservative offline RL algorithms.

## Implicit Credit Assignment via Value Factorization Structures

Although initially studied for multi-agent systems, insights from value factorization also improve offline RL by providing structured credit assignment signals.

### Counterfactual Credit Assignment Insight

A factored value function structure of the form:

$$
Q_{tot}(s, a_{1}, \dots, a_{n}) = f(Q_{1}(s, a_{1}), \dots, Q_{n}(s, a_{n}))
$$

can implicitly implement counterfactual credit assignment.

Parameter explanations:

- $Q_{tot}$: global value function.
- $Q_{i}(s,a_{i})$: individual component value for agent or subsystem $i$.
- $f(\cdot)$: mixing function combining components.
- $s$: environment state.
- $a_{i}$: action taken by entity $i$.

In architectures designed for IGM (Individual-Global-Max) consistency, gradients backpropagated through $f$ isolate the marginal effect of each component. This implicitly gives each agent or subsystem a counterfactual advantage signal.

Even in single-agent structured RL, similar factorization structures allow credit flowing into components representing skills, modes, or action groups, enabling better temporal and structural decomposition.

## Model-Based vs Model-Free Offline RL

Lecture 23 contrasts model-based imagination (ROMI) with conservative model-free methods such as IQL and CQL.

### Forward Model-Based Rollouts

Forward imagination using a learned model:

$$
p_{\theta}(s'|s,a)
$$

Parameter explanations:

- $p_{\theta}$: learned forward dynamics model.
- $\theta$: parameters of the forward model.
- $s'$: predicted next state.
- $s$: current state.
- $a$: action taken in current state.

Problems:

- Forward rollouts drift away from dataset support.
- Model error compounds with each step.
- Leads to training instability if used without penalties.

### Penalty Methods (MOPO, MOReL)

Augmented reward:

$$
r_{model}(s,a) = r(s,a) - \beta u(s,a)
$$

Parameter explanations:

- $r_{model}(s,a)$: penalized reward for model-generated steps.
- $u(s,a)$: uncertainty score of model for state-action pair.
- $\beta$: penalty coefficient.
- $r(s,a)$: original reward.

These methods limit exploration into uncertain model regions.

### ROMI vs Forward Rollouts

- Forward methods expand state space beyond dataset.
- ROMI expands -backward-, staying consistent with known good future states.
- ROMI reduces error accumulation because future anchors are real.

## Combining ROMI With Conservative Offline RL

ROMI is typically combined with:

- CQL (Conservative Q-Learning)
- IQL (Implicit Q-Learning)
- BCQ and BEAR (policy constraint methods)

Workflow:

1. Generate imagined transitions via ROMI.
2. Add them to dataset.
3. Train Q-function or policy using conservative losses.

Benefits:

- Better coverage of reward-relevant states.
- Increased policy improvement over dataset.
- More stable Q-learning backups.

## Summary of Lecture 23

Key points:

- Offline RL can be improved via structured imagination.
- ROMI creates safe imagined transitions by reversing dynamics.
- Reverse imagination avoids pitfalls of forward model error.
- Factored value structures provide implicit counterfactual credit assignment.
- Combining ROMI with conservative learners yields state-of-the-art performance.