NoteNextra-origin/content/CSE510/CSE510_L22.md

# CSE510 Deep Reinforcement Learning (Lecture 22)

## Offline Reinforcement Learning

### Requirements for Current Successes

- Access to the Environment Model or Simulator
- Not Costly for Exploration or Trial-and-Error

#### Background: Offline RL

- The success of modern machine learning
  - Scalable data-driven learning methods (GPT-4, CLIP,DALL·E, Sora)
- Reinforcement learning
  - Online learning paradigm
  - Interaction is expensive & dangerous
    - Healthcare, Robotics, Recommendation...
- Can we develop data-driven offline RL?

#### Definition in Offline RL

- the policy $\pi_k$ is updated with a static dataset $\mathcal{D}$, which is collected by _unknown behavior policy_ $\pi_\beta$
- Interaction is not allowed

- $\mathcal{D}=\{(s_i,a_i,s_i',r_i)\}$
- $s\sim d^{\pi_\beta} (s)$
- $a\sim \pi_\beta (a|s)$
- $s'\sim p(s'|s,a)$
- $r\gets r(s,a)$
- Objective: $\max_\pi\sum _{t=0}^{T}\mathbb{E}_{s_t\sim d^\pi(s),a_t\sim \pi(a|s)}[\gamma^tr(s_t,a_t)]$

#### Key challenge in Offline RL

Distribution Shift

How about using the traditional reinforcement learning (bootstrapping)?

$$
Q(s,a)=r(s,a)+\gamma \max_{a'\in A} Q(s',a')
$$

$$
\pi(s)=\arg\max_{a\in A} Q(s,a)
$$

but notice that

$$
P_{\pi_beta}(s,a)\neq P_{\pi_f}(s,a)
$$