# CSE510 Deep Reinforcement Learning (Lecture 22) ## Offline Reinforcement Learning ### Requirements for Current Successes - Access to the Environment Model or Simulator - Not Costly for Exploration or Trial-and-Error #### Background: Offline RL - The success of modern machine learning - Scalable data-driven learning methods (GPT-4, CLIP,DALLĀ·E, Sora) - Reinforcement learning - Online learning paradigm - Interaction is expensive & dangerous - Healthcare, Robotics, Recommendation... - Can we develop data-driven offline RL? #### Definition in Offline RL - the policy $\pi_k$ is updated with a static dataset $\mathcal{D}$, which is collected by _unknown behavior policy_ $\pi_\beta$ - Interaction is not allowed - $\mathcal{D}=\{(s_i,a_i,s_i',r_i)\}$ - $s\sim d^{\pi_\beta} (s)$ - $a\sim \pi_\beta (a|s)$ - $s'\sim p(s'|s,a)$ - $r\gets r(s,a)$ - Objective: $\max_\pi\sum _{t=0}^{T}\mathbb{E}_{s_t\sim d^\pi(s),a_t\sim \pi(a|s)}[\gamma^tr(s_t,a_t)]$ #### Key challenge in Offline RL Distribution Shift How about using the traditional reinforcement learning (bootstrapping)? $$ Q(s,a)=r(s,a)+\gamma \max_{a'\in A} Q(s',a') $$ $$ \pi(s)=\arg\max_{a\in A} Q(s,a) $$ but notice that $$ P_{\pi_beta}(s,a)\neq P_{\pi_f}(s,a) $$