Files
NoteNextra-origin/content/CSE510/CSE510_L22.md
Trance-0 34afc00a7f updates
2025-11-11 12:45:58 -06:00

1.2 KiB

CSE510 Deep Reinforcement Learning (Lecture 22)

Offline Reinforcement Learning

Requirements for Current Successes

  • Access to the Environment Model or Simulator
  • Not Costly for Exploration or Trial-and-Error

Background: Offline RL

  • The success of modern machine learning
    • Scalable data-driven learning methods (GPT-4, CLIP,DALL·E, Sora)
  • Reinforcement learning
    • Online learning paradigm
    • Interaction is expensive & dangerous
      • Healthcare, Robotics, Recommendation...
  • Can we develop data-driven offline RL?

Definition in Offline RL

  • the policy \pi_k is updated with a static dataset \mathcal{D}, which is collected by unknown behavior policy \pi_\beta

  • Interaction is not allowed

  • \mathcal{D}=\{(s_i,a_i,s_i',r_i)\}

  • s\sim d^{\pi_\beta} (s)

  • a\sim \pi_\beta (a|s)

  • s'\sim p(s'|s,a)

  • r\gets r(s,a)

  • Objective: \max_\pi\sum _{t=0}^{T}\mathbb{E}_{s_t\sim d^\pi(s),a_t\sim \pi(a|s)}[\gamma^tr(s_t,a_t)]

Key challenge in Offline RL

Distribution Shift

How about using the traditional reinforcement learning (bootstrapping)?


Q(s,a)=r(s,a)+\gamma \max_{a'\in A} Q(s',a')

\pi(s)=\arg\max_{a\in A} Q(s,a)

but notice that


P_{\pi_beta}(s,a)\neq P_{\pi_f}(s,a)