updates
This commit is contained in:
50
content/CSE510/CSE510_L22.md
Normal file
50
content/CSE510/CSE510_L22.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 22)
|
||||
|
||||
## Offline Reinforcement Learning
|
||||
|
||||
### Requirements for Current Successes
|
||||
|
||||
- Access to the Environment Model or Simulator
|
||||
- Not Costly for Exploration or Trial-and-Error
|
||||
|
||||
#### Background: Offline RL
|
||||
|
||||
- The success of modern machine learning
|
||||
- Scalable data-driven learning methods (GPT-4, CLIP,DALL·E, Sora)
|
||||
- Reinforcement learning
|
||||
- Online learning paradigm
|
||||
- Interaction is expensive & dangerous
|
||||
- Healthcare, Robotics, Recommendation...
|
||||
- Can we develop data-driven offline RL?
|
||||
|
||||
#### Definition in Offline RL
|
||||
|
||||
- the policy $\pi_k$ is updated with a static dataset $\mathcal{D}$, which is collected by _unknown behavior policy_ $\pi_\beta$
|
||||
- Interaction is not allowed
|
||||
|
||||
- $\mathcal{D}=\{(s_i,a_i,s_i',r_i)\}$
|
||||
- $s\sim d^{\pi_\beta} (s)$
|
||||
- $a\sim \pi_\beta (a|s)$
|
||||
- $s'\sim p(s'|s,a)$
|
||||
- $r\gets r(s,a)$
|
||||
- Objective: $\max_\pi\sum _{t=0}^{T}\mathbb{E}_{s_t\sim d^\pi(s),a_t\sim \pi(a|s)}[\gamma^tr(s_t,a_t)]$
|
||||
|
||||
#### Key challenge in Offline RL
|
||||
|
||||
Distribution Shift
|
||||
|
||||
How about using the traditional reinforcement learning (bootstrapping)?
|
||||
|
||||
$$
|
||||
Q(s,a)=r(s,a)+\gamma \max_{a'\in A} Q(s',a')
|
||||
$$
|
||||
|
||||
$$
|
||||
\pi(s)=\arg\max_{a\in A} Q(s,a)
|
||||
$$
|
||||
|
||||
but notice that
|
||||
|
||||
$$
|
||||
P_{\pi_beta}(s,a)\neq P_{\pi_f}(s,a)
|
||||
$$
|
||||
@@ -24,4 +24,5 @@ export default {
|
||||
CSE510_L19: "CSE510 Deep Reinforcement Learning (Lecture 19)",
|
||||
CSE510_L20: "CSE510 Deep Reinforcement Learning (Lecture 20)",
|
||||
CSE510_L21: "CSE510 Deep Reinforcement Learning (Lecture 21)",
|
||||
CSE510_L22: "CSE510 Deep Reinforcement Learning (Lecture 22)",
|
||||
}
|
||||
Reference in New Issue
Block a user