updates?
This commit is contained in:
@@ -105,9 +105,7 @@ There are two primary families of solutions:
|
||||
1. --Policy constraint methods--
|
||||
2. --Conservative value estimation methods--
|
||||
|
||||
---
|
||||
|
||||
# 1. Policy Constraint Methods
|
||||
## 1. Policy Constraint Methods
|
||||
|
||||
These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions.
|
||||
|
||||
@@ -163,9 +161,7 @@ Parameter explanations:
|
||||
|
||||
BEAR controls distribution shift more tightly than BCQ.
|
||||
|
||||
---
|
||||
|
||||
# 2. Conservative Value Function Methods
|
||||
## 2. Conservative Value Function Methods
|
||||
|
||||
These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values.
|
||||
|
||||
@@ -213,9 +209,7 @@ Key idea:
|
||||
|
||||
IQL often achieves state-of-the-art performance due to simplicity and stability.
|
||||
|
||||
---
|
||||
|
||||
# Model-Based Offline RL
|
||||
## Model-Based Offline RL
|
||||
|
||||
### Forward Model-Based RL
|
||||
|
||||
@@ -248,9 +242,7 @@ Parameter explanations:
|
||||
|
||||
These methods limit exploration into unknown model regions.
|
||||
|
||||
---
|
||||
|
||||
# Reverse Model-Based Imagination (ROMI)
|
||||
## Reverse Model-Based Imagination (ROMI)
|
||||
|
||||
ROMI generates new training data by -backward- imagination.
|
||||
|
||||
@@ -288,8 +280,6 @@ Benefits:
|
||||
|
||||
ROMI combined with conservative RL often outperforms standard offline methods.
|
||||
|
||||
---
|
||||
|
||||
# Summary of Lecture 22
|
||||
|
||||
Offline RL requires balancing:
|
||||
@@ -304,14 +294,3 @@ Three major families of solutions:
|
||||
3. Model-based conservatism and imagination (MOPO, MOReL, ROMI)
|
||||
|
||||
Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems.
|
||||
|
||||
---
|
||||
|
||||
# Recommended Screenshot Frames for Lecture 22
|
||||
|
||||
- Lecture 22, page 7: Offline RL diagram showing policy learning from fixed dataset, subsection "Offline RL Setting".
|
||||
- Lecture 22, page 35: Illustration of dataset support vs policy action distribution, subsection "Strategies for Safe Offline RL".
|
||||
|
||||
---
|
||||
|
||||
--End of CSE510_L22.md--
|
||||
|
||||
Reference in New Issue
Block a user