This commit is contained in:
Trance-0
2025-11-18 14:08:20 -06:00
parent 9416bd4956
commit 2946feefbe
4 changed files with 18 additions and 72 deletions

View File

@@ -105,9 +105,7 @@ There are two primary families of solutions:
1. --Policy constraint methods--
2. --Conservative value estimation methods--
---
# 1. Policy Constraint Methods
## 1. Policy Constraint Methods
These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions.
@@ -163,9 +161,7 @@ Parameter explanations:
BEAR controls distribution shift more tightly than BCQ.
---
# 2. Conservative Value Function Methods
## 2. Conservative Value Function Methods
These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values.
@@ -213,9 +209,7 @@ Key idea:
IQL often achieves state-of-the-art performance due to simplicity and stability.
---
# Model-Based Offline RL
## Model-Based Offline RL
### Forward Model-Based RL
@@ -248,9 +242,7 @@ Parameter explanations:
These methods limit exploration into unknown model regions.
---
# Reverse Model-Based Imagination (ROMI)
## Reverse Model-Based Imagination (ROMI)
ROMI generates new training data by -backward- imagination.
@@ -288,8 +280,6 @@ Benefits:
ROMI combined with conservative RL often outperforms standard offline methods.
---
# Summary of Lecture 22
Offline RL requires balancing:
@@ -304,14 +294,3 @@ Three major families of solutions:
3. Model-based conservatism and imagination (MOPO, MOReL, ROMI)
Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems.
---
# Recommended Screenshot Frames for Lecture 22
- Lecture 22, page 7: Offline RL diagram showing policy learning from fixed dataset, subsection "Offline RL Setting".
- Lecture 22, page 35: Illustration of dataset support vs policy action distribution, subsection "Strategies for Safe Offline RL".
---
--End of CSE510_L22.md--