This commit is contained in:
Trance-0
2025-11-18 14:08:20 -06:00
parent 9416bd4956
commit 2946feefbe
4 changed files with 18 additions and 72 deletions

View File

@@ -105,9 +105,7 @@ There are two primary families of solutions:
1. --Policy constraint methods-- 1. --Policy constraint methods--
2. --Conservative value estimation methods-- 2. --Conservative value estimation methods--
--- ## 1. Policy Constraint Methods
# 1. Policy Constraint Methods
These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions. These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions.
@@ -163,9 +161,7 @@ Parameter explanations:
BEAR controls distribution shift more tightly than BCQ. BEAR controls distribution shift more tightly than BCQ.
--- ## 2. Conservative Value Function Methods
# 2. Conservative Value Function Methods
These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values. These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values.
@@ -213,9 +209,7 @@ Key idea:
IQL often achieves state-of-the-art performance due to simplicity and stability. IQL often achieves state-of-the-art performance due to simplicity and stability.
--- ## Model-Based Offline RL
# Model-Based Offline RL
### Forward Model-Based RL ### Forward Model-Based RL
@@ -248,9 +242,7 @@ Parameter explanations:
These methods limit exploration into unknown model regions. These methods limit exploration into unknown model regions.
--- ## Reverse Model-Based Imagination (ROMI)
# Reverse Model-Based Imagination (ROMI)
ROMI generates new training data by -backward- imagination. ROMI generates new training data by -backward- imagination.
@@ -288,8 +280,6 @@ Benefits:
ROMI combined with conservative RL often outperforms standard offline methods. ROMI combined with conservative RL often outperforms standard offline methods.
---
# Summary of Lecture 22 # Summary of Lecture 22
Offline RL requires balancing: Offline RL requires balancing:
@@ -304,14 +294,3 @@ Three major families of solutions:
3. Model-based conservatism and imagination (MOPO, MOReL, ROMI) 3. Model-based conservatism and imagination (MOPO, MOReL, ROMI)
Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems. Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems.
---
# Recommended Screenshot Frames for Lecture 22
- Lecture 22, page 7: Offline RL diagram showing policy learning from fixed dataset, subsection "Offline RL Setting".
- Lecture 22, page 35: Illustration of dataset support vs policy action distribution, subsection "Strategies for Safe Offline RL".
---
--End of CSE510_L22.md--

View File

@@ -56,8 +56,6 @@ Benefits:
ROMI effectively fills in missing gaps in the state-action graph, improving training stability and performance when paired with conservative offline RL algorithms. ROMI effectively fills in missing gaps in the state-action graph, improving training stability and performance when paired with conservative offline RL algorithms.
---
## Implicit Credit Assignment via Value Factorization Structures ## Implicit Credit Assignment via Value Factorization Structures
Although initially studied for multi-agent systems, insights from value factorization also improve offline RL by providing structured credit assignment signals. Although initially studied for multi-agent systems, insights from value factorization also improve offline RL by providing structured credit assignment signals.
@@ -84,8 +82,6 @@ In architectures designed for IGM (Individual-Global-Max) consistency, gradients
Even in single-agent structured RL, similar factorization structures allow credit flowing into components representing skills, modes, or action groups, enabling better temporal and structural decomposition. Even in single-agent structured RL, similar factorization structures allow credit flowing into components representing skills, modes, or action groups, enabling better temporal and structural decomposition.
---
## Model-Based vs Model-Free Offline RL ## Model-Based vs Model-Free Offline RL
Lecture 23 contrasts model-based imagination (ROMI) with conservative model-free methods such as IQL and CQL. Lecture 23 contrasts model-based imagination (ROMI) with conservative model-free methods such as IQL and CQL.
@@ -135,8 +131,6 @@ These methods limit exploration into uncertain model regions.
- ROMI expands -backward-, staying consistent with known good future states. - ROMI expands -backward-, staying consistent with known good future states.
- ROMI reduces error accumulation because future anchors are real. - ROMI reduces error accumulation because future anchors are real.
---
## Combining ROMI With Conservative Offline RL ## Combining ROMI With Conservative Offline RL
ROMI is typically combined with: ROMI is typically combined with:
@@ -157,8 +151,6 @@ Benefits:
- Increased policy improvement over dataset. - Increased policy improvement over dataset.
- More stable Q-learning backups. - More stable Q-learning backups.
---
## Summary of Lecture 23 ## Summary of Lecture 23
Key points: Key points:
@@ -168,10 +160,3 @@ Key points:
- Reverse imagination avoids pitfalls of forward model error. - Reverse imagination avoids pitfalls of forward model error.
- Factored value structures provide implicit counterfactual credit assignment. - Factored value structures provide implicit counterfactual credit assignment.
- Combining ROMI with conservative learners yields state-of-the-art performance. - Combining ROMI with conservative learners yields state-of-the-art performance.
---
## Recommended Screenshot Frames for Lecture 23
- Lecture 23, page 20: ROMI concept diagram depicting reverse imagination from goal states. Subsection: "Reverse Model-Based Imagination (ROMI)".
- Lecture 23, page 24: Architecture figure showing reverse policy and reverse dynamics model used to generate imagined transitions. Subsection: "Reverse Imagination Process".

View File

@@ -4,8 +4,6 @@
This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX. This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX.
## Multi-Agent Coordination Under Uncertainty ## Multi-Agent Coordination Under Uncertainty
In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP). In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP).
@@ -39,7 +37,6 @@ Parameter explanations:
Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment. Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment.
## Joint vs Factored Q-Learning ## Joint vs Factored Q-Learning
### Joint Q-Learning ### Joint Q-Learning
@@ -75,15 +72,12 @@ Parameter explanations:
The goal is to enable decentralized greedy action selection. The goal is to enable decentralized greedy action selection.
## Individual-Global-Max (IGM) Condition ## Individual-Global-Max (IGM) Condition
The IGM condition enables decentralized optimal action selection: The IGM condition enables decentralized optimal action selection:
$$ $$
\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a}) \arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a})=
===========================================
\big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big) \big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big)
$$ $$
@@ -96,8 +90,6 @@ Parameter explanations:
IGM makes decentralized execution optimal with respect to the learned factorized value. IGM makes decentralized execution optimal with respect to the learned factorized value.
## Linear Value Factorization ## Linear Value Factorization
### VDN (Value Decomposition Networks) ### VDN (Value Decomposition Networks)
@@ -123,8 +115,6 @@ Cons:
- Limited representation capacity. - Limited representation capacity.
- Cannot model non-linear teamwork interactions. - Cannot model non-linear teamwork interactions.
## QMIX: Monotonic Value Factorization ## QMIX: Monotonic Value Factorization
QMIX uses a state-conditioned mixing network enforcing monotonicity: QMIX uses a state-conditioned mixing network enforcing monotonicity:
@@ -154,8 +144,6 @@ Benefits:
- More expressive than VDN. - More expressive than VDN.
- Supports CTDE while keeping decentralized greedy execution. - Supports CTDE while keeping decentralized greedy execution.
## Theoretical Issues With Linear and Monotonic Factorization ## Theoretical Issues With Linear and Monotonic Factorization
Limitations: Limitations:
@@ -164,8 +152,6 @@ Limitations:
- QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions. - QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions.
- Off-policy training can diverge in some factorizations. - Off-policy training can diverge in some factorizations.
## QPLEX: Duplex Dueling Multi-Agent Q-Learning ## QPLEX: Duplex Dueling Multi-Agent Q-Learning
QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class. QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class.
@@ -193,8 +179,6 @@ QPLEX Properties:
- Has full representation capacity for all IGM-consistent Q-functions. - Has full representation capacity for all IGM-consistent Q-functions.
- Enables stable off-policy training. - Enables stable off-policy training.
## QPLEX Training Objective ## QPLEX Training Objective
QPLEX minimizes a TD loss over $Q_{tot}$: QPLEX minimizes a TD loss over $Q_{tot}$:
@@ -211,8 +195,6 @@ Parameter explanations:
- $\mathbf{a'}$: next joint action evaluated by TD target. - $\mathbf{a'}$: next joint action evaluated by TD target.
- $Q_{tot}$: QPLEX global value estimate. - $Q_{tot}$: QPLEX global value estimate.
## Role of Credit Assignment ## Role of Credit Assignment
Credit assignment addresses: "Which agent contributed what to the team reward?" Credit assignment addresses: "Which agent contributed what to the team reward?"
@@ -223,8 +205,6 @@ Value factorization supports implicit credit assignment:
- Dueling architectures allow each agent to learn its influence. - Dueling architectures allow each agent to learn its influence.
- QPLEX provides clean marginal contributions implicitly. - QPLEX provides clean marginal contributions implicitly.
## Performance on SMAC Benchmarks ## Performance on SMAC Benchmarks
QPLEX outperforms: QPLEX outperforms:
@@ -240,8 +220,6 @@ Key reasons:
- Strong representational capacity. - Strong representational capacity.
- Off-policy stability. - Off-policy stability.
## Extensions: Diversity and Shared Parameter Learning ## Extensions: Diversity and Shared Parameter Learning
Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior. Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior.
@@ -254,8 +232,6 @@ Approaches such as CDS (Celebrating Diversity in Shared MARL) introduce:
These techniques improve exploration and cooperation in complex multi-agent tasks. These techniques improve exploration and cooperation in complex multi-agent tasks.
## Summary of Lecture 24 ## Summary of Lecture 24
Key points: Key points:
@@ -266,10 +242,3 @@ Key points:
- QPLEX achieves full IGM representational capacity. - QPLEX achieves full IGM representational capacity.
- Implicit credit assignment arises naturally from factorization. - Implicit credit assignment arises naturally from factorization.
- Diversity methods allow richer multi-agent coordination strategies. - Diversity methods allow richer multi-agent coordination strategies.
## Recommended Screenshot Frames for Lecture 24
- Lecture 24, page 16: CTDE and QMIX architecture diagram (mixing network). Subsection: "QMIX: Monotonic Value Factorization".
- Lecture 24, page 31: QPLEX benchmark performance on SMAC. Subsection: "Performance on SMAC Benchmarks".

View File

@@ -7,14 +7,17 @@ CSE 5100
**Fall 2025** **Fall 2025**
## Instructor Information ## Instructor Information
**Chongjie Zhang** **Chongjie Zhang**
Office: McKelvey Hall 2010D Office: McKelvey Hall 2010D
Email: chongjie@wustl.edu Email: chongjie@wustl.edu
### Instructor's Office Hours: ### Instructor's Office Hours:
Chongjie Zhang's Office Hours: Wednesdays 11:00 -12:00 am in Mckelvey Hall 2010D Or you may email me to make an appointment. Chongjie Zhang's Office Hours: Wednesdays 11:00 -12:00 am in Mckelvey Hall 2010D Or you may email me to make an appointment.
### TAs: ### TAs:
- Jianing Ye: jianing.y@wustl.edu - Jianing Ye: jianing.y@wustl.edu
- Kefei Duan: d.kefei@wustl.edu - Kefei Duan: d.kefei@wustl.edu
- Xiu Yuan: xiu@wustl.edu - Xiu Yuan: xiu@wustl.edu
@@ -22,6 +25,7 @@ Chongjie Zhang's Office Hours: Wednesdays 11:00 -12:00 am in Mckelvey Hall 2010D
**Office Hours:** Thursday 4:00pm -5:00pm in Mckelvey Hall 1030 (tentative) Or you may email TAs to make an appointment. **Office Hours:** Thursday 4:00pm -5:00pm in Mckelvey Hall 1030 (tentative) Or you may email TAs to make an appointment.
## Course Description ## Course Description
Deep Reinforcement Learning (RL) is a cutting-edge field at the intersection of artificial intelligence and decision-making. This course provides an in-depth exploration of the fundamental principles, algorithms, and applications of deep reinforcement learning. We start from the Markov Decision Process (MDP) framework and cover basic RL algorithms—value-based, policy-based, actorcritic, and model-based methods—then move to advanced topics including offline RL and multi-agent RL. By combining deep learning with reinforcement learning, students will gain the skills to build intelligent systems that learn from experience and make near-optimal decisions in complex environments. Deep Reinforcement Learning (RL) is a cutting-edge field at the intersection of artificial intelligence and decision-making. This course provides an in-depth exploration of the fundamental principles, algorithms, and applications of deep reinforcement learning. We start from the Markov Decision Process (MDP) framework and cover basic RL algorithms—value-based, policy-based, actorcritic, and model-based methods—then move to advanced topics including offline RL and multi-agent RL. By combining deep learning with reinforcement learning, students will gain the skills to build intelligent systems that learn from experience and make near-optimal decisions in complex environments.
The course caters to graduate and advanced undergraduate students. Student performance evaluation will revolve around written and programming assignments and the course project. The course caters to graduate and advanced undergraduate students. Student performance evaluation will revolve around written and programming assignments and the course project.
@@ -39,6 +43,7 @@ By the end of this course, students should be able to:
- Execute an end-to-end DRL project: problem selection, environment design, algorithm selection, experimental protocol, ablations, and reproducibility. - Execute an end-to-end DRL project: problem selection, environment design, algorithm selection, experimental protocol, ablations, and reproducibility.
## Prerequisites ## Prerequisites
If you are unsure about any of these, please speak to the instructor. If you are unsure about any of these, please speak to the instructor.
- Proficiency in Python programming. - Proficiency in Python programming.
@@ -51,11 +56,13 @@ One of the following:
- b) a Machine Learning course (CSE 417T or ESE 417). - b) a Machine Learning course (CSE 417T or ESE 417).
## Textbook ## Textbook
**Primary text** (optional but recommended): Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed., online). We will not cover all of the chapters and, from time to time, cover topics not contained in the book. **Primary text** (optional but recommended): Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed., online). We will not cover all of the chapters and, from time to time, cover topics not contained in the book.
**Additional references:** Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed.); OpenAI Spinning Up in Deep RL tutorial. **Additional references:** Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed.); OpenAI Spinning Up in Deep RL tutorial.
## Homeworks ## Homeworks
There will be a total of three homework assignments distributed throughout the semester. Each assignment will be accessible on Canvas, allowing you approximately two weeks to finish and submit it before the designated deadline. There will be a total of three homework assignments distributed throughout the semester. Each assignment will be accessible on Canvas, allowing you approximately two weeks to finish and submit it before the designated deadline.
Late work will not be accepted. If you have a documented medical or emergency reason, contact the TAs as soon as possible. Late work will not be accepted. If you have a documented medical or emergency reason, contact the TAs as soon as possible.
@@ -65,21 +72,25 @@ Late work will not be accepted. If you have a documented medical or emergency re
**Academic Integrity:** Do not copy from peers or online sources. Violations will be referred per university policy. **Academic Integrity:** Do not copy from peers or online sources. Violations will be referred per university policy.
## Final Project ## Final Project
A researchlevel project of your choice that demonstrates mastery of DRL concepts and empirical methodology. Possible directions include: (a) improving an existing approach, (b) tackling an unsolved task/benchmark, (c) reproducing and extending a recent paper, or (d) creating a new task/problem relevant to RL. A researchlevel project of your choice that demonstrates mastery of DRL concepts and empirical methodology. Possible directions include: (a) improving an existing approach, (b) tackling an unsolved task/benchmark, (c) reproducing and extending a recent paper, or (d) creating a new task/problem relevant to RL.
**Team size:** 12 students by default (contact instructor/TAs for approval if proposing a larger team). **Team size:** 12 students by default (contact instructor/TAs for approval if proposing a larger team).
### Milestones: ### Milestones:
- **Proposal:** ≤ 2 pages outlining problem, related work, methodology, evaluation plan, and risks. - **Proposal:** ≤ 2 pages outlining problem, related work, methodology, evaluation plan, and risks.
- **Progress report with short survey:** ≤ 4 pages with preliminary results or diagnostics. - **Progress report with short survey:** ≤ 4 pages with preliminary results or diagnostics.
- **Presentation/Poster session:** brief talk or poster demo. - **Presentation/Poster session:** brief talk or poster demo.
- **Final report:** 710 pages (NeurIPS format) with clear experiments, ablations, and reproducibility details. - **Final report:** 710 pages (NeurIPS format) with clear experiments, ablations, and reproducibility details.
## Evaluation ## Evaluation
**Homework / Problem Sets (3) — 45%** **Homework / Problem Sets (3) — 45%**
Each problem set combines written questions (derivations/short answers) and programming components (implementations and experiments). Each problem set combines written questions (derivations/short answers) and programming components (implementations and experiments).
**Final Course Project — 50% total** **Final Course Project — 50% total**
- Proposal (max 2 pages) — 5% of project - Proposal (max 2 pages) — 5% of project
- Progress report with brief survey (max 4 pages) — 10% of project - Progress report with brief survey (max 4 pages) — 10% of project
- Presentation/Poster session — 10% of project - Presentation/Poster session — 10% of project
@@ -91,7 +102,9 @@ Contributions in class and on the course discussion forum, especially in the pro
**Course evaluations** (mid-semester and final course evaluations): extra credit up to 2% **Course evaluations** (mid-semester and final course evaluations): extra credit up to 2%
## Grading Scale ## Grading Scale
The intended grading scale is as follows. The instructor reserves the right to adjust the grading scale. The intended grading scale is as follows. The instructor reserves the right to adjust the grading scale.
- A's (A-,A,A+): >= 90% - A's (A-,A,A+): >= 90%
- B's (B-,B,B+): >= 80% - B's (B-,B,B+): >= 80%
- C's (C-,C,C+): >= 70% - C's (C-,C,C+): >= 70%