From 2946feefbe2de391b0267f601172a1e4a8651680 Mon Sep 17 00:00:00 2001 From: Trance-0 <60459821+Trance-0@users.noreply.github.com> Date: Tue, 18 Nov 2025 14:08:20 -0600 Subject: [PATCH] updates? --- content/CSE510/CSE510_L22.md | 29 ++++------------------------- content/CSE510/CSE510_L23.md | 15 --------------- content/CSE510/CSE510_L24.md | 33 +-------------------------------- content/CSE510/index.md | 13 +++++++++++++ 4 files changed, 18 insertions(+), 72 deletions(-) diff --git a/content/CSE510/CSE510_L22.md b/content/CSE510/CSE510_L22.md index ba86826..33ed2f3 100644 --- a/content/CSE510/CSE510_L22.md +++ b/content/CSE510/CSE510_L22.md @@ -105,9 +105,7 @@ There are two primary families of solutions: 1. --Policy constraint methods-- 2. --Conservative value estimation methods-- ---- - -# 1. Policy Constraint Methods +## 1. Policy Constraint Methods These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions. @@ -163,9 +161,7 @@ Parameter explanations: BEAR controls distribution shift more tightly than BCQ. ---- - -# 2. Conservative Value Function Methods +## 2. Conservative Value Function Methods These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values. @@ -213,9 +209,7 @@ Key idea: IQL often achieves state-of-the-art performance due to simplicity and stability. ---- - -# Model-Based Offline RL +## Model-Based Offline RL ### Forward Model-Based RL @@ -248,9 +242,7 @@ Parameter explanations: These methods limit exploration into unknown model regions. ---- - -# Reverse Model-Based Imagination (ROMI) +## Reverse Model-Based Imagination (ROMI) ROMI generates new training data by -backward- imagination. @@ -288,8 +280,6 @@ Benefits: ROMI combined with conservative RL often outperforms standard offline methods. ---- - # Summary of Lecture 22 Offline RL requires balancing: @@ -304,14 +294,3 @@ Three major families of solutions: 3. Model-based conservatism and imagination (MOPO, MOReL, ROMI) Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems. - ---- - -# Recommended Screenshot Frames for Lecture 22 - -- Lecture 22, page 7: Offline RL diagram showing policy learning from fixed dataset, subsection "Offline RL Setting". -- Lecture 22, page 35: Illustration of dataset support vs policy action distribution, subsection "Strategies for Safe Offline RL". - ---- - ---End of CSE510_L22.md-- diff --git a/content/CSE510/CSE510_L23.md b/content/CSE510/CSE510_L23.md index 31752c5..46f58a9 100644 --- a/content/CSE510/CSE510_L23.md +++ b/content/CSE510/CSE510_L23.md @@ -56,8 +56,6 @@ Benefits: ROMI effectively fills in missing gaps in the state-action graph, improving training stability and performance when paired with conservative offline RL algorithms. ---- - ## Implicit Credit Assignment via Value Factorization Structures Although initially studied for multi-agent systems, insights from value factorization also improve offline RL by providing structured credit assignment signals. @@ -84,8 +82,6 @@ In architectures designed for IGM (Individual-Global-Max) consistency, gradients Even in single-agent structured RL, similar factorization structures allow credit flowing into components representing skills, modes, or action groups, enabling better temporal and structural decomposition. ---- - ## Model-Based vs Model-Free Offline RL Lecture 23 contrasts model-based imagination (ROMI) with conservative model-free methods such as IQL and CQL. @@ -135,8 +131,6 @@ These methods limit exploration into uncertain model regions. - ROMI expands -backward-, staying consistent with known good future states. - ROMI reduces error accumulation because future anchors are real. ---- - ## Combining ROMI With Conservative Offline RL ROMI is typically combined with: @@ -157,8 +151,6 @@ Benefits: - Increased policy improvement over dataset. - More stable Q-learning backups. ---- - ## Summary of Lecture 23 Key points: @@ -168,10 +160,3 @@ Key points: - Reverse imagination avoids pitfalls of forward model error. - Factored value structures provide implicit counterfactual credit assignment. - Combining ROMI with conservative learners yields state-of-the-art performance. - ---- - -## Recommended Screenshot Frames for Lecture 23 - -- Lecture 23, page 20: ROMI concept diagram depicting reverse imagination from goal states. Subsection: "Reverse Model-Based Imagination (ROMI)". -- Lecture 23, page 24: Architecture figure showing reverse policy and reverse dynamics model used to generate imagined transitions. Subsection: "Reverse Imagination Process". diff --git a/content/CSE510/CSE510_L24.md b/content/CSE510/CSE510_L24.md index ff93ecf..7be42e5 100644 --- a/content/CSE510/CSE510_L24.md +++ b/content/CSE510/CSE510_L24.md @@ -4,8 +4,6 @@ This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX. - - ## Multi-Agent Coordination Under Uncertainty In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP). @@ -39,7 +37,6 @@ Parameter explanations: Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment. - ## Joint vs Factored Q-Learning ### Joint Q-Learning @@ -75,15 +72,12 @@ Parameter explanations: The goal is to enable decentralized greedy action selection. - - ## Individual-Global-Max (IGM) Condition The IGM condition enables decentralized optimal action selection: $$ -\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a}) -=========================================== +\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a})= \big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big) $$ @@ -96,8 +90,6 @@ Parameter explanations: IGM makes decentralized execution optimal with respect to the learned factorized value. - - ## Linear Value Factorization ### VDN (Value Decomposition Networks) @@ -123,8 +115,6 @@ Cons: - Limited representation capacity. - Cannot model non-linear teamwork interactions. - - ## QMIX: Monotonic Value Factorization QMIX uses a state-conditioned mixing network enforcing monotonicity: @@ -154,8 +144,6 @@ Benefits: - More expressive than VDN. - Supports CTDE while keeping decentralized greedy execution. - - ## Theoretical Issues With Linear and Monotonic Factorization Limitations: @@ -164,8 +152,6 @@ Limitations: - QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions. - Off-policy training can diverge in some factorizations. - - ## QPLEX: Duplex Dueling Multi-Agent Q-Learning QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class. @@ -193,8 +179,6 @@ QPLEX Properties: - Has full representation capacity for all IGM-consistent Q-functions. - Enables stable off-policy training. - - ## QPLEX Training Objective QPLEX minimizes a TD loss over $Q_{tot}$: @@ -211,8 +195,6 @@ Parameter explanations: - $\mathbf{a'}$: next joint action evaluated by TD target. - $Q_{tot}$: QPLEX global value estimate. - - ## Role of Credit Assignment Credit assignment addresses: "Which agent contributed what to the team reward?" @@ -223,8 +205,6 @@ Value factorization supports implicit credit assignment: - Dueling architectures allow each agent to learn its influence. - QPLEX provides clean marginal contributions implicitly. - - ## Performance on SMAC Benchmarks QPLEX outperforms: @@ -240,8 +220,6 @@ Key reasons: - Strong representational capacity. - Off-policy stability. - - ## Extensions: Diversity and Shared Parameter Learning Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior. @@ -254,8 +232,6 @@ Approaches such as CDS (Celebrating Diversity in Shared MARL) introduce: These techniques improve exploration and cooperation in complex multi-agent tasks. - - ## Summary of Lecture 24 Key points: @@ -266,10 +242,3 @@ Key points: - QPLEX achieves full IGM representational capacity. - Implicit credit assignment arises naturally from factorization. - Diversity methods allow richer multi-agent coordination strategies. - - - -## Recommended Screenshot Frames for Lecture 24 - -- Lecture 24, page 16: CTDE and QMIX architecture diagram (mixing network). Subsection: "QMIX: Monotonic Value Factorization". -- Lecture 24, page 31: QPLEX benchmark performance on SMAC. Subsection: "Performance on SMAC Benchmarks". diff --git a/content/CSE510/index.md b/content/CSE510/index.md index 4c349a0..4f18e99 100644 --- a/content/CSE510/index.md +++ b/content/CSE510/index.md @@ -7,14 +7,17 @@ CSE 5100 **Fall 2025** ## Instructor Information + **Chongjie Zhang** Office: McKelvey Hall 2010D Email: chongjie@wustl.edu ### Instructor's Office Hours: + Chongjie Zhang's Office Hours: Wednesdays 11:00 -12:00 am in Mckelvey Hall 2010D Or you may email me to make an appointment. ### TAs: + - Jianing Ye: jianing.y@wustl.edu - Kefei Duan: d.kefei@wustl.edu - Xiu Yuan: xiu@wustl.edu @@ -22,6 +25,7 @@ Chongjie Zhang's Office Hours: Wednesdays 11:00 -12:00 am in Mckelvey Hall 2010D **Office Hours:** Thursday 4:00pm -5:00pm in Mckelvey Hall 1030 (tentative) Or you may email TAs to make an appointment. ## Course Description + Deep Reinforcement Learning (RL) is a cutting-edge field at the intersection of artificial intelligence and decision-making. This course provides an in-depth exploration of the fundamental principles, algorithms, and applications of deep reinforcement learning. We start from the Markov Decision Process (MDP) framework and cover basic RL algorithms—value-based, policy-based, actor–critic, and model-based methods—then move to advanced topics including offline RL and multi-agent RL. By combining deep learning with reinforcement learning, students will gain the skills to build intelligent systems that learn from experience and make near-optimal decisions in complex environments. The course caters to graduate and advanced undergraduate students. Student performance evaluation will revolve around written and programming assignments and the course project. @@ -39,6 +43,7 @@ By the end of this course, students should be able to: - Execute an end-to-end DRL project: problem selection, environment design, algorithm selection, experimental protocol, ablations, and reproducibility. ## Prerequisites + If you are unsure about any of these, please speak to the instructor. - Proficiency in Python programming. @@ -51,11 +56,13 @@ One of the following: - b) a Machine Learning course (CSE 417T or ESE 417). ## Textbook + **Primary text** (optional but recommended): Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed., online). We will not cover all of the chapters and, from time to time, cover topics not contained in the book. **Additional references:** Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed.); OpenAI Spinning Up in Deep RL tutorial. ## Homeworks + There will be a total of three homework assignments distributed throughout the semester. Each assignment will be accessible on Canvas, allowing you approximately two weeks to finish and submit it before the designated deadline. Late work will not be accepted. If you have a documented medical or emergency reason, contact the TAs as soon as possible. @@ -65,21 +72,25 @@ Late work will not be accepted. If you have a documented medical or emergency re **Academic Integrity:** Do not copy from peers or online sources. Violations will be referred per university policy. ## Final Project + A research‑level project of your choice that demonstrates mastery of DRL concepts and empirical methodology. Possible directions include: (a) improving an existing approach, (b) tackling an unsolved task/benchmark, (c) reproducing and extending a recent paper, or (d) creating a new task/problem relevant to RL. **Team size:** 1–2 students by default (contact instructor/TAs for approval if proposing a larger team). ### Milestones: + - **Proposal:** ≤ 2 pages outlining problem, related work, methodology, evaluation plan, and risks. - **Progress report with short survey:** ≤ 4 pages with preliminary results or diagnostics. - **Presentation/Poster session:** brief talk or poster demo. - **Final report:** 7–10 pages (NeurIPS format) with clear experiments, ablations, and reproducibility details. ## Evaluation + **Homework / Problem Sets (3) — 45%** Each problem set combines written questions (derivations/short answers) and programming components (implementations and experiments). **Final Course Project — 50% total** + - Proposal (max 2 pages) — 5% of project - Progress report with brief survey (max 4 pages) — 10% of project - Presentation/Poster session — 10% of project @@ -91,7 +102,9 @@ Contributions in class and on the course discussion forum, especially in the pro **Course evaluations** (mid-semester and final course evaluations): extra credit up to 2% ## Grading Scale + The intended grading scale is as follows. The instructor reserves the right to adjust the grading scale. + - A's (A-,A,A+): >= 90% - B's (B-,B,B+): >= 80% - C's (C-,C,C+): >= 70%