updates?
This commit is contained in:
@@ -4,8 +4,6 @@
|
||||
|
||||
This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX.
|
||||
|
||||
|
||||
|
||||
## Multi-Agent Coordination Under Uncertainty
|
||||
|
||||
In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP).
|
||||
@@ -39,7 +37,6 @@ Parameter explanations:
|
||||
|
||||
Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment.
|
||||
|
||||
|
||||
## Joint vs Factored Q-Learning
|
||||
|
||||
### Joint Q-Learning
|
||||
@@ -75,15 +72,12 @@ Parameter explanations:
|
||||
|
||||
The goal is to enable decentralized greedy action selection.
|
||||
|
||||
|
||||
|
||||
## Individual-Global-Max (IGM) Condition
|
||||
|
||||
The IGM condition enables decentralized optimal action selection:
|
||||
|
||||
$$
|
||||
\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a})
|
||||
===========================================
|
||||
\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a})=
|
||||
|
||||
\big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big)
|
||||
$$
|
||||
@@ -96,8 +90,6 @@ Parameter explanations:
|
||||
|
||||
IGM makes decentralized execution optimal with respect to the learned factorized value.
|
||||
|
||||
|
||||
|
||||
## Linear Value Factorization
|
||||
|
||||
### VDN (Value Decomposition Networks)
|
||||
@@ -123,8 +115,6 @@ Cons:
|
||||
- Limited representation capacity.
|
||||
- Cannot model non-linear teamwork interactions.
|
||||
|
||||
|
||||
|
||||
## QMIX: Monotonic Value Factorization
|
||||
|
||||
QMIX uses a state-conditioned mixing network enforcing monotonicity:
|
||||
@@ -154,8 +144,6 @@ Benefits:
|
||||
- More expressive than VDN.
|
||||
- Supports CTDE while keeping decentralized greedy execution.
|
||||
|
||||
|
||||
|
||||
## Theoretical Issues With Linear and Monotonic Factorization
|
||||
|
||||
Limitations:
|
||||
@@ -164,8 +152,6 @@ Limitations:
|
||||
- QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions.
|
||||
- Off-policy training can diverge in some factorizations.
|
||||
|
||||
|
||||
|
||||
## QPLEX: Duplex Dueling Multi-Agent Q-Learning
|
||||
|
||||
QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class.
|
||||
@@ -193,8 +179,6 @@ QPLEX Properties:
|
||||
- Has full representation capacity for all IGM-consistent Q-functions.
|
||||
- Enables stable off-policy training.
|
||||
|
||||
|
||||
|
||||
## QPLEX Training Objective
|
||||
|
||||
QPLEX minimizes a TD loss over $Q_{tot}$:
|
||||
@@ -211,8 +195,6 @@ Parameter explanations:
|
||||
- $\mathbf{a'}$: next joint action evaluated by TD target.
|
||||
- $Q_{tot}$: QPLEX global value estimate.
|
||||
|
||||
|
||||
|
||||
## Role of Credit Assignment
|
||||
|
||||
Credit assignment addresses: "Which agent contributed what to the team reward?"
|
||||
@@ -223,8 +205,6 @@ Value factorization supports implicit credit assignment:
|
||||
- Dueling architectures allow each agent to learn its influence.
|
||||
- QPLEX provides clean marginal contributions implicitly.
|
||||
|
||||
|
||||
|
||||
## Performance on SMAC Benchmarks
|
||||
|
||||
QPLEX outperforms:
|
||||
@@ -240,8 +220,6 @@ Key reasons:
|
||||
- Strong representational capacity.
|
||||
- Off-policy stability.
|
||||
|
||||
|
||||
|
||||
## Extensions: Diversity and Shared Parameter Learning
|
||||
|
||||
Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior.
|
||||
@@ -254,8 +232,6 @@ Approaches such as CDS (Celebrating Diversity in Shared MARL) introduce:
|
||||
|
||||
These techniques improve exploration and cooperation in complex multi-agent tasks.
|
||||
|
||||
|
||||
|
||||
## Summary of Lecture 24
|
||||
|
||||
Key points:
|
||||
@@ -266,10 +242,3 @@ Key points:
|
||||
- QPLEX achieves full IGM representational capacity.
|
||||
- Implicit credit assignment arises naturally from factorization.
|
||||
- Diversity methods allow richer multi-agent coordination strategies.
|
||||
|
||||
|
||||
|
||||
## Recommended Screenshot Frames for Lecture 24
|
||||
|
||||
- Lecture 24, page 16: CTDE and QMIX architecture diagram (mixing network). Subsection: "QMIX: Monotonic Value Factorization".
|
||||
- Lecture 24, page 31: QPLEX benchmark performance on SMAC. Subsection: "Performance on SMAC Benchmarks".
|
||||
|
||||
Reference in New Issue
Block a user