6.3 KiB
CSE510 Deep Reinforcement Learning (Lecture 24)
Cooperative Multi-Agent Reinforcement Learning (MARL)
This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX.
Multi-Agent Coordination Under Uncertainty
In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP).
A transition is defined as:
P(s' \mid s, a_{1}, \dots, a_{n})
Parameter explanations:
s: current global state.s': next global state.a_{i}: action taken by agenti.P(\cdot): environment transition function.
The shared return is:
\mathbb{E}\left[\sum_{t=0}^{T} \gamma^{t} r_{t}\right]
Parameter explanations:
\gamma: discount factor.T: horizon length.r_{t}: shared team reward at timet.
CTDE: Centralized Training, Decentralized Execution
Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment.
Joint vs Factored Q-Learning
Joint Q-Learning
In joint-action learning, one learns a full joint Q-function:
Q_{tot}(s, a_{1}, \dots, a_{n})
Parameter explanations:
Q_{tot}: joint value for the entire team.(a_{1}, \dots, a_{n}): joint action vector across agents.
Problem:
- The joint action space grows exponentially in
n. - Learning is not scalable.
Value Factorization
Instead of learning Q_{tot} directly, we factorize it into individual utility functions:
Q_{tot}(s, \mathbf{a}) = f(Q_{1}(s,a_{1}), \dots, Q_{n}(s,a_{n}))
Parameter explanations:
\mathbf{a}: joint action vector.f(\cdot): mixing network combining individual Q-values.
The goal is to enable decentralized greedy action selection.
Individual-Global-Max (IGM) Condition
The IGM condition enables decentralized optimal action selection:
\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a})=
\big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big)
Parameter explanations:
\arg\max_{\mathbf{a}}: search for best joint action.\arg\max_{a_{i}}: best local action for agenti.Q_{i}(s,a_{i}): individual utility for agenti.
IGM makes decentralized execution optimal with respect to the learned factorized value.
Linear Value Factorization
VDN (Value Decomposition Networks)
VDN assumes:
Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} Q_{i}(s,a_{i})
Parameter explanations:
Q_{i}(s,a_{i}): value of agent $i$'s action.\sum_{i=1}^{n}: linear sum over agents.
Pros:
- Very simple, satisfies IGM.
- Fully decentralized execution.
Cons:
- Limited representation capacity.
- Cannot model non-linear teamwork interactions.
QMIX: Monotonic Value Factorization
QMIX uses a state-conditioned mixing network enforcing monotonicity:
\frac{\partial Q_{tot}}{\partial Q_{i}} \ge 0
Parameter explanations:
\partial Q_{tot} / \partial Q_{i}: gradient of global Q w.r.t. individual Q.\ge 0: ensures monotonicity required for IGM.
The mixing function is:
Q_{tot}(s,\mathbf{a}) = f_{mix}(Q_{1}, \dots, Q_{n}; s)
Parameter explanations:
f_{mix}: neural network with non-negative weights.s: global state conditioning the mixing process.
Benefits:
- More expressive than VDN.
- Supports CTDE while keeping decentralized greedy execution.
Theoretical Issues With Linear and Monotonic Factorization
Limitations:
- Linear models (VDN) cannot represent complex coordination.
- QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions.
- Off-policy training can diverge in some factorizations.
QPLEX: Duplex Dueling Multi-Agent Q-Learning
QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class.
QPLEX Advantage Factorization
QPLEX factorizes:
Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} \lambda_{i}(s,\mathbf{a})\big(Q_{i}(s,a_{i}) - \max_{a'-{i}} Q_{i}(s,a'-{i})\big)
- \max_{\mathbf{a}} \sum_{i=1}^{n} Q_{i}(s,a_{i})
Parameter explanations:
\lambda_{i}(s,\mathbf{a}): positive mixing coefficients.Q_{i}(s,a_{i}): individual utility.\max_{a'-{i}} Q_{i}(s,a'-{i}): per-agent baseline value.\max_{\mathbf{a}}: maximization over joint actions.
QPLEX Properties:
- Fully satisfies IGM.
- Has full representation capacity for all IGM-consistent Q-functions.
- Enables stable off-policy training.
QPLEX Training Objective
QPLEX minimizes a TD loss over Q_{tot}:
L = \mathbb{E}\Big[(r + \gamma \max_{\mathbf{a'}} Q_{tot}(s',\mathbf{a'}) - Q_{tot}(s,\mathbf{a}))^{2}\Big]
Parameter explanations:
r: shared team reward.\gamma: discount factor.s': next state.\mathbf{a'}: next joint action evaluated by TD target.Q_{tot}: QPLEX global value estimate.
Role of Credit Assignment
Credit assignment addresses: "Which agent contributed what to the team reward?"
Value factorization supports implicit credit assignment:
- Gradients into each
Q_{i}act as counterfactual signals. - Dueling architectures allow each agent to learn its influence.
- QPLEX provides clean marginal contributions implicitly.
Performance on SMAC Benchmarks
QPLEX outperforms:
- QTRAN
- QMIX
- VDN
- Other CTDE baselines
Key reasons:
- Effective realization of IGM.
- Strong representational capacity.
- Off-policy stability.
Extensions: Diversity and Shared Parameter Learning
Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior.
Approaches such as CDS (Celebrating Diversity in Shared MARL) introduce:
- Identity-aware diversity.
- Information-based intrinsic rewards for agent differentiation.
- Balanced sharing vs agent specialization.
These techniques improve exploration and cooperation in complex multi-agent tasks.
Summary of Lecture 24
Key points:
- Cooperative MARL requires scalable value decomposition.
- IGM enables decentralized action selection from centralized training.
- QMIX introduces monotonic non-linear factorization.
- QPLEX achieves full IGM representational capacity.
- Implicit credit assignment arises naturally from factorization.
- Diversity methods allow richer multi-agent coordination strategies.