245 lines
6.3 KiB
Markdown
245 lines
6.3 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 24)
|
|
|
|
## Cooperative Multi-Agent Reinforcement Learning (MARL)
|
|
|
|
This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX.
|
|
|
|
## Multi-Agent Coordination Under Uncertainty
|
|
|
|
In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP).
|
|
|
|
A transition is defined as:
|
|
|
|
$$
|
|
P(s' \mid s, a_{1}, \dots, a_{n})
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $s$: current global state.
|
|
- $s'$: next global state.
|
|
- $a_{i}$: action taken by agent $i$.
|
|
- $P(\cdot)$: environment transition function.
|
|
|
|
The shared return is:
|
|
|
|
$$
|
|
\mathbb{E}\left[\sum_{t=0}^{T} \gamma^{t} r_{t}\right]
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $\gamma$: discount factor.
|
|
- $T$: horizon length.
|
|
- $r_{t}$: shared team reward at time $t$.
|
|
|
|
### CTDE: Centralized Training, Decentralized Execution
|
|
|
|
Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment.
|
|
|
|
## Joint vs Factored Q-Learning
|
|
|
|
### Joint Q-Learning
|
|
|
|
In joint-action learning, one learns a full joint Q-function:
|
|
|
|
$$
|
|
Q_{tot}(s, a_{1}, \dots, a_{n})
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $Q_{tot}$: joint value for the entire team.
|
|
- $(a_{1}, \dots, a_{n})$: joint action vector across agents.
|
|
|
|
Problem:
|
|
|
|
- The joint action space grows exponentially in $n$.
|
|
- Learning is not scalable.
|
|
|
|
### Value Factorization
|
|
|
|
Instead of learning $Q_{tot}$ directly, we factorize it into individual utility functions:
|
|
|
|
$$
|
|
Q_{tot}(s, \mathbf{a}) = f(Q_{1}(s,a_{1}), \dots, Q_{n}(s,a_{n}))
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $\mathbf{a}$: joint action vector.
|
|
- $f(\cdot)$: mixing network combining individual Q-values.
|
|
|
|
The goal is to enable decentralized greedy action selection.
|
|
|
|
## Individual-Global-Max (IGM) Condition
|
|
|
|
The IGM condition enables decentralized optimal action selection:
|
|
|
|
$$
|
|
\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a})=
|
|
|
|
\big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big)
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $\arg\max_{\mathbf{a}}$: search for best joint action.
|
|
- $\arg\max_{a_{i}}$: best local action for agent $i$.
|
|
- $Q_{i}(s,a_{i})$: individual utility for agent $i$.
|
|
|
|
IGM makes decentralized execution optimal with respect to the learned factorized value.
|
|
|
|
## Linear Value Factorization
|
|
|
|
### VDN (Value Decomposition Networks)
|
|
|
|
VDN assumes:
|
|
|
|
$$
|
|
Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} Q_{i}(s,a_{i})
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $Q_{i}(s,a_{i})$: value of agent $i$'s action.
|
|
- $\sum_{i=1}^{n}$: linear sum over agents.
|
|
|
|
Pros:
|
|
|
|
- Very simple, satisfies IGM.
|
|
- Fully decentralized execution.
|
|
|
|
Cons:
|
|
|
|
- Limited representation capacity.
|
|
- Cannot model non-linear teamwork interactions.
|
|
|
|
## QMIX: Monotonic Value Factorization
|
|
|
|
QMIX uses a state-conditioned mixing network enforcing monotonicity:
|
|
|
|
$$
|
|
\frac{\partial Q_{tot}}{\partial Q_{i}} \ge 0
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $\partial Q_{tot} / \partial Q_{i}$: gradient of global Q w.r.t. individual Q.
|
|
- $\ge 0$: ensures monotonicity required for IGM.
|
|
|
|
The mixing function is:
|
|
|
|
$$
|
|
Q_{tot}(s,\mathbf{a}) = f_{mix}(Q_{1}, \dots, Q_{n}; s)
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $f_{mix}$: neural network with non-negative weights.
|
|
- $s$: global state conditioning the mixing process.
|
|
|
|
Benefits:
|
|
|
|
- More expressive than VDN.
|
|
- Supports CTDE while keeping decentralized greedy execution.
|
|
|
|
## Theoretical Issues With Linear and Monotonic Factorization
|
|
|
|
Limitations:
|
|
|
|
- Linear models (VDN) cannot represent complex coordination.
|
|
- QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions.
|
|
- Off-policy training can diverge in some factorizations.
|
|
|
|
## QPLEX: Duplex Dueling Multi-Agent Q-Learning
|
|
|
|
QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class.
|
|
|
|
### QPLEX Advantage Factorization
|
|
|
|
QPLEX factorizes:
|
|
|
|
$$
|
|
Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} \lambda_{i}(s,\mathbf{a})\big(Q_{i}(s,a_{i}) - \max_{a'-{i}} Q_{i}(s,a'-{i})\big)
|
|
|
|
- \max_{\mathbf{a}} \sum_{i=1}^{n} Q_{i}(s,a_{i})
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $\lambda_{i}(s,\mathbf{a})$: positive mixing coefficients.
|
|
- $Q_{i}(s,a_{i})$: individual utility.
|
|
- $\max_{a'-{i}} Q_{i}(s,a'-{i})$: per-agent baseline value.
|
|
- $\max_{\mathbf{a}}$: maximization over joint actions.
|
|
|
|
QPLEX Properties:
|
|
|
|
- Fully satisfies IGM.
|
|
- Has full representation capacity for all IGM-consistent Q-functions.
|
|
- Enables stable off-policy training.
|
|
|
|
## QPLEX Training Objective
|
|
|
|
QPLEX minimizes a TD loss over $Q_{tot}$:
|
|
|
|
$$
|
|
L = \mathbb{E}\Big[(r + \gamma \max_{\mathbf{a'}} Q_{tot}(s',\mathbf{a'}) - Q_{tot}(s,\mathbf{a}))^{2}\Big]
|
|
$$
|
|
|
|
Parameter explanations:
|
|
|
|
- $r$: shared team reward.
|
|
- $\gamma$: discount factor.
|
|
- $s'$: next state.
|
|
- $\mathbf{a'}$: next joint action evaluated by TD target.
|
|
- $Q_{tot}$: QPLEX global value estimate.
|
|
|
|
## Role of Credit Assignment
|
|
|
|
Credit assignment addresses: "Which agent contributed what to the team reward?"
|
|
|
|
Value factorization supports implicit credit assignment:
|
|
|
|
- Gradients into each $Q_{i}$ act as counterfactual signals.
|
|
- Dueling architectures allow each agent to learn its influence.
|
|
- QPLEX provides clean marginal contributions implicitly.
|
|
|
|
## Performance on SMAC Benchmarks
|
|
|
|
QPLEX outperforms:
|
|
|
|
- QTRAN
|
|
- QMIX
|
|
- VDN
|
|
- Other CTDE baselines
|
|
|
|
Key reasons:
|
|
|
|
- Effective realization of IGM.
|
|
- Strong representational capacity.
|
|
- Off-policy stability.
|
|
|
|
## Extensions: Diversity and Shared Parameter Learning
|
|
|
|
Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior.
|
|
|
|
Approaches such as CDS (Celebrating Diversity in Shared MARL) introduce:
|
|
|
|
- Identity-aware diversity.
|
|
- Information-based intrinsic rewards for agent differentiation.
|
|
- Balanced sharing vs agent specialization.
|
|
|
|
These techniques improve exploration and cooperation in complex multi-agent tasks.
|
|
|
|
## Summary of Lecture 24
|
|
|
|
Key points:
|
|
|
|
- Cooperative MARL requires scalable value decomposition.
|
|
- IGM enables decentralized action selection from centralized training.
|
|
- QMIX introduces monotonic non-linear factorization.
|
|
- QPLEX achieves full IGM representational capacity.
|
|
- Implicit credit assignment arises naturally from factorization.
|
|
- Diversity methods allow richer multi-agent coordination strategies.
|