# CSE510 Deep Reinforcement Learning (Lecture 24) ## Cooperative Multi-Agent Reinforcement Learning (MARL) This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX. ## Multi-Agent Coordination Under Uncertainty In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP). A transition is defined as: $$ P(s' \mid s, a_{1}, \dots, a_{n}) $$ Parameter explanations: - $s$: current global state. - $s'$: next global state. - $a_{i}$: action taken by agent $i$. - $P(\cdot)$: environment transition function. The shared return is: $$ \mathbb{E}\left[\sum_{t=0}^{T} \gamma^{t} r_{t}\right] $$ Parameter explanations: - $\gamma$: discount factor. - $T$: horizon length. - $r_{t}$: shared team reward at time $t$. ### CTDE: Centralized Training, Decentralized Execution Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment. ## Joint vs Factored Q-Learning ### Joint Q-Learning In joint-action learning, one learns a full joint Q-function: $$ Q_{tot}(s, a_{1}, \dots, a_{n}) $$ Parameter explanations: - $Q_{tot}$: joint value for the entire team. - $(a_{1}, \dots, a_{n})$: joint action vector across agents. Problem: - The joint action space grows exponentially in $n$. - Learning is not scalable. ### Value Factorization Instead of learning $Q_{tot}$ directly, we factorize it into individual utility functions: $$ Q_{tot}(s, \mathbf{a}) = f(Q_{1}(s,a_{1}), \dots, Q_{n}(s,a_{n})) $$ Parameter explanations: - $\mathbf{a}$: joint action vector. - $f(\cdot)$: mixing network combining individual Q-values. The goal is to enable decentralized greedy action selection. ## Individual-Global-Max (IGM) Condition The IGM condition enables decentralized optimal action selection: $$ \arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a}) =========================================== \big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big) $$ Parameter explanations: - $\arg\max_{\mathbf{a}}$: search for best joint action. - $\arg\max_{a_{i}}$: best local action for agent $i$. - $Q_{i}(s,a_{i})$: individual utility for agent $i$. IGM makes decentralized execution optimal with respect to the learned factorized value. ## Linear Value Factorization ### VDN (Value Decomposition Networks) VDN assumes: $$ Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} Q_{i}(s,a_{i}) $$ Parameter explanations: - $Q_{i}(s,a_{i})$: value of agent $i$'s action. - $\sum_{i=1}^{n}$: linear sum over agents. Pros: - Very simple, satisfies IGM. - Fully decentralized execution. Cons: - Limited representation capacity. - Cannot model non-linear teamwork interactions. ## QMIX: Monotonic Value Factorization QMIX uses a state-conditioned mixing network enforcing monotonicity: $$ \frac{\partial Q_{tot}}{\partial Q_{i}} \ge 0 $$ Parameter explanations: - $\partial Q_{tot} / \partial Q_{i}$: gradient of global Q w.r.t. individual Q. - $\ge 0$: ensures monotonicity required for IGM. The mixing function is: $$ Q_{tot}(s,\mathbf{a}) = f_{mix}(Q_{1}, \dots, Q_{n}; s) $$ Parameter explanations: - $f_{mix}$: neural network with non-negative weights. - $s$: global state conditioning the mixing process. Benefits: - More expressive than VDN. - Supports CTDE while keeping decentralized greedy execution. ## Theoretical Issues With Linear and Monotonic Factorization Limitations: - Linear models (VDN) cannot represent complex coordination. - QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions. - Off-policy training can diverge in some factorizations. ## QPLEX: Duplex Dueling Multi-Agent Q-Learning QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class. ### QPLEX Advantage Factorization QPLEX factorizes: $$ Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} \lambda_{i}(s,\mathbf{a})\big(Q_{i}(s,a_{i}) - \max_{a'-{i}} Q_{i}(s,a'-{i})\big) - \max_{\mathbf{a}} \sum_{i=1}^{n} Q_{i}(s,a_{i}) $$ Parameter explanations: - $\lambda_{i}(s,\mathbf{a})$: positive mixing coefficients. - $Q_{i}(s,a_{i})$: individual utility. - $\max_{a'-{i}} Q_{i}(s,a'-{i})$: per-agent baseline value. - $\max_{\mathbf{a}}$: maximization over joint actions. QPLEX Properties: - Fully satisfies IGM. - Has full representation capacity for all IGM-consistent Q-functions. - Enables stable off-policy training. ## QPLEX Training Objective QPLEX minimizes a TD loss over $Q_{tot}$: $$ L = \mathbb{E}\Big[(r + \gamma \max_{\mathbf{a'}} Q_{tot}(s',\mathbf{a'}) - Q_{tot}(s,\mathbf{a}))^{2}\Big] $$ Parameter explanations: - $r$: shared team reward. - $\gamma$: discount factor. - $s'$: next state. - $\mathbf{a'}$: next joint action evaluated by TD target. - $Q_{tot}$: QPLEX global value estimate. ## Role of Credit Assignment Credit assignment addresses: "Which agent contributed what to the team reward?" Value factorization supports implicit credit assignment: - Gradients into each $Q_{i}$ act as counterfactual signals. - Dueling architectures allow each agent to learn its influence. - QPLEX provides clean marginal contributions implicitly. ## Performance on SMAC Benchmarks QPLEX outperforms: - QTRAN - QMIX - VDN - Other CTDE baselines Key reasons: - Effective realization of IGM. - Strong representational capacity. - Off-policy stability. ## Extensions: Diversity and Shared Parameter Learning Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior. Approaches such as CDS (Celebrating Diversity in Shared MARL) introduce: - Identity-aware diversity. - Information-based intrinsic rewards for agent differentiation. - Balanced sharing vs agent specialization. These techniques improve exploration and cooperation in complex multi-agent tasks. ## Summary of Lecture 24 Key points: - Cooperative MARL requires scalable value decomposition. - IGM enables decentralized action selection from centralized training. - QMIX introduces monotonic non-linear factorization. - QPLEX achieves full IGM representational capacity. - Implicit credit assignment arises naturally from factorization. - Diversity methods allow richer multi-agent coordination strategies. ## Recommended Screenshot Frames for Lecture 24 - Lecture 24, page 16: CTDE and QMIX architecture diagram (mixing network). Subsection: "QMIX: Monotonic Value Factorization". - Lecture 24, page 31: QPLEX benchmark performance on SMAC. Subsection: "Performance on SMAC Benchmarks".