2.5 KiB
CSE510 Deep Reinforcement Learning (Lecture 25)
Restore human intelligence
Linear Value Factorization
Why Linear Factorization works?
- Multi-agent reinforcement learning are mostly emprical
- Theoretical Model: Factored Multi-Agent Fitted Q-Iteration (FMA-FQI)
Theorem 1
It realize Counterfactual credit assignment mechanism.
Agent i:
Q_i^{(t+1)}(s,a_i)=\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right]-\frac{n-1}{n}\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right]
Here \mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right] is the evaluation of a_i.
and \mathbb{E}_{a'}\left[y^{(t)}(s,a')\right] is the baseline
The target $Q$-value: y^{(t)}(s,a)=r+\gamma\max_{a'}Q_{tot}^{(t)}(s',a')
Theorem 2
it has local convergence with on-policy training
Limitations of Linear Factorization
Linear: Q_{tot}(s,a)=\sum_{i=1}^{n}Q_{i}(s,a_i)
Limited Representation: Suboptimal (Prisoner's Dilemma)
| a_2\a_2 | Action 1 | Action 2 |
|---|---|---|
| Action 1 | 8 | -12 |
| Action 2 | -12 | 0 |
After linear factorization:
| a_2\a_2 | Action 1 | Action 2 |
|---|---|---|
| Action 1 | -6.5 | -5 |
| Action 2 | -5 | -3.5 |
Theorem 3
it may diverge with off-policy training
Perfect Alignment: IGM Factorization
- Individual-Global Maximization (IGM) Constraint
\argmax_{a}Q_{tot}(s,a)=(\argmax_{a_1}Q_1(s,a_1), \dots, \argmax_{a_n}Q_n(s,a_n))
-
IGM Factorization:
Q_{tot} (s,a)=f(Q_1(s,a_1), \dots, Q_n(s,a_n))- Factorization function
frealizes all functions satsisfying IGM.
- Factorization function
-
FQI-IGM: Fitted Q-Iteration with IGM Factorization
Theorem 4
Convergence & optimality. FQI-IGM globally converges to the optimal value function in multi-agent MDPs.
QPLEX: Multi-Agent Q-Learning with IGM Factorization
IGM: $\argmax_a Q_{tot}(s,a)=\begin{pamtrix} \argmax_{a_1}Q_1(s,a_1) \ \dots \ \argmax_{a_n}Q_n(s,a_n) \end{pmatrix} $
Core idea:
- Fitting well the values of optimal actions
- Approximate the values of non-optimal actions
QPLEX Mixing Network:
Q_{tot}(s,a)=\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i')+\sum_{i=1}^{n} \lambda_i(s,a)(Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i'))
Here \sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i') is the baseline \max_a Q_{tot}(s,a)
And Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i') is the "advantage".
Coefficients: \lambda_i(s,a)>0, easily realized and learned with neural networks
Continue next time...