# CSE510 Deep Reinforcement Learning (Lecture 25) > Restore human intelligence ## Linear Value Factorization [link to paper](https://arxiv.org/abs/2006.00587) ### Why Linear Factorization works? - Multi-agent reinforcement learning are mostly emprical - Theoretical Model: Factored Multi-Agent Fitted Q-Iteration (FMA-FQI) #### Theorem 1 It realize **Counterfactual** credit assignment mechanism. Agent $i$: $$ Q_i^{(t+1)}(s,a_i)=\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right]-\frac{n-1}{n}\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right] $$ Here $\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right]$ is the evaluation of $a_i$. and $\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right]$ is the baseline The target $Q$-value: $y^{(t)}(s,a)=r+\gamma\max_{a'}Q_{tot}^{(t)}(s',a')$ #### Theorem 2 it has local convergence with on-policy training ##### Limitations of Linear Factorization Linear: $Q_{tot}(s,a)=\sum_{i=1}^{n}Q_{i}(s,a_i)$ Limited Representation: Suboptimal (Prisoner's Dilemma) |a_2\a_2| Action 1 | Action 2 | |---|---|---| |Action 1| **8** | -12 | |Action 2| -12 | 0 | After linear factorization: |a_2\a_2| Action 1 | Action 2 | |---|---|---| |Action 1| -6.5 | -5 | |Action 2| -5 | **-3.5** | #### Theorem 3 it may diverge with off-policy training ### Perfect Alignment: IGM Factorization - Individual-Global Maximization (IGM) Constraint $$ \argmax_{a}Q_{tot}(s,a)=(\argmax_{a_1}Q_1(s,a_1), \dots, \argmax_{a_n}Q_n(s,a_n)) $$ - IGM Factorization: $Q_{tot} (s,a)=f(Q_1(s,a_1), \dots, Q_n(s,a_n))$ - Factorization function $f$ realizes all functions satsisfying IGM. - FQI-IGM: Fitted Q-Iteration with IGM Factorization #### Theorem 4 Convergence & optimality. FQI-IGM globally converges to the optimal value function in multi-agent MDPs. ### QPLEX: Multi-Agent Q-Learning with IGM Factorization [link to paper](https://arxiv.org/pdf/2008.01062) IGM: $\argmax_a Q_{tot}(s,a)=\begin{pamtrix} \argmax_{a_1}Q_1(s,a_1) \\ \dots \\ \argmax_{a_n}Q_n(s,a_n) \end{pmatrix} $ Core idea: - Fitting well the values of optimal actions - Approximate the values of non-optimal actions QPLEX Mixing Network: $$ Q_{tot}(s,a)=\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i')+\sum_{i=1}^{n} \lambda_i(s,a)(Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i')) $$ Here $\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i')$ is the baseline $\max_a Q_{tot}(s,a)$ And $Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i')$ is the "advantage". Coefficients: $\lambda_i(s,a)>0$, **easily realized and learned with neural networks** > Continue next time...