updates?

2025-11-20 13:03:44 -06:00
parent d90faea29b
commit 970ee8fa5a
7 changed files with 475 additions and 4 deletions
--- a/content/CSE510/CSE510_L24.md
+++ b/content/CSE510/CSE510_L24.md
@@ -90,8 +90,6 @@ Parameter explanations:

 IGM makes decentralized execution optimal with respect to the learned factorized value.

-## Linear Value Factorization
-
 ### VDN (Value Decomposition Networks)

 VDN assumes:
--- a/content/CSE510/CSE510_L25.md
+++ b/content/CSE510/CSE510_L25.md
@@ -0,0 +1,101 @@
+# CSE510 Deep Reinforcement Learning (Lecture 25)
+
+> Restore human intelligence
+
+## Linear Value Factorization
+
+[link to paper](https://arxiv.org/abs/2006.00587)
+
+### Why Linear Factorization works?
+
+- Multi-agent reinforcement learning are mostly emprical
+- Theoretical Model: Factored Multi-Agent Fitted Q-Iteration (FMA-FQI)
+
+#### Theorem 1
+
+It realize **Counterfactual** credit assignment mechanism.
+
+Agent $i$:
+
+$$
+Q_i^{(t+1)}(s,a_i)=\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right]-\frac{n-1}{n}\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right]
+$$
+
+Here $\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right]$ is the evaluation of $a_i$.
+
+and $\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right]$ is the baseline
+
+The target $Q$-value: $y^{(t)}(s,a)=r+\gamma\max_{a'}Q_{tot}^{(t)}(s',a')$
+
+#### Theorem 2
+
+it has local convergence with on-policy training
+
+##### Limitations of Linear Factorization
+
+Linear: $Q_{tot}(s,a)=\sum_{i=1}^{n}Q_{i}(s,a_i)$
+
+Limited Representation: Suboptimal (Prisoner's Dilemma)
+
+|a_2\a_2| Action 1 | Action 2 |
+|---|---|---|
+|Action 1| **8** | -12 |
+|Action 2| -12 | 0 |
+
+After linear factorization:
+
+|a_2\a_2| Action 1 | Action 2 |
+|---|---|---|
+|Action 1| -6.5 | -5 |
+|Action 2| -5 | **-3.5** |
+
+#### Theorem 3
+
+it may diverge with off-policy training
+
+### Perfect Alignment: IGM Factorization
+
+- Individual-Global Maximization (IGM) Constraint
+
+$$
+\argmax_{a}Q_{tot}(s,a)=(\argmax_{a_1}Q_1(s,a_1), \dots, \argmax_{a_n}Q_n(s,a_n))
+$$
+
+- IGM Factorization: $Q_{tot} (s,a)=f(Q_1(s,a_1), \dots, Q_n(s,a_n))$
+  - Factorization function $f$ realizes all functions satsisfying IGM.
+
+- FQI-IGM: Fitted Q-Iteration with IGM Factorization
+
+#### Theorem 4
+
+Convergence & optimality. FQI-IGM globally converges to the optimal value function in multi-agent MDPs.
+
+### QPLEX: Multi-Agent Q-Learning with IGM Factorization
+
+[link to paper](https://arxiv.org/pdf/2008.01062)
+
+IGM: $\argmax_a Q_{tot}(s,a)=\begin{pamtrix}
+\argmax_{a_1}Q_1(s,a_1) \\
+ \dots \\
+  \argmax_{a_n}Q_n(s,a_n)
+\end{pmatrix}
+$
+
+Core idea:
+
+- Fitting well the values of optimal actions
+- Approximate the values of non-optimal actions
+
+QPLEX Mixing Network:
+
+$$
+Q_{tot}(s,a)=\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i')+\sum_{i=1}^{n} \lambda_i(s,a)(Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i'))
+$$
+
+Here $\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i')$ is the baseline $\max_a Q_{tot}(s,a)$
+
+And $Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i')$ is the "advantage".
+
+Coefficients: $\lambda_i(s,a)>0$, **easily realized and learned with neural networks**
+
+> Continue next time...
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -27,4 +27,5 @@ export default {
    CSE510_L22: "CSE510 Deep Reinforcement Learning (Lecture 22)",
    CSE510_L23: "CSE510 Deep Reinforcement Learning (Lecture 23)",
    CSE510_L24: "CSE510 Deep Reinforcement Learning (Lecture 24)",
+    CSE510_L25: "CSE510 Deep Reinforcement Learning (Lecture 25)",
 }