updates?
This commit is contained in:
@@ -90,8 +90,6 @@ Parameter explanations:
|
||||
|
||||
IGM makes decentralized execution optimal with respect to the learned factorized value.
|
||||
|
||||
## Linear Value Factorization
|
||||
|
||||
### VDN (Value Decomposition Networks)
|
||||
|
||||
VDN assumes:
|
||||
|
||||
101
content/CSE510/CSE510_L25.md
Normal file
101
content/CSE510/CSE510_L25.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 25)
|
||||
|
||||
> Restore human intelligence
|
||||
|
||||
## Linear Value Factorization
|
||||
|
||||
[link to paper](https://arxiv.org/abs/2006.00587)
|
||||
|
||||
### Why Linear Factorization works?
|
||||
|
||||
- Multi-agent reinforcement learning are mostly emprical
|
||||
- Theoretical Model: Factored Multi-Agent Fitted Q-Iteration (FMA-FQI)
|
||||
|
||||
#### Theorem 1
|
||||
|
||||
It realize **Counterfactual** credit assignment mechanism.
|
||||
|
||||
Agent $i$:
|
||||
|
||||
$$
|
||||
Q_i^{(t+1)}(s,a_i)=\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right]-\frac{n-1}{n}\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right]
|
||||
$$
|
||||
|
||||
Here $\mathbb{E}_{a_{-i}'}\left[y^{(t)}(s,a_i\oplus a_{-i}')\right]$ is the evaluation of $a_i$.
|
||||
|
||||
and $\mathbb{E}_{a'}\left[y^{(t)}(s,a')\right]$ is the baseline
|
||||
|
||||
The target $Q$-value: $y^{(t)}(s,a)=r+\gamma\max_{a'}Q_{tot}^{(t)}(s',a')$
|
||||
|
||||
#### Theorem 2
|
||||
|
||||
it has local convergence with on-policy training
|
||||
|
||||
##### Limitations of Linear Factorization
|
||||
|
||||
Linear: $Q_{tot}(s,a)=\sum_{i=1}^{n}Q_{i}(s,a_i)$
|
||||
|
||||
Limited Representation: Suboptimal (Prisoner's Dilemma)
|
||||
|
||||
|a_2\a_2| Action 1 | Action 2 |
|
||||
|---|---|---|
|
||||
|Action 1| **8** | -12 |
|
||||
|Action 2| -12 | 0 |
|
||||
|
||||
After linear factorization:
|
||||
|
||||
|a_2\a_2| Action 1 | Action 2 |
|
||||
|---|---|---|
|
||||
|Action 1| -6.5 | -5 |
|
||||
|Action 2| -5 | **-3.5** |
|
||||
|
||||
#### Theorem 3
|
||||
|
||||
it may diverge with off-policy training
|
||||
|
||||
### Perfect Alignment: IGM Factorization
|
||||
|
||||
- Individual-Global Maximization (IGM) Constraint
|
||||
|
||||
$$
|
||||
\argmax_{a}Q_{tot}(s,a)=(\argmax_{a_1}Q_1(s,a_1), \dots, \argmax_{a_n}Q_n(s,a_n))
|
||||
$$
|
||||
|
||||
- IGM Factorization: $Q_{tot} (s,a)=f(Q_1(s,a_1), \dots, Q_n(s,a_n))$
|
||||
- Factorization function $f$ realizes all functions satsisfying IGM.
|
||||
|
||||
- FQI-IGM: Fitted Q-Iteration with IGM Factorization
|
||||
|
||||
#### Theorem 4
|
||||
|
||||
Convergence & optimality. FQI-IGM globally converges to the optimal value function in multi-agent MDPs.
|
||||
|
||||
### QPLEX: Multi-Agent Q-Learning with IGM Factorization
|
||||
|
||||
[link to paper](https://arxiv.org/pdf/2008.01062)
|
||||
|
||||
IGM: $\argmax_a Q_{tot}(s,a)=\begin{pamtrix}
|
||||
\argmax_{a_1}Q_1(s,a_1) \\
|
||||
\dots \\
|
||||
\argmax_{a_n}Q_n(s,a_n)
|
||||
\end{pmatrix}
|
||||
$
|
||||
|
||||
Core idea:
|
||||
|
||||
- Fitting well the values of optimal actions
|
||||
- Approximate the values of non-optimal actions
|
||||
|
||||
QPLEX Mixing Network:
|
||||
|
||||
$$
|
||||
Q_{tot}(s,a)=\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i')+\sum_{i=1}^{n} \lambda_i(s,a)(Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i'))
|
||||
$$
|
||||
|
||||
Here $\sum_{i=1}^{n}\max_{a_i'}Q_i(s,a_i')$ is the baseline $\max_a Q_{tot}(s,a)$
|
||||
|
||||
And $Q_i(s,a_i)-\max_{a_i'}Q_i(s,a_i')$ is the "advantage".
|
||||
|
||||
Coefficients: $\lambda_i(s,a)>0$, **easily realized and learned with neural networks**
|
||||
|
||||
> Continue next time...
|
||||
@@ -27,4 +27,5 @@ export default {
|
||||
CSE510_L22: "CSE510 Deep Reinforcement Learning (Lecture 22)",
|
||||
CSE510_L23: "CSE510 Deep Reinforcement Learning (Lecture 23)",
|
||||
CSE510_L24: "CSE510 Deep Reinforcement Learning (Lecture 24)",
|
||||
CSE510_L25: "CSE510 Deep Reinforcement Learning (Lecture 25)",
|
||||
}
|
||||
Reference in New Issue
Block a user