298 lines
9.0 KiB
Markdown
298 lines
9.0 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 4)
|
|
|
|
Markov Decision Process (MDP) Part II
|
|
|
|
## Recall from last lecture
|
|
|
|
An Finite MDP is defined by:
|
|
|
|
- A finite set of **states** $s \in S$
|
|
- A finite set of **actions** $a \in A$
|
|
- A **transition function** $T(s, a, s')$
|
|
- Probability that a from s leads to $s'$, i.e.,
|
|
$P(s'| s, a)$
|
|
- Also called the model or the dynamics
|
|
- A **reward function $R(s)$** ( Sometimes $R(s,a)$ or $R(s, a, s')$ )
|
|
- A **start state**
|
|
- Maybe a **terminal state**
|
|
|
|
A model for sequential decision making problem under uncertainty
|
|
|
|
### Optimal Policy and Bellman Optimality Equation
|
|
|
|
The goal for a MDP is to compute or learn an optimal policy.
|
|
|
|
- An **optimal policy** is one that achieves the highest value at any state
|
|
|
|
$$
|
|
\pi^* = \arg\max_\pi V^\pi(s)
|
|
$$
|
|
|
|
- We define the optimal value function using Bellman Optimality Equation
|
|
|
|
$$
|
|
V^*(s) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^*(s')
|
|
$$
|
|
|
|
- The optimal policy is
|
|
|
|
$$
|
|
\pi^*(s) = \arg\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^*(s')
|
|
$$
|
|
|
|
### The Existence of the Optimal Policy
|
|
|
|
Theorem: for any Markov Decision Process
|
|
|
|
- There exists an optimal policy
|
|
- There can be many optimal policies, but all optimal policies achieve the same optimal value function
|
|
- There is always a deterministic optimal policy for any MDP
|
|
|
|
## Solve MDP
|
|
|
|
### Value Iteration
|
|
|
|
Repeatedly update an estimate of the optimal value function according to Bellman Optimality Equation.
|
|
|
|
1. Initialize an estimate for the value function arbitrarily
|
|
|
|
$$
|
|
\hat{V}(s) \gets 0, \forall s \in S
|
|
$$
|
|
|
|
2. Repeat, update:
|
|
|
|
$$
|
|
\hat{V}(s) \gets R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) \hat{V}(s'), \forall s \in S
|
|
$$
|
|
|
|
<details>
|
|
<summary>Example</summary>
|
|
|
|
Suppose we have a robot that can move in a 2D grid. with the following dynamics:
|
|
|
|
- with 80% probability, the robot moves in the direction of the action
|
|
- with 10% probability, the robot moves in the direction of the action + 1 (wrap to left)
|
|
- with 10% probability, the robot moves in the direction of the action - 1 (wrap to right)
|
|
|
|
The gird ($V^0(s)$) is:
|
|
|
|
|0|0|0|1|
|
|
|0|*|0|-100|
|
|
|0|0|0|0|
|
|
|
|
If we fun the value iteration with $\gamma = 0.9$, we can update the value function as follows:
|
|
|
|
$$
|
|
V^1(s) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^0(s')
|
|
$$
|
|
|
|
On point $(3,3)$, the best action is to move to the goal state, so:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
V^1((3,3)) &= R((3,3)) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|(3,3),\text{right}) V^0((3,4))
|
|
&= 0+0.9 \times 0.8 \times 1 = 0.72
|
|
\end{aligned}
|
|
$$
|
|
|
|
On point $(3,4)$, the best action is to move up so that you can stay in the grid with $90\%$ probability, so:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
V^1((3,4)) &= R((3,4)) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|(3,4),\text{up}) V^0((3,4))
|
|
&= 1+0.9 \times (0.8+0.1) \times 1 = 1.81
|
|
\end{aligned}
|
|
$$
|
|
|
|
On $t=1$, the value on grid is:
|
|
|
|
|0|0|0.72|1.81|
|
|
|0|*|0|-99.91|
|
|
|0|0|0|0|
|
|
|
|
</details>
|
|
|
|
The general algorithm can be written as:
|
|
|
|
```python
|
|
# suppose we defined the grid as previous example
|
|
|
|
grid = [
|
|
[0, 0, 0, 1],
|
|
[0, '*', 0, -100],
|
|
[0, 0, 0, 0]
|
|
]
|
|
m,n = len(grid), len(grid[0])
|
|
ACTIONS = {'up':(0,-1), 'down':(0,1), 'left':(-1,0), 'right':(1,0)}
|
|
|
|
gamma = 0.9
|
|
V = value_iteration(gamma, ACTIONS, grid)
|
|
print(V)
|
|
|
|
def get_reward(action, i, j):
|
|
reward = 0
|
|
reward += 0.8 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
|
|
reward += 0.1 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
|
|
reward += 0.1 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
|
|
return reward
|
|
|
|
def value_iteration(gamma, ACTIONS, V):
|
|
V_new=[[0]*m for _ in range(n)]
|
|
while True:
|
|
for i in range(m):
|
|
for j in range(n):
|
|
s = (i, j)
|
|
V_new[i][j] = V[i][j] + gamma * max(get_reward(action, i, j) for action.values() in ACTIONS)
|
|
if max(abs(V_new[i][j] - V[i][j]) for i in range(m) for j in range(n)) < 1e-6:
|
|
break
|
|
V = V_new
|
|
return V
|
|
```
|
|
|
|
### Convergence of Value Iteration
|
|
|
|
Theorem: Value Iteration converges to the optimal value function $\hat{V}\to V^*$ as $t\to\infty$.
|
|
|
|
<details>
|
|
<summary>Proof</summary>
|
|
|
|
For any estimate of the value function $\hat{V}$, we define the Bellman backup operator $\operatorname{B}:\mathbb{R}^{|S|}\to \mathbb{R}^{|S|}$ by
|
|
|
|
$$
|
|
\operatorname{B}(\hat{V}(s)) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) \hat{V}(s')
|
|
$$
|
|
|
|
Note that $\operatorname{B}(V^*) = V^*$.
|
|
|
|
Since $\|\max_{x\in X}f(x)-\max_{x\in X}g(x)\|\leq \max_{x\in X}\|f(x)-g(x)\|$, for any value function $V_1$ and $V_2$, we have
|
|
|
|
$$
|
|
\begin{aligned}
|
|
|\operatorname{B}(V_1(s))-\operatorname{B}(V_2(s))|&= \gamma \left|\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V_1(s')-\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V_2(s')\right|\\
|
|
&\leq \gamma \max_{a\in A} \left|\sum_{s'\in S} P(s'|s,a) V_1(s')-\sum_{s'\in S} P(s'|s,a) V_2(s')\right|\\
|
|
&\leq \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) |V_1(s')-V_2(s')|\\
|
|
&\leq \gamma \max_{s\in S}|V_1-V_2|
|
|
\end{aligned}
|
|
$$
|
|
|
|
</details>
|
|
|
|
Assume $0\leq \gamma < 1$, and reward $R(s)$ is bounded by $R_{\max}$.
|
|
|
|
Then
|
|
|
|
$$
|
|
V^*(s)\leq \sum_{t=0}^\infty \gamma^t R_{\max} = \frac{R_{\max}}{1-\gamma}
|
|
$$
|
|
|
|
Let $V^k$ be the value function after $k$ iterations of Value Iteration.
|
|
|
|
$$
|
|
\max_{s\in S}|V^k(s)-V^*(s)|\leq \frac{R_{\max}}{1-\gamma}\gamma^k
|
|
$$
|
|
|
|
#### Stopping condition
|
|
|
|
We can construct the optimal policy arbitrarily close to the optimal value function.
|
|
|
|
If $\|V^k-V^{k+1}\|<\epsilon$, then $\|V^k-V^*\|\leq \epsilon\frac{\gamma}{1-\gamma}$.
|
|
|
|
So we can select small $\epsilon$ to stop the iteration.
|
|
|
|
### Greedy Policy
|
|
|
|
Given a $V^k$ that is close to the optimal value $V^*$, the greedy policy is:
|
|
|
|
$$
|
|
\pi_{g}(s) = \arg\max_{a\in A} \sum_{s'\in S} T(s',a,s') V^k(s')
|
|
$$
|
|
|
|
Here $T(s',a,s')$ is the transition function between state $s'$ and $s$ with action $a$.
|
|
|
|
This selects the action looks best if we assume that we get value $V^k$ in one step.
|
|
|
|
#### Value of a greedy policy
|
|
|
|
If we define $V_g$ to be the value function of the greedy policy, then
|
|
|
|
This is not necessarily optimal, but it is a good approximation.
|
|
|
|
In homework, we will prove that if $\|V^k-V^*\|<\lambda$, then $\|V_g-V^*\|\leq 2\lambda\frac{\gamma}{1-\gamma}$.
|
|
|
|
So we can set stopping condition so that $V_g$ has desired accuracy to $V^*$.
|
|
|
|
There is a finite $\epsilon$ such that greedy policy is $\epsilon$-optimal.
|
|
|
|
### Problem of Value Iteration and Policy Iteration
|
|
|
|
- It is slow $O(|S|^2|A|)$
|
|
- The max action at each state rarely changes
|
|
- The policy converges before the value function
|
|
|
|
### Policy Iteration
|
|
|
|
Interleaving polity evaluation and policy improvement.
|
|
|
|
1. Initialize a random policy $\hat{\pi}$
|
|
2. Compute the value function $V^{\pi}$
|
|
3. Update the policy $\pi$ to be greedy policy with respect to $V^{\pi}$
|
|
$$
|
|
\pi(s)\gets \arg\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^{\pi}(s')
|
|
$$
|
|
4. Repeat until convergence
|
|
|
|
### Exact Policy Evaluation by Linear Solver
|
|
|
|
Let $V^{\pi}\in \mathbb{R}^{|S|}$ be a vector of values for each state, $r\in \mathbb{R}^{|S|}$ be a vector of rewards for each state.
|
|
|
|
Let $P^{\pi}\in \mathbb{R}^{|S|\times |S|}$ be a transition matrix for the policy $\pi$.
|
|
|
|
$$
|
|
P^{\pi}_{ij} = P(s_{t+1}=i|s_t=j,a_t=\pi(s_t))
|
|
$$
|
|
|
|
The Bellman equation for the policy can be written in vector form as:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
V^{\pi} &= r + \gamma P^{\pi} V^{\pi} \\
|
|
(I-\gamma P^{\pi})V^{\pi} &= r \\
|
|
V^{\pi} &= (I-\gamma P^{\pi})^{-1} r
|
|
\end{aligned}
|
|
$$
|
|
|
|
- Proof involves showing that each iteration is also a contraction and monotonically improve the policy
|
|
- Convergence to the exact optimal policy
|
|
- The number of policies is finite
|
|
|
|
In real world, policy iteration is usually faster than value iteration.
|
|
|
|
#### Policy Iteration Complexity
|
|
|
|
- Each iteration runs in polynomial time in the number of states and actions
|
|
- There are at most |A|n policies and PI never repeats a policy
|
|
- So at most an exponential number of iterations
|
|
- Not a very good complexity bound
|
|
- Empirically O(n) iterations are required
|
|
- Challenge: try to generate an MDP that requires more than that n iterations
|
|
|
|
### Generalized Policy Iteration
|
|
|
|
- Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement
|
|
- independent of their granularity and other details of the two processes
|
|
|
|
### Summary
|
|
|
|
#### Policy Iteration vs Value Iteration
|
|
|
|
- **PI has two loops**: inner loop (evaluate $V^{\pi}$)
|
|
and outer loop (improve $\pi$)
|
|
- **VI has one loop**: repeatedly apply
|
|
$V^{k+1}(s) = \max_{a\in A} [r(s,a) + \gamma \sum_{s'\in S} P(s'|s,a) V^k(s')]$
|
|
- **Trade-offs**:
|
|
- PI converges in few outer steps if you can evaluate quickly/accurately;
|
|
- VI avoids expensive exact evaluation, doing cheaper but many Bellman optimality updates.
|
|
- **Modified Policy Iteration**: partial evaluation + improvement.
|
|
|
|
- **Modified Policy Iteration**: partial evaluation + improvement. |