updates
This commit is contained in:
247
content/CSE510/CSE510_L6.md
Normal file
247
content/CSE510/CSE510_L6.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# CSE510 Lecture 6
|
||||
|
||||
## Active reinforcement learning
|
||||
|
||||
### Exploration vs. Exploitation
|
||||
|
||||
- **Exploitation**: To try to get reward. We exploit our current knowledge to get a payoff.
|
||||
- **Exploration**: Get more information about the world. How do we know if there is not a pot of gold around the corner?
|
||||
|
||||
- To explore we typically need to take actions that do not seem best according to our current model
|
||||
- Managing the trade-off between exploration and exploitation is a critical issue in RL
|
||||
- Basic intuition behind most approaches
|
||||
- Explore more when knowledge is weak
|
||||
- Exploit more as we gain knowledge
|
||||
|
||||
### ADP-based RL
|
||||
|
||||
Model based
|
||||
|
||||
1. Start with an initial (uninformed) model
|
||||
2. solve for optimal policy given the current model (using value or policy iteration)
|
||||
3. Take action according to an **exploration/exploitation** policy
|
||||
4. Update estimated model based on observed transition
|
||||
5. Goto 2
|
||||
|
||||
#### Exploration/Exploitation policy
|
||||
|
||||
**Greedy action** is the action maximizing estimated $Q$ value
|
||||
|
||||
$$
|
||||
Q(s,a) = R(s) + \gamma \max_{s'\in S} P(s,a,s')V(s')
|
||||
$$
|
||||
|
||||
- where $V$ is current optimal value function estimate (based on current model), and $R, T$ are current estimates of model
|
||||
- $Q(s,a)$ is the expected value of taking action $a$ in state $s$ and then getting estimated value $V(s')$ for the next state $s'$
|
||||
|
||||
Want an exploration policy that is **greedy in the limit of infinite exploration** (GLIE)
|
||||
|
||||
- Try each action in each state and unbounded number of times
|
||||
- Guarantees convergence
|
||||
|
||||
**GLIE**: Greedy in the limit of infinite exploration
|
||||
|
||||
#### Greedy Policy 1
|
||||
|
||||
On time step $t$ select random action with probability $p(t)$ and greedy action with probability $1-p(t)$
|
||||
|
||||
$p(t) = \frac{1}{t}$ will lead to convergence, but is slow.
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> In practice, it's common to simply set $p(t) = \epsilon$ for all $t$.
|
||||
|
||||
#### Greedy Policy 2
|
||||
|
||||
Boltzmann exploration
|
||||
|
||||
Selection action with probability,
|
||||
|
||||
$$
|
||||
Pr(a\mid s)=\frac{\exp(Q(s,a)/T)}{\sum_{a'\in A}\exp(Q(s,a')/T)}
|
||||
$$
|
||||
|
||||
$T$ is the temperature. Large $T$ means that each action has about the same probability. Small $T$ leads to more greedy behavior.
|
||||
|
||||
Typically start with large $T$ and decrease with time.
|
||||
|
||||
<details>
|
||||
<summary>Example: impact of temperature</summary>
|
||||
|
||||
Suppose we have two actions and that $Q(s,a_1) = 1$ and $Q(s,a_2) = 0$.
|
||||
|
||||
When $T=10$, we have
|
||||
|
||||
$$
|
||||
Pr(a_1\mid s)=\frac{\exp(1/10)}{\exp(1/10)+\exp(2/10)}=0.48
|
||||
$$
|
||||
|
||||
$$
|
||||
Pr(a_2\mid s)=\frac{\exp(2/10)}{\exp(1/10)+\exp(2/10)}=0.52
|
||||
$$
|
||||
|
||||
When $T=1$, we have
|
||||
|
||||
$$
|
||||
Pr(a_1\mid s)=\frac{\exp(1/1)}{\exp(1/1)+\exp(2/1)}=0.27
|
||||
$$
|
||||
|
||||
$$
|
||||
Pr(a_2\mid s)=\frac{\exp(2/1)}{\exp(1/1)+\exp(2/1)}=0.73
|
||||
$$
|
||||
|
||||
When $T=0.1$, we have
|
||||
|
||||
$$
|
||||
Pr(a_1\mid s)=\frac{\exp(1/0.1)}{\exp(1/0.1)+\exp(2/0.1)}=0.02
|
||||
$$
|
||||
|
||||
$$
|
||||
Pr(a_2\mid s)=\frac{\exp(2/0.1)}{\exp(1/0.1)+\exp(2/0.1)}=0.98
|
||||
$$
|
||||
|
||||
</details>
|
||||
|
||||
### (Alternative Model-Based RL) Optimistic Exploration: Rmax [Brafman & Tennenholtz, 2002]
|
||||
|
||||
1. Start with an **optimistic model**
|
||||
- (assign largest possible reward to "unexplored states")
|
||||
- (actions from "unexplored states" only self transition)
|
||||
2. Solve for optimal policy in optimistic model (standard VI)
|
||||
3. Take greedy action according to the computed policy
|
||||
4. Update optimistic estimated model
|
||||
- (if a state becomes "known" then use its true statistics)
|
||||
5. Goto 2
|
||||
|
||||
Agent always acts greedily according to a model that assumes
|
||||
all "unexplored" states are maximally rewarding
|
||||
|
||||
#### Implementation for optimistic model
|
||||
|
||||
- Keep track of number of times a state-action pair is tried
|
||||
- If $N(s, a) < N_e$ then $T(s,a,s)=1$ and $R(s) = Rmax$ in optimistic model,
|
||||
- Otherwise, $T(s,a,s’)$ and $R(s)$ are based on estimates obtained from the $N_e$ experiences (the estimate of true model)
|
||||
- $N_e$ can be determined by using Chernoff Bound
|
||||
- An optimal policy for this optimistic model will try to reach unexplored states (those with unexplored actions) since it can stay at those states and accumulate maximum reward
|
||||
- Never explicitly explores. Is always greedy, but with respect to an optimistic outlook.
|
||||
|
||||
```pseudocode
|
||||
Algorithm: (for Infinite horizon RL problems)
|
||||
Initialize $\hat{p}, \hat{r}$, and $N(s,a)$ For $t = 1, 2, ...$
|
||||
1. Build an optimistic reward model $(Q(s,a))_{s,a}$ from $\hat{p}, \hat{r}$, and $N(s,a)$
|
||||
2. Select action $a(t)$ maximizing $Q(s(t),a)$ over $A_{s(t)}$
|
||||
3. Observe the transition to $s(t+1)$ and collect reward $r(s(t),a(t))$ according to $\hat{p}$
|
||||
4. Update $\hat{p}, \hat{r}$, and $N(s,a)$
|
||||
```
|
||||
|
||||
#### Efficiency of Rmax
|
||||
|
||||
If the model is very completely learned (i.e. $N(s, a) = N_e$ for all $s, a$), then Rmax will be near optimal.
|
||||
|
||||
Results how that this will happen "quickly" in terms of number of steps.
|
||||
|
||||
General proof strategy: **PAC Guarantee (Roughly speaking):** There is a value $N_e$, such that with high probability the Rmax algorithm will select at most a polynomial number of actions with value less than $\epsilon$ of optimal.
|
||||
|
||||
RL can be solved in poly-time in number of actions, number of states, and discount factor.
|
||||
|
||||
### TD-based Active RL
|
||||
|
||||
1. Start with initial value function
|
||||
2. Take action from an **exploration/exploitation** policygiving new state $s'$ (should converge to optimal policy)
|
||||
3. **Update** estimated model (To compute the exploration/exploitation policy.)
|
||||
4. Perform TD update
|
||||
$$
|
||||
V(s) \gets V(s) + \alpha (R(s) + \gamma V(s') - V(s))
|
||||
$$
|
||||
$V(s)$ is new estimate of optimal value function at state $s$.
|
||||
5. Goto 2
|
||||
|
||||
Given the usual assumptions about learning rate and GLIE, TD will converge to an optimal value function!
|
||||
|
||||
- Exploration/Exploitation policy requires computing $argmax Q(s, a)$ for the exploitation part of the policy
|
||||
- Computing $argmax Q(s, a)$ requires $T$ in addition to $V$
|
||||
- Thus TD-learning must still maintain an estimated model for action selection
|
||||
- It is computationally more efficient at each step compared to Rmax (i.e., optimistic exploration)
|
||||
- TD-update vs. Value Iteration
|
||||
- But model requires much more memory than value function
|
||||
- Can we get a model-fee variant?
|
||||
|
||||
### Q-learning
|
||||
|
||||
Instead of learning the optimal value function $V$, directly learn the optimal $Q$ function.
|
||||
|
||||
Recall $Q(s, a)$ is the expected value of taking action $a$ in state $s$ and then
|
||||
following the optimal policy thereafter
|
||||
|
||||
Given the $Q$ function we can act optimally by selecting action greedily according to $Q(s, a)$ without a model
|
||||
|
||||
The optimal $Q$-function satisfies $V(s) = \max_{a'\in A} Q(s, a')$ which gives:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
Q(s,a) &= R(s) + \gamma \sum_{s'\in S} T(s,a,s') V(s')\\
|
||||
&= R(s) + \gamma \sum_{s'\in S} T(s,a,s') \max_{a'\in A} Q(s',a')\\
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
How can we learn the $Q$-function directly?
|
||||
|
||||
#### Q-learning implementation
|
||||
|
||||
Model-free reinforcement learning
|
||||
|
||||
1. Start with initial Q-values (e.g. all zeros)
|
||||
2. Take action from an **exploration/exploitation** policy giving new state $s'$ (should converge to optimal policy)
|
||||
3. Perform TD update
|
||||
$$
|
||||
Q(s,a) \gets Q(s,a) + \alpha (R(s) + \gamma \max_{a'\in A} Q(s',a') - Q(s,a))
|
||||
$$
|
||||
$Q(s,a)$ is current estimate of optimal Q-value for state $s$ and action $a$.
|
||||
4. Goto 2
|
||||
|
||||
- Does not require model since we learn the Q-value function directly
|
||||
- Use explicit $|S|\times |A|$ table to store Q-values
|
||||
- Off-policy learning: the update does not depend on the actual next action
|
||||
- The exploration/exploitation policy directly uses $Q$-values
|
||||
|
||||
#### Convergence of Q-learning
|
||||
|
||||
Q-learning converges to the optimal Q-value in the limit with probability 1 if:
|
||||
|
||||
- Every state-action pair is visited infinitely often
|
||||
- Learning rate decays just so: $\sum_{t=1}^{\infty} \alpha(t) = \infty$ and $\sum_{t=1}^{\infty} \alpha(t)^2 < \infty$
|
||||
|
||||
#### Speedup for Goal-Based Problems
|
||||
|
||||
- **Goal-Based Problem**: receive big reward in goal state and then transition to terminal state
|
||||
- Initializing $Q(s, a)$ for all $s \in S$ and $a \in A$ to zeros and then observing the following sequence of (state, reward, action) triples
|
||||
- $(s0, 0, a0) (s1, 0, a1) (s2, 10, a2) (terminal,0)$
|
||||
- The sequence of Q-value updates would result in: $Q(s0, a0) = 0$, $Q(s1, a1) =0$, $Q(s2, a2)=10$
|
||||
- So nothing was learned at $s0$ and $s1$
|
||||
- Next time this trajectory is observed we will get non-zero for $Q(s1, a1)$ but still $Q(s0, a0)=0$
|
||||
|
||||
|
||||
From the example we see that it can take many learning trials for the final reward to "back propagate" to early state-action pairs
|
||||
|
||||
- Two approaches for addressing this problem:
|
||||
1. Trajectory replay: store each trajectory and do several iterations of Q-updates on each one
|
||||
2. Reverse updates: store trajectory and do Q-updates in reverse order
|
||||
- In our example (with learning rate and discount factor equal to 1 for ease of illustration) reverse updates would give
|
||||
- $Q(s2,a2) = 10$, $Q(s1,a1) = 10$, $Q(s0,a0)=10$
|
||||
|
||||
### Off-policy vs on-policy RL
|
||||
|
||||
### SARSA
|
||||
|
||||
1. Start with initial Q-values (e.g. all zeros)
|
||||
2. Take action $a_n$ on state $s_n$ from an $\epsilon$-greedy policy giving new state $s_{n+1}$
|
||||
3. Take action $a_{n+1}$ on state $s_{n+1}$ from an $\epsilon$-greedy
|
||||
4. Perform TD update
|
||||
$$
|
||||
Q(s_n,a_n) \gets Q(s_n,a_n) + \alpha (R(s_n) + \gamma Q(s_{n+1},a_{n+1}) - Q(s_n,a_n))
|
||||
$$
|
||||
5. Goto 2
|
||||
|
||||
> [!NOTES]
|
||||
>
|
||||
> Compared with Q-learning, SARSA (on-policy) usually takes more "safer" actions.
|
||||
@@ -6,4 +6,7 @@ export default {
|
||||
CSE510_L1: "CSE510 Deep Reinforcement Learning (Lecture 1)",
|
||||
CSE510_L2: "CSE510 Deep Reinforcement Learning (Lecture 2)",
|
||||
CSE510_L3: "CSE510 Deep Reinforcement Learning (Lecture 3)",
|
||||
CSE510_L4: "CSE510 Deep Reinforcement Learning (Lecture 4)",
|
||||
CSE510_L5: "CSE510 Deep Reinforcement Learning (Lecture 5)",
|
||||
CSE510_L6: "CSE510 Deep Reinforcement Learning (Lecture 6)",
|
||||
}
|
||||
Reference in New Issue
Block a user