updates
This commit is contained in:
241
content/CSE510/CSE510_L5.md
Normal file
241
content/CSE510/CSE510_L5.md
Normal file
@@ -0,0 +1,241 @@
|
||||
# CSE510 Lecture 5
|
||||
|
||||
## Passive Reinforcement Learning
|
||||
|
||||
New twist: don't know T or R
|
||||
|
||||
- i.e. we don't know which states are good or what the actions do
|
||||
- Must actually try actions and states out to learn
|
||||
|
||||
### Passive learning and active learning
|
||||
|
||||
Passive Learning
|
||||
|
||||
- The agent has a fixed policy and tries to learn the utilities of states by observing the world go by
|
||||
- Analogous to policy evaluation
|
||||
- Often serves as a component of active learning algorithms
|
||||
- Often inspires active learning algorithms
|
||||
|
||||
Active Learning
|
||||
|
||||
- The agent attempts to find an optimal (or at least good) policy by acting in the world
|
||||
- Analogous to solving the underlying MDP, but without first being given the MDP model
|
||||
|
||||
### Model-based vs. Model-free RL
|
||||
|
||||
Model-based RL
|
||||
|
||||
- Learn the MDP model, or an approximate of it
|
||||
- Use it for policy evaluation or to find the optimal policy
|
||||
|
||||
Example as human: learn to navigate by exploring the environment
|
||||
|
||||
Model-free RL
|
||||
|
||||
- Derive the optimal policy without explicitly learning the model
|
||||
- Useful when model is difficult to represent and/or learn
|
||||
|
||||
Example as human: learn to walk, talk based on reflection. (don't need to know law of physics).
|
||||
|
||||
### Small vs. Huge MDPs
|
||||
|
||||
We will first cover RL methods for small MDPs
|
||||
|
||||
- MDPs where the number of states and actions is reasonably small
|
||||
- These algorithms will inspire more advanced methods
|
||||
|
||||
Later we will cover algorithms for huge MDPs
|
||||
|
||||
- Function Approximation Methods
|
||||
- Policy Gradient Methods
|
||||
- Actor-Critic Methods
|
||||
|
||||
### Problem settings
|
||||
|
||||
Suppose given a stationary policy, want to determine how good it is.
|
||||
|
||||
We want to estimate $V^\pi(s)$, but not given:
|
||||
|
||||
- translation matrix
|
||||
- reward function
|
||||
|
||||
### Monte Carlo direct estimation (model-free)
|
||||
|
||||
Also called Direct Estimation.
|
||||
|
||||
```python
|
||||
def monte_carlo_direct_estimation(policy, env, num_episodes):
|
||||
V = {}
|
||||
for episode in range(num_episodes):
|
||||
state = env.reset()
|
||||
while not env.done:
|
||||
action = policy.act(state)
|
||||
next_state, reward, done = env.step(action)
|
||||
V[state] += reward
|
||||
state = next_state
|
||||
return {s: V[s] / num_episodes for s in V}
|
||||
```
|
||||
|
||||
Estimate $V^\pi(s)$ by average total reward of episodes containing state $s$.
|
||||
|
||||
**Reward to go** of state s is the sum of the (discounted) rewards from the state until a terminal state is reached.
|
||||
|
||||
Drawback:
|
||||
|
||||
- Need large number of episodes to get accurate estimate
|
||||
- Does not exploit Bellman constrains on policy values.
|
||||
|
||||
### Adaptive Dynamic Programming (ADP) (model-based)
|
||||
|
||||
- Follow the policy for a while
|
||||
- Estimate transition model based on obsercations
|
||||
- Learn reward function
|
||||
- Use estimated model to compute utilities of policy
|
||||
|
||||
$$
|
||||
V^\pi(s) = R(s) + \gamma \sum_{s'\in S} T(s,a,s') V^\pi(s')
|
||||
$$
|
||||
|
||||
```python
|
||||
def adaptive_dynamic_programming(policy, env, num_episodes, steps_per_episode):
|
||||
V = {}
|
||||
for episode in range(num_episodes):
|
||||
state = env.reset()
|
||||
while not env.done:
|
||||
action = policy.act(state, steps_per_episode)
|
||||
next_state, reward, done = env.step(action)
|
||||
V[state] += reward
|
||||
state = next_state
|
||||
return {s: V[s] / num_episodes for s in V}
|
||||
```
|
||||
|
||||
Drawback:
|
||||
|
||||
- Still need full DP policy evaluation after certain steps.
|
||||
|
||||
### Temporal difference learning (model-free)
|
||||
|
||||
- Do local updates of utility/value function on a **per-action** basis
|
||||
- Don't try to estimate the entire transition function
|
||||
- For each transition from $s$ to $s'$, update:
|
||||
$$
|
||||
V^\pi(s) \gets V^\pi(s) + \alpha (R(s) + \gamma V^\pi(s') - V^\pi(s))
|
||||
$$
|
||||
|
||||
Here $\alpha$ is the learning rate, $\gamma$ is the discount factor.
|
||||
|
||||
```python
|
||||
def temporal_difference_learning(policy, env, num_episodes):
|
||||
V = {}
|
||||
for episode in range(num_episodes):
|
||||
state = env.reset()
|
||||
while not env.done:
|
||||
action = policy.act(state)
|
||||
next_state, reward, done = env.step(action)
|
||||
V[state] += alpha * (reward + gamma * V[next_state] - V[state])
|
||||
state = next_state
|
||||
return V
|
||||
```
|
||||
|
||||
Drawback:
|
||||
|
||||
- requires more training experience (epochs) than ADP but much less computation per epoch
|
||||
- Choice depends on relative cost of experience vs. computation
|
||||
|
||||
#### Online Mean Estimation algorithm
|
||||
|
||||
Suppose we want to incrementally computer the mean of a stream of numbers.
|
||||
|
||||
$$
|
||||
(x_1, x_2, \ldots)
|
||||
$$
|
||||
|
||||
Given a new sample $x_{n+1}$, the new mean is the old estimate (for $n$ samples) plus the weighted difference between the new sample and the old estimate.
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\hat{X}_{n+1} &= \frac{1}{n+1} \sum_{i=1}^{n+1} x_i \\
|
||||
&= \frac{1}{n}\sum_{i=1}^{n} x_i + \frac{1}{n+1} x_{n}-\frac{1}{n}\hat{X}_n\\
|
||||
&= \hat{x}_{n} +\frac{1}{n+1} \left[x_{n+1} +\sum_{i=1}^{n} x_i-\frac{n+1}{n}\sum_{i=1}^{n} x_i\right]\\
|
||||
&= \hat{x}_{n} + \frac{1}{n+1} (x_{n+1} - \hat{X}_n)
|
||||
$$
|
||||
|
||||
### Summary of passive RL
|
||||
|
||||
**Monte-Carlo Direct Estimation (model-free)**
|
||||
|
||||
- Simple to implement
|
||||
- Each update is fast
|
||||
- Does not exploit Bellman constraints
|
||||
- Converges slowly
|
||||
|
||||
**Adaptive Dynamic Programming (model-based)**
|
||||
|
||||
- Harder to implement
|
||||
- Each update is a full policy evaluation (expensive)
|
||||
- Fully exploits Bellman constraints
|
||||
- Fast convergence (in terms of updates)
|
||||
|
||||
**Temporal Difference Learning (model-free)**
|
||||
|
||||
- Update speed and implementation similar to direct estimation
|
||||
- Partially exploits Bellman constraints---adjusts state to 'agree' with observed successor
|
||||
- Not all possible successors as in ADP
|
||||
- Convergence speed in between direct estimation and ADP
|
||||
|
||||
### Between ADP and TD
|
||||
|
||||
- Moving TD toward ADP
|
||||
- At each step perform TD updates based on observed transition and "imagined" transitions based on initial trajectory tests. (model-based)
|
||||
- Imagined transition are generated using estimated model
|
||||
- The more imagined transitions used, the more like ADP
|
||||
- Making estimate more consistent with next state distribution
|
||||
- Converges in the limit of infinite imagined transitions to ADP
|
||||
- Trade-off computational and experience efficiency
|
||||
- More imagined transitions require more time per step, but fewer steps of actual experience
|
||||
|
||||
## Active Reinforcement Learning
|
||||
|
||||
### Naive Model-Based Approach
|
||||
|
||||
1. Act Randomly for a (long) time
|
||||
- Or systematically explore all possible actions
|
||||
2. Learn
|
||||
- Transition function
|
||||
- Reward function
|
||||
3. Use value iteration, policy iteration, ...
|
||||
4. Follow resulting policy
|
||||
|
||||
This will work if step 1 is running long enough. And there is no dead-end for policy testing.
|
||||
|
||||
Drawback:
|
||||
|
||||
- Long time to converge
|
||||
|
||||
### Revision of Naive Approach
|
||||
|
||||
1. Start with an initial (uninformed) model
|
||||
2. Solve for the optimal policy given the current model (using value or policy iteration)
|
||||
3. Execute an action suggested by the policy in the current state
|
||||
4. Update the estimated model based on the observed transition
|
||||
5. Goto 2
|
||||
|
||||
This is just like ADP but we follow the greedy policy suggested by current value estimate.
|
||||
|
||||
**Will this work?**
|
||||
|
||||
No. Can get stuck in local minima.
|
||||
|
||||
#### Exploration vs. Exploitation
|
||||
|
||||
Two reasons to take an action in RL:
|
||||
|
||||
- **Exploitation**: To try to get reward. We exploit our current knowledge to get a payoff.
|
||||
- **Exploration**: Get more information about the world. How do we know if there is not a pot of gold around the corner?
|
||||
|
||||
- To explore we typically need to take actions that do not seem best according to our current model
|
||||
- Managing the trade-off between exploration and exploitation is a critical issue in RL
|
||||
- Basic intuition behind most approaches:
|
||||
- Explore more when knowledge is weak
|
||||
- Exploit more as we gain knowledge
|
||||
|
||||
Reference in New Issue
Block a user