diff --git a/content/CSE332S/index.mdx b/content/CSE332S/index.md similarity index 100% rename from content/CSE332S/index.mdx rename to content/CSE332S/index.md diff --git a/content/CSE510/CSE510_L5.md b/content/CSE510/CSE510_L5.md new file mode 100644 index 0000000..5b8b858 --- /dev/null +++ b/content/CSE510/CSE510_L5.md @@ -0,0 +1,241 @@ +# CSE510 Lecture 5 + +## Passive Reinforcement Learning + +New twist: don't know T or R + +- i.e. we don't know which states are good or what the actions do +- Must actually try actions and states out to learn + +### Passive learning and active learning + +Passive Learning + +- The agent has a fixed policy and tries to learn the utilities of states by observing the world go by +- Analogous to policy evaluation +- Often serves as a component of active learning algorithms +- Often inspires active learning algorithms + +Active Learning + +- The agent attempts to find an optimal (or at least good) policy by acting in the world +- Analogous to solving the underlying MDP, but without first being given the MDP model + +### Model-based vs. Model-free RL + +Model-based RL + +- Learn the MDP model, or an approximate of it +- Use it for policy evaluation or to find the optimal policy + +Example as human: learn to navigate by exploring the environment + +Model-free RL + +- Derive the optimal policy without explicitly learning the model +- Useful when model is difficult to represent and/or learn + +Example as human: learn to walk, talk based on reflection. (don't need to know law of physics). + +### Small vs. Huge MDPs + +We will first cover RL methods for small MDPs + +- MDPs where the number of states and actions is reasonably small +- These algorithms will inspire more advanced methods + +Later we will cover algorithms for huge MDPs + +- Function Approximation Methods +- Policy Gradient Methods +- Actor-Critic Methods + +### Problem settings + +Suppose given a stationary policy, want to determine how good it is. + +We want to estimate $V^\pi(s)$, but not given: + +- translation matrix +- reward function + +### Monte Carlo direct estimation (model-free) + +Also called Direct Estimation. + +```python +def monte_carlo_direct_estimation(policy, env, num_episodes): + V = {} + for episode in range(num_episodes): + state = env.reset() + while not env.done: + action = policy.act(state) + next_state, reward, done = env.step(action) + V[state] += reward + state = next_state + return {s: V[s] / num_episodes for s in V} +``` + +Estimate $V^\pi(s)$ by average total reward of episodes containing state $s$. + +**Reward to go** of state s is the sum of the (discounted) rewards from the state until a terminal state is reached. + +Drawback: + +- Need large number of episodes to get accurate estimate +- Does not exploit Bellman constrains on policy values. + +### Adaptive Dynamic Programming (ADP) (model-based) + +- Follow the policy for a while +- Estimate transition model based on obsercations +- Learn reward function +- Use estimated model to compute utilities of policy + +$$ +V^\pi(s) = R(s) + \gamma \sum_{s'\in S} T(s,a,s') V^\pi(s') +$$ + +```python +def adaptive_dynamic_programming(policy, env, num_episodes, steps_per_episode): + V = {} + for episode in range(num_episodes): + state = env.reset() + while not env.done: + action = policy.act(state, steps_per_episode) + next_state, reward, done = env.step(action) + V[state] += reward + state = next_state + return {s: V[s] / num_episodes for s in V} +``` + +Drawback: + +- Still need full DP policy evaluation after certain steps. + +### Temporal difference learning (model-free) + +- Do local updates of utility/value function on a **per-action** basis +- Don't try to estimate the entire transition function +- For each transition from $s$ to $s'$, update: + $$ + V^\pi(s) \gets V^\pi(s) + \alpha (R(s) + \gamma V^\pi(s') - V^\pi(s)) + $$ + +Here $\alpha$ is the learning rate, $\gamma$ is the discount factor. + +```python +def temporal_difference_learning(policy, env, num_episodes): + V = {} + for episode in range(num_episodes): + state = env.reset() + while not env.done: + action = policy.act(state) + next_state, reward, done = env.step(action) + V[state] += alpha * (reward + gamma * V[next_state] - V[state]) + state = next_state + return V +``` + +Drawback: + +- requires more training experience (epochs) than ADP but much less computation per epoch +- Choice depends on relative cost of experience vs. computation + +#### Online Mean Estimation algorithm + +Suppose we want to incrementally computer the mean of a stream of numbers. + +$$ +(x_1, x_2, \ldots) +$$ + +Given a new sample $x_{n+1}$, the new mean is the old estimate (for $n$ samples) plus the weighted difference between the new sample and the old estimate. + +$$ +\begin{aligned} +\hat{X}_{n+1} &= \frac{1}{n+1} \sum_{i=1}^{n+1} x_i \\ +&= \frac{1}{n}\sum_{i=1}^{n} x_i + \frac{1}{n+1} x_{n}-\frac{1}{n}\hat{X}_n\\ +&= \hat{x}_{n} +\frac{1}{n+1} \left[x_{n+1} +\sum_{i=1}^{n} x_i-\frac{n+1}{n}\sum_{i=1}^{n} x_i\right]\\ +&= \hat{x}_{n} + \frac{1}{n+1} (x_{n+1} - \hat{X}_n) +$$ + +### Summary of passive RL + +**Monte-Carlo Direct Estimation (model-free)** + +- Simple to implement +- Each update is fast +- Does not exploit Bellman constraints +- Converges slowly + +**Adaptive Dynamic Programming (model-based)** + +- Harder to implement +- Each update is a full policy evaluation (expensive) +- Fully exploits Bellman constraints +- Fast convergence (in terms of updates) + +**Temporal Difference Learning (model-free)** + +- Update speed and implementation similar to direct estimation +- Partially exploits Bellman constraints---adjusts state to 'agree' with observed successor + - Not all possible successors as in ADP +- Convergence speed in between direct estimation and ADP + +### Between ADP and TD + +- Moving TD toward ADP + - At each step perform TD updates based on observed transition and "imagined" transitions based on initial trajectory tests. (model-based) + - Imagined transition are generated using estimated model +- The more imagined transitions used, the more like ADP + - Making estimate more consistent with next state distribution + - Converges in the limit of infinite imagined transitions to ADP +- Trade-off computational and experience efficiency + - More imagined transitions require more time per step, but fewer steps of actual experience + +## Active Reinforcement Learning + +### Naive Model-Based Approach + +1. Act Randomly for a (long) time + - Or systematically explore all possible actions +2. Learn + - Transition function + - Reward function +3. Use value iteration, policy iteration, ... +4. Follow resulting policy + +This will work if step 1 is running long enough. And there is no dead-end for policy testing. + +Drawback: + +- Long time to converge + +### Revision of Naive Approach + +1. Start with an initial (uninformed) model +2. Solve for the optimal policy given the current model (using value or policy iteration) +3. Execute an action suggested by the policy in the current state +4. Update the estimated model based on the observed transition +5. Goto 2 + +This is just like ADP but we follow the greedy policy suggested by current value estimate. + +**Will this work?** + +No. Can get stuck in local minima. + +#### Exploration vs. Exploitation + +Two reasons to take an action in RL: + +- **Exploitation**: To try to get reward. We exploit our current knowledge to get a payoff. +- **Exploration**: Get more information about the world. How do we know if there is not a pot of gold around the corner? + +- To explore we typically need to take actions that do not seem best according to our current model +- Managing the trade-off between exploration and exploitation is a critical issue in RL +- Basic intuition behind most approaches: + - Explore more when knowledge is weak + - Exploit more as we gain knowledge +