updates

2025-09-09 11:21:43 -05:00
parent a28f016183
commit ed106ccaaa
2 changed files with 241 additions and 0 deletions
--- a/content/CSE332S/index.mdx
+++ b/content/CSE332S/index.mdx
--- a/content/CSE510/CSE510_L5.md
+++ b/content/CSE510/CSE510_L5.md
@@ -0,0 +1,241 @@
+# CSE510 Lecture 5
+
+## Passive Reinforcement Learning
+
+New twist: don't know T or R
+
+- i.e. we don't know which states are good or what the actions do
+- Must actually try actions and states out to learn
+
+### Passive learning and active learning
+
+Passive Learning
+
+- The agent has a fixed policy and tries to learn the utilities of states by observing the world go by
+- Analogous to policy evaluation
+- Often serves as a component of active learning algorithms
+- Often inspires active learning algorithms
+
+Active Learning
+
+- The agent attempts to find an optimal (or at least good) policy by acting in the world
+- Analogous to solving the underlying MDP, but without first being given the MDP model
+
+### Model-based vs. Model-free RL
+
+Model-based RL
+
+- Learn the MDP model, or an approximate of it
+- Use it for policy evaluation or to find the optimal policy
+
+Example as human: learn to navigate by exploring the environment
+
+Model-free RL
+
+- Derive the optimal policy without explicitly learning the model
+- Useful when model is difficult to represent and/or learn
+
+Example as human: learn to walk, talk based on reflection. (don't need to know law of physics).
+
+### Small vs. Huge MDPs
+
+We will first cover RL methods for small MDPs
+
+- MDPs where the number of states and actions is reasonably small
+- These algorithms will inspire more advanced methods
+
+Later we will cover algorithms for huge MDPs
+
+- Function Approximation Methods
+- Policy Gradient Methods
+- Actor-Critic Methods
+
+### Problem settings
+
+Suppose given a stationary policy, want to determine how good it is.
+
+We want to estimate $V^\pi(s)$, but not given:
+
+- translation matrix
+- reward function
+
+### Monte Carlo direct estimation (model-free)
+
+Also called Direct Estimation.
+
+```python
+def monte_carlo_direct_estimation(policy, env, num_episodes):
+    V = {}
+    for episode in range(num_episodes):
+        state = env.reset()
+        while not env.done:
+            action = policy.act(state)
+            next_state, reward, done = env.step(action)
+            V[state] += reward
+            state = next_state
+    return {s: V[s] / num_episodes for s in V}
+```
+
+Estimate $V^\pi(s)$ by average total reward of episodes containing state $s$.
+
+**Reward to go** of state s is the sum of the (discounted) rewards from the state until a terminal state is reached.
+
+Drawback:
+
+- Need large number of episodes to get accurate estimate
+- Does not exploit Bellman constrains on policy values.
+
+### Adaptive Dynamic Programming (ADP) (model-based)
+
+- Follow the policy for a while
+- Estimate transition model based on obsercations
+- Learn reward function
+- Use estimated model to compute utilities of policy
+
+$$
+V^\pi(s) = R(s) + \gamma \sum_{s'\in S} T(s,a,s') V^\pi(s')
+$$
+
+```python
+def adaptive_dynamic_programming(policy, env, num_episodes, steps_per_episode):
+    V = {}
+    for episode in range(num_episodes):
+        state = env.reset()
+        while not env.done:
+            action = policy.act(state, steps_per_episode)
+            next_state, reward, done = env.step(action)
+            V[state] += reward
+            state = next_state
+    return {s: V[s] / num_episodes for s in V}
+```
+
+Drawback:
+
+- Still need full DP policy evaluation after certain steps.
+
+### Temporal difference learning (model-free)
+
+- Do local updates of utility/value function on a **per-action** basis
+- Don't try to estimate the entire transition function
+- For each transition from $s$ to $s'$, update:
+  $$
+  V^\pi(s) \gets V^\pi(s) + \alpha (R(s) + \gamma V^\pi(s') - V^\pi(s))
+  $$
+
+Here $\alpha$ is the learning rate, $\gamma$ is the discount factor.
+
+```python
+def temporal_difference_learning(policy, env, num_episodes):
+    V = {}
+    for episode in range(num_episodes):
+        state = env.reset()
+        while not env.done:
+            action = policy.act(state)
+            next_state, reward, done = env.step(action)
+            V[state] += alpha * (reward + gamma * V[next_state] - V[state])
+            state = next_state
+    return V
+```
+
+Drawback:
+
+- requires more training experience (epochs) than ADP but much less computation per epoch
+- Choice depends on relative cost of experience vs. computation
+
+#### Online Mean Estimation algorithm
+
+Suppose we want to incrementally computer the mean of a stream of numbers.
+
+$$
+(x_1, x_2, \ldots)
+$$
+
+Given a new sample $x_{n+1}$, the new mean is the old estimate (for $n$ samples) plus the weighted difference between the new sample and the old estimate.
+
+$$
+\begin{aligned}
+\hat{X}_{n+1} &= \frac{1}{n+1} \sum_{i=1}^{n+1} x_i \\
+&= \frac{1}{n}\sum_{i=1}^{n} x_i + \frac{1}{n+1} x_{n}-\frac{1}{n}\hat{X}_n\\
+&= \hat{x}_{n} +\frac{1}{n+1} \left[x_{n+1} +\sum_{i=1}^{n} x_i-\frac{n+1}{n}\sum_{i=1}^{n} x_i\right]\\
+&= \hat{x}_{n} + \frac{1}{n+1} (x_{n+1} - \hat{X}_n)
+$$
+
+### Summary of passive RL
+
+**Monte-Carlo Direct Estimation (model-free)**
+
+- Simple to implement
+- Each update is fast
+- Does not exploit Bellman constraints
+- Converges slowly
+
+**Adaptive Dynamic Programming (model-based)**
+
+- Harder to implement
+- Each update is a full policy evaluation (expensive)
+- Fully exploits Bellman constraints
+- Fast convergence (in terms of updates)
+
+**Temporal Difference Learning (model-free)**
+
+- Update speed and implementation similar to direct estimation
+- Partially exploits Bellman constraints---adjusts state to 'agree' with observed successor
+  - Not all possible successors as in ADP
+- Convergence speed in between direct estimation and ADP
+
+### Between ADP and TD
+
+- Moving TD toward ADP
+  - At each step perform TD updates based on observed transition and "imagined" transitions based on initial trajectory tests. (model-based)
+  - Imagined transition are generated using estimated model
+- The more imagined transitions used, the more like ADP
+  - Making estimate more consistent with next state distribution
+  - Converges in the limit of infinite imagined transitions to ADP
+- Trade-off computational and experience efficiency
+  - More imagined transitions require more time per step, but fewer steps of actual experience
+
+## Active Reinforcement Learning
+
+### Naive Model-Based Approach
+
+1. Act Randomly for a (long) time
+    - Or systematically explore all possible actions
+2. Learn
+    - Transition function
+    - Reward function
+3. Use value iteration, policy iteration, ...
+4. Follow resulting policy
+
+This will work if step 1 is running long enough. And there is no dead-end for policy testing.
+
+Drawback:
+
+- Long time to converge
+
+### Revision of Naive Approach
+
+1. Start with an initial (uninformed) model
+2. Solve for the optimal policy given the current model (using value or policy iteration)
+3. Execute an action suggested by the policy in the current state
+4. Update the estimated model based on the observed transition
+5. Goto 2
+
+This is just like ADP but we follow the greedy policy suggested by current value estimate.
+
+**Will this work?**
+
+No. Can get stuck in local minima.
+
+#### Exploration vs. Exploitation
+
+Two reasons to take an action in RL:
+
+- **Exploitation**: To try to get reward. We exploit our current knowledge to get a payoff.
+- **Exploration**: Get more information about the world. How do we know if there is not a pot of gold around the corner?
+
+- To explore we typically need to take actions that do not seem best according to our current model
+- Managing the trade-off between exploration and exploitation is a critical issue in RL
+- Basic intuition behind most approaches:
+  - Explore more when knowledge is weak
+  - Exploit more as we gain knowledge
+