updates

2025-09-11 12:47:03 -05:00
parent 0200ab7eed
commit 533ea36b37
5 changed files with 538 additions and 0 deletions
--- a/content/CSE510/CSE510_L6.md
+++ b/content/CSE510/CSE510_L6.md
@@ -0,0 +1,247 @@
+# CSE510 Lecture 6
+
+## Active reinforcement learning
+
+### Exploration vs. Exploitation
+
+- **Exploitation**: To try to get reward. We exploit our current knowledge to get a payoff.
+- **Exploration**: Get more information about the world. How do we know if there is not a pot of gold around the corner?
+
+- To explore we typically need to take actions that do not seem best according to our current model
+- Managing the trade-off between exploration and exploitation is a critical issue in RL
+- Basic intuition behind most approaches
+  - Explore more when knowledge is weak
+  - Exploit more as we gain knowledge
+
+### ADP-based RL
+
+Model based
+
+1. Start with an initial (uninformed) model
+2. solve for optimal policy given the current model (using value or policy iteration)
+3. Take action according to an **exploration/exploitation** policy
+4. Update estimated model based on observed transition
+5. Goto 2
+
+#### Exploration/Exploitation policy
+
+**Greedy action** is the action maximizing estimated $Q$ value
+
+$$
+Q(s,a) = R(s) + \gamma \max_{s'\in S} P(s,a,s')V(s')
+$$
+
+- where $V$ is current optimal value function estimate (based on current model), and $R, T$ are current estimates of model
+- $Q(s,a)$ is the expected value of taking action $a$ in state $s$ and then getting estimated value $V(s')$ for the next state $s'$
+
+Want an exploration policy that is **greedy in the limit of infinite exploration** (GLIE)
+
+- Try each action in each state and unbounded number of times
+- Guarantees convergence
+
+**GLIE**: Greedy in the limit of infinite exploration
+
+#### Greedy Policy 1
+
+On time step $t$ select random action with probability $p(t)$ and greedy action with probability $1-p(t)$
+
+$p(t) = \frac{1}{t}$ will lead to convergence, but is slow.
+
+> [!TIP]
+>
+> In practice, it's common to simply set $p(t) = \epsilon$ for all $t$.
+
+#### Greedy Policy 2
+
+Boltzmann exploration
+
+Selection action with probability,
+
+$$
+Pr(a\mid s)=\frac{\exp(Q(s,a)/T)}{\sum_{a'\in A}\exp(Q(s,a')/T)}
+$$
+
+$T$ is the temperature. Large $T$ means that each action has about the same probability. Small $T$ leads to more greedy behavior.
+
+Typically start with large $T$ and decrease with time.
+
+<details>
+<summary>Example: impact of temperature</summary>
+
+Suppose we have two actions and that $Q(s,a_1) = 1$ and $Q(s,a_2) = 0$.
+
+When $T=10$, we have
+
+$$
+Pr(a_1\mid s)=\frac{\exp(1/10)}{\exp(1/10)+\exp(2/10)}=0.48
+$$
+
+$$
+Pr(a_2\mid s)=\frac{\exp(2/10)}{\exp(1/10)+\exp(2/10)}=0.52
+$$
+
+When $T=1$, we have
+
+$$
+Pr(a_1\mid s)=\frac{\exp(1/1)}{\exp(1/1)+\exp(2/1)}=0.27
+$$
+
+$$
+Pr(a_2\mid s)=\frac{\exp(2/1)}{\exp(1/1)+\exp(2/1)}=0.73
+$$
+
+When $T=0.1$, we have
+
+$$
+Pr(a_1\mid s)=\frac{\exp(1/0.1)}{\exp(1/0.1)+\exp(2/0.1)}=0.02
+$$
+
+$$
+Pr(a_2\mid s)=\frac{\exp(2/0.1)}{\exp(1/0.1)+\exp(2/0.1)}=0.98
+$$
+
+</details>
+
+### (Alternative Model-Based RL) Optimistic Exploration: Rmax [Brafman & Tennenholtz, 2002]
+
+1. Start with an **optimistic model**
+   - (assign largest possible reward to "unexplored states") 
+   - (actions from "unexplored states" only self transition)
+2. Solve for optimal policy in optimistic model (standard VI)
+3. Take greedy action according to the computed policy
+4. Update optimistic estimated model
+   - (if a state becomes "known" then use its true statistics)
+5. Goto 2
+
+Agent always acts greedily according to a model that assumes
+all "unexplored" states are maximally rewarding
+
+#### Implementation for optimistic model
+
+- Keep track of number of times a state-action pair is tried
+- If $N(s, a) < N_e$ then $T(s,a,s)=1$ and $R(s) = Rmax$ in optimistic model,
+- Otherwise, $T(s,a,s’)$ and $R(s)$ are based on estimates obtained from the $N_e$ experiences (the estimate of true model)
+- $N_e$ can be determined by using Chernoff Bound
+- An optimal policy for this optimistic model will try to reach unexplored states (those with unexplored actions) since it can stay at those states and accumulate maximum reward
+- Never explicitly explores. Is always greedy, but with respect to an optimistic outlook.
+
+```pseudocode
+Algorithm: (for Infinite horizon RL problems)
+Initialize $\hat{p}, \hat{r}$, and $N(s,a)$ For $t = 1, 2, ...$
+1. Build an optimistic reward model $(Q(s,a))_{s,a}$ from $\hat{p}, \hat{r}$, and $N(s,a)$
+2. Select action $a(t)$ maximizing $Q(s(t),a)$ over $A_{s(t)}$
+3. Observe the transition to $s(t+1)$ and collect reward $r(s(t),a(t))$ according to $\hat{p}$
+4. Update $\hat{p}, \hat{r}$, and $N(s,a)$
+```
+
+#### Efficiency of Rmax
+
+If the model is very completely learned (i.e. $N(s, a) = N_e$ for all $s, a$), then Rmax will be near optimal.
+
+Results how that this will happen "quickly" in terms of number of steps.
+
+General proof strategy: **PAC Guarantee (Roughly speaking):** There is a value $N_e$, such that with high probability the Rmax algorithm will select at most a polynomial number of actions with value less than $\epsilon$ of optimal.
+
+RL can be solved in poly-time in number of actions, number of states, and discount factor.
+
+### TD-based Active RL
+
+1. Start with initial value function
+2. Take action from an **exploration/exploitation** policygiving new state $s'$ (should converge to optimal policy)
+3. **Update** estimated model (To compute the exploration/exploitation policy.)
+4. Perform TD update
+   $$
+   V(s) \gets V(s) + \alpha (R(s) + \gamma V(s') - V(s))
+   $$
+   $V(s)$ is new estimate of optimal value function at state $s$.
+5. Goto 2
+
+Given the usual assumptions about learning rate and GLIE, TD will converge to an optimal value function!
+
+- Exploration/Exploitation policy requires computing $argmax Q(s, a)$ for the exploitation part of the policy
+  - Computing $argmax Q(s, a)$ requires $T$ in addition to $V$
+- Thus TD-learning must still maintain an estimated model for action selection
+- It is computationally more efficient at each step compared to Rmax (i.e., optimistic exploration)
+  - TD-update vs. Value Iteration
+  - But model requires much more memory than value function
+- Can we get a model-fee variant? 
+
+### Q-learning
+
+Instead of learning the optimal value function $V$, directly learn the optimal $Q$ function.
+
+Recall $Q(s, a)$ is the expected value of taking action $a$ in state $s$ and then
+following the optimal policy thereafter
+
+Given the $Q$ function we can act optimally by selecting action greedily according to $Q(s, a)$ without a model
+
+The optimal $Q$-function satisfies $V(s) = \max_{a'\in A} Q(s, a')$ which gives:
+
+$$
+\begin{aligned}
+Q(s,a) &= R(s) + \gamma \sum_{s'\in S} T(s,a,s') V(s')\\
+&= R(s) + \gamma \sum_{s'\in S} T(s,a,s') \max_{a'\in A} Q(s',a')\\
+\end{aligned}
+$$
+
+How can we learn the $Q$-function directly?
+
+#### Q-learning implementation
+
+Model-free reinforcement learning
+
+1. Start with initial Q-values (e.g. all zeros)
+2. Take action from an **exploration/exploitation** policy giving new state $s'$ (should converge to optimal policy)
+3. Perform TD update
+   $$
+   Q(s,a) \gets Q(s,a) + \alpha (R(s) + \gamma \max_{a'\in A} Q(s',a') - Q(s,a))
+   $$
+   $Q(s,a)$ is current estimate of optimal Q-value for state $s$ and action $a$.
+4. Goto 2
+
+- Does not require model since we learn the Q-value function directly
+- Use explicit $|S|\times |A|$ table to store Q-values
+- Off-policy learning: the update does not depend on the actual next action
+- The exploration/exploitation policy directly uses $Q$-values
+
+#### Convergence of Q-learning
+
+Q-learning converges to the optimal Q-value in the limit with probability 1 if:
+
+- Every state-action pair is visited infinitely often
+- Learning rate decays just so: $\sum_{t=1}^{\infty} \alpha(t) = \infty$ and $\sum_{t=1}^{\infty} \alpha(t)^2 < \infty$
+
+#### Speedup for Goal-Based Problems
+
+- **Goal-Based Problem**: receive big reward in goal state and then transition to terminal state
+- Initializing $Q(s, a)$ for all $s \in S$ and $a \in A$ to zeros and then observing the following sequence of (state, reward, action) triples
+  - $(s0, 0, a0) (s1, 0, a1) (s2, 10, a2) (terminal,0)$
+- The sequence of Q-value updates would result in: $Q(s0, a0) = 0$, $Q(s1, a1) =0$, $Q(s2, a2)=10$
+- So nothing was learned at $s0$ and $s1$
+  - Next time this trajectory is observed we will get non-zero for $Q(s1, a1)$ but still $Q(s0, a0)=0$
+
+
+From the example we see that it can take many learning trials for the final reward to "back propagate" to early state-action pairs
+
+- Two approaches for addressing this problem:
+  1. Trajectory replay: store each trajectory and do several iterations of Q-updates on each one
+  2. Reverse updates: store trajectory and do Q-updates in reverse order
+- In our example (with learning rate and discount factor equal to 1 for ease of illustration) reverse updates would give
+  - $Q(s2,a2) = 10$, $Q(s1,a1) = 10$, $Q(s0,a0)=10$
+
+### Off-policy vs on-policy RL
+
+### SARSA
+
+1. Start with initial Q-values (e.g. all zeros)
+2. Take action $a_n$ on state $s_n$ from an $\epsilon$-greedy policy giving new state $s_{n+1}$
+3. Take action $a_{n+1}$ on state $s_{n+1}$ from an $\epsilon$-greedy 
+4. Perform TD update
+   $$
+   Q(s_n,a_n) \gets Q(s_n,a_n) + \alpha (R(s_n) + \gamma Q(s_{n+1},a_{n+1}) - Q(s_n,a_n))
+   $$
+5. Goto 2
+
+> [!NOTES]
+>
+> Compared with Q-learning, SARSA (on-policy) usually takes more "safer" actions.
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -6,4 +6,7 @@ export default {
    CSE510_L1: "CSE510 Deep Reinforcement Learning (Lecture 1)",
    CSE510_L2: "CSE510 Deep Reinforcement Learning (Lecture 2)",
    CSE510_L3: "CSE510 Deep Reinforcement Learning (Lecture 3)",
+    CSE510_L4: "CSE510 Deep Reinforcement Learning (Lecture 4)",
+    CSE510_L5: "CSE510 Deep Reinforcement Learning (Lecture 5)",
+    CSE510_L6: "CSE510 Deep Reinforcement Learning (Lecture 6)",
 }