update

2025-10-28 11:20:24 -05:00
parent e4490f6fa2
commit 361745d658
4 changed files with 199 additions and 10 deletions
--- a/content/CSE510/CSE510_L17.md
+++ b/content/CSE510/CSE510_L17.md
@@ -1,14 +1,47 @@
 # CSE510 Deep Reinforcement Learning (Lecture 17)

-## Model-based RL
-
-### Model-based RL vs. Model-free RL
+## Why Model-Based RL?

 - Sample efficiency
 - Generalization and transferability
 - Support efficient exploration in large-scale RL problems
 - Explainability
 - Super-human performance in practice
+  - Video games, Go, Algorithm discovery, etc.
+
+> [!NOTE]
+>
+> Model is anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition $T(s'| s, a)$ and reward $R(s, a)$.
+
+For ADP-based (model-based) RL
+
+1. Start with initial model
+2. Solve for optimal policy given current model 
+   - (using value or policy iteration)
+3. Take action according to an exploration/exploitation policy
+   - Explores more early on and gradually uses policy from 2
+4. Update estimated model based on observed transition
+5. Goto 2
+
+### Problems in Large Scale Model-Based RL
+
+- New planning methods for given a model
+  - Model is large and not perfect
+- Model learning
+  - Requiring generalization
+- Exploration/exploitation strategy
+  - Requiring generalization and attention
+
+### Large Scale Model-Based RL
+
+- New optimal planning methods (Today)
+  - Model is large and not perfect
+- Model learning (Next Lecture)
+  - Requiring generalization
+- Exploration/exploitation strategy (Next week)
+  - Requiring generalization and attention
+
+## Model-based RL

 ### Deterministic Environment: Cross-Entropy Method

@@ -29,12 +62,14 @@ Simplest method: guess and check: "random shooting method"
 - pick $A_1, A_2, ..., A_n$ from some distribution (e.g. uniform)
 - Choose $A_i$ based on $\argmax_i J(A_i)$

-#### Cross-Entropy Method with continuous-valued inputs
+#### Cross-Entropy Method (CEM) with continuous-valued inputs

-1. sample $A_1, A_2, ..., A_n$ from some distribution $p(A)$
-2. evaluate $J(A_1), J(A_2), ..., J(A_n)$
-3. pick the _elites_ $A_1, A_2, ..., A_m$ with the highest $J(A_i)$, where $m<n$
-4. update the distribution $p(A)$ to be more likely to choose the elites
+Cross-entropy method with continuous-valued inputs:s
+
+1. Sample $A_1, A_2, ..., A_n$ from some distribution $p(A)$
+2. Evaluate $J(A_1), J(A_2), ..., J(A_n)$
+3. Pick the _elites_ $A_1, A_2, ..., A_m$ with the highest $J(A_i)$, where $m<n$
+4. Update the distribution $p(A)$ to be more likely to choose the elites

 Pros:

@@ -68,15 +103,70 @@ Use model as simulator to evaluate actions.

 Tree policy:

-Decision policy:
+- Select/create leaf node
+- Selection and Expansion
+- Bandit problem!
+
+Default policy/rollout policy
+
+- Play the game till end
+- Simulation
+
+Decision policy
+
+- Selecting the final action
+
+#### Upper Confidence Bound on Trees (UCT)
+
+Selecting Child Node - Multi-Arm Bandit Problem
+
+UCB1 applied for each child selection
+
+$$
+UCT=\overline{X_j}+2C_p\sqrt{\frac{2\ln n_j}{n_j}}
+$$
+
+- where $\overline{X_j}$ is the mean reward of selecting this position
+  - $[0,1]$
+- $n$ is the number of times current(parent) node has been visited
+- $n_j$ is the number of times child node $j$ has been visited
+  - Guaranteed we explore each child node at least once
+- $C_p$ is some constant $>0$
+
+Each child has non-zero probability of being selected
+
+We can adjust $C_p$ to change exploration vs. exploitation trade-off
+
+#### Decision Policy: Final Action Selection
+
+Selecting the best child

 - Max (highest weight)
 - Robust (most visits)
 - Max-Robust (max of the two)

-#### Upper Confidence Bound on Trees (UCT)
+#### Advantages and disadvantages of MCTS

+Advantages:

+- Proved MCTS converges to minimax solution
+  - Domain-independent
+  - Anytime algorithm
+  - Achieving better with a large branch factor
+
+Disadvantages:
+
+- Basic version converges very slowly
+- Leading to small-probability failures
+
+### Example usage of MCTS
+
+AlphaGo vs Lee Sedol, Game 4
+
+- White 78 (Lee): unexpected move (even other professional players didn't see coming) - needle in the haystack
+- AlphaGo failed to explore this in MCTS
+
+Imitation learning from MCTS:

 #### Continuous Case: Trajectory Optimization

--- a/content/CSE510/CSE510_L18.md
+++ b/content/CSE510/CSE510_L18.md
@@ -0,0 +1,65 @@
+# CSE510 Deep Reinforcement Learning (Lecture 18)
+
+## Model-based RL framework
+
+Model Learning with High-Dimensional Observations
+
+- Learning model in a latent space with observation reconstruction
+- Learning model in a latent space without observation reconstruction
+- Learning model in the observation space (i.e., videos)
+
+### Naive approach:
+
+If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last week. (or $p(s_{t+1}| s_t, a_t)$ in the stochastic case)
+
+So we can learn $f(s_t,a_t)$ from data, and _then_ plan through it.
+
+Model-based reinforcement learning version **0.5**:
+
+1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
+2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
+3. Plan through $f(s_t,a_t)$ to choose action $a_t$
+
+Sometime, it does work!
+
+- Essentially how system identification works in classical robotics
+- Some care should be taken to design a good base policy
+- Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters
+
+However, Distribution mismatch problem becomes worse as we use more
+expressive model classes.
+
+Version 0.5: collect random samples, train dynamics, plan
+
+- Pro: simple, no iterative procedure
+- Con: distribution mismatch problem
+
+Version 1.0: iteratively collect data, replan, collect data
+
+- Pro: simple, solves distribution mismatch
+- Con: open loop plan might perform poorly, esp. in stochastic domains
+
+Version 1.5: iteratively collect data using MPC (replan at each step)
+
+- Pro: robust to small model errors
+- Con: computationally expensive, but have a planning algorithm available
+
+Version 2.0: backpropagate directly into policy
+
+- Pro: computationally cheap at runtime
+- Con: can be numerically unstable, especially in stochastic domains
+- Solution: model-free RL + model-based RL
+
+Final version:
+
+1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
+2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
+3. Backpropagate through $f(s_t,a_t)$ into the policy to optimized $\pi_\theta(s_t,a_t)$
+4. Run the policy $\pi_\theta(s_t,a_t)$ to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
+5. Goto 2
+
+## Model Learning with High-Dimensional Observations
+
+- Learning model in a latent space with observation reconstruction
+- Learning model in a latent space without observation reconstruction
+
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -20,4 +20,5 @@ export default {
    CSE510_L15: "CSE510 Deep Reinforcement Learning (Lecture 15)",
    CSE510_L16: "CSE510 Deep Reinforcement Learning (Lecture 16)",
    CSE510_L17: "CSE510 Deep Reinforcement Learning (Lecture 17)",
+    CSE510_L18: "CSE510 Deep Reinforcement Learning (Lecture 18)",
 }