update
This commit is contained in:
@@ -1,14 +1,47 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 17)
|
||||
|
||||
## Model-based RL
|
||||
|
||||
### Model-based RL vs. Model-free RL
|
||||
## Why Model-Based RL?
|
||||
|
||||
- Sample efficiency
|
||||
- Generalization and transferability
|
||||
- Support efficient exploration in large-scale RL problems
|
||||
- Explainability
|
||||
- Super-human performance in practice
|
||||
- Video games, Go, Algorithm discovery, etc.
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> Model is anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition $T(s'| s, a)$ and reward $R(s, a)$.
|
||||
|
||||
For ADP-based (model-based) RL
|
||||
|
||||
1. Start with initial model
|
||||
2. Solve for optimal policy given current model
|
||||
- (using value or policy iteration)
|
||||
3. Take action according to an exploration/exploitation policy
|
||||
- Explores more early on and gradually uses policy from 2
|
||||
4. Update estimated model based on observed transition
|
||||
5. Goto 2
|
||||
|
||||
### Problems in Large Scale Model-Based RL
|
||||
|
||||
- New planning methods for given a model
|
||||
- Model is large and not perfect
|
||||
- Model learning
|
||||
- Requiring generalization
|
||||
- Exploration/exploitation strategy
|
||||
- Requiring generalization and attention
|
||||
|
||||
### Large Scale Model-Based RL
|
||||
|
||||
- New optimal planning methods (Today)
|
||||
- Model is large and not perfect
|
||||
- Model learning (Next Lecture)
|
||||
- Requiring generalization
|
||||
- Exploration/exploitation strategy (Next week)
|
||||
- Requiring generalization and attention
|
||||
|
||||
## Model-based RL
|
||||
|
||||
### Deterministic Environment: Cross-Entropy Method
|
||||
|
||||
@@ -29,12 +62,14 @@ Simplest method: guess and check: "random shooting method"
|
||||
- pick $A_1, A_2, ..., A_n$ from some distribution (e.g. uniform)
|
||||
- Choose $A_i$ based on $\argmax_i J(A_i)$
|
||||
|
||||
#### Cross-Entropy Method with continuous-valued inputs
|
||||
#### Cross-Entropy Method (CEM) with continuous-valued inputs
|
||||
|
||||
1. sample $A_1, A_2, ..., A_n$ from some distribution $p(A)$
|
||||
2. evaluate $J(A_1), J(A_2), ..., J(A_n)$
|
||||
3. pick the _elites_ $A_1, A_2, ..., A_m$ with the highest $J(A_i)$, where $m<n$
|
||||
4. update the distribution $p(A)$ to be more likely to choose the elites
|
||||
Cross-entropy method with continuous-valued inputs:s
|
||||
|
||||
1. Sample $A_1, A_2, ..., A_n$ from some distribution $p(A)$
|
||||
2. Evaluate $J(A_1), J(A_2), ..., J(A_n)$
|
||||
3. Pick the _elites_ $A_1, A_2, ..., A_m$ with the highest $J(A_i)$, where $m<n$
|
||||
4. Update the distribution $p(A)$ to be more likely to choose the elites
|
||||
|
||||
Pros:
|
||||
|
||||
@@ -68,15 +103,70 @@ Use model as simulator to evaluate actions.
|
||||
|
||||
Tree policy:
|
||||
|
||||
Decision policy:
|
||||
- Select/create leaf node
|
||||
- Selection and Expansion
|
||||
- Bandit problem!
|
||||
|
||||
Default policy/rollout policy
|
||||
|
||||
- Play the game till end
|
||||
- Simulation
|
||||
|
||||
Decision policy
|
||||
|
||||
- Selecting the final action
|
||||
|
||||
#### Upper Confidence Bound on Trees (UCT)
|
||||
|
||||
Selecting Child Node - Multi-Arm Bandit Problem
|
||||
|
||||
UCB1 applied for each child selection
|
||||
|
||||
$$
|
||||
UCT=\overline{X_j}+2C_p\sqrt{\frac{2\ln n_j}{n_j}}
|
||||
$$
|
||||
|
||||
- where $\overline{X_j}$ is the mean reward of selecting this position
|
||||
- $[0,1]$
|
||||
- $n$ is the number of times current(parent) node has been visited
|
||||
- $n_j$ is the number of times child node $j$ has been visited
|
||||
- Guaranteed we explore each child node at least once
|
||||
- $C_p$ is some constant $>0$
|
||||
|
||||
Each child has non-zero probability of being selected
|
||||
|
||||
We can adjust $C_p$ to change exploration vs. exploitation trade-off
|
||||
|
||||
#### Decision Policy: Final Action Selection
|
||||
|
||||
Selecting the best child
|
||||
|
||||
- Max (highest weight)
|
||||
- Robust (most visits)
|
||||
- Max-Robust (max of the two)
|
||||
|
||||
#### Upper Confidence Bound on Trees (UCT)
|
||||
#### Advantages and disadvantages of MCTS
|
||||
|
||||
Advantages:
|
||||
|
||||
- Proved MCTS converges to minimax solution
|
||||
- Domain-independent
|
||||
- Anytime algorithm
|
||||
- Achieving better with a large branch factor
|
||||
|
||||
Disadvantages:
|
||||
|
||||
- Basic version converges very slowly
|
||||
- Leading to small-probability failures
|
||||
|
||||
### Example usage of MCTS
|
||||
|
||||
AlphaGo vs Lee Sedol, Game 4
|
||||
|
||||
- White 78 (Lee): unexpected move (even other professional players didn't see coming) - needle in the haystack
|
||||
- AlphaGo failed to explore this in MCTS
|
||||
|
||||
Imitation learning from MCTS:
|
||||
|
||||
#### Continuous Case: Trajectory Optimization
|
||||
|
||||
|
||||
65
content/CSE510/CSE510_L18.md
Normal file
65
content/CSE510/CSE510_L18.md
Normal file
@@ -0,0 +1,65 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 18)
|
||||
|
||||
## Model-based RL framework
|
||||
|
||||
Model Learning with High-Dimensional Observations
|
||||
|
||||
- Learning model in a latent space with observation reconstruction
|
||||
- Learning model in a latent space without observation reconstruction
|
||||
- Learning model in the observation space (i.e., videos)
|
||||
|
||||
### Naive approach:
|
||||
|
||||
If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last week. (or $p(s_{t+1}| s_t, a_t)$ in the stochastic case)
|
||||
|
||||
So we can learn $f(s_t,a_t)$ from data, and _then_ plan through it.
|
||||
|
||||
Model-based reinforcement learning version **0.5**:
|
||||
|
||||
1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
|
||||
2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
|
||||
3. Plan through $f(s_t,a_t)$ to choose action $a_t$
|
||||
|
||||
Sometime, it does work!
|
||||
|
||||
- Essentially how system identification works in classical robotics
|
||||
- Some care should be taken to design a good base policy
|
||||
- Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters
|
||||
|
||||
However, Distribution mismatch problem becomes worse as we use more
|
||||
expressive model classes.
|
||||
|
||||
Version 0.5: collect random samples, train dynamics, plan
|
||||
|
||||
- Pro: simple, no iterative procedure
|
||||
- Con: distribution mismatch problem
|
||||
|
||||
Version 1.0: iteratively collect data, replan, collect data
|
||||
|
||||
- Pro: simple, solves distribution mismatch
|
||||
- Con: open loop plan might perform poorly, esp. in stochastic domains
|
||||
|
||||
Version 1.5: iteratively collect data using MPC (replan at each step)
|
||||
|
||||
- Pro: robust to small model errors
|
||||
- Con: computationally expensive, but have a planning algorithm available
|
||||
|
||||
Version 2.0: backpropagate directly into policy
|
||||
|
||||
- Pro: computationally cheap at runtime
|
||||
- Con: can be numerically unstable, especially in stochastic domains
|
||||
- Solution: model-free RL + model-based RL
|
||||
|
||||
Final version:
|
||||
|
||||
1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
|
||||
2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
|
||||
3. Backpropagate through $f(s_t,a_t)$ into the policy to optimized $\pi_\theta(s_t,a_t)$
|
||||
4. Run the policy $\pi_\theta(s_t,a_t)$ to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
|
||||
5. Goto 2
|
||||
|
||||
## Model Learning with High-Dimensional Observations
|
||||
|
||||
- Learning model in a latent space with observation reconstruction
|
||||
- Learning model in a latent space without observation reconstruction
|
||||
|
||||
@@ -20,4 +20,5 @@ export default {
|
||||
CSE510_L15: "CSE510 Deep Reinforcement Learning (Lecture 15)",
|
||||
CSE510_L16: "CSE510 Deep Reinforcement Learning (Lecture 16)",
|
||||
CSE510_L17: "CSE510 Deep Reinforcement Learning (Lecture 17)",
|
||||
CSE510_L18: "CSE510 Deep Reinforcement Learning (Lecture 18)",
|
||||
}
|
||||
Reference in New Issue
Block a user