177 lines
4.4 KiB
Markdown
177 lines
4.4 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 17)
|
|
|
|
## Why Model-Based RL?
|
|
|
|
- Sample efficiency
|
|
- Generalization and transferability
|
|
- Support efficient exploration in large-scale RL problems
|
|
- Explainability
|
|
- Super-human performance in practice
|
|
- Video games, Go, Algorithm discovery, etc.
|
|
|
|
> [!NOTE]
|
|
>
|
|
> Model is anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition $T(s'| s, a)$ and reward $R(s, a)$.
|
|
|
|
For ADP-based (model-based) RL
|
|
|
|
1. Start with initial model
|
|
2. Solve for optimal policy given current model
|
|
- (using value or policy iteration)
|
|
3. Take action according to an exploration/exploitation policy
|
|
- Explores more early on and gradually uses policy from 2
|
|
4. Update estimated model based on observed transition
|
|
5. Goto 2
|
|
|
|
### Problems in Large Scale Model-Based RL
|
|
|
|
- New planning methods for given a model
|
|
- Model is large and not perfect
|
|
- Model learning
|
|
- Requiring generalization
|
|
- Exploration/exploitation strategy
|
|
- Requiring generalization and attention
|
|
|
|
### Large Scale Model-Based RL
|
|
|
|
- New optimal planning methods (Today)
|
|
- Model is large and not perfect
|
|
- Model learning (Next Lecture)
|
|
- Requiring generalization
|
|
- Exploration/exploitation strategy (Next week)
|
|
- Requiring generalization and attention
|
|
|
|
## Model-based RL
|
|
|
|
### Deterministic Environment: Cross-Entropy Method
|
|
|
|
#### Stochastic Optimization
|
|
|
|
abstract away optimal control/planning:
|
|
|
|
$$
|
|
a_1,\ldots, a_T =\argmax_{a_1,\ldots, a_T} J(a_1,\ldots, a_T)
|
|
$$
|
|
|
|
$$
|
|
A=\argmax_{A} J(A)
|
|
$$
|
|
|
|
Simplest method: guess and check: "random shooting method"
|
|
|
|
- pick $A_1, A_2, ..., A_n$ from some distribution (e.g. uniform)
|
|
- Choose $A_i$ based on $\argmax_i J(A_i)$
|
|
|
|
#### Cross-Entropy Method (CEM) with continuous-valued inputs
|
|
|
|
Cross-entropy method with continuous-valued inputs:s
|
|
|
|
1. Sample $A_1, A_2, ..., A_n$ from some distribution $p(A)$
|
|
2. Evaluate $J(A_1), J(A_2), ..., J(A_n)$
|
|
3. Pick the _elites_ $A_1, A_2, ..., A_m$ with the highest $J(A_i)$, where $m<n$
|
|
4. Update the distribution $p(A)$ to be more likely to choose the elites
|
|
|
|
Pros:
|
|
|
|
- Very fast to run if parallelized
|
|
- Extremely simple to implement
|
|
|
|
Cons:
|
|
|
|
- Very harsh dimensionality limit
|
|
- Only open-loop planning
|
|
- Suboptimal in stochastic environments
|
|
|
|
### Discrete Case: Monte Carlo Tree Search (MCTS)
|
|
|
|
Discrete planning as a search problem
|
|
|
|
Close-loop planning:
|
|
|
|
- At each state, iteratively build a search tree to evaluate actions, select the best-first action, and the move the next state.
|
|
|
|
Use model as simulator to evaluate actions.
|
|
|
|
#### MCTS Algorithm Overview
|
|
|
|
1. Selection: Select the best-first action from the search tree
|
|
2. Expansion: Add a new node to the search tree
|
|
3. Simulation: Simulate the next state from the selected action
|
|
4. Backpropagation: Update the values of the nodes in the search tree
|
|
|
|
#### Policies in MCTS
|
|
|
|
Tree policy:
|
|
|
|
- Select/create leaf node
|
|
- Selection and Expansion
|
|
- Bandit problem!
|
|
|
|
Default policy/rollout policy
|
|
|
|
- Play the game till end
|
|
- Simulation
|
|
|
|
Decision policy
|
|
|
|
- Selecting the final action
|
|
|
|
#### Upper Confidence Bound on Trees (UCT)
|
|
|
|
Selecting Child Node - Multi-Arm Bandit Problem
|
|
|
|
UCB1 applied for each child selection
|
|
|
|
$$
|
|
UCT=\overline{X_j}+2C_p\sqrt{\frac{2\ln n_j}{n_j}}
|
|
$$
|
|
|
|
- where $\overline{X_j}$ is the mean reward of selecting this position
|
|
- $[0,1]$
|
|
- $n$ is the number of times current(parent) node has been visited
|
|
- $n_j$ is the number of times child node $j$ has been visited
|
|
- Guaranteed we explore each child node at least once
|
|
- $C_p$ is some constant $>0$
|
|
|
|
Each child has non-zero probability of being selected
|
|
|
|
We can adjust $C_p$ to change exploration vs. exploitation trade-off
|
|
|
|
#### Decision Policy: Final Action Selection
|
|
|
|
Selecting the best child
|
|
|
|
- Max (highest weight)
|
|
- Robust (most visits)
|
|
- Max-Robust (max of the two)
|
|
|
|
#### Advantages and disadvantages of MCTS
|
|
|
|
Advantages:
|
|
|
|
- Proved MCTS converges to minimax solution
|
|
- Domain-independent
|
|
- Anytime algorithm
|
|
- Achieving better with a large branch factor
|
|
|
|
Disadvantages:
|
|
|
|
- Basic version converges very slowly
|
|
- Leading to small-probability failures
|
|
|
|
### Example usage of MCTS
|
|
|
|
AlphaGo vs Lee Sedol, Game 4
|
|
|
|
- White 78 (Lee): unexpected move (even other professional players didn't see coming) - needle in the haystack
|
|
- AlphaGo failed to explore this in MCTS
|
|
|
|
Imitation learning from MCTS:
|
|
|
|
#### Continuous Case: Trajectory Optimization
|
|
|
|
#### Linear Quadratic Regulator (LQR)
|
|
|
|
#### Non-linear iterative LQR (iLQR)/ Differential Dynamic Programming (DDP)
|
|
|