4.4 KiB
CSE510 Deep Reinforcement Learning (Lecture 17)
Why Model-Based RL?
- Sample efficiency
- Generalization and transferability
- Support efficient exploration in large-scale RL problems
- Explainability
- Super-human performance in practice
- Video games, Go, Algorithm discovery, etc.
Note
Model is anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition
T(s'| s, a)and rewardR(s, a).
For ADP-based (model-based) RL
- Start with initial model
- Solve for optimal policy given current model
- (using value or policy iteration)
- Take action according to an exploration/exploitation policy
- Explores more early on and gradually uses policy from 2
- Update estimated model based on observed transition
- Goto 2
Problems in Large Scale Model-Based RL
- New planning methods for given a model
- Model is large and not perfect
- Model learning
- Requiring generalization
- Exploration/exploitation strategy
- Requiring generalization and attention
Large Scale Model-Based RL
- New optimal planning methods (Today)
- Model is large and not perfect
- Model learning (Next Lecture)
- Requiring generalization
- Exploration/exploitation strategy (Next week)
- Requiring generalization and attention
Model-based RL
Deterministic Environment: Cross-Entropy Method
Stochastic Optimization
abstract away optimal control/planning:
a_1,\ldots, a_T =\argmax_{a_1,\ldots, a_T} J(a_1,\ldots, a_T)
A=\argmax_{A} J(A)
Simplest method: guess and check: "random shooting method"
- pick
A_1, A_2, ..., A_nfrom some distribution (e.g. uniform) - Choose
A_ibased on\argmax_i J(A_i)
Cross-Entropy Method (CEM) with continuous-valued inputs
Cross-entropy method with continuous-valued inputs:s
- Sample
A_1, A_2, ..., A_nfrom some distributionp(A) - Evaluate
J(A_1), J(A_2), ..., J(A_n) - Pick the elites
A_1, A_2, ..., A_mwith the highestJ(A_i), wherem<n - Update the distribution
p(A)to be more likely to choose the elites
Pros:
- Very fast to run if parallelized
- Extremely simple to implement
Cons:
- Very harsh dimensionality limit
- Only open-loop planning
- Suboptimal in stochastic environments
Discrete Case: Monte Carlo Tree Search (MCTS)
Discrete planning as a search problem
Close-loop planning:
- At each state, iteratively build a search tree to evaluate actions, select the best-first action, and the move the next state.
Use model as simulator to evaluate actions.
MCTS Algorithm Overview
- Selection: Select the best-first action from the search tree
- Expansion: Add a new node to the search tree
- Simulation: Simulate the next state from the selected action
- Backpropagation: Update the values of the nodes in the search tree
Policies in MCTS
Tree policy:
- Select/create leaf node
- Selection and Expansion
- Bandit problem!
Default policy/rollout policy
- Play the game till end
- Simulation
Decision policy
- Selecting the final action
Upper Confidence Bound on Trees (UCT)
Selecting Child Node - Multi-Arm Bandit Problem
UCB1 applied for each child selection
UCT=\overline{X_j}+2C_p\sqrt{\frac{2\ln n_j}{n_j}}
- where
\overline{X_j}is the mean reward of selecting this position[0,1]
nis the number of times current(parent) node has been visitedn_jis the number of times child nodejhas been visited- Guaranteed we explore each child node at least once
C_pis some constant>0
Each child has non-zero probability of being selected
We can adjust C_p to change exploration vs. exploitation trade-off
Decision Policy: Final Action Selection
Selecting the best child
- Max (highest weight)
- Robust (most visits)
- Max-Robust (max of the two)
Advantages and disadvantages of MCTS
Advantages:
- Proved MCTS converges to minimax solution
- Domain-independent
- Anytime algorithm
- Achieving better with a large branch factor
Disadvantages:
- Basic version converges very slowly
- Leading to small-probability failures
Example usage of MCTS
AlphaGo vs Lee Sedol, Game 4
- White 78 (Lee): unexpected move (even other professional players didn't see coming) - needle in the haystack
- AlphaGo failed to explore this in MCTS
Imitation learning from MCTS: