# CSE510 Deep Reinforcement Learning (Lecture 17) ## Why Model-Based RL? - Sample efficiency - Generalization and transferability - Support efficient exploration in large-scale RL problems - Explainability - Super-human performance in practice - Video games, Go, Algorithm discovery, etc. > [!NOTE] > > Model is anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition $T(s'| s, a)$ and reward $R(s, a)$. For ADP-based (model-based) RL 1. Start with initial model 2. Solve for optimal policy given current model - (using value or policy iteration) 3. Take action according to an exploration/exploitation policy - Explores more early on and gradually uses policy from 2 4. Update estimated model based on observed transition 5. Goto 2 ### Problems in Large Scale Model-Based RL - New planning methods for given a model - Model is large and not perfect - Model learning - Requiring generalization - Exploration/exploitation strategy - Requiring generalization and attention ### Large Scale Model-Based RL - New optimal planning methods (Today) - Model is large and not perfect - Model learning (Next Lecture) - Requiring generalization - Exploration/exploitation strategy (Next week) - Requiring generalization and attention ## Model-based RL ### Deterministic Environment: Cross-Entropy Method #### Stochastic Optimization abstract away optimal control/planning: $$ a_1,\ldots, a_T =\argmax_{a_1,\ldots, a_T} J(a_1,\ldots, a_T) $$ $$ A=\argmax_{A} J(A) $$ Simplest method: guess and check: "random shooting method" - pick $A_1, A_2, ..., A_n$ from some distribution (e.g. uniform) - Choose $A_i$ based on $\argmax_i J(A_i)$ #### Cross-Entropy Method (CEM) with continuous-valued inputs Cross-entropy method with continuous-valued inputs:s 1. Sample $A_1, A_2, ..., A_n$ from some distribution $p(A)$ 2. Evaluate $J(A_1), J(A_2), ..., J(A_n)$ 3. Pick the _elites_ $A_1, A_2, ..., A_m$ with the highest $J(A_i)$, where $m0$ Each child has non-zero probability of being selected We can adjust $C_p$ to change exploration vs. exploitation trade-off #### Decision Policy: Final Action Selection Selecting the best child - Max (highest weight) - Robust (most visits) - Max-Robust (max of the two) #### Advantages and disadvantages of MCTS Advantages: - Proved MCTS converges to minimax solution - Domain-independent - Anytime algorithm - Achieving better with a large branch factor Disadvantages: - Basic version converges very slowly - Leading to small-probability failures ### Example usage of MCTS AlphaGo vs Lee Sedol, Game 4 - White 78 (Lee): unexpected move (even other professional players didn't see coming) - needle in the haystack - AlphaGo failed to explore this in MCTS Imitation learning from MCTS: #### Continuous Case: Trajectory Optimization #### Linear Quadratic Regulator (LQR) #### Non-linear iterative LQR (iLQR)/ Differential Dynamic Programming (DDP)