# CSE510 Deep Reinforcement Learning (Lecture 17)

## Why Model-Based RL?

- Sample efficiency
- Generalization and transferability
- Support efficient exploration in large-scale RL problems
- Explainability
- Super-human performance in practice
  - Video games, Go, Algorithm discovery, etc.

> [!NOTE]
>
> Model is anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition $T(s'| s, a)$ and reward $R(s, a)$.

For ADP-based (model-based) RL

1. Start with initial model
2. Solve for optimal policy given current model 
   - (using value or policy iteration)
3. Take action according to an exploration/exploitation policy
   - Explores more early on and gradually uses policy from 2
4. Update estimated model based on observed transition
5. Goto 2

### Problems in Large Scale Model-Based RL

- New planning methods for given a model
  - Model is large and not perfect
- Model learning
  - Requiring generalization
- Exploration/exploitation strategy
  - Requiring generalization and attention

### Large Scale Model-Based RL

- New optimal planning methods (Today)
  - Model is large and not perfect
- Model learning (Next Lecture)
  - Requiring generalization
- Exploration/exploitation strategy (Next week)
  - Requiring generalization and attention

## Model-based RL

### Deterministic Environment: Cross-Entropy Method

#### Stochastic Optimization

abstract away optimal control/planning:

$$
a_1,\ldots, a_T =\argmax_{a_1,\ldots, a_T} J(a_1,\ldots, a_T)
$$

$$
A=\argmax_{A} J(A)
$$

Simplest method: guess and check: "random shooting method"

- pick $A_1, A_2, ..., A_n$ from some distribution (e.g. uniform)
- Choose $A_i$ based on $\argmax_i J(A_i)$

#### Cross-Entropy Method (CEM) with continuous-valued inputs

Cross-entropy method with continuous-valued inputs:s

1. Sample $A_1, A_2, ..., A_n$ from some distribution $p(A)$
2. Evaluate $J(A_1), J(A_2), ..., J(A_n)$
3. Pick the _elites_ $A_1, A_2, ..., A_m$ with the highest $J(A_i)$, where $m<n$
4. Update the distribution $p(A)$ to be more likely to choose the elites

Pros:

- Very fast to run if parallelized
- Extremely simple to implement

Cons:

- Very harsh dimensionality limit
- Only open-loop planning
- Suboptimal in stochastic environments

### Discrete Case: Monte Carlo Tree Search (MCTS)

Discrete planning as a search problem

Close-loop planning:

- At each state, iteratively build a search tree to evaluate actions, select the best-first action, and the move the next state.

Use model as simulator to evaluate actions.

#### MCTS Algorithm Overview

1. Selection: Select the best-first action from the search tree
2. Expansion: Add a new node to the search tree
3. Simulation: Simulate the next state from the selected action
4. Backpropagation: Update the values of the nodes in the search tree

#### Policies in MCTS

Tree policy:

- Select/create leaf node
- Selection and Expansion
- Bandit problem!

Default policy/rollout policy

- Play the game till end
- Simulation

Decision policy

- Selecting the final action

#### Upper Confidence Bound on Trees (UCT)

Selecting Child Node - Multi-Arm Bandit Problem

UCB1 applied for each child selection

$$
UCT=\overline{X_j}+2C_p\sqrt{\frac{2\ln n_j}{n_j}}
$$

- where $\overline{X_j}$ is the mean reward of selecting this position
  - $[0,1]$
- $n$ is the number of times current(parent) node has been visited
- $n_j$ is the number of times child node $j$ has been visited
  - Guaranteed we explore each child node at least once
- $C_p$ is some constant $>0$

Each child has non-zero probability of being selected

We can adjust $C_p$ to change exploration vs. exploitation trade-off

#### Decision Policy: Final Action Selection

Selecting the best child

- Max (highest weight)
- Robust (most visits)
- Max-Robust (max of the two)

#### Advantages and disadvantages of MCTS

Advantages:

- Proved MCTS converges to minimax solution
  - Domain-independent
  - Anytime algorithm
  - Achieving better with a large branch factor

Disadvantages:

- Basic version converges very slowly
- Leading to small-probability failures

### Example usage of MCTS

AlphaGo vs Lee Sedol, Game 4

- White 78 (Lee): unexpected move (even other professional players didn't see coming) - needle in the haystack
- AlphaGo failed to explore this in MCTS

Imitation learning from MCTS:

#### Continuous Case: Trajectory Optimization

#### Linear Quadratic Regulator (LQR)

#### Non-linear iterative LQR (iLQR)/ Differential Dynamic Programming (DDP)