Files
NoteNextra-origin/content/CSE510/CSE510_L17.md
Zheyuan Wu dbb201ef37 updates
2025-10-23 13:39:36 -05:00

87 lines
2.0 KiB
Markdown

# CSE510 Deep Reinforcement Learning (Lecture 17)
## Model-based RL
### Model-based RL vs. Model-free RL
- Sample efficiency
- Generalization and transferability
- Support efficient exploration in large-scale RL problems
- Explainability
- Super-human performance in practice
### Deterministic Environment: Cross-Entropy Method
#### Stochastic Optimization
abstract away optimal control/planning:
$$
a_1,\ldots, a_T =\argmax_{a_1,\ldots, a_T} J(a_1,\ldots, a_T)
$$
$$
A=\argmax_{A} J(A)
$$
Simplest method: guess and check: "random shooting method"
- pick $A_1, A_2, ..., A_n$ from some distribution (e.g. uniform)
- Choose $A_i$ based on $\argmax_i J(A_i)$
#### Cross-Entropy Method with continuous-valued inputs
1. sample $A_1, A_2, ..., A_n$ from some distribution $p(A)$
2. evaluate $J(A_1), J(A_2), ..., J(A_n)$
3. pick the _elites_ $A_1, A_2, ..., A_m$ with the highest $J(A_i)$, where $m<n$
4. update the distribution $p(A)$ to be more likely to choose the elites
Pros:
- Very fast to run if parallelized
- Extremely simple to implement
Cons:
- Very harsh dimensionality limit
- Only open-loop planning
- Suboptimal in stochastic environments
### Discrete Case: Monte Carlo Tree Search (MCTS)
Discrete planning as a search problem
Close-loop planning:
- At each state, iteratively build a search tree to evaluate actions, select the best-first action, and the move the next state.
Use model as simulator to evaluate actions.
#### MCTS Algorithm Overview
1. Selection: Select the best-first action from the search tree
2. Expansion: Add a new node to the search tree
3. Simulation: Simulate the next state from the selected action
4. Backpropagation: Update the values of the nodes in the search tree
#### Policies in MCTS
Tree policy:
Decision policy:
- Max (highest weight)
- Robust (most visits)
- Max-Robust (max of the two)
#### Upper Confidence Bound on Trees (UCT)
#### Continuous Case: Trajectory Optimization
#### Linear Quadratic Regulator (LQR)
#### Non-linear iterative LQR (iLQR)/ Differential Dynamic Programming (DDP)