Trance-0/NoteNextra-origin

Fork 0

Files

Zheyuan Wu 361745d658 update

2025-10-28 11:20:24 -05:00

4.4 KiB

Raw Blame History

CSE510 Deep Reinforcement Learning (Lecture 17)

Why Model-Based RL?

Sample efficiency
Generalization and transferability
Support efficient exploration in large-scale RL problems
Explainability
Super-human performance in practice
- Video games, Go, Algorithm discovery, etc.

Note

Model is anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition T(s'| s, a) and reward R(s, a).

For ADP-based (model-based) RL

Start with initial model
Solve for optimal policy given current model
- (using value or policy iteration)
Take action according to an exploration/exploitation policy
- Explores more early on and gradually uses policy from 2
Update estimated model based on observed transition
Goto 2

Problems in Large Scale Model-Based RL

New planning methods for given a model
- Model is large and not perfect
Model learning
- Requiring generalization
Exploration/exploitation strategy
- Requiring generalization and attention

Large Scale Model-Based RL

New optimal planning methods (Today)
- Model is large and not perfect
Model learning (Next Lecture)
- Requiring generalization
Exploration/exploitation strategy (Next week)
- Requiring generalization and attention

Model-based RL

Deterministic Environment: Cross-Entropy Method

Stochastic Optimization

abstract away optimal control/planning:


a_1,\ldots, a_T =\argmax_{a_1,\ldots, a_T} J(a_1,\ldots, a_T)


A=\argmax_{A} J(A)

Simplest method: guess and check: "random shooting method"

pick A_1, A_2, ..., A_n from some distribution (e.g. uniform)
Choose A_i based on \argmax_i J(A_i)

Cross-Entropy Method (CEM) with continuous-valued inputs

Cross-entropy method with continuous-valued inputs:s

Sample A_1, A_2, ..., A_n from some distribution p(A)
Evaluate J(A_1), J(A_2), ..., J(A_n)
Pick the elites A_1, A_2, ..., A_m with the highest J(A_i), where m<n
Update the distribution p(A) to be more likely to choose the elites

Pros:

Very fast to run if parallelized
Extremely simple to implement

Cons:

Very harsh dimensionality limit
Only open-loop planning
Suboptimal in stochastic environments

Discrete Case: Monte Carlo Tree Search (MCTS)

Discrete planning as a search problem

Close-loop planning:

At each state, iteratively build a search tree to evaluate actions, select the best-first action, and the move the next state.

Use model as simulator to evaluate actions.

MCTS Algorithm Overview

Selection: Select the best-first action from the search tree
Expansion: Add a new node to the search tree
Simulation: Simulate the next state from the selected action
Backpropagation: Update the values of the nodes in the search tree

Policies in MCTS

Tree policy:

Select/create leaf node
Selection and Expansion
Bandit problem!

Default policy/rollout policy

Play the game till end
Simulation

Decision policy

Selecting the final action

Upper Confidence Bound on Trees (UCT)

Selecting Child Node - Multi-Arm Bandit Problem

UCB1 applied for each child selection


UCT=\overline{X_j}+2C_p\sqrt{\frac{2\ln n_j}{n_j}}

where \overline{X_j} is the mean reward of selecting this position
- [0,1]
n is the number of times current(parent) node has been visited
n_j is the number of times child node j has been visited
- Guaranteed we explore each child node at least once
C_p is some constant >0

Each child has non-zero probability of being selected

We can adjust C_p to change exploration vs. exploitation trade-off

Decision Policy: Final Action Selection

Selecting the best child

Max (highest weight)
Robust (most visits)
Max-Robust (max of the two)

Advantages and disadvantages of MCTS

Advantages:

Proved MCTS converges to minimax solution
- Domain-independent
- Anytime algorithm
- Achieving better with a large branch factor

Disadvantages:

Basic version converges very slowly
Leading to small-probability failures

Example usage of MCTS

AlphaGo vs Lee Sedol, Game 4

White 78 (Lee): unexpected move (even other professional players didn't see coming) - needle in the haystack
AlphaGo failed to explore this in MCTS

Imitation learning from MCTS:

4.4 KiB

Raw Blame History

CSE510 Deep Reinforcement Learning (Lecture 17)

Why Model-Based RL?

Problems in Large Scale Model-Based RL

Large Scale Model-Based RL

Model-based RL

Deterministic Environment: Cross-Entropy Method

Stochastic Optimization

Cross-Entropy Method (CEM) with continuous-valued inputs

Discrete Case: Monte Carlo Tree Search (MCTS)

MCTS Algorithm Overview

Policies in MCTS

Upper Confidence Bound on Trees (UCT)

Decision Policy: Final Action Selection

Advantages and disadvantages of MCTS

Example usage of MCTS

Continuous Case: Trajectory Optimization

Linear Quadratic Regulator (LQR)

Non-linear iterative LQR (iLQR)/ Differential Dynamic Programming (DDP)

4.4 KiB Raw Blame History

CSE510 Deep Reinforcement Learning (Lecture 17)

Why Model-Based RL?

Problems in Large Scale Model-Based RL

Large Scale Model-Based RL

Model-based RL

Deterministic Environment: Cross-Entropy Method

Stochastic Optimization

Cross-Entropy Method (CEM) with continuous-valued inputs

Discrete Case: Monte Carlo Tree Search (MCTS)

MCTS Algorithm Overview

Policies in MCTS

Upper Confidence Bound on Trees (UCT)

Decision Policy: Final Action Selection

Advantages and disadvantages of MCTS

Example usage of MCTS

Continuous Case: Trajectory Optimization

Linear Quadratic Regulator (LQR)

Non-linear iterative LQR (iLQR)/ Differential Dynamic Programming (DDP)

4.4 KiB

Raw Blame History