Trance-0/NoteNextra-origin

Fork 0

Files

Trance-0 d24c0bdd9e updates

2025-11-04 11:30:12 -06:00

5.2 KiB

Raw Permalink Blame History

CSE510 Deep Reinforcement Learning (Lecture 20)

Exploration in RL

Motivations

Exploration vs. Exploitation Dilemma

Online decision-making involves a fundamental choice:

Exploration: trying out new things (new behaviors), with the hope of discovering higher rewards
Exploitation: doing what you know will yield the highest reward

The best long-term strategy may involve short-term sacrifices

Gather enough knowledge early to make the best long-term decisions

Example

Restaurant Selection

Exploitation: Go to your favorite restaurant
Exploration: Try a new restaurant

Oil Drilling

Exploitation: Drill at the best known location
Exploration: Drill at a new location

Game Playing

Exploitation: Play the move you believe is best
Exploration: Play an experimental move

Breakout vs. Montezuma's Revenge

Property	Breakout	Montezuma's Revenge
Reward frequency	Dense (every brick hit gives points)	Extremely sparse (only after collecting key or treasure)
State space	Simple (ball, paddle, bricks)	Complex (many rooms, objects, ladders, timing)
Action relevance	Almost any action affects reward soon	Most actions have no immediate feedback
Exploration depth	Shallow (few steps to reward)	Deep (dozens/hundreds of steps before reward)
Determinism	Mostly deterministic dynamics	Deterministic but requires long sequences of precise actions
Credit assignment	Easy — short time gap	Very hard — long delay from cause to effect

Motivation

Motivation: "Forces" that energize an organism to act and that direct its activity
Extrinsic Motivation: being motivated to do something because of some external reward ($, a prize, food, water, etc.)
Intrinsic Motivation: being motivated to do something because it is inherently enjoyable (curiosity, exploration, novelty, surprise, incongruity, complexity…)

Intuitive Exploration Strategy

Intrinsic motivation drives the exploration for unknowns
Intuitively, we explore efficiently once we know what we do not know, and target our exploration efforts to the unknown part of the space.
All non-naive exploration methods consider some form of uncertainty estimation, regarding state (or state-action) I have visited, transition dynamics, or Q-functions.
Optimal methods in smaller settings don't work, but can inspire for larger settings
May use some hacks

Classes of Exploration Methods in Deep RL

Optimistic exploration
- Uncertainty about states
- Visiting novel states (state visitation counting)
Information state search
- Uncertainty about state transitions or dynamics
- Dynamics prediction error or Information gain for dynamics learning
Posterior sampling
- Uncertainty about Q-value functions or policies
- Selecting actions according to the probability they are best

Optimistic Exploration

Count-Based Exploration in Small MDPs

Book-keep state visitation counts N(s) Add exploration reward bonuses that encourage policies that visit states with fewer counts.


R(s,a,s') = r(s,a,s') + \mathcal{B}(N(s))

where \mathcal{B}(N(s)) is the intrinsic exploration reward bonus.

UCB: \mathcal{B}(N(s)) = \sqrt{\frac{2\ln n}{N(s)}} (more aggressive exploration)
MBIE-EB (Strehl & Littman): \mathcal{B}(N(s)) = \sqrt{\frac{1}{N(s)}}
BEB (Kolter & Ng): \mathcal{B}(N(s)) = \frac{1}{N(s)}
We want to come up with something that rewards states that we have not visited often.
But in large MDPs, we rarely visit a state twice!
We need to capture a notion of state similarity, and reward states that are most dissimilar to what we have seen so far
- as opposed to different (as they will always be different).

Fitting Generative Models

Idea: fit a density model p_\theta(s) (or p_\theta(s,a))

p_\theta(s) might be high even for a new s.

If s is similar to perviously seen states, can we use p_\theta(s) to get a "pseudo-count" for s?

If we have small MDPs, the true probability is


P(s)=\frac{N(s)}{n}

where N(s) is the number of times s has been visited and n is the total states visited.

after we visit s, then


P'(s)=\frac{N(s)+1}{n+1}

fit model p_\theta(s) to all states \mathcal{D} so far.
take a step i and observe s_i.
fit new model p_\theta'(s) to all states \mathcal{D} \cup {s_i}.
use p_\theta(s_i) and p_\theta'(s_i) to estimate the "pseudo-count" for \hat{N}(s_i).
set r_i^+=r_i+\mathcal{B}(\hat{N}(s_i))
go to 1

How to get \hat{N}(s_i)? use the equations


p_\theta(s_i)=\frac{\hat{N}(s_i)}{\hat{n}}\quad p_\theta'(s_i)=\frac{\hat{N}(s_i)+1}{\hat{n}+1}

link to the paper

Density models