updates
This commit is contained in:
143
content/CSE510/CSE510_L20.md
Normal file
143
content/CSE510/CSE510_L20.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 20)
|
||||
|
||||
## Exploration in RL
|
||||
|
||||
### Motivations
|
||||
|
||||
#### Exploration vs. Exploitation Dilemma
|
||||
|
||||
Online decision-making involves a fundamental choice:
|
||||
|
||||
- Exploration: trying out new things (new behaviors), with the hope of discovering higher rewards
|
||||
- Exploitation: doing what you know will yield the highest reward
|
||||
|
||||
The best long-term strategy may involve short-term sacrifices
|
||||
|
||||
Gather enough knowledge early to make the best long-term decisions
|
||||
|
||||
<details>
|
||||
<summary>Example</summary>
|
||||
Restaurant Selection
|
||||
|
||||
- Exploitation: Go to your favorite restaurant
|
||||
- Exploration: Try a new restaurant
|
||||
|
||||
Oil Drilling
|
||||
|
||||
- Exploitation: Drill at the best known location
|
||||
- Exploration: Drill at a new location
|
||||
|
||||
Game Playing
|
||||
|
||||
- Exploitation: Play the move you believe is best
|
||||
- Exploration: Play an experimental move
|
||||
|
||||
</details>
|
||||
|
||||
#### Breakout vs. Montezuma's Revenge
|
||||
|
||||
| Property | Breakout | Montezuma's Revenge |
|
||||
|----------|----------|--------------------|
|
||||
| **Reward frequency** | Dense (every brick hit gives points) | Extremely sparse (only after collecting key or treasure) |
|
||||
| **State space** | Simple (ball, paddle, bricks) | Complex (many rooms, objects, ladders, timing) |
|
||||
| **Action relevance** | Almost any action affects reward soon | Most actions have no immediate feedback |
|
||||
| **Exploration depth** | Shallow (few steps to reward) | Deep (dozens/hundreds of steps before reward) |
|
||||
| **Determinism** | Mostly deterministic dynamics | Deterministic but requires long sequences of precise actions |
|
||||
| **Credit assignment** | Easy — short time gap | Very hard — long delay from cause to effect |
|
||||
|
||||
#### Motivation
|
||||
|
||||
- Motivation: "Forces" that energize an organism to act and that direct its activity
|
||||
- Extrinsic Motivation: being motivated to do something because of some external reward ($, a prize, food, water, etc.)
|
||||
- Intrinsic Motivation: being motivated to do something because it is inherently enjoyable (curiosity, exploration, novelty, surprise, incongruity, complexity…)
|
||||
|
||||
### Intuitive Exploration Strategy
|
||||
|
||||
- Intrinsic motivation drives the exploration for unknowns
|
||||
- Intuitively, we explore efficiently once we know what we do not know, and target our exploration efforts to the unknown part of the space.
|
||||
- All non-naive exploration methods consider some form of uncertainty estimation, regarding state (or state-action) I have visited, transition dynamics, or Q-functions.
|
||||
|
||||
- Optimal methods in smaller settings don't work, but can inspire for larger settings
|
||||
- May use some hacks
|
||||
|
||||
### Classes of Exploration Methods in Deep RL
|
||||
|
||||
- Optimistic exploration
|
||||
- Uncertainty about states
|
||||
- Visiting novel states (state visitation counting)
|
||||
- Information state search
|
||||
- Uncertainty about state transitions or dynamics
|
||||
- Dynamics prediction error or Information gain for dynamics learning
|
||||
- Posterior sampling
|
||||
- Uncertainty about Q-value functions or policies
|
||||
- Selecting actions according to the probability they are best
|
||||
|
||||
### Optimistic Exploration
|
||||
|
||||
#### Count-Based Exploration in Small MDPs
|
||||
|
||||
Book-keep state visitation counts $N(s)$
|
||||
Add exploration reward bonuses that encourage policies that visit states with fewer counts.
|
||||
|
||||
$$
|
||||
R(s,a,s') = r(s,a,s') + \mathcal{B}(N(s))
|
||||
$$
|
||||
|
||||
where $\mathcal{B}(N(s))$ is the intrinsic exploration reward bonus.
|
||||
|
||||
- UCB: $\mathcal{B}(N(s)) = \sqrt{\frac{2\ln n}{N(s)}}$ (more aggressive exploration)
|
||||
- MBIE-EB (Strehl & Littman): $\mathcal{B}(N(s)) = \sqrt{\frac{1}{N(s)}}$
|
||||
- BEB (Kolter & Ng): $\mathcal{B}(N(s)) = \frac{1}{N(s)}$
|
||||
|
||||
- We want to come up with something that rewards states that we have not visited often.
|
||||
- But in large MDPs, we rarely visit a state twice!
|
||||
- We need to capture a notion of state similarity, and reward states that are most dissimilar to what we have seen so far
|
||||
- as opposed to different (as they will always be different).
|
||||
|
||||
#### Fitting Generative Models
|
||||
|
||||
Idea: fit a density model $p_\theta(s)$ (or $p_\theta(s,a)$)
|
||||
|
||||
$p_\theta(s)$ might be high even for a new $s$.
|
||||
|
||||
If $s$ is similar to perviously seen states, can we use $p_\theta(s)$ to get a "pseudo-count" for $s$?
|
||||
|
||||
If we have small MDPs, the true probability is
|
||||
|
||||
$$
|
||||
P(s)=\frac{N(s)}{n}
|
||||
$$
|
||||
|
||||
where $N(s)$ is the number of times $s$ has been visited and $n$ is the total states visited.
|
||||
|
||||
after we visit $s$, then
|
||||
|
||||
$$
|
||||
P'(s)=\frac{N(s)+1}{n+1}
|
||||
$$
|
||||
|
||||
1. fit model $p_\theta(s)$ to all states $\mathcal{D}$ so far.
|
||||
2. take a step $i$ and observe $s_i$.
|
||||
3. fit new model $p_\theta'(s)$ to all states $\mathcal{D} \cup {s_i}$.
|
||||
4. use $p_\theta(s_i)$ and $p_\theta'(s_i)$ to estimate the "pseudo-count" for $\hat{N}(s_i)$.
|
||||
5. set $r_i^+=r_i+\mathcal{B}(\hat{N}(s_i))$
|
||||
6. go to 1
|
||||
|
||||
How to get $\hat{N}(s_i)$? use the equations
|
||||
|
||||
$$
|
||||
p_\theta(s_i)=\frac{\hat{N}(s_i)}{\hat{n}}\quad p_\theta'(s_i)=\frac{\hat{N}(s_i)+1}{\hat{n}+1}
|
||||
$$
|
||||
|
||||
[link to the paper](https://arxiv.org/pdf/1606.01868)
|
||||
|
||||
#### Density models
|
||||
|
||||
[link to the paper](https://arxiv.org/pdf/1703.01310)
|
||||
|
||||
#### State Counting with DeepHashing
|
||||
|
||||
- We still count states (images) but not in pixel space, but in latent compressed space.
|
||||
- Compress $s$ into a latent code, then count occurrences of the code.
|
||||
- How do we get the image encoding? e.g., using autoencoders.
|
||||
- There is no guarantee such reconstruction loss will capture the important things that make two states to be similar
|
||||
@@ -22,4 +22,5 @@ export default {
|
||||
CSE510_L17: "CSE510 Deep Reinforcement Learning (Lecture 17)",
|
||||
CSE510_L18: "CSE510 Deep Reinforcement Learning (Lecture 18)",
|
||||
CSE510_L19: "CSE510 Deep Reinforcement Learning (Lecture 19)",
|
||||
CSE510_L20: "CSE510 Deep Reinforcement Learning (Lecture 20)",
|
||||
}
|
||||
Reference in New Issue
Block a user