# CSE510 Deep Reinforcement Learning (Lecture 20) ## Exploration in RL ### Motivations #### Exploration vs. Exploitation Dilemma Online decision-making involves a fundamental choice: - Exploration: trying out new things (new behaviors), with the hope of discovering higher rewards - Exploitation: doing what you know will yield the highest reward The best long-term strategy may involve short-term sacrifices Gather enough knowledge early to make the best long-term decisions
Example Restaurant Selection - Exploitation: Go to your favorite restaurant - Exploration: Try a new restaurant Oil Drilling - Exploitation: Drill at the best known location - Exploration: Drill at a new location Game Playing - Exploitation: Play the move you believe is best - Exploration: Play an experimental move
#### Breakout vs. Montezuma's Revenge | Property | Breakout | Montezuma's Revenge | |----------|----------|--------------------| | **Reward frequency** | Dense (every brick hit gives points) | Extremely sparse (only after collecting key or treasure) | | **State space** | Simple (ball, paddle, bricks) | Complex (many rooms, objects, ladders, timing) | | **Action relevance** | Almost any action affects reward soon | Most actions have no immediate feedback | | **Exploration depth** | Shallow (few steps to reward) | Deep (dozens/hundreds of steps before reward) | | **Determinism** | Mostly deterministic dynamics | Deterministic but requires long sequences of precise actions | | **Credit assignment** | Easy — short time gap | Very hard — long delay from cause to effect | #### Motivation - Motivation: "Forces" that energize an organism to act and that direct its activity - Extrinsic Motivation: being motivated to do something because of some external reward ($, a prize, food, water, etc.) - Intrinsic Motivation: being motivated to do something because it is inherently enjoyable (curiosity, exploration, novelty, surprise, incongruity, complexity…) ### Intuitive Exploration Strategy - Intrinsic motivation drives the exploration for unknowns - Intuitively, we explore efficiently once we know what we do not know, and target our exploration efforts to the unknown part of the space. - All non-naive exploration methods consider some form of uncertainty estimation, regarding state (or state-action) I have visited, transition dynamics, or Q-functions. - Optimal methods in smaller settings don't work, but can inspire for larger settings - May use some hacks ### Classes of Exploration Methods in Deep RL - Optimistic exploration - Uncertainty about states - Visiting novel states (state visitation counting) - Information state search - Uncertainty about state transitions or dynamics - Dynamics prediction error or Information gain for dynamics learning - Posterior sampling - Uncertainty about Q-value functions or policies - Selecting actions according to the probability they are best ### Optimistic Exploration #### Count-Based Exploration in Small MDPs Book-keep state visitation counts $N(s)$ Add exploration reward bonuses that encourage policies that visit states with fewer counts. $$ R(s,a,s') = r(s,a,s') + \mathcal{B}(N(s)) $$ where $\mathcal{B}(N(s))$ is the intrinsic exploration reward bonus. - UCB: $\mathcal{B}(N(s)) = \sqrt{\frac{2\ln n}{N(s)}}$ (more aggressive exploration) - MBIE-EB (Strehl & Littman): $\mathcal{B}(N(s)) = \sqrt{\frac{1}{N(s)}}$ - BEB (Kolter & Ng): $\mathcal{B}(N(s)) = \frac{1}{N(s)}$ - We want to come up with something that rewards states that we have not visited often. - But in large MDPs, we rarely visit a state twice! - We need to capture a notion of state similarity, and reward states that are most dissimilar to what we have seen so far - as opposed to different (as they will always be different). #### Fitting Generative Models Idea: fit a density model $p_\theta(s)$ (or $p_\theta(s,a)$) $p_\theta(s)$ might be high even for a new $s$. If $s$ is similar to perviously seen states, can we use $p_\theta(s)$ to get a "pseudo-count" for $s$? If we have small MDPs, the true probability is $$ P(s)=\frac{N(s)}{n} $$ where $N(s)$ is the number of times $s$ has been visited and $n$ is the total states visited. after we visit $s$, then $$ P'(s)=\frac{N(s)+1}{n+1} $$ 1. fit model $p_\theta(s)$ to all states $\mathcal{D}$ so far. 2. take a step $i$ and observe $s_i$. 3. fit new model $p_\theta'(s)$ to all states $\mathcal{D} \cup {s_i}$. 4. use $p_\theta(s_i)$ and $p_\theta'(s_i)$ to estimate the "pseudo-count" for $\hat{N}(s_i)$. 5. set $r_i^+=r_i+\mathcal{B}(\hat{N}(s_i))$ 6. go to 1 How to get $\hat{N}(s_i)$? use the equations $$ p_\theta(s_i)=\frac{\hat{N}(s_i)}{\hat{n}}\quad p_\theta'(s_i)=\frac{\hat{N}(s_i)+1}{\hat{n}+1} $$ [link to the paper](https://arxiv.org/pdf/1606.01868) #### Density models [link to the paper](https://arxiv.org/pdf/1703.01310) #### State Counting with DeepHashing - We still count states (images) but not in pixel space, but in latent compressed space. - Compress $s$ into a latent code, then count occurrences of the code. - How do we get the image encoding? e.g., using autoencoders. - There is no guarantee such reconstruction loss will capture the important things that make two states to be similar