# CSE510 Deep Reinforcement Learning (Lecture 21)

## Exploration in RL

### Information state search

Uncertainty about state transitions or dynamics

Dynamics prediction error or Information gain for dynamics learning

#### Computational Curiosity

- "The direct goal of curiosity and boredom is to improve the world model."
- "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
- There is positive reinforcement whenever the system fails to correctly predict the environment.
- Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)

#### Reward Prediction Error

- Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.

$$
R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)
$$

- where $r(s,a,s')$ is the extrinsic reward, $T(s,a,\theta)$ is the predicted next state, and $\mathcal{B}$ is a bonus function (intrinsic reward bonus).
- Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.

[link to the paper](https://arxiv.org/pdf/1507.08750)
</details>

<details>
<summary>Example</summary>

Learning Visual Dynamics

- Exploration reward bonuses $\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|$
  - However, trivial solution exists: could get reward by just moving around randomly.

---

- Exploration reward bonuses with autoencoders $\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|$
  - But suffer the problems of autoencoding reconstruction loss that has little to do with our task

#### Task Rewards vs. Exploration Rewards

Exploration rewards bonuses:

$$
\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
$$

Only task rewards:

$$
R(s,a,s') = r(s,a,s')
$$

Task+curiosity rewards:

$$
R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')
$$

Sparse task + curiosity rewards:

$$
R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')
$$

Only curiosity rewards:

$$
R^c(s,a,s') = \mathcal{B}^c(s, a, s')
$$

#### Extrinsic reward RL is not New

- Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006)
- Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991)
- Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010)
- Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004)
- Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995)
- Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708

#### Limitation of Prediction Errors

- Agent will be rewarded even though the model cannot improve.
- So it will focus on parts of environment that are inherently unpredictable or stochastic.
- Example: the noisy-TV problem
  - The agent is attracted forever in the most noisy states, with unpredictable outcomes.

#### Random Network Distillation

Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.

New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the **next state itself.**

- The target network is a neural network with fixed, randomized weights, which is never trained.
- The prediction network is trained to predict the target network's output.

> the more you visit the state, the less loss you will have.

### Posterior Sampling

Uncertainty about Q-value functions or policies

Selecting actions according to the probability they are best according to the current model.

#### Exploration with Action Value Information

Count-Based and Curiosity-driven method does not take into
account the action value information

![Action Value Information](https://notenextra.trance-0.com/CSE510/Action_Value_Information.png)

> In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.

#### Exploration via Posterior Sampling of Q Functions

- Represent a posterior distribution of Q functions, instead of a point estimate.
  1. Sample from $P(Q), Q\sim P(Q)$
  2. Choose actions according to this $Q$ for one episode $a=\arg\max_{a} Q(s,a)$
  3. Update $P(Q)$ based on the sampled $Q$ and collected experience tuples $(s,a,r,s')$
- Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
- But how can we learn a distribution of Q functions $P(Q)$ if Q function is a deep neural network?

#### Bootstrap Ensemble

- Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
  - Computationally expensive
- Neural network ensembles with shared backbone: only the heads are trained with different subset of the data

### Questions

- Why do PG methods implicitly support exploration?
- Is it sufficient? How can we improve its implicit exploration?
- What are limitations of entropy regularization?
- How can we improve exploration for PG methods?
  - Intrinsic-motivated bonuses (e.g., RND)
  - Explicitly optimize per-state entropy in the return (e.g., SAC)
  - Hierarchical RL
  - Goal-conditional RL
- What are potentially more effective exploration methods?
  - Knowledge-driven
  - Model-based exploration