146 lines
5.7 KiB
Markdown
146 lines
5.7 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 21)
|
||
|
||
## Exploration in RL
|
||
|
||
### Information state search
|
||
|
||
Uncertainty about state transitions or dynamics
|
||
|
||
Dynamics prediction error or Information gain for dynamics learning
|
||
|
||
#### Computational Curiosity
|
||
|
||
- "The direct goal of curiosity and boredom is to improve the world model."
|
||
- "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
|
||
- There is positive reinforcement whenever the system fails to correctly predict the environment.
|
||
- Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)
|
||
|
||
#### Reward Prediction Error
|
||
|
||
- Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.
|
||
|
||
$$
|
||
R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)
|
||
$$
|
||
|
||
- where $r(s,a,s')$ is the extrinsic reward, $T(s,a,\theta)$ is the predicted next state, and $\mathcal{B}$ is a bonus function (intrinsic reward bonus).
|
||
- Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.
|
||
|
||
[link to the paper](https://arxiv.org/pdf/1507.08750)
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Example</summary>
|
||
|
||
Learning Visual Dynamics
|
||
|
||
- Exploration reward bonuses $\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|$
|
||
- However, trivial solution exists: could get reward by just moving around randomly.
|
||
|
||
---
|
||
|
||
- Exploration reward bonuses with autoencoders $\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|$
|
||
- But suffer the problems of autoencoding reconstruction loss that has little to do with our task
|
||
|
||
#### Task Rewards vs. Exploration Rewards
|
||
|
||
Exploration rewards bonuses:
|
||
|
||
$$
|
||
\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
|
||
$$
|
||
|
||
Only task rewards:
|
||
|
||
$$
|
||
R(s,a,s') = r(s,a,s')
|
||
$$
|
||
|
||
Task+curiosity rewards:
|
||
|
||
$$
|
||
R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')
|
||
$$
|
||
|
||
Sparse task + curiosity rewards:
|
||
|
||
$$
|
||
R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')
|
||
$$
|
||
|
||
Only curiosity rewards:
|
||
|
||
$$
|
||
R^c(s,a,s') = \mathcal{B}^c(s, a, s')
|
||
$$
|
||
|
||
#### Extrinsic reward RL is not New
|
||
|
||
- Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006)
|
||
- Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991)
|
||
- Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010)
|
||
- Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004)
|
||
- Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995)
|
||
- Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708
|
||
|
||
#### Limitation of Prediction Errors
|
||
|
||
- Agent will be rewarded even though the model cannot improve.
|
||
- So it will focus on parts of environment that are inherently unpredictable or stochastic.
|
||
- Example: the noisy-TV problem
|
||
- The agent is attracted forever in the most noisy states, with unpredictable outcomes.
|
||
|
||
#### Random Network Distillation
|
||
|
||
Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.
|
||
|
||
New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the **next state itself.**
|
||
|
||
- The target network is a neural network with fixed, randomized weights, which is never trained.
|
||
- The prediction network is trained to predict the target network's output.
|
||
|
||
> the more you visit the state, the less loss you will have.
|
||
|
||
### Posterior Sampling
|
||
|
||
Uncertainty about Q-value functions or policies
|
||
|
||
Selecting actions according to the probability they are best according to the current model.
|
||
|
||
#### Exploration with Action Value Information
|
||
|
||
Count-Based and Curiosity-driven method does not take into
|
||
account the action value information
|
||
|
||

|
||
|
||
> In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.
|
||
|
||
#### Exploration via Posterior Sampling of Q Functions
|
||
|
||
- Represent a posterior distribution of Q functions, instead of a point estimate.
|
||
1. Sample from $P(Q), Q\sim P(Q)$
|
||
2. Choose actions according to this $Q$ for one episode $a=\arg\max_{a} Q(s,a)$
|
||
3. Update $P(Q)$ based on the sampled $Q$ and collected experience tuples $(s,a,r,s')$
|
||
- Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
|
||
- But how can we learn a distribution of Q functions $P(Q)$ if Q function is a deep neural network?
|
||
|
||
#### Bootstrap Ensemble
|
||
|
||
- Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
|
||
- Computationally expensive
|
||
- Neural network ensembles with shared backbone: only the heads are trained with different subset of the data
|
||
|
||
### Questions
|
||
|
||
- Why do PG methods implicitly support exploration?
|
||
- Is it sufficient? How can we improve its implicit exploration?
|
||
- What are limitations of entropy regularization?
|
||
- How can we improve exploration for PG methods?
|
||
- Intrinsic-motivated bonuses (e.g., RND)
|
||
- Explicitly optimize per-state entropy in the return (e.g., SAC)
|
||
- Hierarchical RL
|
||
- Goal-conditional RL
|
||
- What are potentially more effective exploration methods?
|
||
- Knowledge-driven
|
||
- Model-based exploration |