updates today
This commit is contained in:
146
content/CSE510/CSE510_L21.md
Normal file
146
content/CSE510/CSE510_L21.md
Normal file
@@ -0,0 +1,146 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 21)
|
||||
|
||||
## Exploration in RL
|
||||
|
||||
### Information state search
|
||||
|
||||
Uncertainty about state transitions or dynamics
|
||||
|
||||
Dynamics prediction error or Information gain for dynamics learning
|
||||
|
||||
#### Computational Curiosity
|
||||
|
||||
- "The direct goal of curiosity and boredom is to improve the world model."
|
||||
- "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
|
||||
- There is positive reinforcement whenever the system fails to correctly predict the environment.
|
||||
- Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)
|
||||
|
||||
#### Reward Prediction Error
|
||||
|
||||
- Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.
|
||||
|
||||
$$
|
||||
R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)
|
||||
$$
|
||||
|
||||
- where $r(s,a,s')$ is the extrinsic reward, $T(s,a,\theta)$ is the predicted next state, and $\mathcal{B}$ is a bonus function (intrinsic reward bonus).
|
||||
- Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.
|
||||
|
||||
[link to the paper](https://arxiv.org/pdf/1507.08750)
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Example</summary>
|
||||
|
||||
Learning Visual Dynamics
|
||||
|
||||
- Exploration reward bonuses $\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|$
|
||||
- However, trivial solution exists: could get reward by just moving around randomly.
|
||||
|
||||
---
|
||||
|
||||
- Exploration reward bonuses with autoencoders $\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|$
|
||||
- But suffer the problems of autoencoding reconstruction loss that has little to do with our task
|
||||
|
||||
#### Task Rewards vs. Exploration Rewards
|
||||
|
||||
Exploration rewards bonuses:
|
||||
|
||||
$$
|
||||
\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
|
||||
$$
|
||||
|
||||
Only task rewards:
|
||||
|
||||
$$
|
||||
R(s,a,s') = r(s,a,s')
|
||||
$$
|
||||
|
||||
Task+curiosity rewards:
|
||||
|
||||
$$
|
||||
R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')
|
||||
$$
|
||||
|
||||
Sparse task + curiosity rewards:
|
||||
|
||||
$$
|
||||
R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')
|
||||
$$
|
||||
|
||||
Only curiosity rewards:
|
||||
|
||||
$$
|
||||
R^c(s,a,s') = \mathcal{B}^c(s, a, s')
|
||||
$$
|
||||
|
||||
#### Extrinsic reward RL is not New
|
||||
|
||||
- Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006)
|
||||
- Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991)
|
||||
- Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010)
|
||||
- Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004)
|
||||
- Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995)
|
||||
- Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708
|
||||
|
||||
#### Limitation of Prediction Errors
|
||||
|
||||
- Agent will be rewarded even though the model cannot improve.
|
||||
- So it will focus on parts of environment that are inherently unpredictable or stochastic.
|
||||
- Example: the noisy-TV problem
|
||||
- The agent is attracted forever in the most noisy states, with unpredictable outcomes.
|
||||
|
||||
#### Random Network Distillation
|
||||
|
||||
Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.
|
||||
|
||||
New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the **next state itself.**
|
||||
|
||||
- The target network is a neural network with fixed, randomized weights, which is never trained.
|
||||
- The prediction network is trained to predict the target network's output.
|
||||
|
||||
> the more you visit the state, the less loss you will have.
|
||||
|
||||
### Posterior Sampling
|
||||
|
||||
Uncertainty about Q-value functions or policies
|
||||
|
||||
Selecting actions according to the probability they are best according to the current model.
|
||||
|
||||
#### Exploration with Action Value Information
|
||||
|
||||
Count-Based and Curiosity-driven method does not take into
|
||||
account the action value information
|
||||
|
||||

|
||||
|
||||
> In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.
|
||||
|
||||
#### Exploration via Posterior Sampling of Q Functions
|
||||
|
||||
- Represent a posterior distribution of Q functions, instead of a point estimate.
|
||||
1. Sample from $P(Q), Q\sim P(Q)$
|
||||
2. Choose actions according to this $Q$ for one episode $a=\arg\max_{a} Q(s,a)$
|
||||
3. Update $P(Q)$ based on the sampled $Q$ and collected experience tuples $(s,a,r,s')$
|
||||
- Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
|
||||
- But how can we learn a distribution of Q functions $P(Q)$ if Q function is a deep neural network?
|
||||
|
||||
#### Bootstrap Ensemble
|
||||
|
||||
- Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
|
||||
- Computationally expensive
|
||||
- Neural network ensembles with shared backbone: only the heads are trained with different subset of the data
|
||||
|
||||
### Questions
|
||||
|
||||
- Why do PG methods implicitly support exploration?
|
||||
- Is it sufficient? How can we improve its implicit exploration?
|
||||
- What are limitations of entropy regularization?
|
||||
- How can we improve exploration for PG methods?
|
||||
- Intrinsic-motivated bonuses (e.g., RND)
|
||||
- Explicitly optimize per-state entropy in the return (e.g., SAC)
|
||||
- Hierarchical RL
|
||||
- Goal-conditional RL
|
||||
- What are potentially more effective exploration methods?
|
||||
- Knowledge-driven
|
||||
- Model-based exploration
|
||||
Reference in New Issue
Block a user