Trance-0/NoteNextra-origin

Fork 0

Files

Trance-0 74364283fe updates today

2025-11-06 13:59:31 -06:00

5.7 KiB

Raw Blame History

CSE510 Deep Reinforcement Learning (Lecture 21)

Exploration in RL

Information state search

Uncertainty about state transitions or dynamics

Dynamics prediction error or Information gain for dynamics learning

Computational Curiosity

"The direct goal of curiosity and boredom is to improve the world model."
"Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
There is positive reinforcement whenever the system fails to correctly predict the environment.
Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)

Reward Prediction Error

Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.


R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)

where r(s,a,s') is the extrinsic reward, T(s,a,\theta) is the predicted next state, and \mathcal{B} is a bonus function (intrinsic reward bonus).
Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.

link to the paper

Example

Learning Visual Dynamics

Exploration reward bonuses \mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|
- However, trivial solution exists: could get reward by just moving around randomly.

Exploration reward bonuses with autoencoders \mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
- But suffer the problems of autoencoding reconstruction loss that has little to do with our task

Task Rewards vs. Exploration Rewards

Exploration rewards bonuses:


\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|

Only task rewards:


R(s,a,s') = r(s,a,s')

Task+curiosity rewards:


R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')

Sparse task + curiosity rewards:


R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')

Only curiosity rewards:


R^c(s,a,s') = \mathcal{B}^c(s, a, s')

Extrinsic reward RL is not New

Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006)
Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991)
Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010)
Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004)
Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995)
Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708

Limitation of Prediction Errors

Agent will be rewarded even though the model cannot improve.
So it will focus on parts of environment that are inherently unpredictable or stochastic.
Example: the noisy-TV problem
- The agent is attracted forever in the most noisy states, with unpredictable outcomes.

Random Network Distillation

Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.

New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the next state itself.

The target network is a neural network with fixed, randomized weights, which is never trained.
The prediction network is trained to predict the target network's output.

the more you visit the state, the less loss you will have.

Posterior Sampling

Uncertainty about Q-value functions or policies

Selecting actions according to the probability they are best according to the current model.

Exploration with Action Value Information

Count-Based and Curiosity-driven method does not take into account the action value information

In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.

Exploration via Posterior Sampling of Q Functions

Represent a posterior distribution of Q functions, instead of a point estimate.
1. Sample from P(Q), Q\sim P(Q)
2. Choose actions according to this Q for one episode a=\arg\max_{a} Q(s,a)
3. Update P(Q) based on the sampled Q and collected experience tuples (s,a,r,s')
Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
But how can we learn a distribution of Q functions P(Q) if Q function is a deep neural network?

Bootstrap Ensemble

Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
- Computationally expensive
Neural network ensembles with shared backbone: only the heads are trained with different subset of the data

Questions

Why do PG methods implicitly support exploration?
Is it sufficient? How can we improve its implicit exploration?
What are limitations of entropy regularization?
How can we improve exploration for PG methods?
- Intrinsic-motivated bonuses (e.g., RND)
- Explicitly optimize per-state entropy in the return (e.g., SAC)
- Hierarchical RL
- Goal-conditional RL
What are potentially more effective exploration methods?
- Knowledge-driven
- Model-based exploration

5.7 KiB Raw Blame History Unescape Escape