5.7 KiB
CSE510 Deep Reinforcement Learning (Lecture 21)
Exploration in RL
Information state search
Uncertainty about state transitions or dynamics
Dynamics prediction error or Information gain for dynamics learning
Computational Curiosity
- "The direct goal of curiosity and boredom is to improve the world model."
- "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
- There is positive reinforcement whenever the system fails to correctly predict the environment.
- Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)
Reward Prediction Error
- Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.
R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)
- where
r(s,a,s')is the extrinsic reward,T(s,a,\theta)is the predicted next state, and\mathcal{B}is a bonus function (intrinsic reward bonus). - Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.
Example
Learning Visual Dynamics
- Exploration reward bonuses
\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|- However, trivial solution exists: could get reward by just moving around randomly.
- Exploration reward bonuses with autoencoders
\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|- But suffer the problems of autoencoding reconstruction loss that has little to do with our task
Task Rewards vs. Exploration Rewards
Exploration rewards bonuses:
\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
Only task rewards:
R(s,a,s') = r(s,a,s')
Task+curiosity rewards:
R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')
Sparse task + curiosity rewards:
R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')
Only curiosity rewards:
R^c(s,a,s') = \mathcal{B}^c(s, a, s')
Extrinsic reward RL is not New
- Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006)
- Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991)
- Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010)
- Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004)
- Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995)
- Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708
Limitation of Prediction Errors
- Agent will be rewarded even though the model cannot improve.
- So it will focus on parts of environment that are inherently unpredictable or stochastic.
- Example: the noisy-TV problem
- The agent is attracted forever in the most noisy states, with unpredictable outcomes.
Random Network Distillation
Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.
New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the next state itself.
- The target network is a neural network with fixed, randomized weights, which is never trained.
- The prediction network is trained to predict the target network's output.
the more you visit the state, the less loss you will have.
Posterior Sampling
Uncertainty about Q-value functions or policies
Selecting actions according to the probability they are best according to the current model.
Exploration with Action Value Information
Count-Based and Curiosity-driven method does not take into account the action value information
In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.
Exploration via Posterior Sampling of Q Functions
- Represent a posterior distribution of Q functions, instead of a point estimate.
- Sample from
P(Q), Q\sim P(Q) - Choose actions according to this
Qfor one episodea=\arg\max_{a} Q(s,a) - Update
P(Q)based on the sampledQand collected experience tuples(s,a,r,s')
- Sample from
- Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
- But how can we learn a distribution of Q functions
P(Q)if Q function is a deep neural network?
Bootstrap Ensemble
- Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
- Computationally expensive
- Neural network ensembles with shared backbone: only the heads are trained with different subset of the data
Questions
- Why do PG methods implicitly support exploration?
- Is it sufficient? How can we improve its implicit exploration?
- What are limitations of entropy regularization?
- How can we improve exploration for PG methods?
- Intrinsic-motivated bonuses (e.g., RND)
- Explicitly optimize per-state entropy in the return (e.g., SAC)
- Hierarchical RL
- Goal-conditional RL
- What are potentially more effective exploration methods?
- Knowledge-driven
- Model-based exploration
