# CSE510 Deep Reinforcement Learning (Lecture 21) ## Exploration in RL ### Information state search Uncertainty about state transitions or dynamics Dynamics prediction error or Information gain for dynamics learning #### Computational Curiosity - "The direct goal of curiosity and boredom is to improve the world model." - "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality. - There is positive reinforcement whenever the system fails to correctly predict the environment. - Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail) #### Reward Prediction Error - Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail. $$ R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|) $$ - where $r(s,a,s')$ is the extrinsic reward, $T(s,a,\theta)$ is the predicted next state, and $\mathcal{B}$ is a bonus function (intrinsic reward bonus). - Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known. [link to the paper](https://arxiv.org/pdf/1507.08750)
Example Learning Visual Dynamics - Exploration reward bonuses $\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|$ - However, trivial solution exists: could get reward by just moving around randomly. --- - Exploration reward bonuses with autoencoders $\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|$ - But suffer the problems of autoencoding reconstruction loss that has little to do with our task #### Task Rewards vs. Exploration Rewards Exploration rewards bonuses: $$ \mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\| $$ Only task rewards: $$ R(s,a,s') = r(s,a,s') $$ Task+curiosity rewards: $$ R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s') $$ Sparse task + curiosity rewards: $$ R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s') $$ Only curiosity rewards: $$ R^c(s,a,s') = \mathcal{B}^c(s, a, s') $$ #### Extrinsic reward RL is not New - Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006) - Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991) - Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010) - Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004) - Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995) - Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708 #### Limitation of Prediction Errors - Agent will be rewarded even though the model cannot improve. - So it will focus on parts of environment that are inherently unpredictable or stochastic. - Example: the noisy-TV problem - The agent is attracted forever in the most noisy states, with unpredictable outcomes. #### Random Network Distillation Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action. New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the **next state itself.** - The target network is a neural network with fixed, randomized weights, which is never trained. - The prediction network is trained to predict the target network's output. > the more you visit the state, the less loss you will have. ### Posterior Sampling Uncertainty about Q-value functions or policies Selecting actions according to the probability they are best according to the current model. #### Exploration with Action Value Information Count-Based and Curiosity-driven method does not take into account the action value information ![Action Value Information](https://notenextra.trance-0.com/CSE510/Action_Value_Information.png) > In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values. #### Exploration via Posterior Sampling of Q Functions - Represent a posterior distribution of Q functions, instead of a point estimate. 1. Sample from $P(Q), Q\sim P(Q)$ 2. Choose actions according to this $Q$ for one episode $a=\arg\max_{a} Q(s,a)$ 3. Update $P(Q)$ based on the sampled $Q$ and collected experience tuples $(s,a,r,s')$ - Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q. - But how can we learn a distribution of Q functions $P(Q)$ if Q function is a deep neural network? #### Bootstrap Ensemble - Neural network ensembles: train multiple Q-function approximations each on using different subset of the data - Computationally expensive - Neural network ensembles with shared backbone: only the heads are trained with different subset of the data ### Questions - Why do PG methods implicitly support exploration? - Is it sufficient? How can we improve its implicit exploration? - What are limitations of entropy regularization? - How can we improve exploration for PG methods? - Intrinsic-motivated bonuses (e.g., RND) - Explicitly optimize per-state entropy in the return (e.g., SAC) - Hierarchical RL - Goal-conditional RL - What are potentially more effective exploration methods? - Knowledge-driven - Model-based exploration