updates today

2025-11-06 13:59:31 -06:00
parent 51b34be077
commit 74364283fe
8 changed files with 428 additions and 1 deletions
--- a/content/CSE510/CSE510_L21.md
+++ b/content/CSE510/CSE510_L21.md
@@ -0,0 +1,146 @@
+# CSE510 Deep Reinforcement Learning (Lecture 21)
+
+## Exploration in RL
+
+### Information state search
+
+Uncertainty about state transitions or dynamics
+
+Dynamics prediction error or Information gain for dynamics learning
+
+#### Computational Curiosity
+
+- "The direct goal of curiosity and boredom is to improve the world model."
+- "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
+- There is positive reinforcement whenever the system fails to correctly predict the environment.
+- Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)
+
+#### Reward Prediction Error
+
+- Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.
+
+$$
+R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)
+$$
+
+- where $r(s,a,s')$ is the extrinsic reward, $T(s,a,\theta)$ is the predicted next state, and $\mathcal{B}$ is a bonus function (intrinsic reward bonus).
+- Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.
+
+[link to the paper](https://arxiv.org/pdf/1507.08750)
+</details>
+
+<details>
+<summary>Example</summary>
+
+Learning Visual Dynamics
+
+- Exploration reward bonuses $\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|$
+  - However, trivial solution exists: could get reward by just moving around randomly.
+
+---
+
+- Exploration reward bonuses with autoencoders $\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|$
+  - But suffer the problems of autoencoding reconstruction loss that has little to do with our task
+
+#### Task Rewards vs. Exploration Rewards
+
+Exploration rewards bonuses:
+
+$$
+\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
+$$
+
+Only task rewards:
+
+$$
+R(s,a,s') = r(s,a,s')
+$$
+
+Task+curiosity rewards:
+
+$$
+R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')
+$$
+
+Sparse task + curiosity rewards:
+
+$$
+R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')
+$$
+
+Only curiosity rewards:
+
+$$
+R^c(s,a,s') = \mathcal{B}^c(s, a, s')
+$$
+
+#### Extrinsic reward RL is not New
+
+- Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006)
+- Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991)
+- Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010)
+- Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004)
+- Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995)
+- Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708
+
+#### Limitation of Prediction Errors
+
+- Agent will be rewarded even though the model cannot improve.
+- So it will focus on parts of environment that are inherently unpredictable or stochastic.
+- Example: the noisy-TV problem
+  - The agent is attracted forever in the most noisy states, with unpredictable outcomes.
+
+#### Random Network Distillation
+
+Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.
+
+New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the **next state itself.**
+
+- The target network is a neural network with fixed, randomized weights, which is never trained.
+- The prediction network is trained to predict the target network's output.
+
+> the more you visit the state, the less loss you will have.
+
+### Posterior Sampling
+
+Uncertainty about Q-value functions or policies
+
+Selecting actions according to the probability they are best according to the current model.
+
+#### Exploration with Action Value Information
+
+Count-Based and Curiosity-driven method does not take into
+account the action value information
+
+![Action Value Information](https://notenextra.trance-0.com/CSE510/Action_Value_Information.png)
+
+> In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.
+
+#### Exploration via Posterior Sampling of Q Functions
+
+- Represent a posterior distribution of Q functions, instead of a point estimate.
+  1. Sample from $P(Q), Q\sim P(Q)$
+  2. Choose actions according to this $Q$ for one episode $a=\arg\max_{a} Q(s,a)$
+  3. Update $P(Q)$ based on the sampled $Q$ and collected experience tuples $(s,a,r,s')$
+- Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
+- But how can we learn a distribution of Q functions $P(Q)$ if Q function is a deep neural network?
+
+#### Bootstrap Ensemble
+
+- Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
+  - Computationally expensive
+- Neural network ensembles with shared backbone: only the heads are trained with different subset of the data
+
+### Questions
+
+- Why do PG methods implicitly support exploration?
+- Is it sufficient? How can we improve its implicit exploration?
+- What are limitations of entropy regularization?
+- How can we improve exploration for PG methods?
+  - Intrinsic-motivated bonuses (e.g., RND)
+  - Explicitly optimize per-state entropy in the return (e.g., SAC)
+  - Hierarchical RL
+  - Goal-conditional RL
+- What are potentially more effective exploration methods?
+  - Knowledge-driven
+  - Model-based exploration