partial updates

2025-11-18 13:25:21 -06:00
parent 946d0b605f
commit 9416bd4956
10 changed files with 1218 additions and 136 deletions
--- a/content/CSE510/CSE510_L21.md
+++ b/content/CSE510/CSE510_L21.md
@@ -1,146 +1,242 @@
 # CSE510 Deep Reinforcement Learning (Lecture 21)

-## Exploration in RL
+> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.

-### Information state search
+## Exploration in RL: Information-Based Exploration (Intrinsic Curiosity)

-Uncertainty about state transitions or dynamics
-
-Dynamics prediction error or Information gain for dynamics learning
-
-#### Computational Curiosity
+### Computational Curiosity

 - "The direct goal of curiosity and boredom is to improve the world model."
- "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
- There is positive reinforcement whenever the system fails to correctly predict the environment.
- Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)
+- Curiosity encourages agents to seek experiences that better predict or explain the environment.
+- A "curiosity unit" gives reward based on the mismatch between current model predictions and actual outcomes.
+- Intrinsic reward is high when the agent's prediction fails, that is, when it encounters surprising outcomes.
+- This yields positive intrinsic reinforcement when the internal predictive model errs, causing the agent to repeat actions that lead to prediction errors.
+- The agent is effectively motivated to create situations where its model fails.

-#### Reward Prediction Error
+### Model Prediction Error as Intrinsic Reward

- Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.
+We augment the reward with an intrinsic bonus based on model prediction error:

-$$
-R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)
-$$
+$R(s, a, s') = r(s, a, s') + B(|T(s, a; \theta) - s'|)$

- where $r(s,a,s')$ is the extrinsic reward, $T(s,a,\theta)$ is the predicted next state, and $\mathcal{B}$ is a bonus function (intrinsic reward bonus).
- Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.
+Parameter explanations:

-[link to the paper](https://arxiv.org/pdf/1507.08750)
-</details>
+- $s$: current state of the agent.
+- $a$: action taken by the agent in state $s$.
+- $s'$: next state resulting from executing action $a$ in state $s$.
+- $r(s, a, s')$: extrinsic environment reward for transition $(s, a, s')$.
+- $T(s, a; \theta)$: learned dynamics model with parameters $\theta$ that predicts the next state.
+- $\theta$: parameter vector of the predictive dynamics model $T$.
+- $|T(s, a; \theta) - s'|$: prediction error magnitude between predicted next state and actual next state.
+- $B(\cdot)$: function converting prediction error magnitude into an intrinsic reward bonus.
+- $R(s, a, s')$: total reward, sum of extrinsic reward and intrinsic curiosity bonus.

-<details>
-<summary>Example</summary>
+Key ideas:

-Learning Visual Dynamics
+- The agent receives an intrinsic reward $B(|T(s, a; \theta) - s'|)$ when the actual outcome differs from what its world model predicts.
+- Initially many transitions are surprising, encouraging broad exploration.
+- As the model improves, familiar transitions yield smaller error and smaller intrinsic reward.
+- Exploration becomes focused on less-known parts of the state space.
+- Intrinsic motivation is non-stationary: as the agent learns, previously novel states lose their intrinsic reward.

- Exploration reward bonuses $\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|$
-  - However, trivial solution exists: could get reward by just moving around randomly.
+#### Avoiding Trivial Curiosity Traps

---
+[link to paper](https://ar5iv.labs.arxiv.org/html/1705.05363#:~:text=reward%20signal%20based%20on%20how,this%20feature%20space%20using%20self)

- Exploration reward bonuses with autoencoders $\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|$
-  - But suffer the problems of autoencoding reconstruction loss that has little to do with our task
+Naively defining $B(s, a, s')$ directly in raw observation space can lead to trivial curiosity traps.

-#### Task Rewards vs. Exploration Rewards
+Examples:

-Exploration rewards bonuses:
+- The agent may purposely cause chaotic or noisy observations (like flickering pixels) that are impossible to predict.
+- The model cannot reduce prediction error on pure noise, so the agent is rewarded for meaningless randomness.
+- This yields high intrinsic reward without meaningful learning or progress toward task goals.

-$$
-\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
-$$
+To prevent this, we restrict prediction to a more informative feature space:

-Only task rewards:
+$B(s, a, s') = |T(E(s; \phi), a; \theta) - E(s'; \phi)|$

-$$
-R(s,a,s') = r(s,a,s')
-$$
+Parameter explanations:

-Task+curiosity rewards:
+- $E(s; \phi)$: learned encoder mapping raw state $s$ into a feature vector.
+- $\phi$: parameter vector of the encoder $E$.
+- $T(E(s; \phi), a; \theta)$: forward model predicting next feature representation from encoded state and action.
+- $E(s'; \phi)$: encoded feature representation of the next state $s'$.
+- $B(s, a, s')$: intrinsic reward based on prediction error in feature space.

-$$
-R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')
-$$
+Key ideas:

-Sparse task + curiosity rewards:
+- The encoder $E(s; \phi)$ is trained so that features capture aspects of the state that are controllable by the agent.
+- One approach is to train $E$ via an inverse dynamics model that predicts $a$ from $(s, s')$.
+- This encourages $E$ to keep only information necessary to infer actions, discarding irrelevant noise.
+- Measuring prediction error in feature space ignores unpredictable environmental noise.
+- Intrinsic reward focuses on errors due to lack of knowledge about controllable dynamics.
+- The agent's curiosity is directed toward aspects of the environment it can influence and learn.

-$$
-R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')
-$$
+A practical implementation is the Intrinsic Curiosity Module (ICM) by Pathak et al. (2017):

-Only curiosity rewards:
+- The encoder $E$ and forward model $T$ are trained jointly.
+- The loss includes both forward prediction error and inverse dynamics error.
+- Intrinsic reward is set to the forward prediction error in feature space.
+- This drives exploration of states where the agent cannot yet predict the effect of its actions.

-$$
-R^c(s,a,s') = \mathcal{B}^c(s, a, s')
-$$
+#### Random Network Distillation (RND)

-#### Extrinsic reward RL is not New
+Random Network Distillation (RND) provides a simpler curiosity bonus without learning a dynamics model.

- Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006)
- Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991)
- Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010)
- Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004)
- Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995)
- Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708
+Basic idea:

-#### Limitation of Prediction Errors
+- Use a fixed random neural network $f_{\text{target}}$ that maps states to feature vectors.
+- Train a predictor network $f_{\text{pred}}$ to approximate $f_{\text{target}}$ on visited states.
+- The intrinsic reward is the prediction error between $f_{\text{pred}}(s)$ and $f_{\text{target}}(s)$.

- Agent will be rewarded even though the model cannot improve.
- So it will focus on parts of environment that are inherently unpredictable or stochastic.
- Example: the noisy-TV problem
-  - The agent is attracted forever in the most noisy states, with unpredictable outcomes.
+Typical form of the intrinsic reward:

-#### Random Network Distillation
+$r^{\text{int}}(s) = |f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$

-Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.
+Parameter explanations:

-New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the **next state itself.**
+- $f_{\text{target}}$: fixed random neural network generating target features for each state.
+- $f_{\text{pred}}(s; \psi)$: trainable predictor network with parameters $\psi$.
+- $\psi$: parameter vector for the predictor network.
+- $s$: state input to both networks.
+- $|f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$: squared error between predictor and target features.
+- $r^{\text{int}}(s)$: intrinsic reward based on prediction error in random feature space.

- The target network is a neural network with fixed, randomized weights, which is never trained.
- The prediction network is trained to predict the target network's output.
+Key properties:

-> the more you visit the state, the less loss you will have.
+- For novel or rarely visited states, $f_{\text{pred}}$ has not yet learned to match $f_{\text{target}}$, so error is high.
+- For frequently visited states, prediction error becomes small, and intrinsic reward decays.
+- The target network is random and fixed, so it does not adapt to the policy.
+- This provides a stable novelty signal without explicit dynamics learning.
+- RND achieves strong exploration performance in challenging environments, such as hard-exploration Atari games.

-### Posterior Sampling
+### Efficacy of Curiosity-Driven Exploration

-Uncertainty about Q-value functions or policies
+Empirical observations:

-Selecting actions according to the probability they are best according to the current model.
+- Curiosity-driven intrinsic rewards often lead to significantly higher extrinsic returns in sparse-reward environments compared to agents trained only on extrinsic rewards.
+- Intrinsic rewards act as a proxy objective that guides the agent toward interesting or informative regions of the state space.
+- In some experiments, agents trained with only intrinsic rewards (no extrinsic reward during training) still learn behaviors that later achieve high task scores when extrinsic rewards are measured.
+- Using random features for curiosity (as in RND) can perform nearly as well as using learned features in many domains.
+- Simple surprise signals are often sufficient to drive effective exploration.
+- Learned feature spaces may generalize better to truly novel scenarios but are not always necessary.

-#### Exploration with Action Value Information
+Historical context:

-Count-Based and Curiosity-driven method does not take into
-account the action value information
+- The concept of learning from intrinsic rewards alone is not new.
+- Itti and Baldi (2005) studied "Bayesian surprise" as a driver of human attention.
+- Schmidhuber (1991, 2010) formalized curiosity, creativity, and fun as intrinsic motivations in learning agents.
+- Singh et al. (2004) proposed intrinsically motivated reinforcement learning frameworks.
+- These early works laid the conceptual foundation for modern curiosity-driven deep RL methods.

-![Action Value Information](https://notenextra.trance-0.com/CSE510/Action_Value_Information.png)
+For further reading on intrinsic curiosity methods:

-> In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.
+- Pathak et al., "Curiosity-driven Exploration by Self-supervised Prediction", 2017.
+- Burda et al., "Exploration by Random Network Distillation", 2018.
+- Schmidhuber, "Formal Theory of Creativity, Fun, and Intrinsic Motivation", 2010.

-#### Exploration via Posterior Sampling of Q Functions
+## Exploration via Posterior Sampling

- Represent a posterior distribution of Q functions, instead of a point estimate.
-  1. Sample from $P(Q), Q\sim P(Q)$
-  2. Choose actions according to this $Q$ for one episode $a=\arg\max_{a} Q(s,a)$
-  3. Update $P(Q)$ based on the sampled $Q$ and collected experience tuples $(s,a,r,s')$
- Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
- But how can we learn a distribution of Q functions $P(Q)$ if Q function is a deep neural network?
+While optimistic and curiosity bonus methods modify the reward function, posterior sampling approaches handle exploration by maintaining uncertainty over models or value functions and sampling from this uncertainty.

-#### Bootstrap Ensemble
+These methods are rooted in Thompson Sampling and naturally balance exploration and exploitation.

- Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
-  - Computationally expensive
- Neural network ensembles with shared backbone: only the heads are trained with different subset of the data
+### Posterior Sampling in Multi-Armed Bandits (Thompson Sampling)

-### Questions
+In a multi-armed bandit problem (no state transitions), Thompson Sampling works as follows:

- Why do PG methods implicitly support exploration?
- Is it sufficient? How can we improve its implicit exploration?
- What are limitations of entropy regularization?
- How can we improve exploration for PG methods?
-  - Intrinsic-motivated bonuses (e.g., RND)
-  - Explicitly optimize per-state entropy in the return (e.g., SAC)
-  - Hierarchical RL
-  - Goal-conditional RL
- What are potentially more effective exploration methods?
-  - Knowledge-driven
-  - Model-based exploration
+1. Maintain a prior and posterior distribution over the reward parameters for each arm.
+2. At each time step, sample reward parameters for all arms from their current posterior.
+3. Select the arm with the highest sampled mean reward.
+4. Observe the reward, update the posterior, and repeat.
+
+Intuition:
+
+- Each action is selected with probability equal to the posterior probability that it is optimal.
+- Arms with high uncertainty are more likely to be sampled as optimal in some posterior draws.
+- Exploration arises naturally from uncertainty, without explicit epsilon-greedy noise or bonus terms.
+- Over time, the posterior concentrates on the true reward means, and the algorithm shifts toward exploitation.
+
+Theoretical properties:
+
+- Thompson Sampling attains near-optimal regret bounds in many bandit settings.
+- It often performs as well as or better than upper confidence bound algorithms in practice.
+
+### Posterior Sampling for Reinforcement Learning (PSRL)
+
+In reinforcement learning with states and transitions, posterior sampling generalizes to sampling entire MDP models.
+
+Posterior Sampling for Reinforcement Learning (PSRL) operates as follows:
+
+1. Maintain a posterior distribution over environment dynamics and rewards, based on observed transitions.
+2. At the beginning of an episode, sample an MDP model from this posterior.
+3. Compute the optimal policy for the sampled MDP (for example, by value iteration).
+4. Execute this policy in the real environment for the whole episode.
+5. Use the observed transitions to update the posterior, then repeat.
+
+Key advantages:
+
+- The agent commits to a sampled model's policy for an extended duration, which induces deep exploration.
+- If a sampled model is optimistic in unexplored regions, the corresponding policy will deliberately visit those regions.
+- Exploration is coherent across time within an episode, unlike per-step randomization in epsilon-greedy.
+- The method does not require ad hoc exploration bonuses; exploration is an emergent property of the posterior.
+
+Challenges:
+
+- Maintaining an exact posterior over high-dimensional MDPs is usually intractable.
+- Practical implementations use approximations.
+
+### Approximate Posterior Sampling with Ensembles (Bootstrapped DQN)
+
+A common approximate posterior method in deep RL is Bootstrapped DQN.
+
+Basic idea:
+
+- Train an ensemble of $K$ Q-networks (heads), $Q^{(1)}, \dots, Q^{(K)}$.
+- Each head is trained on a different bootstrap sample or masked subset of experience.
+- At the start of each episode, sample a head index $k$ uniformly from ${1, \dots, K}$.
+- For the entire episode, act greedily with respect to $Q^{(k)}$.
+
+Parameter definitions for the ensemble:
+
+- $K$: number of Q-network heads in the ensemble.
+- $Q^{(k)}(s, a)$: Q-value estimate for head $k$ at state-action pair $(s, a)$.
+- $k$: index of the sampled head used for the current episode.
+- $(s, a)$: state and action arguments to Q-value functions.
+
+Implementation details:
+
+- A shared feature backbone network processes state inputs, feeding into all heads.
+- Each head has its own final layers, allowing diverse value estimates.
+- Masking or bootstrapping assigns different subsets of transitions to different heads during training.
+
+Benefits:
+
+- Each head approximates a different plausible Q-function, analogous to a sample from a posterior.
+- When a head is optimistic about certain under-explored actions, its greedy policy will explore them deeply.
+- Exploration behavior is temporally consistent within an episode.
+- No modification of the reward function is required; exploration arises from policy randomization via multiple heads.
+
+Comparison to epsilon-greedy:
+
+- Epsilon-greedy adds per-step random actions, which can be inefficient for long-horizon exploration.
+- Bootstrapped DQN commits to a strategy for an episode, enabling the agent to execute complete exploratory plans.
+- This can dramatically increase the probability of discovering long sequences needed to reach sparse rewards.
+
+Other approximate posterior approaches:
+
+- Bayesian neural networks for Q-functions (explicit parameter distributions).
+- Using Monte Carlo dropout at inference to sample Q-functions.
+- Randomized prior functions added to Q-networks to maintain exploration.
+
+Theoretical insights:
+
+- Posterior sampling methods can enjoy strong regret bounds in some RL settings.
+- They can have better asymptotic constants than optimism-based methods in certain problems.
+- Coherent, temporally extended exploration is essential in environments with delayed rewards and complex goals.
+
+For further reading:
+
+- Osband et al., "Deep Exploration via Bootstrapped DQN", 2016.
+- Osband and Van Roy, "Why Is Posterior Sampling Better Than Optimism for Reinforcement Learning?", 2017.
+- Chapelle and Li, "An Empirical Evaluation of Thompson Sampling", 2011.