partial updates
This commit is contained in:
@@ -1,146 +1,242 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 21)
|
||||
|
||||
## Exploration in RL
|
||||
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
|
||||
|
||||
### Information state search
|
||||
## Exploration in RL: Information-Based Exploration (Intrinsic Curiosity)
|
||||
|
||||
Uncertainty about state transitions or dynamics
|
||||
|
||||
Dynamics prediction error or Information gain for dynamics learning
|
||||
|
||||
#### Computational Curiosity
|
||||
### Computational Curiosity
|
||||
|
||||
- "The direct goal of curiosity and boredom is to improve the world model."
|
||||
- "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
|
||||
- There is positive reinforcement whenever the system fails to correctly predict the environment.
|
||||
- Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)
|
||||
- Curiosity encourages agents to seek experiences that better predict or explain the environment.
|
||||
- A "curiosity unit" gives reward based on the mismatch between current model predictions and actual outcomes.
|
||||
- Intrinsic reward is high when the agent's prediction fails, that is, when it encounters surprising outcomes.
|
||||
- This yields positive intrinsic reinforcement when the internal predictive model errs, causing the agent to repeat actions that lead to prediction errors.
|
||||
- The agent is effectively motivated to create situations where its model fails.
|
||||
|
||||
#### Reward Prediction Error
|
||||
### Model Prediction Error as Intrinsic Reward
|
||||
|
||||
- Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.
|
||||
We augment the reward with an intrinsic bonus based on model prediction error:
|
||||
|
||||
$$
|
||||
R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)
|
||||
$$
|
||||
$R(s, a, s') = r(s, a, s') + B(|T(s, a; \theta) - s'|)$
|
||||
|
||||
- where $r(s,a,s')$ is the extrinsic reward, $T(s,a,\theta)$ is the predicted next state, and $\mathcal{B}$ is a bonus function (intrinsic reward bonus).
|
||||
- Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.
|
||||
Parameter explanations:
|
||||
|
||||
[link to the paper](https://arxiv.org/pdf/1507.08750)
|
||||
</details>
|
||||
- $s$: current state of the agent.
|
||||
- $a$: action taken by the agent in state $s$.
|
||||
- $s'$: next state resulting from executing action $a$ in state $s$.
|
||||
- $r(s, a, s')$: extrinsic environment reward for transition $(s, a, s')$.
|
||||
- $T(s, a; \theta)$: learned dynamics model with parameters $\theta$ that predicts the next state.
|
||||
- $\theta$: parameter vector of the predictive dynamics model $T$.
|
||||
- $|T(s, a; \theta) - s'|$: prediction error magnitude between predicted next state and actual next state.
|
||||
- $B(\cdot)$: function converting prediction error magnitude into an intrinsic reward bonus.
|
||||
- $R(s, a, s')$: total reward, sum of extrinsic reward and intrinsic curiosity bonus.
|
||||
|
||||
<details>
|
||||
<summary>Example</summary>
|
||||
Key ideas:
|
||||
|
||||
Learning Visual Dynamics
|
||||
- The agent receives an intrinsic reward $B(|T(s, a; \theta) - s'|)$ when the actual outcome differs from what its world model predicts.
|
||||
- Initially many transitions are surprising, encouraging broad exploration.
|
||||
- As the model improves, familiar transitions yield smaller error and smaller intrinsic reward.
|
||||
- Exploration becomes focused on less-known parts of the state space.
|
||||
- Intrinsic motivation is non-stationary: as the agent learns, previously novel states lose their intrinsic reward.
|
||||
|
||||
- Exploration reward bonuses $\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|$
|
||||
- However, trivial solution exists: could get reward by just moving around randomly.
|
||||
#### Avoiding Trivial Curiosity Traps
|
||||
|
||||
---
|
||||
[link to paper](https://ar5iv.labs.arxiv.org/html/1705.05363#:~:text=reward%20signal%20based%20on%20how,this%20feature%20space%20using%20self)
|
||||
|
||||
- Exploration reward bonuses with autoencoders $\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|$
|
||||
- But suffer the problems of autoencoding reconstruction loss that has little to do with our task
|
||||
Naively defining $B(s, a, s')$ directly in raw observation space can lead to trivial curiosity traps.
|
||||
|
||||
#### Task Rewards vs. Exploration Rewards
|
||||
Examples:
|
||||
|
||||
Exploration rewards bonuses:
|
||||
- The agent may purposely cause chaotic or noisy observations (like flickering pixels) that are impossible to predict.
|
||||
- The model cannot reduce prediction error on pure noise, so the agent is rewarded for meaningless randomness.
|
||||
- This yields high intrinsic reward without meaningful learning or progress toward task goals.
|
||||
|
||||
$$
|
||||
\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
|
||||
$$
|
||||
To prevent this, we restrict prediction to a more informative feature space:
|
||||
|
||||
Only task rewards:
|
||||
$B(s, a, s') = |T(E(s; \phi), a; \theta) - E(s'; \phi)|$
|
||||
|
||||
$$
|
||||
R(s,a,s') = r(s,a,s')
|
||||
$$
|
||||
Parameter explanations:
|
||||
|
||||
Task+curiosity rewards:
|
||||
- $E(s; \phi)$: learned encoder mapping raw state $s$ into a feature vector.
|
||||
- $\phi$: parameter vector of the encoder $E$.
|
||||
- $T(E(s; \phi), a; \theta)$: forward model predicting next feature representation from encoded state and action.
|
||||
- $E(s'; \phi)$: encoded feature representation of the next state $s'$.
|
||||
- $B(s, a, s')$: intrinsic reward based on prediction error in feature space.
|
||||
|
||||
$$
|
||||
R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')
|
||||
$$
|
||||
Key ideas:
|
||||
|
||||
Sparse task + curiosity rewards:
|
||||
- The encoder $E(s; \phi)$ is trained so that features capture aspects of the state that are controllable by the agent.
|
||||
- One approach is to train $E$ via an inverse dynamics model that predicts $a$ from $(s, s')$.
|
||||
- This encourages $E$ to keep only information necessary to infer actions, discarding irrelevant noise.
|
||||
- Measuring prediction error in feature space ignores unpredictable environmental noise.
|
||||
- Intrinsic reward focuses on errors due to lack of knowledge about controllable dynamics.
|
||||
- The agent's curiosity is directed toward aspects of the environment it can influence and learn.
|
||||
|
||||
$$
|
||||
R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')
|
||||
$$
|
||||
A practical implementation is the Intrinsic Curiosity Module (ICM) by Pathak et al. (2017):
|
||||
|
||||
Only curiosity rewards:
|
||||
- The encoder $E$ and forward model $T$ are trained jointly.
|
||||
- The loss includes both forward prediction error and inverse dynamics error.
|
||||
- Intrinsic reward is set to the forward prediction error in feature space.
|
||||
- This drives exploration of states where the agent cannot yet predict the effect of its actions.
|
||||
|
||||
$$
|
||||
R^c(s,a,s') = \mathcal{B}^c(s, a, s')
|
||||
$$
|
||||
#### Random Network Distillation (RND)
|
||||
|
||||
#### Extrinsic reward RL is not New
|
||||
Random Network Distillation (RND) provides a simpler curiosity bonus without learning a dynamics model.
|
||||
|
||||
- Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS’05. pp. 547–554 (2006)
|
||||
- Schmidhuber, J.: Curious model-building control systems. In: IJCNN’91. vol. 2, pp. 1458–1463 (1991)
|
||||
- Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230–247 (9 2010)
|
||||
- Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS’04 (2004)
|
||||
- Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN’95 (1995)
|
||||
- Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708
|
||||
Basic idea:
|
||||
|
||||
#### Limitation of Prediction Errors
|
||||
- Use a fixed random neural network $f_{\text{target}}$ that maps states to feature vectors.
|
||||
- Train a predictor network $f_{\text{pred}}$ to approximate $f_{\text{target}}$ on visited states.
|
||||
- The intrinsic reward is the prediction error between $f_{\text{pred}}(s)$ and $f_{\text{target}}(s)$.
|
||||
|
||||
- Agent will be rewarded even though the model cannot improve.
|
||||
- So it will focus on parts of environment that are inherently unpredictable or stochastic.
|
||||
- Example: the noisy-TV problem
|
||||
- The agent is attracted forever in the most noisy states, with unpredictable outcomes.
|
||||
Typical form of the intrinsic reward:
|
||||
|
||||
#### Random Network Distillation
|
||||
$r^{\text{int}}(s) = |f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$
|
||||
|
||||
Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.
|
||||
Parameter explanations:
|
||||
|
||||
New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the **next state itself.**
|
||||
- $f_{\text{target}}$: fixed random neural network generating target features for each state.
|
||||
- $f_{\text{pred}}(s; \psi)$: trainable predictor network with parameters $\psi$.
|
||||
- $\psi$: parameter vector for the predictor network.
|
||||
- $s$: state input to both networks.
|
||||
- $|f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$: squared error between predictor and target features.
|
||||
- $r^{\text{int}}(s)$: intrinsic reward based on prediction error in random feature space.
|
||||
|
||||
- The target network is a neural network with fixed, randomized weights, which is never trained.
|
||||
- The prediction network is trained to predict the target network's output.
|
||||
Key properties:
|
||||
|
||||
> the more you visit the state, the less loss you will have.
|
||||
- For novel or rarely visited states, $f_{\text{pred}}$ has not yet learned to match $f_{\text{target}}$, so error is high.
|
||||
- For frequently visited states, prediction error becomes small, and intrinsic reward decays.
|
||||
- The target network is random and fixed, so it does not adapt to the policy.
|
||||
- This provides a stable novelty signal without explicit dynamics learning.
|
||||
- RND achieves strong exploration performance in challenging environments, such as hard-exploration Atari games.
|
||||
|
||||
### Posterior Sampling
|
||||
### Efficacy of Curiosity-Driven Exploration
|
||||
|
||||
Uncertainty about Q-value functions or policies
|
||||
Empirical observations:
|
||||
|
||||
Selecting actions according to the probability they are best according to the current model.
|
||||
- Curiosity-driven intrinsic rewards often lead to significantly higher extrinsic returns in sparse-reward environments compared to agents trained only on extrinsic rewards.
|
||||
- Intrinsic rewards act as a proxy objective that guides the agent toward interesting or informative regions of the state space.
|
||||
- In some experiments, agents trained with only intrinsic rewards (no extrinsic reward during training) still learn behaviors that later achieve high task scores when extrinsic rewards are measured.
|
||||
- Using random features for curiosity (as in RND) can perform nearly as well as using learned features in many domains.
|
||||
- Simple surprise signals are often sufficient to drive effective exploration.
|
||||
- Learned feature spaces may generalize better to truly novel scenarios but are not always necessary.
|
||||
|
||||
#### Exploration with Action Value Information
|
||||
Historical context:
|
||||
|
||||
Count-Based and Curiosity-driven method does not take into
|
||||
account the action value information
|
||||
- The concept of learning from intrinsic rewards alone is not new.
|
||||
- Itti and Baldi (2005) studied "Bayesian surprise" as a driver of human attention.
|
||||
- Schmidhuber (1991, 2010) formalized curiosity, creativity, and fun as intrinsic motivations in learning agents.
|
||||
- Singh et al. (2004) proposed intrinsically motivated reinforcement learning frameworks.
|
||||
- These early works laid the conceptual foundation for modern curiosity-driven deep RL methods.
|
||||
|
||||

|
||||
For further reading on intrinsic curiosity methods:
|
||||
|
||||
> In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.
|
||||
- Pathak et al., "Curiosity-driven Exploration by Self-supervised Prediction", 2017.
|
||||
- Burda et al., "Exploration by Random Network Distillation", 2018.
|
||||
- Schmidhuber, "Formal Theory of Creativity, Fun, and Intrinsic Motivation", 2010.
|
||||
|
||||
#### Exploration via Posterior Sampling of Q Functions
|
||||
## Exploration via Posterior Sampling
|
||||
|
||||
- Represent a posterior distribution of Q functions, instead of a point estimate.
|
||||
1. Sample from $P(Q), Q\sim P(Q)$
|
||||
2. Choose actions according to this $Q$ for one episode $a=\arg\max_{a} Q(s,a)$
|
||||
3. Update $P(Q)$ based on the sampled $Q$ and collected experience tuples $(s,a,r,s')$
|
||||
- Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
|
||||
- But how can we learn a distribution of Q functions $P(Q)$ if Q function is a deep neural network?
|
||||
While optimistic and curiosity bonus methods modify the reward function, posterior sampling approaches handle exploration by maintaining uncertainty over models or value functions and sampling from this uncertainty.
|
||||
|
||||
#### Bootstrap Ensemble
|
||||
These methods are rooted in Thompson Sampling and naturally balance exploration and exploitation.
|
||||
|
||||
- Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
|
||||
- Computationally expensive
|
||||
- Neural network ensembles with shared backbone: only the heads are trained with different subset of the data
|
||||
### Posterior Sampling in Multi-Armed Bandits (Thompson Sampling)
|
||||
|
||||
### Questions
|
||||
In a multi-armed bandit problem (no state transitions), Thompson Sampling works as follows:
|
||||
|
||||
- Why do PG methods implicitly support exploration?
|
||||
- Is it sufficient? How can we improve its implicit exploration?
|
||||
- What are limitations of entropy regularization?
|
||||
- How can we improve exploration for PG methods?
|
||||
- Intrinsic-motivated bonuses (e.g., RND)
|
||||
- Explicitly optimize per-state entropy in the return (e.g., SAC)
|
||||
- Hierarchical RL
|
||||
- Goal-conditional RL
|
||||
- What are potentially more effective exploration methods?
|
||||
- Knowledge-driven
|
||||
- Model-based exploration
|
||||
1. Maintain a prior and posterior distribution over the reward parameters for each arm.
|
||||
2. At each time step, sample reward parameters for all arms from their current posterior.
|
||||
3. Select the arm with the highest sampled mean reward.
|
||||
4. Observe the reward, update the posterior, and repeat.
|
||||
|
||||
Intuition:
|
||||
|
||||
- Each action is selected with probability equal to the posterior probability that it is optimal.
|
||||
- Arms with high uncertainty are more likely to be sampled as optimal in some posterior draws.
|
||||
- Exploration arises naturally from uncertainty, without explicit epsilon-greedy noise or bonus terms.
|
||||
- Over time, the posterior concentrates on the true reward means, and the algorithm shifts toward exploitation.
|
||||
|
||||
Theoretical properties:
|
||||
|
||||
- Thompson Sampling attains near-optimal regret bounds in many bandit settings.
|
||||
- It often performs as well as or better than upper confidence bound algorithms in practice.
|
||||
|
||||
### Posterior Sampling for Reinforcement Learning (PSRL)
|
||||
|
||||
In reinforcement learning with states and transitions, posterior sampling generalizes to sampling entire MDP models.
|
||||
|
||||
Posterior Sampling for Reinforcement Learning (PSRL) operates as follows:
|
||||
|
||||
1. Maintain a posterior distribution over environment dynamics and rewards, based on observed transitions.
|
||||
2. At the beginning of an episode, sample an MDP model from this posterior.
|
||||
3. Compute the optimal policy for the sampled MDP (for example, by value iteration).
|
||||
4. Execute this policy in the real environment for the whole episode.
|
||||
5. Use the observed transitions to update the posterior, then repeat.
|
||||
|
||||
Key advantages:
|
||||
|
||||
- The agent commits to a sampled model's policy for an extended duration, which induces deep exploration.
|
||||
- If a sampled model is optimistic in unexplored regions, the corresponding policy will deliberately visit those regions.
|
||||
- Exploration is coherent across time within an episode, unlike per-step randomization in epsilon-greedy.
|
||||
- The method does not require ad hoc exploration bonuses; exploration is an emergent property of the posterior.
|
||||
|
||||
Challenges:
|
||||
|
||||
- Maintaining an exact posterior over high-dimensional MDPs is usually intractable.
|
||||
- Practical implementations use approximations.
|
||||
|
||||
### Approximate Posterior Sampling with Ensembles (Bootstrapped DQN)
|
||||
|
||||
A common approximate posterior method in deep RL is Bootstrapped DQN.
|
||||
|
||||
Basic idea:
|
||||
|
||||
- Train an ensemble of $K$ Q-networks (heads), $Q^{(1)}, \dots, Q^{(K)}$.
|
||||
- Each head is trained on a different bootstrap sample or masked subset of experience.
|
||||
- At the start of each episode, sample a head index $k$ uniformly from ${1, \dots, K}$.
|
||||
- For the entire episode, act greedily with respect to $Q^{(k)}$.
|
||||
|
||||
Parameter definitions for the ensemble:
|
||||
|
||||
- $K$: number of Q-network heads in the ensemble.
|
||||
- $Q^{(k)}(s, a)$: Q-value estimate for head $k$ at state-action pair $(s, a)$.
|
||||
- $k$: index of the sampled head used for the current episode.
|
||||
- $(s, a)$: state and action arguments to Q-value functions.
|
||||
|
||||
Implementation details:
|
||||
|
||||
- A shared feature backbone network processes state inputs, feeding into all heads.
|
||||
- Each head has its own final layers, allowing diverse value estimates.
|
||||
- Masking or bootstrapping assigns different subsets of transitions to different heads during training.
|
||||
|
||||
Benefits:
|
||||
|
||||
- Each head approximates a different plausible Q-function, analogous to a sample from a posterior.
|
||||
- When a head is optimistic about certain under-explored actions, its greedy policy will explore them deeply.
|
||||
- Exploration behavior is temporally consistent within an episode.
|
||||
- No modification of the reward function is required; exploration arises from policy randomization via multiple heads.
|
||||
|
||||
Comparison to epsilon-greedy:
|
||||
|
||||
- Epsilon-greedy adds per-step random actions, which can be inefficient for long-horizon exploration.
|
||||
- Bootstrapped DQN commits to a strategy for an episode, enabling the agent to execute complete exploratory plans.
|
||||
- This can dramatically increase the probability of discovering long sequences needed to reach sparse rewards.
|
||||
|
||||
Other approximate posterior approaches:
|
||||
|
||||
- Bayesian neural networks for Q-functions (explicit parameter distributions).
|
||||
- Using Monte Carlo dropout at inference to sample Q-functions.
|
||||
- Randomized prior functions added to Q-networks to maintain exploration.
|
||||
|
||||
Theoretical insights:
|
||||
|
||||
- Posterior sampling methods can enjoy strong regret bounds in some RL settings.
|
||||
- They can have better asymptotic constants than optimism-based methods in certain problems.
|
||||
- Coherent, temporally extended exploration is essential in environments with delayed rewards and complex goals.
|
||||
|
||||
For further reading:
|
||||
|
||||
- Osband et al., "Deep Exploration via Bootstrapped DQN", 2016.
|
||||
- Osband and Van Roy, "Why Is Posterior Sampling Better Than Optimism for Reinforcement Learning?", 2017.
|
||||
- Chapelle and Li, "An Empirical Evaluation of Thompson Sampling", 2011.
|
||||
Reference in New Issue
Block a user