242 lines
13 KiB
Markdown
242 lines
13 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 21)
|
|
|
|
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
|
|
|
|
## Exploration in RL: Information-Based Exploration (Intrinsic Curiosity)
|
|
|
|
### Computational Curiosity
|
|
|
|
- "The direct goal of curiosity and boredom is to improve the world model."
|
|
- Curiosity encourages agents to seek experiences that better predict or explain the environment.
|
|
- A "curiosity unit" gives reward based on the mismatch between current model predictions and actual outcomes.
|
|
- Intrinsic reward is high when the agent's prediction fails, that is, when it encounters surprising outcomes.
|
|
- This yields positive intrinsic reinforcement when the internal predictive model errs, causing the agent to repeat actions that lead to prediction errors.
|
|
- The agent is effectively motivated to create situations where its model fails.
|
|
|
|
### Model Prediction Error as Intrinsic Reward
|
|
|
|
We augment the reward with an intrinsic bonus based on model prediction error:
|
|
|
|
$R(s, a, s') = r(s, a, s') + B(|T(s, a; \theta) - s'|)$
|
|
|
|
Parameter explanations:
|
|
|
|
- $s$: current state of the agent.
|
|
- $a$: action taken by the agent in state $s$.
|
|
- $s'$: next state resulting from executing action $a$ in state $s$.
|
|
- $r(s, a, s')$: extrinsic environment reward for transition $(s, a, s')$.
|
|
- $T(s, a; \theta)$: learned dynamics model with parameters $\theta$ that predicts the next state.
|
|
- $\theta$: parameter vector of the predictive dynamics model $T$.
|
|
- $|T(s, a; \theta) - s'|$: prediction error magnitude between predicted next state and actual next state.
|
|
- $B(\cdot)$: function converting prediction error magnitude into an intrinsic reward bonus.
|
|
- $R(s, a, s')$: total reward, sum of extrinsic reward and intrinsic curiosity bonus.
|
|
|
|
Key ideas:
|
|
|
|
- The agent receives an intrinsic reward $B(|T(s, a; \theta) - s'|)$ when the actual outcome differs from what its world model predicts.
|
|
- Initially many transitions are surprising, encouraging broad exploration.
|
|
- As the model improves, familiar transitions yield smaller error and smaller intrinsic reward.
|
|
- Exploration becomes focused on less-known parts of the state space.
|
|
- Intrinsic motivation is non-stationary: as the agent learns, previously novel states lose their intrinsic reward.
|
|
|
|
#### Avoiding Trivial Curiosity Traps
|
|
|
|
[link to paper](https://ar5iv.labs.arxiv.org/html/1705.05363#:~:text=reward%20signal%20based%20on%20how,this%20feature%20space%20using%20self)
|
|
|
|
Naively defining $B(s, a, s')$ directly in raw observation space can lead to trivial curiosity traps.
|
|
|
|
Examples:
|
|
|
|
- The agent may purposely cause chaotic or noisy observations (like flickering pixels) that are impossible to predict.
|
|
- The model cannot reduce prediction error on pure noise, so the agent is rewarded for meaningless randomness.
|
|
- This yields high intrinsic reward without meaningful learning or progress toward task goals.
|
|
|
|
To prevent this, we restrict prediction to a more informative feature space:
|
|
|
|
$B(s, a, s') = |T(E(s; \phi), a; \theta) - E(s'; \phi)|$
|
|
|
|
Parameter explanations:
|
|
|
|
- $E(s; \phi)$: learned encoder mapping raw state $s$ into a feature vector.
|
|
- $\phi$: parameter vector of the encoder $E$.
|
|
- $T(E(s; \phi), a; \theta)$: forward model predicting next feature representation from encoded state and action.
|
|
- $E(s'; \phi)$: encoded feature representation of the next state $s'$.
|
|
- $B(s, a, s')$: intrinsic reward based on prediction error in feature space.
|
|
|
|
Key ideas:
|
|
|
|
- The encoder $E(s; \phi)$ is trained so that features capture aspects of the state that are controllable by the agent.
|
|
- One approach is to train $E$ via an inverse dynamics model that predicts $a$ from $(s, s')$.
|
|
- This encourages $E$ to keep only information necessary to infer actions, discarding irrelevant noise.
|
|
- Measuring prediction error in feature space ignores unpredictable environmental noise.
|
|
- Intrinsic reward focuses on errors due to lack of knowledge about controllable dynamics.
|
|
- The agent's curiosity is directed toward aspects of the environment it can influence and learn.
|
|
|
|
A practical implementation is the Intrinsic Curiosity Module (ICM) by Pathak et al. (2017):
|
|
|
|
- The encoder $E$ and forward model $T$ are trained jointly.
|
|
- The loss includes both forward prediction error and inverse dynamics error.
|
|
- Intrinsic reward is set to the forward prediction error in feature space.
|
|
- This drives exploration of states where the agent cannot yet predict the effect of its actions.
|
|
|
|
#### Random Network Distillation (RND)
|
|
|
|
Random Network Distillation (RND) provides a simpler curiosity bonus without learning a dynamics model.
|
|
|
|
Basic idea:
|
|
|
|
- Use a fixed random neural network $f_{\text{target}}$ that maps states to feature vectors.
|
|
- Train a predictor network $f_{\text{pred}}$ to approximate $f_{\text{target}}$ on visited states.
|
|
- The intrinsic reward is the prediction error between $f_{\text{pred}}(s)$ and $f_{\text{target}}(s)$.
|
|
|
|
Typical form of the intrinsic reward:
|
|
|
|
$r^{\text{int}}(s) = |f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$
|
|
|
|
Parameter explanations:
|
|
|
|
- $f_{\text{target}}$: fixed random neural network generating target features for each state.
|
|
- $f_{\text{pred}}(s; \psi)$: trainable predictor network with parameters $\psi$.
|
|
- $\psi$: parameter vector for the predictor network.
|
|
- $s$: state input to both networks.
|
|
- $|f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$: squared error between predictor and target features.
|
|
- $r^{\text{int}}(s)$: intrinsic reward based on prediction error in random feature space.
|
|
|
|
Key properties:
|
|
|
|
- For novel or rarely visited states, $f_{\text{pred}}$ has not yet learned to match $f_{\text{target}}$, so error is high.
|
|
- For frequently visited states, prediction error becomes small, and intrinsic reward decays.
|
|
- The target network is random and fixed, so it does not adapt to the policy.
|
|
- This provides a stable novelty signal without explicit dynamics learning.
|
|
- RND achieves strong exploration performance in challenging environments, such as hard-exploration Atari games.
|
|
|
|
### Efficacy of Curiosity-Driven Exploration
|
|
|
|
Empirical observations:
|
|
|
|
- Curiosity-driven intrinsic rewards often lead to significantly higher extrinsic returns in sparse-reward environments compared to agents trained only on extrinsic rewards.
|
|
- Intrinsic rewards act as a proxy objective that guides the agent toward interesting or informative regions of the state space.
|
|
- In some experiments, agents trained with only intrinsic rewards (no extrinsic reward during training) still learn behaviors that later achieve high task scores when extrinsic rewards are measured.
|
|
- Using random features for curiosity (as in RND) can perform nearly as well as using learned features in many domains.
|
|
- Simple surprise signals are often sufficient to drive effective exploration.
|
|
- Learned feature spaces may generalize better to truly novel scenarios but are not always necessary.
|
|
|
|
Historical context:
|
|
|
|
- The concept of learning from intrinsic rewards alone is not new.
|
|
- Itti and Baldi (2005) studied "Bayesian surprise" as a driver of human attention.
|
|
- Schmidhuber (1991, 2010) formalized curiosity, creativity, and fun as intrinsic motivations in learning agents.
|
|
- Singh et al. (2004) proposed intrinsically motivated reinforcement learning frameworks.
|
|
- These early works laid the conceptual foundation for modern curiosity-driven deep RL methods.
|
|
|
|
For further reading on intrinsic curiosity methods:
|
|
|
|
- Pathak et al., "Curiosity-driven Exploration by Self-supervised Prediction", 2017.
|
|
- Burda et al., "Exploration by Random Network Distillation", 2018.
|
|
- Schmidhuber, "Formal Theory of Creativity, Fun, and Intrinsic Motivation", 2010.
|
|
|
|
## Exploration via Posterior Sampling
|
|
|
|
While optimistic and curiosity bonus methods modify the reward function, posterior sampling approaches handle exploration by maintaining uncertainty over models or value functions and sampling from this uncertainty.
|
|
|
|
These methods are rooted in Thompson Sampling and naturally balance exploration and exploitation.
|
|
|
|
### Posterior Sampling in Multi-Armed Bandits (Thompson Sampling)
|
|
|
|
In a multi-armed bandit problem (no state transitions), Thompson Sampling works as follows:
|
|
|
|
1. Maintain a prior and posterior distribution over the reward parameters for each arm.
|
|
2. At each time step, sample reward parameters for all arms from their current posterior.
|
|
3. Select the arm with the highest sampled mean reward.
|
|
4. Observe the reward, update the posterior, and repeat.
|
|
|
|
Intuition:
|
|
|
|
- Each action is selected with probability equal to the posterior probability that it is optimal.
|
|
- Arms with high uncertainty are more likely to be sampled as optimal in some posterior draws.
|
|
- Exploration arises naturally from uncertainty, without explicit epsilon-greedy noise or bonus terms.
|
|
- Over time, the posterior concentrates on the true reward means, and the algorithm shifts toward exploitation.
|
|
|
|
Theoretical properties:
|
|
|
|
- Thompson Sampling attains near-optimal regret bounds in many bandit settings.
|
|
- It often performs as well as or better than upper confidence bound algorithms in practice.
|
|
|
|
### Posterior Sampling for Reinforcement Learning (PSRL)
|
|
|
|
In reinforcement learning with states and transitions, posterior sampling generalizes to sampling entire MDP models.
|
|
|
|
Posterior Sampling for Reinforcement Learning (PSRL) operates as follows:
|
|
|
|
1. Maintain a posterior distribution over environment dynamics and rewards, based on observed transitions.
|
|
2. At the beginning of an episode, sample an MDP model from this posterior.
|
|
3. Compute the optimal policy for the sampled MDP (for example, by value iteration).
|
|
4. Execute this policy in the real environment for the whole episode.
|
|
5. Use the observed transitions to update the posterior, then repeat.
|
|
|
|
Key advantages:
|
|
|
|
- The agent commits to a sampled model's policy for an extended duration, which induces deep exploration.
|
|
- If a sampled model is optimistic in unexplored regions, the corresponding policy will deliberately visit those regions.
|
|
- Exploration is coherent across time within an episode, unlike per-step randomization in epsilon-greedy.
|
|
- The method does not require ad hoc exploration bonuses; exploration is an emergent property of the posterior.
|
|
|
|
Challenges:
|
|
|
|
- Maintaining an exact posterior over high-dimensional MDPs is usually intractable.
|
|
- Practical implementations use approximations.
|
|
|
|
### Approximate Posterior Sampling with Ensembles (Bootstrapped DQN)
|
|
|
|
A common approximate posterior method in deep RL is Bootstrapped DQN.
|
|
|
|
Basic idea:
|
|
|
|
- Train an ensemble of $K$ Q-networks (heads), $Q^{(1)}, \dots, Q^{(K)}$.
|
|
- Each head is trained on a different bootstrap sample or masked subset of experience.
|
|
- At the start of each episode, sample a head index $k$ uniformly from ${1, \dots, K}$.
|
|
- For the entire episode, act greedily with respect to $Q^{(k)}$.
|
|
|
|
Parameter definitions for the ensemble:
|
|
|
|
- $K$: number of Q-network heads in the ensemble.
|
|
- $Q^{(k)}(s, a)$: Q-value estimate for head $k$ at state-action pair $(s, a)$.
|
|
- $k$: index of the sampled head used for the current episode.
|
|
- $(s, a)$: state and action arguments to Q-value functions.
|
|
|
|
Implementation details:
|
|
|
|
- A shared feature backbone network processes state inputs, feeding into all heads.
|
|
- Each head has its own final layers, allowing diverse value estimates.
|
|
- Masking or bootstrapping assigns different subsets of transitions to different heads during training.
|
|
|
|
Benefits:
|
|
|
|
- Each head approximates a different plausible Q-function, analogous to a sample from a posterior.
|
|
- When a head is optimistic about certain under-explored actions, its greedy policy will explore them deeply.
|
|
- Exploration behavior is temporally consistent within an episode.
|
|
- No modification of the reward function is required; exploration arises from policy randomization via multiple heads.
|
|
|
|
Comparison to epsilon-greedy:
|
|
|
|
- Epsilon-greedy adds per-step random actions, which can be inefficient for long-horizon exploration.
|
|
- Bootstrapped DQN commits to a strategy for an episode, enabling the agent to execute complete exploratory plans.
|
|
- This can dramatically increase the probability of discovering long sequences needed to reach sparse rewards.
|
|
|
|
Other approximate posterior approaches:
|
|
|
|
- Bayesian neural networks for Q-functions (explicit parameter distributions).
|
|
- Using Monte Carlo dropout at inference to sample Q-functions.
|
|
- Randomized prior functions added to Q-networks to maintain exploration.
|
|
|
|
Theoretical insights:
|
|
|
|
- Posterior sampling methods can enjoy strong regret bounds in some RL settings.
|
|
- They can have better asymptotic constants than optimism-based methods in certain problems.
|
|
- Coherent, temporally extended exploration is essential in environments with delayed rewards and complex goals.
|
|
|
|
For further reading:
|
|
|
|
- Osband et al., "Deep Exploration via Bootstrapped DQN", 2016.
|
|
- Osband and Van Roy, "Why Is Posterior Sampling Better Than Optimism for Reinforcement Learning?", 2017.
|
|
- Chapelle and Li, "An Empirical Evaluation of Thompson Sampling", 2011. |