From d24c0bdd9e76adc2c6f00711cea01638b3b191aa Mon Sep 17 00:00:00 2001 From: Trance-0 <60459821+Trance-0@users.noreply.github.com> Date: Tue, 4 Nov 2025 11:30:12 -0600 Subject: [PATCH] updates --- content/CSE510/CSE510_L20.md | 143 ++++++++++++++++++++++++++++++++++ content/CSE510/_meta.js | 1 + content/CSE5519/CSE5519_B4.md | 2 +- 3 files changed, 145 insertions(+), 1 deletion(-) create mode 100644 content/CSE510/CSE510_L20.md diff --git a/content/CSE510/CSE510_L20.md b/content/CSE510/CSE510_L20.md new file mode 100644 index 0000000..7a7f108 --- /dev/null +++ b/content/CSE510/CSE510_L20.md @@ -0,0 +1,143 @@ +# CSE510 Deep Reinforcement Learning (Lecture 20) + +## Exploration in RL + +### Motivations + +#### Exploration vs. Exploitation Dilemma + +Online decision-making involves a fundamental choice: + +- Exploration: trying out new things (new behaviors), with the hope of discovering higher rewards +- Exploitation: doing what you know will yield the highest reward + +The best long-term strategy may involve short-term sacrifices + +Gather enough knowledge early to make the best long-term decisions + +
+Example +Restaurant Selection + +- Exploitation: Go to your favorite restaurant +- Exploration: Try a new restaurant + +Oil Drilling + +- Exploitation: Drill at the best known location +- Exploration: Drill at a new location + +Game Playing + +- Exploitation: Play the move you believe is best +- Exploration: Play an experimental move + +
+ +#### Breakout vs. Montezuma's Revenge + +| Property | Breakout | Montezuma's Revenge | +|----------|----------|--------------------| +| **Reward frequency** | Dense (every brick hit gives points) | Extremely sparse (only after collecting key or treasure) | +| **State space** | Simple (ball, paddle, bricks) | Complex (many rooms, objects, ladders, timing) | +| **Action relevance** | Almost any action affects reward soon | Most actions have no immediate feedback | +| **Exploration depth** | Shallow (few steps to reward) | Deep (dozens/hundreds of steps before reward) | +| **Determinism** | Mostly deterministic dynamics | Deterministic but requires long sequences of precise actions | +| **Credit assignment** | Easy — short time gap | Very hard — long delay from cause to effect | + +#### Motivation + +- Motivation: "Forces" that energize an organism to act and that direct its activity +- Extrinsic Motivation: being motivated to do something because of some external reward ($, a prize, food, water, etc.) +- Intrinsic Motivation: being motivated to do something because it is inherently enjoyable (curiosity, exploration, novelty, surprise, incongruity, complexity…) + +### Intuitive Exploration Strategy + +- Intrinsic motivation drives the exploration for unknowns +- Intuitively, we explore efficiently once we know what we do not know, and target our exploration efforts to the unknown part of the space. +- All non-naive exploration methods consider some form of uncertainty estimation, regarding state (or state-action) I have visited, transition dynamics, or Q-functions. + +- Optimal methods in smaller settings don't work, but can inspire for larger settings +- May use some hacks + +### Classes of Exploration Methods in Deep RL + +- Optimistic exploration + - Uncertainty about states + - Visiting novel states (state visitation counting) +- Information state search + - Uncertainty about state transitions or dynamics + - Dynamics prediction error or Information gain for dynamics learning +- Posterior sampling + - Uncertainty about Q-value functions or policies + - Selecting actions according to the probability they are best + +### Optimistic Exploration + +#### Count-Based Exploration in Small MDPs + +Book-keep state visitation counts $N(s)$ +Add exploration reward bonuses that encourage policies that visit states with fewer counts. + +$$ +R(s,a,s') = r(s,a,s') + \mathcal{B}(N(s)) +$$ + +where $\mathcal{B}(N(s))$ is the intrinsic exploration reward bonus. + +- UCB: $\mathcal{B}(N(s)) = \sqrt{\frac{2\ln n}{N(s)}}$ (more aggressive exploration) +- MBIE-EB (Strehl & Littman): $\mathcal{B}(N(s)) = \sqrt{\frac{1}{N(s)}}$ +- BEB (Kolter & Ng): $\mathcal{B}(N(s)) = \frac{1}{N(s)}$ + +- We want to come up with something that rewards states that we have not visited often. +- But in large MDPs, we rarely visit a state twice! +- We need to capture a notion of state similarity, and reward states that are most dissimilar to what we have seen so far + - as opposed to different (as they will always be different). + +#### Fitting Generative Models + +Idea: fit a density model $p_\theta(s)$ (or $p_\theta(s,a)$) + +$p_\theta(s)$ might be high even for a new $s$. + +If $s$ is similar to perviously seen states, can we use $p_\theta(s)$ to get a "pseudo-count" for $s$? + +If we have small MDPs, the true probability is + +$$ +P(s)=\frac{N(s)}{n} +$$ + +where $N(s)$ is the number of times $s$ has been visited and $n$ is the total states visited. + +after we visit $s$, then + +$$ +P'(s)=\frac{N(s)+1}{n+1} +$$ + +1. fit model $p_\theta(s)$ to all states $\mathcal{D}$ so far. +2. take a step $i$ and observe $s_i$. +3. fit new model $p_\theta'(s)$ to all states $\mathcal{D} \cup {s_i}$. +4. use $p_\theta(s_i)$ and $p_\theta'(s_i)$ to estimate the "pseudo-count" for $\hat{N}(s_i)$. +5. set $r_i^+=r_i+\mathcal{B}(\hat{N}(s_i))$ +6. go to 1 + +How to get $\hat{N}(s_i)$? use the equations + +$$ +p_\theta(s_i)=\frac{\hat{N}(s_i)}{\hat{n}}\quad p_\theta'(s_i)=\frac{\hat{N}(s_i)+1}{\hat{n}+1} +$$ + +[link to the paper](https://arxiv.org/pdf/1606.01868) + +#### Density models + +[link to the paper](https://arxiv.org/pdf/1703.01310) + +#### State Counting with DeepHashing + +- We still count states (images) but not in pixel space, but in latent compressed space. +- Compress $s$ into a latent code, then count occurrences of the code. +- How do we get the image encoding? e.g., using autoencoders. +- There is no guarantee such reconstruction loss will capture the important things that make two states to be similar \ No newline at end of file diff --git a/content/CSE510/_meta.js b/content/CSE510/_meta.js index b8071df..43bbfc8 100644 --- a/content/CSE510/_meta.js +++ b/content/CSE510/_meta.js @@ -22,4 +22,5 @@ export default { CSE510_L17: "CSE510 Deep Reinforcement Learning (Lecture 17)", CSE510_L18: "CSE510 Deep Reinforcement Learning (Lecture 18)", CSE510_L19: "CSE510 Deep Reinforcement Learning (Lecture 19)", + CSE510_L20: "CSE510 Deep Reinforcement Learning (Lecture 20)", } \ No newline at end of file diff --git a/content/CSE5519/CSE5519_B4.md b/content/CSE5519/CSE5519_B4.md index 74ad181..18221d7 100644 --- a/content/CSE5519/CSE5519_B4.md +++ b/content/CSE5519/CSE5519_B4.md @@ -14,4 +14,4 @@ This paper shows that the visual instruction tuning can improve the performance >[!TIP] > -> This paper shows that LLaVA-1.5 obeys the scaling law and splitting the high resolution images into grids to maintain the data efficiency. I wonder why this method is not applicable to multi-image understanding tasks? Why we cannot assign index embeddings to each image and push the image sets to the model for better understanding? \ No newline at end of file +> This paper shows that LLaVA-1.5 obeys the scaling law and splitting the high resolution images into grids to maintain the data efficiency. I wonder why this method is not applicable to multi-image understanding tasks? Why we cannot assign index embeddings to each image and push the image sets to the model for better understanding? What are the technical challenges to implement this idea? \ No newline at end of file