This commit is contained in:
Zheyuan Wu
2025-10-28 11:20:24 -05:00
parent e4490f6fa2
commit 361745d658
4 changed files with 199 additions and 10 deletions

View File

@@ -1,14 +1,47 @@
# CSE510 Deep Reinforcement Learning (Lecture 17)
## Model-based RL
### Model-based RL vs. Model-free RL
## Why Model-Based RL?
- Sample efficiency
- Generalization and transferability
- Support efficient exploration in large-scale RL problems
- Explainability
- Super-human performance in practice
- Video games, Go, Algorithm discovery, etc.
> [!NOTE]
>
> Model is anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition $T(s'| s, a)$ and reward $R(s, a)$.
For ADP-based (model-based) RL
1. Start with initial model
2. Solve for optimal policy given current model
- (using value or policy iteration)
3. Take action according to an exploration/exploitation policy
- Explores more early on and gradually uses policy from 2
4. Update estimated model based on observed transition
5. Goto 2
### Problems in Large Scale Model-Based RL
- New planning methods for given a model
- Model is large and not perfect
- Model learning
- Requiring generalization
- Exploration/exploitation strategy
- Requiring generalization and attention
### Large Scale Model-Based RL
- New optimal planning methods (Today)
- Model is large and not perfect
- Model learning (Next Lecture)
- Requiring generalization
- Exploration/exploitation strategy (Next week)
- Requiring generalization and attention
## Model-based RL
### Deterministic Environment: Cross-Entropy Method
@@ -29,12 +62,14 @@ Simplest method: guess and check: "random shooting method"
- pick $A_1, A_2, ..., A_n$ from some distribution (e.g. uniform)
- Choose $A_i$ based on $\argmax_i J(A_i)$
#### Cross-Entropy Method with continuous-valued inputs
#### Cross-Entropy Method (CEM) with continuous-valued inputs
1. sample $A_1, A_2, ..., A_n$ from some distribution $p(A)$
2. evaluate $J(A_1), J(A_2), ..., J(A_n)$
3. pick the _elites_ $A_1, A_2, ..., A_m$ with the highest $J(A_i)$, where $m<n$
4. update the distribution $p(A)$ to be more likely to choose the elites
Cross-entropy method with continuous-valued inputs:s
1. Sample $A_1, A_2, ..., A_n$ from some distribution $p(A)$
2. Evaluate $J(A_1), J(A_2), ..., J(A_n)$
3. Pick the _elites_ $A_1, A_2, ..., A_m$ with the highest $J(A_i)$, where $m<n$
4. Update the distribution $p(A)$ to be more likely to choose the elites
Pros:
@@ -68,15 +103,70 @@ Use model as simulator to evaluate actions.
Tree policy:
Decision policy:
- Select/create leaf node
- Selection and Expansion
- Bandit problem!
Default policy/rollout policy
- Play the game till end
- Simulation
Decision policy
- Selecting the final action
#### Upper Confidence Bound on Trees (UCT)
Selecting Child Node - Multi-Arm Bandit Problem
UCB1 applied for each child selection
$$
UCT=\overline{X_j}+2C_p\sqrt{\frac{2\ln n_j}{n_j}}
$$
- where $\overline{X_j}$ is the mean reward of selecting this position
- $[0,1]$
- $n$ is the number of times current(parent) node has been visited
- $n_j$ is the number of times child node $j$ has been visited
- Guaranteed we explore each child node at least once
- $C_p$ is some constant $>0$
Each child has non-zero probability of being selected
We can adjust $C_p$ to change exploration vs. exploitation trade-off
#### Decision Policy: Final Action Selection
Selecting the best child
- Max (highest weight)
- Robust (most visits)
- Max-Robust (max of the two)
#### Upper Confidence Bound on Trees (UCT)
#### Advantages and disadvantages of MCTS
Advantages:
- Proved MCTS converges to minimax solution
- Domain-independent
- Anytime algorithm
- Achieving better with a large branch factor
Disadvantages:
- Basic version converges very slowly
- Leading to small-probability failures
### Example usage of MCTS
AlphaGo vs Lee Sedol, Game 4
- White 78 (Lee): unexpected move (even other professional players didn't see coming) - needle in the haystack
- AlphaGo failed to explore this in MCTS
Imitation learning from MCTS:
#### Continuous Case: Trajectory Optimization

View File

@@ -0,0 +1,65 @@
# CSE510 Deep Reinforcement Learning (Lecture 18)
## Model-based RL framework
Model Learning with High-Dimensional Observations
- Learning model in a latent space with observation reconstruction
- Learning model in a latent space without observation reconstruction
- Learning model in the observation space (i.e., videos)
### Naive approach:
If we knew $f(s_t,a_t)=s_{t+1}$, we could use the tools from last week. (or $p(s_{t+1}| s_t, a_t)$ in the stochastic case)
So we can learn $f(s_t,a_t)$ from data, and _then_ plan through it.
Model-based reinforcement learning version **0.5**:
1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
3. Plan through $f(s_t,a_t)$ to choose action $a_t$
Sometime, it does work!
- Essentially how system identification works in classical robotics
- Some care should be taken to design a good base policy
- Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters
However, Distribution mismatch problem becomes worse as we use more
expressive model classes.
Version 0.5: collect random samples, train dynamics, plan
- Pro: simple, no iterative procedure
- Con: distribution mismatch problem
Version 1.0: iteratively collect data, replan, collect data
- Pro: simple, solves distribution mismatch
- Con: open loop plan might perform poorly, esp. in stochastic domains
Version 1.5: iteratively collect data using MPC (replan at each step)
- Pro: robust to small model errors
- Con: computationally expensive, but have a planning algorithm available
Version 2.0: backpropagate directly into policy
- Pro: computationally cheap at runtime
- Con: can be numerically unstable, especially in stochastic domains
- Solution: model-free RL + model-based RL
Final version:
1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
3. Backpropagate through $f(s_t,a_t)$ into the policy to optimized $\pi_\theta(s_t,a_t)$
4. Run the policy $\pi_\theta(s_t,a_t)$ to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
5. Goto 2
## Model Learning with High-Dimensional Observations
- Learning model in a latent space with observation reconstruction
- Learning model in a latent space without observation reconstruction

View File

@@ -20,4 +20,5 @@ export default {
CSE510_L15: "CSE510 Deep Reinforcement Learning (Lecture 15)",
CSE510_L16: "CSE510 Deep Reinforcement Learning (Lecture 16)",
CSE510_L17: "CSE510 Deep Reinforcement Learning (Lecture 17)",
CSE510_L18: "CSE510 Deep Reinforcement Learning (Lecture 18)",
}

View File

@@ -1,2 +1,35 @@
# CSE5519 Advances in Computer Vision (Topic F: 2024: Representation Learning)
## Long-CLIP: Unlocking the long-text capability of CLIP
[link to the paper](https://arxiv.org/pdf/2403.15378)
### Novelty in Long-CLIP
1. a **knowledge-preserving** stretching of positional embeddings
2. a **primary component matching** of CLIP features.
### Knowledge-preserving stretching of positional embeddings
Retain the embedding of the top 20 positions, as for remaining 57 positions (training text is lower than 77 tokens), use the large ratio for linear interpolation.
$$
\operatorname{PE}^*(pos)=\begin{cases}
\operatorname{PE}(pos) & \text{if } pos \leq 20 \\
\operatorname{PE}(1-\alpha)\times \operatorname{PE}(\lfloor \frac{pos}{\lambda_2}\rfloor) + \alpha \times \operatorname{PE}(\lceil \frac{pos}{\lambda_2}\rceil) & \text{if } pos > 20
\end{cases}
$$
where $\alpha=\frac{pos\%\lambda_2}{\lambda_2}$.
#### Primary component matching of CLIP features
Do not train with long text, (may decrease the performance of CLIP for short text)
Use fine-grained and coarse-grained components to match the CLIP features.
> [!TIP]
>
> This paper shows an interesting approach to increasing the long-text capability of CLIP. The authors use a knowledge-preserving stretching of positional embeddings and a primary component matching of CLIP features to achieve this.
>
> However, the primary component matching is not a very satisfying solution, as it may not capture the novelty in high-frequency components, for example, the texture of clothes in the main character in the image, suppose multiple texture exists in the image. How does the model solve this problem and align the feature to the correct object of description? Or simply assumes that the bigger objects in the image are more important for the captioning task?