updates
This commit is contained in:
150
content/CSE510/CSE510_L16.md
Normal file
150
content/CSE510/CSE510_L16.md
Normal file
@@ -0,0 +1,150 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 16)
|
||||
|
||||
## Deterministic Policy Gradient (DPG)
|
||||
|
||||
### Learning Deterministic Policies
|
||||
|
||||
- Deterministic policy gradients [Silver et al., ICML 2014]
|
||||
- Explicitly learn a deterministic policy.
|
||||
- $a = \mu_\theta(s)$
|
||||
- Advantages
|
||||
- Existing optimal deterministic policy for MDPs
|
||||
- Naturally dealing with a continuous action space
|
||||
- Expected to be more efficient than learning stochastic policies
|
||||
- Computing stochastic gradient requires more samples, as it integrates over both state and action space.
|
||||
- Deterministic gradient is preferable as it integrates over state space only.
|
||||
|
||||
### Deterministic Policy Gradient
|
||||
|
||||
The objective function is:
|
||||
|
||||
$$
|
||||
J(\theta)=\int_{s\in S} \rho^{\mu}(s) r(s,\mu_\theta(s)) ds
|
||||
$$
|
||||
|
||||
where $\rho^{\mu}(s)$ is the stationary distribution under the behavior policy $\mu_\theta(s)$.
|
||||
|
||||
The policy gradient from the standard policy gradient theorem is:
|
||||
|
||||
$$
|
||||
\nabla_\theta J(\theta) = \mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
|
||||
$$
|
||||
|
||||
#### Issues for DPG
|
||||
|
||||
The formulations up to now can only use on-policy data.
|
||||
|
||||
Deterministic policy can hardly guarantee sufficient
|
||||
exploration.
|
||||
|
||||
- Solution: Off-policy training using a stochastic behavior policy.
|
||||
|
||||
#### Off-Policy Deterministic Policy Gradient (Off-DPG)
|
||||
|
||||
Use a stochastic behavior policy $\beta(a|s)$. The modified objective function is:
|
||||
|
||||
$$
|
||||
J(\mu_\theta)=\int_{s\in S} \rho^{\beta}(s) Q^{\mu_\theta}(s,\beta(s)) ds
|
||||
$$
|
||||
|
||||
The gradients are:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\nabla_\theta J(\mu_\theta) &\approx \int_{s\in S} \rho^{\beta}(s) \nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)} ds\\
|
||||
&= \mathbb{E}_{s\sim \rho^{\beta}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
Importance sampling is avoided in the actor due to the absence of integral over actions.
|
||||
|
||||
#### Policy Evaluation in DPG
|
||||
|
||||
Importance sampling can also be avoided in the critic.
|
||||
|
||||
Gradient TD-like algorithm can be directly applied to the critic.
|
||||
|
||||
$$
|
||||
\mathcal{L}_{critic}(w) = \mathbb{E}[r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)]^2
|
||||
$$
|
||||
|
||||
#### Off-Policy Deterministic Actor-Critic
|
||||
|
||||
$$
|
||||
\delta_t=r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)
|
||||
$$
|
||||
|
||||
$$
|
||||
w_{t+1} = w_t + \alpha_w \delta_t \nabla_w Q^w(s_t,a_t)
|
||||
$$
|
||||
|
||||
$$
|
||||
\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu_\theta(s_t) \nabla_a Q^{\mu_\theta}(s_t,a_t)\vert_{a=\mu_\theta(s_t)}
|
||||
$$
|
||||
|
||||
### Deep Deterministic Policy Gradient (DDPG)
|
||||
|
||||
Insights from DQN + Deterministic Policy Gradients
|
||||
|
||||
- Use a replay buffer
|
||||
- Critic is updated every timestep (Sample from buffer, minibatch):
|
||||
|
||||
$$
|
||||
\mathcal{L}_{critic}(w) = \mathbb{E}[r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)]^2
|
||||
$$
|
||||
|
||||
Actor is updated every timestep:
|
||||
|
||||
$$
|
||||
\nabla_a Q(s_t,a;w)|_{a=\mu_\theta(s_t)} \nabla_\theta \mu_\theta(s_t)
|
||||
$$
|
||||
|
||||
Smoothing target updated at every timestep:
|
||||
|
||||
$$
|
||||
w_{t+1} = \tau w_t + (1-\tau) w_{t+1}
|
||||
$$
|
||||
|
||||
$$
|
||||
\theta_{t+1} = \tau \theta_t + (1-\tau) \theta_{t+1}
|
||||
$$
|
||||
|
||||
Exploration: add noise to the action selection: $a_t = \mu_\theta(s_t) + \mathcal{N}_t$
|
||||
|
||||
Batch normalization used for training networks
|
||||
|
||||
### Extension of DDPG
|
||||
|
||||
Overestimation bias is an issue of Q-learning in which the maximization of a noisy value estimate
|
||||
|
||||
$$
|
||||
DDPG:\nabla_\theta J(\theta) = \mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
|
||||
$$
|
||||
|
||||
#### Double DQN is not enough
|
||||
|
||||
Because the slow-changing policy in an actor-critic setting
|
||||
|
||||
- the current and target value estimates remain too similar to avoid maximization bias.
|
||||
- Target value of Double DQN: $r_t + \gamma Q^w'(s_{t+1},\mu_\theta(s_{t+1}))$
|
||||
|
||||
#### TD3: Twin Delayed Deep Deterministic policy gradient
|
||||
|
||||
Address overestimation bias:
|
||||
|
||||
- Double Q-learning is unbiased in tabular settings, but still slight overestimation with function approximation.
|
||||
|
||||
$$
|
||||
y_1 = r + \gamma Q^{\theta_2'}(s', \pi_{\phi_1}(s'))
|
||||
$$
|
||||
$$
|
||||
y_2 = r + \gamma Q^{\theta_1'}(s', \pi_{\phi_2}(s'))
|
||||
$$
|
||||
It is possible that $Q^{\theta_2}(s, \pi_{\phi_1}(s)) > Q^{\theta_1}(s, \pi_{\phi_1}(s))$
|
||||
|
||||
Clipped double Q-learning:
|
||||
|
||||
$$
|
||||
y_1 = r + \gamma \min_{i=1,2} Q^{\theta_i'}(s', \pi_{\phi_i}(s'))
|
||||
$$
|
||||
|
||||
@@ -18,4 +18,5 @@ export default {
|
||||
CSE510_L13: "CSE510 Deep Reinforcement Learning (Lecture 13)",
|
||||
CSE510_L14: "CSE510 Deep Reinforcement Learning (Lecture 14)",
|
||||
CSE510_L15: "CSE510 Deep Reinforcement Learning (Lecture 15)",
|
||||
CSE510_L16: "CSE510 Deep Reinforcement Learning (Lecture 16)",
|
||||
}
|
||||
1
content/CSE5313/CSE5313_L15.md
Normal file
1
content/CSE5313/CSE5313_L15.md
Normal file
@@ -0,0 +1 @@
|
||||
# CSE5313 Coding and information theory for data science (Lecture 15)
|
||||
@@ -17,4 +17,5 @@ export default {
|
||||
CSE5313_L12: "CSE5313 Coding and information theory for data science (Lecture 12)",
|
||||
CSE5313_L13: "CSE5313 Coding and information theory for data science (Lecture 13)",
|
||||
CSE5313_L14: "CSE5313 Coding and information theory for data science (Lecture 14)",
|
||||
CSE5313_L15: "CSE5313 Coding and information theory for data science (Lecture 15)",
|
||||
}
|
||||
@@ -1,2 +1,26 @@
|
||||
# CSE5519 Advances in Computer Vision (Topic F: 2023: Representation Learning)
|
||||
|
||||
## Self-supervised learning from images with a joint-embedding predictive architectureLinks to an external site.
|
||||
|
||||
[link to the paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Assran_Self-Supervised_Learning_From_Images_With_a_Joint-Embedding_Predictive_Architecture_CVPR_2023_paper.pdf)
|
||||
|
||||
### Novelty in Joint-Embedding Predictive Architecture
|
||||
|
||||
- Sample target blocks with sufficiently large scale
|
||||
- Use a sufficiently informative (sparsely distributed) context blocks to predict the target block
|
||||
|
||||
Combining vision transformer
|
||||
|
||||
representation learning in biological systems is the adaptation of an internal model to predict the sensory input responses.
|
||||
|
||||
I-JEPA model predicts the missing information in the abstract latent space.
|
||||
|
||||
Use multiple ViT to predict the different sections of target block. Similar to Masked autoencoders. **However, the prediction is made to the abstract representation of the target block, then decoded by a target decoder.**
|
||||
|
||||
(Recall in MAE, the prediction is made to the pixel space of the target block.)
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> This paper presents a simple and effective self-supervised learning method for image representation learning. The key seems to be the multi block masking learning from the abstract representation of the target block given the representation of the context blocks.
|
||||
>
|
||||
> In the ablation study, the author found that the multi-block masking is more effective than the single-block masking or random masking strategies. I wonder if we increase the number of multiple blocks to predict the target block, will the performance continue to improve? Is the performance improvement mainly contributed by the fine-grained prediction of the target block originated from learning the representation of the target block from the context blocks or just larger context size for each ViT prediction? How the consistency of multi-block prediction is guaranteed? I don't know if I missed it but if we increase the consistency for intersection of the multiple blocks to predict the target block, will the performance continue to improve?
|
||||
|
||||
@@ -1,2 +1,8 @@
|
||||
# CSE5519 Advances in Computer Vision (Topic I: 2023 - 2024: Embodied Computer Vision and Robotics)
|
||||
|
||||
## RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.Links to an external site.
|
||||
|
||||
[link to the paper](https://arxiv.org/abs/2307.15818)
|
||||
|
||||
### Novelty in RT-2
|
||||
|
||||
|
||||
Reference in New Issue
Block a user