partial updates

This commit is contained in:
Trance-0
2025-11-18 13:25:21 -06:00
parent 946d0b605f
commit 9416bd4956
10 changed files with 1218 additions and 136 deletions

View File

@@ -1,146 +1,242 @@
# CSE510 Deep Reinforcement Learning (Lecture 21)
## Exploration in RL
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
### Information state search
## Exploration in RL: Information-Based Exploration (Intrinsic Curiosity)
Uncertainty about state transitions or dynamics
Dynamics prediction error or Information gain for dynamics learning
#### Computational Curiosity
### Computational Curiosity
- "The direct goal of curiosity and boredom is to improve the world model."
- "Curiosity Unit": reward is a function of the mismatch between model's current predictions and actuality.
- There is positive reinforcement whenever the system fails to correctly predict the environment.
- Thus the usual credit assignment process ... encourages certain past actions in order to repeat situations similar to the mismatch situation. (planning to make your (internal) world model to fail)
- Curiosity encourages agents to seek experiences that better predict or explain the environment.
- A "curiosity unit" gives reward based on the mismatch between current model predictions and actual outcomes.
- Intrinsic reward is high when the agent's prediction fails, that is, when it encounters surprising outcomes.
- This yields positive intrinsic reinforcement when the internal predictive model errs, causing the agent to repeat actions that lead to prediction errors.
- The agent is effectively motivated to create situations where its model fails.
#### Reward Prediction Error
### Model Prediction Error as Intrinsic Reward
- Add exploration reward bonuses that encourage policies to visit states that will cause the prediction model to fail.
We augment the reward with an intrinsic bonus based on model prediction error:
$$
R(s,a,s') = r(s,a,s') + \mathcal{B}(\|T(s,a,\theta)-s'\|)
$$
$R(s, a, s') = r(s, a, s') + B(|T(s, a; \theta) - s'|)$
- where $r(s,a,s')$ is the extrinsic reward, $T(s,a,\theta)$ is the predicted next state, and $\mathcal{B}$ is a bonus function (intrinsic reward bonus).
- Exploration reward bonuses are non-stationary: as the agent interacts with the environment, what is now new and novel, becomes old and known.
Parameter explanations:
[link to the paper](https://arxiv.org/pdf/1507.08750)
</details>
- $s$: current state of the agent.
- $a$: action taken by the agent in state $s$.
- $s'$: next state resulting from executing action $a$ in state $s$.
- $r(s, a, s')$: extrinsic environment reward for transition $(s, a, s')$.
- $T(s, a; \theta)$: learned dynamics model with parameters $\theta$ that predicts the next state.
- $\theta$: parameter vector of the predictive dynamics model $T$.
- $|T(s, a; \theta) - s'|$: prediction error magnitude between predicted next state and actual next state.
- $B(\cdot)$: function converting prediction error magnitude into an intrinsic reward bonus.
- $R(s, a, s')$: total reward, sum of extrinsic reward and intrinsic curiosity bonus.
<details>
<summary>Example</summary>
Key ideas:
Learning Visual Dynamics
- The agent receives an intrinsic reward $B(|T(s, a; \theta) - s'|)$ when the actual outcome differs from what its world model predicts.
- Initially many transitions are surprising, encouraging broad exploration.
- As the model improves, familiar transitions yield smaller error and smaller intrinsic reward.
- Exploration becomes focused on less-known parts of the state space.
- Intrinsic motivation is non-stationary: as the agent learns, previously novel states lose their intrinsic reward.
- Exploration reward bonuses $\mathcal{B}(s, a, s') = \|T(s, a; \theta) - s'\|$
- However, trivial solution exists: could get reward by just moving around randomly.
#### Avoiding Trivial Curiosity Traps
---
[link to paper](https://ar5iv.labs.arxiv.org/html/1705.05363#:~:text=reward%20signal%20based%20on%20how,this%20feature%20space%20using%20self)
- Exploration reward bonuses with autoencoders $\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|$
- But suffer the problems of autoencoding reconstruction loss that has little to do with our task
Naively defining $B(s, a, s')$ directly in raw observation space can lead to trivial curiosity traps.
#### Task Rewards vs. Exploration Rewards
Examples:
Exploration rewards bonuses:
- The agent may purposely cause chaotic or noisy observations (like flickering pixels) that are impossible to predict.
- The model cannot reduce prediction error on pure noise, so the agent is rewarded for meaningless randomness.
- This yields high intrinsic reward without meaningful learning or progress toward task goals.
$$
\mathcal{B}(s, a, s') = \|T(E(s';\theta),a;\theta)-E(s';\theta)\|
$$
To prevent this, we restrict prediction to a more informative feature space:
Only task rewards:
$B(s, a, s') = |T(E(s; \phi), a; \theta) - E(s'; \phi)|$
$$
R(s,a,s') = r(s,a,s')
$$
Parameter explanations:
Task+curiosity rewards:
- $E(s; \phi)$: learned encoder mapping raw state $s$ into a feature vector.
- $\phi$: parameter vector of the encoder $E$.
- $T(E(s; \phi), a; \theta)$: forward model predicting next feature representation from encoded state and action.
- $E(s'; \phi)$: encoded feature representation of the next state $s'$.
- $B(s, a, s')$: intrinsic reward based on prediction error in feature space.
$$
R^t(s,a,s') = r(s,a,s') + \mathcal{B}^t(s, a, s')
$$
Key ideas:
Sparse task + curiosity rewards:
- The encoder $E(s; \phi)$ is trained so that features capture aspects of the state that are controllable by the agent.
- One approach is to train $E$ via an inverse dynamics model that predicts $a$ from $(s, s')$.
- This encourages $E$ to keep only information necessary to infer actions, discarding irrelevant noise.
- Measuring prediction error in feature space ignores unpredictable environmental noise.
- Intrinsic reward focuses on errors due to lack of knowledge about controllable dynamics.
- The agent's curiosity is directed toward aspects of the environment it can influence and learn.
$$
R^t(s,a,s') = r^t(s,a,s') + \mathcal{B}^t(s, a, s')
$$
A practical implementation is the Intrinsic Curiosity Module (ICM) by Pathak et al. (2017):
Only curiosity rewards:
- The encoder $E$ and forward model $T$ are trained jointly.
- The loss includes both forward prediction error and inverse dynamics error.
- Intrinsic reward is set to the forward prediction error in feature space.
- This drives exploration of states where the agent cannot yet predict the effect of its actions.
$$
R^c(s,a,s') = \mathcal{B}^c(s, a, s')
$$
#### Random Network Distillation (RND)
#### Extrinsic reward RL is not New
Random Network Distillation (RND) provides a simpler curiosity bonus without learning a dynamics model.
- Itti, L., Baldi, P.F.: Bayesian surprise attracts human attention. In: NIPS05. pp. 547554 (2006)
- Schmidhuber, J.: Curious model-building control systems. In: IJCNN91. vol. 2, pp. 14581463 (1991)
- Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, IEEE Trans. on Autonomous Mental Development 2(3), 230247 (9 2010)
- Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS04 (2004)
- Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven information acquisition in non-deterministic environments. In: ICANN95 (1995)
- Sun, Y., Gomez, F.J., Schmidhuber, J.: Planning to be surprised: Optimal Bayesian exploration in dynamic environments (2011), http://arxiv.org/abs/1103.5708
Basic idea:
#### Limitation of Prediction Errors
- Use a fixed random neural network $f_{\text{target}}$ that maps states to feature vectors.
- Train a predictor network $f_{\text{pred}}$ to approximate $f_{\text{target}}$ on visited states.
- The intrinsic reward is the prediction error between $f_{\text{pred}}(s)$ and $f_{\text{target}}(s)$.
- Agent will be rewarded even though the model cannot improve.
- So it will focus on parts of environment that are inherently unpredictable or stochastic.
- Example: the noisy-TV problem
- The agent is attracted forever in the most noisy states, with unpredictable outcomes.
Typical form of the intrinsic reward:
#### Random Network Distillation
$r^{\text{int}}(s) = |f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$
Original idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the current state and action.
Parameter explanations:
New idea: Predicting the output of a fixed and randomly initialized neural network on the next state, given the **next state itself.**
- $f_{\text{target}}$: fixed random neural network generating target features for each state.
- $f_{\text{pred}}(s; \psi)$: trainable predictor network with parameters $\psi$.
- $\psi$: parameter vector for the predictor network.
- $s$: state input to both networks.
- $|f_{\text{pred}}(s; \psi) - f_{\text{target}}(s)|^{2}$: squared error between predictor and target features.
- $r^{\text{int}}(s)$: intrinsic reward based on prediction error in random feature space.
- The target network is a neural network with fixed, randomized weights, which is never trained.
- The prediction network is trained to predict the target network's output.
Key properties:
> the more you visit the state, the less loss you will have.
- For novel or rarely visited states, $f_{\text{pred}}$ has not yet learned to match $f_{\text{target}}$, so error is high.
- For frequently visited states, prediction error becomes small, and intrinsic reward decays.
- The target network is random and fixed, so it does not adapt to the policy.
- This provides a stable novelty signal without explicit dynamics learning.
- RND achieves strong exploration performance in challenging environments, such as hard-exploration Atari games.
### Posterior Sampling
### Efficacy of Curiosity-Driven Exploration
Uncertainty about Q-value functions or policies
Empirical observations:
Selecting actions according to the probability they are best according to the current model.
- Curiosity-driven intrinsic rewards often lead to significantly higher extrinsic returns in sparse-reward environments compared to agents trained only on extrinsic rewards.
- Intrinsic rewards act as a proxy objective that guides the agent toward interesting or informative regions of the state space.
- In some experiments, agents trained with only intrinsic rewards (no extrinsic reward during training) still learn behaviors that later achieve high task scores when extrinsic rewards are measured.
- Using random features for curiosity (as in RND) can perform nearly as well as using learned features in many domains.
- Simple surprise signals are often sufficient to drive effective exploration.
- Learned feature spaces may generalize better to truly novel scenarios but are not always necessary.
#### Exploration with Action Value Information
Historical context:
Count-Based and Curiosity-driven method does not take into
account the action value information
- The concept of learning from intrinsic rewards alone is not new.
- Itti and Baldi (2005) studied "Bayesian surprise" as a driver of human attention.
- Schmidhuber (1991, 2010) formalized curiosity, creativity, and fun as intrinsic motivations in learning agents.
- Singh et al. (2004) proposed intrinsically motivated reinforcement learning frameworks.
- These early works laid the conceptual foundation for modern curiosity-driven deep RL methods.
![Action Value Information](https://notenextra.trance-0.com/CSE510/Action_Value_Information.png)
For further reading on intrinsic curiosity methods:
> In this case, the optimal solution is action 1, but we will explore action 3 because it has the highest uncertainty. And it takes long to distinguish action 1 and 2 since they have similar values.
- Pathak et al., "Curiosity-driven Exploration by Self-supervised Prediction", 2017.
- Burda et al., "Exploration by Random Network Distillation", 2018.
- Schmidhuber, "Formal Theory of Creativity, Fun, and Intrinsic Motivation", 2010.
#### Exploration via Posterior Sampling of Q Functions
## Exploration via Posterior Sampling
- Represent a posterior distribution of Q functions, instead of a point estimate.
1. Sample from $P(Q), Q\sim P(Q)$
2. Choose actions according to this $Q$ for one episode $a=\arg\max_{a} Q(s,a)$
3. Update $P(Q)$ based on the sampled $Q$ and collected experience tuples $(s,a,r,s')$
- Then we do not need $\epsilon$-greedy for exploration! Better exploration by representing uncertainty over Q.
- But how can we learn a distribution of Q functions $P(Q)$ if Q function is a deep neural network?
While optimistic and curiosity bonus methods modify the reward function, posterior sampling approaches handle exploration by maintaining uncertainty over models or value functions and sampling from this uncertainty.
#### Bootstrap Ensemble
These methods are rooted in Thompson Sampling and naturally balance exploration and exploitation.
- Neural network ensembles: train multiple Q-function approximations each on using different subset of the data
- Computationally expensive
- Neural network ensembles with shared backbone: only the heads are trained with different subset of the data
### Posterior Sampling in Multi-Armed Bandits (Thompson Sampling)
### Questions
In a multi-armed bandit problem (no state transitions), Thompson Sampling works as follows:
- Why do PG methods implicitly support exploration?
- Is it sufficient? How can we improve its implicit exploration?
- What are limitations of entropy regularization?
- How can we improve exploration for PG methods?
- Intrinsic-motivated bonuses (e.g., RND)
- Explicitly optimize per-state entropy in the return (e.g., SAC)
- Hierarchical RL
- Goal-conditional RL
- What are potentially more effective exploration methods?
- Knowledge-driven
- Model-based exploration
1. Maintain a prior and posterior distribution over the reward parameters for each arm.
2. At each time step, sample reward parameters for all arms from their current posterior.
3. Select the arm with the highest sampled mean reward.
4. Observe the reward, update the posterior, and repeat.
Intuition:
- Each action is selected with probability equal to the posterior probability that it is optimal.
- Arms with high uncertainty are more likely to be sampled as optimal in some posterior draws.
- Exploration arises naturally from uncertainty, without explicit epsilon-greedy noise or bonus terms.
- Over time, the posterior concentrates on the true reward means, and the algorithm shifts toward exploitation.
Theoretical properties:
- Thompson Sampling attains near-optimal regret bounds in many bandit settings.
- It often performs as well as or better than upper confidence bound algorithms in practice.
### Posterior Sampling for Reinforcement Learning (PSRL)
In reinforcement learning with states and transitions, posterior sampling generalizes to sampling entire MDP models.
Posterior Sampling for Reinforcement Learning (PSRL) operates as follows:
1. Maintain a posterior distribution over environment dynamics and rewards, based on observed transitions.
2. At the beginning of an episode, sample an MDP model from this posterior.
3. Compute the optimal policy for the sampled MDP (for example, by value iteration).
4. Execute this policy in the real environment for the whole episode.
5. Use the observed transitions to update the posterior, then repeat.
Key advantages:
- The agent commits to a sampled model's policy for an extended duration, which induces deep exploration.
- If a sampled model is optimistic in unexplored regions, the corresponding policy will deliberately visit those regions.
- Exploration is coherent across time within an episode, unlike per-step randomization in epsilon-greedy.
- The method does not require ad hoc exploration bonuses; exploration is an emergent property of the posterior.
Challenges:
- Maintaining an exact posterior over high-dimensional MDPs is usually intractable.
- Practical implementations use approximations.
### Approximate Posterior Sampling with Ensembles (Bootstrapped DQN)
A common approximate posterior method in deep RL is Bootstrapped DQN.
Basic idea:
- Train an ensemble of $K$ Q-networks (heads), $Q^{(1)}, \dots, Q^{(K)}$.
- Each head is trained on a different bootstrap sample or masked subset of experience.
- At the start of each episode, sample a head index $k$ uniformly from ${1, \dots, K}$.
- For the entire episode, act greedily with respect to $Q^{(k)}$.
Parameter definitions for the ensemble:
- $K$: number of Q-network heads in the ensemble.
- $Q^{(k)}(s, a)$: Q-value estimate for head $k$ at state-action pair $(s, a)$.
- $k$: index of the sampled head used for the current episode.
- $(s, a)$: state and action arguments to Q-value functions.
Implementation details:
- A shared feature backbone network processes state inputs, feeding into all heads.
- Each head has its own final layers, allowing diverse value estimates.
- Masking or bootstrapping assigns different subsets of transitions to different heads during training.
Benefits:
- Each head approximates a different plausible Q-function, analogous to a sample from a posterior.
- When a head is optimistic about certain under-explored actions, its greedy policy will explore them deeply.
- Exploration behavior is temporally consistent within an episode.
- No modification of the reward function is required; exploration arises from policy randomization via multiple heads.
Comparison to epsilon-greedy:
- Epsilon-greedy adds per-step random actions, which can be inefficient for long-horizon exploration.
- Bootstrapped DQN commits to a strategy for an episode, enabling the agent to execute complete exploratory plans.
- This can dramatically increase the probability of discovering long sequences needed to reach sparse rewards.
Other approximate posterior approaches:
- Bayesian neural networks for Q-functions (explicit parameter distributions).
- Using Monte Carlo dropout at inference to sample Q-functions.
- Randomized prior functions added to Q-networks to maintain exploration.
Theoretical insights:
- Posterior sampling methods can enjoy strong regret bounds in some RL settings.
- They can have better asymptotic constants than optimism-based methods in certain problems.
- Coherent, temporally extended exploration is essential in environments with delayed rewards and complex goals.
For further reading:
- Osband et al., "Deep Exploration via Bootstrapped DQN", 2016.
- Osband and Van Roy, "Why Is Posterior Sampling Better Than Optimism for Reinforcement Learning?", 2017.
- Chapelle and Li, "An Empirical Evaluation of Thompson Sampling", 2011.

View File

@@ -1,50 +1,317 @@
# CSE510 Deep Reinforcement Learning (Lecture 22)
## Offline Reinforcement Learning
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
### Requirements for Current Successes
## Offline Reinforcement Learning: Introduction and Challenges
- Access to the Environment Model or Simulator
- Not Costly for Exploration or Trial-and-Error
Offline reinforcement learning (offline RL), also called batch RL, aims to learn an optimal policy -without- interacting with the environment. Instead, the agent is given a fixed dataset of transitions collected by an unknown behavior policy.
#### Background: Offline RL
### The Offline RL Dataset
- The success of modern machine learning
- Scalable data-driven learning methods (GPT-4, CLIP,DALL·E, Sora)
- Reinforcement learning
- Online learning paradigm
- Interaction is expensive & dangerous
- Healthcare, Robotics, Recommendation...
- Can we develop data-driven offline RL?
#### Definition in Offline RL
- the policy $\pi_k$ is updated with a static dataset $\mathcal{D}$, which is collected by _unknown behavior policy_ $\pi_\beta$
- Interaction is not allowed
- $\mathcal{D}=\{(s_i,a_i,s_i',r_i)\}$
- $s\sim d^{\pi_\beta} (s)$
- $a\sim \pi_\beta (a|s)$
- $s'\sim p(s'|s,a)$
- $r\gets r(s,a)$
- Objective: $\max_\pi\sum _{t=0}^{T}\mathbb{E}_{s_t\sim d^\pi(s),a_t\sim \pi(a|s)}[\gamma^tr(s_t,a_t)]$
#### Key challenge in Offline RL
Distribution Shift
How about using the traditional reinforcement learning (bootstrapping)?
We are given a static dataset:
$$
Q(s,a)=r(s,a)+\gamma \max_{a'\in A} Q(s',a')
D = { (s_i, a_i, s'-i, r_i ) }-{i=1}^N
$$
$$
\pi(s)=\arg\max_{a\in A} Q(s,a)
$$
Parameter explanations:
but notice that
- $s_i$: state sampled from behavior policy state distribution.
- $a_i$: action selected by the behavior policy $\pi_beta$.
- $s'_i$: next state sampled from environment dynamics $p(s'|s,a)$.
- $r_i$: reward observed for transition $(s_i,a_i)$.
- $N$: total number of transitions in the dataset.
- $D$: full offline dataset used for training.
The goal is to learn a new policy $\pi$ maximizing expected discounted return using only $D$:
$$
P_{\pi_beta}(s,a)\neq P_{\pi_f}(s,a)
\max_{\pi} ; \mathbb{E}\Big[\sum_{t=0}^T \gamma^t r(s_t, a_t)\Big]
$$
Parameter explanations:
- $\pi$: policy we want to learn.
- $r(s,a)$: reward received for state-action pair.
- $\gamma$: discount factor controlling weight of future rewards.
- $T$: horizon or trajectory length.
### Why Offline RL Is Difficult
Offline RL is fundamentally harder than online RL because:
- The agent cannot try new actions to fix wrong value estimates.
- The policy may choose out-of-distribution actions not present in $D$.
- Q-value estimates for unseen actions can be arbitrarily incorrect.
- Bootstrapping on wrong Q-values can cause divergence.
This leads to two major failure modes:
1. --Distribution shift--: new policy actions differ from dataset actions.
2. --Extrapolation error--: the Q-function guesses values for unseen actions.
### Extrapolation Error Problem
In standard Q-learning, the Bellman backup is:
$$
Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s', a')
$$
Parameter explanations:
- $Q(s,a)$: estimated value of taking action $a$ in state $s$.
- $\max_{a'}$: maximum over possible next actions.
- $a'$: candidate next action for evaluation in backup step.
If $a'$ was rarely or never taken in the dataset, $Q(s',a')$ is poorly estimated, so Q-learning boots off invalid values, causing instability.
### Behavior Cloning (BC): The Safest Baseline
The simplest offline method is to imitate the behavior policy:
$$
\max_{\phi} ; \mathbb{E}_{(s,a) \sim D}[\log \pi_{\phi}(a|s)]
$$
Parameter explanations:
- $\phi$: neural network parameters of the cloned policy.
- $\pi_{\phi}$: learned policy approximating behavior policy.
- $\log \pi_{\phi}(a|s)$: negative log-likelihood loss.
Pros:
- Does not suffer from extrapolation error.
- Extremely stable.
Cons:
- Cannot outperform the behavior policy.
- Ignores reward information entirely.
### Naive Offline Q-Learning Fails
Directly applying off-policy Q-learning on $D$ generally leads to:
- Overestimation of unseen actions.
- Divergence due to extrapolation error.
- Policies worse than behavior cloning.
## Strategies for Safe Offline RL
There are two primary families of solutions:
1. --Policy constraint methods--
2. --Conservative value estimation methods--
---
# 1. Policy Constraint Methods
These methods restrict the learned policy to stay close to the behavior policy so it does not take unsupported actions.
### Advantage Weighted Regression (AWR / AWAC)
Policy update:
$$
\pi(a|s) \propto \pi_{beta}(a|s)\exp\left(\frac{1}{\lambda}A(s,a)\right)
$$
Parameter explanations:
- $\pi_{beta}$: behavior policy used to collect dataset.
- $A(s,a)$: advantage function derived from Q or V estimates.
- $\lambda$: temperature controlling strength of advantage weighting.
- $\exp(\cdot)$: positive weighting on high-advantage actions.
Properties:
- Uses advantages to filter good and bad actions.
- Improves beyond behavior policy while staying safe.
### Batch-Constrained Q-learning (BCQ)
BCQ constrains the policy using a generative model:
1. Train a VAE $G_{\omega}$ to model $a$ given $s$.
2. Train a small perturbation model $\xi$.
3. Limit the policy to $a = G_{\omega}(s) + \xi(s)$.
Parameter explanations:
- $G_{\omega}(s)$: VAE-generated action similar to data actions.
- $\omega$: VAE parameters.
- $\xi(s)$: small correction to generated actions.
- $a$: final policy action constrained near dataset distribution.
BCQ avoids selecting unseen actions and strongly reduces extrapolation.
### BEAR (Bootstrapping Error Accumulation Reduction)
BEAR adds explicit constraints:
$$
D_{MMD}\left(\pi(a|s), \pi_{beta}(a|s)\right) < \epsilon
$$
Parameter explanations:
- $D_{MMD}$: Maximum Mean Discrepancy distance between action distributions.
- $\epsilon$: threshold restricting policy deviation from behavior policy.
BEAR controls distribution shift more tightly than BCQ.
---
# 2. Conservative Value Function Methods
These methods modify Q-learning so Q-values of unseen actions are -underestimated-, preventing the policy from exploiting overestimated values.
### Conservative Q-Learning (CQL)
One formulation is:
$$
J(Q) = J_{TD}(Q) + \alpha\big(\mathbb{E}_{a\sim\pi(\cdot|s)}Q(s,a) - \mathbb{E}_{a\sim D}Q(s,a)\big)
$$
Parameter explanations:
- $J_{TD}$: standard Bellman TD loss.
- $\alpha$: weight of conservatism penalty.
- $\mathbb{E}_{a\sim\pi(\cdot|s)}$: expectation over policy-chosen actions.
- $\mathbb{E}_{a\sim D}$: expectation over dataset actions.
Effect:
- Increases Q-values of dataset actions.
- Decreases Q-values of out-of-distribution actions.
### Implicit Q-Learning (IQL)
IQL avoids constraints entirely by using expectile regression:
Value regression:
$$
V(s) = \arg\min_{v} ; \mathbb{E}\big[\rho_{\tau}(Q(s,a) - v)\big]
$$
Parameter explanations:
- $v$: scalar value estimate for state $s$.
- $\rho_{\tau}(x)$: expectile regression loss.
- $\tau$: expectile parameter controlling conservatism.
- $Q(s,a)$: Q-value estimate.
Key idea:
- For $\tau < 1$, IQL reduces sensitivity to large (possibly incorrect) Q-values.
- Implicitly conservative without special constraints.
IQL often achieves state-of-the-art performance due to simplicity and stability.
---
# Model-Based Offline RL
### Forward Model-Based RL
Train a dynamics model:
$$
p_{\theta}(s'|s,a)
$$
Parameter explanations:
- $p_{\theta}$: learned transition model.
- $\theta$: parameters of transition model.
We can generate synthetic transitions using $p_{\theta}$, but model error accumulates.
### Penalty-Based Model Approaches (MOPO, MOReL)
Add uncertainty penalty:
$$
r_{model}(s,a) = r(s,a) - \beta , u(s,a)
$$
Parameter explanations:
- $r_{model}$: penalized reward for model rollouts.
- $u(s,a)$: model uncertainty estimate.
- $\beta$: penalty coefficient.
These methods limit exploration into unknown model regions.
---
# Reverse Model-Based Imagination (ROMI)
ROMI generates new training data by -backward- imagination.
### Reverse Dynamics Model
ROMI learns:
$$
p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
$$
Parameter explanations:
- $\psi$: parameters of reverse dynamics model.
- $s_{t+1}$: later state.
- $a_{t}$: action taken leading to $s_{t+1}$.
- $s_{t}$: predicted predecessor state.
ROMI also learns a reverse policy for sampling likely predecessor actions.
### Reverse Imagination Process
Given a goal state $s_{g}$:
1. Sample $a_{t}$ from reverse policy.
2. Predict $s_{t}$ from reverse dynamics.
3. Form imagined transition $(s_{t}, a_{t}, s_{t+1})$.
4. Repeat to build longer imagined trajectories.
Benefits:
- Imagined transitions end in real states, ensuring grounding.
- Completes missing parts of dataset.
- Helps propagate reward backward reliably.
ROMI combined with conservative RL often outperforms standard offline methods.
---
# Summary of Lecture 22
Offline RL requires balancing:
- Improvement beyond dataset behavior.
- Avoiding unsafe extrapolation to unseen actions.
Three major families of solutions:
1. Policy constraints (BCQ, BEAR, AWR)
2. Conservative Q-learning (CQL, IQL)
3. Model-based conservatism and imagination (MOPO, MOReL, ROMI)
Offline RL is becoming practical for real-world domains such as healthcare, robotics, autonomous driving, and recommender systems.
---
# Recommended Screenshot Frames for Lecture 22
- Lecture 22, page 7: Offline RL diagram showing policy learning from fixed dataset, subsection "Offline RL Setting".
- Lecture 22, page 35: Illustration of dataset support vs policy action distribution, subsection "Strategies for Safe Offline RL".
---
--End of CSE510_L22.md--

View File

@@ -1,3 +1,177 @@
# CSE510 Deep Reinforcement Learning (Lecture 23)
## Offline Reinforcement Learning
> Due to lack of my attention, this lecture note is generated by ChatGPT to create continuations of the previous lecture note.
## Offline Reinforcement Learning Part II: Advanced Approaches
Lecture 23 continues with advanced topics in offline RL, expanding on model-based imagination methods and credit assignment structures relevant for offline multi-agent and single-agent settings.
## Reverse Model-Based Imagination (ROMI)
ROMI is a method for augmenting an offline dataset with additional transitions generated by imagining trajectories -backwards- from desirable states. Unlike forward model rollouts, backward imagination stays grounded in real data because imagined transitions always terminate in dataset states.
### Reverse Dynamics Model
ROMI learns a reverse dynamics model:
$$
p_{\psi}(s_{t} \mid s_{t+1}, a_{t})
$$
Parameter explanations:
- $p_{\psi}$: learned reverse transition model.
- $\psi$: parameter vector for the reverse model.
- $s_{t+1}$: next state (from dataset).
- $a_{t}$: action that hypothetically leads into $s_{t+1}$.
- $s_{t}$: predicted predecessor state.
ROMI also learns a reverse policy to sample actions that likely lead into known states:
$$
\pi_{rev}(a_{t} \mid s_{t+1})
$$
Parameter explanations:
- $\pi_{rev}$: reverse policy distribution.
- $a_{t}$: action sampled for backward trajectory generation.
- $s_{t+1}$: state whose predecessors are being imagined.
### Reverse Imagination Process
To generate imagined transitions:
1. Select a goal or high-value state $s_{g}$ from the offline dataset.
2. Sample $a_{t}$ from $\pi_{rev}(a_{t} \mid s_{g})$.
3. Predict $s_{t}$ from $p_{\psi}(s_{t} \mid s_{g}, a_{t})$.
4. Form an imagined transition $(s_{t}, a_{t}, s_{g})$.
5. Repeat backward to obtain a longer imagined trajectory.
Benefits:
- Imagined states remain grounded by terminating in real dataset states.
- Helps propagate reward signals backward through states not originally visited.
- Avoids runaway model error that occurs in forward model rollouts.
ROMI effectively fills in missing gaps in the state-action graph, improving training stability and performance when paired with conservative offline RL algorithms.
---
## Implicit Credit Assignment via Value Factorization Structures
Although initially studied for multi-agent systems, insights from value factorization also improve offline RL by providing structured credit assignment signals.
### Counterfactual Credit Assignment Insight
A factored value function structure of the form:
$$
Q_{tot}(s, a_{1}, \dots, a_{n}) = f(Q_{1}(s, a_{1}), \dots, Q_{n}(s, a_{n}))
$$
can implicitly implement counterfactual credit assignment.
Parameter explanations:
- $Q_{tot}$: global value function.
- $Q_{i}(s,a_{i})$: individual component value for agent or subsystem $i$.
- $f(\cdot)$: mixing function combining components.
- $s$: environment state.
- $a_{i}$: action taken by entity $i$.
In architectures designed for IGM (Individual-Global-Max) consistency, gradients backpropagated through $f$ isolate the marginal effect of each component. This implicitly gives each agent or subsystem a counterfactual advantage signal.
Even in single-agent structured RL, similar factorization structures allow credit flowing into components representing skills, modes, or action groups, enabling better temporal and structural decomposition.
---
## Model-Based vs Model-Free Offline RL
Lecture 23 contrasts model-based imagination (ROMI) with conservative model-free methods such as IQL and CQL.
### Forward Model-Based Rollouts
Forward imagination using a learned model:
$$
p_{\theta}(s'|s,a)
$$
Parameter explanations:
- $p_{\theta}$: learned forward dynamics model.
- $\theta$: parameters of the forward model.
- $s'$: predicted next state.
- $s$: current state.
- $a$: action taken in current state.
Problems:
- Forward rollouts drift away from dataset support.
- Model error compounds with each step.
- Leads to training instability if used without penalties.
### Penalty Methods (MOPO, MOReL)
Augmented reward:
$$
r_{model}(s,a) = r(s,a) - \beta u(s,a)
$$
Parameter explanations:
- $r_{model}(s,a)$: penalized reward for model-generated steps.
- $u(s,a)$: uncertainty score of model for state-action pair.
- $\beta$: penalty coefficient.
- $r(s,a)$: original reward.
These methods limit exploration into uncertain model regions.
### ROMI vs Forward Rollouts
- Forward methods expand state space beyond dataset.
- ROMI expands -backward-, staying consistent with known good future states.
- ROMI reduces error accumulation because future anchors are real.
---
## Combining ROMI With Conservative Offline RL
ROMI is typically combined with:
- CQL (Conservative Q-Learning)
- IQL (Implicit Q-Learning)
- BCQ and BEAR (policy constraint methods)
Workflow:
1. Generate imagined transitions via ROMI.
2. Add them to dataset.
3. Train Q-function or policy using conservative losses.
Benefits:
- Better coverage of reward-relevant states.
- Increased policy improvement over dataset.
- More stable Q-learning backups.
---
## Summary of Lecture 23
Key points:
- Offline RL can be improved via structured imagination.
- ROMI creates safe imagined transitions by reversing dynamics.
- Reverse imagination avoids pitfalls of forward model error.
- Factored value structures provide implicit counterfactual credit assignment.
- Combining ROMI with conservative learners yields state-of-the-art performance.
---
## Recommended Screenshot Frames for Lecture 23
- Lecture 23, page 20: ROMI concept diagram depicting reverse imagination from goal states. Subsection: "Reverse Model-Based Imagination (ROMI)".
- Lecture 23, page 24: Architecture figure showing reverse policy and reverse dynamics model used to generate imagined transitions. Subsection: "Reverse Imagination Process".

View File

@@ -0,0 +1,275 @@
# CSE510 Deep Reinforcement Learning (Lecture 24)
## Cooperative Multi-Agent Reinforcement Learning (MARL)
This lecture introduces cooperative multi-agent reinforcement learning, focusing on formal models, value factorization, and modern algorithms such as QMIX and QPLEX.
## Multi-Agent Coordination Under Uncertainty
In cooperative MARL, multiple agents aim to maximize a shared team reward. The environment can be modeled using a Markov game or a Decentralized Partially Observable MDP (Dec-POMDP).
A transition is defined as:
$$
P(s' \mid s, a_{1}, \dots, a_{n})
$$
Parameter explanations:
- $s$: current global state.
- $s'$: next global state.
- $a_{i}$: action taken by agent $i$.
- $P(\cdot)$: environment transition function.
The shared return is:
$$
\mathbb{E}\left[\sum_{t=0}^{T} \gamma^{t} r_{t}\right]
$$
Parameter explanations:
- $\gamma$: discount factor.
- $T$: horizon length.
- $r_{t}$: shared team reward at time $t$.
### CTDE: Centralized Training, Decentralized Execution
Training uses global information (centralized), but execution uses local agent observations. This is critical for real-world deployment.
## Joint vs Factored Q-Learning
### Joint Q-Learning
In joint-action learning, one learns a full joint Q-function:
$$
Q_{tot}(s, a_{1}, \dots, a_{n})
$$
Parameter explanations:
- $Q_{tot}$: joint value for the entire team.
- $(a_{1}, \dots, a_{n})$: joint action vector across agents.
Problem:
- The joint action space grows exponentially in $n$.
- Learning is not scalable.
### Value Factorization
Instead of learning $Q_{tot}$ directly, we factorize it into individual utility functions:
$$
Q_{tot}(s, \mathbf{a}) = f(Q_{1}(s,a_{1}), \dots, Q_{n}(s,a_{n}))
$$
Parameter explanations:
- $\mathbf{a}$: joint action vector.
- $f(\cdot)$: mixing network combining individual Q-values.
The goal is to enable decentralized greedy action selection.
## Individual-Global-Max (IGM) Condition
The IGM condition enables decentralized optimal action selection:
$$
\arg\max_{\mathbf{a}} Q_{tot}(s,\mathbf{a})
===========================================
\big(\arg\max_{a_{1}} Q_{1}(s,a_{1}), \dots, \arg\max_{a_{n}} Q_{n}(s,a_{n})\big)
$$
Parameter explanations:
- $\arg\max_{\mathbf{a}}$: search for best joint action.
- $\arg\max_{a_{i}}$: best local action for agent $i$.
- $Q_{i}(s,a_{i})$: individual utility for agent $i$.
IGM makes decentralized execution optimal with respect to the learned factorized value.
## Linear Value Factorization
### VDN (Value Decomposition Networks)
VDN assumes:
$$
Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} Q_{i}(s,a_{i})
$$
Parameter explanations:
- $Q_{i}(s,a_{i})$: value of agent $i$'s action.
- $\sum_{i=1}^{n}$: linear sum over agents.
Pros:
- Very simple, satisfies IGM.
- Fully decentralized execution.
Cons:
- Limited representation capacity.
- Cannot model non-linear teamwork interactions.
## QMIX: Monotonic Value Factorization
QMIX uses a state-conditioned mixing network enforcing monotonicity:
$$
\frac{\partial Q_{tot}}{\partial Q_{i}} \ge 0
$$
Parameter explanations:
- $\partial Q_{tot} / \partial Q_{i}$: gradient of global Q w.r.t. individual Q.
- $\ge 0$: ensures monotonicity required for IGM.
The mixing function is:
$$
Q_{tot}(s,\mathbf{a}) = f_{mix}(Q_{1}, \dots, Q_{n}; s)
$$
Parameter explanations:
- $f_{mix}$: neural network with non-negative weights.
- $s$: global state conditioning the mixing process.
Benefits:
- More expressive than VDN.
- Supports CTDE while keeping decentralized greedy execution.
## Theoretical Issues With Linear and Monotonic Factorization
Limitations:
- Linear models (VDN) cannot represent complex coordination.
- QMIX monotonicity limits representation power for tasks requiring non-monotonic interactions.
- Off-policy training can diverge in some factorizations.
## QPLEX: Duplex Dueling Multi-Agent Q-Learning
QPLEX introduces a dueling architecture that satisfies IGM while providing full representation capacity within the IGM class.
### QPLEX Advantage Factorization
QPLEX factorizes:
$$
Q_{tot}(s,\mathbf{a}) = \sum_{i=1}^{n} \lambda_{i}(s,\mathbf{a})\big(Q_{i}(s,a_{i}) - \max_{a'-{i}} Q_{i}(s,a'-{i})\big)
- \max_{\mathbf{a}} \sum_{i=1}^{n} Q_{i}(s,a_{i})
$$
Parameter explanations:
- $\lambda_{i}(s,\mathbf{a})$: positive mixing coefficients.
- $Q_{i}(s,a_{i})$: individual utility.
- $\max_{a'-{i}} Q_{i}(s,a'-{i})$: per-agent baseline value.
- $\max_{\mathbf{a}}$: maximization over joint actions.
QPLEX Properties:
- Fully satisfies IGM.
- Has full representation capacity for all IGM-consistent Q-functions.
- Enables stable off-policy training.
## QPLEX Training Objective
QPLEX minimizes a TD loss over $Q_{tot}$:
$$
L = \mathbb{E}\Big[(r + \gamma \max_{\mathbf{a'}} Q_{tot}(s',\mathbf{a'}) - Q_{tot}(s,\mathbf{a}))^{2}\Big]
$$
Parameter explanations:
- $r$: shared team reward.
- $\gamma$: discount factor.
- $s'$: next state.
- $\mathbf{a'}$: next joint action evaluated by TD target.
- $Q_{tot}$: QPLEX global value estimate.
## Role of Credit Assignment
Credit assignment addresses: "Which agent contributed what to the team reward?"
Value factorization supports implicit credit assignment:
- Gradients into each $Q_{i}$ act as counterfactual signals.
- Dueling architectures allow each agent to learn its influence.
- QPLEX provides clean marginal contributions implicitly.
## Performance on SMAC Benchmarks
QPLEX outperforms:
- QTRAN
- QMIX
- VDN
- Other CTDE baselines
Key reasons:
- Effective realization of IGM.
- Strong representational capacity.
- Off-policy stability.
## Extensions: Diversity and Shared Parameter Learning
Parameter sharing encourages sample efficiency, but can cause homogeneous agent behavior.
Approaches such as CDS (Celebrating Diversity in Shared MARL) introduce:
- Identity-aware diversity.
- Information-based intrinsic rewards for agent differentiation.
- Balanced sharing vs agent specialization.
These techniques improve exploration and cooperation in complex multi-agent tasks.
## Summary of Lecture 24
Key points:
- Cooperative MARL requires scalable value decomposition.
- IGM enables decentralized action selection from centralized training.
- QMIX introduces monotonic non-linear factorization.
- QPLEX achieves full IGM representational capacity.
- Implicit credit assignment arises naturally from factorization.
- Diversity methods allow richer multi-agent coordination strategies.
## Recommended Screenshot Frames for Lecture 24
- Lecture 24, page 16: CTDE and QMIX architecture diagram (mixing network). Subsection: "QMIX: Monotonic Value Factorization".
- Lecture 24, page 31: QPLEX benchmark performance on SMAC. Subsection: "Performance on SMAC Benchmarks".

View File

@@ -26,4 +26,5 @@ export default {
CSE510_L21: "CSE510 Deep Reinforcement Learning (Lecture 21)",
CSE510_L22: "CSE510 Deep Reinforcement Learning (Lecture 22)",
CSE510_L23: "CSE510 Deep Reinforcement Learning (Lecture 23)",
CSE510_L24: "CSE510 Deep Reinforcement Learning (Lecture 24)",
}

View File

@@ -1,4 +1,4 @@
# CSE5313 Coding and information theory for data science (Lecture 21)
# CSE5313 Coding and information theory for data science (Lecture 22)
## Approximate Gradient Coding

View File

@@ -0,0 +1,242 @@
# CSE5313 Coding and information theory for data science (Lecture 23)
## Coded Computing
### Motivation
Some facts:
- Moore's law is saturating.
- Improving CPU performance is hard.
- Modern datasets are growing remarkably large.
- E.g., TikTok, YouTube.
- Learning tasks are computationally heavy.
- E.g., training neural networks.
Solution: Distributed Computing for Scalability
- Offloading computation tasks to multiple computation nodes.
- Gather and accumulate computation results.
- E.g., Apache Hadoop, Apache Spark, MapReduce.
### General Framework
- The system involves 1 master node and $P$ worker nodes.
- The master has a dataset $D$ and wants $f(D)$, where $f$ is some function.
- The master partitions $D=(D_1,\cdots,D_P)$, and sends $D_i$ to node $i$.
- Every node $i$ computes $g(D_i)$, where $g$ is somem function.
- Finally, the master collects $g(D_1),\cdots,g(D_P)$ and computes $f(D)=h(g(D_1),\cdots,g(D_P))$, where $h$ is some function.
#### Challenges
Stragglers
- Nodes that are significantly slower than the others.
Adversaries
- Nodes that return errounous results.
- Computation/communication errors.
- Adversarial attacks.
Privary
- Nodes may be curious about the dataset.
### Resemblance to communication channel
Suppose $f,g=\operatorname{id}$, and let $D=(D_1,\cdots,D_P)\in \mathbb{F}^p$ a message.
- $D_i$ is a field element
- $\mathbb{F}$ could be $\mathbb{R}$ or $\mathbb{C}$, $\mathbb{F}^q$.
Observation: This is a distributed storage system.
- An erasure - node that does not respond.
- An error - node that returns errounous results.
Solution:
- Add redundancy to the message
- Error-correcting codes.
### Coded Distributed Computing
- The master partitions $D$ and encodes it before sending to $P$ workers.
- Workers perform computations on coded data $\tilde{D}$ and generate coded results $g(\tilde{D})$.
- The master decode the coded results and obtain $f(D)=h(g(\tilde{D}))$.
### Outline
Matrix-Vector Multiplication
- MDS codes.
- Short-Dot codes.
Matrix-Matrix Multiplication
- Polynomial codes.
- MatDot codes.
Polynomial Evaluation
- Lagrange codes.
- Application to BLockchain.
### Trivial solution - replication
Why no straggler tolerance?
- We employ an individual worker node $i$ to compute $y_i=(a_i,\ldots,a_{iN})\cdot (x_1,\ldots,x_N)^T$.
Replicate the computation?
- Let $r+1$ nodes compute every $y_i$.
We need $P=rM+M$ worker nodes to tolerate $r$ erasures and $\lfloor \frac{r}{2}\rfloor$ adversaries.
### Use of MDS codes
Let $2|M$ and $P=3$.
Let $A_1,A_2$ be submatrices of $A$ such that $A=[A_1^\top|A_2^\top]^\top$.
- Worker node 1 conputes $A_1\cdot x$.
- Worker node 2 conputes $A_2\cdot x$.
- Worker node 3 conputes $(A_1+A_2)\cdot x$.
Observation: the results can be obtained from any two worker nodes.
Let $G\in \mathbb{F}^{M\times P}$ be the generator matrix of an $(P,M)$ MDS code.
The master node computes $F=G^\top A\in \mathbb{F}^{P\times N}$.
Every worker node $i$ computers $F_i\cdot x$.
- $F_i=(G^\top A)_i$ is the i-th row of $G^\top A$.
Notice that $Fx=G^\top A\cdot x=G^\top y$ is the codeword of $y$.
Node $i$ computes an entry in this codeword.
$1$ response = $1$ entyr of the codeword.
The master does **not** need all workers to respond to obtain $y$.
- The MDS property allows decoding from any $M$ $y_i$'s
- This scheme tolerates $P-M$ erasures, and the recovery threshold $K=M$.
- We need $P=r+M$ worker nodes to tolerate $r$ stragglers or $\frac{r}{2}$ adversaries.
- With replication, we need $P=rM+M$ worker nodes.
#### Potential improvements for MDS codes
- The matrix $A$ is usually a (trained) model, and $x$ is the data (feature vector).
- $x$ is transmitted frequently, while the row of $A$ (or $G^\top A$) is communicated in advance.
- Every worker needs to receive the entire $x$ and compute the dot product.
- Communication-heavy
- Can we design a scheme that allows every node only receive only a part of $x$?
### Short-Dot codes
[link to paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8758338)
We want to create a matrix $F\in \mathbb{F}^{P\times M}$ from $A$ such that:
- Every node computes $F_i\cdot x$.
- Every $K$ rows linearly span the row space of $A$.
- Each row of $F$ contains at most $s$ non-zero entries.
In the MDS method, $F=G^\top A$.
- The recovery threshold $K=M$.
- Every worker node needs to receive $s=N$ symbols (the entire x)
No free lunch
Can we trade the recovery threshold $K$ for a smaller $s$?
- Every worker node receives less than $N$ symbols.
- The master will need more than $M$ responses to recover the computation result.
#### Construction of Short-Dot codes
Choose a super-regular matrix $B\in \mathbb{F}^{P\time K}$, where $P$ is the number of worker nodes.
- A matrix is supper-regular if every square submatrix is invertible.
- Lagrange/Cauchy matrix is super-regular (next lecture).
Create matrix $\tilde{A}$ by stacking some $Z\in \mathbb{F}^{(K-M)\times N}$ below matrix $A$.
Let $F=B\dot \tilde{A}\in \mathbb{F}^{P\times N}$.
**Short-Dot**: create matrix $F\in \mathbb{F}^{P\times N}$ such that:
- Every $K$ rows linearly span the row space of $A$.
- Each row of $F$ contains at most $s=\frac{P-K+M}{P}$. $N$ non-zero entries (sparse).
#### Recovery of Short-Dot codes
Claim: Every $K$ rows of $F$ linearly span the row space of $A$.
<details>
<summary>Proof</summary>
Since $B$ is supper-regular, it is also MDS, i.e., every $K\times K$ submatrix of $B$ is invertible.
Hence, every row of $A$ can be represented as a linear combination of any $K$ rows of $F$.
That is, for every $\mathcal{X}\subseteq[P],|\mathcal{X}|=K$, we can have $\tilde{A}=(B^{\mathcal{X}})^{-1}F^{\mathcal{X}}$.
</details>
What about the sparsity of $F$?
- Want each row of $F$ to be sparse.
#### Sparsity of Short-Dot codes
Build $P\times P$ square matrix whose each row/column contains $P-K+M$ non-zero entries.
Concatenate $\frac{N}{P}$ such matrices and obtain
[Missing slides 18]
We now investigate what 𝑍 should look like to construct such a matrix 𝐹.
• Recall that each column of 𝐹 must contains 𝐾 𝑀 zeros.
They are indexed by the set 𝒰𝑃 , where |𝒰| = 𝐾 𝑀.
Let 𝐵
𝒰𝔽
𝐾𝑀 ×𝐾 be a submatrix of 𝐵 containing rows indexed by 𝒰.
• Since 𝐹 = 𝐵𝐴ሚ , it follows that 𝐹𝑗 = 𝐵𝐴ሚ
𝑗
, where 𝐹𝑗 and 𝐴ሚ
𝑗 are the 𝑗-th column of 𝐹 and 𝐴ሚ.
• Next, we have 𝐵
𝒰𝐴ሚ
𝑗 = 0 (𝐾𝑀)×1.
• Split 𝐵
𝒰 = [𝐵 1,𝑀
𝒰
,𝐵[𝑀+1,𝐾]
𝒰 ], 𝐴ሚ
𝑗 = 𝐴𝑗
𝑇
, 𝑍𝑗
𝑇
𝑇
.
𝐵
𝒰𝐴ሚ
𝑗 = 𝐵 1,𝑀
𝒰 𝐴𝑗 + 𝐵[𝑀+1,𝐾]
𝒰 𝑍𝑗 = 0 (𝐾𝑀)×1.
𝑍𝑗= (𝐵 𝑀+1,𝐾
𝒰
)
1 𝐵 1,𝑀
𝒰 𝐴𝑗
.
• Note that 𝐵 𝑀+1,𝐾
𝒰𝔽
𝐾𝑀 × 𝐾𝑀 is invertible.
Since 𝐵 is super-regular.

View File

@@ -26,4 +26,5 @@ export default {
CSE5313_L20: "CSE5313 Coding and information theory for data science (Lecture 20)",
CSE5313_L21: "CSE5313 Coding and information theory for data science (Lecture 21)",
CSE5313_L22: "CSE5313 Coding and information theory for data science (Lecture 22)",
CSE5313_L23: "CSE5313 Coding and information theory for data science (Lecture 23)",
}

View File

@@ -1,2 +1,14 @@
# CSE5519 Advances in Computer Vision (Topic G: 2025: Correspondence Estimation and Structure from Motion)
## MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos
[link to paper](https://arxiv.org/pdf/2412.04463)
- vanilla Droid-SLAM
- mono-depth initialization
- objective movement map prediction
- two-stage training scheme
> [!TIP]
>
> How does the two-stage training scheme help with the robustness of the model? For me, it seems that this paper is just the integration of GeoNet (separated pose and depth) with full regression.

View File

@@ -1,2 +1,16 @@
# CSE5519 Advances in Computer Vision (Topic I: 2025: Embodied Computer Vision and Robotics)
## Navigation World Models
[link to paper](https://arxiv.org/pdf/2412.03572)
### Novelty in NWM
- Conditional Diffusion Transformer
- Use time and action to conditioning the diffusion process
> [!TIP]
>
> This paper provides a new way to train navigation world models. Via conditioned diffusion, the model can generate an imagined trajectory in an unknown environment and perform navigation tasks.
>
> However, the model collapses frequently when using out-of-distribution data, resulting in poor navigation performance. I wonder how we can further condition on the novelty of the environment and integrate exploration strategies to train the model online to fix the collapse issue. What might be the challenges of doing so in the Conditioned Diffusion Transformer?