Some checks failed
Sync from Gitea (main→main, keep workflow) / mirror (push) Has been cancelled
223 lines
6.1 KiB
Markdown
223 lines
6.1 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 26)
|
|
|
|
## Continue on Real-World Practical Challenges for RL
|
|
|
|
### Factored multi-agent RL
|
|
|
|
- Sample efficiency -> Shared Learning
|
|
- Complexity -> High-Order Factorization
|
|
- Partial Observability -> Communication Learning
|
|
- Sparse reward -> Coordinated Exploration
|
|
|
|
#### Parameter Sharing vs. Diversity
|
|
|
|
- Parameter Sharing is critical for deep MARL methods
|
|
- However, agents tend to acquire homogenous behaviors
|
|
- Diversity is essential for exploration and practical tasks
|
|
|
|
[link to paper: Google Football](https://arxiv.org/pdf/1907.11180)
|
|
|
|
Schematics of Our Approach: Celebrating Diversity in Shared MARL (CDS)
|
|
|
|
- In representation, CDS allows MARL to adaptively decide
|
|
when to share learning
|
|
- Encouraging Diversity in Optimization
|
|
|
|
In optimization, maximizing an information-theoretic objective to achieve identity-aware diversity
|
|
|
|
$$
|
|
\begin{aligned}
|
|
I^\pi(\tau_T;id)&=H(\tau_t)-H(\tau_T|id)=\mathbbb{E}_{id,\tau_T\sim \pi}\left[\log \frac{p(\tau_T|id)}{p(\tau_T)}\right]\\
|
|
&= \mathbb{E}_{id,\tau}\left[ \log \frac{p(o_0|id)}{p(o_0)}+\sum_{t=0}^{T-1}\log\frac{a_t|\tau_t,id}{p(a_t|\tau_t)}+\log \frac{p(o_{t+1}|\tau_t,a_t,id)}{p(o_{t+1}|\tau_t,a_t)}\right]
|
|
\end{aligned}
|
|
$$
|
|
|
|
Here: $\sum_{t=0}^{T-1}\log\frac{a_t|\tau_t,id}{p(a_t|\tau_t)}$ represents the action diversity.
|
|
|
|
$\log \frac{p(o_{t+1}|\tau_t,a_t,id)}{p(o_{t+1}|\tau_t,a_t)}$ represents the observation diversity.
|
|
|
|
### Summary
|
|
|
|
- MARL plays a critical role for AI, but is at the early stage
|
|
- Value factorization enables scalable MARL
|
|
- Linear factorization sometimes is surprising effective
|
|
- Non-linear factorization shows promise in offline settings
|
|
- Parameter sharing plays an important role for deep MARL
|
|
- Diversity and dynamic parameter sharing can be critical for complex cooperative tasks
|
|
|
|
## Challenges and open problems in DRL
|
|
|
|
### Overview for Reinforcement Learning Algorithms
|
|
|
|
Recall from lecture 2
|
|
|
|
Better sample efficiency to less sample efficiency:
|
|
|
|
- Model-based
|
|
- Off-policy/Q-learning
|
|
- Actor-critic
|
|
- On-policy/Policy gradient
|
|
- Evolutionary/Gradient-free
|
|
|
|
#### Model-Based
|
|
|
|
- Learn the model of the world, then pan using the model
|
|
- Update model often
|
|
- Re-plan often
|
|
|
|
#### Value-Based
|
|
|
|
- Learn the state or state-action value
|
|
- Act by choosing best action in state
|
|
- Exploration is a necessary add-on
|
|
|
|
#### Policy-based
|
|
|
|
- Learn the stochastic policy function that maps state to action
|
|
- Act by sampling policy
|
|
- Exploration is baked in
|
|
|
|
### Where we are?
|
|
|
|
Deep RL has achieved impressive results in games, robotics, control, and decision systems.
|
|
|
|
But it is still far from a general, reliable, and efficient learning paradigm.
|
|
|
|
Today: what limits Deep RL, what's being worked on, and what's still open.
|
|
|
|
### Outline of challenges
|
|
|
|
- Offline RL
|
|
- Multi-Agent complexity
|
|
- Sample efficiency & data reuse
|
|
- Stability & reproducibility
|
|
- Generalization & distribution shift
|
|
- Scalable model-based RL
|
|
- Safety
|
|
- Theory gaps & evaluation
|
|
|
|
### Sample inefficiency
|
|
|
|
Model-free Deep RL often need million/billion of steps
|
|
|
|
- Humans with 15-minute learning tend to outperform DDQN with 115 hours
|
|
- OpenAI Five for Dota 2: 180 years playing time per day
|
|
|
|
Real-world systems can't afford this
|
|
|
|
Root causes: high-variance gradients, weak priors, poor credit assignment.
|
|
|
|
Open direction for sample efficiency
|
|
|
|
- Better data reuse: off-policy learning & replay improvements
|
|
- Self-supervised representation learning for control (learning from interacting with the environment)
|
|
- Hybrid model-based/model-free approaches
|
|
- Transfer & pre-training on large datasets
|
|
- Knowledge driving-RL: leveraging pre-trained models
|
|
|
|
#### Knowledge-Driven RL: Motivation
|
|
|
|
Current LLMs are not good at decision making
|
|
|
|
Pros: rich knowledge
|
|
|
|
Cons: Auto-regressive decoding lack of long turn memory
|
|
|
|
Reinforcement learning in decision making
|
|
|
|
Pros: Go beyond human intelligence
|
|
|
|
Cons: sample inefficiency
|
|
|
|
### Instability & the Deadly triad
|
|
|
|
Function approximation + boostraping + off-policy learning can diverge
|
|
|
|
Even stable algorithms (PPO) can be unstable
|
|
|
|
#### Open direction for Stability
|
|
|
|
Better optimization landscapes + regularization
|
|
|
|
Calibration/monitoring tools for RL training
|
|
|
|
Architectures with built-in inductive biased (e.g., equivariance)
|
|
|
|
### Reproducibility & Evaluation
|
|
|
|
Results often depend on random seeds, codebase, and compute budget
|
|
|
|
Benchmark can be overfit; comparisons apples-to-oranges
|
|
|
|
Offline evaluation is especially tricky
|
|
|
|
#### Toward Better Evaluation
|
|
|
|
- Robustness checks and ablations
|
|
- Out-of-distribution test suites
|
|
- Realistic benchmarks beyond games (e.g., science and healthcare)
|
|
|
|
### Generalization & Distribution Shift
|
|
|
|
Policy overfit to training environments and fail under small challenges
|
|
|
|
Sim-to-real gap, sensor noise, morphology changes, domain drift.
|
|
|
|
Requires learning invariance and robust decision rules.
|
|
|
|
#### Open direction for Generalization
|
|
|
|
- Domain randomization + system identification
|
|
- Robust/ risk-sensitive RL
|
|
- Representation learning for invariance
|
|
- Meta-RL and fast adaptation
|
|
|
|
### Model-based RL: Promise & Pitfalls
|
|
|
|
- Learned models enable planning and sample efficiency
|
|
- But distribution mismatch and model exploitation can break policies
|
|
- Long-horizon imagination amplifies errors
|
|
- Model-learning is challenging
|
|
|
|
### Safety, alignment, and constraints
|
|
|
|
Reward mis-specification -> unsafe or unintended behavior
|
|
|
|
Need to respect constraints: energy, collisions, ethics, regulation
|
|
|
|
Exploration itself may be unsafe
|
|
|
|
#### Open direction for Safety RL
|
|
|
|
- Constraint RL (Lagrangians, CBFs, she)
|
|
|
|
### Theory Gaps & Evaluation
|
|
|
|
Deep RL lacks strong general guarantees.
|
|
|
|
We don't fully understand when/why it works
|
|
|
|
Bridging theory and
|
|
|
|
#### Promising theory directoins
|
|
|
|
Optimization thoery of RL objectives
|
|
|
|
Generalization and representation learning bounds
|
|
|
|
Finite-sample analysis s
|
|
|
|
### Connection to foundation models
|
|
|
|
- Pre-training on large scale experience
|
|
- World models as sequence predictors
|
|
- RLHF/preference optimization for alignment
|
|
- Open problems: groundign
|
|
|
|
### What to expect in the next 3-5 years
|
|
|
|
Unified model-based offline + safe RL stacks
|
|
|
|
Large pretrianed decision models
|
|
|
|
Deployment in high-stake domains |