update?

2025-11-25 12:46:41 -06:00
parent 34182ff139
commit 2a094032b7
5 changed files with 528 additions and 1 deletions
--- a/content/CSE510/CSE510_L26.md
+++ b/content/CSE510/CSE510_L26.md
@@ -36,3 +36,188 @@ Here: $\sum_{t=0}^{T-1}\log\frac{a_t|\tau_t,id}{p(a_t|\tau_t)}$ represents the a

 $\log \frac{p(o_{t+1}|\tau_t,a_t,id)}{p(o_{t+1}|\tau_t,a_t)}$ represents the observation diversity.

+### Summary
+
+- MARL plays a critical role for AI, but is at the early stage
+- Value factorization enables scalable MARL
+  - Linear factorization sometimes is surprising effective
+  - Non-linear factorization shows promise in offline settings
+- Parameter sharing plays an important role for deep MARL
+- Diversity and dynamic parameter sharing can be critical for complex cooperative tasks
+
+## Challenges and open problems in DRL
+
+### Overview for Reinforcement Learning Algorithms
+
+Recall from lecture 2
+
+Better sample efficiency to less sample efficiency:
+
+- Model-based
+- Off-policy/Q-learning
+- Actor-critic
+- On-policy/Policy gradient
+- Evolutionary/Gradient-free
+
+#### Model-Based
+
+- Learn the model of the world, then pan using the model
+- Update model often
+- Re-plan often
+
+#### Value-Based
+
+- Learn the state or state-action value
+- Act by choosing best action in state
+- Exploration is a necessary add-on
+
+#### Policy-based
+
+- Learn the stochastic policy function that maps state to action
+- Act by sampling policy
+- Exploration is baked in
+
+### Where we are?
+
+Deep RL has achieved impressive results in games, robotics, control, and decision systems.
+
+But it is still far from a general, reliable, and efficient learning paradigm.
+
+Today: what limits Deep RL, what's being worked on, and what's still open.
+
+### Outline of challenges
+
+- Offline RL
+- Multi-Agent complexity
+- Sample efficiency & data reuse
+- Stability & reproducibility
+- Generalization & distribution shift
+- Scalable model-based RL
+- Safety
+- Theory gaps & evaluation
+
+### Sample inefficiency
+
+Model-free Deep RL often need million/billion of steps
+
+- Humans with 15-minute learning tend to outperform DDQN with 115 hours
+- OpenAI Five for Dota 2: 180 years playing time per day
+
+Real-world systems can't afford this
+
+Root causes: high-variance gradients, weak priors, poor credit assignment.
+
+Open direction for sample efficiency
+
+- Better data reuse: off-policy learning & replay improvements
+- Self-supervised representation learning for control (learning from interacting with the environment)
+- Hybrid model-based/model-free approaches
+- Transfer & pre-training on large datasets
+  - Knowledge driving-RL: leveraging pre-trained models
+
+#### Knowledge-Driven RL: Motivation
+
+Current LLMs are not good at decision making
+
+Pros: rich knowledge
+
+Cons: Auto-regressive decoding lack of long turn memory
+
+Reinforcement learning in decision making
+
+Pros: Go beyond human intelligence
+
+Cons: sample inefficiency
+
+### Instability & the Deadly triad
+
+Function approximation + boostraping + off-policy learning can diverge
+
+Even stable algorithms (PPO) can be unstable
+
+#### Open direction for Stability
+
+Better optimization landscapes + regularization
+
+Calibration/monitoring tools for RL training
+
+Architectures with built-in inductive biased (e.g., equivariance)
+
+### Reproducibility & Evaluation
+
+Results often depend on random seeds, codebase, and compute budget
+
+Benchmark can be overfit; comparisons apples-to-oranges
+
+Offline evaluation is especially tricky
+
+#### Toward Better Evaluation
+
+- Robustness checks and ablations
+- Out-of-distribution test suites
+- Realistic benchmarks beyond games (e.g., science and healthcare)
+
+### Generalization & Distribution Shift
+
+Policy overfit to training environments and fail under small challenges
+
+Sim-to-real gap, sensor noise, morphology changes, domain drift.
+
+Requires learning invariance and robust decision rules.
+
+#### Open direction for Generalization
+
+- Domain randomization + system identification
+- Robust/ risk-sensitive RL
+- Representation learning for invariance
+- Meta-RL and fast adaptation
+
+### Model-based RL: Promise & Pitfalls
+
+- Learned models enable planning and sample efficiency
+- But distribution mismatch and model exploitation can break policies
+- Long-horizon imagination amplifies errors
+- Model-learning is challenging
+
+### Safety, alignment, and constraints
+
+Reward mis-specification -> unsafe or unintended behavior
+
+Need to respect constraints: energy, collisions, ethics, regulation
+
+Exploration itself may be unsafe
+
+#### Open direction for Safety RL
+
+- Constraint RL (Lagrangians, CBFs, she)
+
+### Theory Gaps & Evaluation
+
+Deep RL lacks strong general guarantees.
+
+We don't fully understand when/why it works
+
+Bridging theory and 
+
+#### Promising theory directoins
+
+Optimization thoery of RL objectives
+
+Generalization and representation learning bounds
+
+Finite-sample analysis s
+
+### Connection to foundation models
+
+- Pre-training on large scale experience
+- World models as sequence predictors
+- RLHF/preference optimization for alignment
+- Open problems: groundign 
+
+### What to expect in the next 3-5 years
+
+Unified model-based offline + safe RL stacks
+
+Large pretrianed decision models
+
+Deployment in high-stake domains