# CSE510 Deep Reinforcement Learning (Lecture 26) ## Continue on Real-World Practical Challenges for RL ### Factored multi-agent RL - Sample efficiency -> Shared Learning - Complexity -> High-Order Factorization - Partial Observability -> Communication Learning - Sparse reward -> Coordinated Exploration #### Parameter Sharing vs. Diversity - Parameter Sharing is critical for deep MARL methods - However, agents tend to acquire homogenous behaviors - Diversity is essential for exploration and practical tasks [link to paper: Google Football](https://arxiv.org/pdf/1907.11180) Schematics of Our Approach: Celebrating Diversity in Shared MARL (CDS) - In representation, CDS allows MARL to adaptively decide when to share learning - Encouraging Diversity in Optimization In optimization, maximizing an information-theoretic objective to achieve identity-aware diversity $$ \begin{aligned} I^\pi(\tau_T;id)&=H(\tau_t)-H(\tau_T|id)=\mathbbb{E}_{id,\tau_T\sim \pi}\left[\log \frac{p(\tau_T|id)}{p(\tau_T)}\right]\\ &= \mathbb{E}_{id,\tau}\left[ \log \frac{p(o_0|id)}{p(o_0)}+\sum_{t=0}^{T-1}\log\frac{a_t|\tau_t,id}{p(a_t|\tau_t)}+\log \frac{p(o_{t+1}|\tau_t,a_t,id)}{p(o_{t+1}|\tau_t,a_t)}\right] \end{aligned} $$ Here: $\sum_{t=0}^{T-1}\log\frac{a_t|\tau_t,id}{p(a_t|\tau_t)}$ represents the action diversity. $\log \frac{p(o_{t+1}|\tau_t,a_t,id)}{p(o_{t+1}|\tau_t,a_t)}$ represents the observation diversity. ### Summary - MARL plays a critical role for AI, but is at the early stage - Value factorization enables scalable MARL - Linear factorization sometimes is surprising effective - Non-linear factorization shows promise in offline settings - Parameter sharing plays an important role for deep MARL - Diversity and dynamic parameter sharing can be critical for complex cooperative tasks ## Challenges and open problems in DRL ### Overview for Reinforcement Learning Algorithms Recall from lecture 2 Better sample efficiency to less sample efficiency: - Model-based - Off-policy/Q-learning - Actor-critic - On-policy/Policy gradient - Evolutionary/Gradient-free #### Model-Based - Learn the model of the world, then pan using the model - Update model often - Re-plan often #### Value-Based - Learn the state or state-action value - Act by choosing best action in state - Exploration is a necessary add-on #### Policy-based - Learn the stochastic policy function that maps state to action - Act by sampling policy - Exploration is baked in ### Where we are? Deep RL has achieved impressive results in games, robotics, control, and decision systems. But it is still far from a general, reliable, and efficient learning paradigm. Today: what limits Deep RL, what's being worked on, and what's still open. ### Outline of challenges - Offline RL - Multi-Agent complexity - Sample efficiency & data reuse - Stability & reproducibility - Generalization & distribution shift - Scalable model-based RL - Safety - Theory gaps & evaluation ### Sample inefficiency Model-free Deep RL often need million/billion of steps - Humans with 15-minute learning tend to outperform DDQN with 115 hours - OpenAI Five for Dota 2: 180 years playing time per day Real-world systems can't afford this Root causes: high-variance gradients, weak priors, poor credit assignment. Open direction for sample efficiency - Better data reuse: off-policy learning & replay improvements - Self-supervised representation learning for control (learning from interacting with the environment) - Hybrid model-based/model-free approaches - Transfer & pre-training on large datasets - Knowledge driving-RL: leveraging pre-trained models #### Knowledge-Driven RL: Motivation Current LLMs are not good at decision making Pros: rich knowledge Cons: Auto-regressive decoding lack of long turn memory Reinforcement learning in decision making Pros: Go beyond human intelligence Cons: sample inefficiency ### Instability & the Deadly triad Function approximation + boostraping + off-policy learning can diverge Even stable algorithms (PPO) can be unstable #### Open direction for Stability Better optimization landscapes + regularization Calibration/monitoring tools for RL training Architectures with built-in inductive biased (e.g., equivariance) ### Reproducibility & Evaluation Results often depend on random seeds, codebase, and compute budget Benchmark can be overfit; comparisons apples-to-oranges Offline evaluation is especially tricky #### Toward Better Evaluation - Robustness checks and ablations - Out-of-distribution test suites - Realistic benchmarks beyond games (e.g., science and healthcare) ### Generalization & Distribution Shift Policy overfit to training environments and fail under small challenges Sim-to-real gap, sensor noise, morphology changes, domain drift. Requires learning invariance and robust decision rules. #### Open direction for Generalization - Domain randomization + system identification - Robust/ risk-sensitive RL - Representation learning for invariance - Meta-RL and fast adaptation ### Model-based RL: Promise & Pitfalls - Learned models enable planning and sample efficiency - But distribution mismatch and model exploitation can break policies - Long-horizon imagination amplifies errors - Model-learning is challenging ### Safety, alignment, and constraints Reward mis-specification -> unsafe or unintended behavior Need to respect constraints: energy, collisions, ethics, regulation Exploration itself may be unsafe #### Open direction for Safety RL - Constraint RL (Lagrangians, CBFs, she) ### Theory Gaps & Evaluation Deep RL lacks strong general guarantees. We don't fully understand when/why it works Bridging theory and #### Promising theory directoins Optimization thoery of RL objectives Generalization and representation learning bounds Finite-sample analysis s ### Connection to foundation models - Pre-training on large scale experience - World models as sequence predictors - RLHF/preference optimization for alignment - Open problems: groundign ### What to expect in the next 3-5 years Unified model-based offline + safe RL stacks Large pretrianed decision models Deployment in high-stake domains