# CSE510 Deep Reinforcement Learning (Lecture 2) Introduction and Markov Decision Processes (MDPs) ## What is reinforcement learning (RL) - A general computational framework for behavior learning through reinforcement/trial and error - Deep RL: combining deep learning with RL for complex problems - Showing a promise for artificial general intelligence (AGI) ## What RL can do now. ### Backgammon #### Neuro-Gammon Developed by Gerald Tesauro in 1989 in IBM's research center. Train to mimic expert demonstrations using supervised learning. Achieved intermediate-level human player. #### TD-Gammon (Temporal Difference Learning) Developed by Gerald Tesauro in 1992 in IBM's research center. A neural network that trains itself to be an evaluation function by playing against itself starting from random weights. Achieved performance close to top human players of its time. ### DeepMind Atari Use deep Q-learning to play Atari games. Without human demonstrations, it can learn to play the game at a superhuman level. ### AlphaGo Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, expert demonstrations, self-play, and TPU from Google. ### Video Games OpenAI Five for Dota 2 won 5v5 best of 3 games against top human players. Deepmind AlphaStar for StarCraft supervised training followed by a league competition training. ### AlphaTensor discovering faster matrix multiplication algorithms with reinforcement learning. AlphaTensor: 76 vs Strassen's 80 for 5x5 matrix multiplication. ### Training LLMs For verifiable tasks (coding, math, etc.), RL can be used to train a model to perform the task without human supervision. ### Robotics Unitree Go, Altlas by Boston Dynamics, etc. ## What are the challenges of RL in real world applications? Beating the human champion is "easier" than placing the go stones. ### State estimation Known environments (known entities and dynamics) vs. unknown environments (unknown entities and dynamics). Need for behaviors to **transfer/generalize** across environmental variations since the real world is very diverse. > **State estimation** > > To be able to act, you need first to be able to **see**, detect the **objects** that you interact with, detect whether you achieved the **goal**. Most works are between two extremes: - Assuming the world model known (object locations, shapes, physical properties obtain via AR tags or manual tuning), they use planners to search for the action sequence to achieve a desired goal. - Do not attempt to detect any objects and learn to map RGB images directly to actions. Behavior learning is challenging because state estimation is challenging, in other word, because computer vision/perception is challenging. Interesting direction: **leveraging DRL and vision-language models** ### Efficiency Cheap vs. Expensive to get experience samples #### DRL Sample Efficiency Humans after 15 minutes tend to outperform DDQN after 115 hours #### Reinforcement Learning in Human Human appear to learn to act (e.g., walk) through "very few examples" of trial and error. How is an open question... Possible answers: - Hardware: 230 million years of bipedal movement data - Imitation Learning: Observation of other humans walking (e.g., imitation learning, episodic memory and semantic memory) - Algorithms: Better than backpropagation and stochastic gradient descent #### Discrete and continuous action spaces Computation is discrete, but the real action space is continuous. #### One-goal vs. Multi-goal Life is a multi-goal problem. Involving infinitely many possible games. #### Rewards automatic and auto detect rewards Our curiosity is a reward. #### And more - Transfer learning - Generalization - Long horizon reasoning - Model-based RL - Sparse rewards - Reward design/learning - Planning/Learning - Lifelong learning - Safety - Interpretability - etc. ## What is the course about? To teach you RL models and algorithms. - To be able to tackle real world problems. To excite you about RL. - To provide a primer for you to launch advanced studies. Schedule: - RL Model and basic algorithms - Markov Decision Process (MDP) - Passive RL: ADP and TD-learning - Active RL: Q-Learning and SARSA - Deep RL algorithms - Value-Based methods - Policy Gradient Methods - Model-Based methods - Advanced Topics - Offline RL, Multi-Agent RL, etc. ### Reinforcement Learning Algorithms #### Model-Based - Learn the model of the world, then plan using the model - Update model often - Re-plan often #### Value-Based - Learn the state or state-action value - Act by choosing best action in state - Exploration is a necessary add-on #### Policy-based - Learn the stochastic policy function that maps state to action - Act by sampling policy - Exploration is baked in #### Better sample efficiency to Less sample efficiency - Model-Based - Off-policy/Q-learning - Actor-critic - On-policy/Policy gradient - Evolutionary/Gradient-free ## What is RL? ## RL model: Markov Decision Process (MDP)