update

2025-08-28 12:51:46 -05:00
parent 802993833b
commit 7137d8aca2
6 changed files with 538 additions and 0 deletions
--- a/content/CSE510/CSE510_L2.md
+++ b/content/CSE510/CSE510_L2.md
@@ -0,0 +1,187 @@
+# CSE510 Deep Reinforcement Learning (Lecture 2)
+
+Introduction and Markov Decision Processes (MDPs)
+
+## What is reinforcement learning (RL)
+
+- A general computational framework for behavior learning through reinforcement/trial and error
+- Deep RL: combining deep learning with RL for complex problems
+- Showing a promise for artificial general intelligence (AGI)
+
+## What RL can do now.
+
+### Backgammon
+
+#### Neuro-Gammon
+
+Developed by Gerald Tesauro in 1989 in IBM's research center.
+
+Train to mimic expert demonstrations using supervised learning.
+
+Achieved intermediate-level human player.
+
+#### TD-Gammon (Temporal Difference Learning)
+
+Developed by Gerald Tesauro in 1992 in IBM's research center.
+
+A neural network that trains itself to be an evaluation function by playing against itself starting from random weights.
+
+Achieved performance close to top human players of its time.
+
+### DeepMind Atari
+
+Use deep Q-learning to play Atari games.
+
+Without human demonstrations, it can learn to play the game at a superhuman level.
+
+### AlphaGo
+
+Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, expert demonstrations, self-play, and TPU from Google.
+
+### Video Games
+
+OpenAI Five for Dota 2
+
+won 5v5 best of 3 games against top human players.
+
+Deepmind AlphaStar for StarCraft
+
+supervised training followed by a league competition training.
+
+### AlphaTensor
+
+discovering faster matrix multiplication algorithms with reinforcement learning.
+
+AlphaTensor: 76 vs Strassen's 80 for 5x5 matrix multiplication.
+
+### Training LLMs
+
+For verifiable tasks (coding, math, etc.), RL can be used to train a model to perform the task without human supervision.
+
+### Robotics
+
+Unitree Go, Altlas by Boston Dynamics, etc.
+
+## What are the challenges of RL in real world applications?
+
+Beating the human champion is "easier" than placing the go stones.
+
+### State estimation
+
+Known environments (known entities and dynamics) vs. unknown environments (unknown entities and dynamics).
+
+Need for behaviors to **transfer/generalize** across environmental variations since the real world is very diverse.
+
+> **State estimation**
+>
+> To be able to act, you need first to be able to **see**, detect the **objects** that you interact with, detect whether you achieved the **goal**.
+
+Most works are between two extremes:
+
+- Assuming the world model known (object locations, shapes, physical properties obtain via AR tags or manual tuning), they use planners to search for the action sequence to achieve a desired goal.
+
+- Do not attempt to detect any objects and learn to map RGB images directly to actions.
+
+Behavior learning is challenging because state estimation is challenging, in other word, because computer vision/perception is challenging.
+
+Interesting direction: **leveraging DRL and vision-language models**
+
+### Efficiency
+
+Cheap vs. Expensive to get experience samples
+
+#### DRL Sample Efficiency
+
+Humans after 15 minutes tend to outperform DDQN after
+115 hours
+
+#### Reinforcement Learning in Human
+
+Human appear to learn to act (e.g., walk) through "very few examples" of trial and error. How is an open question...
+
+Possible answers:
+
+- Hardware: 230 million years of bipedal movement data
+- Imitation Learning: Observation of other humans walking (e.g., imitation learning, episodic memory and semantic memory)
+- Algorithms: Better than backpropagation and stochastic gradient descent
+
+#### Discrete and continuous action spaces
+
+Computation is discrete, but the real action space is continuous.
+
+#### One-goal vs. Multi-goal
+
+Life is a multi-goal problem. Involving infinitely many possible games.
+
+#### Rewards automatic and auto detect rewards
+
+Our curiosity is a reward.
+
+#### And more
+
+- Transfer learning
+- Generalization
+- Long horizon reasoning
+- Model-based RL
+- Sparse rewards
+- Reward design/learning
+- Planning/Learning
+- Lifelong learning
+- Safety
+- Interpretability
+- etc.
+
+## What is the course about?
+
+To teach you RL models and algorithms.
+
+- To be able to tackle real world problems.
+
+To excite you about RL.
+
+- To provide a primer for you to launch advanced studies.
+
+Schedule:
+
+- RL Model and basic algorithms
+  - Markov Decision Process (MDP)
+  - Passive RL: ADP and TD-learning
+  - Active RL: Q-Learning and SARSA
+- Deep RL algorithms
+  - Value-Based methods
+  - Policy Gradient Methods
+  - Model-Based methods
+- Advanced Topics
+  - Offline RL, Multi-Agent RL, etc.
+
+### Reinforcement Learning Algorithms
+
+#### Model-Based
+
+- Learn the model of the world, then plan using the model
+- Update model often
+- Re-plan often
+
+#### Value-Based
+
+- Learn the state or state-action value
+- Act by choosing best action in state
+- Exploration is a necessary add-on
+
+#### Policy-based
+
+- Learn the stochastic policy function that maps state to action
+- Act by sampling policy
+- Exploration is baked in
+
+#### Better sample efficiency to Less sample efficiency
+
+- Model-Based
+- Off-policy/Q-learning
+- Actor-critic
+- On-policy/Policy gradient
+- Evolutionary/Gradient-free
+
+## What is RL?
+
+## RL model: Markov Decision Process (MDP)
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -4,4 +4,5 @@ export default {
        type: 'separator'
    },
    CSE510_L1: "CSE510 Deep Reinforcement Learning (Lecture 1)",
+    CSE510_L2: "CSE510 Deep Reinforcement Learning (Lecture 2)",
 }