update
This commit is contained in:
187
content/CSE510/CSE510_L2.md
Normal file
187
content/CSE510/CSE510_L2.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 2)
|
||||
|
||||
Introduction and Markov Decision Processes (MDPs)
|
||||
|
||||
## What is reinforcement learning (RL)
|
||||
|
||||
- A general computational framework for behavior learning through reinforcement/trial and error
|
||||
- Deep RL: combining deep learning with RL for complex problems
|
||||
- Showing a promise for artificial general intelligence (AGI)
|
||||
|
||||
## What RL can do now.
|
||||
|
||||
### Backgammon
|
||||
|
||||
#### Neuro-Gammon
|
||||
|
||||
Developed by Gerald Tesauro in 1989 in IBM's research center.
|
||||
|
||||
Train to mimic expert demonstrations using supervised learning.
|
||||
|
||||
Achieved intermediate-level human player.
|
||||
|
||||
#### TD-Gammon (Temporal Difference Learning)
|
||||
|
||||
Developed by Gerald Tesauro in 1992 in IBM's research center.
|
||||
|
||||
A neural network that trains itself to be an evaluation function by playing against itself starting from random weights.
|
||||
|
||||
Achieved performance close to top human players of its time.
|
||||
|
||||
### DeepMind Atari
|
||||
|
||||
Use deep Q-learning to play Atari games.
|
||||
|
||||
Without human demonstrations, it can learn to play the game at a superhuman level.
|
||||
|
||||
### AlphaGo
|
||||
|
||||
Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, expert demonstrations, self-play, and TPU from Google.
|
||||
|
||||
### Video Games
|
||||
|
||||
OpenAI Five for Dota 2
|
||||
|
||||
won 5v5 best of 3 games against top human players.
|
||||
|
||||
Deepmind AlphaStar for StarCraft
|
||||
|
||||
supervised training followed by a league competition training.
|
||||
|
||||
### AlphaTensor
|
||||
|
||||
discovering faster matrix multiplication algorithms with reinforcement learning.
|
||||
|
||||
AlphaTensor: 76 vs Strassen's 80 for 5x5 matrix multiplication.
|
||||
|
||||
### Training LLMs
|
||||
|
||||
For verifiable tasks (coding, math, etc.), RL can be used to train a model to perform the task without human supervision.
|
||||
|
||||
### Robotics
|
||||
|
||||
Unitree Go, Altlas by Boston Dynamics, etc.
|
||||
|
||||
## What are the challenges of RL in real world applications?
|
||||
|
||||
Beating the human champion is "easier" than placing the go stones.
|
||||
|
||||
### State estimation
|
||||
|
||||
Known environments (known entities and dynamics) vs. unknown environments (unknown entities and dynamics).
|
||||
|
||||
Need for behaviors to **transfer/generalize** across environmental variations since the real world is very diverse.
|
||||
|
||||
> **State estimation**
|
||||
>
|
||||
> To be able to act, you need first to be able to **see**, detect the **objects** that you interact with, detect whether you achieved the **goal**.
|
||||
|
||||
Most works are between two extremes:
|
||||
|
||||
- Assuming the world model known (object locations, shapes, physical properties obtain via AR tags or manual tuning), they use planners to search for the action sequence to achieve a desired goal.
|
||||
|
||||
- Do not attempt to detect any objects and learn to map RGB images directly to actions.
|
||||
|
||||
Behavior learning is challenging because state estimation is challenging, in other word, because computer vision/perception is challenging.
|
||||
|
||||
Interesting direction: **leveraging DRL and vision-language models**
|
||||
|
||||
### Efficiency
|
||||
|
||||
Cheap vs. Expensive to get experience samples
|
||||
|
||||
#### DRL Sample Efficiency
|
||||
|
||||
Humans after 15 minutes tend to outperform DDQN after
|
||||
115 hours
|
||||
|
||||
#### Reinforcement Learning in Human
|
||||
|
||||
Human appear to learn to act (e.g., walk) through "very few examples" of trial and error. How is an open question...
|
||||
|
||||
Possible answers:
|
||||
|
||||
- Hardware: 230 million years of bipedal movement data
|
||||
- Imitation Learning: Observation of other humans walking (e.g., imitation learning, episodic memory and semantic memory)
|
||||
- Algorithms: Better than backpropagation and stochastic gradient descent
|
||||
|
||||
#### Discrete and continuous action spaces
|
||||
|
||||
Computation is discrete, but the real action space is continuous.
|
||||
|
||||
#### One-goal vs. Multi-goal
|
||||
|
||||
Life is a multi-goal problem. Involving infinitely many possible games.
|
||||
|
||||
#### Rewards automatic and auto detect rewards
|
||||
|
||||
Our curiosity is a reward.
|
||||
|
||||
#### And more
|
||||
|
||||
- Transfer learning
|
||||
- Generalization
|
||||
- Long horizon reasoning
|
||||
- Model-based RL
|
||||
- Sparse rewards
|
||||
- Reward design/learning
|
||||
- Planning/Learning
|
||||
- Lifelong learning
|
||||
- Safety
|
||||
- Interpretability
|
||||
- etc.
|
||||
|
||||
## What is the course about?
|
||||
|
||||
To teach you RL models and algorithms.
|
||||
|
||||
- To be able to tackle real world problems.
|
||||
|
||||
To excite you about RL.
|
||||
|
||||
- To provide a primer for you to launch advanced studies.
|
||||
|
||||
Schedule:
|
||||
|
||||
- RL Model and basic algorithms
|
||||
- Markov Decision Process (MDP)
|
||||
- Passive RL: ADP and TD-learning
|
||||
- Active RL: Q-Learning and SARSA
|
||||
- Deep RL algorithms
|
||||
- Value-Based methods
|
||||
- Policy Gradient Methods
|
||||
- Model-Based methods
|
||||
- Advanced Topics
|
||||
- Offline RL, Multi-Agent RL, etc.
|
||||
|
||||
### Reinforcement Learning Algorithms
|
||||
|
||||
#### Model-Based
|
||||
|
||||
- Learn the model of the world, then plan using the model
|
||||
- Update model often
|
||||
- Re-plan often
|
||||
|
||||
#### Value-Based
|
||||
|
||||
- Learn the state or state-action value
|
||||
- Act by choosing best action in state
|
||||
- Exploration is a necessary add-on
|
||||
|
||||
#### Policy-based
|
||||
|
||||
- Learn the stochastic policy function that maps state to action
|
||||
- Act by sampling policy
|
||||
- Exploration is baked in
|
||||
|
||||
#### Better sample efficiency to Less sample efficiency
|
||||
|
||||
- Model-Based
|
||||
- Off-policy/Q-learning
|
||||
- Actor-critic
|
||||
- On-policy/Policy gradient
|
||||
- Evolutionary/Gradient-free
|
||||
|
||||
## What is RL?
|
||||
|
||||
## RL model: Markov Decision Process (MDP)
|
||||
@@ -4,4 +4,5 @@ export default {
|
||||
type: 'separator'
|
||||
},
|
||||
CSE510_L1: "CSE510 Deep Reinforcement Learning (Lecture 1)",
|
||||
CSE510_L2: "CSE510 Deep Reinforcement Learning (Lecture 2)",
|
||||
}
|
||||
Reference in New Issue
Block a user