132 lines
3.6 KiB
Markdown
132 lines
3.6 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 1)
|
|
|
|
## Artificial general intelligence
|
|
|
|
- Multimodeal perception
|
|
- Persistent memory + retrieval
|
|
- World modeling + planning
|
|
- Tool use with verification
|
|
- Interactive learning loops (RLHF/RLAIF)
|
|
- Uncertainty estimation & oversight
|
|
|
|
LLM may not be the ultimate solution for AGI, but may be a part of solution.
|
|
|
|
## Long-Horizon Agency
|
|
|
|
Decision-Making/Control and Multi-Agent collaboration
|
|
|
|
## Course logistics
|
|
|
|
Announcement and discussion on Canvas
|
|
|
|
Weekly recitations
|
|
|
|
Thursday 4:00PM- 5:00PM in Mckelvey Hall 1030
|
|
|
|
or night office hours (11am-12pm Wed in Mckelvey Hall 2010D)
|
|
|
|
or by appointment
|
|
|
|
### Prerequisites
|
|
|
|
- Proficiency in Python programming.
|
|
- **Programming experience with deep learning**.
|
|
- Research Experience (Not required, but highly recommended)
|
|
- Mathematics: Linear Algebra (MA 429 or MA 439 or ESE 318), Calculus III (MA 233), Probability & Statistics.
|
|
|
|
### Textbook
|
|
|
|
Not required, but recommended:
|
|
|
|
- Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed., online).
|
|
- Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed.).
|
|
- OpenAI Spinning Up in Deep RL tutorial.
|
|
|
|
### Final Project
|
|
|
|
Research-level project of your choice
|
|
|
|
- Improving an existing approach
|
|
- Tackling an unsolved task/benchmark
|
|
- Creating a new task/problem that hasn't been addressed by RL
|
|
|
|
Can be done in a team of 1-2 students
|
|
|
|
Must be harder than homework.
|
|
|
|
The core is to understand the pipeline of RL research, may not always be an improvement over existing methods.
|
|
|
|
#### Milestones
|
|
|
|
- Proposal (max 2 pages)
|
|
- Progress report with brief survey (max 4 pages)
|
|
- Presentation/Poster session
|
|
- Final report (7-10 pages, NeurIPS style)
|
|
|
|
## What is RL?
|
|
|
|
### Goal for course
|
|
|
|
How to build intelligent agents that **learn to act** and achieve specific goals in a **dynamic environments**?
|
|
|
|
Acting to achieve is key part of intelligence.
|
|
|
|
> Brain is to produce adaptable and complex movements. (Daniel Wolpert)
|
|
|
|
## What RL do
|
|
|
|
A general-purpose framwork for decision making/behavioral learning
|
|
|
|
- RL is for an agent with the capacity to act
|
|
- Each action influences the agent's future observation
|
|
- Success is measured by a scalar reward signal
|
|
- Goal: find a policy that maximize expected total rewards.
|
|
|
|
Exploration: Add randomness to your action selection
|
|
|
|
If the result was better than expected, do more of the same in the future.
|
|
|
|
### Deep reinforcement learning
|
|
|
|
DL is a general-purpose framework for representation learning.
|
|
|
|
- Given an objective
|
|
- Learn representation that is required to achieve objective
|
|
- Directly from raw inputs
|
|
- Using minimal domain knowledge
|
|
|
|
Deep learning enables RL algorithms to solve complex problems in an end-to-end manner.
|
|
|
|
### Machine learning Paradigm
|
|
|
|
Supervised learning: learning from examples
|
|
|
|
Self-supervised learning: learning structures in data
|
|
|
|
Reinforcement learning: learning from experiences
|
|
|
|
Example using LLMs:
|
|
|
|
Self-supervised: pretraining
|
|
|
|
SFT: supervised fine-tuning (post-training)
|
|
|
|
RL is also used in post-training for improving reasoning capabilities.
|
|
|
|
RLHF: reinforcement learning from human feedback (fine-tuning)
|
|
|
|
_RL generates data beyond the original training data._
|
|
|
|
All the paradigm are "supervised" by a loss function.
|
|
|
|
### Differences for RL from other paradigms
|
|
|
|
**Exploration**: the agent does not have prior data known to be good.
|
|
|
|
**Non-stationarity**: the environment is dynamic and the agent's actions influence the environment.
|
|
|
|
**Credit assignment**: the agent needs to learn to assign credit to its actions. (delayed reward)
|
|
|
|
**Limited samples**: actions take time to execute in the real world, which may limited the amount of experience.
|
|
|