NoteNextra-origin/content/CSE510/CSE510_L2.md

# CSE510 Deep Reinforcement Learning (Lecture 2)

Introduction and Markov Decision Processes (MDPs)

## What is reinforcement learning (RL)

- A general computational framework for behavior learning through reinforcement/trial and error
- Deep RL: combining deep learning with RL for complex problems
- Showing a promise for artificial general intelligence (AGI)

## What RL can do now.

### Backgammon

#### Neuro-Gammon

Developed by Gerald Tesauro in 1989 in IBM's research center.

Train to mimic expert demonstrations using supervised learning.

Achieved intermediate-level human player.

#### TD-Gammon (Temporal Difference Learning)

Developed by Gerald Tesauro in 1992 in IBM's research center.

A neural network that trains itself to be an evaluation function by playing against itself starting from random weights.

Achieved performance close to top human players of its time.

### DeepMind Atari

Use deep Q-learning to play Atari games.

Without human demonstrations, it can learn to play the game at a superhuman level.

### AlphaGo

Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, expert demonstrations, self-play, and TPU from Google.

### Video Games

OpenAI Five for Dota 2

won 5v5 best of 3 games against top human players.

Deepmind AlphaStar for StarCraft

supervised training followed by a league competition training.

### AlphaTensor

discovering faster matrix multiplication algorithms with reinforcement learning.

AlphaTensor: 76 vs Strassen's 80 for 5x5 matrix multiplication.

### Training LLMs

For verifiable tasks (coding, math, etc.), RL can be used to train a model to perform the task without human supervision.

### Robotics

Unitree Go, Altlas by Boston Dynamics, etc.

## What are the challenges of RL in real world applications?

Beating the human champion is "easier" than placing the go stones.

### State estimation

Known environments (known entities and dynamics) vs. unknown environments (unknown entities and dynamics).

Need for behaviors to **transfer/generalize** across environmental variations since the real world is very diverse.

> **State estimation**
>
> To be able to act, you need first to be able to **see**, detect the **objects** that you interact with, detect whether you achieved the **goal**.

Most works are between two extremes:

- Assuming the world model known (object locations, shapes, physical properties obtain via AR tags or manual tuning), they use planners to search for the action sequence to achieve a desired goal.

- Do not attempt to detect any objects and learn to map RGB images directly to actions.

Behavior learning is challenging because state estimation is challenging, in other word, because computer vision/perception is challenging.

Interesting direction: **leveraging DRL and vision-language models**

### Efficiency

Cheap vs. Expensive to get experience samples

#### DRL Sample Efficiency

Humans after 15 minutes tend to outperform DDQN after
115 hours

#### Reinforcement Learning in Human

Human appear to learn to act (e.g., walk) through "very few examples" of trial and error. How is an open question...

Possible answers:

- Hardware: 230 million years of bipedal movement data
- Imitation Learning: Observation of other humans walking (e.g., imitation learning, episodic memory and semantic memory)
- Algorithms: Better than backpropagation and stochastic gradient descent

#### Discrete and continuous action spaces

Computation is discrete, but the real action space is continuous.

#### One-goal vs. Multi-goal

Life is a multi-goal problem. Involving infinitely many possible games.

#### Rewards automatic and auto detect rewards

Our curiosity is a reward.

#### And more

- Transfer learning
- Generalization
- Long horizon reasoning
- Model-based RL
- Sparse rewards
- Reward design/learning
- Planning/Learning
- Lifelong learning
- Safety
- Interpretability
- etc.

## What is the course about?

To teach you RL models and algorithms.

- To be able to tackle real world problems.

To excite you about RL.

- To provide a primer for you to launch advanced studies.

Schedule:

- RL Model and basic algorithms
  - Markov Decision Process (MDP)
  - Passive RL: ADP and TD-learning
  - Active RL: Q-Learning and SARSA
- Deep RL algorithms
  - Value-Based methods
  - Policy Gradient Methods
  - Model-Based methods
- Advanced Topics
  - Offline RL, Multi-Agent RL, etc.

### Reinforcement Learning Algorithms

#### Model-Based

- Learn the model of the world, then plan using the model
- Update model often
- Re-plan often

#### Value-Based

- Learn the state or state-action value
- Act by choosing best action in state
- Exploration is a necessary add-on

#### Policy-based

- Learn the stochastic policy function that maps state to action
- Act by sampling policy
- Exploration is baked in

#### Better sample efficiency to Less sample efficiency

- Model-Based
- Off-policy/Q-learning
- Actor-critic
- On-policy/Policy gradient
- Evolutionary/Gradient-free

## What is RL?

## RL model: Markov Decision Process (MDP)