update

2025-09-02 11:22:30 -05:00
parent ce180d4b26
commit 60a3b372d1
3 changed files with 241 additions and 0 deletions
--- a/content/CSE510/CSE510_L3.md
+++ b/content/CSE510/CSE510_L3.md
@@ -0,0 +1,240 @@
+# CSE510 Lecture 3
+
+## Introduction and Definition of MDPs
+
+### Definition and Examples
+
+#### Reinforcement Learning
+
+A computational framework for behavior
+learning through reinforcement
+
+- RL is for an agent with the capacity to act
+- Each action influences the agent’s future observation
+- Success is measured by a scalar reward signal
+- Goal: find a policy that maximizes expected total rewards
+
+Mathematical Model: Markov Decision Processes (MDP)
+
+#### Markov Decision Processes (MDP)
+
+A Finite MDP is defined by:
+
+- A finite set of states $s \in S$
+- A finite set of actions $a \in A$
+- A transition function $T(s, a, s')$
+- Probability that a from s leads to s', i.e.,
+$P(s'| s, a)$
+- Also called the model or the dynamics
+- A reward function $R(s)$ ( Sometimes $R(s,a)$ or $R(s, a, s')$ )
+- A start state
+- A start state
+- Maybe a terminal state
+
+A model for sequential decisionmaking problem under uncertaint
+
+#### States
+
+- **Stat is a snapshot of everything that matters for the next decision**
+- _Experience_ is a sequence of observations, actions, and rewards.
+- _Observation_ is the raw input of the agent's sensors
+- The state is a summary of the experience.
+
+$$
+s_t=f(o_1, r_1, a_1, \ldots, a_{t-1}, o_t, r_t)
+$$
+
+- The state can **include immediate "observations," highly processed observations, and structures built up over time from sequences of observations, memories** etc.
+- In a fully observed environment, $s_t= f(o_t)$
+
+#### Action
+
+- **Action = choice you make now**
+- They are used by the agent to interact with the world.
+- They can have many different temporal granularities and abstractions.
+- Actions can be defined to be
+  - The instantaneous torques on the gripper
+  - The instantaneous gripper translation, rotation, opening
+  - Instantaneous forces applied to the objects
+  - Short sequences of the above 
+
+#### Rewards
+
+- **Reward = score you get as a result**
+- They are scalar values provided by the environment to the agent that indicate whether goals have been achieved,
+  - e.g., 1 if goal is achieved, 0 otherwise, or -1 for overtime step the goal is not achieved
+- Rewards specify what the agent needs to achieve, not how to achieve it.
+- The simplest and cheapest form of supervision, and surprisingly general.
+- **Dense rewards are always preferred if available**
+  - e.g., distance changes to a goal.
+
+#### Dynamics or the Environment Model
+
+- **Transition = dice roll** the world makes after your choice.
+- How the state change given the current state and action
+
+$$
+P(S_{t+1}=s'|S_t=s_t, A_t=a_t)
+$$
+
+- Modeling the uncertainty
+  - Everyone has their own "world model", capturing the physical laws of the world.
+  - Human also have their own "social model", by their values, beliefs, etc.
+- Two problems:
+  - Planning: the dynamics model is known
+  - Reinforcement learning: the dynamics model is unknown
+
+#### Assumptions we have for MDP
+
+**First-Order Markovian dynamics** (history independence)
+
+- Next state only depend on current state and current action
+
+$$
+P(S_{t+1}=s'|S_t=s_t,A_t=a_t,S_1,A_1,\ldots,S_{t-1},A_{t-1}) = P(S_{t+1}=s'|S_t=s_t,A_t=a_t)
+$$
+
+**State-dependent** reward
+
+- Reward is a deterministic function of current state
+
+**Stationary dynamics**: do not depend on time
+
+$$
+P(S_{t+1}=s'|A_t,S_t) = P(S_{k+1}=s'|A_k,S_k),\forall t,k
+$$
+
+**Full observability** of the state
+
+- Though we can't predict exactly which state we will reach when we execute an action, after the action is executed, we know the new state.
+
+### Examples
+
+#### Atari games
+
+- States: raw RGB frames (one frame is not enough, so we use a sequence of frames, usually 4 frames)
+- Action: 18 actions in joystick movement
+- Reward: score changes
+
+#### Go
+
+- States: features of the game board
+- Action: place a stone or resign
+- Reward: win +1, lose -1, draw 0
+
+#### Autonomous car driving
+
+- States: speed, direction, lanes, traffic, weather, etc.
+- Action: steer, brake, throttle
+- Reward: +1 for reaching the destination, -1 for honking from surrounding cars, -100 for collision (exmaple)
+
+#### Grid World
+
+A maze-like problem
+
+- The agent lives in a grid
+
+- States: position of the agent
+- Noisy actions: east, south, west, north
+- Dynamics: actions not always go as planned
+  - 80% of the time, the action North takes the agent north (if there is a wall, it stays)
+  - 10% of the time, the action North takes the agent west and 10% of the time, the action North takes the agent east
+- Reward the agent receives each time step
+  - Small "living" reward each step (can be negative)
+  - Big reward for reaching the goal
+
+> [!NOTE]
+>
+> Due to the noise in the actions, it is insufficient to just output a sequence of actions to reach the goal.
+
+### Solution and its criterion
+
+### Solution to an MDP
+
+- Actions have stochastic effects, so the state we end up in is uncertain
+- This means that we might end up in states where the remainder of the action sequence doesn't apply or is a bad choice
+- A solution should tell us what the best action is for any possible situation/state that might arise
+
+### Policy as output to an MDP
+
+A stationary policy is a mapping from states to actions
+
+- $\pi: S \to A$
+- $\pi(s)$ is the action to take in state $s$ (regardless of the time step)
+- Specifies a continuously reactive controller
+
+We don't want to output just any policy
+
+We want to output a good policy
+
+One that accumulates a lot of rewards
+
+### Value of a policy
+
+Value function
+
+$V:S\to \mathbb{R}$ associates value with each state
+
+$$
+\begin{aligned}
+V^\pi(s) &= \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t)|s_0=s,a_t=\pi(s_t), s_{t+1}|s_t,a_t\sim P\right] \\
+&= \mathbb{E}\left[R(s_t) + \gamma \sum_{t=1}^\infty \gamma^{t-1} R(s_{t+1})|s_0=s,a_t=\pi(s_t), s_{t+1}|s_t,a_t\sim P\right] \\
+&= R(s) + \gamma \sum_{s'\in S} P(s'|s,\pi(s)) V^\pi(s')
+\end{aligned}
+$$
+
+Future rewards "discounted" by $\gamma$ per time step
+
+We value the state by the expected total rewards from this state onwards, discounted by $\gamma$ for each time step.
+
+> A small $\gamma$ means model would short-sighted and reduce computation complexity.
+
+#### Bellman Equation
+
+Basically, it gives one step lookahead value of a policy.
+
+$$
+V^\pi(s) = R(s) + \gamma \sum_{s'\in S} P(s'|s,\pi(s)) V^\pi(s')
+$$
+
+Today's value = Today's reward + discounted future value
+
+### Optimal Policy and Bellman Optimality Equation
+
+The goal for a MDP is to compute or learn an optimal policy.
+
+- An optimal policy is one that achieves the highest value at any state
+
+$$
+\pi^* = \arg\max_\pi V^\pi(s)
+$$
+
+We define the optimal value function suing Bellman Optimality Equation (Proof left as an exercise)
+
+$$
+V^*(s) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^*(s')
+$$
+
+The optimal policy is
+
+$$
+\pi^*(s) = \arg\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^*(s')
+$$
+
+![Optimal Policy](https://notenextra.trance-0.com/CSE510/MDP-optimal-policy.png)
+
+> [!NOTE]
+>
+> When $R(s)$ is small, the agent will prefer to take actions that avoids punishment in short term.
+
+### The existence of the optimal policy
+
+Theorem: for any Markov Decision Process
+
+- There exists an optimal policy
+- There can be many optimal policies, but all optimal policies achieve the same optimal value function
+- There is always a deterministic optimal policy for any MDP
+
+## Value Iteration
+
+## Policy Iteration
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -5,4 +5,5 @@ export default {
    },
    CSE510_L1: "CSE510 Deep Reinforcement Learning (Lecture 1)",
    CSE510_L2: "CSE510 Deep Reinforcement Learning (Lecture 2)",
+    CSE510_L3: "CSE510 Deep Reinforcement Learning (Lecture 3)",
 }
--- a/public/CSE510/MDP-optimal-policy.png
+++ b/public/CSE510/MDP-optimal-policy.png