updates

2025-09-04 12:51:31 -05:00
parent 95bb726462
commit 5abe1dcda6
4 changed files with 503 additions and 1 deletions
--- a/content/CSE510/CSE510_L3.md
+++ b/content/CSE510/CSE510_L3.md
@@ -1,4 +1,4 @@
-# CSE510 Lecture 3
+# CSE510 Deep Reinforcement Learning (Lecture 3)

 ## Introduction and Definition of MDPs

--- a/content/CSE510/CSE510_L4.md
+++ b/content/CSE510/CSE510_L4.md
@@ -0,0 +1,298 @@
+# CSE510 Deep Reinforcement Learning (Lecture 4)
+
+Markov Decision Process (MDP) Part II
+
+## Recall from last lecture
+
+An Finite MDP is defined by:
+
+- A finite set of **states** $s \in S$
+- A finite set of **actions** $a \in A$
+- A **transition function** $T(s, a, s')$
+  - Probability that a from s leads to $s'$, i.e.,
+  $P(s'| s, a)$
+  - Also called the model or the dynamics
+- A **reward function $R(s)$** ( Sometimes $R(s,a)$ or $R(s, a, s')$ )
+- A **start state**
+- Maybe a **terminal state**
+
+A model for sequential decision making problem under uncertainty
+
+### Optimal Policy and Bellman Optimality Equation
+
+The goal for a MDP is to compute or learn an optimal policy.
+
+- An **optimal policy** is one that achieves the highest value at any state
+
+$$
+\pi^* = \arg\max_\pi V^\pi(s)
+$$
+
+- We define the optimal value function using Bellman Optimality Equation
+
+$$
+V^*(s) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^*(s')
+$$
+
+- The optimal policy is
+
+$$
+\pi^*(s) = \arg\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^*(s')
+$$
+
+### The Existence of the Optimal Policy
+
+Theorem: for any Markov Decision Process
+  
+- There exists an optimal policy
+- There can be many optimal policies, but all optimal policies achieve the same optimal value function
+- There is always a deterministic optimal policy for any MDP
+
+## Solve MDP
+
+### Value Iteration
+
+Repeatedly update an estimate of the optimal value function according to Bellman Optimality Equation.
+
+1. Initialize an estimate for the value function arbitrarily
+
+$$
+\hat{V}(s) \gets 0, \forall s \in S
+$$
+
+2. Repeat, update:
+
+$$
+\hat{V}(s) \gets R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) \hat{V}(s'), \forall s \in S
+$$
+
+<details>
+<summary>Example</summary>
+
+Suppose we have a robot that can move in a 2D grid. with the following dynamics:
+
+- with 80% probability, the robot moves in the direction of the action
+- with 10% probability, the robot moves in the direction of the action + 1 (wrap to left)
+- with 10% probability, the robot moves in the direction of the action - 1 (wrap to right)
+
+The gird ($V^0(s)$) is:
+
+|0|0|0|1|
+|0|*|0|-100|
+|0|0|0|0|
+
+If we fun the value iteration with $\gamma = 0.9$, we can update the value function as follows:
+
+$$
+V^1(s) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^0(s')
+$$
+
+On point $(3,3)$, the best action is to move to the goal state, so:
+
+$$
+\begin{aligned}
+V^1((3,3)) &= R((3,3)) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|(3,3),\text{right}) V^0((3,4))
+&= 0+0.9 \times 0.8 \times 1 = 0.72
+\end{aligned}
+$$
+
+On point $(3,4)$, the best action is to move up so that you can stay in the grid with $90\%$ probability, so:
+
+$$
+\begin{aligned}
+V^1((3,4)) &= R((3,4)) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|(3,4),\text{up}) V^0((3,4))
+&= 1+0.9 \times (0.8+0.1) \times 1 = 1.81
+\end{aligned}
+$$
+
+On $t=1$, the value on grid is:
+
+|0|0|0.72|1.81|
+|0|*|0|-99.91|
+|0|0|0|0|
+
+</details>
+
+The general algorithm can be written as:
+
+```python
+# suppose we defined the grid as previous example
+
+grid = [
+    [0, 0, 0, 1],
+    [0, '*', 0, -100],
+    [0, 0, 0, 0]
+]
+m,n = len(grid), len(grid[0])
+ACTIONS = {'up':(0,-1), 'down':(0,1), 'left':(-1,0), 'right':(1,0)}
+
+gamma = 0.9
+V = value_iteration(gamma, ACTIONS, grid)
+print(V)
+
+def get_reward(action, i, j):
+    reward = 0
+    reward += 0.8 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
+    reward += 0.1 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
+    reward += 0.1 * grid[i+action[0]][j+action[1]] if i+action[0] >= 0 and i+action[0] < m and j+action[1] >= 0 and j+action[1] < n and grid[i+action[0]][j+action[1]] != '*' else grid[i][j]
+    return reward
+
+def value_iteration(gamma, ACTIONS, V):
+    V_new=[[0]*m for _ in range(n)]
+    while True:
+        for i in range(m):
+            for j in range(n):
+                s = (i, j)
+                V_new[i][j] = V[i][j] + gamma * max(get_reward(action, i, j) for action.values() in ACTIONS)
+        if max(abs(V_new[i][j] - V[i][j]) for i in range(m) for j in range(n)) < 1e-6:
+            break
+        V = V_new
+    return V
+```
+
+### Convergence of Value Iteration
+
+Theorem: Value Iteration converges to the optimal value function $\hat{V}\to V^*$ as $t\to\infty$.
+
+<details>
+<summary>Proof</summary>
+
+For any estimate of the value function $\hat{V}$, we define the Bellman backup operator $\operatorname{B}:\mathbb{R}^{|S|}\to \mathbb{R}^{|S|}$ by
+
+$$
+\operatorname{B}(\hat{V}(s)) = R(s) + \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) \hat{V}(s')
+$$
+
+Note that $\operatorname{B}(V^*) = V^*$.
+
+Since $\|\max_{x\in X}f(x)-\max_{x\in X}g(x)\|\leq \max_{x\in X}\|f(x)-g(x)\|$, for any value function $V_1$ and $V_2$, we have
+
+$$
+\begin{aligned}
+|\operatorname{B}(V_1(s))-\operatorname{B}(V_2(s))|&= \gamma \left|\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V_1(s')-\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V_2(s')\right|\\
+&\leq \gamma \max_{a\in A} \left|\sum_{s'\in S} P(s'|s,a) V_1(s')-\sum_{s'\in S} P(s'|s,a) V_2(s')\right|\\
+&\leq \gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) |V_1(s')-V_2(s')|\\
+&\leq \gamma \max_{s\in S}|V_1-V_2|
+\end{aligned}
+$$
+
+</details>
+
+Assume $0\leq \gamma < 1$, and reward $R(s)$ is bounded by $R_{\max}$.
+
+Then
+
+$$
+V^*(s)\leq \sum_{t=0}^\infty \gamma^t R_{\max} = \frac{R_{\max}}{1-\gamma}
+$$
+
+Let $V^k$ be the value function after $k$ iterations of Value Iteration.
+
+$$
+\max_{s\in S}|V^k(s)-V^*(s)|\leq \frac{R_{\max}}{1-\gamma}\gamma^k
+$$
+
+#### Stopping condition
+
+We can construct the optimal policy arbitrarily close to the optimal value function.
+
+If $\|V^k-V^{k+1}\|<\epsilon$, then $\|V^k-V^*\|\leq \epsilon\frac{\gamma}{1-\gamma}$.
+
+So we can select small $\epsilon$ to stop the iteration.
+
+### Greedy Policy
+
+Given a $V^k$ that is close to the optimal value $V^*$, the greedy policy is:
+
+$$
+\pi_{g}(s) = \arg\max_{a\in A} \sum_{s'\in S} T(s',a,s') V^k(s')
+$$
+
+Here $T(s',a,s')$ is the transition function between state $s'$ and $s$ with action $a$.
+
+This selects the action looks best if we assume that we get value $V^k$ in one step.
+
+#### Value of a greedy policy
+
+If we define $V_g$ to be the value function of the greedy policy, then
+
+This is not necessarily optimal, but it is a good approximation.
+
+In homework, we will prove that if $\|V^k-V^*\|<\lambda$, then $\|V_g-V^*\|\leq 2\lambda\frac{\gamma}{1-\gamma}$.
+
+So we can set stopping condition so that $V_g$ has desired accuracy to $V^*$.
+
+There is a finite $\epsilon$ such that greedy policy is $\epsilon$-optimal.
+
+### Problem of Value Iteration and Policy Iteration
+
+- It is slow $O(|S|^2|A|)$
+- The max action at each state rarely changes
+- The policy converges before the value function
+
+### Policy Iteration
+
+Interleaving polity evaluation and policy improvement.
+
+1. Initialize a random policy $\hat{\pi}$
+2. Compute the value function $V^{\pi}$
+3. Update the policy $\pi$ to be greedy policy with respect to $V^{\pi}$
+   $$
+   \pi(s)\gets \arg\max_{a\in A} \sum_{s'\in S} P(s'|s,a) V^{\pi}(s')
+   $$
+4. Repeat until convergence
+
+### Exact Policy Evaluation by Linear Solver
+
+Let $V^{\pi}\in \mathbb{R}^{|S|}$ be a vector of values for each state, $r\in \mathbb{R}^{|S|}$ be a vector of rewards for each state.
+
+Let $P^{\pi}\in \mathbb{R}^{|S|\times |S|}$ be a transition matrix for the policy $\pi$.
+
+$$
+P^{\pi}_{ij} = P(s_{t+1}=i|s_t=j,a_t=\pi(s_t))
+$$
+
+The Bellman equation for the policy can be written in vector form as:
+
+$$
+\begin{aligned}
+V^{\pi} &= r + \gamma P^{\pi} V^{\pi} \\
+(I-\gamma P^{\pi})V^{\pi} &= r \\
+V^{\pi} &= (I-\gamma P^{\pi})^{-1} r
+\end{aligned}
+$$
+
+- Proof involves showing that each iteration is also a contraction and monotonically improve the policy
+- Convergence to the exact optimal policy
+  - The number of policies is finite
+
+In real world, policy iteration is usually faster than value iteration.
+
+#### Policy Iteration Complexity
+
+- Each iteration runs in polynomial time in the number of states and actions
+- There are at most |A|n policies and PI never repeats a policy
+  - So at most an exponential number of iterations
+  - Not a very good complexity bound
+- Empirically O(n) iterations are required
+  - Challenge: try to generate an MDP that requires more than that n iterations
+
+### Generalized Policy Iteration
+
+- Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement
+- independent of their granularity and other details of the two processes
+
+### Summary
+
+#### Policy Iteration vs Value Iteration
+
+- **PI has two loops**: inner loop (evaluate $V^{\pi}$)
+and outer loop (improve $\pi$)
+- **VI has one loop**: repeatedly apply
+$V^{k+1}(s) = \max_{a\in A} [r(s,a) + \gamma \sum_{s'\in S} P(s'|s,a) V^k(s')]$
+- **Trade-offs**:
+  - PI converges in few outer steps if you can evaluate quickly/accurately;
+  - VI avoids expensive exact evaluation, doing cheaper but many Bellman optimality updates.
+- **Modified Policy Iteration**: partial evaluation + improvement.
+
+- **Modified Policy Iteration**: partial evaluation + improvement.
--- a/content/CSE5313/CSE5313_L4.md
+++ b/content/CSE5313/CSE5313_L4.md
@@ -0,0 +1,203 @@
+# CSE5313 Coding and information theory for data science (Lecture 4)
+
+Algebra over finite fields
+
+$\mathbb{N}$ is the set of natural numbers.
+
+by adding additive inverse, we can define the set of integers $\mathbb{Z}$.
+
+by adding multiplicative inverse, we can define the set of rational numbers $\mathbb{Q}$.
+
+by adding limits, we can define the set of real numbers $\mathbb{R}$.
+
+by adding polynomial roots, we can define the set of complex numbers $\mathbb{C}$.
+
+Another two extensions $\mathbb{H}$ for quaternions and $\mathbb{O}$ for octonions.
+
+## Few theorems about extension of fields
+
+- As long as it is irreducible, the choice of $f(x)$ does not matter.
+  - If $f_1(x), f_2(x)$ are irreducible of the same degree, then $\mathbb{Z}_p[x] \mod f_1(x) \cong \mathbb{Z}_p[x] \mod f_2(x)$.
+- Over every $\mathbb{Z}_p$ ($p$ prime), there exists an irreducible polynomial of every degree.
+- All finite fields of the same size are isomorphic.
+- All finite fields are of size $p^d$ for prime $p$ and integer $d$.
+
+Corollary: This is effectively the only way to construct finite fields.
+
+## Common notations
+
+$\mathbb{F}_q$ is the finite field with $q$ elements. where $q$ is a prime power.
+
+$\mathbb{F}_q^t$ is a vector field of dimension $t$ over $\mathbb{F}_q$. (no notion of multiplication)
+
+$\mathbb{F}_{q^t}$ is the finite field with $q^t$ elements. (as extension field of $\mathbb{F}_q$)
+
+$\mathbb{F}_{q^t}\Rightarrow \mathbb{F}_q^t$. Every extension field of $\mathbb{F}_q$ is a vector field over $\mathbb{F}_q$.
+
+$\mathbb{F}_q^t\nRightarrow \mathbb{F}_{q^t}$. Additional structure is required.
+
+> [!IMPORTANT]
+>
+> From now on, we will use $+,\cdot$ to denote the modulo operations $\oplus,\odot$.
+
+### Few exercises
+
+Let $F$ be a field and $\alpha(x)\in F[x]$. Prove that for $\beta\in F$, show $\alpha(\beta)=0\iff (x-\beta)\mid \alpha(x)$.
+
+<details>
+<summary>Proof</summary>
+
+$\implies$:
+
+Suppose $x-\beta\mid \alpha(x)$, then $\alpha(x)=(x-\beta)q(x)$ for some $q(x)\in F[x]$.
+
+Thus, $\alpha(\beta)=(\beta-\beta)q(\beta)=0$.
+
+$\impliedby$:
+
+We proceed by induction over $\deg(\alpha(x))$.
+
+Base case: $\deg(\alpha(x))=1$, then $\alpha(x)=\alpha_0+\alpha_1x$ for some $\alpha_0,\alpha_1\in F$.
+Then $\alpha(\beta)=\alpha_0+\alpha_1\beta=0\iff \alpha_1\beta$. So $\alpha_0=-\alpha_1\beta$.
+
+So $\alpha(x)=\alpha_1 x-\alpha_1\beta=\alpha_1 (x-\beta)$.
+
+Inductive step: Suppose $\deg(\alpha(x))>1$, and the condition holds for all polynomials $r(x)$ with $\deg(r(x))<\deg(\alpha(x))$.
+
+Then $a(x)=(x-\beta)q(x)+r(x)$ for some $q(x),r(x)\in F[x]$ with $\deg(r(x))<1$ (By euclid's division algorithm).
+
+By our inductive hypothesis, $r(x)=0\iff (x-\beta)\mid r(x)$.
+
+So $\alpha(x)=(x-\beta)q(x)+r(x)=0\iff (x-\beta)\mid \alpha(x)$.
+
+</details>
+
+Let $F$ be a field and $a(x)\in F[x]$. Prove that for $\beta\in F$, show $a(\beta)=0\iff f(x)\mid a(x)$.
+
+<details>
+<summary>Solution</summary>
+
+$9=3^2$
+
+We extend $\mathbb{Z}_3=\mathbb{F}_3$ to $\mathbb{F}_{3^2}$.
+
+We extend this by using an irreducible polynomial of degree 2.
+
+Need to find an irreducible polynomial $f(x)=x^2+\alpha\in \mathbb{F}_3[x]$.
+
+> [!TIP]
+>
+> In $\mathbb{F}_3$, $\forall x\in \mathbb{F}_3$, $x^2\neq 2$. So $f(x)=x^2+1$ has no root in $\mathbb{F}_3$.
+
+Consider $f(x)=x^2-2=x^2+1$.
+
+Suppose for contradiction that $f(x)$ is reducible, then $f(x)=a(x)b(x)$ for some $a(x),b(x)\in \mathbb{F}_3[x]$. And $a(x),b(x)$ are both of degree 1.
+
+So $f(x)=(\alpha_1 x-a_2)(\beta_1 x-\beta_2)$ for some $\alpha_1,\alpha_2,\beta_1,\beta_2\in \mathbb{F}_3$, and $\alpha_1\beta_1\neq 0$.
+
+So $f(\frac{a_2}{a_1})=0$
+
+This contradicts the fact that $f(x)$ has no root in $\mathbb{F}_3$.
+
+</details>
+
+## Summary from last lecture
+
+A recipe for constructing a field $\mathbb{F}$ with $p^t$ elements ($p$ prime).
+
+- Construct $\mathbb{Z}_p$.
+- Find an irreducible polynomial $f(x)$ of degree $t$ (always exists).
+- Let $\mathbb{F} = \mathbb{Z}_p[x] \mod f(x)$.
+  - The elements: polynomials in $\mathbb{Z}_p[x]$ of degree at most $t-1$.
+  - Addition and multiplication $\mod f(x)$.
+
+Facts:
+
+- Choice of $f(x)$ does not matter.
+  - Always end up with isomorphic $\mathbb{F}$ (identical up to renaming of elements).
+  - All finite fields are of size $p^t$ for prime $p$ and some $t$.
+- The above recipe is unique.
+- The above recipe is unique.
+
+## Algebra over finite fields
+
+### Groups
+
+A group is a set $G$ with an operation $\cdot$ that satisfies the following axioms:
+
+1. Closure: $\forall a,b\in G, a\cdot b\in G$.
+2. Associativity: $\forall a,b,c\in G, (a\cdot b)\cdot c=a\cdot (b\cdot c)$.
+3. Identity: $\exists e\in G, \forall a\in G, a\cdot e=e\cdot a=a$.
+4. Inverses: $\forall a\in G, \exists a^{-1}\in G, a\cdot a^{-1}=a^{-1}\cdot a=e$.
+
+> [!IMPORTANT]
+>
+> 1. Operator $\cdot$ is not necessarily commutative. (if so then it's called an abelian group)
+> 2. May not be finite/infinite.
+> 3. May represented in power notation $a^n=a^{n-1}\cdot a$.
+
+#### Examples of groups
+
+$(\mathbb{Z},+)$ is an abelian group.
+$(\mathbb{Q},+)$ is an abelian group.
+
+$(\{\mathbb{R}\setminus\{0\},\cdot\})$ is an abelian group.
+
+Let $\mathbb{Z}_n^*=\{x\in \mathbb{Z}_n:gcd(x,n)=1\}$.
+
+Then $(\mathbb{Z}_n^*,\cdot)$ is a group. If $n$ is prime, then $\mathbb{Z}_n^*$ only need to remove $0$.
+
+#### Order of element in group
+
+Let $(G,\cdot)$ be a **finite** group. 
+
+Then there exists $k\in\mathbb{N}$ such that $a^k=e$.
+
+<details>
+<summary>Proof</summary>
+
+Consider the sequence $a,a^2,a^3,\cdots$.
+
+Since $G$ is finite, there exists $i,j\in\mathbb{N}$ such that $a^i=a^j$.
+
+Then $a^i=a^j\iff a^{i-j}=e$.
+
+So there exists $k=i-j\in\mathbb{N}$ such that $a^k=e$.
+
+</details>
+
+The order of an element $a\in G$ (denoted as $\mathcal{O}(a)$) is the smallest positive integer $n$ such that $a^n=e$.
+
+Examples:
+
+order of $5$ in $(\mathbb{Z}_6,+)$ is $6$.
+
+If $a^l=e$, then $\mathcal{O}(a)\mid l$.
+
+<details>
+<summary>Proof</summary>
+
+Let $\mathcal{O}(a)=m$, then if $a^l=e$, then by definition of order, $l\geq m$. So $\exists q,r\in\mathbb{N}$ such that $l=qm+r$ and $r<m$.
+
+So $a^l=a^{qm+r}=a^r=e$. This contradicts the definition of order. That $l$ is the smallest positive integer such that $a^l=e$.
+
+</details>
+
+### Cyclic groups
+
+A group $G$ is called cyclic if there exists $a\in G$ such that $\mathcal{O}(a)=|G|$.
+
+$G=\{a^i|i\in\mathbb{Z}\}$
+
+one element is a generator of the group.
+
+#### Exercises for cyclic groups
+
+Is $(\mathbb{Z}_n,+)$ cyclic? If so, find a generator.
+
+<details>
+<summary>Solution</summary>
+
+Yes, the generator is $\{a|a\in \mathbb{Z}_n,gcd(a,n)=1\}$.
+
+</details>
--- a/content/CSE5313/_meta.js
+++ b/content/CSE5313/_meta.js
@@ -6,4 +6,5 @@ export default {
    CSE5313_L1: "CSE5313 Coding and information theory for data science (Lecture 1)",
    CSE5313_L2: "CSE5313 Coding and information theory for data science (Lecture 2)",
    CSE5313_L3: "CSE5313 Coding and information theory for data science (Lecture 3)",
+    CSE5313_L4: "CSE5313 Coding and information theory for data science (Lecture 4)",
 }