update

2025-08-28 12:51:46 -05:00
parent 802993833b
commit 7137d8aca2
6 changed files with 538 additions and 0 deletions
--- a/content/CSE510/CSE510_L2.md
+++ b/content/CSE510/CSE510_L2.md
@@ -0,0 +1,187 @@
+# CSE510 Deep Reinforcement Learning (Lecture 2)
+
+Introduction and Markov Decision Processes (MDPs)
+
+## What is reinforcement learning (RL)
+
+- A general computational framework for behavior learning through reinforcement/trial and error
+- Deep RL: combining deep learning with RL for complex problems
+- Showing a promise for artificial general intelligence (AGI)
+
+## What RL can do now.
+
+### Backgammon
+
+#### Neuro-Gammon
+
+Developed by Gerald Tesauro in 1989 in IBM's research center.
+
+Train to mimic expert demonstrations using supervised learning.
+
+Achieved intermediate-level human player.
+
+#### TD-Gammon (Temporal Difference Learning)
+
+Developed by Gerald Tesauro in 1992 in IBM's research center.
+
+A neural network that trains itself to be an evaluation function by playing against itself starting from random weights.
+
+Achieved performance close to top human players of its time.
+
+### DeepMind Atari
+
+Use deep Q-learning to play Atari games.
+
+Without human demonstrations, it can learn to play the game at a superhuman level.
+
+### AlphaGo
+
+Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, expert demonstrations, self-play, and TPU from Google.
+
+### Video Games
+
+OpenAI Five for Dota 2
+
+won 5v5 best of 3 games against top human players.
+
+Deepmind AlphaStar for StarCraft
+
+supervised training followed by a league competition training.
+
+### AlphaTensor
+
+discovering faster matrix multiplication algorithms with reinforcement learning.
+
+AlphaTensor: 76 vs Strassen's 80 for 5x5 matrix multiplication.
+
+### Training LLMs
+
+For verifiable tasks (coding, math, etc.), RL can be used to train a model to perform the task without human supervision.
+
+### Robotics
+
+Unitree Go, Altlas by Boston Dynamics, etc.
+
+## What are the challenges of RL in real world applications?
+
+Beating the human champion is "easier" than placing the go stones.
+
+### State estimation
+
+Known environments (known entities and dynamics) vs. unknown environments (unknown entities and dynamics).
+
+Need for behaviors to **transfer/generalize** across environmental variations since the real world is very diverse.
+
+> **State estimation**
+>
+> To be able to act, you need first to be able to **see**, detect the **objects** that you interact with, detect whether you achieved the **goal**.
+
+Most works are between two extremes:
+
+- Assuming the world model known (object locations, shapes, physical properties obtain via AR tags or manual tuning), they use planners to search for the action sequence to achieve a desired goal.
+
+- Do not attempt to detect any objects and learn to map RGB images directly to actions.
+
+Behavior learning is challenging because state estimation is challenging, in other word, because computer vision/perception is challenging.
+
+Interesting direction: **leveraging DRL and vision-language models**
+
+### Efficiency
+
+Cheap vs. Expensive to get experience samples
+
+#### DRL Sample Efficiency
+
+Humans after 15 minutes tend to outperform DDQN after
+115 hours
+
+#### Reinforcement Learning in Human
+
+Human appear to learn to act (e.g., walk) through "very few examples" of trial and error. How is an open question...
+
+Possible answers:
+
+- Hardware: 230 million years of bipedal movement data
+- Imitation Learning: Observation of other humans walking (e.g., imitation learning, episodic memory and semantic memory)
+- Algorithms: Better than backpropagation and stochastic gradient descent
+
+#### Discrete and continuous action spaces
+
+Computation is discrete, but the real action space is continuous.
+
+#### One-goal vs. Multi-goal
+
+Life is a multi-goal problem. Involving infinitely many possible games.
+
+#### Rewards automatic and auto detect rewards
+
+Our curiosity is a reward.
+
+#### And more
+
+- Transfer learning
+- Generalization
+- Long horizon reasoning
+- Model-based RL
+- Sparse rewards
+- Reward design/learning
+- Planning/Learning
+- Lifelong learning
+- Safety
+- Interpretability
+- etc.
+
+## What is the course about?
+
+To teach you RL models and algorithms.
+
+- To be able to tackle real world problems.
+
+To excite you about RL.
+
+- To provide a primer for you to launch advanced studies.
+
+Schedule:
+
+- RL Model and basic algorithms
+  - Markov Decision Process (MDP)
+  - Passive RL: ADP and TD-learning
+  - Active RL: Q-Learning and SARSA
+- Deep RL algorithms
+  - Value-Based methods
+  - Policy Gradient Methods
+  - Model-Based methods
+- Advanced Topics
+  - Offline RL, Multi-Agent RL, etc.
+
+### Reinforcement Learning Algorithms
+
+#### Model-Based
+
+- Learn the model of the world, then plan using the model
+- Update model often
+- Re-plan often
+
+#### Value-Based
+
+- Learn the state or state-action value
+- Act by choosing best action in state
+- Exploration is a necessary add-on
+
+#### Policy-based
+
+- Learn the stochastic policy function that maps state to action
+- Act by sampling policy
+- Exploration is baked in
+
+#### Better sample efficiency to Less sample efficiency
+
+- Model-Based
+- Off-policy/Q-learning
+- Actor-critic
+- On-policy/Policy gradient
+- Evolutionary/Gradient-free
+
+## What is RL?
+
+## RL model: Markov Decision Process (MDP)
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -4,4 +4,5 @@ export default {
        type: 'separator'
    },
    CSE510_L1: "CSE510 Deep Reinforcement Learning (Lecture 1)",
+    CSE510_L2: "CSE510 Deep Reinforcement Learning (Lecture 2)",
 }
--- a/content/CSE5313/CSE5313_L2.md
+++ b/content/CSE5313/CSE5313_L2.md
@@ -0,0 +1,348 @@
+# CSE5313 Coding and information theory for data science (Lecture 2)
+
+## Review on Channel coding
+
+Let $F$ be the input alphabet, $\Phi$ be the output alphabet.
+
+e.g. $F=\{0,1\},\mathbb{R}$.
+
+Introduce noise: $\operatorname{Pr}(c'\text{ received}|c\text{ transmitted})$.
+
+We use $u$ to denote the information to be transmitted
+
+$c$ to be the codeword.
+
+$c'$ is the received codeword. given to the decoder.
+
+$u'$ is the decoded information word.
+
+Error if $u' \neq u$.
+
+Example:
+
+**Binary symmetric channel (BSC)**
+
+$F=\Phi=\{0,1\}$
+
+Every bit of $c$ is flipped with probability $p$.
+
+**Binary erasure channel (BEC)**
+
+$F=\Phi=\{0,1,*\}$, very common in practice when we are unsure when the bit is transmitted.
+
+$c$ is transmitted, $c'$ is received.
+
+$c'$ is $c$ with probability $1-p$, $e$ with probability $p$.
+
+## Encoding
+
+Encoding $E$ is a function from $F^k$ to $F^n$.
+
+Where $E(u)=c$ is the codeword.
+
+Assume $n\geq k$, we don't compress the information.
+
+A code $\mathcal{C}$ is a subset of $F^n$.
+
+Encoding is a one to one mapping from $F^k$ to $\mathcal{C}$.
+
+In practice, we usually choose $\mathcal{C}\subseteq F^n$ to be the size of $F^k$.
+
+## Decoding
+
+$D$ is a function from $\Phi^n$ to $\mathcal{C}$.
+
+$D(c')=\hat{c}$
+
+The decoder then outputs the unique $u'$ such that $E(u')=\hat{c}$.
+
+Our aim is to have $u=u'$.
+
+Decoding error probability: $\operatorname{P}_{err}=\max_{c\in \mathcal{C}}\operatorname{P}_{err}(c)$.
+
+where $\operatorname{P}_{err}(c)=\sum_{y|D(y)\neq c}\operatorname{Pr}(y\text{ received}|c\text{ transmitted})$.
+
+Our goal is to construct decoder $D$ such that $\operatorname{P}_{err}$ is bounded.
+
+Example:
+
+Repetition code in binary symmetric channel:
+
+Let $F=\Phi=\{0,1\}$. Every bit of $c$ is flipped with probability $p$.
+
+Say $k=1$, $n=3$ and let $\mathcal{C}=\{000,111\}$.
+
+Let the encoder be $E(u)=u u u$.
+
+The decoder is $D(000)=D(100)=D(010)=D(001)=0$, $D(110)=D(101)=D(011)=D(111)=1$.
+
+Exercise: Compute the error probability of the repetition code in binary symmetric channel.
+
+<details>
+<summary>Solution</summary>
+
+Recall that $P_{err}(c)=\sum_{y|D(y)\neq c}\operatorname{Pr}(y\text{ received}|c\text{ transmitted})$.
+
+Use binomial random variable:
+
+$$
+\begin{aligned}
+P_{err}(000)&=\sum_{y|D(y)\neq 000}\operatorname{Pr}(y\text{ received}|000\text{ transmitted})\\
+&=\operatorname{Pr}(2\text{ flipes or more})\\
+&=\binom{n}{2}p^2(1-p)+\binom{n}{3}p^3\\
+&=3p^2(1-p)+p^3\\
+\end{aligned}
+$$
+
+The computation is identical for $111$.
+
+$P_{err}=\max\{P_{err}(000),P_{err}(111)\}=P_{err}(000)=3p^2(1-p)+p^3$.
+
+</details>
+
+### Maximum likelihood principle
+
+For $p\leq 1/2$, the last example is maximum likelihood decoder.
+
+Notice that $\operatorname{Pr}(c'=000|c=000)=(1-p)^3$ and $\operatorname{Pr}(c'=000|c=111)=p^3$.
+
+- If $p\leq 1/2$, then $(1-p)^3\geq p^3$. $c=000$ is more likely to be transmitted than $c=111$.
+
+When $\operatorname{Pr}(c'=001|c=000)=(1-p)^2p$ and $\operatorname{Pr}(c'=001|c=111)=p^2(1-p)$.
+
+- If $p\leq 1/2$, then $(1-p)^2p\geq p^2(1-p)$. $c=001$ is more likely to be transmitted than $c=110$.
+
+For $p>1/2$, we just negate the above.
+
+In general, Maximum likelihood decoder is $D(c')=\arg\max_{c\in \mathcal{C}}\operatorname{Pr}(c'\text{ received}|c\text{ transmitted})$.
+
+## Defining a "good" code
+
+Two metrics:
+
+- How many redundant bits are needed?
+  - e.g. repetition code: $k=1$, $n=3$ sends $2$ redundant bits.
+- What is the resulting error probability?
+  - Depends on the decoding function.
+  - Normally, maximum likelihood decoding is assumed.
+  - Should go zero with $n$.
+
+### Definition for rate of code is $\frac{k}{n}$.
+
+More generally, $\log_{|F|}\frac{|\mathcal{C}|}{n}$.
+
+### Definition for information entropy
+
+Let $X$ be a random variable over a discrete set $\mathcal{X}$.
+
+- That is every $x\in \mathcal{X}$ has a probability $\operatorname{Pr}(X=x)$.
+
+The entropy $H(X)$ of a discrete random variable $X$ is defined as:
+
+$$
+H(X)=\mathbb{E}_{x\sim X}{\log \frac{1}{\operatorname{Pr}(x)}}=-\sum_{x\in \mathcal{X}}\operatorname{Pr}(x)\log \operatorname{Pr}(x)
+$$
+
+when $X=Bernouili(p)$, we denote $H(X)=H(p)=-p\log p-(1-p)\log (1-p)$.
+
+A deeper explanation will be given in the later in the course.
+
+## Which rate are possible?
+
+Claude Shannon '48: Coding theorem of the BSC(binary symmetric channel)
+
+Recall $r=\frac{k}{n}$.
+
+Let $H(\cdot)$ be the entropy function.
+
+For every $0\leq r<1-H(p)$,
+
+- There exists $\mathcal{C}_1, \mathcal{C}_2,\ldots$ of rates $r_1,r_2,\ldots$ lengths $n_1,n_2,\ldots$ and $r_i\geq r$.
+- That with Maximum likelihood decoding satisifies $P_{err}\to 0$ as $i\to \infty$.
+
+For any $R\geq 1-H(p)$,
+
+- Any sequence $\mathcal{C}_1, \mathcal{C}_2,\ldots$ of rates $r_1,r_2,\ldots$ lengths $n_1,n_2,\ldots$ and $r_i\geq R$,
+- Any andy decoding algorithm, $P_{err}\to 1$ as $i\to \infty$.
+
+$1-H(p)$ is the capacity of the BSC.
+
+- Informally, the capacity is the best possible rate of the code (asymptotically).
+- A special case of a broader theorem (Shannon's coding theorem).
+- We will see later in this course.
+
+Polar codes, for explicit construction of codes with rate arbitrarily close to capacity.
+
+### BSC capacity - Intuition
+
+Capacity of the binary symmetric channel with crossover probability $p=1-H(p)$.
+
+A correct decoder $c'\to c$ essentially identifies two objects:
+
+- The codeword $c$
+- The error word $e=c'-c$ subtraction $\mod 2$.
+- $c$ and $e$ are independent of each other.
+
+A **typical** $e$ has $\approx np$ $1$'s (law of large numbers), say $n(p\pm \delta)$.
+
+Exercise:
+
+$\operatorname{Pr}(e)=p^{n(p\pm \delta)}(1-p)^{n(1-p\pm \delta)}=2^{-n(H(p)+\epsilon)}$ for some $\epsilon$ goes to zero as $n\to \infty$.
+
+<details>
+<summary>Intuition</summary>
+
+There exists $\approx 2^{n(H(p)}$ typical error words.
+
+To index those typical error words, we need $\log_2 (2^{nH(p)})=nH(p)+O(1)$. bits to identify the error word $e$.
+
+To encode the message, we need $\log_2 |\mathcal{C}|$ bits.
+
+Since we send $n$ bits, the rate is $k+nH(p)+O(1)\leq n$, so $\frac{k}{n}\leq 1-H(p)$.
+
+So the rate cannot exceed $1-H(p)$.
+
+</details>
+
+<details>
+<summary>Formal proof</summary>
+
+$$
+\begin{aligned}
+\operatorname{Pr}(e)&=p^{n(p\pm \delta)}(1-p)^{n(1-p\pm \delta)}\\
+&=p^{np}(1-p)^{n(1-p)}p^{\pm n\delta}(1-p)^{\mp n\delta}\\
+\end{aligned}
+$$
+
+And
+
+$$
+\begin{aligned}
+2^{-n(H(p)+\epsilon)}&=2^{-n(-p\log p-(1-p)\log (1-p)+\epsilon)}\\
+&=2^{np\log p}2^{n(1-p)\log (1-p)}2^{-n\epsilon}\\
+&=p^{np}(1-p)^{n(1-p)}2^{-n\epsilon}\\
+\end{aligned}
+$$
+
+So we need to check there exists $\epsilon>0$ such that
+
+$$
+\lim_{n\to \infty}p^{\pm n\delta}(1-p)^{\mp n\delta}\leq 2^{-n\epsilon}
+$$
+
+Test
+
+$$
+\begin{aligned}
+2^{-n\epsilon}&=p^{np}(1-p)^{n(1-p)}2^{-n\epsilon}\\
+-n\epsilon&=\delta n\log p-\delta n\log (1-p)\\
+\epsilon&=\delta (\log (1-p)-\log p)\\
+\end{aligned}
+$$
+
+
+</details>
+
+## Hamming distance
+
+How to quantify the noise in the channel?
+
+- Number of flipped bits.
+
+Definition of Hamming distance:
+
+- Denote $c=(c_1,c_2,\ldots,c_n)$ and $c'=(c'_1,c'_2,\ldots,c'_n)$.
+- $d_H(c,c')=\sum_{i=1}^n [c_i\neq c'_i]$.
+
+Minimum hamminng distance:
+
+- Let $\mathcal{C}$ be a code.
+- $d_H(\mathcal{C})=\min_{c_1,c_2\in \mathcal{C},c_1\neq c_2}d_H(c_1,c_2)$.
+
+Hamming distance is a metric.
+
+- $d_H(x,y)\geq 0$ equal iff $x=y$.
+- $d_H(x,y)=d_H(y,x)$
+- Triangle inequality: $d_H(x,y)\leq d_H(x,z)+d_H(z,y)$
+
+### Level of error handling
+
+error detection
+
+erasure correction
+
+error correction
+
+Erasure: replacement of an entry by $*\not\in F$.
+
+Error: substitution of one entry by a different one.
+
+Example: If $d_H(\mathcal{C})=d$.
+
+#### Error detection
+
+Theorem: If $d_H(\mathcal{C})=d$, then there exists $f:F^n\to \mathcal{C}\cap \{\text{"error detected"}\}$. that detects every patter of $\leq d-1$ errors correctly.
+
+\* track lost *\
+
+Idea:
+
+Since $d_H(\mathcal{C})=d$, one needs $\geq d$ errors to cause "confusion$.
+
+#### Erasure correction
+
+Theorem: If $d_H(\mathcal{C})=d$, then there exists $f:\{F^n\cup \{*\}\}\to \mathcal{C}\cap \{\text{"failed"}\}$. that recovers every patter of at most $d-1$ erasures.
+
+Idea:
+
+\* track lost *\
+
+#### Error correction
+
+Define the Hamming ball of radius $r$ centered at $c$ as:
+
+$$
+B_H(c,r)=\{y\in F^n:d_H(c,y)\leq r\}
+$$
+
+Theorem: If $d_H(\mathcal{C})\geq d$, then there exists $f:F^n\to \mathcal{C}$ that corrects every pattern of at most $\lfloor \frac{d-1}{2}\rfloor$ errors.
+
+Ideas:
+
+The ball $\{B_H(c,\lfloor \frac{d-1}{2}\rfloor)|c\in \mathcal{C}\}$ are disjoint.
+
+Use closest neighbor decoding, use triangle inequality.
+
+## Intro to linear codes
+
+Summary: a code of minimum hamming distance $d$ can
+
+- detect $\leq d-1$ errors.
+- correct $\leq d-1$ erasures.
+- Correct $\leq \lfloor \frac{d-1}{2}\rfloor$ errors.
+
+Problems:
+
+- How to construct good codes, $k/n$ and $d$ large?
+- How good can these codes possibly be?
+- How to encode?
+- How to decode with noisy channel
+
+Tools
+
+- Linear algebra over finite fields.
+
+### Linear codes
+
+Consider $F^n$ as a vector space, and let $\mathcal{C}\subseteq F^n$ be a subspace.
+
+$F,\Phi$ are finites, we use finite fields (algebraic objects that "immitate" $\mathbb{R}^n$, $\mathbb{C}^n$).
+
+Formally, satisfy the field axioms.
+
+Next Lectures:
+
+- Field axioms
+- Prime fields ($\mathbb{F}_p$)
+- Field extensions (e.g. $\mathbb{F}_{p^t}$)
+
--- a/content/CSE5313/_meta.js
+++ b/content/CSE5313/_meta.js
@@ -4,4 +4,5 @@ export default {
        type: 'separator'
    },
    CSE5313_L1: "CSE5313 Coding and information theory for data science (Lecture 1)",
+    CSE5313_L2: "CSE5313 Coding and information theory for data science (Lecture 2)",
 }
--- a/content/CSE5519/CSE5519_L2.md
+++ b/content/CSE5519/CSE5519_L2.md
--- a/content/CSE5519/_meta.js
+++ b/content/CSE5519/_meta.js
@@ -4,4 +4,5 @@ export default {
        type: 'separator'
    },
    CSE5519_L1: "CSE5519 Advances in Computer Vision (Lecture 1)",
+    CSE5519_L2: "CSE5519 Advances in Computer Vision (Lecture 2)",
 }