update

2025-08-28 12:51:46 -05:00
parent 802993833b
commit 7137d8aca2
6 changed files with 538 additions and 0 deletions
--- a/content/CSE510/CSE510_L2.md
+++ b/content/CSE510/CSE510_L2.md
@@ -0,0 +1,187 @@
 # CSE510 Deep Reinforcement Learning (Lecture 2)
 Introduction and Markov Decision Processes (MDPs)
 ## What is reinforcement learning (RL)
 - A general computational framework for behavior learning through reinforcement/trial and error
 - Deep RL: combining deep learning with RL for complex problems
 - Showing a promise for artificial general intelligence (AGI)
 ## What RL can do now.
 ### Backgammon
 #### Neuro-Gammon
 Developed by Gerald Tesauro in 1989 in IBM's research center.
 Train to mimic expert demonstrations using supervised learning.
 Achieved intermediate-level human player.
 #### TD-Gammon (Temporal Difference Learning)
 Developed by Gerald Tesauro in 1992 in IBM's research center.
 A neural network that trains itself to be an evaluation function by playing against itself starting from random weights.
 Achieved performance close to top human players of its time.
 ### DeepMind Atari
 Use deep Q-learning to play Atari games.
 Without human demonstrations, it can learn to play the game at a superhuman level.
 ### AlphaGo
 Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, expert demonstrations, self-play, and TPU from Google.
 ### Video Games
 OpenAI Five for Dota 2
 won 5v5 best of 3 games against top human players.
 Deepmind AlphaStar for StarCraft
 supervised training followed by a league competition training.
 ### AlphaTensor
 discovering faster matrix multiplication algorithms with reinforcement learning.
 AlphaTensor: 76 vs Strassen's 80 for 5x5 matrix multiplication.
 ### Training LLMs
 For verifiable tasks (coding, math, etc.), RL can be used to train a model to perform the task without human supervision.
 ### Robotics
 Unitree Go, Altlas by Boston Dynamics, etc.
 ## What are the challenges of RL in real world applications?
 Beating the human champion is "easier" than placing the go stones.
 ### State estimation
 Known environments (known entities and dynamics) vs. unknown environments (unknown entities and dynamics).
 Need for behaviors to **transfer/generalize** across environmental variations since the real world is very diverse.
 > **State estimation**
 >
 > To be able to act, you need first to be able to **see**, detect the **objects** that you interact with, detect whether you achieved the **goal**.
 Most works are between two extremes:
 - Assuming the world model known (object locations, shapes, physical properties obtain via AR tags or manual tuning), they use planners to search for the action sequence to achieve a desired goal.
 - Do not attempt to detect any objects and learn to map RGB images directly to actions.
 Behavior learning is challenging because state estimation is challenging, in other word, because computer vision/perception is challenging.
 Interesting direction: **leveraging DRL and vision-language models**
 ### Efficiency
 Cheap vs. Expensive to get experience samples
 #### DRL Sample Efficiency
 Humans after 15 minutes tend to outperform DDQN after
 115 hours
 #### Reinforcement Learning in Human
 Human appear to learn to act (e.g., walk) through "very few examples" of trial and error. How is an open question...
 Possible answers:
 - Hardware: 230 million years of bipedal movement data
 - Imitation Learning: Observation of other humans walking (e.g., imitation learning, episodic memory and semantic memory)
 - Algorithms: Better than backpropagation and stochastic gradient descent
 #### Discrete and continuous action spaces
 Computation is discrete, but the real action space is continuous.
 #### One-goal vs. Multi-goal
 Life is a multi-goal problem. Involving infinitely many possible games.
 #### Rewards automatic and auto detect rewards
 Our curiosity is a reward.
 #### And more
 - Transfer learning
 - Generalization
 - Long horizon reasoning
 - Model-based RL
 - Sparse rewards
 - Reward design/learning
 - Planning/Learning
 - Lifelong learning
 - Safety
 - Interpretability
 - etc.
 ## What is the course about?
 To teach you RL models and algorithms.
 - To be able to tackle real world problems.
 To excite you about RL.
 - To provide a primer for you to launch advanced studies.
 Schedule:
 - RL Model and basic algorithms
  - Markov Decision Process (MDP)
  - Passive RL: ADP and TD-learning
  - Active RL: Q-Learning and SARSA
 - Deep RL algorithms
  - Value-Based methods
  - Policy Gradient Methods
  - Model-Based methods
 - Advanced Topics
  - Offline RL, Multi-Agent RL, etc.
 ### Reinforcement Learning Algorithms
 #### Model-Based
 - Learn the model of the world, then plan using the model
 - Update model often
 - Re-plan often
 #### Value-Based
 - Learn the state or state-action value
 - Act by choosing best action in state
 - Exploration is a necessary add-on
 #### Policy-based
 - Learn the stochastic policy function that maps state to action
 - Act by sampling policy
 - Exploration is baked in
 #### Better sample efficiency to Less sample efficiency
 - Model-Based
 - Off-policy/Q-learning
 - Actor-critic
 - On-policy/Policy gradient
 - Evolutionary/Gradient-free
 ## What is RL?
 ## RL model: Markov Decision Process (MDP)
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -4,4 +4,5 @@ export default {
        type: 'separator'
    },
    CSE510_L1: "CSE510 Deep Reinforcement Learning (Lecture 1)",
    CSE510_L2: "CSE510 Deep Reinforcement Learning (Lecture 2)",
 }
--- a/content/CSE5313/CSE5313_L2.md
+++ b/content/CSE5313/CSE5313_L2.md
@@ -0,0 +1,348 @@
 # CSE5313 Coding and information theory for data science (Lecture 2)
 ## Review on Channel coding
 Let $F$ be the input alphabet, $\Phi$ be the output alphabet.
 e.g. $F=\{0,1\},\mathbb{R}$.
 Introduce noise: $\operatorname{Pr}(c'\text{ received}|c\text{ transmitted})$.
 We use $u$ to denote the information to be transmitted
 $c$ to be the codeword.
 $c'$ is the received codeword. given to the decoder.
 $u'$ is the decoded information word.
 Error if $u' \neq u$.
 Example:
 **Binary symmetric channel (BSC)**
 $F=\Phi=\{0,1\}$
 Every bit of $c$ is flipped with probability $p$.
 **Binary erasure channel (BEC)**
 $F=\Phi=\{0,1,*\}$, very common in practice when we are unsure when the bit is transmitted.
 $c$ is transmitted, $c'$ is received.
 $c'$ is $c$ with probability $1-p$, $e$ with probability $p$.
 ## Encoding
 Encoding $E$ is a function from $F^k$ to $F^n$.
 Where $E(u)=c$ is the codeword.
 Assume $n\geq k$, we don't compress the information.
 A code $\mathcal{C}$ is a subset of $F^n$.
 Encoding is a one to one mapping from $F^k$ to $\mathcal{C}$.
 In practice, we usually choose $\mathcal{C}\subseteq F^n$ to be the size of $F^k$.
 ## Decoding
 $D$ is a function from $\Phi^n$ to $\mathcal{C}$.
 $D(c')=\hat{c}$
 The decoder then outputs the unique $u'$ such that $E(u')=\hat{c}$.
 Our aim is to have $u=u'$.
 Decoding error probability: $\operatorname{P}_{err}=\max_{c\in \mathcal{C}}\operatorname{P}_{err}(c)$.
 where $\operatorname{P}_{err}(c)=\sum_{y|D(y)\neq c}\operatorname{Pr}(y\text{ received}|c\text{ transmitted})$.
 Our goal is to construct decoder $D$ such that $\operatorname{P}_{err}$ is bounded.
 Example:
 Repetition code in binary symmetric channel:
 Let $F=\Phi=\{0,1\}$. Every bit of $c$ is flipped with probability $p$.
 Say $k=1$, $n=3$ and let $\mathcal{C}=\{000,111\}$.
 Let the encoder be $E(u)=u u u$.
 The decoder is $D(000)=D(100)=D(010)=D(001)=0$, $D(110)=D(101)=D(011)=D(111)=1$.
 Exercise: Compute the error probability of the repetition code in binary symmetric channel.
 <details>
 <summary>Solution</summary>
 Recall that $P_{err}(c)=\sum_{y|D(y)\neq c}\operatorname{Pr}(y\text{ received}|c\text{ transmitted})$.
 Use binomial random variable:
 $$
 \begin{aligned}
 P_{err}(000)&=\sum_{y|D(y)\neq 000}\operatorname{Pr}(y\text{ received}|000\text{ transmitted})\\
 &=\operatorname{Pr}(2\text{ flipes or more})\\
 &=\binom{n}{2}p^2(1-p)+\binom{n}{3}p^3\\
 &=3p^2(1-p)+p^3\\
 \end{aligned}
 $$
 The computation is identical for $111$.
 $P_{err}=\max\{P_{err}(000),P_{err}(111)\}=P_{err}(000)=3p^2(1-p)+p^3$.
 </details>
 ### Maximum likelihood principle
 For $p\leq 1/2$, the last example is maximum likelihood decoder.
 Notice that $\operatorname{Pr}(c'=000|c=000)=(1-p)^3$ and $\operatorname{Pr}(c'=000|c=111)=p^3$.
 - If $p\leq 1/2$, then $(1-p)^3\geq p^3$. $c=000$ is more likely to be transmitted than $c=111$.
 When $\operatorname{Pr}(c'=001|c=000)=(1-p)^2p$ and $\operatorname{Pr}(c'=001|c=111)=p^2(1-p)$.
 - If $p\leq 1/2$, then $(1-p)^2p\geq p^2(1-p)$. $c=001$ is more likely to be transmitted than $c=110$.
 For $p>1/2$, we just negate the above.
 In general, Maximum likelihood decoder is $D(c')=\arg\max_{c\in \mathcal{C}}\operatorname{Pr}(c'\text{ received}|c\text{ transmitted})$.
 ## Defining a "good" code
 Two metrics:
 - How many redundant bits are needed?
  - e.g. repetition code: $k=1$, $n=3$ sends $2$ redundant bits.
 - What is the resulting error probability?
  - Depends on the decoding function.
  - Normally, maximum likelihood decoding is assumed.
  - Should go zero with $n$.
 ### Definition for rate of code is $\frac{k}{n}$.
 More generally, $\log_{|F|}\frac{|\mathcal{C}|}{n}$.
 ### Definition for information entropy
 Let $X$ be a random variable over a discrete set $\mathcal{X}$.
 - That is every $x\in \mathcal{X}$ has a probability $\operatorname{Pr}(X=x)$.
 The entropy $H(X)$ of a discrete random variable $X$ is defined as:
 $$
 H(X)=\mathbb{E}_{x\sim X}{\log \frac{1}{\operatorname{Pr}(x)}}=-\sum_{x\in \mathcal{X}}\operatorname{Pr}(x)\log \operatorname{Pr}(x)
 $$
 when $X=Bernouili(p)$, we denote $H(X)=H(p)=-p\log p-(1-p)\log (1-p)$.
 A deeper explanation will be given in the later in the course.
 ## Which rate are possible?
 Claude Shannon '48: Coding theorem of the BSC(binary symmetric channel)
 Recall $r=\frac{k}{n}$.
 Let $H(\cdot)$ be the entropy function.
 For every $0\leq r<1-H(p)$,
 - There exists $\mathcal{C}_1, \mathcal{C}_2,\ldots$ of rates $r_1,r_2,\ldots$ lengths $n_1,n_2,\ldots$ and $r_i\geq r$.
 - That with Maximum likelihood decoding satisifies $P_{err}\to 0$ as $i\to \infty$.
 For any $R\geq 1-H(p)$,
 - Any sequence $\mathcal{C}_1, \mathcal{C}_2,\ldots$ of rates $r_1,r_2,\ldots$ lengths $n_1,n_2,\ldots$ and $r_i\geq R$,
 - Any andy decoding algorithm, $P_{err}\to 1$ as $i\to \infty$.
 $1-H(p)$ is the capacity of the BSC.
 - Informally, the capacity is the best possible rate of the code (asymptotically).
 - A special case of a broader theorem (Shannon's coding theorem).
 - We will see later in this course.
 Polar codes, for explicit construction of codes with rate arbitrarily close to capacity.
 ### BSC capacity - Intuition
 Capacity of the binary symmetric channel with crossover probability $p=1-H(p)$.
 A correct decoder $c'\to c$ essentially identifies two objects:
 - The codeword $c$
 - The error word $e=c'-c$ subtraction $\mod 2$.
 - $c$ and $e$ are independent of each other.
 A **typical** $e$ has $\approx np$ $1$'s (law of large numbers), say $n(p\pm \delta)$.
 Exercise:
 $\operatorname{Pr}(e)=p^{n(p\pm \delta)}(1-p)^{n(1-p\pm \delta)}=2^{-n(H(p)+\epsilon)}$ for some $\epsilon$ goes to zero as $n\to \infty$.
 <details>
 <summary>Intuition</summary>
 There exists $\approx 2^{n(H(p)}$ typical error words.
 To index those typical error words, we need $\log_2 (2^{nH(p)})=nH(p)+O(1)$. bits to identify the error word $e$.
 To encode the message, we need $\log_2 |\mathcal{C}|$ bits.
 Since we send $n$ bits, the rate is $k+nH(p)+O(1)\leq n$, so $\frac{k}{n}\leq 1-H(p)$.
 So the rate cannot exceed $1-H(p)$.
 </details>
 <details>
 <summary>Formal proof</summary>
 $$
 \begin{aligned}
 \operatorname{Pr}(e)&=p^{n(p\pm \delta)}(1-p)^{n(1-p\pm \delta)}\\
 &=p^{np}(1-p)^{n(1-p)}p^{\pm n\delta}(1-p)^{\mp n\delta}\\
 \end{aligned}
 $$
 And
 $$
 \begin{aligned}
 2^{-n(H(p)+\epsilon)}&=2^{-n(-p\log p-(1-p)\log (1-p)+\epsilon)}\\
 &=2^{np\log p}2^{n(1-p)\log (1-p)}2^{-n\epsilon}\\
 &=p^{np}(1-p)^{n(1-p)}2^{-n\epsilon}\\
 \end{aligned}
 $$
 So we need to check there exists $\epsilon>0$ such that
 $$
 \lim_{n\to \infty}p^{\pm n\delta}(1-p)^{\mp n\delta}\leq 2^{-n\epsilon}
 $$
 Test
 $$
 \begin{aligned}
 2^{-n\epsilon}&=p^{np}(1-p)^{n(1-p)}2^{-n\epsilon}\\
 -n\epsilon&=\delta n\log p-\delta n\log (1-p)\\
 \epsilon&=\delta (\log (1-p)-\log p)\\
 \end{aligned}
 $$
 </details>
 ## Hamming distance
 How to quantify the noise in the channel?
 - Number of flipped bits.
 Definition of Hamming distance:
 - Denote $c=(c_1,c_2,\ldots,c_n)$ and $c'=(c'_1,c'_2,\ldots,c'_n)$.
 - $d_H(c,c')=\sum_{i=1}^n [c_i\neq c'_i]$.
 Minimum hamminng distance:
 - Let $\mathcal{C}$ be a code.
 - $d_H(\mathcal{C})=\min_{c_1,c_2\in \mathcal{C},c_1\neq c_2}d_H(c_1,c_2)$.
 Hamming distance is a metric.
 - $d_H(x,y)\geq 0$ equal iff $x=y$.
 - $d_H(x,y)=d_H(y,x)$
 - Triangle inequality: $d_H(x,y)\leq d_H(x,z)+d_H(z,y)$
 ### Level of error handling
 error detection
 erasure correction
 error correction
 Erasure: replacement of an entry by $*\not\in F$.
 Error: substitution of one entry by a different one.
 Example: If $d_H(\mathcal{C})=d$.
 #### Error detection
 Theorem: If $d_H(\mathcal{C})=d$, then there exists $f:F^n\to \mathcal{C}\cap \{\text{"error detected"}\}$. that detects every patter of $\leq d-1$ errors correctly.
 \* track lost *\
 Idea:
 Since $d_H(\mathcal{C})=d$, one needs $\geq d$ errors to cause "confusion$.
 #### Erasure correction
 Theorem: If $d_H(\mathcal{C})=d$, then there exists $f:\{F^n\cup \{*\}\}\to \mathcal{C}\cap \{\text{"failed"}\}$. that recovers every patter of at most $d-1$ erasures.
 Idea:
 \* track lost *\
 #### Error correction
 Define the Hamming ball of radius $r$ centered at $c$ as:
 $$
 B_H(c,r)=\{y\in F^n:d_H(c,y)\leq r\}
 $$
 Theorem: If $d_H(\mathcal{C})\geq d$, then there exists $f:F^n\to \mathcal{C}$ that corrects every pattern of at most $\lfloor \frac{d-1}{2}\rfloor$ errors.
 Ideas:
 The ball $\{B_H(c,\lfloor \frac{d-1}{2}\rfloor)|c\in \mathcal{C}\}$ are disjoint.
 Use closest neighbor decoding, use triangle inequality.
 ## Intro to linear codes
 Summary: a code of minimum hamming distance $d$ can
 - detect $\leq d-1$ errors.
 - correct $\leq d-1$ erasures.
 - Correct $\leq \lfloor \frac{d-1}{2}\rfloor$ errors.
 Problems:
 - How to construct good codes, $k/n$ and $d$ large?
 - How good can these codes possibly be?
 - How to encode?
 - How to decode with noisy channel
 Tools
 - Linear algebra over finite fields.
 ### Linear codes
 Consider $F^n$ as a vector space, and let $\mathcal{C}\subseteq F^n$ be a subspace.
 $F,\Phi$ are finites, we use finite fields (algebraic objects that "immitate" $\mathbb{R}^n$, $\mathbb{C}^n$).
 Formally, satisfy the field axioms.
 Next Lectures:
 - Field axioms
 - Prime fields ($\mathbb{F}_p$)
 - Field extensions (e.g. $\mathbb{F}_{p^t}$)
--- a/content/CSE5313/_meta.js
+++ b/content/CSE5313/_meta.js
@@ -4,4 +4,5 @@ export default {
        type: 'separator'
    },
    CSE5313_L1: "CSE5313 Coding and information theory for data science (Lecture 1)",
    CSE5313_L2: "CSE5313 Coding and information theory for data science (Lecture 2)",
 }
--- a/content/CSE5519/CSE5519_L2.md
+++ b/content/CSE5519/CSE5519_L2.md
--- a/content/CSE5519/_meta.js
+++ b/content/CSE5519/_meta.js
@@ -4,4 +4,5 @@ export default {
        type: 'separator'
    },
    CSE5519_L1: "CSE5519 Advances in Computer Vision (Lecture 1)",
    CSE5519_L2: "CSE5519 Advances in Computer Vision (Lecture 2)",
 }