upgrade structures and migrate to nextra v4

This commit is contained in:
Zheyuan Wu
2025-07-06 12:40:25 -05:00
parent 76e50de44d
commit 717520624d
317 changed files with 18143 additions and 22777 deletions

349
content/CSE347/CSE347_L9.md Normal file
View File

@@ -0,0 +1,349 @@
# Lecture 9
## Randomized Algorithms
### Hashing
Hashing with chaining:
Input: We have integers in range $[1,n-1]=U$. We want to map them to a hash table $T$ with $m$ slots.
Hash function: $h:U\rightarrow [m]$
Goal: Hashing a set $S\subseteq U$, $|S|=n$ into $T$ such that the number of elements in each slot is at most $1$.
#### Collisions
When multiple keys are mapped to the same slot, we call it a collision, we keep a linked list of all the keys that map to the same slot.
**Runtime** of insert, query, delete of elements $=\Theta(\textup{length of the chain})$
**Worst-case** runtime of insert, query, delete of elements $=\Theta(n)$
Therefore, we want chains to be short, or $\Theta(1)$, as long as $|S|$ is reasonably sized, or equivalently, we want the number in any set $S$ to hash **uniformly** across all slots.
#### Simple Uniform Hashing Assumptions
The $n$ elements we want to hash (the set $S$) is picked uniformly at random from $U$. Therefore, we could see that this simple hash function works fine:
$$
h(x)=x\mod m
$$
Question: What happens if an adversary knows this function and designs $S$ to make the worst-case runtime happen?
Answer: The adversary can make the runtime of each operation $\Theta(n)$ by simply making all the elements hash to the same slot.
#### Randomization to the rescue
We don't want the adversary to know the hash function based on just looking at the code.
Ideas: Randomize the choice of the hash function.
### Randomized Algorithm
#### Definition
A randomized algorithm is an algorithm the algorithm makes internal random choices.
2 kinds of randomized algorithms:
1. Las Vegas: The runtime is random, but the output is always correct.
2. Monte Carlo: The runtime is fixed, but the output is sometimes incorrect.
We will focus on Las Vegas algorithms in this course.
$$O(n)=E[T(n)]$$ or some other probabilistic quantity.
#### Randomization can help
Ideas: Randomize the choice of hash function $h$ from a family of hash functions, $H$.
If we randomly pick a hash function from this family, then the probability that the hash function is bad on **any particular** set $S$ is small.
Intuitively, the adversary can not pick a bad input since most hash functions are good for any particular input $S$.
#### Universal Hashing: Goal
We want to design a universal family of hash functions, $H$, such that the probability that the hash table behaves badly on any input $S$ is small.
#### Universal Hashing: Definition
Suppose we have $m$ buckets in the hash table. We also have $2$ inputs $x\neq y$ and $x,y\in U$. We want $x$ and $y$ to be unlikely to hash to the same bucket.
$H$ is a universal **family** of hash functions if for any two elements $x\neq y$,
$$
Pr_{h\in H}[h(x)=h(y)]=\frac{1}{m}
$$
where $h$ is picked uniformly at random from the family $H$.
#### Universal Hashing: Analysis
Claim: If we choose $h$ randomly from a universal family of hash functions, $H$, then the hash table will exhibit good behavior on any set $S$ of size $n$ with high probability.
Question: What are some good properties and what does it mean by with high probability?
Claim: Given a universal family of hash functions, $H$, $S=\{a_1,a_2,\cdots,a_n\}\subset \mathbb{N}$. For any probability $0\leq \delta\leq 1$, if $n\leq \sqrt{2m\delta}$, the chance that no two keys hash to the same slot is $\geq1-\delta$.
Example: If we pick $\delta=\frac{1}{2}$. As long as $n<\sqrt{2m}$, the chance that no two keys hash to the same slot is $\geq\frac{1}{2}$.
If we pick $\delta=\frac{1}{3}$. As long as $n<\sqrt{\frac{4}{3}m}$, the chance that no two keys hash to the same slot is $\geq\frac{2}{3}$.
Proof Strategy:
1. Compute the **expected value** of collisions. Note that collisions occurs when two different values are hashed to the same slot. (Indicator random variables)
2. Apply a "tail" bound that converts the expected value to probability. (Markov's inequality)
##### Compute the expected number of collisions
Let $m$ be the size of the hash table. $n$ is the number of keys in the set $S$. $N$ is the size of the universe.
For inputs $x,y\in S,x\neq y$, we define a random variable
$$
C_{xy}=
\begin{cases}
1 & \text{if } h(x)=h(y) \\
0 & \text{otherwise}
\end{cases}
$$
$C_{xy}$ is called an indicator random variable, that takes value $0$ or $1$.
The expected number of collisions is
$$
E[C_{xy}]=1\times Pr[C_{xy}=1]+0\times Pr[C_{xy}=0]=Pr[C_{xy}=1]=\frac{1}{m}
$$
Define $C_x$: random variable that represents the cost of inserting/searching/deleting $x$ from the hash table.
$C_x\leq$ total number of elements that collide with $x$ (= number of elements $y$ such that $h(x)=h(y)$).
$$
C_x=\sum_{y\in S,y\neq x,h(x)=h(y)}1
$$
So, $C_x=\sum_{y\in S,y\neq x}C_{xy}$.
By linearity of expectation,
$$
E[C_x]=\sum_{y\in S,y\neq x}E[C_{xy}]=\sum_{y\in S,y\neq x}\frac{1}{m}=\frac{n-1}{m}
$$
$E[C]=\Theta(1)$ if $n=O(m)$. Total cost of $K$ insert/search operations is $O(k)$. by linearity of expectation.
Say $C$ is the total number of collisions.
$C=\frac{\sum_{x\in S}C_x}{2}$ because each collision is counted twice.
$$
E[C]=\frac{1}{2}\sum_{x\in S}E[C_x]=\frac{1}{2}\sum_{x\in S}\frac{n-1}{m}=\frac{n(n-1)}{2m}
$$
If we want $E[C]\leq \delta$, then we need $n=\sqrt{2m\delta}$.
#### The probability of no collisions $C=0$
We know that the expected value of number of collisions is now $\leq \delta$, but what about the probability of **NO** collisions?
> Markov's inequality: $$P[X\geq k]\leq\frac{E[X]}{k}$$
> For non-negative random variable $X$, $Pr[X\geq k\cdot E[X]]\leq \frac{1}{k}$.
Use Markov's inequality: For non-negative random variable $X$, $Pr[X\geq k\cdot E[X]]\leq \frac{1}{k}$.
Apply this to $C$:
$$
Pr[C\geq \frac{1}{\delta}E[C]]<\delta\Rightarrow Pr[C\geq 1]<\delta
$$
So, if we want $Pr[C=0]>1-\delta$, $n<\sqrt{2m\delta}$ with probability $1-\delta$, you will have no collisions.
#### More general conclusion
Claim: For a universal hash function family $H$, if number of keys $n\leq \sqrt{Bm\delta}$, then the probability that at most $B+1$ keys hash to the same slot is $> 1-\delta$.
### Example: Quicksort
Based on partitioning [assume all elements are distinct]: Partition($A[p\cdots r]$)
- Rearranges $A$ into $A[p\cdots q-1],A[q],A[q+1\cdots r]$
Runtime: $O(r-p)$, linear time.
```python
def partition(A,p,r):
x=A[r]
lo=p
for i in range(p,r):
if A[i]<x:
A[lo],A[i]=A[i],A[lo]
lo+=1
A[lo],A[r]=A[r],A[lo]
return lo
def quicksort(A,p,r):
if p<r:
q=partition(A,p,r)
quicksort(A,p,q-1)
quicksort(A,q+1,r)
```
#### Runtime analysis
Let the number of element in $A_{low}$ be $k$.
$$
T(n)=\Theta(n)+T(k)+T(n-k-1)
$$
By even split assumption, $k=\frac{n}{2}$.
$$
T(n)=T(\frac{n}{2})+T(\frac{n}{2}-1)+\Theta(n)\approx \Theta(n\log n)
$$
Which is approximately the same as merge sort.
_Average case analysis is always suspicious._
### Randomized Quicksort
- Pick a random pivot element.
- Analyze the expected runtime. over the random choices of pivot.
```python
def randomized_partition(A,p,r):
ix=random.randint(p,r)
x=A[ix]
A[r],A[ix]=A[ix],A[r]
lo=p
for i in range(p,r):
if A[i]<x:
A[lo],A[i]=A[i],A[lo]
lo+=1
A[lo],A[r]=A[r],A[lo]
return lo
def randomized_quicksort(A,p,r):
if p<r:
q=randomized_partition(A,p,r)
randomized_quicksort(A,p,q-1)
randomized_quicksort(A,q+1,r)
```
$$
E[T(n)]=E(T(n-k-1)+T(k)+cn)=E(T(n-k-1))+E(T(k))+cn
$$
by linearity of expectation.
$$
Pr[\textup{pivot has rank }k]=\frac{1}{n}
$$
So,
$$
\begin{aligned}
E[T(n)]&=\frac{1}{n}\sum_{k=0}^{n-1}(E[T(k)]+E[T(n-k-1)])+cn\\
&=cn+\sum_{k=0}^{n-1}Pr[n-k-1=j]T(j)+\sum_{k=0}^{n-1}Pr[k=j]T(j)\\
&=cn+\sum_{k=0}^{n-1}\frac{1}{n}T(j)+\sum_{k=0}^{n-1}\frac{1}{n}T(j)\\
&=cn+\frac{2}{n}\sum_{k=0}^{n-1}T(j)
\end{aligned}
$$
Claim: the solution to this recurrence is $E[T(n)]=O(n\log n)$ or $T(n)=c'n\log n+1$.
Proof:
We prove by induction.
Base case: $n=1,T(n)=T(1)=c$
Inductive step: Assume that $T(k)=c'k\log k+1$ for all $k<n$.
Then,
$$
\begin{aligned}
T(n)&=cn+\frac{2}{n}\sum_{k=0}^{n-1}T(k)\\
&=cn+\frac{2}{n}\sum_{k=0}^{n-1}(c'k\log k+1)\\
&=cn+\frac{2c'}{n}\sum_{k=0}^{n-1}k\log k+\frac{2}{n}\sum_{k=0}^{n-1}1
\end{aligned}
$$
Then we use the fact that $\sum_{k=0}^{n-1}k\log k\leq \frac{n^2\log n}{2}-\frac{n^2}{8}$ (can be proved by induction).
$$
\begin{aligned}
T(n)&=cn+\frac{2c'}{n}\left(\frac{n^2\log n}{2}-\frac{n^2}{8}\right)+\frac{2}{n}n\\
&=c'n\log n-\frac{1}{4}c'n+cn+2\\
&=(c'n\log n+1)-\left(\frac{1}{4}c'n-cn-1\right)
\end{aligned}
$$
We need to prove that $\frac{1}{4}c'n-cn-1\geq 0$.
Choose $c'$ and $c$ such that $\frac{1}{4}c'n\geq cn+1$ for all $n\geq 2$.
If $c'\geq 8c$, then $T(n)\leq c'n\log n+1$.
$E[T(n)]\leq c'n\log n+1=O(n\log n)$
QED
A more elegant proof:
Let $X_{ij}$ be an indicator random variable that is $1$ if element of rank $i$ is compared to element of rank $j$.
Running time: $$X=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}$$
So, the expected number of comparisons is
$$
E[X_{ij}]=Pr[X_{ij}=1]\times 1+Pr[X_{ij}=0]\times 0=Pr[X_{ij}=1]
$$
This is equivalent to the expected number of comparisons in randomized quicksort.
The expected number of running time is
$$
\begin{aligned}
E[X]&=E[\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}]\\
&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}E[X_{ij}]\\
&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}Pr[X_{ij}=1]
\end{aligned}
$$
For any two elements $z_i,z_j\in S$, the probability that $z_i$ is compared to $z_j$ is (either $z_i$ or $z_j$ is picked first as the pivot before the any elements of the ranks larger than $i$ and less than $j$)
$$
\begin{aligned}
Pr[X_{ij}=1]&=Pr[z_i\text{ is picked first}]+Pr[z_j\text{ is picked first}]\\
&=\frac{1}{j-i+1}+\frac{1}{j-i+1}\\
&=\frac{2}{j-i+1}
\end{aligned}
$$
So, with harmonic number, $H_n=\sum_{k=1}^{n}\frac{1}{k}$,
$$
\begin{aligned}
E[X]&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}\frac{2}{j-i+1}\\
&\leq 2\sum_{i=0}^{n-2}\sum_{k=1}^{n-i-1}\frac{1}{k}\\
&\leq 2\sum_{i=0}^{n-2}c\log(n)\\
&=2c\log(n)\sum_{i=0}^{n-2}1\\
&=\Theta(n\log n)
\end{aligned}
$$
QED