# CSE347 Analysis of Algorithms (Lecture 9) ## Randomized Algorithms ### Hashing Hashing with chaining: Input: We have integers in range $[1,n-1]=U$. We want to map them to a hash table $T$ with $m$ slots. Hash function: $h:U\rightarrow [m]$ Goal: Hashing a set $S\subseteq U$, $|S|=n$ into $T$ such that the number of elements in each slot is at most $1$. #### Collisions When multiple keys are mapped to the same slot, we call it a collision, we keep a linked list of all the keys that map to the same slot. **Runtime** of insert, query, delete of elements $=\Theta(\textup{length of the chain})$ **Worst-case** runtime of insert, query, delete of elements $=\Theta(n)$ Therefore, we want chains to be short, or $\Theta(1)$, as long as $|S|$ is reasonably sized, or equivalently, we want the number in any set $S$ to hash **uniformly** across all slots. #### Simple Uniform Hashing Assumptions The $n$ elements we want to hash (the set $S$) is picked uniformly at random from $U$. Therefore, we could see that this simple hash function works fine: $$ h(x)=x\mod m $$ Question: What happens if an adversary knows this function and designs $S$ to make the worst-case runtime happen? Answer: The adversary can make the runtime of each operation $\Theta(n)$ by simply making all the elements hash to the same slot. #### Randomization to the rescue We don't want the adversary to know the hash function based on just looking at the code. Ideas: Randomize the choice of the hash function. ### Randomized Algorithm #### Definition A randomized algorithm is an algorithm the algorithm makes internal random choices. 2 kinds of randomized algorithms: 1. Las Vegas: The runtime is random, but the output is always correct. 2. Monte Carlo: The runtime is fixed, but the output is sometimes incorrect. We will focus on Las Vegas algorithms in this course. $$O(n)=E[T(n)]$$ or some other probabilistic quantity. #### Randomization can help Ideas: Randomize the choice of hash function $h$ from a family of hash functions, $H$. If we randomly pick a hash function from this family, then the probability that the hash function is bad on **any particular** set $S$ is small. Intuitively, the adversary can not pick a bad input since most hash functions are good for any particular input $S$. #### Universal Hashing: Goal We want to design a universal family of hash functions, $H$, such that the probability that the hash table behaves badly on any input $S$ is small. #### Universal Hashing: Definition Suppose we have $m$ buckets in the hash table. We also have $2$ inputs $x\neq y$ and $x,y\in U$. We want $x$ and $y$ to be unlikely to hash to the same bucket. $H$ is a universal **family** of hash functions if for any two elements $x\neq y$, $$ Pr_{h\in H}[h(x)=h(y)]=\frac{1}{m} $$ where $h$ is picked uniformly at random from the family $H$. #### Universal Hashing: Analysis Claim: If we choose $h$ randomly from a universal family of hash functions, $H$, then the hash table will exhibit good behavior on any set $S$ of size $n$ with high probability. Question: What are some good properties and what does it mean by with high probability? Claim: Given a universal family of hash functions, $H$, $S=\{a_1,a_2,\cdots,a_n\}\subset \mathbb{N}$. For any probability $0\leq \delta\leq 1$, if $n\leq \sqrt{2m\delta}$, the chance that no two keys hash to the same slot is $\geq1-\delta$. Example: If we pick $\delta=\frac{1}{2}$. As long as $n<\sqrt{2m}$, the chance that no two keys hash to the same slot is $\geq\frac{1}{2}$. If we pick $\delta=\frac{1}{3}$. As long as $n<\sqrt{\frac{4}{3}m}$, the chance that no two keys hash to the same slot is $\geq\frac{2}{3}$. Proof Strategy: 1. Compute the **expected value** of collisions. Note that collisions occurs when two different values are hashed to the same slot. (Indicator random variables) 2. Apply a "tail" bound that converts the expected value to probability. (Markov's inequality) ##### Compute the expected number of collisions Let $m$ be the size of the hash table. $n$ is the number of keys in the set $S$. $N$ is the size of the universe. For inputs $x,y\in S,x\neq y$, we define a random variable $$ C_{xy}= \begin{cases} 1 & \text{if } h(x)=h(y) \\ 0 & \text{otherwise} \end{cases} $$ $C_{xy}$ is called an indicator random variable, that takes value $0$ or $1$. The expected number of collisions is $$ E[C_{xy}]=1\times Pr[C_{xy}=1]+0\times Pr[C_{xy}=0]=Pr[C_{xy}=1]=\frac{1}{m} $$ Define $C_x$: random variable that represents the cost of inserting/searching/deleting $x$ from the hash table. $C_x\leq$ total number of elements that collide with $x$ (= number of elements $y$ such that $h(x)=h(y)$). $$ C_x=\sum_{y\in S,y\neq x,h(x)=h(y)}1 $$ So, $C_x=\sum_{y\in S,y\neq x}C_{xy}$. By linearity of expectation, $$ E[C_x]=\sum_{y\in S,y\neq x}E[C_{xy}]=\sum_{y\in S,y\neq x}\frac{1}{m}=\frac{n-1}{m} $$ $E[C]=\Theta(1)$ if $n=O(m)$. Total cost of $K$ insert/search operations is $O(k)$. by linearity of expectation. Say $C$ is the total number of collisions. $C=\frac{\sum_{x\in S}C_x}{2}$ because each collision is counted twice. $$ E[C]=\frac{1}{2}\sum_{x\in S}E[C_x]=\frac{1}{2}\sum_{x\in S}\frac{n-1}{m}=\frac{n(n-1)}{2m} $$ If we want $E[C]\leq \delta$, then we need $n=\sqrt{2m\delta}$. #### The probability of no collisions $C=0$ We know that the expected value of number of collisions is now $\leq \delta$, but what about the probability of **NO** collisions? > Markov's inequality: $$P[X\geq k]\leq\frac{E[X]}{k}$$ > For non-negative random variable $X$, $Pr[X\geq k\cdot E[X]]\leq \frac{1}{k}$. Use Markov's inequality: For non-negative random variable $X$, $Pr[X\geq k\cdot E[X]]\leq \frac{1}{k}$. Apply this to $C$: $$ Pr[C\geq \frac{1}{\delta}E[C]]<\delta\Rightarrow Pr[C\geq 1]<\delta $$ So, if we want $Pr[C=0]>1-\delta$, $n<\sqrt{2m\delta}$ with probability $1-\delta$, you will have no collisions. #### More general conclusion Claim: For a universal hash function family $H$, if number of keys $n\leq \sqrt{Bm\delta}$, then the probability that at most $B+1$ keys hash to the same slot is $> 1-\delta$. ### Example: Quicksort Based on partitioning [assume all elements are distinct]: Partition($A[p\cdots r]$) - Rearranges $A$ into $A[p\cdots q-1],A[q],A[q+1\cdots r]$ Runtime: $O(r-p)$, linear time. ```python def partition(A,p,r): x=A[r] lo=p for i in range(p,r): if A[i] Proof We prove by induction. Base case: $n=1,T(n)=T(1)=c$ Inductive step: Assume that $T(k)=c'k\log k+1$ for all $k A more elegant proof:
Proof Let $X_{ij}$ be an indicator random variable that is $1$ if element of rank $i$ is compared to element of rank $j$. Running time: $$X=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}$$ So, the expected number of comparisons is $$ E[X_{ij}]=Pr[X_{ij}=1]\times 1+Pr[X_{ij}=0]\times 0=Pr[X_{ij}=1] $$ This is equivalent to the expected number of comparisons in randomized quicksort. The expected number of running time is $$ \begin{aligned} E[X]&=E[\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}]\\ &=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}E[X_{ij}]\\ &=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}Pr[X_{ij}=1] \end{aligned} $$ For any two elements $z_i,z_j\in S$, the probability that $z_i$ is compared to $z_j$ is (either $z_i$ or $z_j$ is picked first as the pivot before the any elements of the ranks larger than $i$ and less than $j$) $$ \begin{aligned} Pr[X_{ij}=1]&=Pr[z_i\text{ is picked first}]+Pr[z_j\text{ is picked first}]\\ &=\frac{1}{j-i+1}+\frac{1}{j-i+1}\\ &=\frac{2}{j-i+1} \end{aligned} $$ So, with harmonic number, $H_n=\sum_{k=1}^{n}\frac{1}{k}$, $$ \begin{aligned} E[X]&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}\frac{2}{j-i+1}\\ &\leq 2\sum_{i=0}^{n-2}\sum_{k=1}^{n-i-1}\frac{1}{k}\\ &\leq 2\sum_{i=0}^{n-2}c\log(n)\\ &=2c\log(n)\sum_{i=0}^{n-2}1\\ &=\Theta(n\log n) \end{aligned} $$