10 KiB
CSE347 Analysis of Algorithms (Lecture 9)
Randomized Algorithms
Hashing
Hashing with chaining:
Input: We have integers in range [1,n-1]=U. We want to map them to a hash table T with m slots.
Hash function: h:U\rightarrow [m]
Goal: Hashing a set S\subseteq U, |S|=n into T such that the number of elements in each slot is at most 1.
Collisions
When multiple keys are mapped to the same slot, we call it a collision, we keep a linked list of all the keys that map to the same slot.
Runtime of insert, query, delete of elements =\Theta(\textup{length of the chain})
Worst-case runtime of insert, query, delete of elements =\Theta(n)
Therefore, we want chains to be short, or \Theta(1), as long as |S| is reasonably sized, or equivalently, we want the number in any set S to hash uniformly across all slots.
Simple Uniform Hashing Assumptions
The n elements we want to hash (the set S) is picked uniformly at random from U. Therefore, we could see that this simple hash function works fine:
h(x)=x\mod m
Question: What happens if an adversary knows this function and designs S to make the worst-case runtime happen?
Answer: The adversary can make the runtime of each operation \Theta(n) by simply making all the elements hash to the same slot.
Randomization to the rescue
We don't want the adversary to know the hash function based on just looking at the code.
Ideas: Randomize the choice of the hash function.
Randomized Algorithm
Definition
A randomized algorithm is an algorithm the algorithm makes internal random choices.
2 kinds of randomized algorithms:
- Las Vegas: The runtime is random, but the output is always correct.
- Monte Carlo: The runtime is fixed, but the output is sometimes incorrect.
We will focus on Las Vegas algorithms in this course.
O(n)=E[T(n)] or some other probabilistic quantity.
Randomization can help
Ideas: Randomize the choice of hash function h from a family of hash functions, H.
If we randomly pick a hash function from this family, then the probability that the hash function is bad on any particular set S is small.
Intuitively, the adversary can not pick a bad input since most hash functions are good for any particular input S.
Universal Hashing: Goal
We want to design a universal family of hash functions, H, such that the probability that the hash table behaves badly on any input S is small.
Universal Hashing: Definition
Suppose we have m buckets in the hash table. We also have 2 inputs x\neq y and x,y\in U. We want x and y to be unlikely to hash to the same bucket.
H is a universal family of hash functions if for any two elements x\neq y,
Pr_{h\in H}[h(x)=h(y)]=\frac{1}{m}
where h is picked uniformly at random from the family H.
Universal Hashing: Analysis
Claim: If we choose h randomly from a universal family of hash functions, H, then the hash table will exhibit good behavior on any set S of size n with high probability.
Question: What are some good properties and what does it mean by with high probability?
Claim: Given a universal family of hash functions, H, S=\{a_1,a_2,\cdots,a_n\}\subset \mathbb{N}. For any probability 0\leq \delta\leq 1, if n\leq \sqrt{2m\delta}, the chance that no two keys hash to the same slot is \geq1-\delta.
Example: If we pick \delta=\frac{1}{2}. As long as n<\sqrt{2m}, the chance that no two keys hash to the same slot is \geq\frac{1}{2}.
If we pick \delta=\frac{1}{3}. As long as n<\sqrt{\frac{4}{3}m}, the chance that no two keys hash to the same slot is \geq\frac{2}{3}.
Proof Strategy:
- Compute the expected value of collisions. Note that collisions occurs when two different values are hashed to the same slot. (Indicator random variables)
- Apply a "tail" bound that converts the expected value to probability. (Markov's inequality)
Compute the expected number of collisions
Let m be the size of the hash table. n is the number of keys in the set S. N is the size of the universe.
For inputs x,y\in S,x\neq y, we define a random variable
C_{xy}=
\begin{cases}
1 & \text{if } h(x)=h(y) \\
0 & \text{otherwise}
\end{cases}
C_{xy} is called an indicator random variable, that takes value 0 or 1.
The expected number of collisions is
E[C_{xy}]=1\times Pr[C_{xy}=1]+0\times Pr[C_{xy}=0]=Pr[C_{xy}=1]=\frac{1}{m}
Define C_x: random variable that represents the cost of inserting/searching/deleting x from the hash table.
C_x\leq total number of elements that collide with x (= number of elements y such that h(x)=h(y)).
C_x=\sum_{y\in S,y\neq x,h(x)=h(y)}1
So, C_x=\sum_{y\in S,y\neq x}C_{xy}.
By linearity of expectation,
E[C_x]=\sum_{y\in S,y\neq x}E[C_{xy}]=\sum_{y\in S,y\neq x}\frac{1}{m}=\frac{n-1}{m}
E[C]=\Theta(1) if n=O(m). Total cost of K insert/search operations is O(k). by linearity of expectation.
Say C is the total number of collisions.
C=\frac{\sum_{x\in S}C_x}{2} because each collision is counted twice.
E[C]=\frac{1}{2}\sum_{x\in S}E[C_x]=\frac{1}{2}\sum_{x\in S}\frac{n-1}{m}=\frac{n(n-1)}{2m}
If we want E[C]\leq \delta, then we need n=\sqrt{2m\delta}.
The probability of no collisions C=0
We know that the expected value of number of collisions is now \leq \delta, but what about the probability of NO collisions?
Markov's inequality:
P[X\geq k]\leq\frac{E[X]}{k}For non-negative random variableX,Pr[X\geq k\cdot E[X]]\leq \frac{1}{k}.
Use Markov's inequality: For non-negative random variable X, Pr[X\geq k\cdot E[X]]\leq \frac{1}{k}.
Apply this to C:
Pr[C\geq \frac{1}{\delta}E[C]]<\delta\Rightarrow Pr[C\geq 1]<\delta
So, if we want Pr[C=0]>1-\delta, n<\sqrt{2m\delta} with probability 1-\delta, you will have no collisions.
More general conclusion
Claim: For a universal hash function family H, if number of keys n\leq \sqrt{Bm\delta}, then the probability that at most B+1 keys hash to the same slot is > 1-\delta.
Example: Quicksort
Based on partitioning [assume all elements are distinct]: Partition(A[p\cdots r])
- Rearranges
AintoA[p\cdots q-1],A[q],A[q+1\cdots r]
Runtime: O(r-p), linear time.
def partition(A,p,r):
x=A[r]
lo=p
for i in range(p,r):
if A[i]<x:
A[lo],A[i]=A[i],A[lo]
lo+=1
A[lo],A[r]=A[r],A[lo]
return lo
def quicksort(A,p,r):
if p<r:
q=partition(A,p,r)
quicksort(A,p,q-1)
quicksort(A,q+1,r)
Runtime analysis
Let the number of element in A_{low} be k.
T(n)=\Theta(n)+T(k)+T(n-k-1)
By even split assumption, k=\frac{n}{2}.
T(n)=T(\frac{n}{2})+T(\frac{n}{2}-1)+\Theta(n)\approx \Theta(n\log n)
Which is approximately the same as merge sort.
Average case analysis is always suspicious.
Randomized Quicksort
- Pick a random pivot element.
- Analyze the expected runtime. over the random choices of pivot.
def randomized_partition(A,p,r):
ix=random.randint(p,r)
x=A[ix]
A[r],A[ix]=A[ix],A[r]
lo=p
for i in range(p,r):
if A[i]<x:
A[lo],A[i]=A[i],A[lo]
lo+=1
A[lo],A[r]=A[r],A[lo]
return lo
def randomized_quicksort(A,p,r):
if p<r:
q=randomized_partition(A,p,r)
randomized_quicksort(A,p,q-1)
randomized_quicksort(A,q+1,r)
E[T(n)]=E(T(n-k-1)+T(k)+cn)=E(T(n-k-1))+E(T(k))+cn
by linearity of expectation.
Pr[\textup{pivot has rank }k]=\frac{1}{n}
So,
\begin{aligned}
E[T(n)]&=\frac{1}{n}\sum_{k=0}^{n-1}(E[T(k)]+E[T(n-k-1)])+cn\\
&=cn+\sum_{k=0}^{n-1}Pr[n-k-1=j]T(j)+\sum_{k=0}^{n-1}Pr[k=j]T(j)\\
&=cn+\sum_{k=0}^{n-1}\frac{1}{n}T(j)+\sum_{k=0}^{n-1}\frac{1}{n}T(j)\\
&=cn+\frac{2}{n}\sum_{k=0}^{n-1}T(j)
\end{aligned}
Claim: the solution to this recurrence is E[T(n)]=O(n\log n) or T(n)=c'n\log n+1.
Proof
We prove by induction.
Base case: n=1,T(n)=T(1)=c
Inductive step: Assume that T(k)=c'k\log k+1 for all k<n.
Then,
\begin{aligned}
T(n)&=cn+\frac{2}{n}\sum_{k=0}^{n-1}T(k)\\
&=cn+\frac{2}{n}\sum_{k=0}^{n-1}(c'k\log k+1)\\
&=cn+\frac{2c'}{n}\sum_{k=0}^{n-1}k\log k+\frac{2}{n}\sum_{k=0}^{n-1}1
\end{aligned}
Then we use the fact that \sum_{k=0}^{n-1}k\log k\leq \frac{n^2\log n}{2}-\frac{n^2}{8} (can be proved by induction).
\begin{aligned}
T(n)&=cn+\frac{2c'}{n}\left(\frac{n^2\log n}{2}-\frac{n^2}{8}\right)+\frac{2}{n}n\\
&=c'n\log n-\frac{1}{4}c'n+cn+2\\
&=(c'n\log n+1)-\left(\frac{1}{4}c'n-cn-1\right)
\end{aligned}
We need to prove that \frac{1}{4}c'n-cn-1\geq 0.
Choose c' and c such that \frac{1}{4}c'n\geq cn+1 for all n\geq 2.
If c'\geq 8c, then T(n)\leq c'n\log n+1.
E[T(n)]\leq c'n\log n+1=O(n\log n)
A more elegant proof:
Proof
Let X_{ij} be an indicator random variable that is 1 if element of rank i is compared to element of rank j.
Running time: X=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}
So, the expected number of comparisons is
E[X_{ij}]=Pr[X_{ij}=1]\times 1+Pr[X_{ij}=0]\times 0=Pr[X_{ij}=1]
This is equivalent to the expected number of comparisons in randomized quicksort.
The expected number of running time is
\begin{aligned}
E[X]&=E[\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}]\\
&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}E[X_{ij}]\\
&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}Pr[X_{ij}=1]
\end{aligned}
For any two elements z_i,z_j\in S, the probability that z_i is compared to z_j is (either z_i or z_j is picked first as the pivot before the any elements of the ranks larger than i and less than j)
\begin{aligned}
Pr[X_{ij}=1]&=Pr[z_i\text{ is picked first}]+Pr[z_j\text{ is picked first}]\\
&=\frac{1}{j-i+1}+\frac{1}{j-i+1}\\
&=\frac{2}{j-i+1}
\end{aligned}
So, with harmonic number, H_n=\sum_{k=1}^{n}\frac{1}{k},
\begin{aligned}
E[X]&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}\frac{2}{j-i+1}\\
&\leq 2\sum_{i=0}^{n-2}\sum_{k=1}^{n-i-1}\frac{1}{k}\\
&\leq 2\sum_{i=0}^{n-2}c\log(n)\\
&=2c\log(n)\sum_{i=0}^{n-2}1\\
&=\Theta(n\log n)
\end{aligned}