# CSE347 Analysis of Algorithms (Lecture 9)

## Randomized Algorithms

### Hashing

Hashing with chaining:

Input: We have integers in range $[1,n-1]=U$. We want to map them to a hash table $T$ with $m$ slots.

Hash function: $h:U\rightarrow [m]$

Goal: Hashing a set $S\subseteq U$, $|S|=n$ into $T$ such that the number of elements in each slot is at most $1$.

#### Collisions

When multiple keys are mapped to the same slot, we call it a collision, we keep a linked list of all the keys that map to the same slot.

**Runtime** of insert, query, delete of elements $=\Theta(\textup{length of the chain})$

**Worst-case** runtime of insert, query, delete of elements $=\Theta(n)$

Therefore, we want chains to be short, or $\Theta(1)$,  as long as $|S|$ is reasonably sized, or equivalently, we want the number in any set $S$ to hash **uniformly** across all slots.

#### Simple Uniform Hashing Assumptions

The $n$ elements we want to hash (the set $S$) is picked uniformly at random from $U$. Therefore, we could see that this simple hash function works fine:

$$
h(x)=x\mod m
$$

Question: What happens if an adversary knows this function and designs $S$ to make the worst-case runtime happen?

Answer: The adversary can make the runtime of each operation $\Theta(n)$ by simply making all the elements hash to the same slot.

#### Randomization to the rescue

We don't want the adversary to know the hash function based on just looking at the code.

Ideas: Randomize the choice of the hash function.

### Randomized Algorithm

#### Definition

A randomized algorithm is an algorithm the algorithm makes internal random choices.

2 kinds of randomized algorithms:

1. Las Vegas: The runtime is random, but the output is always correct.
2. Monte Carlo: The runtime is fixed, but the output is sometimes incorrect.

We will focus on Las Vegas algorithms in this course.

$$O(n)=E[T(n)]$$ or some other probabilistic quantity.

#### Randomization can help

Ideas: Randomize the choice of hash function $h$ from a family of hash functions, $H$.

If we randomly pick a hash function from this family, then the probability that the hash function is bad on **any particular** set $S$ is small.

Intuitively, the adversary can not pick a bad input since most hash functions are good for any particular input $S$.

#### Universal Hashing: Goal

We want to design a universal family of hash functions, $H$, such that the probability that the hash table behaves badly on any input $S$ is small.

#### Universal Hashing: Definition

Suppose we have $m$ buckets in the hash table. We also have $2$ inputs $x\neq y$ and $x,y\in U$. We want $x$ and $y$ to be unlikely to hash to the same bucket.

$H$ is a universal **family** of hash functions if for any two elements $x\neq y$,

$$
Pr_{h\in H}[h(x)=h(y)]=\frac{1}{m}
$$

where $h$ is picked uniformly at random from the family $H$.

#### Universal Hashing: Analysis

Claim: If we choose $h$ randomly from a universal family of hash functions, $H$, then the hash table will exhibit good behavior on any set $S$ of size $n$ with high probability.

Question: What are some good properties and what does it mean by with high probability?

Claim: Given a universal family of hash functions, $H$, $S=\{a_1,a_2,\cdots,a_n\}\subset \mathbb{N}$. For any probability $0\leq \delta\leq 1$, if $n\leq \sqrt{2m\delta}$, the chance that no two keys hash to the same slot is $\geq1-\delta$.

Example: If we pick $\delta=\frac{1}{2}$. As long as $n<\sqrt{2m}$, the chance that no two keys hash to the same slot is $\geq\frac{1}{2}$.

If we pick $\delta=\frac{1}{3}$. As long as $n<\sqrt{\frac{4}{3}m}$, the chance that no two keys hash to the same slot is $\geq\frac{2}{3}$.

Proof Strategy:

1. Compute the **expected value** of collisions. Note that collisions occurs when two different values are hashed to the same slot. (Indicator random variables)
2. Apply a "tail" bound that converts the expected value to probability. (Markov's inequality)

##### Compute the expected number of collisions

Let $m$ be the size of the hash table. $n$ is the number of keys in the set $S$. $N$ is the size of the universe.

For inputs $x,y\in S,x\neq y$, we define a random variable

$$
C_{xy}=
\begin{cases}
1 & \text{if } h(x)=h(y) \\
0 & \text{otherwise}
\end{cases}
$$

$C_{xy}$ is called an indicator random variable, that takes value $0$ or $1$.

The expected number of collisions is

$$
E[C_{xy}]=1\times Pr[C_{xy}=1]+0\times Pr[C_{xy}=0]=Pr[C_{xy}=1]=\frac{1}{m}
$$

Define $C_x$: random variable that represents the cost of inserting/searching/deleting $x$ from the hash table.

$C_x\leq$ total number of elements that collide with $x$ (= number of elements $y$ such that $h(x)=h(y)$).

$$
C_x=\sum_{y\in S,y\neq x,h(x)=h(y)}1
$$

So, $C_x=\sum_{y\in S,y\neq x}C_{xy}$.

By linearity of expectation,

$$
E[C_x]=\sum_{y\in S,y\neq x}E[C_{xy}]=\sum_{y\in S,y\neq x}\frac{1}{m}=\frac{n-1}{m}
$$

$E[C]=\Theta(1)$ if $n=O(m)$. Total cost of $K$ insert/search operations is $O(k)$. by linearity of expectation.

Say $C$ is the total number of collisions.

$C=\frac{\sum_{x\in S}C_x}{2}$ because each collision is counted twice.

$$
E[C]=\frac{1}{2}\sum_{x\in S}E[C_x]=\frac{1}{2}\sum_{x\in S}\frac{n-1}{m}=\frac{n(n-1)}{2m}
$$

If we want $E[C]\leq \delta$, then we need $n=\sqrt{2m\delta}$.

#### The probability of no collisions $C=0$

We know that the expected value of number of collisions is now $\leq \delta$, but what about the probability of **NO** collisions?

> Markov's inequality: $$P[X\geq k]\leq\frac{E[X]}{k}$$
> For non-negative random variable $X$, $Pr[X\geq k\cdot E[X]]\leq \frac{1}{k}$.

Use Markov's inequality: For non-negative random variable $X$, $Pr[X\geq k\cdot E[X]]\leq \frac{1}{k}$.

Apply this to $C$:

$$
Pr[C\geq \frac{1}{\delta}E[C]]<\delta\Rightarrow Pr[C\geq 1]<\delta
$$

So, if we want $Pr[C=0]>1-\delta$, $n<\sqrt{2m\delta}$ with probability $1-\delta$, you will have no collisions.

#### More general conclusion

Claim: For a universal hash function family $H$, if number of keys $n\leq \sqrt{Bm\delta}$, then the probability that at most $B+1$ keys hash to the same slot is $> 1-\delta$.

### Example: Quicksort

Based on partitioning [assume all elements are distinct]: Partition($A[p\cdots r]$)

- Rearranges $A$ into $A[p\cdots q-1],A[q],A[q+1\cdots r]$

Runtime: $O(r-p)$, linear time.

```python
def partition(A,p,r):
    x=A[r]
    lo=p
    for i in range(p,r):
        if A[i]<x:
            A[lo],A[i]=A[i],A[lo]
            lo+=1
    A[lo],A[r]=A[r],A[lo]
    return lo

def quicksort(A,p,r):
    if p<r:
        q=partition(A,p,r)
        quicksort(A,p,q-1)
        quicksort(A,q+1,r)
```

#### Runtime analysis

Let the number of element in $A_{low}$ be $k$.

$$
T(n)=\Theta(n)+T(k)+T(n-k-1)
$$

By even split assumption, $k=\frac{n}{2}$.

$$
T(n)=T(\frac{n}{2})+T(\frac{n}{2}-1)+\Theta(n)\approx \Theta(n\log n)
$$

Which is approximately the same as merge sort.

_Average case analysis is always suspicious._

### Randomized Quicksort

- Pick a random pivot element.
- Analyze the expected runtime. over the random choices of pivot.

```python

def randomized_partition(A,p,r):
    ix=random.randint(p,r)
    x=A[ix]
    A[r],A[ix]=A[ix],A[r]
    lo=p
    for i in range(p,r):
        if A[i]<x:
            A[lo],A[i]=A[i],A[lo]
            lo+=1
    A[lo],A[r]=A[r],A[lo]
    return lo

def randomized_quicksort(A,p,r):
    if p<r:
        q=randomized_partition(A,p,r)
        randomized_quicksort(A,p,q-1)
        randomized_quicksort(A,q+1,r)
```

$$
E[T(n)]=E(T(n-k-1)+T(k)+cn)=E(T(n-k-1))+E(T(k))+cn
$$

by linearity of expectation.

$$
Pr[\textup{pivot has rank }k]=\frac{1}{n}
$$

So,

$$
\begin{aligned}
E[T(n)]&=\frac{1}{n}\sum_{k=0}^{n-1}(E[T(k)]+E[T(n-k-1)])+cn\\
&=cn+\sum_{k=0}^{n-1}Pr[n-k-1=j]T(j)+\sum_{k=0}^{n-1}Pr[k=j]T(j)\\
&=cn+\sum_{k=0}^{n-1}\frac{1}{n}T(j)+\sum_{k=0}^{n-1}\frac{1}{n}T(j)\\
&=cn+\frac{2}{n}\sum_{k=0}^{n-1}T(j)
\end{aligned}
$$

Claim: the solution to this recurrence is $E[T(n)]=O(n\log n)$ or $T(n)=c'n\log n+1$.

<details>
<summary>Proof</summary>

We prove by induction.

Base case: $n=1,T(n)=T(1)=c$

Inductive step: Assume that $T(k)=c'k\log k+1$ for all $k<n$.

Then,

$$
\begin{aligned}
T(n)&=cn+\frac{2}{n}\sum_{k=0}^{n-1}T(k)\\
&=cn+\frac{2}{n}\sum_{k=0}^{n-1}(c'k\log k+1)\\
&=cn+\frac{2c'}{n}\sum_{k=0}^{n-1}k\log k+\frac{2}{n}\sum_{k=0}^{n-1}1
\end{aligned}
$$

Then we use the fact that $\sum_{k=0}^{n-1}k\log k\leq \frac{n^2\log n}{2}-\frac{n^2}{8}$ (can be proved by induction).

$$
\begin{aligned}
T(n)&=cn+\frac{2c'}{n}\left(\frac{n^2\log n}{2}-\frac{n^2}{8}\right)+\frac{2}{n}n\\
&=c'n\log n-\frac{1}{4}c'n+cn+2\\
&=(c'n\log n+1)-\left(\frac{1}{4}c'n-cn-1\right)
\end{aligned}
$$

We need to prove that $\frac{1}{4}c'n-cn-1\geq 0$.

Choose $c'$ and $c$ such that $\frac{1}{4}c'n\geq cn+1$ for all $n\geq 2$.

If $c'\geq 8c$, then $T(n)\leq c'n\log n+1$.

$E[T(n)]\leq c'n\log n+1=O(n\log n)$

</details>

A more elegant proof:

<details>
<summary>Proof</summary>

Let $X_{ij}$ be an indicator random variable that is $1$ if element of rank $i$ is compared to element of rank $j$.

Running time: $$X=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}$$

So, the expected number of comparisons is

$$
E[X_{ij}]=Pr[X_{ij}=1]\times 1+Pr[X_{ij}=0]\times 0=Pr[X_{ij}=1]
$$

This is equivalent to the expected number of comparisons in randomized quicksort.

The expected number of running time is

$$
\begin{aligned}
E[X]&=E[\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}]\\
&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}E[X_{ij}]\\
&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}Pr[X_{ij}=1]
\end{aligned}
$$

For any two elements $z_i,z_j\in S$, the probability that $z_i$ is compared to $z_j$ is (either $z_i$ or $z_j$ is picked first as the pivot before the any elements of the ranks larger than $i$ and less than $j$)

$$
\begin{aligned}
Pr[X_{ij}=1]&=Pr[z_i\text{ is picked first}]+Pr[z_j\text{ is picked first}]\\
&=\frac{1}{j-i+1}+\frac{1}{j-i+1}\\
&=\frac{2}{j-i+1}
\end{aligned}
$$

So, with harmonic number, $H_n=\sum_{k=1}^{n}\frac{1}{k}$,

$$
\begin{aligned}
E[X]&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}\frac{2}{j-i+1}\\
&\leq 2\sum_{i=0}^{n-2}\sum_{k=1}^{n-i-1}\frac{1}{k}\\
&\leq 2\sum_{i=0}^{n-2}c\log(n)\\
&=2c\log(n)\sum_{i=0}^{n-2}1\\
&=\Theta(n\log n)
\end{aligned}
$$

</details>