Perfect hash functions
If we fix a finite set of keys \(K\):
- A function \(h : K \to \{0,\dots,m-1\}\) is a perfect hash for \(K\) if:
- \(h\) is injective on \(K\) (no two keys “collide” in the codomain),
- so each key gets its own bucket.
If also \(|K| = m\) and \(h\) is surjective, then \(h\) is bijective on \(K\).
Perfect hash ⇒ no collisions (for that particular key set).
BUT:
- It might be very hand-crafted and fragile.
- It might stop being perfect as soon as we add more keys.
What if we add more keys?
- Keys:
"MATH101", "DSCI220", "STAT201", "DSCI221", "DSCI100", "CPSC330", "STAT200"
- Buckets: \(\{0,1,2,3\}\) (4 buckets)
- Hash: \(h(\text{"*}xyz\text{"}) = (x+y+z) \bmod 4\)
Question:
- We now have 7 keys and still 4 bucket. Can any function from these 7 keys to \(\{0,1,2,3\}\) be injective?
- Why or why not?
Pigeonhole principle (PHP)
Formal version in our language:
- If \(|X| > |Y|\), then no function \(f : X \to Y\) can be injective.
Apply to hashing:
- Keys = pigeons
- Buckets = pigeonholes
If we have more keys than buckets:
- \[|\text{Keys}| > m = |\{0,\dots,m-1\}|,\]
- no hash function \(h : \text{Keys} \to \{0,\dots,m-1\}\) can be one-to-one on that key set.
Collisions
A collision happens when two different keys share the same hash value:
\[
k_1 \ne k_2, \quad h(k_1) = h(k_2).
\]
By the pigeonhole principle:
- As soon as we have more keys than buckets,
collisions are guaranteed (for every hash function).
A Rosey Hash Function
From “perfect” to “general-purpose”
We’ve seen: On a fixed set of keys we can sometimes build a perfect hash
But for general-purpose hashing:
- The key space is HUGE (all possible strings, IDs, …), so we only ever see a sample of keys in our data.
- For any fixed hash function \(h : \text{Keys} \to \{0,\dots,m-1\}\)…
Collisions before the table is full?
The pigeonhole principle told us:
- If the number of items actually stored in the table is greater than the number of buckets \(m\),
- then some bucket must contain at least 2 items.
So if we keep inserting more and more items while keeping \(m\) fixed, collisions are eventually guaranteed.
To ponder…
When we have fewer than \(m\) items, will we have collisions?
Create a small sketch or example that you can use to illustrate your understanding.
Hashing?
Place a colorful little dot on your own birthday.
Interesting questions to ask:
- How do we tally collisions?
- How many collisions do we expect?
- What is the general relationship between
- number of keys (\(n\)),
- table size (\(m\)), and
- expected # of collisions?