Discrete Math for Data Science

DSCI 220, 2025 W1

December 1, 2025

Announcements

If we fix a finite set of keys \(K\):

A function \(h : K \to \{0,\dots,m-1\}\) is a perfect hash for \(K\) if:
- \(h\) is injective on \(K\) (no two keys “collide” in the codomain),
- so each key gets its own bucket.

If also \(|K| = m\) and \(h\) is surjective, then \(h\) is bijective on \(K\).

Perfect hash ⇒ no collisions (for that particular key set).

BUT:

Keys: "MATH101", "DSCI220", "STAT201", "DSCI221", "DSCI100", "CPSC330", "STAT200"
Buckets: \(\{0,1,2,3\}\) (4 buckets)
Hash: \(h(\text{"*}xyz\text{"}) = (x+y+z) \bmod 4\)

Question:

We now have 7 keys and still 4 bucket. Can any function from these 7 keys to \(\{0,1,2,3\}\) be injective?
Why or why not?

Formal version in our language:

Apply to hashing:

If we have more keys than buckets:

\[|\text{Keys}| > m = |\{0,\dots,m-1\}|,\]
no hash function \(h : \text{Keys} \to \{0,\dots,m-1\}\) can be one-to-one on that key set.

A collision happens when two different keys share the same hash value:

\[ k_1 \ne k_2, \quad h(k_1) = h(k_2). \]

By the pigeonhole principle:

As soon as we have more keys than buckets,
collisions are guaranteed (for every hash function).

We’ve seen: On a fixed set of keys we can sometimes build a perfect hash

But for general-purpose hashing:

The key space is HUGE (all possible strings, IDs, …), so we only ever see a sample of keys in our data.
For any fixed hash function \(h : \text{Keys} \to \{0,\dots,m-1\}\)…

The pigeonhole principle told us:

If the number of items actually stored in the table is greater than the number of buckets \(m\),
then some bucket must contain at least 2 items.

So if we keep inserting more and more items while keeping \(m\) fixed, collisions are eventually guaranteed.

When we have fewer than \(m\) items, will we have collisions?

Create a small sketch or example that you can use to illustrate your understanding.

Place a colorful little dot on your own birthday.

Interesting questions to ask:

How do we tally collisions?
How many collisions do we expect?
What is the general relationship between
1. number of keys (\(n\)),
2. table size (\(m\)), and
3. expected # of collisions?