Discrete Math for Data Science

DSCI 220, 2025 W1

December 1, 2025

Announcements

Hashing

Perfect hash functions

If we fix a finite set of keys \(K\):

  • A function \(h : K \to \{0,\dots,m-1\}\) is a perfect hash for \(K\) if:
    • \(h\) is injective on \(K\) (no two keys “collide” in the codomain),
    • so each key gets its own bucket.

If also \(|K| = m\) and \(h\) is surjective, then \(h\) is bijective on \(K\).

Perfect hash ⇒ no collisions (for that particular key set).

BUT:

  • It might be very hand-crafted and fragile.
  • It might stop being perfect as soon as we add more keys.

What if we add more keys?

  • Keys: "MATH101", "DSCI220", "STAT201", "DSCI221", "DSCI100", "CPSC330", "STAT200"
  • Buckets: \(\{0,1,2,3\}\) (4 buckets)
  • Hash: \(h(\text{"*}xyz\text{"}) = (x+y+z) \bmod 4\)

Question:

  1. We now have 7 keys and still 4 bucket. Can any function from these 7 keys to \(\{0,1,2,3\}\) be injective?
  2. Why or why not?

Pigeonhole principle (PHP)

Formal version in our language:

  • If \(|X| > |Y|\), then no function \(f : X \to Y\) can be injective.

Apply to hashing:

  • Keys = pigeons
  • Buckets = pigeonholes

If we have more keys than buckets:

  • \[|\text{Keys}| > m = |\{0,\dots,m-1\}|,\]
  • no hash function \(h : \text{Keys} \to \{0,\dots,m-1\}\) can be one-to-one on that key set.

Collisions

A collision happens when two different keys share the same hash value:

\[ k_1 \ne k_2, \quad h(k_1) = h(k_2). \]

By the pigeonhole principle:

  • As soon as we have more keys than buckets,
    collisions are guaranteed (for every hash function).

A Rosey Hash Function

20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

From “perfect” to “general-purpose”

We’ve seen: On a fixed set of keys we can sometimes build a perfect hash

But for general-purpose hashing:

  • The key space is HUGE (all possible strings, IDs, …), so we only ever see a sample of keys in our data.
  • For any fixed hash function \(h : \text{Keys} \to \{0,\dots,m-1\}\)
0 1 m-1

Collisions before the table is full?

The pigeonhole principle told us:

  • If the number of items actually stored in the table is greater than the number of buckets \(m\),
  • then some bucket must contain at least 2 items.

So if we keep inserting more and more items while keeping \(m\) fixed, collisions are eventually guaranteed.

To ponder…

 

When we have fewer than \(m\) items, will we have collisions?

 

Create a small sketch or example that you can use to illustrate your understanding.

Hashing?

Place a colorful little dot on your own birthday.

Interesting questions to ask:

  • How do we tally collisions?
  • How many collisions do we expect?
  • What is the general relationship between
    1. number of keys (\(n\)),
    2. table size (\(m\)), and
    3. expected # of collisions?