Discrete Math for Data Science

DSCI 220, 2025 W1

December 2, 2025

Announcements

Hashing

Hashing?

Place a colorful little dot on your own birthday.

Interesting questions to ask:

  • How do we tally collisions?
  • How many collisions do we expect?
  • What is the general relationship between
    1. number of keys (\(n\)),
    2. table size (\(m\)), and
    3. expected # of collisions?

Side Quest – Birthdays

What’s the probability no-one in this room has a birthday today?

 

 

What’s the probability that there are 2 people (or more) with the same birthday?

 

 

Does this change depending on the number of people in the room?

Expected Value (an aside)

How many pips do we expect to see on a die?

Definitions that will help us:

  • \(X\):
  • \(E[X]\):

How many pips do we expect to see on 2 dice?

  • \(X\), \(Y\):
  • \(E[ X + Y ] =\)
1 2 3 4 5 6 1 2 3 4 5 6 2 3 4 5 6 7 3 4 5 6 7 8 4 5 6 7 8 9 5 6 7 8 9 10 6 7 8 9 10 11 7 8 9 10 11 12

Expected Number of Collisions

What is expected # of people who share a bday w someone else?

Definitions that will help us:

  • \(X_{ij}\):
  • \(E[X_{ij}]\):

Sum over all \(X_{ij}\) to get the number of shared bdays and its expectation:

  • \(X\):
  • \(E[X] =\)

Collisions: if we randomly put \(k\) items into \(m\) bins, we expect ________ pairs to collide.

Implication:

Hash function summary

We can only avoid collisions if the size of keyspace is less than equal to the size of our hash table and we have a perfect hash function.

We can thwart poor performance by randomizing our choice of hash function for each application. Universal Hashing to the rescue!

We have to deal with collisions: even with only ____ keys we will expect at least one.

We need a collision resolution strategy…