Lists vs dictionaries
Suppose we want to store our friends’ favourite drink orders…
Use a list of pairs:
How do we find Bianca’s order?
Finding things with a list
If we store data in a list, to find Bianca’s order we have to search:
- Do you like this?
- What if the list were very long?
- What’s so special about the indices?
Python moment
Python has an implementation of an Associative Array called a dict.
Example Key: Value pairs
We often want to map one kind of thing (the key) to another kind of thing (the value):
- Song title → artist
"Anti-Hero" → "Taylor Swift"
- Course code → course title
"DSCI100" → "Intro to DSci"
- Email address → user ID
- Date → temperature
- Flight → destination
- City name → population
- Word → number of times it appears
- user ID → Email address
All of these are naturally modeled as key–value pairs — exactly what Python dictionaries (and hash tables) are built for.
Where we are
- We’ve been thinking about functions:
- \(f : X \to Y\)
- “every \(x \in X\) gets exactly one \(y \in Y\)”
- We saw:
- functions as sets of input–output pairs,
- counting functions,
- huge feature spaces (\(X\) can be enormous).
Today:
- A special and practical kind of function:
- Goal:
- connect hashing to our function language,
- introduce injective / surjective via a concrete example
Arrays vs associative arrays
A simple array/list:
- Indices: \(0,1,2,\dots,m-1\)
- Values: whatever we store there
- Access: by integer position,
A[3]
An associative array:
- Keys: strings, IDs, course codes,
- Access: by key,
T["DSCI220"]
Idea:
Use a function that turns a key into an index.
That function is called a hash function.
Recall: functions
Formal definition:
A function \(f : X \to Y\) assigns to each input \(x \in X\) exactly one output \(y \in Y\).
- \(X\) = domain
- \(Y\) = codomain
Today’s special case:
- \(X\) = set of keys (e.g., strings like
"DSCI220")
- \(Y\) = set of indices \(\{0,1,\dots,m-1\}\)
So a hash function is just:
\[
h : \text{Keys} \to \{0,1,\dots,m-1\}.
\]
Example
Let’s start with a set of keys:
"MATH101", "DSCI220", "STAT201", "DSCI221"
And an array:
Goal: Define a hash function \(h\)
Observations
Notes:
- Every key has exactly one hash value. Why must this be true?
- No 2 keys share the same output.
- Every index is the output from some input.
- We call this a perfect hash function
Vocabulary: injective / surjective
Let \(f : X \to Y\).
- \(f\) is Injective (one-to-one) if:
- \(x_1 \ne x_2\) implies \(f(x_1) \ne f(x_2)\).
- No two different inputs share the same output.
- \(f\) is Surjective (onto) if:
- For every \(y \in Y\), there is at least one \(x \in X\) with \(f(x) = y\).
- Every output value is used.
- \(f\) is Bijective if:
- it is both injective and surjective.
- Perfect “pairing” between \(X\) and \(Y\).
On our 4 course codes and 4 buckets, \(h\) is bijective, a perfect hash for that key set.
Perfect hash functions
If we fix a finite set of keys \(K\):
- A function \(h : K \to \{0,\dots,m-1\}\) is a perfect hash for \(K\) if:
- \(h\) is injective on \(K\) (no two keys collide),
- so each key gets its own bucket.
If also \(|K| = m\) and \(h\) is surjective, then \(h\) is bijective on \(K\).
Perfect hash ⇒ no collisions (for that particular key set).
BUT:
- It might be very hand-crafted and fragile.
- It might stop being perfect as soon as we add more keys.
What if we add more keys?
- Keys:
"MATH101", "DSCI220", "STAT201", "DSCI221", "DSCI100", "CPSC330", "STAT200"
- Buckets: \(\{0,1,2,3\}\) (4 buckets)
- Hash: \(h(\text{"*}xyz\text{"}) = (x+y+z) \bmod 4\)
Question:
- We now have 7 keys and still 4 bucket. Can any function from these 7 keys to \(\{0,1,2,3\}\) be injective?
- Why or why not?
Pigeonhole principle (PHP)
Formal version in our language:
- If \(|X| > |Y|\), then no function \(f : X \to Y\) can be injective.
Apply to hashing:
- Keys = pigeons
- Buckets = pigeonholes
If we have more keys than buckets:
- \[|\text{Keys}| > m = |\{0,\dots,m-1\}|,\]
- no hash function \(h : \text{Keys} \to \{0,\dots,m-1\}\) can be one-to-one on that key set.
Collisions
A collision happens when two different keys share the same hash value:
\[
k_1 \ne k_2, \quad h(k_1) = h(k_2).
\]
By the pigeonhole principle:
- As soon as we have more keys than buckets,
collisions are guaranteed (for every hash function).
A Rosey Hash Function
From “perfect” to “general-purpose”
We’ve seen: On a tiny, fixed set of keys, we can sometimes build a perfect hash
But for general-purpose hashing:
- The key space is HUGE (all possible strings, IDs, …), so we only ever see a sample of keys in our data.
- For any fixed hash function (h : {0,,m-1})…
Collisions before the table is full?
The pigeonhole principle told us:
- If the number of items actually stored in the table is greater than the number of buckets (m),
- then some bucket must contain at least 2 items.
So if we keep inserting more and more items while keeping (m) fixed, collisions are eventually guaranteed.
To ponder for next time:
- When we have fewer than (m) items, will we have collisions?