Discrete Math for Data Science

DSCI 220, 2025 W1

November 25, 2025

Announcements

Feature Spaces and the Curse of Dimensionality

Today

Think of data as points in a feature space \(X\)
Count how big \(X\) gets as we:
- add more features
- increase feature precision
See why data are sparse in high dimensions
Connect this to learning a function \(f : X \to Y\) from data

Quick recap: functions and models

From last time:

A model / classifier is a function \(f : X \to Y\)
- \(X\) = input space / feature space
- \(Y\) = labels or outputs
For finite \(X\) and \(Y\):
- If \(|X| = N\) and \(|Y| = m\) then there are \(m^N\) functions \(X \to Y\)
Example:
- \(X = \{0,1\}^3\) (3-bit vectors), so \(|X| = 8\)
- Binary labels \(Y = \{0,1\}\): there are \(2^8 = 256\) classifiers

Quick recap: learning from data

There is some unknown true function \(f : X \to Y\)
We see data: \[ D = \{(x_1, y_1), \dots, (x_n, y_n)\}, \quad y_i = f(x_i) \]
We pick a function \(h\) (our model) that:
- agrees with the data (or mostly agrees)
- we hope generalizes to new \(x \in X\)

Today’s question:

What is this space \(X\) really like, and why does it cause trouble?

How many images exist?

Consider black-and-white images:

Size: \(28 \times 28\) pixels
Each pixel is either 0 (black) or 1 (white)

Question:
If each pixel can be 0 or 1, how many different images are possible?

Think: how many pixels? how many choices per pixel?
Discuss with a neighbor and write your answer as a power of 2.

Counting images

There are \(28 \times 28 = 784\) pixels
Each pixel: 2 choices (0 or 1)

So the number of possible images is:

\[ 2^{784}. \]

We can estimate:

\(\log_{10}(2^{784}) = 784 \log_{10} 2 \approx 784 \cdot 0.301 \approx 236\)
So about \(10^{236}\) possible images

Images as feature vectors

We can think of each image as a point in a feature space:

Feature space: \[ X = \{0,1\}^{784} \]
Each image is a 784-dimensional vector of 0s and 1s
Size of \(X\): \(|X| = 2^{784}\)

A digit recognizer is a function:

\[ f : X \to \{0,1,\dots,9\} \]

that assigns a label to each possible image.

Data and the Domain

In practice:

We might have tens of thousands or millions of labeled images
But the total number of possible images is on the order of \(10^{236}\)

So:

Our training set is a microscopic fraction of the full input space \(X\).

Yet we want \(f\) to behave well on images it has never seen.

This is one face of the curse of dimensionality.

Binary feature vectors

Now simplify:

Suppose examples are described by \(d\) binary features
- Each feature: 0 or 1
The feature space is: \[ X = \{0,1\}^d \]

How big is \(X\)?

There are \(2\) choices for each of the \(d\) features
So: \[ |X| = 2^d \]

10 features vs 20 vs 30

Suppose we have a dataset with \(n = 10{,}000\) examples
(assume they are all distinct points in \(X\)).

Fill in this table:

# features \(d\)	\(\|X\| = 2^d\) (approx)	Could 10,000 data points cover all of \(X\)?	Rough fraction seen \(n/\|X\|\)
10
20
30

10 features vs 20 vs 30

Let’s plug in:

\(d = 10\): \(|X| = 2^{10} = 1{,}024\)
\(d = 20\): \(|X| = 2^{20} \approx 1{,}000{,}000\)
\(d = 30\): \(|X| = 2^{30} \approx 1{,}000{,}000{,}000\)

With \(n = 10{,}000\) data points:

# features \(d\)	\(\|X\|\)	Can 10,000 cover all of \(X\)?	Fraction \(n/\|X\|\)
10	\(1{,}024\)	Yes (even many repeats)	\(\approx 10\)
20	\(\approx 10^6\)	No	\(\approx 10^{-2}\)
30	\(\approx 10^9\)	Definitely not	\(\approx 10^{-5}\)

As we add features, \(|X|\) explodes, and the fraction of \(X\) we see shrinks rapidly.

Functions on a huge \(X\)

Recall:

For a finite input space \(X\) with \(|X| = N\) and binary labels: \[ \#\{f : X \to \{0,1\}\} = 2^N \]

If we have labeled data on \(n\) distinct inputs:

The function is fixed on those \(n\) points
Free on the remaining \(N - n\) points
So the number of functions consistent with the data is: \[ 2^{N - n} \]

If \(N\) is huge and \(n \ll N\), then \(N - n \approx N\): there are still astronomically many functions that fit the data perfectly.

Example: 20 binary features

Take \(d = 20\) binary features:

Feature space size: \[ |X| = 2^{20} \approx 1{,}000{,}000 \]
Suppose we have \(n = 10{,}000\) labeled examples

Then:

\(N = 2^{20}\), \(N - n \approx 990{,}000\)
Number of functions \(X \to \{0,1\}\) consistent with the data: \[ 2^{N - n} \approx 2^{990{,}000} \]

Even after seeing 10,000 examples, there are still wildly many possible functions that agree with all observed labels.

From discrete to continuous features

Now imagine each feature is a real number in \([0,1]\):

Feature space: \[ X = [0,1]^d \]
A model is a function: \[ f : [0,1]^d \to Y \]

We’d like our data to be dense enough that we can approximate \(f(x)\)
everywhere in \([0,1]^d\), not just at the sample points.

A grid in 1 dimension

Start with \(d=1\):

Interval \([0,1]\)
We want a resolution of about \(0.1\):
- Divide into 10 equal segments
- Place a grid point in each segment

We need about 10 points to cover \([0,1]\) at resolution \(0.1\).

(Each point “represents” nearby values.)

A grid in 2 and 3 dimensions

Now \(d=2\):

Square \([0,1]^2\)
Use the same spacing \(0.1\) in each dimension:
- 10 grid points along \(x\)
- 10 grid points along \(y\)
- Total grid points: \(10 \times 10 = 100\)

\(d=3\):

Cube \([0,1]^3\)
10 grid points per dimension → \(10^3 = 1{,}000\) total

Question:
With a partner, extend this pattern to \(d = 5\) and \(d = 10\):

A grid in \(d\) dimensions

In general:

If we use 10 grid points per dimension (spacing \(\approx 0.1\)),
Then the total number of grid points is: \[ 10^d \]

Some values:

\(d = 2\): \(10^2 = 100\)
\(d = 5\): \(10^5 = 100{,}000\)
\(d = 10\): \(10^{10} = 10{,}000{,}000{,}000\)

To “cover” \([0,1]^d\) with this resolution, we’d need about \(10^d\) samples.

What can 10,000 points cover?

Suppose we have \(n = 10{,}000\) data points.

Compare:

\(d=2\): need \(\approx 100\) grid points
- 10,000 >> 100 → many points per cell
\(d=5\): need \(\approx 100{,}000\) grid points
- 10,000 < 100,000 → many empty cells
\(d=10\): need \(\approx 10^{10}\) grid points
- 10,000 is negligible

As dimension \(d\) grows, a fixed number of points becomes very sparse in \([0,1]^d\).

Curse of dimensionality

One way to state it:

To approximate an arbitrary function \(f : [0,1]^d \to Y\) everywhere with a fixed resolution, the number of required samples grows exponentially with the dimension \(d\).

So:

High-dimensional spaces are extremely sparse
Most of the volume is “far away” from any given data point
It is impossible to densely sample the space unless we have an enormous amount of data

Putting it together: data, functions, and generalization

Across these two lessons:

The input space \(X\) (feature space) can be huge:
- discrete: \(|X| = 2^d\) for \(d\) binary features
- continuous: \([0,1]^d\) needs \(10^d\) grid points for modest resolution
A model is a function \(f : X \to Y\)
Even after observing data on \(n\) points:
- there are still many functions that fit the data perfectly
- especially when \(|X|\) is enormous and \(n \ll |X|\)

ML algorithms cope by:

restricting to a structured hypothesis class (linear, trees, neural nets, …)
using inductive biases (smoothness, simplicity, etc.) so that sparse data can still guide the choice of \(f\)

Big picture

Counting tells us that:
- High-dimensional feature spaces explode in size
- Our datasets occupy a tiny fraction of those spaces
Functions lesson:
- Huge number of possible functions \(f : X \to Y\)
- Many functions remain consistent with the same data
Curse-of-dimensionality lesson:
- As dimension grows, data become sparser
- Approximating a function everywhere in \(X\) requires exponentially many samples

Reflection:

In your own words: why does adding more features make learning a function from data harder, from a counting perspective?