Discrete Math for Data Science

DSCI 220, 2025 W1

October 28, 2025

Announcements

Regular Expressions

Regular Expressions as String Filters

Goals:

Use regex to filter strings (and rows in a DataFrame).
Read/write the core regex operators.
See a regex as a set descriptor (a language over an alphabet).
Build boolean masks: df['col'].str.contains(pattern).

From Sets to Strings

Sets:

Math: $A=\{x\in U: P(x)\}$
Strings: $L=\{w\in\Sigma^*: \texttt{matches}(w,\text{pattern})\}$

Key idea: A regular expression specifies membership in a set of strings using patterns of characters.

We treat regex as both filters and as languages.

Warm-Up: In or Out?

pattern = ^[A-Z]{2}\d{3}$ $L=\{w\in\Sigma^*: \texttt{matches}(w,$ pattern$)\}$

Which strings are in the set $L$?

CS110

ab123

DS2200

MA101

Standard Operators

Literals: cat

Dot: . any char (except newline)

Class: [aeiou] or ranges like [A-Z0-9]

Negated class: [^,] anything but comma

Alternation: (cat|dog)

Quantifiers: a* 0 or more, a+ or more, a? 0 or 1, {m,n} range

Anchors: ^ start, $ end

Escapes: \. literal dot

Groups: ( ) captures, (?: ) non-capturing

Regex vs. Language

Fix an alphabet of characters, $\Sigma$.

Then a language $L=\{w\in\Sigma^*: \texttt{matches}(w,$ pattern$)\}$

if a pattern matches nothing, $L=\varnothing$
^$ matches only empty string ⇒ $L=\{\varepsilon\}$
pattern a means $L=\{$a$\}$
Concatenated patterns AB means $L(A)L(B)={xy:x\in L(A),y\in L(B)}$
Union of patterns: A|B means $L(A)\cup L(B)$
Kleene Star: A* means all finite concatenations from $L(A)$ (incl. $\varepsilon$)

Say the Set

^(ab|ba)*$: strings over {a,b} that are concatenations of ab or ba.

^(0|1)*11(0|1)*$: all binary strings containing 11 as a substring.

^1(01|10)*0$: binary strings that start with 1, end with 0, and are filled with {0,1} pairs that flip from 1 to 0 or vice versa.

Back to the Cafe

Suppose a column drink has values like: “Iced Latte”, “Drip Coffee”, “Americano”, “Matcha Tea”, “Espresso”

Goals:

Keep espresso-style drinks: “espresso”, “americano”, “latte”, or “cappuccino”
“Iced” appears as a whole word
Drinks that end with “Tea”

Cafe continued

Typical Patterns

Whole word: r’\biced\b’

Starts with: r’^Iced’

Ends with: r’Tea$’

Optional word: r’^(Iced )?Latte$’

One of set (case-insensitive):

r’(?i)\b(espresso|americano|latte|cappuccino)\b’

Your Turn

Filter for:

Drinks exactly “Iced Latte” (any case)
Drinks containing either “Americano” or “Espresso” (whole words, any case)
Drinks that do not contain “Iced” but end with “Tea” (hint: ________________)

Your Turn

Reminder: From Filter to Language

Let $\Sigma$ = letters, numbers, and spaces.

Pattern ^(?i)espresso|americano|latte|cappuccino)$ denotes a language $L\subseteq \Sigma^*$.

This parallels:

Set: $L=\{,w\in\Sigma^*: \text{pattern matches},\}$

Predicate: $P(w)=\texttt{matches}(w,\text{pattern})$

Mask: df[‘drink’].str.match(pattern)

You’ve been doing language membership tests all class, with filters.