Discrete Math for Data Science

DSCI 220, 2025 W1

October 28, 2025

Announcements

Regular Expressions

Regular Expressions as String Filters

Goals:

  • Use regex to filter strings (and rows in a DataFrame).

  • Read/write the core regex operators.

  • See a regex as a set descriptor (a language over an alphabet).

  • Build boolean masks: df['col'].str.contains(pattern).

From Sets to Strings

Sets:

  • Math: \(A=\{x\in U: P(x)\}\)

  • Strings: \(L=\{w\in\Sigma^*: \texttt{matches}(w,\text{pattern})\}\)

Key idea: A regular expression specifies membership in a set of strings using patterns of characters.

We treat regex as both filters and as languages.

Warm-Up: In or Out?

pattern = ^[A-Z]{2}\d{3}$ \(L=\{w\in\Sigma^*: \texttt{matches}(w,\) pattern\()\}\)

 

Which strings are in the set \(L\)?

 

CS110

ab123

DS2200

MA101


Standard Operators

Literals: cat

Dot: . any char (except newline)

Class: [aeiou] or ranges like [A-Z0-9]

Negated class: [^,] anything but comma

Alternation: (cat|dog)

Quantifiers: a* 0 or more, a+ or more, a? 0 or 1, {m,n} range

Anchors: ^ start, $ end

Escapes: \. literal dot

Groups: ( ) captures, (?: ) non-capturing

Regex vs. Language

Fix an alphabet of characters, \(\Sigma\).

Then a language \(L=\{w\in\Sigma^*: \texttt{matches}(w,\) pattern\()\}\)

  • if a pattern matches nothing, \(L=\varnothing\)

  • ^$ matches only empty string ⇒ \(L=\{\varepsilon\}\)

  • pattern a means \(L=\{\)a\(\}\)

  • Concatenated patterns AB means \(L(A)L(B)={xy:x\in L(A),y\in L(B)}\)

  • Union of patterns: A|B means \(L(A)\cup L(B)\)

  • Kleene Star: A* means all finite concatenations from \(L(A)\) (incl. \(\varepsilon\))

Say the Set

 

^(ab|ba)*$: strings over {a,b} that are concatenations of ab or ba.

 

^(0|1)*11(0|1)*$: all binary strings containing 11 as a substring.

 

^1(01|10)*0$: binary strings that start with 1, end with 0, and are filled with {0,1} pairs that flip from 1 to 0 or vice versa.

Back to the Cafe

Suppose a column drink has values like: “Iced Latte”, “Drip Coffee”, “Americano”, “Matcha Tea”, “Espresso”

 

Goals:

  1. Keep espresso-style drinks: “espresso”, “americano”, “latte”, or “cappuccino”

  2. “Iced” appears as a whole word

  3. Drinks that end with “Tea”

Cafe continued

Typical Patterns

Whole word: r’\biced\b’

Starts with: r’^Iced’

Ends with: r’Tea$’

Optional word: r’^(Iced )?Latte$’

One of set (case-insensitive):

r’(?i)\b(espresso|americano|latte|cappuccino)\b’

Your Turn

Filter for:

  1. Drinks exactly “Iced Latte” (any case)

  2. Drinks containing either “Americano” or “Espresso” (whole words, any case)

  3. Drinks that do not contain “Iced” but end with “Tea” (hint: ________________)

Your Turn

Reminder: From Filter to Language

Let \(\Sigma\) = letters, numbers, and spaces.

Pattern ^(?i)espresso|americano|latte|cappuccino)$ denotes a language \(L\subseteq \Sigma^*\).

This parallels:

Set: \(L=\{,w\in\Sigma^*: \text{pattern matches},\}\)

Predicate: \(P(w)=\texttt{matches}(w,\text{pattern})\)

Mask: df[‘drink’].str.match(pattern)

You’ve been doing language membership tests all class, with filters.