DSCI 220, 2025 W1
October 28, 2025
Goals:
Use regex to filter strings (and rows in a DataFrame).
Read/write the core regex operators.
See a regex as a set descriptor (a language over an alphabet).
Build boolean masks: df['col'].str.contains(pattern).
Sets:
Math: \(A=\{x\in U: P(x)\}\)
Strings: \(L=\{w\in\Sigma^*: \texttt{matches}(w,\text{pattern})\}\)
Key idea: A regular expression specifies membership in a set of strings using patterns of characters.
We treat regex as both filters and as languages.
pattern = ^[A-Z]{2}\d{3}$ \(L=\{w\in\Sigma^*: \texttt{matches}(w,\) pattern\()\}\)
Which strings are in the set \(L\)?
CS110
ab123
DS2200
MA101
Literals: cat
Dot: . any char (except newline)
Class: [aeiou] or ranges like [A-Z0-9]
Negated class: [^,] anything but comma
Alternation: (cat|dog)
Quantifiers: a* 0 or more, a+ or more, a? 0 or 1, {m,n} range
Anchors: ^ start, $ end
Escapes: \. literal dot
Groups: ( ) captures, (?: ) non-capturing
Fix an alphabet of characters, \(\Sigma\).
Then a language \(L=\{w\in\Sigma^*: \texttt{matches}(w,\) pattern\()\}\)
if a pattern matches nothing, \(L=\varnothing\)
^$ matches only empty string ⇒ \(L=\{\varepsilon\}\)
pattern a means \(L=\{\)a\(\}\)
Concatenated patterns AB means \(L(A)L(B)={xy:x\in L(A),y\in L(B)}\)
Union of patterns: A|B means \(L(A)\cup L(B)\)
Kleene Star: A* means all finite concatenations from \(L(A)\) (incl. \(\varepsilon\))
^(ab|ba)*$: strings over {a,b} that are concatenations of ab or ba.
^(0|1)*11(0|1)*$: all binary strings containing 11 as a substring.
^1(01|10)*0$: binary strings that start with 1, end with 0, and are filled with {0,1} pairs that flip from 1 to 0 or vice versa.
Suppose a column drink has values like: “Iced Latte”, “Drip Coffee”, “Americano”, “Matcha Tea”, “Espresso”
Goals:
Keep espresso-style drinks: “espresso”, “americano”, “latte”, or “cappuccino”
“Iced” appears as a whole word
Drinks that end with “Tea”
Whole word: r’\biced\b’
Starts with: r’^Iced’
Ends with: r’Tea$’
Optional word: r’^(Iced )?Latte$’
One of set (case-insensitive):
Filter for:
Drinks exactly “Iced Latte” (any case)
Drinks containing either “Americano” or “Espresso” (whole words, any case)
Drinks that do not contain “Iced” but end with “Tea” (hint: ________________)
Let \(\Sigma\) = letters, numbers, and spaces.
Pattern ^(?i)espresso|americano|latte|cappuccino)$ denotes a language \(L\subseteq \Sigma^*\).
This parallels:
Set: \(L=\{,w\in\Sigma^*: \text{pattern matches},\}\)
Predicate: \(P(w)=\texttt{matches}(w,\text{pattern})\)
Mask: df[‘drink’].str.match(pattern)
You’ve been doing language membership tests all class, with filters.