Programming, problem solving, and algorithms

CPSC 203, 2025 W1

November 27, 2025

Announcements

Text as Data

Visualizing Literature

Lexical Novelty

“Novels are full of new characters, new locations and new expressions. The discourse between characters involves new ideas being exchanged. We can get a hint of this by tracking the introduction of new terms in a novel. In the below visualizations (in which each column represents a chapter and each small block a paragraph of text), I maintain a variable which represents novelty. When a paragraph contains more than 25% new terms (i.e. words that have not been observed thus far) this variable is set at its maximum of 1. Otherwise, the variable decays. The variable is used to colour the paragraph with red being 1.0 and blue being 0. The result is that we can get an idea of the introduction of new ideas in novels.” - Matthew Hurst

Word Co-occurrence

Cross References

Sentiment Analysis

Mapping

Visualizing Literature

A novel is just a long sequence of characters.
If we treat it as data, we can:
- Count and visualize patterns,
- Compare books or authors,
- Build networks of characters and concepts.
Today: turn a novel into numbers and pictures.

What Can We Compute From a Book?

If I give you the full text of a novel as a .txt file

What questions could we ask?
What numbers or graphs could we compute?

Ideas:

The Text-as-Data Pipeline

Data flow:

Load the text from a file
Tokenize: split into words
Preprocess:
- lowercase
- remove punctuation
- remove stopwords
- use lemmas
Count frequencies
Visualize as a bar chart
Interpret: what do the counts tell us?

Step 1: Getting the Text into Python

Tokenizer

We want to analyze the data by word or by ____________ or by ____________ or by ____________…

We can separate the data (string) into any of these using a “tokenizer”

Tokenization

Translate: “Astrology. The governess was alwaysmuddled with her astrolabe, and when she got specially muddled she would take it outthe Wart by rapping his knuckles. She did not rap Kay’s knuckles, because when Kay grew”

Into: [‘Astrology.’, ‘The’, ‘governess’, ‘was’, ‘always’, ‘getting’, ‘muddled’, ‘with’, ‘her’, ‘astrolabe’, ‘,’, ‘and’, ‘when’, ‘she’, ‘got’, ‘specially’, ‘muddled’, ‘she’, ‘would’, ‘take’, ‘it’, ‘out’, ‘of’, ‘the’, ‘Wart’, ‘by’, ‘rapping’, ‘his’, ‘knuckles.’, ‘She’, ‘did’, ‘not’, ‘rap’, ‘Kay’, “‘s”, ’knuckles’, ‘,’, ‘because’, ‘when’, ‘Kay’, ‘grew’, ‘older’]