CPSC 203, 2025 W1
November 27, 2025

“Novels are full of new characters, new locations and new expressions. The discourse between characters involves new ideas being exchanged. We can get a hint of this by tracking the introduction of new terms in a novel. In the below visualizations (in which each column represents a chapter and each small block a paragraph of text), I maintain a variable which represents novelty. When a paragraph contains more than 25% new terms (i.e. words that have not been observed thus far) this variable is set at its maximum of 1. Otherwise, the variable decays. The variable is used to colour the paragraph with red being 1.0 and blue being 0. The result is that we can get an idea of the introduction of new ideas in novels.” - Matthew Hurst
If I give you the full text of a novel as a .txt file
Ideas:
Data flow:
We want to analyze the data by word or by ____________ or by ____________ or by ____________…
We can separate the data (string) into any of these using a “tokenizer”
Translate: “Astrology. The governess was alwaysmuddled with her astrolabe, and when she got specially muddled she would take it outthe Wart by rapping his knuckles. She did not rap Kay’s knuckles, because when Kay grew”
Into: [‘Astrology.’, ‘The’, ‘governess’, ‘was’, ‘always’, ‘getting’, ‘muddled’, ‘with’, ‘her’, ‘astrolabe’, ‘,’, ‘and’, ‘when’, ‘she’, ‘got’, ‘specially’, ‘muddled’, ‘she’, ‘would’, ‘take’, ‘it’, ‘out’, ‘of’, ‘the’, ‘Wart’, ‘by’, ‘rapping’, ‘his’, ‘knuckles.’, ‘She’, ‘did’, ‘not’, ‘rap’, ‘Kay’, “‘s”, ’knuckles’, ‘,’, ‘because’, ‘when’, ‘Kay’, ‘grew’, ‘older’]

Thirty most common tokens in the mystery text.

Unify tally for “Valor” and “valor”.
Depending on task, may not want to do this.

Tokenizer leaves periods at end of sentences: “father.”
Amazingly, it works fine for “Dr.”, “$3.50”, “!”

List of common, unhelpful words compiled by spacy from large corpora. We keep words that aren’t in that list.
More sophisticated approach is called tf-idf.

goes -> go running -> run eaten -> eat

