Programming, problem solving, and algorithms

CPSC 203, 2025 W1

November 27, 2025

Announcements

Text as Data

Visualizing Literature

Lexical Novelty

“Novels are full of new characters, new locations and new expressions. The discourse between characters involves new ideas being exchanged. We can get a hint of this by tracking the introduction of new terms in a novel. In the below visualizations (in which each column represents a chapter and each small block a paragraph of text), I maintain a variable which represents novelty. When a paragraph contains more than 25% new terms (i.e. words that have not been observed thus far) this variable is set at its maximum of 1. Otherwise, the variable decays. The variable is used to colour the paragraph with red being 1.0 and blue being 0. The result is that we can get an idea of the introduction of new ideas in novels.” - Matthew Hurst

Word Co-occurrence

Cross References

Sentiment Analysis

Mapping

Visualizing Literature

  • A novel is just a long sequence of characters.
  • If we treat it as data, we can:
    • Count and visualize patterns,
    • Compare books or authors,
    • Build networks of characters and concepts.
  • Today: turn a novel into numbers and pictures.

What Can We Compute From a Book?

If I give you the full text of a novel as a .txt file

  • What questions could we ask?
  • What numbers or graphs could we compute?

Ideas:

The Text-as-Data Pipeline

Data flow:

  1. Load the text from a file
  2. Tokenize: split into words
  3. Preprocess:
    • lowercase
    • remove punctuation
    • remove stopwords
    • use lemmas
  4. Count frequencies
  5. Visualize as a bar chart
  6. Interpret: what do the counts tell us?

Step 1: Getting the Text into Python

Tokenizer

We want to analyze the data by word or by ____________ or by ____________ or by ____________…

We can separate the data (string) into any of these using a “tokenizer”

Tokenization

 

Translate: “Astrology. The governess was alwaysmuddled with her astrolabe, and when she got specially muddled she would take it outthe Wart by rapping his knuckles. She did not rap Kay’s knuckles, because when Kay grew”

 

Into: [‘Astrology.’, ‘The’, ‘governess’, ‘was’, ‘always’, ‘getting’, ‘muddled’, ‘with’, ‘her’, ‘astrolabe’, ‘,’, ‘and’, ‘when’, ‘she’, ‘got’, ‘specially’, ‘muddled’, ‘she’, ‘would’, ‘take’, ‘it’, ‘out’, ‘of’, ‘the’, ‘Wart’, ‘by’, ‘rapping’, ‘his’, ‘knuckles.’, ‘She’, ‘did’, ‘not’, ‘rap’, ‘Kay’, “‘s”, ’knuckles’, ‘,’, ‘because’, ‘when’, ‘Kay’, ‘grew’, ‘older’]

Demo

 

 

Thirty most common tokens in the mystery text.

Pre-processing

A Feasible Sequence

lower case eliminate punctuation remove stop words lemma

Unify tally for “Valor” and “valor”.


Depending on task, may not want to do this.

A Feasible Sequence

lower case eliminate punctuation remove stop words lemma

Tokenizer leaves periods at end of sentences: “father.”

 

Amazingly, it works fine for “Dr.”, “$3.50”, “!”

A Feasible Sequence

lower case eliminate punctuation remove stop words lemma

List of common, unhelpful words compiled by spacy from large corpora. We keep words that aren’t in that list.

 

More sophisticated approach is called tf-idf.

A Feasible Sequence

lower case eliminate punctuation remove stop words lemma

goes -> go running -> run eaten -> eat

Curious?

Demo

PrairieLearn Activity