Programming, problem solving, and algorithms

CPSC 203, 2025 W1

December 2, 2025

Announcements

Text as Data

The Text-as-Data Pipeline

Data flow:

  1. Load the text from a file
  2. Tokenize: split into words
  3. Preprocess:
    • lowercase
    • remove punctuation
    • remove stopwords
    • use lemmas
  4. Count frequencies
  5. Visualize as a bar chart
  6. Interpret: what do the counts tell us?

Demo

 

 

Thirty most common tokens in the mystery text.

A Feasible Sequence

lower case eliminate punctuation remove stop words lemma

  • goes -> go
  • running -> run
  • eaten -> eat

Demo

PrairieLearn Activity

Named Entity Recognition

Essentially: tagging proper nouns

 

NER used for inferring relationships between entities:

  • PERSON lives LOCATION
  • LOCATION has ORGANIZATION

NER

  1. Underline all of the proper nouns (named entities) in this text:
Harriet Smith’s intimacy at Hartfield was soon a settled thing. Quick and decided in her ways, Emma lost no time in inviting, encouraging, and telling her to come very often; and as their acquaintance increased, so did their satisfaction in each other. As a walking companion, Emma had very early foreseen how useful she might find her. In that respect Mrs. Weston’s loss had been important. Her father never went beyond the shrubbery, where two divisions of the ground sufficed him for his long walk, or his short, as the year varied; and since Mrs. Weston’s marriage her exercise had been too much confined. She had ventured once alone to Randalls, but it was not pleasant; and a Harriet Smith, therefore, one whom she could summon at any time to a walk, would be a valuable addition to her privileges. But in every respect, as she saw more of her, she approved her, and was confirmed in all her kind designs.

Typical categories of entities are PERSON, LOCATION, ORGANIZATION. Think about how you might discover each of the entities using a program.

NER with SpaCy

  1. The PL activity contains a file called ner_nb.py. Modify and execute this file to answer the following questions. In each case, sketch an example of the output, and explain it briefly in English.
  1. load a OFK excerpt from ofk_ch1Short.txt

  2. create a spaCy doc object with doc = nlp(textRaw)

NER with SpaCy

  1. If doc is the result of part b, What does sents = list(doc.sents) do?

  2. If sents is the result of part c, what does sentWords = [[token.text for token in sent] for sent in sents] do?

  3. If sents is the result of part c, what does sentWordsPOS = [[(token.text, token.pos_) for token in sent] for sent in sents] do?

NER with SpaCy

    1. If sents is the result of part c, what does sentWordsNER = [[(token.text, token.ent_iob_, token.ent_type_) for token in sent] for sent in sents] do?
  1. If sents is the result of part c, what does ents_by_sent = [[(ent.text, ent.label_) for ent in sent.ents] for sent in sents] do?

  2. If doc is the result of part b, what does names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"] do?

One Last Project

Suppose we’d like to understand the bonds between pairs of people in a book!

  • How can we infer connections between characters?
  • How can we draw a graph?

Friends?

Given the text from a novel, how can we infer interaction or connections between characters? Discuss this question, and write down your ideas.

________________________________________

 

________________________________________

 

________________________________________

 

________________________________________

We Are Not the First

Starting From the Goal

 

The social network graph will have…

 

Vertices:

 

Edges:

Edges

We could consider every pair of people and check every paragraph for their presence.

  • Do you like this plan?

OR, we could

Edges

"I've heard of his family," said Ron darkly. "They were some of the first to come back to our side after You-Know-Who disappeared. Said they'd been bewitched. My dad doesn't believe it. He says Malfoy's father didn't need an excuse to go over to the Dark Side." He turned to Hermione. "Can we help you with something?"

  • What names appear?
  • What pairs should be tallied?
  • General observations:

Moving to the Middle

 

Given [[RW, HG, HP], [RW, AD], [H, HP, HG]],

  1. Could you create [RW, HG, HP, AD, H]?

 

  1. and {(RW,HG):1, (RW,HP):1, (HG,HP):2, (RW,AD):1, (H,HP):1, (H,HG):1}?

Demo

In the PL activity, load socnet_nb.py.