Appendix C: Basic text preprocessing [video]

Appendix C: Basic text preprocessing [video]#

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("pineapple") # extract all interesting information about the document
doc.vector[:10]

array([ 0.65486 , -2.2584  ,  0.062793,  1.8801  ,  0.207   , -3.3299  ,
       -0.96833 ,  1.5131  , -3.7041  , -0.077749], dtype=float32)

Why do we need preprocessing?
- Text data is unstructured and messy.
- We need to “normalize” it before we do anything interesting with it.
Example:
- Lemma: Same stem, same part-of-speech, roughly the same meaning
  - Vancouver’s → Vancouver
  - computers → computer
  - rising → rise, rose, rises

Tokenization#

Sentence segmentation
- Split text into sentences
Word tokenization
- Split sentences into words

Sentence segmentation

MDS is a Master's program at UBC in British Columbia. MDS teaching team is truly multicultural!! Dr. George did his Ph.D. in Scotland. Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada. Dr. Gelbart did his PhD in the U.S.

How many sentences are there in this text?

### Let's do sentence segmentation on "."
text = (
    "UBC is one of the well known universities in British Columbia. "
    "UBC CS teaching team is truly multicultural!! "
    "Dr. Toti completed her Ph.D. in Italy."
    "Dr. Moosvi, Dr. Kolhatkar, and Dr. Ola completed theirs in Canada."
    "Dr. Heeren and Dr. Lécuyer completed theirs in the U.S."
)
print(text.split("."))

['UBC is one of the well known universities in British Columbia', ' UBC CS teaching team is truly multicultural!! Dr', ' Toti completed her Ph', 'D', ' in Italy', 'Dr', ' Moosvi, Dr', ' Kolhatkar, and Dr', ' Ola completed theirs in Canada', 'Dr', ' Heeren and Dr', ' Lécuyer completed theirs in the U', 'S', '']

In English, period (.) is quite ambiguous. (In Chinese, it is unambiguous.)
- Abbreviations like Dr., U.S., Inc.
- Numbers like 60.44%, 0.98
! and ? are relatively ambiguous.
How about writing regular expressions?
A common way is using off-the-shelf models for sentence segmentation.

### Let's try to do sentence segmentation using nltk
from nltk.tokenize import sent_tokenize

sent_tokenized = sent_tokenize(text)
print(sent_tokenized)

['UBC is one of the well known universities in British Columbia.', 'UBC CS teaching team is truly multicultural!!', 'Dr. Toti completed her Ph.D. in Italy.Dr.', 'Moosvi, Dr. Kolhatkar, and Dr. Ola completed theirs in Canada.Dr.', 'Heeren and Dr. Lécuyer completed theirs in the U.S.']

Word tokenization

MDS is a Master's program at UBC in British Columbia.

How many words are there in this sentence?
Is whitespace a sufficient condition for a word boundary?

MDS is a Master's program at UBC in British Columbia.

What’s our definition of a word?
- Should British Columbia be one word or two words?
- Should punctuation be considered a separate word?
- What about the punctuations in U.S.?
- What do we do with words like Master's?
This process of identifying word boundaries is referred to as tokenization.
You can use regex but better to do it with off-the-shelf ML models.

### Let's do word segmentation on white spaces
print("Splitting on whitespace: ", [sent.split() for sent in sent_tokenized])

### Let's try to do word segmentation using nltk
from nltk.tokenize import word_tokenize

word_tokenized = [word_tokenize(sent) for sent in sent_tokenized]
# This is similar to the input format of word2vec algorithm
print("\n\n\nTokenized: ", word_tokenized)

Splitting on whitespace:  [['UBC', 'is', 'one', 'of', 'the', 'well', 'known', 'universities', 'in', 'British', 'Columbia.'], ['UBC', 'CS', 'teaching', 'team', 'is', 'truly', 'multicultural!!'], ['Dr.', 'Toti', 'completed', 'her', 'Ph.D.', 'in', 'Italy.Dr.'], ['Moosvi,', 'Dr.', 'Kolhatkar,', 'and', 'Dr.', 'Ola', 'completed', 'theirs', 'in', 'Canada.Dr.'], ['Heeren', 'and', 'Dr.', 'Lécuyer', 'completed', 'theirs', 'in', 'the', 'U.S.']]

Tokenized:  [['UBC', 'is', 'one', 'of', 'the', 'well', 'known', 'universities', 'in', 'British', 'Columbia', '.'], ['UBC', 'CS', 'teaching', 'team', 'is', 'truly', 'multicultural', '!', '!'], ['Dr.', 'Toti', 'completed', 'her', 'Ph.D.', 'in', 'Italy.Dr', '.'], ['Moosvi', ',', 'Dr.', 'Kolhatkar', ',', 'and', 'Dr.', 'Ola', 'completed', 'theirs', 'in', 'Canada.Dr', '.'], ['Heeren', 'and', 'Dr.', 'Lécuyer', 'completed', 'theirs', 'in', 'the', 'U.S', '.']]

Word segmentation

For some languages you need much more sophisticated tokenizers.

For languages such as Chinese, there are no spaces between words.
- jieba is a popular tokenizer for Chinese.
German doesn’t separate compound words.
- Example: rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
- (the law for the delegation of monitoring beef labeling)

Types and tokens

Usually in NLP, we talk about
- Type an element in the vocabulary
- Token an instance of that type in running text

Exercise for you

UBC is located in the beautiful province of British Columbia. It's very close to the U.S. border. You'll get to the USA border in about 45 mins by car.

Consider the example above.
- How many types? (task dependent)
- How many tokens?

Other commonly used preprocessing steps#

Punctuation and stopword removal
Stemming and lemmatization

Punctuation and stopword removal

The most frequently occurring words in English are not very useful in many NLP tasks.
- Example: the , is , a , and punctuation
Probably not very informative in many tasks

# Let's use `nltk.stopwords`.
# Add punctuations to the list.
stop_words = list(set(stopwords.words("english")))
import string

punctuation = string.punctuation
stop_words += list(punctuation)
# stop_words.extend(['``','`','br','"',"”", "''", "'s"])
print(stop_words)

['above', 'myself', 'ain', 'will', 'have', 'won', 'hasn', 'then', "he'll", 'by', 'why', "that'll", "you'd", 'or', 'any', 'do', 'haven', 'against', 'll', "i'd", "you've", 'had', 'itself', 'mightn', "won't", 'before', "it's", 'on', 'aren', 'ourselves', 'ours', "i've", "we'd", 'once', 'too', "she's", 'is', 'here', "doesn't", 'as', "he'd", "we've", 'yourself', 'shouldn', 'doesn', 'this', 'wouldn', 'her', "hadn't", 'did', 'are', 'no', "couldn't", 'themselves', 'but', 'having', 'needn', "she'll", 'with', 'wasn', 'below', 'off', 'theirs', "needn't", 'out', 'each', 'their', "weren't", 'what', 'again', 'now', "hasn't", "it'd", 'after', 'all', 'its', 'more', 'should', 's', "isn't", 'am', 'couldn', 'until', "mightn't", 'we', 'me', 'under', 'the', 'some', 'how', 'nor', 'my', "aren't", 'because', 'him', "they'd", 'm', 'doing', 'if', 'at', 'over', "they're", 'other', 'and', "we'll", 'they', "i'm", 've', 'for', 'than', 'been', 'just', 'own', 'being', 'our', 'from', 'himself', 'your', "wasn't", 'does', 'ma', 'only', 'he', 'so', 'these', 'his', 'who', 'whom', 'herself', "mustn't", 'of', "you're", 'was', 'in', 'about', 'while', 'during', "shan't", 'didn', 'where', "they've", 't', 'between', "you'll", 'can', 'a', "shouldn't", 'there', 'both', 'further', 'o', 'isn', 'to', 'down', 'not', "it'll", 'it', 'mustn', "haven't", 'up', 'weren', "he's", 'you', 'don', 'same', 'yourselves', 'few', 'most', "didn't", 'shan', "they'll", 'those', 'y', "i'll", 'be', 'were', 'hadn', 'such', 'them', "don't", "wouldn't", 'when', 'an', 'very', 'that', 'which', 'through', 'has', 'she', 'i', "should've", "we're", 'into', 're', 'yours', 'hers', "she'd", 'd', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']

### Get rid of stop words
preprocessed = []
for sent in word_tokenized:
    for token in sent:
        token = token.lower()
        if token not in stop_words:
            preprocessed.append(token)
print(preprocessed)

['ubc', 'one', 'well', 'known', 'universities', 'british', 'columbia', 'ubc', 'cs', 'teaching', 'team', 'truly', 'multicultural', 'dr.', 'toti', 'completed', 'ph.d.', 'italy.dr', 'moosvi', 'dr.', 'kolhatkar', 'dr.', 'ola', 'completed', 'canada.dr', 'heeren', 'dr.', 'lécuyer', 'completed', 'u.s']

Lemmatization

For many NLP tasks (e.g., web search) we want to ignore morphological differences between words
- Example: If your search term is “studying for ML quiz” you might want to include pages containing “tips to study for an ML quiz” or “here is how I studied for my ML quiz”
Lemmatization converts inflected forms into the base form.

import nltk

nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /Users/kvarada/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

True

nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /Users/kvarada/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

True

# nltk has a lemmatizer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print("Lemma of studying: ", lemmatizer.lemmatize("studying", "v"))
print("Lemma of studied: ", lemmatizer.lemmatize("studied", "v"))

Lemma of studying:  study
Lemma of studied:  study

Stemming

Has a similar purpose but it is a crude chopping of affixes
- automates, automatic, automation all reduced to automat.
Usually these reduced forms (stems) are not actual words themselves.
A popular stemming algorithm for English is PorterStemmer.
Beware that it can be aggressive sometimes.

from nltk.stem.porter import PorterStemmer

text = (
    "UBC is located in the beautiful province of British Columbia... "
    "It's very close to the U.S. border."
)
ps = PorterStemmer()
tokenized = word_tokenize(text)
stemmed = [ps.stem(token) for token in tokenized]
print("Before stemming: ", text)
print("\n\nAfter stemming: ", " ".join(stemmed))

Before stemming:  UBC is located in the beautiful province of British Columbia... It's very close to the U.S. border.

After stemming:  ubc is locat in the beauti provinc of british columbia ... it 's veri close to the u.s. border .

Other tools for preprocessing#

We used Natural Language Processing Toolkit (nltk) above
Many available tools
spaCy

spaCy

Industrial strength NLP library.
Lightweight, fast, and convenient to use.
spaCy does many things that we did above in one line of code!
Also has multi-lingual support.

import spacy

# Load the model
nlp = spacy.load("en_core_web_md")
text = (
    "MDS is a Master's program at UBC in British Columbia. "
    "MDS teaching team is truly multicultural!! "
    "Dr. George did his Ph.D. in Scotland. "
    "Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada. "
    "Dr. Gelbart did his PhD in the U.S."
)

doc = nlp(text)

# Accessing tokens
tokens = [token for token in doc]
print("\nTokens: ", tokens)

# Accessing lemma
lemmas = [token.lemma_ for token in doc]
print("\nLemmas: ", lemmas)

# Accessing pos
pos = [token.pos_ for token in doc]
print("\nPOS: ", pos)

Tokens:  [MDS, is, a, Master, 's, program, at, UBC, in, British, Columbia, ., MDS, teaching, team, is, truly, multicultural, !, !, Dr., George, did, his, Ph.D., in, Scotland, ., Dr., Timbers, ,, Dr., Ostblom, ,, Dr., Rodríguez, -, Arelis, ,, and, Dr., Kolhatkar, did, theirs, in, Canada, ., Dr., Gelbart, did, his, PhD, in, the, U.S.]

Lemmas:  ['mds', 'be', 'a', 'Master', "'s", 'program', 'at', 'UBC', 'in', 'British', 'Columbia', '.', 'mds', 'teaching', 'team', 'be', 'truly', 'multicultural', '!', '!', 'Dr.', 'George', 'do', 'his', 'ph.d.', 'in', 'Scotland', '.', 'Dr.', 'Timbers', ',', 'Dr.', 'Ostblom', ',', 'Dr.', 'Rodríguez', '-', 'Arelis', ',', 'and', 'Dr.', 'Kolhatkar', 'do', 'theirs', 'in', 'Canada', '.', 'Dr.', 'Gelbart', 'do', 'his', 'phd', 'in', 'the', 'U.S.']

POS:  ['NOUN', 'AUX', 'DET', 'PROPN', 'PART', 'NOUN', 'ADP', 'PROPN', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'NOUN', 'NOUN', 'NOUN', 'AUX', 'ADV', 'ADJ', 'PUNCT', 'PUNCT', 'PROPN', 'PROPN', 'VERB', 'PRON', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'CCONJ', 'PROPN', 'PROPN', 'VERB', 'PRON', 'ADP', 'PROPN', 'PUNCT', 'PROPN', 'PROPN', 'VERB', 'PRON', 'NOUN', 'ADP', 'DET', 'PROPN']

Other typical NLP tasks#

In order to understand text, we usually are interested in extracting information from text. Some common tasks in NLP pipeline are:

Part of speech tagging
- Assigning part-of-speech tags to all words in a sentence.
Named entity recognition
- Labelling named “real-world” objects, like persons, companies or locations.
Coreference resolution
- Deciding whether two strings (e.g., UBC vs University of British Columbia) refer to the same entity
Dependency parsing
- Representing grammatical structure of a sentence

Extracting named-entities using spaCy

from spacy import displacy

doc = nlp(
    "University of British Columbia "
    "is located in the beautiful "
    "province of British Columbia."
)
displacy.render(doc, style="ent")
# Text and label of named entity span
print("Named entities:\n", [(ent.text, ent.label_) for ent in doc.ents])
print("\nORG means: ", spacy.explain("ORG"))
print("GPE means: ", spacy.explain("GPE"))

University of British Columbia ORG is located in the beautiful province of British Columbia GPE .

Named entities:
 [('University of British Columbia', 'ORG'), ('British Columbia', 'GPE')]

ORG means:  Companies, agencies, institutions, etc.
GPE means:  Countries, cities, states

Dependency parsing using spaCy

doc = nlp("I like cats")
displacy.render(doc, style="dep")

Many other things possible

spaCy is a powerful tool
You can build your own rule-based searches.
You can also access word vectors using spaCy with bigger models. (Currently we are using en_core_web_md model.)