Appendix C: Basic text preprocessing [video]#
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("pineapple") # extract all interesting information about the document
doc.vector[:10]
array([ 0.65486 , -2.2584 , 0.062793, 1.8801 , 0.207 , -3.3299 ,
-0.96833 , 1.5131 , -3.7041 , -0.077749], dtype=float32)
Why do we need preprocessing?
Text data is unstructured and messy.
We need to “normalize” it before we do anything interesting with it.
Example:
Lemma: Same stem, same part-of-speech, roughly the same meaning
Vancouver’s → Vancouver
computers → computer
rising → rise, rose, rises
Tokenization#
Sentence segmentation
Split text into sentences
Word tokenization
Split sentences into words
Sentence segmentation
MDS is a Master's program at UBC in British Columbia. MDS teaching team is truly multicultural!! Dr. George did his Ph.D. in Scotland. Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada. Dr. Gelbart did his PhD in the U.S.
How many sentences are there in this text?
### Let's do sentence segmentation on "."
text = (
"UBC is one of the well known universities in British Columbia. "
"UBC CS teaching team is truly multicultural!! "
"Dr. Toti completed her Ph.D. in Italy."
"Dr. Moosvi, Dr. Kolhatkar, and Dr. Ola completed theirs in Canada."
"Dr. Heeren and Dr. Lécuyer completed theirs in the U.S."
)
print(text.split("."))
['UBC is one of the well known universities in British Columbia', ' UBC CS teaching team is truly multicultural!! Dr', ' Toti completed her Ph', 'D', ' in Italy', 'Dr', ' Moosvi, Dr', ' Kolhatkar, and Dr', ' Ola completed theirs in Canada', 'Dr', ' Heeren and Dr', ' Lécuyer completed theirs in the U', 'S', '']
In English, period (.) is quite ambiguous. (In Chinese, it is unambiguous.)
Abbreviations like Dr., U.S., Inc.
Numbers like 60.44%, 0.98
! and ? are relatively ambiguous.
How about writing regular expressions?
A common way is using off-the-shelf models for sentence segmentation.
### Let's try to do sentence segmentation using nltk
from nltk.tokenize import sent_tokenize
sent_tokenized = sent_tokenize(text)
print(sent_tokenized)
['UBC is one of the well known universities in British Columbia.', 'UBC CS teaching team is truly multicultural!!', 'Dr. Toti completed her Ph.D. in Italy.Dr.', 'Moosvi, Dr. Kolhatkar, and Dr. Ola completed theirs in Canada.Dr.', 'Heeren and Dr. Lécuyer completed theirs in the U.S.']
Word tokenization
MDS is a Master's program at UBC in British Columbia.
How many words are there in this sentence?
Is whitespace a sufficient condition for a word boundary?
MDS is a Master's program at UBC in British Columbia.
What’s our definition of a word?
Should British Columbia be one word or two words?
Should punctuation be considered a separate word?
What about the punctuations in
U.S.?What do we do with words like
Master's?
This process of identifying word boundaries is referred to as tokenization.
You can use regex but better to do it with off-the-shelf ML models.
### Let's do word segmentation on white spaces
print("Splitting on whitespace: ", [sent.split() for sent in sent_tokenized])
### Let's try to do word segmentation using nltk
from nltk.tokenize import word_tokenize
word_tokenized = [word_tokenize(sent) for sent in sent_tokenized]
# This is similar to the input format of word2vec algorithm
print("\n\n\nTokenized: ", word_tokenized)
Splitting on whitespace: [['UBC', 'is', 'one', 'of', 'the', 'well', 'known', 'universities', 'in', 'British', 'Columbia.'], ['UBC', 'CS', 'teaching', 'team', 'is', 'truly', 'multicultural!!'], ['Dr.', 'Toti', 'completed', 'her', 'Ph.D.', 'in', 'Italy.Dr.'], ['Moosvi,', 'Dr.', 'Kolhatkar,', 'and', 'Dr.', 'Ola', 'completed', 'theirs', 'in', 'Canada.Dr.'], ['Heeren', 'and', 'Dr.', 'Lécuyer', 'completed', 'theirs', 'in', 'the', 'U.S.']]
Tokenized: [['UBC', 'is', 'one', 'of', 'the', 'well', 'known', 'universities', 'in', 'British', 'Columbia', '.'], ['UBC', 'CS', 'teaching', 'team', 'is', 'truly', 'multicultural', '!', '!'], ['Dr.', 'Toti', 'completed', 'her', 'Ph.D.', 'in', 'Italy.Dr', '.'], ['Moosvi', ',', 'Dr.', 'Kolhatkar', ',', 'and', 'Dr.', 'Ola', 'completed', 'theirs', 'in', 'Canada.Dr', '.'], ['Heeren', 'and', 'Dr.', 'Lécuyer', 'completed', 'theirs', 'in', 'the', 'U.S', '.']]
Word segmentation
For some languages you need much more sophisticated tokenizers.
For languages such as Chinese, there are no spaces between words.
jieba is a popular tokenizer for Chinese.
German doesn’t separate compound words.
Example: rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
(the law for the delegation of monitoring beef labeling)
Types and tokens
Usually in NLP, we talk about
Type an element in the vocabulary
Token an instance of that type in running text
Exercise for you
UBC is located in the beautiful province of British Columbia. It's very close to the U.S. border. You'll get to the USA border in about 45 mins by car.
Consider the example above.
How many types? (task dependent)
How many tokens?
Other commonly used preprocessing steps#
Punctuation and stopword removal
Stemming and lemmatization
Punctuation and stopword removal
The most frequently occurring words in English are not very useful in many NLP tasks.
Example: the , is , a , and punctuation
Probably not very informative in many tasks
# Let's use `nltk.stopwords`.
# Add punctuations to the list.
stop_words = list(set(stopwords.words("english")))
import string
punctuation = string.punctuation
stop_words += list(punctuation)
# stop_words.extend(['``','`','br','"',"”", "''", "'s"])
print(stop_words)
['above', 'myself', 'ain', 'will', 'have', 'won', 'hasn', 'then', "he'll", 'by', 'why', "that'll", "you'd", 'or', 'any', 'do', 'haven', 'against', 'll', "i'd", "you've", 'had', 'itself', 'mightn', "won't", 'before', "it's", 'on', 'aren', 'ourselves', 'ours', "i've", "we'd", 'once', 'too', "she's", 'is', 'here', "doesn't", 'as', "he'd", "we've", 'yourself', 'shouldn', 'doesn', 'this', 'wouldn', 'her', "hadn't", 'did', 'are', 'no', "couldn't", 'themselves', 'but', 'having', 'needn', "she'll", 'with', 'wasn', 'below', 'off', 'theirs', "needn't", 'out', 'each', 'their', "weren't", 'what', 'again', 'now', "hasn't", "it'd", 'after', 'all', 'its', 'more', 'should', 's', "isn't", 'am', 'couldn', 'until', "mightn't", 'we', 'me', 'under', 'the', 'some', 'how', 'nor', 'my', "aren't", 'because', 'him', "they'd", 'm', 'doing', 'if', 'at', 'over', "they're", 'other', 'and', "we'll", 'they', "i'm", 've', 'for', 'than', 'been', 'just', 'own', 'being', 'our', 'from', 'himself', 'your', "wasn't", 'does', 'ma', 'only', 'he', 'so', 'these', 'his', 'who', 'whom', 'herself', "mustn't", 'of', "you're", 'was', 'in', 'about', 'while', 'during', "shan't", 'didn', 'where', "they've", 't', 'between', "you'll", 'can', 'a', "shouldn't", 'there', 'both', 'further', 'o', 'isn', 'to', 'down', 'not', "it'll", 'it', 'mustn', "haven't", 'up', 'weren', "he's", 'you', 'don', 'same', 'yourselves', 'few', 'most', "didn't", 'shan', "they'll", 'those', 'y', "i'll", 'be', 'were', 'hadn', 'such', 'them', "don't", "wouldn't", 'when', 'an', 'very', 'that', 'which', 'through', 'has', 'she', 'i', "should've", "we're", 'into', 're', 'yours', 'hers', "she'd", 'd', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
### Get rid of stop words
preprocessed = []
for sent in word_tokenized:
for token in sent:
token = token.lower()
if token not in stop_words:
preprocessed.append(token)
print(preprocessed)
['ubc', 'one', 'well', 'known', 'universities', 'british', 'columbia', 'ubc', 'cs', 'teaching', 'team', 'truly', 'multicultural', 'dr.', 'toti', 'completed', 'ph.d.', 'italy.dr', 'moosvi', 'dr.', 'kolhatkar', 'dr.', 'ola', 'completed', 'canada.dr', 'heeren', 'dr.', 'lécuyer', 'completed', 'u.s']
Lemmatization
For many NLP tasks (e.g., web search) we want to ignore morphological differences between words
Example: If your search term is “studying for ML quiz” you might want to include pages containing “tips to study for an ML quiz” or “here is how I studied for my ML quiz”
Lemmatization converts inflected forms into the base form.
import nltk
nltk.download("wordnet")
[nltk_data] Downloading package wordnet to /Users/kvarada/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
True
nltk.download('omw-1.4')
[nltk_data] Downloading package omw-1.4 to /Users/kvarada/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
True
# nltk has a lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("Lemma of studying: ", lemmatizer.lemmatize("studying", "v"))
print("Lemma of studied: ", lemmatizer.lemmatize("studied", "v"))
Lemma of studying: study
Lemma of studied: study
Stemming
Has a similar purpose but it is a crude chopping of affixes
automates, automatic, automation all reduced to automat.
Usually these reduced forms (stems) are not actual words themselves.
A popular stemming algorithm for English is PorterStemmer.
Beware that it can be aggressive sometimes.
from nltk.stem.porter import PorterStemmer
text = (
"UBC is located in the beautiful province of British Columbia... "
"It's very close to the U.S. border."
)
ps = PorterStemmer()
tokenized = word_tokenize(text)
stemmed = [ps.stem(token) for token in tokenized]
print("Before stemming: ", text)
print("\n\nAfter stemming: ", " ".join(stemmed))
Before stemming: UBC is located in the beautiful province of British Columbia... It's very close to the U.S. border.
After stemming: ubc is locat in the beauti provinc of british columbia ... it 's veri close to the u.s. border .
Other tools for preprocessing#
We used Natural Language Processing Toolkit (nltk) above
Many available tools
Industrial strength NLP library.
Lightweight, fast, and convenient to use.
spaCy does many things that we did above in one line of code!
Also has multi-lingual support.
import spacy
# Load the model
nlp = spacy.load("en_core_web_md")
text = (
"MDS is a Master's program at UBC in British Columbia. "
"MDS teaching team is truly multicultural!! "
"Dr. George did his Ph.D. in Scotland. "
"Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada. "
"Dr. Gelbart did his PhD in the U.S."
)
doc = nlp(text)
# Accessing tokens
tokens = [token for token in doc]
print("\nTokens: ", tokens)
# Accessing lemma
lemmas = [token.lemma_ for token in doc]
print("\nLemmas: ", lemmas)
# Accessing pos
pos = [token.pos_ for token in doc]
print("\nPOS: ", pos)
Tokens: [MDS, is, a, Master, 's, program, at, UBC, in, British, Columbia, ., MDS, teaching, team, is, truly, multicultural, !, !, Dr., George, did, his, Ph.D., in, Scotland, ., Dr., Timbers, ,, Dr., Ostblom, ,, Dr., Rodríguez, -, Arelis, ,, and, Dr., Kolhatkar, did, theirs, in, Canada, ., Dr., Gelbart, did, his, PhD, in, the, U.S.]
Lemmas: ['mds', 'be', 'a', 'Master', "'s", 'program', 'at', 'UBC', 'in', 'British', 'Columbia', '.', 'mds', 'teaching', 'team', 'be', 'truly', 'multicultural', '!', '!', 'Dr.', 'George', 'do', 'his', 'ph.d.', 'in', 'Scotland', '.', 'Dr.', 'Timbers', ',', 'Dr.', 'Ostblom', ',', 'Dr.', 'Rodríguez', '-', 'Arelis', ',', 'and', 'Dr.', 'Kolhatkar', 'do', 'theirs', 'in', 'Canada', '.', 'Dr.', 'Gelbart', 'do', 'his', 'phd', 'in', 'the', 'U.S.']
POS: ['NOUN', 'AUX', 'DET', 'PROPN', 'PART', 'NOUN', 'ADP', 'PROPN', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'NOUN', 'NOUN', 'NOUN', 'AUX', 'ADV', 'ADJ', 'PUNCT', 'PUNCT', 'PROPN', 'PROPN', 'VERB', 'PRON', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'CCONJ', 'PROPN', 'PROPN', 'VERB', 'PRON', 'ADP', 'PROPN', 'PUNCT', 'PROPN', 'PROPN', 'VERB', 'PRON', 'NOUN', 'ADP', 'DET', 'PROPN']
Other typical NLP tasks#
In order to understand text, we usually are interested in extracting information from text. Some common tasks in NLP pipeline are:
Part of speech tagging
Assigning part-of-speech tags to all words in a sentence.
Named entity recognition
Labelling named “real-world” objects, like persons, companies or locations.
Coreference resolution
Deciding whether two strings (e.g., UBC vs University of British Columbia) refer to the same entity
Dependency parsing
Representing grammatical structure of a sentence
Extracting named-entities using spaCy
from spacy import displacy
doc = nlp(
"University of British Columbia "
"is located in the beautiful "
"province of British Columbia."
)
displacy.render(doc, style="ent")
# Text and label of named entity span
print("Named entities:\n", [(ent.text, ent.label_) for ent in doc.ents])
print("\nORG means: ", spacy.explain("ORG"))
print("GPE means: ", spacy.explain("GPE"))
Named entities:
[('University of British Columbia', 'ORG'), ('British Columbia', 'GPE')]
ORG means: Companies, agencies, institutions, etc.
GPE means: Countries, cities, states
Dependency parsing using spaCy
doc = nlp("I like cats")
displacy.render(doc, style="dep")
Many other things possible
spaCy is a powerful tool
You can build your own rule-based searches.
You can also access word vectors using spaCy with bigger models. (Currently we are using
en_core_web_mdmodel.)