Appendix A: Demo of feature engineering for text data#
Import#
We will be using Covid tweets dataset for this.
df = pd.read_csv(DATA_DIR + 'Corona_NLP_test.csv')
df['Sentiment'].value_counts()
Sentiment
Negative 1041
Positive 947
Neutral 619
Extremely Positive 599
Extremely Negative 592
Name: count, dtype: int64
train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
train_df
UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | |
---|---|---|---|---|---|---|
1927 | 1928 | 46880 | Seattle, WA | 13-03-2020 | While I don't like all of Amazon's choices, to... | Positive |
1068 | 1069 | 46021 | NaN | 13-03-2020 | Me: shit buckets, its time to do the weekly s... | Negative |
803 | 804 | 45756 | The Outer Limits | 12-03-2020 | @SecPompeo @realDonaldTrump You mean the plan ... | Neutral |
2846 | 2847 | 47799 | Flagstaff, AZ | 15-03-2020 | @lauvagrande People who are sick arent panic ... | Extremely Negative |
3768 | 3769 | 48721 | Montreal, Canada | 16-03-2020 | Coronavirus Panic: Toilet Paper Is the People... | Negative |
... | ... | ... | ... | ... | ... | ... |
1122 | 1123 | 46075 | NaN | 13-03-2020 | Photos of our local grocery store shelveswher... | Extremely Positive |
1346 | 1347 | 46299 | Toronto | 13-03-2020 | Just went to the the grocery store (Highland F... | Positive |
3454 | 3455 | 48407 | Houston, TX | 16-03-2020 | Real talk though. Am I the only one spending h... | Neutral |
3437 | 3438 | 48390 | Washington, DC | 16-03-2020 | The supermarket business is booming! #COVID2019 | Neutral |
3582 | 3583 | 48535 | St James' Park, Newcastle | 16-03-2020 | Evening All Here s the story on the and the im... | Positive |
3038 rows × 6 columns
train_df.columns
Index(['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet',
'Sentiment'],
dtype='object')
train_df['Location'].value_counts()
Location
United States 63
London, England 37
Los Angeles, CA 30
New York, NY 29
Washington, DC 29
..
Suburb of Chicago 1
philippines 1
Dont ask for freedom, take it. 1
Windsor Heights, IA 1
St James' Park, Newcastle 1
Name: count, Length: 1441, dtype: int64
X_train, y_train = train_df[['OriginalTweet', 'Location']], train_df['Sentiment']
X_test, y_test = test_df[['OriginalTweet', 'Location']], test_df['Sentiment']
y_train.value_counts()
Sentiment
Negative 852
Positive 743
Neutral 501
Extremely Negative 472
Extremely Positive 470
Name: count, dtype: int64
scoring_metrics = 'accuracy'
results = {}
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
"""
Returns mean and std of cross validation
Parameters
----------
model :
scikit-learn model
X_train : numpy array or pandas DataFrame
X in the training data
y_train :
y in the training data
Returns
----------
pandas Series with mean scores from cross_validation
"""
scores = cross_validate(model, X_train, y_train, **kwargs)
mean_scores = pd.DataFrame(scores).mean()
std_scores = pd.DataFrame(scores).std()
out_col = []
for i in range(len(mean_scores)):
out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))
return pd.Series(data=out_col, index=mean_scores.index)
Dummy classifier#
dummy = DummyClassifier()
results["dummy"] = mean_std_cross_val_scores(
dummy, X_train, y_train, return_train_score=True, scoring=scoring_metrics
)
pd.DataFrame(results).T
/var/folders/b3/g26r0dcx4b35vf3nk31216hc0000gr/T/ipykernel_13054/4158382658.py:26: FutureWarning:
Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
fit_time | score_time | test_score | train_score | |
---|---|---|---|---|
dummy | 0.001 (+/- 0.001) | 0.001 (+/- 0.001) | 0.280 (+/- 0.001) | 0.280 (+/- 0.000) |
Bag-of-words model#
from sklearn.feature_extraction.text import CountVectorizer
pipe = make_pipeline(CountVectorizer(stop_words='english'),
LogisticRegression(max_iter=1000))
results["logistic regression"] = mean_std_cross_val_scores(
pipe, X_train['OriginalTweet'], y_train, return_train_score=True, scoring=scoring_metrics
)
pd.DataFrame(results).T
/var/folders/b3/g26r0dcx4b35vf3nk31216hc0000gr/T/ipykernel_13054/4158382658.py:26: FutureWarning:
Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
fit_time | score_time | test_score | train_score | |
---|---|---|---|---|
dummy | 0.001 (+/- 0.001) | 0.001 (+/- 0.001) | 0.280 (+/- 0.001) | 0.280 (+/- 0.000) |
logistic regression | 0.278 (+/- 0.019) | 0.008 (+/- 0.000) | 0.414 (+/- 0.012) | 0.999 (+/- 0.000) |
Is it possible to further improve the scores?#
How about adding new features based on our intuitions? Let’s extract our own features that might be useful for this prediction task. In other words, let’s carry out feature engineering.
The code below adds some very basic length-related and sentiment features. We will be using a popular library called
nltk
for this exercise. If you have successfully created the courseconda
environment on your machine, you should already have this package in the environment.
How do we extract interesting information from text?
We use pre-trained models!
A couple of popular libraries which include such pre-trained models.
nltk
conda install -c anaconda nltk
spaCy
conda install -c conda-forge spacy
For emoji support:
pip install spacymoji
You also need to download the language model which contains all the pre-trained models. For that run the following in your course
conda
environment or here.
import spacy
# !python -m spacy download en_core_web_md
import nltk
nltk.download("punkt")
[nltk_data] Downloading package punkt to /Users/kvarada/nltk_data...
[nltk_data] Package punkt is already up-to-date!
True
nltk.download("vader_lexicon")
nltk.download("punkt")
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
[nltk_data] Downloading package vader_lexicon to
[nltk_data] /Users/kvarada/nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /Users/kvarada/nltk_data...
[nltk_data] Package punkt is already up-to-date!
s = "CPSC 330 students are smart, sweet, and funny."
print(sid.polarity_scores(s))
{'neg': 0.0, 'neu': 0.368, 'pos': 0.632, 'compound': 0.8225}
s = "CPSC 330 students are tired because of all the hard work they have been doing."
print(sid.polarity_scores(s))
{'neg': 0.249, 'neu': 0.751, 'pos': 0.0, 'compound': -0.5106}
spaCy#
A useful package for text processing and feature extraction
Active development: https://github.com/explosion/spaCy
Interactive lessons by Ines Montani: https://course.spacy.io/en/
Good documentation, easy to use, and customizable.
To run the code below, you have to download the pretrained model in the course environment.
python -m spacy download en_core_web_md
import spacy
nlp = spacy.load("en_core_web_md")
sample_text = """Dolly Parton is a gift to us all.
From writing all-time great songs like “Jolene” and “I Will Always Love You”,
to great performances in films like 9 to 5, to helping fund a COVID-19 vaccine,
she’s given us so much. Now, Netflix bring us Dolly Parton’s Christmas on the Square,
an original musical that stars Christine Baranski as a Scrooge-like landowner
who threatens to evict an entire town on Christmas Eve to make room for a new mall.
Directed and choreographed by the legendary Debbie Allen and counting Jennifer Lewis
and Parton herself amongst its cast, Christmas on the Square seems like the perfect movie
to save Christmas 2020. 😻 👍🏿"""
# [Adapted from here.](https://thepopbreak.com/2020/11/22/dolly-partons-christmas-on-the-square-review-not-quite-a-christmas-miracle/)
Spacy extracts all interesting information from text with this call.
doc = nlp(sample_text)
Let’s look at part-of-speech tags.
print([(token, token.pos_) for token in doc][:20])
[(Dolly, 'PROPN'), (Parton, 'PROPN'), (is, 'AUX'), (a, 'DET'), (gift, 'NOUN'), (to, 'ADP'), (us, 'PRON'), (all, 'PRON'), (., 'PUNCT'), (
, 'SPACE'), (From, 'ADP'), (writing, 'VERB'), (all, 'DET'), (-, 'PUNCT'), (time, 'NOUN'), (great, 'ADJ'), (songs, 'NOUN'), (like, 'ADP'), (“, 'PUNCT'), (Jolene, 'PROPN')]
Often we want to know who did what to whom.
Named entities give you this information.
What are named entities in the text?
from spacy import displacy
displacy.render(doc, style="ent")
From writing all-time great songs like “ Jolene PERSON ” and “I Will Always Love You”,
to great performances in films like 9 to 5 DATE , to helping fund a COVID-19 vaccine,
she’s given us so much. Now, Netflix ORG bring us Dolly Parton PERSON ’s Christmas DATE on the Square FAC ,
an original musical that stars Christine Baranski PERSON as a Scrooge-like landowner
who threatens to evict an entire town on Christmas Eve DATE to make room for a new mall.
Directed and choreographed by the legendary Debbie Allen PERSON and counting Jennifer Lewis PERSON
and Parton PERSON herself amongst its cast, Christmas DATE on the Square FAC seems like the perfect movie
to save Christmas 2020 DATE . 😻 👍🏿
print("Named entities:\n", [(ent.text, ent.label_) for ent in doc.ents])
print("\nORG means: ", spacy.explain("ORG"))
print("\nPERSON means: ", spacy.explain("PERSON"))
print("\nDATE means: ", spacy.explain("DATE"))
Named entities:
[('Dolly Parton', 'PERSON'), ('Jolene', 'PERSON'), ('9 to 5', 'DATE'), ('Netflix', 'ORG'), ('Dolly Parton', 'PERSON'), ('Christmas', 'DATE'), ('Square', 'FAC'), ('Christine Baranski', 'PERSON'), ('Christmas Eve', 'DATE'), ('Debbie Allen', 'PERSON'), ('Jennifer Lewis', 'PERSON'), ('Parton', 'PERSON'), ('Christmas', 'DATE'), ('Square', 'FAC'), ('Christmas 2020', 'DATE')]
ORG means: Companies, agencies, institutions, etc.
PERSON means: People, including fictional
DATE means: Absolute or relative dates or periods
An example from a project#
Goal: Extract and visualize inter-corporate relationships from disclosed annual 10-K reports of public companies.
text = (
"Heavy hitters, including Microsoft and Google, "
"are competing for customers in cloud services with the likes of IBM and Salesforce."
)
doc = nlp(text)
displacy.render(doc, style="ent")
print("Named entities:\n", [(ent.text, ent.label_) for ent in doc.ents])
Named entities:
[('Microsoft', 'ORG'), ('Google', 'ORG'), ('IBM', 'ORG'), ('Salesforce', 'PRODUCT')]
If you want emoji identification support install spacymoji
in the course environment.
pip install spacymoji
After installing spacymoji
, if it’s still complaining about module not found, my guess is that you do not have pip
installed in your conda
environment. Go to your course conda
environment install pip
and install the spacymoji
package in the environment using the pip
you just installed in the current environment.
conda install pip
YOUR_MINICONDA_PATH/miniconda3/envs/cpsc330/bin/pip install spacymoji
from spacymoji import Emoji
nlp.add_pipe("emoji", first=True);
Does the text have any emojis? If yes, extract the description.
doc = nlp(sample_text)
doc._.emoji
[('😻', 138, 'smiling cat with heart-eyes'),
('👍🏿', 139, 'thumbs up dark skin tone')]
Simple feature engineering for our problem.#
import en_core_web_md
import spacy
nlp = en_core_web_md.load()
from spacymoji import Emoji
nlp.add_pipe("emoji", first=True)
def get_relative_length(text, TWITTER_ALLOWED_CHARS=280.0):
"""
Returns the relative length of text.
Parameters:
------
text: (str)
the input text
Keyword arguments:
------
TWITTER_ALLOWED_CHARS: (float)
the denominator for finding relative length
Returns:
-------
relative length of text: (float)
"""
return len(text) / TWITTER_ALLOWED_CHARS
def get_length_in_words(text):
"""
Returns the length of the text in words.
Parameters:
------
text: (str)
the input text
Returns:
-------
length of tokenized text: (int)
"""
return len(nltk.word_tokenize(text))
def get_sentiment(text):
"""
Returns the compound score representing the sentiment: -1 (most extreme negative) and +1 (most extreme positive)
The compound score is a normalized score calculated by summing the valence scores of each word in the lexicon.
Parameters:
------
text: (str)
the input text
Returns:
-------
sentiment of the text: (str)
"""
scores = sid.polarity_scores(text)
return scores["compound"]
def get_avg_word_length(text):
"""
Returns the average word length of the given text.
Parameters:
text -- (str)
"""
words = text.split()
return sum(len(word) for word in words) / len(words)
def has_emoji(text):
"""
Returns the average word length of the given text.
Parameters:
text -- (str)
"""
doc = nlp(text)
return 1 if doc._.has_emoji else 0
import nltk
nltk.download('punkt_tab')
[nltk_data] Downloading package punkt_tab to
[nltk_data] /Users/kvarada/nltk_data...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.
True
train_df = train_df.assign(n_words=train_df["OriginalTweet"].apply(get_length_in_words))
train_df = train_df.assign(vader_sentiment=train_df["OriginalTweet"].apply(get_sentiment))
train_df = train_df.assign(rel_char_len=train_df["OriginalTweet"].apply(get_relative_length))
test_df = test_df.assign(n_words=test_df["OriginalTweet"].apply(get_length_in_words))
test_df = test_df.assign(vader_sentiment=test_df["OriginalTweet"].apply(get_sentiment))
test_df = test_df.assign(rel_char_len=test_df["OriginalTweet"].apply(get_relative_length))
train_df = train_df.assign(
average_word_length=train_df["OriginalTweet"].apply(get_avg_word_length)
)
test_df = test_df.assign(average_word_length=test_df["OriginalTweet"].apply(get_avg_word_length))
# whether all letters are uppercase or not (all_caps)
train_df = train_df.assign(
all_caps=train_df["OriginalTweet"].apply(lambda x: 1 if x.isupper() else 0)
)
test_df = test_df.assign(
all_caps=test_df["OriginalTweet"].apply(lambda x: 1 if x.isupper() else 0)
)
train_df = train_df.assign(has_emoji=train_df["OriginalTweet"].apply(has_emoji))
test_df = test_df.assign(has_emoji=test_df["OriginalTweet"].apply(has_emoji))
train_df.head()
UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | n_words | vader_sentiment | rel_char_len | average_word_length | all_caps | has_emoji | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1927 | 1928 | 46880 | Seattle, WA | 13-03-2020 | While I don't like all of Amazon's choices, to... | Positive | 31 | -0.1053 | 0.589286 | 5.640000 | 0 | 0 |
1068 | 1069 | 46021 | NaN | 13-03-2020 | Me: shit buckets, its time to do the weekly s... | Negative | 52 | -0.2500 | 0.932143 | 4.636364 | 0 | 0 |
803 | 804 | 45756 | The Outer Limits | 12-03-2020 | @SecPompeo @realDonaldTrump You mean the plan ... | Neutral | 44 | 0.0000 | 0.910714 | 6.741935 | 0 | 0 |
2846 | 2847 | 47799 | Flagstaff, AZ | 15-03-2020 | @lauvagrande People who are sick arent panic ... | Extremely Negative | 46 | -0.8481 | 0.907143 | 5.023810 | 0 | 0 |
3768 | 3769 | 48721 | Montreal, Canada | 16-03-2020 | Coronavirus Panic: Toilet Paper Is the People... | Negative | 21 | -0.5106 | 0.500000 | 9.846154 | 0 | 0 |
train_df.shape
(3038, 12)
(train_df['all_caps'] == 1).sum()
0
X_train = train_df.drop(columns=['Sentiment'])
numeric_features = ['vader_sentiment',
'rel_char_len',
'average_word_length']
passthrough_features = ['all_caps', 'has_emoji']
text_feature = 'OriginalTweet'
drop_features = ['UserName', 'ScreenName', 'Location', 'TweetAt']
preprocessor = make_column_transformer(
(StandardScaler(), numeric_features),
("passthrough", passthrough_features),
(CountVectorizer(stop_words='english'), text_feature),
("drop", drop_features)
)
pipe = make_pipeline(preprocessor, LogisticRegression(max_iter=1000))
results["LR (more feats)"] = mean_std_cross_val_scores(
pipe, X_train, y_train, return_train_score=True, scoring=scoring_metrics
)
pd.DataFrame(results).T
/var/folders/b3/g26r0dcx4b35vf3nk31216hc0000gr/T/ipykernel_13054/4158382658.py:26: FutureWarning:
Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
fit_time | score_time | test_score | train_score | |
---|---|---|---|---|
dummy | 0.001 (+/- 0.001) | 0.001 (+/- 0.001) | 0.280 (+/- 0.001) | 0.280 (+/- 0.000) |
logistic regression | 0.278 (+/- 0.019) | 0.008 (+/- 0.000) | 0.414 (+/- 0.012) | 0.999 (+/- 0.000) |
LR (more feats) | 0.246 (+/- 0.015) | 0.009 (+/- 0.000) | 0.690 (+/- 0.007) | 0.998 (+/- 0.001) |
pipe.fit(X_train, y_train)
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('standardscaler', StandardScaler(), ['vader_sentiment', 'rel_char_len', 'average_word_length']), ('passthrough', 'passthrough', ['all_caps', 'has_emoji']), ('countvectorizer', CountVectorizer(stop_words='english'), 'OriginalTweet'), ('drop', 'drop', ['UserName', 'ScreenName', 'Location', 'TweetAt'])])), ('logisticregression', LogisticRegression(max_iter=1000))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('standardscaler', StandardScaler(), ['vader_sentiment', 'rel_char_len', 'average_word_length']), ('passthrough', 'passthrough', ['all_caps', 'has_emoji']), ('countvectorizer', CountVectorizer(stop_words='english'), 'OriginalTweet'), ('drop', 'drop', ['UserName', 'ScreenName', 'Location', 'TweetAt'])])), ('logisticregression', LogisticRegression(max_iter=1000))])
ColumnTransformer(transformers=[('standardscaler', StandardScaler(), ['vader_sentiment', 'rel_char_len', 'average_word_length']), ('passthrough', 'passthrough', ['all_caps', 'has_emoji']), ('countvectorizer', CountVectorizer(stop_words='english'), 'OriginalTweet'), ('drop', 'drop', ['UserName', 'ScreenName', 'Location', 'TweetAt'])])
['vader_sentiment', 'rel_char_len', 'average_word_length']
StandardScaler()
['all_caps', 'has_emoji']
passthrough
OriginalTweet
CountVectorizer(stop_words='english')
['UserName', 'ScreenName', 'Location', 'TweetAt']
drop
LogisticRegression(max_iter=1000)
cv_feats = pipe.named_steps['columntransformer'].named_transformers_['countvectorizer'].get_feature_names_out().tolist()
feat_names = numeric_features + passthrough_features + cv_feats
coefs = pipe.named_steps['logisticregression'].coef_[0]
df = pd.DataFrame(
data={
"features": feat_names,
"coefficients": coefs,
}
)
df.sort_values('coefficients')
features | coefficients | |
---|---|---|
0 | vader_sentiment | -6.167241 |
11331 | won | -1.384111 |
2551 | coronapocalypse | -0.817034 |
2214 | closed | -0.754165 |
8661 | retail | -0.729109 |
... | ... | ... |
9862 | stupid | 1.157157 |
3299 | don | 1.162007 |
4879 | hell | 1.312696 |
3129 | die | 1.365420 |
7504 | panic | 1.539459 |
11664 rows × 2 columns
We get some improvements with our engineered features!