# Appendix A: Common features used in text classification

### Bag of words 

- So far for text data we have been using bag of word features. 
- They are good enough for many tasks. But ... 
- This encoding throws out a lot of things we know about language
- It assumes that word order is not that important.   
- So if you want to improve the scores further on text classification tasks you carry out **feature engineering**. 

Let's look at some examples from research papers.  

### Example: Label "Personalized" Important E-mails: 
- [The Learning Behind Gmail Priority Inbox](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36955.pdf)
- Features: bag of words, trigrams, regular expressions, and so on.
- There might be some "globally" important messages:
    - "This is your mother, something terrible happened, give me a call ASAP."
- But your "important" message may be unimportant to others.
     - Similar for spam: "spam" for one user could be "not spam" for another.

- Social features (e.g., percentage of sender emails that is read by the recipient)
- Content features (e.g., recent terms the user has been using in emails)
- Thread features (e.g., whether the user has started the thread)
- ...

### [The Learning Behind Gmail Priority Inbox](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36955.pdf)

![](img/gmail_priority_inbox.png)

<!-- <img src="img/gmail_priority_inbox.png" width="1000" height="1000"> -->

#### Feature engineering examples: [Automatically Identifying Good Conversations Online](http://www.courtneynapoles.com/res/icwsm17-automatically.pdf)

![](img/classifying_good_conversations_online.png)

<!-- <img src="img/classifying_good_conversations_online.png" width="800" height="800"> -->


### Term weighing (TF-IDF) 

- A measure of relatedness between words and documents
- Intuition: Meaningful words may occur repeatedly in related documents, but functional words (e.g., _make_, _the_) may be distributed evenly over all documents


$$tf.idf(w_i,d_j) = (1+log(tf_{ij})) log\frac{D}{df_i}$$


where, 
- $tf_{ij}$ &rarr; number of occurrences of the term $w_i$ in document $d_j$
- $D$ &rarr; number of documents
- $df_i$ &rarr; number of documents in which $w_i$ occurs

Check `TfidfVectorizer` from `sklearn`. 

### N-grams 

- Incorporating more context 
- A contiguous sequence of _n_ items (characters, tokens) in text.
    <blockquote>
        CPSC330 students are hard-working .
    </blockquote>    

- 2-grams (bigrams): a contiguous sequence of two words
    * _CPSC330 students, students are, are hard-working, hard-working ._
- 3-grams (trigrams): a contiguous sequence of three words
    * _CPSC330 students are, students are hard-working, are hard-working ._

You can extract ngram features using `CountVectorizer` by passing `ngram_range`. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

X = [
    "URGENT!! As a valued network customer you have been selected to receive a $900 prize reward!",
    "Lol you are always so convincing.",
    "URGENT!! Call right away!!",
]
vec = CountVectorizer(ngram_range=(1, 3))
X_counts = vec.fit_transform(X)
bow_df = pd.DataFrame(X_counts.toarray(), columns=vec.get_feature_names_out().tolist(), index=X)

In [None]:
bow_df

### ASIDE: [Google n-gram viewer](https://books.google.com/ngrams)
 
- All Our N-gram are Belong to You
    - https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-toyou.html

<blockquote>
Here at Google Research we have been using word n-gram models for a variety
of R&D projects, such as statistical machine translation, speech recognition,
spelling correction, entity detection, information extraction, and others.
That's why we decided to share this enormous dataset with everyone. We
processed 1,024,908,267,229 words of running text and are publishing the
counts for all 1,176,470,663 five-word sequences that appear at least 40
times. There are 13,588,391 unique words, after discarding words that appear
less than 200 times.‚Äù
</blockquote>

In [None]:
from IPython.display import HTML

url = "https://books.google.com/ngrams/"
HTML("<iframe src=%s width=1000 height=800></iframe>" % url)

### Aside: [Google n-gram viewer](https://books.google.com/ngrams)
 
- Count the occurrences of the bigram _smart women_ in the corpus from 1800 to 2000 

![](img/ngram_viewer_smart_woman.png)

<!-- <img src="img/ngram_viewer_smart_woman.png" width="800" height="800"> -->

### Aside: [Google n-gram viewer](https://books.google.com/ngrams)
 
- Trends in the word _challenge_ used as a noun vs. verb

![](img/ngram_viewer_challenge_NN_VB.png)

<!-- <img src="img/ngram_viewer_challenge_NN_VB.png" width="800" height="800"> -->

### Part-of-speech features

#### Part-of-speech (POS) in English

- Part-of-speech: A kind of syntactic category that tells you some of the grammatical properties of a word.
    * Noun &rarr; water, sun, cat  
    * Verb &rarr; run, eat, teach

    
<blockquote>
The ____ was running. 
</blockquote>    

- Only a noun fits here. 

### Part-of-speech (POS) features

- POS features use POS information for the words in text.  

<blockquote>
    CPSC330/<span style="color:green">PROPER_NOUN</span> students/<span style="color:green">NOUN</span> are/<span style="color:green">VERB</span> hard-working/<span style="color:green">ADJECTIVE</span>
</blockquote>    


### An example from a project 

- Data: a bunch of documents 
- Task: identify texts with *permissions* and identify who is giving permission to whom. 

<blockquote>
<b>You</b> may <b>disclose</b> Google confidential information when compelled to do so by law if <b>you</b> provide <b>us</b> reasonable prior notice, unless a court orders that <b>we</b> not receive notice.
</blockquote>

- A very simple solution
    * Look for pronouns and verbs. 
    * Add POS tags as features in your model. 
    * Maybe look up words similar to **disclose**.
    

### Penn Treebank part-of-speech tags (bonus)

![](img/PTB_POS.png)

<!-- <img src="img/PTB_POS.png" width="900" height="900"/> -->

- You also need to download the language model which contains all the pre-trained models. For that run the following in your course `conda` environment. 