Lecture 14: Midterm review guiding questions

Lecture 14: Midterm review guiding questions#

UBC 2023-24

Instructor: Varada Kolhatkar

Imports#

import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr
import pandas as pd
from sklearn.compose import (
    ColumnTransformer,
    TransformedTargetRegressor,
    make_column_transformer,
)
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor

ML fundamentals#

What are four splits of data we have seen so far?
What are the advantages of cross-validation?
Why it’s important to look at sub-scores of cross-validation?
What is the fundamental trade-off in supervised machine learning?
What is the Golden rule in supervised machine learning?

Pros and cons of different ML models#

Decision trees
KNNs, SVM RBFs
Linear models
Random forests
LGBM, CatBoost
Stacking, averaging

Preprocessing#

Let’s bring back our quiz2 grades toy dataset.

grades_df = pd.read_csv('data/quiz2-grade-toy-col-transformer.csv')
grades_df.head()

	enjoy_course	ml_experience	major	class_attendance	university_years	lab1	lab2	lab3	lab4	quiz1	quiz2
0	yes	1	Computer Science	Excellent	3	92	93.0	84	91	92	A+
1	yes	1	Mechanical Engineering	Average	2	94	90.0	80	83	91	not A+
2	yes	0	Mathematics	Poor	3	78	85.0	83	80	80	not A+
3	no	0	Mathematics	Excellent	3	91	NaN	92	91	89	A+
4	yes	0	Psychology	Good	4	77	83.0	90	92	85	A+

X, y = grades_df.drop(columns=['quiz2']), grades_df['quiz2']

numeric_feats = ["university_years", "lab1", "lab3", "lab4", "quiz1"]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
passthrough_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = [
    "lab2",
    "class_attendance",
    "enjoy_course",
]  # do not include these features in modeling

What’s the difference between sklearn estimators and transformers?
Can you think of a better way to impute missing values compared to SimpleImputer?

One-hot encoding#

What’s the purpose of the following arguments of one-hot encoding?
- handle_unknown=”ignore”
- sparse=False
- drop=”if_binary”
How do you deal with categorical features with only two possible categories?

Ordinal encoding#

What’s the difference between ordinal encoding and one-hot encoding?
What happens if we do not order the categories when we apply ordinal encoding? Does it matter if we order the categories in ascending or descending order?
What would happen if an unknown category shows up during validation or test time during ordinal encoding? For example, for class_attendance feature what if a category called “super poor” shows up?

OHE vs. ordinal encoding#

Since enjoy_course feature is binary you decide to apply one-hot encoding with drop="if_binary". Your friend decide to apply ordinal encoding on it. Will it make any difference in the transformed data?

ohe = OneHotEncoder(drop="if_binary", sparse_output=False)
ohe_encoded = ohe.fit_transform(grades_df[['enjoy_course']]).ravel()

oe = OrdinalEncoder()
oe_encoded = oe.fit_transform(grades_df[['enjoy_course']]).ravel()

data = { "oe_encoded": oe_encoded, 
         "ohe_encoded": ohe_encoded}
pd.DataFrame(data)

	oe_encoded	ohe_encoded
0	1.0	1.0
1	1.0	1.0
2	1.0	1.0
3	0.0	0.0
4	1.0	1.0
5	0.0	0.0
6	1.0	1.0
7	0.0	0.0
8	0.0	0.0
9	1.0	1.0
10	1.0	1.0
11	1.0	1.0
12	1.0	1.0
13	1.0	1.0
14	0.0	0.0
15	0.0	0.0
16	1.0	1.0
17	1.0	1.0
18	0.0	0.0
19	0.0	0.0
20	1.0	1.0

In what scenarios it’s OK to break the golden rule?
What are possible ways to deal with categorical columns with large number of categories?
In what scenarios you’ll not include a feature in your model even if it’s a good predictor?

What’s the problem with calling fit_transform on the test data in the context of CountVectorizer?
Do we need to scale after applying bag-of-words representation?

Hyperparameter optimization#

What makes hyperparameter optimization a hard problem?
What are two different tools provided by sklearn for hyperparameter optimization?
What is optimization bias?

Evaluation metrics#

Why accuracy is not always enough?
Why it’s useful to get prediction probabilities?
In what scenarios do you care more about precision or recall?
What’s the main difference between AP score and F1 score?
What are advantages of RMSE or MAPE over MSE?

Ensembles#

How does a random forest model inject randomness in the model?
What’s the difference between random forests and gradient boosted trees?
Why do we need averaging or stacking?
What are the benefits of stacking over averaging?

Feature importances and selection#

What are the limitations of looking at simple correlations between features and targets?
How can you get feature importances or non-linear models?
What you might need to explain a single prediction?
What’s the difference between feature engineering and feature selection?
Why do we need feature selection?
What are the three possible ways we looked at for feature selection?

	oe_encoded	ohe_encoded
0	1.0	1.0
1	1.0	1.0
2	1.0	1.0
3	0.0	0.0
4	1.0	1.0
5	0.0	0.0
6	1.0	1.0
7	0.0	0.0
8	0.0	0.0
9	1.0	1.0
10	1.0	1.0
11	1.0	1.0
12	1.0	1.0
13	1.0	1.0
14	0.0	0.0
15	0.0	0.0
16	1.0	1.0
17	1.0	1.0
18	0.0	0.0
19	0.0	0.0
20	1.0	1.0

	oe_encoded	ohe_encoded
0	1.0	1.0
1	1.0	1.0
2	1.0	1.0
3	0.0	0.0
4	1.0	1.0
5	0.0	0.0
6	1.0	1.0
7	0.0	0.0
8	0.0	0.0
9	1.0	1.0
10	1.0	1.0
11	1.0	1.0
12	1.0	1.0
13	1.0	1.0
14	0.0	0.0
15	0.0	0.0
16	1.0	1.0
17	1.0	1.0
18	0.0	0.0
19	0.0	0.0
20	1.0	1.0

Lecture 14: Midterm review guiding questions

Contents

Lecture 14: Midterm review guiding questions#

Imports#

ML fundamentals#

Pros and cons of different ML models#

Preprocessing#

One-hot encoding#

Ordinal encoding#

OHE vs. ordinal encoding#

Hyperparameter optimization#

Evaluation metrics#

Ensembles#

Feature importances and selection#

	oe_encoded	ohe_encoded
0	1.0	1.0
1	1.0	1.0
2	1.0	1.0
3	0.0	0.0
4	1.0	1.0
5	0.0	0.0
6	1.0	1.0
7	0.0	0.0
8	0.0	0.0
9	1.0	1.0
10	1.0	1.0
11	1.0	1.0
12	1.0	1.0
13	1.0	1.0
14	0.0	0.0
15	0.0	0.0
16	1.0	1.0
17	1.0	1.0
18	0.0	0.0
19	0.0	0.0
20	1.0	1.0