Lecture 14: Midterm review guiding questions#
UBC 2023-24
Instructor: Varada Kolhatkar
Imports#
import os
import sys
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr
import pandas as pd
from sklearn.compose import (
ColumnTransformer,
TransformedTargetRegressor,
make_column_transformer,
)
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
ML fundamentals#
What are four splits of data we have seen so far?
What are the advantages of cross-validation?
Why it’s important to look at sub-scores of cross-validation?
What is the fundamental trade-off in supervised machine learning?
What is the Golden rule in supervised machine learning?
Pros and cons of different ML models#
Decision trees
KNNs, SVM RBFs
Linear models
Random forests
LGBM, CatBoost
Stacking, averaging
Preprocessing#
Let’s bring back our quiz2 grades toy dataset.
grades_df = pd.read_csv('data/quiz2-grade-toy-col-transformer.csv')
grades_df.head()
enjoy_course | ml_experience | major | class_attendance | university_years | lab1 | lab2 | lab3 | lab4 | quiz1 | quiz2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | yes | 1 | Computer Science | Excellent | 3 | 92 | 93.0 | 84 | 91 | 92 | A+ |
1 | yes | 1 | Mechanical Engineering | Average | 2 | 94 | 90.0 | 80 | 83 | 91 | not A+ |
2 | yes | 0 | Mathematics | Poor | 3 | 78 | 85.0 | 83 | 80 | 80 | not A+ |
3 | no | 0 | Mathematics | Excellent | 3 | 91 | NaN | 92 | 91 | 89 | A+ |
4 | yes | 0 | Psychology | Good | 4 | 77 | 83.0 | 90 | 92 | 85 | A+ |
X, y = grades_df.drop(columns=['quiz2']), grades_df['quiz2']
numeric_feats = ["university_years", "lab1", "lab3", "lab4", "quiz1"] # apply scaling
categorical_feats = ["major"] # apply one-hot encoding
passthrough_feats = ["ml_experience"] # do not apply any transformation
drop_feats = [
"lab2",
"class_attendance",
"enjoy_course",
] # do not include these features in modeling
What’s the difference between sklearn estimators and transformers?
Can you think of a better way to impute missing values compared to
SimpleImputer
?
One-hot encoding#
What’s the purpose of the following arguments of one-hot encoding?
handle_unknown=”ignore”
sparse=False
drop=”if_binary”
How do you deal with categorical features with only two possible categories?
Ordinal encoding#
What’s the difference between ordinal encoding and one-hot encoding?
What happens if we do not order the categories when we apply ordinal encoding? Does it matter if we order the categories in ascending or descending order?
What would happen if an unknown category shows up during validation or test time during ordinal encoding? For example, for
class_attendance
feature what if a category called “super poor” shows up?
OHE vs. ordinal encoding#
Since
enjoy_course
feature is binary you decide to apply one-hot encoding withdrop="if_binary"
. Your friend decide to apply ordinal encoding on it. Will it make any difference in the transformed data?
ohe = OneHotEncoder(drop="if_binary", sparse_output=False)
ohe_encoded = ohe.fit_transform(grades_df[['enjoy_course']]).ravel()
oe = OrdinalEncoder()
oe_encoded = oe.fit_transform(grades_df[['enjoy_course']]).ravel()
data = { "oe_encoded": oe_encoded,
"ohe_encoded": ohe_encoded}
pd.DataFrame(data)
oe_encoded | ohe_encoded | |
---|---|---|
0 | 1.0 | 1.0 |
1 | 1.0 | 1.0 |
2 | 1.0 | 1.0 |
3 | 0.0 | 0.0 |
4 | 1.0 | 1.0 |
5 | 0.0 | 0.0 |
6 | 1.0 | 1.0 |
7 | 0.0 | 0.0 |
8 | 0.0 | 0.0 |
9 | 1.0 | 1.0 |
10 | 1.0 | 1.0 |
11 | 1.0 | 1.0 |
12 | 1.0 | 1.0 |
13 | 1.0 | 1.0 |
14 | 0.0 | 0.0 |
15 | 0.0 | 0.0 |
16 | 1.0 | 1.0 |
17 | 1.0 | 1.0 |
18 | 0.0 | 0.0 |
19 | 0.0 | 0.0 |
20 | 1.0 | 1.0 |
In what scenarios it’s OK to break the golden rule?
What are possible ways to deal with categorical columns with large number of categories?
In what scenarios you’ll not include a feature in your model even if it’s a good predictor?
What’s the problem with calling
fit_transform
on the test data in the context ofCountVectorizer
?Do we need to scale after applying bag-of-words representation?
Hyperparameter optimization#
What makes hyperparameter optimization a hard problem?
What are two different tools provided by sklearn for hyperparameter optimization?
What is optimization bias?
Evaluation metrics#
Why accuracy is not always enough?
Why it’s useful to get prediction probabilities?
In what scenarios do you care more about precision or recall?
What’s the main difference between AP score and F1 score?
What are advantages of RMSE or MAPE over MSE?
Ensembles#
How does a random forest model inject randomness in the model?
What’s the difference between random forests and gradient boosted trees?
Why do we need averaging or stacking?
What are the benefits of stacking over averaging?
Feature importances and selection#
What are the limitations of looking at simple correlations between features and targets?
How can you get feature importances or non-linear models?
What you might need to explain a single prediction?
What’s the difference between feature engineering and feature selection?
Why do we need feature selection?
What are the three possible ways we looked at for feature selection?