Lecture 14: Midterm review guiding questions#

UBC 2023-24

Instructor: Varada Kolhatkar

Imports#

import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr
import pandas as pd
from sklearn.compose import (
    ColumnTransformer,
    TransformedTargetRegressor,
    make_column_transformer,
)
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor

ML fundamentals#

  • What are four splits of data we have seen so far?

  • What are the advantages of cross-validation?

  • Why it’s important to look at sub-scores of cross-validation?

  • What is the fundamental trade-off in supervised machine learning?

  • What is the Golden rule in supervised machine learning?

Pros and cons of different ML models#

  • Decision trees

  • KNNs, SVM RBFs

  • Linear models

  • Random forests

  • LGBM, CatBoost

  • Stacking, averaging

Preprocessing#

Let’s bring back our quiz2 grades toy dataset.

grades_df = pd.read_csv('data/quiz2-grade-toy-col-transformer.csv')
grades_df.head()
enjoy_course ml_experience major class_attendance university_years lab1 lab2 lab3 lab4 quiz1 quiz2
0 yes 1 Computer Science Excellent 3 92 93.0 84 91 92 A+
1 yes 1 Mechanical Engineering Average 2 94 90.0 80 83 91 not A+
2 yes 0 Mathematics Poor 3 78 85.0 83 80 80 not A+
3 no 0 Mathematics Excellent 3 91 NaN 92 91 89 A+
4 yes 0 Psychology Good 4 77 83.0 90 92 85 A+
X, y = grades_df.drop(columns=['quiz2']), grades_df['quiz2']
numeric_feats = ["university_years", "lab1", "lab3", "lab4", "quiz1"]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
passthrough_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = [
    "lab2",
    "class_attendance",
    "enjoy_course",
]  # do not include these features in modeling
  • What’s the difference between sklearn estimators and transformers?

  • Can you think of a better way to impute missing values compared to SimpleImputer?

One-hot encoding#

  • What’s the purpose of the following arguments of one-hot encoding?

    • handle_unknown=”ignore”

    • sparse=False

    • drop=”if_binary”

  • How do you deal with categorical features with only two possible categories?

Ordinal encoding#

  • What’s the difference between ordinal encoding and one-hot encoding?

  • What happens if we do not order the categories when we apply ordinal encoding? Does it matter if we order the categories in ascending or descending order?

  • What would happen if an unknown category shows up during validation or test time during ordinal encoding? For example, for class_attendance feature what if a category called “super poor” shows up?





OHE vs. ordinal encoding#

  • Since enjoy_course feature is binary you decide to apply one-hot encoding with drop="if_binary". Your friend decide to apply ordinal encoding on it. Will it make any difference in the transformed data?

ohe = OneHotEncoder(drop="if_binary", sparse_output=False)
ohe_encoded = ohe.fit_transform(grades_df[['enjoy_course']]).ravel()
oe = OrdinalEncoder()
oe_encoded = oe.fit_transform(grades_df[['enjoy_course']]).ravel()
data = { "oe_encoded": oe_encoded, 
         "ohe_encoded": ohe_encoded}
pd.DataFrame(data)
oe_encoded ohe_encoded
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 0.0 0.0
4 1.0 1.0
5 0.0 0.0
6 1.0 1.0
7 0.0 0.0
8 0.0 0.0
9 1.0 1.0
10 1.0 1.0
11 1.0 1.0
12 1.0 1.0
13 1.0 1.0
14 0.0 0.0
15 0.0 0.0
16 1.0 1.0
17 1.0 1.0
18 0.0 0.0
19 0.0 0.0
20 1.0 1.0
  • In what scenarios it’s OK to break the golden rule?

  • What are possible ways to deal with categorical columns with large number of categories?

  • In what scenarios you’ll not include a feature in your model even if it’s a good predictor?

  • What’s the problem with calling fit_transform on the test data in the context of CountVectorizer?

  • Do we need to scale after applying bag-of-words representation?

Hyperparameter optimization#

  • What makes hyperparameter optimization a hard problem?

  • What are two different tools provided by sklearn for hyperparameter optimization?

  • What is optimization bias?

Evaluation metrics#

  • Why accuracy is not always enough?

  • Why it’s useful to get prediction probabilities?

  • In what scenarios do you care more about precision or recall?

  • What’s the main difference between AP score and F1 score?

  • What are advantages of RMSE or MAPE over MSE?

Ensembles#

  • How does a random forest model inject randomness in the model?

  • What’s the difference between random forests and gradient boosted trees?

  • Why do we need averaging or stacking?

  • What are the benefits of stacking over averaging?

Feature importances and selection#

  • What are the limitations of looking at simple correlations between features and targets?

  • How can you get feature importances or non-linear models?

  • What you might need to explain a single prediction?

  • What’s the difference between feature engineering and feature selection?

  • Why do we need feature selection?

  • What are the three possible ways we looked at for feature selection?