{
"cells": [
{
"cell_type": "markdown",
"id": "3811e6e2-fae4-48fe-92ce-d19930dda37a",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "67f25497-8be5-480a-854c-cf728b6da55c",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"# Lecture 14: Midterm review guiding questions\n",
"\n",
"UBC 2023-24\n",
"\n",
"Instructor: Varada Kolhatkar"
]
},
{
"cell_type": "markdown",
"id": "ecc0bcc5-d8ae-45d6-ab33-81b3d9d15c92",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c58c8084-8c94-4ac8-9550-8e5fb8991bbb",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import numpy.random as npr\n",
"import pandas as pd\n",
"from sklearn.compose import (\n",
" ColumnTransformer,\n",
" TransformedTargetRegressor,\n",
" make_column_transformer,\n",
")\n",
"from sklearn.dummy import DummyClassifier, DummyRegressor\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV\n",
"from sklearn.metrics import make_scorer, mean_squared_error, r2_score\n",
"from sklearn.model_selection import cross_val_score, cross_validate, train_test_split\n",
"from sklearn.pipeline import Pipeline, make_pipeline\n",
"from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler\n",
"from sklearn.svm import SVC\n",
"from sklearn.tree import DecisionTreeRegressor"
]
},
{
"cell_type": "markdown",
"id": "4568c7cf-39e5-48c7-b2e7-a6b0ff49878f",
"metadata": {},
"source": [
"## ML fundamentals\n",
"\n",
"- What are four splits of data we have seen so far? \n",
"- What are the advantages of cross-validation?\n",
"- Why it's important to look at sub-scores of cross-validation?\n",
"- What is the fundamental trade-off in supervised machine learning?\n",
"- What is the Golden rule in supervised machine learning? "
]
},
{
"cell_type": "markdown",
"id": "ea6efe36-72ac-4244-97f3-9e47f99e7f84",
"metadata": {},
"source": [
"## Pros and cons of different ML models\n",
"\n",
"- Decision trees\n",
"- KNNs, SVM RBFs\n",
"- Linear models \n",
"- Random forests\n",
"- LGBM, CatBoost\n",
"- Stacking, averaging "
]
},
{
"cell_type": "markdown",
"id": "888bd743-456a-4f6f-b5ef-139c52d6818e",
"metadata": {},
"source": [
"## Preprocessing"
]
},
{
"cell_type": "markdown",
"id": "bf30b454-9f43-481e-9b1c-da43031fc0d8",
"metadata": {},
"source": [
"Let's bring back our quiz2 grades toy dataset. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "64a854ff-7887-40d0-b2fe-a1c3c0189b4a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" enjoy_course | \n",
" ml_experience | \n",
" major | \n",
" class_attendance | \n",
" university_years | \n",
" lab1 | \n",
" lab2 | \n",
" lab3 | \n",
" lab4 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" yes | \n",
" 1 | \n",
" Computer Science | \n",
" Excellent | \n",
" 3 | \n",
" 92 | \n",
" 93.0 | \n",
" 84 | \n",
" 91 | \n",
" 92 | \n",
" A+ | \n",
"
\n",
" \n",
" 1 | \n",
" yes | \n",
" 1 | \n",
" Mechanical Engineering | \n",
" Average | \n",
" 2 | \n",
" 94 | \n",
" 90.0 | \n",
" 80 | \n",
" 83 | \n",
" 91 | \n",
" not A+ | \n",
"
\n",
" \n",
" 2 | \n",
" yes | \n",
" 0 | \n",
" Mathematics | \n",
" Poor | \n",
" 3 | \n",
" 78 | \n",
" 85.0 | \n",
" 83 | \n",
" 80 | \n",
" 80 | \n",
" not A+ | \n",
"
\n",
" \n",
" 3 | \n",
" no | \n",
" 0 | \n",
" Mathematics | \n",
" Excellent | \n",
" 3 | \n",
" 91 | \n",
" NaN | \n",
" 92 | \n",
" 91 | \n",
" 89 | \n",
" A+ | \n",
"
\n",
" \n",
" 4 | \n",
" yes | \n",
" 0 | \n",
" Psychology | \n",
" Good | \n",
" 4 | \n",
" 77 | \n",
" 83.0 | \n",
" 90 | \n",
" 92 | \n",
" 85 | \n",
" A+ | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" enjoy_course ml_experience major class_attendance \\\n",
"0 yes 1 Computer Science Excellent \n",
"1 yes 1 Mechanical Engineering Average \n",
"2 yes 0 Mathematics Poor \n",
"3 no 0 Mathematics Excellent \n",
"4 yes 0 Psychology Good \n",
"\n",
" university_years lab1 lab2 lab3 lab4 quiz1 quiz2 \n",
"0 3 92 93.0 84 91 92 A+ \n",
"1 2 94 90.0 80 83 91 not A+ \n",
"2 3 78 85.0 83 80 80 not A+ \n",
"3 3 91 NaN 92 91 89 A+ \n",
"4 4 77 83.0 90 92 85 A+ "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grades_df = pd.read_csv('data/quiz2-grade-toy-col-transformer.csv')\n",
"grades_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2e081808-d250-40a2-8500-0fba312533a7",
"metadata": {},
"outputs": [],
"source": [
"X, y = grades_df.drop(columns=['quiz2']), grades_df['quiz2']"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4f74c704-4c1a-42ef-8838-3f0ae6b36a50",
"metadata": {},
"outputs": [],
"source": [
"numeric_feats = [\"university_years\", \"lab1\", \"lab3\", \"lab4\", \"quiz1\"] # apply scaling\n",
"categorical_feats = [\"major\"] # apply one-hot encoding\n",
"passthrough_feats = [\"ml_experience\"] # do not apply any transformation\n",
"drop_feats = [\n",
" \"lab2\",\n",
" \"class_attendance\",\n",
" \"enjoy_course\",\n",
"] # do not include these features in modeling"
]
},
{
"cell_type": "markdown",
"id": "fef7234a-0368-4716-876e-ef0da500c0a2",
"metadata": {},
"source": [
"- What's the difference between sklearn estimators and transformers? \n",
"- Can you think of a better way to impute missing values compared to `SimpleImputer`? "
]
},
{
"cell_type": "markdown",
"id": "42b10938-8e20-4be4-aefd-22b686c4851f",
"metadata": {},
"source": [
"### One-hot encoding \n",
"- What's the purpose of the following arguments of one-hot encoding?\n",
" - handle_unknown=\"ignore\"\n",
" - sparse=False\n",
" - drop=\"if_binary\" \n",
"- How do you deal with categorical features with only two possible categories? "
]
},
{
"cell_type": "markdown",
"id": "e4cf3d09-b521-476b-80cc-ea9e87ad191e",
"metadata": {},
"source": [
"### Ordinal encoding\n",
"\n",
"- What's the difference between ordinal encoding and one-hot encoding? \n",
"- What happens if we do not order the categories when we apply ordinal encoding? Does it matter if we order the categories in ascending or descending order? \n",
"- What would happen if an unknown category shows up during validation or test time during ordinal encoding? For example, for `class_attendance` feature what if a category called \"super poor\" shows up? \n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"id": "ca74bc0b-3175-4c9d-be61-cef9b8bee30a",
"metadata": {},
"source": [
"#### OHE vs. ordinal encoding\n",
"\n",
"- Since `enjoy_course` feature is binary you decide to apply one-hot encoding with `drop=\"if_binary\"`. Your friend decide to apply ordinal encoding on it. Will it make any difference in the transformed data? "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "3037d867-ed9d-4a08-b027-3dea90c8d2d3",
"metadata": {},
"outputs": [],
"source": [
"ohe = OneHotEncoder(drop=\"if_binary\", sparse_output=False)\n",
"ohe_encoded = ohe.fit_transform(grades_df[['enjoy_course']]).ravel()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "07f82887-36f1-428d-a423-c70ffd104c7d",
"metadata": {},
"outputs": [],
"source": [
"oe = OrdinalEncoder()\n",
"oe_encoded = oe.fit_transform(grades_df[['enjoy_course']]).ravel()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "8f89b3cb-b42d-40b5-94e7-c6dfbf010c8e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" oe_encoded | \n",
" ohe_encoded | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 7 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 8 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 9 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 10 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 11 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 12 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 13 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 14 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 15 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 16 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 17 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 18 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 19 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 20 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" oe_encoded ohe_encoded\n",
"0 1.0 1.0\n",
"1 1.0 1.0\n",
"2 1.0 1.0\n",
"3 0.0 0.0\n",
"4 1.0 1.0\n",
"5 0.0 0.0\n",
"6 1.0 1.0\n",
"7 0.0 0.0\n",
"8 0.0 0.0\n",
"9 1.0 1.0\n",
"10 1.0 1.0\n",
"11 1.0 1.0\n",
"12 1.0 1.0\n",
"13 1.0 1.0\n",
"14 0.0 0.0\n",
"15 0.0 0.0\n",
"16 1.0 1.0\n",
"17 1.0 1.0\n",
"18 0.0 0.0\n",
"19 0.0 0.0\n",
"20 1.0 1.0"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = { \"oe_encoded\": oe_encoded, \n",
" \"ohe_encoded\": ohe_encoded}\n",
"pd.DataFrame(data)"
]
},
{
"cell_type": "markdown",
"id": "23b71e80-fe7e-446d-b54c-a88d06fc48c1",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- In what scenarios it's OK to break the golden rule?\n",
"- What are possible ways to deal with categorical columns with large number of categories? \n",
"- In what scenarios you'll not include a feature in your model even if it's a good predictor? "
]
},
{
"cell_type": "markdown",
"id": "81a054f9-d9e3-4885-b697-e8d3821b0cfd",
"metadata": {},
"source": [
"- What's the problem with calling `fit_transform` on the test data in the context of `CountVectorizer`?\n",
"- Do we need to scale after applying bag-of-words representation? "
]
},
{
"cell_type": "markdown",
"id": "0f12b0b8-58a9-48d6-a79d-e534cfafdff5",
"metadata": {},
"source": [
"### Hyperparameter optimization \n",
"\n",
"- What makes hyperparameter optimization a hard problem?\n",
"- What are two different tools provided by sklearn for \n",
"hyperparameter optimization? \n",
"- What is optimization bias? "
]
},
{
"cell_type": "markdown",
"id": "066bd8da-ddc6-42e5-ac93-dd146732d6fa",
"metadata": {},
"source": [
"### Evaluation metrics\n",
"\n",
"- Why accuracy is not always enough?\n",
"- Why it's useful to get prediction probabilities? \n",
"- In what scenarios do you care more about precision or recall? \n",
"- What's the main difference between AP score and F1 score?\n",
"- What are advantages of RMSE or MAPE over MSE? "
]
},
{
"cell_type": "markdown",
"id": "a1e6c11b-ee26-4d37-87ea-2b6bd3560f60",
"metadata": {},
"source": [
"### Ensembles\n",
"\n",
"- How does a random forest model inject randomness in the model?\n",
"- What's the difference between random forests and gradient boosted trees?\n",
"- Why do we need averaging or stacking? \n",
"- What are the benefits of stacking over averaging? "
]
},
{
"cell_type": "markdown",
"id": "30fe7d3b-ec86-4ec3-86b8-f7ef6d0b1b79",
"metadata": {},
"source": [
"### Feature importances and selection \n",
"\n",
"- What are the limitations of looking at simple correlations between features and targets? \n",
"- How can you get feature importances or non-linear models?\n",
"- What you might need to explain a single prediction?\n",
"- What's the difference between feature engineering and feature selection? \n",
"- Why do we need feature selection?\n",
"- What are the three possible ways we looked at for feature selection? \n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "cpsc330",
"language": "python",
"name": "cpsc330"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}