{ "cells": [ { "cell_type": "markdown", "id": "3811e6e2-fae4-48fe-92ce-d19930dda37a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](img/330-banner.png)" ] }, { "cell_type": "markdown", "id": "67f25497-8be5-480a-854c-cf728b6da55c", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "# Lecture 14: Midterm review guiding questions\n", "\n", "UBC 2023-24\n", "\n", "Instructor: Varada Kolhatkar" ] }, { "cell_type": "markdown", "id": "ecc0bcc5-d8ae-45d6-ab33-81b3d9d15c92", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "c58c8084-8c94-4ac8-9550-8e5fb8991bbb", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import os\n", "import sys\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import numpy.random as npr\n", "import pandas as pd\n", "from sklearn.compose import (\n", " ColumnTransformer,\n", " TransformedTargetRegressor,\n", " make_column_transformer,\n", ")\n", "from sklearn.dummy import DummyClassifier, DummyRegressor\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV\n", "from sklearn.metrics import make_scorer, mean_squared_error, r2_score\n", "from sklearn.model_selection import cross_val_score, cross_validate, train_test_split\n", "from sklearn.pipeline import Pipeline, make_pipeline\n", "from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler\n", "from sklearn.svm import SVC\n", "from sklearn.tree import DecisionTreeRegressor" ] }, { "cell_type": "markdown", "id": "4568c7cf-39e5-48c7-b2e7-a6b0ff49878f", "metadata": {}, "source": [ "## ML fundamentals\n", "\n", "- What are four splits of data we have seen so far? \n", "- What are the advantages of cross-validation?\n", "- Why it's important to look at sub-scores of cross-validation?\n", "- What is the fundamental trade-off in supervised machine learning?\n", "- What is the Golden rule in supervised machine learning? " ] }, { "cell_type": "markdown", "id": "ea6efe36-72ac-4244-97f3-9e47f99e7f84", "metadata": {}, "source": [ "## Pros and cons of different ML models\n", "\n", "- Decision trees\n", "- KNNs, SVM RBFs\n", "- Linear models \n", "- Random forests\n", "- LGBM, CatBoost\n", "- Stacking, averaging " ] }, { "cell_type": "markdown", "id": "888bd743-456a-4f6f-b5ef-139c52d6818e", "metadata": {}, "source": [ "## Preprocessing" ] }, { "cell_type": "markdown", "id": "bf30b454-9f43-481e-9b1c-da43031fc0d8", "metadata": {}, "source": [ "Let's bring back our quiz2 grades toy dataset. " ] }, { "cell_type": "code", "execution_count": 2, "id": "64a854ff-7887-40d0-b2fe-a1c3c0189b4a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
enjoy_courseml_experiencemajorclass_attendanceuniversity_yearslab1lab2lab3lab4quiz1quiz2
0yes1Computer ScienceExcellent39293.0849192A+
1yes1Mechanical EngineeringAverage29490.0808391not A+
2yes0MathematicsPoor37885.0838080not A+
3no0MathematicsExcellent391NaN929189A+
4yes0PsychologyGood47783.0909285A+
\n", "
" ], "text/plain": [ " enjoy_course ml_experience major class_attendance \\\n", "0 yes 1 Computer Science Excellent \n", "1 yes 1 Mechanical Engineering Average \n", "2 yes 0 Mathematics Poor \n", "3 no 0 Mathematics Excellent \n", "4 yes 0 Psychology Good \n", "\n", " university_years lab1 lab2 lab3 lab4 quiz1 quiz2 \n", "0 3 92 93.0 84 91 92 A+ \n", "1 2 94 90.0 80 83 91 not A+ \n", "2 3 78 85.0 83 80 80 not A+ \n", "3 3 91 NaN 92 91 89 A+ \n", "4 4 77 83.0 90 92 85 A+ " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grades_df = pd.read_csv('data/quiz2-grade-toy-col-transformer.csv')\n", "grades_df.head()" ] }, { "cell_type": "code", "execution_count": 3, "id": "2e081808-d250-40a2-8500-0fba312533a7", "metadata": {}, "outputs": [], "source": [ "X, y = grades_df.drop(columns=['quiz2']), grades_df['quiz2']" ] }, { "cell_type": "code", "execution_count": 4, "id": "4f74c704-4c1a-42ef-8838-3f0ae6b36a50", "metadata": {}, "outputs": [], "source": [ "numeric_feats = [\"university_years\", \"lab1\", \"lab3\", \"lab4\", \"quiz1\"] # apply scaling\n", "categorical_feats = [\"major\"] # apply one-hot encoding\n", "passthrough_feats = [\"ml_experience\"] # do not apply any transformation\n", "drop_feats = [\n", " \"lab2\",\n", " \"class_attendance\",\n", " \"enjoy_course\",\n", "] # do not include these features in modeling" ] }, { "cell_type": "markdown", "id": "fef7234a-0368-4716-876e-ef0da500c0a2", "metadata": {}, "source": [ "- What's the difference between sklearn estimators and transformers? \n", "- Can you think of a better way to impute missing values compared to `SimpleImputer`? " ] }, { "cell_type": "markdown", "id": "42b10938-8e20-4be4-aefd-22b686c4851f", "metadata": {}, "source": [ "### One-hot encoding \n", "- What's the purpose of the following arguments of one-hot encoding?\n", " - handle_unknown=\"ignore\"\n", " - sparse=False\n", " - drop=\"if_binary\" \n", "- How do you deal with categorical features with only two possible categories? " ] }, { "cell_type": "markdown", "id": "e4cf3d09-b521-476b-80cc-ea9e87ad191e", "metadata": {}, "source": [ "### Ordinal encoding\n", "\n", "- What's the difference between ordinal encoding and one-hot encoding? \n", "- What happens if we do not order the categories when we apply ordinal encoding? Does it matter if we order the categories in ascending or descending order? \n", "- What would happen if an unknown category shows up during validation or test time during ordinal encoding? For example, for `class_attendance` feature what if a category called \"super poor\" shows up? \n", "\n", "



" ] }, { "cell_type": "markdown", "id": "ca74bc0b-3175-4c9d-be61-cef9b8bee30a", "metadata": {}, "source": [ "#### OHE vs. ordinal encoding\n", "\n", "- Since `enjoy_course` feature is binary you decide to apply one-hot encoding with `drop=\"if_binary\"`. Your friend decide to apply ordinal encoding on it. Will it make any difference in the transformed data? " ] }, { "cell_type": "code", "execution_count": 5, "id": "3037d867-ed9d-4a08-b027-3dea90c8d2d3", "metadata": {}, "outputs": [], "source": [ "ohe = OneHotEncoder(drop=\"if_binary\", sparse_output=False)\n", "ohe_encoded = ohe.fit_transform(grades_df[['enjoy_course']]).ravel()" ] }, { "cell_type": "code", "execution_count": 6, "id": "07f82887-36f1-428d-a423-c70ffd104c7d", "metadata": {}, "outputs": [], "source": [ "oe = OrdinalEncoder()\n", "oe_encoded = oe.fit_transform(grades_df[['enjoy_course']]).ravel()" ] }, { "cell_type": "code", "execution_count": 7, "id": "8f89b3cb-b42d-40b5-94e7-c6dfbf010c8e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
oe_encodedohe_encoded
01.01.0
11.01.0
21.01.0
30.00.0
41.01.0
50.00.0
61.01.0
70.00.0
80.00.0
91.01.0
101.01.0
111.01.0
121.01.0
131.01.0
140.00.0
150.00.0
161.01.0
171.01.0
180.00.0
190.00.0
201.01.0
\n", "
" ], "text/plain": [ " oe_encoded ohe_encoded\n", "0 1.0 1.0\n", "1 1.0 1.0\n", "2 1.0 1.0\n", "3 0.0 0.0\n", "4 1.0 1.0\n", "5 0.0 0.0\n", "6 1.0 1.0\n", "7 0.0 0.0\n", "8 0.0 0.0\n", "9 1.0 1.0\n", "10 1.0 1.0\n", "11 1.0 1.0\n", "12 1.0 1.0\n", "13 1.0 1.0\n", "14 0.0 0.0\n", "15 0.0 0.0\n", "16 1.0 1.0\n", "17 1.0 1.0\n", "18 0.0 0.0\n", "19 0.0 0.0\n", "20 1.0 1.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = { \"oe_encoded\": oe_encoded, \n", " \"ohe_encoded\": ohe_encoded}\n", "pd.DataFrame(data)" ] }, { "cell_type": "markdown", "id": "23b71e80-fe7e-446d-b54c-a88d06fc48c1", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- In what scenarios it's OK to break the golden rule?\n", "- What are possible ways to deal with categorical columns with large number of categories? \n", "- In what scenarios you'll not include a feature in your model even if it's a good predictor? " ] }, { "cell_type": "markdown", "id": "81a054f9-d9e3-4885-b697-e8d3821b0cfd", "metadata": {}, "source": [ "- What's the problem with calling `fit_transform` on the test data in the context of `CountVectorizer`?\n", "- Do we need to scale after applying bag-of-words representation? " ] }, { "cell_type": "markdown", "id": "0f12b0b8-58a9-48d6-a79d-e534cfafdff5", "metadata": {}, "source": [ "### Hyperparameter optimization \n", "\n", "- What makes hyperparameter optimization a hard problem?\n", "- What are two different tools provided by sklearn for \n", "hyperparameter optimization? \n", "- What is optimization bias? " ] }, { "cell_type": "markdown", "id": "066bd8da-ddc6-42e5-ac93-dd146732d6fa", "metadata": {}, "source": [ "### Evaluation metrics\n", "\n", "- Why accuracy is not always enough?\n", "- Why it's useful to get prediction probabilities? \n", "- In what scenarios do you care more about precision or recall? \n", "- What's the main difference between AP score and F1 score?\n", "- What are advantages of RMSE or MAPE over MSE? " ] }, { "cell_type": "markdown", "id": "a1e6c11b-ee26-4d37-87ea-2b6bd3560f60", "metadata": {}, "source": [ "### Ensembles\n", "\n", "- How does a random forest model inject randomness in the model?\n", "- What's the difference between random forests and gradient boosted trees?\n", "- Why do we need averaging or stacking? \n", "- What are the benefits of stacking over averaging? " ] }, { "cell_type": "markdown", "id": "30fe7d3b-ec86-4ec3-86b8-f7ef6d0b1b79", "metadata": {}, "source": [ "### Feature importances and selection \n", "\n", "- What are the limitations of looking at simple correlations between features and targets? \n", "- How can you get feature importances or non-linear models?\n", "- What you might need to explain a single prediction?\n", "- What's the difference between feature engineering and feature selection? \n", "- Why do we need feature selection?\n", "- What are the three possible ways we looked at for feature selection? \n" ] } ], "metadata": { "kernelspec": { "display_name": "cpsc330", "language": "python", "name": "cpsc330" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }