{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3811e6e2-fae4-48fe-92ce-d19930dda37a",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![](img/330-banner.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67f25497-8be5-480a-854c-cf728b6da55c",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "# Lecture 14: Midterm review guiding questions\n",
    "\n",
    "UBC 2023-24\n",
    "\n",
    "Instructor: Varada Kolhatkar"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecc0bcc5-d8ae-45d6-ab33-81b3d9d15c92",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c58c8084-8c94-4ac8-9550-8e5fb8991bbb",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import numpy.random as npr\n",
    "import pandas as pd\n",
    "from sklearn.compose import (\n",
    "    ColumnTransformer,\n",
    "    TransformedTargetRegressor,\n",
    "    make_column_transformer,\n",
    ")\n",
    "from sklearn.dummy import DummyClassifier, DummyRegressor\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV\n",
    "from sklearn.metrics import make_scorer, mean_squared_error, r2_score\n",
    "from sklearn.model_selection import cross_val_score, cross_validate, train_test_split\n",
    "from sklearn.pipeline import Pipeline, make_pipeline\n",
    "from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.tree import DecisionTreeRegressor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4568c7cf-39e5-48c7-b2e7-a6b0ff49878f",
   "metadata": {},
   "source": [
    "## ML fundamentals\n",
    "\n",
    "- What are four splits of data we have seen so far? \n",
    "- What are the advantages of cross-validation?\n",
    "- Why it's important to look at sub-scores of cross-validation?\n",
    "- What is the fundamental trade-off in supervised machine learning?\n",
    "- What is the Golden rule in supervised machine learning? "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea6efe36-72ac-4244-97f3-9e47f99e7f84",
   "metadata": {},
   "source": [
    "## Pros and cons of different ML models\n",
    "\n",
    "- Decision trees\n",
    "- KNNs, SVM RBFs\n",
    "- Linear models \n",
    "- Random forests\n",
    "- LGBM, CatBoost\n",
    "- Stacking, averaging "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "888bd743-456a-4f6f-b5ef-139c52d6818e",
   "metadata": {},
   "source": [
    "## Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf30b454-9f43-481e-9b1c-da43031fc0d8",
   "metadata": {},
   "source": [
    "Let's bring back our quiz2 grades toy dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "64a854ff-7887-40d0-b2fe-a1c3c0189b4a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>enjoy_course</th>\n",
       "      <th>ml_experience</th>\n",
       "      <th>major</th>\n",
       "      <th>class_attendance</th>\n",
       "      <th>university_years</th>\n",
       "      <th>lab1</th>\n",
       "      <th>lab2</th>\n",
       "      <th>lab3</th>\n",
       "      <th>lab4</th>\n",
       "      <th>quiz1</th>\n",
       "      <th>quiz2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>yes</td>\n",
       "      <td>1</td>\n",
       "      <td>Computer Science</td>\n",
       "      <td>Excellent</td>\n",
       "      <td>3</td>\n",
       "      <td>92</td>\n",
       "      <td>93.0</td>\n",
       "      <td>84</td>\n",
       "      <td>91</td>\n",
       "      <td>92</td>\n",
       "      <td>A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>yes</td>\n",
       "      <td>1</td>\n",
       "      <td>Mechanical Engineering</td>\n",
       "      <td>Average</td>\n",
       "      <td>2</td>\n",
       "      <td>94</td>\n",
       "      <td>90.0</td>\n",
       "      <td>80</td>\n",
       "      <td>83</td>\n",
       "      <td>91</td>\n",
       "      <td>not A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>yes</td>\n",
       "      <td>0</td>\n",
       "      <td>Mathematics</td>\n",
       "      <td>Poor</td>\n",
       "      <td>3</td>\n",
       "      <td>78</td>\n",
       "      <td>85.0</td>\n",
       "      <td>83</td>\n",
       "      <td>80</td>\n",
       "      <td>80</td>\n",
       "      <td>not A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>no</td>\n",
       "      <td>0</td>\n",
       "      <td>Mathematics</td>\n",
       "      <td>Excellent</td>\n",
       "      <td>3</td>\n",
       "      <td>91</td>\n",
       "      <td>NaN</td>\n",
       "      <td>92</td>\n",
       "      <td>91</td>\n",
       "      <td>89</td>\n",
       "      <td>A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>yes</td>\n",
       "      <td>0</td>\n",
       "      <td>Psychology</td>\n",
       "      <td>Good</td>\n",
       "      <td>4</td>\n",
       "      <td>77</td>\n",
       "      <td>83.0</td>\n",
       "      <td>90</td>\n",
       "      <td>92</td>\n",
       "      <td>85</td>\n",
       "      <td>A+</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  enjoy_course  ml_experience                   major class_attendance  \\\n",
       "0          yes              1        Computer Science        Excellent   \n",
       "1          yes              1  Mechanical Engineering          Average   \n",
       "2          yes              0             Mathematics             Poor   \n",
       "3           no              0             Mathematics        Excellent   \n",
       "4          yes              0              Psychology             Good   \n",
       "\n",
       "   university_years  lab1  lab2  lab3  lab4  quiz1   quiz2  \n",
       "0                 3    92  93.0    84    91     92      A+  \n",
       "1                 2    94  90.0    80    83     91  not A+  \n",
       "2                 3    78  85.0    83    80     80  not A+  \n",
       "3                 3    91   NaN    92    91     89      A+  \n",
       "4                 4    77  83.0    90    92     85      A+  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grades_df = pd.read_csv('data/quiz2-grade-toy-col-transformer.csv')\n",
    "grades_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2e081808-d250-40a2-8500-0fba312533a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y = grades_df.drop(columns=['quiz2']), grades_df['quiz2']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4f74c704-4c1a-42ef-8838-3f0ae6b36a50",
   "metadata": {},
   "outputs": [],
   "source": [
    "numeric_feats = [\"university_years\", \"lab1\", \"lab3\", \"lab4\", \"quiz1\"]  # apply scaling\n",
    "categorical_feats = [\"major\"]  # apply one-hot encoding\n",
    "passthrough_feats = [\"ml_experience\"]  # do not apply any transformation\n",
    "drop_feats = [\n",
    "    \"lab2\",\n",
    "    \"class_attendance\",\n",
    "    \"enjoy_course\",\n",
    "]  # do not include these features in modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fef7234a-0368-4716-876e-ef0da500c0a2",
   "metadata": {},
   "source": [
    "- What's the difference between sklearn estimators and transformers?  \n",
    "- Can you think of a better way to impute missing values compared to `SimpleImputer`? "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42b10938-8e20-4be4-aefd-22b686c4851f",
   "metadata": {},
   "source": [
    "### One-hot encoding \n",
    "- What's the purpose of the following arguments of one-hot encoding?\n",
    "    - handle_unknown=\"ignore\"\n",
    "    - sparse=False\n",
    "    - drop=\"if_binary\"    \n",
    "- How do you deal with categorical features with only two possible categories? "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4cf3d09-b521-476b-80cc-ea9e87ad191e",
   "metadata": {},
   "source": [
    "### Ordinal encoding\n",
    "\n",
    "- What's the difference between ordinal encoding and one-hot encoding? \n",
    "- What happens if we do not order the categories when we apply ordinal encoding?  Does it matter if we order the categories in ascending or descending order? \n",
    "- What would happen if an unknown category shows up during validation or test time during ordinal encoding? For example, for `class_attendance` feature what if a category called \"super poor\" shows up? \n",
    "\n",
    "<br><br><br><br>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca74bc0b-3175-4c9d-be61-cef9b8bee30a",
   "metadata": {},
   "source": [
    "#### OHE vs. ordinal encoding\n",
    "\n",
    "- Since `enjoy_course` feature is binary you decide to apply one-hot encoding with `drop=\"if_binary\"`. Your friend decide to apply ordinal encoding on it. Will it make any difference in the transformed data? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3037d867-ed9d-4a08-b027-3dea90c8d2d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "ohe = OneHotEncoder(drop=\"if_binary\", sparse_output=False)\n",
    "ohe_encoded = ohe.fit_transform(grades_df[['enjoy_course']]).ravel()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "07f82887-36f1-428d-a423-c70ffd104c7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "oe = OrdinalEncoder()\n",
    "oe_encoded = oe.fit_transform(grades_df[['enjoy_course']]).ravel()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8f89b3cb-b42d-40b5-94e7-c6dfbf010c8e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>oe_encoded</th>\n",
       "      <th>ohe_encoded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    oe_encoded  ohe_encoded\n",
       "0          1.0          1.0\n",
       "1          1.0          1.0\n",
       "2          1.0          1.0\n",
       "3          0.0          0.0\n",
       "4          1.0          1.0\n",
       "5          0.0          0.0\n",
       "6          1.0          1.0\n",
       "7          0.0          0.0\n",
       "8          0.0          0.0\n",
       "9          1.0          1.0\n",
       "10         1.0          1.0\n",
       "11         1.0          1.0\n",
       "12         1.0          1.0\n",
       "13         1.0          1.0\n",
       "14         0.0          0.0\n",
       "15         0.0          0.0\n",
       "16         1.0          1.0\n",
       "17         1.0          1.0\n",
       "18         0.0          0.0\n",
       "19         0.0          0.0\n",
       "20         1.0          1.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = { \"oe_encoded\": oe_encoded, \n",
    "         \"ohe_encoded\": ohe_encoded}\n",
    "pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23b71e80-fe7e-446d-b54c-a88d06fc48c1",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- In what scenarios it's OK to break the golden rule?\n",
    "- What are possible ways to deal with categorical columns with large number of categories? \n",
    "- In what scenarios you'll not include a feature in your model even if it's a good predictor? "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81a054f9-d9e3-4885-b697-e8d3821b0cfd",
   "metadata": {},
   "source": [
    "- What's the problem with calling `fit_transform` on the test data in the context of `CountVectorizer`?\n",
    "- Do we need to scale after applying bag-of-words representation? "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f12b0b8-58a9-48d6-a79d-e534cfafdff5",
   "metadata": {},
   "source": [
    "### Hyperparameter optimization \n",
    "\n",
    "- What makes hyperparameter optimization a hard problem?\n",
    "- What are two different tools provided by sklearn for \n",
    "hyperparameter optimization?  \n",
    "- What is optimization bias? "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "066bd8da-ddc6-42e5-ac93-dd146732d6fa",
   "metadata": {},
   "source": [
    "### Evaluation metrics\n",
    "\n",
    "- Why accuracy is not always enough?\n",
    "- Why it's useful to get prediction probabilities? \n",
    "- In what scenarios do you care more about precision or recall? \n",
    "- What's the main difference between AP score and F1 score?\n",
    "- What are advantages of RMSE or MAPE over MSE? "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1e6c11b-ee26-4d37-87ea-2b6bd3560f60",
   "metadata": {},
   "source": [
    "### Ensembles\n",
    "\n",
    "- How does a random forest model inject randomness in the model?\n",
    "- What's the difference between random forests and gradient boosted trees?\n",
    "- Why do we need averaging or stacking? \n",
    "- What are the benefits of stacking over averaging?  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30fe7d3b-ec86-4ec3-86b8-f7ef6d0b1b79",
   "metadata": {},
   "source": [
    "### Feature importances and selection \n",
    "\n",
    "- What are the limitations of looking at simple correlations between features and targets? \n",
    "- How can you get feature importances or non-linear models?\n",
    "- What you might need to explain a single prediction?\n",
    "- What's the difference between feature engineering and feature selection? \n",
    "- Why do we need feature selection?\n",
    "- What are the three possible ways we looked at for feature selection? \n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cpsc330",
   "language": "python",
   "name": "cpsc330"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}