Final exam preparation: guiding questions

Final exam preparation: guiding questions#

Imports#

import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr
import pandas as pd

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
DATA_DIR = os.path.join(os.path.abspath(".."), "data/")

Study tips#

Review the course learning objectives, found at the start of the lecture notes, to clarify expectations. The overall course objectives are available here. Consider creating a checklist for each objective and rate your confidence in performing each task.
Focus on understanding not just how to do something, but also why it’s done that way. This will prepare you for exam questions requiring novel applications of concepts.
If reasoning questions are challenging, try explaining concepts out loud. Articulating your thoughts often highlights gaps in understanding.
Revisit homework assignments thoroughly. Review the problems, your solutions, and any feedback to deepen your understanding.
Use active recall techniques like flashcards, summary sheets, or teaching concepts to peers to test your memory and comprehension.
Develop a strong grasp of core machine learning concepts. For each, know the definition, application, and implications in real-world contexts.
Review case studies and examples from the course to see how theoretical concepts are applied in practice.
Take advantage of office hours, tutorials, and other available resources.
Create a study schedule that prioritizes topics where you feel less confident.
Study in focused intervals (e.g., 25 minutes of work followed by a 5-minute break) to maintain concentration.
Begin each study session with a minute or two of focused breathing to calm your mind and improve focus.
Join or form study groups to discuss material and exchange ideas. Teaching others is one of the most effective ways to solidify your understanding.

Part 1#

Introduction#

What is ML? When is it suitable?
ML terminology
ML types

ML fundamentals#

What are four splits of data we have seen so far?
What are the advantages of cross-validation?
Why it’s important to look at sub-scores of cross-validation?
What is the fundamental trade-off in supervised machine learning?
What is the Golden rule in supervised machine learning?
Scenarios for data leakage

Pros, cons, parameters and hyperparameters of different ML models#

Decision trees
KNNs, SVM RBFs
Linear models
Random forests
Grading Boosting, LGBM, CatBoost
Stacking, averaging

Comparison of models

Model	Parameters and hyperparameters	Strengths	Weaknesses
Decision Trees
KNNs
SVM RBF
Linear models
Random forests
Gradient boosting
Stacking
Averaging

Preprocessing#

What are various data preprocessing steps such as scaling, OHE, ordinal encoding, and handling missing values. Why and when each step is necessary?

sklearn Transformers

Transformer	Hyperparameters	When to use?
`SimpleImputer`
`StandardScaler`
`OneHotEncoder`
`OrdinalEncoder`
`CountVectorizer`
`TransformedTargetRegressor`

Let’s bring back our quiz2 grades toy dataset.

grades_df = pd.read_csv(DATA_DIR + 'quiz2-grade-toy-col-transformer.csv')
grades_df.head()

	enjoy_course	ml_experience	major	class_attendance	university_years	lab1	lab2	lab3	lab4	quiz1	quiz2
0	yes	1	Computer Science	Excellent	3	92	93.0	84	91	92	A+
1	yes	1	Mechanical Engineering	Average	2	94	90.0	80	83	91	not A+
2	yes	0	Mathematics	Poor	3	78	85.0	83	80	80	not A+
3	no	0	Mathematics	Excellent	3	91	NaN	92	91	89	A+
4	yes	0	Psychology	Good	4	77	83.0	90	92	85	A+

X, y = grades_df.drop(columns=['quiz2']), grades_df['quiz2']

numeric_feats = ["university_years", "lab1", "lab3", "lab4", "quiz1"]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
passthrough_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = [
    "lab2",
    "class_attendance",
    "enjoy_course",
]  # do not include these features in modeling

What’s the difference between sklearn estimators and transformers?
Can you think of a better way to impute missing values compared to SimpleImputer?

One-hot encoding

What’s the purpose of the following arguments of one-hot encoding?
- handle_unknown=”ignore”
- sparse=False
- drop=”if_binary”
How do you deal with categorical features with only two possible categories?

Ordinal encoding

What’s the difference between ordinal encoding and one-hot encoding?
What happens if we do not order the categories when we apply ordinal encoding? Does it matter if we order the categories in ascending or descending order?
What would happen if an unknown category shows up during validation or test time during ordinal encoding? For example, for class_attendance feature what if a category called “super poor” shows up?

OHE vs. ordinal encoding

Since enjoy_course feature is binary you decide to apply one-hot encoding with drop="if_binary". Your friend decide to apply ordinal encoding on it. Will it make any difference in the transformed data?

ohe = OneHotEncoder(drop="if_binary", sparse_output=False)
ohe_encoded = ohe.fit_transform(grades_df[['enjoy_course']]).ravel()

oe = OrdinalEncoder()
oe_encoded = oe.fit_transform(grades_df[['enjoy_course']]).ravel()

data = { "oe_encoded": oe_encoded, 
         "ohe_encoded": ohe_encoded}
pd.DataFrame(data)

	oe_encoded	ohe_encoded
0	1.0	1.0
1	1.0	1.0
2	1.0	1.0
3	0.0	0.0
4	1.0	1.0
5	0.0	0.0
6	1.0	1.0
7	0.0	0.0
8	0.0	0.0
9	1.0	1.0
10	1.0	1.0
11	1.0	1.0
12	1.0	1.0
13	1.0	1.0
14	0.0	0.0
15	0.0	0.0
16	1.0	1.0
17	1.0	1.0
18	0.0	0.0
19	0.0	0.0
20	1.0	1.0

In what scenarios it’s OK to break the golden rule?
What are possible ways to deal with categorical columns with large number of categories?
In what scenarios you’ll not include a feature in your model even if it’s a good predictor?

What’s the problem with calling fit_transform on the test data in the context of CountVectorizer?
Do we need to scale after applying bag-of-words representation?

Hyperparameter optimization#

What makes hyperparameter optimization a hard problem?
What are two different tools provided by sklearn for hyperparameter optimization?
What is optimization bias?

Method	Strengths/Weaknesses	When to use?
Nested for loops
Grid search
Random search

Evaluation metrics#

Understand different metrics used to evaluate machine learning models like accuracy, precision, recall, F1-score, and PR curve, ROC curves for classification; mean squared error, root mean-squared error, MAPE and r2 for regression. Be prepared to discuss why you would choose one metric over another based on the problem context.
Why accuracy is not always enough?
Why it’s useful to get prediction probabilities?
In what scenarios do you care more about precision or recall?
What’s the main difference between AP score and F1 score?
What are advantages of RMSE or MAPE over MSE?

Classification Metrics

Metric	How to generate/calculate?	When to use?
Accuracy
Precision
Recall
F1-score
AP score
AUC

Regression Metrics

Metric	How to generate/calculate?	When to use?
MSE
RMSE
r2 score
MAPE

Ensembles#

How does a random forest model inject randomness in the model?
What’s the difference between random forests and gradient boosted trees?
Why do we need averaging or stacking?
What are the benefits of stacking over averaging?

Feature importances#

What are the limitations of looking at simple correlations between features and targets?
How can you get feature importances or non-linear models?
What you might need to explain a single prediction?

Feature engineering and selection#

What’s the difference between feature engineering and feature selection?
Why do we need feature selection?
What are the three possible ways we looked at for feature selection?

Part 2#

Clustering#

Why clustering and what is the problem of clustering?
Compare and contrast different clustering methods.
What’s the difficulty in evaluation of clustering? How do we evaluate clusters?

Scenario	Which clustering method?
Well-separated spherical clusters
Large datasets
Flexibility with cluster shapes
Small to medium datasets
Prior knowlege on how many clusters
Clusters are roughly of equal size
Irregularly shaped clusters
Clusters with different densities
Datasets with hierarchical relationships
No prior knowledge on number of clusters
Noise and outliers

Which clustering method would you use in each of the scenarios below? Why?
How would you represent the data in each case?
- Scenario 1: Customer segmentation in retail
- Scenario 2: An environmental study aiming to identify clusters of a rare plant species
- Scenario 3: Clustering furniture items for inventory management and customer recommendations

How to decide the number of clusters?
What’s the difficulty in evaluation of clustering? How do we evaluate clusters?

Recommender systems#

What’s the utility matrix?
How do we evaluate recommender systems?
What are the baseline models we talked about?
- Global average
- Per user average
- Per item average
Evaluation of recommender systems
Compare and contrast KNN Imputer and content-based filtering
Ethical issues associated with recommender systems

Introduction to NLP#

Embeddings
- What are different document and word representations we talked about?
- Why do we care about creating different representations?
- What are pre-trained models? Why are the benefits of using them?
Topic modeling
- What is topic modeling? What are the inputs and outputs of topic modeling?
- How it’s different from clustering documents using a clustering model, say KMeans?
Text Preprocessing

Multiclass classification and computer vision#

How is the Softmax function used by logistic regression in the context of multiclass classification?
What are the methods we saw to use pre-trained image classification models for our image classification tasks?
- Out of the box
- Using pre-trained models as feature extractors

How would you use pre-trained model in each case below?

Imagine you want to quickly develop a prototype for an app that can identify different cat breeds from photos.
Suppose you’re working on a project to predict the city in Canada based on the photos of landmarks in the city, a task for which there’s limited training data available.
Suppose you’re developing a system to diagnose specific types of tumors from MRI scans.

Time series#

When is time series analysis appropriate?
- Time series analysis is used when there is a temporal aspect in the data.
Data splitting: Data should be split based on time to avoid future data leaking into the training set.
Essential questions for Exploratory Data Analysis (EDA):
- What is the frequency of data collection (e.g., hourly, daily)?
- How many time series are present within the dataset?
- Are there any gaps or missing values in the data?
Feature engineering
- Derived new features from the date/time column.
- Appropriately encoded features based on the chosen model.
- Created lag features to incorporate past values for prediction.
Baseline model approach: Employ a simple model, such as using today’s target value to predict tomorrow’s, as a starting point for comparison.
Cross-Validation Method for Time Series: In sklearn, use TimeSeriesSplit as the cv parameter in functions like cross_validate or cross_val_score for time-appropriate validation.
Strategies for long-term forecasting:
- Generate forecasts for sequential time steps by assuming the predictions for the previous steps are accurate.
Trends
- A ‘days since’ feature to capture the trend over time

Survival analysis#

What is right-censored data?
What happens when we treat right-censored data the same as “regular” data?
- Predicting churn vs. no churn
- Predicting tenure
  - Throw away people who haven’t churned
  - Assume everyone churns today
Survival analysis encompasses predicting both churn and tenure and deals with censoring and can make rich and useful predictions!
- We can get survival curves which show the probability of survival over time.
- KM model \(\rightarrow\) doesn’t look at features
- CPH model \(\rightarrow\) like linear regression, does look at the features and provides coefficients associated with each feature

Communication#

Why is communication important in ML and Data Science?
What are different principles of good explanation?
What to watch out for when producing or consuming visualizations?

Ethics#

Fairness, accountability, transparency
Representation bias, measurement bias, historical bias

Deployment (Not examinable)#

Deploying a model as a web app
Deploying a model as a REST API

	oe_encoded	ohe_encoded
0	1.0	1.0
1	1.0	1.0
2	1.0	1.0
3	0.0	0.0
4	1.0	1.0
5	0.0	0.0
6	1.0	1.0
7	0.0	0.0
8	0.0	0.0
9	1.0	1.0
10	1.0	1.0
11	1.0	1.0
12	1.0	1.0
13	1.0	1.0
14	0.0	0.0
15	0.0	0.0
16	1.0	1.0
17	1.0	1.0
18	0.0	0.0
19	0.0	0.0
20	1.0	1.0

	oe_encoded	ohe_encoded
0	1.0	1.0
1	1.0	1.0
2	1.0	1.0
3	0.0	0.0
4	1.0	1.0
5	0.0	0.0
6	1.0	1.0
7	0.0	0.0
8	0.0	0.0
9	1.0	1.0
10	1.0	1.0
11	1.0	1.0
12	1.0	1.0
13	1.0	1.0
14	0.0	0.0
15	0.0	0.0
16	1.0	1.0
17	1.0	1.0
18	0.0	0.0
19	0.0	0.0
20	1.0	1.0