Final review guiding questions

Final review guiding questions#

UBC 2023-24

Instructor: Varada Kolhatkar

The final is going to be cumulative but in this review session, we’ll focus on the post-midterm material.

Scenario	Which clustering method?
Well-separated spherical clusters
Large datasets
Flexibility with cluster shapes
Small to medium datasets
Prior knowlege on how many clusters
Clusters are roughly of equal size
Irregularly shaped clusters
Clusters with different densities
Datasets with hierarchical relationships
No prior knowledge on number of clusters
Noise and outliers

Which clustering method would you use in each of the scenarios below? Why?
How would you represent the data in each case?
- Scenario 1: Customer segmentation in retail
- Scenario 2: An environmental study aiming to identify clusters of a rare plant species
- Scenario 3: Clustering furniture items for inventory management and customer recommendations

What’s the utility matrix?
How do we evaluate recommender systems?
What are the baseline models we talked about?
- Global average
- Per user average
- Per item average
Evaluation of recommender systems
Compare and contrast KNN Imputer, collaborative filtering, and content-based filtering
Ethical issues associated with recommender systems

Embeddings
- What are different document and word representations we talked about?
- Why do we care about creating different representations?
- What are pre-trained models? Why are the benefits of using them?
Topic modeling
- What is topic modeling? What are the inputs and outputs of topic modeling?
- How it’s different from clustering documents using a clustering model, say KMeans?
Text Preprocessing

What’s the difference between OVR and OVO?
What are the methods we saw to use pre-trained image classification models for our image classification tasks?
- Out of the box
- Using pre-trained models as feature extractors
- Fine-tuning pre-trained models for our task (only mentioned)

How would you use pre-trained model in each case below?

Imagine you want to quickly develop a prototype for an app that can identify different cat breeds from photos.
Suppose you’re working on a project to predict the city in Canada based on the photos of landmarks in the city, a task for which there’s limited training data available.
Suppose you’re developing a system to diagnose specific types of tumors from MRI scans.

When is time series analysis appropriate?
- Time series analysis is used when there is a temporal aspect in the data.
Data splitting: Data should be split based on time to avoid future data leaking into the training set.
Essential questions for Exploratory Data Analysis (EDA):
- What is the frequency of data collection (e.g., hourly, daily)?
- How many time series are present within the dataset?
- Are there any gaps or missing values in the data?
Feature engineering
- Derived new features from the date/time column.
- Appropriately encoded features based on the chosen model.
- Created lag features to incorporate past values for prediction.
Baseline model approach: Employ a simple model, such as using today’s target value to predict tomorrow’s, as a starting point for comparison.
Cross-Validation Method for Time Series: In sklearn, use TimeSeriesSplit as the cv parameter in functions like cross_validate or cross_val_score for time-appropriate validation.
Strategies for long-term forecasting:
- Generate forecasts for sequential time steps by assuming the predictions for the previous steps are accurate.
Trends
- A ‘days since’ feature to capture the trend over time

What is right-censored data?
What happens when we treat right-censored data the same as “regular” data?
- Predicting churn vs. no churn
- Predicting tenure
  - Throw away people who haven’t churned
  - Assume everyone churns today
Survival analysis encompasses predicting both churn and tenure and deals with censoring and can make rich and useful predictions!
- We can get survival curves which show the probability of survival over time.
- KM model \(\rightarrow\) doesn’t look at features
- CPH model \(\rightarrow\) like linear regression, does look at the features and provides coefficients associated with each feature