Tutorial 3¶

UBC 2025-26

Outline¶

During this tutorial, we will focus on preprocessing - the necessary steps to perform to make the data meaningful for a learning algorithm.

All questions can be discussed with your classmates and the TAs - this is not a graded exercise!

import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import HTML

sys.path.append("../code/.")
from plotting_functions import *
from utils import *

pd.set_option("display.max_colwidth", 200)

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

`ColumnTransformer` on the California housing dataset¶

In this notebook, you will practice features preprocessing using the California housing dataset.

Let’s start by loading the dataset (this is done for you):

housing_df = pd.read_csv("../data/housing.csv")
train_df, test_df = train_test_split(housing_df, test_size=0.1, random_state=123)

train_df.head()

Let’s also add some new features that may help us with the prediction:

train_df = train_df.assign(
    rooms_per_household=train_df["total_rooms"] / train_df["households"]
)
test_df = test_df.assign(
    rooms_per_household=test_df["total_rooms"] / test_df["households"]
)

train_df = train_df.assign(
    bedrooms_per_household=train_df["total_bedrooms"] / train_df["households"]
)
test_df = test_df.assign(
    bedrooms_per_household=test_df["total_bedrooms"] / test_df["households"]
)

train_df = train_df.assign(
    population_per_household=train_df["population"] / train_df["households"]
)
test_df = test_df.assign(
    population_per_household=test_df["population"] / test_df["households"]
)

Finally, we are separating for you the target from the features:

# Let's keep both numeric and categorical columns in the data.
X_train = train_df.drop(columns=["median_house_value"])
y_train = train_df["median_house_value"]

X_test = test_df.drop(columns=["median_house_value"])
y_test = test_df["median_house_value"]

Step 0: EDA¶

Let’s get a sense for our dataset using the strategies we learned about last time. From this information, what kinds of preprocessing steps might we need to take here?

train_df.info()

housing_df["ocean_proximity"].unique()

train_df.describe()

Step 1¶

Your turn now! Start by importing ColumnTranformer and make_column_transformer

Step 2¶

Next, group features by type (numerical or categorical). You may also want to save the target separately.

Step 3¶

Create a ColumnTransformer for your features. The transformer should include imputation and scaling for numeric features, and encoding for categorical features (which type of encoding?)

Step 4¶

Visualize the transformed training set as a dataframe

Step 5¶

Finally, let’s train a classifier (or even better, for practice, a baseline and another regressor):

create a pipeline with the preprocessor and a regressor of your choice.
use the pipeline to perform cross-validation

Recap/comprehension questions¶

Do we have to preprocess the target column too?
If we only plan to use a Decision Tree as classifier, do we still need to scale the numerical features?
If the dataset included an ordinal feature “Neighbourhood desirability”, with numerical labels 1 (poor), 2 (good) and 3 (excellent), would we need to apply an ordinal encoder to it?
Why do we add the argument drop="if_binary" to OneHotEncoder when dealing with categorical features with only two possible values? What would be the disadvantages of not doing so?

Tutorial 3