Lecture 10: Regression metrics¶

UBC, 2025-26

Imports and LO¶

Imports¶

import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import (
    ColumnTransformer,
    TransformedTargetRegressor,
    make_column_transformer,
)
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor

%matplotlib inline
DATA_DIR = os.path.join(os.path.abspath(".."), "data/")

# Ignore future deprecation warnings from sklearn (using `os` instead of `warnings` also works in subprocesses)
import os
os.environ['PYTHONWARNINGS']='ignore::FutureWarning'

Learning outcomes¶

From this lecture, students are expected to be able to:

Carry out feature transformations on somewhat complicated dataset.
Visualize transformed features as a dataframe.
Use Ridge and RidgeCV.
Explain how alpha hyperparameter of Ridge relates to the fundamental tradeoff.
Explain the effect of alpha on the magnitude of the learned coefficients.
Examine coefficients of transformed features.
Appropriately select a scoring metric given a regression problem.
Interpret and communicate the meanings of different scoring metrics on regression problems.
- MSE, RMSE, $R^2$ , MAPE
Apply log-transform on the target values in a regression problem with TransformedTargetRegressor.

After carrying out preprocessing, why it’s useful to get feature names for transformed features?

More comments on tackling class imbalance¶

In lecture 9 we talked about a few rather simple approaches to deal with class imbalance.
If you have a problem such as fraud detection problem where you want to spot rare events, you can also think of this problem as anomaly detection problem and use different kinds of algorithms such as isolation forests.
If you are interested in this area, it might be worth checking out this book on this topic. Imbalanced Learning: Foundations, Algorithms, and Applications

Dataset [video]¶

In this lecture, we’ll be using Kaggle House Prices dataset. As usual, to run this notebook you’ll need to download the data. For this dataset, train and test have already been separated. We’ll be working with the train portion in this lecture.

df = pd.read_csv(DATA_DIR + "housing-kaggle/train.csv")
train_df, test_df = train_test_split(df, test_size=0.10, random_state=123)
train_df.head()

The supervised machine learning problem is predicting housing price given features associated with properties.
Here, the target is SalePrice, which is continuous. So it’s a regression problem (as opposed to classification).

train_df.shape

(1314, 81)

Let’s separate `X` and `y`¶

X_train = train_df.drop(columns=["SalePrice"])
y_train = train_df["SalePrice"]

X_test = test_df.drop(columns=["SalePrice"])
y_test = test_df["SalePrice"]

EDA¶

train_df.describe()

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1314 entries, 302 to 1389
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1314 non-null   int64  
 1   MSSubClass     1314 non-null   int64  
 2   MSZoning       1314 non-null   object 
 3   LotFrontage    1089 non-null   float64
 4   LotArea        1314 non-null   int64  
 5   Street         1314 non-null   object 
 6   Alley          81 non-null     object 
 7   LotShape       1314 non-null   object 
 8   LandContour    1314 non-null   object 
 9   Utilities      1314 non-null   object 
 10  LotConfig      1314 non-null   object 
 11  LandSlope      1314 non-null   object 
 12  Neighborhood   1314 non-null   object 
 13  Condition1     1314 non-null   object 
 14  Condition2     1314 non-null   object 
 15  BldgType       1314 non-null   object 
 16  HouseStyle     1314 non-null   object 
 17  OverallQual    1314 non-null   int64  
 18  OverallCond    1314 non-null   int64  
 19  YearBuilt      1314 non-null   int64  
 20  YearRemodAdd   1314 non-null   int64  
 21  RoofStyle      1314 non-null   object 
 22  RoofMatl       1314 non-null   object 
 23  Exterior1st    1314 non-null   object 
 24  Exterior2nd    1314 non-null   object 
 25  MasVnrType     528 non-null    object 
 26  MasVnrArea     1307 non-null   float64
 27  ExterQual      1314 non-null   object 
 28  ExterCond      1314 non-null   object 
 29  Foundation     1314 non-null   object 
 30  BsmtQual       1280 non-null   object 
 31  BsmtCond       1280 non-null   object 
 32  BsmtExposure   1279 non-null   object 
 33  BsmtFinType1   1280 non-null   object 
 34  BsmtFinSF1     1314 non-null   int64  
 35  BsmtFinType2   1280 non-null   object 
 36  BsmtFinSF2     1314 non-null   int64  
 37  BsmtUnfSF      1314 non-null   int64  
 38  TotalBsmtSF    1314 non-null   int64  
 39  Heating        1314 non-null   object 
 40  HeatingQC      1314 non-null   object 
 41  CentralAir     1314 non-null   object 
 42  Electrical     1313 non-null   object 
 43  1stFlrSF       1314 non-null   int64  
 44  2ndFlrSF       1314 non-null   int64  
 45  LowQualFinSF   1314 non-null   int64  
 46  GrLivArea      1314 non-null   int64  
 47  BsmtFullBath   1314 non-null   int64  
 48  BsmtHalfBath   1314 non-null   int64  
 49  FullBath       1314 non-null   int64  
 50  HalfBath       1314 non-null   int64  
 51  BedroomAbvGr   1314 non-null   int64  
 52  KitchenAbvGr   1314 non-null   int64  
 53  KitchenQual    1314 non-null   object 
 54  TotRmsAbvGrd   1314 non-null   int64  
 55  Functional     1314 non-null   object 
 56  Fireplaces     1314 non-null   int64  
 57  FireplaceQu    687 non-null    object 
 58  GarageType     1241 non-null   object 
 59  GarageYrBlt    1241 non-null   float64
 60  GarageFinish   1241 non-null   object 
 61  GarageCars     1314 non-null   int64  
 62  GarageArea     1314 non-null   int64  
 63  GarageQual     1241 non-null   object 
 64  GarageCond     1241 non-null   object 
 65  PavedDrive     1314 non-null   object 
 66  WoodDeckSF     1314 non-null   int64  
 67  OpenPorchSF    1314 non-null   int64  
 68  EnclosedPorch  1314 non-null   int64  
 69  3SsnPorch      1314 non-null   int64  
 70  ScreenPorch    1314 non-null   int64  
 71  PoolArea       1314 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          259 non-null    object 
 74  MiscFeature    50 non-null     object 
 75  MiscVal        1314 non-null   int64  
 76  MoSold         1314 non-null   int64  
 77  YrSold         1314 non-null   int64  
 78  SaleType       1314 non-null   object 
 79  SaleCondition  1314 non-null   object 
 80  SalePrice      1314 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 841.8+ KB

`pandas_profiler`¶

We do not have pandas_profiling in our course environment. You will have to install it in the environment on your own if you want to run the code below.

conda install -c conda-forge pandas-profiling

# from pandas_profiling import ProfileReport

# profile = ProfileReport(train_df, title="Pandas Profiling Report")  # , minimal=True)
# profile.to_notebook_iframe()

Feature types¶

Do not blindly trust all the info given to you by automated tools.
How does pandas profiling figure out the data type?
- You can look at the Python data type and say floats are numeric, strings are categorical.
- However, in doing so you would miss out on various subtleties such as some of the string features being ordinal rather than truly categorical.
- Also, it will think free text is categorical.

In addition to tools such as above, it’s important to go through data description to understand the data.
The data description for our dataset is available here.

Feature types¶

We have mixed feature types and a bunch of missing values.
Now, let’s identify feature types and transformations.

Let’s get the numeric-looking columns.

numeric_looking_columns = X_train.select_dtypes(include=np.number).columns.tolist()
print(numeric_looking_columns)

['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']

Not all numeric looking columns are necessarily numeric.

train_df["MSSubClass"].unique()

array([ 20,  50,  30,  60, 160,  85,  90, 120, 180,  80,  70,  75, 190,
        45,  40])

MSSubClass: Identifies the type of dwelling involved in the sale.

    20	1-STORY 1946 & NEWER ALL STYLES
    30	1-STORY 1945 & OLDER
    40	1-STORY W/FINISHED ATTIC ALL AGES
    45	1-1/2 STORY - UNFINISHED ALL AGES
    50	1-1/2 STORY FINISHED ALL AGES
    60	2-STORY 1946 & NEWER
    70	2-STORY 1945 & OLDER
    75	2-1/2 STORY ALL AGES
    80	SPLIT OR MULTI-LEVEL
    85	SPLIT FOYER
    90	DUPLEX - ALL STYLES AND AGES
   120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150	1-1/2 STORY PUD - ALL AGES
   160	2-STORY PUD - 1946 & NEWER
   180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190	2 FAMILY CONVERSION - ALL STYLES AND AGES

Also, month sold is more of a categorical feature than a numeric feature.

train_df["MoSold"].unique()  # Month Sold

array([ 1, 7, 3, 5, 8, 10, 6, 9, 12, 2, 4, 11])

drop_features = ["Id"]
numeric_features = [
    "BedroomAbvGr",
    "KitchenAbvGr",
    "LotFrontage",
    "LotArea",
    "OverallQual",
    "OverallCond",
    "YearBuilt",
    "YearRemodAdd",
    "MasVnrArea",
    "BsmtFinSF1",
    "BsmtFinSF2",
    "BsmtUnfSF",
    "TotalBsmtSF",
    "1stFlrSF",
    "2ndFlrSF",
    "LowQualFinSF",
    "GrLivArea",
    "BsmtFullBath",
    "BsmtHalfBath",
    "FullBath",
    "HalfBath",
    "TotRmsAbvGrd",
    "Fireplaces",
    "GarageYrBlt",
    "GarageCars",
    "GarageArea",
    "WoodDeckSF",
    "OpenPorchSF",
    "EnclosedPorch",
    "3SsnPorch",
    "ScreenPorch",
    "PoolArea",
    "MiscVal",
    "YrSold",
]

set(numeric_looking_columns) - set(numeric_features) - set(drop_features)

{'MSSubClass', 'MoSold'}

We’ll treat the above numeric-looking features as categorical features.

There are a bunch of ordinal features in this dataset.
Ordinal features with the same scale
- Poor (Po), Fair (Fa), Typical (TA), Good (Gd), Excellent (Ex)
- These we’ll be calling ordinal_features_reg.
Ordinal features with different scales
- These we’ll be calling ordinal_features_oth.

ordinal_features_reg = [
    "ExterQual",
    "ExterCond",
    "BsmtQual",
    "BsmtCond",
    "HeatingQC",
    "KitchenQual",
    "FireplaceQu",
    "GarageQual",
    "GarageCond",
    "PoolQC",
]
ordering = [
    "Po",
    "Fa",
    "TA",
    "Gd",
    "Ex",
]  # if N/A it will just impute something, per below
ordering_ordinal_reg = [ordering] * len(ordinal_features_reg)
ordering_ordinal_reg

[['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 ['Po', 'Fa', 'TA', 'Gd', 'Ex']]

We’ll pass the above as categories in our OrdinalEncoder.

There are a bunch more ordinal features using different scales.
- These we’ll be calling ordinal_features_oth.
- We are encoding them separately.

ordinal_features_oth = [
    "BsmtExposure",
    "BsmtFinType1",
    "BsmtFinType2",
    "Functional",
    "Fence",
]
ordering_ordinal_oth = [
    ["NA", "No", "Mn", "Av", "Gd"],
    ["NA", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
    ["NA", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
    ["Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"],
    ["NA", "MnWw", "GdWo", "MnPrv", "GdPrv"],
]

The remaining features are categorical features.

categorical_features = list(
    set(X_train.columns)
    - set(numeric_features)
    - set(ordinal_features_reg)
    - set(ordinal_features_oth)
    - set(drop_features)
)
categorical_features

['RoofStyle',
 'BldgType',
 'Electrical',
 'Condition2',
 'Alley',
 'LandSlope',
 'CentralAir',
 'GarageFinish',
 'RoofMatl',
 'Condition1',
 'Foundation',
 'LotShape',
 'MSZoning',
 'PavedDrive',
 'SaleCondition',
 'Heating',
 'Neighborhood',
 'HouseStyle',
 'MSSubClass',
 'MasVnrType',
 'MoSold',
 'LotConfig',
 'GarageType',
 'Street',
 'SaleType',
 'Exterior2nd',
 'MiscFeature',
 'LandContour',
 'Exterior1st',
 'Utilities']

We are not doing it here but we can engineer our own features too.
Would price per square foot be a good feature to add in here?

Applying feature transformations¶

Since we have mixed feature types, let’s use ColumnTransformer to apply different transformations on different features types.

from sklearn.compose import make_column_transformer

numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
ordinal_transformer_reg = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OrdinalEncoder(categories=ordering_ordinal_reg),
)

ordinal_transformer_oth = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OrdinalEncoder(categories=ordering_ordinal_oth),
)

categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore", sparse_output=False),
)

preprocessor = make_column_transformer(
    ("drop", drop_features),
    (numeric_transformer, numeric_features),
    (ordinal_transformer_reg, ordinal_features_reg),
    (ordinal_transformer_oth, ordinal_features_oth),
    (categorical_transformer, categorical_features),
)

Examining the preprocessed data¶

preprocessor.fit(X_train)  # Calling fit to examine all the transformers.
preprocessor.named_transformers_

{'drop': 'drop',
 'pipeline-1': Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                 ('standardscaler', StandardScaler())]),
 'pipeline-2': Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
                 ('ordinalencoder',
                  OrdinalEncoder(categories=[['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                                             ['Po', 'Fa', 'TA', 'Gd', 'Ex']]))]),
 'pipeline-3': Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
                 ('ordinalencoder',
                  OrdinalEncoder(categories=[['NA', 'No', 'Mn', 'Av', 'Gd'],
                                             ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ',
                                              'ALQ', 'GLQ'],
                                             ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ',
                                              'ALQ', 'GLQ'],
                                             ['Sal', 'Sev', 'Maj2', 'Maj1',
                                              'Mod', 'Min2', 'Min1', 'Typ'],
                                             ['NA', 'MnWw', 'GdWo', 'MnPrv',
                                              'GdPrv']]))]),
 'pipeline-4': Pipeline(steps=[('simpleimputer',
                  SimpleImputer(fill_value='missing', strategy='constant')),
                 ('onehotencoder',
                  OneHotEncoder(handle_unknown='ignore', sparse_output=False))])}

ohe_columns = list(
    preprocessor.named_transformers_["pipeline-4"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
)
new_columns = (
    numeric_features + ordinal_features_reg + ordinal_features_oth + ohe_columns
)

X_train_enc = pd.DataFrame(
    preprocessor.transform(X_train), index=X_train.index, columns=new_columns
)
X_train_enc.head()

X_train.shape

(1314, 80)

X_train_enc.shape

(1314, 262)

We went from 80 features to 263 features!!

Other possible preprocessing?¶

There is a lot of room for improvement ...
We’re just using SimpleImputer.
- In reality we’d want to go through this more carefully.
- We may also want to drop some columns that are almost entirely missing.
We could also check for outliers, and do other exploratory data analysis (EDA).
But for now this is good enough ...

Model building¶

`DummyRegressor`¶

dummy = DummyRegressor()
pd.DataFrame(cross_validate(dummy, X_train, y_train, cv=10, return_train_score=True))

Let’s try a linear model: `Ridge`¶

Recall that we are going to use Ridge() instead of LinearRegression() in this course.
Similar to linear regression, ridge regression is also a linear model for regression.
So the formula it uses to make predictions is the same one used for ordinary least squares.
But it has a hyperparameter alpha which controls the fundamental tradeoff.

lr = make_pipeline(preprocessor, Ridge())
lr.fit(X_train, y_train);

lr_preds = lr.predict(X_test)
lr_preds[:10]

array([224865.34161762,  65243.29045704, 133337.46729488, 250900.65424895,
       128984.32127053, 205597.8361313 , 334576.71199006, 158982.72345029,
       148343.09009799, 129257.94933723])

lr_preds.max(), lr_preds.min()

(np.float64(411412.67186503176), np.float64(38853.27502379083))

print("Smallest coefficient: ", lr.named_steps["ridge"].coef_.min())
print("Largest coefficient:", lr.named_steps["ridge"].coef_.max())

Smallest coefficient:  -191169.07174469245
Largest coefficient: 83603.01311996076

Let’s carry out cross-validation with Ridge.

lr_pipe = make_pipeline(preprocessor, Ridge())
pd.DataFrame(cross_validate(lr_pipe, X_train, y_train, cv=10, return_train_score=True))

Quite a bit of variation in the test scores.
Performing poorly in fold 8. Not sure why.
- Probably it contains the outliers in the data which we kind of ignored.

Feature names of transformed data¶

If you want to get the column names of newly created columns, you need to fit the transformer.

preprocessor

ohe_columns = list(
    preprocessor.named_transformers_["pipeline-4"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
)
ohe_columns

['RoofStyle_Flat',
 'RoofStyle_Gable',
 'RoofStyle_Gambrel',
 'RoofStyle_Hip',
 'RoofStyle_Mansard',
 'RoofStyle_Shed',
 'BldgType_1Fam',
 'BldgType_2fmCon',
 'BldgType_Duplex',
 'BldgType_Twnhs',
 'BldgType_TwnhsE',
 'Electrical_FuseA',
 'Electrical_FuseF',
 'Electrical_FuseP',
 'Electrical_Mix',
 'Electrical_SBrkr',
 'Electrical_missing',
 'Condition2_Artery',
 'Condition2_Feedr',
 'Condition2_Norm',
 'Condition2_PosA',
 'Condition2_PosN',
 'Condition2_RRAe',
 'Condition2_RRAn',
 'Condition2_RRNn',
 'Alley_Grvl',
 'Alley_Pave',
 'Alley_missing',
 'LandSlope_Gtl',
 'LandSlope_Mod',
 'LandSlope_Sev',
 'CentralAir_N',
 'CentralAir_Y',
 'GarageFinish_Fin',
 'GarageFinish_RFn',
 'GarageFinish_Unf',
 'GarageFinish_missing',
 'RoofMatl_ClyTile',
 'RoofMatl_CompShg',
 'RoofMatl_Membran',
 'RoofMatl_Metal',
 'RoofMatl_Roll',
 'RoofMatl_Tar&Grv',
 'RoofMatl_WdShake',
 'RoofMatl_WdShngl',
 'Condition1_Artery',
 'Condition1_Feedr',
 'Condition1_Norm',
 'Condition1_PosA',
 'Condition1_PosN',
 'Condition1_RRAe',
 'Condition1_RRAn',
 'Condition1_RRNe',
 'Condition1_RRNn',
 'Foundation_BrkTil',
 'Foundation_CBlock',
 'Foundation_PConc',
 'Foundation_Slab',
 'Foundation_Stone',
 'Foundation_Wood',
 'LotShape_IR1',
 'LotShape_IR2',
 'LotShape_IR3',
 'LotShape_Reg',
 'MSZoning_C (all)',
 'MSZoning_FV',
 'MSZoning_RH',
 'MSZoning_RL',
 'MSZoning_RM',
 'PavedDrive_N',
 'PavedDrive_P',
 'PavedDrive_Y',
 'SaleCondition_Abnorml',
 'SaleCondition_AdjLand',
 'SaleCondition_Alloca',
 'SaleCondition_Family',
 'SaleCondition_Normal',
 'SaleCondition_Partial',
 'Heating_Floor',
 'Heating_GasA',
 'Heating_GasW',
 'Heating_Grav',
 'Heating_OthW',
 'Heating_Wall',
 'Neighborhood_Blmngtn',
 'Neighborhood_Blueste',
 'Neighborhood_BrDale',
 'Neighborhood_BrkSide',
 'Neighborhood_ClearCr',
 'Neighborhood_CollgCr',
 'Neighborhood_Crawfor',
 'Neighborhood_Edwards',
 'Neighborhood_Gilbert',
 'Neighborhood_IDOTRR',
 'Neighborhood_MeadowV',
 'Neighborhood_Mitchel',
 'Neighborhood_NAmes',
 'Neighborhood_NPkVill',
 'Neighborhood_NWAmes',
 'Neighborhood_NoRidge',
 'Neighborhood_NridgHt',
 'Neighborhood_OldTown',
 'Neighborhood_SWISU',
 'Neighborhood_Sawyer',
 'Neighborhood_SawyerW',
 'Neighborhood_Somerst',
 'Neighborhood_StoneBr',
 'Neighborhood_Timber',
 'Neighborhood_Veenker',
 'HouseStyle_1.5Fin',
 'HouseStyle_1.5Unf',
 'HouseStyle_1Story',
 'HouseStyle_2.5Fin',
 'HouseStyle_2.5Unf',
 'HouseStyle_2Story',
 'HouseStyle_SFoyer',
 'HouseStyle_SLvl',
 'MSSubClass_20',
 'MSSubClass_30',
 'MSSubClass_40',
 'MSSubClass_45',
 'MSSubClass_50',
 'MSSubClass_60',
 'MSSubClass_70',
 'MSSubClass_75',
 'MSSubClass_80',
 'MSSubClass_85',
 'MSSubClass_90',
 'MSSubClass_120',
 'MSSubClass_160',
 'MSSubClass_180',
 'MSSubClass_190',
 'MasVnrType_BrkCmn',
 'MasVnrType_BrkFace',
 'MasVnrType_Stone',
 'MasVnrType_missing',
 'MoSold_1',
 'MoSold_2',
 'MoSold_3',
 'MoSold_4',
 'MoSold_5',
 'MoSold_6',
 'MoSold_7',
 'MoSold_8',
 'MoSold_9',
 'MoSold_10',
 'MoSold_11',
 'MoSold_12',
 'LotConfig_Corner',
 'LotConfig_CulDSac',
 'LotConfig_FR2',
 'LotConfig_FR3',
 'LotConfig_Inside',
 'GarageType_2Types',
 'GarageType_Attchd',
 'GarageType_Basment',
 'GarageType_BuiltIn',
 'GarageType_CarPort',
 'GarageType_Detchd',
 'GarageType_missing',
 'Street_Grvl',
 'Street_Pave',
 'SaleType_COD',
 'SaleType_CWD',
 'SaleType_Con',
 'SaleType_ConLD',
 'SaleType_ConLI',
 'SaleType_ConLw',
 'SaleType_New',
 'SaleType_Oth',
 'SaleType_WD',
 'Exterior2nd_AsbShng',
 'Exterior2nd_AsphShn',
 'Exterior2nd_Brk Cmn',
 'Exterior2nd_BrkFace',
 'Exterior2nd_CBlock',
 'Exterior2nd_CmentBd',
 'Exterior2nd_HdBoard',
 'Exterior2nd_ImStucc',
 'Exterior2nd_MetalSd',
 'Exterior2nd_Other',
 'Exterior2nd_Plywood',
 'Exterior2nd_Stone',
 'Exterior2nd_Stucco',
 'Exterior2nd_VinylSd',
 'Exterior2nd_Wd Sdng',
 'Exterior2nd_Wd Shng',
 'MiscFeature_Gar2',
 'MiscFeature_Othr',
 'MiscFeature_Shed',
 'MiscFeature_TenC',
 'MiscFeature_missing',
 'LandContour_Bnk',
 'LandContour_HLS',
 'LandContour_Low',
 'LandContour_Lvl',
 'Exterior1st_AsbShng',
 'Exterior1st_AsphShn',
 'Exterior1st_BrkComm',
 'Exterior1st_BrkFace',
 'Exterior1st_CBlock',
 'Exterior1st_CemntBd',
 'Exterior1st_HdBoard',
 'Exterior1st_ImStucc',
 'Exterior1st_MetalSd',
 'Exterior1st_Plywood',
 'Exterior1st_Stone',
 'Exterior1st_Stucco',
 'Exterior1st_VinylSd',
 'Exterior1st_Wd Sdng',
 'Exterior1st_WdShing',
 'Utilities_AllPub',
 'Utilities_NoSeWa']

new_columns = (
    numeric_features + ordinal_features_reg + ordinal_features_oth + ohe_columns
)

Tuning `alpha` hyperparameter of `Ridge`¶

Recall that Ridge has a hyperparameter alpha that controls the fundamental tradeoff.
This is like C in LogisticRegression but, annoyingly, alpha is the inverse of C.
That is, large C is like small alpha and vice versa.
Smaller alpha: lower training error (overfitting)

param_grid = {"ridge__alpha": 10.0 ** np.arange(-5, 5, 1)}

pipe_ridge = make_pipeline(preprocessor, Ridge())

search = GridSearchCV(pipe_ridge, param_grid, return_train_score=True, n_jobs=-1)
search.fit(X_train, y_train)
train_scores = search.cv_results_["mean_train_score"]
cv_scores = search.cv_results_["mean_test_score"]

plt.semilogx(param_grid["ridge__alpha"], train_scores.tolist(), label="train")
plt.semilogx(param_grid["ridge__alpha"], cv_scores.tolist(), label="cv")
plt.legend()
plt.xlabel("alpha")
plt.ylabel("score");

best_alpha = search.best_params_
best_alpha

{'ridge__alpha': np.float64(100.0)}

search.best_score_

np.float64(0.7951559890417759)

It seems alpha=100 is the best choice here.

General intuition: larger alpha leads to smaller coefficients.
Smaller coefficients mean the predictions are less sensitive to changes in the data. Hence less chance of overfitting.

pipe_bigalpha = make_pipeline(preprocessor, Ridge(alpha=1000))
pipe_bigalpha.fit(X_train, y_train)
bigalpha_coeffs = pipe_bigalpha.named_steps["ridge"].coef_
pd.DataFrame(
    data=bigalpha_coeffs, index=new_columns, columns=["Coefficients"]
).sort_values(by="Coefficients", ascending=False)

Smaller alpha leads to bigger coefficients.

pipe_smallalpha = make_pipeline(preprocessor, Ridge(alpha=0.01))
pipe_smallalpha.fit(X_train, y_train)
smallalpha_coeffs = pipe_smallalpha.named_steps["ridge"].coef_
pd.DataFrame(
    data=smallalpha_coeffs, index=new_columns, columns=["Coefficients"]
).sort_values(by="Coefficients", ascending=False)

With the best alpha found by the grid search, the coefficients are somewhere in between.

pipe_bestalpha = make_pipeline(
    preprocessor, Ridge(alpha=search.best_params_["ridge__alpha"])
)
pipe_bestalpha.fit(X_train, y_train)
bestalpha_coeffs = pipe_bestalpha.named_steps["ridge"].coef_
pd.DataFrame(
    data=bestalpha_coeffs, index=new_columns, columns=["Coefficients"]
).sort_values(by="Coefficients", ascending=False)

To summarize:

Higher values of alpha means a more restricted model.
The values of coefficients are likely to be smaller for higher values of alpha compared to lower values of alpha.

`RidgeCV`¶

Because it’s so common to want to tune alpha with Ridge, sklearn provides a class called RidgeCV, which automatically tunes alpha based on cross-validation.

alphas = 10.0 ** np.arange(-6, 6, 1)
ridgecv_pipe = make_pipeline(preprocessor, RidgeCV(alphas=alphas, cv=10))
ridgecv_pipe.fit(X_train, y_train);

best_alpha = ridgecv_pipe.named_steps["ridgecv"].alpha_

best_alpha

np.float64(100.0)

Let’s examine the tuned model.

ridge_tuned = make_pipeline(preprocessor, Ridge(alpha=best_alpha))
ridge_tuned.fit(X_train, y_train)
ridge_preds = ridge_tuned.predict(X_test)
ridge_preds[:10]

array([228603.06797961, 104643.99931882, 155624.61029914, 246332.07245741,
       127614.33726089, 243243.60429913, 304784.87844893, 145425.67512181,
       157008.6526853 , 128528.91515735])

df = pd.DataFrame(
    data={"coefficients": ridge_tuned.named_steps["ridge"].coef_}, index=new_columns
)

df.sort_values("coefficients", ascending=False)

So according to this model:

As OverallQual feature gets bigger the housing price will get bigger.
Neighborhood_Edwards is associated with reducing the housing price.
- We’ll talk more about interpretation of different kinds of features next week.

ridge_preds.max(), ridge_preds.min()

(np.float64(390725.7472092076), np.float64(30815.281220255387))

❓❓ Questions for you¶

iClicker Exercise 10.1¶

Select all of the following statements which are TRUE.

(A) Price per square foot would be a good feature to add in our X.
(B) The alpha hyperparameter of Ridge has similar interpretation of C hyperparameter of LogisticRegression; higher alpha means more complex model.
(C) In Ridge, smaller alpha means bigger coefficients whereas bigger alpha means smaller coefficients.

Can we use the metrics we looked at in the previous lecture for this problem? Why or why not?

Regression scoring functions¶

We aren’t doing classification anymore, so we can’t just check for equality:

ridge_tuned.predict(X_train) == y_train

302     False
767     False
429     False
1139    False
558     False
        ...  
1041    False
1122    False
1346    False
1406    False
1389    False
Name: SalePrice, Length: 1314, dtype: bool

y_train.values

array([205000, 160000, 175000, ..., 262500, 133000, 131000], shape=(1314,))

ridge_tuned.predict(X_train)

array([212870.41150573, 178494.22221894, 189981.34426571, ...,
       245329.56804591, 129904.71327467, 135422.67672595], shape=(1314,))

We need a score that reflects how right/wrong each prediction is.

There are a number of popular scoring functions for regression. We are going to look at some common metrics:

mean squared error (MSE)
$R^2$
root mean squared error (RMSE)
MAPE

See sklearn documentation for more details.

Mean squared error (MSE)¶

A common metric is mean squared error:

preds = ridge_tuned.predict(X_train)

np.mean((y_train - preds) ** 2)

np.float64(872961060.7986544)

Perfect predictions would have MSE=0.

np.mean((y_train - y_train) ** 2)

np.float64(0.0)

This is also implemented in sklearn:

from sklearn.metrics import mean_squared_error

mean_squared_error(y_train, preds)

872961060.7986544

MSE looks huge and unreasonable. There is an error of ~$1 Billion!
Is this score good or bad?

Unlike classification, with regression our target has units.
The target is in dollars, the mean squared error is in $dollars^2$
The score also depends on the scale of the targets.
If we were working in cents instead of dollars, our MSE would be $10,000 \times (100^2$ ) higher!

np.mean((y_train * 100 - preds * 100) ** 2)

Root mean squared error or RMSE¶

The MSE above is in $dollars^2$ .
A more relatable metric would be the root mean squared error, or RMSE

np.sqrt(mean_squared_error(y_train, ridge_tuned.predict(X_train)))

np.float64(29545.914451894267)

Error of $30,000 makes more sense.
Let’s dig deeper.

plt.scatter(y_train, ridge_tuned.predict(X_train), alpha=0.3)
grid = np.linspace(y_train.min(), y_train.max(), 1000)
plt.plot(grid, grid, "--k")
plt.xlabel("true price")
plt.ylabel("predicted price");

Here we can see a few cases where our prediction is way off.
Is there something weird about those houses, perhaps? Outliers?
Under the line means we’re under-predicting, over the line means we’re over-predicting.

$R^2$ (not in detail)¶

A common score is the $R^2$

This is the score that sklearn uses by default when you call score()
You can read about it if interested.
$R^2$ measures the proportion of variability in $y$ that can be explained using $X$ .
Independent of the scale of $y$ . So the max is 1.

R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{\sum_{i=1}^n (y_i - \bar{y})^2}

(1)

The denominator measures the total variance in $y$ .
The amount of variability that is left unexplained after performing regression.

Key points:

The maximum is 1 for perfect predictions
Negative values are very bad: “worse than DummyRegressor” (very bad)

(Optional) Warning: MSE is “reversible” but $R^2$ is not:

mean_squared_error(y_train, preds)

872961060.7986544

mean_squared_error(preds, y_train)

872961060.7986544

r2_score(y_train, preds)

0.8601643854446082

r2_score(preds, y_train)

0.8280229354283182

When you call fit it minimizes MSE / maximizes $R^2$ (or something like that) by default.
Just like in classification, this isn’t always what you want!!

MAPE¶

We got an RMSE of ~$30,000 before.

Question: Is an error of $30,000 acceptable?

np.sqrt(mean_squared_error(y_train, ridge_tuned.predict(X_train)))

np.float64(29545.914451894267)

For a house worth $600k, it seems reasonable! That’s 5% error.
For a house worth $60k, that is terrible. It’s 50% error.

We have both of these cases in our dataset.

plt.hist(y_train, bins=100);

How about looking at percent error?

pred_train = ridge_tuned.predict(X_train)
percent_errors = (pred_train - y_train) / y_train * 100.0
percent_errors

302      3.839225
767     11.558889
429      8.560768
1139   -16.405650
558     17.194710
          ...    
1041    -0.496213
1122   -28.624615
1346    -6.541117
1406    -2.327283
1389     3.376089
Name: SalePrice, Length: 1314, dtype: float64

These are both positive (predict too high) and negative (predict too low).

We can look at the absolute percent error:

np.abs(percent_errors)

302      3.839225
767     11.558889
429      8.560768
1139    16.405650
558     17.194710
          ...    
1041     0.496213
1122    28.624615
1346     6.541117
1406     2.327283
1389     3.376089
Name: SalePrice, Length: 1314, dtype: float64

And, like MSE, we can take the average over examples. This is called mean absolute percent error (MAPE).

def my_mape(true, pred):
    return np.mean(np.abs((pred - true) / true))

my_mape(y_train, pred_train)

np.float64(0.10092665203438746)

Let’s use sklearn to calculate MAPE.

from sklearn.metrics import mean_absolute_percentage_error

mean_absolute_percentage_error(y_train, pred_train)

0.10092665203438746

Ok, this is quite interpretable.
On average, we have around 10% error.

Transforming the targets¶

When you have prices or count data, the target values are skewed.
Let’s look at our target column.

plt.hist(y_train, bins=100);

A common trick in such cases is applying a log transform on the target column to make it more normal and less skewed.
That is, transform $y\rightarrow \log(y)$ .
Linear regression will usually work better on something that looks more normal.

plt.hist(np.log10(y_train), bins=100);

We can incorporate this in our pipeline using sklearn.

from sklearn.compose import TransformedTargetRegressor

ttr = TransformedTargetRegressor(
    Ridge(alpha=best_alpha), func=np.log1p, inverse_func=np.expm1
) # transformer for log transforming the target
ttr_pipe = make_pipeline(preprocessor, ttr)

ttr_pipe

Why can’t we incorporate preprocessing targets in our column transformer?

ttr_pipe.fit(X_train, y_train); # y_train automatically transformed

ttr_pipe.predict(X_train)  # predictions automatically un-transformed

array([221329.43833466, 170670.12761659, 182639.31984311, ...,
       248609.60023631, 132158.15026771, 133270.71866979], shape=(1314,))

mean_absolute_percentage_error(y_test, ttr_pipe.predict(X_test))

0.07808506982896264

We reduced MAPE from ~10% to ~8% with this trick!

Does .fit() know we care about MAPE?
No, it doesn’t. Why are we minimizing MSE (or something similar) if we care about MAPE??
When minimizing MSE, the expensive houses will dominate because they have the biggest error.

Different scoring functions with `cross_validate`¶

Let’s try using MSE instead of the default $R^2$ score.

pd.DataFrame(
    cross_validate(
        ridge_tuned,
        X_train,
        y_train,
        return_train_score=True,
        scoring="neg_mean_squared_error",
    )
)

# make a scorer function that we can pass into cross-validation
mape_scorer = make_scorer(my_mape, greater_is_better=False)

pd.DataFrame(
    cross_validate(
        ridge_tuned, X_train, y_train, return_train_score=True, scoring=mape_scorer
    )
)

If you are finding greater_is_better=False argument confusing, here is the documentation:

greater_is_better(bool), default=True Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.

Since our custom scorer mape gives an error and not a score, I’m passing False to it and it’ll sign flip so that we can interpret bigger numbers as better performance.

# ?make_scorer

scoring = {
    "r2": "r2",
    "mape_scorer": mape_scorer, # just for demonstration for a custom scorer
    "sklearn MAPE": "neg_mean_absolute_percentage_error",
    "neg_root_mean_square_error": "neg_root_mean_squared_error",
    "neg_mean_squared_error": "neg_mean_squared_error",
}

pd.DataFrame(
    cross_validate(
        ridge_tuned, X_train, y_train, return_train_score=True, scoring=scoring
    )
).T

Are we getting the same alpha with mape?

param_grid = {"ridge__alpha": 10.0 ** np.arange(-6, 6, 1)}

pipe_ridge = make_pipeline(preprocessor, Ridge())

search = GridSearchCV(
    pipe_ridge, param_grid, return_train_score=True, n_jobs=-1, scoring=mape_scorer
)
search.fit(X_train, y_train);

print("Best hyperparameter values: ", search.best_params_)
print("Best score: %0.3f" % (search.best_score_))
pd.DataFrame(search.cv_results_)[
    [
        "mean_train_score",
        "mean_test_score",
        "param_ridge__alpha",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index().T

Using multiple metrics in `GridSearchCV` or `RandomizedSearchCV`¶

We could use multiple metrics with GridSearchCV or RandomizedSearchCV.
But if you do so, you need to set refit to the metric (string) for which the best_params_ will be found and used to build the best_estimator_ on the whole dataset.

search_multi = GridSearchCV(
    pipe_ridge,
    param_grid,
    return_train_score=True,
    n_jobs=-1,
    scoring=scoring,
    refit="sklearn MAPE",
)
search_multi.fit(X_train, y_train);

print("Best hyperparameter values: ", search_multi.best_params_)
print("Best score: %0.3f" % (search_multi.best_score_))
pd.DataFrame(search_multi.cv_results_).set_index("rank_test_mape_scorer").sort_index()

What’s the test score?

search_multi.score(X_test, y_test)

my_mape(y_test, ridge_tuned.predict(X_test))

Using regression metrics with `scikit-learn`¶

In sklearn, you will notice that it has negative version of the metrics above (e.g., neg_mean_squared_error, neg_root_mean_squared_error).
The reason for this is that scores return a value to maximize, the higher the better.

❓❓ Questions for you¶

iClicker Exercise 10.2¶

Select all of the following statements which are TRUE.

(A) We can use still use precision and recall for regression problems but now we have other metrics we can use as well.
(B) In sklearn for regression problems, using r2_score() and .score() (with default values) will produce the same results.
(C) RMSE is always going to be non-negative.
(D) MSE does not directly provide the information about whether the model is underpredicting or overpredicting.
(E) We can pass multiple scoring metrics to GridSearchCV or RandomizedSearchCV for regression as well as classification problems.

What did we learn today?¶

House prices dataset target is price, which is numeric -> regression rather than classification
There are corresponding versions of all the tools we used:
- DummyClassifier -> DummyRegressor
- LogisticRegression -> Ridge
Ridge hyperparameter alpha is like LogisticRegression hyperparameter C, but opposite meaning
We’ll avoid LinearRegression in this course.

Scoring metrics
$R^2$ is the default .score(), it is unitless, 0 is bad, 1 is best
MSE (mean squared error) is in units of target squared, hard to interpret; 0 is best
RMSE (root mean squared error) is in the same units as the target; 0 is best
MAPE (mean absolute percent error) is unitless; 0 is best, 1 is bad

Lecture 10: Regression metrics

Lecture 10: Regression metrics¶

Imports and LO¶

Imports¶

Learning outcomes¶

More comments on tackling class imbalance¶

Dataset [video]¶

Let’s separate X and y¶

EDA¶

pandas_profiler¶

Feature types¶

Feature types¶

Applying feature transformations¶

Examining the preprocessed data¶

Other possible preprocessing?¶

Model building¶

DummyRegressor¶

Let’s try a linear model: Ridge¶

Feature names of transformed data¶

Tuning alpha hyperparameter of Ridge¶

RidgeCV¶

❓❓ Questions for you¶

iClicker Exercise 10.1¶

Regression scoring functions¶

Mean squared error (MSE)¶

Root mean squared error or RMSE¶

R2R^2R2 (not in detail)¶

MAPE¶

Transforming the targets¶

Different scoring functions with cross_validate¶

Using multiple metrics in GridSearchCV or RandomizedSearchCV¶

Using regression metrics with scikit-learn¶

❓❓ Questions for you¶

iClicker Exercise 10.2¶

What did we learn today?¶

Let’s separate `X` and `y`¶

`pandas_profiler`¶

`DummyRegressor`¶

Let’s try a linear model: `Ridge`¶

Tuning `alpha` hyperparameter of `Ridge`¶

`RidgeCV`¶

$R^2$ (not in detail)¶

Different scoring functions with `cross_validate`¶

Using multiple metrics in `GridSearchCV` or `RandomizedSearchCV`¶

Using regression metrics with `scikit-learn`¶