Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lecture 10: Regression metrics

Lecture 10: Regression metrics

UBC, 2025-26

Imports and LO

Imports

import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import (
    ColumnTransformer,
    TransformedTargetRegressor,
    make_column_transformer,
)
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor

%matplotlib inline
DATA_DIR = os.path.join(os.path.abspath(".."), "data/")

# Ignore future deprecation warnings from sklearn (using `os` instead of `warnings` also works in subprocesses)
import os
os.environ['PYTHONWARNINGS']='ignore::FutureWarning'

Learning outcomes

From this lecture, students are expected to be able to:

  • Carry out feature transformations on somewhat complicated dataset.

  • Visualize transformed features as a dataframe.

  • Use Ridge and RidgeCV.

  • Explain how alpha hyperparameter of Ridge relates to the fundamental tradeoff.

  • Explain the effect of alpha on the magnitude of the learned coefficients.

  • Examine coefficients of transformed features.

  • Appropriately select a scoring metric given a regression problem.

  • Interpret and communicate the meanings of different scoring metrics on regression problems.

    • MSE, RMSE, R2R^2, MAPE

  • Apply log-transform on the target values in a regression problem with TransformedTargetRegressor.

After carrying out preprocessing, why it’s useful to get feature names for transformed features?

More comments on tackling class imbalance

  • In lecture 9 we talked about a few rather simple approaches to deal with class imbalance.

  • If you have a problem such as fraud detection problem where you want to spot rare events, you can also think of this problem as anomaly detection problem and use different kinds of algorithms such as isolation forests.

  • If you are interested in this area, it might be worth checking out this book on this topic. Imbalanced Learning: Foundations, Algorithms, and Applications

Dataset [video]

In this lecture, we’ll be using Kaggle House Prices dataset. As usual, to run this notebook you’ll need to download the data. For this dataset, train and test have already been separated. We’ll be working with the train portion in this lecture.

df = pd.read_csv(DATA_DIR + "housing-kaggle/train.csv")
train_df, test_df = train_test_split(df, test_size=0.10, random_state=123)
train_df.head()
Loading...
  • The supervised machine learning problem is predicting housing price given features associated with properties.

  • Here, the target is SalePrice, which is continuous. So it’s a regression problem (as opposed to classification).

train_df.shape
(1314, 81)

Let’s separate X and y

X_train = train_df.drop(columns=["SalePrice"])
y_train = train_df["SalePrice"]

X_test = test_df.drop(columns=["SalePrice"])
y_test = test_df["SalePrice"]

EDA

train_df.describe()
Loading...
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1314 entries, 302 to 1389
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1314 non-null   int64  
 1   MSSubClass     1314 non-null   int64  
 2   MSZoning       1314 non-null   object 
 3   LotFrontage    1089 non-null   float64
 4   LotArea        1314 non-null   int64  
 5   Street         1314 non-null   object 
 6   Alley          81 non-null     object 
 7   LotShape       1314 non-null   object 
 8   LandContour    1314 non-null   object 
 9   Utilities      1314 non-null   object 
 10  LotConfig      1314 non-null   object 
 11  LandSlope      1314 non-null   object 
 12  Neighborhood   1314 non-null   object 
 13  Condition1     1314 non-null   object 
 14  Condition2     1314 non-null   object 
 15  BldgType       1314 non-null   object 
 16  HouseStyle     1314 non-null   object 
 17  OverallQual    1314 non-null   int64  
 18  OverallCond    1314 non-null   int64  
 19  YearBuilt      1314 non-null   int64  
 20  YearRemodAdd   1314 non-null   int64  
 21  RoofStyle      1314 non-null   object 
 22  RoofMatl       1314 non-null   object 
 23  Exterior1st    1314 non-null   object 
 24  Exterior2nd    1314 non-null   object 
 25  MasVnrType     528 non-null    object 
 26  MasVnrArea     1307 non-null   float64
 27  ExterQual      1314 non-null   object 
 28  ExterCond      1314 non-null   object 
 29  Foundation     1314 non-null   object 
 30  BsmtQual       1280 non-null   object 
 31  BsmtCond       1280 non-null   object 
 32  BsmtExposure   1279 non-null   object 
 33  BsmtFinType1   1280 non-null   object 
 34  BsmtFinSF1     1314 non-null   int64  
 35  BsmtFinType2   1280 non-null   object 
 36  BsmtFinSF2     1314 non-null   int64  
 37  BsmtUnfSF      1314 non-null   int64  
 38  TotalBsmtSF    1314 non-null   int64  
 39  Heating        1314 non-null   object 
 40  HeatingQC      1314 non-null   object 
 41  CentralAir     1314 non-null   object 
 42  Electrical     1313 non-null   object 
 43  1stFlrSF       1314 non-null   int64  
 44  2ndFlrSF       1314 non-null   int64  
 45  LowQualFinSF   1314 non-null   int64  
 46  GrLivArea      1314 non-null   int64  
 47  BsmtFullBath   1314 non-null   int64  
 48  BsmtHalfBath   1314 non-null   int64  
 49  FullBath       1314 non-null   int64  
 50  HalfBath       1314 non-null   int64  
 51  BedroomAbvGr   1314 non-null   int64  
 52  KitchenAbvGr   1314 non-null   int64  
 53  KitchenQual    1314 non-null   object 
 54  TotRmsAbvGrd   1314 non-null   int64  
 55  Functional     1314 non-null   object 
 56  Fireplaces     1314 non-null   int64  
 57  FireplaceQu    687 non-null    object 
 58  GarageType     1241 non-null   object 
 59  GarageYrBlt    1241 non-null   float64
 60  GarageFinish   1241 non-null   object 
 61  GarageCars     1314 non-null   int64  
 62  GarageArea     1314 non-null   int64  
 63  GarageQual     1241 non-null   object 
 64  GarageCond     1241 non-null   object 
 65  PavedDrive     1314 non-null   object 
 66  WoodDeckSF     1314 non-null   int64  
 67  OpenPorchSF    1314 non-null   int64  
 68  EnclosedPorch  1314 non-null   int64  
 69  3SsnPorch      1314 non-null   int64  
 70  ScreenPorch    1314 non-null   int64  
 71  PoolArea       1314 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          259 non-null    object 
 74  MiscFeature    50 non-null     object 
 75  MiscVal        1314 non-null   int64  
 76  MoSold         1314 non-null   int64  
 77  YrSold         1314 non-null   int64  
 78  SaleType       1314 non-null   object 
 79  SaleCondition  1314 non-null   object 
 80  SalePrice      1314 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 841.8+ KB

pandas_profiler

We do not have pandas_profiling in our course environment. You will have to install it in the environment on your own if you want to run the code below.

conda install -c conda-forge pandas-profiling

# from pandas_profiling import ProfileReport

# profile = ProfileReport(train_df, title="Pandas Profiling Report")  # , minimal=True)
# profile.to_notebook_iframe()

Feature types

  • Do not blindly trust all the info given to you by automated tools.

  • How does pandas profiling figure out the data type?

    • You can look at the Python data type and say floats are numeric, strings are categorical.

    • However, in doing so you would miss out on various subtleties such as some of the string features being ordinal rather than truly categorical.

    • Also, it will think free text is categorical.

  • In addition to tools such as above, it’s important to go through data description to understand the data.

  • The data description for our dataset is available here.

Feature types

  • We have mixed feature types and a bunch of missing values.

  • Now, let’s identify feature types and transformations.

  • Let’s get the numeric-looking columns.

numeric_looking_columns = X_train.select_dtypes(include=np.number).columns.tolist()
print(numeric_looking_columns)
['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']

Not all numeric looking columns are necessarily numeric.

train_df["MSSubClass"].unique()
array([ 20, 50, 30, 60, 160, 85, 90, 120, 180, 80, 70, 75, 190, 45, 40])

MSSubClass: Identifies the type of dwelling involved in the sale.

    20	1-STORY 1946 & NEWER ALL STYLES
    30	1-STORY 1945 & OLDER
    40	1-STORY W/FINISHED ATTIC ALL AGES
    45	1-1/2 STORY - UNFINISHED ALL AGES
    50	1-1/2 STORY FINISHED ALL AGES
    60	2-STORY 1946 & NEWER
    70	2-STORY 1945 & OLDER
    75	2-1/2 STORY ALL AGES
    80	SPLIT OR MULTI-LEVEL
    85	SPLIT FOYER
    90	DUPLEX - ALL STYLES AND AGES
   120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150	1-1/2 STORY PUD - ALL AGES
   160	2-STORY PUD - 1946 & NEWER
   180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190	2 FAMILY CONVERSION - ALL STYLES AND AGES

Also, month sold is more of a categorical feature than a numeric feature.

train_df["MoSold"].unique()  # Month Sold
array([ 1, 7, 3, 5, 8, 10, 6, 9, 12, 2, 4, 11])
drop_features = ["Id"]
numeric_features = [
    "BedroomAbvGr",
    "KitchenAbvGr",
    "LotFrontage",
    "LotArea",
    "OverallQual",
    "OverallCond",
    "YearBuilt",
    "YearRemodAdd",
    "MasVnrArea",
    "BsmtFinSF1",
    "BsmtFinSF2",
    "BsmtUnfSF",
    "TotalBsmtSF",
    "1stFlrSF",
    "2ndFlrSF",
    "LowQualFinSF",
    "GrLivArea",
    "BsmtFullBath",
    "BsmtHalfBath",
    "FullBath",
    "HalfBath",
    "TotRmsAbvGrd",
    "Fireplaces",
    "GarageYrBlt",
    "GarageCars",
    "GarageArea",
    "WoodDeckSF",
    "OpenPorchSF",
    "EnclosedPorch",
    "3SsnPorch",
    "ScreenPorch",
    "PoolArea",
    "MiscVal",
    "YrSold",
]
set(numeric_looking_columns) - set(numeric_features) - set(drop_features)
{'MSSubClass', 'MoSold'}

We’ll treat the above numeric-looking features as categorical features.

  • There are a bunch of ordinal features in this dataset.

  • Ordinal features with the same scale

    • Poor (Po), Fair (Fa), Typical (TA), Good (Gd), Excellent (Ex)

    • These we’ll be calling ordinal_features_reg.

  • Ordinal features with different scales

    • These we’ll be calling ordinal_features_oth.

ordinal_features_reg = [
    "ExterQual",
    "ExterCond",
    "BsmtQual",
    "BsmtCond",
    "HeatingQC",
    "KitchenQual",
    "FireplaceQu",
    "GarageQual",
    "GarageCond",
    "PoolQC",
]
ordering = [
    "Po",
    "Fa",
    "TA",
    "Gd",
    "Ex",
]  # if N/A it will just impute something, per below
ordering_ordinal_reg = [ordering] * len(ordinal_features_reg)
ordering_ordinal_reg
[['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex']]

We’ll pass the above as categories in our OrdinalEncoder.

  • There are a bunch more ordinal features using different scales.

    • These we’ll be calling ordinal_features_oth.

    • We are encoding them separately.

ordinal_features_oth = [
    "BsmtExposure",
    "BsmtFinType1",
    "BsmtFinType2",
    "Functional",
    "Fence",
]
ordering_ordinal_oth = [
    ["NA", "No", "Mn", "Av", "Gd"],
    ["NA", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
    ["NA", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
    ["Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"],
    ["NA", "MnWw", "GdWo", "MnPrv", "GdPrv"],
]

The remaining features are categorical features.

categorical_features = list(
    set(X_train.columns)
    - set(numeric_features)
    - set(ordinal_features_reg)
    - set(ordinal_features_oth)
    - set(drop_features)
)
categorical_features
['RoofStyle', 'BldgType', 'Electrical', 'Condition2', 'Alley', 'LandSlope', 'CentralAir', 'GarageFinish', 'RoofMatl', 'Condition1', 'Foundation', 'LotShape', 'MSZoning', 'PavedDrive', 'SaleCondition', 'Heating', 'Neighborhood', 'HouseStyle', 'MSSubClass', 'MasVnrType', 'MoSold', 'LotConfig', 'GarageType', 'Street', 'SaleType', 'Exterior2nd', 'MiscFeature', 'LandContour', 'Exterior1st', 'Utilities']
  • We are not doing it here but we can engineer our own features too.

  • Would price per square foot be a good feature to add in here?

Applying feature transformations

  • Since we have mixed feature types, let’s use ColumnTransformer to apply different transformations on different features types.

from sklearn.compose import make_column_transformer

numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
ordinal_transformer_reg = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OrdinalEncoder(categories=ordering_ordinal_reg),
)

ordinal_transformer_oth = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OrdinalEncoder(categories=ordering_ordinal_oth),
)

categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore", sparse_output=False),
)

preprocessor = make_column_transformer(
    ("drop", drop_features),
    (numeric_transformer, numeric_features),
    (ordinal_transformer_reg, ordinal_features_reg),
    (ordinal_transformer_oth, ordinal_features_oth),
    (categorical_transformer, categorical_features),
)

Examining the preprocessed data

preprocessor.fit(X_train)  # Calling fit to examine all the transformers.
preprocessor.named_transformers_
{'drop': 'drop', 'pipeline-1': Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), 'pipeline-2': Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('ordinalencoder', OrdinalEncoder(categories=[['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex'], ['Po', 'Fa', 'TA', 'Gd', 'Ex']]))]), 'pipeline-3': Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('ordinalencoder', OrdinalEncoder(categories=[['NA', 'No', 'Mn', 'Av', 'Gd'], ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'], ['NA', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv']]))]), 'pipeline-4': Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])}
ohe_columns = list(
    preprocessor.named_transformers_["pipeline-4"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
)
new_columns = (
    numeric_features + ordinal_features_reg + ordinal_features_oth + ohe_columns
)
X_train_enc = pd.DataFrame(
    preprocessor.transform(X_train), index=X_train.index, columns=new_columns
)
X_train_enc.head()
Loading...
X_train.shape
(1314, 80)
X_train_enc.shape
(1314, 262)

We went from 80 features to 263 features!!

Other possible preprocessing?

  • There is a lot of room for improvement ...

  • We’re just using SimpleImputer.

    • In reality we’d want to go through this more carefully.

    • We may also want to drop some columns that are almost entirely missing.

  • We could also check for outliers, and do other exploratory data analysis (EDA).

  • But for now this is good enough ...





Model building

DummyRegressor

dummy = DummyRegressor()
pd.DataFrame(cross_validate(dummy, X_train, y_train, cv=10, return_train_score=True))
Loading...

Let’s try a linear model: Ridge

  • Recall that we are going to use Ridge() instead of LinearRegression() in this course.

  • Similar to linear regression, ridge regression is also a linear model for regression.

  • So the formula it uses to make predictions is the same one used for ordinary least squares.

  • But it has a hyperparameter alpha which controls the fundamental tradeoff.

lr = make_pipeline(preprocessor, Ridge())
lr.fit(X_train, y_train);
lr_preds = lr.predict(X_test)
lr_preds[:10]
array([224865.34161762, 65243.29045704, 133337.46729488, 250900.65424895, 128984.32127053, 205597.8361313 , 334576.71199006, 158982.72345029, 148343.09009799, 129257.94933723])
lr_preds.max(), lr_preds.min()
(np.float64(411412.67186503176), np.float64(38853.27502379083))
print("Smallest coefficient: ", lr.named_steps["ridge"].coef_.min())
print("Largest coefficient:", lr.named_steps["ridge"].coef_.max())
Smallest coefficient:  -191169.07174469245
Largest coefficient: 83603.01311996076

Let’s carry out cross-validation with Ridge.

lr_pipe = make_pipeline(preprocessor, Ridge())
pd.DataFrame(cross_validate(lr_pipe, X_train, y_train, cv=10, return_train_score=True))
Loading...
  • Quite a bit of variation in the test scores.

  • Performing poorly in fold 8. Not sure why.

    • Probably it contains the outliers in the data which we kind of ignored.

Feature names of transformed data

  • If you want to get the column names of newly created columns, you need to fit the transformer.

preprocessor
Loading...
ohe_columns = list(
    preprocessor.named_transformers_["pipeline-4"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
)
ohe_columns
['RoofStyle_Flat', 'RoofStyle_Gable', 'RoofStyle_Gambrel', 'RoofStyle_Hip', 'RoofStyle_Mansard', 'RoofStyle_Shed', 'BldgType_1Fam', 'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE', 'Electrical_FuseA', 'Electrical_FuseF', 'Electrical_FuseP', 'Electrical_Mix', 'Electrical_SBrkr', 'Electrical_missing', 'Condition2_Artery', 'Condition2_Feedr', 'Condition2_Norm', 'Condition2_PosA', 'Condition2_PosN', 'Condition2_RRAe', 'Condition2_RRAn', 'Condition2_RRNn', 'Alley_Grvl', 'Alley_Pave', 'Alley_missing', 'LandSlope_Gtl', 'LandSlope_Mod', 'LandSlope_Sev', 'CentralAir_N', 'CentralAir_Y', 'GarageFinish_Fin', 'GarageFinish_RFn', 'GarageFinish_Unf', 'GarageFinish_missing', 'RoofMatl_ClyTile', 'RoofMatl_CompShg', 'RoofMatl_Membran', 'RoofMatl_Metal', 'RoofMatl_Roll', 'RoofMatl_Tar&Grv', 'RoofMatl_WdShake', 'RoofMatl_WdShngl', 'Condition1_Artery', 'Condition1_Feedr', 'Condition1_Norm', 'Condition1_PosA', 'Condition1_PosN', 'Condition1_RRAe', 'Condition1_RRAn', 'Condition1_RRNe', 'Condition1_RRNn', 'Foundation_BrkTil', 'Foundation_CBlock', 'Foundation_PConc', 'Foundation_Slab', 'Foundation_Stone', 'Foundation_Wood', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg', 'MSZoning_C (all)', 'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM', 'PavedDrive_N', 'PavedDrive_P', 'PavedDrive_Y', 'SaleCondition_Abnorml', 'SaleCondition_AdjLand', 'SaleCondition_Alloca', 'SaleCondition_Family', 'SaleCondition_Normal', 'SaleCondition_Partial', 'Heating_Floor', 'Heating_GasA', 'Heating_GasW', 'Heating_Grav', 'Heating_OthW', 'Heating_Wall', 'Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker', 'HouseStyle_1.5Fin', 'HouseStyle_1.5Unf', 'HouseStyle_1Story', 'HouseStyle_2.5Fin', 'HouseStyle_2.5Unf', 'HouseStyle_2Story', 'HouseStyle_SFoyer', 'HouseStyle_SLvl', 'MSSubClass_20', 'MSSubClass_30', 'MSSubClass_40', 'MSSubClass_45', 'MSSubClass_50', 'MSSubClass_60', 'MSSubClass_70', 'MSSubClass_75', 'MSSubClass_80', 'MSSubClass_85', 'MSSubClass_90', 'MSSubClass_120', 'MSSubClass_160', 'MSSubClass_180', 'MSSubClass_190', 'MasVnrType_BrkCmn', 'MasVnrType_BrkFace', 'MasVnrType_Stone', 'MasVnrType_missing', 'MoSold_1', 'MoSold_2', 'MoSold_3', 'MoSold_4', 'MoSold_5', 'MoSold_6', 'MoSold_7', 'MoSold_8', 'MoSold_9', 'MoSold_10', 'MoSold_11', 'MoSold_12', 'LotConfig_Corner', 'LotConfig_CulDSac', 'LotConfig_FR2', 'LotConfig_FR3', 'LotConfig_Inside', 'GarageType_2Types', 'GarageType_Attchd', 'GarageType_Basment', 'GarageType_BuiltIn', 'GarageType_CarPort', 'GarageType_Detchd', 'GarageType_missing', 'Street_Grvl', 'Street_Pave', 'SaleType_COD', 'SaleType_CWD', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_ConLI', 'SaleType_ConLw', 'SaleType_New', 'SaleType_Oth', 'SaleType_WD', 'Exterior2nd_AsbShng', 'Exterior2nd_AsphShn', 'Exterior2nd_Brk Cmn', 'Exterior2nd_BrkFace', 'Exterior2nd_CBlock', 'Exterior2nd_CmentBd', 'Exterior2nd_HdBoard', 'Exterior2nd_ImStucc', 'Exterior2nd_MetalSd', 'Exterior2nd_Other', 'Exterior2nd_Plywood', 'Exterior2nd_Stone', 'Exterior2nd_Stucco', 'Exterior2nd_VinylSd', 'Exterior2nd_Wd Sdng', 'Exterior2nd_Wd Shng', 'MiscFeature_Gar2', 'MiscFeature_Othr', 'MiscFeature_Shed', 'MiscFeature_TenC', 'MiscFeature_missing', 'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low', 'LandContour_Lvl', 'Exterior1st_AsbShng', 'Exterior1st_AsphShn', 'Exterior1st_BrkComm', 'Exterior1st_BrkFace', 'Exterior1st_CBlock', 'Exterior1st_CemntBd', 'Exterior1st_HdBoard', 'Exterior1st_ImStucc', 'Exterior1st_MetalSd', 'Exterior1st_Plywood', 'Exterior1st_Stone', 'Exterior1st_Stucco', 'Exterior1st_VinylSd', 'Exterior1st_Wd Sdng', 'Exterior1st_WdShing', 'Utilities_AllPub', 'Utilities_NoSeWa']
new_columns = (
    numeric_features + ordinal_features_reg + ordinal_features_oth + ohe_columns
)

Tuning alpha hyperparameter of Ridge

  • Recall that Ridge has a hyperparameter alpha that controls the fundamental tradeoff.

  • This is like C in LogisticRegression but, annoyingly, alpha is the inverse of C.

  • That is, large C is like small alpha and vice versa.

  • Smaller alpha: lower training error (overfitting)

param_grid = {"ridge__alpha": 10.0 ** np.arange(-5, 5, 1)}

pipe_ridge = make_pipeline(preprocessor, Ridge())

search = GridSearchCV(pipe_ridge, param_grid, return_train_score=True, n_jobs=-1)
search.fit(X_train, y_train)
train_scores = search.cv_results_["mean_train_score"]
cv_scores = search.cv_results_["mean_test_score"]
plt.semilogx(param_grid["ridge__alpha"], train_scores.tolist(), label="train")
plt.semilogx(param_grid["ridge__alpha"], cv_scores.tolist(), label="cv")
plt.legend()
plt.xlabel("alpha")
plt.ylabel("score");
<Figure size 640x480 with 1 Axes>
best_alpha = search.best_params_
best_alpha
{'ridge__alpha': np.float64(100.0)}
search.best_score_
np.float64(0.7951559890417759)
  • It seems alpha=100 is the best choice here.

  • General intuition: larger alpha leads to smaller coefficients.

  • Smaller coefficients mean the predictions are less sensitive to changes in the data. Hence less chance of overfitting.

pipe_bigalpha = make_pipeline(preprocessor, Ridge(alpha=1000))
pipe_bigalpha.fit(X_train, y_train)
bigalpha_coeffs = pipe_bigalpha.named_steps["ridge"].coef_
pd.DataFrame(
    data=bigalpha_coeffs, index=new_columns, columns=["Coefficients"]
).sort_values(by="Coefficients", ascending=False)
Loading...
  • Smaller alpha leads to bigger coefficients.

pipe_smallalpha = make_pipeline(preprocessor, Ridge(alpha=0.01))
pipe_smallalpha.fit(X_train, y_train)
smallalpha_coeffs = pipe_smallalpha.named_steps["ridge"].coef_
pd.DataFrame(
    data=smallalpha_coeffs, index=new_columns, columns=["Coefficients"]
).sort_values(by="Coefficients", ascending=False)
Loading...

With the best alpha found by the grid search, the coefficients are somewhere in between.

pipe_bestalpha = make_pipeline(
    preprocessor, Ridge(alpha=search.best_params_["ridge__alpha"])
)
pipe_bestalpha.fit(X_train, y_train)
bestalpha_coeffs = pipe_bestalpha.named_steps["ridge"].coef_
pd.DataFrame(
    data=bestalpha_coeffs, index=new_columns, columns=["Coefficients"]
).sort_values(by="Coefficients", ascending=False)
Loading...

To summarize:

  • Higher values of alpha means a more restricted model.

  • The values of coefficients are likely to be smaller for higher values of alpha compared to lower values of alpha.

RidgeCV

Because it’s so common to want to tune alpha with Ridge, sklearn provides a class called RidgeCV, which automatically tunes alpha based on cross-validation.

alphas = 10.0 ** np.arange(-6, 6, 1)
ridgecv_pipe = make_pipeline(preprocessor, RidgeCV(alphas=alphas, cv=10))
ridgecv_pipe.fit(X_train, y_train);
best_alpha = ridgecv_pipe.named_steps["ridgecv"].alpha_

best_alpha
np.float64(100.0)



Let’s examine the tuned model.

ridge_tuned = make_pipeline(preprocessor, Ridge(alpha=best_alpha))
ridge_tuned.fit(X_train, y_train)
ridge_preds = ridge_tuned.predict(X_test)
ridge_preds[:10]
array([228603.06797961, 104643.99931882, 155624.61029914, 246332.07245741, 127614.33726089, 243243.60429913, 304784.87844893, 145425.67512181, 157008.6526853 , 128528.91515735])
df = pd.DataFrame(
    data={"coefficients": ridge_tuned.named_steps["ridge"].coef_}, index=new_columns
)
df.sort_values("coefficients", ascending=False)
Loading...

So according to this model:

  • As OverallQual feature gets bigger the housing price will get bigger.

  • Neighborhood_Edwards is associated with reducing the housing price.

    • We’ll talk more about interpretation of different kinds of features next week.

ridge_preds.max(), ridge_preds.min()
(np.float64(390725.7472092076), np.float64(30815.281220255387))



❓❓ Questions for you

iClicker Exercise 10.1

Select all of the following statements which are TRUE.

  • (A) Price per square foot would be a good feature to add in our X.

  • (B) The alpha hyperparameter of Ridge has similar interpretation of C hyperparameter of LogisticRegression; higher alpha means more complex model.

  • (C) In Ridge, smaller alpha means bigger coefficients whereas bigger alpha means smaller coefficients.

Can we use the metrics we looked at in the previous lecture for this problem? Why or why not?





Regression scoring functions

  • We aren’t doing classification anymore, so we can’t just check for equality:

ridge_tuned.predict(X_train) == y_train
302 False 767 False 429 False 1139 False 558 False ... 1041 False 1122 False 1346 False 1406 False 1389 False Name: SalePrice, Length: 1314, dtype: bool
y_train.values
array([205000, 160000, 175000, ..., 262500, 133000, 131000], shape=(1314,))
ridge_tuned.predict(X_train)
array([212870.41150573, 178494.22221894, 189981.34426571, ..., 245329.56804591, 129904.71327467, 135422.67672595], shape=(1314,))

We need a score that reflects how right/wrong each prediction is.

There are a number of popular scoring functions for regression. We are going to look at some common metrics:

  • mean squared error (MSE)

  • R2R^2

  • root mean squared error (RMSE)

  • MAPE

See sklearn documentation for more details.

Mean squared error (MSE)

  • A common metric is mean squared error:

preds = ridge_tuned.predict(X_train)
np.mean((y_train - preds) ** 2)
np.float64(872961060.7986544)

Perfect predictions would have MSE=0.

np.mean((y_train - y_train) ** 2)
np.float64(0.0)

This is also implemented in sklearn:

from sklearn.metrics import mean_squared_error

mean_squared_error(y_train, preds)
872961060.7986544
  • MSE looks huge and unreasonable. There is an error of ~$1 Billion!

  • Is this score good or bad?

  • Unlike classification, with regression our target has units.

  • The target is in dollars, the mean squared error is in dollars2dollars^2

  • The score also depends on the scale of the targets.

  • If we were working in cents instead of dollars, our MSE would be 10,000×(100210,000 \times (100^2) higher!

np.mean((y_train * 100 - preds * 100) ** 2)

Root mean squared error or RMSE

  • The MSE above is in dollars2dollars^2.

  • A more relatable metric would be the root mean squared error, or RMSE

np.sqrt(mean_squared_error(y_train, ridge_tuned.predict(X_train)))
np.float64(29545.914451894267)
  • Error of $30,000 makes more sense.

  • Let’s dig deeper.

plt.scatter(y_train, ridge_tuned.predict(X_train), alpha=0.3)
grid = np.linspace(y_train.min(), y_train.max(), 1000)
plt.plot(grid, grid, "--k")
plt.xlabel("true price")
plt.ylabel("predicted price");
<Figure size 640x480 with 1 Axes>
  • Here we can see a few cases where our prediction is way off.

  • Is there something weird about those houses, perhaps? Outliers?

  • Under the line means we’re under-predicting, over the line means we’re over-predicting.

R2R^2 (not in detail)

A common score is the R2R^2

  • This is the score that sklearn uses by default when you call score()

  • You can read about it if interested.

  • R2R^2 measures the proportion of variability in yy that can be explained using XX.

  • Independent of the scale of yy. So the max is 1.

R2(y,y^)=1i=1n(yiyi^)2i=1n(yiyˉ)2R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{\sum_{i=1}^n (y_i - \bar{y})^2}
  • The denominator measures the total variance in yy.

  • The amount of variability that is left unexplained after performing regression.

Key points:

  • The maximum is 1 for perfect predictions

  • Negative values are very bad: “worse than DummyRegressor” (very bad)

(Optional) Warning: MSE is “reversible” but R2R^2 is not:

mean_squared_error(y_train, preds)
872961060.7986544
mean_squared_error(preds, y_train)
872961060.7986544
r2_score(y_train, preds)
0.8601643854446082
r2_score(preds, y_train)
0.8280229354283182
  • When you call fit it minimizes MSE / maximizes R2R^2 (or something like that) by default.

  • Just like in classification, this isn’t always what you want!!

MAPE

  • We got an RMSE of ~$30,000 before.

Question: Is an error of $30,000 acceptable?

np.sqrt(mean_squared_error(y_train, ridge_tuned.predict(X_train)))
np.float64(29545.914451894267)
  • For a house worth $600k, it seems reasonable! That’s 5% error.

  • For a house worth $60k, that is terrible. It’s 50% error.

We have both of these cases in our dataset.

plt.hist(y_train, bins=100);
<Figure size 640x480 with 1 Axes>

How about looking at percent error?

pred_train = ridge_tuned.predict(X_train)
percent_errors = (pred_train - y_train) / y_train * 100.0
percent_errors
302 3.839225 767 11.558889 429 8.560768 1139 -16.405650 558 17.194710 ... 1041 -0.496213 1122 -28.624615 1346 -6.541117 1406 -2.327283 1389 3.376089 Name: SalePrice, Length: 1314, dtype: float64

These are both positive (predict too high) and negative (predict too low).

We can look at the absolute percent error:

np.abs(percent_errors)
302 3.839225 767 11.558889 429 8.560768 1139 16.405650 558 17.194710 ... 1041 0.496213 1122 28.624615 1346 6.541117 1406 2.327283 1389 3.376089 Name: SalePrice, Length: 1314, dtype: float64

And, like MSE, we can take the average over examples. This is called mean absolute percent error (MAPE).

def my_mape(true, pred):
    return np.mean(np.abs((pred - true) / true))
my_mape(y_train, pred_train)
np.float64(0.10092665203438746)

Let’s use sklearn to calculate MAPE.

from sklearn.metrics import mean_absolute_percentage_error
mean_absolute_percentage_error(y_train, pred_train)
0.10092665203438746
  • Ok, this is quite interpretable.

  • On average, we have around 10% error.

Transforming the targets

  • When you have prices or count data, the target values are skewed.

  • Let’s look at our target column.

plt.hist(y_train, bins=100);
<Figure size 640x480 with 1 Axes>
  • A common trick in such cases is applying a log transform on the target column to make it more normal and less skewed.

  • That is, transform ylog(y)y\rightarrow \log(y).

  • Linear regression will usually work better on something that looks more normal.

plt.hist(np.log10(y_train), bins=100);
<Figure size 640x480 with 1 Axes>

We can incorporate this in our pipeline using sklearn.

from sklearn.compose import TransformedTargetRegressor
ttr = TransformedTargetRegressor(
    Ridge(alpha=best_alpha), func=np.log1p, inverse_func=np.expm1
) # transformer for log transforming the target
ttr_pipe = make_pipeline(preprocessor, ttr)
ttr_pipe
Loading...

Why can’t we incorporate preprocessing targets in our column transformer?

ttr_pipe.fit(X_train, y_train); # y_train automatically transformed
ttr_pipe.predict(X_train)  # predictions automatically un-transformed
array([221329.43833466, 170670.12761659, 182639.31984311, ..., 248609.60023631, 132158.15026771, 133270.71866979], shape=(1314,))
mean_absolute_percentage_error(y_test, ttr_pipe.predict(X_test))
0.07808506982896264

We reduced MAPE from ~10% to ~8% with this trick!

  • Does .fit() know we care about MAPE?

  • No, it doesn’t. Why are we minimizing MSE (or something similar) if we care about MAPE??

  • When minimizing MSE, the expensive houses will dominate because they have the biggest error.

Different scoring functions with cross_validate

  • Let’s try using MSE instead of the default R2R^2 score.

pd.DataFrame(
    cross_validate(
        ridge_tuned,
        X_train,
        y_train,
        return_train_score=True,
        scoring="neg_mean_squared_error",
    )
)
Loading...
# make a scorer function that we can pass into cross-validation
mape_scorer = make_scorer(my_mape, greater_is_better=False)

pd.DataFrame(
    cross_validate(
        ridge_tuned, X_train, y_train, return_train_score=True, scoring=mape_scorer
    )
)
Loading...

If you are finding greater_is_better=False argument confusing, here is the documentation:

greater_is_better(bool), default=True Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.

Since our custom scorer mape gives an error and not a score, I’m passing False to it and it’ll sign flip so that we can interpret bigger numbers as better performance.

# ?make_scorer
scoring = {
    "r2": "r2",
    "mape_scorer": mape_scorer, # just for demonstration for a custom scorer
    "sklearn MAPE": "neg_mean_absolute_percentage_error",
    "neg_root_mean_square_error": "neg_root_mean_squared_error",
    "neg_mean_squared_error": "neg_mean_squared_error",
}

pd.DataFrame(
    cross_validate(
        ridge_tuned, X_train, y_train, return_train_score=True, scoring=scoring
    )
).T
Loading...

Are we getting the same alpha with mape?

param_grid = {"ridge__alpha": 10.0 ** np.arange(-6, 6, 1)}

pipe_ridge = make_pipeline(preprocessor, Ridge())

search = GridSearchCV(
    pipe_ridge, param_grid, return_train_score=True, n_jobs=-1, scoring=mape_scorer
)
search.fit(X_train, y_train);
print("Best hyperparameter values: ", search.best_params_)
print("Best score: %0.3f" % (search.best_score_))
pd.DataFrame(search.cv_results_)[
    [
        "mean_train_score",
        "mean_test_score",
        "param_ridge__alpha",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index().T

Using multiple metrics in GridSearchCV or RandomizedSearchCV

  • We could use multiple metrics with GridSearchCV or RandomizedSearchCV.

  • But if you do so, you need to set refit to the metric (string) for which the best_params_ will be found and used to build the best_estimator_ on the whole dataset.

search_multi = GridSearchCV(
    pipe_ridge,
    param_grid,
    return_train_score=True,
    n_jobs=-1,
    scoring=scoring,
    refit="sklearn MAPE",
)
search_multi.fit(X_train, y_train);
print("Best hyperparameter values: ", search_multi.best_params_)
print("Best score: %0.3f" % (search_multi.best_score_))
pd.DataFrame(search_multi.cv_results_).set_index("rank_test_mape_scorer").sort_index()

What’s the test score?

search_multi.score(X_test, y_test)
my_mape(y_test, ridge_tuned.predict(X_test))

Using regression metrics with scikit-learn

  • In sklearn, you will notice that it has negative version of the metrics above (e.g., neg_mean_squared_error, neg_root_mean_squared_error).

  • The reason for this is that scores return a value to maximize, the higher the better.

❓❓ Questions for you

iClicker Exercise 10.2

Select all of the following statements which are TRUE.

  • (A) We can use still use precision and recall for regression problems but now we have other metrics we can use as well.

  • (B) In sklearn for regression problems, using r2_score() and .score() (with default values) will produce the same results.

  • (C) RMSE is always going to be non-negative.

  • (D) MSE does not directly provide the information about whether the model is underpredicting or overpredicting.

  • (E) We can pass multiple scoring metrics to GridSearchCV or RandomizedSearchCV for regression as well as classification problems.





What did we learn today?

  • House prices dataset target is price, which is numeric -> regression rather than classification

  • There are corresponding versions of all the tools we used:

    • DummyClassifier -> DummyRegressor

    • LogisticRegression -> Ridge

  • Ridge hyperparameter alpha is like LogisticRegression hyperparameter C, but opposite meaning

  • We’ll avoid LinearRegression in this course.

  • Scoring metrics

  • R2R^2 is the default .score(), it is unitless, 0 is bad, 1 is best

  • MSE (mean squared error) is in units of target squared, hard to interpret; 0 is best

  • RMSE (root mean squared error) is in the same units as the target; 0 is best

  • MAPE (mean absolute percent error) is unitless; 0 is best, 1 is bad