Lecture 7: Linear Models

Lecture 7: Linear Models#

UBC 2023-24

Instructor: Varada Kolhatkar and Andrew Roth

Imports, Announcements, and LO#

Imports#

import os
import sys

sys.path.append("code/.")

import IPython
import ipywidgets as widgets
import matplotlib.pyplot as plt
import mglearn
import numpy as np
import pandas as pd
from IPython.display import HTML, display
from ipywidgets import interact, interactive
from plotting_functions import *
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from utils import *

%matplotlib inline
pd.set_option("display.max_colwidth", 200)

Announcements#

Homework 3 is due on Oct 2nd.

Learning outcomes#

From this lecture, students are expected to be able to:

Explain the general intuition behind linear models;
Explain how predict works for linear regression;
Use scikit-learn’s Ridge model;
Demonstrate how the alpha hyperparameter of Ridge is related to the fundamental tradeoff;
Explain the difference between linear regression and logistic regression;
Use scikit-learn’s LogisticRegression model and predict_proba to get probability scores
Explain the advantages of getting probability scores instead of hard predictions during classification;
Broadly describe linear SVMs
Explain how can you interpret model predictions using coefficients learned by a linear model;
Explain the advantages and limitations of linear classifiers.

Linear models [video]#

Linear models is a fundamental and widely used class of models. They are called linear because they make a prediction using a linear function of the input features.

We will talk about three linear models:

Linear regression
Logistic regression
Linear SVM (brief mention)

Linear regression#

A very popular statistical model and has a long history.
Imagine a hypothetical regression problem of predicting weight of a snake given its length.

np.random.seed(7)
n = 100
X_1 = np.linspace(0, 2, n) + np.random.randn(n) * 0.01
X = pd.DataFrame(X_1[:, None], columns=["length"])

y = abs(np.random.randn(n, 1)) * 3 + X_1[:, None] * 5 + 0.2
y = pd.DataFrame(y, columns=["weight"])
snakes_df = pd.concat([X, y], axis=1)
train_df, test_df = train_test_split(snakes_df, test_size=0.2, random_state=77)

X_train = train_df[["length"]].values
y_train = train_df["weight"].values
X_test = test_df[["length"]].values
y_test = test_df["weight"].values
train_df.head()

	length	weight
73	1.489130	10.507995
53	1.073233	7.658047
80	1.622709	9.748797
49	0.984653	9.731572
23	0.484937	3.016555

Let’s visualize the hypothetical snake data.

plt.plot(X_train, y_train, ".", markersize=10)
plt.xlabel("length")
plt.ylabel("weight (target)");

../_images/592d4e41fd09f5159988ad19b367642f0f7052dc7f14ba7a482737fc8c5e4123.png

Let’s plot a linear regression model on this dataset.

grid = np.linspace(min(X_train)[0], max(X_train)[0], 1000)
grid = grid.reshape(-1, 1)

from sklearn.linear_model import Ridge

r = Ridge()
r.fit(X_train, y_train)
plt.plot(X_train, y_train, ".", markersize=10)
plt.plot(grid, r.predict(grid))
plt.grid(True)
plt.xlabel("length")
plt.ylabel("weight (target)");

../_images/9d3450f337db4d3fca103ba23020669205ee3da12ca6262ed781eef523effcef.png

The orange line is the learned linear model.

Prediction of linear regression#

Given a snake length, we can use the model above to predict the target (i.e., the weight of the snake).
The prediction will be the corresponding weight on the orange line.

snake_length = 0.75
r.predict([[snake_length]])

array([6.20683258])

What are we exactly learning?#

The model above is a line, which can be represented with a slope (i.e., coefficient or weight) and an intercept.
For the above model, we can access the slope (i.e., coefficient or weight) and the intercept using coef_ and intercept_, respectively.

r.coef_  # r is our linear regression object

array([5.26370005])

r.intercept_  # r is our linear regression object

2.259057547817185

How are we making predictions?#

Given a feature value $x_1$ and learned coefficient $w_1$ and intercept $b$, we can get the prediction $\hat{y}$ with the following formula: $$\hat{y} = w_1x_1 + b$$

prediction = snake_length * r.coef_ + r.intercept_
prediction

array([6.20683258])

r.predict([[snake_length]])

array([6.20683258])

Great! Now we exactly know how the model is making the prediction.

Generalizing to more features#

For more features, the model is a higher dimensional hyperplane and the general prediction formula looks as follows:

$\hat{y} =$ $w_1$ $x_1$ $+ \dots +$ $w_d$ $x_d$ + $b$

where,

($x_1, \dots, x_d$) are input features
($w_1, \dots, w_d$) are coefficients or weights (learned from the data)
$b$ is the bias which can be used to offset your hyperplane (learned from the data)

Example#

Suppose these are the coefficients learned by a linear regression model on a hypothetical housing price prediction dataset.

Feature	Learned coefficient
Bedrooms	0.20
Bathrooms	0.11
Square Footage	0.002
Age	-0.02

Now given a new example, the target will be predicted as follows: | Bedrooms | Bathrooms | Square Footage | Age | |——————–|———————|—————-|—–| | 3 | 2 | 1875 | 66 |

\[\hat{y} = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + b\]

\[\text{predicted price}= 0.20 \times 3 + 0.11 \times 2 + 0.002 \times 1875 + (-0.02) \times 66 + b\]

When we call fit, a coefficient or weight is learned for each feature which tells us the role of that feature in prediction. These coefficients are learned from the training data.

Important

In linear models for regression, the model is a line for a single feature, a plane for two features, and a hyperplane for higher dimensions. We are not yet ready to discuss how does linear regression learn these coefficients and intercept.

`Ridge`#

scikit-learn has a model called LinearRegression for linear regression.
But if we use this “vanilla” version of linear regression, it may result in large coefficients and unexpected results.
So instead of using LinearRegression, we will always use another linear model called Ridge, which is a linear regression model with a complexity hyperparameter alpha.

from sklearn.linear_model import LinearRegression  # DO NOT USE IT IN THIS COURSE
from sklearn.linear_model import Ridge  # USE THIS INSTEAD

Data#

Let’s use sklearn’s built in regression dataset, the Boston Housing dataset. The task associated with this dataset is to predict the median value of homes in several Boston neighborhoods in the 1970s, using information such as crime rate in the neighbourhood, average number of rooms, proximity to the Charles River, highway accessibility, and so on.

from sklearn.datasets import fetch_california_housing


california = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    california.data, california.target, test_size=0.2
)
pd.DataFrame(X_train, columns=california.feature_names)

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
0	2.0000	52.0	5.303030	1.082251	725.0	3.138528	38.03	-121.88
1	2.0000	52.0	5.506410	1.134615	1026.0	3.288462	34.00	-118.30
2	4.0474	30.0	5.419355	1.006452	858.0	2.767742	37.31	-121.94
3	3.2794	7.0	5.546473	1.044166	5146.0	3.392221	34.46	-117.20
4	2.5551	35.0	4.018487	1.016807	1886.0	3.169748	37.35	-121.86
...	...	...	...	...	...	...	...	...
16507	3.0185	17.0	4.205479	0.863014	434.0	1.981735	34.61	-120.16
16508	12.6320	5.0	7.462963	0.888889	208.0	3.851852	34.44	-119.31
16509	3.9808	20.0	5.678689	1.006557	999.0	3.275410	38.28	-121.20
16510	5.8195	25.0	6.585513	0.961771	1645.0	3.309859	33.71	-117.97
16511	3.7315	20.0	7.368304	1.738839	1085.0	2.421875	37.58	-118.74

16512 rows × 8 columns

print(california.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

`Ridge` on the California housing dataset#

pipe = make_pipeline(StandardScaler(), Ridge())
scores = cross_validate(pipe, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

	fit_time	score_time	test_score	train_score
0	0.006541	0.000512	0.599492	0.608050
1	0.002695	0.000391	0.613507	0.604619
2	0.002454	0.000362	0.598100	0.608468
3	0.002432	0.000355	0.612621	0.604797
4	0.002944	0.000456	0.601790	0.606902

Hyperparameter `alpha` of `Ridge`#

Ridge has hyperparameters just like the rest of the models we learned.
The alpha hyperparameter is what makes Ridge different from vanilla LinearRegression.
Similar to the other hyperparameters that we saw, alpha controls the fundamental tradeoff.

Note

If we set alpha=0 that is the same as using LinearRegression.

Let’s examine the effect of alpha on the fundamental tradeoff.

scores_dict = {
    "alpha": 10.0 ** np.arange(-3, 6, 1),
    "mean_train_scores": list(),
    "mean_cv_scores": list(),
}
for alpha in scores_dict["alpha"]:
    pipe_ridge = make_pipeline(StandardScaler(), Ridge(alpha=alpha))
    scores = cross_validate(pipe_ridge, X_train, y_train, return_train_score=True)
    scores_dict["mean_train_scores"].append(scores["train_score"].mean())
    scores_dict["mean_cv_scores"].append(scores["test_score"].mean())

results_df = pd.DataFrame(scores_dict)

results_df

	alpha	mean_train_scores	mean_cv_scores
0	0.001	0.606567	0.605101
1	0.010	0.606567	0.605101
2	0.100	0.606567	0.605101
3	1.000	0.606567	0.605102
4	10.000	0.606557	0.605100
5	100.000	0.605696	0.604320
6	1000.000	0.579432	0.578523
7	10000.000	0.431137	0.430868
8	100000.000	0.116167	0.115956

Here we do not really see overfitting but in general,

larger alpha $\rightarrow$ likely to underfit
smaller alpha $\rightarrow$ likely to overfit

Coefficients and intercept#

The model learns

coefficients associated with each feature
the intercept or bias

Let’s examine the coefficients learned by the model.

pipe_ridge = make_pipeline(StandardScaler(), Ridge(alpha=1.0))
pipe_ridge.fit(X_train, y_train)
coeffs = pipe_ridge.named_steps["ridge"].coef_

pd.DataFrame(data=coeffs, index=california.feature_names, columns=["Coefficients"])

	Coefficients
MedInc	0.827039
HouseAge	0.117058
AveRooms	-0.265273
AveBedrms	0.307521
Population	-0.003194
AveOccup	-0.039498
Latitude	-0.895963
Longitude	-0.866110

The model also learns an intercept (bias).
For each prediction, we are adding this amount irrespective of the feature values.

pipe_ridge.named_steps["ridge"].intercept_

2.0667579112160865

Can we use this information to interpret model predictions?

❓❓ Questions for you#

(iClicker) Exercise 7.1#

iClicker cloud join link: https://join.iclicker.com/3DP5H

Select all of the following statements which are TRUE.

(A) Increasing the hyperparameter alpha of Ridge is likely to decrease model complexity.
(B) Ridge can be used with datasets that have multiple features.
(C) With Ridge, we learn one coefficient per training example.
(D) If you train a linear regression model on a 2-dimensional problem (2 features), the model will learn 3 parameters: one for each feature and one for the bias term.

Interpretation of coefficients#

One of the main advantages of linear models is that they are relatively easy to interpret.
We have one coefficient per feature which kind of describes the role of the feature in the prediction according to the model.

There are two pieces of information in the coefficients based on

Sign
Magnitude

Sign of the coefficients#

In the example below, for instance:

MedInc (median income) has a positive coefficient
- the prediction will be proportional to the feature value; as MedInc gets bigger, the median house value gets bigger
AveRooms (Average number of rooms) has a negative coefficient
- the prediction will be inversely proportional to the feature value; as AveRooms gets bigger, the median house value gets smaller

pd.DataFrame(data=coeffs, index=california.feature_names, columns=["Coefficients"])

	Coefficients
MedInc	0.827039
HouseAge	0.117058
AveRooms	-0.265273
AveBedrms	0.307521
Population	-0.003194
AveOccup	-0.039498
Latitude	-0.895963
Longitude	-0.866110

Magnitude of the coefficients#

Bigger magnitude $\rightarrow$ bigger impact on the prediction
In the example below, both MedInc and AveBedrms have a positive impact on the prediction but MedInc would have a bigger positive impact because it’s feature value is going to be multiplied by a number with a bigger magnitude.

data = {
    "coefficient": pipe_ridge.named_steps["ridge"].coef_.tolist(),
    "magnitude": np.absolute(pipe_ridge.named_steps["ridge"].coef_.tolist()),
}
coef_df = pd.DataFrame(data, index=california.feature_names).sort_values(
    "magnitude", ascending=False
)
coef_df

	coefficient	magnitude
Latitude	-0.895963	0.895963
Longitude	-0.866110	0.866110
MedInc	0.827039	0.827039
AveBedrms	0.307521	0.307521
AveRooms	-0.265273	0.265273
HouseAge	0.117058	0.117058
AveOccup	-0.039498	0.039498
Population	-0.003194	0.003194

Importance of scaling#

When you are interpreting the model coefficients, scaling is crucial.
If you do not scale the data, features with smaller magnitude are going to get coefficients with bigger magnitude whereas features with bigger scale are going to get coefficients with smaller magnitude.
That said, when you scale the data, feature values become hard to interpret for humans!

Important

Take these coefficients with a grain of salt. They might not always match your intuitions. Also, they do not tell us about how the world works. They only tell us about how the prediction of your model works.

❓❓ Questions for you#

Discuss the importance of scaling when interpreting linear regression coefficients.
What might be the meaning of complex vs simpler model in case of linear regression?

Logistic regression [video]#

Logistic regression intuition#

A linear model for classification.
Similar to linear regression, it learns weights associated with each feature and the bias.
It applies a threshold on the raw output to decide whether the class is positive or negative.
In this lecture we will focus on the following aspects of logistic regression.
- predict, predict_proba
- how to use learned coefficients to interpret the model

Motivating example#

Consider the problem of predicting sentiment expressed in movie reviews.

Training data for the motivating example#

Review 1: This movie was excellent! The performances were oscar-worthy! 👍

Review 2: What a boring movie! I almost fell asleep twice while watching it. 👎

Review 3: I enjoyed the movie. Excellent! 👍

Targets: positive 👍 and negative 👎
Features: words (e.g., excellent, flawless, boring)

Learned coefficients associated with all features#

Suppose our vocabulary contains only the following 7 words.
A linear classifier learns weights or coefficients associated with the features (words in this example).
Let’s ignore bias for a bit.

Predicting with learned weights#

Use these learned coefficients to make predictions. For example, consider the following review $x_i$.

It got a bit boring at times but the direction was excellent and the acting was flawless.

- Feature vector for $x_i$: [1, 0, 1, 1, 0, 0, 0]

$score(x_i) = $ coefficient(boring) $\times 1$ + coefficient(excellent) $\times 1$ + coefficient(flawless) $\times 1$ = $-1.40 + 1.93 + 1.43 = 1.96$
$1.96 > 0$ so predict the review as positive 👍.

x = ["boring=1", "excellent=1", "flawless=1"]
w = [-1.40, 1.93, 1.43]
display(plot_logistic_regression(x, w))

Weighted sum of the input features = 1.960 y_hat = pos

../_images/b878a23f996d3fae70cbe370cb3e05f77386c82c6265a686b51e325329055741.svg

So the prediction is based on the weighted sum of the input features.
Some feature are pulling the prediction towards positive sentiment and some are pulling it towards negative sentiment.
If the coefficient of boring had a bigger magnitude or excellent and flawless had smaller magnitudes, we would have predicted “neg”.

def f(w_0):
    x = ["boring=1", "excellent=1", "flawless=1"]
    w = [-1.40, 1.93, 1.43]
    w[0] = w_0
    print(w)
    display(plot_logistic_regression(x, w))

interactive(
    f,
    w_0=widgets.FloatSlider(min=-6, max=2, step=0.5, value=-1.40),
)

In our case, for values for the coefficient of boring < -3.36, the prediction would be negative.

A linear model learns these coefficients or weights from the training data!

So a linear classifier is a linear function of the input X, followed by a threshold.

(1)#\[\begin{equation} \begin{split} z =& w_1x_1 + \dots + w_dx_d + b\\ =& w^Tx + b \end{split} \end{equation}\]

\[\begin{split}\hat{y} = \begin{cases} 1, & \text{if } z \geq r\\ -1, & \text{if } z < r \end{cases}\end{split}\]

Components of a linear classifier#

input features ($x_1, \dots, x_d$)
coefficients (weights) ($w_1, \dots, w_d$)
bias ($b$ or $w_0$) (can be used to offset your hyperplane)
threshold ($r$)

In our example before, we assumed $r=0$ and $b=0$.

Logistic regression on the cities data#

cities_df = pd.read_csv("data/canada_usa_cities.csv")
train_df, test_df = train_test_split(cities_df, test_size=0.2, random_state=123)
X_train, y_train = train_df.drop(columns=["country"]).values, train_df["country"].values
X_test, y_test = test_df.drop(columns=["country"]).values, test_df["country"].values

cols = train_df.drop(columns=["country"]).columns
train_df.head()

	longitude	latitude	country
160	-76.4813	44.2307	Canada
127	-81.2496	42.9837	Canada
169	-66.0580	45.2788	Canada
188	-73.2533	45.3057	Canada
187	-67.9245	47.1652	Canada

Let’s first try DummyClassifier on the cities data.

dummy = DummyClassifier()
scores = cross_validate(dummy, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

	fit_time	score_time	test_score	train_score
0	0.000420	0.000757	0.588235	0.601504
1	0.000168	0.000261	0.588235	0.601504
2	0.000160	0.000255	0.606061	0.597015
3	0.000113	0.000199	0.606061	0.597015
4	0.000109	0.000189	0.606061	0.597015

Now let’s try LogisticRegression

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
scores = cross_validate(lr, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

	fit_time	score_time	test_score	train_score
0	0.005852	0.000337	0.852941	0.827068
1	0.002138	0.000257	0.823529	0.827068
2	0.001907	0.000257	0.696970	0.858209
3	0.001977	0.000259	0.787879	0.843284
4	0.002188	0.000259	0.939394	0.805970

Logistic regression seems to be doing better than dummy classifier. But note that there is a lot of variation in the scores.

Accessing learned parameters#

Recall that logistic regression learns the weights $w$ and bias or intercept $b$.
How to access these weights?
- Similar to Ridge, we can access the weights and intercept using coef_ and intercept_ attribute of the LogisticRegression object, respectively.

lr = LogisticRegression()
lr.fit(X_train, y_train)
print("Model weights: %s" % (lr.coef_))  # these are the learned weights
print("Model intercept: %s" % (lr.intercept_))  # this is the bias term
data = {"features": cols, "coefficients": lr.coef_[0]}
pd.DataFrame(data)

Model weights: [[-0.04108149 -0.33683126]]
Model intercept: [10.8869838]

	features	coefficients
0	longitude	-0.041081
1	latitude	-0.336831

Both negative weights
The weight of latitude is larger in magnitude.
This makes sense because Canada as a country lies above the USA and so we expect latitude values to contribute more to a prediction than longitude.

Prediction with learned parameters#

Let’s predict target of a test example.

example = X_test[0, :]
example

array([-64.8001,  46.098 ])

Raw scores#

Calculate the raw score as: y_hat = np.dot(w, x) + b

(
    np.dot(
        example,
        lr.coef_.reshape(
            2,
        ),
    )
    + lr.intercept_
)

array([-1.97817876])

Apply the threshold to the raw score.
Since the prediction is < 0, predict “negative”.
What is a “negative” class in our context?
With logistic regression, the model randomly assigns one of the classes as a positive class and the other as negative.
- Usually it would alphabetically order the target and pick the first one as negative and second one as the positive class.

The classes_ attribute tells us which class is considered negative and which one is considered positive. - In this case, Canada is the negative class and USA is a positive class.

lr.classes_

array(['Canada', 'USA'], dtype=object)

So based on the negative score above (-1.978), we would predict Canada.
Let’s check the prediction given by the model.

lr.predict([example])

array(['Canada'], dtype=object)

Great! The predictions match! We exactly know how the model is making predictions.

Decision boundary of logistic regression#

The decision boundary of logistic regression is a hyperplane dividing the feature space in half.

lr = LogisticRegression()
lr.fit(X_train, y_train)
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)
mglearn.plots.plot_2d_separator(lr, X_train, fill=False, eps=0.5, alpha=0.7)
plt.title(lr.__class__.__name__)
plt.xlabel("longitude")
plt.ylabel("latitude");

../_images/35bbcf792010804231118bb2b8d24698d2e1eb9d546edfbc98419d1b443dc35c.png

For $d=2$, the decision boundary is a line (1-dimensional)
For $d=3$, the decision boundary is a plane (2-dimensional)
For $d\gt 3$, the decision boundary is a $d-1$-dimensional hyperplane

fig, axes = plt.subplots(1, 3, figsize=(20, 5))
for model, ax in zip(
    [KNeighborsClassifier(), SVC(gamma=0.01), LogisticRegression()], axes
):
    clf = model.fit(X_train, y_train)
    mglearn.plots.plot_2d_separator(
        clf, X_train, fill=True, eps=0.5, ax=ax, alpha=0.4
    )
    mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, ax=ax)
    ax.set_title(clf.__class__.__name__)
    ax.set_xlabel("longitude")
    ax.set_ylabel("latitude")
axes[0].legend();

../_images/00e38e4be59b8213d67bc33dc498a160a3256f8ab202a96977da3c2dfc7d820a.png

Notice a linear decision boundary (a line in our case).
Compare it with KNN or SVM RBF decision boundaries.

Main hyperparameter of logistic regression#

C is the main hyperparameter which controls the fundamental trade-off.
We won’t really talk about the interpretation of this hyperparameter right now.
At a high level, the interpretation is similar to C of SVM RBF
- smaller C $\rightarrow$ might lead to underfitting
- bigger C $\rightarrow$ might lead to overfitting

scores_dict = {
    "C": 10.0 ** np.arange(-4, 6, 1),
    "mean_train_scores": list(),
    "mean_cv_scores": list(),
}
for C in scores_dict["C"]:
    lr = LogisticRegression(C=C)
    scores = cross_validate(lr, X_train, y_train, return_train_score=True)
    scores_dict["mean_train_scores"].append(scores["train_score"].mean())
    scores_dict["mean_cv_scores"].append(scores["test_score"].mean())

results_df = pd.DataFrame(scores_dict)
results_df

	C	mean_train_scores	mean_cv_scores
0	0.0001	0.664707	0.658645
1	0.0010	0.784424	0.790731
2	0.0100	0.827842	0.826203
3	0.1000	0.832320	0.820143
4	1.0000	0.832320	0.820143
5	10.0000	0.832320	0.820143
6	100.0000	0.832320	0.820143
7	1000.0000	0.832320	0.820143
8	10000.0000	0.832320	0.820143
9	100000.0000	0.832320	0.820143

Predicting probability scores [video]#

`predict_proba`#

So far in the context of classification problems, we focused on getting “hard” predictions.
Very often it’s useful to know “soft” predictions, i.e., how confident the model is with a given prediction.
For most of the scikit-learn classification models we can access this confidence score or probability score using a method called predict_proba.

Let’s look at probability scores of logistic regression model for our test example.

example

array([-64.8001,  46.098 ])

lr = LogisticRegression(random_state=123)
lr.fit(X_train, y_train)
lr.predict([example])  # hard prediction

array(['Canada'], dtype=object)

lr.predict_proba([example])  # soft prediction

array([[0.87848688, 0.12151312]])

The output of predict_proba is the probability of each class.
In binary classification, we get probabilities associated with both classes (even though this information is redundant).
The first entry is the estimated probability of the first class and the second entry is the estimated probability of the second class from model.classes_.

lr.classes_

array(['Canada', 'USA'], dtype=object)

Because it’s a probability, the sum of the entries for both classes should always sum to 1.
Since the probabilities for the two classes sum to 1, exactly one of the classes will have a score >=0.5, which is going to be our predicted class.

How does logistic regression calculate these probabilities?#

The weighted sum $w_1x_1 + \dots + w_dx_d + b$ gives us “raw model output”.
For linear regression this would have been the prediction.
For logistic regression, you check the sign of this value.
- If positive (or 0), predict $+1$; if negative, predict $-1$.
- These are “hard predictions”.

You can also have “soft predictions”, aka predicted probabilities.
- To convert the raw model output into probabilities, instead of taking the sign, we apply the sigmoid.

The sigmoid function#

The sigmoid function “squashes” the raw model output from any number to the range $[0,1]$ using the following formula, where $x$ is the raw model output. $$\frac{1}{1+e^{-x}}$$
Then we can interpret the output as probabilities.

sigmoid = lambda x: 1 / (1 + np.exp(-x))
raw_model_output = np.linspace(-8, 8, 1000)
plt.plot(raw_model_output, sigmoid(raw_model_output))
plt.plot([0, 0], [0, 0.5], "--k")
plt.plot([-8, 0], [0.5, 0.5], "--k")
plt.xlabel("raw model output, $w^Tx$")
plt.ylabel("predicted probability")
plt.title("the sigmoid function");

../_images/cc98c2d207d0f64c79d9ad7afa6414b0d876554e39b7ecf7480f79392121da38.png

Recall our hard predictions that check the sign of $w^Tx$, or, in other words, whether or not it is $\geq 0$.
- The threshold $w^Tx=0$ corresponds to $p=0.5$.
- In other words, if our predicted probability is $\geq 0.5$ then our hard prediction is $+1$.

Let’s get the probability score by calling sigmoid on the raw model output for our test example.

sigmoid(
    np.dot(
        example,
        lr.coef_.reshape(
            2,
        ),
    )
    + lr.intercept_
)

array([0.12151312])

This is the probability score of the positive class, which is USA.

lr.predict_proba([example])

array([[0.87848688, 0.12151312]])

With predict_proba, we get the same probability score for USA!!

Let’s visualize probability scores for some examples.

data_dict = {
    "y": y_train[:12],
    "y_hat": lr.predict(X_train[:12]).tolist(),
    "probabilities": lr.predict_proba(X_train[:12]).tolist(),
}

pd.DataFrame(data_dict)

	y	y_hat	probabilities
0	Canada	Canada	[0.7046068097086478, 0.2953931902913523]
1	Canada	Canada	[0.5630169062040135, 0.43698309379598654]
2	Canada	Canada	[0.8389680973255862, 0.16103190267441386]
3	Canada	Canada	[0.7964150775404333, 0.20358492245956678]
4	Canada	Canada	[0.901080665234097, 0.09891933476590302]
5	Canada	Canada	[0.7753006388010788, 0.22469936119892117]
6	USA	USA	[0.030740704606528224, 0.9692592953934718]
7	Canada	Canada	[0.6880304799160926, 0.3119695200839075]
8	Canada	Canada	[0.7891358587234142, 0.21086414127658581]
9	USA	USA	[0.006546969753885579, 0.9934530302461144]
10	USA	USA	[0.2787419584843105, 0.7212580415156895]
11	Canada	Canada	[0.8388877146644935, 0.16111228533550653]

The actual y and y_hat match in most of the cases but in some cases the model is more confident about the prediction than others.

Least confident cases#

Let’s examine some cases where the model is least confident about the prediction.

least_confident_X = X_train[[127, 141]]
least_confident_X

array([[ -79.7599,   43.6858],
       [-123.078 ,   48.9854]])

least_confident_y = y_train[[127, 141]]
least_confident_y

array(['Canada', 'USA'], dtype=object)

probs = lr.predict_proba(least_confident_X)

data_dict = {
    "y": least_confident_y,
    "y_hat": lr.predict(least_confident_X).tolist(),
    "probability score (Canada)": probs[:, 0],
    "probability score (USA)": probs[:, 1],
}
pd.DataFrame(data_dict)

	y	y_hat	probability score (Canada)	probability score (USA)
0	Canada	Canada	0.634392	0.365608
1	USA	Canada	0.635666	0.364334

mglearn.discrete_scatter(
    least_confident_X[:, 0],
    least_confident_X[:, 1],
    least_confident_y,
    markers="o",
)
mglearn.plots.plot_2d_separator(lr, X_train, fill=True, eps=0.5, alpha=0.5)

../_images/4829f9b483b406594454a41095c72e3070190d74dd74121428e904e0debc6744.png

The points are close to the decision boundary which makes sense.

Most confident cases#

Let’s examine some cases where the model is most confident about the prediction.

most_confident_X = X_train[[37, 4]]
most_confident_X

array([[-110.9748,   32.2229],
       [ -67.9245,   47.1652]])

most_confident_y = y_train[[37, 165]]
most_confident_y

array(['USA', 'Canada'], dtype=object)

probs = lr.predict_proba(most_confident_X)

data_dict = {
    "y": most_confident_y,
    "y_hat": lr.predict(most_confident_X).tolist(),
    "probability score (Canada)": probs[:, 0],
    "probability score (USA)": probs[:, 1],
}
pd.DataFrame(data_dict)

	y	y_hat	probability score (Canada)	probability score (USA)
0	USA	USA	0.010028	0.989972
1	Canada	Canada	0.901081	0.098919

most_confident_X

array([[-110.9748,   32.2229],
       [ -67.9245,   47.1652]])

mglearn.discrete_scatter(
    most_confident_X[:, 0],
    most_confident_X[:, 1],
    most_confident_y,
    markers="o",
)
mglearn.plots.plot_2d_separator(lr, X_train, fill=True, eps=0.5, alpha=0.5)

../_images/1039cd3899943e1454c5cc4b6a753e7b66d97a09dd68fdf3f4cdd16591560273.png

The points are far away from the decision boundary which makes sense.

Over confident cases#

Let’s examine some cases where the model is confident about the prediction but the prediction is wrong.

over_confident_X = X_train[[25, 98]]
over_confident_X

array([[-129.9912,   55.9383],
       [-134.4197,   58.3019]])

over_confident_y = y_train[[25, 98]]
over_confident_y

array(['Canada', 'USA'], dtype=object)

probs = lr.predict_proba(over_confident_X)

data_dict = {
    "y": over_confident_y,
    "y_hat": lr.predict(over_confident_X).tolist(),
    "probability score (Canada)": probs[:, 0],
    "probability score (USA)": probs[:, 1],
}
pd.DataFrame(data_dict)

	y	y_hat	probability score (Canada)	probability score (USA)
0	Canada	Canada	0.931792	0.068208
1	USA	Canada	0.961902	0.038098

mglearn.discrete_scatter(
    over_confident_X[:, 0],
    over_confident_X[:, 1],
    over_confident_y,
    markers="o",
)
mglearn.plots.plot_2d_separator(lr, X_train, fill=True, eps=0.5, alpha=0.5)

../_images/43c6525d5e1a6cf939d02b5f6313c753b6467f01e30e32dcf1bbd5b9c4109f7d.png

The cities are far away from the decision boundary. So the model is pretty confident about the prediction.
But the cities are likely to be from Alaska and our linear model is not able to capture that this part belong to the USA and not Canada.

Below we are using colour to represent prediction probabilities. If you are closer to the border, the model is less confident whereas the model is more confident about the mainland cities, which makes sense.

fig, axes = plt.subplots(1, 2, figsize=(18, 5))
from matplotlib.colors import ListedColormap

for ax in axes:
    mglearn.discrete_scatter(
        X_train[:, 0], X_train[:, 1], y_train, markers="o", ax=ax
    )
    ax.set_xlabel("longitude")
    ax.set_ylabel("latitude")

axes[0].legend(["Train class 0", "Train class 1"], ncol=2, loc=(0.1, 1.1))

mglearn.plots.plot_2d_separator(
    lr, X_train, fill=True, eps=0.5, ax=axes[0], alpha=0.5
)
mglearn.plots.plot_2d_separator(
    lr, X_train, fill=False, eps=0.5, ax=axes[1], alpha=0.5
)
scores_image = mglearn.tools.plot_2d_scores(
    lr, X_train, eps=0.5, ax=axes[1], alpha=0.5, cm=plt.cm.coolwarm
)
cbar = plt.colorbar(scores_image, ax=axes.tolist())

../_images/430780408cbbcaf7eda3ac332c13c6079b2a13e59aea19446aad6c1afaae8494.png

Sometimes a complex model that is overfitted, tends to make more confident predictions, even if they are wrong, whereas a simpler model tends to make predictions with more uncertainty.

To summarize,

With hard predictions, we only know the class.
With probability scores we know how confident the model is with certain predictions, which can be useful in understanding the model better.

❓❓ Questions for you#

(iClicker) Exercise 7.2#

iClicker cloud join link: https://join.iclicker.com/3DP5H

Select all of the following statements which are TRUE.

(A) Increasing logistic regression’s C hyperparameter increases model complexity.
(B) The raw output score can be used to calculate the probability score for a given prediction.
(C) For linear classifier trained on $d$ features, the decision boundary is a $d-1$-dimensional hyperparlane.
(D) A linear model is likely to be uncertain about the data points close to the decision boundary.

Linear SVM#

We have seen non-linear SVM with RBF kernel before. This is the default SVC model in sklearn because it tends to work better in many cases.
There is also a linear SVM. You can pass kernel="linear" to create a linear SVM.

cities_df = pd.read_csv("data/canada_usa_cities.csv")
train_df, test_df = train_test_split(cities_df, test_size=0.2, random_state=123)
X_train, y_train = train_df.drop(columns=["country"]).values, train_df["country"].values
X_test, y_test = test_df.drop(columns=["country"]).values, test_df["country"].values

fig, axes = plt.subplots(1, 2, figsize=(18, 5))
from matplotlib.colors import ListedColormap

for (model, ax) in zip([SVC(gamma=0.01), SVC(kernel="linear")], axes):
    mglearn.discrete_scatter(
        X_train[:, 0], X_train[:, 1], y_train, markers="o", ax=ax
    )
    model.fit(X_train, y_train)
    ax.set_xlabel("longitude")
    ax.set_ylabel("latitude")
    mglearn.plots.plot_2d_separator(
        model, X_train, fill=True, eps=0.5, ax=ax, alpha=0.5
    )

axes[0].set_title("SVM RBF")
axes[1].set_title("Linear SVM");

../_images/5f2e1b1163061938091583724868da78d44f43a933b55ec83239f681159f485f.png

predict method of linear SVM and logistic regression works the same way.
We can get coef_ associated with the features and intercept_ using a Linear SVM model.

linear_svc = SVC(kernel="linear")
linear_svc.fit(X_train, y_train)
print("Model weights: %s" % (linear_svc.coef_))
print("Model intercept: %s" % (linear_svc.intercept_))

Model weights: [[-0.0195598  -0.23640124]]
Model intercept: [8.22811601]

lr = LogisticRegression()
lr.fit(X_train, y_train)
print("Model weights: %s" % (lr.coef_))
print("Model intercept: %s" % (lr.intercept_))

Model weights: [[-0.04108149 -0.33683126]]
Model intercept: [10.8869838]

Note that the coefficients and intercept are slightly different for logistic regression.
This is because the fit for linear SVM and logistic regression are different.

Summary of linear models#

Linear regression is a linear model for regression whereas logistic regression is a linear model for classification.
Both these models learn one coefficient per feature, plus an intercept.

Main hyperparameters#

The main hyperparameter is the “regularization” hyperparameter controlling the fundamental tradeoff.
- Logistic Regression: C
- Linear SVM: C
- Ridge: alpha

Interpretation of coefficients in linear models#

the $j$th coefficient tells us how feature $j$ affects the prediction
if $w_j > 0$ then increasing $x_{ij}$ moves us toward predicting $+1$
if $w_j < 0$ then increasing $x_{ij}$ moves us toward prediction $-1$
if $w_j == 0$ then the feature is not used in making a prediction

Strengths of linear models#

Fast to train and predict
Scale to large datasets and work well with sparse data
Relatively easy to understand and interpret the predictions
Perform well when there is a large number of features

Limitations of linear models#

Is your data “linearly separable”? Can you draw a hyperplane between these datapoints that separates them with 0 error.
- If the training examples can be separated by a linear decision rule, they are linearly separable.

A few questions you might be thinking about

How often the real-life data is linearly separable?
Is the following XOR function linearly separable?

$$x_1$$	$$x_2$$	target
0	0	0
0	1	1
1	0	1
1	1	0

Are linear classifiers very limiting because of this?

$\(x_1\)$	$\(x_2\)$	target
0	0	0
0	1	1
1	0	1
1	1	0

Lecture 7: Linear Models

Contents

Lecture 7: Linear Models#

Imports, Announcements, and LO#

Imports#

Announcements#

Learning outcomes#

Linear models [video]#

Linear regression#

Prediction of linear regression#

What are we exactly learning?#

How are we making predictions?#

Generalizing to more features#

Example#

Ridge#

Data#

Ridge on the California housing dataset#

Hyperparameter alpha of Ridge#

Coefficients and intercept#

❓❓ Questions for you#

(iClicker) Exercise 7.1#

Interpretation of coefficients#

Sign of the coefficients#

Magnitude of the coefficients#

Importance of scaling#

❓❓ Questions for you#

Logistic regression [video]#

Logistic regression intuition#

Motivating example#

Training data for the motivating example#

Learned coefficients associated with all features#

Predicting with learned weights#

Components of a linear classifier#

Logistic regression on the cities data#

Accessing learned parameters#

Prediction with learned parameters#

Raw scores#

Decision boundary of logistic regression#

Main hyperparameter of logistic regression#

Predicting probability scores [video]#

predict_proba#

How does logistic regression calculate these probabilities?#

The sigmoid function#

Least confident cases#

Most confident cases#

Over confident cases#

❓❓ Questions for you#

(iClicker) Exercise 7.2#

Linear SVM#

Summary of linear models#

Main hyperparameters#

Interpretation of coefficients in linear models#

Strengths of linear models#

Limitations of linear models#

`Ridge`#

`Ridge` on the California housing dataset#

Hyperparameter `alpha` of `Ridge`#

`predict_proba`#