
Tutorial 1¶
UBC 2024-25
Outline¶
During this tutorial, you will see another example of classification with decision trees and will take a closer look at decision boundaries.
Imports¶
import os
import re
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
sys.path.append(os.path.join(os.path.abspath(".."), "code"))
import graphviz
import IPython
import mglearn
from IPython.display import HTML, display
from plotting_functions import *
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from utils import *
plt.rcParams["font.size"] = 16
pd.set_option("display.max_colwidth", 200)
%matplotlib inline
DATA_DIR = '../data/' # Custom function to plot decision boundary and tree
def plot_tree_decision_boundary_and_tree(
model, X, y, height=6, width=16, fontsize = 9, x_label="x-axis", y_label="y-axis", eps=None
):
fig, ax = plt.subplots(
1,
2,
figsize=(width, height),
subplot_kw={"xticks": (), "yticks": ()},
gridspec_kw={"width_ratios": [1.5, 2]},
)
plot_tree_decision_boundary(model, X, y, x_label, y_label, eps, ax=ax[0])
custom_plot_tree(model,
feature_names=X.columns.tolist(),
class_names=['Canada', 'US'],
impurity=False,
fontsize=fontsize, ax=ax[1])
ax[1].set_axis_off()
plt.show()Exercise: Predicting country using the longitude and latitude¶
Imagine that you are given longitude and latitude of some border cities of USA and Canada along with which country they belong to. Using this training data, you are supposed to come up with a classification model to predict whether a given longitude and latitude combination is in the USA or Canada.
### US Canada cities data
df = pd.read_csv(DATA_DIR + "canada_usa_cities.csv")
dfX = df[["longitude", "latitude"]]y = df["country"]mglearn.discrete_scatter(X.iloc[:, 0], X.iloc[:, 1], y)
plt.xlabel("longitude")
plt.ylabel("latitude");Question 1¶
Given what you know about decision trees, do you think it will try to separate samples by latitude or longitude first? And around what value?
Your answer here
Real boundary between Canada and USA¶
In real life we know what’s the boundary between USA and Canada.

Can a learning algorithm infer this boundary based on the limited training examples given to us?
Question 2¶
Before moving to more advanced models, let’s create a baseline classifier.
Reminder: do you remember what is the purpose of a baseline? If not, ask a TA to help you understand why we use them.
from sklearn.dummy import DummyClassifier # import the classifier
dummy_clf = DummyClassifier(strategy="most_frequent") # Create a classifier object
dummy_clf.fit( ); # Complete the code to fit the classifierNow, let’s see how accurate are the predictions of our dummy classifier. This will be our baseline (any classifier with a worse performance than this should certainly go in the trash bin!)
# Score the DummyClassifierDo you know what class was picked as the majority class? You can see this easily by making a single prediction, since DummyClassifier always predicts the same class.
dummy_clf.predict(X.iloc[0])Question 3¶
Now that we have our baseline, let’s try a more proper model. We will start with a simple decision tree of depth 1 (decision stump). run the cell below to fit the tree and see the decision boundary. Is it the same boundary you picked when answering Question 1?
model = DecisionTreeClassifier(max_depth=1)
model.fit(X.values, y)
plot_tree_decision_boundary_and_tree(
model,
X,
y,
height=6,
width=16,
fontsize=15,
eps=10,
x_label="longitude",
y_label="latitude",
)# Score this classifier to see if it performs better than the baselineQuestion 4¶
Finally, play with the max_depth parameter of the decision tree and try a few different values. Do you see a relationship between depth and performance?
model = DecisionTreeClassifier(max_depth=1) # Change depth here
model.fit(X.values, y)
plot_tree_decision_boundary_and_tree(
model,
X,
y,
height=6,
width=16,
fontsize=12,
eps=10,
x_label="longitude",
y_label="latitude",
)# Score the classifier to see which depth gives the best accuracyRecap Questions¶
Terminology
Assign the correct definition to each element of the model:
Latitude:
Longitude:
Country (Canada/US):
Latitude <= 42.5 (in a tree node):
max_depth: