Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Tutorial 1

Tutorial 1

UBC 2024-25

Outline

During this tutorial, you will see another example of classification with decision trees and will take a closer look at decision boundaries.

Imports

import os
import re
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

sys.path.append(os.path.join(os.path.abspath(".."), "code"))
import graphviz
import IPython
import mglearn
from IPython.display import HTML, display
from plotting_functions import *
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from utils import *

plt.rcParams["font.size"] = 16
pd.set_option("display.max_colwidth", 200)
%matplotlib inline

DATA_DIR = '../data/' 
# Custom function to plot decision boundary and tree

def plot_tree_decision_boundary_and_tree(
    model, X, y, height=6, width=16, fontsize = 9, x_label="x-axis", y_label="y-axis", eps=None
):
    fig, ax = plt.subplots(
        1,
        2,
        figsize=(width, height),
        subplot_kw={"xticks": (), "yticks": ()},
        gridspec_kw={"width_ratios": [1.5, 2]},
    )
    plot_tree_decision_boundary(model, X, y, x_label, y_label, eps, ax=ax[0])
    custom_plot_tree(model, 
                 feature_names=X.columns.tolist(), 
                 class_names=['Canada', 'US'],
                 impurity=False,
                 fontsize=fontsize, ax=ax[1])
    ax[1].set_axis_off()
    plt.show()



Exercise: Predicting country using the longitude and latitude

Imagine that you are given longitude and latitude of some border cities of USA and Canada along with which country they belong to. Using this training data, you are supposed to come up with a classification model to predict whether a given longitude and latitude combination is in the USA or Canada.

### US Canada cities data
df = pd.read_csv(DATA_DIR + "canada_usa_cities.csv")
df
X = df[["longitude", "latitude"]]
y = df["country"]
mglearn.discrete_scatter(X.iloc[:, 0], X.iloc[:, 1], y)
plt.xlabel("longitude")
plt.ylabel("latitude");

Question 1

Given what you know about decision trees, do you think it will try to separate samples by latitude or longitude first? And around what value?

Your answer here

Real boundary between Canada and USA

In real life we know what’s the boundary between USA and Canada.

Source

Can a learning algorithm infer this boundary based on the limited training examples given to us?

Question 2

Before moving to more advanced models, let’s create a baseline classifier.

Reminder: do you remember what is the purpose of a baseline? If not, ask a TA to help you understand why we use them.

from sklearn.dummy import DummyClassifier # import the classifier

dummy_clf = DummyClassifier(strategy="most_frequent") # Create a classifier object

dummy_clf.fit( ); # Complete the code to fit the classifier

Now, let’s see how accurate are the predictions of our dummy classifier. This will be our baseline (any classifier with a worse performance than this should certainly go in the trash bin!)

# Score the DummyClassifier

Do you know what class was picked as the majority class? You can see this easily by making a single prediction, since DummyClassifier always predicts the same class.

dummy_clf.predict(X.iloc[0])

Question 3

Now that we have our baseline, let’s try a more proper model. We will start with a simple decision tree of depth 1 (decision stump). run the cell below to fit the tree and see the decision boundary. Is it the same boundary you picked when answering Question 1?

model = DecisionTreeClassifier(max_depth=1)
model.fit(X.values, y)
plot_tree_decision_boundary_and_tree(
    model,
    X,
    y,
    height=6,
    width=16,
    fontsize=15,
    eps=10,
    x_label="longitude",
    y_label="latitude",
)
# Score this classifier to see if it performs better than the baseline

Question 4

Finally, play with the max_depth parameter of the decision tree and try a few different values. Do you see a relationship between depth and performance?

model = DecisionTreeClassifier(max_depth=1)  # Change depth here
model.fit(X.values, y)
plot_tree_decision_boundary_and_tree(
    model,
    X,
    y,
    height=6,
    width=16,
    fontsize=12,
    eps=10,
    x_label="longitude",
    y_label="latitude",
)
# Score the classifier to see which depth gives the best accuracy

Recap Questions

Terminology

Assign the correct definition to each element of the model:

  • Latitude:

  • Longitude:

  • Country (Canada/US):

  • Latitude <= 42.5 (in a tree node):

  • max_depth: