Lecture 1: Course Introduction#

UBC 2024-25

Imports#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sys
sys.path.append(os.path.join(os.path.abspath(".."), "code"))
from IPython.display import HTML, display
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

plt.rcParams["font.size"] = 16
pd.set_option("display.max_colwidth", 200)

DATA_DIR = '../data/' 

Learning outcomes#

From this lecture, you will be able to

  • Explain the motivation behind study machine learning.

  • Briefly describe supervised learning.

  • Differentiate between traditional programming and machine learning.

  • Assess whether a given problem is suitable for a machine learning solution.



Characters in this course?#

  • CPSC 330 teaching team (instructors and the TAs)

  • Eva (a fictitious enthusiastic student)

  • And you all, of course 🙂!

Meet Eva (a fictitious persona)!#

Eva is among one of you. She has some experience in Python programming. She knows machine learning as a buzz word. During her recent internship, she has developed some interest and curiosity in the field. She wants to learn what is it and how to use it. She is a curious person and usually has a lot of questions!

Why machine learning (ML)? [video]#

See also

Check out the accompanying video on this material.

Prevalence of ML#

Let’s look at some examples.

Saving time and scaling products#

  • Imagine writing a program for spam identification, i.e., whether an email is spam or non-spam.

  • Traditional programming

    • Come up with rules using human understanding of spam messages.

    • Time consuming and hard to come up with robust set of rules.

  • Machine learning

    • Collect large amount of data of spam and non-spam emails and let the machine learning algorithm figure out rules.

  • With machine learning, you’re likely to

    • Save time

    • Customize and scale products



Supervised machine learning#

Types of machine learning#

Here are some typical learning problems.

  • Supervised learning (Gmail spam filtering)

    • Training a model from input data and its corresponding targets to predict targets for new examples.

  • Unsupervised learning (Google News)

    • Training a model to find patterns in a dataset, typically an unlabeled dataset.

  • Reinforcement learning (AlphaGo)

    • A family of algorithms for finding suitable actions to take in a given situation in order to maximize a reward.

  • Recommendation systems (Amazon item recommendation system)

    • Predict the “rating” or “preference” a user would give to an item.

What is supervised machine learning (ML)?#

  • Training data comprises a set of observations (\(X\)) and their corresponding targets (\(y\)).

  • We wish to find a model function \(f\) that relates \(X\) to \(y\).

  • We use the model function to predict targets of new examples.

Example: Predict whether a message is spam or not#

Input features \(X\) and target \(y\)#

Note

Do not worry about the code and syntax for now.

Note

Download SMS Spam Collection Dataset from here.

Training a supervised machine learning model with \(X\) and \(y\)#

Hide code cell source
sms_df = pd.read_csv(DATA_DIR + "spam.csv", encoding="latin-1")
sms_df = sms_df.drop(columns = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})
train_df, test_df = train_test_split(sms_df, test_size=0.10, random_state=42)
HTML(train_df.head().to_html(index=False))
X_train, y_train = train_df["sms"], train_df["target"]
X_test, y_test = test_df["sms"], test_df["target"]
clf = make_pipeline(CountVectorizer(max_features=5000), LogisticRegression(max_iter=5000))
clf.fit(X_train, y_train);

Predicting on unseen data using the trained model#

pd.DataFrame(X_test[0:4])

Note

Do not worry about the code and syntax for now.

pred_dict = {
    "sms": X_test[0:4],
    "spam_predictions": clf.predict(X_test[0:4]),
}
pred_df = pd.DataFrame(pred_dict)
pred_df.style.set_properties(**{"text-align": "left"})

We have accurately predicted labels for the unseen text messages above!

Question: how many examples do you think were needed to get this result? In other words, how many samples are in train_df?

  • (A) < 10

  • (B) 10 - 100

  • (C) 100 - 1000

  • (D) 1000 - 10000

  • (E) 10000 - 100000

# Ready to find out?

len(train_df)



Examples#

Let’s look at some concrete examples of supervised machine learning.

Note

Do not worry about the code at this point. Just focus on the input and output in each example.

Example 1: Predicting whether a patient has a liver disease or not#

Input data#

Suppose we are interested in predicting whether a patient has the disease or not. We are given some tabular data with inputs and outputs of liver patients, as shown below. The data contains a number of input features and a special column called “Target” which is the output we are interested in predicting.

Note

Download the data from here.

Hide code cell source
df = pd.read_csv(DATA_DIR + "indian_liver_patient.csv")
df = df.drop(columns = ["Gender"])
df["Dataset"] = df["Dataset"].replace(1, "Disease")
df["Dataset"] = df["Dataset"].replace(2, "No Disease")
df.rename(columns={"Dataset": "Target"}, inplace=True)
train_df, test_df = train_test_split(df, test_size=4, random_state=42)
HTML(train_df.head().to_html(index=False))
Building a supervise machine learning model#

Let’s train a supervised machine learning model with the input and output above.

from lightgbm.sklearn import LGBMClassifier

X_train = train_df.drop(columns=["Target"])
y_train = train_df["Target"]
X_test = test_df.drop(columns=["Target"])
y_test = test_df["Target"]
model = LGBMClassifier(random_state=123, verbose=-1)
model.fit(X_train, y_train)
Model predictions on unseen data#
  • Given features of new patients below we’ll use this model to predict whether these patients have the liver disease or not.

HTML(X_test.reset_index(drop=True).to_html(index=False))
pred_df = pd.DataFrame({"Predicted_target": model.predict(X_test).tolist()})

df_concat = pd.concat([pred_df, X_test.reset_index(drop=True)], axis=1)
HTML(df_concat.to_html(index=False))



Example 2: Predicting the label of a given image#

Suppose you want to predict the label of a given image using supervised machine learning. We are using a pre-trained model here to predict labels of new unseen images.

Note

Assuming that you have successfully created cpsc330 conda environment on your computer, you’ll have to install torchvision in cpsc330 conda environment to run the following code. If you are unable to install torchvision on your laptop, please don’t worry,not crucial at this point.

conda activate cpsc330 conda install -c pytorch torchvision

import img_classify
from PIL import Image
import glob
import matplotlib.pyplot as plt
# Predict topn labels and their associated probabilities for unseen images
images = glob.glob(DATA_DIR + "test_images/*.*")
class_labels_file = DATA_DIR + 'imagenet_classes.txt'
%matplotlib inline 
for img_path in images:
    img = Image.open(img_path).convert('RGB')
    img.load()
    plt.imshow(img)
    plt.show();    
    df = img_classify.classify_image(img_path, class_labels_file)
    print(df.to_string(index=False))
    print("--------------------------------------------------------------")



Example 3: Predicting sentiment expressed in a movie review#

Suppose you are interested in predicting whether a given movie review is positive or negative. You can do it using supervised machine learning.

Note

Download the data from here.

Note: the textbook uses a very similar dataset for this example (https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews), but be aware that the columns have different names.

imdb_df = pd.read_csv("../data/imdb_master.csv", encoding="ISO-8859-1")
imdb_df = imdb_df[imdb_df["label"].str.startswith(("pos", "neg"))]
imdb_df.drop(["Unnamed: 0", "type", "file"], axis=1, inplace=True)
imdb_df.rename(columns={"label": "target"}, inplace=True)
train_df, test_df = train_test_split(imdb_df, test_size=0.10, random_state=123)
HTML(train_df.head().to_html(index=False))
# Build an ML model
X_train, y_train = train_df["review"], train_df["target"]
X_test, y_test = test_df["review"], test_df["target"]

clf = make_pipeline(CountVectorizer(max_features=5000), LogisticRegression(max_iter=5000))
clf.fit(X_train, y_train);
# Predict on unseen data using the built model
pred_dict = {
    "reviews": X_test[0:4],
    "sentiment_predictions": clf.predict(X_test[0:4]),
}
pred_df = pd.DataFrame(pred_dict)
pred_df.style.set_properties(**{"text-align": "left"})



Example 4: Predicting housing prices#

Suppose we want to predict housing prices given a number of attributes associated with houses.

Note

Download the data from here.

Hide code cell source
df = pd.read_csv( DATA_DIR + "kc_house_data.csv")
df = df.drop(columns = ["id", "date"])
df.rename(columns={"price": "target"}, inplace=True)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=4)
HTML(train_df.head().to_html(index=False))
# Build a regression model
from lightgbm.sklearn import LGBMRegressor

X_train, y_train = train_df.drop(columns= ["target"]), train_df["target"]
X_test, y_test = test_df.drop(columns= ["target"]), train_df["target"]

model = LGBMRegressor()
#model = XGBRegressor()
model.fit(X_train, y_train);
# Predict on unseen examples using the built model
pred_df = pd.DataFrame(
    # {"Predicted target": model.predict(X_test[0:4]).tolist(), "Actual price": y_test[0:4].tolist()}
    {"Predicted_target": model.predict(X_test[0:4]).tolist()}
)
df_concat = pd.concat([pred_df, X_test[0:4].reset_index(drop=True)], axis=1)
HTML(df_concat.to_html(index=False))

To summarize, supervised machine learning can be used on a variety of problems and different kinds of data.



🤔 Eva’s questions#

At this point, Eva is wondering about many questions.

  • How are we exactly “learning” whether a message is spam and ham?

  • What do you mean by “learn without being explicitly programmed”? The code has to be somewhere …

  • Are we expected to get correct predictions for all possible messages? How does it predict the label for a message it has not seen before?

  • What if the model mis-labels an unseen example? For instance, what if the model incorrectly predicts a non-spam as a spam? What would be the consequences?

  • How do we measure the success or failure of spam identification?

  • If you want to use this model in the wild, how do you know how reliable it is?

  • Would it be useful to know how confident the model is about the predictions rather than just a yes or a no?

It’s great to think about these questions right now. But Eva has to be patient. By the end of this course you’ll know answers to many of these questions!

Machine learning workflow#

Supervised machine learning is quite flexible; it can be used on a variety of problems and different kinds of data. Here is a typical workflow of a supervised machine learning systems.

We will build machine learning pipelines in this course, focusing on some of the steps above.



âť“âť“ Questions for you#

iClicker cloud join link: https://join.iclicker.com/VYFJ

Select all of the following statements which are True (iClicker)#

  • (A) Predicting spam is an example of machine learning.

  • (B) Predicting housing prices is not an example of machine learning.

  • (C) For problems such as spelling correction, translation, face recognition, spam identification, if you are a domain expert, it’s usually faster and scalable to come up with a robust set of rules manually rather than building a machine learning model.

  • (D) If you are asked to write a program to find all prime numbers up to a limit, it is better to implement one of the algorithms for doing so rather than using machine learning.

  • (E) Google News is likely be using machine learning to organize news.



Break (5 min)#

  • We will try to take a 5-minute break half way through every class.



About this course#

Course website#

UBC-CS/cpsc330-2024W1 is the most important link. Please read everything on there!

CPSC 330 vs. 340#

Read UBC-CS/cpsc330-2024W1 which explains the difference between the two courses.

TLDR:

  • 340: how do ML models work?

  • 330: how do I use ML models?

  • CPSC 340 has many prerequisites.

  • CPSC 340 goes deeper but has a more narrow scope.

  • I think CPSC 330 will be more useful if you just plan to apply basic ML.

Registration, waitlist and prerequisites#

Please go through the syllabus carefully before contacting me about these issues. Even then, I am very unlikely to be able to help with registration, waitlist or prerequisite issues.

Course format#

  • Lectures are T/Th at 3:30pm.

  • Often, there will be videos to watch before or during the lecture time. (Check the main course page to see if you are expected to watch videos before the class.)

  • Weekly tutorials will be office hour format run by the TAs and are completely optional.

  • We’ll have two midterms and one final (dates).

Communications#

  • Our main forum for getting help will be Piazza –> access through Canvas.

  • Other forms of communications (Canvas, email…) will likely go unresponded.

  • Let’s all take 2 minutes to register (through the link or through the tab on Canvas)

  • You must have a @ubc.ca email associated with your account. Unrecognizable accounts (other emails) will be dropped without warning.

Grades#

  • The grading breakdown is here. This page also explains the policy on challenging grades and late tokens.

  • You have one week to raise a concern from the time that your grades were posted, by contacting the course coordinator.

First deliverables#

  • First homework assignment is due this coming Tuesday, September 10th, at 11:59pm. The assignment is available on GitHub.

  • You must do the first homework assignment on your own.

  • The Syllabus quiz is available on PrairieLearn and is due by Sept 19th, 11:59 pm.

Please read this entire document about asking for help. TLDR: Be nice.

Lecture and homework format: Jupyter notebooks#

  • This document is a Jupyter notebook, with file extension .ipynb.

  • Confusingly, “Jupyter notebook” is also the original application that opens .ipynb files - but has since been replaced by Jupyter lab.

    • Some things might not work with the Jupyter notebook application.

    • The course setup/install instructions include Jupyter lab.

  • Jupyter notebooks contain a mix of code, code output, markdown-formatted text (including LaTeX equations), and more.

    • When you open a Jupyter notebook in one of these apps, the document is “live”, meaning you can run the code.

    • For example:

1+1
x = [1,2,3]
x[0] = 9999
x
  • By default, Jupyter prints out the result of the last line of code, so you don’t need as many print statements.

  • In addition to the “live” notebooks, Jupyter notebooks can be statically rendered in the web browser, e.g. this.

    • This can be convenient for quick read-only access, without needing to launch the Jupyter notebook/lab application.

    • But you need to launch the app properly to interact with the notebooks.

Lecture style#

  • Lots of code snippets in Jupyter.

  • There will be some YouTube videos to watch before or during the lecture.

  • We will also try to work on some questions and exercises together during the class.

  • All materials will be posted on the course website. Lecture notes will be posted right before each class.

  • Lectures from the previous semester are available on previous course repositories (change 2024W1 with 2022W2, for example).

  • I cannot promise anything will stay the same from last year to this year, so watch out for differences.



Setting up your computer for the course#

Python requirements/resources#

We will primarily use Python in this course.

Here is the basic Python knowledge you’ll need for the course:

  • Basic Python programming

  • Numpy

  • Pandas

  • Basic matplotlib

  • Sparse matrices

Some of you will already know Python, others won’t. Homework 1 is all about Python.

We do not have time to teach all the Python we need but you can find some useful Python resources here.

Activity#

In this course, we will primarily be using Python, git, GitHub, Canvas, Gradescope, and Piazza. Let’s set up your computers for the course.

  • Follow the setup instructions here to create a course conda environment on your computer.

Checklist for you before next class#

  • [ ] Are you able to access course Canvas shell?

  • [ ] Are you able to access Gradescope? (If not, refer to the Gradescope Student Guide.)

  • [ ] Are you able to access course Piazza (through Canvas)?

  • [ ] Did you follow the setup instructions here to create a course conda environment on your computer?

  • [ ] Did you complete the Syllabus quiz on PrairieLearn? (Due date: Thursday, Sep 19th at 11:59pm)

  • [ ] Are you done with homework 1? (Due: Tuesday, Sep 10th at 11:59pm)



Summary#

  • Machine learning is increasingly being applied across various fields.

  • In supervised learning, we are given a set of observations (\(X\)) and their corresponding targets (\(y\)) and we wish to find a model function \(f\) that relates \(X\) to \(y\).

  • Machine learning is a different paradigm for problem solving. Very often it reduces the time you spend programming and helps customizing and scaling your products.

  • Before applying machine learning to a problem, it’s always advisable to assess whether a given problem is suitable for a machine learning solution or not.