Appendix B: Multi-class, meta-strategies

Appendix B: Multi-class, meta-strategies#

So far we have been talking about binary classification
Can we use these classifiers when there are more than two classes?
- “ImageNet” computer vision competition, for example, has 1000 classes
Can we use decision trees or KNNs for multi-class classification?
What about logistic regression and Linear SVMs?

Many linear classification models don’t extend naturally to the multiclass case.
A common technique is to reduce multiclass classication into several instances of binary classification problems.
Two kind of “hacky” ways to reduce multi-class classification into binary classification:
- the one-vs.-rest approach
- the one-vs.-one approach

One vs. Rest#

1v{2,3}, 2v{1,3}, 3v{1,2}
Learn a binary model for each class which tries to separate that class from all of the other classes.
If you have \(k\) classes, it’ll train \(k\) binary classifiers, one for each class.
Trained on imbalanced datasets containing all examples.
Given a test point, get scores from all binary classifiers (e.g., raw scores for logistic regression).

The classifier which has the highest score for this class “wins” and that’s going to be the prediction for this class.
Since we have one binary classifier per class, we have coefficients per feature and an intercept for each class.

Note

Note that there is also a multinomial logistic regression also called as the maxent classifier. This is different than the above multi-class meta strategies. More on this in DSCI 573.

Let’s create some synthetic data with two features and three classes.

import mglearn
from sklearn.datasets import make_blobs

X, y = make_blobs(centers=3, n_samples=120, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0", "Class 1", "Class 2"]);

../../_images/83fab170adac6abab4e514663671340e403da6b913300185aa14282cab4d3b82.png

lr = LogisticRegression(max_iter=2000, multi_class="ovr")
lr.fit(X_train, y_train)
print("Coefficient shape: ", lr.coef_.shape)
print("Intercept shape: ", lr.intercept_.shape)

Coefficient shape:  (3, 2)
Intercept shape:  (3,)

/Users/kvarada/miniforge3/envs/571/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:1256: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.7. Use OneVsRestClassifier(LogisticRegression(..)) instead. Leave it to its default value to avoid this warning.
  warnings.warn(

This learns three binary linear models.
So we have coefficients for two features for each of these three linear models.
Also we have three intercepts, one for each class.

# Function definition in code/plotting_functions.py
plot_multiclass_lr_ovr(lr, X_train, y_train, 3)

../../_images/c8fc36f17ceab8b1829c1a0a69219b0afc197a64b04d029d4a568eec6f56268c.png

How would you classify the following points?
The answer is pick the class with the highest value for the classification formula.

test_points = [[-4.0, 12], [-2, 0.0], [-8, 3.0], [4, 8.5], [0, -7]]            

plot_multiclass_lr_ovr(lr, X_train, y_train, 3, test_points)

../../_images/53cea6c01610601fcb828fe81ccaadc983078cea95fd7a38fb229f0ab639bf4e.png

plot_multiclass_lr_ovr(lr, X_train, y_train, 3, test_points, decision_boundary=True)

../../_images/e771f3c0a6d1adae3a5bc13d7e6a6e1dacea4402dbe0ac4f6b129e7054e1e271.png

Let’s calculate the raw scores for a test point.

test_points[4]

[0, -7]

lr.coef_

array([[-0.65125032,  1.05350962],
       [ 1.35375221, -0.2865025 ],
       [-0.63316788, -0.7250894 ]])

lr.intercept_ 

array([-5.42541681,  0.21616484, -2.47242662])

test_points[4]@lr.coef_.T + lr.intercept_

array([-12.79998417,   2.22168237,   2.60319915])

lr.classes_

array([0, 1, 2])

Class 1 and 2 seems to have similar scores, which makes sense because the point is close to the border. But it’s in the green region because Class 2 score is the highest.

lr.predict_proba([test_points[4]])

array([[1.50596432e-06, 4.92120500e-01, 5.07877994e-01]])

One Vs. One approach#

Build a binary model for each pair of classes.
1v2, 1v3, 2v3
Trains n * (n-1)/2 binary classifiers
Trained on relatively balanced subsets

One Vs. One prediction#

Apply all of the classifiers on the test example.
Count how often each class was predicted.
Predict the class with most votes.

Using OVR and OVO as wrappers#

You can use these strategies as meta-strategies for any binary classifiers.
- OneVsRestClassifier
- OneVsOneClassifier
When do we use OneVsRestClassifier and OneVsOneClassifier
It’s not that likely for you to need OneVsRestClassifier or OneVsOneClassifier because most of the methods you’ll use will have native multi-class support.
However, it’s good to know in case you ever need to extend a binary classifier (perhaps one you’ve implemented on your own).

from sklearn.multiclass import OneVsOneClassifier, OneVsRestClassifier

# Let's examine the time taken by OneVsRestClassifier and OneVsOneClassifier

# generate blobs with fixed random generator
X_multi, y_multi = make_blobs(n_samples=1000, centers=20, random_state=300)

X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_multi, y_multi
)

mglearn.discrete_scatter(X_train_multi[:, 0], X_train_multi[:, 1], y_train_multi, s=6);

../../_images/48e61b582ae2824c7bbb76ab044bb3f88110eb1e2466b428bd2f51c08eb8fb5d.png

model = OneVsOneClassifier(LogisticRegression())
%timeit model.fit(X_train_multi, y_train_multi);
print("With OVO wrapper")
print(model.score(X_train_multi, y_train_multi))
print(model.score(X_test_multi, y_test_multi))

ms ± 7.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
With OVO wrapper
792
792

model = OneVsRestClassifier(LogisticRegression())
%timeit model.fit(X_train_multi, y_train_multi);
print("With OVR wrapper")
print(model.score(X_train_multi, y_train_multi))
print(model.score(X_test_multi, y_test_multi))

ms ± 490 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
With OVR wrapper
7266666666666667
68

As expected OVO takes more time compared to OVR
Here you will find summary of how scikit-learn handles multi-class classification for different classifiers.