13  Introduction to supervised classification

Methods for supervised classification originated the early nineteenth century, under the term discriminant analysis (see, for example, Fisher (1936)). The late 1900s and 2000s has seen an explosion of research with many new methods emerging, especially addressing the increases collection of data, and storage in databases. The fundamental goals and approaches remain the same: to be able accurately predict the class labels using a model developed from a categorical response variable and multivariate predictors. In contrast to unsupervised classification, the class label (categorical response variable) is known, in the training sample. The training sample is used to build the prediction model, and also to estimate the accuracy, or inversely error, of the model for future data.

Start by colouring points by the class variable, and examine whether the colour groups separate into clusters or are different in shape.

13.1 Beyond predictive accuracy

For a long time the focus has been on methods and algorithms that focus on predictive accuracy. While this is still an ongoing aim, a substantial change has been gaining traction where being able to understand a model, to interpret it, understand how predictions are made is underway. Questions like:

  • Are the classes well separated in the data space, so that they correspond to distinct clusters? If so, what are the shapes of the clusters? Is each cluster sufficiently ellipsoidal so that we can assume that the data arises from a mixture of multivariate normal distributions? Do the clusters exhibit characteristics that suggest one algorithm in preference to others?
  • Where does the boundary between classes fall? Are the classes linearly separable, or does the difference between classes suggest a non-linear boundary? How do changes in the input parameters affect these boundaries? How do the boundaries generated by different methods vary?
  • What cases are misclassified, or have more uncertain predictions? Are there places in the data space where predictions are especially good or bad?
  • Which predictors most contribute to the model predictions? Is it possible to reduce the set of explanatory variables?
Figure 13.1: Examples of supervised classification patterns: (a) linearly separable, (b) linear but not completely separable, (c) non-linearly separable, (d) non-linear, but not completely separable.

Figure 13.1 shows some 2D examples where the two classes are (a) linearly separable, (b) not completely separable but linearly different, (c) non-linearly separable and (d) not completely separable but with a non-linear difference. We can also see that in (a) only the horizontal variable would be important for the model because the two classes are completely separable in this direction. Although the data in (c) has separable classes, most models would have difficulty capturing the separation. It is for this reason that it is important to understand the boundary between classes produced by a fitted model. In each of b, c, d it is likely that some observations would be misclassified. Identifying these cases, and inspecting where they are in the data space is important for understanding the model’s performance on different samples.

Explainability has been a pursuit in statistics for a long time, and includes checking model assumptions, diagnosing the fit, and assessing variable importance. Interpretability also motivates the emerging field called explainable artificial intelligence (XAI), which is developing methods to determine how a model makes decisions on individual cases. Statistical models tend be easy to describe and interpret because they impose strong assumptions on data, that may or may not be reasonable. In contrast, computational models that follow prescribed algorithms tend to seen as having fewer assumptions and can flexibly fit to data. However, all algorithms impose some belief that may or may not be reasonable, and this needs checking.

13.2 Balancing bias and variance

Using a machine learning conceptual framework and associated terminology may be helpful in understanding the importance of explainability. Modeling methods are characterised by bias and variance, and flexibility. Imposing more beliefs or assumptions on the model fit potentially creates a more biased model. Here are some examples of the reasoning:

  • In linear discriminant analysis (LDA) (Chapter 14) there is a strong assumption that classes correspond to elliptical clusters. If the data does not have classes with this structure the result is a biased model that does not fit the data.
  • In contrast, tree algorithms (Chapter 15) (most commonly) operate on single variables at each split, which imposes the belief that there is no correlation between variables. If the data does have elliptical clusters where separations are in combinations of variables then a tree model will produce a biased model that does not effectively fit the data.
  • The tree model is considered more flexible that LDA because the algorithm can be forced to keep iterating to closely fit a particular training sample.
  • The LDA model may have high bias, but it likely generate a similar fitted model if a different sample of data was used, and thus have low variance.
  • In contrast a tree model could be forced to have low bias for a particular sample of data, if the same algorithm was fitted to a different sample it would likely be quite different, thus be described as high variance.

Using high-dimensional visualisation to understand the shapes of the class clusters will help to assess whether a particular method will likely result in a high or low bias model.

Predictive accuracy trades off these characteristics, with the best model having low bias, low variance. Ideally, being sufficiently flexible but no more than necessary. Figure 13.2 illustrates these model attributes for the data used in Figure 13.1 c. There are three samples drawn from the same process, that generates the separated zig-zag clusters, and two different types of models are fitted. The training samples are shown as black points, with symbol indicating the class. The colour represents the predictions from the different fitted models. Plots a and d show one training sample; plots b and e show a second training sample; plots c and f show the third training sample. Plots a, b and c show predictions a simple statistical model, and plots d, e, f show predictions from a computational model that has few assumptions. The model fitted in a-c has high bias because it does not capture the zig-zag but low variance because it is virtually identical for all samples. The second model with fit shown in d-f has lower bias, because it more flexibly fits the zig-zag structure. But it has higher variance, because the actual model changes substantially for different training samples. Balancing this trade-off to achieve a model that has low bias and low variability is desirable.

FIX ME
Figure 13.2: Illustrating bias and variance using three different samples from the same distribution (a/d, b/e, c/f) and two different models (a-c, d-f). The black points are the training samples and colour indicates predictions from the resulting fitted model. The model fitted in a-c has high bias but low variance. The model fitted in d-f has lower bias but higher variance.

Bias and variance are conceptual constructs. Bias is not possible to quantify unless a true model is known. It is used for setting up simulations and comparing various models, because in these controlled scenarios they can be computed. In practice, it is not possible to compute. Using high-dimensional visualisation can help with understanding the shape of the class and separation between classes, thus a better sense about whether a particular approach will be able to capture this or not, and thus have likely have low or high bias.

To understand variance, we need to know how the model fit changes when a different training samples is used to fit the model. This is achieved by dividing the training sample into folds and fitting a model to each fold. This is more difficult to evaluate with visual methods because it would require examining multiple samples for small differences.

However, related to this is the practice of dividing the data into training and testing (or training and testing and validation) sets. The training set is used to fit the model, and the test set is used to estimate the error of the model when used on new samples. This is particularly important for computational models with few assumptions. High-dimensional visualisation can help to assess whether training and test samples are comparable, and thus be suitable to using for the two tasks.

13.3 How to use the tour with classification tasks

The primary start to using tours with classification problems is to colour observations by the class variable. Then use the grand tour and guided tour to explore the shapes of the clusters, and separation between them. If there are many classes, start by colouring one class against all the others. After a model has been fitted to the data, there are two directions for effort: (1) Labelling observations as correctly classified or not, (2) simulate a set of data in the domain of the predictors so that the model boundary can be examined. Figure 13.3 illustrates the approach, and purposes of using different tours. There are two 4D data sets. One has elliptically shaped clusters corresponding to two classes and a gap between them (a,b). The other (c,d) has one of my favourite data games to play with students. Because the class clusters are nested it is very difficult to construct a good classification model using simple methods. However, with good visualisation it is relatively easy to see the nested structure, and fit a simple and perfect model by transforming the variables using distance from the mean. The slice tour is useful for this data, and occasionally useful for detecting odd shapes of clusters.

Set up to examine data in a classification task by colouring points using the class variable. If there are many classes, this can be done by colouring one class against the rest, and sequentially working through the classes.

Generally, the grand and guided tour are useful for exploring the class structure. Using the grand tour, used in (a), shows that the shape of the two clusters is elliptical, roughly the same size but oriented differently. It is also possible to see a large gap between the two clusters. This shape would suggest that some methods are technically incorrect to apply. But the large gap between clusters means that practically several simple methods would be adequate, probably, and there is no need to apply a more complex model. The guided tour, used in (b), focuses the view on separations between the classes. The primary index to use is lda_pp() will help to find some differences between clusters, but it is simple, and easily confused by odd cluster shapes.

grand tour

guided tour

projection display

slice display
Figure 13.3: The grand tour, guided tour and slice tour are each useful for visualising class differences: (a,b) two classes with elliptically shaped clusters that have similar size but different orientations, and (c,d) a spherical kernel cluster inside a spherical shell. Plot (a) shows a grand tour revealing the overall elliptical shape with different orientations and the gap between the two clusters. Plot (b) shows the use of a guided tour to focus in on the gap. Plot (c) shows a grand tour of projections which hints at some difference between the two class clusters, but the slice tour (d) cuts through the projection to reveal the kernel inner cluster and the shell shape of the outside cluster.

The tour is also used to assess the fit of a model. For this, it can be helpful to generate diagnostic data by:

  • Augmenting the training and test data with new variables containing fitted values, as proportions or posterior probability and class predictions, an indicator variable representing an error in the prediction. This is analogous to how the broom (Robinson et al., 2024) package augments data from some model objects.
  • Create a separate data table containing variable importance estimates. These can be used to choose variables to view in the tour.
  • Additionally simulating data covering the domain of predictors and predicting the response. This is used to examine and compare boundaries between classes produced by different methods. The classifly (Wickham, 2022) package can do this for many methods. Using the slice tour can be better when examining boundaries in high dimensions.
  • Augmenting the training data with XAI metrics (described in Molnar (2022)) as produced by methods such as LIME, SHAP, or counterfactuals, to better understand how the model “sees” the classification. These metrics are observation-level variable importance scores that help to determine how the model arrived at the prediction for this observation. The reason to use these metrics only for the training data is because they describe the fitted model that was built from this training data. They can help to understand the fitted model in a local neighbourhood of an observation.

To diagnose a fitted model, augment the observed data with fitted values, local variable importance, and simulate a full domain of new observations with predictions to examine the classification boundaries it produce.

Exercises

  1. Using just the variables se, maxt, mint, log_dist_road, and “accident” or “lightning” causes, in the bushfires data use the tour (grand or guided using lda_pp) to decide
  1. whether the two classes are separable,
  2. which variables might be more important, and
  3. whether one should fit a linear or nonlinear classifier to separate the groups by assessing whether the border between groups is linear or nonlinear.
  1. Make a 10% sample of the pisa data, stratified by the two countries, and examine the first 5 math scores PV1MATH-PV5MATH. Explain whether there is a difference between the two countries, and whether they are separable based on math scores.

  2. There are several interesting data sets with class variables available on the GGobi website. Examine the differences between type of music, based on the the variables lvar, lave, lmax, lfener, lfreq. Are these music types separable? If so, which variables are important. The music data can be read using:

Code
library(readr)
library(dplyr)
music <- read_csv("http://ggobi.org/book/data/music-sub.csv",
                  show_col_types = FALSE) |>
  rename(title = `...1`) |>
  mutate(type = factor(type))
  1. The aflw data contains information on the player positions, as well as their statistics. Subset the data to FF (full forward), BPL (back pocket left) and RR (ruck). Examine the differences between player statistics (goals to tackles) for these positions. Explain whether the players in these positions have different profiles on the statistics, and whether the position is distinguishable based on combinations of the statistics.