14  Linear discriminant analysis

Linear discriminant analysis (LDA) dates to the early 1900s. It’s one of the most elegant and simple techniques for both modeling separation between groups, and as an added bonus, producing a low-dimensional representation of the differences between groups. LDA has two strong assumptions: the groups are samples from multivariate normal distributions, and each have the same variance-covariance. If the latter assumption is relaxed, a slightly less elegant solution results from quadratic discriminant analysis.

Useful explanations can be found in Venables & Ripley (2002a) and Ripley (1996). A good general treatment of parametric methods for supervised classification can be found in R. A. Johnson & Wichern (2002) or another similar multivariate analysis textbook. It’s also useful to know that hypothesis testing for the difference in multivariate means using multivariate analysis of variance (MANOVA) has similar assumptions to LDA. Also model-based clustering assumes that each cluster arises from a multivariate normal distribution, and is related to LDA. The methods described here can be used to check these assumptions when applying these methods, too.

Because LDA is a parametric model it is important to check that these assumptions are reasonably satisfied:

  • shape of clusters are elliptical.
  • spread of the observations are the same.

14.1 Extracting the key elements of the model

LDA builds the model on the between-group sum-of-square matrix

\[B=\sum_{k=1}^g n_k(\bar{X}_k-\bar{X})(\bar{X}_k-\bar{X})^\top\] which measures the differences between the class means, compared with the overall data mean \(\bar{X}\) and the within-group sum-of-squares matrix,

\[ W = \sum_{k=1}^g\sum_{i=1}^{n_k} (X_{ki}-\bar{X}_k)(X_{ki}-\bar{X}_k)^\top \]

which measures the variation of values around each class mean. The linear discriminant space is generated by computing the eigenvectors (canonical coordinates) of \(W^{-1}B\), and this is the \((g-1)\)-D space where the group means are most separated with respect to the pooled variance-covariance. For each class we compute

\[ \delta_k(x) = (x-\mu_k)^\top W^{-1}\mu_k + \log \pi_k \]

where \(\pi_k\) is a prior probability for class \(k\) that might be based on unequal sample sizes, or cost of misclassification. The LDA classifier rule is to assign a new observation to the class with the largest value of \(\delta_k(x)\).

We can fit an LDA model using the lda() function from the MASS package. Here we have used the penguins data, assuming equal prior probability, to illustrate.

# Code to fit the model
library(dplyr)
library(mulgar)
library(MASS)
load("data/penguins_sub.rda")

p_lda <- lda(species~bl+bd+fl+bm, 
             data=penguins_sub,
             prior=c(1/3, 1/3, 1/3))
options(digits=2)
# p_lda

Because there are three classes the dimension of the discriminant space is 2D. We can easily extract the group means from the model.

# Extract the sample means
p_lda$means
             bl    bd    fl    bm
Adelie    -0.95  0.60 -0.78 -0.62
Chinstrap  0.89  0.64 -0.37 -0.59
Gentoo     0.65 -1.10  1.16  1.10

The coefficients to project the data into the discriminant space, that is the eigenvectors of \(W^{-1}B\) are:

# Extract the discriminant space
p_lda$scaling
     LD1   LD2
bl -0.24 -2.31
bd  2.04  0.19
fl -1.20  0.08
bm -1.22  1.24

and the predicted values, which include class predictions, and coordinates in the discriminant space are generated as:

# Extract the fitted values
p_lda_pred <- predict(p_lda, penguins_sub)

The best separation between classes can be viewed from this object, which can be shown to match the original data projected using the scaling component of the model object (see Figure 14.1).

Code to generate LDA plots
# Check calculations from the fitted model, and equations
library(colorspace)
library(ggplot2)
library(ggpubr)
# Using the predicted values from the model object
p_lda_pred_x1 <- data.frame(p_lda_pred$x)
p_lda_pred_x1$species <- penguins_sub$species
p_lda1 <- ggplot(p_lda_pred_x1, 
                 aes(x=LD1, y=LD2, 
                     colour=species)) + 
  geom_point() +
  xlim(-6, 8) + ylim(-6.5, 5.5) +
  scale_color_discrete_divergingx("Zissou 1") +
  ggtitle("(a)") +
  theme_minimal() +
  theme(aspect.ratio = 1, legend.title = element_blank()) 

# matches the calculations done manually
p_lda_pred_x2 <- data.frame(as.matrix(penguins_sub[,1:4]) %*%
                              p_lda$scaling)
p_lda_pred_x2$species <- penguins_sub$species
p_lda2 <- ggplot(p_lda_pred_x2, 
                 aes(x=LD1, y=LD2, 
                     colour=species)) + 
  geom_point() +
  xlim(-6, 8) + ylim(-7, 5.5) +
  scale_color_discrete_divergingx("Zissou 1") +
  ggtitle("(b)") +
  theme_minimal() +
  theme(aspect.ratio = 1, legend.title = element_blank()) 
ggarrange(p_lda1, p_lda2, ncol=2, 
          common.legend = TRUE, legend = "bottom")
Figure 14.1: Penguins projected into the 2D discriminant space, done two ways: (a) using the predicted values, (b) directly projecting using the model component. The scale is not quite the same but the projected data is identical in shape.

The \(W\) and \(B\) matrices cannot be extracted from the model object, so we need to compute these separately. We only need \(W\) actually. It is useful to think of this as the pooled variance-covariance matrix. Because the assumption for LDA is that the population group variance-covariances are identical, we estimate this by computing them for each class and then averaging them to get the pooled variance-covariance matrix. It’s laborious, but easy.

# Compute pooled variance-covariance
p_vc_pool <- mulgar::pooled_vc(penguins_sub[,1:4],
                               penguins_sub$species)
p_vc_pool
     bl   bd   fl   bm
bl 0.31 0.18 0.13 0.18
bd 0.18 0.32 0.14 0.20
fl 0.13 0.14 0.23 0.16
bm 0.18 0.20 0.16 0.31

This can be used to draw an ellipse corresponding to the pooled variance-covariance that is used by the LDA model.

14.2 Checking assumptions

This LDA approach is widely applicable, but it is useful to check the underlying assumptions on which it depends: (1) that the cluster structure corresponding to each class forms an ellipse, showing that the class is consistent with a sample from a multivariate normal distribution, and (2) that the variance of values around each mean is nearly the same. Figure 14.2 and Figure 14.3 illustrates two datasets, of which only one is consistent with these assumptions. Other parametric models, such as quadratic discriminant analysis or logistic regression, also depend on assumptions about the data which should be validated.

To check the equal and elliptical variance-covariance assumption, generate points on the surface of an ellipse corresponding to the variance-covariance for each group. When watching these ellipses in a tour, they should similar in all projections.

Code
# Generate ellipses for each group's variance-covariance
p_ell <- NULL
for (i in unique(penguins_sub$species)) {
  x <- penguins_sub %>% dplyr::filter(species == i)
  e <- gen_xvar_ellipse(x[,1:2], n=150, nstd=1.5)
  e$species <- i
  p_ell <- bind_rows(p_ell, e)
}
Code for penguins data and ellipse plots
lda1 <- ggplot(penguins_sub, aes(x=bl, 
                         y=bd, 
                         colour=species)) +
  geom_point() +
  scale_color_discrete_divergingx("Zissou 1") +
  xlim(-2.5, 3) + ylim(-2.5, 2.5) +
  ggtitle("(a)") +
  theme_minimal() +
  theme(aspect.ratio = 1) 
lda2 <- ggplot(p_ell, aes(x=bl, 
                         y=bd, 
                         colour=species)) +
  geom_point() +
  scale_color_discrete_divergingx("Zissou 1") +
  xlim(-2.5, 3) + ylim(-2.5, 2.5) +
  ggtitle("(b)") +
  theme_minimal() +
  theme(aspect.ratio = 1)
ggarrange(lda1, lda2, ncol=2, 
          common.legend = TRUE, legend = "bottom")
Figure 14.2: Scatterplot of flipper length by bill length of the penguins data, and corresponding variance-covariance ellipses. There is a small amount of difference between the ellipses, but they are similar enough to be confident in assuming the population variance-covariances are equal.
Code for bushfires data and ellipse plots
# Now repeat for a data set that violates assumptions
data(bushfires)
lda3 <- ggplot(bushfires, aes(x=log_dist_cfa, 
                         y=log_dist_road, 
                         colour=cause)) +
  geom_point() +
  scale_color_discrete_divergingx("Zissou 1") +
  xlim(6, 11) + ylim(-1, 10.5) +
  ggtitle("(a)") +
  theme_minimal() +
  theme(aspect.ratio = 1)
b_ell <- NULL
for (i in unique(bushfires$cause)) {
  x <- bushfires %>% dplyr::filter(cause == i)
  e <- gen_xvar_ellipse(x[,c(57, 59)], n=150, nstd=2)
  e$cause <- i
  b_ell <- bind_rows(b_ell, e)
}
lda4 <- ggplot(b_ell, aes(x=log_dist_cfa, 
                         y=log_dist_road, 
                         colour=cause)) +
  geom_point() +
  scale_color_discrete_divergingx("Zissou 1") +
  xlim(6, 11) + ylim(-1, 10.5) +
  ggtitle("(b)") +
  theme_minimal() +
  theme(aspect.ratio = 1)
ggarrange(lda3, lda4, ncol=2, 
          common.legend = TRUE, legend = "bottom")
Figure 14.3: Scatterplot of distance to cfa and road for the bushfires data, and corresponding variance-covariance ellipses. There is a lot of difference between the ellipses, so it cannot be assumed that the population variance-covariances are equal.

The equal and elliptical variance-covariance assumption is reasonable for the penguins data because the ellipse shapes roughly match the spread of the data. It is not a suitable assumption for the bushfires data, because the spread is not elliptically-shaped and varies in size between groups.

This approach extends to any dimension. We would use the same projection sequence to view both the data and the variance-covariance ellipses, as in Figure 14.4. It can be seen that there is some difference in the shape and size of the ellipses between species, in some projections, and also with the spread of points in the projected data. However, it is the differences are small, so it would be safe to assume that the population variance-covariances are equal.

Code for making animated gifs
library(tourr)
p_ell <- NULL
for (i in unique(penguins_sub$species)) {
  x <- penguins_sub %>% dplyr::filter(species == i)
  e <- gen_xvar_ellipse(x[,1:4], n=150, nstd=1.5)
  e$species <- i
  p_ell <- bind_rows(p_ell, e)
}
p_ell$species <- factor(p_ell$species)
load("data/penguins_tour_path.rda")
animate_xy(p_ell[,1:4], col=factor(p_ell$species))
render_gif(penguins_sub[,1:4], 
           planned_tour(pt1), 
           display_xy(half_range=0.9, axes="off", col=penguins_sub$species),
           gif_file="gifs/penguins_lda1.gif",
           frames=500,
           loop=FALSE)
render_gif(p_ell[,1:4], 
           planned_tour(pt1), 
           display_xy(half_range=0.9, axes="off", col=p_ell$species),
           gif_file="gifs/penguins_lda2.gif",
           frames=500,
           loop=FALSE)
Animation showing a tour of the penguins data, with colour indicating species. The spread of points in each group is reasonably similar regardless of projection.
(a) Data
Animation showing a tour of the ellipses corresponding to variance-covariance matrices for each species. The shape of the ellipse for each group is reasonably similar regardless of projection.
(b) Variance-covariance ellipses
Figure 14.4: Checking the assumption of equal variance-covariance matrices for the 4D penguins data. Each ellipse corresponds to the sample variance-covariance for each species.

As a further check, we could generate three ellipses corresponding to the pooled variance-covariance matrix, as would be used in the model, centered at each of the means. Overlay this with the data, as done in Figure 14.5. Now you will compare the spread of the observations in the data, with the elliptical shape of the pooled variance-covariance. If it matches reasonably we can safely use LDA. This can also be done group by group when multiple groups make it difficult to view all together.

To check the fit of the equal variance-covariance assumption, simulate points on the ellipse corresponding to the pooled sample variance-covariance matrix. Generate one for each group centered at the group mean, and compare with the data.

Code for adding ellipses to data
# Create an ellipse corresponding to pooled vc
pool_ell <- gen_vc_ellipse(p_vc_pool, 
                           xm=rep(0, ncol(p_vc_pool)))

# Add means to produce ellipses for each species
p_lda_pool <- data.frame(rbind(
  pool_ell +
    matrix(rep(p_lda$means[1,],
      each=nrow(pool_ell)), ncol=4),
  pool_ell +
    matrix(rep(p_lda$means[2,],
      each=nrow(pool_ell)), ncol=4),
  pool_ell +
    matrix(rep(p_lda$means[3,],
      each=nrow(pool_ell)), ncol=4)))
# Create one data set with means, data, ellipses
p_lda_pool$species <- factor(rep(levels(penguins_sub$species),
                          rep(nrow(pool_ell), 3)))
p_lda_pool$type <- "ellipse"
p_lda_means <- data.frame(
  p_lda$means,
  species=factor(rownames(p_lda$means)),
                          type="mean")
p_data <- data.frame(penguins_sub[,1:5], 
                     type="data")
p_lda_all <- bind_rows(p_lda_means,
                       p_data,
                       p_lda_pool)
p_lda_all$type <- factor(p_lda_all$type, 
   levels=c("mean", "data", "ellipse"))
shapes <- c(3, 4, 20)
p_pch <- shapes[p_lda_all$type]
Code to generate animated gifs
# Code to run the tour
animate_xy(p_lda_all[,1:4], col=p_lda_all$species, pch=p_pch)
load("data/penguins_tour_path.rda")
render_gif(p_lda_all[,1:4], 
           planned_tour(pt1), 
           display_xy(col=p_lda_all$species, pch=p_pch, 
                      axes="off", half_range = 0.7),
           gif_file="gifs/penguins_lda_pooled1.gif",
           frames=500,
           loop=FALSE)

# Focus on one species
render_gif(p_lda_all[p_lda_all$species == "Gentoo",1:4], 
           planned_tour(pt1), 
           display_xy(col="#F5191C", 
                      pch=p_pch[p_lda_all$species == "Gentoo"], 
                      axes="off", half_range = 0.7),
           gif_file="gifs/penguins_lda_pooled2.gif",
           frames=500,
           loop=FALSE)
Animation showing a tour of the pooled variance-covariance ellipse, computed for each species, overlaid on the data. The shape of the ellipse for each group is reasonably similar to the spread of the points in all projections.
(a) All species
Animation showing a tour of the pooled variance-covariance ellipse overlaid on the data for the Gentoo penguins (red). The shape of the ellipse is reasonably similar to the spread of the points in all projections.
(b) Gentoo
Figure 14.5: Checking how the pooled variance-covariance matches the spread of points in each group.

From the tour, we can see that the assumption of equal elliptical variance-covariance is a reasonable assumption for the penguins data. In all projections the ellipse is reasonably matching the spread of the observations.

14.3 Examining results

The boundaries for a classification model can be examined by:

  1. generating a large number of test points in the domain of the data
  2. predicting the class for each test point

We’ll look at this for 2D using the LDA model fitted to bl, and bd of the penguins data.

p_bl_bd_lda <- lda(species~bl+bd, data=penguins_sub, 
                                  prior = c(1/3, 1/3, 1/3))

The fitted model means \(\bar{x}_{Adelie} = (\) -0.95, 0.6\()^\top\), \(\bar{x}_{Chinstrap} = (\) 0.89, 0.64\()^\top\), and \(\bar{x}_{Gentoo} = (\) 0.65, -1.1\()^\top\) can be added to the plots.

The boundaries can be examined using the explore() function from the classifly package, which generates observations in the range of all values ofbl and bd and predicts their class. Figure 14.6 shows the resulting prediction regions, with the observed data and the sample means overlaid.

# Compute points in domain of data and predict
library(classifly)

p_bl_bd_lda_boundaries <- explore(p_bl_bd_lda, penguins_sub)
p_bl_bd_lda_m1 <- ggplot(p_bl_bd_lda_boundaries) +
  geom_point(aes(x=bl, y=bd, 
                 colour=species, 
                 shape=.TYPE), alpha=0.8) + 
  scale_color_discrete_divergingx("Zissou 1") +
  scale_shape_manual(values=c(46, 16)) +
  theme_minimal() +
  theme(aspect.ratio = 1, legend.position = "none")

p_bl_bd_lda_means <- data.frame(p_bl_bd_lda$means,
                        species=rownames(p_bl_bd_lda$means))
p_bl_bd_lda_m1 +   
  geom_point(data=p_bl_bd_lda_means, 
             aes(x=bl, y=bd), 
             colour="black", 
             shape=3,
             size=3) 
Square divided into three regions by the coloured points, primarily a wedge of yellow (Chinstrap) extending from the top right divides the other two groups. The regions mostly contain the observed values, with just a few over the boundary in the wrong region.
Figure 14.6: Prediction regions of the LDA model for two variables of the three species of penguins indicated by the small points. Large points are the observations, and the sample mean of each species is represented by the plus. The boundaries between groups can be seen to be roughly half-way between the means, taking the elliptical spread into account, and mostly distinguishes the three species.

This approach can be readily extended to higher dimensions. One first fits the model with all four variables, and uses the explore() to generate points in the 4D space with predictions, generating a representation of the prediction regions. Figure 14.7(a) shows the results using a slice tour (Laa et al., 2020a). Points inside the slice are shown in larger size. The slice is made in the centre of the data, to show the boundaries in this neighbourhood. As the tour progresses we see a thin slice through the centre of the data, parallel with the projection plane. In most projections there is some small overlap of points between groups, which happens because we are examining a 4D object with 2D. The slice helps to alleviate this, allowing a focus on the boundaries in the centre of the cube. In all projections the boundaries between groups is linear, as would be expected when using LDA. We can also see that the model roughly divides the cube into three relatively equally-sized regions.

Figure 14.7(b) shows the three prediction regions, represented by points in 4D, projected into the discriminant space. Linear boundaries neatly divide the full space, which is to be expected because the LDA model computes it’s classification rules in this 2D space.

p_lda <- lda(species ~ ., penguins_sub[,1:5], prior = c(1/3, 1/3, 1/3))
p_lda_boundaries <- explore(p_lda, penguins_sub)
Code for generating slice tour
# Code to run the tour
p_lda_boundaries$species
animate_slice(p_lda_boundaries[p_lda_boundaries$.TYPE == "simulated",1:4], col=p_lda_boundaries$species[p_lda_boundaries$.TYPE == "simulated"], v_rel=0.02, axes="bottomleft")
render_gif(p_lda_boundaries[p_lda_boundaries$.TYPE == "simulated",1:4],
           planned_tour(pt1),
           display_slice(v_rel=0.02, 
             col=p_lda_boundaries$species[p_lda_boundaries$.TYPE == "simulated"], 
             axes="bottomleft"),                     gif_file="gifs/penguins_lda_boundaries.gif",
           frames=500,
           loop=FALSE
           )
Code for projecting into LDA space
# Project the boundaries into the 2D discriminant space
p_lda_b_sub <- p_lda_boundaries[
  p_lda_boundaries$.TYPE == "simulated", 
  c(1:4, 6)]
p_lda_b_sub_ds <- data.frame(as.matrix(p_lda_b_sub[,1:4]) %*%
  p_lda$scaling)
p_lda_b_sub_ds$species <- p_lda_b_sub$species
p_lda_b_sub_ds_p <- ggplot(p_lda_b_sub_ds, 
       aes(x=LD1, y=LD2, 
           colour=species)) +
  geom_point(alpha=0.5) +  
  geom_point(data=p_lda_pred_x1, aes(x=LD1, 
                               y=LD2, 
                               shape=species),
             inherit.aes = FALSE) +
  scale_color_discrete_divergingx("Zissou 1") +
  scale_shape_manual(values=c(1, 2, 3)) +
  theme_minimal() +
  theme(aspect.ratio = 1, 
        legend.position = "bottom",
        legend.title = element_blank()) 
Animation plot, where three groups of coloured points can be seen. They roughly break the square into three regions with linear boundaries, which varies in each projection.
(a) 4D
Scatterplot of points divided neatly into three colour groups by linear boundaries. The regions are split at the middle, and with the boundary between blue and red, blue and yellow, and yellow and red, starting near 11, 4, and o'clock, respectively.
(b) Discriminant space
Figure 14.7: Examining the boundaries produced by the LDA model in the full 4D with a slice tour (left) and in the discriminant space (right). Large points indicate observations within the slice, and dots are observations outside the slice. Focusing on the within-slice points, there is some overlap of points between regions in most views which represents the occlusion of 4D shapes when examining projections with thin slices. The linear boundaries are seen exactly in the discriminant space, that is they are orthogonal to these two dimensions.

From the tour, we can see that the LDA boundaries divide the classes only in the discriminant space. It is not using the space orthogonal to the 2D discriminant space. You can see this because the boundary is sharp in just one 2D projection, while most of the projections show some overlap of regions.

Exercises

  1. For the simple_clusters compute the LDA model, and make a plot of the data, with points coloured by the class. Overlay variance-covariance ellipses, and a \(+\) indicating the sample mean for each class. Is it reasonable to assume that the two classes are sampled from populations with the same variance-covariance?
  2. Examine the clusters corresponding to the classes in the clusters data set, using a tour. Based on the shape of the data is the assumption of equal variance-covariance reasonable?
  3. Examine the pooled variance-covariance for the clusters data, overlaid on the data in a tour on the 5D. Does it fit the variance of each cluster nicely?
  4. Fit an LDA model to the simple_clusters data. Examine the boundaries produced by the model, in 2D.
  5. Fit an LDA model to the clusters data. Examine the boundaries produced by the model in 5D.
  6. Assess the LDA assumptions for the multicluster data. Is LDA an appropriate model?
  7. Compute the first 12 PCs of the sketches data. Check the assumption of equal, elliptical variance-covariance of the 6 groups. Regardless of whether you decide that the assumption is satisfied or not, fit an LDA to the 12 PCs. Extract the discriminant space (the x component of the predict object), and examine the separation (or not) of the 6 groups in this 5D space. Is LDA providing a good classification model for this data?
  8. Even though the bushfires data does not satisfy the assumptions for LDA, fit LDA to the first five PCs. Examine the class differences in the 3D discriminant space.
  9. Compute the boundary between classes, for the LDA model where the prior probability reflects the sample size, and the LDA model where the priors are equal for all groups. How does the boundary between lightning caused fires and the other groups change?