16  Support vector machines

A support vector machine (SVM) (Vapnik, 1999) looks for gaps between clusters in the data, based on the extreme observations in each class. In this sense it mirrors the graphical approach described Chapter 7, in which we searched for gaps between groups. It can be viewed as similar to LDA, in that the boundary between classes is a hyperplane. The difference between LDA and SVM is the placement of the boundary. LDA uses the means and covariance matrices of the classes to place the boundary, but SVM uses extreme observations.

The key elements of the SVM model to extract are:

  • support vectors
  • separating hyperplane.

SVM is widely used for it’s ability to fit non-linear classification models in a simple fashion using kernels in the boundary equation. We are focusing on linear methods here because it makes for a useful comparison with how the models differ from those provided by SVM. SVM tends to place the boundary between groups in a gap, if it exists. This is nice from a visual perspective because when we look at differences between classes using a tour, we naturally focus on the gaps. SVM better fits this perception than LDA.

Non-linear SVM models are interesting to examine also. Mostly one would examine the boundaries between classes which can be done in the same way that is documented in the Chapter 14 and Chapter 15.

16.1 Components of the SVM model

To illustrate the approach, we use two simple simulated data examples. Both have only two variables, and two classes. Explaining SVM is easier when there are just two groups. In the first data set the two classes have different covariances matrices, which will cause trouble for LDA, but SVM should see the gap between the two clusters and place the separating hyperplane in the middle of the gap. In the second data set the two groups are concentric circles, with the inner one solid. A non-linear SVM should be fitted to this data, which should see circular gap between the two classes.

Note that the svm function in the e1071 package will automatically scale observations into the range \([0,1]\). To make it easier to examine the fitted model, it is best to scale your data first, and then fit the model.

Code to simulate data examples
# Toy examples
library(mulgar)
library(ggplot2)
library(geozoo)
library(tourr)

set.seed(1071)
n1 <- 162
vc1 <- matrix(c(1, -0.7, -0.7, 1), ncol=2, byrow=TRUE)
c1 <- rmvn(n=n1, p=2, mn=c(-2, -2), vc=vc1)
vc2 <- matrix(c(1, -0.4, -0.4, 1)*2, ncol=2, byrow=TRUE)
n2 <- 138
c2 <- rmvn(n=n2, p=2, mn=c(2, 2), vc=vc2)
df1 <- data.frame(x1=mulgar:::scale2(c(c1[,1], c2[,1])), 
                 x2=mulgar:::scale2(c(c1[,2], c2[,2])), 
                 cl = factor(c(rep("A", n1), 
                               rep("B", n2))))
c1 <- sphere.hollow(p=2, n=n1)$points*3 + 
  c(rnorm(n1, sd=0.3), rnorm(n1, sd=0.3))
c2 <- sphere.solid.random(p=2, n=n2)$points
df2 <- data.frame(x1=mulgar:::scale2(c(c1[,1], c2[,1])), 
                  x2=mulgar:::scale2(c(c1[,2], c2[,2])), 
                  cl = factor(c(rep("A", n1), 
                               rep("B", n2))))
library(classifly)
library(e1071)
df1_svm <- svm(cl~., data=df1, 
                     probability=TRUE, 
                     kernel="linear", 
               scale=FALSE)
df1_svm_e <- explore(df1_svm, df1)

df2_svm <- svm(cl~., data=df2,  
                     probability=TRUE, 
                     kernel="radial")
df2_svm_e <- explore(df2_svm, df2)
Code to make plots
library(patchwork)
library(colorspace)
s1 <- ggplot() + 
  geom_point(data=df1, aes(x=x1, y=x2, colour=cl),
             shape=20) +
  scale_colour_discrete_divergingx(palette="Zissou 1") +
  geom_point(data=df1_svm_e[(!df1_svm_e$.BOUNDARY)&(df1_svm_e$.TYPE=="simulated"),], 
             aes(x=x1, y=x2, colour=cl), shape=3) +
  geom_point(data=df1[df1_svm$index,], 
             aes(x=x1, y=x2, colour=cl), 
             shape=4, size=4) +
  theme_minimal() +
  theme(aspect.ratio=1, legend.position = "none") +
  ggtitle("(a)")

s2 <- ggplot() + 
  geom_point(data=df2, aes(x=x1, y=x2, colour=cl), shape=20) +
  scale_colour_discrete_divergingx(palette="Zissou 1") +
  geom_point(data=df2_svm_e[(!df2_svm_e$.BOUNDARY)&(df2_svm_e$.TYPE=="simulated"),], 
             aes(x=x1, y=x2, colour=cl), 
             shape=3) +
  geom_point(data=df2[df2_svm$index,], 
             aes(x=x1, y=x2, colour=cl), 
             shape=4, size=4) +
  theme_minimal() +
  theme(aspect.ratio=1, legend.position = "none") +
  ggtitle("(b)")

s1+s2
Figure 16.1: SVM classifier fit overlaid on two simulated data examples: (a) groups with different variance-covariance, fitted using a linear kernel, (b) groups with non-linear separation, fitted using a radial kernel. The band of points shown as ‘+’ mark the SVM boundary, and points marked by ‘x’ are the support vectors used to define the boundary.

Figure 16.1 shows the two data sets and the important aspects of the fitted SVM model for each. The observations are represented by dots, the separating hyperplane (just a line for 2D) is represented by ‘+’. Where the two colours merge is the actual location of the boundary between classes. It can be seen that this is located right down the middle of the gap, for both data sets. Even though the boundary is circular for the second data set, in a transformed high-dimensional space it would be linear.

SVMs use a subset of the observations to define the boundary, and these are called the support vectors. For each of the data sets these are marked with ‘x’. For the linear boundary, there are nine support vectors, five in one group and four in the other. There is one interesting observation in the red group, which falls on the other side of the boundary. It is marked as a support vector, but its contribution to the fitted hyperplane is limited by a control parameter in the model fitting process.

Linear SVMs can be assessed similarly to regression models. The components of the model are:

  1. The points that are the support vectors:
df1_svm$index
[1]  15  45 123 135 155 180 202 239 292
  1. Their coefficients:
df1_svm$coefs
            [,1]
 [1,]  0.3771240
 [2,]  0.1487726
 [3,]  1.0000000
 [4,]  1.0000000
 [5,]  1.0000000
 [6,] -0.5258966
 [7,] -1.0000000
 [8,] -1.0000000
 [9,] -1.0000000

which indicate that all but 15, 45 and 180 are actually bounded support vectors (their coefficients are bounded to magnitude 1).

  1. that when used with the intercept:
df1_svm$rho
[1] 0.3520001

can be used to compute the equation of the fitted hyperplane.

w = t(df1_svm$SV) %*% df1_svm$coefs
w
        [,1]
x1 -1.501086
x2 -1.356237

Giving the equation to be -1.5 \(x_1 +\) -1.36 \(x_2 +\) -0.35 \(=0\), or alternatively, \(x_2 =\) -1.11 \(x_1 +\) -0.26.

which can be used to generate a line to show the boundary with the data.

s1 + geom_abline(intercept=df1_svm$rho/w[2],
                 slope=-w[1]/w[2])

Note that care in scaling of data is important to get the intercept calculated exactly. We have standardised the data, and set the scale=FALSE parameter in the svm function. The slope calculation is quite robust to the data scaling.

Like LDA, a linear SVM model for two groups can be written using the equation of a hyperplane. The fitted model coefficients are then used to generate points on this plane, to examine the boundary between groups.

16.2 Examining the model components in high-dimensions

For higher dimensions, the procedures are similar, with the hyperplane and support vectors being examined using a tour. Here we examine the model for differentiating male and female Chinstrap penguins. The Chinstrap penguins have a noticeable difference in size of the sexes, unlike the other two species. Working with a two-class problem is easier for explaining SVM, but multi-class calculations can also follow this approach.

library(dplyr)
load("data/penguins_sub.rda")
chinstrap <- penguins_sub %>%
  filter(species == "Chinstrap") %>%
  select(-species) %>%
  mutate_if(is.numeric, mulgar:::scale2)
chinstrap_svm <- svm(sex~., data=chinstrap, 
                     kernel="linear",
                     probability=TRUE, 
                     scale=FALSE)
chinstrap_svm_e <- explore(chinstrap_svm, chinstrap)
Code to make the tours
# Tour raw data
animate_xy(chinstrap[,1:4], col=chinstrap$sex)
# Add all SVs, including bounded
c_pch <- rep(20, nrow(chinstrap))
c_pch[chinstrap_svm$index] <- 4
animate_xy(chinstrap[,1:4], col=chinstrap$sex, pch=c_pch)
# Only show the SVs with |coefs| < 1
c_pch <- rep(20, nrow(chinstrap))
c_pch[chinstrap_svm$index[abs(chinstrap_svm$coefs)<1]] <- 4
c_cex <- rep(1, nrow(chinstrap))
c_cex[chinstrap_svm$index[abs(chinstrap_svm$coefs)<1]] <- 2
animate_xy(chinstrap[,1:4], col=chinstrap$sex, 
           pch=c_pch, cex=c_cex)
render_gif(chinstrap[,1:4],
           grand_tour(),
           display_xy(col=chinstrap$sex, pch=c_pch, cex=c_cex),
           gif_file="gifs/chinstrap_svs.gif",
           width=400,
           height=400,
           frames=500)

# Tour the separating hyperplane also
symbols <- c(3, 20)
c_pch <- symbols[as.numeric(chinstrap_svm_e$.TYPE[!chinstrap_svm_e$.BOUNDARY])]
animate_xy(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4], 
           col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY],
           pch=c_pch)
render_gif(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4],
           grand_tour(),
           display_xy(col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY], pch=c_pch),
           gif_file="gifs/chinstrap_svm.gif",
           width=400,
           height=400,
           frames=500)
FIX ME
(a) Support vectors
FIX ME
(b) SVM boundary
Figure 16.2: SVM model for distinguishing the sexes of the Chinstrap penguins. The separating hyperplane is 3D, and separates primarily on variables bl and bd, as seen because these two axes extend out from the plane when it is seen on its side, separating the two groups.

Mark the support vectors by point shape, and examine where these are relative to the difference between groups.

Examining the hyperplane in a grand tour display (Figure 16.2) indicates that two of the variables, bl and bd, are important for separating the two classes. We can check this interpretation using the radial tour. Using the components from the model, the coefficients of the hyperplane are:

t(chinstrap_svm$SV) %*% chinstrap_svm$coefs
         [,1]
bl -0.9102439
bd -1.1073475
fl -0.5223364
bm -0.2846370

The coefficients for bl and bd are the largest (in magnitude) which supports the the interpretation that they are most important. This vector can be used to set the starting point for radial tour, once it is normalised. Any orthonormal vector serves to turn this into a 2D projection, to visualise the boundary.

set.seed(1022)
prj1 <- mulgar::norm_vec(t(chinstrap_svm$SV) %*%
                           chinstrap_svm$coefs)
prj2 <- basis_random(4, 1)
prj <- orthonormalise(cbind(prj1, prj2))
prj
         [,1]        [,2]
bl -0.5865081 -0.06412875
bd -0.7135101  0.51192498
fl -0.3365631 -0.77713899
bm -0.1834035 -0.36038216
Code to conduct the radial tours
animate_xy(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4], 
           tour_path = radial_tour(start=prj, mvar = 2),
           col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY],
           pch=c_pch)
render_gif(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4],
           radial_tour(start=prj, mvar = 2),
           display_xy(col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY], pch=c_pch),
           gif_file="gifs/chinstrap_rad_bd.gif",
           apf = 1/30,
           width=400,
           height=400,
           frames=500)
render_gif(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4],
           radial_tour(start=prj, mvar = 1),
           display_xy(col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY], pch=c_pch),
           gif_file="gifs/chinstrap_rad_bl.gif",
           apf = 1/30,
           width=400,
           height=400,
           frames=500)
render_gif(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4],
           radial_tour(start=prj, mvar = 3),
           display_xy(col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY], pch=c_pch),
           gif_file="gifs/chinstrap_rad_fl.gif",
           apf = 1/30,
           width=400,
           height=400,
           frames=500)
render_gif(chinstrap_svm_e[!chinstrap_svm_e$.BOUNDARY,1:4],
           radial_tour(start=prj, mvar = 4),
           display_xy(col=chinstrap_svm_e$sex[!chinstrap_svm_e$.BOUNDARY], pch=c_pch),
           gif_file="gifs/chinstrap_rad_bm.gif",
           apf = 1/30,
           width=400,
           height=400,
           frames=500)

This projection is show in Figure 16.3. You can see the boundary between the two sexes as a clear line, marked by a sample of points on either side. We use the radial tour to remove each of the variables from the projection using the radial tour to examine it’s importance on the model, and hence the boundary. If the clear view of the boundary gets jumbled when a variable is removed we infer that this variable is very important for the model (as seen for bl and bd). If there is little change in the clarity when a variable is removed, then it is less important (as seen for fl and bm).

(a) bl
(b) bd
(c) fl
(d) bm
Figure 16.3: Exploring the importance of the four variables to the separating hyperplane using a radial tour where the contribution of each variable is reduced to 0, and then increased to it’s original value. You can see that bl and bd contribute most to the plane, because when they are removed the plane is no longer on it side marking the boundary. Variables fl and bm contribute a small amount to the separating hyperplane, but it is possible that these two could be removed without affecting the strength of the separation between the sexes.

Use a radial tour to zero out coefficients defining the separating hyperplane to explore the variable importance.

In this example, we can see that clarity of the boundary changes substantially when either bl and bd are removed. There is a small change when fl and bm are removed, so they are less important. This interpretation matches the interpretation that would be made from the magnitude of the coefficients of the hyperplane (printed earlier). They reinforce each other. It is possible that the interpretation of the coefficients could differ after using the radial tour, most likely in terms of simplifying the vector, supporting the forcing some coefficients to zero.

When we use the radial tour to examine how the different variables contribute to the separating hyperplane between the sexes, we learn that bl and bd are the most important variables. We could (almost) ignore fl and bm for this classification.

Exercises

  1. Generate a small subset from the bushfire data: we keep the variables log_dist_cfa, log_dist_road and cause, and we select only observations where cause is either lightning or arson. Fit a linear SVM model to separate the two classes and show the decision boundary together with the data. Compare to the boundary obtained by LDA and argue how the two models place the separating hyperplane in different ways.
  2. We extend this into a multivariate setting by also including amaxt180 and amaxt720 as predictors. Fit a linear SVM model and calculate the hyperplane to judge which of these variables are important.
  3. Calculate the decision boundary and look at it with a radial tour to see the effect from including individual predictors in a projection. Also explore what happens when rotating out multiple variables together. What can you learn from this?
  4. From the sketches_train data select all observations of class banana or boomerang For this subset use PCA to find the first 5 PCs. Fit two SVM models: once with linear kernel and once with radial kernel and default value for the gamma parameter. Compare the number of missclassified observations in the training data for the two models.
  5. Compute the model predictions and compare the decision boundaries between the linear and radial SVM using a slice tour. Does the shape match what you expect given the respective kernel function?
  6. SVM models are defined for separating two classes, but and ensemble of such models can be used when we want to distinguish more than two classes. Look up the documentation of the svm function to learn how this works, then fit an SVM model to separate the three penguin species. In this case we primarily use the model predictions to investigate the decision boundaries, you can use explore together with the slice tour to do this. You can use different kernels and compare the resulting decision boundaries.