10  Model-based clustering

Model-based clustering Fraley & Raftery (2002) fits a multivariate normal mixture model to the data. It uses the EM algorithm to fit the parameters for the mean, variance–covariance of each population, and the mixing proportion. The variance-covariance matrix is re-parametrized using an eigen-decomposition

\[ \Sigma_k = \lambda_kD_kA_kD_k^\top, ~~~k=1, \dots, g ~~\mbox{(number of clusters)} \]

resulting in several model choices, ranging from simple to complex, as shown in Table 10.1.

Table 10.1: Parameterizations of the covariance matrix.
Model Sigma Family Volume Shape Orientation
EII $$\lambda I$$ Spherical Equal Equal NA
VII $$\lambda_k I$$ Spherical Variable Equal NA
EEI $$\lambda A$$ Diagonal Equal Equal Coordinate axes
VEI $$\lambda_kA$$ Diagonal Variable Equal Coordinate axes
EVI $$\lambda A_k$$ Diagonal Equal Variable Coordinate axes
VVI $$\lambda_k A_k$$ Diagonal Variable Variable Coordinate axes
EEE $$\lambda DAD^\top$$ Diagonal Equal Equal Equal
EVE $$\lambda DA_kD^\top$$ Ellipsoidal Equal Variable Equal
VEE $$\lambda_k DAD^\top$$ Ellipsoidal Variable Equal Equal
VVE $$\lambda_k DA_kD^\top$$ Ellipsoidal Variable Equal Equal
EEV $$\lambda D_kAD_k^\top$$ Ellipsoidal Equal Variable Variable
VEV $$\lambda_k D_kAD_k^\top$$ Ellipsoidal Variable Variable Variable
EVV $$\lambda D_kA_kD_k^\top$$ Ellipsoidal Equal Variable Variable
VVV $$\lambda_k D_kA_kD_k^\top$$ Ellipsoidal Variable Variable Variable

Note the distribution descriptions “spherical” and “ellipsoidal”. These are descriptions of the shape of the variance-covariance for a multivariate normal distribution. A standard multivariate normal distribution has a variance-covariance matrix with zeros in the off-diagonal elements, which corresponds to spherically shaped data. When the variances (diagonals) are different or the variables are correlated, then the shape of data from a multivariate normal is ellipsoidal.

The models are typically scored using the Bayes Information Criterion (BIC), which is based on the log likelihood, number of variables, and number of mixture components. They should also be assessed using graphical methods, as we demonstrate using the penguins data.

10.1 Examining the model in 2D

We start with two of the four real-valued variables (bl, fl) and the three species. The goal is to determine whether model-based methods can discover clusters that closely correspond to the three species. Based on the scatterplot in Figure 10.1 we would expect it to do well, and suggest an elliptical variance-covariance of roughly equal sizes as the model.

Code to make plot
load("data/penguins_sub.rda")
ggplot(penguins_sub, aes(x=bl, 
                         y=fl)) + #, 
                         #colour=species)) +
  geom_point() +
  geom_density2d(colour="#3B99B1") +
  theme_minimal() +
  theme(aspect.ratio = 1)
Figure 10.1: Scatterplot of flipper length by bill length of the penguins data.
penguins_BIC <- mclustBIC(penguins_sub[,c(1,3)])
ggmc <- ggmcbic(penguins_BIC, cl=2:9, top=4) + 
  scale_color_discrete_divergingx(palette = "Roma") +
  ggtitle("(a)") +
  theme_minimal() 
penguins_mc <- Mclust(penguins_sub[,c(1,3)], 
                      G=3, 
                      modelNames = "EVE")
penguins_mce <- mc_ellipse(penguins_mc)
penguins_cl <- penguins_sub[,c(1,3)]
penguins_cl$cl <- factor(penguins_mc$classification)
ggell <- ggplot() +
   geom_point(data=penguins_cl, aes(x=bl, y=fl,
                                    colour=cl),
              alpha=0.3) +
   geom_point(data=penguins_mce$ell, aes(x=bl, y=fl,
                                         colour=cl),
              shape=16) +
   geom_point(data=penguins_mce$mn, aes(x=bl, y=fl,
                                        colour=cl),
              shape=3, size=2) +
  scale_color_discrete_divergingx(palette = "Zissou 1")  +
  theme_minimal() +
  theme(aspect.ratio=1, legend.position="none") +
  ggtitle("(b)")
ggmc + ggell + plot_layout(ncol=2)
Figure 10.2: Summary plots from model-based clustering: (a) BIC values for clusters 2-9 of top four models, (b) variance-covariance ellipses and cluster means (+) corresponding to the best model. The best model is three-cluster EVE, which has differently shaped variance-covariances albeit the same volume and orientation.

Figure 10.2 summarises the results. All models agree that three clusters is the best. The different variance-covariance models for three clusters have similar BIC values with EVE (different shape, same volume and orientation) being slightly higher. These plots are made from the mclust package output using the ggmcbic and mc_ellipse functions fro the mulgar package.

10.2 Extending to higher dimensions

Now we will examine how model-based clustering will group the penguins data using all four variables.

penguins_BIC <- mclustBIC(penguins_sub[,1:4])
ggmc <- ggmcbic(penguins_BIC, cl=2:9, top=7) + 
  scale_color_discrete_divergingx(palette = "Roma") +
  theme_minimal() 
ggmc
Figure 10.3: BIC values for the top models for 2-9 clusters on the penguins data. The interpretation is mixed: if one were to choose three clusters any of the variance-covariance models would be equally as good, but the very best model is the four-cluster VEE.
penguins_mc <- Mclust(penguins_sub[,1:4], 
                      G=4, 
                      modelNames = "VEE")
penguins_mce <- mc_ellipse(penguins_mc)
penguins_cl <- penguins_sub
penguins_cl$cl <- factor(penguins_mc$classification)

penguins_mc_data <- penguins_cl %>%
  select(bl:bm, cl) %>%
  mutate(type = "data") %>%
  bind_rows(bind_cols(penguins_mce$ell,
                      type=rep("ellipse",
                               nrow(penguins_mce$ell)))) %>%
  mutate(type = factor(type))
Code to make animated gifs
animate_xy(penguins_mc_data[,1:4],
           col=penguins_mc_data$cl,
           pch=c(4, 20 )[as.numeric(penguins_mc_data$type)], 
           axes="off")

load("data/penguins_tour_path.rda")
render_gif(penguins_mc_data[,1:4], 
           planned_tour(pt1), 
           display_xy(col=penguins_mc_data$cl,
               pch=c(4, 20)[
                 as.numeric(penguins_mc_data$type)], 
                      axes="off",
               half_range = 0.7),
           gif_file="gifs/penguins_best_mc.gif",
           frames=500,
           loop=FALSE)

penguins_mc <- Mclust(penguins_sub[,1:4], 
                      G=3, 
                      modelNames = "EEE")
penguins_mce <- mc_ellipse(penguins_mc)
penguins_cl <- penguins_sub
penguins_cl$cl <- factor(penguins_mc$classification)

penguins_mc_data <- penguins_cl %>%
  select(bl:bm, cl) %>%
  mutate(type = "data") %>%
  bind_rows(bind_cols(penguins_mce$ell,
                      type=rep("ellipse",
                               nrow(penguins_mce$ell)))) %>%
  mutate(type = factor(type))

animate_xy(penguins_mc_data[,1:4],
           col=penguins_mc_data$cl,
           pch=c(4, 20)[as.numeric(penguins_mc_data$type)], 
           axes="off")

# Save the animated gif
load("data/penguins_tour_path.rda")
render_gif(penguins_mc_data[,1:4], 
           planned_tour(pt1), 
           display_xy(col=penguins_mc_data$cl,
               pch=c(4, 20)[
                 as.numeric(penguins_mc_data$type)], 
                      axes="off",
               half_range = 0.7),
           gif_file="gifs/penguins_simpler_mc.gif",
           frames=500,
           loop=FALSE)
FIX ME
(a) Best model: four-cluster VEE
FIX ME
(b) Three-cluster EEE
FIX ME
(c) Three-cluster EEE
Figure 10.4: Examining the model-based clustering results for the penguins data: (a) best model according to BIC value, (b) simpler three-cluster model. Dots are ellipse points, and “x” are data points. It is important to note that the three cluster solution fits the data better, even though it has a lower BIC.

Using the tour to visualise the final choices of models with similarly high BIC values helps to choose which best fits the data. It may not be the one with the highest BIC value.

Exercises

  1. Examine the three cluster EVE, VVE and VEE models with the tour, and explain whether these are distinguishably different from the EEE three cluster model.
  2. Fit model-based clustering to the the clusters. Does it suggest the data has three clusters? Using the tour examine the best model model. How well does this fit the data?
  3. Fit model-based clustering to the the multicluster. Does it suggest the data has six clusters? Using the tour examine the best model model. How well does this fit the data?
  4. Fit model-based clustering to the fake_trees data. Does it suggest that the data has 10 clusters? If not, why do you think this is? Using the tour examine the best model model. How well does this fit the branching structure?
  5. Try fitting model-based clustering to the aflw data? What is the best model? Is the solution related to offensive vs defensive vs mid-fielder skills?
  6. Use model-based clustering on the challenge data sets, c1-c7 from the mulgar package. Explain why or why not the best model fits the cluster structure or not.