12  Summarising and comparing clustering results

12.1 Summarising results

It can be considered that the result of a cluster analysis is a smaller data set that describes the original data set more compactly, corresponding to the cluster centres. With this in mind, the key elements for summarising a cluster analysis are, for each cluster:

  1. the number of observations
  2. a measure of the centre
  3. some measure of the variability of observations around each centre.

With numerical data, which we have focused on here, the measure of centre is typically the multivariate mean. Measuring the variability can be more complicated. The exception is model-based clustering because it imposes an elliptical variance-covariance structure leaving the variance-covariance estimates for each cluster as suitable measures of 3. The methods discussed in Chapter 10 for displaying the corresponding ellipses in high dimensions is suitable.

For all other methods, the shape of the variance is not controlled, so cluster shapes can vary. Some methods have a tendency to create particular shapes, for example \(k\)-means and Wards linkage hierarchical clustering may result in spherically shapes. But more than likely this is not the case, the algorithm uses a ball to group observations but the shape of the cluster might be very different. Summarising the variability, though is very important, because it provides feedback on the strength of the clustering. A good clustering result makes the large data simpler, where a smaller set of points is an adequate summary of the data.

Generally, to accommodate summarising odd-shaped clusters, it is common to generate the convex hull of each cluster, as shown in Figure 12.1. This can also be done in high-dimensions, using the R package cxhull (Laurent, 2023) to compute the \(p\)-D convex hull.

For model-based clustering, the variability within clusters can be summarised by \(p\)-dimensional ellipses. Generally, a \(p\)-dimensional convex hull can show the variability within clusters, without assuming a specific distribution.

Code to do clustering
load("data/penguins_sub.rda")
p_dist <- dist(penguins_sub[,1:2])
p_hcw <- hclust(p_dist, method="ward.D2")
p_cl <- data.frame(cl_w = cutree(p_hcw, 3))

Code for convex hulls in 2D
psub <- penguins_sub |>
  select(bl, bd) |>
  mutate(cl = p_cl$cl_w)

phull <- gen_chull(psub[,1:2], psub$cl)
phull_segs <- data.frame(x = phull$data$bl[phull$edges[,1]],
                         y = phull$data$bd[phull$edges[,1]],
                         xend = phull$data$bl[phull$edges[,2]],
                         yend = phull$data$bd[phull$edges[,2]],
                         cl = phull$edge_clr)
p_chull2D <- ggplot() +
  geom_point(data=phull$data, aes(x=bl, y=bd, 
                            colour=cl)) + 
  geom_segment(data=phull_segs, aes(x=x, xend=xend,
                                    y=y, yend=yend,
                                    colour=cl)) +
  scale_colour_discrete_divergingx(palette = "Zissou 1") +
  theme_minimal() +
  theme(aspect.ratio = 1)
Code to do clustering
p_dist <- dist(penguins_sub[,1:4])
p_hcw <- hclust(p_dist, method="ward.D2")

p_cl <- data.frame(cl_w = cutree(p_hcw, 3))

penguins_mc <- Mclust(penguins_sub[,1:4], 
                      G=3, 
                      modelNames = "EEE")
p_cl <- p_cl |> 
  mutate(cl_mc = penguins_mc$classification)
Code to generate pD convex hull
phull <- gen_chull(penguins_sub[,1:4], p_cl$cl_w)
Code to generate pD convex hull and view in tour
animate_xy(phull$data[,1:4], 
           col=phull$data$cl,
           edges=phull$edges, 
           edges.col=phull$edge_clr)
render_gif(phull$data[,1:4], 
           tour_path = grand_tour(),
           display = display_xy(col=phull$data$cl,
                                edges=phull$edges,
                                edges.col=phull$edge_clr),
           gif_file = "gifs/penguins_chull.gif",
           frames = 500, 
           width = 400,
           height = 400)
(a) 2D
(b) 4D
Figure 12.1

Convex hulls summarising the extent of Wards linkage clustering in 2D and 4D.

12.2 Comparing two clusterings

Each cluster analysis will result in a vector of class labels for the data. To compare two results we would tabulate and plot the pair of integer variables. The labels given to each cluster will likely differ. If the two methods agree, there will be just a few cells with large counts among mostly empty cells.

Below is a comparison between the three cluster results of Wards linkage hierarchical clustering (rows) and model-based clustering (columns). The two methods mostly agree, as seen from the three cells with large counts, and most cells with zeros. They disagree only on eight penguins. These eight penguins would be considered to be part of cluster 1 by Wards, but model-based considers them to be members of cluster 2.

The two methods label the clusters differently: what Wards labels as cluster 3, model-based labels as cluster 2. The labels given by any algorithm are arbitrary, and can easily be changed to coordinate between methods.

Code to make confusion table
p_cl |> 
  count(cl_w, cl_mc) |> 
  pivot_wider(names_from = cl_mc, 
              values_from = n, 
              values_fill = 0) |>
  gt() |>
  tab_spanner(label = "cl_mc", columns=c(`2`, `3`, `1`)) |>
  cols_width(everything() ~ px(60))
cl_w cl_mc
2 3 1
1 8 0 149
2 0 119 0
3 57 0 0

We can examine the disagreement by linking a plot of the table, with a tour plot. Linking the confusion matrix with the tour can be accomplished with crosstalk and detourr. Figure 12.2 show screenshots of the exploration of the eight penguins on which the methods disagree. It makes sense that there is some confusion. These penguins are part of the large clump of observations that don’t separate cleanly into two clusters. The eight penguins are roughly in the middle of this clump, and there is no clear place to split this large clump. Realistically, both methods result in a plausible clustering.

Figure 12.2: Comparing the model-based and Wards hierarchical linkage solutions using linking between the confusion table and a tour, using detourr. Points are coloured according to the model-based result. The disagreement on eight penguins is with cluster 1 from Wards and cluster 2 from model-based. These penguins fall in the middle of the large clump of points, where it is not possible to cleanly decide on a split. Both solutions are plausible.

Code to do interactive graphics
library(crosstalk)
library(plotly)
library(detourr)
library(viridis)
p_cl <- p_cl |> 
  mutate(cl_w_j = jitter(cl_w),
         cl_mc_j = jitter(cl_mc))
penguins_cl <- bind_cols(penguins_sub, p_cl)
p_cl_shared <- SharedData$new(penguins_cl)

set.seed(1046)
detour_plot <- detour(p_cl_shared, tour_aes(
  projection = bl:bm,
  colour = cl_mc)) |>
    tour_path(grand_tour(2), 
                    max_bases=100, fps = 60) |>
       show_scatter(alpha = 0.7, axes = FALSE,
                    width = "100%", height = "450px")

conf_mat <- plot_ly(p_cl_shared, 
                    x = ~cl_mc_j,
                    y = ~cl_w_j,
                    color = ~cl_mc,
                    colors = viridis_pal(option = "D")(3),
                    height = 450) |>
  highlight(on = "plotly_selected", 
              off = "plotly_doubleclick") |>
    add_trace(type = "scatter", 
              mode = "markers")
  
bscols(
     detour_plot, conf_mat,
     widths = c(5, 6)
 )                 

Linking a scatterplot showing the confusion matrix and a tour plot can help to decide whether one solution is better than another, or not.

Exercises

  1. Compare the results of the four cluster model-based clustering with that of the four cluster Wards linkage clustering of the penguins data.
  2. Compare the results from clustering of the fake_trees data for two different choices of \(k\). (This follows from the exercise in Chapter 9.) Which choice of \(k\) is best? And what choice of \(k\) best captures the 10 known branches?
  3. Compare and contrast the cluster solutions for the first four PCs of the aflw data, conducted in Chapter 8 and Chapter 9. Which provides the most useful clustering of this data?
  4. Compute the convex hulls for the final clustering of the aflw data. What do you learn about the shapes of the final clusters, and which variables most contribute to the difference between clusters.
  5. Pick your two clusterings on one of the challenge data sets, c1-c7 from the mulgar package, that give very different results. Compare and contrast the two solutions, and decide which is the better solution.
  6. Use model-based clustering on the c1 data, as done in Q6 Chapter 10. Construct the summary of the best 6 cluster solution, that uses ellipses to summarise each cluster. How good is this fit?
  7. Use Wards linkage linkage clustering for the c4 data set. Choose 5 clusters as the result. Construct the convex hulls for the solution and display with a grand tour.

Project

Most of the time your data will not neatly separate into clusters, but partitioning it into groups of similar observations can still be useful. In this case our toolbox will be useful in comparing and contrasting different methods, understanding to what extend a cluster mean can describe the observations in the cluster, and also how the boundaries between clusters have been drawn. To explore this we will use survey data that examines the risk taking behavior of tourists, this is the risk_MSA data, see the Appendix for details.

  1. We first examine the data in a grand tour. Do you notice that each variable was measured on a discrete scale?
  2. Next we explore different solutions from hierarchical clustering of the data. For comparison we will keep the number of clusters fixed to 6 and we will perform the hierarchical clustering with different combinations of distance functions (Manhattan distance and Euclidean distance) and linkage (single, complete and Ward linkage). Which combinations make sense based on what we know about the method and the data?
  3. For each of the hierarchical clustering solutions draw the dendrogram in 2D and also in the data space. You can also map the grouping into 6 clusters to different colors. How would you describe the different solutions?
  4. Using the method introduced in this chapter, compare the solution using Manhattan distance and complete linkage to one using Euclidean distance and Ward linkage. First compute a confusion table and then use liminal or detourr to explore some of the differences. For example, you should be able to see how small subsets where the two clustering solutions disagree can be outlying and are grouped differently depending on the choices we make.
  5. Selecting your preferred solution from hierarchical clustering, we will now compare it to what is found using \(k\)-means clustering with \(k=6\). Use a tour to show the cluster means together with the data points (make sure to pick an appropriate symbol for the data points to avoid too much overplotting). What can you say about the variation within the clusters? Can you match some of the clusters with the most relevant variables from following the movement of the cluster means during the tour?
  6. Use a projection pursuit guided tour to best separate the clusters identified with \(k\)-means clustering. How are the clusters related to the different types of risk?
  7. Use the approaches from this chapter to summarize and compare the \(k\)-means solution to your selected hierarchical clustering results. Are the groupings mostly similar?
  8. Some other possible activities include examining how model-based methods would cluster the data. We expect it should be similar to Wards hierarchical or \(k\)-means, that it will partition into roughly equal chunks with an EII variance-covariance model being optimal. Also examining an SOM fit. SOM is not ideal for this data because the data fills the space. If the SOM model is fitted properly it should be a tangled net where the nodes (cluster means) are fairly evenly spread out. Thus the result should again be similar to Wards hierarchical or \(k\)-means. A common problem with fitting an SOM is that optimisation stops early, before fully capturing the data set. This is the reasons to use the tour for SOM. If the net is bunched in one part of the data space, it means that the optimisation wasn’t successful.