9  \(k\)-means clustering

One of the simplest and efficient techniques for clustering data is the \(k\)-means algorithm. The algorithm begins with a choice for \(k\), the number of clusters to divide the data into. It is seeded with \(k\) initial means, and sequentially iterates through the observations, assigning them to the nearest mean, and re-calculating the \(k\) means. It stops at a given number of iterations or when points no longer change clusters. The algorithm will tend to segment the data into roughly equal sized, or spherical clusters, and thus will work well if the clusters are separated and equally spherical in shape.

A good place to learn ore about the \(k\)-means algorithm is Chapter 20 of Boehmke & Greenwell (2019). The algorithm has been in use for a long time! It was named \(k\)-means by MacQueen (1967), but developed by Lloyd in 1957 (as described in Lloyd (1982)) and separately by Forgy (1965), and perhaps others as it is a very simple procedure.

The key elements to examine in a k-means clustering algorithm result are:

  • means
  • boundaries

9.1 Examining results in 2D

Figure 9.1 shows the results of \(k\)-means clustering on the 2D simple_clusters data and two variables of the penguins data. We can see that it works well when the clusters are spherical, but for the penguins data it fails because the shape of the clusters is elliptical. It actually makes a mistake that would not be made if we simply visually clustered: cluster 3 has grouped points across a gap, a divide that visually we would all agree should form a separation.

library(mulgar)
library(ggplot2)
library(dplyr)
library(colorspace)
library(patchwork)
data("simple_clusters")
load("data/penguins_sub.rda")

set.seed(202305)
sc_bl_bd_km <- kmeans(simple_clusters[,1:2], centers=2, 
                     iter.max = 50, nstart = 5)
sc_bl_bd_km_means <- data.frame(sc_bl_bd_km$centers) %>%
  mutate(cl = factor(rownames(sc_bl_bd_km$centers)))
sc_bl_bd_km_d <- simple_clusters[,1:2] %>% 
  mutate(cl = factor(sc_bl_bd_km$cluster))
Code to make plots
sc_bl_bd_km_p <- ggplot() +
  geom_point(data=sc_bl_bd_km_d, 
             aes(x=x1, y=x2, colour=cl), 
             shape=16, alpha=0.4) +
  geom_point(data=sc_bl_bd_km_means, 
             aes(x=x1, y=x2, colour=cl), 
             shape=3, size=5) +
    scale_color_discrete_divergingx("Zissou 1") +
  ggtitle("(a)") +
  theme_minimal() +
  theme(aspect.ratio = 1, 
        legend.position = "bottom",
        legend.title = element_blank()) 

p_bl_bd_km <- kmeans(penguins_sub[,1:2], centers=3, 
                     iter.max = 50, nstart = 5)
p_bl_bd_km_means <- data.frame(p_bl_bd_km$centers) %>%
  mutate(cl = factor(rownames(p_bl_bd_km$centers)))
p_bl_bd_km_d <- penguins_sub[,1:2] %>% 
  mutate(cl = factor(p_bl_bd_km$cluster))

p_bl_bd_km_p <- ggplot() +
  geom_point(data=p_bl_bd_km_d, 
             aes(x=bl, y=bd, colour=cl), 
             shape=16, alpha=0.4) +
  geom_point(data=p_bl_bd_km_means, 
             aes(x=bl, y=bd, colour=cl), 
             shape=3, size=5) +
    scale_color_discrete_divergingx("Zissou 1") +
  ggtitle("(b)") +
  theme_minimal() +
  theme(aspect.ratio = 1, 
        legend.position = "bottom",
        legend.title = element_blank()) 

sc_bl_bd_km_p + p_bl_bd_km_p + plot_layout(ncol=2)
Figure 9.1: Examining \(k\)-means clustering results for simple clusters (a) and two variables of the penguins data (b). The means are indicated by a \(+\). The results are perfect for the simple clusters but not for the penguins data. The penguin clusters are elliptically shaped which is not captured by \(k\)-means. Cluster 3 has observations grouped across a gap in the data.

9.2 Examining results in high dimensions

This approach extends to high-dimensions. One colours observations by the cluster label, and overlays the final cluster means. If we see gaps in points in a single cluster it would mean that \(k\)-means fails to see important cluster structure. This is what happens with the 4D penguins data as shown in Figure 9.2.

p_km <- kmeans(penguins_sub[,1:4], centers=3, 
                     iter.max = 50, nstart = 5)
p_km_means <- data.frame(p_km$centers) %>%
  mutate(cl = factor(rownames(p_km$centers)))
p_km_d <- penguins_sub[,1:4] %>% 
  mutate(cl = factor(p_km$cluster))
Code to make animated gifs
library(tourr)
p_km_means <- p_km_means %>%
  mutate(type = "mean")
p_km_d <- p_km_d %>%
  mutate(type = "data")
p_km_all <- bind_rows(p_km_means, p_km_d)
p_km_all$type <- factor(p_km_all$type, levels=c("mean", "data"))
p_pch <- c(3, 20)[as.numeric(p_km_all$type)]
p_cex <- c(3, 1)[as.numeric(p_km_all$type)]
animate_xy(p_km_all[,1:4], col=p_km_all$cl, 
           pch=p_pch, cex=p_cex, axes="bottomleft")
render_gif(p_km_all[,1:4],
           grand_tour(),
           display_xy(col=p_km_all$cl, 
                      pch=p_pch, 
                      cex=p_cex, 
                      axes="bottomleft"),
           gif_file="gifs/p_km.gif",
           width=400,
           height=400,
           frames=500)
FIX ME
Figure 9.2: Exploring the k-means clustering result for the 4D penguins data. You can see cluster 2 clearly separated from the other observations. Cluster 3, like in the 2D example, is a mix of observations across a gap. Even the mean of the cluster is almost in the gap.

Generally, there is no need to choose \(k\) ahead of time. One would re-fit \(k\)-means with various choices of \(k\), and compare the tot.withinss and examine the clusters visually, to decide on the optimal final value of \(k\). This can be assessed in a similar way to the scree plot for PCA.

Exercises

  1. Compute a \(k\)-means clustering for the fake_trees data, varying \(k\) to about 20. Choose your best \(k\), and examine the solution using the first 10 PCs on the data. It should capture the data quite nicely, although it will break up each branch into multiple clusters.
  2. Compute a \(k\)-means clustering of the first four PCs of the aflw data. Examine the best solution (you choose which \(k\)), and describe how it divides the data. By examining the means, can you tell if it extracts clusters of offensive vs defensive vs midfield players? Or does it break the data into high skills vs low skills?
  3. Use \(k\)-means clustering on the challenge data sets, c1-c7 from the mulgar package. Explain what choice of \(k\) is best for each data set, and why or why not the cluster structure, as you have described it from earlier chapters, is detected or not.