6  Overview

Unsupervised classification, or cluster analysis, organizes observations into similar groups. Cluster analysis is a commonly used, appealing, and conceptually intuitive statistical method. Some of its uses include market segmentation, where customers are grouped into clusters with similar attributes for targeted marketing; gene expression analysis, where genes with similar expression patterns are grouped together; and the creation of taxonomies for animals, insects, or plants. Clustering can be used as a way of reducing a massive amount of data because observations within a cluster can be summarized by its centre. Also, clustering effectively subsets the data thus simplifying analysis because observations in each cluster can be analyzed separately.

6.1 What are clusters?

Organizing objects into groups is a common task to help make sense of the world around us. Perhaps this is why it is an appealing method of data analysis. However, cluster analysis is more complex than it initially appears. Many people imagine that it will produce neatly separated clusters like those in Figure 6.1(a), but it almost never does. Such ideal clusters are rarely encountered in real data, so we often need to modify our objective from find the natural clusters in this data. Instead, we need to organize the cases into groups that are similar in some way. Even though this may seem disappointing when compared with the ideal, it is still often an effective means of simplifying and understanding a dataset.

Knowing what shapes are in your data helps to decide on the best method and to diagnose the result. For example, if the clusters are elliptical model-based clustering is recommended.

Figure 6.1: Different structures in data impact cluster analysis. When there are well-separated groups (a), it is simple to group similar observations. Even when there are not, partitioning observations into groups may still be useful. There may be nuisance observations (b) or nuisance variables (c) that affect the interpoint distance calculations and distract the clustering algorithm, and there may oddly shaped clusters (d) which are hard to numerically describe.

At the heart of the clustering process is the work of discovering which variables are most important for defining the groups. It is often true that we only require a subset of the variables for finding clusters, whereas another subset (called nuisance variables) has no impact. In the bottom left plot of Figure 6.1, it is clear that the variable plotted horizontally is important for splitting this data into two clusters, whereas the variable plotted vertically is a nuisance variable. Nuisance is an apt term for these variables, because they can radically change the interpoint distances and impair the clustering process.

Dynamic graphical methods help us to find and understand the cluster structure in high dimensions. With the tools in our toolbox, primarily tours, along with linked scatterplots and parallel coordinate plots, we can see clusters in high-dimensional spaces. We can detect gaps between clusters, the shape and relative positions of clusters, and the presence of nuisance variables. We can even find unusually shaped clusters, like those in the bottom right plot in Figure 6.1. In simple situations we can use graphics alone to group observations into clusters, using a “spin and brush” method. In more difficult data problems, we can assess and refine numerical solutions using graphics.

This part of the book discusses the use of interactive and dynamic graphics in the clustering of data. Section 6.2 introduces cluster analysis, focusing on interpoint distance measures. Chapter 7 describes an example of a purely graphical approach to cluster analysis, the spin and brush method. In the example shown in that section, we were able to find simplifications of the data that had not been found using numerical clustering methods, and to find a variety of structures in high-dimensional space. Chapter 8 describes methods for reducing the interpoint distance matrix to an intercluster distance matrix using hierarchical algorithms, Chapter 10 covers model-based clustering, and Chapter 11 described clustering with self-organising maps. Each of these chapters shows how graphical tools can be used to assess the results of numerical methods. Chapter 12 summarizes the chapter and revisits the data analysis strategies used in the examples. Additional references that provide good companions to the material presented in these chapters are Venables & Ripley (2002a), Boehmke & Greenwell (2019), Hennig et al. (2015), Giordani et al. (2020), Kassambara (2017), and the CRAN Task View (Leisch & Gruen, 2023). Chapter 12 summarizes the chapter and revisits the data analysis strategies used in the examples.

6.2 The importance of defining similar

Before we can begin finding groups of cases that are similar1, we need to decide how to define or measure whether they are close together or far apart. Consider a dataset with three cases \((a_1, a_2, a_3)\) and four variables \((V_1, V_2, V_3, V_4)\), described in matrix format as

\[\begin{align*} X = \begin{bmatrix} & {\color{grey} V_1} & {\color{grey} V_2} & {\color{grey} V_3} & {\color{grey} V_4} \\\hline {\color{grey} a_1} | & x_{11} & x_{12} & x_{13} & x_{14} \\ {\color{grey} a_2} | & x_{21} & x_{22} & x_{23} & x_{24} \\ {\color{grey} a_3} | & x_{31} & x_{32} & x_{33} & x_{34} \end{bmatrix} = \begin{bmatrix} & {\color{grey} V_1} & {\color{grey} V_2} & {\color{grey} V_3} & {\color{grey} V_4} \\\hline {\color{grey} a_1} | & 7.3 & 7.6 & 7.7 & 8.0 \\ {\color{grey} a_2} | & 7.4 & 7.2 & 7.3 & 7.2 \\ {\color{grey} a_3} | & 4.1 & 4.6 & 4.6 & 4.8 \end{bmatrix} \end{align*}\]

which is plotted in Figure 6.2. The Euclidean distance between two cases (rows of the matrix) with \(p\) elements is defined as

\[\begin{align*} d_{\rm Euc}(a_i,a_j) &=& ||a_i-a_j|| %\\ % &=& \sqrt{(x_{i1}-x_{j1})^2+\dots + (x_{ip}-x_{jp})^2}, ~~~~~~i,j=1,\dots, n, \end{align*}\]

where \(||x_i||=\sqrt{x_{i1}^2+x_{i2}^2+\dots +x_{ip}^2}\). For example, the Euclidean distance between cases 1 and 2 in the above data, is

\[\begin{align*} d_{\rm Euc}(a_1,a_2) &= \sqrt{(7.3-7.4)^2+(7.6-7.2)^2+ (7.7-7.3)^2+(8.0-7.2)^2} \\ &= 1.0 \end{align*}\]

For the three cases, the interpoint Euclidean distance matrix is

\[\begin{align*} d_{\rm Euc} = \begin{bmatrix} & {\color{grey} a_1} & {\color{grey} a_2} & {\color{grey} a_3} \\\hline {\color{grey} a_1} | & 0.0 & 1.0 & 6.3 \\ {\color{grey} a_2} | & 1.0 & 0.0 & 5.5 \\ {\color{grey} a_3} | & 6.3 & 5.5 & 0.0 \end{bmatrix} \end{align*}\]

Code for plot
x <- data.frame(V1 = c(7.3, 7.4, 4.1),
                    V2 = c(7.6, 7.2, 4.6),
                    V3 = c(7.7, 7.3, 4.6),
                    V4 = c(8.0, 7.2, 4.8),
                    point = factor(c("a1", "a2", "a3")))
library(GGally)
library(colorspace)
library(gridExtra)
pscat <- ggpairs(x, columns=1:4,
                 upper=list(continuous="points"),
                 diag=list(continuous="blankDiag"),
                 axisLabels="internal",
                 ggplot2::aes(colour=point)) +
    scale_colour_discrete_divergingx(
      palette = "Zissou 1", nmax=4) +
    xlim(3.7, 8.5) + ylim(3.7, 8.5) + 
    theme_minimal() +
    theme(aspect.ratio=1)
pscat
Code for plot
ppar <- ggparcoord(x, columns=1:4, 
                   groupColumn = 5, 
                   scale = "globalminmax") +
    scale_colour_discrete_divergingx(
      palette = "Zissou 1", nmax=4) +
  xlab("") + ylab("") + 
  theme_minimal() + 
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        legend.title = element_blank())
ppar

Figure 6.2: The scatterplot matrix (left) shows that cases \(a_1\) and \(a_2\) have similar values. The parallel coordinate plot (right) allows a comparison of other structure, which shows the similarity in the trend of the profiles on cases \(a_1\) and \(a_3\).

Cases \(a_1\) and \(a_2\) are more similar to each other than they are to case \(a_3\), because the Euclidean distance between cases \(a_1\) and \(a_2\) is much smaller than the distance between cases \(a_1\) and \(a_3\) and between cases \(a_2\) and \(a_3\).

There are many different ways to calculate similarity. Similarity measures based on correlation distance can be useful. It is typically used where similarity of structure or shape is more important than similarity in magnitude.

As an example, see the parallel coordinate plot of the sample data at the right of Figure 6.2. Cases \(a_1\) and \(a_3\) are widely separated, but their shapes are similar (low, medium, medium, high). Case \(a_2\), although overlapping with case \(a_1\), has a very different shape (high, medium, medium, low). The Pearson correlation between two cases, \(\rho(a_i,a_j)\), is defined as

\[\begin{align*} \rho(a_i,a_j) = \frac{(a_i-c_i)^\top(a_j-c_j)} {\sqrt(a_i-c_i)^\top(a_i-c_i) \sqrt(a_j-c_j)^\top(a_j-c_j)} \label{corc} \end{align*}\]

Typically, \(c_i, c_j\) are the sample means of each case, \(\bar{a}_i,\bar{a}_j\). For these three observations, \(c_1=\bar{a}_1=7.650, c_2=\bar{a}_2=7.275, c_3=\bar{a}_3=4.525\). An interesting geometric fact, is that if \(c_i, c_j\) are set to be 0, as is commonly done, \(\rho\) is a generalized correlation that describes the angle between the two data vectors. The correlation is then converted to a distance metric, with one possibility being as follows:

\[\begin{align*} d_{\rm Cor}(a_i,a_j) = \sqrt{2(1-\rho(a_i,a_j))} \end{align*}\]

This distance metric will treat cases that are strongly negatively correlated as the most distant. If you want to consider strong negative correlation as close, then you could take the absolute value of \(\rho(a_i,a_j)\) in the above equation, and remove the multiplication by 2.

The interpoint distance matrix for the sample data using \(d_{\rm Cor}\) and the Pearson correlation coefficient is

\[\begin{align*} d_{\rm Cor} = \begin{bmatrix} & {\color{grey} a_1} & {\color{grey} a_2} & {\color{grey} a_3} \\\hline {\color{grey} a_1} | & 0.0 & 3.6 & 0.1 \\ {\color{grey} a_2} | & 3.6 & 0.0 & 3.8 \\ {\color{grey} a_3} | & 0.1 & 3.8 & 0.0 \end{bmatrix} \end{align*}\]

By this metric, cases \(a_1\) and \(a_3\) are the most similar, because the correlation distance is smaller between these two cases than the other pairs of cases.

Note that these interpoint distances differ dramatically from those for Euclidean distance. As a consequence, the way the cases would be clustered is also very different. Choosing the appropriate distance measure is an important part of a cluster analysis.

After a distance metric has been chosen and a cluster analysis has been performed, the analyst must evaluate the results, and this is actually a difficult task. A cluster analysis does not generate \(p\)-values or other numerical criteria, and the process tends to produce hypotheses rather than testing them. Even the most determined attempts to produce the “best” results using modeling and validation techniques may result in clusters that, although seemingly significant, are useless for practical purposes. As a result, cluster analysis is best thought of as an exploratory technique, and it can be quite useful despite the lack of formal validation because of its power in data simplification.

Defining an appropriate distance metric from the context of the problem is a most important decision. For example, if your variables are all numeric, and on the same scale then Euclidean distance might be best. If your variables are categorical, you might need to use something like Hamming distance.

The context in which the data arises is the key to assessing the results. If the clusters can be characterized in a sensible manner, and they increase our knowledge of the data, then we are on the right track. To use an even more pragmatic criterion, if a company can gain an economic advantage by using a particular clustering method to carve up their customer database, then that is the method they should use.

Exercises

Use the following data to answer these questions:

     x1   x2   x3
a1 0.13 0.21 0.09
a2 0.91 0.95 0.85
a3 0.62 0.73 0.65
a4 0.21 0.92 0.43
  1. Compute the Euclidean distance between cases a1, a2, a3, a4.

  2. Compute the correlation distance (as defined above) between cases a1, a2, a3, a4.

  3. Which two points have the (a) biggest (b) smallest Mahalanobis (statistical) distance, assuming that the covariance matrix is:

    x1  x2  x3
x1 1.0 0.8 0.8
x2 0.8 1.0 0.8
x3 0.8 0.8 1.0

(The function mahalanobis will calculate this in R. Technically this gives distance between each case and the mean vector.)

  1. Is the ordering of distance between cases the same if Manhattan distance is used instead of Euclidean?

  2. Compute the Chebychev distance between cases a1, a2, a3, a4.

  3. Compute Bray-Curtis distance between cases a1, a2, a3, a4.

  4. Make a plot of the data, and write a paragraph describing how the different distance metrics agree and disagree on how close or far the cases are from each other.


  1. Both similarity and dissimilarity measures are used for defining how similar cases are. It can be confusing! They measure similar in opposite directions. With a dissimilarity measure, a smaller number means the cases are closer, as in a distance metric. A similarity measure usually ranges between 0 and 1, with 1 indicating that the cases are closer, for example, correlation.↩︎