2  Technical details

2.1 Notation conventions and R objects

The data can be considered to be a matrix of numbers with the columns corresponding to variables, and the rows correspond to observations. It can be helpful to write this in mathematical notation, like:

\[\begin{eqnarray*} X_{n\times p} = [X_1~X_2~\dots~X_p]_{n\times p} = \left[ \begin{array}{cccc} X_{11} & X_{12} & \dots & X_{1p} \\ X_{21} & X_{22} & \dots & X_{2p}\\ \vdots & \vdots & & \vdots \\ X_{n1} & X_{n2} & \dots & X_{np} \end{array} \right]_{n\times p} \end{eqnarray*}\]

where \(X\) indicates the \(n\times p\) data matrix, \(X_j\) indicates variable \(j, j=1, \dots, p\) and \(X_{ij}\) indicates the value of the \(j^{th}\) variable for the \(i^{th}\) observation. (It can be confusing to distinguish whether one is referring to the observation or a variable, because \(X_i\) is used to indicate observation. In descriptions where it is unclear we will use \(X_{i.}\) to indicate observation/row and \(X_{.j}\) to indicate variable/column. Also this will usually be accompanied by qualifying words such as observation or variable.)

Having notation is helpful for concise explanations of different methods, to explain how data is scaled, processed and projected for various tasks, and how different quantities are calculated from the data.

When there is a response variable(s), it is common to consider \(X\) to be the predictors, and use \(Y\) to indicate the response variable(s). \(Y\) could be a matrix, also, and would be \(n\times q\), where commonly \(q=1\). \(Y\) could be numeric or categorical, and this would change how it is handled with visualisation.

To make a low-dimensional projection (shadow) of the data onto \(d\) dimensions (\(d < p\)), we need an orthonormal basis:

\[\begin{eqnarray*} A_{p\times d} = \left[ \begin{array}{cccc} A_{11} & A_{12} & \dots & A_{1d} \\ A_{21} & A_{22} & \dots & A_{2d}\\ \vdots & \vdots & & \vdots \\ A_{p1} & A_{p2} & \dots & A_{pd} \end{array} \right]_{p\times d} \end{eqnarray*}\]

\(A\) should be an orthonormal matrix, which means that the \(\sum_{j=1}^p A_{jk}^2=1, k=1, \dots, d\) (columns represent vectors of length 1) and \(\sum_{j=1}^p A_{jk}A_{jl}=0, k,l=1, \dots, d; k\neq l\) (columns represent vectors that are orthogonal to each other). In matrix notation, this can be written as \(A^{\top}A = I_d\).

Then the projected data is written as:

\[\begin{eqnarray*} Y_{n\times d} = XA = \left[ \begin{array}{cccc} y_{11} & y_{12} & \dots & y_{1d} \\ y_{21} & y_{22} & \dots & y_{2d}\\ \vdots & \vdots & & \vdots \\ y_{n1} & y_{n2} & \dots & y_{nd} \end{array} \right]_{n\times d} \end{eqnarray*}\]

where \(y_{ij} = \sum_{k=1}^p X_{ik}A_{kj}\). Note that we are using \(Y\) as the projected data here, as well as it possibly being used for a response variable. Where necessary, this will be clarified with words in the text, when notation is used in explanations later.

When using R, if we only have the data corresponding to \(X\) it makes sense to use a matrix object. However, if the response variable is included and it is categorical, then we might use a data.frame or a tibble which can accommodate non-numerical values. Then to work with the data, we can use the base R methods:

X <- matrix(c(1.1, 1.3, 1.4, 1.2, 
              2.7, 2.6, 2.4, 2.5, 
              3.5, 3.4, 3.2, 3.6), 
            ncol=4, byrow=TRUE)
X
     [,1] [,2] [,3] [,4]
[1,]  1.1  1.3  1.4  1.2
[2,]  2.7  2.6  2.4  2.5
[3,]  3.5  3.4  3.2  3.6

which is a data matrix with \(n=3, p=4\) and to extract a column (variable):

X[,2]
[1] 1.3 2.6 3.4

or a row (observation):

X[2,]
[1] 2.7 2.6 2.4 2.5

or an individual cell (value):

X[3,2]
[1] 3.4

To make the data projection we need an orthonormal matrix:

A <- matrix(c(0.707,0.707,0,0,0,0,0.707,0.707),
            ncol=2, byrow=FALSE)
A
      [,1]  [,2]
[1,] 0.707 0.000
[2,] 0.707 0.000
[3,] 0.000 0.707
[4,] 0.000 0.707

You can check that it is orthonormal by

sum(A[,1]^2)
[1] 0.999698
sum(A[,1]*A[,2])
[1] 0

and compute the projected data using matrix multiplication:

X %*% A
       [,1]   [,2]
[1,] 1.6968 1.8382
[2,] 3.7471 3.4643
[3,] 4.8783 4.8076

The magical number 0.707 used above and to create the projection in Figure 1.2 arises from normalising a vector with equal contributions from each variable, (1, 1). Dividing by sqrt(2) gives (0.707, 0.707).

The notation convention used throughout the book is:

n = number of observations
p = number of variables, dimension of data
d = dimension of the projection
g = number of groups, in classification
X = data matrix

2.2 Mechanics of tours

2.2.1 Different ways to choose target bases

Although there are a variety of different tour types, they are (almost) all composed of three core building blocks: a set of target projection bases, a method for interpolating between them and the method to display the projected data. The manner that the target planes are chosen primarily determines the type of tour. Figure 2.1 illustrates three main tour types: grand, guided and radial. The appendix has a list of the many others.

Tours are composed from three elements: - a set of target projection bases - an interpolation method - the method to display the projected data

The original tour was called the grand tour (Asimov, 1985). In a grand tour, the target bases are chosen randomly from all possible projections. The reason to use a grand tour is to get an overview of the data quickly - it is possible to discover relationships between variables that were not pre-conceived. The particular grand tour available in the tourr package also has the feature that all projections are equally likely to be viewed, and it efficiently covers the space of all projections. The tour in the langevitour (Harrison, 2023) software is similar to a grand tour but uses a different dynamic to choose the tour path, based on jostling particles. It doesn’t need to do interpolation because changes are incremental.

Because space gets large as dimension increases, the wait for seeing interesting structure in projections when using a grand tour can be long. So if you have an idea of the types of structure that would be interesting to observe using a guided tour (Cook et al., 1995) may help. A guided tour chooses target bases according to some function describing interesting, and the tour path follows an optimisation of this function. For example, the LDA index (E.-K. Lee et al., 2005) is a function that describes the separation between known classes in a projection, and is defined as follows:

\[\begin{eqnarray*} I_{\rm LDA}(A) = 1- \frac{|A'WA|}{|A'(W+B)A|} \end{eqnarray*}\]

where \(B = \sum_{i=1}^g n_i(\bar{y}_{i.}-\bar{y}_{..})(\bar{y}_{i.}-\bar{y}_{..})', W=\sum_{i=1}^g\sum_{j=1}^{n_i} (y_{ij}-\bar{y}_{i.})(y_{ij}-\bar{y}_{i.})'\) are the between- and within-group sum of squares matrices in a linear discriminant analysis, with \(g=\)number of groups, and \(n_i, i=1, ....g\) is the number of cases in each group. The notation \(\bar{y}_{i.}\) means we have calculated the mean value for group \(i\) across all observations \(n_i\), and \(\bar{y}_{..}\) is the overall mean value across all groups.

The radial tour (Laa et al., 2023) provides a way to assess the importance of any variable or combination of variables. The target basis is chosen by zeroing out a variable (or multiple variables) and the interpolation runs from the initial projection to the target and back to the initial. If the structure observed in the initial plot doesn’t change much when the variable(s) is removed, then the variable is not important. This is often best combined with the guided tour, where the radial tour would start from the best projection (as done in Figure 2.1 c), the most structured projection, and each of the variables can be tested for their importance in producing the structure.

Each of these tour types can be run in the tourr package by setting the tour_path parameter of the animate() function (and also when using specific displays via the animate_XXX() functions, a full list is given in the appendix). The paths of projection bases can also be generated and saved to be used later with the save_history() function. This is the manner with which to save a particular data projection, or to generate sequences of projections to pass to external software.

Animation showing a sequence of 2D projections of three species of penguin. The projections of the four variables bl, bd, fl, bm, change gradually and take many values. We can see that the red species are different from each other because there are separations in some projections and the movement of the three clusters is different. Some unusual penguins can be seen as outliers in some projections.

grand tour

Animation showing a sequence of 2D projections of three species of penguin, where they gradually get more separated. Final projection has bm pointing to 1 o'clock, fl pointing to 10 o'clock, bl pointing to 8 o'clock and bd pointing to 5 o'clock.

guided tour

Animation showing a sequence of 2D projections of three species of penguin starting from a projection where the three are distinct. Initial projection has bm pointing to 1 o'clock, fl pointing to 10 o'clock, bl pointing to 8 o'clock and bd pointing to 5 o'clock. The fl contribution is reduced to 0 and then back to its original value but this has only a small affect the difference between groups.

radial
Figure 2.1: Three main tour types for examining multivariate data: (a) grand tour for obtaining an overview, (b) guided tour to explore for particular structure, (c) radial tour to assess the importance of a variable. The guided tour does show the separations between the three groups better, but will miss unexpected features like the several anomalies. The radial tour shows that fl is not contributing much to the separation of the Gentoo (red) penguins from the others.
Two similar illustrations labelled (a) geodesic and (b) Givens. Geodesic shows a sequence of rectangles, and Givens shows a sequence of Cartesian axes, both coming forwards as indicated by a curved arrow-tipped line. Line labels are plane-to-plane interpolation and basis-to-basis interpolation, respectively. Set of orange points in top of each, labelled target, shows a sine curve on its side in geodesic and upright in Givens.
Figure 2.2: Two different types of interpolation to a target: (a) geodesic plane-to-plane, (b) basis-to-basis using Givens rotations. The geodesic interpolation follows the shortest paths between planes. Givens respects the orientation of the basis when plotting or calculating projection pursuit indexes. Both methods maintain orthonormality of the bases for each step.

2.2.2 Interpolating between targets

Interpolating from the current basis to the target basis is most commonly achieved using geodesic interpolation. This type of interpolation (described in Buja et al. (2005)) has especially important properties: (1) it is plane-to-plane so follows the shortest path, and (2) maintains orthonormality of the basis at each step when \(d>1\). The plane-to-plane movement operates like video stabilisation, by removing the within-plane spin, so you can focus on structure without distracting movement. However, when using the guided tour if the index is not rotationally invariant, for example, if based on correlation or splines, it can be important respect a particular orientation. For this rare situation, using Givens interpolation, available in the woylier package (Batsaikhan et al., 2024) can be helpful. Figure 2.2 illustrates the difference.

2.2.3 Displaying the projected data

Based on the dimension of the projection basis there are natural ways to display the projected data. If \(d=2\) it is most common that we would use a scatterplot, as shown in Figure 2.1.

If \(d=1\) there are many choices of display - histogram, density plot, dotplot, boxplot - any method commonly used to display univariate continuous data (see animate_dist()). If \(d=1\) but a second (set of) variable(s) is available, such as a time index, a categorical variable index (animate_idx()), or spatial coordinates (animate_image()), then the display might reserve one axis for the projected data and another for the additional variables. Figure 2.3 shows two different display choices for \(d=1\), a histogram of the projected data (a) and an ordered dotplot of an index created by combining values of several ranking variables. In each display the box with horizontal lines at the bottom represents the projection coefficients. In (a) there are six variables contributing to the projection, four with negative coefficients, which reveals some bimodality and a potential anomaly in this data. In (b), nine ranking variables on the liveability of nine US cities are combined to give an overall rank for each city. (Data originally from Savageau & Boyer (1993).) This combination is a projection, and a tour is used to assess how the city ranks would change if the combination of these measures changed. That is, the use case, is how robust is the ranking to small changes in its construction.

When \(d>2\), even though it primarily academic, some possible choices are to use stereo to simulate 3D, scatterplot matrices or parallel coordinates.

Histogram with two modes, and small separated bar on right. Below this is a set of horizontal line segments starting from the middle labelled t1, t2, hd, a1, a2, a3. Segment for a1 is longest and pointing to the left. Medium length segments for t1, pointing right, t2, hd pointing left. Segments for a2 and a3 are very small pointing right and left, respectively.

histogram

Orange dots along lines labelled 39, 15, 39, 19, 33, 35, 11, 36, 44, which indicates the city's state, roughly in  from right to left. Horizontal line segments with numbers at the bottom indicate the projection basis. The numbers are 0.20, 0.13, 0.07, 0.10, 0.07, 0.14, 0.08, 0.08, 0.13 for cl, hs, hl, cr, tr, ed, ar, rc, ec, respectively.

dotplot
Figure 2.3: Different choices for displaying projected data: (a) histogram, (b) ordered dotplot. In each case the projected data is 1D. In the histogram, the lines at the bottom of the display correspond to the projection basis. For the ordered dotplot, each point corresponds to a US city which has been measured on several criteria such as housing, climate, crime, and education. The lines at the bottom correspond the coefficients for each of these ratings in the projection basis.

Adjustments to the display are useful for some problems. The crowding problem where points concentrate as dimension increases can be alleviated with a transformation of the projected data documented in Laa et al. (2022) and available with the animate_sage() function. When the data is particularly dense, for example, simulated data filling out a full \(p\)-dimensional cube from a fitted model generated to understand the fit, it can be useful to use slicing (Laa et al., 2020a). Slicing will fade out observations beyond a given distance from the centre of the data, relative to the projection. Figure 2.4 illustrates the slice display available using the animate_slice() function.

Animation showing points on the surface of a 3D donut using 2D projections. The shape of the donut can roughly be seen.

projection

Animation showing points on the surface of a 3D donut using slices of 2D projections. Circular shapes can be seen which reveal the donut to be hollow.

slice
Figure 2.4: Slicing can cut through high density is useful to find hollowness in high-dimensional data, or understand limitations of fitted models. Here points on the surface of a 3D torus are shown using (a) projections, and (b) slices. The slicing reveals the hollowness of this data with the circular cuts to the donut shape.

It can also be useful to add layers to the data, such as summary statistics, labels for selected observations, or connecting points with lines to represent model fits. Figure 2.5 illustrates two use cases. In (a) labels for specific players are shown when examining player statistics in the aflw data. All three were named best players in the 2021 AFLW season, Bowers and Davey were equal best and fairest, and Vescio was awarded best goal kicker. All three have quite different player statistics profiles. The play of Bowers and Davey differs in the number of handballs and kicks - Davey tends to distribute the ball with handballs, while Bowers tends to kick. In (b) the lines indicate constraints. This is compositional data where each row adds to 1. It is the Fireworks data from the compositions (van den Boogaart et al., 2024) package, and measures the mixtures of five different ingredients. We can see that it is relatively equal mixtures because all are very centred within the constraints. We can also see points banding into stripes, suggesting that there are just a small number of firework recipes.

Animation of 2D projections of four variables hnd, gls, bhn, kck. Three points are labelled Davey, Bowers, Vescio, and all are in the extremes of scatterplot in many projections. Points concentrate in a bunch in the middle and spread more thinly the further from the middle. There are some stripes in some projections indicating that one variable takes discrete values.

labelling

Animation of 2D projections from 4D. Lines corresponding to a simplex surround the orange points, and the vertices are labelled a-e. The points are mostly in the centre, and appear to fall on discrete sets of lines pointing in various directions.

constraints
Figure 2.5: It is sometimes useful to overlay additional information, such as labels for a few points (a) or representing constraints (b). The labels in (a) correspond to players that won awards in the 2021 AFLW season. The lines in (b) correspond to constraints - this is compositional data where each 5D observation is constrained to sum to 1.

It is often asked why isn’t the data display using solid shapes, or provide depth cues with point size or colour. But the nature of high-dimensions is that comprehensive research on lighting models and perspective conducted for 3D graphics, does not extend to more than 3D. For example, a light source, like a lamp, is a point object, and in 4D the shadows would be 3D. A surface in 4D is 3D. So the choice of using points to represent objects, and lines to indicate geometric features is because these are meaningful in any dimension.

Exercises

  1. Generate a matrix \(A\) with \(p=5\) (rows) and \(d=2\) (columns), where each value is randomly drawn from a standard normal distribution. Extract the element at row 3 and column 1.
  2. We will interpret \(A\) as an orthonormal basis and therefore it needs to be checked for orthonormality, and if it fails, then to be orthonormalised. Use the function tourr::is_orthonormal to explicitly check that each column is normalised and that the two columns are orthogonal. If they are not, then use tourr::orthonormalise to make them so. For the fixed version of \(A\), which dimensions contribute most to the projection, horizontally and vertically?
  3. Use matrix multiplication to calculate the projection of the mulgar::clusters data onto the 2D plane defined by \(A\). Make a scatterplot of the projected data. Can you identify clustering in this view?
  4. Use save_history() to generate a grand tour path for the mulgar::clusters data, with three target planes. Extract and report the second target basis, and plot the resulting data.
  5. Using the saved path, generate the interpolation between the target planes, using interpolate(). How many bases are in the tour path? Plot the first four in set of planes, and explain what is happening. Save tour path, extract basis and plot data
  6. Repeat save_history() but use a holes index guided tour. Plot the final projection. Does the guided tour find the three clusters? Which of the variables have the largest contributions to the differences between groups?
  7. Repeat save_history() using a radial tour, starting from the best projection basis from the guided tour, and exploring the contribution of x1 (mvar=1). Why should the length of the path be set to three? Why is the last basis the same as the first? Examine the second target basis, and explain how it is different from the first basis.
  8. Run the tours from in 4-7 using the animate_xy(). You can use the planned_tour method, but it is more interesting to simply re-generate using the different tour_path methods. It is interesting to examine the importance of x5 and x4 to the clustering seen in the best projection from the guided tour.
  9. Examine tours of 1D projections of the mulgar::clusters data, using density or histogram rendering. Can the three clusters be seen with 1D projections?
  10. Explore the use of the slice tour on these geometric objects, that can be simulated using geozoo:
  1. Roman Surface, generated by
Code
rs <- geozoo::roman.surface()$points |>
  scale() |>
  as.data.frame()
  1. solid 4D sphere, generated by
Code
s_solid <- geozoo::sphere.solid.random(4, 2000)$points |>
  as.data.frame()
  1. hollow 4D sphere, generated by
Code
s_hollow <- geozoo::sphere.hollow(4, 2000)$points |>
  as.data.frame()

Use both regular tours, and the slice tours on each. What does the slice tour allow us to see in the Roman Surface? What is the difference between the solid and hollow spheres?