Appendix A — Toolbox

A.1 Using tours in the tourr package

A.1.1 Installation

You can install the released version of tourr from CRAN with:

and the development version from the GitHub repo with:

# install.packages("remotes")
remotes::install_github("ggobi/tourr")

A.1.2 Getting started

To run a tour in R, use one of the animate functions. The following code will show a 2D tour displayed as a scatterplot on a 6D data set with three labelled classes.

animate_xy(flea[,-7], col=flea$species)

Wickham et al. (2011a) remains a good reference for learning more about this package. The package website has a list of current functionality.

A.1.3 Different tours

The two main components of the tour algorithm are the projection dimension which affects the choice of display to use, and the algorithm that delivers the projection sequence. The primary functions for these two parts are

  1. For display of different projection dimensions:
  1. To change the way projections are delivered:
  • grand_tour(): Smooth sequence of random projections to view all possible projections as quickly as possible. Good for getting an overview of the high-dimensional data, especially when you don’t know what you are looking for.
  • guided_tour(): Follow a projection pursuit optimisation to find projections that have particular patterns. This is used when you want to learn if the data has particular patterns, such as clustering or outliers. Use the holes() index to find projections with gaps that allow one to see clusters, or lda_pp() or pda_pp() when class labels are known and you want to find the projections where the clusters are separated.
  • little_tour(): Smoothly interpolate between pairs of variables, to show all the marginal views of the data.
  • local_tour(): Makes small movements around a chosen projections to explore a small neighbourhood. Very useful to learn if small distances away from a projection change the pattern substantially or not.
  • radial_tour(): Interpolates a chosen variable out of the projection, and then back into the projection. This is useful for assessing importance of variables to pattern in a projection. If the pattern changes a lot when the variable is rotated out, then the variable is important for producing it.
  • dependendence_tour(): Delivers two sequences of 1D grand tours, to examine associations between two sets of variables. This is useful for displaying two groups of variables as in multiple regression, or multivariate regression or canonical correlation analysis, as two independent 1D projections.
  • frozen_tour(): This is an interesting one! it allows the coefficient for some variables to be fixed, and others to vary.

A.1.4 The importance of scale

Scaling of multivariate data is really important in many ways. It affects most model fitting, and can affect the perception of patterns when data is visualised. Here we describe a few scaling issues to take control of when using tours.

Pre-processing data

It is generally useful to standardise your data to have mean 0 and variance-covariance equal to the identity matrix before using the tour. We use the tour to discover associations between variables. Characteristics of single variables should be examined and understood before embarking on looking for high-dimensional structure.

The rescale parameter in the animate() function will scale all variables to range between 0 and 1, prior to starting the tour. This will force all to have the same range. It is the default, and without this data with different ranges across variable may have some strange patterns. If you have already scaled the data yourself, even if using a different scaling such as using standardised variables you should set rescale=FALSE.

A more severe transformation that can be useful prior to starting a tour is to sphere the data. This is also an option in the animate() function, but is FALSE by default. Sphering is the same as conducting a principal component analysis, and using the principal components as the variables. It removes all linear association between variables! This can be especially useful if you want to focus on finding non-linear associations, including clusters, and outliers.

Scaling to fit into plot region

The half_range parameter in most of the display types sets the range used to scale the data into the plot. It is estimated when a tour is started, but you may need to change it if you find that the data keeps escaping the plot window or is not fully using the space. Space expands exponentially as dimension increases, and the estimation takes this into account. However, different distributions of data points lead to different variance of observations in high-dimensional space. A skewed distribution will be more varied than a normal distribution. It is hard to estimate precisely how the data should be scaled so that it fits nicely into the plot space for all projections viewed.

The center parameter is used to centre each projection by setting the mean to be at the middle of the plot space. With different distributions the mean of the data can vary around the plot region, and this can be distracting. Fixing the mean of each projection to always be at the center of the plot space makes it easier to focus on other patterns.

A.1.5 Saving your tour

The functions save_history() and planned_tour() allow the tour path to be pre-computed, and re-played in your chosen way. The tour path is saved as a list of projection vectors, which can also be passed to external software for displaying tours easily. Only a minimal set of projections is saved, by default, and a full interpolation path of projections can always be generated from it using the interpolate() function.

Versions and elements of tours can be saved for publication using a variety of functions:

  • render_gif(): Save a tour as an animated gif, using the gifski package.
  • render_proj(): Save an object that can be used to produce a polished rendering of a single projection, possibly with ggplot.
  • render_anim(): Creates an object containing a sequence of projections that can be used with plotly() to produce an HTML animation, with interactive control.

A.1.6 Understanding your tour path

Figure A.1 shows tour paths on 3D data spaces. For 1D projections the space of all possible projections is a \(p\)-dimensional sphere Figure A.1 (a). For 2D projections the space of all possible projections is a \(p\times 2\)-dimensional torus Figure A.1 (b)! The geometry is elegant.

In these figures, the space is represented by the light colour, and is constructed by simulating a large number of random projections. The two darker colours indicate paths generated by a grand tour and a guided tour. The grand tour will cover the full space of all possible projections if allowed to run for some time. The guided tour will quickly converge to an optimal projection, so will cover only a small part of the overall space.

(a) 1D tour paths
(b) 2D tour paths
Figure A.1: Grand and guided tour paths of 1D and 2D projections of 3D data. The light points represent the space of all 1D and 2D projections respectively. You can see the grand tour is more comprehensively covering the space, as expected, whereas the guided tour is more focused, and quickly moves to the best projection.

A.2 What not to do

A.2.1 Discrete and categorical data

Tour methods are for numerical data, particularly real-valued measurements. If your data is numerical, but discrete the data can look artificially clustered. Figure A.2 shows an example. The data is numeric but discrete, so it is ok to examine it in a tour. In this example, there will be overplotting of observations and the artificial clustering (plot a). It can be helpful to jitter observations, by adding a small amount of noise (plot b). This helps to remove the artificial clustering, but preserve the main pattern which is the strong linear association. Generally, jittering is a useful tool for working with discrete data, so that you can focus on examining the multivariate association. If the data is categorical, with no natural ordering of categories, the tour is not advised.

Discrete data code
set.seed(430)
df <- data.frame(x1 = sample(1:6, 107, replace=TRUE)) %>% 
          mutate(x2 = x1 + sample(1:2, 107, replace=TRUE),
                 x3 = x1 - sample(1:2, 107, replace=TRUE),
                 x4 = sample(1:3, 107, replace=TRUE))
animate_xy(df)
render_gif(df,           
           grand_tour(),
           display_xy(),
           gif_file = "gifs/discrete_data.gif",
           frames = 100,
           width = 300, 
           height = 300)

dfj <- df %>%
  mutate(x1 = jitter(x1, 2), 
         x2 = jitter(x2, 2),
         x3 = jitter(x3, 2),
         x4 = jitter(x4, 2))
animate_xy(dfj)
render_gif(dfj,           
           grand_tour(),
           display_xy(),
           gif_file = "gifs/jittered_data.gif",
           frames = 100,
           width = 300, 
           height = 300)
(a) Discrete data
(b) Jittered data
Figure A.2: Discrete data can look like clusters, which is misleading. Adding a small amount of jitter (random number) can help. The noise is not meaningful but it could allow the viewer to focus on linear or non-linear association between variables without being distracted by artificial clustering.

A.2.2 Missing values

Code to handle missing values
library(naniar)
library(ggplot2)
library(colorspace)
data("oceanbuoys")
ob_p <- oceanbuoys %>%
  filter(year == 1993) %>%
  ggplot(aes(x = air_temp_c,
           y = humidity)) +
     geom_miss_point() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  theme_minimal() + 
  theme(aspect.ratio=1)
ob_nomiss_below <- oceanbuoys %>%
  filter(year == 1993) %>%
  rename(st = sea_temp_c,
         at = air_temp_c,
         hu = humidity) %>%
  select(st, at, hu) %>%
  rowwise() %>%
  mutate(anymiss = factor(ifelse(naniar:::any_na(c(st, at, hu)), TRUE, FALSE))) %>%
  add_shadow(st, at, hu) %>%
  impute_below_if(.predicate = is.numeric) 
ob_nomiss_mean <- oceanbuoys %>%
  filter(year == 1993) %>%
  rename(st = sea_temp_c,
         at = air_temp_c,
         hu = humidity) %>%
  select(st, at, hu) %>%
  rowwise() %>%
  mutate(anymiss = factor(ifelse(naniar:::any_na(c(st, at, hu)), TRUE, FALSE))) %>%
  add_shadow(st, at, hu) %>%
  impute_mean_if(.predicate = is.numeric) 
ob_p_below <- ob_nomiss_below %>%
  ggplot(aes(x=st, y=hu, colour=anymiss)) +
  geom_point() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  theme_minimal() + 
  theme(aspect.ratio=1, legend.position = "None")
ob_p_mean <- ob_nomiss_mean %>%
  ggplot(aes(x=st, y=hu, colour=anymiss)) +
  geom_point() +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  theme_minimal() + 
  theme(aspect.ratio=1, legend.position = "None")
Code to make animation
animate_xy(ob_nomiss_below[,1:3], col=ob_nomiss$anymiss)
render_gif(ob_nomiss_below[,1:3],
           grand_tour(),
           display_xy(col=ob_nomiss_below$anymiss), 
           gif_file = "gifs/missing_values1.gif",
           frames = 100,
           width = 300, 
           height = 300)
render_gif(ob_nomiss_mean[,1:3],
           grand_tour(),
           display_xy(col=ob_nomiss_mean$anymiss), 
           gif_file = "gifs/missing_values2.gif",
           frames = 100,
           width = 300, 
           height = 300)

Missing values can also pose a problem for high-dimensional visualisation, but they shouldn’t just be ignored or removed. Methods used in 2D to display missings as done in the naniar package (N. Tierney & Cook, 2023a) like placing them below the complete data don’t translate well to high dimensions. Figure A.3 illustrates this. It leads to artificial clustering of observations Figure A.3 (b). It is better to impute the values, and mark them with colour when plotting. The cases are then included in the visualisation so we can assess the multivariate relationships, and also obtain some sense of how these cases should be handled, or imputed. In the example in Figure A.3 (d) we imputed the values simply, using the mean of the complete cases. We can see this is not an ideal approach for imputation for this data because some of the imputed values are outside the domain of the complete cases.

(a) Missings below in 2D
(b) Missings below in high-D
(c) Missings imputed in 2D
(d) Missings imputed in high-D
Figure A.3: Ways to visualise missings for 2D don’t transfer to higher dimensions. When the missings are set at 10% below the complete cases it appears to be clustered data when viewed in a tour (b). It is better to impute the value, and use colour to indicate that it is originally a missing value (d).

A.2.3 Context such as time and space

We occasionally hear statements like “time is the fourth dimension” or “space is the fifth dimension”. This is not a useful way to think about dimensionality.

If you have data with spatial or temporal context, we recommend avoiding using the time index or the spatial coordinates along with the multiple variables in the tour. Time and space are different types of variables, and should not be combined with the multivariate measurements.

For multivariate temporal data, we recommend using a dependence tour, where one axis is reserved for the time index, and the other axis is used to tour on the multiple variables. For spatial data, we recommend using an image tour, where horizontal and vertical axes are used for spatial coordinates and colour of a tile is used for the tour of multiple variables.

A.3 Tours in other software

There are tours available in various software packages. For most examples we use the tourr package, but the same purpose could be achieved by using other software. We also use some of the software this book, when the tourr package is not up for the task. For information about these packages, their websites are the best places to start

  • liminal: to combine tours with (non-linear) dimension reduction algorithms.
  • detourr: animations for {tourr} using htmlwidgets for performance and portability.
  • langevitour: HTML widget that randomly tours projections of a high-dimensional dataset with an animated scatterplot.
  • woylier: alternative method for generating a tour path by interpolating between d-D frames in p-D space rather than d-D planes.
  • spinifex: manual control of dynamic projections of numeric multivariate data.
  • ferrn: extracts key components in the data object collected by the guided tour optimisation, and produces diagnostic plots.

A.4 Supporting software

  • classifly: This package is used heavily for supervised classification.

The explore() function is used to explore the classification model. It will predict the class of a sample of points in the predictor space (.TYPE=simulated), and return this in a data frame with the observed data (.TYPE=actual). The variable .BOUNDARY indicates that a point is within a small distance of the classification boundary, when the value is FALSE. The variable .ADVANTAGE gives an indication of the confidence with which an observation is predicted, so can also be used to select simulated points near the boundary.