3  Dimension reduction overview

This chapter sets up the concepts related to methods for reducing dimension such as principal component analysis (PCA) and t-stochastic neighbour embedding (t-SNE), and how the tour can be used to assist with these methods.

3.1 The meaning of dimension

The number of variables, \(p\), is considered to be the dimension of the data. However, the observed data may live in a lower dimensional sub-space, this means the observations do not fill out the full \(p\)-dimensions. This implicit dimensionality is perceived in a tour using the spread of points. When the points are spread far apart, then the data is filling the space. Conversely, when the points “collapse” into a sub-region then the data is only partially filling the space, and some dimension reduction to reduce to this smaller dimensional space may be worthwhile.

When exploring the implicit dimensionality of multivariate data we are looking for projections where the points do not fill the plotting canvas fully. This would indicate that the observed values do not fully populate the high dimensions.

Let’s start with some 2D examples. Figure 3.1 shows three plots of two variables. Plot (a) shows two variables that are strongly linearly associated1, because when x1 is low, x2 is low also, and conversely when x1 is high, x2 is also high. This can also be seen by the reduction in spread of points (or “collapse”) in one direction making the data fill less than the full square of the plot. So from this we can conclude that the data is not fully 2D. The second step is to infer which variables contribute to this reduction in dimension. The axes for x1 and x2 are drawn extending from \((0,0)\) and because they both extend out of the cloud of points, in the direction away from the collapse of points we can say that they are jointly responsible for the dimension reduction.

Figure 3.1 (b) shows a pair of variables that are not linearly associated. Variable x1 is more varied than x3 but knowing the value on x1 tells us nothing about possible values on x3. Before running a tour all variables are typically scaled to have equal spread. The purpose of the tour is to capture association and relationships between the variables, so any univariate differences should be removed ahead of time. Figure 3.1 (c) shows what this would look like when x3 is scaled - the points are fully spread in the full square of the plot.

Code to produce 2D data examples
library(tibble)
set.seed(6045)
x1 <- runif(123)
x2 <- x1 + rnorm(123, sd=0.1)
x3 <- rnorm(123, sd=0.2)
df <- tibble(x1 = (x1-mean(x1))/sd(x1), 
             x2 = (x2-mean(x2))/sd(x2),
             x3, 
             x3scaled = (x3-mean(x3))/sd(x3))
Three scatterplots: (a) points lie close to a straight line in the x=y direction, (b) points lie close to a horizontal line, (c) points spread out in the full plot region. There are no axis labels or scales.
Figure 3.1: Explanation of how dimension reduction is perceived in 2D, relative to variables: (a) Two variables with strong linear association. Both variables contribute to the association, as indicated by their axes extending out from the ‘collapsed’ direction of the points; (b) Two variables with no linear association. But x3 has less variation, so points collapse in this direction; (c) The situation in plot (b) does not arise in a tour because all variables are (usually) scaled. When an axis extends out of a direction where the points are collapsed, it means that this variable is partially responsible for the reduced dimension.

The way variables are scaled can affect the appearance of dimensionality. If the variables are scaled together, using global values, some variables may have smaller variance than others. Scaling variables individually shifts the focus to association between variables, as the predominant reason for reduced dimension.

Figure 3.2 illustrates other types of association that could indicate reduced dimensionality. Plot (a) shows a strong nonlinear association. Plots (b) and (c) show substantial regions of the data space that have no observations, which may mean there is some barrier or gap in the data generating process. The L-shape in plot (d) is a pattern where one variable only shows variability if the other variable is essentially all the same, say close to zero. It could be described as if one variable has a non-zero value then the other variable is zero. This is a very strong association pattern, but not one that is captured by correlation. Plots (e) and (f) show other types of association, one with heterogeneous variance depending on the subspace, and clustering of observations, respectively. While we might detect these visually, using dimension reduction methods with these structures can be tricky.

Figure 3.2: Other types of association: (a) nonlinear, (b) gap between subspaces, (c) barrier beyond which no values are observed, perhaps a limiting inequality constraint, (d) L-shape where if one variable has a spread of values the other does not, (e) skewness or heterogeneous variance, (f) clustering.

3.2 How to perceive the dimensionality using a tour

Now let’s think about what this looks like with five variables. Figure 3.3 shows a grand tour on five variables, with (a) data that is primarily 2D, (b) data that is primarily 3D and (c) fully 5D data. You can see that both (a) and (b) the spread of points collapse in some projections, with it happening more in (a). In (c) the data is always spread out in the square, although it does seem to concentrate or pile in the centre. This piling is typical when projecting from high dimensions to low dimensions. The sage tour (Laa et al., 2022) makes a correction for this.

Code to make animated gifs
library(mulgar)
data(plane)
data(box)
render_gif(plane,
           grand_tour(), 
           display_xy(),
           gif_file="gifs/plane.gif",
           frames=500,
           width=200,
           height=200)
render_gif(box,
           grand_tour(), 
           display_xy(),
           gif_file="gifs/box.gif",
           frames=500,
           width=200,
           height=200)
# Simulate full cube
library(geozoo)
cube5d <- data.frame(cube.solid.random(p=5, n=300)$points)
colnames(cube5d) <- paste0("x", 1:5)
cube5d <- data.frame(apply(cube5d, 2, function(x) (x-mean(x))/sd(x)))
render_gif(cube5d,
           grand_tour(), 
           display_xy(),
           gif_file="gifs/cube5d.gif",
           frames=500,
           width=200,
           height=200)
Animation of sequences of 2D projections shown as scatterplots. You can see points collapsing into a thick straight line in various projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.
(a) 2D plane in 5D
Animation of sequences of 2D projections shown as scatterplots. You can see points collapsing into a thick straight line in various projections, but not as often as in the animation in (a). A circle with line segments indicates the projection coefficients for each variable for all projections viewed.
(b) 3D plane in 5D
Animation of sequences of 2D projections shown as scatterplots. You can see points are always spread out fully in the plot space, in all projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.
(c) 5D plane in 5D
Figure 3.3: Different dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting into 2D. Notice that the 5D in 5D always fills out the box (although it does concentrate some in the middle which is typical when projecting from high to low dimensions). Also you can see that the 2D in 5D, concentrates into a line more than the 3D in 5D. This suggests that it is lower dimensional.

The next step is to determine which variables contribute. In the examples just provided, all variables are linearly associated in the 2D and 3D data. You can check this by making a scatterplot matrix, Figure 3.4.

Code for scatterplot matrix
library(GGally)
library(mulgar)
data(plane)
ggscatmat(plane) +
  theme(panel.background = 
          element_rect(colour="black", fill=NA),
    axis.text = element_blank(),
    axis.ticks = element_blank())
A five-by-five scatterplot matrix, with scatterplots in the lower triangle, correlaton printed in the upper triangle and density plots shown on the diagonal. Plots of x1 vs x2, x1 vs x3, x2 vs x3, and x4 vs x5 have strong positive or negative correlation. The remaining pairs of variables have no association.
Figure 3.4: Scatterplot matrix of plane data. You can see that x1-x3 are strongly linearly associated, and also x4 and x5. When you watch the tour of this data, any time the data collapses into a line you should see only (x1, x2, x3) or (x4, x5). When combinations of x1 and x4 or x5 show, the data should be spread out.

To make an example where not all variables contribute, we have added two additional variables to the plane data set, which are purely noise.

# Add two pure noise dimensions to the plane
plane_noise <- plane
plane_noise$x6 <- rnorm(100)
plane_noise$x7 <- rnorm(100)
plane_noise <- data.frame(apply(plane_noise, 2, 
    function(x) (x-mean(x))/sd(x)))
Code
ggduo(plane_noise, columnsX = 1:5, columnsY = 6:7, 
    types = list(continuous = "points")) +
  theme(aspect.ratio=1,
    panel.background = 
          element_rect(colour="black", fill=NA),
    axis.text = element_blank(),
    axis.ticks = element_blank())
Two rows of scatterplots showing x6 and x7 against x1-x5. The points are spread out in the full plotting region, although x6 has one point with an unusually low value.
Figure 3.5: Scatterplots showing two additional noise variables that are not associated with any of the first five variables.

Now we have 2D structure in 7D, but only five of the variables contribute to the 2D structure, that is, five of the variables are linearly related with each other. The other two variables (x6, x7) are not linearly related to any of the others.

The data is viewed with a grand tour in Figure 3.6. We can still see the concentration of points along a line in some dimensions, which tells us that the data is not fully 7D. Then if you look closely at the variable axes you will see that the collapsing to a line only occurs when any of x1-x5 contribute strongly in the direction orthogonal to this. This does not happen when x6 or x7 contribute strongly to a projection - the data is always expanded to fill much of the space. That tells us that x6 and x7 don’t substantially contribute to the dimension reduction, that is, they are not linearly related to the other variables.

Code to generate animation
library(ggplot2)
library(plotly)
library(htmlwidgets)

set.seed(78)
b <- basis_random(7, 2)
pn_t <- tourr::save_history(plane_noise, 
                    tour_path = grand_tour(),
                    start = b,
                    max_bases = 8)
pn_t <- interpolate(pn_t, 0.1)
pn_anim <- render_anim(plane_noise,
                         frames=pn_t)

pn_gp <- ggplot() +
     geom_path(data=pn_anim$circle, 
               aes(x=c1, y=c2,
                   frame=frame), linewidth=0.1) +
     geom_segment(data=pn_anim$axes, 
                  aes(x=x1, y=y1, 
                      xend=x2, yend=y2, 
                      frame=frame), 
                  linewidth=0.1) +
     geom_text(data=pn_anim$axes, 
               aes(x=x2, y=y2, 
                   frame=frame, 
                   label=axis_labels), 
               size=5) +
     geom_point(data=pn_anim$frames, 
                aes(x=P1, y=P2, 
                    frame=frame), 
                alpha=0.8) +
     xlim(-1,1) + ylim(-1,1) +
     coord_equal() +
     theme_bw() +
     theme(axis.text=element_blank(),
         axis.title=element_blank(),
         axis.ticks=element_blank(),
         panel.grid=element_blank())
pn_tour <- ggplotly(pn_gp,
                        width=500,
                        height=550) |>
       animation_button(label="Go") |>
       animation_slider(len=0.8, x=0.5,
                        xanchor="center") |>
       animation_opts(easing="linear", 
                      transition = 0)

htmlwidgets::saveWidget(pn_tour,
          file="html/plane_noise.html",
          selfcontained = TRUE)
Figure 3.6: Grand tour of the plane with two additional dimensions of pure noise. The collapsing of the points indicates that this is not fully 7D. This only happens when any of x1-x5 are contributing strongly (frame 49 x4, x5; frame 79 x1; frame 115 x2, x3). If x6 or x7 are contributing strongly the data is spread out fully (frames 27, 96). This tells us that x6 and x7 are not linearly associated, but other variables are.

To determine which variables are responsible for the reduced dimension look for the axes that extend out of the point cloud. These contribute to smaller variation in the observations, and thus indicate possible dimension reduction using these variables.

The simulated data here is very simple, and what we have learned from the tour could also be learned from principal component analysis. However, if there are small complications, such as outliers or nonlinear relationships, that might not be visible from principal component analysis, the tour can help you to see them.

Figure 3.7 and Figure 3.8(a) show example data with an outlier and Figure 3.8(b) shows data with non-linear relationships.

Code
# Add several outliers to the plane_noise data
plane_noise_outliers <- plane_noise
plane_noise_outliers[101,] <- c(2, 2, -2, 0, 0, 0, 0)
plane_noise_outliers[102,] <- c(0, 0, 0,-2, -2, 0, 0)
Code for scatterplot matrix
ggscatmat(plane_noise_outliers, columns = 1:5) +
  theme(aspect.ratio=1,
    panel.background = 
          element_rect(colour="black", fill=NA),
    axis.text = element_blank(),
    axis.ticks = element_blank())
A five-by-five scatterplot matrix, with scatterplots in the lower triangle, correlaton printed in the upper triangle and density plots shown on the diagonal. Plots of x1 vs x2, x1 vs x3, x2 vs x3, and x4 vs x5 have strong positive or negative correlation, with an outlier in the corner of the plot. The remaining pairs of variables have no association, and thus also no outliers.
Figure 3.7: Scatterplot matrix of the plane with noise data, with two added outliers in variables with strong correlation.
Code to generate animated gif
render_gif(plane_noise_outliers,          
           grand_tour(), 
           display_xy(),
           gif_file="gifs/pn_outliers.gif",
           frames=500,
           width=200,
           height=200)

data(plane_nonlin)
set.seed(508)
render_gif(plane_nonlin,          
           grand_tour(), 
           display_xy(),
           gif_file="gifs/plane_nonlin.gif",
           frames=500,
           width=400,
           height=400)
Animation showing scatterplots of 2D projections from 5D. The points sometimes appear to be a plane viewed from the side, with two single points futher away. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.
(a) Outliers
Animation showing scatterplots of 2D projections from 5D. The points sometimes appear to be lying on a curve in various projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.
(b) Non-linear relationship
Figure 3.8: Examples of different types of dimensionality issues: outliers (a) and non-linearity (b). In (a) you can see two points far from the others in some projections. Also the two can be seen with different movement patterns – moving faster and different directions than the other points during the tour. Outliers will affect detection of reduced dimension, but they can be ignored when assessing dimensionality with the tour. In (b) there is a non-linear relationship between several variables, primarily with x3. Non-linear relationships may not be easily captured by other techniques but are often visible with the tour.

Exercises

  1. Multicollinearity is when the predictors for a model are strongly linearly associated. It can adversely affect the fitting of most models, because many possible models may be equally good. Variable importance might be masked by correlated variables, and confidence intervals generated for linear models might be too wide. Check for multicollinearity or other associations between the predictors in:
    1. 2001 Australian election data
    2. 2016 Australian election data
  2. Examine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the mvtnorm package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?
  3. The following code shows how to hide a point in a four-dimensional space, so that it is not visible in any of the plots of two variables. Generate both d and d_r and confirm that the point is visible in a scatterplot matrix of d, but not in the scatterplot matrix of d_r. Also confirm that it is visible in both data sets when you use a tour.
Code
library(tidyverse)
library(tourr)
library(GGally)
set.seed(946)
d <- tibble(x1=runif(200, -1, 1), 
            x2=runif(200, -1, 1), 
            x3=runif(200, -1, 1))
d <- d |>
  mutate(x4 = x3 + runif(200, -0.1, 0.1))
# outlier is visible in d
d <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))

# Point is hiding in d_r
d_r <- d |>
  mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,
         x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,
         x2 = cos(pi/6)*x2 + sin(pi/6)*x4,
         x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)
  1. Examine each of the challenge data sets c1, c2, …, c7 from the the mulgar package for signs of the observed values not filling out the full \(p\) dimensions.

  2. The data sets assoc1, assoc2, assoc3 have other types of association. Can you detect what the associations are in each set?

  3. The data sets anomaly1, anomaly2, anomaly3, anomaly4, anomaly5 all have single anomalies. For there to be an anomaly in a data set, there must be some association between the variables, and the anomaly doesn’t conform to this association pattern. Can you find them?

  4. Copulas are commonly used to define the covariance relationship between pairs of variables, for fixed marginal distributions. There are several R packages that enable simulating multivariate data using copula methods. The 5D copula datasets in the mulgar package are simulated using the covsim package from four different copula models: clayton, joe, frank, gauss. Each have normal marginal distributions.

  1. Is the sample generated using a Gaussian copula similar to a sample from a 5D multivariate normal distribution? (You can use the mulgar::rmvn() function with the sample variance-covariance matrix as the copnorm data to simulate a multivariate normal sample. Then create a combined data set with a variable indicating the source of the sample, and examine in a grand tour.)

  2. Repeat a. to compare the sample from a Clayton copula with the 5D multivariate normal sample.


  1. It is generally better to use associated than correlated. Correlation is a statistical quantity, measuring linear association. The term associated can be prefaced with the type of association, such as linear or non-linear.↩︎