3  Overview

This chapter will focus on methods for reducing dimension, and how the tour1 can be used to assist with the common methods such as principal component analysis (PCA), multidimensional scaling (MDS), t-stochastic neighbour embedding (t-SNE), and factor analysis.

Dimension is perceived in a tour using the spread of points. When the points are spread far apart, then the data is filling the space. Conversely when the points “collapse” into a sub-region then the data is only partially filling the space, and some dimension reduction to reduce to this smaller dimensional space may be worthwhile.

When points do not fill the plotting canvas fully, it means that it lives in a lower dimension. This low-dimensional space might be linear or non-linear, with the latter being much harder to define and capture.

Let’s start with some 2D examples. You need at least two variables to be able to talk about association between variables. Figure 3.1 shows three plots of two variables. Plot (a) shows two variables that are strongly linearly associated2, because when x1 is low, x2 is low also, and conversely when x1 is high, x2 is also high. This can also be seen by the reduction in spread of points (or “collapse”) in one direction making the data fill less than the full square of the plot. So from this we can conclude that the data is not fully 2D. The second step is to infer which variables contribute to this reduction in dimension. The axes for x1 and x2 are drawn extending from \((0,0)\) and because they both extend out of the cloud of points, in the direction away from the collapse of points we can say that they are jointly responsible for the dimension reduction.

Figure 3.1 (b) shows a pair of variables that are not linearly associated. Variable x1 is more varied than x3 but knowing the value on x1 tells us nothing about possible values on x3. Before running a tour all variables are typically scaled to have equal spread. The purpose of the tour is to capture association and relationships between the variables, so any univariate differences should be removed ahead of time. Figure 3.1 (c) shows what this would look like when x3 is scaled - the points are fully spread in the full square of the plot.

Code to produce 2D data examples
library(tibble)
set.seed(6045)
x1 <- runif(123)
x2 <- x1 + rnorm(123, sd=0.1)
x3 <- rnorm(123, sd=0.2)
df <- tibble(x1 = (x1-mean(x1))/sd(x1), 
             x2 = (x2-mean(x2))/sd(x2),
             x3, 
             x3scaled = (x3-mean(x3))/sd(x3))
Three scatterplots: (a) points lie close to a straight line in the x=y direction, (b) points lie close to a horizontal line, (c) points spread out in the full plot region. There are no axis labels or scales.
Figure 3.1: Explanation of how dimension reduction is perceived in 2D, relative to variables: (a) Two variables with strong linear association. Both variables contribute to the association, as indicated by their axes extending out from the ‘collapsed’ direction of the points; (b) Two variables with no linear association. But x3 has less variation, so points collapse in this direction; (c) The situation in plot (b) does not arise in a tour because all variables are (usually) scaled. When an axes extends out of a direction where the points are collapsed, it means that this variable is partially responsible for the reduced dimension.

Now let’s think about what this looks like with five variables. Figure 3.2 shows a grand tour on five variables, with (a) data that is primarily 2D, (b) data that is primarily 3D and (c) fully 5D data. You can see that both (a) and (b) the spread of points collapse in some projections, with it happening more in (a). In (c) the data is always spread out in the square, although it does seem to concentrate or pile in the centre. This piling is typical when projecting from high dimensions to low dimensions. The sage tour (Laa et al., 2022) makes a correction for this.

Code to make animated gifs
library(mulgar)
data(plane)
data(box)
render_gif(plane,
           grand_tour(), 
           display_xy(),
           gif_file="gifs/plane.gif",
           frames=500,
           width=200,
           height=200)
render_gif(box,
           grand_tour(), 
           display_xy(),
           gif_file="gifs/box.gif",
           frames=500,
           width=200,
           height=200)
# Simulate full cube
library(geozoo)
cube5d <- data.frame(cube.solid.random(p=5, n=300)$points)
colnames(cube5d) <- paste0("x", 1:5)
cube5d <- data.frame(apply(cube5d, 2, function(x) (x-mean(x))/sd(x)))
render_gif(cube5d,
           grand_tour(), 
           display_xy(),
           gif_file="gifs/cube5d.gif",
           frames=500,
           width=200,
           height=200)
Animation of sequences of 2D projections shown as scatterplots. You can see points collapsing into a thick straight line in various projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.
(a) 2D plane in 5D
Animation of sequences of 2D projections shown as scatterplots. You can see points collapsing into a thick straight line in various projections, but not as often as in the animation in (a). A circle with line segments indicates the projection coefficients for each variable for all projections viewed.
(b) 3D plane in 5D
Animation of sequences of 2D projections shown as scatterplots. You can see points are always spread out fully in the plot space, in all projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.
(c) 5D plane in 5D
Figure 3.2: Different dimensional planes - 2D, 3D, 5D - displayed in a grand tour projecting into 2D. Notice that the 5D in 5D always fills out the box (although it does concentrate some in the middle which is typical when projecting from high to low dimensions). Also you can see that the 2D in 5D, concentrates into a line more than the 3D in 5D. This suggests that it is lower dimensional.

The next step is to determine which variables contribute. In the examples just provided, all variables are linearly associated in the 2D and 3D data. You can check this by making a scatterplot matrix, Figure 3.3.

Code for scatterplot matrix
library(GGally)
library(mulgar)
data(plane)
ggscatmat(plane) +
  theme(panel.background = 
          element_rect(colour="black", fill=NA),
    axis.text = element_blank(),
    axis.ticks = element_blank())
A five-by-five scatterplot matrix, with scatterplots in the lower triangle, correlaton printed in the upper triangle and density plots shown on the diagonal. Plots of x1 vs x2, x1 vs x3, x2 vs x3, and x4 vs x5 have strong positive or negative correlation. The remaining pairs of variables have no association.
Figure 3.3: Scatterplot matrix of plane data. You can see that x1-x3 are strongly linearly associated, and also x4 and x5. When you watch the tour of this data, any time the data collapses into a line you should see only (x1, x2, x3) or (x4, x5). When combinations of x1 and x4 or x5 show, the data should be spread out.

To make an example where not all variables contribute, we have added two additional variables to the plane data set, which are purely noise.

# Add two pure noise dimensions to the plane
plane_noise <- plane
plane_noise$x6 <- rnorm(100)
plane_noise$x7 <- rnorm(100)
plane_noise <- data.frame(apply(plane_noise, 2, 
    function(x) (x-mean(x))/sd(x)))
ggduo(plane_noise, columnsX = 1:5, columnsY = 6:7, 
    types = list(continuous = "points")) +
  theme(aspect.ratio=1,
    panel.background = 
          element_rect(colour="black", fill=NA),
    axis.text = element_blank(),
    axis.ticks = element_blank())
Two rows of scatterplots showing x6 and x7 against x1-x5. The points are spread out in the full plotting region, although x6 has one point with an unusually low value.
Figure 3.4: Scatterplots showing two additional noise variables that are not associated with any of the first five variables.

Now we have 2D structure in 7D, but only five of the variables contribute to the 2D structure, that is, five of the variables are linearly related with each other. The other two variables (x6, x7) are not linearly related to any of the others.

The data is viewed with a grand tour in Figure 3.5. We can still see the concentration of points along a line in some dimensions, which tells us that the data is not fully 7D. Then if you look closely at the variable axes you will see that the collapsing to a line only occurs when any of x1-x5 contribute strongly in the direction orthogonal to this. This does not happen when x6 or x7 contribute strongly to a projection - the data is always expanded to fill much of the space. That tells us that x6 and x7 don’t substantially contribute to the dimension reduction, that is, they are not linearly related to the other variables.

Code to generate animation
library(ggplot2)
library(plotly)
library(htmlwidgets)

set.seed(78)
b <- basis_random(7, 2)
pn_t <- tourr::save_history(plane_noise, 
                    tour_path = grand_tour(),
                    start = b,
                    max_bases = 8)
pn_t <- interpolate(pn_t, 0.1)
pn_anim <- render_anim(plane_noise,
                         frames=pn_t)

pn_gp <- ggplot() +
     geom_path(data=pn_anim$circle, 
               aes(x=c1, y=c2,
                   frame=frame), linewidth=0.1) +
     geom_segment(data=pn_anim$axes, 
                  aes(x=x1, y=y1, 
                      xend=x2, yend=y2, 
                      frame=frame), 
                  linewidth=0.1) +
     geom_text(data=pn_anim$axes, 
               aes(x=x2, y=y2, 
                   frame=frame, 
                   label=axis_labels), 
               size=5) +
     geom_point(data=pn_anim$frames, 
                aes(x=P1, y=P2, 
                    frame=frame), 
                alpha=0.8) +
     xlim(-1,1) + ylim(-1,1) +
     coord_equal() +
     theme_bw() +
     theme(axis.text=element_blank(),
         axis.title=element_blank(),
         axis.ticks=element_blank(),
         panel.grid=element_blank())
pn_tour <- ggplotly(pn_gp,
                        width=500,
                        height=550) %>%
       animation_button(label="Go") %>%
       animation_slider(len=0.8, x=0.5,
                        xanchor="center") %>%
       animation_opts(easing="linear", 
                      transition = 0)

htmlwidgets::saveWidget(pn_tour,
          file="html/plane_noise.html",
          selfcontained = TRUE)
Figure 3.5: Grand tour of the plane with two additional dimensions of pure noise. The collapsing of the points indicates that this is not fully 7D. This only happens when any of x1-x5 are contributing strongly (frame 49 x4, x5; frame 79 x1; frame 115 x2, x3). If x6 or x7 are contributing strongly the data is spread out fully (frames 27, 96). This tells us that x6 and x7 are not linearly associated, but other variables are.

To determine which variables are responsible for the reduced dimension look for the axes that extend out of the point cloud. These contribute to smaller variation in the observations, and thus indicate dimension reduction.

The simulated data here is very simple, and what we have learned from the tour could also be learned from principal component analysis. However, if there are small complications, such as outliers or nonlinear relationships, that might not be visible from principal component analysis, the tour can help you to see them.

Figure 3.6 and Figure 3.7(a) show example data with an outlier and Figure 3.7(b) shows data with non-linear relationships.

Code for scatterplot matrix
# Add several outliers to the plane_noise data
plane_noise_outliers <- plane_noise
plane_noise_outliers[101,] <- c(2, 2, -2, 0, 0, 0, 0)
plane_noise_outliers[102,] <- c(0, 0, 0,-2, -2, 0, 0)

ggscatmat(plane_noise_outliers, columns = 1:5) +
  theme(aspect.ratio=1,
    panel.background = 
          element_rect(colour="black", fill=NA),
    axis.text = element_blank(),
    axis.ticks = element_blank())
A five-by-five scatterplot matrix, with scatterplots in the lower triangle, correlaton printed in the upper triangle and density plots shown on the diagonal. Plots of x1 vs x2, x1 vs x3, x2 vs x3, and x4 vs x5 have strong positive or negative correlation, with an outlier in the corner of the plot. The remaining pairs of variables have no association, and thus also no outliers.
Figure 3.6: Scatterplot matrix of the plane with noise data, with two added outliers in variables with strong correlation.
Code to generate animated gif
render_gif(plane_noise_outliers,          
           grand_tour(), 
           display_xy(),
           gif_file="gifs/pn_outliers.gif",
           frames=500,
           width=200,
           height=200)

data(plane_nonlin)
render_gif(plane_nonlin,          
           grand_tour(), 
           display_xy(),
           gif_file="gifs/plane_nonlin.gif",
           frames=500,
           width=200,
           height=200)
Animation showing scatterplots of 2D projections from 5D. The points sometimes appear to be a plane viewed from the side, with two single points futher away. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.
(a) Outliers
Animation showing scatterplots of 2D projections from 5D. The points sometimes appear to be lying on a curve in various projections. A circle with line segments indicates the projection coefficients for each variable for all projections viewed.
(b) Non-linear relationship
Figure 3.7: Examples of different types of dimensionality issues: outliers (a) and non-linearity (b). In (a) you can see two points far from the others in some projections. Also the two can be seen with different movement patterns – moving faster and different directions than the other points during the tour. Outliers will affect detection of reduced dimension, but they can be ignored when assessing dimensionality with the tour. In (b) there is a non-linear relationship between several variables, primarily with x3. Non-linear relationships may not be easily captured by other techniques but are often visible with the tour.

Exercises

  1. Multicollinearity is when the predictors for a model are strongly linearly associated. It can adversely affect the fitting of most models, because many possible models may be equally as good. Variable importance might be masked by correlated variables, and confidence intervals generated for linear models might be too wide. Check the for multicollinearity or other associations between the predictors in:
    1. 2001 Australian election data
    2. 2016 Australian election data
  2. Examine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the mvtnorm package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?
  3. The following code shows how to hide a point in a four-dimensional space, so that it is not visible in any of the plots of two variables. Generate both d and d_r and confirm that the point is visible in a scatterplot matrix of d, but not in the scatterplot matrix of d_r. Also confirm that it is visible in both data sets when you use a tour.
Code
library(tidyverse)
library(tourr)
library(GGally)
set.seed(946)
d <- tibble(x1=runif(200, -1, 1), 
            x2=runif(200, -1, 1), 
            x3=runif(200, -1, 1))
d <- d %>%
  mutate(x4 = x3 + runif(200, -0.1, 0.1))
# outlier is visible in d
d <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))

# Point is hiding in d_r
d_r <- d %>%
  mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,
         x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,
         x2 = cos(pi/6)*x2 + sin(pi/6)*x4,
         x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)

  1. Note that the animated tours from this chapter can be viewed at https://dicook.github.io/mulgar_book/3-intro-dimred.html.↩︎

  2. It is generally better to use associated than correlated. Correlation is a statistical quantity, measuring linear association. The term associated can be prefaced with the type of association, such as linear or non-linear.↩︎