5  Non-linear dimension reduction

Non-linear dimension reduction (NLDR) aims to find a single low-dimensional representation of the high-dimensional data that shows the main features of the data. If there are separated clusters present then it might be a layout where the clusters are all distinct, in a way that a single linear projection could not reveal. For observations falling on a low-dimensional non-linear manifold in high dimensions the NLDR might unfold or unroll it so that they are represented in a plane where the distances are similar to their distance along the manifold.

Most techniques only require an interpoint similarity or distance matrix as the main ingredient, rather than the full \(p\)-dimensional data. A classic example where only this information is the morse code data reported by Rothkopf (1957), and available in the xgobi software (Swayne et al., 1998). The data contains the counts of the number of times that a letter is mistaken for another letter collected in an experimental study. However, here we focus on problems where the full \(p\)-dimensional data is available, so we can also compare structure perceived using the tour on the high-dimensional space, relative to structure revealed in the low-dimensional embedding.

A common myth is that non-linear dimension reduction captures non-linear patterns in the high-dimensional data. It may or may not do this. The term means that the methods transform the data non-linearly into a useful (or not) visual representation.

5.1 Classical methods

In statistics, methods for non-linear dimension reduction arise with multidimensional scaling (MDS) (Kruskal, 1964a). Classically, MDS minimises some function of the difference between two interpoint distance matrices, the distance between points in the high-dimensions, and in the low-dimensional representations using a stress function (\(C_{L2}\)):

\[ C_{L2} = \left(\sum_{i, j=1; i\neq j}^n (d_p(i,j) - d_k(i,j))^2\right)^{1/2} \] where \((d_p(i,j))\) are the distances between all pairs of points in \(p\)-dimensions, and \(d_k(i,j)\) is the distance between the points in the low-dimensional (\(k\)) space. PCA is a special case of MDS, in that, the first two PCs provide the solution to the above equation if distance is Euclidean. Each PC is a linear projection, but generally MDS can provide non-linear transformations to represent unusual high-dimensional patterns.

The choices for defining \(d_p\) are many, including ones that are locally, using say \(k\) nearest neighbours, or globally computed. Distances can also be transformed using some function, including ranks instead of actual distance. Another approach modifies the stress function the \(L_2\)-norm, by replacing the 2 with an arbitrary power. Others yet use transformations of the distances. A good resource for learning about MDS and the many choices is Borg & Groenen (2005).

The methods isomap (Tenenbaum et al., 2000) and local linear embedding (LLE) are two methods where local distance is used, and they were designed to unwrap data from a manifold. Figure 5.1 illustrates how the results on the classic swiss roll data can differ according to method, in useful ways. The observations lie on a 2D non-linear manifold in 3D. The data is a manufactured example introduced to illustrate the ability of isomap and LLE to unwrap the swiss roll and lay out the points in a rectangle as though one was travelling along the surface.

Code for spiral plots
library(Rdimtools)
library(colorspace)
library(cardinalR)
library(tsibble)
library(ggplot2)
library(dplyr)
library(patchwork)
library(ggthemes)
n_obs <- 1652  
set.seed(259)
swiss <- swiss_roll(n_obs, num_noise = 0)
# Standardising ruins structure
# swiss <- apply(swiss, 2, function(x) (x-mean(x))/sd(x))
swiss[,3] <- swiss[,3] * 3 # This scale produces more recognisable results
# Compute distance from (0,0) in first two coordinates - spiral
swiss <- cbind(swiss, sqrt(swiss[,1]^2 +
                           swiss[,2]^2))
colnames(swiss) <- c("x", "y", "z", "d")
swiss_tbl <- as_tibble(swiss)
swiss_pca <- prcomp(swiss[,1:3], scale=FALSE)
swiss_tbl <- bind_cols(swiss_tbl, as_tibble(swiss_pca$x))
spiral1 <- ggplot(swiss_tbl, 
                  aes(x=PC2, 
                      y=PC3, 
                      colour=d)) +
  geom_point() +
  scale_colour_continuous_divergingx(palette="Zissou 1",
       mid=median(swiss_tbl$d)) + 
  theme_minimal() +
  theme(aspect.ratio=1, 
        axis.text = element_blank(),
        legend.position = "none")

swiss_iso <- do.isomap(swiss[,1:3],
  ndim=2, type=c("knn", 10), 
  weight=FALSE)$Y
colnames(swiss_iso) <- c("iso1", "iso2")
swiss_iso <- as_tibble(swiss_iso) |>
  bind_cols(swiss_tbl)
spiral2 <- ggplot(swiss_iso, 
                  aes(x=iso1, 
                      y=iso2, 
                      colour=d)) + 
  geom_point() +
  scale_colour_continuous_divergingx(palette="Zissou 1",
       mid=median(swiss_iso$d)) + 
  theme_minimal() +
  theme(aspect.ratio=1, 
        axis.text = element_blank(),
        legend.position = "none")

#spiral1 + spiral2 + plot_layout(ncol=2)

tour

This is an untitled chart has x-axis 'PC2' and y-axis 'PC3' with no labels or legend. Colour is used to show a variable d. The chart is a set of 1652 solid circle points laid out in a spiral pattern, with colours starting with one end of the rainbow smoothly transitioning the the other end of the rainbow scale at the other end of the spiral.

PCA

This is an untitled chart has x-axis 'PC2' and y-axis 'PC3' with no labels or legend. Colour is used to show a variable d. The chart is a set of 1652 solid circle points laid out in a slightly squashed rectangle, with colours starting with one end of the rainbow on the left and smoothly transitioning the the other end of the rainbow scale at the right.

isomap
Figure 5.1: The classic swiss roll data constructed to illustrate isomap, as shown in (a) a tour, (b) lowest two principal components, (c) NLDR provided by isomap. The spiral can be discovered by many methods, including different principal compoents depending on the scaling of the data, or using projection pursuit, or nonlinear MDS. Being able to unwrap the spiral and lay out the points along a plane is achieved by isomap and LLE with careful choice of parameters. Colour indicates distance along this spiral.

The swiss roll data is interesting in the sense that from a visual perspective it is interesting to be able to discover the spiral structure. Particularly this would be a more challenging if the 2D spiral was in data with more than three variables, and all the additional ones were noise (like z in this example). Methods like projection pursuit, PCA or MDS can discover the spiral if all variables are measured on the same scale (say -1, 1) but not standardised because the variance of the spiral shape is different from that of noise. Being able to discover the manifold (here, a 2D surface matching a swiss roll) can be useful for conducting further analysis. The points are now organised according to distance relative to the manifold on which they live. This layout is like taming a lion, it feels like an achievement, but realistically it is very difficult to achieve from these methods: data scale choice or a different number of \(k\) nearest neighbours, changes the layout dramatically. Organising the points into distance along the spiral could have equally been achieved by discovering the spiral, and computing radial distance, as done to produce the colouring of points in Figure 5.1. This approach would also extend beyond 3D, to any additional number of noise dimensions.

5.2 Contemporary approaches

Code to generate the 2D non-linear representation
library(mulgar)
library(Rtsne)
library(uwot)
library(ggplot2)
library(patchwork)
set.seed(44)
cnl_tsne <- Rtsne(clusters_nonlin)
cnl_umap <- umap(clusters_nonlin)
n1 <- ggplot(as.data.frame(cnl_tsne$Y), aes(x=V1, y=V2)) +
  geom_point() + 
  xlab("tsne1") +
  ylab("tsne2") +
  ggtitle("(a) t-SNE") +
  theme_minimal() + 
  theme(aspect.ratio=1, 
        axis.text = element_blank())
n2 <- ggplot(as.data.frame(cnl_umap), aes(x=V1, y=V2)) +
  geom_point() + 
  xlab("umap1") +
  ylab("umap2") +
  ggtitle("(b) UMAP") +
  theme_minimal() + 
  theme(aspect.ratio=1, 
        axis.text = element_blank())
n1 + n2
This chart has two plots titled '(a) t-SNE', with x-axis 'tsne1' and y-axis 'tsne2', and '(b) UMAP' with x-axis 'umap1' anf y-axis 'umap2'. The chart '(a)' is a set of 1268 solid circle points arranged with a strip going from bottom let to top right, two C-shapes above and below the line and two small concetration of dots at the middle bottom and top. The points in chart '(b)' are arranged like a tilted smiley face.
Figure 5.2: Two non-linear embeddings of the non-linear clusters data: (a) t-SNE, (b) UMAP. One suggests five clusters and the other four, and also disagree on the cluster shapes.

Popular methods in current use for NLDR include t-SNE (Maaten & Hinton, 2008) and UMAP (McInnes et al., 2018). These approaches use a different approach for comparing the high-dimensional and low-dimensional distances, based on Kullback-Leibler divergence:

\[ C_{KL} = \sum_{i,j} p_{ij}\log\left(\frac{p_{ij}}{q_{ij}}\right) \] where \(i,j\) indicate two observations in the data, and \(p_{ij}, q_{ij}\) are the probability distributions of the distances in \(p\) and \(k\) dimensions, respectively. To obtain \(p_{ij}\) the interpoint distances are computed, and normalised by dividing by \(\sum_{i,j} p_{ij}\), as is done similarly to obtain \(q_{ij}\). A further step in t-SNE is transforming the distances through the CDF of a specified distribution (Gaussian in \(p\)-D and \(t\)-distribution in \(k\)-D) to reflect differences is volume between the high and low dimensional spaces, akin to the transformation used in the sage tour (Laa et al. (2022)). (See Figure 1.5 illustrating the crowding effect.) The normalisation effectively means small distances, like distances between points that are in a cluster, even if clusters are bigger or smaller, become relatively closer in comparison with all the large distances.

In practice, the optimisation is conducted with a likelihood function:

\[ C_{LV} = \sum_{i \neq j} p_{ij} \log\left(w_{ij}\right) + \gamma \sum_{i \neq j} \log\left(w_{ij}\right) \] where \(w_{ij}\) are the unnormalised similarities in the \(k\)-dimensional representation, and the second component weighted by \(\gamma\) adds a repulsive force to push points apart. UMAP optimises a variation of \(C_{LV}\). Both t-SNE and UMAP are designed with cluster structure in mind, and may provide useful low-dimensional representations of clustering in high-dimensions.

Figure 5.2 shows two NLDR views of the clusters_nonlin data set from the mulgar package. Both suggest that there are four clusters, and that some clusters are non-linearly shaped. They disagree on the type of non-linear pattern, where t-SNE represents one cluster as a wavy-shape and UMAP both have a simple parabolic shape.

Animating plot of four variables, where different scatterplots of the porjections show two small spherical clusters, one C-shape cluster and and S-shaped stripe of points.
Figure 5.3: Grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape.

The full 4D data is shown with a grand tour in Figure 5.3. The four clusters suggested by the NLDR methods can be seen. We also get a better sense of the relative size and proximity of the clusters. There are two small spherical clusters, one quite close to the end of the large sine wave cluster. The fourth cluster is relatively small, and has a slight curve, like a bent rod. The t-SNE representation is slightly more accurate than the UMAP representation. We would expect that the wavy cluster is the sine wave seen in the tour.

NLDR can provide useful low-dimensional summaries of high-dimensional structure but you need to check whether it is a sensible and accurate representation by comparing with what is perceived from a tour.

5.3 Assessing an NLDR layout

Figure 5.2 shows that NLDR can produce useful low-dimensional summaries of structure in high-dimensional data. However, it can be a frustrating exercise because very different representations can result depending on the parameter choices, and even the random number seeding the fit. (You can check this by changing the set.seed in the code above, and by changing from the default parameters.) Also, it may not be possible to represent the high-dimensional structures faithfully in low dimensions.

Diagnosing a layout is an important first step to determine if it can be appropriately used in further analysis. The approach for doing this date back to the methods discussed in Buja et al. (2008). The NLDR view needs to be interactively linked to other plots of the data, and especially to a tour with the specific purpose to determine the faithfulness to the structure present in high dimensions. For example, with the data in Figure 5.2, we would want to know which of the two curved clusters in the UMAP representation correspond to the sine wave cluster.

5.3.1 Using liminal

Figure 5.4 shows how the NLDR plot can be linked to a tour view, using the liminal package, to better understand how well the structure of the data is represented. Here we learn that the smile in the UMAP embedding is the small bent rod cluster, and that the unibrow is the sine wave.

library(liminal)
umap_df <- data.frame(umapX = cnl_umap[, 1],
                      umapY = cnl_umap[, 2])
limn_tour_link(
  umap_df,
  clusters_nonlin,
  cols = x1:x4
)
Screenshot of an interface labeled 'embed + tour' displaying two side-by-side panels with abstract black curved lines on a white background. The left panel has a shaded vertical area, while the right panel shows a smaller, more condensed version of the curved line. The interface includes buttons at the bottom for 'Play,' 'Pause,' 'Restart,' and 'Controls,' with a 'Done' button in the top right corner and a 'Cancel' button in the top left corner.
(a) Smile matches bent rod.
(b) Unibrow matches sine wave.
Figure 5.4: Two screenshots from liminal showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster. The unibrow matches to the sine wave cluster.

5.3.2 Using detourr

Figure 5.5 shows how the linking is achieved using detourr. It uses a shared data object, as made possible by the crosstalk package, and the UMAP view is made interactive using plotly.

library(detourr)
library(dplyr)
library(crosstalk)
library(plotly)
umap_df <- data.frame(umapX = cnl_umap[, 1],
                      umapY = cnl_umap[, 2])
cnl_df <- bind_cols(clusters_nonlin, umap_df)
shared_cnl <- SharedData$new(cnl_df)

detour_plot <- detour(shared_cnl, tour_aes(
  projection = starts_with("x"))) |>
    tour_path(grand_tour(2), 
                    max_bases=50, fps = 60) |>
       show_scatter(alpha = 0.7, axes = FALSE,
                    width = "100%", height = "450px")

umap_plot <- plot_ly(shared_cnl,
                    x = ~umapX, 
                    y = ~umapY,
                    color = I("black"),
                    height = 450) %>%
    highlight(on = "plotly_selected", 
              off = "plotly_doubleclick") %>%
    add_trace(type = "scatter", 
              mode = "markers")

bscols(
     detour_plot, umap_plot,
     widths = c(5, 6)
 )
Screenshot of a data visualization interface displaying two scatter plots side by side. The left plot contains black and gray dots, with a dense black curved cluster near the top and a lighter scattered distribution below. The right plot is labeled with 'umapX' and 'umapY' axes and shows a more structured arrangement of points, with a distinct curved black cluster on the right and a dispersed gray cluster on the left. The interface includes toolbar icons on the left for rotation, movement, selection, and color adjustments, as well as a playback slider at the bottom.
Figure 5.5: Screenshot from detourr showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster.

5.4 Example: fake_trees

Figure 5.6 shows a more complex example, using the fake_trees data. We know that the 10D data has a main branch, and 9 branches (clusters) attached to it, based on our explorations in the earlier chapters. The t-SNE view, where points are coloured by the known branch ids, is very helpful for seeing the linear branch structure.

What we can’t tell is that there is a main branch from which all of the others extend. We also can’t tell which of the clusters corresponds to this branch. Linking the plot with a tour helps with this. Although, not shown in the sequence of snapshots in Figure 5.6, the main branch is actually the dark blue cluster, which is separated into three pieces by t-SNE.

Code to run liminal on the fake trees data
library(liminal)
library(Rtsne)
data(fake_trees)
set.seed(2020)
tsne <- Rtsne::Rtsne(
  dplyr::select(fake_trees,
                dplyr::starts_with("dim")))
tsne_df <- data.frame(tsneX = tsne$Y[, 1],
                      tsneY = tsne$Y[, 2])
limn_tour_link(
  tsne_df,
  fake_trees,
  cols = dim1:dim10,
  color = branches
)
(a) Linked views of t-SNE dimension reduction with a tour of the fake trees data. The t-SNE view clearly shows ten 1D non-linear clusters, while the tour of the full 100 variables suggests a lot more variation in the data, and less difference between clusters.
(b) Focus on the green cluster which is split by t-SNE. The shape as viewed in many linear projections shown by the tour shows that it is a single curved cluster. The split is an artifact of the t-SNE mapping.
(c) Focus on the purple cluster which splits the green cluster in the t-SNE view. The tour shows that these two clusters are distinct, but are close in one neighbourhood of the 100D space. The close proximity in the t-SNE view is reasonable, though.
Figure 5.6: Three snapshots of using the liminal linked views to explore how t-SNE has summarised the fake_trees data in 2D.

The t-SNE representation clearly shows the linear structures of the data, but viewing this 10D data with the tour shows that t-SNE makes several inaccurate breaks of some of the branches.

Exercises

  1. Change the seed using set.seed and re-run the the code used for Figure 5.2. The resulting layout changes, right? What parts are persistent between different layouts, (e.g. always four clusters, always two small circular clusters, elongated cluster curves in different ways between layouts, …)?

  2. Examine the effect of various choices when making a 2D layout of the swiss roll data, using isomap and LLE:

  1. Standardise the variables.

  2. Change the number of nearest neighbours.

  1. Use t-SNE and UMAP to layout the swiss roll data. How does this compare with isomap and LLE layouts?

  2. This question uses the penguins_sub data

  1. Generate a 2D representation using t-SNE. Plot the points mapping the colour to species. What is most surprising? (Hint: Are the three species represented by three distinct clusters?)
  2. Re-do the t-SNE representation with different parameter choices, including using different random seeds. Are the results different each time, or do you think that they could be considered to be equivalent?
  3. Use liminal or detourr to link the t-SNE representation to a tour of the penguins. Highlight the points that have been placed in an awkward position by t-SNE from others in their species. Watch them relative to the others in their species in the tour view, and think about whether there is any rationale for the awkward placement.
  4. Try again using UMAP to make the 2D representation, and use liminal or detourr to link with a tour to explore the result.
  1. Conduct your best t-SNE and UMAP representations of the aflw data. Compare and contrast what is learned relative to a tour or the principal component analysis.

  2. The cardinalR package can be used to generate various high-dimensional shapes. Generate data with 5 irregularly shaped clusters, centering them at different points in five dimensions. Check what this looks like using a grand tour. If you were to sketch a 2D layout that describes your data, what would it look like? Use PCA, t-SNE and UMAP to make 2D layouts. Can you find a choice of parameters that comes close to your idealised sketch?

  3. This question uses the multicluster data.

  1. Use the tour first. How many clusters do you see?

  2. Use UMAP to create a 2D layout. How well does this capture the clusters in the data?

  1. Use UMAP to create a 2D layout of the sketches_train data, V1 to V784. Plot this with points coloured by word (what the person was asked to sketch).

Project

Gene expressions measured as scRNA-Seq of 2622 human peripheral blood mononuclear cells data is available from the Seurat R package Satija et al. (2015). The paper web site has code to extract and pre-process the data, which follow the tutorial at https://satijalab.org/seurat/articles/pbmc3k_tutorial.html. The processed data, containing the first 50 PCs is provided with the book, as pbmc_pca_50.rds.

The original paper (Z. Chen et al., 2024) used UMAP on the first 15 PCs to find a representation of the data to illustrate the clustering. They used the default settings of the RunUMAP() function in Seurat, without setting a seed.

Generate the t-SNE and UMAP representations of the first 9 PCs of data, using their default settings. They should be quite different. (We use 9 PCs because the scree plot in the data pre-processing suggests that 15 is too many.) Based on your examination of the data in a tour, which method yields the more accurate representation? Explain what the structure in the 2D is relative to that seen in the tour.