5  Non-linear dimension reduction

5.1 Explanation of NLDR methods

Non-linear dimension reduction (NLDR) aims to find a low-dimensional representation of the high-dimensional data that shows the main features of the data. In statistics, it dates back to Kruskal (1964a)’s work on multidimensional scaling (MDS). Some techniques only require an interpoint similarity or distance matrix as the main ingredient, rather than the full data. We’ll focus on when the full data is available here, so we can also compare structure perceived using the tour on the high-dimensional space, relative to structure revealed in the low-dimensional embedding.

There are many methods available for generating non-linear low dimensional representations of the data. MDS is a classical technique that minimises the difference between two interpoint distance matrices, the distance between points in the high-dimensions, and in the low-dimensional representations. A good resource for learning about MDS is Borg & Groenen (2005).

Code to generate the 2D non-linear representation
library(mulgar)
library(Rtsne)
library(uwot)
library(ggplot2)
library(patchwork)
set.seed(42)
cnl_tsne <- Rtsne(clusters_nonlin)
cnl_umap <- umap(clusters_nonlin)
n1 <- ggplot(as.data.frame(cnl_tsne$Y), aes(x=V1, y=V2)) +
  geom_point() + 
  ggtitle("(a) t-SNE") +
  theme_minimal() + 
  theme(aspect.ratio=1)
n2 <- ggplot(as.data.frame(cnl_umap), aes(x=V1, y=V2)) +
  geom_point() + 
  ggtitle("(b) UMAP") +
  theme_minimal() + 
  theme(aspect.ratio=1)
n1 + n2
FIXME
Figure 5.1: Two non-linear embeddings of the non-linear clusters data: (a) t-SNE, (b) UMAP. Both suggest four clusters, with two being non-linear in some form.

Figure 5.1 show two NLDR views of the clusters_nonlin data set from the mulgar package. Both suggest that there are four clusters, and that some clusters are non-linearly shaped. They disagree on the type of non-linear pattern, where t-SNE represents one cluster as a wavy-shape and UMAP both have a simple parabolic shape. Popular methods in current use include t-SNE (Maaten & Hinton, 2008), UMAP (McInnes et al., 2018) and PHATE (Moon et al., 2019).

Code to create animated gif
library(tourr)
render_gif(clusters_nonlin, 
           grand_tour(),
           display_xy(),
           gif_file = "gifs/clusters_nonlin.gif",
           frames = 500,
           width = 300, 
           height = 300)
Figure 5.2: Grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape.

::: {.content-visible when-format=“pdf”}

Figure 5.3: Two frames from a grand tour of the nonlinear clusters data set, shows four clusters. Two are very small and spherical in shape. One is large, and has a sine wave shape, and the other is fairly small with a bent rod shape.

The full 4D data is shown with a grand tour in Figure 5.2 @. The four clusters suggested by the NLDR methods can be seen. We also get a better sense of the relative size and proximity of the clusters. There are two small spherical clusters, one quite close to the end of the large sine wave cluster. The fourth cluster is relatively small, and has a slight curve, like a bent rod. The t-SNE representation is slightly more accurate than the UMAP representation. We would expect that the wavy cluster is the sine wave seen in the tour.

NLDR can provide useful low-dimensional summaries of high-dimensional structure but you need to check whether it is a sensible and accurate representation by comparing with what is perceivd from a tour.

5.2 Assessing reliability of the NLDR representation

NLDR can produce useful low-dimensional summaries of structure in high-dimensional data, like those shown in Figure 5.1. However, there are numerous pitfalls. The fitting procedure can produce very different representations depending on the parameter choices, and even the random number seeding the fit. (You can check this by changing the set.seed in the code above, and by changing from the default parameters.) Also, it may not be possible to represent the high-dimensional structures faithfully low dimensions. For these reasons, one needs to connect the NLDR view with a tour of the data, to help assess its usefulness and accuracy. For example, with this data, we would want to know which of the two curved clusters in the UMAP representation correspond to the sine wave cluster.

5.2.1 Using liminal

Figure 5.4 shows how the NLDR plot can be linked to a tour view, using the liminal package, to better understand how well the structure of the data is represented. Here we see learn that the smile in the UMAP embedding is the small bent rod cluster, and that the unibrow is the sine wave.

library(liminal)
umap_df <- data.frame(umapX = cnl_umap[, 1],
                      umapY = cnl_umap[, 2])
limn_tour_link(
  umap_df,
  clusters_nonlin,
  cols = x1:x4
)
(a) Smile matches bent rod.
(b) Unibrow matches sine wave.
Figure 5.4: Two screenshots from liminal showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster. The unibrow matches to the sine wave cluster.

5.2.2 Using detourr

Figure 5.5 shows how the linking is achieved using detourr. It uses a shared data object, as made possible by the crosstalk package, and the UMAP view is made interactive using plotly.

library(detourr)
library(dplyr)
library(crosstalk)
library(plotly)
umap_df <- data.frame(umapX = cnl_umap[, 1],
                      umapY = cnl_umap[, 2])
cnl_df <- bind_cols(clusters_nonlin, umap_df)
shared_cnl <- SharedData$new(cnl_df)

detour_plot <- detour(shared_cnl, tour_aes(
  projection = starts_with("x"))) |>
    tour_path(grand_tour(2), 
                    max_bases=50, fps = 60) |>
       show_scatter(alpha = 0.7, axes = FALSE,
                    width = "100%", height = "450px")

umap_plot <- plot_ly(shared_cnl,
                    x = ~umapX, 
                    y = ~umapY,
                    color = I("black"),
                    height = 450) %>%
    highlight(on = "plotly_selected", 
              off = "plotly_doubleclick") %>%
    add_trace(type = "scatter", 
              mode = "markers")

bscols(
     detour_plot, umap_plot,
     widths = c(5, 6)
 )
Figure 5.5: Screenshot from detourr showing which clusters match between the UMAP representation and the tour animation. The smile corresponds to the small bent rod cluster.

5.3 Example: fake_trees

Figure 5.6 shows a more complex example, using the fake_trees data. We know that the 10D data has a main branch, and 9 branches (clusters) attached to it, absed on our explorations in the earlier chapters. The t-SNE view, where points are coloured by the known branch ids, is very helpful for seeing the linear branch structure.

What we can’t tell is that there is a main branch from which all of the others extend. We also can’t tell which of the clusters corresponds to this branch. Linking the plot with a tour helps with this. Although, not shown in the sequence of snapshots in Figure 5.6, the main branch is actually the dark blue cluster, which is separated into three pieces by t-SNE.

Code to run liminal on the fake trees data
library(liminal)
library(Rtsne)
data(fake_trees)
set.seed(2020)
tsne <- Rtsne::Rtsne(dplyr::select(fake_trees, dplyr::starts_with("dim")))
tsne_df <- data.frame(tsneX = tsne$Y[, 1],
                      tsneY = tsne$Y[, 2])
limn_tour_link(
  tsne_df,
  fake_trees,
  cols = dim1:dim10,
  color = branches
)
(a) Linked views of t-SNE dimension reduction with a tour of the fake trees data. The t-SNE view clearly shows ten 1D non-linear clusters, while the tour of the full 100 variables suggests a lot more variation in the data, and less difference between clusters.
(b) Focus on the green cluster which is split by t-SNE. The shape as viewed in many linear projections shown by the tour shows that it is a single curved cluster. The split is an artifact of the t-SNE mapping.
(c) Focus on the purple cluster which splits the green cluster in the t-SNE view. The tour shows that these two clusters are distinct, but are close in one neighbourhood of the 100D space. The close proximity in the t-SNE view is reasonable, though.
Figure 5.6: Three snapshots of using the liminal linked views to explore how t-SNE has summarised the fake_trees data in 2D.

The t-SNE representation clearly shows the linear structures of the data, but viewing this 10D data with the tour shows that t-SNE makes several inaccurate breaks of some of the branches.

Exercises

  1. Using the penguins_sub data generate a 2D representation using t-SNE. Plot the points mapping the colour to species. What is most surprising? (Hint: Are the three species represented by three distinct clusters?)
  2. Re-do the t-SNE representation with different parameter choices. Are the results different each time, or could they be considered to be equivalent?
  3. Use liminal or detourr to link the t-SNE representation to a tour of the penguins. Highlight the points that have been placed in an awkward position by t-SNE from others in their species. Watch them relative to the others in their species in the tour view, and think about whether there is any rationale for the awkward placement.
  4. Use UMAP to make the 2D representation, and use liminal or detourr to link with a tour to explore the result.
  5. Conduct your best t-SNE and UMAP representations of the aflw data. Compare and contrast what is learned relative to a tour on the principal component analysis.