Visualising the Geometry of High Dimensions
We are going to see that we can gain intuition for structure in high dimensions through visualisation
It doesn’t mean that it’s easy. It doesn’t mean that visualisation is used alone. It means that (high-dimensional) visualisation is an important part of your toolbox, especially to allow discovery of what we don’t know.
Tours of high-dimensional data are like examining the shadows (projections)
(and slices/sections to see through a shadow)
Increasing dimension adds an additional orthogonal axis.
If you want more high-dimensional shapes there is an R package, geozoo, which will generate cubes, spheres, simplices, mobius strips, torii, boy surface, klein bottles, cones, various polytopes, …
And read or watch Flatland: A Romance of Many Dimensions (1884) Edwin Abbott.
Data is 2D:
Projection is 1D:
Notice that the values of change between (-1, 1). All possible values being shown during the tour.
watching the 1D shadows we can see:
What does the 2D data look like? Can you sketch it?
Data is 4D:
Projection is 2D:
How many clusters do you see?
Species explains the three clusters.
Tours have two main components: How to move over the space, and how to display the projected data.
How should you plot your projected data?
James, A. T. and Constantine, A. G. (1974) Generalized Jacobi Polynomials as Spherical Functions of the Grassmann Manifold, https://doi.org/10.1112/plms/s3-29.1.174
Grand tour: see from all sides
Guided tour: Steer towards the most interesting features.
Utilise distance from the projection plane to make the slice, and shift centre of projection plane.
As number of variables increase concentration in centre of projection increases. Great for studying distribution of means (Central Limit Theorem) but bad for visualising high-dimensional data. Possibly obscures interesting structure.
TARGET BASIS (would show dog if we could find)
Givens interpolation ends at requested frame, but geodesic interpolation arrives at the plane, is frame-agnostic, and that is problematic for optimisation using the guided tour.
If you want to discover and mark the clusters you see, you can use the detourr package to spin and brush points. Here’s a live demo. Hopefully this works.
Best projection provided by the guided tour, separating three species.
Removing flipper length
Removing bill length
library(tidyverse)
library(tourr)
library(GGally)
set.seed(946)
d <- tibble(x1=runif(200, -1, 1),
x2=runif(200, -1, 1),
x3=runif(200, -1, 1))
d <- d %>%
mutate(x4 = x3 + runif(200, -0.1, 0.1))
d <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))
d_r <- d %>%
mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,
x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,
x2 = cos(pi/6)*x2 + sin(pi/6)*x4,
x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)
library(tidyverse)
library(tourr)
library(GGally)
set.seed(946)
d <- tibble(x1=runif(200, -1, 1),
x2=runif(200, -1, 1),
x3=runif(200, -1, 1))
d <- d %>%
mutate(x4 = x3 + runif(200, -0.1, 0.1))
d <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))
d_r <- d %>%
mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,
x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,
x2 = cos(pi/6)*x2 + sin(pi/6)*x4,
x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)
For example, when we teach regression, we overlay the fitted model on the data: MODEL IN THE DATA SPACE.
A residual plot is DATA IN THE MODEL SPACE. When we go beyond 2D, it’s considered too hard to show the model in the data space. It isn’t!
Wickham et al (2015) https://doi.org/10.1002/sam.11271
Principal component analysis
NLDR: t-Stochastic neighbourhood embedding
Data in the model space
Model in the data space
library(mulgar)
p_pca_m <- pca_model(p_pca, s=2.2)
p_pca_m_d <- rbind(p_pca_m$points, penguins_sub[,1:4])
p_pca_m_d_clr <- c(rep("#EC5C00", 4),
rep("black", nrow(penguins_sub)))
animate_xy(p_pca_m_d, edges=p_pca_m$edges,
axes="bottomleft",
col=p_pca_m_d_clr,
edges.col="#EC5C00",
edges.width=3)
render_gif(p_pca_m_d,
grand_tour(),
display_xy(half_range=4.2,
col=p_pca_m_d_clr,
edges=p_pca_m$edges,
edges.col="#EC5C00",
edges.width=3),
gif_file="gifs/p_pca_model.gif",
frames=500,
width=400,
height=400,
loop=FALSE)
Data in the model space
Model in the data space
https://doi.org/10.48550/arXiv.2506.22051
The slice tour is especially useful for exploring classification models, comparing boundaries produced by different models. (The same penguins data used here.)
Linear discriminant analysis
Classification tree
Linear discriminant analysis
Classification tree
Best model: four-cluster VEE
Three-cluster EEE
Interactivity: Compare cluster models
DEMO
DATA 1: projections
DATA 2: cluster labels
library(crosstalk)
library(plotly)
library(viridis)
p_cl_shared <- SharedData$new(penguins_cl)
detour_plot <- detour(p_cl_shared, tour_aes(
projection = bl:bm,
colour = cl_w)) |>
tour_path(grand_tour(2),
max_bases=50, fps = 60) |>
show_scatter(alpha = 0.7, axes = FALSE,
width = "100%", height = "450px")
conf_mat <- plot_ly(p_cl_shared,
x = ~cl_mc_j,
y = ~cl_w_j,
color = ~cl_w,
colors = viridis_pal(option = "D")(3),
height = 450) |>
highlight(on = "plotly_selected",
off = "plotly_doubleclick") %>%
add_trace(type = "scatter",
mode = "markers")
bscols(
detour_plot, conf_mat,
widths = c(5, 6)
)
Liver function (6D) among a sample of patients (all women).
Liver function (6D) among a sample of aging patients patients.
Example: MNIST fashion
10 fashion items, 60000 training 28x28 images
Model fitted as described in keras tutorial.
Single hidden layer with 128 nodes, which reduces the 28x28= 784-dimensional space to 128-dimensional space.
What does this dimension reduction do for the classification?
Principal components is the usual way to manage constructing a smaller number of dimensions to view the data.
Feedforward back-propagation model
Input space
Activations
If you have more than three components in a compositional data set, the data falls inside a simplex, of more than 2D.
Each component forms one vertex of the simplex. Points
Helps to understand uncertainty in predictions more than is possible with a confusion matrix.
The tourr package provides the algorithm to generate the tour paths of projection bases, and also the ability to create new tours, and draw projections with a variety of different display methods.
Tours provide the ability to do statistics with visual help.
These paths of projections can be generated off-line and used with other software.
Elegant interactivity solutions with detourr, liminal, langevitour, lionfish but need to be developed further.
Please use these tools 😃
Slides made in Quarto, with code included.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Tarntanya Aug 2025 https://dicook.github.io/Adelaide-colloquium/slides.html