We are going to see that we can gain intuition for structure in high dimensions through visualisation
It doesn’t mean that it’s easy. It doesn’t mean that visualisation is used alone. It means that (high-dimensional) visualisation is an important part of your toolbox, especially to allow discovery of what we don’t know.
Tours of high-dimensional data are like examining the shadows (projections)
(and slices/sections to see through a shadow)
Increasing dimension adds an additional orthogonal axis.
If you want more high-dimensional shapes there is an R package, geozoo, which will generate cubes, spheres, simplices, mobius strips, torii, boy surface, klein bottles, cones, various polytopes, …
And read or watch Flatland: A Romance of Many Dimensions (1884) Edwin Abbott.
Data
Projection
Projected data
Data is 2D:
Projection is 1D:
Notice that the values of change between (-1, 1). All possible values being shown during the tour.
watching the 1D shadows we can see:
What does the 2D data look like? Can you sketch it?
⟵
The 2D data
Data is 3D:
Projection is 2D:
Notice that the values of change between (-1, 1). All possible values being shown during the tour.
See:
Data is 4D:
Projection is 2D:
How many clusters do you see?
1D paths in 3D space
2D paths in 3D space
Grand tour: see from all sides
Guided tour: Steer towards the most interesting features.
Avoid being a blind man inspecting the elephant
Principal component analysis
NLDR: t-Stochastic neighbourhood embedding
Data in the model space 1
Model in the data space
library(mulgar)
p_pca_m <- pca_model(p_pca, s=2.2)
p_pca_m_d <- rbind(p_pca_m$points, penguins_sub[,1:4])
animate_xy(p_pca_m_d, edges=p_pca_m$edges,
axes="bottomleft",
edges.col="#E7950F",
edges.width=3)
render_gif(p_pca_m_d,
grand_tour(),
display_xy(half_range=4.2,
edges=p_pca_m$edges,
edges.col="#E7950F",
edges.width=3),
gif_file="gifs/p_pca_model.gif",
frames=500,
width=400,
height=400,
loop=FALSE)
Data in the model space
Model in the data space
???
Stay tuned for new work to appear next year
library(tidyverse)
library(tourr)
library(GGally)
set.seed(946)
d <- tibble(x1=runif(200, -1, 1),
x2=runif(200, -1, 1),
x3=runif(200, -1, 1))
d <- d %>%
mutate(x4 = x3 + runif(200, -0.1, 0.1))
d <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))
d_r <- d %>%
mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,
x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,
x2 = cos(pi/6)*x2 + sin(pi/6)*x4,
x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)
library(tidyverse)
library(tourr)
library(GGally)
set.seed(946)
d <- tibble(x1=runif(200, -1, 1),
x2=runif(200, -1, 1),
x3=runif(200, -1, 1))
d <- d %>%
mutate(x4 = x3 + runif(200, -0.1, 0.1))
d <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))
d_r <- d %>%
mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,
x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,
x2 = cos(pi/6)*x2 + sin(pi/6)*x4,
x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)
How should you plot your projected data?
Utilise distance from the projection plane to make the slice, and shift centre of projection plane.
Increase variables, increase concentration, possibly obscuring important structure.
Transformation expands the centre to make a sage display.
TARGET BASIS (would show dog if we could find)
Givens interpolation ends at requested frame, but geodesic interpolation arrives at the plane, is frame-agnostic, and that is problematic for optimisation using the guided tour.
If you want to discover and mark the clusters you see, you can use the detourr package to spin and brush points. Here’s a live demo. Hopefully this works.
Best projection provided by the guided tour, separating three species.
Removing flipper length
Removing bill length
Projection
Slice
This is especially useful for exploring classification models, comparing boundaries produced by different models. (The same penguins data used here.)
Linear discriminant analysis
Classification tree
Best model: four-cluster VEE
Three-cluster EEE
Convex hulls are often used to summarise clusters in 2D. It is possible to view these in high-d, too.
cl_w | cl_mc | ||
---|---|---|---|
1 | 2 | 3 | |
1 | 149 | 8 | 0 |
2 | 0 | 0 | 119 |
3 | 0 | 57 | 0 |
library(crosstalk)
library(plotly)
library(viridis)
p_cl_shared <- SharedData$new(penguins_cl)
detour_plot <- detour(p_cl_shared, tour_aes(
projection = bl:bm,
colour = cl_w)) |>
tour_path(grand_tour(2),
max_bases=50, fps = 60) |>
show_scatter(alpha = 0.7, axes = FALSE,
width = "100%", height = "450px")
conf_mat <- plot_ly(p_cl_shared,
x = ~cl_mc_j,
y = ~cl_w_j,
color = ~cl_w,
colors = viridis_pal(option = "D")(3),
height = 450) |>
highlight(on = "plotly_selected",
off = "plotly_doubleclick") %>%
add_trace(type = "scatter",
mode = "markers")
bscols(
detour_plot, conf_mat,
widths = c(5, 6)
)
Market segmentation data typically has NO separated clusters. It is a partitioning.
Three different 2D data sets. What is a useful partition?
Here we show the model in the data space so we can see where it is partitioning the “blob”.
This is what the model looks like in only one variable at a time. You can’t see where it is partitioning.
Australian Vacation Activities
Austrian Winter Activities
Using a guided tour. There is some hint of the partitioning, when looking at all clusters, but there is too much overlap.
Focus on two clusters only.
First find the separation, then examine the combination of variables.
Cluster 6 consists of tourists who like going to health facilities, excursions and drinking wine.
# A tibble: 5 × 2
act proj
<chr> <dbl>
1 using.health.facilities 0.255
2 heurigen 0.119
3 going.to.a.spa 0.0885
4 organized.excursions 0.0841
5 excursions 0.0728
Cluster 3 consists of tourists who very much like going to a disco or bar, with some interest in alpine activities and theatre/opera.
# A tibble: 5 × 2
act proj
<chr> <dbl>
1 snowboarding -0.103
2 ski.touring -0.106
3 theater.opera -0.119
4 alpine.skiing -0.214
5 going.to.discos.bars -0.889
The tourr package provides the algorithm to generate the tour paths, and also create new tours, different displays.
Elegant interactivity solutions with detourr, liminal, langevitour, lionfish but need to be developed further.
Please use these tools 😃
Slides made in Quarto, with code included. Available at https://dicook.github.io/MPSS/slides.html.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
MPSS 2024