We are going to see that we can gain intuition for structure in high dimensions through visualisation
It doesn’t mean that it’s easy. It doesn’t mean that visualisation is used alone. It means that (high-dimensional) visualisation is an important part of your toolbox, especially to allow discovery of what we don’t know.
Tours of high-dimensional data are like examining the shadows (projections)
(and slices/sections to see through a shadow)
Increasing dimension adds an additional orthogonal axis.
If you want more high-dimensional shapes there is an R package, geozoo, which will generate cubes, spheres, simplices, mobius strips, torii, boy surface, klein bottles, cones, various polytopes, …
And read or watch Flatland: A Romance of Many Dimensions (1884) Edwin Abbott.
Data
Projection
Projected data
Data is 2D:
Projection is 1D:
Notice that the values of change between (-1, 1). All possible values being shown during the tour.
watching the 1D shadows we can see:
What does the 2D data look like? Can you sketch it?
⟵
The 2D data
Data is 3D:
Projection is 2D:
Notice that the values of change between (-1, 1). All possible values being shown during the tour.
See:
Data is 4D:
Projection is 2D:
How many clusters do you see?
1D paths in 3D space
2D paths in 3D space
Grand tour: see from all sides
Guided tour: Steer towards the most interesting features.
Avoid being a blind man inspecting the elephant
Principal component analysis
NLDR: t-Stochastic neighbourhood embedding
How should you plot your projected data?
Utilise distance from the projection plane to make the slice, and shift centre of projection plane.
Increase variables, increase concentration, possibly obscuring important structure.
Transformation expands the centre to make a sage display.
Givens interpolation ends at requested frame, but geodesic interpolation arrives at the plane, is frame-agnostic, and that is problematic for optimisation using the guided tour.
If you want to discover and mark the clusters you see, you can use the detourr package to spin and brush points. Here’s a live demo. Hopefully this works.
cl_w | cl_mc | ||
---|---|---|---|
1 | 2 | 3 | |
1 | 149 | 8 | 0 |
2 | 0 | 0 | 119 |
3 | 0 | 57 | 0 |
library(crosstalk)
library(plotly)
library(viridis)
p_cl_shared <- SharedData$new(penguins_cl)
detour_plot <- detour(p_cl_shared, tour_aes(
projection = bl:bm,
colour = cl_w)) |>
tour_path(grand_tour(2),
max_bases=50, fps = 60) |>
show_scatter(alpha = 0.7, axes = FALSE,
width = "100%", height = "450px")
conf_mat <- plot_ly(p_cl_shared,
x = ~cl_mc_j,
y = ~cl_w_j,
color = ~cl_w,
colors = viridis_pal(option = "D")(3),
height = 450) |>
highlight(on = "plotly_selected",
off = "plotly_doubleclick") %>%
add_trace(type = "scatter",
mode = "markers")
bscols(
detour_plot, conf_mat,
widths = c(5, 6)
)
Best projection provided by the guided tour, separating three species.
Removing flipper length
Removing bill length
Projection
Slice
This is especially useful for exploring classification models, comparing boundaries produced by different models. (The same penguins data used here.)
Linear discriminant analysis
Classification tree
Data in the model space 1
Model in the data space
library(mulgar)
p_pca_m <- pca_model(p_pca, s=2.2)
p_pca_m_d <- rbind(p_pca_m$points, penguins_sub[,1:4])
animate_xy(p_pca_m_d, edges=p_pca_m$edges,
axes="bottomleft",
edges.col="#E7950F",
edges.width=3)
render_gif(p_pca_m_d,
grand_tour(),
display_xy(half_range=4.2,
edges=p_pca_m$edges,
edges.col="#E7950F",
edges.width=3),
gif_file="gifs/p_pca_model.gif",
frames=500,
width=400,
height=400,
loop=FALSE)
Data in the model space
Model in the data space
???
See Jayani Lakshika’s talk, Fri 11am: IPS12
library(tidyverse)
library(tourr)
library(GGally)
set.seed(946)
d <- tibble(x1=runif(200, -1, 1),
x2=runif(200, -1, 1),
x3=runif(200, -1, 1))
d <- d %>%
mutate(x4 = x3 + runif(200, -0.1, 0.1))
d <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))
d_r <- d %>%
mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,
x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,
x2 = cos(pi/6)*x2 + sin(pi/6)*x4,
x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)
library(tidyverse)
library(tourr)
library(GGally)
set.seed(946)
d <- tibble(x1=runif(200, -1, 1),
x2=runif(200, -1, 1),
x3=runif(200, -1, 1))
d <- d %>%
mutate(x4 = x3 + runif(200, -0.1, 0.1))
d <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))
d_r <- d %>%
mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,
x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,
x2 = cos(pi/6)*x2 + sin(pi/6)*x4,
x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)
The tourr package provides the algorithm to generate the tour paths, and also create new tours, different displays.
Elegant interactivity solutions with detourr, liminal, langevitour but need to be developed further.
Talks at this conference to learn more:
Please use these tools at home 😃
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Best model: four-cluster VEE
Three-cluster EEE
Convex hulls are often used to summarise clusters in 2D. It is possible to view these in high-d, too.