Dianne Cook

Econometrics and Business Statistics

Monash University

Econometrics and Business Statistics

Monash University

We are going to see that we can gain intuition for structure in high dimensions through visualisation

It doesn’t mean that it’s easy. It doesn’t mean that visualisation is used alone. It means that (high-dimensional) visualisation is an important part of your toolbox, especially to allow discovery of what *we don’t know*.

- Using a tour to see into high dimensions
- Algorithms in the tourr package
- New developments in recent years
- Examples of usage
- Future research directions
- Other talks at this conference

Tours of high-dimensional data are like examining the shadows (projections)

(and slices/sections to see through a shadow)

Increasing dimension adds an additional orthogonal axis.

If you want more high-dimensional shapes there is an R package, geozoo, which will generate cubes, spheres, simplices, mobius strips, torii, boy surface, klein bottles, cones, various polytopes, …

And read or watch Flatland: A Romance of Many Dimensions (1884) Edwin Abbott.

Data

$\begin{eqnarray*} X_{~n\times p} = [X_{~1}~X_{~2}~\dots~X_{~p}]_{~n\times p} = \left[ \begin{array}{cccc} x_{~11} & x_{~12} & \dots & x_{~1p} \\ x_{~21} & x_{~22} & \dots & x_{~2p}\\ \vdots & \vdots & & \vdots \\ x_{~n1} & x_{~n2} & \dots & x_{~np} \end{array} \right]_{~n\times p} \end{eqnarray*}$

Projection

$\begin{eqnarray*} A_{~p\times d} = \left[ \begin{array}{cccc} a_{~11} & a_{~12} & \dots & a_{~1d} \\ a_{~21} & a_{~22} & \dots & a_{~2d}\\ \vdots & \vdots & & \vdots \\ a_{~p1} & a_{~p2} & \dots & a_{~pd} \end{array} \right]_{~p\times d} \end{eqnarray*}$

Projected data

$\begin{eqnarray*} Y_{~n\times d} = XA = \left[ \begin{array}{cccc} y_{~11} & y_{~12} & \dots & y_{~1d} \\ y_{~21} & y_{~22} & \dots & y_{~2d}\\ \vdots & \vdots & & \vdots \\ y_{~n1} & y_{~n2} & \dots & y_{~nd} \end{array} \right]_{~n\times d} \end{eqnarray*}$

Data is 2D: $~~p=2$

Projection is 1D: $~~d=1$$\begin{eqnarray*} A_{~2\times 1} = \left[ \begin{array}{c} a_{~11} \\ a_{~21}\\ \end{array} \right]_{~2\times 1} \end{eqnarray*}$

Notice that the values of $A$ change between (-1, 1). All possible values being shown during the tour.

$\begin{eqnarray*} A = \left[ \begin{array}{c} 1 \\ 0\\ \end{array} \right] ~~~~~~~~~~~~~~~~ A = \left[ \begin{array}{c} 0.7 \\ 0.7\\ \end{array} \right] ~~~~~~~~~~~~~~~~ A = \left[ \begin{array}{c} 0.7 \\ -0.7\\ \end{array} \right] \end{eqnarray*}$

watching the 1D shadows we can see:

- unimodality
- bimodality, there are two clusters.

What does the 2D data look like? Can you sketch it?

⟵

The 2D data

Data is 3D: $p=3$

Projection is 2D: $d=2$

$\begin{eqnarray*} A_{~3\times 2} = \left[ \begin{array}{cc} a_{~11} & a_{~12} \\ a_{~21} & a_{~22}\\ a_{~31} & a_{~32}\\ \end{array} \right]_{~3\times 2} \end{eqnarray*}$

Notice that the values of $A$ change between (-1, 1). All possible values being shown during the tour.

See:

- circular shapes
- some transparency, reveals middle
- hole in in some projections
- no clustering

Data is 4D: $p=4$

Projection is 2D: $d=2$

$\begin{eqnarray*} A_{~4\times 2} = \left[ \begin{array}{cc} a_{~11} & a_{~12} \\ a_{~21} & a_{~22}\\ a_{~31} & a_{~32}\\ a_{~41} & a_{~42}\\ \end{array} \right]_{~4\times 2} \end{eqnarray*}$

How many clusters do you see?

- three, right?
- one separated, and two very close,
- and they each have an elliptical shape.

- do you also see an outlier or two?

1D paths in 3D space

2D paths in 3D space

Grand tour: see from all sides

Guided tour: Steer towards the most interesting features.

Avoid being a blind man inspecting the elephant

Principal component analysis

NLDR: t-Stochastic neighbourhood embedding

*choice of target planes***grand**: random**guided**: objective function**local**: nearby**little**: marginals**manual/radial**: specific variable

*interpolation between them***geodesic**: plane to plane**Givens**: frame/basis to frame/basis

*How should you plot your projected data?*

**1D**: density, dotplot, histogram**2D**: scatterplot, density2D,**sage**,**pca**,**slice****3D**: stereo**kD**: parallel coordinates, scatterplot matrix**1D+spatial**: image

**interactivity**: detourr, liminal, langevitour**slice/section**: explore shape of models**manual/radial tour**: explore sensitivity of structure to particular variables**sage**: correct for piling**Givens interpolation**: frame to frame

Utilise distance from the projection plane to make the slice, and shift centre of projection plane.

Increase variables, increase concentration, possibly obscuring important structure.

Transformation expands the centre to make a sage display.

Givens interpolation ends at requested frame, but geodesic interpolation arrives at the plane, is frame-agnostic, and that is problematic for optimisation using the guided tour.

If you want to discover and mark the clusters you see, you can use the detourr package to spin and brush points. Here’s a live demo. Hopefully this works.

cl_w | cl_mc | ||
---|---|---|---|

1 | 2 | 3 | |

1 | 149 | 8 | 0 |

2 | 0 | 0 | 119 |

3 | 0 | 57 | 0 |

```
library(crosstalk)
library(plotly)
library(viridis)
p_cl_shared <- SharedData$new(penguins_cl)
detour_plot <- detour(p_cl_shared, tour_aes(
projection = bl:bm,
colour = cl_w)) |>
tour_path(grand_tour(2),
max_bases=50, fps = 60) |>
show_scatter(alpha = 0.7, axes = FALSE,
width = "100%", height = "450px")
conf_mat <- plot_ly(p_cl_shared,
x = ~cl_mc_j,
y = ~cl_w_j,
color = ~cl_w,
colors = viridis_pal(option = "D")(3),
height = 450) |>
highlight(on = "plotly_selected",
off = "plotly_doubleclick") %>%
add_trace(type = "scatter",
mode = "markers")
bscols(
detour_plot, conf_mat,
widths = c(5, 6)
)
```

Best projection provided by the guided tour, separating three species.

Removing flipper length

Removing bill length

Projection

Slice

This is especially useful for exploring classification models, **comparing boundaries** produced by different models. (The same penguins data used here.)

Linear discriminant analysis

Classification tree

Data in the model space ^{1}

Model in the data space

```
library(mulgar)
p_pca_m <- pca_model(p_pca, s=2.2)
p_pca_m_d <- rbind(p_pca_m$points, penguins_sub[,1:4])
animate_xy(p_pca_m_d, edges=p_pca_m$edges,
axes="bottomleft",
edges.col="#E7950F",
edges.width=3)
render_gif(p_pca_m_d,
grand_tour(),
display_xy(half_range=4.2,
edges=p_pca_m$edges,
edges.col="#E7950F",
edges.width=3),
gif_file="gifs/p_pca_model.gif",
frames=500,
width=400,
height=400,
loop=FALSE)
```

Data in the model space

Model in the data space

???

See Jayani Lakshika’s talk, Fri 11am: IPS12

```
library(tidyverse)
library(tourr)
library(GGally)
set.seed(946)
d <- tibble(x1=runif(200, -1, 1),
x2=runif(200, -1, 1),
x3=runif(200, -1, 1))
d <- d %>%
mutate(x4 = x3 + runif(200, -0.1, 0.1))
d <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))
d_r <- d %>%
mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,
x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,
x2 = cos(pi/6)*x2 + sin(pi/6)*x4,
x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)
```

```
library(tidyverse)
library(tourr)
library(GGally)
set.seed(946)
d <- tibble(x1=runif(200, -1, 1),
x2=runif(200, -1, 1),
x3=runif(200, -1, 1))
d <- d %>%
mutate(x4 = x3 + runif(200, -0.1, 0.1))
d <- bind_rows(d, c(x1=0, x2=0, x3=-0.5, x4=0.5))
d_r <- d %>%
mutate(x1 = cos(pi/6)*x1 + sin(pi/6)*x3,
x3 = -sin(pi/6)*x1 + cos(pi/6)*x3,
x2 = cos(pi/6)*x2 + sin(pi/6)*x4,
x4 = -sin(pi/6)*x2 + cos(pi/6)*x4)
```

The tourr package provides the algorithm to generate the tour paths, and also create new tours, different displays.

- Stopping, pausing, going back
- Zooming in, focus on subsets
- Linking between multiple displays

Elegant interactivity solutions with detourr, liminal, langevitour but need to be developed further.

- Better integration with model objects
- Specialist design for different models
- Integrating other guidance, explainability metrics

Talks at this conference to learn more:

- Fri 11am: IPS12: Visualising high-dimensional and complex data (Paul Harrison, Jayani Lakshika)
- Fri 1:30pm: CPS09: Visualising Complex data & Anomaly Detection (Janith Wanniarachchi - XAI)

*Please use these tools at home* 😃

- Cook and Laa (2023) Interactively exploring high-dimensional data and models in R
- Slides made in Quarto
- Get a copy of slides at

https://github.com/dicook/IASC-ARS-2023

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Best model: four-cluster VEE

Three-cluster EEE

Convex hulls are often used to summarise clusters in 2D. It is possible to view these in high-d, too.