time | topic |
3:00-3:15 | More on dimension reduction |
3:15-4:00 | Understanding clusters in data using visualisation |
4:00-4:40 | Building better classification models with visual input |
4:40-5:00 | Bring your own data! |
Session 2
time | topic |
3:00-3:15 | More on dimension reduction |
3:15-4:00 | Understanding clusters in data using visualisation |
4:00-4:40 | Building better classification models with visual input |
4:40-5:00 | Bring your own data! |
Find some low-dimensional layout of points which approximates the distance between points in high-dimensions, with the purpose being to have a useful representation that reveals high-dimensional patterns, like clusters.
Multidimensional scaling (MDS) is the original approach:
\[ \mbox{Stress}_D(x_1, ..., x_n) = \left(\sum_{i, j=1; i\neq j}^n (d_{ij} - d_k(i,j))^2\right)^{1/2} \] where \(D\) is an \(n\times n\) matrix of distances \((d_{ij})\) between all pairs of points, and \(d_k(i,j)\) is the distance between the points in the low-dimensional space.
PCA is a special case of MDS. The result from PCA is a linear projection, but generally MDS can provide some non-linear transformation.
Many variations being developed:
NLDR can be useful but it can also make some misleading representations.
Tour animation of the same data
Which is the best representation, t-SNE or UMAP, of this 9D data?
You can use this code to read the data and view in a tour:
Ideal thinking of neatly separated clusters, but it is rarely encountered in data
Objective is to organize the cases into groups that are similar in some way. You need a measure of similarity (or distance).
To decide on a best result, you need to see how it divides the data into clusters. The cluster statistics, like dendrogram, or cluster summaries, or gap statistics might all look good but the result is bad. You need to see the model in the data space!
Model-based clustering fits a multivariate normal mixture model to the data.
\[ \Sigma_k = \lambda_kD_kA_kD_k^\top, ~~~k=1, \dots, g \]
\(\Sigma_k\) is the variance-covariance of cluster \(k\),
\(g=\)number of clusters,
\(D_k\) describes the orientation of a cluster,
\(A_k\) describes the variance in different variables,
\(\lambda_k\) is an overall size.
Clustering this data. What do you expect?
Can we assume the shape of the clusters is elliptical?
Four-cluster VEE
Three-cluster EEE
Models (ellipses) are overlaid on the data. Which is the best fit?
Extract the estimated model parameters
Use the spin-and-brush approach to extract the clusters from the c1
Linear discriminant analysis is the ideal classifier for this data.
A random forest is the simplest classifier to fit for complicated boundaries. It is built from multiple trees generated by randomly sampling the cases and the variables. The random sampling (with replacement) of cases has the fortunate effect of creating a training (“in-bag”) and a test (“out-of-bag”) sample for each tree computed. The most beautiful results are obtaining diagnostics that help us to assess the model are the votes, the measure of variable importance, and the proximity matrix.
randomForest(formula = cause ~ ., data = bushfires_sub, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 7
OOB estimate of error rate: 11%
Confusion matrix:
accident arson burning_off lightning
accident 73 3 0 62
arson 11 8 1 17
burning_off 3 0 3 3
lightning 14 0 0 823
accident 0.471
arson 0.784
burning_off 0.667
lightning 0.017
The votes matrix can be considered to be predictive probabilities, where the values for each observation sum to 1. With 3 classes it is a 2D triangle. For 4 or more classes it is a simplex and can be examined in a tour.
Votes matrix for bushfire model fit
# Create votes matrix data
bushfires_rf_votes <- bushfires_rf$votes %>%
as_tibble() %>%
mutate(cause = bushfires_sub$cause)
# Project 4D into 3D
proj <- t(geozoo::f_helmert(4)[-1,])
b_rf_v_p <- as.matrix(bushfires_rf_votes[,1:4]) %*% proj
colnames(b_rf_v_p) <- c("x1", "x2", "x3")
b_rf_v_p <- b_rf_v_p %>%
as.data.frame() %>%
mutate(cause = bushfires_sub$cause)
# Add simplex
simp <- simplex(p=3)
sp <- data.frame(simp$points)
colnames(sp) <- c("x1", "x2", "x3")
sp$cause = ""
b_rf_v_p_s <- bind_rows(sp, b_rf_v_p) %>%
mutate(cause = factor(cause))
labels <- c("accident" , "arson",
"burning_off", "lightning",
rep("", nrow(b_rf_v_p)))
animate_xy(b_rf_v_p_s[,1:3], col = b_rf_v_p_s$cause,
axes = "off", half_range = 1.3,
edges = as.matrix(simp$edges),
obs_labels = labels)
p_cl <- p_tidy_std |>
mutate(pspecies = predict(p_lda, p_tidy)$class) |>
dplyr::select(bl:bm, species, pspecies) |>
mutate(sp_jit = jitter(as.numeric(species), 0.5),
psp_jit = jitter(as.numeric(pspecies), 0.5))
p_cl_shared <- SharedData$new(p_cl)
detour_plot <- detour(p_cl_shared, tour_aes(
projection = bl:bm,
colour = species)) |>
max_bases=50, fps = 60) |>
show_scatter(alpha = 0.9, axes = FALSE,
width = "100%", height = "450px",
palette = brewer.pal(3, "Dark2"))
conf_mat <- plot_ly(p_cl_shared,
x = ~psp_jit,
y = ~sp_jit,
color = ~species,
colors = brewer.pal(3, "Dark2"),
height = 450) |>
highlight(on = "plotly_selected",
off = "plotly_doubleclick") %>%
add_trace(type = "scatter",
mode = "markers")
detour_plot, conf_mat,
widths = c(5, 6)
Explore the misclassifications in the random forest fit of the penguins data, using the code provided in the slides2.R
Explore and compare the boundaries of different models using the slice tour.
Dissect and explore the operation of a neural network.
If you don’t have your own data, try looking at the risk_MSA which contains survey responses on six types of risks (recreational, health, career, financial, safety and social) perceived with Australian tourism, collected in 2015, used in Dolnicar et al, 2018.
All of the material presented today comes from
Cook and Laa (2024) Interactively exploring high-dimensional data and models in R
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.