class: center, middle, inverse, title-slide .title[ # Touring multivariate data ] .subtitle[ ## SISBID 2024
https://github.com/dicook/SISBID
] .author[ ### Di Cook (
dicook@monash.edu
)
Heike Hofmann (
hhofmann4@unl.edu
)
Susan Vanderplas (
susan.vanderplas@unl.edu
) ] .date[ ### 08/14-16/2024 ] --- Penguins data: See https://allisonhorst.github.io/palmerpenguins/ for more details. <br> <br> <table> <tr> <td width="40%"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Adélie_Penguin.jpg/320px-Adélie_Penguin.jpg" width="100%" /> </td> <td width="30%"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Pygoscelis_papua_-Jougla_Point%2C_Wiencke_Island%2C_Palmer_Archipelago_-adults_and_chicks-8.jpg/273px-Pygoscelis_papua_-Jougla_Point%2C_Wiencke_Island%2C_Palmer_Archipelago_-adults_and_chicks-8.jpg" width="100%" /> </td> <td width="30%"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/A_chinstrap_penguin_%28Pygoscelis_antarcticus%29_on_Deception_Island_in_Antarctica.jpg/201px-A_chinstrap_penguin_%28Pygoscelis_antarcticus%29_on_Deception_Island_in_Antarctica.jpg" width="90%" /> </td> </tr> <tr> <td> Adélie .footnote[[Wikimedia Commons](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Adélie_Penguin.jpg/320px-Adélie_Penguin.jpg)] </td> <td> Gentoo .footnote[[Wikimedia Commons](https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Pygoscelis_papua_-Jougla_Point%2C_Wiencke_Island%2C_Palmer_Archipelago_-adults_and_chicks-8.jpg/273px-Pygoscelis_papua_-Jougla_Point%2C_Wiencke_Island%2C_Palmer_Archipelago_-adults_and_chicks-8.jpg)] </td> <td> Chinstrap .footnote[[Wikimedia Commons](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/A_chinstrap_penguin_%28Pygoscelis_antarcticus%29_on_Deception_Island_in_Antarctica.jpg/201px-A_chinstrap_penguin_%28Pygoscelis_antarcticus%29_on_Deception_Island_in_Antarctica.jpg)]</td> </tr> </table> --- .pull-left[ .small[ ``` r ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, colour=species, shape=species)) + geom_point(alpha=0.7, size=2) + scale_colour_discrete_divergingx(palette = "Zissou 1") + theme(aspect.ratio=1, legend.position="bottom") ``` ] ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: inverse middle center # Our first tour
What patterns do you see?
−
+
01
:
30
--- .pull-left[ ``` r # Pre-process the data penguins_std <- penguins %>% rename(bl = bill_length_mm, bd = bill_depth_mm, fl = flipper_length_mm, bm = body_mass_g) %>% select(species, bl:bm) %>% na.omit() %>% mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) ``` ``` r # Run the tour clrs <- divergingx_hcl(3, palette="Zissou 1") col <- clrs[ as.numeric( penguins$species)] animate_xy(penguins_std[,2:5], col=col, axes="off", fps=15) ``` ] .pull-right[ <img src="penguins2d.gif" width="100%"> ] --- class: inverse middle # What did you see? - clusters ✅ -- - outliers ✅ -- - linear dependence ✅ -- - elliptical clusters with slightly different shapes ✅ -- - separated elliptical clusters with slightly different shapes ✅ -- --- # Which shows better separation? .pull-left[ <img src="penguins2d.gif" width="80%"> ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # What is a tour? .pull-left[ A grand tour is by definition a movie of low-dimensional projections constructed in such a way that it comes arbitrarily close to showing all possible low-dimensional projections; in other words, a grand tour is a space-filling curve in the manifold of low-dimensional projections of high-dimensional data spaces. <img src="images/hands.png" width="80%"> ] .pull-right[ `\({\mathbf x}_i \in \mathcal{R}^p\)`, `\(i^{th}\)` data vector `\(F\)` is a `\(p\times d\)` orthonormal basis, `\(F'F=I_d\)`, where `\(d\)` is the projection dimension. The projection of `\({\mathbf x_i}\)` onto `\(F\)` is `\({\mathbf y}_i=F'{\mathbf x}_i\)`. Tour is indexed by time, `\(F(t)\)`, where `\(t\in [a, z]\)`. Starting and target frame denoted as `\(F_a = F(a), F_z=F(t)\)`. The animation of the projected data is given by a path `\({\mathbf y}_i(t)=F'(t){\mathbf x}_i\)`. ] --- # Geodesic interpolation between planes .pull-left[ Tour is indexed by time, `\(F(t)\)`, where `\(t\in [a, z]\)`. Starting and target frame denoted as `\(F_a = F(a), F_z=F(t)\)`. The animation of the projected data is given by a path `\({\mathbf y}_i(t)=F'(t){\mathbf x}_i\)`. ] .pull-right[ <img src="images/geodesic.png" width="120%"> ] --- class: inverse middle center # Reading axes - interpretation Length and direction of axes relative to the pattern of interest --- <img src="images/reading_axes.001.png" width="100%"> --- <img src="images/reading_axes.002.png" width="100%"> --- # Reading axes - interpretation <iframe src="penguins.html" width="800" height="500" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe> --- .pull-left[ ``` r ggplot(penguins, aes(x=flipper_length_mm, y=bill_depth_mm, colour=species, shape=species)) + geom_point(alpha=0.7, size=2) + scale_colour_discrete_divergingx(palette = "Zissou 1") + theme(aspect.ratio=1, legend.position="bottom") ``` <img src="index_files/figure-html/runthis13-1.png" width="90%" style="display: block; margin: auto;" /> Gentoo from others in contrast of fl, bd ] .pull-right[ ``` r ggplot(penguins, aes(x=bill_length_mm, y=body_mass_g, colour=species, shape=species)) + geom_point(alpha=0.7, size=2) + scale_colour_discrete_divergingx(palette = "Zissou 1") + theme(aspect.ratio=1, legend.position="bottom") ``` <img src="index_files/figure-html/runthis14-1.png" width="90%" style="display: block; margin: auto;" /> Chinstrap from others in contrast of bl, bm ] --- class: inverse middle left There may be multiple and different combinations of variables that reveal similar structure. ☹️ The tour can help to discover these, too. 😂 --- # Other tour types - .orange[guided]: follows the optimisation path for a projection pursuit index. - .orange[little]: interpolates between all variables. - .orange[local]: rocks back and forth from a given projection, so shows all possible projections within a radius. - .orange[dependence]: two independent 1D tours - .orange[frozen]: fixes some variable coefficients, others vary freely. - .orange[manual]: control coefficient of one variable, to examine the sensitivity of structure this variable. (In the .orange[spinifex] package) - .orange[slice]: use a section instead of a projection. - .orange[sage]: transform a 2D projection, to avoid data piling. --- class: inverse middle center # guided tour new target bases are chosen using a projection pursuit index function --- `$$\mathop{\text{maximize}}_{F} g(F'x) ~~~\text{ subject to } F \text{ being orthonormal}$$` .font_small[ - `holes`: This is an inverse Gaussian filter, which is optimised when there is not much data in the center of the projection, i.e. a "hole" or donut shape in 2D. - `central mass`: The opposite of holes, high density in the centre of the projection, and often "outliers" on the edges. - `LDA`/`PDA`: An index based on the linear discriminant dimension reduction (and penalised), optimised by projections where the named classes are most separated. ] --- .pull-left[ Grand <img src="penguins2d.gif" width="80%"> .small[ Might accidentally see best separation ] ] .pull-right[ Guided, using LDA index <img src="penguins2d_guided.gif" width="80%"> .small[ Moves to the best separation ] ] --- class: inverse middle center # manual tour control the coefficient of one variable, reduce it to zero, increase it to 1, maintaining orthonormality --- # Manual tour .pull-left[ - start from best projection, given by projection pursuit - bl contribution controlled - if bl is removed form projection, Adelie and chinstrap are mixed - bl is important for Adelie ] .pull-right[ <img src="penguins_manual_bl.gif" width="90%"> ] --- # Manual tour .pull-left[ - start from best projection, given by projection pursuit - fl contribution controlled - cluster less separated when fl is fully contributing - fl is important, in small amounts, for Gentoo ] .pull-right[ <img src="penguins_manual_fl.gif" width="90%"> ] --- # Local tour .pull-left[ Rocks from and to a given projection, in order to observe the neighbourhood ] .pull-right[ <img src="penguins2d_local.gif" width="90%"> ] --- # Projection dimension and displays .pull-left[ <img src="penguins1d.gif" width="90%"> ] .pull-right[ <img src="penguins2d_dens.gif" width="90%"> ] --- class: inverse middle # Your turn Using the sample code from the tour package, check how many clusters in the example data. ``` r library(tourr) data(flea) ?animate_xy # On a Mac, you may need to start a quartz graphics window # quartz() # On windows, you may need to start an X11 graphics window # X11() animate_xy(flea[, 1:6]) # If you want to use your RStudio graphics window, it might show up better # if you reduce the frame rate for drawing. animate_xy(flea[, 1:6], fps=10) ```
−
+
02
:
00
--- # Saving and sharing: Animated gif .pull-left[ ``` r render_gif( penguins_std[,2:5], grand_tour(), display_xy(col=col, axes="bottomleft"), file="penguins2d.gif", frames=100, width=300, height=300) ``` ] .pull-right[ <img src="penguins2d.gif" width="80%"> ] --- # Saving and sharing: Animation .pull-left[ ``` r set.seed(209) b <- basis_random(4, 2) penguins_pct <- tourr::save_history(penguins_std[,2:5], tour_path = grand_tour(), start = b, max_bases = 5) save(penguins_pct, file="data/p_tour_path.rda") penguins_pcti <- interpolate(penguins_pct, 0.2) penguins_anim <- render_anim(penguins_std, vars = 2:5, frames=penguins_pcti, obs_labels=penguins_std$species) ``` ] .pull-right[ <iframe width="550" height="600" src="../../html/penguins.html" title="Animation of AFLW four PCs with interactive labelling. "></iframe> ] --- Code to draw this plot is complicated. .pull-left[ ``` r penguins_gp <- ggplot() + geom_path(data=penguins_anim$circle, aes(x=c1, y=c2, frame=frame), linewidth=0.1) + geom_segment(data=penguins_anim$axes, aes(x=x1, y=y1, xend=x2, yend=y2, frame=frame), linewidth=0.1) + geom_text(data=penguins_anim$axes, aes(x=x2, y=y2, frame=frame, label=axis_labels), size=5) + geom_point(data=penguins_anim$frames, aes(x=P1, y=P2, colour=species, frame=frame, label=obs_labels), alpha=0.8) + ``` ] .pull-right[ ``` r xlim(-1,1) + ylim(-1,1) + scale_colour_discrete_divergingx(palette = "Zissou 1") + coord_equal() + theme_bw() + theme(legend.position = "none", axis.text=element_blank(), axis.title=element_blank(), axis.ticks=element_blank(), panel.grid=element_blank()) penguins_tour <- ggplotly(penguins_gp, width=500, height=550) %>% animation_button(label="Go") %>% animation_slider(len=0.8, x=0.5, xanchor="center") %>% animation_opts( easing="linear", transition = 0) penguins_tour htmlwidgets::saveWidget(penguins_tour, file="html/penguins.html", selfcontained = TRUE) ``` ] --- # Saving and sharing: Single frame .pull-left[ ``` r load(here::here("data/p_tour_path.rda")) penguins_pcti <- interpolate(penguins_pct, 0.2) f27 <- matrix(penguins_pcti[,,27], ncol=2) p27 <- render_proj(penguins_std[,2:5], f27, obs_labels=penguins_std$species) ``` Draw it with ggplot, and possibly pass to plotly. ] .pull-right[
] --- # Resources - [GGobi web site](http://www.ggobi.org), [ggobi book](http://www.ggobi.org/book) - Emerson et al (2013) The Generalized Pairs Plot, Journal of Computational and Graphical Statistics, 22:1, 79-91 - [Natalia da Silva](http://natydasilva.com/) [PPForest](https://cran.r-project.org/web/packages/PPforest/index.html) and [shiny app](https://natydasilva.shinyapps.io/shinyV03/). - Wickham et al (2011) [tourr: An R Package for Exploring Multivariate Data with Projections](https://www.jstatsoft.org/article/view/v040i02/v40i02.pdf) and the R package [tourr](https://cran.r-project.org/web/packages/tourr/index.html) - Schloerke et al (2016) [Escape from Boxland](https://journal.r-project.org/archive/2016/RJ-2016-044/index.html), [the web site zoo](http://schloerke.com/geozoo/) and the R package [geozoo](https://cran.r-project.org/web/packages/geozoo/index.html) - Spyrison and Cook (2020). spinifex: Manual Tours, Manual Control of Dynamic Projections of Numeric Multivariate Data. https://CRAN.R-project.org/package=spinifex - Stuart Lee [liminal](https://github.com/sa-lee/liminal) New tools to do linked brushing between tours and PCA/tSNE/PDS views --- # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.