class: center, middle, inverse, title-slide .title[ # Multivariate data plots ] .subtitle[ ## SISBID 2024
https://github.com/dicook/SISBID
] .author[ ### Di Cook (
dicook@monash.edu
)
Heike Hofmann (
hhofmann4@unl.edu
)
Susan Vanderplas (
susan.vanderplas@unl.edu
) ] .date[ ### 08/14-16/2024 ] --- class: inverse middle # Your turn - What is multivariate data? - What makes multivariate analysis different from univariate analysis?
−
+
01
:
00
??? - data is multivariate, if we have more information than a single aspect for each entity/person/experimental unit. - multivariate analysis takes relationships between these different aspects into account. --- # Main types of plots - __pairwise plots__: explore association between pairs of variables - __parallel coordinate plots__: use parallel axes to lay out many variables on a page - __heatmaps__: represent data value using colour, present as a coloured table - __tours__: sequence of projections of high-dimensional data, good for examining shape and distribution between many variables --- # Scatterplot matrix: GGally .pull-left[The basic plot plot for multivariate data is a scatterplot matrix. There are two functions available in the GGally package: `ggpairs`. ``` r # Make a simple scatterplot matrix of the new penguins data stdd <- function(x) (x-mean(x))/sd(x) penguins <- penguins |> filter(!is.na(bill_length_mm)) |> rename(bl = bill_length_mm, bd = bill_depth_mm, fl = flipper_length_mm, bm = body_mass_g) |> mutate_at(vars(bl:bm), stdd) ggpairs(penguins, columns=c(3:6)) ``` *What do we learn?* - clustering - linear dependence ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="432" style="display: block; margin: auto;" /> ] --- .pull-left[ **Add some colour for groups** ``` r # Re-make mapping colour to species (class) ggpairs(penguins, columns=c(3:6), ggplot2::aes(colour=species)) + scale_color_viridis_d(option = "plasma", begin=0.2, end=0.8) + scale_fill_viridis_d(option = "plasma", begin=0.2, end=0.8) ``` <br> <br> <br> *What do we learn?* - clustering is due to the class variable ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-5-1.png" width="432" style="display: block; margin: auto;" /> ] --- .pull-left[ Only show correlation.
This is dangerous!
Only appropriate if correlation is a good summary of the association, or if you have a large number of variables that you need to scan quickly. ``` r # Look at one species only adelie <- penguins |> filter(species == "Adelie") |> select(bl:bm) ggcorr(adelie) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" /> ] --- .pull-left[ Only show correlation, using a corrgram. Again, this is dangerous, and only useful to get a broad overview of association that is suitably summarised by correlation. ``` r # install.packages("corrgram") library(corrgram) corrgram(adelie, lower.panel=corrgram::panel.ellipse) ``` The `corrgram` package has numerous scatterplot matrix capabilities. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Large sample size .pull-left[ ``` r # Data downloaded from https://archive.ics.uci.edu/dataset/401/gene+expression+cancer+rna+seq # This chunk takes some time to run, so evaluated off-line tcga <- data.table(read.csv(here("data/TCGA-PANCAN-HiSeq-801x20531/data.csv"))) tcga_t <- t(as.matrix(tcga[,2:20532])) colnames(tcga_t) <- tcga$X tcga_t_pc <- prcomp(tcga_t, scale = FALSE)$x ggally_hexbin <- function (data, mapping, ...) { p <- ggplot(data = data, mapping = mapping) + geom_hex(binwidth=20, ...) p } ggpairs(tcga_t_pc, columns=c(1:4), lower = list(continuous = "hexbin")) + scale_fill_gradient(trans="log", low="#E24C80", high="#FDF6B5") ``` ] .pull-right[ ![](ggpairs-hexbin.png) ] --- # Generalized pairs plot .pull-left[ The pairs plot can also incorporate non-numerical variables, and different types of two variable plots. ``` r # Matrix plot when variables are not numeric data(australia_PISA2012) australia_PISA2012 <- australia_PISA2012 |> mutate(desk = factor(desk), room = factor(room), study = factor(study), computer = factor(computer), software = factor(software), internet = factor(internet), literature = factor(literature), poetry= factor(poetry), art = factor(art), textbook = factor(textbook), dictionary = factor(dictionary), dishwasher = factor(dishwasher)) australia_PISA2012 |> filter(!is.na(dishwasher)) |> ggpairs(columns=c(3, 15, 16, 21, 26)) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ``` r # Modify the defaults, set the transparency of points since there is a lot of data australia_PISA2012 |> filter(!is.na(dishwasher)) |> ggpairs(columns=c(3, 15, 16, 21, 26), lower = list(continuous = wrap("points", alpha=0.05))) ``` <img src="index_files/figure-html/generalised pairs plot enhance plots-1.png" width="432" style="display: block; margin: auto;" /> --- .pull-left[ ``` r # Make a special style of plot to put in the matrix my_fn <- function(data, mapping, method="loess", ...){ p <- ggplot(data = data, mapping = mapping) + geom_point(alpha=0.2, size=1) + geom_smooth(method="lm", ...) p } ``` ``` r australia_PISA2012 |> filter(!is.na(dishwasher)) |> ggpairs(columns=c(3, 15, 16, 21, 26), lower = list(continuous = my_fn)) ``` *What do we learn?* - moderate increase in all scores as more time is spent on homework - test scores all have a very regular bivariate normal shape - is this simulated data? yes. - having a dishwasher in the household corresponds to small increase in homework time - very little but slight increase in scores with a dishwasher in household ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-9-1.png" width="432" style="display: block; margin: auto;" /> ] --- class: inverse middle # Your turn Re-make the plot with - side-by-side boxplots on the lower triangle, for the combo variables, - and the density plots in the upper triangle.
−
+
08
:
00
--- # Regression setting .pull-left[ An alternative pairs plot that better matches this sort of data, where there is a response variable and explanatory variables. For this data, plot house price against all other variables. ``` r housing <- read_csv(here::here("data/housing.csv")) |> mutate(date = dmy(date)) |> mutate(year = year(date)) |> filter(year == 2016) |> filter(!is.na(bedroom2), !is.na(price)) |> filter(bedroom2 < 7, bathroom < 5) |> mutate(bedroom2 = factor(bedroom2), bathroom = factor(bathroom)) ``` ] .pull-right[ ``` r ggduo(housing[, c(4,3,8,10,11)], columnsX = 2:5, columnsY = 1, aes(colour=type, fill=type), types = list(continuous = wrap("smooth", alpha = 0.10))) ``` <img src="index_files/figure-html/make a regression style pairs plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Parallel coordinate plots .pull-left[ ``` r library(ggpcp) library(colorspace) penguins |> pcp_select(species, bl:bm) |> pcp_arrange() |> ggplot(aes_pcp()) + geom_pcp(aes(colour=species)) + geom_pcp_boxes() + geom_pcp_labels() + scale_colour_discrete_divergingx(palette = "Zissou 1") + theme_pcp() + theme(legend.position = "none") ``` Axes are parallel, observations are connecting lines. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-12-1.png" width="432" style="display: block; margin: auto;" /> ] --- # PCP: large sample size .pull-left[ ``` r ggplot() + geom_ribbon(data = dframe, aes(x=pcp_x, ymin = lower, ymax = upper, group = level), alpha=0.5) + geom_pcp_axes(data=tcga_t_pc_pcp_sub, aes_pcp()) + geom_pcp_boxes(data=tcga_t_pc_pcp_sub, aes_pcp(), boxwidth = 0.1) + geom_pcp(data=tcga_t_pc_pcp_sub, aes_pcp(), colour="orange") + theme_pcp() ``` With large data, aggregate to get an overview, and select some observations to show. ] .pull-right[ ![](ggpcp-ribbon.png) ] --- # Resources - [Cook and Laa (2024)](https://dicook.github.io/mulgar_book/) - Emerson et al (2013) The Generalized Pairs Plot, Journal of Computational and Graphical Statistics, 22:1, 79-91 - [Natalia da Silva](http://natydasilva.com/) [PPForest](https://cran.r-project.org/web/packages/PPforest/index.html) and [shiny app](https://natydasilva.shinyapps.io/shinyV03/). - Eunkyung Lee [PPtreeViz](https://www.jstatsoft.org/article/view/v083i08) - Wickham, Cook, Hofmann (2015) [Visualising Statistical Models: Removing the blindfold](http://dicook.org/publication/blindfold_2015/) - Cook and Swayne (2007) [Interactive and Dynamic Graphics for Data Analysis](http://ggobi.org/book/) - Wickham et al (2011) [tourr: An R Package for Exploring Multivariate Data with Projections](https://www.jstatsoft.org/article/view/v040i02/v40i02.pdf) and the R package [tourr](https://cran.r-project.org/web/packages/tourr/index.html) - Schloerke et al (2016) [Escape from Boxland](https://journal.r-project.org/archive/2016/RJ-2016-044/index.html), [the web site zoo](http://schloerke.com/geozoo/) and the R package [geozoo](https://cran.r-project.org/web/packages/geozoo/index.html) --- # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.