Multivariate data plots

class: center, middle, inverse, title-slide

.title[
# Multivariate data plots
]
.subtitle[
## SISBID 2024 <br> <a href="https://github.com/dicook/SISBID" class="uri">https://github.com/dicook/SISBID</a>
]
.author[
### Di Cook (<a href="mailto:dicook@monash.edu" class="email">dicook@monash.edu</a>) <br> Heike Hofmann (<a href="mailto:hhofmann4@unl.edu" class="email">hhofmann4@unl.edu</a>) <br> Susan Vanderplas (<a href="mailto:susan.vanderplas@unl.edu" class="email">susan.vanderplas@unl.edu</a>)
]
.date[
### 08/14-16/2024
]

---

class: inverse middle 
# Your turn

- What is multivariate data?
- What makes multivariate analysis different from univariate analysis?

???

- data is multivariate, if we have more information than a single aspect for each entity/person/experimental unit.
- multivariate analysis takes relationships between these different aspects into account.

---
# Main types of plots

- __pairwise plots__: explore association between pairs of variables
- __parallel coordinate plots__: use parallel axes to lay out many variables on a page
- __heatmaps__: represent data value using colour, present as a coloured table
- __tours__: sequence of projections of high-dimensional data, good for examining shape and distribution between many variables

---
# Scatterplot matrix: GGally

.pull-left[The basic plot plot for multivariate data is a scatterplot matrix. There are two functions available in the GGally package: `ggpairs`.

``` r
# Make a simple scatterplot matrix of the new penguins data
stdd <- function(x) (x-mean(x))/sd(x)
penguins <- penguins |> 
  filter(!is.na(bill_length_mm)) |>
  rename(bl = bill_length_mm,
         bd = bill_depth_mm,
         fl = flipper_length_mm,
         bm = body_mass_g) |>
  mutate_at(vars(bl:bm), stdd)
ggpairs(penguins, columns=c(3:6)) 
```

*What do we learn?*

- clustering
- linear dependence
]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-4-1.png" width="432" style="display: block; margin: auto;" />
]

---

.pull-left[
**Add some colour for groups**

``` r
# Re-make mapping colour to species (class)
ggpairs(penguins, columns=c(3:6), 
        ggplot2::aes(colour=species)) +
  scale_color_viridis_d(option = "plasma", begin=0.2, end=0.8) +
  scale_fill_viridis_d(option = "plasma", begin=0.2, end=0.8)
```

<br>
<br>
<br>
*What do we learn?*

- clustering is due to the class variable
]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-5-1.png" width="432" style="display: block; margin: auto;" />
]

---

.pull-left[
Only show correlation. <span class=" faa-float animated " style=" display: -moz-inline-stack; display: inline-block; transform: rotate(0deg);">This is dangerous!</span> Only appropriate if correlation is a good summary of the association, or if you have a large number of variables that you need to scan quickly.

``` r
# Look at one species only
adelie <- penguins |> 
  filter(species == "Adelie") |>
  select(bl:bm)
ggcorr(adelie)
```

]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" />
]

---

.pull-left[
Only show correlation, using a corrgram. Again, this is dangerous, and only useful to get a broad overview of association that is suitably summarised by correlation.

``` r
# install.packages("corrgram")
library(corrgram)
corrgram(adelie, 
  lower.panel=corrgram::panel.ellipse)
```

The `corrgram` package has numerous scatterplot matrix capabilities.

]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-7-1.png" width="90%" style="display: block; margin: auto;" />
]

---
# Large sample size

.pull-left[

``` r
# Data downloaded from https://archive.ics.uci.edu/dataset/401/gene+expression+cancer+rna+seq
# This chunk takes some time to run, so evaluated off-line
tcga <- data.table(read.csv(here("data/TCGA-PANCAN-HiSeq-801x20531/data.csv")))
tcga_t <- t(as.matrix(tcga[,2:20532]))
colnames(tcga_t) <- tcga$X
tcga_t_pc <- prcomp(tcga_t, scale = FALSE)$x
ggally_hexbin <- function (data, mapping, ...)  {
    p <- ggplot(data = data, mapping = mapping) + geom_hex(binwidth=20, ...)
    p
}
ggpairs(tcga_t_pc, columns=c(1:4),
        lower = list(continuous = "hexbin")) +
  scale_fill_gradient(trans="log", 
    low="#E24C80", high="#FDF6B5")
```
]
.pull-right[
![](ggpairs-hexbin.png)
]

---
# Generalized pairs plot

.pull-left[
The pairs plot can also incorporate non-numerical variables, and different types of two variable plots.

``` r
# Matrix plot when variables are not numeric
data(australia_PISA2012)
australia_PISA2012 <- australia_PISA2012 |>
  mutate(desk = factor(desk), 
         room = factor(room),
         study = factor(study), 
         computer = factor(computer),
         software = factor(software), 
         internet = factor(internet),
         literature = factor(literature), 
         poetry= factor(poetry),
         art = factor(art), 
         textbook = factor(textbook),
         dictionary = factor(dictionary),
         dishwasher = factor(dishwasher))
australia_PISA2012 |> 
  filter(!is.na(dishwasher)) |> 
  ggpairs(columns=c(3, 15, 16, 21, 26))
```
]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto;" />
]

---

``` r
# Modify the defaults, set the transparency of points since there is a lot of data
australia_PISA2012 |> 
  filter(!is.na(dishwasher)) |> 
  ggpairs(columns=c(3, 15, 16, 21, 26), 
          lower = list(continuous = wrap("points", alpha=0.05)))
```

---

.pull-left[

``` r
# Make a special style of plot to put in the matrix
my_fn <- function(data, mapping, method="loess", ...){
      p <- ggplot(data = data, mapping = mapping) + 
      geom_point(alpha=0.2, size=1) + 
      geom_smooth(method="lm", ...)
      p
}
```

``` r
australia_PISA2012 |> 
  filter(!is.na(dishwasher)) |> 
  ggpairs(columns=c(3, 15, 16, 21, 26), 
          lower = list(continuous = my_fn))
```

*What do we learn?*

- moderate increase in all scores as more time is spent on homework
- test scores all have a very regular bivariate normal shape - is this simulated data? yes.
- having a dishwasher in the household corresponds to small increase in homework time
- very little but slight increase in scores with a dishwasher in household

]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-9-1.png" width="432" style="display: block; margin: auto;" />
]

---
class: inverse middle

# Your turn

Re-make the plot with

- side-by-side boxplots on the lower triangle, for the combo variables, 
- and the density plots in the upper triangle.

---
# Regression setting

.pull-left[
An alternative pairs plot that better matches this sort of data, where there is a response variable and explanatory variables. For this data, plot house price against all other variables.

``` r
housing <- read_csv(here::here("data/housing.csv")) |>
  mutate(date = dmy(date)) |>
  mutate(year = year(date)) |>
  filter(year == 2016) |>
  filter(!is.na(bedroom2), !is.na(price)) |>
  filter(bedroom2 < 7, bathroom < 5) |>
  mutate(bedroom2 = factor(bedroom2), 
         bathroom = factor(bathroom)) 
```
]
.pull-right[

``` r
ggduo(housing[, c(4,3,8,10,11)], 
      columnsX = 2:5, columnsY = 1, 
      aes(colour=type, fill=type), 
      types = list(continuous = 
                     wrap("smooth", 
                       alpha = 0.10)))
```

<img src="index_files/figure-html/make a regression style pairs plot-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# Parallel coordinate plots

.pull-left[

``` r
library(ggpcp)
library(colorspace)
penguins |>
  pcp_select(species, bl:bm) |>
  pcp_arrange() |>
  ggplot(aes_pcp()) +
    geom_pcp(aes(colour=species)) +
    geom_pcp_boxes() +
    geom_pcp_labels() +
    scale_colour_discrete_divergingx(palette = "Zissou 1") +
    theme_pcp() +
    theme(legend.position = "none")
```

Axes are parallel, observations are connecting lines.

]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-12-1.png" width="432" style="display: block; margin: auto;" />
]

---
# PCP: large sample size

.pull-left[

``` r
ggplot() +
    geom_ribbon(data = dframe, 
              aes(x=pcp_x, ymin = lower, ymax = upper,
              group = level), alpha=0.5) +
    geom_pcp_axes(data=tcga_t_pc_pcp_sub, 
             aes_pcp()) +
    geom_pcp_boxes(data=tcga_t_pc_pcp_sub, 
             aes_pcp(), boxwidth = 0.1) +
    geom_pcp(data=tcga_t_pc_pcp_sub, 
             aes_pcp(), colour="orange") +
    theme_pcp()
```

With large data, aggregate to get an overview, and select some observations to show.

]
.pull-right[
![](ggpcp-ribbon.png)
]

---
# Resources

- [Cook and Laa (2024)](https://dicook.github.io/mulgar_book/)
- Emerson et al (2013) The Generalized Pairs Plot, Journal of Computational and Graphical Statistics, 22:1, 79-91
- [Natalia da Silva](http://natydasilva.com/) [PPForest](https://cran.r-project.org/web/packages/PPforest/index.html) and [shiny app](https://natydasilva.shinyapps.io/shinyV03/).
- Eunkyung Lee [PPtreeViz](https://www.jstatsoft.org/article/view/v083i08)
- Wickham, Cook, Hofmann (2015) [Visualising Statistical Models: Removing the blindfold](http://dicook.org/publication/blindfold_2015/)
- Cook and Swayne (2007) [Interactive and Dynamic Graphics for Data Analysis](http://ggobi.org/book/)
- Wickham et al (2011) [tourr: An R Package for Exploring Multivariate Data with Projections](https://www.jstatsoft.org/article/view/v040i02/v40i02.pdf) and the R package [tourr](https://cran.r-project.org/web/packages/tourr/index.html)
- Schloerke et al (2016) [Escape from Boxland](https://journal.r-project.org/archive/2016/RJ-2016-044/index.html), [the web site zoo](http://schloerke.com/geozoo/) and the R package [geozoo](https://cran.r-project.org/web/packages/geozoo/index.html)

---
# Share and share alike

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.