ggpairs(penguins_std, columns=c(2:5))
Multivariate data plots
SISBID 2025
https://github.com/dicook/SISBID
Your turn
- What is multivariate data?
- What makes multivariate analysis different from univariate analysis?
- data is multivariate if we have more information than a single aspect for each entity/person/experimental unit.
- multivariate analysis takes relationships between these different aspects into account.
Main types of plots
- pairwise plots: explore association between pairs of variables
- parallel coordinate plots: use parallel axes to lay out many variables on a page
- heatmaps: represent data value using colour, present as a coloured table
- tours: sequence of projections of high-dimensional data, good for examining shape and distribution between many variables
Scatterplot matrix: GGally
Scatterplot matrix
Heatmaps ⚠️
# install.packages("superheat")
library(superheat)
superheat(penguins_std[,2:5],
pretty.order.rows = T,
pretty.order.cols = T)
How many clusters do you see?
Mapping numeric values to color is sub-optimal, as we know from the hierarchy.
It is possible to NOT see clusters, and also imagine clusters that don’t exist when using heatmaps of multivariate data.
Heatmaps of correlation
Corrgrams
- can be dangerous ⚠️
- useful for a broad overview IF correlation is a good summary
The corrgram
package has numerous correlation display capabilities.
Large sample size
# Data downloaded from https://archive.ics.uci.edu/dataset/401/gene+expression+cancer+rna+seq
# This chunk takes some time to run, so evaluated off-line
if (!file.exists(here("data", "TCGA-PANCAN-HiSeq-801x20531", "data.csv"))) {
download.file(url = "https://archive.ics.uci.edu/static/public/401/gene+expression+cancer+rna+seq.zip",
destfile = here::here("data", "TCGA-PANCAN-HiSeq-801x20531.zip"), mode = "wb")
unzip(here::here("data", "TCGA-PANCAN-HiSeq-801x20531.zip"),
exdir = here::here("data/TCGA-PANCAN-HiSeq-801x20531/"))
# Untar into folder
untar(here::here("data/TCGA-PANCAN-HiSeq-801x20531/TCGA-PANCAN-HiSeq-801x20531.tar.gz"),
exdir = here("data"))
}
<- tibble(read.csv(here("data", "TCGA-PANCAN-HiSeq-801x20531", "data.csv")))
tcga
<- t(as.matrix(tcga[,2:20532]))
tcga_t colnames(tcga_t) <- tcga$X
<- prcomp(tcga_t, scale = FALSE)$x
tcga_t_pc <- function (data, mapping, ...) {
ggally_hexbin <- ggplot(data = data, mapping = mapping) + geom_hex(binwidth=20, ...)
p
p
}ggpairs(tcga_t_pc, columns=c(1:4),
lower = list(continuous = "hexbin")) +
scale_fill_gradient(trans="log",
low="#E24C80", high="#FDF6B5")
Generalized pairs plot
The pairs plot can also incorporate non-numerical variables, and different types of two variable plots.
Generalized Pairs Plots
Customized Generalized Pairs Plots
What do we learn?
- moderate increase in all scores as more time is spent on homework
- test scores all have a very regular bivariate normal shape - is this simulated data? yes.
- having a dishwasher in the household corresponds to small increase in homework time
- very little but slight increase in scores with a dishwasher in household
Your turn
Re-make the plot with
- side-by-side boxplots on the lower triangle, for the combo variables,
- and the density plots in the upper triangle.
Regression setting
<- read_csv(here::here("data/housing.csv")) |>
housing mutate(date = dmy(date)) |>
mutate(year = year(date)) |>
filter(year == 2016) |>
filter(!is.na(bedroom2), !is.na(price)) |>
filter(bedroom2 < 7, bathroom < 5) |>
mutate(bedroom2 = factor(bedroom2),
bathroom = factor(bathroom))
ggduo(housing[, c(4,3,8,10,11)],
columnsX = 2:5, columnsY = 1,
aes(colour=type, fill=type),
types = list(continuous =
wrap("smooth",
alpha = 0.10)))
Parallel coordinate plots
# install.packages("ggpcp")
library(ggpcp)
|>
penguins_std pcp_select(species, bl:bm) |>
pcp_arrange() |>
ggplot(aes_pcp()) +
geom_pcp(aes(colour=species)) +
geom_pcp_boxes() +
geom_pcp_labels() +
scale_colour_discrete_divergingx(
palette = "Zissou 1") +
theme_pcp() +
theme(legend.position = "none")
Axes are parallel, observations are connecting lines.
PCP: large sample size
ggplot() +
geom_ribbon(data = dframe, aes(x=pcp_x, ymin = lower, ymax = upper, group = level), alpha=0.5) +
geom_pcp_axes(data=tcga_t_pc_pcp_sub, aes_pcp()) +
geom_pcp_boxes(data=tcga_t_pc_pcp_sub, aes_pcp(), boxwidth = 0.1) +
geom_pcp(data=tcga_t_pc_pcp_sub, aes_pcp(), colour="orange") +
theme_pcp()
With large data, aggregate to get an overview, and select some observations to show.
Big biological data
The Bioconductor package bigPint
has tools for working with larger amounts of data, as seen in RNA-Seq experiments. It has variations of scatterplot matrices, parallel coordinate plots and interactivity with these displays.
More info at https://lindsayrutter.github.io/bigPint/
Resources
- Cook and Laa (2025)
- Emerson et al (2013) The Generalized Pairs Plot, Journal of Computational and Graphical Statistics, 22:1, 79-91
- Natalia da Silva PPForest and shiny app.
- Eunkyung Lee PPtreeViz
- Wickham, Cook, Hofmann (2015) Visualising Statistical Models: Removing the blindfold
- Cook and Swayne (2007) Interactive and Dynamic Graphics for Data Analysis
- Wickham et al (2011) tourr: An R Package for Exploring Multivariate Data with Projections and the R package tourr
- Schloerke et al (2016) Escape from Boxland, the web site zoo and geozoo
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.