Tutorial
Take a data plot and make it better

Dianne Cook
Monash University

Welcome 👋🏼

Thanks for joining to learn about making data plots today.

About the instructors:

🦘 Di is a Professor of Statistics. She has more than 30 years of research and teaching of data visualisation, and open source software development.
🐨 Jayani is a final year PhD student. She is working on methods to help decide on the best nonlinear low dimensional representation of high dimensional data, and is the author of several R packages.
🏛️ We are both in Econometrics and Business Statistics, at Monash University.

🧩 Feel free to ask questions any time. 🤔

🎯 The objectives for today are:

Build your knowledge of cognitive perception principles for good graphics
Recognise elements of a current design that can be improved
Develop coding skills to implement improved design

load these libraries to get started

library(tidyverse)
library(colorspace)
library(patchwork)
library(broom)
library(palmerpenguins)
library(ggbeeswarm)
library(vcd)
library(nullabor)
library(MASS)
library(colorspace)
library(conflicted)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)
conflicts_prefer(dplyr::rename)
conflicts_prefer(dplyr::mutate)
conflicts_prefer(dplyr::summarise)

Session 1: Principles and tools

Outline

time	topic
5	Outline
10	Tidy data
15	Grammar of graphics
15	Guided exercises
15	Cognitive principles
15	Guided exercises
15	Identifying poor elements
30	BREAK

Tidy data

Tidy data (1/5)

Illustrations from Julia Lowndes and Allison Horst

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.
Each table contains one data set.
Long form makes it easier to reshape in many different ways
Wider forms are common for analysis

Long form: one measured value per row. All other variables are descriptors (key variables)

Widest form: all measured values for an entity are in a single row.

Tidy format (2/5)

This WHO Tuberculosis Notifications is not in tidy format. The first step is to determine what the variables are.

Code

tb <- read_csv("data/TB_notifications_2023-08-21.csv") |>
  filter(country == "Australia", year > 1996, year < 2013) |>
  select(year, contains("new_sp")) 
glimpse(tb)

Rows: 16
Columns: 22
$ year         <dbl> 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012
$ new_sp       <dbl> 226, 203, 285, 251, 228, 210, 113, 285, 241, 269, 281, 299, 267, 274, 301, 290
$ new_sp_m04   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 0, NA, 0, 0, 0, 2
$ new_sp_m514  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 3, NA, 3, 2, 2, 1
$ new_sp_m014  <dbl> 1, 0, 0, 3, 1, 1, 0, 0, 0, 1, 3, 2, 3, 2, 2, 3
$ new_sp_m1524 <dbl> 8, 11, 13, 16, 23, 15, 14, 18, 32, 33, 30, 46, 30, 42, 38, 26
$ new_sp_m2534 <dbl> 24, 22, 40, 35, 20, 20, 10, 16, 27, 35, 33, 33, 37, 33, 44, 40
$ new_sp_m3544 <dbl> 18, 18, 54, 25, 18, 26, 2, 17, 23, 23, 20, 20, 16, 22, 26, 17
$ new_sp_m4554 <dbl> 13, 13, 52, 24, 18, 19, 11, 15, 11, 21, 15, 27, 24, 25, 19, 25
$ new_sp_m5564 <dbl> 17, 15, 37, 19, 13, 13, 5, 11, 12, 16, 14, 23, 12, 9, 12, 16
$ new_sp_m65   <dbl> 28, 31, 49, 49, 35, 34, 30, 32, 30, 43, 37, 42, 34, 27, 37, 37
$ new_sp_mu    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0
$ new_sp_f04   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 0, NA, 1, 1, 2, 0
$ new_sp_f514  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 4, NA, 3, 3, 1, 1
$ new_sp_f014  <dbl> 0, 2, 0, 0, 1, 0, 0, 0, 2, 2, 4, 3, 4, 4, 3, 1
$ new_sp_f1524 <dbl> 10, 19, 10, 15, 21, 15, 9, 6, 18, 18, 26, 27, 31, 36, 26, 27
$ new_sp_f2534 <dbl> 15, 24, 16, 19, 27, 21, 13, 17, 26, 27, 37, 32, 27, 43, 40, 48
$ new_sp_f3544 <dbl> 9, 15, 18, 12, 16, 15, 3, 5, 11, 14, 20, 14, 14, 12, 23, 15
$ new_sp_f4554 <dbl> 5, 8, 6, 15, 7, 6, 5, 7, 10, 7, 12, 6, 12, 2, 7, 11
$ new_sp_f5564 <dbl> 10, 2, 2, 5, 8, 4, 4, 3, 6, 9, 7, 11, 11, 5, 7, 9
$ new_sp_f65   <dbl> 12, 24, 26, 14, 20, 23, 7, 19, 14, 21, 23, 10, 12, 12, 17, 15
$ new_sp_fu    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0

Variables are
R code to read data

year
sex
age category

Code

tb <- read_csv("data/TB_notifications_2023-08-21.csv") |>
  filter(country == "Australia", year > 1996, year < 2013) |>
  select(year, contains("new_sp")) 
glimpse(tb)

Tidy data (3/5)

Steps to wrangle to tidy form:

Select only the variables containing sex and age counts
Pivot into long form
Extract variables from names (agesex column)
Tidy age codes

Is count a variable?

# A tibble: 12 × 4
    year sex   age   count
   <dbl> <chr> <fct> <dbl>
 1  1997 m     0-14      1
 2  1997 m     15-24     8
 3  1997 m     25-34    24
 4  1997 m     35-44    18
 5  1997 m     45-54    13
 6  1997 m     55-64    17
 7  1997 m     > 65     28
 8  1997 f     0-14      0
 9  1997 f     15-24    10
10  1997 f     25-34    15
11  1997 f     35-44     9
12  1997 f     45-54     5

Code

tb_tidy <- tb |>
  select(-new_sp, -new_sp_m04, -new_sp_m514, 
                  -new_sp_f04, -new_sp_f514) |> 
  pivot_longer(starts_with("new_sp"), 
    names_to = "sexage", 
    values_to = "count") |>
  mutate(sexage = str_remove(sexage, "new_sp_")) |>
  separate_wider_position(
    sexage,
    widths = c(sex = 1, age = 4),
    too_few = "align_start"
  ) |>
  filter(age != "u") |>
  mutate(age = fct_recode(age, "0-14" = "014",
                          "15-24" = "1524",
                          "15-24" = "1524",
                          "25-34" = "2534",
                          "35-44" = "3544",
                          "45-54" = "4554",
                          "55-64" = "5564",
                          "> 65" = "65"))
tb_tidy |> slice_head(n=12)

Why do it? (4/5)

Illustrations from Julia Lowndes and Allison Horst

Tidy data is the starting point for statistical analysis, and data visualisation.

Read more from tidy paper and wrangling paper.

Tidy data = statistical data (5/5)

\[\begin{align} X = \left[ \begin{array}{cccc} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{np} & x_{n2} & \dots & x_{np} \end{array} \right] \end{align}\]

Variables \(x_1, x_2, ..., x_p\) are in the columns. And we have \(n\) observations.

Graphics built on tidy data, fit nicely with your statistical analysis too.

Grammatical descriptions for plots

Grammar (1/5)

A grammar of graphics maps the variables from a tidy data set to elements of the plot.

It’s like having the DNA rather than a species name, so you know how the plots are related to each other.

Same script can be applied to different data.

grammar
R
message

plot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION> +
  <SCALE> +
  <THEME>

tb_yr <- tb_tidy |>
  group_by(year) |>
  summarise(count = sum(count, na.rm=TRUE)) 
gg1 <- ggplot(tb_yr, 
  aes(x=year, y=count)) +
  geom_col() +
  ylim(c(0, 350))
gg2 <- ggplot(tb_yr, 
  aes(x=year, y=count)) +
  geom_point() +
  geom_smooth(se=F) +
  ylim(c(0, 350))
gg1 + gg2 + plot_layout(ncol=1)

These plots examine the relationship between TB incidence and time as years.

Grammar and variables (2/5)

?
variables
grammar
R
message

Democrat: true or false
Margin of the vote: 0-80

DATA: electoral
MAPPING: x = Democrat, y = Margin
GEOM: boxplot (calculates five number summary, and displays as boxplot)

library(nullabor)
data(electoral)
polls <- electoral$polls
ggplot(polls) +
  geom_boxplot(aes(x=Democrat, 
                   y=Margin)) +
  xlab("democrat") + 
  scale_y_continuous("margin (%)", 
    breaks=seq(0, 100, 20),
    limits=c(0,100)) +
  theme(aspect.ratio = 1.2, 
        panel.grid.major.x = element_blank())

This plot compares the distribution of the margin in the vote percentage, between states with a democratic majority party and the rest, for the polling data.

?
variables
grammar
R
message

year
count ?? yes, for plotting purposes, this is a variable
age

DATA: tb_tidy
MAPPING: x = year, y = count, colour = age
GEOM: lm (linear model)

tb_tidy |>
  filter(age %in% c("45-54", "55-64"),
         sex == "f") |>
  ggplot() + 
    geom_smooth(aes(x=year, 
                  y=count,
                  colour=age), 
              se=F, 
              method="lm") +
    scale_color_discrete_divergingx(palette="Geyser") +
    scale_x_continuous("year", 
      breaks = seq(1998, 2012, 4), 
      labels = c("98", "02", "06", "10")) +
    theme(aspect.ratio = 0.8, 
      axis.text = element_text(size="10"))

This plot compares the linear trend in TB incidence between over time for 45-54 and 55-64 year olds.

Plot from grammar (3/5)

Here is the grammatical description

DATA: tb_tidy, 2012
MAPPING: x=age, fill=sex
STAT: count
POSITION: stack
GEOM: bar

?
plot
R
message

tb_tidy |>
  filter(year == 2012) |>
  ggplot() + 
  geom_bar(aes(x=age, 
               weight=count,
               fill=sex),
           alpha=0.8) +
  scale_fill_discrete_divergingx(palette="Geyser") +
  theme_bw() +
  theme(aspect.ratio = 0.8, 
    axis.text = element_text(size="10"))

See documentation: geom_bar, after_stat(count)

This plot examines the relationship between TB incidence and age, and sex (although it’s almost impossible from this arrangement to assess this last relationship).

Here is the grammatical description

DATA: tb_tidy, 2012
MAPPING: x=age, fill=sex
STAT: proportion
POSITION: fill
GEOM: bar

?
plot
R
message

tb_tidy |>
  filter(year == 2012) |>
  ggplot() + 
  geom_bar(aes(x=age, 
               weight=count,
               fill=sex),
           position="fill", alpha=0.8) +
  scale_fill_discrete_divergingx(palette="Geyser") +
  ylab("proportion") +
  theme_bw() +
  theme(aspect.ratio = 0.8, 
    axis.text = element_text(size="10"))

See documentation: geom_bar, after_stat(prop)

This plot examines the relationship between TB incidence and age, and sex, focusing on the proportion of each sex within each age group.

Make the data do the work for your visualisation (4/5)

❌

# A tibble: 10 × 4
    year age       m     f
   <dbl> <fct> <dbl> <dbl>
 1  1997 0-14      1     0
 2  1997 15-24     8    10
 3  1997 25-34    24    15
 4  1997 35-44    18     9
 5  1997 45-54    13     5
 6  1997 55-64    17    10
 7  1997 > 65     28    12
 8  1998 0-14      0     2
 9  1998 15-24    11    19
10  1998 25-34    22    24

?
problem
R

Levels of the variable sex have been split into different columns.

tb_bad |> 
  ggplot() + 
    geom_point(aes(x=year, y=m), colour = "#A39000") +
    geom_point(aes(x=year, y=f), colour = "#93B3FE")

✅

# A tibble: 10 × 4
    year sex   age   count
   <dbl> <chr> <fct> <dbl>
 1  1997 m     0-14      1
 2  1997 m     15-24     8
 3  1997 m     25-34    24
 4  1997 m     35-44    18
 5  1997 m     45-54    13
 6  1997 m     55-64    17
 7  1997 m     > 65     28
 8  1997 f     0-14      0
 9  1997 f     15-24    10
10  1997 f     25-34    15

The variable sex is mapped to colour, and the plotting software handles the different levels. Can use palettes to appropriately handle colour mapping.

tb_tidy |> 
  ggplot() + 
    geom_point(aes(x=year, 
                   y=count, 
                   colour=sex))

Guided exercises

Exercise 1

Data on World Development Indicators (WDI) from World Bank.

Rows: 4,793
Columns: 23
$ `Country Name`  <chr> "Afghanistan", "Afghanistan", "Afg…
$ `Country Code`  <chr> "AFG", "AFG", "AFG", "AFG", "AFG",…
$ `Series Name`   <chr> "Access to clean fuels and technol…
$ `Series Code`   <chr> "EG.CFT.ACCS.ZS", "EG.CFT.ACCS.RU.…
$ `2004 [YR2004]` <chr> "10.5", "1.9", "45.3", "NA", "20.1…
$ `2005 [YR2005]` <chr> "11.9", "2.4", "50.2", "NA", "20.1…
$ `2006 [YR2006]` <chr> "13.5", "3", "54.7", "NA", "17.011…
$ `2007 [YR2007]` <chr> "15.1", "3.6", "59.2", "NA", "13.0…
$ `2008 [YR2008]` <chr> "16.6", "4.3", "62.9", "NA", "6.76…
$ `2009 [YR2009]` <chr> "18.3", "5.1", "66.4", "NA", "4.52…
$ `2010 [YR2010]` <chr> "19.9", "5.9", "69.4", "NA", "3.12…
$ `2011 [YR2011]` <chr> "21.3", "7", "72", "NA", "2.512463…
$ `2012 [YR2012]` <chr> "22.9", "8", "74.3", "NA", "2.9476…
$ `2013 [YR2013]` <chr> "24.5", "9", "76.1", "NA", "3.4903…
$ `2014 [YR2014]` <chr> "26.1", "10.2", "78", "NA", "3.474…
$ `2015 [YR2015]` <chr> "27.6", "11.4", "79.5", "NA", "3.5…
$ `2016 [YR2016]` <chr> "28.8", "12.6", "80.5", "NA", "4.3…
$ `2017 [YR2017]` <chr> "30.3", "13.5", "81.6", "NA", "NA"…
$ `2018 [YR2018]` <chr> "31.4", "14.5", "82.6", "NA", "NA"…
$ `2019 [YR2019]` <chr> "32.6", "15.6", "83.2", "NA", "NA"…
$ `2020 [YR2020]` <chr> "33.8", "16.4", "83.8", "NA", "NA"…
$ `2021 [YR2021]` <chr> "34.9", "17.4", "84.5", "NA", "NA"…
$ `2022 [YR2022]` <chr> "36.1", "18.5", "85", "NA", "NA", …

What are the variables?
What are the steps needed to wrangle it into tidy form?

?
variables
wrangling

country: name and code
indicator 1 to \(p\): name and code
year

Pre-process, by creating a country dictionary table with unique Country Name and Country Code and an indicator dictionary table with unique Series Name and Series Code. Keep only the code columns in main data.

Pivot year to long form: country, year, indicator and value
Clean up year text
Convert character to numeric, where needed

Could pivot to wide form with indicators in separate columns

Exercise 2

Make a plot to examine the relationship between EG.CFT.ACCS.ZS (“Access to clean fuels and technologies for cooking (% of population)”) by country over time.

tidying data

wdi_country <- wdi |>
  select(`Country Name`, `Country Code`) |>
  distinct()
wdi_indicator <- wdi |>
  select(`Series Name`, `Series Code`) |>
  distinct()
wdi_tidy <- wdi |>
  select(`Country Code`, `Series Code`, 
    `2004 [YR2004]`:`2022 [YR2022]`) |>
  pivot_longer(cols=`2004 [YR2004]`:`2022 [YR2022]`, 
    names_to="year", values_to="value") |>
  rename(country = `Country Code`,
         indicator = `Series Code`) |>
  mutate(value = as.numeric(value)) |>
  mutate(year = str_sub(year, 1, 4)) |>
  mutate(year = as.numeric(year))

?
grammar
R

DATA: wdi_tidy, EG.CFT.ACCS.ZS
MAPPING: x=year, y=value, group=country
GEOM: line

wdi_tidy |>
  filter(indicator == "EG.CFT.ACCS.ZS") |>
  ggplot() +
    geom_line(aes(x=year, 
                  y=value, 
                  group=country),
              alpha = 0.5) +
    xlab("") + ylab("Access to clean fuel") +
    theme_minimal()

Exercise 3

We can read the proportion, but we have lost the size of each category. The way to fix this is to use a mosaic plot, which maps the width of the columns to count.

tb_tidy |>
  filter(year == 2012) |>
  ggplot() + 
  geom_mosaic(aes(x=age, 
                 weight=count,
                 fill=sex)) +
  scale_fill_discrete_divergingx(palette="Geyser") +
  scale_y_continuous("proportion", breaks=seq(0,1,0.25)) +
  #theme_bw() +
  theme(aspect.ratio = 0.6, 
    axis.text = element_text(size="10"))

Cognitive perception principles

Hierarchy of mappings (1/15)

Cleveland and McGill (1984)

Illustrations made by Emi Tanaka

Hierarchy of mappings (2/15)

Based on the accuracy with which readers returned the numerical values.

Position - common scale (BEST)
Position - nonaligned scale
Length, direction, angle
Area
Volume, curvature
Shading, color (WORST)

Primary mapping used in common plots

scatterplot, barchart
side-by-side boxplot, stacked barchart
piechart, rose plot, gauge plot, donut, wind direction map, starplot
treemap, bubble chart, mosaicplot
chernoff face
choropleth map

Proximity (3/15)

Place elements that you want to compare close to each other. If there are multiple comparisons to make, you need to decide which one is most important.

Change blindness (4/15)

Making comparisons across plots requires the eye to jump from one focal point to another. It may result in not noticing differences.

Change blindness (5/15)

Help the reader remember what the pattern is in other panels by under-plotting all.

Too many colours, too busy

Pre-attentive (6/15)

Can you find the odd one out?

Is it easier now?

Colour palettes should match variable type (7/15)

There are three basic choices of palettes:

qualitative
sequential
diverging
(rainbow)
(palindrome) SKIPPED

Which one you choose depends on the

data values
and what to emphasize

Resources for exploring color:

rainbow palettes (8/15)

❌ Jet rainbow palette

Code

library(vital)
library(viridis)
am <- aus_mortality |> 
  filter(State == "Victoria", 
         Sex != "total", 
         Year < 1980, 
         Age < 90) 

ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_color_gradientn(colours = rainbow(10)) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Produces false detail, banding and color blindness ambiguity.

✅ viridis palettes

Code

ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_colour_gradientn(colors = viridis_pal(option = "turbo")(10)[10:1]) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Have a uniform scale, match grey scale ladder. The turbo palette alleviates Jet rainbow palette problems.

rainbow palettes (9/15)

❌ Jet rainbow palette

Code

ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_color_gradientn(colours = deutan(rainbow(10))) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Produces false detail, banding and ambiguity.

✅ viridis palettes

Code

ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_colour_gradientn(colors = deutan(viridis_pal(option = "turbo")(10)[10:1])) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Colors still readable and following scale.

Transforming, e.g. colour scales (10/15)

If the variable mapped to colour has a right-skewed distribution, consider transforming it using a log or a square root.

This is the same data, where count has been transformed using square root.

Code

ggplot(as_tibble(Titanic), 
       aes(x=interaction(Sex, Age),
           y=interaction(Class, Survived), 
           fill=n)) +
  geom_tile() +
  xlab("Sex, Age") +
  ylab("Class, Survived") +
  scale_fill_continuous_sequential(
    palette = "Terrain", 
    trans="sqrt")

Order categorical variables by the statistic (11/15)

❌ Default: alphabetical

Code

load("data/student_means.rda")
student_means_sub <- student_means |>
  filter(country %in% c("SGP", "KOR", "POL", "DEU", "NOR", "IRL", "GBR", "IDN", "AUS", "NZL", "USA", "TUR", "PHL", "MAR", "URY", "CHL", "COL", "CAN"))
ggplot(student_means_sub, aes(x=country, y=math)) + 
  geom_point(colour="#8ACE00", size=4) + 
  coord_flip() +
  xlab("") +
  theme(aspect.ratio = 2)

Full scale of number

Code

ggplot(student_means_sub, aes(x=country, y=math)) + 
  geom_point(colour="#8ACE00", size=4) + 
  coord_flip() +
  xlab("") +
  ylim(c(0, 1000)) +
  theme(aspect.ratio = 2)

✅ Order by statistic

Code

ggplot(student_means_sub, 
       aes(x=fct_reorder(country, math), 
           y=math)) + 
  geom_point(colour="#8ACE00", size=4) + 
  coord_flip() +
  xlab("") +
  ylim(c(0, 1000)) +
  theme(aspect.ratio = 2)

Do the calculation for the reader (12/15)

Code

data(anorexia, package="MASS")
ggplot(data=anorexia, 
  aes(x=Prewt, 
      y=Postwt, 
        colour=Treat)) + 
  coord_equal() +
  xlim(c(70, 110)) + 
  ylim(c(70, 110)) +
  xlab("Pre-treatment weight (lbs)") +  
  ylab("Post-treatment weight (lbs)") +
  geom_abline(intercept=0, slope=1,  
    colour="grey80", linewidth=1.25) + 
  geom_density2d() + 
  geom_point(size=3) +
  facet_grid(.~Treat) +
  theme(legend.position = "none")

Before and after treatment weight for anorexia patients
Three different treatments
Need to read the difference relative to a 45\(^o\) line

Code

ggplot(data=anorexia, 
  aes(x=Prewt, colour=Treat,
    y=(Postwt-Prewt)/Prewt*100)) + 
  xlab("Pre-treatment weight (lbs)") +  
  ylab("Percent increase in weight") +
  geom_hline(yintercept=0, linewidth=1.25, 
    colour="grey80") + 
  geom_point(size=3) +   
  facet_grid(.~Treat) +
  theme(aspect.ratio=1, legend.position = "none")

Compute the difference
Compare difference relative to before weight
Before weight is used as the baseline
EASIER to read the difference above and below a horizontal line

Aspect ratio (13/15)

❌ Wrong aspect ratio

The default aspect ratio in most plots is rectangular.

If you want to compare two quantities, including assessing correlation, the aspect ratio should be square.

Two ways to achieve this with ggplot2:

theme(aspect.ratio=1) PREFERRED
coord_equal()

Aspect ratio (14/15)

Lines should be on average 45\(^o\).

To read and compare trend
To examine seasonality in time series

Summary and more (15/15)

Items that are primary elements of a plot:

colour
trend line (?)

Organising items:

place items to compare, close to each other
control the ordering, to make patterns easier to read
align axes for comparison across plots

Conventions:

time on horizontal
connecting dots
text horizontal
audience: academic, report, journalism

Calculations:

transformations to symmetry
do calculations for the reader
appropriate aspect ratio

Backgrounds:

Axes and text should sit in the background to be examined only when needing to interpret
Data elements should be pre-attentive, first items seen

Don’t repeat yourself: no units on each tick mark (e.g. %)

Data pre-processing:

to create mapping of variables
beware missing information

Variables are:

party
vote count (but it’s a summary statistic)

Compare the counts for the party, or the proportion of votes obtained by the different parties

mapping of count to angle is sub-optimal
party not ordered by vote
with so many groups, colour is not distinguishable, so one cannot match to legend

Exercise 2

message
R
errors

Variables:

Democrat: true or false
Margin of the vote: 0-80

This plot compares the distribution of the margin in the vote percentage, between states with a democratic majority party and the rest, for the polling data.

library(nullabor)
data(electoral)
polls <- electoral$polls
ggplot(polls) +
  geom_boxplot(aes(x=Democrat, 
                   y=Margin)) +
  xlab("democrat") + 
  scale_y_continuous("margin (%)", 
    breaks=seq(0, 100, 20),
    limits=c(0,100)) +
  theme(aspect.ratio = 1.2, 
        panel.grid.major.x = element_blank())

What is good:

numeric values mapped to position along a line
easy to compare medians, quartiles, and fences

Things that might be fixed:

Only 5 numbers, plus outliers
Is the dot on the FALSE boxplot really an outlier?
Not sure how many observations are summarised by the 5 numbers

Exercise 3

Source: El Tiempo, Bogota, 13 June 2025

message
errors

Just the top plot

Variables are:

year (2022-2025)
month (Jan-May)
Number of murders

How are the number of murders changing over these years, also in relation to month?

Primary comparison is month, but it should be years.

Suffers from change blindness to be able to perceive yearly change.

Exercise 4

message
errors

Variables are:

pollutant
value

Distribution of the values separately for each pollutant.

Pollutant is mapped to both x axis and to facet!

Wasted space more than anything else. Should be separate plots for each pollutant, focusing on the distribution of each.

It is not change blindness because we are not trying to compare these distributions.

Exercise 5

message
errors

Variables are:

pollutant
value
location

Compare the distribution of the values for each pollutant between locations.

Good things:

value mapped to position along one axis
location mapped to x axis, so distributions can be compared

Distributions are mostly skewed, so comparison can only be made on a few observations, not the bulk of observations. Need to transform most of the pollutant values, probably using a log scale.

Exercise 6

message
errors
fix

Variables are:

year
rating
course
number of students completing the rating

Extra information: benchmark rating values (grey, green, red)

Examine the student rating trend over time, and compare these across units.

Aspect ratio to perceive trend
Rating mapped to length of a bar, instead of point along an axis
Cannot compare units
Ordering of units, could go from highest overall ratings to lowest
Semester information is mapped to be primary comparison for some units
Number of students submitting ratings is mapped to text

Also, number of students submitting ratings is not calibrated by number of students in the unit.

End of session 1

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Tutorial Take a data plot and make it better

Welcome 👋🏼

Session 1: Principles and tools

Outline

Tidy data

Tidy data (1/5)

Tidy format (2/5)

Tidy data (3/5)

Why do it? (4/5)

Tidy data = statistical data (5/5)

Grammatical descriptions for plots

Grammar (1/5)

Grammar and variables (2/5)

Plot from grammar (3/5)

Make the data do the work for your visualisation (4/5)

Guided exercises

Exercise 1

Exercise 2

Exercise 3

Cognitive perception principles

Hierarchy of mappings (1/15)

Hierarchy of mappings (2/15)

Proximity (3/15)

Change blindness (4/15)

Change blindness (5/15)

Pre-attentive (6/15)

Colour palettes should match variable type (7/15)

rainbow palettes (8/15)

rainbow palettes (9/15)

Transforming, e.g. colour scales (10/15)

Order categorical variables by the statistic (11/15)

Do the calculation for the reader (12/15)

Aspect ratio (13/15)

Aspect ratio (14/15)

Summary and more (15/15)

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

End of session 1

Tutorial
Take a data plot and make it better