Tutorial
Take a data plot and make it better

Dianne Cook
Monash University

Welcome 👋🏼

Thanks for joining to learn about making data plots today.


About the instructors:

🦘 Di is a Professor of Statistics. She has more than 30 years of research and teaching of data visualisation, and open source software development.
🐨 Jayani is a final year PhD student. She is working on methods to help decide on the best nonlinear low dimensional representation of high dimensional data, and is the author of several R packages.
🏛️ We are both in Econometrics and Business Statistics, at Monash University.

🧩 Feel free to ask questions any time. 🤔


🎯 The objectives for today are:

  1. Build your knowledge of cognitive perception principles for good graphics
  2. Recognise elements of a current design that can be improved
  3. Develop coding skills to implement improved design
load these libraries to get started
library(tidyverse)
library(colorspace)
library(patchwork)
library(broom)
library(palmerpenguins)
library(ggbeeswarm)
library(vcd)
library(nullabor)
library(MASS)
library(colorspace)
library(conflicted)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)
conflicts_prefer(dplyr::rename)
conflicts_prefer(dplyr::mutate)
conflicts_prefer(dplyr::summarise)

Session 1: Principles and tools

Outline

time topic
5 Outline
10 Tidy data
15 Grammar of graphics
15 Guided exercises
15 Cognitive principles
15 Guided exercises
15 Identifying poor elements
30 BREAK

Tidy data

Tidy data (1/5)

Illustrations from Julia Lowndes and Allison Horst

  • Each variable is a column; each column is a variable.

  • Each observation is a row; each row is an observation.

  • Each value is a cell; each cell is a single value.

  • Each table contains one data set.

  • Long form makes it easier to reshape in many different ways

  • Wider forms are common for analysis

Long form: one measured value per row. All other variables are descriptors (key variables)

Widest form: all measured values for an entity are in a single row.

Tidy format (2/5)

This WHO Tuberculosis Notifications is not in tidy format. The first step is to determine what the variables are.

Code
tb <- read_csv("data/TB_notifications_2023-08-21.csv") |>
  filter(country == "Australia", year > 1996, year < 2013) |>
  select(year, contains("new_sp")) 
glimpse(tb)
Rows: 16
Columns: 22
$ year         <dbl> 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012
$ new_sp       <dbl> 226, 203, 285, 251, 228, 210, 113, 285, 241, 269, 281, 299, 267, 274, 301, 290
$ new_sp_m04   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 0, NA, 0, 0, 0, 2
$ new_sp_m514  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 3, NA, 3, 2, 2, 1
$ new_sp_m014  <dbl> 1, 0, 0, 3, 1, 1, 0, 0, 0, 1, 3, 2, 3, 2, 2, 3
$ new_sp_m1524 <dbl> 8, 11, 13, 16, 23, 15, 14, 18, 32, 33, 30, 46, 30, 42, 38, 26
$ new_sp_m2534 <dbl> 24, 22, 40, 35, 20, 20, 10, 16, 27, 35, 33, 33, 37, 33, 44, 40
$ new_sp_m3544 <dbl> 18, 18, 54, 25, 18, 26, 2, 17, 23, 23, 20, 20, 16, 22, 26, 17
$ new_sp_m4554 <dbl> 13, 13, 52, 24, 18, 19, 11, 15, 11, 21, 15, 27, 24, 25, 19, 25
$ new_sp_m5564 <dbl> 17, 15, 37, 19, 13, 13, 5, 11, 12, 16, 14, 23, 12, 9, 12, 16
$ new_sp_m65   <dbl> 28, 31, 49, 49, 35, 34, 30, 32, 30, 43, 37, 42, 34, 27, 37, 37
$ new_sp_mu    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0
$ new_sp_f04   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 0, NA, 1, 1, 2, 0
$ new_sp_f514  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 4, NA, 3, 3, 1, 1
$ new_sp_f014  <dbl> 0, 2, 0, 0, 1, 0, 0, 0, 2, 2, 4, 3, 4, 4, 3, 1
$ new_sp_f1524 <dbl> 10, 19, 10, 15, 21, 15, 9, 6, 18, 18, 26, 27, 31, 36, 26, 27
$ new_sp_f2534 <dbl> 15, 24, 16, 19, 27, 21, 13, 17, 26, 27, 37, 32, 27, 43, 40, 48
$ new_sp_f3544 <dbl> 9, 15, 18, 12, 16, 15, 3, 5, 11, 14, 20, 14, 14, 12, 23, 15
$ new_sp_f4554 <dbl> 5, 8, 6, 15, 7, 6, 5, 7, 10, 7, 12, 6, 12, 2, 7, 11
$ new_sp_f5564 <dbl> 10, 2, 2, 5, 8, 4, 4, 3, 6, 9, 7, 11, 11, 5, 7, 9
$ new_sp_f65   <dbl> 12, 24, 26, 14, 20, 23, 7, 19, 14, 21, 23, 10, 12, 12, 17, 15
$ new_sp_fu    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0
  • year
  • sex
  • age category
Code
tb <- read_csv("data/TB_notifications_2023-08-21.csv") |>
  filter(country == "Australia", year > 1996, year < 2013) |>
  select(year, contains("new_sp")) 
glimpse(tb)

Tidy data (3/5)

Steps to wrangle to tidy form:

  1. Select only the variables containing sex and age counts
  2. Pivot into long form
  3. Extract variables from names (agesex column)
  4. Tidy age codes

Is count a variable?

# A tibble: 12 × 4
    year sex   age   count
   <dbl> <chr> <fct> <dbl>
 1  1997 m     0-14      1
 2  1997 m     15-24     8
 3  1997 m     25-34    24
 4  1997 m     35-44    18
 5  1997 m     45-54    13
 6  1997 m     55-64    17
 7  1997 m     > 65     28
 8  1997 f     0-14      0
 9  1997 f     15-24    10
10  1997 f     25-34    15
11  1997 f     35-44     9
12  1997 f     45-54     5
Code
tb_tidy <- tb |>
  select(-new_sp, -new_sp_m04, -new_sp_m514, 
                  -new_sp_f04, -new_sp_f514) |> 
  pivot_longer(starts_with("new_sp"), 
    names_to = "sexage", 
    values_to = "count") |>
  mutate(sexage = str_remove(sexage, "new_sp_")) |>
  separate_wider_position(
    sexage,
    widths = c(sex = 1, age = 4),
    too_few = "align_start"
  ) |>
  filter(age != "u") |>
  mutate(age = fct_recode(age, "0-14" = "014",
                          "15-24" = "1524",
                          "15-24" = "1524",
                          "25-34" = "2534",
                          "35-44" = "3544",
                          "45-54" = "4554",
                          "55-64" = "5564",
                          "> 65" = "65"))
tb_tidy |> slice_head(n=12)

Why do it? (4/5)

Illustrations from Julia Lowndes and Allison Horst

Tidy data is the starting point for statistical analysis, and data visualisation.


Read more from tidy paper and wrangling paper.

Tidy data = statistical data (5/5)



\[\begin{align} X = \left[ \begin{array}{cccc} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{np} & x_{n2} & \dots & x_{np} \end{array} \right] \end{align}\]

Variables \(x_1, x_2, ..., x_p\) are in the columns. And we have \(n\) observations.


Graphics built on tidy data, fit nicely with your statistical analysis too.

Grammatical descriptions for plots

Grammar (1/5)

A grammar of graphics maps the variables from a tidy data set to elements of the plot.

It’s like having the DNA rather than a species name, so you know how the plots are related to each other.

Same script can be applied to different data.

plot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION> +
  <SCALE> +
  <THEME>
tb_yr <- tb_tidy |>
  group_by(year) |>
  summarise(count = sum(count, na.rm=TRUE)) 
gg1 <- ggplot(tb_yr, 
  aes(x=year, y=count)) +
  geom_col() +
  ylim(c(0, 350))
gg2 <- ggplot(tb_yr, 
  aes(x=year, y=count)) +
  geom_point() +
  geom_smooth(se=F) +
  ylim(c(0, 350))
gg1 + gg2 + plot_layout(ncol=1)

These plots examine the relationship between TB incidence and time as years.

Grammar and variables (2/5)

  • Democrat: true or false
  • Margin of the vote: 0-80
DATA: electoral
MAPPING: x = Democrat, y = Margin
GEOM: boxplot (calculates five number summary, and displays as boxplot)
library(nullabor)
data(electoral)
polls <- electoral$polls
ggplot(polls) +
  geom_boxplot(aes(x=Democrat, 
                   y=Margin)) +
  xlab("democrat") + 
  scale_y_continuous("margin (%)", 
    breaks=seq(0, 100, 20),
    limits=c(0,100)) +
  theme(aspect.ratio = 1.2, 
        panel.grid.major.x = element_blank())

This plot compares the distribution of the margin in the vote percentage, between states with a democratic majority party and the rest, for the polling data.


  • year
  • count ?? yes, for plotting purposes, this is a variable
  • age
DATA: tb_tidy
MAPPING: x = year, y = count, colour = age
GEOM: lm (linear model)
tb_tidy |>
  filter(age %in% c("45-54", "55-64"),
         sex == "f") |>
  ggplot() + 
    geom_smooth(aes(x=year, 
                  y=count,
                  colour=age), 
              se=F, 
              method="lm") +
    scale_color_discrete_divergingx(palette="Geyser") +
    scale_x_continuous("year", 
      breaks = seq(1998, 2012, 4), 
      labels = c("98", "02", "06", "10")) +
    theme(aspect.ratio = 0.8, 
      axis.text = element_text(size="10"))

This plot compares the linear trend in TB incidence between over time for 45-54 and 55-64 year olds.

Plot from grammar (3/5)

Here is the grammatical description

DATA: tb_tidy, 2012
MAPPING: x=age, fill=sex
STAT: count
POSITION: stack
GEOM: bar

tb_tidy |>
  filter(year == 2012) |>
  ggplot() + 
  geom_bar(aes(x=age, 
               weight=count,
               fill=sex),
           alpha=0.8) +
  scale_fill_discrete_divergingx(palette="Geyser") +
  theme_bw() +
  theme(aspect.ratio = 0.8, 
    axis.text = element_text(size="10"))

See documentation: geom_bar, after_stat(count)

This plot examines the relationship between TB incidence and age, and sex (although it’s almost impossible from this arrangement to assess this last relationship).

Here is the grammatical description

DATA: tb_tidy, 2012
MAPPING: x=age, fill=sex
STAT: proportion
POSITION: fill
GEOM: bar

tb_tidy |>
  filter(year == 2012) |>
  ggplot() + 
  geom_bar(aes(x=age, 
               weight=count,
               fill=sex),
           position="fill", alpha=0.8) +
  scale_fill_discrete_divergingx(palette="Geyser") +
  ylab("proportion") +
  theme_bw() +
  theme(aspect.ratio = 0.8, 
    axis.text = element_text(size="10"))

See documentation: geom_bar, after_stat(prop)

This plot examines the relationship between TB incidence and age, and sex, focusing on the proportion of each sex within each age group.

Make the data do the work for your visualisation (4/5)

# A tibble: 10 × 4
    year age       m     f
   <dbl> <fct> <dbl> <dbl>
 1  1997 0-14      1     0
 2  1997 15-24     8    10
 3  1997 25-34    24    15
 4  1997 35-44    18     9
 5  1997 45-54    13     5
 6  1997 55-64    17    10
 7  1997 > 65     28    12
 8  1998 0-14      0     2
 9  1998 15-24    11    19
10  1998 25-34    22    24

Levels of the variable sex have been split into different columns.

tb_bad |> 
  ggplot() + 
    geom_point(aes(x=year, y=m), colour = "#A39000") +
    geom_point(aes(x=year, y=f), colour = "#93B3FE")

# A tibble: 10 × 4
    year sex   age   count
   <dbl> <chr> <fct> <dbl>
 1  1997 m     0-14      1
 2  1997 m     15-24     8
 3  1997 m     25-34    24
 4  1997 m     35-44    18
 5  1997 m     45-54    13
 6  1997 m     55-64    17
 7  1997 m     > 65     28
 8  1997 f     0-14      0
 9  1997 f     15-24    10
10  1997 f     25-34    15

The variable sex is mapped to colour, and the plotting software handles the different levels. Can use palettes to appropriately handle colour mapping.

tb_tidy |> 
  ggplot() + 
    geom_point(aes(x=year, 
                   y=count, 
                   colour=sex))

Guided exercises

Exercise 1

Data on World Development Indicators (WDI) from World Bank.


Rows: 4,793
Columns: 23
$ `Country Name`  <chr> "Afghanistan", "Afghanistan", "Afg…
$ `Country Code`  <chr> "AFG", "AFG", "AFG", "AFG", "AFG",…
$ `Series Name`   <chr> "Access to clean fuels and technol…
$ `Series Code`   <chr> "EG.CFT.ACCS.ZS", "EG.CFT.ACCS.RU.…
$ `2004 [YR2004]` <chr> "10.5", "1.9", "45.3", "NA", "20.1…
$ `2005 [YR2005]` <chr> "11.9", "2.4", "50.2", "NA", "20.1…
$ `2006 [YR2006]` <chr> "13.5", "3", "54.7", "NA", "17.011…
$ `2007 [YR2007]` <chr> "15.1", "3.6", "59.2", "NA", "13.0…
$ `2008 [YR2008]` <chr> "16.6", "4.3", "62.9", "NA", "6.76…
$ `2009 [YR2009]` <chr> "18.3", "5.1", "66.4", "NA", "4.52…
$ `2010 [YR2010]` <chr> "19.9", "5.9", "69.4", "NA", "3.12…
$ `2011 [YR2011]` <chr> "21.3", "7", "72", "NA", "2.512463…
$ `2012 [YR2012]` <chr> "22.9", "8", "74.3", "NA", "2.9476…
$ `2013 [YR2013]` <chr> "24.5", "9", "76.1", "NA", "3.4903…
$ `2014 [YR2014]` <chr> "26.1", "10.2", "78", "NA", "3.474…
$ `2015 [YR2015]` <chr> "27.6", "11.4", "79.5", "NA", "3.5…
$ `2016 [YR2016]` <chr> "28.8", "12.6", "80.5", "NA", "4.3…
$ `2017 [YR2017]` <chr> "30.3", "13.5", "81.6", "NA", "NA"…
$ `2018 [YR2018]` <chr> "31.4", "14.5", "82.6", "NA", "NA"…
$ `2019 [YR2019]` <chr> "32.6", "15.6", "83.2", "NA", "NA"…
$ `2020 [YR2020]` <chr> "33.8", "16.4", "83.8", "NA", "NA"…
$ `2021 [YR2021]` <chr> "34.9", "17.4", "84.5", "NA", "NA"…
$ `2022 [YR2022]` <chr> "36.1", "18.5", "85", "NA", "NA", …
  • What are the variables?
  • What are the steps needed to wrangle it into tidy form?
  • country: name and code
  • indicator 1 to \(p\): name and code
  • year

Pre-process, by creating a country dictionary table with unique Country Name and Country Code and an indicator dictionary table with unique Series Name and Series Code. Keep only the code columns in main data.

  1. Pivot year to long form: country, year, indicator and value
  2. Clean up year text
  3. Convert character to numeric, where needed

Could pivot to wide form with indicators in separate columns

Exercise 2

Make a plot to examine the relationship between EG.CFT.ACCS.ZS (“Access to clean fuels and technologies for cooking (% of population)”) by country over time.

tidying data
wdi_country <- wdi |>
  select(`Country Name`, `Country Code`) |>
  distinct()
wdi_indicator <- wdi |>
  select(`Series Name`, `Series Code`) |>
  distinct()
wdi_tidy <- wdi |>
  select(`Country Code`, `Series Code`, 
    `2004 [YR2004]`:`2022 [YR2022]`) |>
  pivot_longer(cols=`2004 [YR2004]`:`2022 [YR2022]`, 
    names_to="year", values_to="value") |>
  rename(country = `Country Code`,
         indicator = `Series Code`) |>
  mutate(value = as.numeric(value)) |>
  mutate(year = str_sub(year, 1, 4)) |>
  mutate(year = as.numeric(year))
DATA: wdi_tidy, EG.CFT.ACCS.ZS
MAPPING: x=year, y=value, group=country
GEOM: line
wdi_tidy |>
  filter(indicator == "EG.CFT.ACCS.ZS") |>
  ggplot() +
    geom_line(aes(x=year, 
                  y=value, 
                  group=country),
              alpha = 0.5) +
    xlab("") + ylab("Access to clean fuel") +
    theme_minimal()

Exercise 3

We can read the proportion, but we have lost the size of each category. The way to fix this is to use a mosaic plot, which maps the width of the columns to count.

tb_tidy |>
  filter(year == 2012) |>
  ggplot() + 
  geom_mosaic(aes(x=age, 
                 weight=count,
                 fill=sex)) +
  scale_fill_discrete_divergingx(palette="Geyser") +
  scale_y_continuous("proportion", breaks=seq(0,1,0.25)) +
  #theme_bw() +
  theme(aspect.ratio = 0.6, 
    axis.text = element_text(size="10"))

Cognitive perception principles

Hierarchy of mappings (1/15)



Cleveland and McGill (1984)



Illustrations made by Emi Tanaka

Hierarchy of mappings (2/15)

Based on the accuracy with which readers returned the numerical values.

  1. Position - common scale (BEST)
  2. Position - nonaligned scale
  3. Length, direction, angle
  4. Area
  5. Volume, curvature
  6. Shading, color (WORST)

Primary mapping used in common plots

  1. scatterplot, barchart
  2. side-by-side boxplot, stacked barchart
  3. piechart, rose plot, gauge plot, donut, wind direction map, starplot
  4. treemap, bubble chart, mosaicplot
  5. chernoff face
  6. choropleth map

Proximity (3/15)

Place elements that you want to compare close to each other. If there are multiple comparisons to make, you need to decide which one is most important.

Change blindness (4/15)

Making comparisons across plots requires the eye to jump from one focal point to another. It may result in not noticing differences.


Change blindness (5/15)


Help the reader remember what the pattern is in other panels by under-plotting all.

Too many colours, too busy

Pre-attentive (6/15)

Can you find the odd one out?

Is it easier now?

Colour palettes should match variable type (7/15)

There are three basic choices of palettes:

  • qualitative
  • sequential
  • diverging
  • (rainbow)
  • (palindrome) SKIPPED

Which one you choose depends on the

  • data values
  • and what to emphasize

Resources for exploring color:

rainbow palettes (8/15)

Jet rainbow palette

Code
library(vital)
library(viridis)
am <- aus_mortality |> 
  filter(State == "Victoria", 
         Sex != "total", 
         Year < 1980, 
         Age < 90) 

ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_color_gradientn(colours = rainbow(10)) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Produces false detail, banding and color blindness ambiguity.

viridis palettes

Code
ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_colour_gradientn(colors = viridis_pal(option = "turbo")(10)[10:1]) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Have a uniform scale, match grey scale ladder. The turbo palette alleviates Jet rainbow palette problems.

rainbow palettes (9/15)

Jet rainbow palette

Code
ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_color_gradientn(colours = deutan(rainbow(10))) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Produces false detail, banding and ambiguity.

viridis palettes

Code
ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_colour_gradientn(colors = deutan(viridis_pal(option = "turbo")(10)[10:1])) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Colors still readable and following scale.

Transforming, e.g. colour scales (10/15)

If the variable mapped to colour has a right-skewed distribution, consider transforming it using a log or a square root.


This is the same data, where count has been transformed using square root.

Code
ggplot(as_tibble(Titanic), 
       aes(x=interaction(Sex, Age),
           y=interaction(Class, Survived), 
           fill=n)) +
  geom_tile() +
  xlab("Sex, Age") +
  ylab("Class, Survived") +
  scale_fill_continuous_sequential(
    palette = "Terrain", 
    trans="sqrt")

Order categorical variables by the statistic (11/15)

❌ Default: alphabetical
Code
load("data/student_means.rda")
student_means_sub <- student_means |>
  filter(country %in% c("SGP", "KOR", "POL", "DEU", "NOR", "IRL", "GBR", "IDN", "AUS", "NZL", "USA", "TUR", "PHL", "MAR", "URY", "CHL", "COL", "CAN"))
ggplot(student_means_sub, aes(x=country, y=math)) + 
  geom_point(colour="#8ACE00", size=4) + 
  coord_flip() +
  xlab("") +
  theme(aspect.ratio = 2)

Full scale of number
Code
ggplot(student_means_sub, aes(x=country, y=math)) + 
  geom_point(colour="#8ACE00", size=4) + 
  coord_flip() +
  xlab("") +
  ylim(c(0, 1000)) +
  theme(aspect.ratio = 2)

✅ Order by statistic
Code
ggplot(student_means_sub, 
       aes(x=fct_reorder(country, math), 
           y=math)) + 
  geom_point(colour="#8ACE00", size=4) + 
  coord_flip() +
  xlab("") +
  ylim(c(0, 1000)) +
  theme(aspect.ratio = 2)

Read more about OECD PISA

Do the calculation for the reader (12/15)

Code
data(anorexia, package="MASS")
ggplot(data=anorexia, 
  aes(x=Prewt, 
      y=Postwt, 
        colour=Treat)) + 
  coord_equal() +
  xlim(c(70, 110)) + 
  ylim(c(70, 110)) +
  xlab("Pre-treatment weight (lbs)") +  
  ylab("Post-treatment weight (lbs)") +
  geom_abline(intercept=0, slope=1,  
    colour="grey80", linewidth=1.25) + 
  geom_density2d() + 
  geom_point(size=3) +
  facet_grid(.~Treat) +
  theme(legend.position = "none")

  • Before and after treatment weight for anorexia patients
  • Three different treatments
  • Need to read the difference relative to a 45\(^o\) line
Code
ggplot(data=anorexia, 
  aes(x=Prewt, colour=Treat,
    y=(Postwt-Prewt)/Prewt*100)) + 
  xlab("Pre-treatment weight (lbs)") +  
  ylab("Percent increase in weight") +
  geom_hline(yintercept=0, linewidth=1.25, 
    colour="grey80") + 
  geom_point(size=3) +   
  facet_grid(.~Treat) +
  theme(aspect.ratio=1, legend.position = "none")

  • Compute the difference
  • Compare difference relative to before weight
  • Before weight is used as the baseline
  • EASIER to read the difference above and below a horizontal line

Aspect ratio (13/15)

❌ Wrong aspect ratio


The default aspect ratio in most plots is rectangular.



If you want to compare two quantities, including assessing correlation, the aspect ratio should be square.



Two ways to achieve this with ggplot2:

  • theme(aspect.ratio=1) PREFERRED
  • coord_equal()

Aspect ratio (14/15)

Lines should be on average 45\(^o\).

  • To read and compare trend
  • To examine seasonality in time series

Summary and more (15/15)

Items that are primary elements of a plot:

  • colour
  • trend line (?)

Organising items:

  • place items to compare, close to each other
  • control the ordering, to make patterns easier to read
  • align axes for comparison across plots

Conventions:

  • time on horizontal
  • connecting dots
  • text horizontal
  • audience: academic, report, journalism

Calculations:

  • transformations to symmetry
  • do calculations for the reader
  • appropriate aspect ratio

Backgrounds:

  • Axes and text should sit in the background to be examined only when needing to interpret
  • Data elements should be pre-attentive, first items seen

Don’t repeat yourself: no units on each tick mark (e.g. %)

Data pre-processing:

  • to create mapping of variables
  • beware missing information

Exercise 1

Variables are:

  • party
  • vote count (but it’s a summary statistic)

Compare the counts for the party, or the proportion of votes obtained by the different parties

  • mapping of count to angle is sub-optimal
  • party not ordered by vote
  • with so many groups, colour is not distinguishable, so one cannot match to legend

Exercise 2

Variables:

  • Democrat: true or false
  • Margin of the vote: 0-80

This plot compares the distribution of the margin in the vote percentage, between states with a democratic majority party and the rest, for the polling data.

library(nullabor)
data(electoral)
polls <- electoral$polls
ggplot(polls) +
  geom_boxplot(aes(x=Democrat, 
                   y=Margin)) +
  xlab("democrat") + 
  scale_y_continuous("margin (%)", 
    breaks=seq(0, 100, 20),
    limits=c(0,100)) +
  theme(aspect.ratio = 1.2, 
        panel.grid.major.x = element_blank())

What is good:

  • numeric values mapped to position along a line
  • easy to compare medians, quartiles, and fences

Things that might be fixed:

  • Only 5 numbers, plus outliers
  • Is the dot on the FALSE boxplot really an outlier?
  • Not sure how many observations are summarised by the 5 numbers

Exercise 3

Source: El Tiempo, Bogota, 13 June 2025

Just the top plot

Variables are:

  • year (2022-2025)
  • month (Jan-May)
  • Number of murders

How are the number of murders changing over these years, also in relation to month?

Primary comparison is month, but it should be years.

Suffers from change blindness to be able to perceive yearly change.

Exercise 4

Variables are:

  • pollutant
  • value

Distribution of the values separately for each pollutant.

Pollutant is mapped to both x axis and to facet!

Wasted space more than anything else. Should be separate plots for each pollutant, focusing on the distribution of each.

It is not change blindness because we are not trying to compare these distributions.

Exercise 5

Variables are:

  • pollutant
  • value
  • location

Compare the distribution of the values for each pollutant between locations.

Good things:

  • value mapped to position along one axis
  • location mapped to x axis, so distributions can be compared

Distributions are mostly skewed, so comparison can only be made on a few observations, not the bulk of observations. Need to transform most of the pollutant values, probably using a log scale.

Exercise 6

Variables are:

  • year
  • rating
  • course
  • number of students completing the rating

Extra information: benchmark rating values (grey, green, red)

Examine the student rating trend over time, and compare these across units.

  • Aspect ratio to perceive trend
  • Rating mapped to length of a bar, instead of point along an axis
  • Cannot compare units
  • Ordering of units, could go from highest overall ratings to lowest
  • Semester information is mapped to be primary comparison for some units
  • Number of students submitting ratings is mapped to text

Also, number of students submitting ratings is not calibrated by number of students in the unit.

End of session 1

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.