Tutorial: Take a data plot and make it better

Dianne Cook
Monash University

Session 2: Practice, polish and significance

Outline

time topic
15 Fixing the plot design
10 Guided exercises
20 Styling and theming
15 Is the pattern visible and real?
30 Making YOUR plot better

Fixing the plot design

Example 1

What are the errors in this plot?

Variables are:

  • Number of TVs in household
  • Average math score
  • Country

We are examining the relationship between average math score and number of televisions in the household across 6 countries.

Code
load("data/student_tv.rda")
student_tv |>
  filter(year == 2022) |>
  filter(!is.na(television)) |>
  ggplot(aes(x=television, y=math, group=country)) +
    geom_line() +
    geom_point() +
    facet_wrap(~country, ncol=3) +
    ylab("average math") 

Perceiving trend is distracted by

  • different average values
  • ordering of countries

Reasonable aspects:

  • Aspect ratio for examining trend
  • Missing value category removed
Code
s_tv_2012 <- student_tv |>
  filter(year == 2022) |>
  filter(!is.na(television)) 
ggplot(s_tv_2012, 
      aes(x=television, 
          y=math, 
          group=country)) +
    geom_line() +
    geom_point() +
    facet_wrap(~country, ncol=3, scale="free") +
    ylab("average math") 

Code
slope <- function(y, x) {
  coef(lsfit(x, y))[2]
}
s_tv_trend <- s_tv_2012 |>
  mutate(tv_num = as.numeric(television)) |>
  group_by(country) |>
  summarise(s = slope(tv_num, math))
s_tv_2012 <- s_tv_2012 |>
  mutate(country = factor(country, 
    levels = s_tv_trend$country[order(s_tv_trend$s)]))
ggplot(s_tv_2012, 
      aes(x=television, 
          y=math, 
          group=country)) +
    geom_line() +
    geom_point() +
    facet_wrap(~country, ncol=3, scale="free") +
    ylab("average math") 

Example 2

Code
tb_aus_idn <- read_csv("data/TB_notifications_2023-08-21.csv") |>
  filter(iso3 %in% c("AUS", "IDN", "KOR")) |> 
  select(year, iso3, c_newinc) |>
  pivot_wider(names_from = iso3, values_from = c_newinc) |>
  mutate_at(vars(AUS:KOR), function(x) x/max(x, na.rm=TRUE)) |>
  pivot_longer(AUS:KOR, names_to = "iso3", 
    values_to = "rel_count")

tb_aus_idn |>
    ggplot(aes(x=year, y=rel_count, fill=iso3)) +
      geom_col(position = "dodge") +
  scale_fill_discrete_divergingx(palette = "Zissou 1") +
  xlab("") + ylab("Relative count") +
  theme(aspect.ratio = 0.5, 
        legend.title = element_blank())

Variables are: year, count, country

Mapping: x=year, y=count, colour=country

Geom: bar

Message: Compare change in TB incidence over time in different countries

  • mapping of count is to a length
  • proximity suggests compare country in year
  • map count to position along an axis
  • arrange countries, and use geoms, to make trend the primary element
Code
tb_aus_idn |>
    ggplot(aes(x=year, y=rel_count)) +
      geom_point() +
      geom_smooth(se=F, colour = "#E87700") +
      xlab("") + ylab("Relative count") +
      facet_wrap(~iso3, ncol=1) +
      theme(aspect.ratio = 0.6)

Guided exercises



What is wrong with these plots, and how would you fix them?




Some answers
# 1: soften grid, resolve axis overlap
# 2: put line over points
# 3: remove repeats '000
# 4: aspect ratio=1 to read association
# 5: re-map variables
process data
tb <- read_csv("data/TB_notifications_2023-08-21.csv") |>
  filter(country == "Australia", year > 1996, year < 2013) |>
  select(year, contains("new_sp")) 
tb_tidy <- tb |>
  select(-new_sp, -new_sp_m04, -new_sp_m514, 
                  -new_sp_f04, -new_sp_f514) |> 
  pivot_longer(starts_with("new_sp"), 
    names_to = "sexage", 
    values_to = "count") |>
  mutate(sexage = str_remove(sexage, "new_sp_")) |>
  separate_wider_position(
    sexage,
    widths = c(sex = 1, age = 4),
    too_few = "align_start"
  ) |>
  filter(age != "u") |>
  mutate(age = fct_recode(age, "0-14" = "014",
                          "15-24" = "1524",
                          "15-24" = "1524",
                          "25-34" = "2534",
                          "35-44" = "3544",
                          "45-54" = "4554",
                          "55-64" = "5564",
                          "> 65" = "65"))
plotting code
tb_tidy |> 
  filter(!(age %in% c("0-14", "unknown"))) |>
  ggplot(aes(x=year, 
           y=count, 
           colour=sex)) + 
  geom_point() +
  geom_smooth(se=F) +
  facet_wrap(~age, ncol = 3) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 1)) +
  theme(axis.text = element_text(size="10"),
        panel.grid.major = element_line(color="black")) 

plotting code
tb_tidy |> 
  filter(!(age %in% c("0-14", "unknown"))) |>
  ggplot(aes(x=year, 
           y=count, 
           colour=sex)) + 
  geom_smooth(se=F) +
  geom_point() +
  facet_wrap(~age, ncol = 3) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 2), 
    labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
  theme(axis.text = element_text(size="10",
                    color="grey80"),
        axis.title = element_text(size="10",
                    color="grey80"),
        panel.grid.major =   
           element_line(color="white"),
        panel.background =
           element_rect(fill="grey90", 
                        colour = "grey80")) 

plotting code
gapminder |> 
  filter (year == 2007) |>
  ggplot(aes(x=lifeExp, 
             y=gdpPercap,
             label=country,
             colour=continent)) +
  geom_point() +
  scale_colour_discrete_divergingx(palette = "Zissou 1")

plotting code
gapminder |> 
  filter (year == 2007) |>
  ggplot(aes(x=lifeExp, 
             y=gdpPercap,
             label=country)) +
  geom_point(colour = "#3B99B1") +
  scale_y_log10("gdpPercap ('000)",
                breaks = seq(0, 50000, 10000), 
                labels = seq(0, 50, 10)) +
  theme(aspect.ratio = 0.5)

plotting code
gapminder |> 
  filter (year == 2007) |>
  ggplot(aes(x=as.numeric(country),
             y=gdpPercap,
             fill=lifeExp)) +
  geom_col() + xlab("country") +
  scale_fill_distiller(palette = "RdPu", trans="log10") +    
  scale_y_log10("gdpPercap ('000)",
                breaks = seq(0, 50000, 10000), 
                labels = seq(0, 50, 10)) +
  theme(aspect.ratio = 0.5)

Styling and theming

Styling 1/7

The BBC cookbook has good basic advice for journalism and reports. The work of Amanda Cox has been instrumental in the NY Times data visualisations.

The Royal Statistical Society provides a Best Practices for Data Visualisation with its own RSSthemes package for ggplot2, for academic publications.



The default ggplot2 theme theme_grey() is designed to give the data plot the same ink strength on the page as the surrounding text.

Australia has maintained its status as a low-incidence tuberculosis (TB) country over the past decade, with notification rates that have remained relatively stable despite global fluctuations in TB burden. The country’s TB epidemiology presents a unique profile characterized by consistent low domestic transmission and a disease burden heavily concentrated among overseas-born populations.

Australia reports approximately 1300 cases of TB per year and has a TB case notification rate of 5.5 cases per 100,000 population, though recent data suggests this rate has shown slight variations. Incidence of tuberculosis (per 100,000 people) in Australia was reported at 6.2 in 2023, indicating a modest increase from historical averages. In 2015 this was 5.3 per 100,000 population per year, corresponding to 1,244 individual notifications, demonstrating the relatively stable nature of TB incidence in the country. The consistency of these figures over the decade reflects Australia’s effective TB control measures and robust public health surveillance systems. This rate has essentially remained unchanged since the mid-1980s, however a slight increase in rates has been observed since 2003, suggesting a gradual but measurable trend that health authorities continue to monitor closely.

Fig 1. TB incidence in Australia 1980-2021. Initially incidence dropped but it has been steadily climbing in the recent two decades. Note that, counts are not population adjusted.

A defining characteristic of Australia’s TB epidemiology is the overwhelming concentration of cases among overseas-born populations. Between 88% to 95% of TB cases in Australia have been reported in the overseas-born population, highlighting the critical role of migration patterns in shaping the country’s TB landscape. This demographic distribution has remained consistent throughout the past decade and represents one of the most significant epidemiological features of TB in Australia. The Australian-born population experiences markedly lower TB rates, with specific risk factors identified in vulnerable groups. Research from Victoria reveals that the most common risk factor in the 0–14 year age group was a household contact with tuberculosis (85.1%), followed by having a parent from a high tuberculosis incidence country (70.2%). These findings underscore the importance of contact tracing and screening programs, particularly for children in households with overseas-born parents from high-burden countries.

Overall themes 2/7

The ggthemes package supplements the handful available in ggplot2.

These are convenient definitions of the array of style choices for a data plot that include background, position and sizing of title and axis text, legend position and arrangement, axes, ticks and grid lines, extra space at plot edges, …

Code
library(patchwork)
p1 <- tb_aus_p + theme_grey() +
  theme(aspect.ratio = 0.5) + ggtitle("default")
p2 <- tb_aus_p + theme_minimal() +
  theme(aspect.ratio = 0.5) + ggtitle("minimal")
p3 <- tb_aus_p + theme_tufte() +
  theme(aspect.ratio = 0.5) + ggtitle("tufte")
p4 <- tb_aus_p + theme_economist() +
  theme(aspect.ratio = 0.5) + ggtitle("economist")
p1 + p2 + p3 + p4 + plot_layout(ncol=2)

Setting theme elements 3/7

A theme for all plots was specified for these two slide decks.

theme_set(ggthemes::theme_gdocs(base_size = 14) +
  theme(plot.background = 
        element_rect(fill = 'transparent', colour = NA),
        axis.line.x = element_line(color = "black", 
                                   linewidth = 0.4),
        axis.line.y = element_line(color = "black", 
                                   linewidth = 0.4),
        panel.grid.major = element_line(color = "grey90"),
        axis.ticks = element_line(color = "black"),
        plot.title.position = "plot",
        plot.title = element_text(size = 14),
        panel.background  = 
          element_rect(fill = 'transparent', colour = "black"),
        legend.background = 
          element_rect(fill = 'transparent', colour = NA),
        legend.key        = 
          element_rect(fill = 'transparent', colour = NA)
  ) 
)

Basic theme: theme_gdocs()

  • Text size adjusted
  • Background set
  • Axes lines adjusted
  • Grid lines colour changed
  • Title position changed

Colour palettes 4/7

The colorspace package has the most comprehensive set of palettes and tools for assessing, or creating new palettes.

If you have mapped the variables correctly, changing the colours only requires using the assignment of a new palette.

Code
tb_age <- tb_tidy |> 
  filter(!(age %in% c("0-14", "unknown"))) |>
  ggplot(aes(x = year, 
             y = count, 
             colour = age)) +
  geom_point() +
  geom_smooth(se=F) +
  facet_wrap(~sex, ncol = 2) +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 2), 
    labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
  theme(axis.text = element_text(size="10")) 
tb_age  

Code
tb_age +
  scale_color_discrete_divergingx(palette="Zissou 1") 

Code
tb_age +
  scale_color_discrete_divergingx(palette="Geyser") 

Code
tb_age +
  scale_color_discrete_sequential(palette="OrYel") 

Colorblind proofing 5/7

The colorspace package has several functions deutan(), protan(), tritan() which simulates color vision deficiencies.

You need to create the changed colour palette outside the plotting code, and then apply it manually.

Also the package dichromat has tools to simulate color deficiencies.

Code
clrs <- deutan(scales::hue_pal()(6))
tb_age + scale_colour_manual("", values = clrs)

Code
clrs <- deutan(divergingx_hcl(6, "Zissou 1"))
tb_age +
  scale_colour_manual("", values = clrs)

Code
clrs <- deutan(divergingx_hcl(6, "Geyser"))
tb_age +
  scale_colour_manual("", values = clrs)

Code
clrs <- deutan(sequential_hcl(6, "OrYel"))
tb_age +
  scale_colour_manual("", values = clrs)

Creating ensemble of plots 6/7

One plot is often not enough, for a report or an academic publication.

Creating and arranging is an art.

The packages patchwork and cowplot can help with layout.

For the tuberculosis data, suppose we want one overview plot, and then separate displays showing sex and age effects.

Code
ov <- tb_tidy |>
  group_by(year) |>
  summarise(count = sum(count)) |>
  ggplot(aes(x=year, y=count)) +
    geom_col() +
    annotate("text", x=1996, y=320, label="A", size=8) +
    xlim(c(1996, 2013))
sex <- tb_tidy |>
  group_by(year, sex) |>
  summarise(count = sum(count)) |>
  ggplot(aes(x=year, weight=count, fill=sex)) +
    geom_bar(position="fill") +
    scale_fill_discrete_divergingx(palette = "TealRose",
      rev=TRUE) +
    ylab("proportion") +
    annotate("text", x=1996, y=0.95, label="B", size=8) +
    xlim(c(1996, 2013))
age <- tb_tidy |>
  group_by(year, age) |>
  summarise(count = sum(count)) |>
  ggplot(aes(x=year, weight=count, fill=age)) +
    geom_bar(position="fill") +
    scale_fill_discrete_sequential(palette = "Sunset") +
    ylab("proportion") +
    annotate("text", x=1996, y=0.95, label="C", size=8) +
    xlim(c(1996, 2013))

ov + sex/age + plot_layout(widths=c(2,1))

Saving the plot 7/7

The output format dramatically affects what can be done with the plot. The primary formats for publications are:

  • pdf
    • Academic articles typically should use pdf figures, which is mixed format, but typically high-resolution that can be accurately resized.
    • Files can be large if there are many observations and hence many points.
    • Doesn’t display in html documents, like slides or websites natively.
  • svg: scalable vector graphics, maintains full resolution and all elements, but there is less support for this format in many applications.
  • png, jpg
    • Articles in MS Office, because they don’t handle other formats!
    • Higher resolution can be obtained by setting size or dpi.

See ggplot2 book on saving plots for more details on using ggsave().

Code
set.seed(1130)
d <- tibble(x = runif(10000, -1, 1),
            y = x + x^2 + rnorm(10000))
p <- ggplot(d, aes(x=x, y=y)) +
  geom_point(alpha=0.3) +
  geom_smooth(colour="#006dae", se=F, linewidth=2)
ggsave("images/test.pdf", p, width=1000, height=800, units="px")
ggsave("images/test.png", p, width=1000, height=800, units="px")
ggsave("images/test-low.png", p, width=250, height=200, units="px", dpi=75)

PDF (no show in slides)

PNG - high

PNG - low

Is the pattern visible and real?

Checking for patterns 1/2

Should you report a pattern, when it is not really there?



Check for spurious patterns by making comparison plots, where you know that there is is no relationship.

Comparisons made by randomising the values of air_tmax.

Checking for patterns 2/2

Should you report a pattern, when it is not really there?



Check for spurious patterns by making comparison plots, where you know that there is is no relationship.

Comparisons made by randomising the values of air_tmin.

Better design

The same procedure can be used to compare different plot designs.

If the real plot is detected faster and more often amongst a page of decoys, using one design in comparison to another, then that design is better.

Making your own plot better

Let’s take the plunge!

  1. Share a plot that you have, that you’d like to improve
  2. Brainstorm the elements: variables, message
  3. Map out the aspects that might be improved
  4. Start making improvements one by one

We have a breakout room, for attendees online, for

  • one-on-one help with an instructor
  • or working together with other attendees

Cheerio 👋🏼

Thanks for joining to learn about making data plots today.



These are the resources that we have used today, that might be useful for you in the future.

End of session 2

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.