Tutorial: Take a data plot and make it better

Dianne Cook
Monash University

Session 2: Practice, polish and significance

Outline

time	topic
15	Fixing the plot design
10	Guided exercises
20	Styling and theming
15	Is the pattern visible and real?
30	Making YOUR plot better

Fixing the plot design

Example 1

What are the errors in this plot?

message
R
errors
fix 1
fix 2

Variables are:

Number of TVs in household
Average math score
Country

We are examining the relationship between average math score and number of televisions in the household across 6 countries.

Code

load("data/student_tv.rda")
student_tv |>
  filter(year == 2022) |>
  filter(!is.na(television)) |>
  ggplot(aes(x=television, y=math, group=country)) +
    geom_line() +
    geom_point() +
    facet_wrap(~country, ncol=3) +
    ylab("average math")

Perceiving trend is distracted by

different average values
ordering of countries

Reasonable aspects:

Aspect ratio for examining trend
Missing value category removed

Code

s_tv_2012 <- student_tv |>
  filter(year == 2022) |>
  filter(!is.na(television)) 
ggplot(s_tv_2012, 
      aes(x=television, 
          y=math, 
          group=country)) +
    geom_line() +
    geom_point() +
    facet_wrap(~country, ncol=3, scale="free") +
    ylab("average math")

Code

slope <- function(y, x) {
  coef(lsfit(x, y))[2]
}
s_tv_trend <- s_tv_2012 |>
  mutate(tv_num = as.numeric(television)) |>
  group_by(country) |>
  summarise(s = slope(tv_num, math))
s_tv_2012 <- s_tv_2012 |>
  mutate(country = factor(country, 
    levels = s_tv_trend$country[order(s_tv_trend$s)]))
ggplot(s_tv_2012, 
      aes(x=television, 
          y=math, 
          group=country)) +
    geom_line() +
    geom_point() +
    facet_wrap(~country, ncol=3, scale="free") +
    ylab("average math")

Example 2

Code

tb_aus_idn <- read_csv("data/TB_notifications_2023-08-21.csv") |>
  filter(iso3 %in% c("AUS", "IDN", "KOR")) |> 
  select(year, iso3, c_newinc) |>
  pivot_wider(names_from = iso3, values_from = c_newinc) |>
  mutate_at(vars(AUS:KOR), function(x) x/max(x, na.rm=TRUE)) |>
  pivot_longer(AUS:KOR, names_to = "iso3", 
    values_to = "rel_count")

tb_aus_idn |>
    ggplot(aes(x=year, y=rel_count, fill=iso3)) +
      geom_col(position = "dodge") +
  scale_fill_discrete_divergingx(palette = "Zissou 1") +
  xlab("") + ylab("Relative count") +
  theme(aspect.ratio = 0.5, 
        legend.title = element_blank())

message
errors
fix
R

Variables are: year, count, country

Mapping: x=year, y=count, colour=country

Geom: bar

Message: Compare change in TB incidence over time in different countries

mapping of count is to a length
proximity suggests compare country in year

map count to position along an axis
arrange countries, and use geoms, to make trend the primary element

Code

tb_aus_idn |>
    ggplot(aes(x=year, y=rel_count)) +
      geom_point() +
      geom_smooth(se=F, colour = "#E87700") +
      xlab("") + ylab("Relative count") +
      facet_wrap(~iso3, ncol=1) +
      theme(aspect.ratio = 0.6)

Guided exercises

What is wrong with these plots, and how would you fix them?

Some answers

# 1: soften grid, resolve axis overlap
# 2: put line over points
# 3: remove repeats '000
# 4: aspect ratio=1 to read association
# 5: re-map variables

process data

tb <- read_csv("data/TB_notifications_2023-08-21.csv") |>
  filter(country == "Australia", year > 1996, year < 2013) |>
  select(year, contains("new_sp")) 
tb_tidy <- tb |>
  select(-new_sp, -new_sp_m04, -new_sp_m514, 
                  -new_sp_f04, -new_sp_f514) |> 
  pivot_longer(starts_with("new_sp"), 
    names_to = "sexage", 
    values_to = "count") |>
  mutate(sexage = str_remove(sexage, "new_sp_")) |>
  separate_wider_position(
    sexage,
    widths = c(sex = 1, age = 4),
    too_few = "align_start"
  ) |>
  filter(age != "u") |>
  mutate(age = fct_recode(age, "0-14" = "014",
                          "15-24" = "1524",
                          "15-24" = "1524",
                          "25-34" = "2534",
                          "35-44" = "3544",
                          "45-54" = "4554",
                          "55-64" = "5564",
                          "> 65" = "65"))

plotting code

tb_tidy |> 
  filter(!(age %in% c("0-14", "unknown"))) |>
  ggplot(aes(x=year, 
           y=count, 
           colour=sex)) + 
  geom_point() +
  geom_smooth(se=F) +
  facet_wrap(~age, ncol = 3) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 1)) +
  theme(axis.text = element_text(size="10"),
        panel.grid.major = element_line(color="black"))

plotting code

tb_tidy |> 
  filter(!(age %in% c("0-14", "unknown"))) |>
  ggplot(aes(x=year, 
           y=count, 
           colour=sex)) + 
  geom_smooth(se=F) +
  geom_point() +
  facet_wrap(~age, ncol = 3) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 2), 
    labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
  theme(axis.text = element_text(size="10",
                    color="grey80"),
        axis.title = element_text(size="10",
                    color="grey80"),
        panel.grid.major =   
           element_line(color="white"),
        panel.background =
           element_rect(fill="grey90", 
                        colour = "grey80"))

plotting code

gapminder |> 
  filter (year == 2007) |>
  ggplot(aes(x=lifeExp, 
             y=gdpPercap,
             label=country,
             colour=continent)) +
  geom_point() +
  scale_colour_discrete_divergingx(palette = "Zissou 1")

plotting code

gapminder |> 
  filter (year == 2007) |>
  ggplot(aes(x=lifeExp, 
             y=gdpPercap,
             label=country)) +
  geom_point(colour = "#3B99B1") +
  scale_y_log10("gdpPercap ('000)",
                breaks = seq(0, 50000, 10000), 
                labels = seq(0, 50, 10)) +
  theme(aspect.ratio = 0.5)

plotting code

gapminder |> 
  filter (year == 2007) |>
  ggplot(aes(x=as.numeric(country),
             y=gdpPercap,
             fill=lifeExp)) +
  geom_col() + xlab("country") +
  scale_fill_distiller(palette = "RdPu", trans="log10") +    
  scale_y_log10("gdpPercap ('000)",
                breaks = seq(0, 50000, 10000), 
                labels = seq(0, 50, 10)) +
  theme(aspect.ratio = 0.5)

Styling and theming

Styling 1/7

The BBC cookbook has good basic advice for journalism and reports. The work of Amanda Cox has been instrumental in the NY Times data visualisations.

The Royal Statistical Society provides a Best Practices for Data Visualisation with its own RSSthemes package for ggplot2, for academic publications.

The default ggplot2 theme theme_grey() is designed to give the data plot the same ink strength on the page as the surrounding text.

Australia has maintained its status as a low-incidence tuberculosis (TB) country over the past decade, with notification rates that have remained relatively stable despite global fluctuations in TB burden. The country’s TB epidemiology presents a unique profile characterized by consistent low domestic transmission and a disease burden heavily concentrated among overseas-born populations.

Australia reports approximately 1300 cases of TB per year and has a TB case notification rate of 5.5 cases per 100,000 population, though recent data suggests this rate has shown slight variations. Incidence of tuberculosis (per 100,000 people) in Australia was reported at 6.2 in 2023, indicating a modest increase from historical averages. In 2015 this was 5.3 per 100,000 population per year, corresponding to 1,244 individual notifications, demonstrating the relatively stable nature of TB incidence in the country. The consistency of these figures over the decade reflects Australia’s effective TB control measures and robust public health surveillance systems. This rate has essentially remained unchanged since the mid-1980s, however a slight increase in rates has been observed since 2003, suggesting a gradual but measurable trend that health authorities continue to monitor closely.

Fig 1. TB incidence in Australia 1980-2021. Initially incidence dropped but it has been steadily climbing in the recent two decades. Note that, counts are not population adjusted.

A defining characteristic of Australia’s TB epidemiology is the overwhelming concentration of cases among overseas-born populations. Between 88% to 95% of TB cases in Australia have been reported in the overseas-born population, highlighting the critical role of migration patterns in shaping the country’s TB landscape. This demographic distribution has remained consistent throughout the past decade and represents one of the most significant epidemiological features of TB in Australia. The Australian-born population experiences markedly lower TB rates, with specific risk factors identified in vulnerable groups. Research from Victoria reveals that the most common risk factor in the 0–14 year age group was a household contact with tuberculosis (85.1%), followed by having a parent from a high tuberculosis incidence country (70.2%). These findings underscore the importance of contact tracing and screening programs, particularly for children in households with overseas-born parents from high-burden countries.

Overall themes 2/7

The ggthemes package supplements the handful available in ggplot2.

These are convenient definitions of the array of style choices for a data plot that include background, position and sizing of title and axis text, legend position and arrangement, axes, ticks and grid lines, extra space at plot edges, …

Code

library(patchwork)
p1 <- tb_aus_p + theme_grey() +
  theme(aspect.ratio = 0.5) + ggtitle("default")
p2 <- tb_aus_p + theme_minimal() +
  theme(aspect.ratio = 0.5) + ggtitle("minimal")
p3 <- tb_aus_p + theme_tufte() +
  theme(aspect.ratio = 0.5) + ggtitle("tufte")
p4 <- tb_aus_p + theme_economist() +
  theme(aspect.ratio = 0.5) + ggtitle("economist")
p1 + p2 + p3 + p4 + plot_layout(ncol=2)

Setting theme elements 3/7

A theme for all plots was specified for these two slide decks.

theme_set(ggthemes::theme_gdocs(base_size = 14) +
  theme(plot.background = 
        element_rect(fill = 'transparent', colour = NA),
        axis.line.x = element_line(color = "black", 
                                   linewidth = 0.4),
        axis.line.y = element_line(color = "black", 
                                   linewidth = 0.4),
        panel.grid.major = element_line(color = "grey90"),
        axis.ticks = element_line(color = "black"),
        plot.title.position = "plot",
        plot.title = element_text(size = 14),
        panel.background  = 
          element_rect(fill = 'transparent', colour = "black"),
        legend.background = 
          element_rect(fill = 'transparent', colour = NA),
        legend.key        = 
          element_rect(fill = 'transparent', colour = NA)
  ) 
)

Basic theme: theme_gdocs()

Text size adjusted
Background set
Axes lines adjusted
Grid lines colour changed
Title position changed

Colour palettes 4/7

The colorspace package has the most comprehensive set of palettes and tools for assessing, or creating new palettes.

If you have mapped the variables correctly, changing the colours only requires using the assignment of a new palette.

Default
1
2
3

Code

tb_age <- tb_tidy |> 
  filter(!(age %in% c("0-14", "unknown"))) |>
  ggplot(aes(x = year, 
             y = count, 
             colour = age)) +
  geom_point() +
  geom_smooth(se=F) +
  facet_wrap(~sex, ncol = 2) +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 2), 
    labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
  theme(axis.text = element_text(size="10")) 
tb_age

Code

tb_age +
  scale_color_discrete_divergingx(palette="Zissou 1")

Code

tb_age +
  scale_color_discrete_divergingx(palette="Geyser")

Code

tb_age +
  scale_color_discrete_sequential(palette="OrYel")

Colorblind proofing 5/7

The colorspace package has several functions deutan(), protan(), tritan() which simulates color vision deficiencies.

You need to create the changed colour palette outside the plotting code, and then apply it manually.

Also the package dichromat has tools to simulate color deficiencies.

Default
1
2
3

Code

clrs <- deutan(scales::hue_pal()(6))
tb_age + scale_colour_manual("", values = clrs)

Code

clrs <- deutan(divergingx_hcl(6, "Zissou 1"))
tb_age +
  scale_colour_manual("", values = clrs)

Code

clrs <- deutan(divergingx_hcl(6, "Geyser"))
tb_age +
  scale_colour_manual("", values = clrs)

Code

clrs <- deutan(sequential_hcl(6, "OrYel"))
tb_age +
  scale_colour_manual("", values = clrs)

Creating ensemble of plots 6/7

One plot is often not enough, for a report or an academic publication.

Creating and arranging is an art.

The packages patchwork and cowplot can help with layout.

For the tuberculosis data, suppose we want one overview plot, and then separate displays showing sex and age effects.

Code

ov <- tb_tidy |>
  group_by(year) |>
  summarise(count = sum(count)) |>
  ggplot(aes(x=year, y=count)) +
    geom_col() +
    annotate("text", x=1996, y=320, label="A", size=8) +
    xlim(c(1996, 2013))
sex <- tb_tidy |>
  group_by(year, sex) |>
  summarise(count = sum(count)) |>
  ggplot(aes(x=year, weight=count, fill=sex)) +
    geom_bar(position="fill") +
    scale_fill_discrete_divergingx(palette = "TealRose",
      rev=TRUE) +
    ylab("proportion") +
    annotate("text", x=1996, y=0.95, label="B", size=8) +
    xlim(c(1996, 2013))
age <- tb_tidy |>
  group_by(year, age) |>
  summarise(count = sum(count)) |>
  ggplot(aes(x=year, weight=count, fill=age)) +
    geom_bar(position="fill") +
    scale_fill_discrete_sequential(palette = "Sunset") +
    ylab("proportion") +
    annotate("text", x=1996, y=0.95, label="C", size=8) +
    xlim(c(1996, 2013))

ov + sex/age + plot_layout(widths=c(2,1))

Saving the plot 7/7

The output format dramatically affects what can be done with the plot. The primary formats for publications are:

pdf
- Academic articles typically should use pdf figures, which is mixed format, but typically high-resolution that can be accurately resized.
- Files can be large if there are many observations and hence many points.
- Doesn’t display in html documents, like slides or websites natively.
svg: scalable vector graphics, maintains full resolution and all elements, but there is less support for this format in many applications.
png, jpg
- Articles in MS Office, because they don’t handle other formats!
- Higher resolution can be obtained by setting size or dpi.

See ggplot2 book on saving plots for more details on using ggsave().

Code

set.seed(1130)
d <- tibble(x = runif(10000, -1, 1),
            y = x + x^2 + rnorm(10000))
p <- ggplot(d, aes(x=x, y=y)) +
  geom_point(alpha=0.3) +
  geom_smooth(colour="#006dae", se=F, linewidth=2)
ggsave("images/test.pdf", p, width=1000, height=800, units="px")
ggsave("images/test.png", p, width=1000, height=800, units="px")
ggsave("images/test-low.png", p, width=250, height=200, units="px", dpi=75)

PDF (no show in slides)

PNG - high

PNG - low

Is the pattern visible and real?

Checking for patterns 1/2

Should you report a pattern, when it is not really there?

Check for spurious patterns by making comparison plots, where you know that there is is no relationship.

Which is different?
Actual plot

Comparisons made by randomising the values of air_tmax.

Checking for patterns 2/2

Should you report a pattern, when it is not really there?

Check for spurious patterns by making comparison plots, where you know that there is is no relationship.

Which is different?
Actual plot

Comparisons made by randomising the values of air_tmin.

Better design

The same procedure can be used to compare different plot designs.

If the real plot is detected faster and more often amongst a page of decoys, using one design in comparison to another, then that design is better.

Which is different?
Design 1
Design 2

Making your own plot better

Let’s take the plunge!

Share a plot that you have, that you’d like to improve
Brainstorm the elements: variables, message
Map out the aspects that might be improved
Start making improvements one by one

We have a breakout room, for attendees online, for

one-on-one help with an instructor
or working together with other attendees

Cheerio 👋🏼

Thanks for joining to learn about making data plots today.

These are the resources that we have used today, that might be useful for you in the future.

End of session 2

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.