A Grammar of Graphics

SISBID 2025
https://github.com/dicook/SISBID

TWO MINUTE CHALLENGE ๐Ÿ”ฎ ๐Ÿ‘ฝ ๐Ÿ‘ผ

Write down as many types of data plots that you can think of.

Go to menti.com and enter code 2979 2396

๐Ÿ• You've got 2 minutes!

What is a data plot?

  • data
  • aesthetics: mapping of variables to graphical elements
  • geom: type of plot structure to use
  • transformations: log scale, โ€ฆ
  • layers: multiple geoms, multiple data sets, annotation
  • facets: show subsets in different plots
  • themes: modifying style

Why?

  • With the grammar, a data plot becomes a statistic.

  • It is a functional mapping from variable to graphical element. Then we can do statistics on charts!

  • With a grammar, we donโ€™t have individual animals in the zoo, we have the genetic code that says how one plot is related to another plot.

Elements of the grammar

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

7 key elements:

  • DATA
  • GEOM_FUNCTION
  • MAPPINGS
  • STAT
  • POSITION
  • COORDINATE_FUNCTION
  • FACET_FUNCTION

Example: Tuberculosis data

(Current) TB case notifications data from WHO.
Also available via R package getTBinR.

ggplot(tb_us, aes(x = year, 
                  y = count, 
                  fill = sex)) +
  geom_bar(stat = "identity") +
  facet_grid(~ age) 

TWO MINUTE CHALLENGE ๐Ÿ”ฎ ๐Ÿ‘ฝ ๐Ÿ‘ผ

Go to menti.com and enter code 2979 2396

  • What do you learn about tuberculosis incidences in the USA from this plot?
  • Give three changes to the plot that would improve it.

Results ๐Ÿ”ฎ ๐Ÿ‘ฝ ๐Ÿ‘ผ

Fix the plot

Manually selected fill colors; theme with white background for better contrast

# This uses a color blind friendly scale
ggplot(tb_us, aes(x=year, y=count, fill=sex)) +
  geom_bar(stat="identity") + 
  facet_grid(~age_group)  + 
  scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) + 
  theme_bw() 

Color deficiency friendly color schemes

Compare males and females

ggplot(tb_us, aes(x=year, y=count, fill=sex)) +  
  geom_bar(stat="identity", position="fill") + 
  ylab("proportion") + 
  facet_grid(~age_group) +  
  scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) 

๐Ÿ”ฎ ๐Ÿ‘ฝ ๐Ÿ‘ผ TWO MINUTE CHALLENGE

  • What do we learn about the data that is different from the previous plot?
  • What is easier and what is harder or impossible to learn from this arrangement?

Separate plots

# Make separate plots for males and females, focus on counts by category
ggplot(tb_us, aes(x=year, y=count, fill=sex)) +
  geom_bar(stat="identity") +
  scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) + 
  facet_grid(sex~age_group) + 
  theme_bw()

Make a pie

# How to make a pie instead of a barchart - not straight forward
ggplot(tb_us, aes(x=year, y=count, fill=sex)) +
  geom_bar(stat="identity") + 
  facet_grid(sex~age_group) + 
  scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) +
  coord_polar() + 
  theme_bw()

This isnโ€™t a pie, itโ€™s a rose plot!

Stacked bar

# Step 1 to make the pie
ggplot(tb_us, aes(x = 1, y = count, fill = factor(year))) +
  geom_bar(stat="identity", position="fill") + 
  facet_grid(sex~age_group) +
  scale_fill_viridis_d("", option="inferno") 

Pie chart

# Now we have a pie, note the mapping of variables
# and the modification to the coord_polar
ggplot(tb_us, aes(x = 1, y = count, fill = factor(year))) + 
  geom_bar(stat="identity", position="fill") + 
  facet_grid(sex~age_group) +
  scale_fill_viridis_d("", option="inferno") +
  coord_polar(theta = "y") 

๐Ÿ”ฎ ๐Ÿ‘ฝ ๐Ÿ‘ผ TWO MINUTE CHALLENGE

  • What are the pros, and cons, of using the pie chart for this data?
  • Would it be better if the pies used age for the segments, and facetted by year (and sex)?

Go to menti.com and enter code 2979 2396

Line plot vs barchart

ggplot(tb_us, aes(x=year, y=count, colour=sex)) +
  geom_line() + geom_point() +
  facet_grid(~age_group) +
  scale_colour_manual("Sex", values = c("#DC3220", "#005AB5")) +
  ylim(c(0,NA)) +
  theme_bw()

  • We can read counts for both sexes
  • Males and females can be directly compared
  • Temporal trend is visible

Line plot vs barchart

tb_us |> group_by(year, age_group) |> 
  summarise(p = count[sex=="m"]/sum(count)) |>
  ggplot(aes(x=year, y=p)) +
  geom_hline(yintercept = 0.50, colour="grey50", linewidth=2) +
  geom_line() + geom_point() +
  facet_grid(~age_group) +
  ylab("Proportion of Males") +
  theme_bw()

  • Attention is forced to proportion of males
  • Direct comparison of counts within year and age
  • Equal proportion guideline provides a baseline for comparison

Your turn

Make sure you can make all the TB plots just shown. If you have extra time, try to:

  • Facet by gender, and make line plots of counts of age.
  • Show the points only, and overlay a linear model fit.

07:00

Resources