ggplot(tb_us, aes(x = year,
y = count,
fill = sex)) +
geom_bar(stat = "identity") +
facet_grid(~ age)
A Grammar of Graphics
SISBID 2025
https://github.com/dicook/SISBID
TWO MINUTE CHALLENGE ๐ฎ ๐ฝ ๐ผ
Write down as many types of data plots that you can think of.
Go to menti.com and enter code 2979 2396
๐ You've got 2 minutes!
What is a data plot?
- data
- aesthetics: mapping of variables to graphical elements
- geom: type of plot structure to use
- transformations: log scale, โฆ
- layers: multiple geoms, multiple data sets, annotation
- facets: show subsets in different plots
- themes: modifying style
Why?
With the grammar, a data plot becomes a statistic.
It is a functional mapping from variable to graphical element. Then we can do statistics on charts!
With a grammar, we donโt have individual animals in the zoo, we have the genetic code that says how one plot is related to another plot.
Elements of the grammar
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
7 key elements:
- DATA
- GEOM_FUNCTION
- MAPPINGS
- STAT
- POSITION
- COORDINATE_FUNCTION
- FACET_FUNCTION
Example: Tuberculosis data
(Current) TB case notifications data from WHO.
Also available via R package getTBinR
.
TWO MINUTE CHALLENGE ๐ฎ ๐ฝ ๐ผ
Go to menti.com and enter code 2979 2396
- What do you learn about tuberculosis incidences in the USA from this plot?
- Give three changes to the plot that would improve it.
Results ๐ฎ ๐ฝ ๐ผ
- Incidence is declining, in all age groups, except possibly 15-24
- Much higher incidence for over 65s and 35-44 year olds
- We do not know the underlying population size
- There appears to be a structural change around 2008. Either a recording change or a policy change?
- Missing information for 1998 # no longer true
- Cannot compare counts for male/female using stacked bars, maybe fill to 100% to focus on proportion
- Colour scheme is not color blind proof, switch to better palette
- Axis labels, and tick marks?
Fix the plot
Manually selected fill colors; theme with white background for better contrast
Compare males and females
๐ฎ ๐ฝ ๐ผ TWO MINUTE CHALLENGE
- What do we learn about the data that is different from the previous plot?
- What is easier and what is harder or impossible to learn from this arrangement?
- Focus is now on proportions of male and female each year, within age group
- Proportions are similar across year
- Roughly equal proportions at young and old age groups, more male incidence in middle years
Separate plots
- Counts are generally higher for males than females
- There are very few female cases in the middle years
- Perhaps something of a older male outbreak in 2007-8, and possibly a young female outbreak in the same years
Make a pie
This isnโt a pie, itโs a rose plot!
Stacked bar
Pie chart
๐ฎ ๐ฝ ๐ผ TWO MINUTE CHALLENGE
- What are the pros, and cons, of using the pie chart for this data?
- Would it be better if the pies used age for the segments, and facetted by year (and sex)?
Go to menti.com and enter code 2979 2396
Line plot vs barchart
- We can read counts for both sexes
- Males and females can be directly compared
- Temporal trend is visible
Line plot vs barchart
- Attention is forced to proportion of males
- Direct comparison of counts within year and age
- Equal proportion guideline provides a baseline for comparison
Your turn
Make sure you can make all the TB plots just shown. If you have extra time, try to:
- Facet by gender, and make line plots of counts of age.
- Show the points only, and overlay a linear model fit.
Resources
- posit cheatsheets
- ggplot2: Elegant Graphics for Data Analysis, Hadley Wickham
- ggplot2 web site
- R Graphics Cookbook, Winston Chang
- Data Visualization, Kieran Healy
- Data Visualization with R, Rob Kabacoff
- Fundamentals of Data Visualization, Claus O. Wilke
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.