class: center, middle, inverse, title-slide .title[ # A Grammar of Graphics ] .subtitle[ ## SISBID 2024
https://github.com/dicook/SISBID
] .author[ ### Di Cook (
dicook@monash.edu
)
Heike Hofmann (
hhofmann4@unl.edu
)
Susan Vanderplas (
susan.vanderplas@unl.edu
) ] .date[ ### 08/14-16/2024 ] --- class: inverse middle 🔮 👽 👼 **TWO MINUTE CHALLENGE** <br> .large[Write down as many types of data plots/charts that you can think of.] Go to menti.com and enter code 4651 9428 <img src="images/mentimeter_qr_code.png" width=200> <br> 🕐
You've got 2 minutes!
--- <div style='position: relative; padding-bottom: 56.25%; padding-top: 35px; height: 0; overflow: hidden;'><iframe sandbox='allow-scripts allow-same-origin allow-presentation' allowfullscreen='true' allowtransparency='true' frameborder='0' height='315' src='https://www.mentimeter.com/app/presentation/c7464477c3f1274f23886cf21c41ec89/ad3e75b80c75/embed' style='position: absolute; top: 0; left: 0; width: 100%; height: 100%;' width='420'></iframe></div> --- # What is a data plot? - data - **aesthetics: mapping of variables to graphical elements** - geom: type of plot structure to use - transformations: log scale, ... - layers: multiple geoms, multiple data sets, annotation - facets: show subsets in different plots - themes: modifying style # Why? With the grammar, a data plot becomes a statistic. It is a functional mapping from variable to graphical element. Then we can do statistics on data plots. With a grammar, we don't have individual animals in the zoo, we have the genetic code that says how one plot is related to another plot. --- # Elements of the grammar .pull-left[ <img src="images/ggplot2.png" width="20%" /> ``` ggplot(data = <DATA>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION> ) + <COORDINATE_FUNCTION> + <FACET_FUNCTION> ``` ] .pull-right[ The 7 key elements of the grammar are: - **DATA** - **GEOM_FUNCTION** - **MAPPINGS** - STAT - POSITION - COORDINATE_FUNCTION - FACET_FUNCTION ] --- # Example: Tuberculosis data (Current) tuberculosis data from [WHO](http://www.who.int/tb/country/data/download/en/), the case notifications table. Also can download with R package, [getTBinR](https://github.com/seabbs/getTBinR). ``` r ggplot(tb_us, aes(x = year, y = count, fill = sex)) + geom_bar(stat = "identity") + facet_grid(~ age) ``` <img src="index_files/figure-html/make a bar chart of US TB incidence-1.png" width="60%" style="display: block; margin: auto;" /> 🔮 👽 👼 **TWO MINUTE CHALLENGE** Go to menti.com and enter code 4651 9428 - What do you learn about tuberculosis incidence in the USA from this plot? - Give three changes to the plot that would improve it. --- <div style='position: relative; padding-bottom: 56.25%; padding-top: 35px; height: 0; overflow: hidden;'><iframe sandbox='allow-scripts allow-same-origin allow-presentation' allowfullscreen='true' allowtransparency='true' frameborder='0' height='315' src='https://www.mentimeter.com/app/presentation/c7464477c3f1274f23886cf21c41ec89/ad3e75b80c75/embed' style='position: absolute; top: 0; left: 0; width: 100%; height: 100%;' width='420'></iframe></div> ??? - Incidence is declining, in all age groups, except possibly 15-24 - Much higher incidence for over 65s and 35-44 year olds - We do not know the underlying population size - There appears to be a structural change around 2008. Either a recording change or a policy change? - Missing information for 1998 # no longer true - Cannot compare counts for male/female using stacked bars, maybe fill to 100% to focus on proportion - Colour scheme is not color blind proof, switch to better palette - Axis labels, and tick marks? --- # Fix the plot ``` r # This uses a color blind friendly scale ggplot(tb_us, aes(x=year, y=count, fill=sex)) + geom_bar(stat="identity") + * facet_grid(~age_group) + * scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) ``` <img src="index_files/figure-html/colour and axes fixes-1.png" width="80%" style="display: block; margin: auto;" /> .bottom[Colorblindness-friendly color schemes, https://davidmathlogic.com/colorblind/#%23005AB5-%23DC3220] --- # Compare males and females ``` r # Fill the bars, note the small change to the code ggplot(tb_us, aes(x=year, y=count, fill=sex)) + * geom_bar(stat="identity", position="fill") + facet_grid(~age_group) + scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) + * ylab("proportion") ``` <img src="index_files/figure-html/compare proportions of males/females-1.png" width="70%" style="display: block; margin: auto;" /> 🔮 👽 👼 **TWO MINUTE CHALLENGE** - What do we learn about the data that is different from the previous plot? - What is easier, what is harder or impossible to learn from this arrangement? ??? - Focus is now on proportions of male and female each year, within age group - Proportions are similar across year - Roughly equal proportions at young and old age groups, more male incidence in middle years --- # Separate plots .pull-left[ ``` r # Make separate plots for males and females, focus on counts by category ggplot(tb_us, aes(x=year, y=count, fill=sex)) + geom_bar(stat="identity") + * facet_grid(sex~age_group) + scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) ``` <img src="index_files/figure-html/compare counts of males/females-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ 🔮 👽 👼 **TWO MINUTE CHALLENGE** - What do we learn about the data that is different from the previous plot? - What is easier, what is harder or impossible to learn from this arrangement? ] --- # Make a pie ``` r # How to make a pie instead of a barchart - not straight forward ggplot(tb_us, aes(x=year, y=count, fill=sex)) + geom_bar(stat="identity") + facet_grid(sex~age_group) + scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) + * coord_polar() ``` <img src="index_files/figure-html/rose plot of males/females-1.png" width="60%" style="display: block; margin: auto;" /> Not a pie, a [rose plot](https://datavizcatalogue.com/methods/nightingale_rose_chart.html)! --- # Stacked bar ``` r # Step 1 to make the pie ggplot(tb_us, aes(x = 1, y = count, fill = factor(year))) + * geom_bar(stat="identity", position="fill") + facet_grid(sex~age_group) + scale_fill_viridis_d("", option="inferno") ``` <img src="index_files/figure-html/stacked barchart of males/females-1.png" width="60%" style="display: block; margin: auto;" /> --- # Pie chart ``` r # Now we have a pie, note the mapping of variables # and the modification to the coord_polar ggplot(tb_us, aes(x = 1, y = count, fill = factor(year))) + geom_bar(stat="identity", position="fill") + facet_grid(sex~age_group) + scale_fill_viridis_d("", option="inferno") + * coord_polar(theta = "y") ``` <img src="index_files/figure-html/pie chart of males/females-1.png" width="60%" style="display: block; margin: auto;" /> --- 🔮 👽 👼 **TWO MINUTE CHALLENGE** - What are the pros, and cons, of using the pie chart for this data? - Would it be better if the pies used age for the segments, and facetted by year (and sex)? Go to menti.com and enter code 4651 9428 --- # Line plot vs barchart <img src="index_files/figure-html/use a line plot instead of bar-1.png" width="80%" style="display: block; margin: auto;" /> A line plot allows reading the counts for the two sexes. The counts are displayed in the same plot, and males and females can be directly compared as well as temporal trend. --- # Line plot vs barchart <img src="index_files/figure-html/use a line plot of proportions-1.png" width="80%" style="display: block; margin: auto;" /> Computing proportion and displaying only this, forces the attention to be on the one quantity. Proportion here is computed separately for year and age, which means we are directly comparing the counts within these subsets. Adding the guideline to indicate equal proportions is an important baseline. --- class: inverse middle # Your turn Make sure you can make all the TB plots just shown. If you have extra time, try to: - Facet by gender, and make line plots of counts of age. - Show the points only, and overlay a linear model fit.
−
+
07
:
00
--- # Resources - [posit cheatsheets](https://posit.co/resources/cheatsheets/) - [ggplot2: Elegant Graphics for Data Analysis, Hadley Wickham](https://ggplot2-book.org), [web site](https://ggplot2.tidyverse.org) - [R Graphics Cookbook, Winston Chang](http://www.cookbook-r.com/Graphs/) - [Data Visualization, Kieran Healy](https://socviz.co) - [Data Visualization with R, Rob Kabacoff](https://rkabacoff.github.io/datavis/index.html) - [Fundamentals of Data Visualization, Claus O. Wilke](https://serialmentor.com/dataviz/) --- # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.