class: center, middle, inverse, title-slide .title[ # Making a mess again - with the data ] .subtitle[ ## SISBID 2024
https://github.com/dicook/SISBID
] .author[ ### Di Cook (
dicook@monash.edu
)
Heike Hofmann (
hhofmann4@unl.edu
)
Susan Vanderplas (
susan.vanderplas@unl.edu
) ] .date[ ### 08/14-16/2024 ] --- class: inverse middle # Your turn Warmup: Turn the `french_fries` data from wide format into a long format with variables `type` and `rating`. ``` # A tibble: 6 × 9 time treatment subject rep potato buttery grassy rancid painty <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 3 1 2.9 0 0 0 5.5 2 1 1 3 2 14 0 0 1.1 0 3 1 1 10 1 11 6.4 0 0 0 4 1 1 10 2 9.9 5.9 2.9 2.2 0 5 1 1 15 1 1.2 0.1 0 1.1 5.1 6 1 1 15 2 8.8 3 3.6 1.5 2.3 ```
−
+
05
:
00
--- class: middle, center, inverse <h1>What would we like to find out about the french fries data set?</h1> Go to menti.com at 4651 9428 and let us know! <img src="images/mentimeter_qr_code.png" width = 200> --- # What would we like to know? - Is the design complete? - Are replicates like each other? - How do the ratings on the different scales differ? - Are raters giving different scores on average? - Do ratings change over the weeks? Each of these questions involves different summaries of the data. --- # Pivot french fries to long ``` r ff_long <- french_fries %>% pivot_longer(potato:painty, names_to = "type", values_to = "rating") head(ff_long) # A tibble: 6 × 6 time treatment subject rep type rating <fct> <fct> <fct> <dbl> <chr> <dbl> 1 1 1 3 1 potato 2.9 2 1 1 3 1 buttery 0 3 1 1 3 1 grassy 0 4 1 1 3 1 rancid 0 5 1 1 3 1 painty 5.5 6 1 1 3 2 potato 14 ``` --- # Pivot long to wide In certain applications, we may wish to take a long dataset and pivot it to a wide dataset (perhaps displaying in a table). This was called "spreading" the data. Examples: - Are replicates like each other? - we want to compare rep 1 values to rep 2 values - How do the ratings on the different scales differ? - we want to compare ratings across different scales - Are raters giving different scores on average? - we want to compare ratings across different raters - Do ratings change over the weeks? - we want to compare ratings across different weeks --- # Pivot to wide form We use the **pivot_wider** function from `tidyr` to do introduce variables with comparable values: ``` r head(ff_long, 3) # A tibble: 3 × 6 time treatment subject rep type rating <fct> <fct> <fct> <dbl> <chr> <dbl> 1 1 1 3 1 potato 2.9 2 1 1 3 1 buttery 0 3 1 1 3 1 grassy 0 french_fries_weeks <- ff_long %>% pivot_wider(names_from = "time", values_from = "rating") head(french_fries_weeks) # A tibble: 6 × 14 treatment subject rep type `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` <fct> <fct> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 3 1 potato 2.9 9 11.8 13.6 14 0.4 2.9 3.5 1.1 NA 2 1 3 1 buttery 0 0.3 0.2 0.1 0.3 1.2 0 0.5 0.4 NA 3 1 3 1 grassy 0 0.1 0 0 0 0 0 1.3 0 NA 4 1 3 1 rancid 0 5.8 6 1.7 0 0 0 0 0 NA 5 1 3 1 painty 5.5 0.3 0 0 1.7 9.5 5.5 3.8 7 NA 6 1 3 2 potato 14 5.5 7.8 5.3 12.9 3.3 0.8 0.6 2.5 NA ``` --- # Pivot to wide form ``` r head(french_fries_weeks) # A tibble: 6 × 14 treatment subject rep type `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` <fct> <fct> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 3 1 potato 2.9 9 11.8 13.6 14 0.4 2.9 3.5 1.1 NA 2 1 3 1 buttery 0 0.3 0.2 0.1 0.3 1.2 0 0.5 0.4 NA 3 1 3 1 grassy 0 0.1 0 0 0 0 0 1.3 0 NA 4 1 3 1 rancid 0 5.8 6 1.7 0 0 0 0 0 NA 5 1 3 1 painty 5.5 0.3 0 0 1.7 9.5 5.5 3.8 7 NA 6 1 3 2 potato 14 5.5 7.8 5.3 12.9 3.3 0.8 0.6 2.5 NA ``` `pivot_wider` introduces one new variable for each level of the variable in `names_from` and fills values in from the variable in `values_from` posit cheatsheet: https://raw.githubusercontent.com/rstudio/cheatsheets/main/tidyr.pdf --- # Comparing ratings from different weeks Plot Week 1 against Week 9 in a scatterplot: ``` r french_fries_weeks %>% ggplot(aes(x = `1`, y = `9`)) + geom_point() ``` <img src="index_files/figure-html/week 1 vs week 9-1.png" style="display: block; margin: auto;" /> Note the use of the backtick for variable names with special characters or numbers. --- class: inverse middle # Your turn: Do the replicates look like each other? Tackle this by plotting the replicates against each other using a scatterplot. You will need to first convert the data into long form, and then get the replicates spread into separate columns by replicate.
−
+
05
:
00
--- # Are ratings similar on different scales? - Are ratings similar on the different scales: potato'y, buttery, grassy, rancid and painty? - We need to pivot the data into long form, and make plots facetted by the scale. -- ``` r ff.m <- french_fries %>% pivot_longer(-(time:rep), names_to="type", values_to="rating") head(ff.m) # A tibble: 6 × 6 time treatment subject rep type rating <fct> <fct> <fct> <dbl> <chr> <dbl> 1 1 1 3 1 potato 2.9 2 1 1 3 1 buttery 0 3 1 1 3 1 grassy 0 4 1 1 3 1 rancid 0 5 1 1 3 1 painty 5.5 6 1 1 3 2 potato 14 ``` --- ``` r ff.m <- french_fries %>% pivot_longer(-(time:rep), names_to="type", values_to="rating") head(ff.m) # A tibble: 6 × 6 time treatment subject rep type rating <fct> <fct> <fct> <dbl> <chr> <dbl> 1 1 1 3 1 potato 2.9 2 1 1 3 1 buttery 0 3 1 1 3 1 grassy 0 4 1 1 3 1 rancid 0 5 1 1 3 1 painty 5.5 6 1 1 3 2 potato 14 ``` ``` r ggplot(data=ff.m, aes(x=rating)) + geom_histogram(binwidth=2) + facet_wrap(~type, ncol=5) ``` <img src="index_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Side-by-side boxplots ``` r ggplot(data=ff.m, aes(x=type, y=rating, fill=type)) + geom_boxplot() ``` <img src="index_files/figure-html/side-by-Side boxplots-1.png" style="display: block; margin: auto;" /> --- class: inverse middle # Your turn: What is the correlation between scales? Tackle this problem by creating a wide form of the data by type of scale. `cor` allows you to create a correlation matrix. Look into the help `?cor` to get rid of `NA` values in the result. Draw a scatterplot of two scales with the highest (positive or negative) correlation value.
−
+
07
:
00
--- # Ratings by week How do ratings change over the weeks? Again, we use the long form of the data and plot: ``` r ff.m$time <- as.numeric(ff.m$time) ggplot(data=ff.m, aes(x=time, y=rating, colour=type)) + geom_point(size=.75) + geom_smooth() + facet_wrap(~type) ``` <img src="index_files/figure-html/ratings by week-1.png" style="display: block; margin: auto;" /> --- class: inverse middle # Your turn: Modelling ratings over time & different scales? Find a linear model describing the average rating depending on the week (time) and type of scale as shown below. Which form of the dataset should we use? Challenge: can you plot the fitted lines from the model? <img src="index_files/figure-html/ratings by week again-1.png" style="display: block; margin: auto;" />
−
+
07
:
00
--- # Resources - [posit cheatsheets](https://posit.co/resources/cheatsheets/) - [Wickham (2007) Reshaping data](https://www.jstatsoft.org/article/view/v021i12) - [R for Data Science (Wickham & Grolemund), chapter 9](https://r4ds.had.co.nz/wrangle-intro.html) - [Telling Stories with Data (Alexander), chapters 9 & 10](https://tellingstorieswithdata.com/09-clean_and_prepare.html) --- # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.