Making a mess again - with the data

class: center, middle, inverse, title-slide

.title[
# Making a mess again - with the data
]
.subtitle[
## SISBID 2024 <br> <a href="https://github.com/dicook/SISBID" class="uri">https://github.com/dicook/SISBID</a>
]
.author[
### Di Cook (<a href="mailto:dicook@monash.edu" class="email">dicook@monash.edu</a>) <br> Heike Hofmann (<a href="mailto:hhofmann4@unl.edu" class="email">hhofmann4@unl.edu</a>) <br> Susan Vanderplas (<a href="mailto:susan.vanderplas@unl.edu" class="email">susan.vanderplas@unl.edu</a>)
]
.date[
### 08/14-16/2024
]

---

class: inverse middle 
# Your turn

Warmup:

Turn the `french_fries` data from wide format into a long format with variables `type` and `rating`.

```
# A tibble: 6 × 9
  time  treatment subject   rep potato buttery grassy rancid painty
  <fct> <fct>     <fct>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
1 1     1         3           1    2.9     0      0      0      5.5
2 1     1         3           2   14       0      0      1.1    0  
3 1     1         10          1   11       6.4    0      0      0  
4 1     1         10          2    9.9     5.9    2.9    2.2    0  
5 1     1         15          1    1.2     0.1    0      1.1    5.1
6 1     1         15          2    8.8     3      3.6    1.5    2.3
```

<div class="countdown" id="timer_32611903" data-update-every="1" tabindex="0" style="right:0;bottom:0;">
<div class="countdown-controls"><button class="countdown-bump-down">−</button><button class="countdown-bump-up">+</button></div>
<code class="countdown-time"><span class="countdown-digits minutes">05</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
---
class: middle, center, inverse

<h1>What would we like to find out about the french fries data set?</h1>

Go to menti.com at 4651 9428 and let us know!

---
# What would we like to know?
  
- Is the design complete?
- Are replicates like each other?
- How do the ratings on the different scales differ?
- Are raters giving different scores on average?
- Do ratings change over the weeks?
  
Each of these questions involves different summaries of the data.

---
# Pivot french fries to long

``` r
ff_long <- french_fries %>% 
  pivot_longer(potato:painty, names_to = "type", values_to = "rating")

head(ff_long)
# A tibble: 6 × 6
  time  treatment subject   rep type    rating
  <fct> <fct>     <fct>   <dbl> <chr>    <dbl>
1 1     1         3           1 potato     2.9
2 1     1         3           1 buttery    0  
3 1     1         3           1 grassy     0  
4 1     1         3           1 rancid     0  
5 1     1         3           1 painty     5.5
6 1     1         3           2 potato    14  
```

---
# Pivot long to wide
  
In certain applications, we may wish to take a long dataset and pivot it to a wide dataset (perhaps displaying in a table).

This was called "spreading" the data.

Examples:

- Are replicates like each other? - we want to compare rep 1 values to rep 2 values
- How do the ratings on the different scales differ? - we want to compare ratings across different scales
- Are raters giving different scores on average? - we want to compare ratings across different raters
- Do ratings change over the weeks? - we want to compare ratings across different weeks

---
# Pivot to wide form
  
We use the **pivot_wider** function from `tidyr` to do introduce variables with comparable values:

``` r
head(ff_long, 3)
# A tibble: 3 × 6
  time  treatment subject   rep type    rating
  <fct> <fct>     <fct>   <dbl> <chr>    <dbl>
1 1     1         3           1 potato     2.9
2 1     1         3           1 buttery    0  
3 1     1         3           1 grassy     0

french_fries_weeks <- ff_long %>% 
  pivot_wider(names_from = "time", values_from = "rating")

head(french_fries_weeks)
# A tibble: 6 × 14
  treatment subject   rep type      `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
  <fct>     <fct>   <dbl> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1         3           1 potato    2.9   9    11.8  13.6  14     0.4   2.9   3.5   1.1    NA
2 1         3           1 buttery   0     0.3   0.2   0.1   0.3   1.2   0     0.5   0.4    NA
3 1         3           1 grassy    0     0.1   0     0     0     0     0     1.3   0      NA
4 1         3           1 rancid    0     5.8   6     1.7   0     0     0     0     0      NA
5 1         3           1 painty    5.5   0.3   0     0     1.7   9.5   5.5   3.8   7      NA
6 1         3           2 potato   14     5.5   7.8   5.3  12.9   3.3   0.8   0.6   2.5    NA
```

---
# Pivot to wide form

``` r
head(french_fries_weeks)
# A tibble: 6 × 14
  treatment subject   rep type      `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
  <fct>     <fct>   <dbl> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1         3           1 potato    2.9   9    11.8  13.6  14     0.4   2.9   3.5   1.1    NA
2 1         3           1 buttery   0     0.3   0.2   0.1   0.3   1.2   0     0.5   0.4    NA
3 1         3           1 grassy    0     0.1   0     0     0     0     0     1.3   0      NA
4 1         3           1 rancid    0     5.8   6     1.7   0     0     0     0     0      NA
5 1         3           1 painty    5.5   0.3   0     0     1.7   9.5   5.5   3.8   7      NA
6 1         3           2 potato   14     5.5   7.8   5.3  12.9   3.3   0.8   0.6   2.5    NA
```

`pivot_wider` introduces one new variable for each level of the variable in `names_from`

and fills values in from the variable in `values_from`

posit cheatsheet: https://raw.githubusercontent.com/rstudio/cheatsheets/main/tidyr.pdf

---
# Comparing ratings from different weeks

Plot Week 1 against Week 9 in a scatterplot:

``` r
french_fries_weeks %>%
  ggplot(aes(x = `1`, y = `9`)) + geom_point()
```

Note the use of the backtick for variable names with special characters or numbers.

---
class: inverse middle 
# Your turn: Do the replicates look like each other?

Tackle this by plotting the replicates against each other using a scatterplot.

You will need to first convert the data into long form, and then get the replicates spread into separate columns by replicate.

---
# Are ratings similar on different scales?
  
- Are ratings similar on the different scales: potato'y, buttery, grassy, rancid and painty?
- We need to pivot the data into long form, and make plots facetted by the scale.

``` r
ff.m <- french_fries %>% 
pivot_longer(-(time:rep), names_to="type", values_to="rating")
head(ff.m)
# A tibble: 6 × 6
  time  treatment subject   rep type    rating
  <fct> <fct>     <fct>   <dbl> <chr>    <dbl>
1 1     1         3           1 potato     2.9
2 1     1         3           1 buttery    0  
3 1     1         3           1 grassy     0  
4 1     1         3           1 rancid     0  
5 1     1         3           1 painty     5.5
6 1     1         3           2 potato    14  
```

---

``` r
ggplot(data=ff.m, aes(x=rating)) + geom_histogram(binwidth=2) + 
facet_wrap(~type, ncol=5) 
```

---
# Side-by-side boxplots

``` r
ggplot(data=ff.m, aes(x=type, y=rating, fill=type)) + 
geom_boxplot()
```

---
class: inverse middle 
# Your turn: What is the correlation between scales?

Tackle this problem by creating a wide form of the data by type of scale.

`cor` allows you to create a correlation matrix. Look into the help `?cor` to get rid of `NA` values in the result.

Draw a scatterplot of two scales with the highest (positive or negative) correlation value.

---
# Ratings by week

How do ratings change over the weeks?
Again, we use the long form of the data and plot:

``` r
ff.m$time <- as.numeric(ff.m$time)
ggplot(data=ff.m, aes(x=time, y=rating, colour=type)) + 
geom_point(size=.75) +
geom_smooth() +
facet_wrap(~type)
```

---
class: inverse middle 
# Your turn: Modelling ratings over time & different scales?

Find a linear model describing the average rating depending on the week (time) and type of scale as shown below.

Which form of the dataset should we use?

Challenge: can you plot the fitted lines from the model?

---
# Resources

- [posit cheatsheets](https://posit.co/resources/cheatsheets/)
- [Wickham (2007) Reshaping data](https://www.jstatsoft.org/article/view/v021i12)
- [R for Data Science (Wickham & Grolemund), chapter 9](https://r4ds.had.co.nz/wrangle-intro.html)
- [Telling Stories with Data (Alexander), chapters 9 & 10](https://tellingstorieswithdata.com/09-clean_and_prepare.html)

---
# Share and share alike

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.