Wrangling Data and Models

class: center, middle, inverse, title-slide

.title[
# Wrangling Data and Models
]
.subtitle[
## SISBID 2024 <br> <a href="https://github.com/dicook/SISBID" class="uri">https://github.com/dicook/SISBID</a>
]
.author[
### Di Cook (<a href="mailto:dicook@monash.edu" class="email">dicook@monash.edu</a>) <br> Heike Hofmann (<a href="mailto:hhofmann4@unl.edu" class="email">hhofmann4@unl.edu</a>) <br> Susan Vanderplas (<a href="mailto:susan.vanderplas@unl.edu" class="email">susan.vanderplas@unl.edu</a>)
]
.date[
### 08/14-16/2024
]

---

# Outline

- `broom`, `broom.mixed`: tidily extract elements from model summaries
- `dplyr`:  motivation, functions, chaining

---
# Models and model output

Functions such as `lm`, `glm`, `lmer`, ... allow us to fit models

e.g. fit french fry rating with respect to all designed main effects:

``` r

ff_long <- french_fries |> pivot_longer(potato:painty, names_to = "type", values_to = "rating")
ff_lm <- lm(rating~type+treatment+time+subject, 
data=ff_long)
```

---

`summary`, `str`, `resid`, `fitted`, `coef`, ... allow us to extract different parts of a model for a linear model. Other model functions work differently ... major headaches!

``` r
summary(ff_lm)

Call:
lm(formula = rating ~ type + treatment + time + subject, data = ff_long)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0729 -1.9674 -0.4644  1.5202 10.2192

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.19851    0.25792   4.647 3.50e-06 ***
typegrassy  -1.15513    0.15572  -7.418 1.49e-13 ***
typepainty   0.70055    0.15577   4.497 7.11e-06 ***
typepotato   5.13322    0.15572  32.964  < 2e-16 ***
typerancid   2.03293    0.15572  13.055  < 2e-16 ***
treatment2  -0.05128    0.12051  -0.426   0.6705    
treatment3  -0.04546    0.12056  -0.377   0.7062    
time2        0.10194    0.21613   0.472   0.6372    
time3       -0.01361    0.21613  -0.063   0.9498    
time4       -0.10250    0.21613  -0.474   0.6353    
time5       -0.13301    0.21690  -0.613   0.5398    
time6       -0.04611    0.21613  -0.213   0.8311    
time7       -0.24621    0.21628  -1.138   0.2550    
time8       -0.11626    0.21658  -0.537   0.5914    
time9        0.13188    0.22783   0.579   0.5627    
time10       0.54635    0.22783   2.398   0.0165 *  
subject10    1.71451    0.24389   7.030 2.48e-12 ***
subject15   -0.35911    0.24489  -1.466   0.1426    
subject16    0.47519    0.24408   1.947   0.0516 .  
subject19    2.01651    0.24389   8.268  < 2e-16 ***
subject31    1.49284    0.25098   5.948 2.98e-09 ***
subject51    1.87284    0.24389   7.679 2.07e-14 ***
subject52    0.19484    0.24389   0.799   0.4244    
subject63    0.96051    0.24389   3.938 8.37e-05 ***
subject78   -0.58283    0.24389  -2.390   0.0169 *  
subject79   -0.53702    0.25027  -2.146   0.0320 *  
subject86    0.43543    0.25098   1.735   0.0828 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.9 on 3444 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.3995,	Adjusted R-squared:  0.395 
F-statistic: 88.13 on 26 and 3444 DF,  p-value: < 2.2e-16
```

---
# Tidying model output

The package `broom` gets model results into a tidy format at different levels

One value for each:

- model: `broom::glance`
- coefficient in the model: `broom::tidy`
- value in the dataset: `broom::augment`

---
# Goodness of fit statistics: `glance`

``` r
glance(ff_lm)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1     0.400         0.395  2.90      88.1       0    26 -8607. 17270. 17442.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

---
# Model estimates: `tidy`

``` r
ff_lm_tidy <- tidy(ff_lm)
glimpse(ff_lm_tidy)
Rows: 27
Columns: 5
$ term      <chr> "(Intercept)", "typegrassy", "typepainty", "typepotato", "ty…
$ estimate  <dbl> 1.19851228, -1.15513007, 0.70054547, 5.13321525, 2.03292748,…
$ std.error <dbl> 0.2579150, 0.1557200, 0.1557749, 0.1557200, 0.1557200, 0.120…
$ statistic <dbl> 4.64692727, -7.41799461, 4.49716394, 32.96439433, 13.0550191…
$ p.value   <dbl> 3.495385e-06, 1.489614e-13, 7.113736e-06, 2.275767e-207, 4.6…
```

---
# Model diagnostics: `augment`

``` r
ff_lm_all <- augment(ff_lm)
glimpse(ff_lm_all)
Rows: 3,471
Columns: 12
$ .rownames  <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "1…
$ rating     <dbl> 2.9, 0.0, 0.0, 0.0, 5.5, 14.0, 0.0, 0.0, 1.1, 0.0, 11.0, 6.…
$ type       <chr> "potato", "buttery", "grassy", "rancid", "painty", "potato"…
$ treatment  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ time       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ subject    <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 10, 10, 10, 10, 10, 10, 10, 1…
$ .fitted    <dbl> 6.33172754, 1.19851228, 0.04338221, 3.23143977, 1.89905775,…
$ .resid     <dbl> -3.43172754, -1.19851228, -0.04338221, -3.23143977, 3.60094…
$ .hat       <dbl> 0.007910376, 0.007911506, 0.007910376, 0.007910376, 0.00790…
$ .sigma     <dbl> 2.899486, 2.900008, 2.900081, 2.899554, 2.899426, 2.897111,…
$ .cooksd    <dbl> 4.169299e-04, 5.086106e-05, 6.662863e-08, 3.696831e-04, 4.5…
$ .std.resid <dbl> -1.18820206, -0.41497338, -0.01502067, -1.11885438, 1.24679…
```

---
# Residual plot

``` r
ggplot(ff_lm_all, aes(x=.fitted, y=.resid)) + geom_point()
```

---
# Adding random effects

Add random intercepts for each subject

``` r
library(lme4)
fries_lmer <- lmer(rating~type+treatment+time+subject + ( 1 | subject ), 
data = ff_long)
```

---
class: inverse middle 
# Your turn

The `broom.mixed` package provides similar functionality to mixed effects models as `broom` does for fixed effects models

- Try out the functionality of `broom.mixed` on each level: model, parameters, and data
- Augment the `ff_long` data with the model diagnostics
- Plot the residuals from the random effects model
- Plot fitted values against observed values

---
# dplyr verbs

There are five primary dplyr **verbs**, representing distinct data analysis tasks:

- `filter`: Remove the rows of a data frame, producing subsets
- `arrange`: Reorder the rows of a data frame
- `select`: Select particular columns of a data frame
- `mutate`: Add new columns that are functions of existing columns
- `summarise`: Create collapsed summaries of a data frame
- (`group_by`: Introduce structure to a data frame)
 
 
---
# Filter

select a subset of the observations (horizontal selection)

``` r
load(here::here("data/french_fries.rda"))
french_fries |>
*   filter(subject == 3, time == 1)
# A tibble: 6 × 9
  time  treatment subject   rep potato buttery grassy rancid painty
* <fct> <fct>     <fct>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
1 1     1         3           1    2.9     0      0      0      5.5
2 1     1         3           2   14       0      0      1.1    0  
3 1     2         3           1   13.9     0      0      3.9    0  
4 1     2         3           2   13.4     0.1    0      1.5    0  
5 1     3         3           1   14.1     0      0      1.1    0  
6 1     3         3           2    9.5     0      0.6    2.8    0  
```

---
# Arrange

order the observations (hierarchically)

``` r
french_fries |>
*   arrange(desc(rancid)) |>
    head()
# A tibble: 6 × 9
  time  treatment subject   rep potato buttery grassy rancid painty
  <fct> <fct>     <fct>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
1 9     2         51          1    7.3     2.3      0   14.9    0.1
2 10    1         86          2    0.7     0        0   14.3   13.1
3 5     2         63          1    4.4     0        0   13.8    0.6
4 9     2         63          1    1.8     0        0   13.7   12.3
5 5     2         19          2    5.5     4.7      0   13.4    4.6
6 4     3         63          1    5.6     0        0   13.3    4.4
```

---
# Select

select a subset of the variables (vertical selection)

``` r
french_fries |>
*   select(time, treatment, subject, rep, potato) |>
    head()
# A tibble: 6 × 5
  time  treatment subject   rep potato
  <fct> <fct>     <fct>   <dbl>  <dbl>
1 1     1         3           1    2.9
2 1     1         3           2   14  
3 1     1         10          1   11  
4 1     1         10          2    9.9
5 1     1         15          1    1.2
6 1     1         15          2    8.8
```

---
# Summarise

summarize observations into a (set of) one-number statistic(s):

``` r
french_fries |>
*   summarise(
      mean_rancid = mean(rancid, na.rm=TRUE), 
      sd_rancid = sd(rancid, na.rm = TRUE)
*     )
# A tibble: 1 × 2
  mean_rancid sd_rancid
        <dbl>     <dbl>
1        3.85      3.78
```

---
# Summarise and group_by

``` r
french_fries |>
    group_by(time, treatment) |>
    summarise(mean_rancid = mean(rancid), sd_rancid = sd(rancid))
# A tibble: 30 × 4
# Groups:   time [10]
   time  treatment mean_rancid sd_rancid
   <fct> <fct>           <dbl>     <dbl>
 1 1     1                2.76      3.21
 2 1     2                1.72      2.71
 3 1     3                2.6       3.20
 4 2     1                3.9       4.37
 5 2     2                2.14      3.12
 6 2     3                2.50      3.38
 7 3     1                4.65      3.93
 8 3     2                2.90      3.77
 9 3     3                3.6       3.59
10 4     1                2.08      2.39
# ℹ 20 more rows
```

---
# Let's use these tools

to answer these french fry experiment questions:

- Is the design complete?
- Are replicates like each other? 
- How do the ratings on the different scales differ?
- Are raters giving different scores on average?
- Do ratings change over the weeks?

---
# Completeness 
If the data is complete it should be 12 x 10 x 3 x 2, that is, 6 records for each person. (Assuming that each person rated on all scales.)

To check this we want to tabulate the number of records for each subject, time and treatment. This means select appropriate columns, tabulate, count and spread it out to give a nice table.

---
class: inverse middle 
# Your turn

How many values do we have for each subject? Check the help for function `?n`

<div class="countdown" id="timer_72b1a429" data-update-every="1" tabindex="0" style="right:0;bottom:0;">
<div class="countdown-controls"><button class="countdown-bump-down">−</button><button class="countdown-bump-up">+</button></div>
<code class="countdown-time"><span class="countdown-digits minutes">02</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
---
# French Fries - completeness

`n()`

``` r
french_fries |> 
  group_by(subject) |> 
  summarize(n = n()) 
# A tibble: 12 × 2
   subject     n
   <fct>   <int>
 1 3          54
 2 10         60
 3 15         60
 4 16         60
 5 19         60
 6 31         54
 7 51         60
 8 52         60
 9 63         60
10 78         60
11 79         54
12 86         54
```

---
# Other nice short cuts

instead of `group_by(subject) |> summarize(n = n())` we can use:

- `group_by(subject) |> tally()`
- `count(subject)`

---
# Counts for subject by time

``` r
french_fries |>
  na.omit() |>
  count(subject, time) |>
  pivot_wider(names_from="time", values_from="n")
# A tibble: 12 × 11
   subject   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
   <fct>   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1 3           6     6     6     6     6     6     6     6     6    NA
 2 10          6     6     6     6     6     6     6     6     6     6
 3 15          6     6     6     6     5     6     6     6     6     6
 4 16          6     6     6     6     6     6     6     5     6     6
 5 19          6     6     6     6     6     6     6     6     6     6
 6 31          6     6     6     6     6     6     6     6    NA     6
 7 51          6     6     6     6     6     6     6     6     6     6
 8 52          6     6     6     6     6     6     6     6     6     6
 9 63          6     6     6     6     6     6     6     6     6     6
10 78          6     6     6     6     6     6     6     6     6     6
11 79          6     6     6     6     6     6     5     4     6    NA
12 86          6     6     6     6     6     6     6     6    NA     6
```

---
# How do scores change over time?

``` r
ggplot(data=ff_long, aes(x=time, y=rating, colour=treatment)) +
  geom_point() +
  facet_grid(subject~type) 
```

Plot on next slide...very messy!

---
class: hide-count
count: false

---
class: inverse
# Your Turn

Get summary of ratings over replicates and connect the dots for a picture as below:

---
# References

- [Posit cheatsheets](https://posit.co/resources/cheatsheets/)
- [Wickham (2007) Reshaping data](https://www.jstatsoft.org/article/view/v021i12)
- [broom vignettes, David Robinson](https://cran.r-project.org/web/packages/broom/vignettes/broom.html)
- [Wickham (2011) Split-Apply-Combine](https://www.jstatsoft.org/article/view/v040i01)

---
# Share and share alike

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.