Tidying your data

class: center, middle, inverse, title-slide

.title[
# Tidying your data
]
.subtitle[
## SISBID 2024 <br> <a href="https://github.com/dicook/SISBID" class="uri">https://github.com/dicook/SISBID</a>
]
.author[
### Di Cook (<a href="mailto:dicook@monash.edu" class="email">dicook@monash.edu</a>) <br> Heike Hofmann (<a href="mailto:hhofmann4@unl.edu" class="email">hhofmann4@unl.edu</a>) <br> Susan Vanderplas (<a href="mailto:susan.vanderplas@unl.edu" class="email">susan.vanderplas@unl.edu</a>)
]
.date[
### 08/14-16/2024
]

---

# Using `tidyr`, `dplyr`

- Writing readable code using **pipes**
- What is **tidy data**? Why do you want tidy data? Getting your data into tidy form using tidyr.
- Reading different **data formats**
- String operations, working with **text**

---
# The pipe operator `%>%` or `|>`

- read as `then` 
- `x %>% f(y)` and `x |> f(y)` is the same as `f(x, y)`
- `%>%` is part of the `dplyr` package (or more precisely, the `magrittr` package), 
`|>` is part of base R 
- pipes allow the code to be *read* like a sequence of operations

Example:

``` r
tb <- read_csv(here::here("data/TB_notifications_2020-07-01.csv"))
tb |>                                  # first we get the tb data
  filter(year == 2016) |>              # then we focus on just the year 2016
  group_by(country) |>                 # then we group by country
  summarize(
    cases = sum(c_newinc, na.rm=TRUE)   # to create a summary of all new cases
    ) |> 
  arrange(desc(cases))                  # then we sort countries to show highest number new cases first
```

There are some subtle differences between the base pipe `|>` and the tidy pipe `%>%`, see  [here](https://towardsdatascience.com/understanding-the-native-r-pipe-98dea6d8b61b) for details.

---
.pull-left[
<img src="https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/6e297111-efb7-4555-93d2-085071d1853c_rw_1920.jpg?h=f1951b118eaddaaab8a572422bc16566" width="100%">

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by [Julia Lowndes and Allison Horst](https://allisonhorst.com/other-r-fun)

]

.pull-right[
# What is **tidy data**?

- What do we expect tidy data to look like?
- Couple of exercises to look closer at what we find in practice

]

---
class: inverse middle 
# Your turn 1

What are the variables?

```
# A tibble: 6 × 4
  Inst                     AvNumPubs AvNumCits PctCompletion
  <chr>                        <dbl>     <dbl>         <dbl>
1 ARIZONA STATE UNIVERSITY      0.9       1.57          31.7
2 AUBURN UNIVERSITY             0.79      0.64          44.4
3 BOSTON COLLEGE                0.51      1.03          46.8
4 BOSTON UNIVERSITY             0.49      2.66          34.2
5 BRANDEIS UNIVERSITY           0.3       3.03          48.7
6 BROWN UNIVERSITY              0.84      2.31          54.6
```

---
class: inverse middle 
# Your turn 2

What's in the column names of this data? What are the experimental units? What are the measured variables?

```
# A tibble: 3 × 12
  id     `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1` `WI-12.R2`
  <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
1 Gene 1      2.18     2.20       4.20     2.63       5.06       4.54       5.53
2 Gene 2      1.46     0.585      1.86     0.515      2.88       1.36       2.96
3 Gene 3      2.03     0.870      3.28     0.533      4.63       2.18       5.56
# ℹ 4 more variables: `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, `WM-12.R2` <dbl>,
#   `WM-12.R4` <dbl>
```

---
class: inverse middle 
# Your turn 3

What are the variables? What are the records?

```
           V1   V2 V3   V4  V5  V9 V13 V17 V21 V25 V29 V33 V37 V41 V45 V49 V53
1 ASN00086282 1970  7 TMAX 141 124 113 123 148 149 139 153 123 108 119 112 126
2 ASN00086282 1970  7 TMIN  80  63  36  57  69  47  84  78  49  42  48  56  51
3 ASN00086282 1970  7 PRCP   3  30   0   0  36   3   0   0  10  23   3   0   5
4 ASN00086282 1970  8 TMAX 145 128 150 122 109 112 116 142 166 127 117 127 159
5 ASN00086282 1970  8 TMIN  50  61  75  67  41  51  48  -7  56  62  47  33  67
6 ASN00086282 1970  8 PRCP   0  66   0  53  13   3   8   0   0   0   3   5   0
  V57 V61 V65 V69 V73 V77 V81 V85 V89 V93 V97
1 112 115 133 134 126 104 143 141 134 117 142
2  36  44  39  40  58  15  33  51  74  39  66
3   0   0   0   0   0   8   0  18   0   0   0
4 143 114  65 113 125 129 147 161 168 178 161
5  84  11  41  18  50  22  28  74  94  73  88
6   0  64   3  99  36   8   0   0   0   8  36
```

---
class: inverse middle 
# Your turn 4

What are the variables? What are the experimental units?

```
# A tibble: 6 × 22
  iso2   year  m_04 m_514 m_014 m_1524 m_2534 m_3544 m_4554 m_5564  m_65   m_u
  <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>
1 ZW     2003    NA    NA   133    874   3048   2228    981    367   205    NA
2 ZW     2004    NA    NA   187    833   2908   2298   1056    366   198    NA
3 ZW     2005    NA    NA   210    837   2264   1855    762    295   656    NA
4 ZW     2006    NA    NA   215    736   2391   1939    896    348   199    NA
5 ZW     2007     6   132   138    500   3693      0    716    292   153    NA
6 ZW     2008    NA    NA   127    614      0   3316    704    263   185     0
# ℹ 10 more variables: f_04 <dbl>, f_514 <dbl>, f_014 <dbl>, f_1524 <dbl>,
#   f_2534 <dbl>, f_3544 <dbl>, f_4554 <dbl>, f_5564 <dbl>, f_65 <dbl>,
#   f_u <dbl>
```

---
class: inverse middle 
# Your turn 5

```
# A tibble: 4 × 9
  time  treatment subject   rep potato buttery grassy rancid painty
  <fct> <fct>     <fct>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
1 1     1         3           1    2.9     0      0      0      5.5
2 1     1         3           2   14       0      0      1.1    0  
3 1     1         10          1   11       6.4    0      0      0  
4 1     1         10          2    9.9     5.9    2.9    2.2    0  
```

---
# Messy data patterns

There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns:

- Column headers are not just variable names, but also contain values
- Variables are stored in both rows and columns, contingency table format
- One type of experimental unit stored in multiple tables
- Dates in many different formats

---
# What is tidy data?

- Each observation forms a row
- Each variable forms a column
- Data is contained in a single table

- *Long form* makes it easier to reshape in many different ways
- *Wider forms* are common for analysis

Long form: **one** measured value per row. All other variables are descriptors (key variables)

Widest form: **all** measured values for an entity are in a single row.

---
background-image: url(lego.png)
background-size: 100%
background-position: 50% 5%

---
background-image: url(playmobile.png)
background-size: 100%
background-position: 50% 5%

---

![](https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/209ee1eb-fd83-4e4c-9bed-82ae43b0f3e9_rw_3840.jpg?h=b00690a8a5eef3c901d6609cfefee987)

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by [Julia Lowndes and Allison Horst](https://allisonhorst.com/other-r-fun)

---
# Tidy verbs

- `pivot_longer`: specify the **keys** (identifiers) and the **values** (measures) to make long form (used to be called melting or gathering)
- `pivot_wider`: variables in columns (used to be called casting or spreading)
- `separate`/`unite`: split and combine columns
- `nest`/`unnest`: working with list variables

---
# Pivot to long form

.pull-left[

`pivot_longer(data, cols, names_to = "name", values_to = "value", ...)`

`data |> pivot_longer(cols, names_to = "name", values_to = "value", ...)`

- `pivot_longer` turns a wide format into a long format
- two new variables are introduced (in key-value format): **name** and **value**
- `col` defines which variables should be combined

The animation on the right is from Garrick Aden-Buie's `tidyexplain` package.

[tidyr cheat sheet](https://rstudio.github.io/cheatsheets/tidyr.pdf)
]

--
.pull-right[

<img src="https://github.com/gadenbuie/tidyexplain/raw/main/images/tidyr-pivoting.gif" width = 450>
]

---
# Pivoting an example

``` r
# wide format
dframe
  id trtA trtB
1  1  2.5   45
2  2  4.6   35

# long format
dframe |> pivot_longer(trtA:trtB, names_to="treatment", values_to="outcome")
# A tibble: 4 × 3
     id treatment outcome
  <int> <chr>       <dbl>
1     1 trtA          2.5
2     1 trtB         45  
3     2 trtA          4.6
4     2 trtB         35  
```

---
# Variable Selectors

`data |> pivot_longer(cols, names_to = "key", values_to = "value", ...)`

`cols` argument identifies variables that should be combined.

**Selectors** can be used to identify variables: variables can be identified by **name**, by **position**, by a range (using `:`), by a pattern, or by a combination of all.

Pattern selectors:

`starts_with(match, ignore.case = TRUE, vars = NULL)`

other select functions: `ends_with`, `contains`, `matches`.

For more details, see `?tidyselect::language`

---
# TB notifications
  
new notifications of TB have the form `new_sp_<sex><age group>`:

``` r
read_csv(here::here("data/TB_notifications_2020-07-01.csv")) |> 
  dplyr::select(country, iso3, year, starts_with("new_sp_")) |>
  na.omit() |>
  head()
# A tibble: 6 × 23
  country     iso3   year new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524
  <chr>       <chr> <dbl>      <dbl>       <dbl>       <dbl>        <dbl>
1 Afghanistan AFG    2010          4         193         197          986
2 Afghanistan AFG    2012          0         188         188         1116
3 Albania     ALB    2005          0           0           0           26
4 Albania     ALB    2006          1           4           5           24
5 Albania     ALB    2007          0           0           0           19
6 Albania     ALB    2009          0           0           0           22
# ℹ 16 more variables: new_sp_m2534 <dbl>, new_sp_m3544 <dbl>,
#   new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>,
#   new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
#   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
#   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sp_fu <dbl>
```

---
# TB notifications (pivot_longer)
  
create two new variables:

- variable `name` (contains all variable names starting with `new_sp_`) and 
- variable `value` with the values of the selected variables

``` r
tb1 <- read_csv(here::here("data/TB_notifications_2020-07-01.csv")) |> 
  dplyr::select(country, iso3, year, starts_with("new_sp_")) |>
  pivot_longer(starts_with("new_sp_"))

tb1 |> na.omit() |> head()
# A tibble: 6 × 5
  country     iso3   year name         value
  <chr>       <chr> <dbl> <chr>        <dbl>
1 Afghanistan AFG    1997 new_sp_m014      0
2 Afghanistan AFG    1997 new_sp_m1524    10
3 Afghanistan AFG    1997 new_sp_m2534     6
4 Afghanistan AFG    1997 new_sp_m3544     3
5 Afghanistan AFG    1997 new_sp_m4554     5
6 Afghanistan AFG    1997 new_sp_m5564     2
```

---
# Separate columns
  
`separate_wider_delim (data, col, delim, names, ...)`

`data |> separate_wider_delim (col, delim, names, ...)`

- split column `col` from data frame `frame` into a set of columns as specified in `names`
- `delim` is the delimiter at which we split into columns, splitting separator.

---
# TB notification (separate)
  
Work on `name`:

``` r
tb2 <- tb1 |>
  separate_wider_delim(name, delim = "_", names=c("foo_new", "foo_sp", "sexage"))

tb2 |> na.omit() |> head()
# A tibble: 6 × 7
  country     iso3   year foo_new foo_sp sexage value
  <chr>       <chr> <dbl> <chr>   <chr>  <chr>  <dbl>
1 Afghanistan AFG    1997 new     sp     m014       0
2 Afghanistan AFG    1997 new     sp     m1524     10
3 Afghanistan AFG    1997 new     sp     m2534      6
4 Afghanistan AFG    1997 new     sp     m3544      3
5 Afghanistan AFG    1997 new     sp     m4554      5
6 Afghanistan AFG    1997 new     sp     m5564      2
```

---
# Separate columns
  
`separate_wider_position (data, col, widths, ...)`

`data %>% separate_wider_position(col, widths, ...)`

- split column `col` from data frame `frame` into a set of columns as specified in `widths`
- `widths` is named numeric vector where the names become column names; unnamed components will be matched but not included.

---
# TB notification (separate again)
  
Now split `sexage` into first character (m/f) and rest.

``` r
tb3 <- tb2 %>% dplyr::select(-starts_with("foo")) |> # remove the `foo` variables
  separate_wider_position(
    sexage,
    widths = c(sex = 1, age = 4),
    too_few = "align_start"
  )

tb3 |> na.omit() |> head()
# A tibble: 6 × 6
  country     iso3   year sex   age   value
  <chr>       <chr> <dbl> <chr> <chr> <dbl>
1 Afghanistan AFG    1997 m     014       0
2 Afghanistan AFG    1997 m     1524     10
3 Afghanistan AFG    1997 m     2534      6
4 Afghanistan AFG    1997 m     3544      3
5 Afghanistan AFG    1997 m     4554      5
6 Afghanistan AFG    1997 m     5564      2
```

---
class: inverse middle 
# Your turn

Read the genes data from folder `data`. Column names contain data and are kind of messy.

``` r
genes <- read_csv(here::here("data/genes.csv"))

names(genes)
 [1] "id"       "WI-6.R1"  "WI-6.R2"  "WI-6.R4"  "WM-6.R1"  "WM-6.R2" 
 [7] "WI-12.R1" "WI-12.R2" "WI-12.R4" "WM-12.R1" "WM-12.R2" "WM-12.R4"
```

Produce the data frame called `gtidy` as shown below:

``` r
head(gtidy)
# A tibble: 6 × 5
  id     trt   time  rep    expr
  <chr>  <chr> <chr> <chr> <dbl>
1 Gene 1 I     6     1      2.18
2 Gene 1 I     6     2      2.20
3 Gene 1 I     6     4      4.20
4 Gene 1 M     6     1      2.63
5 Gene 1 M     6     2      5.06
6 Gene 1 I     12    1      4.54
```

Break into zoom rooms, clock is ticking 😄
<div class="countdown" id="timer_ee393796" data-update-every="1" tabindex="0" style="right:0;bottom:0;">
<div class="countdown-controls"><button class="countdown-bump-down">−</button><button class="countdown-bump-up">+</button></div>
<code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

---
# Plot the genes data overlaid with group means

.pull-left[

``` r
gmean <- gtidy |> 
  group_by(id, trt, time) |> 
  summarise(expr = mean(expr))
gtidy |> 
  ggplot(aes(x = trt, y = expr, colour=time)) +
  geom_point() +
  geom_line(data = gmean, aes(group = time)) +
  facet_wrap(~id) +
  scale_colour_brewer("", palette="Set1")
```
]
.pull-right[
<img src="index_files/figure-html/plot the genes data overlais with group means-1.png" style="display: block; margin: auto;" />
]
---
# Resources

- [The tidy tools manifesto](https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html)
- [Posit cheatsheets](https://posit.co/resources/cheatsheets/)
- [Wickham (2007) Reshaping data](https://www.jstatsoft.org/article/view/v021i12)
- [broom vignettes, David Robinson](https://cran.r-project.org/web/packages/broom/vignettes/broom.html)

---
# Share and share alike

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.