class: center, middle, inverse, title-slide .title[ # Tidying your data ] .subtitle[ ## SISBID 2024
https://github.com/dicook/SISBID
] .author[ ### Di Cook (
dicook@monash.edu
)
Heike Hofmann (
hhofmann4@unl.edu
)
Susan Vanderplas (
susan.vanderplas@unl.edu
) ] .date[ ### 08/14-16/2024 ] --- # Using `tidyr`, `dplyr` <img src="images/tidyr.png" width="10%" /> <img src="images/dplyr.png" width="10%" /> - Writing readable code using **pipes** - What is **tidy data**? Why do you want tidy data? Getting your data into tidy form using tidyr. - Reading different **data formats** - String operations, working with **text** --- # The pipe operator `%>%` or `|>` - read as `then` - `x %>% f(y)` and `x |> f(y)` is the same as `f(x, y)` - `%>%` is part of the `dplyr` package (or more precisely, the `magrittr` package), `|>` is part of base R - pipes allow the code to be *read* like a sequence of operations Example: ``` r tb <- read_csv(here::here("data/TB_notifications_2020-07-01.csv")) tb |> # first we get the tb data filter(year == 2016) |> # then we focus on just the year 2016 group_by(country) |> # then we group by country summarize( cases = sum(c_newinc, na.rm=TRUE) # to create a summary of all new cases ) |> arrange(desc(cases)) # then we sort countries to show highest number new cases first ``` There are some subtle differences between the base pipe `|>` and the tidy pipe `%>%`, see [here](https://towardsdatascience.com/understanding-the-native-r-pipe-98dea6d8b61b) for details. --- .pull-left[ <img src="https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/6e297111-efb7-4555-93d2-085071d1853c_rw_1920.jpg?h=f1951b118eaddaaab8a572422bc16566" width="100%"> Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by [Julia Lowndes and Allison Horst](https://allisonhorst.com/other-r-fun) ] .pull-right[ # What is **tidy data**? - What do we expect tidy data to look like? - Couple of exercises to look closer at what we find in practice ] --- class: inverse middle # Your turn 1 What are the variables? ``` # A tibble: 6 × 4 Inst AvNumPubs AvNumCits PctCompletion <chr> <dbl> <dbl> <dbl> 1 ARIZONA STATE UNIVERSITY 0.9 1.57 31.7 2 AUBURN UNIVERSITY 0.79 0.64 44.4 3 BOSTON COLLEGE 0.51 1.03 46.8 4 BOSTON UNIVERSITY 0.49 2.66 34.2 5 BRANDEIS UNIVERSITY 0.3 3.03 48.7 6 BROWN UNIVERSITY 0.84 2.31 54.6 ``` --- class: inverse middle # Your turn 2 What's in the column names of this data? What are the experimental units? What are the measured variables? ``` # A tibble: 3 × 12 id `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1` `WI-12.R2` <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Gene 1 2.18 2.20 4.20 2.63 5.06 4.54 5.53 2 Gene 2 1.46 0.585 1.86 0.515 2.88 1.36 2.96 3 Gene 3 2.03 0.870 3.28 0.533 4.63 2.18 5.56 # ℹ 4 more variables: `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, `WM-12.R2` <dbl>, # `WM-12.R4` <dbl> ``` --- class: inverse middle # Your turn 3 What are the variables? What are the records? ``` V1 V2 V3 V4 V5 V9 V13 V17 V21 V25 V29 V33 V37 V41 V45 V49 V53 1 ASN00086282 1970 7 TMAX 141 124 113 123 148 149 139 153 123 108 119 112 126 2 ASN00086282 1970 7 TMIN 80 63 36 57 69 47 84 78 49 42 48 56 51 3 ASN00086282 1970 7 PRCP 3 30 0 0 36 3 0 0 10 23 3 0 5 4 ASN00086282 1970 8 TMAX 145 128 150 122 109 112 116 142 166 127 117 127 159 5 ASN00086282 1970 8 TMIN 50 61 75 67 41 51 48 -7 56 62 47 33 67 6 ASN00086282 1970 8 PRCP 0 66 0 53 13 3 8 0 0 0 3 5 0 V57 V61 V65 V69 V73 V77 V81 V85 V89 V93 V97 1 112 115 133 134 126 104 143 141 134 117 142 2 36 44 39 40 58 15 33 51 74 39 66 3 0 0 0 0 0 8 0 18 0 0 0 4 143 114 65 113 125 129 147 161 168 178 161 5 84 11 41 18 50 22 28 74 94 73 88 6 0 64 3 99 36 8 0 0 0 8 36 ``` --- class: inverse middle # Your turn 4 What are the variables? What are the experimental units? ``` # A tibble: 6 × 22 iso2 year m_04 m_514 m_014 m_1524 m_2534 m_3544 m_4554 m_5564 m_65 m_u <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 ZW 2003 NA NA 133 874 3048 2228 981 367 205 NA 2 ZW 2004 NA NA 187 833 2908 2298 1056 366 198 NA 3 ZW 2005 NA NA 210 837 2264 1855 762 295 656 NA 4 ZW 2006 NA NA 215 736 2391 1939 896 348 199 NA 5 ZW 2007 6 132 138 500 3693 0 716 292 153 NA 6 ZW 2008 NA NA 127 614 0 3316 704 263 185 0 # ℹ 10 more variables: f_04 <dbl>, f_514 <dbl>, f_014 <dbl>, f_1524 <dbl>, # f_2534 <dbl>, f_3544 <dbl>, f_4554 <dbl>, f_5564 <dbl>, f_65 <dbl>, # f_u <dbl> ``` --- class: inverse middle # Your turn 5 <!--10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:--> ``` # A tibble: 4 × 9 time treatment subject rep potato buttery grassy rancid painty <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 3 1 2.9 0 0 0 5.5 2 1 1 3 2 14 0 0 1.1 0 3 1 1 10 1 11 6.4 0 0 0 4 1 1 10 2 9.9 5.9 2.9 2.2 0 ``` <!--What is the experimental unit? What are the factors of the experiment? What was measured? What do you want to know?--> --- # Messy data patterns There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns: - Column headers are not just variable names, but also contain values - Variables are stored in both rows and columns, contingency table format - One type of experimental unit stored in multiple tables - Dates in many different formats --- # What is tidy data? - Each observation forms a row - Each variable forms a column - Data is contained in a single table - *Long form* makes it easier to reshape in many different ways - *Wider forms* are common for analysis Long form: **one** measured value per row. All other variables are descriptors (key variables) Widest form: **all** measured values for an entity are in a single row. --- background-image: url(lego.png) background-size: 100% background-position: 50% 5% --- background-image: url(playmobile.png) background-size: 100% background-position: 50% 5% --- ![](https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/209ee1eb-fd83-4e4c-9bed-82ae43b0f3e9_rw_3840.jpg?h=b00690a8a5eef3c901d6609cfefee987) Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by [Julia Lowndes and Allison Horst](https://allisonhorst.com/other-r-fun) --- # Tidy verbs - `pivot_longer`: specify the **keys** (identifiers) and the **values** (measures) to make long form (used to be called melting or gathering) - `pivot_wider`: variables in columns (used to be called casting or spreading) - `separate`/`unite`: split and combine columns - `nest`/`unnest`: working with list variables --- # Pivot to long form .pull-left[ `pivot_longer(data, cols, names_to = "name", values_to = "value", ...)` `data |> pivot_longer(cols, names_to = "name", values_to = "value", ...)` - `pivot_longer` turns a wide format into a long format - two new variables are introduced (in key-value format): **name** and **value** - `col` defines which variables should be combined The animation on the right is from Garrick Aden-Buie's `tidyexplain` package. [tidyr cheat sheet](https://rstudio.github.io/cheatsheets/tidyr.pdf) ] -- .pull-right[ <img src="https://github.com/gadenbuie/tidyexplain/raw/main/images/tidyr-pivoting.gif" width = 450> ] --- # Pivoting an example ``` r # wide format dframe id trtA trtB 1 1 2.5 45 2 2 4.6 35 # long format dframe |> pivot_longer(trtA:trtB, names_to="treatment", values_to="outcome") # A tibble: 4 × 3 id treatment outcome <int> <chr> <dbl> 1 1 trtA 2.5 2 1 trtB 45 3 2 trtA 4.6 4 2 trtB 35 ``` --- # Variable Selectors `data |> pivot_longer(cols, names_to = "key", values_to = "value", ...)` `cols` argument identifies variables that should be combined. **Selectors** can be used to identify variables: variables can be identified by **name**, by **position**, by a range (using `:`), by a pattern, or by a combination of all. Pattern selectors: `starts_with(match, ignore.case = TRUE, vars = NULL)` other select functions: `ends_with`, `contains`, `matches`. For more details, see `?tidyselect::language` --- # TB notifications new notifications of TB have the form `new_sp_<sex><age group>`: ``` r read_csv(here::here("data/TB_notifications_2020-07-01.csv")) |> dplyr::select(country, iso3, year, starts_with("new_sp_")) |> na.omit() |> head() # A tibble: 6 × 23 country iso3 year new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524 <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Afghanistan AFG 2010 4 193 197 986 2 Afghanistan AFG 2012 0 188 188 1116 3 Albania ALB 2005 0 0 0 26 4 Albania ALB 2006 1 4 5 24 5 Albania ALB 2007 0 0 0 19 6 Albania ALB 2009 0 0 0 22 # ℹ 16 more variables: new_sp_m2534 <dbl>, new_sp_m3544 <dbl>, # new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>, # new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>, # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, # new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sp_fu <dbl> ``` --- # TB notifications (pivot_longer) create two new variables: - variable `name` (contains all variable names starting with `new_sp_`) and - variable `value` with the values of the selected variables ``` r tb1 <- read_csv(here::here("data/TB_notifications_2020-07-01.csv")) |> dplyr::select(country, iso3, year, starts_with("new_sp_")) |> pivot_longer(starts_with("new_sp_")) tb1 |> na.omit() |> head() # A tibble: 6 × 5 country iso3 year name value <chr> <chr> <dbl> <chr> <dbl> 1 Afghanistan AFG 1997 new_sp_m014 0 2 Afghanistan AFG 1997 new_sp_m1524 10 3 Afghanistan AFG 1997 new_sp_m2534 6 4 Afghanistan AFG 1997 new_sp_m3544 3 5 Afghanistan AFG 1997 new_sp_m4554 5 6 Afghanistan AFG 1997 new_sp_m5564 2 ``` --- # Separate columns `separate_wider_delim (data, col, delim, names, ...)` `data |> separate_wider_delim (col, delim, names, ...)` - split column `col` from data frame `frame` into a set of columns as specified in `names` - `delim` is the delimiter at which we split into columns, splitting separator. --- # TB notification (separate) Work on `name`: ``` r tb2 <- tb1 |> separate_wider_delim(name, delim = "_", names=c("foo_new", "foo_sp", "sexage")) tb2 |> na.omit() |> head() # A tibble: 6 × 7 country iso3 year foo_new foo_sp sexage value <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> 1 Afghanistan AFG 1997 new sp m014 0 2 Afghanistan AFG 1997 new sp m1524 10 3 Afghanistan AFG 1997 new sp m2534 6 4 Afghanistan AFG 1997 new sp m3544 3 5 Afghanistan AFG 1997 new sp m4554 5 6 Afghanistan AFG 1997 new sp m5564 2 ``` --- # Separate columns `separate_wider_position (data, col, widths, ...)` `data %>% separate_wider_position(col, widths, ...)` - split column `col` from data frame `frame` into a set of columns as specified in `widths` - `widths` is named numeric vector where the names become column names; unnamed components will be matched but not included. --- # TB notification (separate again) Now split `sexage` into first character (m/f) and rest. ``` r tb3 <- tb2 %>% dplyr::select(-starts_with("foo")) |> # remove the `foo` variables separate_wider_position( sexage, widths = c(sex = 1, age = 4), too_few = "align_start" ) tb3 |> na.omit() |> head() # A tibble: 6 × 6 country iso3 year sex age value <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Afghanistan AFG 1997 m 014 0 2 Afghanistan AFG 1997 m 1524 10 3 Afghanistan AFG 1997 m 2534 6 4 Afghanistan AFG 1997 m 3544 3 5 Afghanistan AFG 1997 m 4554 5 6 Afghanistan AFG 1997 m 5564 2 ``` --- class: inverse middle # Your turn Read the genes data from folder `data`. Column names contain data and are kind of messy. ``` r genes <- read_csv(here::here("data/genes.csv")) names(genes) [1] "id" "WI-6.R1" "WI-6.R2" "WI-6.R4" "WM-6.R1" "WM-6.R2" [7] "WI-12.R1" "WI-12.R2" "WI-12.R4" "WM-12.R1" "WM-12.R2" "WM-12.R4" ``` Produce the data frame called `gtidy` as shown below: ``` r head(gtidy) # A tibble: 6 × 5 id trt time rep expr <chr> <chr> <chr> <chr> <dbl> 1 Gene 1 I 6 1 2.18 2 Gene 1 I 6 2 2.20 3 Gene 1 I 6 4 4.20 4 Gene 1 M 6 1 2.63 5 Gene 1 M 6 2 5.06 6 Gene 1 I 12 1 4.54 ``` Break into zoom rooms, clock is ticking 😄
−
+
10
:
00
--- # Plot the genes data overlaid with group means .pull-left[ ``` r gmean <- gtidy |> group_by(id, trt, time) |> summarise(expr = mean(expr)) gtidy |> ggplot(aes(x = trt, y = expr, colour=time)) + geom_point() + geom_line(data = gmean, aes(group = time)) + facet_wrap(~id) + scale_colour_brewer("", palette="Set1") ``` ] .pull-right[ <img src="index_files/figure-html/plot the genes data overlais with group means-1.png" style="display: block; margin: auto;" /> ] --- # Resources - [The tidy tools manifesto](https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html) - [Posit cheatsheets](https://posit.co/resources/cheatsheets/) - [Wickham (2007) Reshaping data](https://www.jstatsoft.org/article/view/v021i12) - [broom vignettes, David Robinson](https://cran.r-project.org/web/packages/broom/vignettes/broom.html) --- # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.