tb <-read_csv(here::here("data/TB_notifications_2025-07-22.csv"))tb %>%# first we get the tb datafilter(year ==2023) %>%# then we focus on the most recent yeargroup_by(country) %>%# then we group by countrysummarize(cases =sum(c_newinc, na.rm=TRUE) # to create a summary of all new cases ) %>%arrange(desc(cases)) # then we sort countries to show highest number of new cases first
tb <-read_csv(here::here("data/TB_notifications_2025-07-22.csv"))tb |># first we get the tb datafilter(year ==2023) |># then we focus on the most recent yeargroup_by(country) |># then we group by countrysummarize(cases =sum(c_newinc, na.rm=TRUE) # to create a summary of all new cases ) |>arrange(desc(cases)) # then we sort countries to show highest number new cases first
# A tibble: 215 × 2
country cases
<chr> <dbl>
1 India 2382714
2 Indonesia 804836
3 Philippines 575770
4 China 564918
5 Pakistan 475761
6 Nigeria 367250
7 Bangladesh 302813
8 Democratic Republic of the Congo 258069
9 South Africa 211810
10 Ethiopia 134873
# ℹ 205 more rows
What is tidy data?
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
What do we expect tidy data to look like?
maybe easier: what are sources of messiness?
🔮 👽 👼 TWO MINUTE CHALLENGE
What are aspects of messiness in data that you have encountered?
# A tibble: 6 × 4
Inst AvNumPubs AvNumCits PctCompletion
<chr> <dbl> <dbl> <dbl>
1 ARIZONA STATE UNIVERSITY 0.9 1.57 31.7
2 AUBURN UNIVERSITY 0.79 0.64 44.4
3 BOSTON COLLEGE 0.51 1.03 46.8
4 BOSTON UNIVERSITY 0.49 2.66 34.2
5 BRANDEIS UNIVERSITY 0.3 3.03 48.7
6 BROWN UNIVERSITY 0.84 2.31 54.6
What’s in the column names of this data? What are the experimental units? What are the measured variables?
10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice.
What is the experimental unit? What are the factors of the experiment? What was measured? What do you want to know?
Messy data patterns
There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns:
Column headers are not just variable names, but also contain values
Variables are stored in both rows and columns, contingency table format
One type of experimental unit stored in multiple tables
Dates in many different formats
Tidy Data Conventions
Data is contained in a single table
Each observation forms a row (no data info in column names)
Each variable forms a column (no mashup of multiple pieces of information)
Long and Wide
Long form: one measured value per row. All other variables are descriptors (key variables) - good for modelling, terrible for most other analyses, e.g. correlation matrix
Widest form: all measured values for an entity are in a single row.
Wide form: measurements are arranged by some of the descriptors in columns (for direct comparisons)
Illustrations from the Openscapes blog: Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Tidy verbs
pivot_longer: get information out of names into columns
pivot_wider: make columns of observed data for levels of design variables (for comparisons)
separate/unite: split and combine columns
nest/unnest: make/unmake variables into sub-data frames of a list variable
Pivot to long form
data |> pivot_longer(cols, names_to = "name", values_to = "value", ...)
pivot_longer turns a wide format into a long format
two new variables are introduced (in key-value format): name and value
data |> separate_wider_delim (col, delim, names, ...)
split column col from data frame frame into a set of columns as specified in names
delim is the delimiter at which we split into columns, splitting separator.
Separate TB notifications
Work on name:
tb2 <- tb1 |>separate_wider_delim( name, delim ="_", names=c("toss_new", "toss_sp", "sexage")) tb2 |>na.omit() |>head()# A tibble: 6 × 7 country iso3 year toss_new toss_sp sexage value<chr><chr><dbl><chr><chr><chr><dbl>1 Afghanistan AFG 1997 new sp m014 02 Afghanistan AFG 1997 new sp m1524 103 Afghanistan AFG 1997 new sp m2534 64 Afghanistan AFG 1997 new sp m3544 35 Afghanistan AFG 1997 new sp m4554 56 Afghanistan AFG 1997 new sp m5564 2
Separate columns
data %>% separate_wider_position(col, widths, ...)
split column col from frame into a set of columns specified in widths
widths is named numeric vector where the names become column names; unnamed components will be matched but not included.
Separate TB notifications again
Now split sexage into first character (m/f) and rest.
tb3 <- tb2 %>% dplyr::select(-starts_with("toss")) |># remove the `toss` variablesseparate_wider_position( sexage,widths =c(sex =1, age =4),too_few ="align_start" )tb3 |>na.omit() |>head()# A tibble: 6 × 6 country iso3 year sex age value<chr><chr><dbl><chr><chr><dbl>1 Afghanistan AFG 1997 m 01402 Afghanistan AFG 1997 m 1524103 Afghanistan AFG 1997 m 253464 Afghanistan AFG 1997 m 354435 Afghanistan AFG 1997 m 455456 Afghanistan AFG 1997 m 55642
Your turn
Read the genes data from folder data. Column names contain data and are kind of messy.
Produce the data frame called gtidy as shown below:
head(gtidy)# A tibble: 6 × 5 id trt time rep expr<chr><chr><chr><chr><dbl>1 Gene 1 I 612.182 Gene 1 I 622.203 Gene 1 I 644.204 Gene 1 M 612.635 Gene 1 M 625.066 Gene 1 I 1214.54