Appendix B — Data

This chapter describes the datasets used throughout the book as listed in Table B.1.

Table B.1: List of data sets and their sources used in the book examples.
Name Description Source Analysis
aflw Player statistics from the AFLW mulgar clustering, dimension reduction
bushfires Multivariate spatio-temporal data for locations of bushfires mulgar clustering, classification with RF
Australian election data Socioecenomic characteristics of Australian electorates https://github.com/jforbes14/eechidna-paper dimension reduction, multicollinearity
penguins Measure four physical characteristics of three species of penguins https://allisonhorst.github.io/palmerpenguins/ classification, and clustering
pisa OECD programme for international student assessment data learningtower dimension reduction, regression
sketches Google's Quickdraw data mulgar neural networks, classification
multicluster Simulated data used to show various cluster examples mulgar clustering
fake trees Simulated data showing branching structure mulgar clustering, dimension reduction
plane and box Simulated data showing hyper-planes mulgar dimension reduction
cluster Simulated data with various clustering mulgar clustering
c1-c7 Simulated data with various clustering, challenge data mulgar clustering
fashion MNIST Collection of apparel images https://github.com/zalandoresearch/fashion-mnist classification

B.1 Australian Football League Women

Description

The aflw data is from the 2021 Women’s Australian Football League. These are average player statistics across the season, with game statistics provided by the fitzRoy package. If you are new to the game of AFL, there is a nice explanation on Wikipedia.

Variables

Rows: 381
Columns: 35
$ id              <chr> "CD_I1001678", "CD_I1001679", "CD_I1001681", "CD_I1001…
$ given_name      <chr> "Jordan", "Brianna", "Jodie", "Ebony", "Emma", "Pepa",…
$ surname         <chr> "Zanchetta", "Green", "Hicks", "Antonio", "King", "Ran…
$ number          <int> 2, 3, 5, 12, 60, 21, 22, 23, 35, 14, 3, 8, 16, 12, 19,…
$ team            <chr> "Brisbane Lions", "West Coast Eagles", "GWS Giants", "…
$ position        <chr> "INT", "INT", "HFFR", "WL", "RK", "BPL", "INT", "INT",…
$ time_pct        <dbl> 63.00000, 61.25000, 76.50000, 74.90000, 85.10000, 77.4…
$ goals           <dbl> 0.0000000, 0.0000000, 0.0000000, 0.1000000, 0.6000000,…
$ behinds         <dbl> 0.0000000, 0.0000000, 0.5000000, 0.4000000, 0.4000000,…
$ kicks           <dbl> 5.000000, 2.500000, 3.750000, 8.800000, 4.100000, 3.22…
$ handballs       <dbl> 2.500000, 3.750000, 3.000000, 3.600000, 2.700000, 2.22…
$ disposals       <dbl> 7.500000, 6.250000, 6.750000, 12.400000, 6.800000, 5.4…
$ marks           <dbl> 1.5000000, 0.2500000, 1.0000000, 3.7000000, 2.2000000,…
$ bounces         <dbl> 0.0000000, 0.0000000, 0.0000000, 0.6000000, 0.1000000,…
$ tackles         <dbl> 3.000000, 2.250000, 2.250000, 3.900000, 2.000000, 1.77…
$ contested       <dbl> 3.500000, 2.250000, 3.500000, 5.700000, 4.400000, 2.66…
$ uncontested     <dbl> 3.500000, 4.500000, 3.000000, 7.000000, 2.800000, 1.77…
$ possessions     <dbl> 7.000000, 6.750000, 6.500000, 12.700000, 7.200000, 4.4…
$ marks_in50      <dbl> 1.0000000, 0.0000000, 0.2500000, 0.5000000, 0.9000000,…
$ contested_marks <dbl> 1.0000000, 0.0000000, 0.0000000, 0.4000000, 1.2000000,…
$ hitouts         <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 19.4000000…
$ one_pct         <dbl> 0.0000000, 1.5000000, 0.5000000, 1.2000000, 2.6000000,…
$ disposal        <dbl> 60.25000, 67.15000, 37.20000, 65.96000, 61.72000, 66.8…
$ clangers        <dbl> 2.000000, 0.500000, 2.500000, 3.100000, 2.400000, 1.33…
$ frees_for       <dbl> 1.0000000, 0.5000000, 0.2500000, 2.5000000, 0.5000000,…
$ frees_against   <dbl> 1.0000000, 0.5000000, 1.2500000, 1.3000000, 1.1000000,…
$ rebounds_in50   <dbl> 0.0000000, 0.5000000, 0.2500000, 1.1000000, 0.0000000,…
$ assists         <dbl> 0.00000000, 0.00000000, 0.00000000, 0.20000000, 0.2000…
$ accuracy        <dbl> 0.00000, 0.00000, 0.00000, 5.00000, 30.00000, 0.00000,…
$ turnovers       <dbl> 1.500000, 1.000000, 2.500000, 4.000000, 1.700000, 1.22…
$ intercepts      <dbl> 2.0000000, 2.0000000, 0.5000000, 5.3000000, 1.3000000,…
$ tackles_in50    <dbl> 0.5000000, 0.0000000, 0.7500000, 0.5000000, 0.5000000,…
$ shots           <dbl> 0.5000000, 0.0000000, 0.7500000, 1.0000000, 1.2000000,…
$ metres          <dbl> 72.50000, 58.50000, 76.00000, 225.90000, 89.80000, 76.…
$ clearances      <dbl> 0.5000000, 0.2500000, 1.2500000, 0.4000000, 0.9000000,…

Purpose

The primary analysis is to summarise the variation using principal component analysis, which gives information about relationships between the statistics or skills sets common in players. One also might be tempted to cluster the players, but there are no obvious clusters so it could be frustrating. At best one could partition the players into groups, while recognising there are no absolutely distinct and separated groups.

Source

See the information provided with the fitzRoy package.

Pre-processing

The code for downloading and pre-processing the data is available at the mulgar website in the data-raw folder. The data provided by the fitzRoy package was pre-processed to reduce the variables to only those that relate to player skills and performance. It is possible that using some transformations on the variables would be useful to make them less skewed.

B.2 Bushfires

Description

This data was collated by Weihao (Patrick) Li as part of his Honours research at Monash University. It contains fire ignitions as detected from satellite hotspots, and processed using the spotoroo package, augmented with measurements on weather, vegetation, proximity to human activity. The cause variable is predicted based on historical fire ignition data collected by County Fire Authority personnel.

Variables

Rows: 1,021
Columns: 60
$ id            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ lon           <dbl> 141.1300, 141.3000, 141.4800, 147.1600, 148.1050, 144.18…
$ lat           <dbl> -37.13000, -37.65000, -37.35000, -37.85000, -37.57999, -…
$ time          <date> 2019-10-01, 2019-10-01, 2019-10-02, 2019-10-02, 2019-10…
$ FOR_CODE      <dbl> 41, 41, 91, 44, 0, 44, 0, 102, 0, 91, 45, 41, 45, 45, 45…
$ FOR_TYPE      <chr> "Eucalypt Medium Woodland", "Eucalypt Medium Woodland", …
$ FOR_CAT       <chr> "Native forest", "Native forest", "Commercial plantation…
$ COVER         <dbl> 1, 1, 4, 2, 6, 2, 6, 5, 6, 4, 2, 1, 2, 2, 2, 2, 6, 6, 6,…
$ HEIGHT        <dbl> 2, 2, 4, 2, 6, 2, 6, 5, 6, 4, 3, 2, 3, 3, 3, 2, 6, 6, 6,…
$ FOREST        <dbl> 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,…
$ rf            <dbl> 0.0, 0.0, 15.4, 4.8, 6.0, 11.6, 11.6, 0.6, 0.2, 0.6, 0.0…
$ arf7          <dbl> 5.0857143, 2.4000000, 2.4000000, 0.7142857, 0.8571429, 1…
$ arf14         <dbl> 2.8142857, 1.7428571, 1.8000000, 1.6714286, 1.5714286, 1…
$ arf28         <dbl> 1.9785714, 1.5357143, 1.5357143, 3.7857143, 1.9000000, 1…
$ arf60         <dbl> 2.3033333, 1.7966667, 1.7966667, 4.0000000, 2.5333333, 1…
$ arf90         <dbl> 1.2566667, 1.0150000, 1.0150000, 2.9600000, 2.1783333, 1…
$ arf180        <dbl> 0.9355556, 0.8444444, 0.8444444, 2.3588889, 1.7866667, 1…
$ arf360        <dbl> 1.3644444, 1.5255556, 1.5255556, 1.7272222, 1.4716667, 1…
$ arf720        <dbl> 1.3011111, 1.5213889, 1.5213889, 1.7111111, 1.5394444, 1…
$ se            <dbl> 3.8, 4.6, 14.2, 23.7, 23.8, 16.8, 18.0, 12.9, 14.7, 12.9…
$ ase7          <dbl> 18.02857, 18.50000, 21.41429, 23.08571, 23.11429, 22.014…
$ ase14         <dbl> 17.03571, 17.44286, 18.03571, 19.17143, 18.45714, 18.628…
$ ase28         <dbl> 19.32857, 18.47500, 19.33929, 18.23571, 16.86071, 19.375…
$ ase60         <dbl> 20.38644, 19.99153, 20.39492, 19.90847, 19.26780, 20.449…
$ ase90         <dbl> 22.54118, 21.93193, 22.04370, 20.59328, 20.04538, 21.809…
$ ase180        <dbl> 20.79106, 19.93966, 19.99385, 19.11006, 18.66760, 19.810…
$ ase360        <dbl> 15.55153, 14.83259, 14.87883, 14.69276, 14.44318, 14.755…
$ ase720        <dbl> 15.52350, 14.75049, 14.77427, 14.53463, 14.32656, 14.540…
$ maxt          <dbl> 21.3, 17.8, 15.4, 20.8, 19.8, 15.8, 19.5, 12.6, 18.8, 12…
$ amaxt7        <dbl> 22.38571, 20.44286, 22.21429, 24.21429, 23.14286, 21.671…
$ amaxt14       <dbl> 21.42857, 19.72857, 19.86429, 21.80000, 20.89286, 19.578…
$ amaxt28       <dbl> 20.71071, 19.10000, 19.18929, 19.75000, 19.05714, 18.885…
$ amaxt60       <dbl> 24.02667, 22.28000, 22.38667, 22.93167, 22.12000, 21.031…
$ amaxt90       <dbl> 27.07750, 25.77667, 25.89833, 24.93667, 23.93750, 23.164…
$ amaxt180      <dbl> 26.92000, 25.92722, 25.98500, 24.84056, 23.95389, 23.343…
$ amaxt360      <dbl> 21.55389, 20.79778, 20.81333, 20.21972, 19.99389, 19.505…
$ amaxt720      <dbl> 21.47750, 20.57222, 20.57694, 20.13153, 20.03875, 19.650…
$ mint          <dbl> 9.6, 9.0, 7.3, 7.7, 8.3, 8.3, 6.1, 5.9, 7.4, 5.9, 6.9, 7…
$ amint7        <dbl> 9.042857, 7.971429, 9.171429, 10.328571, 11.200000, 10.6…
$ amint14       <dbl> 9.928571, 9.235714, 9.421429, 10.007143, 10.900000, 10.7…
$ amint28       <dbl> 8.417857, 7.560714, 7.353571, 8.671429, 9.575000, 10.060…
$ amint60       <dbl> 11.156667, 9.903333, 9.971667, 10.971667, 11.975000, 12.…
$ amint90       <dbl> 11.96667, 10.81250, 10.87833, 12.49000, 13.46167, 13.638…
$ amint180      <dbl> 11.96778, 11.01056, 11.02000, 12.41944, 13.42500, 13.695…
$ amint360      <dbl> 9.130556, 8.459722, 8.448333, 9.588611, 10.456389, 11.03…
$ amint720      <dbl> 8.854861, 8.266250, 8.254028, 9.674861, 10.517083, 10.96…
$ dist_cfa      <dbl> 9442.206, 6322.438, 7957.374, 7790.785, 10692.055, 6054.…
$ dist_camp     <dbl> 50966.485, 6592.893, 31767.235, 8816.272, 15339.702, 941…
$ ws            <dbl> 1.263783, 1.263783, 1.456564, 5.424445, 4.219751, 4.1769…
$ aws_m0        <dbl> 2.644795, 2.644795, 2.644795, 5.008369, 3.947659, 5.2316…
$ aws_m1        <dbl> 2.559202, 2.559202, 2.559202, 5.229680, 4.027398, 4.9704…
$ aws_m3        <dbl> 2.446211, 2.446211, 2.446211, 5.386005, 3.708622, 5.3045…
$ aws_m6        <dbl> 2.144843, 2.144843, 2.144843, 5.132617, 3.389890, 5.0355…
$ aws_m12       <dbl> 2.545008, 2.545008, 2.548953, 5.045297, 3.698736, 5.2341…
$ aws_m24       <dbl> 2.580671, 2.580671, 2.584047, 5.081100, 3.745286, 5.2522…
$ dist_road     <dbl> 498.75145, 102.22032, 1217.22446, 281.69151, 215.56176, …
$ log_dist_cfa  <dbl> 9.152945, 8.751860, 8.981854, 8.960697, 9.277256, 8.7084…
$ log_dist_camp <dbl> 10.838924, 8.793748, 10.366191, 9.084354, 9.638200, 9.15…
$ log_dist_road <dbl> 6.212108, 4.627130, 7.104329, 5.640813, 5.373247, 5.0047…
$ cause         <chr> "lightning", "lightning", "lightning", "lightning", "lig…

Purpose

The primary goal is to predict the cause of the bushfire using the weather and distance from human activity variables provided.

Source

Collated data was part of Weihao Li’s Honours thesis, which is not publicly available. The hotspots data was collected from P-Tree System (2020), climate data was taken from the Australian Bureau of Meteorology using the bomrang package (Sparks et al., 2020), wind data from McVicar (2011) and Iowa State University (2020), vegetation data from Australian Bureau of Agricultural and Resource Economics and Sciences (2018), distance from roads calculated using OpenStreetMap contributors (2020), CFA stations from Department of Environment, Land, Water & Planning (2020a), and campsites from Department of Environment, Land, Water & Planning (2020b). The cause was predicted from training data provided by Department of Environment, Land, Water & Planning (2019).

Pre-processing

The 60 variables are too many to view with a tour, so it should be pre-processed using principal component analysis. The categorical variables of FOR_TYPE and FOR_CAT are removed. It would be possible to keep these if they are converted to dummy (binary variables).

B.3 Australian election data

Description

This is data from a study on the relationship between voting patterns and socio-demographic characteristics of Australian electorates reported in Forbes et al. (2020). These are the predictor variables upon which voting percentages are modelled. There are two years of data in oz_election_2001 and oz_election_2016.

Variables

load("data/oz_election_2001.rda")
load("data/oz_election_2016.rda")
glimpse(oz_election_2001)

Purpose

The tour is used to check for multicollinearity between predictors, that might adversely affect the linear model fit.

Source

The data was compiled from Australian Electoral Commission (AEC) and the Australian 38 Bureau of Statistics (ABS). Code to construct the data, and the original data are available at https://github.com/jforbes14/eechidna-paper.

Pre-processing

Considerable pre-processing was done to produce these data sets. The original data was wrangled into tidy form, some variables were log transformed to reduce skewness, and a subset of variables was chosen.

B.4 Palmer penguins

Code
library(palmerpenguins)
penguins <- penguins %>%
  na.omit() # 11 observations out of 344 removed
# use only vars of interest, and standardise
# them for easier interpretation
penguins_sub <- penguins %>% 
  select(bill_length_mm,
         bill_depth_mm,
         flipper_length_mm,
         body_mass_g,
         species, 
         sex) %>% 
  mutate(across(where(is.numeric),  ~ scale(.)[,1])) %>%
  rename(bl = bill_length_mm,
         bd = bill_depth_mm,
         fl = flipper_length_mm,
         bm = body_mass_g)
save(penguins_sub, file="data/penguins_sub.rda")

Description

This data measure four physical characteristics of three species of penguins.

Variables

Name Description
bl a number denoting bill length (millimeters)
bd a number denoting bill depth (millimeters)
fl an integer denoting flipper length (millimeters)
bm an integer denoting body mass (grams)
species a factor denoting penguin species (Adélie, Chinstrap and Gentoo)

Purpose

The primary goal is to find a combination of the four variables where the three species are distinct. This is also a useful data set to illustrate cluster analysis.

Source

Details of the penguins data can be found at https://allisonhorst.github.io/palmerpenguins/, and Horst et al. (2022) is the package source.

Pre-processing

The data is loaded from the palmerpenguins package. The four physical measurement variables and the species are selected, and the penguins with missing values are removed. Variables are standardised, and their names are shortened.

library(palmerpenguins)
penguins <- penguins %>%
  na.omit() # 11 observations out of 344 removed
# use only vars of interest, and standardise
# them for easier interpretation
penguins_sub <- penguins[,c(3:6, 1)] %>% 
  mutate(across(where(is.numeric),  ~ scale(.)[,1])) %>%
  rename(bl = bill_length_mm,
         bd = bill_depth_mm,
         fl = flipper_length_mm,
         bm = body_mass_g) %>%
  as.data.frame()
save(penguins_sub, file="data/penguins_sub.rda")

B.5 Program for International Student Assessment

Description

The pisa data contains plausible scores for math, reading and science of Australian and Indonesian students from the 2018 testing cycle. The plausible scores are simulated from a model fitted to the original data, to preserve privacy of the students.

Variables

Name Description
CNT country, either AUS for Australia or IDN for Indonesia
PV1MATH-PV10MATH plausible scores for math
PV1READ-PV10READ plausible scores for reading
PV1SCIE-PV10SCIE plausible scores for science

Purpose

Primarily this data is useful as an example for dimension reduction.

Source

The full data is available from https://www.oecd.org/pisa/. There are records of the student test scores, along with survey data from the students, their households and their schools.

Pre-processing

The data was reduced to country and the plausible scores, and filtered to the two countries. It may be helpful to know that the SPSS format data was used, and was read into R using the read_sav() function in the haven package.

B.6 Sketches

Description

This data is a subset of images from https://quickdraw.withgoogle.com The subset was created using the quickdraw R package at https://huizezhang-sherry.github.io/quickdraw/. It has 6 different groups: banana, boomerang, cactus, flip flops, kangaroo. Each image is 28x28 pixels. The sketches_train data would be used to train a classification model, and the unlabelled sketches_test can be used for prediction.

Variables

Name Description
V1-V784 grey scale 0-255
word what the person was asked to draw, NA in the test data
id unique id for each sketch

Purpose

Primarily this data is useful as an example for supervised classification, and also dimension reduction.

Source

The full data is available from https://quickdraw.withgoogle.com.

Pre-processing

It is typically useful to pre-process this data into principal components. This code can also be useful for plotting one of the sketches in a recognisable form:

library(mulgar)
library(ggplot2)
data("sketches_train")
set.seed(77)
x <- sketches_train[sample(1:nrow(sketches_train), 1), ]
xm <- data.frame(gry=t(as.matrix(x[,1:784])),
        x=rep(1:28, 28),
        y=rep(28:1, rep(28, 28)))
ggplot(xm, aes(x=x, y=y, fill=gry)) +
  geom_tile() +
  scale_fill_gradientn(colors = gray.colors(256, 
                                     start = 0, 
                                     end = 1, 
                                     rev = TRUE )) +
  ggtitle(x$word) +
  theme_void() + 
    theme(legend.position="none")
Figure B.1: One of the sketches in the subset of training data.

B.7 multicluster

Code
library(mulgar)
data("multicluster")

Description

This data has 10 numeric variables, and a class variable labelling groups.

Variables

Name Description
group cluster label
x1-x10 numeric variables

Purpose

The primary goal is to find the different clusters.

Source

This data is originally from http://ifs.tuwien.ac.at/dm/download/multiChallenge-matrix.txt, and provided as a challenge for non-linear dimension reduction.It was used as an example in Lee, Laa, Cook (2023) https://doi.org/10.52933/jdssv.v2i3.

B.8 clusters, clusters_nonlin, simple_clusters

Code
library(mulgar)
data("clusters")
data("clusters_nonlin")
data("simple_clusters")

Description

This data has a various number of numeric variables, and a class variable labelling the clusters.

Variables

Name Description
x1-x5 numeric variables
cl cluster label

Purpose

The primary goal is to find the different clusters.

Source

Simulated using the code in the simulate.R file of the data-raw directory of the mulgar package.

B.9 plane, plane_nonlin, box

Code
library(mulgar)
data("plane")
data("plane_nonlin")
data("box")

Description

This data has a various number of numeric variables.

Variables

Name Description
x1-x5 numeric variables

Purpose

The primary goal is to understand how many dimensions the data spreads out.

Source

Simulated using the code in the simulate.R file of the data-raw directory of the mulgar package.

B.10 Additional data used in the book

Table C.1 lists additional data available on the book web site at https://dicook.github.io/mulgar_book/data.