Name | Description | Source | Analysis |
---|---|---|---|
aflw | Player statistics from the AFLW | mulgar | clustering, dimension reduction |
bushfires | Multivariate spatio-temporal data for locations of bushfires | mulgar | clustering, classification with RF |
Australian election data | Socioecenomic characteristics of Australian electorates | https://github.com/jforbes14/eechidna-paper | dimension reduction, multicollinearity |
penguins | Measure four physical characteristics of three species of penguins | https://allisonhorst.github.io/palmerpenguins/ | classification, and clustering |
pisa | OECD programme for international student assessment data | learningtower | dimension reduction, regression |
sketches | Google's Quickdraw data | mulgar | neural networks, classification |
multicluster | Simulated data used to show various cluster examples | mulgar | clustering |
fake trees | Simulated data showing branching structure | mulgar | clustering, dimension reduction |
plane and box | Simulated data showing hyper-planes | mulgar | dimension reduction |
cluster | Simulated data with various clustering | mulgar | clustering |
c1-c7 | Simulated data with various clustering, challenge data | mulgar | clustering |
fashion MNIST | Collection of apparel images | https://github.com/zalandoresearch/fashion-mnist | classification |
Appendix B — Data
This chapter describes the datasets used throughout the book as listed in Table B.1.
B.1 Australian Football League Women
Description
The aflw
data is from the 2021 Women’s Australian Football League. These are average player statistics across the season, with game statistics provided by the fitzRoy package. If you are new to the game of AFL, there is a nice explanation on Wikipedia.
Variables
Rows: 381
Columns: 35
$ id <chr> "CD_I1001678", "CD_I1001679", "CD_I1001681", "CD_I1001…
$ given_name <chr> "Jordan", "Brianna", "Jodie", "Ebony", "Emma", "Pepa",…
$ surname <chr> "Zanchetta", "Green", "Hicks", "Antonio", "King", "Ran…
$ number <int> 2, 3, 5, 12, 60, 21, 22, 23, 35, 14, 3, 8, 16, 12, 19,…
$ team <chr> "Brisbane Lions", "West Coast Eagles", "GWS Giants", "…
$ position <chr> "INT", "INT", "HFFR", "WL", "RK", "BPL", "INT", "INT",…
$ time_pct <dbl> 63.00000, 61.25000, 76.50000, 74.90000, 85.10000, 77.4…
$ goals <dbl> 0.0000000, 0.0000000, 0.0000000, 0.1000000, 0.6000000,…
$ behinds <dbl> 0.0000000, 0.0000000, 0.5000000, 0.4000000, 0.4000000,…
$ kicks <dbl> 5.000000, 2.500000, 3.750000, 8.800000, 4.100000, 3.22…
$ handballs <dbl> 2.500000, 3.750000, 3.000000, 3.600000, 2.700000, 2.22…
$ disposals <dbl> 7.500000, 6.250000, 6.750000, 12.400000, 6.800000, 5.4…
$ marks <dbl> 1.5000000, 0.2500000, 1.0000000, 3.7000000, 2.2000000,…
$ bounces <dbl> 0.0000000, 0.0000000, 0.0000000, 0.6000000, 0.1000000,…
$ tackles <dbl> 3.000000, 2.250000, 2.250000, 3.900000, 2.000000, 1.77…
$ contested <dbl> 3.500000, 2.250000, 3.500000, 5.700000, 4.400000, 2.66…
$ uncontested <dbl> 3.500000, 4.500000, 3.000000, 7.000000, 2.800000, 1.77…
$ possessions <dbl> 7.000000, 6.750000, 6.500000, 12.700000, 7.200000, 4.4…
$ marks_in50 <dbl> 1.0000000, 0.0000000, 0.2500000, 0.5000000, 0.9000000,…
$ contested_marks <dbl> 1.0000000, 0.0000000, 0.0000000, 0.4000000, 1.2000000,…
$ hitouts <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 19.4000000…
$ one_pct <dbl> 0.0000000, 1.5000000, 0.5000000, 1.2000000, 2.6000000,…
$ disposal <dbl> 60.25000, 67.15000, 37.20000, 65.96000, 61.72000, 66.8…
$ clangers <dbl> 2.000000, 0.500000, 2.500000, 3.100000, 2.400000, 1.33…
$ frees_for <dbl> 1.0000000, 0.5000000, 0.2500000, 2.5000000, 0.5000000,…
$ frees_against <dbl> 1.0000000, 0.5000000, 1.2500000, 1.3000000, 1.1000000,…
$ rebounds_in50 <dbl> 0.0000000, 0.5000000, 0.2500000, 1.1000000, 0.0000000,…
$ assists <dbl> 0.00000000, 0.00000000, 0.00000000, 0.20000000, 0.2000…
$ accuracy <dbl> 0.00000, 0.00000, 0.00000, 5.00000, 30.00000, 0.00000,…
$ turnovers <dbl> 1.500000, 1.000000, 2.500000, 4.000000, 1.700000, 1.22…
$ intercepts <dbl> 2.0000000, 2.0000000, 0.5000000, 5.3000000, 1.3000000,…
$ tackles_in50 <dbl> 0.5000000, 0.0000000, 0.7500000, 0.5000000, 0.5000000,…
$ shots <dbl> 0.5000000, 0.0000000, 0.7500000, 1.0000000, 1.2000000,…
$ metres <dbl> 72.50000, 58.50000, 76.00000, 225.90000, 89.80000, 76.…
$ clearances <dbl> 0.5000000, 0.2500000, 1.2500000, 0.4000000, 0.9000000,…
Purpose
The primary analysis is to summarise the variation using principal component analysis, which gives information about relationships between the statistics or skills sets common in players. One also might be tempted to cluster the players, but there are no obvious clusters so it could be frustrating. At best one could partition the players into groups, while recognising there are no absolutely distinct and separated groups.
Source
See the information provided with the fitzRoy package.
Pre-processing
The code for downloading and pre-processing the data is available at the mulgar website in the data-raw
folder. The data provided by the fitzRoy package was pre-processed to reduce the variables to only those that relate to player skills and performance. It is possible that using some transformations on the variables would be useful to make them less skewed.
B.2 Bushfires
Description
This data was collated by Weihao (Patrick) Li as part of his Honours research at Monash University. It contains fire ignitions as detected from satellite hotspots, and processed using the spotoroo package, augmented with measurements on weather, vegetation, proximity to human activity. The cause variable is predicted based on historical fire ignition data collected by County Fire Authority personnel.
Variables
Rows: 1,021
Columns: 60
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ lon <dbl> 141.1300, 141.3000, 141.4800, 147.1600, 148.1050, 144.18…
$ lat <dbl> -37.13000, -37.65000, -37.35000, -37.85000, -37.57999, -…
$ time <date> 2019-10-01, 2019-10-01, 2019-10-02, 2019-10-02, 2019-10…
$ FOR_CODE <dbl> 41, 41, 91, 44, 0, 44, 0, 102, 0, 91, 45, 41, 45, 45, 45…
$ FOR_TYPE <chr> "Eucalypt Medium Woodland", "Eucalypt Medium Woodland", …
$ FOR_CAT <chr> "Native forest", "Native forest", "Commercial plantation…
$ COVER <dbl> 1, 1, 4, 2, 6, 2, 6, 5, 6, 4, 2, 1, 2, 2, 2, 2, 6, 6, 6,…
$ HEIGHT <dbl> 2, 2, 4, 2, 6, 2, 6, 5, 6, 4, 3, 2, 3, 3, 3, 2, 6, 6, 6,…
$ FOREST <dbl> 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,…
$ rf <dbl> 0.0, 0.0, 15.4, 4.8, 6.0, 11.6, 11.6, 0.6, 0.2, 0.6, 0.0…
$ arf7 <dbl> 5.0857143, 2.4000000, 2.4000000, 0.7142857, 0.8571429, 1…
$ arf14 <dbl> 2.8142857, 1.7428571, 1.8000000, 1.6714286, 1.5714286, 1…
$ arf28 <dbl> 1.9785714, 1.5357143, 1.5357143, 3.7857143, 1.9000000, 1…
$ arf60 <dbl> 2.3033333, 1.7966667, 1.7966667, 4.0000000, 2.5333333, 1…
$ arf90 <dbl> 1.2566667, 1.0150000, 1.0150000, 2.9600000, 2.1783333, 1…
$ arf180 <dbl> 0.9355556, 0.8444444, 0.8444444, 2.3588889, 1.7866667, 1…
$ arf360 <dbl> 1.3644444, 1.5255556, 1.5255556, 1.7272222, 1.4716667, 1…
$ arf720 <dbl> 1.3011111, 1.5213889, 1.5213889, 1.7111111, 1.5394444, 1…
$ se <dbl> 3.8, 4.6, 14.2, 23.7, 23.8, 16.8, 18.0, 12.9, 14.7, 12.9…
$ ase7 <dbl> 18.02857, 18.50000, 21.41429, 23.08571, 23.11429, 22.014…
$ ase14 <dbl> 17.03571, 17.44286, 18.03571, 19.17143, 18.45714, 18.628…
$ ase28 <dbl> 19.32857, 18.47500, 19.33929, 18.23571, 16.86071, 19.375…
$ ase60 <dbl> 20.38644, 19.99153, 20.39492, 19.90847, 19.26780, 20.449…
$ ase90 <dbl> 22.54118, 21.93193, 22.04370, 20.59328, 20.04538, 21.809…
$ ase180 <dbl> 20.79106, 19.93966, 19.99385, 19.11006, 18.66760, 19.810…
$ ase360 <dbl> 15.55153, 14.83259, 14.87883, 14.69276, 14.44318, 14.755…
$ ase720 <dbl> 15.52350, 14.75049, 14.77427, 14.53463, 14.32656, 14.540…
$ maxt <dbl> 21.3, 17.8, 15.4, 20.8, 19.8, 15.8, 19.5, 12.6, 18.8, 12…
$ amaxt7 <dbl> 22.38571, 20.44286, 22.21429, 24.21429, 23.14286, 21.671…
$ amaxt14 <dbl> 21.42857, 19.72857, 19.86429, 21.80000, 20.89286, 19.578…
$ amaxt28 <dbl> 20.71071, 19.10000, 19.18929, 19.75000, 19.05714, 18.885…
$ amaxt60 <dbl> 24.02667, 22.28000, 22.38667, 22.93167, 22.12000, 21.031…
$ amaxt90 <dbl> 27.07750, 25.77667, 25.89833, 24.93667, 23.93750, 23.164…
$ amaxt180 <dbl> 26.92000, 25.92722, 25.98500, 24.84056, 23.95389, 23.343…
$ amaxt360 <dbl> 21.55389, 20.79778, 20.81333, 20.21972, 19.99389, 19.505…
$ amaxt720 <dbl> 21.47750, 20.57222, 20.57694, 20.13153, 20.03875, 19.650…
$ mint <dbl> 9.6, 9.0, 7.3, 7.7, 8.3, 8.3, 6.1, 5.9, 7.4, 5.9, 6.9, 7…
$ amint7 <dbl> 9.042857, 7.971429, 9.171429, 10.328571, 11.200000, 10.6…
$ amint14 <dbl> 9.928571, 9.235714, 9.421429, 10.007143, 10.900000, 10.7…
$ amint28 <dbl> 8.417857, 7.560714, 7.353571, 8.671429, 9.575000, 10.060…
$ amint60 <dbl> 11.156667, 9.903333, 9.971667, 10.971667, 11.975000, 12.…
$ amint90 <dbl> 11.96667, 10.81250, 10.87833, 12.49000, 13.46167, 13.638…
$ amint180 <dbl> 11.96778, 11.01056, 11.02000, 12.41944, 13.42500, 13.695…
$ amint360 <dbl> 9.130556, 8.459722, 8.448333, 9.588611, 10.456389, 11.03…
$ amint720 <dbl> 8.854861, 8.266250, 8.254028, 9.674861, 10.517083, 10.96…
$ dist_cfa <dbl> 9442.206, 6322.438, 7957.374, 7790.785, 10692.055, 6054.…
$ dist_camp <dbl> 50966.485, 6592.893, 31767.235, 8816.272, 15339.702, 941…
$ ws <dbl> 1.263783, 1.263783, 1.456564, 5.424445, 4.219751, 4.1769…
$ aws_m0 <dbl> 2.644795, 2.644795, 2.644795, 5.008369, 3.947659, 5.2316…
$ aws_m1 <dbl> 2.559202, 2.559202, 2.559202, 5.229680, 4.027398, 4.9704…
$ aws_m3 <dbl> 2.446211, 2.446211, 2.446211, 5.386005, 3.708622, 5.3045…
$ aws_m6 <dbl> 2.144843, 2.144843, 2.144843, 5.132617, 3.389890, 5.0355…
$ aws_m12 <dbl> 2.545008, 2.545008, 2.548953, 5.045297, 3.698736, 5.2341…
$ aws_m24 <dbl> 2.580671, 2.580671, 2.584047, 5.081100, 3.745286, 5.2522…
$ dist_road <dbl> 498.75145, 102.22032, 1217.22446, 281.69151, 215.56176, …
$ log_dist_cfa <dbl> 9.152945, 8.751860, 8.981854, 8.960697, 9.277256, 8.7084…
$ log_dist_camp <dbl> 10.838924, 8.793748, 10.366191, 9.084354, 9.638200, 9.15…
$ log_dist_road <dbl> 6.212108, 4.627130, 7.104329, 5.640813, 5.373247, 5.0047…
$ cause <chr> "lightning", "lightning", "lightning", "lightning", "lig…
Purpose
The primary goal is to predict the cause of the bushfire using the weather and distance from human activity variables provided.
Source
Collated data was part of Weihao Li’s Honours thesis, which is not publicly available. The hotspots data was collected from P-Tree System (2020), climate data was taken from the Australian Bureau of Meteorology using the bomrang
package (Sparks et al., 2020), wind data from McVicar (2011) and Iowa State University (2020), vegetation data from Australian Bureau of Agricultural and Resource Economics and Sciences (2018), distance from roads calculated using OpenStreetMap contributors (2020), CFA stations from Department of Environment, Land, Water & Planning (2020a), and campsites from Department of Environment, Land, Water & Planning (2020b). The cause was predicted from training data provided by Department of Environment, Land, Water & Planning (2019).
Pre-processing
The 60 variables are too many to view with a tour, so it should be pre-processed using principal component analysis. The categorical variables of FOR_TYPE and FOR_CAT are removed. It would be possible to keep these if they are converted to dummy (binary variables).
B.3 Australian election data
Description
This is data from a study on the relationship between voting patterns and socio-demographic characteristics of Australian electorates reported in Forbes et al. (2020). These are the predictor variables upon which voting percentages are modelled. There are two years of data in oz_election_2001
and oz_election_2016
.
Variables
Purpose
The tour is used to check for multicollinearity between predictors, that might adversely affect the linear model fit.
Source
The data was compiled from Australian Electoral Commission (AEC) and the Australian 38 Bureau of Statistics (ABS). Code to construct the data, and the original data are available at https://github.com/jforbes14/eechidna-paper.
Pre-processing
Considerable pre-processing was done to produce these data sets. The original data was wrangled into tidy form, some variables were log transformed to reduce skewness, and a subset of variables was chosen.
B.4 Palmer penguins
Code
library(palmerpenguins)
penguins <- penguins %>%
na.omit() # 11 observations out of 344 removed
# use only vars of interest, and standardise
# them for easier interpretation
penguins_sub <- penguins %>%
select(bill_length_mm,
bill_depth_mm,
flipper_length_mm,
body_mass_g,
species,
sex) %>%
mutate(across(where(is.numeric), ~ scale(.)[,1])) %>%
rename(bl = bill_length_mm,
bd = bill_depth_mm,
fl = flipper_length_mm,
bm = body_mass_g)
save(penguins_sub, file="data/penguins_sub.rda")
Description
This data measure four physical characteristics of three species of penguins.
Variables
Name | Description |
---|---|
bl |
a number denoting bill length (millimeters) |
bd |
a number denoting bill depth (millimeters) |
fl |
an integer denoting flipper length (millimeters) |
bm |
an integer denoting body mass (grams) |
species |
a factor denoting penguin species (Adélie, Chinstrap and Gentoo) |
Purpose
The primary goal is to find a combination of the four variables where the three species are distinct. This is also a useful data set to illustrate cluster analysis.
Source
Details of the penguins data can be found at https://allisonhorst.github.io/palmerpenguins/, and Horst et al. (2022) is the package source.
Pre-processing
The data is loaded from the palmerpenguins
package. The four physical measurement variables and the species are selected, and the penguins with missing values are removed. Variables are standardised, and their names are shortened.
library(palmerpenguins)
penguins <- penguins %>%
na.omit() # 11 observations out of 344 removed
# use only vars of interest, and standardise
# them for easier interpretation
penguins_sub <- penguins[,c(3:6, 1)] %>%
mutate(across(where(is.numeric), ~ scale(.)[,1])) %>%
rename(bl = bill_length_mm,
bd = bill_depth_mm,
fl = flipper_length_mm,
bm = body_mass_g) %>%
as.data.frame()
save(penguins_sub, file="data/penguins_sub.rda")
B.5 Program for International Student Assessment
Description
The pisa
data contains plausible scores for math, reading and science of Australian and Indonesian students from the 2018 testing cycle. The plausible scores are simulated from a model fitted to the original data, to preserve privacy of the students.
Variables
Name | Description |
---|---|
CNT |
country, either AUS for Australia or IDN for Indonesia |
PV1MATH -PV10MATH
|
plausible scores for math |
PV1READ -PV10READ
|
plausible scores for reading |
PV1SCIE -PV10SCIE
|
plausible scores for science |
Purpose
Primarily this data is useful as an example for dimension reduction.
Source
The full data is available from https://www.oecd.org/pisa/. There are records of the student test scores, along with survey data from the students, their households and their schools.
Pre-processing
The data was reduced to country and the plausible scores, and filtered to the two countries. It may be helpful to know that the SPSS format data was used, and was read into R using the read_sav()
function in the haven
package.
B.6 Sketches
Description
This data is a subset of images from https://quickdraw.withgoogle.com The subset was created using the quickdraw R package at https://huizezhang-sherry.github.io/quickdraw/. It has 6 different groups: banana, boomerang, cactus, flip flops, kangaroo. Each image is 28x28 pixels. The sketches_train
data would be used to train a classification model, and the unlabelled sketches_test
can be used for prediction.
Variables
Name | Description |
---|---|
V1-V784 |
grey scale 0-255 |
word |
what the person was asked to draw, NA in the test data |
id |
unique id for each sketch |
Purpose
Primarily this data is useful as an example for supervised classification, and also dimension reduction.
Source
The full data is available from https://quickdraw.withgoogle.com.
Pre-processing
It is typically useful to pre-process this data into principal components. This code can also be useful for plotting one of the sketches in a recognisable form:
library(mulgar)
library(ggplot2)
data("sketches_train")
set.seed(77)
x <- sketches_train[sample(1:nrow(sketches_train), 1), ]
xm <- data.frame(gry=t(as.matrix(x[,1:784])),
x=rep(1:28, 28),
y=rep(28:1, rep(28, 28)))
ggplot(xm, aes(x=x, y=y, fill=gry)) +
geom_tile() +
scale_fill_gradientn(colors = gray.colors(256,
start = 0,
end = 1,
rev = TRUE )) +
ggtitle(x$word) +
theme_void() +
theme(legend.position="none")
B.7 multicluster
Description
This data has 10 numeric variables, and a class variable labelling groups.
Variables
Name | Description |
---|---|
group |
cluster label |
x1-x10 |
numeric variables |
Purpose
The primary goal is to find the different clusters.
Source
This data is originally from http://ifs.tuwien.ac.at/dm/download/multiChallenge-matrix.txt, and provided as a challenge for non-linear dimension reduction.It was used as an example in Lee, Laa, Cook (2023) https://doi.org/10.52933/jdssv.v2i3.
B.8 clusters
, clusters_nonlin
, simple_clusters
Description
This data has a various number of numeric variables, and a class variable labelling the clusters.
Variables
Name | Description |
---|---|
x1-x5 |
numeric variables |
cl |
cluster label |
Purpose
The primary goal is to find the different clusters.
Source
Simulated using the code in the simulate.R
file of the data-raw
directory of the mulgar
package.
B.9 plane
, plane_nonlin
, box
Description
This data has a various number of numeric variables.
Variables
Name | Description |
---|---|
x1-x5 |
numeric variables |
Purpose
The primary goal is to understand how many dimensions the data spreads out.
Source
Simulated using the code in the simulate.R
file of the data-raw
directory of the mulgar
package.
B.10 Additional data used in the book
Table C.1 lists additional data available on the book web site at https://dicook.github.io/mulgar_book/data.
Description | Link |
---|---|
Saved 2D tour path for the aflw data | <a href='https://dicook.github.io/mulgar_book/data/aflw_pct.rda'> aflw_pct.rda </a> |
Saved clusters of the penguins data from detourr | <a href='https://dicook.github.io/mulgar_book/data/detourr_penguins.csv'> detourr_penguins.csv </a> |
Saved clusters of the fake trees data from detourr | <a href='https://dicook.github.io/mulgar_book/data/fake_trees_sb.csv'> fake_trees_sb.csv </a> |
Tidied penguins data | <a href='https://dicook.github.io/mulgar_book/data/penguins_sub.rda'> penguins_sub.rda </a> |
Saved 2D tour path for penguins data | <a href='https://dicook.github.io/mulgar_book/data/penguins_tour_path.rda'> penguins_tour_path.rda </a> |
risk survey | <a href='https://dicook.github.io/mulgar_book/data/risk_MSA.rds'> risk_MSA.rds </a> |
penguins NN model | <a href='https://dicook.github.io/mulgar_book/data/penguins_cnn> penguins_cnn </a> |
fashion MNST NN model | <a href='https://dicook.github.io/mulgar_book/data/fashion_cnn> fashion_cnn </a> |
penguins SHAP values | <a href='https://dicook.github.io/mulgar_book/data/p_exp_sv.rda> p_exp_sv.rda </a> |