Interactively exploring high-dimensional data and models in R

Author

Dianne Cook and Ursula Laa

Published

March 25, 2024

Preface

It is important to visualise your data because you might discover things that you could never have anticipated. Although there are many resources available for data visualisation, there are few comprehensive resources on high-dimensional data visualisation. High-dimensional (or multivariate) data arises when many different things are measured for each observation. While we can learn many things from plotting with 1D and 2D or 3D methods there are likely more structures hidden in the higher dimensions. This book provides guidance on visualising high-dimensional data and models using linear projections, with R.

High-dimensional data spaces are fascinating places. You may think that there’s a lot of ways to plot one or two variables, and a lot of types of patterns that can be found. You might use a density plot and see skewness or a dot plot to find outliers. A scatterplot of two variables might reveal a non-linear relationship or a barrier beyond which no observations exist. We don’t as yet have so many different choices of plot types for high-dimensions, but these types of patterns are also what we seek in scatterplots of high-dimensional data. The additional dimensions can clarify these patterns, that clusters are likely to be more distinct. Observations that did not appear to be very different can be seen to be lonely anomalies in high-dimensions, that no other observations have quite the same combination of values.

What’s in this book?

The book can be divided into these parts:

  • Introduction: Here we introduce you to high-dimensional spaces, how they can be visualised, and notation that is useful for describing methods in later chapters.
  • Dimension reduction: This part covers linear and non-linear dimension reduction. It includes ways to help decide on the number of dimensions needed to summarise the high dimensional data, whether linear dimension reduction is appropriate, detecting problems that might affect the dimension reduction, and examining how well or badly a non-linear dimension reduction is representing the data.
  • Cluster analysis: This part described methods for finding groups in data. Although it includes an explanation of a purely graphical approach, it is mostly on using graphics in association with numerical clustering algorithms. There are explanations of assessing the suitability of different numerical techniques for extracting clusters, based on the data shapes, evaluating the clustering result, and showing the solutions in high dimensions.
  • Classification: This part describes methods for exploring known groups in the data. You’ll learn how to check model assumptions, to help decide if a method is suited to the data, examine classification boundaries and explore where errors arise.

In each of these parts an emphasis is also showing your model with your data in the high dimensional space.

Our hopes are that you will come away with understanding the importance of plotting your high dimensional data as a regular step in your statistical or machine learning analyses. There are many examples of what you might miss if you don’t plot the data. Effective use of graphics goes hand-in-hand with analytical techniques. With high dimensions visualisation is a challenge but it is fascinating, and leads to many surprising moments.

Audience

High-dimensional data arises in many fields such as biology, social sciences, finance, and more. Anyone who is doing exploratory data analysis and model fitting for more than two variables will benefit from learning how to effectively visualise high-dimensions. This book will be useful for students and teachers of multivariate data analysis and machine learning, and researchers, data analysts, and industry professionals who work in these areas.

How to use the book?

The book provides explanations and plots accompanied by R code. We would hope that you run the code to explore the examples as you read the explanations. The chapters are organised by types of analysis and focus on how to use the high-dimensional visualisation to complement the commonly used analytical methods. An overview of the primary high-dimensional visualisation methods discussed in the book and how to get started is provided in the toolbox chapter in the Appendix.

What should I know before reading this book?

The examples assume that you already use R, and have a working knowledge of base R and tidyverse way of thinking about data analysis. It also assumes that you have some knowledge of statistical methods, and some experience with machine learning methods.

If you feel like you need build up your skills in these areas in preparation for working through this book, these are our recommended resources:

We will assume you know how to plot your data and models in 2D. Our material starts from 2D and beyond.

Setting up your workflow

To get started set up your computer with the current versions of R and ideally also with Rstudio Desktop.

In addition, we have made an R package to share the data and functions used in this book, called mulgar.12

install.packages("mulgar", dependencies=TRUE)
# or the development version
devtools::install_github("dicook/mulgar")

To get a copy of the code and data used and an RStudio project to get started, you can download with this code:

book_url <- "https://dicook.github.io/mulgar_book/code_and_data.zip"
usethis::use_zip(url=book_url)

You will be able to click on the mulgar_book.Rproj to get started with the code.

Suggestion, feedback or error?

We welcome suggestions, feedback or details of errors. You can report them as an issue at the Github repo for this book.

Please make a small reproducible example and report the error encountered. Reproducible examples have these components:

  • a small amount of data
  • small amount of code that generates the error
  • copy of the error message that was generated

Citing

Please use this text and bibtex for citing the book:

Cook D., Laa, U. (2024) Interactively exploringhigh-dimensional data and models in R, https://dicook.github.io/mulgar_book/, accessed on YYYY/MM/DD. 

@misc{cook-laa,
  title = {Interactively exploringhigh-dimensional data and models in R},
  author = {Dianne Cook and Ursula Laa},
  year = 2024,
  url = {https://dicook.github.io/mulgar_book/},
  note = {accessed  on YYYY/MM/DD}
}

License

Creative Commons License
The online version of this book is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


  1. Mulga is a type of Australian habitat composed of woodland or open forest dominated by the mulga tree. Massive clearing of mulga led to the vast wheat fields of Western Australia. Here mulgar is an acronym for MULtivariate Graphical Analysis with R.↩︎

  2. Photo of mulga tree taken by L. G. Cook.↩︎