2  Notation conventions and R objects

The data can be considered to be a matrix of numbers with the columns corresponding to variables, and the rows correspond to observations. It can be helpful to write this in mathematical notation, like:

\[\begin{eqnarray*} X_{n\times p} = [X_1~X_2~\dots~X_p]_{n\times p} = \left[ \begin{array}{cccc} X_{11} & X_{12} & \dots & X_{1p} \\ X_{21} & X_{22} & \dots & X_{2p}\\ \vdots & \vdots & & \vdots \\ X_{n1} & X_{n2} & \dots & X_{np} \end{array} \right]_{n\times p} \end{eqnarray*}\]

where \(X\) indicates the the \(n\times p\) data matrix, \(X_j\) indicates variable \(j, j=1, \dots, p\) and \(X_{ij}\) indicates the value \(j^{th}\) variable of the \(i^{th}\) observation. (It can be confusing to distinguish whether one is referring to the observation or a variable, because \(X_i\) is used to indicate observation also. When this is done it is usually accompanied by qualifying words such as observation \(X_3\), or variable \(X_3\).)

Having notation is helpful for concise explanations of different methods, to explain how data is scaled, processed and projected for various tasks, and how different quantities are calculated from the data.

When there is a response variable(s), it is common to consider \(X\) to be the predictors, and use \(Y\) to indicate the response variable(s). \(Y\) could be a matrix, also, and would be \(n\times q\), where commonly \(q=1\). \(Y\) could be numeric or categorical, and this would change how it is handled with visualisation.

To make a low-dimensional projection (shadow) of the data, we need a projection matrix:

\[\begin{eqnarray*} A_{p\times d} = \left[ \begin{array}{cccc} A_{11} & A_{12} & \dots & A_{1d} \\ A_{21} & A_{22} & \dots & A_{2d}\\ \vdots & \vdots & & \vdots \\ A_{p1} & A_{p2} & \dots & A_{pd} \end{array} \right]_{p\times d} \end{eqnarray*}\]

\(A\) should be an orthonormal matrix, which means that the \(\sum_{j=1}^p A_{jk}^2=1, k=1, \dots, d\) (columns represent vectors of length 1) and \(\sum_{j=1}^p A_{jk}A_{jl}=0, k,l=1, \dots, d; k\neq l\) (columns represent vectors that are orthogonal to each other). In matrix notation, this can be written as \(A^{\top}A = I_d\).

Then the projected data is written as:

\[\begin{eqnarray*} Y_{n\times d} = XA = \left[ \begin{array}{cccc} y_{11} & y_{12} & \dots & y_{1d} \\ y_{21} & y_{22} & \dots & y_{2d}\\ \vdots & \vdots & & \vdots \\ y_{n1} & y_{n2} & \dots & y_{nd} \end{array} \right]_{n\times d} \end{eqnarray*}\]

where \(y_{ij} = \sum_{k=1}^p X_{ik}A_{kj}\). Note that we are using \(Y\) as the projected data here, as well as it possibly being used for a response variable. Where necessary, this will be clarified with words in the text, when notation is used in explanations later.

When using R, if we only have the data corresponding to \(X\) it makes sense to use a matrix object. However, if the response variable is included and it is categorical, then we might use a data.frame or a tibble which can accommodate non-numerical values. Then to work with the data, we can use the base R methods:

X <- matrix(c(1.1, 1.3, 1.4, 1.2, 
              2.7, 2.6, 2.4, 2.5, 
              3.5, 3.4, 3.2, 3.6), 
            ncol=4, byrow=TRUE)
X
     [,1] [,2] [,3] [,4]
[1,]  1.1  1.3  1.4  1.2
[2,]  2.7  2.6  2.4  2.5
[3,]  3.5  3.4  3.2  3.6

which is a data matrix with \(n=3, p=4\) and to extract a column (variable):

X[,2]
[1] 1.3 2.6 3.4

or a row (observation):

X[2,]
[1] 2.7 2.6 2.4 2.5

or an individual cell (value):

X[3,2]
[1] 3.4

To make a projection we need an orthonormal matrix:

A <- matrix(c(0.707,0.707,0,0,0,0,0.707,0.707), ncol=2, byrow=FALSE)
A
      [,1]  [,2]
[1,] 0.707 0.000
[2,] 0.707 0.000
[3,] 0.000 0.707
[4,] 0.000 0.707

You can check that it is orthonormal by

sum(A[,1]^2)
[1] 0.999698
sum(A[,1]*A[,2])
[1] 0

and make a projection using matrix multiplication:

X %*% A
       [,1]   [,2]
[1,] 1.6968 1.8382
[2,] 3.7471 3.4643
[3,] 4.8783 4.8076

The seemingly magical number 0.707 used above and to create the projection in Figure 1.1 arises from normalising a vector with equal contributions from each variable, (1, 1). Dividing by sqrt(2) gives (0.707, 0.707).

The notation convention used throughout the book is:

n = number of observations
p = number of variables, dimension of data
d = dimension of the projection
g = number of groups, in classification
X = data matrix

Exercises

  1. Generate a matrix \(A\) with \(p=5\) (rows) and \(d=2\) (columns), where each value is randomly drawn from a standard normal distribution. Extract the element at row 3 and column 1.
  2. We will interpret \(A\) as a projection matrix and therefore it needs to be orthonormalised. Use the function tourr::orthonormalise to do this, and explicitly check that each column is normalised and that the two columns are orthogonal now. Which dimensions contribute most to the projection for your \(A\)?
  3. Use matrix multiplication to calculate the projection of the mulgar::clusters data onto the 2D plane defined by \(A\). Make a scatterplot of the projected data. Can you identify clustering in this view?