Chapter 3 Initial Data Analysis
In Chapter 2, we mentioned that data analysis includes initial data analysis (IDA). There are various definitions of IDA, much like there are numerous definitions for EDA. Some people would be practising IDA without explicitly realising that it is IDA. Or in other cases, a different name is used to describe the same process, such as Chatfield (1985) referring to IDA also as “initial examination of data” and Cox & Snell (1981) as “preliminary data anlysis” and Rao (1983) as “cross-examination of data.” So what is IDA?
The two main objectives for initial data analysis are:
- data description, and
- model formulation.
IDA differs from the main analysis (i.e. usually fitting the model, conducting significance tests, making inferences or predictions). IDA is often unreported in the data analysis reports or scientific papers due to it being “uninteresting” or “obvious.” The role of the main analysis is to answer the intended question(s) that the data were collected for. Sometimes IDA alone is sufficient.
3.1 Data description
Data description should be one of the first steps in the data analysis to assess the structure and quality of the data. We refer them to occasionally as data sniffing or data scrutinizing. These include using common or domain knowledge to check if the recorded data have sensible values. E.g. Are positive values, e.g. height and weight, recorded as positive values with a plausible range? If the data are counts, are the recorded values contain noninteger values? For compositional data, do the values add up to 100% (or 1)? If not is that a measurement error or due to rounding? Or is another variable missing?
In addition, numerical or graphical summaries may reveal that there is unwanted structure in the data. E.g., Does the treatment group have different demographic characteristics to the control group? Does the distribution of the data imply violations of assumptions for the main analysis?
Data sniffing or data scrutinizing is a process that you get better at with practice and have familiarity with the domain area.
Aside from checking the data structure or data quality, it’s important to check how the data are understood by the computer, i.e. checking for data type is also important. E.g.,
- Was the date read in as character?
- Was a factor read in as numeric?
3.1.1 Checking the data type
Example 1: Consider the following data stored as an excel sheet.
We read this data into R as below.
Below is a print out of the data object. Are there any issues here?
## # A tibble: 5 x 4
## id
## <dbl>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## # … with 3 more
## # variables:
## # date <dttm>,
## # loc <chr>,
## # temp <dbl>
In the United States, it is common to use the date format MM/DD/YYYY while the rest of the world commonly use DD/MM/YYYY or YYYY/MM/DD. It is highly probable that the dates are 1st-5th March and not 3rd of Jan-May. You can validate data with external sources, e.g. say the temperature at New York during the two choices suggest that the dates are 1st-5th March.
The benefit of working with data grounded in the real world process is that there are generally means to sanity check. You can robustify your workflow by ensuring that you have an explicit check for the expected data type (and values) in your code.
In the code below, we write our expected types and further coerce some data types to what we want.
## # A tibble: 5 x 4
## id date
## <fct> <date>
## 1 1 2010-03-01
## 2 2 2010-03-02
## 3 3 2010-03-03
## 4 4 2010-03-04
## 5 5 2010-03-05
## # … with 2 more
## # variables:
## # loc <chr>,
## # temp <dbl>
## # A tibble: 5 x 4
## id date
## <fct> <date>
## 1 1 2010-03-01
## 2 2 2010-03-02
## 3 3 2010-03-03
## 4 4 2010-03-04
## 5 5 2010-03-05
## # … with 2 more
## # variables:
## # loc <chr>,
## # temp <dbl>
The checks (or coercions) ensure that even if the data are updated, you can have some confidence that any data type error will be picked up before further analysis.
3.1.2 Checking the data quality
Numerical or graphical summaries, or even just eye-balling the data, helps to uncover some data quality issues. Do you see any issues for the data below?
## # A tibble: 9 x 4
## id date
## <fct> <date>
## 1 1 2010-03-01
## 2 2 2010-03-02
## 3 3 2010-03-03
## 4 4 2010-03-04
## 5 5 2010-03-05
## 6 6 2020-03-01
## 7 7 2020-03-02
## 8 8 2020-03-03
## 9 9 2020-03-04
## # … with 2 more
## # variables:
## # loc <chr>,
## # temp <dbl>
There is a missing value in loc
. Temperature is in Farenheit for New York but Celsius in Melbourne. You can find this out again using external sources.
Consider the soybean study conducted in Brazil (cite Lehner 2016). This study collected multiple traits of soybeans grown at multiple locations over years.
Inspecting the data reveals that there are a number of missing values for sclerotia and these values do not necessary appear missing at random. We could check if observational units that are missing values for sclerotia, have different yield say.
We could also compare the new data with old data.
3.1.3 Check on data collection method
Next we study the data from ABS that shows the total number of people employed in a given month from February 1976 to December 2019 using the original time series.
Do you notice anything?
Why do you think the number of people employed is going up each year?
There’s a suspicious change in August numbers from 2014.
A potential explanation for this is that there was a change in the survey from 2014. Also see https://robjhyndman.com/hyndsight/abs-seasonal-adjustment-2/
Check if the data collection method has been consistent.
3.1.4 Check for experimental data
For experimental data, there would generally be some descriptions that include the experimental layout and any randomisation process of controlled variables (e.g. treatments) to units.
Consider the experiment below. The experiment tests the effects of 9 fertilizer treatments on the yield of brussel sprouts on a field laid out in a rectangular array of 6 rows and 8 columns.
High sulphur and high manure seems to best for the yield of brussel sprouts. Any issues here?
3.2 Model formulation
Note: there are variety of ways to do IDA and you don’t need to prescribe to what we show you.
3.2.1 Review: Linear models in R
Consider the bivariate data cars
were it contains the variables dist
(distance travelled) and speed
(speed of the car) with the data shown in Figure 3.1.
thes models can be fitted in R as follows:
which is the same as
and for \(i = 1, ... 50\) mathematically can be written as
\[y_i = \beta_0 + \beta_1 x_i + e_i,\] where:
- \(y_i\) and \(x_i\) are the speed (in mph) and stopping distance (in ft), respectively of the \(i\)-th car,
- \(\beta_0\) and \(\beta_1\) are intercept and slope, respectively, and
- \(e_i\) is the random error, typically assuming that \(e_i \sim N(0, \sigma^2)\).
3.3 Model formulation
- Say, we are interested in characterising the price of the diamond in terms of its carat.
Looking at this plot, would you fit a linear model with formula price ~ 1 + carat
? What about price ~ poly(carat, 2)
which is the same as fitting:
\[y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + e_i.\] Do we need to modify the assumption of the error distribution? Should we make some transformation before modelling? Are there other candidate models? By the way do you notice a wall like structure for prices at 1, 1.5 and 2 carat?
Notice that we did no formal statistical inference as we initially try to formulate the model. The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve:
- formal inference for model selection,
- justification of the selected “final” model, and
- fitting the final model.
There may be in fact far more models considered but discarded at the IDA stage. These discarded models are hardly ever reported. Consequently, majority of reported statistics give a distorted view and it’s important to remind yourself what might not be reported.
“All models are approximate and tentative; approximate in the sense that no model is exactly true and tentative in that they may be modified in the light of further data”
– Chatfield (1985)
“All models are wrong but some are useful”
– George Box
A wheat breeding trial to test 107 varieties (also called genotype) is conducted in a field experiment laid out in a rectangular array with 22 rows and 15 columns.
3.3.1 Experimental Design
The experiment employs what is referred to as a randomised complete block design (RCBD) (technically it is near-complete and not exactly RCBD due to check varieties have double the replicates of test varieties).
RCBD means that:
- the there are equal number of replicates for each treatment (here it is gen);
- each treatment appears exactly once in each block;
- the blocks are of the same size; and
- each treatment are randomised within block.
In agricultural field experiments, blocks are formed spatially by grouping plots within contiguous areas (called rep
here).
The boundaries of blocks may be chosen arbitrary.
- In the main analysis, people would commonly analyse this using what is called two-way ANOVA model (with no interaction effect).
- The two-way ANOVA model has the form
yield = mean + block + treatment + error
- So for this data,
<- lm(yield ~ 1 + rep + gen,
fit data = gilmour.serpentine)
summary(fit)
##
## Call:
## lm(formula = yield ~ 1 + rep + gen, data = gilmour.serpentine)
##
## Residuals:
## Min 1Q
## -245.070 -69.695
## Median 3Q
## -1.182 71.427
## Max
## 250.652
##
## Coefficients:
## Estimate
## (Intercept) 720.248
## repR2 96.100
## repR3 -129.845
## gen(WqKPWmH*3Ag 24.333
## genAMERY -93.333
## genANGAS -132.667
## genAROONA -153.667
## genBATAVIA -175.333
## genBD231 -70.333
## genBEULAH -173.667
## genBLADE -270.000
## genBT_SCHOMBURG -49.000
## genCADOUX -223.333
## genCONDOR -124.333
## genCORRIGIN -217.667
## genCUNNINGHAM -254.667
## genDGR/MNX-9-9e -47.667
## genDOLLARBIRD -200.667
## genEXCALIBUR -55.000
## genGOROKE -141.667
## genHALBERD -53.333
## genHOUTMAN -209.333
## genJANZ -214.667
## genK2011-5* -87.333
## genKATUNGA -110.333
## genKIATA -165.667
## genKITE -180.000
## genKULIN -91.000
## genLARK -336.333
## genLOWAN -152.333
## genM4997 -146.000
## genM5075 -194.667
## genM5097 -102.667
## genMACHETE -231.333
## genMEERING -247.667
## genMOLINEUX -165.667
## genOSPREY -162.000
## genOUYEN -136.667
## genOXLEY -221.667
## genPELSART -200.333
## genPEROUSE -283.667
## genRAC655 -112.667
## genRAC655'S' -113.667
## genRAC696 -3.667
## genRAC710 -51.000
## genRAC750 -77.333
## genRAC759 -42.000
## genRAC772 5.000
## genRAC777 -172.333
## genRAC779 3.667
## genRAC787 -118.000
## genRAC791 -72.667
## genRAC792 -102.333
## genRAC798 -1.667
## genRAC804 -45.000
## genRAC805 -43.000
## genRAC806 -35.333
## genRAC807 -91.333
## genRAC808 -54.000
## genRAC809 -43.333
## genRAC810 -131.667
## genRAC811 42.333
## genRAC812 -94.000
## genRAC813 -83.333
## genRAC814 -72.333
## genRAC815 -111.000
## genRAC816 -66.333
## genRAC817 -100.000
## genRAC818 -107.000
## genRAC819 -121.333
## genRAC820 -1.000
## genRAC821 -98.333
## genROSELLA -184.333
## genSCHOMBURGK -132.333
## genSHRIKE -128.000
## genSPEAR -254.667
## genSTILETTO -157.000
## genSUNBRI -218.333
## genSUNFIELD -206.667
## genSUNLAND -182.667
## genSWIFT -197.000
## genTASMAN -161.000
## genTATIARA -64.333
## genTINCURRIN -19.000
## genTRIDENT -132.667
## genVF299 -66.333
## genVF300 -111.667
## genVF302 -108.333
## genVF508 11.667
## genVF519 -1.000
## genVF655 -160.167
## genVF664 -106.667
## genVG127 -109.667
## genVG503 -43.000
## genVG506 -108.667
## genVG701 -19.333
## genVG714 -108.333
## genVG878 52.333
## genWARBLER -217.000
## genWI216 4.000
## genWI221 -17.333
## genWI231 -218.333
## genWI232 -56.333
## genWILGOYNE -131.000
## genWW1402 -117.333
## genWW1477 -185.667
## genWW1831 -86.667
## genWYUNA -176.667
## genYARRALINKA -245.000
## Std. Error
## (Intercept) 67.335
## repR2 15.585
## repR3 15.585
## gen(WqKPWmH*3Ag 94.372
## genAMERY 94.372
## genANGAS 94.372
## genAROONA 94.372
## genBATAVIA 94.372
## genBD231 94.372
## genBEULAH 94.372
## genBLADE 94.372
## genBT_SCHOMBURG 94.372
## genCADOUX 94.372
## genCONDOR 94.372
## genCORRIGIN 94.372
## genCUNNINGHAM 94.372
## genDGR/MNX-9-9e 94.372
## genDOLLARBIRD 94.372
## genEXCALIBUR 94.372
## genGOROKE 94.372
## genHALBERD 94.372
## genHOUTMAN 94.372
## genJANZ 94.372
## genK2011-5* 94.372
## genKATUNGA 94.372
## genKIATA 94.372
## genKITE 94.372
## genKULIN 94.372
## genLARK 94.372
## genLOWAN 94.372
## genM4997 94.372
## genM5075 94.372
## genM5097 94.372
## genMACHETE 94.372
## genMEERING 94.372
## genMOLINEUX 94.372
## genOSPREY 94.372
## genOUYEN 94.372
## genOXLEY 94.372
## genPELSART 94.372
## genPEROUSE 94.372
## genRAC655 94.372
## genRAC655'S' 94.372
## genRAC696 94.372
## genRAC710 94.372
## genRAC750 94.372
## genRAC759 94.372
## genRAC772 94.372
## genRAC777 94.372
## genRAC779 94.372
## genRAC787 94.372
## genRAC791 94.372
## genRAC792 94.372
## genRAC798 94.372
## genRAC804 94.372
## genRAC805 94.372
## genRAC806 94.372
## genRAC807 94.372
## genRAC808 94.372
## genRAC809 94.372
## genRAC810 94.372
## genRAC811 94.372
## genRAC812 94.372
## genRAC813 94.372
## genRAC814 94.372
## genRAC815 94.372
## genRAC816 94.372
## genRAC817 94.372
## genRAC818 94.372
## genRAC819 94.372
## genRAC820 94.372
## genRAC821 94.372
## genROSELLA 94.372
## genSCHOMBURGK 94.372
## genSHRIKE 94.372
## genSPEAR 94.372
## genSTILETTO 94.372
## genSUNBRI 94.372
## genSUNFIELD 94.372
## genSUNLAND 94.372
## genSWIFT 94.372
## genTASMAN 94.372
## genTATIARA 94.372
## genTINCURRIN 81.728
## genTRIDENT 94.372
## genVF299 94.372
## genVF300 94.372
## genVF302 94.372
## genVF508 94.372
## genVF519 94.372
## genVF655 81.728
## genVF664 94.372
## genVG127 94.372
## genVG503 94.372
## genVG506 94.372
## genVG701 94.372
## genVG714 94.372
## genVG878 94.372
## genWARBLER 94.372
## genWI216 94.372
## genWI221 94.372
## genWI231 94.372
## genWI232 94.372
## genWILGOYNE 94.372
## genWW1402 94.372
## genWW1477 81.728
## genWW1831 94.372
## genWYUNA 94.372
## genYARRALINKA 94.372
## t value
## (Intercept) 10.697
## repR2 6.166
## repR3 -8.331
## gen(WqKPWmH*3Ag 0.258
## genAMERY -0.989
## genANGAS -1.406
## genAROONA -1.628
## genBATAVIA -1.858
## genBD231 -0.745
## genBEULAH -1.840
## genBLADE -2.861
## genBT_SCHOMBURG -0.519
## genCADOUX -2.367
## genCONDOR -1.317
## genCORRIGIN -2.306
## genCUNNINGHAM -2.699
## genDGR/MNX-9-9e -0.505
## genDOLLARBIRD -2.126
## genEXCALIBUR -0.583
## genGOROKE -1.501
## genHALBERD -0.565
## genHOUTMAN -2.218
## genJANZ -2.275
## genK2011-5* -0.925
## genKATUNGA -1.169
## genKIATA -1.755
## genKITE -1.907
## genKULIN -0.964
## genLARK -3.564
## genLOWAN -1.614
## genM4997 -1.547
## genM5075 -2.063
## genM5097 -1.088
## genMACHETE -2.451
## genMEERING -2.624
## genMOLINEUX -1.755
## genOSPREY -1.717
## genOUYEN -1.448
## genOXLEY -2.349
## genPELSART -2.123
## genPEROUSE -3.006
## genRAC655 -1.194
## genRAC655'S' -1.204
## genRAC696 -0.039
## genRAC710 -0.540
## genRAC750 -0.819
## genRAC759 -0.445
## genRAC772 0.053
## genRAC777 -1.826
## genRAC779 0.039
## genRAC787 -1.250
## genRAC791 -0.770
## genRAC792 -1.084
## genRAC798 -0.018
## genRAC804 -0.477
## genRAC805 -0.456
## genRAC806 -0.374
## genRAC807 -0.968
## genRAC808 -0.572
## genRAC809 -0.459
## genRAC810 -1.395
## genRAC811 0.449
## genRAC812 -0.996
## genRAC813 -0.883
## genRAC814 -0.766
## genRAC815 -1.176
## genRAC816 -0.703
## genRAC817 -1.060
## genRAC818 -1.134
## genRAC819 -1.286
## genRAC820 -0.011
## genRAC821 -1.042
## genROSELLA -1.953
## genSCHOMBURGK -1.402
## genSHRIKE -1.356
## genSPEAR -2.699
## genSTILETTO -1.664
## genSUNBRI -2.314
## genSUNFIELD -2.190
## genSUNLAND -1.936
## genSWIFT -2.087
## genTASMAN -1.706
## genTATIARA -0.682
## genTINCURRIN -0.232
## genTRIDENT -1.406
## genVF299 -0.703
## genVF300 -1.183
## genVF302 -1.148
## genVF508 0.124
## genVF519 -0.011
## genVF655 -1.960
## genVF664 -1.130
## genVG127 -1.162
## genVG503 -0.456
## genVG506 -1.151
## genVG701 -0.205
## genVG714 -1.148
## genVG878 0.555
## genWARBLER -2.299
## genWI216 0.042
## genWI221 -0.184
## genWI231 -2.314
## genWI232 -0.597
## genWILGOYNE -1.388
## genWW1402 -1.243
## genWW1477 -2.272
## genWW1831 -0.918
## genWYUNA -1.872
## genYARRALINKA -2.596
## Pr(>|t|)
## (Intercept) < 2e-16
## repR2 3.29e-09
## repR3 8.44e-15
## gen(WqKPWmH*3Ag 0.796766
## genAMERY 0.323747
## genANGAS 0.161192
## genAROONA 0.104884
## genBATAVIA 0.064513
## genBD231 0.456895
## genBEULAH 0.067074
## genBLADE 0.004628
## genBT_SCHOMBURG 0.604125
## genCADOUX 0.018820
## genCONDOR 0.189041
## genCORRIGIN 0.022010
## genCUNNINGHAM 0.007502
## genDGR/MNX-9-9e 0.613996
## genDOLLARBIRD 0.034584
## genEXCALIBUR 0.560621
## genGOROKE 0.134743
## genHALBERD 0.572551
## genHOUTMAN 0.027560
## genJANZ 0.023884
## genK2011-5* 0.355758
## genKATUNGA 0.243609
## genKIATA 0.080565
## genKITE 0.057772
## genKULIN 0.335964
## genLARK 0.000448
## genLOWAN 0.107915
## genM4997 0.123277
## genM5075 0.040304
## genM5097 0.277826
## genMACHETE 0.015010
## genMEERING 0.009286
## genMOLINEUX 0.080565
## genOSPREY 0.087451
## genOUYEN 0.148986
## genOXLEY 0.019713
## genPELSART 0.034882
## genPEROUSE 0.002955
## genRAC655 0.233813
## genRAC655'S' 0.229702
## genRAC696 0.969042
## genRAC710 0.589455
## genRAC750 0.413410
## genRAC759 0.656721
## genRAC772 0.957794
## genRAC777 0.069183
## genRAC779 0.969042
## genRAC787 0.212486
## genRAC791 0.442120
## genRAC792 0.279385
## genRAC798 0.985926
## genRAC804 0.633949
## genRAC805 0.649093
## genRAC806 0.708462
## genRAC807 0.334201
## genRAC808 0.567765
## genRAC809 0.646559
## genRAC810 0.164359
## genRAC811 0.654174
## genRAC812 0.320310
## genRAC813 0.378179
## genRAC814 0.444214
## genRAC815 0.240781
## genRAC816 0.482862
## genRAC817 0.290466
## genRAC818 0.258101
## genRAC819 0.199895
## genRAC820 0.991555
## genRAC821 0.298560
## genROSELLA 0.052050
## genSCHOMBURGK 0.162242
## genSHRIKE 0.176376
## genSPEAR 0.007502
## genSTILETTO 0.097603
## genSUNBRI 0.021612
## genSUNFIELD 0.029576
## genSUNLAND 0.054192
## genSWIFT 0.037990
## genTASMAN 0.089410
## genTATIARA 0.496142
## genTINCURRIN 0.816382
## genTRIDENT 0.161192
## genVF299 0.482862
## genVF300 0.237976
## genVF302 0.252234
## genVF508 0.901725
## genVF519 0.991555
## genVF655 0.051283
## genVF664 0.259583
## genVG127 0.246460
## genVG503 0.649093
## genVG506 0.250782
## genVG701 0.837867
## genVG714 0.252234
## genVG878 0.579767
## genWARBLER 0.022415
## genWI216 0.966230
## genWI221 0.854440
## genWI231 0.021612
## genWI232 0.551165
## genWILGOYNE 0.166496
## genWW1402 0.215071
## genWW1477 0.024064
## genWW1831 0.359435
## genWYUNA 0.062524
## genYARRALINKA 0.010061
##
## (Intercept) ***
## repR2 ***
## repR3 ***
## gen(WqKPWmH*3Ag
## genAMERY
## genANGAS
## genAROONA
## genBATAVIA .
## genBD231
## genBEULAH .
## genBLADE **
## genBT_SCHOMBURG
## genCADOUX *
## genCONDOR
## genCORRIGIN *
## genCUNNINGHAM **
## genDGR/MNX-9-9e
## genDOLLARBIRD *
## genEXCALIBUR
## genGOROKE
## genHALBERD
## genHOUTMAN *
## genJANZ *
## genK2011-5*
## genKATUNGA
## genKIATA .
## genKITE .
## genKULIN
## genLARK ***
## genLOWAN
## genM4997
## genM5075 *
## genM5097
## genMACHETE *
## genMEERING **
## genMOLINEUX .
## genOSPREY .
## genOUYEN
## genOXLEY *
## genPELSART *
## genPEROUSE **
## genRAC655
## genRAC655'S'
## genRAC696
## genRAC710
## genRAC750
## genRAC759
## genRAC772
## genRAC777 .
## genRAC779
## genRAC787
## genRAC791
## genRAC792
## genRAC798
## genRAC804
## genRAC805
## genRAC806
## genRAC807
## genRAC808
## genRAC809
## genRAC810
## genRAC811
## genRAC812
## genRAC813
## genRAC814
## genRAC815
## genRAC816
## genRAC817
## genRAC818
## genRAC819
## genRAC820
## genRAC821
## genROSELLA .
## genSCHOMBURGK
## genSHRIKE
## genSPEAR **
## genSTILETTO .
## genSUNBRI *
## genSUNFIELD *
## genSUNLAND .
## genSWIFT *
## genTASMAN .
## genTATIARA
## genTINCURRIN
## genTRIDENT
## genVF299
## genVF300
## genVF302
## genVF508
## genVF519
## genVF655 .
## genVF664
## genVG127
## genVG503
## genVG506
## genVG701
## genVG714
## genVG878
## genWARBLER *
## genWI216
## genWI221
## genWI231 *
## genWI232
## genWILGOYNE
## genWW1402
## genWW1477 *
## genWW1831
## genWYUNA .
## genYARRALINKA *
## ---
## Signif. codes:
## 0 '***' 0.001
## '**' 0.01 '*'
## 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.6 on 221 degrees of freedom
## Multiple R-squared: 0.6226, Adjusted R-squared: 0.4381
## F-statistic: 3.375 on 108 and 221 DF, p-value: 1.081e-14
Do you notice anything from below?
- It’s well known in agricultural field trials that spatial variations are introduced in traits; this could be because of the fertility trend, management practices or other reasons.
- In the IDA stage, you investigate to identify these spatial variations - you cannot just simply fit a two-way ANOVA model!
Teaching of Statistics should provide a more balanced blend of IDA and inference
– Chatfield (1985)
Yet there is still very little emphasis of it in teaching and also at times in practice. So don’t forget to do IDA!
3.4 Summary
- Initial data analysis (IDA) is a model-focussed exploration of data with two main objectives:
- data description including scrutinizing for data quality, and
- model formulation without any formal statistical inference.
- IDA hardly sees the limelight even if it’s the very foundation of what the main analysis is built on.