Chapter 3 Initial Data Analysis

In Chapter 2, we mentioned that data analysis includes initial data analysis (IDA). There are various definitions of IDA, much like there are numerous definitions for EDA. Some people would be practising IDA without explicitly realising that it is IDA. Or in other cases, a different name is used to describe the same process, such as Chatfield (1985) referring to IDA also as “initial examination of data” and Cox & Snell (1981) as “preliminary data anlysis” and Rao (1983) as “cross-examination of data.” So what is IDA?

The two main objectives for initial data analysis are:

data description, and
model formulation.

IDA differs from the main analysis (i.e. usually fitting the model, conducting significance tests, making inferences or predictions). IDA is often unreported in the data analysis reports or scientific papers due to it being “uninteresting” or “obvious.” The role of the main analysis is to answer the intended question(s) that the data were collected for. Sometimes IDA alone is sufficient.

3.1 Data description

Data description should be one of the first steps in the data analysis to assess the structure and quality of the data. We refer them to occasionally as data sniffing or data scrutinizing. These include using common or domain knowledge to check if the recorded data have sensible values. E.g. Are positive values, e.g. height and weight, recorded as positive values with a plausible range? If the data are counts, are the recorded values contain noninteger values? For compositional data, do the values add up to 100% (or 1)? If not is that a measurement error or due to rounding? Or is another variable missing?

In addition, numerical or graphical summaries may reveal that there is unwanted structure in the data. E.g., Does the treatment group have different demographic characteristics to the control group? Does the distribution of the data imply violations of assumptions for the main analysis?

Data sniffing or data scrutinizing is a process that you get better at with practice and have familiarity with the domain area.

Aside from checking the data structure or data quality, it’s important to check how the data are understood by the computer, i.e. checking for data type is also important. E.g.,

Was the date read in as character?
Was a factor read in as numeric?

3.1.1 Checking the data type

Example 1: Consider the following data stored as an excel sheet.

We read this data into R as below.

Below is a print out of the data object. Are there any issues here?

## # A tibble: 5 x 4
##      id
##   <dbl>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## # … with 3 more
## #   variables:
## #   date <dttm>,
## #   loc <chr>,
## #   temp <dbl>

In the United States, it is common to use the date format MM/DD/YYYY while the rest of the world commonly use DD/MM/YYYY or YYYY/MM/DD. It is highly probable that the dates are 1st-5th March and not 3rd of Jan-May. You can validate data with external sources, e.g. say the temperature at New York during the two choices suggest that the dates are 1st-5th March.

The benefit of working with data grounded in the real world process is that there are generally means to sanity check. You can robustify your workflow by ensuring that you have an explicit check for the expected data type (and values) in your code.

In the code below, we write our expected types and further coerce some data types to what we want.

## # A tibble: 5 x 4
##   id    date      
##   <fct> <date>    
## 1 1     2010-03-01
## 2 2     2010-03-02
## 3 3     2010-03-03
## 4 4     2010-03-04
## 5 5     2010-03-05
## # … with 2 more
## #   variables:
## #   loc <chr>,
## #   temp <dbl>

## # A tibble: 5 x 4
##   id    date      
##   <fct> <date>    
## 1 1     2010-03-01
## 2 2     2010-03-02
## 3 3     2010-03-03
## 4 4     2010-03-04
## 5 5     2010-03-05
## # … with 2 more
## #   variables:
## #   loc <chr>,
## #   temp <dbl>

The checks (or coercions) ensure that even if the data are updated, you can have some confidence that any data type error will be picked up before further analysis.

3.1.2 Checking the data quality

Numerical or graphical summaries, or even just eye-balling the data, helps to uncover some data quality issues. Do you see any issues for the data below?

## # A tibble: 9 x 4
##   id    date      
##   <fct> <date>    
## 1 1     2010-03-01
## 2 2     2010-03-02
## 3 3     2010-03-03
## 4 4     2010-03-04
## 5 5     2010-03-05
## 6 6     2020-03-01
## 7 7     2020-03-02
## 8 8     2020-03-03
## 9 9     2020-03-04
## # … with 2 more
## #   variables:
## #   loc <chr>,
## #   temp <dbl>

There is a missing value in loc. Temperature is in Farenheit for New York but Celsius in Melbourne. You can find this out again using external sources.

Consider the soybean study conducted in Brazil (cite Lehner 2016). This study collected multiple traits of soybeans grown at multiple locations over years.

Inspecting the data reveals that there are a number of missing values for sclerotia and these values do not necessary appear missing at random. We could check if observational units that are missing values for sclerotia, have different yield say.

We could also compare the new data with old data.

3.1.3 Check on data collection method

Next we study the data from ABS that shows the total number of people employed in a given month from February 1976 to December 2019 using the original time series.

Do you notice anything?

Why do you think the number of people employed is going up each year?

There’s a suspicious change in August numbers from 2014.

A potential explanation for this is that there was a change in the survey from 2014. Also see https://robjhyndman.com/hyndsight/abs-seasonal-adjustment-2/

Check if the data collection method has been consistent.

3.1.4 Check for experimental data

For experimental data, there would generally be some descriptions that include the experimental layout and any randomisation process of controlled variables (e.g. treatments) to units.

Consider the experiment below. The experiment tests the effects of 9 fertilizer treatments on the yield of brussel sprouts on a field laid out in a rectangular array of 6 rows and 8 columns.

High sulphur and high manure seems to best for the yield of brussel sprouts. Any issues here?

3.2 Model formulation

Note: there are variety of ways to do IDA and you don’t need to prescribe to what we show you.

3.2.1 Review: Linear models in R

Consider the bivariate data cars were it contains the variables dist (distance travelled) and speed (speed of the car) with the data shown in Figure 3.1.

Figure 3.1: The plot of the speed of cars vs. the distance travelled.

thes models can be fitted in R as follows:

which is the same as

and for \(i = 1, ... 50\) mathematically can be written as

\[y_i = \beta_0 + \beta_1 x_i + e_i,\] where:

\(y_i\) and \(x_i\) are the speed (in mph) and stopping distance (in ft), respectively of the \(i\)-th car,
\(\beta_0\) and \(\beta_1\) are intercept and slope, respectively, and
\(e_i\) is the random error, typically assuming that \(e_i \sim N(0, \sigma^2)\).

3.3 Model formulation

Say, we are interested in characterising the price of the diamond in terms of its carat.

Looking at this plot, would you fit a linear model with formula price ~ 1 + carat? What about price ~ poly(carat, 2) which is the same as fitting:

\[y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + e_i.\] Do we need to modify the assumption of the error distribution? Should we make some transformation before modelling? Are there other candidate models? By the way do you notice a wall like structure for prices at 1, 1.5 and 2 carat?

Notice that we did no formal statistical inference as we initially try to formulate the model. The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve:

formal inference for model selection,
justification of the selected “final” model, and
fitting the final model.

There may be in fact far more models considered but discarded at the IDA stage. These discarded models are hardly ever reported. Consequently, majority of reported statistics give a distorted view and it’s important to remind yourself what might not be reported.

“All models are approximate and tentative; approximate in the sense that no model is exactly true and tentative in that they may be modified in the light of further data”
– Chatfield (1985)

“All models are wrong but some are useful”
– George Box

A wheat breeding trial to test 107 varieties (also called genotype) is conducted in a field experiment laid out in a rectangular array with 22 rows and 15 columns.

3.3.1 Experimental Design

The experiment employs what is referred to as a randomised complete block design (RCBD) (technically it is near-complete and not exactly RCBD due to check varieties have double the replicates of test varieties).

RCBD means that:

the there are equal number of replicates for each treatment (here it is gen);
each treatment appears exactly once in each block;
the blocks are of the same size; and
each treatment are randomised within block.

In agricultural field experiments, blocks are formed spatially by grouping plots within contiguous areas (called rep here).

The boundaries of blocks may be chosen arbitrary.

In the main analysis, people would commonly analyse this using what is called two-way ANOVA model (with no interaction effect).
The two-way ANOVA model has the form yield = mean + block + treatment + error
So for this data,

fit <- lm(yield ~ 1 + rep + gen, 
          data = gilmour.serpentine)

summary(fit)

## 
## Call:
## lm(formula = yield ~ 1 + rep + gen, data = gilmour.serpentine)
## 
## Residuals:
##      Min       1Q 
## -245.070  -69.695 
##   Median       3Q 
##   -1.182   71.427 
##      Max 
##  250.652 
## 
## Coefficients:
##                 Estimate
## (Intercept)      720.248
## repR2             96.100
## repR3           -129.845
## gen(WqKPWmH*3Ag   24.333
## genAMERY         -93.333
## genANGAS        -132.667
## genAROONA       -153.667
## genBATAVIA      -175.333
## genBD231         -70.333
## genBEULAH       -173.667
## genBLADE        -270.000
## genBT_SCHOMBURG  -49.000
## genCADOUX       -223.333
## genCONDOR       -124.333
## genCORRIGIN     -217.667
## genCUNNINGHAM   -254.667
## genDGR/MNX-9-9e  -47.667
## genDOLLARBIRD   -200.667
## genEXCALIBUR     -55.000
## genGOROKE       -141.667
## genHALBERD       -53.333
## genHOUTMAN      -209.333
## genJANZ         -214.667
## genK2011-5*      -87.333
## genKATUNGA      -110.333
## genKIATA        -165.667
## genKITE         -180.000
## genKULIN         -91.000
## genLARK         -336.333
## genLOWAN        -152.333
## genM4997        -146.000
## genM5075        -194.667
## genM5097        -102.667
## genMACHETE      -231.333
## genMEERING      -247.667
## genMOLINEUX     -165.667
## genOSPREY       -162.000
## genOUYEN        -136.667
## genOXLEY        -221.667
## genPELSART      -200.333
## genPEROUSE      -283.667
## genRAC655       -112.667
## genRAC655'S'    -113.667
## genRAC696         -3.667
## genRAC710        -51.000
## genRAC750        -77.333
## genRAC759        -42.000
## genRAC772          5.000
## genRAC777       -172.333
## genRAC779          3.667
## genRAC787       -118.000
## genRAC791        -72.667
## genRAC792       -102.333
## genRAC798         -1.667
## genRAC804        -45.000
## genRAC805        -43.000
## genRAC806        -35.333
## genRAC807        -91.333
## genRAC808        -54.000
## genRAC809        -43.333
## genRAC810       -131.667
## genRAC811         42.333
## genRAC812        -94.000
## genRAC813        -83.333
## genRAC814        -72.333
## genRAC815       -111.000
## genRAC816        -66.333
## genRAC817       -100.000
## genRAC818       -107.000
## genRAC819       -121.333
## genRAC820         -1.000
## genRAC821        -98.333
## genROSELLA      -184.333
## genSCHOMBURGK   -132.333
## genSHRIKE       -128.000
## genSPEAR        -254.667
## genSTILETTO     -157.000
## genSUNBRI       -218.333
## genSUNFIELD     -206.667
## genSUNLAND      -182.667
## genSWIFT        -197.000
## genTASMAN       -161.000
## genTATIARA       -64.333
## genTINCURRIN     -19.000
## genTRIDENT      -132.667
## genVF299         -66.333
## genVF300        -111.667
## genVF302        -108.333
## genVF508          11.667
## genVF519          -1.000
## genVF655        -160.167
## genVF664        -106.667
## genVG127        -109.667
## genVG503         -43.000
## genVG506        -108.667
## genVG701         -19.333
## genVG714        -108.333
## genVG878          52.333
## genWARBLER      -217.000
## genWI216           4.000
## genWI221         -17.333
## genWI231        -218.333
## genWI232         -56.333
## genWILGOYNE     -131.000
## genWW1402       -117.333
## genWW1477       -185.667
## genWW1831        -86.667
## genWYUNA        -176.667
## genYARRALINKA   -245.000
##                 Std. Error
## (Intercept)         67.335
## repR2               15.585
## repR3               15.585
## gen(WqKPWmH*3Ag     94.372
## genAMERY            94.372
## genANGAS            94.372
## genAROONA           94.372
## genBATAVIA          94.372
## genBD231            94.372
## genBEULAH           94.372
## genBLADE            94.372
## genBT_SCHOMBURG     94.372
## genCADOUX           94.372
## genCONDOR           94.372
## genCORRIGIN         94.372
## genCUNNINGHAM       94.372
## genDGR/MNX-9-9e     94.372
## genDOLLARBIRD       94.372
## genEXCALIBUR        94.372
## genGOROKE           94.372
## genHALBERD          94.372
## genHOUTMAN          94.372
## genJANZ             94.372
## genK2011-5*         94.372
## genKATUNGA          94.372
## genKIATA            94.372
## genKITE             94.372
## genKULIN            94.372
## genLARK             94.372
## genLOWAN            94.372
## genM4997            94.372
## genM5075            94.372
## genM5097            94.372
## genMACHETE          94.372
## genMEERING          94.372
## genMOLINEUX         94.372
## genOSPREY           94.372
## genOUYEN            94.372
## genOXLEY            94.372
## genPELSART          94.372
## genPEROUSE          94.372
## genRAC655           94.372
## genRAC655'S'        94.372
## genRAC696           94.372
## genRAC710           94.372
## genRAC750           94.372
## genRAC759           94.372
## genRAC772           94.372
## genRAC777           94.372
## genRAC779           94.372
## genRAC787           94.372
## genRAC791           94.372
## genRAC792           94.372
## genRAC798           94.372
## genRAC804           94.372
## genRAC805           94.372
## genRAC806           94.372
## genRAC807           94.372
## genRAC808           94.372
## genRAC809           94.372
## genRAC810           94.372
## genRAC811           94.372
## genRAC812           94.372
## genRAC813           94.372
## genRAC814           94.372
## genRAC815           94.372
## genRAC816           94.372
## genRAC817           94.372
## genRAC818           94.372
## genRAC819           94.372
## genRAC820           94.372
## genRAC821           94.372
## genROSELLA          94.372
## genSCHOMBURGK       94.372
## genSHRIKE           94.372
## genSPEAR            94.372
## genSTILETTO         94.372
## genSUNBRI           94.372
## genSUNFIELD         94.372
## genSUNLAND          94.372
## genSWIFT            94.372
## genTASMAN           94.372
## genTATIARA          94.372
## genTINCURRIN        81.728
## genTRIDENT          94.372
## genVF299            94.372
## genVF300            94.372
## genVF302            94.372
## genVF508            94.372
## genVF519            94.372
## genVF655            81.728
## genVF664            94.372
## genVG127            94.372
## genVG503            94.372
## genVG506            94.372
## genVG701            94.372
## genVG714            94.372
## genVG878            94.372
## genWARBLER          94.372
## genWI216            94.372
## genWI221            94.372
## genWI231            94.372
## genWI232            94.372
## genWILGOYNE         94.372
## genWW1402           94.372
## genWW1477           81.728
## genWW1831           94.372
## genWYUNA            94.372
## genYARRALINKA       94.372
##                 t value
## (Intercept)      10.697
## repR2             6.166
## repR3            -8.331
## gen(WqKPWmH*3Ag   0.258
## genAMERY         -0.989
## genANGAS         -1.406
## genAROONA        -1.628
## genBATAVIA       -1.858
## genBD231         -0.745
## genBEULAH        -1.840
## genBLADE         -2.861
## genBT_SCHOMBURG  -0.519
## genCADOUX        -2.367
## genCONDOR        -1.317
## genCORRIGIN      -2.306
## genCUNNINGHAM    -2.699
## genDGR/MNX-9-9e  -0.505
## genDOLLARBIRD    -2.126
## genEXCALIBUR     -0.583
## genGOROKE        -1.501
## genHALBERD       -0.565
## genHOUTMAN       -2.218
## genJANZ          -2.275
## genK2011-5*      -0.925
## genKATUNGA       -1.169
## genKIATA         -1.755
## genKITE          -1.907
## genKULIN         -0.964
## genLARK          -3.564
## genLOWAN         -1.614
## genM4997         -1.547
## genM5075         -2.063
## genM5097         -1.088
## genMACHETE       -2.451
## genMEERING       -2.624
## genMOLINEUX      -1.755
## genOSPREY        -1.717
## genOUYEN         -1.448
## genOXLEY         -2.349
## genPELSART       -2.123
## genPEROUSE       -3.006
## genRAC655        -1.194
## genRAC655'S'     -1.204
## genRAC696        -0.039
## genRAC710        -0.540
## genRAC750        -0.819
## genRAC759        -0.445
## genRAC772         0.053
## genRAC777        -1.826
## genRAC779         0.039
## genRAC787        -1.250
## genRAC791        -0.770
## genRAC792        -1.084
## genRAC798        -0.018
## genRAC804        -0.477
## genRAC805        -0.456
## genRAC806        -0.374
## genRAC807        -0.968
## genRAC808        -0.572
## genRAC809        -0.459
## genRAC810        -1.395
## genRAC811         0.449
## genRAC812        -0.996
## genRAC813        -0.883
## genRAC814        -0.766
## genRAC815        -1.176
## genRAC816        -0.703
## genRAC817        -1.060
## genRAC818        -1.134
## genRAC819        -1.286
## genRAC820        -0.011
## genRAC821        -1.042
## genROSELLA       -1.953
## genSCHOMBURGK    -1.402
## genSHRIKE        -1.356
## genSPEAR         -2.699
## genSTILETTO      -1.664
## genSUNBRI        -2.314
## genSUNFIELD      -2.190
## genSUNLAND       -1.936
## genSWIFT         -2.087
## genTASMAN        -1.706
## genTATIARA       -0.682
## genTINCURRIN     -0.232
## genTRIDENT       -1.406
## genVF299         -0.703
## genVF300         -1.183
## genVF302         -1.148
## genVF508          0.124
## genVF519         -0.011
## genVF655         -1.960
## genVF664         -1.130
## genVG127         -1.162
## genVG503         -0.456
## genVG506         -1.151
## genVG701         -0.205
## genVG714         -1.148
## genVG878          0.555
## genWARBLER       -2.299
## genWI216          0.042
## genWI221         -0.184
## genWI231         -2.314
## genWI232         -0.597
## genWILGOYNE      -1.388
## genWW1402        -1.243
## genWW1477        -2.272
## genWW1831        -0.918
## genWYUNA         -1.872
## genYARRALINKA    -2.596
##                 Pr(>|t|)
## (Intercept)      < 2e-16
## repR2           3.29e-09
## repR3           8.44e-15
## gen(WqKPWmH*3Ag 0.796766
## genAMERY        0.323747
## genANGAS        0.161192
## genAROONA       0.104884
## genBATAVIA      0.064513
## genBD231        0.456895
## genBEULAH       0.067074
## genBLADE        0.004628
## genBT_SCHOMBURG 0.604125
## genCADOUX       0.018820
## genCONDOR       0.189041
## genCORRIGIN     0.022010
## genCUNNINGHAM   0.007502
## genDGR/MNX-9-9e 0.613996
## genDOLLARBIRD   0.034584
## genEXCALIBUR    0.560621
## genGOROKE       0.134743
## genHALBERD      0.572551
## genHOUTMAN      0.027560
## genJANZ         0.023884
## genK2011-5*     0.355758
## genKATUNGA      0.243609
## genKIATA        0.080565
## genKITE         0.057772
## genKULIN        0.335964
## genLARK         0.000448
## genLOWAN        0.107915
## genM4997        0.123277
## genM5075        0.040304
## genM5097        0.277826
## genMACHETE      0.015010
## genMEERING      0.009286
## genMOLINEUX     0.080565
## genOSPREY       0.087451
## genOUYEN        0.148986
## genOXLEY        0.019713
## genPELSART      0.034882
## genPEROUSE      0.002955
## genRAC655       0.233813
## genRAC655'S'    0.229702
## genRAC696       0.969042
## genRAC710       0.589455
## genRAC750       0.413410
## genRAC759       0.656721
## genRAC772       0.957794
## genRAC777       0.069183
## genRAC779       0.969042
## genRAC787       0.212486
## genRAC791       0.442120
## genRAC792       0.279385
## genRAC798       0.985926
## genRAC804       0.633949
## genRAC805       0.649093
## genRAC806       0.708462
## genRAC807       0.334201
## genRAC808       0.567765
## genRAC809       0.646559
## genRAC810       0.164359
## genRAC811       0.654174
## genRAC812       0.320310
## genRAC813       0.378179
## genRAC814       0.444214
## genRAC815       0.240781
## genRAC816       0.482862
## genRAC817       0.290466
## genRAC818       0.258101
## genRAC819       0.199895
## genRAC820       0.991555
## genRAC821       0.298560
## genROSELLA      0.052050
## genSCHOMBURGK   0.162242
## genSHRIKE       0.176376
## genSPEAR        0.007502
## genSTILETTO     0.097603
## genSUNBRI       0.021612
## genSUNFIELD     0.029576
## genSUNLAND      0.054192
## genSWIFT        0.037990
## genTASMAN       0.089410
## genTATIARA      0.496142
## genTINCURRIN    0.816382
## genTRIDENT      0.161192
## genVF299        0.482862
## genVF300        0.237976
## genVF302        0.252234
## genVF508        0.901725
## genVF519        0.991555
## genVF655        0.051283
## genVF664        0.259583
## genVG127        0.246460
## genVG503        0.649093
## genVG506        0.250782
## genVG701        0.837867
## genVG714        0.252234
## genVG878        0.579767
## genWARBLER      0.022415
## genWI216        0.966230
## genWI221        0.854440
## genWI231        0.021612
## genWI232        0.551165
## genWILGOYNE     0.166496
## genWW1402       0.215071
## genWW1477       0.024064
## genWW1831       0.359435
## genWYUNA        0.062524
## genYARRALINKA   0.010061
##                    
## (Intercept)     ***
## repR2           ***
## repR3           ***
## gen(WqKPWmH*3Ag    
## genAMERY           
## genANGAS           
## genAROONA          
## genBATAVIA      .  
## genBD231           
## genBEULAH       .  
## genBLADE        ** 
## genBT_SCHOMBURG    
## genCADOUX       *  
## genCONDOR          
## genCORRIGIN     *  
## genCUNNINGHAM   ** 
## genDGR/MNX-9-9e    
## genDOLLARBIRD   *  
## genEXCALIBUR       
## genGOROKE          
## genHALBERD         
## genHOUTMAN      *  
## genJANZ         *  
## genK2011-5*        
## genKATUNGA         
## genKIATA        .  
## genKITE         .  
## genKULIN           
## genLARK         ***
## genLOWAN           
## genM4997           
## genM5075        *  
## genM5097           
## genMACHETE      *  
## genMEERING      ** 
## genMOLINEUX     .  
## genOSPREY       .  
## genOUYEN           
## genOXLEY        *  
## genPELSART      *  
## genPEROUSE      ** 
## genRAC655          
## genRAC655'S'       
## genRAC696          
## genRAC710          
## genRAC750          
## genRAC759          
## genRAC772          
## genRAC777       .  
## genRAC779          
## genRAC787          
## genRAC791          
## genRAC792          
## genRAC798          
## genRAC804          
## genRAC805          
## genRAC806          
## genRAC807          
## genRAC808          
## genRAC809          
## genRAC810          
## genRAC811          
## genRAC812          
## genRAC813          
## genRAC814          
## genRAC815          
## genRAC816          
## genRAC817          
## genRAC818          
## genRAC819          
## genRAC820          
## genRAC821          
## genROSELLA      .  
## genSCHOMBURGK      
## genSHRIKE          
## genSPEAR        ** 
## genSTILETTO     .  
## genSUNBRI       *  
## genSUNFIELD     *  
## genSUNLAND      .  
## genSWIFT        *  
## genTASMAN       .  
## genTATIARA         
## genTINCURRIN       
## genTRIDENT         
## genVF299           
## genVF300           
## genVF302           
## genVF508           
## genVF519           
## genVF655        .  
## genVF664           
## genVG127           
## genVG503           
## genVG506           
## genVG701           
## genVG714           
## genVG878           
## genWARBLER      *  
## genWI216           
## genWI221           
## genWI231        *  
## genWI232           
## genWILGOYNE        
## genWW1402          
## genWW1477       *  
## genWW1831          
## genWYUNA        .  
## genYARRALINKA   *  
## ---
## Signif. codes:  
##   0 '***' 0.001
##   '**' 0.01 '*'
##   0.05 '.' 0.1 '  ' 1
## 
## Residual standard error: 115.6 on 221 degrees of freedom
## Multiple R-squared:  0.6226, Adjusted R-squared:  0.4381 
## F-statistic: 3.375 on 108 and 221 DF,  p-value: 1.081e-14

Do you notice anything from below?

It’s well known in agricultural field trials that spatial variations are introduced in traits; this could be because of the fertility trend, management practices or other reasons.
In the IDA stage, you investigate to identify these spatial variations - you cannot just simply fit a two-way ANOVA model!

Teaching of Statistics should provide a more balanced blend of IDA and inference
– Chatfield (1985)

Yet there is still very little emphasis of it in teaching and also at times in practice. So don’t forget to do IDA!

3.4 Summary

Initial data analysis (IDA) is a model-focussed exploration of data with two main objectives:
- data description including scrutinizing for data quality, and
- model formulation without any formal statistical inference.
IDA hardly sees the limelight even if it’s the very foundation of what the main analysis is built on.

References

Chatfield, C. 1985. “The Initial Examination of Data.” Journal of the Royal Statistical Society. Series A 148 (3): 214–53.