Is there any pattern in the residuals that indicate a problem with the model fit?
Do we need to change the model specification?
Is there a difference between the species?
Which is the best display to answer: what is distribution of thyroid cancer across Australia?
Reading data plots is subjective.
Making decisions based on data visualisations is common, where we need to be objective .
It is possible, and here is how we do that …
space
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
How would you get this data into tidy form?
First get the data in tidy form
tb_aus_sa <- tb_aus |>
filter(year > 2012) |>
select(iso3, year,
newrel_f014:newrel_f65,
newrel_m014:newrel_m65) |>
pivot_longer(cols=newrel_f014:newrel_m65,
names_to = "sex_age",
values_to = "count") |>
filter(!is.na(count)) |>
separate(sex_age, into=c("stuff",
"sex_age")) |>
mutate(sex = str_sub(sex_age, 1, 1),
age = str_sub(sex_age, 2,
str_length(sex_age))) |>
mutate(age = case_when(
age == "014" ~ "0-14",
age == "1524" ~ "15-24",
age == "2534" ~ "25-34",
age == "3544" ~ "35-44",
age == "4554" ~ "45-54",
age == "5564" ~ "55-64",
age == "65" ~ "65")) |>
select(iso3, year, sex, age, count)
\[X = \left[ \begin{array}{rrrr} X_{~1} & X_{~2} & ... & X_{~p} \end{array} \right] \\ = \left[ \begin{array}{rrrr} X_{~11} & X_{~12} & ... & X_{~1p} \\ X_{~21} & X_{~22} & ... & X_{~2p} \\ \vdots & \vdots & \ddots& \vdots \\ X_{~n1} & X_{~n2} & ... & X_{~np} \end{array} \right]\]
In ggplot2
, the variables from tidy data are explicitly mapped to elements of the plot, using aesthetics
.
Basic Mappings
x
and y
to plot points in a two-dimensional spacecolor
, fill
to render as a color scalesize
maps variable to size of objectshape
maps variable to different shapesDepending on the geom
different mappings are possible, xmin
, xend
, linetype
, alpha
, stroke
, weight
…
Facets
Variables are used to subset (or condition)
Layers
Different data can be mapped onto the same plot, eg observations, and means
How are variables mapped to create this plot?
How are variables mapped to create this plot?
ggplot(tb_aus_sa,
aes(x=year,
y=count,
fill=sex)) +
geom_col(position="fill") +
facet_wrap(~age, ncol=7) +
ylab("") +
scale_fill_discrete_divergingx(palette="ArmyRose") +
scale_x_continuous("year",
breaks = seq(2013, 2021, 2),
labels = c("13", "15", "17", "19", "21")) +
theme(legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_blank(),
axis.text = element_text(size="10"))
Observations: Relatively equal proportions, with more incidence among males in older population. No clear temporal trend.
ggplot(tb_aus_sa,
aes(x=year,
y=count,
colour=sex)) +
geom_point() +
geom_smooth(se=F, alpha=0.7) +
facet_grid(sex~age, scales = "free_y") +
ylab("count") +
scale_colour_discrete_divergingx(palette="ArmyRose") +
scale_x_continuous("year",
breaks = seq(2013, 2021, 2),
labels = c("13", "15", "17", "19", "21")) +
theme(legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_blank(),
axis.text = element_text(size="10"))
Small increasing temporal trend is present in early age groups, for both males and females. Also older groups, although numbers are much smaller.
Cleveland and McGill (1984)
Place elements that you want to compare close to each other. If there are multiple comparisons to make, you need to decide which one is most important.
Making comparisons across plots requires the eye to jump from one focal point to another. It may result in not noticing differences.
For comparison of different patterns, consider the scale. Typically the scale should be the SAME in each plot.
What do you see?
✗ non-linearity
✓ heteroskedasticity
✗ outliers/anomalies
✓ non-normality
✗ fitted value distribution is uniform
Are you sure?
What do you see?
There a difference between the groups
✓ location
✗ shape
✓ outliers/anomalies
Are you sure?
What do you see?
There are clusters of high and low temperature in different parts of the region.
✓ clusters
✓ outliers/anomalies
Are you sure?
What is the null hypothesis?
There is no relationship between residuals and fitted values. This is \(H_o\).
Alternative hypothesis, \(H_a\):
There is some relationship, which might be
\(H_o\): There is no relationship between residuals and fitted values.
How would you generate null samples?
Break any association by
set.seed(241)
ggplot(lineup(null_permute("species"), penguins, n=15),
aes(x=flipper_length_mm,
y=bill_length_mm,
color=species)) +
geom_point(alpha=0.8) +
facet_wrap(~.sample, ncol=5) +
scale_color_discrete_divergingx(palette="Zissou 1") +
theme(legend.position = "none",
axis.title = element_blank(),
axis.text = element_blank(),
panel.grid.major = element_blank())
If 10 people are shown this lineup and all 10 pick plot 2, which is the data plot, the \(p\)-value will be 0.
Generally, we can compute the probability that the data plot is chosen by \(x\) out of \(K\) observers, shown a lineup of \(m\) plots, using a simulation approach that extends from a binomial distribution, with \(p=1/m\).
This means we would reject \(H_o\) and conclude that there is a difference in the distribution of bill length and flipper length between the species of penguins.
data(wasps)
set.seed(258)
wasps_l <- lineup(null_permute("Group"), wasps[,-1], n=15)
wasps_l <- wasps_l |>
mutate(LD1 = NA, LD2 = NA)
for (i in unique(wasps_l$.sample)) {
x <- filter(wasps_l, .sample == i)
xlda <- MASS::lda(Group~., data=x[,1:42])
xp <- MASS:::predict.lda(xlda, x, dimen=2)$x
wasps_l$LD1[wasps_l$.sample == i] <- xp[,1]
wasps_l$LD2[wasps_l$.sample == i] <- xp[,2]
}
ggplot(wasps_l,
aes(x=LD1,
y=LD2,
color=Group)) +
geom_point(alpha=0.8) +
facet_wrap(~.sample, ncol=5) +
scale_color_discrete_divergingx(palette="Zissou 1") +
theme(legend.position = "none",
axis.title = element_blank(),
axis.text = element_blank(),
panel.grid.major = element_blank())
If 10 people are shown this lineup and 1 picked the data plot (position 6), which is the data plot, the \(p\)-value will be large.
This means we would NOT reject \(H_o\) and conclude that there is NO difference in the distribution of groups.
Which plot is the most different?
Plot description was:
In particular, the researcher is interested to know if star temperature is a skewed distribution.
\(H_o: X\sim exp(\widehat{\lambda})\)
\(H_a:\) it has a different distribution.
No peeking!
Which plot is the most different?
Raise your hand when you have chosen.
00:20
This is the pair of plot designs we are evaluating.
Compute signal strength:
No peeking!
Which plot is the most different?
Raise your hand when you have chosen.
00:20
This is the pair of plot designs we are evaluating. Comparing colour palettes used for the spatial distribution of temperature.
Compute signal strength:
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.