Advanced plots and inference

class: center, middle, inverse, title-slide

.title[
# Advanced plots and inference
]
.subtitle[
## SISBID 2024 <a href="https://github.com/dicook/SISBID" class="uri">https://github.com/dicook/SISBID</a>
]
.author[
### Di Cook (<a href="mailto:dicook@monash.edu" class="email">dicook@monash.edu</a>) Heike Hofmann (<a href="mailto:hhofmann4@unl.edu" class="email">hhofmann4@unl.edu</a>) Susan Vanderplas (<a href="mailto:susan.vanderplas@unl.edu" class="email">susan.vanderplas@unl.edu</a>)
]
.date[
### 08/14-16/2024
]

---

# Tidy data and random variables

- The concept of tidy data matches elementary statistics
- Tabular form puts variables in columns and observations in rows
- Not all tabular data is in this form
- This is the point of tidy data

`$$X = \left[ \begin{array}{rrrr}
           X_1 & X_2 & ... & X_p 
           \end{array} \right] \\
  = \left[ \begin{array}{rrrr}
           X_{11} & X_{12} & ... & X_{1p} \\
           X_{21} & X_{22} & ... & X_{2p} \\
           \vdots & \vdots & \ddots& \vdots \\
           X_{n1} & X_{n2} & ... & X_{np}
           \end{array} \right]$$`

- We might even make assumptions about the distribution of each variable, e.g. `$X_1 \sim N(0,1), ~~X_2 \sim \text{Exp}(1) ...$`

---
# Grammar of graphics and statistics

- A statistic is a function on the values of items in a sample, e.g. for `$n$` iid random variates `$\bar{X}_1=\sum_{i=1}^n X_{i1}$`, `$s_1^2=\frac{1}{n-1}\sum_{i=1}^n(X_{i1}-\bar{X}_1)^2$`
- We study the behaviour of the statistic over all possible samples of size `$n$`. 
- The grammar of graphics is the mapping of (random) variables to graphical elements, making plots of data into statistics

---
# What is inference?

Inferring that what we see in the data at hand holds more broadly in life, society and the world.

.pull-left[
**Why do we need it for graphics?**

Here's an example tweeted by David Robinson:

Based on an analysis in [Tick Tock blog, by Graham Tierney](https://ticktocksaythehandsoftheclock.wordpress.com/2018/01/11/capitals-and-good-governance/)
]
.pull-right[
*Below is a simple scatterplot of the two variables of interest. A slight negative slope is observed, but it does not look very large. There are a lot of states whose capitals are less than 5% of the total population. The two outliers are Hawaii (government rank 33 and capital population 25%) and Arizona (government rank 26 and capital population 23%). Without those two the downward trend (an improvement in ranking) would be much stronger.*

*I ran linear regressions of government rank on the percentage of each state’s population living in the capital city, state population (in 100,000s), and state GDP (in $100,000s).... The coefficient is not significant for any regression at the traditional 5% level.*

*... I'm not convinced that the lack of significance is itself significant.*
]

---
# To do *statistical* inference

You need a:

- null hypothesis, and alternative
- statistic computed on the data
- reference distribution on which to measure the statistic: if its extreme on this scale you would reject the null

# Inference with *data plots*

You need a:

- plot description, as provided by the grammar (which is a statistic) - this would prescribe the null hypothesis
- null generating mechanism, e.g. permutation, simulation from a distribution or model
- human vision, to examine array of plots and decide if any are different from the others

---
# Some examples

Here are several plot descriptions. What would be the null hypothesis in each?

.pull-left[

A
```
ggplot(data) + geom_point(aes(x=x1, y=x2))
```

B
```
ggplot(data) + geom_point(aes(x=x1, y=x2, colour=cl))
```
]
.pull-right[
C
```
ggplot(data) + geom_histogram(aes(x=x1))
```

D
```
ggplot(data) + geom_boxplot(aes(x=cl, y=x1))
```
]

.orange[Which plot definition would most match to a null hypothesis stating **there is no difference in the distribution between the groups?**]

---
# Some examples

Here are several plot descriptions. What would be the null hypothesis in each?

.pull-left[

A
`$H_o:$` no association between `x1` and `x2`

B
`$H_o:$` no difference between levels of `cl`

]
.pull-right[
C
`$H_o:$` the distribution of `x1` is XXX

D
`$H_o:$` no difference in the distribution of `x1` between levels of `cl`
]
---

# Let's do it

.pull-left[
<img src="images/nullabor_hex.png" style="width: 20%" />

Example from the nullabor package. The data plot is embedded randomly in a field of null plots, this is a **lineup**. Can you see which one is different?

When you run the example yourself, you get a `decrypt` code line, that you run after deciding on a plot to print the location of the data plot amongst the nulls.

- plot is a scatterplot, null hypothesis is *there is no association between the two variables mapped to the x, y axes*
- null generating mechanism: permutation

]
.pull-right[
<img src="index_files/figure-html/lineup 1-1.png" width="576" style="display: block; margin: auto;" />

]
---
.pull-left[
# Lineup

Embed the data plot in a field of null plots

Ask: Which plot is the most different?
]
.pull-right[
# Null-generating mechanisms

- Permutation: randomizing the order of one of the variables breaks association, but keeps marginal distributions the same
- Simulation: from a given distribution, or model. Assumption is that the data comes from that model

# Evaluation

- Compute `$p$`-value
- Power `$=$` signal strength
]
---
# p-values

Suppose `$x$` individuals selected the data plot from a lineup of `$m$` plots, shown to `$K$` independent observers, then simplistically we can think about the probability of this happening, if the data plot is from the same distribution as the null plots. Assuming that all plots in a lineup are equally likely to be selected, this yields a binomial formula:

`$$P(X\geq x) = \sum_{i=x}^{K} \binom{K}{i} \left(\frac{1}{m}\right)^i\left(\frac{m-1}{m}\right)^{K-i}$$`

For `$x=4, K=17, m=20$`

```
     x simulated       binom
[1,] 4    0.0225 0.008800605
```

Assuming some null plots are visually more salient than others, we can introduce parameter `$\alpha$` to get to more realistic `$p$`-values:

```
alpha = 0.01 alpha = 0.15    alpha = 1 
  0.03397441   0.25942593   0.47220162 
```
`$$P(X \geq x) = \sum_{i = x}^{K} \binom{K}{x} \frac{1}{B(\alpha, (m-1)\alpha)}\cdot B(x+\alpha, K-x+(m-1)\alpha)$$`,

where `$B(.,.)$` is the Beta function. Large values of `$\alpha$` indicate that several null plots are 'interesting'. Generally, `$\alpha = 0.15$` indicates that 1 or 2 null plots are interesting enough for viewers to pick them. 
---

**Evaluating goodness-of-fit with a residual plot**

- plot is a residual vs fitted scatterplot, null hypothesis is *there is no association between the two statistics*
- null generating mechanism: residual rotation

---
**Assessing temporal dependence**

- plot is a lineplot, null hypothesis is *there is no temporal trend*
- null generating mechanism: simulation from a time series model

---
# Let's talk TB

There are many aspects of this plot, this is what we said earlier:

- *Across all ages, and years, the proportion of males having TB is higher than females*
- These proportions tend to be higher in the middle age groups, for all years.
- Relatively similar proportions across years.

---
# Null hypothesis

The plot is constructed by plotting count against year, separately by age group, and coloured by sex.

- By colouring by sex, we make this a primary comparison
- Proportion of sex, conditional on age group and year is the query being addressed by this plot.

*Null hypothesis*: TB incidence is spread equally among men and women, regardless of age and year.
*Alternative hypothesis*: It isn't.

---

``` r
# Make expanded rows of categorical variables
# matching the counts of aggregated data.
# sex needs to be converted to 0, 1 to 
# match binomial simulations
tb_us_long <- expandRows(tb_us, "count")
tb_us_long <- tb_us_long |>
 mutate(sex01 = ifelse(sex=="m", 0, 1)) |>
 select(-sex)

# Generate a lineup of three, randomly choose one of the
# positions to place true data.
# Compute counts again.
pos = sample(1:3, 1)
l <- lineup(null_dist(var="sex01", dist="binom", 
 list(size=1, p=0.5)), 
 true=tb_us_long, n=3, pos=pos)
l <- l |>
 group_by(.sample, year, age) |>
 dplyr::count(sex01)
```

---

---
# A more complicated null

*Null hypothesis*: TB incidence is has the same proportion of men and women, regardless of age and year.
*Alternative hypothesis*: It isn't.

``` r
# Compute proportion across all data
tbl <- tb_us |> group_by(sex) |> summarise(count=sum(count))
tbl
# A tibble: 2 × 2
 sex count
 <chr> <dbl>
1 f 25915
2 m 55640
p <- tbl$count[1]/sum(tbl$count)

pos = sample(1:3, 1)
l <- lineup(null_dist(var="sex01", dist="binom", 
 list(size=1, p=p)), 
 true=tb_us_long, n=3, pos=pos)
l <- l |>
 group_by(.sample, year, age) |>
 dplyr::count(sex01)
```

---

---
class: inverse middle

# Danger zone

- The null hypothesis is determined based on the plot type
--

- It is not based on the structure seen in a data set

---
# A map lineup example

Cancer incidence across the US 2010-2014, all cancer types, per 100k. Data from American Cancer Society, https://cancerstatisticscenter.cancer.org. Does one map show a spatial trend?

---
background-image: \url(https://pbs.twimg.com/profile_images/1092451626781163523/0YzJMi-8_400x400.jpg)
background-size: 12%
background-position: 100% 0%
# Your turn

.pull-left[

Time to play Sesame Street, with your study group.

Each person can

1. run this code, 
2. look at the lineup of plots, and choose which plot shows the most separation between classes (DON'T LOOK AT ANYONE ELSE's!)
3. then run the `decrypt` line, produced when the `lineup` function was called
4. tally up the number of your group members who picked the data plot, this is `x`
5. use the `pvisual` function to compute the p-value, `K=` the number of people in your group, `m=12`
]

.pull-right[

``` r
library(MASS)
data(wasps)
lda_pred <- function(x) {
 d <- predict(lda(Group~., 
 data=x[,-43]))$x[,1:2] |>
 as_tibble() |>
 mutate(Group = x$Group)
 return(d)
}
wasps_lineup <- lineup(null_permute('Group'), 
 wasps[,-1], n=12) |>
 as_tibble()
wasps_lineup_lda <- wasps_lineup |>
 split(.$.sample) |>
 map_df(~lda_pred(.)) |>
 mutate(.sample = wasps_lineup$.sample)
ggplot(wasps_lineup_lda, aes(x=LD1, y=LD2, 
 colour=Group)) + 
 geom_point() +
 facet_wrap(~.sample, ncol=4) +
 scale_colour_brewer(palette="Dark2") +
 theme(legend.position="none")
```
]

.orange[Enter the p-value in the chat!]
---
# Resources

- VanderPlas S., Röttger Chr., Cook D., Hofmann H.: Statistical Significance Calculations for Scenarios in Visual Inference, STAT, 2020, doi:http://dx.doi.org/10.1002/sta4.337.
- Hofmann, H., Follett, L., Majumder, M. and Cook, D. (2012) Graphical Tests for Power Comparison of Competing Designs, http://doi.ieeecomputersociety.org/10.1109/TVCG.2012.230.
- Wickham, H., Cook, D., Hofmann, H. and Buja, A. (2010) Graphical Inference for Infovis,  http://doi.ieeecomputersociety.org/10.1109/TVCG.2010.161. 
- Sievert, C. (2019) Interactive web-based data visualization with R, plotly, and shiny,  https://plotly-r.com/index.html
---
# Share and share alike

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.