SISBID Data Visualization – Visual perception and effective plot construction

Game: Which plot wears it better?

Coming up: 2 different plots of 2012 TB incidence (e.g. newly diagnosed cases) in Kenya, based on variables:

tb_kn |> 
  filter(year == 2012) |> 
  dplyr::select(sex, age, count) |>
  head()

# A tibble: 6 × 3
  sex   age   count
  <chr> <chr> <dbl>
1 m     15-24  4893
2 m     25-34  8149
3 m     35-44  5302
4 m     45-54  2493
5 m     55-64  1099
6 m     65+     669

In arrangement A, separate plots are made for age, and sex is mapped to the x axis.
Conversely, in arrangement B, separate plots are made for sex, and age is mapped to the x axis.

At which age(s) are the counts for males and females relatively the same?

Which plot makes this question easier to answer?

TWO MINUTE CHALLENGE 🔮 👽 👼

At which age(s) are the counts relatively similar across sex?

Which plot makes this easier? What do we learn from each? What’s the focus? What’s easy? What’s harder?

TWO MINUTE CHALLENGE 🔮 👽 👼

Write out a question that would be easier to answer from arrangement B.

Go to www.menti.com and use the code 2979 2396.

Three Variables

Next, we have two different plots of TB incidence in Kenya, based on three variables:

tb_kn |> select(year, sex, age, count) |> head(10)

# A tibble: 10 × 4
    year sex   age   count
   <dbl> <chr> <chr> <dbl>
 1  1995 m     15-24  2072
 2  1995 m     25-34  3073
 3  1995 m     35-44  1675
 4  1995 m     45-54   920
 5  1995 m     55-64   485
 6  1995 m     65+     296
 7  1995 f     15-24  1802
 8  1995 f     25-34  1759
 9  1995 f     35-44   741
10  1995 f     45-54   411

In plot type A, a line plot of counts is drawn separately by age and sex, and year is mapped to the x axis.
Conversely, in plot type B, counts for sex, and age are stacked into a bar chart, separately by age and sex, and year is mapped to the x axis

Is the trend for females generally decreasing over time? Which plot makes this easier?

TWO MINUTE CHALLENGE 🔮 👽 👼

Which type of plot makes it easier to answer

Is the trend for females generally decreasing over time?

01:50

TWO MINUTE CHALLENGE 🔮 👽 👼

What are the pros and cons of each way of displaying the same information? Should specific limits on axes be made?

Should the limits of the y axis in plot A include 0 (zero)?

00:30

TWO MINUTE CHALLENGE 🔮 👽 👼

Plot A shows the proportion as a line plot.
Plot B shows stacked bars scaled to 100% for females and males.

Is there an age effect in the proportion of incidence by gender? Is there a temporal trend in the proportions?

01:05

Perceptual principles

Hierarchy of mappings
Pre-attentive: some elements are noticed before you even realise it.
Color palettes: qualitative, sequential, diverging.
Proximity: Place elements for primary comparison close together.
Change blindness: When focus is interrupted differences may not be noticed.

Hierarchy of mappings

Position - common scale (BEST)
Position - nonaligned scale
Length, direction, angle
Area
Volume, curvature
Shading, color (WORST)

(Cleveland, 1984; Heer and Bostock, 2009)

The hierarchy of mappings is drawn primarily from research by Cleveland & McGill in the 1980s on basic perception.

It’s important to know that this hierarchy applies to estimation accuracy, but not necessarily to speed, accuracy of the “gist” of the plot, or relative magnitude judgments.

It’s much easier to compare aligned quantities – think of two bars placed next to each other, where the bottom is aligned to the axis - you’re essentially just comparing the position of the top of the bar. When the bars are no longer aligned, as in a stacked bar chart, it’s a bit harder to estimate the size of the bar correctly – this is one example of a nonaligned scale; another would be making comparisons between facets that don’t share a scale.

Then comes a 3-way tie between length, angle, and direction – these are easy enough to see, but not as easy to estimate the magnitude.

As we increase the dimensionality of the geometric object, we lose accuracy – area is less accurate than length, and volume is less accurate than area. This is one really good reason not to add extra dimensions to your bar charts (looking at you, MS Excel!)

Finally, we have color and shading. Remember, this is for estimation accuracy – color and shading are both useful, but it’s much harder to get an exact numerical estimate from the legend. Sometimes, people add a third dimension to a plot using color, and this hierarchy should tell you that if you’re going to do that, you want to use the least important numerical variable to show using color… save the important ones for the \(x\) and \(y\) axes, which use position.

TWO MINUTE CHALLENGE 🔮 👽 👼

Come up with a plot type for each of the mappings.

Position - common scale (BEST)
Position - nonaligned scale
Length, direction, angle
Area
Volume, curvature
Shading, color (WORST)

(Cleveland, 1984; Heer and Bostock, 2009)

01:40

Color palettes

display.brewer.all()

Sequential,
Diverging,
Qualitative

Color Brewer annotates palettes with attributes.

display.brewer.all()

Next, we should talk about color – when to use it, what type of scale you should use for different types of variables, and how to ensure that your audience can perceive your color scale.

The colorbrewer project is helpful, because it provides some useful scales as well as attributes for those scales. It’s good to realize that these attributes apply to maps – they may work well enough for other types of charts, but they may not. Everything has limitations.

Colorbrewer helpfully divides their palettes into different types:

Sequential, a scale that increases the saturation (amount of color) to show magnitude
Diverging, a scale that has two directions of saturation, shown using different hues. This type of scale is useful for showing e.g. temperature, deviation from the mean, etc., where the direction is just as relevant as the value.
Qualitative scales are used for categorical variables. With qualitative scales, it’s important to keep the number of categories low enough that we can remember what color matches what value.

Sequential

dsamp <- diamonds |>
  sample_n(1000)
(d <- ggplot(
  dsamp, aes(carat, price)) +
  geom_point(aes(
    colour = clarity)))

Emphasize one side of the spectrum
viridis package palette
- maps to uniform grey scale

Sequential

d + scale_colour_brewer(direction = -1)

Default brewer sequential scale, blues.
Focus is on the dark blue.

Diverging

d + scale_colour_brewer(palette="PRGn")

Emphasize both ends, high AND low
De-emphasize middle

Qualitative

d + scale_colour_brewer(palette="Set1")

Map qualitative variables to most differentiated set of colors.

It’s possible to have too many colours to perceive differences.

TWO MINUTE CHALLENGE 🔮 👽 👼

Of the previous four colour schemes on the same data, which would be the most appropriate? Why?

viridis
ColorBrewer sequential Blues
ColorBrewer Diverging PRGn
ColorBrewer Categorical Set1

00:50

Color blind-proofing

clrs <- hue_pal()(9)
d + theme(legend.position = "none")

clrs <- dichromat(hue_pal()(9))
d + 
  scale_colour_manual("", values=clrs) + 
  theme(legend.position = "none")

Online checking tool coblis: upload an image and it will re-map the colors for different colour perception issues.
The package colorblind has color blind friendly palettes (Susan: but the colours are awful 😭).

Color blind Simulation

Original colours

Color blind view

Pre-attentive

Can you find the odd one out?

Pre-attentive

Is it easier now?

Proximity

Place elements that you want to compare close to each other. If there are multiple comparisons to make, you need to decide which one is most important.

Mapping and proximity

Same proximity is used, but different geoms.

Which is better to determine the relative ratios of males to females by age?

Mapping and proximity

Same proximity is used, but different geoms.

Which is better to determine the relative ratios of ages by sex?

Change blindness

ggplot(dsamp, aes(x=carat, y=price, colour = clarity)) +
  geom_point() +
  geom_smooth(se=FALSE) +
  scale_color_brewer(palette="Set1") +
  facet_wrap(~clarity, ncol=4)

Which has the steeper slope, VS1 or VS2?

Change blindness

Making comparisons across plots requires the eye to jump from one focal point to another.

It may result in not noticing differences.

ggplot(dsamp, aes(x=carat, y=price, 
                  colour = clarity)) +
  geom_point() +
  geom_smooth(se=FALSE) +
  scale_color_brewer(palette="Set1")

Core principles

Make a plot of your data!
- The hierarchy matters if the structure is weak or differences b/w groups are small.
Knowing how to use proximity is a valuable and rare skill
Use of colour: don’t over use
- Too many colours
- Mapping cts variable to colour to add another dimension

Core principles

Show the data!
- Statistics are good if there’s too much data
- Always plot the data for yourself to see the variability
One plot is never enough
- Plot the data in different ways
- Understand the relationships between variables

Your turn

This builds on the exercise from the previous session.

Using your choice of country, for example, Australia, make a set of plots to explore the TB incidence among males relative to females over different age groups for 2012.
Choose your best plot to answer this question: Is there a higher prevalence of TB among younger women in 2012?

07:00

Resources

Claus Wilke, Fundamentals of Data Visualization
Naomi Robbins, Creating More Effective Graphs
Cleveland, McGill (1984) Graphical perception: Theory, experimentation
Heer, Bostock (2010) Crowdsourcing graphical perception
Antony Unwin, Graphical Data Analysis with R
Wagemans et al. (2012) A Century of Gestalt Psychology in Visual Perception:
- I. Perceptual Grouping and Figure-Ground Organization
- II. Conceptual and Theoretical Foundations
Wickham (2013) Graphical criticism
VanderPlas, Goluch, Hofmann (2019) Framed! Reproducing & Revisiting 150 y/o Charts
VanderPlas, Hofmann (2015) Signs of the Sine Illusion

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.