Interactively Visualizing Multivariate Market Segmentation Using the R Package Lionfish

Dianne Cook
Econometrics and Business Statistics
Monash University
Joint with Ursula Laa and Matthias Medl, BOKU

Fritz’s work



It turns out that Fritz implemented a tour:

library(flexclust)
randomTour(iris[,1:4], axiscol=2:5)



Today’s work extends it with better algorithms for choosing projections to show, and interactive graphics where plots are linked in python.

photo of Fritz Leisch

Motivation

You can get any result you want when clustering data.

Yes, and no, ideally no


Market segmentation tends to be carving a “blob” of data into chunks, using clustering algorithms. We argue that:

  • Clustering follows the shape of the data along mathematical rules
  • Algorithms have favourites and quirks, which is replicable and repeatable

Whole apple and knife on cutting board. Apple in two halves and knife on cutting board. Apple sliced into eighths and knife on cutting board.

Whole banana and knife on cutting board. Banana in half and knife on cutting board. Banana cut into eight coin-shaped pieces and knife on cutting board.

Objective



Learn about the shape of the data and how a clustering has carved up the data …



… by using tours - linear projections of high-dimensional data.

Quick quiz

This is how we tend to visualise cluster results.

Side-by-side histograms with x axes labelled 'V1' and 'V2'. Bars are segmented into four colours: red, orange, blue, green. And both histograms roughly have more colour in the bars in that order.

How does this clustering result carve up 2D data? What does the data look like?

Is it easier now?

Scatterplot with axes labelled 'V1' and 'V2'. Points are one of four colours: red, orange, blue, green. The points form a strong positive linear association, which is partitioned along this axis into red, orange, blue, green sections.

Now we can see the clustering has partitioned the blob.

Try again

This is how we tend to visualise cluster results.

Side-by-side histograms with x axes labelled 'V1' and 'V2'. Bars are segmented into four colours: red, orange, blue, green. The order of colours is different in both histograms: red takes medium values on V1 but low values on V2, orange takes moderate values on V1 but high values on V2, blue takes high values on V1 but moderate values on V2 and green takes low values on V1 and moderate values on V2.

How does this clustering result carve up 2D data? What does the data look like?

Is it easier now?

Scatterplot with axes labelled 'V1' and 'V2'. Points are one of four colours: red, orange, blue, green. The points form a blob with no association, which is partitioned along into four quadrants of red, orange, blue, green sections.

Now we can see the clustering has partitioned the blob.

Searching for the partitions in high dimensions

Example: Risk Taking

  • Survey of 563 Australian tourists, see Dolnicar S, Grün B, Leisch F (2018)
  • Six different types of risks: recreational, health, career, financial, social and safety
  • Rated on a scale from 1 (never) to 5 (very often)

Step 1: understand the shape of the data

Code
# Step 1: get a sense of the data
library(lionfish)
data("risk")
colnames(risk) <- c("Rec", "Hea", "Car", "Fin", "Saf", "Soc")

animate_xy(risk)
set.seed(201)
render_gif(risk,
           grand_tour(),
           display_xy(col = "#6C26AC"),
           start = basis_random(6,2),
           gif_file = "gifs/risk_gt.gif",
           apf = 1/20,
           frames = 400,
           width = 400,
           height = 400)

Apple in two halves and knife on cutting board.

Banana cut into eight coin-shaped pieces and knife on cutting board.

Animation showing 2D projections of 6D data as scatterplots of purple points. There is a circle with line segments radiating from the centre which represent the projection coefficients of each 2D projection shown. The patterns that can be seen are circular in many projections, and sometimes elongated, almost elliptical with some higher density at one end and lower density at the other. We can also see discrete lines of points which is due to each variable being ordinal: which can be ignored because it is not important structure for understanding the association between variables.

{fig-alt=“A single 2D projection of 6D data shown as a scatterplot of purple points. A purple sketch roughs out the shape, which is like a pear. The variables are mostly contributing to this projection Soc, Rec and Hea.”}

{fig-alt=“A single 2D projection of 6D data shown as a scatterplot of purple points. A purple sketch roughs out the shape, which is like a rhombus. All six variables contribute to this projection in different directions.”}

Software: lionfish

  • R package to work with implementations of clustering algorithms, and with the tourr package to generate tour paths
  • python interface to use TKinter and matplotlib for the GUI and the interactive graphics
  • matplotlib enables fast rendering and interactivity for linked brushing and manual tours

Finding the partitions

  1. Run the clustering
  2. Run a guided tour with the LDA index to find projection that best separate clusters
  3. Manual tour to refine view of partitions
Code
# Initialise python environment
init_env()

library(tibble)
library(dplyr)

risk_d  <- apply(risk, 2, function(x) (x-mean(x))/sd(x))

# Two clusters
nc <- 3
set.seed(1145)
r_km <- kmeans(risk_d, centers=nc,
               iter.max = 500, nstart = 5)

r_km_d <- risk_d |>
  as_tibble() |>
  mutate(cl = factor(r_km$cluster)) |>
  bind_cols(model.matrix(~ as.factor(r_km$cluster) - 1)) 
colnames(r_km_d)[(ncol(r_km_d)-nc+1):ncol(r_km_d)] <- paste0("cluster", 1:nc)
r_km_d <- r_km_d |>
  mutate_at(vars(contains("cluster")), function(x) x+1)

clusters <- r_km_d$cl

set.seed(110)
guided_tour_history <- save_history(risk_d,
    tour_path = guided_tour(lda_pp(clusters)))

half_range <- max(sqrt(rowSums(risk_d^2)))
feature_names <- colnames(risk_d)
cluster_names <- LETTERS[1:nc] 

clusters <- as.numeric(as.character(clusters))

obj1 <- list(type="2d_tour", obj=guided_tour_history)

risk_d <- data.matrix(risk_d)
interactive_tour(data=risk_d,
                 plot_objects=list(obj1),
                 feature_names=feature_names,
                 half_range=half_range,
                 n_plot_cols=2,
                 preselection=clusters,
                 preselection_names=cluster_names,
                 n_subsets=nc,
                 display_size=6)

k=2,3,4,5k=2,3,4,5-means slices along main spread, and then middle

Screenshot of the lionfish interface showing the two cluster result. The projection shown is where there is a fairly clean line separating the two clusters.

Screenshot of the lionfish interface showing the four cluster result. The projection shown is where there is a fairly clear separation the four clusters, along the direction of the largest spread of points, but the fattest part of the pear shape has been divided into two. So the four clusters are spread along the main direction of spread, with two side-by-side in the fat part of the pear.

Screenshot of the lionfish interface showing the three cluster result. The projection shown is where there is a fairly clear separation the three clusters, along the direction of the largest spread of points.

Screenshot of the lionfish interface showing the five cluster result. The projection shown is where there is a fairly clear separation the four clusters, along the direction of the largest spread of points, but the fattest part of the pear shape has been divided into three. The clusters at the bottom and the top of the pear shape have been faded so we can focus on the three clusters in the fat part of the pear.

Search for meaning

Code
library(tibble)
library(dplyr)

risk_d  <- apply(risk, 2, function(x) (x-mean(x))/sd(x))

# Two clusters
nc <- 3
set.seed(1145)
r_km <- kmeans(risk_d, centers=nc,
               iter.max = 500, nstart = 5)

r_km_d <- risk |>
  as_tibble() |>
  mutate(cl = factor(r_km$cluster))

r_km_d |> 
  pivot_longer(Rec:Soc, names_to = "var", values_to = "val") |>
  ggplot(aes(x=val, fill=cl)) +
    geom_bar() +
    facet_wrap(~var, ncol=3, scales="free_y") +
    scale_fill_manual(values = c("#377EB8", "#FF7F00", "#4DAF4A")) +
    xlab("") + ylab("") +
    theme_minimal() +
    theme(legend.position = "none",
          axis.text = element_blank())

Histograms of the six variables laid out in a 2x3 matrix. Bars are filled by cluster colour. All size look very similar with lots of orange at low values green at medium values and blue at high values.



All the activities contribute to the segmentation into three clusters.

Summary

You can find lionfish at https://mmedl94.github.io/lionfish/.

  • Link multiple displays
  • Interactively select points and clusters
  • Visualize the partitions with various tour types


Clustering is a geometric operation and using the tour you should be able to see how the observations have been grouped, always.


Final teaser: clustering algorithms don’t see gaps, it sees pairwise distances. We see gaps, and can be shocked when the algorithm grouped across it.

References and acknowledgements

Slides made in Quarto, with code included.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.