Simpson’s Paradox and segmentation: why analysis is crucial

What is Simpson’s Paradox?

Simpson’s Paradox refers to a data phenomenon where a trend existing in groups reverses when the data is studied as a whole. When it comes to analytics, understanding this paradox is vital since it can completely alter any insights gained from the data itself.

Graph representing Simpson's paradox

Figure 1. demonstrates two series’ of data (light blue and orange), each showing clear patterns. However, when the data is grouped together, the best fit pattern reverses (dark blue); a person on the black dotted line could have any of the three intersecting values. This has a huge impact on forecasting and reporting. Having no visibility of series is the same as not segmenting the data.

Simpson’s Paradox and segmentation: gender bias at UC Berkeley

An example of Simpson’s Paradox is an investigation into gender bias among students applying to the University of California. Male applicants were 44% successful compared to 35% of female applications. Surely such an obvious difference in success rates is evidence of favouritism on the part of the university?

Well, maybe. Dig a little deeper, however, and another possibility is revealed.

The table below holds the data for the largest 6 departments at UC Berkeley:

A table showing the largest 6 departments at UC Berkeley when illustrating Simpson's Paradox

Here, the data shows that at department level, there’s actually very little evidence of bias towards men.

In fact, the largest ‘departmental bias’ occurs is in department A, where the female admission rate is 20% higher than the male. Further analysis of the variables indicates that men have better overall success rate because they apply to more lenient departments. As the table indicates, departments A and B have much higher admission rates than the other 4. These attract just 7% of female applications (from the top 6 departments), compared to 51% of male ones.

Another well-known example of Simpson’s Paradox is in the comparison of baseball batting averages, whereby a player can have a higher batting average than another across many seasons, but a lower one in each individual season.

Simpson’s Paradox and segmentation in business

I know what you’re thinking. How does the relationship between Simpson’s Paradox and segmentation apply in business situations? Diagram showing a rise in sales was due to an increase in ticket prices

Recently, I did some reporting for an arts venue which focused on how their sales have grown and what has driven that change.

Fig. 2 shows that the rise in sales was due to an increase in ticket prices.

Larger group size and frequency growth offset a decline in customer numbers (We'll come back to that frequency growth shortly). So far, the obvious way to improve this sales growth would be to reverse that customer decline.

I then segmented data so the venue could use bespoke targeting on different customer types. They would receive tailored communications to improve each segment's sales growth. This, in turn, would improve the total sales growth. Customers were split into New, Lapsed and Active. Active customers were grouped into past frequency bands 1 to 3 (increasing in order). The sales growth chart at segment level looked very different:

Diagram showing frequency growth declining for every customer segment

Fig. 3 shows that frequency growth had declined for every customer segment. In Fig. 2 we saw that frequency growth had grown for the whole. This is due to the increase in the proportion of customers who were Active and in higher frequency bands. These groups naturally have higher frequencies.

Table of results - Simpson's Paradox

Conclusion

Segmenting the data groups for the arts venue showed an occurrence of Simpson's Paradox. The total view suggested that frequency growth wasn't a concern for the venue; in fact, each segment had declining frequency. Moving forward, the venue should act to counteract this trend.

Without segmenting the data to reveal underlying patterns, this information would have been invisible. The venue would only have found out about the low-level frequency decline much later, at a point when the trend would have been much harder to reverse.

References
1. https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf
2. https://en.wikipedia.org/wiki/Simpson%27s_paradox
3. Ken Ross. "A Mathematician at the Ballpark: Odds and Probabilities for Baseball Fans (Paperback)" Pi Press, 2004. ISBN 0-13-147990-3. 12–13

Interested in our capabilities and want to find out more?

Get in touch

Our Blog - stay up to date with all things Webalytix

CRM is the world’s largest software market, and grew even more under lockdown

By Irina Obrazcova | July 21, 2020

In 2017, customer relationship management software revenue reached £31 billion and became the world’s largest software market. In this blog we discuss the advantages of effective CRM tools including increased sales, sales productivity, sales conversion rates as well as improved levels of customer retention and satisfaction. Find out how your business can leverage the power of CRM software.

Use the Lockdown Lull to Spring Clean your Data

By Libby Plowman | May 21, 2020

The lockdown has inspired a spring-cleaning trend, so now your house is in order, how about refreshing your data to ensure you get the most out of it? Are your customers feeling valued or could they do with some TLC too? Here are some tips to help you get started.

Automating RFM segmentation and labour-intensive tasks

By Irina Obrazcova | May 10, 2020

With the advancement of machine learning and artificial intelligence, automation is becoming more and more prevalent within the business world. However, there is still a big gap in our understanding of just how much can be automated. What is RFM and does your business need it?

CRM Strategy: How to build and maintain customer relationships

By Libby Plowman | April 7, 2020

Today, building a meaningful client base is more important than ever. When it comes to costs, acquiring new customers is five times more expensive than keeping existing ones. So how do you actually maintain lasting customer relationships? We’ve compiled some useful tips to help you grow and retain your hard-earned customers.

Goldilocks and the three steps to understanding machine learning

By Irina Obrazcova | March 15, 2020

It’s a hot topic lately, but for many, machine learning is still a bit of a puzzle. This post looks at the key components, debugging and demystifying what is often seen as an overly technical concept, as well as offering some practical insights into how it all actually works.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-functional	1 year	The cookie is set by the GDPR Cookie Consent plugin to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Necessary" category .
cookielawinfo-checkbox-others	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Others".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance".

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_145586238_1	1 minute	Set by Google to distinguish users.
_gat_UA-189296586-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user’s first session. It stores a true/false value, indicating whether it was the first time Hotjar saw this user.
_hjIncludedInPageviewSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's pageview limit.
_hjIncludedInSessionSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's daily session limit.
_hjTLDTest	session	To determine the most generic cookie path that has to be used instead of the page hostname, Hotjar sets the _hjTLDTest cookie to store different URL substring alternatives until it fails.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.

Cookie	Duration	Description
_hjSession_2252737	30 minutes	No description
_hjSessionUser_2252737	1 year	No description
wp_wpfileupload_6b1ea12ba8dc270fa567a4f380043a44	2 days	No description