Irina – Simpson’s Paradox and segmentation: why analysis is crucial

What is Simpson’s Paradox?

Simpson’s Paradox refers to a data phenomenon where a trend existing in groups reverses when the data is studied as a whole. When it comes to analytics, understanding this paradox is vital since it can completely alter any insights gained from the data itself.

Figure 1. demonstrates two series’ of data (light blue and orange), each showing clear patterns. However, when the data is grouped together, the best fit pattern reverses (dark blue); a person on the black dotted line could have any of the three intersecting values. This has a huge impact on forecasting and reporting. Having no visibility of series is the same as not segmenting the data.

Simpson’s Paradox and segmentation: gender bias at UC Berkeley

An example of Simpson’s Paradox is an investigation into gender bias among students applying to the University of California. Male applicants were 44% successful compared to 35% of female applications. Surely such an obvious difference in success rates is evidence of favouritism on the part of the university?

Well, maybe. Dig a little deeper, however, and another possibility is revealed.

The table below holds the data for the largest 6 departments at UC Berkeley:

Here, the data shows that at department level, there’s actually very little evidence of bias towards men.

In fact, the largest ‘departmental bias’ occurs is in department A, where the female admission rate is 20% higher than the male. Further analysis of the variables indicates that men have better overall success rate because they apply to more lenient departments. As the table indicates, departments A and B have much higher admission rates than the other 4. These attract just 7% of female applications (from the top 6 departments), compared to 51% of male ones.

Another well-known example of Simpson’s Paradox is in the comparison of baseball batting averages, whereby a player can have a higher batting average than another across many seasons, but a lower one in each individual season.

Simpson’s Paradox and segmentation in business

I know what you’re thinking. How does the relationship between Simpson’s Paradox and segmentation apply in business situations?

Recently, I did some reporting for an arts venue which focused on how their sales have grown and what has driven that change.

Fig. 2 shows that the rise in sales was due to an increase in ticket prices.

Larger group size and frequency growth offset a decline in customer numbers (We’ll come back to that frequency growth shortly). So far, the obvious way to improve this sales growth would be to reverse that customer decline.

I then segmented data so the venue could use bespoke targeting on different customer types. They would receive tailored communications to improve each segment’s sales growth. This, in turn, would improve the total sales growth. Customers were split into New, Lapsed and Active. Active customers were grouped into past frequency bands 1 to 3 (increasing in order). The sales growth chart at segment level looked very different:

Fig. 3 shows that frequency growth had declined for every customer segment. In Fig. 2 we saw that frequency growth had grown for the whole. This is due to the increase in the proportion of customers who were Active and in higher frequency bands. These groups naturally have higher frequencies.

Conclusion

Segmenting the data groups for the arts venue showed an occurrence of Simpson’s Paradox. The total view suggested that frequency growth wasn’t a concern for the venue; in fact, each segment had declining frequency. Moving forward, the venue should act to counteract this trend.

Without segmenting the data to reveal underlying patterns, this information would have been invisible. The venue would only have found out about the low-level frequency decline much later, at a point when the trend would have been much harder to reverse.

References
1. https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf
2. https://en.wikipedia.org/wiki/Simpson%27s_paradox
3. Ken Ross. “A Mathematician at the Ballpark: Odds and Probabilities for Baseball Fans (Paperback)” Pi Press, 2004. ISBN 0-13-147990-3. 12–13

Interested in our capabilities and want to find out more?

Get in touch

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-functional	1 year	The cookie is set by the GDPR Cookie Consent plugin to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Necessary" category .
cookielawinfo-checkbox-others	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Others".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance".

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_145586238_1	1 minute	Set by Google to distinguish users.
_gat_UA-189296586-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user’s first session. It stores a true/false value, indicating whether it was the first time Hotjar saw this user.
_hjIncludedInPageviewSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's pageview limit.
_hjIncludedInSessionSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's daily session limit.
_hjTLDTest	session	To determine the most generic cookie path that has to be used instead of the page hostname, Hotjar sets the _hjTLDTest cookie to store different URL substring alternatives until it fails.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.

Cookie	Duration	Description
_hjSession_2252737	30 minutes	No description
_hjSessionUser_2252737	1 year	No description
wp_wpfileupload_6b1ea12ba8dc270fa567a4f380043a44	2 days	No description