Today I’m going to talk about Simpson’s Paradox. I’ll demonstrate it with a simple example:
I decide to challenge my daughter to a Solitaire competition. The aim is to find out who is the best Solitaire player. For five days, we’re each going to play games in the evening and report our scores to declare a winner.
On Monday, I play ten games, and win four of them. This gives me a success rate of 40% (4/10) My daughter players seven games and wins three. Her win percentage is 42.86% (3/7)
She wins the first day.
On Tuesday, I play nine games, and win eight. My win percentage is 88.89% (8/9) My daughter plays four games and wins all four of them to achieve a perfect 100% (4/4)
She wins the second day.
On Wednesday, I play seven games, and win six. My win percentage is 85.71% (6/7) My daughter plays two games and wins both of them, again to achieve a perfect score of 100% (2/2)
She wins the third day.
Daily competitions continue, and each day, my daughter wins. Here are the results in tabular format:
|Me||4/10 (40.00%)||8/9 (88.89%)||6/7 (85.71%)||9/10 (90.00%)||19/30 (63.33%)|
|Daughter||3/7 (42.86%)||4/4 (100.00%)||2/2 (100.00%)||2/2 (100.00%)||11/17 (64.71%)|
As you can see, my daughter has won every single daily challenge. So this must make her the overall champion, right?
Well, something interesting happens if we sum up all five days. If we total up all five days, we find that I have played a total of 66 games, and won 46 of these. This gives me an overall percentage winning ratio of 69.70% (46/66).
|Me||4/10 (40.00%)||8/9 (88.89%)||6/7 (85.71%)||9/10 (90.00%)||19/30 (63.33%)||46/66 (69.70%)|
|Daughter||3/7 (42.86%)||4/4 (100.00%)||2/2 (100.00%)||2/2 (100.00%)||11/17 (64.71%)||22/32 (68.75%)|
My daughter has played a total of 32 games and won 22 of them. This gives her an overall percentage of 68.75% (22/32). This is lower than my percentage!
My daughter has won every individual day, but when all the days are combined, I ended up winning.
That does not make sense. How can I lose every individual day yet still be the winner of the overall competition?
This is Simpson’s Paradox.
OK, what’s going on here? (Take your time and go back and check the arithmetic above. There is no funny business going on). As we will see later, the 'issue' is that my daughter and I played a different number of games on each day.
Let’s look at another example:
In professional baseball, statistics are recorded for batting averages. If there is a large difference in the at-bats between players in different years then it is possible to get situations where a player can have a higher batting average on both of two separate years than another player, yet when both years are combined, the results can be inverted, and he ends up with a lower batting average than the other player!
This situation happened in 1995-1996 between the players Derek Jeter and David Justice:
In both 1995 and 1996, individually, David had a higher batting average that Derrick. However, when both years data are combined, he has a lower batting average.
Interestingly, in this case the 'discrepancy' carried on into the next season:
This phenomenon was first documented by Udny Yule and Karl Pearson in 1899, but because of a paper written by Edward Simpson in 1951, it was given the name Simpson's Paradox by Colin Blyth in 1972.
It's not really a paradox; the mathematics is not lying or changing, it's just that if you compare just the percentages then you are missing out on the the important variable of the sample size.
To more correctly compare things you should really normalize the data to get to the same denominator (sample size).
Look at Derek Jeter's results for 1995, we see that he only went to bat 48 times. David Justice went to bat in the same year 411 times. To more accurately compare Derek to David, we should scale up (normalize) Derek's hits so that they have the same number of at-bats. (How many hits would we expect Derek to get if he'd been at bat the same number of times as David?)
Derek batted 12/48, which is 0.250. If we assumed he continued at exactly that skill level, had he gone to bat 411 times, then we would expect him, to have hit 0.250 × 411 = 102.75 / 411
Similarly, in 1996, the opposite occurred, with David only being at bast 140 times, cf. the 582 times for Derek.
We can scale up David's ratio of 45/140 to what it would be if he had played 582 times. (45/140) × 582 = 187.07 / 582
Let's look at this effect on the data:
|Derek Jeter||(12/48) × 411||102.75/411||.250||183/582||183/582||.314||285.75/993||.288|
|David Justice||104/411||104/411||.253||(45/140) × 582||187.07/582||.321||291.07/993||.293|
Now we can see that, after we've normalized the data and so each player has the same denominator, there is no vacillation of the result, and David has the higher average the entire time. A similar result would have been obtained in the Solitaire results were they scaled so that both of us had played the same number of games.
One of the most famous real-life examples of Simpson's paradox occurred when the University of California, Berkeley was sued for bias against women who had applied for admission to graduate schools there. The admission figures for the fall of 1973 showed that men applying were more likely than women to be admitted, and the difference was "so large that it was unlikely to be due to chance".
Here are the total figures. At first glance it does look pretty damning:
However, as we've learned, combining statistics with different denominators can lead to bogus interpretations.
In fact, when examining the individual departments, it can be seen that no department was significantly biased against women.
Of the six largest departments (listed below), there was evern a "small but statistically significant bias" in favour of women.
A research paper into the issue concluded that women tended to apply to competitive departments with low rates of admission even among qualified applicants (such as in the English Department), whereas men tended to apply to less-competitive departments with high rates of admission among the qualified applicants (such as in engineering and chemistry).
So, the issue was not about Gender Bias. Sure, we still need to do a better job to encourage more women to apply (the number of total applicants were fewer for women), but it was not about a bias in the gender of those that applied.