Delving Deep into the Causation vs Correlation in Data Science Analogy

Article
By
MathCo Team
November 23, 2020 4 minute read

Popularity of the field of Data Science has given birth to a debate on how we use the words correlation and causation, suggesting distinct meaning of each rather than one complementing the other. Commonly, correlation has always been used to substantiate that a strong relationship between two events meant an influence of one over the other or in short causation.

However, a Data Scientist would tell you that jumping to a conclusion of cause and effect on finding strong correlation might be a case of counting chickens before they are hatched. In fact, a significant measure of correlation opens another door to understand the cause and effect phenomenon between correlated events. Let’s analyse the concept of correlation from the perspective of both “Sense” and “Science” to try and reach a conclusion.

The “Sense” of Correlation: In English language, meaning of the word correlation translates to “a mutual relationship or connection between two or more things”. This connection sometimes could be between the events or things themselves and sometimes a third party or event related to each. Intuitive understanding of the mutual relationship between events forms the “sense” of correlation. For e.g. when we say that high speed driving is strongly correlated with road accidents, we intuitively make out the sense of this relationship, concluding that high speed driving is a major cause of road accident.

The Curious Case of Beers and Diapers: Back in early 1990’s analysts at a popular retail chain found out that there was something peculiar happening with the sale of beers and diapers on Friday nights. It turned out that both items were sold out most together in a basket. In short there was very strong positive correlation between sale of beer and sale of diapers. Now this does not make “sense” at all, right? How can the sale of baby diapers be strongly correlated with sale of beers?

This gives way to the importance of analysing whether in some way the sale of one is dependent upon the sale of others. This analysis of interdependence and understanding the cause and effect phenomenon is causation.

The “Science” of Correlation: As stated earlier Correlation measures the strength of relationship between events or things. Let’s dig into the science of it to understand the concept in a holistic manner. Mathematically Correlation is defined as a statistical formula that measures the strength between variables and relationships. If you think that’s too much of maths let’s look at the formula to calculate correlation:

Hold on to your thoughts before you really start to find it scary. In simpler terms correlation measures the relative movement of two events with respect to each other. Now let us try and apply this science on the earlier example of beers and diapers. The correlation measure between sale of diapers and beer showed that as the sale of baby diapers increased so did the sale of beer i.e. the relative movement of the sale of beer and diapers with respect to each other was strongly positive. However, they went on to dig further into it to understand whether the sale of one was influencing the other. It turned out that the sale of baby diaper did influence the sale of beers. Apparently, there was an increase in young fathers shopping for baby diapers and they tend to rush to stores to get diapers thereby ending up buying few pints of beer as well just to unwind themselves. Thus, the sale of baby diapers did cause the sale of beers.

Testing for Correlation Precedes testing for Causation: Correlation does help in understanding the relationship as well as the strength of it however, to establish the fact of influence of one over the other we need to go a step beyond to understand causation. The “Sense” of correlation between events might be too obvious or might not even be logical sometimes, it’s the “Science” of correlation and causation which helps in substantiating the trueness of the sense. Correlation does not automatically imply causation, it needs to be established.