*Correlation does not imply causation *is a mantra of modern data science. It is probably worthwhile at this point to define the terms correlation, imply, and (harder) causation.

### Correlation

For the purposes of this piece, it is sufficient to say that if we measure and record values of variables and , and they appear to have a straight-line relationship, then the correlation is a measure of how close the data is to being on a straight line. For example, consider the following data:

*The variables and have a strong correlation. *

### Causation

Causality is a deep philosophical notion, but, for the purposes of this piece, if there is a relationship between variables and such that for each value of there is a single value of , then we say that * is a function of *: is the cause and is the effect.

In this case, we write , said *is a function of *. This is a causal relationship between and . (As an example which shows why this definition is only useful for the purposes of this piece, is the relationship between sales days after January 1, and the sales, , on that day: for each value of there is a single value of : indeed is a *function *of , but does not *cause *).

If the relationship between and is of the form:

,

for some constants and , then vs is a straight line. When we measure and record and corresponding values of , this straight line relationship may not be observed *perfectly. *For example, suppose that the above variables and are related by:

,

so that there is a causal relation. If this were the case, surely the data should all lie on a straight line and we would have *perfect* correlation?

The easiest explanation of this is the fact that whenever we make a measurement there is always an error, and so a measurement of and isn’t going to proffer the exact values but rather the exact values plus an error (which will be positive for overestimates and negative for underestimates). For example, a measurement might yield and . These errors, and , can, by coincidence, cancel each other out and we still get a point on the line if:

,

however this only happens when the fluke occurs. Therefore in general the *observed* relationship between and takes the form:

,

where is a random variable. When we make assumptions about , we can go off and try and find the *line of best fit *given the error-strewn measurements of and .

The usual situation is that, if there is a causal relationship between and , of the form , and the measurement errors behave reasonably well, then the measurement of and , if plotted, will fit well to a straight line and so will be highly correlated.

### Implication

The everyday language *implies *means something quite different to the logical *implies*.

The logical implies connects two statements. One statement implies another if, whenever is true, then is true also:

implies if is true whenever is true.

If the statements are and then we write for implies

For example, consider the statements:

: Today is Friday

: Today is a Weekday

We have that because if today is Friday then today is a Weekday. We do not have because is not always true when is true (e.g. Thursday).

So, does correlation imply causation? Is it always the case that whenever and are correlated that is the cause and is the effect: . The answer is no.

There are examples of variables and such that and have very high correlation but there is no causal link between and .

It is the purpose of this piece to show that such situations are inevitable.

### Ramsey Theory

The basic principle of Ramsey Theory (for Data Science) is as follows:

With enough data, order and patterns are inevitable.

#### Example 1: The Pigeonhole Principle

If objects are being placed into boxes, and the number of objects exceeds the number of boxes, then at least one box contains at least two objexts.

For example, in a school with over 366 students, at least two will share a birthday. Indeed in a school with over 732 students, at least three will share a birthday.

The famous Birthday Problem asks the question:

What is the probability that in a group of randomly selected people, that two will share a birthday?

In particular, what is the smallest value of such that this probability exceeds 50%.

With enough data, patterns are *likely*. Indeed, with only people, there are pairs of people and this is enough pairs to make a shared birthday *likely.*

With enough data, patterns are *likely*: with even more data, patterns are *inevitable*. If you want to read about *likely* order and patterns look at e.g. the Texas Sharpshooter Fallacy.

We are talking about *inevitable*.

#### Example 2: The Party Problem (Ramsey’s Theorem)

For any numbers (#mutual friends) and (#mutual strangers), there is a number , such that in a party with guests, it is

inevitablethat there will be either pairwise mutual friends or pairwise mutual strangers.

Consider the problem with and . How many guest do we need so that it is inevitable there is either a critical mass of friends (three mutual friends), or three mutual strangers? The answer to this is the Ramsey number .

We certainly require to be larger than three. If you invite two people who do not know each other to your house, the three of ye are not mutual friends, and neither are ye mutual strangers.

If you have four people, you are still not guaranteed three friends or three strangers. For example, suppose you invite two couples to your house, two couples who do not know each other. Suppose you head out just before they come. In this party there is no trio of mutual friends nor mutual strangers.

If you have five people, it is possible that every person knows two people and doesn’t know two people. This can happen. Imagine five people sitting around a table, such that each person doesn’t know the person to their immediate left or right, but does know the two people outside them:

Now with six people. Think of yourself at this party. There are five other people at the party. You could either be friends or strangers with five:

Now because there are five people, either three or more are your friends, or three or more are your strangers: it is impossible that you have two friends and two strangers as there are four people.

Join up you and your (more than) three friends/strangers:

If these three friends (/strangers) are mutual strangers (/friends), they form a group of three mutual strangers (/friends), and so we are done.

If they are not mutual strangers (/friends), then at least a pair of them are mutual friends (/strangers), and so you, and these other two form a group of three mutual friends.

#### Example 3: Equal Weeks

Consider a random selection of 1025 people who are surveyed one week, Monday to Friday. Asked at the end of the day, “*On a* continuous* scale of 0-10, h**ow did you *feel *today?”:*

*Strongly Positive 7.5 – 10**Slightly Positive 5 – 7.5**Slightly Negative 2.5 – 5**Strongly Negative 0 – 2.5*

The surveyor looks at all the answers and sees that Mary and John have had a very similar week:

The Pearson correlation coefficient between these two data sets is 0.984: under almost any interpretation this is considered a very high correlation.

Suppose the surveyor gets Mary and John in the room:

- Mary:
*Yeah I felt great on Monday… bit of a lull on Tuesday but another great day on Wednesday. I had a disaster on Thursday but then Friday was a great day: a great end of the week.* - John:
*Yes I had exactly the same kind of week… that’s really strange…*

Perhaps Mary and John go on to try and rationalise this correlation. What is causing this?

Well, nothing necessarily. There *might *be a reason but *a priori *there might be none. If we do something with the data, categorise it, we see that both Mary and John have had exactly the same week on this level:

It isn’t difficult to show that there are possible weeks. Therefore, by the Pigeonhole Principle, because there are 1025 people surveyed, there are at least two people surveyed who have had the exact same week in this sense.

So while it looks like Mary and John might have had that type of week for some particular reason, in fact, even if all 1025 survey participants answer randomly, it is *inevitable *that a pair of participants will have weeks that correlate in the sense that they will agree on the broad classification of Positive/Negative; Strong/Slight.

This is why “Big Data” has a Ramsey Problem:

With enough data, order and patterns are inevitable.

## Leave a comment

Comments feed for this article