You are currently browsing the category archive for the ‘Data Science’ category.

Correlation does not imply causation is a mantra of modern data science. It is probably worthwhile at this point to define the terms correlation, imply, and (harder) causation.

### Correlation

For the purposes of this piece, it is sufficient to say that if we measure and record values of variables $x$ and $y$, and they appear to have a straight-line relationship, then the correlation is a measure of how close the data is to being on a straight line. For example, consider the following data:

The variables $y$ and $x$ have a strong correlation.

### Causation

Causality is a deep philosophical notion, but, for the purposes of this piece, if there is a relationship between variables $y$ and $x$ such that for each value of $x$ there is a single value of $y$, then we say that $y$ is a function of $x$: $x$ is the cause and $y$ is the effect.

In this case, we write $y=f(x)$, said $y$ is a function of $x$. This is a causal relationship between $x$ and $y$. (As an example which shows why this definition is only useful for the purposes of this piece, is the relationship between sales $t$ days after January 1, and the sales, $S$, on that day: for each value of $t$ there is a single value of $S$: indeed $S$ is a function of $t$, but $t$ does not cause $S$).