Technical Tuesday: Correlation — getting a grip on the most misunderstood concept in financial markets
Correlation is a rather tricky subject. Using visual inspection, what would you say the correlation between the returns of security 1 and security 2 in the left chart and between security 1 and security 3 on the right chart is?
For the chart on the left the Pearson correlation is -1, i.e. a perfect negative linear correlation. For the chart on the right the correlation of the returns of these two securities is zero — i.e. totally uncorrelated.
Some readers may have guessed that there is a strong positive correlation in both cases. How come that intuition when eyeballing these charts can give such misleading impressions? Well, let’s explore the reasons and the data I used to generate the time series.
As you can see in the below Python code block, I generated normally distributed random numbers for return series 1 (ret_1) and then multiplied these random numbers by -1 to generate return series 2 (ret_2). Hence for each observation (e.g. trading day) ret_1 would do exactly the opposite of ret_2 — indicating a perfect negative correlation.
Return series 3 (ret_3) is a another completely random set of numbers drawn from a normal distribution. The seed numbers in the code block for ret_1 and ret_3 ensure you can reproduce the exact same results when running the code — and they make sure ret_1 and ret_3 are drawn from different distributions. If both are completely random what is the correlation we would expect to find between the two? None, of course. Looking at the scatter plots we can see the negative correlation in the chart on the left and the complete lack of a linear relationship on the right:
Why is it so easy to be fooled into thinking the time series shown at the beginning of the article are positively correlated? First, I have a confession to make: this was set up as a bit of a trick question, i.e. showing a chart of stock prices but asking for the correlation between the returns of the securities in question. The distinction of returns and prices is an important one — and while we often visually inspect price time series, it is returns we should be comparing when computing correlations. Prices are almost always non-stationary while returns are often fairly close to stationary, a very desirable statistical property as will be explained a bit later on.
Let’s look at the next code block below. To arrive at a “price”-like time series we take the random returns from before and add an intercept of 10 and 150 respectively as starting points for our “prices”. Every day’s return is added to the intercept and multiplied by a trend factor of 0.15. This is how we end up with the simulated price time series (sec_1, sec_2 and sec_3) from the start. It should be obvious from this description and from the code why the “prices” for security 1 (sec_1) and security 2 (sec_2) look positively correlated — because I added the same trend to the random returns of both. The same is true for the correlation between sec_1 and sec_3.
So the real correlation (or dependence) is that all three “prices” depend on the same trend. Since the trend is entirely based on multiplying each trading day — from day 1 to 8000 — by 0.15, it is clear that the sec_1, sec_2 and sec_3 depend on the x-axis, i.e. time, and not on each other. This is also called autocorrelation (or serial correlation).
The importance of being stationary
Why is this important? When we build models and theories we need to explore different variables that are in involved in the phenomenon we’re trying to explain or predict. It is important to establish whether two variables are really correlated with each other or whether both are really just dependent on a third factor such as a time, because otherwise we just generate lots of spurious correlations that lead to meaningless models. Many statistical approaches rely on stationary data, a formal way of saying that the distribution of a variable does not change over time. Sometimes a picture says more than a thousand words and the following one explaining stationarity fits that category.
Making predictions based on models that are built with nonstationary data — where the mean and variance are not well-defined — is generally a poor idea. If the mean and variance are time-dependent then correlations with other variables are usually not stable. But how do we get stationary data? We started with generating random returns. Let’s compare that return data (ret_1) with our simulated price data (sec_1).
Looking at the return histogram on the left we can observe that the mean and variance of the distribution look fairly steady. However, the mean of the price series (sec_1) on the right is all over the shop — very far from stationary.
We can back this up with a formal method, the Durbin-Watson test. A result of 2 indicates no autocorrelation, 0 implies a positive serial correlation and 4 a negative one. Our simulated returns show a result 1.996 (i.e. no serial correlation) while our simulated prices exhibit a strong positive autocorrelation with a result of zero.
Hence we have additional evidence that comparing our “prices” visually leads to the false impression of a strong positive correlation between the two series, because they are both dependent on time. As a rule in financial time series analysis we need to check for correlations between (stationary) returns instead of prices.
Understanding the Pearson correlation coefficient
We have discussed “correlation” so far without specifying what exactly it means. There are many ways of measuring dependence between variables and the most commonly used one comes from traditional statistics: the Pearson correlation coefficient. It measures the degree of linear dependence between two variables. There is also the Spearman rank correlation, distance correlation and a whole host of other co-dependence metrics from the area of information theory. I will discuss those in a future Technical Tuesday.
The coefficient’s values range between -1 (meaning perfect negative correlation, as in our example above ret_1 and ret_2) and 1 (meaning perfect positive correlation). Zero correlation means there is no relationship between the variables — as is the case between ret_1 and ret_3. I’m including the formula for calculating the Pearson correlation here (taken from Wikipedia) for completeness although I think there are more intuitive ways of explaining and understanding it.
If you look at the numerator of the above formula you may recognise that this is how we calculate the covariance between two variables. And the denominator corresponds to the product of the standard deviations. So the formula can be rearranged as follows:
How to best understand covariance intuitively? Let’s explore this a bit further with a visual approach. For each of the four charts below, we go through the same set of steps:
1) First, we determine the mean of the x-axis and the mean of the y-axis. In the chart on the top left-hand side that is just over 6 for x (blue line) and and roughly 3.5 for y (black).
2) Second, we calculate the differences between x and the mean of x and the same for y and the mean of y. Then we multiply these differences — and end up with the green and red rectangles. The rectangle will be green if the products are both positive or both negative (e.g. -1 x -0.5 = +0.5). Otherwise the rectangle will be coloured red.
3) Finally, we add up all the green and red areas and divide the result by the number of observations. The result is the covariance between x and y, which is unbounded.
We take the covariance and divide it by the product of the standard deviations of x and y to standardise the result to a range of -1 to +1. And that is the Pearson correlation coefficient.
The four charts below show different outcomes of fairly clear positive and negative correlations in the top half and weaker positive and negative ones in the bottom half.
We have seen visually and in the formulas that we used the differences of x and y compared to their respective variable averages. That means we actually look at the deviations from the means and not at variations between the raw data themselves. Effectively the Pearson correlation coefficient tells us if x and y are below or above their average at the same time.
Linear correlation in practice
Let’s look at some real financial market data. We take the first five tickers of Dow Jones components and download price data from 2015 until early 2021 from the Yahoo Finance API. (I covered how to access this API in a previous Technical Tuesday.) Let’s look at the scatter plots between prices first. It appears that there is some linear co-dependence between stocks but as we know these are almost certainly spurious relationships because prices are non-stationary.
Now we look at the same set of tickers — but use daily percentage returns as inputs for the scatter plots. Suddenly the apparent correlations are almost entirely gone. Since the data includes the huge volatility of 2020 there are quite a few more outliers than one would normally expect — and the Pearson correlation coefficient is quite sensitive to outliers. Those outliers can give a misleading impression of more of a correlation than there actually is. When considering the vast amount of returns — the big blobs in the centre of each chart — we can hardly see a linear relationship.
That is not to say that an apparent price correlation cannot also contain a meaningful linear relationship between returns. The physical gold ETF “GLD” and the gold mining company ETF “GDX” are a case in point. Both price data and daily return data show a fairly clear positive correlation.
Such scatter plots and or widely-used correlations heatmaps can be useful to gauge relationships. However, it is important to remember that financial markets are dynamic and always change — and so do correlations. To incorporate change over time it is often a good idea to look at rolling metrics instead of average static ones. That is what we can observe in the next chart: the 60 day rolling correlation between the daily returns of the two ETFs mentioned. Most of the time the correlation coefficient is between 0.7 and 0.9 — i.e. a strong, positive relationship. In early 2020 during the COVID/lockdown-crisis, the correlation broke down somewhat but still remained in positive territory around +0.5 before recovering during the summer of that year.
Despite its widespread use and popularity the Pearson correlation coefficient has its drawbacks and significant limitations. As the description states, it picks up only linear relationships but not non-linear ones. Moreover, is it very sensitive to outliers. Finally, it doesn’t tell us anything about causation between variables — but is often used imprecisely to imply that. (As an aside, the field of causal inference is super interesting and I recommend books by Judea Pearl on the subject.)
In one of the next Technical Tuesdays I will focus on alternatives to the Pearson correlation coefficient that remedy some of its drawbacks and hope that today’s edition explained the subject in a practical and tangible way.