A more intuitive explanation of the correlation

Someone asks on Reddit:

Can someone intuitively explain the correlation formula?

I know what the Cov(X,Y) means. It tells you if the relationship between the variables X and Y is positive or negative (although I must admit I dont really know what the actual number means, I only look the the sign). I know what the standard deviation of X & Y means. Its the average distance of every observed variable from its mean. But I cannot for the life of me understand how Cov(X,Y)/stdX*StdY gives a number between -1&1 which tells me the strength of the relationship between X & Y.

When I see the formula Cov(X,Y)/stdX*StdY, I think to myself: “Ok, I’m taking some number, which basically only tells me if the relationship between X & Y is positive or negative, and then dividing that number with the average distance of every observation X from its mean * the average distance of every observation Y from its mean. And this then someone always gives me a number between -1 and 1. I just don’t understand how this makes sense. Can someone try to explain?? Thanks!

This post is a repost of my answer there.

The covariance of X and Y, Cov(X, Y), is expressed in terms of the amount of variation in X and Y. That’s why the number does not make much sense aside from the direction. By dividing by sd(X) and sd(Y), we standardize the relationship, so that it is bounded between 1 and -1.

It is easier to understand the correlation if we standardize the numbers before doing the calculation. The correlation is a measure for how often the two Z scores go in the same direction and to the same relative degree. The correlation does not depend on which units we used.

E.g. suppose we have the data:
X = 5, 10, 15, 10, 5, 10
Y = 1, 2, 4, 1, 1, 3

The standard deviations (which is almost the same as the mean absolute distance to the mean) are:
sd(X) ≈ 3.8
sd(Y) ≈ 1.3

Their covariance [Cov(X, Y]) is 4.

This number does not make much sense, but we see that it is positive, so the linear relationship is positive.

If we divide it by the standard deviations, we get:
Cov(X, Y)/sd(X)*sd(Y) = cor(X, Y) ≈ 0.84

This number is bounded between 1 and -1 because Cov(X,Y) is on the same scale as sd(X)*sd(Y).

To see it more clearly, it is better to standardize the X and Y values first. Then we get:
X_std = -1.11, 0.22, 1.55, 0.22, -1.11, 0.22
Y_std = -0.79, 0.00, 1.58, -0.79, -0.79, 0.79

Now we can more easily see the pattern. The low/high values of X tend to go with the low/high values of Y. The standardization puts them on the same scale.

One can calculate the correlation in a more intuitive way using the Z scores. We begin by multiplying each pair of Z values. Because negative times negative gives positive and so does positive times positive, the numbers that go in the same direction have positive products. Only the numbers that do not go in the same direction have negative products or zero.

The pairwise products are:
X_std*Y_std = 0.88, 0.00, 2.45, -0.18, 0.88, 0.18

Thus, we see that they tend to be positive. If we sum them, we get ≈ 4.2. This number is also not interpretable because it depends on the number of pairs we summed up. Had we used more pairs, the number would have been larger. To standardize the number for this, we divide by the number of pairs-1 (the degree of freedom). We have 6 pairs, so this number is 5. Thus, cor(X, Y) ≈ 4.2/5 ≈ 0.84. The numbers fit.

You Might Also Like

Bayesian hereditarianism

Understanding restriction of range with Shiny!

Making better use of the scientific literature: large-scale automatic retrieval of data from published figures