One of the first things students learn in statistics is the "correlation coefficient" r, which measures the strength of the relationship between two variables. The formula given in most textbooks is something like the following:
where x and y are the data sets we're trying to measure the correlation of.
This formula can be useful, but also has some major disadvantages. It's complex, hard to remember, and gives students almost no insight into what the correlation coefficient is really measuring. In this post I'll explain an alternative way of thinking about "r" as the cosine of the angle between two vectors. This is the "linear algebra view" of the correlation coefficient.
A Different View of Correlation
The idea behind the correlation coefficient is that we want a standard measure of how "related" two data sets x and y are. Rather than thinking about data sets, imagine instead that we place our x and y data into two vectors u and v. These will be two n-dimensional arrows pointing through space. The question is: how "similar" are these arrows to each other? As we'll see below, the answer is given by the correlation coefficient between them.
The figure below illustrates the idea of measuring the "similarity" of two vectors v1 and v2. In the figure, the vectors are separated by an angle theta. A pretty good measure of how "similar" they are is the cosine of theta. Think about what cosine is doing. If both v1 and v2 point in roughly the same direction, the cosine of theta will be about 1. If they point in opposite directions, it will be -1. And if they are perpendicular or orthogonal, it will be 0. In this way, the cosine of theta fits our intuition pretty well about what it means for two vectors to be "correlated" with each other.
What is the cosine of theta in the figure? From the geometry of right triangles, recall that the cosine of an angle is the ratio of the length of the adjacent side to the length of the hypotenuse. In the figure, we form a right triangle by projecting the vector v1 down onto v2. This gives us a new vector p. The cosine of theta is then given by:
Now suppose we're interested in the correlation between two data sets x and y. Imagine we normalize x and y by subtracting from each data point the mean of the data set. Let's call these new normalized data sets u and v. So we have:
The question is, how "correlated" or "similar" are these vectors u and v to each other in space? That is, what is the cosine of the angle between u and v? This is simple: from the formula derived above, the cosine is given by:
But since and , this means the cosine of theta is just the correlation coefficient between the two vectors u and v, or:
From this perspective, the correlation coefficient has an elegant geometric interpretation. If two data sets are positively correlated, they should roughly "point in the same direction" when placed into n-dimensional vectors. If they're uncorrelated, they should point in directions that are orthogonal to each other. And if they're negatively correlated, they should point in roughly opposite directions.
The cosine of the angle between two vectors nicely fits that intuition about correlation. So it's no surprise the two ideas are ultimately the same thing -- a much simpler interpretation of "r" than the usual textbook formulas.