In this lesson, we'll look at how Variance of a random variable is used to calculate Covariance and Correlation as key measures used in Statistics for finding causal relationships between random variables. Based on these measures, we can identify if two variables are associated each other, and to what extent. This lesson will help you develop a conceptual understanding, necessary calculations and some precautions while using these measures.
You will be able to:
- Understand and calculate covariance and correlation between two random variables
- Visualize and interpret the results of covariance and correlation
- Explain what is meant by the popular phrase "correlation does not imply causation"
Earlier in the course, we introduced variance (represented by
The formula for calculating variance as shown below:
-
$x$ represents an individual data points -
$\mu$ is the mean of the data points. -
$n$ is the total number of data points.
Let's take this further and see how this relates to covariance.
Imagine calculating variance of two random variables to get an idea on how they change together considering all values for both variables. In statictics, when we try to figure out how two random variables tend to vary together, we are effectively talking about covariance between these variables, which helps us identify how two variables are related to one another.
As we will see later in the course, Covariance calculation plays a major role in a number of advanced machine learning algorithms like dimensionality reduction and predictive analyses etc.
If we have
-
$\sigma_{XY}$ = Covariance between$X$ and$Y$ -
$x_i$ = ith element of variable$X$ -
$y_i$ = ith element of variable$Y$ -
$n$ = number of data points ($n$ must be same for$X$ and$Y$ ) -
$\mu_x$ = mean of the independent variable$X$ -
$\mu_y$ = mean of the dependent variable$Y$
We see that above formula calculates the variance of $X$ and $Y$ by multiplying the variance of each of their corresponding elements. Hence the term co-variance.
A general proof for covariance is available here*
Covariance values range from positive infinity to negative infinity.
-
A positive covariance indicates that higher than average values of one variable tend to pair with higher than average values of the other variable.
-
Negative covariance indicates that higher than average values of one variable tend to pair with lower than average values of the other variable.
-
a zero value, or values close to zero indicate no covariance, i.e. no values from one variable can be paired with values of second variable.
The main shortcoming of covariance is that it keeps the scale of the variables
This is where we need correlation.
Correlation is calculated by standardizing covariance by some measure of variability in the data, it produces a quantity that has intuitive interpretations and consistent scale.
We have seen that covariance uses a formulation that depends solely on the units of
The term "correlation" refers to a causal relationship or association between variables. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example:
- Sales might increase when the marketing department spends more on TV advertisements
- Customer's average purchase amount on an e-commerce website might depend on a number of factors related to that customer, e.g. location, age group, gender etc.
- Social media activity and website clicks might be be associated with revenue that a digital publisher makes. etc.
Correlation is the first step to understanding these relationships and subsequently building better business and statistical models.
When two random variables Correlate, this reflects that the change in one item has some measureable effect on the change in the values of second variable.
In data science practice, while checking for associations between variables, we typically look at correlation rather than covariance when comparing variables. Correlation is more interpretable, since it does not depend on the scale of either random variable involved.
Pearson Correlation Coefficient,
*Note: There are a number other correlation coefficients, but for now, we will focus on Pearson correlation as it is the go-to correlation measure for most needs. *
Pearson Correlation (r) is calculated using following formula :
So just like in the case of covariance,
-
$xi$ = ith element of variable$X$ -
$yi$ = ith element of variable$Y$ -
$n$ = number of data points ($n$ must be same for$X$ and$Y$ ) -
$\mu_x$ = mean of the independent variable$X$ -
$\mu_y$ = mean of the dependent variable$Y$ -
$r$ = Calculated Pearson Correlation
In terms of variance , we can see that we are effectively measuring the variance of both variables together, normalized by their standard deviations. A detailed mathematical insight into this equation is available in this paper
Correlation formula shown above always gives values in a range between -1 and 1
We often see patterns or relationships in scatter plots. Pearson Correlation, which is a linear measure can be identified through a scatter plot by inspecting the "linearity of association" between two variables.
If two variables have a correlation of +0.9, this means the change in one item results in an almost similar change to another item. A correlation value of -0.9 means that the change is one variable results as an opposite change in the other variable. A pearson correlation near 0 would be no effect.
Here are some example of pearson correlation calculations as scatter plots.
Imagine we have collected data for 12 days on daily ice cream sales and average temperature on that day for a small ice cream shop. We want to see if these two variables are associated with each other in any way. Here is the data:
Temp Ice Cream
°C Sales
14.2° $215
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
And here is the same data as a Scatter Plot:
We can easily see that hotter weather and higher sales go together. The relationship is good but not perfect. The correlation for this example is 0.9575 which indicates a very strong positive relationship.
A lot of times we have heard “correlation is not causation” or “correlation does not imply causation” . But what do we mean by saying this?
Causation takes a step further than correlation.
Any change in the value of one variable will cause a change in the value of another variable, which means one variable makes other to happen. It is also referred as cause and effect.
Let's try to understand this with an example.
Consider Hidden Factors
Suppose for the above ice cream sales example, we now have some extra data on number of homicide cases in New York. Out of curiosity, we analyze sales numbers vs. homicide rate as scatter plot and see that these two also related to each other. Mind blown... Is ice cream turning people into murderers?
For our example, its actually the weather as a hidden factor which is causing both these events. It is actually causing the rise in ice cream sales and homicides. As in summer people usually go out, enjoy nice sunny day and chill themselves with ice creams. So when it’s sunny, wide range of people are outside and there is a wider selection of victims for predators. There is no causal relationship between the ice cream and rate of homicide, sunny weather is bringing both the factors together. We can say that ice cream sales and homicide rate have a causal relationship with weather.
This is reflected in the image below:
Just after finding correlation, we shouldn't draw the conclusion too quickly. Instead we should take time to find other underlying factors as correlation is just the first step. Find the hidden factors, verify if they are correct and then conclude.
Here is another (rather funny examples) of how correlation analysis may lead to inconceivable findings.
So a causal relationship above , or a cause-effect relationship - what do you think ?
Internet is full of other funny correlations, do a quick Google search and you'll come across lots of them. So the key takeaway here is that covariance and correlation analysis should be used with care. A high correlation may indicate some sort of causal relationship between variables. It should not be taken as a cause-effect scenario.
IMPORTANT NOTE: The variance formula above considers
In this lesson, we looked at Identifying the variance of random variables as a measure of mean deviation. We saw how this measure can be used to first calculate covariance, followed by the correlation to analyze how change variable effects the change of another variable. We looked at the formulas for calculating these measures and you are provided with mathematical proofs of these formulas. Next, we'll see how we can use correlation analysis to run a regression analysis and later, how covariance calculation helps us with dimensionality reduction.
- Correlation is dimensionless, i.e. it is a unit-free measure of the relationship between variables.
- Correlation is a normalized form of covariance and exists between [0,1]
- Correlation is not causation