GithubHelp home page GithubHelp logo

dsc-correlation-covariance-nyc-ds-060319's Introduction

Covariance and Correlation

Introduction

In this lesson, you'll learn how the variance of a variable is used to calculate covariance and correlation as key measures used in statistics to find causal relationships between variables. Based on these measures, you can find out if two variables are associated with each other, and to what extent. This lesson will help you develop a conceptual understanding, necessary calculations, and some precautions while using these measures.

Objectives

You will be able to:

  • Distinguish between correlation and causation
  • Interpret correlation and its relationship to covariance
  • Calculate covariance and correlation

Variance

In an earlier lesson, you learned about variance (represented by $\sigma^2$) as a measure of dispersion for continuous variables from its expected mean value. Let's quickly revisit this, as variance formula plays a key role while calculating covariance and correlation measures.

The formula for calculating variance as shown below:

$$\sigma^2 = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i-\mu)^2$$

  • $x$ represents an individual data points
  • $\mu $ is the mean of the data points
  • $n$ is the total number of data points

Example: Let's consider the monthly ice cream sales for a small ice cream vendor in NYC (January-May):

Month Amount
January 100
February 200
March 300
April 500
May 900

You can calculate that, with a mean of 400 over these 5 months, the variance of the sales amounts is 80,000 (and the standard deviation is 282.84).

Now, let's take this further and see how this relates to covariance.

Covariance

In some cases, you'll want to look at two variables to get an idea about how they change together. In statistics, when trying to figure out how two variables vary together, you can use the covariance between these variables.

Covariance calculation plays a major role in a number of advanced machine learning algorithms like dimensionality reduction, predictive analyses, etc.

Calculating Covariance

If you have $X$ and $Y$, two variables having $n$ elements each. You can calculate covariance ($\sigma_{xy}$) between these two variables by using the formula:

$$\sigma_{XY} = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)$$

  • $\sigma_{XY}$ = Covariance between $X$ and $Y$
  • $x_i$ = ith element of variable $X$
  • $y_i$ = ith element of variable $Y$
  • $n$ = number of data points ($n$ must be same for $X$ and $Y$)
  • $\mu_x$ = mean of the independent variable $X$
  • $\mu_y$ = mean of the dependent variable $Y$

You can see that the formula calculates the variance of $X$ and $Y$ by multiplying the variance of each of their corresponding elements. Hence the term covariance.

Back to our ice cream sales example:

Imagine there is another ice cream vendor, Vendor B. Let's calculate the covariance between our previous Vendor A and this new Vendor B.

Ice cream sales for a small ice cream vendor in NYC (January-May):

Month Vendor A Vendor B
January 100 220
February 200 400
March 300 600
April 500 900
May 900 1300

Recall that Vendor A sold on average 400, and Vendor B sells on average 684, considerably higher.

Using this along with the covariance formula, you can compute that the covariance between ice cream Vendor A and B sales is 106800. Now the question is, what does that mean?

Interpreting Covariance Values

Covariance values range from positive infinity to negative infinity.

  • A positive covariance indicates that two variables are positively related

  • A negative covariance indicates that two variables are inversely related

  • A covariance equal or close to 0 indicates that there is no linear relationship between two variables

Looking back at our ice cream example, it is no surprise that you found a positive covariance there. While the sales amounts are different, it is clear that when sales went up for Vendor A, they also went up for Vendor B.

You'll immediately notice the main shortcoming of covariance: because it ranges between negative and positive infinity, it only tells us when two variables are inversely or positively related. It doesn't tell us anything about to what extent they are inversely or positively related. This makes interpretation difficult and comparing covariances to each other impossible.

Assume you have four different variables a, b, p, and q. Next, let's say you can have $ Cov(a, b) = 3.6$ and $Cov(p, q) = 9$. These covariance values tell us that these pairs are positively associated, but it is difficult to tell whether the relationship between a and b is stronger than p and q without looking at the means and distributions of these variables.

This is where the concept of correlation comes in.

Correlation is calculated by standardizing covariance by some measure of variability in the data. It produces a quantity that has intuitive interpretations and consistent scale.

Let's consider another ice cream Vendor C to reinforce this idea. The mean sales amount for this vendor is 1,120, and the covariance between Vendor A and C is 132,000, which is considerably higher than the covariance of 106,800 which we've seen before. Does this mean the positive relationship between A and C is stronger? We need the correlation here to answer this question!

Month Vendor A Vendor B Vendor C
January 100 220 900
February 200 400 800
March 300 600 500
April 500 900 1400
May 900 1300 2000

Correlation

Covariance uses a formulation that only depends on the units of $X$ and $Y$ variables. When doing data science, it is often not appropriate to use covariance as such because different experiments may contain underlying data measured in different units.

Because of this, it is important to normalize this degree of variation into a standard unit, with interpretable results, independent of the units of data. You can achieve this with a derived normalized measure, called correlation.

The term "correlation" refers to a relationship or association between variables. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example:

  • Sales might increase when the marketing department spends more on TV advertisements
  • Customer's average purchase amount on an e-commerce website might depend on a number of factors related to that customer, e.g. location, age group, gender etc.
  • Social media activity and website clicks might be associated with revenue that a digital publisher makes. etc.

Correlation is the first step to understanding these relationships and subsequently building better business and statistical models.

Pearson's Correlation Coefficient ( $r$ )

Pearson Correlation Coefficient, $r$, also called the linear correlation coefficient, measures the strength and the direction of a linear relationship between two variables. This coefficient quantifies the degree to which a relationship between two variables can be described by a line.

Note: There are a number other correlation coefficients, but for now, we will focus on Pearson correlation as it is the go-to correlation measure for most needs. If someone says "correlation" without further clarification, you can assume that they mean Pearson correlation.

Calculating Correlation Coefficient

Pearson Correlation ($r$) is calculated using following formula :

$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {\sqrt{\sum_{i=1}^{n}(x_i - \mu_x)^2 \sum_{i=1}^{n}(y_i-\mu_y)^2}}$$

So just like in the case of covariance, $X$ and $Y$ are two variables having $n$ elements each.

  • $x_i$ = ith element of variable $X$
  • $y_i$ = ith element of variable $Y$
  • $n$ = number of data points ($n$ must be same for $X$ and $Y$)
  • $\mu_x$ = mean of the independent variable $X$
  • $\mu_y$ = mean of the dependent variable $Y$
  • $r$ = Calculated Pearson Correlation

Interpreting Correlation Values

The Pearson Correlation formula always gives values in a range between -1 and 1

You'll often see patterns or relationships in scatter plots. Pearson correlation, which is a linear measure can be identified through a scatter plot by inspecting the "linearity of association" between two variables.

If two variables have a correlation of +0.9, this means the change in one item results in an almost similar change to another item. A correlation value of -0.9 means that the change is one variable results in an opposite change in the other variable. A Pearson correlation near 0 would be no effect.

Here are some example of Pearson correlation calculations as scatter plots.

** Let's go back to our ice cream example**:

Month Vendor A Vendor B Vendor C
January 100 220 900
February 200 400 800
March 300 600 500
April 500 900 1400
May 900 1300 2000

Recall that $Cov(A, C)$ was higher than $Cov(A, B)$. Using the formula for correlation coefficient, vendors A and B have a correlation coefficient of 0.9888 whereas vendors A and C have a lower correlation coefficient of 0.8858!

Ice cream and the weather:

Let's take a deep dive into an ice cream vendor's sales data. Now, we're going to look at the relationship between the average daily temperature and the daily profit on 12 days in April (the degrees are on the Celsius scale):

Temp (°C) Ice Cream Sales (USD)
14.2 215
16.4 325
11.9 185
15.2 332
18.5 406
22.1 522
19.4 412
25.1 614
23.4 544
18.1 421
22.6 445
17.2 408

And here is the same data as a scatter plot:

Just by looking at the plot, you can easily see that hotter weather goes along with higher ice cream sales. The relationship is good but not perfect. The correlation for this example is 0.9575 which indicates a very strong positive relationship.

Correlation Is Not Causation

You may have come across the saying: “correlation is not causation” or “correlation does not imply causation”.

What do we mean by this?

Causation takes correlation a step further.

Causation is when any change in the value of one variable leads to a change in the value of another variable, which means one variable causes the change in another. This is also referred to as cause and effect.

Let's try to understand this with an example.

Consider Hidden Factors

Suppose for the above ice cream sales example, we now have some extra data on the number of homicide cases in New York. Out of curiosity, we analyze sales numbers vs. homicide rate as scatter plot and see that these two are also related to each other. Mind blown... Is ice cream turning people into murderers?

Example 1: Ice cream sales are correlated with the number of homicides in New York

For our example, it's actually the weather as a hidden factor that is causing both these events. It is actually causing the rise in ice cream sales and homicides. As in summer people usually go out, enjoy a nice sunny day and chill themselves with ice creams. So when it’s sunny, a wide range of people are outside and there is a wider selection of victims for predators. There is no causal relationship between the ice cream and rate of homicide, sunny weather is bringing both the factors together. We can say that ice cream sales and homicide rate have a causal relationship with weather.

This is reflected in the image below:

Just after finding the correlation, we shouldn't draw the conclusion too quickly. Instead, we should take time to find other underlying factors as correlation is just the first step. Find the hidden factors, verify if they are correct and then conclude.

Here are some other (rather funny) examples of how correlation analysis may lead to inconceivable findings.

Example 2: Number of Nicholas Cage movie releases correlates with people drowning in pools

So, how about this chart?

The internet is full of other funny and weird correlations. You'll come across other ones if you do a quick Google search. This particular correlation between Nicolas Cage movies and drowning accidents was found on this website. The key takeaway here is that covariance and correlation analysis should be used with care. The thing is that nowadays, you can find data on a wide variety of subjects. This means that if you search long enough, you'll find a high correlation somewhere between two completely unrelated subjects. This phenomenon is also referred to as "the law of large numbers".

Example 3: Our very own ice cream example

Our very own ice cream example in this lesson is a good example of why correlation does not automatically imply causation.

The sales increase for Vendor A did not cause a sales increase for vendor B. The hidden factor here was (clearly) the weather.

What is important to note is here that, although correlation does not imply causation, IF there is a causal effect, there generally will be a high correlation as well. In other words:

If one factor has an effect on another one (causation), there generally will be a high correlation as well. Yet, this doesn't mean that a high correlation means that there is causation.

Key Takeaways

  • Covariance is dimensionless, i.e. it is a unit-free measure of the relationship between variables
  • Correlation is a normalized form of covariance and ranges between [-1,1]
  • Correlation is not causation

IMPORTANT NOTE: The variance formula used in this lesson considers $X$ and $Y$ to be population variables. For samples, we divide by $n-1$, instead of $n$ due to the principle called Bessel's correction. Here is a mathematical proof for this. We will visit this point again later in the course to explore it in greater detail.

Summary

In this lesson, you reviewed how to compute the variance of variables as a measure of deviation from the mean. Next, you learned how this measure can be used to first calculate covariance, and the correlation to analyze how two variables change together. You learned the formulas for calculating these measures and how they can be applied to real-world situations.

dsc-correlation-covariance-nyc-ds-060319's People

Contributors

fpolchow avatar hoffm386 avatar loredirick avatar mathymitchell avatar shakeelraja avatar sumedh10 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.