Correlation and regression


At the end of the lecture students should be able to:


One of the simplest questions we can ask about two variables is: "Is there a linear relationship between them?"

The (Pearson) correlation coefficient, r, is a measure of linear association between two continuous variables. It is a measure of how well the data fit a straight line.

1 => r >= -1

Data sets with different correlation coefficients

Scatterplots showing data sets with different correlations

We can also perform a hypothesis test, the null hypothesis is that there is no correlation between the two variables (i.e. the population correlation coefficient, r, is zero).


Other points to note:

Correlation should not be used when:

Two examples of when not to use a correlation coefficient:

Two examples of when not to use a correlation coefficient

(a) When there is a non-linear relationship; (b) when distinct subgroups are present. In both of these examples the correlation coefficient quoted is spurious.

Spurious correlations crop up all the time:

Spearman rank correlation coefficient

A (Pearson) correlation coefficient should be only calculated between two normally distributed random variables

The Spearman rank correlation coefficient, rs, can be calculated for non-normally distributed variables. As this coefficient is based only on the ranks of observations outliers do not affect it unduly.

Linear regression

Why are variables correlated?

If two variables, A and B, are correlated then there are four possibilities: