Correlation and regression
Objectives
At the end of the lecture students should be able to:
- Interpret a correlation coefficient
- Be aware of some of the basic problems associated with the use of correlation coefficients
- Know when it is appropriate to use a correlation coefficient and when a regression technique should be used
Correlation
One of the simplest questions we can ask about two variables is: "Is there a linear relationship between them?"
The (Pearson) correlation coefficient, r, is a measure of linear association between two continuous variables. It is a measure of how well the data fit a straight line.
1 => r >= -1
- If r > 0 we have a Positive correlation
- If r < 0 we have a Negative correlation
- If r = 0 we have No correlation
Scatterplots showing data sets with different correlations
We can also perform a hypothesis test, the null hypothesis is that there is no correlation between the two variables (i.e. the population correlation coefficient, r, is zero).
Note:
- It is harder to spot a correlation close to zero, but these are the ones we come across most often
Other points to note:
- The correlation coefficient is unaffected by units of measurement
- Correlations of less than 0·7 should be interpreted cautiously
- Correlation does not imply causation
- Overall, I do not find correlation to be a very useful technique
Correlation should not be used when:
- There is a non-linear relationship between variables
- The are outliers
- There are distinct sub-groups
For example healthy controls with diseased cases - If the values of one of the variables is determined in advance
e.g. Picking the doses of a drug in an experiment measuring its effect
Two examples of when not to use a correlation coefficient:
(a) When there is a non-linear relationship; (b) when distinct subgroups are present. In both of these examples the correlation coefficient quoted is spurious.
Spurious correlations crop up all the time:
- The price of petrol shows a positive correlation with the divorce rate over time
- Number of deaths from heart attacks in a population rises with incidence of long-sightedness over time
- Maximum daily air temperature and number of deaths of cattle were positively correlated during March 2001
- If we repeatedly measure two variables on the same individual over a period of time e.g. a child's height and ability to read, then we will tend to see a correlation
Spearman rank correlation coefficient
A (Pearson) correlation coefficient should be only calculated between two normally distributed random variables
The Spearman rank correlation coefficient, rs, can be calculated for non-normally distributed variables. As this coefficient is based only on the ranks of observations outliers do not affect it unduly.
Linear regression
- Regression analysis fits the best line to the observed data and allows us to make predictions about one variable from the values of the other.
- One variable (the independent variable) is assumed to predict the other (the dependent), the results are not the same if we swap the variables.
- The values of the independent variable may be selected.
- The values do not have to be normally distributed
- There are other assumptions and requirements of a regression analysis. (The relationships is approximately linear; the residuals have to be normally distributed etc.)
- Regression analysis is best carried out under the guidance of a statistician
Why are variables correlated?
If two variables, A and B, are correlated then there are four possibilities:
- The result occurred by chance
- A influences ('causes') B (or, B influences A. Not the same thing!)
- A and B are influenced by some other variable(s). This can happen in two ways:
- C may 'cause' both A and B e.g. increased consumption of sugar increases the number of caries a person has and increases their weight. Does more weight cause more caries?
- A may lead to an increase in C which 'causes' B e.g. low income may increase chance of smoking which increases chance of death from lung cancer. Does low income cause lung-cancer?