# Correlation and regression

#### Objectives

At the end of the lecture students should be able to:

- Interpret a correlation coefficient
- Be aware of some of the basic problems associated with the use of correlation coefficients
- Know when it is appropriate to use a correlation coefficient and when a regression technique should be used

## Correlation

One of the simplest questions we can ask about two variables is: "Is there a linear relationship between them?"

The (Pearson) correlation coefficient, **r**, is a measure of linear association between two continuous variables. It is a measure of how well the data fit a straight line.

**1 => r >= -1**

- If
**r > 0**we have a**Positive**correlation - If
**r < 0**we have a**Negative**correlation - If
**r = 0**we have**No**correlation

*Scatterplots showing data sets with different correlations*

We can also perform a hypothesis test, the null hypothesis is that there is no correlation between the two variables (i.e. the population correlation coefficient, **r**, is zero).

Note:

- It is harder to spot a correlation close to zero, but these are the ones we come across most often

Other points to note:

- The correlation coefficient is unaffected by units of measurement
- Correlations of less than 0·7 should be interpreted cautiously
- Correlation does not imply causation
- Overall, I do not find correlation to be a very useful technique

Correlation should not be used when:

- There is a non-linear relationship between variables
- The are outliers
- There are distinct sub-groups

For example healthy controls with diseased cases - If the values of one of the variables is determined in advance

e.g. Picking the doses of a drug in an experiment measuring its effect

Two examples of when not to use a correlation coefficient:

*(a) When there is a non-linear relationship; (b) when distinct subgroups are present. In both of these examples the correlation coefficient quoted is spurious.*

Spurious correlations crop up all the time:

- The price of petrol shows a positive correlation with the divorce rate over time
- Number of deaths from heart attacks in a population rises with incidence of long-sightedness over time
- Maximum daily air temperature and number of deaths of cattle were positively correlated during March 2001
- If we repeatedly measure two variables on the same individual over a period of time e.g. a child's height and ability to read, then we will tend to see a correlation

#### Spearman rank correlation coefficient

A (Pearson) correlation coefficient should be only calculated between two normally distributed random variables

The Spearman rank correlation coefficient, **r _{s}**, can be calculated for non-normally distributed variables. As this coefficient is based only on the ranks of observations outliers do not affect it unduly.

## Linear regression

- Regression analysis fits the best line to the observed data and allows us to make predictions about one variable from the values of the other.
- One variable (the independent variable) is assumed to predict the other (the dependent), the results are not the same if we swap the variables.
- The values of the independent variable may be selected.
- The values do not have to be normally distributed
- There are other assumptions and requirements of a regression analysis. (The relationships is approximately linear; the residuals have to be normally distributed etc.)
- Regression analysis is best carried out under the guidance of a statistician

## Why are variables correlated?

If two variables, **A** and **B**, are correlated then there are four possibilities:

- The result occurred by chance
**A**influences ('causes')**B**(or,**B**influences**A**. Not the same thing!)**A**and**B**are influenced by some other variable(s). This can happen in two ways:**C**may 'cause' both**A**and**B**e.g. increased consumption of sugar increases the number of caries a person has and increases their weight. Does more weight cause more caries?**A**may lead to an increase in**C**which 'causes'**B**e.g. low income may increase chance of smoking which increases chance of death from lung cancer. Does low income cause lung-cancer?