# Sampling a population<

#### Objectives

At the end of the lecture and having completed the exercises students should be able to:

- Show a basic awareness of why it is important for dentists to have some knowledge of statistics
- Describe the difference between a population and a sample and know how to take acceptable samples from populations

### Population

A population contains every member of a defined group of interest.

We might define a population as "all children aged between five and
ten with caries living in Leeds". A particular characteristic (or
variable) of the population that we wish to know about is called a
*population parameter*. If we want to know how often they brush their
teeth we could ask every child with caries in this age group how often they
brush their teeth and calculate the average*; *average number of times a
day that teeth are brushed* is thus the population parameter. This is
clearly impractical so we study a *sample* of them.

* Statisticians do not really like the word
*average* as it is not precise enough - there are many different
sorts of average. We will be looking at this more closely in the next lecture.

### Sample

A sample is the section of a population that we actually study.

We might decide to select 50 children aged between five and ten in Leeds
with caries and ask them how often they brush their teeth. (The value of a
particular characteristic of a sample is called the *sample
statistic*.) If the average number of times these children brushed their
teeth was 1.7 then we might conclude that children in Leeds aged between
five and ten brush their teeth on average *about* 1.7 times a
day.

### Descriptive statistics

Descriptive statistics are the techniques we use to *describe* the
main features of a sample. In the example above we *described* the
average number of times the children in the *sample* brushed their
teeth.

### Inferential statistics

Statistical inference is the process of using the value of a sample statistic to make an informed guess about the value of a population parameter.

In the above example we used the value of the sample statistic
*average number of times a day that teeth are brushed* to make an
informed guess of the value of the population parameter *average number
of times a day that teeth are brushed*.

### Sample selection

For the process of statistical inference to be valid we must ensure that
we take a *representative sample* of our population.

Whatever method of sample selection we use it is vital that the method is described.

How do we know if the characteristics of a sample we take match the
characteristics of the population we are sampling? The short answer is
*we don't*. We can, however, take steps that make it as likely
as possible that the sample will be *representative* of the
population. Two simple and effective methods of doing this are making sure
the sample size is large and making sure it is randomly selected.

A large sample size is more likely to be representative of a population than a small one. Think of extreme cases. If we want to know the average height of the population and we select just one person and measure their height it is unlikely to be close the populationaverage. If we took 1,000,000 people, measured their heights and took the average, this figure would be likely to be close to the population average.

We will be looking at the effects of different sample sizes in more depth later on in the course so I will not go into it further now.

### Random allocation

The type of sample most likely to be representative is a random sample. A random sample is one where each member of the population has an equal chance of being selected for the sample.

Random selection is best for two reasons - it eliminates bias and statistical theory is based on the idea of random sampling.

We can randomly allocate patients by a number of methods: using tables of random numbers, a computerised random number generator, sealed envelopes etc.

Selecting a sample randomly from a population is very important because:

- It tends to eliminate unknown sources of bias
- If we are aware of a potential source of bias in study comparing treatments we might decide to select carefully to avoid it. In all studies on humans there will inevitably be unknown factors affecting the outcome we are looking at. If we have selected our subjects randomly then even unknown factors will tend to even out between treatment groups
- The researcher's preconceptions and biases do not influence the choice of subjects
- If a researcher selects subjects in any systematic way there will always be the suspicion that subjects have been allocated to groups in a way that affects the results. Perhaps the researcher puts the subjects who are more ill onto the treatment they think will be most effective. This can happen quite unconsciously and will certainly affect the results. A similar effect might arise if you ask patients to volunteer for a new treatment. These patients are unlikely to be representative of all patients.
- Statistical theory is based on the idea of random sampling.
- All the statistical methods you will be using to analyse your results are based on the the premis that your sample was randomly selected. If it wasn't randomly selected then the answers you get could be very wrong.

A random sample is one where each member of the population has an equal chance of being selected for the sample. We can randomly allocate patients by a number of methods, the most common are using tables of random numbers or a computerised random number generator.

If I were conducting an study looking at two treatments,
**A** and **B** then one way I could allocate
patients to treatment groups would be by using a table of random numbers.
The following set of random numbers came from a popular statistics textbook
(most statistics textbooks have them):

**65246356854282020026**

I could allocate patients to treatment **A** if the number
were odd and **B** if it were even. This would result in
successive patients being allocated in the sequence:

**BABBBAABBABBBBBBBBBB**

Randomly selected numbers often seem to have patterns in them, like long
runs of the same number. This is not a problem if we are conducting a large
study, everything evens out over time. If the above study had stopped after
recruiting 20 patients then we would have had four patients on treatment
**A** and sixteen on **B**. This would not be a
very good basis for comparing the two treatments

### Block randomisation

In small studies we want to avoid the severe imbalance in the numbers going into each group that may result from purely random allocation of subjects to groups.

One method that can keep groups more or less the same size but doesn't
depart too far from perfect randomisation is **block
randomisation**. Using this method we recruit in blocks of a certain
size, say four. We ensure that in each block of four two patients get
treatment **A** and two treatment **B**, but
within each block of four the allocation is random. So the first block may
be **ABAB**, the second **ABBA**, the third
**BBAA** and so on. In this way, no matter at what point the
study stops the one group will never have more than two patients (or half
the block size) more than the other.

This link is to a page where block randomisation is explained in more detail.

### Stratified randomisation

A further problem in small studies is that we may be aware of a potential confounding variable. For example, if we are looking at the effect of two different toothbrushes on plaque formation we also know that smoking has a great effect on plaque. In a large study this does not matter - the number of smokers using each toothbrush will end up more or less the same. In a small study we may well see an imbalance in the number of smokers great enough to influence (confound) the result.

One way round this is **stratified randomisation**. We have
separate randomisation sequences for smokers and non-smokers. (Both these
sequences should be block randomised, otherwise we lose the benefit of
stratification - the groups could still end up unbalanced.) It is possible
to stratify on more than one variable, maybe we could stratify on smoking
and use of floss in the above study. We would then have four sets of
randomisation to perform (smoking flossers, non-smoking flossers, smoking
non-flossers and non-smoking non-flossers). Increasing the number of
stratification variables quickly becomes unworkable in practice as we double
(at least) the number of groups with every addional stratifying
variable.

### Blinding

Ideally neither the subject nor the researcher should know which
treatment they are receiving - a *double-blind study*. If this can't
be achieved then either the subject or the investigator should not know
which treatment they are receiving - a *single-blind* study.

### Matching

It is sometimes beneficial to match subjects in the two groups on major
characteristics (other than the one being investigated!) which may affect
(*confound*) the outcome.