Sampling a population<

Objectives

At the end of the lecture and having completed the exercises students should be able to:

Show a basic awareness of why it is important for dentists to have some knowledge of statistics
Describe the difference between a population and a sample and know how to take acceptable samples from populations

Population

A population contains every member of a defined group of interest.

We might define a population as "all children aged between five and ten with caries living in Leeds". A particular characteristic (or variable) of the population that we wish to know about is called a population parameter. If we want to know how often they brush their teeth we could ask every child with caries in this age group how often they brush their teeth and calculate the average*; average number of times a day that teeth are brushed is thus the population parameter. This is clearly impractical so we study a sample of them.

* Statisticians do not really like the word average as it is not precise enough - there are many different sorts of average. We will be looking at this more closely in the next lecture.

Sample

A sample is the section of a population that we actually study.

We might decide to select 50 children aged between five and ten in Leeds with caries and ask them how often they brush their teeth. (The value of a particular characteristic of a sample is called the sample statistic.) If the average number of times these children brushed their teeth was 1.7 then we might conclude that children in Leeds aged between five and ten brush their teeth on average about 1.7 times a day.

Descriptive statistics

Descriptive statistics are the techniques we use to describe the main features of a sample. In the example above we described the average number of times the children in the sample brushed their teeth.

Inferential statistics

Statistical inference is the process of using the value of a sample statistic to make an informed guess about the value of a population parameter.

In the above example we used the value of the sample statistic average number of times a day that teeth are brushed to make an informed guess of the value of the population parameter average number of times a day that teeth are brushed.

Sample selection

For the process of statistical inference to be valid we must ensure that we take a representative sample of our population.

Whatever method of sample selection we use it is vital that the method is described.

How do we know if the characteristics of a sample we take match the characteristics of the population we are sampling? The short answer is we don't. We can, however, take steps that make it as likely as possible that the sample will be representative of the population. Two simple and effective methods of doing this are making sure the sample size is large and making sure it is randomly selected.

A large sample size is more likely to be representative of a population than a small one. Think of extreme cases. If we want to know the average height of the population and we select just one person and measure their height it is unlikely to be close the populationaverage. If we took 1,000,000 people, measured their heights and took the average, this figure would be likely to be close to the population average.

We will be looking at the effects of different sample sizes in more depth later on in the course so I will not go into it further now.

Random allocation

The type of sample most likely to be representative is a random sample. A random sample is one where each member of the population has an equal chance of being selected for the sample.

Random selection is best for two reasons - it eliminates bias and statistical theory is based on the idea of random sampling.

We can randomly allocate patients by a number of methods: using tables of random numbers, a computerised random number generator, sealed envelopes etc.

Selecting a sample randomly from a population is very important because:

It tends to eliminate unknown sources of bias

If we are aware of a potential source of bias in study comparing treatments we might decide to select carefully to avoid it. In all studies on humans there will inevitably be unknown factors affecting the outcome we are looking at. If we have selected our subjects randomly then even unknown factors will tend to even out between treatment groups

The researcher's preconceptions and biases do not influence the choice of subjects

If a researcher selects subjects in any systematic way there will always be the suspicion that subjects have been allocated to groups in a way that affects the results. Perhaps the researcher puts the subjects who are more ill onto the treatment they think will be most effective. This can happen quite unconsciously and will certainly affect the results. A similar effect might arise if you ask patients to volunteer for a new treatment. These patients are unlikely to be representative of all patients.

Statistical theory is based on the idea of random sampling.

All the statistical methods you will be using to analyse your results are based on the the premis that your sample was randomly selected. If it wasn't randomly selected then the answers you get could be very wrong.

A random sample is one where each member of the population has an equal chance of being selected for the sample. We can randomly allocate patients by a number of methods, the most common are using tables of random numbers or a computerised random number generator.

If I were conducting an study looking at two treatments, A and B then one way I could allocate patients to treatment groups would be by using a table of random numbers. The following set of random numbers came from a popular statistics textbook (most statistics textbooks have them):

65246356854282020026

I could allocate patients to treatment A if the number were odd and B if it were even. This would result in successive patients being allocated in the sequence:

BABBBAABBABBBBBBBBBB

Randomly selected numbers often seem to have patterns in them, like long runs of the same number. This is not a problem if we are conducting a large study, everything evens out over time. If the above study had stopped after recruiting 20 patients then we would have had four patients on treatment A and sixteen on B. This would not be a very good basis for comparing the two treatments

Block randomisation

In small studies we want to avoid the severe imbalance in the numbers going into each group that may result from purely random allocation of subjects to groups.

One method that can keep groups more or less the same size but doesn't depart too far from perfect randomisation is block randomisation. Using this method we recruit in blocks of a certain size, say four. We ensure that in each block of four two patients get treatment A and two treatment B, but within each block of four the allocation is random. So the first block may be ABAB, the second ABBA, the third BBAA and so on. In this way, no matter at what point the study stops the one group will never have more than two patients (or half the block size) more than the other.

This link is to a page where block randomisation is explained in more detail.

Stratified randomisation

A further problem in small studies is that we may be aware of a potential confounding variable. For example, if we are looking at the effect of two different toothbrushes on plaque formation we also know that smoking has a great effect on plaque. In a large study this does not matter - the number of smokers using each toothbrush will end up more or less the same. In a small study we may well see an imbalance in the number of smokers great enough to influence (confound) the result.

One way round this is stratified randomisation. We have separate randomisation sequences for smokers and non-smokers. (Both these sequences should be block randomised, otherwise we lose the benefit of stratification - the groups could still end up unbalanced.) It is possible to stratify on more than one variable, maybe we could stratify on smoking and use of floss in the above study. We would then have four sets of randomisation to perform (smoking flossers, non-smoking flossers, smoking non-flossers and non-smoking non-flossers). Increasing the number of stratification variables quickly becomes unworkable in practice as we double (at least) the number of groups with every addional stratifying variable.

Blinding

Ideally neither the subject nor the researcher should know which treatment they are receiving - a double-blind study. If this can't be achieved then either the subject or the investigator should not know which treatment they are receiving - a single-blind study.

Matching

It is sometimes beneficial to match subjects in the two groups on major characteristics (other than the one being investigated!) which may affect (confound) the outcome.