|
|
Experimental design: intergroup variability example
Intergroup variability example
Suppose I want to determine whether my new Java compiler, for example, produces
programs that execute more quickly than earlier compilers. This seems very
straightforward: compile the same program using both compilers, then execute both
versions while timing them. This seems all there is to it, but it isn't. There
are two problems.
- It tells us nothing about the performance on other programs than the test
program. This is our old friend generalization again. A better approach might
be to run a number of test programs with different functions, and add together
their execution times
- In practice, the same program run on the same computer will not always take
the same time to execute. This is because modern operating systems are
multitasking, and it is difficult to predict what else the computer might be
doing while running my test programs.
As an illustration, suppose I run my set of test programs with the two
compilers, and get times of 10 seconds and 20 seconds for the two systems. Then
I run the first program three times and get timings of 10.1, 10.2 and 9.7
seconds. It seems very clear that the two compilers have very different
performances.
But suppose that we are looking at small refinements to a compiler, perhaps
giving a speed improvement of about five percent. Now the within-group
variability is quite similar to the intergroup variability. It is very
difficult to determine whether one compiler is really faster than the other.
In order to get back to the situation where the intergroup variability is
greater than the within-group variability, we could proceed in one or both of
these ways.
- Select a different test set of test programs that are likely to
enhance the differences between the two compilers
- Reduce within-group variability by increasing the number of repetitions of
the tests. This is an example of increasing the sample size.
In experiments involving human beings, a large source of within-group
variability is the natural variability in absolutely everything between
different human subjects. If your test groups comprise people of different age,
gender, ethnic background, etc., then you can expect a great deal of
variability in everything you measure. On the other hand, if you don't have a
mixture of these properties, you can expect bias instead.
The textbook solution to this problem is to make comparisons by testing two
different things on the same groups of human subjects. This largely eliminates
inter-subject variability, but since people can't do two things at the same
time, we can introduce a time bias instead. The textbook solution to
this problem is to use two groups of people, who repeat the tests in
different orders. This is called a crossover experiment. Crossover
experiments have been the standard way of conducting comparison experiments
using human subjects for about 50 years. However, a source of error called the
`carryover effect' has led to a re-evaluation of the usefulness of
this technique in the last few years. These are complex and subtle issues, and
anyone considering carrying out a substantial crossover experiment needs to
consult a competent statistician (or become one!)
|