|
|
Experimental design: comments on comparison
Comparability issues in the Bodgett and Scarper experiment
In the experiment conducted by Bodgett and Scarper, the end-users of their
software were asked to assign ratings to the system. They used a numerical
scale in an attempt to give some credibility to the results (we can average '9'
and '10' but we can't average 'good' and 'excellent'). The main problem,
however, is that it is very difficult to interpret the numerical results
obtained. What does an `ease of use' rating of `8 out of 10' mean? We have no
way of judging what rating would have been obtained by a comparable system (if
there is one). The experimenters can reduce this criticism to a degree by
setting questions that explicitly ask for a comparison against other systems of
which the experimental subjects may be aware. Even then, they run the risk that
the subjects may not remember the other systems well enough to comment (this is
a source of bias), or may give misleading answers in an (unconscious) attempt
to please the experimenter.
What the experimenters could have done is to explictly require their subjects
to carry out equivalent tasks using a number of different systems. If the
subjects still rated the new system more highly than the others, then this
result is more credible than the original one. Moreover, we may be able to
assign a numerical value to the degree by which the two systems differ.
Because the experimental subjects are human, and therefore sources of unwanted
variability, the experimenters should attempt to minimize this variability by
comparing the impressions of the same subjects with the different systems at
different times. In this case we have to be careful of `learning' effects (the
subjects perform better over time because they improve at the task). There may
be a good case for instituting a crossover experiment here.
|