Professional Documents
Culture Documents
10/7/01
10:10 am
Page 24
24
Introduction
A major characteristic of a Delphi study is the asking of questions. The
questioning is iterative, and is intended to find a consensus: it is still
asking questions. Devising questions is a more complex task than it at
first sight appears. There are many pitfalls for the unwary, and there is
extensive relevant research addressing these methodological pitfalls:
from the very early exploratory days (Thorndike 1920) to more
modern large-scale question testing (Schuman and Presser 1996).
Much of this research is widely cited: sometimes in textbooks on
research methods or questionnaire design (Festinger and Katz 1965,
Oppenheim 1966), sometimes in experimental studies or metaanalyses (Kasten and Weintraub 1999, Hoyt and Kerns 1999,
Engelhard and Stone 1998, Wilson et al 1993). As asking questions is
a major part of any Delphi study, such work is relevant to Delphi
research, and in this paper we intend to concentrate on the design of
questions and potential answers.
Given what we have known for many years about the vagaries of
language use (Belson 1981 and 1986), question ordering (Schuman and
Presser 1996), question answering (Thorndike 1920, Kahneman 1982,
Slovic and Tversky 1982 inter alia), it is important that detailed
consideration is given to such issues, and reported. Question design is
NURSE RESEARCHER VOLUME 8 NUMBER 4
NurseResearcher 8/4
10/7/01
10:10 am
Page 25
25
NurseResearcher 8/4
10/7/01
10:10 am
Page 26
26
NurseResearcher 8/4
10/7/01
10:10 am
Page 27
27
Iteration or Repetition
Repetition of an identical question in consecutive rounds to attempt to
generate consensus is not always the correct thing to do. To take a
study on childhood cancers, for example, there were clearly two
dimensions to the problem. One of these posed the question, How
many families would benefit if a particular policy was implemented?
This is a clear question, and it is worth asking in its own right. If there
is consensus that all or most families would benefit from a certain
policy, there is probably merit in pursuing that policy. We called this
the breadth question. However, there is a second dimension which
needs to be teased out. That is the answer to the question: How much
would each family benefit if policy X was introduced ?. This we
called the depth question. Cross-tabulating these two dimensions
clarifies the policy decision which the study is supposed to inform. We
present below possible sets of results in the form of a 2 x 2 table.
Table 1: Envisaging multiple dimensions in a policy-oriented Delphi study
Number of beneficiaries
Degree of benefit
High
Low
High
Low
NurseResearcher 8/4
10/7/01
10:10 am
Page 28
28
NurseResearcher 8/4
10/7/01
10:10 am
Page 29
29
NurseResearcher 8/4
10/7/01
10:10 am
Page 30
30
123
12
1
567 Negative-middling-positive
12345
1234
56
Unenthusiastic-positive-enthusiastic
and probably many more. At the stage of analysis, one can obtain
subtle differences in judgements between statements and between
individual judges. An often unacknowledged weakness of short scales is
that any measurement of changes in judgement (e.g. before and after
some event, or between rounds of a Delphi study) leaves very little space
NURSE RESEARCHER VOLUME 8 NUMBER 4
NurseResearcher 8/4
10/7/01
10:10 am
Page 31
31
for respondents to move into. Each step between scale points is large, and
more psychological effort is needed to make it. Thus, even the apparently
simple task of choosing the number of points on the scale needs thought.
Labelling the points is also not without its problems. We usually try to
make the end-aversion phenomenon into a virtue. We regularly label the
extreme points on the scale (1 and 7) with strong labels such as
Extremely strongly agree, In all cases, Never, Literally under no
circumstances, and the like. The reasoning behind this is that respondents
are unlikely to choose such labels, but when they do make such choices,
they really mean what they say, and we can thus have more confidence in
the extreme judgements made, whether positive or negative.
We also recommend leaving some points unlabelled on the grounds
that respondents may genuinely waver between labelled points, in that
they may not be able exactly to agree with our labels, and we thus
leave room for intermediate judgements. Commonly points two, four
and six will be unlabelled, although there is frequently good reason to
label point four, because of the variety of ways in which people may
make a middling judgement (I dont know, It varies from case to
case, It varies from setting to setting).
Whatever answering scheme one adopts, it should be tested. There is
no point in writing down a scale with labels and hoping that people
will understand the words in exactly the same way that you do. Any
Likert type scale should be piloted, and on a population similar to the
final panel. This should include ensuring that the words are
comprehensible and apparently have the meaning which was intended.
In drawing up statements for Thurstone scales (in which a panel judges
the emotional strength of each statement), we have found that one
obtains an acceptable level of consensus on only about 10 per cent of
the statements. In one study, we had to start with 252 statements to end
up with 22 on which there was adequate agreement on the emotional
tone which they exhibited (Moseley et al 1998).
Equally important is to estimate the response frequency, i.e. to see
how many people check each value. If no one uses points 1 to 4, then
you have effectively only a three-point scale, and all the advantages of
discrimination are lost. This is a particular problem if they cluster at
NURSE RESEARCHER VOLUME 8 NUMBER 4
NurseResearcher 8/4
10/7/01
10:10 am
Page 32
32
NurseResearcher 8/4
10/7/01
10:10 am
Page 33
33
one might ask people how much time, money, how many staff, or
whatever, are required for each policy. One then takes these, and the
researcher, rather than the respondent, makes the comparison. In a
study of the importance of various aspects of the UKCC Scope of
Professional Practice document ( UKCC 1992), we used this method.
However, it worked only because we had developed a computer
program which allowed the respondent to input the number of hours
(or whatever) devoted to a given activity, which checked the data (e.g.
that they were not working hundreds of hours per week), let them
modify their input, and then finally displayed the rankings implicit in
their absolute data. Despite the obvious utility of this method, and the
richness of the data which it produces, on training, economic and time
grounds it is probably not practicable for most Delphi studies, and
cannot be anonymous. However, its existence does show that there are
sound ways of conducting a Delphi study which are not conventional,
and that it is important to consider all alternatives.
NurseResearcher 8/4
10/7/01
10:10 am
Page 34
34
NurseResearcher 8/4
10/7/01
10:10 am
Page 35
35
all to The worst pain that I could imagine. Note that there are no
intermediate labels. In a Delphi study, they could be labelled, say, Of
no use at all up to Absolutely essential. These scales have several
advantages:
they make a minimal use of words, thus overcoming problems of
linguistic interpretation.
they are simple to complete (one merely puts a pencil mark on the
line)
they can be read and interpreted automatically by a scanning device,
thus avoiding the errors of human data input
they produce data which is at the ratio level of measurement, and
which are therefore suitable for a wider range of statistical
manipulations than are safely usable with ordinal data
they give order and distance
in the pain domain at least there is considerable evidence that they
are reliable and valid (Coll 2000).
Visual analogue scales have potential advantages, and we mention
them here in the hope that other researchers will experiment with them
as part of their own Delphi studies.
Summary
The Delphi approach has been widely used in a variety of fields. It has
strengths in overcoming many of the social and psychological
problems associated with opinion and attitude measurement. It can be
seen at first sight as a single technique, with a fixed method which all
future researchers could follow. However, even at the simplest level, it
is really an approach rather than a method. It has a general shape
(Generate statements, Formulate a question (or questions), Undertake
round 1, Analyse the responses, Undertake round 2, and so on).
However, what one does at each stage can legitimately vary.
there is no reason why one must restrict oneself to the panel
members as a source of the statements in the first place. Other
sources are possible and legitimate, although we would regard these
as additional too, rather than in place of, round 1.
the researcher does not have to have only one question about one
NURSE RESEARCHER VOLUME 8 NUMBER 4
NurseResearcher 8/4
10/7/01
10:10 am
Page 36
36
NurseResearcher 8/4
10/7/01
10:10 am
Page 37
37
References
Baddeley A (1994) The Magical Number 7 - Miller G (1956) The Magical Number Seven,
still magic after all these years. Psychological Plus or Minus Two: some limits on our
Review, 101, 2: pp 353-356.
capacity for processing information.
Psychological Review, 63, 2: 81-87.
Belson WA (1981) The design and
understanding of survey questions.
Moseley LG et al (1997) Can feedback be
Aldershot, Gower.
individualised, useful, and economical?
International Journal of Nursing Studies, 34,
Belson WA (1986) Validity in survey
4, 285-294.
research. London, Gower.
Moseley LG et al (1998) Experience of,
Butterworth T (1991) Nursing in Europe: A Knowledge of, and Opinions about,
Delphi Survey. Manchester, Dept of Nursing, Computerised Decision Support Systems
University of Manchester.
among Health Care Clinicians in Wales. Report
Butterworth T, Bishop V (1995) Identifying No C/96/1/029 to the Wales Office of Research
and Development for Health and Social Care.
the characteristic of optimum practice:
findings from a survey of practice experts in Oppenheim AN (1966) Questionnaire
nursing, midwifery and health visiting.
Design and Attitude Measurement. London,
Journal of Advanced Nursing, 22, 24-32.
Heinemann.
Coll AM (2000) Quality of Life following
Day Surgery. Unpublished PhD thesis.
Glamorgan, University of Glamorgan.
Engelhard G, Stone GE (1998) Evaluating
the quality of ratings from standard-setting
judges. Educational and Psychological
Measurement, 58, 2, 176-196.
Festinger L, Katz D (1965) Research
Methods in the Behavioral Sciences. New
York, Holt, Rinehart and Winston.
Hoyt WT, Kerns MD (1999) Magnitude and
moderators of bias in observer ratings. A
meta analysis. Psychological Methods, 4, 4,
403-424.
Kahneman D et al (1982) Judgement under
uncertainty: heuristics and biases.
Cambridge UP, Part IV.
Kasten R, Weintraub Z (1999) Rating errors
and rating accuracy: a field experiment.
Human Performance, 1, 2: 137-153.
Mead DM (1993) The development of
primary nursing in NHS care giving
institutions in Wales. Unpublished PhD
Thesis. University of Wales.
Mead DM, Moseley LG (1994) Automating
ward feedback: a tentative first step. Journal
of Clinical Nursing, 3, 347-354.