Professional Documents
Culture Documents
U
CL
0
a. The Kappa Statistic in Reliability
Studies: Use, Interpretation, and
Sample Size Requirements
Purpose. This article examines and illustrates the use and interpreta-
tion of tbe kappa statistic in musculoskeletal research. Summary of Key
Points. Tbe reliability of clinicians' ratings is an important consider-
ation in areas sucb as diagnosis and the interpretation of examination
findings. Often, these ratings lie on a nominal or an ordinal scale. For
sucb data, tbe kappa coefficient is an appropriate measure of reliabil-
ity. Kappa is defined, in botb weigbted and unweighted forms, and its
use is illustrated with examples from musculoskeletal research. Factors
tbat can influence tbe magnitude of kappa (prevalence, bias, and
nonindependent ratings) are discussed, and ways of evaluating the
magnitude of an obtained kappa are considered. The issue of statistical
testing of kappa is considered, including the use of confidence
intervals, and appropriate sample sizes for reliability studies using
kappa are tabulated. Conclusions. The article concludes witb recom-
mendations for tbe use and interpretation of kappa. [SimJ, Wrigbt CC.
The kappa statistic in reliability studies: use, interpretation, and sample
size requirements. Phys Ther. 2005;85:257-268.]
I
n musculoskeletal practice and research, there is
frequently a need to determine the reliability of
measurements made by clinicians—reliability here
valuable information on the reliability
being the extent to which clinicians agree in their of diagnostic and other examination
ratings, not merely the extent to which their ratings are
associated or correlated. Defined as such, 2 types of procedures.
reliability exist: (1) agreement between ratings made
by 2 or more clinicians (interrater reliability) and
by chance can be considered "true" agreement. Kappa is
(2) agreement between ratings made by the same clini- such a measure of "true" agreement.'" It indicates the
cian on 2 or more occasions (intrarater reliability). proportion of agreement beyond that expected by
chance, that is, the achievedheyond-cha^nce agreement as
In some cases, the ratings in question are on a continu- a proportion of the possible beyond-chance agreement.'-^'
ous scale, such as joint range of motion or distance It takes the form:
walked in 6 minutes. In other instances, however, clini-
cians' judgments are in relation to discrete categories.
These categories may be nominal (eg, "present," observed agreement-chance agreement
(1)
"absent") or ordinal (eg, "mild," "moderate," "severe"); 1 —chance agreement
in each case, the categories are mutually exclusive and
collectively exhaustive, so that each case falls into one, In terms of symbols, this is:
and only one, category. A number of recent studies have
used such data to examine interrater or intrarater reli-
ability in relation to: clinical diagnoses or classifica- (2)
tions,'-* assessment findings,'"'-'^ and radiographic
signs.'"-'^ These data require specific statistical methods where P,, is the proportion of observed agreements and
to assess reliability, and the kappa (K) statistic is com- P^ is the proportion of agreements expected by chance.
monly used for this purpose. This article will define and The simplest use of kappa is for the situation in which 2
illustrate the kappa coefficient and will examine some clinicians each provide a single rating of the same
potentially problematic issues connected with its use and patient, or where a clinician provides 2 ratings of the
interpretation. Sample size requirements, which previ- same patient, representing interrater and intrarater reli-
ously were not readily available in the literature, also are ability, respectively. Kappa also can be adapted for more
provided. than one rating per patient from each of 2 clinicians,"'-''
or for situations where more than 2 clinicians rate each
Nature and Purpose of the Kappa Statistic patient or where each clinician may not rate eveiy
A common example of a situation in which a researcher patient.'8 In this article, however, our focus will be on
may want to assess agreement on a nominal scale is to the simple situation where 2 raters give an independent
determine the presence or absence of some disease or single rating for each patient or where a single rater
condition. This agreement could be determined in provides 2 ratings for each patient. Here, the concern is
situations in which 2 researchers or clinicians have used with how well these ratings agree, not with their relation-
the same examination tool or different tools to deter- ship with some "gold standard" or "true" diagnosis.'-'
mine the diagnosis. One way of gauging the agreement
between 2 clinicians is to calculate overall percentage of The data for paired ratings on a 2-category nominal scale
agreement (calculated over all paired ratings) or effective are usually displayed in a 2 X 2 contingency table, with
percentage of agreement (calculated over those paired rat- the notation indicated in Table 1.'-^" This table shows data
ings where at least one clinician diagnoses presence of from 2 clinicians who assessed 39 patients in relation to
the disease).'-^ Although these calculations provide a the relevance of lateral shift, according to the McKenzie
measure of agreement, neither takes into account the method of low back pain assessment.'-' Cells a and d
agreement that would be expected purely by chance. If indicate, respectively, the numbers of patients for whom
clinicians agree purely by chance, they are not really both clinicians agree on the relevance or nonrelevance
"agreeing" at all; only agreement beyond that expected of lateral shift. Cells b and c indicate the numbers of
J Sim, PhD, is Professor, Primary Care Sciences Research Centre, Keele University, Keele, Staffordshire ST5 5BG, United Kingdon
(j.sim@keele.ac.uk). Address all correspondence to Dr Sim.
CC Wright, BSc, is Principal Lecturer, School of Health and Social Sciences, Coventi-y University, Coventi-y, United Kingdom.
258 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005
Table 1. .8462-.5385
Diagnostic Assessments of Relevance of Lateral Sfiift by 2 Clinicians, (5) f\ ~~ • ^-^ '
From Kilpikoski et a l ' ( K = . 6 7 ) ° 1-Pr 1-.5385
Physical Therapy . Volume 85 . Number 3 . March 2005 Sim and Wright . 259
agreements are of only a single cate-
A. Observed agreement gory, the quadratic weighted kappa
(.67) is higher than the unweighted
kappa (.55). Different weighting
U B. Chance agreement
I
C. Achieved agreement schemes will prodtice different values
I beyond chance of weighted kappa on the same data;
B. Chance agreement
D. Potential agreement for example, linear weighting gives a
beyond chance kappa of .61 for the data in Table 2.
260 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005
Table 4.
(A) Assessment of the Presence of Lateral Sfiift, From Kilpikoski et a l ' ( K = . 1 8 ) ; (B) tfie Some Data Adjusted to Give Equal Agreements in Cells a
and d, and Thus a Low Prevalence Index ( K = . 5 4 )
a b 0 b
Present Present
28 3 31 15 3 18
Clinician 1 Clinician 1
c d c d
Absent Absent
6 2 8 6 15 21
Total 34 5 39 21 18 39
representing agreement (unity), effectively treating this Determinants of the Magnitude of Kappa
source of disagreement as an agreement, while leaving As previously noted, the magnitude of the kappa coeffi-
unchanged the weights for remaining sources of dis- cient represents the proportion of agreement greater
agreement. The alteration that this produces in the than that expected by chance. The interpretation of the
value of kappa serves to quantify the effect of the coefficient, however, is not so straightforward, as there
identified disagreement on the overall agreement, and are other factors that can infiuence the magnitude of the
all possible sources of disagreement can be compared in coefficient or the interpretation that can be placed on a
this way. Returning to the data in Table 3, if we weight as given magnitude. Among those factors that can infiu-
agreements those instances in which the raters disagreed ence the magnitude of kappa are prevalence, bias, and
between derangement and dysfunction syndromes (cells nonindependence of ratings.
b and d), kappa rises from .46 without weighting to .50
with weighting. If alternatively we apply agreement Prevalence
weighting to disagreements between dysfunctional and The kappa coefficient is infiuenced by the prevalence of
postural syndromes (cells / and h), kappa rises more the attribute (eg, a disease or clinical sign). For a
markedly to .55. As the disagreement between dysfunc- situation in which raters choose between classifying cases
tional and postural syndromes produces the greater as either positive or negative in respect to such an
increase in kappa, it can be seen to contribute more to attribtite, a prevalence effect exists when the proportion
the overall disagreement than that between derange- of agreements on the positive classification differs from
ment and dysfunctional syndromes. This finding might that of the negative classification. This can be expressed
indicate that differences between postural and dysfunc- by the prevalence index. Using the notation from Table 1,
tional syndromes are more difficult to determine than this is:
differences between derangement and dysfunctional
syndromes. This information might lead to retraining of a—d\
the raters or rewording of examination protocols. (6) prevalence index=-
In theoiy, kappa can be applied to ordinal categories where |a—(i| is the absolute value of the difference
derived from continuous data. For example, joint ranges between the frequencies of these cells (ie, ignoring the
of motion, measured in degrees, could be placed into sign) and n is the number of paired ratings.
4 categories: "unrestricted," "slightly restricted," "mod-
erately restricted," and "highly restricted." However, the If the prevalence index is high (ie, the prevalence of a
results from such an analysis will depend largely on the positive rating is either very high or very low), chance
choice of the categoiy limits. As this choice is in many agreement is also high and kappa is reduced according-
cases arbitrary, the value of kappa produced may have ly.'^'J This can be shown by considering further data from
little meaning. Furthermore, this procedure involves Kilpikoski et al'' on the presence or absence of lateral
needless sacrifice of information in the original scale shift (Tab. 4). In Table 4A, the prevalence index is high:
and will normally give rise to a loss of statistical pow-
g,- wi.si ii is f^p preferable to analyze the reliability of data
i28-2l
obtained with the original continuous scale^^ using (7) prevalence index= — =.67
other methods such as the intraclass correlation coeffi-
cient,'^'^ the standard error of measurement,'^'' or the bias
and limits of agreement.'^-^ The proportion of chance agreement, therefore, also is
relatively high (.72), and the value of kappa is .18. In
Physical Therapy . Volume 85 . Number 3 . March 2005 Sim and Wright . 261
Table 5.
(A) Contingency Table Showing Nearly Symmetrical Disagreements in Cells b and c, and Thus a Low Bias Index ( K = . 1 2 ) ; (B) Contingency Table
With Asymmetrical Disagreements in Cells b and c, and Thus a Higher Bias Index ( K = . 2 0 ) °
Present a b Present a b
29 21 50 Clinician 1 29 6 35
Clinician 1
c d c cy
Absent Absent
23 27 50 38 27 65
n n
Total 52 48 100 67 33 100
' llypDlhelical data for diagnoses of spoiidylolislhcsis ("present" oi' "absent") by 2 cliTiicians.
262 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005
Table 6.
(A) Data Reported by Kilpikoski et a l ' for Judgments of Directional Preference by 2 Clinicians ( K = . 5 4 ) ; (B) Cell Frequencies Adjusted to Minimize
Prevalence and Bias Effects, Giving a Prevalence-Adjusted Bias-Adjusted K of .79
a b a b
Present Present
Clinician 1 32 1 33 Clinician 1 18 2 20
c d c d
Absent Absent
3 3 6 2 17 19
Total 35 4 39 20 19 39
ratings are independent.''^-''" This requires the patients then the attribute under examination might have
or subjects to be independent (so that any individual can changed. Streiner and Norman'^'^ stated that an inten'al
contribute only one paired rating) and ratings to be of 2 to 14 days is usual, but this will depend on the
independent (so that each obsei'ver should generate a attribtite being measured. Stability of the attribute being
rating without knowledge, and thus without influence, rated is crticial to the period between repeated ratings.
of the other obsei-ver's rating).'"' The fact that ratings Thus, trait attributes pose fewer problems for intrarater
are related in the sense of pertaining to the same case, assessment (because longer periods of time may be left
however, does not contravene the assumption of between ratings) than state attributes, which are more
independence. labile. Some stiggestions to overcome the bias due to
memoiy include: having as long a time period as possible
The kappa coefficient, therefore, is not appropriate for between repeat examinations, blinding raters to their
a situation in which one observer is required to either first rating (although this might be easier with numerical
confirm or disconfirm a known previous rating from data than with diagnostic categories), and different
another obsen'er. In such a situation, agreement on the random ordering of patients or subjects on each rating
underlying attribute is contaminated by agreement on occasion and for each rater.
the assessment of that attribute, and the magnitude of
kappa is liable to be inflated. Equally, as with all mea- Adjusting Kappa
sures of intratester reliability, ratings on the first testing Because both prevalence and bias play a part in deter-
may sometimes influence those given on the second mining the magnitude of the kappa coefficient, some
occasion, which will threaten the assumption of indepen- statisticians have devised adjustments to take accotmt of
dence. In this way, apparent agreement may reflect more these influences.'"' Kappa can be adjusted for high or low
a recollection of the previous decision than a genuine prevalence by computing the average of cells wand dixnd
judgment as to the appropriate classification. In a situa- substituting this value for the actual values in those cells.
tion in which the clinician is doubtful as to the appro- Similarly, an adjustment for bias is achieved by substitut-
priate classification, this recollection may sway the deci- ing the mean of cells b and c for those actual cell values.
sion in favor of agreement rather than disagreement The kappa coefficient that results is referred to as
with the previous decision. Thus, "agreements" that /Mi}AA^(prevalence-adjtisted bias-adjusted kappa). Table
represent a decision to classify in the same way will be 6A shows data from Kilpikoski et al'' for assessments of
added to agreements on the actual attribute. This will directional preference (ie, the direction of movement
tend to increase the value of k that reduces or abolishes pain) in patients evaltiated
according to the McKenzie system; kappa for these data
Accordingly, studies of either interrater or intrarater is .54. Wlien the cell frequencies are adjusted to mini-
reliability should be designed in such a way that ratings mize prevalence and bias, this gives the cell valties shown
are, as far as possible, independent, otherwise kappa in Table 6B, with a PABAK of .79.
values may be inappropriately inflated. Equally, where a
study appears not to have preserved independence Hoehler*" is critical of the use of PABAK because he
between ratings, kappa should be interpreted cautiously. believes that the effects of bias and prevalence on the
Strictly, there will always be some degree of dependence magnittide of kappa are themselves informative and
between ratings in an intrarater study.''*' Various strate- should not be adjusted for and thereby disregarded.
gies can be used, however, to minimize this dependence. Thus, the PABAK could be considered to generate a
The time interval between repeat ratings is important. If value for kappa that does not relate to the situation in
the intei"val is too short, the rater might remember the which the original ratings were made. Table 6B repre-
previously recorded rating; if the inten'al is too long. sents very different diagnostic behavior from Table 6A,
Physical Therapy . Volume 85 . Number 3 . March 2005 Sim and Wright . 263
as indicated by the change in the marginal totals. Table 7.
Furthermore, all of the frequencies in the cells have Data on Assessments of Stiffness at C1-2 From Smedmark et
changed between Table 6A and Table 6B.
Clinician 2 Total
Therefore, the PABAK coefficient on its own is uninfor- No
mative because it relates to a hypothetical situation in Stiffness stiffness
which no prevalence or bias effects are present. How- r-\. . . , Stiffness 2(3) 3
Clinician 1 1 (0)
ever, if PABAK is presented in addition to, rather than in
No stiffness 7(6) 50(51) 57
place of, the obtained value of kappa, its use may be
considered appropriate because it gives an indication of Total 9 51 60
the likely effects of prevalence and bias alongside the " Figures in the cells represent the obsei-ved ratings; those in parentheses are
true value of kappa derived from the specific measure- the ratings that would secure maximum agreement given the marginal totals.
ment context studied. Cicchetti and Feinstein''''' argued, Obser\'ed K=.28; maximum attainable K=.46. Smedmark et al did not specify
the distribtition of the 8 disagreements across the olMiagonal cells, bia ihe
in a similar vein to Hoehler,'" that the effects of the figtires in the table correspond to their reported K.
prevalence and bias "penalize" the value of kappa in an
appropriate manner. However, they also stated that a
single "omnibus" value of kappa is difficult to interpret, which different weighting schemes have been applied to
especially when tiying to diagnose the possible cause of kappa.
an apparent lack of agreement. Byrt et aP'' recom-
mended that the prevalence index and bias index should 'to suggested that interpretation of kappa is assisted
be given alongside kappa, and other authors"*^"*'' have by also reporting the maximum value it could attain for
suggested that the separate proportions of positive and the set of data concerned. To calculate the maximum
negative agreements should be quoted as a means of attainable kappa (K.^^^), the proportions of positive and
alerting the reader to the possibility of prevalence or bias negative judgments by each clinician (ie, the marginal
effects. Similarly, Gj0rup'''* suggested that kappa values totals) are taken as fixed, and the distribution of paired
should be accompanied by the original data in a contin- ratings (ie, the cell frequencies a, b, c, and dm Tab. 1) is
gency table. then adjusted so as to represent the greatest possible
agreement. Table 7 illustrates this process, using data
Interpreting the Magnitude of Kappa from a study of therapists' examination of passive cervi-
Landis and Koch''^ have proposed the following as cal intervertebral motion.'^ Ratings were on a 2-point
standards for strength of agreement for the kappa scale (ie, "stiffness"/"no stiffness"). Table 7 shows that
coefficient: <0=poor, .01-.20=slight, .21-.40=fair, .41- clinician 1 judged stiffness to be present in 3 subjects,
.60 = moderate, .61-.80=substantial, and .81-l=almost whereas clinician 2 arrived at a figure of 9. Thtis, the
perfect. Similar formulations exist,'"'-''^ but with slightly maximum possible agreement on stiffness is limited to 3
different descriptors. The choice of such benchmarks, subjects, rather than the actual figure of 2. Similarly,
however, is inevitably arbitrary,'^^"*'' and the effects of clinician 1 judged 57 subjects to have no stiffness,
prevalence and bias on kappa must be considered when compared with 51 subjects judged by clinician 2 to have
judging its magnitude. In addition, the magnitude of no stiffness; therefore, for "no stiffness," the maximtim
kappa is influenced by factors such as the weighting agreement possible is 51 subjects, rather than 50. That is,
applied and the number of categories in the measure- the maximum possible agreement for either presence or
ment scale.•'''^•''-'-•''' When weighted kappa is used, the absence of the disease is the smaller of the marginal
choice of weighting scheme will affect its magnitude totals in each case. The remaining 6 ratings (60 - [3 -I-
(Appendix). The larger the number of scale categories, 51] = 6) are allocated to the cells that represent
the greater the potential for disagreement, with the disagreement, in order to maintain the marginal total;
restilt that unweighted kappa will be lower with many thus, these ratings are allocated to cell c. For these data,
categories than with few.'^'-^ If quadratic weighting is used, K,,,.,x is .46, compared with a kappa of .28.
however, kappa increases with the number of categories,
and this is most marked in the range from 2 to 5 In contrast to the PABAK, K,,,,,^ serves to gauge the
categories.''" For linear weighting, kappa varies much strength of agreement while preserving the proportions
less with the number of categories than for quadratic of positive ratings demonstrated by each clinician. In
weighting, and may increase or decrease with the num- effect, it provides a reference value for kappa that
ber of categories, depending on the distribution of the presei'ves the individual clinician's overall propensity to
underlying trait.'^" Caution, therefore, should be exer- diagnose a condition or select a rating (within the
cised when comparing the magnitude of kappa across restraints imposed by the marginal totals;/, ,f2g\, and gg
variables that have different prevalence or bias or that in Tab. 1). In some situations, eqtial marginal totals are
are measured on dissimilar scales or across situations in not necessarily to be anticipated, owing to recognized
pre-existing biases,^^ such as when comparing observers
264 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005
with different levels of experience, tools or assessment ity of a test's results or a diagnosis will necessarily be
protocols with inherently different sensitivity,^^ or exam- superior to a stated threshold for clinical importance.
iners who can be expected to utilize classification criteria One-tailed tests should be reserved for occasions when
of different stringency.''' These a priori sources of dis- testing a null hypothesis that kappa is zero.
agreement will give rise to different marginal totals and
will, in turn, be reflected in K^^^- The statistical hypothesis test provides only binary infor-
mation: Is there sufficient evidence that the population
For a given reliability study, the difference between value of kappa is greater than .40, or not? A more useful
kappa and 1 indicates the total unachieved agreement approach is to construct a confidence interval around
beyond chance. The difference between kappa and the sample estimate of kappa, using the standard error
K,,,jx' however, indicates the unachieved agreement of kappa (Appendix) and the z score corresponding to
beyond chance, within the constraints of the marginal the desired level of confidence.^'^ The confidence inter-
totals. Accordingly, the difference between K^^ and 1 val will indicate a range of plausible values for the "true"
shows the effect on agreement of imbalance in the value of kappa, with a stated level of confidence. As with
marginal totals. Thus, K,^^^ reflects the extent to which hypothesis testing, however, it may make sense to evalu-
the raters' ability to agree is constrained by pre-existing ate the lower limit of the confidence interval against a
factors that tend to produce unequal marginal totals, clinically meaningful minimum magnitude, such as .40,
such as differences in their diagnostic propensities or rather than against a zero value. Taking the data in
dissimilar sensitivity in the tools they are using. This Table 5A, for the obtained kappa of .54, the 2-sided 95%
provides useful information. confidence interval is given by:
In a practical situation in which ratings are compared Prior to undertaking a reliability stvidy, a sample size
across clinicians, agreement will usually be better than calculation should be performed so that a study has a
that expected by chance, and specifying a zero value for stated probability of detecting a statistically significant
kappa in the null hypothesis is therefore not very mean- kappa coefficient or of providing a confidence interval
ingful.^''•^^ Thus, the value in the null hypothesis should of a desired width.^^'^^ Table 8 gives the minimum
usually be set at a higher level (eg, to determine whether number of participants required to detect a kappa
the population value of kappa is greater than .40, on the coefficient as statistically significant, with various values
basis that any value lower than this might be considered of the proportion of positive ratings made on a dichot-
clinically "unacceptable"). The minimum acceptable omous variable by 2 raters, specifically, (/j + g-j)/2n in
value of kappa will depend on the clinical context. For terms of the notation in Table 1. A number of points
example, the distress and risk assessment method should be noted in relation to this table. First, the
(DRAM)s8 can be used to identify patients with low back sample sizes given assume no bias between raters. Sec-
pain who are at risk for poor outcome. If this categori- ond, except where the value of kappa stated in the null
zation determines whether or not such patients are hypothesis is zero, sample size requirements are greatest
enrolled in a management program, the minimal accept- when the proportion of positive ratings is either high or
able level of agreement on this categorization might be low. Third, given that the minimum value of kappa
set fairly high, so that decisions on patient management deemed to be clinically important will depend on the
are made with a high degree of consistency. If agree- measurement context, in addition to a null value of zero,
ment on the test were particularly important and the non-zero null values between .40 and .70 have been
kappa were lower than acceptable, it might mean that included in Table 8. Finally, following earlier comments
clinicians need more training in the testing technique or on 1- and 2-tailed tests, the figures given are for 2-tailed
the protocol needs to be reworded. When a value greater tests at a significance level of .05, except where the value
than zero is specified for kappa in the null hypothesis, a of kappa in the null hypothesis is zero, when figures for
2-tailed test is preferable to a 1-tailed test. This is because a 1-tailed test also are given.
there is no theoretical reason to assume that the reliabil-
Physical Therapy . Volume 85 . Number 3 . March 2005 Sim and Wright. 265
Table 8.
The Number of Subjects Required in a 2-Rater Study to Detect a Statisficolly Significant K (PS.05) on a Dichotomous Variable, With Either 80%
or 90% Power, at Various Proportions of Positive Diagnoses, and Assuming the Null Hypothesis Value of Kappa to be 00 40 50
.60, or .70°
When seeking to optimize sample size, the investigator data obtained with diagnostic and other procedures
needs to choose the appropriate balance between the used in musculoskeletal practice. We conclude with the
number of raters examining each subject and the num- following recommendations:
ber of subjects.^" In some instances, it is more practical
to increase the number of raters rather than increase the • Alongside the obtained value of kappa, report the
number of subjects. However, according to Shoukri,^^ bias and prevalence.
when seeking to detect a kappa of .40 or greater on a • Relate the magnitude of the kappa to the maximum
dichotomous variable, it is not advantageous to use more attainable kappa for the contingency table con-
than 3 raters per subject—it can be shown that for a cerned, as well as to 1; this provides an indication of
fixed number of observations, increasing the number of the effect of imbalance in the marginal totals on the
raters beyond 3 has little effect on the power of hypoth- magnitude of kappa.
esis tests or the width of confidence intervals. Therefore, • Construct a confidence interval around the
increasing the number of subjects is the more effective obtained value of kappa, to reflect sampling error.
strategy for maximizing power. • Test the significance of kappa against a value that
represents a minimum acceptable level of agree-
Conclusions ment, rather than against zero, thereby testing
If used and interpreted appropriately, the kappa coeffi- whether its plausible values lie above an "accept-
cient provides valuable information on the reliability of able" threshold.
266 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005
• Use weighted kappa on scales that are ordinal in 18 Fleiss JL. Measuring nominal scale agreement among many raters.
their original form, but avoid its use on interval/ Psychol Bull. 1971;76:378-382.
ratio scales collapsed into ordinal categories. 19 Fritz JM, Wainner RS. Examining diagnostic tests: an evidence-
• Be cautious when comparing the magnitude of based perspective. Phys Ther. 2001;81:1546-1564.
kappa across variables that have different prevalence 20 Feinstein AR, Cicchetti DV. High agreement but low kappa, I: the
or bias, or that are measured on different scales. problems of two paradoxes. J Clin Epidemiol. I990;43:543-549.
21 BartkoJJ, Carpenter WT. On the methods and theor)' of reliability.
References J Nerv Ment Dis. 1976;163:307-317.
1 Toussaint R, Gawlik CS, Rehder U, Ruther W. Sacroiliac joint
22 Fleiss JL, NeeJCM, Landis JR. Large sample variance of kappa in the
diagnostics in the Hamburg Construction Workers Study.
case of different sets of raters. Psychol Bull. 1979;86:974-977.
J Manipulative Physiol Ther. 1999;22;139-143.
23 Hartmann DP. Considerations in the choice of interobsen'er reli-
2 Fritz JM, George S. The use of a classification approach to identify
ability estimates. J Appl Behav Anal. 1977;10:103-116.
subgroups of patients with acute low back pain. Spine. 2000;25:
106-114. 24 Rigby AS. Statistical methods in epidemiology, V: towards an
understanding of the kappa coefficient. Disabil Rehabil. 2000;22:
3 Riddle DL, Freburger JK. Evaluation of the presence of sacroiliac
339-344.
joint region dysfunction using a combination of tests: a multicenter
intertester reliability study. Phys Ther. 2002;82:772-781. 25 Cohen J. Weighted kappa: nominal scale agreement with pro\ision
for scaled disagreement or partial credit. Pi)ic/io/BM«. 1968;70:213-220.
4 Petersen T, Olsen S, Laslett M, Thorsen H, et al. Inter-tester reliabil-
ity of a new diagnostic classification system for patients with non- 26 Lantz C. Application and evaluation of the kappa statistic in the
specific low back pain. Aust J Physiother. 2004)50:85-91. design and interpretation of chiropractic clinical research.
J Manipulative Physiol Ther. 1997;20:521-528.
5 Fjellner A, Bexander C, Faleij R, Strender LE. Interexaminer reli-
ability in physical examination of the cervical spine. / Manipulative 27 McKenzie RA. The Lumbar Spine: Mechanical Diagnosis and Therapy.
Physiol Ther. 1999;22:511-516. Waikanae, New Zealand: Spinal Publications Ltd; 1981.
6 Hawk C, Phongphua C, Bleecker J, Swank L, et al. Preliminary study 28 Kraemer HC, Periyakoil VS, Noda A. Kappa coefficients in medical
of the reliability of assessment procedures for indications for chiro- research. Stat Med. 2002;21:2109-2129.
practic adjustments of the lumbar spine. / Manipulative Physiol Ther.
29 Brennan P, Silman A. Statistical methods for assessing obser\'er
1999;2:382-389.
variability in clinical measures. BMJ 1992;304:1491-1494.
7 Smedmark V, Wallin M, Arvidsson I. Inter-examiner reliability in
30 Donner A, Eliasziw M. Statistical implications of the choice between
assessing passive intervertebral motion of the cervical spine. Man Ther.
a dichotomous or continuous trait in studies of interobsen'er agree-
2000;5:97-10I.
ment. Biometrics. 1994;50:550-555.
8 Hayes KW, Petersen CM. Reliability of assessing end-feel and pain
31 Bartfay E, Donner A. The effect of collapsing multinomial data
and resistance seqtiences in subjects with painful shoulders and knees.
when assessing agreement. Int J Epidemiol. 2000;29:1070-1075.
J Orthop Sports Phys Ther. 2001;31:432-445.
32 Maclure M, Willett WC. Misinterpretation and misuse of the kappa
9 Kilpikoski S, Airaksinen O, Kankaanpaa M, et al. Interexaminer
statistic. AmJ Epidemiol. 1987;126:161-169.
reliability of low back pain assessment using the McKenzie method.
Spine. 2002;27:E207-E214. 33 Streiner DL, Norman GR. Health Measurement Scales: A Practical Guide
to their Development and Use. 3rd ed. Oxford, England: Oxford University
10 Hannes B, Karlmeinrad G, Ogon M, Martin K. Multisurgeon
Press; 2003.
assessment of coronal pattern classifications systems for adolescent
idiopathic scoliosis. Spine. 2002;27:762-767. 34 Stratford PW, Goldsmith GH. Use of the standard error as a
reliability index: an applied example using elbow flexor strength. Phys
11 Speciale AC, Pietrobon R, Urban CW, et al. Observer variability in
Ther. 1997;77:745-750.
assessing lumbar spinal stenosis severity on magnetic resonance imag-
ing and its relation to cross-sectional spinal canal area. Spine. 2002;27: 35 Bland JM, Altman DG. Statistical methods for assessing agreement
1082-1086. between two methods of clinical measurement. Lancet. 1986;! (8476):
307-310.
12 Richards BS, Sucato DJ, Konigsberg DE, OuelletJA. Comparison of
reliability between the Lenke and King classification systems for 36 Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. / Clin
adolescent idiopathic scoliosis using radiographs that were not pre- Epidemiol. 199S;46:423-429.
measured. Spine. 2003;28:l 148-1157.
37 Banneijee M, FieldingJ. Interpreting kappa values for two-obsen'er
13 Sim J, Wright CC. Research in Health Care: Concepts, Designs and nursing diagnosis data. Res Nurs Health. 1997;20:465-470.
Methods. Cheltenham, England: Nelson Thornes; 2000.
38 Thompson WD, Walter SD. A reappraisal of the kappa coefficient.
14 Cohen J. A coefficient of agreement for nominal scales. Educ Psychol / Clin Epidemiol. 1988;41:949-958.
Meas. 1960;20:37-46.
39 Shoukri MM. Measures of Interobserver Agreement. Boca Raton, Fla:
15 Daly LE, Bourke GJ. Interpretation and Uses of Medical Statistics. 5th ed. Chapman & Hall/CRG; 2004.
Oxford, England: Blackwell Science; 2000.
40 Brennan RL, Prediger DJ. Coefficient kappa: some uses, misuses,
16 Conger AJ. Integration and generalization of kappas for multiple and alternatives. Educ Psychol Meas. 1981;41:687-699.
raters. Psychol Bull. 1980;88:322-328.
41 Hoehler FK. Bias and prevalence effects on kappa viewed in terms
17 Haley SM, OsbergJS. Kappa coefficient calculation using multiple of sensitivity and specificity. J Clin Epidemiol. 2000;5S:499-503.
ratings per subject: a special communication. Phys Ther. 1989;69:
42 Cicchetti DV, Feinstein AR. High agreement but low kappa, II:
970-974.
resolving the paradoxes. J Clin Epidemiol. 1990;43:551-558.
Physical Therapy . Volume 85 . Number 3 . March 2005 Sim and Wright . 267
43 Lantz C, Nebenzahl E. Behavior and interpretation of the K stati.stic:
resolution of the two paradoxes. J Clin Epidemiol. 1996;49:431-434.
Appendix.
Weighted Kappa
44 Gj0rup T. The kappa coefficient and the prevalence of a diagnosis. Formula for weighted kappa [K
Methods Inf Med. 1988;27:184-186.
45 LandisJR, Koch GG. The meastirement of obsereer agreement for
categorical data. Biometrics. 1977;33:159-174.
46 Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. New where 2wf^ Is the sum of the weighted observed frequencies in the cells
York, N\': John Wiley & Sons Inc; 1981. of the contingency table, and 2w(. is the sum of the weighted frequen-
47 Alliiian DG. Practical Statistics for Medical Research. London, England: cies expected by chance in the cells of the contingency table.
Chapman & Hall/CRG; 1991. Linear and quadratic weights are calculated as follows:
48 Shrotit PE. Measurement reliability and agreement in p.sychiatr)'.
Stat Methods Med Res. 1998;7:301-317. I i-j\
linear weight= I —;——
49 Dnnn G. Design and Analysis of Reliability Studies: The Statistical n 1
Evaluation of Measurement Eirors. London, England: Edward Arnold;
1989.
50 Brenner H, Kliebsch U. Dependence of weighted kappa coefficients
on the ntimber of categories. Epidemiology. 1996;7:199-202.
quadratic weight=l
-(a)'
where i-j is the difference between the row category on the scale and
the column category on the scale (the number of categories of disagree-
51 Haas M. Statistical methodology for reliability studies. J Manipulative
ment), for the cell concerned, and k is the number of points on the scale.
Physiol Ther. 1991;14:119-132.
The weights for unweighted, linear weighted, and quadratic weighted
52 Soeken KL, Prescott PA. Issues in the tise of kappa to estimate
kappas for agreement on a 4-point ordinal scale are:
reliability. Med Care. I986;24:733-74I.
53 Knight K, Hiller ML, Simpson DD, Broome KM. The validity of
self-reported cocaine use in a criminal justice treatment sample. AmJ Linear Quadratic
Drug Alcohol Abuse. 1998;24:647-660. Unweighted weights weights
54 Posner KL, Sampson PD, Gaplan RA, etal. Measuring interrater No disagreement 1 1 1
reliability among multiple raters: an example of methods for nominal Disagreement by 0 .67 .89
data. Stat Med. 1990;9:1103-l 115. 1 category
55 Petersen IS. Using the kappa coefficient as a measure of reliability Disagreement by 0 .33 .56
or reproducibility. Chest. 1998;] 14:946-947. 2 categories
Disagreement by 0 0 0
56 Main C], Wood PJ, Hollis S, etal. The distress and risk assessment 3 categories
method: a simple patient classification to identify distress and evaluate
the risk of poor outcome. Spine. 1992; 17:42-51.
57 Sim J, Reid N. Statistical inference by confidence intervals: issties of
interpretation and utilization. Phys Ther. 1999;79:186-195. This method weighting is based on agreement; a method of weighing
based on disagreement also can be d
58 Flack W , AFifi AA, Lachenbruch PA, Schouten HjA. Sample size
determinations for the two rater kappa statistic. Psychonietrika. 1988;53: A number of widely available computer programs will calculate K and
321-325. K^ through standard options. SPSS 12° will calculate K (but not K J and
performs a statistical test against a null value of zero. STATA 8'' and
59 Donner A, Eliasziw M. A goodness-of-fit approach to inference SAS 8*^ calculate both K- and K^ and perform a statistical test against a
procedtires for the kappa statistic: confidence inter\'al construction, null value of zero for each of these statistics. PEPI 4^ calculates both K
significance-testing and sample size estimation. Stat Med. 1992;11: and K^, and provides a value of K^^^^ for each of these statistics. This
1511-1519. program performs a statistical test against a null value of zero (and if the
60 Walter SD, Eliasziw M, Donner A. Sample size and optimal designs observed value of K or K^ > . 4 O , against a null value of .40 also),
for reliability studies. Stat Med. 1998;I7:1O]-110. together with 90%, 95%, and 99% 2-sided confidence intervals.
Additionally, the prevalence-adjusted bias-adjusted kappa (PABAK) is
calculated, and a McNemar test for bias is performed. Various other
stand-alone programs are available, and macros can be used to
perform additional functions related to K for some of these programs.
268 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005