You are on page 1of 29

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/306423625

Intraclass correlation coefficients: Clearing the


air, extending some cautions, and making some
requests

Article in Health Services and Outcomes Research Methodology · August 2016


DOI: 10.1007/s10742-016-0156-6

CITATIONS READS

6 442

1 author:

Robert Trevethan
Independent author and researcher
15 PUBLICATIONS 59 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Temporal stability / instability of toe-brachial indices View project

All content following this page was uploaded by Robert Trevethan on 01 September 2017.

The user has requested enhancement of the downloaded file.


This is the preprint copy of an article that was subsequently published in Health Services and Outcomes
Research Methodology, 2017, 17, 127–143. http://dx.doi.org/10.1007/s10742-016-0156-6. The text of the
two documents is very close, but not identical.

Intraclass Correlation Coefficients:

Clearing the Air, Extending Some Cautions, and Making Some Requests

Robert Trevethan, PhD

Independent academic researcher and author1

Abstract Intraclass correlation coefficients (ICCs) are frequently employed in health


science research, often to assess intrarater and interrater reliability. In many cases,
insufficient details are provided about these ICCs and there seem to be
misunderstandings about their selection and how they should be interpreted. This article
is intended primarily to provide a clear, accessible description of ICCs, including how
they should be selected, interpreted, and reported. Emphasis is given to areas where
researchers seem to encounter the greatest conceptual difficulties and to exhibit the
greatest misconceptions. Two extended examples are used to support the points being
made. Major additional aims of this article are to raise the awareness of authors,
reviewers, and editors concerning the importance of using appropriate ICCs, and to
encourage them to ensure that complete and accurate information about ICCs is reported
in journal articles. Failure to do so perpetuates a risk that incorrect decisions might be
made about matters that are of crucial importance for people’s health.

1
Author contact: robertrevethan@gmail.com

Keywords: Intraclass correlation coefficient · ICC · Rater reliability ·


Reliability · Toe-brachial index
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 2

1 Introduction
In both research and clinical settings within the health sciences, consistency of
measurement can be of utmost importance. Problems arise if measuring devices
produce inconsistent readings because of instrumentation errors, if a particular rater
produces different readings of a stable phenomenon at different time points, or if
different raters produce discrepant readings among each other. Inconsistent
measurement cannot be ignored because accurate measurement is often needed to
determine such things as appropriate referrals, the extent of disease progression,
whether people might be amenable to treatment or might be best served with
particular kinds of interventions, and whether people have been responsive to
interventions. Consistency of measurement is also necessary to prevent under- and
over-diagnosing, and these are difficult to avoid if the bases of assessment fluctuate
unpredictably. Consistency therefore can be an important and problematic issue for
researchers and clinicians. It can also be fascinating.

This article is concerned primarily with the procedures for assessing the
consistency of raters within themselves at different points in time (sometimes almost
cotemporaneous) and between different raters, usually referred to as intra- and inter-
rater reliability respectively. One of the most commonly used means of assessing
these two kinds of reliability is the intraclass correlation coefficient (ICC) and it is
used widely in some contexts, for example, in orthopedics (Lee et al. 2012) and
podiatry (Wrobel and Armstrong 2008). Unfortunately there seems to be a degree of
persistent confusion, as well as misinformation, about ICCs (Rankin and Stokes
1998) and attempts to understand them are not helped by some theorists and
statisticians (e.g. Carrasco and Jover 2003; Fleiss 1986; Hayen et al. 2007; McGraw
and Wong 1996; Shrout and Fleiss 1979) describing ICCs in a way that, while
authoritative, accurate, and necessary, is probably too dauntingly sophisticated for
many researchers. Furthermore, the conventionally accepted way that these authors
categorize ICCs does not always conform with what appear to be the categorizations
in statistics software packages. Information from other sources, including the
Internet, is often incomplete, confusing, contradictory, and sometimes patently
incorrect. This article is intended to provide general, informative, and accessible
clarification for researchers, particularly for those who do not have a strong
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 3

background in statistics, concerning how ICCS should be selected and interpreted,


and to indicate the need for researchers to provide clear information about what ICCs
they use. While describing ICCs, extended information is presented for areas where
researchers seem to experience the greatest conceptual difficulties and exhibit the
greatest misconceptions.

Several features and constraints of this article, some already foreshadowed, are
offered before providing that description. First, although focus will be directed
primarily to contexts involving consistency within and between raters, at times
reference will be made to intraparticipant and instrumentation consistency. Second,
there are many different contexts beyond intrarater and intrarater reliability in which
ICCs can be appropriately used, and only some of those contexts are referred to here.
Third, other ways in which reliability are assessed will not be dealt with. These
include Bland-Altman plots and limits of agreement as well as latent variable
modeling approaches (see Raykov et al. 2013), some of which can be used in parallel
with, or instead of, ICC analyses. Like the ICC, these analyses require either equal
interval or ratio data. Analyses that are suitable for categorical data, such as Cohen’s
kappa, are also not included. Fourth, examples will be derived from podiatry, but
there are many other health and social sciences, as well as disciplines such as
education, in which ICCs are applicable. Fifth, procedures within the Statistical
Package for the Social Sciences (SPSS®; IBM, USA) for obtaining ICCs will be
referred to at times because SPSS is frequently used for analyzing data within the
health and social sciences when ICCs are sought, ICCs can be easily produced within
that package (point and click), and its output concerning ICCs is clearly formatted.
Minimal reference will be made to other software packages such as R, SAS, and
Stata because the focus of this article is on the general nature, selection,
interpretation, and reporting of ICCs, not on how they can be obtained in different
packages. Sixth, because of the focus of this article, some issues such as
determination of adequate sample sizes are not considered. Finally, even readers who
find statistics distasteful and unfathomable might find it helpful to read this article in
conjunction with information about ICCs available in a text such as that by Portney
and Watkins (2009). The information there and in this article should complement
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 4

each other, but perhaps the material in one or the other will resonate more helpfully
with particular readers and may do so to a greater or lesser extent at different times.

2 The nature of ICCs


There are a number of different formulas for calculating ICCs. Choice of a formula
depends on the purpose and design of a particular study (or part of a study), how the
study was conducted, and the number of measurements taken to represent a single
reading for purposes of analysis. In order to indicate some of the decisions behind the
nature of a particular ICC and its underlying formula, ICCs are typically expressed
with two numbers in parentheses, for example (1,1), according to the categorization
system that emerged from the foundational work of Shrout and Fleiss (1979) and
McGraw and Wong (1996). The first number refers to the model of an ICC, and the
second number to its form. Entries in Table 1 provide abbreviated information about
both of these as well as information about a third category, sometimes referred to as
type. Informative and adequate in-text indications about ICCs can be economically
expressed in several ways, for example as “ICC(m,f) with Type” where m refers to
model and f to form, as used subsequently within this article, but also with
superscripted or subscripted formats, as in “ICCm,f with Type” or “ICCm,f with Type”
respectively. The different kinds of models, forms, and types of ICCs are described
in the next three sections.

2.1 ICC models


There are three ICC models, each of which differs in terms of where the sources of
statistical variability are believed to lie. These sources of variability all include the
participants in a study and sometimes, but not always, the raters. If instruments are
used, variability from them, known as instrumentation or measurement error, is also
taken into account. Despite the impression given by many researchers that the only
source of variability is their raters, all sources of variation can be part of the
statistical landscape, and these sources of variability are the basis for choosing a
particular model.
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 5

Table 1 Categories of ICC, research contexts and outcomes, and SPSS options
SPSS
Category Research context and outcomes option
Model
1 A range of different raters assess different participants, and there is Third
no match between raters and participants. This situation is option
infrequent. This model will usually produce lower ICCs than do the 1-way
other two models. random

2 The same raters assess all participants, and theoretically the raters are Second
regarded as being randomly selected, as are the particular option
participants. Furthermore, the raters are assumed to be representative 2-way
of other raters, and it is assumed that those other raters would random
produce a similar kind of consistency regardless of the participants.
The main feature of this model is intended generalization to other
situations. This model is particularly appropriate when different
raters’ consistency in using a particular instrument is being assessed.
It usually produces ICCs that lie between those of the other two
models, but sometimes they are the same as those in Model 3.

3 The same raters assess all participants, but the raters are the only First
raters of interest for current purposes (the specific study setting) and option
it is expected that they would exhibit similar agreement/consistency 2-way
with a different set of participants. This model applies to many mixed
situations, including those used to assess intrarater reliability and test-
retest reliability. It will usually produce the highest ICCs.
Form
1 On any occasion, only one reading / measurement is taken by each Single
rater from each participant for purposes of analysis. measures
2 Two readings are taken from each participant (either two readings by Average
a single rater, or a single reading by each of two raters), and those measures
two readings are averaged.
k k readings are taken from each participant by a single rater, or k raters Average
take a reading from each participant—not necessarily, but usually, on measures
the same occasion—and those k readings are averaged. Any form
greater than 1 will produce higher ICCs than those under Form 1.
Type
Consis- This option assesses the extent to which the sequence of scores Consistency
tency corresponds across data sets, i.e. whether the same scores tend to be
consistently at the top, middle, or bottom of the data sets.

Abso- This option assesses not only the sequence of scores in different data Absolute
lute sets but also the extent to which the data are similar according to their
magnitude. It produces lower ICCs than does the consistency option.
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 6

It might not always be easy to determine which model should be sought from, or
which is provided by, a particular software package. For example, as will be evident
from the first and third columns in Table 1, some confusion can arise in making the
appropriate choice when using SPSS. This is because, for Model on the statistics
screen where ICCs are available in SPSS, the options are in reverse order to the
conventional categorizations referred to above. The default, and therefore first,
option in SPSS is the conventional Model 3, and the third option is the conventional
Model 1. Because of that reversal, there is a risk that researchers might regard Model
1 (the third option in SPSS) as the most desirable option simply because, by
convention, the most desirable option is usually designated as “number one”. That
this is misguided should quickly become apparent in what follows.

In many, perhaps most, research situations, Model 3, i.e. ICC(3,f), is the most
appropriate and therefore it is dealt with first here. This model pertains when
researchers want to assess whether the specific raters in their specific study are
consistent, either within themselves (intrarater reliability) or between each other
(interrater reliability). It is important to stress that no other raters are of interest
within this model. In this sense, the raters are “fixed”. The formulas on which this
model is predicated are therefore constructed specifically on the assumption that
generalizability to other raters is not being sought. In essence, the researchers do not
intend to claim that other raters are necessarily likely to produce the same level of
agreement that their specific raters did. It is assumed, however, that the specific
raters would probably produce a similar amount of intrarater and/or interrater
agreement regardless of which participants they were rating. In this context, the
participants are regarded as having been randomly selected even if, in reality, they
were not.

Because of the way in which the above scenario is conceptualized, in the formulas
that pertain to Model 3 the main variation among the measurements is conceived of
as coming from two identifiable sources: the raters who are regarded as fixed, and
the participants who are regarded as random. This means that these two sources of
variation are different—i.e. mixed—in nature (one fixed, the other random), and
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 7

thus, as can be seen in Table 1, this model is commonly referred to as a 2-way mixed
model. It is based on a two-way analysis of variance (ANOVA).

Logically, this is the only model appropriate for assessing intrarater reliability
because each single rater is the only rater of interest, thus “fixed”, in the context of
intrarater reliability. It is also the most appropriate model for test-retest reliability
(Brozek and Alexander 1947; Müller and Büttner 1994). Furthermore, because the
raters are the only raters of interest under Model 3, this model would be appropriate
in a pilot study where the extent of rater consistency was being assessed prior to the
same rater(s) being used in a substantive study (Hayen et al. 2007). In this case,
generalization concerning raters would still not be a consideration because the raters,
by being the same in both the pilot and substantive studies, would be regarded as
fixed. In other cases, however, Model 3 would be used in a post hoc fashion in order
to determine the extent to which raters already had, or had not, been consistent in a
particular, restricted, context.

Model 2, i.e. ICC(2,f), is also likely to be appropriate in many research situations.


It is the second option in SPSS and differs from Model 3 in terms of intended
generalizability. For Model 2, unlike Model 3, generalization is intended. This model
applies when the researchers are interested in the performance of their raters in the
belief that those raters might be typical of other raters and those other raters would
therefore produce similar amounts of consistency under similar circumstances. For
example, the researchers might be initially interested in testing whether two or more
raters are likely to produce similar ratings to each other when using a particular piece
of equipment but with an interest in whether their findings might apply more widely
if that equipment was used by other raters. They might want, for example, to
generalize the findings from a specific research setting across to clinical settings, or
from one clinical setting to other clinical settings. In these cases, the raters are not
considered, as they were in Model 3, to be fixed. Instead, they are regarded as having
been drawn from a pool of (presumably similar) raters, and for statistical purposes
they are conceived of as being random despite the fact that they are probably selected
for quite specific reasons, not the least of which might be their sheer availability.
That aside, in this situation the prospect of variation is regarded as coming from two
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 8

random sources, the raters and the participants, each of which can be clearly
identified in the statistical formulas. Not surprisingly, therefore, this is referred to as
a 2-way random model because each of the two identified sources of variation comes
from a different “direction” (thus 2-way) and they are both regarded as random in the
statistical environment. Like Model 3, it based on a two-way ANOVA.

Model 1, i.e. ICC(1,f), is appropriate for research situations that are usually more
complex than are the situations that apply to the other two models. This model might
be appropriate in a variety of contexts, but in practice those contexts arise only
infrequently. In essence, one set of ratings is produced by one group of raters on
some of the participants, and another set of ratings is produced by a different group
of raters on a different group of participants. This could seem unnecessarily
complicated and to open up a range of problems, but it need not be disconcertingly
and disadvantageously unorganized, and it might be necessary when large data sets
are to be analyzed and a number of raters are needed to accomplish the task. For
example, as pointed out by Landers (2015), if two readings were to be taken from
each of 2,000 people, it might be necessary to employ 10 raters, each of whom was
assigned to take two readings from each of 200 people, a total of 400 readings for
each rater. In cases such as this, it is not possible to distinguish clearly where the
variation in scores might be coming from (the inconsistencies caused by different
raters and different participants cannot be separated from each other), so
conceptually the variation is regarded as coming from one single general direction.
Because of this, and because both raters and participants are regarded as random, it is
referred to as a 1-way random model. This model is based on a one-way ANOVA.

2.2 ICC forms


There are essentially only two forms of ICC: single measures and average measures.
These are also indicated in Table 1. The discussion about forms that follows is
extended because it is easy for researchers to be confused about form, and selecting
the wrong form can result in inappropriately low or exaggeratedly high ICCs being
reported as well as confusion for those who read accounts of research.
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 9

Form 1, i.e. ICC(m,1), refers to the most common situation. Here, as suggested by
the integer used to represent it, single readings are used as the basis for analysis. For
example, for each participant a single blood pressure reading by one rater might be
compared with a single reading from the same rater on a subsequent occasion, with
the readings taken perhaps only a few minutes apart. This would provide the basis for
a typical investigation in which intrarater reliability was assessed. Alternatively, for
each participant a single blood pressure reading by one rater might be compared with
a single reading from a different rater, both readings again being taken within a few
minutes of each other. This would provide the basis for a typical investigation in
which interrater, as opposed to intrarater, reliability was assessed. In both cases,
however, the form would be 1 because single readings were carried forward for
analysis.

Any number > 1 in the form indicates two things: first, simply that scores had
been averaged to produce the data for subsequent analysis (i.e. averaging of the data
had occurred before entry into the ICC procedure), and, second, the specific number
of scores that were averaged to produce each data point. For example, in a variation
of the above situation, each rater might take two readings from each participant and
those two readings could be averaged for each rater prior to the ICC analysis. In that
case, the form would be 2, as in ICC(m,2), to indicate not only that averaging was
employed (as pointed out above, any number > 1 signals this) but also, more
specifically, that it was two readings that were averaged in each case. In this
situation, incidentally, interrater reliability could be assessed. If each rater had taken
three readings and averaged them, the form would be 3, represented as ICC(m,3), and
so on. The final row of entries for Form in Table 1 contains the letter k to indicate
whatever number of readings had been averaged in a particular context.

Scanning the text of a number of research articles and trying to match the form of
the ICC used by researchers (many researchers, incidentally, do not indicate either
the model or the form they used) suggests that researchers often confuse whether or
not the data were averaged—the appropriate focus—with the number of raters in
their study—an inappropriate focus. Sometimes, for example, if single scores were
obtained from two raters, the form is incorrectly indicated as being 2; if the single
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 10

scores were obtained from three raters the form is incorrectly indicated as 3. And so
on. This misperception by researchers occurs so frequently that it is worth
emphasizing as being incorrect. McGraw and Wong (1996) make it clear that Form
does not refer to the number of raters (no matter how many raters are involved) by
stating that it refers to such things as “the average rating for k judges, the average
score for a k-item test, or the average weight of k litter-mates” (p. 33, italics added).

In most contexts, the form will be 1 because data points are seldom averaged
when consistency between data is being assessed. This is probably because there is
usually no need to engage in a process of averaging scores. As Shrout and Fleiss
(1979, p. 426) pointed out:

Only occasionally is the choice of a mean rating as the unit of analysis


based on substantive grounds. An example of a substantive choice is the
investigation of the decisions (ratings) of a team of physicians … . More
typically, an investigator decides to use a mean as a unit of analysis
because the individual rating is too unreliable.

Hayen et al. (2007) emphasize a caveat that is alluded to in the above quotation,
namely that the average should not be used when determining ICCs unless averaged
ratings are used in situations within which a particular study’s results are expected to
also apply.

Clearly, sometimes researchers might believe that taking the average of two or
more scores from each rater, or perhaps an average of two or more raters’ scores, is
justified because it would improve reliability. In those cases, the researchers should
include appropriate procedures in the design of their study and should indicate their
having done so by using the relevant integer (the number higher than 1 that
corresponds to the number of measurements that were averaged) to represent Form,
substituting it for the letter k. For example, if one rater took two measurements on
one occasion, which were then averaged, and two measurements on a second
occasion, which were also averaged, and the two sets of averaged measurements
were submitted for the ICC calculation, the form would be 2, as in (m,2). If three,
rather than two, measurements were taken by the same rater on both occasions, and
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 11

each set of three measurements was averaged, the form would be 3, as in (m,3). In
neither of these situations should the form be indicated as 1 in the mistaken belief
that it reflects the readings having been made by only one rater.

The process of averaging can occur in a number of different ways and for a
number of different purposes, and sometimes those purposes might move away from
the context of rater reliability. This means that researchers need to be particularly
mindful not only about the form they use and report, but also about exactly what they
are likely to be measuring in a particular context. For example, if for each
participant, three different raters measured blood pressure on one day and those three
readings were averaged, and the same process was repeated a week later, an ICC
could be calculated on the set of averages from the first three readings relative to the
set of averages from the second three readings. In that case, the form would be 3 to
indicate that three readings had been averaged to provide the paired set of scores for
each occasion. However, it is important to note that here, intraparticipant
consistency, rather than either intra- or inter-rater consistency (reliability), would be
the prime focus of assessment because any distinction within or between the raters
would have been lost in the process of averaging.

Although Form refers to whether single or averaged scores were entered into an
ICC analysis, not to the number of raters involved, under some circumstances the
number representing Form will correspond to the number of raters. In the above
example, the readings of three raters were averaged at one time point and again at a
second time point, and an ICC across the two time points was obtained. Therefore,
both the number of raters and the form would be 3. However, that number
representing Form would refer to the number of measurements that were averaged,
not to there having been three raters. The fact that the two numbers are the same is,
in a sense, coincidental. In a context of interrater reliability, if two raters each took
two measurements, and for each of those raters their own two measurements were
averaged before being submitted to an ICC analysis, the form would be 2—again not
because of the number of raters but because of the number of measurements that
were used to produce the averages.
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 12

A related important point to note about forms is that a number to represent the
number of raters in a situation is never entered into any of the formulas for an ICC.
Only sets of scores are submitted for analysis. For example, in the usual case of
assessing intrarater reliability, two sets of single readings would be entered for
analysis, one set for each point in time that readings were taken, or perhaps three sets
of single readings would be entered if a rater’s consistency over three time points
was being assessed. In both of those cases, the form would be 1 as long as none of
the entries that were submitted for analysis had been averaged. The number 1 would
reflect the absence of averaging (i.e. use of single readings), not the existence of only
one rater. If each of four raters took a single measure from each participant and those
measures were not averaged, an admittedly unusual situation, the form would still be
1 even though four sets of data were used for computation purposes. The researchers
would simply need to keep in mind that the ICC refers to readings taken from four
people in that case. The main point to be noted here is that the number of sets of data
that are submitted for an ICC analysis does not necessarily bear any relationship with
the number of raters. Nor, of course, is the number of data sets entered for analysis
necessarily related to the form of an ICC.

Because averaging is likely to remove some of the variability among scores, sets
of data based on average scores will almost inevitably be more similar to each other
than will sets based on single scores. In order to reflect this, the formulas behind the
calculation of ICCs generate output in which the ICCs based on average measures—
any form > 1, thus ICCs (m,k)—will always be higher than are the ICCs based on
single measures, i.e. ICCs (m,1). This is an important point to note because,
regardless of what model or type had been selected, the ICCs that result from both
single and averaged processes might be closely juxtaposed in computer output. For
example, in SPSS the former are in a row immediately above the latter. Because of
this proximity, it may be tempting for researchers whose data comprise single
measures to invalidly choose the ICC based on averaged scores. If they are
unfamiliar with the meaning of form in ICCs, they might assume incorrectly that
their statistics package employs an undisclosed process of averaging to produce more
refined or accurate output, and thus the higher ICC is made available. Furthermore,
inappropriately choosing the averaged output makes results look more impressive for
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 13

both researchers and their colleagues, and journals are more likely to accept
impressive results—the well-known positive publication bias (see Franko et al.
2014)—so it might be an attractive course of action. Choosing results from averaged
output is also a particularly easy trap to fall into for researchers assessing interrater
reliability if they believe that the form of an ICC is based on the number of raters
rather than on the number of scores that had been averaged. Because all interrater
studies have two or more raters, those researchers will first believe that the form is
inevitably > 1, namely the number of raters being assessed. They might then assume,
again mistakenly, that averaging is something that the statistics package applies to
scores after they have been entered into an ICC analysis. The final step for these
researchers is to assume that it is therefore appropriate to report the form as the
number of raters in their study and valid to report the ICC value that is provided in
the averaged measures section of the computer output. This, however, is almost
never appropriate because, as mentioned above, ratings are seldom averaged to
produce the data that are fed into an ICC analysis. It also results in confusion for
those who read an account of the research because they cannot be certain whether
any discrepancy related to Form occurred because the researchers did not understand
the nature of Form in ICCs or did not describe the design of their research
adequately.

One possibly additional confusing feature about an ICC’s form should be noted in
conclusion. As established above, for Form there are only two choices (single and
average), different formulas apply to each of those choices, and the different
formulas produce the specific output from a statistics package. However, within the
average procedure, the same formula is applied regardless of the number of scores
used to generate each average. The second integer in parentheses, if > 1, is therefore
used merely to indicate, or confirm, the number of items used when obtaining
averages: It serves primarily as information. Therefore, researchers should be
vigilant in taking their findings from the computer output section referring to single
measures if their form is 1, and from the computer output section referring to average
measures if their form is anything > 1.
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 14

2.3 ICC types


The three different models (1, 2, or 3) and two basic forms (one form relating to
single measurements, and the other relating to the average of two or more
measurements) produce six different kinds of ICC: two forms for each of the three
models. There is a third aspect of ICCs that must be taken into account—and it
results in even more ICC versions than the six that are produced by combinations of
model and form. This aspect has been labeled Type by Lee et al. (2012) and concerns
whether an ICC is based on what is referred to as consistency, versus absolute,
agreement. Abbreviated information about each of these types is provided at the
bottom of Table 1. In SPSS, the option for selecting the ICC type is provided on the
same screen that is available for choosing the model, and, whatever software package
is used, the type should be deliberately identified and an appropriate choice should
be made,. Both types (consistency or absolute) are available within Models 2 and 3,
but, for statistical reasons, only the absolute option is relevant for Model 1.
Therefore, Type brings the total number of ICC versions to 10.

In most contexts, the consistency option will produce the Pearson’s product
moment correlation coefficient and therefore it negates the reason for ICCs’
existence. This option merely reflects the extent to which two sets of scores have a
similar sequence when both are ordered from smallest to largest. For example, the
scores 10, 20, 30, and 40 have the same sequence as the scores 20, 30, 40, and 50.
They are perfectly correlated and would yield a Pearson’s correlation of 1.00.
However, this does not indicate the extent to which one set might be systematically
higher or lower than the other. In this case, the second set is consistently 10 higher
than the first, so the two sets are different from each other—something that the
correlation coefficient does not reflect. The ICC is designed, when Type is used
appropriately, to overcome this problem by taking into account the extent to which
scores are similar in terms of extent—i.e. how identical they are to each other. This
is what occurs under the absolute agreement option and it is why that option should
almost always be chosen when obtaining ICCs. Under most circumstances, the ICCs
produced by the absolute option will be smaller than are the ICCs produced by the
consistency option, so again it is tempting for researchers to choose the latter option
if they favor reporting high ICCs whether doing so is valid or not. Unfortunately,
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 15

consistency agreement is the default option in most versions of SPSS, so that


removes some “protection” against producing ICCs that are undesirably inflated.
Again, researchers who assume that the default options in SPSS are the more
desirable might be misled regarding this aspect of ICC selection.

2.4 Concluding remarks about models, forms, and types


There is some inconsistency among authorities, and indeed within computer-
generated output, concerning the relative size of ICCs produced by the different
models. For example, Landers (2015) has claimed that Model 1 always produces the
smallest ICCs, but Shrout and Fleiss (1979) stated that it does so only “on average”.
Furthermore, although Portney and Watkins (2009) stated that ICCs produced within
Model 2 are usually smaller than are those for Model 3, a slightly different angle was
offered by Shrout and Fleiss who indicated that sometimes these models produce
identical output, and another, more extreme, view was offered by Hallgren (2012)
who stated that SPSS always produces identical output for Models 2 and 3. The fact
that Models 2 and 3 can readily, if not always, produce identical output will be
demonstrated in the example at Section 3 below.

Computer output almost always differs depending on whether the data are
analyzed on the type being consistency or absolute agreement, and whether the form
is 1 (single measures) or > 1 (averaged measures). These differences might not be
negligible. Krebs (1986) cited research in which the differences might be 20-fold
despite being based on the same data. The differences between ICCs might be
dependent on such things as the number of measurements being analyzed, the extent
to which the data sets are matched (i.e. the extent of agreement/disagreement
between scores), and the amount of variability within each set of measurements. The
extent to which assumptions for parametric tests, such as normality of data
distributions and absence of outliers, are met, as well as the number of raters and
range of scores, might also play a part (Müller and Büttner 1994). This is a field that
is ripe for systematic research, and one that inevitably requires statistical
sophistication to unravel adequately.
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 16

Because of the differences that might be produced for different data sets, it is
important that researchers choose their models correctly, decide whether they should
use consistency or absolute agreement for their analyses (i.e. the type of ICC), and
then report the correct output according to the form that applies. This should ensure
that their procedures and conclusions are in order because each of these choices
could have consequences for the ICC values that are most appropriate. In addition,
within the text of their publications, authors should clearly indicate which model(s),
form(s), and type(s) they used because doing so assures consumers of the research
that the correct ICCs had, hopefully, been applied. Providing complete information
about ICCs also indicates succinctly what assumptions about generalizability had
been made in choosing a particular ICC analysis, whether the data had been averaged
prior to ICC analysis, and whether the results were appropriately interpreted in terms
of the ICC on which they were based. For example, by indicating that the ICCs apply
to Model 2, researchers signal that they intend their results to be generalizable
beyond their particular raters and are entitled to interpret their results in light of that,
whereas in Model 3 the results are not intended to be generalized—even if the results
might have wider application. The first example that follows supports and illustrates
much of what has been presented above and provides additional insights about ICCs.
The subsequent example provides further insights, particularly concerning the
interpretation of ICCs.

3 Example 1: Producing and examining ICCs


The example in this section demonstrates that it is possible to obtain a variety of
results for the same set of data depending on the selections made for analysis, that
researchers need to make the appropriate choices when determining what to seek
from a software package, and that care must be taken to interpret the output
appropriately. All of these aims can be demonstrated by using unpublished data from
the doctoral research by McAra (2015). As part of that research, a single toe-brachial
index (TBI) reading taken from one foot was compared with a single TBI reading
from the other foot—both readings made by the same tester within 10 to 15 minutes
of each other. The readings chosen for analysis were the lowest (or equal lowest) of
three from each foot on the basis that they were least likely of the three to be
distorted by measurement artifacts. The data were subsequently organized so that the
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 17

first reading of each pair was from the foot with the lower (or in some cases equal
lowest) TBI reading.

These data were analyzed in 10 different ways in SPSS to yield results that
represent all available combinations of model, form, and type. The results are
summarized in Table 2. There it is obvious that, as indicated above, it is possible to
obtain identical, similar, or substantially different output for the 10 different types of
ICC with the same data. In this case, six different ICCs were produced. They range
from 0.51 to 0.87. As predicted by some authorities, Model 1 had the lowest ICCs,
and Models 2 and 3 were identical. The consistency option always produced higher
ICCs than did the absolute option, and, as pointed out above, its yielding higher
values should not be surprising given that the consistency option focuses on
sequencing of data rather than both sequencing and size. In that light, it is interesting
that the Pearson product moment correlation coefficient (calculated separately) is
equal to the ICC of 0.78, the second-highest ICC, produced with single measures and
consistency agreement in both Models 2 and 3. The only ICC that exceeded the
Pearson’s correlation occurred for the ICCs that assumed the data had been averaged.
Entries in Table 2 also confirm the expectation that, other things being equal, ICCs
based on single measures are lower than are those based on averaged measures.

Table 2 ICCs produced by the same data depending on model, form, and type
Model (1, 2, or 3) and form (single vs average)
Model 1 Model 2 Model 3
One-way random Two-way random Two-way mixed
Type Single Average Single Average Single Average
(agreement) (1,1) (1, k) (2,1) (2, k) (3,1) (3, k)

Consistency Not applicable 0.78 0.87 0.78 0.87


Absolute 0.51 0.68 0.57 0.73 0.57 0.73

In this context, the appropriate ICC is (3,1) with absolute agreement because the
tester was the same (fixed) for all readings and the readings were not averaged. As a
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 18

result, the appropriate ICC for these data is the second lowest, namely 0.57.
However, with inappropriate decision making and an uninformed desire to claim a
high degree of similarity between the two feet, the highest value of 0.87 (in both
Models 2 and 3) could quite readily have been selected. Doing that might be
regarded as justifiable because it was supported by a statistical process. However,
this would have been misguided for two reasons: consistency in agreement should
not have been sought, and the readings had not been averaged.

The scatter plot in Figure 1, based on the data referred to above, is also
informative for present purposes. It provides a visual representation of this particular
set of data and therefore helps to determine whether an ICC as low as 0.57 should be
accepted as likely to reflect the data accurately, or whether an ICC as high as 0.87
might be more acceptable. The data points and line of best fit in that scatter plot
indicate that there is a general tendency for TBIs on one foot to correspond with
those on the other foot in that a low score on one foot is usually associated with a
low score on the other foot, and vice versa. (There are some outliers, as is the case in
many data sets.) At a quick glance, there is a moderately strong relationship between
the paired readings. The apparent strength of this relationship is supported by the
already-noted high Pearson correlation coefficient of 0.78 and the ICC of 0.87, both
of which could be regarded as appropriate given the pattern in the data. A closer
inspection of the scatter plot, however, reveals that although TBIs on one foot were
sometimes equal to TBIs on the other foot (thus the rising diagonal pattern of data
points from the bottom of the plot), the readings on the X-axis are generally
noticeably lower than are those on the Y-axis. This is demonstrated most clearly by
the line of best fit intersecting the Y-axis at approximately 0.28 when the X-axis is at
approximately 0.08, notionally producing a substantial TBI interfoot difference of
0.20 at the lowest point. Because the TBI differences at higher levels were not as
great, the average difference across all 97 readings was only 0.12, however.
Nevertheless, the differences ranged from 0.00 to 0.45, and 18% of these differences
were larger than 0.20. Differences of this magnitude in TBI values are not trivial
(they were significant at p < 0.001), so the Pearson correlation coefficient of 0.78 and
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 19

TBI from
foot with
higher TBI
reading

TBI from foot with lower TBI reading

Figure 1 Scatter plot of TBI readings from each foot within 10 to 15 minutes of
each other. These were the lowest of three readings on each foot. The associated
Pearson correlation coefficient is 0.78, and the ICC is 0.57. The ascending
diagonal “bottom” line of entries is a result of the foot on the X-axis having been
designated the lower TBI foot; therefore, by definition there can be no values on
the Y-axis that are lower than the values on the X-axis, but there were a small
number of readings that were equal for both feet.

the ICC of 0.87 are obviously misleading. The difference in readings between the
two feet is sufficiently large that the “legitimate” ICC(3,1) of 0.57 is therefore likely
to be a much more accurate indication of the extent of similarity. It is certainly more
so than are the ICCs in Models 2 and 3 in Table 2 that were based merely on simply
ordering the data from lowest to highest (the consistency solutions) or on data that
were assumed to have been averaged.

An additional important component of ICC interpretation involves inspection of


confidence intervals. In this case, the 95% confidence interval ranged from 0.0 to
0.82 and was therefore obviously negatively skewed given the obtained ICC of 0.57.
This indicates, very informatively, that in repeated testing with similar samples, in
95% of cases some of the samples’ ICCs might extend upward to 0.82, but in more
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 20

samples they are likely to extend down toward zero. This provides further
confirmation that the ICC value of 0.57 indicates an appropriately low degree of
similarity between the two sets of scores.

Apart from demonstrating that researchers should make choices advisedly when
seeking ICC analyses and selecting entries from computer output, this example
reveals that the possibility of incorrect reporting is increased if researchers do not
inspect their data and confidence intervals carefully and that incorrect reporting
might create inaccurate and unfounded impressions for both themselves and others.
This is pursued in the next example.

4 Example 2: Extending the interpretation of ICC values


This second example demonstrates that once ICC values have been obtained,
problems related to interpretation of values might occur. The main source of these
problems is that there are different sets of criteria for describing ICCs. Of these
criteria, the two most common are those of Fleiss (1986) and Portney and Watkins
(2009). According to Fleiss, ICCs < 0.40 are poor, those from 0.40 to 0.75 are fair to
good, and those > 0.75 are excellent. The criteria proposed by Portney and Watkins
are more conservative, particularly at the upper end, with < 0.75 poor to moderate,
≥ 0.75 good, and > 0.90 “reasonable for clinical measurements”. The difference is
most noticeable in that researchers who use the Fleiss categories will regard all ICCs
above 0.75 as excellent, whereas Portney and Watkins characterize ICCs from 0.75
up to 0.90 more reservedly as merely good, and they avoid use of the word excellent
altogether.

These two sets of criteria seem to have acquired an unquestioned status despite
Fleiss asserting that “no universally applicable standards are possible for what
constitutes poor, fair, or good reliability” and Portney and Watkins stating that their
categories should be regarded as guidelines only and that the specific context of a
study should be taken into account. Nevertheless, for clinical situations, Nunnally
and Bernstein (1994) recommend that ICCs should attain at least 0.90, thus
concurring with Portney and Watkins’ general guidelines with regard to clinical
measurements.
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 21

Whether the Fleiss or Portney and Watkins set of categories is preferable for
clinical settings can be put to the test by again using data from the McAra (2015)
doctoral research. A different part of those data is used in this instance because of the
additional insights that can be drawn from them. In this case, the data comprise the
first two of three readings taken within approximately 1 min of each other from the
foot with the lower (or one of the feet with an equally low) TBI reading. All TBI
readings were obtained with automated devices and the toe-measurement cuffs
(occlusion and sensor) were kept in position for all three readings of a set. Data for
97 people are represented in the scatter plot in Figure 2. Although there is only one
major outlier, an initial glance suggests that most data do not depart noticeably from
the line of best fit. Furthermore, the line of best fit intersects with the Y-axis (at 0.20)
at much the same location on the X-axis (approximately 0.18), and many of the
readings are close to each other: For example, the left-most data point represents
readings of approximately 0.25 and 0.22, and the right-most point represents
readings of approximately 0.80 and 0.90. Therefore, the data appear to be well
matched. However, some of the “moderate” outliers are quite large. In the lower left
quadrant of the scatter plot, for example, there is an outlier with paired readings of
approximately 0.19 and 0.38, and another outlier is in the upper right quadrant where
the paired readings are approximately 0.83 and 0.63. For these TBIs, therefore, there
are obviously some major discrepancies between the two readings despite most of
them not departing far from the line of best fit. (The point at which particular
measurements depart sufficiently from other data to be regarded as outliers, whether
outliers should be removed from data sets prior to substantive analyses, and whether
outliers are likely to produce misleading ICCs rather than ICCs from which
appropriate conclusions can be drawn, are important issues, but they extend beyond
the focus of this article.)
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 22

First
TBI
reading

Second TBI reading

Figure 2 Scatter plot of the first two of three TBI readings. These readings were
taken approximately one minute apart from each other on the foot with the lower
(or one of the feet with an equally low) TBI, and the associated ICC is 0.80.

For the data in Figure 2, an ICC(3,1) absolute agreement is again appropriate, this
time because there was only one rater who took each of the single readings. The
ICC(3,1) absolute agreement value for these data is 0.80. The initial question that
arises is how this scatter plot should be described verbally, as opposed to statistically.
As indicated above, the answer should be determined by the context within which the
research applies—in this case the diagnostic and predictive situations that researchers
and clinicians would face within podiatry. Therefore, it is arguable that the Portney
and Watkins criteria, with their focus on clinical contexts, would be more applicable
than would the Fleiss categories. This means that the relationship between the
readings would be described as good. The scatter plot alone reveals that an ICC of
0.80 represents a relationship between two variables that is well below what seems
appropriate for describing it as excellent, which would be the case if the Fleiss
criteria were used. In essence, there should not be as much discrepancy between the
TBIs as is evident in Figure 2 if a claim is to be made that TBIs are highly similar to
each other, let alone carry an excellent degree of similarity. This is supported by the
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 23

95% confidence interval, which in this case extends from 0.72 to 0.86, suggesting
that, although ICCs from similar samples would differ little from the obtained ICC of
0.80, it is unlikely that any of those ICCs would be as high as the value of 0.90 that
would qualify them as being acceptable within clinical contexts.

Two main points emerge from the above analysis. First, again it seems important
to inspect data and confidence intervals carefully to gain a sense of the conclusions
that can be most appropriately drawn from them, including what reservations might
need to be entertained. Second, a downscaling should occur in the claims that some
researchers make about their results if those results are intended to apply to clinical
settings. Although there are temptations not only to seek and disseminate ICC values
that are the highest that statistics software produces, and to interpret those values
with a categorization scheme that is generous or enhancing, in some research the
word excellent should be converted to become merely good, and descriptions of fair
and good should become poor to moderate. This is not merely a superficial
downward shift in the nature of the adjectives. The descriptors used could have
important consequences concerning the claims that researchers make of their results
and the impressions that consumers of research, including policy makers, are
encouraged to accept.

The above is not intended to imply that the Fleiss categories should be discarded.
As indicated above, different criteria might apply in different contexts. Fleiss’s
categories are likely to be appropriate for data in some disciplines and contexts, for
example, where strong associations are not anticipated because of the range of
variables that are expected to influence human thinking, physiological functioning,
or behavior.

There is an additional important point that can be drawn from the above analysis.
Because the measurement cuffs remained in place for each pair of readings, the rater
was not a source of variation between the readings. Instead, the only sources of
variability between the pairs of readings could have been either variation from within
each participant, or instrumentation variation/error. Most studies of intrarater and
interrater reliability do not seem to take these sources of variation into account, yet
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 24

they almost inevitably play some role. The ICC of 0.80 in this case is a clear
demonstration of that. The readings in the above scenario have no relationship with
either intra- or inter-rater reliability. However, if the same readings had been made
by a single rater at two points in time, or two raters on the same occasion, it might be
tempting for researchers to interpret the ICC solely in terms of intrarater reliability in
the former case or interrater reliability in the latter. These are clearly limited
perspectives.

5 Conclusion
Thirty years ago, Krebs (1986) admonished researchers to “declare” their ICC types.
More recently, Kottner et al. (2011) did the same, as did Lee et al. (2012), who found
that only 5% of the studies they reviewed contained sufficient information about the
ICCs that had been used within them. Clearly, Krebs’ original call has not been
heeded. One of the main purposes of this article, therefore, is to encourage
researchers to reveal the nature of their ICC analyses to a much greater extent, and
also to encourage reviewers and editors to require full and accurate information from
authors.

An inspection of five podiatric studies (Romanos et al. 2010; Scanlon et al. 2012;
Sonter, Chuter, and Casey 2015; Sonter, Sadler, and Chuter 2015; Tehan et al. 2015)
in which rater reliability had been assessed for TBIs is revealing. In three of these
studies there is no indication about either the model or form on which the ICCs were
calculated, and in the two remaining studies where ICC information is provided the
researchers used a form greater than 1 without any evidence in the text that the
relevant TBI readings had been averaged. This suggests that they reported ICCs that
were based on average, not single, measures and therefore that their reported ICCs
would have been inflated. Four of the five studies used the Fleiss categories for
interpreting the values and therefore made what might be regarded as exaggerated
claims about the extent of similarity in their data if they intended their research to
apply to clinical settings—which presumably they did. In none of these studies do
the researchers indicate whether their ICCs were based on consistency or absolute
agreement, and only two of the five sets of researchers acknowledge that participant
variability or instrumentation error might be part of the complete picture.
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 25

The unsatisfactory state of affairs that these studies reveal can be remedied only if
researchers, reviewers, and editors are more aware about the nature of ICCs and the
importance of authors providing complete information when reporting them.
Reviewers and editors might also need to be less beguiled by words such as excellent
when ICCs are described. Kottner et al. (2011) not only call for more informative
descriptions of the procedures used within reliability studies and more complete
reporting of results, but also propose a comprehensive set of requirements that
authors might be expected to abide by when research involving rater reliability is
reported. Implementing these requirements can only be beneficial. Some journals
will no longer accept articles that fail to adhere to basic statistical conventions and
are also showing a greater willingness to accept articles that contain “nonresults”. In
line with these recommendations and policies, it might be necessary to require
adequate reporting of ICCs and their associated confidence intervals before an article
is accepted for publication and to have a greater appreciation and acceptance of
research that appears to be unimpressive because of its low ICC results. Unless these
procedures and practices are put in place, there is a perpetuated risk that incorrect
decisions might be made about matters that are of crucial importance for people’s
health.

Compliance with ethical standards

Financial disclosure No financial support was received for this article.

Conflicts of interest None.

Human and animal rights This article does not contain any studies with human

participants performed by the author.


Trevethan: INTRACLASS CORRELATION COEFFICIENTS 26

Acknowledgements

Michela Betta, Sylvia McAra, Melissa Nott, Rod Pope, Therese Schmid, and two
anonymous reviewers provided valuable comments and feedback concerning earlier
drafts of this article. Associate Professor Pope also engaged in thought-provoking
interchanges, drew my attention to literature about ICCs that I had not been aware of,
and shared computer output that provided confirming insights. Dr McAra granted
permission to use data that had been acquired as part of her doctoral research.

References
Brozek, J., Alexander, H.: Components of variance and the consistency of repeated
measurements. Res Quart 18,152–166 (1947)

Carrasco, J.L., Jover, L.: Estimating the generalized concordance correlation coefficient
through variance components. Biometrics 59, 849e58 (2003)

Fleiss, J.: The Design and Analysis of Clinical Experiments. John Wiley and Sons,
New York (1986)

Franco, A., Malhotra, N., Simonovits, G.: Publication bias in the social sciences:
unlocking the file drawer. Science 345, 1502–1505 (2014).
doi:10.1126/science.1255484

Hallgren, K.A.: Computing inter-rater reliability for observational data: an overview


and tutorial. Tutor Quant Methods Psychol 8(1), 23–34 (2012)

Hayen, A., Dennis, R., Finch, C.: Determining the intra- and inter-observer reliability of
screening tools used in sports injury research. J Sci Med Sport 10, 201–210 (2007).
doi:10.1016/j.jsams.2006.09.002

IBM: SPSS Statistics for Windows, Version 20.0. IBM Corp., Armonk, NY (2011)

Kottner, J., Audigé, L., Brorson, S., Donner, A., Gajewski, B.J., Hróbjartsson, A.,
Roberts, C., Shoukri, M., Streiner, D.L.: Guidelines for reporting reliability and
agreement studies (GRRAS) were proposed. J Clin Epidemiol 64, 96–106 (2011).
doi: 10.1016/j.jclinepi.2010.03.002

Krebs, D.E.: Declare your ICC type. Phys Ther 66, 1431 (1986)
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 27

Landers, R.N.: Computing intraclass correlations (ICC) as estimates of interrater


reliability in SPSS. The Winnower 2:e143518.81744 (2015).
doi:10.15200/winn.143518.81744

Lee, K.M., Lee, J., Chung, C.Y., Ahn, S., Sung, K.H., Kim, T.W., Lee, H.J., Park, M.S.:
Pitfalls and important issues in testing reliability using intraclass correlation
coefficients in orthopaedic research. Clin Orthop Surg 4, 149–155 (2012).
doi:10.4055/cios.2012.4.2.149

McAra, S.J.: Glygeryl trinitrate and toe-brachial indexes in pedal ischaemia.


Unpublished PhD thesis. Charles Sturt University, Albury, Australia (2015).
Available from http://primo.unilinc.edu.au/CSU:CSU_ALL:dtl_csu80101

McGraw, K.O., Wong, S.P.: Forming inferences about some intraclass correlation
coefficients. Psychological Methods 1, 30–46 (1996)

Müller, R., Büttner, P.: A critical discussion of intraclass correlation coefficients. Stat
Med 13: 2465–2476 (1994)

Nunnally, J.C., Bernstein, I.H.: Psychometric Theory, 3rd ed. McGraw Hill, New York
(1994)

Portney, L.G., Watkins, M.P.: Foundations of Clinical Research: Applications to


Practice, 3rd ed. Pearson Education, Upper Saddle River, NJ (2009)

Rankin, G., Stokes, M.: Reliability of assessment tools in rehabilitation: an illustration


of appropriate statistical analyses. Clin Rehabil 12 187–199 (1998).
http://dx.doi.org/10.1191/026921598672178340

Romanos, M.T., Raspovic, A., Perrin, B.M.: The reliability of toe systolic pressure and
the toe brachial index in patients with diabetes. J Foot Ankle Res 3, 31 (2010).
doi:10.1186/1757-1146-3-31

Raykov, T., Dimitrov, D.M., von Eye, A., Marcoulides, G.A. (2013) Interrater
agreement evaluation: A latent variable modeling approach. Educ Psychol Meas
73, 512–531 (2013). doi:10.1177/0013164412449016

Scanlon, C., Park, K., Mapletoft, D., Begg, L., Burns, J.: Interrater and intrarater
reliability of photoplethysmography for measuring toe blood pressure and toe-
Trevethan: INTRACLASS CORRELATION COEFFICIENTS 28

brachial index in people with diabetes mellitus. J Foot Ankle Res 5, 13 (2012).
doi:10.1186/1757-1146-5-13

Shrout, P.E., Fleiss, J.L.: Intraclass correlations: uses in assessing rater reliability.
Psychol Bull 86, 420–428 (1979)

Sonter, J., Chuter, V., Casey, S.: Intratester and intertester reliability of toe pressure
measurements in people with and without diabetes performed by podiatric
physicians. J Am Podiatr Med Assoc 105, 201–208 (2015). doi:10.7547/0003-
0538-105.3.201

Sonter, J., Sadler, S., Chuter, V.: Inter-rater reliability of automated devices for
measurement of toe systolic blood pressure and the toe brachial index. Blood Press
Monit 20, 47–51 (2015). doi:10.1097/MBP.0000000000000083

Tehan, P.E., Bray, A., Chuter, V.H.: Non-invasive vascular assessment in the foot with
diabetes: sensitivity and specificity of the ankle brachial index, toe brachial index
and continuous wave Doppler for detecting peripheral arterial disease. J Diabetes
Complicat 30, 155–160 (2015). http://dx.doi.org/10.1016/j.jdiacomp.2015.07.019

Wrobel, J.S., Armstrong, D.G.: Reliability and validity of current physical examination
techniques of the foot and ankle. J Am Podiatr Med Assoc 98, 197–206 (2008)

View publication stats

You might also like