You are on page 1of 19

Measures of Accuracy of Screening Tests

Introduction

Screening can be defined as the application of a medical procedure or test to a population to Commented [RS1]: Screening is applied to a defined
population so as to detect persons with possible disease
identify people who are in the pre-clinical phase or as yet have no signs or symptoms of a
particular disease for the purpose of determining their likelihood of having the disease. The
screening procedure itself does not diagnose the illness but can set an alarm to those who have
a positive result for further evaluation with subsequent diagnostic tests or procedures. The goal
of screening is to detect the disease in the earliest phase so that proper treatment or management
can be done to reduce morbidity or mortality from the disease, when treatment is usually more
successful. Some common examples of screening tests are pap smear, mammogram, clinical
breast examination, blood pressure determination, cholesterol level, eye examination/vision
test, and urinalysis.

Screening vs. Diagnostic Tests

Screening tests are different from diagnostic tests in terms of meaning, application and uses.
Diagnostic tests are usually performed on individuals with a symptoms or sign of an illness
whereas screening tests are applied on healthy individuals with no such symptoms or signs.
Screening test is usually applied to a large population simultaneously whereas diagnostic test
is applied on a single patient at a time. Diagnostic tests are more expensive and accurate as
compared to screening test which is less accurate and non-expensive. Diagnostic test provides
a basis for initiation of treatment whereas screening test does not.

Natural History of Disease Commented [RS2]: Concept of screening may be more


elaborated and explained in the context of natural history of
Application and effectiveness of screening test or programme depends upon natural course of disease to explain certain concepts viz. lead time.

disease. Screening is not very effective from public health point of view if the natural course
of disease is short or in case of acute illness where the latent period is very short.

Criteria for a Screening Programme


1. Life-threatening diseases and those known to have serious and irreversible consequences if
not treated early are appropriate for screening. For example, Life threating disease such as Lung
Cancer, disease having irreversible consequences such as Hypothyroidism.
2. Treatment of diseases at their earlier stages should be more effective than treatment begun
after the development of symptoms.

1
3. The prevalence of the detectable preclinical phase of disease has to be high among the
population screened. High prevalence would reduce the relative costs of the screening program
and increase positive predictive value.
4. A screening program that finds diseases that occur less often could only benefit few
individuals. Such a program might prevent some deaths. While preventing even one death is
important, given limited resources, a more cost-effective program for diseases that are more
common should be given a higher priority, because it will help more people.
5. In some cases though, screening for low prevalence diseases is also cost effective, if the cost
of screening is less than the cost of care if the disease is not detected early.
6. A suitable screening test must be available. Suitability criteria includes adequate sensitivity
and specificity, low cost, ease of administration, safe, imposes minimal discomfort upon
administration, and is acceptable to both patients and practitioners.
7. There must also be appropriate follow-up of those individuals with positive screening results
to ensure thorough diagnostic testing occurs.

Characteristics of a Screening Test


1. The given health condition or disease should be an important health problem.
2. There should be a treatment for the disease or condition.
3. There should be proper facilities for diagnosis and treatment after screening.
4. There should be a latent or asymptomatic stage of the disease.
5. The natural history of the disease should be adequately understood.
6. There should be an agreed policy on whom to treat.
7. There should be a continuous process of case-finding in the population, not just a once and
for all project. Commented [RS3]: These are not characteristics of a
screening test. They may be written as characteristics of
8. The test used should be sensitive. disease for which screening test is employed

9. The test should be less expensive.


10. The test should be less invasive, less painful or with minimal discomfort.
11. The test should be easy to administer or socially acceptable.
12. The test should be reliable i.e. consistent results on repeated test.
13. The test should be valid i.e. ability to distinguish between diseased and non-diseased
people.
Test Reliability (Consistency)

2
A screening test is considered reliable if it gives consistent results with repeated tests.
Variability in the measurement can be the result of physiologic variation or the result of
variables related to the method of testing. For example, if one were using a sphygmomanometer
to measure blood pressure repeatedly over time in a single individual, the results might vary
depending on:

Biological variability (BP normally varies within an individual).


Instrument variability (is the sphygmomanometer reliable).
Intra-observer variability (does a given tester perform the test the same way each time).
Inter-observer variability (do different testers perform the test the same way each time).

The reliability of all tests can potential be affected by one or more of these factors.

Test Validity (Accuracy)

Validity is the ability of a test to correctly measure what it intends to measure. It should have
ability to correctly identify diseased and non-diseased persons. In diseased personspersons, it
should give a positive result and in non-diseased persons it should give a negative result. The
validity of a test can be assessed if the test results can be compared either with a true measure
of the physiologic, biochemical, or pathologic state of the disease or with the occurrence of
disease progression or a disease complication that the test result seeks to predict (1).

The diagnostic accuracy of a screening test gives us an answer to the following question How
well this test discriminates between the two conditions of interest i.e. diseased and healthy,
two-stages of diseases etc. This discriminative ability can be quantified by the measures of
diagnostic accuracy:
Sensitivity and Specificity
Positive and Negative predicative values (PPV, NPV)
Likelihood ratio
Area under the ROC curve (AUC)

Different measures of diagnostic accuracy relate to the different aspects of diagnostic


procedure. Some measures are used to assess the discriminative property of the test; others are
used to assess its predictive ability (2). While discriminative measures are mostly used by
health policy decisions, predictive measures are most useful in predicting the probability of a

3
disease in an individual (3). Furthermore, it should be noted that measures of a test performance
are not fixed indicators of a test quality and performance. Measures of diagnostic accuracy are
very sensitive to the characteristics of the population in which the test accuracy is evaluated.
Some measures largely depend on the disease prevalence, while others are highly sensitive to
the spectrum of the disease in the studied population. It is therefore of utmost importance to
know how to interpret them as well as when and under what conditions to use them (4).

A 2 x 2 table, or contingency table, is also used when testing the validity of a screening test,
but note that this is a different contingency table than the ones used for summarizing cohort
studies, randomized clinical trials, and case-control studies. The 2 x 2 table below shows the
results of the evaluation of a screening test for diseased and non-diseased subjects.

Gold Standard
Test Result Diseased Not Diseased Total
Test Positive a (TP) b (FP) a+b
Test Negative c (FN) d (TN) c+d
Totals a+c b+d a + b +c + d =N

The contingency table for evaluating a screening test lists the true disease status in the columns,
and the observed screening test results are listed in the rows.

The table shown above shows the results for a screening test. There are a + c subjects who are Commented [RS4]: May consider to explain concept of
TP. FP. FN. TN using pictorial illustrations to make it more
ultimately found to have had disease, and b + d subjects remained free of disease during the illustrative and self-explanatory on the lines of explanation
given in Leon Gordis.
study. Among the a + c subjects with disease, a have a positive screening test (TP-true
positives), but c have negative tests (FN-false negatives). Among the b + d subjects without
disease, d have negative screening tests (TN-true negatives), but b are incorrectly have
positive screening tests (FP-false positives).

Based on the outcome observed in the contingency table above, we will define different
diagnostic accuracy of the test.

Sensitivity and Specificity

A. Sensitivity It is defined as the test ability to correctly identify diseased subjects as test
positive. It is conditional probability P (T+ | D+) of getting positive test results (T+) in the

4
diseased subjects (D+). Hence, it relates to the potential of a test to recognise subjects with the
disease. Numerically, it is expressed as conditional probability estimate as

Sensitivity = P(T+ | D+) = a / (a + c).

It is usually expressed in per cent value such as 80% which mean the proportion of test positive
among diseased subject is 80 out of 100.

B. Specificity It is defined as the test ability to correctly identify a healthy or non-diseased


subjects as test negative. It is conditional probability P (T- | D-) of getting test result negative
(T-) in the non-diseased subjects (D-). Hence, it relates to the potential of a test to recognise
subjects without the disease. Numerically, it is expressed as conditional probability estimate as

Specificity = P (T- | D-) = d / (b + d).

It is usually expressed in per cent value such as 80% which mean the proportion of test negative
among non-diseased subject is 80 out of 100.

There was a common notion that neither sensitivity nor specificity depends on or influenced
by the disease prevalence. It means that results estimated in one study population can easily be
transferred to some other population with a different prevalence of the disease. Nonetheless,
Sensitivity and specificity of a test often vary with prevalence, likely due to mechanisms that
affect both prevalence and sensitivity and specificity, such as patient spectrum (5). Therefore,
investigators are invited to think of the intended use of the test when designing a study of test
accuracy, and specify the inclusion criteria that define the study population accordingly (6).

Along with sensitivity and specificity, accuracy is also an important indicator of diagnostic
ability of a screening test. Accuracy is the proportion of true results, either true positive or
true negative, in a population. It measures the degree of veracity of a diagnostic test on a
condition. Numerically it is given as below:

True Positive (TP) + True Negative (TN) a+d


Accuracy = ----------------------------------------------------- = ------------------
TP + FP + FN + TN a+b+c+d

In addition to the equation show above, accuracy can be determined from sensitivity and
specificity, where prevalence is known. Prevalence is the probability of disease in the
population at a given time:

5
Accuracy = (sensitivity) x (prevalence) + (specificity) x (1 - prevalence).

The numerical value of accuracy represents the proportion of true positive results (both true
positive and true negative) in the selected population. An accuracy of 99% of times the test
result is accurate, regardless positive or negative. This stays correct for most of the cases.
However, it worth mentioning, the equation of accuracy implies that even if both sensitivity
and specificity are high, say 99%, it does not suggest that the accuracy of the test is equally
high as well. In addition to sensitivity and specificity, the accuracy is also determined by how
common the disease in the selected population. A diagnosis for rare conditions in the
population of interest may result in high sensitivity and specificity, but low accuracy. Accuracy
needs to be interpreted cautiously (7).

Predictive Value

The validity of a test can also be expressed as the extent to which being categorized as positive
or negative actually predicts the presence of the disease i.e. the ability of a test to predict disease
among those who are test positive and non-disease who are test negative.

Positive Predictive Value (PPV) It is the proportion of those with a positive test who have
the disease. It is the probability that a subject has the disease given that the subject has a positive
screening test result. In terms of Bayes Theorem, it is expressed as

P(T+|D+) P(D+)
PPV = P(D+| T+) = ------------------------------------------
P(T+|D+) P(D+) + P(T+|D-) P(D-)

Sensitivity x Prevalence
= ---------------------------------------------------------------------------
Sensitivity x Prevalence + (1 Specificity) (1 Prevalence)

= a / (a + b)

PPV depends on sensitivity, specificity and prevalence of disease in the population. For a given
sensitivity and specificity, the PPV increases as the prevalence of disease increase in the
population.

Let us consider a screening test with sensitivity of 80% and specificity of 90% is used in
population of 10,000 individuals each with 5%, 10% and 15% prevalence of disease

6
respectively. We see that PPV of a test with same sensitivity and specificity increases as
prevalence of disease is increasing (Table-2).

Table-2 : 2 x 2 Contingency Tables with increasing prevalence

With 5 % Prevalence
Test D+ D- Total PPV
Result
T+ 400 950 1350 29.6%
T- 100 8550 8650
Total 500 9500 10000
With 10 % Prevalence
D+ D- Total
T+ 800 900 1700 47.05%
T- 200 8100 8300
Total 1000 9000 10000
With 15 % Prevalence
D+ D- Total
T+ 1200 850 2050 58.53%
T- 300 7650 7950
Total 1500 8500 10000

Let us consider the screening test with sensitivity of 80% and prevalence of 10% but with
varying specificity of 80%, 90% and 95% respectively. We see that PPV of a test with same
sensitivity and prevalence increases as specificity of the test is increasing (Table-2A).

Table-2A : 2 x 2 Contingency Tables with increasing Specificity

With 80 % Specificity
Test D+ D- Total PPV
Result
T+ 800 1800 2600 30.76%
T- 200 7200 7400
Total 1000 9000 10000
With 90 % Specificity
D+ D- Total
T+ 800 900 1700 47.05%
T- 200 8100 8300
Total 1000 9000 10000
With 95 % Specificity
D+ D- Total
T+ 800 450 1250 64.0%
T- 200 8550 8750
Total 1000 9000 10000

7
Let us consider the screening test with specificity of 90% and prevalence of 10% but with
varying sensitivity of 80%, 90% and 95% respectively. We see that PPV of a test with same
specificity and prevalence increases as sensitivity of the test is increasing (Table-2B).

Table-2B : 2 x 2 Contingency Tables with increasing Sensitivity

With 80 % Sensitivity
Test D+ D- Total PPV
Result
T+ 800 900 1700 47.05%
T- 200 8100 8300
Total 1000 9000 10000
With 90 % Sensitivity
D+ D- Total
T+ 900 900 1800 50.0%
T- 100 8100 8200
Total 1000 9000 10000
With 95 % Sensitivity
D+ D- Total
T+ 950 900 1850 51.35%
T- 50 8100 8150
Total 1000 9000 10000

From the above three Tables 2, 2A and 2B, we can see that the extent of increase of PPV
is more rapid in case of increasing specificity of the test and increasing prevalence of the
disease compared to the extent of increase when sensitivity of the test is increased. Hence,
PPV value is more influence by Specificity of the test and prevalence of the disease.

Thus, for a screening test with a given sensitivity and specificity, the rarer the disease, the lower
the PPV. In this sense, PPV serves as a crude measure of relative cost efficiency i.e. it reflects
the ratio of the screening program benefits or yields (number of TP) to the cost of misdiagnoses
(FPs + FNs) for a given number of screened subjects.
Further PPV is more sensitive to changes in Specificity than to changes in sensitivity.
Therefore, we can do more to improve the efficiency of a screening program, especially with a
rare disease, by increasing the specificity of the test than by increasing the sensitivity.
A PPV value of 50% indicates that the chance of having disease among those who tested
positive is 50%.
Negative Predictive Value (NPV) It is the proportion of those with a negative test who do
not have the disease in question. It is the probability that a subject is non-diseased given that
the subject has a negative screening test result. In terms of Bayes Theorem, it is expressed as

8
P(T-|D-) P(D-)
NPV = P(D-| T-) = ------------------------------------------
P(T-|D-) P(D-) + P(T-|D+) P(D+)

Specificity x (1 Prevalence)
= ---------------------------------------------------------------------------
Specificity x (1 Prevalence) + (1 Sensitivity) x Prevalence

= d/ (c + d)

NPV very close to 1 indicates that testing negative is reassuring as to the absences of disease
and that rescreening may not be worthwhile. If NPV falls short of 1 by an amount comparable
with pre-clinical disease prevalence, much of the pre-clinical disease pool will be missed by
the screening program. A low NPV is more likely to result from poor sensitivity than poor
specificity. Hence, a screening test with high sensitivity will improve NPV.

Let us consider the screening test with specificity of 90% and prevalence of 10% but with
varying sensitivity of 80%, 90% and 95% respectively. We see that NPV of a test with same
specificity and prevalence increases as sensitivity of the test is increasing (Table-3A).

Table-3A : 2 x 2 Contingency Tables with increasing Sensitivity

With 80 % Sensitivity
Test D+ D- Total NPV
Result
T+ 800 900 1700
T- 200 8100 8300 97.59%
Total 1000 9000 10000
With 90 % Sensitivity
D+ D- Total
T+ 900 900 1800
T- 100 8100 8200 98.78%
Total 1000 9000 10000
With 95 % Sensitivity
D+ D- Total
T+ 950 900 1850
T- 50 8100 8150 99.39%
Total 1000 9000 10000

Let us consider the screening test with sensitivity of 80% and prevalence of 10% but with
varying specificity of 80%, 90% and 95% respectively. We see that NPV of a test with same

9
sensitivity and prevalence does not increases as much as that of the increase in specificity of
the test (Table-3B). NPV is not very sensitive to increase in specificity of the disease.

Table-3B : 2 x 2 Contingency Tables with increasing Specificity

With 80 % Specificity
Test D+ D- Total NPV
Result
T+ 800 1800 2600
T- 200 7200 7400 97.3%
Total 1000 9000 10000
With 90 % Specificity
D+ D- Total
T+ 800 900 1700
T- 200 8100 8300 97.6%
Total 1000 9000 10000
With 95 % Specificity
D+ D- Total
T+ 800 450 1250
T- 200 8550 8750 97.7%
Total 1000 9000 10000

Let us consider a screening test with sensitivity of 80% and specificity of 90% is used in
population of 10,000 individuals each with 5%, 10% and 15% prevalence of disease
respectively. We see that NPV of a test with same sensitivity and specificity decreases as the
prevalence of disease increases (Table-3C).

Table-3C : 2 x 2 Contingency Tables with increasing prevalence

With 5 % Prevalence
Test D+ D- Total PPV
Result
T+ 400 950 1350
T- 100 8550 8650 98.8%
Total 500 9500 10000
With 10 % Prevalence
D+ D- Total
T+ 800 900 1700
T- 200 8100 8300 97.6%
Total 1000 9000 10000
With 15 % Prevalence
D+ D- Total
T+ 1200 850 2050
T- 300 7650 7950 96.2%
Total 1500 8500 10000

Thus Positive predictive value of a screening program can be improved by restricting the
program to people at high risk, that is, those who have a relatively high prevalence of

10
preclinical diseases, or by restricting at a lower frequency to maintain the prevalence of
preclinical disease in the target population at a higher level. Either approach leads to some
overall loss of the value (PVN) of screening since fewer cases are detected and treated early.
(8).

Example 1 : Following are the results of Pap smear & Cervical Biopsy, which was
done on 600 patients attending gynae OPD in a hospital. Study the 2x2 table and
answer the below mentioned questions.

Cervical Biopsy
Cancer No Cancer Total
Positive 96 250 346
Test Negative 4 250 254
Total 100 500 600

Calculate Sensitivity, Specificity, Positive predictive value (PPV) and. Negative


predictive value (NPV).

Sensitivity = (96/100) x 100 = 96%


Specificity = (250/500) x 100 = 50%
PPV = (96/346) x 100 = 27.74%
NPV = (250/254) x 100 = 98.42%

Example-2: The sensitivity of a particular home pregnancy test is 80% if the test is
used by a group of women in which 1/3 are actually pregnant and the positive predictive
value is 50%, then what would be the Specificity of the test?

Solution: It is given Sensitivity = 80%, PPV = 50% and Prevalence = 1/3 = 33.3%

PPV is given by the formula as


Sensitivity x Prevalence
PPV = ---------------------------------------------------------------------------
Sensitivity x Prevalence + (1 Specificity) (1 Prevalence)

Putting the given values in the above expression, we get

Specificity = 60% (approx.)


Likelihood ratio (LR)

Likelihood ratio is a very useful measure of diagnostic accuracy. It is defined as the ratio of
expected test result in subjects with a certain state/disease to the subjects without the disease.
As such, LR directly links the pre-test and post-test probability of a disease in a specific patient
(9).

11
Simplified, LR tells us how many times more likely particular test result is in subjects with the
disease than in those without disease. When both probabilities are equal, such test is of no value
and its LR = 1.

Likelihood ratio for positive test results (LR+) tells us how much more likely the positive test
result is to occur in subjects with the disease compared to those without the disease.
Numerically it is given by the formula as below:

Pr (T+D+) Sensitivity
LR+ = -------------------- = --------------------------
Pr (T+D-) (1 Specificity)

LR+ is usually higher than 1 because is it more likely that the positive test result will occur in
subjects with the disease than in subject without the disease.

LR+ is the best indicator for ruling-in diagnosis. The higher the LR+ the test is more indicative
of a disease. Good diagnostic tests have LR+ > 10 and their positive result has significant
contribution to the diagnosis.

Likelihood ratio for negative test result (LR-) represents the ratio of the probability that a
negative result will occur in subjects with the disease to the probability that the same result will
occur in subjects without the disease. Therefore, LR- tells us how much less likely the negative
test result is to occur in a patient than in a subject without disease.

Pr (T-D+) (1 - Sensitivity)
LR- = -------------------- = --------------------------
Pr (T-D-) Specificity

LR- is usually less than 1 because it is less likely that negative test result occurs in subjects
with than in subjects without disease.

Area under the ROC curve (AUC)

All the above diagnostic indicators are based when the outcome of screening test is a binary
variable i.e. either positive or negative. There are many screening test where outcome is a
continuous variable such as prostate specific antigen (PSA) test for prostate cancer in which a

12
test value below 4.0 is considered to be normal and above 4.0 to be abnormal. Clearly there
will be patients with PSA values below 4.0 that are abnormal (false negative) and those above
4.0 that are normal (false positive). Receiver operating characteristic (ROC) curves are used in
medicine to determine a cut-off value for a clinical test. The goal of an ROC curve analysis is
to determine the cut-off value.

The sensitivity and specificity of a diagnostic test depends on more than just the "quality" of
the test--they also depend on the definition of what constitutes an abnormal test. Look at the
idealized graph at right showing the number of patients with and without a disease arranged
according to the value of a diagnostic test. This distributionsThis distribution overlap--the test
(like most) does not distinguish normal from disease with 100% accuracy. The area of overlap
indicates where the test cannot distinguish normal from disease. In practice, we choose a cut-
off point (indicated by the vertical black line) above which we consider the test to be abnormal
and below which we consider the test to be normal. The position of the cut-off point will
determine the number of true positive, true negatives, false positives and false negatives. We
may wish to use different cut-off points for different clinical situations if we wish to minimize
one of the erroneous types of test results.

Assume that there are two groups of men and by using a gold standard technique one group
is known to be normal (negative), not have prostate cancer, and the other is known to have
prostate cancer (positive). A blood measurement of prostatespecific antigen is made in all men
and used to test for the disease. The test will find some, but not all, abnormal to have the
disease. The ROC curve analysis of the PSA test will find a cut-off value that will, in some

13
way, minimize the number of false positives and false negatives. Minimizing the false positives
and false negatives is the same as maximizing the sensitivity and specificity.
The receiver operating characteristic (ROC) curve is the plot that displays the full picture of
trade-off between the sensitivity (true positive rate) and (1- specificity) (false positive rate)
across a series of cut-off points. Area under the ROC curve is considered as an effective
measure of inherent validity of a diagnostic test. This curve is useful in:

(i) Evaluating the discriminatory ability of a test to correctly pick up diseased and non-diseased
subjects

(ii) Finding optimal cut-off point to least misclassify diseased and non-diseased subjects

(iii) Comparing efficacy of two or more medical tests for assessing the same disease

(iv) Comparing two or more observers measuring the same test (inter-observer variability).

Non-parametric and parametric methods to obtain area under the ROC curve

Statistical software provides non-parametric and parametric methods for obtaining the area
under ROC curve. The user has to make a choice. The following details may help.

Non-parametric methods are distribution-free and the resulting area under the ROC curve is
called empirical. First such method uses trapezoidal rule. If sensitivity and specificity are
denoted by Sn and Sp, respectively, the trapezoidal rule calculates the area by joining the points
(Sn, 1 Sp) at each interval value of the continuous test and draws a straight line joining the
x-axis. This forms several trapezoids and their area can be easily calculated and summed.
Another non-parametric method uses Mann-Whitney statistics, also known as Wilcoxon rank-
sum statistic and the c-index for calculating area. Both these non-parametric methods of
estimating AUC estimate have been found equivalent (10).

Parametric methods are used when the statistical distribution of test values in diseased and non-
diseased is known. Binomial distribution is commonly used for this purpose. This is applicable
when both diseased and non-diseased test values follow normal distribution. If data are actually
binomial or a transformation such as log, square or Box-Cox makes the data binomial then the
relevant parameters can be easily estimated by the means and variances of test values in
diseased and non-diseased subjects. For details, see (9, 11).

14
The choice of method to calculate AUC for continuous test values essentially depends upon
availability of statistical software. Binomial method produces the smooth ROC curve, further
statistics can be easily calculated but gives biased results when data are degenerated and
distribution is bimodal (12-13). When software for both parametric and non-parametric
methods is available, conclusion should be based on the method that yields greater precision
of estimate of inherent validity, namely, of AUC.

Examples of ROC curve

Patients with Suspected Hypothyroidism: Consider the following data on patients with
suspected hypothyroidism reported (14). T4 and TSH values were measured in ambulatory
patients with suspected hypothyroidism and TSH values was used as a gold standard for
determining which patients were truly hypothyroid.

T4 value Hypothyroid Euthyroid

5 or less 18 1

5.1 - 7 7 17

7.1 - 9 4 36

9 or more 3 39

Totals: 32 93

Notice that these authors found considerable overlap in T4 values among the hypothyroid and
euthyroid patients. Further, the lower the T4 value, the more likely the patients are to be
hypothyroid.

Of a total of 125 subjects, 32 are known to be hypothyroid and 93 are known to have normal
thyroid function. All subjects are assessed with respect to T4 (thyroxine) levels, and then sorted
among the four ordinal categories: T4<5.1, T4=5.1 to 7.0, T4=7.1 to 9.0, and T4>9.0. Of the
19 subjects with T4 levels lower than 5.1, 18 were in fact hypothyroid while only 1 was
euthyroid. Thus, if a T4 of 5 or less were taken as an indication of hypothyroidism, this measure
would yield 18 true positives and 1 false positive, with a true-positive rate (sensitivity) of
18/32=.5625 and a false-positive rate (1-specificity) of 1/93=.0108.

15
Observed Frequencies Cumulative Rates
T4 Value Euthyroid Hypothyroid Euthyroid Hypothyroid

Diagnostic False True False True


Level Positive Positive Positive Positive
<5.1 1 18 .0108 .5625
5.1-7.0 17 7 .1935 .7813
7.1-9.0 36 4 .5806 .9063
>9.0 39 3 1.0 1.0
Totals: 93 32

Similarly, 7 of the hypothyroid subjects and 17 of the euthyroid had T4 levels between 5.1
and 7.0. Thus, if any value of T4 less than 7.1 were taken as an indication of hypothyroidism,
this measure would yield 18+7=25 true positives and 1+17=18 false positive, with a true-
positive rate of 25/32=.7813 and a false-positive rate of 18/93=.1935. And so on for the other
diagnostic levels, T4=7.1 to 9.0, and T4>9.0.

For the present example k=4, so the curve is fitted to the first three of the bivariate pairs, as
shown below in Graph A.

The area under the T4 ROC curve is 0.872. The T4 would be considered to be "good" at
separating hypothyroid from euthyroid patients.

Interpretation of ROC curve

Total area under ROC curve is a single index for measuring the performance of a test. The
larger the AUC, the better is overall performance of the medical test to correctly identify

16
diseased and non-diseased subjects. Equal AUCs of two tests represents similar overall
performance of tests but this does not necessarily mean that both the curves are identical. They
may cross each other.

Figure 1 depicts three different ROC curves. Considering the area under the curve, test A is
better than both B and C, and the curve is closer to the perfect discrimination. Test B has good
validity and test C has moderate.

Figure 1: Three ROC curves with different areas under the curve

The accuracy of the test depends on how well the test separates the group being tested into
those with and without the disease in question. Accuracy is measured by the area under the
ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test. A
rough guide for classifying the accuracy of a diagnostic test is the traditional academic point
system:

.90-1 = excellent (A)


.80-.90 = good (B)
.70-.80 = fair (C)
.60-.70 = poor (D)
.50-.60 = fail (F)

17
Screening is a very important from public health point of views as it helps people in perceiving Commented [RS5]: Before ending, may consider to
comment on evaluating a screening programme more so
symptoms of their own illness and then consult physicians for diagnosis and treatment. The when characteristics of screening programme is described
at the beginning of this chapter.
success of any screening program at reducing morbidity and mortality depends on various
factors such as interrelations between the disease experience of the target population, the
characteristics of the screening procedures, and the effectiveness of the methods of treating
disease early.

References:

1. Noel S. Weiss. Clinical Epidemiology Chapter 32 Modern Epidemiology, Third


Edition, Editors Rothman KJ, Greenland S and Lash TL.

2. Irwig L, Bossuyt P, Glasziou P, Gatsonis C, Lijmer J. Designing studies to ensure that

18
estimates of test accuracy are transferable. BMJ. 2002; 324(7338): 669-71.

3. Raslich MA. Markert RJ, Stutes SA. Selecting and interpreting diagnostic tests.
Biochemia Medica 2007; 17(2):139-270.

4. Ana-Maria imundi . Measures of diagnostic accuracy: basic definitions. Department


of Molecular Diagnostics University Department of Chemistry, Sestre milosrdnice
University Hospital, Zagreb, Croatia. Assessed on 12/02/2017
www.ifcc.org/ifccfiles/docs/190404200805.pdf, Page no. 2.

5. Mariska M.G. Leeflang MG, Anne W.S. Rutjes AWS, Reitsma JB MD, Hooft L,
Bossuyt P. Variation of a tests sensitivity and specificity with disease prevalence.
CMAJ. 2013, August 6(11): 185.

6. Irwig L, Bossuyt P, Glasziou P, et al. Designing studies to ensure that estimates of test
accuracy are transferable. BMJ. 2002; 324:669-71.

7. Wen Zhu, Nancy Zeng, Ning Wang. Sensitivity, Specificity, Accuracy, Associated
Confidence Interval and ROC Analysis with Practical SAS Implementations. NESUG
2010, Health Care and Life Sciences.

8. Alan S. Morrison. Screening Chapter 25. Modern Epidemiology, Second Edition.


Editors Rothman KJ and Greenland S.

9. Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. BMJ 2004; 17,
329(7458):168-9.

10. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating
characteristic (ROC) curve. Radiology 1982; 143:29-36.

11. Zhou Xh, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine. New
York: John Wiley and Sons, Inc, 2002.

12. Faraggi D, Reiser B. Estimating of area under the ROC curve. Stat Med 2002; 21:3093-
3106.

13. Hajian Tilaki KO, Hanley JA, Joseph L, Collet JP. A comparison of parametric and
nonparametric approaches to ROC analysis of quantitative diagnosis tests. Med Decis Making
1997; 17:94-102.

14. Goldstein BJ and Mushlin AI. Use of a single thyroxine test to evaluate ambulatory
medical patients for suspected hypothyroidism. J Gen Intern Med. 1987 Jan-
Feb;2(1):20-4.

19

You might also like