Professional Documents
Culture Documents
DOI 10.1007/s00198-009-1169-6
ORIGINAL ARTICLE
Received: 4 August 2009 / Accepted: 29 December 2009 / Published online: 5 February 2010
# International Osteoporosis Foundation and National Osteoporosis Foundation 2010
Abstract
Summary Morphometric methods of vertebral fracture
diagnosis lack specificity. We used detailed shape and
image texture model parameters to improve the specificity
of quantitative fracture identification. Two radiologists
visually classified all vertebrae for system training and
evaluation. The vertebral endplates were located by a semiautomatic segmentation method to obtain classifier inputs.
Introduction Vertebral fractures are common osteoporotic
fractures, but current quantitative detection methods (morphometry) lack specificity. We used detailed shape and
texture information to develop more specific quantitative
classifiers of vertebral fracture to improve the objectivity of
vertebral fracture diagnosis. These classifiers require a
detailed segmentation of the vertebral endplate, and so we
investigated the use of semi-automated segmentation
methods as part of the diagnosis.
Methods The vertebrae in a training set of 360 dual energy
X-ray absorptiometry images were manually segmented.
The shape and image texture of vertebrae were statistically
M. G. Roberts (*) : E. M. B. Pacheco : R. Mohankumar :
T. F. Cootes
Imaging Science and Biomedical Engineering,
University of Manchester,
Stopford Building, Oxford Road,
Manchester M13 9PT, UK
e-mail: martin.roberts@manchester.ac.uk
E. M. B. Pacheco
Department of Radiology, Faculty of Medical Sciences,
State University of Campinas (Unicamp),
Campinas, Brazil
Introduction
J. E. Adams
Clinical Radiology, Manchester Royal Infirmary,
Oxford Road,
Manchester M13 9WL, UK
Clinical context
2038
2039
2040
2041
variance proportion to obtain the best classifier generalisation on manually segmented data. The optimisation was
conducted using a randomised resampling scheme, to avoid
a biassed estimate of the true population performance.
Technical details are given in [33]. This resulted in
reductions from the 90% of texture variance used in [21]
to variance proportions of 75%, 70% and 85% for the midthoracic, lower thoracic and lumbar spine respectively. The
relatively lower proportion in the mid-thoracic may be due
to diaphragm motion artefacts introducing spurious texture
modes.
Classification experiments with semi-automatic segmentations
Leave-one-out cross-validation tests were performed over
the entire dataset to determine classifier performance on
semi-automatically segmented data. Each image was
selected in turn to be the test image. The appearance
models were then built using the remaining data (excluding
the test image); and then the linear discriminant classifiers
were trained using the manual segmentations to obtain the
classifier inputs, and the radiologists gold standard
vertebral classification for the desired classifier outputs.
Any vertebra classed as deformed but not fractured was
pooled with the normal vertebrae when training the
classifiers.
With this set of trained models and associated classifiers,
the software loaded the previously recorded semi-automatic
segmentations of the test image. For each vertebra in this
test image the software looped through each of the 10 semiautomatic segmentations in turn, and fitted the appearance
models to the test image (given the segmentation) to obtain
the appropriate classifier inputs. We recorded the classifiers measure of the likelihood that the vertebra is
fractured. The fracture likelihood estimate is recorded,
rather than a dichotomous fractured/not fractured result, so
that ROC curves can be derived by varying the detection
threshold used on this likelihood estimate. The results thus
obtained for the semi-automatic segmentations implicitly
include user imprecision effects.
The same classifiers were applied to the gold standard
manual segmentation of each image. This gives an upperbound on practical performance, and indicates to what extent
classification errors result from segmentation errors.
Statistical comparisons between classifiers
The specificity of the appearance classifier was compared
with the morphometric algorithm at several sensitivity values.
We compared a set of points around the desirable sensitivity
region rather than ROC curve area, because statistics for the
whole curve area generally involve operating regions of little
practical relevance (e.g. unrealistically low sensitivity or high
2042
false positive rate (FPR)). Also, the fact that the ROC curves
are generated by pooling slightly different classifiers over a
leave-one-out train/test cycle leads to some complications in
the underlying statistics for ROC curve area. As there are
multiple segmentations of each image (to model operator
precision), the samples are not all independent. We formulated
the comparison using the signs test; although not a very
powerful test, it can be easily employed with such data. The
number of false positives (varying between 0 and 10) is
calculated over the segmentation set of each unfractured test
image for the pair of classifiers being compared to give
1
2
NFP ; NFP . Any image pairings where these totals are equal is
discarded, leaving m comparison pairs, over which the count
1
2
of the number of cases w in which NFP > NFP is computed.
Even though the underlying results for a particular classifier
are not independent, nevertheless, we still have a set of
matched groups, and so under the null hypothesis that there
is no difference in classifiers w should be binomially
distributed, with WB(m,0.5).
Reproducibility
The only variable factor in the otherwise automatic process
is exactly where the user clicks on the centre of each
vertebra, when initialising the AAM. Because we simulated
the precision of this process, and ran multiple segmentations per image, we can estimate the reproducibility of the
overall method. Each randomised initialisation results in a
slightly different set of classifier estimates of vertebral
fracture probabilities. To translate this to an equivalent
reproducibility, we need to specify a threshold on these
probabilities. We computed (as part of calculating the ROC
curves) detection thresholds suitable for 1%, 2.5% and 5%
false positive rates. For a selected threshold we can then
treat the implied decision for each of the 10 image
segmentations as 10 different classifications, and then
calculate their overall concordance using the standard
Fleiss multi-rater Kappa statistic [34].
The mean absolute deviation of the fracture probability
between the 10 estimates was also calculated, and compared with the rates of change of sensitivity and specificity
with respect to detection threshold.
Results
Classifier comparisons for manually segmented data
As an upper limit on performance the results obtained by
reapplying the classifiers to the gold standard manually
segmented points are presented.
Table 1 shows the false positive rates obtained with the
various classifiers at sensitivities of 0.90 and 0.95 for the
mid-thoracic, lower thoracic and lumbar vertebrae, respectively. Conversely, Table 2 shows sensitivities for FPR
2.5% and 5%. Overall classifier performance is summarised
in Fig. 5a, which shows the ROC curves for the appearance
classifier, and for the morphometric method.
The appearance-based classifier dominates the ROC
curves in each of the three spinal regions over the operating
points of practical interest. In the thoracic spine, the use of
appearance model classifiers allows a large improvement in
specificity at the 95% sensitivity point. The FPR falls from
around 20% with current morphometric methods to about
4%, whilst in the lumbar the FPR is halved (10.3% to
5.6%). If we fix the required FPR at 5% and compare
sensitivity (Table 2), there is an overall sensitivity gain of
around 7%.
At a sensitivity of 95% on the manually segmented data, an
overall FPR of 5.1% using appearance classifiers was
achieved, whereas this increases to 18.5% using the morphometric method. At 5.1% FPR, the breakdown of 95%
sensitivity by fracture grade is: 85.4% sensitivity against
grade 1 fractures, 98.5% for grade 2 and 100% for grade 3
fractures. The grade 1 sensitivity compares to 67.4% for the
morphometric method at the same specificitya sensitivity
increase of 18% compared to morphometry.
Classifier comparisons for semi-automatic segmentations
Figure 5b shows the overall ROC curves obtained from the
semi-automatic segmentations, with all three spinal regions
combined. The combined appearance classifier ROC curve
appears better than height-based morphometry, but this
overall ROC curve masks certain performance differences
in the different regions of the spine. Table 3 shows the
classifier sensitivities at FPR of 2.5% and 5% for the three
spinal regions; Table 4 shows the corresponding sensitivities for the three fracture grades.
Overall the sensitivity of the appearance classifier at 5%
FPR is reduced from around 95% (Table 2) with ideal
(manual) segmentation to 88% (Table 4) with the semiautomatic segmentation, a loss in sensitivity of 7% due to
segmentation errors. The appearance classifier always gives
better sensitivity than the morphometric method, with
typical improvements of around 10%. At 2.5% FPR, for
grade 1 fractures, there is an improvement in sensitivity of
around 19% by using the appearance classifier; or at 5%
FPR the improvement is around 13%. The lower thoracic
region shows less improvement than the other regions
especially at higher sensitivity. This may be because there
are often diaphragm artefacts in this region, which may bias
the fitting of the appearance models underlying texture
model, as well as potentially inducing segmentation errors.
At 5% FPR, the reduction in sensitivity (from ideal
manual segmentation) induced by segmentation error in the
2043
Spinal region
Classifier
T7T9
Appearance
Morphometric
Appearance
Morphometric
Appearance
Morphometric
T10T12
L1L4
Percentage of manually segmented fractured vertebrae correctly classified with two false
positive rates, comparing the
appearance and morphometric
algorithms
Spinal region
Classifier
T7T9
Appearance
Morphometric
Appearance
Morphometric
Appearance
Morphometric
T10T12
L1L4
2.4
6.7
2.5
5.4
2.0
6.1
3.2
20.7
4.7
19.6
5.6
10.3
Discussion
The results obtained when classifying on the basis of a
semi-automatic segmentation are not as good as with the
gold standard manual segmentation, but still promising. At
a FPR (per vertebra) of 5%, the overall sensitivity is 88%
for the appearance classifier, compared to 79% for standard
morphometric methods, with a more substantial increase
when severe (grade 3) fractures are excluded.
The fact that the method appears to miss some obvious
(4.4% at 5% FPR) grade 3 fractures may seem counterintuitive, but these are caused by AAM segmentation
failure [22]. Occasional gross segmentation failures can
also be manually corrected by moving a few points (e.g. the
four corner points), and then running the AAM algorithm
again.
Smyth et al. [27] reported on results for classifying
vertebrae using shape parameters, but the dataset was small
and confined to lumbar vertebrae. We have extended the
method by using appearance parameters, which implicitly
adds edge structure information to the detailed shape
description. De Bruijne et al. [35] propose a Neighbour-
90.2
83.7
92.1
83.2
91.2
80.0
95.7
88.0
95.0
87.1
94.4
88.0
2044
Spinal region
Classifier
T7T9
Appearance
Morphometric
Appearance
Morphometric
Appearance
Morphometric
T10T12
L1L4
72.6
62.1
81.1
70.3
86.7
68.0
84.1
73.0
88.9
85.4
90.4
78.2
2045
Fracture grade
Classifier
Grade 1
Appearance
Morphometric
Appearance
Morphometric
Appearance
Morphometric
Appearance
Morphometric
Grade 2
Percentage of semi-automatically
segmented fractured vertebrae
correctly detected for each fracture grade at two false positive
rates, comparing the appearance
and morphometric algorithms
Grade 3
Overall
61.6
42.7
86.9
69.7
92.4
87.9
81.2
67.5
74.3
61.1
92.6
80.2
95.6
93.3
88.2
78.7
Conclusions
We have developed linear discriminants for detecting vertebral
fracture using appearance parameters. Given a gold standard
manual segmentation, the appearance-based classifier can
almost quarter the false positive rate when operating at 95%
sensitivity compared with traditional quantitative morphometric methods. Using AAM-based semi-automatic segmentation
a sensitivity (per vertebra) of 88% can be obtained at 5% false
positive rate, compared to 79% with morphometry. Larger
sensitivity gains are observed with grade 1 fractures. The
reproducibility of the method is excellent. We believe the
Table 5 Signs test results for specificity comparisons between classifiers for the semi-automatic segmentations
Spinal region
T7T9
T10T12
L1L4
Sensitivity
70%
75%
80%
85%
90%
3.77
(28/34)
3.27
(18/21)
4.63
(36/42)
4.66
(40/51)
3.40
(23/28)
4.14
(39/49)
5.11
(59/74)
2.65
(21/28)
6.56
(70/81)
7.89
(110/130)
1.34
(27/45)
10.91
(144/153)
5.66
(140/200)
0.59
(39/73)
13.58
(223/237)
The Signs test statistics are shown for the comparison between specificities of the appearance classifier and the morphometric algorithm at the
same indicated sensitivities. These comparisons are for the semi-automatic segmentations. The lower pair of bracketed figures give: (1) the
number of (unfractured) vertebrae in which the morphometric classifier gave more false positive results than the appearance classifier over the set
of 10 semi-automatic segmentations per image; (2) the total number of unfractured vertebrae where the number of false positive results was
different for the two classifiers. Under the null hypothesis of no difference this pair should follow a binomial distribution with p=0.5. The top
figure shows the number of standard deviations away from the null hypothesis mean that the result lies. Under a reasonable Gaussian
approximation (note np>10 in all cases) the results are significant at the 5% level where this exceeds 1.96
2046
feasibility of a computer-assisted diagnosis system for vertebral fracture has been established for VFA images, although
there are intrinsic limitations of VFA image quality. The CAD
system could be used clinically as part of a triage approach,
and also has the potential to be used in large-scale epidemiological studies or pharmaceutical trials. Work is underway to
extend the method so it can be applied to radiographs.
Acknowledgements The authors wish to thank Mr Stephen Capener
(SC) who performed the manual annotation of the vertebrae on the clinical
VFA images, and the team at the University of Sheffield (Professor R.
Eastell, Dr. L. Ferrar and Dr. G. Jiang) for initial guidance on the ABQ
method. The work was funded through a grant from the UK Arthritis
Research Council (ARC) (grant no. 17644), with earlier foundation work
having been funded by grants from the Central Manchester University
Hospitals NHS Foundation Trust (CMFC) Research Endowment Fund.
Conflicts of interest None.
References
1. Cooper C, ONeill T, Silman A (1993) The epidemiology of
vertebral fractures. European Vertebral Osteoporosis Study Group.
Bone 14(Suppl 1):S89S97
2. Melton LJ III, Atkinson EJ, Cooper C et al (1999) Vertebral fractures
predict subsequent fractures. Osteoporosis Int 10(3):214221
3. Black DM, Arden NK, Palermo L et al (1999) Prevalent vertebral
deformities predict hip fractures and new vertebral deformities but
not wrist fractures; Study of Osteoporotic Fractures Research
Group. J Bone Miner Res 14(5):821828
4. Ensrud KE, Nevitt MC, Palermo L et al (1999) What proportion
of incident morphometric vertebral fractures are clinically diagnosed and vice versa? J Bone Miner Res 14(S1):S138
5. Gehlbach SH, Bigelow C, Heimisdottir M et al (2000) Recognition of vertebral fractures in a clinical setting. Osteoporos Int 11
(7):577582
6. Delmas PD, van de Langerijt L, Watts NB et al (2005) Underdiagnosis of vertebral fractures is a worldwide problem: the
IMPACT study. J Bone Miner Res 20(4):557563
7. Ferrar L, Jiang G, Adams J, Eastell R (2005) Identification of
vertebral fractures: an update. Osteoporos Int 16(7):717728
8. Guermazi A, Mohr A, Grigorian M et al (2002) Identification of
vertebral fractures in osteoporosis. Semin Musculoskelet Radiol 6
(3):241252
9. Genant HK, Wu CY, van Kuijk C et al (1993) Vertebral fracture
assessment using a semi-quantitative technique. J Bone Miner Res
8(9):11371148
10. Black DM, Cummings SR, Stone K et al (1991) A new approach
to defining normal vertebral dimensions. J Bone Miner Res 6
(8):883892
11. Eastell R, Cedel SL, Wahner HW et al (1991) Classification of
vertebral fractures. J Bone Miner Res 6(3):207215
12. McCloskey E, Spector T, Eyres K et al (1993) The assessment of
vertebral deformity: a method for use in population studies and
clinical trials. Osteoporos Int 3(3):138147
13. Guglielmi G, Diacinti D, van Kuijk C et al (2008) Vertebral
morphometry: current methods and recent advances. Eur Radiol
18(7):14841496
14. Wu CY, Li J, Jergas M et al (1995) Comparison of semiquantitative and quantitative techniques for the assessment of prevalent
and incident vertebral fractures. Osteoporos Int 5(5):354370