You are on page 1of 10

Osteoporos Int (2010) 21:20372046

DOI 10.1007/s00198-009-1169-6

ORIGINAL ARTICLE

Detection of vertebral fractures in DXA VFA images


using statistical models of appearance and a semi-automatic
segmentation
M. G. Roberts & E. M. B. Pacheco & R. Mohankumar &
T. F. Cootes & J. E. Adams

Received: 4 August 2009 / Accepted: 29 December 2009 / Published online: 5 February 2010
# International Osteoporosis Foundation and National Osteoporosis Foundation 2010

Abstract
Summary Morphometric methods of vertebral fracture
diagnosis lack specificity. We used detailed shape and
image texture model parameters to improve the specificity
of quantitative fracture identification. Two radiologists
visually classified all vertebrae for system training and
evaluation. The vertebral endplates were located by a semiautomatic segmentation method to obtain classifier inputs.
Introduction Vertebral fractures are common osteoporotic
fractures, but current quantitative detection methods (morphometry) lack specificity. We used detailed shape and
texture information to develop more specific quantitative
classifiers of vertebral fracture to improve the objectivity of
vertebral fracture diagnosis. These classifiers require a
detailed segmentation of the vertebral endplate, and so we
investigated the use of semi-automated segmentation
methods as part of the diagnosis.
Methods The vertebrae in a training set of 360 dual energy
X-ray absorptiometry images were manually segmented.
The shape and image texture of vertebrae were statistically
M. G. Roberts (*) : E. M. B. Pacheco : R. Mohankumar :
T. F. Cootes
Imaging Science and Biomedical Engineering,
University of Manchester,
Stopford Building, Oxford Road,
Manchester M13 9PT, UK
e-mail: martin.roberts@manchester.ac.uk

modelled using Appearance Models. The vertebrae were


given a gold standard classification by two radiologists.
Linear discriminant classifiers to detect fractures were
trained on the vertebral appearance model parameters.
Classifier performance was evaluated by cross-validation
for manual and semi-automatic segmentations, the latter
derived using Active Appearance Models (AAM). Results
were compared with a morphometric algorithm using the
signs test.
Results With manual segmentation, the false positive rates
(FPR) at 95% sensitivity were: 5% (appearance) and 18%
(morphometry). With semi-automatic segmentations the
sensitivities at 5% FPR were: 88% (appearance) and 79%
(morphometry).
Conclusion Specificity and sensitivity are improved by
using an appearance-based classifier compared to standard
height ratio morphometry. An overall sensitivity loss of 7%
occurs (at 95% specificity) when using a semi-automatic
(AAM) segmentation compared to expert annotation, due to
segmentation error. However, the classifier sensitivity is
still adequate for a computer-assisted diagnosis system for
vertebral fracture, especially if used in a triage approach.
Keywords Active appearance model . Computer-assisted
diagnosis . DXA . Osteoporosis . Vertebral fracture
assessment

E. M. B. Pacheco
Department of Radiology, Faculty of Medical Sciences,
State University of Campinas (Unicamp),
Campinas, Brazil

Introduction

J. E. Adams
Clinical Radiology, Manchester Royal Infirmary,
Oxford Road,
Manchester M13 9WL, UK

Osteoporosis is a progressive skeletal disease characterised


by low bone mass and structural deterioration of bone
tissue, leading to bone fragility and an increased suscepti-

Clinical context

2038

bility to fractures, particularly of the hip, spine and wrist.


Vertebral fractures are the hallmark of osteoporosis, and are
associated with significant morbidity and mortality [1]. The
presence of vertebral fractures is a powerful predictor of
future fractures in the spine (X5) and at the hip (X2) [2, 3].
Although vertebral fractures may be symptomatic, between
30% and 75% may be asymptomatic, and so may not come
to clinical attention [4]. There is also evidence that a
significant number (3050% or more) of vertebral fractures
present on radiographs are not being reported, or if they are
reported their significance is not recognised [5, 6].
Vertebral fracture detection
There is no precise definition of exactly what constitutes a
vertebral fracture, but a variety of methods for diagnosing
and describing such fractures have been developed [7, 8].
These include semi-quantitative methods involving some
subjective judgment by an expert radiologist or trained
reader [9], and fully quantitative morphometric methods
[1013]. The latter require the manual annotation of six (or
more) points on each vertebra, which is time consuming
and may be imprecise.
Current quantitative methods [1013] (six-point morphometry) rely on using only a small number of height
measurements, and assess reductions in anterior, middle
and posterior heights of vertebral bodies. However, these
methods have been shown to lack specificity [14, 15].
Subtle shape information is lost by using only three heights
for each vertebra, and there is no use of changes in image
texture which are typically present with endplate fracture.
Some of this information is used in semi-quantitative
approaches [14]. Therefore the Genant et al. semiquantitative (SQ) method [9] has become almost a de facto
gold standard for fracture assessment, but there still
remains significant subjectivity, particularly for mild (grade
1) fractures. The subjectivity problem is discussed at length
by Jiang et al. who propose an algorithm-based qualitative
method (ABQ) [16].
Assessment for vertebral fractures is predominantly
made from conventional radiographs (or computed radiography). However, technical improvements in dual energy
X-ray absorptiometry (DXA) scanners (fan X-ray beam
with bank of detectors) have resulted in faster scanning and
improved spatial resolution, enabling vertebral fracture
assessment (VFA) to be made from DXA spine images
[17, 18]. DXA scanners are also becoming available in
units other than radiology departments, so that the
diagnosis of vertebral fractures is being made by personnel
not necessarily trained in image interpretation. Therefore, it
is desirable to define a quantitative approach which can
capture some of the more subtle information used in expert
visual assessment and diagnosis of vertebral fractures.

Osteoporos Int (2010) 21:20372046

Our aim is to define more reliable quantitative fracture


classification methods based on a complete definition of
the vertebras shape, and the texture within a sampling
profile centred on the endplate. In this study, we have used
compact statistical models of vertebral appearance. The
appearance model [19, 20] captures information about
shape and surrounding image texture. We trained linear
discriminant fracture classifiers using the appearance
model parameters. This method requires an accurate and
detailed segmentation of the vertebral endplate. In previous work [21] we reported on the use of such classifiers
applied to VFA (dual energy) images, but that study used
manually segmented vertebrae. Although this established
the principle of the technique, it would be too laborious to
require a detailed manual segmentation in practise. So in
this study, we applied techniques of semi-automatic
vertebral segmentation [22], and then ran the classifiers
on the computer-derived segmentations. We extended the
models used in [22] by adding corner detecting features
which improve the segmentation accuracy, and we also
optimised the complexity of the appearance models used
as input to the classification system, by adjusting the
proportion of the overall sample texture variance that was
modelled. The result is a form of computer-aided
diagnosis (CAD) system for vertebral fracture.

Material and methods


DataDXA VFA images
For our study, we have used dual (as opposed to single)
energy VFA images acquired on DXA scanners (QDR 4500
Delphi or Discovery DXA scanners, Hologic Inc, Bedford,
MA, USA). Dual energy images were used, as the thoracic
vertebrae are generally more clearly visualised than on
single energy images. The VFA images are noisier and of
lower spatial resolution than spinal radiographs, but they
have several advantages. DXA involves a substantially
lower ionising radiation dose [23], avoids the projectional
parallax effects of radiographs and has the whole spine on a
single image. Good agreement is obtained between expert
reading of VFA images and radiographs [18, 24, 25]. A
typical image is shown in Fig. 1.
The dataset comprised 360 anonymised VFA images,
250 of which had been obtained from two previous studies
[26, 27] with institutional ethical approval. To these were
added another 110 anonymised images acquired from
patients aged 65 years and older referred for bone
densitometry, and for whom the referring physician had
requested VFA and who fulfilled the clinical guideline
criteria for performing VFA [28]. The models used for the
semi-automatic segmentation and classification algorithms

Osteoporos Int (2010) 21:20372046


Fig. 1 Lateral VFA image of a
spine displaying some features
of osteoporosis. There is a
wedge fracture of T12 and an
endplate fracture of T7, and also
T5 (but T5 was not analysed in
this study)

2039

ment. Fractured vertebrae were assigned a grade according


to the Genant et al. grading system [9]. Vertebrae were
classed as either: normal, deformed (but not fractured),
fractured (with grades 13) or not visualised. Those
vertebrae classed as deformed all display height loss in
excess of 15%. Of the 3,600 vertebrae viewed, poor image
quality resulted in 90 (2.5%) vertebrae not being adequately
visualised; these were excluded from the analysis. Of the
remaining 3,510 vertebrae, 2,969 (84.6%) were classed as
normal, and there were 354 (10.1%) vertebral fractures (97
mild, 147 moderate and 110 severe). There were 187
(5.3%) non-fracture vertebral deformities (often due to
degenerative disc disease)approximately double the
number of mild fracturesand so there was scope for
confusion between mild fractures and deformity.
Segmentation and appearance models
We used a variant on the Active Appearance Model (AAM)
[19, 20] to provide a semi-automatic segmentation method.
The AAM is an efficient scheme for fitting the parameters
of an appearance model (Fig. 3). The latter is a statistical
model derived from a training set of images, which captures
how the shape of an object of interest and the image texture
within (or surrounding) that shape vary across a training

cover the lumbar spine starting at L4 and continue up the


thoracic spine to vertebra T7. Vertebrae were only assessed
up to T7, as many of the early images in the study (acquired
from [27]) were of poor quality above T7, with the
vertebrae being often obscured by the scapulae. Recent
improvements in both VFA imaging quality and patient
positioning protocols should allow further vertebrae T6T4
to be assessed.
Expert classification of vertebrae
Vertebrae were first independently classified by two
radiologists using the ABQ method [7, 16, 29] to
differentiate vertebral fractures (Fig. 2) from other causes
of vertebral deformities (e.g. short vertebral height in which
the endplate is still intact and crisp in outline). The ABQ
method tends to produce a lower number of fractures than
SQ or morphometric techniques, as mild deformities are
given other diagnoses [16, 29] (e.g. developmental abnormality, spondylosis). Good agreement is obtained between
ABQ diagnoses using radiographs and VFA images [30].
Having performed the initial diagnoses independently,
the two radiologists then compared readings, and where
there was a discrepancy they reached a consensus judg-

Fig. 2 Fractured vertebrae in DXA images (between T10 and T12),


note the multi-edged structure. There are faintly visible cortical rim
remnants giving rise to secondary edge structure in addition to the
diffuse edges of the collapsed central endplates. These are particularly
evident (indicated by arrows) below the inferior fractured depressed
endplate of T11 (also indicated), and above the superior fractured
depressed endplate of T10 (indicated). The fractured depressed
endplates also have a more diffuse banded appearance than the crisper
edge of normal endplates

2040

Osteoporos Int (2010) 21:20372046

accuracy for fractured vertebrae improved to 0.97 mm


compared to1.24 mm in [22].
AAM initialisation

Fig. 3 Appearance model of three vertebrae centred on L1 with the


grey level texture sampled within the convex hull of the shape,
showing the mean appearance (centre) and variation in 2.5 standard
deviations in the first appearance parameter. This basic form of
appearance model is illustrated due to ease of visualisation, but in this
study we used a texture model of the edge structure along a profile
centred on the shape, see Fig. 4 and [21]

set. A specific set of appearance parameters is a compact


definition of both the shape of an object in an image, and
the texture around (and/or within) the image patch covered
by that shape. Texture can mean simple grey level pixel
values, or other descriptors such as image gradients.
Appearance models require a segmentation of the
modelled structures across a set of training images. The
images had already been manually annotated using an inhouse tool [21]. A semi-automatic segmentation was also
obtained for each image by running leave-8-out tests
(thus avoiding bias in the estimate of performance on new
images), using the AAM-based method of [22]. This
method uses a set of overlapping AAMs consisting of
triplets of vertebrae. Figure 4 shows an AAM segmentation
of T9 with all the modelled points indicated, and a subset of
the image texture sampling profiles.
In [22], the texture models used by the AAM were based
on the renormalised image gradient along profiles normal to
the shape boundary. In this study, we enhanced the
underlying texture models by adding measures of corner
and commensurate edge strength based on the Harris corner
detector, as computed in [31]. These were sampled only for
a subset of points around the vertebral corners, and the join
to the pedicle (which also triggers corner detectors).
Secondly, we optimised the profile lengths, resulting in a
sampling range up to 8 mm for lumbar vertebrae (from
6 mm in [22]), with the thoracic profile lengths scaled at
each vertebral level by mean vertebral size compared to L1.
Thirdly, we introduced alternative fractured initialisations
on the central vertebra of each triplet sub-model, and thus
ran each AAM from several start-points, selecting the bestfitting solution (i.e. least residual sum of squares) as the one
to impose. Complete segmentation accuracy results will not
be given, as the focus of the paper is the application to
classification, but in summary, the mean point-to-line

AAMs are local search methods, and so need to be


initialised somewhere near the sought object. Therefore,
when used in an associated prototype clinical tool, the
clinician clicks with the mouse pointer on the approximate
centre of each vertebra in turn. On each test image, the user
initialisation was simulated by using the known vertebral
centres implied by the manually marked points. To model
operator precision, Gaussian random offsets were added in
the x,y directions, with 2 and 3 mm standard deviations
(SD) respectively. This randomised initialisation was
repeated 10 times for each image. The resulting semiautomatic segmentations were saved and then used as
inputs to the classification algorithm. There was no manual
adjustment of the AAM segmentations. We calculated the
precision of the AAM solutions over the 10 replications, by
calculating the mean point-to-curve error from the mean
shape.
Operator precision
We also validated these operator precision assumptions,
using a subset of 100 images which were initialised by two
radiologists clicking on the centre of each vertebra. We
determined the root mean square (RMS) point difference
between the two initialisations (averaged over all vertebrae), and then calculated an equivalent zero mean
population p
Gaussian
with SD equal to this RMS precision

divided by 2. The equivalent SDs were found to be lower


than those assumed (1.2, 1.4 mm), but we have retained the

Fig. 4 Points used in the Active Appearance Model with a subset of


the texture sampling profiles along which image gradient is computed

Osteoporos Int (2010) 21:20372046

original more pessimistic assumption to allow for situations


where the method might be used by a less skilled operator.
Classification method
The classification method is described in detail in [21]. In
summary, three vertebral appearance models and associated
(linear discriminant) classifiers are constructed by pooling
data from nearby vertebrae for: the lumbar spine (L4 to L1),
lower thoracic spine (T12T10) and mid-thoracic spine (T9
to T7). The set of points and texture sampling profiles used
in each vertebra are similar to those used for AAM
segmentation (Fig. 4), except that the points connecting
the inferior posterior corner to the pedicle are not used, and
no corner features are sampled. Given a specific image and
vertebral segmentation, the corresponding appearance
model can be refitted to the given shape and its surrounding
image texture. This provides a set of appearance parameters
that are the inputs to the classifier. The appearance model
describes shape in a normalised (size-free) reference frame,
so the McCloskey crush ratio [12] is used as an additional
parameter to give a relative measure of vertebral scale. The
crush ratio is defined as
rcrush Hp =Hppred , where Hp is the vertebras posterior
height, and Hppred is its predicted height given the posterior
heights of four neighbouring vertebrae. We used VFA
specific height data obtained from [32] for this prediction.
For a comparison with what might be achieved using the
more standard morphometric height ratios we also calculated the mid-height and wedge ratios:
rmid =Hm/Hp, rwedge =Ha/Hp, where Hm and Ha are the
middle and anterior heights, respectively. A hybrid morphometric method was employed, thresholding rmid and
rwedge as in the EastellMelton [11] method, together with
rcrush. This is simpler than the full McCloskey method, but
retains its advantage of using a single crush ratio derived in
a manner which should avoid bias from fractured neighbours. We calculated the mean and standard deviation of
each ratio over the sub-sample of normal vertebrae. A
vertebra was then classified as fractured if any of the three
ratios was less than X standard deviations below the normal
mean. X was varied to generate receiver-operatingcharacteristic (ROC) curves. We refer to this method below
as EastellMcCloskey hybrid.
One parameter used when building the vertebral appearance models is what proportion of the texture variance in
the training set to include. This determines how many
appearance parameters are used in the model. If too large a
variance proportion is used, an over-complex model is
produced. This could lead to poor generalisation of the
classifiers, due to over-training on noise. In contrast,
including too low a variance proportion can lose useful
information. During this study we therefore optimised the

2041

variance proportion to obtain the best classifier generalisation on manually segmented data. The optimisation was
conducted using a randomised resampling scheme, to avoid
a biassed estimate of the true population performance.
Technical details are given in [33]. This resulted in
reductions from the 90% of texture variance used in [21]
to variance proportions of 75%, 70% and 85% for the midthoracic, lower thoracic and lumbar spine respectively. The
relatively lower proportion in the mid-thoracic may be due
to diaphragm motion artefacts introducing spurious texture
modes.
Classification experiments with semi-automatic segmentations
Leave-one-out cross-validation tests were performed over
the entire dataset to determine classifier performance on
semi-automatically segmented data. Each image was
selected in turn to be the test image. The appearance
models were then built using the remaining data (excluding
the test image); and then the linear discriminant classifiers
were trained using the manual segmentations to obtain the
classifier inputs, and the radiologists gold standard
vertebral classification for the desired classifier outputs.
Any vertebra classed as deformed but not fractured was
pooled with the normal vertebrae when training the
classifiers.
With this set of trained models and associated classifiers,
the software loaded the previously recorded semi-automatic
segmentations of the test image. For each vertebra in this
test image the software looped through each of the 10 semiautomatic segmentations in turn, and fitted the appearance
models to the test image (given the segmentation) to obtain
the appropriate classifier inputs. We recorded the classifiers measure of the likelihood that the vertebra is
fractured. The fracture likelihood estimate is recorded,
rather than a dichotomous fractured/not fractured result, so
that ROC curves can be derived by varying the detection
threshold used on this likelihood estimate. The results thus
obtained for the semi-automatic segmentations implicitly
include user imprecision effects.
The same classifiers were applied to the gold standard
manual segmentation of each image. This gives an upperbound on practical performance, and indicates to what extent
classification errors result from segmentation errors.
Statistical comparisons between classifiers
The specificity of the appearance classifier was compared
with the morphometric algorithm at several sensitivity values.
We compared a set of points around the desirable sensitivity
region rather than ROC curve area, because statistics for the
whole curve area generally involve operating regions of little
practical relevance (e.g. unrealistically low sensitivity or high

2042

false positive rate (FPR)). Also, the fact that the ROC curves
are generated by pooling slightly different classifiers over a
leave-one-out train/test cycle leads to some complications in
the underlying statistics for ROC curve area. As there are
multiple segmentations of each image (to model operator
precision), the samples are not all independent. We formulated
the comparison using the signs test; although not a very
powerful test, it can be easily employed with such data. The
number of false positives (varying between 0 and 10) is
calculated over the segmentation set of each unfractured test
image for the pair of classifiers being compared to give
1
2
NFP ; NFP . Any image pairings where these totals are equal is
discarded, leaving m comparison pairs, over which the count
1
2
of the number of cases w in which NFP > NFP is computed.
Even though the underlying results for a particular classifier
are not independent, nevertheless, we still have a set of
matched groups, and so under the null hypothesis that there
is no difference in classifiers w should be binomially
distributed, with WB(m,0.5).
Reproducibility
The only variable factor in the otherwise automatic process
is exactly where the user clicks on the centre of each
vertebra, when initialising the AAM. Because we simulated
the precision of this process, and ran multiple segmentations per image, we can estimate the reproducibility of the
overall method. Each randomised initialisation results in a
slightly different set of classifier estimates of vertebral
fracture probabilities. To translate this to an equivalent
reproducibility, we need to specify a threshold on these
probabilities. We computed (as part of calculating the ROC
curves) detection thresholds suitable for 1%, 2.5% and 5%
false positive rates. For a selected threshold we can then
treat the implied decision for each of the 10 image
segmentations as 10 different classifications, and then
calculate their overall concordance using the standard
Fleiss multi-rater Kappa statistic [34].
The mean absolute deviation of the fracture probability
between the 10 estimates was also calculated, and compared with the rates of change of sensitivity and specificity
with respect to detection threshold.

Results
Classifier comparisons for manually segmented data
As an upper limit on performance the results obtained by
reapplying the classifiers to the gold standard manually
segmented points are presented.
Table 1 shows the false positive rates obtained with the
various classifiers at sensitivities of 0.90 and 0.95 for the

Osteoporos Int (2010) 21:20372046

mid-thoracic, lower thoracic and lumbar vertebrae, respectively. Conversely, Table 2 shows sensitivities for FPR
2.5% and 5%. Overall classifier performance is summarised
in Fig. 5a, which shows the ROC curves for the appearance
classifier, and for the morphometric method.
The appearance-based classifier dominates the ROC
curves in each of the three spinal regions over the operating
points of practical interest. In the thoracic spine, the use of
appearance model classifiers allows a large improvement in
specificity at the 95% sensitivity point. The FPR falls from
around 20% with current morphometric methods to about
4%, whilst in the lumbar the FPR is halved (10.3% to
5.6%). If we fix the required FPR at 5% and compare
sensitivity (Table 2), there is an overall sensitivity gain of
around 7%.
At a sensitivity of 95% on the manually segmented data, an
overall FPR of 5.1% using appearance classifiers was
achieved, whereas this increases to 18.5% using the morphometric method. At 5.1% FPR, the breakdown of 95%
sensitivity by fracture grade is: 85.4% sensitivity against
grade 1 fractures, 98.5% for grade 2 and 100% for grade 3
fractures. The grade 1 sensitivity compares to 67.4% for the
morphometric method at the same specificitya sensitivity
increase of 18% compared to morphometry.
Classifier comparisons for semi-automatic segmentations
Figure 5b shows the overall ROC curves obtained from the
semi-automatic segmentations, with all three spinal regions
combined. The combined appearance classifier ROC curve
appears better than height-based morphometry, but this
overall ROC curve masks certain performance differences
in the different regions of the spine. Table 3 shows the
classifier sensitivities at FPR of 2.5% and 5% for the three
spinal regions; Table 4 shows the corresponding sensitivities for the three fracture grades.
Overall the sensitivity of the appearance classifier at 5%
FPR is reduced from around 95% (Table 2) with ideal
(manual) segmentation to 88% (Table 4) with the semiautomatic segmentation, a loss in sensitivity of 7% due to
segmentation errors. The appearance classifier always gives
better sensitivity than the morphometric method, with
typical improvements of around 10%. At 2.5% FPR, for
grade 1 fractures, there is an improvement in sensitivity of
around 19% by using the appearance classifier; or at 5%
FPR the improvement is around 13%. The lower thoracic
region shows less improvement than the other regions
especially at higher sensitivity. This may be because there
are often diaphragm artefacts in this region, which may bias
the fitting of the appearance models underlying texture
model, as well as potentially inducing segmentation errors.
At 5% FPR, the reduction in sensitivity (from ideal
manual segmentation) induced by segmentation error in the

Osteoporos Int (2010) 21:20372046


Table 1 False positive rates at
various sensitivities for manual
segmentations

False positive rates (%) in 90


95% sensitivity regions summarised by spinal region for manual
segmentations, comparing the
appearance and morphometric
algorithms

2043
Spinal region

Classifier

T7T9

Appearance
Morphometric
Appearance
Morphometric
Appearance
Morphometric

T10T12
L1L4

semi-automatic method is about 11% for grade 1 fractures,


6% for grade 2 fractures and 4% for grade 3.
Table 5 gives the results of carrying out the signs test,
comparing false positive results between the appearance
and morphometric classifiers, for sensitivities in the range
7090%. This table shows the number of standard deviations above the (null hypothesis) expected value of the
count in which the morphometric method gives more false
positives than the appearance classifier. Using the standard
Gaussian approximation to the Binomial distribution would
allow this to be converted to a significance value, but it is
obvious that most of the values are larger than the 5%
significance value of around 1.96. The exceptions are for
the sensitivity values of 8590% in the lower thoracic
spine, for which no significant difference in FPR can be
detected by the signs test. This is in contrast to results for
high (95%) sensitivity with the gold standard manual
segmentation (Table 1 and [21]), and may be caused by a
general degradation due to segmentation error.
Reproducibility
The mean AAM segmentation point precision was
0.097 mm, and this translates to excellent reproducibility,
which also means that the segmentation is not sensitive to
assumptions made about the precision of the manual
initialisation. The kappa statistics for the resulting classification depend on whereabouts on the ROC curve it is
desired to operate, but vary between 0.91 and 0.96 in the
specificity range considered (9599% specificity). For
comparison, inter-radiologist concordance using SQ in
[14] lay between 0.80 and 0.81; a value exceeding 0.8 is
normally regarded as indicating almost perfect agreement.
Table 2 Fracture detection sensitivities for manual segmentations, given false positive rate

Percentage of manually segmented fractured vertebrae correctly classified with two false
positive rates, comparing the
appearance and morphometric
algorithms

Spinal region

Classifier

T7T9

Appearance
Morphometric
Appearance
Morphometric
Appearance
Morphometric

T10T12
L1L4

FPR at 90% sensitivity (%)

FPR at 95% sensitivity (%)

2.4
6.7
2.5
5.4
2.0
6.1

3.2
20.7
4.7
19.6
5.6
10.3

The mean absolute deviation of the fracture probability


estimate between different initialisations was 3.4103. For
comparison the mean rate of change in detection threshold
per percentage sensitivity (in the 8090% region) was 5.2
103. Similarly, the mean rate of change in detection
threshold per percentage point of specificity in the 95
99% region is 0.02. So the initialisation precision translates
to a final sensitivity/specificity precision of around 0.65%/
0.16%, respectively, in practical ROC curve regions.

Discussion
The results obtained when classifying on the basis of a
semi-automatic segmentation are not as good as with the
gold standard manual segmentation, but still promising. At
a FPR (per vertebra) of 5%, the overall sensitivity is 88%
for the appearance classifier, compared to 79% for standard
morphometric methods, with a more substantial increase
when severe (grade 3) fractures are excluded.
The fact that the method appears to miss some obvious
(4.4% at 5% FPR) grade 3 fractures may seem counterintuitive, but these are caused by AAM segmentation
failure [22]. Occasional gross segmentation failures can
also be manually corrected by moving a few points (e.g. the
four corner points), and then running the AAM algorithm
again.
Smyth et al. [27] reported on results for classifying
vertebrae using shape parameters, but the dataset was small
and confined to lumbar vertebrae. We have extended the
method by using appearance parameters, which implicitly
adds edge structure information to the detailed shape
description. De Bruijne et al. [35] propose a Neighbour-

Sensitivity at FPR 2.5% (%)

Sensitivity at FPR 5% (%)

90.2
83.7
92.1
83.2
91.2
80.0

95.7
88.0
95.0
87.1
94.4
88.0

2044

Osteoporos Int (2010) 21:20372046

Fig. 5 a ROC curves for fracture


detection (all data combined),
using gold standard manual
segmentations, comparing the
appearance parameter linear discriminant with the morphometric
method; b corresponding ROC
curves using semi-automatic
(Active Appearance Model
derived) segmentations

Conditional Shape Model. This predicts the expected shape


given several neighbouring vertebrae, and uses the total
deviation from the predicted shape to classify a vertebra.
This method employs additional prior information about the
interrelations between vertebrae, but by using only shape
the method may be more prone than appearance-based
classifiers to false positives on non-fracture deformities.
Table 3 Fracture detection sensitivities for semi-automatic segmentations, given false positive
rate
Percentage of semiautomatically segmented fractured vertebrae correctly classified with two false positive
rates, comparing the appearance
and morphometric algorithms

Spinal region

Classifier

T7T9

Appearance
Morphometric
Appearance
Morphometric
Appearance
Morphometric

T10T12
L1L4

Wu et al. [14] derived sensitivity/specificity figures for


three radiologists using the semi-quantitative method, compared to a gold standard derived by a consensus reading
involving also a fourth radiologist expert in the SQ method.
The median sensitivity was 88% with a specificity of 98%
(so FPR of 2%). The overall sensitivity with the appearance
classifier at this FPR is 88.7% when using manually
Sensitivity at FPR 2.5% (%)

Sensitivity at FPR 5% (%)

72.6
62.1
81.1
70.3
86.7
68.0

84.1
73.0
88.9
85.4
90.4
78.2

Osteoporos Int (2010) 21:20372046


Table 4 Fracture detection sensitivities for different fracture
grades using semi-automatic segmentations, given false positive
rates

2045
Fracture grade

Classifier

Grade 1

Appearance
Morphometric
Appearance
Morphometric
Appearance
Morphometric
Appearance
Morphometric

Grade 2
Percentage of semi-automatically
segmented fractured vertebrae
correctly detected for each fracture grade at two false positive
rates, comparing the appearance
and morphometric algorithms

Grade 3
Overall

segmented data, strikingly similar to that of an experienced


radiologist using SQ on radiographs. Our gold standard is
less rigorous than that of [14], and we were using poorer
quality VFA images. Nevertheless, the concordance of our
appearance classifier with expert reading can be configured
to be comparable to that of inter-radiologist concordance
using SQ if a good segmentation is available. When a semiautomatic segmentation is used, the equivalent specificity of
our method is degraded to 95%.
The specificity improvements obtained from the appearance classifier applied to the manually segmented data
indicate that the underlying appearance model may capture
some more subtle discriminating features than the simple
height ratio method; for example, information about the edge
structure of the vertebral endplate and detailed shape
information. This may enable some of the false positive short
vertebral height wedge deformities, typical of the mid-thoracic
region, to be rejected by the appearance classifier.
Our development of more sophisticated quantitative
methods also creates the possibility of replacing the current
grading system (grades 13) with a more continuous
measure. This might make it quicker to detect progression
of existing fractures in longitudinal studies.

Sensitivity at FPR 2.5% (%)

Sensitivity at FPR 5% (%)

61.6
42.7
86.9
69.7
92.4
87.9
81.2
67.5

74.3
61.1
92.6
80.2
95.6
93.3
88.2
78.7

There were several limitations to this study. Firstly,


vertebrae were only assessed up to T7. Secondly, there were
no accompanying radiographs to allow a more rigorous
assessment of true fracture status. Thirdly, our dataset
contained images from a single scanner manufacturer.
Finally, the older images have substantially poorer resolution than is available on modern DXA scanners (1 mm per
line pair compared with 0.35 mm for a modern DXA
scanner), which will tend to underestimate performance.

Conclusions
We have developed linear discriminants for detecting vertebral
fracture using appearance parameters. Given a gold standard
manual segmentation, the appearance-based classifier can
almost quarter the false positive rate when operating at 95%
sensitivity compared with traditional quantitative morphometric methods. Using AAM-based semi-automatic segmentation
a sensitivity (per vertebra) of 88% can be obtained at 5% false
positive rate, compared to 79% with morphometry. Larger
sensitivity gains are observed with grade 1 fractures. The
reproducibility of the method is excellent. We believe the

Table 5 Signs test results for specificity comparisons between classifiers for the semi-automatic segmentations
Spinal region

T7T9
T10T12
L1L4

Sensitivity
70%

75%

80%

85%

90%

3.77
(28/34)
3.27
(18/21)
4.63
(36/42)

4.66
(40/51)
3.40
(23/28)
4.14
(39/49)

5.11
(59/74)
2.65
(21/28)
6.56
(70/81)

7.89
(110/130)
1.34
(27/45)
10.91
(144/153)

5.66
(140/200)
0.59
(39/73)
13.58
(223/237)

The Signs test statistics are shown for the comparison between specificities of the appearance classifier and the morphometric algorithm at the
same indicated sensitivities. These comparisons are for the semi-automatic segmentations. The lower pair of bracketed figures give: (1) the
number of (unfractured) vertebrae in which the morphometric classifier gave more false positive results than the appearance classifier over the set
of 10 semi-automatic segmentations per image; (2) the total number of unfractured vertebrae where the number of false positive results was
different for the two classifiers. Under the null hypothesis of no difference this pair should follow a binomial distribution with p=0.5. The top
figure shows the number of standard deviations away from the null hypothesis mean that the result lies. Under a reasonable Gaussian
approximation (note np>10 in all cases) the results are significant at the 5% level where this exceeds 1.96

2046

feasibility of a computer-assisted diagnosis system for vertebral fracture has been established for VFA images, although
there are intrinsic limitations of VFA image quality. The CAD
system could be used clinically as part of a triage approach,
and also has the potential to be used in large-scale epidemiological studies or pharmaceutical trials. Work is underway to
extend the method so it can be applied to radiographs.
Acknowledgements The authors wish to thank Mr Stephen Capener
(SC) who performed the manual annotation of the vertebrae on the clinical
VFA images, and the team at the University of Sheffield (Professor R.
Eastell, Dr. L. Ferrar and Dr. G. Jiang) for initial guidance on the ABQ
method. The work was funded through a grant from the UK Arthritis
Research Council (ARC) (grant no. 17644), with earlier foundation work
having been funded by grants from the Central Manchester University
Hospitals NHS Foundation Trust (CMFC) Research Endowment Fund.
Conflicts of interest None.

References
1. Cooper C, ONeill T, Silman A (1993) The epidemiology of
vertebral fractures. European Vertebral Osteoporosis Study Group.
Bone 14(Suppl 1):S89S97
2. Melton LJ III, Atkinson EJ, Cooper C et al (1999) Vertebral fractures
predict subsequent fractures. Osteoporosis Int 10(3):214221
3. Black DM, Arden NK, Palermo L et al (1999) Prevalent vertebral
deformities predict hip fractures and new vertebral deformities but
not wrist fractures; Study of Osteoporotic Fractures Research
Group. J Bone Miner Res 14(5):821828
4. Ensrud KE, Nevitt MC, Palermo L et al (1999) What proportion
of incident morphometric vertebral fractures are clinically diagnosed and vice versa? J Bone Miner Res 14(S1):S138
5. Gehlbach SH, Bigelow C, Heimisdottir M et al (2000) Recognition of vertebral fractures in a clinical setting. Osteoporos Int 11
(7):577582
6. Delmas PD, van de Langerijt L, Watts NB et al (2005) Underdiagnosis of vertebral fractures is a worldwide problem: the
IMPACT study. J Bone Miner Res 20(4):557563
7. Ferrar L, Jiang G, Adams J, Eastell R (2005) Identification of
vertebral fractures: an update. Osteoporos Int 16(7):717728
8. Guermazi A, Mohr A, Grigorian M et al (2002) Identification of
vertebral fractures in osteoporosis. Semin Musculoskelet Radiol 6
(3):241252
9. Genant HK, Wu CY, van Kuijk C et al (1993) Vertebral fracture
assessment using a semi-quantitative technique. J Bone Miner Res
8(9):11371148
10. Black DM, Cummings SR, Stone K et al (1991) A new approach
to defining normal vertebral dimensions. J Bone Miner Res 6
(8):883892
11. Eastell R, Cedel SL, Wahner HW et al (1991) Classification of
vertebral fractures. J Bone Miner Res 6(3):207215
12. McCloskey E, Spector T, Eyres K et al (1993) The assessment of
vertebral deformity: a method for use in population studies and
clinical trials. Osteoporos Int 3(3):138147
13. Guglielmi G, Diacinti D, van Kuijk C et al (2008) Vertebral
morphometry: current methods and recent advances. Eur Radiol
18(7):14841496
14. Wu CY, Li J, Jergas M et al (1995) Comparison of semiquantitative and quantitative techniques for the assessment of prevalent
and incident vertebral fractures. Osteoporos Int 5(5):354370

Osteoporos Int (2010) 21:20372046


15. Black D, Palermo L, Nevitt MC et al (1995) Comparison of
methods for defining prevalent vertebral deformities: the study of
osteoporotic fractures. J Bone Miner Res 10(6):890902
16. Jiang G, Eastell R, Barrington NA, Ferrar L (2004) Comparison of
methods for the visual identification of prevalent vertebral fracture
in osteoporosis. Osteoporos Int 15(11):887896
17. Rea JA, Steiger P, Blake GM et al (1998) Optimizing data
acquisition and analysis of morphometric X-ray absorptiometry.
Osteoporos Int 8(2):177183
18. Rea JA, Li J, Blake GM et al (2000) Visual assessment of vertebral
deformity by X-ray absorptiometry: a highly predictive method to
exclude vertebral deformity. Osteoporos Int 11(8):660668
19. Cootes TF, Edwards GJ, Taylor CJ (1998) Active appearance
models. In: Burkhardt H, Neumann B (eds) Proc. 5th European
conference on computer vision. Springer, Heidelberg, pp 484
498
20. Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance
models. IEEE Transactions on Pattern Matching and Machine
Intelligence 23(6):681885
21. Roberts MG, Cootes TF, Pacheco E, Adams JE (2007) Quantitative vertebral fracture detection on DXA images using shape and
appearance models. Acad Radiol 14(10):11661178
22. Roberts MG, Cootes TF, Adams JE (2006) Vertebral morphometry: semiautomatic determination of detailed shape from dualenergy X-ray absorptiometry images using active appearance
models. Invest Radiol 41(12):849859
23. Blake GM, Rea JA, Fogelman I (1997) Vertebral morphometry
studies using dual-energy X-ray absorptiometry. Semin Nucl Med
27(3):276290
24. Ferrar L, Jiang G, Eastell R, Peel NF (2003) Visual identification
of vertebral fractures in osteoporosis: using morphometric x-ray
absorptiometry. J Bone Miner Res 18(5):933938
25. Binkley N, Krueger D, Gangnon R et al (2005) Lateral vertebral
assessment: a valuable technique to detect clinically significant
vertebral fractures. Osteoporos Int 16(12):15131518
26. McCloskey E, Selby P, de Takats D et al (2001) Effects of
clodronate on vertebral fracture risk in osteoporosis: a 1-year
interim analysis. Bone 28(3):310315
27. Smyth PP, Taylor CJ, Adams JE (1999) Vertebral shape: automatic
measurement with active shape models. Radiology 211(2):571578
28. Vokes T, Bachman D, Baim S et al (2006) (2006) Vertebral
fracture assessment: the 2005 ISCD Official Positions. J Clin
Densitom 9(1):3746
29. Ferrar L, Jiang G, Armbrecht G et al (2007) Is short vertebral
height always an osteoporotic fracture? The osteoporosis and
ultrasound study (OPUS). Bone 41:512
30. Ferrar L, Jiang G, Clowes GA et al (2008) Comparison of
densitometric and radiographic vertebral fracture assessment using
the algorithm-based qualitative (ABQ) method in postmenopausal
women at low and high risk of fracture. J Bone Miner Res 23
(1):103111
31. Scott IM, Cootes TF, Taylor CJ (2003) Improving active
appearance model matching using local image structure. In:
Proceedings of 18th Conference on Information Processing in
Medical Imaging 258269
32. Rea J, Steiger P, Blake GM et al (1998) Morphometric X-ray
absorptiometry: reference data for vertebral dimensions. J Bone
Miner Res 13(3):464474
33. Roberts MG (2008). Automatic detection and classification of
vertebral fracture using statistical models of appearance, PhD
Thesis, University of Manchester
34. Fleiss JL (1971) Measuring nominal scale agreement among many
raters. Psychol Bull 76(5):378382
35. de Bruijne M, Lund M, Tanko L et al (2007) Quantitative
vertebral morphometry using neighbour-conditional shape models. Med Image Anal 11(5):503512

You might also like