Shape and Scale in Detecting Disease Clusters

University of Iowa
Iowa Research Online

Theses and Dissertations
2008
Shape and scale in detecting disease clusters

Soumya Mazumdar
University of Iowa
Copyright 2008 Soumya Mazumdar

This dissertation is available at Iowa Research Online: http://ir.uiowa.edu/etd/208
Recommended Citation
Mazumdar, Soumya. "Shape and scale in detecting disease clusters." PhD (Doctor of Philosophy) thesis, University of Iowa, 2008.
http://ir.uiowa.edu/etd/208.
Follow this and additional works at: http://ir.uiowa.edu/etd

Part of the Geography Commons
SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS
by
Soumya Mazumdar
An Abstract
Of a thesis submitted in partial fulfillment
of the requirements for the Doctor of
Philosophy degree in Geography
in the Graduate College of
The University of Iowa
December 2008
Thesis Supervisor: Professor Gerard Rushton
ABSTRACT
This dissertation offers a new cluster detection method. This method looks at the
cluster detection problem from a new perspective. I change the question of What do real
clusters look like? to the question of What do spurious clusters look like? and How
do spurious clusters affect the ability to recover real clusters? Spurious clusters can be
identified from their geographical characteristics. These are related to the spatial
distribution of people at risk, the shape and scale of the geographic units used to
aggregate these people, the shape and scale of the spatial configurations that the disease
mapping or cluster detection method may impose on the data and the shape and scale of
the area of increased risk. The statistical testing process may also create spurious clusters.
I propose that the problem of spurious clusters can be resolved using a computational
geographic approach. I argue that Monte Carlo simulations can be used to estimate the
patterns of spurious clusters in any situation of interest given knowledge of the first three
of these four determinants of spurious clusters. Then, given these determinants, where
real measurements of disease or mortality are known, it is possible to show those areas of
increased risk that are true clusters as opposed to those that are spurious clusters. This
distinction is made in a three dimensional signature space, with shape, size and rate as the
three axes. The extent of similarity (or dissimilarity) of a cluster to the simulated spurious
cluster influences whether it can be recovered. These experiments show that this method
is successful in detecting clusters. This method can also predict with reasonable certainty
which clusters can be recovered, and which cannot. I compare this method with
Rogersons Score statistic method. These comparisons expose the weaknesses of
Rogersons method. Finally these two methods and the Spatial Scan Statistic are applied
to searching for possible clusters of prostate cancer incidence in Iowa. The implications
of the findings are discussed.
Abstract Approved:
___________________________________
Thesis Supervisor
___________________________________
Title and Department
___________________________________
Date
SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS
by
Soumya Mazumdar
A thesis submitted in partial fulfillment

of the requirements for the Doctor of
Philosophy degree in Geography
in the Graduate College of
December 2008
Thesis Supervisor: Professor Gerard Rushton
Graduate College
Iowa City, Iowa
CERTIFICATE OF APPROVAL
_______________________
PH.D. THESIS
_______________
This is to certify that the Ph.D. thesis of
Soumya Mazumdar
has been approved by the Examining Committee
for the thesis requirement for the Doctor of Philosophy
degree in Geography at the December 2008 graduation.
Thesis Committee: ___________________________________
Gerard Rushton, Thesis Supervisor
___________________________________
David Bennett
___________________________________
Naresh Kumar
___________________________________
Marc Linderman
___________________________________
Dale Zimmerman
ACKNOWLEDGMENTS
I would like to acknowledge the help I have received during the course of my stay
in Iowa. I would like to thank Dr Rushton for supervising my research. I would also like
to thank my committee members for their contributions. The last four years of my life
have been emotionally challenging for me. I thank the great masters before us who have
helped me through. I am thankful to the writings of M. Scott Peck, Viktor Frankl, Swami
Vivekananda, and the yogic practices of Sri Sri Ravishankar @ Art of Living Foundation.
I would also like to thank my family members, especially my mom, mishtimashi and late
Dr Mazumdar for their support. Thanks are also due to all my friends and well wishers.
ii
TABLE OF CONTENTS
LIST OF TABLES ......................................................................................................v

LIST OF FIGURES .................................................................................................. vi
CHAPTER
1. DETECTING CLUSTERS OF DISEASE: INVESTIGATING SPURIOUS
CLUSTERS---------------------------------------------------------------------1
1.1 Statement of Purpose------------------------------------------------------1
1.2 Introduction-----------------------------------------------------------------2
1.3 Organization of the dissertation------------------------------------------7
1.4 Review of existing methods of cluster detection----------------------7
1.4.1 Map data without further geographic processing---------------9
1.4.1.1 Methods that do not smooth the data------------------10
1.4.1.2 Methods that smooth the data--------------------------10
1.4.2 Methods that pre-process the data before calculating
and/or testing for significant disease risk----------------12
1.4.2.1 Non combinatiorial approches-------------------------13
1.4.2.2 Combinatorial approaches------------------------------17
1.4.2.3 Hybrid approaches---------------------------------------18
1.4.3 Significance testing and spurious clusters---------------------19
1.4.4 Identifying spurious clusters and distinguishing true
clusters from spurious clusters---------------------------------22
1.4.4.1 The spatial distribution of the locations of people
in the map-----------------------------------------------24
1.4.4.2 The scale and spatial configuration of the
geographic units that are used to aggregate data
into discrete small areas-------------------------------27
1.4.5 Identifying spurious clusters and distinguishing true
clusters from spurious clusters---------------------------------29
1.4.6 Why use size, shape and rate----------------------------------- 30
2. THE SHAPE SIZE SENSITIVE (S.S.S) METHOD FOR DETECTING
DISEASE CLUSTERS-------------------------------------------------------55
2.1 Theoretical foundations of the S.S.S method-------------------------55
2.2 Hypothesis testing ---------------------------------------------------60
2.3 The simulated dataset---------------------------------------------------65
2.3.1 Hypothetical study area and population------------------------65
2.3.2 Hypothetical case population------------------------------------66
2.3.3 Datasets under the null hypothesis of no clustering----------66
2.3.4 Extracting the cluster candidates--------------------------------68
2.3.5 Datasets under the alternative hypothesis of clustering------69
iii
2.3.5.1 Rationale Behind the choice of these

configurations of synthetic clusters------------------------------69
2.4 Rogersons Score Statistic-----------------------------------------------73
2.4.1 Theory--------------------------------------------------------------73
2.5 Diagnostics----------------------------------------------------------------75
2.6 Computational Scheme--------------------------------------------------76
2.7 Results- ------------------------------------------------------------------ 77
2.8 Discussions and future directions--------------------------------------81
3. INVESTIGATING THE SPATIAL PATTERNS OF PROSTATE CANCER
IN IOWA---------------------------------------------------------------------109
3.1 Background-------------------------------------------------------------109
3.2 Methods-----------------------------------------------------------------111
3.3 Results-------------------------------------------------------------------115
3.4 Discussion---------------------------------------------------------------119
3.5 Conclusion--------------------------------------------------------------120
3.6 Contribution that this dissertation makes to the geography
literature-----------------------------------------------------------------120
REFERENCES----------------------------------------------------------------------------135
iv
LIST OF TABLES
Table
2.1
Hold one validation for null hypothesis.-----------------------------------------102
2.2
Hold one validation for alternative hypothesis.---------------------------------102
2.3
Summary statistics of the simulated 3675 spurious clusters.------------------103
2.4
Shape, size, risk (signature) and the ability to recover simulated clusters.--104
2.5
The table illustrates the average sensitivity (ability to detect a cluster

when it exists) and specificity (ability to classify an area that is not
a cluster as such).--------------------------------------------------------------------105
2.6
This table compares sensitivity and specificity with which clusters are
recovered for SSS and Rogersons method and the higher the sensitivity
the better the cluster is recovered.-------------------------------------------------106
2.7
Cluster recovery using only rates and only shapes.-----------------------------107
2.8
How do true clusters differ in shape and size from spurious clusters.-------108
LIST OF FIGURES
Figure
1.1
This figure displays the statistical significance of accidents per

square kilometer (a p- map over densities) , where accidents have been
randomly scattered across the study area . A 30 meter grid was laid
over the entire study area and a 600 meter filter was used to estimate the
accident
densities. The black areas are significant noisy clusters--------35
1.2 This figure displays a spurious cluster detected by Duczmals

Simulated Annealing based SaTScan method. This cluster has a high,
statistically significant likelihood value.-------------------------------------------36
1.3
In the geographic area, 42 people are distributed over a uniform grid.

Each circle represents an individual. They are color coded white to
indicate that they are healthy. ------------------------------------------------------37
1.4
A noise or spurious cluster generating process operates at the scale of the

entire geographical area. No person is at a greater risk of disease than any
other. All people are at a risk of 0.24. Diseased people are randomly
diseased over the map. These disease people are color coded black to
indicate a diseased state.-------------------------------------------------------------38
1.5
A boundary is drawn around those people who are diseased. This

represents our gerrymandered cluster. Note the highly irregular
and large shape of the cluster.-------------------------------------------------------39
1.6
In contrast to 1.4, a cluster generating process operates on this geographic

area. The cluster generating process predisposes the people living in
the area bound by the dotted lines to a greater risk than other areas of the
map. These people are at a risk of 0.56. In one realization of the process
cluster of 10 people therefore are diseased in this area.----------------------40
1.7
The cluster is then enclosed within a boundary. Note the relatively

regular shape of the cluster (compared to a random distribution of
diseased people). ------------------------------------------------------------------41
1.8
People are distributed non uniformly over space.--------------------------------42
1.9
The entire geographic space is subject to the same risk (0.24) noise
generating process. The resulting 10 diseased people and the
gerrymandered cluster are shown.--------------------------------------------------43
1.10 The cluster generating process in figure 6 operates on the inhomogenously

distributed population. The risk elevation is the same as in Figure 1.6
0.56. This causes 8 people to fall ill from an at-risk population of 14.--------44
vi
1.11 The estimated cluster shape and size is very different from what the
shape and size of the cluster is in reality (The dotted line in Figure 10).
It is also very different from what was obtained for a homogenous
distribution of people in Figure 1.6.------------------------------------------------45
1.12 Now a cluster generating process operates on this space. The white
river within the dotted lines is the area of excess risk. People
living within this area are at an excess risk of disease.--------------------------46
1.13 Assuming an inhomogeneous distribution of people as in figure
1.8 and a risk elevation of 0.71, we see that a certain number
of people (10) within the area of excess risk are diseased.----------------------47
1.14 The gerrymandered cluster now encloses the diseased people. Note
the highly irregular and large shape of this cluster.------------------------------48
1.15 Two cluster generating processes of circular shape and risk elevation
of 0.75 operate on a homogenous distribution of people.-----------------------49
1.16 The clusters that are estimated from this have the same triangular shape.
This is highly unlikely in reality.---------------------------------------------------50
1.17 In this example a slightly larger area of increased risk is considered
than in the earlier example. 6 people in each of the two clusters
are subject to a risk of 0.5, which results in 3 of them becoming
cases/ falling ill.-----------------------------------------------------------------------51
1.18 The clusters that are generated have very different shapes. In fact the
larger the area of increased risk, the greater the number of possible
shapes and sizes of the estimated cluster.----------------------------------------52
1.19 In this example people are inhomogenously distributed. The same cluster
generating process in Figure 1.15 gives rise to two circular areas of
increased risk where the risk elevation is 0.5.-----------------------------------53
1.20 The two clusters generated have very different shapes. There is no
configuration of cases within the clusters for which two estimated
clusters could have the same shape.------------------------------------------------54
2.1 Using echelons to extract cluster candidates.----------------------------------------87
2.2
A set of 50,000 cardiovascular disease mortality cases are randomly

distributed by population weights to each of 942 ZCTAs in the state
of Iowa. A pattern is then extracted using Spatial Filtering.
The pattern is binarized, and the resulting polygon cluster
candidates are extracted using a GIS.----------------------------------------------88
2.3
An example set of spurious cluster signatures S(ZN ) in signature space.---89
2.4
An example set of spurious cluster signatures S(ZN ) in signature space

with a few candidate clusters (grey squares).-------------------------------------90
2.5
Bounding rectangle for elliptical footprint.---------------------------------------91

vii
2.6
Flowchart of the S.S.S method.-----------------------------------------------------92
2.7
Population distribution of ZCTAs in Iowa, 2000.--------------------------------93
2.8
This figure displays the computational process used to create the

simulated dataset. Each bin is labeled as k and has a specific size. For
the simulations in this research n=942.-------------------------------------------93
2.9
The simulated datasets follow a multinomial distribution.----------------------94
2.10 Summary of shapes of simulated spurious clusters, frequency and

cumulative frequency.----------------------------------------------------------------95
2.11 Summary of sizes of simulated spurious clusters, frequency and
2.12 Summary of rates of simulated spurious clusters, frequency and
2.13 Characteristics of the four clusters simulated under the alternative
hypothesis.-----------------------------------------------------------------------------98
2.14 Cluster detection diagnostics (The key to the numbers is in the text).--------99
2.15 Patterns detected by the Score statistic and the S.S.S method for one
dataset among 20 datasets simulated for cluster-4. The true cluster
pattern can be seen inset. In this particular dataset S.S.S is able to
identify 62% of the true cluster pattern, while the Score statistic is
able to identify 20%.----------------------------------------------------------------100
2.16 Patterns detected by the Score statistic and the S.S.S method for one
dataset among 20 datasets simulated for cluster-3. The true cluster
pattern can be seen in the inset. In this particular dataset S.S.S is
able to identify 98% of the true cluster pattern, while the Score statistic
is able to identify 91%.-------------------------------------------------------------101
3.1
Spatial patterns of prostate cancer incidence (1999-2004) in Iowa.----------123
3.2
Cluster of prostate cancer incidence in Iowa, detected by the S.S.S

method. ----------------------------------------------------------------------------124
3.3
Cluster detected by SaTScan when the geometry of the cluster is

assumed to be ellipsoidal.----------------------------------------------------------125
3.4
Cluster detected by SaTScan when the geometry of the cluster is

assumed to be circular.-------------------------------------------------------------126
3.5
Large secondary cluster with low elevation in risk detected by

Kulldorffs SaTScan when the geometry of the cluster is assumed
to be elliptical.-----------------------------------------------------------------------127
3.6
ZCTAs in Iowa with a significant value of Rogersons Score statistic.-----128
viii
3.7
Expected number of cases in ZCTAs: Entire Iowa versus areas with a

significant value of Rogersons Score statistic.---------------------------------129
3.8
ZCTAs in the North West Iowa cluster of high prostate cancer

incidence.-----------------------------------------------------------------------------130
3.9
Counties boundaries with ZCTAs in the North West Iowa cluster of high
prostate cancer incidence.----------------------------------------------------------131
3.10 Change in mortality and incidence rates from 1990-2004 in five counties
Dickinson, Clay, Buena-Vista, Emmet and Clay Counties in the cluster.
The expected counts for the particular year (1990, 1991.2000) are
calculated using 2000 census population for the local area, and
incidence/mortality information for the state of Iowa
(Same procedure as indirect standardization).-----------------------------------132
3.11 Variations in the directly standardized incidence and mortality rate in
Iowa, and incidence of Prostate cancer in Dickinson County for
the years 1990-2004.----------------------------------------------------------------133
3.12 Variations in the directly standardized incidence and mortality rate
in Iowa, and incidence of Prostate cancer in Clay County for the years
1990-2004.---------------------------------------------------------------------------134
ix
CHAPTER 1: DETECTING CLUSTERS OF DISEASE: INVESTIGATING

SPURIOUS CLUSTERS
1.1 Statement of Purpose

This dissertation offers a new cluster detection method. This method looks at the
cluster detection problem from a new perspective. I change the question of What do real
clusters look like? to the question of What do spurious clusters look like? and How
do spurious clusters affect the ability to recover real clusters? Spurious clusters can be
identified from their geographical characteristics. These are related to the spatial
distribution of people at risk, the shape and scale of the geographic units used to
aggregate these people, the shape and scale of the spatial configurations that the disease
mapping or cluster detection method may impose on the data and the shape and scale of
the area of increased risk. The statistical testing process may also create spurious clusters.
I propose that the problem of spurious clusters can be resolved using a computational
geographic [1] approach. I argue that Monte Carlo simulations can be used to estimate
the patterns of spurious clusters in any situation of interest given knowledge of the first
three of these four determinants of spurious clusters. Then, given these determinants,
where real measurements of disease or mortality are known, it is possible to show those
areas of increased risk that are true clusters as opposed to those that are spurious clusters.
The extent of similarity (or dissimilarity) of a cluster to the simulated spurious cluster
influences whether it can be recovered. These experiments show that this method is
successful in detecting clusters. This method can also predict with reasonable certainty
which clusters can be recovered, and which cannot. I compare this method with
Rogersons Score statistic method [2]. These comparisons expose the weaknesses of
Rogersons method. Finally these two methods and the Spatial Scan Statistic [3] are
applied to searching for possible clusters of prostate cancer incidence in Iowa. The
implications of the findings are discussed.
1.2 Introduction
Disease mapping has a long history. Starting with the example of John Snows
cholera map to the intelligent agents [4] of the present century, disease mapping has
progressed with developments in science, especially Geographical Information Systems
(G.I.S) and epidemiology. Some of the first disease maps were simple dot maps
indicating the location of disease cases. These gave way to maps of statistical summaries
known as thematic maps". These maps convey more information than simple dot maps
and are therefore, powerful exploratory and decision making tools. For example, when
mortality maps of lung cancer for the United States were made in the 1960s, high rates
were found in areas of the Eastern Seaboard [5, 6]. Later, these high rates were attributed
to exposure to asbestos among shipyard workers in these areas. A disease map can thus
be used to map spatial variations in disease risk. A decision maker can ask Is a person
living in a given area at a greater risk of disease than a person living in another area? or
In which areas of the map do people have the greatest risk of disease? In the disease
mapping literature the problem of finding areas of excess risk is often called cluster
detection", a cluster being defined as A geographically bounded group of occurrences of
sufficient size and concentration to be unlikely to have occurred by chance" [7] or in
plain English, a geographic area of high disease risk. A geographical cluster is therefore
spatially analogous to statistical clustering [8], where the question of interest is finding
things near in statistical space instead of geographical space.
While investigating the causal factors (or etiology) of areas of increased risk are
important, there are other important applications of these methods. Public health agencies
are often interested in allocating resources to areas with an increased burden of disease
[9, 10]. Cluster detection methods are used to identify areas with increased burden of
disease. Sometimes, environmental policy is formulated on the basis of such studies. In

one instance, the Vatican was taken to task for operating radio transmitters at illegal
frequencies after studies showed an increased risk of cancer among people living close to
these transmitters [11, 12]. Note that policies are often formulated on the basis of
evidence that an increased risk exists even though the etiological basis for the increased
risk may not have been established. An interesting extension to etiological research is that
the presence of spatial clusters of increased risk could also be used to prove the existence
of disease risk factors that are spatially non random. For example, it has been claimed
that clusters of autism in California prove the existence of risk factors that are not related
to genetics or the vaccine hypothesis1 (barring selective migration) [13]. Many public
health agencies maintain on the fly cluster investigation infrastructure to address
cluster related enquiries [14].
A number of methods exist that can be used to delineate clusters. A persistent
problem with many of these methods is the detection of areas not at high risk being
identified as such. Some convenient terms for such false positives are noise" [15], noisy
clusters or spurious clusters [16-19] . In this research I develop a method to detect and
adjust for the occurrence of spurious clusters in cluster detection studies.
The cluster detection literature identifies at least three types of spurious clusters.
The first is when the estimate of risk in an area is based on a small number of people
[15]. These estimates of risk are unreliable and therefore the area may not have a
significant excess risk. A number of solutions exist to solve this problem [20-26]. The
second type of spurious clusters stem from statistical issues in the cluster detection
method. For example, failing to adjust for multiple hypothesis testing problems may give
rise to spurious clusters [18, 27]. This problem is an area of active research [28].
1 The vaccine hypothesis is that exposure to Thimerosol a mercury based additive in

vaccines is a risk factor for autism.
Kulldorffs SaTScan method resolves this problem by adopting a likelihood based

hypothesis testing framework [3].
The third type of spurious cluster is created by a mismatch in the scale and spatial
structure of the process that generates the cluster, with the scale and spatial structure used
to measure the process. The scale and spatial structure or spatial form of the cluster
search process (which measures or samples the underlying data) can generate spurious
clusters. Unlike the other sources of spurious clusters very little research exists on this
form of noise. There are a number of reasons for this. Until recently, the computational
power available to researchers, for cluster detection problems was limited. A cluster can
have any geometry or spatial form in reality. However a limited amount of computational
power confined researchers to searching for clusters within a small range of spatial forms.
For instance, it is a common strategy to search for circular clusters. This strategy was
adopted by some of the first cluster search methods [27], and remains common today
[29]. If the real cluster is not circular in shape, then the power to detect non circular
clusters is greatly reduced. But, a limited search also implies that the likelihood of
mismatch between the circles and the underlying true cluster is also limited (given that
the spatial form of this true cluster is unknown). In contrast, if the cluster search
incorporates a number of different spatial forms, then the likelihood of mismatch
increases. Since computational power is not a limiting factor anymore, some researchers
have developed shape free" disease cluster detection methods. These methods, that draw
from the work of geographers in the 1960s and 70s [30] measure spatial attributes (like
disease counts or rates) at a large number of possible shapes , sizes and scales. The
measured spatial attributes or some functions of the attributes are used to decide if an
area of a given shape and size at a given scale is a cluster or not. For example, Duczmals
[31] scan assigns a likelihood value to each cluster it finds, where the likelihood is a
function of attributes such as an observed number of cases in the cluster. The clusters
with the highest likelihood are most likely to be clusters. These methods thus, promise to
seek out the true clusters, no matter what their spatial form. However, this also means,
that at some shape and scale, noise or spurious clusters will be detected. These spatial
forms will represent a mismatch between the shape and scale of the process that
generated the process and the shape and scale of the process being used to detect it. The
closest analogy that can be drawn to this is similar to what is known in the disease
mapping literature as the Texas Sharpshooter Effect. If a shotgun is used on a wall,
then the wall is splattered with seemingly random bullet holes. At the scale of the wall,
the process is random. However, it is always possible to draw targets a posteriori around
the bullet holes. The act of drawing a target is similar to searching for a cluster at a scale
different from the scale at which the original process occurred (the entire wall).
Duczmals search procedure, thus often finds clusters that are spurious. Such spurious
clusters will be found by any method that offers the least amount of geometric freedom to
the clusters search. In fact, these spurious clusters have even been found when the search
is limited to circular geometries (for example, see Kulldorff [32]). Tackling this problem
therefore requires a) A thorough understanding of the problem of what gives rise to these
spurious clusters. b) Suggesting a method to solve or in the very least, manage this
problem. This dissertation is an attempt at this.
It is clear that an understanding of this problem requires an understanding of scale
and shape of the spurious cluster or noise generating process. The shape, size and risk
elevation of a cluster, whether spurious or real, is unique to each and every disease
mapping/cluster detection situation. The characteristics (shape, size and risk elevation) of
a cluster depend on : a) The cluster generating process, especially the shape and size of
the area of excess risk, b) The spatial distribution of people over space and c) The scale at
which the spatial data are aggregated [19]. These factors are unique to each disease
mapping situation/example, and these factors are responsible for creating spurious
clusters. Once we have established these facts, two take home facts are: 1) Every disease
mapping situation has a unique noise or spurious cluster signature b) It is not possible to
guess this signature a-priori. However this signature may be computed as explained
below.
Since, each disease mapping situation has a unique noise or spurious cluster
signature, it follows that in every disease mapping situation there will be some clusters
which will be hard to detect. These clusters will be in ways similar to the spurious or
noisy clusters. This issue or the issue of recoverability has just started being discussed
in the disease mapping literature [33, 34]. The methods I describe incorporate the
following features. First, it extracts cluster candidates using an exploratory approach.
Second, shape, size and rate are used to distinguish true clusters from spurious clusters.
Third, the method incorporates recoverability of clusters into the analyses. The researcher
is able to know (computationally) a-priori what spatial form of clusters are recoverable.
The method utilizes computational geography and two fundamental geographic aspects of
clusters- shape, and size to analyze the recoverability of clusters and to separate cluster
from non cluster or spurious clusters. This dissertation diverges from the traditional
disease clustering literature in taking shape and size into consideration. Traditionally only
the rate at a given location or some function of the rate is used to separate a true cluster
from a spurious one. Since the method incorporates the shape and size of the cluster in its
analysis, I call it the Shape, Size Sensitive disease cluster detection method or the S.S.S
method. The S.S.S method is tested and validated on simulated data. This method
demonstrates the power of computational geography over traditional methods [35]. The
ideas and methods developed and tested in this dissertation are either new, or have been
discussed only in scant detail in the literature. Yet, they are fundamental to geography
and disease mapping. This research thus makes an important contribution to the disease
mapping literature.
1.3 Organization of the dissertation

In this chapter (Chapter 1) I discuss how various disease mapping and cluster
detection techniques approach the problem of spurious clusters. I then argue that these
methods do not address the issue of spurious clusters adequately. I suggest that a
geographical approach can help us better understand the problem and explain how
geography gives rise to spurious clusters. Then, having understood the geographical
bases for spurious clusters I propose a geographically sensitive disease cluster detection
method. I explain this method the Shape Size Sensitive (S.S.S) method in Chapter 2.
Then, using simulated data, I test the sensitivity of this method. I also compare the
performance of the S.S.S method with Rogersons Score statistic method for detecting
disease clusters. The final, short chapter is Chapter 3. Here I use the S.S.S method and
Rogersons Score Statistic and Kulldorffs Spatial Scan Statistic to investigate the spatial
patterns of prostate cancer risk in Iowa. The implications of the findings are discussed.
1.4 Review of existing methods of cluster
detection
All disease mapping and cluster detection approaches share a common goal. This
is to uncover the underlying pattern of risk. These methods calculate statistics as rates or
likelihoods which serve as measures of risk. The patterns" on a map are obtained by
mapping either these statistics, or those areas that cross some threshold of the calculated
statistic. When the second procedure is followed, that is, the rate, or, the likelihood of an
area having an excess risk is statistically tested; the method is often called a cluster
detection method. Most cluster detection methods test a large number of areas which
could possibly be clusters. These are called candidate clusters [31, 36] or cluster
candidates. If a cluster passes the statistical test, but demarcates an area where no
cluster exists in reality, then, it is a noisy cluster [31] or spurious cluster [16-19]. The
term true cluster may be used to indicate geographic areas of excess risk. It is also
possible that a true cluster is suppressed by the cluster detection process. In the disease
cluster detection literature this problem is usually not discussed separately, but forms an
integral part of the spurious cluster detection problem. Spurious clusters may be created
at various stages in the disease mapping/cluster detection process. The first step for
applying a cluster detection method is to collect spatial data. This data may come preaggregated into administrative regions, or it may come in the individual form [37, 38].
If the data are in the individual form, they need to be processed and aggregated
such that summary statistics may be gleaned from them and the summary statistics
mapped. The process of aggregation may create spurious clusters. One solution is to use
the individual level data to search for clusters [39]. While a number of methods will work
with both aggregated and individual level data, there are a very few methods, that have
been developed exclusively for individual level data [40, 41] . With better quality data
being increasingly available, such analyses will become more common [37, 42]. The
majority of disease mapping situations start with aggregated data and summary statistics
are calculated from these datasets. When the summary statistics are calculated based on a
small base population (also called a small support size), then these statistical estimates
are likely to be unreliable. This is the small number problem. Some methods carry out
a process called smoothing", where information from neighboring regions is used to
obtain a better estimate of the mapped statistic for a given region. This, to some extent
alleviates the problem of spurious clusters created from small numbers. The statistical
testing procedure could also create spurious clusters. If multiple hypothesis tests without
adjustment are carried out then, this process may also give rise to spurious clusters. In a
famous example, Openshaw [27] carried out multiple hypothesis tests when searching for
leukemia clusters in Northern England. Whenever a test was significant, a circle was
drawn. Some of these circles were spurious clusters, and would not have existed if
adjustments for multiple testing were carried out. Sometimes, using the wrong reference
distribution may also create spurious clusters. Conversely, using overly conservative
multiple testing correction techniques may suppress true clusters [28]. Waller and
Gotway [4] write of situations where for a Poisson reference distribution, it is not
possible to distinguish a lack of fit to the Poisson distribution (spurious cluster) from a
rejection of the null hypothesis (true cluster). This is an area of active statistical research,
and some new and innovative solutions have been proposed to these problems [43, 44].
Kulldorffs SatScan method uses a likelihood based hypothesis testing framework to
solve the problem of multiple testing [3]. Instead of testing multiple hypotheses, this
method tests only one hypothesis. This hypothesis test is carried out on the cluster
candidate that is most likely to be a cluster. The likelihood is a statistical function,
that is calculated under the assumption that the observed data conform to certain known
distributions (ex: Poisson or binomial).
There still remains the third source of spurious clusters. Unlike the first two, there
is little research on this source of spurious clusters. This is when spurious clusters are
created from mismatch between the process that generates the disease map patterns, and
the processes used to recover the patterns. This mismatch could arise when the data are
aggregated to administrative regions, or to other shapes and scales by the method of
analysis. In this section I discuss the various methods for the detection of cluster
detection in context of their ability to handle this problem. Among the various methods
available, some methods offer the opportunity of multiscalar analysis. In these methods,
the data may be geographically rescaled. While these methods geographically process the
data before mapping patterns other methods consider the sanctity of geographic
boundaries unbreachable. The latter attempts to expose the underlying risk pattern by
mapping summary statistics within existing geographic boundaries without any further
geographic processing of the data.
10
1.4.1 Map data without further geographic

processing
In these methods the geographic boundaries of regions are left as they are,
however various statistical manipulations are carried out on the data. Some researchers
prefer to call this group of methods as disease mapping methods [45]. As I discussed
earlier, these methods can again be subdivided into two groups, methods that smooth the
data and methods that do not smooth the data.
1.4.1.1 Methods that do not smooth the data
The vast majority of diseases maps are maps of raw rates, where the number of
cases per unit population within existing geographic regions such as counties or states are
mapped [46]. Another approach is a map of probabilities" [47, 48], where instead of
mapping a rate, the probability of observing the rate within a geographic region is
mapped. Mapping raw rates are often problematic when the rates are based on small base
populations [15]. The maps thus produced are likely to display noisy (small number
problem) patterns.
1.4.1.2 Methods that smooth the data
In these methods various statistical manipulations are used to smooth the rates
in each region while at the same time keeping the geographic boundaries intact.
Information from neighboring regions are used to stabilize the rates in a given region.
Some examples of this approach can be found in the Bayesian disease mapping literature
[23, 24]. Other examples are method of moving averages and headbanging [20,
22].These methods are not very successful in dealing with the problem of spurious
clusters. A study by Kafadar [22] has shown that many of the popular smoothers such as
headbanging and empirical Bayes are unable to detect true patterns in the data or have
issues with detecting spurious patterns or clusters. Some of the methods smooth the data
11
by averaging rates over kernels or filters. For example Sabel et al. [49] investigate rates
of Amylotropic Lateral Sclerosis (Lou Gehrings disease) incidence in Finland by
smoothing rates using Gaussian Kernels. Another method is Rogersons Local Score
statistic [2, 4, 50]. In this method the deviations from the expected rate are smoothed
using Gaussian Kernels. Like other methods, if the rates are based on small numbers,
then smoothing these unreliable rates may create spurious clusters. I use Rogersons
Score statistic in my research and therefore, this method is discussed in detail in later
sections. Spurious clusters are often created by these methods. First, because these
methods map the rates based on small areas before smoothing them, they are prone to the
small number problem. Second, these methods do not in any way attempt to deal with the
problem of spurious clusters from spatial mismatch discussed earlier. Third, the statistical
tests that these methods carry out may not be able to distinguish spurious clusters from
true clusters. For example, there is no consensus on what the correct reference
distribution is for Rogersons Score statistic [2, 4, 50].
A separate group of methods that often smooth the data, are local measures of
spatial similarity. These methods , which are also known as LISA (Local Indicators of
Spatial Autocorrelation) [51] address the question, - How similar is the risk at a given
small area to that of its neighbors? The greater the similarity, the higher the likelihood
that the small area belongs to (or is) a cluster. Some of the LISA statistics are local
Morans-I and local Gearys C [50-54]. Since, the underlying philosophy of this approach
is that things nearer are more similar than things farther away [55], the implicit definition
of scale here is the distance at which this similarity is manifested. Thus a process that acts
at a large scale may cause similarity among immediately neighboring local areas, than
processes that work at a smaller scale. Like other methods, if the statistics are calculated
on small areas, they could be unreliable. The reference distribution of LISA statistics are
often not known [4] and the scale at which a process operates is not investigated before
12
LISA statistics are calculated. Any of these factors could lead to the creation of spurious
clusters.
1.4.2 Methods that pre-process the data
before calculating and/or testing for significant
disease risk
These methods allow the modification of geographic boundaries to extract the
underlying risk surface and/or to find which area has the greatest excess in disease risk.
One group of methods, often called density estimation methods, [56] simply ignore
existing geographic boundaries. Drawing from the field" theory of geographic
phenomena [20]; they consider that disease risk patterns are continuous in nature and that
they do not change or stop abruptly at geographic boundaries. When appropriately used,
these methods provide the opportunity to control the spatial basis of support, and thus, the
scale of the analysis [57, 58].The other group of methods draw from concepts of region
building which were developed by geographers [30]. One approach to building regions
is to coalesce groups of areas to build aggregate regions. These methods attempt to find
that combination of areas which has the greatest likelihood of being a zone of high
disease risk. A third group of methods combine concepts of region building methods with
the first group of methods or with methods discussed in the last section. The ability of all
these methods is limited by the scale of the data. Often the data come aggregated into
small areas and the analysis must be carried out at scales equal or greater than the scale of
aggregation. Nevertheless, these methods are better equipped than other methods to
control the shape and the scale of the data, and this gives them an edge over other
methods when dealing with the problem of spurious clusters.
13
1.4.2.1 Non combinatorial approaches

These methods ignore geographic boundaries and attempt to extract the
underlying patterns of risk. They often lay a uniform grid over the map area and measure
the statistic of interest at each grid point. Irrespective of whether the data are aggregated
or not, a value can be obtained at each grid point. While there are a number of approaches
to calculating the statistic at each grid point [21]; a simple and common approach is to
filter" the data using circular spatial filters [3, 9, 21, 27]. Some methods map the statistic
calculated at each grid point [9] while others do not [3]. These circles can be of fixed or
varying sizes. However, since these filters are of a certain shape, they bias the cluster
search. The bias is in favor of detecting clusters of or similar to, the shape of the filter
(circles in this case). Statistically, the clusters that are of the shape of the filter have a
higher power of detection than clusters of other shapes. This approach therefore,
overcomes the limitation outlined in the methods discussed earlier, but is limited in its
treatment of geographic shape. Ellipses and other geometric shapes have also been
studied [29, 59]. One of the methods, based on Rushtons Adaptive DMap [9] maps
rates at grid points using adaptive filters and interpolates these with an IDW (Inverse
Distance Weighting) interpolation algorithm. The adaptive filter [58, 60] ensures that the
rates are based on the same number of people or the same support size. Thus, unlike the
LISA methods, all statistics are equally reliable. Also, the use of an adaptive filter
ensures that the scale of the analysis can be precisely controlled. The Inverse Distance
Weighting Algorithm used for creating the final pattern was also found by Kafadar [22]
to be the least noisy of all smoothing/interpolation methods. Thus, by allowing
multiscalar analysis, relative freedom of cluster shape (clusters dont have to conform to
geographic boundaries) and using a robust interpolation technique, Rushtons Adaptive
Filtering method is best suited for dealing with the problem of spurious clusters from
mismatch between the process and analysis scales. I use this method in my analyses.
Another important density estimation method is Kulldorff's SaTScan [3]. While the
14
DMap method maps the extracted pattern, and is therefore good for visualizing and
exploring the underlying pattern, SaTScan can be used to map only those areas that are
significant clusters. SaTScan has found wide acceptance in the public health community
because of its ability to account for the multiple hypotheses testing problem and a robust,
freely available software. Some of the recent developments in the disease clustering
literature have followed the combinatorial approaches that I discuss next, and their
method of choice has been based on the Spatial Scan Statistic method of cluster
detection. Since multiple testing is an issue with these combinatorial approaches, the
Spatial Scan Statistic is a reasonable choice. Since I use the Spatial Scan Statistic in
Chapter-3 to investigate clusters of prostate cancer in North West Iowa, some of the
details of the Spatial Scan Statistic are provided next:
The scan statistic originated as a one dimensional test. Its objective was to test if a
one dimensional point process is purely random. The one dimensional spatial scan
statistic was extended by Kulldorff into the spatial domain [3] .The spatial scan statistic
moves a circle across the study area. The circle centers on to a centroid. The centroid
could be the location of a single individual for unaggregated data, the centroid of a census
tract (for example) for aggregated data or for a set of grid points. Kulldorff (1997) [3]
states The zone defined by a circle consists of all individuals in those cells whose
centroids lie inside the circle and each zone is uniquely identified by these individuals.
Thus, although the number of circles is infinite the number of zones will be finite. For
unaggregated data the zones are perfectly circular, that is, the individuals in the zone are
exactly those located within a defining circle. With data aggregated into census districts,
a zone may have irregular boundaries that depend on the size and the shape of the several
contiguous census districts it includes. The Spatial Scan Statistic is implemented
through the freely available software SaTScan [32]. The methodology of the Spatial Scan
Statistic is explained as follows. The method involves two steps, - 1. Confounder
adjustment and 2. Hypothesis testing
15
In disease cluster detection studies known risk factors or confounders are

adjusted for, before the cluster detection algorithm is implemented. Thus, for example, it
is known that age is associated with prostate cancer. It may be desirable to remove the
effect of age from the analyses, such that the clusters that are detected reflect the presence
of other, yet unknown, risk factors. The confounder adjustment procedure that SaTScan
utilizes is known as the indirect standardization method. It is as follows:
If ,
ei= Expected number of cases in local area/ZCTA i after confounder adjustment.
ni = Observed number of cases in local area/ZCTA i after confounder adjustment.
r = specific cofounder group, for example age group from 45-65 yrs.
= Total number of confounder groups.
nr = Total number of cases in G in age group r
Nir= Total number of people in G in local area i, in age group r.
The confounder adjustment procedure is:

ei =

[ (nr / ri
1 N )* N ]
The adjusted numbers of cases are then used to test the hypothesis if a given local
area/ZCTA i has an excess risk/belongs to a cluster. The hypothesis testing procedure is
explained next. The Spatial Scan Statistic tests the hypothesis if a given area of the map
(for example a collection of ZCTAs) has a greater (or lesser) risk, than the rest of the
ZCTAs in the entire geographic region G.
If Zj is the jth cluster :
16
For all possible Zjs in Z (The collection of k possible clusters in G), if the risk inside Zj
is
R(inside, j) is the risk inside Zj while R(outside, j) is the risk outside Zj ,then under the null
hypothesis and alternative hypothesis:
H0: R(inside, j) = R(outside, j)
H1: R(inside, j) > R(outside, j)
The observed number of cases nj inside (or outside) a cluster candidate is assumed to be
Poisson Distributed, and a function of the expected number of cases in the cluster ej and
the risk R(inside, j) .
Let n= k i
1 Nir
r
nj Poisson [ ej *R(inside, j) ]
The likelihood function that is used, from these null and alternative hypotheses are as
follows:
= Likelihood (R(inside, j) > R(outside, j) ) / Likelihood(R(inside, j) = R(outside, j) )

This likelihood ratio can be solved and written in the logarithmic form as follows:
Log Likelihood Ratio or LLRj = (nj ln (nj/ ej)) + ((n- nj) ln [(n- nj)/(n- ej)])
The significance of the log likelihood ratio is tested using a Monte Carlo
hypothesis test. The SaTScan program carries out a user-specified number of Monte
Carlo randomizations of the data and tests to 0.001 % (The percentage can be user
specified too) significance of the presence of a cluster. A p value is reported. This is
calculated as p value = Rank of LLR / (1 + #simulation)." Note that the spatial scan
statistic procedure does not adjust for multiple testing in the traditional sense for example
by carrying out a Bonferroni or other multiple testing adjustment procedure. Instead, it
avoids the problem of testing multiple hypotheses, by concentrating on those clusters
candidates that are most likely to be true clusters (and thus have the highest log likelihood
17
value). Also note that the Spatial Scan Statistic procedure explained above is the spatial
Poisson model, which is the model used in disease mapping. There are numerous other
modifications to the Spatial Scan Statistic procedure [29].
1.4.2.2 Combinatorial Approaches

Some geographers are interested in creating or building regions [30, 61-64].
Regions are built up by assigning small areas to groups such that they fulfill certain
criteria. Regional geographers have called this the assignment problem. Small areas
are so assigned to regions, that a certain attribute of the region is optimized [30, 62].
Sometimes, the problem could involve maximizing the variation in an attribute of the
newly built region as a proportion of the variation within the entire map [30, 65]. The
general question in this approach is What combination of areas will optimize a given
objective? ". In the disease mapping context disease risk or the likelihood of risk can be
maximized. An example in the disease mapping context was investigated by Alvanides
[61]. A similar strategy was also suggested (but not implemented) by Rushton [66].
These ideas were implemented in computer programs first by Openshaw [64] and later by
other researchers [63, 67, 68]. Independently Duczmal suggested a similar solution to
finding disease clusters of any shape. He operationally achieved this by maximizing the
Spatial Scan Statistic likelihood function over possible combinations of areas. While it is
sometimes possible to look at all possible combinations/ collections of areas, for most
realistic geographical areas this is not possible (For example, see Cliff and Haggett [62]).
Neither are there theoretical solutions to the problem. In operations research, such
problems are called np-complete. This means that for a collection of n areas, the problem
cannot be solved in polynomial computer time. Heuristics are used to solve such
problems. Duczmal uses the Simulated Annealing (SA) and Genetic Algorithm (GA)
heuristics in his research [31, 69]. An important aspect of these methods is that they
provide enormous freedom of analysis of shape and scale. The analysis scale and shape
18
vary across a multitude of combinations. Thus instead of asking the question Is there a
cluster at a given scale of the following shape? these methods demand - Find clusters
of any shape at any scale. This makes these methods immensely powerful. But this
strength also brings about a weakness. If spurious clusters are created from a mismatch
between the process and analysis scale and shapes, and if a large number of scales and
shapes are evaluated by this analysis method, then it follows that noisy clusters will
almost always be detected by these methods alongside genuine or true clusters. At the
end of this section will shall see an example of this. The next section discusses some of
the modifications that researchers have proposed to these methods. These modifications
offer better power of detecting clusters.
1.4.2.3 Hybrid Approaches
These approaches combine some of the strategies of the non-combinatorial
approaches with a combinatorial search. Some examples are the approaches proposed by
Patil and Tallie [70], Tango [71] and Yinnakoulias [36]. Tango proposed that the search
begin with a circular cluster as a seed", but then regions adjacent to the circular cluster
be coalesced with it and the resulting hybrid be tested as a possible cluster. With every
level of adjacency enumerated the problem becomes computationally complex, and
therefore in their example Tango suggested that three levels of adjacency be tested. Patil
and Tallie`s [70] approach is limited to restricting the search space to areas with the
highest rates, which Patil and Tallie call the Upper level sets". These methods provide
interesting extensions to the combinatorial shape-free methods of cluster search.
We are now in a position to summarize the various methods discussed. All the
methods outlined above have one singular goal: To extract the underlying pattern of
significant excess risk. Some methods are good at mapping the entire pattern [9], while
others are good at testing for significant excess risk [3]. In the next section, I discuss how
problems with significance testing can introduce spurious clusters.
19
1.4.3 Significance Testing and Spurious

Clusters
In general all methods at some point, address the following question: Of all the
candidate clusters in the pattern of risk (whether mapped or not), what clusters are true
clusters? Each candidate cluster has a specific risk elevation, a size, and a shape.
Traditionally most cluster detection" techniques have used some function of the risk
elevation or rate of a given area to decide if the area is a true cluster. The question that is
asked is How likely are we to observe this risk elevation or rate in this area if the
underlying process is noise? " If the probability is small then the area is not a cluster.
The distribution of risks/rates under the process of noise is also known as the reference
distribution. Traditionally, the reference distribution is normatively chosen. Some
choices are the normal distribution [2, 50], the chi-squared distribution [2, 50], the
Poisson [3] distribution and the Gumbel distribution [43]. However, using such
distributions is problematic. If the populations are small, the normal distribution cannot
be used. It is often hard to distinguish a lack of fit to the Chi-Squared distribution from
a genuine deviation from the Chi-Squared distribution (indicating clustering) [4] . A more
robust method of achieving this is to use a Monte Carlo simulation approach to
empirically determine the reference distribution. Methodologically this may be achieved
by simulating a series of maps, in each of which noise is the underlying process. Multiple
Monte-Carlo simulations of the data are used to mimic the noise process. If the observed
risk elevation (or some function of the risk value such as the rate) for the area is
significantly different from the ones in the simulated maps, then the area is considered to
be a cluster. However Monte Carlo simulations do not guarantee that spurious clusters
will not be detected. Steenberghen et al.,[72] carried out an experiment that illustrates
this problem. This is displayed in Fig 1.1. Fig 1.1 is a map in which simulated locations
of traffic accidents (points) were randomly scattered [72], filtered using 600 meter filters,
20
the density of points estimated, the resulting clusters tested for significance and the level
of significance was displayed (also known as a p-map). If areas which show 0.025 %
significance are called clusters, the black shapes in Figure 1.1 are spurious clusters.
Some methods attempt to tackle this problem with a combination of both Monte
Carlo and normative statistical techniques. Examples are Duczmals and Kulldorffs
methods. Duczmals method [3, 31, 43, 69, 73] (which derives from Kuldorffs method)
generates a large number of irregular cluster candidates. For each candidate the rate is
calculated. The rate is then fed into a function known as a likelihood function to yield a
likelihood value of the cluster candidate being a true cluster. This value is divided by
the likelihood of the cluster candidate not being a true cluster. This ratio is known as the
likelihood ratio. The likelihood ratios for all cluster candidates are calculated. The
cluster candidates with the highest ratios are the most likely clusters. Multiple Monte
Carlo simulations are carried out, and the rates at all the candidate clusters calculated.
Again, the rates are fed into the likelihood function, thus generating a reference
distribution of likelihood ratios for each cluster candidate. The likelihood ratio value of
the cluster candidate is compared with the reference distribution to decide if the cluster
candidate is a true cluster. However when Duczmal applied this approach to some of his
data, problems with this approach were dramatically exposed. In one of his studies
Duczmal [31] simulated breast cancer cases and randomly distributed them over 245
counties in New England (Fig 1.2). When he instructed his Simulated Annealing (SA)
SaTScan based irregular cluster search algorithm to search for clusters, one of the clusters
that it found was a large and extremely irregular cluster encompassing 122 counties, and
enclosing a large percentage of the randomly scattered cases. This cluster is an example
of a noisy cluster. The noise generating process (random distribution of cases) operated at
the scale of 245 counties (aggregated). The shape of the area at which this process
operated is the shape of the New England region that we see in Fig 1.2. At this scale and
shape, the process generates noise. However, if this process is studied at the scale of an
21
aggregation of 122 counties and at the shape that follows the darker (orange if your copy
of this document is in color) shaded counties in Figure 1.2, then, a noisy or spurious
cluster is generated. It is known that the process that generated this cluster is noise.
This example thus illustrates a situation where spurious clusters are created from a
mismatch between the scale and shape of the process that generates the cluster and the
scale and the shape imposed by the method of analysis. Duczmal [31] noted that this
noisy cluster was large in size and extremely irregular in shape. Duczmal [73] suggests
that large and irregular clusters like the one found in his study (above) are likely to be
spurious. He and some other researchers [36] therefore, incorporate a penalty for
irregularity of shape in this cluster search algorithm. The extent of this penalty is decided
on a priori knowledge of the shape of the cluster. Therefore, if researchers believe that
the clusters in an area are likely to be circular; they place a high penalty on clusters that
are not circular in shape and vice versa. The spurious cluster detected by Duczmals
method and the proposed solution raises some important questions. Is this spurious
cluster large and irregular with a high risk/rate elevation a cluster of his particular
method, or is it possible that if a cluster detection method is given freedom of shape and
size then these clusters are likely to be detected? We note that the shape and size of the
spurious clusters in Fig 1.1 are different from the shape and size of Duczmals spurious
cluster. Thus not all spurious clusters are large and irregular.
Duczmals problem has reintroduced the otherwise rarely discussed issue of shape
and size in the disease cluster detection literature [69, 74, 75]. Risk elevation is just one
possible characteristic of a cluster. McCullagh [76] states - In map analysis, features of
prime importance may be size, shape, orientation and spacing". It is possible for clusters
of different shapes and sizes to have the same risk elevation. It is also possible for
clusters of same shape and sizes to have different risk elevations. The first objective of
any cluster search should therefore be to distinguish spurious or noisy clusters from
everything else. The risk or rate value of a possible cluster alone is not sufficient to make
22
this distinction. The shape and size of the cluster must also be factored in, when
considering if a cluster is a true cluster. Duczmal proposes a solution that makes certain a
priori assumptions about the shape and size of a cluster. This solution is interesting.
However, the problem of spurious clusters may be approached from a different angle.
Instead of asking the question What is the shape of a true cluster? which is what these
methods do, and which is a question which is hard if not impossible to answer, the
question that should be asked is What is the shape of a spurious cluster?. Unlike the
first question, this is easier to answer. This is because the shape of a spurious cluster,
unlike a true cluster can be mined a-posteriori from the data. To know how this can be
done, we first need to understand how spurious clusters are generated in the first place.
Thus, in the chapter that follows I discuss in depth, the phenomenon of noise and the
creation of spurious clusters.
1.4.4 Identifying spurious clusters and
distinguishing true clusters from spurious
clusters
Spurious clusters enclose noise. Across disciplines noise is defined as .. a
random and unpredictable signal" [77]. By this definition if the nature of the signal is
known, then noise can be detected and filtered out. For example in a satellite image, it
may be known that certain frequencies are the signal frequencies and therefore a spectral
analysis and subsequent filtering may help remove the undesirable noise. In a satellite
image the signal has a physical existence. For example, infrared radiation emitted by
vegetation can be measured with certain instruments. In contrast, in mapping disease the
signal cannot be physically measured. The signal is conceptual and has to be estimated
from the available data. Some geographers and statisticians attempt to tackle the problem
by developing statistical models that attempt to separate signal from noise [21, 23, 78-
23
80]. Perhaps a better approach to understanding signal and noise in a disease map is to
understand the physical process that gives rise to the signal (as in a satellite signal). It is
known that in a disease map, the observed patterns are the result of underlying processes.
The observed patterns are patterns obtained from mapping statistical summaries of
disease outcomes. For example, a map of patterns of cholera mortality in England could
be displaying the number of cholera deaths per unit population in each county. The
outcome in this case is cholera mortality which is the outcome of a disease process. Since
cholera is a communicable disease it is possible that the spread of cholera can be modeled
as a contact network process [81]. There exist many other spatially explicit disease
processes2. For example, patterns of disease could be the result of processes that reflect
an underlying lack of access to healthcare [10, 56, 82-84]. Whatever the specific process
may be, these processes have a common trait in having a spatial form [85], and this
means that they predispose some areas of the map to have a greater risk than any other.
It is also possible that the underlying process does not cause any region of the
map to have a greater risk than any other. Since a disease case may appear at any point on
the map by random chance, by the earlier definition of noise, this is a noise generating
process. A cluster defined by enclosing some of these disease cases is a spurious cluster.
On any given map disease patterns can be the result of one or more processes. It could be
the result of one process that generates clusters and another process that generates noise.
The challenge therefore, is to distinguish the areas of a pattern that are the result of a
cluster generating process from those that are not. Also, given a disease process that
generates patterns on a map; a number of other factors also influence the patterns we
2 It is important to distinguish between a spatially explicit disease process and a

spatial disease process. Some scientists attempt to model diseases as purely spatial processes.
Examples of this can be seen from the cellular automata based disease modeling literature. No
disease process is purely spatial and therefore such models are misleading.
24
actually observe. Given a cluster generating process, the following factors influence the
pattern that is then extracted:
1. The spatial distribution of the locations of people in the map.
2. The shape and size of the geographic units that are used to aggregate individuals
into discrete small areas.
3.
The shape and size of the spatial configuration, the disease mapping or cluster
detection method may impose on the data (In addition to 2).
Understanding these factors is essential to understanding noise and spurious

clusters. I discuss this next.
1.4.4.1 The spatial distribution of the locations
of people in the map
A cluster generating process causes an area of the map to have a greater risk than
other areas of the map. Cluster detection methods seek to estimate the shape, size and risk
elevation of the area of increased risk using the locations of people as proxy sample sites.
A representative spatial sample of the area of risk would be a uniform grid [86]. People
are never distributed uniformly over space; instead, a likely spatial distribution consists
of dense settlements interspaced with sparsely populated areas. This creates a challenge
in estimating the true shape of the cluster. As I illustrate from figures 1.3 to 1.11, a
cluster that in reality has a uniform shape, may be estimated as having a highly irregular
shape, because of the way people are distributed over space [75].The shape of the actual
area of increased risk or true cluster created by the cluster generating process also
influences the shape of the cluster that is finally estimated. If the shape of the true cluster
25
is highly irregular, it is quite likely that the shape of the cluster that is estimated is also
highly irregular, but the converse may also be true! This is illustrated from figures 1.12 to
1.14.Another phenomenon long observed by geographers is that the same risk process
may give birth to different shaped clusters in different areas of the map or, in more
general terms, the same cluster generating process may give rise to different patterns
[87]. While the shape of the original area of the increased risk or true cluster may be the
same in two areas and the spatial distribution of the people may be the same, it is not
necessary that the pattern of people who are diseased (and who are not) will be the same
in both areas. This means that the shape of the estimated area of increased risk will not be
the same in both areas. This is further complicated by the fact that people are almost
never distributed similarly over space in two different regions (Figures 1.15 to 1.20).
First, for the purposes of understanding this issue, let us assume the highly
improbable situation that people are uniformly distributed over space. Let the distribution
be over a uniform grid. Figure 1.3 illustrates the situation. Next, let us consider that out
of the 42 people in the region, 10 are afflicted by some disease. However, we assume that
the process that causes disease is a noise generating process. Therefore, we expect
diseased people (or cases) to be randomly distributed over the region among 42 people as
shown in figure 1.4. A convex hull boundary of these cases is seen in Figure 1.5. In
contrast, if there is a cluster generating process, we would expect the diseased people to
be clustered together. Figure 1.6 illustrates such a situation. People enclosed within a
dotted area of increased risk are diseased, the risk being 0. 24 (the risk in other areas
being 0). We observe in Figure 1.6 one realization of the risk process, so 10 people are
diseased. Figure 1.7 displays the convex hull boundary of this cluster of diseased
people. The smooth and regular shape of this cluster is in sharp contrast to the irregular
cluster shape that we observe in Figure 1.5. Since it is highly unlikely, that people will be
uniformly distributed over space, Figure 1.8 illustrates the more realistic possibility of
people being non uniformly distributed over space. If the entire geographic area in figure
26
1.8 is subject to a risk, we expect some people to become diseased (again, one realization
of the process) . Figure 1.9 illustrates this and the boundary that demarcates the cluster.
The shape of the cluster is very different from what was obtained in Figure 1.5. An
increased area of risk on such a heterogeneously distributed population gives rise to
clusters of unpredictable shapes (figures 1.10 and 1.11).These example show how the
spatial distribution of the people affect the shape and size of the risk surface detected.
From these examples it may seem that for a given distribution of people over
space, a cluster generating process gives rise to patterns on a map that are regular
compared to the shapes generated by a noise generating process. Indeed, some scientists
use measures of regularity of a clusters shape to distinguish a true cluster from a
cluster spurious cluster [73]. Also, people never are distributed uniformly over
geographic space. Next, we see how this affects the shape and size of the cluster detected.
In the example I have discussed I assumed that the cluster generating process gives rise to
a very regularly shaped area of increased risk (The area within the dotted line). In reality
this may not be true. The area of increased risk may have a very irregular shape. Some
examples of geographic features that can be areas of increased risk are rivers, roads,
underground groundwater streams, plumes of aerial pollution or a combination of some
of these. We therefore observe that the shape and size of a cluster cannot be predicted apriori and is unique to the risk elevation of the cluster generating process and the spatial
distribution of the people. Another aspect of a cluster generating process is that the same
process can give rise to different shaped clusters in different regions of the map. This can
happen even if people are uniformly distributed. The examples below illustrate this:
From the discussion and the examples, we can conclude that both the spatial
distribution of people and the shape and size of the area of increased risk, have an
important bearing on the shape and size of the cluster that is finally detected. The area of
increased risk or the true cluster may have a very different spatial configuration from
the cluster that is detected. Parts of the true cluster may be suppressed or spurious areas
27
of increased risk may arise. Spurious clusters are created from the method used to
measure the outcome of the process of clustering. By definition, the method uses a scale
and (or) shape of measurement that is dependent on the spatial distribution of people.
Since this distribution is not representative of the underlying area of increased risk, there
is a mismatch between the measurement shape/scale and the process shape scale. While
the above examples are with individual level data, the conclusions drawn can be
generalized to aggregated data. The act of data aggregation itself could introduce noise
over and above the problem of heterogeneously distributed people. This is discussed in
the next section.
1.4.4.2 The scale and spatial configuration
of the geographic units that are used to
aggregate data into discrete small areas
In the geography literature the term scale is used to refer to three different kinds
of scales, two of which are of relevance here. The first is the phenomenon scale, or the
scale at which a spatial process operates. The second is the analysis scale the scale at
which data are aggregated for measurement and analysis [88]. When a phenomenon such
as a disease operates at a given scale, its outcome is often registered as heterogeneity in
disease rates at that scale [89]. Geographers have often attempted to find the scale at
which a process operates [90]. Two well known methods are the use of spectral analysis
[65] and variogram [91] modeling. The latter approach is often used in the health
geography literature. Studies in China have shown that Esophageal and Liver Cancers
operate at scales of less than 150 kms while stomach cancers operate at scales less than
90 km [91]. In Sweden substance related disorders operate at scales less than 3 kms [92].
Unfortunately, the scale at which a given process operates is not known in most
geographic studies. A geographer attempts to study a process by collecting and analyzing
28
spatial data. This process involves analysis through the calculation of statistical
summaries of data aggregated at an appropriate scale. When the process scale is not
known there is every possibility of a mismatch between the process scale and the analysis
scale. This mismatch or misalignment arises from two sources. First, geographic data are
often aggregated into discrete units often for purposes different from the analyses for
which they are being used. These units of aggregation could differ in shape and scale
from the process scale and shape. As Haining [93] states in Conceptual models of spatial
variation [93] ...This might be referred to as process-induced spatial heterogeneity. This
source of heterogeneity may be compounded in the case of regional data by measuring
attributes through spatial units of different size. This might be referred to as
measurement-induced heterogeneity because it is a product of how attributes are
observed and measured. A second source of mismatch is from the spatial structures that a
disease mapping/ cluster detection method imposes on the data. For example, spatial
filtering [9, 10] and Spatial Scan Statistic based methods calculate summary statistics
by aggregating data along circular filters. In the geography literature the problems that
arise from spatial mismatch are grouped under MAUP or the Modifiable Area Unit
Problem [91, 94]. MAUP phenomena are again grouped under two broad sub groups as
the zone effect and the scale effect. The creation of spurious heterogeneity or destruction
of true heterogeneity with changing scales is a manifestation of the scale effect. If the
scale is kept fixed but the shape of the zones of aggregation are changed, then the zone
effect is likely to be seen. Geographic data aggregated to administrative units often
display both the zone and scale effects of MAUP. Aggregating data has a smoothing
effect on disease rates [95], and therefore clusters at scales smaller than the scale of
aggregation could be missed, when analyses are done using these data. Conversely, if the
scale of aggregation is smaller than the process scale, then noisy clusters could be
detected. A recent study by Ozonoff et al., [19] demonstrated that when individual level
data are aggregated and a Spatial Scan Statistic cluster search method used on the data,
29
then noise increases with increasing levels of aggregation. Therefore, analysis and
process scales interact in complex ways to create noisy clusters and suppress true clusters
We can conclude from our discussions above, that a number of complex factors
influence the shape, size and the risk elevation of the clusters that are detected and the
spurious clusters created. These factors are dependent on the spatial distribution of the
people and the process and analysis scales. It is not possible to make a priori assumptions
about these factors, and it is certainly not possible to predict the shape of a noisy cluster a
priori. What approach is then appropriate if the spurious clusters have to be separated
from the true clusters? The section that follows answers this question.
1.4.5 Identifying the noisy" or spurious
components of the pattern
A reasonable cluster detection technique should take into consideration not only
the risk elevation but also the shape and size of the cluster. I propose a spatially enabled
computational process that uses these attributes of a cluster, to identify the signature of
spurious clusters from patterns on a disease map. Earlier, I introduced the idea that a
pattern is the outcome of a process. Analyzing a pattern or the components of a pattern
such as individual clusters may yield clues about the underlying process. A map of
disease patterns represents one realization of the underlying process. It may not be
possible to draw conclusions on the process that generated the pattern or components of
the pattern by analyzing just one map. However, if multiple maps were available,
representing multiple realizations of the process, then analyzing the patterns may yield
clues about the underlying process. A classic example of this approach can be found in
Hagerstrands classic paper [96] in which he simulates multiple maps assuming an
underlying process. He then compares maps of empirical data with the maps that he has
simulated to draw conclusions about the validity with which he represents the process in
his model. Another example can be seen from Diggle [97].Therefore, if maps were
30
created using a known process, then analysis of the simulated patterns on the maps would
yield clues on the signature" of that particular process. Once this signature" is known,
then the pattern could imply (or not imply) the existence of this process. More
specifically, this scheme can help identify a signature" for spurious clusters. These
signatures can then be used to distinguish clusters that are spurious from clusters that are
true", in any given pattern of disease risk. Shape, size and risk elevation are part of this
signature". For example, the signature of spurious clusters in Duczmals [73] method
was that these clusters were large in size and had irregular shapes. The next chapter is
devoted to the method I have developed based on these ideas. The method is first
described, then tested and validated on simulated data.
1.4.6 Why use size, shape and rate
The reason I add the dimensions of size and shape, in addition to rate, is to
characterize the reference space in which spurious clusters are located. I know from
theory (as discussed in this chapter) that spurious clusters arise differently to the extent
that the numbers of people at risk in relation to the overall relative risk of the disease
exist differ across the space. When people are distributed uniformly in space, the average
number and average size of spurious clusters in that space can be determined from
theory. As Schinazi [98] shows, deterministic statistics can be used to determine the
chance of finding a given number of clusters with a rate higher or lower than the expected
rate. However, when people at risk are distributed non-uniformly in space, the equivalent
number is more difficult to determine directly from theory. The same theory still applies;
it is just more difficult to implement in the case of non-uniform distribution of people at
risk. For this reason, I use Monte Carlo simulation to discover the rate, size, shape space
in which typical spurious clusters lie, given the particular distribution of people at risk
and the particular overall relative risk of the disease in the study area in question. In his
seminal paper King [85] states The mathematics of stochastic spatial processes have
31
proven to be extremely complex and it is perhaps not surprising that alternative

approaches to study these processes have been sought. In the analysis of any system,
simulation represents a lower level of abstraction than the formal mathematical analysis,
and this technique has been applied to geographic research. In this research I use shape,
size and rate to distinguish real clusters from spurious ones. Since the probability of
disease in a cluster is higher than in a non cluster, we expect the rate, which is an estimate
of this probability to be higher in a cluster. Conversely, if people with higher probabilities
of disease are grouped together or are spatially clustered, than randomly scattered about
the map, we expect a higher degree of spatial autocorrelation in the former situation. We
would then expect the size of true clusters to be larger than any spurious clusters created
by noise. The causative agent for this increased spatial autocorrelation could be
environmental toxins or social and behavioral factors. There is a vast literature on the
social and environmental causes of increased risks [99], a complete discussion of which
is out of the scope of this dissertation. Nevertheless, I briefly discuss some of these
agents of increased risk. As I discussed in the introduction chapter, disease mapping owes
its beginnings to infectious diseases such as cholera and smallpox. Infectious agents such
as bacteria or viruses are often transmitted through close physical contact. It is therefore
not surprising that infectious diseases such as Cholera [100] and Yellow fever [101] have
served as some of the best example cases of disease clusters. A collection of cases are
positively autocorrelated compared to a random distribution of cases. Conversely a high
spatial autocorrelation of disease X in space could indicate an infectious etiology for that
disease. One would expect the clusters thus formed to be contiguous and large as opposed
to a random allocation of cases. Other causative factors of diseases are environmental
toxins. Environmental toxins tend to follow certain physical features or attributes of the
environment. People residing within or close to these features are at an increased risk of
disease compared to others because of a differential exposure to these toxins. Some
examples of physical attribute/toxin pairs are: rivers and fungicides [102], radio antennae
32
and electromagnetic wave plumes[12], farms and pesticides [103], cranberry bushes
and pesticides [104], Concentrated Animal Feeding Operations [105] and dust plumes,
Canals and assorted chemical wastes (The famous Love Canal)[106]. It is expected that
these toxins act across local areas in contiguous areas. The elevations of risk caused by
such agents are over a large area as opposed to any risk caused by spatially random
events.
While physical toxins may cause an increased risk regime, the social environment
may also cause the same effect. Public health researchers discuss the context and
composition of social environment [78, 80, 99, 107, 108]. If a number of individuals
practicing high risk behaviors compose a neighborhood they could end up reinforcing
each others behaviors. This could result in a cluster of disease cases created by the
compositional effect of high risk individuals living together [107]. If a number of high
risk individuals are living together, they form a cluster. This cluster would naturally be
larger than isolated individuals or even families practicing high risk behaviors. In contrast
to this compositional process, if a certain neighborhood has poor access to services,
then the access context of this neighborhood causes the people living in it to have a
higher risk of disease. Some of the examples of access/outcome pairs from the literature
are access to prenatal care clinics and birth outcomes [83], general accessibility and
health risk factors [109], access to radiation clinics and choice of therapy [110], access to
health resources and late stage colorectal cancer [10]. Network distance, Euclidean
distance or some function thereof are used to quantify access. It is not possible for
immediately neighboring individuals to have different accessibilities. One therefore finds
clusters of high or low accessibility, which translates to larger clusters than random,
While we would expect the sizes of true clusters to be larger than the sizes of
spurious clusters, there is a small but finite probability that by random chance some
spurious clusters will be larger than the true clusters. Also, the shapes of true clusters will
have a greater degree of freedom than the shapes of spurious clusters. For example, the
33
shapes of true clusters could follow a road or river network, in which case they will be
extremely irregular. Conversely, they could be regular or circular. The shapes of
spurious clusters on the other hand are constrained by the particular geographic aspects of
the data such as level of aggregation and spatial distribution of people as discussed in this
chapter. Therefore, we can expect the shape, size and rate of true clusters to be different
from the shapes, sizes and rates of spurious clusters. The question of whether each of
these dimensions contributes to the power to discriminate between spurious clusters and
true clusters is an empirical question that can be answered. In my simulations, I hold rate
constant across synthetic clusters (Horizontal axis in figure 2.13: clusters 1 and 2, clusters
3 and 4) , when changing shape and size, and conversely change rate when keeping shape
and size constant (Vertical axis in figure 2.13: clusters 1 and 3, clusters 2 and 4).I also
address the empirical question, how much does each of these dimension contribute to the
overall sensitivity, if information on the other dimensions is withdrawn. The theoretical
reasons, however, for expecting the dimensions of size and shape to contribute to the
ability to separate spurious clusters from true clusters remains; viz. the shape and size of
spurious clusters in any area depend on the spatial distribution of the people at risk in the
area. If this spatial distribution changes, the patterns of spurious clusters change.
Therefore, size and shape as well as rate are part of the signatures of spurious clusters in a
particular region. The ability to measure size and shape using GIS methods thus becomes
an important part of the methodology for distinguishing true from spurious clusters in any
area. At this stage we are in a position to revisit the definition of clusters. Knoxs
definition of a cluster is a A geographically bounded group of occurrences of sufficient
size and concentration to be unlikely to have occurred by chance". The vast majority of
disease clustering literature interpret unlikely to have occurred by chance as the
unlikeliness of the estimated risk of disease in the cluster only to have occurred by
chance. As shown by Duczmal [31] and Ozonoff [19] , this interpretation is fallible to the
problem of spurious clusters. Thus, two different clusters with different geographical
34
bounds but the same rate, would be evaluated similarly. This dissertation does not
redefine Knoxs definition of clustering. It interprets it in a geographically meaningful
manner by including shape and size along with rate into cluster interpretation in a three
dimensional computational space.
35
Figure 1.1: This figure displays the statistical significance of accidents per square
kilometer (a pp- map over densities) , where accidents have been randomly
scattered across the study area . A 30 meter grid was laid over the entire study
area and a 600 meter filter was used to estimate the accident densities. The
black areas aree significant noisy clusters.
Note: Reproduced from Steenbergehen, Thomas and Wetts (2005) [72].
36
Figure 1.2: This figure displays a spurious cluster detected by Duczmals Simulated
Annealing based SaTScan method. This cluster has a high, statistically
significant likelihood value.
Note: Reproduced from Duczmal, Kulldorff and Huang (2006) [73].
37
Figure 1.3: In the geographic area, 42 people are distributed over a uniform grid. Each
circle represents an individual. They are color coded white to indicate that
they are healthy.
38
Figure 1.4: A noise or spurious cluster generating process operates at the scale of the
entire geographical area. No person is at a greater risk of disease than any
other. All people are at a risk of 0.24. Diseased people are randomly diseased
over the map. These disease people are color coded black to indicate a
diseased state.
39
Figure 1.5: A boundary is drawn around those people who are diseased. This represents
our gerrymandered cluster. Note the highly irregular and large shape of the
cluster.
40
Figure 1.6: In contrast to 1.4, a cluster generating process operates on this geographic
area. The cluster generating process predisposes the people living in the area
bound by the dotted lines to a greater risk than other areas of the map. These
people are at a risk of 0.56. In one realization of the process cluster of 10
people therefore are diseased in this area.
41
Figure 1.7: The cluster is then enclosed within a boundary. Note the relatively regular
shape of the cluster (compared to a random distribution of diseased people).
42
Figure 1.8: People are distributed non uniformly over space.
43
Figure 1.9: The entire geographic space is subject to the same risk (0.24) noise
generating process. The resulting 10 diseased people and the gerrymandered
cluster are shown.
44
Figure 1.10: The cluster generating process in figure 1.6 operates on the
inhomogenously distributed population. The risk elevation is the same as in
Figure 1.6 0.56. This causes 8 people to fall ill from an at-risk population of
14.
45
1.11: The estimated cluster shape and size is very different from what the shape and size
of the cluster is in reality (The dotted line in Figure 1.10). It is also very
different from what was obtained for a homogenous distribution of people in
Figure 1.6.
46
Figure 1.12: Now a cluster generating process operates on this space. The white river
within the dotted lines is the area of excess risk. People living within this area
are at an excess risk of disease.
47
Figure 1.13: Assuming an inhomogeneous distribution of people as in figure 1.8 and a

risk elevation of 0.71, we see that a certain number of people (10) within the
area of excess risk are diseased.
48
Figure 1.14: The gerrymandered cluster now encloses the diseased people. Note the
highly irregular and large shape of this cluster.
49
Figure 1.15: Two cluster generating processes of circular shape and risk elevation of
0.75 operate on a homogenous distribution of people.
50
Figure 1.16: The clusters that are estimated from this have the same triangular shape.
This is highly unlikely in reality.
51
Figure 1.17: In this example a slightly larger area of increased risk is considered than in
the earlier example. 6 people in each of the two clusters are subject to a risk
of 0.5, which results in 3 of them becoming cases/ falling ill.
52
Figure 1.18: The clusters that are generated have very different shapes. In fact the
larger the area of increased risk, the greater the number of possible shapes and
sizes of the estimated cluster.
53
Figure 1.19: In this example people are inhomogenously distributed. The same cluster
generating process in Figure 1.15 gives rise to two circular areas of increased
risk where the risk elevation is 0.5.
54
Figure 1.20: The two clusters generated have very different shapes. There is no
configuration of cases within the clusters for which two estimated clusters
could have the same shape.
55
CHAPTER 2: THE SHAPE SIZE SENSITIVE (S.S.S) METHOD

FOR DETECTING DISEASE CLUSTERS
In this chapter I describe and then test the S.S.S method for the detection of
disease clusters. In the section that immediately follows, I discuss the theory, hypothesis
testing framework and the algorithm that implements the method. The method is also
compared with a second method, Rogersons Score test method. The results from the two
methods are compared, and implications are discussed.
2.1 Theoretical foundations of the
S.S.S method
A pattern on a map consists of a number of possible candidate clusters. A map
may be that of a region such as the state of Iowa. Let us denote this region as G . The
region may be comprised of X number of local areas G= [A1, A2.. AX ]3. An example
of a local area A1 is a ZCTA (Zip Code Tabulation Areas). Also there may be k
candidate clusters on the map then this set can be denoted as Z= [Z1, Z2.. Zk]. These
candidate clusters have different properties for different methods of detecting disease
clusters. For example, in methods that do not geographically process the data, clusters
follow local area boundaries, and the set of all candidate clusters comprise the universe,
or Z=G. Each candidate cluster Zi,i=1to k is a collection of some discrete local areas Ais
. However, in methods that process geographic or spatial data, especially density
estimation methods like spatial filtering, these properties do not necessarily apply. The
set Z of candidate clusters can be also divided into two complementary subsets a set of
3 The terminology in this section is similar to that of Duczmal, Kulldorff and Huang
(2006) [73].
56
true clusters and a set of noisy clusters. Let these be ZT and ZN respectively, where ZT
ZN = Z. Of course, the sets ZT and ZN are not known a-priori to the researcher. Note that
these cluster candidates can be extracted using any method that offers relative freedom of
shape and size of cluster candidates. In this research, I use Rushtons Spatial Filtering and
echelons. But the approach can be used for example, with Duczmals [31] method or
Kulldorffs SaTScan [3] analyses.
Each cluster candidate could either be a true cluster or it could be a spurious
cluster. It is not known a priori, if a candidate cluster is true or spurious. Each candidate
cluster has a shape K(Z), size S(Z) and a third attribute R(Z) that provides a measure of
the risk at the cluster. A measure of shape- K(Z) is the compactness or regularity of
the clusters geometry. Compactness is measured as KZ
4 AreaZ/
PerimeterZ [73].A circle is perfectly regular and has a compactness of 1. The

compactness value tends to 0 as the geometry tends to become less regular. Size S(Z) is
the area of Z as Area(Z). R(Z) is the statistical measure of clustering. Thus R(Z) could be
a likelihood statistic, as in Kulldorffs SaTScan, a rate statistic in Spatial Filtering, or a
measure of spatial autocorrelation as in the LISA (Local Indicators of Spatial
Autocorrelation) methods. R(Z) in most methods is a continuous variable, and this is the
attribute that is used to create patterns on a map. If cluster candidates or Zi,s have to be
extracted from the pattern, then R(Z) has to be discretized. Since clusters are discrete
geographic entities with strict boundaries, an appropriate method of extracting
geographically bounded discrete entities from a continuous surface (such as a map
pattern) is required.
One approach to do this is to use echelons [111], where, the continuous surface
is divided into a hierarchy of topographic features such as peaks, ridges and saddles.
While there are other approaches to extracting cluster candidates [31] that are irregular in
shape, the echelons approach when used with a continuous surface, has two strengths.
First, the surface is continuous, thus the cluster candidates that are extracted do not have
57
to conform to underlying geographic boundaries. Second, this approach provides an

exploratory approach to selecting cluster candidates. While selecting cluster candidates
on the basis of results of an earlier search is not recommended, since this leads to
selection bias, it is possible to look at a smoothed surface of risk, and decide on what
level of smoothing or what level of echelons are appropriate for the research. It is not
absolutely necessary to make these decisions, and a brute force approach using multiple
echelons and multiple filter sizes will work. Nevertheless, this approach does offer the
opportunity to make these a-priori selections if required.
Discrete entities are created from the intersection of this three dimensional
topography with a horizontal cutoff plane called the level in the echelons literature. The
level of this cutoff plane can be such that only the peaks or highs are revealed above
it, or it could be lowered, such that features at lesser altitudes are exposed. In my analysis
I use one level or threshold. This has the effect of binarizing the continuous surface into
highs and lows. Thus, in a map of smoothed rates, all cluster candidates that show rates
greater than the threshold rate in the region will be considered as possible cluster
candidates (See Figure 2.1).Raising the threshold has the effect of showing only the
peaks and ridges as cluster candidates, while lowering the threshold has the opposite
effect. A powerful method for the detection of disease clusters can weed out the false
positives and thus offer high sensitivity, when applied on such a map. In later versions of
this method, multiple levels of echelons will be used. One echelon serves as a good
starting point.
We can therefore define a threshold T, such that:
If R(Z) > T , t=1, R(Z) = R(Z)*t
If R(Z) < T, t=0, R(Z) = R(Z)*t
Where T is a threshold that decides beyond what value of the statistic R(Z) we
consider clusters to exist. Once the cluster candidates are extracted, the average rate at the
58
cluster candidate can be calculated. The average rate is calculated as the observed number
of cases divided by the expected number of cases. Alternatively, if a fine, uniform grid is
applied to the study area and the rate at each grid point is known, then the rate in a
candidate cluster is equal to the average of rates at the grid points that lie within a
. These attributes [K(Z),S(Z) and RZ
]
candidate cluster. Let us call this as RZ
comprise the signature" S(Z) of a cluster. My hypothesis is that the signatures of clusters
that are spurious will be different from signatures of true clusters. A disease map will
yield many possible cluster candidates. If all the cluster candidates are spurious then:
Zj ZN ,
j , ZT=
For any candidate that is a true cluster
: Zj ZT
I assume that a cluster is a true cluster, if its signature is not classified as the signature
of a spurious cluster. Or,
Zj, ZT if S(Zj) S(ZN) and consequently S(Zj) S(ZT)
This classification can be done on the basis of a decision rule D,
D: S(Zj) S(Z), such that if there are no true clusters then: S(Zj) S(ZN)
While, if there is a cluster that is a true cluster then, S(Zj) S(ZT). To decide if
any given cluster signature belongs to a set of noisy or spurious cluster signatures, we
first need a reference set of spurious cluster signatures. This reference set must be
computationally generated. Thus for example, if we are analyzing cluster candidates in a
map of cardiovascular disease mortality by ZCTA in Iowa, then the noisy clusters must
be extracted from simulated datasets of cardiovascular disease mortality by ZCTA in
Iowa. The simulated datasets must be created under a noise or spurious cluster generating
process. Thus no person in a ZCTA should have a greater risk of dying of cardiovascular
59
disease than in any other ZCTA (details of data simulation are in a later section). The
number of cases that are simulated should be equal to the number of cases in the real
dataset.
Thus one map, Sm1 , created under a noise generating process generates say
Sm1(n) spurious clusters ZN1, Z N 2.. Z N Sm1(n) when the map is analyzed with a
cluster detection algorithm such as spatial filtering . M number of simulations provide a
valid reference set of noisy clusters [ (ZN1.. ZN Sm1(n)),(ZN1.. ZN Sm2(n)),(ZN3..
ZNSm3(n) ),. (ZNM.. ZN SmM(n) )] = ZN.The reference set of signatures is thus
[ (S(ZN1).. S(ZNSm1(n))),(S(ZN1)..S(ZNSm2(n))),(S(ZN3).. S(ZN Sm3(n) )),.
(S(ZNM).. S(ZNSmM(n) ))] = S(ZN ). Figure 2.2 displays the signatures of spurious
clusters on a map of Iowa. The noisy data is an allocation of 50,0004 cardiovascular
mortality cases to each of 942 ZCTAs in Iowa. The process used to extract the clusters
was Rushtons Adaptive Filtering [9]. The output of this process is a surface of smoothed
rate statistics.The statistic was binarized such that areas with rates greater than the
echelon level 1.1 are coded as black. Each individual polygon that has been coded as
black is a noisy cluster. This one map provides one set of reference noisy signatures, say:
(ZN1.. ZN Sm1(n)). M sets of similar maps make ZN.
Once the reference distribution of simulated spurious signatures S(ZN ) is
obtained, the important questions are What do these spurious signatures look like? and
given candidate clusters Zj,j=1 to k and their signatures, how do we differentiate these
from S(ZN ) ? The first question is addressed by exploring the signatures S(ZN ). The
shapes, sizes and rates of the clusters in this set are explored. We can expect clusters that
have a similar shape, size and rate as these spurious clusters to be the least likely to be
recovered.
4 This number is consistent with real epidemiological data. This is discussed in detail in
the data simulation section later.
60
To differerentiate signatures of true clusters from signatures of spurious clusters,

we can describe a three dimensional signature space [K(Z),S(Z) and
K(Z) is on the X axis, size S(Z) is on the Y axis and rate
] where shape
on the Z axis. In this
signature space is a set of noise signatures S(ZN ). The signature set for M=50 can be
seen in Fig 2.3. This signature set of spurious clusters is used as a reference set in this
research.They occupy a certain region of the signature space.Figure 2.3 illustrates the
reference distribution of spurious cluster signatures: S(ZN ).
Thus, if S(ZN) were to be enclosed by a boundary in signature space, then all
candidate clusters that are spurious should be enclosed within this boundary. In contrast,
all candidate clusters that are true would lie outside this boundary. This boundary thus
defines a rejection region in three dimensional space. The rejection region can be defined
in a Monte Carlo Hypothesis testing framework. This is discussed in the next section.
2.2 Hypothesis testing
I test the hypothesis that any of (as opposed to one of) the candidate clusters
(Zj,j=1to k) is a true cluster. The signature of a candidate cluster is compared to a reference
distribution S(ZN ) of simulated noisy cluster signatures to decide if the candidate cluster
is noisy. The theoretical distributional properties of the signatures of spurious clusters are
not known (and could be a subject for future research).For example, while there is
substantial research on how rates and likelihoods are distributed [69, 112, 113] there is
little research on the trivariate distribution of rates, shapes and sizes of clusters in
different geographies. This creates a challenge in testing any hypothesis.
It is known, for example, that in the bivariate case, a normally distributed variable
will have a footprint that can be approximated by an ellipse [114]. The exact
configuration of the ellipse depends on how the two variables are correlated. It is possible
to draw a rectangle that encloses this ellipse. Irrespective of how the correlation varies,
this rectangle will always enclose the ellipse. The ellipse assumption is true only for
61
normal distributions. Without any available research, it is not possible to assume whether
the bivariate distributions of shapes and sizes or rates and sizes (for example) of cluster
candidates are normally distributed. Also, even if they are normally distributed in this
specific disease mapping situation, there is no reason to assume that they would be so, in
other situations. However, it would still be possible to enclose these bivariate
distributions in a bounding rectangle. The extremums of shape and size can be used to
define the bounds of such a rectangle. If hypotheses were being tested using ellipses then,
data points residing in the outer band of the ellipse (0.001 or 0.05 would be considered
significant). With a bounding rectangle a similar band can be created. Note that the
rectangle is more conservative than the ellipse. Refer to Figure 2.5
Rectangular confidence intervals have been discussed in the literature and
recommended for use [115]. Also note that the rectangle is easily used in a Monte Carlo
hypothesis testing framework. Instead of using the location of a data point in space, its
rank among a set of simulated data points can be used as an indicator of its significance.
Monte Carlo approach to hypothesis testing is appropriate for these analyses. The rank of
the shape, size or rate of a candidate cluster in a list of candidate clusters and simulated
spurious clusters can be easily calculated. A natural extension to the rectangle in two
dimensions is the hyper-rectangle or cuboid [116] in three dimensions, when the data are
three dimensional (shape, size and rate), with the extremums of each of these defining the
bounding rectangle. Making the bounding rectangle bigger will not change the results.
The Monte Carlo ranks remain the same. The hyper rectangle is therefore a visualization
tool rather than a hypothesis testing device, since the hypothesis test in itself is non
parametric.
Under the null hypothesis the signatures of all the candidate clusters S(Zj) are
equal to the mean or median of the signatures of noisy clusters. Since the signature is a
trivariate variable (shape, size, rate), the mean or median of the signature is the mean or
median of the shape, size and rate.
62
H0 : S(Zj) = [Smean(ZN) ,Kmean(ZN),

R !" ( (ZN)]
Kmean(ZN) : Is the mean cluster compactness (shape) among all signatures in S(ZN )
Smean(ZN) : Is the mean cluster size among all signatures in S(ZN )
N

N
R
!" (Z ) : Is the mean rate over all mean rates R !" (Z )
Under the alternative hypothesis there is at least one candidate cluster that is a true
cluster:
Thus, the signature of at least one candidate cluster is not equal to the mean
N

H1 : S(Zj) [Smean(ZN) ,Kmean(ZN), R
!" (Z )]
The p values of the variables shape, size and rate, for any given candidate is
calculated as the relative rank of the variable with respect to the rank of all the other
variables. Thus for example, the p-value of shape (pshape) for cluster candidate j is Rank
(S(Zj)) / (Total number of cluster candidates + Total number of simulated spurious
clusters). If the hypothesis test is at the level, then a cluster candidate is significant if
pshape < /6. A cluster candidate is significant if either of shape, size or rate is
significant. I simulate 50 datasets under the null hypothesis and 20 datasets under the
alternative hypothesis. In this Monte Carlo hypothesis test 50 datasets under the null
hypothesis are being compared with datasets under the alternative hypothesis. 50 datasets
do not imply that there are 50 data points under the null hypothesis. This is because
each of the 50 datasets generate a large number (on the average around 70) spurious
cluster candidates or data points in three dimensional shape, size rate space. Thus the
comparison dataset under the null hypothesis has 3675 and not 50 data points.
Nevertheless, the question arises, if this number 3675 or 50 datasets is sufficient to satisfy
the requirements of creating the reference null distribution. I thus cross validated my
63
simulations through a hold one cross validation process. In this method of cross
validation a statistic such as the mean, is measured for all datasets, then one dataset is
randomly removed and the mean is measured again. If the mean does not change, then
the number of simulations is sufficient for hypothesis testing. Table 2.1 below
summarizes the results. For the alternative hypothesis, the test for validation is slightly
different. A stable number of datasets are over which the mean sensitivity and specificity
values converge. Table 2.2 illustrates the results. Next, I shall illustrate the hypothesis
testing procedure with an example. Let us suppose I am testing the hypothesis that a
dataset simulated under the alternative hypothesis (Section 2.3) of cluster 2 is being
tested for true clusters. The comparison dataset is a dataset of spurious cluster
candidates that has been simulated under the null hypothesis (Section 2.3). There are a
total of 3675 data points in this comparison dataset. Each datapoint has a shape, size and
rate value. The test dataset has 45 datapoints. Each datapoint has a shape, a size and a
rate. The first step in the hypothesis test is to merge the two datasets (comparison and test
datasets) together. After merging the datasets we have a total of 3720 datapoints. Next,
we rank the datapoints by their shape value on the shape axis, by their size value on the
size axis and rate value on the rate axis. If we are testing the hypothesis at 0.01 the level,
then we expect 3720*0.01 = 36 (rounded to six multiple) datapoints to be rejected. If a
two sided test is carried out, then each axis, shape, size and rate has to contribute 12
datapoints to the rejection region each. Since this is a two sided test each side of the axes
contribute six datapoints. For this specific dataset there are specific cutoffs for shape, size
and rate. The cutoff for shape are 0.71 and 0.045, size 5711.521 and 1.193, and rate
1.207 and 1. Out of 36 datapoints that are rejected four are from the test dataset, the rest
are from the reference dataset and are thus discarded. Of the four, three are rejected on
their shape values (0.73, 0.73., 0.72). One datapoint is rejected both on its shape (0.55)
and rate 1.22.
64
In these analyses the test and the reference datasets are merged during the
hypothesis test. An alternative to this would be to take each datapoint from the test
dataset and test it separately against the reference datasets. There are a number of
drawbacks to this approach. First, is the problem of multiple testing. Carrying out
multiple tests would introduce statistical noise in the dataset. Adjustments for multiple
testing would make the test ultra conservative. Simultaneous testing is the best approach
in this context. There is also another advantage of using the merge procedure. It is known
that some of the datapoints in the test datasets are spurious clusters or are in some ways
similar to the spurious clusters. When merged with the reference dataset, and ranked as
explained above, these spurious cluster datapoints join the reference dataset or become
part of the reference population against which those clusters that are most different from
spurious clusters are compared. This increases the power to distinguish or discriminate
the true clusters from spurious clusters. This makes the test slightly more conservative,
since potentially true clusters may not be rejected. The number of cluster candidates in a
given geographic situation is unknown. It is important that the number of datapoints in
the reference dataset be large. This is easily achieved by carrying out an adequate number
of simulations, and testing for validation as above. It is important to calibrate the
rejection region to match the structure of the signature set created by the process that
generated the spurious clusters. The structure of this signature set will vary from one
disease mapping situation to another. Using a hyper rectangle and a non parametric
method of hypothesis testing offers this ability. Computational geography [1] thus allows
us to use a rejection region appropriate to the situation at hand, instead of assuming a
normative truth about the distribution of spurious cluster signatures. This ability to
adapt the rejection region to the local geography is the strength of the S.S.S method. The
flowchart in figure 2.6 summarizes the S.S.S method.
The S.S.S method extracts the signature of spurious clusters for a given
geography. By the theory on which this method is based, if the true clusters that one is
65
attempting to recover, have a signature that is very similar to the signature of the spurious
clusters, then it is unlikely that this signature will be recovered. However, if the signature
of the true clusters is different, that is the shape, size and/or the rate is different from
the shape, size and/or rate of the spurious clusters, then these clusters are very likely to be
recovered. The simulated data that I describe next are used to test these ideas. The S.S.S
method is also compared with an existing method of disease cluster detection: Rogersons
Score statistic.
2.3 The simulated dataset
It is a standard practice in the literature [32, 75, 113] to test any new method for
the detection of disease clusters against data simulated under conditions of clustering and
no clustering. The ability of a cluster detection method to correctly classify regions into
cluster and not cluster, are quantified into measures of sensitivity, specificity, false
positive percent and false negative percent. Thus, data are simulated in this research. The
data consist of 50 datasets simulated under the null hypothesis of no clustering and 20 (X
4) datasets simulated under the alternative hypothesis of having clusters. Four different
clusters (20 datasets each) are simulated under the alternative hypothesis to reflect
different configurations of possible true clusters. The hypothetical study area and the
datasets are described in detail next.
2.3.1 Hypothetical study area and
population
My analysis is based on the geographical area of the state of Iowa. The size of this
geographic area is approximately 240 miles * 360 miles and the population is around
2,892,853. I simulate groups/counts of people at the small area level. The small areas are
ZCTAs (Zip Code Tabulation Areas). There are a total of 942 ZCTAs in Iowa. The
advantages of using a geographical area as Iowa are manifold. The state has a relatively
66
homogenously distributed population with low population densities in most areas, but
small and densely populated urban areas. Also, there are existing datasets of cancer/
birth-defect for the state [10]. In Chapter 3 of this dissertation, the S.S.S methods,
Rogersons Score statistic and the Spatial Scan Statistic[3] are applied to an existing
dataset of prostate cancer incidence in Iowa (at ZCTA level geography). Figure 2.7 is a
choropleth map of ZCTAs in Iowa by population.
2.3.2 Hypothetical case population
The simulated cases are of deaths from cardiovascular disease. Cardiovascular
diseases (CVD) or diseases of the heart [117] include ICD-10 codes I00-I99 (diseases
of the circulatory system) and a number of other related disorders (ICD-10 codes I00I09,I11,I13,I20-I51). Together they contributed to 26,897 deaths in the year 2000, in
Iowa [117]. In my study I simulate 50,000 cases of cardiovascular disease deaths
(mortality) in Iowa. This accounts for approximately two years of observed CVD deaths.
2.3.3 Datasets under the null hypothesis
of no clustering
The hypothetical case populations are simulated by distributing cases among the
ZCTAs weighted by populations. The methodology that was used to generate the
simulated datasets is simple. A computational array was created with 942 bins where
each bin represents a ZCTA. The size of each bin is proportional to the population in the
ZCTA. Our task is to allocate a certain number (50,000) of cases to these bins in
proportion to their populations. If we visualize each case as a dart, then the cases are
allocated by randomly throwing darts at the array. Once all darts are exhausted, the
number of darts or cases in each bin is summed. This sum represents the total number of
cases in the particular ZCTA. One dataset is thus created from this finite allocation of
cases to bins (Note: not finite # cases to each bin. Some bins may get none through this
67
process and that is fine). If the process is repeated 50 times, 50 simulated datasets are
obtained. Figure 2.8 illustrates the process.
The philosophy behind this is that the process that gives rise to the case
population is noise", or that no ZCTA is at a greater risk of having a case than any other
after accounting for the relative differences in populations in the ZCTAs. Therefore from
a common pot of a fixed number of cases, each time a case is drawn, a decision has to be
made on which ZCTA shall receive the case. Each ZCTA has a probability of receiving a
case proportional to the number of people in the ZCTA. Once all cases are allocated, a
dataset is ready. This procedure if replicated M number of times gives M datasets. Each
of the different M datasets are expected to have a different spatial distribution of cases. It
can be theoretically proved that the resulting case distribution follows a multinomial
distribution. Figure 2.9 shows the proof. For this analysis 50 datasets were simulated.
It is important to note, that a number of risk factors affect the outcome simulated.
Some of these risk factors are non spatial. For example, for heart disease, age is a
possible risk factor. The observed patterns of risk on a map reflect the outcome of these
risk factors over and above spatial risk factors. If the simulated patterns do not reflect the
underlying age and sex distribution, then comparing these with the observed patterns is
incorrect. Thus, if the S.S.S method is used in a real life epidemiological situation then, it
is important that these covariates be adjusted for, and the at risk population be used
to create the simulated patterns. In Chapter-3 the S.S.S method is used to investigate
prostate cancer clusters in Iowa. The population used to simulate the reference spurious
clusters is men (sex) over the age of 45 (age). The observed patterns are adjusted to the
underlying age distribution. Since, Rushtons spatial filtering is used; covariates are
adjusted at the stage when the rates are calculated at each grid point. This approach can
be extended to include any number of risk factors, in a logistic regression or multilevel
regression framework. For a detailed discussion on this topic, see Banerjee (2004) or
Klassen et al., (2005) [56, 118, 119].
68
In the simulations that are carried out in this chapter the above considerations are
not important. The purpose of the simulations in this chapter is to test the ability of the
S.S.S method to recover certain simulated patterns, and to compare the S.S.S method
with Rogersons Score statistic method of cluster detection. While, a dataset simulated
with using the population at risk can perhaps be of a more realistic nature, they do not
affect the results of analyses carried out on the simulated data. The aspatial risk factors
are considered to be the same and fixed in the datasets simulated under the null and
alternative hypothesis.
2.3.4 Extracting the cluster candidates
For each of the datasets simulated above, patterns of risk were extracted using
adaptive spatial filtering. A uniform 2.5 mile grid is used. The denominator size or filter
size was set at 6600 people which is around 114 expected cases. From the smoothed
patterns echelons were used (as explained in chapter 1) to extract the cluster candidates.
The echelon level was set at 1.058. This level is approximately equal to the mean rate in
the simulated noisy patterns. Choosing an echelon at a low altitude (for example the
minima) could increase the possibility of detecting false positives. Choosing an echelon
that is too high could increase the possibility of type -2 error. A median or a mean is thus
a reasonable choice for one echelon level. All clusters candidates that cross this mean
threshold are considered spurious clusters. The end result of the process explained above
is a set of spurious cluster signatures, for the geography of my choice (ZCTAs in Iowa,
2000 population). Each spurious cluster has a signature. This signature is comprised of
a shape (expressed as a value of compactness), a size (in square miles) and a rate. Recall
that this is S(ZN ) the signature set of reference spurious clusters. We are now in a
position to explore this. In the figures that follow, these shapes, sizes and rates of the
spurious cluster signatures are summarized. There are a total of 3675 spurious clusters.
The bar charts in Figures 2.10, 2.11 and 2.12 summarize the shapes, the size and the rates
69
of the spurious clusters. Table 2.3 provides further statistical summaries on the spurious
clusters.
The datasets that are described next are simulated under the alternative hypothesis
that clusters exist. Recall, that S.S.S compares the shapes, sizes and rates of true cluster
candidates (like the ones that are simulated in the next section), with the shapes, sizes and
rates of clusters simulated under the null hypothesis (which were discussed in this
section). If the theory underlying the S.S.S method is correct, then the greater the
difference in the signature of the spurious clusters and any given cluster candidate, the
more likely the cluster candidate is a true cluster. In the next section I describe datasets
that are simulated under the alternative hypothesis. In these datasets, I carefully control
the shape, size and the rate of the simulated clusters with varying degrees of difference
from the signatures of spurious clusters.
2.3.5 Datasets under the alternative hypothesis
of clustering
10 ( X 4) datasets are simulated under the alternative hypothesis that a cluster
exists. The patterns extracted from these map datasets will yield the true clusters ZT. The
procedure is similar to the one followed in the last section. The only difference here is
that people living in some of the ZCTAs are placed at a higher risk of disease than people
living in other ZCTAs. Four different clusters are simulated. Figure 2.15 summarizes the
characteristics of the clusters simulated. We can also call these the four clustering
situations.
2.3.5.1 Rationale Behind the choice of these
configurations of synthetic clusters
The shapes, sizes and risk elevations of the synthetic clusters were chosen to test
the S.S.S method and its underlying theory. If the shape, size and the risk elevation of the
70
clusters are similar to the shape, size and rates of the spurious cluster as explained in the
last section, then the S.S.S method will not be able to detect these clusters. Conversely, if
the shape, size and the rates of the clusters simulated under the alternative hypothesis are
different from the shape, size and rates of the spurious clusters, then, S.S.S will be able to
detect these clusters. The former clusters are thus less likely to be recovered than the
latter ones.
Clusters 1 and 2 were chosen such that their shape and size mimics the shape and
size of a spurious cluster (Refer to Figure 2.13 and Table 2.4 for a summary). They are
composed of four small areas of increased risk. The shape and size were chosen to be
similar to the shape, size and rate of the spurious clusters (refer to Figure 2.13 for a
comparison). Also, the spatial form or geometries of the four areas of increased risk were
chosen from cluster candidates of one of the fifty patterns simulated under the null
hypothesis.
While both clusters 1 and 2 have the similar shape and size as spurious clusters,
Cluster 1 has a risk elevation that is many times higher than Cluster 2. Cluster 2 thus, is
the least likely to be recovered of all clusters because its shape and size are similar to that
of a spurious cluster in the given geometry, and the risk elevation (1.25) is similar to that
of spurious clusters (1.1). Cluster 2 has better recoverability because it has a rate that is
different from that of spurious clusters even though its shape and size are similar to them.
In contrast, Cluster 3 has a shape, size and risk elevation that is very different from those
of spurious clusters. It is a contiguous, aggregation of 111 ZCTAs in the state It is
therefore the most recoverable of all the clusters.
Cluster 4 has a shape and size different from the spurious clusters, but its risk
elevation is similar to that seen in spurious cluster. Table 2.4 summarizes these ideas. It
illustrates the shapes, sizes and rates of the four simulated clusters, and also the mean
shape, size and rate of the spurious clusters. Every cluster is ranked according to the
extent it can be recovered. The cluster which is most similar to the spurious clusters
71
(Cluster 2) is the least recoverable, while Cluster 3, which is different from the spurious
clusters, is the most recoverable.
Note that the geography of Clusters 3 and 4, could pose a significant challenge to
the traditional disease cluster detection methods. Many methods of detecting clusters of
disease coalesce neighboring local areas to gain statistical power [2, 120]. In Clusters 3
and 4, this opportunity is not available. Nevertheless, Cluster 3 has a high elevation in
risk. This may increase power of detection. However, Cluster 4 has neither a high
elevation of risk, nor an agglomeration of local areas with a slightly high elevation of
risk. It is therefore likely that S.S.S will be able to detect Cluster 4 with success while
other methods will fail to do so. Another question of interest is the relative recoverability
of Clusters 1 and 4. Cluster 1 has a shape and size similar to spurious clusters but a risk
which is very different (3 times higher than the mean rate at spurious clusters). Cluster 4
has a risk similar to the mean rate of spurious clusters but the shape and size that are
different from them. It is not known, how in this specific example, the interplay of rate,
size and shape will affect the relative recoverability of Cluster 1 and Cluster 4. In later
sections, the S.S.S method is applied to the simulated data, and the postulates that I have
discussed above are tested. The S.S.S method is also compared with another method of
disease cluster detection, - Rogersons Score statistic.
If a limited number of synthetic clusters are evaluated, it can be argued that the
synthetic clusters were cherry picked such that power evaluations of a given method
(S.S.S in this case) will be successful. It is therefore important that the synthetic clusters
that are used in any analysis reflect the possible local geographies within a given region.
In the context of disease mapping this would imply that the synthetic clusters reproduce
the population densities or the density of controls in the local areas in the region. This can
either be done by simulating multiple synthetic clusters, each of which covers a different
population or control density regime, or by simulating one or two clusters that cover a
number of possible control densities regimes. If the second option is chosen then the
72
clusters may have to be large in size to encompass different local areas. This strategy was
followed in the design of Clusters 3 and 4 in these analyses.. Alternatively; multiple
clusters may be used in a single simulation to cover the different local areas. This
approach was used in Clusters 1 and 2. To keep the power evaluations conservative, it is
advisable to err on the side of caution, and choose local areas for synthetic clusters with
population densities lower than what exists in the region. Areas with higher population
densities are easier to detect since they offer greater power. Any cluster detection method
can detect clusters in densely populated region but most cluster detection methods have
difficulty detecting areas with small numbers of controls [32]. While cluster 1 and 2
cover predominantly rural areas, they also include relatively urban areas such as Walcott
and Carroll towns. Similarly, clusters 3 and 4 include large towns such as Council Bluffs,
Marshalltown, Fort Dodge and Davenport. The median population density in all ZCTAs
in Iowa is 22 people per square mile while in the simulated clusters 1,2,3 and 4 it is 18
people per square mile. As is shown in Figure 2.14, the population densities in all the
simulated clusters in this study are less than in Iowa as a whole. Thus, while the number
of clusters that are simulated are limited in this study, the population regimes that the
clusters cover are not. While, the simulated clusters cover areas which are slightly less
densely populated than areas in Iowa, the ability to detect such areas is a more difficult
test of sensitivity for any cluster detection method than the alternative of testing for
clusters in densely populated urban areas. The cluster recovery process is free to recover
any fraction of a given simulated cluster. While the number of a clusters that are
simulated are limited in this validation test to four, there are no limitations to the number
of parts of these clusters that are successfully recovered using a given cluster detection
method. Thus, if there are N ZCTAs in a given synthetic cluster, the cluster detection
method has the opportunity to recover 2N possible combinations of ZCTAs from the
synthetic cluster. If on top of this we account for the fact that the S.S.S method can detect
clusters across ZCTA boundaries, there are unlimited number of possible clusters that can
73
be detected. In spite of these unlimited possibilities the empirical results in Section 2.7
show that the clusters that are detected have some common attributes. The number of
simulations that are used in these analyses are thus sufficient to derive meaningful
generalizations. This and the fact that the simulated clusters are a conservative and
reasonable representation of the local geographies in Iowa justifies the choice of
simulated clusters in these analyses.
2.4 Rogersons Score Statistic
Rogersons Score statistic is a focused test [16] that has been used as a local
cluster testing method used by Waller [4] and Rogerson [2, 50, 121] to study spatial
patterns of leukemia in New York. A focused testing method is used to test areas of
excess risk around a given point or area, while a local testing method is used to test for
an excess of risk in a local area. It can be implemented with a freely available software
GeoSurveillance [121]. The power of the score statistic as a local cluster detection tool
has not been tested, and this study is a first attempt at this. Rogersons Score statistic is
set up to test one local area at a time. If the test were to be used repeatedly on a number
of local areas, the statistical problem of multiple testing comes into play. These power
tests will address the question, if in a realistic disease mapping situation, multiple testing
is an issue with Rogersons Score test. The theory of the Score test is explained next:
2.4.1 Theory
Rogersons local score statistic maps a smoothed value based on the difference
between the observed and expected counts of cases in a given region. To test for a raised
incidence or prevalence around region i, the statistic that is mapped is as follows:
Adjusted Ui =
X)

W
$%
&E%
* (xh-Eh)
74
Where,
Ui is the value of the score statistic for local area/ZCTA Ai
Wih is a weight parameter that decides the extent of smoothing that will be applied to
the
above statistic.
X is the total number of local areas/ZCTAs
xh is the total number of cases in a local area/ZCTA
Eh is the expected number of cases in a local area
S(G) is the area of the geographic region in the map, for example, the state of Iowa.
dih is the distance from the ith local area to the hth one.

W+) = W) /,-. W)
W) = ( 1 ) * exponent ( - d) / 2 (X/S(G))
exph = (Ni/N)*n
= (Total number of people in ZCTA/Total number of people)*(Total number of cases)
i,h=1,2,,n
is the bandwidth parameter, and this decides the extent of smoothing that will
be applied. Since this method smoothes the statistic calculated above with a Gaussian
Kernel, the size of the kernel will decide the extent of smoothing that is applied to the
data. If =1, then the size of one standard deviation of the kernel is equal to the average
distance between the centroids of all ZCTAs or local areas in the study area. The mapped
statistic for each local area is tested for significance by comparing with a normal
reference distribution [2], with mean zero. In this research Rogersons Score statistic is
compared with the S.S.S method and the S.S.S method is applied to the simulated data
75
described in the earlier sections. Therefore, the performances of these methods need to be
quantified. This is done, using diagnostic measures that are summarized in the next
section.
2.5 Diagnostics
Power, sensitivity and specificity are metrics used to evaluate the quality of any
method for the detection of disease clusters [122, 123]. These metrics are also used to
compare different methods. Figure 2.15 (and the key below) summarizes the diagnostic
measures that are used.
The area that is a true cluster but has not been identified as a true cluster by the
cluster detection method. This area as a percentage of the area within the true
cluster (black oval) is a measure of the false negative percent of the cluster
detection test.
The area on the map that is not a cluster, but has been wrongly classified by the
cluster detection method as a true cluster. This area, as a percentage of the area of
the map (rectangle minus black oval) that is not a cluster is the false positive
percent.
The area of the map that is a cluster and has been correctly classified as being a
cluster by the cluster detection method. This area as a percentage of the area of
the true cluster (black oval) is a measure of the sensitivity of the cluster detection
method.
The area of the map that is not a cluster and has been correctly classified as not
being a cluster by the cluster detection method. This area as a percentage of the
area that is not a cluster (which is the rectangle minus the black oval) is a measure
of the specificity of the cluster detection method.
76
The above diagnostics are accepted measures of the quality of a disease clustering
test [122]. A good test has high values of sensitivity and specificity and low true negative
and false positive scores. Most researchers report sensitivity and specificity values,
because the other two diagnostic measures can be calculated with ease from the
sensitivity and specificity measures. These diagnostics were calculated for the two
methods. Note that these measures have been developed for cluster detection techniques
that detect clusters that follow administrative boundaries. A region (such as a ZCTA)
either in its entirety belongs to a cluster or does not. The measures are based on counts
and percentages of administrative regions that lie within or do not lie within clusters.
Thus, these diagnostics are easily calculated for Rogersons Score statistic method.
However, grid based smoothing methods such as Spatial Filtering cut across regional
boundaries. These metrics are thus slightly modified for the S.S.S method. Instead of
using a binary count of inclusiveness or non inclusiveness of a cluster candidate, the
percentage of the area of a ZCTA that is within (or not within) a cluster, is used.
Sensitivity and specificity can be calculated for any one map. However, in simulation
experiments like this one, there will be some variation from one map to another.
Therefore, the diagnostics that are reported are averaged over all the simulated maps. In
this study, for each cluster (1,2,3,4) there are 20 simulated datasets. Sensitivity and
specificity are thus averaged over 20 maps.
2.6 Computational Scheme
The DMap Filtering routine is realized using a VBA-Excel program written by the
author. Rogersons Score statistic is available in Rogersons GeoSurveillance software
[121]. All the GIS functions (including convex hull) are realized using ArcGIS 9.1 [124].
Monte Carlo hypothesis testing is achieved in VBA-Excel.
77
2.7 Results
Figure 2.16, displays an example cluster detected by S.S.S and Rogersons Score
statistic for Cluster-4. Cluster-4 is the cluster in which the shape, size and geometry of
the cluster pose a challenge for traditional cluster detection methods. It was predicted
earlier that S.S.S would show greater sensitivity than Rogersons Score statistic in
detecting Cluster-4. This is observed in the results. S.S.S is three times more sensitive
(the ability to detect a cluster given that it exists) than Rogersons Score statistic.
Cluster-3 was a relatively easier cluster detection scenario. The risk elevation at
the cluster was 3.0. Both Rogersons method and the S.S.S are equally successful in
detecting these clusters. Figure 2.17 displays the clusters detected by the two methods for
one of the 20 simulated datasets. Tables 2.5 and 2.6 summarize the summary diagnostic
statistics for the two cluster detection methods. The average sensitivity (the ability to
detect a cluster when it exists) for S.S.S is 95%.This implies that on the average S.S.S is
able to detect 95% of the simulated cluster (Cluster-3). Clusters-1 and 2 were simulated
to test the underlying theory (or assumption), that cluster candidates that resemble
spurious clusters are less likely to be recovered by S.S.S. Cluster-2 has the greatest
resemblance to the spurious clusters and is therefore the least recoverable. These results
confirm this. The average sensitivity (or the ability to detect a cluster if it exists) is
around 33%. In contrast S.S.S offers an average sensitivity of around 83% with Cluster-1.
The sensitivity with which the clusters are recovered are thus almost exactly as predicted
by the theory. Cluster-3 is the most recoverable, while Cluster-2 is the least recoverable.
It was predicted that Cluster-1 (which has shape and size similar to spurious clusters) and
Cluster-4 (which had a rate similar to spurious clusters) would have medium
recoverability. In these simulations Cluster-1 is better recovered than Cluster-4. S.S.S
shows a sensitivity of 83% with Cluster-1, while it shows a sensitivity of 58% with
Cluster-4. However, lower specificity (the ability to classify areas that do not have an
excess of risk as such) is obtained with Cluster-1 than with Cluster-4. Tables 2.5 and 2.6
78
summarize the predicted ability to recover a cluster with along with sensitivity and
specificity with which it was recovered. The higher the sensitivity and specificity, the
better the cluster is recovered.
Score statistic was also used to detect Clusters-1 and 2 (Table . The average
sensitivity obtained with Cluster-1 was 96% and 87% with Cluster-2. The comparable
sensitivities for Cluster-1 and Cluster-2 with SSS are 83% and 33% respectively. It is
important to interpret these results in the right context. The computational approach [1]
that S.S.S uses makes it possible to predict which Clusters are most likely to be
recovered. Thus from the simulations discussed earlier, I predicted that Cluster-1 would
be the hardest to recover. Similarly, since both Cluster-1 and Cluster-2 had shapes and
sizes similar to those of spurious clusters, they were considered to be less likely to be
recovered. In contrast using Rogersons method does not offer such predictive abilities.
For example, Rogersons method is not able to reasonably detect Cluster-4 (sensitivity
27%). If Cluster-4 had existed in a real epidemiological situation, and Rogersons
method had been applied to it, then the interpretation of the results would be challenging.
Is the result of Rogersons method showing no clusters (in areas where there are
clusters) to be interpreted as a failure of Rogersons method or is it interpreted as the non
existence of a cluster? Without any additional evidence, the researcher may conclude that
the latter is true, with significant negative consequences. In contrast if Cluster-2 had
existed in reality and S.S.S had been applied to it, it would have been known from the
simulations that clusters of certain shapes, sizes and rates would be hard to detect with
S.S.S. It is also a prudent approach to systematically carry out the analyses at a multiple
scales. This could be achieved by using geographic data at the individual level or the
analysis scale can be changed by using a different filter size. For an example see Rushton
et al.,[125]. S.S.S thus empowers the researcher with a-priori knowledge based on
computational geography compared to blind approaches like Rogersons Score statistic.
79
These experiments are designed to empirically test the contribution that each of
the axes, - shape, size and rate make to the overall sensitivity of these analyses. I have
argued earlier that using a multidimensional cluster signature space should offer better
power than using simply a rate or likelihood statistic. Simulated clusters 1 and 3 have the
same risk elevation and clusters 2 and 4 also have the same risk elevation. If risk
elevation were the only deciding factor in the recoverability of a cluster, or if the shape
and size axes did not matter, then it would be expected that the same, or similar
sensitivity will be obtained for clusters 1 and 3, 2 and 4. However, this is not the case.
Clusters 3 and 4, are recovered better than clusters 1 and 2. The reason why these clusters
are better recovered is clusters 3 and 4, have shapes and sizes that are markedly different
from the shape and size of Clusters 1 and 2 (And concurrently spurious clusters). It is not
that a rate is not important. An increase in rate from 1.25 to 3.0, increases the
recoverability of Cluster 2 by 50% and Cluster 4 by 37%. Also since the patterns are
created and clusters extracted based on rates, the rate axis is indispensable. However, as,
these simulations show, gains in sensitivity are to be had from incorporating shape and
size in the analyses. Which of these axes- shape and size, is contributing to the overall
sensitivity, and in what proportion? I calculated sensitivity values using information on
just one axis, when withholding information on the other two axes. For example, a
datapoint found significant on the basis of rate, shape and rate, size and rate or all of
shape, size and rate in the earlier table, would be found significant in these analyses for
rate axis only. The results of these analyses show that all three axes are important. Shape
and rate contribute most to the significance of candidate clusters. Tables 2.8 and 2.9
illustrate these results. The clusters that are significant generally have more regular
shapes (with the exception of cluster 3) and higher rates than spurious clusters. While
size does not contribute to the significance (with the exception of cluster 3) of candidate
clusters, the sizes of candidate clusters are different from the sizes of spurious clusters.
As Table 2.8 illustrates, the median size of the significant candidate clusters are greater
80
than the median size of spurious clusters. Also observe from Table 2.9 that for clusters 2
and 4, the shape axis is contributing significantly to the sensitivity of the test. These are
the clusters where the rate is very similar to that of spurious clusters.
Clusters that have high rates are easily recovered. If the rate is high enough the
pattern is extracted almost exactly as it was simulated (Cluster 3). For clusters that do,
not have a high rate, shape acts as a back up sensitivity engine. Thus, for example, in
Cluster 4, the rate is similar to that of the background noise. When this pattern is
recovered, it breaks up into multiple cluster candidates, that may or may not have rates
higher than the background. However, these candidates are more regular in shape than
the background noise. I do not know the exact mechanism that causes these clusters to be
more regular than the background noise or spurious clusters; nevertheless the empirical
evidence is there. Note the shape axis does cause an increase in false positives, but this
may be outweighed by the increase in sensitivity. The costs and benefits would have to be
judged individually for every specific public health situation (see discussion).
Nevertheless, I can at least claim that the two dimensional signature space with rate and
shape, is better than the traditional uni-dimensional rate axis. Should the size axis be
discarded? Perhaps not. While, in these experiments size was not significant (In all but
cluster 3), the average size of clusters is still larger than non clusters. I therefore argue
that a three dimensional signature space should be used in distinguishing clusters from
non clusters.
These simulations thus provide empirical evidence to support the theory that a
three dimensional computation space is best for disease cluster detection. The empirical
evidence conforms to the theory discussed earlier, - the sizes of clusters are larger than
non clusters, and the shape of clusters are different from non clusters.
81
2.8 Discussions and future directions

Recent discussions in the disease clustering literature have brought into the fore
questions about the geographical aspects of disease clusters. Issues of cluster shape and
size, and recoverability are being investigated [31, 33, 34, 43, 69, 73]. Computational
systems are being developed that address the question of recoverability or How well
could a cluster be identified, if it were there? While the answer to this question was
taken for granted (that the clusters are fully recoverable under any situation), there is
increasing acknowledgement that methods to address these issues need to be developed.
As Jacquez states two of the major deficiencies of geographic studies of disease
clusters are that they often assume clusters have a specific shape (e.g. circle or ellipse)
and do not evaluate statistical power using the geography, at-risk population,
demographics, covariates and numbers of observed cases of the cancer under
investigation [33]. In this research I have argued that recoverability of a cluster depends
on the disease mapping situation at hand. Depending on the geography of a region,
certain clusters will be less recoverable than others. Clusters that are less recoverable will
resemble noisy or spurious clusters. A computational approach is outlined where for
any given disease mapping situation the nature of these spurious clusters can be mined
from the data. This knowledge can be used to predict the ability to recover any given
cluster. True clusters are compared with spurious clusters in a three dimensional
computational space that incorporates shape, size and rate. Shape and size are
fundamental to geographical analysis [39, 64, 74, 76, 87, 88, 90, 91, 95, 126-129]. While
the use of rate as a means of distinguishing real clusters from spurious ones is well
established in the disease cluster literature, there is less documented evidence of the use
and utility of shape and size in this literature.
Nevertheless there has been some limited attempt to study shape and size by other
researchers. Duczmal [31, 43, 69, 73] can be credited with having made the empirical
observation that clusters have a different shape and size than non-clusters. Duczmal and
82
other researchers [31, 69, 71, 73] discovered that clusters in Duczmals Simulated
Annealing Cluster search method are of a large size and irregular shape when data are
simulated under conditions of no clusters being present, at a census tract level geography.
The researchers thus considered clusters with this signature as being spurious and put
a penalty on any cluster candidates that resembled this signature [69, 73]. Other
researchers have attempted to compare the size of clusters detected by different cluster
detection methods [130]. However, the shape of their clusters was limited to circles, and
the cluster detection methods used different estimates of risk (log likelihood ratios versus
rates). Thus, while there is some evidence of research involving shape and size in the
disease cluster detection literature, and there is some empirical evidence to show that the
shape and size of clusters are different from non clusters, there has been no systematic
effort, like the one outlined in this research, to utilize size and shape along with rate in
cluster detection. The empirical observations in this research support the observations
made by Duczmal [31, 69, 71, 73] that the shape and size of clusters are different from
non clusters. I show that incorporating the shape axis in addition to rate can greatly
increase the sensitivity of the cluster detection method. This increase in sensitivity comes
with some cost in the form of false positives. The size axis, is found to be important, but
not enough to add extra sensitivity to the test. One may choose to either abandon this
axis, or at least do more empirical research to test the reliability of this axis in other
geographical situations.
Like any other disease mapping/cluster detection situation, it is important to take
the public health implications of any cluster search into consideration. While it is
desirable to have a cluster detection technique that is conservative, the public have a
right to know if there is an increased risk of disease in their neighborhood. We have
observed in these analyses that incorporating the shape axis along with rate can increase
the sensitivity of the S.S.S method. However, it also somewhat increases the number of
false positives. This is a limitation of this method. The ultimate decision as to whether
83
one should be conservative or not depends on the public health implications of the
decisions. While it may be necessary to have the highest sensitivity possible (even at the
cost of specificity/increased false positives) for contagious or fatal diseases, a
conservative approach could serve better for diseases with smaller public health burdens.
An important question in this context is the elevation of risk that is important enough to
need a public health intervention. For instance, a small elevation in risk in a common
disease like prostate cancer, where an intervention may not have a desirable risk-return
tradeoff can be ignored without consequence. On the other hand, a small elevation in risk
in a rare, highly contagious and/or non endemic disease like non-Hodgkins lymphoma,
West-Nile, Ebola may need immediate emergency intervention.
Increasing or decreasing the filtersize has an effect on the patterns that are
recovered. A larger filtersize implies that a smoother pattern will be recovered while a
smaller filter size creates a pattern with greater spatial variations. These changes in the
patterns will be manifested in cluster candidates that are extracted. However this effect
will be manifested both in the simulated noise datasets, and the real dataset in the S.S.S
methodology. Thus, while increasing or decreasing the filter size does affect the results
that are obtained from the SSS methodology, it does not affect their accuracy. The
adaptive filters are larger in areas with lower population densities. However, this does not
directly translate to a possibility of having larger clusters in rural areas. First, the risk
elevation at the region may not be above the cutoff threshold to create candidate
clusters. Second, the noise generating process may create reference spurious clusters
that have a size larger or similar to the large candidate clusters extracted from rural
areas. If a significant cluster is found, then it is likely that this cluster exists. In this
context it is important to take into consideration some of the limitations that this method
has. The conclusions that one draws from using this method are dependent on the pattern
of spurious clusters that are simulated for comparison. The pattern is unique to a
particular geography and it is therefore, a good practice to simulate the pattern every time
84
the method is ported to a new geographic area. Sometimes however, it may be possible to
use the same geography to create the simulated patterns. For example, if prostate cancer
incidence is studied at the ZCTA level in Iowa, it may be possible to use the patterns
obtained in Iowa, in areas with similar geography (northern Missouri, Nebraska), when
studying the same disease. However, for the sake of accuracy it is recommended that a
new set of simulations be done for each and every new geographic situation.
There are a number of interesting extensions to this research. Some of this can be
inspired by the work of Patil [70] , Duczmal [31, 43, 69, 70, 73] and Boscoe [131]. For
example, instead of using the rates as an indicator of risk, it is possible to use the spatial
scan likelihood ratios. This method can be extended to be used with multiple echelons
instead of one. This may improve the sensitivity of this method. Interesting methods of
visualizing the results of these analyses can also be developed. One approach could be to
use nested clusters of any shape with colors or shades representative of their level of
significance ( a p-map [9] of a different type) or their risk elevations [131].
A limited number of spatial forms of synthetic clusters are tested in this research.
While it is possible to extend this testing to multiple spatial forms, it is important to
understand the computationally intensive nature of the problem. We may be tempted to
believe that the shapes, sizes and rates of clusters can be represented in three dimensional
space by a three dimensional grid, with each grid point representing a possible synthetic
cluster. The power of the S.S.S (and other) methods can then be tested against these
synthetic clusters. Unfortunately this belief is misleading. This is because; the
relationship of the parameter shape represented by a compactness value between the three
dimensional attribute space and geographic space is not one to one. There are an infinite
or certainly a large number of possible shapes than can have the same value of
compactness. The problem thus, while not entirely intractable, is certainly
computationally complex. The limited number of cluster shapes and sizes that are tested
in this research are a) realistic and b) cover a reasonable spectrum of possible cluster
85
spatial forms. A survey of the forms of the spatial forms of clusters used in the existing
cluster diagnostic literatures either shows a complete lack of realistic cluster forms [29,
31, 36] or a test of just one possible form [71]. The methods that are proposed in this
research are only as good as the data that go into the analyses. The coarser the resolution
of the data, the less likely that local variations will be detected. While ZCTAs are
sufficient to demonstrate the methods proposed in this research, individual level data may
serve the purpose better in real epidemiological situations. These analyses can easily be
extended to individual level data.
Rogersons Score statistic performs reasonably in these simulations. If multiple
testing were a problem, then a number of spurious clusters would have been detected.
This would have decreased the specificity of this method. However, this was not the case.
This method suffers from one weakness (apart from being blind as discussed), which
was not exposed in this chapter, but becomes clear in Chapter-3. This weakness is the
inability of this method to address the small number problem. While in these simulations
relatively large base populations were used, in the next chapter, the base populations are
segmented by age. Thus, some ZCTAs have small populations, and the score statistic
calculates rates based on these populations. This creates noisy or spurious clusters.
While a large number of methods for the detection of disease clusters exist, very
few methods offer the ability to manipulate the shape and size of the candidate clusters.
In the last few years, a variety of methods have attempted to address this issue, almost all
of which are based on Kuldorffs Spatial Scan Statistic method of statistical testing.
Duczmals method, which has been shown to be powerful [31, 43] is also based on the
Spatial Scan Statistic. This research has been the one of the few attempts to suggest an
alternate approach. Unlike the SaTScan based approach that assumes normative truths
about the distributional characteristics of the data, this approach offers a relatively
positivistic approach. Deriving its strengths from computational geography [35], this
approach adapts the statistical testing of candidate clusters, to the specific disease
86
clustering situation in question.
87
Figure 2.1: Using echelons to extract cluster candidates.
88
Figure 2.2: A set of 50,000 cardiovascular disease mortality cases are randomly
distributed by population weights to each of 942 ZCTAs in the state of Iowa.
A pattern is then extracted using Spatial Filtering. The pattern is binarized,
and the resulting polygon cluster candidates are extracted using a GIS.
89
Figure 2.3: An example set of spurious cluster signatures S(ZN ) in signature space.
90
Figure 2.4: An example set of spurious cluster signatures S(ZN ) in signature space with
a few candidate clusters (grey squares).
91
Figure 2.5: Bounding rectangle for elliptical footprint.
92
Figure 2.6: Flowchart of the S.S.S method.
93
Figure 2.7: Population distribution of ZCTAs in Iowa, 2000.
Uniform distribution of PRNG

k1
..
k2
0.15 Range of k2 0.42
1
kn
1
Cumulative weight of k regions
Figure 2.8: This figure displays the computational process used to create the
simulated dataset. Each bin is labeled as k and has a specific size. For the simulations in
this research n=942.
Note: Reproduced from Kumar, N. and A. Bragdon, 2008. Pseudo Random Number
Generators for Simulating Randomness in Geographic Space, manuscript, Dept of
Geography, The University of Iowa, Iowa City, IA 52242.
94
Figure 2.9: Proof: The simulated datasets follow a multinomial distribution.
95
Figure 2.10: Summary of shapes of simulated spurious clusters, frequency and

cumulative frequency.
Figure 2.11: Summary of sizes of simulated spurious clusters, frequency and cumulative
frequency.
96
Figure 2.12: Summary of rates of simulated spurious clusters.
97
Figure 2.13: Characteristics of the four clusters simulated under the alternative
hypothesis.
98
Figure 2.14: Population densities in simulated clusters compared to population densities

in Iowa.
99
Figure 2.15: Cluster detection diagnostics (The key to the numbers is in the text).
100
Figure 2.16: Patterns detected by the Score statistic and the S.S.S method for one dataset
among 20 datasets simulated for cluster-4. The true cluster pattern can be
seen inset. In this particular dataset S.S.S is able to identify 56% of the true
cluster pattern, while the Score statistic is able to identify 18 %.
101
Figure 2.17: Patterns detected by the Score statistic and the S.S.S method for one dataset
among 20 datasets simulated for cluster-3. The true cluster pattern can be
seen in the inset. In this particular dataset S.S.S is able to identify 98% of the
true cluster pattern, while the Score statistic is able to identify 92%.
102
Attribute
Datasets
50
49
Shape
0.26
0.26
Size
260.78
260.55
Rate
1.06
1.06
Table 2.1: Hold one validation for null hypothesis.
Total
Sensitivity
Specificity
datasets
20
19
20
19
Cluster 1
83%
82%
92%
92%
Cluster 2
33%
30%
95%
97%
Cluster 3
95%
95%
90%
90%
Cluster 4
58%
60%
99%
99%
Table 2.2: Hold one validation for alternative hypothesis.
103
Mean
Median
Minimum
Maximum
Shape
Size
Rate
0.26
260.78
0.27
12.52
1.05
0.03
0.88
1.00
0.44
6584.63
1.52
1.06
Table 2.3: Summary statistics of the simulated 3675 spurious clusters.
104
Simulated
Cluster
Shap
Size (In
square miles)
Risk
Hypothesize
d ranks of the extent
to which the cluster
will be recovered by
S.S.S
(com
pactness)
0.20,
0.25,
0.39,
0.15
0.20,
0.25,
0.39,
0.15
108.77
179.19
630.00
311.32
108.77
179.19
630.00
311.32
Cluster 3
0.01
Cluster 4
Mean shape, size
and rate of
spurious clusters
Cluster 1
Cluster 2
How
recoverable is this
cluster?
3.00
Medium
recoverability
1.25
Hardest to
recover
9762.00
3.00
Easiest to
recover
0.01
9762.00
1.25
Medium
recoverability
0.27
260.78
1.1
Table 2.4: Shape, size, risk (signature) and the ability to recover simulated clusters.
105
Average
Average
Sensitivity
specificity
83%
Cluster 2
Predicted
recoverability
Average
False
Positive
rate
Average
False
Negative
rate
92%
8%
17%
Medium
33%
95%
5%
67%
Hardest
Cluster 3
95%
90%
10%
5%
Easiest
Cluster 4
58%
99%
1%
42%
Medium
Cluster 1
Table 2.5: The table illustrates the average sensitivity (ability to detect a cluster when it
exists) and specificity (ability to classify an area that is not a cluster as such)
106
Cluster
S.S.S
Local Score Statistic
num
Average
sensitiviy
Average
Specificity
Averag
e false
positive
rate
Average
false
negative
rate
Average
Sensitiviy
Average
Specificiy
95%
90%
10%
5%
95%
92%
58%
99%
1%
42%
27%
98%
Average
false
positive
rate
Average
false
negative
rate
8%
5%
2%
63%
Table 2.6: This table compares sensitivity and specificity with which clusters
are recovered for SSS and Rogersons method and the higher the
sensitivity the better the cluster is recovered.
107
Cluster 1
Sensitivity
Rate
Shap
axis only
e axis
only
82%
80%
Specificity
Rate
Shape
axis only axis only
False Positive
Rate
Shape
axis only
axis only
89%
88%
11%
12%
Cluster 2
3%
31%
95%
83%
5%
17%
Cluster 3
95%
95%
90%
90%
10%
10%
Cluster 4
23%
56%
96%
90%
4%
10%
Table 2.7: Cluster recovery using only rates and only shapes.
108
Mean shape
Spurious clusters
Cluster candidates
Cluster 1
Cluster 2
Cluster 3
Cluster 4
0.27
Mean shape of
significant cluster
candidates
Median size
12.52
Median size
of significant
cluster candidates
0.69
220.29
0.61
19.76
0.0246
13,311.85
0.58
24.10
Table 2.8: How do true clusters differ in shape and size from spurious clusters.
109
CHAPTER 3: INVESTIGATING THE SPATIAL PATTERNS OF

PROSTATE CANCER IN IOWA
In this chapter I investigate the spatial patterns of prostate cancer incidence in

Iowa. I begin with a discussion on the spatial epidemiology of prostate cancer, and how
cluster detection techniques and geographic analyses have advanced this research. I apply
three cluster detection techniques The S.S.S (Shape Size Sensitive) method, developed
in this dissertation, Kulldorffs Spatial Scan Statistic and Rogersons Score test to
investigate possible prostate cancer clusters in the state of Iowa. The implications of the
results are investigated and discussed.
3.1 Background
Prostate cancer is the most common cancer among men (excluding skin cancer) in
the United States. It is also the second leading cause of cancer related deaths [132]. It is
estimated that 186,320 men will be diagnosed with and 28,660 men will die of cancer of
the prostate in 2008 [133]. Other developed countries fare similarly. Prostate cancer is the
most common cancer among men in the UK and accounts for 24% of all new cancers
detected [134] there.
The risk of an individual having prostate cancer rises exponentially with age.
Almost 80% of all prostate cancers that are diagnosed are in men above the age of 65.
Estimates from the SEER (Survival Epidemiology and End Results) program indicate
that US men under the age of 65 had an incidence rate of 56.8 per 10,000, while the same
statistic for the 65 plus age group is 974.7 [135]. Other risk factors for prostate cancer
are race and family history. Some risk factors of prostate cancer for which the evidence is
inconsistent are, diet, occupation (with farmers being at an increased risk), obesity, low
vitamin-D intake, sexually transmitted diseases, diabetes, smoking and physical inactivity
[135].
110
Prostate cancer also shows marked geographic variability. There are substantial
geographical variations in the mortality, incidence, treatment and survival patterns of
prostate cancer [131, 136].Because of the marked geographical variations shown by this
cancer, a number of interesting GIS based techniques have been used to study these
spatial patterns [46, 119, 131, 137, 138]. Since, the validity of the spatial patterns is
dependent on the quality of the underlying spatial data [37, 42], researchers have
investigated the role of scale and geocoding quality on the spatial patterns of this disease
[125, 139, 140].
Historically, there is a higher risk of death from prostate cancer for people living
in the Northern Plains than in the rest of the US [141, 142]. There is a higher risk
associated with a rural residence than an urban one [141, 142]. Some associations have
been hypothesized between various agricultural pesticides and prostate cancer. For
instance, pesticide applicators have a relative risk 1.4 times the general population [103].
Some studies are attempting to address these concerns by studying the risk of prostate
(and other) cancers in a large cohort of agricultural workers and their families [103, 143].
Iowa is a large agricultural state and prostate cancer accounts for 26.3% (the
highest percentage of all cancers) of new cancers detected in Iowa [144]. While there are
some maps of prostate cancer incidence from states with SEER registries (of which Iowa
is one) and this data can be obtained from certain sources [145] no record exists, of any
attempt to systematically search for areas with excess risk of prostate cancer in Iowa.
Maps for the ICCCC (Iowa Consortium for Comprehensive Cancer Control) are one step
in this direction [146]. Some of these maps show smoothed rates of prostate cancer
incidence and mortality in Iowa. Smoothed maps created from adaptive filters are better
representative of the underlying variations in risk, than simple choropleth maps [2, 58,
60]. The rates mapped on choropleth maps could be inaccurate if they are calculated
using a small base support population [58, 60, 146]. While these maps are powerful
111
exploratory tools, some of the patterns observed on these maps could be spurious, and
have arisen by random chance. These patterns need to be statistically tested to find the
likelihood of them having arisen by chance. If this likelihood is small, then it is very
likely that these patterns are real. Cluster detection techniques [45, 147] were developed
with this objective in mind [146]. The objective of this study is to address the question
Are there any areas of Iowa with significant excess prostate cancer incidence risk? or
Are there any clusters of prostate cancer incidence in Iowa? If there are any clusters,
then how is their existence interpreted? Note that the objective of this study is to find
areas with excess risk of prostate cancer incidence as opposed to excess prostate cancer
incidence per se. Risk is a dynamic and unobserved quantity that the researcher wishes
to estimate [4]. The incidence of prostate cancer can be high in a given area, but this
does not necessarily imply an excess of risk. For example, if the incidence rates are based
on small numbers, then risk estimates based on these rates may not be representative of
the true value of the risk. The methods that are used in this research such as, Kulldorffs
Spatial Scan Statistic [3], the adaptive filtering method[58] and the S.S.S method report
values that can be directly interpreted as risk. These methods are discussed in the next
section.
3.2 Methods
The data examined is for new diagnosed cases of prostate cancer for ages above
45 geocoded to the ZIP Codes for the years 1999-2004, for the state of Iowa. Subseqently
they were geocoded to ZCTAs (Zip Coded Tabulation Areas).The data was obtained
from the State Health Registry of Iowa (S.H.R.I) [148] which is one of the 18 SEER
registries in the United States, through a data sharing agreement. The University of Iowa
IRB (Institutional Review Board) approved the use of this data (The IRB application
number is 200805730).
112
The ZCTA boundaries were obtained from the US Census website [149]. For all
the three methods that were applied to the data, the cancer rates were age-standardized
using the indirect standardization method. Three age groups were used in the
standardization procedure, 45-64, 65-84 and 85+. These age groups are consistent with
that used by previous researchers [142].The numbers of cases in the three groups were
3944, 7071 and 2193 respectively, accounting for a total of 13,208 prostate cancer cases
in Iowa.
There are many cluster detection methods [45, 150], some of which were
discussed in Chapter-1 of this dissertation. One of the approaches in the public health
community is to use a battery of methods to search for excesses of risk on a map [130,
151-153]. If a number of methods agree on certain areas of a map having an excess of
risk, it is likely that these excesses are for real. Two of the best examples are Breast
cancer clusters around Cape Cod and the Long Island Breast Cancer cluster. In the former
researchers found clusters using GAM (Generalized Additive Modeling), Bonnetti and
Paganos M-Statistic and Kulldorffs SaTScan [153, 154]. Breast Cancer clusters were
found on Long Island, New York, using Kulldorffs SaTScan [155] and LISA methods
[156] .
The three methods that were used in this research are 1) The S.S.S method 2)
Rogersons Score statistic [50] and 3) Kulldorffs SaTScan [3]. The three methods were
chosen because they are based on different principles. The S.S.S method and Rogersons
Score statistic were described in the last chapter, thus they are not explained in detail in
this section. Rogersons Score statistic is a localized version of a focused test. For every
ZCTA on the map, it looks for an excess of risk around the ZCTA. This excess risk is
calculated as a weighted sum of the difference between the observed and expected rates.
This is summarized as a normalized statistic that is then compared with a standard normal
distribution. In these analyses all tests were carried out at the 0.01 level. The test is
repeated for all ZCTAs. Note that because the method calculates rates at individual
113
ZCTAs, it is possible that when there are a small number of expected cases, or population
at risk, this test may detect spurious clusters. Rogersons Score statistic is realized by the
freely downloadable GeoSurveillance software [121].
The S.S.S method is a new method proposed by the author that compares the
signature of the background noise or spurious clusters to decide if a given cluster
candidate is a true cluster. Like most cluster detection methods, the S.S.S method
addresses the question of the likelihood of a given cluster candidate having arisen by
chance. But, unlike most methods S.S.S uses the shape, size and rate of a cluster (instead
of just rate) to infer if a given cluster candidate is a true cluster. In this research, the
cluster candidates are extracted by using echelons [70, 111] and surfaces from the
adaptive [58] spatial filtering approach. The process is simple. A horizontal plane at a
risk level of choice (In this case 1.2) is intersected with the three dimensional surface
created from adaptive spatial filtering of the real data. The risk level represents the mean
rate in all the simulated cluster candidates. This yields numerous two dimensional
irregular cluster candidates. At the same time a number of datasets are simulated using
the same geography, and same number of cases, but the risk at each ZCTA is
proportional to the population at risk. When the age standardization procedure is used to
allocate disease cases to the ZCTAs, the number of cases that a ZCTA receives depends
on the population structure of that particular ZCTA. Thus a ZCTA with a large
proportion of 85+ people will receive more proportionally more cases for that age group.
Computationally we can imagine three strings instead of the one seen in Figure 2.8 , one
string for each age group. The final number of cases is a sum of cases received from each
age group. This process causes the numerators (number of cases) to change differentially
from the denominators (populations).The older the population structure of an area, the
greater the relative risk. Cluster candidates are extracted from these datasets as they are
for the real data. These reference or spurious cluster candidates are compared with
the candidates from the real data using their shape, size and rate. If any of the cluster
114
candidates from the real data are found significant, they are marked out as true clusters.
These true clusters can be mapped for further exploration. While the S.S.S method is
used in this research with cluster candidates extracted using echelons and adaptive
filtering, in principle it can be extended to cluster candidates extracted using any other
method.
The third method used in this method is the SaTScan method which uses density
estimation and a likelihood based testing approach [3, 26]. Kulldorffs Spatial Scan is
widely used in the public health community to detect areas of excess risk of disease on a
map [92, 130, 131, 155, 157]. The method uses overlapping circles of increasing radii
centered on the centroids of local areas (ZCTAs). For each circle, the relative risk and
likelihood ratio is calculated based on the cumulative observed and expected deaths
contained within it. Relative risk is a measure of the increased or decreased risk
associated with being in a particular circle relative to the state and the ratio of observed
deaths to expected deaths. In contrast, the likelihood ratio in a circle is a measure of how
the incidence rate within a circle differs from the rate outside the circle and can be
calculated as follows:
LLR = (O ln (O/E)) + ((n-O) ln [(n-O)/(n-E)])
where,
LLR represents the logarithm of the likelihood ratio, O is observed cases, E is
expected cases, and n is the total number of cases in the entire region (Iowa). This
formula assumes that disease events are distributed as a Poisson random variable. The
likelihood ratios are compared to the results of a Monte Carlo simulation of the data, and
each circle is thus assigned a p-value according to its likelihood rank. Circles with pvalues less than 0.01 or 0.05 can be considered significant. Recent modifications in the
SaTScan methodology allows for the search of both circular and elliptic clusters [29].
Both circular and elliptic cluster searches were carried out in this research. The largest
115
cluster size was set at 50% of the population. The Spatial Scan Statistic is implemented
by the SaTScan software that is freely available.
The three methods S.S.S, Spatial Scan Statistic and Rogersons Score statistic
were used to investigate if any excesses of risk of prostate cancer incidence exist in Iowa.
The results of these investigations are explained next. There is considerable agreement
among the methods on the location and size of the excesses of prostate cancer incidence
risk in Iowa.
3.3 Results
I start with exploratory spatial analyses of the data. Adaptive filters were used to
smooth the data to observe the underlying variations in risk. Figure 3.1 displays this map.
Each rate is based on 397 expected cases on a 2.5 mile grid. With this number of
expected cases the smallest detectable difference in relative risk is around 10% [4, 158].
This difference is sufficient for these analyses (for example see Alvanja et.al., [103]). A
larger filter size would allow smaller risk differences to be detected, but would also
smooth out the spatial variations [25]. A smaller filter size decreases the ability to
reliably detect small differences in risk. For example if 30 expected cases were used as
the filter size, the smallest difference in risk that would be detectable would be 36%. The
analyses using spatially adaptive filters show that some areas in North West Iowa have
risks 30% greater than normal. There are also isolated patches of high risk (20% or 30%
above normal) in East Central Iowa.
The S.S.S method detected one cluster in North West Iowa (Fig 3.2). The risk
elevation at the cluster is 1.3. The cluster encompassed a total of 47 ZCTAs. A majority
rule (>50% area) was used to express the cluster in ZCTA geography. The observed
number of cases in the cluster is 604 and the expected number of expected cases in the
cluster is 451. No other significant clusters were detected by S.S.S. Since the cluster
candidates derived for the S.S.S method were from the adaptively filtered surface shown
116
in Figure 3.1, there is some concordance on the location of areas of excess risk in the two
maps.
The results of Kulldorffs SaTScan analyses can be seen in figures 3.3 to 3.5. Two
significant clusters were detected by these analyses. A primary cluster and a secondary
cluster. Figure 3.3 shows the primary cluster detected by this method assuming a
ellipsoidal geometry. Figure 3.4 shows the same results with a circular geometry. With a
circular geometry, the observed numbers of cases in this cluster are 814 while 606 were
expected. The relative risk is 1.36. The cluster that is detected when an ellipsoidal
geometry is assumed has a more compact shape, and a higher relative risk. The number
of expected cases is 446, observed cases are 631, and the relative risk is 1.44, which is
40% higher than normal. Note that the number of observed and expected cases is similar
to what is obtained from S.S.S.
The SaTScan analyses also detected a secondary cluster (with both elliptic and
circular analyses) in Eastern Iowa. However, unlike, the primary cluster this cluster has a
lower relative risk of 1.1 and includes 3471 cases. The expected number of cases is 3220.
This cluster thus includes almost a quarter of all the expected cases in Iowa. It is likely,
that this cluster is an agglomeration of areas with relatively small elevations of risk. The
large size of the cluster makes it powerful enough to be statistically significant. This
secondary cluster found by the elliptic analyses is displayed in Figure 3.5.
The results of the analyses using Rogersons local score statistics can be seen in
Fig 3.6. Rogersons method does not report one value of risk for a cluster because the
rates are calculated individually for each ZCTA. The ZCTAs that have a significant value
of the score statistic are mapped.
Rogersons Score statistic detects a number of isolated clusters that are not
detected either by S.S.S or by SaTScan. There is thus the possibility that these clusters
could be spurious. These spurious clusters could arise from problems with the statistical
testing procedure. Since at least 942 (number of ZCTAs) separate tests are carried out,
117
there is a possibility that at the 0.01 level, nine ZCTAs have been classified as clusters,
while they are not so in reality. However, the number of ZCTAs that appear in the
isolated clusters are far greater than nine (a conservative estimate is ninety). These
spurious clusters are not artifacts of multiple testing. Instead they are created from the
small number problem. Figure 3.7 illustrates this. The figure shows that the clusters
detected by Rogersons method have a smaller number of expected cases than the
average. For example, 77% of the ZCTAs found to be cluster by Rogersons method
have less than 10 expected cases compared to 70% of all ZCTAs in Iowa. Note that the
problem of unstable rates does not arise with Kulldorffs SaTScan and S.S.S. These
methods use adaptive filters and therefore the statistics they calculate are based on a
reliable number of expected cases. Since in both these cases one cluster was found by
either of these methods, Figure 3.7 cannot be made for them.
Nevertheless, Rogersons method does demarcate the same area in Northwest
Iowa as a cluster of prostate cancer incidence, as SaTScan and Rogersons Score statistic.
We are now in a position to address the question: What areas of Iowa, are found to be at
an excess risk of prostate cancer incidence by all the three methods? This question can be
addressed with a simple GIS intersect operation, which shows the areas that are common
between the clusters found by S.S.S, Rogersons method and the Spatial Scan Statistic.
Figure 3.8 displays these areas. Figure 3.9 displays the same map as in Figure 3.8 along
with the County boundaries.
Interpretation of clusters of any disease pose a challenge [45, 140]. It is especially
challenging to interpret clusters of diseases that have undergone radical changes in the
means and methods of diagnosis and detection [89]. Prostate cancer is one such disease.
In 1986 PSA (Prostate Specific Antigen) testing was introduced in the United States. This
test is affordable and easy (though it has a high false positive rate), and was adopted
throughout the United States. The adoption was not uniform either temporally or
spatially. Thus variations in patterns of prostate cancer risk can reflect the underlying
118
variations in rates of screening rather than variations from a spatially varying etiologic
factor [89, 119, 137, 139]. While there are no definitive methods of disentangling the
effects of screening from other factors such as access to healthcare and any number of
intervening factors [119, 159], there are certain epidemiological indicators of the effects
of high screening uptake. Areas with relatively high screening uptake, show high
incidence rates with a migration towards early stages [119, 160]. A number of studies
have shown spikes in prostate cancer incidence after the adoption of PSA testing with
relatively small changes in mortality [137, 160-163]. Figure 3.10 displays the change in
incidence and mortality rates in the Counties that have more than 60% of their areas
within the cluster. The mortality and incidence rates for these counties were aggregated to
calculate the rates. The data is from a GIS based database maintained by the University
of Kentucky and Kentucky Cancer Registry [145]. The same dataset was queried to
extract directly standardized rates for two counties within the cluster, and also for the
state of Iowa. These figures can be seen in Figures 3.11 and 3.12. All these graphs are
consistent in the observation that there were sharp increases in incidence during the late
nineties (1995-97) in Counties within the cluster.
Also, these areas may show a lower than usual mortality rate. Maps from the
ICCCC [146], which display rates of late stage and mortality for a similar period can be
used to interpret this cluster. These maps show, that the areas marked as Cluster in this
research have a very much lower than expected occurrence of late stage cases. They also
show that these areas have very low mortality rates. It is therefore likely, that there are
cohorts of high incidence rates in this area. The possible reasons for this are, in-migration
of aged individuals, who receive their first diagnosis in the area, an etiologic agent that
has become prevalent suddenly in time or an unusually heavy uptake of prostate cancer
screening relative to other areas, in this area.
119
3.4 Discussion
Three cluster detection methods were used to study the spatial patterns of prostate
cancer incidence in Iowa. A cluster of excess risk was found in North Western Iowa. This
finding is consistent through all the methods. On further investigation it was found that
the rates of prostate cancer incidence increased rapidly in the late nineties within the
Cluster, compared to the state of Iowa. There is also evidence suggesting that there is an
excess of early stage prostate cancers in the area. It is therefore, possible that the
observed prostate cancer cluster in Iowa, is an artifact of increased prostate cancer
screening (perhaps PSA screening), which is picking up latent cases of prostate cancer.
This phenomenon peaked in the late nineties in this area (compared to early nineties in
the rest of the US [161]). However, this peak of increased early stage detection has not
subsided to levels comparable to the rest of the state, and therefore a cluster of prostate
cancer incidence is found to exist in the area. Mortality rates for the region show small
decreases in mortality in the cluster region over the same period of time when incidence
rates increase. Changing diagnostic regimes often contribute to increased incidence and
consequently prevalence rates. This phenomenon has been observed in diseases as
diverse as breast cancer [164] to Autism[165]. It is feasible for such a diagnostic regime
change to play out in a geographically localized area, as in the Tyrol Prostate cancer
study [160] or in rural Iowa. These regions show similar temporal patterns of incidence
and mortality change. Nevertheless, further research is required to pin point the exact
causes of excess cancer risk in this region within this cluster. One possible approach is to
map the temporal trend of prostate cancer stages over time in this region. A changed
diagnostic regime causes an increase in early stage cancers[160]. Other alternatives are to
look into various etiological agents, or over cohorts of people residing in this area [103,
143] .
120
This cluster is highly unlikely to have occurred by chance, since three different
methods have identified the presence of this cluster. Independent confirmation from
multiple methods decreases the likelihood that the observed cluster is a false positive.
While the likelihood of this cluster being a false positive is low, it is possible that other
clusters, with lower elevations in risk or extremely small clusters with high elevations in
risk are not detected by the methods used in this research. The use of data aggregated at
the ZCTA level could be a contributing factor to this problem.
3.5 Conclusion
Patterns of prostate cancer risk were investigated over the state of Iowa. An
excess of prostate cancer risk was found in an area in the North West of the state. Further
investigations led to indications of the possibility that this cluster could be caused by an
increased uptake of screening compared to the rest of the state in the period 1999-2004.
However, further investigations would be required to verify this claim. The section that
follows next summarizes the work in terms of its contribution to the geographic literature.
3.6 Contribution that this dissertation
makes to the geography literature
Traditionally geographers have contributed greatly to the disease mapping and
disease cluster detection literature [9, 27]. The contributions range from choropleth maps
[46, 166] to density estimation techniques [27, 42, 46, 166] to name a few. The disease
cluster detection literature has developed rapidly in the last decade, with the availability
of cheap and readily available computational power. This has made it possible to search
for clusters of various spatial forms. While this is a positive development, it has also
brought into focus the problem of noise or spurious clusters. The problem of
spurious clusters is inherently geographic in nature, and for any given cluster detection
problem there are different degrees of recoverability for clusters of different shapes, sizes
121
and risk elevations. The disease cluster literature has started addressing questions about
the recoverability and power of disease clusters [33, 34, 69]. Recoverability or the extent,
to which a given cluster can be recovered, is related to issues of shape and scale [39, 64,
74, 87, 88, 90, 91, 95, 127-129]. In this dissertation I proposed a new method of cluster
detection (S.S.S or Shape Size Sensitive method) that utilizes the shape, size and rate of
clusters to predict the ability of the method to detect clusters. Shape, size and rate are
used to distinguish spurious or noisy clusters from true clusters. This method differs
from other methods in first addressing the question of identifying the characteristics of
spurious clusters. . The clusters that differ most from these spurious clusters are most
likely to be true and thus recovered. Shape, size and rate are used in a three
dimensional computational space to distinguish spurious clusters from true clusters. It is
shown empirically that the inclusion of these axes improves the ability to distinguish true
clusters from spurious ones.The method is compared with an existing method of cluster
detection (Rogersons Score Statistic). Results show that the S.S.S method is successfully
able to predict the extent to which it is able to detect any given cluster. For those clusters
that it is able to detect, it is more powerful, or at least as powerful as Rogersons method.
Unlike, the existing geography literature this research also makes use of a realistic set of
synthetic disease clusters to test the robustness of the proposed method.
The S.S.S method is also used to study patterns of prostate cancer incidence in
Iowa along with Rogersons method and Kulldorffs Spatial Scan Statistic. The three
methods show remarkable convergence in the location of a cluster (and the risk elevation
at it) in North Western Iowa. The causative factors of this cluster need to be investigated
further. This dissertation thus makes an important contribution to the small but significant
health geography literature on the spatial patterns of prostate cancer [119, 125, 137].
Rogersons method is affected by the problem of unstable rates or small numbers. The
Spatial Scan Statistic method and the method used to extract cluster candidates in S.S.S
are robust to small numbers. This research reinforces the assertion that some geographers
122
have made [94], and that is, instead of accepting geographic boundaries as they are, space
should be manipulated according to the needs of the specific research problem at hand
[64, 67]. This research also utilizes the approach of echelons in disease mapping.
These analyses show that geography is important. They show that shape and size
are important aspects of disease cluster detection. Yet, these aspects have been long
ignored by the disease mapping community. This research demonstrates that taking the
geographic aspects of disease clusters into account can greatly increase the effectiveness
of such analyses.
123
Figure 3.1: Spatial patterns of prostate cancer incidence (1999-2004) in Iowa.
124
Figure 3.2: Cluster of prostate cancer incidence in Iowa, detected by the S.S.S method.
125
Figure 3.3: Cluster detected by SaTScan when the geometry of the cluster is assumed to
be ellipsoidal.
126
Figure 3.4: Cluster detected by SaTScan when the geometry of the cluster is assumed to
be circular.
127
Figure 3.5: Large secondary cluster with low elevation in risk detected by Kulldorffs
SaTScan when the geometry of the cluster is assumed to be elliptical.
128
Figure 3.6: ZCTAs in Iowa with a significant value of Rogersons Score statistic.
129
Figure 3.7: Expected number of cases in ZCTAs: Entire Iowa versus areas with a
significant value of Rogersons Score statistic
statistic.
130
Figure 3.8: ZCTAs in the North West Iowa cluster of high prostate cancer incidence.
131
Figure 3.9: Counties boundaries with ZCTAs in the North West Iowa cluster of high
prostate cancer incidence.
132
Figure 3.10: Change in mortality and incidence rates from 1990

1990-2004
2004 in five counties
Dickinson, Clay, Buena
Buena-Vista,
Vista, Emmet and Clay Counties in the cluster. The
expected counts for the particular year (1990, 1991.2000) are calculated
using 2000 census population for the local area, and incidence/mortality
information for the state of Iowa (Same procedure as indirect standardization).
133
Figure 3.11: Variations in the directly standardized incidence and mortality rate in Iowa,
and incidence of Prostate cancer in Dickinson County for the years 19901990
2004.
134
Figure 3.12: Variations in the directly standardized incidence and mortality rate in Iowa,
and incidence of Prostate cancer in Clay County for the years 1990-2004.
1990
135
REFERENCES
1.
Openshaw S: Towards a more computationally minded scientific human

geography. Environment and Planning A 1998, 30:317-332.
2.
Rogerson PA: A set of associated statistical tests for spatial clustering.

Environmental and Ecological Statistics 2005, 12:275-288.
3.
Kulldorff M: Spatial Scan Statistic. Communications in Statistics: Theory and

Methods 1997, 26:1481-1496.
4.
Waller L, Gotway CA: Applied spatial statistics for public health data. New
Jersey: John Wiley and Sons; 2004.
5.
Blot WJ, Morris LE, Stroube R, Tagnon I, Fraumeni JF: Lung cancer and
laryngeal cancers in relation to shipyard employment in coastal Virginia.
Journal of the National Cancer Institute 1980, 65(3):571-575.
6.
Mapping disease: The evolution of spatial epidemiology

[http://www.spatialepidemiology.com/special/history/disease.html]
7.
Knox EG: Detection of clusters. In Methodology of enquiries into disease

clustering. Edited by Elliott P. London: Small Area Health Statistics Unit;
1989:17-20.
8.
Arthur D, Vassilvitskii S: k-means++ the advantages of careful seeding. In

Symposium on Discrete Algorithms (SODA). Edited by None. New Orleans,
Louisiana; 2007.
9.
Rushton G, Lolonis P: Exploratory spatial data analysis of birth defects in an

urban population. Statistics in Medicine 1996, 15:717-726.
10.
Rushton G, Peleg I, Bannerjee A, Smith G, West M: Analyzing geographic

patterns of disease incidence: rates of late-stage colorectal cancer in Iowa.
Journal of Medical Systems 2004, 28(3):223-235.
11.
Hellemans A: Sins of transmission? In IEEE Spectrum online. Volume 1866.

2005.
12.
Michelozzi P, Capon A, Kirchmayer U, Forastiere F, Biggeri A, Barca A, Perucci

CA: Adult and childhood leukemia near a high-power radio station in
Rome,Italy. American Journal of Epidemiology 2002, 155(12):1096-1103.
13.
Bearman P, King M: Early thoughts on the autism epidemic. In Institute for

Quantitative Social Science. Cambridge, MA; 2008.
14.
Small Area Health Statistic Unit [http://www.sahsu.org]
136
15.
Rushton G: Public health, GIS and spatial analytic tools. Annual Reviews of
Public Health 2003, 24:43-56.
16.
Besag J, Newell J: The detection of clusters in rare diseases. Journal of the

Royal Statisitcal Society:A 1991, 154:143-155.
17.
Whittermore AS, Friend N, Brown BW, Holly EA: A test to detect clusters of
disease. Biometrika 1987, 74(3).
18.
Openshaw S, Charlton M, Craft A: Searching for leukaemia clusters Using a

Geographical Analysis Machine. Journal of the Regional Science Association
1988, 64:95-106.
19.
Ozonoff A, Jeffery.C., Manjourides J, Forsberg White L, Pagano M: Effect of

spatial resolution on cluster detection: a simulation study. International
Journal of Health Geographics 2007, 6(52):1-7.
20.
Burroughs PA, McDonnell RA: Principles of Geographical Information Systems.

Oxford: Oxford University Press; 1998.
21.
Goovaerts P: Geostatistical analysis of disease data: estimation of cancer

mortality risk from empirical frequencies using Poisson Kriging.
International Journal of Health Geographics 2005, 14:4-31.
22.
Kafadar K: Choosing among two-dimensional smoothers in practice.

Computational Statistics and Data Analysis 1994, 18:419-439.
23.
Leyland AH, Davies CA: Empirical Bayes methods for disease mapping.
Statistical Methods in Medical Research 2005, 14:17-34.
24.
Richardson S, Best NG: Bayesian hierarchical models in ecological analysis of

health-environment effects. Environmetrics 2003, 14:129-147.
25.
Silverman BW: Choosing the window width when estimating a density.

Biometrika 1978, 65:1-11.
26.
Silverman BW: Density estimation for statistics and data analysis. Boca-Raton,
Florida.: Chapman and Hall/CRC; 1986.
27.
Openshaw S: A Mark 1 Geographical Analysis Machine for the automated

analysis of point data sets. International Journal of Geographical Information
Systems 1987, 1(4):335-358.
28.
Castro M, Singer BH: Controlling the False Discovery Rate: A new

application to account for multiple and dependent tests in local statistics of
spatial association. Geographical Analysis 2006, 38:180-208.
29.
Kulldorff M, Huang L, Pickle L, Duczmal L: An elliptic scan statistic. Statistics

in Medicine 2006, 25(22):3929-3943.
30.
Haggett P, Cliff AD, Frey A: Location Analysis in Human Geography. Second

edition. New York: John Wiley and Sons; 1977.
137
31.
Duczmal L, Assuncao R: A simulated annealing strategy for the detection of

arbitarily shaped spatial clusters. 2004, Computational Statistics and Data
Analysis:269-286.
32.
Kulldorff M, Tango T, Park PJ: Power comparisons for disease clustering tests.
Computational Statistics and Data Analysis 2003, 42:665-684.
33.
Jacquez GM: The Methodological and Logistical Problems of Disease

Clustering. In DIMACS/DyDAn Workshop: Investigation of disease
clusters:Transitioning to the 21st century and beyond. Edited by Lawson A,
Wartenberg D. Piscatway, New Jersey: DIMACS; 2008.
34.
Jacquez GM: Are disease cluster investigations biased towards the false
positives? The shape of things to come. In DIMACS/DyDAn Workshop:
Investigation of disease clusters:Transitioning to the 21st century and beyond.
Edited by Lawson A, Wartenberg D. Piscatway, New Jersey: DIMACS; 2008.
35.
Openshaw S: Towards a computationally minded scientific human

geography. Environment and Planning A 1998, 30:317-332.
36.
Yinnakoulias N, Royschuk RJ, Hodgson J: Adaptations for finding irregularly

shaped disease clusters. International Journal of health geographics 2007(6):128.
37.
Boscoe FP, Ward MH, Reynolds P: Current practices in spatial analysis of

cancer data: data characteristics and data sources for geographic studies of
cancer. International Journal of Health Geographics 2004, 3(28).
38.
Haining R: Exploratory spatial data analysis: conceptual models. In Spatial

Data Analysis: Theory and Practice. Edited by Haining R. Cambride: Cambridge
Univeristy Press; 2003:181-187.
39.
Rushton G, Cai Q: Geocoding cancer: the relevance of scale. In Association of

American Geographers Annual Meeting: 2005; Denver, CO; 2005.
40.
Dematte C, Molinari N, Daurs J: Arbitrarily shaped multiple spatial cluster

detection for case event data. Computational Statistics & Data Analysis 2007,
51:3931 3945.
41.
Dematte C, Molinaria N, Daursa J: SPATCLUS: An R package for

arbitrarily shaped multiple spatial cluster detection for case event data
Computational Methods and Programs in Biomedicine 2006, 84(1):42-49.
42.
Rushton G, Armstrong MP, Gittler JD, Greene BR, Pavlik CE, West MM,
Zimmerman DL: Geocoding in cancer research:A review. American Journal of
Preventive Medicine 2006, 30(28):516-524.
43.
Duczmal L, Cancado ALF, Takahashi RHC: Delineation of irregularly shaped

disease clusters through multiobjective optimization. Journal of
Computational and Graphical Statistics 2008, 17(1):243-262.
44.
McLaughlin CC: Effects of randomization methods on statistical inference in

disease cluster detection. Health and Place 2007, 13:152-163.
138
45.
Elliott P, Wartenberg D: Spatial epidemiology: current approaches and future

challenges. Environmental Health Perspectives 2004, 112:998-1006.
46.
Brewer C: Basic mapping principles for visualizing cancer data using

Geographic Information Systems (GIS). American Journal of Preventive
Medicine 2006, 30(2S):S25-S36.
47.
Choynowski M: Map based on probabilities. Journal of the American Statistical

Association 1959, 54:384-388.
48.
Cliff AD, Haggett P: Atlas of disease distributions. Oxford: Blackwell; 1988.
49.
Sabel CE, Boyle PJ, Lytnen M, Gatrell AC, Jokelainen M, Flowerdew R,

Maasilta P: Spatial clustering of amyotrophic lateral sclerosis in Finland at
place of birth and place of death. American Journal of Epidemiology 2003,
157:898-905.
50.
Rogerson PA: Statistical methods for the detection of spatial clustering in

casecontrol data. Statistics in Medicine 2006, 25:811823.
51.
Anselin L: Local Indicators of spatial associaton. Geographical Analysis 1995,

27(2):93-116.
52.
Geary RC: The contiguity ratio and statistical mapping. The Incorporated
Statistician 1954, 5:115-145.
53.
Getis A, Ord JK: The analysis of spatial association by use of distance

statistics. Geographical Analysis 1992, 24(3):189-206.
54.
Moran PAP: Notes on continuous stochashtic phenomena. Biometrika 1950,

37:17-23.
55.
Tobler W: A computer movie simulating urban growth in the Detroit region.

Economic Geography 1970, 46:234-240.
56.
Bannerjee A: Temporal changes in the spatial patterns of disease rates

incorporating known risk factors. Unpublished Phd Dissertation. University of
Iowa, Department of Geography; 2004.
57.
Cai Q: Controlling the spatial support in disease mapping. In Association of

American Geographers Annual Meeting. Edited by None. San Francisco, CA;
2007.
58.
Tiwari C: Using spatially adaptive filters to map late-stage colorectal cancer

Incidence in Iowa. In Developments in Spatial Data Handling. Edited by Fisher
P. Berlin: Springer Verlag; 2005.
59.
Bath PA, Craigs CC, Maheswaran R, Raymond J, Willett P: Use of graph theory
to identify patterns of deprivation and high morbidity and mortality in
public health data sets. Journal of the American Medical Informatics
Association 2005, 12(6):630-641.
139
60.
Talbot TO, Kulldorff M, Forand SP, Haley VB: Evaluation of spatial filters to
create smoothed maps of health data. Statistics in Medicine 2000, 19:23992408.
61.
Alvanides S, Openshaw S, Macgill J: Zone design as a spatial analysis tool. In

Modeling scale in geographical information science. Edited by Tate NJ, Atkinson
PM. New Jersey: John Wiley and Sons; 2001:141-157.
62.
Cliff AD, Haggett P: On the efficiency of alternative aggregations in regionbuilding problems. Environment and Planning 1970, 2:285-294.
63.
Cockings S, Martin D: Zone design for environment and health studies using
pre-aggregated data. Social Science and Medicine 2005, 60:2729-2742.
64.
Openshaw S: A geographical solution to scale and aggregation problems in

region-building, partitioning and spatial modelling. Transactions of the
Institute of British Geographers 1977, 2(4).
65.
Moellering H, Tobler WR: Geographical variance. 4 1972, 35-50.
66.
Rushton G: Can Spatial Optimization Methods be used to Identify Disease

Clusters? In The Kohn Colloquim. Department of Geograpy, University of Iowa;
2003.
67.
Martin D: Developing the automated zoning procedure to reconcile

incompatible zoning systems. International Journal of Geographical
Information Science 2003, 17(2):181-196.
68.
Flowerdew R, Manley DJ, Sabel CE: Neighbourhood effects on health: Does it

matter where you draw the boundaries? Social Science & Medicine 2008,
66:1241-1255.
69.
Duczmal L, Cancado ALF, Takahashi RHC: What is the true shape of a disease
cluster? The multi-objective genetic scan. Advances in Disease Surveillance
2007, 50.
70.
Patil GP, Taille C: Upper level set scan statistics for detecting arbitrarily
shaped hotspots. Journal of Computational and Graphical Statistics 2006,
15:428-442.
71.
Tango T, Takahashi K: A flexibly shaped scan statistic for detecting clusters.

International Journal of Health Geographics 2005, 4(11):1-15.
72.
Steenberghen T, Thomas I, Wets D: Innovative spatial analysis techniques for

traffic safety. Volume 141. Brussels: Belgian Science Policy Institute; 2005.
73.
Duczmal L, Kulldorff M, Huang L: Evaluation of spatial scan statistics for

irregularly shaped clusters. Journal of the American Statistical Association
2006, 15(2):428-442.
74.
Puett RC, Lawson AB, Clark AB, Aldrich TE, Porter DE, Feigley CE, Herbert
JR: Scale and shape issues in focused cluster power for count data.
140
75.
Smith G: Disease cluster detection methods: The impact of choice of shape on

the power of statistical tests. Iowa CIty: University of Iowa; 2004.
76.
McCullagh MJ, Davis JC: Optical analysis of two-dimensional patterns. Annals

of the Association of American Geographers 1972, 62(4):561-577.
77.
Lathi BP: Modern digital and analog communication Systems. New-York: Oxford
University Press; 1998.
78.
Diez-Roux AV: Multilevel analysis in public health research. Annual Review of

Public Health 2000, 21:171-192.
79.
Fotheringham SA, Brunsdon C, Charlton M: Geographically weighted

regression:The analysis of spatially varying relationships. Chichester: Wiley;
2002:2.
80.
Kawachi I, Berkman L: Multilevel methods for public health research. In

Neighbourhoods and Health. Oxford,U.K: Oxford University Press; 2003:64-111.
81.
Burns J, Hatt C, Brooks C, Keefauver E, Wells EV, Shuchman R, WIlson ML:

Visualization and simulation of disease outbreaks: spatially-explicit
applications using disease surveillance data. In ESRI Users conference. Edited
by None. San Diego, CA: ESRI; 2006.
82.
McLafferty S: GIS and health care. Annual Review of Public Health 2003,
24:25-42.
83.
McLafferty S, Grady S: Prenatal care need and access:A GIS analysis. Journal
of Medical Systems 2004, 28(5):321-333.
84.
Wang F, McLafferty S, Escamilla V, Luo L: Late-stage breast cancer diagnosis

and health care access in Illinois. The Professional Geographer 2008, 1(60):5469.
85.
King LJ: The analysis of spatial form and Its relation to geographic theory.
Annals of the Association of American Geographers 1969, 59(3):573-595.
86.
Griffith DA: Effective geographic sample size in the presence of spatial

autocorrelation. Annals of the Association of American Geographers 2005, 95
(4):740-760.
87.
Harvey DW: Pattern, process, and the scale problem in geographical

research. Transaction of the Institute of British Geographers 1968, 45(68):71-78.
88.
Montello DR: Scale in Geography. In International Encyclopedia of the Social

and Behavioral Sciences. Edited by Smelser NJ, Baltes PB. Oxford: Pergamon
Press; 2001.
89.
Yinnakoulias N, Svenson LW, Schopflocher DP: Diagnostic uncertainty and

medical geography: what are we mapping? The Canadian Geographer / Le
Geographe canadien 2005, 49(3):291-300.
141
90.
Goodchild MF: Models of scale and scales of modelling. In Modelling scale in

geographical information science. Edited by Tate NJ, Atkinson PM. Chichester:
John Wiley and Sons; 2001.
91.
Lam N, Quattorchi DA: Scale, resolution, and fractal analysis. Professional

Geographer 1992, 44(1).
92.
Chaix B, Leyland AH, Sabel CE, Chauvin P, Rastam L, Kristersson H, Merlo J:

Spatial clustering of mental disorders and associated characteristics of the
neighbourhood context in Malm, Sweden, in 2001. Journal of Epidemiology
and Community Health 2001, 60:427-435.
93.
Haining R: Spatial data analysis for the social and environmental sciences.
Cambridge: Cambridge University Press; 1990.
94.
Openshaw S, Taylor PJ: A million or so correlation coefficients: Three

experimnets in the modifiable area unit problem. In Statisitcal Applications in
the Spatial Sciences. Edited by Wrigley N. London: Pion; 1979:127-144.
95.
Fotheringham AS, Brundson C, Charlton M: Scale issues in geographically

weighted regression. In Modelling scale in geographical information science.
Edited by Tate NJ, Atkinson PM. Chichester: John Wiley and Sons; 2001.
96.
Hagerstrand T: A monte carlo approach to diffusion. In Spatial Analysis. Edited

by Berry B, Marble D. Englewood Cliffs: Prentice Hall; 1968:368-384.
97.
Diggle P: Binary mosaics and the spatial pattern of heather. Biometrika 1981,
373(3):531-539.
98.
Schinazi RB: The probability of a cancer cluster due to chance alone.

Statistics in Medicine 2000, 19(16):2195-2198.
99.
McKinlay JB, Marceau LD: A tale of 3 tails. American Journal of Public Health
1996, 89:295-298.
100.
Tufte E: Visual explanations:Images and quantitites, evidence and narrative.

Cheshire: Graphics Press; 1997.
101.
Stevenson L: Putting disease on the map: the early use of spot maps in the
study of yellow fever. Journal of the History of Medicine 1965, 20:227-261.
102.
Verkasalo PK, Kokki E, Pukkala E, Vartiainen T, Kiviranta H, Penttinen A,

Pekkanen J: Cancer risk near a polluted river in Finland. Environmental
Health Perspectives 2004, 112(9):1026-1031.
103.
Alavanja MCR, Samanic C, Dosemeci M, Lubin J, Tarone R, Lynch CF, Knott C,

Thomas K, Hoppin JA, Barker J et al: Use of agricultural pesticides and
prostate cancer risk in the agricultural health study cohort. American Journal
of Epidemiology 2003, 157:800-814.
104.
Brody JG, Vorhees DJ, Melly SJ, Swedis SR, Drivas PJ, Rudel RA: Using GIS
and historical records to reconstruct residential exposure to large-scale
pesticide application. Journal of Exposure Analysis and Environmental
Epidemiology 2002, 12(64-80).
142
105.
Mazumdar S, Rushton G, Smith B, J,, Zimmerman DL, Donham KJ: Geocoding

accuracy and the recovery of relationships between environmental exposures
and health. International Journal of Health Geographics 2008, 7(13):1-15.
106.
CEH: Love canal follow-up health study. New York: New York State
Department of Health; 2006.
107.
Kawachi I, Berkman L: Neighbourhoods and Health. New-York: Oxford

University Press; 2003.
108.
Kreiger N, Chen JT, Waterman PD, Soobader MJ, Subramunium SV, Carson R:
Choosing area based socioeconomic measures to monitor social inequalities
in low birthweight and childhood lead poisoning: The public health
disparities Geocoding Project (U.S.). Journal of Epidemiology and Community
Health 2003, 57:186-199.
109.
Pickle LW: Within state geographic patterns of health insurance coverage

and health risk factors in the United States. American Journal of Preventive
Medicine 2002, 22(2):75-83.
110.
Schroen AT, Brenin DR, Kelly MD, Knaus WA, Slingluff CL: Impact of patient
distance to radiation therapy on masectomy use in early-stage breast cancer
patients. Journal of Clinical Oncology 2005, 23(28):7074-7080.
111.
Myers WL, Kurihara K, Patil GP, Vraney R: Finding upper-level sets in cellular
surface data using echelons and SaTScan. Environmetal and Ecological
Statistics 2006, 13:379390.
112.
Lawson AB, Browne WJ, C.L. VR: Disease mapping with WinBUGS and
MLWiN. Chichester: John Wiley & Sons Ltd; 2003.
113.
Richardson S, Thomson A, Best N, Elliot P: Interpreting posterior relative risk

estimates in disease mapping studies. Environmental Health Perspectives 2004,
112(9):1016-1024.
114.
Stevens J: Applied multivariate statistics for the social sciences. Boca Raton:
Taylor and Francis; 2002.
115.
Sidak Z: Rectangular confidence regions for the means of multivariate

normal distributions. Journal of the American Statistical Association 1967,
62(318):626-633.
116.
James F: Monte Carlo theory and practice. Reports on the progress in physics
1980, 43:1145-1189.
117.
NCHS: Deaths, percent of total deaths, and death rates for the 15 leading
causes of death: United States and each state, 1999-2004. In DataWareHouse.
Volume LCWK9:. 2004 edition.: Department of Health and Human Services;
2004:1-26.
118.
Banerjee A: Temporal changes in the spatial pattern of disease rates

incorporating known risk factors. Social Science & Medicine 2007, 65(1):7-19.
143
119.
Klassen AC, Kuldorff M, Curriero F: Geographic clustering of prostate cancer

grade and stage at diagnosis, before and after adjustment of risk factors.
International Journal of Health Geographics 2005, 4(1).
120.
Waller LA, Hill EG, Rudd RA: The geography of power: statistical
performance of tests of clusters and clustering in heterogeneous populations.
Statistics in Medicine 2006, 25(5):853-865.
121.
Lee G, Yamada I, Rogerson PA: GeoSurveillance. 1.1 edition. Buffalo: NCGIA

(National Center for Geographic Information and Analysis). Department of
Geography. State University of New York, Buffalo.; 2006.
122.
Lawson M, Lawson AB: Cluster detection diagnostics for small area health
data: With reference to evaluation of local likelihood models. Statistics in
Medicine 2006, 25:771-786.
123.
Goovaerts P, Gebreab S: How does Poisson kriging compare to the popular

BYM model for mapping disease risks? International Journal of Health
Geographics 2008, 7(6):1-50.
124.
ArcGIS. 9.1 edition. Redlands: ESRI; 2007.
125.
Rushton G, Armstrong MP, Gittler J, Greene BR, Pavlik CE, West MM,
Zimmerman DL: Geocoding health data: The use of geographic codes in cancer
prevention and control, research and practice Boca Raton, Florida: CRC Press.
Taylor and Francis; 2008.
126.
Rushton G: Spatial pattern of grocery purchases by the Iowa rural population. .

Iowa City: University of Iowa; 1966.
127.
Stone K: Scale, scale, scale? Economic Geography 1968, 44(2).
128.
Tate NJ, Atkinson PM: Modelling scale in Geographical Information Science.

In Modelling scale in Geographical Information Science. Edited by Tate NJ,
Atkinson PM. Chichester: John Wiley and Sons; 2002.
129.
Watson MK: The scale problem in human geography. Geographiska Annaler,

Series B 1978, 60(1):36-47.
130.
Ozendrol E, Williams BL, Kang SY, Magsumbol MS: Comparison of spatial

scan statisitc and spatial filtering in estimating low birth weight clusters.
131.
Boscoe FP, McLaughlinb C, Schymurab MJ, Kielb CL: Visualization of the

spatial scan statistic using nested circles. Health and Place 2003, 9:273-277.
132.
Incidence and mortality rate trends. Bethesda, Maryland: Office of Science

Planning and Assessment (OSPA): National Cancer Institute; 2007.
133.
Surveillance Epidemiology and End Results: Cancer stats fact sheet: cancer
of the Prostate [http://seer.cancer.gov/statfacts/html/prost.html]
134.
Cancer research UK [www.cancerresearchuk.org]
144
135.
Hsing AW, Chokkalingam AP: Prostate cancer epidemiology. Frontiers in

Bioscience 2006, 11:1388-1413.
136.
Seidman CS: An introduction to prostate cancer and Geographic Information

Systems. American Journal of Preventive Medicine 2006, 30(2S):S1-S2.
137.
Rogerson PA, Sinha G, Han D: Recent changes in the spatial pattern of

prostate cancer in the U.S. American Journal of Preventive Medicine 2006
30(2S).
138.
Gregorio DI, Kulldorff MT, Sheehan J, Samociuk H: Geographic distribution of

prostate cancer incidence in the era of PSA testing, Connecticut, 1984 to
!998. Urology 2004, 63(1).
139.
Gregorio DI, Samociuk H, DeChello.L., Swede H: Effects of study area size on

geographic characterizations of health events: Prostate cancer incidence in
Southern New England, USA,19941998. International Journal of Health
Geographics 2006, 5(8).
140.
Oliver MN, Matthews KA, Siadaty M, Hauck FR, Pickle LW: Geographic bias
related to geocoding in epidemiologic studies. International Journal of Health
141.
Jemal A, Kulldorff M, Devesa SS, Hayes RB, Fraumeni JF: A geographic

analysis of prostate cancer mortality in the United States. International
Journal of Cancer 2002, 101:168-174.
142.
Rusiecki JA, Kulldorff M, Nuckols JR, Song C, Ward MH: Geographically

based Investigation of prostate cancer mortality in four U.S. northern plain
states. American Journal of Preventive Medicine 2006 30(2S):S101-108.
143.
Agricultural Health Study [http://aghealth.nci.nih.gov/]
144.
West MM, Lynch CF, McKeen KM, Olson DB: 2007 Cancer in Iowa repor:
Iowas progress toward cancer mortality goals for the year 2010. Iowa City:
State Health Registry of Iowa, University of Iowa; 2007.
145.
Cancer incidence and mortality data [http://cancer-rates.info/]
146.
Iowa Consortium for Comprehensive Cancer Control Cancer Maps Site

[http://www.uiowa.edu/iowacancermaps/]
147.
Olsen SF, Martuzzi M, Elliott P: Cluster analysis and disease mapping- why,
when and how? A step by step guide. British Medical Journal 1996:313-866.
148.
State Health Registry of Iowa, Iowa Cancer Registry [http://www.publichealth.uiowa.edu/shri/]
149.
American Factfinder [http://www.census.gov]
150.
Mazumdar S: Cluster detection an application to demographic

characterization for cancer screening potential evaluation. MSc Dissertation.
Univerisity of Leicester, Department of Geography; 2003.
145
151.
Goovaerts P, Gebreab S: How does Poisson kriging compare to the popular

BYM model for mapping disease risks? International Journal of Health
152.
Wheeler DC: A comparison of spatial clustering and cluster detection

techniques for childhood leukemia incidence in Ohio, 1996 2003.
International Journal of Health Geographics 2007, 6(13).
153.
Paulu C, Aschengrau A, Ozonoff D: Exploring associations between residential

location and breast cancer Incidence in a casecontrol study. Environmental
Health Perspectives 2002, 110:471-478.
154.
Ozonoff A, Webster T, Vieira V, Weinberg J, Ozonoff D, Aschengrau A: Cluster

detection methods applied to the Upper Cape Cod cancer data. Environmental
Health Perspectives 2005, 4(19).
155.
Kulldorff M, Feuer EJ, Miller BA, Freedman LS: Breast cancer in northeastern
United States: A geographical analysis. American Journal of Epidemiology
1997, 146:161-170.
156.
Jacquez GM, Greiling DA: Local clustering in breast, lung and colorectal
cancer in Long Island, New York. International Journal of Health Geographics
2003, 2(3):1-12.
157.
Hjalmars U: Childhood leukemia in Sweden: Using GIS and a spatial scan

statistic for cluster detection. Statistics in Medicine 1996, 15:707-715.
158.
Cai Q: Mapping disease risk using spatial filtering methods. PhD dissertation.
University of Iowa, Geography; 2007.
159.
Shaw PA, Etzioni R, Zeliadt SB, Mariotto A, Karnofski K, Penson DF, Weiss NS,
Feuer EJ: An ecologic study of prostate-specific antigen screening and
prostate cancer mortality in nine geographic areas of the United States.
American Journal of Epidemiology 2004, 160(11):1059-1069.
160.
Bartsch G, Horninger W, Klocker H, Reissigl A, Oberaigner W, Schonitzer D:

Prostate cancer mortality after introduction of prostate-specific antigen mass
screening in the Federal State of Tyrol, Austria. Urology 2001, 58:417-424.
161.
Coldman AJ, Phillips N, Pickles TA: Trends in prostate cancer incidence and
mortality: an analysis of mortality change by screening intensity. Canadian
Medical Association Journal 2003, 168(1).
162.
Crocetti E, Ciatto S, Zappa M: Prostate Cancer: Different Incidence But Not

Mortality Trends Within Two Areas of Tuscany, Italy. Journal of the National
Cancer Institute 2001, 93(11).
163.
McDavid K, MAb JL, Fulton JP, Tonita J, Thompsona TD: Prostate cancer
incidence and mortality rates and trends in the United States and Canada.
Public health reports 2004, 119.
164.
Woodward WA, Strom EA, Tucker SL: Changes in the American Joint
Committe on Cancer staging for breast cancer dramatically affect stagespecific survival. Journal of Clinical Oncology 2003, 21:3244-3248.
146
165.
Sahttuck PT: The contribution of diagnostic substitution to the growing

administrative prevalence of autism in US special education. Pediatrics 2006,
117(4):1028-1037.
166.
Categorization of Hypotheses Generated from Epidemiological Maps

[www.biomedware.com/conferences/Health_disparities_conference.html]

Shape and Scale in Detecting Disease Clusters

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Shape and Scale in Detecting Disease Clusters

Uploaded by

Copyright:

Available Formats

University of Iowa

Iowa Research Online

Shape and scale in detecting disease clusters

Copyright 2008 Soumya Mazumdar

Follow this and additional works at: http://ir.uiowa.edu/etd

SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS

Thesis Supervisor: Professor Gerard Rushton

SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS

A thesis submitted in partial fulfillment

LIST OF TABLES ......................................................................................................v

2.3.5.1 Rationale Behind the choice of these

Hold one validation for null hypothesis.-----------------------------------------102

Hold one validation for alternative hypothesis.---------------------------------102

Summary statistics of the simulated 3675 spurious clusters.------------------103

The table illustrates the average sensitivity (ability to detect a cluster

Cluster recovery using only rates and only shapes.-----------------------------107

This figure displays the statistical significance of accidents per

1.2 This figure displays a spurious cluster detected by Duczmals

In the geographic area, 42 people are distributed over a uniform grid.

A noise or spurious cluster generating process operates at the scale of the

A boundary is drawn around those people who are diseased. This

In contrast to 1.4, a cluster generating process operates on this geographic

The cluster is then enclosed within a boundary. Note the relatively

People are distributed non uniformly over space.--------------------------------42

1.10 The cluster generating process in figure 6 operates on the inhomogenously

A set of 50,000 cardiovascular disease mortality cases are randomly

An example set of spurious cluster signatures S(ZN ) in signature space.---89

An example set of spurious cluster signatures S(ZN ) in signature space

Bounding rectangle for elliptical footprint.---------------------------------------91

Flowchart of the S.S.S method.-----------------------------------------------------92

Population distribution of ZCTAs in Iowa, 2000.--------------------------------93

This figure displays the computational process used to create the

The simulated datasets follow a multinomial distribution.----------------------94

2.10 Summary of shapes of simulated spurious clusters, frequency and

Spatial patterns of prostate cancer incidence (1999-2004) in Iowa.----------123

Cluster of prostate cancer incidence in Iowa, detected by the S.S.S

Cluster detected by SaTScan when the geometry of the cluster is

Cluster detected by SaTScan when the geometry of the cluster is

Large secondary cluster with low elevation in risk detected by

ZCTAs in Iowa with a significant value of Rogersons Score statistic.-----128

Expected number of cases in ZCTAs: Entire Iowa versus areas with a

ZCTAs in the North West Iowa cluster of high prostate cancer

CHAPTER 1: DETECTING CLUSTERS OF DISEASE: INVESTIGATING

1.1 Statement of Purpose

disease. Sometimes, environmental policy is formulated on the basis of such studies. In

1 The vaccine hypothesis is that exposure to Thimerosol a mercury based additive in

Kulldorffs SaTScan method resolves this problem by adopting a likelihood based

1.3 Organization of the dissertation

1.4.1 Map data without further geographic

1.4.2.1 Non combinatorial approaches

In disease cluster detection studies known risk factors or confounders are

The confounder adjustment procedure is:

= Likelihood (R(inside, j) > R(outside, j) ) / Likelihood(R(inside, j) = R(outside, j) )

1.4.2.2 Combinatorial Approaches

1.4.3 Significance Testing and Spurious

2 It is important to distinguish between a spatially explicit disease process and a

Understanding these factors is essential to understanding noise and spurious

proven to be extremely complex and it is perhaps not surprising that alternative

Figure 1.8: People are distributed non uniformly over space.

Figure 1.13: Assuming an inhomogeneous distribution of people as in figure 1.8 and a

CHAPTER 2: THE SHAPE SIZE SENSITIVE (S.S.S) METHOD

PerimeterZ [73].A circle is perfectly regular and has a compactness of 1. The

to conform to underlying geographic boundaries. Second, this approach provides an

For any candidate that is a true cluster

PerimeterZ [73].A circle is perfectly regular and has a compactness of 1. The

H0 : S(Zj) = [Smean(ZN) ,Kmean(ZN),