Next Article in Journal
An Intrinsic Version of the k-Harmonic Equation
Next Article in Special Issue
Homogeneity Test for Multiple Semicontinuous Data with the Density Ratio Model
Previous Article in Journal
DADE-DQN: Dual Action and Dual Environment Deep Q-Network for Enhancing Stock Trading Strategy
Previous Article in Special Issue
A New Instrumental-Type Estimator for Quantile Regression Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Flexible-Elliptical Spatial Scan Method

1
Department of Mathematics, Clarkson University, Potsdam, NY 13699, USA
2
Department of Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
3
National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(17), 3627; https://doi.org/10.3390/math11173627
Submission received: 24 July 2023 / Revised: 17 August 2023 / Accepted: 18 August 2023 / Published: 22 August 2023
(This article belongs to the Special Issue Statistical Analysis: Theory, Methods and Applications)

Abstract

:
The detection of disease clusters in spatial data analysis plays a crucial role in public health, while the circular scan method is widely utilized for this purpose, accurately identifying non-circular (irregular) clusters remains challenging and reduces detection accuracy. To overcome this limitation, various extensions have been proposed to effectively detect arbitrarily shaped clusters. In this paper, we combine the strengths of two well-known methods, the flexible and elliptic scan methods, which are specifically designed for detecting irregularly shaped clusters. We leverage the unique characteristics of these methods to create candidate zones capable of accurately detecting irregularly shaped clusters, along with a modified likelihood ratio test statistic. By inheriting the advantages of the flexible and elliptic methods, our proposed approach represents a practical addition to the existing repertoire of spatial data analysis techniques.

1. Introduction

In public health, surveillance procedures that identify disease clusters play an important role in controlling and preventing disease outbreaks. Numerous methods can be used for detecting clustering and clusters. For detecting spatial autocorrelation, methods such as Moran’s I [1] and Geary’s c [2] are commonly used. These methods quantify a global property over the entire study area and indicate whether response values are more similar than they would be under the null hypothesis and that no spatial autocorrelation is present. Therefore, Moran’s I and Geary’s c are global indices of spatial autocorrelation and can be used in situations such as regression analysis when we want to check whether uncorrelated error assumptions are satisfied or as evidence of clustering across the entire study area. In order to detect local spatial clusters, other methods were proposed, e.g., the cluster evaluation permutation procedure [3], the Besag–Newell method [4], and the circular spatial scan method [5,6] and its related extensions.
The circular spatial scan method [5,6] has gained remarkable popularity for finding local clusters compared to the aforementioned methods due to its computational efficiency and its power to detect disease clusters. This method is characterized by (i) the set of candidate zones to be scanned and (ii) the likelihood ratio test (LRT) statistic for each candidate zone. The capability of the spatial scan method in detecting disease clusters inspired other researchers to propose extensions to improve its accuracy, specifically for detecting non-circular (irregularly shaped) clusters. The circular scan method and its extensions, generally, scan the entire study area and identify the candidate zones that obtain the largest value of an LRT statistic.
There are many different approaches for constructing the set of candidate zones and for computing the LRT statistic. Tango and Takahashi [7,8] proposed the flexible scan method, in which non-circular clusters can be detected more accurately by forming a set of candidate zones from a set of connected regions satisfying certain constraints. In the flexible scan method, each connected candidate zone is enclosed within a circle comprised of a pre-specified set of nearest neighbors. Candidate zones coming from the connected regions within the circle may not be large enough (or flexible enough) to include highly irregular and long candidate zones. Additionally, the computational cost of this method becomes increasingly great as the size of the circle is expanded, which may preclude more arbitrarily shaped candidate zones from being considered [7]. The flexible scan method has recently been used to detect high- and low-risk clusters of COVID-19 incidence in Florida [9], high-risk clusters of La Crosse virus disease in the Appalachian region of the United States [10], and high-risk clusters of thyroid cancer incidence in Fukushima, Japan [11].
Kulldorff et al. [12] proposed the elliptic scan method, which includes elliptical candidate zones along with circular ones. Elliptical candidate zones allow the method to detect non-circular clusters with different shapes and different angles when ellipses rotate around their centers. The elliptic method indeed uses a variety of elliptical shapes and angles to identify irregularly shaped clusters; however, its final results are conditional on the selected shapes. As such, the set of elliptical zones may not have enough versatility to cover non-elliptical clusters. The elliptic scan method has recently been used to identify high-risk clusters of paratuberculosis in sheep and goats in southern Spain [13], high- and low-risk clusters of breast and cervical cancer-related mortality in Brazil, and clusters of high nontuberculous mycobacteria infection risk for persons with cystic fibrosis in United States counties [14].
Another extension of the circular scan method is the minimum spanning tree method proposed by Assunção et al. [15], which attempts to construct candidate zones based on the regions that result in the largest LRT statistic. The minimum spanning tree algorithm may detect abnormal clusters that have a star-like shape because a new region can be added to a current candidate zone regardless of whether the LRT increases or decreases in relation to the current candidate zone. This tendency to detect star-shaped clusters is called the “octopus effect”. Costa et al. [16] extended the minimum spanning tree algorithm by imposing early stopping criteria on the method. Specifically, a new region can only be added to the current candidate zone if it increases the current LRT statistic value. Moreover, in order to avoid the octopus effect, Costa et al. [16] proposed additional stopping criteria, specifically, selecting only the regions that share at least two connections with the current candidate zone. A problem with these methods (and also the elliptic and flexible scan methods) is that adding a low-risk region to an existing zone can increase the LRT of the new zone. Philosophically, it seems unwise to include a low-risk region in a cluster, e.g., a region with low standardized mortality ratio (SMR), where SMR is the ratio of observed to expected cases in a region.
In this study, we propose the flexible–elliptical scan method, which combines the flexible and elliptic scan methods to address their respective limitations and leverage their advantages. Our approach involves modifying the set of candidate zones and the likelihood ratio test statistics. We compare the performance of the proposed flexible–elliptical method with the established elliptic and rflex scan methods for identifying irregularly shaped disease clusters. This evaluation includes benchmark data sets comprising 56 diverse irregularly shaped cluster models, as well as real-world data sets such as the northeastern United States and NTM data. Our findings demonstrate a balanced integration between the flexible and elliptic scan methods in accurately detecting irregularly shaped clusters in disease surveillance. The flexible–elliptical method exhibits better flexibility, inheriting the capabilities of the reflex and elliptic methods, particularly in constructing the set of candidate zones. The proposed method offers a streamlined and straightforward approach, eliminating the need for tuning parameters and providing a more adaptable solution to capture irregular cluster shapes.
The structure of this paper is as follows. In Section 2, we describe the methodology of the circular scan method, the elliptic scan method, and the restricted flexible scan method and then propose a new flexible–elliptical scan method. In Section 3, we benchmark the performance of these scan methods and outline the results using simulated data sets based on the breast cancer mortality of the northeastern United States made available by Kulldorff [6]. In Section 4, we apply these methods to identifying clusters of the northeastern United States data set [17,18]. Additionally, in Section 5, we apply these methods to identifying and comparing clusters of nontuberculous mycobacterial (NTM) cases in Colorado. In Section 6, we draw specific conclusions about the proposed methodology from our study. Finally, in Section 7, we more broadly discuss the strengths and weaknesses of the proposed methodology and comment on future work.

2. Methods

Consider a geographical map (study area) that is partitioned into N regions (e.g., zip codes). Each region is represented by its centroid i, i = 1 , , N , which is a geographical location inside the region. For each region, we know (i) the population size, n i and (ii) the number of cases, Y i . Let Z denote a candidate zone that is formed from the union of one or more (typically connected) regions. Let Z be the set of candidate zones. Each Z Z is a potential cluster for which we believe the risk of developing disease inside Z is higher than the risk of developing disease outside Z . Let p denote the risk of developing disease inside Z . Let q denote the risk of developing disease outside Z . Therefore, under the null hypothesis of no clustering, p = q for all Z Z (the complete list of notation can be found at the end of the paper before Appendix A). The alternative hypothesis states that there is at least one cluster in the study area, i.e., there is at least one Z Z such that p > q . More formally,
H 0 : p = q   for   all   Z Z versus H 1 : p > q   for   some   Z Z .
In general, the scan methodologies described in this paper are characterized by (i) the set of candidate zones to be scanned, Z , and (ii) the LRT statistic, λ . We will use different subscripts after Z to indicate the specific method used to construct the set of candidate zones, such as Z c , Z e , and Z f . Additionally, the LRT statistics used for different scan methods are indicated by superscripts after λ , such as λ c , λ e , and λ f .
We now define a number of statistics that are common to the methods we discuss. Let y + = i = 1 N Y i denote the total number of cases and n + = i = 1 N n i denote the total population over the entire study area. For a candidate zone Z , let y i n = i Z Y i denote the observed number of cases inside Z and n i n = i Z n i denote the population size inside Z . The expected number of cases inside Z is denoted by E i n . Assuming that the risk is constant across all regions, the expected number of cases inside Z is E i n = n i n y + / n + . Alternatively, we can use other approaches such as generalized linear models to estimate the expected number of cases in each region [19]. Additionally, we let y o u t = y + y i n denote the observed number of cases outside Z , n o u t = n + n i n denote the population size outside Z , and E o u t = y + E i n denote the expected number of cases outside Z .
We discuss the circular, elliptic, flexible, restricted flexible, and the proposed flexible–elliptical scan methods below. Additional discussion of the former methods can be found in French et al. [20].

2.1. The Circular Scan Method

The circular scan method [5,6] overlays a circular window on each centroid i in the study area. We successively add the nearest regions to the starting region until some percentage of the total population is reached to create a sequence of candidate zones. This percentage of the total population can be set by the user (the default value is 50%) or can be estimated using the Gini [21] or elbow method [22]. We then do the same process for all centroids in the study area to construct Z c .
Kulldorff [6] modeled the case counts, Y i , using a (i) Binomial or (ii) Poisson distribution in order to derive the LRT statistic λ c . The case counts are modeled as
Y i indep . Poisson ( n i p ) , if i Z , and Y i indep . Poisson ( n i q ) , if i Z
or
Y i indep . Binomial ( n i , p ) , if i Z , and Y i indep . Binomial ( n i , q ) , if i Z .
Assuming a Poisson distribution for the case counts Y i , the likelihood function of a fixed candidate zone Z in terms of disease risk parameters p and q is
L P ( Z , p , q ) = i Z e n i p ( n i p ) Y i Y i ! i Z e n i q ( n i q ) Y i Y i ! ,
and Kulldorff [6] derived the LRT statistic for the Poisson case counts as
λ Z c = sup p > q L P ( Z , p , q ) sup p = q L P ( Z , p , q ) = y i n n i n y i n y o u t n o u t y o u t y + n + y + I y i n n i n > y o u t n o u t = y i n E i n y i n y o u t E o u t y o u t I y i n E i n > y o u t E o u t ,
where I ( ) is an indicator function.
The LRT statistic in Equation (4) has subscript Z to indicate that the LRT statistic is computed for a specific zone Z Z c . The circular scan method proceeds by computing the LRT statistic in Equation (4) for each candidate zone Z Z c . The candidate zone that attains the maximum LRT statistic is known as the most likely Cluster (MLC). Therefore, the LRT statistic value for the MLC is computed as
λ c = sup Z Z c λ Z c .
Assuming a binomial distribution for the case counts Y i , the likelihood function of a fixed candidate zone Z in terms of disease risk parameters p and q is
L B ( Z , p , q ) = i Z n i Y i p Y i 1 p n i Y i i Z n i Y i q Y i 1 q n i Y i ,
and Kulldorff [6] derived the LRT statistic for the Binomial case counts as
λ Z c = sup p > q L B ( Z , p , q ) sup p = q L B ( Z , p , q ) = y i n n i n y i n n i n y i n n i n n i n y i n y o u t n o u t y o u t n o u t y o u t n o u t n i n y i n y + n + y + n + y + n + n + y + I y i n n i n > n i n y i n n i n .
The LRT statistic value for the MLC is computed as
λ c = sup Z Z c λ Z c .
The derivation of the LRT statistic for Poisson and Binomial case counts can be found in Appendix A.
The “second MLC” is the candidate zone that attains the second highest value of λ c while not overlapping the MLC. Similarly, the “third MLC” and “fourth MLC” can be computed. We use the Monte Carlo method described in ref. [23] (p. 126) to assess the significance of the MLC (or the secondary MLCs). In short, data sets are simulated under the null hypothesis, the test statistic of the MLC is determined for each simulated data set, and the test statistics for the simulated data sets are used to compute a Monte Carlo p-value for the test statistic associated with each candidate zone.

2.2. The Elliptic Scan Method

As discussed in the previous section, the circular scan method uses circular windows to construct the set of candidate zones. Therefore, this method is ineffective for detecting non-circular clusters. In order to resolve this limitation, Kulldorff et al. [12] proposed the elliptic scan method, which modifies the set of candidate zones Z c .
In the elliptic scan method, the set Z e consists of many overlapping ellipses; each ellipse is characterized by (i) the x-coordinate and y-coordinate of its origin i, (ii) its shape s, (iii) its angle ϕ , and (iv) its population size. The shape s 1 of an ellipse is defined as the ratio of the major axis and minor axis. A window with s = 1 is a special case of an ellipse that represents a circle, and as s gets larger, the ellipse becomes narrower and longer. The collection of ellipse shapes recommended by Kulldorff et al. [12] is s = 1, 1.5, 2, 3, 4, 5, 6, 8, 10, 15, 20, 30, 60, 120. The parameter ϕ is the angle between the major axis and the x axis. Figure 1 displays an ellipse and its associated parameters.
For a fixed center, shape s, and population size, we can define the set of angles ϕ such that a new ellipse overlaps at least 70% of the previous ellipse. To construct a set of candidate zones, Z e , for a region with a fixed center located at ( x , y ) , shape s, and angle ϕ , we successively enlarge the size of the ellipse (though shape s is fixed) until the stopping criterion is met, which is typically including no more than 50% of the total population in the ellipse. Each time a new centroid falls inside the ellipse, a new candidate zone is created by taking the union of all regions with a centroid inside the ellipse. We repeat this process for all different user-specified combinations of centers, shapes, and angles.
To conduct hypothesis testing, both λ Z c and λ Z c in Equations (4) and (6) can be used as LRT statistics. However, using these unpenalized statistics may cause the detection of impractically long and narrow ellipses. Thus, Kulldorff et al. [12] suggested an eccentricity penalty function that penalizes very thin clusters. The eccentricity penalty is 4 s ( s + 1 ) 2 γ , where s is the shape of the cluster and γ 0 is a tuning parameter. Therefore, the likelihood ratio test statistic for Poisson case counts in the elliptic scan method is given by
λ e = sup Z Z e y i n E i n y i n y o u t E o u t y o u t I y i n E i n > y o u t E o u t 4 s ( s + 1 ) 2 γ .
when s = 1 or γ = 0 , there is no penalty. For a fixed s > 1 , as γ gets larger, a larger penalty is imposed on the model. Similarly, for a fixed γ > 0 , as s gets larger, a larger penalty is imposed on the model, so long and narrow clusters are less likely to be detected. When γ , penalties for non-circular clusters are very large and only circular clusters can be detected. The same penalty function can be used for the Binomial case counts LRT given in Equation (6). In the following sections, we focus only on the Poisson case counts. However, any LRT statistic modification can be applied to the binomial case counts as well.
The elliptic scan method is relatively fast, powerful, and suited for moderately irregular clusters. However, the elliptic scan method also has many unknown parameters such as shape s, angle ϕ , population size, and tuning parameter γ that should be specified by users. For real data sets in which the true clusters are unknown, picking the right parameters is not simple, and using different parameters has a significant impact on the final results and decisions. Furthermore, because the set Z e includes only ellipses, the elliptic method is unable to detect highly irregular cluster shapes, e.g., star-like shape clusters.

2.3. The Flexible Scan Method

The flexible spatial scan method proposed by Tango and Takahashi  [7] is able to detect non-circular clusters by exhaustively searching all of the connected candidate zones within neighborhoods that include up to K regions. Given K, for every region i { 1 , , N } the set of the candidate zones Z f is the union of all connected subsets among the K nearest neighbors of i that include region i. The algorithm that Tango and Takahashi [7] proposed for constructing the connected regions within a circle with radius K is as follows:
  • For each region i { 1 , , N } , define the set W i = { i , i 1 , , i k } such that i k is the kth nearest region to the region i.
  • Let Z be a set in the power set of W i (i.e., Z P ( W i ) ) which includes region i. Therefore, Z is a set that has at most k + 1 regions including centroid i. For example, Z = { i , i 2 , i 8 , i 5 , , i k } , where k k .
  • Split the set Z into two subsets Z 1 = { i } and Z 1 = Z Z 1 .
  • Split set Z 1 to two subsets Z 2 and Z 2 such that Z 2 contains all the regions of Z 1 that are connected to set Z 1 , and Z 2 contains all the regions that are not connected to Z 1 . The process continues until either Z j or Z j becomes a null set for a j N .
  • Z in Step 2 is a connected set of regions if Z j in Step 4 becomes a null set first, otherwise Z is disconnected.
  • If Z in Step 5 is a connected set, it will be added to Z f .
  • Repeat Steps 1 through 6 for all regions i and all sets Z P ( W i ) .
Once the set of candidate zones Z f is formed, the LRT statistic λ Z c Equation (4) (for the Poisson case counts) is calculated for each Z Z f , and the one that attains the maximum is the MLC. Compared to the circular and elliptic scan method, this method can detect highly irregular clusters within small neighborhood sizes. Since the number of candidate zones increases exponentially as a function of K, this method is not computationally feasible for large K like K 30  [7]. Additionally, in those situations where the true cluster is circular, the flexible method tends to detect clusters larger than the true cluster. In the next section, we describe the restricted flexible scan method, which attempts to address these limitations.

2.4. The Restricted Flexible Scan Method

Due to the computational inefficiency of the flexible scan method, Tango and Takahashi [8] proposed the restricted flexible (rflex) scan method to decrease the computation time needed for detecting larger clusters. In order to avoid adding low-risk regions to the set of candidate zones, for each region Z Z f , Tango and Takahashi proposed the following restricted likelihood ratio by taking the risk of each individual region into account:
λ Z r = y i n E i n y i n y o u t E o u t y o u t I y i n E i n > y o u t E o u t i Z I ( p i < α 1 ) ,
where α 1 is a pre-specified significance level and p i is the middle p-value given by
p i = P Y i y i + 1 + 1 2 P Y i = y i ,
where y i is the observed case count for region i, Y i Poisson ( n i r ) , and r = y + / n + is an estimate of constant risk. For a low-risk region i Z , the indicator function I ( p i < α 1 ) is zero and then the entire candidate zone Z is considered insignificant, meaning that it will be removed from the set of candidate zones Z f . Removing low-risk zones from the set of candidate zones Z f makes the computational load lighter than the original method. Tango and Takahashi [8] provided the following guidance regarding the choice of α 1 as follows:
  • 0.10 α 1 < 0.20 for detecting small clusters,
  • 0.20 α 1 < 0.30 for detecting small to medium clusters,
  • 0.30 α 1 < 0.40 for detecting large clusters.
The tuning parameter α 1 is an unknown parameter that must be specified by users that will directly impact the results and performance of the restricted method. Moreover, even though the restricted flexible method has a lighter computational load than the original flexible method, it may still be computationally demanding for large α 1 .

2.5. The Flexible-Elliptical Scan Method

We now describe the flexible–elliptical scan method. The flexible–elliptical method is characterized by (i) the set of candidate zones Z f e (the subscript “fe” stands for flexible–elliptical) and (ii) the LRT statistic λ f e . Since Tango and Takahashi [7,8] create candidate zones from subsets of connected regions in concentric circles having K regions, highly irregular and long clusters may be difficult to detect unless K becomes large. More specifically, K might need to be very large before the irregular cluster contained in a concentric circle of K nearest neighbors. Furthermore, the set of elliptic candidate zones Z e is not versatile enough to cover non-elliptical clusters. To form a larger and more flexible set of candidate zones, we construct the set of candidate zones based on the set of all connected subsets within the elliptical windows. In other words, for a fixed region i, fixed shape s, and fixed angle ϕ , first we sequentially enlarge the ellipsis until a stopping criterion is met; inside the largest ellipse, we find all connected subsets that include region i.
The circular and elliptic scan methods tend to detect clusters larger than the true cluster because their candidate zones absorb low-risk regions as they become larger. In order to eliminate low-risk regions from Z f e , we adjust the LRT statistic in Equation (4) so that a region is only included in a candidate zone if its standardized mortality ratio (SMR) is at least 1. More formally, Z remains in Z f e if Y i / E i > 1 for all i Z ; however, Z is removed from Z f e if Y i / E i 1 for some i Z . Thus, we specify the LRT statistic for the flexible–elliptical method as
λ Z f e = y i n E i n y i n y o u t E o u t y o u t I y i n E i n > y o u t E o u t i Z I Y i E i > 1 .
Considering Equation (11), if only one region i has fewer observed cases than what is expected, then the product i Z I ( Y i / E i > 1 ) becomes zero and the entire candidate zone  Z is removed from Z f e . Removing low-risk candidate zones from the set Z f will reduce the computation time compared to the unrestricted method. Additionally, the flexible–elliptical method may consider fewer candidate zones than the rflex method when α 1 is relatively large (e.g., α 1 0.40 ), making it faster to apply. Unlike the restricted LRT statistic λ r in Equation (9), which requires an additional unknown parameter α 1 in the model, the proposed LRT statistic in Equation (11) does not require any additional tuning parameter. This is helpful because the size of the true cluster is unknown, making it difficult to choose an appropriate α 1 .
We also can use a different adjustment to the LRT statistic λ Z f e in Equation (11) to eliminate low risk regions. To accomplish that, let g i = Y i / E i , i { 1 , , N } denote the empirical region rate and g ¯ = 1 N i = 1 N g i . Instead of using the multiplier i Z I Y i / E i > 1 in Equation (11), we can use i Z I g i > g ¯ . For the data sets used in the simulation study section (Section 3), we get almost identical results. Due to this, in Section 3, we only provide the results of the flexible–elliptical method when using the LRT statistic in Equation (11).

3. Simulation Study

To assess the performance of the elliptic scan method, we compare its results to the elliptic and rflex scan method using non-circular benchmark data sets provided by Duczmal et al. [24]. The benchmark data sets are simulated based on the female breast cancer mortality in the N = 245 counties (or county equivalent) in the northeastern United States during the years 1988–1992 [25]. Eleven clustering models “a” through “k” are generated such that the total number of cases across the study area is y + = 600 among n + = 29,535,210 people at risk. Figure 2 illustrates clustering models “a”–“i”. Cluster “j” is the union of “g” and “h”. Cluster “k” is the union of “g”, “h”, and “i”. For each clustering model mentioned above, 10,000 different data set are generated. Additionally, 99,999 data sets are simulated under the null hypothesis of no clustering. These benchmark data sets are available in the neastbenchmark R package, which can be installed from https://github.com/jfrench/neastbenchmark (accessed on 7 July 2023).
To have a more extensive comparison, we generated 45 irregularly shaped clustering models based on circular benchmark data sets provided by Kulldorff et al. [25]. Three different sets of irregularly shaped clustering models, iurban (i.e., irregularly shaped urban clustering model), irural (i.e., irregularly shaped rural clustering model), and imixed (i.e., irregularly shaped mixed of urban and rural clustering models) are generated. Each clustering model contains 2–16 regions (counties). For each clustering model mentioned above, 10,000 different data set are generated. The last three plots of Figure 2 illustrate nine of these 45 clustering models.
To evaluate how well a cluster identified by each scan method matches the true cluster, different performance measures can be used [7,16]. We compare the methods in terms of their sensitivity, PPV, and misclassification as described below. Let z and z ^ denote the true cluster and the detected cluster, respectively. Let n ( X ) be the population inside any zone X. Sensitivity is the proportion of the population of the true cluster that is covered by the detected cluster and is computed as
sensitivity = n ( z z ^ ) n ( z ) .
PPV is the proportion of the population of the detected cluster that is covered by the true cluster and is computed as
PPV = n ( z z ^ ) n ( z ^ ) .
Misclassification is the proportion of the total population that is not correctly categorized and is computed as
misclassification = n [ ( z z ^ ) ( z z ^ ) c ] n + .
Ideally, we want to see sensitivity and PPV equal to 1 and misclassification equal to 0.
We compare the performance of the flexible–elliptical, rflex (for both tuning parameters α 1 = 0.2 and α 1 = 0.3 ), and elliptic scan method in terms of the average sensitivity, PPV, and misclassification. The smerc R package [26] was used to apply the elliptic and rflex methods to the benchmark data sets. Each scan method was applied to 1000 simulated data sets for each of the 56 clustering models. To keep the set of candidate zones more comparable for all three scan methods, the stopping criterion for the size of the scanning windows was set to K-nearest neighbors. That is, for each starting region, i, a maximum of K 1 -nearest neighbors can be added. The rflex scan method used the middle p values, and the tuning parameters were set to α 1 = 0.2 and α 1 = 0.3 . Both the elliptic and the flexible–elliptical methods used the default shape and angle values used in the SaTScanTM software [27]. More specifically, the shapes are s = 1 , 1.5 , 2 , 3 , 4 , 5 , and the number of angles associated with each shape is ϕ = 1 , 4 , 6 , 9 , 12 , 15 . Therefore, for each region i, 47 different elliptical windows are considered; then, each of 47 elliptical shapes is enlarged until K = 20 regions are included. All methods identified the clusters using the corresponding version of the LRT statistic in Equations (8), (9) and (11), respectively.
Figure 3 presents box plots of the average sensitivity, PPV, and misclassification for each method among all 56 clustering models. Table S1 in the Supplementary Materials provides complete results for the simulation study. Overall, the sensitivity of the elliptic method is higher than the other methods. This heightened sensitivity may be attributed to the fact that the elliptic method has a tendency to detect clusters that are larger than the true clusters. By detecting larger clusters, the elliptic method captures a greater number of true positives, resulting in a higher sensitivity value. However, it is important to note that this enhanced sensitivity comes at the cost of potentially including some false positives in the identified clusters. In contrast, the flexible–elliptical demonstrates a more consistent sensitivity and PPV across all different clustering models. Unlike the rflex method, which exhibits varying results based on the chosen value of α 1 , the flexible–elliptical method achieves a more constant sensitivity and PPV across different clustering models. Additionally, the flexible–elliptical method does not suffer from unnecessarily detecting larger clusters like the elliptic method. While the flexible–elliptical method may not always surpass the rflex and elliptic methods individually, it provides a robust and stable performance compared to the other two methods. The flexible–elliptical method exhibits similar sensitivities to the rflex method with α 1 = 0.3 , showcasing its overall robust performance. In terms of PPV, on average, the rflex method with α 1 = 0.2 , and the flexible–elliptical method demonstrate the highest average PPV values among the tested methods. This underscores the effectiveness of the flexible–elliptical method in identifying true clusters while minimizing false positives compared to the elliptic method. On the other hand, the elliptic method exhibits a relatively lower PPV, highlighting the advantages offered by the flexible–elliptical method in achieving precise and reliable cluster identification. Regarding misclassification, the results indicate similar average levels across all clustering models for each method.

4. Application to Northeastern United States Data

We now detect clusters of breast cancer mortality cases in the northeastern United States during the years 1988–1992. This data set is the inspiration for the simulated data examined in the previous section. We compare the clusters identified by the elliptic, rflex, and flexible–elliptical scan methods. The total number of observed breast cancer mortality cases is y + = 58,943, which was aggregated across the years 1988–1992. The population of each region used in this analysis is the 1990 U.S. census estimate, with the total number of persons at risk being n + = 29,535,210. More information related to the northeastern data set can be found in Kulldorff et al. [25].
Figure 4 provides choropleth maps of the case count (left panel) and SMR (right panel) for each region in the northeastern data set. The number of cases per region ranged from 2 to 2169 with a median of 86 cases. The SMR of each region is computed as SMR i = Y i / E i , where E i is the population size of each region multiplied by the constant risk = y + / n + . The SMRs of the regions ranged from 0.33 to 1.81, while the case count plot in the left panel of Figure 4 does show patterns of large case counts, it is not clear whether this pattern is unusual because the plot does not account for the population size of each region. The SMR plot in the right panel of Figure 4 does not indicate any systematic patterns of high SMRs. Therefore, spatial scan methods must be applied to this data set to identify clusters.
The northeastern data were analyzed using the previously discussed scan methods, each of which identified different clusters. The maximum number of regions allowed in each candidate zone was set to K = 20 . The default values of s and ϕ provided in Section 3 are used for the elliptic and flexible–elliptical method. For the middle p-value, two tuning parameters α 1 = 0.2 and α 1 = 0.3 were considered for the rflex method. For the elliptic method, γ = 0 was used for the penalty function in Equation (8).
Figure 5 displays clusters detected by each scan method. There are seven clusters identified by the rflex method using α 1 = 0.2 . Eight clusters are detected by the rflex method using α 1 = 0.3 . Six clusters are detected by the elliptic and flexible–elliptical scan method. A summary of the significant clusters found at level α = 0.05 is given in Table 1.
The flexible–elliptical method exhibits several key properties that are worth focusing on. Notably, the clusters detected by this method encompass a larger number of cases, on average, compared to both the elliptic and rflex methods. Furthermore, the clusters identified by the flexible–elliptical method tend to have the largest population at risk, indicating their significance in terms of potential public health impact. While the rflex methods tend to yield clusters with higher SMR values, the flexible–elliptical method demonstrates slightly smaller SMR values, reflecting its ability to capture clusters with more precise risk estimates. In contrast, the elliptic method, which has a tendency to include low-risk regions, yields the lowest mean SMR values among the methods even though the population at rick is not as high as the flexible–elliptical method. This again shows a balancing of the advantages of both rflex and elliptic method. For example, consider Cluster 1 detected by the flexible–elliptical method. This cluster encompasses the largest population at risk compared to other clusters. Intriguingly, the rflex method detects two smaller clusters, namely Clusters 2 and 4, which, when combined, form a subset of Cluster 1. This example demonstrates that the flexible–elliptical method is capable of identifying more extensive and impactful clusters when compared to multiple smaller clusters detected by the rflex method.
Furthermore, Cluster 1 detected by the flexible–elliptical method was disconnected into two separate clusters, namely Cluster 1 and Cluster 4, by the elliptic method. This could be due to its ability to detect clusters of various shapes and sizes, making it more flexible and realistic in capturing different types of clusters. In contrast, the elliptic method tends to identify more compact and elliptical-shaped clusters. Almost all of the clusters detected by the elliptic method in Figure 5 have elliptical shapes, which might be unlikely in reality. It is important to note that, since the data set is real, definitive conclusions regarding the nature of the clusters cannot be made. However, the results from the proposed flexible–elliptical method demonstrate its ability to provide more diverse and versatile cluster configurations while maintaining a high number of cases, SMR values, and populations at risk. This method strikes a balance between the characteristics of the rflex and elliptic methods, offering a more comprehensive approach to cluster detection and potentially yielding more meaningful and interpretable results.

5. Application to NTM Data

In order to provide a more extensive comparison, we also analyze nontuberculous mycobacterial (NTM) patient data and identify disease clusters by comparing the discussed three spatial scan approaches. NTM data were obtained from the National Jewish Health (NJH) Hospital electronic medical record database in Denver, Colorado. All patients (those with cystic fibrosis and those without) who had sought treatment at NJH, had a diagnosis of NTM infection (i.e., at least one positive culture) and were resident in Colorado during February 2008 through January 2018 were included in this dataset, totaling y + = 822 patients. Since NTM is considered a rare disease, we aggregated patient data over a 10-year period and tabulated patient data for each zip code tabulation area (ZCTA). We used the total population of Colorado as determined by the 2010 US Census, fixed at n + = 5,029,374 people. Given that the incubation period of NTM is not currently understood, we did not have a reliable time of disease onset variable, and therefore, we could not consider a temporal analysis to identify disease clusters. The use of this dataset was approved by the NJH Institutional Review Board (HS-3148).
Figure 6 displays the significant NTM clusters detected by each scan method at significance level α = 0.05 . To compute the p-value, 999 null data sets were simulated under constant risk hypothesis. The rflex method with α 1 = 0.2 and α 1 = 0.3 identified the same Cluster 1 but the rflex method with α 1 = 0.3 includes additional regions for Cluster 2 compared to the rflex method with α 1 = 0.2 . For Cluster 1, the flexible–elliptical method included a longer, narrower set of zip codes compared to those identified by the rflex methods. For Cluster 2, the rflex methods differed from the flexible–elliptical method by only one zip code. The elliptic method detected the largest clusters among all of the methods tested. The elliptic method detected Cluster 1 zip codes within the same location as the previous methods but covered a larger area of zip codes. For Cluster 1, all methods identified some variation of zip codes within the center to the eastern end of the city of Denver and in suburban regions south of Denver. For Cluster 2, the elliptic method identified a much larger Cluster compared to those identified by the other methods. In Cluster 2, all methods included zip codes located in the city of Arvada. The rflex methods and the flexible–elliptical method also included zip codes located in Boulder County. The elliptic method did not include the Boulder zip codes but included the Arvada zip codes. Cluster 2, identified by the elliptic method, extended farther west into the Rocky Mountains. All methods detected Cluster 3, as this included one zip code in Pitkin County with only 20 residents.
NTM are commonly found in water, and the hypotheses surrounding NTM exposure and acquisition focus on municipal water supplies. The water supply for zip codes located in Cluster 1 comes from different regions along the Western Slope than for zip codes located in Cluster 2 (for clusters identified by the rflex and flexible–elliptical methods). Recent studies have demonstrated an association between a trace metal, molybdenum, in the raw water supply and NTM infection risk in Colorado [28,29]. Regions that supply water to zip codes in Clusters 1 and 2 have naturally occurring molybdenum in high abundance, as evidenced by the fact that large molybdenum mines are located in these regions. These regions with high molybdenum concentrations are located within Cluster 2 identified by the elliptic method.
The rflex method and the flexible–elliptical method present zip codes in Boulder County and the city of Arvada as part of Cluster 2. The Boulder zip codes receive their water supply from sources that are different from the Arvada zip codes. Since the Boulder zip codes were not identified in the elliptic method cluster, this identification may lead us to further examine those regions.
The elliptic method, because it typically exhibits greater sensitivity, tends to generate clusters including more zip codes and that have larger geographic area. However, its detected clusters tend to have lower PPV as not all zip codes within the detected clusters are likely to be high risk. The rflex and flexible–elliptical methods typically have greater PPV, so they are likely more useful for identifying the highest risk regions within the true cluster. Given that its detected clusters tend to be larger, the elliptic method may provide more opportunities for hypothesis generation in the initial stages of data exploration, while the elliptic–flexible method possibly focuses in on the most high-risk regions of each cluster.

6. Conclusions

In our simulation study, it was revealed that the elliptic method generally displayed higher sensitivity compared to the other scan methods. This heightened sensitivity is attributed to the elliptic method’s tendency to detect larger clusters, increasing the chances of capturing the true cluster. In contrast, the rflex method with α 1 = 0.2 exhibited the lowest sensitivity, likely due to the elimination of moderate-rate regions by the middle p-value. However, the sensitivity of the flexible–elliptical method closely aligned with that of the rflex method with α 1 = 0.3 , indicating its comparable performance in identifying irregularly shaped clusters. Importantly, the proposed flexible–elliptical method moderates the trade-off between cluster size and accuracy without relying on any specific tuning parameter, providing a more flexible and versatile approach to capturing the true cluster.
The simulation study also revealed that, on average, the flexible–elliptical method demonstrated a better performance based on PPV. The PPV of the rflex method with α 1 = 0.2 was comparable to the flexible–elliptical method, but the rflex method with α 1 = 0.3 resulted in lower PPV. The elliptic method had the lowest PPV values, which again could be attributed to its tendency to detect clusters larger than the true clusters. PPV ensures more accurate and reliable cluster identification, holding significant implications for the precision and validity of cluster detection studies. By effectively detecting and capturing clusters, the proportion of the detected clusters accurately aligning with the true clusters in the population increased, which is an important measure. This further highlights the flexible–elliptical method as a versatile approach to maintain high accuracy and impact in cluster detection while avoiding dependence on a tuning parameter and detecting excessively larger clusters.
Notably, the performance of the rflex method exhibits some inconsistency between the two α 1 levels, 0.2 and 0.3. This observation practically implies that the effectiveness of the rflex method relies on the chosen value of α 1 and the specific characteristics of the clustering model. For instance, when examining the clustering model “irural05”, the sensitivity ranges from 0.61 for α 1 = 0.20 to 0.70 for α 1 = 0.30 , while the PPV ranges from 0.79 to 0.74, or the clustering model “iurban08”, the sensitivity ranges from 0.59 for α 1 = 0.20 to 0.71 for α 1 = 0.30 , while the PPV ranges from 0.81 to 0.78.
The performance in terms of misclassification was generally comparable across all methods. This similarity arose from the definition of misclassification in Equation (14), where the denominator represented the total population at risk. The breast cancer data set in Section 3 had a large total population size of n + = 29,535,210, contributing to the similarity in misclassification rates. However, it is worth noting that the proposed flexible–elliptical method demonstrated improved performance in certain clustering models, such as models “j”, “k”, and “iurban13”.

7. Discussion

In this study, we proposed the flexible–elliptical scan method, which combined the flexible and elliptic scan methods to address their respective limitations and leverage their advantages. Our approach involved modifying the set of candidate zones and the likelihood ratio test statistics. We thoroughly compared the performance of the proposed flexible–elliptical method with the elliptic and rflex methods for identifying irregularly shaped disease clusters. This evaluation included benchmark data sets comprising 56 diverse irregularly shaped cluster models as well as real-world data sets related to breast cancer mortality and NTM cases. Our findings demonstrated a balanced performance between the flexible and elliptic scan methods in accurately detecting irregularly shaped clusters in disease surveillance.
The flexible–elliptical method exhibited flexibility, inheriting the capabilities of the rflex and elliptic methods, particularly in constructing the set of candidate zones. The elliptic method often struggled to identify clusters with highly irregular shapes, limiting its effectiveness in capturing complex disease patterns. Similarly, the rflex method faced challenges in detecting very long and narrow clusters due to its reliance on circular-shaped windows and the user-defined α 1 tuning parameter. By incorporating the strengths of these two methods, the flexible–elliptical method demonstrated a more adaptable approach to candidate zone construction, enabling it to capture highly non-circular shaped clusters as shown in Figure 7. This heightened flexibility allowed for the detection of a broader range of cluster shapes, rendering the flexible–elliptical method a valuable tool in identifying irregular disease clusters and leveraging the advantages of both elliptic and reflex methods.
While the rflex method’s performance can vary depending on the chosen tuning parameter values, the proposed flexible–elliptical method eliminates the need for such parameter adjustments. The flexible–elliptical method demonstrates independence from tuning parameters, ensuring consistent and reliable cluster detection outcomes. While the rflex method with tuning parameters α 1 = 0.2 and α 1 = 0.3 exhibited relatively good sensitivity and PPV, a closer examination reveals that α 1 = 0.2 yielded a superior PPV, whereas α 1 = 0.3 achieved better sensitivity (Figure 3). Moreover, the number of significant clusters can be influenced by the choice of tuning parameter (e.g., Figure 5). On the other hand, the elliptic method imposed an eccentricity penalty on the likelihood ratio test statistic which required another tuning parameter. By adjusting the tuning parameter, the elliptic method avoided detecting very narrow and long clusters. In the proposed flexible–elliptical method, no eccentricity penalties have been used. First, we considered not only elliptical windows but also connected regions inside them. Second, we filtered out windows having low-risk regions. Therefore, even if a very narrow and thin cluster is obtained, an additional penalty is not required due to the fact that we include only high-risk regions in each cluster. An example of such a cluster can be found in the bottom-right plane of Figure 7, which is a very long cluster, as it should be.
The flexible–elliptical method avoids including low-risk regions, which could potentially be an advantage, but it does allow for disconnecting a large cluster. For example, consider two large significant clusters that are connected with a single region, and that region is a low-risk region. In this situation, the flexible–elliptical method presumably detects one of them. It is possible that the other cluster is detected as a secondary cluster, but it is not guaranteed. It is important to note that there were some situations where the elliptic method detects clusters containing disconnected regions. For example, in the clustering models such as Cluster “c” in Figure 2, the nearest neighbors are not necessarily connected, and elliptical windows may include disconnected regions. Another example is shown in Cluster 1 detected by the elliptic method in Figure 6. Unlike the elliptic method, the flexible–elliptical method disconnects regions systematically. This can be a limitation of the proposed flexible–elliptical method, and it could be extended when taking other criteria into account before removing a region only based on whether it is a low-risk region. Similar to algorithms proposed by Costa et al. [16], we may avoid eliminating those low-risk regions by having specific geographic proximity criteria. For example, consider a current window that involves only high-risk regions. We can let a low-risk region be added to this current window if the region has two connections (borders) and increases the current likelihood test statistic value. Furthermore, although the proposed method is relatively simple, it is possible to impose additional restrictions on the regions to further enhance speed and accuracy in Cluster detection.
In summary, the proposed method combines two well-known methods for detecting irregularly shaped clusters, taking advantage of their individual strengths and achieving a balanced approach. The flexible–elliptical method inherits the favorable features of both the elliptic and rflex methods. It demonstrates a better positive predictive value (PPV) compared to the elliptic method and comparable PPV to the rflex method with α 1 = 0.2 . Notably, the flexible–elliptical method does not rely on the tuning parameter α 1 , offering a more streamlined and straightforward approach. The construction of the set of candidate zones in the proposed method provides greater flexibility compared to the rflex method, allowing for improved adaptability to irregular cluster shapes.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math11173627/s1.

Author Contributions

Writing—review and editing, M.M., J.P.F. and E.M.L.; Methodology, M.M. and J.P.F.; Software, M.M. and J.P.F.; Data curation, E.M.L.; Conceptualization, E.M.L. All authors have read and agreed to the published version of the manuscript.

Funding

Both Mohammad Meysami and Joshua P. French were partially supported by NSF award 1915277.

Data Availability Statement

The simulation-related data are available in the neastbenchmark R package available for download at https://github.com/jfrench/neastbenchmark. The breast cancer data are available in the smerc R package. The NTM data are not publicly available at this time.

Acknowledgments

We thank the anonymous referees for their comments, which have helped us to improve the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

NNumber of regions in the study area.
iCentroid of each region.
Y i Number of cases in region i.
n i Population size of region i.
E i Expected number of cases in region i (under the null).
θ i Risk of developing the disease in region i
y + Total number of cases in the study area, i.e., i = 1 N Y i = y + .
n + Total population of the study area, i.e., i = 1 N n i = n + .
Z Set of all candidate zones.
Z A candidate zone.
y i n Number of cases inside the candidate zone Z , i.e., i Z y i = y i n .
n i n Population size inside the candidate zone Z , i.e., i Z n i = n i n .
E i n Expected number of cases inside the candidate zone Z , i.e., i Z E i = E i n .
y o u t Number of cases outside the candidate zone Z , i.e., i Z y i = y o u t .
n o u t Population size outside the candidate zone Z , i.e., i Z n i = n o u t .
E o u t Expected number of cases outside the candidate zone Z , i.e., i Z E i = E o u t .

Appendix A

In this appendix we derive the likelihood ratio test statistic when the case counts are modeled by a Poisson or a Binomial random variable. Assume:
  • The risk of disease for all regions i Z is p. That is, θ i = p for all i Z .
  • The risk of disease for all regions i Z is q. That is, θ i = q for all i Z .

Appendix A.1. Poisson Cases Counts

Let Y i indep . Poisson ( n i θ i ) . Thus, the likelihood function is
L P ( Z , θ i ) = i = 1 N e n i θ i ( n i θ i ) Y i Y i ! .
Under the alternative hypothesis of existing at least one cluster, the likelihood function can be written as:
L P ( Z , p , q ) = i Z e n i p ( n i p ) Y i Y i ! i Z e n i q ( n i q ) Y i Y i ! = e p i Z n i p i Z Y i i Z n i Y i i Z Y i ! e q i Z n i q i Z Y i i Z n i Y i i Z Y i ! = e p n i n p y i n e q n o u t q y o u t i Z n i Y i i Z n i Y i i Z Y i ! i Z Y i ! ( because   i Z n i = n i n   and   i Z n i = n o u t ) = C e p n i n p y i n e q n o u t q y o u t ,
where C = i Z n i Y i i Z n i Y i i Z Y i ! i Z Y i ! .
Compute the log function of the L P ( Z , p , q ) , i.e., l P ( Z , p , q ) :
l P ( Z , p , q ) = log C p n i n + y i n log p q n o u t + y o u t log q .
Differentiate with respect to p and set equal to zero to find the maximum.
p l P ( Z , p , q ) = n i n + y i n 1 p = s e t 0 p ^ = y i n n i n 2 p 2 l P ( Z , p , q ) = y i n 1 p 2 < 0 p ^ = y i n n i n is a maximum .
Similarly, q ^ = y o u t n o u t .
Under the null hypothesis of no clustering, p is equal to q. Thus,
l P ( Z , p = q ) = log C p n i n + n o u t + y i n + y o u t log p . ( we   know   n i n + n o u t = n + , and y i n + y o u t = y + )
p l P ( Z , p = q ) = n + + y + 1 p p ^ = q ^ = y + n + .
Therefore, the likelihood ratio test statistic can be written as
λ Z c = sup p > q L P ( Z , p , q ) sup p = q L P ( Z , p = q ) = C e y i n n i n n i n y i n n i n y i n e y o u t n o u t n o u t y o u t n o u t y o u t C e y + n + n + y + n + y + = e ( y i n + y o u t ) y i n n i n y i n y o u t n o u t y o u t e y + y + n + y + ( where   e ( y i n + y o u t ) = e y + ) = y i n n i n y i n y o u t n o u t y o u t y + n + y + = y i n y + n + n i n y i n y o u t y + n + n o u t y o u t = y i n E i n y i n y o u t E o u t y o u t . ( because   y + n + n i n = E i n ,   and   y + n + n o u t = E o u t )
No cluster is a “hotspot” unless y i n E i n > y o u t E o u t , so we include the indicator function to simplify the computations. Thus,
λ Z c = sup p > q L P ( Z , p , q ) sup p = q L P ( Z , p = q ) = y i n E i n y i n y o u t E o u t y o u t I y i n E i n > y o u t E o u t .
By taking maximum over all Z Z the likelihood ratio test statistic for the most likely cluster is obtained. That is,
λ c = sup Z Z c λ Z c .

Appendix A.2. Binomial Cases Counts

Let Y i indep . Binomial ( n i , θ i ) . Thus, the likelihood function is
L B ( Z , θ i ) = i = 1 N n i Y i θ i Y i 1 θ i n i Y i
Under the alternative hypothesis of existing at least one cluster, the likelihood function can be written as:
L B ( Z , p , q ) = i Z n i Y i p Y i 1 p n i Y i i Z n i Y i q Y i 1 q n i Y i = i Z n i Y i p i Z Y i 1 p i Z ( n i Y i ) i Z n i Y i q i Z Y i 1 q i Z ( n i Y i ) = C p y i n 1 p n i n y i n q y o u t 1 q n o u t y o u t ,
where C = i Z n i Y i · i Z n i Y i .
Compute the log function of L B ( Z , p , q ) , i.e., l B ( Z , p , q ) :
l B ( Z , p , q ) = log C + y i n log p + n i n y i n log ( 1 p ) + y o u t log q + n o u t y o u t log ( 1 q ) .
Differentiate with respect to p and set equal to zero to find the maximum.
p l B ( Z , p , q ) = y i n 1 p n i n y i n 1 1 p = s e t 0 p ^ = y i n n i n . Similarly , q ^ = y o u t n o u t .
Under the null hypothesis of no clustering, p is equal to q. Thus,
l B ( Z , p = q ) = log C + y i n + y o u t log p + n i n + n o u t ( y i n + y o u t ) log ( 1 p )
p l B ( Z , p = q ) = y + 1 p ( n + y + ) 1 1 p = s e t 0 p ^ = q ^ = y + n + .
Therefore, the likelihood ratio test statistic can be written as
λ Z c = sup p > q L B ( Z , p , q ) sup p = q L B ( Z , p = q ) = C y i n n i n y i n 1 y i n n i n n i n y i n y o u t n o u t y o u t 1 y o u t n o u t n o u t y o u t C y + n + y + 1 y + n + n + y + = y i n n i n y i n n i n y i n n i n n i n y i n y o u t n o u t y o u t n o u t y o u t n o u t n o u t y o u t y + n + y + n + y + n + n + y + .
No cluster is a hotspot unless y i n n i n > n i n y i n n i n , so we include the indicator function to simplify the computations. Thus,
λ Z c = y i n n i n y i n n i n y i n n i n n i n y i n y o u t n o u t y o u t n o u t y o u t n o u t n o u t y o u t y + n + y + n + y + n + n + y + I y i n n i n > n i n y i n n i n
By taking the maximum over all Z Z the likelihood ratio test statistic for the most likely cluster is obtained. That is,
λ c = sup Z Z c λ Z c .

References

  1. Moran, P.A. Notes on continuous stochastic phenomena. Biometrika 1950, 37, 17–23. [Google Scholar] [CrossRef]
  2. Geary, R.C. The contiguity ratio and statistical mapping. Inc. Stat. 1954, 5, 115–146. [Google Scholar] [CrossRef]
  3. Turnbull, B.W.; Iwano, E.J.; Burnett, W.S.; Howe, H.L.; Clark, L.C. Monitoring for clusters of disease: Application to leukemia incidence in upstate New York. Am. J. Epidemiol. 1990, 132 (Suppl. S1), 136–143. [Google Scholar] [CrossRef] [PubMed]
  4. Besag, J.; Newell, J. The detection of clusters in rare diseases. J. R. Stat. Soc. Ser. Stat. Soc. 1991, 154, 143–155. [Google Scholar] [CrossRef]
  5. Kulldorff, M.; Nagarwalla, N. Spatial disease clusters: Detection and inference. Stat. Med. 1995, 14, 799–810. [Google Scholar] [CrossRef] [PubMed]
  6. Kulldorff, M. A spatial scan statistic. Commun. Stat. Theory Methods 1997, 26, 1481–1496. [Google Scholar] [CrossRef]
  7. Tango, T.; Takahashi, K. A flexibly shaped spatial scan statistic for detecting clusters. Int. J. Health Geogr. 2005, 4, 11. [Google Scholar] [CrossRef] [PubMed]
  8. Tango, T.; Takahashi, K. A flexible spatial scan statistic with a restricted likelihood ratio for detecting disease clusters. Stat. Med. 2012, 31, 4207–4218. [Google Scholar] [CrossRef]
  9. Khan, M.M.; Odoi, E.W. Geographic disparities in COVID-19 testing and outcomes in Florida. BMC Public Health 2023, 23, 1–13. [Google Scholar] [CrossRef]
  10. Day, C.A.; Odoi, A.; Trout Fryxell, R. Geographically persistent clusters of La Crosse virus disease in the Appalachian region of the United States from 2003 to 2021. PLoS Neglected Trop. Dis. 2023, 17, e0011065. [Google Scholar] [CrossRef]
  11. Nakaya, T.; Takahashi, K.; Takahashi, H.; Yasumura, S.; Ohira, T.; Shimura, H.; Suzuki, S.; Suzuki, S.; Iwadate, M.; Yokoya, S.; et al. Revisiting the geographical distribution of thyroid cancer incidence in Fukushima Prefecture: Analysis of data from the second-and third-round thyroid ultrasound examination. J. Epidemiol. 2022, 32, S76–S83. [Google Scholar] [CrossRef] [PubMed]
  12. Kulldorff, M.; Huang, L.; Pickle, L.; Duczmal, L. An elliptic spatial scan statistic. Stat. Med. 2006, 25, 3929–3943. [Google Scholar] [CrossRef] [PubMed]
  13. Jiménez-Martín, D.; García-Bocanegra, I.; Risalde, M.A.; Fernández-Molera, V.; Jiménez-Ruiz, S.; Isla, J.; Cano-Terriza, D. Epidemiology of paratuberculosis in sheep and goats in southern Spain. Prev. Vet. Med. 2022, 202, 105637. [Google Scholar] [CrossRef]
  14. Mercaldo, R.A.; Marshall, J.E.; Prevots, D.R.; Lipner, E.M.; French, J.P. Detecting clusters of high nontuberculous mycobacteria infection risk for persons with cystic fibrosis—An analysis of U.S. counties. Tuberculosis 2023, 138, 102296. [Google Scholar] [CrossRef] [PubMed]
  15. Assunção, R.; Costa, M.; Tavares, A.; Ferreira, S. Fast detection of arbitrarily shaped disease clusters. Stat. Med. 2006, 25, 723–742. [Google Scholar] [CrossRef]
  16. Costa, M.A.; Assunção, R.M.; Kulldorff, M. Constrained spanning tree algorithms for irregularly-shaped spatial clustering. Comput. Stat. Data Anal. 2012, 56, 1771–1783. [Google Scholar] [CrossRef]
  17. Waller, L.A.; Turnbull, B.W.; Clark, L.C.; Nasca, P. Chronic disease surveillance and testing of clustering of disease and exposure: Application to leukemia incidence and TCE-contaminated dumpsites in upstate New York. Environmetrics 1992, 3, 281–300. [Google Scholar] [CrossRef]
  18. Waller, L.A.; Turnbull, B.W.; Clark, L.C.; Nasca, P. Spatial pattern analyses to detect rare disease clusters. Case Stud. Biometry 1994, 3, 23. [Google Scholar]
  19. Moraga, P. Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny; Chapman and Hall/CRC: Boca Raton, FL, USA, 2019. [Google Scholar]
  20. French, J.P.; Meysami, M.; Hall, L.M.; Weaver, N.E.; Nguyen, M.C.; Panter, L. A comparison of spatial scan methods for Cluster detection. J. Stat. Comput. Simul. 2022, 92, 3343–3372. [Google Scholar] [CrossRef]
  21. Han, J.; Zhu, L.; Kulldorff, M.; Hostovich, S.; Stinchcomb, D.G.; Tatalovich, Z.; Lewis, D.R.; Feuer, E.J. Using Gini coefficient to determining optimal cluster reporting sizes for spatial scan statistics. Int. J. Health Geogr. 2016, 15, 1–11. [Google Scholar] [CrossRef]
  22. Meysami, M.; French, J.P.; Lipner, E.M. Estimating the optimal population upper bound for scan methods in retrospective disease surveillance. Biom. J. 2021, 63, 1633–1651. [Google Scholar] [CrossRef] [PubMed]
  23. Waller, L.A.; Gotway, C.A. Applied Spatial Statistics for Public Health Data; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 368. [Google Scholar]
  24. Duczmal, L.; Kulldorff, M.; Huang, L. Evaluation of spatial scan statistics for irregularly shaped clusters. J. Comput. Graph. Stat. 2006, 15, 428–442. [Google Scholar] [CrossRef]
  25. Kulldorff, M.; Tango, T.; Park, P.J. Power comparisons for disease clustering tests. Comput. Stat. Data Anal. 2003, 42, 665–684. [Google Scholar] [CrossRef]
  26. French, J.P. Smerc: Statistical Methods for Regional Counts, R Package Version 1.4; 2021. Available online: https://cran.r-project.org/package=smerc (accessed on 20 August 2023).
  27. Kulldorff, M. SaTScan, Version 9.7; 2021. Available online: https://satscan.org (accessed on 7 July 2021).
  28. Lipner, E.M.; Crooks, J.L.; French, J.; Strong, M.; Nick, J.A.; Prevots, D.R. Nontuberculous mycobacterial infection and environmental molybdenum in persons with cystic fibrosis: A case-control study in Colorado. J. Expo. Sci. Environ. Epidemiol. 2021, 32, 289–294. [Google Scholar] [CrossRef] [PubMed]
  29. Lipner, E.M.; French, J.; Bern, C.R.; Walton-Day, K.; Knox, D.; Strong, M.; Prevots, D.R.; Crooks, J.L. Nontuberculous Mycobacterial Disease and Molybdenum in Colorado Watersheds. Int. J. Environ. Res. Public Health 2020, 17, 3854. [Google Scholar] [CrossRef] [PubMed]
Figure 1. A study area comprised of 19 polygonal regions. The centroid of each region is indicated by a dot. The dashed-line ellipse, which includes a collection of regions, is centered at the centroid i with angle ϕ , with a minor axis a and a major axis b. This ellipse is a potential candidate zone Z Z e . The elliptic scan method starts with a single centroid i and extends the ellipse until a new centroid is absorbed. A new candidate zone is created each time a new centroid is absorbed. For each region i, and with a fixed s = b a and ϕ , this process continues until a stopping criterion is met (by default, when 50% of the population is contained within an ellipse).
Figure 1. A study area comprised of 19 polygonal regions. The centroid of each region is indicated by a dot. The dashed-line ellipse, which includes a collection of regions, is centered at the centroid i with angle ϕ , with a minor axis a and a major axis b. This ellipse is a potential candidate zone Z Z e . The elliptic scan method starts with a single centroid i and extends the ellipse until a new centroid is absorbed. A new candidate zone is created each time a new centroid is absorbed. For each region i, and with a fixed s = b a and ϕ , this process continues until a stopping criterion is met (by default, when 50% of the population is contained within an ellipse).
Mathematics 11 03627 g001
Figure 2. Illustration of a few of the imixed, iurban, irural, and “a”–“k” benchmark simulated data sets for the breast cancer mortality of the northeastern United States. For example, iurban10 displays an urban cluster containing 10 regions. imixed04 illustrates a mixed (mixed of urban and rural regions) cluster contacting four regions.
Figure 2. Illustration of a few of the imixed, iurban, irural, and “a”–“k” benchmark simulated data sets for the breast cancer mortality of the northeastern United States. For example, iurban10 displays an urban cluster containing 10 regions. imixed04 illustrates a mixed (mixed of urban and rural regions) cluster contacting four regions.
Mathematics 11 03627 g002
Figure 3. Box plots of the average (a) sensitivity, (b) PPV and (c) misclassification for all 56 Cluster models.
Figure 3. Box plots of the average (a) sensitivity, (b) PPV and (c) misclassification for all 56 Cluster models.
Mathematics 11 03627 g003
Figure 4. Breast cancer mortality cases (a) and SMR (b) for the Northeastern United States data.
Figure 4. Breast cancer mortality cases (a) and SMR (b) for the Northeastern United States data.
Mathematics 11 03627 g004
Figure 5. A map of an eight-county region in upstate New York. The significant clusters identified by the rflex method (for both tuning parameters α 1 = 0.2 and α 1 = 0.3 ), the flexible–elliptical method, and the elliptic method are shown. In each map, the first and last clusters are the most significant and least significant clusters, respectively. The level of significance is α = 0.05 . Cluster 1 is the MLC, with each successive cluster having a lower LRT statistic.
Figure 5. A map of an eight-county region in upstate New York. The significant clusters identified by the rflex method (for both tuning parameters α 1 = 0.2 and α 1 = 0.3 ), the flexible–elliptical method, and the elliptic method are shown. In each map, the first and last clusters are the most significant and least significant clusters, respectively. The level of significance is α = 0.05 . Cluster 1 is the MLC, with each successive cluster having a lower LRT statistic.
Mathematics 11 03627 g005
Figure 6. A map of the significant NTM clusters identified by the rflex, elliptic, and flexible–elliptical scan method. In each map, the first and last clusters are the most significant and the least significant clusters, respectively. The level of significance is α = 0.05 . Cluster 1 is the MLC, with each successive Cluster having a lower LRT.
Figure 6. A map of the significant NTM clusters identified by the rflex, elliptic, and flexible–elliptical scan method. In each map, the first and last clusters are the most significant and the least significant clusters, respectively. The level of significance is α = 0.05 . Cluster 1 is the MLC, with each successive Cluster having a lower LRT.
Mathematics 11 03627 g006
Figure 7. Two examples of non-circular clusters detected by different scan methods. Left plots: clustering model imixed12 detected by the elliptic method and the flexible–elliptical method. Right plots: clustering model “a” detected by the rflex method with α 1 = 0.2 and the flexible–elliptical method.
Figure 7. Two examples of non-circular clusters detected by different scan methods. Left plots: clustering model imixed12 detected by the elliptic method and the flexible–elliptical method. Right plots: clustering model “a” detected by the rflex method with α 1 = 0.2 and the flexible–elliptical method.
Mathematics 11 03627 g007
Table 1. Significant clusters detected by the rflex method (both α 1 = 0.2 and α 1 = 0.3 ), flexible–elliptical (flex-ellip) method, and elliptic method. The Monte Carlo p-value was computed using 999 null data sets under the constant risk hypothesis at the significance level of α = 0.05 .
Table 1. Significant clusters detected by the rflex method (both α 1 = 0.2 and α 1 = 0.3 ), flexible–elliptical (flex-ellip) method, and elliptic method. The Monte Carlo p-value was computed using 999 null data sets under the constant risk hypothesis at the significance level of α = 0.05 .
MethodClusterPopulationCasesExpectedSMRp-Value
rflex ( α 1 = 0.2 )Cluster 11,922,489452538361.180.001
Cluster 22,232,866515044561.160.001
Cluster 3920,991224818381.220.001
Cluster 4228,3226434551.410.001
Cluster 5660,581153713181.170.001
Cluster 6507,044120110111.190.001
Cluster 7104,0572912071.400.003
mean = 939,478mean = 2228 mean = 1.25
rflex ( α 1 = 0.3 )Cluster 11,922,489452538361.180.001
Cluster 22,232,866515044561.160.001
Cluster 3920,991224818381.220.001
Cluster 4228,3226434551.410.001
Cluster 5660,581153713181.170.001
Cluster 6507,044120110111.190.001
Cluster 7104,0572912071.400.004
Cluster 8470,39710849381.150.041
mean = 880,843mean = 2085 mean = 1.23
flex-ellipCluster 13,256,369748064981.150.001
Cluster 22,062,671485341161.180.001
Cluster 3920,991224818381.220.001
Cluster 41,673,793370333401.110.001
Cluster 5507,044120110111.190.004
Cluster 6104,0572912071.400.009
mean = 1,420,821mean = 3296 mean = 1.21
ellipticCluster 11,917,315451738261.180.001
Cluster 21,701,906397933961.170.001
Cluster 31,102,261259821991.180.001
Cluster 41,841,814406236751.110.001
Cluster 5889,355203517741.150.002
Cluster 6635,396148012681.170.002
mean = 1,348,008mean = 3112 mean = 1.16
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Meysami, M.; French, J.P.; Lipner, E.M. Flexible-Elliptical Spatial Scan Method. Mathematics 2023, 11, 3627. https://doi.org/10.3390/math11173627

AMA Style

Meysami M, French JP, Lipner EM. Flexible-Elliptical Spatial Scan Method. Mathematics. 2023; 11(17):3627. https://doi.org/10.3390/math11173627

Chicago/Turabian Style

Meysami, Mohammad, Joshua P. French, and Ettie M. Lipner. 2023. "Flexible-Elliptical Spatial Scan Method" Mathematics 11, no. 17: 3627. https://doi.org/10.3390/math11173627

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop