Fuzzy Entropy-Based Spatial Hotspot Reliability

Cluster techniques are used in hotspot spatial analysis to detect hotspots as areas on the map; an extension of the Fuzzy C-means that the clustering algorithm has been applied to locate hotspots on the map as circular areas; it represents a good trade-off between the accuracy in the detection of the hotspot shape and the computational complexity. However, this method does not measure the reliability of the detected hotspots and therefore does not allow us to evaluate how reliable the identification of a hotspot of a circular area corresponding to the detected cluster is; a measure of the reliability of hotspots is crucial for the decision maker to assess the need for action on the area circumscribed by the hotspots. We propose a method based on the use of De Luca and Termini’s Fuzzy Entropy that uses this extension of the Fuzzy C-means algorithm and measures the reliability of detected hotspots. We test our method in a disease analysis problem in which hotspots corresponding to areas where most oto-laryngo-pharyngeal patients reside, within a geographical area constituted by the province of Naples, Italy, are detected as circular areas. The results show a dependency between the reliability and fluctuation of the values of the degrees of belonging to the hotspots.


Introduction
Hotspot detection is an emerging spatial analysis feature that allows for the detection of areas in which events representing a certain phenomenon are present with greater insistence (hotspots) and follows their spatial distribution and displacement over time. Cluster techniques are proposed by various researchers to locate hotspots in the study area for many problems. For example, in crime analysis, it is used specifically to locate as hotspots the areas with greater presence and frequency of criminal events in city contexts; in disease analysis, it is used to evaluate the formation and displacement of disease strains over time; in monitoring problems of natural and environmental disasters, such as the monitoring of developments of fires in wooded areas in summer, it is applied to analyze where and with what frequency and intensity natural and malicious phenomena of fires develop.
Cluster algorithms are proposed by some authors to detect hotspots in various spatial analysis problems.
Kernel density algorithms have the advantage of detecting even hotspots of irregular geometric shape, but they are computationally more expensive than K-means and FCM; on the other hand, K-means and FCM detect only cluster centers and are less robust than the presence of noise in the data. Furthermore, in K-means and FCM, the number of clusters must be fixed in advance and validity measures must be used to evaluate what the suitable number of clusters might be.
In [21], a new hotspot detection technique is proposed, based on an extension of FCM, called Extended Fuzzy C-means (for short, EFCM) [22]. EFCM detect cluster as hyper-spheres in the space of the features; the number of clusters must not be set a priori as it is obtained through merging processes of the most similar clusters carried out during each iteration. In [21], the authors show that EFCM can approximate the shape of hotspots on the map and is robust with respect to the presence of noise and outliers. The EFCM hotspot detection method was applied in disease analysis [23,24] and in earthquake disaster analysis [25].
One of the main needs in hotspot detection is to evaluate the reliability of the results by measuring how significant the detected hotspots are. EFCM detects circular hotspots on the map but does not give information about their reliability. This assessment is sometimes interpretative; it is left to the expert who assesses whether the analyzed event persists more frequently in the region where the hotspot was detected. An effective measure of reliability of a hotspot is critical to understanding how accurate the location of the area in which the analyzed phenomenon exists is, in order to monitor it and follow its movements over time. Currently, no hotspot detection method proposed in the literature allows us to measure the reliability of the detected hotspots; a quantitative assessment of the reliability of the detected hotspots is crucial because it would allow the decision maker to evaluate how accurate the geographic location and areal size of a hotspot can be.
In this research we propose a measure of the reliability of the hotspots detected by running the EFCM algorithm, in which the De Luca & Termini fuzzy entropy [26,27] is applied; the reliability of the hotspot is higher if the fuzzy entropy measured for the corresponding cluster is lower.
Recently, measures of fuzzy entropy of clustering in FCM have been proposed in [28,29]. Following [28], in this work each fuzzy cluster constitutes a fuzzy set in the domain of the data points and a measure of the fuzzy entropy is applied to each fuzzy cluster to evaluate its fuzziness.
If H(A i ) is the fuzziness of the ith cluster, normalized in the interval [0,1], we assign a reliability of the correspondent hotspot A i given by: The reliability of A i is normalized in the interval [0,1]. It holds 1 when the fuzziness of C i is null (the cluster is a not null crisp set) and 0 when the fuzziness is maximum (all the data points belong to the cluster with a membership degree 1 2 n). We've implemented our method in a GIS-framework in which the EFCM-based hotspot detection algorithm was encapsulated. After executing EFCM, the detected hotspots are shown as circles on the map and the reliability of each hotspot is calculated as in (1). Finally, the hotspot reliability map is constructed.
In next section, the EFCM algorithm and the De Luca & Termini Fuzzy Entropy are introduced. Our framework is described in Section 3. Section 4 show the results of our tests. Finally, considerations are included in Section 5.

The EFCM Algorithm
The EFCM algorithm [22] is a variation of FCM in which the cluster prototypes are hyper-spheres in the space of the features; each cluster is characterized by a vector characterizing its center and its radius.
Let X = {x 1 , . . . , x N } ⊂ R n be a set of N data points in the n-dimensional space of the features R n where x k = (x k1 , . . . , x kn ). Let V = {v 1 , . . . , v C } ⊂ R n be the set of centers of the C clusters. Let U be the C × N partition matrix where u ik is the membership degree of the kth data point x k to the jth cluster v j . Let r = {r 1 , . . . , r C } be the set of radii of the C clusters. EFCM minimize the following objective function: where m is the fuzzifier parameter and δ ij , interpreted as the distance between the ith cluster and the jth data point, is given by: In (2) d ij is the Euclidean distance between the center of the ith cluster and the jth pattern and r i is the radius of the ith cluster.
If P i is the covariance matrix of the ith cluster: then the radius r i of the ith cluster is given by: Applying the Lagrange multiplier method to the (1) we obtain the solutions for V and U: where φ j is the number of cluster whose distance from the jth data point δ ij i = 1, . . . , C is equal to 0. In [22], with the aim to ensure the separation between clusters, the radius of the ith cluster calculated at the tth iteration, r i (t) is increased by a factor is the number of clusters detected during the previous iteration and β (t−1) is the value calculated at the previous iteration of a parameter defined recursively, where β (0) = 1, The optimal number of clusters is found merging at any iteration with the two most similar clusters, under some conditions.
The similarity between two clusters is measured by the following inclusion index: where the similarity cluster matrix S is a symmetric matrix. Let S (t) be the similarity cluster matrix calculated at the tth iteration and let i* and k* be the indices of the two most similar clusters; these two clusters are merged if their similarity is greater than an adaptive similarity threshold α (t) = 1/(C (t) − 1), and the absolute difference S is less than an error η. When two clusters are merged, the number of clusters is reduced by one unit and the If the two most similar clusters are merged: The EFCM algorithm (Algorithm 1) is described in the following pseudocode. 1. Set m, ε, η, the initial number of clusters C (0) 2. β ← 1, S* ← 0, S* prev ← 1 3. Initialize randomly the partition matrix U and the centers v i 4. Repeat

5.
For i = 1 to C //calculate centers and radius of clusters 6.
Calculate the center of the ith cluster v i by (4) 7.
Calculate the radius of the ith cluster r i by (12) 8.
r i ← r i · β/C //enlarge the radius of the i-th cluster 9.
For i = 1 to C //calculate new partition matrix 10.
For j = 1 to N 11.
Calculate the membership degree component u ij by (14) 12.
For i = 1 to C − 1 //Find the two most similar clusters 13.
For k = i + 1 to C 14.
If S ik > S* Then 16.
If S* > α //merge the two most similar clusters 20.
For j = i + 1 to N 21.
Remove the kth row from U 23.
Return the partition matrix and the volume prototypes of the final C Clusters

De Luca & Termini Fuzzy Entropy and the Measure of Fuzziness
Let F(X) = {A: X → [0,1]} be the family of fuzzy sets defined on a universe of discourse X. Let h: [0,1] → [0,1] be a continuous function called fuzzy entropy function. The following restrictions are required for the function h: h is monotonically increasing in in [0, 1 2 h is monotonically decreasing in in [ 1 2 , 1] The fuzzy entropy function has a minimal value of 0 when u is 0 or 1, and a maximum value when u = 1 2 . De Luca and Termini in [26,27] propose the following fuzzy entropy function: which has the maximum value 1 when u = 1 2 ; it is called Shannon's function. If X = {x i , x 2 , . . . , x N } is a discrete set, we define the entropy measure of fuzziness of the fuzzy set A as: where K is a multiplicative constant. If H(A) = 0, then for each element x i , i = 1, . . . , N, A(x i ) = 0 or A(x i ) = 1 and A coincides with a subset of the set X; if for each element x i , i = 1, . . . , N, A(x i ) = 1 2 and the fuzziness of A is maximum. If A is a crisp set, its fuzziness is null and H(A) = 0. The higher the fuzziness of a fuzzy set, the closer the mean membership degree to the fuzzy set of X's elements approaches 1 2 . In [28], a fuzziness measure (12) is used to construct a new validity index, which is applied to evaluate the optimal number of clusters in FCM. If A i is the ith fuzzy cluster where i = 1, . . . , C considered as a fuzzy set and u ij is the membership degree of the jth data point to the ith cluster, the authors use the following fuzzy entropy measure of A i : where N is the number of data points and the De Luca & Termini fuzzy entropy function (11) is used.

The Proposed Framework
We constructed a GIS-based framework in which EFCM is implemented to detect hotspots and the fuzzy entropy measure (13) is calculated to evaluate the reliability of the detected hotspots. The framework is schematized in Figure 1. If X = { , , … , } is a discrete set, we define the entropy measure of fuzziness of the fuzzy set A as: where K is a multiplicative constant. If H(A) = 0, then for each element xi, i = 1, …, N, A(xi) = 0 or A(xi) = 1 and A coincides with a subset of the set X; if for each element xi, i = 1, …, N, A(xi) = ½ and the fuzziness of A is maximum.
If A is a crisp set, its fuzziness is null and H(A) = 0. The higher the fuzziness of a fuzzy set, the closer the mean membership degree to the fuzzy set of X's elements approaches ½.
In [28], a fuzziness measure (12) is used to construct a new validity index, which is applied to evaluate the optimal number of clusters in FCM. If Ai is the ith fuzzy cluster where i = 1, …, C considered as a fuzzy set and uij is the membership degree of the jth data point to the ith cluster, the authors use the following fuzzy entropy measure of Ai: where N is the number of data points and the De Luca & Termini fuzzy entropy function (11) is used.

The Proposed Framework
We constructed a GIS-based framework in which EFCM is implemented to detect hotspots and the fuzzy entropy measure (13) is calculated to evaluate the reliability of the detected hotspots. The framework is schematized in Figure 1. The set of spatial events is extracted from the spatial event datasets. Each event is a data point given by its latitude and longitude coordinates. The event extraction functionality builds the set of data points by extracting the event data located in the study area and transforming them into a single system of geographic coordinates.  The set of spatial events is extracted from the spatial event datasets. Each event is a data point given by its latitude and longitude coordinates. The event extraction functionality builds the set of data points by extracting the event data located in the study area and transforming them into a single system of geographic coordinates.
The set of spatial events is given by a set of N events X = {x 1 , . . . , x N } R N . Each data point is given by a pair x j = (x j1 , x j2 ), j = 1, . . . , N; x j1 and x j2 are the latitude and longitude coordinates where the event is located.
EFCM was executed to detect circles C as clusters; each cluster is identified by its center v j = (v i1 , v i2 ), j = 1, . . . , C and its radius r i and the component u ij of the C × N partition matrix U gives the membership degree of the jth data point to the ith cluster. The cluster prototypes constitute hotspots of circular areas which are shown on the map.
The Calculate reliability function calculates the reliability of each hotspot by applying the Formula (13) to assess the fuzziness of the hotspots. The reliability R(A i ), i = 1, . . . , C, assigned to the ith hotspot is given by the Formula (1); it is a value in the range [0,1].
Finally, the reliability thematic map was produced. Below we show the algorithm applied to extract the hotspots assessing their reliabilities (Algorithm 2). Algorithm 2: Hotspots reliability evaluation.
For i = 1 to C 4.
For j = 1 to N 6.
H i ← H i + h(u ij )//where the Equation (11) is applied for the function h(u) Return the centers v i , the radius r i and the reliability R i i = 1, . . . , C In next section we show the obtained results.

Test Results
We tested our framework on an area of study given by the district of Naples, Italy. The extension of the district is 1171 km 2 . The event dataset was constructed by considering the places of residence of patients who have been diagnosed with oto-laryngo-pharyngeal disease diagnosis in the last four years. These data were collected by entering only nonsensitive information and transmitted by hospitals and medical facilities. An address locator geocoding function was used to geo-refer the data points.
The event dataset is made from about 4000 data points. The GIS framework was constructed using the tool GIS Esri ArcGis Desktop 10.8; the EFCM algorithm was implemented in the GIS platform using Python libraries.
After executing EFCM, 24 hotspots were detected and plotted as circles on the map. The thematic map with the hotspots is shown in Figure 2. The detected hotspots have an extension between 0.6 and about 9 km 2 . The area and the reliability of each hotspots are shown in Table 1. Figure 3 shows a plot graph in which the reliability dependency on hotspots area is analyzed. Figure 3 shows that there is no linear dependency between the area of hotspots and their reliability; the very low value of the coefficient of determination R 2 (=0.128) means that the smaller and more compact hotspots do not necessarily have greater reliability.
In Figure 4, the graph analyzes the linear dependency of reliability on the standard deviation. The graph shows the presence of a linear relationship between standard deviation and reliability with a mean-high value of the coefficient of determination R 2 (=0.864): this result means that, on average, the greater the fluctuation of the values of the degrees of The detected hotspots have an extension between 0.6 and about 9 km 2 . The area and the reliability of each hotspots are shown in Table 1.    Figure 3 shows a plot graph in which the reliability dependency on hotspots area is analyzed.  Figure 3 shows that there is no linear dependency between the area of hotspots and their reliability; the very low value of the coefficient of determination R 2 (=0.128) means that the smaller and more compact hotspots do not necessarily have greater reliability.
In Figure 4, the graph analyzes the linear dependency of reliability on the standard deviation. The graph shows the presence of a linear relationship between standard deviation and reliability with a mean-high value of the coefficient of determination R 2 (=0.864): this result means that, on average, the greater the fluctuation of the values of the degrees of belonging of the data points to the hotspot, the greater the fuzzy entropy of the hotspot, therefore the lower its reliability.     Figure 3 shows that there is no linear dependency between the area of hotspots and their reliability; the very low value of the coefficient of determination R 2 (=0.128) means that the smaller and more compact hotspots do not necessarily have greater reliability.
In Figure 4, the graph analyzes the linear dependency of reliability on the standard deviation. The graph shows the presence of a linear relationship between standard deviation and reliability with a mean-high value of the coefficient of determination R 2 (=0.864): this result means that, on average, the greater the fluctuation of the values of the degrees of belonging of the data points to the hotspot, the greater the fuzzy entropy of the hotspot, therefore the lower its reliability.    Low, which includes hotspots with reliability less than 0.45 -Mean, which includes hotspots with reliability between 0.45 and 0.6 -High, which includes hotspots with reliability greater than 0.6 This can provide information on the distribution in the study area of hotspots with different reliabilities. Hotspots with low reliability can be interpreted as hotspots in which there is greater uncertainty regarding their location and extent; on the contrary, hotspots with high reliability are hotspots whose location and extent have been detected with greater certainty. Mean, which includes hotspots with reliability between 0.45 and 0.6 -High, which includes hotspots with reliability greater than 0.6 This can provide information on the distribution in the study area of hotspots with different reliabilities. Hotspots with low reliability can be interpreted as hotspots in which there is greater uncertainty regarding their location and extent; on the contrary, hotspots with high reliability are hotspots whose location and extent have been detected with greater certainty. The map in Figure 5 shows a concentration of hotspots with low reliability within the municipality of Naples; they are located in an area corresponding to the historic center of the city.
We asked a team of expert doctors who analyzed the dataset of the locations of patients diagnosed with the disease to evaluate how accurate the location and width of each hotspot detected on the map was, assigning one of the three labels: Low, Mean, and High, to each hotspot. In Table 2, the evaluations of the experts are compared with the results obtained in the thematic map in Figure 5.  The map in Figure 5 shows a concentration of hotspots with low reliability within the municipality of Naples; they are located in an area corresponding to the historic center of the city.
We asked a team of expert doctors who analyzed the dataset of the locations of patients diagnosed with the disease to evaluate how accurate the location and width of each hotspot detected on the map was, assigning one of the three labels: Low, Mean, and High, to each hotspot. In Table 2, the evaluations of the experts are compared with the results obtained in the thematic map in Figure 5. The results obtained are correlated with the deductions made by the team of experts. The four hotspots rated with reliability Low, are also evaluated with Low reliability by the pool of experts, who found it more difficult to locate disease strains in the areas including the historic center of the municipality of Naples, due mainly to a high population density, on average homogeneous to the entire area of the historic city center.

Final Considerations
To assess the reliability of the hotspots detected in spatial analysis, we propose a framework in which the EFCM hotspot detection algorithm is used to detect the hotspots, and the De Luca and Termini's fuzzy entropy is applied to measure the reliability of the detected hotspots.
We tested our framework in a disease analysis problem; the results show that the presence of an approximately linear dependence between the reliability of the detected hotspots and the fluctuation of the membership degrees of the event data points to the corresponding fuzzy clusters. Furthermore, the spatial distribution of the reliability of the detected hotspots corresponds to the assessments made by the pool of experts.
In the future, we intend to adapt and apply our framework on massive event datasets, and to test the proposed method for measuring the reliability of predictions of future locations and displacements of hotspots.