Hierarchical Clustering with Spatial Constraints and Standardized Incidence Ratio in Tuberculosis Data

: In this paper, we propose presenting a solution based on socio-epidemiological variables of tuberculosis, considering a clustering with spatial/geographical constraints; and, determine a value of alpha that increases spatial contiguity without signiﬁcantly deteriorating the quality of the solution based on the variables of interest, i.e. those of the feature space. For the application of Ward’s hierarchical clustering method, two dissimilarity matrices were calculated, the ﬁrst provides the dissimilarities in the feature space calculated from the socio-epidemiological variables D 0 and the second provides the dissimilarities in the calculated constraints space from the geographical distances D 1 , together with an α mixing parameter and the non-uniform weight w assigned to the calculation of the dissimilarity matrix deﬁned by the standardized incidence ratio (SIR) of TB and that contributed signiﬁcantly to the increase in clarity, both from a spatial and socio-epidemiological point of view. The method is shown to be feasible in epidemiological studies in the joint understanding of factors of different dimensions, aggregated from a spatial perspective. It is analysis tool that allows making a better understanding of the socio-epidemiological reality of the municipality.


Introduction
In exploratory data analysis, the statistician often uses clustering and visualization to improve his knowledge of the data [1]. In the viewing, he looks for some clusterings explaining some of the significant characteristics of the data.
The cluster analysis goal consists of distinguishing, in the data set to be analyzed, the groups, called clusters. In this paper, we study the hierarchical clustering (and not partitioning). Hierarchical cluster algorithm groups the data based on the distance between each one and looks for data within a cluster to be the most similar to each other. These groups are disjoint subsets of the data set. They have such a property that data that belong to different clusters differ among themselves much more than the data on the same cluster [2]. The difficulty of choosing the clustering method for grouping a set of n objects into k separate sets and the ideal number of clusters is well frequent among researchers. In Tuberculosis (TB) epidemiology, for instance, this challenge is excellent for being a data-driven approach involving many subjective decisions. However, in some clustering problems, it is relevant to impose constraints on the set of allowed solutions [3]. Contiguity constraints (in space or time) are the most common; they occur when the objects in a cluster required not only to be similar to one other, but also to comprise a contiguous set of objects.
TB still poses a substantial global health threat, with some 10-million new cases per year. In Brazil, the estimate is that the incidence of TB is increasing after many years of decline due to the upward trend in the period of 2016-2018 [4]. TB incidence is disproportionately high among people in poverty [5]. The goal set by the World Health Organization (WHO) is to cure 85% of new bacilliferous TB cases by 2020 [6]; however, as observed in the 2018 data, Brazil (71.4%) falls short of reaching this goal [7]. In the State of Paraíba, the situation is even more critical [8] identified a cure rate of 55% in the studied period (2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016).
The State of Paraíba is composed of 223 municipalities; it has the fourteenth contingent population among Brazil's states with more than 4.018 million inhabitants according to 2019 estimates by the Brazilian Institute of Geography and Statistics [9].
The relationship between TB and social conditions demands an understanding of the dynamics of this aggravation and its occurrence in the territory [10]. This study aims to present a solution based on socio-epidemiological variables of TB with results being easily visualized on a map while using the ClustGeo package. This method uses Ward-like hierarchical clustering with non-Euclidean dissimilarities and non-uniform weights attributed to the standardized incidence ratio (SIR) of TB in the 223 municipalities of Paraíba and the importance of the constraint in the clustering procedure through the parameter α, responsible for controlling the weight of the constraint in the quality of the solution on the variables of interest.
Sometimes we wish to provide disease risk estimates in each of the areas that form partitions of the study region [11]. For instance, to identify changes in morbidity and/or mortality in time or to compare the incidence or prevalence. The standardized incidence ratio (SIR) is one simple measure of disease risk.

Study Design and Data Sources
The data analyzed in this study are notified cases of TB in the 223 municipalities in the State of Paraíba in the period between 2001 and 2018, using a secondary source, through the database, registered in the Notifiable Diseases Information System [12] and made available on the website of the Informatics Department of the Unified Health System (DATASUS). The data are reported cases of TB in the State of Paraíba; the variables are ratios, divided into epidemiological (new cases, cure, male and female deaths) and social variable (active age (20-64) patients with TB). A matrix was also calculated with the geographic distances between the municipalities and the weight w non-uniform attributed to the calculation of the dissimilarity matrix D, as being the standardized incidence ratio (SIR) of TB in the State of Paraíba.
The data were collected between February-May 2020. Statistical analyses were undertaken in R version 3.6.2 [13]. This study was not submitted to the Research Ethics Committee's evaluation as it is a survey of secondary data and does not directly involve human beings.

Constrained Hierarchical Clustering
Usually, the researcher has difficulty of clustering a set of n objects into k disjoint clusters. Soon, many methods proposed finding the best partition according to a homogeneity criterion based on differences, or for a multivariate distribution function mix model. The most common type is the contiguity constraints.
Such constraints occur when the objects in a cluster are required not only to be similar to one other, but also to comprise a contiguous set of objects (municipality), i.e., the contiguity between each pair of objects is given by a matrix C = (c ij ) n×n , where c ij = 1 if the ith and the jth objects are contiguous, and 0 if they are not [3]. An adjacency matrix used to find a connection between the borders of each municipality in the State of Paraíba. Accordingly, two clusters are regarded as contiguous if there are two objects, one from each cluster, which is linked in the contiguity matrix. Several authors in different areas of knowledge have implemented of constrained clustering procedures [14][15][16][17][18][19][20][21]. For instance, Miele et al. [22] proposed a model-based spatially constrained method that embeds the geographical information within an EM regularization framework by adding some constraints to the maximum likelihood estimation of parameters. It is a partitioning method with neighbourhood constraints, while the Ward-like method [3] is a hierarchical clustering (and not partitioning) method, including spatial/geographical constraints (not necessarily neighbourhood constraints) [23].

Ward-Like Hierarchical Clustering
With algorithm similar to Ward, Ward-like is a constrained hierarchical clustering algorithm that optimizes the convex combination D α = (1 − α)D 0 + αD 1 of this criterion calculated with two dissimilarity matrices, D 0 and D 1 beyond a mixing parameter α ∈ [0; 1]. The first dissimilarity matrix D 0 = [d 0,ij ] is constructed from the Manhattan distance matrix between the 223 municipalities performed with the p = 5 variables socio-epidemiological, i.e., the matrix gives the differences in the feature space, and the dissimilarity matrix D 1 = [d 1,ij ] is constructed from the geographical distance between the 223 municipalities, i.e., the matrix D 1 gives the differences in constraint space. The minimized criterion at each stage is a convex combination of the homogeneity criterion calculated with D 0 and the homogeneity criterion calculated with D 1 . The parameter α (the weight of this convex combination) gives the relative importance of D 0 as compared to D 1 . This parameter controls the weight of the constraint on the quality of the solutions, i.e., for a given value of α[0; 1], the mixing parameter α clearly controls the part of pseudo-inertia due to D 0 and D 1 . The mixed pseudo inertia of the cluster C α k is defined as: where µ α k = ∑ i∈C α k w i is the weight of C α k ; d 0;ij and d 1;ij are the normalized dissimilarity between observations i and j in D 0 and D 1 , respectively. For the choice of α we will use two types of spatial constraints (geographical distances and neighborhood contiguity). For the last case, the dissimilarity matrix D 1 will be constructed from the corresponding adjacency matrix A, i.e., D 1 = 1 n − A with 1 n,ij = 1∀(i, j), a ij equal to 1 if municipalities i and j are neighbourhood 0 otherwise, a ii = 1 by convention. When α increases, the homogeneity that is calculated with D 0 decreases; conversely, the homogeneity calculated increases with D 1 . Therefore, the idea is to determine a value of α, which increases the spatial the geographical homogeneity without deteriorating the quality of the solution on the variables of interest too much.
These homogeneities are measurable using the appropriate pseudo within-cluster inertias. To determine a suitable value for the mixing parameter α, let us assume that the dissimilarity matrix D 1 contains geographical distances between n municipalities, whereas the dissimilarity matrix D 0 contains distances that are based on a n × p 0 data matrix X 0 of p 0 socio-epidemiologic variables measured on these n municipalities. Basically, the notion of proportion of the total mixed (pseudo) inertia explained by the 1]. When β = 0, the denominator W 0 (P 1 ) is the (pseudo) total inertia, and the numerator is the (pseudo) within-cluster inertia W0(P α K ), both based on the D 0 dissimilarity matrix. Therefore, the higher the value of the Q 0 (P α K ) criterion, the more homogeneous is the P α K partition from the socio-epidemiological point of view; β = 1, the denominator W 1 (P 1 ) is the total inertia (pseudo) and the numerator is the (pseudo) inertia within the cluster W 1 (P α K ), both based on the D 1 dissimilarity matrix. Ergo, the higher the value of criterion Q 1 (P α K ), the more homogeneous is the partition P α K from the geographical point of view. When β assumes a value of β ∈]0, 1[, the denominator W β (P 1 ) is a total mixed (pseudo) inertia, and it is not easy to interpret in practice and the numerator W β (P α K ) is the mixed (pseudo) inertia within the cluster.
With R package ClustGeo (version 2.0) developed by Chavent et al. [3], it is possible to implement this hierarchical clustering algorithm with geographical constraints and choose the mixing parameter α provided with two types of spatial constraints (geographical distances and neighbourhood contiguity). Let w i be the weight of the ith observation for i = 1, . . . , n. Let D = [d ij ] be a n × n dissimilarity matrix associated with the n observations, where d ij is the dissimilarity measure between observations i and j. The function hclustgeo of the ClustGeo package is a wrapper of the usual function hclust. It performs the hierarchical clustering of Ward.D, using a dissimilarity matrix D (which is an object of the class dist , i.e., an object obtained with the dist function or a dissimilarity matrix transformed into an object of the class dist with the as.dist function) and the weights w = (w 1 , . . . , w n ) of observations as arguments.
Here, the standardized incidence ratio (SIR) Equation (4) of TB in the 223 municipalities of the State of Paraíba will be applied as non-uniform weights; ergo each municipality will have its weight. The sum of the heights in the dendrogram is equal to the total pseudo-inertia of the data set. The formula for pseudo-inertia of the Ward-like method is: where µ k = ∑ i∈C k w i is the weight of C k . The lower the pseudo-inertia I(C k ), the more homogeneous are the observations that belong to the cluster C k . The function hclustgeo is a wrapper of the usual hclust function with the following arguments: (a) distance: D 0 (Manhattan distance). D 0 is the Manhattan distance matrix between the 223 municipalities performed with the p = 5 variables socio-epidemiological; (b) distance: D 1 . The geographic distances between the municipalities; calculating a distance matrix for geographic points using R through packages: sgeostat (version 1.0-27) [24], geosphere (version 1.5-10) [25], and Imap (version 1.32) [26]. These functions calculate distance matrix for geographic for latitude and longitude points of the center of gravity of the municipalities; c) Members: w = SIR i . The sum of the heights in the dendrogram is equal to the total pseudo-inertia of the data set Equation (2). The spirit of the Ward-like hierarchical clustering is to aggregate the two clusters A and B from a given partition P α K+1 in K + 1 clusters, to that the new partition has minimum mixed within-cluster inertia.

Manhattan Distance
We opted for the Manhattan distance, because the Ward method has already been generalized for use over non-Euclidean distances. According to Strauss and Maltitz [27], Ward's clustering algorithm can use it in conjunction with Manhattan distances.
where i and j are the municipalities with k = 1, . . . , n = 223.

Standardized Incidence Ratio
One simple measure of disease risk is the standardized incidence ratio (SIR). For each area i, i = 1, . . . , n = 223, the SIR is defined as the ratio of observed counts to the expected counts.
The expected counts E i represent the total number of TB cases that one would expect if the population of municipality i behaved the way the population of the State of Paraíba behaves. E i can be calculated while using indirect standardization as j is the rate (number of cases divided by population) in stratum j in the standard population, and n (i) j is the population in stratum j of area i.

Results and Discussion
Have been notified 24.258 TB cases in the State of Paraíba from 2001 to 2018. Of this total, 80% were new cases, 65% patients got cured, 46.8% had less than ten years of study, 81.3% were between working age (20-64), and 6,1% mortality, being men (4.2%) and women (1.9%). Clustering approaches are a useful tool to detect patterns in data sets and generate hypotheses regarding potential relationships. Therefore, the role of cluster analysis is to uncover a certain kind of natural structure in the data set [2]. Figure 1 shows the dendrogram of the dissimilarity matrix D 0 , i.e., the differences in the feature space of socio-epidemiological variables, which is the Manhattan distance matrix between the 223 municipalities performed with p = 5 variables socio-epidemiological. To choose the suitable number K of clusters, we focus on the Ward dendrogram based on the p = 5 socio-epidemiological variables, that is using D 0 only. The visual inspection of the dendrogram in Figure 1 suggests retaining K = 5 clusters. The 223 municipalities grouped in their respective clusters according to socio-epidemiological similarity, namely, cluster 1 (42 municipalities), cluster 2 (37), cluster 3 (36), cluster 4 (90), and cluster 5, only 18 municipalities. The partition corresponding to the five clusters is shown on the map presented in Figure 2.
Geographically, we perceive clusters well dispersed according to socio-epidemiological variables; that is, the clusters are not strictly contiguous. The interpretation of clusters according to the initial socio-epidemiological variables is interesting. Figure A1 in Appendix A show the variable boxplots for each cluster (top row). Cluster 1, the female mortality rate is the lowest of all clusters, while male mortality has a higher median. Cluster 2 has a high rate of new cases and cure, and a higher female mortality rate than in other clusters. Cluster 3, people of working age (20-64) has a rate that is higher than the average value of the study area, as well as being higher than in other clusters. Similarly, cluster 4 also has a high rate of new cases and a high average age of TB patients of working age and is also greater than the average value of the study area. Cluster 5, high rates of new cases and people of working age, and the lowest cure rate in all other clusters. We will introduce the matrix D 1 of geographical distances into hclustgeo, i.e., a partition taking into account the geographical constraints in order to obtain geographically more compact clusters. For this, it is necessary that a mixing parameter is selected α to improve the geographical cohesion of the five groups without adversely affecting the socio-epidemiological cohesion. In Figure 3, we have the mixing parameter α ∈ [0, 1] defines the importance of D 0 and D 1 in the clustering process with separate calculations for socio-epidemiologic homogeneity and the geographic cohesion of the partitions obtained for a range of different values of α and the five clusters.
The next plot, Figure 3, shows the choice of α for partition, taking into account the geographical constraints. Figure 3 gives the plot of the proportion of explained pseudo-inertia calculated with D 0 (the socio-epidemiological distances), which is equal to 0.50 when α = 0 and decreases when α increases (in solid black line). On the contrary, the proportion of explained pseudo-inertia calculated with D 1 (the geographical distances) is equal to 0.87 when α = 1 and it decreases when α decreases (dashed line).
The obtaining of the partition taking into account the geographic constraints with the normalized proportion of explained inertias at the bottom of Figure 3 (i.e., Q * 0 (P α K ) and Q * 1 (P α K ), shows the value α that aims to increase the spatial contiguity, as seen in detail in Table 1.
The value of α is a trade-off between the loss of socio-economic homogeneity and the gain of geographic cohesion. When α = 0, the geographical dissimilarities are not taken into account. When α = 1, it is the socio-epidemiologic distances that are not taken into account; the clusters are obtained with the geographical distances only. The plot presented in Figure 3 (bottom) would appear to suggest choosing α = 0.17, which corresponds to a loss of only (1 − 0.61407565 = 38.59%) of socio-epidemiologic with a SIR of each municipality, and 17.75% increase in geographical homogeneity.  The increased geographical cohesion of this partition can be seen in Figure 4. In Figure 4, a gain significative in spatial homogeneity is perceived. Figure A1 presented in Appendix A shows the boxplots of the variables for each cluster of the partition (middle row). Clusters 1, 3, and 4 seem to differentiate among themselves mainly due to the slight increase in deaths of male patients in cluster 4 and greater variation in the cure proportion of cluster 3. Cluster 5 differed from cluster 2 by the slight increase in deaths, with greater proportions for males, whereas cluster 2 has a higher average number of working-age TB patients and greater variation proportion of cure.
The next plot, Figure 5, shows the choice of α for partition, taking into account the neighborhood constraints. At the bottom of Figure 5, the plot of the normalized proportion of explained inertias (i.e., Q 0 (P α K ) and Q 1 (P α K )) suggests retaining α = 0.12 slightly favoring socio-epidemiological homogeneity versus geographical homogeneity.  It remains only to determine this final partition for K = 5 clusters and α = 0.12. Figure 6 provides the corresponding map. Figure 6 shows that the clusters are spatially more compact than those in Figure 5. However, it is known that this approach creates divergences in the adjacency matrix, which gives more importance to the neighborhoods. Thereupon, as the approach is based on soft contiguity restrictions, municipalities that are not neighbours may be in the same clustering, according occurs with the municipalities of Lucena, Coxixola, Congo, Zabelê and Santa Inês in cluster 2. The quality of the partition in Figure 6 is slightly worse than that of the partition in Figure 4, according to the Q 0 criterion (61.41% versus 82.25%). Figure 6. Map of the partition with K = 5 clusters based on the socio-epidemiological distances D 0 and the "neighborhood" distances of the municipalities D 1 with α = 0.12.

Conclusions
When considering spatial/geographical constraints, the hierarchical grouping becomes even more complete in detecting patterns in data sets of different dimensions. According to the weights that are given to the geographical differences in this combination, the solution will have more or less spatially contiguous clusters. Through our results, the non-uniform weights w defined by the standardized incidence ratio (SIR) of TB contributed to the increase in clarity both from a spatial and socio-epidemiological point of view.
Therefore, the application of the Ward-Like method becomes indispensable in understanding the socio-epidemiological reality of the State of Paraíba from a spatial perspective, thus facilitating decisions in the development of public policies and more effective health actions in the fight against tuberculosis.
Future work would be to add new socio-epidemiological variables and, instead of the municipalities, use the Health Regions that are responsible for the organization, planning, and execution of health actions and services in the state of Paraíba.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: DATASUS Department of the Unified Health System IBGE Brazilian Institute of Geography and Statistics SIR standardized incidence ratio TB tuberculosis WHO World Health Organization