Infectious Disease Relational Data Analysis Using String Grammar Non-Euclidean Relational Fuzzy C-Means

Statistical analysis in infectious diseases is becoming more important, especially in prevention policy development. To achieve that, the epidemiology, a study of the relationship between the occurrence and who/when/where, is needed. In this paper, we develop the string grammar non-Euclidean relational fuzzy C-means (sgNERF-CM) algorithm to determine a relationship inside the data from the age, career, and month viewpoint for all provinces in Thailand for the dengue fever, influenza, and Hepatitis B virus (HBV) infection. The Dunn’s index is used to select the best models because of its ability to identify the compact and well-separated clusters. We compare the results of the sgNERF-CM algorithm with the string grammar relational hard C-means (sgRHCM) algorithm. In addition, their numerical counterparts, i.e., relational hard C-means (RHCM) and non-Euclidean relational fuzzy C-means (NERF-CM) algorithms are also applied in the comparison. We found that the sgNERF-CM algorithm is far better than the numerical counterparts and better than the sgRHCM algorithm in most cases. From the results, we found that the month-based dataset does not help in relationship-finding since the diseases tend to happen all year round. People from different age ranges in different regions in Thailand have different numbers of dengue fever infections. The occupations that have a higher chance to have dengue fever are student and teacher groups from the central, north-east, north, and south regions. Additionally, students in all regions, except the central region, have a high risk of dengue infection. For the influenza dataset, we found that a group of people with the age of more than 1 year to 64 years old has higher number of influenza infections in every province. Most occupations in all regions have a higher risk of infecting the influenza. For the HBV dataset, people in all regions with an age between 10 to 65 years old have a high risk in infecting the disease. In addition, only farmer and general contractor groups in all regions have high chance of infecting HBV as well.


Introduction
Statistical studies involving infectious diseases have been going on for some time [1]. Some studies model and analyze the development of diseases [1][2][3][4]. Another type of study in infectious diseases is epidemiology, i.e., the study of the frequency of disease and how the frequency differs across groups of people [5][6][7][8][9][10][11][12]. One of the considerations of epidemiology is to look at the relationships inside the data itself. There are many existing methods for analyzing data, including clustering algorithms. However, it has been shown that relational data clustering can find a relationship among data better than regular clustering algorithms [13]. 2 of 18 Relational data [13] is described by R = r ij n×n where r ij is a relationship between the ith and jth objects, and n is the number of objects involved. There exist several relational cluster algorithms, e.g., fuzzy non-metric (FNM), assignment prototype (AP) model, relational hard C-means (RHCM), relational fuzzy C-means (RFCM) and non-Euclidean relational fuzzy C-means clustering (NERF-CM) algorithms [13][14][15]. However, these algorithms deal with a numerical feature vector, and the relationship is formed by the pairwise distance between those vectors. Meanwhile, data in healthcare are normally composed of numeric and non-numeric information. Syntactic pattern recognition [16][17][18] is more suitable in this scenario. Although there are a few syntactic clustering algorithms [15,16,[18][19][20][21][22][23] that deal with non-numeric datasets, only our relationship clustering algorithm, namely the string grammar relational hard C-means (sgRHCM) algorithm [24], can deal with nonnumeric relationship datasets. Since there is normally an uncertainty in a dataset, it would be better to use string grammar relationship fuzzy clustering to cope with the problem. Therefore, in this paper, we introduce a string grammar non-Euclidean relationship fuzzy C-means (sgNERF-CM) algorithm. This algorithm is an extended version of its numeric counterpart NERF-CM algorithm.
In Thailand, reports have been published on the occurrence of infectious diseases [25]. There has been no report on the relationship between province and the occurrence of disease based on age, career, and month. In addition, one might want to know whether there is any relationship between different provinces in terms of the number of infections. However, the collected raw data does not provide this information directly. It does not show clusters based on the relationship between provinces, either. One might use the numeric clustering algorithm to find the clusters of similar province characteristics based on the occurrences of a disease, but the result of that numeric clustering algorithm cannot cluster based on disease occurrence relationship among provinces. Moreover, one is unable to use numeric relational cluster algorithms directly if the dataset does not contain only numeric values. In that case, the use of string grammar relationship clustering might be more appropriate. When we find the clusters based on the disease occurrence relationship of provinces, this might help the country to formulate a good prevention policy. To formulate a good prevention policy, we need to study the epidemiology of these infectious diseases.
In this paper, we study dengue fever, influenza, and Hepatitis B virus (HBV) infection. Therefore, we will use our sgNERF-CM algorithm in analysis of these health datasets to see if there is any relationship between province and the occurrence of the diseases based on age, career, and month on the three abovementioned diseases. Therefore, the contribution of the paper is two-fold. First, from a technical perspective, a new algorithm, namely the sgNERF-CM algorithm, is developed. Secondly, from an application perspective, the new sgNERF-CM algorithm is applied in real-world health datasets containing string grammar data, not numeric data.

String Grammar Non-Euclidean Relational Fuzzy C-Means (sgNERF-CM) Algorithm
We will briefly describe the string grammar non-Euclidean relational fuzzy C-means (sgNERF-CM) algorithm here. Let S = {s 1 , s 2 , . . . , s N } be a set of N strings [18], each of which is a sequence of symbols (primitives). Suppose s k = (x 1 x 2 . . . x l ), a string with length l, where each x i is a member of a set of defined symbols or primitives (x i ∈ Σ for i = 1, . . . , l). The relationship r ij between input strings s i and s j is computed using the Levenshtein distance Lev(s i , s j ) [18] (the smallest number of transformations needed to derive one string from another). The spread transformation parameter (β) is used to convert non-Euclidean dissimilarity relationship data into Euclidean dissimilarity data. This transformation is designed to prevent a failure from using non-Euclidean dissimilarity relationship data [14]. The sgNERF-CM algorithm is shown below.
Store: Relation matrix R = [r ij ] N×N , where r ij = Lev(s i , s j ).
Calculate distance [5]: If d ik < 0 for any i and k, then Calculate Update Update If d ik > 0 for all i Update membership value: Else To check the cluster validity after the algorithm converges, we compute the compactness and separation of clusters using Dunn's validity index [26,27] which is a standard cluster validity measure to show the goodness of the clustering result as follows: where dist(c i ,c j ) is the distance between clusters c i and c j and computed as: diam(c j ) is the diameter (maximum pairwise distance of strings in a cluster) of cluster c j and computed as: The nature of Dunn's index is that the larger value, the better the resulting clusters. However, one might wonder why Dunn's index is used to evaluate the cluster validity in this case, when there are several existing cluster validity methods. The reason is that this index exists simply to calculate and can be easily applied to a string grammar clustering. Additionally, there is, to date, no cluster validity measure in the case of string grammar clustering in the literature.
To assign a test string (s t ) into a cluster, compute where [5]

System Description
The system used in this research is shown in Figure 1. Each sample datum is encoded into a string sequence (s i ). Then, the relational matrix between all string sequences is compute using the Levenshtein distance Lev(s i , s j ) [18]. The sgNERF-CM is iteratively computed until it converges. The final clusters, based on the disease occurrence relationship of provinces, are produced. To find which cluster belongs to which, based on the relation of each province, we encode that sample into a string sequence. Then, the relationship distance in Equation (2) is computed. The test sample is assigned to the closest cluster. The final clusters, based on the disease occurrence relationship of provinces, are produced. To find which cluster belongs to which, based on the relation of each province, we encode that sample into a string sequence. Then, the relationship distance in Equation (2) is computed. The test sample is assigned to the closest cluster.

Simulation Results
The dengue fever, influenza, and HBV datasets were collected by the Bureau of Epidemiology, Department of Disease Control, Ministry of Public Health, Thailand (http://www.boe.moph.go.th/boedb/surdata/) (accessed on 23 March 2020). These datasets are the reports of the number of suspected infections based on different categories, i.e., the number of infected people arranged by age, career, and month in each province in Thailand. These reports are collected by the provincial public health office of each province, government hospital, and health center. Although the age and career categories are not as good as other categories in health development, we still implement our algorithm using these categories because it might help the policymaker look at the influence of age or career in infection. We split datasets into training and blind test datasets. The detail is as follows. The training dataset for dengue fever is from 2010 and 2012 to 2018, whereas that for influenza is from 2006 to 2018. The report of HBV from 2006 to 2018 is used as a training dataset. In the training process, the algorithm is applied several times for each parameter setting. The parameter setting is selected by randomization method. After the algorithm converges, provinces with similar number of occurrences are grouped into the same cluster. Then, the best model is selected to be used in the cluster assignment of the blind test dataset. The blind test dataset in each disease and each category is from 2019. Tables 1-3 show examples of input data in each category from the dengue fever dataset. Please note that we intentionally selected these examples to show the variety of data in each category.
We then convert the data into string data by concatenating the number in each field with commas, as shown in Table 4. For example, suppose that the 1st-4th numbers are 30, 4, 5, and 100, then the concatenated string will be 30, 4, 5, and 100. We train our algorithm on the randomized data from the training dataset. After we select the results with the highest Dunn's index, we test that model on the blind test dataset. The numbers of training and blind test datasets for all datasets are shown in Table 5.
In the experiment, the number of clusters varies from 2 to 10. To show the ability of the sgNERF-CM algorithm, we also implement the sgRHCM algorithm [24] and their numerical counterparts, i.e., the non-Euclidean relational fuzzy c-means clustering (NERF-CM) algorithm [14] and the relational hard C-means (RHCM) algorithm [15]. The best results of age, career, and month categories are shown in Tables 6-8, respectively. We can see that Dunn's indices for the sgNERF-CM and sgRHCM algorithms are better than their numerical counterparts in all the experiments. The index for the sgNERF-CM algorithm is comparable or better than that for the sgRHCM algorithm in all the experiments.
The sgNERF-CM algorithm can group provinces if there is any relationship based on each specified category. To show this ability, for the dengue fever dataset, examples of the training samples and blind test samples in each group are shown in Tables 9-11. Examples of clustering results for the influenza dataset are shown in Tables 12-14. Finally, those for the HBV dataset are shown in Tables 15-17. From Table 9, we can see that in clusters 2 and 4, a group of people in the north, central, east, north-east, and south regions aged between 10 and 24 years old has a higher number of dengue fever infections, whereas in clusters 3 and 5, a group of people in the region of north-east, central, and south aged between 10 and 14 years old has a higher number of dengue fever infections.
The provinces that are grouped into the same cluster based on their rela number of the disease occurrences in each category are shown in the sam figures. The figures show that the sgNERF-CM can group provinces based tionship to the number of disease occurrences. This is also shown in the fol sion.
(a) (b) (c) Figure 2. The blind test results for the dengue fever dataset for (a) age range, (b month category using the best models from sgNERF-CM algorithm.    Table 9, we can see th and 4, a group of people in the north, central, east, north-east, and south   23,9,9,9,7, 0            24, 17, 15, 23, 23, 17, 11, 16, 21, 13 2 2559, Loei 22,26,24,19,17,11,19,13,5,10,14,19 Blind Test 2 2562, Si Sa Ket 29, 15, 15, 12, 21, 23, 10, 13, 13, 16, 18, 11 2 2562, Surat Thani 17,16,15,15,16,9,12,17,15,10,17,11 Training 3 2561, Phrae 3, 2, 3, 2, 1, 2, 3, 2, 3, 3, 2, 1 3 2559, Phichit 11,6,7,9,9,5,6,2,9,5,9,2 Blind From Table 10, we can see that student and teacher groups have a higher number of dengue fever infections in the 3rd cluster (central, north-east, north, and south region). In the 2nd cluster (central, north-east, north, and south region), the student group has a higher number of dengue fever infections. However, the number of infections in the student group in the 3rd cluster is higher than that in the 2nd cluster. In the 4th cluster, only the student group in all regions except for the central region has a higher number of infections. From this category, we can see that the student group has a chance of being infected by dengue fever more than other occupations. Hence, in this case, a prevention policy can be directed to students in schools, e.g., allocating budget to schools for this particular prevention, providing knowledge to students, giving a recommendation for schools to clean mosquito-breeding habitats, etc. From Table 8, in only the 2nd cluster from the north, north-east, west, and south regions, the number of dengue infections is low. In the 1st cluster (central, north-east, east, and south regions), the number of infections in all 12 months is high. In this case, we can see that there will be a case of infection in every month of year for all regions. The prevention policy should be applied all year round for those abovementioned regions. This can be done in the form of a recommendation to clean mosquito-breeding habitats regularly, promotion of awareness of the disease at all times, etc.
From Table 12, we can see that the 2nd and 3rd clusters behave similarly, i.e., a group of people aged 1 to 64 years old has a higher number of influenza infections in every province. However, in the 4th clusters, only 4 provinces have a lower number of infections. From Table 13, we can see from the blind test examples that the student group in all regions has a higher number of influenza infections in the 1st cluster. In the 2nd cluster from the training examples, only the student and unknown groups have a higher number of influenza infections. However, when we look at the blind test results, we can see that most of the samples are grouped into the 1st cluster. This might be because most of the occupations have a high number of infections. Hence, the generated strings are more related to most of the samples in the 1st cluster than the 2nd cluster. From Table 14, we can see that the number of infections is high in all the clusters (covering all regions) in all 12 months of the year, meaning that the influenza prevention policy should be implemented in all regions. The policy can be executed in the form of a screening and isolation system, and a recommendation or promotion of using sanitary masks in all regions. Additionally, health personnel should emphasize these policies among students and unknown groups.
From Table 15, we can see that in all clusters, there are higher numbers of HBV infections in a group of people in all regions aged 10 to 65 years old. From Table 16, we can see that the farmer and general contractor groups in all regions have a high number of HBV infections in the 1st cluster. However, in the 2nd cluster, only the general contractor group in all regions has high numbers of HBV infections. To develop a prevention policy, these two occupations should be focused on more than other occupations. Promotion of disease awareness and vaccination of people, especially those in these two occupations, should be embedded into health policy. From Table 17, the month information is not useful. This is because when we look at the 3rd cluster (all regions), the number of infections is low in every month, whereas in the 2nd cluster (all regions), the numbers are high in every month.

Conclusions
To develop health policy, especially in infectious diseases, health data analysis is becoming increasingly important. Epidemiology is the study of finding relationships between occurrences of a disease and other environmental factors (who, when, and where). To analyze infectious disease datasets, we developed the string grammar non-Euclidean relational fuzzy C-means (sgNERF-CM) algorithm to find relationships inside the data from the age, career, and month viewpoints for all provinces in Thailand for dengue fever, influenza, and HBV infection. The input datasets are the reports of the disease occurrences arranged by age, career, and month in each province in Thailand. The developed algorithm is implemented to group provinces based on their relationship of disease occurrences in each category. The cluster results provide additional information to aid health personnel or policymakers to see similarities in each group. This similarity will ultimately help in the development of health policy in the future.
To show the sgNERF-CM algorithm's performance and ability to cope with uncertain data, we compare the results with the string grammar relational hard C-means (sgRHCM), the relational hard C-means (RHCM), and the non-Euclidean relational fuzzy C-means (NERF-CM) algorithms. The results show that the sgNERF-CM algorithm is better than the sgRHCM algorithm in most cases, and better than the numerical algorithms in all cases. We selected the best sgNERF-CM models from the one yielding the highest Dunn's index because it indicated the most compact and best-separated clusters. In the blind test process, we found that people from different age ranges in different regions in Thailand have different numbers of dengue fever infections. Student and teacher groups from central, north-east, north, and south regions have higher chances of being infected by dengue fever. Additionally, the student group in all regions except for the central region has a high risk of dengue infection. In every month, people in the central, north-east, east, and south regions should be made aware of the prevention of the dengue fever. For the influenza dataset, we found that a group of people aged 1 to 64 years old has a higher number of influenza infections in every province. Most occupations in all regions have a higher risk of influenza infection. It is not surprising that infection of influenza in all regions happens all year round. For the HBV dataset, people in all regions aged between 10 and 65 have a high risk of disease infection. In addition, only the farmer and general contractor groups in all regions have high chance of contracting the disease as well. Again, it is not surprising that the number of infections by month does not contain specific information, since the infections can happen all year round. This paper provides information extracted from the collected infectious disease data. We hope that this information will be beneficial in the development of prevention policy. For future work, we plan to apply our sgNERF-CM algorithm to extract useful information for the other diseases. Additionally, we can further use the cluster results to predict disease development in a given region, age, or occupation.