Principal Component Analysis of Categorized Polytomous Variable-Based Classification of Diabetes and Other Chronic Diseases

A chronic disease diabetes mellitus is assuming pestilence proportion worldwide. Therefore prevalence is important in all aspects. Researchers have introduced various methods, but still, the improvement is a need for classification techniques. This paper considers data mining approach and principal component analysis (PCA) techniques, on a single platform to approaches on the polytomous variable-based classification of diabetes mellitus and some selected chronic diseases. The PCA result shows eigenvalues, and the total variance is explained for the principal components (PCs) solution. Total of twelve attributes was analyzed with the intention to precise the pattern of the correlation with minimum factors as possible. Usually, factors with large eigenvalues retained. The first five components have their eigenvalues large enough to be retained. Their variances are 18.9%, 14.0%, 13.6%, 10.3%, and 8.6%, respectively. That explains ~65.3% of the total variance. We further applied K-means clustering with the aid of the first two PCs. As well, correlation results between diabetes mellitus and selected diseases; it has revealed that diabetes patients are more likely to have kidney and hypertension. Therefore, the study validates the proposed polytomous method for classification techniques. Such a study is important in better assessment on low socio-economic status zone regions around the globe.


Introduction
Diagnosis of chronic diseases is essential in the healthcare field as these diseases persist for a long time. The major chronic diseases include diabetes, cardiovascular disease, kidney problem, cancer, and stoke. Classification of the disease helps in taking precautionary actions, and effective treatment at an initial stage found to be helpful for patients [1].
Principal component analysis (PCA) a trendy method for data reduction, found to be a useful step in classification [2]. PCA and machine learning methods were successfully applied in medical domains, for example, in the diagnosis of diabetes aspects, therapy, prognostics of recurrence of breast cancer, localization of a primary tumor, and diagnosis of thyroid diseases. Polytomization of variables may occur in situation categorical variables are of multiple outcomes, for more subjective assessment and evaluation, as the factors scores obtained from binary variables are linearly related [1][2][3].
As a chronic disease, diabetes mellitus is assuming pestilence proportion worldwide, by affecting developed and underdeveloped countries at the same time [4]. The most recent prevalence figure

Diabetes Mellitus
Diabetes is a chronic disease that occurs either when the pancreas does not produce enough insulin or when the body cannot efficiently use the insulin it produces. Insulin is a hormone that controls blood sugar. Hyperglycemia or higher blood sugar is a common effect of uncontrolled diabetes and, over time, leads to severe damages in the entire human body's systems [19].
The diabetes conditions include type I diabetes, also known as "Insulin-dependent" (IND). It is characterized by beta cell destruction caused by an autoimmune process, typically leading to absolute insulin deficiency. Ultimately, all IND patients will require insulin remedy to maintain normglycemia.
Type II diabetes is also known as "Non-insulin-dependent" (NID). The most common form of diabetes mellitus and is highly associated with a family history of diabetes, older age, obesity and lack of exercise. NID comprises 80% to 90% of all cases of diabetes mellitus. Primarily individuals with Type 2 diabetes exhibit intra-abdominal (visceral) obesity, which is directly related to the presence of insulin resistance, in addition to hypertension and dyslipidemia.
The disorder identified in women who develop IND during pregnancy, and women with undiagnosed asymptomatic NID discovered during pregnancy classified as Gestational Diabetes (GTD) [19,20].
In some countries of low and middle incomes, use the most straightforward test that does not require fasting before taking the test, if 200 or more than 200 mg/dl of blood glucose it probably indicates diabetes but has to be reconfirmed [20].

Sampling Procedure
The sampling procedure used in distributing the questionnaires and determining the sample sizes is a probability cluster sampling process at the beginning. The entire populations were divided into clusters, which is allied with a selection of a subset of individuals from the populace to estimate the characteristics of the entire population [21]. All the attributes determine one or more properties of the observable subjects distinguished as independent individuals.
The nonprobability sampling procedure was also applied. It is a process by which a researcher selects a sample based on subjective judgment rather than random selection. Thus, not all the members of the population have the same equal chance of participating in the study, unlike the probabilistic method.
The northwestern part of Nigeria comprises of seven states. We divide each state into three clusters according to their senatorial zones (i.e., central, north, and south). Own government hospitals were chosen in each group from the six states; whereas, in Kano State, we double the number due to its population.

Categorization of Continuous Variables
Continuous variables (CVs) encountered in many situations. CVs such as age, body mass index, blood glucose level, blood pressure, and many other things were measured. To relate an outcome variable to a single continuous variable, an appropriate regression model is essential.
CVs may be converted into categorical variables by grouping values into two or more categories. Dichotomize continuous variables occur in one of two possible states, which can label as zero and one, e.g., "improved/not improved", "completed task/failed to complete the task", etc. In a situation whereby the grouping of variables is more than two categories, referred to as polytomous continuous variables [22].
There are several advantages of dichotomizing continuous variables, but these no statistical grounds [23]. The most common argument seems to be simplicity. Placing all individuals into two groups is extensively perceived to greatly simplify statistical analysis and lead to straightforward interpretation and presentation of results.
From the literature point of view, it has observed that dichotomization of variable had been criticized extensively [22][23][24][25][26]. According to the authors of [27], when the real risk increases (or decreases) monotonically with the level of the variable of interest, the apparent spread of risk will increase with the number of groups used. With only two groups, one may seriously underestimate the extent of variation in risk. This means that when individuals are divided into just two groups, considerable variability may subsume within each group.
Therefore, the study adopts the grouping of categorical variables to be polytomous, except where they are naturally dichotomous. The procedures involved presented by the flowchart as in Figure 1.

Categorization of Continuous Variables
Continuous variables (CVs) encountered in many situations. CVs such as age, body mass index, blood glucose level, blood pressure, and many other things were measured. To relate an outcome variable to a single continuous variable, an appropriate regression model is essential.
CVs may be converted into categorical variables by grouping values into two or more categories. Dichotomize continuous variables occur in one of two possible states, which can label as zero and one, e.g., "improved/not improved", "completed task/failed to complete the task", etc. In a situation whereby the grouping of variables is more than two categories, referred to as polytomous continuous variables [22].
There are several advantages of dichotomizing continuous variables, but these no statistical grounds [23]. The most common argument seems to be simplicity. Placing all individuals into two groups is extensively perceived to greatly simplify statistical analysis and lead to straightforward interpretation and presentation of results.
From the literature point of view, it has observed that dichotomization of variable had been criticized extensively [22][23][24][25][26]. According to the authors of [27], when the real risk increases (or decreases) monotonically with the level of the variable of interest, the apparent spread of risk will increase with the number of groups used. With only two groups, one may seriously underestimate the extent of variation in risk. This means that when individuals are divided into just two groups, considerable variability may subsume within each group.
Therefore, the study adopts the grouping of categorical variables to be polytomous, except where they are naturally dichotomous. The procedures involved presented by the flowchart as in Figure 1. A proposed polytomous categorization of variables procedures executed in java codes, the process involved in classifying the variables implemented in "if-else-if statement" algorithm. The algorithms perform the tasks and screen each categorical variable before assign to the appropriate group.
The study achieved seven if-else-if statements for the classification process by successfully categorizing seven different categorical variables. Data flow for the categorized variables of interest presented below in Figure 2. A proposed polytomous categorization of variables procedures executed in java codes, the process involved in classifying the variables implemented in "if-else-if statement" algorithm. The algorithms perform the tasks and screen each categorical variable before assign to the appropriate group.
The study achieved seven if-else-if statements for the classification process by successfully categorizing seven different categorical variables. Data flow for the categorized variables of interest presented below in Figure 2.

Principal Component Analysis
Principal component analysis is a high utility multivariable analysis (MVA) technique used in summarizing the information contained in a continuous multivariable dataset, by reducing the dimension of the dataset without losing useful information. PCA is aimed to identify a hidden pattern in any given dataset, reduce redundancy and eliminate noise.
PCA works in a highly correlated environment. So, it selects a set of variables that are highly dependent with each other, but at the same time, they are entirely uncorrelated with different subsets of variables which are combined to form a factor.
The basic idea here is that these newly formed factors drive the fundamental process due to which the variables in the data set are thought to correlate with each other. The specific goals are to summarize patterns of correlations among observed variables, to reduce a large number of observed variables to a smaller number of factors, and to provide an outfitted definition for an underlying process by using observed variables [28]. The PCA model is based on the following model, where i, j = 1, 2, 3, …, p The correlation matrix for the complete data based on the sizes of some correlation place constraints on the sizes of others. In particular,

Software Programming Language
The programming software used for the experiments are free open programming software for programming, statistics, and graphics. This was developed by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand, the name (R) came from their respective first names [29], and can be accessed online via http://www.r-project.org. Also, an object-oriented programming language (java platform) was used in the process of categorizing the variables.

Principal Component Analysis
Principal component analysis is a high utility multivariable analysis (MVA) technique used in summarizing the information contained in a continuous multivariable dataset, by reducing the dimension of the dataset without losing useful information. PCA is aimed to identify a hidden pattern in any given dataset, reduce redundancy and eliminate noise.
PCA works in a highly correlated environment. So, it selects a set of variables that are highly dependent with each other, but at the same time, they are entirely uncorrelated with different subsets of variables which are combined to form a factor.
The basic idea here is that these newly formed factors drive the fundamental process due to which the variables in the data set are thought to correlate with each other. The specific goals are to summarize patterns of correlations among observed variables, to reduce a large number of observed variables to a smaller number of factors, and to provide an outfitted definition for an underlying process by using observed variables [28]. The PCA model is based on the following model, where i, j = 1, 2, 3, . . . , p.
The correlation matrix for the complete data based on the sizes of some correlation place constraints on the sizes of others. In particular,

Software Programming Language
The programming software used for the experiments are free open programming software for programming, statistics, and graphics. This was developed by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand, the name (R) came from their respective first names [29], and can be accessed online via http://www.r-project.org. Also, an object-oriented programming language (java platform) was used in the process of categorizing the variables.

Result
In this study, a combine data mining and PCA techniques prompted on the real-life dataset. The dataset of 281 patients suffering from diabetes mellitus and related diseases were collected from the northwestern part of Nigeria. The results presented in both figures and tables below. Table 1 presents descriptive statistics of some attributes and their prevalence among 281 instances. These are diabetic conditions (DCD) with three different states, average patients' weight in respect to DCD, and patients' age grouped with DCD.    The cumulative percentage explained is given in the last column. The first five (5) components out of twelve (12) have their eigenvalue above one (1) and are large enough to be retained. Their explained variances are 18.9%, 14.0%, 13.6%, 10.3%, and 8.6%, respectively, and thus described 65.38% of the total variance.            5 below, represents a data reduction process flow, at the initial stage, twelve variables entered, and in the final stage, the dataset reduced to five variables. Figure 6, below, presents K-means clusters example for the attribute "Residential suburb" classified as Village, Town, and City for the 281 instances. Also, the same applies to the remaining attributes. Table 3 presents a correlation matrix for the variables involves in the study to check the degree of relationship between them. It was observed that the principal diagonal leading elements are all equals 1.    Figure 6, below, presents K-means clusters example for the attribute "Residential suburb" classified as Village, Town, and City for the 281 instances. Also, the same applies to the remaining attributes.     Besides, variable BMI and WGT have the highest correlation value of 0.81, indicating strong positive relation. Variable AGE and GLU, SEX and OCP, MST and OCP, LOA and OCP, and SEX and LOE with their respective correlation values of 0.52, 0.39, 0.33, 0.32, and 0.30, respectively, were all are positively correlated. Figure 7 presents a pictorial representation of a correlation matrix. The darker the color, the more strongly the relationship is, and vice versa.    Table 4, below, presents a correlation coefficient result for some selected symptoms of diabetes mellitus and other chronic diseases; the first column represents diabetes mellitus symptoms, and the first raw represents symptoms from four different chronic diseases involved in the study. The chronic diseases include high blood pressure, kidney problems, cardiovascular problems, and eye problems.
It was observed that the symptoms (FTG) related to kidney problems and high blood pressure has a relatively high correlation value of (0.43). Symptom (SOB) related cardiovascular problem, symptom (SCE) related to eye problem and symptoms (NRV) of high blood pressure with their correlation values of (0.42), (0.37), and (0.30), respectively.
On the other hand, the study recorded negative correlations between some symptoms among chronic diseases and diabetes mellitus. Also, symptoms (SFH) related to kidney problems and (DRV) associated with eye problems, with correlation values of (−0.01) and (0.01), respectively, show no     Table 4, below, presents a correlation coefficient result for some selected symptoms of diabetes mellitus and other chronic diseases; the first column represents diabetes mellitus symptoms, and the first raw represents symptoms from four different chronic diseases involved in the study. The chronic diseases include high blood pressure, kidney problems, cardiovascular problems, and eye problems.
It was observed that the symptoms (FTG) related to kidney problems and high blood pressure has a relatively high correlation value of (0.43). Symptom (SOB) related cardiovascular problem, symptom (SCE) related to eye problem and symptoms (NRV) of high blood pressure with their correlation values of (0.42), (0.37), and (0.30), respectively.
On the other hand, the study recorded negative correlations between some symptoms among chronic diseases and diabetes mellitus. Also, symptoms (SFH) related to kidney problems and (DRV) associated with eye problems, with correlation values of (−0.01) and (0.01), respectively, show no  Table 4, below, presents a correlation coefficient result for some selected symptoms of diabetes mellitus and other chronic diseases; the first column represents diabetes mellitus symptoms, and the first raw represents symptoms from four different chronic diseases involved in the study. The chronic diseases include high blood pressure, kidney problems, cardiovascular problems, and eye problems. It was observed that the symptoms (FTG) related to kidney problems and high blood pressure has a relatively high correlation value of (0.43). Symptom (SOB) related cardiovascular problem, symptom (SCE) related to eye problem and symptoms (NRV) of high blood pressure with their correlation values of (0.42), (0.37), and (0.30), respectively.
On the other hand, the study recorded negative correlations between some symptoms among chronic diseases and diabetes mellitus. Also, symptoms (SFH) related to kidney problems and (DRV) associated with eye problems, with correlation values of (−0.01) and (0.01), respectively, show no relationship.

Discussion
This study uses java codes in performing the variables categorization by implementing If-else-if statements algorithms to avoid loss of vital information from the dataset. The main strength of this study is the use of a real-life dataset from the authentic hospital records of Nigerian hospitals, which was used to validate all our results.
The outcomes of the categorized variables in Table 2, indicates that this method can successfully be implemented to classify the variables to meaningful clusters. Thus, clusters evaluation has shown in Figure 6, for more subjective assessment and further clinical investigation. Also, such a study is essential in better assessment on low socio-economic status zone regions around the globe.
On the other hand, from Table 4, it has been observed that there exists a positive relationship between diabetes mellitus symptoms and selected chronic diseases symptoms (FTG) and (SOB) of (0.43). Initially, the experiments were carried out with 281 instances in R and java platforms. The only limitation is the time required to obtain the desired results will be directly proportional to the size of the data use, in case of using a different sample size.

Conclusions
The study proposed a polytomous method of categorizing variables. The process was applied to classify all the categorical variables of interest in a real-life dataset. The result in Table 2 shows eigenvalues, and the total variance explained for the principal components (PCs) solution.
There were twelve (12) variables involved in the study, and the aim was to predict the pattern of the correlation with minimum factors as possible. Similarly, the resultant eigenvalues correspond to a different factor, and, usually, factors with large eigenvalues are to be retained.
It is quite clear that from Table 2 above, the first five (5) components out of twelve (12) have their eigenvalue above one (1) and thus, are large enough to be retained. Their variances are 18.9%, 14.0%, 13.6%, 10.3%, and 8.6%, respectively. These five (5) components describe 65.38% of the total variance.
Besides, from the retrained PCs, the first two PCs were used in the assessment of K-means clusters for the attribute residential suburb.
Moreover, from the correlation results, it has been revealed that diabetes patients are more likely to have kidney or hypertensive problem with the correlation value of (0.43). Rather than having an eye and cardiovascular issues. Therefore, the study validates the proposed categorization method for classification techniques. Funding: This study was funded by the Natural Science Foundation of Hebei Province with grant numbers 61572420, 61472341, and 61772449.