A Diabetes Prediction System Based on Incomplete Fused Data Sources

: In recent years, the diabetes population has grown younger. Therefore, it has become a key problem to make a timely and effective prediction of diabetes, especially given a single data source. Meanwhile, there are many data sources of diabetes patients collected around the world, and it is extremely important to integrate these heterogeneous data sources to accurately predict diabetes. For the different data sources used to predict diabetes, the predictors may be different. In other words, some special features exist only in certain data sources, which leads to the problem of missing values. Considering the uncertainty of the missing values within the fused dataset, multiple imputation and a method based on graph representation is used to impute the missing values within the fused dataset. The logistic regression model and stacking strategy are applied for diabetes training and prediction on the fused dataset. It is proved that the idea of combining heterogeneous datasets and imputing the missing values produced in the fusion process can effectively improve the performance of diabetes prediction. In addition, the proposed diabetes prediction method can be further extended to any scenarios where heterogeneous datasets with the same label types and different feature attributes exist.


Introduction
The prevalence of diabetes is rising all over the world. According to the statistics from the International Diabetes Federation (IDF), about 425 million adults worldwide were diagnosed with diabetes in 2017. Following this trend, it is estimated that by 2045, the number of diabetic patients in the world will exceed 629 million [1]. Diabetes is a major cause of blindness, kidney failure, heart attack, stroke and lower-limb amputation, bringing huge inconvenience to the patients and even being life-threatening in severe cases. If it can be detected early and intensive treatment be offered, about half of those diabetes patients can go into remission.
At present, machine learning is an important research direction due to its high efficiency, accuracy, and extraordinary learning speed, and it plays a huge role in many fields, such as computer vision [2], natural language processing [3,4], stock market analysis [5], etc. In the 1980s, machine learning was used to predict diabetes. In 1988, Smith [6] constructed an early neural network model to study diabetes prediction in Pima Indians. He compared and analyzed the results with Logistic Regression (LR) and Linear Perceptron (LP). In 2004, Meiland [7] established a model based on LR to predict the presence of asymptomatic bacteriuria in diabetic women according to clinical indicators such as medical history and white blood cell count. In 2011, Ahmad A [8] compared the performance of decision trees and neural networks in diabetes prediction. In 2013, Kumari [9] used the backpropagation algorithm to provide an effective method for the automated examination of diabetes. In 2017, Maniruzzaman et al. [10] proposed a Gaussian Process (GP)-based model for diabetic classification of existing techniques such as Naive Bayes (NB), Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). In 2018, Swapna and Vinaya Kumar [11] used a deep learning method for detecting diabetes. In their study, Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN) and a combination thereof were adopted to obtain dynamic features and then pipelined to Support Vector Machine (SVM) for classification.
All these studies have improved the prediction performance of diabetes. However, they are only based on a single small dataset. Considering that a single dataset contains limited information and there are many diabetes datasets, an effective method to promote diabetes prediction performance should be combining the multiple heterogeneous datasets. It should be noted that the process of combining datasets is also a featurefusion process. Different datasets contain different feature information, and the features contained in different datasets will be fused when combined with different datasets. For feature fusion, many datasets are multi-dimensional. For example, the characteristics of a flower can be shape, color, and so on. These characteristics determine which breed it belongs to. We can also infer its species by its smell, flowering period, and geographical distribution. Therefore, if we combine these characteristics together, then the prediction of the flower varieties should be more accurate. In fact, this process can be viewed as multi-view learning [12][13][14]. In the process of feature fusion, the common features of heterogeneous datasets are directly integrated. Some specific features will be missed during the fusion process. Thus, some missing-value handling methods are needed to solve this problem and form a complete dataset.
For missing value handling strategies, there are three categories of approaches to deal with missing values. The first category is to remove all samples with missing values [15]. This is simple and intuitive; it will encounter huge problems when a large number of data values are missing. Unlike the first category, the second category chooses to impute the missing values. There are many common imputation methods, including Mean Imputation [16], Random Imputation within Classes [17], Gaussian copula imputation [18], etc. Commonly used ones include Regression Imputation [19], Support Vector Machine Imputation (SVM) [20], Multiple Imputation (MI) [21], Genetic Algorithm [22], graph-based imputation [23], etc. These imputation methods can effectively impute the missing values, but the imputation effect is different. The third category uses the indicator matrix to indicate the position of the missing values in the dataset, ignoring the marked missing values in the subsequent training and prediction process, and only uses the non-missing parts [24,25]. The second category is the most commonly used one among these three categories, and multiple imputation is the most popular one among the methods based on imputation.
As for diabetes prediction, besides the most critical glycemic indicator, different hospitals or institutions will also consider other indicators to help them determine whether a patient has diabetes. These indicators may be differentiated and of different dimensions. In other words, there is heterogeneity between datasets. Since different indicators in the heterogeneous datasets all reflect some information related to diabetes, from the perspective of feature fusion, it is necessary to fuse them together. After fusion, the new dataset contains the characteristics of the original two datasets, which theoretically improves the prediction effect of diabetes. Therefore, in this paper, we will propose a new diabetes prediction system that can combine heterogeneous data sources. The system fused the common and special features within two data sources and missing values occurred during this process. For the missing value handling problem, we adopt multiple imputation and the graph representation learning model to impute the data to feed LR [26] to train and predict diabetes. To further improve the prediction results, we also introduce ensemble learning framework stacking [27,28], and finally further improve the prediction performance.
The main contributions of this paper are summarized as follows: 1. Explore and improve diabetes prediction by fusing two heterogeneous datasets from different sources, which can be generalized to more than two data sources and different application domains.

2.
Taking the uncertainty of the missing values into consideration, graph representation learning is adopted to impute the missing values within the fused dataset.

3.
Compared with the prediction results of the original dataset, the prediction results of the fused dataset are improved, and the stacking strategy further improves the performance.
The rest of this paper is organized as follows: Section 2 elaborates the overall framework of the system, introducing the specific functions of each module. Section 3 demonstrates the experiments, including dataset description, experimental settings, experimental evaluation metrics, experimental results and the analysis. In Section 4, we discuss the experimental results. Finally, Section 5 concludes the paper.

Heterogeneous Data Source Combination System Framework
The system framework in Figure 1 demonstrates the main process to combine data sources and predict diabetes. The whole system consists of three stages: data input, data combination, data training and prediction. Two heterogeneous datasets are fed as input, a new fused dataset is obtained by data combination, and then training and prediction are carried out to output the prediction result of diabetes. The details of these three phases are described below.

Data Input
At this stage, we feed two different datasets to predict diabetes. The features of these two datasets can be divided into common parts and specific parts, and the labels of those samples, being positive or negative, are retained.

Data Analysis
In this section, for the two datasets, we first calculate the correlation between the features of the datasets and the labels, and then analyze which features have a greater impact on the prediction of diabetes. These features are supposed to be extracted as the features of the combined dataset. Essentially, this is the feature selection or feature reduction. In addition, some advanced machine learning methods, such as supervised nonnegative matrix factorization and attribute reduction [29][30][31], can also be adopted.

Data Preprocess
Before combination, the original datasets should be preprocessed to eliminate abnormal samples. For the few missing values in the datasets, we can choose to remove them directly, or we can use common imputation methods (such as mean imputation) to impute the missing values. In short, what we do is ensure that the input data are reasonable.

Data Imputation
After data preprocessing, there should be no missing data in the two original datasets. Then, we start to combine heterogeneous data sources. Let P = {p 1 , p 2 , · · · , p s } be all the feature sets in the first dataset, and Q = {q 1 , q 2 , · · · , q t } be all the feature sets in the second dataset. A ip (i = 1, 2, · · · , m; p = p 1 , p 2 , · · · , p s ) represents the p-th feature corresponding to the i-th sample in the first dataset, B jq (j = 1, 2, · · · , n; q = q 1 , q 2 , · · · , q t ) represents the q-th feature corresponding to the j-th sample in the second dataset. When combining data, let S = P ∩ Q be the feature sets shared by P and Q, and union the sample with the shared features in data A(S) and B(S) to obtain D 1 , i.e., D 1 = A(S) ∪ B(S). Let R = P \ Q be the feature sets that exist in P but do not exist in Q, then set the sample in the second dataset B(R) = NaN, union the sample to obtain D 2 = A(R) ∪ B(R). In the same way, let T = Q \ P, set the sample in the first data A(T) = NaN, and union the sample to obtain As long as there are different features in the two different datasets, there must be missing values in the new dataset. Next, two imputation models will be used to impute missing values.
1. Multiple Regression. The main idea of multiple regression imputation model is to fit multiple regression models to missing variables and complete data variables to predict missing values [32]. For each missing value, we regard the missing variable as the dependent variable and the related variable (other non-missing variables) as the independent variable, perform regression fitting, then use the predicted value as the imputed value. We repeat the above process to generate m datasets, integrate the datasets together, then evaluate the model, and finally output a complete dataset.
2. GRAPE. Traditional missing value imputation methods include simple mean imputation and median imputation, etc., and complex ones such as k-neighbor imputation, regression imputation and so on [33]. However, some imputation methods based on deep learning, such as graph representation learning, are rarely considered to impute missing values. GRAPE is a graph-based representation learning method, which has good performance in feature imputation and label prediction [23]. In the GRAPE framework, feature imputation is transformed into an edge-level prediction task and label prediction into a node-level prediction task according to the bipartite graph model. These tasks are then solved using graph neural networks. The main steps are as follows:

Transform the Dataset into a Graph Structure
The main idea of the GRAPE model is to represent the missing dataset as a bipartite graph. In Figure 2, F j (j = 1, 2, 3) represents the jth feature of the sample, O i (i = 1, 2, 3) represents the ith sample, R(O i , F j ) represents the value corresponding to the jth feature of the ith sample. For example, R(O 1 , F 1 ) = 0.5. Taking O 1 , O 2 , · · · , O i as the nodes on one side of the bipartite graph, and F 1 , F 2 , · · · , F j as the nodes on the other side of the bipartite graph, the nodes on both sides correspond to the edges formed by the connection, which is R(O, F). R(O, F) is the feature value of the corresponding sample. If the feature of the sample is missing, the corresponding connection is missing. The nodes of the bipartite graph are represented in the form of one-shot vectors, which is convenient for subsequent input. The embedding of an element is the vector calculated by propagating the element to the hidden layer. Based on a trained GNN, each element can be transmitted to the hidden layer from its own one-shot representation, and then extract the hidden layer vector as its own embedding [34]. An embedding can be regarded as a new feature vector, which contains information between elements. These embeddings are used as the input of the imputation model to impute missing values.

Use of MLP (Multilayer Perceptron) Model for Data Imputation
The nodes on one side of the bipartite graph are used as the input layer of the MLP, and the nodes on the other side are used as the output layer. Here, the MLP can in fact be viewed as a GNN over a complete graph, where the message function is matrix multiplication. GRAPE extends a simple MLP by allowing it to operate on sparse graphs, enabling it for missing feature imputation tasks by adopting a more complex message computation. For more detail, refer to GRAPE [23].
Through data analysis, data preprocessing and data imputation, a fused complete dataset can be finally obtained. This dataset contains the features extracted from the original two datasets, and each sample has a corresponding feature value. Then we use this dataset for training and prediction.

Data Training and Prediction
In this step, we use the basic model and the stacking method of ensemble learning [35] to train and make prediction. The main idea of the stacking method in the ensemble learning paradigm is as follows. First, the dataset is evenly divided into k parts, any one part is used as the test data, and the remaining k − 1 parts are used as training data. The k basic models are employed to train and predict the data, and then the predicted results are combined to form a single dataset. Finally, a meta-model is used to train this dataset and obtain the final result. We chose five basic models (decision tree, random forest, LDA (linear discriminant analysis), KNN, Naive Bayes) and a meta-model (LR) for training and prediction. Its algorithm is shown in Algorithm 1.
The fused dataset is divided into training and test sets, and then the above models are used for training to predict, respectively, the samples belonging to the first data source in the test set, the samples belonging to the second data source, and all the samples in the test set to obtain the results.

Dataset Description
There are two datasets used in this experiment, namely the Weihai Municipal Hospital(WMH) Diabetes Dataset and the Pima Indian Diabetes Dataset. The WMH dataset comes from a municipal hospital in Weihai City, Shandong Province, China. It records 12 diabetes-related diagnostic indicators of 118 local patients in Weihai, including gender, age, BMI, family history, insulin, blood sugar, serum C-peptide, and so on. The Pima dataset contains diabetes indicators including Pregnancies, Glucose, Insulin, Blood Pressure, Skin Thickness, Diabetes Pedigree Function, BMI, etc. of 768 Pima Indian women with potential diabetes from Phoenix, Arizona [36]. Both datasets contain the diagnosis labels of each patient (ill or not). According to the advice from doctors and some conclusions from previous research [37], six important features (gender, age, BMI, blood glucose, proinsulin and Cp120) are selected from the former dataset and five features (gender, age, BMI, blood glucose and diabetes pedigree) from the latter one to form seven common features. The features shared by the two datasets include gender, age, BMI, and blood glucose. Table 1 is a detailed description of the features.

Experimental Settings
After the dataset fusion process, we selected the multiple imputation and the GRAPE model to fill in the datasets, and obtained two complete fused datasets. The LR model and the stacking model in the ensemble learning framework are selected to train and predict whether the person is diabetic. Five basic models-random forest, decision tree, LDA, KNN, and Naive Bayes-and a meta-model LR are adopted in the stacking model. To verify the effectiveness of the proposed methods, five performance evaluation metrics-accuracy, precision, recall, F1-score, AUC (Area Under Curve)-are used in our experiments, and five-fold cross-validation is adopted to run the experiments. The average results and standard deviations are reported. Cross-validation can decrease the impact brought about by the different distributions of training data and test data. Experiment 1. Use the LR model to train and predict on the original dataset. Experiment 2. Use the multiple imputation model and GRAPE model for data imputation on the fused dataset, and use LR model and stacking model for training and prediction. Train the models on the training set containing both WMH and Pima samples, and then predict the WMH samples and Pima samples in the test set, respectively. Experiment 3. Use the multiple imputation model and GRAPE model for data imputation on the fused dataset, and use LR model and stacking model for training and prediction. Split the fused dataset into the WMH samples and Pima samples, and then train and predict on WMH samples and Pima samples, respectively. Experiment 4. Compare the methods "Complete GRAPE" and "GRAPE". When the GRAPE model is used to impute the missing values, in fact, it will impute all the values including the existing values and the missing values. When we adopt LR to do the training, and use the complete data obtained after GRAPE, we call this method "Complete GRAPE". If the dataset keeps the existing values and just replaces missing values with the imputed values obtained from GRAPE, we call the method employed on this dataset "GRAPE". We will use them to predict the labels of WMH samples, Pima samples, and all samples in the test set. Experiment 5. Compare the methods "LR+GRAPE" and "GRAPE". For the basic model LR, the data after GRAPE imputation is divided into training and test sets, LR is trained on the training set and prediction is performed on the test set. This method is named "LR+GRAPE". It is noted that GRAPE can predict the label in the test set without the help of any additional classification model. In Figure 2, running GRAPE with the label as node, the label corresponding to each sample in the test set will be given. This method is named "GRAPE". We will use them to predict the labels of WMH samples, Pima samples, and all samples in the test set.
Since we use five-fold CV, all the above experiments are run five times, and the mean and standard deviation of the five results are taken as the final result.

Evaluation Metrics
Five evaluation metrics-accuracy, precision, recall, F1-score, and AUC (Area Under Curve)-are adopted in the experimental comparison. Since the first four are based on the confusion matrix, we will introduce the confusion matrix first. Figure 3 is the structure of the confusion matrix. The confusion matrix is a two-dimensional matrix, which is mainly used to evaluate binary classification problems and reflect the difference between the predicted result and the actual result [38]. It can be seen from the matrix that there are two types of category (0 and 1), and the difference between the category predicted by the model and the true categories forms four indicators, respectively. They are true positive (TP), false positive (FP), false negative (FN) and true negative (TN). TP represents the number of samples whose predicted result is 1 and the true result is 1, and FP represents the number of samples whose predicted result is 1 but the actual result is 0. TN represents the number of samples whose predicted result is 0 and the true result is 0, and FN represents the number of samples whose predicted result is 0 but the actual result is 1.
The calculation formula for accuracy is It reflects the proportion of the correctly predicted sample to the total sample. The calculation formula of precision is The calculation formula of recall is It indicates how many samples with positive labels are predicted correctly. Precision and recall are often relative. Sometimes precision is high but recall is low, so F1-score is introduced to provide a trade-off. F1-score is an evaluation index that combines precision and recall. Its calculation formula is It is the harmonic mean of precision and recall. The last one, AUC, refers to the area between the ROC (Receiver Operating Characteristic) curve and the x-axis. It can quantitatively display the classification of the model. Generally, the larger the value of AUC, the better the classification performance on the dataset.
In the context of medicine, especially in clinical decision support, the AUC is often too general in that it assesses all decision thresholds, including unrealistic ones. Conversely, accuracy, precision, recall, positive predictive value, and the F1 score are too specific; they are measured against a single threshold that is optimal for some cases but not for others, which is not fair. Thus, each measure has its limit, all of them will be adopted to evaluate the result. A very recent work [39] describes a deep ROC analysis to measure performance in multiple groups of predicted risk or in groups of TP rate or FP rate. It is interesting that these authors also provide a Python toolkit.

Experimental Results
The results of Experiment 1 and Experiment 2 are shown in Figure 4, Tables 2 and 3. From Table 2, we can find that both the MICE filling model and the GRAPE imputation model have improved the prediction results of WMH samples slightly, and the performance obtained using the stacking model is better than that obtained by the LR model. Regarding the WMH data, it achieves the best performance with an accuracy of 92.5% using the GRAPE model for data imputation and the stacking model for training and prediction. From Table 3, it can be seen that compared to the WMH dataset, Pima's prediction results are slightly improved. On the whole, for Pima dataset, it can be improved slightly using the MICE filling model to impute and using the stacking model to predict.

Accuracy Precision
Recall   For Experiment 3, Figure 5 shows the result comparison of training on fused training samples to predict Pima test samples and separate training on Pima training samples and then predicting Pima test samples. Figure 6 shows the result comparison of training on fused training samples to predict WMH test samples and separately training on WMH samples and then predicting WMH test samples. It can be seen from the results that for both Pima samples and WMH samples, the results of separate training and prediction are generally not as good as those results obtained with training and predicting together.
For Experiment 4, Tables 4 and 5 demonstrate the results of the two methods "GRAPE" and "Complete GRAPE". From these two tables, we can see that it is slightly better to use the imputed dataset obtained from "Complete GRAPE" for WMH dataset than that obtained from "GRAPE" in recall, but slightly worse in precision, F1-score and AUC. For Pima data, it seems that "Complete GRAPE" performs worse than "GRAPE" in most of the evaluation metrics. On all data, "Complete GRAPE" performs slightly better than or as well as "GRAPE". In summary, it seems that "Complete GRAPE" performs as well as "GRAPE". This is indeed reasonable, because the imputed result of the GRAPE model for the position that already exists is very close to the actual value. It indicates that this has little influence on the prediction effect.   For Experiment 5, Figures 7 and 8 show the predicted results of the two methods "GRAPE" and "LR+GRAPE". From the results, we can find that the prediction results obtained from "GRAPE" are slightly better than or as well as that obtained from "LR+GRAPE". Thus, it is better to directly use "GRAPE" to impute and predict than to run LR on the data after "GRAPE" imputation.   To see the boundary of the proposed method, we run the STACK-GRAPE on fused datasets with three missing ratios: 30%, 50% and 80%. For our fused dataset, its missing ratio is about 30%. We assume our fused dataset to be X with the size 986× 7, to generate a dataset with missing ratios 50% and 80%, 20% and 50% of 6902 (986× 7) entries are randomly removed from those observed positions. STACK-GRAPE is then run on these generated datasets. The results are shown in Table 6. From this result, we can see that the performance of STACK-GRAPE decreases when missing ratios increase, but even in the missing ratio 80%, the accuracy is still 0.676. Thus, this method applies to the high missing ratio.  Since ensemble learning framework stacking is used to predict diabetes, to understand the stacking model better, additional experiments with each sub-model are conducted and the results are shown in Table 7. From the results, we can find that Naive Bayes performs the worst, and random forest performs the best among the sub-models. The ensemble model STACK-GRAPE outperforms all the sub-models in the average cases.

Discussion
From the above experiments, it can be seen that the imputation model and the classifier adopted have an impact on the prediction performance of the fused datasets. As for the filling model, the more basic filling models such as mean filling and KNN filling are not suitable for multiple regression imputation. The deep-learning imputation model GRAPE seems the best option to impute the missing values in the fused dataset. GRAPE, a deeplearning imputation model, can act as a classifier besides the imputation model, and it performs well. In addition, the ensemble learning model stacking can boost the performance further, and some previous works [40,41] also verified this conclusion. However, these works cannot deal with missing-value problems, thus comparison cannot be done on prediction on incomplete datasets. In the process of fusing heterogeneous datasets, it is important to choose a suitable filling model. The better the filling model used, the more effective the information contained in the fusion dataset. With such a dataset, coupled with a good prediction model, the final prediction result will be satisfactory.
For the imputation model, besides multiple imputation and GRAPE, several other models are proposed to deal with missing values, such as LSTM [42]. Due to the introduction of a gating mechanism, LSTM is outstanding in the processing of the missing values of time-series problems. Compared with LSTM, GRAPE can deal with any missing case in any data. The idea of combining heterogeneous datasets and imputing the missing values incurred in the combining process is not only applicable to the problem of diabetes prediction, but also to all the disease prediction problems [43], even those outside the medical field. It is also interesting to consider missing-value imputation and diabetes prediction as multitask learning [44].
Although the proposed methods show their effectiveness, there is a limit for them. If the different data sources do not have common features, the proposed method cannot be directly uses. However, in real-world applications, different data sources collected with the same goal normally have some features in common, as well as some special features. Thus, when combining, it is better to make full use of their complementary and consensus information. It can be seen as multi-view learning [45,46] or multi-source learning. Thus, the tools used in those related areas can be borrowed to serve the current goal. In addition, we fused two data sources in this work, and in fact it is easy to extend to more than two data sources.

Conclusions
The prediction performance of this system obtained by combining two heterogeneous diabetes data sources and selecting an appropriate imputation model is better than those obtained on the original datasets. For the WMH dataset, the GRAPE model for imputation and the stacking model for training is the best combination. For the Pima dataset, the MICE model and the stacking model have the best prediction performance. This shows that the idea of combining heterogeneous datasets and imputing the missing values produced in the fusion process is effective to improve diabetes prediction performance. In fact, almost every hospital collected the heterogeneous diabetes datasets and they have not been exploited to serve for diabetes prediction. This paper provided a feasible and effective way to deal with this problem and shows great potential. Moreover, this idea may not only be used in diabetes prediction, but also can apply to any scenarios where heterogeneous datasets with the same label types and different feature attributes exist. In the current work, we did not investigate the impact caused by the different distribution of training data and test data. However, this fact indeed has an impact on the prediction performance. In future work, further research will be conducted. Informed Consent Statement: Patient consent was waived due to research involving the analysis of existing data, where the data will be analyzed such that individual subjects cannot be identified.

Data Availability Statement:
The datasets generated and analyzed in the current study are available from the corresponding author on reasonable request.