# A Fusion-Based Machine Learning Approach for the Prediction of the Onset of Diabetes

## Abstract

## 1. Introduction

- A fusion-based machine learning architecture for the prediction of diabetes has been proposed.
- Two machine learning classifiers Support Vector Machine (SVM) and Artificial Neural Network (ANN) within the architecture have been evaluated.

## 2. Related Research

## 3. Materials and Methods

#### 3.1. Datasets

#### 3.2. System Architecture

#### 3.2.1. Data Fusion

#### 3.2.2. Pre-Processing

#### 3.2.3. Cross-Fold Validation

#### 3.2.4. Support Vector Machines

_{i}represents the transferred input vector and y

_{i}the target value, the SVM becomes a binary classifier in which the class labels feature only two values +1 or −1. SVM draws an optimal hyper-plane H that separates the data into different classes and the hyper-plane H from the inputs. The objective function has convexity, a significant advantage as the solution of a quadratic programming problem and the training of SVMs are equivalent, yielding a unique solution. In contrast, the Artificial Neural Network (ANN) method requires nonlinear optimization, which may result in the algorithm being held hostage to local minimums.

- (i)
- Linear Kernel: $K\left({x}_{i},{x}_{j}\right)={x}_{i}^{T}{x}_{j}$
- (ii)
- Radical Kernel:$K\left({x}_{i},{x}_{j}\right)=exp(-\gamma \mid \left|{x}_{i}-{x}_{j}\right|{\mid}^{2})$
- (iii)
- Polynomial Kernel: $K\left({x}_{i},{x}_{j}\right)={(y{x}_{i}^{T}{x}_{j}+r)}^{d}$
- (iv)
- Sigmoid Kernel: $K\left({x}_{i},{x}_{j}\right)=\mathit{tanh}\left(\gamma {x}_{i}^{T}{x}_{j}+r\right),$ where $r,d\in N$ and $\gamma \in {R}^{+}$ all are constants.

#### 3.2.5. Artificial Neural Networks

#### 3.2.6. Fusion of SVM-ANN

^{th}-classifier, ${\phi}_{i}$ represent the i

^{th}-class of objects, and ${P}_{i}(x|{\phi}_{j})$ represents the probability of x in the j

^{th}-classifier given that the j

^{th}-class of objects occured. As the proposed objective of the architecture is a two-class output, the posteriori probability can be written as:

^{th}-classifier given that the target class of object has occurred, respectively.

^{th}-classifier and the probabilty of the outlier class given that x event has occurred in the i

^{th}-classifier. Then, the decision criteria are computed as:

^{th}-class, the fusion rule can be written as:

## 4. Performance Evaluation

#### 4.1. Performance Evaluation Matrix

^{®}Core™ i3-3217U CPU @ 1.80 GHz PC.

#### 4.2. Performance Results and Discussion

## 5. Conclusions

**Figure 2.**(

**a**) Confusion matrix of ANN; (

**b**) Confusion matrix of SVM; (

**c**) Confusion matrix of Fusion (SVM-ANN).

**Figure 5.**Performance Comparison of the Proposed Approach with existing Models in terms of Accuracy.

Studies | Proposed Methods | Dataset | Findings |
---|---|---|---|

[5] | Logistic Adaptive Network Fuzzy Inference System (LANFIS) | Pima Indians diabetes | Prediction accuracy = 88.05% Sensitivity = 92.15% Specificity = 81.63% |

[7] | Hybrid Prediction Model (HPM)+ C 4.5 | Pima Indian diabetes | Prediction accuracy = 92.38% |

[20] | Artificial Neural Networks (ANN) + General Regression Neural Networks (GRNN) | Pima Indian diabetes | Prediction accuracy = 80% |

[22] | Principal Component Analysis (PCA) + Adaptive Neuro-Fuzzy Inference System (ANFIS) | Pima Indian diabetes | Prediction accuracy = 89.47% |

[23] | Adaptive Network-based Fuzzy System (ANFS) + Levenberg–Marquardt Algorithm | Pima Indian diabetes | Prediction accuracy = 82.30% Sensitivity = 66.23% Specificity = 89.78% |

[24] | Least Square Support Vector Machine (LS-SVM) and Generalization Discriminant Analysis (GDA) | Pima Indian diabetes | Classification accuracy = 82.05% Sensitivity = 83.33% Specificity = 82.05% |

[25] | Bayesian Network (BN) | Pima Indian diabetes | Prediction accuracy = 72.3% |

[26] | (1) Genetic Algorithm (GA) + K-Nearest Neighbors (GA-KNN), (2) Genetic Algorithm (GA) + Support Vector Machine (GA-SVM) | Pima Indian diabetes | Prediction accuracy = 80.5%, Prediction accuracy = 87.0%, |

[31] | Gaussian Hidden Markov Model (GHMM) | CPCSSN clinical dataset | Prediction accuracy = 85.9% |

[32] | Deep Extreme Learning Machine (DELM) | Pima Indian diabetes | Prediction accuracy = 92.8% |

[33] | Gradient Boosted Trees (GBTs) | Canadian AppleTree and the Israeli Maccabi Health Services (MHS) | Prediction accuracy = 92.5% |

Proposed SVM-ANN | Prediction accuracy = 94.67% Sensitivity = 89.23% Specificity = 97.32% |

S# | Feature Name | Description | Variable Type |
---|---|---|---|

1 | Glucose (F1) | Plasma glucose concentration at 2 h in an oral glucose tolerance test | Real |

2 | Pregnancies (F2) | Number of times pregnant | Integer |

3 | Blood Pressure (F3) | Diastolic blood pressure (mm HG) | Real |

4 | Skin Thickness (F4) | Triceps skinfold thickness (mm) | Real |

5 | Insulin (F5) | 2-h serum insulin (mu U/mL) | Real |

6 | BMI (F6) | Body mass index (weight in kg/(height in)^{2} | Real |

7 | Diabetes Pedigree Function (F7) | Diabetes Pedigree Function | Real |

8 | Age (F8) | Age (years) | Integer |

1 Begin 2 Input Data 3 Apply Data fusion technique 4 Preprocess the data by different techniques 5 Data partitioning using the K-fold cross-validation method 6 Classification of diabetes and healthy peoples using SVM and ANN 7 Fusion of SVM and ANN 8 Computes performance of the architecture using a different evaluation matrix 9 Finish |

Evaluation Matrix | SVM | ANN | Fusion of SVM-ANN |
---|---|---|---|

Accuracy | 88.30% | 93.63% | 94.67% |

Specificity | 93.02% | 97.20% | 97.32% |

Sensitivity | 78.62% | 86.28% | 89.23% |

Precision | 84.58% | 93.77% | 94.19% |

Miss rate | 11.70% | 6.37% | 5.33% |

False Positive Ratio (FPR) | 0.06 | 0.02 | 0.02 |

False Negative Ratio (FNR) | 0.21 | 0.13 | 0.10 |

