Ensemble Deep Learning Models for Heart Disease Classiﬁcation: A Case Study from Mexico

: Heart diseases are highly ranked among the leading causes of mortality in the world. They have various types including vascular, ischemic, and hypertensive heart disease. A large number of medical features are reported for patients in the Electronic Health Records (EHR) that allow physicians to diagnose and monitor heart disease. We collected a dataset from Medica Norte Hospital in Mexico that includes 800 records and 141 indicators such as age, weight, glucose, blood pressure rate, and clinical symptoms. Distribution of the collected records is very unbalanced on the different types of heart disease, where 17% of records have hypertensive heart disease, 16% of records have ischemic heart disease, 7% of records have mixed heart disease, and 8% of records have valvular heart disease. Herein, we propose an ensemble-learning framework of different neural network models, and a method of aggregating random under-sampling. To improve the performance of the classiﬁcation algorithms, we implement a data preprocessing step with features selection. Experiments were conducted with unidirectional and bidirectional neural network models and results showed that an ensemble classiﬁer with a BiLSTM or BiGRU model with a CNN model had the best classiﬁcation performance with accuracy and F1-score between 91% and 96% for the different types of heart disease. These results are competitive and promising for heart disease dataset. We showed that ensemble-learning framework based on deep models could overcome the problem of classifying an unbalanced heart disease dataset. Our proposed framework can lead to highly accurate models that are adapted for clinical real data and diagnosis use.


Introduction
Heart disease is among the primary causes of death in the United States. Based on the American Heart Association's most recent statistics, 121.5 million US deaths were caused by heart disease [1]. Identifying heart disease is a challenging task that requires several biological indicators and risk factors including age, gender, high blood pressure, diabetes, cholesterol level, and many other clinical indicators. Many hospitals and physician practices rely on the Electronic Health Records (EHR) system for either monitoring patients [2] or for detecting anomalies and potential issues such as extracting important drug events in pharmacovigilance [3], establishing the EHR-based public health surveillance [4], and exploiting dynamic features of the handwriting process to support the Parkinson's disease at earlier stages [5].
Data mining has provided an effective solution for several healthcare applications [6] such as medical image segmentation [7], patient deep representations [8], and computer-aided detection (CAD) tools for Liver Cancer diagnosis [9] and Interstitial Lung Disease (ILD) detection [10]. The complex nature of the real medical dataset requires careful management because a prediction error may have a serious effect. Therefore, clinical informatics has been carefully used for analyzing the EHR data and accurately classifying the disease based on machine learning algorithms and statistical techniques. For this purpose, recent works have applied classification algorithms such as Decision Trees (DT) and Naive Bayes for Heart disease prediction [11], and K-Nearest Neighbor (KNN) for an automatic classification of the blood pressure [12]. Another work conducted three types of SVM classifiers for predicting the Coronary artery disease [13]. An automated diagnosis system was suggested for the identification of heart valve diseases based on the Support Vector Machines (SVM) classification of heart sounds [14].
In recent years, neural network models have presented their outstanding performance for data prediction and tackling various classification problems. Deep learning techniques have played a significant role in the healthcare domain for knowledge discovery and diseases classification, like heart disease, diabetes, and brain disease, using the collected biomedical data as surveyed in [15,16], that showed several types of clinical applications using the deep learning framework, and also noted some limitations and needs of improvements. In particular, several predictive models based on neural networks have been designed for accurately classifying heart disease [17]. In recent works, convolutional neural networks (CNN) have been implemented for identifying different categories of heartbeats in ECG signals [18], and a modified deep convolutional neural network has been utilized for classifying the ECG data into normal and abnormal [19]. Recurrent neural network (RNN) has also been employed for predicting future disease using robust patient representations of EHR [20] and modeling temporal relations among events in EHR data [21]. Another study has used long short-term memory networks (LSTM) for predicting the risk of cardiovascular disease [22], and gated recurrent units (GRU) for vascular disease prediction [23]. Theoretical studies by Yang et al. [24] showed that bidirectional neural network models, such as bidirectional LSTM (BiLSTM) and bidirectional GRU (BiGRU), have had promising results because they could overcome the limitation on the input data flexibility. This has been proved in many applications like Natural Language Processing (NLP), such as accurate context-sensitivity prediction [25] and unified tagging solution suggested by [26] using the BLSTM-RNN algorithm. Other applications that have been studied for heart disease detection such as sequence labeling for extraction of medical events from EHR notes was suggested using Bidirectional RNN [27], and an automatic classification of arrhythmia's based on Bidirectional LSTM [28]. Another work presented a new recurrent convolutional neural network (RCNN)-based disease risk assessment from hospital big data [29].
One of the biggest challenges in the healthcare analysis is the lack of collected EHR data for conducting accurate predictive models. The huge imbalance of dataset distribution is another problem for healthcare analysis and particularly for heart disease classification. In order to increase the adaptability and the precision of the machine learning solutions, ensemble-learning models have been proposed to contribute better diagnosis results, for example using the AdaBoost ensemble classifier for heartbeat classification [30], and cardiovascular disease detection using hybrid models [31,32]. Ensemble of neural networks has been also suggested for phonocardiogram recordings detection using a feed-forward neural network without segmentation [33]. Another effective solution that has been presented to tackle the combined challenge of class imbalance learning and real data insufficiency is to balance the data using a new under-sampling scheme with a noise filter [34], or using a resampling strategy and time-decayed metric to overcome a class imbalance online [35].
In this paper, we introduce an ensemble-learning framework based on neural network models for classifying different types of heart disease. The main objective of this research is to enhance the performance accuracy prediction for an unbalanced heart disease dataset. Our methodology also presents a combined resampling technique of random and averaging under-sampling that adapts to the output class distribution. The conducted experiments present comparative results between unidirectional and bidirectional neural network models and show that an ensemble-learning classifier combining BiLSTM or BiGRU and CNN has a stronger capability to predict correctly different types of heart disease.
The rest of the paper is organized as follows. Section 2 discusses existing works for heart disease detection and classification techniques for medical applications. We provide a preliminary overview of the research methods in Section 3. Section 4 presents our proposed approach and the detailed implementation of our methodology. Section 5 describes the experimental dataset, followed by the evaluation metrics, hyperparameters setting, and presents comparative results. Section 6 concludes with a discussion about the proposed methodology results and presents suggestions for future enhancement.

Related Work
Heart disease is a general term that describes any condition affecting the heart and is indicated by chest pain and fatigue, abnormal pulse rate, and many other symptoms. There are many risk factors for diagnosing heart disease. Some risk factors are related to age, sex, and weight. Other risk factors are associated with lifestyle, such as smoking, high blood pressure, and having other diseases like diabetes and obesity. This variety of factors makes it hard for physicians to assess and diagnose the type of heart disease.
Artificial intelligence solutions have been suggested to analyze and classify the EHR data for heart disease prediction [36], by designing standard classification models such as support vector machine (SVM), a priori algorithm, decision trees, and hybrid random forest model [37,38]. Heart failure prediction has been modeled by relying on machine learning techniques applied to EHR data and reached a high AUC score of 77% using Logistic regression with model selection based on Bayesian information criterion [39]. Moreover, machine learning techniques have been proven to efficiently classify different types of medical data such as magneto-cardiograph recordings using k-nearest neighbor and XGBoost techniques [40], or clustering multi-label documents in order to help finding co-occurrence of heart disease with other diseases [41,42].
Advanced research in artificial intelligence has induced accurate systems for medical applications and designed computational tools to increase the reliability of the predictive models while dealing with sensitive clinical data [43]. A new clinical prediction modeling algorithm has been developed to build heart failure survival prediction models and mostly applied to identify the complex patterns on EHR data with diverse predictor-response relationships [44]. In this context, a deep learning approach was efficiently implemented for cardiology applications [45], and risk analysis of cardiovascular disease using an auto-encoder algorithm [46]. Another work has suggested a Multiple Kernel Learning with Adaptive Neuro-Fuzzy Inference System (MKL with ANFIS) for heart disease diagnosis and has produced high sensitivity (98%) and high specificity (99%) for the KEGG Metabolic Reaction Network dataset [47]. Recent applications implemented a convolutional neural network (CNN) and multilayer perceptron (MLP) for fetal heart rate records assessment and reached 85% accuracy [48]; a recurrent neural network (RNN) was also suggested for automatic detection of irregular beating rhythm in records with 83% accuracy [49]. A long-short term memory (LSTM) network was used for atrial fibrillation classification from diverse electrocardiographic signals and reached 78% accuracy in [50], and 79% F1 score in [51]. A pediatric heart disease screening application was also solved using a CNN model for the task of automatic structural heart abnormality risk detection from digital phonocardiogram (PCG) signals [52]. Moreover, bidirectional neural network architecture has been introduced for effectively improving the accuracy of heart disease applications using the BiLSTM-Attention algorithm that reached better results (accuracy of 99.49%) than the literature's review [53]. Other applications of deep learning presented in medical imaging and achieved the state-of-the-art results [54], and many challenging tasks were solved in biomedicine considering the utility of the neural networks [55]. A generative adversarial network (GAN), which is composed of BiLSTM and CNN models, was suggested to address the problem of generating synthetic ECG data in order to enhance the automated medical-aided diagnosis and showed high morphological similarity to real ECG recordings [56]. Some other applications employed the Natural Language Processing (NLP) that helped to assist doctors in heart disease diagnosis, such as suggesting comprehensive learning models from the electronic medical data using LTSM [57], and outpatient categories classification according to textual content using the attention based BiLTSM model [58].
Data mining models have been optimized by introducing new concepts like ensemble-learning models that improved the classification performance. This has been suggested by developing a predictive ensemble-learning model on different datasets in order to diagnose and classify the presence and absence of coronary heart disease and achieved promising accuracy that exceeded the state-of-the-art results [59]. The idea of ensemble-learning was also applied by aggregating the predictions of different classifiers instead of training an individual classifier. An example of application was conducted for predicting coronary heart disease using bagged tree and AdaBoost algorithms [60]. From this basis, an ensemble method based on neural network has been suggested for creating a more effective classification model and showed promising classification accuracy in [61,62]. An example of ensemble-learning model was also proposed for heart failure detection using an LSTM-CNN-based network [63,64].
There are several factors that might affect the performance of existing classification models applied to real data, and one of these reasons is explained by the class imbalance of the training dataset. The designed models were often biased toward the majority class and were not able to generalize the learning. Therefore, balancing the data was suggested to produce a better definition of the class while training the models, using proposed methods including Smote [65], Edited Nearest Neighbors (ENN), and Tomek [66]. A selective ensemble-learning framework solution was proposed for ECG-based heartbeat classification, and it showed a high classification performance for imbalanced multi-category classification task [67].

Unidirectional Neural Network
Neural network architecture has been proposed as a multilayer learning process that helps to extract complex features from the input data. Dense neural network architectures were suggested for different types of grid-structured data, such as images and tabulated data. The first common architecture was credited to the Convolutional Neural Network (CNN) [68], which applies convolution operation along the hidden layers and a maximum pooling layer to capture the position of the important features. This model showed high accuracy for computer vision and information retrieval applications. CNN architecture assumes the independence of the input and output, which causes limitations for many tasks that deal sequentially with the data, such as text translation and speech recognition. Therefore, another architecture introduced recurrent units to preserve the semantic and syntactic dependency of the data, called the Recurrent Neural Network (RNN) [69]. This architecture added the memorization part to the learning and the recursive propagation between the units. It also added another backpropagation step to update the weights and train parameters along the neural network.
Since the RNN units are sequentially interconnected, the backpropagation requires going backward along the sequence to update the weight. Therefore, the computation between the units may become either bigger or smaller, and the gradients may vanish and be reduced to either zero or infinity. Consequently, the network cannot memorize the information along the network, and the prediction may be limited to the recent input of the sequence. In this case, it is difficult for the model to know the direction it should take to improve the loss function. To preserve the long-term information and overcome the short-term input, another neural network architecture was introduced to solve the vanishing gradients problem of the RNN. The first model was the long-short term memory (LSTM) network [70], which modified the single unit into a complex unit that is composed of Input gate i, an Output gate o, a memory cell c, and a forget gate f.
In fact, the LSTM passes the information between the gates, so the input gate decides how much new input to take after considering the memory cell, and the forget gate decides how much old memory to retain, as detailed in the following Equations (1)-(6): Another neural network architecture was suggested to overcome the vanishing or divergent gradients problem of the RNN model. This new architecture uses the same technique of gating the hidden state, called the Gated Recurrent Units (GRU) network [71], that is an improved version of the unidirectional RNN, and it uses an update u and a reset gate r. In fact, the update gate connects the current input and helps the unit to determine how much old information to retain future information. The reset gate is the equivalent forget gate of the LSTM. It decides how much of the past information to forget and computed with the same formula as the update gate.
A candidate-hidden state d was introduced in this network structure, where the current input is connected to the product between the reset gate and the previous hidden state, which determines what information to remove from the previous time steps, as described in the following Equations (7)-(10):

Bidirectional Recurrent Neural Network
In the traditional RNN architecture (and its modifications architectures including LSTM and GRU), the information can only be passed in forward, and therefore each current unit depends only on the previous unit. In some applications such as speech recognition, text auto-complete, and machine translation, every moment carries the context information, so making predictions about the current state requires using the information coming from both previous steps and later time steps. Therefore, a bidirectional recurrent neural network architecture was introduced to treat all the inputs equally. It is composed of forward and backward hidden states − → h t and ← − h t as shown in the schematic Figure 1. Such a network may integrate two independent RNN, LSTM, or GRU models, where every input in the sequence data is fed at the time order for one network that is composed of sequential units, and then at the reverse time order for another network. The outputs of the two opposite directional networks forward ( f ) and backward (b) are concatenated at each time step to generate the final hidden layer h t−1 which is used to generate the output layer o t , as detailed in the following Equations (11)-(13):

Data Balancing Methods
In real-life cases of binary problems, the training dataset can contain a higher representative proportion of samples from one class than another class. Building a binary classification model and training it on such an unbalanced dataset can be biased towards the largest class. It causes a lower misclassification error for the highest representative class so that most of the testset samples are incorrectly classified to the majority class. Such a model requires more attention in the case of a real medical dataset as the cost of predicting disease to a healthy patient is a significant issue.
Balancing a small real dataset is the most common idea to solve the high imbalance of the output class distribution. Random sampling is one of the techniques that was suggested to balance the dataset. It is a non-heuristic method that comes in two ways: random over-sampling and random under-sampling. Random over-sampling aims to balance the dataset through the random replication of the minority class samples; random under-sampling works through the random elimination of the majority class samples. Other heuristic methods were also suggested to overcome the imbalance problem. Tomek links is a method that aims to eliminate the majority class samples according to the distance between a pair of samples [72]. Synthetic Minority Over-sampling (SMOTE) is also a common technique that applies an interpolation between several minority class samples in order to create new artificial minority class samples [73]. Condensed Nearest Neighbor Rule (CNNR) is an algorithm that aims to reclassify samples from the majority class to the minority class using a 1-nearest neighbor classification technique, and it eliminates the distant samples because they are irrelevant for the learning process [74].

Features Selection Methods
A real dataset usually comes with a large number of features, which provides numerical and categorical indicators of each record. The features determine the quality of the training data, but some of them can be irrelevant and noisy. The following paragraph details methods that can be applied to determine the importance of the features toward the target variable.

Model-Based Selection Method
This data-transformer applies an estimator model, which presents a coefficient indicating the feature importance after fitting the model on the dataset. The selection of the features is determined if their corresponding coefficient is higher than a certain threshold. The threshold value is either specified numerically or with the built-in heuristics such as mean, median, or a composed value like (n x mean) value. Concerning the choice of the estimator model, there are different types of models including linear models known as L1-based model, or nonlinear models like the tree-based model including a decision tree or random forest classifiers.

Recursive Feature Elimination Method
This method is an extension of the model-based selection technique. It consists of running a recursive feature (RF) to determine initial importance weights of the features, and then removing a percentage of them with the lowest importance weights from the dataset. The recursive feature elimination (RFE) approach recursively assigns ranks to remove features and trains a model on the remaining features until achieving a certain accuracy of the model that predicts the target variable. The remaining features with the highest performance are finally selected.

Classification Models
We proposed an ensemble-learning framework of two different neural network models, including the basic deep neural networks which are the most widely explored for the binary classification task. Thus, we used CNN, LSTM, GRU, BiLSTM, and BiGRU for our heart disease binary classification task. The first proposed architecture is a CNN model as represented in Figure 2. It consists of three different blocks of convolutional layers that are followed up with a dense layer such as ReLu activation function. In order to avoid the overfitting problem, a dropout method was applied to add penalty to the loss function by randomly removing inputs during the training. A dropout layer was added after each convolutional layer with a certain rate. Moreover, to improve the model's generalization and reduce the overfitting, we applied an L2-regularization to the dense layers of the model. This allows for adding penalties on the layers activity during the training. The second architecture is depicted in Figure 3, which illustrates an LSTM model, composed of three different blocks of LSTM. The initial block contains a recurrent dropout rate that helps to drop a fraction of units for the linear transformation of the recurrent state. Before adding the last fully connected layer, we added a dense layer with an L2-regularization to penalize the learning during the optimization.  Two additional bidirectional neural network models were suggested in our framework. BiLSTM and BiGRU models that are based on the previous unidirectional LSTM and GRU models. The two architectures stack three blocks of bidirectional LSTM and bidirectional GRU, and they use a concatenation method by which the outputs of the forward and backward LSTM and GRU are combined. Equivalently to the unidirectional models, BiLSTM and BiGRU contain an L2-regularization dropout layer and a final fully connected dense layer. The choice of units' number per hidden layer was based on empirical studies and preliminary experiments to select the optimal hidden layer size.

Ensemble-Learning Framework
Looking at the work presented by Esfahani et al., it designed an ensemble classifier using different machine learning techniques and helped to effectively predict the heart disease for a dataset of 303 patients [31]. The study showed the effectiveness of the ensemble-learning models on a small medical dataset, but it is limited to the traditional classifiers. Das et al. applied a similar framework by constructing a new neural network of three independent simple neural networks [61,62]. Another study assembled multiple neural network models for combining textual, visual, and mixed predictions and achieved better performance for the medical captioning task [75]. However, the technique was based on averaging multiple predictions without taking into account the suitability of the neural network architecture to each class label.
We proposed an ensemble-learning framework that trains two independent deep learning models on the same dataset and the final prediction is the concatenation of the two different predictions, as shown in Figure 5. This emphasizes the aggregation of two different classifiers where each classification model is designated to a particular output of the binary class. The two classification models were selected so that each model had a higher classification performance for one of the output classes than the other one. This was achieved after conducting preliminary experiments showing that our designed LSTM, GRU, BiLSTM, and BiGRU models performed better for the majority output class 0 (i.e., not having a heart disease type). The proposed CNN model performed better for the minority output class 1 (i.e., having a heart disease type) than for the output class 0). However, due to the imbalance of the output class labels, the performance of the individual CNN model toward the output class 1 was not high enough. Therefore, we proposed a model-averaging, method using a random under-sampling technique to balance the data evenly for every target i . The training was conducted over a number of balanced subsets, where the entire dataset was divided randomly into N balanced subsets according to the minority class distribution as presented in Figure 6. N was chosen so that the division generated equally-distributed representation of the two output classes through all the targets (heart disease types), as presented in Equation (14): After that, the model-averaging method, which is inspired by the bagging aggregation technique, is used to train the Model 2 (i.e., CNN model) on the N subsets. Inspired by studies [76,77] that showed improvement when combining the predictions, the N single predictions were then averaged into a final prediction that represents the Prediction 2 in our ensemble-learning model framework.

Experimental Results
The described models were trained over the heart disease dataset, where each model predicts the class label probabilities of a given patient record. The training models present an objective function to reduce binary cross-entropy loss between the predicted and actual class labels. Experiments were conducted to select the optimal hyperparameters of the models and compare the models across the subsets of the dataset and their targets. All outlined materials and models were implemented in open source Python software.

Dataset
A collection of heart disease dataset was extracted from Medica Norte Hospital. It includes 800 records of patients that indicate several biological indicators including age, gender, systolic and diastolic blood pressure, heart rate. . . etc, as described in Appendix A. The dataset presents four different types of heart disease, as described in Table 1: CARDIOPATIA_HIPERTENSIVA, CARDIOPATIA_ISQUEMICA, CARDIOPATIA_MIXTA, and CARDIOPATIA_VALVULAR. It refers to heart problems that occur due to high blood pressure that remain for a long time.

Type 2 CARDIOPATIA_ISQUEMICA (Ischemic Heart disease)
It is also called "coronary artery disease" and "coronary heart disease"; the term applies to cardiac problems caused by narrowed heart arteries.

Type 3 CARDIOPATIA_MIXTA (Mixed Heart disease)
It is known as the combined systolic and diastolic heart failure, which refers to the first presentation of mixed connective tissue disease.

Type 4 CARDIOPATIA_VALVULAR (Valvular Heart disease)
It occurs mainly in the elderly and is characterized by a narrowing of the aortic valve opening which increases resistance to blood flow.
Every type is represented with a binary target variable indicating whether a patient is diagnosed with the given type of heart disease or not. Because the types are different, the distribution of each target over the records is proportionally different and significantly unbalanced as shown in Figure 7, where the class 0 (i.e., a patient is not diagnosed with a heart disease type) presents the majority class label comparing to the class 1 (i.e., a patient is diagnosed with a heart disease type). Such a real dataset requires cleaning and preprocessing before training the binary classification models. Therefore, we removed two records that had missing values. In order to normalize the dataset, we transformed each type of categorical variables into numerical variables using different methods. For example, the "Gender" feature represents a categorical variable with no intrinsic ordering to the categories; thus, a regular integer encoding is not suitable for it because it lets the model assume a natural ordering between the categories and this may result in poor performance. In this case, a one-hot encoding was applied to the integer representation, as shown in Table 2, where the encoding removed the nominal variable "Gender" and created a new binary variable for each unique category as "Gender Female" and "Gender Male". Moreover, for the interval variable "Age", we categorized it into subcategories based on a proportionate interval of 20 years. Therefore, the age variable was transformed from 81 values into a five-category feature that was transformed later into a numerical variable, as described in Table 3. The entire dataset was normalized in a range of minimum and maximum scale of each feature by applying the Min-Max normalization technique. Therefore, the binary/categorical features did not change, but the nominal and ordinal features were transformed into values between 0 and 1.
Our data collection presents a large number of noisy features that a neural network classification model may learn poor correlations and affect the final performance. Data analysis showed that some features are under-represented in the data to the point that it is highly unlikely for a neural network model to learn meaningful information from it. Including these irrelevant features may lead to an overfitting problem or may increase the computation inside the network while it decides the importance of the features. Hence, the original dataset collection was preprocessed following the different features selection methods, described in Section 3, before data were fed into the neural network models. This generated several subsets with different numbers of features, where Table 4 summarizes the three different subsets resulted from every method, and their indexes according to Appendix A.

Evaluation Metrics
The experiments were conducted for binary classification, so results are reported using the evaluation metrics accuracy (ACC), precision (P), recall (R), and F1 score. The relevant parameters are computed using the elements of the confusion matrix, true positive (tp), false positive (fp), false negative (fn), and true negative (tn), as explained in Table 5. The formula for calculating accuracy (ACC), precision (P), recall (R), and F1 score (F1) is detailed is the following Equations (15)- (18): To evaluate the accuracy of the proposed models with, we also used the AUC (Area Under the receiver operating characteristic Curve) score as the index that presents the probability that the value of a randomly example of class 0 is higher than a random example of class 1. In general, the value of AUC is between 0.5 and 1, and the higher the value of AUC is, the higher the classification accuracy and the maximum is.

Hyperparameters Setting
The proposed classification models present a list of hyperparameters that includes number of epochs, dropout rate, optimizer, learning rate, and so on. Considering their effect on the model performance, only four hyperparameters were selected for the tuning. The dataset was randomly split into a training set and a test set according to the division 20-80%. In each experiment, the neural network parameters were fixed and each hyperparameter was varied. The experiments were reported using the accuracy score and the F1 score.

Number of Epochs
The epoch measures the number of times the training set is selected once to update the weights. As the number of epochs increases, the model is more capable of generalizing the learning. Using a large number of epochs may cause an overfitting problem on the training set and the model would perform poorly for validation or test set. Therefore, finding the optimum number of epochs is important. Figure 8 shows the result of varying the number of epochs for the targets of the original dataset, where the classification performance increases significantly as the number of epochs increases. The model tends to be stable after the epoch number 70 for the Target 1, 60 for the Target 3, and 50 for Target 2 and Target 4.

Dropout Rate
Dropout is a regularization technique used to avoid the overfitting problem and ensure the model generalization. It randomly ignores a number of layer outputs so that the training is performed with different values of output layer, and network capacity is minimized during the training phase. The dropout layer is added with a fraction of the input units to drop during the training, within a range from 0 to 1. Figure 9 shows that the model reaches the highest performance value with a value of 0.4 for all the targets.

Optimizer
The optimization algorithms minimize the loss function of the model. They update the parameters of the gradient descent until it converges and reaches its most accurate outcome with the weights. Four popular optimizers are selected to train the model for the different targets, Adam, Adadelta, RMSprop, and SGD. Figure 10 shows that the best optimizer is Adadelta as it gives the highest classification performance for all the targets.

Learning Rate
The optimization of the weights is monitored with a learning rate. Therefore, after selecting the optimal optimizer Adadelta, its learning rate is varied with values between 0 and 2. A very large value of learning rate may cause the model to exceed the optimal point, and a very small value may generate a slow training model. Figure 11 shows that the classification performance of the model becomes stable when the learning rate has the value 1. Literature reviews showed that the learning rate should be set to its default value 1 as the optimizer uses an adaptive learning rate. Figure 11. Relationship between learning rate and Performance metrics.

Classification Performance of the Single Model
Before conducting experiments for the proposed ensemble-learning framework, we first reported the performance results of the single models using the original dataset and the best selected hyperparameters. Table 6 summarizes the empirical results of the different output targets classification, and it demonstrates the low performance of the individual models where the F1 and ACC scores achieved around maximum of 87% and 81% for Target 1, 81% for Target 2, 87% and 80% for Target 3, and 87% and 82% for Target 4. The AUC scores also reached a maximum value of 79% for Target 1 and 82% for Target 2 using the BiGRU model, 88% for Target 3 using the BiLSTM model, and 85% for Target 4 using the BiGRU model. The results also validate that MLP model had a lower classification performance than the CNN model. Furthermore, the bidirectional recurrent neural network models outperformed the unidirectional models.

Comparison of the Classification Performance for Different Subsets
To select the optimal features selection among the proposed methods, the original dataset and different generated subsets are compared using one ensemble-learning model that includes CNN and LSTM models. The results of the output targets classification are reported in Table 7.  Table 7 shows that Subset 1 reaches the highest classification performance. It contained the best 60 selected features and performed overall well for the four output targets. This was yielded by the model-based selection technique using a tree-based classifier (Random classifier) which determined the importance of the features toward the output targets. Compared to the recursive feature elimination (RFE) method, our best selected method filtered the important features by training the model one time. This did not enforce the selection of the best features similarly to the RFE method that iteratively trains the model and predicts best features one by one until it achieves a certain accuracy. Subset 1 includes some features that were not filtered by the other methods, such as Glycemic load, Increase in triglyceride concentration, Diabetes mellitus type 2, Chronic renal insufficiency, etc. (See Appendix A). This indicates that these missing features from the other subsets are high indicators for predicting the different heart disease types.

Comparison of the Classification Performance for Different Models
In order to validate the effectiveness of the proposed ensemble-learning model framework, different neural network models are varied and tested for different targets. The neural networks models are also compared with a baseline model SVM. Because two models should be selected for the final ensemble-learning model, Table 8 summarizes the experimental results of varying the neural network models in the ensemble-learning framework, using the selected Subset 1. Table 7 indicates that F1 and ACC scores of Targets 1, 2, and 4 have the highest value using the CNN and BiLSTM models. However, experiments for Target 3 show that the combination of CNN and BiGRU models has the best F1 and ACC scores. The classification performance of LSTM and GRU is the lowest, where the maximum F1 score is around 89% for Target 1 and Target 2, and around 94% for Target 3 and Target 4. When using BiLSTM and BiGRU, which can capture the information more effectively, the F1 and ACC scores could reach around 92% and 94%. By using bidirectional recurrent neural network models instead of unidirectional models, the F1 and ACC scores are significantly improved achieving around 93% and 92% for Target 1, 92% for Target 2, 95% and 96% for Target 3, and 94% and 91% for Target 4. The proposed ensemble-learning framework joined two different models where each model was designated for one class. The CNN and MLP models performed well for the output class 1. However, the LSTM, GRU, BiLSTM, and BiGRU performed well for the output class 0. This difference that appeared in the performance can be explained by the fact that CNN and MLP models were designed to create a complex feature mapping by using connectivity pattern between the neurons. It appears that this was suitable for the minority output class 1. However, the remaining recurrent neural networks models (LSTM, GRU, BiLSTM, and BiGRU) were designed to benefit from the sequential dependencies between the features and the majority output class 0. The proposed ensemble-learning framework that was enhanced by the balancing method could benefit from combining the two characteristics and created a new model that had a better performance for the two output classes. Experimental results proved that the CNN model outperformed the MLP model, and this can be explained by the fact that CNN is more effective as it can go deeper. Its layers are sparsely connected rather than the fully connected hidden layers in MLP. CNN architecture is also able to optimize the number of parameters by sharing weights while conserving enough information. The BiLSTM and BiGRU models outperformed the unidirectional LSTM and GRU models. This validates the benefit of learning from the sequential orders of the features to correctly predict the output class. The bidirectional models could profit from making the prediction of the current state using the information coming from both previous steps and later time steps. Compared to the unidirectional models, BiLSTM and BiGRU models were capable of exploiting the two directional context-information from the features. The experiments also showed that our proposed models could show better results than a standard baseline model such as SVM that performed poorly for all the targets with a maximum ACC score of 68%. Comparison of the different models is illustrated in Figure 12, where each subplot shows the AUC score plot that of each target (i.e., a heart disease type). As illustrated in Figure 12, the baseline model SVM has the lowest AUC score of 0.5 compared to the proposed ensemble-learning framework that uses different associations of neural network models. This can be explained by the lack of complex learning that a simple model can achieve for classifying such a real-dataset. The subplots explain a comparative performance of the different combinations of models for each target variable and indicates that our proposed method achieved the highest performance with a maximum AUC score of 0.97 for Target 1, 0.98 for Target 2, 0.99 for Target 3, and 0.96 for Target 4.

Discussion and Conclusions
In this paper, we found that binary classification of heart disease dataset is challenging in terms of size, distribution of records, and number of important features. We demonstrated that features selection techniques succeeded to properly reduce a large number of features. The target of this work is to classify four types of heart disease. However, each type has very unbalanced distribution due to the large proportion of collected records indicating patients were not diagnosed with any type of disease. Therefore, our proposed framework demonstrated a technique of averaging and random under-balancing of the records, in order to equally generate proportional dataset for building the classification models.
We tackled the classification problem by designing different neural network models. The proposed methodology framework includes unidirectional neural network including convolutional neural network and recurrent neural network architectures. From the text mining point of view [25], we presented a bidirectional recurrent neural network that outperformed the results of unidirectional models. Experiments were conducted to select the best models for every output class label. Some models were more suitable for one class label than for the other. This finding opens a way to incorporate intelligibly the models into an ensemble-learning framework that concatenates the predictions result for each class into a general prediction result.
The framework of this work was assessed on different versions of the initial dataset and within different combinations of models. This served as a validation for the methodology of the features selection and for the neural network models design. Comparing to the works [78,79], our framework does not penalize records in the dataset, but it creates an artificial balance and monotony to the training dataset. Nevertheless, the designed models achieved a high classification performance, despite the quality of our structured dataset.
Moreover, this work does not use a benchmark dataset to validate the proposed methodology. The classification performance of the suggested models may noticeably change if the size of the dataset and the records distribution increases. While the features selection is related to the target variables, the original number of features can be robustly reduced for a larger dataset without integrating the independent variables. We presented a comparison of the proposed methodology with SVM as a baseline model, showing its low classification performance against the neural network models.
The significance of this work to the clinical domain rests on the ability of advanced deep learning models to adapt to a small collection of patient records. As a result, the suggested methodology framework connects every category of records to the appropriate model, and then concatenates their results into a final diagnosis. This approach is able to achieve accurate diagnosis in case of unbalanced distribution of health records.
Future works may include implementing other designs of neural networks such as Generative Adversarial Network (GAN) or Attention-based Recurrent Neural Network. This will give a better insight into the behavior of the prediction of different types of heart disease. Moreover, converting the binary classification of each output target into a multi-label classification task may be considered for our dataset in order to solve its imbalance and maintain its structure.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Index
Variables Description

CARDIOANGIOESCLEROSIS
It consists of an increase in collagen fibers and acid mucopolysaccharides and a decrease in elastic and smooth muscle fibers. These alterations produce an arterial ectasia with loss of elasticity, as can be seen in the aorta of the elderly.

30
HIPOTIROIDISMO occurs when the thyroid gland does not produce enough thyroid hormone to meet the body's needs

HIPOTIROIDISMO_SUBCLINICO
it is an alteration in the function of the thyroid gland that has few symptoms or very nonspecific and that is detected in a blood test when high TSH values (above the reference range of the laboratory) but with a normal free T4.

BRADICARDIA
it supposes the emission, on the part of the sinus node, of less than 60 ppm (pulsations per minute), or its lack of total function, in which case the frequency that takes control is that of the atrioventricular nodule, of about 45-55 ppm, approximately. It consists of a heart rhythm originated and driven normally, but with a heart rate higher than usual. It is physiological and occurs due to anxiety, exercise, anemia, alcohol consumption, heart failure or nicotine. In general, it does not require specific treatment, but the cause must be acted upon: quitting tobacco, correcting anemia, etc.

ANGOR_PECTORIS
is a pain and disease of the coronary arteries, usually of an oppressive nature, located in the retrosternal area, caused by insufficient blood supply (oxygen) to the cells of the heart muscle.

INSUFICIENCIA TRICUSPIDEA
Its most frequent cause is dilatation of the right ventricle. It does not usually produce signs or symptoms, although severe tricuspid regurgitation can cause cervical pulsations, a holosystolic murmur, and heart failure induced by right ventricular dysfunction or atrial fibrillation.

INSUFICIENCIA_MITRAL
Also known as mitral regurgitation, it is a disorder of the mitral valve of the heart, characterized by reflux of blood from the left ventricle to the left atrium during systole.

INSUFICIENCIA_VENOSA_LINFATICA
It is a condition in which the veins have problems returning blood from the legs to the heart.

INSUFICIENCIA_VENOSA_PERIFERICA
Is a disease in which the venous return is difficult, especially in standing, and in which the venous blood flows back in the opposite direction to the normal, that is, in the case of the lower limbs, it circulates from the deep venous system to the superficial one.

46
FA_PAROXISTICA It is the most frequent sustained cardiac arrhythmia.

HIPOTENSION_ORTOSTATICA
It is defined as the inability of the body to regulate blood pressure quickly. It is produced by sudden changes in body position (from lying down to standing). It usually lasts a few seconds or minutes and can cause fainting.

HIPOTENSION_POSTURAL
It is a form of low blood pressure that occurs when you stand up after sitting or lying down. Orthostatic hypotension can make you feel dizzy or dazed, and you can even faint 49 ANSIEDAD Mental state characterized by great restlessness, intense excitement and extreme insecurity.

SX_METABOLICO
Is a group of conditions that put you at risk of developing heart disease and type 2 diabetes? These are High blood pressure. Glucose (a type of sugar) high in the blood

ANGIODISPLASIA_INTESTINAL
It is a small malformation that causes dilation and vascular fragility of the colon, resulting in an intermittent loss of blood from the intestinal tract. The lesions are often multiple and often involve the blind or the colonascent, although it can occur in other areas.

ANEMIA
A syndrome that is characterized by an abnormal decrease in the number or size of red blood cells in your blood or in your hemoglobin level 53 ANEMIA_CRONICA is a type of anemia found in people with certain long-term (chronic) conditions that involve inflammation.

SOPLO
The heart has valves that close with each heartbeat, which causes the blood to flow in only one direction. The valves are located between the chambers. The murmurs can happen for many reasons, for example: When a valve does not close well, and the blood is returned (regurgitation) It is an inflammatory process that causes blood clots to form that produce blockages in one or more veins, in general, in the legs. The affected vein may be near the surface of the skin (thrombophlebitis superficial) or at a deep level of a muscle (deep vein thrombosis).

TROMBOSIS
Colloquially called blood clot, it is the final product of the blood coagulation stage in hemostasis. There are two components in a thrombus: aggregate platelets and red blood cells that form a plug, and a mesh of reticulated fibrin protein

ATEROMATOSIS
It is a multifactorial inflammatory process that affects the wall of the arteries, has a long clinical evolution, and manifests itself in highly evolved stages causing a cardiovascular event 69 ESTEATOSIS fatty liver disease

FIBROSIS_PULMONAR
Is a condition in which the tissue in the lungs heals and, therefore, becomes thick and hard. This causes difficulties when breathing and it is possible that the blood does not receive enough oxygen 71 COLITIS is an inflammatory bowel disease that causes lasting inflammation and ulcers (sores) in the digestive tract 72 HIPERCOLESTEROLEMIA Increase in the normal amount of cholesterol in the blood

FARINGOAMIGDALITIS
Is the acute infection of the pharynx or the palatine tonsils? Symptoms may include angina, dysphagia, cervical lymphadenopathy, and fever. The diagnosis is clinical, complemented by culture or rapid antigenic testing.

DIVERTICULAR
It is a condition that occurs when small bags or sacs are formed that push outward through the weak points in the wall of your colon. Diverticulosis can also cause problems such as diverticular bleeding and diverticulitis.

ANEURISMA_RENAL
The etiology is usually related to fibromuscular dysplasia, arteriosclerosis of the renal artery, which may be congenital, associated with arteritis, or with a traumatic history.

EPOC
Chronic obstructive pulmonary disease (COPD) is a chronic inflammatory disease of the lungs that obstructs the flow of air from the lungs. Symptoms include shortness of breath, cough, mucus production (sputum) and whistling when breathing.

ACALASIA_ESOFAGICA
This is called the lower esophageal sphincter (LES). Normally, this muscle relaxes when you swallow to let food pass into the stomach. In people with achalasia, this muscle ring does not relax so well. In addition, the normal muscular activity of the esophagus (peristalsis) is reduced.

CANDIDIASIS_ESOFAGICA
It refers to the infection of the oral mucosa. Candida is responsible for the majority of oral fungal infections and C. albicans is the main causative species of infection. The infection can spread to the esophagus causing esophageal candidiasis.