A Hybrid Intelligent Approach to Predict Discharge Diagnosis in Pediatric Surgical Patients

: Computer-aided diagnosis is a research area of increasing interest in third-level pediatric hospital care. The effectiveness of surgical treatments improves with accurate and timely information, and machine learning techniques have been employed to assist practitioners in making decisions. In this context, the prediction of the discharge diagnosis of new incoming patients could make a difference for successful treatments and optimal resource use. In this paper, a computer-aided diagnosis system is proposed to provide statistical information on the discharge diagnosis of a new incoming patient, based on the historical records from previously treated patients. The proposed system was trained and tested using a dataset of 1196 records; the dataset was coded according to the International Classiﬁcation of Diseases, version 10 (ICD10). Among the processing steps, relevant features for classiﬁcation were selected using the sequential forward selection wrapper, and outliers were removed using the density-based spatial clustering of applications with noise. Ensembles of decision trees were trained with different strategies, and the highest classiﬁcation accuracy was obtained with the extreme Gradient boosting algorithm. A 10-fold cross-validation strategy was employed for system evaluation, and performance comparison was performed in terms of accuracy and F-measure. Experimental results showed an average accuracy of 84.62%, and the resulting decision tree learned from the experience in samples allowed it to visualize suitable treatments related to the historical record of patients. According to computer simulations, the proposed classiﬁcation approach using XGBoost provided higher classiﬁcation performance than other ensemble approaches; the resulting decision tree can be employed to inform possible paths and risks according to previous experience learned by the system. Finally, the adaptive system may learn from new cases to increase decisions’ accuracy through incremental learning.


Introduction
Computer-aided diagnosis systems have been proposed to solve medicine and biology problems since the late 1950s [1]. In a variety of health institutions, current clinical practices include the use of computer-based tools daily. However, such systems still present challenges either in the clinical, regulatory, and algorithmic aspects [2]. Regarding the algorithmic aspects, recent trends exhibit that studies are focusing on the use of artificial intelligence and machine learning techniques to diagnose diseases based on patients' historical records [1].  Table 1 presents a couple of publications that apply machine learning methods to solve classification issues. Although the results in Table 1 cannot be fairly compared due to differences in experimental settings and different classification problems, some good results are shown by distinct approaches. Among the classifiers employed in CAD systems, neural network approaches have been more often employed in medical imagery, and other approaches are spread over different applications. A relevant result obtained with AdaBoost reveals a significant reduction in the false-positive classification rate [19]. A decrease in the false detection of breast cancer is relevant to reduce the unnecessary costs derived from supplementary exams. Another interesting result is the use of XGBoost to predict the revisit, based on the historical records from patients [6]. A recent survey on the computer-aided diagnosis confirms the no-free-lunch theorem in CAD systems (no single algorithm can be applied to all aspects of CAD), and DTs are representative supervised approaches employed for classification [1]. Deep learning approaches are also employed for classification in CAD systems. With the drawback, these approaches require many training data to build accurate models, which is problematic for pediatric patients with a short clinical history. Despite the above, DTs' low stability to small changes in the training set makes this approach hard to tune, and the strategy employed to reduce such instability consists of using ensemble learning [16]. Table 1. Representative publications that report the use of machine learning approaches for medical diagnosis.

Problem
Proposal Technique Results

Validation Reference
Classification of unhealthy and healthy neonates The authors introduced a system for classifying unhealthy and healthy neonates in neonatal intensive care units using medical thermography processing and artificial neural networks (ANN).
ANN Accuracy: 98.42% 10-fold cross validation Savasci et al. [20] Disease diagnosis (tuberculosis, malaria, and intestinal parasites) The authors introduced a system based on convolutional neural networks (CNN) for disease diagnosis from microscopic images. Rajagopal [25] Voice pathology detection The authors presented a method combining density clustering and support vector machines for voice pathology detection.

SVM and DBScan
Accuracy: 98.00% Cross validation Amami and Smitib [26] Disease diagnosis (heart disease, Parkinson's disease, and BUPA liver disorder) A hybrid system for diseases diagnostic is proposed, which is compounded by a new method entitled k-medoids clustering-based attribute weighting (kmAW) as a data preprocessing method, and an SVM was preferred in the classification phase.

SVM and kmAW
Accuracy: 98.95% 10-fold cross-validation Peker [9] Predicting patient revisits This study focuses on the predictive identification of patients frequently revisiting the University of Virginia Health System Emergency Department. The authors proposed the use of the XGBoost algorithm to predict the risk of revisit. The CAD system is based on feature selection and ensemble learning. Compared with other methods [30], the proposed method significantly reduces the false-positive classification rate.
AdaBoost AUC: 0.9617 10-fold crossvalidation Lu et al. [19] * Receiver operating characteristics (ROC) is a probability curve and the area under the ROC curve (AUC) represents degree or measure of separability.
Regarding the limitations of current systems, the works reported in Table 1 provide information about the diagnosis of a given condition; some of them use images to describe the characteristics of some pathologies, and others describe demographic data. However, the works reported do not consider characteristics such as the complications that occurred during treatment and relevant factors related to the patient's comorbidities, which could have led to poor patient outcomes.
This research paper is a retrospective study in which a computer-aided diagnostic system is proposed to predict the discharge diagnosis of pediatric patients undergoing surgical procedures in a third-level pediatric hospital in Peru. According to the hospital requirements, the system predicts one of three discharge diagnoses: (1) deceased; (2) unhealthy; and (3) healthy. These three discharge diagnoses are those employed in the current input analysis made to every new patient, based on the historical record and new observations. Five modules are proposed to evaluate patients' records. In the second module, missing data are completed using the wrapper algorithm. In the third module, non-relevant records (outliers) are filtered out to reduce noisy samples. In the fourth module, the ensembles of decision trees are trained using XGBoost to classify and present the results. Finally, the system is evaluated on a dataset composed of historical records from pediatric patients, following a 10-fold cross-validation process. The proposed approach is compared to distinct classifier ensembles in terms of Accuracy and F-measure.

Materials and Methods
The context of the implementation is the Hospital Nacional San Bartolomé, in Lima, Perú, which is a category III-1 hospital, that is part of the public health service [31]. Due to the retrospective nature of the study, informed consent was waived. The sections below describe the data from clinical records and the experimental methodology employed to evaluate the performance and properties of the proposed computer-aided diagnosis system.

Data Acquisition
A database with 1205 medical records of pediatric patients is available, each patient was diagnosed at the hospital entrance, and their condition was coded according to the International Classification of Diseases, version 10 (ICD10) standard [32]. Additional information in patients' records includes a medical history and a detailed diagnostic before and after hospitalization. Table 2 presents the complete list of characteristics considered for classification. Whereas ordinal features are suitable for thresholding, categorical data are restricted to a number of categories that are represented with names or tags.

Experimental Methodology
The process pursued to evaluate the system was summarized in the five phases shown in Figure 2; this process was based on the general structure of a computer-aided diagnostic system (see Figure 1). The first phase is the coding of the data by assigning categories to the numerical labels. Then, the missing information in the medical records must be completed using imputer methods. The third phase was to filter the data by removing noise from the data and performing a feature selection process to classify the data. The fourth phase was the classification of the data, which included optimizing the hyperparameters of the selected classification methods and their subsequent training using the cross-validation technique. Finally, an evaluation of the results was made using some performance metrics. The five phases of the methodology are detailed in the following sections.

Pre-Processing Data
The preprocessing phase is designed to prepare the medical records to be suitable for the central processing of the proposed system by encoding data and completing missing information.

Data Encoding
The data coding process consists of two steps: (1) reading the input attributes and categorizing their values; (2) creating a dictionary for each attribute by assigning nominal data to the incoming categorical data. The first step consists of receiving input data in text format from which the attribute values are extracted by eliminating repetitions. In the next step, the attribute values are converted to numerical data so that the classification algorithms can process them. For example, we have an input attribute called "Provenance" with values (Piura, Tumbes, Tumbes, Piura, Comas); the result of applying the first step of the encoding process is (Piura, Tumbes, Comas); finally, applying the second step of the encoding results (Piura = 1, Tumbes = 2, Comas = 3).

Data Completion
As a consequence of manual data retrieval, several records were found to be incomplete. Data from medical records present 8.6% of missing values among the eight numerical features and four categorical traits (see Table 2). In the pattern recognition literature, the incomplete data problem is commonly addressed by deleting incomplete cases, using models to estimate data distribution or classifier parameters, or evaluating missing data through imputation methods.
In order to complete missing information from medical records, the following two methods were selected: 1.
The K-nearest neighbors imputer employs the K-nearest neighbor (KNN) classifier with Euclidean distance, as shown in Equation (1), to describe the similarity between the incomplete record (x i ) and other records nearby (x j ) [33,34]: where dist(x i , x j ) is the Euclidean distance between the encoded medical records; and n symbolizes the number of features. The estimate of the missing value at record i compared to the record j is given bŷ where k is the number of samples selected from features; X k is the input matrix for the k th record; and W k is the k th similarity weight defined by where The simple imputer method employed is based on mean, as described by Buuren and Groothuis-Oudshoorn [35]. The simple imputer replaces the missing values with the mean value of the missing feature, considering all records in training data, according to: where values in matrix X are the observations of each feature in the medical record; and N is the amount of records used for training.

Main Processing
The central processing phase was designed to produce discharge diagnosis predictions from numerical samples prepared in the preprocessing stage.

Data Filtering
Data filtering is a two-step process that includes selecting relevant features for classification and removing noisy data. Feature selection techniques can be divided into three categories: filter, wrapper, and embedded methods. From these three categories, wrapper methods have the advantages of using feature dependencies and are developed with classifier performance [36]. Additionally, as a difference from embedded methods, the classifier can be replaced by any other available once the most relevant features are found.
In the proposed system, the sequential forward selection (SFS) wrapper method was employed to select the most relevant features evaluated [37,38]. In order to remove noisy records that are likely to affect classification performance negatively, medical records were filtered out using the Density-based spatial clustering of applications with noise (DBSCAN) clustering. The DBSCAN algorithm aimed to find the essential samples with higher density in order to expand clusters from them, finding clusters of similar density [26].

Classification with XGBoost
XGBoost was employed to train the ensemble of decision trees for discharge diagnosis prediction. Ensembles of classifiers take advantage of the diversity of opinions between weak classifiers to produce accurate and commonly with more stable classification performance. A decision tree (DT) is a predictive model commonly used as a weak classifier in ensembles, in which the conjunctions of features are represented in the branches, and conclusions or decisions are represented in the leaves (class labels). XGBoost is commonly employed for training decision trees, where an objective function is defined for the supervised learning of a model [39]. Training consists of searching for the best parameters θ that fit the training data x i and the labels y i , see Equation (5): where L = is the training loss; and Ω is the regularization term. L measures how predictive the model is regarding the training data, and its equation is given by Equation (6): On the other hand, the regularization term controls the complexity of the model and is given by where T is the number of end nodes of the tree, and ω is the vector of scores on end nodes. The solution is provided by trees that are constructed sequentially and learn from each tree's predecessors. The learning scheme is called additive training: the functions f i contain the structure of the tree and the end nodes' scores. The learning structure of the tree is complex because it cannot be learned from all the trees at once, and a prediction value is obtained at each step t asŷ t according to Equation (8): Optimizing performance in decision trees includes finding the maximum depth to prune the trees backward, eliminating losses, and optimizing learning. Other parameters to be considered are the number of trees, the learning rate to prevent overfitting, the percentage of samples used per tree, and the percentage of features used per tree [6,40]. The GridSearch algorithm was employed to automatically optimize each classifier's hyperparameters on a validation set [41]. The GridSearch algorithm starts by defining a limited number of values for the hyperparameters, then the Cartesian product of these sets is evaluated through a sequential combination. Additionally, a 10-fold cross-validation process was applied for the statistical comparison of performance [42].

Performance Evaluation
A 10-fold cross-validation strategy was followed to obtain an average performance and standard deviation. Performance measures derived from confusion matrices were employed to represent real and predicted classes. The main metrics used were: Accuracy, Recall, Precision, and F-measure; for a detailed description of the metrics used, please refer to the work of Castro et al. [43].

Experimental Results
The proposed system was implemented using Python v3.6.8, and the libraries joblib, numpy, pandas, date, pytz, scikit-learn, scipy, and xlrd. The server employed to execute the comparison of the system using distinct algorithms included a virtual machine running on Linux Centos 7.0 with 32 processor chips (Intel Xeon (R) CPU 2.60 GHz), and 20 MB of RAM.

Pre-Processing Results
The result of the encoding phase was the assignment of an integer to each categorical feature value. After the imputation of the lost data with the KNN imputer and simple imputer, the dataset was normalized to validate the most accurate methods. The distributions of age after data completion with both methods are shown in Figure 3. By comparing both distributions, it can be seen that the KNN imputer provides a distribution that complies with the principle of the central theorem, where a higher area is close to zero. In other words, errors are normally distributed, and meaningful samples can be parameterized. Similar results were observed with the rest of the features, and the rest of the preprocessing was conducted on KNN imputed data.

Main Processing
With sample records meeting the new distribution of data provided by the KNN imputer, a subset of six features was selected through the SFS wrapper. The features included DI, ICD10, Medicine, TT_Medic, Complications, and D_Hospital, producing a precision up to 82.4%. For data filtering, the best hyperparameters of the DBSCAN were explored on validation data using a search strategy based on random candidate combinations (provided by the ParameterSampler function). The hyperparameters found include eps = 4.5285, minsamples = 9, p = 1, and a cohesion of cluster of 87%; as a result, nine records were deleted from the original dataset. For a fair comparison, the GridSearch algorithm was applied to all classifiers with partitions of data following 70% for training and 30% of samples for test. The resulting hyperparameters for each classifier are shown in Table 3. The average performance of each classifier after a 10-fold cross-validation strategy for classifier evaluation and comparison is presented in Table 4, where the numbers in parenthesis represent the standard deviation for ten replications of the experiment. Bold numbers highlight the highest performance. Results in Table 4 reveal that the proposed XGBoost algorithm achieves the best performance in terms of both accuracy and F-measure. These results are consistent with those presented in the literature for different applications. For instance, Nguyen et al. [44] mentions the XGBoost model as a robust algorithm to build predictive models when applied to predict the environmental effects around a mine. Further performance analysis can be retrieved by analyzing the confusion matrices for each classifier. Confusion matrices in Figure 4 summarize the predictability of each model: the dark gray colors represent high values, whereas light gray colors represent low values. Comparing Figure 4a-f, the highest level of correctly classified samples is provided by XGBoost: Figure 4b. Here, most of the errors occur when class 1 (deceased patients) are confused with class 2 (unhealthy patients).  Although the overlap between classes 1 and 2 in Figure 4b is close to one-third of the decisions, it might be considered that pathologies and complications may appear after the patient is discharged. Similarly, the highest precision presented in all cases corresponds to class 2, and the particular situation for discharging unhealthy patients should require further analysis by practitioners. A higher level of errors is presented between classes 1 (deceased patient) and 2, and close attention must be paid to every particular case. Although approaches in Figure 4a,b,e present a dark diagonal (correct class predictions), the diagonal values for XGBoost are consistently higher than those of other approaches.
Finally, Figure 5 presents the receiver operating characteristics (ROC) curves per class computed for the proposed system designed with XGBoost. In order to construct the ROC curves, the output probabilities from each class were computed, and the area under the ROC curve (AUC) was estimated for each curve using the approximation by rectangles. The lowest AUC was achieved by the blue solid ROC curve, which represents the performance of the system for class 3 (deceased patients), and the operational point closest to the upper-left corner corresponds to tpr = 0.85 and f pr = 0.1. Furthermore, selecting the correct operational point is relevant to tune the system and reduce errors for specific classes. Figure 5. ROC curves that show the system performance for each class (e.g., discharge diagnosis: deceased, unhealthy, and healthy). The diagonal dashed line represents a random classifier that does not consider any information to make decisions.

Decision Trees for Computer-Aided Diagnosis
According to Chen and Guestrin [39], the resulting probabilities of decision trees can be computed using the logistic function. As shown in Figure 6, the resulting decision tree allowed identifying malformations of the digestive tract, firstly the atresia of the esophagus. This was identified in the node (ICD10 < 141), where it makes the code Q39 (ICD10).
According to Stoll et al. [45], this treatment with reserved diagnosis was that the child cannot feed or pass the saliva. The latter causes the same to pass to the lungs, and the other problem that complicates the treatment is that if the distal segment of the atrial (malformed) esophagus attached to the trachea causes the gastric juice to also pass to the lung, the child will develop pneumonia. Currently, this pathology's complexity is often associated with premature heart problems and other associated malformations such as rectal anus malformations, which makes the reserved prognosis of the child and the urgency of the solution: the subsequent three days are crucial in the treatment.
The nodes (TT_Medic < 506 and TT_Medic < 140) refer to the correct treatment that has been successful in the historical data. The correct treatment is described: since the patient cannot be fed, it must be hydrated with a good intravenous route with prophylactic antibiotics, hydration, and feeding (total parenteral nutrition). The catheter designed for this does not go to a peripheral or superficial vein because it does not hold the NPT; it must go to a more prominent or central vein. If a peripheral vein is used, the solution's osmolarity will inflame and affect it. For this reason, we reference a tunneled CVC (TT_Medic < 140) and a probe (TT_Medic < 506) to aspirate the saliva until it is operated. After surgery, the procedure consists of placing a drain in the thorax to confirm if the surgery points (anastomosis) have a saliva leak. In this sense, it is corrected until everything is well. The probability of dying with these recommendations is 54.5% since these cases are extremely critical and even more so if not in newborns. The complexity of the treatment of these malformations lies in the fact that there are other added affectations. Likewise, pathologies associated with the digestive tract's malformations are of the genitourinary category belonging to the N category of ICD10, especially rectal atresia. For this reason, the tree places (ICD10 < 100) [46].
Other malformations that interrupt the communication between the mouth and the anus would be intestinal atresia and rectal atresia. Supposing these atresias are not treated on time, we consider the three days referred to at the tree. In this situation, the case will present an increase in the probability of complications such as sepsis (generalized infection), multi-organ failure, coagulation disorders, shortness of breath, and the acidotic state will be high. The probability of death increases from 54.5% to 59.4%, being that in previous lines, and the proposal of natural treatment of atresias was explained within three days. Therefore, time is vital, and as happens in third-level hospitals, patients arrive from other cities where they had previously had unsuccessful treatments and therefore are very delicate cases.
Another example is shown in the tree (TT_Medic < 20). This reveals that a treatment such as an ostomy (removing the proximal part of the intestine to the skin in order to let it drain) involves drainage and complicated infection control. Therefore, it is vital to consider the days of treatment if it is more significant than three triggers in a perforation of the intestine and the chances of death increase.

Conclusions
In this paper, a computer-aided diagnosis system was proposed to be employed as a tool for accurate and timely diagnosis in a third-level hospital. Based on decision trees, the system considers the historical records of input pediatric patients to predict an estimated discharge diagnosis and possible treatments. The proposed CAD system predicts three categories of patient discharge: (1) deceased, (2) unhealthy, and (3) healthy. The CAD learns from the previous diagnoses, the treatment applied, and the patient's discharge condition in order to assist the pediatric surgeon's decision-making so that the best treatment can be offered. If the patient is discharged with pathology, this condition is due to the fact that the patient requires management or surgical treatment in stages due to the complexity of the pathology or the need for its correction at a later age. For example, there is the case of high anorectal malformations, in which at the neonatal stage, a procedure is performed that allows the child to have a bowel movement through an artificial orifice, and then, at a later age, the definitive correction is performed. Let us suppose that the patient dies later, before completing the entire treatment. In that case, the factors that led to the patient's poor evolution would have to be recorded, and the system would have to be fed back. Therefore, the system must have continuous feedback to improve medical decision support.
The main advantages that are foreseen with using a prediction system to support the making decision process include the following. First, a graphical representation of the possible paths from admission to output diagnosis can provide the means for better decisions. Second, the resulting decision tree can be employed to inform possible paths and risks to the patients' parents, according to previous system experience. Third, statistics on previous cases may provide experience-based evidence in the case of legal conflicts. Finally, an adaptive system may learn from new cases to make more accurate decisions as the knowledge is improved with experience.
Future work could further improve the results by using XGBoost with ensemble learning (Boosting and Bagging) methods, which consists of selecting the samples that obtained the least error during the learning of sequentially constructed trees. This distributed learning environment can solve problems beyond billions of examples, making it much more versatile when it comes to treatments as delicate as pediatric, and hyperparameters for ensemble learning can be optimized considering a trade-off between accuracy and stability [47]. Additionally, given the continuous entry of novel cases, the system might be adapted to incorporate new information on the new sample records. Ensemble learning methodologies have been proven to provide good performance after incremental learning and fusion adaptation. Finally, although the use of the ICD10 standard provides a helpful framework, within the next few years, the system should consider the recent ICD11 codification to include a more accurate diagnosis [48]. Backward compatibility may be resolved by adding a module to translate medical records from ICD10 to ICD11.