Application of Data Mining for the Prediction of Mortality and Occurrence of Complications for Gastric Cancer Patients

The development of malign cells that can grow in any part of the stomach, known as gastric cancer, is one of the most common causes of death worldwide. In order to increase the survival rate in patients with this condition, it is essential to improve the decision-making process leading to a better and more efficient selection of treatment strategies. Nowadays, with the large amount of information present in hospital institutions, it is possible to use data mining algorithms to improve the healthcare delivery. Thus, this study, using the CRISP methodology, aims to predict not only the mortality associated with this disease, but also the occurrence of any complication following surgery. A set of classification models were tested and compared in order to improve the prediction accuracy. The study showed that, on one hand, the J48 algorithm using oversampling is the best technique to predict the mortality in gastric cancer patients, with an accuracy of approximately 74%. On the other hand, the rain forest algorithm using oversampling presents the best results when predicting the possible occurrence of complications among gastric cancer patients after their in-hospital stays, with an accuracy of approximately 83%.


Introduction
Many aspects that were previously unknown to healthcare professionals are now being revealed by the data generated by healthcare, improving the quality of medical procedures or treatment strategies [1]. Healthcare facilities like hospitals produce large amounts of heterogeneous data every day, since it includes diverse sources, data types and formats. This heterogeneity of healthcare data leads to the need of a rigorous observation of this data in order to assess its quality and identify possible problems that need to be solved. Since the data are so complex, it is practically impossible to analyze it with traditional tools and methods [2]. This complexity calls for more sophisticated techniques that are able to manage and produce meaningful knowledge. That way, the healthcare services records can serve as a way of assessing their quality and the patient's satisfaction [3]. Thus, the use of data technologies like data mining (DM) has become essential in healthcare.
DM is a process that refers to the extraction of useful information from vast amounts of data [4]. It is used to find hidden patterns and uncover unknown correlations that are not obvious when observing the data with the naked eye [5]. Thus, DM can greatly benefit the healthcare industry by

State of Art
In order to provide a deeper understanding of the context and importance of this study, this section provides the general background related to the associated research field. Thus, some concepts like knowledge discovery in databases (KDD), machine learning (ML) and DM are dissected and their association with the healthcare field leads to the introduction of the clinical decision support systems (CDSS).

Knowledge Discovery in Databases
Over the years, the rapid growth of digitization and computerization of processes in health institutions, as well as the large number of transactions that are performed daily in these environments, led to the production and collection of large amounts of data. This exponential increase in the amount of data stored by hospital institutions has raised the need to transform this data into relevant and useful information for the institution, leading to more efficient decision-making processes. This urgent need of extracting knowledge from the growing amount of digital data propelled the use of new computational theories and tools. This area is known as KDD [4,10,11].
According to Fayyad et al. [12], the KDD process consists of several phases and begins with the analysis of the application domain and the objectives to be accomplished, and this process is divided into 5 phases, represented on Figure 1.
The first step of the process is to choose the base to be mined, which can be data samples, subsets of variables up to large masses of data. The preprocessing phase aims to eliminate noise, missing values and illegitimate values. The data transformation step depends on the search objective and the algorithm to be applied, because it defines the limitations to be imposed on the database [11,12]. Improving data quality is important for better results, thus ensuring better quality in discovered patterns. After completing the previous phases, DM is applied. This is the most important phase of the KDD process.

Data Mining
DM is the process of using machine learning techniques and statistical and mathematical functions to automatically extract potentially useful information from data in a way that is understandable to users. It can reveal the patterns and relationships among large amounts of data in a single or several data sets. The knowledge achieved can adopt various forms of representation, such as equations, trees or graphs, patterns or correlations [13].
DM methods can be divided into two categories: supervised and unsupervised. The supervised methods are used to predict a value and require the specification of a target attribute, on the contrary unsupervised methods are applied to discover the intrinsic structure, patterns, or affinities between the data [14].
The definition of the mining technique to be applied is closely related to the mining task to be performed, as this task defines the relationship between the data, ie the model. DM tasks are the types of discovery to perform in a database, that is, the information to extract. To determine which task to solve, it is important to have a good knowledge of the application domain and to know the type of information to obtain. Therefore, DM includes two main types of techniques: descriptive and predictive. An example of descriptive techniques are the clustering techniques that are responsible for discovering information hidden in data. On the other hand, examples of predictive techniques are classification and regression techniques, that are used to retrieve new information from existing data [15][16][17]. The focal point of this paper are predictive techniques, more specifically, classification techniques.
Thus, there are many applications for DM, since it is greatly adaptable to distinct businesses and goals. They can go from retail stores, hospitals and banks to insurance or airline companies. The acquired knowledge during the DM process can also be used to support the decision-making process in various processes, e.g., in medicine-in the diagnosis phase, a correct and rapid analysis of this large volume of data is important for the identification of pathologies.

Clinical Decision Support Systems
To accomplish these goals, CDSS use clinical knowledge that is incorporated into the system helping professionals to analyze patient data, as well as decision-making. This knowledge used to maintain these systems is often extracted through DM techniques that, as mentioned before, are used to analyze and explore data with the aim of discovering patterns that might be helpful for decision-making [18].

Related Work
The improvement of gastric cancer diagnosis, mortality and complications rates have always been one of the most common work themes when it comes to the application of DM techniques in healthcare. Thus, some of the existing works have been studied prior to the conception of this paper.
Lee [15] applied DM techniques in order to create a prediction process for the occurrence of postoperative complications on gastric cancer patients. They have developed artificial neural networks (ANN) and compared their results with those of the traditional logistic regression (LR) approach, where they've achieved an average correct classification rate of 84.16% with ANN in contrast with 82.4% of LR.
Polaka et al. [16] planned various approaches for diagnosing gastric cancer using the original dataset and datasets with subsets of features. The best results were obtained for the dataset using attribute subsets selected with the wrapper approach. Four different models were tested, where C4.5 obtained 74.7% of accuracy, as well as CART. The RIPPER algorithm produced an accuracy of 73.9%, while the multilayer perceptron got the best results with 79.6%.
Hosein Zadeh et al. [17] used an optimized multivariate imputation by chained equations (MICE) technique to predict the chances of survival in gastric cancer patients. Three different techniques were executed: the first one, which consisted in the application of logistic regression, obtained 63.03% of accuracy, while the second technique that used a not optimized MICE algorithm earned an accuracy value of 66.14%. Finally, the third approach with the optimized MICE algorithm produced results with 72.57% of accuracy.
Mohammadzadeh et al. [19] carried out a study aimed to develop a decision model for predicting the probability of mortality in gastric cancer patients also identifying the most important factors influencing the mortality of patients who suffer from this disease. Regarding the effective factors on mortality of gastric cancer, the determined factors were diabetes, ethnicity, tobacco, tumor size, surgery, pathological stage, age at diagnosis, exposure to chemical weapons and alcohol consumption. The accuracy of developed decision tree was 74%.

Methodology
The reference model used during the development of this study was cross-industry standard process for DM, most commonly known as CRISP-DM.
The CRISP-DM methodology provides a structured approach to planning a DM project and is a six phase hierarchical process, divided in the following steps: business understanding, data understanding, data preparation, modelling, and evaluation and deployment, as shown on Figure 2 [20,21]. The Section 5 describes in detail the application of all this steps in the context of this study.

Methods
In order to analyze and explore the available data and to induce the data mining models (DMM), the chosen ML software was Waikato Environment for Knowledge Analysis (WEKA). During the execution of this study, five modelling techniques were used with WEKA in order to induce the DM models, namely: random forest (RF), J48, simple logistic (SL), Bayes net (BN) and PART. This study includes ensemble techniques Bagging and adaptive boosting (Adaboost) using some of the mentioned algorithms. It is important to note that in this study the application of oversampling using synthetic minority oversampling (SMOTE) was also tested.

Random Forest
RF is an ensemble learning method for classification which operates by constructing a multitude of decision trees. Initially, a bootstrap sample from the training data was selected (random sample obtained with replacement) with the goal of inducing a decision tree (DT). The repetition of this step was performed until an ensemble of DTs was created, each one of them having its own prediction value. Thus, the final prediction was achieved by combining the output from all trees, which corresponds to the most frequent output obtained by the ensemble. RF could correct for decision trees' habit of overfitting to their training set making it a very efficient and accurate classifier [22,23].

J48
The J48 algorithm used greedy technology to induce DTs for further classification. J48 generated decision trees, where each tree node evaluated the existence or significance of each individual attribute. Decision trees were built from top to bottom by choosing the most appropriate attribute for each situation. Once the attribute was chosen, the training data was divided into subgroups, corresponding to the different attribute values and the process was repeated for each subgroup until a large part of the attributes in each subgroup belonged to a single class. It is important to note that the J48 classifier implemented by WEKA corresponds to an open source Java implementation of the C4.5 algorithm and is considered one of the most powerful and commonly used DT classifier [24][25][26].

Simple Logistic
SL is a classifier for building linear logistic regression models. LogitBoost with simple regression functions as base learners is used for fitting the logistic models. The optimal number of LogitBoost iterations to perform is cross-validated, which leads to automatic attribute selection [27,28].

Bayes Net
BN is a base class for a Bayes network classifier. A Bayesian is a graphical model for probabilistic relationships among a set of variables and is composed of directed acyclic graphs. It also provides data structures like conditional probability distributions, network structure, etc and facilities common to Bayes network learning algorithms like K2 and B [29,30].

PART
PART is a partial decision tree algorithm, which is a combination version of C4.5 and RIPPER algorithms, developed to try to avoid their respective problems. The main specialty of the PART algorithm is that it does not need to perform global optimization like C4.5 and RIPPER to produce the appropriate rules. The fact that PART adopts the separate-and-conquer strategy, building a rule, removing the instances it covers and continuing creating rules recursively for the remaining instances until none are left, is a big advantage [31,32].

Bagging
Bagging is one of the popular ensemble methods proposed by Freund and Schapire [33] for improving classifiers. Bagging is based on bootstrapping and aggregating concepts, integrating the benefits of both approaches [34]. In Bagging, the training set is sampled generating random independent bootstrap replicates. In addition, the classifier on each of these is constructed and aggregated by a simple majority vote in the final decision rule [35].

AdaBoost
Freund and Schapire [33] also proposed Adaboost, a shortening of adaptive boosting. This algorithm stands out mainly due to its potential, flexibility, and simplicity to be implemented in different scenarios. It is an iterative process that produces a strong classifier which consists of a sequence of weighted classifiers that complement one another. AdaBoost achieves its ultimate classifier goal by sequentially introducing new models in order to compensate for the misclassified instances in previous iterations [35].

Sythetic Minority Oversampling (SMOTE)
SMOTE is a popular oversampling technique (which replicates examples from the minority class). It creates new samples based on the interpolation of minority class instances. Based on k nearest neighbours (kNN), it randomly selects samples from minority classes and generates the new ones [36].

Business Understanding
Cancer affects millions of people all over the world and is one of the biggest threats to people's lives and life quality. Gastric cancer is one of the most common causes of cancer related deaths, behind, for example, lung cancer [37]. The prognostic is usually not favorable to the survival of patients, since there is only a probability of less than 30% survival upon diagnosis in Europe [37]. However, in Japan this rate goes up to 90% thanks to early examinations and tumor resections [38].
This malignancy presents no specific symptoms in early stages, which causes delayed diagnoses that lead to the high mortality of patients. In advanced stages, the patient may feel a variety of more serious symptoms, like abdominal pain, indigestion, severe nausea and inexplicable weight loss [38]. By the time these symptoms appear, the cancer has already developed to more dangerous stages. When the tumor is diagnosed, it is often too late for any curative medical procedure to take place. There are various objectives with this study, such as:

•
Promote early examinations among the general population in order to avoid late gastric cancer diagnoses that often lead to the patient's death • Predict the probability of mortality after the surgery • Predict the occurrence of complications after in-hospital stays for gastric cancer patients Thus, this study aims to improve many aspects related to gastric cancer and the way it affects the patients' lives. The focus falls on their hospital admissions and possible complications that may occur related or not to the tumor. The procedures performed and the patient's health status after the hospital stay are also subjects of this work.
The first item is related to the healthcare business goals. The improvement of the quality of the medical services provided is one of the most crucial aspects in this industry. This translates into an increment on the survival rates of patients, in this case patients that suffer from gastric cancer.
The rest of the goals listed are related to the objectives inherent to the DM process. Through the application and refinement of DM techniques these objectives will provide a substantial help to healthcare professionals.

Data Understanding
The data used for this study was collected from a Portuguese hospital and is related to patients with gastric cancer. It includes over 60 variables with information about the patients' admission, stay at the hospital, possible complications and the result of the performed procedure related to 154 patients.

Data Preparation
The original dataset provided had a lot of attributes with high percentages of missing values. When it comes to the numerical variables, half of the attributes have over 45% of missing or null values. This makes them not useful to study or to subject them to ML algorithms, since they offer little to no meaningful information. Consequently, these attributes were removed from the dataset. Moreover, after a careful analysis, it was detected that there were extremely similar attributes, even presenting the same values. As such, one of those attributes was also removed from the dataset, leaving only one of them in the dataset in order to avoid any redundancy. Also, some of the attributes refer to technical aspects related to the extraction of the data, so they were removed from the dataset as well. The categorical attributes were submitted to the same process.
After the data cleaning, three more features were created derived from existing attributes. These new features refer to the number of postoperative complications registered, to the occurrence of complications 30 days after the in-hospital stay and to the death of patients.
The final result was a dataset with 33 features (4 numeric and 29 categorical)(use Case one). However, in order to analyze alternative approaches with fewer attributes, three more datasets were created.
The first one (use Case two) was created with attribute selection performed by the OneR algorithm, where 19 attributes were selected (1 numeric and 18 categorical). Whereas, the second dataset (Use Case 3) included a subset of features that were selected using the Relief algorithm. This one was composed of 20 attributes, from which one was numeric and the rest categorical. On the other hand, the features selected for the third dataset (Use Case 4) were chosen based on the Pearson's correlation method. This subset of features was comprised of 21 attributes, where 2 of them were numeric and 19 were categorical. The summary of the characteristics of the datasets can be checked on the Table 1.

Modeling
The first proposed goal was to predict the mortality of gastric cancer patients that were admitted to the hospital. Based on the health status available, as well as info about the performed surgery and its outcome, the models will predict if it's more likely that the patient will survive or pass away. In this case, four datasets (the original -after the data preparation -and three more that resulted of feature selection) were tested. Thus, the classification process included four scenarios that contemplated distinct set of features.
On the other hand, the second goal was to predict the occurrence of complications after hospital stays. In this case, features related to the patients' morbidity and survival, and complications' rank were removed, along with information about the possible existence of complications. These attributes were eliminated in order to ensure an unbiased and correct prediction.
In order to assure that the models are assuming most of the patterns from the data correctly, and are low on bias and variance, the usage of cross validation came into action. Cross validation provided ample data to train the model and also leaving a lot of data to test it. For this study 10-folder cross validation was used, dividing the dataset into 10 folds, and the holdout method was repeated 10 times, such that each time, one of the 10 subsets was used as the test subset and the other nine subsets are put together to form the training set.
The classifiers selected for this study were RF, J48, Simple Logistic, BN, and PART. In addition, the algorithm AdaBoost and Bagging were also executed in conjunction with the first three models already mentioned. In this DM approach, specifically when using decision trees, the main criterion for selecting a variable to make a decision was the dependence of a variable on the class variable. There was no differentiation between direct dependence and indirect dependence (intermediated by other variables). Such distinction did not make a difference for classification, because trees based on direct dependence or indirect dependence were very likely to result in the same classification. Nevertheless, if these approaches are used to intervene in a system of the real world, indirect dependence may not have impact, while direct dependence can.
Finally, two data approaches were tested: with and without oversampling (using SMOTE). At this stage, the DMMs were constructed using the WEKA software. A DMM can be compose by a target variable (T), a scenario (S), a data mining technique (DMT), a data approach (DA) and a sampling method (SM). Regarding the DM mortality (DM1):

Evaluation
Once the modeling phase was concluded, the chosen classifiers were put to test in order to evaluate and compare their results. The metrics used were accuracy, precision, F-measure and recall. They are defined as such: where TP = true positives, TN = true negatives, FP = false positives, FN = false negatives PR = precision, and RC = recall.
The area under the ROC curve (AUC) metric was also used. ROC is a probability curve and AUC represents degree or measure of separability that represents how much the model is capable of distinguishing between classes.
In order to ease the understanding of the results obtained, they are divided by scenarios as shown below.

Mortality Prediction Results
The Table 2 presents the results obtained during the classification process using the original dataset for mortality, after the data preparation. It is important to note that for each DM technique two data approaches were tested: with and without oversampling.
The Table 3 exposes the results obtained for the prediction of gastric cancer patients' mortality using the feature selection technique that evaluates the worth of a feature using the OneR algorithm.
The Table 4 presents the results obtained for the prediction of gastric cancer patients' mortality using the feature selection method that evaluates the worth of an attribute using the Relief algorithm.
In the Table 5, the results obtained for the prediction of gastric cancer patients' mortality using the feature selection that evaluates the worth of attributes by using the Pearson's correlation are presented.     Table 6 shows the results of the prediction of the occurrence of complications after surgery.

Predict the Mortality of Gastric Cancer Patients
Regarding the first scenario, a first analysis of the results (Figure 3) revealed that the use of oversampling improved them in 13/14 of the tested DM techniques, as it only worsened with the PART algorithm. Without the use of oversampling, SL produced the best result, which despite presenting the same accuracy as the RF model (68.4564%) showed better results for the other metrics in comparison with RF. The best result for this scenario was obtained using oversampling by the ensemble technique Boosting with J48 using Laplace correction, achieving an accuracy of 73.7968%.
Using a dataset with fewer features than the original one (scenario two), the initial best results were achieved with the SL algorithm that produced an accuracy of 68.4564%. However, as it can be seen in Figure 4, better values were obtained with the usage of oversampling, except when using the SL algoritm (alone and ensemble with Adaboost). So, the best result obtained was 74.3316% using the J48 algorithm with oversampling.
In the third scenario, the results that were obtained from the execution of the selected models showed that initially the best result was obtained by the ensemble technique Boosting with SL resulting in an accuracy of 69.7987%. Observing the Figure 5, it is notable that the use of oversampling improved the results obtained in 13/14 of the tested DM techniques, just like in S1, but this time the exception was the SL algorithm. Thus, the best accuracy value for this scenario was 73.2620%, obtained using the ensemble technique Bagging with J48 using Laplace correction.   Finally, in the last scenario of mortality prediction, the Figure 6 shows that, like S1, S2 and S3, the use of oversampling also increased the accuracy values in 13/14 of the applied algorithms. This time the exception was the ensemble technique Boosting with RF that presented the highest accuracy value (69.5187%) before using oversampling but decreasing after.
With the usage of oversampling, the results that were obtained from the execution of the selected models showed that both PART and the ensemble technique Bagging with J48 using Laplace correction produced the best accuracy of 74.3316%.
When compared to the results obtained with the datasets that were submitted to feature selection methods, it's possible to conclude that the original dataset (S1) produced better overall results for the selected metrics without the use of oversampling, as can be seen on the Tables 7-9. However, that observation changed with the application of oversampling since the results increased in other scenarios.
Overall, the best result was obtained with the second scenario, ie using the attributes selected by the OneR algorithm, which achieved an accuracy of 74.3316%, an accuracy of 0.744, an F-measure of 0.743, a recall of 0.743 and an AUC of 0.826 using the J48 algorithm and the oversampling data technique. Moreover, the results obtained for AUC highlight two algorithms: BN and RF (although combined with other techniques), 0.879 being the best AUC result, obtained by Bagging with RF. However, most of the algorithms obtained an AUC close to 1. A good model has AUC near to 1 which means it has a good measure of separability.
The results obtained were not very high, due to the multitude of reasons that may lead to a patient's death. These include factors not directly linked to the gastric cancer or deaths that took place because the patient was already palliative.  When it comes to the prediction of complications after a hospital stay for gastric cancer, the results obtained were more satisfactory, as can be observed on Figure 7. The reason for that is that it is considerably easier to anticipate if a patient will suffer from any complications or disabilities following a surgery by observing the health status available. As such, initially, the best accuracy value was recorded for the J48 algorithm (81.6993%). In contrast to mortality prediction, in this case the use of oversampling only improved in 8/14 algorithms. Nevertheless, the best final outcome for predicting complications was obtained using oversampling, achieving an accuracy of 83.2599 % with the RF algorithm, that also presented the best AUC value of 0.909.

Summary
The best results achieved with this study, previously described, are summarised in Table 10, where the two defined objectives are represented (DMM1 and DMM2). The obtained results, especially for the mortality prediction, are in accordance with the reviewed literature. Relating the two main objectives of this project, although pertinent, the attribute related to the occurrence of complications after surgery was not considered crucial in predicting mortality. Although the occurrence of complications after surgery may be directly related to mortality, in the case of this dataset most patients survived even after the occurrence of complications.
The use of data mining in the healthcare setting can lead to several outcomes, not strictly related to the classification problems presented hereby. With this kind of approach, new scientific knowledge can also be achieved, when understanding the contribution of the selected predictors for the response variable. The several variables identified as predictive in terms of prognosis have a well-established relationship in patients operated due to gastric cancer among medical literature.
TNM classification of malignant tumors and stage, for example, is usedby theoretical models and physicians in practical evaluation and as expected are part of the top attributes when predicting mortality and complications using data mining. The American Joint Committee on Cancer (AJCC)/TNM classification is widely used among different cancers as a staging score, with implications in terms of prognosis [39], as this study succeeds in confirming.
Other relevant attributes found by this analysis include the reason to search for medical care, which can vary since the presentation of the disease can go from weight loss, nausea, and other mild symptoms to anemia stigma, vomiting, and hematemesis (upper tract hemorrhage). One should take into account that the incidence of symptoms are usually suggestive of a more advanced stage, with a subsequent worse prognosis [40]. Last but not least, we also found that the post-operative recovery is an important predictor of the mortality of this cancer, since patients which demand ICU care, tend to have worse prognosis due to being hemodynamically unstable or having several or serious comorbidities [41].

Conclusions and Future Work
In the last years, KDD and, more specifically, DM techniques are becoming increasingly useful for processing and exploiting medical data. The useful information discovered and the patterns obtained with the application of these methods, analysing in real-time complex and heterogeneous data and make conclusions about it, can be used by health professionals to determine diagnoses, prognoses and treatments for patients in healthcare organizations.
What was an impossible task to execute in the past, it is now possible to submit millions and millions of medical records to an algorithm and receive relevant results. There are numerous software available to the general public that offers tools for data processing, reading it, cleaning it, preparing it for the application of algorithms, and even allowing to execute and refine the models.
This paper aimed to predict the mortality of gastric cancer patients based on their health status, data about the tumor and surgery information, as well as to make predictions about the possibility of occurrence of complications following a in-hospital stay using DM techniques. Considering the various reasons that may lead to the patient's death, it becomes challenging to predict if the patient might perish or survive. There are a lot of aspects that influence this outcome that show no direct link to the cancer in question. A lot of patients, due to late diagnosis, face little to no chances of survival since no curative treatment can treat the tumor. These facts contribute to obtain the best accuracy values around 74%.
On the other hand, it is simpler to predict if a patient will suffer from complications after their hospital stay, since it is possible to rely more on the data available. Observing the data about the tumor (its localization, stage, size, lymph nodes and metastasis) and analysing the health status of the patient (given by the ASA score) among other factors, the prediction of the occurrence of complications becomes a more straightforward process. Hence, the accuracy obtained for this goal was around 83%.
Future work will consist in obtaining a larger dataset with more relevant data in order to improve the prediction process for both patients' mortality and occurrence of complications. Others models will also be tested and their results compared with the ones already obtained. Also, an interesting future research would be to attempt to determine whether there exists a causal relationship between the different variables in the used dataset, and the usage of other state of the art machine learning algorithms such as deep neural networks. Finally, it would be interesting to use this work in a CDSS in order to assist healthcare professionals and, consequently, improve the healthcare delivery for patients with gastric cancer.

Conflicts of Interest:
The authors declare no conflict of interest.