eGAP: An Evolutionary Game Theoretic Approach to Random Forest Pruning

: To make healthcare available and easily accessible, the Internet of Things (IoT), which paved the way to the construction of smart cities, marked the birth of many smart applications in numerous areas, including healthcare. As a result, smart healthcare applications have been and are being developed to provide, using mobile and electronic technology, higher diagnosis quality of the diseases, better treatment of the patients, and improved quality of lives. Since smart healthcare applications that are mainly concerned with the prediction of healthcare data (like diseases for example) rely on predictive healthcare data analytics, it is imperative for such predictive healthcare data analytics to be as accurate as possible. In this paper, we will exploit supervised machine learning methods in classiﬁcation and regression to improve the performance of the traditional Random Forest on healthcare datasets, both in terms of accuracy and classiﬁcation/regression speed, in order to produce an effective and efﬁcient smart healthcare application, which we have termed eGAP. eGAP uses the evolutionary game theoretic approach replicator dynamics to evolve a Random Forest ensemble. Random and shrink by adding and removing trees using replicator dynamics, according to the predictive accuracy of each subforest represented by a cluster of trees. All clusters have an initial number of trees that is equal to the number of trees in the smallest cluster. Cluster growth is performed using trees that are not initially sampled. The speed and accuracy of the proposed method have been demonstrated by an experimental study on 10 classiﬁcation and 10 regression medical datasets.


Introduction
The advent of the Internet of Things (IoT) led to the construction of smart cities [1], which were mainly developed to provide high degree of information technology, and a comprehensive application of information resources. Consequently, many smart applications have come into existence in many areas including but not limited to smart energy [2], smart education [3], smart transportation [4], and smart healthcare [5]. For the latter area, which is the relevant one to the research conducted in this paper, many smart healthcare applications were developed as outlined in [6]. For such applications that are concerned with making predictions on new data based on historical data, it is imperative for such predictions to be as accurate as possible to reduce costs of treatment, avoid preventable diseases, predict outbreaks of epidemics, avoid preventable deaths, and improve the quality of life in general.
Random Forest (RF) is an effective tree-based ensemble machine learning approach for both classification and regression. RF uses bagging and randomisation heuristics to produce a diverse ensemble. Given its early success showing consistent predictive effectiveness in both classification and 1.
A novel evolutionary game-theoretic approach for pruning Random Forests; and 2.
A faster Random Forest inference achieved through pruning; and 3.
A thorough experimental study on 20 medical datasets in both classification and regression problems.
This paper is organised as follows. Section 2 presents related work in the area of smart healthcare applications. An overview of Random Forest is given in Section 3. In Section 4, an overview and detailed description of the proposed new method are given. Experiments and results are presented in Section 5. A discussion section then follows in Section 6. The paper is finally concluded with a summary and pointers to future work in Section 7.

Related Work
In the quest to realise smart cities, a number of predictive smart healthcare applications were developed as outlined in [6]. A prediction of inpatient violence incidents was investigated in [10] using recurrent neural network, convolutional neural network, Naïve Bayes, support vector machine, and decision tree. A study of the prediction of type 2 diabetes and hypertension was conducted by [11] using density-based spatial clustering and synthetic minority over-sampling. The authors in [12] investigated the prediction of biochemical recurrences in patients treated by stereotactic body radiation therapy using prostate clinical outlook. Using Kruskal-Wallist test, regression model, Cuckoo search optimisation algorithm, and radial basis function neural network, a model was developed in [13] to determine whether the differences of tuberculosis prevalence rates for different income groups are statistically significant or not.
For better classification and prediction performance for determining severity stage in chronic kidney disease, ref. [14] used Probabilistic Neural Networks (PNNs). In [15], a machine learning analysis, known as KNIME, was used to analyse MRI-derived texture features to predict placenta accreta spectrum in patients with placenta previa. An online Stochastic Gradient Descent (SGD) algorithm with logistic regression is implemented by [16] using Apache Mahout to develop the best scalable diagnosis model. Finally, using three different classifiers: Naive Bayes, C4.5, and Random Forest, ref. [17] investigated the prediction of three diseases, namely, leukemia, lung cancer, and heart disease. Aligned with the work reported in this paper, in [18], the authors proposed a framework for smart healthcare using machine learning and Internet of Things. However, only off-the-shelf shallow learning methods were used including Random Forests. Inspired by this work [18], and other recent work reported in the smart healthcare domain, the research reported in this paper attempts to bring both efficiency and effectiveness to Random Forests, enabling it to contribute to near real-time decision making in smart healthcare. For comprehensive reviews of the adoption of data mining methods in smart applications including healthcare, the reader is referred to [19,20].
As can be seen, numerous attempts to use computer aided diagnosis using machine learning over the last couple of decades. However, the work on smart healthcare has seen a great deal of interest with the increasing power of handheld devices, and particularly smartphones. The work reported in this paper is an attempt in the same direction, where we apply a novel method for pruning Random Forests.

Random Forest: An Overview
Random Forest (RF) is an ensemble learning method used for classification and regression. Developed by Breiman [21] almost two decades ago, the method combines Breiman's bagging sampling approach [22], and the random selection of features, introduced independently by [23,24] and Amit and Geman [25], in order to construct a collection of decision trees with controlled variation. Using bagging, each decision tree in the ensemble is constructed using a sample with replacement from the training data. Statistically, the sample is likely to have about 64% of instances appearing at least once in the sample. Instances in the sample are referred to as in-bag-instances, and the remaining instances (about 36%), are referred to as out-of-bag (OOB) instances.
To enhance diversity, at each node in all trees, a best split feature is selected using a goodness measure (e.g., Gini index) from a set of randomly selected features (typically √ n, where n is the total number of features). Each tree is grown to the largest extent possible and is unpruned. Typically, a maximum depth is usually allowed to prevent trees from growing out of memory in high dimensional datasets.
Having presented a high level overview of Random Forest, the following section discusses the proposed pruning method.

eGAP: An Evolutionary Game Theoretic Approach to Random Forest Pruning
In this section, a novel approach for pruning the classical RF is presented. It combines two techniques that we have used before to improve the performance of RF: clustering and replicator dynamics. These techniques are discussed next.

Clustering
Clustering was the main technique used in [8,9] to extreme pruning of random forests, and in [26], replicator dynamics was employed on a diversified random forest with subforests produced by randomised subspaces [27], to evolve subforests by allowing those with better performance to grow and those with lower performance to shrink. The use of replicator dynamics allowed subspaces of features that interact better for accurate classification to have more trees, and those subspaces that have features that are not interacting well for classification, in comparison to other subspaces, to have fewer trees.
The clustering technique utilises a well established principle in ensemble classification and regression which is that ensembles tend to perform better when the individual classifiers in the ensemble exhibit a high level of diversity [28][29][30][31]. By clustering the trees in the original ensemble (which we refer to as parentRF) into groups of similar trees, representatives are selected from each group. Consequently, many redundant trees are eliminated yielding a much smaller ensemble than the original one. This technique has been used in two of our previously published articles [8,9], and showed notable boost in predictive performance metrics in both classification and regression.

Replicator Dynamics
Replicator Dynamics (RD) by [32] is a simple model of evolution used extensively in evolutionary game theory [33][34][35][36][37][38][39], and hence, explains the term eGAP, we coined for our proposed pruning method. It provides an effective way to represent selection among a population of diverse types. To illustrate how it works, assume that selection occurs between periods after dividing time into discrete intervals. The proportion of each type in the next period is given by the replicator equation as a function of the type's payoff and its current proportion in the population. Types that score above the average payoff increase in proportion, while types that score below the average payoff decrease in proportion. The amount of increase or decrease depends on a type's proportion in the current population and on its relative payoff.
The most general continuous form of RD is given by the differential equation in Equation (1).
such that where the proportion of type i in the population is given by x i , the distribution of types vector in the population is given by x = (x 1 , ..., x n ), the fitness of type i, which depends on the population, is given by f i (x), and φ(x) is the average population fitness which is calculated as the weighted average of the fitness of the n types in the population.

eGAP Algorithm
By incorporating clustering and replicator dynamics, eGAP produces a pruned RF ensemble using the following procedure. First, a random forest of a large enough number of trees (e.g., 1000 trees) called parentRF is built. Trees in this forest are then clustered into groups of similar trees using an efficient clustering method (e.g., k-means). For clustering purposes, each tree is represented as an ordered list of classification outputs for classification tasks. The dissimilarity between any two trees is calculated using the Hamming distance between the two ordered lists. In regression, each tree is represented by the regressed values, forming a vector of real numbers. Thus, the dissimilarity between any two trees is calculated using the Euclidean or Manhattan distance between their respective vectors.
The number of trees in the smallest cluster is then used to determine how many trees to draw from each cluster, forming working clusters that are all of the same initial size. The remaining trees (if any) are kept in the idle clusters, hence, for each working cluster, there is an idle one. Next, RD on the working clusters is applied-such that no working cluster can grow greater than its original cluster size, or shrink to be 0. Using RD terminology, trees in each working cluster essentially act as a type and the working cluster's size is analogous to the proportion of each type. The working cluster's accuracy is analogous to the type's payoff, and the average accuracy of the working clusters is analogous to the average payoff. The discrete intervals mentioned above correspond to loop iterations. At each iteration, the performance of the trees in the working cluster being processed is compared with the performance of the trees in all the working clusters. If it is better, then the size of the working cluster grows by adding to it the best performing tree from the corresponding idle cluster, otherwise, the size shrinks by removing from it the worst performing tree and adding it to the corresponding idle cluster. The performance of the trees in the working clusters is determined by percentage accuracy for classification and R Squared for regression. Figure 1 demonstrates the main steps involved in the production of eGAP. Putting it all together, the eGAP algorithm can be summarised as follows: • An RF of 1000 trees is constructed. • Trees in the forest are clustered into k initial clusters (where k is a multiple of 5 in the range 5-50) • The number of trees in the smallest cluster is determined (let us refer to it as minTrees). • This number is used to form the working clusters by drawing minTrees from each initial cluster. • After drawing the trees, the initial clusters form the idle clusters. • RD is then applied on the working clusters such that no cluster can grow greater than its original size, or shrink to be 0. • Every time a tree is added to a working cluster, the best performing tree in the corresponding idle cluster is chosen and added to the working cluster. • Every time a tree is removed from a working cluster, the worst performing tree is removed and placed in the corresponding idle cluster. • After RD is applied on the working clusters, trees in these clusters are used to populate the ordered list T eGAP which represents eGAP.
More formally, the above algorithm is outlined in Algorithm 1 where T refers to the training dataset, S refers to the size of the parentRF to be created, k refers to the number of clusters to be created, and RDIterations refers to the number of RD iterations that will be applied on the working/active clusters.
The cost of applying eGAP has two components (1) applying a clustering algorithm on a relatively small dataset (each instance represents a tree/classifier in the ensemble); and (2) the cost of re-evaluating the ensemble after each iteration of the replicator dynamics process. Both components are cost efficient, as no re-training is required.
Having discussed the proposed method for pruning Random Forests (eGAP), the following section provides a detailed account of the thorough experimental study conducted to validate the effectiveness of the method.

Algorithm 1 eGAP algorithm
{User Settings} input T, S, k, RDIterations {Process} Create an empty super ordered list AllPredictions Create an empty ordered list T r f to represent parentRF Create an empty ordered list T eGAP to represent eGAP Using the traditional Random Forest Algorithm, create T parentRF of size S For each tree in T parentRF , find its predictions on T and update AllPredictions for i = 1 → S do AllPredictions = AllPredictions ∪ FindPredictions(T parentRF .tree(i), T) end for Using K-means, cluster AllPredictions into a set of initial k clusters: cluster 1 ... cluster k From the smallest cluster, find its size minSize Create working clusters (wkClusters), each has size minSize Add minSize trees from initial clusters to wkClusters Create idle clusters idleClusters Add the remaining trees in each initial cluster to idleClusters for i = 1 → RDIterations do Remove the best performing tree from idleClusters(j) Add it to wkClusters(j) else Remove the worst performing tree from wkClusters(j) Add it to idleClusters(j) end if end for end for Use the trees in wkClusters after applying RD to populate T eGAP for i = 1 → wkClusters do Get the next tree from wkClusters(i) Add it to T eGAP end for {Output} An ordered list of trees T eGAP

Experimental Study and Results
This sections serves as the experimental part of the paper and includes detailed description of the experiments conducted and results. Before doing so, the required setup of the experiments is first covered as described in the next subsection.

Setup
Tables 1 and 2 outline the classification and regression datasets respectively, that will be used in our experiments showing details about the number of features and instances in each dataset. Two sources were used for the above datasets. The first is the UCI repository [40] and the second is MLData [41]. The laptop configuration that was used in conducting the experiments is depicted in Table 3.  Table 4 outlines details about the metrics that will be reported in the experiments. The first two are common to both classification and regression datasets, the next three are for the classification datasets, and the last four are for the regression datasets.

Experiments and Results
The eGAP algorithm outlined in Algorithm 1 was implemented using the Python programming language utilising its machine learning library Scikit-learn to produce a variety of classification and regression metrics as was outlined in Table 4.
With reference to Algorithm 1, in our experiments, the following values were used for user settings. For both S and RDIterations, 1000 was used. Hence, the size of parentRF was 1000 trees, and 1000 RD iterations were applied to grow/shrink the working clusters during the evolution process. As for the number of clusters k, multiples of 5 in the range 5-50 were used.
To use the holdout testing method, which is the simplest type of cross validation, each dataset was divided into two sets: training and testing. A total of two thirds (66%) were reserved for training and the rest (34%) for testing. The selection of the two sets was done randomly using uniform distribution, where each instance has the same probability of being selected.
As demonstrated in the next subsection, the performance of eGAP will not be compared with parentRF only, but also with a random forest of identical size termed RF, where the trees in the forest are chosen at random from the parentRF, and with CLUB-DRF. The latter refers to the CLUB-DRF method used in [8,9] where clustering was the main technique used in the extreme pruning of random forests. Doing such a comparison between eGAP and CLUB-DRF can shed the light as to whether RD has the potential of improving the performance or not.
To compare the performance of eGAP with the parentRF, RF, and CLUB-DRF, the key performance indicator was used. This refers to percentage accuracy for classification and Mean Absolute Error (MAE) for regression. It is highlighted in boldface, in Tables 5 and 7 below, whenever eGAP performed at least as well as parentRF, RF, and CLUB-DRF. It is worth mentioning that though multiples of 5 clusters in the range 5-50 were generated as mentioned earlier, the best performing cluster from which eGAP was generated was the only one that was reported for each dataset. These are listed in the second column in the forthcoming Table 5 (for the classification datasets) and Table 7 (for the regression datasets) .

Classification
The data in Table 5 compare the performance of eGAP with parentRF, RF, and CLUB-DRF on the classification datasets. As shown in this table, percentage accuracy, Area Under Curve (AUC), and F-Measure are reported.   Table 5 demonstrates that eGAP has outperformed the parentRF, RF, and CLUB-DRF on almost all the datasets. As an interesting observation in the first column of this table, size 5 clusters appeared to be the mode, which means it was the best performing cluster size in the majority of the datasets.
For Inference Time per Instance (ITPI, rounded to the next nearest integer) and pruning level (which only applies to the parentRF), the data in Table 6 compare the performance of eGAP with that of parentRF, RF, and CLUB-DRF. In this table, it is interesting to see that of the 10 datasets, eGAP, relative to the parentRF, achieved an extreme pruning level over 90% on six datasets and less than 90% on four datasets. Consequently, due to its smaller size compared with the parentRF, it achieved a faster ITPI on all the datasets as demonstrated in the third column. In the same table (Table 6 that is), though eGAP and RF had identical size, of the 10 datasets, eGAP interestingly achieved faster ITPI on eight datasets. This is likely due to eGAP having shorter trees. This may be attributed to the RD process, when better performing trees in the idle clusters joined the active clusters. Typically, shorter trees tended to overfit the data less, and hence had a better performance. Furthermore, in this table, when comparing the performance of eGAP with the CLUB-DRF method [8,9], which only utilised clustering, we saw that RD proved to be effective in improving the performance, since eGAP was able to outperform CLUB-DRF on seven datasets. For the three datasets where CLUB-DRF was superior over eGAP, the outperformance was a very small negligible fraction of less than 1%.
When comparing the ITPI between eGAP and CLUB-DRF, we saw that CLUB-DRF performed much better as it achieved much faster ITPI. This was expected and came as no surprise since in the CLUB-DRF method, only one representative was selected from each cluster yielding a much smaller ensemble than eGAP. At the conclusion of RD in the eGAP method, depending on how many trees were added to each working cluster, it was very likely that each working cluster would have many trees (recall that after applying RD, the trees in the working clusters form the basis for the trees that eGAP will have as shown in the last for loop in Algorithm 1).

Regression
The data in Table 7 compare the performance of eGAP with the parentRF, RF, and CLUB-DRF on the regression datasets. As demonstrated in these tables, Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R Squared are reported. As demonstrated in Table 7, eGAP outperformed the parentRF, RF, and CLUB-DRF on almost all the datasets, with the mode being five clusters, as was the case with the classification datasets in Table 5. Table 8 compares the ITPI and pruning level (which only applies to the parentRF) of eGAP relative to the parentRF, RF, and CLUB-DRF. In this table, note that of the 10 datasets, eGAP achieved an extreme pruning level over 90% on 7 datasets and less than 90% on three datasets. Consequently, due to its smaller size compared with the parentRF, it achieved a faster ITPI on all the datasets as demonstrated in the third column. When comparing eGAP and RF, though both have identical size, of the 10 datasets, eGAP achieved faster ITPI on eight datasets due to the likelihood that eGAP had shorter trees as stated before. Furthermore, as demonstrated in the table, since CLUB-DRF was much smaller in size than eGAP as discussed before (due to the fact only one representative was selected from each cluster yielding a much smaller ensemble), it was expected to have faster ITPI.

Discussion
In comparing the performance of eGAP with that of the parentRF for both the classification and regression datasets as demonstrated in Tables 5 and 7 respectively, we see that not only has eGAP outperformed parentRF on all the datasets, but even did so significantly for many datasets. Furthermore, as depicted in Tables 6 and 8, high pruning level was achieved on the majority of the datasets yielding an average pruning level of 71.58% for classification and 84.54% for regression. Consequently, faster ITPI was achieved, as demonstrated in these tables, since the number of trees in eGAP was significantly fewer than that of the parentRF. Likewise, eGAP performed consistently well relative to RF and CLUB-DRF as was demonstrated in Tables 5 and 7 respectively. Interestingly, it also achieved faster ITPI on the majority of the datasets as was demonstrated in these tables.
Since in the CLUB-DRF, clustering was also used, the performance gain achieved by eGAP over CLUB-DRF is likely attributed to RD. This is due to the fact that after completing the RD iterations, each working cluster eventually contains high performing trees as low performing trees are removed in each iteration. This is evident by the fact that eGAP was able, in most cases, to outperform the parentRF, RF, and CLUB-DRF.
As for the recommended cluster size, as seen in the experimental section, since five clusters had a recurring pattern, we recommend this cluster size for best results. Furthermore, favourable results, both in terms of the pruning level and accuracy, were achieved using five clusters. This makes eGAP a resource-efficient method suitable to develop smart healthcare applications. eGAP is a novel method for ensemble pruning that is based on evolutionary game theory. The results reported in this article demonstrated its efficacy in pruning Random Forests as well as improving its predictive performance. It is worth noting that the method is generic, and can be used for pruning any ensemble model. Having used the smallest cluster size to determine the size of the initial active clusters in this work, other heuristics can be used to initialise the process. For example, initial active clusters can be of varying sizes. Such variation of settings may prove to be beneficial for larger ensemble models.
In the context of the Internet of Things (IoT), eGAP seems to fit well under the umbrella of healthcare smart cities. Relative to the traditional Random Forest, this is mainly attributed to its two-fold efficacy. The first is accuracy, and the second is speed due to its smaller size. These two traits are desirable in any smart application since they enable users to more efficiently complete a desired task or action. In the context of this paper, eGAP demonstrates its applicability to smart healthcare applications on both classification and regression tasks.

Conclusions and Future Work
In this research article, a new method has been introduced to improve the accuracy and speed of the traditional random forest's classification and regression capabilities. The new method, termed eGAP, produces a pruned version of the traditional random forest that is smaller in size than the original parent random forest from which it was derived and yet, performs at least as good as the parent random forest. To achieve this, eGAP combines two techniques that have been used before to improve random forests, namely, clustering and replicator dynamics. Clustering aims at producing clusters of similar trees based on their predictions on the training dataset. As was described in Section 4.3, after forming the idle and working clusters, replicator dynamics is then applied on each working cluster to grow/shrink the cluster by comparing its performance with the performance of all the working clusters. Due to the high pruning level achieved, and the improved performance over the traditional random forest, this makes eGAP a resource-efficient method that can form the basis for an ideal smart healthcare application. Publisher's Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.