Entropy Ensemble Filter : A Modified Bootstrap Aggregating ( Bagging ) Procedure to Improve Efficiency in Ensemble Model Simulation

Over the past two decades, the Bootstrap AGGregatING (bagging) method has been widely used for improving simulation. The computational cost of this method scales with the size of the ensemble, but excessively reducing the ensemble size comes at the cost of reduced predictive performance. The novel procedure proposed in this study is the Entropy Ensemble Filter (EEF), which uses the most informative training data sets in the ensemble rather than all ensemble members created by the bagging method. The results of this study indicate efficiency of the proposed method in application to synthetic data simulation on a sinusoidal signal, a sawtooth signal, and a composite signal. The EEF method can reduce the computational time of simulation by around 50% on average while maintaining predictive performance at the same level of the conventional method, where all of the ensemble models are used for simulation. The analysis of the error gradient (root mean square error of ensemble averages) shows that using the 40% most informative ensemble members of the set initially defined by the user appears to be most effective.


Introduction
Machine learning is one of the key components of computational intelligence and its main objective is to use computational methods to become more accurate in predicting outcomes without being explicitly programmed.Machine learning has a wide spectrum of applications in different science disciplines [1][2][3][4][5][6][7][8][9].Advanced computational methods, including artificial neural networks (ANN), process input data in the context of previous training history on a defined sample database to produce relevant output [7].To avoid negative effects of over-fitting, an ensemble of models is sometimes used in prediction [10].In machine learning jargon, an ensemble of models is often referred to as a committee [5].Bagging (abbreviated from Bootstrap AGGregatING) [11] developed from the idea of bootstrapping [12,13] in statistics.Under bootstrap resampling, data are drawn randomly from a dataset to form a new training dataset, which has the same number of data points as the original dataset.In committee machines, bagging is widely used for its simplicity and efficiency in enhancing the prediction power of individual models, also called experts [11].Applications have spanned a wide range of fields.Zhu et al. [14] applied the bagging method to the forecasting of tropical cyclone tracks over the South China Sea.Fraz et al. [15] used an ensemble system of bagged and boosted decision trees to retinal blood vessel segmentation.Brenning [16] investigates the performance of bagging in spatial prediction models for landslide hazards.Dietterich [17] compared the effectiveness of bagging, boosting, and randomization methods for constructing ensembles of decision trees.A recurring question in these previous works was: how to choose the ensemble of training data sets for tuning the

Methods: Entropy Ensemble Filter
The philosophy of the EEF method is rooted in using self-information of a random variable, defined by Shannon's information theory [19] for selection, to provide direction in the inherent randomness of ensemble models which are created by bootstrapping.In previous work, a weighting of model-generated ensemble members based on relative entropy was used [20] to reflect additional information available after ensemble generation.In this work, the focus is on selecting an ensemble of training datasets before ensemble model tuning (training of the ANNs).It is our hypothesis that if an ensemble of ANN models or any other machine learning technique uses the most informative ensemble members for training purpose rather than all bootstrapped ensemble members, it could reduce the computational time substantially without negatively affecting the performance of simulation.We discuss the EEF algorithm based on Shannon information theory.Shannon quantifies information by calculating the smallest possible number of bits, on average, to communicate outcomes of a random variable, e.g., per symbol in a message (here, symbols represent bins in a probability mass function which are defined with respect to input data resolution) [18,19,21,22].The Shannon entropy H, in units of bits (per symbol), of ensemble member m in the bootstrapped dataset (generated from step 1, Algorithm 1), is given by: where p yk is the probability of occurrence, within ensemble member m, with values according to random variable Y, of the kth possible value of the variable (K is a total number of discrete values Y can take, i.e., the number of bins in discretization).This equation gives the entropy in the units of "bits" because it uses a logarithm of base 2. Algorithm 1 illustrates the workflow of the EEF method.
The EEF method can assess and cluster the ensemble members to provide the most informative ones for training, selected from the initially generated ensemble.Since model training is by far the most computationally expensive part of the procedure, overall computation time is roughly linear with the number of retained ensemble members, potentially leading to significant savings.

Use ensemble averages instead of individual ensemble models
The rationale for using ensemble averages is that the expected error of the ensemble average is less than or equal to the average expected error of the individual models in the ensemble End

Application: Synthetic Data Simulation
In this section, the EEF method is tested by using synthetic data and artificial neural networks.Le et al. [23] note that "the deep learning community has reported remarkable results taking the synthetic data to train artificial neural networks".We use artificial signals that we corrupt with noise before model training to examine the model's capability to capture the essence of the signal from the noisy signal.In this study, a sinusoidal signal, a non-sinusoidal periodic waveform (sawtooth wave), and a nonperiodic composite signal have been used to create signals that we interpret as a true underlying process (target signal) we wish to simulate.However, these signals are not directly observable for model training, but corrupted by noise that represents, e.g., measurement error or unknown external influences.These target signals are chosen because of the following reasons:

•
Sinusoids are ubiquitous in physics because many physical systems that resonate or oscillate produce quasi-sinusoidal motion.

•
The performance of the method for simulation of a non-sinusoidal waveform was tested on a sawtooth signal, a classical geometric waveform.

•
A composite signal has been used to test the performance of the method for simulation of nonperiodic signals.The signal has been composed of upward steps followed by exponential decay functions, which resemble typical behaviour for river flow response to rainfall events.

Procedure
First, random noise with a normal distribution was added to the known sinusoidal, sawtooth, and composite signals (Equations ( 2)-( 4) respectively) to make the noisy signal (Equation ( 5)) presented in Figures 1-3.The noisy signal was used as an input in the bagging procedure to generate an ensemble of input datasets, referred to as ensemble members.Following the steps described in Algorithm 1, the chosen members by the EEF method are used for training ANN's and subsequently generating the simulation result for each member (Equation ( 6)).
where T is the number of data points in the signal.Subsequently, a prediction is made using the ensemble average over the selected subset of the ensemble.There are three options for the formation of the subset used in the analysis in this paper: (1) M all : all originally generated ensemble members; (2) M rand : a randomly selected subset of size L (reduced from original size M); and (3) M EEF : the EEF subset, formed by selecting the top L highest entropy training data sets generated by bootstrapping.
In Equation ( 7), the case for option 3 is shown.
The RMSE of the ensemble average in Equation ( 8) shown for the EEF method is calculated with respect to the original target signals y (Equations ( 2)-( 4)).

RMSE y EEF
The entropy calculations for each ensemble member are performed in a discretized space, where the signals are processed using 10 bins of equal bin-size arranged between the signal's minimum and maximum values.These bin sizes were chosen to strike a balance between being fine enough to capture the distribution of the values in the time series, while being coarse enough so that enough data points are available per bin to have a representative histogram.The entropies of all training datasets in the ensemble are then calculated by Shannon entropy equation (Equation ( 1)).Since entropy is calculated empirically, the method can be applied regardless of the data distribution type.The index of the highest entropy ensemble member found is used to determine the new ensemble size (see Appendix A).Then, the ensemble of training data sets are filtered, and only the top highest entropy training data sets are retained.ANN models were then trained on all bootstrapped noisy data sets retained in the ensemble, and on all original ensemble members for reference.In the experiments, the ANN that was used was a feed-forward multilayer perceptron model (by using a hyperbolic tangent activation function) with one input and output layer (the bootstrapped datasets), and 10, 50, and 20 hidden neurons.These were fitted to the bootstrapped noisy sinusoidal, sawtooth, and composite signals, respectively, using the early stopping procedure.For each ensemble, the predictions of the ANNs were averaged to yield an ensemble prediction.The distribution of the ensemble predictions is not forced to any parametric form, and, in general, bagging and our proposed modification are not sensitive to distribution type.The predictions were evaluated by calculating RMSE against the target signal, i.e., the synthetic data before the corruption by noise.Note that the true signal was not available for the ANN during training.
in general, bagging and our proposed modification are not sensitive to distribution type.The predictions were evaluated by calculating RMSE against the target signal, i.e., the synthetic data before the corruption by noise.Note that the true signal was not available for the ANN during training.Entropy 2017, 19,520 in general, bagging and our proposed modification are not sensitive to distribution type.The predictions were evaluated by calculating RMSE against the target signal, i.e., the synthetic data before the corruption by noise.Note that the true signal was not available for the ANN during training.

Results and Analysis
The variations of information content for each ensemble member training data set for the sinusoidal signal, sawtooth wave, and composite signal are shown in Figures 4-6, respectively.In the figures it is visible that the bootstrapping leads to significant variability in the training dataset entropies.

Results and Analysis
The variations of information content for each ensemble member training data set for the sinusoidal signal, sawtooth wave, and composite signal are shown in Figures 4-6, respectively.In the figures it is visible that the bootstrapping leads to significant variability in the training dataset entropies.

Results and Analysis
The variations of information content for each ensemble member training data set for the sinusoidal signal, sawtooth wave, and composite signal are shown in Figures 4-6, respectively.In the figures it is visible that the bootstrapping leads to significant variability in the training dataset entropies.After the most informative ensemble members are chosen to train ANNs and their outputs have been processed through ensemble averaging, the predictions are plotted in Figures 7-9.For comparison, the conventional bagging method, based on all ensemble members, is used to train a separate ensemble of neural networks.The prediction from these ensemble averages is included in the same figures.As illustrated in Figures 7-9, the simulation results of using all ensemble members and the chosen ones by the EEF method closely resemble each other, which indicates that filtering the ensemble models could be a reliable method.After the most informative ensemble members are chosen to train ANNs and their outputs have been processed through ensemble averaging, the predictions are plotted in Figures 7-9.For comparison, the conventional bagging method, based on all ensemble members, is used to train a separate ensemble of neural networks.The prediction from these ensemble averages is included in the same figures.As illustrated in Figures 7-9, the simulation results of using all ensemble members and the chosen ones by the EEF method closely resemble each other, which indicates that filtering the ensemble models could be a reliable method.After the most informative ensemble members are chosen to train ANNs and their outputs have been processed through ensemble averaging, the predictions are plotted in Figures 7-9.For comparison, the conventional bagging method, based on all ensemble members, is used to train a separate ensemble of neural networks.The prediction from these ensemble averages is included in the same figures.As illustrated in Figures 7-9, the simulation results of using all ensemble members and the chosen ones by the EEF method closely resemble each other, which indicates that filtering the ensemble models could be a reliable method.
Entropy 2017, 19, 520      To get insight in the trade-off between ensemble size (i.e., computation time) and accuracy in terms of RMSE, analysis of the error gradient with growing ensemble size was conducted.In this analysis, the decrease in error was compared between using the EEF method and conventional bagging with increasing ensemble size.To filter out some of the inherent randomness from the results, the whole process was repeated ten times with different realizations of the random noise, and resulting RMSEs were averaged over these 10 realizations.The error gradient shows the effect of varying the final ensemble size after selection.However, also, the initial ensemble size plays a role in the prediction accuracy, since selecting from a larger initial pool of ensemble members means higher entropy values in the selection.In current practice, the user will decide on how many ensemble models are needed for training and tuning the weights in machine learning.Therefore, we show the results of error gradient analysis for 100 and 1000 initial bootstrapping in Figures 10-12 and 13-15, respectively.The idea of ranking the ensemble by the EEF method and subsequently using it for machine learning shows its advantages in Figures 10 and 13, for the sinusoidal signal.For the other signals, the advantages are mostly in the smallest ensemble sizes, visible in Figures 10-15.The results show that using the 40% most informative ensemble members of the set initially defined by the user appears to be most effective.
An upwards jump in RMSE, such as seen for the conventional bagging in Figures 10 and 13, indicates that an ensemble member (training data set) was picked that led to an ANN that does not perform well in prediction, deteriorating the ensemble average when added to the ensemble.The effect of adding such an ensemble member will be larger when the selected ensemble is still small in size, since the relative weight of the new member in the average will be higher.In the entropybased ordering of the EEF, those ensemble members would also be picked eventually, but generally later in the sequence, when the effect on the total ensemble is small enough not to cause an important upward jump in RMSE.Since the EEF reduces the ensemble size, in many cases some of the poorly performing members will be eliminated from the ensemble.In the limit of using the full ensemble, the EEF and the conventional method converge upon each other (as seen at the extreme right of Figures 10-15), since the full ensembles are identical.The fact that those jumps are not displayed in the EEF results indicates that these poorly performing ANNs are not among the ones To get insight in the trade-off between ensemble size (i.e., computation time) and accuracy in terms of RMSE, analysis of the error gradient with growing ensemble size was conducted.In this analysis, the decrease in error was compared between using the EEF method and conventional bagging with increasing ensemble size.To filter out some of the inherent randomness from the results, the whole process was repeated ten times with different realizations of the random noise, and resulting RMSEs were averaged over these 10 realizations.The error gradient shows the effect of varying the final ensemble size after selection.However, also, the initial ensemble size plays a role in the prediction accuracy, since selecting from a larger initial pool of ensemble members means higher entropy values in the selection.In current practice, the user will decide on how many ensemble models are needed for training and tuning the weights in machine learning.Therefore, we show the results of error gradient analysis for 100 and 1000 initial bootstrapping in Figures 10-12 and Figures 13-15, respectively.The idea of ranking the ensemble by the EEF method and subsequently using it for machine learning shows its advantages in Figures 10 and 13, for the sinusoidal signal.For the other signals, the advantages are mostly in the smallest ensemble sizes, visible in Figures 10-15.The results show that using the 40% most informative ensemble members of the set initially defined by the user appears to be most effective.
An upwards jump in RMSE, such as seen for the conventional bagging in Figures 10 and 13, indicates that an ensemble member (training data set) was picked that led to an ANN that does not perform well in prediction, deteriorating the ensemble average when added to the ensemble.The effect of adding such an ensemble member will be larger when the selected ensemble is still small in size, since the relative weight of the new member in the average will be higher.In the entropy-based ordering of the EEF, those ensemble members would also be picked eventually, but generally later in the sequence, when the effect on the total ensemble is small enough not to cause an important upward jump in RMSE.Since the EEF reduces the ensemble size, in many cases some of the poorly performing members will be eliminated from the ensemble.In the limit of using the full ensemble, the EEF and the conventional method converge upon each other (as seen at the extreme right of Figures 10-15), since the full ensembles are identical.The fact that those jumps are not displayed in the EEF results indicates that these poorly performing ANNs are not among the ones trained on the top highest entropy training data sets.These are the ones that would typically be retained by the EEF method.
Furthermore, the EEF method has been tested with a different initial number of committee members illustrated in Tables A1-A3 (see Appendix A).The results of the sinusoidal signal, sawtooth wave, and composite signal simulation indicate that the EEF method can improve the simulation error 3% on average for a sinusoidal signal, and relatively maintain error performance at the same level for sawtooth wave and composite signal.More importantly, empirical testing showed that it can reduce the simulation time by 54%, 56%, and 45% on average, respectively.

Protection against Overfitting
There are several layers in the procedure that offer protection against overfitting.Firstly, it is important to note that the entire prediction procedure never sees the original data set that is tested against, since only the noise-corrupted version of the data is used for training; however, the final evaluation of performance is against the non-noisy original data set.
Secondly, for both compared methods, the individual ensemble member ANNs are trained on bootstraps of these noise-corrupted data.For each individual data set in the selected ensemble, the ANN training uses the standard and well-tested early stopping (also known as stopped training) procedure to prevent overfitting.In this procedure, the data is divided in training and validation data and training continues until validation performance starts to deteriorate [5].
Thirdly, the bagging procedure adds another layer of protection against overfitting where the outcomes of several fitted models are averaged, reducing reliance on one single model.As can be seen in Figures 10-15, larger ensemble sizes improve prediction up to a certain ensemble size.Therefore, a trade-off between accuracy and ensemble size exists for smaller ensembles.The EEF method provides a way to reduce ensemble size (and computational cost) with smaller decrease in performance, or, conversely, improve performance for fixed small ensemble sizes.In that sense, the EEF method is a Pareto improvement over the conventional method.The EEF selects ensemble members before any model is trained and therefore does not have access to the original signal or predictive performance.Summarizing, the EEF does not increase overfitting issues compared to conventional bagging, which already has safeguards in place at different levels.
Entropy 2017, 19,520 trained on the top highest entropy training data sets.These are the ones that would typically be retained by the EEF method.
Furthermore, the EEF method has been tested with a different initial number of committee members illustrated in Tables A1-A3 (see Appendix A).The results of the sinusoidal signal, sawtooth wave, and composite signal simulation indicate that the EEF method can improve the simulation error 3% on average for a sinusoidal signal, and relatively maintain error performance at the same level for sawtooth wave and composite signal.More importantly, empirical testing showed that it can reduce the simulation time by 54%, 56%, and 45% on average, respectively.

Protection against Overfitting
There are several layers in the procedure that offer protection against overfitting.Firstly, it is important to note that the entire prediction procedure never sees the original data set that is tested against, since only the noise-corrupted version of the data is used for training; however, the final evaluation of performance is against the non-noisy original data set.
Secondly, for both compared methods, the individual ensemble member ANNs are trained on bootstraps of these noise-corrupted data.For each individual data set in the selected ensemble, the ANN training uses the standard and well-tested early stopping (also known as stopped training) procedure to prevent overfitting.In this procedure, the data is divided in training and validation data and training continues until validation performance starts to deteriorate [5].
Thirdly, the bagging procedure adds another layer of protection against overfitting where the outcomes of several fitted models are averaged, reducing reliance on one single model.As can be seen in Figures 10-15, larger ensemble sizes improve prediction up to a certain ensemble size.Therefore, a trade-off between accuracy and ensemble size exists for smaller ensembles.The EEF method provides a way to reduce ensemble size (and computational cost) with smaller decrease in performance, or, conversely, improve performance for fixed small ensemble sizes.In that sense, the EEF method is a Pareto improvement over the conventional method.The EEF selects ensemble members before any model is trained and therefore does not have access to the original signal or predictive performance.Summarizing, the EEF does not increase overfitting issues compared to conventional bagging, which already has safeguards in place at different levels.

Conclusions
In this article, we introduced a novel procedure to assess and cluster ensemble members for bootstrap aggregating (bagging).Fundamentally, we assert that the EEF method can reduce the computational time of simulation very substantially while maintaining error performance at the same level of the conventional method, where all of the ensemble models used for simulation.The idea of ranking and selecting the ensemble with the EEF method and subsequently using them for machine learning shows its advantages in Figures 10 and 13.Figures 10-15 show a clear effect of ensemble size on prediction quality for the smaller ensemble sizes.The positive effects of using the EEF method are most pronounced in the smallest ensemble sizes.The EEF method can be useful to meet the computational power constraints for the continual arrival of new data, which necessitates frequent model updating in atmospheric science.Peng et al. [24] note that computational expense is one of the difficulties in air quality forecasting.Although the results of this study indicated the efficiency of the proposed framework in application to synthetic data simulation, further evaluations of the proposed framework are still necessary, especially in applications to data assimilation problems with real data and numerous observations.

Figure 10 .
Figure 10.The error gradient analysis for sinusoidal signal and 100 initial bootstrapped ensembles.Figure 10.The error gradient analysis for sinusoidal signal and 100 initial bootstrapped ensembles.

Figure 10 .
Figure 10.The error gradient analysis for sinusoidal signal and 100 initial bootstrapped ensembles.Figure 10.The error gradient analysis for sinusoidal signal and 100 initial bootstrapped ensembles.

Figure 11 .
Figure 11.The error gradient analysis for sawtooth signal and 100 initial bootstrapped ensembles.

Figure 12 .
Figure 12.The error gradient analysis for composite signal and 100 initial bootstrapped ensembles.

Figure 11 .Figure 11 .
Figure 11.The error gradient analysis for sawtooth signal and 100 initial bootstrapped ensembles.

Figure 12 .
Figure 12.The error gradient analysis for composite signal and 100 initial bootstrapped ensembles.Figure 12.The error gradient analysis for composite signal and 100 initial bootstrapped ensembles.

Figure 12 .
Figure 12.The error gradient analysis for composite signal and 100 initial bootstrapped ensembles.Figure 12.The error gradient analysis for composite signal and 100 initial bootstrapped ensembles.

Figure 13 .
Figure 13.The error gradient analysis for sinusoidal signal and 1000 initial bootstrapped ensembles.

Figure 14 .
Figure 14.The error gradient analysis for sawtooth signal and 1000 initial bootstrapped ensembles.

Figure 13 .Figure 13 .
Figure 13.The error gradient analysis for sinusoidal signal and 1000 initial bootstrapped ensembles.

Figure 14 .
Figure 14.The error gradient analysis for sawtooth signal and 1000 initial bootstrapped ensembles.Figure 14.The error gradient analysis for sawtooth signal and 1000 initial bootstrapped ensembles.

Figure 14 .
Figure 14.The error gradient analysis for sawtooth signal and 1000 initial bootstrapped ensembles.Figure 14.The error gradient analysis for sawtooth signal and 1000 initial bootstrapped ensembles.

Figure 15 .
Figure 15.The error gradient analysis for composite signal and 1000 initial bootstrapped ensembles.

Table A2 .
Sawtooth wave outputs of the EEF method for different initial committee members.