An Optimal Stacked Ensemble Deep Learning Model for Predicting Time-Series Data Using a Genetic Algorithm—An Application for Aerosol Particle Number Concentrations

Time-series prediction is an important area that inspires numerous research disciplines for various applications, including air quality databases. Developing a robust and accurate model for time-series data becomes a challenging task, because it involves training different models and optimization. In this paper, we proposed and tested three machine learning techniques—recurrent neural networks (RNN), heuristic algorithm and ensemble learning—to develop a predictive model for estimating atmospheric particle number concentrations in the form of a time-series database. Here, the RNN included three variants—Long-Short Term Memory, Gated Recurrent Network, and Bi-directional Recurrent Neural Network—with various configurations. A Genetic Algorithm (GA) was then used to find the optimal time-lag in order to enhance the model’s performance. The optimized models were used to construct a stacked ensemble model as well as to perform the final prediction. The results demonstrated that the time-lag value can be optimized by using the heuristic algorithm; consequently, this improved the model prediction accuracy. Further improvement can be achieved by using ensemble learning that combines several models for better performance and more accurate predictions.


Introduction
Time-series data are a set of observations that are measured and collected sequentially through time. The data can be in the form of discrete or continuous values, and they may have an internal structure, such as auto-correlation, trend or seasonal variation. Time-series analysis deals with time-series data to extract and analyze meaningful statistics and other characteristics of the data. Time-series modelling uses past observations of time-series data to develop an appropriate model, in order to make a prediction or forecast. Due to the indispensable importance of time-series modelling, they have been utilized in numerous practical fields, such as financial analysis, medical sciences, energy consumption, engineering, tourism, and environment [1][2][3][4][5][6].
Air quality measurements are an appropriate example of such time-series data [7]. The application of time-series analysis and modelling in this field is beneficial to investigate the links between air pollution and health effects [8,9], as well as climate change [10,11]. The methods have also been used to estimate air pollutants to substitute unavailable measurement instruments, fill missing data, and for forecasting [12][13][14]. A variety of approaches have been utilized in performing those tasks; for example, expert-based and data-driven approaches [15][16][17]. In particular, data-driven approaches have gained popular attention in air pollution research, because they do not typically require deep knowledge in air pollutant dynamics, chemistry composition, and other explanatory variables [18].
Artificial Neural Networks (ANN) have become the most popular method among data-driven approaches for estimating and forecasting air pollution. ANN is also known as a general and reliable function approximator. The availability of powerful and less-complicated computing tools is the main reason behind the popularity of ANNs [19]. One of the ANN methods is the Recurrent Neural Network (RNN), which is a representative method of deep learning and is mainly used in time-series modelling and forecasting. RNN has time-series and non-linear prediction capabilities, because it has a feedback connection that allows the past information to pass and persist. RNN is a powerful tool used in time-series problems.
In general, developing a time-series model that is highly accurate and reliable is a challenging task. Firstly, there are many parameters that should be determined beforehand to setup the RNN. These include the number of layers, number of neurons, and the time-lag value. The selection of the optimal value for each parameter may require the running of several experiments to select the best combination of these tuning parameters and optimize time-series models. In fact, due to computational limitations, it becomes impossible to attempt all possible combinations of parameters to find the optimal set of parameters. Secondly, time-series data often have trends and patterns, and this has to be handled appropriately in order to choose the adequate length of the time-series data to make an accurate prediction. Finally, all neural network models are categorized as black-box models [20], where the relationship between variables is not transparent. Most time-series problems involve on-stationery relationships. These relationships between variables may change in time with irregular patterns. In this case, the training may result in a limited representation of the overall phenomena. The results may be good for the training and testing databases, but might fail for new datasets of the time-series. Consequently, the generalization of the model may not be achieved.
In order to use neural network techniques in air pollution prediction and avoid the previous listed challenges, in this paper, we proposed the use of three Machine Learning (ML) methods (Appendix A)-Ensemble Learning, RNN, and Heuristic Search Algorithm. The aim is to propose a deep learning model that performs better than previous approaches in modelling the non-linear atmospheric particle number concentrations in Amman, Jordan, while targeting high accuracy prediction. The main contributions of this paper are summarized as follows: • Improving the forecasting accuracy for atmospheric time-series data based on the ensemble technique of different RNN methods, where each method makes the prediction with different time-lag value. • Automatic identification of time-lag for the RNN using a heuristic algorithm, which searches for the optimal (or near optimal) solution. • A parallel implementation of the proposed model was applied in order to enhance its performance in terms of computation time without losing the accuracy or reliability.

Database, Handling, and Preprocessing
The aerosol database used in this study was adopted from the measurement (1 August 2016-31 July 2017) carried out at the Aerosol Laboratory, which was located on the third floor of the Department of Physics, University of Jordan [21]. Amman is considered an example of Middle Eastern urban Computers 2020, 9, 89 3 of 26 conditions. The Middle East region attracts the attention of atmospheric modelling researchers, as it serves as a compilation of aerosol particle sources including natural dust and new particle formation. This database consists of particle number concentrations (PN) and local weather conditions (Temperature (T), Absolute Pressure (P), Relative Humidity (RH), Wind Speed (WS) and Wind Direction (WD)). The details of the aerosol measurements campaign and the metrological measurements lie beyond the scope of this paper, and detailed descriptions can be found in previous studies [21,22].
The database was processed into two databases (daily and hourly averages). The daily database contained 366 samples, and the hourly database contained 8784 records. Typically, a time-series database requires some preparation prior to being modelled with ML methods. The pre-processing sub step involves three main phases: Checking, Scaling and Transformation.

•
Checking: The database contained some missing data due to technical and instrumental issues. These missing values were replaced using the imputation method through interpolation. Each database became consistent with the expected number of records. • Scaling: The processed database was normalized because it is preferred to train the model with data values which fall within the same range as the activation function used in the model training. So, the scaling factor is dependent on the activation function. In this case, the data were scaled between 0 and 1 to transform it within the range of the neural network activation function. • Transformation: The data were transformed from a time-series form to a supervised form.
The supervised data contained a sample with input and output components. The input component was a number of prior observations (defined by time-lag value). The output component contained the observation to be forecasted. In another words, the input was the observations from previous time steps, and the output was the observation of the current time step. The observation of the current time step became the input for the next time step, and so on.

Model Setup
The proposed model in this study was comprised of three ML methods: RNN methods (Gated Recurrent Networks (GRU), Long-Short Term Memory (LSTM), and Bi-Directional Recurrent Neural Networks (BRNN)), heuristic algorithm, and ensemble learning technique (stacking). The overall architecture of the proposed model is shown in Figure 1. The reason for choosing these methods can be summarized as follows:

1.
Because it is a time-series forecasting problem, the prediction process can be accomplished using time-series methods such as RNN, which is able to capture temporal dependencies between data to produce more accurate results. Other methods can be used, such as Artificial Neural Network (ANN) and Bayesian Optimization methods, but these methods may cause overfitting as the time-lag feature is not taken into account. In the time-series data, the proper selection of time-lag value can reduce dimensionality of the non-linear data and avoid overfitting.

2.
The proper selection of time-lag can improve model performance and generate a more accurate result. The time-series data can be viewed as a set of equal chunks where the records of each chunk are correlated and consistent. Finding the number of records at each chunk represents time-lag value, which is a difficult problem to solve and requires the running of several experiments to search for the best value that may generate the best prediction accuracy. Repeating this process has a high computational cost and consumes the resources. Using a heuristic algorithm to solve search problems is highly preferable by researchers as it can find the optimal solution (or near optimal) in a reasonable time with less computational cost. GA is used in this paper to enable us to find the optimal time-lag and prevent overfitting, increase model accuracy and generate the prediction with less computational cost compared to other heuristic algorithms.

3.
More accurate results can be achieved through the use of ensemble technique, which combines the result of each single model to improve accuracy. The prediction result of ensemble method is is usually better than the prediction of a single model. Heterogeneous ensemble learning is used in this paper, as different base models are used with different hyperparameter tuning. The proposed model consisted of four main phases: (1) database pre-processing, (2) base model configuration and tuning, (3) GA operation, and (4) ensemble model setting. In the pre-processing phase, the database was first checked, scaled, and transformed. Then it was split into two main parts: training set and testing set. The percentage of the splitting process was 80% training and 20% testing. The training database was used to build a model, where the testing database was used for validation. In the second phase, the 10-fold cross validation method was applied on the training database for each RNN, LSTM, GRU, and BRNN along with the pretraining method. The time-lag value was set to 1, and the best combination of each method hyperparameters was selected. Pretraining involved adding a new hidden layer to the existing model and refitting. The new model would learn the input from the existing hidden layer while keeping the weights of it unchanged. This technique is called layer-wise which allow the adding of new layer at a time. As the solution was aggregated from the local optimal solution, it was called greedy. The optimal parameters of each network were considered where the configurations differ from hourly and daily databases. In the third phase, the GA was applied on each network to find the optimal time-lag after tuning each model (LSTM, GRU and BRNN) to achieve better performance of the model.
The time-series forecasting problem used lag to make a prediction based on the past observations of that lag. The choice of the appropriate time-lag value was an important step to ensure the generation of high-performance predictions in terms of accuracy. Time-lag could capture dependencies between successive time observations in the model. Because there is no general rule that specified how the lag variable can be selected [23], different approaches have been used by researchers in different application domains. The selection of the value that guarantees the optimal solution with high precision and generation is a complex problem that needs to run and repeat many experiments in order to search for the best solution. On the other hand, running these methods requires more time to find the optimal time-lag value that generates the best prediction, which exhausts resources and requires more time, so using a non-deterministic polynomial (NP) is difficult.
In order to solve the above-mentioned issue, a heuristic algorithm was used. The GA is an evolutionary algorithm that has been used by many researchers in the computer science field to solve complex problems in a variety of domains where the solutions of these problems cannot be found in The proposed model consisted of four main phases: (1) database pre-processing, (2) base model configuration and tuning, (3) GA operation, and (4) ensemble model setting. In the pre-processing phase, the database was first checked, scaled, and transformed. Then it was split into two main parts: training set and testing set. The percentage of the splitting process was 80% training and 20% testing. The training database was used to build a model, where the testing database was used for validation. In the second phase, the 10-fold cross validation method was applied on the training database for each RNN, LSTM, GRU, and BRNN along with the pretraining method. The time-lag value was set to 1, and the best combination of each method hyperparameters was selected. Pretraining involved adding a new hidden layer to the existing model and refitting. The new model would learn the input from the existing hidden layer while keeping the weights of it unchanged. This technique is called layer-wise which allow the adding of new layer at a time. As the solution was aggregated from the local optimal solution, it was called greedy. The optimal parameters of each network were considered where the configurations differ from hourly and daily databases. In the third phase, the GA was applied on each network to find the optimal time-lag after tuning each model (LSTM, GRU and BRNN) to achieve better performance of the model.
The time-series forecasting problem used lag to make a prediction based on the past observations of that lag. The choice of the appropriate time-lag value was an important step to ensure the generation of high-performance predictions in terms of accuracy. Time-lag could capture dependencies between successive time observations in the model. Because there is no general rule that specified how the lag variable can be selected [23], different approaches have been used by researchers in different application domains. The selection of the value that guarantees the optimal solution with high precision and generation is a complex problem that needs to run and repeat many experiments in order to search for the best solution. On the other hand, running these methods requires more time to find the optimal time-lag value that generates the best prediction, which exhausts resources and requires more time, so using a non-deterministic polynomial (NP) is difficult.
Computers 2020, 9, 89 5 of 26 In order to solve the above-mentioned issue, a heuristic algorithm was used. The GA is an evolutionary algorithm that has been used by many researchers in the computer science field to solve complex problems in a variety of domains where the solutions of these problems cannot be found in a reasonable timeframe. Because it is easy to implement, is a global heuristic search algorithm, and can generate more accurate results, GA has become one of the most popular algorithms [24]. Previous studies have shown that GA can produce solutions that are near optimal solution, and it has been applied previously to optimize neural network configurations such as network weights and the architecture [25,26]. Comparing the performance of GA with other heuristic algorithms on solving different kinds of complex optimization problems, several studies show that GA performs better in obtaining the optimal solution, and is faster than other methods such as Differential Evolution (DE), Particle Swarm Optimizations (PSO), Ant Colony Optimization, and others [27][28][29][30]. GA has been applied widely in many applications from ML, Mathematics, Business, Robotica, etc., because they have the following advantages in solving complex problems in different domains [31][32][33][34][35]: 1.
Non-deterministic algorithm and can be used for a variety of problems; 3.
Works with the string-coded of the variables rather than the variable itself, so the coding discretizes the search space even if the function might be continuous; 4.
An optimal solution could be generated by GA as it works with more than one solution (population).
In this paper, GA was applied in each RNN model to search for the optimal time-lag value. The training database was split into two sets, training and validation set. The GA was applied on the training set, while the validation set used to estimate the model performance on the time-lag value.
The population size was set to be 50 for both hourly and daily databases. Each solution represented the number of the lag value. The initial population was generated randomly. Next, the solutions were evaluated according to the fitness function. Each database was prepared according to the chosen time-lag value, and the network was trained (LSTM, GRU or BRNN). The results were then generated. MAE was calculated and returned as a fitness score to each current solution. The three GA operations were performed (selection, crossover and mutation), and the process was repeated for a number of iterations. At the end, the solution with the best fitness value score was selected as the best solution. The details of applying GA on each RNN method to find the optimal time-lag value are as follows: • Initial population: A binary vector randomly initialized the use of the uniform distribution defining initial solution. The population size was set to have 50 possible solutions. • Selection: Select the parents from population with the best fitness value. • Crossover: Exchange the variables between selected parents to generate new offspring. One-point crossover was used for that.

•
Mutation: A binary bit flip with probability of 0.1 was applied to the solution pool by randomly swapping bits to achieve diversity on the solutions. • Fitness function: MAE was used to evaluate solution.
The final phase of the proposed model is building the ensemble model, where the output from RNN methods with its different selected time-lag was run in parallel to generate outputs that were then combined with each other to produce the final PN concentration estimation. In this paper, the stacking ensemble model was used. It involved the combination of the predictions from multiple and different models on the same database. The stacking ensemble consisted of two main levels: • Level 0 (base models): Used three models (LSTM, GRU and BRNN) with selected time-lag value proposed by GA. The testing database was divided into 5 folds that were disjointed and of equal size. One fold remained aside, while others were used to train each model. The prediction was then made using holdout fold. More robust results are generated for each model.

•
Level 1 (meta model): Took the output from level 0 as an input to a single model called a meta-learner to produce the prediction output. A new training database was constructed Computers 2020, 9, 89 6 of 26 from the predictions of sub-models where the first column contained predictions generated by LSTM. The second column contained predictions generated by GRU; the third column contained predictions generated by BRNN; and the last column contained the actual expected output. This way, a new stacked database was constructed. A ML algorithm is then used to learn how to best combine the prediction of sub-models. It is the meta-learner that uses a new stacked dataset for training.
To evaluate the effectiveness of the proposed model, three metrics were used: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination (R 2 ). These metrics can be formulated as follows: where n is the number of observations, y is the vector of the observed values andŷ is the vector of the predicted values. RMSE gives the square root of Mean Square Error (MSE) that measures the average of the square of the error. MAE Measures the average absolute error of the paired predicted values from the original ones. It gives the magnitude of the overall error. R 2 gives an indication of how strong the relationship is between the dependent and independent variables of the model.

Computation Setup and Platform
The work was implemented using Intel (R) Core™ i7-8550U CPU @ 1.80 GHz 1.99 GHz, 8.00 GB of RAM (Intel Corporation, Santa Clara, CA, USA) and Windows 10 version 1909 (Microsoft, Redmond, WA, USA). The application program was written in Python language and executed on JetBrains PyCharm Community Edition 2019.2 ×64. Python provides a set of libraries that can be used to create deep learning model directly, such as Scikit-Learn and TensorFlow. Scikit-Learn library provides a set of supervised and unsupervised learning algorithms that can be used and experimented directly. Table 1 summarizes the toolboxes and packages used.

RNN Models-Evaluation and Performance
Applying 10-fold cross validation into each RNN method (LSTM, GRU and BRNN) results in tuning the network with the optimal set of parameters that generates best prediction. The values of each network parameter found by cross validation search optimization for both hourly and daily databases are given in Table 2. The generated solution from GA was used to train each RNN model and then was validated. The solution that gives the best fitness score was selected as the best time-lag value and was used in the network configuration (Figures 2-4).

RNN Models-Evaluation and Performance
Applying 10-fold cross validation into each RNN method (LSTM, GRU and BRNN) results in tuning the network with the optimal set of parameters that generates best prediction. The values of each network parameter found by cross validation search optimization for both hourly and daily databases are given in Table 2. The generated solution from GA was used to train each RNN model and then was validated. The solution that gives the best fitness score was selected as the best timelag value and was used in the network configuration (Figures 2-4).

RNN Models-Evaluation and Performance
Applying 10-fold cross validation into each RNN method (LSTM, GRU and BRNN) results in tuning the network with the optimal set of parameters that generates best prediction. The values of each network parameter found by cross validation search optimization for both hourly and daily databases are given in Table 2. The generated solution from GA was used to train each RNN model and then was validated. The solution that gives the best fitness score was selected as the best timelag value and was used in the network configuration (Figures 2-4).     Analyzing the results generated from the application of GA on LSTM, GRU and BRNN shows that a different lag value was selected for each method and each database. The performance of each algorithm differs regarding to the running time and accuracy. All methods performed better as the number of the time-lag value was increased. For a small time-lag value, the BRNN performed better and generated a smaller MAE value when compared to LSTM and GRU. The BRNN method also performed better than LSTM and GRU in terms of accuracy (i.e., MAE value) when the running time of this algorithm was extended. The backward and forward run of BRNN enhanced prediction performance, but increased the running time. LSTM and GRU were comparable to one another's, regardless of the time-lag value suggested by the GA to be the optimal value. Table 3 lists the best time-lag value selected by GA for each RNN method along with the running time (in minutes) needed to generate this result. The configuration of each RNN model chosen from the cross-validation process along with an appropriate time-lag value selected by GA was used to construct the stacked ensemble model and generate the final prediction. The three methods were run in parallel using multi-cores in order to enhance running time performance. The testing database was used to train each base model, the outof-sample data generates the predictions. The results of level 0 from the stacked ensemble model are given in Table 4 and shown in Figure 5.
An enhancement in the model performance was achieved in terms of speed by the parallel implementation. It took 25.05 min to finish the training and make a prediction for the three methods. The stacking ensemble technique was preferred when multiple models with different skills were used to train the same database, and each model may behave in a different way.
For both databases (i.e., hourly and daily), the LSTM outperformed GRU and BRNN in terms of RMSE and MAE metrics ( Figure 5). The running time of BRNN was more when compared to LSTM and GRU, due to the forward and backward architecture of BRNN. The GRU method, which had a similar architecture to the LSTM, was less effective than LSTM. This was due to the architecture of the GRU, which had fewer gates than the LSTM. Capturing the time dependency between series observations was not as accurate as with the LSTM method; this result was concluded from the R 2 value of GRU, which was the lowest on the hourly database.  Analyzing the results generated from the application of GA on LSTM, GRU and BRNN shows that a different lag value was selected for each method and each database. The performance of each algorithm differs regarding to the running time and accuracy. All methods performed better as the number of the time-lag value was increased. For a small time-lag value, the BRNN performed better and generated a smaller MAE value when compared to LSTM and GRU. The BRNN method also performed better than LSTM and GRU in terms of accuracy (i.e., MAE value) when the running time of this algorithm was extended. The backward and forward run of BRNN enhanced prediction performance, but increased the running time. LSTM and GRU were comparable to one another's, regardless of the time-lag value suggested by the GA to be the optimal value. Table 3 lists the best time-lag value selected by GA for each RNN method along with the running time (in minutes) needed to generate this result. The configuration of each RNN model chosen from the cross-validation process along with an appropriate time-lag value selected by GA was used to construct the stacked ensemble model and generate the final prediction. The three methods were run in parallel using multi-cores in order to enhance running time performance. The testing database was used to train each base model, the out-of-sample data generates the predictions. The results of level 0 from the stacked ensemble model are given in Table 4 and shown in Figure 5.
An enhancement in the model performance was achieved in terms of speed by the parallel implementation. It took 25.05 min to finish the training and make a prediction for the three methods. The stacking ensemble technique was preferred when multiple models with different skills were used to train the same database, and each model may behave in a different way.
For both databases (i.e., hourly and daily), the LSTM outperformed GRU and BRNN in terms of RMSE and MAE metrics ( Figure 5). The running time of BRNN was more when compared to LSTM and GRU, due to the forward and backward architecture of BRNN. The GRU method, which had a similar architecture to the LSTM, was less effective than LSTM. This was due to the architecture of the GRU, which had fewer gates than the LSTM. Capturing the time dependency between series observations was not as accurate as with the LSTM method; this result was concluded from the R 2 value of GRU, which was the lowest on the hourly database.   The output of each method was used to construct a stacked database. The new database was used to train the ANN model (meta model) and generate the final prediction results. In order to show the enhancement achieved by the ensemble model in terms of prediction accuracy, Table 5 shows the accuracy generated by each base model (LSTM, GRU and BRNN) and the ensemble model for the hourly and daily databases. The ML methods used in this paper were supervising learning methods. These kinds of methods can be clearly analyzed using bias and variance trade-off. As the goal of any supervised learning method is to best estimate the mapping function (f) for the output variables (y) given an input variable (x). The prediction is then generated and the error is estimated. The error can be broken into two parts: bias and variance. The bias error is the prediction error. It reflects the distance between the actual and predicted values, and can be estimated by finding MSA, RMSE, etc. The variance error reflects the amount of change in the prediction accuracy if different training data are used.
The ML algorithm with high variance was highly influenced by the specific training data. This means that these data influenced the choice of parameter types and numbers used in the network training to characterize the mapping function. A good ML algorithm has a low bias and low variance. Through this, it generalized the error and achieved a good prediction performance. The parameterization of the ML algorithm to reduce bias and variance was often difficult, especially when working with non-linear algorithms such as neural networks.
In this paper, the results of modelling time-series data using a heterogenous ensemble learning model that is composed of three different variants of RNN method-LSTM, GRU and BRNN-can be summarized as follows: 1.
The performance of any neural network depends on the parameters tuning. The good selection of each parameter (such as the number of neurons, the number of layers, the number of time-lag, etc.) can either increase or decrease accuracy. Finding the optimal selection of network parameters is an NP problem that requires repeating the experiment by changing one parameter while keeping the other fixed. Determining the set of parameters that influence the network performance depends on the problem domain and the type of data. In this paper, time-series data are used where the RNN method is well-suited to be applied. The number of lags is a very effective parameter that can increase or decrease the model accuracy. The use of a heuristic algorithm in the proposed work searched for the best value of the time-lag to be chosen for each network architecture. This selection guarantees the use of the best value, that produces a robust model and enhances accuracy.

2.
A recurrent neural network is the best method to be used to design a predictive model for time-series data. Many variants of RNN have been developed in the literature where each version has its advantages and disadvantages. The three selected RNN methods used in this paper are LSTM, GRU and BRNN. Three models are designed using each method. The results can be summarized as follows: (a) LSTM model is accurate, robust and efficient. For a large amount of data, LSTM needs more time to generate an accurate prediction.
(b) GRU is simpler than LSTM. On the other hand, it is less accurate, although the time needed to run a GRU model is less than the LSTM running time. (c) BRNN is accurate and robust, but it needs much time to produce a prediction on account of its backward and forward architecture.

3.
The number of the time-lag has a great effect on the algorithm performance for time-series data.
Increasing the time-lag value (window size) increases the running time of the model.

4.
An improvement in the prediction accuracy can be achieved by combination of the result of each method to produce a final one. This is achieved by designing stacked heterogenous ensemble learning that combines the output of each RNN method and uses a new learner to generate the final output. The ensemble learning technique improves accuracy, and the parallel implementation reduces running time. Having three base-learners with a different time-lag number and different architecture reduced error generalization and overfitting, and increased diversity, as shown in Table 5.

5.
The parallel implementation of ensemble method enhances model performance, by reducing time needed to make the prediction.

Comparison with Other and Previous Methods
The model performance and accuracy were compared with previous and other methods. Zaidan et al. [18] proposed a framework using an Artificial Neural Network (ANN) to model PN concentrations in Amman, Jordan (i.e., the same data used here). Different sets and combinations of all weather parameters were the descriptors used. According to their results, the best combination was including the five weather parameters, which were then compared with the results produced in this paper using LSTM, GRU and BRNN with time-lag value equal to 1 ( Table 6). It is evident that our new approach (i.e., LSTM, GRU and BRNN) outperforms the previous one (i.e., ANN). Besides that, ANNs require an optimum network structure with optimum configurations to avoid overfitting and achieve better accuracy; this requires running several epochs, resulting in a long training time. Therefore, it can be said that for a time-series application, it is recommended to use a time-series model based on the use of RNN to achieve better prediction and accuracy. We also compared the current approach with other ensemble approaches, which were based on Bagging and Voting representing heterogeneous and homogeneous ensemble techniques. The experiment was divided into two main phases: (1) six benchmark learning algorithms that were trained over two databases (i.e., hourly and daily), and (2) three ensemble models (bagging, voting and stacking) ( Figure 6). Computers 2020, 9, x FOR PEER REVIEW 12 of 26 In Phase 1, the performance of LSTM, GRU and BRNN was compared with other benchmark ML algorithms. The selected algorithms were Feed-forward Neural Network (FFNN), Time Delay Neural Network (TDNN), and Support Vector Regression (SVR). The results had six ML algorithms that were used as base learners. This main task accomplished through this phase was to prepare each algorithm by tuning it for training. The performance of each algorithm was investigated using a 10fold cross validation and the Greedy layer-wise pretraining technique, where the accuracy of them all was compared. The best three algorithms were used to construct the homogeneous and heterogenous ensemble method (i.e., Phase 2). Here, the time-lag value was not investigated in this experiment and was set to 1; in fact, the effect of choosing a proper time-lag value has been sufficiently investigated using GA. The prediction accuracy results for each of them were evaluated using the same evaluation metrics-RMSE, MAE and R2. In Phase 2, we constructed the ensemble learning algorithms as follows: (1) selected the top three most accurate algorithms in Phase 1; (2) developed a homogeneous ensemble learning method (i.e., Bagging); (3) developed a heterogenous ensemble learning method (i.e., Voting); and (4) the results of the homogenous and heterogenous ensemble models were compared with the current model (i.e., using stacking ensemble). The results generated in Phase 1 are presented in Table 7. The main aim of this phase was to conduct and compare the performance of each base learner algorithm (LSTM, GRU, BRNN, FFNN, TDNN and SVR) over two databases (hourly and daily) in terms of three evaluation metrics (MAE, R 2 and RMSE). Figures  7 and 8 show the prediction scatter graph for PN concentrations using six base learner algorithms over both databases. In Phase 1, the performance of LSTM, GRU and BRNN was compared with other benchmark ML algorithms. The selected algorithms were Feed-forward Neural Network (FFNN), Time Delay Neural Network (TDNN), and Support Vector Regression (SVR). The results had six ML algorithms that were used as base learners. This main task accomplished through this phase was to prepare each algorithm by tuning it for training. The performance of each algorithm was investigated using a 10-fold cross validation and the Greedy layer-wise pretraining technique, where the accuracy of them all was compared. The best three algorithms were used to construct the homogeneous and heterogenous ensemble method (i.e., Phase 2). Here, the time-lag value was not investigated in this experiment and was set to 1; in fact, the effect of choosing a proper time-lag value has been sufficiently investigated using GA. The prediction accuracy results for each of them were evaluated using the same evaluation metrics-RMSE, MAE and R2. In Phase 2, we constructed the ensemble learning algorithms as follows: (1) selected the top three most accurate algorithms in Phase 1; (2) developed a homogeneous ensemble learning method (i.e., Bagging); (3) developed a heterogenous ensemble learning method (i.e., Voting); and (4) the results of the homogenous and heterogenous ensemble models were compared with the current model (i.e., using stacking ensemble). The results generated in Phase 1 are presented in Table 7. The main aim of this phase was to conduct and compare the performance of each base learner algorithm (LSTM, GRU, BRNN, FFNN, TDNN and SVR) over two databases (hourly and daily) in terms of three evaluation metrics (MAE, R 2 and RMSE). Figures 7 and 8 show the prediction scatter graph for PN concentrations using six base learner algorithms over both databases.
According to the models performance in Phase 1, the current method (i.e., three variants of the RNN models) outperformed the other three base learners (i.e., FFNN, TDNN and SVR) for the hourly database (Table 7 and Figure 7). This result shows that LSTM, GRU and BRNN can handle the high dimensionality of the non-linear hourly database and capture correlation between independent variables and the dependent variable (input and output). LSTM, GRU and BRNN can be also used to construct voting and stacking ensemble models. FFNN, TDNN and SVR can be used to construct the bagging ensemble model. As for the daily database (Table 7 and Figure 8), the TDNN method performed the best, and SVR was the worst. Consequently, the best base learners (i.e., TDNN, FFNN and BRNN) were used to construct the heterogeneous ensemble learning model (voting and stacking).
The worst base learners (i.e., SVR, LSTM and GRU) were used to construct the homogeneous ensemble learning model (Bagging). The performance and accuracy using the homogeneous and heterogeneous runs are presented in Table 8 and Figures 9 and 10. Many studies have been carried out for time-series prediction and using ensemble models. When compared to those methods, the current model is robust (Table 9). Table 7. Comparison between the current method and other benchmark Machine Learning methods presented in Phase 1 (Figure 9). Here, RMSE is the root mean square error, MAE is the mean absolute error, and R 2 is the coefficient of determination.

Model
Hourly Daily  Table 8. Comparison between the current method and other ensemble methods presented in Phase 2 ( Figure 9).  Table 9. Analysis of ensemble model used in the literature for time-series prediction.

Related Works Ensemble Method Application Domain Results
Khairalla et al. [36] Stacking heterogeneous ensemble Energy consumption 91.24% Siwek and Osowski [37] The multilayer perceptron, Elman network, radial base function and support vector machine

Conclusions
Time-series prediction is an important area for many applications. Using state-of-the-art ML methods, such as ensemble learning and Recurrent Neural Network (RNN), can improve the prediction accuracy and enhance the results. In this paper, we proposed a new approach that consists of four main phases to predict particle number concentrations in the form of a time-series prediction. The first step was to perform data pre-processing and split into two parts: a training set and a testing set. The training database was used to build the model. The testing database was used to construct the ensemble model. The RNN base learner models were fed by using training data, and each network was configured using 10-fold cross validation. The optimum configuration of each network was considered and tested. Then, the Genetic Algorithm (GA) was applied on each network to find the most optimal time-lag. The final phase of the proposed method was to establish the ensemble model using a testing database, where the output from RNN method with its different selected time-lag was run in parallel to generate outputs that were combined with one another to produce a final PN concentration estimation.
The results demonstrated that LSTM, GRU and BRNN were suitable techniques for time-series modelling, as they were found to be robust and accurate. Among them, LSTM was the best, but requires more time to process a large amount of data. The number of the time-lag values had a great effect on the algorithm performance for time-series data. A different time-lag value was selected by the heuristic algorithm for each learning method. The selected lag value generated the best accurate result. A stacked heterogenous ensemble learning method that combines the output of each RNN method and uses a new learner to generate the final output was constructed. The ensemble learning technique improved accuracy, and the parallel implementation for base learner methods reduced running time.
Having three base-learners with a different time-lag number and different architecture reduced error generalization and overfitting, and increased diversity.
As a recommendation, we state the following points for future research: 1.
An improvement in the prediction model can be achieved by including more data with different features, such as the use of air quality image data. A combination between Convolutional Neural Network and Recurrent Neural Network can provide a more robust and high precision predictive model.

2.
Parallelize neural network models by dividing the single neural network algorithm into tasks where each task can be run in a single core. The running time of the neural network can be reduced, and the performance will be enhanced in terms of speedup.

3.
Various optimization parameters can be tested and investigated to improve accuracy such as the learning function and network structure. heuristic algorithm to guarantee the proper selection of a lag value which improved the accuracy of the model prediction. Three RNN methods were used-LSTM, GRU and BRNN. Each RNN model was optimized and tuned using a cross-validation method. The model that generated the best results was used with its setting parameters. Greedy layer-wise pretraining was used to optimize the building of deep neural network to achieve state-of-the-art performance. An heuristic algorithm was used to investigate the effect of time-lag values in each RNN model. The GA was a search optimization algorithm used to find the optimal solution within a given search space. Each base model used a different time-lag as selected by the GA. This enabled the model to see enough past values that were relevant to the future prediction for more accurate and stable forecasting.
Further improvement in accuracy was achieved by using the ensemble learning technique. This involved the use of multiple models that were trained independently, and the output of all models was combined using a specific technique to improve the accuracy of the prediction task. The ensemble network could be homogeneous or heterogeneous. In the homogeneous ensemble, all models were of the same type but with different structure. The heterogeneous ensemble contained an identical independent model structure where each model was trained with different training data. The optimization of time in terms of computation time was achieved by using parallel implementation for the proposed ensemble learning model, where the three base learners used in the ensemble model were run in parallel on a multicore platform.
In the real-world, time-series forecasting (e.g., air quality and weather) involves complex nonlinear relationships between multiple variables. Traditional forecasting approaches have limitations related to [14]: • Missing data and outliers. For example, some statistical models (e.g., Support Vector Machine (SVM)) are sensitive to missing data and the forecasting accuracy is affected. In practice, ML methods are widely utilized in computer science, statistics, and neural computing. ML have many advantages: they can be used to describe the data characteristics without prior knowledge about their distribution; they depend on the data parameters to model application behavior; they are simpler than statistical methods to adjust; and show reliable performance when applied to non-linear and complex series. ML have some disadvantages: they are black-box models, and they use historical data to learn the stochastic relationship between past and future observations [40].
Neural Network (NN) models are the most popular type of ML method. They are based on a mathematical model that is inspired by the behavior of biological neurons. They are developed to learn the mapping from inputs to outputs over a long sequence of both structured and non-structured data [41]. They can handle missing data, model complex nonlinear relationships, and support multiple inputs.
In general, there are many models of NNs, starting from the Multiple Linear Perceptron (MLP) to the advanced deep learning neural networks that become a powerful tool of ML and artificial intelligence. Recurrent Neural Network (RNN) is the most popular used method for time-series data forecasting. The RNN structure consists of one input layer, one output layer and one hidden layer with one or more feedback loops [42]. The hidden layer in RNN contains states and a memory block where the state of the hidden layer at current time is conditioned with its previous state [43]. The memory block enables RNN to store, remember and process past data for a long period of time. Long-Short Term Memory is a type of RNN that is capable of learning a long dependency in sequence prediction problems [44]. When an RNN model fails to learn the data with a limited time-lag, LSTM is designed to address this problem by using specific units in its internal structure.
The architecture of LSTM consists of one input layer, one output layer, and a series of hidden layers that are recurrently connected and known as memory blocks ( Figure A1). Each block consists of one or more self-recurrent memory cells in order to make it connected to itself, and three main units (input, output and forget gates). In each block, the three gates provide continuous read, write and reset operations; the interactive operations between three gates make LSTM sufficient to solve long-term dependencies that RNN cannot handle.
forecasting. The RNN structure consists of one input layer, one output layer and one hidden layer with one or more feedback loops [42]. The hidden layer in RNN contains states and a memory block where the state of the hidden layer at current time is conditioned with its previous state [43]. The memory block enables RNN to store, remember and process past data for a long period of time. A number of modifications to the original RNN architecture have been developed over the years. Some RNN variants include Long-Short Term Memory (LSTM), Gated Recurrent Network (GRU) and Bidirectional Recurrent Neural Network (BRNN), which are mainly used in the proposed model.

Appendix A.1. Long-Short Term Memory (LSTM)
Long-Short Term Memory is a type of RNN that is capable of learning a long dependency in sequence prediction problems [44]. When an RNN model fails to learn the data with a limited timelag, LSTM is designed to address this problem by using specific units in its internal structure.
The architecture of LSTM consists of one input layer, one output layer, and a series of hidden layers that are recurrently connected and known as memory blocks ( Figure A1). Each block consists of one or more self-recurrent memory cells in order to make it connected to itself, and three main units (input, output and forget gates). In each block, the three gates provide continuous read, write and reset operations; the interactive operations between three gates make LSTM sufficient to solve long-term dependencies that RNN cannot handle. Figure A1. A schematic diagram for the Long Short-Term Memory (LSTM) structure. Here, t is the current time step, x is the input and σ is the activation function. Figure A1. A schematic diagram for the Long Short-Term Memory (LSTM) structure. Here, t is the current time step, x is the input and σ is the activation function.
In the deep learning of neural networks, the gradient vanishing problem occurs when the previous hidden layer's learning speed is slower than that of deeper hidden layers. This decreases the accuracy of the hidden layer learning [45]. The memory cells in LSTM make it better able to learn a long sequence with longer time steps. This makes it suitable to solve problems with time serial issues.
Ensemble learning is another ML method that is used to enhance learning accuracy ( Figure A3). It is based on the use of more than one models (more than one base learner), that are trained with a given database to provide a solution for a given problem with a higher reliability and accuracy than what can be achieved using only one model [49][50][51]. given database to provide a solution for a given problem with a higher reliability and accuracy than what can be achieved using only one model [49][50][51]. Figure A3. Ensemble method framework [52].
There are three major elements in the ensemble technique: • The Training Data (i.e., problem database); • Ensemble Models (i.e., NN methods used as a base learner); • Combination Techniques (i.e., the way to produce the final prediction).
Changing any of the three ensemble elements affects the final prediction results. It is challenging to design the best combination of ensemble learning models to generate the most reliable and accurate results.
• Training Data: The base learner models used in the ensemble method can be trained with the same problem database, or each model can be trained with a subset database that is different from the subset used with other learners. The k-fold cross validation method can be used to Figure A3. Ensemble method framework [52].
There are three major elements in the ensemble technique: • The Training Data (i.e., problem database); • Ensemble Models (i.e., NN methods used as a base learner); • Combination Techniques (i.e., the way to produce the final prediction).