Enhancing PM2.5 Prediction Using NARX-Based Combined CNN and LSTM Hybrid Model

In a world where humanity’s interests come first, the environment is flooded with pollutants produced by humans’ urgent need for expansion. Air pollution and climate change are side effects of humans’ inconsiderate intervention. Particulate matter of 2.5 µm diameter (PM2.5) infiltrates lungs and hearts, causing many respiratory system diseases. Innovation in air pollution prediction is a must to protect the environment and its habitants, including those of humans. For that purpose, an enhanced method for PM2.5 prediction within the next hour is introduced in this research work using nonlinear autoregression with exogenous input (NARX) model hosting a convolutional neural network (CNN) followed by long short-term memory (LSTM) neural networks. The proposed enhancement was evaluated by several metrics such as index of agreement (IA) and normalized root mean square error (NRMSE). The results indicated that the CNN–LSTM/NARX hybrid model has the lowest NRMSE and the best IA, surpassing the state-of-the-art proposed hybrid deep-learning algorithms.


Introduction
During the last century, the human population on Earth has exploded [1]. Thus, for humanity's survival and prosperity, rapid expansion in urbanisation, industry adoption, and transport systems development were inevitable. A direct consequence has been the unprecedented usage of natural resources such as fossil fuels and deforestation, resulting in the release of significant amounts of air pollutants into our host's-the Earth's-atmosphere. Air pollution is defined as the presence of pollutants in the atmosphere that damages humans' health [2]. The damage inflicted by air pollution is not limited only to humans; it extends to all living creatures and the environment.
Moreover, recent research studies show that climate change is directly connected to air pollution [3]. The Earth's atmosphere is afflicted by many contaminants produced by a plethora of anthropogenic sources, such as intense usage and dependence on transportation systems, house-heating systems, and energy generation via fossil fuel combustion, to name a few. This pollution negatively impacts human health, amplifies mortality rates in humans and other species living on Earth, and results in substantial global climate change [4].
The most significant six air pollutants (criteria air pollutants) as defined by the United States Environmental Protection Agency (US-EPA) are suspended particle matter (PM), nitrogen dioxide (NO 2 ), ground-level ozone (O 3 ), carbon monoxide (CO), sulphur dioxide (SO 2 ), and lead (Pb) [5]. In this research, we focus on predicting suspended particulate matter, which are the fine particles found suspended in the atmosphere. The reported sources of PM are dust, forest fires, and man-made sources such as manufacturing processes and vehicle emissions, amongst others. There are two main types of PM, based on the approximate size of the particle. As defined by the World Health Organization (WHO) and

•
Evaluating multiple configurations of NARX using CNN-LSTM, LSTM, Extra Trees, and XGBRF. • Comparing our work to both APNet [19] and NARX LSTM (d8, o1) [20] in terms of IA, it was found that the CNN-LSTM/NARX hybrid model produces better results than both. • Executing our experiments on different cities in two separate locations located on distant continents (Beijing, China; Manchester, UK) and proving that our hybrid model can work well, regardless of the location.
The rest of this paper is constructed as follows. Section 2, "2. Related Work", enlists the most recent and relevant published articles on the highlighted topic. Section 3, "3. Prediction Algorithms", gives the essential knowledge regarding the algorithms used in our paper. Section 4, "4. Proposed Algorithm", presents a detailed description of our proposal. Afterwards, in Section 5, "5. Performance Evaluation", the evaluation metrics are introduced first, followed by a description of the used dataset and finally, the obtained results are presented and discussed in detail. Section 6, The "6. Conclusions" section summarises the outcomes and remarks from this research.

Related Work
As a result of the popularity and effectiveness of machine-learning and deep-learning methods, many studies use deep learning to predict PM 2.5 or PM 10 . Here, the focus is to present recent studies that use CNN combined with LSTM to predict air pollutants, showing their advantages and their disadvantages. In addition, NARX's recent studies are discussed to contrast their work to ours.
In 2018, Huang et al. [19] proposed APNet, a hybrid algorithm to predict PM 2.5 by combining LSTM and CNN. They used 24 h of PM 2.5 , cumulated rain and wind speed to forecast PM 2.5 for the next hour. They used the dataset in [21]. Their approach surpassed the exclusive use of CNN or LSTM and other baseline machine-learning algorithms. Various metrics were used for evaluation, including Pearson correlation coefficient, root mean square error (RMSE), mean absolute error (MAE), and index of agreement (IA). Although they proved the feasibility of their solution, the algorithm predictions did not follow the trend of PM 2.5 pollution accurately. This inaccurate following is due mainly to the instability of PM 2.5 pollution sources.
Qin et al. [22], in 2019, proposed a combined CNN-LSTM scheme to predict PM 2.5 for the next 3 h using the past 24-72 h. They used CNN to feature data extraction spatially for multiple monitoring stations in one city (Shanghai). Then, the resultant feature map was fed to LSTM for timeseries prediction. Finally, an elastic net fine-tuned the results with the help of stochastic gradient descent to regularise constraints, fix network weights, and solve the over-fitting issue. RMSE and correlation coefficient evaluated their model. They used back propagation (BP), recurrent neural networks (RNN), CNN, and LSTM as a baseline for comparison. Their model can be used for processing input from many sites in a city. They have not verified their model to work in other cities.
Another study was carried out in 2020 to predict PM 10 in various locations in Turkey [23]. Their data were collected in Istanbul between 2014 and 2018 to predict PM 10 using 4-, 12-, and 24-h window sizes before the target hour. The study used many parameters to compare and optimise their work, including multiple window sizes and optimisers, and loss functions, and batch sizes. They used mean absolute error (MAE) and root mean square error (RMSE) to measure the performance. They combined data from meteorological and traffic sources and air pollution stations to compare the effectiveness of adding external sources for better air-quality prediction. Their proposal used a flexible dropout layer whose dropout rate depends on the window size. They used all the available data and features, which would incur a high computation cost and long execution time.
A multivariate CNN-LSTM model was introduced by [24] to forecast the next 24 h of PM 2.5 concentration in Beijing, using data from the past week. CNN extracted air pollution features, whereas LSTM performed the timeseries forecast of the historical input.
Univariant and multivariate editions of CNN-LSTM were examined vs. exclusive use of LSTM. RMSE and MAE were the evaluation metrics. Nevertheless, more metrics such as IA or R 2 could have been used to confirm the correlation of the predicted values vs. actual values.
To select a subset of the history of pollutants and related atmospheric conditions, NARX was employed by [20]. They used a NARX neural network to apply LSTM and other algorithms for PM 2.5 prediction in the next hour. They used multiple configurations and delays of the external inputs and 24 h of past PM 2.5 to show the effect of using a subset of the data for prediction. The results show that using a subset gives better results and less training time. For evaluation, K-Fold was used by splitting the data into ten parts, then using one part as a test and the others as training data in a rotating style. This method is not optimal for timeseries problems as it includes training of the model with data that occurred after the test segment. This method could cause a data leak [25], as the model is trained with data from the future and then tested using past data.
This study focuses on leveraging the merits of CNN as a feature-mapping algorithm and the timeseries prediction capabilities of LSTM guided by the NARX neural network selection power to enhance the prediction accuracy of PM 2.5 . Our approach not only uses less training data by selecting certain past timesteps via NARX, but it also uncovers hidden patterns in data by using dilation in the CNN process. When compared to state-of-the-art methods that used the same dataset, and even another dataset of a city in a distant continent, our approach improved prediction results as measured by multiple metrics.
A summary of related work is presented in Table 1.

Prediction Algorithms
To show the effect of using CNN-LSTM with NARX, a brief introduction of each of the components used and the baseline models are presented.

Nonlinear Autoregression with Exogenous Input (NARX)
NARX is primarily utilised in timeseries analysis. It represents the nonlinear form of the autoregressive prediction model with external (exogenous) input. The autoregressive part of the model predicts output in a linear fashion based on earlier values. As a result, NARX connects the present value of a timeseries to preceding values of the series and the current and former values of the driving (external) series. Mapping input data to output can be done via a function. Frequently, that mapping is nonlinear, and any mapping function can be used, such as machine-learning techniques, Gaussian processes, neural networks, a mix of the preceding, or any other mapping function. NARX's general concept is depicted in Figure 1 [26]. data aided LSTM in the prediction process.
actual values such as R2 or IA.
[20] LSTM Used past 24 h to predict next hour RMSE, NRMSE, R 2 , IA Using NARX minimised data input to a lower limit speeding up the process and improving accuracy in LSTM.
Evaluation using K-Fold is inaccurate.

Prediction Algorithms
To show the effect of using CNN-LSTM with NARX, a brief introduction of each of the components used and the baseline models are presented.

Nonlinear Autoregression with Exogenous Input (NARX)
NARX is primarily utilised in timeseries analysis. It represents the nonlinear form of the autoregressive prediction model with external (exogenous) input. The autoregressive part of the model predicts output in a linear fashion based on earlier values. As a result, NARX connects the present value of a timeseries to preceding values of the series and the current and former values of the driving (external) series. Mapping input data to output can be done via a function. Frequently, that mapping is nonlinear, and any mapping function can be used, such as machine-learning techniques, Gaussian processes, neural networks, a mix of the preceding, or any other mapping function. NARX's general concept is depicted in Figure 1 [26]. The model operates via selecting input features amongst consecutive timesteps , and grouping former timesteps of external input order to be of length each. Every input feature can be independently deferred by timesteps. This model suggests choosing how many timesteps to include for each feature by order and delaying them by steps. Figure 1 illustrates that concept by incorporating one input feature 1 using only  The model operates via selecting input features amongst consecutive timesteps t, and grouping former timesteps of external input order to be of length q each. Every input feature can be independently deferred by d timesteps. This model suggests choosing how many timesteps to include for each feature by order q and delaying them by d steps. Figure 1 illustrates that concept by incorporating one input feature x 1 using only q 1 timesteps (exogenous order) delayed by d 1 . A shadow of another input in Figure 1 clarifies the idea of delay and order for multiple inputs. Likewise, the target data are stacked to represent autoregression of p timesteps (auto order). A Python library, fireTS, has been published [27] to enable using any scikit-learn [28] compatible regression to be the nonlinear mapping function for NARX. Generally, NARX is computed as in [29]: whereŷ is the forecasted value; f (.) represents any nonlinear mapping function; y is the target output at any timestep t; p is the length of target timesteps (autoregression order) specifying how many timesteps to use of the target for the prediction process; x 1 , . . . , x m are m external input features; q 1 , . . . , q m are the order associated with each of the exogenous inputs, controlling how many timesteps are captured for each input feature; d 1 , . . . , d m are the delays introduced to each m input feature; e(t) is an error term, which is set to a random value, but, in our case, is set to zero.
It is also worth mentioning that NARX would prepare the input for the internal mapping function in a timeseries format. However, sometimes the internal function has some requirements to be met before processing the data. For example, LSTM would require input in 3-d format, and CNN would require data in a 4-d format.
NARX can also predict more steps in the future by using the predicted step and reinserting it into the mapping function to get the next predicted step. NARX has been used by researchers in air-quality prediction [20], evaluating visibility range on air pollution [30], glucose level prediction [27], and data calibration [31]. The main advantage of NARX is that any nonlinear regression function can be used to perform regression on timeseries problems and that there is flexibility in choosing how much history to use. Also, compared to other recurrent networks, NARX converges quicker and takes fewer training cycles [32].

1-D Convolution Neural Network (1-D CNN)
A convolution neural network uses a convolution operation through a filter to extract patterns or features from input data. CNN is well-known in the image analysis domain. Nevertheless, CNN has multiple network structures, including 1D CNN, 2D CNN, and 3D CNN [33]. 1D CNN can be efficiently used in timeseries analysis [34], 2D CNN is frequently applied in text and image recognition [35], and 3D CNN is employed in video data recognition and medical images analysis [36]. Therefore, 1D CNN is implemented to enhance this research's results further. A simplified view of how 1D CNN works follows.
The left of Figure 2 represents the multidimensional input timeseries data (features + target), which is convoluted from top to bottom, as shown by the coloured arrows in Figure 2, and the coloured rectangles represent multiple filters. Each filter applies convolution that reduces dimensionality from the input to the convolutional layer. The filter uses dilation to select only coloured cells within each filter instead of all cells. This dilation effectively expands the filter size by inserting holes between adjacent elements. This way, a wider field of view is obtained at the same computational cost. CNN can be combined with LSTM [19,23,24,37,38] or support vector machine (SVM) [39]. CNN acts as a feature mapper to detect patterns inside the data.

Long Short-Term Memory (LSTM)
Timeseries studies are, in many cases, done best by the LSTM algorithm. It accepts not only the current input but also preceding outcomes. LSTM works by utilising the outcome at time ( − 1) as the input at time (t), in conjunction with the new input at time (t) [40]. Therefore, contrary to the "feedforward networks", 'memory' is accumulated inside the network. This feature is crucial to LSTM as constant information exists about the past sequence itself, not only the outputs [41]. Air contaminants fluctuate over time, and longterm exposures to PM2.5 are associated with health risks. Throughout lengthy periods, it is

Long Short-Term Memory (LSTM)
Timeseries studies are, in many cases, done best by the LSTM algorithm. It accepts not only the current input but also preceding outcomes. LSTM works by utilising the outcome at time (t − 1) as the input at time (t), in conjunction with the new input at time (t) [40]. Therefore, contrary to the "feedforward networks", 'memory' is accumulated inside the network. This feature is crucial to LSTM as constant information exists about the past sequence itself, not only the outputs [41]. Air contaminants fluctuate over time, and long-term exposures to PM 2.5 are associated with health risks. Throughout lengthy periods, it is evident that the most accurate upcoming air pollution predictor is the earlier air pollution [42].
LSTM is a good model for timeseries prediction because it sustains errors in a gated cell. LSTM is illustrated in Figure 3.

Long Short-Term Memory (LSTM)
Timeseries studies are, in many cases, done best by the LSTM algorithm. It accepts not only the current input but also preceding outcomes. LSTM works by utilising the outcome at time ( − 1) as the input at time (t), in conjunction with the new input at time (t) [40]. Therefore, contrary to the "feedforward networks", 'memory' is accumulated inside the network. This feature is crucial to LSTM as constant information exists about the past sequence itself, not only the outputs [41]. Air contaminants fluctuate over time, and longterm exposures to PM2.5 are associated with health risks. Throughout lengthy periods, it is evident that the most accurate upcoming air pollution predictor is the earlier air pollution [42].
LSTM is a good model for timeseries prediction because it sustains errors in a gated cell. LSTM is illustrated in Figure 3. The following equations describe the LSTM forward training process [43]: The following equations describe the LSTM forward training process [43]: where i t , f t , and o t are activation functions of the input gate, forget gate, and output gate, respectively; h t and C t are the activation vectors for each memory block and cell, respectively; and b and W are the bias vector and weight matrix, respectively. Also tanh(·) represents the tanh function defined in Equation (7), and σ(·) is the sigmoid function, specified in Equation (8).
Since LSTM uses sigmoid and tanh functions, they usually require input data to be normalised from 0 to 1 to get accurate results.

Extra Trees (ET)
Extra Trees is a machine-learning methodology that solves classification and supervised regression problems via a tree-based ensemble method. Its core idea is to build ensembles of unpruned decision trees using the top-down technique. It constructs completely randomised trees with constructions distinct from the learning sample. The Extra Trees algorithm has been developed to compensate for the high variance errors resulting from using a single decision tree. All decision-tree-based methods, including boosted versions, cannot predict values outside the training data range [44], so they cannot extrapolate.

Random Forests in XGBoost (XGBRF)
Both XGBoost and Random Forest are well-known decision tree-based algorithms. XGBoost is a boosting algorithm, while Random Forest is a bagging algorithm. As a result, their combination is known as a hybrid ensemble learning model. Random Forest replaces the decision tree as the basic estimator in the GBRF model [45]. The XGBRF regressor is an improved version of the XGBoost regressor.
The XGBRF trains Random Forest decision trees instead of the gradient-boosting decision trees employed directly by the XGBoost regressor and achieves good accuracy on various datasets. The XGBRF takes advantage of both the XGBoost and the Random Forest models to provide high stability and accuracy and avoid the overfitting problem.
Gradient-boosted models, including gradient-boosted decision trees, are trainable with XGBoost or Random Forests. This training process is feasible since they share the same model inference and representation techniques; however, their training procedures are distinct. XGBoost can use Random Forests either as a basic model for gradient boosting or as a training target. The focus of XGBRF training is on isolated random forests. This technique is a scikit-learn [28] wrapper introduced in the open-source, and still experimental, XGBoost package [46], which implies that the interface can be altered. XGBRF has been used by many studies such as [20,47].

Proposed Algorithm
Our proposal wraps CNN-LSTM with NARX architecture. As shown in Figure 4, selected input features are pre-processed, removing rows containing invalid data and normalising them as required by the guest nonlinear function CNN-LSTM. Data is split gradually to feed the algorithm by a growing amount of training data in each iteration. It preserves the timeseries relationship by testing only future data using the old training history described in Section 5.2.3, "5.2.3. Data Preprocessing before Feeding to ML Algorithms". NARX then refines input to CNN-LSTM using auto-order, exogenous order, and delay. CNN remaps features using convolution and dilation and feeds them to an LSTM neural network, which builds a timeseries predictor that learns the temporal relation between features and target.
Our proposed CNN-LSTM NARX architecture can be illustrated in Algorithm 1:

Algorithm 1: CNN-LSTM NARX Architecture Steps
Sensors 2022, 22,4418 NARX then refines input to CNN-LSTM using auto-order, exogen CNN remaps features using convolution and dilation and feeds th network, which builds a timeseries predictor that learns the temp features and target. First, preprocessing is done where data are normalised, and invalid data are removed.

2.
Data are divided into two sets (training/testing), where training sets always occur before te 3.
For training and testing, NARX selects a specified number of (PM2.5) history hours as defin sion parameter for CNN-LSTM and takes the definite timesteps for the exogenous input feature exogenous order parameter, presenting the specific delay determined by the exogenous delay pa 4.
As CNN only accepts data in 4-D, reshaping is done before applying convolution with dila 5.
Each layer in CNN (Conv 1D, Max Pool 1D, Flatten) is wrapped in a time-distributed layer for each timestep in the data. 6.
LSTM takes input from a flattened layer to perform learning; then, a tanh dense layer red further reduced by a linear dense layer to PM2.5 output.
Input: Exogenous input features (meteorological data or other air pollutants) and one auto-input feature PM 2.5 .
Sensors 2022, 22,4418 NARX then refines input to CNN-LSTM using auto-order, exogen CNN remaps features using convolution and dilation and feeds th network, which builds a timeseries predictor that learns the temp features and target. First, preprocessing is done where data are normalised, and invalid data are removed.

2.
Data are divided into two sets (training/testing), where training sets always occur before te 3.
For training and testing, NARX selects a specified number of (PM2.5) history hours as defin sion parameter for CNN-LSTM and takes the definite timesteps for the exogenous input feature exogenous order parameter, presenting the specific delay determined by the exogenous delay pa 4.
As CNN only accepts data in 4-D, reshaping is done before applying convolution with dila 5.
Each layer in CNN (Conv 1D, Max Pool 1D, Flatten) is wrapped in a time-distributed layer for each timestep in the data. 6.
LSTM takes input from a flattened layer to perform learning; then, a tanh dense layer red further reduced by a linear dense layer to PM2.5 output.
Output: PM 2.5 for the next hour Sensors 2022, 22,4418 NARX then refines input to CNN-LSTM using auto-order, exogen CNN remaps features using convolution and dilation and feeds the network, which builds a timeseries predictor that learns the temp features and target. First, preprocessing is done where data are normalised, and invalid data are removed.

2.
Data are divided into two sets (training/testing), where training sets always occur before te 3.
For training and testing, NARX selects a specified number of (PM2.5) history hours as defin sion parameter for CNN-LSTM and takes the definite timesteps for the exogenous input feature exogenous order parameter, presenting the specific delay determined by the exogenous delay pa 4.
As CNN only accepts data in 4-D, reshaping is done before applying convolution with dila 5.
Each layer in CNN (Conv 1D, Max Pool 1D, Flatten) is wrapped in a time-distributed layer for each timestep in the data. 6.
LSTM takes input from a flattened layer to perform learning; then, a tanh dense layer red further reduced by a linear dense layer to PM2.5 output. Processing:

1.
First, preprocessing is done where data are normalised, and invalid data are removed.

2.
Data are divided into two sets (training/testing), where training sets always occur before testing sets.

3.
For training and testing, NARX selects a specified number of (PM 2.5 ) history hours as defined by the autoregression parameter for CNN-LSTM and takes the definite timesteps for the exogenous input features as demanded by the exogenous order parameter, presenting the specific delay determined by the exogenous delay parameter of NARX.

4.
As CNN only accepts data in 4-D, reshaping is done before applying convolution with dilation.

5.
Each layer in CNN (Conv 1D, Max Pool 1D, Flatten) is wrapped in a time-distributed layer applying convolution for each timestep in the data. 6.
LSTM takes input from a flattened layer to perform learning; then, a tanh dense layer reduces output, which is further reduced by a linear dense layer to PM 2.5 output. 7.
After training is done, testing data is applied to produce predetermined predictions. 8.
The overall system is evaluated using various metrics.
NARX then refines input to CNN-LSTM using auto-order, exogenous order, and delay. CNN remaps features using convolution and dilation and feeds them to an LSTM neural network, which builds a timeseries predictor that learns the temporal relation between features and target. First, preprocessing is done where data are normalised, and invalid data are removed.

2.
Data are divided into two sets (training/testing), where training sets always occur before testing sets.

3.
For training and testing, NARX selects a specified number of (PM2.5) history hours as defined by the autoregression parameter for CNN-LSTM and takes the definite timesteps for the exogenous input features as demanded by the exogenous order parameter, presenting the specific delay determined by the exogenous delay parameter of NARX.

4.
As CNN only accepts data in 4-D, reshaping is done before applying convolution with dilation.

5.
Each layer in CNN (Conv 1D, Max Pool 1D, Flatten) is wrapped in a time-distributed layer applying convolution for each timestep in the data. 6.
LSTM takes input from a flattened layer to perform learning; then, a tanh dense layer reduces output, which is further reduced by a linear dense layer to PM2.5 output.

Validation Metrics
To evaluate the prediction model's performance and uncover any possible association between the forecast and actual values, the following metrics are calculated for our experimentations.

Coefficient of Determination R 2
This metric estimates the correlation between actual and projected values. It is determined as in [48]: where n is the number of data items; P i and A i are the projected and actual values, in that order; P and A denote the mean of the projected and actual value of the pollutant, respectively. R 2 is a descriptive statistical index. Hence, it has no unit of measurement or dimensions, and it ranges from 0 (no correlation) to 1 (complete correlation).

Index of Agreement (IA)
A standardised measure model forecasting error with values between 0 and 1; IA was proposed in [49]. This metric is termed by: where n is the records count; P i and A i are the projected and actual measurements, respectively; P and A symbolise the projected mean and actual mean value of the target, in turn. In this dimensionless metric, 1 represents a total agreement, and 0 represents no agreement. It can identify additive and proportional differences in actual and projected means and variances, but it is too sensitive to extreme values owing to squared differences.

Root Mean Square Error (RMSE)
RMSE computes the mean for the squared differences between predicted and actual values and then takes the square root of the result. It is calculated as in [48]: where n is the samples count; A i and P i are the actual and predicted data, in that order. RMSE has the identical measurement unit of the forecasted or real values, namely µg / m 3 . The less RMSE value, the better the performance of the prediction model.

Normalised Root Mean Square Error (NRMSE)
Normalising RMSE has many ways. One way divides RMSE by the difference between the maximum and the minimum of the actual value. Comparison of models or datasets with distinct scales is better achieved through NRMSE. Its computation is done via [50]:

Beijing, China Dataset
The dataset utilised was obtained from air pollution, and meteorological data for Beijing, China, between 2010 and 2014 [21] and was published in the machine-learning repository of the University of California, Irvine (UCI). The dataset includes hourly data of a variety of meteorological conditions, including (pressure) hPa, (temperature, dew point) • C, (cumulated wind speed) m/s, combined wind direction, (cumulated snow) hours, and (cumulated rain) hours. It also contains the PM 2.5 concentration in micrograms per cubic metre ( µg / m 3 ). All rows with missing values in the PM 2.5 column were removed, and columns specifying the record time were eliminated. The dataset statistics before preprocessing are presented in Table 2.  [51] in the UK. It represents meteorological and air pollutants concentrations for Piccadilly station, Manchester, UK, from 2015 to 2019. It comprises the average hourly data of a variety of meteorological conditions, including (modelled wind direction-M_DIR) in • degrees, (modelled temperature-M_T) • C, (modelled wind speed-M_SPED) m/s. It also covers hourly average concentrations of some air pollutants, including (PM 2.5 , NO, NO 2 , and O 3 ) in µg / m 3 . Table 3 shows the statistics of the dataset before any processing. Because PM 2.5 is measured as weight in a unit of volume, it is always a positive value or zero; hence clearing negative values is necessary. Also, there is some missing data in PM 2.5 and other features, which implies the need for imputation to run machine-learning algorithms.  Figure 5 illustrates a flowchart of the imputation process applied to impute every feature in the dataset.   A comparison of a sample of the data before and after imputation is illustrated in Figures 6 and 7, respectively. A comparison of a sample of the data before and after imputation is illustrated in Figures 6 and 7, respectively. A comparison of a sample of the data before and after imputation is illustrated in Figures 6 and 7, respectively.   Table 4 shows the statistics of the dataset after the imputation process. The minimum and maximum values are the same as the original dataset, and most other statistics are very close to the original dataset values. The datasets were turned into a timeseries suitable form before being employed in  Table 4 shows the statistics of the dataset after the imputation process. The minimum and maximum values are the same as the original dataset, and most other statistics are very close to the original dataset values.

Data Preprocessing before Feeding to ML Algorithms
The datasets were turned into a timeseries suitable form before being employed in any selected prediction algorithms [52]. Data from the previous 24 h were used to forecast PM 2.5 for the upcoming hour. This choice has been made to be able to compare our work to others who used the same amount of look-back hours with the same dataset [19,20]. The transition was accomplished by moving recordings up by 24 places. These data were then inserted as columns after the current dataset, and the procedure was iterated recursively to produce the following structure: dataset (t−n), dataset (t−n−1), . . . , dataset (t−1), target feature (t) as the sample shown in Figure 8. The target feature shown in Figure 8 in the rightmost column of the right table uses past values of itself and other features of the past. It can be noted that the shift operation reduces the number of records by the number of past values or look back value used, 2 in this case. This form was employed in algorithms not using NARX. The training and testing samples were split using K-Fold adapted to handle timeseries situations and avoid data leaks [25]. The sampling was done as shown in Figures 9 and 10.

Results Analysis and Discussion
The proposal was executed on a laptop equipped with an eight-core Intel processor core i9-9980HK CPU @ 2.40GHz-hyperthreading enabled-aided by 32 GB of DDR4 RAM and GeForce RTX 2060. The laptop was not entirely dedicated to the experiments, yet light background work was carried out mostly to ensure training time measured in Tables 5 and 6 was not affected drastically by those tasks. Python 3.8 was used in all our experiments.  Figures 11 and 12 show the layers configuration of CNN-LSTM with NARX (d0, o1) on the seventh iteration as produced by Python for Beijing and Manchester, respectively. The seventh iteration was chosen to represent the average case as it has enough training data but not the whole training data as the last iteration.  11 and 12 show the layers configuration of CNN-LSTM with NARX (d0, o1) on the seventh iteration as produced by Python for Beijing and Manchester, respectively. The seventh iteration was chosen to represent the average case as it has enough training data but not the whole training data as the last iteration.   [20,53]. Each other parameter in each algorithm used was the default as specified by the API of scikit-learn [54]. NARX parameters were 24 for PM 2.5 auto-order and four permutations of exogenous delay (ed) and exogenous order (eo) for all features on all algorithms, specifically (0,1), (0,4), (0,24), and (8,1). To shorten the names and because the same delay and order are applied for all exogenous inputs, the following figures use the NARX version with the name (dx, oy), where x is the exogenous delay and y is the exogenous order.   To compare the ranges of PM 2.5 in Beijing and Manchester, Figure 13 plots the real values used in the comparison shown in following figures. LSTM used tanh as an activation function and utilised the adaptive moment estimation (Adam) optimiser to minimise the loss function (MAE) and a batch size of (72) with (25) epochs. The LSTM arrangement was used in [20,53]. Each other parameter in each algorithm used was the default as specified by the API of scikit-learn [54]. NARX parameters were 24 for PM2.5 auto-order and four permutations of exogenous delay (ed) and exogenous order (eo) for all features on all algorithms, specifically (0,1), (0,4), (0,24), and (8,1).
To shorten the names and because the same delay and order are applied for all exogenous inputs, the following figures use the NARX version with the name (dx, oy), where x is the exogenous delay and y is the exogenous order. All tests were executed in parallel on all central processing unit (CPU) cores to improve speed. LSTM and CNN-LSTM were run using GPU to speed up the training process. The subsequent figures show 72-h-sample timesteps forecast via our tests versus real values in the seventh iteration for each dataset.
To compare the ranges of PM2.5 in Beijing and Manchester, Figure 13 plots the real values used in the comparison shown in following figures. Figures 14-21 compare real PM2.5 values and their predicted counterpart using NARX and non-NARX algorithms during three days of the seventh iteration of the Timeseries Split K-Fold for each dataset. The data points show a slight time-shift between prediction and actual data. This shift is usually caused by most algorithms being greatly affected by the last value of the target more than other inputs in the training process.                  Tables 5 and 6 illustrate metrics results averaged over ten timeseries k-fold iterations for each dataset. The arrow next to each metric shows which direction gives the better result. The upward direction means the higher, the better; and the downward direction means the lower, the better. Numbers backgrounds have been done as a heat map where greener is better and redder is worse. Best values are bold and underlined with a single line. The worst values are double-underlined and italic. Evaluation metrics employed Tables 5 and 6 illustrate metrics results averaged over ten timeseries k-fold iterations for each dataset. The arrow next to each metric shows which direction gives the better result. The upward direction means the higher, the better; and the downward direction means the lower, the better. Numbers backgrounds have been done as a heat map where greener is better and redder is worse. Best values are bold and underlined with a single line. The worst values are double-underlined and italic. Evaluation metrics employed were R 2 , IA, RMSE, and NRMSE. Offline training duration (T tr ) was calculated as the difference between two timestamps before and after training by Python. To further shorten the NARX variation name, after each non-NARX algorithm, the NARX version is denoted by (dx, oy), where x is the delay applied to all exogenous inputs and y is the order applied to all exogenous inputs.
Average results illustrated in Tables 5 and 6 are depicted visually using Figures 22-31, respectively, to ease comparison. The worst-value bar is coloured light red, whereas the best is coloured in light green. The figure has been sectioned to group each algorithm with its NARX variants.
, respectively, to ease comparison. The worst-value bar is coloured light red, whereas the best is coloured in light green. The figure has been sectioned to group each algorithm with its NARX variants.                  Generally, all methods give good results in R 2 and IA above 0.92 and 0.97 for the Beijing dataset, in turn, and 0.70 and 0.89 for the Manchester dataset, in that order. As NARX allows for selecting how much exogenous input and delay is to be used in the training and prediction process, the results differ according to these settings. Using NARX with CNN-LSTM gives the best results almost in all variations and across all metrics, especially with low external order (o1, o4), as the overview and close view in Figures 14 and  15 illustrate. Also, in LSTM, as the Figure 16 general and zoomed window and Figure 17 show, better results are obtained with lower external orders (o1, o4). This effect is probably due to the memory element used in LSTM, which gets misled if fed an extended amount of data from external inputs, which it has already learnt about from previous training. In addition, in the case of CNN, the dilation used along with convolution did capture a hidden relationship between input elements leading to an even better prediction than the mere usage of LSTM at the cost of more processing time. Using NARX usually introduces less processing time in lower external orders, as in the case of ET and XGBRF. Generally, all methods give good results in R 2 and IA above 0.92 and 0.97 for the Beijing dataset, in turn, and 0.70 and 0.89 for the Manchester dataset, in that order. As NARX allows for selecting how much exogenous input and delay is to be used in the training and prediction process, the results differ according to these settings. Using NARX with CNN-LSTM gives the best results almost in all variations and across all metrics, especially with low external order (o1, o4), as the overview and close view in Figures 14 and 15 illustrate. Also, in LSTM, as the Figure 16 general and zoomed window and Figure 17 show, better results are obtained with lower external orders (o1, o4). This effect is probably due to the memory element used in LSTM, which gets misled if fed an extended amount of data from external inputs, which it has already learnt about from previous training. In addition, in the case of CNN, the dilation used along with convolution did capture a hidden relationship between input elements leading to an even better prediction than the mere usage of LSTM at the cost of more processing time. Using NARX usually introduces less processing time in lower external orders, as in the case of ET and XGBRF. In the case of CNN-LSTM and LSTM, there is not much speed gain because of the low dimensionality of the data [55]. GPU usage is a must because convolution in CNN-LSTM with grouping is only supported by GPU implementation in TensorFlow [56].
For the Manchester dataset, RMSE values are generally low because of the low mean (10.42) and standard deviation (9.73) of PM 2.5 in the dataset. In addition, the fact that there are few sharp transitions from low values to high values or vice versa, as shown in Figure 15 (change is from 20 to 1), as well as the shift that exists at much of the results, would cause that difference to be low, mostly. On the other hand, in the Beijing dataset, transitions are much sharper, as in Figure 14, from (153 to 21) along with the shift would cause higher error rates. Table 7 shows an excerpt from the Beijing dataset when the sharp transition happened, conforming to an increase in cumulated wind speed, especially in the northern direction. As described in [21], northerly wind decreases PM 2.5 substantially in all seasons. In contrast with Extra Trees and XGBRF, the more external input order, the better prediction results it gets, see . This result is primarily because of how these systems work as they build decision trees of the input currently present, and no memory element exists. Additionally, it can be noted from Figure 18 that there is little influence of using NARX on XGBRF. A possible reason is the low randomness of XGBRF and its low response to external variables. Tables 8 and 9 represent output statistics of the seventh iteration of prediction, including statistics about training and testing sets. Due to the shifts introduced by NARX and to align all predictions with real data, testing was cut from 3796 to 3722 rows (Beijing) and from 3984 to 3960 (Manchester).
As the previous tables indicate, the best algorithm in green CNN-LSTM NARX (d0, o1) for the Beijing dataset and CNN-LSTM NARX (d0, o4) for Manchester gives the closest output statistics to the statistics of the testing set except for the data towards the maximum. It can also be noted that the red numbers have gone below the limit of PM 2.5 , which is 0. This result indicates the ability of CNN-LSTM and LSTM to extrapolate or go beyond the limits of the training and testing data. The delay in the Beijing dataset's CNN-LSTM (d8, o1) resulted in extremities in minimum and maximum (−15 less than 0 and 403 more than all other CNN-LSTM or LSTM variants but still less than the testing maximum). On the other hand, Extra Trees and XGBRF tend to interpolate and not go beyond the training limits. In Tables 10 and 11, most maximum values in bold purple were close to the testing maximum except for the underlined cases, which are more than the testing maximum but still less than the training maximum. The minimum (marked by bold blue) tends to produce larger values than the testing minimum but never less. . Although these improvements are not of great magnitude, they would make an increasing difference as more steps in the future depend on the next prediction step when using the recursive method. This improvement is important because the system will probably use the predicted next hour to build on and get the second hour in the future, the third, and other future steps.

Conclusions
In this work, an enhancement of PM 2.5 prediction accuracy was proposed and evaluated using a combination of CNN and LSTM based on NARX. The experiments involved using a 24-h period of PM 2.5 concentration in conjunction with meteorological features for Beijing and Manchester to predict the next hour's concentration of PM 2.5 . Using CNN-LSTM and dilation and grouping data introduced better results than all tested methods, especially with low exogenous order and no delay. Our proposed enhancement produced better results than two state-of-the art methods on the same dataset. These methods are APNet (7.36% error reduction) and LSTM_NARX (d8, o1) (4.78% error reduction) for the Beijing dataset. An examination of predictions output statistics proved our enhancement to be the closest to the testing statistics and showed CNN-LSTM extrapolation capabilities.

Future Work
This research can be further expanded to include other data related to air pollution such as traffic data, which probably contributes even more to prediction than meteorological factors. In addition, more timesteps could be predicted in the future (24, 72 h for example) while studying the effects of using NARX in various future prediction methods, including direct and recursive methods. Combining ML methods with a physical model is another way to improve prediction performance in the future. As noted, there are many parameters used in this hybrid, and various methods are potential candidates to explore that domain and optimise those parameters, including but not limited to genetic algorithms and swarm optimisation methods, amongst others. Data Availability Statement: Publicly available datasets were analysed in this study. Beijing dataset was obtained from https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data accessed on 2 March 2022 and Manchester dataset was obtained from https://uk-air.defra.gov.uk/data/data_ selector_service accessed on 2 March 2022.