A Hybrid Spatiotemporal Deep Model Based on CNN and LSTM for Air Pollution Prediction

: Nowadays, air pollution is an important problem with negative impacts on human health and on the environment. The air pollution forecast can provide important information to all affected sides, and allows appropriate measures to be taken. In order to address the problems of ﬁlling in the missing values in the time series used for air pollution forecasts, the automation of the allocation of optimal subset of input variables, the dependency of the air quality at a particular location on the conditions of the surrounding environment, as well as automation of the model’s optimization, this paper proposes a deep spatiotemporal model based on a 2D convolutional neural network and a long short-term memory the model. The results obtained demonstrate the beneﬁts of including spatial information from as many surrounding stations as possible, as well as using as much historical information as possible.


Introduction
Air pollution is associated with the presence of certain substances in the atmosphere in large enough quantities as to have a negative impact on the flora and fauna, or on the environment in general. Air pollution has a strong negative impact on human health and can cause both acute reactions and various chronic diseases [1]. Pollutants can differ in their chemical composition, properties, and origin. The main air pollutants are gases, such as carbon monoxide (CO), carbon dioxide (CO 2 ), nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), ozone (O 3 ), etc., as well as particulate matters (PM 2.5 and PM 10 ). The main sources of air pollution related to human activity are automobiles, industrial activities, power plants, biomass combustion, and others. Particulate matter (PM) includes liquid or solid particles of various sizes and chemical compositions that are suspended in the atmosphere. Of these, PM 2.5 (particles with an aerodynamic diameter of 2.5 µm or smaller) are of particular importance when it comes to human health due to their ability to penetrate deep into the respiratory tract [2]. In addition to that, these small particles tend to remain suspended in the atmosphere for long periods of time, and can also be transported over long distances [2]. Both short-term and long-term exposure to PM 2.5 , even at low concentrations, are associated with an increased risk of disease and premature death [3]. In addition to being associated with cardiovascular [4] and respiratory diseases [5], PM 2.5 has recently been linked to neurocognitive effects and diabetes, as well as birth outcomes [3]. According to [6] in 2017, ambient and household PM 2.5 pollution was responsible for 4.58 million deaths globally. It is estimated that PM 2.5 lowers global life expectancy by about 1 year [7]. The global economic cost due to deaths from ambient PM 2.5 was estimated at USD 4.09 trillion for 2016 [8], and is expected to increase further as the population ages.
An air quality forecast can provide important information to both government institutions and individuals, and allow them to take appropriate action. Modern models for predicting air pollution can be divided into two main types-deterministic and statistical [9]. Deterministic models simulate atmospheric processes related to the emission, dispersion, transformation and elimination of pollutants, which makes these models computationally heavy. Such models are difficult to develop because they require good knowledge of the involved processes and sources of pollution, and often suffer from inaccurate representations due to approximations and incomplete knowledge. Some of the popular deterministic models are WRF-Chem [10], CMAQ [11,12] and CHIMERE [13][14][15]. Statistical models are based on determining regularities in the data itself, without the need to take into account physical and chemical factors. These models are simpler, lighter, and easier to implement, and when enough historical data are provided, statistical models show better forecast results for specific locations. Popular statistical models for predicting the air quality include artificial neural networks [16][17][18][19][20][21][22][23][24], multiple linear regression [25], autoregressive integrated moving average (ARIMA) [26], support vector machine (SVM) [27], nonlinear regression [28] and random forest [29]. Neural networks (NN) are a popular choice for air pollution forecasting, since they do not require any prior assumptions about the data distribution, and are capable of modeling complex nonlinear processes.
Unlike traditional shallow NN models, deep NN models can find patterns in large and multidimensional datasets, making them particularly suitable for modeling the air quality. Recurrent neural networks perform well in modeling temporal dependencies, but the vanishing gradient problem prevents them from modeling long-term dependencies. Long short-term memory networks (LSTM) [30] address this problem through the use of memory cells that can store and retrieve information, allowing the system to model longer time dependencies and making them popular in the field of air quality modeling [31][32][33][34][35][36]. Another popular deep NN is the convolutional neural network (CNN) [37], which is a specific type of a feedforward NN, in which some of the hidden layers perform the convolution operation. CNN is characterized by some translational invariance, as well as an ability to take advantage of local dependencies in the data. 1D CNNs have been used to predict air pollution [33,38]. Hybrid approaches such as combinations of CNN and LSTM are becoming increasingly popular in the field of air quality modeling [39][40][41][42][43]. The addition of a CNN to LSTM usually leads to better results compared to a pure LSTM models [40,41].
Some important issues need to be addressed in order to successfully develop a statistical model for predicting air pollution. For various reasons, there are missing values in the time series that need to be filled in before the data can be used to train the respective model. There are various strategies for filling in the missing values; among them, linear interpolation and the use of the nearest valid value are very popular due to their simplicity, but regression models and machine learning models can also be used. How appropriate an imputation strategy is depends on the amount of missing data and their characteristics [44]. A very important part of the development of successful statistical models is the selection of input variables [45], as the use of an inappropriate set of features can significantly impair the overall performance of the model.
However, many studies in the field of air pollution forecasting do not use any additional automated technique to find the optimal subset of input variables, instead utilizing all available ones from the data set, which usually leads to the inclusion of redundant information. In addition, the quality of the air at a particular location depends not only on the local conditions, but also to a large extent on the conditions of the surrounding environment, as pollutants released into the atmosphere are dispersed and carried away by air currents. This means that the inclusion of spatial information has the potential to improve the performance of the prediction model. However, few proposed models consider spatial relationships. Moreover, the proposed models are optimized manually, which is not a very effective means of optimization, especially in the absence of significant experience. Using an automated model optimization approach could improve and facilitate the whole procedure of developing an efficient model for air pollution forecasting.
Different machine learning techniques are a good and efficient approach to the prediction of different problems in that domain. For example, the reference evapotranspiration prediction and uncertainty analysis under climate change was carried out using multiple linear regression, multiple non-linear regression, multivariate adaptive regression splines, model tree M5, random forest and least-squares boost [46], the last of which showing better results. In addition, a novel least square support vector machine (LSSVM) model integrated with a gradient-based optimizer (GBO) algorithm is introduced for the assessment of water quality parameters, which has shown good performance among all benchmark datasets and algorithms [47].
In order to take into account the above-mentioned problems in air pollution forecasting, a hybrid CNN-LSTM spatiotemporal deep model for air pollution forecasting has been developed with the automatic selection of input variables and the optimization of model hyperparameters. In addition, a hybrid strategy is here proposed for filling in the missing values in the data.

Missing Data Imputation
The linear interpolation performs well when filling small gaps of missing values, but as the size of the gaps increases, its performance deteriorates [44]. As the research study in [44] shows, the hybridization of linear interpolation with multivariate methods provides better results when dealing with more complex missing data. The authors of [48] explored three possible options for missing data imputation-the classical approach of using the last valid observation, an approach that uses the average for a given time point in a whole month, and an approach utilizing a weighted sum between the first two approaches, with an exponential reduction in the influence of the last valid value as the size of the gap increases. The obtained results reveal that for a short-term forecast (less than 4 h), data imputation with the last valid observation provides the best performance, while the approach using the weighted sum gives the best results for longer forecasts (4 to 8 h).

Selection of Input Variables
Various strategies have been applied in an attempt to find the optimal subset of input variables for air pollution forecasting models. Correlation analysis is used in [42] to select input variables for a CNN-LSTM model predicting the PM 2.5 concentrations. In [24], the authors examine both backward elimination and forward selection to choose the input variables, and the obtained results show that the models that utilize techniques to optimize the used input data perform better than those using all variables. Moreover, the backward elimination technique gives better results than the forward selection. As shown in [22], backward selection based on relative entropy improves the performance by increasing the number of input variables up to a number after which there is no further improvement. Random forest is used in [49] to rank the traffic flow in different directions at a road intersection, and the best combinations of factors together with the previous CO concentration are applied as input into an LSTM model to predict future pollutant concentrations.
The genetic algorithm (GA) is another popular approach to selecting input features. In [50], GA is used to select relevant variables for a hybrid model predicting the PM 10 concentration for the next hour, and the experimental results show that the selection of appropriate features leads to improved performance. GA is used in [51] to select the input variables to a radial basis function network (RBF) predicting the concentration of SO 2 and NO 2 after 24 h. The suggested approach provides better results when compared to an RBF model that uses all input variables and an RBF model that uses principal component analysis for data selection.

Optimization of Model Hyperparameters
The use of a suitable NN architecture for a specific problem is very important to the successful performance of the model. Manual architecture optimization is time consuming, and requires significant prior knowledge, which is why a number of research studies have aimed at suggesting automatic methods to optimize the hyperprameters of the model. A classic strategy for optimizing hyperparameters is grid searching. In [34], grid searching is used to optimize some hyperparameters of an LSTM model predicting the air quality for the next 24 h, such as batch size, number of training epochs, optimizer, learning rate, and type of activation function. Grid searching is also applied in [52] to find the optimal hyperparameters, such as the number of hidden layers, the number of neurons in the hidden layer, the learning rate and the number of epochs of a deep NN for the hourly prediction of ozone concentration. The model comparison with SVM, linear regression and a traditional NN shows better prediction results. Various stochastic algorithms are also used to optimize the hyperparameters of statistical models. In [43], random searching is suggested for the optimization of the architecture of a spatiotemporal CNN-LSTM model for PM 2.5 forecasting. In [19], GA is used to optimize the topology and other hyperparameters of a feedforward NN that predict the monthly concentration of ozone and carbon dioxide in Arosa, Switzerland. Two different models are optimized-one for ozone and one for carbon dioxide, using only data on the relevant pollutant as input. For both types of pollutants, the optimization leads to good models capable of predicting the concentrations for the next month. GA is used in [53] for the optimization of the topology and the learning parameters of a distributed Time Lagged Feed forward Network (TLFN) predicting the maximum concentration of PM 10 for the next day, utilizing data from Kozani in northern Greece that include information about the PM 10 , the meteorological conditions, as well as the weather forecast.

Spatiotemporal Models
As shown in [54], there are spatiotemporal correlations between data at different stations in a network, which is an indication that taking into consideration the spatial information may improve model performance. In [55], a feedforward NN is used to predict the concentration of PM 2.5 for the next 1 to 21 h based on data on the local conditions, as A hybrid model is suggested in [50] that considers the impact of neighboring areas to predict PM 10 for the next hour for six different stations in the province of Nan in Thailand. For each station, a modified depth-first search is used to select data from nearby stations to be used in the forecast based on feedforward NN. The experimental results prove that taking into account the influence of neighboring stations leads to improved model performance.
A spatiotemporal deep NN model for predicting the concentration of PM 2.5 is proposed in [48], which consists of LSTM layers and fully-connected layers, and uses not only local information, but also air quality data for the nearest surrounding stations. The model shows better results compared to a Gradient Boosting Decision Tree and deep feed forward NN.
A hybrid neuronal model is suggested in [56] to predict PM 2.5 for a specific station. The model has two components: an LSTM is applied for local forecasting for individual stations, and a standard three-layer neural network is applied for the final forecast, considering the influence of the surrounding areas by combining the station's forecast with those of its neighbors. Thus, the proposed hybrid model leads to better performance than a standard neural network and an LSTM.
A three-component deep NN model for air quality forecasting is suggested in [57], which takes into account different meteorological patterns, as well as the spatial and temporal dependencies for each of them. In the first component, the data are divided into subsets depending on the meteorological patterns, with multiple models trained for each subset and combined dynamically. The second component discovers spatial correlations between stations using Granger's causality and obtains information of the influence of other stations on the data at a particular station. The last component uses an LSTM to detect temporal dependencies. Based on experimental evaluation using data from 35 stations in Beijing, the obtained results show that each of the components improve the performance of the model, and the model's performance is better compared to traditional regression and machine learning models.
A spatiotemporal model for predicting PM 2.5 is developed in [39], combining multiple 1D-CNNs and a bidirectional LSTM. For each individual station, CNN is trained to extract local dependencies, and then the extracted features from the different stations are concatenated and fed to a bidirectional LSTM that learns spatiotemporal dependencies. The spatiotemporal features are further combined and fed to an additional regression layer to obtain the final forecast. Experimental evaluations based on two different data sets show that the proposed model performs better than other traditional and deep NN models, both for forecasting one step ahead in time and for multiple steps.
A spatiotemporal model for ozone forecasting is developed in [54], consisting of three components-two for temporal modeling and one for spatial. One of the two temporal components models only uses ozone data, while the other utilizes additional data for other pollutants, weather conditions, and weather forecast. Both temporal models employ a Seq2Seq architecture with an attention mechanism using a bidirectional LSTM for the encoder and a Gated Recurrent Unit (GRU) for the decoder. In the third component, the distances between stations are included, and spatial features are extracted through a deep NN. Finally, the two temporal models and the spatial model are combined through a deep NN for the final forecast. The experimental results demonstrate that spatial information inclusion improves the performance of the model, and the utilization of additional information, especially weather forecast data, also improves the performance.
A spatiotemporal model for predicting the concentration of PM 2.5 is proposed in [43]. The model consists of several components, the first being a 3D CNN that extracts spatiotemporal information from historical PM 2.5 concentration data of the station and from the concentration data of the nearest highly correlated neighboring stations. This infor-mation is passed to a stateful LSTM that accounts for long-term temporal dependencies. Additional data (etc.) are added to the features extracted so far, and then the forecast for the concentration of PM 2.5 is obtained through one or more fully connected layers. The experimental results confirm that using additional meteorological and aerosol data, as well as data correlations of neighboring stations, improves the model performance.
A hybrid model for predicting PM 2.5 over the next two days is suggested in [58]. The model consists of three modules-an LSTM for extracting temporal information from the data of the target station, an NN for extracting spatial information from neighboring stations and stations showing the most similar temporal behavior, and a CNN for including information about the terrain surrounding the target station. The outputs of these three networks are concatenated and fed to a two-layer NN that generates the final forecast. The model showed better results than other modern models. The experimental results confirm that the CNN module always improves the performance, while the LSTM module improves the forecast only for the first hour.
In [59], a multi-component spatiotemporal model for air quality forecasting over the next 48 h is developed based on temporal and spatial predictors. The temporal predictor uses linear regression to predict the air quality at a particular station using only local data, such as on air quality, meteorological conditions and weather forecasts. The spatial predictor is an NN that provides a forecast for air quality at a particular station, using data for the air pollution and the weather conditions from various surrounding areas. The two model components generate an air quality forecast for the respective station independently of each other, and then the two separate forecasts are dynamically combined to obtain the final forecast with a Regression Tree (RT), depending on the weather conditions. The last component of the model is an inflection predictor that uses an RT to model abrupt changes in the air pollution, and its prediction is added to the overall forecast in the presence of certain weather conditions. The experimental results show that the use of separate models and their proper combination leads to better results compared to the use of a single model. Adding the inflection predictor improves the prediction of sudden changes in the air quality, without deteriorating the performance of the model. The system is deployed through the Microsoft Azure cloud platform, implemented in Bing Maps China, and used by the Chinese Ministry of Environmental Protection.
An encoder-decoder LSTM model for air quality prediction using a spatiotemporal attention mechanism is proposed in [60]. In addition to air quality data and meteorological data, the model uses factory emissions data, traffic data and spatial data, which include information on points of interest and road network. For example, in the paper [61], the authors propose a model for calculating the traffic flow index for a whole city or for a road segment, using public transportation vehicle data, which can impact air pollution from public transport. A spatial attention mechanism [60] is included, with the encoder taking into account the relative influence of the surrounding areas. A temporal attention mechanism is added to the decoder that takes into account the temporal dependencies in the air quality. Experimental results confirm the better performance of the suggested model compared to other benchmark models, and prove that the addition of the spatial and temporal attention mechanisms further improves the performance.
A lightweight hybrid CNN-LSTM model for the hourly forecast of PM 2.5 concentration optimized for edge devices is develop in [42]. The model consists of two parallel 1D CNN networks with identical architectures. The first CNN considers the local temporal dependencies using pollution and weather data only from the target station, while the second network accounts for the spatiotemporal dependencies of the local and surrounding stations, using information on PM 2.5 concentrations from all 12 stations. The outputs of the two CNN networks are concatenated and fed to an LSTM layer. The model performs better than other deep models. The experimental results show that adding the CNN before the predictor leads to an improvement, and considering the spatiotemporal dependencies further improves the results, in some cases significantly.
An LSTM model for predicting the concentration of PM 2.5 is developed in [62], taking into account spatiotemporal correlations. Data for the hourly concentration of PM 2.5 from 12 stations in Beijing are used. PM 2.5 concentration data for all 12 stations are fed to LSTM layers in order to extract spatiotemporal features, which are further combined with additional meteorological and timestamp data and fed to fully connected layers to obtain the final forecast. The model is experimentally compared with a spatial-temporal deep model (STDL), a time delay neural network (TDNN), an autoregressive moving average (ARMA) model and a support vector regression (SVR) model, providing better forecasts that are further improved when using additional meteorological and timestamp data.

Data
In the field of air pollution forecasting, there is a shortage of research using standard publicly available data sets. To facilitate independent testing as well as comparisons between different models, a publicly available data set is chosen for the experimental testing of the proposed model-Beijing Multi-Site Air-Quality Data Set [63]. The data set includes hourly data from 12 state-controlled air pollution stations and includes information for the concentrations of PM 2.5 , PM 10 , SO 2 , NO 2 , CO and O 3 . The data set also contains information for some meteorological variables such as temperature, pressure, dew point temperature, precipitation, wind direction and wind speed. The data are from Beijing, China and cover the period 1 March 2013-28 February 2017.

Missing Data Imputation
As in many other data sets, the Beijing Multi-Site Air-Quality Data Set has missing values. Although the total amount of missing data for the whole measurement period is relatively small-about 3% of the data per pollutant and about 0.1% per meteorological variable are missing-the data for most pollutants include at least one long gap, which in some cases exceeds 1000 consecutive missing values. These characteristics of the missing data are likely to render inappropriate the use of simple and popular imputation strategies, such as using the last valid observation and linear interpolation.
For this reason, a hybrid approach to fill in the missing data is developed, which uses two strategies depending on the length of the time period with missing values (the size of the gap). Small gaps with missing data are filled by linear interpolation, and for large gaps the missing value is filled by the average between the previous value and the average value for the same hour of the same day and month for other years with available valid values. The threshold of the gap size that determines which strategy to be used is empirically chosen as 20 (among possible values of 5, 10, 15 and 20), i.e., gaps of up to and including 20 consecutive missing values are filled using linear interpolation, and gaps of sizes higher than 21 are filled using the second approach, utilizing data for the same part of the day in other years.
The proposed hybrid approach is applied for missing data imputation for all variables, except the wind direction. The missing data for wind direction are filled in using the last valid observation, because the wind direction variable can only take discrete values (16 in total) representing different directions such as north, south-southeast, etc.

Normalization
Before being used by the model, the data of all variables are normalized. Since the values of the wind direction are strings, these values are converted to numbers before normalization using the strategy suggested in [42] by replacing the string with degrees from 0 to 360. To avoid the influence of outliers the Robust Scaler normalization [64], part of the Scikit-learn package [65] for Python, is used. In this normalization, the median is first subtracted, and then the value is scaled by the interquartile range: where Q 1 , Q 2 and Q 3 are the first, second, and third quartile of x.

Partitioning the Data
The data are divided into training, validation and test chinks. The data from 1 March 2013 to 28 February 2015 are used for training, validation data are from the period 1 March 2015-29 February 2016, and data in the period from 1 March 2016 to 28 February 2017 period are used for testing. The validation and test periods are selected in such a way that they comprise data for at least one full year.

Selection of Input Variables
How good the performance of the model is depends to a high extend on the utilization of as much relevant information as possible and the removal of as much redundant information as possible. Particularly, the selection of appropriate input variables is a very important part of the model's development process. Using too small a number might lead to a model that does not have enough input information to make satisfactory predictions, and using all variables or a large number of variables often leads to the inclusion of redundant information; both cases lead to a deterioration in the performance of the model.
For this reason, the input variables are selected by generating 50 random input configurations and optimizing a model for each of them by random searching, with 50 generated random solutions. Each of the generated models is trained for five epochs using the training set, and is evaluated on the validation set using the mean absolute error (MAE) as a quality metric. The best configuration obtained as a result of the random search is as follows: PM 2.5 , PM 10 , SO 2 , pressure, dew point temperature, precipitation, wind direction and wind speed, i.e., information is used for almost all meteorological variables (excluding the temperature, which is in agreement with [63], wherein when using data from Beijing, including the stations used in this article, the authors found that among these meteorological variables, the temperature had the least impact on PM 2.5 ), but only for some of the pollutants.

The Proposed Spatiotemporal Model for Air Pollution Forecasting
The proposed spatiotemporal model for air pollution forecasting is a combination of a 2D CNN and an LSTM. The model consists of two parts: the first part is the convolutional part, which automatically extracts spatial features that are further fed as input into the LSTM part to model temporal dependencies. The spatiotemporal model predicts the concentration of PM 2.5 for the next time step, i.e., for the next hour, at a target station, taking into account information from the surrounding stations.

Convolutional Neural Network (CNN)
CNNs [37] are inspired by the primary visual cortex, and are some of the most popular deep neural networks. At the heart of the CNN are the convolutional layers and the pooling layers. Each convolutional layer extracts features from the data by convolving filters (convolutional kernels) with a given input: where O l is the output feature map of the lth convolutional layer, a is the activation function, W l are the weights of the convolutional kernel, b l is a bias term and * represents the convolution operator. The convolution of a filter with the layer's input makes it possible for the network to learn local dependencies in the data; it also enforces translational equivariance, and reduces the number of learnable parameters. These feature maps obtained as the output of the convolutional layer can be downsampled by a pooling layer that combines the activations from local regions using the max or the average pooling function. The pooling layer introduces a degree of translational invariance. Multiple convolutional layers can be stacked one after the other, leading to a hierarchical representation of features at different levels of abstraction. The features thus extracted are finally concatenated and passed to the next part of the model, the purpose of which is to turn the features into the required output. Usually, this part consists of fully connected layers, but in the presented research, LSTM layers are used.

Long Short-Term Memory (LSTM)
The LSTM [30] network is a type of recurrent neural network in which the hidden neuron is replaced by a memory block that stores and retrieves information, allowing the system to model long-term dependencies. Each block consists of three gates (input gate, forget gate and output gate), which control the flow of information that enters and leaves the cell: block input: input gate: forget gate: cell state: output gate: block output: where x t is the input vector at time t, W z , W i , W f and W o are input weights, R z , R i , R f and R o are recurrent weights, p i , p f and p o are peephole weights, b z , b i , b f and b o are bias weights, σ is the gate activation function, g is the input activation function, h is the output activation function and represents point-wise multiplication.

Input Data to the Model
The inputs to the CNN are images with a size of 25 × 25 pixels. A separate image is created for each time step and each variable, and each variable used represents a separate CNN channel. Thus, images of the selected input variables from the previous few hours are fed as input to the CNN. In the center of the image is the target station for which the forecast is generated, and around it are located nearby neighboring stations.
The relative positions of the stations as shown in Figure 1 are determined directly by their geographical coordinates, as given in Table 1. The distance between the target station and the other stations, which is used to determine whether a neighbor station is included in the image, is given by the parameter neighborhood size (ν), described in Section 3.4.3.
The values of a specific variable at each included neighbor station are represented as circles with a center at the location of the respective station in the image, and a radius r proportional to the value of the variable for the respective station: where x var is the normalized value of the corresponding variable, s p is the number of pixels at one side of the image (in this case the number is 25) and N st is the number of stations that fall into the neighborhood of the target station (including the target station itself), i.e., the number of stations included in the image. The values of a specific variable at each included neighbor station are represented as circles with a center at the location of the respective station in the image, and a radius r proportional to the value of the variable for the respective station: where xvar is the normalized value of the corresponding variable, sp is the number of pixels at one side of the image (in this case the number is 25) and Nst is the number of stations that fall into the neighborhood of the target station (including the target station itself), i.e., the number of stations included in the image.   The value of a specific pixel in the image represents the number of circles of different stations into which it falls inside. Finally, the pixel values are normalized. Thus, the input data are images centered at the target station for which the forecast is generated, and they encode information not only about the value of a certain variable at a certain moment in time, both at the target station and at neighboring stations, but also about the relative spatial positions of the involved stations. Figure 2 shows an example of an input image with a neighborhood size 0.6 for the variable PM 2.5 at 0 h on 29 June 2015, for the target station Dongsi. The value of a specific pixel in the image represents the number of circles of different stations into which it falls inside. Finally, the pixel values are normalized. Thus, the input data are images centered at the target station for which the forecast is generated, and they encode information not only about the value of a certain variable at a certain moment in time, both at the target station and at neighboring stations, but also about the relative spatial positions of the involved stations. Figure 2 shows an example of an input image with a neighborhood size 0.6 for the variable PM2.5 at 0 h on 29 June 2015, for the target station Dongsi.

Architecture Optimization
One serious shortcoming in many research studies is the use of a trial and error approach to optimize the proposed models. For this reason, in this study, the network architectures as well as other hyperparameters (as described in Section 3.4.3 and given in Table  2) are optimized by a genetic algorithm with a modified crossover operator for solutions with variable lengths. The evolutionary procedure is implemented as described in [66] and [67]. The models are trained on the training data for five epochs, and evaluated on the validation data using MAE. The solutions are always evaluated and the fitness score of a certain solution is the average MAE so far.

Architecture Optimization
One serious shortcoming in many research studies is the use of a trial and error approach to optimize the proposed models. For this reason, in this study, the network architectures as well as other hyperparameters (as described in Section 3.4.3 and given in Table 2) are optimized by a genetic algorithm with a modified crossover operator for solutions with variable lengths. The evolutionary procedure is implemented as described in [66,67]. The models are trained on the training data for five epochs, and evaluated on the validation data using MAE. The solutions are always evaluated and the fitness score of a certain solution is the average MAE so far.

GA Solution Encoding
The GA solution comprises three parts-convolutional, LSTM and global. The convolutional and the LSTM parts of the solution consist of a certain number of layers, each of which is characterized by certain parameters. For convolutional layers, these parameters are the number of kernels, the kernel size and the parameter that defines if the layer should be followed by a max-pooling layer with kernel size 2. For LSTM layers, the parameters are the number of LSTM cells and a parameter that define if a dropout will be used, as well as its p. The global part has two parameters-normalized neighborhood size (ν norm ) and the time lag. The first parameter determines the size of the neighborhood (ν) around the target station for which the forecast is generated, which is calculated as follows: dy = max(y center − y min , y max − y center ) (13) where ν norm is the normalized neighborhood size with possible values from 0 to 1, with value 0 meaning that only the target station is included in the image, and value 1 meaning that all stations are included; x center and y center are the coordinates of the target station; x min and x max are the smallest and largest x coordinates from among all 12 stations, respectively; y min and y max are the smallest and largest y coordinates from among all 12 stations, respectively. The x coordinate is the value of the longitude and y is the value of the latitude, and no transformations of the latitude and longitude are made due to the small size of the covered area. The neighborhood is a square centered on the target station, and the image is generated by all stations the coordinates of which fall within this neighborhood. When the size of the neighborhood is small, only the closest stations to the target station are included in the image, while for higher values more distant stations are also included in the image. The second parameter determines the number of historic data samples used by the model; for example, a time lag of 3 means that the data from the previous 3 h will be used as input information. Table 2 shows the optimized parameters of the network architecture, and the possible values they can have.

Selecting and Training the Best Architecture
The best architectures of the last generation of the evolutionary procedure are selected and then trained on a combination of the training and validation set for 100 epochs, then evaluated on the test set. Due to the stochastic nature of the weight initialization, each training is repeated as 10 independent runs.

Performance Metrics
The following metrics are used to evaluate the performance of the best architectures: mean absolute error (MAE), root mean square error (RMSE) and coefficient of determination (R 2 ).
whereŷ i is the predicted value of the i-th sample, y i is the true value for that sample, n is the total number of samples, and y = ∑ n y i/n.

Experimental Setup
The Keras library is used to implement the models and the Adam algorithm is used in the training, with a learning rate of 0.001, β 1 = 0.9, β 2 = 0.999, ε = 1.0 × 10 −7 and batch size of 16. The training of the network architectures is performed on NVIDIA GeForce RTX 2060. ReLU is used as an activation function not only for the convolutional layers, but also for the LSTM layers. The strides of the convolutional layers are set to (1,1), and the padding to valid. Max-pooling is used with size (2, 2), strides (2, 2) and valid padding. For the evolutionary optimization, the population size is set to 20, the number of generations to 50, and elitism of size 2 is also used. The maximum number of convolutional layers is set to 10, and the maximum number of LSTM layers to 5.
Due to the limited computational resources, experiments are conducted on 3 of the 12 stations-Dongsi, Wanliu and Changping. All these three stations are selected prior to the experiments so that the target stations are surrounded by other stations. A separate model is optimized for each station, and due to the stochastic nature of the evolutionary search, the optimization of the architecture for each station is repeated three times.
Besides the main experiment for architecture optimization, some additional experiments are also conducted, such as forming various ensembles using the already trained models, validating the proposed hybrid missing data imputation strategy, and validating some components of the proposed spatiotemporal model. In order to make correct comparisons with the results of the main experiment, separate models are optimized for each additional experiment involving the spatiotemporal model, using the already outlined procedure. As in the main experiment, the evolutionary optimization is repeated three times. Due to the limited computational resources, the additional experiments are conducted for only one station-Wanliu (unless explicitly stated otherwise).

Experimental Results for Architecture Optimization
The experimental results for the architecture optimization of the three independent experimental runs for each of the three selected stations are given in Tables 3-5. Additionally, the distributions of the performance metrics for each station are depicted as boxplots (including the mean shown as a green triangle) in Figures 3-5. The results prove that architectures have been successfully found for each of the stations. Moreover, the models trained using the best architecture provide good and consistent results, with the best performance being achieved for station Wanliu.       For comparison of the results achieved by the suggested model, the experimental results for the same three stations from other research studies, using the same data set for PM 2.5 forecasting, are given in Tables 6-8. As can be seen, for most of the compared cases, the suggested model does not lag far behind other proposed models according to the MAE, and the results for station Wanliu are comparable to those of some other deep NN models.  15.9 23.9 CNN [33] 13.9 23.4 CNN-LSTM [42] 8.486 14.951 The best architectures obtained from each of the three experimental runs for the three selected stations are given in Tables 9-11. In most of the cases, a large value of the neighborhood size is used, often 1, i.e., all 12 stations are taken into account, and for Wanliu, in every experimental run, the best architecture uses the maximum size. A large time lag (7 or 8) is also often applied. The results obtained demonstrate the benefits of including spatial information from as many surrounding stations as possible, as well as using as much historical information as possible. Most of the models have one convolutional layer. The convolutional layers most often have a relatively small number of kernels and a small kernel size. In about half of the cases, the convolutional layers are followed by a maxpooling layer. After the convolutional part, one or two LSTM layers are utilized. In about half of the cases, dropout regularization is applied to the LSTM layer using relatively small values of p-0.1, 0.2 and 0.3. Table 9. The best architecture for each optimization run on station Dongsi. The first row shows the convolutional part, the second the LSTM, and the third the global part. The optimized parameters are given in Table 2.    Figure 6 shows the predictions made by the model (averaged over all 10 training runs) over the entire test period vs. the actual values for one of the experimental runs for each of the three stations. The results reveal that the models experience some difficulty in predicting peak concentrations, especially when the peaks are very large, and this is most evident for Changping (Figure 6c).  Table 11. The best architecture for each optimization run on station Changping. The first row shows the convolutional part, the second the LSTM, and the third the global part. The optimized parameters are given in Table 2.  Figure 6 shows the predictions made by the model (averaged over all 10 training runs) over the entire test period vs. the actual values for one of the experimental runs for each of the three stations. The results reveal that the models experience some difficulty in predicting peak concentrations, especially when the peaks are very large, and this is most evident for Changping (Figure 6c).

Run
(a)

Using the Trained Models to Form Ensembles
Additional experiments have been conducted with the formation of ensembles from the already trained models from the main experiment for architecture optimization. Two different strategies for inclusion in the ensembles are tested; the first uses the models from the 10 training runs of the best architecture obtained from a single optimization run (i.e., an ensemble of models with the same architecture), and the other uses the k best models from the training runs from each experimental run (i.e., an ensemble including models with different architectures). In all cases, the final forecast is obtained by averaging the individual forecasts of the models in the ensemble.  show the results of the ensembles of models with the same architecture and Tables 15-17 show the results of the ensemble models with different architectures. As can be seen in all cases, the results improve when using an ensemble of models compared to the individual models from the main experiment (Tables 3-5). Table 12. Results for station Dongsi of the ensemble composed only of models obtained from the 10 independent training runs of the best architecture of a single optimization run. With the exception of station Changping, in most cases, an ensemble of k best models from each experimental run gives better results than the strategy of using only models with the same architecture, and the observation is valid for most of the values of k. The comparison between the results of the ensembles with the k best models and the average results (over the three experimental runs) of the ensembles with the same architectures proves that the strategy with the k best models provides better results for all stations.

Run
The overall evaluation of the results indicates that the strategy with the k best models shows better results; for station Dongsi, the best results using that strategy are achieved using k = 3, while for Wanliu they are obtained using k = 8, and for Changping using k = 2. A possible explanation for the higher value of k for Wanliu could be that in the main experiment of architecture optimization, the results for this station are the most consistent, which allows the inclusion of a larger number of models in the ensemble without deteriorating its overall performance.
Comparing the results with those of other research studies using the same data set and predicting PM 2.5 (Tables 6-8) demonstrates that for stations Dongsi and Changping, the ensembles have MAE values comparable to most models from [33], but fall behind according to the RMSE, which means that the models probably make big errors more often. However, on the other hand, it should be noted that the forecasts reported in [33] are made on a daily basis, while in the current study, the data in the used dataset are gathered hourly, which may offer an explanation for the lower RMSE. For Wanliu, the ensembles surpass the results achieved in [33] when MAE is used as a metric, and for RMSE the results surpass two of the models-CNN and LSTM, and are slightly behind the other two.

Validation of the Hybrid Strategy for Missing Data Imputation
Two types of experiments have been performed to validate the proposed strategy for missing data imputation. For the first type, random artificial gaps are generated in the data of each variable (except for the wind direction, which is not included in this experiment), i.e., valid values are removed from randomly selected time intervals. In this way, an additional~5% of the valid values have been removed for each time series. The gaps sizes are generated using a geometric distribution with p = 0.05. Three strategies are used to fill in the missing values-linear interpolation (LI), the average between the previous value and the average value for the same hour of the same day and month for other years (MTP), and a hybrid approach (HS) that is a combination of the first two. The performance of the strategies is evaluated by R 2 only on the simulated gaps, i.e., the original missing values are not taken into account in the evaluation.
Due to the stochastic nature of the gap generation, each strategy is evaluated for each variable by 100 independent runs with different simulated gaps. The Kruskal-Wallis (KW) test is used to check for differences between the strategies. The performances of the three strategies applied to data from the three selected stations (Dongsi, Wanliu and Changping) are presented in Tables 18-20. The presented values of p in the tables have not been corrected for multiple comparisons.
As seen from the results, taking into account the multiple comparisons (11 variables per station), in most cases there is no significant difference between the linear interpolation and the hybrid strategy for the pollutant data (except for ozone, where the hybrid strategy shows better results for two of the stations).  Table 19. Results of the three missing data imputation strategies for station Wanliu. R 2 is the average R 2 of the 100 independent evaluations.  The MTP strategy gives the weakest results for all pollutants except ozone, where it outperforms both linear interpolation and the hybrid strategy. Regarding the meteorological variables, there is no significant difference in the performances of the three strategies for precipitation and wind speed in almost all cases. The MTP strategy performs best compared to the other two for the temperature, and shows the weakest results for pressure and dew point temperature. Compared to the other two strategies, linear interpolation performs worst for temperature, but shows the best results for pressure and dew point temperature. For these variables, the hybrid strategy gives results between those of the other two strategies. Out of all variables, linear interpolation and the hybrid strategy show the weakest results for precipitation and wind speed, which may be due to the complex and chaotic nature of these two phenomena. The results reveal that the hybrid strategy is not inferior to the linear interpolation strategy in most cases, and even surpasses it for some specific variables, such as ozone and temperature. For these variables, the seasonal patterns are characterized by less noise than other variables in the selected data set, and the hybrid strategy manages to account better for the seasonal dependencies.
In the second type of experiments, the missing value imputation strategies are evaluated by the performance of models trained with the filled data. Two additional experiments are performed; the first one utilizes models optimized using data imputation by linear interpolation, and the second one utilizes models using data imputation with the MTP strategy. These two experiments are performed using the data from station Changping due to the presence of a larger number of large gaps compared to the data at station Wanliu, which makes Changping a better choice for the testing of the different imputation strategies. Tables 21 and 22 show the results of the two above described experiments. Figures 7 and 8 depict the distributions of the different performance metrics as a boxplot. The results obtained for the training runs from all experimental runs are compared with those for the main experiment that uses the hybrid strategy (Table 5) via a KW test. No significant differences can be observed between the hybrid strategy and the linear interpolation (for all metrics p is above 0.05). A possible explanation for this observation, given the results of the first type of experiments (Tables [18][19][20], is that the variables for which the hybrid strategy has an advantage over linear interpolation (ozone and temperature) are not selected by the initial random search as input variables. The hybrid strategy produces better MAE compared to the MTP strategy (p = 0.0014).  [18][19][20], is that the variables for which the hybrid strategy has an advantage over linear interpolation (ozone and temperature) are not selected by the initial random search as input variables. The hybrid strategy produces better MAE compared to the MTP strategy (p = 0.0014). (a)

Selection of Input Variables by Random Search
In order to evaluate whether the selection of input variables by random searching leads to improved model performance, two additional experiments are accomplished. In the first experiment, the models use only PM2.5 as input, and in the second experiment all 12 available variables are used as input. The results of these two experiments are given in Tables 23 and 24. Figures 9 and 10 show the distributions of the different performance metrics as boxplots. As can be seen, the employment of additional variables besides the

Selection of Input Variables by Random Search
In order to evaluate whether the selection of input variables by random searching leads to improved model performance, two additional experiments are accomplished. In the first experiment, the models use only PM 2.5 as input, and in the second experiment all 12 available variables are used as input. The results of these two experiments are given in Tables 23 and 24. Figures 9 and 10 show the distributions of the different performance metrics as boxplots. As can be seen, the employment of additional variables besides the historical data of the predicted variable (PM 2.5 in this case) improves the performance of the model (p much less than 0.05 for all metrics). historical data of the predicted variable (PM2.5 in this case) improves the performance of the model (p much less than 0.05 for all metrics).  The use of all variables produces similar results to those in the main experiment (Table 4) (p > 0.5 for all metrics), but the full training of the models takes a significantly longer time (p = 2.28 × 10 −07 , one full training when using all variables takes an average of 1.84 h compared to an average of 0.84 h when the input variables are selected by random search).

Inclusion of Spatial Information
In order to evaluate the influence of the inclusion of spatial information on the performance of the model, an experiment is carried out in which models with a fixed neighborhood size of 0 are evolved, i.e., data from other neighboring stations are not taken into account, and only data from the target station are used. The results are given in Table 25, and the comparison with the results of the main experiment (Table 4) confirms that the inclusion of spatial information significantly improves the performance of the model (p much less than 0.05 for all metrics). Figure 11 shows the distributions of the different performance metrics as a boxplot.

Conclusions
Air pollution has a negative impact on human health and the environment, thus necessitating the development of effective air quality forecasting systems that would allow appropriate measures to be taken in a timely manner.
The suggested deep spatiotemporal model for predicting air pollution with the automatic selection of input variables uses a genetic algorithm for the optimization of network architecture and hyperparameters. In addition, a hybrid strategy for filling in the missing values in time series is proposed. The experimental evaluation using a publicly available data set demonstrates that even with limited computational resources, the evolved architectures provide good and consistent results. The inclusion of the trained models in different ensembles further improves the results, bringing them closer to those obtained with some modern deep models using the same data set, and in some cases even showing superior performance.
The hybrid strategy for missing value imputation has advantages over linear interpolation, especially for data with very clear seasonality, without lagging behind in many other cases. The experimental results also reveal that random searching is a simple and effective strategy for selecting the input variables. Moreover, the inclusion of spatial information significantly improves the predictive results obtained with the models.
Future work will be aimed at improving the spatial component of the model by making the size of the neighborhood window dynamic and dependent on the local meteorological conditions.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.