1. Introduction
Nowadays, people are very concerned about air pollution with the development of industries. The concentration of various kinds of pollution gas and solid particle, such as
,
,
,
,
CO, and
impacts human health and sustainable development around the world. In South Korea, research on air pollution is very important and has constantly been viewed as a key topic in environmental protection [
1]. As depicted in
Figure 1, air pollution has several significant factors. We can classify these factors into two specific categories, namely primary factors and secondary factors. The primary factors are essentially based on air pollutants such as solid particles, coal burning, traffic volumes, and manufacturing emission. Each of these sources has a different spatial distribution and temporal pattern. On the other hand, the secondary factors are mainly composed of meteorological information, topography, and time. There is an increasing demand for predicting future air quality, because people can take more precautions in order to not get sick if they know the air quality in advance. Air quality forecasting is also highly significant to any government’s emergency management, since it can provide time for the government to implement appropriate emergency measures to mitigate atmospheric pollution, such as limiting the production and emissions of heavily polluting enterprises and restricting motor vehicles. However, air quality prediction is a complex task and improving the accuracy of predictions and reducing the training time is an urgent and challenging problem in the field of air pollution prevention.
For the past ten years, several researchers have made effort on the air quality forecasting topic. In these studies, the authors mainly focused on two types of modeling for air quality prediction. The first type is knowledge-based models and the second type is data-driven models. The knowledge-based models mainly focus on chemical and physical assumptions to represent the transportation and transformation of air pollution particles. Many knowledge-based models have been proposed in the literature. However, the successful application of knowledge-based models requires a solid background in atmospheric and environmental science. Furthermore, when the model is applied in different situations, the chemical and transportation rules may change, which generally lead the model to inaccurate results. To solve this problem, some researchers in the literature implemented statistical prediction methods such as autoregressive integrated moving average (ARIMA) [
2], hidden semi-Markov models (HSMMs) [
3], and least absolute shrinkage and selection operator (LASSO) model [
4]. However, the statistical prediction method, which implements a mathematical logic and regression analysis, has two significant shortcomings: (1) low accuracy; (2) urge time and energy consumption, mainly caused by the analysis of long-term historical monitoring data. To summarize, the statistical prediction methods [
5] are effective for air quality forecasting. However, the diversity factors of air pollution make it difficult to achieve a good prediction accuracy.
Recently, with the explosion of the era of big data and artificial intelligence, the data-driven approach for air pollution modeling has been considered and applied in several forecasting systems. Air quality prediction methods based on machine learning algorithms have overcome some of the shortcomings of the older statistical prediction methods and numerical predictions mentioned above and have become the mainstream of air quality prediction research. So far, air quality prediction methods based on machine learning have achieved some good results. For instance, Wang et al. [
6] proposed an online support vector machine (SVM) model to predict air pollutant concentration in the Hong Kong downtown area. They performed a comparative experiment between conventional SVM and online SVM and demonstrated the effectiveness of their model. In study [
7], the authors made a case study for air pollution prediction in Murcia city. The authors focused on the prediction of the ozone (
) level. They used several shallow machine learning models. Among these models the Random Forest performed the best. Recently, study [
8] proposed an air pollution forecasting approach by using four advanced regression techniques (Decision Tree Regression, Random Forest Regression, Gradient Boosting Regression, Artificial Neural Network (ANN) Multi-Layer Perceptron Regression) and presented a comparative study to determine the best model for accurately predicting air quality with reference to data size and processing time. The experiment results showed that Random Forest Regression was the best model, performing well for pollution prediction for data sets of varying size and location and with different characteristics.
Deep learning (DL) has been extremely applied in big data analysis to solve several problems related to object recognition [
9], image classification [
10], speech recognition [
10], time series forecasting [
10], and so on. Furthermore, the advent of deep learning technologies has remarkably enhanced the accuracy and efficiency of air quality prediction. Deep learning is currently the most popular data-driven method [
9], which can extract and learn the inherent features of various air quality data automatically. A wide range of papers applying deep learning for air pollution prediction in literature have achieved good results. Among these papers, Zhao et al. [
11] proposed a deep learning model called as long short-term memory—fully connected (LSTM-FC) neural network, to predict the concentration of
among specific monitoring stations over 48 hours. The authors used historical air quality data, meteorological data, and weather forecast data as input for their prediction model. Finally, they evaluated the proposed approach with a dataset containing records of 36 air quality monitoring stations and made a comparison with an ANN model and an LSTM model on the same dataset. Qi et al. [
12] proposed a hybrid model based on deep learning methods that embeds graph convolutional networks and long short-term memory networks (GC-LSTM) to model and predict the spatiotemporal variation of
concentrations. The authors constructed historical observations as spatiotemporal graph series, and historical air quality variables, meteorological factors, spatial terms, and temporal attributes were defined as graph signals. For evaluation purposes, the authors compared their model with some state-of-the-art approaches in different time intervals and based on the results of the proposed model, achieved the best performance for predictions. Wen et al. [
13] proposed a convolutional long short-term memory neural network model to predict the concentration level of
. Spatiotemporal features were extracted through the combination of the convolutional neural network (CNN) and LSTM network. The meteorological data and aerosol data were also integrated, in order to improve the performance of the model. Similar to the previous study, Huang et al. [
14] developed a model that integrates CNN and LSTM for
prediction. They evaluated their model by using four measurement indexes (Mean Absolute Error, Root Mean Square Error, Pearson correlation coefficient, and Index of Agreement) in the experiments. Bai et al. [
15] proposed a stacked autoencoder model combining seasonal analysis and deep feature learning to predict the hourly concentration of
. They evaluated their model using a dataset collected from three environmental monitoring stations in Beijing. The results demonstrated the effectiveness of the proposed approach. Wang et al. [
16] used a hybrid deep learning model based on CNN and seq2seq. The CNN layer was used to extract the spatial correlation among different stations and the seq2seq to capture the temporal relationship for final prediction.
As we can see, a lot of hybrid models have been used recently for air quality forecasting task, and some of them are performing well. However, these models suffer from two significant problems:
The first problem is the very slow training speed, since the huge amount of data coming from the different air quality monitoring stations is trained by using a centralized deep learning architecture. In some worst cases these models need retraining because of their degradation caused by the variation of data distribution over time. So, the problem of the time and resource-consuming during the training step is definitively a big challenge in air quality forecasting;
The second shortcoming of these approaches is that they do not consider the fact that there is usually some noise in air quality data and meteorological data, which affects, to a certain extent, the accuracy and the performance of their predictions, since they are not able to extract suitable features and information from pollution gas and meteorological data.
Considering these challenges, in this research we propose a deep learning model based on a convolutional Bi-LSTM autoencoder framework for air quality forecasting. The proposed model is trained by using a distributed architecture called data parallelism. The main contributions of this paper are as follows:
Study of current state-of-the-art machine learning and deep learning approaches for air quality forecasting;
Design and implementation of a distributed deep learning approach based on two-stage feature extraction for air quality prediction. In the first stage, a stacked autoencoder extracts useful features and information from pollution gas data and meteorological data. In the second stage, considering the properties of multivariate time series air quality particles data, we use a one-dimension convolutional layer (1D-CNN) to extract the local pattern features and the deep spatial correlation features from air quality particles data. The CNN model is widely used in object recognition and image processing area, but due to its one-dimensional characteristics it can also be applied to time series forecasting tasks. Finally, the extracted features are interpreted by a Bi-LSTM layer throughout time steps to make the final prediction;
Evaluation of the proposed approach based on two specific phases. In the first phase we train our deep learning framework within a centralized architecture with a single training server and we compare it against ten state-of-the-art models. In the second phase we use a distributed deep learning architecture called data parallelism to train the proposed framework on several training workers to optimize its accuracy and its training time.
The remaining parts of this paper are organized as follows:
Section 2 presents the data collection and the feature correlation. In
Section 3 we introduce our deep learning framework. The experiments are described in
Section 4. Finally, the conclusion and future work are discussed in
Section 5.
3. Proposed Deep Learning Architecture
Several researchers in literature have studied hybrid deep learning models, which are usually effective for improving the performance of typical deep learning algorithms. In the proposed architecture, we combine a stacked autoencoder, a convolutional neural network, and a bi-directional LSTM layer together to predict the concentration level of
and
based on data collected in Busan metropolitan city. The proposed deep learning framework is based on two stages of deep features extraction. In the first stage we extract suitable features from both particles time-series data, and in the second stage we use a stacked autoencoder layer for encoding the key patterns of meteorological and gas related features.
Figure 7 shows the architecture of the proposed model.
3.1. 1D-CNN for Deep Features Extraction on and Particles Data
The CNN model is widely used in the object recognition and image processing area, but due to its one-dimensional characteristics it can also be applied to time series forecasting tasks. In our study, the CNN model takes the air quality data in one-dimensional form, wherein the data are shaped in order of sequential time instants. Moreover, considering the properties of multivariate time series air quality data, we also leverage the strength of one-dimensional CNN (1D-CNN) to extract the local pattern features, and the deep spatial correlation features of the 20 air quality monitoring stations present in Busan metropolitan city. A standard CNN model has four layers: input, convolutional, pooling, and output layers. A typical convolution process can be represented by the following equation:
and
symbolize the output pattern of the
lth and (
l − 1)th layers.
and
are, respectively, the weight and the bias of the
layer, and
is the activation function. To minimize the dimension of data, the CNN network implements a pooling layer after the convolutional layer to improve the model configuration. The pooling layer can select useful data from the input layer. CNN combines convolutional and pooling layers, which is represented through the following equation:
In order to represent the spatial-temporal features of and particles among all the monitoring stations, we pre-trained several one-dimensional CNN layers to select local recurring features and deep spatial similarity features of multiple patterns observed in different stations. Apart from image processing that uses two-dimensional image pixels as CNN’s inputs, multiple one-dimensional data are inputted to the first part of our deep learning framework.
3.2. Using Stacked Autoencoders for Gas Features and Meteorological Patterns Encoding
There are several important indicators in the air quality forecasting task. Predicting particles such as and without gas features and meteorological information may result in bad accuracy and therefore bad decision-making, and this is why the large majority of the available dataset related to air quality are not only based on particle matters, but also on meteorological data and gas-related data. Considering that, in this study, we took into consideration the meteorological and the gas features to improve the accuracy of our model while predicting or . We used a stacked autoencoders network to encode the information from these features. An autoencoder can be viewed as a kind of neural network typically based on one hidden layer, which aims to set the objective value equal to the input. It usually compresses the input into a latent-space representation, and then reconstructs the output from this representation. Autoencoder networks are unsupervised models including two major processes, which are encoder process and decoder process. These major processes allow this neural network to learn some more abstract features by an unsupervised way. In this paper, we intended to build a vector representation for meteorological and gas information and use it for the final prediction of the or particle.
3.3. Implementation of Stacked BiLSTM for Capture Temporal Dependencies Considering Forward and Backward LSTM Directions Simultaneously
Traditional methods such as ARIMA and some shallow learning algorithms have suffered poor performance in forecasting tasks due the fact that they do not consider the long-term dependence of time series data. The LSTM network as presented in
Figure 8 has been so far the best solution to overcome this shortcoming. In an LSTM model, units can teach the network to learn when to forget historical data and when to update memory units through a new input architecture. The basic structure of an LSTM memory unit is composed of three essential gates, namely, input, forget, and output, as presented in
Figure 8. The gates include a sigmoid layer and a pointwise multiplication operation and can control the data flow of LSTM to prevent gradient eruption. The input gate determines the number of new features that will be reserved, the output gate decides the data that will be delivered, and the forget gate regulates the content that will be abandoned from the previous states. The memory units include a historical information form, which is controlled by the three gates. The conventional LSTM block computing process uses the following equations:
The input gate determines the value that needs to be updated and updates the memory cell . The forget gate identifies the information that is to be forgotten at t − 1 time with the output value . The output gate and memory cell determine the information that can be output and get the output value .
One drawback of the traditional is that it only proceeds in a unidirectional way and may cause the loss of significant information when extracting deep suitable features. Therefore, it is important to utilize both directions of network traffic in order to generate more important features. The goal is to split the state neurons of a standard LSTM into a part that is responsible for the backward states and a forward state. Outputs from forward states are not connected to inputs of backward states, and vice versa. The Bi-LSTM combines the hidden LSTM states of opposite directions to the same output. With this architecture, the output layer will be able to get information from both future and previous states. In this paper, we applied Bi-LSTM, as shown in
Figure 9, to capture the temporal dependencies of particle matters between two directions. The Bi-LSTM considers forward and backward LSTMs simultaneously through two independent hidden layers. The outputs of the forward and backward LSTMs are concatenated to compute the output of Bi-LSTM. The hidden states of the forward and backward layers are measured based on the following equations:
3.4. Using a Distributed Deep Learning Architecture to Train The Proposed Approach
To the best of our knowledge, distributed deep learning training architecture has never been used to challenge the problem of air quality forecasting, especially to reduce the training time and the memory consumption. Our framework was implemented using a distributed deep learning model called data parallelism. We distributed the historical air quality data and the meteorological data across multiple pipelines or nodes before training. The following algorithm represents the training process of the proposed framework (Algorithm 1):
Algorithm 1 Distributed training process |
1: Initialization of coordinator node parameters 2: Define the number of training nodes N 3: Dataset portioning into n shards 4: for each monitoring station 5: datashard ← dataset/N 6: end for 7: for each node i {1,2, 3, …, N} do 8: LocalTrain(Model[parameters], datashard) 9: ∇fi ← Backpropagation () //send gradient to coordinator node 10: end for 11: UpdateMode ← Asynchronous () // Model replicas will be asynchronously aggregated via peer-to-peer communication with the coordinator node 12: Aggregate from all nodes: ∇f ← 1/N 13: for each node i {1,2, 3, …, N} do 14: CoordinatorNode.Push (ParametersUpdate) 15: end for |
In order to reduce the training time and produces a significant computing performance we used a distributed training process called data parallelism by having
n training workers optimize a central proposed model by processing
n different shards (partitions) of the dataset in parallel. In this setting, we distributed
n model replicas over
n processing training nodes. Therefore, every node held one model replica. Then, the workers trained their local replica using the assigned data shard. A parameter server was responsible for the aggregation of model updates, and parameter requests coming from different workers. The final flow of the proposed approach is presented in
Figure 10.
5. Conclusions
In this paper we proposed a convolutional Bi-LSTM autoencoder model for air quality prediction in Busan metropolitan city. The proposed approach utilizes the historical and time series data with the encoded meteorological and gas pollution data to perform the prediction. We used a two-stage feature extraction method to allow our model to perform well in any atmospheric condition (stable or unstable). In the first stage a stacked autoencoder extracts useful features and information from pollution gas data and meteorological data, and in the second stage a convolutional layer is used to extract deep spatial correlation features from air quality particles data. The training process of the proposed approach was based on a distributed learning method, namely data parallelism, in which several training workers were leveraged for training each partition of the air quality dataset by using a model replica. A coordinator node was responsible for the aggregation of model updates, and parameter requests coming from different workers. The experiments were conducted based on one-year air quality data collected in Busan city. The training data and the testing data accounted for 80% and 20% of the dataset, respectively, which means almost two months for testing data. The experimental results showed the superiority of the proposed model over ten state-of-the-art models. The proposed model recorded the lowest error rate while predicting both particles. For prediction, we registered 5.07, 6.93, and 18.27, respectively, for MAE, RMSE, and SMAPE evaluation metrics, and for forecasting we recorded 5.83, 7.22, and 17.27, respectively, for the same metric. Also, it was found that the distributed training method can significantly improve the training time, which can help the government to implement appropriate emergency measures to mitigate atmospheric pollution in a short period of time.