A Hybrid Deep Learning Model to Forecast Particulate Matter Concentration Levels in Seoul, South Korea

: Both long- and short-term exposure to high concentrations of airborne particulate matter (PM) severely affect human health. Many countries now regulate PM concentrations. Early-warning systems based on PM concentration levels are urgently required to allow countermeasures to reduce harm and loss. Previous studies sought to establish accurate, efficient predictive models. Many machine-learning methods are used for air pollution forecasting. The long short-term memory and gated recurrent unit methods, typical deep-learning methods, reliably predict PM levels with some limitations. In this paper, the authors proposed novel hybrid models to combine the strength of two types of deep learning methods. Moreover, the authors compare hybrid deep-learning methods (convolutional neural network (CNN)—long short-term memory (LSTM) and CNN—gated recurrent unit (GRU)) with several stand-alone methods (LSTM, GRU) in terms of predicting PM concentrations in 39 stations in Seoul. Hourly air pollution data and meteorological data from January 2015 to December 2018 was used for these training models. The results of the experiment confirmed that the proposed prediction model could predict the PM concentrations for the next 7 days. Hybrid models outperformed single models in five areas selected randomly with the lowest root mean square error (RMSE) and mean absolute error (MAE) values for both PM 10 and PM 2.5 . The error rate for PM 10 prediction in Gangnam with RMSE is 1.688, and MAE is 1.161. For hybrid models, the CNN–GRU better-predicted PM 10 for all stations selected, while the CNN–LSTM model performed better on predicting PM 2.5 .

In this paper, the authors focus on these methods to explore the predictive utility of stand-alone and hybrid models in forecasting PM concentrations in Seoul, South Korea. The proposed hybrid methods have less training time and higher predicting accuracy compared to single models. Section 2 of the paper prepares relative material for model training. Section 3 explains the proposed models and the training workflows. Section 4 deals with hyperparameter tuning and gives detailed experiments. Section 5 compares the predictions afforded and model performances, and Section 6 contains the conclusions and future research directions.

Observation Stations
The Korean Ministry of Environment installs and operates an air pollution monitoring station nationwide to monitor the nation's air pollution status, trends in change, and whether air quality standards are achieved. In this paper, 39 observation stations in Seoul were selected for the study of PM concentration level forecast. Figure 1 shows the location of the air pollution monitoring stations, with two types of monitoring stations occupied in Figure 1. 25 city air monitoring stations distributed in 25 districts in the whole of Seoul city are marked in red, and the green markers in Figure 1 represent the location of each roadside monitoring station. Table 1 shows the detailed location information of 10 of the 39 monitoring stations. All these stations collect air pollution data every hour of the day, all data are provided on the AirKorea website (http://www.airkorea.or.kr) by the Korean Environmental Corporation. The collected air pollution data include PM 10 , PM 2.5 , O 3 , CO, SO 2 , and NO 2 . The authors used PM 10 and PM 2.5 data collected from 1 January 2015 to 31 December 2018 for this study because PM 2.5 data has been released since 2015. The two types of monitoring stations were treated in the same way, and only PM 10 and PM 2.5 hourly concentrations were used for research in this paper.
Atmosphere 2020, 11, x FOR PEER REVIEW 3 of 20 hybrid models in forecasting PM concentrations in Seoul, South Korea. The proposed hybrid methods have less training time and higher predicting accuracy compared to single models. Section 2 of the paper prepares relative material for model training. Section 3 explains the proposed models and the training workflows. Section 4 deals with hyperparameter tuning and gives detailed experiments. Section 5 compares the predictions afforded and model performances, and Section 6 contains the conclusions and future research directions.

Observation Stations
The Korean Ministry of Environment installs and operates an air pollution monitoring station nationwide to monitor the nation's air pollution status, trends in change, and whether air quality standards are achieved. In this paper, 39 observation stations in Seoul were selected for the study of PM concentration level forecast. Figure 1 shows the location of the air pollution monitoring stations, with two types of monitoring stations occupied in Figure 1. 25 city air monitoring stations distributed in 25 districts in the whole of Seoul city are marked in red, and the green markers in Figure 1 represent the location of each roadside monitoring station. Table 1 shows the detailed location information of 10 of the 39 monitoring stations. All these stations collect air pollution data every hour of the day, all data are provided on the AirKorea website (http://www.airkorea.or.kr) by the Korean Environmental Corporation. The collected air pollution data include PM10, PM2.5, O3, CO, SO2, and NO2. The authors used PM10 and PM2.5 data collected from 1 January 2015 to 31 December 2018 for this study because PM2.5 data has only been released since 2015. The two types of monitoring stations were treated in the same way, and only PM10 and PM2.5 hourly concentrations were used for research in this paper.

Data Description
The input variables that are significant in terms of predictive reliability were used. Prior air pollutant concentrations are essential. As local meteorological data strongly correlated with pollutant concentrations, the authors used some meteorological data for training. The Seoul meteorological data collected from 1 January 2015 to 31 December 2018 used in the study was provided by the Korea Meteorological Administration website (http://data.kma.go.kr). Many factors affect local PM accumulation or dissipation, including temperature, wind, and rain. Temperature is important when predicting PM concentrations [27], as high temperatures are associated with stable high-pressure atmospheric conditions, favoring PM accumulation. Low relative humidity and low temperature correlated with locally high PM 2.5 concentrations [28]. Wind transports PM horizontally [29], and low wind speeds tend to be associated with high PM concentrations because wind directly affects PM dispersion [30]. Rain eliminates PM and dust in the air. Thus, PM concentrations are usually lower in summer seasons because of frequent and heavy rain, and higher in the winter season, which has less rainy days.
Meteorological datasets contain several features. Wind direction ranges from 0 to 360 • , and wind speed followed the Beaufort wind scale from 0 to 12 to represent how fast the wind is blowing. In Reference [31], southern and eastern wind directions and speeds were derived using a periodic cosine function: where α is the wind direction, and w is the wind speed.
To explore the correlations between inputs, the authors used Pearson correlations between 9 training inputs of the Gangnam dataset in Table 2. For PM 10 and PM 2.5 concentrations, correlations ranged from −1 to 1, with a higher absolute number indicating a higher correlation. A positive number indicates a positive correlation between two factors, and a negative number indicates negative correlations. The greater the number is, the higher the correlation is. PM 10 and PM 2.5 levels were highly correlated. Using the original data, the correlation between the two columns, as seen in Table 2, was 0.7763. When the authors resampled the original data into weekly mean data, the correlation between the mean weekly values was 0.8017. Thus, an element of hidden correlation was present, which deep-learning models can capture. Figure 2 shows the trends of PM concentrations. The two data columns were resampled to yield two single-point weekly values, making it easier to capture changes. In this case, the levels of the two PM types exhibited similar trends.
Temperature, wind speed, and rain were negatively correlated with air pollutant concentrations, facilitating NN learning. Thus, forecast PM 10 levels were affected by PM 2.5 and meteorological data.
Air pollution data and meteorological data were combined into one dataset for each station in Seoul. Each dataset contains nine input feathers as hourly PM 10 concentration, PM 2.5 concentration, the temperature at a local station, sky condition at a local station, rainfall, relative humidity, rain condition, wind_x, and wind_y. These nine features were used as inputs for training models in this paper. present, which deep-learning models can capture. Figure 2 shows the trends of PM concentrations. The two data columns were resampled to yield two single-point weekly values, making it easier to capture changes. In this case, the levels of the two PM types exhibited similar trends. Temperature, wind speed, and rain were negatively correlated with air pollutant concentrations, facilitating NN learning. Thus, forecast PM10 levels were affected by PM2.5 and meteorological data.
Air pollution data and meteorological data were combined into one dataset for each station in Seoul. Each dataset contains nine input feathers as hourly PM10 concentration, PM2.5 concentration, the temperature at a local station, sky condition at a local station, rainfall, relative humidity, rain condition, wind_x, and wind_y. These nine features were used as inputs for training models in this paper.
All training models featured nine input variables, including PM2.5 and PM10 concentrations. Each dataset was generated by a monitoring station and contained 4 years of data, from 2015 to 2018. Each dataset was divided into training (70%), validation (27%), and test datasets (3%). The 4-year data from 2015 to 2018 was divided into 1018 days for training, 392 days for validation, and 43 days for testing. The trained models were tuned using the validation datasets, and the predictions were compared with reality. Deep-learning models may become confused if values are missing or abnormal, compromising the outputs. Thus, historical data may be lost. When this arose, the researchers input a value of 0. Also, some data may be incorrectly recorded (machine or human error). For example, on several days, PM10 concentrations exceeded 1000 µ g/m 3 (several cases with extremely high concentration but show no connection with the next or previous data in the dataset) or were less than 0 µ g/m 3 (as common sense), replaced by 0. The input ranges are listed in Table 3. PM2.5 and PM10 concentrations were within the normal range, and the temperature ranged from −25 to 45 °C. Wind_x (southern wind), as a float number calculated by wind speed, and wind direction ranged from −12 to 12, and wind_y (eastern wind), as a float number, ranged from −12 to 12, in South Korea. In terms of rain status, 0 indicates no precipitation, 1 is rain, 2 indicates sleet, and 3 is snow. In terms of the sky, 1-4 indicate sunny, partly cloudy, cloudy, and dark skies, respectively. Table 3. Training data and model parameters.
All training models featured nine input variables, including PM 2.5 and PM 10 concentrations. Each dataset was generated by a monitoring station and contained 4 years of data, from 2015 to 2018. Each dataset was divided into training (70%), validation (27%), and test datasets (3%). The 4-year data from 2015 to 2018 was divided into 1018 days for training, 392 days for validation, and 43 days for testing. The trained models were tuned using the validation datasets, and the predictions were compared with reality. Deep-learning models may become confused if values are missing or abnormal, compromising the outputs. Thus, historical data may be lost. When this arose, the researchers input a value of 0. Also, some data may be incorrectly recorded (machine or human error). For example, on several days, PM 10 concentrations exceeded 1000 µg/m 3 (several cases with extremely high concentration but show no connection with the next or previous data in the dataset) or were less than 0 µg/m 3 (as common sense), replaced by 0. The input ranges are listed in Table 3. PM 2.5 and PM 10 concentrations were within the normal range, and the temperature ranged from −25 to 45 • C. Wind_x (southern wind), as a float number calculated by wind speed, and wind direction ranged from −12 to 12, and wind_y (eastern wind), as a float number, ranged from −12 to 12, in South Korea. In terms of rain status, 0 indicates no precipitation, 1 is rain, 2 indicates sleet, and 3 is snow. In terms of the sky, 1-4 indicate sunny, partly cloudy, cloudy, and dark skies, respectively.
The models feature various cell activation functions. The sigmoid and tanh functions may become saturated, rendering the outputs near constant. Thus, the input data require normalization before forward feeding into the training model, as the outputs must not be saturated [32]. The authors used the MinMaxScaler to normalize training data, so all features were scaled in the range 0 to 1. Table 3. Training data and model parameters.

The Long Short-Term Recurrent Unit
The structure of a single recurrent LSTM unit is shown below ( Figure 3).
Atmosphere 2020, 11 The models feature various cell activation functions. The sigmoid and tanh functions may become saturated, rendering the outputs near constant. Thus, the input data require normalization before forward feeding into the training model, as the outputs must not be saturated [32]. The authors used the MinMaxScaler to normalize training data, so all features were scaled in the range 0 to 1.

The Long Short-Term Recurrent Unit
The structure of a single recurrent LSTM unit is shown below (Figure 3).
These equations show that LSTM cells in recurrent layers process data forward. i refers to the operation of input, o refers to output operation, and f refers to the operation of the forgetting gate. t is the current time, and t -1 is a previous time. h stands for hidden state and C refers to cell state, W and b are the weight and bias vector, σ is the sigmoid activation function, and tanh is the hyperbolic tangent activation function. Formula (3) is a function of the forgetting gate, which decides how much These equations show that LSTM cells in recurrent layers process data forward. i refers to the operation of input, o refers to output operation, and f refers to the operation of the forgetting gate. t is the current time, and t -1 is a previous time. h stands for hidden state and C refers to cell state, W and b are the weight and bias vector, σ is the sigmoid activation function, and tanh is the hyperbolic tangent activation function. Formula (3) is a function of the forgetting gate, which decides how much state data to preserve. Equation (4) shows how the input gate determines the values to be updated, and Equation (5)  updated to the new cell state C t . Equation (7) determines what proportion of the cell state will serve as the output, and this is multiplied by the cell state filtered by the hyperbolic tangent function to generate the final output.

The Gated Recurrent Unit
The structure of a GRU is shown in Figure 4, and the formulae below indicate how the recurrent unit processes sequence data forward.
where x denotes the input vector, h is the output vector, z is the update gate vector, r is the reset gate vector, w and b are the weight and bias respectively, and t is the time. As is true of the LSTM unit, the GRU features gates processing data forward to the unit. The principal differences between the two methods lie in their gates and weights. A GRU has two gates, the update gate and the reset gate. The update gate performs functions similar to the forget and input gates of the LSTM, and the reset gate decides how much past information to forget. Equation (8) shows how the update gate controls new information and information generated by previous activations. Equation (9) shows which reset gate activity is included in candidate activation. Equations (10) and (11) combine the candidate state with previous output and filter the data to obtain the output of the current state.

The Gated Recurrent Unit
The structure of a GRU is shown in Figure 4, and the formulae below indicate how the recurrent unit processes sequence data forward.
where x denotes the input vector, h is the output vector, z is the update gate vector, r is the reset gate vector, w and b are the weight and bias respectively, and t is the time. As is true of the LSTM unit, the GRU features gates processing data forward to the unit. The principal differences between the two methods lie in their gates and weights. A GRU has two gates, the update gate and the reset gate. The update gate performs functions similar to the forget and input gates of the LSTM, and the reset gate decides how much past information to forget. Equation (8) shows how the update gate controls new information and information generated by previous activations. Equation (9) shows which reset gate activity is included in candidate activation. Equations (10) and (11) combine the candidate state with previous output and filter the data to obtain the output of the current state.

Convolutional Layers
A one-dimensional convolutional neural network (1DConvNet) can handle local patterns in time-series sequences. In terms of time-series forecasting, the time sequence is treated as a spatial dimension (similar to two-dimensional height or width), which is optimal in our present context. Identical input transformations were performed on all extracted patches, so a specific pattern learned at the current position can be recognized in a different position. Figure 5 shows the processing of a single-feature input over time by the 1DConvNet. The window size used for sequence processing can be predefined, and fragments learned in sequence. These learned subsequences can then be identified wherever they occur in the overall sequence. 'Max pooling' reduces the lengths of input sequences, as CNN learns their critical parameters. In this paper, the researchers applied the 1DConvNet to our proposed models, as introduced in Section 3.

Hybrid Models
In this paper, hybrid CNN-LSTM and CNN-GRU models are used to predict local PM10 and PM2.5 concentrations. The authors assume that with the one-dimensional convolutional layer, it is possible to recognize local patterns based on the feature of 1DConvNet, and recurrent layers are designed to capture useful patterns to forecast the future. It was assumed that by combining convolutional layers and RNN layers, the hybrid models were able to capture hidden patterns and deliver reliable predictions.
The structure of the CNN-LSTM model is illustrated in Figure 6. The model features four layers with different neuron types and numbers. The data were fed into a one-dimensional convolutional layer, an LSTM layer stack on top of that layer, a fully connected dense layer stack on the LSTM layer, and the top layer is the output layer with two neurons.
To allow among-model comparisons, the CNN-GRU model has a similar structure. The model processed nine input variables. The first layer is a one-dimensional convolutional layer with 64 neurons, the second layer is a recurrent layer with 32 GRUs, and the third and final layers are dense. The principal difference between the two models is the inner structure of the recurrent layer.

Hybrid Models
In this paper, hybrid CNN-LSTM and CNN-GRU models are used to predict local PM 10 and PM 2.5 concentrations. The authors assume that with the one-dimensional convolutional layer, it is possible to recognize local patterns based on the feature of 1DConvNet, and recurrent layers are designed to capture useful patterns to forecast the future. By combining convolutional layers and RNN layers, the hybrid models expected to capture hidden patterns and deliver reliable predictions.
The structure of the CNN-LSTM model is illustrated in Figure 6. The model features four layers with different neuron types and numbers. The data were fed into a one-dimensional convolutional layer, an LSTM layer stack on top of that layer, a fully connected dense layer stack on the LSTM layer, and the top layer is the output layer with two neurons.
To allow among-model comparisons, the CNN-GRU model has a similar structure. The model processed nine input variables. The first layer is a one-dimensional convolutional layer with 64 neurons, the second layer is a recurrent layer with 32 GRUs, and the third and final layers are dense. The principal difference between the two models is the inner structure of the recurrent layer.

General Workflow
In this paper, the authors explored how well four models predicted air pollutant concentrations in several regions of Seoul. The training workflow is illustrated in Figure 7: The steps are: 1. Data choice. Good-quality training data are essential. For each specific model, the data type and attributes must be chosen carefully. The authors used pollution data and meteorological data to train models, combining several training factors (hourly particulates concentrations and meteorological data) into one dataset, aligning the different types of data at the same time points. 2.
Data preprocessing. The researchers used various data preprocessed via different methods to generate inputs to the NNs. The specific preprocessing methods used met the requirements of the training models. For example, the first layer of (the input to) the LSTM model was an LSTM layer. 3.
Model training. The models use input data to learn hidden features. The training model structures differ. For the LSTM model, the green layer with recurrent units is an LSTM layer, the white layer is fully connected, and the last layer is the output layer. Training requires tuning of the various hyperparameters that affect training and model performance.

4.
Hyperparameter tuning. During model training, many hyperparameters must be defined or modified to optimize predictions. Model hyperparameters differ, and all models were optimized before comparison. 5.
Output generation. After training, the best model was identified, and the success of training was evaluated by inputting test data. Both the PM 10 and PM 2.5 concentrations served as outputs. Thus, the output layers featured two neurons. 6.
Model comparison. All models were trained to generate predictions. The authors used all models to predict air pollution concentrations in the same area and then identified the most suitable model.

General Workflow
In this paper, the authors explored how well four models predicted air pollutant concentrations in several regions of Seoul. The training workflow is illustrated in Figure 7:

General Workflow
In this paper, the authors explored how well four models predicted air pollutant concentrations in several regions of Seoul. The training workflow is illustrated in Figure 7:

Evaluation Methods
In this paper, root mean square errors (RMSEs) and mean absolute errors (MAEs) were calculated when comparing predictions with actual values. Smaller values indicate better performances. The RMSE imparts relatively high weights to large errors. The formulae are: where n is the number of sampled data in the test set, y i is a sample i of the predicted data, andŷ i is the real i value.

Model Tuning
The model structure must be defined when comparing performance. Before training, many NN factors affect performance, and when one-factor changes, the resultant prediction changes. These factors, termed hyperparameters, must be carefully chosen. A model is optimized via hyperparameter tuning. General hyperparameters include the number of layers in a neural network, the node numbers in each layer, the learning rate, and layer activation and loss functions. Initially, the authors optimized the GRU model and ensured that the hyperparameters of the other models were similar to the GRU values. For example, a simple GRU model with four layers and 32 neurons set for recurrent layers and the columns contain PM 10 and PM 2.5 data were shifted in given length, the shifted length called 'lookback'. We tuning the length of 'lookback' as it can dramatically affect predictions. Thus, the researchers change lookback while holding the other hyperparameters fixed, as shown in Table 4. When lookback increased, the predictions either failed to improve or deteriorated monotonously. A long lookback (72 h) compromised performance, as irrelevant data were included. Too-short lookback (<12 h) data were unstable, perhaps because previous data were lacking, but nonetheless, the researchers used these data. In Table 4, both PM 10 and PM 2.5 had best performance with the lowest error when lookback was set as 24 h. Therefore, the researchers decided to use 24 h as the lookback for these models.
To find the suitable number of layers and number of neurons in each layer, the researchers tuning one single model at first, then applied the layer structure of tuned model to other models. Table 5 shows the results predicted upon tuning the GRU model. When tuning layer and neuron numbers, the other hyperparameters are fixed.
To explore the number of layers for this model in Table 5, assuming each column represents that the number of units is the same in all layers, values with italic style font in each column represent the best case. By comparing the prediction results of different layers with the same unit numbers, the four-layer structure has seven best cases for PM 10 and PM 2.5 in general. The five-layer structure has five best cases, and the three-layer structure did not perform ideally. Thus, the researchers chose four-layer and five-layer structures for further experiments with the GRU model. As the number of layers chosen was four or five, further experiments keep the number of layers fixed and increase the number of neurons in each layer from 16 to 512, as shown in Table 5. Values with bold font represent the best case in each row. For the four-layer structure, 64 units have the best performance for PM 10 , and 16 are considered as the best cases for PM 2.5 . For the five-layer structure, 32 had the best neuron numbers for both PM 10 and PM 2.5 prediction. When the layer number was held constant, and with the neuron number increased, the model did not improve while more training time and more resource consumption occurred. Thus, the maximum neuron number in each layer was defined as 64. The researchers chose 64, 32, and 16 as the number of neurons for further experiments.
Other hyperparameters (number of epochs, learning rate, layer drop rate, activation and loss functions, and weight initializing scheme) were similarly tuned. The models featured CNN, RNN, and fully connected layers. The authors tuned the single GRU and LSTM models and then built hybrid models with the same layer and unit numbers. All models were trained using the same hardware, software, and dataset, and all models ran in TensorFlow on Nvidia Quadro 4-core P4000 GPU (graphics processing unit) and an Intel Xeon 3.3 GHz CPU (central processing unit). The layers of the various models had different attributes, the details of which are listed in Table 6. LSTM models featured RNN layers with internal LSTM units. The GRU model was similar to the LSTM model, but the recurrent units differed. The hybrid models featured one convolutional and several recurrent layers.  The GRU model featured four layers, the first and second of which were recurrent, with GRU inner units and the neuron numbers listed above. The third layer was fully connected, and the last (output) layer with two neurons. The structure and (certain) attributes of the LSTM model were similar to those of the GRU model. The principal difference lay in the inner structure of the recurrent layers. The LSTM model employed LSTM units. In the CNN-GRU model, the first layer was one-dimensional convolutional, and the second was recurrent with GRUs. The last two layers were fully connected. The principal difference between the hybrid models was that the second layer of the CNN-LSTM model was recurrent with LSTM rather than with GRUs. For all training models, the authors principally used the MSE function to optimize parameters. The Adam optimization algorithm was also employed with the learning rate set to 0.01. The early stopping method was used to pause training when validation loss was not updated after three epochs.

Results
The four models were used to forecast up to 15 days of PM concentrations in Seoul. All models were trained using the same datasets. As forecast length increased, accuracy decreased. Table 7 shows the predictions of the GRU model for one station. The 15-day predictions remained reliable, and the safest forecasts are up to 7 days in this paper. Tables 8 and 9 list the 7-day PM 10 and PM 2.5 predictions for five Seoul stations derived using the four models. The five stations were Gangnam, Songpa, Seocho, Gangseo, and Geumcheon. All models were trained using the Gangnam dataset-to explore model versatility, 4 of the remaining 38 areas were randomly selected, and 7-day PM concentrations were predicted. Table 8 shows the PM 10 predictions. Of the single models, the LSTM performed better than the GRU model in all five stations. The hybrid models were better than a single model for all five stations in this paper. The CNN-GRU model outperformed all other models with all stations for PM 10 prediction in this paper. The CNN-GRU model better-predicted PM 10 in Gangseo and Geumcheon stations. Other models had better performance in these two stations compared with other stations.  Table 9 shows the PM 2.5 predictive data. The single models performed similarly in five stations. The GRU model better predicted the Gangseo and Geumcheon realities, and the LSTM model was better for Gangnam, Songpa, and Seocho compared to the GRU model. Two single models performed better on Gangseo and Geumcheon stations. The hybrid models outperformed the single models for all stations in this paper, with the CNN-LSTM model being the best of the hybrid models. The CNN-LSTM model generally predicted better than the CNN-GRU model for PM 2.5 predictions. The CNN-GRU model predicted better in the Songpa and Geumcheon stations compared to other stations, while the CNN-LSTM model performed better in the Songpa and Seocho stations in this paper. Figures 8-11 below illustrate the 5-day Gangnam-gu predictions of all models. Each figure compares the PM 10 predictions (up), and PM 2.5 predictions (down) yielded by a specific model. The solid blue lines are the real data, and the dashed orange lines show the predicted concentrations. The y-axis is the PM 10 or PM 2.5 concentration, and the x-axis shows time in hours. The GRU model generally well-predicted PM trends. In terms of PM 10 predictions, the GRU model failed to predict the highest and lowest concentrations over the 5 days. This is of great concern in the sense that such errors compromise early warning. In terms of PM 2.5 predictions, the GRU model predicted high-level pollution episodes, but not periods of low PM.
Compared to the GRU model, the LSTM model better predicted PM 10 levels. As shown below, the LSTM predictions were reliable and tracked changing PM concentrations well. The LSTM model well-predicted the highest PM 10 and PM 2.5 concentrations but poorly predicted the lowest concentrations.
The CNN-GRU model well-predicted the highest and lowest concentrations. The PM 10 predictions reliably outperformed those of other models, and the predictions for the lowest PM 2.5 levels were fairly reliable, though not perfect.
The CNN-LSTM model reliably predicted the levels of both PM types. Especially for PM 2.5 , this model outperformed other models and could be used for early warning of high-level PM concentrations. The PM 10 data were generally reliable, though not perfect. errors compromise early warning. In terms of PM2.5 predictions, the GRU model predicted high-level pollution episodes, but not periods of low PM. Compared to the GRU model, the LSTM model better predicted PM10 levels. As shown below, the LSTM predictions were reliable and tracked changing PM concentrations well. The LSTM model well-predicted the highest PM10 and PM2.5 concentrations but poorly predicted the lowest concentrations.  . Five-day air pollution levels predicted by the LSTM model. Figure 9. Five-day air pollution levels predicted by the LSTM model. Figure 9. Five-day air pollution levels predicted by the LSTM model. The CNN-GRU model well-predicted the highest and lowest concentrations. The PM10 predictions reliably outperformed those of other models, and the predictions for the lowest PM2.5 levels were fairly reliable, though not perfect.  The CNN-GRU model well-predicted the highest and lowest concentrations. The PM10 predictions reliably outperformed those of other models, and the predictions for the lowest PM2.5 levels were fairly reliable, though not perfect.   Figure 12 shows the PM10 predictions of all models over 3 days. The solid blue line shows the real data. Generally, the GRU model exhibited the poorest match to the real data, while the other models were reliable. The hybrid models generally performed better than the single models. The CNN-LSTM model reliably predicted the levels of both PM types. Especially for PM2.5, this model outperformed other models and could be used for early warning of high-level PM concentrations. The PM10 data were generally reliable, though not perfect.  Figure 12 shows the PM10 predictions of all models over 3 days. The solid blue line shows the real data. Generally, the GRU model exhibited the poorest match to the real data, while the other models were reliable. The hybrid models generally performed better than the single models.   Figure 12 shows the PM10 predictions of all models over 3 days. The solid blue line shows the real data. Generally, the GRU model exhibited the poorest match to the real data, while the other models were reliable. The hybrid models generally performed better than the single models. In terms of predicting PM2.5 levels, both the GRU and LSTM models were weak in predicting the future highest and lowest levels. All models simulated changing concentrations. The two hybrid models forecast the extreme episodes and generally outperformed the single models. The CNN-GRU model best-predicted PM10 levels, and the CNN-LSTM model best predicted PM2.5 levels. In terms of predicting PM 2.5 levels, both the GRU and LSTM models were weak in predicting the future highest and lowest levels. All models simulated changing concentrations. The two hybrid models forecast the extreme episodes and generally outperformed the single models. The CNN-GRU model best-predicted PM 10 levels, and the CNN-LSTM model best predicted PM 2.5 levels.

Conclusions
In this paper, four predictive models were compared in terms of their ability to forecast future air pollution for several days ahead in different areas of Seoul. All models were trained using the same dataset and the same software and hardware. The principal contributions of this study are as follows: (1) The two hybrid models that combined convolutional and recurrent layers yielded reliable predictions 15 days in advance.
(2) An LSTM model similar in structure to a GRU model performed better than the GRU model. (3) CNN-GRU and CNN-LSTM hybrid models performed better than the single models. (4) The CNN-GRU hybrid model better predicted PM 10 levels, and the CNN-LSTM model better predicted PM 2.5 levels. (5) Meteorological data (auxiliary variables) improved the training accuracy of all models. The new models forecast PM 2.5 better than PM 10 levels. For future research, the authors will apply these models to other cities, and explore the seasonality and spatiotemporal characteristics of the datasets to optimize forecast accuracy. For future research, other hybrid models, such as fuzzy neural and other neural network models, will be applied to optimize the proposed methodology. More data resources, such as related air pollutions, will be used to improve models and examine other regions.