Applying PCA to Deep Learning Forecasting Models for Predicting 𝐏𝐌 𝟐 . 𝟓

: Fine particulate matter ( PM (cid:2870) . (cid:2873) ) is one of the main air pollution problems that occur in major cities around the world. A country’s PM (cid:2870) . (cid:2873) can be affected not only by country factors but also by the neighboring country’s air quality factors. Therefore, forecasting PM (cid:2870) . (cid:2873) requires collecting data from outside the country as well as from within which is necessary for policies and plans. The data set of many variables with a relatively small number of observations can cause a dimensionality problem and limit the performance of the deep learning model. This study used daily data for five years in predicting PM (cid:2870) . (cid:2873) concentrations in eight Korean cities through deep learning models. PM (cid:2870) . (cid:2873) data of China were collected and used as input variables to solve the dimensionality problem using principal components analysis (PCA). The deep learning models used were a recurrent neural network (RNN), long short ‐ term memory (LSTM), and bidirectional LSTM (BiLSTM). The perfor ‐ mance of the models with and without PCA was compared using root ‐ mean ‐ square error (RMSE) and mean absolute error (MAE). As a result, the application of PCA in LSTM and BiLSTM, excluding the RNN, showed better performance: decreases of up to 16.6% and 33.3% in RMSE and MAE val ‐ ues. The results indicated that applying PCA in deep learning time series prediction can contribute to practical performance improvements, even with a small number of observations. It also provides a more accurate basis for the establishment of PM (cid:2870) . (cid:2873) reduction policy in the country.


Introduction
Fine particulate matter (PM . indicates particles with an aerodynamic diameter of 2.5 μm or less. It is not a specific chemical, such as sulfur oxides (SO ) and nitrogen oxides (NO ), but a mixture of particles of varying sizes, components, and shapes. Typical substances that form PM . include elemental carbon (EC), organic carbon (OC), NO , volatile organic compounds (VOC), ozone (O ), ammonia (NH ), SO , condensate particles, metal particles, mineral particles, etc. Because of its small size, it penetrates the body through the respiratory tract, causing inflammation or damaging organs [1]. The WHO considers PM . a major environmental risk factor that causes cardiovascular, respiratory, and various other cancers [2]. Figure 1 shows the effects of PM . on the body [3].
Korea's PM . concentration was the highest among the 37 OECD (Organization for Economic Co-operation and Development) countries in 2019 [4], and studies have shown that it has a negative effect on people's health. Han et al. [5] stated that 1763 early deaths in Seoul in 2015 were closely related to PM . . Hwang et al. [6] explained that, when the average annual concentration of PM . in Seoul increases by 10 μg/m , the risk of death from PM . -related diseases (ischemic heart disease, chronic obstructive lung disease, lung cancer, and cerebrovascular diseases) increases by 13.9%. This is in line with the major causes of death for Koreans in 2019. Statistics Korea shows that cancer (158.2 deaths per 100,000 people), cardiovascular diseases (60.4 deaths per 100,000 people), and pneumonia (45.1 deaths per 100,000 people) are the three major causes of death [7]. This suggests that PM . is highly correlated to the main cause of death for Koreans. The Korean government is making great efforts to reduce PM . concentration to protect people's health. The government has divided the crisis into three stages according to the current status and prediction of PM . concentration and has devised a manual for local governments for each stage of action. The government also aims to reduce the annual average concentration of PM . by 35% compared to 2016 by establishing a five-year plan for PM . concentration reduction. To achieve this purpose, the government selected 15 major tasks by evaluating its potential reduction, cost effectiveness, linkage with other policies, and social impact. These tasks are implemented by each local government [8]. Table 1 shows Korea's crisis stage standard for PM . concentration, which reflects the concentration of PM . in the current period and future forecast values. It suggests that the accurate prediction of PM . concentration is needed in the short and long terms. In this regard, several studies have conducted air quality prediction using deep learning methods with domestic data (wind speed, NO , SO , temperature, etc.) in Korea, and new deep learning models have been developed to show high performance in air quality prediction [9,10]. However, foreign factors should also be considered in predicting PM . concentration in Korea, as the concentration of PM . in the Shandong region of China is also found to affect Korea's PM . concentration [11]. However, as China's past PM . concentration data are composed of daily data, Korea's data should also be organized on a daily basis for deep learning PM . prediction. This data composition can cause a "curse of dimensionality" due to the small number of observations compared to variables, which can reduce the performance of the model. This study aims to show that the application of principal component analysis (PCA) in the deep learning time series prediction models for PM . -a recurrent neural network (RNN), long short-term memory (LSTM), and bidirectional LSTM (BiLSTM)-can result in better performance by comparing the root-mean-square error (RMSE) and mean absolute error (MAE) with the same models without PCA application.

Previous Research
Several studies have shown the association of PM . with lung and cardiovascular disease (CVD). Wang et al. [12] reported that CVD is the one of the main mortality factors of elder people. It was found that the ambient PM . concentration is related to several CVDs by linking PM . exposure and CVD based on multiple pathophysiological mechanisms. César et al. [13] showed that the exposure to PM . can cause hospitalizations for pneumonia and asthma in children younger than 10 years of age through an ecological study of time series and a generalized additive model of Poisson regression. Kim et al. [14] reported associations of short-term PM . exposure with acute upper respiratory infection and bronchitis among children aged 0-4 years through a difference-in-differences approach generalized to multiple spatial units (regions) and time periods (day) with distributed lag non-linear models. Vinikoor-Imler et al. [15] studied the relationship between PM . concentration, lung cancer incidence, and mortality by linear regression and concluded that there is a possibility of an association between them. Choe et al. [16] reported that the effect of changes in PM . emissions on changes in internal visits and hospitalization probabilities due to respiratory diseases was estimated through Probit and Tobit models. If PM . emissions change by 1%, the probability of visitation due to respiratory diseases increases from 0.755% to 1.216%, and the probability of hospitalization increases from 0.150% to 0.197%.
The need for PM prediction research is emerging, and various studies are underway on PM prediction. Zev Ross et al. [17] developed the land use regression model to predict PM . in New York City and showed that urbanization factors such as traffic volume and population density have a high explanation in predicting PM . . Rob Beelen et al. [18] compared the performance of ordinary kriging, universal kriging, and regression mapping in developing EU-wide maps of air pollution and showed that universal kriging performs better in mapping NO , PM , and O . Vikas Singh et al. [19] suggested a cokriging-based approach and interpolated PM in areas not observed in the network in PM monitoring based on the suggested method with secondary variable from the results of a deterministic chemical transport model (CTM) simulation. And the results showed that the proposed method provides flexibility in collecting ultrafine dust data.
Other studies have shown examples of predicting PM . through machine learning and deep learning. Zhao et al. [20] predicted the PM . contamination of stations in Beijing using long short-term memory-fully connected (LSTM-FC), LSTM, and an artificial neural network (ANN) with historical air quality data, meteorological data, weather forecast data, and the day of the week data. They showed that the LSTM-FC model outperforms LSTM and the ANN, with MAE = 23.97-50. 13 [22] used XGBoost (XGB), the light gradient boosting machine (LGBM), the gated recurrent unit (GRU), convolutional neural network-LSTM (CNNLSTM), BiLSTM, and LSTM to predict PM . concentration of eight sites in Seoul and Gwangju with community multiscale air quality (CMAQ) data. The result showed that LSTM performs best, with MAE = 3.5847 μg/m , RMSE = 4.8292 μg/m , R = 0.8989, and IA = 0.9368 of the mean in all sites.
The RNN, LSTM, and BiLSTM models were used in this study, because previous studies have shown that the deep learning sequence model performs better in prediction. The local weather and air quality data were used to predict PM . , as shown in previous studies, and used as predictive input variables. The regional data of China are also used as predictive input variables, which were found to affect PM . in Korea. Figure 2 shows the spatial range of the research. A total of eight cities in Korea were selected for analysis. Of the eight cities, six are metropolitan cities (Busan, Daejeon, Daegu, Gwangju, Incheon, and Ulsan) representing each province, one is the capital city (Seoul), and one is the most populous city (Wonju) in the province without a metropolitan city. In each city, daily air quality data (PM . , SO , O , NO , and CO) [23] and meteorological data (temperature, wind speed, wind direction, humidity, precipitation, etc.) [24] were collected in consideration of the internal factors of PM . generation. Air quality data were collected within 5 km of each city's meteorological data observatory.  Figure 3 shows that Korea is mainly a country with north and west winds. As a result, the air quality of Korea can be directly and indirectly affected by the air quality of China, a country located in the west and north. Figure 4 [25] also shows the concentration of PM . in Korea and China at the same time before and after the outbreak of COVID-19. According to Bao et al. [26], it can be seen that the lockdown of Chinese factories after the COVID-19 outbreak actually improved the Chinese air quality. Considering this, with the direction of the wind in Korea, we can see that the air quality of Korea is highly affected by the air quality in China. Accordingly, daily PM . concentrations in 55 areas in China close to Korea were selected as input variables in this study, including the PM . concentration in Shandong province, which was found to increase PM . concentration in Korea.

Data Preprocessing
All variables have a time range from 1 January 2015 to 31 December 2019 and are collected as daily data. There are missing values in some variables, and these missing values were processed by the exponentially weighted moving average (EWMA) using the imputeTS package of the R software [27]. The EWMA gives higher weights to the latest data, reducing the weight of older values, and the formula for EWMA imputation suggested by Hunter [28] is as follows: (1) is the predicted value at time t, is the observed value at time t, is the observed error at time t, and is a constant value called the weight from zero to one. The higher the value is, the less it reflects past data.  Figures A1-A3). Each variable shows the values in a different range due to the differences in units of measurement and the characteristics within the region. In the case of Chinese data, the concentration of PM . in each city over time seems to be constant, but some cities have outliers. If one variable has a relatively greater value, or a wider range of values than the others, in the composition of the data, it can result in a significant impact on the predicted value, regardless of the predictive importance of the variable.  To solve these problems, the scope of the variables should be adjusted through normalization. In this study, maximum-minimum normalization was carried out to every data of each city as shown in the following equation: Because the wind direction data were collected as 16 cardinal points, these are labels encoded to transform direction data into numerical data.

Variable Correlation Analysis
As mentioned above, the prediction target of this study is the concentration of PM . . The efficiency of the forecast results in deep learning, and machine learning depends on the correlation between the dependent and the independent variables. It is important to add variables with a strong negative or positive correlation between the dependent variable and the independent variable. In addition, the results of correlation are necessary for data analysis because they provide a basis for determining the influence of each independent variable on a dependent variable. In this study, the Pearson correlation coefficient was calculated, which is expressed as the covariance and standard deviation of the variables, as shown in the following equations in the case of observation vector X , , … , : Each element in the correlation matrix has a value between −1 and 1, showing that a value greater than 0 is a positive correlation and a value less than 0 is a negative correlation. The correlation matrix is symmetric, and all of the diagonal elements of the matrix have a value of 1 considering , , ∈ 1,2, … . Figure 6 is a visualization of the correlation between PM . concentrations and the highest eight factors inside Seoul, Korea. Appendix A Tables A2-A9 show the correlation between PM . concentrations and the meteorological air quality factors of each city in Korea. Overall, the factors that have a strong positive correlation with PM . are air quality factors except for O . PM . also appears to have a positive correlation with local air pressure (LAP), sea-level pressure (SP), wind direction, and relative humidity. Conversely, temperature, wind speed, O , wind flow sum (wind flow sum refers to the distance that the air flows, and the Korea Meteorological Administration produces a day-to-day wind flow sum (24 h wind flow sum).), and daily precipitation were found to have a negative correlation with PM . concentrations. However, the variables that have a relatively weak correlation with PM . changed the sign of the correlation depending on the region.   Table 2, an overall correlation between 0.13 and 0.55 is shown. Comparing this with the factors inside the Korean cities, we can see that the PM . concentration of each city in China is as much related with the PM . concentration in Korea as the data of air quality inside the city. This suggests that China's PM . concentration could be an important independent variable in predicting PM . concentrations in Korea.

PCA
PCA reduces dimensions by linear combinations of variables with high explanatory power of the overall data variability, explaining variation in high-dimension data in low dimensions. Vectors with p variables can have total p principal components, and the principal components of vector 1 , whose covariance matrix is , can be generated as follows: Because the principal component is a linear combination of X, it can be expressed as Equation (9), and the variance of this linear combination can be expressed as Equation (10). The PCA has to preserve the variance of the original data as much as possible, so Equation (10) should also be maximized. Therefore, the method of generating principal components can be transformed into the problem of obtaining 1 , which maximizes under the condition . Equation (11) was derived by applying Lagrange's multiplier method to Equation (10). Equation (13) was made by Equation (12), which partially differentiates Equation (11) by . Equation (13) shows that is the eigenvalue of , and is the eigenvector of . As a result, a linear combination that maximizes Equation (10), i.e., the principal component, can be expressed as Equation (9). In addition, Equation (10), which is the variance of the principal component, can be expressed as under the condition . Therefore, in vectors with p variables, the i-th principal component is Equation (15), and the variance is Equation (16). Subsequently, the number of principal components is selected for convenience by the principal components where the sum of the principal components is more than 80% to 90% of the total variance. For example, the number of principal components i has to be selected out of principal components p. Equation (17) has to produce results of more than 80% to 90%:

RNN
The RNN is a deep learning model for processing sequence data, such as stock charts [30], music [31], and natural language processes [32]. It remembers the state entered from the previous time point (t − 1) through the hidden layer and passes the hidden layer state at that specific time point (t) to the next time point (t + 1). That is, the status at the previous time point affects the state at the present time point, and the state at the present time point affects the status at the next time point. This procedure is repeated until result values becomes optimized; hence the name "recurrent neural network." Figure 8b is the unrolled and inner structure of Figure 8a. In Equations (18)-(20), is an input, and ℎ is a hidden state at time t.
is the weight from layer i to layer j, and is the bias in each layer. In Equation (21), is the loss at time t, and and are the actual and predicted values, respectively, at time point t. The RNN model shares the weights and biases at all time points and circulates the input data to output the results. Model training is repeated until the loss value is minimized by gradient descending in the loss function, with information of specific previous time steps. At the same time, the weight is updated to find the optimum value. This is called backpropagation through time (BPTT) and in an RNN can be expressed as follows [33]: (24) * 0,1

LSTM and BiLSTM
In an RNN, tanh is used as an activation function to train the model in a non-linear way. However, there is a long-term dependency problem caused by a "vanishing gradient" problem in the RNN's BPTT, in which the gradient (weights update rate) disappears as the value (derivative value of the tanh function with respect to ℎ ) less than 1 continues to multiply. Thus, the state of a relatively distant past time point has almost no effect on an output of the present time point. As a result, the model relies only on short-term data and has a limit in achieving the best performance. To solve this problem, Hochreiter et al. [34] suggested the LSTM model. Figure 9 shows the internal structure of LSTM and its process. LSTM is the model in which forgetting and memory ( , the input ( , the inner cell state candidate , the conveying and inner cell state at time point t ( , and the output ( are added to the RNN model. Especially, , which penetrates all time points, greatly contributes to solving the long-term dependency problem. The order of each part and the internal algorithm can be explained by the following process: Equation (25), output of the forget gate, determines whether the historical state is forgotten by the combination of and ℎ . The output value of this step is converted to a number between 0 and 1 by the sigmoid function and multiplied by (memory of past data, i.e., historical state) to determine how much past data to preserve or forget. A value of 0 indicates forgetfulness, and 1 indicates memorization of past data. Equations (26) and (27)  BiLSTM is a variant of the bidirectional RNN proposed by Schuster et al. [35]. Figure  10 shows an example of applying a bidirectional way to sentence learning. If (A) is taught in the model and "went" is set as the target, (B) predicts in a forward way and (C) predicts in both a forward and a backward way. If LSTM uses a historical state to predict the value of time point t, bidirectional LSTM predicts the value of time point t by adding an LSTM layer that reads data from a future state. The computations within the model are the same as those of LSTM, and LSTM and BiLSTM update their weights in the training model as an RNN [36].

Evaluation Model Performance
In this study, MAE (Mean Absolute Error) and RMSE (Root Mean Square Error) were used as evaluation indicators to compare the performance of each model with and without PCA application. The calculations of each indicator are expressed as follows:

Workflow
The flow of this study is divided into four stages: data collection, data preprocessing, prediction, and evaluation ( Figure 11). The application of PCA is used in the data preprocessing stage, aiming to reduce the number of variables and increase the performance of model predictions. Thus, the data preprocessing stage was divided into two cases. Case 1 was set as a prediction without a PCA application, and Case 2 was set as a prediction with the PCA application. Afterwards, each case will be compared using evaluation indicators (MAE and RMSE).
1 (32) Figure 11. Workflow of the PCA application deep learning model for predicting PM . .

PC Selection
For each city, PCA was performed on the input variables, except the dependent variable, PM . . The variance of each city's data was explained by a relatively small number of principal components, which resulted in the selection of five principal components in all cities. This reduced the number of input variables to about 1/16. Tables A10-A17 show the results of the PCA of each city, and Table 3 shows how much the five principal components describe the overall variation of each city.

Setup and Case Comparison
China's daily PM . concentration and Korea's air quality and meteorological data were collected from 1 January 2015 to 31 December 2019 to predict the concentration of PM . in eight Korean cities. In total, 85% of the collected data were allocated to the train set and 15% to the test set. In the aspect of details in models, the three models have 256 units in the layer, a tanh activation function, 200 epochs ,a batch size of 64 and an adaptive moment estimation (ADAM) optimizer [37]. To avoid overfitting, 30% of the train set was designated as a validation set, and a 30% dropout regulation was used between the input layer and the output layer. Additionally, in model learning, earlystopping, one of the callback functions of Keras, was applied to stop learning in the epoch when optimal learning had achieved 200 epochs. Figure 12 shows the predicted and actual values of PM 2.5 for each case and model in Seoul. Figures A4-A10 show the PM . concentration prediction of each city except for Seoul. Unlike LSTM and BiLSTM, the RNN appears to have outputted average values for all time periods and shows relatively low predictive power in both Case 1 and Case 2. The RNN without PCA seems to follow the trend more and to show relatively higher performance than the RNN with PCA. However, although there are differences between cities, LSTM and BiLSTM show that they follow the trend relatively well, regardless of whether PCA is applied or not. Furthermore, it can be seen that PCA application in all cities corrects the difference between the predicted and actual values that exists if PCA is not applied. It also appears to have produced more accurate results in predicting peak values. Tables 4 and 5 are numerical representations of these visual results. As noted above, it is understood that the reduction in dimension in all cities leads to a relatively low performance in the RNN, except for Daegu in terms of the MAE. This means that, for RNNs, reducing variables does not help model learning; rather, providing a high amount of information in a short period of time can lead to a better performance, depending on the feature of the model that relies on short-term information. Instead of an RNN, which lacks overall accuracy, the results that should be considered are those of LSTM and BiLSTM. Unlike the RNN, PCA application to LSTM and BiLSTM showed better results in RMSE and MAE evaluation, similar to the visual results. The order of cities with high performance is as follows: Busan > Daejeon > Gwangju > Daegu > Seoul > Ulsan > Wonju > Incheon, while the order of cities with high improvement in MAE and RMSE is as follows: Busan > Incheon > Gwangju > Seoul > Ulsan > Daegu > Daejeon > Wonju. LSTM showed high performance in Daejeon, Daegu, and Busan, while BiLSTM showed higher performance in the rest of the cities. The difference in performance and performance improvements city by city makes it worthwhile to consider which characteristics of each city would cause regional differences in the performance of the same model, and which model would perform better depending on regional characteristics. To do so, it is expected that such studies require multidisciplinary considerations.

Conclusions
Performance degradation due to the curse of dimensionality can occur in deep learning and machine learning. We proposed a PCA-applied model to solve this problem, and through performance comparison with a non-PCA model, we showed that PCA applications produce better results in deep learning time series prediction. Such a performance improvement technique can be a way to increase the efficiency of the government system by providing better forecasts as a basis for issuing crisis alerts and establishing air pollution reduction policies in the future.
As the correlation analysis shows, the concentration of PM . in China appears to have positive correlations with the concentration of PM . in China, indicating that we have to consider China's air pollution factors in predicting the concentration of PM . in Korea. It suggests that there is a justification for the setup of real-time air pollution databases between the two countries from the ongoing joint research between Korea and China [38].
However, while PCA applications can improve model performance, the results show relatively weak predictions on predicting the minimum and maximum concentration PM . for each city. It seems to be a problem due to a small number of observations (daily observations, not hourly observations). It is expected that future joint cross-border research will result in better performance by collecting much more observations. Some meteorological data in each Korean city showed a relatively weak correlation with concentration, so it seems necessary to find variables that have causality or strong correlation within areas other than deep learning. For example, if spatial factors (spatial homogeneity, autocorrelation, etc.) in Chinese cities and Korean cities are added to the model as input variables, it is expected that the model will produce better performance by learning time and spatial features of data. This research will continue to maximize the prediction performance of deep learning models by collecting observations and optimizing models, while applying new algorithms and adding other variables that have causality with concentration of PM . in terms of econometrics and spatial econometrics.  Acknowledgments: The open access fee was supported by the Research Institute for Agriculture and Life Sciences, Seoul National University.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1 shows the acronym list of Tables A2-A17 and Figures A1-A3.  Tables A2-A9 show the correlation coefficient of the meteorological and air quality factors between PM . concentrations in each city.   Figure A2. The meteorological data of Seoul (wind). Figure A3. The meteorological data of Seoul (temperature, relative humidity, precipitation). Figure A4. The PM . prediction in Gwangju by two cases. Figure A5. The PM . prediction in Daegu by two cases. Figure A6. The PM . prediction in Daejeon by two cases. Figure A7. The PM . prediction in Busan by two cases. Figure A8. The PM . prediction in Ulsan by two cases. Figure A9. The PM . prediction in Wonju by two cases. Figure A10. The PM . prediction in Incheon by two cases.