Monitoring of PM2.5 Concentrations by Learning from Multi-Weather Sensors

This paper aims to monitor the ambient level of particulate matter less than 2.5 μm (PM2.5) by learning from multi-weather sensors. Over the past decade, China has established a high-density network of automatic weather stations. In contrast, the number of PM monitors is much smaller than the number of weather stations. Since the haze process is closely related to the variation of meteorological parameters, it is possible and promising to calculate the concentration of PM2.5 by studying the data from weather sensors. Here, we use three machine learning methods, namely multivariate linear regression, multivariate nonlinear regression, and neural network, in order to monitor PM2.5 by exploring the data of multi-weather sensors. The results show that the multivariate linear regression method has the root mean square error (RMSE) of 24.6756 μg/m3 with a correlation coefficient of 0.6281, by referring to the ground truth of PM2.5 time series data; and the multivariate nonlinear regression method has the RMSE of 24.9191 μg/m3 with a correlation coefficient of 0.6184, while the neural network based method has the best performance, of which the RMSE of PM2.5 estimates is 15.6391 μg/m3 with the correlation coefficient of 0.8701.


Introduction
Particulate matter (PM) is a kind of atmospheric aerosol, formatting as minute solid particles or liquid droplets suspended in air [1]. PM is mainly from anthropogenic origin, derived from industrial, home heating and cooking, and transportation sources while most natural sources are relatively less important [2]. PM less than 10 µm (PM 10 ) and PM less than 2.5 µm (PM 2.5 ) consist of a number of components such as sulfate, nitrate, ammonium, elemental carbon, organic carbon, and soil or dust particles. Fine particles, PM 1.0 (aerodynamic diameter of less than 1.0 µm ), carry toxic trace elements, like Se, S, V, Cu, Fe, Pb, As, Cd, Ni, Zn, Mn, etc. [3]. Scientific studies reveal that these PM increase the risk of anthroposphere, atmosphere, hydrosphere, biosphere, and lithosphere [2][3][4]. The negative impact of PM can be summarized as follows: first, it affects human health [5][6][7]. PM 2.5 and PM 10 generally passes through the nose and throat and even enters the lungs; fine particles, PM 1.0 , and smaller particles are able to penetrate into the human respiratory and circulation system, resulting in adverse health effects [4]. Second, PM episodes reduce visibility and lead to climate change [8]. PM is the main cause of reduced visibility (haze) in the world. It suppresses convection and precipitation by both radiative and micro-physical effects, changes lighting phenomenon in different regions, weakens the hydrological cycle, and leads to less fresh water and nutrient imbalance in coastal waters and large river basins [4]. Third, particle pollution and acid rain make lakes and streams acidic, damage sensitive forests, farm crops, stone, soil buildings and other materials, and deplete the nutrients in soil, affecting the diversity of ecosystems [9]. PM cycle. In addition, the visibility also decreases in the presence of rain. Thus, it is necessary to supplement the visibility observation using precipitation measurements, in order to decide whether PM episodes really happen depending on the visibility measurements.
As addressed above, since the meteorological parameters can be affected by PM episodes, it is possible and promising to measure the concentration of PM by studying the data from weather sensors, even though AWS were not originally designed for PM observation. There is a high density distribution network of AWS in China, and these AWSs work 24/7 under all weather conditions. Because of the absence of dense urban PM monitoring networks, values observed at a 'central monitor' were frequently considered to be representative for ambient pollutant levels within a metropolitan area. With recent growth of the high density network of AWS, we have an opportunity to measure PM more accurately based on data from AWS. However, little work has been done to calculate PM concentration by using meteorological parameters. Our previous study using hidden Markov models to quantify PM concentrations have yielded some encouraging results [31]. In this paper, we aim to use three machine learning methods, namely multivariate linear regression, multivariate nonlinear regression and neural network, to retrieve PM concentrations by learning from the data of multi-weather sensors.

Materials
Observations of PM 2.5 and meteorological parameters were collected from January 2014 to June 2014 at the National Xiamen weather station, Fujian, China. The PM monitoring station was installed in the same standard observation site as the automatic multi-weather sensors. The meteorological parameters and PM 2.5 data were collected at a frequency of once per hour and processed by using the world meteorological data quality control standard. The main meteorological parameters are visibility, wind direction, wind speed, temperature, relative humidity, atmospheric pressure, and hourly rainfall rate. Figure 1 shows variations of PM 2.5 concentrations and meteorological parameters at the National Xiamen weather station during the coordinated observation period. For better visualization, we divided the PM 2.5 data and meteorological parameters into two dimensions according to 24 h per day, as shown in Figure 2.
In order to study the relationship between PM 2.5 and meteorological parameters, linear regressions were performed by using Pearson's linear correlation. Pearson's correlation coefficient is widely used to measure the degree of linear correlation between two quantitative variables [32]. Given N samples of two variables x and y, the coefficient r xy is calculated as where x i and y i are the ith sample points. The results were reported in Table 1. PM 2.5 has a high correlation with visibility, wind direction, wind speed, and relative humidity, and a low correlation with air temperature, atmospheric pressure, and rainfall rate. Therefore, we use the meteorological parameters with high correlation coefficients, namely visibility, wind direction, wind speed and relative humidity, in order to develop the multivariate regression model.
Interestingly, we also found that the performance of neural network based method can be improved by using these meteorological parameters, even though relative humidity, atmospheric pressure, and rainfall rate have low correlation with PM 2.5 . Thus, all seven meteorological parameters were used in the neural network method.    Denote the ith observation of PM 2.5 , visibility, wind direction, wind speed, and relative humidity as y i , X 1 i , X 2 i ,X 3 i , and X 4 i , respectively, we predict the PM 2.5 via the model as where a 0 is the intercept. We include a 0 in the vector of coefficients a , Equation (1) can be written in vector form as an inner product The optimal vector of coefficients a can be generated by minimizing the distance between the predictions and the ground truth data. Using Euclidean distance, the solution of the vector of coefficients can be formulated as where y is a vector of PM 2.5 data in the training set.
The least squares method was applied to fit the above model, and the solution is given by using the Moore-Penrose inverse operation as

Multivariate Nonlinear Regression
The physical principle behind our proposed nonlinear regression model is that the value of atmospheric optical visibility decays as an exponential function with increasing of PM concentrations. In addition, as the wind speed increases, it becomes easier for the particles matter to disperse and the concentration becomes smaller. Figure 3 shows this physical nonlinear relationship. The nonlinear regression model is given by where γ k and β k are the kth coefficients of the model.
In this approach, we pick the coefficients γ and β to minimize the cost function as residual sum of squares as where N is total number of data samples, the parameters θ = (γ, β).
The above problem was solved by using the Nelder-Mead simplex method [33,34]. One vector of the parameters θ represents one simplex. The major procedures of the Nelder-Mead simplex method include order, reflection, expansion, contraction, and shrink operation. Algorithm 1 summarizes the detailed procedures for fitting the multivariate nonlinear regression model. The least squares method was applied to fit the above model, the solution is given by using the Moore-Penrose inverse operation as a = (X T X) −1 X T y (5) The physical principle behind our proposed non-linear regression model is that the value of atmospheric optical visibility decays as an exponential function with increasing of PM concentrations. In addition, as the wind speed increases, it becomes easier for the particles matter to disperse and the
1 Order: sort the three vertices of simplex θ 1 , θ 2 , θ 3 by the simplexes's function values that satisfy go to step 1.
Expansion: 9 calculate the expanded simplex as go to step 1.
Contraction: 18 calculate the contracted simplex as go to step 1.

Neural Network
Neural networks are particularly well suited to dealing with nonlinear fitting problems, due to the fact that enough elements (called neurons) can fit any data with arbitrary precision [35,36]. A multilayer perception (MLP) network [37] is applied to explore the nonlinear regression for PM 2.5 . Figure 4 shows a conceptualized structure of a two-layer feed-forward network that is used for predicting PM concentrations.  Neural networks are particularly well suited to dealing with non-linear fitting problems, due to that its enough elements (called neurons) can fit any data with arbitrary precision [35,36]. A multilayer perception (MLP) network [37] is applied to explore the non-linear regression for PM 2.5 . Figure 4 shows a conceptualized structure of two-layer feed-forward network that used for predicting PM concentrations.
The proposed two-layer feed-forward neural network includes a sigmoid hidden layer and an affine transformation output layer. Assuming that the number of entries in the hidden layer is m h , the input-output function can be formulated as where x is defined above, b is the output offset, v is the weight vector for the output layer, sigmoid(x) = 1/(1 + e −x ), c is the offset vector for the hidden layer, and W is the weight matrix for the hidden layer, and the parameters θ = (b, c, v, W). As mentioned above, all seven meteorological parameters were input in the neural network. Thus the dimension of the parameters can be determined as To avoid over-fitting the data, we applied Bayesian regularization method to train the network [38,39]. Firstly, the sum of squares of the vector weights is defined as where α is the parameters of the function. And the sum of squared errors is given The proposed two-layer feed-forward neural network includes a sigmoid hidden layer and an affine transformation output layer. Assuming that the number of entries in the hidden layer is m h , the input-output function can be formulated as where x is defined above, b is the output offset, v is the weight vector for the output layer, sigmoid(x) = 1/(1 + e −x ), c is the offset vector for the hidden layer, and W is the weight matrix for the hidden layer, and the parameters θ = (b, c, v, W). As mentioned above, all seven meteorological parameters were input in the neural network. Thus, the dimension of the parameters can be determined as To avoid over-fitting the data, we applied the Bayesian regularization method to train the network [38,39]. Firstly, the sum of squares of the vector weights is defined as where α is the parameters of the function. In addition, the sum of squared errors is given where β is the parameters of the error function. From the perspective of Bayesian framework, the weights of neural network are considered as stochastic variables. According to Bayes' rule, the probability density function of the parameters for a neural network M can be formulated as Specifically, the likelihood function does not depend on the regularizer α once the parameters θ is known, and the prior function does not depend on the parameter β that regularizes the data term [38]. Therefore, the above equation can be simplified as Assuming that the prior of the parameters θ and the training data are Gaussian distributed, the probability density functions can be represented where N D is the total number of training data samples. In addition, where N W is the number of the weights. The optimal parameters of the neural network can be obtained by maximizing the posterior probability (Equation (12)). Substituting Equations (13) and (14) into Equation (12), we can obtain is the normalization factor. According to the above derivation, maximizing the above posterior probability is equivalent to minimized the regularized cost function as Here, we use one approach of Gauss-Newton approximation for Bayesian regularization [39]. The more details of Bayesian regularization for neural network can also be found in [40][41][42][43].

Models Training
After solving Equation (4) by least squares, the multivariate linear regression model was determined to be Next, we describe the training process of the multivariate nonlinear regression model. As shown in Figure 5, the cost function value decreased rapidly during the first 300 iterations by using Algorithm 1.
When the number of iterations reached 900, the performance of the algorithm approached saturation. Finally, the multivariate nonlinear regression model was completely fitted after 1081 iterations as In the training of neural network models, validation is not required due to the use of the Bayesian regularization method. Therefore, the PM data and meteorological parameters were randomly divided into two sets as: 60% for training, and 40% for a completely independent test. The number of these datasets is 70% of all data, meaning that 42% of the data has been used for training. Validation is often considered as a form of regularization to meet the balance between under-fitting and over-fitting. Interestingly, the Bayesian regularization method has its own form of validation built into the approach [38,39], so this paradigm disables validation of the dataset, since the purpose of checking validation is to see if the error on the validation set gets better or worse as training progresses. The error of Bayesian regularization is based not only on how the model behaves on the dataset, but also on the size of the weights in the hidden layers. The larger the weights, the larger the error. Thus, throughout the training process, the hidden layer may never be allowed to explore larger weights, even if larger weights may result in a global minimum.
Sensors 2020, xx, 5 10 of 18 In the training of neural network models, validation is not required due to the use of Bayesian regularization method. Therefore, the PM data and meteorological parameters were randomly divided into two sets as: 60% for training, and 40% for a completely independent test. The number of these datasets is 70% of all data, meaning that 42% of the data has been used for training. Validation is often considered as a form of regularization to meet the balance between under-fitting and over-fitting. Interestingly, the Bayesian regularization method has its own form of validation built into the approach [38,39], so this paradigm disables validation of the dataset, since the purpose of checking validation is to see if the error on the validation set gets better or worse as training progresses. The error of Bayesian regularization is based not only on how the model behaves on the dataset, but also on the size of the We averaged the training and test performance over 100 experiments on models with the different numbers of hidden neurons, and the results are shown in Figure 6. These results support the issues mentioned above. While increasing the number of hidden neurons can improve the performance of training, it does not reduce the error for testing. Therefore, by considering the training and test performance together, we set the number of hidden neurons as 16, which can produce satisfactory performance.

Predictions of PM 2.5 concentrations
Using these three trained models, the meteorological parameters were input to obtain the estimated PM concentrations, which were shown in Figures 7-8, respectively. From the plots, it can be seen that both linear and non-linear multivariate regression based methods can estimate the slow part of the PM changes, but the details of the rapid changes cannot be estimated precisely. In contrast, the neural network-based method, with good non-linear learning capability, captured the changes in PM concentration more completely and accurately.

Predictions of PM 2.5 Concentrations
Using these three trained models, the meteorological parameters were input to obtain the estimated PM concentrations, which were shown in Figures 7 and 8, respectively. From the plots, it can be seen that both linear and nonlinear multivariate regression based methods can estimate the slow part of the PM changes, but the details of the rapid changes cannot be estimated precisely. In contrast, the neural network-based method, with good nonlinear learning capability, captured the changes in PM concentration more completely and accurately.
To further the performance of the three machine learning algorithms, the Pearson's linear correlation was used for performing linear regressions between the output of models and ground truth data. The results were reported in Figure 9 and Table 2. The results showed that the multivariate linear regression method has the root mean square error (RMSE) of 24.6756 µg/m 3 with a correlation coefficient of 0.6281, by referring to the ground truth of PM time series data; and the multivariate nonlinear regression method has the RMSE of 24.9191 µg/m 3 with a correlation coefficient of 0.6184, while the neural network based method has the best performance, of which the RMSE of PM estimates is 15.6391 µg/m 3 with a better correlation coefficient of 0.8701.    To further the performance of the three machine learning algorithms, the Pearson's linear correlation was used for performing linear regressions between the output of models and ground truth data. The results were reported in Figure 9 and Table 2. The results showed that the multivariate linear regression method has the root mean square error (RMSE) of 24.6756 µg/m 3 with a correlation coefficient of 0.6281, by referring to the ground truth of PM time series data; and the multivariate non-linear regression method has the RMSE of 24.9191 µg/m 3 with a correlation coefficient of 0.6184; while the neural network based method has the best performance, of which the RMSE of PM estimates is 15.6391 µg/m 3 with a better correlation coefficient of 0.8701.

Discussion
This paper attempts to estimate the concentration of PM 2.5 from meteorological parameters using three machine learning models to answer the question of whether it can be estimated and with what accuracy.
From Table 1, the correlation between PM 2.5 concentration and visibility is -0.5639. The correlation coefficients between the estimated PM 2.5 value and the PM 2.5 reference value are 0.6281 and 0.6184, respectively, after multiple linear and nonlinear regression models (see Table 2). This indicates that the accuracy of PM 2.5 estimation by these two models does not improve much. The main reason is that the nonlinear relationship between PM and meteorological parameters such as visibility, wind speed, wind direction, and humidity is complicated.
The estimation accuracy of the PM 2.5 is greatly improved by the neural network model, with a correlation coefficient of 0.8701, which is better than our previous results by using the hidden Markov models [31]. The results demonstrate the ability of the neural network model to learn nonlinear relationships.
It is very interesting to note that, although the correlations between PM 2.5 and atmospheric pressure, rainfall rate, and temperature are very low (see Table 1), the use of these meteorological parameters is critical to the performance of the neural network. Therefore, we further conducted extensive experiments to investigate the effects of using different meteorological parameter inputs on the RMSE and correlation coefficients of the three machine learning model estimates.
As shown in Table 3, the performance of the neural network approach increases significantly with increasing meteorological parameters. The performance of the linear regression model increases slightly with increasing meteorological parameters; however, the performance of the nonlinear regression model decreases considerably. In machine learning, a very important issue is the problem of under-fitting and over-fitting of data. Unfortunately, the minimization of cost function for multivariate regression models suffers from poor generalization [40]. Therefore, the performance of the multivariate regression models for PM concentration prediction is limited. Our future work will introduce a regularization approach to further improve the performance of the multiple regression model. In contrast, Bayesian regularization has some promising advantages for the training of neural network models [40]. Bayesian regularization does not require a validation dataset; the method itself presents an evaluation of the evidence, which keeps the model from being overtrained. In addition, the Bayesian regularization method introduces Occam's razor that automatically penalizes overly complex models, so the model is difficult to be overfitted [38,39]. One limitation of this study is that the PM and meteorological data are six-month time series. Our future work will collect data for longer periods to further evaluate the performance of the model, and consider other types of neural network models to further improve the accuracy of PM concentration predictions.

Conclusions
This paper demonstrates the potential of using multi-weather sensors to monitor PM 2.5 concentrations. The accuracy of PM 2.5 concentrations has been studied by using a comparison of three classical machine learning methods. The results show that the neural network-based approach outperforms both multivariate linear and nonlinear regression approaches, with encouraging results, a root mean square error of 15.6391 µg/m 3 and a correlation coefficient of 0.8701. This study means that we can estimate the PM concentrations in real time from the high density network of automatic weather stations. Machine learning methods using data of these automated weather stations can provide new insights into mapping PM concentrations, and may make a valuable contribution to our understanding of the particles distribution and its cycle. Our future work will include acquiring more data and using other types of neural network models to further improve the accuracy of PM predictions.
Author Contributions: Conceptualization, Z.X. and Y.W.; methodology, Z.X. and Y.W.; writing-editing and revisions, Z.X. and Y.W.; All authors have read and agreed to the published version of the manuscript.