A Gaussian Process Method with Uncertainty Quantification for Air Quality Monitoring

The monitoring and forecasting of particulate matter (e.g., PM2.5) and gaseous pollutants (e.g., NO, NO2, and SO2) is of significant importance, as they have adverse impacts on human health. However, model performance can easily degrade due to data noises, environmental and other factors. This paper proposes a general solution to analyse how the noise level of measurements and hyperparameters of a Gaussian process model affect the prediction accuracy and uncertainty, with a comparative case study of atmospheric pollutant concentrations prediction in Sheffield, UK, and Peshawar, Pakistan. The Neumann series is exploited to approximate the matrix inverse involved in the Gaussian process approach. This enables us to derive a theoretical relationship between any independent variable (e.g., measurement noise level, hyperparameters of Gaussian process methods), and the uncertainty and accuracy prediction. In addition, it helps us to discover insights on how these independent variables affect the algorithm evidence lower bound. The theoretical results are verified by applying a Gaussian processes approach and its sparse variants to air quality data forecasting.


Introduction
It is generally believed that urban areas provide better opportunities in terms of economic, political, and social facilities compared to rural areas. As a result, more and more people are migrating to urban areas. At present, more than fifty percent of people worldwide live in urban areas, and this percentage is increasing with time. This has led to several environmental issues in large cities, such as air pollution [1].
Landrigan reported that air pollution caused 6.4 million deaths worldwide in 2015 [2]. According to World Health Organization (WHO) statistical data, three million premature deaths were caused by air pollution worldwide in 2012 [3]. Air pollution has a strong link with dementia, causing 850,000 people to suffer from dementia in the UK [4]. Children growing up in residential houses near busy roads and junctions have a much higher risk of developing various respiratory diseases, including asthma, due to high levels of of these cities. Appendix B gives the World Health Organisation (WHO) criteria for air pollutants. Appendix C gives the approximate derivatives of the GP kernel.

Gaussian Processes
Given a set of training data D = {(x i , y i ), i = 1, · · · , n} where x i ∈ X is the input and y i ∈ R is the observation, we can determine a GP model f (·) to predict y * for a new input x * . For instance, when the output is one-dimensional, the GP model is formulated as wheref : X → R is the mean function defined as and k : X × X → R is the kernel function [18] defined as where ε is the additive, independent, identically distributed Gaussian measurement noise with variance σ 2 = 0, and E denotes the mathematical expectation operation. Given x i a D × 1 vector, the n inputs can be aggregated into a matrix X D×n , or briefly X with the corresponding output vector y n×1 , or y. Similarly, the function values at the test inputs X * with dimensions of D × N can be denoted as f * , and we next write the joint distribution of y and f * as where I represents the identity matrix. K nn + σ 2 I is the n × n prior covariance matrix of y with entry K ij = k(x i , x j ) + σ 2 δ ij , where δ ij is one iff i = j and zero otherwise, and x i and x j are column vectors from X. The matrix K NN denotes the N × N prior covariance matrix of f * with entry K ij = k(x i , x j ), where x i and x j are column vectors from X * . The matrices K Nn and K nN satisfy K Nn = K T nN , and the entry of the N × n prior covariance matrix of f * and y is K ij = k(x i , x j ), where x i is a column vector from X * and x j is a column vector from X.
By deriving the conditional distribution of f * from (4), where the prior mean is set to be zero for simplicity [20], we have the predictive posterior at new inputs X * as wheref * is the prediction at X * , and denotes the covariance of f * . The hyperparameter θ incorporated in the mean and covariance functions underpin the predictive performance of GP models, and they are usually estimated by maximising the logarithm of the marginal likelihood log p(y|X) = − 1 2 y T K nn + σ 2 I −1 y − 1 2 log |K nn + σ 2 I| − n 2 log 2π.

Neumann Series Approximation
Given a matrix inverse A −1 , it can be expanded as the following Neumann series [21] which holds if lim n→∞ I − X −1 A n = 0 is satisfied. In our case, suppose where D A is the main diagonal of A and E A is the hollow. If we substitute X in Equation (9) by D A , we get which is guaranteed to converge when lim n→∞ − D −1 A E A n = 0. We investigated the convergence condition in [17], where we proved that if A is diagonally dominant, then Neumann series can approximate A −1 both fast and accurate. In case A is not diagonally dominant, we also provided a way to convert it into a diagonally dominant matrix in [17], such that A −1 can still be approximated by Neumann series. When Neumann series given in (11) converges, we can then approximate A with only the first L terms. The L-term approximation is computed as follows: For instance, when L = 1, 2, 3, we have the approximations

Uncertainty in Measurements
It is intuitive that noisy measurements would result in less accurate predictions, just as a poor model would do. However, it is not direct from Equations (6) and (7). We will show in detail how the measurement noise would affect the prediction accuracy.
From Equations (6) and (7), we can see that the measurement noise affects the prediction and the covariance by adding a term σ 2 n I to the prior covariance K in comparison to the noisy free scenario [20]. From the way that they originated, we know that both K and σ 2 n I are symmetrical. Then, a matrix P exists such that where D K is a diagonal matrix with eigen values of K along the diagonal. As σ 2 n I a diagonal matrix itself, we have σ 2 n I = P −1 σ 2 n IP.
Therefore, we have the partial derivative of Equation (6) with respect to σ 2 n as ∂f * ∂σ 2 n = K * P(D K + σ 2 n I) −2 P −1 y, The element-wise form of Equation (16) can be therefore obtained as where Λ j = (λ j + σ 2 n ) 2 . p hj and p ij are the entries indexed by the j-th column, h-th and i-th row, respectively. k oh is the o-th row and h-th column entry of K * . y i is the i-th element of y. o = 1, · · · , s denotes the o-th element of the partial derivation.
We can see that the sign of Equation (17) is determined by p hj and p ij . This is because we can actually transform y to either positive or negative with a linear transformation, which will not be an issue for the GPs model. When we impose no constraints on p hj and p ij , Equation (17) could be any real number, indicating thatf * is multimodal with respect to σ 2 n , which means that one σ 2 n can lead to differentf * , or equivalently, different σ 2 n can lead to the samef * . In such cases, it is difficult to investigate how σ 2 n affects the prediction accuracy. In this paper, to facilitate the study of the monotonicity off * , we constrain p hj and p ij to satisfy Then, we can see thatf * is monotonic. It means that changes of σ 2 n can cause arbitrarily large/small predictions, whereas a robust method should bound the prediction errors regardless of how σ 2 n varies. Similarly, the partial derivative of Equation (7) with respect to σ 2 n is where we denote the m × n dimension matrix K * P as with p i a m × 1 vector, and i = 1, · · · , n. As the uncertainty is indicated by the diagonal elements, we only show how these elements change with respect to σ 2 n . The diagonal elements are given as with diag(·) denoting the diagonal elements of a matrix. We see that Σ jj 0 stands for j = 1, · · · , m, which implies that cov(f * ) is non-decreasing as σ 2 n increases. This means that the increase of measurement noise level would cause the non-deceasing of the prediction uncertainty.

Uncertainty in Hyperparameters
Another factor that affects the prediction of a GPs model is the hyperparameters. In Gaussian processes, the posterior, as shown in Equation (5), is used to do the prediction, while the marginal likelihood is used for hyperparameters selection [18]. The log marginal likelihood as shown in Equation (22) is usually optimised to determine the hyperparameter with a specified kernel function. log p(y|X, θ) = − 1 2 y T (K + σ 2 n I) −1 y − However, the log marginal likelihood could be non-convex with respect to the hyperparameters, which implies that the optimisation may not converge to the global maxima [22]. A common solution dealing with it is to sample multiple starting points from a prior distribution, then choose the best set of hyperparameters according to the optima of the log marginal likelihood. Let's assume θ = {θ 1 , θ 2 , · · · , θ s } being the hyperparameter set and θ s denoting the s-th of them, then the derivative of log p(y|X) with respect to θ s is where α = (K + σ 2 n I) −1 y, and tr(·) denotes the trace of a matrix. The derivative in Equation (23) is often multimodal and that is why a fare few initialisations are used when conducting convex optimisation. Chen et al. show that the optimisation process with various initialisations can result in different hyperparameters [22]. Nevertheless, the performance (prediction accuracy) with regard to the standardised root mean square error does not change much. However, the authors do not show how the variation of hyperparameters affects the prediction uncertainty [22].
An intuitive explanation to the fact of different hyperparameters resulting with similar predictions is that the prediction shown in Equation (6) is non-monotonic itself with respect to hyperparameters. To demonstrate this, a direct way is to see how the derivative of (6) with respect to any hyperparameter θ s ∈ θ changes, and ultimately how it affects the prediction accuracy and uncertainty. The derivatives off * and cov(f * ) of θ s are as below We can see that Equations (24) and (25) are both involved with calculating (K + σ 2 n I) −1 , which becomes enormously complex when the dimension increases. In this paper, we focus on investigating how hyperparameters affect the predictive accuracy and uncertainty in general. Therefore, we use the Neumann series to approximate the inverse [21].

Derivatives Approximation with Neumann Series
The approximation accuracy and computationally complexity of Neumann series varies with L. This has been studied in [21,23], as well as in our previous work [17]. This paper aims at providing a way to quantify uncertainties involved in GPs. We therefore choose the 2-term approximation as an example to carry out the derivations. By substituting the 2-term approximation into Equations (24) and (25), we have Due to the simple structure of matrices D A and E A , we can get the element-wise form of Equation (26) as Similarly, the element-wise form of Equation (27) is where o = 1, · · · , m denotes the o-th output, d ji is the j-th row and i-th column entry of A , k oj and k oi are the o-th row, j-th and i-th entries of matrix K * , respectively. When the kernel function is determined, Equations (26)-(29) can be used for GPs uncertainty quantification.

Impacts of Noise Level and Hyperparameters on ELBO and UBML
The minimisation of KL q(f, u) p(f, u|y) is equivalent to maximise the ELBO [18,24] as shown in where G n = G xx + σ 2 n I, and t = Tr(K xx − G xx ). Combining it with UBML, as shown in Equation (31), an interval can be given to quantify the uncertainty in marginal likelihood.
This paper, however, focuses on investigating how ELBO and UBML change according to σ 2 n only. Because the investigation of how ELBO and UBML change with respect to kernel hyperparameters involves multiple Neumann series approximations, which makes the analysis less convincing. We shall leave it as an open problem for future study. The derivatives of Equations (30) and (31) with respect to σ 2 n are as follows, (33) Figure 1 shows how σ 2 n affects ELBO and UBML. We set σ 2 n to increase from 0.1 to 200.0 with a step of 0.01. Both ELBO and UBML are recorded step by step. From the figure, we can see that when σ 2 n is small (σ 2 n ∈ [0.1, 1.5]), ELBO increases with different speeds, however, UBML fluctuates as the derivative of UBML jumps between positive and negative. When σ 2 n is in [1.5, 3.0], ELBO still increases, but the speeds slow down significantly. In comparison, UBML keeps decreasing with reducing speeds. The decrements of UBML mean that when σ 2 n increases, though ELBO could be increased still, but the maximum (which is the UBML) can decrease. When σ 2 n ∈ [3.0, 20.0], ELBO starts to decrease when σ 2 n ≈ 3.2, while UBML keeps decreasing. This means that as σ 2 n increases, both ELBO and UBML decrease, which indicates that the model becomes less and less effective to explain the data. When σ 2 n keeps increasing (σ 2 n ∈ [20.0, 200.0]), the decreasing speeds of ELBO and UBML becomes similar and approaches zero. This means that UBML and ELBO both converge and together define an interval for the marginal likelihood, which however, can result in non-optimal hyperparameters. Our conclusion is that when σ 2 n increases, UBML tends to decrease, which decreases the maximum that ELBO can reach. ELBO, on the other hand, is robust to the change of σ 2 n (as it keeps increasing when σ 2 n is below ∼3.2). However, when σ 2 n exceeds a certain threshold, ELBO turns to decrease, indicating that the GPs model becomes less and less reliable. However, both ELBO and UBML converge, even when σ 2 n becomes very significant, though we can no longer trust the model.

Experiments and Analysis
To verify that the proposed solution can help to identify the impacts of σ 2 n and θ on the predition accuracy and uncertainty of GPs model and its sparse variants such as the fully independent training conditional (FITC) [25] and variational free energy (VFE) [24] models, we conduct various experiments to process air quality data collected from Sheffield, UK, and Pershawar, Pakistan (see Appendix A), during the time period of 24 June 2019-14 July 2019 for three weeks, which will be denoted as W1, W2, and W3 hereafter. The data were collected with digital sensors called AQMesh pod with a 15 min time interval. Though the sensor itself is able to measure the concentrations of quite a few atmospheric pollutants, here we only analyse the concentrations of NO, NO 2 , SO 2 , and PM 2.5 . Figure 2 shows the raw data. We can see directly that the air quality of Sheffield is much better than Pershawar on average. Especially during daytime, concentrations of NO 2 and PM 2.5 in Pershawar exceed the WHO criteria (see Appendix B). Meanwhile, those in Sheffield are much lower than the criteria. Being a postindustrial city itself, Sheffield has improved air quality significantly. The experience can be spread to help cities like Pershwar to improve air quality. Figures 3 and 4 show Sheffield and Pershawar forecasting results of GPs, FITC, and VFE, with 3σ confidence intervals (denoted as Conf in the figures) indicated by the shaded area. We can see that the GPs model reports the best results in general, in terms of absolute error between predicts and measurements (denoted as Meas in the figures). However, the performance of all the models varies from pollutant types to cities. This is actually one of the reasons why the investigation of how measurement noise level and hyperparameters affect prediction accuracy and uncertainty is necessary. To make the results more convincing, we normalise the data from both cities for uncertainty quantification studies.

Impacts of Measurement Noise Level and Hyperparameters
To demonstrate how noise level σ 2 n and hyperparameters affect prediction accuracy and uncertainty, three sets of experiments are conducted. This paper adopts the squared exponential (SE) kernel, with hyperparameters s f and l. The analytical derivation can be found in Appendix C. The prediction accuracy is identified by the root mean square error (RMSE), as shown in Equation (34), while the uncertainty is identified by 1 2 σ confidence bound. Configurations of the experiments are as follows.
Experiment 1: Impacts of σ 2 n on prediction accuracy and uncertainty. Both s f and l are fixed to be the optimised values. σ 2 n varies from 0.1 through to 20.0. NO, NO 2 , SO 2 , and PM 2.5 data from both cities are processed. Six inducing points are applied to both FITC and VFE.
Experiment 2: Impacts of s f on prediction accuracy and uncertainty. l is set to the optimised value. s f varies from 0.1 through to 30.0. σ 2 n is set to 0.5 and 1.5, respectively. NO data from both cities are processed. Six inducing points are applied to both FITC and VFE. Experiment 3: Impacts of l on prediction accuracy and uncertainty. s f is set to the optimised value. l varies from 0.1 through to 30.0. σ 2 n is set to 0.5 and 1.5, respectively. NO data from both cities are processed. Six inducing points are applied to both FITC and VFE.
where y i is the ground truth value and y i represents predicted meant. Num is the sample number in testing set. Figures 5 and 6 show the results from Experiment 1. To make the results more distinguishable, the horizontal axes of the figures are set to log(σ 2 n ). We can see from Figure 5 that when σ 2 n is small, GPs perform the best in general, while the performance of FITC and VFE varies. We can also observe that as σ 2 n keeps increasing, the RMSE becomes very significant for all methods/pollutants. Similar results can be observed from Figure 6 as well. Both comply with our theoretical conclusions, despite the fact that the Neumann series is used to approximate the matrix inverse. We also notice that σ 2 n has a more significant impact on Sheffield data as RMSE increases ealier after log(σ 2 n ) reaches zero. From Figure 6b,c, we also see that the uncertainty bounds of Sheffield data are greater after log(σ 2 n ) reaches zero. We think the reason is that Sheffield data are generally less periodical than Pershawar data (see Figure 2), which influences the performance of the models.   Figure 7 shows the results from Experiment 2. According to our theoretical results, the impact of s f on the uncertainty should become greater as s f increases. This is verified by the results shown in Figure 7b,d. Our theoretical results also suggest that the variation of s f would not affect the prediction accuracy. We can see from Figure 7a,c that when s f is smaller, it does affect the prediction accuracy, but when it exceeds a certain value, the impacts become negligible. Considering the Neumann series approximation, we would say that the experimental results comply with the theoretical conclusion.

Impacts of Noise Level on ELBO and UBML
The results of Experiment 3 are shown in Figure 8. We can see that when l is smaller, both RMSE and the uncertainty bounds change rapidly. While after it exceeds certain values, both converge. This again complies with our theoretical conclusions and simulation results. We should also notice from Figures 7 and 8 that the increment of s f tends to increase the uncertainty, whereas the increment of l tends to decrease the uncertainty. Taking both into consideration, an optimised uncertainty bound can be obtained.
We also conduct an experiment to demonstrate how the noise level σ 2 n affects the ELBO and UBML. In our experiment, we set σ 2 n to vary from 0.5 to 4.5. The results are shown in Figure 9. To make the results distinguishable, we set the vertical axes to log(−ELBO/UBML). To make the logrithm work, we reverse the signs of both ELBO and UBML. This is the reason why ELBO is 'greater' than UBML in Figure 9. The full GPs model is trained by setting σ 2 n to {1, 7, 13, 19, 25, 31, 37, 43, 49} to obtain 9 sets of hyperparameters. For each set of them, we then set σ 2 n to vary from 0.5 to 4.5. The darker the colour in Figure 9, the smaller σ 2 n is for model training. We can see that generally, greater σ 2 n can slow down the convergence speed of both ELBO and UBML, while training a model. When the model is trained, the increment of σ 2 n can lower down UBML, which is the maximum that ELBO can reach. This implies that the increment of σ 2 n can cause the failure of a sparse GPs model, as ELBO is deeply related to determine a sparse GPs model. Nevertheless, the experimental results again comply with our theoretical conclusions.

Conclusions
This paper proposes a general method to investigate how the performance variation of a Gaussian process model can be attributed to hyperparameters and measurement noises, etc. The method is demonstrated by applying it to process particulate matter (e.g., PM 2.5 ) and gaseous pollutants (e.g., NO, NO 2 , and SO 2 ) from both Sheffield, UK, and Peshawar, Pakistan. Experimental results show that the proposed method provides insights on how measurement noises and hyperparameters, etc. affect the prediction performance of a Gaussian process. The results align with the analytical derivations, which is enabled by adopting Neuman series to approximate matrix inversions in Gaussian process models. The theoretical findings and experimental results combined demonstrate that the proposed method can generate air quality forecasting results. In the meantime, it provides a way to link uncertainties in measurements and hyperparameters, etc. with the forecasting results. This will help with forecasting performance analysis when measurement noise level or model hyperparameters vary, making the method more general. Peshawar is predominantly hot during summer (May-Mid July) with an average maximum temperature of 40 • C followed by monsoon and cold winter.
Local vehicular emission, fossil fuel energy plants and industrial processes are the significant sources of air pollution in Peshawar. Wind direction and wind speed also play a crucial role to observe transboundary pollution build-up. Furthermore, at this site, the distribution and dispersion of air pollution are further impacted by the nearby buildings, and its proximity to Grand Trunk Road, creating a built-up street canyon environment, generated primarily from nearby, increasing traffic pollution.
The air quality monitoring sensor (AQMS) was installed at the University of Peshawar's Physics Department Building (see Figure A1) at 6 m height from the ground surface level. It is described as an urban background site.
Sheffield (53 • 23 N, 1 • 28 W) is a geographically diverse city located in county South Yorkshire, UK, built on several hills thus situated at an elevation of 29-500 m above sea level. Sheffield covers a total area of 367.9 km 2 with a growing population of 582,506. Sheffield is claimed to be the "greenest city" in England by the local city council. Sheffield enjoys a temperate climate with July considered the hottest month, with an average maximum temperature of 20.8 • C.
The air pollution in the city is primarily due to both road transport and industry, and to a lesser extent, fossil fuel-run processes, such as energy supply and commercial or domestic heating systems (for example, wood burners).
The AQMS is installed at 2.5 m height from the elevated ground surface level at the playground of Hunter's Bar Infants School (see Figure A2), which lies in close proximity to a busy roundabout, and at the intersection of Ecclesall Road, Brocco Bank, Sharrow Vale Road and Junction Road; thus, traffic is the primary source of pollution. It is also described as an urban background site. In our case, the AQMSs are commercially low cost sensor nodes AQMesh. They have been deployed at the two sites in Peshawar and Sheffield. A "black box" post calibration is applied to the data by the manufacturer to eliminate the impact of humidity and temperature on the sensor and to eliminate cross sensitivity. The data are aggregated and sampled every 15 min. The data collected from these nodes are transferred to the cloud-based AQMesh database via standard GPRS communication integrated. The data are then accessed through the dedicated API.

Appendix B. The WHO Concentration Criteria for Pollutants
All data from 'WHO Air quality guidelines for particulate matter, ozone, nitrogen dioxide and sulfur dioxide' [26].

Appendix C. Approximated Derivatives of SE Kernel
By specifying a kernel function, we can obtain analytical forms of Equations (28) and (29) immediately. In this paper, we adopt the widely used SE kernel shown in Equation (A1) as an example.
There are two hyperparameters, i.e., the signal variance s f and length-scale l are involved. Equations (A2) and (A3) show the expectation (prediction mean) partial derivative (EPD) and covariance partial derivative (CPD) of s f , ∂k oj ∂s f d ji k oi + k oj ∂d ji ∂s f k oi − k oj d ji ∂k oi ∂s f