Probabilistic Deep Learning to Quantify Uncertainty in Air Quality Forecasting

Data-driven forecasts of air quality have recently achieved more accurate short-term predictions. However, despite their success, most of the current data-driven solutions lack proper quantifications of model uncertainty that communicate how much to trust the forecasts. Recently, several practical tools to estimate uncertainty have been developed in probabilistic deep learning. However, there have not been empirical applications and extensive comparisons of these tools in the domain of air quality forecasts. Therefore, this work applies state-of-the-art techniques of uncertainty quantification in a real-world setting of air quality forecasts. Through extensive experiments, we describe training probabilistic models and evaluate their predictive uncertainties based on empirical performance, reliability of confidence estimate, and practical applicability. We also propose improving these models using “free” adversarial training and exploiting temporal and spatial correlation inherent in air quality data. Our experiments demonstrate that the proposed models perform better than previous works in quantifying uncertainty in data-driven air quality forecasts. Overall, Bayesian neural networks provide a more reliable uncertainty estimate but can be challenging to implement and scale. Other scalable methods, such as deep ensemble, Monte Carlo (MC) dropout, and stochastic weight averaging-Gaussian (SWAG), can perform well if applied correctly but with different tradeoffs and slight variations in performance metrics. Finally, our results show the practical impact of uncertainty estimation and demonstrate that, indeed, probabilistic models are more suitable for making informed decisions.


Introduction
Monitoring and forecasting real-world phenomena are fundamental use cases for many practical applications in the Internet of Things (IoT). For example, policymakers in municipalities can use forecasts of ambient air quality to make decisions about actions, such as informing the public or starting emission-reduction measures. The problem is that forecasts are both uncertain and intended for human interpretation, so decisions about specific actions should take forecast confidence into account. For instance, it may be best to only start a costly initiative for cleaning streets of dust when the forecast of air pollutants exceeds a certain threshold and the reported confidence is high. Therefore, quantifying the predictive confidence is crucial for learning, providing, and interpreting reliable forecasting models.
Progress in probabilistic machine learning [1] and, more recently, in probabilistic deep learning led to the development of practical tools to estimate uncertainty about models and predictions [2][3][4][5][6][7]. These tools have been successfully used in various domains, such as computer vision [8,9], language modeling [10,11], machine translation [12] and autonomous driving [13]. All of these tools address quantifying predictive uncertainty but differ in techniques, approximations, and assumptions when representing and manipulating uncertainty.
The successful application of probabilistic models to real human problems requires us to bridge the gap from the theory of these disparate approaches to practical concerns. In particular, we need an empirical evaluation and comparison of these many techniques. We want to know how they perform with respect to prediction accuracy, uncertainty quantification, and other requirements. Specifically, we want to know the reliability of their confidence estimate, meaning if they are actually more accurate and trustworthy when their confidence is high. This is practically important for policymakers to make risk-informed decisions. Such an empirical evaluation, especially in the domain of air quality, is currently lacking.
Existing studies already incorporate aspects of uncertainty into their data-driven forecasts, but they often only quantify aleatoric uncertainty. This uncertainty is inherent in the observed data and can be quantified by a distribution over the model's output using softmax or Gaussian. There is also epistemic uncertainty that we can and should quantify as well. This type of uncertainty represents how much the model does not know, for instance, in regions of the input space with little data. Probabilistic models can capture both types of uncertainties, which makes them practically appealing since they convey more information about the reliability of the forecast. They also provide a wider area of control over the decision boundary and the risk profile of a decision. For instance, Figure 1 illustrates the potential of quantifying both types of uncertainties in decision making. It shows the decision F1 score as a function of normalized aleatoric and epistemic confidence thresholds in a probabilistic model. Each point on the surface represents the resulting score of accepting the model predictions only where its confidence is above specific thresholds (τ 1 and τ 2 ). Expressing both uncertainty dimensions gives more control regarding false positives and false negatives and their associated risks. We will come back to this Figure in < l a t e x i t s h a 1 _ b a s e 6 4 = " e k B + D b Q 9 H j n 2 N e k P 0 G o B A m H 5 L 8 A = " > A A A B 7 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m K q M e i F 4 8 V 7 A e 0 o W y 2 m 3 b t Z h N 2 J 0 I J / Q 9 e P C j i 1 f / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 w E n C / Y g O l Q g F o 2 i l V g 9 p 2 q / 1 y x W 3 6 s 5 B V o m X k w r k a P T L X 7 1 B z N K I K 2 S S G t P 1 3 A T 9 j G o U T P J p q Z c a n l A 2 p k P e t V T R i B s / m 1 8 7 J W d W G Z A w 1 r Y U k r n 6 e y K j k T G T K L C d E c W R W f Z m 4 n 9 e N 8 X w 2 s + E S l L k i i 0 W h a k k G J P Z 6 2 Q g N G c o J 5 Z Q p o W 9 l b A R 1 Z S h D a h k Q / C W X 1 4 l r V r V u 6 y 6 9 x e V + k 0 e R In this work, we develop a set of deep probabilistic models for air quality forecasting that quantify both aleatoric and epistemic uncertainties and study how to represent and manipulate their predictive uncertainties. Our contributions are the following: • We conduct a broad empirical comparison and exploratory assessment of state-ofthe-art techniques in deep probabilistic learning applied to air quality forecasting.
Through exhaustive experiments, we describe training these models and evaluat-ing their predictive uncertainties using various metrics for regression and classification tasks. • We improve uncertainty estimation using adversarial training to smooth the conditional output distribution locally around training data points. • We apply uncertainty-aware models that exploit the temporal and spatial correlation inherent in air quality data using recurrent and graph neural networks. • We introduce a new state-of-the-art example for air quality forecasting by defining the problem setup and selecting proper input features and models.
Our results show that all the considered probabilistic models perform well on our application, with slight variations in different performance metrics. Bayesian neural networks perform slightly better in proper scoring rules that measure the quality of probabilistic predictions. In addition, we show that smoothing predictive distributions by adversarial training improves metrics that punish incorrect, overconfident predictions, especially in regression tasks, since the forecasted phenomena are inherently smooth.
The rest of this paper is organized as follows: In Section 2, we outline the related work in uncertainty estimation and air quality forecasting. We then define the problem setup and introduce non-probabilistic baselines and evaluation metrics in Section 3. In Section 4, we present deep probabilistic models applied to the air quality forecast, describe their training, evaluation, and how to improve their uncertainty estimate. We close with a comparative analysis to compare the selected models in Section 5 and a conclusion in Section 6. Code and dataset are available at https://github.com/Abdulmajid-Murad/deep_probabilistic_f orecast (accessed on 27 November 2021).

Related Work
Forecasting ambient air quality is crucial to support decision-making in urban management. Therefore, a sizeable body of work addresses building air quality forecasting models. For example, MACC (Monitoring Atmospheric Composition and Climate) [14] is a European project that provides air quality analysis and forecasting services for the European Continent. It uses physics-based modeling that combines transport, chemistry, and satellite data to provide a multi-model ensemble forecast of atmospheric composition (daily forecasts with hourly outputs of 10 chemical species/aerosols). Walker et al. [15] used the output of MACC ensemble-based probabilistic forecasting of air pollution and performed statistical post-processing, calibrated predictive distribution using Box-Cox transformation for correcting the skewness of air pollution data. Additionally, they discussed model selection and verification using Akaike and Bayesian information criteria. To obtain the ensemble forecast from MACC, they introduced stochastic perturbations to the emissions. Garaud et al. [16] performed a posterior calibration of multi-model ensembles using a mixed optimization algorithm to extract a sub-ensemble.
The official air quality forecasting service in Norway [17] uses a Gaussian dispersion modeling that provides a 2-day hourly forecast with high-resolution coverage (between 250 and 50 m grid) over the entire country [18,19]. The forecast is based on weather conditions, polluting emissions, and terrain. In particular, it uses the chemical transport model uEMEP (urban European Monitoring and Evaluation Program) [18] and a road dust emission model [20]. Denby et al. [21] analyze the accuracy of the Norwegian air quality forecasting by comparing model calculations with measurements. They show that the model s forecast on particle dust is marginally better than the persistent forecast and give some assumptions about why model calculations deviate from the observations.
Although physics-based models [18,19,22] can provide long-range air pollution information, they require significant domain knowledge and complex modeling of dispersion, chemical transport, and meteorological processes. Additionally, physics-based models involve structural uncertainty, low spatial resolution, and do not capture abrupt and shortterm changes in air pollution. Data-driven modeling based on historical data [23][24][25] can complement physics-based modeling by learning directly from air quality measurements and providing a more reliable short-term prediction. For example, Lepperod et al. [23] deployed stationary and mobile micro-sensor devices and used Narrowband IoT (NB-IoT) to aggregate air quality data from these sensors. Then, they applied machine learning methods to predict air quality in the next 48 h using observations of sensors' measurements, traffic, and weather data. Zhou et al. [25] used long short-term memory (LSTM) to forecast multi-step time-series of air quality, while Mokhtari et al. [26] proposed combining a convolutional neural network (CNN) with LSTM for air quality prediction and quantified the uncertainty using quantile regression and MC dropout. Tao et al. [27] used 1D CNNs and a Bidirectional gated recurrent unit (GRU) for a short-term forecast of fine air particles. Pucer et al. [28] used Gaussian Processes (GP) to forecast daily air-pollutant levels, and Aznarte et al. [29] proposed using quantile regression for probabilistic forecasting of extreme nitrogen dioxide (NO 2 ) pollution.
Most of the current data-driven forecasts give point predictions of a deterministic nature. Thus, they lack useful estimates of their predictive uncertainty that convey more information about how much to trust the forecast. Recently, quantifying prediction uncertainty has garnered increasing attention in machine learning fields, including deep learning. Bayesian methods are among the most used approaches for uncertainty estimation in neural networks. Given a prior distribution over the parameters, Bayesian methods use training data to compute a posterior distribution. Using the obtained distribution, we can easily quantify the predictive uncertainty. This approach has been extended to neural networks, theoretically allowing for the accuracy of modern prediction methods with valid uncertainty measures; however, modern neural networks contain many parameters, and obtaining explicit posterior densities through Bayesian inference is computationally intractable. Instead, there exist a variety of approximation methods that estimate the posterior distributions. These methods can be decomposed into three main categories, either based on variational inference [3,30,31], Markov chain Monte Carlo (MCMC) [32][33][34], or Laplace approximation [2,35]. In this paper, we will use variational Bayesian methods and approximate Bayesian inference methods.

Problem Setup
We are trying to build a model for forecasting air quality trends at pre-defined locations and for a specified forecast horizon. In our case study, we want to forecast the level of microscopic particles in the air, known as particulate matter (PM). These are inhalable particles with two types: coarse particles with a diameter less than 10 µm (PM 10 ) and fine particles with a diameter less than 2.5 µm (PM 2.5 ). The forecast predicts pollutant levels for the next 24 h at four monitoring stations in the city of Trondheim, as shown in Figure 2. The stakeholders, policymakers of the municipality, would like to estimate if the concentration of air particles exceeds certain thresholds following the Common Air Quality Index (CAQI) used in Europe [36], as shown in Table 1. This can be achieved through value regression or by directly classifying threshold exceedance levels. We will explore probabilistic models that forecast air quality values and predict threshold exceedance events. Using probabilistic models provides more qualitative information since decisionmaking largely depends on the credibility intervals of specific predictions. The forecast is based on explanatory variables, such as historical air quality measurements, meteorological data, traffic, and street-cleaning reports from the municipality.  Although the CAQI (Table 1) specifies five levels of air pollutants, the air quality in the city of Trondheim is usually at Very Low and rarely exceeds the Low level. For example, Figure 3 shows the air quality level over one year of a representative monitoring station (Elgeseter). Therefore, instead of a multinomial classification task (with five classes), we transform the problem into a threshold exceedance forecast task in which we try to predict the points in time where the air quality exceeds the Very Low level. Appendix C contains more details on class-imbalance of air quality due to asymmetry or right-skewed distributions. The air quality dataset we use is a part of the official measuring network in Europe. Specifically, we use the open database of air quality measurements offered by the Norwegian Institute for Air Research (NILU) [37]. The meteorological data are based on historical weather and climate data offered by the Norwegian Meteorological Institute [38]. The traffic data are based on aggregated traffic volumes offered by the Norwegian Public Roads Administration [39]. A more detailed description of the used datasets can be found in Appendix A.

Epistemic and Aleatoric Uncertainty
Before quantifying the predictive uncertainty, it is worth distinguishing the different sources of uncertainty and the appropriate actions to reduce them. The first source is model or epistemic uncertainty, which is uncertainty in the model parameters in regions of the input space with little data (i.e., data sparsity). This type of uncertainty can be reduced given enough data. By estimating the epistemic uncertainty of a model, we can obtain its confidence interval (CI). The second source of uncertainty is data or aleatoric uncertainty. It is essentially a noise inherent in the observations (i.e., input-dependent) due to either sensor noise or entropy in the true data generating process. By estimating the aleatoric and epistemic uncertainties, we can obtain the prediction interval (PI) [9,40]. Accordingly, prediction intervals are wider than confidence intervals. The third source of uncertainty is model misspecification, i.e., uncertainty about the general structure of the model, such as model type, number of nodes, number of layers. It is also related to the bias-variance tradeoff [41].

Non-Probabilistic Baselines
As a baseline for comparison and to motivate the use of evaluation metrics, we test the performance of non-probabilistic models, such as persistence, XGBoost, and gradient boosting models. A simple baseline predictor is a persistence forecast, which uses the diurnal patterns of the observations. To predict a value in the future, we use the value observed 24 h earlier. Figure 4 shows the results of the persistence model when forecasting the PM-value over one month (January 2020). Supposeŷ t ∈ R is the forecast value, while y t ∈ R is the true observed value at time t, then we can evaluate the aggregated accuracy over a time period T using the root-mean-square error: (1)  Next we evaluate the performance of XGBoost (eXtreme Gradient Boosting) [42] as a nonprobabilistic baseline in our problem setting. XGBoost uses the same model representation and inference as Random forests (i.e., gradient-boosted decision trees) but has a different training mechanism since it uses the second-order approximation of the training objective. We train the model over one year of data (2019) and test its performance over one month (January 2020) of a 24-h forecast horizon. Figure 5 shows the results of the XGBoost model in one representative monitoring station. The results show an improved prediction accuracy compared to the persistence forecast, which illustrates the value of a learned predictor for time-series forecasting of air quality. One advantage of using an XGBoost model is the feasibility of retrieving the feature importance by assigning a score to each input feature, which indicates how useful that feature is when making a prediction, thus contributing to prediction interpretation. Figure 6 shows the feature importance of the XGBoost forecast shown in Figure 5 (a more detailed description of the features can be found in Appendix A.  For the threshold exceedance prediction task, we can use XGBoost to train a binary classifier to predict the probability of threshold exceedance. By using a proper scoring rule [43], such as the cross-entropy as a training criterion, we obtain predictive probabilities. Notably, these probabilities do not represent the model (epistemic) uncertainty. Rather, they represent data (aleatoric) uncertainty. In addition to cross-entropy, we need to evaluate performance using another metric that the model was not optimizing. We can use the Brier score [44], which is a strictly proper scoring rule [43] and metric to evaluate the reliability of predictive probabilities. Suppose at a time, t, the true class label is represented by o t (1 when pollutant level exceeds the threshold, 0 if not), while the predicted probability is represented byp t ∈ [0, 1]. Then we calculate the cross-entropy (CE) and Brier Score (BS) as follows: Both metrics heavily punish overconfident, incorrect predictions. Cross-entropy uses exponential punishment (heavily emphasizes tail probabilities) since it is a negative loglikelihood loss. Thus, it is sensitive to the predicted probabilities of the infrequent class (i.e., threshold exceedance events). In contrast, the Brier score uses quadratic punishment since it is a mean square error in the probability space. Thus, it treats the predicted probabilities of both classes equally.
Additionally, we can use the commonly used metrics for classification tasks [45] to evaluate the performance of a deterministic prediction. By converting the predictive probabilities into predictive class labels, we can calculate the true-positive rate tp, the false-positive rate f p, and the false-negative rate f n. Then, we can use the metrics of Precision : R → [0, 1], Recall : R → [0, 1], and F1 : R → [0, 1] score to evaluate the threshold exceedance prediction as following: Figure 7 shows the results of the XGBoost model when predicting threshold exceedance probability. The blue dots indicate the points in time (hours) when air pollutants actually exceeded the threshold level. The red line represents the predicted probability in percentages. We see that with higher probability, the model predicts a more likely event of threshold exceedance at that specific time.

Quantile Regression
Using an XGBoost model leads to an improved prediction accuracy ( Figure 5), but with a caveat that it only produces point estimates of the mean. To address this, we can use quantile regression [46] to estimate a specific percentile (i.e., quantile) of the target variable. The quantile regression estimates the conditional quantile by minimizing the absolute residuals. In contrast, regular regression estimates the conditional mean by minimizing the least squared residuals (focuses on central tendency). The advantages of using quantile regression are that it focuses more on dispersion or variability, does not assume homoscedasticity in the target variable (i.e., having the same variance independent of the input variable), and is more robust against outliers. We use a Gradient Tree Boosting model [47] to estimate the 5% and 95% percentiles, as shown in Figure 8.
Given a predicted lowerL t ∈ R and upper boundÛ t ∈ R, we can assess the quality of the generated prediction interval using metrics, such as Prediction Interval Coverage Probability (PICP : R → [0, 1]) and Mean Prediction Interval Width (MPIW : R → [0, ∞)): where 1 is the Heaviside step function. PICP indicates how often the prediction interval captures the true values, ranging from 0 (all outside) to 1 (all inside). Intuitively, we seek a model with a narrow prediction interval (i.e., smaller MPIW) while capturing the observed data points (i.e., larger PICP). For example, when forecasting PM 2.5 pollutants in Figure 8, PICP = 0.61 indicates that in 61% of the time-steps, the observed values are inside the predicted intervals, as compared to 72% when forecasting PM 10

Deep Probabilistic Forecast
This section explores probabilistic models for air quality forecasting and describes how to quantify their predictive uncertainty. We assume we have a training dataset D = {x t , y t } T t=1 with an input feature x t ∈ R D and a corresponding observation y t ∈ R for each monitoring station. For the time-series forecasting task, we estimate the aleatoric uncertainty by employing the Mean-Variance Estimation method [48], in which we have a neural network with two output nodes (for every individual time-series) that estimates the meanμ t ∈ R and varianceσ 2 t ∈ (0, ∞) of the target probability distribution. Additionally, we use the negative log-likelihood (NLL : (0, ∞) → (0, ∞)) as a training criterion since the RMSE does not capture the predictive uncertainty. By treating the observed value y t as a sample from the target distribution, we can calculate NLL as follows: Notably, we treat the target distributions as having heteroscedastic uncertainty (i.e., non-constant variance) and diagonal covariance matrices, which simplifies the training criterion of multivariate time-series. For the threshold exceedance prediction task, we estimate the aleatoric uncertainty by having a neural network that outputs the predictive probability in terms of cross-entropy.
Although epistemic uncertainty is model-dependent, it is essentially estimated by measuring the dispersion in predictions when running several inference steps over a specific data point. For the PM-value regression task, every single forward pass (i) outputs a normal distribution (with meanμ t,i and varianceσ 2 t,i ). Thus, multiple forward passes result in a (uniformly weighted) mixture of normal distributions with the meanμ t,mix ∈ R and varianceσ 2 t,mix ∈ (0, ∞) calculated as follows: Given a predicted mean and variance, we can then estimate a point prediction y t =μ t,mix , a lower boundL t =μ t,mix − zσ t,mix , and an upper boundÛ t =μ t,mix + zσ t,mix , where z is the standard score of a normal distribution. For example, with 95% prediction interval (i.e., P(L t < y t < U t ) = 0.95), we use z = 1.96. For the threshold exceedance prediction task, every single forward pass outputs a predictive probability. Thus, we can combine predictions from multiple forward passes by simply averaging the predicted probabilities.
For empirical evaluation, we use the NLL and cross-entropy as evaluation metrics. Additionally, we use the Continuous Ranked Probability Score (CRPS) [43]. It is a widely used metric to evaluate probabilistic forecasts that generalizes the MAE to a probabilistic setting. CRPS measures the difference in the cumulative distribution function (CDF) between the forecast and the true observation. It can be derived analytically for parametric distributions or estimated using samples if the CDF is unknown (e.g., originating from VI or MCMC). Given a forecast CDF F and an empirical CDF of scalar observation y:

Bayesian Neural Networks (BNNs)
In BNNs, we use Bayesian methods for inferring a posterior distribution over the weights p(w) rather than being constrained to weights of fixed values [2]. These weight distributions are parameterized by trainable variables θ. For example, the trainable variable can represent the mean and variance of a Gaussian distribution θ = (µ, σ 2 ), from which the weights can be sampled w ∼ N (µ, σ 2 ).
The objective of training is to calculate the posterior, but obtaining explicit posterior densities through Bayesian inference is intractable. Additionally, using Markov chain Monte Carlo (MCMC) to estimate the posterior can be computationally prohibitive. Instead, we can leverage new techniques in variational inference [49,50] that make BNNs computationally feasible by using a variational approximation q(w|θ) to the posterior. Thereby, the goal of learning is to find the variational parameters that minimize the Kullback-Leibler (KL) divergence between the variational approximation q(w|θ) and the true posterior distribution given training data p(w|D). This can be achieved by minimizing the negative variational lower bound of the marginal likelihood [30]: Assuming Gaussian prior and posterior, we can compute the KL divergence in a closed-form: Additionally, the distribution over activations will also be Gaussian. Thus, we can take advantage of the local reparameterization trick [51], in which we sample from the distribution over activations rather than sampling the weights individually. Consequently, we reduce the variance of stochastic gradients, resulting in faster and more stable training. However, to allow for more flexibility and adaptations to a wide range of situations, we can use non-Gaussian distributions over the weights and MC gradients, as proposed by [3]: where − log P(D|w i ) is the NLL for time-series forecasting Equation (9) or the crossentropy for the threshold exceedance prediction Equation (2). Our implementation uses Laplace distributions as priors and Gaussian as approximate posteriors in time-series forecasting, resulting in better empirical performance. For threshold exceedance prediction, using Gaussian for both the priors and posteriors and using the local reparameterization trick results in better performance. During inference, we sample the weight distributions and perform a forward pass to obtain a prediction. We use (M = 1000) samples for every data point to estimate the model's uncertainty. This corresponds to sampling from infinite ensembles of neural networks. Therefore, combining the outputs from different samples gives information on the model's uncertainty.
We train the BNN model using training data of one year (2019). Figure 9 shows the learning curves for the PM-value regression task. Then we evaluate the model using data of one month (January 2020). Figures 10 and 11 show the results of PM-value regression and threshold exceedance classification in one representative monitoring station. Table 2 shows a summary of performance results in all monitoring stations. The arrows alongside the metrics indicate which direction is better for that specific metric. Additionally, Appendix D.1 shows the performance results in all monitoring stations.   A less efficient approach to predicting threshold exceedance probability is by transforming the time-series forecast into a binary prediction. Figure 12 shows the results of a BNN model. The result demonstrates that transforming PM-value regression into threshold exceedance classification leads to a loss in performance.

Standard Neural Networks with MC Dropout
Although BNNs are more flexible in reasoning about the model's uncertainty with Bayesian analysis, they are computationally less efficient and take longer to converge than standard (non-Bayesian) neural networks. Additionally, they require double the number of parameters at deployment compared to neural networks of the same size. A possible solution is to gracefully prune the weights with the lowest signal-to-noise ratio [3], but this leads to a loss in uncertainty information. Therefore, it would be more convenient to obtain uncertainty directly from standard neural networks. One simple approach we can use is Monte Carlo dropout as an approximate Bayesian method for representing model uncertainty. As shown in [4], MC dropout can be interpreted as performing variational inference. More specifically, it is mathematically equivalent to an approximation of a probabilistic deep Gaussian process.
Essentially, we train a standard neural network model with dropout. Then we keep the dropout during inference and run multiple forward passes using the same data input. This corresponds to sampling nodes, which is equivalent to sampling from ensembles of neural networks [52]. By measuring the spread in predictions, we estimate the predictive uncertainty. In our implementations, we train a standard neural network model with a 50% dropout rate. Then we evaluate it with 50% dropout and (M = 1000) samples. Figures 13 and 14 show the results of the PM-value regression and threshold exceedance classification in one representative monitoring station. Table 3 shows a summary of performance results in all monitoring stations, while Appendix D.2 contains the corresponding figures.

Deep Ensembles
An established method to improve performance is to train an ensemble of neural networks with different configurations [53]. Additionally, many model types can be interpreted by using ensemble methods. For example, sampling weights of BNNs is equivalent to sampling from an infinite ensemble of networks, while sampling the nodes in MC dropout compares to sampling from a finite ensemble [52]. Remarkably, we can also use (deep) ensembles to estimate the predictive uncertainty in neural network models [5].
Essentially, we train multiple neural networks with different parameter initialization on the same data. The stochastic optimization and random initialization ensure that the trained networks are sufficiently independent. During inference, we run a forward pass on the multiple neural networks using the same data input. We estimate the uncertainty by measuring the dispersion in predictions resulting from the multiple neural networks. In our implementation, we train an ensemble of 10 networks. Figures 15 and 16 show the results of PM-value regression and threshold exceedance classification in one representative monitoring station. Table 4 shows a summary of performance results in all monitoring stations, while Appendix D.3 contains the corresponding figures.
The main drawback of the deep ensemble method is the computational cost and number of parameters required at run time compared to standard neural networks, which is an order of magnitude in our case. One solution is to use knowledge distillation [54] to compress the knowledge from the ensemble into a single model, but this also leads to a loss in uncertainty information.

Recurrent Neural Network with MC Dropout
While standard neural networks are powerful models at representational learning, they do not exploit the inherent temporal correlation in air quality data since they act only on static, fixed contextual windows. To address this shortcoming, we can use recurrent neural networks (RNNs), which have cyclic connections from previous time steps, to learn the temporal dynamics of sequential data. Specifically, a hidden state from the last time step is stored and used in addition to the input to generate the current state and output.
One class of RNNs is the long short-term memory (LSTM), which is used extensively in sequence modeling tasks, such as modeling language [55], forecasting weather [56] and traffic [57], recognizing human activity [58], and recently forecasting COVID-19 transmission [59]. An LSTM has gated memory cells to control how much information to forget from previous states and how much information to use to update current states [60].
A simple approach to capture uncertainty within an LSTM model is to use Monte Carlo dropout to approximate Bayesian inference [6]. To implement dropout, we apply a mask that randomly drops some network units with their inputs, output and recurrent connections [61]. Our implementation trains a model of two LSTM layers with a 50% dropout rate and evaluates it with 50% dropout and (M = 1000) samples. Figures 17 and 18 show the results of the PM-value regression and threshold exceedance classification in one representative monitoring station. Table 5

Graph Neural Networks with MC Dropout
By using RNNs, we can capture the temporal (i.e., intra-series) correlations in the time-series of air quality data. However, we need also to exploit the inherent structural (i.e., inter-series) correlations between multiple sensing stations. We can use Graph Neural Networks (GNNs) to address this, which operate on graph-structured data. In essence, GNNs update the features of a graph node by aggregating the features of its adjacent nodes. For example, a node can be a single monitoring station in our case. In the end, GNNs apply a shared layer on each node to obtain a prediction for each node.
In our setting, we assume each sensing station to be a node in a weighted and directed graph, represented by a learnable adjacency matrix. We use a slight variation of the GNNs suggested by Cao et al. [62] to forecast multivariate time-series of air quality. The main idea is to learn a correlation graph directly from data (i.e., learn a graph of sensor nodes without a pre-defined typology) and then learn the structural and temporal correlation in the frequency domain. To capture the temporal correlations, we use a layer of 1D convolution, and three sub-layers of Gated Linear Units (GLUs) [63], while we use Graph Convolutional Networks (GCNs) [64] to capture the structural correlations. In the end, we use two shared layers of fully connected neural networks to predict each node's output.
We use the MC dropout applied to the GLUs, GCNs, and to the fully connected layers to estimate model's uncertainty [65]. We train the model with a 50% dropout rate and evaluate it with 50% dropout and (M = 1000) samples. Figures 19 and 20 show the results of the PM-value regression and threshold exceedance classification in one representative monitoring station. Table 6 shows a summary of performance results in all monitoring stations, while Appendix D.5 contains more-detailed figures.

Stochastic Weight Averaging-Gaussian (SWAG)
An alternative approach to estimate uncertainty in a neural network is to approximate (Gaussian) posterior distributions over the weights using the geometric information in the trajectory of a stochastic optimizer. This approach is named Stochastic Weight Averaging-Gaussian (SWAG) [7]. Notably, SWAG does not optimize the approximate distributions directly, such as in BNNs. Instead, it estimates the mean by calculating a running average of the weights traversed by the optimizer with a modified learning rate schedule (stochastic weight averaging [66]). In addition, SWAG estimates the standard deviation by a diagonal covariance plus of a low-rank deviation matrix using information from a running average of the second moment of the traversed weights.
During inference, we run multiple forward passes using the same data input while drawing samples of the weights from the approximate posterior. By measuring the spread in predictions, we estimate the predictive uncertainty. In our implementations, we train a simple feed-forward neural network model with SWAG and evaluate it with (M = 1000) samples. Figures 21 and 22 show the results of the PM-value regression and threshold exceedance classification in one representative monitoring station. Table 7 shows a summary of performance results in all monitoring stations, and Appendix D.6 contains the corresponding figures.

Improving Uncertainty Estimation with Adversarial Training
Generally, it is desirable to have a model with a smooth conditional output distribution with respect to its input because most measured phenomena are inherently smooth. This idea of distributional smoothing has been used as a regularization technique by encouraging a model to be less overconfident, for example, using label smoothing [67,68] or virtual adversarial training [69].
For uncertainty estimation, distributional smoothing can improve the quality of the predictive uncertainty depending on the direction of smoothing. For example, smoothing along a random direction can be less effective while being computationally expensive in all directions. Lakshminarayanan et al. [5] propose using adversarial training with the fast gradient sign method [70] to smooth the predictive distribution along the direction where the loss is high. Qin et al. [71] investigate the relationship between adversarial robustness and predictive uncertainty. They show that inputs that are sensitive to adversarial perturbations are more likely to have unreliable predictive uncertainty. Based on this insight, they propose a new training approach that smooths training labels based on their input adversarial robustness.
In this paper, we instead propose using the "free" adversarial method [72], which recycles the gradient information from regular training to quickly generate adversarial data. Thus, we locally smooth the prediction distribution along the adversarial direction with virtually no additional cost. Figures 23 illustrates the improvement in uncertainty estimation when using adversarial training in PM-value regression task, while Figure 24 in threshold exceedance classification (using MC dropout as an example). Generally, we observe that adversarial training improves the NLL and CE (i.e., making less overconfident predictions) with negligible effects on other metrics. This, of course, depends on the size of the adversarial perturbation, which can be tuned accordingly. We observe that increasing the perturbation size can have adverse effects on the accuracy metrics, which is expected [73]. We also observe that adversarial training led to more improvements in the PM-value regression than in the threshold exceedance classification. This is because threshold exceedance events are rare, and smoothing the predictive distribution does not help catch these events.

Discussion
In this section, we investigate some of the implications of the study. In particular, we perform a comparative analysis to evaluate the selected probabilistic models based on empirical performance, reliability of confidence estimate, and practical applicability. Then we close by investigating the practical impact of uncertainty quantification on decision-making.

Empirical Performance
In Section 4, we summarized the performance of each model in a tabular format. Here, we compare all the models according to their empirical performance in a single monitoring station. Figure 25 shows a comparative summary of empirical performance in the PM-value regression task. We observe that all models perform consistently well with slight variations. The BNN model performs better in metrics that assess the quality of a probabilistic forecast (i.e., in CRPS and NLL). This is expected since BNNs provide the closest approximation to Bayesian inference, while other models provide only a crude approximation. Interestingly, the GNNs with MC dropout perform very closely to BNNs in CRPS and NLL. By scoring better in CRPS, BNNs also score better in accuracy metrics (i.e., RMSE) since the CRPS generalizes the MAE to a probabilistic setting.
The PICP and MPIW are conflicting metrics that simultaneously assess the quality of the generated prediction interval. For example, by increasing the width of a prediction interval (higher MPIW), more values will be inside the predicted intervals (higher PICP). Thus, we observe that BNNs perform well in PICP but poorly in MPIW.  Figure 26 shows a comparative summary of empirical performance in the threshold exceedance classification task. We observe that the BNN model performs better in metrics that measure the quality of probabilistic predictions (scoring rule): Brier score and crossentropy. We also observe that the performance is inconclusive in metrics intended for deterministic classification (F1, precision, recall). This shows that these metrics are not appropriate for probabilistic prediction. Additionally, these metrics are biased by class imbalance since threshold exceedance is a rare event.  Figure 26. Comparison of empirical performance of the selected probabilistic models in the threshold exceedance classification task. The comparison is according to five performance metrics (left to right): Brier score, cross-entropy, FI score, precision, and recall. Blue highlights the best performance, while red highlights the worst performance.

Reliability of Confidence Estimate
In Section 5.1, we evaluate the selected models using metrics of accuracy and predictive probabilities separately. However, for decision-making, it is crucial to avoid overconfident, incorrect predictions. Therefore, evaluating the reliability of a confidence estimate is indispensable when selecting probabilistic models. One approach to evaluate reliability is to measure the amount of loss (or the number of incorrect predictions) a model makes when its confidence is above a certain threshold. This is a slight variation of the accuracy-versus-confidence technique suggested by Lakshminarayanan et al. [5] for the classification task.
For the threshold classification task, the confidence interval is bounded between 0 and 1 (0 ≤ CI ≤ 1). Thus, we can define a model confidence to be: confidence = 1 − CI and a confidence threshold 0 ≤ τ ≤ 1. Then we plot the number of incorrect predictions the model makes when its confidence is above τ. We expect the curve to be monotonically decreasing for a reliable confidence estimate since a rational model has fewer incorrect predictions at high confidence. Additionally, we plot the total number of predictions (in our case, number of hours in a month of test set) as a function of τ. This curve is monotonically decreasing, but the decreasing rate indicates the amount of confidence in a model. The decreasing rate would be high in a model with lower confidence since it makes fewer predictions with high confidence. Figure 27 shows plots of loss vs. confidence and count vs. confidence in the threshold classification task. We observe that the BNN model is more reliable by making fewer incorrect predictions at high confidence but that it has the lowest confidence. Appendix B shows the results in all monitoring stations.
For the PM-value regression task, the confidence interval is unbounded (0 < CI < ∞). However, to compare models, we can normalize and calculate their relative confidence to be bounded between 0 and 1. Then we plot the regression loss as a function of the confidence threshold τ. Figure 28 show the plots of loss vs. confidence and count vs. confidence in the PM-value regression task. We observe that the selected models produce rational behaviors and that the BNN model is more reliable but has the lowest confidence. The curves are smooth since the loss is continuous in the regression task.

Risk-informed Decisions
The primary motivation of quantifying epistemic uncertainty is to represent how much the model does not know. Evaluating specific decision policies is out of the scope of this paper, as they depend on the specific costs of countermeasures or the cost of high pollution values. Nonetheless, to show the value of probabilistic models, we investigate the practical impact on decision-making using a case of urban management. Recently, the Norwegian government has proposed strict regulations for particle dust thresholds. That means if the particle dust exceeds a specific threshold, the government will impose a penalty on the municipality for violating pollution regulations. Therefore, the municipality has to decide whether to implement countermeasures in advance if a forecasting model predicts threshold exceedance events.
The challenge is that not all forecasts are correct and occasionally result in false positives and false negatives. A false positive is associated with implementing unnecessary countermeasures, while a false negative is associated with violating pollution regulations. The costs of these two scenarios are context-dependent. Therefore, it is helpful to have a forecasting model that also provides potential flexibility and tradeoffs depending on the cost of false positives and false negatives.
Non-probabilistic models already offer some aspect of this tradeoff by leveraging the aleatoric uncertainty. For example, we can adjust the probability threshold in a binary classification, i.e., implement countermeasures, only if the predictive probability is above τ 1 . Figure 30a shows the decision score (in terms of F1, precision, and recall) as a function of aleatoric confidence in a non-probabilistic model. Accepting the model decision even when its predictive probability is low will result in more false positives and thus higher costs of unnecessary countermeasures. On the other hand, accepting the model decisions only when its predictive probability is high will result in more false negatives, thus increasing the risk of violating pollution regulations.
In probabilistic models, we can also leverage epistemic uncertainty to obtain a higher degree of tradeoffs. That means we implement countermeasures only if the predictive probability is above τ 1 and the model confidence is above a certain threshold τ 2 . Figure 30b shows the resulting decision score as a function of aleatoric and epistemic confidence in a probabilistic model. We use a BNN model as a representative example. To allow a fair comparison, we use a non-probabilistic model with the same architecture and trained with the same conditions. We observe that the probabilistic model provides a wider area of control over the risk profile based on the costs of false positives versus false negatives. In this case, the probabilistic model scores better over a wider range of τ 1 and τ 2 than a non-probabilistic model. This confirms that probabilistic models are more suitable for making more informed and risk-aware decisions.

Practical Applicability
In Section 5.1, we observe that a BNN model performs better in a variety of metrics. Further, BNNs offer an elegant approach to represent model uncertainty without strong assumptions. However, they are computationally demanding, require more parameters, are sensitive to hyperparameter choices, and take longer to converge than non-Bayesian neural networks. Thus, BNNs are difficult to work with, especially on larger datasets. Additionally, the nature of variational approximation and the choice of the prior or posterior distributions bias the predictive uncertainty.
From a practical perspective, MC Dropout, SWAG, and deep ensemble are convenient and scalable approaches to obtain uncertainty directly from standard neural networks without changing the optimization. Additionally, they have fewer hyperparameters and can be applied independently from the underlying network architecture (such as standard neural network, LSTM, and GNNs). However, these approaches require strong assumptions and provide a crude approximation of the Bayesian inference. During inference, they have high computational costs during inference (proportional to the number of samples M). Further, the uncertainty is estimated by measuring the diversity of predictions that depends on the network's architecture, size, and properties of the training data and initialization parameters. Accordingly, there is no guarantee that they produce a diverse prediction for an uncertain input.
The deep ensemble is a convenient non-Bayesian approach that captures aspects of multi-modality. It is better than the bootstrap ensemble in which independent models are trained using independent datasets. The deep ensemble trains every model using the whole dataset since stochastic optimization and random initialization make the models sufficiently independent. However, training deep ensemble is more expensive than SWAG or MC dropout since we need to train M different models.
The underlying assumption used in MC dropout is that it approximates a probabilistic deep Gaussian process. This assumption is based on the idea that a single-layer neural network tends to converge to a Gaussian process in the limit of an infinite number of hidden units [74]. Our results show that MC dropout provides less reliable uncertainty estimates (Figures 27 and 28), which confirms the results recently presented in other domains, such as computer vision [75] and molecular property prediction [76].
The related works that address quantifying uncertainty in a data-driven forecast of air quality uses either quantile regression (QR) [29], Gaussian processes (GP) [28], or ConvLSTM with MC dropout [26]. We have reimplemented these methods in our problem setting to compare them to the proposed models since we have different datasets and different data inputs. We use a Gradient Tree Boosting [47] for quantile regression and GPyTorch [77] for Gaussian processes. Table 8 summarizes the comparison results in the PM-value regression task in one representative monitoring station. We observe that the proposed models perform better, especially in metrics that assess the quality of a probabilistic forecast (i.e., in CRPS and NLL). Quantile regression only estimates the conditional quantile and not a probabilistic distribution. Thus, we can only assess the quality of its prediction interval and not a probabilistic forecast. Gaussian processes offer a Bayesian formalism to reason about uncertainty. However, they have the highest computational cost since they require an inversion of the kernel matrix, which has an asymptotic complexity of O(n 3 ), where n is the number of training examples.

Conclusions
This work presents a broad empirical evaluation of the relevant state-of-the-art deep probabilistic models applied in air quality forecasting. Through extensive experiments, we describe training these models and evaluating their predictive uncertainties using various metrics for regression and classification tasks. We introduce a new state-of-the-art example for air quality forecasting by defining the problem setup and selecting proper input features and models. Then, we apply uncertainty-aware models that exploit the temporal and spatial correlation inherent in air quality data using recurrent and graph neural networks. We propose improving uncertainty estimation using "free" adversarial training to locally smooth the prediction distribution along the adversarial direction with virtually no additional cost. Finally, we show how data-driven probabilistic models can improve the current practice using a real-world example of air quality forecasting in Norway. Particularly, we show the practical impact of uncertainty quantification and demonstrate that probabilistic models are more suitable for making informed decisions.
The results show that the proposed models perform better than previous works in quantifying uncertainty in data-driven air quality forecasts. BNNs provide a more reliable uncertainty estimate but with challenging practical applicability. Additionally, the results demonstrate that MC Dropout, SWAG, and deep ensemble have less reliable uncertainty estimates, but they are more convenient in practical applicability. Our contribution hereof is not about specific model selection but more about navigating the larger design space and the different tradeoffs offered by various probabilistic models.
Future work includes exploring hybrid air quality forecasting combining physicsbased and data-driven probabilistic models. Additionally, future work should address uncertainty estimation with out-of-domain or seasonal variations in air quality forecasting. In our experiments, we used a reasonable search space for hyperparameters and network architectures. Thus, future work could target an exhaustive search of hyperparameters, additional data sets, and models (e.g., transformers).

Abbreviations
The

Appendix A. Datasets
This section shows additional figures explaining the used dataset of air quality, meteorological data, traffic, and street-cleaning reports from the municipality.

Appendix A.2. Weather Data
We use weather data observations at four monitoring stations in the city of Trondeheim (Voll, Sverreborg, Gloshaugen, Lade). These observations include: air temperature, relative humidity, precipitation, air pressure, wind speed, wind direction, snow thickness, and duration of sunshine. Figure A2 summarize all these observations.

Appendix A.3. Traffic Data
We also used traffic volume as an input feature for the forecasting models. We used the data of the traffic volume recorded at eight streets of Trondheim. Figure A3 shows the traffic volume recorded over two years. The figure shows a clear correlation of traffic between streets. It also shows the drop in traffic during the pandemic lockdown in Trondheim (after 12 March 2020). We also used data of street-cleaning at main streets of Trondheim, as shown in Figure A4. These data are reported by the municipality and include the duration of time in which a street-cleaning is taking place.

Appendix B. Reliability of Confidence Estimate: Additional Plots
This section presents additional plots for comparison of confidence reliability in all monitoring stations. Figure A5 shows the comparison of confidence reliability in the PM-value regression task while Figure A6 in the threshold exceedance classification task.   Figure A5. Comparison of confidence reliability for the selected probabilistic models in the PM-value regression task in all monitoring stations.  Figure A6. Comparison of confidence reliability for the selected probabilistic models in the threshold exceedance classification task in all monitoring stations.

Appendix C. Justification for Threshold-Exceedance Classification
Recently, the Norwegian government has proposed even stricter regulations for particle dust thresholds. Therefore, we want to estimate if the concentration of air particles exceeds certain thresholds following the Common Air Quality Index (CAQI) used in Europe [36], as shown in Table 1.
The CAQI specifies five levels of air pollutants, but the air quality in the city of Trondheim is usually at Very Low and rarely exceeds the Medium level. Figure A8 shows the air quality level over one year in all monitoring stations. Generally, air quality data have right-skewed distributions, with the degree of skewness (asymmetry) differing among different cities. Figure A7 shows histograms that approximately represent the distribution of air quality at four monitoring stations in the city of Trondheim. The figure clearly shows heavily right-skewed distributions, in which higher CAQI classes are under-represented. Therefore, instead of a multinomial classification task (with five classes), we transform the problem into a threshold exceedance forecast task in which we try to predict the points in time where the air quality exceeds the Very Low level.

Appendix D. Experimental Details of Deep Probabilistic Forecasting Models
In this section, we show the results of the PM-value regression and threshold exceedance classification in all monitoring stations using the corresponding probabilistic model.