# Probabilistic Deep Learning to Quantify Uncertainty in Air Quality Forecasting

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- We conduct a broad empirical comparison and exploratory assessment of state-of-the-art techniques in deep probabilistic learning applied to air quality forecasting. Through exhaustive experiments, we describe training these models and evaluating their predictive uncertainties using various metrics for regression and classification tasks.
- We improve uncertainty estimation using adversarial training to smooth the conditional output distribution locally around training data points.
- We apply uncertainty-aware models that exploit the temporal and spatial correlation inherent in air quality data using recurrent and graph neural networks.
- We introduce a new state-of-the-art example for air quality forecasting by defining the problem setup and selecting proper input features and models.

## 2. Related Work

## 3. Air Quality Prediction, Base Models and Metrics

#### 3.1. Problem Setup

#### 3.2. Epistemic and Aleatoric Uncertainty

#### 3.3. Non-Probabilistic Baselines

#### 3.4. Quantile Regression

## 4. Deep Probabilistic Forecast

#### 4.1. Bayesian Neural Networks (BNNs)

#### 4.2. Standard Neural Networks with MC Dropout

#### 4.3. Deep Ensembles

#### 4.4. Recurrent Neural Network with MC Dropout

#### 4.5. Graph Neural Networks with MC Dropout

#### 4.6. Stochastic Weight Averaging–Gaussian (SWAG)

#### 4.7. Improving Uncertainty Estimation with Adversarial Training

## 5. Discussion

#### 5.1. Empirical Performance

#### 5.2. Reliability of Confidence Estimate

#### 5.3. Risk-informed Decisions

#### 5.4. Practical Applicability

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

MC | Monte Carlo |

SWAG | Stochastic weight averaging-Gaussian |

IoT | Internet of Things |

MACC | Monitoring Atmospheric Composition and Climate |

uEME | urban European Monitoring and Evaluation Program |

NB-IoT | Narrowband-IoT |

LSTM | Long short-term memory |

CNN | Convolutional neural network |

GRU | Gated recurrent unit |

GP | Gaussian process |

$N{O}_{2}$ | Nitrogen dioxide |

MCMC | Markov chain Monte Carlo |

$P{M}_{10}$ | Coarse particulate matter of diameter less than 10 $\mu $$\mathrm{m}$ |

$P{M}_{2.5}$ | Fine particulate matter of diameter less than $2.5$ $\mu $$\mathrm{m}$ |

CAQI | Common air quality index |

NILU | Norwegian institute for air research |

CI | Confidence interval |

PI | Prediction interval |

XGBoost | eXtreme Gradient Boosting |

QR | Quantile regression |

RMSE | Root-mean-square error |

CE | Cross entropy |

BS | Brier Score |

PICP | Prediction interval coverage probability |

MPIW | Mean prediction interval width |

NLL | Negative log-likelihood |

CRPS | Continuous ranked probability score |

CDF | Cumulative distribution function |

MAE | Mean absolute error |

BNN | Bayesian Neural Network |

KL | Kullback–Leibler divergence |

GNN | Graph neural network |

GLU | Gated linear units |

GCN | Graph convolutional network |

## Appendix A. Datasets

#### Appendix A.1. Air Quality Data

**Figure A1.**Air quality data of $P{M}_{2.5}$ and $P{M}_{10}$, measured over two years in four different sensing stations in the city of Trondheim. These data are offered by the Norwegian Institute for Air Research (NILU) (https://www.nilu.com/open-data/, accessed on 27 November 2021).

#### Appendix A.2. Weather Data

**Figure A2.**Weather data observations over two years at four monitoring station in the city of Trondeheim (Voll, Sverreborg, Gloshaugen, Lade). These data are offered by the Norwegian Meteorological Institute (https://frost.met.no, accessed on 27 November 2021).

#### Appendix A.3. Traffic Data

**Figure A3.**Traffic volume recorded at eight streets of Trondheim over two years. These data are offered by the Norwegian Public Roads Administration (https://www.vegvesen.no/trafikkdata/start/om-api, accessed on 27 November 2021).

**Figure A4.**Data of the duration of time in which a street-cleaning is taking place on the main streets of Trondheim, reported by the municipality.

## Appendix B. Reliability of Confidence Estimate: Additional Plots

**Figure A5.**Comparison of confidence reliability for the selected probabilistic models in the PM-value regression task in all monitoring stations.

**Figure A6.**Comparison of confidence reliability for the selected probabilistic models in the threshold exceedance classification task in all monitoring stations.

## Appendix C. Justification for Threshold-Exceedance Classification

**Figure A7.**Histograms that approximately represent the distribution of air quality at four monitoring stations in the city of Trondheim. The air quality data come from heavily right-skewed distributions, in which higher CAQI classes are under-represented.

**Figure A8.**Air quality in the city of Trondheim over one year in all monitoring stations. The data are decomposed into the five CAQI levels of air pollutants. It is usually at Very Low and rarely exceeds the Medium level.

## Appendix D. Experimental Details of Deep Probabilistic Forecasting Models

#### Appendix D.1. Bayesian Neural Networks (BNNs)

#### Appendix D.2. Standard NNs with MC Dropout

**Figure A10.**Deep probabilistic forecast of air quality using NNs with MC dropout at four monitoring stations.

#### Appendix D.3. Deep Ensemble

**Figure A11.**Deep probabilistic forecast of air quality using deep ensemble at four monitoring stations.

#### Appendix D.4. LSTM with MC Dropout

**Figure A12.**Deep probabilistic forecast of air quality using LSTM with MC dropout at four monitoring stations.

#### Appendix D.5. GNNs with MC Dropout

**Figure A13.**Deep probabilistic forecast of air quality using GNNs with MC dropout at four monitoring stations.

#### Appendix D.6. SWAG

**Figure A14.**Deep probabilistic forecast of air quality using a SWAG model at four monitoring stations.

## References

- Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature
**2015**, 521, 452–459. [Google Scholar] [CrossRef] [PubMed] - MacKay, D.J. A practical bayesian framework for backpropagation networks. Neural Comput.
**1992**, 4, 448–472. [Google Scholar] [CrossRef] - Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural network. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1613–1622. [Google Scholar]
- Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
- Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the International Conference on Neural Information Processing Systems, California, CA, USA, 4–9 December 2017; pp. 6405–6416. [Google Scholar]
- Zhu, L.; Laptev, N. Deep and Confident Prediction for Time Series at Uber. Proceedings of 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; pp. 103–110. [Google Scholar]
- Maddox, W.J.; Izmailov, P.; Garipov, T.; Vetrov, D.P.; Wilson, A.G. A simple baseline for bayesian uncertainty in deep learning. In Proceedings of the International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13153–13164. [Google Scholar]
- Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. Proceeding of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Utah, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar]
- Kendall, A.; Gal, Y. What uncertainties do we need in Bayesian deep learning for computer vision? In Proceedings of the International Conference on Neural Information Processing Systems, California, CA, USA, 4–9 December 2017; pp. 5580–5590. [Google Scholar]
- Chien, J.T.; Ku, Y.C. Bayesian recurrent neural network for language modeling. IEEE Trans. Neural Networks Learn. Syst.
**2016**, 27, 361–374. [Google Scholar] [CrossRef] [PubMed] - Xiao, Y.; Wang, W.Y. Quantifying uncertainties in natural language processing tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, Hawaii, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7322–7329. [Google Scholar]
- Ott, M.; Auli, M.; Grangier, D.; Ranzato, M. Analyzing uncertainty in neural machine translation. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3956–3965. [Google Scholar]
- Meyer, G.P.; Thakurdesai, N. Learning an uncertainty-aware object detector for autonomous driving. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 10521–10527. [Google Scholar]
- Marécal, V.; Peuch, V.H.; Andersson, C.; Andersson, S.; Arteta, J.; Beekmann, M.; Benedictow, A.; Bergström, R.; Bessagnet, B.; Cansado, A.; et al. A regional air quality forecasting system over Europe: The MACC-II daily ensemble production. Geosci. Model Dev.
**2015**, 8, 2777–2813. [Google Scholar] [CrossRef] [Green Version] - Walker, S.E.; Hermansen, G.H.; Hjort, N.L. Model selection and verification for ensemble based probabilistic forecasting of air pollution in Oslo, Norway. In Proceedings of the 60th ISI World Statistics Congress (WSC), Rio de Janeiro, Brazil, 26–31 July 2015. [Google Scholar]
- Garaud, D.; Mallet, V. Automatic calibration of an ensemble for uncertainty estimation and probabilistic forecast: Application to air quality. J. Geophys. Res. Atmos.
**2011**, 116. [Google Scholar] [CrossRef] [Green Version] - Air Quality Forecasting Service in Norway. Available online: https://luftkvalitet.miljodirektoratet.no/kart/59/10/5/aqi (accessed on 27 November 2021).
- Denby, B.R.; Gauss, M.; Wind, P.; Mu, Q.; Grøtting Wærsted, E.; Fagerli, H.; Valdebenito, A.; Klein, H. Description of the uEMEP_v5 downscaling approach for the EMEP MSC-W chemistry transport model. Geosci. Model Dev.
**2020**, 13, 6303–6323. [Google Scholar] [CrossRef] - Mu, Q.; Denby, B.R.; Wærsted, E.G.; Fagerli, H. Downscaling of air pollutants in Europe using uEMEP_v6. Geosci. Model Dev. Discuss.
**2021**, 1–24. [Google Scholar] [CrossRef] - Norman, M.; Sundvor, I.; Denby, B.R.; Johansson, C.; Gustafsson, M.; Blomqvist, G.; Janhäll, S. Modelling road dust emission abatement measures using the NORTRIP model: Vehicle speed and studded tyre reduction. Atmos. Environ.
**2016**, 134, 96–108. [Google Scholar] [CrossRef] - Denby, B.R.; Klein, H.; Wind, P.; Gauss, M.; Pommier, M.; Fagerli, H.; Valdebenito, A. The Norwegian Air Quality Service: Model Forecasting. Available online: https://wiki.met.no/_media/airquip/workshopno/denby_17sep2018.pdf (accessed on 27 November 2021).
- Simpson, D.; Benedictow, A.; Berge, H.; Bergström, R.; Emberson, L.D.; Fagerli, H.; Flechard, C.R.; Hayman, G.D.; Gauss, M.; Jonson, J.E.; et al. The EMEP MSC-W chemical transport model–technical description. Atmos. Chem. Phys.
**2012**, 12, 7825–7865. [Google Scholar] [CrossRef] [Green Version] - Lepperød, A.; Nguyen, H.T.; Akselsen, S.; Wienhofen, L.; Øzturk, P.; Zhang, W. Air Quality Monitor and Forecast in Norway Using NB-IoT and Machine Learning. In Int. Summit Smart City 360°.; Springer: New York, NY, USA, 2019; pp. 56–67. [Google Scholar]
- Veiga, T.; Munch-Ellingsen, A.; Papastergiopoulos, C.; Tzovaras, D.; Kalamaras, I.; Bach, K.; Votis, K.; Akselsen, S. From a Low-Cost Air Quality Sensor Network to Decision Support Services: Steps towards Data Calibration and Service Development. Sensors
**2021**, 21, 3190. [Google Scholar] [CrossRef] - Zhou, Y.; Chang, F.J.; Chang, L.C.; Kao, I.F.; Wang, Y.S. Explore a deep learning multi-output neural network for regional multi-step-ahead air quality forecasts. J. Clean. Prod.
**2019**, 209, 134–145. [Google Scholar] [CrossRef] - Mokhtari, I.; Bechkit, W.; Rivano, H.; Yaici, M.R. Uncertainty-Aware Deep Learning Architectures for Highly Dynamic Air Quality Prediction. IEEE Access
**2021**, 9, 14765–14778. [Google Scholar] [CrossRef] - Tao, Q.; Liu, F.; Li, Y.; Sidorov, D. Air pollution forecasting using a deep learning model based on 1D convnets and bidirectional GRU. IEEE Access
**2019**, 7, 76690–76698. [Google Scholar] [CrossRef] - Pucer, J.F.; Pirš, G.; Štrumbelj, E. A Bayesian approach to forecasting daily air-pollutant levels. Knowl. Inf. Syst.
**2018**, 57, 635–654. [Google Scholar] - Aznarte, J.L. Probabilistic forecasting for extreme NO2 pollution episodes. Environ. Pollut.
**2017**, 229, 321–328. [Google Scholar] [CrossRef] - Graves, A. Practical variational inference for neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; pp. 2348–2356. [Google Scholar]
- Louizos, C.; Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2218–2227. [Google Scholar]
- Neal, R.M. Bayesian learning for Neural Networks; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the International Conference on Machine Learning, Washington, WA, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]
- Chen, T.; Fox, E.; Guestrin, C. Stochastic gradient hamiltonian monte carlo. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1683–1691. [Google Scholar]
- Ritter, H.; Botev, A.; Barber, D. A scalable laplace approximation for neural networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; Volume 6. [Google Scholar]
- Elshout, S.v.d.; Léger, K. CAQI Air quality index—Comparing Urban Air Quality across Borders-2012. Technical Report, EUROPEAN UNION European Regional Development Fund Regional Initiative Project. 2012. Available online: https://www.airqualitynow.eu/download/CITEAIR-Comparing_Urban_Air_Quality_across_Borders.pdf (accessed on 27 November 2021).
- Open Database of Air Quality Measurements by the Norwegian Institute for Air Research (NILU). Available online: https://www.nilu.com/open-data/ (accessed on 27 November 2021).
- The Meteorological Data by the Norwegian Meteorological Institute. Available online: https://frost.met.no (accessed on 27 November 2021).
- Traffic Data by the Norwegian Public Roads Administration. Available online: https://www.vegvesen.no/trafikkdata/start/om-api (accessed on 27 November 2021).
- Heskes, T.; Wiegerinck, W.; Kappen, H. Practical confidence and prediction intervals for prediction tasks. Prog. Neural Process.
**1997**, 8, 128–135. [Google Scholar] [CrossRef] - Dar, Y.; Muthukumar, V.; Baraniuk, R.G. A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning. arXiv
**2021**, arXiv:2109.02355. [Google Scholar] - Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc.
**2007**, 102, 359–378. [Google Scholar] [CrossRef] - Brier, G.W. Verification of forecasts expressed in terms of probability. Mon. Weather. Rev.
**1950**, 78, 1–3. [Google Scholar] [CrossRef] - Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag.
**2009**, 45, 427–437. [Google Scholar] [CrossRef] - Koenker, R.; Hallock, K.F. Quantile regression. J. Econ. Perspect.
**2001**, 15, 143–156. [Google Scholar] [CrossRef] - Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat.
**2001**, 29, 1189–1232. [Google Scholar] [CrossRef] - Nix, D.A.; Weigend, A.S. Estimating the mean and variance of the target probability distribution. In Proceedings of the IEEE International Conference on Neural Networks, Florida, FL, USA, 28 June–2 July 1994; Volume 1, pp. 55–60. [Google Scholar]
- Hoffman, M.D.; Blei, D.M.; Wang, C.; Paisley, J. Stochastic variational inference. J. Mach. Learn. Res.
**2013**, 14, 1303–1347. [Google Scholar] - Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv
**2013**, arXiv:1312.6114. [Google Scholar] - Kingma, D.P.; Salimans, T.; Welling, M. Variational dropout and the local reparameterization trick. In Proceedings of the International Conference on Neural Information Processing Systems, Montreal, Canada, 7–12 December 2015; Volume 2, pp. 2575–2583. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.
**2014**, 15, 1929–1958. [Google Scholar] - Dietterich, T.G. Ensemble methods in machine learning. In Proceedings of the International Workshop on Multiple Classifier Systems, Cagliari, Italy, 21–23 June 2000; pp. 1–15. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. Proceedings of The International Conference on Neural Information Processing Systems (Deep Learning and Representation Learning Workshop), Montreal. Canada, 8–13 December 2014. [Google Scholar]
- Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; Wu, Y. Exploring the limits of language modeling. arXiv
**2016**, arXiv:1602.02410. [Google Scholar] - Chen, R.; Wang, X.; Zhang, W.; Zhu, X.; Li, A.; Yang, C. A hybrid CNN-LSTM model for typhoon formation forecasting. Geoinformatica
**2019**, 23, 375–396. [Google Scholar] [CrossRef] - Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Murad, A.; Pyun, J.Y. Deep recurrent neural networks for human activity recognition. Sensors
**2017**, 17, 2556. [Google Scholar] [CrossRef] [Green Version] - Chimmula, V.K.R.; Zhang, L. Time series forecasting of COVID-19 transmission in Canada using LSTM networks. Chaos Solitons Fractals
**2020**, 135, 109864. [Google Scholar] [CrossRef] - Sak, H.; Senior, A.W.; Beaufays, F. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling: Research.Google. Available online: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43905.pdf (accessed on 27 November 2021).
- Gal, Y.; Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1027–1035. [Google Scholar]
- Cao, D.; Wang, Y.; Duan, J.; Zhang, C.; Zhu, X.; Huang, C.; Tong, Y.; Xu, B.; Bai, J.; Tong, J.; et al. Spectral Temporal Graph Neural Network for Multivariate Time-series Forecasting. In Proceedings of the International Conference on Neural Information Processing Systems, British Columbia, Canada, 6–12 December 2020; Volume 33, pp. 17766–17778. [Google Scholar]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Hasanzadeh, A.; Hajiramezanali, E.; Boluki, S.; Zhou, M.; Duffield, N.; Narayanan, K.; Qian, X. Bayesian graph neural networks with adaptive connection sampling. Proceedings of International Conference on Machine Learning, Virtual, Vienna, Austria, 12–18 July 2020; pp. 4094–4104. [Google Scholar]
- Izmailov, P.; Podoprikhin, D.; Garipov, T.; Vetrov, D.; Wilson, A.G. Averaging weights leads to wider optima and better generalization. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, California, CA, USA, 6–10 August 2018; pp. 876–885. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Muller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? In Proceedings of the International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 4694–4703. [Google Scholar]
- Miyato, T.; Maeda, S.i.; Koyama, M.; Ishii, S. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell.
**2018**, 41, 1979–1993. [Google Scholar] [CrossRef] [Green Version] - Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015. [Google Scholar]
- Qin, Y.; Wang, X.; Beutel, A.; Chi, E.H. Improving uncertainty estimates through the relationship with adversarial robustness. arXiv
**2020**, arXiv:2006.16375. Available online:https://arxiv.org/abs/2006.16375 (accessed on 27 November 2021). - Shafahi, A.; Najibi, M.; Ghiasi, A.; Xu, Z.; Dickerson, J.; Studer, C.; Davis, L.S.; Taylor, G.; Goldstein, T. Adversarial training for free! In Proceedings of the International Conference on Neural Information Processing Systems, Vancouver, Canada, 8–14 December 2019; pp. 3358–3369. [Google Scholar]
- Zhang, H.; Yu, Y.; Jiao, J.; Xing, E.; El Ghaoui, L.; Jordan, M. Theoretically principled trade-off between robustness and accuracy. In Proceedings of the International Conference on Machine Learning, California, CA, USA, 9–15 June 2019; pp. 7472–7482. [Google Scholar]
- Williams, C.K. Computing with infinite networks. In Proceedings of the International Conference on Neural Information Processing Systems, Denver, CO, USA, 3–5 December 1996; pp. 295–301. [Google Scholar]
- Gustafsson, F.K.; Danelljan, M.; Schon, T.B. Evaluating scalable bayesian deep learning methods for robust computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 318–319. [Google Scholar]
- Scalia, G.; Grambow, C.A.; Pernici, B.; Li, Y.P.; Green, W.H. Evaluating scalable uncertainty estimation methods for deep learning-based molecular property prediction. J. Chem. Inf. Model.
**2020**, 60, 2697–2717. [Google Scholar] [CrossRef] - Gardner, J.R.; Pleiss, G.; Bindel, D.; Weinberger, K.Q.; Wilson, A.G. Gpytorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In Proceedings of the International Conference on Neural Information Processing Systems, Montreal, Canada, 2–8 December 2018; pp. 7576–7586. [Google Scholar]

**Figure 1.**Illustrative example: Decision F1 score as a function of normalized aleatoric and epistemic confidence thresholds.

**Figure 3.**Air quality level over one year of one representative monitoring station in Trondheim, where the air pollutant is commonly at a Very Low level and rarely exceeds Low.

**Figure 4.**Persistence forecast of air pollutant over one month in one representative monitoring station.

**Figure 5.**PM-value regression using the XGBoost model over one month in one representative monitoring station.

**Figure 6.**Feature importance of a trained XGBoost model indicating how useful each input feature when making a prediction.

**Figure 7.**Predicting the threshold exceedance probability of the air pollutant level using an XGBoost model in one representative monitoring station.

**Figure 8.**Air quality prediction interval using quantile regression of a Gradient Tree Boosting model.

**Figure 9.**Learning curve of training a BNN model to forecast PM-values. (

**Left:**) negative log-likelihood loss; (

**Center:**) KL loss estimated using MC sampling; (

**Right:**) learning rate of exponential decay.

**Figure 10.**Probabilistic forecasting of multivariate time-series of air quality using a BNN model in one representative monitoring station.

**Figure 12.**Predicting threshold exceedance probability by transforming PM-value regression into binary predictions.

**Figure 13.**Probabilistic forecasting of multivariate time-series air quality using a standard neural network model with MC dropout.

**Figure 14.**Predicting the threshold exceedance probability of air pollutants level using a standard neural network with MC dropout.

**Figure 16.**Predicting threshold exceedance probability of air pollutants level using a deep ensemble.

**Figure 17.**Probabilistic forecasting of multivariate time-series air quality using an LSTM model with MC dropout.

**Figure 18.**Predicting threshold exceedance probability of air pollutants level using an LSTM model with MC dropout.

**Figure 19.**Probabilistic forecasting of multivariate time-series air quality using a GNN model with MC dropout.

**Figure 20.**Predicting threshold exceedance probability of air pollutants level using a GNN model with MC dropout.

**Figure 23.**Comparison of uncertainty estimation in PM-value regression when training (

**top**) without adversarial training versus (

**bottom**) without adversarial training. Using Adversarial training leads to smoother predictive distribution; thus, lower NLL (less overconfident predictions).

**Figure 24.**Comparison of uncertainty estimation in threshold exceedance classification when training (

**top**) without adversarial training versus (

**bottom**) without adversarial training. Using Adversarial training leads to smoother predictive distribution; thus, lower CE (less overconfident predictions).

**Figure 25.**Comparison of empirical performance of the selected probabilistic models in the PM-value regression task. The comparison is according to five performance metrics (left to right): CRPS, NLL, RMSE, PICP, and MPIW. Blue highlights the best performance, while red highlights the worst performance. The arrows alongside the metrics indicate which direction is better for that specific metric.

**Figure 26.**Comparison of empirical performance of the selected probabilistic models in the threshold exceedance classification task. The comparison is according to five performance metrics (left to right): Brier score, cross-entropy, FI score, precision, and recall. Blue highlights the best performance, while red highlights the worst performance.

**Figure 27.**Comparison of confidence reliability for the selected probabilistic models in the threshold exceedance task. (

**Left:**) loss versus confidence. (

**Right:**) count versus confidence. The selected models produce are rational, which means the loss-vs-confidence curves are monotonically decreasing.

**Figure 28.**Comparison of confidence reliability for the selected probabilistic models in the PM-value regression task. (

**Left:**) loss versus confidence. (

**Right:**) count versus confidence.

**Figure 29.**Impact of adversarial training on predictive uncertainty in PM-value regression, using deep ensemble as an example. (

**Left:**) loss versus confidence. (

**Right:**) count versus confidence.

**Figure 30.**Comparison of decision score in non-probabilistic and probabilistic models. (

**a**) Decision score in a non-probabilistic model as a function of class probability threshold (${\tau}_{1}$ corresponding to aleatoric confidence). (

**b**) Decision score in a probabilistic model as a function of both class probability threshold (${\tau}_{1}$) and model confidence threshold (${\tau}_{2}$ corresponding to epistemic uncertainty).

Index | ${\mathit{PM}}_{10}\left(\mathsf{\mu}{\mathbf{g}/\mathbf{m}}^{3}\right)$ | ${\mathit{PM}}_{10}\left(\mathsf{\mu}{\mathbf{g}/\mathbf{m}}^{3}\right)$ |
---|---|---|

Very low | 0–25 | 0–15 |

Low | 25–50 | 15–30 |

Medium | 50–90 | 30–55 |

High | 90–180 | 55–110 |

Very High | >180 | >110 |

**Table 2.**Summary of performance results when forecasting the PM-value and threshold exceedance using a BNNs model.

Station | Particulate | PM-Value Regression | Threshold Exceedance Classification | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

RMSE↓ | PICP↑ | MPIW↓ | CRPS↓ | NLL↓ | Brier↓ | Precision↑ | Recall↑ | F1↑ | CE↓ | ||

Bakke Kirke | $P{M}_{2.5}$ | 4.81 | 0.99 | 17.62 | 0.51 | 1.29 | 0.04 | 1.00 | 0.44 | 0.61 | 0.13 |

$P{M}_{10}$ | 5.86 | 0.94 | 26.12 | 0.50 | 1.28 | 0.03 | 1.00 | 0.30 | 0.47 | 0.09 | |

E6-Tiller | $P{M}_{2.5}$ | 3.77 | 0.92 | 13.25 | 0.54 | 1.39 | 0.02 | 0.00 | 0.00 | 0.00 | 0.08 |

$P{M}_{10}$ | 9.40 | 0.92 | 34.18 | 0.48 | 1.26 | 0.06 | 0.00 | 0.00 | 0.00 | 0.23 | |

Elgeseter | $P{M}_{2.5}$ | 3.93 | 0.91 | 12.79 | 0.53 | 1.36 | 0.03 | 0.88 | 0.42 | 0.56 | 0.12 |

$P{M}_{10}$ | 5.17 | 0.90 | 25.07 | 0.47 | 1.28 | 0.03 | 0.55 | 0.19 | 0.29 | 0.12 | |

Torvet | $P{M}_{2.5}$ | 4.07 | 0.90 | 10.83 | 0.48 | 1.30 | 0.03 | 0.75 | 0.46 | 0.57 | 0.13 |

$P{M}_{10}$ | 5.25 | 0.93 | 18.47 | 0.43 | 1.17 | 0.03 | 0.50 | 0.23 | 0.32 | 0.10 |

**Table 3.**Summary of performance results when forecasting the PM-value and threshold exceedance using a standard neural network with MC dropout.

Station | Particulate | PM-Value Regression | Threshold Exceedance Classification | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

RMSE↓ | PICP↑ | MPIW↓ | CRPS↓ | NLL↓ | Brier↓ | Precision↑ | Recall↑ | F1↑ | CE↓ | ||

Bakke kirke | $P{M}_{2.5}$ | 5.34 | 0.69 | 9.40 | 0.60 | 2.30 | 0.04 | 0.65 | 0.51 | 0.57 | 0.31 |

$P{M}_{10}$ | 6.42 | 0.66 | 12.45 | 0.59 | 3.35 | 0.03 | 0.67 | 0.48 | 0.56 | 0.10 | |

E6-Tiller | $P{M}_{2.5}$ | 3.75 | 0.72 | 7.26 | 0.60 | 2.24 | 0.01 | 0.00 | 0.00 | 0.00 | 0.24 |

$P{M}_{10}$ | 9.49 | 0.71 | 16.62 | 0.51 | 2.30 | 0.07 | 0.18 | 0.04 | 0.06 | 0.57 | |

Elgeseter | $P{M}_{2.5}$ | 4.43 | 0.70 | 7.29 | 0.57 | 2.12 | 0.05 | 0.57 | 0.38 | 0.45 | 0.43 |

$P{M}_{10}$ | 5.59 | 0.69 | 12.11 | 0.51 | 2.38 | 0.04 | 0.37 | 0.32 | 0.34 | 0.17 | |

Torvet | $P{M}_{2.5}$ | 4.60 | 0.55 | 5.26 | 0.57 | 2.91 | 0.04 | 0.68 | 0.44 | 0.53 | 0.33 |

$P{M}_{10}$ | 5.63 | 0.62 | 8.94 | 0.51 | 2.51 | 0.03 | 0.56 | 0.35 | 0.43 | 0.14 |

**Table 4.**Summary of performance results when forecasting the PM-value and threshold exceedance using a deep ensemble.

Station | Particulate | PM-Value Regression | Threshold Exceedance Classification | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

RMSE↓ | PICP↑ | MPIW↓ | CRPS↓ | NLL↓ | Brier↓ | Precision↑ | Recall↑ | F1↑ | CE↓ | ||

Bakke kirke | $P{M}_{2.5}$ | 5.29 | 0.77 | 11.65 | 0.57 | 1.67 | 0.05 | 0.69 | 0.53 | 0.60 | 0.55 |

$P{M}_{10}$ | 6.21 | 0.70 | 14.00 | 0.57 | 2.46 | 0.03 | 0.60 | 0.36 | 0.45 | 0.26 | |

E6-Tiller | $P{M}_{2.5}$ | 3.78 | 0.77 | 8.46 | 0.58 | 1.84 | 0.01 | 0.00 | 0.00 | 0.00 | 0.34 |

$P{M}_{10}$ | 9.44 | 0.72 | 16.07 | 0.50 | 2.14 | 0.07 | 0.31 | 0.08 | 0.12 | 1.16 | |

Elgeseter | $P{M}_{2.5}$ | 4.46 | 0.71 | 7.99 | 0.58 | 2.00 | 0.05 | 0.68 | 0.28 | 0.40 | 0.66 |

$P{M}_{10}$ | 5.53 | 0.69 | 12.47 | 0.52 | 2.48 | 0.04 | 0.45 | 0.32 | 0.38 | 0.35 | |

Torvet | $P{M}_{2.5}$ | 4.45 | 0.57 | 5.13 | 0.56 | 2.66 | 0.04 | 0.73 | 0.31 | 0.43 | 0.55 |

$P{M}_{10}$ | 5.39 | 0.64 | 8.68 | 0.49 | 2.19 | 0.03 | 0.62 | 0.19 | 0.29 | 0.30 |

**Table 5.**Summary of performance results when forecasting PM-value or threshold exceedance using an LSTM model with MC dropout.

Station | Particulate | PM-Value Regression | Threshold Exceedance Classification | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

RMSE↓ | PICP↑ | MPIW↓ | CRPS↓ | NLL↓ | Brier↓ | Precision↑ | Recall↑ | F1↑ | CE↓ | ||

Bakke kirke | $P{M}_{2.5}$ | 5.01 | 0.88 | 14.01 | 0.53 | 1.47 | 0.05 | 0.66 | 0.53 | 0.58 | 0.25 |

$P{M}_{10}$ | 6.25 | 0.82 | 19.28 | 0.54 | 1.78 | 0.03 | 0.59 | 0.48 | 0.53 | 0.14 | |

E6-Tiller | $P{M}_{2.5}$ | 3.90 | 0.72 | 7.45 | 0.62 | 2.31 | 0.02 | 0.00 | 0.00 | 0.00 | 0.11 |

$P{M}_{10}$ | 9.68 | 0.74 | 18.96 | 0.53 | 2.03 | 0.08 | 0.24 | 0.12 | 0.16 | 0.43 | |

Elgeseter | $P{M}_{2.5}$ | 4.32 | 0.72 | 8.91 | 0.59 | 2.10 | 0.05 | 0.58 | 0.40 | 0.47 | 0.28 |

$P{M}_{10}$ | 5.98 | 0.73 | 15.14 | 0.55 | 2.64 | 0.05 | 0.30 | 0.29 | 0.30 | 0.24 | |

Torvet | $P{M}_{2.5}$ | 4.19 | 0.56 | 6.88 | 0.58 | 4.79 | 0.05 | 0.58 | 0.42 | 0.49 | 0.30 |

$P{M}_{10}$ | 5.81 | 0.61 | 11.33 | 0.54 | 4.03 | 0.03 | 0.43 | 0.35 | 0.38 | 0.1 |

**Table 6.**Summary of performance results when forecasting the PM-value or threshold exceedance using a GNN model with MC dropout.

Station | Particulate | PM-Value Regression | Threshold Exceedance Classification | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

RMSE↓ | PICP↑ | MPIW↓ | CRPS↓ | NLL↓ | Brier↓ | Precision↑ | Recall↑ | F1↑ | CE↓ | ||

Bakke kirke | $P{M}_{2.5}$ | 4.70 | 0.88 | 12.33 | 0.52 | 1.41 | 0.05 | 0.61 | 0.53 | 0.56 | 0.21 |

$P{M}_{10}$ | 6.26 | 0.79 | 16.10 | 0.54 | 1.83 | 0.03 | 0.43 | 0.36 | 0.39 | 0.11 | |

E6-Tiller | $P{M}_{2.5}$ | 3.80 | 0.83 | 9.14 | 0.57 | 1.60 | 0.02 | 0.00 | 0.00 | 0.00 | 0.11 |

$P{M}_{10}$ | 9.46 | 0.80 | 19.89 | 0.48 | 1.59 | 0.07 | 0.19 | 0.06 | 0.09 | 0.35 | |

Elgeseter | $P{M}_{2.5}$ | 3.98 | 0.83 | 9.37 | 0.54 | 1.51 | 0.04 | 0.65 | 0.45 | 0.53 | 0.19 |

$P{M}_{10}$ | 5.80 | 0.79 | 15.07 | 0.50 | 1.60 | 0.04 | 0.35 | 0.23 | 0.27 | 0.17 | |

Torvet | $P{M}_{2.5}$ | 4.27 | 0.68 | 6.19 | 0.50 | 2.04 | 0.05 | 0.55 | 0.46 | 0.50 | 0.22 |

$P{M}_{10}$ | 5.55 | 0.70 | 10.39 | 0.47 | 1.83 | 0.03 | 0.36 | 0.35 | 0.35 | 0.11 |

Station | Particulate | PM-Value Regression | Threshold Exceedance Classification | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

RMSE↓ | PICP↑ | MPIW↓ | CRPS↓ | NLL↓ | Brier↓ | Precision↑ | Recall↑ | F1↑ | CE↓ | ||

Bakke kirke | $P{M}_{2.5}$ | 5.51 | 0.79 | 13.13 | 0.58 | 1.64 | 0.04 | 0.66 | 0.64 | 0.65 | 0.20 |

$P{M}_{10}$ | 6.66 | 0.78 | 17.95 | 0.57 | 2.03 | 0.04 | 0.49 | 0.61 | 0.54 | 0.12 | |

E6-Tiller | $P{M}_{2.5}$ | 3.76 | 0.79 | 9.25 | 0.59 | 1.82 | 0.01 | 0.00 | 0.00 | 0.00 | 0.10 |

$P{M}_{10}$ | 9.35 | 0.82 | 21.28 | 0.49 | 1.73 | 0.08 | 0.19 | 0.08 | 0.11 | 0.49 | |

Elgeseter | $P{M}_{2.5}$ | 4.53 | 0.73 | 9.33 | 0.59 | 1.97 | 0.04 | 0.60 | 0.45 | 0.52 | 0.21 |

$P{M}_{10}$ | 5.76 | 0.76 | 16.61 | 0.53 | 1.96 | 0.04 | 0.37 | 0.45 | 0.41 | 0.18 | |

Torvet | $P{M}_{2.5}$ | 4.58 | 0.79 | 10.33 | 0.54 | 1.63 | 0.04 | 0.67 | 0.50 | 0.57 | 0.20 |

$P{M}_{10}$ | 5.62 | 0.71 | 12.48 | 0.50 | 1.76 | 0.03 | 0.50 | 0.42 | 0.46 | 0.13 |

**Table 8.**Comparison of the previous works and the proposed models when quantifying uncertainty in data-driven forecast of air quality.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Murad, A.; Kraemer, F.A.; Bach, K.; Taylor, G.
Probabilistic Deep Learning to Quantify Uncertainty in Air Quality Forecasting. *Sensors* **2021**, *21*, 8009.
https://doi.org/10.3390/s21238009

**AMA Style**

Murad A, Kraemer FA, Bach K, Taylor G.
Probabilistic Deep Learning to Quantify Uncertainty in Air Quality Forecasting. *Sensors*. 2021; 21(23):8009.
https://doi.org/10.3390/s21238009

**Chicago/Turabian Style**

Murad, Abdulmajid, Frank Alexander Kraemer, Kerstin Bach, and Gavin Taylor.
2021. "Probabilistic Deep Learning to Quantify Uncertainty in Air Quality Forecasting" *Sensors* 21, no. 23: 8009.
https://doi.org/10.3390/s21238009