Special Issue "Probabilistic Methods for Deep Learning"

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Information Theory, Probability and Statistics".

Deadline for manuscript submissions: closed (1 October 2021) | Viewed by 14860

Special Issue Editors

Dr. Eric Nalisnick
E-Mail Website
Guest Editor
Amsterdam Machine Learning Lab, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
Interests: statistical machine learning; probabilistic modeling; Bayesian statistics; deep learning; generative models; approximate inference
Dr. Dustin Tran
E-Mail Website
Guest Editor
Google Brain, Mountain View, CA, USA
Interests: statistical machine learning; Bayesian statistics; deep learning; uncertainty; robustness

Special Issue Information

Dear colleagues,

The umbrella of techniques known as deep learning has had empirical success across a variety of predictive modeling tasks. Consequently, there is hope that deep learning can catalyze progress in medicine, the sciences, and other domains of consequence. Yet, many deep learning techniques are ill-equipped for these new settings in which safety and transparency are crucial for their success. For instance, neural networks have been shown to be overconfident, which could lead to them being unduly trusted to make a medical diagnosis.

Combining deep learning with probabilistic and statistical methodologies is one potential way to overcome—or at least ameliorate—these shortcomings. A probabilistic approach can quantify a network’s uncertainty, allowing for more informed down-stream decision making. Of course, this is a non-trivial pursuit, as deep learning incurs computational and analytical difficulties that do not plague more traditional models. Adapting deep learning so that its robustness and uncertainty can be quantified without sacrificing predictive power is an open and challenging problem.

In this Special Issue, we aim to highlight work at the intersection of deep learning, probabilistic modeling, and statistical inference. In particular, we welcome work on Bayesian neural networks, deep latent variable models, deep ensembles, networks with statistical guarantees (e.g., via conformal inference), and probabilistic understanding of neural networks (e.g., via infinite limits).

Dr. Eric Nalisnick
Dr. Dustin Tran
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep learning
  • neural networks
  • probabilistic modeling
  • Bayesian statistics
  • statistical inference
  • uncertainty quantification
  • robustness

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Article
Perfect Density Models Cannot Guarantee Anomaly Detection
Entropy 2021, 23(12), 1690; https://doi.org/10.3390/e23121690 - 16 Dec 2021
Cited by 12 | Viewed by 1217
Abstract
Thanks to the tractability of their likelihood, several deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. [...] Read more.
Thanks to the tractability of their likelihood, several deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities through the lens of reparametrization and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for anomaly detection relies on strong and implicit hypotheses, and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection. Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Article
Gradient Regularization as Approximate Variational Inference
Entropy 2021, 23(12), 1629; https://doi.org/10.3390/e23121629 - 03 Dec 2021
Viewed by 889
Abstract
We developed Variational Laplace for Bayesian neural networks (BNNs), which exploits a local approximation of the curvature of the likelihood to estimate the ELBO without the need for stochastic sampling of the neural-network weights. The Variational Laplace objective is simple to evaluate, as [...] Read more.
We developed Variational Laplace for Bayesian neural networks (BNNs), which exploits a local approximation of the curvature of the likelihood to estimate the ELBO without the need for stochastic sampling of the neural-network weights. The Variational Laplace objective is simple to evaluate, as it is the log-likelihood plus weight-decay, plus a squared-gradient regularizer. Variational Laplace gave better test performance and expected calibration errors than maximum a posteriori inference and standard sampling-based variational inference, despite using the same variational approximate posterior. Finally, we emphasize the care needed in benchmarking standard VI, as there is a risk of stopping before the variance parameters have converged. We show that early-stopping can be avoided by increasing the learning rate for the variance parameters. Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Article
Empirical Frequentist Coverage of Deep Learning Uncertainty Quantification Procedures
Entropy 2021, 23(12), 1608; https://doi.org/10.3390/e23121608 - 30 Nov 2021
Cited by 4 | Viewed by 1046
Abstract
Uncertainty quantification for complex deep learning models is increasingly important as these techniques see growing use in high-stakes, real-world settings. Currently, the quality of a model’s uncertainty is evaluated using point-prediction metrics, such as the negative log-likelihood (NLL), expected calibration error (ECE) or [...] Read more.
Uncertainty quantification for complex deep learning models is increasingly important as these techniques see growing use in high-stakes, real-world settings. Currently, the quality of a model’s uncertainty is evaluated using point-prediction metrics, such as the negative log-likelihood (NLL), expected calibration error (ECE) or the Brier score on held-out data. Marginal coverage of prediction intervals or sets, a well-known concept in the statistical literature, is an intuitive alternative to these metrics but has yet to be systematically studied for many popular uncertainty quantification techniques for deep learning models. With marginal coverage and the complementary notion of the width of a prediction interval, downstream users of deployed machine learning models can better understand uncertainty quantification both on a global dataset level and on a per-sample basis. In this study, we provide the first large-scale evaluation of the empirical frequentist coverage properties of well-known uncertainty quantification techniques on a suite of regression and classification tasks. We find that, in general, some methods do achieve desirable coverage properties on in distribution samples, but that coverage is not maintained on out-of-distribution data. Our results demonstrate the failings of current uncertainty quantification techniques as dataset shift increases and reinforce coverage as an important metric in developing models for real-world applications. Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Article
Minimum Message Length in Hybrid ARMA and LSTM Model Forecasting
Entropy 2021, 23(12), 1601; https://doi.org/10.3390/e23121601 - 29 Nov 2021
Cited by 2 | Viewed by 1029
Abstract
Modeling and analysis of time series are important in applications including economics, engineering, environmental science and social science. Selecting the best time series model with accurate parameters in forecasting is a challenging objective for scientists and academic researchers. Hybrid models combining neural networks [...] Read more.
Modeling and analysis of time series are important in applications including economics, engineering, environmental science and social science. Selecting the best time series model with accurate parameters in forecasting is a challenging objective for scientists and academic researchers. Hybrid models combining neural networks and traditional Autoregressive Moving Average (ARMA) models are being used to improve the accuracy of modeling and forecasting time series. Most of the existing time series models are selected by information-theoretic approaches, such as AIC, BIC, and HQ. This paper revisits a model selection technique based on Minimum Message Length (MML) and investigates its use in hybrid time series analysis. MML is a Bayesian information-theoretic approach and has been used in selecting the best ARMA model. We utilize the long short-term memory (LSTM) approach to construct a hybrid ARMA-LSTM model and show that MML performs better than AIC, BIC, and HQ in selecting the model—both in the traditional ARMA models (without LSTM) and with hybrid ARMA-LSTM models. These results held on simulated data and both real-world datasets that we considered.We also develop a simple MML ARIMA model. Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Article
History Marginalization Improves Forecasting in Variational Recurrent Neural Networks
Entropy 2021, 23(12), 1563; https://doi.org/10.3390/e23121563 - 24 Nov 2021
Cited by 1 | Viewed by 825
Abstract
Deep probabilistic time series forecasting models have become an integral part of machine learning. While several powerful generative models have been proposed, we provide evidence that their associated inference models are oftentimes too limited and cause the generative model to predict mode-averaged dynamics. [...] Read more.
Deep probabilistic time series forecasting models have become an integral part of machine learning. While several powerful generative models have been proposed, we provide evidence that their associated inference models are oftentimes too limited and cause the generative model to predict mode-averaged dynamics. Mode-averaging is problematic since many real-world sequences are highly multi-modal, and their averaged dynamics are unphysical (e.g., predicted taxi trajectories might run through buildings on the street map). To better capture multi-modality, we develop variational dynamic mixtures (VDM): a new variational family to infer sequential latent variables. The VDM approximate posterior at each time step is a mixture density network, whose parameters come from propagating multiple samples through a recurrent architecture. This results in an expressive multi-modal posterior approximation. In an empirical study, we show that VDM outperforms competing approaches on highly multi-modal datasets from different domains. Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Article
Winsorization for Robust Bayesian Neural Networks
Entropy 2021, 23(11), 1546; https://doi.org/10.3390/e23111546 - 20 Nov 2021
Cited by 3 | Viewed by 719
Abstract
With the advent of big data and the popularity of black-box deep learning methods, it is imperative to address the robustness of neural networks to noise and outliers. We propose the use of Winsorization to recover model performances when the data may have [...] Read more.
With the advent of big data and the popularity of black-box deep learning methods, it is imperative to address the robustness of neural networks to noise and outliers. We propose the use of Winsorization to recover model performances when the data may have outliers and other aberrant observations. We provide a comparative analysis of several probabilistic artificial intelligence and machine learning techniques for supervised learning case studies. Broadly, Winsorization is a versatile technique for accounting for outliers in data. However, different probabilistic machine learning techniques have different levels of efficiency when used on outlier-prone data, with or without Winsorization. We notice that Gaussian processes are extremely vulnerable to outliers, while deep learning techniques in general are more robust. Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Article
Conditional Deep Gaussian Processes: Multi-Fidelity Kernel Learning
Entropy 2021, 23(11), 1545; https://doi.org/10.3390/e23111545 - 20 Nov 2021
Cited by 3 | Viewed by 841
Abstract
Deep Gaussian Processes (DGPs) were proposed as an expressive Bayesian model capable of a mathematically grounded estimation of uncertainty. The expressivity of DPGs results from not only the compositional character but the distribution propagation within the hierarchy. Recently, it was pointed out that [...] Read more.
Deep Gaussian Processes (DGPs) were proposed as an expressive Bayesian model capable of a mathematically grounded estimation of uncertainty. The expressivity of DPGs results from not only the compositional character but the distribution propagation within the hierarchy. Recently, it was pointed out that the hierarchical structure of DGP well suited modeling the multi-fidelity regression, in which one is provided sparse observations with high precision and plenty of low fidelity observations. We propose the conditional DGP model in which the latent GPs are directly supported by the fixed lower fidelity data. Then the moment matching method is applied to approximate the marginal prior of conditional DGP with a GP. The obtained effective kernels are implicit functions of the lower-fidelity data, manifesting the expressivity contributed by distribution propagation within the hierarchy. The hyperparameters are learned via optimizing the approximate marginal likelihood. Experiments with synthetic and high dimensional data show comparable performance against other multi-fidelity regression methods, variational inference, and multi-output GP. We conclude that, with the low fidelity data and the hierarchical DGP structure, the effective kernel encodes the inductive bias for true function allowing the compositional freedom. Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Article
Sampling the Variational Posterior with Local Refinement
Entropy 2021, 23(11), 1475; https://doi.org/10.3390/e23111475 - 08 Nov 2021
Viewed by 665
Abstract
Variational inference is an optimization-based method for approximating the posterior distribution of the parameters in Bayesian probabilistic models. A key challenge of variational inference is to approximate the posterior with a distribution that is computationally tractable yet sufficiently expressive. We propose a novel [...] Read more.
Variational inference is an optimization-based method for approximating the posterior distribution of the parameters in Bayesian probabilistic models. A key challenge of variational inference is to approximate the posterior with a distribution that is computationally tractable yet sufficiently expressive. We propose a novel method for generating samples from a highly flexible variational approximation. The method starts with a coarse initial approximation and generates samples by refining it in selected, local regions. This allows the samples to capture dependencies and multi-modality in the posterior, even when these are absent from the initial approximation. We demonstrate theoretically that our method always improves the quality of the approximation (as measured by the evidence lower bound). In experiments, our method consistently outperforms recent variational inference methods in terms of log-likelihood and ELBO across three example tasks: the Eight-Schools example (an inference task in a hierarchical model), training a ResNet-20 (Bayesian inference in a large neural network), and the Mushroom task (posterior sampling in a contextual bandit problem). Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Article
Ensemble Neuroevolution-Based Approach for Multivariate Time Series Anomaly Detection
Entropy 2021, 23(11), 1466; https://doi.org/10.3390/e23111466 - 06 Nov 2021
Cited by 3 | Viewed by 1042
Abstract
Multivariate time series anomaly detection is a widespread problem in the field of failure prevention. Fast prevention means lower repair costs and losses. The amount of sensors in novel industry systems makes the anomaly detection process quite difficult for humans. Algorithms that automate [...] Read more.
Multivariate time series anomaly detection is a widespread problem in the field of failure prevention. Fast prevention means lower repair costs and losses. The amount of sensors in novel industry systems makes the anomaly detection process quite difficult for humans. Algorithms that automate the process of detecting anomalies are crucial in modern failure prevention systems. Therefore, many machine learning models have been designed to address this problem. Mostly, they are autoencoder-based architectures with some generative adversarial elements. This work shows a framework that incorporates neuroevolution methods to boost the anomaly detection scores of new and already known models. The presented approach adapts evolution strategies for evolving an ensemble model, in which every single model works on a subgroup of data sensors. The next goal of neuroevolution is to optimize the architecture and hyperparameters such as the window size, the number of layers, and the layer depths. The proposed framework shows that it is possible to boost most anomaly detection deep learning models in a reasonable time and a fully automated mode. We ran tests on the SWAT and WADI datasets. To the best of our knowledge, this is the first approach in which an ensemble deep learning anomaly detection model is built in a fully automatic way using a neuroevolution strategy. Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Article
Conditional Deep Gaussian Processes: Empirical Bayes Hyperdata Learning
Entropy 2021, 23(11), 1387; https://doi.org/10.3390/e23111387 - 23 Oct 2021
Cited by 1 | Viewed by 867
Abstract
It is desirable to combine the expressive power of deep learning with Gaussian Process (GP) in one expressive Bayesian learning model. Deep kernel learning showed success as a deep network used for feature extraction. Then, a GP was used as the function model. [...] Read more.
It is desirable to combine the expressive power of deep learning with Gaussian Process (GP) in one expressive Bayesian learning model. Deep kernel learning showed success as a deep network used for feature extraction. Then, a GP was used as the function model. Recently, it was suggested that, albeit training with marginal likelihood, the deterministic nature of a feature extractor might lead to overfitting, and replacement with a Bayesian network seemed to cure it. Here, we propose the conditional deep Gaussian process (DGP) in which the intermediate GPs in hierarchical composition are supported by the hyperdata and the exposed GP remains zero mean. Motivated by the inducing points in sparse GP, the hyperdata also play the role of function supports, but are hyperparameters rather than random variables. It follows our previous moment matching approach to approximate the marginal prior for conditional DGP with a GP carrying an effective kernel. Thus, as in empirical Bayes, the hyperdata are learned by optimizing the approximate marginal likelihood which implicitly depends on the hyperdata via the kernel. We show the equivalence with the deep kernel learning in the limit of dense hyperdata in latent space. However, the conditional DGP and the corresponding approximate inference enjoy the benefit of being more Bayesian than deep kernel learning. Preliminary extrapolation results demonstrate expressive power from the depth of hierarchy by exploiting the exact covariance and hyperdata learning, in comparison with GP kernel composition, DGP variational inference and deep kernel learning. We also address the non-Gaussian aspect of our model as well as way of upgrading to a full Bayes inference. Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Article
Self-Supervised Variational Auto-Encoders
Entropy 2021, 23(6), 747; https://doi.org/10.3390/e23060747 - 14 Jun 2021
Cited by 3 | Viewed by 1700
Abstract
Density estimation, compression, and data generation are crucial tasks in artificial intelligence. Variational Auto-Encoders (VAEs) constitute a single framework to achieve these goals. Here, we present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE), which utilizes deterministic and discrete transformations [...] Read more.
Density estimation, compression, and data generation are crucial tasks in artificial intelligence. Variational Auto-Encoders (VAEs) constitute a single framework to achieve these goals. Here, we present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE), which utilizes deterministic and discrete transformations of data. This class of models allows both conditional and unconditional sampling while simplifying the objective function. First, we use a single self-supervised transformation as a latent variable, where the transformation is either downscaling or edge detection. Next, we consider a hierarchical architecture, i.e., multiple transformations, and we show its benefits compared to the VAE. The flexibility of selfVAE in data reconstruction finds a particularly interesting use case in data compression tasks, where we can trade-off memory for better data quality and vice-versa. We present the performance of our approach on three benchmark image data (Cifar10, Imagenette64, and CelebA). Full article
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Show Figures

Figure 1

Back to TopTop