Classification and Uncertainty Quantification of Corrupted Data using Semi-Supervised Autoencoders

Parametric and non-parametric classifiers often have to deal with real-world data, where corruptions like noise, occlusions, and blur are unavoidable - posing significant challenges. We present a probabilistic approach to classify strongly corrupted data and quantify uncertainty, despite the model only having been trained with uncorrupted data. A semi-supervised autoencoder trained on uncorrupted data is the underlying architecture. We use the decoding part as a generative model for realistic data and extend it by convolutions, masking, and additive Gaussian noise to describe imperfections. This constitutes a statistical inference task in terms of the optimal latent space activations of the underlying uncorrupted datum. We solve this problem approximately with Metric Gaussian Variational Inference (MGVI). The supervision of the autoencoder's latent space allows us to classify corrupted data directly under uncertainty with the statistically inferred latent space activations. Furthermore, we demonstrate that the model uncertainty strongly depends on whether the classification is correct or wrong, setting a basis for a statistical"lie detector"of the classification. Independent of that, we show that the generative model can optimally restore the uncorrupted datum by decoding the inferred latent space activations.


Introduction and Motivation
Many real-world applications of data-driven classifiers, e.g., neural networks, involve corruptions that pose significant challenges to the pretrained classifiers. Often, the corruption must previously be included, and, thus, already be known during training. For instance, noise (e.g., due to sensor imperfections) and convolutions (e.g., due to lens flares or unfocused images) are inevitable in image processing systems and may occur spontaneously and irregularly. The same holds for masking, which may occur when a foreign object occludes the actual object of interest (e.g., water droplets, dirt, or scratches on the camera lens). Hence, we aimed to answer the following question in this paper: How can we classify corrupted data with a parametric classifier without imposing any constraints on the training data? As classifying corrupted data naturally demands a measure of uncertainty for validation, we included both model uncertainty δ m and reconstruction uncertainty δ r in the classification. We refer to δ m as the model's confidence of the classification itself. In contrast, we refer to δ r as the confidence of the process of reconstructing the latent space activations given some corrupted datum. An overview of the proposed method is illustrated in Figure 1. . From left to right: Ground truth image x in the data space, corrupted image d in the data space (random masking m, Gaussian blur C, additive white Gaussian noise n), posterior mean H in the latent space with reconstruction uncertainty δ r , model uncertainty δ m , and the restored image g(H) (decoded posterior mean) in the data space. We included the encoding of the uncorrupted data f (x) (illustrated by the shaded white bars in the third column). Top row: data sample from the MNIST-dataset (ground truth label: 4). Bottom row: data sample from the Fashion-MNIST-dataset (ground truth label: 2 (pullover)). We can classify d using the posterior mean H as the autoencoder's latent space is supervised (note the highlighted max. activation responsible for classification). We are able to classify and quantify model uncertainty δ m with the Mahalanobis distance in the latent space (note the highlighted min. activation responsible for classification). Strong overlapping for the Fashion-MNIST-example of the 1 · σ error bars of δ r across different classes indicates that no reliable and confident classification is possible due to heavy corruption.

Methodology Overview and Related Work
To address the challenge of classification and uncertainty quantification of corrupted data, we propose the following core approach, illustrated in Figure 2.
[784]  1 In the first step, we trained a supervised autoencoder [1] that is: (a) capable of classifying the input data with its latent space activations h and (b) capable of decoding the (supervised) latent space activations to generate higher-dimensional data, targeting it to be identical to the input. Except for these two constraints ((a) and (b)), we did not impose any further restrictions on the autoencoder. 2 In the second step, we decoupled the decoder g from the autoencoder and treated the decoder as a fixed generative function g. Neither retraining nor further modifying of g is performed in the following steps.
3 In the third step, we included g in an ADDITIVE WHITE GAUSSIAN NOISE (AWGN) channel model d = mCg(h) + n. This AWGN channel model additionally involves heavy corruption such as convolution C and masking m. 4 In the final step, we approximated the posterior probability distribution P (h|d) in the latent space and derived the mean and standard deviation, corresponding optimally to some uncorrupted datum g(h), given the corrupted datum d. Due to supervision in the latent space, this reconstruction enables a direct classification of d including model and reconstruction uncertainty quantification, even though the decoding function was trained on uncorrupted data. We used a set of samples H from the approximate posterior probability distribution to determine the sample mean mean(H) = H, as well as the set's reconstruction uncertainty δ r with the sample standard deviation std(H). Samples are statistically inferred by METRIC GAUSSIAN VARIATIONAL INFERENCE (MGVI) [2]. In addition to the reconstruction uncertainty δ r , we determined the model uncertainty by calculating the MAHALANOBIS distance (M-distance) in the latent space representation, slightly different from [3]. Here, we distinguish between reconstruction uncertainty δ r and model uncertainty δ m to evaluate the confidence of the process of inferring h and to evaluate the confidence of the classification given by the supervised latent space, respectively. Similar to our approach, references [4,5] showed that the reconstruction of the latent space by posterior inference and by using generative models [6][7][8] for a corrupted datum can lead to an optimal image restoration with uncertainty quantification. These methods do not, however, focus on classifying the corrupted datum in the latent space, nor do they use supervised autoencoder structures. In the field of quantifying uncertainties of classifications, several methods exist. Predominantly, BAYESIAN NEURAL NETWORKS (BNNs) [9] and MONTE CARLO dropout (MC-dropout) [10] have recently shown success. More recently, EVIDENTIAL DEEP LEARNING (EDL) [11] was introduced as yet another probabilistic method to quantify classification uncertainty. The latter two methods are compared to our method in Section 3. Finally, various methods to perform image restoration exist in the literature, such as the well-known denoising autoencoder [12]. These conventional methods require prior knowledge of the corruption to be included to the training data.

Generative Model and Bayesian Inference with Neural Networks
The first step of our method is to train a supervised autoencoder. The autoencoder involves the encoding function f (mapping data x ∈ R p to the latent space representation with activations h ∈ R z , z ∈ N), as well as the decoding function g (mapping h to the data space representationx ∈ R p , p ∈ N, p z). The parameters of f : R p → R z and g : R z → R p are optimized via a combination of two loss terms L g f (representing the reconstruction loss in the data space) and L f (representing the classification loss in the latent space): where j denotes the number of activations in the latent space h that are supervised, i.e., After normalizing all data samples in the range of [0, 1], we used the corresponding cross-entropy for each respective loss term to penalize false classifications in the latent space and inaccurate reconstructions in the data space. Note that the loss term L f (h j , y) processes activations from the latent space h j with the softmax-function (i.e., L SAE = L g f (x, x) + L f (softmax(h j ), y)). The softmax-function's output yields values ranging from [0, 1], which can be penalized by one-hot-encoded labels y. The softmax-function is not included as an activation function in our neural network, where the latent space h is activated linearly; see Section 3 for details. We minimized the general loss function of Equation (1) using the Adam optimizer [13] (test accuracy of [98.6%; 89.4%] on the encoding function f with [MNIST; Fashion-MNIST]). Once the training procedure converged, we decoupled the decoding function g from the autoencoder. Without loss of generality, we then used an AWGN model including the nonlinearity g(h), which additionally involves masking m and convolution C on g: Additive white Gaussian noise, n ∈ R p ∼ N (0, Σ n ), is applied to the decoded latent space signal g(h), which yields the corrupted data d ∈ R p . Note that, for the implementation of h = Aξ + µ h , the reparametrization trick [14] is applied In addition to AWGN, we included the corruptions of masking m and convolutions C, which are both linear operations. Since we are interested in reconstructing the latent space activation h from d alongside uncertainty quantification, the goal is to determine the posterior distribution P (h|d) ∝ P (d|h)P (h). The log-probability distribution reads where (·) T denotes the matrix transpose. Since we are ultimately interested in the analytically intractable mean of h, h P (h|d) = hP (h|d)dh, we approximately determined the mean and the variance of P (h|d) by applying MGVI. Similar to other variational inference methods [14,15], MGVI approximates the distribution by a simpler, but tractable distribution from within a variational family, Q(h). The parameters of Q(h), i.e., mean η and covariance ∆, are obtained by minimizing the variational lower bound. The size of a full variational covariance scales quadratically with the number of latent variables. Taking these limitations into account, we employed MGVI, which locally approximates the target distribution using the inverse Fisher metric as an uncertainty estimate around the variational mean η, which we are optimizing for. The approximation is represented by an ensemble of samples H = {h 1 ,h 2 , . . . ,h n } withh ∈ R z , which we used for our analysis.h refers to the inferred sample. We here call H the posterior mean and δ r the posterior standard deviation, or the reconstruction uncertainty.

Classification and Uncertainty Quantification
The supervision of the latent space allows us to classify the input d in a straightforward manner by evaluating the sample mean and sample standard deviation of the set H. While the sampling mean of the set mean(H) = H gives the class of the most likely classification, the sampling standard deviation reflects the reconstruction uncertainty δ r of the latent space posterior distribution. δ r depends on the type and magnitude of the corruption, as well as the prior probability distribution we included in the channel model (Equation (3)). We visualized this dependency with various experiments; see Figure 3.
Since we are additionally interested in the uncertainty of the model, δ m , we evaluated the M-distance of all samples in H to every class conditional distribution in the latent space (see arrows in Figure 4). We initially determined the parameters of these class conditional distributions by passing the uncorrupted data samples from an independent (i.e., independent of training and testing) dataset X Val (see Section 3) through the encoder f . We then evaluated the closest class conditional distribution to a single samplẽ h, which corresponds to the most likely class. The absolute value of the M-distance to the closest class conditional distribution serves as a measure of the model uncertainty δ m . In this work, all class conditional distributions in the latent space were assumed to follow multivariate Gaussian distributions with covariance Σ i and mean µ i . This method is an implementation slightly different from [3], where it was shown that the Mahalanobis distance is not only an accurate classifier in this context, but also a reliable out-of-distribution detector reflecting the model uncertainty. Reference [3] used tied covariance matrices instead of individual covariance matrices for each class conditional distribution, as done in our method.

Experiments
To validate our method, we conducted several experiments (see for details of implementation and code: https://github.com/pjoppich/corrupted_data_classification) on the MNIST [16] and the Fashion-MNIST [17] dataset. We evaluated the performance on various corruption types and magnitudes and performed a comparison to MC-dropout [10] and EDL [11]. The following architecture was used for the supervised autoencoder (we used the same architecture for both datasets): A feedforward neural network was built with dimen-sions 784 where layers {0} − {2} and {4} − {7} use the SeLU activation function [18], layer {3} linear, and layer {8} sigmoid activations. Note that, in our case, for simplicity, the number of latent space dimensions z is equal to the number of supervised classes j, although j ≤ z holds generally. We split each dataset into three subsets, X Train (48 · 10 3 samples, used for training), X Test (10 · 10 3 samples, used for testing and experiments), and X Val (12 · 10 3 samples, used for determining Σ h and Σ C 1 . . . Σ C K ). We used the MGVI implementation of NIFTy [19] to perform the inference 3 .

Classification
We visualize experiments (1)-(3) in Figure 3. In the first experiment (1), we classified data from an independent test set of the MNIST-dataset corrupted by different noise levels with the proposed method. We denote α as the noise level of n, n ∼ N (0, α). We compared the accuracy of our method to the baseline of processing corrupted data through the encoder of the pretrained autoencoder. We show that we significantly improved the accuracy of classifying corrupted data in comparison to the straightforward classification by f (d). For the second experiment (2), we used the same data samples as for (1) with the exception that we now additionally corrupted the data with window masking at a constant noise level of α = 0.1. Again, we compared the accuracy of our method to the baseline of processing the same data samples through the encoder. In the third experiment (3), we corrupted the data by convolving them with a Gaussian blur kernel with a filter size of 7 × 7 and different magnitudes γ at a constant noise level of α = 0.1.
Experiments (1), (2), and (3) led to the following conclusions: • The reconstruction uncertainty δ r True of correct classifications is approximately equivalent to the δ r False of wrong classifications. This behavior indicates that the correctness of the classification does not influence the reconstruction uncertainty δ r , showing evidence that δ r is independent of the model uncertainty δ m . • As opposed to δ r , the model uncertainty δ m strongly depends on the correct/wrong classification of the corrupted datum: δ m is significantly and consistently higher for false classifications than for true classifications. This characteristic sets the basis for a statistical "lie detector" (see Section 3.2) of classification. Fields of application could be the validation of neural networks in, e.g., medical imaging and other safety-critical applications. • Classifying corrupted data through the decoder (see ACC g in Figure 3) (rather than the encoder (see ACC f )), with a suitable channel model considering the corruption, significantly improved the model's accuracy without the necessity of retraining the autoencoder. Especially for high levels of all corruption types, the accuracy of the model notably improved. Corruption by convolution had catastrophic consequences for classifying data in a straightforward manner through the encoder f , while this type of corruption seemed to only have a minor impact on our method. • Both uncertainties δ r and δ m rose with increasing levels of corruption.

Detection of False Classifications
Finally, in experiment (4) (see Figure 5), we validated the model uncertainty of our method by introducing the Uncertainty-based Receiver Operating Characteristics (U-ROC) curve of detecting false classifications with the M-distance. We evaluated the binary classification task of the two classes "The neural network correctly classifies a corrupted datum" (POSITIVE CLASS) and "The neural network falsely classifies a corrupted datum" (NEGATIVE CLASS). Based on the model uncertainty of our method, we aimed to predict the two classes without further knowledge, providing the initially proposed "lie detector". The U-ROC curve was built from the TRUE POSITIVE RATE and the FALSE POSITIVE RATE. We compared our U-ROC curve with the U-ROC curve of the MC-dropout method [10] and with the U-ROC curve of EDL [11], feeding all methods with the identical input of a datum corrupted by noise at α = [0.1, 0.5, 1.0]. We made the following conclusions from experiment (4), Figure 5: • Our method seemed to outperform MC-dropout and EDL to detect false classifications given the same data samples at the input for α = 0.1 and α = 0.5. One reason for this might be that the M-distance serves as a reliable out-of-distribution detector, exploiting the inherent latent space structure of uncorrupted data as a reference, as opposed to MC-dropout and EDL. For α = 1.0, both EDL and our method outperformed MC-dropout, while the Area Under the Curve (AUC) of EDL was largest.
Here, it should be noted that EDL cannot classify the corrupted data at this noise level (accuracy: 8.9%), resulting in only few samples to test the cases of TRUE POSITIVES and FALSE POSITIVES. • All three methods provided reliable results for detecting false classifications for low noise levels.
The model uncertainty δ m truly reflects the confidence of the classification, i.e., a high value of δ m correlates empirically with a higher probability of false classification. • The U-ROC curve combined with the accuracy indicates that EDL seemed to overestimate uncertainties, leading to a very robust U-ROC curve for high noise levels, but simultaneously leading to a severe drop in the accuracy in the presence of data corruption. We observed comparable results on F-MNIST data.

Requirements and Summary
Our proposed methodology to classify a corrupted datum d including uncertainty quantification requires the following inputs in addition to d: • m, C: Without loss of generality, here, we assumed corruption by masking and convolution represented by m and C in the AWGN channel model, as depicted in Equation (2). Here, C can in real-world applications often be derived from the image processing system in use. Algorithms to detect possibly occluding objects (represented by masking m) were given by, e.g., [20]. • Σ n : Noise covariance matrix. AWGN with n ∼ N (0, Σ n ) and applied additively to the data d. Among others, the methodology published by [21] enables the derivation of Σ n given noisy data d.

•
Σ h : Sampling covariance matrix of all (uncorrupted) latent space activations processed by the encoding function f . We used the assumption that an autoencoder can represent an inherent, lower-dimensional structure of the data in its latent space and assumed this sub-dimensional structure to sufficiently follow a multivariate Gaussian probability distribution.
In summary, our approach was able to classify heavily corrupted data with parametric classifiers. The method does not require corrupted data for training. As we built our procedure on a probabilistic architecture, we quantified the classification and the model uncertainties, allowing for a reliable detection of false classifications. We see our method as a highly flexible and robust framework that can be applied to any generative neural network to improve performance on corrupted data significantly. If the generative neural network comes with a supervised encoded space, it can classify the data directly. We showed that the M-distance can independently be used to classify data. The limitations of our method include that the corruption type needs to be modeled, as well as there is a higher computational cost than MC-dropout and EDL (mainly due to the approximation of the posterior probability distribution; Step 4 in Section 2.1).