Latent-Insensitive autoencoders for Anomaly Detection

Reconstruction-based approaches to anomaly detection tend to fall short when applied to complex datasets with target classes that possess high inter-class variance. Similar to the idea of self-taught learning used in transfer learning, many domains are rich with similar unlabelled datasets that could be leveraged as a proxy for out-of-distribution samples. In this paper we introduce Latent-Insensitive autoencoder (LIS-AE) where unlabeled data from a similar domain is utilized as negative examples to shape the latent layer (bottleneck) of a regular autoencoder such that it is only capable of reconstructing one task. We provide theoretical justification for the proposed training process and loss functions along with an extensive ablation study highlighting important aspects of our model. We test our model in multiple anomaly detection settings presenting quantitative and qualitative analysis showcasing the significant performance improvement of our model for anomaly detection tasks.


INTRODUCTION
Anomaly detection is a classical machine learning field which is concerned with the identification of in-distribution and out-ofdistribution samples that finds applications in numerous fields [1,34]. Unlike traditional multi-label classification where the goal is to find decision boundaries between classes present in a given dataset, the goal of anomaly detection is to find one-versus-all boundaries for classes that are not in the dataset which is significantly more challenging compared to standard classification. Autoencoders [3] have been used extensively for anomaly detection [6,34,35] under the assumption that reconstruction error incurred by anomalies is higher than that of normal samples [11,36]. However, it has been observed that this assumption might not hold as standard autoencoders might generalize so well even for anomalies [9,36]. In practice, this issue becomes more relevant in two important settings, namely, when the normal data is relatively complex it requires high latent dimension for good reconstitution, and when anomalies share similar compositional features and are from a close domain to the normal data [7]. To mitigate these issues we present Latent-Insensitive autoencoder (LIS-AE), a new class of autoencoders where the training process is carried out in two phases. In the first phase, the model simply reconstructs the input as a standard autoencoder, in the second phase the entire model except the latent layer is "frozen". We then train the model in such a way that forces the latent layer to only keep reconstructing the target task. We use the concept of a negative dataset from one-class classification [31] whereby an auxiliary dataset of non-examples from similar domains is used as a proxy for out-of-distribution samples. We change the training objective such that the autoencoder keeps its low reconstruction error for the target dataset while pushing the error of the negative dataset to exceed certain value. In some cases, minimizing and maximizing the reconstruction loss at the same time becomes contradictory, especially for negative  classes that are very similar to the target class. To resolves this issue we introduce another variant with modified first phase loss that ensures that the input of the latent layer is linearly-separable for positive and negative examples during the second phase. This linearly-separable variant (LinSep-LIS-AE) almost always performs better than directly using LIS-AE. Details of architecture, training process, theoretical analysis, and experiments are discussed in detail in the following sections.

RELATED WORK
Many reconstruction-based anomaly detection approaches have been proposed starting with classical methods such as PCA [21]. Robust-PCA mitigates the issue of outlier sensitivity in PCA by decomposing the data matrix into a sum of two low-rank and sparse matrices using nuclear norm and 1 norms as convex relaxation for the objective loss [5]. autoencoders address the issue of PCA only considering linear relations in feature-space by introducing nonlinearities benefiting from multiple layers of representations [4]. We elaborate further on the shortcomings of PCA and autoencoders in the theoretical section and use that to motivate our approach.
Other methods try to improve on base autoencoders by endowing the latent code with particular properties. In the case of VAE [2], it does so by having the latent code to follow a prior distribution (usually normal) which also allows sampling from the decoder. However, in the context of anomaly detection, it introduces scaling issues since minimizing KL-Divergence for high latent dimensions required for complex tasks is quite challenging. Another approach is Replicator Neural Networks (RepNN) [12] which is an autoencoder with a staircase activation function positioned on the output arXiv:2110.13101v2 [cs.LG] 14 Nov 2021 of the bottleneck layer (Latent Layer). This is mainly used in order to quantize the latent code into a number of discrete values which also aids in forming clusters [32]. Unfortunately, a discrete staircase function is non-differentiable which prevents learning via backpropagation. Instead, a differentiable approximation involving the sum of hyperbolic tangent functions tanh was introduced in place of the otherwise, non-differentiable discrete staircase function. However, as discussed in [30], despite the theoretical appeal for having a quantized latent code via smooth approximation, in practice, having such activation function makes it significantly difficult for the gradient signal to flow. We also note that increasing the number of levels using the aforementioned , tanh sum approximation presents a significant overhead during training and testing since activation functions have to be computed for each batch, moreover, it suffered from scaling issues similar to that of VAE. Another approach is memorizing normality of a given dataset using Memory-augmented autoencoder [9]. This approach limits the effective space of possible latent codes by constructing a memory module that takes in the output of the encoder as an address and passes to the decoder the most relevant memory items from a stored reservoir of prototypical patterns that have been learned during training. Other non-reconstruction-based approaches include One-Class classification which is tightly connected to anomaly detection in the sense that both problems are concerned with finding one-versus-all boundaries. One-Class SVM is a variation of the classical SVM algorithm [8] where the objective is to find a hyper-plane that best separates samples from outliers [27]. Support Vector Data Description (SVDD) [29] tries to find a circumscribing hyper-sphere that contains all samples while having optimal margin for outliers. It is worth noting that for kernels where ( , ) = 1 such as RBF and Laplacian, OC-SVM and OC-SVDD learn identical decision functions [16]. To address the lack of representation learning and bad computational scalability of OC-SVM and OC-SVDD, Deep SVDD (OC-DSVDD) employs a deep neural network that learns useful representation while mapping outputs to a hypersphere of minimum volume [24]. However, due to its sole reliance on optimizing for minimum volume, this approach is prone to hyper-sphere collapse which leads to finding uninformative features [22].
Other approaches have been proposed where an auxiliary datset of non-examples (negative dataset) is drawn from similar domains as a proxy for the otherwise intractable complement for the target class. In [31], a collection called the "Universum", allows learning useful representation to the domain of the problem via maximizing the number of contradictions on an equivalence class. Similar to OC-DSVDD, [22] leverages a labeled dataset from a close domain to fine-tune pre-trained two CNNs in order to learn new good features. The goodness of these features is quantified by the compactness (inter-class variance) for the target class and descriptiveness (crossentropy) for the labeled dataset. Despite avoiding hyper-sphere collapse and outperforming OC-SVDD, this approach requires two pre-trained neural networks and a large labeled dataset along with the target dataset. Another approach that also makes use of a large auxiliary dataset is Outlier Exposure (OE) [13], which is a supervised approach that trains a standard neural network classifier while exposing it to a diverse set of non-examples on which the output of the classifier is optimized to follow a uniform distribution using another cross-entropy loss.

PROPOSED METHOD 3.1 Architecture
An undercomplete deep autoencoder is a type of unsupervised feedforward neural network for learning a lower-dimensional feature representation of a particular dataset via reconstructing the input back at the output layer. To prevent autoencoders from converging to trivial solutions such as the identity mapping; a bottleneck layer with output z such that its dimension is less than the dimension of the input x. The forward pass is computed as such: where x is the input, is the bottleneck layer, ℰ and are convolutional neural networks representing the encoder and the decoder modules respectively. Typically, such models are trained to minimize the L 2 -norm of the difference between the input and the reconstructed output ∥x − x∥ 2 . As previously discussed, the choice of the activation function of z plays an important rule in anomaly detection. Activation functions that quantize the latent code or encourage forming clusters are preferable. In our experiments, we find that confining the latent code to have values between [−1, 1] with a tanh activatin function as we maximize the loss over the negative dataset during the latent-shaping phase has a regularizing effect. We also note that unbounded activation functions such as ReLU tend to have poor performance.

Terminology
Positive Dataset (D + ): This is the dataset that contains the normal class(es), for example, the plane class from CIFAR-10. Negative Dataset (D − ): This is a secondary unlabeled dataset containing negative examples from a similar domain as D + . The choice of D − depends on D + . For example if D + is the digit 0 from MNIST, D − might be random strokes or another dataset with similar features such as Omniglot [28]. It is important to note that the model should not be tested on D − since this violates the assumption of not knowing anomalies. Anomaly Dataset (D ): This is a test dataset that contains classes that are neither in D + nor in D − . Feature Extraction Phase: This is the first phase of training. The model is simply trained to reconstruct its input. Latent-Shaping Phase: This is the second phase of training. The encoder and decoder networks are frozen and only the latent layer is active.

Training for Anomaly Detection
Given a dataset D + and a negative dataset D − from a similar domain to D + , we divide the training process into two phases; the first phase is reconstructing samples from D + by minimizing the loss function L = ∥x + − x + ∥ 2 until convergence, where x + is the input drawn from D + andx + is the output of the autoencoder. In the second phase, we freeze the model except for the latent layer and minimize the following loss function: where x − is a sampled batch from D − ,x − is its reconstruction, is a hyperparameter that controls the effect of the two parts of the loss function and is another hyperparameter indicating that we are satisfied if the reconstruction error ∥x − − x − ∥ 2 of the negative dataset exceeds a certain value.

Predicting Anomalies
We use reconstruction error L ( ) = ∥x−x∥ 2 to distinguish between anomalies and normal data where x is the test sample andx is the reconstructed output. More specifically, we set a threshold such that if L (x) > the output is considered to be anomalous.

THEORETICAL JUSTIFICATION 4.1 Formulation
In this section, we present theoretical justification for the reasoning behind selective freezing and the second phase loss function. We would like to show that the process described in algorithm 1 implies that the reconstruction loss for a latent-insensitive autoencoder (L ) remains equivalent to the reconstruction loss of a standard autoencoder (L ) for normal (positive) samples but larger for anomalies. More formally, under certain assumptions for negative dataset (D − ), L (x + ) = L (x + ) and L (x ) ≥ L (x ) where x is an anomalous sample. From optimiality of autoencoders [4], we know that absent any non-linear activation functions, a linear autoencoder corresponds to singular value decomposition (SVD); henceforth, we use SVD interchangeably with linear autoencoders. Given an × data matrix X + , we decompose R into X | | ⊕ X ⊥ , where X | | := ( + ) and X ⊥ is its orthogonal complement (X + ).
We further decompose X + using SVD: where U and V are orthonormal matrices and Σ is a diagonal matrix such that Σ = [ 1 ... | 0]. However, in practice it is rarely separated this neatly, specially when dealing with large number of samples of a high-dimensional dataset; therefore, we resort to reduced-SVD where we take the first columns of U with the caveat that the choice of is a hyper-parameter.
, and from Eckart-Young low-rank approximation theorem, columns of U ≈ (X | | ) and columns of U ≈ (X ⊥ ). A linear autoencoder with -dimensional latent layer is equivalent to the following transform:x Where U and U represent the decoder and the encoder respectively. Furthermore, any data point x ∈ R can be represented as x = U z + U c, where c and z are ( − ) and -dimensional real vectors. By orthonormality, we have the following identities: U U = I and U U = 0, where I is an -identity matrix. As a shorthand, we write ∥.∥ instead of ∥.∥ 2 2 . Using these two identities, we rewrite the reconstruction loss ∥x − x∥ as following: We note that the loss function is agnostic to the nature of c and is only concerned with its magnitude. The assumption for anomaly detection under this setting is that ∥c ∥ > ∥c + ∥, where c and c + correspond to orthogonal components for anomalies and positive data respectively. We posit that while this agnosticism is desirable for potential generality, it is not optimal for anomaly detection; hence, we modify the loss score to depend on the nature of c: where B is an × ( − ) matrix such that the loss is small for normal data but large for anomalies. In other words, we want ∥B c + ∥ = 0 and ∥B c ∥ to be large. We define C | | := orthonormal basis for (C + ) and C ⊥ := orthonormal basis for (C + ), where C + is the matrix of all positive orthogonal components c + . We decompose C ⊥ further into C −⊥ and C ⊥ where columns of C −⊥ are the basis of (C − ) that are not in C | | and columns of C ⊥ are the remaining columns of C ⊥ .
, any c ∈ R − can be written as c = C | | p + C −⊥ q + C ⊥ s, where p, q and s are real vectors.
Despite the fact that we do not have access to c , we can utilize other negative examples from similar domain and use c − as a proxy for c . Since c = C | | p + C −⊥ q + C ⊥ s, maximizing ∥B C −⊥ ∥ implies maximizing ∥B c ∥ assuming that ∥C −⊥ q∥ ≠ 0. The later assumption hinges on the fact that X − is from a similar domain. Therefore, we end up with the goal of finding B such that ∥B c + ∥ = 0 and ∥B c − ∥ is large.
where controls the importance of the second term.
since orthonormal transformations preserve the dot product.
since adding constant to argmin does not affect the objective.
In practice, we cannot maximize ∥U E x − − x − ∥ indefinitely and we are satisfied if it reaches a certain large : We notice that in order for this to work, the decoder U has to be known and remain fixed (frozen). This suggests a two-phase training where we first compute the decoder and encoder networks, and in the second phase the decoder is fixed while the encoder E is modified using the new loss. In fig. 4, a linear version of LIS-AE is trained on digit-8 from MNIST with Omniglot as a negative dataset. We perform orthogonal decomposition on each input by projecting it onto digit-8 subspace to get its projection and orthogonal vectors. We then feed each vector separately to a regular linear AE and linear LIS-AE. We observe that the regular autoencoder outputs zero images for the orthogonal part of each sample regardless of the class it belongs to. However, in the case of LIS-AE, it behaves differently for normal class than for anomalous classes.
We also notice that orthogonal projections do not form a semantically meaningful representation in pixel space. In order to gain a better representation we use a deep AE. For this non-linear case, we treat the middle part of the network as an inner linear autoencoder which is operating on a more semantically meaningful transformed version of the data. This suggests a stacked autoencoder archiecture where another loss term for the inner autoencoder is added in the first phase to make sure that the output of the layer after latent is However, in our experiments we observed that adding these loss terms was not necessary and a similar loss to the linear case produced similar results since we are only considering reconstruction scores of the outer model. Therefore, we keep the entire network frozen except for the latent layer while directly minimizing the following loss as before. (eq. 4)

Intuition
For concreteness, we consider the following simple, supervised case. Given a dataset X + such that for each x + ∈ X + : We notice that most of the variance in data is along the x-axis. Training a linear autoencoder with latent dimension = 1, results in D = 1 0 0 and E = 1 0 0 where D and E are the decoder and encoder networks respectively.
Given input x = , the loss score is L = ∥x −x∥ 2 = 2 + 2 . Training a LIS-AE on negative samples that have only nonzero values along the z-axis, we end up with the same D and a modifiedÊ = 1 0 , where is a large number and theñ where , ≠ 0. The new loss scores for x + and x a are: In the case of regular Linear-AE (PCA), given x + = ( + , + , 0), for each point ( , , ) ∈ the cylinder: and since is a large number, the cross-section of the cylinder is squashed in the z dimension resulting in heavily penalized loss in the z dimension but a regular loss in the y dimension. In this case, the two samples become indistinguishable only for very small values of . We note that the newÊ is merely a rotated and stretched version of the old E in the -plane. Thus, we can think of Linear LIS-AE as a regular PCA with its eigenvectors (columns of U r ) stretched and tilted in the directions of the orthogonal complement of the eigenspace. This is done in such a way that keeps the column space of normal examples invariant under the new transformation U E T . By itself, this formulation remains ill-posed since there are infinite number of solutions that do not necessarily help with anomaly detection. More formally, given E := U + U B, we can choose any matrix B such that However, this does not guarantee any advantage for anomaly detection on similar data, even worse in practice, this modification process might result in a slightly worse performance if done arbitrarily since the model usually has to sacrifice some extreme samples from the normal data to balance the two losses. Thus, the negative dataset is used to properly determine the directions of the tilt and hyperparamteres ( and ) determine the importance and amount of stretching (or shrinking) without changing the normal case as much as possible. For deep LIS-AE, the same analogy holds albeit in a latent space.
Deep architectures are not only useful for learning good representation, but can learn a non-linear transformation with useful properties for our objective such as linear separability of negative and positive samples. By adding a standard binary cross-entropy loss before the non-linear activation of the latent layer during the

Anomaly Detection
In this section, we test LIS-AE for anomaly detection on image data in unsupervised settings. Given a standard classification dataset, we group a set of classes together into a new dataset and consider it the "normal" dataset. The rest of classes that are not in the normal nor in the negative datasets are considered anomalies. During training, our model is presented only with the normal dataset and the additional negative dataset. We evaluate the performance on test data comprised of both the "normal" and "anomalous" groups. For MNIST and Fashion-MNIST, the encoder network consists of two Convolutional layers with LeakyReLU non-linearities followed by a fully-connected bottleneck layer with tanh activation function. The decoder network consists of a fully-connected layer followed by a LeakyReLU and two Deconvolution layers with LeakyReLU activation functions and a final convolution layer with sigmoid situated at the final output. For SVHN and CIFAR-10 we use latent layer with larger sizes and higher capacity networks with same depth. It is worth noting that the choice of latent layer size has the most effect on performance for all models (compared to other hyper-parameters). We report the best performing latent dimension for all models.
In table (1), we compare LIS-AE with several autoencoder-based anomaly detection models as baselines, all of which share the exact same architecture. It is worth noting that the most direct comparison is between LIS-AE and AE since not only they have the same architecture, they have the exact same encoder and decoder weights and their performance is merely measured before and after the latent-shaping phase. We use a different variant of RepNN with a sigmoid activation function ( ) = 1/(1 + exp(− )) placed before the tanh staircase function approximation described in section 2. This is mainly used because "squashing" the input between 0 and 1 before passing it to the staircase function gives more robust and easy-to-train network. We only report the best results for Sig-RepNN with 4 activation levels. For anomaly GAN (AnoGAN) [26], we follow the implementation described in [25]. We train a W-GAN [10] with gradient penalty and report performance for two anomaly scores, namely, encoder-generator reconstruction loss and additional feature-matching distance score in the discriminator feature space (AnoGAN-FM). For AnoGAN, Linear-AE, AE, VAE, RepNN, MemAE, and LIS-A, we use reconstruction error L (x) such that if L (x) > the input is considered an anomaly. Varying the threshold , we are able to compute the area under the curve (AUC) as a measure of performance. Similarly, for OC-SVDD (equivalently OC-SVM with rbf kernel) and OC-DSVDD, we vary inverse length scale and use predicted class label. For Kernel density estimation (KDE) [20], we vary the threshold over the log-likelihood scores. For Isolation Forest (IF) [18], we vary the threshold over the anomaly score calculated by the Isolation Forest algorithm.
The datasets tested in table (1) are MNIST and Fashion-MNIST. To train LIS-AE on MNIST we use Omniglot [15] as our negative dataset since it shares similar compositional characteristics with MNIST. Since Omniglot is a relatively small dataset, we diversify the negative examples with various augmentation techniques, namely, Gaussian blurring, random cropping, horizontal and vertical flipping. We test two settings for MNIST, a 1-class setting where normal dataset is one particular class and the rest of the dataset classes are considered anomalies. The process is repeated for all classes and averge AUC for 10 classes is reported. Another setting is 2-class MNIST where the normal dataset consists of two classes and the remaining classes are considered anomalies. For example, the first task contains digits 0 and 1 and the remaining digits are considered anomalies, the second task contains digit 2 and 3, and so forth. This setting is more challenging since there is more than one class present in the normal dataset. For Fashion-MNIST, the choice of negative example is different. We use the next class as the negative dataset and we do not include it with anomalies (i.e. remaining classes) during test time.
We note that LIS-AE achieves superior performance to all compared approaches, however, we also notice that these settings are comparatively easy and all tested models performed adequately including classical non-deep approaches. In  (4), we see that standard AE is prone to generalize so well for other classes which is not a desired property for anomaly detection. In contrast, LIS-AE only reconstructs normal data faithfully which translates to the large performance gap we see in figure (3). We also notice that despite CIFAR-10 being more complex than SVHN, most reconstruction-based models are performing better on CIFAR-10 than on SVHN. This is due to the fact that the difference between SVHN classes in terms of reconstruction is not as large since they share similar compositional features and do appear in samples from other classes while for CIFAR-10 classes vary significantly. (e.g. digit-2 and digit-3 on a wall vs truck and bird)

Ablation
In this section, we investigate the effect of the nature of negative dataset and linear separability of positive and negative examples. In table (3) we train LIS-AE on different negative and positive datasets. Similar to table (2), we split each positive dataset into two datasets and follow the same settings as before with the exception of "None" and "Supervised " cases. The "None" case indicates that no negative for both positive and negative datasets, the positive data starts with class 0 and negative dataset consists of classes 5 to 9 where the outliers are classes 1 to 4. This process is repeated for all 10 classes present in each dataset and average AUC is reported. Overall, using a negative dataset resulted in a significant increase in performance in every case except for two important cases, namely, when Fashion-MNIST and CIFAR-10 were used as negative datasets for MNIST and SVHN respectively. This could be explained by the fact that the model was not capable of reconstructing Fashion-MNIST and CIFAR-10 classes in the first place. Moreover, shaping the latent layer in such a way that maximizes the loss for Fashion-MNIST and CIFAR-10 classes does not guarantee any advantage for anomaly detection of similar digit classes present in MNIST and SVHN. This coupled with the fact that this process in practice forces the model to ignore some samples from the normal dataset to balance the two losses which results in the performance degradation we observe in these two cases.
Table (4) is an excerpt of the complete table in the appendices where we examine the effect of each class present in the negative dataset on anomaly detection performance for other test classes from the CIFAR-10 dataset. We split CIFAR-10 into two separate datasets, the first split is used for selecting classes as negative datasets and the other split is used as outliers. For each class in CIFAR-10 we train eight models in different settings, the first setting is None where we train a standard autoencoder with no negative examples as the base model. The remaining seven settings differ in the second phase, we select one class as our negative dataset and test the model performance on each individual class from the outlier dataset. The combined setting is similar to the setting described in section 5.1 where we combine all negative classes in one 5-class negative dataset. Note that these classes are not the same as the classes in the outlier test dataset except for the final setting, which is an upper-bound supervised setting where the negative dataset is comprised of classes that are in the outlier dataset except for the positive class. This process is then repeated for all 10 classes in CIFAR-10. Overall, we observe a significant performance increase over the base model with the general trend of negative classes significantly increasing anomaly detection performance for similar outliers. For example, the dog class drastically improves performance on the cat class but not so much for the plane class. However, we also notice two important exceptions, namely, when the horse class is used as the negative dataset for the car class, we notice a significant performance increase for the relatively similar deer class as expected, however, when the horse class is used as the negative dataset for the same deer class, we notice that the performance does not improve as in the first case and even degrades for the care class.
Other notable examples of this observation can be found in the appendices where, for instance, the dog class improves performance on cat outliers, but causes noticeable degradation when used as the negative dataset for the same cat class. The gained performance, in the first case, is due to the fact that these classes share similar compositional features and backgrounds. However in the second case, the same property makes it difficult to balance the minimization and maximization loss during the latent-shaping phase. For example, car and truck images are very similar in this scenario that minimizing and maximizing the loss at the same time becomes contradictory. As posited in section 4.2, we mitigate this issue by adding a binary cross-entropy loss while training in the first phase to ensure that the input of the latent layer is linearly-separable for positive and negative examples. Notice that unlike other approaches [13,22], this does not require a labeled positive or negative dataset and relies only on the fact that we have two distinct datasets. This linear separablity makes the second phase of training relatively easier and less contradictory. In table (5), we see that LinSep-LIS-AE mitigates this issue for the

CONCLUSION
In this paper we introduced a novel autoencoder-based model called Latent-Insensitive autoencoder (LIS-AE). With the help of negative samples drawn from a similar domain as the normal data we tune the weights of the bottleneck part of a standrad autoencoder such that the resulting model is able to reconstruct the target task while penalizing anomalous samples. We also presented theoretical justification for the reasoning behind our two-phase training process and the latent-shaping loss function along with a more powerful variant. Multiple ablation studies were conducted to explain the effect of negative classes and highlight other important aspects of our model. We tested our model in a variety of anomaly detection settings with multiple datasets of varying degrees of complexity. Experimental results showed significant performance improvement over compared methods. Future research will focus on possible ways for synthesizing negative examples for domains with limited data.
We also hope to further study and employ various manifold learning approaches for latent space representation.
ACKNOWLEDGEMENT Artem Lenskiy was funded by Our Health in Our Hands (OHIOH), a strategic initiative of the Australian National University, which aims to transform healthcare by developing new personalised health technologies and solutions in collaboration with patients, clinicians, and health care providers.

A EFFECT OF INDIVIDUAL CLASSES AS NEGATIVE EXAMPLES
As discussed in section 5.2, we examine the effect of each class present in the negative dataset on anomaly detection performance for other test classes from the CIFAR-10 dataset. The first table shows results for standard LIS-AE while the second table shows   Fashion-MNIST classes as Positive datasets. Left, outliers as negative dataset (Supervised). Right, Omniglot as negative dataset.