Extended Autoencoder for Novelty Detection with Reconstruction along Projection Pathway

Featured Application: The potential applications of novelty detection are to detect the cause of process defects in manufacturing industry and to identify unique opinions in natural language processing as well as to detect credit card thefts in credit card companies. Abstract: Recently, novelty detection with reconstruction along projection pathway (RaPP) has made progress toward leveraging hidden activation values. RaPP compares the input and its autoencoder reconstruction in hidden spaces to detect novelty samples. Nevertheless, traditional autoencoders have not yet begun to fully exploit this method. In this paper, we propose a new model, the Extended Autoencoder Model, that adds an adversarial component to the autoencoder to take full advantage of RaPP. The adversarial component matches the latent variables of the reconstructed input to the latent variables of the original input to detect novelty samples with high hidden reconstruction errors. The proposed model can be combined with variants of the autoencoder, such as a variational autoencoder or adversarial autoencoder. The e ﬀ ectiveness of the proposed model was evaluated across various novelty detection datasets. Our results demonstrated that extended autoencoders are capable of outperforming conventional autoencoders in detecting novelties using the RaPP method.


Introduction
Novelty detection is the task of recognizing data points that are different from normal data [1]. Novelty detection applies to diverse domains, including intrusion detection, fraud detection, malware detection, medical anomaly detection, industrial anomaly detection, video surveillance, and other numerous fields [2]. More than a decade ago, deep autoencoders were used successfully to detect novelty samples based on reconstruction error [3]. Reconstruction error is the distance between the original input and its autoencoder reconstruction. Autoencoders compress the input into a lower-dimensional projection and then reconstruct the output from this representation. Reconstruction-based methods assume that novelties cannot be effectively reconstructed from low-dimensional projections [4]. Thus, the data samples with high reconstruction error can be detected as novelties. However, reconstruction-based methods have a limitation in that they do not leverage the information available along the projection pathway of deep autoencoders. To extract information in hidden spaces, we can directly compare the outputs of hidden layers of the encoder and the corresponding outputs of hidden layers in the decoder using a symmetric autoencoder. The outputs of the hidden layers in the encoder are referred to as hidden activations, and the outputs of hidden layers in the decoder are hidden reconstructions. However, this direct way of computing hidden reconstruction error provides no meaningful information, as the autoencoder does not impose any correspondence between members of encoding-decoding pairs during training.
reconstruction error provides no meaningful information, as the autoencoder does not impose any correspondence between members of encoding-decoding pairs during training.
Recently, reconstruction along projection pathway (RaPP) was introduced as a way of detecting novelty samples using the information in hidden spaces [5], providing an indirect way of computing hidden reconstruction error. RaPP carries out novelty detection by comparing the hidden activations of the input with those of the reconstructed input. In other words, RaPP replaces the hidden reconstructions of the original input with the hidden activations of the reconstructed input. For this replacement, hidden activations of the reconstructed input are equivalent to the corresponding hidden reconstructions of the original input on the strict condition that the autoencoder can perfectly reconstruct the input. Without any further training or changes to the autoencoder, the RaPP outperforms ordinary autoencoder-based reconstruction methods. However, the RaPP method can be further improved.
In reconstruction-based novelty detection methods, autoencoders are trained to minimize reconstruction errors with normal samples. To take full advantage of the RaPP method, we propose that autoencoders be trained to minimize hidden reconstruction errors by matching the hidden activations of the reconstructed input to the hidden activations of the original input during training. To this end, we borrow the concept of "adversarial autoencoder" (AAE) [6], in which adversarial training [7] is used to match the encoding distribution with an arbitrary prior distribution. Figure 1 depicts AAE architecture. The discriminator tries to distinguish whether a sample arises from the latent variables or from a prior distribution. At the same time, the encoder is updated to fool the discriminator . We use adversarial training to match the latent variables of the reconstructed input to the latent variables of the original input. A discriminator is then added on top of the autoencoder to create an extended autoencoder. Whereas the ordinary autoencoder only focuses on the input space to minimize the reconstruction error, our extended autoencoder investigates both the input space and hidden space simultaneously.   [8] visualization of the values of latent variables in the standard autoencoder and the extended autoencoder during training with normal samples. As seen in the figure, the values of latent variables of original input does not match well with those of reconstructed input in the standard autoencoder. By contrast, the two types of latent variable values tend to overlap in the extended autoencoder; this is because the extended autoencoder lowers hidden reconstruction errors. Relatively, novelty samples can be expected to have larger hidden reconstruction errors as well as larger reconstruction errors compared to normal samples as the autoencoders learn only normal data. We also show that because our method drives the decoder to be the inverse of the encoder, we can compute hidden reconstruction errors without strict assumptions about the autoencoder. Our proposed extended autoencoder model outperformed standard autoencoders for various popular datasets from Kaggle and the University of California at   [8] visualization of the values of latent variables in the standard autoencoder and the extended autoencoder during training with normal samples. As seen in the figure, the values of latent variables of original input does not match well with those of reconstructed input in the standard autoencoder. By contrast, the two types of latent variable values tend to overlap in the extended autoencoder; this is because the extended autoencoder lowers hidden reconstruction errors. Relatively, novelty samples can be expected to have larger hidden reconstruction errors as well as larger reconstruction errors compared to normal samples as the autoencoders learn only normal data. We also show that because our method drives the decoder to be the inverse of the encoder, we can compute hidden reconstruction errors without strict assumptions about the autoencoder. Our proposed extended autoencoder model outperformed standard autoencoders for various popular datasets from Kaggle and the University of California at Irvine (UCI) repository. In addition, we also demonstrate the efficiency of the extended autoencoders for MNIST (Modified National Institute of Standards and Technology) and fashion F-MNIST datasets. The extended autoencoders provide similar performance, only with 50 times fewer epochs than the standard autoencoders for these benchmark datasets.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 14 Irvine (UCI) repository. In addition, we also demonstrate the efficiency of the extended autoencoders for MNIST (Modified National Institute of Standards and Technology) and fashion F-MNIST datasets. The extended autoencoders provide similar performance, only with 50 times fewer epochs than the standard autoencoders for these benchmark datasets.

Related Work
With the advent of deep learning, deep autoencoder-based novelty detection algorithms have been widely researched. Under normal operation, a deep autoencoder uses the reconstruction error to measure the novelty score; that is, the autoencoder is trained to minimize the reconstruction error given the data samples. In doing so, the autoencoder learns a mapping function that reconstructs the normal data with a small reconstruction error. However, the ability to accurately reconstruct the novelty samples is limited, as the novelty samples do not resemble the data used for training. Therefore, novelty samples are detected according to a high reconstruction error. If the novelties seem to come from a specific distribution, generative models such as variational autoencoders (VAE) [9] are suitable for detecting novelties. Because VAEs encode the input as a distribution over the latent space instead of as a single point, a VAE can outperform an ordinary autoencoder in novelty detection based on reconstruction error [10].
Adversarial autoencoders (AAEs) can also be used for novelty detection. The advantage of AAE over VAE is that adversarial training drives the encoding distribution to match the arbitrary prior distribution. Due to this advantage, there are many applications of AAE for novelty detection. For example, to improve the interpretability of detected anomalies, a mixture of Gaussians is chosen as an arbitrary prior distribution [11]. A holistic view of the latent space that resides within the Gaussian mixture partitions the latent space into semantic regions. One-class novelty detection using generative adversarial networks (OCGAN) [12] uses adversarial training to ensure that the latent space resembles a uniform distribution; in this bounded latent space, OCGAN finds informative negative samples to improve novelty detection. Additionally, generative probabilistic novelty detection (GPND) [13] improves novelty detection by making the computation of the novelty probability feasible. GPND introduces adversarial training to match the output of the decoder with the real data distribution. This adversarial training reduces blurriness and adds additional detail to the generated images. From this perspective, adversarial training can be introduced to conventional autoencoders, making them more suitable for RaPP applications.

Preliminaries of RaPP
RaPP, a method for detecting novelty samples by exploiting hidden activation values, compares the input data and the reconstructed data, similar to ordinary reconstruction-based methods;

Related Work
With the advent of deep learning, deep autoencoder-based novelty detection algorithms have been widely researched. Under normal operation, a deep autoencoder uses the reconstruction error to measure the novelty score; that is, the autoencoder is trained to minimize the reconstruction error given the data samples. In doing so, the autoencoder learns a mapping function that reconstructs the normal data with a small reconstruction error. However, the ability to accurately reconstruct the novelty samples is limited, as the novelty samples do not resemble the data used for training. Therefore, novelty samples are detected according to a high reconstruction error. If the novelties seem to come from a specific distribution, generative models such as variational autoencoders (VAE) [9] are suitable for detecting novelties. Because VAEs encode the input as a distribution over the latent space instead of as a single point, a VAE can outperform an ordinary autoencoder in novelty detection based on reconstruction error [10].
Adversarial autoencoders (AAEs) can also be used for novelty detection. The advantage of AAE over VAE is that adversarial training drives the encoding distribution to match the arbitrary prior distribution. Due to this advantage, there are many applications of AAE for novelty detection. For example, to improve the interpretability of detected anomalies, a mixture of Gaussians is chosen as an arbitrary prior distribution [11]. A holistic view of the latent space that resides within the Gaussian mixture partitions the latent space into semantic regions. One-class novelty detection using generative adversarial networks (OCGAN) [12] uses adversarial training to ensure that the latent space resembles a uniform distribution; in this bounded latent space, OCGAN finds informative negative samples to improve novelty detection. Additionally, generative probabilistic novelty detection (GPND) [13] improves novelty detection by making the computation of the novelty probability feasible. GPND introduces adversarial training to match the output of the decoder with the real data distribution. This adversarial training reduces blurriness and adds additional detail to the generated images. From this perspective, adversarial training can be introduced to conventional autoencoders, making them more suitable for RaPP applications.

Preliminaries of RaPP
RaPP, a method for detecting novelty samples by exploiting hidden activation values, compares the input data and the reconstructed data, similar to ordinary reconstruction-based methods; however, RaPP also extends these comparisons to hidden spaces. An intuitive application of this idea would be to compare the activation in hidden space and the corresponding reconstruction in that hidden space; however, this direct comparison has no effect due to the lack of an explicit correspondence between members of encoding-decoding pairs of hidden layers during training. However, RaPP can compute the hidden reconstructions indirectly, using the hidden activations of the reconstructed input. This indirect computation is depicted in Figure 3.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 14 however, RaPP also extends these comparisons to hidden spaces. An intuitive application of this idea would be to compare the activation in hidden space and the corresponding reconstruction in that hidden space; however, this direct comparison has no effect due to the lack of an explicit correspondence between members of encoding-decoding pairs of hidden layers during training. However, RaPP can compute the hidden reconstructions indirectly, using the hidden activations of the reconstructed input. This indirect computation is depicted in Figure 3. Hidden activations of the reconstructed input are shown to be equivalent to the corresponding hidden reconstructions of the original input. For this computation of hidden reconstruction, RaPP assumes the existence of abstract decoders, which are inverse functions of corresponding encoders. An autoencoder = ∘ is a pretrained neural network, such that and represent the encoder and the decoder, and is the number of hidden layers of and where = ∘ ⋯ ∘ ∘ and = ∘ ∘ ⋯ ∘ . The autoencoder can completely reconstruct the input : i.e., = ( ). Partial computation of and are as follows: That is, = : and = : . The -th hidden activation of the reconstructed input is defined by Let be the number of hidden layers, and let us assume that there exists an abstract decoder = ∘ ⋯ ∘ such that The first condition indicates that an abstract decoder can, like the decoder , perfectly reconstruct the input . The second condition states that is the inverse function of . Then, the -th hidden reconstruction is defined by The -th hidden reconstruction : ( ) is equal to the -th hidden activation of the reconstructed input : ( ) as follows: Hidden activations of the reconstructed input are shown to be equivalent to the corresponding hidden reconstructions of the original input. For this computation of hidden reconstruction, RaPP assumes the existence of abstract decoders, which are inverse functions of corresponding encoders. An autoencoder A = f • g is a pretrained neural network, such that g and f represent the encoder and the decoder, and l is the number of hidden layers of g and f where g = g l • · · · • g 2 • g 1 and f = f 1 • f 2 • · · · • f l . The autoencoder can completely reconstruct the input x: i.e., x = A(x). Partial computation of g and f are as follows: That is, g = g :l and f = f l:1 . The i-th hidden activation of the reconstructed input is defined bŷ Let l be the number of hidden layers, and let us assume that there exists an abstract decoder The first condition indicates that an abstract decoderf can, like the decoder f , perfectly reconstruct the input x. The second condition states thatf i is the inverse function of g i . Then, the i-th hidden reconstruction is defined byĝ Appl. Sci. 2020, 10, 4497

of 14
The i-th hidden reconstructionĝ :i (x) is equal to the i-th hidden activation of the reconstructed inputĝ :i (x) as follows:ĝ (1)) This equation implies that there is no need to implementf to compute the hidden reconstruction. Therefore, the hidden reconstruction computation is possible only with g and f already accessible.
Three steps are involved in detecting novelty samples using the RaPP method. The first step is to train the autoencoder only with normal samples. After the autoencoder is trained, the next step is to project the input and its autoencoder reconstruction onto the hidden spaces to obtain pairs of activation values. Finally, the novelty score is calculated by aggregating these two hidden activations. Before summing up, hidden activations must be normalized to alleviate the dependency between the hidden layers. To normalize hidden activations, RaPP uses the normalized aggregation along pathway (NAP) metric, which is based on the concept of Mahalanobis distance. Experiments have shown that RaPP outperforms reconstruction-based novelty detection for various datasets.

The Proposed Model: Extended Autoencoders
Similar to using a reconstruction method to minimize the reconstruction errors, the RaPP method can be further improved by minimizing the hidden reconstruction errors with the mean squared error loss. However, when an autoencoder architecture is deep, one needs at least as many loss functions as there are layers. It is not advisable to consider all deep hidden activations at once, as the original purpose of the autoencoder would be lost. Thus, we focused only on the last hidden activations, the so-called latent variables. Latent variables are usually treated as normally distributed continuous variables [14]. To reflect a distance measure between probability distributions, we must introduce adversarial training loss to the autoencoder.
The core idea of our proposal comes mainly from AAE. In AAE, the discriminator treats samples from a prior distribution as "true" and samples from the latent variables as "false." While the encoder is trained to fool the discriminator, the latent variables become similar to the prior distribution. We apply this concept to our extended autoencoder. Specifically, a discriminator is added on top of the autoencoder to distinguish latent variables of the input from the latent variables of the reconstructed input. Concurrently, the decoder is updated to generate the reconstructed input, which can fool the discriminator. Therefore, the decoder can reconstruct the input to close the distance between the latent variables of the input and its reconstruction. Figure 4 depicts the extended autoencoder architecture. Algorithm 1 summarizes the training methodology of the extended autoencoder. In the sections below, we first describe the architecture and the training procedure of the proposed model in detail. We also applied our extended autoencoder model to autoencoder variants, such as VAE and AAE. Second, we explain how the extended autoencoder can improve novelty detection with RaPP. Computation of the hidden reconstruction is tractable by assuming the existence of abstract decodersf . On the premise that the autoencoder can copy the input x perfectly, there is no need to implement the abstract decoder. However, in practice, the abstract decoder should be implemented because the autoencoder assumption is no longer true.
as the original purpose of the autoencoder would be lost. Thus, we focused only on the last hidden activations, the so-called latent variables. Latent variables are usually treated as normally distributed continuous variables [14]. To reflect a distance measure between probability distributions, we must introduce adversarial training loss to the autoencoder.
The core idea of our proposal comes mainly from AAE. In AAE, the discriminator treats samples from a prior distribution as "true" and samples from the latent variables as "false." While the encoder is trained to fool the discriminator, the latent variables become similar to the prior distribution. We apply this concept to our extended autoencoder. Specifically, a discriminator is added on top of the autoencoder to distinguish latent variables of the input from the latent variables of the reconstructed input. Concurrently, the decoder is updated to generate the reconstructed input, which can fool the discriminator. Therefore, the decoder can reconstruct the input to close the distance between the latent variables of the input and its reconstruction. Figure 4 depicts the extended autoencoder architecture. Algorithm 1 summarizes the training methodology of the extended autoencoder.

Architecture of Extended Autoencoder
In this section, we describe the architecture of the extended autoencoder and the training procedure. The extended autoencoder consists of an autoencoder and a discriminator. Reconstruction loss and adversarial training loss are described in Equations (3) and (4).
The first loss function is the mean squared error, which corresponds to the distance between the input and its reconstruction. We jointly optimize the encoder g and the decoder f to minimize the L mse loss function. The second loss function is adversarial training loss to match the latent variables of the reconstructed input to the latent variables of the original input. The discriminator D l tries to distinguish whether a sample arises from the latent variables of the input or from the latent variables of the reconstructed input. The decoder is trained to fool the discriminator D l , and the encoder and the decoder can be obtained by optimizing the following full objective: The extended autoencoder can be combined with all variants of the autoencoder; the associated loss functions depend on the variant employed. For example, the extended autoencoder can combine with the VAE to create an extended VAE. Figure 5 depicts the architecture of the extended VAE, which has three loss functions: where g(x) = g µ (x) + e 1 2 g σ (x) ·z, z ∼ N(0, 1) In this case, the second loss function is the Kullback-Leiber divergence loss, which is used to confer continuity in the latent space.
is the closed form loss in the special case of a Gaussian latent. The VAE encoder consists of two neural networks: , . The output of is the mean, and the output of is log-variance. Using a reparameterization modification, it is possible for the encoder output ( ) to model the standard Gaussian distribution and the backpropagation. The remaining parts are the same as the case of the basic extended autoencoder. The final objective function of the extended VAE is as follows: The extended autoencoder can also combine with the AAE. Figure 6 describes the architecture of the extended AAE, which has three loss functions: Basically, the discriminator tries to distinguish whether a sample arises from the latent variables or from a prior distribution. At the same time, the encoder is updated to fool the discriminator .
In a similar way, the discriminator tries to distinguish whether a sample arises from the latent variables of input or from the latent variables of the reconstructed input. The decoder is trained to fool the discriminator . Therefore, the encoder and the decoder can be obtained by optimizing the following full objective: In this case, the second loss function is the Kullback-Leiber divergence loss, which is used to confer continuity in the latent space. L KL is the closed form loss in the special case of a Gaussian latent. The VAE encoder consists of two neural networks: g µ , g σ . The output of g µ is the mean, and the output of g σ is log-variance. Using a reparameterization modification, it is possible for the encoder output g(x) to model the standard Gaussian distribution and the backpropagation. The remaining parts are the same as the case of the basic extended autoencoder. The final objective function of the extended VAE is as follows: The extended autoencoder can also combine with the AAE. Figure 6 describes the architecture of the extended AAE, which has three loss functions: Basically, the discriminator D z tries to distinguish whether a sample arises from the latent variables or from a prior distribution. At the same time, the encoder is updated to fool the discriminator D z . In a similar way, the discriminator D l tries to distinguish whether a sample arises from the latent variables of input or from the latent variables of the reconstructed input. The decoder is trained to fool the discriminator D l . Therefore, the encoder and the decoder can be obtained by optimizing the following full objective: Appl. Sci. 2020, 10, 4497 8 of 14 Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 14 Figure 6. Architecture of the extended AAE, which is based on the AAE architecture, with the addition of discriminator . The dotted encoder is not updated during the training.

Implementation of Approximate
In this section, we discuss why the abstract decoder should be implemented to improve the RaPP method, and we show how the adversarial training can force to be an approximate . To use the RaPP method, the hidden reconstruction of the input must be equal to the hidden activation of the reconstructed input. For this computation, a necessary underlying assumption is that the autoencoder is able to perfectly copy the input . Practically, however, this assumption is not acceptable. The autoencoder has constraints on its network; thus, it is not capable of perfectly reconstructing the input. The aim of the autoencoder is not to be the identity function but to learn the compressed representation of the input. Because ( ) and ( ) only approximately copy the input in their own way, ( ) = ( ) is no longer valid. It is still true that RaPP works because the autoencoder at least attempts to copy the input . However, we expect that using the abstract decoder , rather than decoder , will improve RaPP performance. The implementation of can be difficult even though that neural networks are highly flexible frameworks. Thus, the number of criteria required may approach the number of layers needed to force all deep encoder-decoder pairs to be inverses of each other. As such, autoencoder learning is susceptible to failure given the extensive criteria. To avoid this failure, we focus only on the last layer and extend this to the whole encoder-decoder system to implement the approximate .
Since we only consider the last layer, the second condition of can be expressed as follows: As the second condition of holds that = , it is also true that = . Thus, we can substitute ∘ with ∘ in this equation. In addition, the extended autoencoder can be based on a configuration in which a generative autoencoder is trained to match the encoding distribution with the prior distribution. In this paper, we set the encoding distribution ( ) to the standard Gaussian distribution ~ (0,1). Finally, we can define an approximate such that

Implementation of Approximatef
In this section, we discuss why the abstract decoderf should be implemented to improve the RaPP method, and we show how the adversarial training can force f to be an approximatef . To use the RaPP method, the hidden reconstruction of the input must be equal to the hidden activation of the reconstructed input. For this computation, a necessary underlying assumption is that the autoencoder is able to perfectly copy the input x. Practically, however, this assumption is not acceptable. The autoencoder has constraints on its network; thus, it is not capable of perfectly reconstructing the input. The aim of the autoencoder is not to be the identity function but to learn the compressed representation of the input. Because f (g(x)) andf (g(x)) only approximately copy the input x in their own way,f (g(x)) = f (g(x)) is no longer valid. It is still true that RaPP works because the autoencoder at least attempts to copy the input x. However, we expect that using the abstract decoderf , rather than decoder f , will improve RaPP performance. The implementation off can be difficult even though that neural networks are highly flexible frameworks. Thus, the number of criteria required may approach the number of layers needed to force all deep encoder-decoder pairs to be inverses of each other. As such, autoencoder learning is susceptible to failure given the extensive criteria. To avoid this failure, we focus only on the last layer and extend this to the whole encoder-decoder system to implement the approximatef .
Since we only consider the last layer, the second condition off can be expressed as follows: As the second condition off holds thatf i = g −1 i , it is also true thatf = g −1 . Thus, we can substitute (g l •f l ) with (g •f ) in this equation. In addition, the extended autoencoder can be based on a configuration in which a generative autoencoder is trained to match the encoding distribution with the prior distribution. In this paper, we set the encoding distribution g(x) to the standard Gaussian distribution z ∼ N(0, 1). Finally, we can define an approximatef such that This equation states that the latent variable of the input (the left side of the equation) is equal to the latent variable of the reconstructed input (the right side of the equation). To implement the approximatef that satisfies this equation, we introduce an adversarial criterion to force f to be the approximatef , as follows: As the decoder f is updated to confuse the discriminator D l , the decoder f approximates the abstract decoderf . This means that the latent space of the reconstructed input becomes similar to the latent space of the original input. For novelties, however, the hidden reconstruction errors are not reduced, as the autoencoder is trained only with normal samples. Therefore, we can detect novelties better using the RaPP method.

Experiments
In this section, we evaluate our proposed extended autoencoder model based on the RaPP method. We use three popular autoencoders: the AE, VAE, and AAE, as baseline models. Our extended autoencoder is also applied to these baseline models. Thus, our comparison included six models in total: the standard versions of the AE, VAE, and AAE and the extended versions of AE, VAE, and AAE.

Datasets and Problem Setups
The novelty datasets used for novelty detection were collected from Kaggle and the UCI repository. We also carried out novelty detection on the popular benchmark datasets MNIST and F-MNIST. The details of the datasets are described in Table 1. Given that MI-F and MI-V are actually from the same dataset, they share the same features. We treat this dataset as two datasets, as it has two columns that can be used as a novelty class (i.e., machine completed and passed visual inspection). Some datasets (including MI-F, MI-V, EOPT, NASA, and RARM) have only two classes, a normal class and a novelty class. The others (including STL, OTTO, SNSR, MNIST, and F-MNIST) have more than two classes. If there are more than two classes, the performance varies depending on which class is assumed to be novelty. For reliable experiment, it is recommended that each of classes is considered novelty once; in other words, we assigned a single class as the normal class and the remaining classes as the novelty class. We then performed novelty detection as many times as the number of classes and averaged the results. For example, MNIST has 10 classes from "0" to "9", and we then performed novelty detection 10 times to assign each class on MNIST as a normal class. As a result, 10 detection results are generated, and their average value is calculated as the final output of a single trial.
We selected a semi-supervised learning approach for novelty detection [15]. Thus, we provided only normal samples during the training phase and used both normal and novelty samples during the testing phase. Half of the test sets were made up of normal samples, and the other half were novelty samples. After training the autoencoder, we used the RaPP method to calculate the novelty score by normalizing and aggregating the hidden activation values of an input and its autoencoder reconstruction. With this novelty score, we evaluated novelty detection performance using the area under the receiver operating characteristic curve (AUROC) [16]. To alleviate random errors during training, we obtained the AUROC by averaging AUROC scores from 30 trials for novelty datasets and 5 trials for benchmark datasets.

Implementation Details
All autoencoders have a symmetric architecture in which the encoder and the decoder consist of 10 layers. If the autoencoder had another component such as discriminators, then we would also have to set the discriminator to have 10 layers. The latent space dimension was determined by the number of principal components that explain at least 90% of the variance for novelty datasets. For benchmark datasets, the latent space dimension was set to 20, and we only considered the hidden spaces. Thus, we did not compute the reconstruction error in the input space. We applied Leaky-ReLU [17] activation and batch normalization [18] to all layers, with the exception of the last layer. The Adam optimization algorithm [19] was used to select the best model with the lowest validation loss. As for the implementation details, there was no difference in the implementation between the standard and the extended autoencoders, except that the extended autoencoders had an additional discriminator.

Results
In this section, our extended autoencoders for AE, VAE, and AAE are referred to as XAE, XVAE, and XAAE, respectively. Table 2 summarizes the RaPP performance of the standard autoencoders and our extended autoencoders trained for 100 epochs with novelty datasets. In Table 2, the best score for each dataset is indicated with underlining. The extended autoencoders provided the best results for all datasets except the MI-F and MI-V datasets. Given that MI-F and MI-V are derived from the same dataset, the extended autoencoder achieved the best results among six of the seven datasets. Thus, this empirical result demonstrates that the extended autoencoder is generally suitable for detecting novelties using the RaPP method. However, note that these results do not include the best cases since the hyperparameters including the number of hidden layers has not been optimized for each dataset. In this respect, we have investigated the performance of RaPP by varying the number of hidden layers with MI-F and MI-V datasets, and the results are presented in Tables 3 and 4. The extended autoencoder denoted as XAE achieves the best score when the number of hidden layers is set to 2 on MI-F and 6 on MI-V, respectively.  Table 5 summarizes the RaPP performance with the MNIST and F-MNIST image datasets. The extended autoencoders required only two epochs to achieve performance similar to that of standard autoencoders, which are trained for 100 epochs. For the F-MNIST dataset, the extended AAE (trained for only two epochs) outperformed all standard autoencoders trained for 100 epochs, and for the MNIST dataset, the proposed model showed almost the same performance as that of standard autoencoders. Of course, since the execution time of a single epoch can vary depending on autoencoder methods, it is necessary to check the time. Figure 7 shows the execution time in seconds of a single training epoch. As seen in the figure, during a single epoch, each of the extended autoencoders take approximately twice as long as its corresponding standard autoencoder; in other words, the time required for two epochs of the extended autoencoder is almost the same as the time required for four epochs of the standard autoencoder. After all, in the training process, the extended autoencoders outperform the standard autoencoders within a short training time for both MNIST and F-MNIST datasets. Unfortunately, this high efficiency was not observed for the other Kaggle and UCI datasets in Table 2. In any case, this is especially important when concept drift is a possibility [20], as the ability to adapt quickly depends on training efficiency.  Additionally, we found that the performance of the extended autoencoders tended to decrease as the number of epochs increased. This issue may be caused by difficulties associated with GAN training. Mode collapse [21] is a well-known problem that occurs when the generator learns to map several different input values to the same output point. In our case, the decoder was trained to reconstruct only images that were capable of deceiving the discriminator. As a future work, other improved techniques for training GANs [22] can be applied to prevent mode collapse in image datasets.

Conclusions
In this work, novelty detection with RaPP was improved using extended autoencoders in which we focused on the hidden activations. To reduce the hidden reconstruction error for normal samples, we introduced an adversarial network to the autoencoder. This additional adversarial network ensures that the latent variables of the reconstructed input match the latent variables of the original input. In this process, the decoder closes to the abstract decoder, which enables computation of the hidden reconstruction. Therefore, we no longer need to assume that the autoencoder can completely reconstruct the input. Our empirical results showed that when using the RaPP method for novelty detection, the extended autoencoders outperformed the standard autoencoders for diverse novelty detection datasets from Kaggle and the UCI repository and improved efficiency in benchmark datasets.
In addition to the modification of the autoencoder, the RaPP method can be improved further in future work. Here, we selected the model with the lowest reconstruction error on the validation dataset as the best model. This effectively provided an "early stop" framework to prevent the model from being overfit to the training dataset. However, the hidden reconstruction error to select the best model must be included, as RaPP leverages not only the reconstruction error but also the hidden reconstruction error.