IWGAN: Anomaly Detection in Airport Based on Improved Wasserstein Generative Adversarial Network †

Adversarial


Introduction
With recent advances in artificial intelligence (AI), several scholars have begun to invest in deep learning research. AI represents the use of mathematical methods to enable computers to mimic human intelligence by performing inferential, predictive, perceptive, and social tasks. AI has previously exhibited varying degrees of success in image recognition [1], voice recognition [2], natural language processing [3], expert systems [4], and automatic planning [5]. A common, everyday example of AI is the automatic driving system, which encompasses image recognition, object detection, target recognition, semantic segmentation, and other imaging technologies [6].
Owing to the exceptional performance achieved by deep learning and AI, many deep learning models have recently been developed [7,8]. The defining characteristic of deep learning is the use of an artificial neural network architecture to perform training and then accomplish prediction, classification, identification, and other tasks. The architecture of a neural network resembles that of the human brain, wherein it comprises neurons. These architectures are highly diverse, with examples including the multilayer perceptron (MLP) [9], feedforward neural network (FF) [10], recurrent neural network [11], long shortterm memory [11], autoencoders (AE) [12], variational AE (VAE) [13], automatic noise reduction encoders (Denoising, DAE), convolutional neural networks (CNN) [14], and generative adversarial networks (GANs) [15]. GANS, which are typically used to generate images, were proposed by Ian Goodfellow in 2014 [15], representing a breakthrough in the learning of unsupervised neural networks and rapidly becoming a prevalent research topic. The current applications of GANs include style transfer [16], as well as mammography, which is used to generate highly accurate digital breast images [17].
Although deep learning has an overwhelmingly positive impact on data classification, detection, and analysis, two challenging factors frequently manifest in the anomaly detection task: the insufficient quantity of normal and abnormal data during the training phase. These issues are difficult to address in the context of supervision learning. The two-stream neural network proposed by Waseem Ullah et al. [18] performs a two-stage model learning strategy. First, a lightweight convolutional neural network is employed on resource-constrained IoT devices to classify events as normal or abnormal. Second, the abnormal data is through bi-directional long short-term memory (BD-LSTM) for furthering their respective anomaly classification. A GAN and AE were adopted in this study to learn the form of the data latent distribution to generate a new dataset. Thus, the increase and variety of data are expected to improve the model's anomaly detection performance. The proposed method, which integrates the Wasserstein-GAN (WGAN) and Skip-GANomaly models to distinguish between normal and abnormal images, is called the Improved Wasserstein Skip-Connection GAN (IWGAN). In the experimental stage, we evaluated different hyper-parameters-including the activation function, learning rate, decay rate, training times of discriminator, and method of label smoothing-to identify the optimal combination. Consequently, the IWGAN can generate high-quality images, thereby increasing the success rate of abnormal image detection. The contributions of this study are as follows: • The proposed IWGAN model, which combines WGAN and Skip-GANomaly, resolves the issues posed by training difficulty and model collapse. • We optimized the training parameters of IWGAN, such as the LeakyReLU activation layer, decay learning rate, training times, and label smoothing. • The proposed model was evaluated using the Fréchet Inception Distance (FID) and Area Under Curve (AUC) values. The experimental results indicate superior performance to that of existing models, such as U-Net, GAN, WGAN, GANomaly, and Skip-GANomaly.
The remainder of this paper is organized as follows. Section 2 serves as a summary of the related work. In Section 3, we discuss the proposed network's overall architecture, as well as the methods used to overcome the training challenges inherent to GANs. Section 4 presents the experimental results and a discussion of our work. Finally, Section 5 concludes the paper.

AutoEncoder, AE
Although the AE was proposed in 1988, the calculation of high-dimensional data was complex and difficult to optimize at the time. In 2006, Hinton et al. [19] employed gradient descent as an optimization tool to produce an abstract representation of original sample features, thereby improving the reduction in feature dimensionality. Henceforth, the AE method has attracted considerable scholarly attention. The defining characteristic of an AE is the use of two networks: an encoder and a decoder. The encoder compresses the image to reduce dimensionality while retaining the main features and the decoder restores the image to its original form. At the end of training, the AE obtains a low-dimensional vector representing the input data in the hidden layer. The optimization objective is to minimize the gap between the input and reconstructed images. The overall process is therefore an unsupervised learning method for representing the features of input images. Waseem Ullah et al. [20] used an autoencoder to extract spatially optimal features and forward them to the echo state network to obtain a single spatiotemporal information-aware feature vector. At the same time, this feature vector is fused with 3D convolution features to achieve an intelligent dual-stream convolution neural network-based framework for anomaly detection. This study shows that autoencoders can effectively learn features in anomaly events. U-Net [21,22], an AE variant proposed by Ronneberger et al. in 2015, is regarded as one of the best models for image segmentation in biomedical imaging. U-Net is based on a CNN framework that uses each pixel for classification and its defining feature is its U-shaped architecture. We adopted the AE to learn the distribution of latent data, generate a new dataset, and compare it with the experimental dataset.

Generative Adversarial Network, GAN
The GAN [23] is an unsupervised method that constructs a model through two neural networks: a generator and discriminator. The underlying concept, along with its many variations, represents one of the most innovative ideas in machine learning over the last decade. GANs are most commonly used to generate images, as in the case of CycleGAN [24] for style conversion, GauGAN [25] for automatic painting, DeepFake for face changing, and HoloGAN [26] for full-angle image generation. There are also applications in the medical [2], semiconductor [27], astronomy [28], fashion advertising, and other major fields, with an extensive scope of applicability. Within the GAN architecture, the discriminator learns to distinguish between authentic and forged images, whereas the generator attempts to generate forged images to deceive the discriminator. The two networks are trained sequentially. The present study examined the task of using a GAN for anomaly detection.
There are several problems associated with the original GAN architecture. If the discriminator is over-trained, the generator's gradient will vanish more rapidly, rendering the generator useless. Conversely, if the discriminator is poorly trained, the generator's gradient will be inaccurate. Thus, the overall network operates as intended only if the discriminator is trained well, which is difficult to ensure. Another problem inherent to the conventional GAN architecture is the potential mode collapse due to a suboptimal loss function. WGAN [29,30] can effectively solve these problems by substituting the loss function with a smoother Earth-Mover (EM) distance. WGAN yields the following improvements: (1) the last layer of the discriminator removes the sigmoid function; (2) the losses of the generator and discriminator do not affect the logarithmic operator; (3) the discriminator weight is updated in each iteration to limit the maximum and minimum weights; and (4) there is no optimizer based on momentum change, with optimizers such as RMSProp [31] and SGD [31] being used instead.

GANs in Anomaly Detection
Schlegl et al. [32] developed AnoGAN based on a deep convolutional GAN to learn the information between normal and local anomalies. Zenati et al. [33] improved the GAN encoder by introducing an extra discriminator to ensure cycle consistency. Fast Unsupervised Anomaly Detection with GAN (F-AnoGAN) [34] was based upon AnoGAN and WGAN to achieve anomaly detection. GANomaly [35], which employs the AE and GAN architectures, comprises four sub-models: an encoder, a decoder, a discriminator, and an additional encoder. Accordingly, GANomaly uses three loss functions-adversarial, contextual, and latent-to calculate the distribution of normal images, thereby learning to distinguish between normal and abnormal datasets as shown in Equation (1). Skip-GANomaly [36] employs the same principle as GANomaly, except that the generator is added to the skip connection, the extra encoder is deleted, and the last convolutional layer of the discriminator is regarded as the encoder. The loss function of Skip-GANomaly is similar to that of GANomaly.
where λ adv , λ con , and λ lat are the weights for the three losses, respectively.

Proposed Model
The following section discusses the proposed network architecture and core technologies as well as details regarding the internal architecture of each subnetwork. Unlike the traditional GAN architecture, IWGAN employs WGAN and collaborates with Skip-GANomaly through a fusion-network structure.

Generator Architecture
The generator can be divided into two parts joined by a skip connection: an encoder and a decoder. The skip connection bridges the deep feature of the decoder with the shallow feature of the encoder, allowing the encoder to refer to the decoder when extracting image features after convolution, which ensures a higher-quality restoration. In the small structure of the entire network, the batch normalization layer [37] is used for normalization operations so that the entire network's gradient layer does not vanish easily. LeakyReLU [38,39] was adopted as each network's activation layer in place of ReLU [40] and Adam [41] was employed as the optimizer. A dynamic learning rate strategy was implemented to halve the learning rates among 500, 750, 875, and 950 training iterations. The overall generator architecture is illustrated in Figure 1.

Discriminator Architecture
The discriminator architecture developed in this study is identical to that of the encoder, with the addition of two fully-connected layers. The first of these layers is connected to global pooling (GlobaMaxPooling2D), with 100 connection points used for feature extraction. This layer corresponds to the feature similarity between the original and generated images to optimize the model loss. The second fully-connected layer is represented by one neuron, which reflects the output of determining the falsification of an image. The primary difference between the proposed architecture and that of WGAN is that the sigmoid layer is replaced by the LeakyReLU layer, as shown in Figure 2. In addition, we adopted the loss function used in Skip-GANomaly, which combines three loss values: adversarial, contextual, and latent. Adversarial loss is used to increase the reconstruction ability regarding normal images, contextual loss guides the model to learn contextual information and sufficiently capture the data distribution, and latent loss helps generate realistic and contextually similar images. During the training phase, the model is able to correctly reconstruct normal samples and incur a high loss for the reconstruction of abnormal samples, thereby improving the efficiency of anomaly detection.

Image Normalization
We normalized all the images' pixels from a [0, 255] range to a [−1, 1] range, as the neural network performs a weighted inner product on each pixel of the input image during forward propagation. A wider range leads to a significant increase in computational time, causing the model to converge slowly during backpropagation. Another reason behind normalization is the distance between image samples. If the range of feature points per pixel is particularly wide, the result may be inaccurate. Therefore, the pixels were normalized [42] to improve the model accuracy.

Unilateral Label Smoothing
Hard labels may cause overfitting during training, particularly when the number of training samples is relatively small. Label smoothing [43] can enhance the model generalizability, alleviate the problem of overfitting, and serve as a preventive measure against noise. Furthermore, it increases the amount of feature information learned by the model, which is beneficial for distinguishing relationships between classes within the data. Szegedy [44] et al. demonstrated this method's effectiveness for classification using the weighted average of hard labels, as well as the uniform distribution on the labels as soft labels. As a method to improve the performance of neural networks and avoid the overconfidence of the discriminator in real samples, this approach has proven useful across many models. Therefore, the present study adopted and evaluated this approach.

Proposed Architecture
The proposed architecture integrates WGAN and Skip-GANomaly, as shown in Figure 3. WGAN avoids the various training challenges inherent to the conventional GAN and uses the EM distance smoothness to completely resolve the vanishing gradient issue. In addition to producing satisfactory results for the anomaly detection task, Skip-GANomaly uses three loss functions to improve the generator's performance in identifying anomalous objects [36]. We also converted the hard labels of the GAN network into smooth soft labels and reduced the optimizer's learning rate at specific iterations of the training process to determine the optimal weights of the neural network. Furthermore, LeakyReLU was applied in each activation layer to prevent gradient vanishing. In detecting anomaly data, we use the anomaly score that was proposed by [32,33]. We evaluate the new image data x as being normal or abnormal images. The anomaly score is defined by Equation (2).
where λ is the weight for controlling the importance of the score function, S(x) is the anomaly score function, and R(x) is the reconstruction score that measured the contextual similarity between the input and generated image. In the testing dataset D test , we will obtain the anomaly score vector S such that S = { S i : S(x i ), x i ∈ D test }. Finally, we scale the S i anomaly scores within the probabilistic range from 0 to 1. The hyperparameters are set as same as the reference [36].

Environment Setup and Evaluation Metrics
The experimental environment was a Windows 10 computer with an Intel(R) Core(TM) i7-8500 CPU @ 3.20 GHz and a memory of 16.0 G. All the programs were written in the Python3.7 programming language.
An output value is referred to as a true positive (TP) if it is correctly predicted to be positive, whereas a false positive (FP) occurs when a value is incorrectly predicted to be positive. Conversely, a false negative (FN) is a value incorrectly predicted to be negative, whereas a true negative (TN) is a value correctly predicted to be negative. Precision, recall, and F1-score were used as the evaluation indices in this experiment. In addition, the AUC of the receiver operating characteristics (ROC) was calculated using the true positive rate (TPR) and false positive rate (FPR).

Dataset
Our experiments were conducted on the GDXray+ [45] database, which comprises more than 19,407 X-ray images and is used for research and educational purposes only. The dataset encompasses five types of X-ray images: castings, welds, luggage, natural objects, and environments. Only the luggage category was considered in this study. This category includes 8150 X-ray images over a total of 77 series, such as pocket knives, pistols, and razor blades, as shown in Figure 4.

Data Augmentation
To diversify the data, an augmentation method was applied [46], wherein the original images were randomly rotated, offset in the horizontal or vertical direction, sheared, zoomed in or out, and horizontally flipped. To avoid information loss during augmentation, any missing areas of images were filled via a neighboring interpolation, wherein the nearest pixel's value was used as a supplementary pixel. Either GAN or the augmentation method would be used to generate the new data and increase the performance of the deep neural network efficiently.

Difference between ReLU and LeakyReLU
This study evaluated the effectiveness of ReLU and LeakyReLU on the proposed model, under a fixed learning rate and the use of hard labels. Using ReLU as the activation layer, the images that generated over 900 epochs exhibited high quality in the later stages, although the model performed poorly in the early stages of training. The results are shown in Figures 5 and 6.  The anomaly scores were calculated using the Skip-GANomaly evaluation method after training. All the test set images contained anomalous and normal images. The scatter points shown in Figure 7 illustrate the distribution of anomaly scores for all the images. Here, the red dots represent anomalous data and the blue dots represent normal data. Using LeakyReLU as the activation function, the model exhibited a similar increase in quality over 900 epochs.
The distribution of abnormal images and the distribution density of normal images are shown in Figure 8. LeakyReLU evidently performed better than ReLU. Both activation layers were used for ten training sessions to draw a box-and-whisker plot to ensure reliability, as shown in Figure 9. Table 1 shows all the AUC values corresponding to the two activation layers after 10 training cycles. Although LeakyReLU is slightly less stable than ReLU, it achieved higher maximum values, indicating superior performance.

Difference in Learning Rate
To evaluate whether the learning rate decay [47] is effective within the proposed model, the learning rate decay ratio was evaluated at 0.1 and 0.5, with the ReLU activation layer and hard labels. As shown in Figure 10, the learning rate decayed by a factor of 0.1 every 100 iterations to avoid missing weights closer to the optimal convergence point. Under this decay rate, the images generated over 900 epochs indicate that the proposed model performs poorly in the early stages of the training process. However, when comparing these results with Figure 5, the learning rate decay evidently yields improved performance, as shown in Figure 11. A learning rate decay rate of 0.5 applied every 100 iterations is shown in Figure 12. Although the network performed poorly in the early stages of the training process, it successfully generated high-quality images in later stages, as shown in Figure 13. The distribution density plots corresponding to different learning rates are shown in Figure 14.
The AUC values of the ten training sessions were drawn into a box-and-whisker plot to ensure reliability at learning rate decay rates of 0.1 and 0.5. The decay rate of 0.1 yields superior results, as shown in Figure 15.

Training Iterations for Discriminators
We evaluated the most effective number of training iterations for the discriminator by alternating that number between 3 and 5. The five-iteration setup represents a parameter specified in the WGAN paper. Both experiments used a ReLU activation layer, fixed learning rate, and hard labels. After three training iterations, the network performed poorly in the early stages of the training process. However, the generated images were of a higher quality in the later stages. After five training iterations, although the network still performed poorly in the early training stages, it was an improvement compared to its performance after three iterations. Likewise, the network generated higher-quality images at later stages. Both iteration settings were used in 20 training sessions to draw box-and-whisker plots. The results indicate that the use of five training iterations yielded substantially improved performance, as shown in Figure 16.

Smoothing Label
This section evaluates the impact of label smoothing in the proposed model. The experiment was performed using unilateral and bilateral label smoothing, with a ReLU activation layer and fixed learning rate being used for both settings. Under unilateral label smoothing, the images that generated over 900 epochs indicate that the network performed poorly in the early stages of training. However, the performance was improved over the case of using a simple hard label and higher-quality images were generated in later stages. Under bilateral label smoothing, although the network still performed poorly in the early stages, there was a clear improvement. Likewise, excellent-quality images were generated in the later stages. Table 2 lists all the AUC values over 10 training sessions with unilateral and bilateral training. According to the AUC values in Table 2, unilateral labels are better than double labels.

Discussion and Analysis
Although the parametric analysis discussed in the previous sections is not fully interpretable, each parameter had an impact on the overall performance during training. The box-and-whisker plots corresponding to different activation layers indicate that although ReLU is more stable than LeakyReLU, the latter produced superior maximum and average AUC values. The use of a decaying learning rate has been demonstrated to stabilize the model accuracy, with a rate of 0.1 yielding superior performance to a rate of 0.5. The discriminator was trained repeatedly to effectively reduce false positives, with five iterations of training demonstrating superior results over three iterations of training. Furthermore, the dataset labels were smoothed, with unilateral smooth labeling producing superior results compared to bilateral or hard labeling. Under optimal parameters, although the images generated in the first training epoch are of a somewhat higher quality, they are still not sufficiently good. However, high-quality images were generated after 900 epochs, as shown in Figure 17.

Evaluation
Two evaluation methods are generally used to evaluate the quality and diversity of images generated by GANs: the inception score (IS) [48] and the FID [49]. However, IS exhibits a disadvantage wherein certain types of images lead to an incorrect IS score. Accordingly, we employed the FID to evaluate the distance between the two images' data distributions, wherein a smaller FID value indicates higher quality and diversity of the generated image. We evaluated the quality of images generated using U-NET, GAN, WGAN, and GANomaly, as summarized in Table 3. The proposed model exhibits a significant improvement in performance. Table 4 lists the AUC values and F1 scores obtained by each model, likewise indicating the superior performance of IWGAN. For the evaluation of time complexity, this paper uses floating-point operations per second (FLOPs) for comparison, as shown in Table 5. The method proposed in this paper requires a large amount of computation in execution. Compared with Skip-GANomaly, IWGAN has additional 20% of computing resources. However, compared with Skip-GANomaly, IWGAN improves the performance by 38.5% and 19% in the evaluation indicators of the FID and F1-score, respectively. In the case of unconstrained resources, the method proposed in this paper can obtain better results of anomaly detection.

Conclusions
The anomaly detection task involves two major challenges: insufficient data and insufficient abnormal data. This paper proposes the IWGAN network, whose architecture combines WGAN and Skip-GANomaly. The WGAN subnetwork mitigates the issues of training difficulty and mode collapse, while the excellent detection ability of Skip-GANomaly resolves the insufficient data problem. In addition, we found an optimal combination of training parameters for IWGAN: the LeakyReLU activation layer, a decay learning rate of 0.1, five training iterations, and unilateral label smoothing. The proposed model was evaluated using the FID, with experimental results exhibiting significant improvements in performance. The AUC value of the overall network for the GDXray+ discriminant gun samples reached an average of approximately 0.95, which is excellent in terms of a single-generation network. In future work, we will introduce an attention model mechanism to extend the proposed model's applicability for all anomaly detection tasks. In addition, we plan to adopt a neural architecture search to optimize the hyper-parameters (HPO) automatically.