You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

17 November 2021

Spatio-Temporal Deep Learning-Based Methods for Defect Detection: An Industrial Application Study Case

,
,
,
,
,
,
,
,
and
1
Institute of Computing (IComp), Federal University of Amazonas (UFAM), Manaus 69080-900, Brazil
2
Institute and Center for Development and Research in Software Technology (ICTS), Manaus 69080-900, Brazil
3
TPV Technology Limited, Manaus 69080-900, Brazil
*
Author to whom correspondence should be addressed.
This article belongs to the Topic Applied Computer Vision and Pattern Recognition

Abstract

Data-driven methods—particularly machine learning techniques—are expected to play a key role in the headway of Industry 4.0. One increasingly popular application in this context is when anomaly detection is employed to test manufactured goods in assembly lines. In this work, we compare supervised, semi/weakly-supervised, and unsupervised strategies to detect anomalous sequences in video samples which may be indicative of defective televisions assembled in a factory. We compare 3D autoencoders, convolutional neural networks, and generative adversarial networks (GANs) with data collected in a laboratory. Our methodology to simulate anomalies commonly found in TV devices is discussed in this paper. We also propose an approach to generate anomalous sequences similar to those produced by a defective device as part of our GAN approach. Our results show that autoencoders perform poorly when trained with only non-anomalous data—which is important because class imbalance in industrial applications is typically skewed towards the non-anomalous class. However, we show that fine-tuning the GAN is a feasible approach to overcome this problem, achieving results comparable to those of supervised methods.

1. Introduction

Intelligent models are expected to play a key role in accomplishing the goals of Industry 4.0, which include the evolution of traditional manufacturing systems into intelligent, automated systems. In this context, research on machine and deep learning has rapidly dominated applications within the industrial field, especially in the current second level of Industry 4.0, which is data and intelligence driven [1]. Despite such an apparent success, machine learning-based solutions deployed into real industrial applications are still few and mostly conducted by a small group of predominantly large companies [2]. According to Bertolini et al. [2], production planning and control and defect analysis are examples of emerging research topics that are already attracting significant academic and industrial interest which is expected to continue growing in the coming years.
Particularly in problems involving defect identification and classification, visual quality inspection is an important research topic, and images are among the most common type of data dealt with. Several studies have proposed solutions supported by automated image recognition using machine learning for defect detection, such as identifying material defects in the selective laser melting of metal powders [3], and the defect classification of semiconductor fabrication using scanning electron microscope images [4]. However, we note that the literature has given significantly less attention to the investigation of this kind of problem using video data.
Although different aspects addressed in works that investigate defect identification from images may be extremely useful when dealing with video, the latter poses unique challenges, especially when considering the spatio-temporal patterns of input video sequences. In addition, video data are difficult to represent and model due to their high dimensionality, the presence of noise, and the fact that each video segment may represent a wide variety of events.
The problem investigated in this paper involves defect identification from video data. In the assembly line of a TV set manufacturer, the TV’s digital decoders must be tested to ensure that no defects occur. For instance, some possible defects are: (1) mosaic—characterized by artifacts of a geometric pattern which may partly or completely block the video’s frames; (2) freeze—corresponding to consecutive repetitions of the same frame; (3) frame loss—temporal jumps that skip more than a single frame at a time; and (4) black screen—when the complete darkening of the screen replaces one or more frames. In order to better illustrate these defects, Figure 1 shows two different frame sequences, one with no anomalies (Figure 1a) and another presenting all four defects (Figure 1b).
Figure 1. Example of a regular sequence of frames (a) and an anomalous sequence with simulations of four types of defects (b).
Detecting defects in video may be considered a special case of video anomaly detection, since the objective of this task is usually to discriminate positive events from negative and rare ones. An anomaly is usually an outlier, a non-standard piece of data, such as defects in surfaces [5]. Anomaly detection may be employed in a wide range of applications, such as the identification of noisy signals [6], disease classification [7,8,9], and pest control through environment surveillance [10,11]. Deep learning-based methods are considered the state-of-the-art in video anomaly detection [12]. According to Nayak et al. [12], among the categories of deep learning methods used for video anomaly detection, the most widely used are supervised and semi-supervised learning. In the first category, convolutional neural networks (CNNs) built with spatio-temporal layers—e.g., convolutional 3D or two-stream CNN—are successfully used as video descriptors to provide discriminative information when labeled data are available. In the second, spatio-temporal auto-encoders, based on convolutional long short-term memory (ConvLSTM) architecture for instance, are very popular. These models are trained in a one-class classification (OCC) fashion. Typically, the models are trained to reconstruct normal videos with high precision, and when presented to anomalies, they commit high reconstruction errors. That reconstruction error may be used to decide whether the input video is an anomaly or not.
Considering semi-supervised learning, more recent attention in the literature has been focused on the provision of adversarial training by adopting a generative adversarial network (GAN) within anomaly detection [13]. In a typical GAN procedure, the generator provides fake samples and passes them to the discriminator, which focuses on distinguishing fake from real samples. The discriminator is trained to be as precise as possible in assigning correct labels to both real and fake samples, while the generator learns to provide fake samples realistic enough to confound the discriminator. In the context of anomaly detection, however, the process is slightly different. The common procedure is to train one standard GAN using the OCC approach. Hence, only non-anomaly samples are used to lead the GAN to learn a mapping from the latent space representation to the samples. Therefore, the generator learns how to generate normal samples. In test time, since each sample needs to be mapped to the latent space, when an anomalous sample is encoded, a high discrimination score is expected, while a low score is likely to be indicative of a non-anomalous sample. It is important to mention that some recent studies present methods that use both GAN and autoencoders for anomaly detection, as in [14].
Semi-supervised approaches are particularly interesting for defect detection because they exploit the fact that normal video instances are usually largely available in real applications, while collecting defect data with sufficient variety, volume, and quality is generally costly and time-consuming. However, the lack of anomalous samples for validation may be a drawback to this strategy. GANs can mitigate this problem as they may successfully generate anomalous samples [13]. Some examples of works proposing GANs to solve the imbalanced data problem in the manufacturing domain are [15,16,17]. In terms of videos, acquiring enough anomalous data is even harder. Consequently, common supervised deep learning neural network training tasks cannot be carried out. Instead, GANs may be used to generate the samples of anomalous videos.
In this work, we proposed a novel GAN-based anomaly detection model which learns to generate samples of anomalous video in a semi-supervised way. The proposed method requires only normal data and few instances of the non-anomalous class in the training process. Unlike the traditional GAN whose generator component is composed of fractional convolutional layers and its loss is calculated by taking into account the discriminator’s classification, we propose a custom generator component to generate synthetic anomalous instances from the normal instances. This is performed by inserting anomaly into normal video instances using transformations such as Gaussian noise, temporal jumps and freezing. The discriminator loss values are used as an adjustment factor by the generator. The lower the discriminator loss is, the smoother the defects produced by the generator are. As a consequence, the video samples used to train the discriminator become increasingly “harder” as the discriminator’s loss reduce during the training process. We do so because, as pointed by [18], it is important that the discriminator and the generator learn simultaneously, unless the discriminator does not have ample gradients to update its weights with. Similarly, the generator must steadily show harder anomalies, otherwise it would not have a generator competing against the discriminator.
In addition, we introduce in this paper three novel video datasets that simulate real-world industry-oriented failures. The datasets provide video-level annotations, i.e., a video is labeled as anomalous or anomalous, but the timestamps of the anomalies within each video are unknown. We compare the results achieved by our proposed GAN against five different models. More specifically, four supervised methods: a custom 3D CNN, 3D ResNet-34 [19], Mobile Video Networks (MoViNet-A2) [20], and Convolution 3D (C3D) [21]; and one semi-supervised method: an autoencoder model composed of residual blocks, ConvLSTM, and ConvCNN layers, recently proposed in [22].
The remainder of this paper is organized as follows. A short review of the state-of-the-art involving spatio-temporal deep learning-based methods for video anomaly detection is presented in Section 2. An in-depth description of the datasets created and the GAN proposed, as well as the additional models investigated in this paper, is provided in Section 3. Section 4 describes experiments and results. Finally, conclusions and future work are presented in Section 5.

3. Materials and Methods

We describe in this section the datasets generated to evaluate the models employed in this work. Then, we present the proposed GAN as well as other investigated methods.

3.1. Datasets

We introduced in this paper three novel video datasets that simulate real-world industry-oriented failures. The datasets are called 60frames, DenserGlitch and The_1R—listed in their chronological order of acquisition and complexity. Aiming to generate videos representing the real use-case scenario, the 60frames dataset represents the most controlled environment since its instances were obtained frame-by-frame. The instances of the remaining datasets were obtained in real time as the videos were played in one of the devices used.
All three datasets contain instances from two classes: (1) regular video segments; and (2) anomalous ones. All samples were generated from a single base video composed of 2399 frames sampled at 30 frames per second, leading to roughly 79 s of duration. The videos were collected in a controlled environment similar to the one used in the assembly line producing TV sets. To increase the variability of the data, two different television manufacturers and screen sizes were used, and the camera was moved around several times during the capture. It must be noted that none of the datasets contain any real anomaly. The defects were simulated based on observations of real anomalies. The details pertaining to each dataset are presented below.

3.1.1. 60frames

A camera Basler acA2500-14uc USB 3.0 equipped with C125-0418-5M-P f4mm lens was used to generate instances for this dataset. Each frame of the original video was displayed in the television’s screen and captured by the camera before changing to the next frame. Taking into account the fact that there were originally 2399 frames, 47,980 frames were collected for each class, since 10 camera positions were used for each of the two devices (32-inch and 43-inch screen size). The original resolution of the captured images is 2590 × 1942 . However, the following preprocessing steps were conducted: crop—to only preserve the pixels inside the screen, obtaining an ROI with roughly 1152 × 824 pixels; and resize to 224 × 224 . Only one type of anomaly (mosaic) was simulated before the instances’ capture. Each capture alternated between original frames and frames with the simulated mosaic so that later on, while constructing the videos, it was possible to insert other defects in any desired position, allowing the full control of where the anomalies would be placed and how long they would last.
Defective frames with mosaic were simulated using the glitch-this module, version 1.0.2, randomly alternating the glitches’ intensity between levels 2 and 10. Freezes, by adding multiple copies of a given frame, and darkened frames are examples of additional defects simulated after the frames’ capture. In addition, the defects’ duration was randomly chosen according to a normal distribution with a mean of 12 and standard deviation of 4, to determines the number of defective frames for each anomaly event. Finally, the produced frames provided 6240 video segments—among which precisely 4680 are anomalous instances (75%) and 1560 are normal (25%). Each video is composed of 60 frames, in order to increase the amount of examples for each class, and the last 30 frames of each sample overlap with the first 30 samples of the next one. As a consequence, the dataset was partitioned into training, validation and test sets, taking the frames shared by different videos of the same camera position into account so that all instances collected with a given camera position were placed in the same partition.

3.1.2. DenserGlitch

For this dataset, a Basler acA1300-200uc camera was employed due to the need to collect all frames using 30 frames per second as the sample rate. As a consequence of this change, the resolution of the captured frames decreased to 1280 × 1024 . The mosaic simulation procedure is also different since it is based on real anomalous instances obtained through experiments with signal attenuation performed to induce defects. Figure 2 shows a comparison between real mosaic anomalies (Figure 2a) and those simulated on 60frames (Figure 2b) and DenserGlitch (Figure 2c) datasets, respectively.
Figure 2. Examples of anomalous frames showing the real mosaic (a) and the simulated mosaic in the 60frames (b) and DenserGlitch (c) datasets, respectively.
The same preprocessing steps performed for the previous database were also conducted for this second dataset. An amount of 40 video segments with 10 s was produced. The video segments were balanced, i.e., 20 presented no anomalies in any frames while the remaining 20 presented mosaic in all frames. These videos were recorded while being displayed in the screen of a 43-inch device, allowing the generation of 360 video segments composed of 60 frames obtained with 50% overlap between the frames of subsequent samples. When compared to the 60frames dataset, the mosaic simulation was expected to be more realistic in DenserGlitch. In addition, since the frames were captured by recording the video while it was displayed, even using a camera with the same sample rate as the video’s, some slight oscillations of the frame rates can break the sync in a way that the camera may capture the transition between different frames in the video. This phenomenon may occur in a real scenario. However, this dataset presents two disadvantages: (1) it is significantly smaller than the other two datasets; and (2) all anomalous videos contain defects in every frame, which makes the defect detection problem less challenging.

3.1.3. The_1R

The video capture process for this dataset is similar to the process used to generate the DenserGlitch dataset, including the same camera. Again, two different screen sizes and 10 different camera positions were used per device, providing 78 video segments for each class. The samples were generated using digital image processing before the data capture. Therefore, 60-frame video segments presenting anomalies added as needed were provided, allowing the generation of 3120 samples (1560 per class). The 78 anomalous segments produced for each camera position were equally split into four types of defects. Besides, the amount of defects in a single video was defined according to the following proportions: 46% of the samples contained 1 defect, 31% contained 2 defects, 15% contained 3 defects and 8% contained 4 defects. This proportion was empirically defined based on the observed frequency of each defect in the production line.
To try to prevent possible bias towards some specific type of anomaly, samples with 2 or 3 defects were evenly split for every possible combination of anomalies. The defects’ extension was determined by a normal distribution ( μ = 12 , σ = 6 ) truncated in 1. Random values were generated according to this distribution every time a defect was added, independently of its type or the amount of anomalies in a given sample. When adding a defect in a sample, the starting frame of the anomaly was randomly chosen between the positions that could allocate the defect’s full length, as determined by the normal distribution.
The anomalies in this dataset were simulated as follows: mosaics as in DenserGlitch; and freezes and black screens as in the 60frames dataset, except for the fact that all defects were introduced before the videos’ capture. This way, neighbor frames between freezes or black screen are slightly different due to tiny oscillations of the camera or other external factors not controlled during the video capture process. There is also frame loss simulated by skipping some intermediate frames of the videos, producing temporal jumps. During the capture, a colorful screen was used to tag the start and end of each 60-frame segment inside a larger video that was displayed. This strategy allowed the automatic segmentation of the long captured video into a series of shorter instances. Additionally, a camera sample rate oscillations led to unintended events of frame loss in some instances. As a result, the total length of the samples was reduced to 55 frames to enable using most of the generated instances.

3.2. The Proposed 3D-GAN

Six different models were investigated in this paper: a proposed 3D-GAN; a custom 3D-CNN; a spatio-temporal autoencoder; 3D ResNet-34; C3D; and MoViNet-A2. The models 3D ResNet and C3D are considered state-of-the-art methods [34] while MoViNet-A2 is a computation and memory-efficient network recently proposed to cope with streaming video. Our proposed GAN is detailed in this section and the next section provides a short summary of the baseline.
We proposed a solution using 3D-GAN whose custom generator is designed to generate anomalous videos. The discriminator component, on the other hand, learns only from the normal class data. This way, the proposed method simultaneously generates the anomalous samples and is capable of anomaly detection. Therefore, instead of the traditional fake vs. real adversarial competition, anomalous instances generated from the real ones will be recognized.
When analyzing the results achieved by the classical generator network of a GAN, we observed that the instances generated were not similar enough to real anomalies. Regarding this analysis, we decided to build a custom non-neural network-based generator for providing anomalous videos from the normal ones. This was performed by inserting the anomaly into normal video instances using two groups of transformations: (1) spatial; and (2) temporal. The first group was composed of the following transformations:
  • Gaussian noise;
  • Salt-and-pepper noise;
  • Poisson noise;
  • Failure in a color channel;
  • Defective pixels on display;
  • Jitter;
  • Digital channel packet loss.
It is important to note that these defects are mainly observed as spatial features. However, since temporal anomalies can occur in the digital channel problem, the generator was also responsible for generating the temporal-based defects below:
  • Freezing;
  • Temporal jumps;
  • Black screen;
  • Glitch between frames.
All transformations employed by our generator have parameters. For instance, the Gaussian noise depends on two parameters: mean and variance. Taking into account that the mean can be considered zero or simply removed, the variance is the only parameter that controls the noise intensity, i.e., the higher the variance, the higher the severity of the noise. Another example is the black screen transformation, whose parameter is the number of black frames to be inserted into the video. In this case, the number of black frames defines the severity of the anomaly. As explained in the next paragraphs, the transformation parameters are dynamically adjusted according to the loss provided by the discriminator network.
Figure 3 shows the learning process of the model. In each cycle, a batch of normal videos is randomly selected to be fed to the discriminator. The same batch of normal videos is also provided to the generator, which applies temporal and spatial transformations to each normal video in order to generate anomalous instances. The generator provides anomalous videos using the discriminator loss function as an adjustment factor to better generate these instances. Hence, parameters that determine the severity of the anomaly, such as the number of anomalous frames for the black screen transformation and the variance of the Gaussian noise, vary with the discriminator loss. For instance, the variance of the Gaussian noise reduces with the loss of the discriminator. Thus, the better the discriminator, the smoother the generated anomalies will be and the more difficult they will be to detect. For each type of anomaly, upper and lower limits were defined. Therefore, both the anomaly types and their parameters are experimentally adjustable.
Figure 3. Architecture of the proposed GAN during training. The generator takes normal video samples and creates anomalous ones. Each training batch contains normal samples and anomalous samples—75% of which come from the generator and 25% of which come from real data. At each step of the training, the loss of the discriminator is used to update the discriminator’s weights and also as a parameter to tune the generator.
In order to extract features only provided by a physical camera as similar as possible to the actual device used in real tests, a small number of real captured anomalous instances (equivalent to 25% of the number of normal instances) was used in our method. When a real anomalous instance was used, a Gaussian smoothing filter was employed before the instance was fed to the discriminator to prevent possible over-adjustments due to the differences between real captured instances and the generated instances. Our preliminary experiments indicated that this process allows the model to achieve better generalization. In addition, based on observations conducted in the assembly line, the generator chooses the group of transformations to be inserted into the normal instances according to the following fixed distribution: 50% for spatial transformations; 30% for temporal transformations; and 20% for real anomalous instances. The transformations from each group are randomly chosen.
As mentioned before, the artificial generator uses the discriminator loss function as an adjustment factor, generating “more complex” anomalous videos as the discriminator learns. Figure 4 shows examples of anomalous frames generated by employing the Gaussian transformation to a normal frame. The original normal frame is shown in Figure 4a. Since the noise severity is defined according to the loss of the discriminator, the anomalous frame in (c) shows high severe noise as a result of the high loss provided by the discriminator, whilst it becomes more realist and smoother as the discriminator loss is reduced.
Figure 4. Two illustrative examples of anomalous frames obtained by the proposed generator using Gaussian noise applied to the normal frame in (a). The noise severity is higher at the beginning of the discriminator learning process, as shown in (c). It becomes smoother and more realistic as the discriminator loss decreases, as observed in (b).
In terms of the discriminator component, Table 1 summarizes its architecture. This network is composed of 3 convolutional layers with Leaky ReLU as the activation function, which allows a small and non-zero gradient when the unit is not active. We also added a dropout hidden layer with a 0.3 dropout rate to mitigate overfitting. Due to the possibility that the fully connected layers are prone to overfitting, thus hampering the generalization ability of the overall network [35], we also applied GlobalMaxPooling3D to take the maximum value of each feature map. This results in a vector being directly fed into the sigmoid layer.
Table 1. Architecture of the customized discriminator component of the 3D-GAN designed to detect anomalies in video data.

3.3. Baselines

Four supervised approaches were investigated in this paper: (1) a 3D-CNN we customized for the specific-purpose application; and (2) three pre-trained methods: 3D ResNet-34, C3D and MoViNet-A2. The first model was described in the next section and the remaining CNNs are summarized in Section 3.3.3. Moreover, we also employed a semi-supervised method: an autoencoder described in Section 3.3.2.

3.3.1. Customized 3D-CNN

The architecture defined for the 3D-CNN is summarized in Figure 5. It is composed of 5 3D convolution layers, each followed by a max-pooling and a batch normalization layer. Despite not being clear in the literature whether to use dropout and/or batch normalization to optimize the model focusing on generalization, our experiments pointed out batch normalization layers as a better option. It is worth mentioning that similar results were observed in [36]. Their results indicate that batch normalization improved model accuracy without considerably increasing the training time. The opposite was observed when using dropout layers, since these incurred a reduction in the model accuracy of evaluating anomalies. These authors concluded that batch normalization layers allowed the model to increase its performance.
Figure 5. Architecture of the customized 3D-CNN designed to detect anomalies in video data.
The weights were initialized using Glorot Normal Initialization [37] and the convolution layers employed the rectified linear unit (ReLU) as the activation function. After the last convolutional layer, the global average polling 3D process for temporal data is carried out. In this process, the final normalization of the feature maps is performed to allow each feature map to be converted into a single value. Then, a dense layer is added after the Global Average Polling 3D layer. Finally, dropout is used to focus on reducing overfitting. In addition, since this is a binary problem, the last layer employs sigmoid as the activation function. Therefore, the prediction reaches values close to 1 when an anomaly is detected and close to 0 otherwise. Finally, the remaining hyperparameters applied are: cross entropy and Adam—using a learning rate of 10 4 —as the loss function and optimization algorithm, respectively. All hyperparameters were experimentally tuned.

3.3.2. Autoencoder

A semi-supervised learning approach was also investigated in this paper. Precisely, we employed the spatio-temporal residual autoencoder (R-STAE), recently proposed in [22]. As described in Section 2.2, this architecture is composed of ConvAE and Conv.LSTM layers, with the addition of residual blocks that result in lower reconstruction errors.
To employ this model using our datasets, the input size was modified to 224 × 224. In addition, Adam with a learning rate of 10 4 was the optimizer employed, as was performed for the customized 3D-CNN. The mean squared error (MSE) was used to calculate the average squared difference between the reconstructed and real frames. This method assigns a label to the input instance based on the normality score calculated using the equation below, also employed in [22].
normscore = 1 ( MSE min ( MSE ) / max ( MSE ) )
The normality score values are expected to be higher for normal instances and lower for the anomalous ones. Hence, the score taken from Equation (1) is compared against an empirically defined threshold.

3.3.3. Pre-Trained 3D-CNN

  • 3D ResNet: The 3D version of ResNet-34 [19] was chosen as one of the baselines in this paper. The input size, optimizer, and learning rate value were the same as that used in the previous approaches. The network was built with squeeze-and-excitation [38] layers, which improves the network resources with small computation overhead but without increasing the network depth.
  • C3D: It is a generic 3D-CNN proposed in [21] to learn video features by modeling temporal information using pre-trained 3D convolution and 3D pooling operations. It has 8 convolution layers (3D) and 5 pooling layers. The C3D network was originally designed for action recognition; however, it is also very effective in other tasks such as anomaly detection in surveillance videos [34]. In this work, the same hyperparameters employed for the previous methods were used with this model: input size = 224 × 224; learning rate = 10 4 ; and Adam as the optimizer.
  • MoViNet-A2: It is a member of a family of computation- and memory-efficient 3D CNNs (from A0 to A5) recently proposed in [20]. The three first models (MoViNets A0, A1, and A2) are lightweight methods that can be used on mobile devices. The model employed in this work is the MoviNet-A2, since the input dimension used with the previous baseline models matches its training resolution (224 × 224). Even focusing on a trade-off between efficiency and accuracy, it is expected that MoviNets-A2 achieves lower accuracy when compared to the other supervised CNN investigated in this paper.

4. Experimental Results

Our experiments can be split into two main series. In the first series, all six investigated deep learning models are compared using the three datasets generated in this work. Then, the second series is performed by comparing the same six models on an external dataset. The objective of this second series was to study the behavior of the investigated methods when trained and tested on a related but different dataset. This analysis is especially important to test the generalizability as well as the transferability of the representations learned by the generator component proposed in this paper. First, however, we presented a description of our experimental protocol.

4.1. Experimental Protocol

The three datasets proposed in this paper are described in Section 3.1. For each dataset, 30% of the samples were reserved as the test partition while the remaining instances were further divided into training and validation partitions with, respectively, 40% and 30% of the total samples. The total amount of examples for each dataset was summarized in Table 2. The samples’ selection for each partition was not random because subsequent videos of a given camera positioning share frames. Considering the fact that 10 different camera positions were used for each device in each dataset, data from different positions were divided into partitions as follows: samples from three positions from each device for the test set; four different positions for the training set; and the remaining three for the validation set. This guarantees that all instances from a given camera position would always be in the same partition, preventing information leaks from the training data to the testing data. Exploiting this strategy, the experimental results were validated using k-fold validation ( k = 3 ), choosing different sets of camera positions for each data partition.
Table 2. Amount of instances per partition for each of the produced datasets.
The external dataset investigated in the second series of experiments was the University of Houston Camera Tampering Detection Dataset (UHCTD) [39], proposed to test camera tampering detection methods. Tampering corresponds to an unauthorized or an accidental change in the view of a surveillance camera. In the UHCTD dataset, this includes covering, defocusing and intentionally moving cameras with malicious intent such as committing theft or property damage. In our work, the tampering instances are grouped to form the anomalous class.
The dataset consists of 576 instances of tampering induced over 288 h of video captured by two surveillance cameras whose resolutions are 2048 × 1536 and 1280 × 960, respectively. To maintain the same experimental protocol adopted for the other datasets investigated in this work, videos were sampled at 55 frames, with 224 × 224 spatial resolution. However, the data partitioning proposed by the UHCTD authors [39] was maintained, using the traditional hold-out validation strategy. Hence, after sampling, 30,252 samples of normal videos and 10,572 samples of anomalous videos were obtained to compose the training set. For the test set, 60,437 and 21,140 instances composed the normal and abnormal classes, respectively. In both sets, the class distribution is nearly 3:1, with anomalies comprising the minority class.

4.2. Comparing Deep Learning Approaches

The experiments conducted to compare the six investigated approaches achieved the results summarized in Table 3. Area under the ROC curve (AUC), accuracy, precision and recall were calculated to determine the performance of the approaches. In addition, the number of processed frames per second (FPS) was also reported to compare the time cost. Values in bold indicate the best result in each dataset.
Table 3. Comparison among the six deep learning 3D methods experimented upon three datasets.
Although the mosaic simulation in the DenserGlitch dataset was expected to be more realistic, all approaches reached the highest possible performance in this dataset, except for the autoencoder. The justification for this high performance in DenserGlitch when compared to the results reached on the other datasets is that DenserGlitch is less complex in nature, as it presents only one type of anomaly (mosaic) and the anomalous samples present a mosaic in all frames. For the two datasets presenting temporal and spatial defeats, the proposed 3D-GAN achieved a better area under the curve and higher accuracy and recall compared to the supervised methods pre-trained and customized 3D-CNN (cust. 3D-CNN) in the 60frames dataset.
In terms of the 1R dataset, 3D ResNet was superior while 3D-GAN, C3D and cust. 3D-CNN provided equivalent performances. It is interesting to observe the results attained by MoViNet-A2. Despite being a lightweight architecture, we can see in Table 3 a 12% accuracy reduction in the 1R dataset compared to the best-performing model (3D ResNet). However, in the 60frames dataset we see a much larger accuracy difference. It is important to mention that larger MoViNets (A3–A5) would probably reduce this difference since it has been shown that MoViNet-A5 attains state-of-the-art results in different applications [20].
These results show that the proposed custom generator module seems to be steadily beneficial for the anomaly detection task using 3D-GAN, since this approach consistently achieved high performance in all datasets, even when only using 25% of labeled anomalous instances, whilst 3D ResNet, C3D, MoViNet and the cust. 3D-CNN were trained using 100% of labeled anomalous instances. This is a very competitive advantage as it avoids the need for a large set of manually labeled anomalous data.
The results achieved by the autoencoder, on the other hand, were significantly worse than those of the other five methods. Despite being expected, since autoencoder is an unsupervised approach, we believe this performance may still be increased by better fine-tuning the appropriate threshold value. While this method is strongly dependent on such a hyperparameter, finding its appropriate value without any validation samples of the anomalous class is very challenging. In addition, this parameter is totally problem dependent.
In terms of time cost, the customized 3D-CNN is most likely the solution establishing the best trade-off between the accuracy and time cost. This method reached the highest FPS in the 60frames and 1R datasets whilst the 3D-GAN was better in the DenserGlith dataset. In contrast, the autoencoder achieved the worst FPS values in all three datasets. The MoViNet-A2 model achieved intermediate FPS values. This is due to the fact that this model is determined by image resolution and FPS values. Therefore, the largest MoViNet models are able to reach higher FPS.

4.3. Generalization Analysis

The first series of experiments has shown that the generator module proposed in this work allowed the 3D-GAN to attain comparable and even better results than the supervised methods, despite only using 25 % of the labeled anomalous instances. However, it is not a general-purpose generator. This may constitute a limitation as the generator cannot easily fit a various range of problems. In order to analyze this aspect, in this second series, we conducted experiments using the external dataset UHCTD.
The results shown in Table 4 indicate an order relation between the CNN-based methods investigated for the UHCTD dataset in terms of performance. C3D reached the highest AUC, accuracy and precision. Then, 3D-ResNet was slightly better than the customized 3D-CNN, except in recall. Finally, MoViNet was 8% worse in AUC when compared to C3D. On the other hand, it provided the highest recall among the CNN-based methods. All supervised methods outperformed the 3D-GAN model, while, again, the autoencoder provided the worst results. As expected, the reason for the significantly degraded performance of the 3D-GAN in UHCTD is probably the problem-dependency of the custom generator. Especially noteworthy is the recall rate obtained using the 3D-GAN model, which was the lowest recall performance among all models investigated. This result indicates how unsuccessful the 3D-GAN was in trying to identify all anomalies. In terms of FPS, the customized 3D-CNN was also the best approach.
Table 4. Comparison among the six experimented deep learning 3D methods on the external UHCTD dataset.
However, if the generator is customized, e.g., by providing some modifications in the defects used to generate the anomalous instances, for a specific-purpose application, the final performance will improve. Taking into account the fact that the 3D-GAN method used in this paper is semi-supervised, considering all the available data as normal and the existence of a few anomalous instances to customize and train the generator, the 3D-GAN approach is highly recommended.

5. Conclusions

In this paper, we propose a GAN-based anomaly detection model by using a custom generator component to generate samples of videos presenting spatio-temporal defects. The two components (discriminator and generator) are adversely trained to simultaneously generate anomalies along with learning to perform anomaly detection. The proposed method only requires normal data and few instances of the non-anomalous class in the training process. We applied our method to three datasets with defects of TV digital encoding which were introduced in this paper to simulate real-world industry-oriented failures. Moreover, we investigated the generalizability of representations learned by the proposed generator in an external dataset created to test camera tampering detection methods. Finally, we compared the results to a custom 3D CNN model, an autoencoder model and three pre-trained CNN, precisely C3D, MoViNet and 3D ResNet.
The main results of our experiments successfully demonstrated that the proposed generator helped the 3D-GAN model to achieve a performance compared to the results of the supervised state-of-the-art 3D ResNet and C3D, as well as the lightweight MoViNet, without an increased time cost. However, due to the problem dependency of the proposed custom generator, we observed a significant degraded performance of the 3D-GAN in the external dataset. In the future, we will try to extend our approach to be a more general-purpose method for irregularity detection.

Author Contributions

Conceptualization, E.M.d.S., L.A., R.G., A.S.J., A.P. and R.J.S.B.; methodology, L.A.d.S., E.M.d.S., L.A., R.G. and O.A.d.S.; software, L.A.d.S., N.S.F., L.A., M.V. and A.S.J.; investigation, L.A.d.S., N.S.F., L.A., D.F. and A.S.J.; data curation, L.A.d.S., N.S.F., L.A., D.F., M.V., A.S.J., C.F.S.C., R.J.S.B. and O.A.d.S.; writing—original draft preparation, L.A.d.S., E.M.d.S., N.S.F., L.A. and M.V.; writing—review and editing, E.M.d.S., L.A., R.G., A.P., R.J.S.B., A.S.C. and O.A.d.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The results presented in this paper were sponsored by ENVISION Indústria de Produtos Eletrônicos LTDA under the terms of Brazilian federal law No. 8.387/91 (SUFRAMA). This research was conducted by partners: UFAM/ICOMP, ICTS and ENVISION/TPV.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rauch, E. Industry 4.0+: The Next Level of Intelligent and Self-optimizing Factories. In Advances in Design, Simulation and Manufacturing III, Proceedings of the 3rd International Conference on Design, Simulation, Manufacturing: The Innovation Exchange, DSMIE-2020, Kharkiv, Ukraine, 9–12 June 2020; Springer: Cham, Switzerland, 2020; pp. 176–186. [Google Scholar]
  2. Bertolini, M.; Mezzogori, D.; Neroni, M.; Zammori, F. Machine Learning for industrial applications: A comprehensive literature review. Expert Syst. Appl. 2021, 175, 114820. [Google Scholar] [CrossRef]
  3. Caggiano, A.; Zhang, J.; Alfieri, V.; Caiazzo, F.; Gao, R.; Teti, R. Machine learning-based image processing for on-line defect recognition in additive manufacturing. CIRP Ann. 2019, 68, 451–454. [Google Scholar] [CrossRef]
  4. Imoto, K.; Nakai, T.; Ike, T.; Haruki, K.; Sato, Y. A CNN-Based Transfer Learning Method for Defect Classification in Semiconductor Manufacturing. IEEE Trans. Semicond. Manuf. 2019, 32, 455–459. [Google Scholar] [CrossRef]
  5. Liu, K.; Li, A.; Wen, X.; Chen, H.; Yang, P. Steel Surface Defect Detection Using GAN and One-Class Classifier. In Proceedings of the 25th International Conference on Automation and Computing (ICAC), Lancaster, UK, 5–7 September 2019; pp. 1–6. [Google Scholar]
  6. Fu, Q.; Wang, H. A Novel Deep Learning System with Data Augmentation for Machine Fault Diagnosis from Vibration Signals. Appl. Sci. 2020, 10, 5765. [Google Scholar] [CrossRef]
  7. Motamed, S.; Rogalla, P.; Khalvati, F. Data Augmentation Using Generative Adversarial Networks (GANs) For GAN-Based Detection Of Pneumonia Furthermore, COVID-19 In Chest X-ray Images. arXiv 2021, arXiv:2006.03622. [Google Scholar]
  8. Chaudhari, P.; Agrawal, H.; Kotecha, K. Data augmentation using MG-GAN for improved cancer classification on gene expression data. Soft Comput. 2020, 24, 11381–11391. [Google Scholar] [CrossRef]
  9. Sandfort, V.; Yan, K.; Pickhardt, P.J.; Summers, R.M. Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Sci. Rep. 2019, 9, 1–9. [Google Scholar]
  10. Lu, C.Y.; Rustia, D.J.A.; Lin, T.T. Generative adversarial network based image augmentation for insect pest classification enhancement. IFAC-PapersOnLine 2019, 52, 1–5. [Google Scholar] [CrossRef]
  11. Dai, Q.; Cheng, X.; Qiao, Y.; Zhang, Y. Agricultural Pest Super-Resolution and Identification With Attention Enhanced Residual and Dense Fusion Generative and Adversarial Network. IEEE Access 2020, 8, 81943–81959. [Google Scholar] [CrossRef]
  12. Nayak, R.; Pati, U.C.; Das, S.K. A comprehensive review on deep learning-based methods for video anomaly detection. Image Vis. Comput. 2021, 106, 104078. [Google Scholar] [CrossRef]
  13. PourReza, M.; Mohammadi, B.; Khaki, M.; Bouindour, S.; Snoussi, H.; Sabokrou, M. G2D: Generate to Detect Anomaly. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikola, HI, USA, 5–9 January 2021. [Google Scholar]
  14. Kim, J.; Jeong, K.; Choi, H.; Seo, K. GAN-Based Anomaly Detection In Imbalance Problems. In European Conference on Computer Vision, Proceedings of the Computer Vision—ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Bartoli, A., Fusiello, A., Eds.; Springer: Cham, Switzerland, 2020; pp. 128–145. [Google Scholar]
  15. Cabrera, D.; Sancho, F.; Long, J.; Sánchez, R.V.; Zhang, S.; Cerrada, M.; Li, C. Generative adversarial networks selection approach for extremely imbalanced fault diagnosis of reciprocating machinery. IEEE Access 2019, 7, 70643–70653. [Google Scholar] [CrossRef]
  16. Mao, W.; Liu, Y.; Ding, L.; Li, Y. Imbalanced fault diagnosis of rolling bearing based on generative adversarial network: A comparative study. IEEE Access 2019, 7, 9515–9530. [Google Scholar] [CrossRef]
  17. Zhou, F.; Yang, S.; Fujita, H.; Chen, D.; Wen, C. Deep learning fault diagnosis method based on global optimization GAN for unbalanced data. Knowl.-Based Syst. 2020, 187, 104837. [Google Scholar] [CrossRef]
  18. Arjovsky, M.; Bottou, L. Towards principled methods for training generative adversarial networks. arXiv 2017, arXiv:1701.04862. [Google Scholar]
  19. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
  20. Kondratyuk, D.; Yuan, L.; Li, Y.; Zhang, L.; Tan, M.; Brown, M.; Gong, B. MoViNets: Mobile Video Networks for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–24 June 2021. [Google Scholar]
  21. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv 2014, arXiv:1412.0767. [Google Scholar]
  22. Deepak, K.; Chandrakala, S.; Chandrakala, S. Residual spatiotemporal autoencoder for unsupervised video anomaly detection. Signal Image Video Process. 2021, 15, 215–222. [Google Scholar] [CrossRef]
  23. Chong, Y.S.; Tay, Y.H. Abnormal Event Detection in Videos using Spatiotemporal Autoencoder. arXiv 2017, arXiv:1701.01546. [Google Scholar]
  24. Sha, L.; Zhiwen, Y.; Kan, X.; Jinli, Z.; Honggang, D. An improved two-stream CNN method for abnormal behavior detection. J. Phys. Conf. Ser. 2020, 1617, 012064. [Google Scholar] [CrossRef]
  25. Mehmood, A. Abnormal Behavior Detection in Uncrowded Videos with Two-Stream 3D Convolutional Neural Networks. Appl. Sci. 2020, 11, 3523. [Google Scholar] [CrossRef]
  26. Lin, W.; Gao, J.; Wang, Q.; Li, X. Learning to detect anomaly events in crowd scenes from synthetic data. Neurocomputing 2021, 436, 248–259. [Google Scholar] [CrossRef]
  27. Hasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A.K.; Davis, L.S. Learning Temporal Regularity in Video Sequences. arXiv 2016, arXiv:1604.04574. [Google Scholar]
  28. Duman, E.; Erdem, O.A. Anomaly Detection in Videos Using Optical Flow and Convolutional Autoencoder. IEEE Access 2019, 7, 183914–183923. [Google Scholar] [CrossRef]
  29. Peres, R.S.; Azevedo, M.; Araújo, S.O.; Guedes, M.; Miranda, F.; Barata, J. Generative Adversarial Networks for Data Augmentation in Structural Adhesive Inspection. Appl. Sci. 2021, 11, 3086. [Google Scholar] [CrossRef]
  30. Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; Aila, T. Training generative adversarial networks with limited data. arXiv 2020, arXiv:2006.06676. [Google Scholar]
  31. Wan, Q.; Li, Y.; Cui, H.; Feng, Z. 3D-Mask-GAN: Unsupervised Single-View 3D Object Reconstruction. In Proceedings of the 6th International Conference on Behavioral, Economic and Socio-Cultural Computing (BESC), Beijing, China, 28–30 October 2019; pp. 1–6. [Google Scholar]
  32. Yu, C.; Wang, Y. 3D-Scene-GAN: Three-dimensional Scene Reconstruction with Generative Adversarial Networks. 2018. Available online: https://openreview.net/forum?id=SkNEsmJwf (accessed on 2 November 2021).
  33. Loper, M.M.; Black, M.J. OpenDR: An approximate differentiable renderer. In European Conference on Computer Vision, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 154–169. [Google Scholar]
  34. Liu, K.; Ma, H. Exploring Background-Bias for Anomaly Detection in Surveillance Videos. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1490–1499. [Google Scholar]
  35. Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
  36. Garbin, C.; Zhu, X.; Marques, O. Dropout vs. batch normalization: An empirical study of their impact to deep learning. Multimed. Tools Appl. 2020, 79, 12777–12815. [Google Scholar] [CrossRef]
  37. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. Proc. Track 2010, 9, 249–256. [Google Scholar]
  38. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv 2019, arXiv:1709.01507. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  39. Mantini, P.; Shah, S.K. UHCTD: A Comprehensive Dataset for Camera Tampering Detection. In Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019; pp. 1–8. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.