Combining Background Subtraction and Convolutional Neural Network for Anomaly Detection in Pumping-Unit Surveillance

: Background subtraction plays a fundamental role for anomaly detection in video surveillance, which is able to tell where moving objects are in the video scene. Regrettably, the regular rotating pumping unit is treated as an abnormal object by the background-subtraction method in pumping-unit surveillance. As an excellent classiﬁer, a deep convolutional neural network is able to tell what those objects are. Therefore, we combined background subtraction and a convolutional neural network to perform anomaly detection for pumping-unit surveillance. In the proposed method, background subtraction was applied to ﬁrst extract moving objects. Then, a clustering method was adopted for extracting di ﬀ erent object types that had more movement-foreground objects but fewer typical targets. Finally, nonpumping unit objects were identiﬁed as abnormal objects by the trained classiﬁcation network. The experimental results demonstrate that the proposed method can detect abnormal objects in a pumping-unit scene with high accuracy.


Introduction
Anomaly detection in video surveillance has become a public focus. It is an unsupervised learning task that refers to the problem of identifying abnormal patterns or motions in video data [1][2][3]. One of the most effective and frequently used methods of anomaly detection is to adopt background-subtraction methods in video surveillance. Over the past couple of decades, diverse background-subtraction methods have been presented by researchers to identify foreground objects in the videos [4][5][6]. The main idea of the background-subtraction algorithm is to build a background model [7], compare the current frame against the background model, and then detect moving objects according to their differences. There are some representative methods. For instance, Stauffer and Grimson proposed a Gaussian mixture model (GMM) for background modeling in cases of dynamic scenes, illumination changes, shaking trees, and so on [8]. Makantasis et al. estimated the thermal responses of each pixel of thermal imagery as a mixture of Gaussians by a Bayesian approach [9]. Barnich et al. applied random aggregation to background extraction and proposed the ViBe (visual background extractor) method [10]. In building a samples-based estimation of the background and updating the background models, ViBe uses a novel random selection strategy that indicates that information between neighboring pixels can propagate [11,12]. Elgammal et al. presented a nonparametric method based on kernel-density estimation (KDE) [13]. In this method, it is not necessary to estimate the parameter because it depends on previously observed pixel values, and there is no need to store the complete data. KDE has been commonly applied to vision processing, especially in cases where the underlying density is unknown. Hofmann et al. proposed the pixel-based adaptive segmenter (PBAS) in 2012 [14]. This algorithm, a nonparametric model based on pixels, combines the advantages of ViBe while making some improvements. It has realized nonparameter moving-object detection, and it is robust to slow illumination variation. St-Charles et al. proposed self-balanced sensitivity segmenter (SuBSENSE), which uses the principle of sample consistency and a feedback mechanism, which means that this background model can adapt to the diversity of complex backgrounds [15].
These existing background-subtraction methods are used to detect foreground objects in many applications showing good performance. However, in pumping-unit surveillance, the rotating pumping unit is judged as a foreground object when a traditional background-subtraction method is used for anomaly detection. Because the traditional background-subtraction method cannot obviate the interference of a rotating pumping unit, this results in losing the purpose of anomaly monitoring in video surveillance. On the other hand, intelligent monitoring systems are capable to detect unknown object types or unusual scenarios, whereas traditional background-subtraction methods can only provide the regions of abnormal objects and not give their specific category. Thus, the regions of interest, which are extracted from the image background by background-subtraction methods, need further processing.
In recent years, deep learning has made remarkable achievements in the field of computer vision. Deep learning is widely used in image recognition, object detection and classification [16,17]. This has achieved state-of-the-art results in those fields. GoogLeNet [18] is a deep convolutional neural network (CNN) [19]-based system that has been used in object recognition.
In this paper, we combined background subtraction and a CNN for anomaly detection in pumping-unit surveillance. In the proposed method, the background-subtraction method is used to extract motion objects in scenes, and a CNN identifies motion objects. A large quantity of samples is needed to train a deep CNN, but in practical application, it is always hard to provide enough samples. Therefore, a pretrained fine-tuned CNN was used in the proposed method.
The rest of this paper is organized as follows. Section 2 gives a brief introduction of pumping-unit surveillance. Section 3 presents the details of the proposed method. Section 4 shows the experiments on surveillance videos of the pumping unit to verify the validity and feasibility of the proposed method. Finally, conclusions are given in Section 5.

Problem of Pumping-Unit Surveillance
When a background-subtraction method is used for abnormal detection in a pumping-unit scene, the rotating pumping unit is extracted as a foreground object. As shown in Figure 1, the pumping unit is also detected as a foreground object as the vehicle. It is worth noting that several parts of the pumping unit are detected as the foreground rather than the whole pumping unit. In a normal situation, the rotating pumping unit should not be regarded as an abnormal object. To detect abnormal scenarios, a scene with a moving pumping unit that should be regarded as part of the background. Therefore, simply using background subtraction is not suitable for abnormal detection in a pumping-unit scene. The problem of pumping-unit surveillance is to detect real abnormal objects, and recognize and classify the objects. Figure 2 shows the outline of pumping-unit surveillance. Pumping units, vehicles, and pedestrians in pumping-unit scenes should be correctly identified and classified. abnormal scenarios, a scene with a moving pumping unit that should be regarded as part of the background. Therefore, simply using background subtraction is not suitable for abnormal detection in a pumping-unit scene. The problem of pumping-unit surveillance is to detect real abnormal objects, and recognize and classify the objects. Figure 2 shows the outline of pumping-unit surveillance. Pumping units, vehicles, and pedestrians in pumping-unit scenes should be correctly identified and classified.

Proposed Method
In this section, an intelligent method of pumping-unit surveillance is presented in detail. The system of pumping-unit surveillance is the centralized distributed architecture. Figure 3 shows the framework of the proposed method, including training and detection phases. In front-end processors, the input frame of each pumping-unit monitoring scene is processed by a background-subtraction method; so far, moving foreground objects are extracted. In a back-end processor, these objects are classified by clustering technology and then fed into the pretrained GoogLeNet [18]. Transfer learning method is used to retrain GoogLeNet. In this way, the classification network is completed, which is used for the classification and recognition of foreground objects.

Proposed Method
In this section, an intelligent method of pumping-unit surveillance is presented in detail. The system of pumping-unit surveillance is the centralized distributed architecture. Figure 3 shows the framework of the proposed method, including training and detection phases. In front-end processors, the input frame of each pumping-unit monitoring scene is processed by a background-subtraction method; so far, moving foreground objects are extracted. In a back-end processor, these objects are classified by clustering technology and then fed into the pretrained GoogLeNet [18]. Transfer learning method is used to retrain GoogLeNet. In this way, the classification network is completed, which is used for the classification and recognition of foreground objects. processors, the input frame of each pumping-unit monitoring scene is processed by a background-subtraction method; so far, moving foreground objects are extracted. In a back-end processor, these objects are classified by clustering technology and then fed into the pretrained GoogLeNet [18]. Transfer learning method is used to retrain GoogLeNet. In this way, the classification network is completed, which is used for the classification and recognition of foreground objects.

Moving-Object Extraction
Background subtraction is the basis of subsequent abnormal detection. In the training phase, the segmentation result obtained by background subtraction is used as a label mask. In the detecting phase, the foreground object obtained by background subtraction is used as the input of subsequent recognition and classification. In this way, it is only needed to judge and classify the foreground object rather than to recognize the whole image with a sliding window. Therefore, computation can be reduced and processing speed can be improved. The advantage of this method is that it is unsupervised; hence, the performance of background subtraction directly affects classification accuracy. In this paper, SuBSENSE [20], a state-of-the-art unsupervised background-subtraction method, was adopted for extracting the foreground object in the video. SuBSENSE is a pixel-level background-subtraction algorithm, and its basic idea is to use color and texture features to first detect moving objects, then introduce the idea of feedback control to adaptively update the parameters in the background model with the obtained rough segmentation results, so as to achieve better detection results. Foreground F can be obtained after video frame I is processed by SuBSENS: where i and j are the position coordinates of the pixels. After obtaining the foreground pixels, the connected component-labeling method is used to locate and mark each connected region in the image, so as to obtain foreground target O [21]: where n is the least number of pixels in a connected region; in this paper, we set n = 150, namely, only the connected regions with more than 150 pixels were regarded as foreground objects.

Clustering and Labeling
In the training phase, a large number of moving objects O are extracted by the background-subtraction method. According to prior knowledge, these objects have two characteristics: (1) numerous objects; (2) fewer categories.
Several parts of the pumping unit are detected as foreground targets, which are classified into the same category. The moving objects that need to be recognized in the pumping-unit monitoring site are divided into three categories: pumping unit, vehicle, and pedestrian. There are many kinds of clustering algorithms that are used to deal with data-structure partition [22][23][24]. In this paper, foreground objects are subdivided into several subcategories by a hierarchical clustering algorithm, and then these subcategories are divided into pumping unit, vehicle, and pedestrian through human intervention, which are used as the training data of GoogLeNet.
Strategies for hierarchical clustering generally fall into two types, agglomerative and divisive [25]. This clustering method uses data linkage criteria to repeatedly merge or split the data to build a hierarchy of clusters through a hierarchical architecture. The clustering process is as follows: (1) Assuming that foreground moving object O = {o 1 , o 2 , · · · o k } has k samples, the resolution of foreground moving object O is resized to 224 × 224.
(2) Samples are aggregated by a bottom-up approach, and Euclidean distance is chosen as the similarity measurement between categories: where i, j = 1, 2 · · · k. Linkage criteria use the average distance between all pairs of objects in any two clusters: where r and s are clusters and n r and n s are the number of objects in cluster r and s, respectively. Similarly, o ri and o sj are the ith and jth object in cluster r and s, respectively.
(3) The pedestrian and vehicle categories in hierarchical clustering are selected separately, and the other categories are classified as part of the pumping-unit category. Figure 4 shows the clustering process of foreground objects.
where r and s are clusters and and are the number of objects in cluster r and s, respectively. Similarly, and are the ith and jth object in cluster r and s, respectively. (3) The pedestrian and vehicle categories in hierarchical clustering are selected separately, and the other categories are classified as part of the pumping-unit category.

Transfer Learning
In traditional machine learning, a training set and test set are required to be in the same feature space and have the same data distribution. However, this demand is not satisfied in many cases, unless plenty of time and effort are spent to label the mass as of date. Transfer learning is a branch of machine learning. It can apply trained data to new problems, which can help avoid many data-labeling efforts. As deep learning develops quickly, transfer learning is increasingly combined with neural networks. In this paper, we used parameter-based transfer learning to address the problem of lacking abundant image samples of the labeled pumping unit.
In the classification application of pumping-unit monitoring, it is very time consuming to retrain a new neural network. Training data are not rich enough to train a deep neural network with strong generalization ability. To address this problem, transfer learning is desirable. For the past few years, transfer learning has been widely applied in various fields [26,27]. Pretrained models are usually based on large datasets, which can expand our training data, make the model more robust, improve the generalization ability, and save the time cost of training. The weight of the pretrained network is initialized and then fine-tuned on the new data. Compared with retraining the weight of network, this method can achieve better accuracy.

Transfer Learning
In traditional machine learning, a training set and test set are required to be in the same feature space and have the same data distribution. However, this demand is not satisfied in many cases, unless plenty of time and effort are spent to label the mass as of date. Transfer learning is a branch of machine learning. It can apply trained data to new problems, which can help avoid many data-labeling efforts. As deep learning develops quickly, transfer learning is increasingly combined with neural networks. In this paper, we used parameter-based transfer learning to address the problem of lacking abundant image samples of the labeled pumping unit.
In the classification application of pumping-unit monitoring, it is very time consuming to retrain a new neural network. Training data are not rich enough to train a deep neural network with strong generalization ability. To address this problem, transfer learning is desirable. For the past few years, transfer learning has been widely applied in various fields [26,27]. Pretrained models are usually based on large datasets, which can expand our training data, make the model more robust, improve the generalization ability, and save the time cost of training. The weight of the pretrained network is initialized and then fine-tuned on the new data. Compared with retraining the weight of network, this method can achieve better accuracy.
GoogLeNet is a pretrained convolutional neural network; it was trained on ImageNet [28], which has a million images. In this paper, GoogLeNet was retrained in pumping-unit data to classify objects that were extracted in the pumping-unit scene. Figure 5 shows the architecture of the fine-tuned GoogLeNet. Replacing the last three layers of GoogLeNet are a fully connected layer, a softmax layer, and a classification output layer. These three layers combine the general features of the objects extracted by the network, and convert the objects into the probability of different category labels. The size of the final full connection layer was set to 3, which is the same as the number of object categories in the pumping data. Then, the earlier layers in the network were frozen, that is, in subsequent training, the learning rate of these layers was set to 0, and the weight parameters of these layers were kept unchanged. Freezing earlier layers not only speeds up training, but also prevents overfitting of the pumping data. In this paper, the layers before inception 5a were frozen, and the layers behind it were retrained. The loss function is cross-entropy loss, and the L 2 regularization term of the weights was added to the loss function to alleviate the effect of overfitting. Thus, the objective function was as follows: where m is the number of samples, n is the number of classes, t ij is the indicator that the ith sample belongs to the jth class, w is the weight vector, and λ is the regularization factor. y ij is the value from the softmax function, which is the output of sample i for class j: fine-tuned GoogLeNet. Replacing the last three layers of GoogLeNet are a fully connected layer, a softmax layer, and a classification output layer. These three layers combine the general features of the objects extracted by the network, and convert the objects into the probability of different category labels. The size of the final full connection layer was set to 3, which is the same as the number of object categories in the pumping data. Then, the earlier layers in the network were frozen, that is, in subsequent training, the learning rate of these layers was set to 0, and the weight parameters of these layers were kept unchanged. Freezing earlier layers not only speeds up training, but also prevents overfitting of the pumping data. In this paper, the layers before inception 5a were frozen, and the layers behind it were retrained. The loss function is cross-entropy loss, and the L2 regularization term of the weights was added to the loss function to alleviate the effect of overfitting. Thus, the objective function was as follows: * = min 1 log + 1 2 where m is the number of samples, n is the number of classes, tij is the indicator that the ith sample belongs to the jth class, w is the weight vector, and λ is the regularization factor. yij is the value from the softmax function, which is the output of sample i for class j:

Experiments
In this section, four surveillance videos of pumping units were used to test the performance of the proposed method. Table 1 shows the details of these video datasets.

Experiments
In this section, four surveillance videos of pumping units were used to test the performance of the proposed method. Table 1 shows the details of these video datasets. There are several performance indicators used to quantificationally evaluate the performance of the classification model [29]: where TP is true positive, TN is true negative, FP is false positive, and FN is false negative. The higher the value of these indicators, the better the performance of the classification model.

Foreground Detection
The input video frame was segmented into foreground and background by the SuBSENSE algorithm, and multiple foreground objects were extracted. SuBSENSE [15,20] combines the color and local binary-similarity pattern features to detect moving objects. This method outperformed all previously tested state-of-the-art unsupervised methods on the CDnet [30] dataset. As a famous benchmark dataset, CDnet provides ground truths for all video frames that range over diverse detection challenges such as dynamic background and various lighting conditions. Based on its excellent performance, SuBSENSE was used to extract the moving objects. Figure 6 presents the results of background subtraction. As can intuitively be seen, the segmentation results of SuBSENSE outperformed other methods. In foreground detection, several parts of the pumping unit were normally detected as the foreground rather than the whole pumping unit. The reason is that pumping units have a large scale along with periodic rotation in surveillance scenes. Some parts of the pumping unit are judged as background by background-subtraction methods.

Foreground Detection
The input video frame was segmented into foreground and background by the SuBSENSE algorithm, and multiple foreground objects were extracted. SuBSENSE [15,20] combines the color and local binary-similarity pattern features to detect moving objects. This method outperformed all previously tested state-of-the-art unsupervised methods on the CDnet [30] dataset. As a famous benchmark dataset, CDnet provides ground truths for all video frames that range over diverse detection challenges such as dynamic background and various lighting conditions. Based on its excellent performance, SuBSENSE was used to extract the moving objects. Figure 6 presents the results of background subtraction. As can intuitively be seen, the segmentation results of SuBSENSE outperformed other methods. In foreground detection, several parts of the pumping unit were normally detected as the foreground rather than the whole pumping unit. The reason is that pumping units have a large scale along with periodic rotation in surveillance scenes. Some parts of the pumping unit are judged as background by background-subtraction methods. Pumping unit surveillance is a long time supervision; therefore, the background subtraction method has to address the light condition changes. In order to further verify the foreground extraction ability of the background subtraction method in light condition changes, a long term video was tested. Figure 7 shows the background subtraction results in the variant light conditions. As can intuitively be seen, the region of foreground detection of the pumping unit is less sensitive to the changing light. The experimental results show that SuBSENSE is able to eliminate the interference caused by light condition gradual changes. Pumping unit surveillance is a long time supervision; therefore, the background subtraction method has to address the light condition changes. In order to further verify the foreground extraction ability of the background subtraction method in light condition changes, a long term video was tested. Figure 7 shows the background subtraction results in the variant light conditions. As can intuitively be seen, the region of foreground detection of the pumping unit is less sensitive to the changing light. The experimental results show that SuBSENSE is able to eliminate the interference caused by light condition gradual changes.

Object Classifiction
Through the clustering method mentioned in Section 3.2, these foreground objects were classified into three categories: pumping unit, person, and vehicle. In total, 1200 images were randomly selected as the image dataset to train and verify the performance of the classification network, which included 500 images of the pumping unit, 500 person images, and 200 vehicle images. In the monitoring video, there were a large number of foreground objects and a small number of typical targets, which means that each category of targets appeared repeatedly. 30 percent of images in the image dataset were randomly selected as the training set, and the remaining 70% as the testing set. The training process of the classification network is shown in Figure 8. The model tends to be convergent after 50 training iterations. The trained model can achieve high accuracy and low loss. The classification network obtained by retraining GoogLeNet through the fine-tuned method was used for moving-object detection in the pumping-unit monitoring scene. Figure 9 shows the classifications of moving objects in the scene identified by the classification network. After moving objects are recognized and classified, the pumping unit is not regarded as an abnormal object, while persons and vehicles were output as abnormal objects. If there is no moving pumping unit in the

Object Classifiction
Through the clustering method mentioned in Section 3.2, these foreground objects were classified into three categories: pumping unit, person, and vehicle. In total, 1200 images were randomly selected as the image dataset to train and verify the performance of the classification network, which included 500 images of the pumping unit, 500 person images, and 200 vehicle images. In the monitoring video, there were a large number of foreground objects and a small number of typical targets, which means that each category of targets appeared repeatedly. 30 percent of images in the image dataset were randomly selected as the training set, and the remaining 70% as the testing set. The training process of the classification network is shown in Figure 8. The model tends to be convergent after 50 training iterations. The trained model can achieve high accuracy and low loss.

Object Classifiction
Through the clustering method mentioned in Section 3.2, these foreground objects were classified into three categories: pumping unit, person, and vehicle. In total, 1200 images were randomly selected as the image dataset to train and verify the performance of the classification network, which included 500 images of the pumping unit, 500 person images, and 200 vehicle images. In the monitoring video, there were a large number of foreground objects and a small number of typical targets, which means that each category of targets appeared repeatedly. 30 percent of images in the image dataset were randomly selected as the training set, and the remaining 70% as the testing set. The training process of the classification network is shown in Figure 8. The model tends to be convergent after 50 training iterations. The trained model can achieve high accuracy and low loss. The classification network obtained by retraining GoogLeNet through the fine-tuned method was used for moving-object detection in the pumping-unit monitoring scene. Figure 9 shows the classifications of moving objects in the scene identified by the classification network. After moving objects are recognized and classified, the pumping unit is not regarded as an abnormal object, while persons and vehicles were output as abnormal objects. If there is no moving pumping unit in the  The classification network obtained by retraining GoogLeNet through the fine-tuned method was used for moving-object detection in the pumping-unit monitoring scene. Figure 9 shows the classifications of moving objects in the scene identified by the classification network. After moving objects are recognized and classified, the pumping unit is not regarded as an abnormal object, while persons and vehicles were output as abnormal objects. If there is no moving pumping unit in the detected foreground objects, it means that the pumping unit has stopped working, and an abnormal alarm should be given.
Algorithms 2019, 12, x FOR PEER REVIEW 9 of 12 detected foreground objects, it means that the pumping unit has stopped working, and an abnormal alarm should be given. To evaluate the proposed method, a histogram of oriented gradient (HOG) features and a multiclass support vector machine (SVM) classifier were used for comparative experiments. SVM is a classical classification method, while HOG features are a feature descriptor that is used for object detection in computer vision and image processing. It forms the features by calculation and statistics of the HOG in local areas of the image. HOG features combined with SVM classifiers have been widely used in image recognition [31]. The confusion matrices of the retrained net and SVM are presented in Figures 10 and 11, respectively. The experiment classification results of the three classes are listed in Table 2. To assure confidence in the experimental results, the experiment process was repeated 10 times. The average values of each metric are reported. The overall accuracy of the proposed method was 0.9988, while of the SVM was 0.9500. In the application of pumping-unit monitoring, the performance of the proposed method was obviously better than that ofa the classical SVM with HOG features. To evaluate the proposed method, a histogram of oriented gradient (HOG) features and a multiclass support vector machine (SVM) classifier were used for comparative experiments. SVM is a classical classification method, while HOG features are a feature descriptor that is used for object detection in computer vision and image processing. It forms the features by calculation and statistics of the HOG in local areas of the image. HOG features combined with SVM classifiers have been widely used in image recognition [31]. The confusion matrices of the retrained net and SVM are presented in Figures 10 and 11, respectively. The experiment classification results of the three classes are listed in Table 2. To assure confidence in the experimental results, the experiment process was repeated 10 times. The average values of each metric are reported. The overall accuracy of the proposed method was 0.9988, while of the SVM was 0.9500. In the application of pumping-unit monitoring, the performance of the proposed method was obviously better than that ofa the classical SVM with HOG features.
Algorithms 2019, 12, x FOR PEER REVIEW 9 of 12 detected foreground objects, it means that the pumping unit has stopped working, and an abnormal alarm should be given. To evaluate the proposed method, a histogram of oriented gradient (HOG) features and a multiclass support vector machine (SVM) classifier were used for comparative experiments. SVM is a classical classification method, while HOG features are a feature descriptor that is used for object detection in computer vision and image processing. It forms the features by calculation and statistics of the HOG in local areas of the image. HOG features combined with SVM classifiers have been widely used in image recognition [31]. The confusion matrices of the retrained net and SVM are presented in Figures 10 and 11, respectively. The experiment classification results of the three classes are listed in Table 2. To assure confidence in the experimental results, the experiment process was repeated 10 times. The average values of each metric are reported. The overall accuracy of the proposed method was 0.9988, while of the SVM was 0.9500. In the application of pumping-unit monitoring, the performance of the proposed method was obviously better than that ofa the classical SVM with HOG features.

Conclusions
On-site monitoring of pumping units is a typical monitoring scene, that is, there is interference of periodic moving objects in the scene. The traditional background-subtraction method cannot satisfy the requirements of anomaly monitoring in this scenario. In the proposed method, background subtraction can extract possible abnormal targets. The pretrained CNN has a strong generalization and transplantation ability, which only needs a small number of samples and computing resources for retraining. After being trained by transfer learning, the network can be used to detect abnormal targets in a pumping-unit scene. The experimental results show that the proposed method can identify real foreground objects with high accuracy.

Conclusions
On-site monitoring of pumping units is a typical monitoring scene, that is, there is interference of periodic moving objects in the scene. The traditional background-subtraction method cannot satisfy the requirements of anomaly monitoring in this scenario. In the proposed method, background subtraction can extract possible abnormal targets. The pretrained CNN has a strong generalization and transplantation ability, which only needs a small number of samples and computing resources for retraining. After being trained by transfer learning, the network can be used to detect abnormal targets in a pumping-unit scene. The experimental results show that the proposed method can identify real foreground objects with high accuracy.