Ghost Detection and Removal Based on Two-Layer Background Model and Histogram Similarity

Detecting and removing ghosts is an important challenge for moving object detection because ghosts will remain forever once formed, leading to the overall detection performance degradation. To deal with this issue, we first classified the ghosts into two categories according to the way they were formed. Then, the sample-based two-layer background model and histogram similarity of ghost areas were proposed to detect and remove the two types of ghosts, respectively. Furthermore, three important parameters in the two-layer model, i.e., the distance threshold, similarity threshold of local binary similarity pattern (LBSP), and time sub-sampling factor, were automatically determined by the spatial-temporal information of each pixel for adapting to the scene change rapidly. The experimental results on the CDnet 2014 dataset demonstrated that our proposed algorithm not only effectively eliminated ghost areas, but was also superior to the state-of-the-art approaches in terms of the overall performance.


Introduction
With the widespread application of surveillance cameras, a huge amount of video data is generated every day. Methods to automatically and quickly analyze information of interest from a video sequence has been extensively studied for decades [1][2][3][4][5]. Among these research works, change detection is a fundamental first step for higher-level computer vision applications such as video surveillance, pedestrian and vehicle tracking, and anomaly behavior recognition. Background subtraction (BS) is one of the most widely used techniques in change detection, and its performance depends mainly on the background modeling methods. To date, the popular background modeling methods can be classified into the following four categories: GMM-based [6][7][8][9], sample-based [10][11][12][13], clustering-based [14][15][16][17], and artificial neural network-based [18][19][20][21][22]. Each model has its own advantages and disadvantages. For example, the GMM-based model can handle the multimodal distribution problem, but the background pixel does not always follow Gaussian distribution and the difficulty of parameter estimation. The sample-based model shows the superiority in speed, yet it cannot efficiently address dynamic background and noise. The clustering-based model is robust to the noise, but only has good results for scenarios without substantial background changes. The artificial neural network-based model can obtain good performance, whereas this method requires previous training or a manual intervention.
In fact, robust change detection in real surveillance applications still faces great challenges in complex outdoor scenes [23][24][25], such as illumination changes, camera motion, ghost removal, camouflaged object detection, dynamic background suppression, and so on. Especially, ghost detection and removal are rarely discussed in the existing methods. This can be attributed to the following two difficulties. First of all, the initialization method of the background model is relatively simple. Due to The rest of the paper is organized as follows. Section 2 describes our proposed algorithm in detail. The experimental evaluation is presented in Section 3. Finally, Section 4 gives the conclusions.

Methodology
We describe our proposed approach from four aspects. First, we present the sample-based twolayer background model to classify background and foreground in Section 2.1. This model not only suppressed ghosts caused by intermittent motion objects, but also reduced false positives caused by periodic motion background. Second, we show the detection and removal process of ghosts caused by incorrect model initialization based on the histogram similarity and feedback scheme in Section 2.2. Third, we describe how to update the two-layer background model in Section 2.3. Finally, we adaptively determine three important parameters according to spatial-temporal characteristics of the scene itself in Section 2.4.

Sample-Based Two-Layer Background Model and Background/Foreground Classification
Due to the use of the neighborhood diffusion mechanism in sample-based background subtraction, foreground objects that remain motionless for a long time are incorporated into the background. Later on, the foreground object is removed, and the initial position of the object is detected. This is because the true background samples are deleted after maintaining for a short time in the model update process. In order to retain the background samples for a long time, the methods [10,11] increased the number of background samples. However, performance degrades when the number of samples exceeds 50. The method [12] used the feature of the current observation to replace the sample with the minimum weight. However, it cannot quickly adapt environmental change.
Unlike the above methods, we presented a sample-based two-layer background model in the paper: The main model

Methodology
We describe our proposed approach from four aspects. First, we present the sample-based two-layer background model to classify background and foreground in Section 2.1. This model not only suppressed ghosts caused by intermittent motion objects, but also reduced false positives caused by periodic motion background. Second, we show the detection and removal process of ghosts caused by incorrect model initialization based on the histogram similarity and feedback scheme in Section 2.2. Third, we describe how to update the two-layer background model in Section 2.3. Finally, we adaptively determine three important parameters according to spatial-temporal characteristics of the scene itself in Section 2.4.

Sample-Based Two-Layer Background Model and Background/Foreground Classification
Due to the use of the neighborhood diffusion mechanism in sample-based background subtraction, foreground objects that remain motionless for a long time are incorporated into the background. Later on, the foreground object is removed, and the initial position of the object is detected. This is because the true background samples are deleted after maintaining for a short time in the model update process. In order to retain the background samples for a long time, the methods [10,11] increased the number of background samples. However, performance degrades when the number of samples exceeds 50. The method [12] used the feature of the current observation to replace the sample with the minimum weight. However, it cannot quickly adapt environmental change.
Unlike the above methods, we presented a sample-based two-layer background model in the paper: The main model BG a (x) and the candidate model BG c (x). BG a (x) could adapt to scene changes. BG c (x) was composed of the background samples replaced in BG a (x). It can be seen that the lifespan of background samples was extended to BG c (x). Specifically, each pixel x was modeled by a set of N a sample values bg a,k (x) (k = 1, 2, · · · , N a ) and a set of N c candidate sample values bg c,k (x) (k = 1, 2, N a · · · , N c ): BG a (x) = bg a,1 (x), bg a,2 (x), · · · , bg a,N a (x) (1) BG c (x) = bg c,1 (x), bg c,2 (x), · · · , bg c,N c (x) (2) Sensors 2020, 20, 4558 4 of 22 Similar to SuBSENSE, we also utilized color information and local binary similarity pattern (LBSP) feature to construct background model. That is, bg a,k (x) and bg c,k (x) are defined by a sixtuple: where u ∈ {a, c}, i u,R,k (x), i u,G,k (x), and i u,B,k (x) are the color intensity of RGB three channels at location x, respectively. intra−LBSP u,R,k (x), intra−LBSP u,G,k (x), and intra−LBSP u,B,k (x) are the intra-LBSP texture feature [11], which can be defined as Here, C ∈ {R, G, B}, p is the neighboring pixel of x. i u,C,k (x) is the reference values of intra-LBSP descriptor. i u,C,k (x) and i u,C,k (p) come from the current frame. g u,C,k (x) is the internal similarity threshold of LBSP, which is discussed in detailed in Section 2.4.1.
It is worth mentioning, at the initial time, that bg a,k (x) (k = 1, 2, . . . , N a ) were randomly and independently selected from the color, and the intra-LBSP feature of 5*5 neighborhood pixels of x. bg c,k (x) (k = 1, 2, . . . , N c ) was set to 0. Meanwhile, similar to SuBSENSE, we also added an inter-LBSP descriptor to suppress shadow when the color values of the current frame did not match with those of background samples. The inter-LBSP feature is defined as Here, i u,C,k (x) are the reference values of the inter-LBSP descriptor, which comes from the color intensity of the background sample.
When a new input frame I t (t ≥ 2) comes at time t, each pixel x was first classified as foreground ( f t (x) = 1) or background ( f t (x) = 0) by matching I t (x) with its respective background sample set BG a (x), i.e., where Here, # represents the number of elements in the collection, # min is the minimal number of matches required for a background classification. i R,t (x), i G,t (x), and i B,t (x) are the color intensity at location x at time t, respectively. intra−LBSP R,t (x), intra−LBSP G,t (x), and intra−LBSP B,t (x) are the intra-LBSP texture feature. dist(·, ·) is a distance function between the current observation and a given background sample, which includes two elements: Color and texture distance. Ham(·, ·) represents Hamming distance. R(x) is the distance threshold, which includes two elements: Color threshold R color (x) and LBSP texture distance threshold R lbsp (x). The discussion about R(x) is postponed to Section 2.4.2.
Then, if the pixel x is classified as foreground, I t (x) continues to compare with background model BG c (x). When I t (x) matches with BG c (x), it indicates that the pixel x was previously judged as background. Thus, the pixel x is considered as ghosts caused by removed foreground objects. The final output segmentation map f g t can be obtained by Figure 2 shows the detection results of five methods, respectively, on the "sofa" #1742 and "traffic" #1376 frame from the CDnet 2014 dataset [27]. The description of this dataset is postponed to Section 3. The #1742 frame in the sofa sequence included three objects: A light-yellow box, white plastic bag, and briefcase. The light-yellow box was a static object on the floor which was then moved onto the sofa. A ghost was left on the floor (marked using red ) in the detection results of SuBSENSE [11] and SWCD [13]. However, our two-layer not only suppressed ghosts, but also detected camouflaged static foreground object (marked using green in Figure 2) because of the adaptive distance and LBSP threshold in Section 2.4. Moreover, the periodic motion background often happened because of the camera jitter, as shown in the traffic sequence in Figure 2, and a lot of false positives (i.e., marked using purple in Figure 2) occurred. Compared with SuBSENSE, PAWCS [16], WeSamBE [12], and SWCD, our model effectively removed these false positive detection. The reason is that the periodic background motion made the background samples appear intermittently. The earlier background samples were stored in the candidate background model in our proposed method. When Equation (11) was executed, the dynamic background could be suppressed. However, these earlier samples could have been deleted in the other methods, making the current observation unmatched with background model. ( , ) dist ⋅ ⋅ is a distance function between the current observation and a given background sample, which includes two elements: Color and texture distance.
( , ) Ham ⋅ ⋅  Figure 2 shows the detection results of five methods, respectively, on the "sofa" #1742 and "traffic" #1376 frame from the CDnet 2014 dataset [27]. The description of this dataset is postponed to Section 3. The #1742 frame in the sofa sequence included three objects: A light-yellow box, white plastic bag, and briefcase. The light-yellow box was a static object on the floor which was then moved onto the sofa. A ghost was left on the floor (marked using red ◇) in the detection results of SuBSENSE [11] and SWCD [13]. However, our two-layer not only suppressed ghosts, but also detected camouflaged static foreground object (marked using green 〇 in Figure 2) because of the adaptive distance and LBSP threshold in Section 2.4. Moreover, the periodic motion background often happened because of the camera jitter, as shown in the traffic sequence in Figure 2, and a lot of false positives (i.e., marked using purple 〇 in Figure 2) occurred. Compared with SuBSENSE, PAWCS [16], WeSamBE [12], and SWCD, our model effectively removed these false positive detection. The reason is that the periodic background motion made the background samples appear intermittently. The earlier background samples were stored in the candidate background model in our proposed method. When Equation (11) was executed, the dynamic background could be suppressed. However, these earlier samples could have been deleted in the other methods, making the current observation unmatched with background model.

Detection and Removal of the Second Type of Ghost
The two-layer background model could only remove ghosts caused by the first situation and could not do anything about the second kind of ghosts mentioned in Section 1, since the newly revealed background did not match with ( ) a BG x and ( ) c BG x . In order to eliminate these ghosts quickly, some literatures [10,11] have increased the neighborhood diffusion rate. However, long-term static foreground objects were also incorporated into the background. In this paper, we eliminated the second kind of ghosts based on feedback mechanism and the histogram similarity. The method was as follows.

Detection and Removal of the Second Type of Ghost
The two-layer background model could only remove ghosts caused by the first situation and could not do anything about the second kind of ghosts mentioned in Section 1, since the newly revealed background did not match with BG a (x) and BG c (x). In order to eliminate these ghosts quickly, some literatures [10,11] have increased the neighborhood diffusion rate. However, long-term static foreground objects were also incorporated into the background. In this paper, we eliminated the second kind of ghosts based on feedback mechanism and the histogram similarity. The method was as follows.
Taking the "tunnelExit_0_35fps" video sequence as an example as shown in Figure 3, we analyzed the formation process of ghosts caused by an object that existed in the first frame. There was a static blue minibus from #000001 to #001683 (marked using red ) in this video sequence. It started moving Sensors 2020, 20, 4558 6 of 22 from #001684, and a few foreground pixels were detected in the region where the blue minibus was located. Then, foreground pixels continued to increase until the moving object and its ghosts separated, as shown in #001685, #001686, and #001687. Finally, a stable ghost region was formed in #001688 and #001689 frame (marked using red ). Of course, this process is only a necessary condition for the formation of ghosts because this situation can also occur when a normally moving object becomes motionless. However, as shown in Figure 4d-f., the histograms of the ghost region (marked using red in Figure 4c had high similarity on the #000001 and #001684 frames. On the contrary, the similarity of the static foreground region was often very low on the first frame and the frame appearing objects because of the difference of background and objects. Thus, we could utilize this characteristic to distinguish the two case. Here, we needed to solve the following four issues.
Sensors 2020, 20, x FOR PEER REVIEW 6 of 23 Taking the "tunnelExit_0_35fps" video sequence as an example as shown in Figure 3, we analyzed the formation process of ghosts caused by an object that existed in the first frame. There was a static blue minibus from #000001 to #001683 (marked using red □) in this video sequence. It started moving from #001684, and a few foreground pixels were detected in the region where the blue minibus was located. Then, foreground pixels continued to increase until the moving object and its ghosts separated, as shown in #001685, #001686, and #001687. Finally, a stable ghost region was formed in #001688 and #001689 frame (marked using red □). Of course, this process is only a necessary condition for the formation of ghosts because this situation can also occur when a normally moving object becomes motionless. However, as shown in Figure 4d-f., the histograms of the ghost region (marked using red □ in Figure 4c had high similarity on the #000001 and #001684 frames. On the contrary, the similarity of the static foreground region was often very low on the first frame and the frame appearing objects because of the difference of background and objects. Thus, we could utilize this characteristic to distinguish the two case. Here, we needed to solve the following four issues.  (d) Sensors 2020, 20, x FOR PEER REVIEW 6 of 23 Taking the "tunnelExit_0_35fps" video sequence as an example as shown in Figure 3, we analyzed the formation process of ghosts caused by an object that existed in the first frame. There was a static blue minibus from #000001 to #001683 (marked using red □) in this video sequence. It started moving from #001684, and a few foreground pixels were detected in the region where the blue minibus was located. Then, foreground pixels continued to increase until the moving object and its ghosts separated, as shown in #001685, #001686, and #001687. Finally, a stable ghost region was formed in #001688 and #001689 frame (marked using red □). Of course, this process is only a necessary condition for the formation of ghosts because this situation can also occur when a normally moving object becomes motionless. However, as shown in Figure 4d-f., the histograms of the ghost region (marked using red □ in Figure 4c had high similarity on the #000001 and #001684 frames. On the contrary, the similarity of the static foreground region was often very low on the first frame and the frame appearing objects because of the difference of background and objects. Thus, we could utilize this characteristic to distinguish the two case. Here, we needed to solve the following four issues.  (d) First, it was necessary to determine how to obtain the stable region. The connected foreground region W t (x), where pixel x was located at time t, was extracted by 8 neighborhood diffusion. The stable foreground region Y t (x) was obtained until the difference of the number of pixels located at the connected region among three adjacent frames was less than the specified threshold.
where Reg t (x) is defined as the number of pixels of the connected region. We experimentally set the threshold th re to 10%. Second, it was necessary to determine how to get the frame number where the static foreground started moving. We constructed a counter FCMT to count the times that each pixel was continuously identified as a foreground pixel. Then, FN was used to record the frame number t in which each pixel x starts to move. That is, FCMT was equal to 1: Third, it was necessary to determine how to compute the histogram similarity of the stable region between the first frame and the frame forming ghost. We used the MDPA histogram distance [28] to compute the histogram similarity of the connected regions at location x.
Here, C ∈ {R, G, B}, H C 1 (x), and H C FN t (x) (x) denote the histograms of the connected region corresponding to the first frame and the frame starts to appear ghosts at location x, respectively, and the color histogram of R, G, B channels was quantified as the M (M = 64) bin.
If two histograms were similar, the connected region at location x was considered as ghosts.
Here, T hist was the histogram similarity threshold and was set to 3 experimentally in the paper. Finally, the ghosts were removed. When a pixel was considered a ghost (G t (x) = 1), it was classified as background, and the background model was reinitialized using the color and intra-LBSP feature of 5*5 neighborhood pixels of the current frame. As shown in Figure 5, our proposed method could effectively remove ghosts caused by incorrect model initialization, but a ghost was left (marked using red ) in the detection results of SuBSENSE.

Background Model Update
In complicate practical scenarios, background changes (i.e., gradual or sudden illumination change, camera jitter, PTZ) often occur. Therefore, it was necessary to update background model to adapt the scene changes after background/foreground classification. The conservative strategy was used to select the pixels which needed to be updated, and a random observation replacement policy was used to select the samples which was updated in the paper. So, the update process of our twolayer background model ( , , Here, ( ) x φ is the time subsampling factor, and rand(0, ( )) x φ is a function, which obtains a random number between 0 and ( ) Then, the candidate background model where It is worth mentioning that the conservative update strategy caused the deadlocks, and the false positive (i.e., ghosts) was difficult to eliminate, since only the pixels marked as background were updated. However, our candidate background model could solve the deadlock problem effectively because , ( )

Background Model Update
In complicate practical scenarios, background changes (i.e., gradual or sudden illumination change, camera jitter, PTZ) often occur. Therefore, it was necessary to update background model to adapt the scene changes after background/foreground classification. The conservative strategy was used to select the pixels which needed to be updated, and a random observation replacement policy was used to select the samples which was updated in the paper. So, the update process of our two-layer background model (BG a (x) and BG c (x)) was as follows.
First, BG a (x) was updated using the current observation. When a new pixel x in current frame was classified as background, a randomly selected background sample bg a,k (x) from BG a (x) had 1/φ(x) probability to be replaced by the features I t (x) of the current observation x. Meanwhile, at y, a neighbor of x, a randomly picked sample bg a,i (y) from BG a (y) was also replaced by Here, φ(x) is the time subsampling factor, and rand(0, φ(x)) is a function, which obtains a random number between 0 and φ(x).
Then, the candidate background model BG c (x) was updated by the sample replaced in BG c (a) Specifically, if the difference between the background sample bg a,k (x) replaced and I t (x) was larger than the distance threshold R(x), bg a,k (x) was stored in BG c (x) before it was replaced. where It is worth mentioning that the conservative update strategy caused the deadlocks, and the false positive (i.e., ghosts) was difficult to eliminate, since only the pixels marked as background were updated. However, our candidate background model could solve the deadlock problem effectively because bg a,k (x) was not actually deleted but stored in BG c (x), as shown in Equation (19). The current observation may not match with the samples in BG a (x) when intermittent moving objects are removed, but it matched with the samples in BG c (x). Thus, the current observation was still classified as background, and then there were no ghosts left.

Parameter Analysis
As stated above, our proposed approach involved three important parameters: Similarity threshold of LBSP g u,C,k (x), distance threshold R(x), and time subsampling factor φ(x). We analyze them in detail in this section.

Similarity Threshold of LBSP
In SuBSENSE, g u,C,k (x) was initialized to g r .i u,C,k (x) and g r was set to 0.3 experimentally. Thus, the initial value of g u,C,k (x) only depends on the intensity of the pixel x and is not relevant to the location of x, so it is a global threshold. It is obviously unreasonable to use the same threshold in different scenarios or different regions of the same scenario because of the same i u,C,k (x). This case is illustrated by the pixels (451, 637) and (166, 653) of #000001 frame from "fall" video sequence of the CDnet 2014 dataset in Figure 6. The pixel (451, 637) locates in the static region while the pixel (166, 653) locates in the dynamic region. Both of their reference intensities located in the red are 112. Thus, the thresholds of both pixels are set to 33 (0.3*112 = 33) at the initial time. In fact, the threshold of the pixel (451, 637) should be set to a smaller value (i.e. 15) for detecting the horizontal texture in "154", "141" and "131" (marked using the blue ). The threshold of the pixel (166, 653) should be set to a larger value (i.e., 50) for suppressing dynamic background. Although g u,C,k (x) is automatically regulated over time based on the texture magnitude of the analyzed scenes in SuBSENSE, making little texture scenes with a smaller threshold than that cluttered scenes. However, the regulation is based on frame-level texture magnitude rather than pixel-level texture magnitude. Thus, g u,C,k (x) is always a global threshold and cannot reflect the characteristics of a pixel or a local area.

Parameter Analysis
As stated above, our proposed approach involved three important parameters: Similarity threshold of LBSP ( )   First, the background samples at location x integrated the spatial-temporal local information of the pixel, since some of them came from the feature of the previous frames and the others came from the feature of neighborhood pixel. Then, the distribution of the background sample set in highcontrast region (i.e., swaying trees, rippling water) was more dispersed than that in low-contrast regions (i.e., road, wall). Therefore, the dynamic background region had a large mean squared error (MSE). Correspondingly, a small value was produced in the static background region. Since the foreground pixels or noise can be involved into the background model in the model update, the As analyzed above, g u,C,k (x) should vary with different scenarios or different regions. We adaptively computed g u,C,k (x) based on the mean squared error of the background samples in the paper. First, the background samples at location x integrated the spatial-temporal local information of the pixel, since some of them came from the feature of the previous frames and the others came from the feature of neighborhood pixel. Then, the distribution of the background sample set in high-contrast region (i.e., swaying trees, rippling water) was more dispersed than that in low-contrast regions (i.e., road, wall). Therefore, the dynamic background region had a large mean squared error (MSE). Correspondingly, a small value was produced in the static background region. Since the foreground pixels or noise can be involved into the background model in the model update, the background samples with large difference (i.e., the maximum and minimum sample values) in the model should not participate the calculation. Thus, g u,C,k (x) is defined as g u,C,k (x) =h a,C (x), k = 1, 2, · · · , N u where h a,C (x) = n a q=1 i a,C,q x)−bg a,C (x) 2 n a (22) Here, n a is the number of background samples in set S a at location x. bg a,C,min (x) and bg a,C,max (x) are the minimum and maximum color sample values in BG u (x), respectively. th 1 is disturbance threshold and was set to 3 experimentally. g u,C,k (x) was limited to the interval [3,30]. It is worth mentioning that the main model and the candidate model utilized the same similarity thresholds in this paper, since the main model better could reflect the change of scene.
As an example of the pixel (37, 102) on the #001430 frame from the "highway" video sequence and the pixel (194, 73) on the #001030 frame from the "sofa" video sequence in Figure 7, Figure 7a,b show the location of two pixels in the original frames. The former locates in the dynamic region and the latter locates in the static region. Figure 7c-h are the color and intra-LBSP features in background sample set obtained on blue channel using three methods: Our proposed algorithm with and without outliers, and SuBSENSE. Here, the intra-LBSP features were defined as the number of "1." Table 1 lists MSE of color and LBSP feature in background sample set on three methods. It is not hard to find from Figure 7f-h that our proposed method could obtain richer LBSP texture features at location (194, 73) than SuBSENSE, which can also be illustrated by a larger LBSP MSE in Table 1. This was beneficial to detect camouflaged static foreground objects. Then, we can see by comparing Figure 7d,e with Figure 7g,h that the color and intra-LBSP features at location (37, 102) were more widely distributed than those at (194, 73) using our proposed algorithm with outliers and SuBSENSE. The similarity threshold at location (37, 102) should be set to a larger value. However, the difference of the color MSEs was small at two locations. That is because some outliers (i.e., foreground pixels, noise) were included in the background sample set. It is easy to cause the camouflaged object to go missing, which can be demonstrated by the missed box in Figure 7o,p (marked by red ). Thus, these outliers were excluded in our final algorithm (see Equation (22)) for detecting the camouflaged object (see Figure 7n). Moreover, since the distribution of the background sample set was more disperse at location (37, 102), the difference of the color MSE without outliers or with outliers was not obvious. Therefore, outliers only had a great influence on the static flat region. In addition, although our proposed algorithm without outliers had a smaller color MSE (21.57) than that of SuBSENSE (24.55) on pixel (37, 102), the dynamic background (marked by yellow ) was still suppressed, as shown in Figure 7j. It attributes to our two-layer background model, since the periodic motion background samples were stored in the candidate model and could match with the current pixel.

Distance Threshold
The distance threshold ( ) R x is an extremely important parameter which adjusts the precision and sensitivity of the background model for the local changes. In initial time, the color threshold x is set to 30 in SuBSENSE [11], 23 in WeSamBE [12], and 35 in SWCD [13], and the texture threshold ( ) lbsp R x is set to 3 for all the experimental scenarios. Then, they are automatically adjusted

Distance Threshold
The distance threshold R(x) is an extremely important parameter which adjusts the precision and sensitivity of the background model for the local changes. In initial time, the color threshold R color (x) is set to 30 in SuBSENSE [11], 23 in WeSamBE [12], and 35 in SWCD [13], and the texture threshold R lbsp (x) is set to 3 for all the experimental scenarios. Then, they are automatically adjusted over time based on the historical detection results of each pixel. In fact, the initial value R 0 color (x) and R 0 lbsp (x) directly influences the final segmentation results. It is unreasonable that they are set to the same value for all scenes. Instead, they should be set to a large value in these scenes with highly dynamic background and rich texture information for reducing false negatives. Otherwise, it is a small value in static and weak texture scenes for increasing the true positive. In the paper, we initialized this parameter according to the dynamic range, background dynamics, and texture complexity of every scene. In fact, the dynamic range of a scene reflects the distribution of pixels. A narrow dynamic range means that the distribution of pixels relatively concentrates and the difference between pixels is small in the scene. Thus, a small distance threshold should be selected so as to detect the moving objects in the scene. The background dynamics represent the background changes. These changes could be local (i.e., swaying tree in the wind) or global (i.e., camera motion). In order to make sure that changing background is not detected as foreground, a large distance threshold should be set. A rich texture can improve identification of the foreground, but it increases the false positive in background region with complex texture. This is a tradeoff. Thus, we appropriately increased the distance threshold in the texture region.
First, the distribution of the color histogram was used to measure the dynamic range of an image in Equation (25). If most of color values of an image are concentrated on a few bins, the dynamic range is small. where Here, h and w specify the height and width of input image, respectively, and H cc (i)(i = 0, 1, 2, . . . , 255, cc = {L, a, b}) is defined as the frequency of the ith grayscale on Lab color space. K cc and X cc represent the proportion of pixels and number of grayscale with higher frequencies, respectively. Thereby, the larger K cc is and the smaller X cc is, the more concentrated the distribution pixel values are. RI effectively reflects the dynamic range of a scene, and its value is in the interval [0.3,1.3] in most of the scenarios.
Second, the change of background is measured by the mean value of the absolute difference M 0 between the first two frames. Generally, there are few moving objects in the first two frames. Thus, M 0 represents the background dynamics of the scene. M 0 is almost equal to 0 in a static scenario and a large value if the dynamic background elements are included in the scene or the camera moves.
Here, In 1 and In 2 are the intensity of the first and second frames, respectively. Third, the texture complexity of the scene is measured utilizing the mean value of Laplacian texture feature of the first frame L 0 . A large L 0 indicates that the scene has a strong texture.
where L 0 ∈ [10,60] in the CDnet2014 dataset. Finally, the initial distance threshold can be obtained by Here, K 0 , a, and b are user-defined parameters. Comprehensively considering the research results described by the authors of [11][12][13], we bound R 0 color (x) and R 0 lbsp (x) to the intervals [10,40] and [1,6], respectively, to adapt most of the practical environments. Thus, we defined K 0 = 10, a = 10, and b = 10 in the paper. The modified initial R(x) can achieve a robust detection result against the environment changes.
Next, the distance thresholds need to be updated for adapting gradual background changes in test frames after their background/foreground segmentation. In Reference [11], it was adjusted according to two important indexes: Background dynamics and local segmentation noise levels.
Background dynamics of the pixel x at time t is measured by a recursive moving average D min (x): Here, α is the learning rate, and α ST (=1/25) and α LT (=1/100) are a short-term learning rate and a long-term learning rate, respectively, in SuBSENSE. d t (x) is the minimal normalized color-LBSP distance between all samples in BG a (x) and I t (x). Therefore, D min (x) ≈ 0 in a completely static background region, and D min (x) ≈ 1 in a dynamic region and foreground object region.
The local segmentation noise level is measured by the accumulator v(x) of blinking pixels (alternatively marked as foreground and background in time). v where Here, ⊕ refers to an XOR operation, and the increment parameter v incr and decrement parameter v decr are 1 and 0.1, respectively. v(x) converges to 0 for a stable pixel, and v(x) would have large positive for constantly changing pixels.
Based on D min (x) and v(x), the distance threshold factor r(x) and distance threshold R(x) was updated for each new frame according to Equations (36)-(38). Unlike Reference [11], we fused the similarity threshold of LBSP h a,C (x) to update R color (x). The improved distance threshold could quickly respond to the change of the environment and accelerate the convergence of the algorithm, since h a,C (x) has a large value in dynamic background region and a small value in static background region.
where r(x) was initialized to 1, β was weighed and set to a little value (0.1 in the paper). h a,R (x), h a,G (x), and h a,B (x) were updated by Equation (22).

Time Subsampling Factor
The time subsampling factor φ(x) is another important parameter in the sample-based detection algorithm. φ(x) controls the update speed of the background model. A small φ(x) makes the model updated with high chances, which leads to the cases in which slowly moving objects are assimilated into the background, generating false negatives. Conversely, a large φ(x) causes the background model to adapt to the background changes slowly, resulting in ghosts that are not eliminated for a long time, generating false positives.
φ(x) was initialized to 2 and was limited to the interval [2, ∞]. It was updated by Meanwhile, in order to avoid camouflaged foreground pixels into background model, the edge information was utilized to regulate the time sub-sampling factor in the process of neighbor diffusion. More precisely, there is often strong texture information at the border between the background and foreground regions. The neighborhood diffusion should slow down at the boundary. That is, we used a large φ (y) to update the background model of pixel y (a neighbor of x) when a pixel x was classified as background ( f g t (x) = 0) and the Laplacian texture feature L(y) was larger than a user-defined threshold th 2 , that is, where Here, m 0 should choose a slightly large value for slowing diffusion. The value was set to 5 experimentally in the paper.

Dataset and Evaluation Metrics
In order to evaluate the performance of the proposed method, we selected the CDnet dataset [27] as the test dataset. Compared with other dataset (i.e., Wallflower, PETS), the CDnet dataset has two merits. One is the variety of scenarios. The earlier vision (CDnet 2012 dataset) offers 31 real world scenes (more than 88,000 frames) and is classified into six video categories: Baseline, camera jitter (CJ), dynamic background (DB), intermittent object motion (IOM), shadow, and thermal. In 2014, the dataset was expanded to 53 videos (nearly 160,000 frames, 11 categories). The new added 22 videos are divided into five categories: Bad weather (BW), low frame rate (LF), night videos (NV), PTZ, and turbulence, which have greater challenges. The expanded dataset almost covers all challenges on change detection. The other is labeled ground-truth masks, making the results more reasonable for comparing the proposed method with other methods.
The results were compared and quantified by the following seven metrics [27]: Recall (Re), Specificity (Sp), False Positive Rate (FPR), False Negative Rate (FNR), Percentage of Wrong Classifications (PWC), Precision (Pr), and F-Measure. The seven metrics are considered as official metrics to test the effectiveness of the change-detection algorithms.

How to Determine N a and N c
We introduced two important parameters in the two-layer background model in Section 2.1: N a (number of background samples in the main model) and N c (number of candidate background samples). N a was set to 50, 25, and 35, respectively, in SuBSENSE, WeSamBE, and SWCD. N c was the parameter increased in the paper. To determine the two parameters, the relationship among N a , N c and the average F-measure was analyzed on the CDnet 2014 dataset. We discussed N a and N c in interval [0,50] with the increment of 5 to remain consistent with the above-mentioned literature. Here, we only list the results of the bad weather, camera jitter, dynamic background, and low frame rate categories in Tables 2-5, since the four categories represent four typical scenarios. For example, a bad weather scene with lots of noise and a narrow dynamic range, camera jitter with the periodic and a global background motion, dynamic background with a local background motion, and low frame rate scene with the large displacement of a moving object.        The blue entries indicate the better results in Tables 2-5. The F-Measure score first improved and then degraded as N a or N c increased. Therefore, it is not always better for N a or N c to choose a bigger value. Too many background samples cause the model to be overfitted. By comparing N c = 0 and N c 0 in Tables 2-5, it is not difficult to find that the candidate background model could improve the F-Measure score. Taking the camera jitter category as an example, the F-Measure score of N a = 20 and N c = 10 was improved by about 2.3% over N a = 30 and N c = 0. For the scenes with background motion, such as camera jitter and dynamic background, a large N a was required to store diverse background samples, and a relatively small N c was set to rapidly adapt to the changing scene. For example, the optimal F-Measure score was concentrated in N a = 45 and N c = 25 in Tables 2  and 3. Moreover, it can be seen from Table 4 that N a was a small value in the relatively static scene and N c should be selected a larger value for adapting slowly changing background and remaining background samples for a long time for reuse. For example, the F-Measure score was better when N a = 15 and N c = 40 in low frame rate category. In addition, a small N c was needed in the scene with narrow dynamic range as shown in Table 5 because the difference among pixels was not obvious. As a consequence, N a and N c were determined according to background dynamics and the dynamic range of the scene. The computing equations are as follows. where Here, W p (x) is the p * p neighboring pixel of x and p = 21 in the paper, f o f (q) is the detection result of the second frame image using optical flow method, and R dy (x) reflects the strength of the background motion. The value of R dy (x) was in the interval [0,0.5] in most of the scenarios. According Tables 2-5, it can be seen that the performance was better when 15 ≤ N a ≤ 40 and 5 ≤ N c ≤ 50. Thus, we set m 1 = 10, m 2 = 50, m 3 = 10, and m 4 = 50.

Threshold Performance Analysis
In order to analyze the effect of the improved threshold in the proposed method, we list the detection results of the following four cases in Table 6: (1) Using the distance threshold, LBSP similarity threshold, and time subsampling factor in SuBSENSE; (2) using the LBSP similarity threshold and time subsampling factor in SuBSENSE and using the improved distance threshold; (3) using time subsampling factor in SuBSENSE and using the improved distance threshold and LBSP similarity threshold; and (4) using the improved distance threshold, LBSP similarity threshold, and time subsampling factor. It can be seen from Table 6 that Re and the average F-measure score was continuously improved from case (1) to case (4). Meanwhile, by comparing case (1) and (2) and case (2) and (3) in Table 6, it is not hard to find that the average F-measure had a great improvement after using the improved LBSP similarity threshold or the improved distance threshold. The reason was that the two thresholds were adaptively determined according to the change of the scene and region itself, that is, a large threshold in the regions with a fast-changing background, and a small threshold in the static and low-contrast regions. In addition, the average F-measure had not been obviously improved after using the improved time subsampling factor, as shown in case (3) and (4), because the time subsampling factor was slightly adjusted.

Ghost Removal
The critical ghosts appeared in intermittent object motion (IOM) category and in "tunnelExit_0_35fps" video sequence of "low framerate" category on the CDnet2014 Dataset. In Section 2, we discuss the "sofa" and "tunnelExit_0_35fps" video sequence in Figures 2 and 3. It can be seen that our proposed method removed ghosts. In order to analyze the advantages and disadvantages of our proposed method, we tested more video sequences in Figure 8, which included the "abandonedBox," "parking," and "winterDriveway" scenario in the IOM category. In the "abandonedBox" scenario, there was a red box on the road from the first frame, and then it started to move from the #2446 frame. In the "parking" scenario, there was a white car in the parking lot from the first frame and then it was driven away from the #1334 frame. Both the red box and car in the two video sequences were used to model the background, hence ghosts often occur (marked by pink ) after being moved away in most of detection method. However, our proposed algorithm effectively removed ghosts. It is well known that no algorithm is omnipotent. Our proposed method could not remove ghosts when the background changed, as shown in "winterDriveway" scenario in Figure 8. In future work, we will extend our method to adapt to background changes.

Average Performance on CDnet2014 Dataset
In this section, we demonstrate the effectiveness of our proposed method by comparing Re, Sp, FPR, FNR, PWC, Pr, and F-Measure with those of the state-of-the-art change detection approaches. The website on changedetection.net (CDnet) reported detailed segmentation results and evaluation data of dozens of change-detection algorithms on CDnet. In this paper, we systematically compared the proposed method with several related the state-of-the-art methods, such as SuBSENSE, PAWCS, WeSamBE, SWCD, SharedModel [6], and BSUV-Net [18]. First, the average performance of several algorithms is summarized in Table 7. By observing Table 7, we can see that our proposed method had the highest recall (0.8456) and the lowest FNR (0.1544). In particular, the recall obtained by our method outperformed the second-best method (SuBSENSE) by about 3.3%. The precise achieved by our method was ranked second. Meanwhile, the F-measure value (0.7898) was improved by 3% compared to SWCD and was even better than BSUV-Net. Thus, the proposed method was in competition with best methods. Note that red-bold entries indicate the best results in a given column, and blue-bold entries indicate that the second-best results Furthermore, the F-measure of each category is presented in Table 8. The proposed method gave superior results in the bad weather, camera jitter, intermittent object motion, low framerate, and turbulence categories. Especially, the F-measure of the camera jitter, low framerate, and turbulence categories increased about 4.3%, 5.5%, and 11% compared to the second-best method, respectively. For the dynamic background and shadow categories, our method ranked second. However, our method was not good at the night video and PTZ categories. The reason was that the proposed algorithm could only deal with background movements in a small range. Note that red-bold entries indicate the best results in a given row, and blue-bold entries indicate that the second-best results.
Finally, some visual results for various video sequences are shown in Figure 9. From top to bottom, the sequences are skating, badminton, fall, tramstop, turnpike_0_5fps, backdoor, and turbulance0. They are from the bad weather, camera jitter, dynamic background, intermittent object motion, low frame rate, shadow, and turbulence category, respectively. From the #1141 frame in the badminton sequence, the #3189 frame in the fall sequence, and the #2580 frame in the turbulance0 sequence, we can see that our proposed algorithm was less sensitive to dynamic background, camera jitter (periodic motion background), and noise compared with other methods because of a large distance threshold. In the "tramstop" scenario, there was a red box, which was put on the road from the #1030 frame and then kept motionless. Sample-based background subtraction often makes the static objects slowly incorporated into the background. However, our proposed algorithm still remained a part of foreground object after 2000 frames, as shown in green . Meanwhile, as seen from the #917 frame in the skating sequence, our proposed algorithm detected the camouflaged person (marked using pink ). In addition, our proposed algorithm eliminated weak shadow (marked using red ). Therefore, our proposed algorithm was effective for suppressing dynamic background, removing the ghosts, and detecting camouflaged objects.

Processing Speed
In this paper, all algorithms ran on an AMD Ryzen 3.7 GHz processor, which produced by AMD (Advanced Micro Devices, Inc.) in Sunnyvale, California, USA and was sourced in Chengdu, China. The experimental code is edited and complied in VS2015 and Opencv 3.0 with 16GB RAM. Table 9 lists the runtime of our proposed algorithm and SuBSENSE on three video sequences with different resolution. The runtime of our proposed algorithm was slower than that of SuBSENSE by about one-fifth on "highway," one-fifth on "skating," and one-third on "fall," respectively. The reason is that Equation (12) and Equation (22) in our algorithm needed spend more time. The time complexity on "fall" was higher than "highway" and "skating" because of a higher number of background samples used on "fall." In the future work, we will modify the update mechanism in Equation (22) and the connected region computation strategy to reduce the overall computing time of the algorithm.

Conclusions
In this paper, we proposed a ghost detection and removal method using sample-based two-layer background model and histogram similarity, which removed the ghosts caused by incorrect model initialization and intermittent object motion. In addition, the candidate background model added could decrease the false positive detection cause by periodic background motion because it extends the lifespan of the background samples. Then, we modified the color and texture distance threshold, internal similarity threshold of LBSP feature, and time subsampling factor according to the characteristic of the scene and region. The improved parameters were beneficial to suppress dynamic background and detect camouflaged objects. Our proposed algorithm was proved to be effective in comparison to other the state-of-the-art methods. However, our proposed algorithm was not suitable for the scenarios