In this section, we evaluate the performances of our novel method ASBS and compare them to those of the original BGS algorithm and those of the original SBS method [

30]. First, in

Section 4.1, we present our evaluation methodology. This comprises the choice of a dataset along with the evaluation metric, and all needed implementation details about ASBS, such as how we compute the semantics, and how we choose the values of the different thresholds. In

Section 4.2, we evaluate ASBS when combined with state-of-the-art BGS algorithms.

Section 4.3 is devoted to a possible variant of ASBS which includes a feedback mechanism that can be applied to any conservative BGS algorithm. Finally, we discuss the computation time of ASBS in

Section 4.4.

#### 4.1. Evaluation Methodology

For the quantitative evaluation, we chose the CDNet 2014 dataset [

12] which is composed of 53 video sequences taken in various environmental conditions such as bad weather, dynamic backgrounds and night conditions, as well as different video acquisition conditions, such as PTZ and low frame rate cameras. This challenging dataset is largely employed within the background subtraction community and currently serves as the reference dataset to compare state-the-art BGS techniques.

We compare performances on this dataset according to the overall

${F}_{1}$ score, which is one of the most widely used performance scores for this dataset. For each video,

${F}_{1}$ is computed by:

where

$TP$ (true positives) is the number of foreground pixels correctly classified,

$FP$ (false positives) the number of background pixels incorrectly classified, and

$FN$ (false negatives) the number of foreground pixels incorrectly classified. The overall

${F}_{1}$ score on the entire dataset is obtained by first averaging the F1 scores over the videos, then over the categories, according the common practice of CDNet [

12]. Note that this averaging introduces inconsistencies between overall scores that can be avoided by using summarization instead, as described in Reference [

34], but to allow a fair comparison with the other BGS algorithms, we decided to stick to the original practice of Reference [

12] for our experiments.

We compute the semantics as in Reference [

30], that is with the semantic segmentation network PSPNet [

25] trained on the ADE20K dataset [

35] (using the public implementation [

36]). The network outputs a vector containing 150 real numbers for each pixel, where each number is associated to a particular object class within a set of 150 mutually exclusive classes. The semantic probability estimate

${p}_{S,t}(x,y)$ is computed by applying a softmax function to this vector and summing the values obtained for classes that belong to a subset of classes that are relevant for motion detection. We use the same subset of classes as in Reference [

30] (person, car, cushion, box, boot, boat, bus, truck, bottle, van, bag and bicycle), whose elements correspond to moving objects of the CDNet 2014 dataset.

For dealing with missing semantics, since the possibilities to combine spatial and temporal sampling schemes are endless, we have restricted the study to the case of a temporal sub-sampling of one semantic frame per X original frames; this sub-sampling factor is referred to as $X:1$ hereafter. In other scenarios, semantics could be obtained at a variable frame rate or for some variable regions of interest, or even a mix of these sub-sampling schemes.

The four thresholds are chosen as follows. For each BGS algorithm, we optimize the thresholds

$({\tau}_{\mathrm{BG}},{\tau}_{\mathrm{FG}})$ of SBS with a grid search to maximize its overall

${F}_{1}$ score. Then, in a second time, we freeze the optimal thresholds

$({\tau}_{\mathrm{BG}}^{*},{\tau}_{\mathrm{FG}}^{*})$ found by the first grid search and optimize the thresholds

$({\tau}_{A},{\tau}_{B})$ of ASBS by a second grid search for each pair (BGS algorithm,

$X:1$), to maximize the overall

${F}_{1}$ score once again. Such methodology allows a fair comparison between SBS and ASBS as the two techniques use the same common parameters

$({\tau}_{\mathrm{BG}}^{*},{\tau}_{\mathrm{FG}}^{*})$ and ASBS is compared to an optimal SBS method. Note that the

$\alpha $ parameter is chosen as in Reference [

30].

The segmentation maps of the BGS algorithms are either taken directly from the CDNet 2014 website (when no feedback mechanism is applied) or computed using the public implementations available in Reference [

37] for ViBe [

5] and Reference [

38] for SuBSENSE [

6] (when the feedback mechanism of

Section 4.3 is applied).

#### 4.2. Performances of ASBS

A comparison of the performances obtained with SBS and ASBS for four state-of-the-art BGS algorithms (IUTIS-5 [

8], PAWCS [

7], SuBSENSE [

6], and WebSamBe [

39]) and for different sub-sampling factors is provided in

Figure 4. For the comparison with SBS, we used two naive heuristics for dealing with missing semantic frame as, otherwise, the evaluation would be done on a subset of the original images as illustrated in

Figure 1. The first heuristic simply copies

${B}_{t}$ in

${D}_{t}$ for frames with missing semantics. The second heuristic uses the last available semantic frame

${S}_{t}$ in order to still apply

$\mathrm{rule}\phantom{\rule{0.166667em}{0ex}}1$ and

$\mathrm{rule}\phantom{\rule{0.166667em}{0ex}}2$ even when no up-to-date semantic frames are available. Let us note that this last naive heuristic corresponds to using ASBS with

${\tau}_{A}$ and

${\tau}_{B}$ chosen big enough so that the condition on the color of each pixel is always satisfied.

As can be seen, the performances of ASBS decrease much more slowly than those of SBS with the decrease of the semantic frame rate and, therefore, are much closer to those of the ideal case (SBS with all semantic maps computed, that is SBS 1:1), meaning that ASBS provides better decisions for frames without semantics.

A second observation can be made concerning the heuristic repeating ${S}_{t}$. The performances become worse than the ones of the original BGS for semantic frame rates lower than 1 out of 5 frames, but they are better than SBS when repeating ${B}_{t}$ for high semantic frame rates. This observation emphasizes the importance of checking the color feature as done with ASBS instead of blindly repeating the corrections induced by semantics. The performances for lower frame rates are not represented for the sake of figure clarity but still decrease linearly to very low performances. For example, in the case of IUTIS_5, the performance drops to $0.67$ at $25:1$. In the rest of the paper, when talking about performances on SBS at different frame rates, we only consider the heuristic where we copy ${B}_{t}$ as it is the one that behaves the best, given our experimental setup. Finally, it can be seen that, on average, ASBS with 1 frame of semantics out of 25 frames (ASBS $25:1$) performs as well as SBS, with copy of ${B}_{t},$ with 1 frame of semantics out of 2 frames (SBS $2:1$).

In

Figure 5, we also compare the effects of SBS with copied

${B}_{t}$ in

${D}_{t}$ for frames with missing semantics, and ASBS for different BGS algorithms by looking at their performances in the mean ROC space of CDNet 2014 (ROC space where the false and true foreground rates are computed according to the rules of Reference [

12]). The points represent the performances of different BGS algorithms whose segmentation maps can be downloaded on the dataset website. The arrows represent the effects of SBS and ASBS for a temporal sub-sampling factor of

$5:1$. This choice of frame rate is motivated by the fact that it is the frame rate at which PSPNet can produce the segmentation maps on a GeForce GTX Titan X GPU. We observe that SBS improves the performances, but only marginally, whereas ASBS moves the performances much closer to the oracle (upper left corner).

To better appreciate the positive impact of our strategy for replacing semantics, we also provide a comparative analysis of the

${F}_{1}$ score by only considering the frames without semantics. We evaluate the relative improvement of the

${F}_{1}$ score of ASBS, SBS and the second heuristic (SBS with copies of

${S}_{t}$) compared to the original BGS algorithm (which is equivalent to the first heuristic, SBS with copies of

${B}_{t}$). In

Figure 6, we present our analysis on a per-category basis, in the same fashion as in Reference [

30]. As shown, the performances of ASBS are close to the ones of SBS for almost all categories, indicating that our substitute for semantics is adequate. We can also observe that the second heuristic does not perform well, and often degrades the results compared the original BGS algorithm. In this Figure, SBS appears to fail for two categories: “night videos“ and “thermal“. This results from the ineffectiveness of PSPNet to process videos of these categories, as this network is not trained with such image types. Interestingly, ASBS is less impacted than SBS because it refrains from copying some wrong decisions enforced by semantics.

Finally, in

Figure 7, we provide the evolution of the optimal parameters

${\tau}_{A}$ and

${\tau}_{B}$ with the temporal sub-sampling factor (in the case of PAWCS). The optimal value decreases with the sub-sampling factor, implying that the matching condition on colors become tighter or, in other words, that

$\mathrm{rule}\phantom{\rule{0.166667em}{0ex}}A$ and

$\mathrm{rule}\phantom{\rule{0.166667em}{0ex}}B$ should be activated less frequently for lower semantic frame rates, as a consequence of the presence of more outdated colors in the color map for further images.

#### 4.3. A Feedback Mechanism for SBS and ASBS

The methods SBS and ASBS are designed to be combined to a BGS algorithm to improve the quality of the final segmentation, but they do not affect the decisions taken by the BGS algorithm itself. In this section, we explore possibilities to embed semantics inside the BGS algorithm itself, which would remain blind to semantics otherwise. Obviously, this requires to craft modifications specific to a particular algorithm or family of algorithms, which can be effortful as explained hereinafter.

The backbone of many BGS algorithms is composed of three main parts. First, an internal model of the background is kept in memory, for instance in the form of color samples or other types of features. Second, the input frame is compared to this model via a distance function to classify pixels as background or foreground. Third, the background model is updated to account for changes in the background over time.

A first possibility to embed semantics inside the BGS algorithm is to include semantics directly in a joint background model integrating color and semantic features. This requires to formulate the relationships that could exist between them and to design a distance function accounting for these relationships, which is not trivial. Therefore, we propose a second way of doing so by incorporating semantics during the update, which is straightforward for algorithms whose model updating policy is conservative (as introduced in Reference [

5]). For those algorithms, the background model in pixel

$(x,y)$ may be updated if

${B}_{t}(x,y)=\mathrm{BG}$, but it is always left unchanged if

${B}_{t}(x,y)=\mathrm{FG}$, which prevents the background model from being corrupted with foreground features. In other words, the segmentation map

${B}_{t}$ serves as an updating mask. As

${D}_{t}$ produced by SBS or ASBS is an improved version of

${B}_{t}$, we can advantageously use

${D}_{t}$ instead of

${B}_{t}$ to update the background model, as illustrated in

Figure 8. This introduces a semantic feedback which improves the internal background model and, consequently, the next segmentation map

${B}_{t+1}$, whether or not semantics is computed.

To appreciate the benefit of a semantic feedback, we performed experiments for two well-known conservative BGS algorithms, ViBe and SuBSENSE, using the code made available by the authors (see Reference [

37] for ViBe and Reference [

38] for SuBSENSE). Let us note that the performances for SuBSENSE are slightly lower than the ones reported in

Figure 4 as there are small discrepancies between the performance reported on the CDNet web site and the ones obtained with the available source code.

Figure 9 (left column) reports the results of ASBS with the feedback mechanism on ViBe and SuBSENSE, and compares them to the original algorithm and the SBS method. Two main observations can be made. First, as for the results of the previous section, SBS and ASBS both improve the performances even when the semantic frame rate is low. Also, ASBS always performs better. Second, including the feedback always improves the performances for both SBS and ASBS, and for both BGS algorithms. In the case of ViBe, the performance is much better when the feedback is included. For SuBSENSE, the performance is also improved, but only marginally. This might be due to the fact that ViBe has a very straightforward way of computing the update of the background model while SuBSENSE uses varying internal parameters and heuristics, calculated adaptively. It is thus more difficult to interpret the impact of a better updating map on SuBSENSE than it is on ViBe.

We also investigated to what extend the feedback provides better updating maps to the BGS algorithm. For conservative algorithms, this means that, internally, the background model is built with better features. This measure can be evaluated using the output of the classification map, ${B}_{t}$.

For that purpose, we compared the original BGS algorithm and the direct output, that is

${B}_{t}$ in

Figure 8, of the feedback method when the updating map is replaced by

${D}_{t}$ obtained by either SBS or ASBS. As can be seen in

Figure 9 (right column), using the semantic feedback always improves the BGS algorithm whether the updating map is obtained from SBS or ASBS. This means that the internal background model of the BGS algorithm is always enhanced and that, consequently, a feedback helps the BGS algorithm to take better decisions.

Finally, let us note that ViBe, which is a real-time BGS algorithm, combined with semantics provided at a real-time rate (about 1 out of 5 frames) and with the feedback from ASBS has a mean

${F}_{1}$ performance of

$0.746$, which is the same performance as the original SuBSENSE algorithm that is not real time [

33]. This performance corresponds to the performance of RT-SBS presented in [

31]. It can be seen that our method can thus help real-time algorithms to reach performances of the top unsupervised BGS algorithms while meeting the real-time constraint, which is a huge advantage in practice. We illustrate our two novel methods, ASBS and the feedback, in

Figure 10 on one video of each category of the CDNet2014 dataset using ViBe as BGS algorithm.

One last possible refinement would consist to adapt the updating rate of the background model according to a rule map similar to that of ASBS. More specifically, if ${B}_{t}(x,y)=\mathrm{FG}$ and ${D}_{t}(x,y)=\mathrm{BG}$, we could assume that the internal background model in pixel $(x,y)$ is inadequate and, consequently, we could increase the updating rate in that pixel. Tests performed on ViBe showed that the performances are improved with this strategy. However, this updating rate adaptation has to be tailored for each BGS algorithm specifically; therefore, we did not consider this final refinement in our experiments. We only evaluated the impact of the feedback mechanism on BGS algorithms with a conservative updating policy, and avoided any particular refinement that would have biased the evaluation.

#### 4.4. Time Analysis of ASBS

In this section, we show the timing diagram of ASBS and provide typical values for the different computation durations.

The timing diagram of ASBS with feedback is presented in

Figure 11. The inclusion of a feedback has two effects. First, we need to include the feedback time

${\Delta}_{F}$ in the time needed for the background subtraction algorithm

${\Delta}_{B}$. In our case, as we only substitute the updating map by

${D}_{t}$, it can be implemented as a simple pointer replacement and therefore

${\Delta}_{F}$ is negligible (in the following, we take

${\Delta}_{F}\simeq 0\phantom{\rule{0.166667em}{0ex}}\mathrm{ms}$). Second, we have to wait for ASBS (or SBS) to finish before starting the background subtraction of the next frame.

Concerning the computation time of BGS algorithms, Roy et al. [

33] have provided a reliable estimate of the processing speed of leading unsupervised background subtraction algorithms. They show that the best performing ones are not real time. Only a handful of algorithms are actually real time, such as ViBe that can operate at about

$200\phantom{\rule{0.166667em}{0ex}}\mathrm{fps}$ on CDNet 2014 dataset, that is

${\Delta}_{B}=5\phantom{\rule{0.166667em}{0ex}}\mathrm{ms}$. With PSPNet, the semantic frame rate is of about 5 to

$7\phantom{\rule{0.166667em}{0ex}}\mathrm{fps}$ for a NVIDIA GeForce GTX Titan X GPU, which corresponds to

${\Delta}_{S}\simeq 200\phantom{\rule{0.166667em}{0ex}}\mathrm{ms}$. It means that for

$25\phantom{\rule{0.166667em}{0ex}}\mathrm{fps}$ videos, we have access to semantics about once every 4 to 5 frames. In addition,

Table 3 reports our observation about the mean execution time per frame of

${\Delta}_{D}$ for SBS and ASBS. These last tests were performed on a single thread running on a single processor Intel(R) Xeon(X) E5-2698 v4 2.20GHz.

Thus, in the case of ViBe, we start from a frame rate of about

$200\phantom{\rule{0.166667em}{0ex}}\mathrm{fps}$ in its original version to reach about

$160\phantom{\rule{0.166667em}{0ex}}\mathrm{fps}$ when using ASBS, which is still real time. This is important because, as shown in

Section 4.3, the performances of ViBe with ASBS at a semantic frame rate of 1 out of 5 frames and feedback is the same as SuBSENSE that, alone, runs at a frame rate lower than

$25\phantom{\rule{0.166667em}{0ex}}\mathrm{fps}$ [

33]. Hence, thanks to ASBS, we can replace BGS algorithms that work well but are too complex to run in real time and are often difficult to interpret by a combination of a much simpler BGS algorithm and a processing based on semantics, regardless of the frame rate of the last. Furthermore, ASBS is much easier to optimize as the parameters that we introduce are few in number and easy to interpret. In addition, we could also fine-tune the semantics, by selecting a dedicated set of objects to be considered, for a scene-specific setup. It is our belief that there are still some margins for further improvements.