^{1}

^{*}

^{1}

^{1}

^{1}

^{1}

^{2}

^{1}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (

Low-cost systems that can obtain a high-quality foreground segmentation almost independently of the existing illumination conditions for indoor environments are very desirable, especially for security and surveillance applications. In this paper, a novel foreground segmentation algorithm that uses only a Kinect depth sensor is proposed to satisfy the aforementioned system characteristics. This is achieved by combining a mixture of Gaussians-based background subtraction algorithm with a new Bayesian network that robustly predicts the foreground/background regions between consecutive time steps. The Bayesian network explicitly exploits the intrinsic characteristics of the depth data by means of two dynamic models that estimate the spatial and depth evolution of the foreground/background regions. The most remarkable contribution is the depth-based dynamic model that predicts the changes in the foreground depth distribution between consecutive time steps. This is a key difference with regard to visible imagery, where the color/gray distribution of the foreground is typically assumed to be constant. Experiments carried out on two different depth-based databases demonstrate that the proposed combination of algorithms is able to obtain a more accurate segmentation of the foreground/background than other state-of-the art approaches.

Video surveillance in indoor environments is an active focus of research because of its high interest for the security industry [

Most of the existing works have been designed to operate with visible imagery (gray or color images). State-of-the-art algorithms have achieved a great performance in the presence of challenging situations, such as changes in illumination, shadows, color camouflage and non-static background regions [

Regarding the systems based on IR imagery (near-infrared spectrum, very close to the visible imagery), they have the advantage that they can work in nighttime situations [

Systems based on thermal sensors [

Surveillance systems based on low-cost depth sensors, such as the Microsoft Kinect, or the ASUS Xtion, are an excellent alternative. They achieve an excellent tradeoff among the following three aspects: installation and settings, cost and quality of the segmented foreground. The installation and setting is quite simple, since they do not have any specific illumination requirements: they can work independently of the existing illumination in the indoor environment, even in total darkness, as discussed in recent reviews [

In spite of the aforementioned advantages of surveillance systems based on low-cost depth sensors, there are no works that perform a high-quality foreground segmentation using depth data exclusively, according to the authors' knowledge. This claim is also confirmed by a recent review [

Although the combination of multiple sources of imagery can improve the foreground segmentation, the cost, installation and complexity of the system are significantly increased. This fact motivates the interest in developing systems based only on depth sensors, which achieve a more appealing tradeoff of the system characteristics and requirements. Although a depth-based foreground segmentation is possible using some of the depth background models involved in the previous works that combine multiple kinds of imagery, the obtained segmentation results are not completely satisfactory. The main reason is that they have applied background subtraction algorithms conceived of for color imagery to depth imagery, without specifically addressing the problems of depth sensors: the limited depth range in the acquisition process, the lack of depth information in some regions due to reflections [

In this paper, we propose a combination of two algorithms to obtain a high-quality foreground segmentation using only depth data information acquired by a Kinect sensor (first generation), which is ideal for security/surveillance indoor applications that have to deal with situations of low, unpredictable or no lighting. The first algorithm is the classic MoG algorithm adapted for depth imagery. The second algorithm is based on a Bayesian network, which explicitly exploits the intrinsic characteristic of the depth data. This Bayesian network is able to accurately predict the FG/BG regions between consecutive time steps using two dynamic models, which encode the spatial and depth evolution of the FG/BG regions. The most important contribution of the paper is the proposed depth-based dynamic model. Unlike the case of visible imagery, where the color/gray distribution of the foreground is assumed to be constant (at least in certain periods of time), the depth distribution can significantly change between consecutive time steps, because of the own motion of the foreground objects. As far as the authors' knowledge, there is no proposal in the literature that deals with this problem. In the Kinect-based video surveillance re-identification system presented in [

The organization of the paper is as follows. A general overview of the proposed FG/BG segmentation algorithm is presented in Section 2. The proposed Bayesian network for the estimation of the FG/BG probabilities using spatial and depth correlation properties is described in Section 3, along with the applied approximate inference technique. The results obtained from testing the proposed foreground segmentation algorithm with two public databases are presented in Section 4. Finally, conclusions are drawn in Section 5.

The proposed FG/BG segmentation system consists of three modules (see

The MoG-BS module is based on the algorithm presented in [_{t}_{t}_{t}_{t}_{MoG–BS}_{t}_{MoG–BS}_{t}_{t}

The BN-FBP module also estimates per pixel probabilities of FG/BG, but using a region-based approach that exploits the spatial and depth correlations of the FG/BG regions in depth imagery across time. The probability of every pixel of being the foreground and background according to this module is represented by _{BN–FBP}_{t}_{BN–FBP}_{t}

The third and last module computes the FG/BG segmentation by combining the pixel-wise probabilities, _{MoG–BS}_{t}_{BN–FBP}_{t}_{comb}_{t}_{MoG–BS}_{t}_{BN–FBP}_{t}_{comb}_{t}_{comb}_{t}

A description of the proposed Bayesian network for the estimation of the FG/BG probabilities is presented in Section 3.1. The derivation of the posterior joint probability density function (pdf) related to the Bayesian network is presented in Section 3.2. The spatial and depth dynamic models involved in the derivation of the previous posterior joint pdf are described in Sections 3.3 and 3.4. Lastly, the process of inference is explained in Section 3.6, which is used to obtain an accurate approximation of the posterior joint pdf and, thus, the desired FG/BG probabilities.

A novel Bayesian network for the estimation of the FG/BG probabilities has been designed, which takes advantage of the spatial and depth correlation properties of the depth imagery across time. The goal is to estimate _{BN–FBP}_{t}

The variable _{t}_{t}_{DI}_{DI}_{t}_{t}_{−1} and _{t}

The variable, _{t}_{−1}, is the estimated FG/BG image segmentation at the previous time step. The relationship between _{t}_{t}_{−1} is determined by a spatial dynamic model described in Section 3.3, which predicts the location of the foreground regions at the current time step.

The variable _{t}_{t}_{DI}_{t}_{t}_{t}_{t}_{t}_{t}_{t}

The predicted depth-based appearance of the FG/BG regions involved in the above observation model are represented by the variables, _{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{−1} and _{t}_{−1}, which can be computed from the available FG/BG segmentation at the previous time step,

The estimation of _{t}_{t}_{−1} using a depth-based dynamic model for foreground regions described in Section 3.4. The variable, _{t}_{−1}, is obtained by computing the local depth histograms from the regions of _{t}_{−1} that were segmented as the foreground at the previous time step

Similarly to the estimation of _{t}_{t}_{t}_{−1} using a depth-based dynamic model for background regions described in Section 3.4. The variable, _{t}_{−1}, is obtained by computing the local depth histograms from the regions of _{t}_{t}_{t}_{−1}) is that a more accurate background appearance model is obtained, since _{t}_{t}

From a Bayesian perspective, the goal is to estimate the posterior joint pdf, _{t}, FGH_{t}, BGH_{t}, ST∣FG_{t}_{−1}, _{t}_{−1}, _{t}_{−1}, _{t}_{t}_{−1}, _{t}_{−1}, _{t}_{−1}, _{t}_{t}_{−1},_{t}_{−1},_{t}_{−1},_{t}

The probability term, _{t}_{t}_{−1}), encodes the prior knowledge about what regions could be FG/BG given the previous FG/BG estimation. Its expression is defined by a spatial dynamic model described in Section 3.3.

The probability terms, _{t}_{t}_{−1}, _{t}_{t}_{−1}), predicts the depth-based appearance of the background regions between consecutive time steps. The dynamic models involved in the prediction of depth-based FG/BG appearances are described in Section 3.4.

The last probability term, _{t}_{t}, FGH_{t}, BGH_{t}_{t}_{t}, FGH_{t}, BGH_{t}

Finally, the desired FG/BG probability, _{BN–FBP}_{t}_{t}, BGH_{t}, ST

The spatial dynamic model for the foreground regions is based on a proximity concept: the neighborhood of the foreground regions at _{FGt−1} is the set of pixels segmented as the foreground at _{FGt−1},Σ_{spa}_{FGt−1} and covariance matrix Σ_{spa}_{spa}

On the other hand, the background spatial prior pdf of one pixel is just the complementary value of the foreground one:

Given the FG/BG spatial prior pdf of every pixel, the FG/BG prior pdf of the whole image is computed as:
_{DI}

The depth-based appearance dynamic model for the foreground regions is based on the following concept: the depth values of foreground regions between two consecutive time steps are assumed to be close to each other. Thus, the prediction of the appearance of a foreground region, represented by a depth histogram, _{t}_{t}_{t}_{t}_{−1}(_{FGH}_{t}_{−1}(_{FGH}_{t}_{−1}(_{t}_{−1}(_{t}_{−1}(_{FGH}

On the other hand, the expected depth displacement of a foreground region is modeled by the Gaussian of zero mean:

The prediction of the appearance of the whole foreground, represented by the set of depth histograms, _{t}

The prediction of the appearance of the background, represented by the set of depth histograms, _{t}_{t}_{t}_{BGH}

The observation model evaluates the degree of agreement/coherence between the set of depth histograms, _{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}, BGH_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{H}_{t}_{t}_{θ}_{t}_{t}_{−1}.

Finally, the pdf, _{t}_{t}, FGH_{t}, BGH_{t}_{t}_{DI}_{t}

The expression of the posterior joint pdf, _{t}, FGH_{t}, BGH_{t}, ST_{t}_{−1},_{t}_{−1},_{t}_{−1},_{t}

The probability term, _{HIER}_{t}, BGH_{t}, ST_{p}_{t}^{(p)}, _{t}^{(p)}, ^{(p)}∣_{p}

_{t}, BGH

_{t}, ST

For

_{p}

^{(p)}, from

^{(p)}, draw a sample,

_{t}

_{t}

_{−1},

^{(p)}) = ∏

_{i}

_{t}

_{t}

_{−1}(

^{(p)}); (3) Draw a sam ple,

_{t}

_{t}

_{−1}) = ∏

_{j}

_{t}

_{t}

_{−1}(

Conditioned on a drawn joint sample,
_{t}_{GRID}_{t}^{(}^{p}^{)(}^{q}^{)}, is computed as:

As a result, the posterior joint pdf can be expressed as:

Finally, the desired FG/BG probabilities, _{BN–FBP}_{t}_{t}, BGH_{t}, ST

The proposed FG/BG segmentation system is tested and compared with other state-of-the-art algorithms using two different depth-based datasets. The first one was presented in [

The metrics used to perform the evaluation of the algorithms are: false positive rate (FPR), which represents the fraction of background pixels that are incorrectly marked as foreground; false negative rate (FNR), which represents the fraction of foreground pixels that are incorrectly marked as background; total error (TE), which represents the total number of misclassified pixels normalized with respect to the image size; and one similarity measure,

To rank the accuracy of the analyzed methods, the overall metric proposed in [_{i}_{m}

The final overall metric, _{i}_{sq}

The performance of the proposed method, referred to from now on as BayesNet, is compared with other state-of-the-art background subtraction techniques: pixel-based adaptive segmenter(PBAS) [_{Ziυ}_{D}_{Ziυ}_{D}_{D}

For the comparison, the following parameters have been used for the BayesNet algorithm: _{t}_{t}_{t}_{FGH}_{BGH}_{p}

Regarding the setting of the parameters used by the other algorithms, the following strategy has been adopted. Initially, the parameters that the authors selected in their original papers as optimal have been used. Furthermore, other configurations have been taken into account, which were found to be optimal in reviews or other works containing comparisons among algorithms (as in the case of the change detection challenge). Later, those parameters have been refined to maximize the

The results of the different algorithms using the

_{D}_{Ziυ}_{Ziυ}

Some qualitative FG/BG segmentation results are shown in _{D}

Similarly, _{D}

Finally, _{D}_{D}

In this subsection, several operational and practical issues are addressed, such as the computational cost, the relationship between the MoG-BS and BN-FBP modules, the robustness to several factors (missing and noisy depth measurements, camera jitter, intermittent motion and the viewpoint change of foreground objects) and the initialization.

The computational cost has been calculated as the mean value of the processing time of the algorithm using two different image sizes: 320 × 240 and 640 × 320 pixels. The computer used for the tests had an Intel Core i7-3540M processor at 3 GHz and 12 GB of RAM. The obtained mean values have been 432 ms and 1, 106 ms for the first and second image sizes, respectively. Notice that the algorithm is currently a prototype implemented in MATLAB without any specific code optimization, and therefore, the aforementioned processing times can be decreased by either optimizing the MATLAB-based implementation, programming a c/c++ implementation, or even programming a graphical processing unit (GPU) -based implementation. Of special interest is the last choice: the structure of the BN-FBP module allows for an efficient implementation in a GPU, because the inference is based on a particle filtering technique, in which the computation relative to each particle can be performed in parallel. The implementation of the other algorithms, _{D}_{Ziυ}

Taking into account that the MoG-BS module is essentially the _{D}

Regarding the noisy depth measurements, the robustness is due to the region-level processing performed by the BN-FBP module. Working with regions instead of pixels allows for the consideration more data to make inferences and, thus, to be less sensitive to noise. Specifically, this behavior is achieved by working with histograms of depth regions, rather than individual pixel values. On the other hand, the quadratic relationship between the measured depth and the noise is taken into account in both system modules. In the MoG-BS module, the depth model parameters are selected as follows: given the mean value of each Gaussian of the mixture, its variance is adjusted according to the aforementioned quadratic relationship [

Due to the previous adapted noise processing, the operational range of the Kinect sensor can be extended further than the recommended operational conditions: from 7–8 m up to 11–12 m, which is usually not considered, because of the low signal-to-noise ratio.

Regarding the missing depth measurements, the region-level processing is also the key. Inside a region, there can be some pixels without depth assignment, but the characterization of such a region can be still done using the other pixel values. In addition, this region-level processing should theoretically provide a natural robustness against the camera jitter.

As regards the intermittent motion of background objects (dynamics backgrounds), the algorithm has not been explicitly designed to be robust to this situation, and therefore, a decrease in the performance is expected.

The proposed algorithm is also robust to viewpoint changes of foreground objects thanks to the underlying piecewise-linear model used for the depth-based foreground dynamics in the BN-FBP module. Briefly, foreground regions in the previous time step are divided into sub-regions. For every sub-region, different possible depth displacements are calculated for the actual time step (as part of the particle filtering procedure). As a result, a bag' of possible local regions with different depth structures is available, which cover the actual depth appearance of the foreground, including potential deformable evolutions, such as articulated foreground objects or changes in the view point. This fact can be observed in

The process of initialization is explained below. The first frame (usually free of foreground objects) is used to initialize the probabilistic background model of the MoG algorithm (MoG-BS module). From the second frame on, the MoG-BS module is already able to compute FG/BG segmentations. On the other hand, the BN-FBP module needs the background model and the FG/BG segmentation from the previous time step to estimate the FG/BG segmentation in the current time step. All this data is already available from the third frame on thanks to the MoG-BS module. No more considerations are needed for the BN-FBP module, since it is not a temporal recursive model.

A novel algorithm for high-quality foreground segmentation in depth imagery has been proposed, which can operate almost independently of the existing illumination conditions in indoor scenarios. The FG/BG segmentation is carried out by the combination of a MoG-based subtraction algorithm and a Bayesian network-based algorithm. The Bayesian network is able to predict the FG/BG regions between consecutive depth images by explicitly exploiting the intrinsic characteristic of the depth data. For this purpose, two dynamic models that estimate the spatial and depth evolution of the FG/BG are used. Of special interest is the depth-based dynamic model that predicts the depth distribution of the FG/BG objects in consecutive time steps, which are encoded by an appearance model based on the concept of bag of features'. Remarkable results have been obtained in two public depth-based datasets, outperforming other state-of-the art approaches.

This work has been partially supported by the Ministerio de Economía y Competitividad of the Spanish Government under the project TEC2010-20412 (Enhanced 3DTV). Massimo Camplani would like to acknowledge the European Union and the Universidad Politécnica de Madrid (UPM) for supporting his activities through the Marie Curie-Cofund research grant.

The authors declare no conflict of interest.

Modules of the proposed foreground/background (FG/BG) segmentation system. MoG, mixture of Gaussians.

Proposed Bayesian network for the estimation of the FG/BG probabilities.

Results for frame 420 of the _{D}_{Ziυ}

Results for frame 410 of the _{D}_{Ziυ}

Results for frame 1, 069 of the _{D}_{Ziυ}

Results for frame 513 of the _{D}_{Ziυ}

Posterior pdf of an articulated foreground object between two images without taking into account the background data/model. (

Summary of variables and main parameters used in the proposed Bayesian network.

Variable/Parameter | Description |
---|---|

_{t} |
FG/BG binary image segmentation at time step |

_{t} |
Binary value of _{t} |

_{t} |
Set of depth histograms computed from _{t} |

_{t} |
Depth histogram computed from a local region of _{t} |

_{t} |
Set of depth histograms that models the foreground appearance at the time step, |

_{t} |
_{t} |

Expected shift in depth of the foreground regions between consecutive time steps | |

_{t} |
Set of depth histograms that models the background appearance at the time step, |

_{t} |
_{t} |

_{t} |
Depth image at time step |

_{t} |
Depth-based representation of the most probable background obtained from the MoG-BS module at time step |

Detection accuracy obtained by analyzing the _{D}_{Ziυ}

24.63 | 1.19 | ||||

_{D} |
3.65 | 31.70 | 0.34 | 2 | |

PBAS | 5.29 | 39.19 | 2.22 | 0.28 | 3.25 |

Vibe | 33.36 | 34.57 | 0.14 | 4.75 | |

SOM | 13.32 | 40.39 | 10.86 | 0.19 | 4.75 |

_{Ziυ} |
6.02 | 41.31 | 2.82 | 0.19 | 4.75 |

Detection accuracy obtained by analyzing the

26.30 | 1.76 | ||||

_{D} |
5.42 | 31.38 | 0.41 | 2.75 | |

PBAS | 6.53 | 29.30 | 3.26 | 0.43 | 3.25 |

Vibe | 18.73 | 18.26 | 0.26 | 4.75 | |

SOM | 9.79 | 38.25 | 5.71 | 0.33 | 5.25 |

_{Ziυ} |
6.51 | 27.43 | 3.51 | 0.39 | 3.50 |

Detection accuracy obtained by analyzing the

33.21 | 0.98 | ||||

_{D} |
7.20 | 56.99 | 0.32 | 4.25 | |

PBAS | 6.01 | 32.03 | 2.64 | ||

Vibe | 9.02 | 7.88 | 0.45 | 4.25 | |

SOM | 7.47 | 23.37 | 5.41 | 0.47 | 3.75 |

_{Ziυ} |
7.05 | 38.82 | 2.94 | 0.42 | 4.25 |

Detection accuracy obtained by analyzing the

7.86 | |||||

_{D} |
7.97 | 12.01 | 7.35 | 0.59 | 2.75 |

PBAS | 14.12 | 61.80 | 0.29 | 4.25 | |

Vibe | 19.27 | 6.19 | 21.47 | 0.39 | 4.75 |

SOM | 13.32 | 49.93 | 7.21 | 0.36 | 3.75 |

_{Ziυ} |
17.63 | 5.68 | 19.64 | 0.41 | 3.75 |

Detection accuracy obtained by analyzing the

8.13 | |||||

_{D} |
7.30 | 17.37 | 0.55 | 2.25 | |

PBAS | 12.31 | 55.02 | 6.45 | 0.32 | 4.75 |

Vibe | 14.44 | 7.34 | 15.43 | 0.42 | 5.00 |

SOM | 10.26 | 38.30 | 6.43 | 0.43 | 3.25 |

_{Ziυ} |
13.49 | 5.33 | 14.59 | 0.43 | 4.00 |

Final ranking.

_{D} |
2.80 |

PBAS | 3.55 |

Vibe | 4.70 |

SOM | 4.15 |

_{Ziυ} |
4.05 |