Background Subtraction for Moving Object Detection in RGBD Data: A Survey

: The paper provides a speciﬁc perspective view on background subtraction for moving object detection, as a building block for many computer vision applications, being the ﬁrst relevant step for subsequent recognition, classiﬁcation, and activity analysis tasks. Since color information is not sufﬁcient for dealing with problems like light switches or local gradual changes of illumination, shadows cast by the foreground objects, and color camouﬂage, new information needs to be caught to deal with these issues. Depth synchronized information acquired by low-cost RGBD sensors is considered in this paper to give evidence about which issues can be solved, but also to highlight new challenges and design opportunities in several applications and research areas.

Background modeling is a critical component for motion detection tasks, and it is essential for most of modern video surveillance applications.Usually, the color information provides most of the information useful to detect foreground and to solve all the basic issues related to this task [1][2][3][4][5].Anyway, problems like light switches or local gradual changes of illumination, shadows cast by the foreground objects, and color camouflage due to similar color of foreground and background regions are still open.The recent broad availability of depth data (from stereo vision to off-the-shelf RGBD sensors, such as time-of-flight and structured light cameras) opened new ways of dealing with the problem.Indeed, dense depth data provided by RGBD cameras is very attractive for foreground/background segmentation in indoor environments (due to the range camera limitations), since it does not suffer from the above-mentioned challenging issues that affect color-based algorithms.Moreover, depth information is beneficial to detect and reduce the effection of moved background objects.
On the other hand, the use of just depth data poses several problems that do not assure the required efficiency: (a) depth-based segmentation fails in case of depth camouflage that appears when foreground objects move towards the modeled background; (b) object silhouettes are strongly affected by the high level of depth data noise at object boundaries; (c) depth measurements are not always available for all the image pixels due to multiple reflections, scattering in particular surfaces, or occlusions.All these issues arose with several background modeling approaches based solely on depth as proposed in [6][7][8][9][10], mainly as building blocks for people-detection and tracking systems [11][12][13][14].
Therefore, many recent methods try to exploit the complementary nature of color and depth information acquired with RGBD sensors.Generally, these methods either extend to RGBD data's well-known background models initially designed for color data [15,16] or model the scene background (and sometimes also the foreground) based on color and depth independently and then combine the results, on the basis of different criteria [17][18][19][20] (see Section 3).
Several reviews related to RGBD data have been recently presented.In [21], Cruz et al. provide one of the first surveys of academic and industrial research on Kinect and RGBD data, showing the basic principles to begin developing applications using Kinect.Greff et al. [8] present a comparison between background subtraction algorithms using depth cameras.In [22], Zhang unravels the intelligent technologies encoded in Kinect, such as sensor calibration, human skeletal tracking, and facial-expression tracking.It also demonstrates a prototype system that employs multiple Kinects in an immersive teleconferencing application.In [23], Han et al. present a comprehensive review of recent Kinect-based computer vision algorithms and applications, giving insights on how researchers exploit and improve computer vision algorithms using Kinect.In [24], Camplani et al. survey multiple human tracking in RGBD data.
The present paper aims to provide a comprehensive review of methods which exploit RGBD data for moving object detection based on background subtraction.We do not review methods based only on RGB features, as that would need a dedicated survey of its own and would demand much greater space-for RGB only background subtraction, the reader is referred to the reviews presented in [1][2][3]5].We provide a brief analysis of the main issues and a concise description of the existing literature.Moreover, we summarize the metrics commonly used for the evaluation of these methods and the datasets that are publicly available.Finally, we provide the most extensive comparison of the existing methods on some datasets.

RGBD Data and Related Issues for Background Subtraction
Color cameras are based on sensors like CCD or CMOS, which provide a reliable representation of the scene with high-resolution images.Background subtraction using this kind of sensors often results in a precise separation between foreground and background, even though well-known scene background modeling challenges for moving object detection must be taken into account [25,26]: Bootstrapping: The challenge is to learn a model of the scene background (to be adopted for background subtraction) even when the usual assumption of having a set of training frames empty of foreground objects fails.
• Color Camouflage: When videos include foreground objects whose color is very close to that of the background, it is hard to provide a correct segmentation based only on color.
• Illumination Changes: The challenge is to adapt the color background model to strong or mild illumination changes to achieve an accurate foreground detection.
• Intermittent Motion: The issue is to detect foreground objects even if they stop moving (abandoned objects) or if they were initially stationary and then start moving (removed objects).

• Moving Background:
The challenge is to model not only the static background but also slight changes in the background that are not interesting for surveillance, such as waving trees in outdoor videos.
• Color Shadows: The challenge is to discriminate foreground objects by shadows cast on the background by foreground objects that apparently behave as moving objects.
Depth sensors provide partial geometrical information about the scene that can help solving some of the above problems.A depth image, storing for each pixel a depth value proportional to the estimated distance from the device to the corresponding point in the real world, can be obtained with different methods [27]: 1. Stereo vision [28]: this is a passive technique where the depth is derived from the disparity between images captured from a camera pair.Stereo vision systems need to be well-calibrated and can fail when the scene is not sufficiently textured.Moreover, algorithms for stereo reconstruction are often computationally expensive.Finally, stereo vision systems cannot work in low light conditions.In this case, infrared (IR) lights can be added to the system, but then, the color information is lost, which generates segmentation and matching difficulties.
2. Time-of-Flight (ToF) [29]: ToF cameras are active sensors that determine the per-pixel depth value by measuring the time taken by IR light to travel to the object and back to the camera.A ToF camera provides more accurate depth images than a stereo vision system, but it is very expensive and limited to low image resolution.The measured depth map can be noisy both spatially and temporally, and noise is content-dependent and hence, difficult to remove by traditional filtering methods.3. Structured light [30]: A structured light sensor consists of an IR emitter and an IR camera.
The emitter projects an IR speckle pattern onto the scene; the camera captures the reflected pattern and correlates it against a stored reference pattern on a plane, providing the depth values.
Well known examples include the Microsoft Kinect version 1 (in the following simply named Kinect) and the Asus Xtion Pro Live.These sensors can acquire higher resolution images than a ToF camera at a lower price.The drawback is that depth information is not always well estimated at the object boundaries and for areas too far from/too close to the IR projector.Also, the noise in depth measurements increases quadratically with increasing distance from the sensor [31].
Even though depth data solves some of the previously highlighted background maintenance issues, being independent of scene color and illumination conditions, it suffers from several problems, independent of which technology is used for its estimation.Indeed, as for color data, depth data suffers from bootstrapping, intermittent motion, and moving background.Moreover, challenges specific for depth data include [32,33].

•
Depth Camouflage: When foreground objects are very close in depth to the background, the sensor gives the same depth data values for foreground and background, making it hard to provide a correct segmentation based only on depth.
• Depth Shadows: Similar to the case of color, depth shadows are caused by foreground objects blocking the IR light emitted by the sensor from reaching the background.
• Specular Materials: When the scene includes specular objects, IR rays from a single incoming direction are reflected back in a single outgoing direction, without causing the diffusion needed to obtain depth information.
• Out of Sensor Range: When foreground or background objects are too close to/far from the sensor, the sensor is unable to measure depth, due to its minimum and maximum depth specifications.
In the last three cases, where depth cannot be measured at a given pixel, the sensor returns a special non-value code to indicate its inability to measure depth [32], resulting in an invalid depth value (shown as black pixels in the depth images reported in Figure 1).These invalid values must be suitably handled to exploit depth for background subtraction.and the Asus Xtion Pro Live.These sensors can acquire higher resolution images than a ToF camera at a lower price.The drawback is that depth information is not always well estimated at the objects boundaries and for areas too far from/too close to the IR projector.Also, the noise in depth measurements increases quadratically with increasing distance from the sensor [31].
Even though depth data solves some of the previously highlighted background maintenance issues, being independent of scene color and illumination conditions, it suffers from several problems, independently of which technology is used for its estimation.Indeed, as for color data, depth data suffers from bootstrapping, intermittent motion, and moving background.Moreover, challenges specific for depth data include [32,33] • Depth Camouflage: When foreground objects are very close in depth to the background, the sensor gives the same depth data values for foreground and background, making it hard to provide a correct segmentation based only on depth.
• Depth Shadows: Similarly to the case of color, depth shadows are caused by foreground objects blocking the IR light emitted by the sensor from reaching the background.
• Specular Materials: When the scene includes specular objects, IR rays from a single incoming direction are reflected back in a single outgoing direction, without causing the diffusion needed to obtain depth information.
• Out of Sensor Range: When foreground or background objects are too close to/far from the sensor, the sensor is unable to measure depth, due to its minimum and maximum depth specifications.
In the last three cases, where depth cannot be measured at a given pixel, the sensor returns a special non-value code to indicate its inability to measure depth [32], resulting in an invalid depth value (shown as black pixels in the depth images reported in Figure 1).These invalid values must be suitably handled to exploit depth for background subtraction.

Methods
In the last twenty years, several methods have been proposed for background subtraction exploiting depth data, as an alternative or complement to color data.A summary of background subtraction methods for RGBD videos is given in Table 1.Here, apart from the name of the authors and the related reference (column Authors and Ref.), we report (column Used data) whether they exploit only the depth information (D) or the complementary nature of color and depth information (RGBD).Moreover, we specify (column Depth data) how the considered depth data is acquired (Kinect, ToF cameras, stereo vision devices).Furthermore, we specify (column Model) the type of model adopted for the background, including Codebook [34], Frame difference, Kernel Density Estimation (KDE) [35], Mixture of Gaussians (MoG) [36], Robust Principal Components Analysis (RPCA) [37], Self-Organizing Background Subtraction (SOBS) [38], Single Gaussian [39], Thresholding, ViBe [40], and WiSARD weightless neural network [41].Finally, we specify (column No. of models) if they extend to RGBD data well-known background models originally designed for color data (1 model) or model the scene background based on color and depth independently and then combine the results, on the basis of different criteria (2 models).
In the following, we provide a brief description of the reviewed methods, presented in chronological order.In case of research dealing with higher-level systems (e.g., teleconferencing, matting, fall detection, human tracking, gesture recognition, object detection), we limit our attention to background modeling and foreground detection.
Eveland et al. [6] present a method of statistical background modeling for stereo sequences based on the disparity images extracted from stereo pairs.The depth background is modeled by a single Gaussian, similarly to [39], but selective update prevents the incorporation of foreground objects into the background.
The method proposed by Gordon et al. [42] is an adaptation of the MoG algorithm to color and depth data obtained with a stereo device.Each background pixel is modeled as a mixture of four-dimensional Gaussian distributions: three components are the color data (the YUV color space components), and the fourth one is the depth data.Color and depth features are considered independent, and the same updating strategy of the original MoG algorithm is used to update the distribution parameters.The authors propose a strategy where, for reliable depth data, depth-based decisions bias the color-based ones: in case that a reliable distribution match is found in the depth component, the color-based matching criterion is relaxed, thus reducing the color camouflage errors.When the stereo matching algorithm is not reliable, the color-based matching criterion is set to be harder to avoid problems such as shadows or local illumination changes.
Ivanov et al. [43] propose an approach based on stereo vision, which uses the disparity (estimated offline) to warp one image of the pair in the other one, thus creating a geometric background model.If the color and brightness between corresponding points do not match, the pixels either belong to a foreground object or to an occlusion shadow.The latter case can be further disambiguated using more than two camera views.
Harville et al. [16] propose a foreground segmentation method using the YUV color space with the additional depth values estimated by stereo cameras.They adopt four-dimensional MoG models, also modulating the background model learning rate based on scene activity and making color-based segmentation criteria dependent on depth observations.Kolmogorov et al. [44] describe two algorithms for bi-layer segmentation fusing stereo and color/contrast information, focused on live background substitution for teleconferencing.To segment the foreground, this approach relies on stereo vision, assuming that people participating in the teleconference are close to the camera.Color information is used to cope with stereo occlusion and low-texture regions.The color/contrast model is composed of MoG models for the background and the foreground.Crabb et al. [45] propose a method for background substitution, a regularly used effect in TV and video production.Thresholding of depth data coming from a ToF camera, using a user-defined threshold, is adopted to generate a trimap (consisting of background, foreground, and uncertain pixels).Alpha matting values for uncertain pixels, mainly in the borders of the segmented objects, are needed for a natural looking blending of those objects on a different background.They are obtained by cross-bilateral filtering based on color information.
In [11] by Guomundsson et al., 3D multi-person tracking in smart-rooms is tackled.They adopt a single Gaussian model for the range data from a two-modal camera rig (consisting of a ToF range camera and an additional higher resolution grayscale camera) for background subtraction.
In [46], Wu et al. present an algorithm for bi-layer segmentation of natural videos in real time using a combination of infrared, color, and edge information.A prime application of this system is in telepresence, where there is a need to remove the background and replace it with a new one.For each frame, the IR image is used to pre-segment the color image using a simple thresholding technique.This pre-segmentation is adopted to initialize a pentamap, which is then used by graph cuts algorithm to find the final foreground region.
The depth data provided by a ToF camera is used to generate 3D-TV contents by Frick et al. [7].The MoG algorithm is applied to the depth data to obtain foreground regions, which are then excluded by median filtering to improve background depth map accuracy.
In [47], Leens et al. propose a multi-camera system that combines color and depth data, obtained with a low-resolution ToF camera, for video segmentation.The algorithm applies the ViBe algorithm independently to the color and the depth data.The obtained foreground masks are then combined with logical operations and post-processed with morphological operations.
MoG is also adopted in the algorithm proposed by Stormer et al. [48], where depth and infrared data captured by a ToF camera are combined to detect foreground objects.Two independent background models are built, and each pixel is classified as background or foreground only if the two models matching conditions agree.Very close or overlapping foreground objects are further separated using a depth gradient-based segmentation.
Wang et al. [49] propose TofCut, an algorithm that combines color and depth cues in a unified probabilistic fusion framework and a novel adaptive weighting scheme to control the influence of these two cues intelligently over time.Bilayer segmentation is formulated as a binary labeling problem, whose optimal solution is obtained by minimizing an energy function.The data term evaluates the likelihood of each pixel belonging to the foreground or the background.The contrast term encodes the assumption that segmentation boundaries tend to align with the edges of high contrast.Color and depth foreground and background pixels are modeled through MoGs and single Gaussians, respectively, and their weighting factors are adaptively adjusted based on the discriminative capabilities of their models.The algorithm is also used in an automatic matting system [82] to automatically generate foreground masks, and consequently trimaps, to guide alpha matting.Dondi et al. [50] propose a matting method using the intensity map generated by ToF cameras.It first segments the distance map based on the corresponding values of the intensity map and then applies region growing to the filtered distance map, to identify and label pixel clusters.A trimap is obtained by eroding the output to select the foreground, dilating it to select foreground, and selecting as indeterminate the remaining contour pixels.The obtained trimap is fed in input to a matting algorithm that refines the result.
Frick et al. [51] use a thresholding technique to separate the foreground from the background in multiple planes of the video volume, for the generation of 3D-TV contents.A posterior trimap-based refinement using hierarchical graph cuts segmentation is further adopted to reduce the artifacts produced by the depth noise.
Kawabe et al. [52] employ stereo cameras to extract pedestrians.Foreground regions are extracted by MoG-based background subtraction and shadow detection using the color data.Then the moving objects are extracted by thresholding the histogram of depth data, computed by stereo matching.
Mirante et al. [53] exploit the information captured by a multi-sensor system consisting of a stereo camera pair with a ToF range sensor.Motion, retrieved by color and depth frame difference, provides the initial ROI mask.The foreground mask is first extracted by region growing in the depth data, where seeds are obtained by the ROI, then refined based on color edges.Finally, a trimap is generated, where uncertain values are those along the foreground contours, and are classified based on color in the CIELab color space.
Rougier et al. [54] explore the Kinect sensor for the application of detecting falls in the elderly.For people detection, the system adopts a single Gaussian depth background model.[55] propose an approach to video matting that combines color information with the depth provided by ToF cameras.Depth keying is adopted to segment moving objects based on depth information, comparing the current depth image with a depth background image (constructed by averaging several ToF-images).MoG is adopted to segment moving objects based on color information.The two segmentations are weighted using two types of reliability measure for depth measurements: the depth variance and the amplitude image of the ToF-camera.The weighted average of the color and depth segmentations is used as matting alpha value for blending foreground and background, while its thresholding (using a user-defined threshold) is used for evaluating moving object segmentation.

Schiller and Koch
Stone and Skubic [56] use only the depth information provided by a Kinect device to extract the foreground.For each pixel, minimum and maximum depth values d m and d M are computed by a set of training images to form a background model.For a new frame, each pixel is compared against the background model, and those pixels which lie outside the range In [9], Han et al. present a human detection and tracking system for a smart environment application.Background subtraction is applied only on the depth images as frame-by-frame difference, assisted by a clustering algorithm that checks the depth continuity of pixels in the neighborhood of foreground pixels.Once the object has been located in the image, visual features are extracted from the RGB image and are then used for tracking the object in successive frames.
In the surveillance system based on the Kinect proposed by Clapés et al. [57], a per pixel background subtraction technique is presented.The authors propose a background model based on a four-dimensional Gaussian distribution (using color and depth features).Then, user and object candidate regions are detected and recognized using robust statistical approaches.
In the gesture recognition system presented by Mahbub et al. [10], the foreground objects are extracted by the depth data using frame difference.
Ottonelli et al. ( 2013) [59] refine ViBe segmentation of the color data by adding to the achieved foreground mask a compensation factor computed based on the color and depth data obtained by a stereo camera.
In the object detection system presented by Zhang et al. [60], background subtraction is achieved by single Gaussian modeling of the depth information provided by a Kinect sensor.
Fernandez-Sanchez et al. [58] adopt Codebook as background model and consider data captured by Kinect cameras.They analyze two approaches that differ in the depth integration method: the four-dimensional Codebook (CB4D) considers merely depth as a fourth channel of the background model, while the Depth-Extended Codebook (DECB) adds a joint RGBD fusion method directly into the model.They proved that the latter achieves better results than the former.In [15], the authors consider stereo disparity data, besides color.To get the best of color and depth features, they extend the DECB algorithm through a post-processing stage for mask fusion (DECB-LF), based on morphological reconstruction using the output of the color-based algorithm.
Braham et al. [61] adopt two background models for depth data, separating valid values (modeled by a single Gaussian model) and invalid values (holes).The Gaussian mean is updated to the maximum valid value, while the standard deviation follows a quadratic relationship with respect to the depth.This leads to a depth-dependent foreground/background threshold that enables the model to adapt to the non-uniform noise of range images automatically.
In [17], Camplani and Salgado propose an approach, named CL W , based on a combination of color and depth classifiers (CL C and CL D ) and the adoption of the MoG model.The combination of classifiers is based on a weighted average that allows to adaptively modifying the support of each classifier in the ensemble by considering foreground detections in the previous frames and the depth and color edges.For each pixel, the support of each classifier to the final segmentation result is obtained by considering the global edge-closeness probability and the classification labels obtained in the previous frame.In [62], the authors improve their method, proposing a method named MoG-RegPRE, that builds not only pixel-based but also region-based models from depth and color data, and fuses the models in a mixture of experts fashion to improve the final foreground detection performance.
Chattopadhyay et al. [63] adopt RGBD streams for recognizing gait patterns of individuals.To extract RGBD information of moving objects, they adopt the SOBS model for color background subtraction and use the obtained foreground masks to extract the depth information of people silhouettes from the registered depth frames.
In [18], Gallego and Pardás present a foreground segmentation system that combines color and depth information captured by a Kinect camera to perform a complete Bayesian segmentation between foreground and background classes.The system adopts a combination of spatial-color and spatial-depth region-based MoG models for the foreground, as well as two color and depth pixel-wise MoG models for the background, in a Logarithmic Opinion Pool decision framework used to combine the likelihoods of each model correctly.A post-processing step based on a trimap analysis is also proposed to correct the precision errors that the depth sensor introduces in the object contour.
The algorithm proposed by Giordano et al. in [64] explicitly models the scene background and foreground with a KDE approach in a quantized x-y-hue-saturation-depth space.Foreground segmentation is achieved by thresholding the log-likelihood ratio over the background and foreground probabilities.
Murgia et al. [65] propose an extension of the Codebook model.Similarly to CB4D [58], it includes depth as a fourth channel of the background model but also applies colorimetric invariants to modify the color aspect of the input images, to give them the aspect they would have under canonical illuminants.
In [66], Song et al. model grayscale color and depth values based on MoG.The combination of the two models is based on the product of the likelihoods of the two models.
Boucher et al. [67] initially exploit depth information to achieve a coarse segmentation, using middleware of the adopted ASUS Xtion camera.The obtained mask is refined in uncertain areas (mainly object contours) having high background/foreground contrast, locally modeling colors by their mean value.
Cinque et al. [68] adapt to Kinect data a matting method previously proposed for ToF data.It is based on Otsu thresholding of the depth map and region growing for labeling pixel clusters, assembled to create an alpha map.Edge improvement is obtained by logical OR of the current map with those of the previous four frames.Huang et al. [69] propose a post-processing framework based on an initial segmentation obtained solely by depth data.Two post-processing steps are proposed: a foreground hole detection step and object boundary refining step.For foreground hole detection, they obtain two weak decisions based on the region color cue and the contour contrast cue, adaptively fused according to their corresponding reliability.For object boundary refinement, they apply weighted fusion of motion probability weighted temporal prior, color likelihood, and smoothness constraints.Therefore, besides handling challenges such as color camouflage, illumination variations, and shadows, the method maintains spatial and temporal consistency of the obtained segmentation, a fundamental issue for the telepresence target application.
Javed et al. [70] propose the DEOR-PCA (Depth Extended Online RPCA) method for background subtraction using binocular cameras.It consists of four main stages: disparity estimation, background modeling, integration, and spatiotemporal constraints.Initially, the range information is obtained using disparity estimation algorithms on a set of stereo pairs.Then, OR-PCA is applied to each of color left image and related disparity image to model the background, separately.The integration adds low-rank and sparse components obtained via OR-PCA to recover the background model and foreground mask from each image.The reconstructed sparse matrix is then thresholded to get the binary foreground mask.Finally, spatiotemporal constraints are applied to remove from the foreground mask most of the noise due to depth information.
In [71], Nguyen et al. present a method where, as an initial offline step, noise is removed from depth data based on a noise model.Background subtraction is then solved by combining RGB and depth features, both modeled by MoG.The fundamental idea in their combination strategy is that when depth measurement is reliable, the segmentation is mainly based on depth information; otherwise, RGB is used as an alternative.
Sun et al. [72] propose a MoG model for color information and a single Gaussian model for depth, together with a color-depth consistency check mechanism driving the updating of the two models.However, experimental results aim at evaluating background estimation, rather than background subtraction.
Tian et al. [73] propose a depth-weighted group-wise PCA-based algorithm, named DG-PCA.The background/foreground separation problem is formulated as a weighted L 2,1 -norm PCA problem with depth-based group sparsity being introduced.Dynamic groups are first generated solely based on depth, and then an iterative solution using depth to define the weights in L 2,1 -norm is developed.The method handles moving cameras through global motion compensation.
In [19], Huang et al. present a method where two separate color and depth background models are based on ViBe, and the two resulting foreground masks are fused by weighted average.The result is further adaptively refined, taking into account multi-cue information (color, depth, and edges) and spatiotemporal consistency (in the neighborhood of foreground pixels in the actual and previous frames).
In [20], Liang et al. propose a method to segment foreground objects based on color and depth data independently, using an existing background subtraction method (in the experiments they choose MoG).They focus on refining the inaccurate results through supervised learning.They extract several features from the source color and depth data in the foreground areas.These features are fed to two independent classifiers (in the experiments they choose random forest [83]) to obtain a better foreground detection.
In [74], Palmero et al. propose a baseline algorithm for human body segmentation using color, depth, and thermal information.To reduce the spatial search space in subsequent steps, the preliminary step is background subtraction, achieved in the depth domain using MoG.
In the method proposed by Chacon et al. [75], named SAFBS (Self-Adapting Fuzzy Background Subtraction), background subtraction is based on two background models for color (in the HSV color space) and depth, providing an initial foreground segmentation by frame differencing.A fuzzy algorithm computes the membership value of each pixel to background or foreground, based on color and depth differences, as well as depth similitude, of the current frame and the background.Temporal and spatial smoothing of the membership values is applied to reduce false alarms due to depth flickering and imprecise measurements around object contours, respectively.The classification result is then employed to update the two background models, using automatically computed learning rates.
De Gregorio and Giordano [76] adapt an existing background modeling method using the WiSARD weightless neural network (WNNs) [41] to the domain of RGBD videos.Color and depth video streams are synchronously but separately modeled by WNNs at each pixel, using a set of initial video frames for network training.In the detection phase, classification is interleaved with re-training on current colors whenever pixels are detected as belonging to the background.Finally, the obtained output masks are combined by an OR operator and post-processed by morphological filtering.
Javed et al. [77] investigate the performance of an online RPCA-based method, named SRPCA, for moving object detection using RGBD videos.The algorithm consists of three main stages: (i) detection of dynamic images to create an input dynamic sequence by discarding motionless video frames; (ii) computation of spatiotemporal graph Laplacians; and (iii) application of RPCA to incorporate the preceding two steps for the separation of background and foreground components.In the experiments, the algorithm is tested by using only intensity, only RGB, and RGBD features, leading to the surprising conclusion that best results are achieved using only intensity features.
The algorithm proposed by Maddalena and Petrosino [78], named RGBD-SOBS, is based on two background models for color and depth information, exploiting a self-organizing neural background model previously adopted for RGB videos [84].The resulting color and depth detection masks are combined, not only to achieve the final results but also to better guide the selective model update procedure.
Minematsu et al. [79] propose an algorithm, named SCAD, based on a simple combination of the appearance (color) and depth information.The depth background is obtained using, for each pixel, its farthest depth value along the whole video, thus resulting in a batch algorithm.The likelihood of the appearance background is computed using texture-based and RGB-based background subtraction.To reduce false positives due to illumination changes, SCAD roughly detects foreground objects by using texture-based background subtraction.Then, it performs RGB-based background subtraction to improve the results of texture-based background subtraction.Finally, foreground masks are obtained using graph cuts to optimize an energy function which combines the two likelihoods of the background.
Moyá-Alcover et al. [32] construct a scene background model using KDE with a three-dimensional Gaussian kernel.One of the dimensions models depth information, while the other two model normalized chromaticity coordinates.Missing depth data are modeled using a probabilistic strategy to distinguish pixels that belong to the background model from those which are due to foreground objects.Pixels that cannot be classified as background or foreground are placed in the undefined class.Two different implementations are obtained depending on whether undefined pixels are considered as background (GSM UB ) or foreground (GSM UF ), demonstrating their suitability for scenes where actions happen far or close to the sensor, respectively.
Trabelsi et al. [80] propose the RGBD-KDE algorithm, also based on a scene background model using KDE, but using a two-dimensional Gaussian kernel.One of the dimensions models depth information, while the other models the intensity (average of RGB components).To reduce computational complexity, the Fast Gaussian Transform is adapted to the problem.
Zhou et al. [81] construct color and depth models based on ViBe and fuse the results in a weighting mechanism for the model update that relies on depth reliability.

Metrics
The usual way of evaluating the performance of background subtraction algorithms for moving object detection in videos is to pixel-wise compare the computed foreground masks with the corresponding ground truth (GT) foreground masks [26,85] and compute suitable metrics.Metrics frequently adopted for evaluating background subtraction methods in RGBD videos are summarized in Table 2. Here, we report their name (column Name), abbreviation (column Acronym), definition (column Computed as), and whether they should be minimized (↓) or maximized (↑) to have more accurate results (column Better if).All these metrics are defined in terms of the total number of True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN) pixels in the whole video.Most of the metrics reported in Table 2 are frequently used for evaluating background subtraction methods in RGB videos [26,85].The exception is Si ∂Ω , specifically adopted to analyze the errors close to the object boundaries ∂Ω, where depth data is usually very imprecise.In [17], ∂Ω is defined as an image region made of pixels surrounding the ground truth object boundary and having a distance from it of at most 10 pixels.
Where more than one metric is considered, overall metrics to rank the accuracy of the compared methods are also proposed by some authors [17,32], based on the rankings achieved by the methods according to each of the metrics.

Datasets
Several RGBD datasets exist for different tasks, including object detection and tracking, object and scene recognition, human activity analysis, 3D-simultaneous localization and mapping (SLAM), and hand gesture recognition (e.g., see surveys in [24,86,87]).However, depending on the application they have been devised for, they can include single RGBD images instead of videos, or they can supply GTs in the form of bounding boxes, 3D geometries, camera trajectories, 6DOF poses, or dense multi-class labels, rather than GT foreground masks.
In Table 3, we summarize some publicly available RGBD datasets suitable for background subtraction that include videos and, eventually, GT foreground masks.Specifically, we report their acronym, website, and reference publication (column Name & Refs.), the source for depth data (column Source), whether or not they also provide GT foreground masks (column GT masks), the number of videos they include (column No. of videos), some RGBD background subtraction methods adopting them for their evaluation (column Adopted by), and the main application they have been devised for (column Main application).The GSM dataset [32] includes seven different sequences designed to test some of the main problems in scene modeling when both color and depth information are used: color camouflage, depth camouflage, color shadows, smooth and sudden illumination changes, and bootstrapping.Each sequence is provided with some hand-labeled GT foreground masks.All the sequences are also included in the SBM-RGBD dataset [33] and accompanied by 56 GT foreground masks.
The Kinect dataset [90] contains nine single person sequences, recorded with a Kinect camera, to show depth and color camouflage situations that are prone to errors in color-depth scenarios.
The MICA-FALL dataset [71] contains RGBD videos for the analysis of human activities, mainly fall detection.Two scenarios are considered for capturing activities that happen at the center field of view of one of the four Kinect sensors or at the cross-view of two or more sensors.Besides color and depth data, accelerometer information and the coordinates of 20 skeleton joints are provided for every frame.
The MULTIVISION dataset consists of two different sets of sequences for the objective evaluation of background subtraction algorithms based on depth information as well as color images.The first set (MULTIVISION Stereo [15]) consists of four sequences recorded by stereo cameras, combined with three different disparity estimation algorithms [103][104][105].The sequences are devised to test color saturation, color and depth camouflage, color shadows, low lighting, flickering lights, and sudden illumination changes.The second set (MULTIVISION Kinect [58]) consists of four sequences recorded by a Kinect camera, devised to test out of sensor range depth data, color and depth camouflage, flickering lights, and sudden illumination changes.For all the sequences, some frames have been hand-segmented to provide GT foreground masks.The four MULTIVISION Kinect sequences are also included in the SBM-RGBD dataset [33] and accompanied by 294 GT foreground masks.
The Princeton Tracking Benchmark dataset [95] includes 100 videos covering many realistic cases, such as deformable objects, moving camera, different occlusion conditions, and a variety of clutter backgrounds.The GTs are manual annotations in the form of bounding-boxes drawn around the objects on each frame.One of the sequences (namely, sequence bear_front) is also included in the SBM-RGBD dataset [33] and accompanied by 15 GT foreground masks.
The RGB-D Object Detection dataset [17] includes four different sequences of indoor environments, acquired with a Kinect camera, that contain different demanding situations, such as color and depth camouflage or cast shadows.For each sequence, a hand-labeled ground truth is provided to test foreground/background segmentation algorithms.All the sequences, suitably subdivided and reorganized, are also included in the SBM-RGBD dataset [33] and accompanied by more than 1100 GT foreground masks.
The RGB-D People dataset [98] is devoted to evaluating people detection and tracking algorithms for robotics, interactive systems, and intelligent vehicles.It includes more than 3000 RGBD frames acquired in a university hall from three vertically mounted Kinect sensors.The data contains walking and standing persons seen from different orientations and with different levels of occlusions.Regarding the ground truth, all frames are annotated manually to contain bounding boxes in the 2D depth image space and the visibility status of subjects.Unfortunately, the GT foreground masks built and used in [62] are not available.
The SBM-RGBD dataset [33] is a publicly available benchmarking framework specifically designed to evaluate and compare scene background modeling methods for moving object detection on RGBD videos.It involves the most extensive RGBD video dataset ever made for this specific purpose and also includes videos coming from other datasets, namely, GSM [32], MULTIVISION [58], Princeton Tracking Benchmark [95], RGB-D Object Detection dataset [17], and UR Fall Detection Dataset [106,107].The 33 videos acquired by Kinect cameras span seven categories, selected to include diverse scene background modeling challenges for moving object detection: illumination changes, color and depth camouflage, intermittent motion, out of sensor depth range, color and depth shadows, and bootstrapping.Depth images are already synchronized and registered with the corresponding color images by projecting the depth map onto the color image, allowing a color-depth pixel correspondence.For each sequence, pixels that have no color-depth correspondence (due to the difference in the color and depth cameras centers) are signaled in a binary Region-of-Interest (ROI) image and are excluded by the evaluation.
Other publicly available RGBD video datasets are worth mentioning, being equipped with pixel-wise GT foreground masks, which are devoted to specific applications.These include the BIWI RGBD-ID dataset [108,109] and the IPG dataset [110,111], targeted to people re-identification, and the VAP Trimodal People Segmentation Dataset [74,112], that contains videos captured by thermal, depth, and color sensors, devoted to human body segmentation.

Comparisons
Due to the public availability of data, GTs, and results obtained by existing background subtraction algorithms handling RGBD data, five of the RGBD datasets described in Section 5 have been adopted by several authors for benchmarking new algorithms for the problem.Here, we summarize and compare the published results.

Comparisons on the MULTIVISION Kinect Dataset
Performance comparisons on the MULTIVISION Kinect dataset are reported in Table 4. Here, values for the DECB and the CB4D algorithms by Fernandez et al. [58] and the Codebook algorithm using only color (CB) and only depth (CB-D) are those reported in [58].Values for the RGBD-KDE algorithm by Trabelsi et al. [80] and the KDE algorithm using only color (C-KDE) and only depth (D-KDE) are those reported in [80].Values for the RSBS (Random Sampling-based Background Subtraction) algorithm by Huang et al. [19] and the SAFBS algorithm by Chacon et al. [75] are those reported by their authors.It can be observed that in general, depth alone (i.e., CB-D and D-KDE) allows achieving results better than color alone (i.e., CB and C-KDE), being insensitive to illumination variations (e.g., in sequences ChairBox and Hallway) and color camouflage (e.g., in sequence Hallway).The exception clearly holds for the case of depth camouflage, as in sequence Wall (see Figure 2).For all the sequences, the combined use of both information allows in general to achieve comparable or better performance.

Comparisons on the MULTIVISION Stereo Dataset
Performance comparisons on the MULTIVISION Stereo dataset are reported in

Comparisons on the MULTIVISION Stereo Dataset
Performance comparisons on the MULTIVISION Stereo dataset are reported in Table 5.Here, values for the DECB-LF algorithm by Fernandez et al. [15], the DECB and the CB4D algorithms by Fernandez et al. [58], and for the Codebook algorithm using only color (CB) and only depth (CB-D) are those reported in [15].Values for the RGBD-KDE algorithm by Trabelsi et al. [80] and the KDE algorithm using only color (C-KDE) and only depth (D-KDE) are those reported in [80].Values for the DEOR-PCA algorithm by Javed et al. [70] are those reported by the same authors.
It can be observed that, for all the videos, the combined use of both color and depth information (i.e., DEOR-PCA, DECB, DECB-LF, and RGBD-KDE methods) allows achieving results better than those obtained by color alone (i.e., CB and C-KDE methods) or depth alone (i.e., CB-D and D-KDE methods).Moreover, the difficulty in estimating and discriminating disparities in case of flickering lights (e.g., in LCDScreen and LabDoor videos) and in case of depth camouflage (e.g., in Crossing video) leads depth alone-based methods to obtain worse results as compared to color alone-based methods.Only for sequence Suitcase (see Figure 3), where the main issue is color camouflage, depth alone allows achieving results better than color alone, due to the high accuracy of the estimated depth information.

Comparisons on the MULTIVISION Stereo Dataset
Performance comparisons on the MULTIVISION Stereo dataset are reported in Table 5.Here, values for the DECB-LF algorithm by Fernandez et al. [15], the DECB and the CB4D algorithms by Fernandez et al. [58], and for the Codebook algorithm using only color (CB) and only depth (CB-D) are those reported in [15].Values for the RGBD-KDE algorithm by Trabelsi et al. [80] and the KDE algorithm using only color (C-KDE) and only depth (D-KDE) are those reported in [80].Values for the DEOR-PCA algorithm by Javed et al. [70] are those reported by the same authors.
It can be observed that, for all the videos, the combined use of both color and depth information Table 5. Performance results of various background subtraction methods on the RGBD videos of the MULTIVISION Stereo dataset, using depth from three disparity estimation algorithms (Var [103], Phase [104], and SGBM [105]) or using only color (RGB).F 1 and σ F 1 are the mean and the standard deviation over four GT masks for each video.In boldface, the best results for each metric and each video.

Comparisons on the RGB-D Object Detection Dataset
Performance comparisons on the RGB-D Object Detection dataset are reported in Table 6.Here, values for the two weak color and depth classifiers (CL C and CL D ) and the weighted color and depth classifier (CL W ) by Camplani and Salgado [17], the four-dimensional MoG model (MoG4D) by Gordon et al. [42], the combined RGB and depth ViBe model (ViBeRGB+D) by Leens et al. [47], and the combined RGB and depth MoG model (MoGRGB+D) by Stormer et al. [48] are those reported in [17].Values for the RGBD-KDE algorithm by Trabelsi et al. [80] and the KDE algorithm using only color (C-KDE) and only depth (D-KDE) are those reported in [80].Values for the AMDF (Adaptive Multi-Cue Decision Fusion) algorithm by Huang et al. [69], the RFBS (Refinement Framework for Background Subtraction) algorithm by Liang et al. [20], the EC-RGBD algorithm by Nguyen et al. [71], the enhanced classifier (MoG-RegPRE) by Camplani et al. [62], the GSM UB and GSM UF algorithms by Moyá et al. [32], and the SAFBS algorithm by Chacon et al. [75] are those reported by the related authors.
Good performance can be achieved for color camouflage (ColCamSeq) and shadows (ShSeq), as well as for sequence GenSeq (see Figure 4), which combines different issues (color shadows, color and depth camouflage, and noisy depth data).On the other hand, depth camouflage (DCamSeq) seems to be a problem for most of the methods using depth.Good performance can be achieved for color camouflage (ColCamSeq) and shadows (ShSeq), as well as for sequence GenSeq (see Figure 4), which combines different issues (color shadows, color and depth camouflage, and noisy depth data).On the other hand, depth camouflage (DCamSeq) seems to be a problem for most of the methods using depth.

Comparisons on the GSM Dataset
Performance comparisons on the GSM dataset are reported in Table 7.Here, values for the GSM UB and GSM UF algorithms by Moyá et al. [32] are those reported on the dataset website.Values for the RGBD-KDE algorithm by Trabelsi et al. [80] and the KDE algorithm using only color (C-KDE) and only depth (D-KDE) are those reported in [80].It can be observed that the compared methods based on the combination of color and depth information robustly deal with all the issues related to RGBD data: intermittent object motion (Sleeping-ds), illumination changes (TimeOfDay-ds and LightSwitch-ds), color camouflage (Cespatx-ds), depth camouflage (Despatx-ds, see Figure 5), color and depth shadows (Shadows-ds), and bootstrapping (Bootstraping-ds).It should be pointed out that, in the case of TimeOfDay-ds and Ls-ds sequences, the performance analysis should be based on Specificity, FPR, FNR, and PWC, rather than on the other three metrics.Indeed, there are no foreground objects throughout the whole sequences, their rationale being the willingness of not detecting false positives under varying illumination conditions.This leads to having no positive cases in the ground truths and, consequently, to undefined values of Precision, Recall, and F-measure.While for GSM UB and GSM UF values in these undefined cases are set to zero, a different handling must have been adopted for the other compared methods.

Comparisons on the SBM-RGBD Dataset
Performance comparisons on the SBM-RGBD dataset are reported in Tables 8 and 9. Here, values for the RGBD-SOBS and RGB-SOBS algorithms by Maddalena and Petrosino [78], the SRPCA algorithm by Javed et al. [77], the AvgM-D algorithm by Li and Wang [100], the Kim algorithm by Younghee Kim [101], the SCAD algorithm by Minematsu et al. [79], the cwisardH+ algorithm by De Gregorio and Giordano [76], and the MFCN algorithm by Zeng et al. [102], are those reported by the related authors.All the performance measures have been computed using the complete set of GTs and are available at http://rgbd2017.na.icar.cnr.it/SBM-RGBDchallengeResults.html.

Comparisons on the SBM-RGBD Dataset
Performance comparisons on the SBM-RGBD dataset are reported in Tables 8 and 9. Here, values for the RGBD-SOBS and RGB-SOBS algorithms by Maddalena and Petrosino [78], the SRPCA algorithm by Javed et al. [77], the AvgM-D algorithm by Li and Wang [113], the Kim algorithm by Younghee Kim [114], the SCAD algorithm by Minematsu et al. [79], the cwisardH+ algorithm by De Gregorio and Giordano [76], and the MFCN algorithm by Zeng et al. [102], are those reported by the related authors.All the performance measures have been computed using the complete set of GTs and are available at [115].
It can be observed that the deep learning-based MFCN algorithm almost always achieves the best results in all the video categories, in terms of all the metrics.This is certainly possible thanks to the availability of such a wide dataset to train the network.Several conclusions can be drawn for each of the considered challenges by observing the remaining results.Bootstrapping can be a problem when using only color information, especially for selective background subtraction methods (e.g., RGB-SOBS), i.e., those that update the background model using only background information.Indeed, once a foreground object is erroneously included into the background model (e.g., due to inappropriate background initialization or to inaccurate segmentation of foreground objects), it will hardly be removed by the model, continuing to produce false negative results.The problem is even harder if some parts of the background are never shown during the sequences, as it happens in most of the videos of the Bootstrapping category.Indeed, in these cases, also the best performing background initialization methods [116,117] fail and only alternative techniques (e.g., inpainting) can be adopted to recover missing data [118].Nonetheless, depth information seems to be beneficial for affording the challenge, as reported in Table 8, where accurate results are achieved by most of the methods that exploit depth information.As expected, all the methods that exploit depth information achieve high accuracy in case of color camouflage and illumination changes.In the latter case, it should be pointed out that, since this video category includes the two TimeOfDay-ds and Ls-ds sequences of the GSM dataset (without any foreground object), the performance analysis should be based on Specificity, FPR, FNR, and PWC, rather than on the other three metrics (see Section 6.4).
Depth can be beneficial also for detecting and properly handling cases of intermittent motion.Indeed, foreground objects can be easily identified based on their depth, that is lower than that of the background, even when they remain stationary for long time periods.Methods that explicitly exploit this characteristic succeed in handling cases of removed and abandoned objects, achieving high accuracy.
Overall, shadows do not seem to pose a strong challenge to most of the methods.Indeed, depth shadows due to moving objects cause some undefined depth values, generally close to the object contours, but these can be handled based on motion.Color shadows can be handled either exploiting depth information, that is insensitive to this challenge, or through color shadow detection techniques when only color information is taken into account.
Depth camouflage and out of range (see Figure 6) are among the most challenging issues, at least when information on color is disregarded or not properly combined with depth.Indeed, even though the accuracy of most of the methods is moderately high, several false negatives are produced.challenge, as reported in Table 8, where accurate results are achieved by most of the methods that exploit depth information.
As expected, all the methods that exploit depth information achieve high accuracy in case of color camouflage and illumination changes.In the latter case, it should be pointed out that, since this video category includes the two TimeOfDay-ds and Ls-ds sequences of the GSM dataset (without any foreground object), the performance analysis should be based on Specificity, FPR, FNR, and PWC, rather than on the other three metrics (see § 6.4).
Depth can be beneficial also for detecting and properly handling cases of intermittent motion.
Indeed, foreground objects can be easily identified based on their depth, that is lower than that of the background, even when they remain stationary for long time periods.Methods that explicitly exploit this characteristic succeed in handling cases of removed and abandoned objects, achieving high accuracy.
Overall, shadows do not seem to pose a strong challenge to most of the methods.Indeed, depth shadows due to moving objects cause some undefined depth values, generally close to the object contours, but these can be handled based on motion.Color shadows can be handled either exploiting depth information, that is insensitive to this challenge, or through color shadow detection techniques when only color information is taken into account.
Depth camouflage and Out of range (see Figure 6) are among the most challenging issues, at least when information on color is disregarded or not properly combined with depth.Indeed, even though the accuracy of most of the methods is moderately high, several false negatives are produced.

Summary of the Findings and Open Issues
From the reported comparisons, it can be argued that, generally, most of the issues related to RGB data may be solved by accurate depth information, being insensitive to scene color and illumination conditions (color camouflage, illumination changes, and color shadows) and providing geometric information of the scene (bootstrapping and intermittent motion).This does not hold in cases where depth measurements or estimation are not sufficiently accurate.However, the combined use of both color and depth information was shown to allow achieving results better than those obtained by color alone or depth alone.Indeed, a clever combination of this information enables the exploitation of depth benefits, at the same time overcoming the issues arising from eventual depth

Summary of the Findings and Open Issues
From the reported comparisons, it can be argued that, generally, most of the issues related to RGB data may be solved by accurate depth information, being insensitive to scene color and illumination conditions (color camouflage, illumination changes, and color shadows) and providing geometric information of the scene (bootstrapping and intermittent motion).This does not hold in cases where depth measurements or estimation are not sufficiently accurate.However, the combined use of both color and depth information was shown to allow achieving results better than those obtained by color alone or depth alone.Indeed, a clever combination of this information enables the exploitation of depth benefits, at the same time overcoming the issues arising from eventual depth inaccuracies, by exploiting the complimentary color information.
Open issues remain when depth and color information fail to be complimentary.As an example, it has been shown that an object moving on a wall can be detected based on its color, rather than its camouflaged depth.However, what if the object has the same color of the wall?Future research directions should certainly investigate these cases.

Conclusions and Future Research Directions
The paper provides a comprehensive review of methods which exploit RGBD data for moving object detection based on background subtraction, a building block for many computer vision applications.The main issues and the existing literature are briefly reviewed.Moreover, the metrics commonly used for the evaluation of these methods and the datasets that are publicly available are summarized.Finally, the most extensive comparison of the existing methods on some datasets is provided, which can serve as a reference for future methods aiming at overcoming the highlighted open issues.

Figure 1 .
Figure 1. Background modeling issues related to depth data (highlighted by red ellipses).(a) Depth camouflage.(b) Depth shadows.(c) Specular materials.(d) Out of sensor range.Figure 1. Background modeling issues related to depth data (highlighted by red ellipses).(a) Depth camouflage; (b) Depth shadows; (c) Specular materials; (d) Out of sensor range.

Figure 1 .
Figure 1. Background modeling issues related to depth data (highlighted by red ellipses).(a) Depth camouflage.(b) Depth shadows.(c) Specular materials.(d) Out of sensor range.Figure 1. Background modeling issues related to depth data (highlighted by red ellipses).(a) Depth camouflage; (b) Depth shadows; (c) Specular materials; (d) Out of sensor range.

2 .
[30]-of-Flight (ToF)[29]: ToF cameras are active sensors that determine the per-pixel depth value by measuring the time taken by IR light to travel to the object and back to the camera.A ToF camera provides more accurate depth images than a stereo vision system, but it is very expensive and limited to low image resolution.The measured depth map can be noisy both spatially and temporally, and noise is content-dependent and hence difficult to remove by traditional filtering methods.3.Structured light[30]: A structured light sensor consists of an IR emitter and an IR camera.The emitter projects an IR speckle pattern onto the scene; the camera captures the reflected pattern and correlates it against a stored reference pattern on a plane, providing the depth values.Well known examples include the Microsoft Kinect version 1 (in the following simply named Kinect)

Table 1 .
Summary of background subtraction methods for RGBD videos.

Table 2 .
Metrics frequently adopted for evaluating background subtraction methods in RGBD videos.

Table 3 .
Some publicly available RGBD datasets for background subtraction.

Table 4 .
Performance results of various background subtraction methods on the RGBD videos of the MULTIVISION Kinect dataset.and σ F 1 are the mean and the standard deviation over four GT masks for each video.In boldface, the best results for each metric and each sequence.

Table 5 .
Here, [70]alues for the DECB-LF algorithm by Fernandez et al.[15], the DECB and the CB4D algorithms by 501 Fernandez et al.[58], and for the Codebook algorithm using only color (CB) and only depth (CB-D) 502 are those reported in[15].Values for the RGBD-KDE algorithm by Trabelsi et al.[80]and the KDE 503 algorithm using only color (C-KDE) and only depth (D-KDE) are those reported in[80].Values for the 504 DEOR-PCA algorithm by Javed et al.[70]are those reported by the same authors.505Itcan be observed that, for all the videos, the combined use of both color and depth information 506

Table 6 .
Performance results of various background subtraction methods on the RGBD videos of the RGB-D Object Detection dataset.In boldface, the best results for each metric and each sequence.

Table 7 .
Performance results of various background subtraction methods on the RGBD videos of the GSM dataset.In boldface, the best results for each metric and each sequence.

Table 8 .
Average results of various background subtraction methods for each category of the SBM-RGBD dataset (Part 1).In boldface, the best results for each metric and each category.