1. Introduction
In image processing, background subtraction is widely used in object detection and tracking approaches [
1,
2,
3] with broad applications such as human tracking [
4,
5], face recognition [
6], traffic management [
7], and surveillance systems [
8,
9]. Background subtraction is typically a preprocessing phase used to identify and differentiate the foreground pixels (representing objects of interest) from the background pixels (representing uninteresting information). The background pixels can then be subtracted or removed from the original image, leaving only the foreground pixels, which reduces the storage space requirements, reduces the computational complexity, and improves the overall algorithm performance of downstream image processing techniques. Established 2D background subtraction approaches are based on static background segmentation methods [
10], adaptive Gaussian mixture models [
11,
12], real-time codebook models [
13,
14], and independent component analysis-based techniques [
15,
16]. Although advanced 2D background subtraction techniques can handle gradual illumination changes and repetitive movements in the background, they perform poorly in the presence of shadows or foreground regions with colors similar to the background [
17]. The release of commercially available and inexpensive depth sensors such as the Microsoft Kinect opened new doors for improved background subtraction techniques because of the availability of additional depth data associated with each pixel of color data. Depth sensor (RGB-D) systems are more robust for background detection problems compared to classic color-based systems because depth data are largely invariant to color, texture, shape, and lighting [
18,
19]. As a result of the advantages of combined depth and color information, data from RGB-D sensors have also been widely used in background segmentation methods (e.g., [
17,
20,
21]) which have been thoroughly reviewed (see [
22] for a comprehensive review of different 3D background subtraction methods and their capabilities in solving common background segmentation challenges).
Background subtraction methods are generally used to analyze individual images and identify the foreground by first estimating a reference background that is developed from historical information obtained from videos or sequences of images. Therefore, the classic application of 2D background subtraction is separating dynamic or moving objects from a relatively static or slow-changing background scene. However, RGB images only contain intensity information and spatial information that is largely restricted to the two dimensions of the image that are perpendicular to the camera’s perspective. Accordingly, identifying the boundaries between objects, or conversely identifying object interactions or contact, is limited mainly to detectable intensity differences. In applications that utilize RGB-D data, interactions or contact between objects and object spatial relationships can be more directly measured.
Furthermore, registration between depth and RGB data allows traditional 2D background subtraction approaches to be supplemented with additional depth-based approaches (e.g., [
23,
24]), and further allows RGB-D data to be represented as 3D point clouds [
25]. For example, reliably identifying and removing static background components such as roads and walls before modeling the background can result in both improved background subtraction and improved foreground segmentation using both 2D and RGB-D data; these improvements, to our knowledge, have only been realized through approaches that require additional data and computation, such as motion estimation between consecutive frames (e.g., [
26]). Identifying static background components suffers from the same limitations as modeling the entire background using 2D data, suggesting that little benefit is afforded by first removing these background objects, then modeling the background. However, with RGB-D data, parametrically modeled objects (e.g., planes, spheres, cones, cylinders, and cubes) are far more reliably detectable. As a result, researchers have attempted to segment or remove large planar surfaces (e.g., walls, ceiling, and floor surfaces) as a preprocessing or fundamental step before all other algorithms (e.g., [
27,
28,
29]).
In general, large planar surfaces comprise a large percentage of points within each frame of RGB-D data captured in indoor environments. However, outside specific applications that seek to identify significant surfaces (e.g., ground plane detection [
30]), large planar surfaces are not often the objects of interest in 3D computer vision applications. Notably, smaller planar surfaces (e.g., tabletops, chair seats and backs, desks) are more likely to be of interest than larger surfaces at the boundaries of the scene. Furthermore, the large bounding surfaces can decrease the performance of 3D computer vision algorithms (e.g., object segmentation and tracking) by cluttering their search space. Therefore, a robust removal technique for points that belong to surfaces at the outer boundaries of the RGB-D data can significantly reduce the search space and bring three main benefits to the computer vision systems: improving downstream results, speeding up downstream processes, and reducing the overall size of the point clouds. We refer to these large surfaces, which may include points from multiple planes or objects (e.g., points that represent a wall, window, or curtain) at the extremes of the point clouds as super-surfaces in the context of background subtraction applications. Our objective is to develop a robust technique of removing super-surfaces from RGB-D data captured from indoor environments and represented as point clouds. Our intention is that our super-surface removal technique will function as a pre-processing step, improving existing computer vision techniques, and reducing storage requirements, without removing any foreground data.
1.1. Related Work
Point cloud data are classified as either organized or unorganized datasets. Organized datasets are represented by a matrix-like structure, where the data (i.e., voxels) are accessible by index, usually according to their spatial or geometric relationship. Unlike organized point clouds, the relationships between adjacent voxels of unorganized datasets are unknown, and the data are simply stored as a one-dimensional, unsorted array. Data from RGB-D sensors are typically stored as organized point clouds, where indices are referenced according to the spatial resolution of the sensor. However, point cloud pre-processing steps such as down-sampling often produce unorganized point clouds. While it is trivial to convert an organized point cloud to an unorganized point cloud, the reverse is much more complicated and costly. Since spatial relationships between voxels are preserved, plane detection is less challenging for organized point clouds. However, computer vision approaches designed for unorganized datasets are universal [
31] (i.e., they can also be used for organized datasets), so a robust plane detection approach must work with unorganized datasets and not rely on the spatial relationship of voxels derived from their storage indices. In general, plane segmentation methods can be categorized into three categories: model fitting-based methods, region growing-based methods, and clustering feature-based methods.
1.1.1. Model Fitting-Based Methods
Random Sample Consensus (RANSAC) [
32] and Hough transform [
33] are the most commonly used model fitting-based methods for plane segmentation. The Hough transform is a voting technique for identifying objects that can be modeled parametrically, such as lines, planes, and spheres. Every point is transformed into a unique function (e.g., a sinusoid when modeling lines) in a discretized parameter space. Objects of interest can then be extracted by selecting the maximal intersections between the functions in the discretized parameter space, where the spatial tolerance for model fitting (e.g., to compensate for sensor resolution, noise, and object surface variations) can be accommodated by changing the resolution of the parameter space. The Hough transform has been successfully used for 3D plane segmentation in several publications (e.g., [
34,
35]). Unfortunately, although Hough transform-based methods can robustly segment 3D objects, they necessitate large amounts of memory and significant computational time [
36], and their results depend significantly on the proper selection of segmentation parameters [
37]. More importantly, Hough transform is unable to discriminate between voxels that lie within a parameterized model (i.e., inliers) and outside the model (i.e., outliers) since spatial relationships are not preserved in the Hough parameter space. The result is that foreground points that belong to objects that are spatially close to the parameterized background model will often be associated with the model, and ultimately the background scene.
The RANSAC algorithm begins with a random selection of data points that estimate the corresponding model parameters (e.g., three points for a plane). Then, the remaining points are examined to determine how many of them are well-approximated by the model. Finally, the RANSAC algorithm returns the model with the highest percentage of inliers that are within a fixed threshold (e.g., the orthogonal distance from the planar model). Many researchers have proposed RANSAC-based algorithms for 3D plane segmentation, such as [
37,
38,
39].
Awwad et al. [
37] proposed a RANSAC-based segmentation algorithm that first clusters the data points into small sections based on their normal vectors and then segments the planar surfaces. This implementation of RANSAC prevents the segmentation of spurious surfaces in the presence of parallel-gradual planes such as stairs. Chen et al. [
38] developed an improved RANSAC algorithm through a novel localized sampling technique and a region growing-based approach. Their proposed method, intended to segment polyhedral rooftops from noisy Airborne Laser Scanning (ALS) point clouds, is based on the assumption that rooftops comprise only planar primitives. Li et al. [
39] proposed an enhanced RANSAC algorithm based on Normal Distribution Transformation (NDT) cells to prevent segmenting spurious planes. The algorithm considers each NDT cell rather than each point. After dividing the data points into a grid of NDT cells, a combination of the RANSAC algorithm and an iterative reweighted least-square approach fit a plane in each cell. Finally, a connected-component approach extracts large planes and eliminates points that do not belong to planes. Although the proposed method can detect 3D planes more reliable and faster than the standard RANSAC, it requires cell size tuning for different datasets.
According to a performance comparison by Tarsha-Kurdi et al. [
40], the RANSAC algorithm outperforms the Hough transform approach for 3D roof plane segmentation in terms of both speed and accuracy. However, RANSAC suffers from spurious plane detection in complex 3D indoor environments [
39].
1.1.2. Region Growing-Based Methods
In general, region growing-based methods have two main stages. First, they pick a seed point and then merge the neighboring points or voxels that comply with the predefined criteria (e.g., similar normal vector). Several researchers proposed point-based, voxel-based, and hybrid region growing techniques for 3D point cloud segmentation. Tóvári and Pfeifer [
41] proposed a point-based region growing algorithm that merges adjacent points to a seed region based on their normal vectors and distance to the adjusting plane. Nurunnabi et al. [
42] utilized the same criteria but with a different seed point selection approach and a better normal vector estimation.
Voxel-based region growing algorithms (e.g., [
43,
44]) improve the speed and robustness of point-based methods by voxel-wise processing of the 3D point clouds. Xiao et al. [
45] proposed a 3D plane segmentation method based on a hybrid region growing approach utilizing a subwindow and a single point as growth units. Although their technique is significantly faster than the point-based region growing approach, it was only intended for organized point clouds. Vo et al. [
36] proposed a fast plane segmentation technique for urban environments. The method consists of two main stages: first, a coarse segmentation is achieved using an octree-based region growing approach, then a point-based process refines the results by adding unassigned points into incomplete segments.
Region growing-based methods are easy to employ for 3D plane segmentation, primarily for organized point clouds. However, their output results depend on the growing criteria, the seed point selection, and the textures or roughness of planes [
46]. Furthermore, they are not robust to occlusion, point density variation, and noise [
47].
1.1.3. Clustering Feature-Based Methods
Clustering feature-based methods adopt a data clustering approach based on characteristics of planar surfaces, such as normal vector attributes. Filin [
48] proposed a clustering method based on an attribute vector comprising a point location, its tangent plane’s parameters, and the height difference between the point and its adjacent points. In another work, Filin and Pfeifer [
49] computed point features using a slope adaptive neighborhood system and employed a mode-seeking algorithm to extract clusters. Then, they extended or merged these clusters with their adjacent points or clusters if they shared analogous standard deviations and surface parameters. Czerniawski et al. [
28] applied a simple density-based clustering algorithm to a normal vector space (i.e., a Gaussian sphere). The dense clusters on the Gaussian sphere represent the directions perpendicular to large planes. Zhou et al. [
50] proposed a clustering feature-based method for segmenting planes in terrestrial point clouds. First, they created a 4D parameter space using the planes’ normal vectors and their distance to the origin. Then, they segmented the planar surfaces by applying the Iso cluster unsupervised classification method.
Despite the efficiency of clustering feature-based methods, employing multi-dimensional features in large point clouds is computationally intensive [
36]. Furthermore, they are sensitive to noise and outliers [
51]. Moreover, the clustering segmentation approaches cannot reliably segment the edge points as these points may have different feature vectors compare to the surface points.
1.2. Contributions
Several 3D plane segmentation methods can satisfactorily detect different planar surfaces for various computer vision applications. However, to our knowledge, no approaches have been developed specifically for bounding surface removal, particularly in complex environments: environments that are cluttered, and where the placement of a depth sensor is not ideal. Additionally, existing segmentation approaches generally segment foreground points that belong to parametrically modeled objects of interest (e.g., planes, spheres, cones, cylinders, and cubes), rather than with the intention of removing background points belonging to the bounding surfaces. Therefore, existing approaches can easily remove critical foreground objects (or portions of foreground objects), significantly impacting the segmentation accuracy of semantic information. To overcome these limitations, we propose a method of removing background bounding surfaces (i.e., super-surfaces, such as walls, windows, curtains, and floor). Our novel method is particularly suited to more challenging and cluttered indoor environments, where differentiating between foreground and background points is complicated. Accordingly, our objective is to develop a robust background super-surface removal method that can support a wide range of sensor heights relative to the ground (i.e., support varied sensor perspectives) for organized and unorganized point clouds. Additionally, our approach must ensure that foreground objects, and points belonging to those objects, are preserved during super-surface removal.
Our method significantly reduces the search space, and it can considerably reduce the size of 3D datasets, depending on the number and size of the super-surfaces in each point cloud. Furthermore, when used as a preprocessing step, our approach can improve the results and the running time of different 3D computer vision methods such as object recognition and tracking algorithms. The remainder of this paper is organized as follows. In the next section, we describe our proposed 3D super-surface removal method. In
Section 3, we provide our experimental results and the evaluation of our proposed method. In
Section 4, we present our discussion and future work, followed by conclusions in
Section 5.
2. The Iterative Region-Based RANSAC
Our iterative region-based RANSAC (IR-RANSAC) method has five main steps, as illustrated in
Figure 1. We begin with two preprocessing techniques, first down-sampling the raw point cloud and then removing noisy or outlying points in the depth map. Second, we divide the point cloud space into four overlapped local regions based on the current view of the sensor. Third, we segment a base plane in each of the four local regions. Fourth, we implement an iterative plane removal technique to all four local regions, segmenting and removing the super-surfaces. Finally, we cluster the remaining point cloud using the geometric relationship between groups of points, resulting in a final point cloud comprised only of clustered objects of interest.
2.1. Downsampling and Denoising
Since input point clouds are generally large in size due to the significant number of 3D points and associated color information, a downsampling method with low computational complexity can significantly reduce the running time of point cloud processing algorithms. Downsampling is typically achieved using either a random downsampling method [
52] or a voxelized grid approach [
53]. Although the former is more efficient, the latter preserves the shape of the point cloud better and exploits the geometric relationship of the underlying samples. Since we are predominantly concerned with preserving the underlying points that represent the true geometry of objects in the scene, we utilized a voxelized grid approach [
53] that returns the centroid of all the points in each 3D voxel grid with a leaf size of 0.1 cm. In this way, the downsampled point clouds will still reflect the structure and maintain the geometric properties of the original point cloud while reducing the total amount of points that will need to be processed and stored.
Removing noisy points is a critical point cloud preprocessing task. Noisy or spurious points have two significant impacts on our approach. A noisy point cloud with false or spurious data points, including points outside of a scene’s real boundaries (see
Figure 2 for example) can lead to a wrong measurement of the overall bounding box containing the point cloud, resulting in the definition of incorrect local regions in our subsequent processing steps. Furthermore, noisy points within the point cloud itself will effectively skew or change the geometry of the true objects. We utilized a statistical outlier removal approach [
54] by examining the k-nearest neighbors (
) of each point, and removing all points with a distance (
) of more than one standard deviation of the mean distance to the query point to remove outliers of each captured point cloud. If the average distance of a point to its k-nearest neighbors is above the threshold (
), it is considered as an outlier. In this way, we remove points that are dissimilar from other points in their neighborhood. Together, these approaches decrease the number of points in the point cloud, reducing the downstream processing time and increasing the accuracy of our process.
2.2. Local Region Determination
Dividing our captured point clouds into four local regions of interests, based on the properties of our indoor environments, reduces the possibility of detecting foreground planes, increases computational efficiency, and leverages the likely spatial location of potential bounding surfaces. In this way, we exploit knowledge of the scene based on the known sensor perspective, while allowing for surface locations to vary relative to each other in different rooms. Further, these regions help ensure that foreground objects that may appear planar in composition (e.g., tables, beds) are preserved and differentiated from background bounding surfaces.
We partition the point cloud space into four overlapped local regions based on the current view of the sensor. First, we find the bounding values of the downsampled and denoised point cloud, where the values
and
are the Euclidean extrema of the bounding box enclosing the point cloud, and
are the Euclidean dimensions of the bounding box. Using the ranges defined in
Table 1, we then determine the four local regions (see
Figure 3 for a sample visualization of the local regions). We will use these four regions to identify, segment, and remove potential super-surfaces in each region.
Since our approach must be independent of any prior knowledge about the geometry of the indoor environment and both the location and perspective of the sensor, the four initial local regions may not include all the points that are actually part of the super-surfaces (e.g.,
Figure 3d, where parts of the floor are not included within the local region).
Selecting larger initial regions will increase the likelihood that all true points are within the regions but will also increase the likelihood of including points belonging to foreground objects near the super-surfaces (e.g., beds and sofas). To resolve this issue, we implement conservative local regions and extend these four regions after base plane segmentation (see
Section 2.4).
2.3. Base Plane Segmentation
We utilize the RANSAC algorithm [
32] to segment the largest planes with a specific orientation in each of the local regions. All segmented plane candidates with more points than a learned value
of the total number of points in the point cloud, are verified as base planes and stored for use in the next step (
Section 2.4). Planes containing fewer than
points may be associated with key objects or small bounding planes and are dealt with in subsequent processing steps. Furthermore,
is set as a proportion of the total points such that it is adaptive to the size of the point cloud.
The RANSAC algorithm iteratively and randomly samples three voxels, as a minimum subset to generate a hypothesis plane. These three points represent two vectors and , and their cross product is the normal vector to the plane . Therefore, the three parameters of the plane () are computed, and can be solved. In each iteration, the algorithm computes the distance between all the remaining data points and the plane and then counts the number of points within a distance threshold ( of the plane. Finally, RANSAC returns the plane with the highest percentage of inliers.
We add an orientation constraint to the standard RANSAC algorithm (orientation-based RANSAC) so that we assign priority to segmented planes with the highest percentage of inliers that have an expected orientation relative to the local regions. To do this, we defined an initial reference vector for each of the local regions, aligned with the sensor axes as
,
,
and
for the back, left, right, and bottom regions, respectively (
Figure 4). Further, we define a maximum allowance angular variation (
degrees) between the normal vector of the planes and our reference vectors to allow for sensor perspective variations.
The maximum number of iterations
required for convergence by the RANSAC algorithm can be approximated as Equation (1) [
55]. Convergence depends on the number of samples
(
for the plane fitting), the target success probability
(e.g.,
), and the outlier ratio
. Considering there is no prior knowledge about the underlying outlier ratio, it is difficult to approximate the number of RANSAC iterations. Based on our experimental results and due to the iterative design of IR-RANSAC, the algorithm works appropriately with the number of trials adhering to
of the data points within each local region. In this study, we set
to 2% of the data points (e.g., if the back region contains 60,000 points, the maximum number of trials will be set to 1200). Increasing
and
improves the robustness of the output at the expense of additional computation:
2.4. Iterative Plane Removal
In a complex indoor environment, bounding surfaces such as walls, windows, and curtains are difficult to fit to a single plane. Increasing the distance threshold () includes more points near the bounding planes, but simultaneously increases the chance of including data from important objects (e.g., the human body) within the extended threshold. Furthermore, the input point cloud can be unorganized, which means the nearest neighbor operations, such as region growing, are not very efficient for segmenting the rest of the super-surfaces. We introduce a novel iterative plane removal technique to segment and remove super-surfaces from a point cloud while minimizing the likelihood of including points that belong to foreground objects.
First, we remove the verified base planes associated with each local region. Then, we expand the local regions according to the ranges in
Table 2 to completely encompass the areas containing the super-surfaces. Next, we apply the orientation-based RANSAC in each of the extended regions iteratively. The number of iterations depends on the complexity of the indoor environment; based on our experimental results, three iterations are adequate for a challenging indoor environment. In each iteration, segmented planes must be parallel to the base plane of the current region. Hence, we utilize the normal vectors of the base planes as the reference vectors, and we set the maximum allowance angular variation (
) to 5°. Finally, because employing the orientation-based RANSAC in a larger region increases the probability of a false segmentation, we validate the segmented planes in each iteration.
The segmented planes are validated based on their distances, D, from their base planes, where are parameters of the base plane, and are coordinates of a point on the segmented plane. To make the technique robust to high levels of noise, we substitute the distance of a point to the base plane with the mean of all the segmented points’ distances from the base plane.
If the distances are less than a threshold (e.g.,
cm), the planes will be removed from their regions. Otherwise, they are not part of the super-surfaces and will be temporarily removed from the remaining point cloud. There are two advantages to temporarily removing a false segmented plane. First, it prevents RANSAC from segmenting the false plane once again. Second, it reduces the current region for the next iteration.
Figure 5 illustrates the output of the iterative plane removal in each iteration when applied to the back region of a point cloud. The green planes are verified and eliminated from the back region, as shown in
Figure 5b,c,e. However, the segmented red plane is not verified and temporarily removed from the point cloud, as shown in
Figure 5d.
2.5. Euclidean Clustering Removal
In this step, we cluster the remaining point cloud based on Euclidean distance to remove the irrelevant small segments and keep the objects of interest. First, we compute the Euclidean distance between each point and its neighbors. Then, we group neighboring points as a cluster if the distance between any point in an object and an adjacent point is less than a threshold
cm, finishing when all the clusters are determined. Finally, we remove all small clusters with fewer than a threshold
points.
Figure 6 illustrates an example of Euclidean clustering removal following the iterative plane removal.
4. Discussion
Our evaluation results for all four evaluation metrics achieved with IR-RANSAC (the first column of
Figure 10) were higher and much more consistent than those obtained using standard RANSAC (the second column of
Figure 10). Additionally, almost all four scores for IR-RANSAC had a lower standard deviation compared to those of standard RANSAC. Overall, all of our evaluation results for IR-RANSAC were statistically significantly better (
) than those of standard RANSAC using a two-sample t-test. Most notably, the F
1 score, which represents the overall performance of the approaches, for IR-RANSAC was higher than that of standard RANSAC.
In all experiments, our proposed IR-RANSAC method obtained average values above 92% for specificity, 96% for precision, 90% for recall, and 94% for F
1 score. Comparably, the standard RANSAC achieved average values between 33% and 85%, 71% and 92%, 78% and 92%, 74% and 90% for specificity, precision, recall, and F
1 score, respectively. As illustrated in the second column of
Figure 10, there are also many sharp fluctuations in the standard RANSAC evaluation results. The F
1 score fluctuated from 82% to 95% and 76% to 86% for Room-1 and Room-3, respectively. However, it almost remained steady at 74% for the Room-2 dataset. The standard RANSAC approach demonstrated a very low specificity for Room-2 and Room-3, containing many planar furniture.
IR-RANSAC takes about three times as long to execute when compared to the standard RANSAC approach on the same datasets. This is expected though, since our IR-RANSAC method invokes the RANSAC algorithm four times more than the standard RANSAC plane removal. Theoretically, a faster version of IR-RANSAC, which has only one iteration in each local region, can be implemented that would be faster than the standard RANSAC approach because it runs the RANSAC algorithm in smaller regions.
Our evaluation results support that IR-RANSAC is a robust and reliable method for removing the bounding super-surfaces of a complex 3D indoor environment with better performance over traditional RANSAC in all ways except execution time. Our results suggest that IR-RANSAC removes background boundary surfaces effectively without removing foreground data, and can considerably reduce size of 3D point clouds.
The subjects of future research are speeding up the IR-RANSAC algorithm and improving its results in much more complex 3D indoor environments. To improve our algorithm results, we need to reduce its reliance on the Euclidean clustering technique, eliminate the small challenging regions belonging to a super-surface but with a different normal vector (e.g., the highlighted regions around the window in
Figure 8f), and implement self-adaptive parameters to make the algorithm robust to different indoor environments and sensor data.
5. Conclusions
We have presented a 3D bounding surface removal technique, IR-RANSAC, that is particularly suited to more challenging and cluttered indoor environments. IR-RANSAC supports varied sensor perspectives for organized and unorganized point clouds, and it considerably reduces the size of 3D datasets. Moreover, IR-RANSAC can improve the results and the running time of different 3D computer vision methods by reducing their search space. After downsampling and denoising a point cloud captured from an oblique view, we divide the point cloud space into four overlapped local regions, exploiting knowledge of the current view of the sensor, and segment a base plane in each of the four regions. We then expand our search space around the base plane in each region, and iteratively segment and remove the remaining points belonging to each super-surface. Finally, we cluster the remaining point cloud using the geometric relationship between groups of points, resulting in a final point cloud comprised only of clustered objects of interest. We evaluated the performance of IR-RANSAC in terms of four metrics: specificity, precision, recall, and F1 score, on the three generated datasets acquired from an arbitrary oblique-view location and representing different indoor environments. Our experiments demonstrated that our proposed method is a robust super-surface removal and size reduction technique for complex 3D indoor environments. Experimentally, IR-RANSAC outperformed traditional RANSAC segmentation in all categories, supporting our efforts to prioritize the inclusion of all bounding points in each super-surface, while minimizing inclusion of points that belong to foreground objects.
Our intention was to develop a robust method of bounding surface segmentation—maximizing inclusion of bounding surface points and minimizing inclusion of foreground points. Our experimental data suggest that by conceptualizing bounding surfaces (e.g., walls and floor) as unique and different than other large surfaces that belong to foreground objects, it is possible to improve on methods of segmenting and removing these unwanted bounding surfaces specifically. By removing these bounding surfaces and preserving foreground objects, we considerably reduce the size of the resulting dataset, substantially improving downstream storage and processing.