Spatial Topological Relation Analysis for Cluttered Scenes

The spatial topological relations are the foundation of robot operation planning under unstructured and cluttered scenes. Defining complex relations and dealing with incomplete point clouds from the surface of objects are the most difficult challenge in the spatial topological relation analysis. In this paper, we presented the classification of spatial topological relations by dividing the intersection space into six parts. In order to improve accuracy and reduce computing time, convex hulls are utilized to represent the boundary of objects and the spatial topological relations can be determined by the category of points in point clouds. We verified our method on the datasets. The result demonstrated that we have great improvement comparing with the previous method.


Introduction
To perform tasks autonomously under unstructured and cluttered scenes, a robot with artificial intelligence should have the ability to effectively perceive the complex spatial information and plan policy to complete tasks [1][2][3]. Planning a reasonable operation sequence by analyzing the spatial information may avoid fragile objects slipping or crushing [4][5][6][7]. For example, if we take out a dish from a pile of dishes with spoons on it, the spoons need to be removed beforehand ( Figure 1a); if we want to put a lemon into a bowl, we should take out the cans first, as shown in Figure 1b; if we stack the blocks into a tower, we should make the base beforehand, as shown in Figure 1c. These are great challenges for the robot if the sequence of operations is unreasonable. However, most of the current robot tasks are limited to the operation of isolated objects on a plane [8], based on the template matching [9] or feature extraction methods [10]. In the scenes mentioned above, the operating space of robots is limited. The spatial relations between objects is complex, including physical contact, overlap, and occlusion [11]. Unsafe operations may cause the fragile objects falling on the ground, broken into pieces or other unexpected damages [12]. Therefore, it is necessary to analyze the spatial relations between cluttered objects and make an appropriate decision for the autonomous and safe operation of robotic manipulations [13]. Spatial topological relations are one of the important theories describing the relations between objects [14]. Spatial topological relations describe the adjacency and association relations about spatial points, lines, and surfaces [15]. A correct understanding of the spatial topological relations between objects is essential for the successful execution of robot actions [16]. The behavior decision of robots depends on the current state of the spatial topological relations [17]. On one hand, the spatial topological relations need to be accurately analyzed so that the robot performs the next operation correctly. On the other hand, taking work efficiency into consideration, the decision-making process cannot take too much time. Thus, the analysis of the spatial topological relation requires high accuracy and short computing time. Recently, a lot of works have come out to analyze the spatial topological relations in different ways [18][19][20][21], but it still has two challenges because of the cluttered scene: (1) the robot can only get partial point cloud on the surface of the object by vision sensors due to the occlusion; (2) the relations between objects are complex and difficult to be categorized; (3) the small deviation of the vision sensor may cause misclassification.
In this paper, we improved the classification method of spatial topological relation by dividing the cluttered space. Based on the distribution of point clouds in different spaces, the spatial topological relations can be defined, including cross, within, partial within, contain, partial contain, touch, and disjoint. The spatial topological relation can reasonably describe the relation between any two point clouds in the space. Meanwhile, the contour of partial point clouds was described by convex hulls, and the directed distance was utilized to determine the spatial relations of points. The main contributions in this work:

1.
We simplified the widely used model of spatial topological relations and proposed the definition of particular formalism, which improved the accuracy of the spatial topological relation analysis in the cluttered scene.

2.
We proposed the method that determines the spatial topological relation by the approximate expression of the object boundary and the spatial relations of points on cluttered objects. Deviation factor is employed to improve the robustness of the algorithm.

Related Works
In the past decades, researchers have done a lot of work about the spatial topological relations analysis [22][23][24]. The earliest research is used in geographic information systems (GIS). The focus of many research studies is on the formalism of spatial topological relations. The 9-intersection model (9IM) proposed by Egenhofer is one of the most widely used methods to represent spatial topological relations [25]. It is based on the point set topology theory to qualitatively describe the topological relations between targets. The 9IM defines the relations between objects as cross, touch, overlap, equal, within, contains, disjoint, and intersects by the information of the interior, boundary, and exterior of two objects. Based on the 9IM, Clementini [26] expanded the dimensions of relations, called dimensionally extended 9-intersection model (DE-9IM). Although the 9IM or DE-9IM can represent the spatial topological relations between objects, it requires the complete point, line, and surface information of the objects. However, the depth image obtained by robot vision often only contains part of the surface of the objects. Without the complete point clouds, the 9-intersection model cannot accurately represent the relations of objects.
Many research studies focus on the feature extraction of cluttered scenes. Nathan Silberman [27] proposed integer program formulation to infer the physical support relations by combining various methods, including geometric structure from depth, object attributes, and data-driven priors. Under the assumption of the Manhattan world, this method can infer simple support relations between objects in a complex indoor scene with cluttered and stacked objects. It ignores the overlapping situation, which might cause misclassification. On this basis, Panda [28] proposes the mapping inferring and linear programming method to expand the support relations between different entities in the scene, and inferred the relation types, such as "support from below", "support from the side", or "containment". The support relations are expressed in a structure of tree, called support tree, and the support sequence of objects is obtained by performing on a traversal of reverse hierarchical sequence. This expression is reasonable for scene understanding and provides research foundation for robot operation planning. Kartmann [29] infers physically reasonable support relations between objects without any prior knowledge about the physical properties (mass distribution and friction coefficient). By the virtual force analysis, the uncertainty of the support relations is taken into account in the prediction. Jia [30] uses RGB-D data as input, performed a three-dimensional box on the surface of the object, extracted the bounding box representation features, and designed an energy function to determine the quality of the segmentation and the stability of the scene based on the support relations. This method represents and classifies objects for 3D scene understanding.
Some research utilized learning methods, such as support vector machines (SVMs) or artificial neural networks (ANNs), to infer the spatial topological relations. Rosman [31] used SVMs for the first time to describe the topology of two-dimensional spatial relations of objects, but their research is only applicable to simple objects without occlusion. Mojtahedzadeh [32] described a fast method to extract the support relations between pairs of objects in contact with each other by using the static balance principle. In addition to SVMs, they also use artificial neural networks (ANNs) and random forests to approximate the probability distribution of the relations between objects. However, this method only considers entities with convex polyhedral shapes (box, cylinder, and barrel), which limits its practical application. Zhuo [33] introduces an approach to infer support relations from a single image by Markov random field (MRF), integer linear programming, and SVMs framework.
To summarize, the 9IM or DE-9IM requires the complete surfaces, so it is not suitable for cluttered scenes. The methods based on geometric features pay attention to the nearby points, lines, and surface features of the object, but they ignore the overall features of objects. The methods based on learning methods use the generalization ability of ANNs or SVMs to infer the spatial relations between objects through the annotation of a large amount of data, but it ignores the physical features of object. Therefore, how to adjust the 9IM for cluttered scenes and propose the methods to solve it is the key to improving accuracy and adaptability of the spatial topological relation analysis.

Methods
Our model of spatial topological relations is based on the space division, including interior, boundary, and exterior, so that the spatial topological relations of any two three-dimensional objects can be described in detail by the intersection of point sets.

Definitions of Spatial Topological Relations
Let A be a point cloud obtained from depth camera. A ⊂ R 3 . The convex hull of point cloud A is the smallest convex set that contains all the points of A. We present a formal definition of the boundary, interior, and exterior of A as follow: Definition 1. The boundary of a point cloudA, denoted by ∂A, is the convex hull of A.   Based on the definition of A • , ∂A, A − , and A + , we have the following proposition: The spatial topological relations between two point clouds, namely A and B, can be described as the relation between the A and B • , ∂B or B • and the relation between B and A • , ∂A or A • . There are: (1) the parts of A located at the interior of ∂B, denoted by A ∩ B • ; (2) the parts of A located on ∂B (A ∩ ∂B); Therefore, the spatial topological relation from A to B can be represented as a matrix R(A, B): which is called the 6-intersection model (6IM). The 6IM considers the intersection between point and boundary, ignoring the intersections between areas. So, the 6IM can be used in the scenes where only the point clouds on the surface of objects are available. Based on the 6IM, we define all the spatial topological relations from A to B, as shown in Figure 2. The yellow model is point cloud A and the green one is point cloud B. Taking into account all the circumstances, we have defined 7 spatial topological relations, i.e., cross, within, partial within, contain, partial contain, touch and disjoint. We give the definition of each spatial topological relation as follow.  If A ∩ B • = ¬∅, then we have A + ∩ B • = ¬∅ because of Proposition 1, which means that the closure of A has common part with the interior of B. Similarly, A • ∩ B = ¬∅ means that the closure of B has common part with the interior of A. So, the closures of A and B have the common parts with the opposite interiors. In addition, if A ∩ B • = ¬∅, then we have A + ∩ B + = ¬∅, which means the two point clouds have common closures. We define that A crosses with B if they have common closures and these closures intersect with the opposite interiors. Contain is the reverse definition of within, and we define it by swapping the roles of A and B in Definition 5. Different from Definition 9, at least one of A ∩ ∂B and ∂A ∩ B is empty. So, if the intersection between the boundaries of A and B is not empty and the relation from A and B is not touch, we define it as disjoint.
We realize that the relations do not exist in some cases. To eliminate non-existent relations, we have the following proposition.

Proof. Due to Definition 3 and
The formal definition of the spatial topological relations is given by six different specifications with the values empty (∅), non-empty (¬∅) or arbitrary ( * ), shown in Table 1. Each relation expects disjoint is corresponded to a rule, and three situations are included in relation disjoint. Based on Proposition 2, we can distinguish non-existent relations, which are shown in Table 2. The relations are  complete if summarizing them from Tables 1 and 2. Table 1. The definition of the spatial topological relations between two point clouds. Table 2. Non-existent relations.

Classification Criteria of Spatial Topological Relations
In order to infer the spatial topological relations by the 6IM, all the points in the point clouds of one object should be determined the relative position relations with the convex hull of another object. If a point locates in the convex hull of a point cloud, it must locate in the axis aligned-bounding box (AABB) of the point cloud. It is easy and fast to evaluate whether a point is in AABB, in other words, whether the point is within the range of AABB at the three directions of coordinate axis, so we can speed up the classification of points. If a point is not in AABB, then the point is also not within the range of the convex hull. If a point is in AABB, the next step is to determine whether the point is in the convex hull. AABB is employed to represent the boundary of objects [16]. However, it is inappropriate for the spatial topological relation analysis. The reason is that AABB is an inexact approximation for the boundary of objects. The convex hull is the exact approximation for the boundary of objects and performs better than AABB.
Take a cube convex hull for an example, as shown in Figure 3. Let a i (i = 1, . . . , 5) be the points of point cloud A, and the points b j ( j = 1, . . . , 8) are the vertices of the convex hull of point cloud B. Every three points b k1 , b k2 , b k3 (k = 1, . . . , 12) from b j constitute a triangular surface of the convex hull and → n k is an outer normal vector to the surface. So, as to determine the spatial topological relation between a point and convex hull, the next step is to iterate over the faces of the convex hull and to determine if the point is on the negative or positive side of the faces. The classical method from computation geometry is employed to determine if a point is inside a convex hull [34]. a i is inside the convex hull if → n k ·(a i − b k1 ) < 0 for all k, outside the convex hull if → n k ·(a i − b k1 ) > 0 for some of k, or on the boundary of the convex hull if → n k ·(a i − b k1 ) ≤ 0 for all k with equality occurring at least once. Based on the distance formula, we define the directed distance from a point a i to a plane b k1 b k2 b k3 , that is: where → n k is an outer normal vector of the plane, and → n k > 0. d i, k means the distance from point a i to plane b k1 b k2 b k3 . By the definition of directed distance, the classical method is equal to: where K is the set of the face number. In the example of cube convex hull, K = {k : k = 1, . . . , 12}.
All the points of point cloud A can be classified by formula (3), as shown in Figure 4a.  Based on the above descriptions, we relax the determined condition of boundary point by employing deviation factor to improve the robustness of our algorithm. By extending the upper and lower bounds of determine condition from d i, k = 0 to d i, k ≤ δ, we have: , a i is interior point ∀k ∈ K, d i, k ≤ δ and ∃k ∈ K, d i, k ≤ δ , a i is boundary point where δ is the deviation factor. Generally, δ is a small positive value, and the larger it takes, the more boundary points will be determined. So, the value of δ is usually equal to the deviation of point clouds.
By formula (3), points a n from point cloud A are classified, as shown in Figure 4b. The classifications of a 1 , a 2 , and a 3 are the same as the classical method. Unlike the classical method, a 4 and a 5 are determined as boundary points as they are supposed to be by our method. If the point a i is an interior point for the convex hull of B, then A ∩ B • = ¬∅. If the point p is a boundary point, then A ∩ ∂B = ¬∅. Otherwise, if the point p is an exterior point, then A ∩ B − = ¬∅. We traverse the points in point cloud A unless A ∩ B • , A ∩ ∂B, and A ∩ B − are all non-empty. We stop the loop of point cloud A when A ∩ B • , A ∩ ∂B, and A ∩ B − are all non-empty, because the rest of calculation will not change the results. By this way, the spatial topological relations can be decided by the 6IM. The whole approach is described in Algorithm 1.

Experimental Results
To verify the accuracy and the rapidity of our spatial topological relation analysis method described in Section 3, we have done a series of experiments on the point clouds generated from the International Institute of information Technology (IIIT) RGBD dataset and the Yale-CMU-Berkeley (YCB) benchmarks.

IIIT RGBD Dataset
The IIIT RGBD dataset contains seven scenes with different types of physical interactions between objects, such as supporting from below, supporting from the side and inclusion [35]. Due to the occlusion by each other, all the RGBD images are part of the point clouds representing the surface of objects. Each RGBD image is segmented by semantic annotation. We reconstructed point clouds of objects from RGBD images by the point cloud registration method [36]. Additionally, the convex hull of each point cloud was obtained by the Quickhull method [37].

YCB Benchmarks
The YCB benchmarks are designed for robot manipulation. The model set contains different kinds of objects, such as food, tools, and kitchen items [38]. Each object has the corresponding 3D model reconstructed from the merged point clouds with high precision. We chose several objects, made 7 scenes, and removed outliers by Point Cloud Library (PCL).

Results
The reconstruction point clouds of the IIIT RGBD dataset are shown in Figure 5a. The point cloud of ground has little effect on the result so only the point cloud near the objects is reserved. Then, we use filter to remove outlier points of point clouds. Due to the semantic annotation, all the objects can be reconstructed respectively. In this way, all the objects are separated from each other. We display all objects, including the ground, in Figure 5a. Scene 1 and scene 2 are similar. One box lays on the ground, with a box leaning on it and another box putting upright on it. In scene 3, one box lays on the ground and two books putting on it. Because the images are taken near the corner of the walls, photos cannot be taken in some perspective and the images are insufficient for complete 3D reconstruction. As a result, the point clouds of box and books are incomplete. In scene 4, a box lays on the ground, and a hollow jar is placed on it. In the hollow jar, a rod-like object inserts inside it. Due to the insufficient RGBD images in the IIIT RGBD dataset, only a small part of objects can be reconstructed. The 3D reconstruction of scene 4 shows that the rod-like object seems to levitate in the air. A large area of point cloud missing may cause failure. In scene 5, a solid box leans on a hollow box, and there is a bar placed in the hollow box. Different from scene 4, we have plenty of RGBD images taken from multiple perspectives of scene 5. So, the point clouds of objects in scene 5 are relatively complete, compared with scene 4. Scene 6 is quite cluster, with five objects crowded in the limited space. A box is placed horizontally on the ground, and three boxes lean on it and one box is placed vertically on it. Scene 7 is similar to scene 6, and there are four objects in this scene. One box is placed horizontally on the ground, and two boxes lean on it. Besides, a box is placed isolated on the ground. Although scene 6 and scene 7 are cluster and objects cover each other, due to sufficient images from all perspectives, we can reconstruct most of the point clouds from the surface of objects.
Because all the RGBD images are taken by the depth camera, so only the surface of point clouds is captured. Although the point cloud of each object can be obtained separately, the 3D reconstruction is fragmentary and only contains surface point clouds. The convex hull of point clouds is shown in Figure 5b. Due to the incomplete point clouds, the convex hulls are the subsets of the actual convex hulls of point clouds.
Despite these obstacles above-mentioned existing, our method can still analyze the spatial location of points. By the method in Section 3.2, the boundary points and interior points of each point cloud were classified, as shown in Figure 5c. The boundary points and interior points between different objects are drawn in different colors. By the definitions in Section 3.1, we can decide the spatial topological relations between objects, as shown in Figure 5d. The red line represents touch, and the direction of arrows represents the direction of relation as A → B. The green line represents partial contain. The relation disjoint is so common that we ignore its visualization. In scene 1 and 2, the results show that the box laying on the ground touches ground and all the other boxes. In scene 3, the two books touch each other and touch the box at the same time. In scene 4, the box touches the hollow jar and the ground simultaneously. However, due to the lack of point cloud, only a small part of the hollow jar can be reconstructed. As a result, the point cloud of the rod-like object is far away from the hollow jar, and the spatial topological relation between them, which is supposed to be partial within, have been misjudged as disjoint. In scene 5, the hollow box touches the ground and partial contains the blue bar. The solid box touches the ground, but the relation between the hollow box and the solid box is misjudged as disjoint, which should be touch. The reason is that the point cloud is so sparse at the contacting surface that few points of the solid box are located in the convex hull the hollow box. Despite the scenes are quite clustered and messy, our method performed well in scene 6 and 7. All the relations obtained by our method are identical to the ground truth. In scene 6, the lying box touches all the other boxes. In scene 7, the isolated box is disjoint with the other boxes. We have compared our method with the feature extraction method, learning method, and AABB method on the accuracy and the computing time. The accuracy means the number of the relations which are correctly classified divided by the number of all the relations in one scene. The criterion for determining whether the classification of spatial topological relations is correct is to compare them with the ground truth from the dataset. In the feature extraction method [35], the definition of spatial topological relations is different from ours, so we combine "support from below" and "support from the side" to touch. If the relation is "containment", we determine it is correct no matter whether the ground truth is partial contain or contain. By the learning method [31], we got contact point networks of point cloud by SVMs and classified relations by k-means method. The AABB method is using AABB, instead of convex hull, to represent the boundary of objects, and the other is same as our method. The computing time is the time of the analysis of all the relations in one scene. The results are shown in Table 3. In every scene, our method performed obviously better than the feature extraction method [35] on the accuracy and the computing time. There are 55 relations in 7 different scenes. Our method has correctly analyzed 53 relations and cost 131.8 s in total. As a comparison, the feature extraction method, the learning method, and the AABB method have correctly analyzed 41, 40, and 36 relations, with costing 1478.2, 231.1, and 17.7 s, respectively. The average accuracy of our method on the IIIT RGBD dataset is 96.4%, which is 21.9% higher than 74.5% accuracy of the feature extraction method, 23.7% higher than 72.7% accuracy of the feature extraction method, and 30.9% higher than 65.5% accuracy of the AABB method. In addition, the average time of our method is 2.4 s, which is faster than 26.9 s by the feature extraction method and 4.2 s by the learning method. Although our method is slower than the AABB method, the accuracy of our method is much higher. Different from the IIIT RGBD dataset, the YCB benchmarks provide dense and high-resolution point clouds of objects. All the 3D point clouds are reconstructed by precise stitching, as shown in Figure 6a. The image of scene 8 shows that there are a strawberry and a lemon in the bowl placed on the table, and a master chef can is far from them. In scene 9, a mustard bottle, a sugar box, and a tomato soup can are placed on the table, and there is a lemon on the tomato soup can. Scene 10 is clustered. In scene 10, a tuna fish can and a gelatin box are very close, but they are not touching with each other, and a banana is placed on the tuna fish can. Besides, a chips can and a potted meat can are living away from other objects. In scene 11, a mug, a bowl, and a tomato soup can are placed closely on the table, and the bowl contains an orange. In scene 12, there is a plate on the table, with a bear and potted meat can on it, and a tomato soup can is isolated. Scene 13 is a typical case in the kitchen. A stack of plates is placed on the table, and a bowl is placed on the top of them. In scene 14, two bananas are placed mostly parallel on the table, with a plum and a lemon aside. All the point clouds of objects in each scene are dense, with few noise points in them. This is beneficial to the spatial topological relation analysis. Another advantage is that the YCB benchmarks contain a variety of relations and this is suitable for the verification of our method. The convex hull of point clouds is shown in Figure 6b. Due to the density and the number of points of the point clouds in the YCB benchmarks are much larger than these in the IIIT RGBD dataset, the completion of convex hulls in the YCB benchmarks are better. The boundary points and interior points of each point cloud are classified, as shown in Figure 6c. We found that the number of boundary points and interior points are much larger than the IIIT RGBD dataset. The reason is that the spatial topological relations are plentiful in the YCB benchmarks. The spatial topological relations between objects are shown in Figure 6d. The red line represents touch, and the direction of arrows represents the direction of relation as A → B. The cyan, green, blue, magenta, and yellow line represents cross, within, partial within, contain, and partial contain, respectively. In scene 8, our result shows that the strawberry is within the bowl and the lemon is partial within the bowl, which is consistent with the ground truth. In scene 9, the mustard bottle, tomato soup can, and sugar box touches the table, and the lemon touches the tomato soup can. Scene 10 is a special case. Most of relations, except the relation between the banana and the gelatin box, are classified correctly. The misclassification is mainly caused by the common sense of human beings. The banana is not a container, so the relation generally cannot be considered as contain. However, based on the definition of 6IM, the relation is determined as partial contain. In scene 11, the orange is partial within the bowl, and other objects touch the table. Scene 12 is similar to scene 11 where the potted meat can and the bear are partial within the plate, which touches the table together with the tomato soup can. In scene 13, the plate at the bottom of stack touches the table and other objects are partial within the object below from top to bottom. In scene 14, the relation between two bananas is complex so it is determined as cross. Because of the same reason as scene 10, the relation between the plum and the banana, which is supposed to be touch, is misidentified as partial within.
We have also compared our method with the feature extraction method on the accuracy and the computing time, as shown in Table 4. Same as the results on the IIIT RGBD dataset, our method is significantly better than the other methods in terms of accuracy and calculation time. There are 75 relationships in 7 different scenes. Our method has correctly analyzed 71 relationships and took 131.7 s in total. As a comparison, the feature extraction method, the learning method, and the AABB method have correctly analyzed 64, 59, and 57 relationships, which took a total of 3002.4, 175.6, and 10.9 s. The number of relations and the points of each objects in YCB benchmarks are larger than these in the IIIT RGBD dataset, so our method and the feature extraction method took much more time on the calculation of spatial topological relations. The average accuracy of our method on the IIIT RGBD dataset is 94.7%, which is 9.3% higher than the 85.3% accuracy of the feature extraction method, 16.0% higher than the 78.7% accuracy of the AABB method, and 18.7% higher than the 76.0% accuracy of the AABB method. In addition, the average time of our method is 1.8 s, which is significantly faster than 40.0 s of the feature extraction method and 23.4 s of the learning method. Although the AABB method is really fast, the accuracy of it is the worst among these methods.

Discussion
The results confirmed that our method had advantages over the other methods on the accuracy. The reason is that our method effectively classifies the relations between objects and use the overall features of the point cloud to analyze the spatial topological relations. Currently, the spatial topological relations are mainly defined by the intersections of points, lines, and regions. However, in the cluttered scenes, the point clouds of objects are incomplete and the shapes of them are unpredictable. Meanwhile, only the point clouds on the surface of objects can be perceived by vision sensors. Without obvious region features, the current spatial topological relation methods cannot work well. Our method used convex hull to represent the boundary of objects. Since convex hulls contain the boundary feathers of point clouds, our method improved the accuracy and saves computing time for the interaction analysis between point clouds. In addition, we proposed the deviation factor to improve the robustness of our method. Although our method based on convex hulls has done a certain degree of region interpolation, however, it is not suitable for the scenes with point clouds severely missing. The methods based on object stability inferring may be helpful for further improvement on the accuracy in these extreme scenes.

Conclusions
In summary, we have identified 6IM to describe the spatial topological relations in cluttered scenes and its classification by calculating the relations between points and convex hulls. Different from others, our method takes the convex hulls as the approximate expression of the boundary of objects.
Due to the reasonable definition and calculation process, our method is suitable for the cluttered scenes with partial, hollow, and complex point clouds. The rapidity and the accuracy of our method are verified on the IIIT dataset and the YCB benchmarks, on which we have improvement in every scene comparing with other methods.
In the future, we will improve the accuracy of the spatial topological relation analysis in the scenes with point clouds severely missing by stability inferring. Based on the spatial topological relation analysis, we will design the robot grasping strategy to realize automatic object sorting in cluttered scenes.