Object-Independent Grasping in Heavy Clutter

: When grasping objects in a cluttered environment, a key challenge is to ﬁnd appropriate poses to grasp e ﬀ ectively. Accordingly, several grasping algorithms based on artiﬁcial neural networks have been developed recently. However, these methods require large amounts of data for learning and high computational costs. Therefore, we propose a depth di ﬀ erence image-based bin-picking (DBP) algorithm that does not use a neural network. DBP predicts the grasp pose from the object and its surroundings, which are obtained through depth ﬁltering and clustering. The object region is estimated by the density-based spatial clustering of applications with noise (DBSCAN) algorithm, and a depth di ﬀ erence image (DDI) that represents the depth di ﬀ erence between adjacent areas is deﬁned. To validate the performance of the DBP scheme, bin-picking experiments were conducted on 45 di ﬀ erent objects, along with bin-picking experiments in heavy clutters. DBP exhibited success rates of 78.6% and 83.3%, respectively. In addition, DBP required a computational time of approximately 1.4 s for each attempt.


Introduction
For a robot to grasp a target object in a cluttered environment successfully, where many objects are stacked in a small space such as a box, the gripper should not collide with surrounding objects or the walls of the box. Because of the recent development of artificial neural networks (ANN), grasping algorithms can provide excellent performance if sufficient data are provided for learning [1][2][3]. However, learning often requires several robots and devices to compute the vast amount of data needed. Additionally, in cases where the target objects change frequently (such as in the logistics industry), ANN-based grasping algorithms have to be retrained, which is inefficient [4]. Therefore, an algorithm that allows the robot to grasp unknown objects without excessive learning is necessary.
Many grasping algorithms use either a geometry-based or data-driven method. The former is a traditional method in which the grasp pose is estimated by predicting the exact three-dimensional (3D) position of an object [5,6] or by matching the 3D point cloud using known 3D computer-aided design (CAD) models [7][8][9]. Therefore, applying this method to a new object is cumbersome, because an accurate CAD model is needed and cannot always be obtained. Thus, the estimated pose of the target object may be inaccurate. Because of this, methods that estimate the pose of the objects in 3D environments without CAD models have been recently proposed [10]. Though geometry-based grasping methods are often used because the CAD models of manufactured objects are available, the logistics industry is unlikely to have CAD models of the products. Thus, these methods are hardly applied in logistics.
In contrast, in the data-driven method, the grasp poses of the objects are estimated using an ANN-based learning scheme. This method generally has a higher success rate than the traditional geometry-based methods. In this method, RGB images [11,12], depth images [3,13], or both [14] can be used. However, a data-driven method requires a large amount of data that are manually

DBP
The DBP structure used in this study is shown in Figure 1. DBP consists of three elements: a grasp candidate generator, a grasp pose evaluator, and a grasp pose decider. The grasp candidate generator processes the image obtained by a depth sensor and generates a group of grasp pose candidates. The grasp pose evaluator selects the most appropriate candidate among those obtained from the grasp candidate generator by analyzing the shape of the target object and the surrounding space. The grasp pose decider adjusts the grasp pose to obtain a more appropriate one. Using the foregoing procedure, robotic grasping can be performed without learning using devices such as a graphics processing unit (GPU).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 2 of 11 solve these issues, methods such as obtaining the data from simulations [13] and using generative adversarial networks (GANs) [15] have been proposed. However, simulations may have low success rates because of the difference to reality, and the GAN-based method requires a long training time [16] and is hard to be trained. There are also reinforcement learning-based schemes. These, similar to deep learning-based methods, have good performance when properly taught. However, their learning requires an enormous amount of data. For instance, Google collected more than 0.9 million grasp data over several months using 14 robots [1,2]. Using a single robot, this would have taken years. Furthermore, even when these data are collected, the algorithm can hardly operate if they are not obtained in different environments.
Herein, we propose a bin-picking scheme based on the depth difference image (DDI), which estimates the graspability by analyzing the space around the object to be grasped. By DDI-based bin picking (DBP), a robot with a two-finger gripper can grasp unknown objects in a cluttered environment. This does not require a learning process (which requires a substantial amount of data) or CAD models of the target objects. Therefore, the most significant contribution of this study is to provide a generalized grasp solution that does not need prior information, including CAD models and training data. DBP consists of a grasp candidate generator, grasp pose evaluator, and grasp pose modifier. The grasp candidate generator considers the shape of an object and the surrounding space, generating a group of candidates for the robot to attempt grasping. The grasp pose evaluator determines the most appropriate grasp candidate using a Gaussian mixture model (GMM) and DDI. The grasp pose modifier obtains the final grasp pose by adjusting that determined by the grasp pose evaluator. Experiments involving the bin picking of different objects in a two-dimensional (2D) clutter and of one type of object in a heavy clutter revealed that this method is effective.
The remainder of this paper is organized as follows. In Section 2, the overall structure of DBP and the individual modules are described in detail. Section 3 presents the experimental results and Section 4 analyzes the experimental results. Finally, Section 5 presents the conclusions.

DBP
The DBP structure used in this study is shown in Figure 1. DBP consists of three elements: a grasp candidate generator, a grasp pose evaluator, and a grasp pose decider. The grasp candidate generator processes the image obtained by a depth sensor and generates a group of grasp pose candidates. The grasp pose evaluator selects the most appropriate candidate among those obtained from the grasp candidate generator by analyzing the shape of the target object and the surrounding space. The grasp pose decider adjusts the grasp pose to obtain a more appropriate one. Using the foregoing procedure, robotic grasping can be performed without learning using devices such as a graphics processing unit (GPU).

Grasp Candidate Generator
The grasp candidate generator provides grasp poses that are likely to lead to a successful object grasping. In this, three processes are involved: depth filtering, region clustering, and grasp candidate generation.
First, depth filtering removes the image, maintaining only the data of the lowest p%, using the height. This is because objects located higher are generally easier to grasp. The lowest p% is selected because high objects appear closer to the camera of the robot.
Then, region clustering divides the filtered depth image into several object regions. Here, the depth image is clustered using a density-based spatial clustering of applications with noise (DBSCAN) algorithm on the filtered p% data. Figure 2 shows an example of DBSCAN. If a circle of radius ε is drawn around points A and F, a minimum of five points falls in the circle. Because points A and F are in the same circle, they belong to the same cluster and are called the core points. When points B, C, D, E, and G are centered, instead, less than five points are in the circle; thus, these are called border points. Point H is never included; thus, it is called a noise point. Although DBSCAN does not need to set the number of clusters in advance, it can detect clusters with geometric shapes and outliers. In this study, ε was set to 10 pixels, and the minimum number of samples for one cluster was set to 5. However, these parameters can be changed depending on the environment, e.g., the number and shape of the target objects and the resolution of the depth sensor.
he grasp candidate generator provides grasp poses that are likely to lead to a successful o ing. In this, three processes are involved: depth filtering, region clustering, and grasp candi ation. irst, depth filtering removes the image, maintaining only the data of the lowest p%, using t. This is because objects located higher are generally easier to grasp. The lowest p% is sele se high objects appear closer to the camera of the robot. hen, region clustering divides the filtered depth image into several object regions. Here image is clustered using a density-based spatial clustering of applications with n AN) algorithm on the filtered p% data. Figure 2 shows an example of DBSCAN. If a circ s ε is drawn around points A and F, a minimum of five points falls in the circle. Because po F are in the same circle, they belong to the same cluster and are called the core points. W B, C, D, E, and G are centered, instead, less than five points are in the circle; thus, these border points. Point H is never included; thus, it is called a noise point. Although DBSC ot need to set the number of clusters in advance, it can detect clusters with geometric sh utliers. In this study, ε was set to 10 pixels, and the minimum number of samples for one clu et to 5. However, these parameters can be changed depending on the environment, e.g. er and shape of the target objects and the resolution of the depth sensor. inally, grasp candidates are generated for each cluster estimated by DBSCAN. Figure 3 sh ample of the whole operation. In the rightmost figure, each cluster has 10 grasp candid candidates consist of the locations, grasp angles, and width of the gripper. The location ined from the centroids of the clusters. For example, in Figure 2, the centroid of the clust M. The grasp angles are simply multiples of (180/n). The gripper width is equal to the sma sion of the width and height of a rectangle surrounding the cluster. In Figure 3, for exam ipper width is h0. Finally, grasp candidates are generated for each cluster estimated by DBSCAN. Figure 3 shows an example of the whole operation. In the rightmost figure, each cluster has 10 grasp candidates. Grasp candidates consist of the locations, grasp angles, and width of the gripper. The locations are determined from the centroids of the clusters. For example, in Figure 2, the centroid of the cluster is point M. The grasp angles are simply multiples of (180/n). The gripper width is equal to the smallest dimension of the width and height of a rectangle surrounding the cluster. In Figure 3, for example, the gripper width is h 0 .

Grasp Candidate Generator
The grasp candidate generator provides grasp poses that are likely to lead to a successful object grasping. In this, three processes are involved: depth filtering, region clustering, and grasp candidate generation.
First, depth filtering removes the image, maintaining only the data of the lowest p%, using the height. This is because objects located higher are generally easier to grasp. The lowest p% is selected because high objects appear closer to the camera of the robot.
Then, region clustering divides the filtered depth image into several object regions. Here, the depth image is clustered using a density-based spatial clustering of applications with noise (DBSCAN) algorithm on the filtered p% data. Figure 2 shows an example of DBSCAN. If a circle of radius ε is drawn around points A and F, a minimum of five points falls in the circle. Because points A and F are in the same circle, they belong to the same cluster and are called the core points. When points B, C, D, E, and G are centered, instead, less than five points are in the circle; thus, these are called border points. Point H is never included; thus, it is called a noise point. Although DBSCAN does not need to set the number of clusters in advance, it can detect clusters with geometric shapes and outliers. In this study, ε was set to 10 pixels, and the minimum number of samples for one cluster was set to 5. However, these parameters can be changed depending on the environment, e.g., the number and shape of the target objects and the resolution of the depth sensor. Finally, grasp candidates are generated for each cluster estimated by DBSCAN. Figure 3 shows an example of the whole operation. In the rightmost figure, each cluster has 10 grasp candidates. Grasp candidates consist of the locations, grasp angles, and width of the gripper. The locations are determined from the centroids of the clusters. For example, in Figure 2, the centroid of the cluster is point M. The grasp angles are simply multiples of (180/n). The gripper width is equal to the smallest dimension of the width and height of a rectangle surrounding the cluster. In Figure 3, for example, the gripper width is h0.

Grasp Pose Evaluator
The grasp pose evaluator identifies the most appropriate grasp candidate among those provided by the grasp candidate generator considering the object shape and surrounding space. It performs a DDI analysis, GMM analysis, and graspability evaluation through a cost function with three parameters.

Grasp Pose Evaluator
The grasp pose evaluator identifies the most appropriate grasp candidate among those provided by the grasp candidate generator considering the object shape and surrounding space. It performs a DDI analysis, GMM analysis, and graspability evaluation through a cost function with three parameters.

DDI
The DDI is computed using the maximum depth difference between adjacent pixels in the depth image. This novel method produces large values at the boundary between the objects and the surrounding environment, and small values in areas exclusively belonging to either of them.
The DDI can be obtained as follows. First, an m × m region is filtered from the upper-left corner of the depth image. Here, the largest difference between the central pixel and the adjacent pixels is used as the new output. Then, the filter moves one pixel to the right according to the sliding-window approach and repeats the operation. At the end of the row, the filter moves to the next column. The corresponding pseudo-code is presented in Algorithm 1 for the case of m = 3, and a sample DDI is shown in Figure 4.
image. This novel method produces large values at the boundary between the objects and the surrounding environment, and small values in areas exclusively belonging to either of them.
The DDI can be obtained as follows. First, an m × m region is filtered from the upper-left corner of the depth image. Here, the largest difference between the central pixel and the adjacent pixels is used as the new output. Then, the filter moves one pixel to the right according to the sliding-window approach and repeats the operation. At the end of the row, the filter moves to the next column. The corresponding pseudo-code is presented in Algorithm 1 for the case of m = 3, and a sample DDI is shown in Figure 4.
The size m can be any odd number except 1. The difference in the DDI for different m is not large, mainly being that for larger m, the size of the DDI is smaller. In this study, m was set to 3 to approximate the size of the resulting image to that of the input image. However, m can be safely set to another odd number. As shown in Figure 5, the DDI has large values at the contour of the object, similar to contour extraction, e.g., using a Sobel operator. However, there is a significant difference. Because a Sobel operator outputs only small values (e.g., 0-10), the depth difference cannot be properly represented. In contrast, the DDI can display both the contour of the object and the depth difference between neighboring pixels. This feature is used to estimate the graspability of the grasp candidates. The size m can be any odd number except 1. The difference in the DDI for different m is not large, mainly being that for larger m, the size of the DDI is smaller. In this study, m was set to 3 to approximate the size of the resulting image to that of the input image. However, m can be safely set to another odd number.

Algorithm 1 DDI
As shown in Figure 5, the DDI has large values at the contour of the object, similar to contour extraction, e.g., using a Sobel operator. However, there is a significant difference. Because a Sobel operator outputs only small values (e.g., 0-10), the depth difference cannot be properly represented. In contrast, the DDI can display both the contour of the object and the depth difference between neighboring pixels. This feature is used to estimate the graspability of the grasp candidates.

Evaluation Model by GMM
To evaluate the grasp candidates, a model based on the GMM was designed, using three Gaussian models and DDI values corresponding to the grasp candidates, as shown in Figure 6. Additionally, a cost function was developed using the three parameters defined in the following. In Figure 6, the three Gaussian models are ordered according to their x values and denoted 1, 2, and 3. G 2 can be interpreted as the area where the object exists, and G 1 and G 3 can be considered as spaces to the left and right of the object, respectively. According to the Gaussian models obtained from the GMM, the proportion difference, height difference, and width are defined, and the cost function is constructed by multiplying or dividing them.

Evaluation Model by GMM
To evaluate the grasp candidates, a model based on the GMM was designed, using three Gaussian models and DDI values corresponding to the grasp candidates, as shown in Figure 6. Additionally, a cost function was developed using the three parameters defined in the following. In Figure 6, the three Gaussian models are ordered according to their x values and denoted 1, 2, and 3. G2 can be interpreted as the area where the object exists, and G1 and G3 can be considered as spaces to the left and right of the object, respectively. According to the Gaussian models obtained from the GMM, the proportion difference, height difference, and width are defined, and the cost function is constructed by multiplying or dividing them. Figure 6a,b shows the DDI and its profile along the grasp candidate indicated by the red line in Figure 6a

Evaluation Model by GMM
To evaluate the grasp candidates, a model based on the GMM was designed, using three Gaussian models and DDI values corresponding to the grasp candidates, as shown in Figure 6. Additionally, a cost function was developed using the three parameters defined in the following. In Figure 6, the three Gaussian models are ordered according to their x values and denoted 1, 2, and 3. G2 can be interpreted as the area where the object exists, and G1 and G3 can be considered as spaces to the left and right of the object, respectively. According to the Gaussian models obtained from the GMM, the proportion difference, height difference, and width are defined, and the cost function is constructed by multiplying or dividing them. Figure 6a,b shows the DDI and its profile along the grasp candidate indicated by the red line in Figure 6a  The proportion difference index dp is defined as the ratio of G1 to G3. The proportion p is defined: as Appl. Sci. 2020, 10, 804 6 of 11 thus, p 1 + p 2 + p 3 = 1. For example, in Figure 6, p 1 = 0.25, p 2 = 0.5, and p 3 = 0.25. If p 1 and p 3 have different values, there is space only on one side (left or right) of the object. Therefore, to select the cases in which both the left and right sides of the object are wide, the proportion difference d p is defined as: where the difference between p 1 and p 3 is divided by the largest value among p 1 , p 2 , and p 3 for normalization. Thus, d p indicates whether p 1 and p 3 are similar. However, even if they are, the spaces may not be large enough. Then, the height difference, which is the second evaluation index, is used. The height difference index d h is defined as: where the height difference D d of Figure 6c is divided by the largest value among h 1 , h 2 , and h 3 for normalization. Thus, d h indicates the depth of the space around the object. A larger d h indicates a deeper space around the object. In summary, d p and d h indicate the width and depth of the space around the object, respectively. The width index w c is defined by the difference between the x values of G 1 and G 3 of Figure 6d, i.e., Here, x d represents the distance between G 1 and G 3 , and x l represents the maximum width of the gripper. c is proportional to the width of the gripper and the camera resolution and inversely proportional to the distance. Because the opening of the gripper is limited, w c is infinite when x d > x l . Additionally, it is assumed that the gripper has an optimal width to grasp an object; thus, w c is minimized at a specific c value and increases rapidly as c deviates from this value. The expressions for w c for 0 < x d < c and c < x d < x l were initially designed as linear functions, but they were replaced with exponential functions to optimize the opening width of the gripper.

Evaluation function
The evaluation function e is defined according to the three foregoing evaluation indices as: where d p and d h determine whether there is enough space around the object, and w c determines whether the width of the object is appropriate for grasping according to the width of the gripper. Therefore, the evaluation function determines the graspability according to the surrounding space and the shape of the object. Because, when the object can be appropriately grasped, d p and w c are small, and d h is large, the optimal grasp pose corresponds to the smallest e.

Grasp Candidate Decider
The grasp pose decider consists of the grasp pose modifier and the reaching distance estimator. The grasp pose modifier determines the final grasp pose of the robot, and the reaching distance estimator determines the distance that the robot must travel downward for grasping.
The grasp pose modifier updates the location and width of the gripper. First, the width of the gripper is estimated according to the optimal grasp pose and shape of the cluster in the depth image. In Figure 7a, the width of the gripper is larger than the target object. To avoid collisions with other objects in the clutter, the grasping width should be reduced, and the location should be modified according to the new grasping width. In the clustered depth image, the width has been reduced to fit the boundaries of the cluster. A new grasping width is obtained by adding a margin to this value. Additionally, as shown in Figure 7b, the half-width of the newly estimated gripper is set as the new center position of the gripper. the boundaries of the cluster. A new grasping width is obtained by adding a margin to this value. Additionally, as shown in Figure 7b, the half-width of the newly estimated gripper is set as the new center position of the gripper.
Next, the reaching distance estimator determines the height at which the robot should approach the object. Because an RGB-D camera can only see one side of the object, the distance to approach before grasping must be determined according to the partial depth data of the object. The height is determined from the maximum and minimum depth (hmax and hmin, respectively) of the cluster in the clustered depth image as: where k is a factor that indicates the correspondence of the hmax − hmin difference to the reaching distance. In fact, though hmax is the deepest of the filtered points, it is not large enough to grasp the object, because only the depth information close to the camera remains after filtering. Therefore, for reliable grasping, the robot must reach beyond the value of hmax, which is obtained by introducing k (set to 0.5-1 in this study).

Experiments
Several experiments were conducted to determine whether the proposed DBP is effective for grasping objects that have a complex piling structure. For this purpose, the performance of DBP was compared with that of three grasping algorithms. The first one used a random method. In this, the grasp position was the central coordinate of the cluster found by the grasp candidate generator, and the grasp angle was determined randomly. Thus, compared to the DBP method, the grasp position was the same, and the grasp angle was different. The second algorithm was based on principal component analysis (PCA) [17]. In this method, the grasp center position was set to one of the center points of the clusters in the clustered depth image, and the grasp angle was obtained by PCA. Thus, a narrow part of the target object was used for grasping. The third algorithm was based on an ANN [18]. In this algorithm, the ANN received the depth image and used it to estimate the grasp pose. Note that the algorithm did not previously learn the objects to be used in the experiments.
The three algorithms and DBP were tested with different objects both in a 2D cluttered environment and in 3D bin picking. In the 2D clutter, grasping was performed for 20 types of objects in an area delimited by white lines. In 3D bin picking, the target objects were placed in a 390 mm × 480 mm × 250 mm box. In this case, the parameters of the DBP algorithm were p = 0.1, n = 20, c = 80, and k = 0.5. Figure 8 shows the experimental setup, which comprised a UR5 robot, a RealSense D435 RGB-D sensor mounted on the robot arm, and a Robotiq two-finger gripper. The main central processing unit (CPU) was an Intel Core i9-7940X, and the GPU was a GeForce GTX 1080 Ti. Figure 8a shows the 45 different objects used in the experiments. In the 2D cluttered environment shown in Figure 8b, 20 objects were randomly selected among the 45 objects and stacked in the outlined area. In bin picking, Next, the reaching distance estimator determines the height at which the robot should approach the object. Because an RGB-D camera can only see one side of the object, the distance to approach before grasping must be determined according to the partial depth data of the object. The height is determined from the maximum and minimum depth (h max and h min , respectively) of the cluster in the clustered depth image as: where k is a factor that indicates the correspondence of the h max − h min difference to the reaching distance. In fact, though h max is the deepest of the filtered points, it is not large enough to grasp the object, because only the depth information close to the camera remains after filtering. Therefore, for reliable grasping, the robot must reach beyond the value of h max , which is obtained by introducing k (set to 0.5-1 in this study).

Experiments
Several experiments were conducted to determine whether the proposed DBP is effective for grasping objects that have a complex piling structure. For this purpose, the performance of DBP was compared with that of three grasping algorithms. The first one used a random method. In this, the grasp position was the central coordinate of the cluster found by the grasp candidate generator, and the grasp angle was determined randomly. Thus, compared to the DBP method, the grasp position was the same, and the grasp angle was different. The second algorithm was based on principal component analysis (PCA) [17]. In this method, the grasp center position was set to one of the center points of the clusters in the clustered depth image, and the grasp angle was obtained by PCA. Thus, a narrow part of the target object was used for grasping. The third algorithm was based on an ANN [18]. In this algorithm, the ANN received the depth image and used it to estimate the grasp pose. Note that the algorithm did not previously learn the objects to be used in the experiments.
The three algorithms and DBP were tested with different objects both in a 2D cluttered environment and in 3D bin picking. In the 2D clutter, grasping was performed for 20 types of objects in an area delimited by white lines. In 3D bin picking, the target objects were placed in a 390 mm × 480 mm × 250 mm box. In this case, the parameters of the DBP algorithm were p = 0.1, n = 20, c = 80, and k = 0.5. Figure 8 shows the experimental setup, which comprised a UR5 robot, a RealSense D435 RGB-D sensor mounted on the robot arm, and a Robotiq two-finger gripper. The main central processing unit (CPU) was an Intel Core i9-7940X, and the GPU was a GeForce GTX 1080 Ti. Figure 8a shows the 45 different objects used in the experiments. In the 2D cluttered environment shown in Figure 8b, Appl. Sci. 2020, 10, 804 8 of 11 20 objects were randomly selected among the 45 objects and stacked in the outlined area. In bin picking, as shown in Figure 8c, the 45 objects were used. The objects comprised the Australian Center for Robotic Vision (ACRV, Brisbane, Australia) picking benchmark (APB) [19], the Yale-CMU-Berkeley (YCB) benchmark [20], the World Robot Summit (WRS) 2018 set, and household items.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 11 as shown in Figure 8c, the 45 objects were used. The objects comprised the Australian Center for Robotic Vision (ACRV, Brisbane, Australia) picking benchmark (APB) [19], the Yale-CMU-Berkeley (YCB) benchmark [20], the World Robot Summit (WRS) 2018 set, and household items. To test the DBP scheme in a heavy clutter, a 330 mm × 450 mm × 260 mm box was filled with small cosmetic containers, as shown in Figure 9a. Without proper consideration of the space around the object, grasping is difficult. In this experiment, the parameters of the DBP algorithm were p = 0.05, n = 20, and c = 80. In contrast to the previous experiments, the width of the gripper was fixed, because only one type of object was targeted.  To test the DBP scheme in a heavy clutter, a 330 mm × 450 mm × 260 mm box was filled with small cosmetic containers, as shown in Figure 9a. Without proper consideration of the space around the object, grasping is difficult. In this experiment, the parameters of the DBP algorithm were p = 0.05, n = 20, and c = 80. In contrast to the previous experiments, the width of the gripper was fixed, because only one type of object was targeted.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 11 as shown in Figure 8c, the 45 objects were used. The objects comprised the Australian Center for Robotic Vision (ACRV, Brisbane, Australia) picking benchmark (APB) [19], the Yale-CMU-Berkeley (YCB) benchmark [20], the World Robot Summit (WRS) 2018 set, and household items. To test the DBP scheme in a heavy clutter, a 330 mm × 450 mm × 260 mm box was filled with small cosmetic containers, as shown in Figure 9a. Without proper consideration of the space around the object, grasping is difficult. In this experiment, the parameters of the DBP algorithm were p = 0.05, n = 20, and c = 80. In contrast to the previous experiments, the width of the gripper was fixed, because only one type of object was targeted.

Comparisons with Other Algorithms
As seen in Table 1, DBP exhibited the highest success rate in all the experiments. The learning-based methods have a lower success rate, though they show similar performance. In particular, the PCA-based grasping performed similarly to DBP in the 2D and 3D clutters with different objects, because the gripper collided rarely with other objects during the grasp attempts owing to the large space between the objects in those experiments. In fact, the success rate in the heavy clutter of cosmetic containers, where DBP outperformed the other algorithms, supports this assumption. Thus, the experiments indicated that grasping without collisions is important in heavy-clutter environments. Examples of grasp poses estimated by DBP are shown in Figure 9b. In all the experiments, the PCA-based grasping was twice as fast as DBP, and the ANN-based grasping was in turn 1.16 times faster than PCA. However, DBP needed 1.4 s to estimate the grasp pose, which is sufficiently fast for practical applications. After the robot grasps an object, it needs time to move the object to the designated position. Additionally, if an eye-to-hand camera is used instead of the eye-in-hand camera used in our experiments, the robot can estimate a new grasp pose while moving an object.
In summary, the PCA grasping method had a good performance if the space around the object to be grasped was sufficiently large, and the performance deteriorated otherwise. The performance of the learning-based grasping was similar in all environments, but worse than DBP. In fact, DBP, which considers both the space around and the shape of the target object, obtained a good grasping success rate even when the space around the target object was small. Thus, DBP can be applied more widely than other grasping methods, because its performance is good in different environments.

Causes of Failures
Though the DBP algorithm demonstrated the highest success rate, it showed 16.7% failures in grasping cosmetic containers. These failures are most likely caused by the environmental changes that occur after estimating the grasp pose. When a gripper approaches the object, its fingers often contact the surrounding objects. Such contact is likely to change the surrounding environment and the pose of the target object. Therefore, unless the grasp pose is corrected accordingly, the chance of successful grasping is reduced.
Another cause is related to the top-down grasping path that is used by most two-finger grippers. In this, the gripper first moves over the target object, then descends vertically to grasp the object. In this way, grasping objects near the wall is very difficult. Because, in grasp pose estimation, avoiding a collision with the wall has priority over grasping the object, the estimation of the correct grasp pose is not easy, especially for small bins.

Conclusions
A novel difference image-based bin-picking method was proposed for generalized grasping in heavy clutter. We introduced a DDI to analyze the geometry around the object to be grasped in the absence of a CAD model, a graspability evaluation method based on the DDI, and a DBP structure consisting of a grasp candidate generator, grasp pose evaluator, and grasp pose decider. The DBP method aims to estimate the optimal grasp pose in a short time with a cost-efficient process, grasping novel objects even in space-constrained environments. The performance of DBP was verified by grasping experiments in 2D clutter and 3D bin-picking environments. In the experiments, DBP exhibited better performance than other grasping methods. In particular, the success rate of DBP in heavy clutter with small objects was 83.3%, approximately 1.4 times higher than that of the other algorithms. Moreover, the computation time was 1.4 s, which is sufficiently fast for industrial and logistics applications.