Next Article in Journal
Readiness Assessment for IDE Startups: A Pathway toward Sustainable Growth
Previous Article in Journal
Solar Photovoltaic Power Prediction Using Big Data Tools
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Reinforcement Learning-Based Robotic Grasping in Clutter and Occlusion

by
Marwan Qaid Mohammed
1,*,
Lee Chung Kwek
1,*,
Shing Chyi Chua
1,
Abdulaziz Salamah Aljaloud
2,
Arafat Al-Dhaqm
3,
Zeyad Ghaleb Al-Mekhlafi
2 and
Badiea Abdulkarem Mohammed
2
1
Faculty of Engineering and Technology, Multimedia University (MMU), Ayer Keroh 75450, Melaka, Malaysia
2
College of Computer Science and Engineering, University of Ha’il, Ha’il 81481, Saudi Arabia
3
School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia (UTM), Skudai 81310, Johor, Malaysia
*
Authors to whom correspondence should be addressed.
Sustainability 2021, 13(24), 13686; https://doi.org/10.3390/su132413686
Submission received: 16 November 2021 / Revised: 4 December 2021 / Accepted: 8 December 2021 / Published: 10 December 2021

Abstract

:
In robotic manipulation, object grasping is a basic yet challenging task. Dexterous grasping necessitates intelligent visual observation of the target objects by emphasizing the importance of spatial equivariance to learn the grasping policy. In this paper, two significant challenges associated with robotic grasping in both clutter and occlusion scenarios are addressed. The first challenge is the coordination of push and grasp actions, in which the robot may occasionally fail to disrupt the arrangement of the objects in a well-ordered object scenario. On the other hand, when employed in a randomly cluttered object scenario, the pushing behavior may be less efficient, as many objects are more likely to be pushed out of the workspace. The second challenge is the avoidance of occlusion that occurs when the camera itself is entirely or partially occluded during a grasping action. This paper proposes a multi-view change observation-based approach (MV-COBA) to overcome these two problems. The proposed approach is divided into two parts: (1) using multiple cameras to set up multiple views to address the occlusion issue; and (2) using visual change observation on the basis of the pixel depth difference to address the challenge of coordinating push and grasp actions. According to experimental simulation findings, the proposed approach achieved an average grasp success rate of 83.6%, 86.3%, and 97.8% in the cluttered, well-ordered object, and occlusion scenarios, respectively.

1. Introduction

Object grasping is an important step in a variety of robotic tasks, yet it poses a challenging task in robotic manipulation [1]. Additionally, perception of the surroundings is a valuable skill that the robot should possess in order to perform robotic tasks with multiple actions (e.g., push or grasp) in clutter using deep reinforcement learning (deep-RL) algorithms [2]. The robot should be able to detect changes in workspace state by comparing pixel depth differences between the previous and current workspace states. Therefore, an ineffective push which might not really help grasping leaves the robot struggling to grasp. Next, the second concern to be considered is that, in an occluded workspace whereby intelligent grasp pose estimation is needed to pick up all of the objects, a single view of the workspace might not be sufficient to accomplish these tasks. The drawback of using a single camera during a robotic grasp task is that it may suffer from a lack of visual data due to the single view, making it difficult to precisely predict object locations in an occluded environment. The two problems of robotic grasping in cluttered surroundings and avoidance of occlusion are closely related to each other. We wondered if there is an approach to avoid these two problems and improve grasp efficiency by using a multi-view and visual change observation-based execution of grasping tasks in both cluttered and occluded environments.
Data representation with several levels of abstraction is achievable with deep learning (DL) [3]. On the other hand, reinforcement learning (RL) refers to how software agents learn to act in a way that maximizes cumulative reward via trial and error. Combining these two machine learning approaches takes advantage of deep learning’s representation ability to address the reinforcement learning challenge. A typical deep-RL system employs a deep neural network to calculate a non-linear mapping from perceptual inputs to action values, as well as RL signals to update the weights in the networks, generally using backpropagation, to yield better rewards [4]. In robotic grasping, the robot examines the surroundings using RGB-D data and takes the best action possible within the policy. In this paper, deep-RL is used to address robotic grasping in cluttered and occluded environments.
Various studies have been conducted on grasping an object from a single viewpoint using deep-RL. Robot grasping [5,6,7,8,9,10,11,12], for example, has been extensively researched, but the emphasis has primarily been on performing a particular task in clutter with a single viewpoint. In performing various robotic manipulations, several techniques have been used such as cooperative object manipulation [13] and reach-to-grasp-to-place [14,15,16,17,18,19,20]. To perform pattern exploration and grasp identification, Guo et al. [21] proposed a mutual convolutional neural network (CNN). To solve the issue of mixing grasp recognition and object detection with relationship logic in piles of objects, Zhang et al. [22] proposed a multi-task convolution robotic grasping network. Park et al. [23], on the other hand, used a single multi-task deep neural network to learn grasp recognition, object detection, and object relationship reasoning. Currently, the most common input for determining a grasp is a single RGB-D (red green blue with depth) image from a single fixed location, which lacks visual data. When partial views are insufficient to gain a thorough understanding of the target object, having more perspective data is useful. For example, Morrison et al. [24] examined various informative perspectives in order to reduce uncertainty in grasp pose estimation caused by clutter and occlusions. This enables the robot to perform grasping of invisible objects. However, these approaches do not consider whether the camera itself is fully or partially occluded.
Robot grasping, pushing [25], shift-to-grasp [26], and push-to-grasp [27] have been intensively studied, but mostly on performing a specific task. Some approaches have been involved in performing various robotic manipulations. Performing a push or shift action during a grasping task in clutter has been implemented in different studies. By contrast, object pushing or shifting in a workspace can be useful even if the workspace is not cluttered. For instance, the object could be in a place where it fails to be grasped. Thus, shifting the object can facilitate easy grasping for the robot. In prior studies, the pushing action and the grasping action were synergized as complementary parts for each other on the basis of selecting the maximum Q-value by s t = arg max Q p ,   Q g , either to clear objects from the table [27], grasp invisible objects using color segmentation [28,29] or the object index number [30], or perform block arrangement tasks [31]. The maximum value strategy can be enhanced by Q-learning on two neural networks (NNs) in parallel, in which the first NN is used to estimate the push Q-value Q p , whilst the second is used to determine the grasp Q-value Q g (e.g., in each iteration state ( s t ), NN estimates the Q-values for each grasp and push; if the grasp Q-value is low and the push Q-value is high, the push action will be executed and vice versa). Specifically, these challenges are presented as follows. (1) When pushing and grasping are determined based on the maximum predicted Q-values of two primitive actions (i.e., grasp and push) from two neural networks, the robot will recognize the full clutter and prioritize push over grasp as the grasp Q-values remain low. Limited by the workspace size, the foremost objects will tend to be moved out of the workspace’s borders, which cannot be grasped by the robot. (2) For the training robot to coordinate between push and grasp (e.g., deciding when and how to push or grasp), many iterations are required, and in some cases, the robot will fail to disassemble closely arranged objects. (3) The robot may struggle to grasp since the camera is either fully or partially occluded because it may suffer from a lack of visual data, making it difficult to precisely predict object locations in occluded environments.
This paper proposes a multi-view and change observation-based approach (MV-COBA) that uses two cameras to achieve multiple views of state changes in the workspace and goes on to coordinate grasp and push execution in an effective manner. This approach aims to prevent a lack of visual data due to having a single view and perform effective robotic grasping in various working scenarios. The MV-COBA is proposed to emphasize the additional challenges of grasping in a cluttered environment and occlusion, as a main objective in this paper. The MV-COBA is divided into two parts: In the first part, the proposed approach’s workflow begins with two RGB-D cameras capturing an RGB-D image of the workspace from two separate perspectives. Next, the extracted features are fused and fed to a single grasp network after each captured RGB-D image’s intermediate feature extraction. In the second part, the robot performs a grasp and then observes the previous and current states of the workspace. If no visual change observation is available, then the robot will perform a push action. The main technical contributions of this paper are as follows:
  • Using multiple views to maximize grasp efficiency in both cluttered and occluded environments;
  • Establishing a robust change observation for coordinating the execution of primitive grasp and push actions through a fully self-supervised learning manner;
  • Incorporating a multi-view and change observation-based approach to perform push and grasp actions in wide scenarios;
  • The learning of MV-COBA is entirely self-supervised, and its performance is validated via simulation.
This paper is organized as follows. Section 2 discusses the related works. Section 3 states the research problem definition and the motivation of the proposed approach. Section 4 describes the proposed MV-COBA, including change observation, action representation, problem formulation, and the approach overview. Section 5 presents the simulation setup and training protocol. Section 6 provides the simulation results and discussions. Section 7 presents the conclusion and future work.

2. Related Works

An industrial robot should be able to perceive and interact with its surroundings. The ability to grasp is essential and crucial among many of the basic skills because it can provide tremendous strength to society in performing robotic tasks [2]. Several different approaches have been proposed in recent years to perform grasping in clutter by using deep-RL.

2.1. Single View with Grasp-Only Policy

The grasp-only policy is a strategy for teaching a robot to grasp objects using only the grasp action when no other primitive actions (such as push, shift, and poke) are involved. A single depth image has recently been used to execute grasping, where the depth image is passed to a four-branch convolutional neural network (CNN) with a shared encoder–decoder [32]. Although the active vision approach has been implemented to address the problem of robotic perception [33,34,35], which is considered the key subtask in grasping, active vision is not intensively exploited in addressing the issue of grasping in a cluttered environment. Kopicki et al. [36] addressed the issue of how to improve dexterity in order to grasp novel items viewed from a single perspective. They proposed a single-view mechanism, where the single view similarly works as a direct regression of point clouds. However, the backside view of the object is invisible to the camera, which leads to the absence of some data, increasing the difficulty of grasping the object from the backside view. In addition, a 6-DoF (degree of freedom) grasping of the target object has been performed in a cluttered scene using point cloud-based partial observation [9,37]. Recent studies that focused on multi-finger grasps (e.g., [8,32,38,39]) primarily employed planned precision grasping for known objects. However, grasping in crowded settings with a multi-finger gripper is inefficient, particularly when dealing with retrieving a target object from the clutter, since the multi-finger hands may struggle to find enough space for their fingers to move freely in the workspace. In comparison, other studies have concentrated on robotic grasping of unknown objects adopting parallel-jaw grippers (e.g., [37,40,41,42]). As stated in [43,44], multi-finger grasps remain more problematic than parallel-jaw grasps. However, these policies (grasp-only) might fail to grab items when they are engaged in a well-organized cluttered object situation that requires assistance from other actions (e.g., pushing, shifting) to detach the objects from the arrangement. Furthermore, they do not consider the occlusion challenge, particularly when the camera itself is either partially or fully occluded.

2.2. Suction and Multifunctional Gripper-Based Grasping

Instead of gripper-based grasping, suction-based grasping is an effective mechanism for grasping amid cluttered environments via deep-RL. For pick-and-place tasks, Shao et al. [45] utilized a suction grasp mechanism, rather than a parallel-jaw gripper, to avoid the challenge of grasping objects in a cluttered environment by training via Q-learning on the ResNet and U-net structures. Another study [46] presented a suction grasp point affordance network (SGPA-Net) in which two neural networks (NNs) were trained in series using greedy policy sampling-based Q-learning. However, their approach of connecting two networks and training them with mixed loss produced unsatisfactory results. Furthermore, the authors in [47] attempted to enhance the success rate of suction grasp by utilizing the dataset of [40]. On the other hand, rather than grasp-only, or suction-based grasping, other researchers preferred grasping in cluttered environments by coordinating both grip and suction at the same time using deep-RL, the device for which is referred to as a multifunctional gripper, which is a composite robotic hand with a suction cup. For instance, Zeng et al. [48] created a robotic pick-and-place system that could grasp and detect both known and unknown objects in a cluttered environment via a deep Q-network (DQN). Yen-Chen et al. [49] focused on ‘learning to perceive’ and transferred that knowledge to ‘learning to act’. In addition, the deep graph convolutional network (GCN) model was used to predict suction and gripper end-effector affordances in robotic bin picking [50]. However, these research methods concentrate on grasp-only policies, which may be ineffective in circumstances involving the detachment of well-organized items. Additionally, they neglect the occlusion issue, which happens when the camera is partially or fully obscured.

2.3. Synergizing Two Primitive Actions

Synergizing two primitive actions can be easily performed in clutter once these actions are executed as a complementary scenario. Synergizing two actions is a fundamental strategy, but it remains a challenging policy to learn. Accordingly, several studies which focused on improving grasping performance have contributed to this field. For instance, visual push-to-grasp (VPG) was proposed in [27] by learning pushing and grasping in a parallel manner using Q-learning with fully connected networks (FCNs). In another study, grasping was performed in clutter using a deep Q-critic [28], which learns the annotation of the target objects through FCNs. In addition, the task of singulating an object from its surrounding objects was performed in [51] by using a split-deep Q-network (split-DQN) that learns optimal push policies. Object color detection, which was used in [52], is limited to certain types of objects in training and testing scenarios. Hundt et al. [31] described the schedule for positive task (SPOT) reward and the SPOT-Q RL algorithm, which efficiently learns multistep block manipulation tasks. Another method for promoting object grasping in clutter using DQN was determined in [26] via shifting an object by putting the fingers of a gripper on top of the target object to ensure improvement in the grasping performance. Given depth observation of a scene, an RL-based strategy was used to achieve optimal push policies, as presented in [53]. These policies facilitate grasping of the target objects in cluttered scenes by separating them from the surrounding objects, possibly unseen before during the training phase. However, the authors employed an instance push policy, in which a sole push policy is learned for the recognizable target in clutter via Q-learning.
A goal-conditioned hierarchical RL formulation with a high sample efficiency was presented in [30] to train a push-to-grasp technique for grasping a specific object in clutter. In [54], the visual foresight tree (VFT) method was presented to identify the shortest sequence of actions to retrieve a target object from a densely packed environment. The method combined a deep interactive prediction network (DIPN) for estimating the push action outcomes, and Monte Carlo tree search (MCTS) for selection of the best action. However, the time required by VFT is long due to the large MCTS tree that must be computed. Additionally, it assumes prior knowledge and relies on a target object with a specific color. Alternatively, the DQN-based obstacle rearrangement (DORE) algorithm was proposed [55], and the success rates were checked using intermediate performance tests (IPTs). These approaches, however, are limited to retrieving the target object using identical grid space-based object rearrangement, which cannot be applied to instances that are not modeled by regularly spaced identical grids. These studies concentrated on object retrieval tasks rather than object removal tasks. They assumed prior knowledge and relied on a target object with a specific color.
Furthermore, Yang et al. [56] relied on a segmentation module that must be able to identify the target object; by contrast, [56] accepted the target object as an image. However, this approach failed in some cases, and the grasp success rate reached 50% for different reasons, particularly in heavily cluttered scenes, due to the physical limitations of the gripper and the visual similarity of the models in the random dataset, especially in the cluttered setting. A similar concept of image segmentation combined with a visuomotor mechanical search was used in [57] using deep-RL agents that could efficiently uncover a target object obscured by a pile of unknown items. However, the agent repeated the same action without causing any change in the environment. In other cases, the agent moved near the target object but did not interact with the items occluding it.

2.4. Multi-View-Based Grasping

The occlusion scenario has been avoided in some studies through various techniques such as grasping visible objects by using color segmentation [28] and radio frequency perception [52]. Other approaches have employed multiple views from a single camera to increase the probabilities of grasp success in clutter. For example, Morrison et al. [24,58] proposed a multi-view approach that employs active vision to choose insightful perspectives of different grasp pose estimates. A grasping task in a cluttered environment has been performed by using a generative grasping-CNN (GG-CNN). A composite algorithm has been proposed for estimating the pose of a target whose models are derived from multiple views [59]. Other approaches aimed at crowded scenes, in which a robot has to know whether an object is on another object in a pile of objects in order to grasp it successfully, have also been proposed. For instance, a shared CNN has been proposed for discovering and detecting the occluded target [21]. While avoiding the issue of mixing grasp execution and object detection in piles of objects, a multi-task convolution robotic grasping network was proposed in [22]. The authors developed a framework that uses multiple deep neural networks for generating local feature maps, grasp recognition, object detection, and interaction reasoning separately. Instead of multiple tasks, a single task has been proposed to avoid the same issue of mixing grasp execution and object detection [23]. In addition, instance methods of segmentation have mostly been used in studies to detect the target object. For example, a grasping task was achieved in [40] by training the Q-learning policy on pre-trained models (i.e., VGG neural network) by using fully convolutional networks (FCNs). In this work, the authors used multiple views from different sides by capturing RGB-D images.

2.5. The Knowledge Gap

Some difficulties arise as a result of combining the two actions. From the literature, the push motion has been used to accomplish specific goals such as retrieving an object from its cluttered surroundings [28,30,34,54,55,56,57,60,61,62]. This approach assumes prior knowledge and relies on a predefined target object either using color, quick response (QR) code, or barcode objects. In object removal tasks [26,27,63,64,65,66], the pushing and grasping actions have been synergized by using two parallel FCNs, one each for grasping and pushing. The works of [9,27,67] are the closest to our work. Al-Shanoon et al. [9] proposed the Deep Reinforcement Grasp Policy (DRGP). DRGP [9] focuses on performing the grasp-only policy, with no other actions (such as push, shift, and poke) involved. This strategy encounters difficulty especially in a well-organized shape situation where there is no space for the robot’s gripper’s fingers to execute the grasp. The other two papers [27,67] focused on synergizing push and grasp using the maximum Q-value strategy, predicted by two FCNs. This strategy is considered the basis of our work. However, this strategy fails in certain cases because the robot proceeds to push the entire pile of objects, causing the items to be pushed out of the robot’s workspace. In addition, it performs a push movement when it is not necessary, resulting in a series of grasping and pushing actions due to the estimation of grasp points and push direction on separated FCNs.
Another challenge dealing with the pushing behavior is that the push action could be less efficient when it is used in random clutters. The reason behind this issue is that the best action for the robot to take is determined by the highest predicted Q-values. As a result, the robot takes the push action as the best action since grasping Q-values remain low, as reported in [27]. To overcome the aforementioned challenges, we advocate training both the push and the grasp action with the same FCN. The reason is twofold. First, it reduces model complexity. Second, it helps to minimize unnecessary push behavior generated by the maximum Q-value strategy. The push and grasp actions are then coordinated using the observation of the change in the pixel depth in the workspace. Additionally, those who used multiple views (e.g., [24,40]) employed a single camera-based multi-view strategy with no regard to whether the camera is completely or partially obscured, resulting in the robot failing to complete the task. Furthermore, they only performed grasp policies, which they were unable to accomplish when they involved detaching well-organized objects, causing the robot to struggle to complete the tasks. The proposed MV-COBA, using dual cameras for multiple views, as well as change observation for the effective execution of primitive grasp and push actions as complementary parts, is considered to alleviate challenges in clutter and occlusion scenarios. In the next section, the problem description of the aforementioned related works is explained and discussed in detail.

3. Problem Definitions

The work presented in this paper addresses one of the most fundamental problems in robotic manipulation: robotic grasping in cluttered scenarios. Consider a well-organized shape situation where objects are placed close to each other, as illustrated in Figure 1. Such an arrangement leaves no space for the gripper’s fingers to execute a grasp without first detaching the tightly arranged objects. DRGP [9], which teaches a robot to grasp objects using only the grasp action, with no other non-prehensile actions (such as push, shift, and poke), would not be able to perform efficient grasps in this situation. Synergizing non-prehensile and grasp actions is seen as a mechanism for addressing the challenges that grasp-only policies cannot handle. Previous studies have focused on synergizing push and grasp using the maximum Q-value strategy (e.g., [7,27,67]), in which two NNs, one for each action, were trained to predict the best push and grasp Q-values and went on to execute the action with the higher Q-value.
The issues with this learning scheme have been identified: (1) Ineffective push execution, in which the robot pushes the entire pile instead of the targeted object, which results in the need for several pushes in order to detach the objects from their arrangement (see Figure 1). (2) Push is favored over grasp, particularly in heavy clutters, because grasping Q-values remain low, as reported in [27], which results in the entire pile being pushed out of the workspace (see Figure 2). (3) The robot performs the push action when the grasp action should be performed instead. This implies that the robot could grasp without needing to push, and vice versa; however, the maximum value strategy might cause the robot to push when grasping is the best option.
The second challenge is the scenario of occlusion avoidance. Several studies have effectively performed grasping tasks. However, to achieve grasping in clutter, where objects are stacked on top of each other in a randomly cluttered scenario [28], a single-perspective view is insufficient. Additionally, employing multiple views of a single movable camera [24] to increase the probabilities of grasping success in clutter would not be an adequate solution. This strategy suffers from a lack of visual data until the camera itself is obscured, and there is no other camera available to compensate for the lack of a visual sensor. Although multiple views have been implemented to grasp occluded objects [59], there has been no consideration of whether one of the cameras is totally obstructed or both cameras are partially obscured. Additionally, how the two cameras compensate for each other throughout the grasping task to avoid such difficulties remains to be explored. To alleviate clutter and occlusion, however, executing the grasp action using dual camera-based multiple views as complementary parts has not been previously considered. Figure 3 demonstrates the occlusion problem, where one of the cameras is fully or partially occluded.

The MV-COBA’s Motivation

The proposed MV-COBA is motivated to offer a significant perspective on the issue of learning to push and grasp in cluttered environments using deep-RL. In this paper, we demonstrate how the MV-COBA allows robotic systems to acquire a range of sophisticated vision-based manipulation abilities (e.g., pushing, grasping) that can be generalized to a number of different scenarios while consuming less data. Since the single-view-based robotic learning strategy is not adequate to overcome the occlusion challenge, the problem is modeled through multiple views using two cameras ( S RGB D 1 ,   RGB D 2 ). MV-COBA-based grasping in clutter and occlusion is seen an alternative approach that can facilitate spatial equivariance in learning robotic manipulation. We hope that by emphasizing the importance of spatial equivariance in learning manipulation, the MV-COBA presented in this paper can inform key design decisions for future manipulation approaches.
In this paper, we introduce approaches based on deep reinforcement learning frameworks for learning to grasp objects in clutter. To map visual observations to actions, we employ the Q-learning reinforcement learning framework trained on a fully convolutional network (FCN). The idea is to utilize traditional controllers to perform primitive actions (e.g., grasp and push), and then to use an FCN to learn visual affordance-based grasping or pushing Q-values. In other words, the FCN maps visual observations (e.g., RGB-D image) to perceived affordances (e.g., confidence scores or action values). For instance, to learn the visual affordances of grasping or pushing Q-values with a parallel-jaw gripper, we can use a fully convolutional network that takes an RGB-D image of the robot workspace (e.g., a workspace filled with objects) as input and outputs a dense pixel-wise map of the probability (Q-value) of picking or pushing success for each observable surface in the image. The robot then guides the robot’s end effector, which is equipped with a parallel-jaw gripper, to perform a primitive action that connects the pixel with the highest estimated affordance value.
The MV-COBA is divided into two main parts. The first stage is perception, where a single FCN is used to learn visual affordance-based grasping or pushing Q-values. When two FCNs are used to predict the Q-values for pushing and grasping separately, the robot performs the best action based on the highest predicted Q-value. In certain situations, especially crammed clutters, the push action will be prioritized over the grasp action because the grasp Q-values remain low. This results in unnecessary push behavior. Even when the model was trained for over 7000 iterations, this problem still occurred [68]. To mitigate this issue, we propose training grasp and push actions on a single sharable network. For example, since the grasp and push actions share the same point of prediction, the robot could be enabled to shift a specific object (Figure 4a) rather than the whole pile (Figure 4b). Accordingly, this strategy ensures that the robot can detach the objects in fewer iterations. The second reason why using a single fully connected layer can improve grasp performance is that the push behavior, as shown in Figure 2, can be inefficient when dealing with randomly cluttered object scenarios, which become more difficult as the number of objects increases.
Catenation of multimodal features can provide important and complimentary information [69,70]. Thus, in this stage, feature fusion has also been used to fuse the extracted features from the multiples views using raw images to address the occlusion challenge illustrated in Figure 3. The goal of feature fusion is to compensate for the circumstance that one of the cameras is completely or partially obscured. This type of application could be beneficial in industrial line production when static cameras are employed to observe the robot’s workspace. For instance, if an obstacle obscures one camera’s view of the workspace, the second camera can compensate for the loss of visual data. Similarly, if both of them are partially obscured, they can compensate for the absence of visual input.
In the second stage of the MV-COBA, that is, design, pixel depth differencing-based change observation is adopted as a coordinating mechanism to decide which action should be performed. By examining the visual change in the previous state of the workspace and the current state, the robot learns if the previous grasp action has been successful or introduced a detectable change in the arrangement of the objects. On the other hand, if no visual change is detected, this indicates an ineffective grasp has been performed. Based on the principle that a grasp should be attempted by default, whereas a push should be performed only when it is needed, we minimize unnecessary push behavior by performing the push action when the previous grasp action has failed. We address the problem of coordinating two actions in this paper; existing related works addressed this problem using the maximum value strategy.
Previous works predicted the grasp pose and push direction using separated NNs. This implies that the push NN and the grasp NN take the RGB-D image separately, and then each one predicts the Q-value and thus makes a selection of the maximum value within the grasp net prediction and push net prediction. However, the object can be grasped without being pushed in certain situations and vice versa. To replace the maximum value-based strategy to synergize two actions, pixel depth differencing is proposed since it has been shown (test scenarios) to be more efficient than the maximum value strategy. Although pixel depth differencing is not the best option, it has been shown to effectively address the aforementioned problem. In contrast to previous works, the robot ensures that pushing is performed when it is needed, and the same is applicable to grasping. To achieve these goals, a change observation based on pixel depth differencing to synergize two actions is proposed.
The methodology is outlined in the following section, which includes the proposed approach workflow, learning policy, and theoretical formula.

4. Methodology

To begin, detailed descriptions of the grasp and push actions are provided to explain how they are executed. Following that, the MV-COBA is introduced to contextualize the approach’s procedure. Next, the procedure for observing the changes based on pixel depth differencing is described. Finally, the formulation problem and reward functions are discussed in detail.

4.1. Change Observation

One of the key issues in visual sensing is change observation (also called change detection), which aids in perceiving any changes on an object’s surface. Change detection, as described by Singh, is the process of noticing differences in the condition of an object or phenomenon over time by monitoring [71]. Change detection is also described as the process of visually observing the robot’s workspace and detecting the change between the previous and current states. Furthermore, change detection could be divided into three categories [72]: (1) binary change, which utilizes a binary indication for change or non-change areas; (2) triple change mask, which displays the change based on the geometrical label of the status, which may be positive, negative, or non-change; and (3) type change, which utilizes a full change matrix. Additionally, change detection techniques (3D data) are divided into seven categories [73]: geometric comparison, geometry–spectrum analysis, visual analysis, classification, advanced models, geographic information system methods, and other approaches. Each category has its own set of algorithms for detecting changes in 3D data [74]. The benefits of applying change detection to synergize grasp and push motions in grasping tasks have been highlighted in the current research. To achieve the study’s objective, the geometric comparison utilizes the Euclidean distance, which is the foundation of height differencing, and projection-based difference methods are used. The basic and direct height differencing technique [75] is sufficient to observe the change in the robot’s workspace in the implementation to enhance the change detection process.
Depth Differencing (Image Differencing)
Change detection has mostly been used in various applications that track the changes on the object’s surface [76], including urban (e.g., building and infrastructure change detection), environment, ecology (e.g., landslides, forest, and vegetation monitoring), and civil contexts (e.g., construction monitoring). In robotic grasping, change detection is mostly used in detecting the objects on the workspace in order to start the robot’s training or testing session. Observing the change in the robot’s workspace, as a part of controlling the robot’s grasping performance, appears to be a new mechanism that can be used to synergize grasp and push actions. We used the image differencing method that subtracts data of the previous state image from those of the current state image (pixel by pixel). In other words, image differencing is a technique that involves subtracting the digital number (DN) value of one data point from the digital number (DN) value of another data point in the same pixel for the same band, resulting in a new picture. Mathematically, pixel depth image differencing PDD is represented by subtracting the depth image of the current state s t + 1 from the depth image of the previous state ( s t ), as stated in Equation (1).
PDD = s t s t + 1
The main challenge in using the image differencing technique is the selection of the threshold value. The threshold value varies by application, and it can only be obtained from a trial-and-error process. We investigated the robot’s workspace with different threshold values; we found that the threshold can be obtained based on trial-and-error processes, and there is no specific formula for obtaining that value. Increasing the value of the threshold means the robot can detect if there is a considerable change (clearly observed), whereas reducing the threshold value means the robot can detect even a small change which the human eye might not be able to notice. With the selected threshold value, a change observation (CO) is examined if the total of PDD exceeds the threshold value τ , as shown in Equation (2).
CO = PDD > τ
In this instance, the MV-COBA identifies a CO if the value of PDD is greater than the threshold value ( PDD   >   τ ). A grasping attempt that is successful is also considered a CO. Otherwise, no CO exists, as shown in Equation (3).
CO = T r u e                               T r u e                                 F a l s e                                     i f   PDD > τ i f   G r a s p s u c c e s s O t h e r w i s e  
In the event that no change is detected, the robot executes a push operation. If the push action does not result in a change in the workspace, the robot repeats the push action.

4.2. Grasp and Push Action Execution

Each action a t is presented as a primitive motion ψ to be performed at a 3D position Ρ , which is projected from a pixel ρ x of the heightmap image that represents the state ( s t ) (Equation (4)).
a = ψ ,   Ρ     |     ψ     g r a s p ,     p u s h   ,     Ρ   ρ x   s t
Two primitive motions to be executed by the robot are grasping and pushing actions. To perform a grasp action, the robot positions the middle point of the gripper’s parallel jaw at P in 1 of 16 orientations. To ensure the robot reaches the desired object, the robot slides its gripper’s fingers down 3 cm before closing its fingers and picks up the object. At a distance of almost 30 cm, the gripper’s fingers are vertically measured against the workspace. The difference between the gripper’s location before and after grasping attempts is compared to 30 cm to determine if a grasp action has been accomplished. This is needed to keep the robot inside the workspace and prevent singulation. As an alternative, when the robot’s fingers are not completely closed, a successful grasping attempt is counted. This signifies that an object stays intact in its gripper’s fingers until it is placed down.
In terms of pushing action, the robot executes a pushing action at the starting position P with a push length of 10 cm in 1 of 16 directions. The robot closes its fingers before moving down to push with the tips of its closed gripper’s fingers. If the robot completes the push length on the workspace, the pushing operation is deemed successful. Motion planning of the robot arm is conducted automatically in both primitive grasp and push actions using a reliable, collision-free inverse kinematic solver.

4.3. Problem Formulation

The grasping and pushing actions are formulated as a Markov decision process (MDP) [48]. MDP serves as the theoretical basis for reinforcement learning, since it allows the interaction process of reinforcement learning to be described in probabilistic terms. MDP can be expressed as a tuple S ,   A ,   P ,   r , γ , which consists of a set of states S , a set of actions A , the function of the transaction probability P s | s , a , a reward function R   s , a , and a discount factor γ . After achieving a state ( s t S ) and executing an action ( a t A ) according to the current policy ( π   s t ) at every time step ( t ), the agent subsequently transits to a new state according to the probability of transition ( p s t + 1 | s t ,   a t P ) and receives a corresponding reward ( r t = r s t ,   a t R ). This process is repeated for each time step. The goal of reinforcement learning is to find an optimum policy ( π * : S A ) that maximizes long-term returns, R t = i = t T γ i t   r t , over the time horizon T and taking into account the discount factor γ 1 , 0 . The discount factor determines the importance of the future rewards at the present state. With a smaller value of γ , the agent focuses more on optimizing the immediate rewards. However, while performing an action, we cannot be certain of the future states.
One way to determine the best policy π * is to compute the state–action value function first, which is known as Q-learning. Accordingly, we must consider all potential scenarios and derive an estimate of the long-term returns of Q π s , a based on the likelihood of a state change, as stated in Equation (5), which is known as the Bellman equation [77]. The state–action value function ( Q π s , a ) is the long-term expected returns after performing an action ( a t ) at the present state ( s t ), and then following the policy ( π ).
Q s t ,   a t = r t + γ s S P s | s , a . max a A Q π s , a  
The action with the highest action value is the output, and it results in an instant reward. As a result, the policy is reinforced by choosing the action that has the highest state–action value, as shown in Equation (6) [77]. The agent’s purpose is to choose the optimal action with the highest state value, maximizing the action value function and the sum of future reward expectation returns. Maximization is achieved by choosing the action with the highest value (among all potential actions).
π * s = arg max a A Q π s , a   .
We propose adopting deep Q-learning, which is a method of deep reinforcement learning that combines fully connected networks (FCNs) with Q-learning in order to create Q-values by fitting a function rather than a Q-table. In this paper, we define the learning process for synergizing two actions (e.g., grasp and push). In previous related work, the two actions were predicted by two separated networks; however, we reformulated the problem with one prediction network ( ϕ A ) for both primitive grasp and push actions ( A = ψ g , p ). Firstly, the target FCN network calculates Q s t ,   a t   for each potential action at a given state s t   and finds the highest Q-value using a greedy policy, ϵ greedy . The ϵ greedy policy attempts to address the exploration and exploitation issue in Q-learning. The value of ϵ greedy will progressively drop throughout the training phase in order to achieve a favorable trade-off between exploration and exploitation. For example, once an action a t is executed in the current state s t based on the ϵ greedy policy, the action that can receive the highest Q-value is selected to calculate the target Q-value in the next state, whereas Equation (7) is used to obtain the target Q-value ( y i ), which is used to update the Q-value for each action by adding the immediate reward r t provided to the agent at the current state and the discounted Q-value.
y i = r t + γ max a Q a ,   s
Then, the action value function is updated for each action in a given state using the temporal difference (TD) method, which is aimed at achieving an optimal policy toward the estimated return TD target. TD is the process of minimizing the temporal difference error L i ϕ ψ in Q-values through a learning process that includes subtracting two Q-values, such as the current Q-value Q s t ,   a t , from the target Q-value y i , as stated in Equation (8).
L i ϕ ψ = Q s t , a t y i
We consider the heightmap in the same way that we treat the state, i.e., s t c h 1 ,   c h 2 ,   d h 1 , d h 2 . Our model’s purpose is to learn the action value function Q ψ π s ,   a , which estimates the expected return for a grasping or pushing action a in a state s   under a policy π . However, instead of a single RGB-D camera, we have two RGB-D cameras with two perspectives, which means that we feed a two-color heightmap and a two-depth heightmap as the state s t . In the proposed approach, the problem is modeled as a discrete MDP. Thus, we observe the robot workspace from multiple views, which means 2   s t are generated. Accordingly, the state representation can be expressed as the set s t   s t 1 ,   s t 2 s t 1 c h 1 ,   d h 1 ,   s t 2 c h 2 , d h 2 . In MDP, the robot performs actions in a state s t , then transitions to the next state s , and receives the rewards r s t ,   a t ,   s . Our model’s purpose is to learn the action value function Q ψ π s ,   a , which estimates the expected return for a grasping or pushing action a in a state s   under a policy π . To begin, each c h and d h is rotated to, for example, θ = 22.5 o   N orientation, before being sent to the FCN to create N  Q-values, and then the highest Q-value, Q h = max [ Q 1 Q 2 Q N ]   , is selected. The optimal actions are calculated based on the Q h by the depth pixel difference (PDD) in comparison to the threshold τ , as stated in Equation (2). Since we have two state representations, the PDD will be obtained for both s 1 and s 2 , as stated in Equation (9). The change observation observes if either PDD 1 > τ or PDD 2 > τ .
PDD 1 = s t 1 s t + 1 1 ;   PDD 2 = s t 2 s t + 1 2
To represent Equation (7), the action is selected using the change in observation caused by subtracting s t + 1 from s t , as stated in Equation (3). Thus, the action selection could be carried out in the manner specified in a g PDD 1 > τ   OR   PDD 2 > τ p PDD 1 τ   AND   PDD 2 τ . Then, Equation (7) could be represented as in Equation (10).
y i = r t + γ max a Q a ,   s a g PDD 1 > τ   OR   PDD 2 > τ p PDD 1 τ   AND   PDD 2 τ
The ψ is parameterized as a vector ( x ,   y ,   z ,   θ ), where x ,   y ,   z denotes the middle position of the gripper, and θ is the rotation of the gripper during the action execution in the table plane. In addition, the action value function, which evaluates the quality of each potential action at a given state, is solved using a fully connected network (FCN). To ease learning-oriented pushing and grasping actions, the input heightmap is rotated by 16 orientations ( θ ), each of which corresponds to a push or grasp action at different multiples of 22.5° orientation from the original state, before being sent to the FCN. The retrieved visual features x , y , z , θ and execution parameters are then concatenated by the Q-function network to generate the policy state. The Q-value policy samples the optimal action by mapping the action distribution across the current state representation using the FCN (e.g., action network ( ϕ A )) at Q s t , a t . The primitive actions ( ψ ) and pixel ( ρ x ) with the greatest Q-value of all 16 pixel-wise Q-value maps are the actions that maximize the Q-function (Equation (11)).
arg   max a   Q s t ,   a = argmax ψ   ,     ρ x   ϕ A   s t
Once the pixel with the best predicted Q-value is obtained, a grasp is performed. During the subsequent iterations, the visual changes in the previous state of the workspace and the current state are examined to determine the suitable action. The next section describes the proposed change observation strategy.
Reward function: This provides the agent with a score indicating how well the algorithm performs in relation to the environment. It denotes how good or bad the agent is performing in the environment. In other words, rewards represent both gains and losses. To mitigate their impact, future rewards must be multiplied by the discount factor ( γ ), as written in Equation (10). Therefore, the reward function for MV-COBA is divided into two groups: grasp reward and push reward, as follows.
Grasp reward:  r g s t ,   s t + 1 = 0.0 is assigned for a failed grasp attempt and no change detection in the workspace, and r g s t ,   s t + 1 = 0.5 is assigned for a failed grasp attempt with change detection in the workspace, whereas r g s t ,   s t + 1 = 1.0 is assigned for a successful grasp attempt.
Push reward: r p s t ,   s t + 1 = 0.0 is assigned for pushes with no change detection in the workspace, and r p s t ,   s t + 1 = 0.5 is assigned for pushes with change detection in the workspace.

4.4. MV-COBA Overview

An illustration of the proposed MV-COBA is presented in Figure 5. The MV-COBA works in two stages. The first stage is to take multiple views of the workspace into consideration, which aims to increase the likelihood of grasping by having more complete visual data of the workspace. It helps to alleviate difficulty in generating dexterous grasping points due to visual occlusion in a single-view setting. In the second stage, the robot decides on a push or grasp action based on the highest Q-value (obtained from the first stage) and the observation of the visual change in the workspace.
Firstly, the two fixed-mount RGB-D cameras capture the robot’s specified workspace from multiple opposing views. Then, all the images ( RGB ,   D ) are resized to a size of d × d , which is represented as a 2D vector, and two sets of random patches are extracted from them. One set is RGB patches with a size of h × w × c , and the other one is D patches with a size of h × w × c . Each set of patches is then normalized separately, whereas the h , w , and c are the height, width, and channel, respectively. Therefore, RGB has three channels (red-green-blue), and D has one channel. To construct a color heightmap ( c h ) and a depth heightmap ( d h ), the RGB and depth images orthographically project in the direction of gravity using a known extrinsic camera parameter. The current status of the robot workspace is represented by this heightmap image. The workspace has a resolution of 224 by 224 pixels and covers an area of 0.448 2   m of the tabletop surface. As a result, each pixel in the 3D workspace represents a 0.002 2   m   vertical column of the heightmap.
Secondly, features are retrieved from each heightmap in a convolutional manner with the two trained feature extractor networks (FENs), which are two-layer residual networks [78], whereas each FEN has two channels. One channel uses the c h patches set as input and is used for extracting color features from the RGB images; the other channel uses the d h patches set as input and is used for extracting shape features from the depth images. Therefore, the extracted features can be expressed as EF 1 v e c t o r c h 1 ,   d h 1 and EF 2 = c h 2 ,   d h 2 . Before feeding the extracted features into DenseNet, the set of c h 1 ,   d h 1 and the set of c h 2 ,   d h 2 are catenated independently. The output of each FEN is represented as a vector c h ,   d h , which is fed into DenseNet-121 [79], pre-trained on ImageNet [80], to generate the motion agnostic features (MAF).
After forward propagation through the FENs and DenseNet, we concatenate their outputs as the final feature vector. These features (e.g., MAF) are then sent to the action net ϕ A , which is a three-layer residual network followed by bilinear up-sampling, to estimate the Q-value of the grasp and push actions at 16 orientations (different multiples of 22.5°). The action that maximizes the Q-function is the primitive action and pixel ψ ,   ρ x with the highest Q-value across all 32 pixel-wise Q-value maps. Experience replay [61] is employed, which is used to store the agent’s experiences at each time step in a dataset e t = s t ,   a t ,   r t ,   s t + 1 that is pooled across many episodes to create a replay memory.
The primitive push or grasp action is selected based on the highest Q-value earned. At the same time, the visual change in the workspace is examined by comparing the workspace images in the two subsequent states. The robot will perform a push action if no noticeable visual change is detected. Otherwise, a grasp action will be executed. In self-supervised learning, this process is repeated continuously. The grasp and push actions are trained with a single shared network, that is, a fully connected network (FCN). The grasp and push actions are predicted using the same prediction network process, in contrast to other previous works which trained grasp and push actions on two parallel separated networks. Training grasp and push actions on a single network is a privilege to avoid executing unnecessary push behaviors and prevent objects being pushed out of the workspace borders.

5. Simulation of Experiments

A V-REP simulator was employed in this paper to execute a simulation experiment with a UR5 robot equipped with a parallel-jaw gripper. First, the camera took the RGB-D image of the workspace with a resolution of 640 × 480 pixels. Each pixel reflected a horizontally oriented physical grasp of the heightmap centered at the point in the scene. Training was conducted via PyTorch, which was built on a 3.7 GHz Intel Core i7-8700HQ CPU with an NVIDIA 1660Ti GPU.

5.1. Baseline Comparisons

For performance comparison, several baseline models as described below were adopted. The results of the MV-COBA and these baseline models are presented and discussed in Section 6.
MV-COBA-2FCNs: This baseline is almost identical to the proposed MV-COBA. Rather than utilizing a single fully connected network (FCN) to estimate either a push or a grab action, the predictions for each action are trained on two FCNs, as shown in Figure 6. The MV-COBA-2FCNs approach trains the push and grab actions using separable FCNs based on change observation. The MV-COBA-2FCNs approach adopts a similar framework as MV-COBA. However, instead of using a single shared ϕ A to predict either a push or a grasp action, the push and grasp predictions are trained on two separated ϕ a : one is for grasp ϕ g , and the second is for push ϕ p , as illustrated in Figure 6. The first camera is utilized to offer visual data for the push action prediction Q p s t c h 1 ,   d h 1 , while the second camera provides visual data for the grasp point prediction ( Q g s t c h 2 ,   d h 2 . The goal of this baseline is to determine how the robot executes both actions if each has its own network predictor, as well as aiding in the assessment process. The proposed MV-COBA-2NNs approach seeks to enhance the grasping probability of success by the synergy of push and grasp actions; yet, the grasping action must be executed with more proficiency to boost grasp efficiency. Accordingly, the MV-COBA-2FCNs approach is grouped into two stages. The first stage is a multi-perspective approach that employs two cameras. It emphasizes the additional difficulties of grasping in a cluttered setting. The second stage is the visual change observation approach, which entails regulating pushing and grasping actions in chaos and in a confined space.
DRGP [9]: Al-Shanoon et al. proposed the Deep Reinforcement Grasp Policy (DRGP), which focuses on performing the grasp-only policy with a single view when no other actions (such as push, shift, and poke) are involved. The difficulty in a well-organized shape situation is that there is no space for the robot’s gripper’s fingers to execute the grasp. The performance improvement of our proposed approach during a grasping task is compared to the DRGP baseline.
VPG [27]: The Visual Pushing for Grasping approach has worked effectively for grasping tasks in cluttered scenarios. However, the performance of push actions occasionally fails in some cases, which forces objects to be pushed outside the robot’s workspace, particularly during randomly cluttered object scenarios.
CPG [67]: Yang et al. developed a sophisticated Q-learning framework for collaborative pushing and grasping (CPG). Their paper proposed a non-maximum suppression policy (policyNMS) for dynamically evaluating pushing and grasping actions by imposing a suppression constraint on unreasonable actions. Additionally, a data-driven pushing reward network called PR-Net was used to determine the degree of separation or aggregation between objects. Even if the environment changes throughout the execution process, the prediction-based determination of the maximum Q-value could occasionally fail to determine the correct action to be executed. Additionally, this method lacks a restriction on the pushing distance, which could cause certain objects to be pushed out of the robot’s workspace border.
MVP [24]: The Multi-View Picking (MVP) baseline analyzes multiple informative perspectives for an eye-in-hand camera in order to reduce uncertainty in grasp pose estimate caused by clutter and occlusion observed when reaching for a grab in clutter. This enables it to execute grasps that are not always apparent from the first view. This baseline is almost identical to the proposed MV-COBA in terms of using multi-view-based grasping actions. Rather than utilizing two cameras, a single movable camera is used to capture multiple perspectives.

5.2. Training Scenarios

The MV-COBA was trained by self-supervised learning with a simulation platform. During training, a set of ten objects of varying shapes was randomly dropped into the robot’s workspace for training. The robot learned to execute either a grasp or push action by trial and error. After all objects were cleared from the workspace, another set of ten objects was dropped for further training. Experiment data were continuously collected until the robot completed 3000 training iterations.
For the training stage, the stochastic gradient descent (SGD) optimizer was applied to train the FCN with a learning rate of 10 4 , momentum of 0.9, and a weight decay of 2 5 . The agent learned exploration using the epsilon greedy ( ϵ -greedy) policy, where it started at 0.5 and reduced during the training session. The grasping action’s learning performance improved incrementally over the training iterations. At each iteration, the agent was trained by reducing the temporal difference error ( L o s s i ) using the Huber loss function, as stated in Equation (12)), where θ i denotes the neural network parameters at iteration ( i ), and θ i denotes the target network parameters between individual updates. We only passed gradients through the single-pixel ( p x ) and action network ϕ A values used to compute the predicted value of the executed action. At each iteration, all other pixels backpropagated with no loss.
L o s s i = 1 2 Q θ i s i , a i y i θ i 2 , i f Q θ i s i , a i y i θ i < 1   Q θ i s i , a i y i θ i 1 2 ,   o t h e r w i s e

5.3. Testing Scenarios

The proposed MV-COBA was put to the test in a variety of scenarios to see how well it performed the grasping task. The testing experiment was divided into three sections, each of which validates a different type of test scenario: The first two scenarios successfully highlight the importance of applying the visual CO idea to synergize grasp and push actions, while the third scenario is presented to explain the core benefit of employing multiple views to prevent occlusion. By using two cameras to capture multiple views to observe the robot’s workspace, they act as complementary parts to each other. Once the first camera is occluded to see the workspace clearly, the second camera will compensate the lack of vision.
The challenge of randomly cluttered objects (test cases 1–6): The proposed approach was tested on four scenarios with a random selection of 20–40 objects cluttered together, as shown in Figure 7.
The challenge of well-ordered objects (test cases 7–12): The proposed approach was tested on six scenarios, where 4–17 objects were manually arranged to reflect some challenging selection scenarios, as shown in Figure 7. For instance, some objects were tightly arranged without space for grasping.
The challenge of occluded objects: The proposed approach was tested on three test cases, in which the view of some objects was occluded from one side, while they could be seen from the other side and vice versa, as illustrated in Figure 8. The occlusion between the camera and target objects can affect the robot’s performance, which may lead to failure in grasping objects. To solve this, we observed the robot’s workspace from two sides using two cameras.
The first challenge scenario, i.e., randomly cluttered objects, is most commonly seen in the actual world. The robot may be able to execute grasping effectively even with the grasp-only policy, while the other baselines under consideration have shown lower efficiency due to frequent grasping failure and push behavior. In the second challenge scenario (well-ordered objects), however, accomplishing the grasping task with the grasp-only policy is challenging for robots. As a result, the robot may frequently fail to grip the object in a consistent manner, thereby forcing the robot to begin a new test session without completing the task. In other words, in the first challenge scenario, the robot with the grasp-only policy is capable of performing the grasping task successfully after several attempts. In the second challenge scenario, however, due to the arrangement of the objects, the robot is more likely to fail to grip the object. In the third challenge scenario (occluded objects), because of the occlusion between the camera and the objects, the robot may be unable to grasp the object with a single view. These three types of scenarios were used to test the MV-COBA’s ability to overcome the aforementioned issues. The key goals of these scenarios are as follows: (1) to see whether using two cameras is more successful than using a single camera in executing grasping tasks in a variety of settings, not just in clutter but also under occlusion; (2) to determine if pushing action can aid the robot in executing grasping tasks, and to see if the approach of change observation-based grasp and push action execution can be used to complete grasping tasks in a variety of scenarios, not only in well-ordered object challenges.
These scenarios, which differ from the training scenario as explained in Section 5.2, were created by hand to represent difficult situations. In some of these test cases, objects were placed close together in places and orientations where even the best grasping policy would struggle to pick up any of them without first separating them. Furthermore, a single isolated object which was separated from the configuration was set in the workspace as a checkpoint indicating the readiness of the training policy; the policy was deemed not ready if the isolated object was not grasped.

5.4. Evaluation Metrics

The proposed MV-COBA and baseline methods were evaluated in a series of the above-mentioned test cases. The robot was required to pick up and clear all objects from the workspace. For each test case, five test runs (denoted by n) were performed. The number of objects in the workspace varied in the range of 4–40 objects. Three evaluation metrics were used to assess the performance of the models. For all of these metrics, the higher the value, the better. These metrics are as follows:
  • The grasp success rate: Ratio of successful grasp attempts to the total of executed actions over n test runs per test case.
  • The action efficiency rate: Ratio of the number of objects to the number of executed actions before completion. It is used to measure the capability of the model to perform tasks by grasping all objects.
  • The completion rate: This is the average of the total number of completed objects divided by the total number of objects. It is used to measure the capability of MV-COBA to grasp all objects in each test case without failing in more than five actions consecutively.

6. Results and Discussion

The findings are organized into training and testing sessions in this section. The outcomes of the baseline models besides MV-COBA are shown in the training session in the form of graphs of the grasp success rate and action efficiency, which indicate how each baseline behaved throughout the training phase and how it learned quickly and efficiently. In the testing session, a series of test cases, each of which was executed for five test runs, is presented. The performance of the models is evaluated based on the grasp success rate, action efficiency rate, and completion rate.

6.1. Training Session Findings

The proposed MV-COBA and other baselines were trained using the same training protocol. The grasping performance was assessed using the percentage of grasp success and action efficiency for every 200 recent grasp attempts (m = 200). At the earlier training trials, i.e., trials 𝑖 < 𝑚, the percentage was scaled by a factor of i m . The graphs of the grasp success rate and action efficiency rate for 4000 training iterations are illustrated in Figure 9 and Figure 10, respectively.
The proposed MV-COBA outperformed the other baseline models with 87.2% and 82.8% in grasp success and action efficiency, respectively. MV-COBA-2NNs, which has almost the same model architecture as MV-COBA, where it employs a two-camera view for visual change detection and two FCNs for push and grasp prediction, showed a comparable performance, with an 83.8% grasp success rate and an 80.2% action efficiency rate. It is worth noting that the proposed MV-COBA demonstrated its capability in assisting the robot in quick learning while maintaining a consistent grasping performance. According to Figure 9 and Figure 10, MV-COBA achieved an 80% grasp success rate within the first 500 training iterations. On the other hand, the DRGP baseline had a grasp success (and action efficiency) rate of 74.1%. Likewise, the MVP baseline had a grasp success (and action efficiency) rate of 77.2%. The VPG and CPG baselines, which are regarded as having an ideal mechanism for coordinating two primitive actions when executing grasping tasks in a cluttered environment, showed relatively low grasp success and action efficiency rates, with 76.7% and 65.2% for VPG, and 71.3% and 59.4% for CPG, respectively. The VPG baseline also needed a substantial number of training iterations in order to learn policies that synergize grasping and pushing behaviors. It required more than 500 training cycles to achieve a 50% grasp success rate. Additionally, the CPG baseline struggled to reach 55% in grasp success within the first 1000 iterations, which indicates that it requires many iterations to gain experience in the grasping task. The performance of MV-COBA and other baseline models in terms of the grasp success rate and action efficiency is summarized in Table 1.

6.2. Testing Session Findings

The findings are divided into three sections: (1) randomly cluttered object challenge; (2) well-ordered object challenge; and (3) occluded object challenge. In performing all of the 15 test scenarios as illustrated in Figure 7 and Figure 8, the grasping performance of the proposed MV-COBA was compared with that of MV-COBA-2FCNs as well as earlier works implementing DRGP, MVP, VPG, and CPG. As detailed in Section 5.4, three parameters, the grasp success rate, the action efficiency rate, and the completion rate (or clearance), were considered for performance evaluation in the first and second challenge scenarios. The evaluation results for randomly cluttered and well-ordered object challenge scenarios are tabulated in Table 2 and Table 3, respectively. Meanwhile, in the third challenge scenario, the grasp success rate and completion rate were considered for performance evaluation, as tabulated in Table 4. The action efficiency rate was not calculated because CPG, VPG, and DRGP were unable to clear all of the objects (i.e., completion) in all of the test runs.

6.2.1. Randomly Cluttered Object Challenge

Table 2 shows that the proposed MV-COBA outperformed the other approaches considering the three evaluation metrics in all test cases. The proposed MV-COBA achieved a high grasping performance, with an 83.6% average grasp success rate compared to 70.8% for MV-COBA-2FCNs, 56.6% for DRGP, 59.8% for MVP, 65.4% for VPG, and 64.3% for CPG. The grasp success rate affects the two other evaluation metrics, in which lower action efficiency and completion rates were recorded correspondingly for all the baseline methods. On average, MV-COBA improved the grasp success rate by 13% as compared with MV-COBA-2FCNs, by 25% compared with DRGP and MVP, and by 18% compared with VPG and CPG. Therefore, CPG and VPG fared poorly in the randomly cluttered object scenarios, demonstrating an average action efficiency rate of 48.7% and 51.1%, respectively, followed by DRGP at 56.6% and MVP at 64.5%. In contrast, the proposed MV-COBA could pick up the objects from the robot’s workspace efficiently with an action efficiency rate of 75% under the same challenging scenarios. The low action efficiency of CPG and VPG suggests that the policy executed a large number of push actions, many of which might not be helpful. In addition, excessive pushes tend to move the foremost objects out of the workspace borders, preventing the policy from completing the grasping task. VPG and CPG showed a moderate completion rate of 70% among all the methods, whereas DRGP and MVP showed the lowest completion rate of 50% among all the methods; on the other hand, MV-COBA demonstrated a high completion rate of 94% in picking up all objects from the workspace without failing consecutively for more than five actions.

6.2.2. Well-Ordered Object Challenge

The evaluation was conducted in this part based on other challenge scenarios, where the objects were tightly arranged without space for grasping. Table 3 shows that the proposed MV-COBA performed the grasping task more efficiently with an 86.3% grasp success rate as compared with the three other approaches. MV-COBA also exceeded the average performance of MV-COBA-2FCNs, DRGP, MVP, CPG, and VPG in all other evaluated metrics, with an 80.4% action efficiency rate and a 100% completion rate. As expected, DRGP and MVP showed the worst performance among all methods tested in the well-ordered object scenarios, with an average grasp success (and action efficiency) rate of 42.3% and 35.6%, respectively. The policy failed to complete the task in most scenarios, showing a mere 20.6% completion rate. Without assistance from any other non-prehensile actions such as push, the grasp-only (e.g., DRGP, MVP) policy was insufficient in disrupting the arrangement of blocks. The policy struggled to grasp objects around it continually. In cases where the policy experienced consecutive failures for more than five actions, the simulation restarted with a new run of tests without finishing the task. MV-COBA-2FCNs, CPG, and VPG recorded a better average performance than that of the grasp-only policy in all three evaluated metrics.

6.2.3. Occluded Object Challenge

In this part of the experiment, objects were purposely arranged in such a way that some objects were partially or fully occluded from the camera view. Using multiple views of the workspace, the proposed MV-COBA was able to reconstruct the workspace images and thus gain a better understanding of the workspace. As illustrated in Table 4, the proposed MV-COBA outperformed the three baseline approaches by a large margin. An average grasp success rate of 97.8% and completion rate of 100% were achieved by MV-COBA. The MV-COBA-2FCNs, DRGP, MVP, CPG, and VPG approaches fared very poorly in the occluded object scenarios, with an average grasp success rate of 44.3% for MVP, which is the highest among the other comparable baselines. MV-COBA-2FCNs employed multiple views from two cameras, but their performance was poor in these situations. The reason for this is that the first camera provided visual data of the workspace state for the push prediction, while the second camera provided visual data of the workspace state for the grasp prediction. The robot acted poorly once one or two cameras were partially or fully blocked, since there was no feature fusion of the cameras’ data. Therefore, with these approaches, the robot failed to complete the task in all scenarios, and thus a 0% completion rate was recorded for all of them. Apparently, the robot was unable to locate the hidden objects and thus failed to completely clear up all of the objects from the workspace.
Comparing other recent grasping methods, e.g., [26,27,67], the authors of these studies attempted to solve the issue of retaining object arrangement forms by combining two actions. Their methods, on the other hand, have certain limitations. Firstly, they performed the process of synergizing two actions based on the maximum value, in which they trained the grasping and pushing actions on two separated neural networks in a parallel manner and then selected the maximum value. This strategy, however, occasionally fails to detach the objects from the arrangement due to the push behavior. Secondly, their approaches were tested on detaching the objects from their arrangement shape, where they did not take into consideration the randomly arranged objects. Lastly, they mapped the visual observation to actions using just a single view. Those who used multi-view-based grasping tasks (e.g., [24,40]) used a single camera-based multi-view strategy, where there is no consideration of whether the camera itself is fully or partially obscured, causing the robot to fail to complete the task. Additionally, they performed the grasp-only policy, which cannot execute grasping since the objects are well organized, causing the robot to struggle to complete the tasks. In contrast, the proposed MV-COBA coordinated the grasp and push actions based on change observation rather than the maximum value. Differently, we trained both grasping and pushing actions using a single sharable neural network to avoid unnecessary pushes and ensure that shift actions were performed on a single object rather than pushing actions on the entire pile of objects. In our proposed MV-COBA, we also considered multiple views based on two cameras placed on opposite sides of each other. The goal of this was to improve the likelihood of a successful grasp and to avoid the occlusion issue in the event that one camera is completely obscured or both cameras are partially occluded, combining them to compensate for each other.

7. Conclusions and Future Work

One of the challenges in robotics is executing grasping tasks in an unstructured environment. In this paper, the proposed MV-COBA, which uses a dual camera-based multi-view approach in combination with a change observation-based approach, showed excellent grasping performance in various test scenarios involving randomly cluttered, well-ordered, and occluded objects. The simulation results show that the proposed MV-COBA could efficiently clear objects from the workspace. The method achieved grasp success rates of 83.6%, 86.3%, and 97.8% in the cluttered, well-ordered object, and occlusion scenarios, respectively, indicating its capability to complete a grasping task successfully. MV-COBA also recorded the highest action efficiency among the tested methods, implying that the policy was effective in synergizing the push and grasp behaviors. Furthermore, the proposed approach demonstrated a significantly high completion rate (94% in the cluttered scenario, and 100% in both the well-ordered and occlusion scenarios), even in challenging scenarios. The outstanding grasping performance achieved by the MV-COBA proves that the proposed learning policy is effective in overcoming the aforementioned problems, i.e., coordination of push and grasp actions, and occlusion due to having a single view. However, the proposed approach was tested on simulations, which might be a weakness to consider. In future work, implementation of the proposed approach on hardware would provide strong validation for those interested in conducting further research.

Author Contributions

Conceptualization, M.Q.M., L.C.K. and S.C.C.; formal analysis, L.C.K.; funding acquisition, L.C.K., A.S.A. and B.A.M.; investigation, L.C.K., S.C.C., A.A.-D., Z.G.A.-M. and B.A.M.; methodology, M.Q.M. and L.C.K.; project administration, L.C.K. and S.C.C.; resources, M.Q.M.; software, M.Q.M.; supervision, L.C.K. and S.C.C.; visualization, A.S.A., A.A.-D., Z.G.A.-M. and B.A.M.; writing—original draft, M.Q.M.; writing—review and editing, L.C.K., S.C.C., A.S.A., A.A.-D., Z.G.A.-M. and B.A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Multimedia University (MMU) through the MMU GRA Scheme (MMUI/190004.02) and the MMU Internal Fund (MMUI/210111).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article.

Acknowledgments

The authors are thankful to Multimedia University (MMU) for supporting this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Marwan, Q.M.; Chua, S.C.; Kwek, L.C. Comprehensive Review on Reaching and Grasping of Objects in Robotics. Robotica 2021, 39, 1849–1882. [Google Scholar] [CrossRef]
  2. Mohammed, M.Q.; Chung, K.L.; Chyi, C.S. Review of Deep Reinforcement Learning-Based Object Grasping: Techniques, Open Challenges, and Recommendations. IEEE Access 2020, 8, 178450–178481. [Google Scholar] [CrossRef]
  3. Mohri, M.; Rostamizadeh, A.; Talwalkar, A. Foundations of Machine Learning, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018; Volume 148, pp. 1–162. [Google Scholar]
  4. François-lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; François-lavet, V.; Pineau, J.; Bellemare, M.G. An Introduction to Deep Reinforcement Learning. Found. Trends Mach. Learn. 2018, 11, 219–354. [Google Scholar]
  5. Kumar, N.M.; Mohammed, M.A.; Abdulkareem, K.H.; Damasevicius, R.; Mostafa, S.A.; Maashi, M.S.; Chopra, S.S. Artificial intelligence-based solution for sorting COVID related medical waste streams and supporting data-driven decisions for smart circular economy practice. Process. Saf. Environ. Prot. 2021, 152, 482–494. [Google Scholar] [CrossRef]
  6. Mohammed, M.Q.; Chung, K.L.; Chyi, C.S. Pick and Place Objects in a Cluttered Scene Using Deep Reinforcement Learning. Int. J. Mech. Mechatron. Eng. IJMME 2020, 20, 50–57. [Google Scholar]
  7. Deng, Y.; Guo, X.; Wei, Y.; Lu, K.; Fang, B.; Guo, D.; Liu, H.; Sun, F. Deep Reinforcement Learning for Robotic Pushing and Picking in Cluttered Environment. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 619–626. [Google Scholar] [CrossRef]
  8. Wu, B.; Akinola, I.; Allen, P.K. Pixel-Attentive Policy Gradient for Multi-Fingered Grasping in Cluttered Scenes. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1789–1796. [Google Scholar] [CrossRef] [Green Version]
  9. Al-Shanoon, A.; Lang, H.; Wang, Y.; Zhang, Y.; Hong, W. Learn to grasp unknown objects in robotic manipulation. Intell. Serv. Robot. 2021, 14, 571–582. [Google Scholar] [CrossRef]
  10. Mohammed, M.Q.; Kwek, L.C.; Chua, S.C. Learning Pick to Place Objects using Self-supervised Learning with Minimal Training Resources. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 493–499. [Google Scholar]
  11. Lakhan, A.; Abed Mohammed, M.; Ahmed Ibrahim, D.; Hameed Abdulkareem, K. Bio-Inspired Robotics Enabled Schemes in Blockchain-Fog-Cloud Assisted IoMT Environment. J. King Saud Univ. Comput. Inf. Sci. 2021. [Google Scholar] [CrossRef]
  12. Mostafa, S.A.; Mustapha, A.; Gunasekaran, S.S.; Ahmad, M.S.; Mohammed, M.A.; Parwekar, P.; Kadry, S. An agent architecture for autonomous UAV flight control in object classification and recognition missions. Soft Comput. 2021. [Google Scholar] [CrossRef]
  13. Zhao, T.; Deng, M.; Li, Z.; Hu, Y. Cooperative Manipulation for a Mobile Dual-Arm Robot Using Sequences of Dynamic Movement Primitives. IEEE Trans. Cogn. Dev. Syst. 2020, 12, 18–29. [Google Scholar] [CrossRef]
  14. Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Countinuous learning control with deep reinforcement. arXiv 2016, arXiv:1509.02971. [Google Scholar]
  15. Heess, N.; Tb, D.; Sriram, S.; Lemmon, J.; Merel, J.; Wayne, G.; Tassa, Y.; Erez, T.; Wang, Z.; Eslami, S.M.A.; et al. Emergence of Locomotion Behaviours in Rich Environments. arXiv 2017, arXiv:170702286v2. [Google Scholar]
  16. Schulman, J.; Eecs, J.; Edu, B.; Abbeel, P.; Cs, P.; Edu, B. Trust Region Policy Optimization. In Proceedings of the 31st International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
  17. Mnih, V.; Mirza, M.; Graves, A.; Harley, T.; Lillicrap, T.P.; Silver, D. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  18. Bhagat, S.; Banerjee, H. Deep Reinforcement Learning for Soft, Flexible Robots: Brief Reviewwith Impending Challenges. Robotics 2019, 8, 93. [Google Scholar]
  19. Fawzi, H.; Mostafa, S.A.; Ahmed, D.; Alduais, N.; Mohammed, M.A.; Elhoseny, M. TOQO: A new Tillage Operations Quality Optimization model based on parallel and dynamic Decision Support System. J. Clean. Prod. 2021, 316, 128263. [Google Scholar] [CrossRef]
  20. Podder, A.K.; Bukhari, A.A.L.; Islam, S.; Mia, S.; Mohammed, M.A.; Kumar, N.M.; Cengiz, K.; Abdulkareem, K.H. IoT based smart agrotech system for verification of Urban farming parameters. Microprocess Microsyst. 2021, 82, 104025. [Google Scholar] [CrossRef]
  21. Guo, D.; Kong, T.; Sun, F.; Liu, H. Object discovery and grasp detection with a shared convolutional neural network. In Proceedings of the IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2038–2043. [Google Scholar]
  22. Zhang, H.; Lan, X.; Bai, S.; Wan, L.; Yang, C.; Zheng, N. A Multi-task Convolutional Neural Network for Autonomous Robotic Grasping in Object Stacking Scenes. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6435–6442. [Google Scholar]
  23. Park, D.; Seo, Y.; Shin, D.; Choi, J.; Chun, S.Y. A single multi-task deep neural network with post-processing for object detection with reasoning and robotic grasp detection. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7300–7306. [Google Scholar]
  24. Morrison, D.; Corke, P.; Leitner, J. Multi-View Picking: Next-best-view Reaching for Improved Grasping in Clutter. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8762–8768. [Google Scholar]
  25. Eitel, A.; Hauff, N.; Burgard, W. Learning to Singulate Objects Using a Push Proposal Network. Springer Proc. Adv. Robot. 2020, 10, 405–419. [Google Scholar] [CrossRef] [Green Version]
  26. Berscheid, L.; Meißner, P.; Kröger, T. Robot Learning of Shifting Objects for Grasping in Cluttered Environments. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 612–618. [Google Scholar]
  27. Zeng, A.; Song, S.; Welker, S.; Lee, J.; Rodriguez, A.; Funkhouser, T. Learning Synergies Between Pushing and Grasping with Self-Supervised Deep Reinforcement Learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4238–4245. [Google Scholar]
  28. Yang, Y.; Liang, H.; Choi, C. A Deep Learning Approach to Grasping the Invisible. IEEE Robot. Autom. Lett. 2020, 5, 2232–2239. [Google Scholar]
  29. Mohammed, M.Q.; Kwek, L.C.; Chua, S.C.; Alandoli, E.A. Color Matching Based Approach for Robotic Grasping. In Proceedings of the 2021 International Congress of Advanced Technology and Engineering (ICOTEN), Taiz, Yemen, 4–5 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
  30. Xu, K.; Yu, H.; Lai, Q.; Wang, Y.; Xiong, R. Efficient learning of goal-oriented push-grasping synergy in clutter. IEEE Robot. Autom. Lett. 2021, 6, 6337–6344. [Google Scholar]
  31. Hundt, A.; Killeen, B.; Greene, N.; Wu, H.; Kwon, H.; Paxton, C.; Hager, G.D. “Good Robot!”: Efficient Reinforcement Learning for Multi-Step Visual Tasks with Sim to Real Transfer. IEEE Robot. Autom. Lett. 2020, 5, 6724–6731. [Google Scholar]
  32. Wu, B.; Akinola, I.; Gupta, A.; Xu, F.; Varley, J.; Watkins-Valls, D.; Allen, P.K. Generative Attention Learning: A “GenerAL” framework for high-performance multi-fingered grasping in clutter. Auton. Robots 2020, 44, 971–990. [Google Scholar]
  33. Wu, K.; Ranasinghe, R.; Dissanayake, G. Active recognition and pose estimation of household objects in clutter. In Proceedings of the IEEE International Conference on Robotics and Automation, Seattle, WA, USA, 26–30 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 4230–4237. [Google Scholar]
  34. Novkovic, T.; Pautrat, R.; Furrer, F.; Breyer, M.; Siegwart, R.; Nieto, J. Object Finding in Cluttered Scenes Using Interactive Perception. In Proceedings of the IEEE International Conference on Robotics and Automation, Eth, Autonomous Systems Lab, Zurich, Switzerland, 31 May–31 August 2020; pp. 8338–8344. [Google Scholar] [CrossRef]
  35. Jiang, D.; Wang, H.; Chen, W.; Wu, R. A novel occlusion-free active recognition algorithm for objects in clutter. In Proceedings of the 2016 IEEE International Conference on Robotics and Biomimetics, ROBIO 2016, Qingdao, China, 3–7 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1389–1394. [Google Scholar]
  36. Kopicki, M.S.; Belter, D.; Wyatt, J.L. Learning better generative models for dexterous, single-view grasping of novel objects. Int. J. Robot. Res. 2019, 38, 1246–1267. [Google Scholar] [CrossRef] [Green Version]
  37. Murali, A.; Mousavian, A.; Eppner, C.; Paxton, C.; Fox, D. 6-DOF Grasping for Target-driven Object Manipulation in Clutter. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 6232–6238. [Google Scholar]
  38. Corona, E.; Pumarola, A.; Alenyà, G.; Moreno-Noguer, F.; Rogez, G. GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 5030–5040. [Google Scholar]
  39. Kiatos, M.; Malassiotis, S.; Sarantopoulos, I. A Geometric Approach for Grasping Unknown Objects With Multifingered Hands. IEEE Trans. Robot. 2021, 37, 735–746. [Google Scholar]
  40. Zeng, A.; Yu, K.; Song, S.; Suo, D.; Walker, E.; Rodriguez, A.; Xiao, J. Multi-view self-supervised deep learning for 6D pose estimation in the Amazon Picking Challenge. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 9 May–3 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1383–1386. [Google Scholar]
  41. Chen, X.; Ye, Z.; Sun, J.; Fan, Y.; Hu, F.; Wang, C.; Lu, C. Transferable Active Grasping and Real Embodied Dataset. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 3611–3618. [Google Scholar]
  42. Berscheid, L.; Rühr, T.; Kröger, T. Improving Data Efficiency of Self-supervised Learning for Robotic Grasping. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2125–2131. [Google Scholar]
  43. Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Ojea, J.A.; Goldberg, K. Dex-Net 2.0: Deep learning to plan Robust grasps with synthetic point clouds and analytic grasp metrics. In Robotics: Science and Systems; Department of EECS, University of California: Berkeley, CA, USA, 2017. [Google Scholar] [CrossRef]
  44. Mousavian, A.; Eppner, C.; Fox, D. 6-DOF GraspNet: Variational grasp generation for object manipulation. In Proceedings of the IEEE International Conference on Computer Vision, NVIDIA, Seoul, Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2901–2910. [Google Scholar] [CrossRef] [Green Version]
  45. Shao, Q.; Hu, J.; Wang, W.; Fang, Y.; Liu, W.; Qi, J.; Ma, J. Suction Grasp Region Prediction Using Self-supervised Learning for Object Picking in Dense Clutter. In Proceedings of the 2019 IEEE 5th International Conference on Mechatronics System and Robots (ICMSR), Singapore, 3–5 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7–12. [Google Scholar]
  46. Han, M.; Pan, Z.; Xue, T.; Shao, Q.; Ma, J.; Wang, W. Object-Agnostic Suction Grasp Affordance Detection in Dense Cluster Using Self-Supervised Learning. arXiv 2019, arXiv:190602995v1. [Google Scholar]
  47. Mitash, C.; Bekris, K.E.; Boularias, A. A self-supervised learning system for object detection using physics simulation and multi-view pose estimation. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 545–551. [Google Scholar]
  48. Zeng, A.; Song, S.; Yu, K.-T.; Donlon, E.; Hogan, F.R.; Bauza, M.; Ma, D.; Taylor, O.; Liu, M.; Romo, E.; et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. Int. J. Robot. Res 2019, 3750–3757. [Google Scholar] [CrossRef] [Green Version]
  49. Yen-Chen, L.; Zeng, A.; Song, S.; Isola, P.; Lin, T.-Y. Learning to See before Learning to Act: Visual Pre-training for Manipulation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 7286–7293. [Google Scholar]
  50. Iriondo, A.; Lazkano, E.; Ansuategi, A. Affordance-based grasping point detection using graph convolutional networks for industrial bin-picking applications. Sensors 2021, 21, 816. [Google Scholar] [CrossRef]
  51. Sarantopoulos, I.; Kiatos, M.; Doulgeri, Z.; Malassiotis, S. Split Deep Q-Learning for Robust Object Singulation*. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 6225–6231. [Google Scholar]
  52. Boroushaki, T.; Leng, J.; Clester, I.; Rodriguez, A.; Adib, F. Robotic Grasping of Fully-Occluded Objects using RF Perception. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 923–929. [Google Scholar]
  53. Kiatos, M.; Malassiotis, S. Robust object grasping in clutter via singulation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1596–1600. [Google Scholar]
  54. Huang, B.; Han, S.D.; Yu, J.; Boularias, A. Visual Foresight Tree for Object Retrieval from Clutter with Nonprehensile Rearrangement. IEEE Robot. Autom. Lett. 2021, 7, 231–238. [Google Scholar]
  55. Cheong, S.; Cho, B.Y.; Lee, J.; Lee, J.; Kim, D.H.; Nam, C.; Kim, C.; Park, S. Obstacle rearrangement for robotic manipulation in clutter using a deep Q-network. Intell. Serv. Robot. 2021, 14, 549–561. [Google Scholar] [CrossRef]
  56. Fujita, Y.; Uenishi, K.; Ummadisingu, A.; Nagarajan, P.; Masuda, S.; Castro, M.Y. Distributed Reinforcement Learning of Targeted Grasping with Active Vision for Mobile Manipulators. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2021; IEEE: Piscataway, NJ, USA, 2020; pp. 9712–9719. [Google Scholar]
  57. Kurenkov, A.; Taglic, J.; Kulkarni, R.; Dominguez-Kuhne, M.; Garg, A.; Martin-Martin, R.; Savarese, S. Visuomotor mechanical search: Learning to retrieve target objects in clutter. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 8408–8414. [Google Scholar] [CrossRef]
  58. Morrison, D.; Leitner, J.; Corke, P. Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach. arXiv 2018, arXiv:180405172v2. [Google Scholar]
  59. Yaxin, L.; Yiqian, T.; Ming, Z. An Intelligent Composite Pose Estimation Algorithm Based on 3D Multi-View Templates. In Proceedings of the 2018 3rd IEEE International Conference on Image, Vision and Computing, ICIVC 2018, Chongqing, China, 27–29 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 956–960. [Google Scholar]
  60. Chen, C.; Li, H.; Zhang, X.; Liu, X.; Tan, U. Towards Robotic Picking of Targets with Background Distractors using Deep Reinforcement Learning. In Proceedings of the 2019 WRC Symposium on Advanced Robotics and Automation (WRC SARA), Beijing, China, 21–22 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 166–171. [Google Scholar]
  61. Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, P.; Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems; OpenAI: San Francisco, CA, USA, 2017; pp. 5049–5059. [Google Scholar]
  62. Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. arXiv 2018, arXiv:180610293v3. [Google Scholar]
  63. Lu, N.; Lu, T.; Cai, Y.; Wang, S. Active Pushing for Better Grasping in Dense Clutter with Deep Reinforcement Learning. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 Nov. 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1657–1663. [Google Scholar]
  64. Goodrich, B.; Kuefler, A.; Richards, W.D. Depth by Poking: Learning to Estimate Depth from Self-Supervised Grasping. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10466–10472. [Google Scholar]
  65. Yang, Z.; Shang, H. Robotic Pushing and Grasping Knowledge Learning via Attention Deep Q-Learning Network; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Academy for Engineering and Technology, Fudan University: Shanghai, China, 2020; Volume 12274 LNAI, pp. 223–234. [Google Scholar] [CrossRef]
  66. Ni, P.; Zhang, W.; Zhang, H.; Cao, Q. Learning efficient push and grasp policy in a totebox from simulation. Adv. Robot. 2020, 34, 873–887. [Google Scholar]
  67. Yang, Y.; Ni, Z.; Gao, M.; Zhang, J.; Tao, D. Collaborative Pushing and Grasping of Tightly Stacked Objects via Deep Reinforcement Learning. IEEE CAA J. Autom. Sin. 2021, 9, 135–145. [Google Scholar]
  68. Danielczuk, M.; Angelova, A.; Vanhoucke, V.; Goldberg, K. X-Ray: Mechanical search for an occluded object by minimizing support of learned occupancy distributions. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 24 October–24 January 2021; The Autolab at University of California: Berkeley, CA, USA, 2020; pp. 9577–9584. [Google Scholar] [CrossRef]
  69. Wu, Y.; Li, J.; Bai, J. Multiple Classifiers-Based Feature Fusion for RGB-D Object Recognition. Int. J. Pattern Recognit. Artif. Intell. 2017, 31, 1750014. [Google Scholar]
  70. Sajjad, M.; Ullah, A.; Ahmad, J.; Abbas, N.; Rho, S.; Baik, S.W. Integrating salient colors with rotational invariant texture features for image representation in retrieval systems. Multimed. Tools Appl. 2018, 77, 4769–4789. [Google Scholar]
  71. Singh, A. Review Articlel: Digital change detection techniques using remotely-sensed data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar]
  72. Qin, R.; Tian, J.; Reinartz, P. 3D change detection—Approaches and applications. ISPRS J. Photogramm. Remote Sens. 2016, 122, 41–56. [Google Scholar]
  73. Lu, D.; Mausel, P.; Brondízio, E.; Moran, E. Change detection techniques. Int. J. Remote Sens. 2004, 25, 2365–2401. [Google Scholar]
  74. Reba, M.; Seto, K.C. A systematic review and assessment of algorithms to detect, characterize, and monitor urban land change. Remote Sens. Environ. 2020, 242, 111739. [Google Scholar]
  75. Iii, A.L. Change detection using image differencing: A study over area surrounding Kumta, India. In Proceedings of the 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, India, 22–24 February 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–5. [Google Scholar]
  76. Qin, R.; Huang, X.; Gruen, A.; Schmitt, G. Object-Based 3-D Building Change Detection on Multitemporal Stereo Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2125–2137. [Google Scholar]
  77. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018; pp. 1–360. [Google Scholar]
  78. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
  79. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2261–2269. [Google Scholar]
  80. Fei-Fei, L.; Deng, J.; Li, K. ImageNet: Constructing a large-scale image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Figure 1. Challenge one of the first problem: learning to push the entire pile of objects in a well-ordered scenario.
Figure 1. Challenge one of the first problem: learning to push the entire pile of objects in a well-ordered scenario.
Sustainability 13 13686 g001
Figure 2. Challenge two of the first problem: some objects tend to be pushed out of the workspace in a randomly cluttered scenario.
Figure 2. Challenge two of the first problem: some objects tend to be pushed out of the workspace in a randomly cluttered scenario.
Sustainability 13 13686 g002
Figure 3. The challenge of occlusion scenarios.
Figure 3. The challenge of occlusion scenarios.
Sustainability 13 13686 g003
Figure 4. Push action’s efficiency in detaching objects from their arrangement: (a) shifting a specific object (proposed MV-COBA); (b) pushing the entire pile (VPG policy [27]).
Figure 4. Push action’s efficiency in detaching objects from their arrangement: (a) shifting a specific object (proposed MV-COBA); (b) pushing the entire pile (VPG policy [27]).
Sustainability 13 13686 g004
Figure 5. The MV-COBA for executing primitive push and grasp actions.
Figure 5. The MV-COBA for executing primitive push and grasp actions.
Sustainability 13 13686 g005
Figure 6. The MV-COBA-2FCNs approach for executing primitive push and grasp actions.
Figure 6. The MV-COBA-2FCNs approach for executing primitive push and grasp actions.
Sustainability 13 13686 g006
Figure 7. A series of randomly cluttered object challenge scenarios (test cases 1–4) and well-ordered object challenge scenarios (test cases 5–10).
Figure 7. A series of randomly cluttered object challenge scenarios (test cases 1–4) and well-ordered object challenge scenarios (test cases 5–10).
Sustainability 13 13686 g007
Figure 8. Challenge scenarios with occluded objects. Case-1: cam1 is completely occluded, whereas cam2 is not. Case-2 and Case-3: both cameras are partially occluded, with some objects detected by cam1 but not visible to cam2, and vice versa.
Figure 8. Challenge scenarios with occluded objects. Case-1: cam1 is completely occluded, whereas cam2 is not. Case-2 and Case-3: both cameras are partially occluded, with some objects detected by cam1 but not visible to cam2, and vice versa.
Sustainability 13 13686 g008
Figure 9. The performance of MV-COBA compared to that of other baseline models in terms of grasp success rate during the training session.
Figure 9. The performance of MV-COBA compared to that of other baseline models in terms of grasp success rate during the training session.
Sustainability 13 13686 g009
Figure 10. The performance of MV-COBA compared to that of other baseline models in terms of action efficiency during the training session.
Figure 10. The performance of MV-COBA compared to that of other baseline models in terms of action efficiency during the training session.
Sustainability 13 13686 g010
Table 1. Evaluating the performance of baseline models during the training phase.
Table 1. Evaluating the performance of baseline models during the training phase.
MethodsEvaluation Mean (%)
Grasp SuccessAction Efficiency
MV-COBA87.2%82.8%
MV-COBA-2FCNs83.8%80.2%
MVP77.2%,77.2%,
VPG76.7%65.2%.
CPG71.3%59.4%
DRGP74.1%,74.1%,
Table 2. Evaluation of randomly cluttered object challenge scenarios.
Table 2. Evaluation of randomly cluttered object challenge scenarios.
Evaluation Mean%MethodTest CasesAverage
Case-1Case-2Case-3Case-4Case-5Case-6
Grasp Success RateMV-COBA86.082.283.481.689.878.483.6
MV-COBA-2FCNs84.370.869.466.767.865.670.8
MVP61.161.566.757.245.055.257.8
VPG80.171.668.350.058.663.865.4
CPG65.663.057.157.482.461.464.5
DRGP59.459.453.266.757.643.256.6
Action EfficiencyMV-COBA79.878.176.474.270.770.875.0
MV-COBA-2FCNs68.064.156.059.362.952.360.4
MVP61.161.566.757.245.055.257.8
VPG61.958.057.740.041.047.651.1
CPG48.954.743.844.358.342.048.7
DRGP59.459.453.266.757.643.256.6
Completion RateMV-COBA10010081.210010082.594.0
MV-COBA-2FCNs10010060.070.110050.080.1
MVP10066.733.333.336.233.350.5
VPG10050.010050.075.150.070.9
CPG10010050.050.050.051.767.0
DRGP66.767.733.333.333.333.344.6
Table 3. Evaluation of well-ordered object challenge scenarios.
Table 3. Evaluation of well-ordered object challenge scenarios.
Evaluation Mean%MethodTest CasesAverage
Case-7Case-8Case-9Case-10Case-11Case-12
Grasp Success RateMV-COBA85.787.189.584.185.685.686.3
MV-COBA-2FCNs82.185.577.476.863.373.376.4
MVP40.644.748.931.125.822.735.6
VPG65.469.872.974.854.853.365.2
CPG65.469.872.974.854.843.363.5
DRGP36.156.448.531.744.636.142.3
Action EfficiencyMV-COBA75.077.483.081.182.782.780.4
MV-COBA-2FCNs70.080.174.068.960.658.368.65
MVP40.644.748.931.125.822.735.6
VPG45.951.158.356.742.142.749.5
CPG58.352.859.850.053.144.053.0
DRGP36.156.448.531.744.636.142.3
Completion RateMV-COBA100100100100100100100
MV-COBA-2FCNs10010010082.766.710091.6
MVP33.366.733.330.00.00.027.2
VPG82.771.466.710076.775.078.8
CPG81.510066.750.010010083.1
DRGP30.030.00.00.033.330.020.6
Table 4. Evaluation of occluded object challenge scenarios.
Table 4. Evaluation of occluded object challenge scenarios.
Evaluation Mean (%)MethodTest CasesAverage
Case-13Case-14Case-15
Grasp Success RateMV-COBA10093.810097.8
MV-COBA-2FCNs0.045.233.326.2
MVP0.045.243.344.3
VPG0.051.733.328.3
CPG0.055.733.329.7
DRGP0.045.733.326.3
Completion RateMV-COBA100100100100.0
MV-COBA-2FCNs0.00.00.00.0
MVP0.00.00.00.0
CPG0.00.00.00.0
VPG0.00.00.00.0
DRGP0.00.00.00.0
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Mohammed, M.Q.; Kwek, L.C.; Chua, S.C.; Aljaloud, A.S.; Al-Dhaqm, A.; Al-Mekhlafi, Z.G.; Mohammed, B.A. Deep Reinforcement Learning-Based Robotic Grasping in Clutter and Occlusion. Sustainability 2021, 13, 13686. https://doi.org/10.3390/su132413686

AMA Style

Mohammed MQ, Kwek LC, Chua SC, Aljaloud AS, Al-Dhaqm A, Al-Mekhlafi ZG, Mohammed BA. Deep Reinforcement Learning-Based Robotic Grasping in Clutter and Occlusion. Sustainability. 2021; 13(24):13686. https://doi.org/10.3390/su132413686

Chicago/Turabian Style

Mohammed, Marwan Qaid, Lee Chung Kwek, Shing Chyi Chua, Abdulaziz Salamah Aljaloud, Arafat Al-Dhaqm, Zeyad Ghaleb Al-Mekhlafi, and Badiea Abdulkarem Mohammed. 2021. "Deep Reinforcement Learning-Based Robotic Grasping in Clutter and Occlusion" Sustainability 13, no. 24: 13686. https://doi.org/10.3390/su132413686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop