An End-to-End Grasping Stability Prediction Network for Multiple Sensors

Featured Application: This work can be applied in the grasping operation of the manipulator. It helps to predict the grasping stability. Abstract: As we all know, the output of the tactile sensing array on the gripper can be used to predict grasping stability. Some methods utilize traditional tactile features to make the decision and some advanced methods use machine learning or deep learning ways to build a prediction model. While these methods are all limited to the speciﬁc sensing array and have two common disadvantages. On the one hand, these models cannot perform well on di ﬀ erent sensors. On the other hand, they do not have the ability of inferencing on multiple sensors in an end-to-end manner. Thus, we aim to ﬁnd the internal relationships among di ﬀ erent sensors and inference the grasping stability of multiple sensors in an end-to-end way. In this paper, we propose the MM-CNN (mask multi-head convolutional neural network), which can be utilized to predict the grasping stability on the output of multiple sensors with the weight sharing mechanism. We train this model and evaluate it on our own collected datasets. This model achieves 99.49% and 94.25% prediction accuracy on two di ﬀ erent sensing arrays, separately. In addition, we show that our proposed structure is also available for other CNN backbones and can be easily integrated.


Introduction
With the recent developments in computer vision and range sensing, robots can detect objects reliably. However, grasping remains to be a question even with the correct location and pose of the object. The main reason is that predicting the grasping stability of the object in autonomous robotic manipulation tasks is still an important and difficult research topic. Grasping stability is defined as the capacity of the grasping to resist external forces and disturbances. A stable grasping can be viewed as a static equilibrium state of the object that is maintained when the grippers are closed. In the case of stable grasping, the grasped objects will keep the static equilibrium state when the manipulator moves. If the grippers do not grasp the object stably during the operation of the manipulator, it can lead to a state in the desired action sequence from which the system cannot recover easily.
In comparison, humans can easily rely on remarkable tactile sensing capabilities and perceive the stability in grasping [1]. We can quickly identify textures and fine features using our fingertips, and while holding objects we subconsciously control grip forces and prevent slippage. When it comes to grasping with tools, we act as if they are extensions of our limbs, which requires not only dexterous

Dataset Building
The manipulator is usually equipped with grippers to accomplish the operation of grasping and the tactile array is assembled on the grippers of the manipulator. To grasp an object, grippers of the manipulator open to the maximum and then they arrive at a suitable location and pose for grasping. After that, grippers close slowly and stop at the time when the object is held. Thus, how to determine if an object is held or not plays an important role in the grasping task. The human body judges it by the perception of tactile physiological signals rather than vision. In the situation of manipulator grasping, similarly, force distribution on sensing arrays shows more details and preforms better than vision. In order to research and model the relationship between the output of tactile sensing array and grasping stability, we build the dataset by collecting the output and corresponding grasping stability of two different tactile sensing arrays.

Kitronyx Tactile Sensing Array
The Ms9723 tactile sensing array is a force distribution measurement product of Kitronyx company, which supports multi-touch technologies. It consists of 160 units and these 160 units are distributed in 16 rows and 10 columns. The output of each unit is a natural number normalized with 255. We find that each unit of this array is insensitive to the surface force and has a relatively higher measuring lower boundary. As a result, the output array is easy to be filled with zero and the numerical value is relatively small when grasping objects.

Self-Made Tactile Sensing Array
Our tactile sensing array also supports multi-touch and can measure the three-dimensional force on it. The sensing material and the electrodes form a sandwich structure. Meanwhile, the electrodes are connected by electrical routings from different rows and columns. Overall, the sensor has 64 units and the area of each unit is 7 × 7 mm to guarantee a spatial resolution. It works with multiplexers, microcontroller unit (MCU), analog-to-digital converter (ADC), reference resistances, and operational amplifiers to make up the measurement circuit. The measurement principle of the scanning circuit is that the row multiplexer is controlled by MCU to select one of the rows, and then the output of each column of operational amplifiers is connected to the analog-to-digital converter through the column multiplexer and the output voltage is converted to digital signals. As the spatial distribution of this array is more complex than the Kitronyx array, it will be discussed in detail later. Compared with the Kitronyx tactile sensing array, our array has the ability to perceive smaller forces. The output of our array starts to change when the force is about 0.05 N, while the output of the Kitronyx array starts to change when the force achieves around 1 N. In addition, our array consists of fewer units. The output of the sensing unit is also a natural number and there is no normalization applied to the output. The maximum of the output can reach the numerical value of 2000 and there are few zeros in the output because of its smaller measuring lower boundary.

Data Collection and Statistics
Although these two arrays vary a lot in some aspect as mentioned above, there are still some common characteristics between them. The outputs of tactile sensing arrays are both positively relative to the magnitude of the force on arrays. This common characteristic makes it possible to provide a unified grasping stability prediction model for these two different sensing arrays. Another common characteristic is that when there is no force on the array, units in these two arrays all output zero. The output of tactile sensors also has three dimensions which represent the height, the width, and the channel. While the channel dimension has the length of one, instead of three. Therefore, outputs of the whole array are not only individual numerical values but also can be regarded as an image of a single channel. In addition, objects of different kinds of shapes are grasped during the data collection as we aim to propose a robust prediction model. Therefore, we show our grasping targets during dataset building in Figure 1. We close grippers from maximum step by step. After this process, if the grasping object keeps a static equilibrium state during grasping, the label of corresponding output is assigned to 1, which infers to the stable grasping. Otherwise, the label is assigned to 0, which infers to the unstable grasping. Additionally, we try with different grasping poses to enlarge the dataset. In general, we collect 1562 samples for Kitronyx tactile sensing array and 3478 samples for the self-made array. We guarantee that these samples are roughly balanced among different objects and there are no different manipulator movement parameters used during collection. The samples collected from Kitronyx tactile sensing array form the GKD (grasping Kitronyx dataset) while the 3478 samples from the self-made array form the GSD (grasping self-made dataset). We randomly select around a quarter of samples from GKD and GSD separately to make up the test set. The test part of each dataset is divided into the train set. We make sure that both the train part and the test part of datasets are still balanced among objects after the split of dataset. We merge GKD and GSD into a new dataset called GMD (grasping merged dataset). The approach of fusion is quite simple. We merge the train set of GKD and GSD to the new train set. Similarly, we merge the test set of GKD and GSD. In this way, GMD has 3779 samples in the train set and 1261 samples in the test set. Samples in the GMD have two labels, one infers to the annotated grasping stability and the other infers to the resource dataset (GKD or GSD) of the sample. More details of datasets can be seen in Table 1.

MM-CNN
We formulate the grasping stability prediction as a two-way classification problem. Back to the origin, our initial aim is to find the relationship between the different tactile sensors and propose a convenient model to predict grasping stability for multiple tactile sensing arrays in an end-to-end way. There are two main difficulties to be solved, in general. On the one hand, although the outputs of different arrays are all positively relative to the magnitude of the force, they vary a lot from each other under the same external force. On the other hand, the model needs to have the ability to treat different kinds of input sizes. These two difficulties limit a lot in some way. As the input sizes are relatively small, we first present a shallow network in Figure 2 as the baseline of our models and the input data is normalized before sent to the network. Normalization helps with the case of different output numerical values for different tactile sensors. Next, we extend the framework to a novel endto-end system with weight sharing for grasping stability prediction.

MM-CNN
We formulate the grasping stability prediction as a two-way classification problem. Back to the origin, our initial aim is to find the relationship between the different tactile sensors and propose a convenient model to predict grasping stability for multiple tactile sensing arrays in an end-to-end way. There are two main difficulties to be solved, in general. On the one hand, although the outputs of different arrays are all positively relative to the magnitude of the force, they vary a lot from each other under the same external force. On the other hand, the model needs to have the ability to treat different kinds of input sizes. These two difficulties limit a lot in some way. As the input sizes are relatively small, we first present a shallow network in Figure 2 as the baseline of our models and the input data is normalized before sent to the network. Normalization helps with the case of different output numerical values for different tactile sensors. Next, we extend the framework to a novel end-to-end system with weight sharing for grasping stability prediction.

MM-CNN
We formulate the grasping stability prediction as a two-way classification problem. Back to the origin, our initial aim is to find the relationship between the different tactile sensors and propose a convenient model to predict grasping stability for multiple tactile sensing arrays in an end-to-end way. There are two main difficulties to be solved, in general. On the one hand, although the outputs of different arrays are all positively relative to the magnitude of the force, they vary a lot from each other under the same external force. On the other hand, the model needs to have the ability to treat different kinds of input sizes. These two difficulties limit a lot in some way. As the input sizes are relatively small, we first present a shallow network in Figure 2 as the baseline of our models and the input data is normalized before sent to the network. Normalization helps with the case of different output numerical values for different tactile sensors. Next, we extend the framework to a novel endto-end system with weight sharing for grasping stability prediction.

Spatial Information
In the case of Kitronyx tactile sensing array, the distribution of 160 units is shown in Figure 3a. The number on the unit in figure infers the order of array outputs. Thus, it is easy for us to reshape the output 160 numerical values to the 16*10 matrix to recover the spatial information from the output. As for our tactile sensing array, the distribution is more complex. The spatial distribution of our array is illustrated in Figure 3b. The block with full lines in the figure infers to the sensing unit while the block with dotted line infers to a blank space. We put these 64 output numerical values back to their positions in spatial and pad zeros to other positions. Then, it comes out a matrix with the size of 12 × 12. As a comparison, we directly reshape the 64 numerical values into a matrix with the size of 8*8, which loses the spatial information. Experiments show that recovering the spatial information of the sensing array helps in grasping stability prediction.

Spatial Information
In the case of Kitronyx tactile sensing array, the distribution of 160 units is shown in Figure 3a. The number on the unit in figure infers the order of array outputs. Thus, it is easy for us to reshape the output 160 numerical values to the 16*10 matrix to recover the spatial information from the output. As for our tactile sensing array, the distribution is more complex. The spatial distribution of our array is illustrated in Figure 3b. The block with full lines in the figure infers to the sensing unit while the block with dotted line infers to a blank space. We put these 64 output numerical values back to their positions in spatial and pad zeros to other positions. Then, it comes out a matrix with the size of 12 × 12. As a comparison, we directly reshape the 64 numerical values into a matrix with the size of 8*8, which loses the spatial information. Experiments show that recovering the spatial information of the sensing array helps in grasping stability prediction.

Fully-Convolutional Backbone with Global Pooling
A fully-convolutional backbone with global pooling is designed to process the input of different sizes. As shown in Figure 2, the baseline model is composed of the backbone and the fully-connected classifier. The backbone includes two convolutional layers and two max-pooling layers. The convolutional layers have the convolution kernel of 3 × 3. As the stride of the convolutional layer is one, it does not reduce the resolution of spatial. Additionally, the pooling layers with the kernel size of 2*2 have the stride of two, which is deemed as down-sampling on the feature map. These two convolutional layers and two max-pooling layers are collectively referred to as the network backbone. They extract features from the input. Two fully-connected layers are followed to act as the role of the classifier. This structure is defined as the baseline model while it cannot deal with the input of different sizes as the input of a fully-connected layer requires to be fixed. Therefore, we propose the fully-convolutional backbone with global pooling to extract the fixed-size feature. There is a global pooling layer after the second max-pooling layer and it can generate a feature embedding of 128 in length for inputs of different sizes. We define this feature after the global pooling layer as the tactile representation, which performs well when transferring to the prediction models of different sensors. This is a desirable characteristic as it makes the training of the weights of the model much easier than the former work as they do not need to be trained from scratch anymore.

Multiple Fully-Connected Heads
We propose the structure of multiple fully-connected heads to model different grasping stability predictions for each sensor while these heads share the weights of tactile representations. Generally speaking, different sensors rely on different working effects. Thus, outputs of sensors vary a lot although outputs are all positively relative to the magnitude of the force. It is difficult to model them in a single classifier, so we decide to remain the parameters in the fully-connected heads to be dependent on the specific sensor, which is effective to maintain the classification accuracy. In this

Fully-Convolutional Backbone with Global Pooling
A fully-convolutional backbone with global pooling is designed to process the input of different sizes. As shown in Figure 2, the baseline model is composed of the backbone and the fully-connected classifier. The backbone includes two convolutional layers and two max-pooling layers. The convolutional layers have the convolution kernel of 3 × 3. As the stride of the convolutional layer is one, it does not reduce the resolution of spatial. Additionally, the pooling layers with the kernel size of 2*2 have the stride of two, which is deemed as down-sampling on the feature map. These two convolutional layers and two max-pooling layers are collectively referred to as the network backbone. They extract features from the input. Two fully-connected layers are followed to act as the role of the classifier. This structure is defined as the baseline model while it cannot deal with the input of different sizes as the input of a fully-connected layer requires to be fixed. Therefore, we propose the fully-convolutional backbone with global pooling to extract the fixed-size feature. There is a global pooling layer after the second max-pooling layer and it can generate a feature embedding of 128 in length for inputs of different sizes. We define this feature after the global pooling layer as the tactile representation, which performs well when transferring to the prediction models of different sensors. This is a desirable characteristic as it makes the training of the weights of the model much easier than the former work as they do not need to be trained from scratch anymore.

Multiple Fully-Connected Heads
We propose the structure of multiple fully-connected heads to model different grasping stability predictions for each sensor while these heads share the weights of tactile representations. Generally speaking, different sensors rely on different working effects. Thus, outputs of sensors vary a lot although outputs are all positively relative to the magnitude of the force. It is difficult to model them in a single classifier, so we decide to remain the parameters in the fully-connected heads to be dependent on the specific sensor, which is effective to maintain the classification accuracy. In this way, the information of the force threshold of grasping is involved in the fully-connected layers and different sensors have their own fully-connected heads. However, this design has the drawback that the number of parameters in the models grows linearly with the number of sensors if we do not utilize weight sharing. To optimize and reduce the number of learnable parameters, these heads for classification share the backbone weights with each other. It is proved to be available and effective in experiments. This weight sharing strategy greatly reduces the number of learnable parameters while being able to maintain the classification accuracies on the dataset of different sensors. Each head outputs a vector of two in length and we concatenate them together to a vector of 2 N in length as the final output of the grasping stability prediction branch (N is the number of different sensors). Meanwhile, it leaves us a problem of how to determine which fully-connected head to be used during inference and we add an additional mask branch, therefore.

Mask Branch
We follow this path and design a mask branch to decide which fully-connected head to be used for grasping stability prediction in inference. As shown in Figure 4, the structure of the mask branch is similar to the single head grasping stability prediction branch. The output of the classifier is a vector of N-length, which represents the probability distribution between N sensors. Then, a binary mask of 2 N-length is generated based on the output according to Equation (1).
In the equation, the output of the classifier is described as O while M is the binary mask. For example, we consider the case that N = 2. If O 1 ≥ O 2 , then M is equal to [1,1,0,0]. Additionally, M is equal to [0, 0, 1, 1] in the case that O 1 < O 2 . Finally, we multiply the grasping stability prediction output with the mask output to determine which fully-connected branch to use during inference. Argmax is applied to the multiplication to predict grasping stability. Finally, our model has the advantage of parallel to the existing grasping stability prediction systems and thus can be easily integrated with the way of replacing the backbone convolutional neural network.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 11 way, the information of the force threshold of grasping is involved in the fully-connected layers and different sensors have their own fully-connected heads. However, this design has the drawback that the number of parameters in the models grows linearly with the number of sensors if we do not utilize weight sharing. To optimize and reduce the number of learnable parameters, these heads for classification share the backbone weights with each other. It is proved to be available and effective in experiments. This weight sharing strategy greatly reduces the number of learnable parameters while being able to maintain the classification accuracies on the dataset of different sensors. Each head outputs a vector of two in length and we concatenate them together to a vector of 2 N in length as the final output of the grasping stability prediction branch (N is the number of different sensors). Meanwhile, it leaves us a problem of how to determine which fully-connected head to be used during inference and we add an additional mask branch, therefore.

Mask Branch
We follow this path and design a mask branch to decide which fully-connected head to be used for grasping stability prediction in inference. As shown in Figure 4, the structure of the mask branch is similar to the single head grasping stability prediction branch. The output of the classifier is a vector of N-length, which represents the probability distribution between N sensors. Then, a binary mask of 2 N-length is generated based on the output according to Equation (1).
In the equation, the output of the classifier is described as O while M is the binary mask. For example, we consider the case that N = 2. If 1 ≥ 2 , then M is equal to [1,1,0,0]. Additionally, M is equal to [0,0,1,1] in the case that 1 < 2 . Finally, we multiply the grasping stability prediction output with the mask output to determine which fully-connected branch to use during inference. Argmax is applied to the multiplication to predict grasping stability. Finally, our model has the advantage of parallel to the existing grasping stability prediction systems and thus can be easily integrated with the way of replacing the backbone convolutional neural network.

Implementation Details
As there are two branches in the structure, the training strategy is important for the whole network. During training, we train the grasping stability prediction branch and the mask branch separately. During the training of the grasping stability prediction branch, we first pretrain the tactile representation and one fully-connected classifier with its corresponding dataset (GKD or GSD). In the case that N = 2, we utilize the pretrained tactile representation and finetune the other fully-connected classifier with the other dataset in the second step. As for the mask prediction branch, we train it with the data in GMD. As the batch fed to the network should be unified in size, batches alternate between samples from GKD and samples from GSD. Fortunately, we find that the mask prediction branch converges well during training with this strategy. When it comes to inference, the output of the grasping stability prediction branch and the output of the mask prediction branch are multiplied together to determine the final prediction in an end-to-end manner.
What is more, there are some details during training and inference. The ReLU (rectified linear unit) is used as the activation function to increase the feature expression ability of the model. We select the cross-entropy with sigmoid as the classification loss function to optimize the multiple fully-connected heads and the mask prediction branch. Each head in grasping stability prediction branch is a two-way classifier while the mask prediction is processed as a problem of N-way classification. We use a steady learning rate of 0.0005 and select the batch size of 32. We select Adam as the optimizer and it is implemented with the Tensorflow. What is more, there is a dropout of 0.8 after the first fully-connected layer in each branch.

Experiments and Results
We make some ablation analysis to prove the advantages of each proposed module in MM-CNN. To make the comparison fair, we use the average and standard deviation of 10 checkpoints after convergence to measure the performance of the model. For our self-made sensing array, there are two optical kinds of input. The first one is the 8*8 matrix which loses the spatial information and the other is the 12 × 12 matrix which recovers the spatial distribution of sensing units. We compare two inputs on the baseline structure and the structure with the global pooling layer. As shown in Table 2, we find that the input with spatial information performs better. In addition, although the structure with the global pooling layer cannot perform as well as the baseline model on the GSD, we decide to use it as it can process the input of different sizes. More details about the influence of the global pooling layer will be discussed in the following paragraphs. In general, we use the input with spatial information as the input of self-made sensing array in the following experiments. We train models on train sets to compare the influence of the global pooling layer on different test sets. We resize the input to the unified size of 16 × 12 by padding zeros in some experiments. In this way, the baseline model which does not include the global pooling layer can be also trained with the combination of GKD and GSD. On the one hand, this operation has little influence on the results of the experiments. On the other hand, we also pad them into the fixed size of 16 × 12 in cases where the padding operation is not necessary, to make the comparison fair. The results are shown in Table 3.
with the grasping stability prediction branch to have an overall prediction. Although the mask branch classifies a sample of GKD incorrectly, we find that the two heads in the grasping stability prediction branch predict similar in this sample. Therefore, as shown in Table 5, the overall accuracy does not decrease. In general, the whole network keeps the best checkpoint accuracy of (99.49%, 94.25%) on the test set of GDK and GSD. This result achieves a comparable level on both datasets with those separate models (with the result of (99.49%, 94.02%)) in Table 3, while this structure can lead to an end-to-end inference.
In addition, we design the mask branch to predict which head to use during inference. The mask prediction branch is trained with the resource label of GMD. As drawn in Figure 5, we analyze the embedding after global pooling in the mask branch. Embeddings of samples from GSD and embeddings of samples from GKD differ a lot and it is available to distinguish them. The mask predictions are all correct in the GKD and 99.89% of samples are classified correctly in the GSD. We combine this branch with the grasping stability prediction branch to have an overall prediction. Although the mask branch classifies a sample of GKD incorrectly, we find that the two heads in the grasping stability prediction branch predict similar in this sample. Therefore, as shown in Table 5, the overall accuracy does not decrease. In general, the whole network keeps the best checkpoint accuracy of (99.49%, 94.25%) on the test set of GDK and GSD. This result achieves a comparable level on both datasets with those separate models (with the result of (99.49%, 94.02%)) in Table 3, while this structure can lead to an end-to-end inference.

Conclusions
We use a convolutional neural network as the baseline to predict the grasping stability through the output of tactile sensing array. In order to propose a unified tactile representation and an end-toend prediction system, we extend the framework of baseline with the global pooling layer, multiple fully-connected heads, and the mask prediction branch. The global pooling layer can deal with variable input sizes and multiple fully-connected heads can deal with different characteristics of sensors. The feature after the global pooling layer performs well when it is transferred to the grasping stability prediction of other sensors and we define it as the tactile representation. In addition, these two optimizations also have the influence of improving classification accuracy. The mask prediction

Conclusions
We use a convolutional neural network as the baseline to predict the grasping stability through the output of tactile sensing array. In order to propose a unified tactile representation and an end-to-end prediction system, we extend the framework of baseline with the global pooling layer, multiple fully-connected heads, and the mask prediction branch. The global pooling layer can deal with variable input sizes and multiple fully-connected heads can deal with different characteristics of sensors. The feature after the global pooling layer performs well when it is transferred to the grasping stability prediction of other sensors and we define it as the tactile representation. In addition, these two optimizations also have the influence of improving classification accuracy. The mask prediction branch determines which fully-connected to be selected during inference and makes the whole system work in an end-to-end manner. Finally, our proposed MM-CNN achieves the high accuracy of (99.49%, 94.25%) on GKD and GSD.