In this section, we analyse the performance of the learning models discussed in the previous section. They were used to predict stability and direction of slip detection as learning tasks in robotic manipulation (Figure 10
). Firstly, CCN and GCN models are used for stability binary detection, i.e., stable grip
and unstable grip
. Secondly, LSTM and ConvLSTM models are built to classify the type of slippage in the following cases: lack of stability, translational (up, down, left, right
), or rotational (clockwise, anti-clockwise
4.1. Dataset and Training Methodology
We generated two datasets in order to carry out the experiments: one for the stability task (see Figure 11
(left)) and another for the classification of slippage (see Figure 11
(right)). The first dataset is composed of 51 objects with three different geometries, such as cylinders, spheres, and boxes. Furthermore, we combined objects with different manufacturing materials, such as wood, plastic, and metal, as well as stiffness degrees. We recorded more than 5500 grasps, from which
were used for training and
for testing. Approximately
of the data was recorded using the palm oriented in 45° with respect to the horizontal plane and the rest divided equally between the side orientation (90° with respect to the horizontal) and down orientation (totally parallel to the horizontal). These orientations were taken into account for recording the datasets because of the construction attributes of the BioTac: It has liquid inside which is affected by gravity so different orientations yield different tactile readings. In addition, the number of samples representing stable grasps and slippery grasps is similar so both subsets are balanced. More information, as well as the data, is available at [48
]. Training and test sets were recorded following these steps:
Grasp the object: the hand performed a three-fingered grasp that contacted the object, which was laying on a table.
Read the sensors: a single reading was recorded then from each of the sensors at the same time.
Lift the object: the hand was raised in order to lift the object and check the outcome.
Label the trial: the recorded tactile readings were labelled according to the outcome of the lifting with two classes; stable, i.e., it is completely static or slip, i.e., either fell from the hand or it moves within it.
Regarding the second task, we created a dataset with 11 different objects to those included in the previous dataset. We selected the objects by taking into account the stiffness of the material with which they were manufactured, the texture (i.e., rough or smooth surface) and the size of the contact surface—the fingertip is only in partial contact with the surface of small objects. The training was performed on four basic objects that hold various textures, sizes and stiffness degrees. In the testing step, we used seven novel objects which we grouped into three categories: two solid and smooth objects, two little objects with small contact surfaces and three objects with rough textures never seen before. In total, we recorded more than 300 sequences of touches, from which
were used for training (the set of 4 basic objects) and the rest for testing, concretely
for a first experiment with the two solid objects,
for a second experiment with the two small objects and
for the last experiment with the set of textured objects. The number of samples representing the seven directions of slip is similar so the set was balanced regarding the considered classes. More details, as well as the data, is available at [49
]. In order to generate this dataset, we moved each of the objects over a BioTac sensor, producing a movement through each of the directions considered. The movement lasted for three seconds and it was carried out at different velocities and producing distinct forces. Moreover, each recording used a different part of the objects. Nevertheless, a single type of movement from each of the seven classes considered was performed on each sample. For the stable
class, the object was pushed against the objects without motion.
Finally, training samples for both tasks were scaled to the range in order to ease the convergence of the neural networks. In addition, we used batch normalisation for improving the stability of the networks. Given that we are using deep learning models with datasets which are not large, the reported testing results were obtained from carrying out a testing stage similar to a 5-fold cross-validation in order to avoid overfitting to the training sets. That is, we found the best configurations using a typical 5-fold cross-validation with the training set. Then, we trained those configurations again shuffling the training samples and then launched predictions on the test samples. This last step of training and later testing was repeated five times in order to avoid reporting results achieved on a single pass through the train and testing phases. As a last remark, the sequences of tactile readings used for the direction of slip detection hold five consecutive readings in time. The publishing rate of the sensor is of 100 Hz; therefore, samples for this task have five tactile readings that were recorded in 50 ms.
4.2. Tuning of Tactile Images and Tactile Graphs
First, we explored the effects of the three tactile distributions showed in Figure 3
. To do so, we trained a basic CNN with a single layer of 32 convolutional 3 × 3 filters followed by a ReLU activation and then a fully-connected layer with 128 ReLUs. The training data were the recorded samples with the side and down orientations from the stability dataset. These samples were selected because we wanted to perform an exploration of the tactile distributions, so a smaller set was preferred. In total, 2549 samples were used for carrying a 5-fold cross-validation. Results are shown in Table 1
As can be seen, the three distributions achieve similar results on each of the four considered metrics. However, distribution D1 yields higher values on each of them consistently. This shows that distributing the sensing points from the BioTac sensor in such a way that they end up having similar neighbourhoods like in the actual sensor helps to obtain a greater performance. Moreover, D2 and D3 are close to the size of the kernel and there are less pixels to work with, forcing the learnt patterns to be less informative as well. Consequently, following reported results regarding the tactile images were achieved using D1.
Secondly, we also explored the effects of the connectivity graph in our tactile graphs. We showed in Figure 5
a manually generated graph but it was also mentioned the possibility of generating these connections using a k-NN strategy. This was tested on this experiment, showing Figure 12
the results obtained on the stability dataset by a GCN with five layers (
, and 32 units, respectively) and two fully-connected layers with 128 and 2 units. Let
refer to our manual connections and
refer to graphs in which each node has connections to its k
These results suggest that increasing the number of connections in the graph decreases the performance of the network. A bigger neighbourhood means that the convolution is basically taking into account more nodes in the graph. As a result, as the number of nodes in the convolution increases, the local tactile patterns in each area of the sensor are lost, because the convolution is basically using the whole sensor. This could be seen as a way of filtering out local information in favour of checking the general behaviour of the sensor. In contrast, in our manually generated graph, we connected some nodes to just another node, like those in the borders of the sensor, but some others are connected to various nodes in its neighbourhood, like the electrode in the centre. In consequence, there are different degrees of connectivity so there are various levels of importance given to local patterns. Therefore, reported results in following sections were achieved using the or manual distribution.
4.3. CNN vs. GCN: Image vs. Graph
We compared CCN and GCN models. Both models were used for stability binary detection, stable grip
and unstable grip
, from single touch data. Single touch data are tactile data obtained while grasping an object but prior to lift it. For this work, we recorded a total of 5581 tactile samples, distributed in 3 sub-sets depending on the orientation of the robotic hand (palm down 0
, vertical palm 90
, and inclined palm 45
). Our dataset contains tactile data of 51 objects with different properties, such as shape, material, and size. Each tactile sample was manually labelled according to two classes: 50% stable grip
and 50% unstable grip
. Later, the dataset was consequently divided into two mutually exclusive sub-sets: 41 objects were used for training and the remaining 10 objects were left for testing. This allowed us to check the generalisation capabilities of the proposed classification methods. Both models, GCN and CNN, were trained by exploring hyper-parameter tuning strategies, such as grid-search technique [50
]. Thus, reported results were achieved with the best performing models found.
As can be seen in Table 2
, the average results obtained with each of the models for the stability prediction in robotic grasps are very similar in terms of accuracy and
. Nevertheless, the CNN achieves higher precision rates but the GCN gets higher recall rates. Therefore, the greater accuracy achieved by the CNN lead us to think that the CNN and the tactile images are a more optimal solution for this problem than the GCN and tactile graphs, but the difference is very small—about 1%. In contrast,
, which is the harmonic average of precision and recall, is 3% greater for GCN. In terms of recall, the score is 12.9% greater for the GCN, meaning that it has less false negatives. Nevertheless, it is also remarkable that the number of incorrectly classified samples is greater for the GCN and, therefore, its precision is less than that of the CNN (11.8%).
Another conclusion that we can extract from Table 2
is that the proposed CNN seems more sensitive to changes in orientation of the robotic hand (the variation is 12.1% in accuracy, 7.8% in precision, 20.2% in recall, and 16.9% in
). Thus, the CNN yields the best score regardless of the metric with palm down (0
) but the worst rates with vertical palm (90
). The reason could be that the CNN learns patterns to detect the stability of the grasp that overfit to the orientation of the fingertips, and therefore it overfits to the orientation of the electrodes. This could be due to changes in the location of the pressure values that define the contact: when the orientation changes, the pressure locates at other parts of the tactile image. Then, the CNN learns local features within the tactile image and it has problems to generalise the learning process to other orientations. Consequently, the GCN has a much more stable performance than the CNN with any evaluation metric. Thus, graphs seem to be a better representation of the state of the touch sensor at a given time for the classification of stability, since it has been able to learn some features in the graph that are not affected by the orientations. As a result, we can affirm that GCNs generalise better than CNNs to recognise stable grasps using tactile data.
From these results, it can be extracted that graphs seems to be a better representation than images when it comes to process the readings of a non-matrix tactile sensor. A graph can represent better the complex structure of a sensor like the BioTac, whereas a tactile image needs a mapping which does not fully correlate with the real distribution of the sensing points in the sensor. As a consequence, it has been seen that for the problem in hand the GCN yields more robust performance rates, which could mean that the learnt features from the graphs are less affected by the changes in orientation. Nevertheless, the CNN can achieve higher precision rates showing that it might be more sure of its detection of a stability pattern, though sacrificing recall. In short, the tactile image seems to be a good option to be used along with a CNN if false negatives are not a problem. However, in our case, missing a sign of a possible unstable grasp might result on a broken object. Therefore, we prefer the use of graphs and GCNs for the task of stability prediction with unstructured tactile sensors, like the BioTac.
4.4. LSTM vs. ConvLSTM: 1D-Signal vs. 2D-Image Sequence
In this section, we compared LSTM and ConvLSTM models. Both models were used in order to detect the direction of slip caused by the friction between a robotic finger and a contacting surface, under different conditions. Seven slip classes were generated. Four movements were translationals (up, down, left, right
) and two of them were rotational movements (clockwise, anti-clockwise
). The friction data are saved as temporal sequence of touch. Each touch is composed of tactile values generated by a BioTac sensor installed on one fingertip. We used one finger instead of three fingers as in the Section 4.3
because it is easier to draw conclusions in relation to the used methodology.
For this work, we performed several frictions with 11 type of objects grouped in 4 sub-sets, being one of them the training set and the other three are test sets with different properties: rigid objects with smooth surface (rigid and smooth
), objects with rough surface and therefore with tactile texture (rough
), small objects and, therefore, with little contact surface (with little contact
). As in Section 4.3
, LSTM and ConvLSTM were trained using grid-search technique [50
] to do the hyper-parameter tuning and obtain a good configuration of both models.
As can be seen in Table 3
, the average results obtained with each of the models for the slippage prediction are very similar in general among all of the evaluation metrics. The differences are not significant, although they are a little greater for the LSTM. This is a 0.6% higher accuracy, 2.5% precision, 1% recall, and 1.2%
. Probably these results were achieved due to the processing of tactile data in order to obtain tactile images for training the ConvLSTM. Tactile images are a better representation because they lead to the exploitation of the local connectivity of the electrodes through the use of convolutional layers in the ConvLSTM. Note that the LSTM was only trained with raw tactile data from BioTac sensor. Nevertheless, the non-matrix sensors do not have a direct correspondence between 3D position of electrodes (sensor cells) and 2D position of pixels in an image. Hence, it is necessary to map the electrodes to an image matrix and assign values to the empty pixels as was described in [17
]. Consequently, we have to generate new non-zero values for the pixels without correspondence from the neighbouring pixels as described in [18
]. For this reason, the proposed ConvLSTM is trained with tactile images which contains both real tactile values from BioTac electrodes and synthetic values generated from the neighbourhood.
Anyway, an advantage of ConvLSTM versus LSTM is that it presents a smaller standard deviation. Thus, for example, recall shows a standard deviation of in contrast with a for (rigid and smooth), versus for (rough) and in comparison with for (with little contact). As a conclusion, we can affirm that the spatio-temporal pattern learned by ConvLSTM from tactile images are more robust and they are less influenced by the type of contacted surface. Consequently, using the ConvLSTM seems a better choice for this task because its performance varies less with novel objects, while it still achieves competitive average rates.
In general, ConvLSTM involves a number of steps greater than LSTM because it needs to transform the raw tactile data from BioTac sensor to tactile images although the rum-time difference is negligible. In both cases, the runt-time does not exceed 1 ms and the greatest limitation of both methods is the reader time required to gather the raw tactile data. In this work, we used a time window of 50 ms in order to count with 5 consecutive tactile readings. Consequently, the online assessment of this classification is not limited by the processing of the neural networks because that can be optimised using pre-computed weights, as well as increasing the computational power of the hardware. Instead, the time constrain comes from this 50 ms window required to read the tactile readings.
To sum up, we found that LSTM actually achieves higher performance rates than the ConvLSTM, though its standard deviation are much higher. As has been mentioned before, this could be due to the fact that the LSTM works with the raw readings coming from the sensor, while the ConvLSTM works with artificial tactile images. As a result, it seems natural to think that the performance of the ConvLSTM might be affected by the quality of these images. In our experiments, these tactile images hold some cells with invented values, used in order to fill the whole picture. This could be misleading the learning and, therefore, limiting the performance of the network. In consequence, the LSTM seems a better option for building a system for detecting the direction of slip which can deal with an unstable predictor, like averaging a set of predictions before giving a final output. In contrast, the ConvLSTM yields a slightly worse performance in the means of peak rates, but it is much more stable in its predictions. This could mean that the ConvLSTM does not need to average its predictions using a time window, reducing the risk of loosing grip when a slippage is detected and giving a more reliable prediction.
The main limitation of our work is the collection of samples for training the neural networks proposed. All of the tested networks are deep neural networks which require large datasets in order to guarantee a low probability of overfitting the data when used for supervised tasks. Since our problems require moving robots, acquiring data can be much more time-consuming than taking pictures for computer vision tasks. This could be overcame by using semi-supervised techniques, so we do not need to label the whole datasets. Another option would be applying data augmentation techniques, though this requires a previous study due to the type of data we are handling (tactile information).
The application of our work to other robots or systems is also constrained by the tactile sensor in use. The BioTac SP is a sensor that can provide slightly different behaviours and ranges of data from one model to another. As a consequence, the trained models will only work for our sensors and cannot be transferred to other robot, even if it is equipped with a BioTac sensor. Nevertheless, the current work can be still of great use for other researches willing to cover tactile tasks with tactile sensors using these learning models.
Finally, another important limitation of our work is the type of object being used. More precisely, the stiffness of the objects used for learning highly affects the performance of the models. Generally speaking, any object can be classified in two categories: solid or soft. Training a model with samples coming from solid objects does not generalise to soft objects and vice versa. Tactile sensors behave differently during a contact with a soft object, and there are even various degrees of softness, so the tactile patterns for a similar stable grasp or a type of slip are different depending on this attribute. Hence, in order to apply learning techniques to tactile tasks, one should bear in mind with the kind of objects the system will work, regarding their stiffness.