Design, Motivation and Evaluation of a Full-Resolution Optical Tactile Sensor

Human skin is capable of sensing various types of forces with high resolution and accuracy. The development of an artificial sense of touch needs to address these properties, while retaining scalability to large surfaces with arbitrary shapes. The vision-based tactile sensor proposed in this article exploits the extremely high resolution of modern image sensors to reconstruct the normal force distribution applied to a soft material, whose deformation is observed on the camera images. By embedding a random pattern within the material, the full resolution of the camera can be exploited. The design and the motivation of the proposed approach are discussed with respect to a simplified elasticity model. An artificial deep neural network is trained on experimental data to perform the tactile sensing task with high accuracy for a specific indenter, and with a spatial resolution and a sensing range comparable to the human fingertip.


Introduction
Relatively small image sensors nowadays provide very high resolutions and wide dynamic ranges. Together with cost effectiveness, these factors have made cameras a comprehensive solution for providing robots with a sense of vision. Moreover, in particular in recent years, the use of machine learning for computer vision problems has contributed to the accomplishment of numerous challenging tasks in the field (see, for example, [1][2][3]).
The benefits of artificial vision systems, from the points of view of both hardware and developed algorithms, can be exploited in different domains, as is the case for robotic tactile sensing systems. Although the fundamental importance of the sense of touch for interacting with the environment has been shown in both humans (see [4]), and robots (see [5]), finding a sensing solution that yields satisfactory performance for various types of interactions and tasks is still an open problem. By using a camera to monitor various physical properties, such as the change in light intensity, which are related to the deformation of a soft material subject to external forces, it is possible to design algorithms which reconstruct the force distribution with high resolution.
This article describes the design of a sensor (shown in Figure 1) that consists of a camera that tracks the movement of spherical markers within a gel, providing an approximation of the strain field inside the material. This information is exploited to reconstruct the normal external force distribution that acts on the surface of the gel. The use of a camera intrinsically renders a very high spatial resolution, given by the extensive number of pixels composing the images. A specific choice of relevant features leverages the full resolution of the camera, with the sensor's spatial resolution not limited by the number of markers.
Moreover, this article provides a theoretical analysis of the elastic model of the material, which evaluates different marker layouts. This analysis indicates that the presence of markers at different depths within the gel yields a higher robustness to errors in the marker tracking, while retaining a small sensor threshold.
The map between the marker displacements and the normal force distribution is modeled with a neural network, which is trained on a vast number of images. These images are collected while an automatically controlled milling machine presses an indenter's tip against the sensor's surface at different locations. During this procedure, ground truth force measurements are provided by a force torque (F/T) sensor. The proposed approach is also discussed with respect to transfer learning applications in the author's work in [6], where preliminary results show that it is possible to greatly reduce the required training data and the training times.
The resulting pipeline runs in real-time on a standard laptop (dual-core, 2.80 GHz), predicting the normal force distribution from the image stream at 60 Hz.

Related Work
As highlighted in [7], among the reasons that have delayed the widespread deployment of tactile sensors on robotic systems compared to vision sensors, the lack of a tactile analog to optical arrays is a result of the inherent complexity of interpreting the information obtained via physical contact.
In the literature, different categories of tactile sensors focus on obtaining this information using different principles (see, for example, [8], where the electrical resistance of an elastomer is related to the pressure exerted on it, and [9], where an array of zinc oxide nanorods generates a voltage signal whose amplitude is proportional to the normal force applied). A detailed description of the main classes of tactile sensors for robotic applications is provided in [10].
Optical (or vision-based) tactile sensors are a large class of devices that exploit various light-related principles, which describe how different properties change with the stress applied to a contact surface. Several examples are described in the literature, based on principles such as photometric stereo [11], total internal reflection [12], and reflected light intensity [13]. Vision-based tactile sensors only marginally affect the observed sensor's surface, and therefore do not alter the softness of the contact area. This is a main requirement for tactile sensors that interact with the environment [7]. In fact, a soft material has the advantage of compliance, friction and conformation to the surfaces it interacts with, which for example are crucial properties for manipulation tasks.
Several approaches proposed for optical tactile sensing are based on tracking a series of markers on the sensor's surface (see [14]). The movement of these markers is directly related to the strain field of the material, and can therefore be used to reconstruct the external force distribution on the surface. For example, in [15], a numerical method which captures a linear elastic model of the material is used to reconstruct the force distribution from marker displacements.
The choice of the marker layout in vision-based tactile sensors is investigated in various works. A symmetrical pattern is exploited in [16] to generalize tactile stimuli to new orientations. In [17], an analytical model approximately reconstructs the force distribution from the displacement of spherical markers. These markers are placed over two different depth layers within the sensor's surface, and this layout shows a better robustness to noise compared to the case in which all markers are placed at the same depth.
To overcome the limitations imposed by the assumptions made on the material model, whose full nonlinear description is highly complex to derive, machine learning approaches have been explored for tactile sensing. By training a learning algorithm with an extensive amount of data, it is possible to reconstruct the force applied to the sensor's surface. For example, in [18], a deep neural network is used to estimate the total contact force that an optical tactile sensor with markers printed on the surface is exerting on several objects. The need for training data is fulfilled in different ways in the literature, for instance with robot manipulators [5] or with other automatic machines [19].
This article presents an optical tactile sensor that is based on tracking spherical markers, which are randomly spread over the entire three-dimensional volume of its soft surface. The sensor presented here does not rely on tracking a specific marker layout (as in [16,17]) and is provided with standard lighting conditions (opposed to [18]), facilitating the deployment to larger surfaces (through multiple cameras) of arbitrary shapes. In fact, the proposed strategy only requires the presence of distinct features, which move according to the material deformation, therefore greatly simplifying manufacture.
Conversely to most of the cited approaches, the features engineered for the proposed supervised learning algorithm can be chosen in a way that the information at all pixels is processed, independent of the pattern choice, therefore actually exploiting the full resolution of the camera. Moreover, the proposed design and strategy can be easily extended to different types of forces and interactions.

Outline
The tactile sensor is presented in Section 2, where its design and the production procedure are discussed. Theoretical analyses of the sensor threshold and robustness are explained in Section 3. Section 4 describes a procedure for training data collection and the related data processing for the generation of ground truth labels. Two feature engineering strategies, both suitable for real-time execution, are presented in Section 5. The learning architecture, which captures a map between the engineered features and the generated ground truth labels, is discussed in Section 6. Results and comparisons are shown in Section 7. Finally, Section 8 draws the conclusions of the article.

Design and Production
To track the spherical markers within the soft material, a fisheye RGB camera (2.24 megapixels ELP USBFHD06H) was placed inside a mold, as shown in Figure 2a, and equipped with a board that controls two rings of LEDs (see Figure 2b), which provide constant illumination. The camera can read image frames up to 100 fps at a resolution of 640 × 480 pixels.
The soft materials employed in the sensor production were all degassed in a vacuum (to remove air bubbles, which form during mixing) and poured through cavities in the mold. The mold was placed on one of its sides (that is, with the camera lens pointing sideways) during production. Three different lids were used to perform the three respective steps of the production procedure, which is explained in the following. The design proposed here differs from the one employed in [6], where a bottom-up procedure yielded imperfections on the sensor's top surface.
A first lid closes the mold, and a relatively stiff transparent silicone layer (ELASTOSIL ® RT 601 RTV-2, mixing ratio 7:1, shore hardness 45A) was then poured into the mold through a side cavity, which is shown in Figure 2b. After the silicone cured, the lid was removed and replaced with the second one, shown (in red) in Figure 2c, which had an indent of 30 × 30 × 4.5 mm. Spherical fluorescent green particles (shown in Figure 3) with a diameter of 500-600 µm were mixed with a very soft silicone gel (Ecoflex™ GEL, mixing ratio 1:1, shore hardness 000-35) and poured into the mold filling the empty indent. After this soft layer was also cured, the second lid was replaced with the third one, shown (in yellow) in Figure 2d, with an indent that left an empty section around the gel of varying thickness (depending on the section) between 1 mm and 1.5 mm. A black silicone layer (ELASTOSIL ® RT 601 RTV-2, mixing ratio 25:1, shore hardness 10A) was then poured through the cavity on the third lid. Figure 4 shows a schematic cross-sectional view of the three silicone layers, and an example of the resulting tactile sensor is shown in Figure 1.  The stiff layer, which was poured first, served as a base for the softer materials that were placed on top of it, and as a spacer between the camera and the region of interest. All the materials employed have comparable refraction indexes, therefore preventing unwanted reflections from the LEDs. The spherical particles have a density that is close to the density of the gel they are mixed with. Together with the viscosity of the material, this is crucial to obtain a homogeneous spread of the markers over the entire depth of the gel. A detailed discussion on this point can be found in Appendix A. The black silicone layer added consistency to the extremely soft gel, which also tended to stick on contact, and provided a shield against external light disturbances. The thickness of the sensor from the base of the camera to the top of the surface is 37 mm.

Motivation
Besides ease of manufacture and portability to surfaces with arbitrary shapes, the approach presented here showed theoretical benefits in a simplified scenario. This section presents first order analyses of two different properties of the proposed sensor: the robustness to noise in the marker displacements and the sensor threshold. Here, the threshold is defined as the minimum force that leads to a detectable change in the camera image. To this purpose, simplified models of the camera and the material were considered. Since the resulting expressions were dependent on the marker distribution and on the displacements observed, Monte Carlo simulations (i.e., based on repeated random sampling) were performed for different external forces and marker layouts, and the results are discussed. In particular, four layout classes were considered: Single layer at 1 mm: The markers were randomly distributed on a single layer (with an horizontal section of 30 mm × 30 mm) placed at a depth of 1 mm from the sensor's surface. Single layer at 2 mm: The markers were randomly distributed on a single layer (with an horizontal section of 30 mm × 30 mm) placed at a depth of 2 mm from the sensor's surface. Single layer at 6 mm: The markers were randomly distributed on a single layer (with an horizontal section of 30 mm × 30 mm) placed at a depth of 6 mm from the sensor's surface.

Homogeneous spread between 1 mm and 6 mm:
The markers were randomly distributed over a depth range of 1-6 mm from the sensor's surface (covering an horizontal section of 30 mm × 30 mm).
In the reminder of this article, vectors are expressed as tuples for ease of notation, with dimension and stacking clear from the context.

Model
A semi-infinite linear elastic half-space model was assumed for the surface material. This model has shown limited loss of accuracy in the force reconstruction task for indentations in the proximity of the origin of the half-space (see [20]), and provides a tractable analytical solution. Although a simple scenario with a single rubber layer was considered, the physical parameters of the conducted simulations were chosen to approximate the first-order behavior of the sensor presented in Section 2.
Three Cartesian axes, x, y and z, defined the world coordinate frame. These were centered at the origin of the half-space, placed on the top surface. The z-axis was positive in the direction entering the material, and the remaining two axes spanned the horizontal surface. Let s j := (x j , y j , z j ) be the position of a marker j before the force is applied, for j = 0, . . . , N m − 1, where N m is the number of markers within the gel. Given a point force applied at the origin (this model can be easily extended to the entire force distribution, as shown in [17]), the three-dimensional displacement ∆s j of the marker j can be derived from the Bousinnesq and Cerruti solutions (see [21]), as, where F ∈ R 3 is the applied force vector, and, where ν and E are the Poisson's ratio and the Young's modulus of the material, respectively, and R j := x 2 j + y 2 j + z 2 j . A pinhole camera model (see [22] (p. 49)) was assumed, with square pixels and the optical center projected at the origin of the image coordinate frame, on the z-axis of the world frame. The image plane was parallel to the sensor's surface.
The displacement ∆p j ∈ R 2 , as observed by the camera and expressed in pixels in the image frame, was computed as the difference between the marker positions in the image after and before the force is applied. This leads to, f is the focal length of the camera (in pixels), d is the distance of the optical center from the origin of the world coordinate frame, and H j,i represents the ith row of the H j matrix, for i = 1, 2, 3. Note that, for small d, which is usually the case for state-of-the-art optical tactile sensors, the difference between K a,j and K b,j is not negligible. The expression in Equation (3) can be rearranged, as explained in Appendix B, as, where P j ∈ R 2×3 depends on the displacement ∆p j . For force sensing applications, once the marker displacement ∆p j is observed in the image, the equality in Equation (6) represents an underdetermined system of two equations with three unknowns (the components of F). To find a unique solution, and to improve the reconstruction error in the presence of noise, more equations can be provided by tracking the remaining markers, yielding, with P ∈ R 2N m ×3 . The force F can then be reconstructed as, where the apex † indicates the Moore-Penrose pseudo-inverse matrix. The matrix P depends on the different ∆p j , therefore the equality in Equation (8) indicates a nonlinear dependence between F and ∆p. The parameters used for the Monte Carlo simulations described in the following subsections are summarized in Table 1. In particular, the Young's modulus was chosen as the linear combination of the resulting values for the soft layers described in Section 2, according to the conversions introduced in [23] (note, however, that this value is only a multiplicative factor in Equation (2), and does not have a large effect on the resulting trends in the simulations).

Robustness to Noise
To a certain extent, the learning algorithm presented in Section 6 aims to approximate a map similar to the one in Equation (8), which relates the observed marker displacements to the applied external force. For the purpose of analyzing how the output of this map reacts to a perturbation δp of the displacements ∆p, a first order linearization was performed around the unperturbed state. Therefore, the force corresponding to the perturbed displacements can be expressed as, where J is the appropriate Jacobian matrix, containing the derivative terms of the map with respect to ∆p. Using the properties of the norm, where the L 2 -norm and the Frobenius norm are used accordingly. The value that κ ∈ R takes is often referred to as the absolute condition number of a function (see [24] (p. 221)), and it provides a bound on the slope at which a function changes given a change in the input. In this analysis, it quantified the sensitivity of the map in Equation (8) to noise in the observed displacements. Note that the noise considered here might stem from both image noise and errors in the estimation of the marker displacements (e.g., errors in the optical flow estimation, see Section 5). The Monte Carlo simulations were performed as follows: (1) N f force samples were randomly drawn. Each of these samples represented a three-dimensional concentrated force vector, which was placed at the origin of the world coordinate frame. (2) For each force sample, N c marker configurations were randomly drawn for each layout class, and the resulting marker displacements were projected to the image frame. (3) Therefore, the resulting value of κ was computed and averaged for the N c different random marker configurations (in the same layout class). Figure 5 shows the average magnitude of κ for different types of forces. For ease of visualization, the results presented here considered the application of pure shear (along one axis) or pure normal force, but similar conclusions were obtained for generic force vectors. The figures indicate how the map in Equation (8) was more robust for configurations with the markers placed closer to the camera, i.e., deeper in the soft material. The layout class with an homogeneous spread of markers between 1 mm and 6 mm showed higher robustness compared to the case of all markers placed at 1 mm or 2 mm depth (but was lower when compared to the class with the markers placed at 6 mm depth). Note that there were two counteracting effects in the model considered, that is, markers closer to the camera exhibited smaller displacements, but these displacements underwent higher amplification when projected to the image.
The sensor's robustness to noise was also evaluated in the same simulation framework by perturbing the observed marker displacements after projecting these to the image frame. Zero-mean Gaussian noise with variance σ 2 p was added to the observed ∆p, and the force was reconstructed using Equation (8). The resulting root mean squared error (RMSE) for varying σ p (averaged over the different simulations) is shown in Figure 6, showing that, considering higher order terms, neglected in the linearization, led to the same trend as in Figure 5. Single layer 1 mm Single layer 2 mm Single layer 6 mm Homogeneous (b) Absolute condition number for normal force Figure 6. Tactile sensing applications are simulated by drawing shear (a) and normal (b) force samples, computing the resulting marker displacements, and projecting these to the image frame. Therefore, Gaussian noise with variance σ 2 p was added to the observed displacements, and the force was reconstructed through the map in Equation (8). The resulting averaged root mean squared error is shown in these plots for different σ p .

Sensor Threshold
The tactile sensor threshold is given by the minimum force that generates a noticeable change in the image, i.e., when a marker is observed at different pixel coordinates. In the worst case scenario (a similar analysis applies to other scenarios), each marker center (or feature) of interest is at the center of a pixel. In this case, a marker needs to move by at least half a pixel to be observed at a different position in the image. Therefore, during the simulations described in Section 3.2, the component-wise maximum among the observed marker displacements was computed (that is, the maximum component of the observed ∆p vector). For each force sample, this quantity was averaged over the N c random marker configurations. The sensor threshold was then estimated as the minimum magnitude of the (shear or normal) force that yields a displacement greater than half a pixel. Table 2 shows the resulting sensor threshold for pure normal and shear forces, respectively, for the different classes of layouts. In the simplified scenario considered, the configurations with the markers placed closer to the sensor's top surface exhibited a smaller threshold, while the layout class with homogeneous spreads of markers at different depths retained a threshold comparable to the layout class with a single layer at 1 mm depth. Summarizing, the analyses presented here show that spreading the markers homogeneously at different depths (in this example between 1 mm and 6 mm) might yield an advantageous trade-off between robustness to noise in the observed marker displacements and sensor threshold. In particular, compared to the case of all markers at a depth of 1 mm, this layout class showed higher robustness to noise while retaining a comparable sensor threshold (which for this example was about 2-3 times smaller than the layout with a single layer at 6 mm depth).

Ground Truth Data
In the context of supervised learning, each data point is associated with a label that represents the ground truth of the quantity of interest. In the approach presented here, a label vector representing the normal force distribution is assigned to the image that the camera captures when this force is applied. In this section, the procedure followed to collect the data, which were then processed and used to train the learning architecture presented in Section 6, is described. The generation of ground truth label vectors is then discussed.

Data Collection
The training data were automatically collected by pressing an indenter's tip against the sensor gel by means of a milling and drilling machine (Fehlmann PICOMAX 56 TOP).
The machine was equipped with a three-axis computer numerical control, which enabled precise motion (in the order of 10 −3 mm) of a spindle in the three-dimensional operating space. A spherical-ended cylindrical indenter (40 mm long, with a diameter of 1.2 mm) was mounted on a state-of-the-art six-axis F/T sensor (ATI Mini 27 Titanium) through a plexiglass plate. The F/T sensor was attached to the spindle and used to measure the force applied by the indenter, which was pressed against the gel at different depths and positions. An image of this setup is shown in Figure 7.
The milling machine outputted a digital signal, which was positive when the needle reached the commanded pressure position. The signal was read by a standard laptop, which was also responsible for reading the F/T sensor data and the camera stream. The data were synchronized on the laptop by extracting the image frames that were recorded (and cropped to 480 × 480 pixels) when the digital signal was positive. These images were then matched to the corresponding normal force value, which was sent by the F/T sensor.

Data Labels
The neural network architecture presented in Section 6 requires the labels to be expressed as real vectors. To this purpose, the sensor surface was divided into n bins, each of these representing a different region. Therefore, a corresponding n-dimensional label vector embedded the force that was applied to each of these regions. Since the force was applied to the gel with the relatively small spherical indenter used for the training data collection, this was simplified as a point force at the center of the contact. The point of application of the force was mapped to one of the n surface bins, and the value of the force read by the F/T sensor was assigned to the label vector component that represented the appropriate bin. All remaining components were then set to zero, to indicate a zero force distribution in those regions. A scheme of this procedure is shown in Figure 8.
Note that the number of surface bins n provides an indication of the spatial resolution of the sensor (measured in millimeters), which is different from the camera resolution that was fixed at 480 × 480 pixels in the scenario considered here. A larger n provides a finer spatial resolution, at the expense of a higher dimensional map between images and force distribution, which is generally more complex to estimate.

Optical Flow Features
Extracting meaningful features by processing the images has the benefit of reducing the required amount of training data and the training times.
In a soft material, the strain field provides information about the force applied to the surface, which generates the material deformation [25]. In the approach presented here, the spherical markers' displacement rendered an approximation of the strain field, which could be obtained by means of optical flow techniques. In fact, the input features to the supervised learning architecture presented in Section 6 were derived from the estimated optical flow.
This section discusses two approaches to the feature engineering problem, based on sparse and dense optical flow, respectively [26].

Sparse Optical Flow
Sparse optical flow methods are based on tracking the movement of a set of keypoints. In particular, the Lucas Kanade algorithm (see [27]) solves the optical flow equations relying on the assumption that pixels in a small neighborhood have the same motion.
For the tactile sensor proposed in this paper, the centers of the fluorescent markers represented a natural choice as keypoints. However, identifying these keypoints required a detection phase with the gel surface at rest. The proposed algorithm is based on the watershed transformation for image segmentation (see [28]). The different steps are explained in Figure 9.   The distance transform (see [29] for the definition) is computed (c), and its local peaks are extracted (d).
The watershed algorithm is finally applied to segment the original image into the different markers. The centers of the resulting contours, in red (e), are chosen as the keypoints to track through Lucas Kanade optical flow in the following frames (f).
Once the keypoints were chosen, they were tracked in the following frames using Lucas Kanade optical flow. The resulting flow was represented as a tuple of magnitude and angle with respect to a defined axis. Similar to the approach explained for generating the label vectors, the image was divided into a uniform grid of m regions. Note that m does not necessarily have to be equal to n. The markers were assigned to the appropriate resulting bins, depending on their location in the image. The features were then defined by the following average quantities, for each image bin i = 1, . . . , m, α avg,i = atan2 where d ij and α ij represent the magnitude and angle tuple relative to the displacement of the marker j in the region i, and N i is the number of markers assigned to the region i. The average defined in Equation (12) is often referred as the circular mean of the angles (see [30] (p. 106) for an explanation). Under the assumptions that the markers were homogeneously distributed over the entire volume of the gel, these average quantities provided an approximation of the optical flow at each of the bin centers. The feature vector for each image was therefore obtained by appropriately stacking the two average quantities for each of the m regions, resulting in a 2m-dimensional feature space. Note that the averages also provided invariance of the features to the marker pattern, provided that the same material and design were retained.

Dense Optical Flow
Compared to sparse optical flow, dense optical flow methods are more accurate, at the expense of a higher need for computational resources [31]. Moreover, they do not rely on a preliminary detection phase, which may introduce further inaccuracies (see, for example, the lower-left part of Figure 9e). In fact, rather than tracking a set of features over subsequent frames, they aim at estimating the motion at each pixel of the image. The Dense Inverse Search (DIS) optical flow algorithm (see [32]) reconstructs the motion through the computation of the flow at different image scales, and exploits an inverse search technique to obtain a considerable increase in speed. Therefore, rather than relying on the detection of keypoints, the DIS algorithm reconstructs the optical flow from any trackable distinct pattern. Moreover, it exploits the full resolution of the camera (distinct from the spatial resolution of the tactile sensor), rendering motion information at each pixel of the image. As a consequence, the spatial resolution of the tactile sensor is not limited by the number of markers. Note that a specific marker layout (e.g., a uniform grid) would not simplify the image processing and the reconstruction of the dense optical flow, and an eventual regular spacing could cause the loss of information at certain pixels. Conversely, the random spread of markers facilitates manufacture and does not impact the deployment of the sensor to applications where a larger contact surface with arbitrary shape is required.
To match the required format of the DIS algorithm, the original image was processed. The green regions in the image were thresholded and the remaining non-green regions were replaced with black pixels, removing external light disturbances and material imperfections. Finally, the resulting image was converted to gray-scale. An example of the masked image and the computed dense optical flow are shown in Figure 10. As in the case of the sparse keypoint tracking, the resulting flow was represented as tuples of magnitude and angle, in this case for each pixel. The pixels were then assigned to m image bins. Redefining N i for the dense optical flow case as the number of pixels assigned to the region i, the averages defined in Equations (11) and (12) were computed, to compose a set of 2m features.

Learning Architecture
The problem of reconstructing the normal force distribution from images can be formulated as a multiple multivariate regression problem, that is, a regression problem where both input features and labels are multi-dimensional vectors. A multi-output deep neural network (DNN) provides a single function approximation to the underlying map, which exploits the interrelations among the different inputs and outputs. Similar to the work in [6], the approach proposed in this article presents a feedforward DNN architecture (see Figure 11), with three fully connected hidden layers of width 1600, which apply a sigmoid function to their outputs. The input data to the network were the 2m (sparse or dense) optical flow features described in Section 5, while the output vectors provided an estimate of the normal force applied at each of the n bins the tactile sensor surface was divided into, as described in Section 4. The architecture weights were trained with RMSProp with Nesterov momentum (known in the literature as Nadam, see [33]), with training batches of 100 samples and a learning rate of 0.001. The solver aimed at minimizing an average root mean squared error (aRMSE), defined according to Borchani et al. [34] as, where y i denote the ith true and predicted label vector component, respectively, for the lth sample in the considered dataset portion (i.e., training, validation, and test sets), which contains N set data points.
Twenty percent of each dataset was retained as a test set, against which the performance was evaluated. Ten percent of the remaining data points were randomly chosen as a validation set. Dropout regularization (at a 10% rate) was employed at each hidden layer during training, which terminated when the loss computed on the validation set did not decrease for 50 consecutive epochs.
Due to the very sparse structure of the label vectors considered in the experiments presented here (see Figure 8), two evaluation metrics were computed on the test set, in addition to the aRMSE, which are more intuitive for this application. For l = 0, 1, . . . , N set − 1, the label component of the ground truth vector where the magnitude of the applied force is maximum was computed, The same quantity was computed for the estimated label vector, Denoting c(k) as the location of the center of a surface bin k on the horizontal plane, a distance metric was introduced, Finally, an error computed on the maximum components of the true and estimated label vectors was defined as, The metric in Equation (16) provides an indication of how close the location of the maximum estimated force was to the real location of the force applied by the indenter described in Section 4. The metric in Equation (17) indicates how accurate (in the magnitude) the estimation of this force was.

Results
The tactile sensor's performance was evaluated on a dataset collected on a test indenter, as described in Section 4. The images (acquired at a resolution of 640 × 480 pixels) were cropped to a region of interest of 480 × 480 pixels, which covered the gel surface. A total of 10,952 data points were recorded by commanding the needle to reach various depths (up to 2 mm, for a maximum force of 1 N) at different positions on the surface, which were defined by an equally-spaced grid (0.75 mm between adjacent points). Note that other ranges of force could be similarly covered by replacing the Ecoflex™ GEL with another material with different hardness.
An example of the prediction of the normal force distribution is shown in Figure 12. The learning architecture was evaluated for different values of m and n, and for features based on sparse and dense optical flow. The results, greatly outperforming the authors' previous work in [6], are shown in Figure 13.
The plots show that, in terms of the aRMSE, which is the loss actually optimized during training, the dense optical flow outperformed the sparse optical flow in estimating the force distribution, in both the finer and coarser spatial resolution cases (determined by n). However, this metric did not provide a reliable comparison across different spatial resolutions, since it was influenced by the zero values in the label vectors, which were considerably more in the finer resolution case.
As shown in Figure 13b, the RMSE mc decreased by increasing the number of averaging regions in the image, and was lower for the dense optical flow case, considering the same number of surface bins. Estimating the force distribution with a finer spatial resolution slightly degraded the performance according to this metric, and might require a different network architecture and more training data, due to the higher dimension of the predicted output.  High accuracy was achieved at each bin, including the corners of the gel, which underwent a deformation that generally differed from the rest of the surface when subject to force (due to boundary effects).
The use of the information at all pixels, provided by the dense optical flow, resulted in a better accuracy in estimating the location of the applied force (indicated by d loc ), compared to the sparse optical flow. In fact, the performance obtained with finer resolution and dense optical flow features was comparable to the case with coarser resolution and sparse optical flow features. Moreover, excessively increasing the number of image bins m might degrade the performance for the sparse optical flow case, since some averaging regions might not be covered by a sufficient number of detected keypoints.   (13), (17) and (16)) for various values of image bins m and surface bins n, and for dense and sparse optical flow (OF) features. The resolution of the ground truth F/T sensor is 0.06 N.

Conclusions
This article has discussed the design of and the motivation for a novel vision-based tactile sensor, which uses a camera to extract information on the force distribution applied to a soft material. Simulation results have shown the benefits of the proposed strategy. The experimental results presented in this paper (see Supplementary video) have shown high accuracy in reconstructing the normal force distribution applied by a small spherical indenter. Although in the experimental setup considered an approximation to a point indenter has been applied, the proposed learning algorithm does not explicitly use knowledge of this information (that is, the point force and the zero valued regions are indistinctly predicted through the same architecture). In fact, the representation of the force distribution introduced in this article (embedded in appropriate label vectors) is suitable to represent generic normal forces, including multi-point contacts and larger indentations. Moreover, this representation can be readily extended to shear forces, by appropriate vector concatenation. Future work will investigate the cases not considered in this article, to address the limitations introduced by the point force approximation. To this purpose, the approach proposed here will need to be evaluated for arbitrary force distributions on a wider dataset, with various indenters and multiple types of contact. These steps will be aimed towards the generalization of the current approach to touch with objects of arbitrary shapes (e.g., for robotic grasping tasks).
Dense optical flow features, whose application in the context of vision-based tactile sensors is novel (to the authors' knowledge), exhibit higher performance compared to (sparse) keypoint tracking and retain fast execution times. The use of these features leverages the randomized marker layout, since (as highlighted in Section 5) the spacing required for regular grid patterns, as for example in [35], might instead cause a decrease in accuracy in the estimation of the dense optical flow.
The spatial resolution (n = 361 corresponds to surface bins with a side of 1.58 mm) and sensing range achieved are comparable to the human skin (see [36] for a reference), and the resulting pipeline runs in real-time at 60 Hz. analysis (which considers η constant). In practice, the dynamic viscosity increase contributes to considerably slow down the particle, further reducing the resulting displacement to a negligible value.