Vision-Based Tactile Sensor Mechanism for the Estimation of Contact Position and Force Distribution Using Deep Learning

This work describes the development of a vision-based tactile sensor system that utilizes the image-based information of the tactile sensor in conjunction with input loads at various motions to train the neural network for the estimation of tactile contact position, area, and force distribution. The current study also addresses pragmatic aspects, such as choice of the thickness and materials for the tactile fingertips and surface tendency, etc. The overall vision-based tactile sensor equipment interacts with an actuating motion controller, force gauge, and control PC (personal computer) with a LabVIEW software on it. The image acquisition was carried out using a compact stereo camera setup mounted inside the elastic body to observe and measure the amount of deformation by the motion and input load. The vision-based tactile sensor test bench was employed to collect the output contact position, angle, and force distribution caused by various randomly considered input loads for motion in X, Y, Z directions and RxRy rotational motion. The retrieved image information, contact position, area, and force distribution from different input loads with specified 3D position and angle are utilized for deep learning. A convolutional neural network VGG-16 classification modelhas been modified to a regression network model and transfer learning was applied to suit the regression task of estimating contact position and force distribution. Several experiments were carried out using thick and thin sized tactile sensors with various shapes, such as circle, square, hexagon, for better validation of the predicted contact position, contact area, and force distribution.


Introduction
Vision-based processing has been a part of inference in many interdisciplinary fields of research [1][2][3]. The usage of vision-based tactile sensors in industrial applications has grown over the past two decades with the rise in the standard of imaging sensors [4][5][6]. Usually, the tactile sensors can perceive the physical aspects of any object, which indeed guides the handling of the object in terms of strength applied to interact with them [7]. On the contrary, visual sensors, such as cameras, do not interact with the objects physically. Instead, they retrieve the visual cues from the imaging patterns of the objects in various modes [8]. The perceiving capability can be improved by using information, such as visual patterns, adapted force, and contact location, retrieved from the visual sensors without having to interact with the object in a physical manner [9]. This can be made possible using deep learning, which utilizes the data collected from vision sensors, along with the parameters, such as contact position, and force distribution, and trains on it to predict the output parameters in the future [10].

Background
The vision-based tactile sensing mechanism is developed using the same scheme, where the camera is mounted inside the elastic tactile sensing fingertip. Whenever the object is in touch with the fingertip, the camera gets the transformed grid pattern used to estimate the contact position and force distribution. The correlation between the input load force, contact position and transformed image captured by the camera sensor can be learned throughout various scenarios [11]. In this case, the vision-based tactile sensor technology gets rid of the need for the usage of separate traditional array type tactile sensor strips which are usually less durable and prone to large signal processing burden and breakage [12]. Furthermore, this type of visual-based tactile sensor is more like a single element type with no physical interaction with the elastic body. In the worst-case scenario, the elastic part can be replaced if damaged but the visual sensor always stays safe [13]. Additionally, indirect contact with the elastic body means the signal processing burden reduces by tenfold even if the detection area increases. The image acquisition process in the context of a visual-based tactile sensor can be observed in Figure 1. The industrial vision-based tactile sensor equipment used in this study is depicted, along with the transformed stereo image pair caused due to deformations on the elastic body.

•
The problem statement of this study is to predict the force distribution and contact position parameters that are to be estimated by the trained deep learning network using the training data acquired from the visual tactile sensor setup.
Usually, the common inference problems that deep learning models are usually trained on are classification and detection problems which are straightforward using the class labels and corresponding training samples to predict/detect the target class objects. In this study, the models must be tailored to match the problem statement of estimating continuously varying quantities, such as contact location and force distribution. Therefore, the problem statement for this study focuses on implementing a customized problem-specific regression model through transfer learning on top of pre-trained deep learning network architecture. This means the training data has to be collected under diverse conditions, such as various inputs loads with different object shapes, tactile sensor thickness, etc. This collected data then has to be paired with the stereo camera samples (which captures the deformation of the elastic body) in terms of right and left images. This collective data has to be properly handled and pre-processed to train the regression network for better prediction of contact position and force distribution, as shown in Figure 2. • employing deep learning for the transfer learning of VGG16 classification pre-trained network model; and • validating the vision-based tactile sensor system to examine the estimation of contact position, contact area, and force distribution using thick and thin tactile sensors with various shapes.
The paper is organized as follows. Section 2 thoroughly discusses the previous works and their characteristics regarding the usage of computer vision/deep learning in vision-based tactile sensor technology. Section 3 explains the overall materials and methodologies utilized in this study. All the aspects, such as overall system installation, stereo camera setup, manufacturing, and practical issues, related to the tactile fingertips, deep learning network architecture, and transfer learning methodology are detailed in this section. Section 4 describes the tactile sensor experiments and related evaluation metrics. Section 5 reports the results and related discussions based on the applied deep learning methodology to estimate the tactile contact position and force distribution. Finally, Section 6 concludes the paper with a summary.

Vision-Based Tactile Sensor Technology
The practice of employing camera sensors to estimate the contact position and force distribution is actively researched in the past decade [15]. The vision sensors are compactly embedded in the tactile sensing mechanism such that the deformations in the elastic body is transformed as tactile force, contact position-based information [16]. With the increase in the pixel resolution of the visual sensors, the vision-based tactile sensitivity has also improved. Researchers have employed image processing and computer vision techniques to measure the force and displacement of markers [17]. The patterns on the deformed materials are analyzed using low-level image processing algorithms and support vector machines [18], and some studies even approached the problem of determining the contact force and tactile location in a machine learning perspective [19]. Some other studies adapted the usage of dynamic vision sensors and depth sensors for tactile sensing [20]. With the accessibility of compact circuit technologies and high spatial resolution vision systems, some studies were able to report 3D displacement in the tactile skins [21]. A few other works tried to embed multiple camera sensors inside the tactile sensor to retrieve the best possible internal tactile force fields [22]. On the other hand, there has been an appeal and enthusiasm towards the learning-based approaches inculcating deep learning for the estimation of tactile information [23]. The visual-based tactile sensing mechanism can be typically classified into two approaches, such as traditional image processing/computer visionbased methods and learning-based methods. In traditional image processing/computer vision methods, various low-level image manipulating techniques are employed to enhance the images retrieved from the deformation source [24]. Often, the traditional methods are directly working on the images retrieved from the input sensor. This enabled devising a pipeline that does not require any training data before the inference. On the contrary, the learning-based techniques heavily rely on the training data for the enhancement of the performance [25].

Previous Works
In the past decade, few studies were proposed in the context of using the vision-based technique in tactile sensing mechanism. Begej et al. [26] pioneered the usage of the visionbased tactile sensor for measuring the contact force and internal reflection. Lepora et al. [27] reported their studies on implementing super-resolution optical tactile sensor which can localize the contact location, as well as to measure the contact force. Ito et al. [28] proposed a method to estimate the slippage degree using a vision-based tactile mechanism with extensive experiments. Yang et al. [29] focused on analyzing the texture of the material using the micro RGB camera in the context of tactile finger instrumentation. A few studies, such as Corradi et al. [30] and Luo et al. [31], used the vision-based tactile mechanism to recognize various objects. There were also a few remarkable studies by Piacenza et al. [32] which accurately estimated the contact position with indentation depth prediction using visual-tactile sensors.
The work from Johnson et al. [33] demonstrating the measurement of surface texture and shape using their photometric stereo technology has gained prominence in the field. Later these studies were further modified to measure the normal and shear force and were reported in Johnson et al. [34] and Yuan et al. [35] The learning-based methods were employed by a few researchers, like Kroemer et al. [36] and Meier et al. [37], for the estimation of force exhibited in the tactile behavior. Especially, Meier et al. [37] used convolutional neural networks to detect the online slip and rotations. Similarly, Chuah et al. [38] used artificial neural networks (ANN) to improve the accuracy in estimation of normal and shear force. They employed an automatic data collection procedure to acquire the footpad while moving through various trajectories. The concept of transfer learning help speeds up the process of adapting learning-based mechanisms into the vision-based tactile sensing tasks. There are many studies, such as Reference [39][40][41], that adapted the transfer learning in the context of Convolutional Neural Networks (CNN) to attain better results in terms of determining the force and other tactile aspects. The details of the summarized vision-based tactile sensing techniques are stated in Table 1. Refresh rate of 30 Hz with spatial resolution of 2.5 mm and size of sensor is high than that of Reference [42,43] because of stereo-camera

System Installation and Flow Schematic
The system installation employed for the vision-based tactile sensor is a combination of multiple systems, such as a motion actuator with a tactile sensor test bench, motion controller, and control personal computer (PC), as depicted in Figure 3a. • Motion actuator with vision tactile sensor bench: The motion actuators are used in the test bench to facilitate the motion along the linear (XYZ) and rotational (R x R y ) axis. The contact shaped tool is activated through actuators in order to make contact with the elastic tactile tip which has a camera fixed inside it. • Motion controllers: The motors are controlled using the motion controllers which indeed act as a bridge between the motion actuators and control PC. This motion controller considers all the parameters, such as force, contact position, and angle, so that the motion exhibits the desired outcome as expected. • Control PC: The control PC is a general personal computer with a LabVIEW GUI which acts as an activity log of the motions, controls, and data acquisition/processing center for the whole system installation. The training/testing data is collected from the test bench stereo camera setup via a USB port. Then, the LabVIEW software is used to accumulate the data with corresponding tactile control parameters for network training/testing.
All these subsystems gather with an intercommunication mechanism to exhibit the overall system flow schematic. The force gauge and the tactile sensor come into contact to exhibit a deformation on the elastic tactile tip, which is thereby recorded as a pattern by the stereo optical system. This mechanism is then controlled, regulated by the motion controller, and a control PC with processing software as a whole. This flow schematic of the visual-tactile sensor mechanism is shown in Figure 3b.

Process of Making Tactile Fingertips
Although there are many ways to make tactile fingertips, this study proceeded with an injection mold technique with the defoaming process, as shown in Figure 4a. Before the injection mold tips, several 3D printing processes were employed to produce tactile fingertips. Yet, they all did not withstand the stress and were torn apart, as shown in Figure 4b. The defoaming process with injection mold structures (upper, lower) used helped in withstanding the elastic stress imposed by repeated force gauging. However, there are some practical issues involved in the process of making the tactile fingertips. One such issue is the problem of surface light reflection on the inside of the tactile fingertip. The injection mold process which was opted by this study posed this issue of light reflection, as shown in Figure 4d. The major concern is that these light reflections will overshadow the deformation patterns inside the tactile fingertips. This could lead to inappropriate optical imagery captured by the stereo camera inside the fingertip. Therefore, the process of sanding was sequentially carried out on the mold surface after the injection process to reduce the light reflections, as shown in Figure 4e. The reliability of the tactile fingertips is crucial in this study as they are often exposed to a repetitively pressing process to collect the force, contact position, and other tactilebased sensor data. Accordingly, the reliability of the tactile tips can be categorized into physical and visual terms.
• Physical: The tactile tip must sustain the repetitive stress and must exhibit the same tactility throughout the sensor data acquisition. But, often, the insides portion of the tactile tip severely suffers from air bubbles. This problem was encountered in this study, and it was successfully resolved using the process of vacuum degassing of the tactile tip while manufacturing it. This process is shown in Figure 4f, and it efficiently reduced the air bubbles and offered better endurance to the tactile tips. • Visual: The visual reliability of the tactile tip was improved by the marker painting process, as shown in Figure 4g, which helped in the recognition of deformation patterns visually. Initially, a white paint is to mark the markers on the surface of the sensor. During the durability test, the markers were not compatible with the tactile sensor rubber material. Therefore, the marker painting is done using the same rubber material but with white color for easy recognition.

Tactile Fingertip Sensor Design Aspects
The tactile sensor fingertip specifications considered in this study are stated below in Table 2. Table 2. Design aspects and specifications of the tactile fingertip sensor.

Design Aspects Specifications
Sensor surface material (including markers) Rubber In the process of making the tactile fingertips, an ablation study was put forward to analyze certain practical aspects, such as which material should be used to make the tips, what should be the thickness of the tactile tip, etc. These questions were investigated using a proper ablation study in terms of tactile touch sensitivity and tactile stability. The forcedisplacement characteristic plot was constructed to analyze the effect of shore hardness, i.e, surface hardness of the material and the thickness of the material. The shore hardness is often measured using shore hardness scale or durometer shore hardness scale, denoted as "Shore 00" hardness scale [44]. For example, if a material is very soft, such as gel, then the shore hardness scale will be shore 05; and, if it is a hard rubber, such as shoe heel, then the shore hardness scale will be shore 100.
• Shore hardness (surface hardness): The tactile materials with a standard thickness t = 1 mm are considered with different shore hardness scales = 40, 60, 70, 80. The force-displacement characteristic plots can be observed in Figure 5 where, with the increase in the force, the tactile tip with shore hardness 40 is easily displaced losing its linearity in terms of elasticity, i.e, the tip with shore hardness 40 is too weak to be used as an elastic body at force 1 N. Similarity, with the increase in the force, the tactile tip material with shore hardness 60 seem to have similar displacement characteristics, like the shore hardness 40 material, but a bit linear. In contrast, the comparison between shore hardness 70 and 80 resulted in choosing the optimal shore hardness of 70 for the study experiments because shore hardness 80 is insensitive to be an elastic material with linearity at various force steps.

Stereo Camera System
The stereo camera system is fixed at the bottom of the tactile elastic tip to capture the deformations caused by the tactile contact. To acquire better image data from the tactile mechanism, the choice of the stereo system has been made. The stereo camera captures both the right image and left image of the deformation and transfers the image data to the control PC for training/testing purposes. The design setup of the stereo camera system used in this study is shown in Figure 7 below. The visual-tactile sensor system heavily relies on this stereo camera system for the inference in real-time. Therefore, the system must be compact, memory-friendly, and power-efficient. The stereo setup used in this study is compact such that its stereo baseline between the right and left camera lens is a mere 10 mm distance with optimal industrial standard size of 640 × 480, which is efficient in terms of memory and power consumption. Nevertheless, other image resolutions, such as 1280 × 720 with 30 fps, 640 × 360 with 30 fps, and 320 × 240 with 30 fps, were also examined. The design aspects of the stereo camera system employed in the experiments are stated in Table 3. Table 3. Design aspects and specifications of the stereo camera system.

Deep Learning Methodology
The deep learning-based contact position and force measurement algorithm is divided into six steps, which are shown in Figure 8 and described in detail below. The stereo image pair consisting of the deformation pattern of the elastic tactile tip serves as an input for the algorithm. Both the right and left images have the same deformation pattern but from a different perspective with a baseline of 10 mm in between both imagery. Data handling, pre-processing, and transfer learning are the crucial steps involved in the learning algorithm.

Region-of-Interest (ROI) and Mode Selection
The ROI setting was carried out to enable memory management and save the processing power of the GPU. While considering a single input image of 3 channels (RGB) with dimensions 640 × 480, the video input from left and right cameras via acquisition equipment in terms of height, width, channels is 480, 640, 3. Therefore, it is essential to design a region of interest that suits both left and right images. Accordingly, a manual ROI area is calculated as per the video input specifications to be the same for the whole stereo pair data. The ROI setting for the Row is: from 24-216 pixels; ROI setting for the Column is: 47-271 pixels; and the cropped area size is (192, 224), which can save GPU memory to the maximum. The ROI design is shown in Figure 9a, which is the same for both the right and left images. The mode selection is a customized procedure designed to test the best possible input feed to insert into a neural network for better results. This procedure involves the selection of the data as per different modes, as shown in Figure 9b, and then feed them into the neural network as input. Although 4 modes were put-forward, the mode that performs well during training (mode 1) will only be considered for the inference.

Zero Centering and Scaling
The combination of image data with a coupled tactile sensing data must be well fused and analyzed for the network to train on the insights of the data, although the video input stereo images received from the equipment are pre-processed by cropping and setting a specific ROI to optimize the memory and power. However, there is also a need to further process the image data such that the fusion of tactile data which is in terms of force, contact location, contact angle, etc., can be possible. In other words, for the deep learning network to converge well during the learning process, unit8 (an unsigned integer) [0-255] image is scaled to [0, 1] and normalized to [−1, 1] by zero-centering. The reason for performing zero centering and scaling is because the attribute to be predicted is different in terms of units and ranges, such as displacements along X, Y, Z, which are (in mm) Force (in N), R a (in degree). Therefore, the zero centering and scaling is essential for the network to learn the insights of the image data in correspondence with the tactile parametric data. The zero centered and scaled data (x ij ) will be a function of original data (x ij ) normalized between the minimum (min j ) and maximum points (max j ), as shown in: where i is the data index, j is the attribute index, x ij is the jth attribute of the ith data, max j is the maximum value of the jth attribute of the training data, and min j is the minimum value of the jth attribute of the training data.

Network Architecture
The convolutional neural network model used in this study was adopted from the wellknown VGG16 structure. Often, the VGG16 model structure is exploited to acquire better accuracy for the object classification tasks in computer vision and AI domains. However, the task that this study has to accomplish is to predict the continuously varying parameters, such as force, contact position, angle, etc. These parameters are indeed the continuous values that cannot be modeled into a classification task. The customized convolutional neural network model consists of total 16 deep layers, including the input layer. The input layer is fed to the neural model, and the input must pass through 16 deep layers, along with 5 max pooling layers. The first 2 layers of the network consist of 64 channel convolution filters of size 3 × 3 with stride 1 followed by a batch normalization and a Rectified Linear Unit (ReLU) activation function. The max pooling of size 2 × 2 with a stride 2 is used after the second convolution layer. The max pooling used throughout the model has a standard configuration of size 2 × 2 with a stride 2. The next 2 convolution layers use a 128 channel convolution filters of size 3 × 3 with stride 1 followed by a batch normalization, ReLU activation function. The max pooling layer is used after the fourth convolution layer. The next 3 convolution layers consists of 256 channel convolution filters of size 3 × 3 with stride 1 followed by a batch normalization, ReLU activation function. The max pooling layer is used after the seventh convolution layer. The next 6 convolution layers contain 512 channel convolution filters of size 3 × 3 with stride 1 followed by a batch normalization, ReLU activation function. The max pooling layer is used after tenth and thirteenth convolution layers. The last two layers are dense fully connected layers with 4096 units each. To prevent the overfitting, a dropout of 0.9 was used. The total output from the fully connected dense layers is used for the regression purpose, as shown in Figure 10.

Contact Area Estimation
The contact area estimation is designed to use the images acquired from the stereo camera to estimate the 2D contact area using naive computer vision methodology, as depicted in Figure 11. The contact area estimation was put forth to analyze the effect of sensor shapes on the contact area. Similarly, the ground truth of the known sensor tips were employed to investigate the errors in the estimated area. The input frame is used to identify the deformations on the elastic tip, and the keypoints are detected using image processing techniques, such as image segmentation and blob analysis [45]. These keypoints are then used to calculate the radii (r) depending upon the shape of the contact tool (l) used. The features are then used as a dataset to apply Gaussian regression to get the contact area, as shown in: where r 1 , r 2 , r 3 are the radii from center to the keypoints; l is the shape of the contact tool, such as circle, square, and hexagon; and x is the feature vector.
where f (x) is the function from zero mean Gaussian Process, h(x) is the transform function, square and hexagon, β is the hyper parameter, and f , h are learned in the training process.

Dataset Used
The tactile contact force gauge equipment used for the collection of data is shown in Figure 12. The data retrieved from the equipment is used to construct the training, validation, and testing dataset. The collected data is transferred to the control PC via a USB port, which is then processed using LabVIEW GUI on the PC. Figure 12b shows the log of all the sensor data (X, Y, Z, R x , R y ) recorded simultaneously with the stereo images. This GUI will have the timestamp of the data which is used to fuse the tactile data with the stereo images. Various shaped contact tools were employed in the experiments to get the force and contact location. The dataset used in the network training is divided into training, validation, and testing which is shown in Table 4. Data01 and Data02 are two splits of the data which are separated as per the sensor size (thin, thick). Each split of the data is internally divided into training, validation, and testing. In Table 4, the training, validation, and testing are depicted (per point) because this data is acquired by applying diverse force levels starting from 0.1 N to 1 N with an interval of 0.1 N. Therefore, for each force applied point, the acquired image stereo pair count is given Data01 containing (2 * 3380) training samples, (2 * 1680) validation samples, and (2 * 1690) testing samples. Similarly, Data02 containing (2 * 2730) training samples, (2 * 910) validation samples, and (2 * 910) testing samples. On a whole, the total images used in the training are 122,200, validation are 51,800, and testing are 52,000 samples.

Training Details
The training is carried out with several aspects inculcated into the data, such as considering different data splits with various modes under diverse sensor sizes, such as thin and thick.
The training sessions were carried out on Data01 and Data02 and evaluated using the validation data for each iteration. The approach of validation is carried out to prevent the network from overfitting and the best model is then saved as a final trained network. The models were also trained under various sensor sizes, such as thin sensor and thick sensor, with induced forces of 1 N and 10 N, respectively. The training scenario and trained model on Data01 with mode1 considering stereo pair (both right and left images) for training acquired with a thin sensor exhibited better accuracy. The graphs in Figure 13 represent various training aspects, such as validations over Force (F), Displacement (D), Position (X, Y, Z), and Rotations (R x andR y ). The seven charts in the figure above are the results of experiments on validation data for 7 attributes [F, D, X, Y, Z, R x , R y ]. Avg err is the average error of all 7 attributes, which should be as low as possible for a better-trained method. The three graphs in the bottom row of Figure 13 represent data loss, regularization term, and total loss in the learning process. Similarly, the analysis of the training process using Data02 split with Mode1 samples is shown in Figure 14.

Testing Evaluations
The performance evaluations were carried out for all the testing scenarios (contact force, contact position displacement, contact position rotation, contact area estimation) using several metrics, such as error rates, full scale output, average error, etc. The testings were carried out exhaustively using various shaped tools, force levels, sensor sizes, and displacements, as shown in Figure 15.

Testing Scenario-1: Force Distribution Estimation
The testing scenario of the force distribution estimation is carried out 10 times with each time 10 steps ranging from 0.1 N to 1 N. The testing performance of the trained system in predicting the force (in N) correctly is evaluated by error calculation between the applied force and estimated force value. The evaluation metric named Full Scale Output (FSO) in % is calculated to quantify the performance of the predicted force, as shown in: where F in is the applied input force in Newtons (N), F pred is the predicted force by the trained neural network in terms of Newtons (N), |(F in − F pred )| max is the maximum value of the difference between actual and predicted force, and |(F in )| max is the maximum value of the applied force.

Testing Scenario-2: Contact Point (Displacement) Estimation along Linear X-axis, Y-axis, an Z-axis
The contact point position (displacement) along the X-axis, Y-axis, and Z-axis is estimated by the trained neural network, and the testing accuracy is calculated by the error between the original displacement along X, Y, Z and estimated displacement along X, Y, Z. The testing evaluations were carried out as follows: • Along Z-axis: The force is applied in Z-direction from 0.1 N to 1 N with 0.1 N interval such that total 10 tests were conducted. The difference between the original position along Z-axis and the estimated one is recorded as the error and an average error over 10 tests is calculated to evaluate the performance of the prediction. • Along X-axis: For evaluating the displacement along X-axis, the force is applied in intervals of 0.1 N from 0.1 N to 1 N with 1-mm displacement step along the X-axis keeping the Y-axis displacement as 0. Therefore, the testing is done for (X =-6 mm ∼ +6 mm, with 1 mm step interval, total 13 points, constant Y = 0). The difference between the original position along X-axis and the estimated one is recorded as the error and an average error over 13 points is calculated to evaluate the performance of the prediction. • Along Y-axis: For evaluating the displacement along Y-axis, the force is applied in intervals of 0.1 N from 0.1 N to 1 N with 1-mm displacement step along the Y-axis keeping the X-axis displacement as 0. Therefore, the testing is done for (Y =-6 mm ∼ +6 mm, with 1 mm step interval, total 13 points, constant X = 0). The difference between the original position along Y-axis and the estimated one is recorded as the error and an average error over 13 points is calculated to evaluate the performance of the prediction.
The evaluation metric used to evaluate these displacements along XYZ axis is calculated to quantify the performance of the predicted force using mean absolute error (MAE), as shown in: where N is the number of tests/points performed, d orig is the original displacement values, and d est is the estimated displacement values by the neural network.

Testing Scenario-3: Contact Angle Estimation along Rotational R xy axis
The contact angle estimation along the rotational axis R x R y is evaluated using the 10 tests when force is applied from 0.1 N to 1 N with 0.1 N interval. The tests were performed such that the original angle along the rotational axis R x R y is set to 45 • . The mean absolute error (MAE) is calculated between the estimated and original angle, as shown in: where N is the number of tests/points performed, R a orig is the original angle 45 • , and R a est is the estimated angle values by the neural network. The sensor is rotated along the × axis and Y-axis to a calibrated ground-truth of 45 • , which is considered to be the ground-truth for the rotational test scenarios. The system installation heavily influences the performance due to the rotational motions. Therefore, a constructive ground-truth of 45 • is calibrated so as to prevent the system installation issues.

Testing Scenario-4: 2D Contact Area Estimation
The testing for the 2D contact area estimation was carried out using various shaped contact tools that are used to contact the elastic tactile tip. The contact area estimates are derived from the Gaussian regression process described earlier. The ground truth (GT) of the contact area is fixed when the tool is used to make contact, and it is used to calculate the error between the estimated and GT. The evaluation of the performance is calculated by error rates in (%), as shown below: where CA GT is the ground truth contact area, and CA est is the estimated contact area from the Gaussian regression.

Force Distribution Estimation
The force estimation carried out using the trained network was validated by using 10 different tests among which each test was recorded within a force range of 0.1 N ∼ 1 N. The estimation errors were recorded in N and were used to calculate the FSO (%) scores. The force estimation errors which were recorded on all 10 tests are depicted in Table 5. The overall average on a whole (10 tests) is around 0.022 N which is accurate for the system to rely on the estimations for future predictions. The FSO (%) scores of all the 10 tests are plotted in Figure 16, and the average FSO (%) score seemed promising, within the force range of (0.1 N∼1 N). The entire data samples, and their corresponding estimation errors for each iteration (test) and their averages, FSO (%) scores, etc., are presented in Table A1 in Appendix A.

Contact Position Estimation w.r.t X,Y,Z Axes
The contact position displacement errors were calculated and are estimated for each force measure ranging from (0.1 N∼1 N) with respect to each ground truth value in × and also in Y spanning from (−6 mm∼+6 mm). The displacement errors covering all the possible test ranges are clearly depicted in Tables A2 and A3 in Appendix A. Table A2 in Appendix A represents the test results in terms of displacement error in the contact position along the X-axis. Similarly, Table A3 in Appendix A represents the test results in terms of displacement error in the contact position along the Y-axis. The average error displacement readings in correspondence to the 13 point ground truth values (−6 mm∼+6 mm) over the force (0.1 N∼1 N) is shown in Figure 17. The contact position displacement error along the Z-axis was calculated by evaluating the estimation values of Z-axis displacement for a given force value ranging from 0.1 N∼1 N. The overall estimation error w.r.t force values are depicted in Figure 18. The results of the contact position displacement estimation in the X, Y, Z axes revealed the performance of the network in predicting the position estimates. The estimation error along the × and Y axes is greater than that in the Z-axis. The reason for that the motion in × and Y indeed requires Z, as well. Therefore, even while acquiring × and Y data, the underlying Z data keeps on feeding into the system.

Contact Angle Estimation w.r.t Rotational R xy Axis
The contact position estimation in terms of angular displacement was calculated through a series of tests within the force range of 0.1 N∼1 N. The sensor is rotated with a fixed angle of 45 • , and the tests were performed through contact tool in touch with the sensor which is inclined. The reason for the calibrated ground truth fixed angle of 45 • is discussed in Appendix A and is depicted clearly in Appendix A, Figure A1a.The trained neural network was able to predict/estimate the angular displacement in the contact position. The results of the estimated displacement w.r.t each force value is depicted in Figure 19.

Contact Area Estimation
The contact area estimation is carried out using the image processing algorithms and Gaussian regression. The estimated contact area is cross checked with the ground truth in correspondence with various contact shaped tools. The corresponding results are reported in Table 6. Figure 20 illustrates the estimation errors w.r.t circular tool with ground truth (GT = 78.54 mm 2 ), square tool with (GT = 100.00 mm 2 ) and (GT = 64.95 mm 2 ).
There were different samples considered for each tool shape for the testing, such as circular (n = 20), square (n = 18), and hexagonal (n = 18). The results suggest that the estimation of the contact area in the case of hexagonal tool seemed more prone to errors. However, on a whole, the total average error is 1.429% on all the contact shaped tools.

Conclusions
This work reports the usage of deep learning-based visual-tactile sensor technology for the estimation of force distribution, contact position displacement along X, Y, Z directions, angular displacement along R x R y direction and contact area. The current study also reports the design aspects, such as choice of the thickness and materials used for the tactile fingertips, encountered during the development of the tactile sensor. The image acquisition was carried out using a compact stereo camera setup mounted inside the elastic body to observe and measure the amount of deformation by the motion and input force. The transfer learning has been employed using the VGG16 model as a backbone network. Several tests were conducted to validate the performance of the network in estimating the force, contact position, angle, area using calibrated ground-truth values of force range 0.1 N∼10 N, position range −6 mm∼+6 mm, fixed angular value of 45 • . The tests were also carried out using thick, thin tactile sensors with various shapes, such as circle, square, and hexagon, along with their ground truth areas. The results determine the average estimation errors for force, contact position in X, Y, Z, contact angle and contact area are 0.022 N, 1.396 mm, 0.973 mm, 0.109 mm, 2.235 • , and 1.429%, respectively. However, the future work should include improvements handling system stability in terms of tactile sensor sensitivity w.r.t reference axes and movements in the vicinity. Nevertheless, the results reported in the study corresponds to the significance of the visual-based tactile sensor using deep learning as an inference tool. Acknowledgments: We thank Nam-Kyu Cho and Kwang-Beom Park from Smart Sensor Research Center at Korea Electronics Technology Institute, Seongnam, Korea for their resources and technical support in performing Image-based tactile sensor repetitive reliability testing.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Supplementary Test Results
The trained neural network was employed on the series of ten tests, and the force distribution estimation errors were recorded, along with the FSO (%) score. The test results are recorded and presented in Table A1. The contact position displacement errors were estimated using simple absolute mean error calculation. However, extensive tests were carried out to retrieve such results. The tests involve the specific ground truth (GT) values for × and Y in a range of −6 mm∼+6 mm w.r.t force range within 0.1 N∼1 N. In X-axis, the displacement along the X-axis is incremented by step 1 mm while keeping Y value 0 mm, and vice versa in the context of Y displacement estimation. The neural network's performance in terms of estimation of contact position displacement along linear × and Y axes is recorded and reported in Tables A2 and A3. The contact position estimation errors mentioned in the above two tables are nearly 4 mm∼6 mm in few cases. The main reason behind this is the sensitivity of the tactile sensor setup w.r.t the workbench. It is sightly dependent on the vibrations in the vicinity of the sensor. For instance, when there is a certain activity, such as walking or jumping, happening around the sensor setup, and if the applied force is slightly in the lower magnitudes of 0.1 N, there seems to be vibrations induced into the system. This eventually causes the errors around 4 mm∼6 mm in terms of contact position estimates. However, the overall average error of the contact position still appears to be less than 1.4 mm in × and less than 1 mm in Y, as discussed earlier in Table 5. In addition, the stability of the installation setup heavily influences the performance due to the rotational motions. Therefore, a constructive ground-truth of 45 • is calibrated so as to prevent the system installation issues, as shown in Figure A1a. Figure A1 gives a glimpse of various aspects, such as dimensions, camera, and use-cases, that might interest few developers.