Comparison of Graph Fitting and Sparse Deep Learning Model for Robot Pose Estimation

The paper presents a simple, yet robust computer vision system for robot arm tracking with the use of RGB-D cameras. Tracking means to measure in real time the robot state given by three angles and with known restrictions about the robot geometry. The tracking system consists of two parts: image preprocessing and machine learning. In the machine learning part, we compare two approaches: fitting the robot pose to the point cloud and fitting the convolutional neural network model to the sparse 3D depth images. The advantage of the presented approach is direct use of the point cloud transformed to the sparse image in the network input and use of sparse convolutional and pooling layers (sparse CNN). The experiments confirm that the robot tracking is performed in real time and with an accuracy comparable to the accuracy of the depth sensor.


Introduction
The main object under study is uArm Swift Pro [1], the robotic arm with three degrees of freedom (3DOF). The three angles describe the arm state: α 0 , α 1 and α 2 , as shown in Figure 1. We assume the known restrictions about the robot geometry. The robotic arm consists of two links with knowledge about the dimensions of arms. The main part of the observer is Intel RealSense depth camera D-435 [2]. It consists of two imagers: a depth sensor (enabling depth vision with a range of up to 10 m) and an RGB camera. The research aims to obtain the robot state based on RGB-D images as fast and accurately as possible. The main application of this research may be an additional tool for tracking the robot behaviour in the workspace, e.g., for safety purposes [3]. The second application is to provide a general-purpose tool for tracking skeleton-type objects. In the second approach, we obtain a convenient tool for testing the accuracy of visual tracking because the robot API provides feedback information about the robot arm state. The tracking system consists of image preprocessing and machine learning parts. In the machine learning part, we compare two approaches: fitting the robot pose and fitting a convolutional neural network (CNN). Input images for machine learning algorithms are transformed into the 3D point cloud and into sparse 3D depth images in the next step. In such a case, the use of sparse convolutional and pooling layers is natural and convenient, and it also can help significantly reduce the time of computations.

Related Research
The task of tracking the robot arm is far from new. Many tracking systems use depth sensors and RGB-D cameras [3][4][5][6]. The main challenges in pose estimation concern the time and accuracy of estimation. Similar techniques are used for human body tracking [7,8]. For example, in [9], a body tracking system with accuracy at level 4-10 cm is reported, but the reported computation time is about 60 s per frame. In [10], human body tracking The presented approach uses RGB-D cameras with an additional depth channel. Another possible approach to tracking objects is applying an event camera (or neuromorphic camera). Compared to RGB-D a, such cameras are not so popular, and they are also difficult to obtain and more expensive. With additional preprocessing steps, such cameras can also provide depth estimation [17].
A neuromorphic camera characterises high dynamic range and low power and bandwidth and power as the signal encoding is intrinsically compressed at the acquisition level. The sensor also achieves a very high frame rate. The output of the event camera only registers independent pixels that respond asynchronously to relative contrast changes, not the full array of pixels. Event cameras are also effectively used for robot tracking purposes, enabling tracking with a theoretically higher speed [18,19].
In the case of the used D-435 sensor, the advantages are a low cost, HD resolution and a direct depth channel. The disadvantage is a low frame rate, which is a limitation in the presented approach.
Another restriction of the presented approach is using an exterior camera that sees the entire work area. There are also approaches using an end-effector mounted camera. Such cameras can also be used to recover the robot position [20]. However, generally, such end-effector mounted cameras are used for other tasks such as object recognition or 3D scanning [21].
Sparse signals are a natural representation of images obtained from depth sensors. Using sparse layers and inputs in CNNs is not as popular as dense ones, but it has also been considered (e.g., [22,23]). Some kinds of sparsity are available in commonly used deep learning libraries. For example, PyTorch [24] provides some functionalities for sparse matrices computations. Tensorflow [25] also provides simple operations for sparse signals, but without the implementation of sparse convolutional layers. A functionality of PyTorch for sparse layers may be extended by using external libraries, e.g., spconv [26] or SparseC-onvNet [27]. Recent approaches using convolutional neural networks take a point cloud directly on the input [28][29][30][31][32][33]. Such structures may be used effectively for both segmentation and recognition purposes. Contrary to point-based methods such as [31], we use a rather volumetric-based method as proposed in [32]. The main weakness of volumetric-based methods is their computational complexity when using a dense representation.
In our approach, we also put the point clouds into the 3D sparse matrix and use it as sparse input to a sparse layer of a CNN. A CNN-based regression model predicts the state of the robotic arm. The presented solution shows that input images in the form of a point cloud can be handled effectively using existing solutions and tools with sparse layers.
In the following section, we present the experiment's theoretical background and settings.

Materials and Methods
The following section describes the preprocessing details and two considered approaches: fitting the robot pose to the point cloud and fitting the sparse CNN. The whole sequence of the image preprocessing and tracking steps is presented in Figure 2.

Image Preprocessing
Image preprocessing is a common procedure for either graph fitting or deep learning. Preprocessing an RGB image consists of the following steps: (1) removing the background from an image, (2) transforming to a point cloud and (3) unifying coordinates for independence from the camera perspective.
An initial part of preprocessing is removing the background. We use our own clustering-based approach (e.g., [34][35][36]). In our experiments, it worked better than the online learned background subtractor from OpenCV [37]. Using the k-means algorithm is simple and efficient algorithm for clustering [38]. The algorithm minimises the inertia criterion: where C is a set of clusters (each cluster is represented by mean µ). The minimisation is performed in two repeating steps: (1) Assign each sample x to the closest cluster centre; (2) update the cluster parameters by averaging the coordinates of all of the samples assigned to the cluster.
In the beginning, we learn the subtractor uses k-means with two clusters (k = 2) and we identify clusters connected with the background (negative centres) and clusters associated with motion (positive centres). Background and motion masks are fitted on grey-scaled images, and the result is applied to the depth image.
In the next step, we remove the background using a static mask for positive and negative clusters. We treat such pixels as the background when a distance between negative and positive centres is less than the threshold. The selected threshold depends on a withincluster variance. When the pixel distance from the background mask is greater than the threshold, such pixel is treated as in motion. The whole procedure is presented in Table 1. The procedure's illustration and the exemplary result are shown in Figure 3. For each pixel: do k-means with k = 2 store and identify the k-means result as negative pixels (background) and positive pixels (motion).

3.
Create a background mask by subtracting the positive and negative images and binarise the result to the threshold.
Resize down the input image to the size of the positive and negative images.

2.
Select a threshold; points below the threshold are thrown away.

3.
Mark as static background pixels those whose distance from the background mask is greater than the motion mask.

4.
Resize the image to native resolution.

5.
Apply the background mask to the depth image and remove background pixels. 6.
Perform transformation depth image to the 3D point cloud.

Negative Centers
Positive Centers After removing a static background, the final step of the image preprocessing is a transformation of the depth image to the points cloud. The 3D point cloud is the input for graph fitting and sparse CNN. We perform a simple calibration and the projection to a unified coordinate system independent of the camera to unify views from two cameras.
We perform the calibration process in the following way. The robot arm stays in a given position (e.g., α 0 = 0 • , α 1 = 90 • , α 2 = 0 • ), then we fit the graph to the point cloud of the arm. Next, based on points graph coordinates, we produce an orthogonal transformation matrix to the coordinate system independent of the view from a camera. The last step of the preprocessing procedure is projecting the 3D point cloud to the unified coordinate system. We present in Figure 4 images obtained on selected preprocessing steps. An exemplary view after projection to the camera independent coordinate system is shown in Figure 5. After removing the static background and transforming a depth image to the point cloud, we obtain about 3000-5000 points per frame.

Graph Fitting
The graph fitting approach, presented here, consists of two steps. Unsupervised, the first one fits the graph pose to the point cloud. The second one is the supervised correction of estimated angles using feedback information about robot states.
Let G = (E, C) be a graph with vertices (nodes) C and edges E. The graph fitting procedure aims to fit graph nodes C to the point cloud X and with a given set of edges E. The graph fitting task does not require such a large number of input points. The number of points in the point cloud can be reduced in different ways. The simplest methods are resampling or using the k-means algorithm. In the presented approach, we use the k-means algorithm-this is a time-acceptable approach, giving more accurate results. The influence of the number of points used in the k-means algorithm on the accuracy and the time of computations are presented in Table 2.  The points remaining after the k-means procedure can be used to find a graph representation of the robot pose. The aforementioned graph G = (E, C) spans the set of nodes C = c 1 , . . . , c k with edges E. The edges define robot arm structure. We assume that set of edges E-i.e., the structure of the robot skeleton-is known. For a given point cloud X = x 1 , x 2 , . . . , x n by graph fitting, we mean a geometrical fitting of the graph structure (spanned on nodes C) to given points X.
Many different measures for the closeness of the graph skeleton to the point cloud during the optimisation procedure were considered. In this paper, we do it as follows: in the first step, we extend the set of graph nodes C by adding linearly spaced points on the graph's edges. This operation gives an extended set of points C ext . Next, we use quantity as the closeness measure: This quantity is a bit similar to the Hausdorff metric with ∑ instead of max. A computation illustration of dist(C, X) is shown in Figure 6. Using such a quality function, we may use here any optimisation procedure. We can use output points from the graph fitting procedure (graph nodes C) directly to determine the robot arm state (given by angles α 0 , α 1 , α 2 ) via geometric constraints. However, it is a bit inaccurate, mainly since the depth sensor provides only the view from one side of the arm. We fit the nonlinear regressor based on points obtained from the graph fitting procedure and take the given robot states as output to correct the robot state estimations.
We considered the usage of two variants of regressors included in the scikit-learn package: two-layer perceptron (learned with MSE criterion) and Huber regressor (learned with approximated MAE criterion) with a nonlinear part. Our regressor experiments show that the Huber regressor is more robust to outliers. As a nonlinear part, we used RBF kernels with the Nyström method [39]. A detailed comparison of accuracy for different kernel regressors is presented in Figure 7. As one can see, the Huber regression with nonlinear kernels attains the best results.  Figure 7. A comparison of accuracies for considered regressors with a different kernel transformation (thanks to [40]): MLP (tanh kernel with linear output layer, using MSE criterion), SGD (RBF kernels with the Nyström method with SGDRegressor using MSE criterion), Huber (RBF kernels with the Nyström method kernel with robust Huber linear regressor).

Sparse Convolutional Neural Network
The second model is a CNN-based regressor that predicts the robot state using the point cloud obtained from the preprocessing part. The point cloud is transformed into a 3D sparse array using voxel-based representation. We divide the point cloud space into 4 × 4 × 4 mm cubes and set the nonzero value of the matrix element if the responding cube contains a point from the cloud. As a result, we obtain a sparse array with the dimensions 750 × 750 × 1375, which is the input for the network. Figure 8 presents the complete structure of the sparse convolutional neural network. The output is predictions of angles that determine the robot's state. Knowing the robot geometry and using the forward kinematics, we can quickly obtain the position of interesting nodes. The last tracking step is smoothing measurements using the Kalman filter [41], a standard procedure for smoothing signals produced by a linear dynamical system disturbed by the normal noise. The procedure assumes the system in the form: where A is the state transition matrix, C is the observation matrix and w t ∼ Normal(0, Q t ), v t ∼ Normal(0, R t ) are random state and observation noises. The filter parameters are fitted using the EM algorithm [42,43] which optimises the log-likelihood criterion with unknown model parameters (A, C, Q, R). Filter predictions are hidden (of filtered) state signals for a given list of observations. In our experiments, we use the pykalman library [44]. Filtering is the final common step for graph fitting and sparse CNN output.
The following section presents the description and results of the experiments.

Preparation of Experiment
In the experimental part, we prepare the learning and testing sequence, and we learn two models based on the learning sequence: sparse CNN and graph fitting. Next, we compare models on the testing data set. During the experiments, we collected in parallel two kinds of data: (1) data from cameras (depth and RGB), and (2) the feedback information about the robot state (angles and positions) taken directly via robot API. These two kinds of signals have different time sequences of measures. The information about a robot's state was taken every 200 ms (every data point was marked with a timestamp). Then, the state information was interpolated to the timestamps of the frames received from the cameras (also marked with the timestamps).
The training sequence included 4240 images, and it was recorded while the robot performed motion along a grid of angles. The trajectory was chosen in such a way to cover most of the working area. For learning purposes and to avoid over-fitting, we divided the training data set into two parts for learning and validation in proportion; 80%:20% . In the learning part, we added a simple data augmentation by adding a certain number of images with the salt-and-pepper noise to the data set. The test data set is an independent data set (contains 980 frames), and it was recorded after training. During the testing trajectory, the robot arm drew a sequence of squares. We used an observation sequence with 6 FPS to reduce the number of frames in all cases. The time of observations was about 700 s for training and 160 s for testing.
In experiments, we recorded two video sequences from two independent devices placed at around 1 metre from the robot arm and a distance of about 60 degrees between them, as shown in Figure 1 (marked as Left and Right camera). The test trajectory was prepared in such a way that, on a particular part of the trajectory, the robot arm was directly in front of the left camera ( Figure 1). It was a difficult case to measure, but we would like to check our solution's robustness.
We performed experiments with an i7-8700 CPU machine (3.20 GHz, 16 GB RAM) with a one-thread model and without the help of a GPU. We used the Python framework, with a machine learning part in the scikit-learn package [40] and some image processing using the OpenCV library [37]. Convolutional neural networks were developed in PyTorch [24], and using the spconv library [26], we added sparse 3D convolutional and pooling layers to the CNN PyTorch model.

Results
The following charts present the results and illustrate the solution's quality for predicting the robot state (given by three angles) and the robot position. Figures 9 and 10 present a detailed comparison of estimation of the robot state on learning data for the graph fitting model and the sparse CNN model. Similar comparisons of the test data are presented in Figures 11 and 12. Shaded areas indicate that the robotic arm is in front of the camera; one may expect reduced accuracy in such a region. Additionally, in Figures 13 and 14, we present the accuracy of the prediction of robot position. A qualitative comparison of the accuracy for the whole training trajectory is presented in Figures 15 and 16. A quantitative comparison of the accuracy of the two models for robot state and positions is presented in Table 3.
As expected, the accuracy of the test set for the left camera is a bit lower than that from the right camera. The camera is in front of the robot arm in the first case. In the second case, the camera looks from the side. The mean accuracy for angle estimation is less than 3 degrees, and the accuracy of estimation of the robot position is less than 20 mm. Tables 2 and 4 provide detailed information about the computation time divided on subsequent operations. As we can see, preprocessing time with getting a frame from the camera takes about 12 msec. For the sparse CNN model, we obtain the speed of observation of about 50 FPS, and for the graph fitting model the speed is about 30 FPS.  Figure 9. Comparison of the angle prediction accuracy for graph fitting and sparse CNN on the training data set, learned on the image from the right camera. The shaded area indicates that the robotic arm is in front of the camera, and it may cause a loss of accuracy. Table 3. The accuracy for two models is compared: Graph fitting and sparse CNN on the training and test data sets. (α 0 , α 1 , α 2 , x, y, z are mean absolute errors for angles and positions in comparison to a real robot position; err α is averaged value; err pos = x 2 + y 2 + z 2 is error of position).    Figure 11. Comparison of the angle prediction accuracy for graph fitting and sparse CNN on the testing data set, learned on the image from the left camera. The shaded area indicates that the robotic arm is in front of the camera.

Discussion
In the paper, we presented two approaches to tracking a robot pose. The first one used graph fitting, and it is an unsupervised approach. We can use it even when we do not know the exact angles. Experiments show that this approach has somewhat better accuracy than the accuracy of the sparse CNN model. The presented accuracy is about 1.8 • -3.3 • degrees for detecting angles using the graph fitting approach and 2.6 • -3.3 • (similar or slightly worse) for the convolutional network (sparse CNN).
A distance measurement accuracy for the depth sensor D435 is about 2% of the measured distance (as reported in [2]). In the scale of our experiment, this accuracy is about 1-2 cm. As declared by the producer of uArm Swift Pro, the positioning accuracy is about 0.2 mm. For both models, the accuracy of the measurement of the robot's position is in the range of 10-20 mm. This accuracy is at the level of the accuracy of the D435 sensor (or even a bit below). The experiments show that the presented solution is a real-time solution with a speed of tracking from about 30 FPS (for a graph fitting approach) to 50 FPS (for a sparse CNN model). We should mention that accuracy and speed of computations may be too low for the feedback control purposes. In this case, the robot's built-in angle sensors are much more accurate and faster. However, the presented approach may be acceptable for some tasks, e.g., trajectory planning or as an additional loop for safety purposes. In this paper, we show that we can observe the robot state with speed and accuracy comparable to RGB-D camera parameters. The presented approach may also be attractive for higher DOF cases when we do not have complete or accurate feedback about the robot's state, e.g., a drone in closed areas or a robotic arm on the mobile platform. However, in such cases, direct application of sparse CNNs may be inaccurate and may need an additional object detection step.
There is some potential possibility to speed up the solution, especially for the k-means and the graph fitting model. There is also a clear possibility to speed up the prediction of sparse CNN models. In experiments, we used only a one-thread CPU model; using more threads or GPU can easily speed up computations. Another possibility to speed up measurements is using the event camera. The output from event cameras gives a similar representation to the sparse representation used in our approach, thus it is one possible extension of the presented research.
The graph fitting approach is unsupervised, and we can use it even when we do have not an exact measurement of angles. The restriction of this method is the computation time for more complicated (higher DOF) graph structures. The sparse CNN is a clear supervised approach, and it also has some restrictions, for example, in such tasks as human hand or body tracking or tracking drones in closed areas, while labelling data is an additional challenge. We can combine both methods for learning the targets for CNNs using graph fitting in an offline way. We also presented that the data from depth sensors in the form of the 3D point cloud can be quickly processed in a convolutional network using existing libraries. The only steps needed are transforming a point cloud to a sparse array and direct use of sparse layers. Data Availability Statement: The data set was recorded during an experiment in our laboratory. Video sequences and other data used in experiments are available at https://edysk.zut.edu.pl/index. php/s/yLBiJPH6WGAG2t6 (accessed on 24 August 2022). For results replication, please follow https://github.com/jrod12/py_realsense_sensors (accessed on 24 August 2022) and readme.md file.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: