Depth Estimation of a Deformable Object via a Monocular Camera

: The depth estimation of the 3D deformable object has become increasingly crucial to various intelligent applications. In this paper, we propose a feature-based approach for accurate depth estimation of a deformable 3D object with a single camera, which reduces the problem of depth estimation to a pose estimation problem. The proposed method needs to reconstruct the target object at the very beginning. With the 3D reconstruction as an a priori model, only one monocular image is required afterwards to estimate the target object’s depth accurately, regardless of pose changes or deformability of the object. Experiments are taken on an NAO robot and a human to evaluate the depth estimation accuracy by the proposed method.


Introduction
In the case of human-robot cooperation, deep reinforcement learning (DRL) is used to train the robot to undertake the task.For an instance of the bolt screwing task, the human partner's arms might be obstacles for the robot during the working process (Figure 1).In order to train the capability of avoiding obstacles for the robot through DRL method, one has to prepare a huge number of samples, which are usually hard to generate.In this case, it can be done by reconstructing the 3D imaging [1][2][3] of a human worker executing the task.The reconstruction sequence of the human's arms can be used as moving obstacles to train the obstacle avoidance capability of the robot in a virtual environment (Figure 2).
In the scenarios above, a common prerequisite is the accurate pose information of a human or a robot.However, when an object is projected onto the camera plane, its depth information along the optical axis is lost, which possibly makes two actually far separated objects look close to each other [4].This results in incorrect estimation of the pose without correct depth information.Although lasers are used to provide depth information, they are usually expensive.Low-cost equipment such as a Kinect are also prevailing alternatives, but their accuracy is not high enough (Kinect has an error of at least 4 mm and has a dead zone within 0.5 m.Additionally, the farther the distance is, the more inaccurate the measuring result is).
Many previous studies propose different approaches to produce depth estimation from images.Among them, a few researchers used optimization methods to handle the problem.Ranftl et al. propose a novel motion segmentation method to produce dense depth map from two consecutive frames through a single monocular camera.They segment the optical flow field into a set of motion models, with which the scene is reconstructed by minimizing a convex program [5].Smith et al. present a method which estimates depth from a single polarisation image by solving a large, sparse system of linear equations [6].Karsch proposes a technique that automatically generates plausible depth maps from videos using non-parametric depth sampling [7].Optimization methods need the manually defined constraints to guarantee the accuracy of the resulted depth estimation.Thus, these methods require human experiences to give effective constraints.Additionally, approaches using learning-based methods to model manually extracted features are other promising alternatives as well.Saxena et al. propose a Markov Random Field (MRF) learning algorithm to handle monocular cues like texture gradients and variations, defocus, etc.They incorporate these cues into a stereo system to obtain depth estimation results [8].Ma et al. improve the ResNet-50 network by transfer learning to tackle the depth estimation problem through a single image [9].Haim et al. propose a phase-coded aperture camera for depth estimation.They equip the camera with an optical phase mask to produce unambiguous depth-related color characteristics for the captured image [10].Gan et al. present a convolutional neural network architecture that pays more attention to the relationships of different image locations and incorporates the absolute and relative features [11].These methods depend on the extracted features.In some scenarios, these features might not incorporate sufficient cues of the depth information.
Recently, approaches using deep learning to generate depth maps from images become prevailing.Fu et al. propose a space-increasing discretization (SID) strategy to discretize depth and recast depth network learning as an ordinal regression problem [12].Jiao et al. propose an approach to handle the depth estimation and semantic labeling task simultaneously.They present a concept called the attention-driven loss for the network supervision, and a synergy network to learn the relevance between the two tasks [13].Godard et al. trained an unsupervised deep neural network with binocular stereo data to address the ground truth depth data deficient problem in traditional depth estimation methods.A novel training loss was proposed for the deep convolutional neural network to perform single image depth estimation with a high quality [14].Inspired from the concept of autoencoders, Garg et al. trained the first convolutional neural network end-to-end from scratch for single view depth estimation in an unsupervised manner [15].Deep learning methods are powerful tools for regressions.However, they usually need expensive platforms to run the algorithm and they usually need many training samples related to the target object for accurate depth estimation.
Different from the aforementioned previous approaches, the proposed method in this paper estimates depth information by means of 3D reconstruction.The first thing we need to do is to circle a camera around the target object and get its reconstruction at the very beginning.Afterwards, only a single monocular image is required to accurately estimate the depth information of the object no matter how it moves or becomes deformable in front of the camera.The proposed method is therefore more applicable in the scenarios where the target object is not rigid and where accurate depth information is necessary.With the point cloud of a target object (in a certain static pose) reconstructed beforehand, the proposed method can estimate the pose and reconstruct the point cloud of a target object (in its other poses) by a single input RGB image.This work can be used on not only humans and humanoid objects, but on other deformable objects as well.
The remainder of this paper is structured as follows.Sections 2-4 individually introduce the three modules of the proposed approach specifically.Section 5 introduces the whole flow chart of the proposed method.Experimental evaluations are provided on a NAO robot and a human in Section 6. Section 7 concludes the contributions of this paper.

3D Labeled Reconstruction
This section introduces the priori model for the target object.The priori model is the 3D reconstruction (stored in the form of the point cloud [16][17][18]) of the target object in its stationary status, with a SIFT feature vector attached to each cloud point.Therefore, the priori model is built through two steps.First, we use a traditional 3D reconstruction approach (for static object) to reconstruct the target object by multiple images.Second, we use the SIFT algorithm [19] to extract feature vectors from the collected images, attaching them to the corresponding 3D points on the reconstructed point cloud.

3D Reconstruction with Multiple Images
Given a static object, we can use a single camera (whose intrinsic parameter is f ) to circle around the object, reconstructing a point cloud.In order to reconstruct the object, the total number of the images captured by the camera is N. Additionally, the total number of points on the object surface is M. The orientation and position of the camera with regard to the world frame at the ith instant can be respectively represented by a matrix R i and a vector t i .Denote P j = X j Y j Z j T as the jth point on the object surface with regard to the world frame, as the same jth point with regard to the camera frame at the ith instant, and as the image coordinate of jth point at the ith instant (for denotation simplification, we define m i j ∈ ∅ if P j is occluded under the observation by the camera at ith instant).The following can be given [20]: Let Subsequently, the desired result in this step is the 3D reconstruction P * j j=1...M in the following form:

SIFT Features to Label the 3D Reconstruction
Given a two-dimensional image I (x, y), the SIFT algorithm [19] is able to extract effective key points through the LoG operator.By computing the gradients in the neighborhood of each key point, a corresponded descriptor vector can be obtained to distinguish the key point.Then, we can use the SIFT algorithm to find a set of feature points (denoted as m i s s=1...S ) and their corresponding descriptor vectors (denoted as l i s s=1...S ) for the image captured at the i th instant, which jointly yield a two-tuple set denoted as {(m s , l s )} s=1...S .Executing the same operation to all the images, we can finally get m i s , l i s i=1...N s=1...S i (where S i indicates the total number of feature points derived from the SIFT algorithm for the image captured at the ith instant).Subsequently, we need to attach the descriptor vectors to the corresponding 3D points on the surface of the reconstructed point cloud.It can be deduced from Equations ( 1) and ( 2) that By Equation ( 5), we can determine the 3D point P i s on the reconstructed point cloud corresponding to the feature point m i s .Thus, we can acquire a two-tuple set . For denotation simplification, we define that any P j that is occluded from the camera view at the ith instant, or whose corresponding m i j is not a key point, still has a descriptor vector l i j = 0. Therefore, we can get P i j , l i j i=1...N j=1...M after the Nth instant.The required 3D labeled reconstruction is the following: where lj represents the average vector over all the non-zero descriptor vectors (i.e., l i j i ≤ N, l i j = 0 ) related to the 3D point P j within the Nth sampling instant.

Skeleton-Based Topological Segmentation
This section introduces how to provide the reconstructed point cloud with a robust topological segmentation, so as to deal with the case where the target object is not rigid.In this way, each sub-point-cloud derived from the topological segmentation is expected to be rigid.Topological segmentation based on the surface information of the target object is inclined to getting influenced by the surface noise, causing weak robustness.Therefore, the proposed topological segmentation is executed through two steps.First, we extract the skeleton of the reconstructed point cloud [21] and segment the skeleton based on its curvature.Second, we dilate the sub-skeletons [22] to yield the sub-point-clouds, which are the results of the topological segmentation.
Skeleton extraction and segmentation.Given an object denoted as Ω = P * j j=1...M , we denote CORE (Ω) as the set of all the maximally inscribed spheres in Ω, none of which has common points of tangency with noisy surface.Then, the skeleton of Ω is denoted as S (Ω).
After extracting the skeleton of the reconstructed point cloud, we segment the skeleton according to its curvature.Supposing C (Ω) ⊂ S (Ω) , an equivalence relation ∼ C implicated by C (Ω) is defined as ∀p 1 , p 2 ∈ S (Ω), p 1 ∼ C 1 p 2 if and only if p 1 , p 2 are on a curve segment whose ends are two points in C (Ω) and no other points in C (Ω) are on the same curve segment.
Thus, the curve segments determined by C (Ω) are the equivalence classes [23] S C (Ω) defined as It is clear that the curve segments in S C (Ω) are separated from each other.In this paper, we propose two categories of points, respectively, denoted as C 1 (Ω) and C 2 (Ω) in order that (7).

The First Category C 1 (Ω)
Supposing all the points on the skeleton S (Ω) constitute a set {p k } k=1...K (K is the total number of elements in S (Ω)).We use a set {e mn } 0<m≤K 0<n≤K to represent the connectivity of each two points from S (Ω).Specifically, e mn = 1 if p m and p n are adjacent to each other.Otherwise, e mn = 0. Subsequently, the first category of points C 1 (Ω) is defined as 3.2.The Second Category C 2 (Ω) We define the function of the cth curve segment (suppose C curve segments totally) in S C 1 (Ω) as r c (u c ), where u c is the arc length parameter of the function r c and u c ∈ (0, L c ) (L c is the length of the cth curve segment).Then, the Frenet formulas [24] of the cth curve is where α c , β c and γ c are, respectively, the unit vector tangent, normal unit vector and binormal unit vector of r c ; κ c and τ c are the curvature and torsion.We construct a quantity δ c satisfying that δ c ∝ κ c and δ c ∝ τ c as well as a threshold υ .Subsequently, the second category of points C 2 (Ω) is defined as

Skeleton Dilation with Constraints
Supposing B is a structuring element [20] in the form of a subset in R 3 , the dilation of S (Ω) by B after T iterations is defined as The dilation operation is executed to segment the reconstructed point cloud by the sub-skeletons.Therefore, sub-point-clouds should be separated from each other.Additionally, the dilation should be stopped when reaching the surface of the reconstructed point cloud.Thus, in each iteration t ∈ {1...T}, we remove the points that violate the following two constraints: where S c 1 (Ω) and S c 2 (Ω) represent two distinct sub-skeletons in S C (Ω), Ω is the external space of Ω.Moreover, for the dilation of each sub-skeleton S c 1 (Ω) under constraints Equations ( 13) and ( 14), the corresponding total iteration times T satisfy that Then, the dilation result of sub-skeletons in S C (Ω) under constraints in the form of Equations ( 13)-( 15) forms a equivalence relation for topological segmentations (Ω, T ).

3D Reconstruction at ith Time
This section introduces how to quickly reconstruct the dynamic object with a single RGB camera.For each frame by the camera, we first extract all the feature points.Through matching these feature points to those on the reconstructed point cloud, we are able to know to which sub-point-clouds these feature points individually correspond.Finally, the poses of the sub-point-clouds can be estimated by the correspondences between the feature points on the point cloud and the image feature points.This pose estimation problem can be easily handled by solving a nonlinear optimization.The reconstruction is therefore reduced to the reorganization of the sub-point-clouds with updated poses.
Denote the image captured at ith instant (i > N) as m j i j=1...M .Through the SIFT algorithm, we can extract the feature points and their corresponding descriptor vectors.Through Section 2.2, a labeled 3D point cloud is reconstructed, to which the descriptor vectors are attached.Thus, we can use the descriptors from the captured image and from the cloud to find their correspondence, which can be denoted as a two-tuple set m . In this set, m In addition, m Suppose T = {[p] : p ∈ Ω} = {{q ∈ Ω : q ∼ p} : p ∈ Ω} as the basis for the topology of space Ω = P * j j=1...M .We can get the basis for topology (denoted as T i ) of the subset P * We can further acquire the basis for the topology (denoted as T m i ) of the set m Based on our design, each element in T is a rigid component of Ω.Thus, when the object represented by Ω moves stochastically, the elements in T c ∈ T have the same rigid transformation, i.e., there exists a single pair R c i , t c i for T i c such that where R c i is a three-dimensional rotation matrix, t c i is a three-dimensional translation vector and R c i T i c + t i c is defined as where p is denoted in the form of a column vector. Defining , since T c i ⊂ T c , based on Equation ( 18), we can get the actual coordinates of T i c i at i th instant as Then, we can get another expression of ϕ −1 (T c i ) as where m i (•) (based on Equation ( 2)) is an operator to transform a three-dimensional point to a two-dimensional coordinate related to the camera at ith instant, which is defined as Thus, we can compute a R c i , t c i for T i c by solving the following optimization problem: Finally, the raw result of the 3d reconstruction Pj

Approach Overview
Specifically, we utilize a single camera to resolve the dynamic 3D object reconstruction.
The problem can be formulated as: given the images m j i i=1...N j=1...M captured within time (the pose and position alters at each time i ≤ N in order that static 3D reconstruction of the object can be satisfactorily achieved) and images captured at time i > N, the expected result is the dense reconstruction of the object P j i i>N j=1...M at the same time i > N .
Accordingly, we consider to fully utilize the 3D information acquired from the foregoing frames and then reduce the dynamic object reconstruction problem to a re-organizing problem.Thus, the proposed approach mainly incorporates three steps.
In the first step, we obtain the static 3D reconstruction of the target object in its stationary status.Specifically, the point cloud of the target object is acquired in the first phase through existing static 3D reconstruction methods.Meanwhile, we obtain the SIFT features on each image (among images utilized for 3D reconstruction) and attach the feature descriptors to the corresponding points on the reconstructed point cloud.
In the second step, we find an appropriate topological segmentation for the reconstructed point cloud such that each topological part moves rigidly during the object motion.Actually, the point cloud topological segmentation is also an open problem due to the fact that it is difficult to determine the standard for a satisfactory segmentation.In this paper, we transfer the point cloud segmentation problem into the skeleton segmentation problem, i.e., the segmentation of the point cloud results from the segmentation of its skeleton.This is based on the thought that the object skeleton is much more stable toward perturbation than the object surface.Thus, we can simply segment the skeleton into several sub-skeletons based on its curvature and torsion property before dilating each sub-skeleton to determine a corresponding topological part of the point cloud.
In the third step, when capturing a new image, we extract the SIFT features and match these features to those attached to the point cloud in the first phase.These matched features implicate the correspondence between the image and the point cloud.Based on the topological segmentation executed on the point cloud, the correspondence between the image and each topological part can also be computed.Then, the pose and position of each topological part can be deduced, and the reconstruction result can thus be obtained through a simple re-organization of these topological parts.The whole flow chart of the proposed approach is illustrated in Figure 3. Construction of rotation matrix.Since the analytic expression of rotation matrix R c i in Equations ( 18), ( 20), ( 23) and ( 24) is required for optimization program (following Equation ( 23)), we utilize the quaternion to construct the rotation matrix.Specifically, a quaternion is in the form of Then, the corresponding rotation R c i in the optimization problem of Equation ( 23) is with a constraint as which constitutes a simple constrained nonlinear optimization problem.

Experiments and Discussion
This section provides the experiment results from the proposed approach.We evaluate the proposed approach by a set of experiments on a NAO robot Figure 4 and a human being.

Experiments on a NAO Robot
NAO robot is an autonomous, programmable humanoid robot developed by Aldebaran Robotics (France), with a height of 58 centimetres and 25 degrees of freedom.We use a NAO robot as the target object for 3D reconstruction.During the experiment, the NAO robot continuously changes its poses such that it is deformable.We reconstruct the NAO robot to test the depth accuracy by the proposed algorithm.
This experiment is undertaken on a single monocular camera to reconstruct the NAO robot in its dynamic status.In order to guarantee the effectiveness of the proposed method, several specific processing procedures are listed in Appendices A-C.
Other procedures of the proposed approach in experiments follow the descriptions and formulas in Sections 2-4.
We compare the proposed method with approaches in [14,25].The depth estimation result by the proposed method is shown in Figure 5.Note that the proposed method only estimates the depth information of the target object while the previous approaches estimate the depth of the whole image.Therefore, we only compare the depth estimation accuracies corresponding to the image regions where the target object appears.Four accuracy metrics are used [26] as shown in Table 1.

Metrics Names Equations
Abs Relative Difference

Images
Proposed method Approach in [14] Approach in [25] Ground truth Figure 5.The depth estimation results for the NAO robot by the approaches in [14,25] and the proposed method.We use the ground truth to build a mask so as to only display the depth information corresponding to the image region of the NAO robot.
The depth accuracies of the proposed method and other three approaches are shown in Table 2. Experiments are also undertaken on a human to verify the proposed approach.We produce the 3D reconstruction of the human in his still pose at the very beginning.Subsequently, the human changes his pose and the camera captures the corresponding monocular image for each pose.We use these images to test the proposed approach and the previous algorithms [14,25].The program to implement this experiment is the same as that for the NAO robot experiment.
The depth estimation results by the proposed method and other approaches are shown in Figure 6.Additionally, the depth accuracies of the proposed method and other three approaches are shown in Table 3.

Images
Proposed method Approach in [14] Approach in [25] Ground truth Figure 6.The depth estimation results for the human volunteer by the approaches in [14,25] and the proposed method.
Similar to the experiments taken on NAO robot, the experiments results can validate the effectiveness of the proposed method.The proposed method uses the 3D point cloud reconstructed ahead of time as the priori model.Subsequently, it depends on the image features to estimate the pose changes of the priori model, therefore being capable of estimating the depth information accurately.

Conclusions
In this paper, we propose a feature-based approach to estimate the depth of a deformable object accurately via a monocular camera.The proposed approach needs to reconstruct the target object in its initial pose as a priori model.Afterwards, only one monocular image is required to to accurately estimate the depth of the target object no matter how it changes its pose.Experiments are undertaken on a NAO robot and a human to evaluate the accuracy of the proposed approach.In future work, we aim to accurately estimate depth of the same kind of deformable object, by only reconstructing a single instance of that kind as the priori model.

Figure 1 .
Figure 1.The tool used in a human-robot cooperation scenario for screwing a bolt.Since the human and the robot together screw the cross tool, the human arm might be obstacles to the robot during the bolt screwing process.

Figure 2 .
Figure 2. 3D reconstruction for training a cooperative robot, which provides training samples and allows the robot to learn to avoid obstacles in a virtual environment.
point from the image and the cloud point, whose descriptor vectors are similar.

Figure 3 .
Figure 3.The framework of the proposed method.

Figure 4 .
Figure 4. Experiments on a NAO robot.(a) shows the robot as the target object to evaluate the algorithms; (b) shows the priori model reconstructed at the beginning.

Table 1 .
Quantitative comparison on the aspects of accuracy and efficiency.

Table 2 .
Quantitative comparison on the aspects of accuracy and efficiency.

Table 3 .
Quatitative comparison on the aspects of accuracy and efficiency.