Learning-Based Pose Estimation of Non-Cooperative Spacecrafts with Uncertainty Prediction

: Estimation of spacecraft pose is essential for many space missions, such as formation ﬂying, rendezvous, docking, repair, and space debris removal. We propose a learning-based method with uncertainty prediction to estimate the pose of a spacecraft from a monocular image. We ﬁrst used a spacecraft detection network (SDN) to crop out the rectangular area in the original image where only spacecraft exist. A keypoint detection network (KDN) was then used to detect 11 pre-selected keypoints with obvious features from the cropped image and predict uncertainty. We propose a keypoints selection strategy to automatically select keypoints with higher detection accuracy from all detected keypoints. These selective keypoints were used to estimate the 6D pose of the spacecraft with the EPnP algorithm. We evaluated our method on the SPEED dataset. The experiments showed that our method outperforms heatmap-based and regression-based methods, and our effective uncertainty prediction can increase the ﬁnal precision of the pose estimation.


Introduction
For the demands of some space missions, such as maintenance for spacecrafts [1], on-orbit docking [2] and removing space debris [3], the pose estimation for non-cooperative spacecrafts has been a hot topic. Non-cooperative spacecrafts generally refer to spacecrafts that do not provide effective cooperative information, including malfunctioning or failed satellites, space debris, and opposing spacecrafts. In the past, the pose of spacecrafts was usually estimated by high-precision sensors [4][5][6]. However, due to the high costs and power consumption of these sensors, this solution of pose estimation is not applicable to many low-cost spacecrafts [7]. Monocular images can provide the key position and orientation information required by the navigation system for spacecraft under low power [8].
In this paper, we mainly focus on how to estimate the 6D pose of a spacecraft from a monocular image. The main difficulty of this task is the limited amount of available pose information. Moreover, the complex shooting environment in space, such as illumination and backgrounds, also brings more challenges. Dhome proposed a closed model-based 6D pose image recognition method [9]. This method corresponds all possible 3D model edges to the captured 2D image edges one by one and uses soft assign to avoid the computational overload caused by exhaustive enumeration. Following Dhome, Kanani and Petit made partial improvements to improve its computational speed and reduce data dependence [10,11]. These methods were initially applied to ground-based robotic navigation algorithms and later to satellite-based monocular navigation. However, modelbased methods require a large amount of feature matching before solving the positional pose, which is difficult to apply in real time [12]. Therefore, some people proposed a non-model-based method to estimate the 6D pose. Augenstein and Rock proposed to use SIFT-based SLAM for pose solution of spacecrafts [13]. Nevertheless, non-model-based approaches have the possibility of losing target features due to large changes in image conditions or perspective relationships [14]. The pose estimation methods have been further developed with the further development of image recognition algorithms. D'Amico proposed a perceptual organization of detected edges in images using the Sobel algorithm and the Hough algorithm to solve the pose-initialization problem [15]. For the first time, pose estimation of a fully non-cooperative spacecraft has been achieved. However, this method is computationally expensive, difficult to use in real-time on onboard hardware, and lacks robustness to illumination conditions [15]. Sharma improved D'Amico's research by proposing Sharma-Ventura-D'Amico (SVD) architecture and introducing the weak gradient elimination (WEG) to reduce the search space [12]. Sharma's method reduces the computation time and improves the detection accuracy, but has the drawback of generating spurious edges when the image condition is bad.
In recent years, due to the development of deep learning algorithms, especially the neural networks, there have been new advances in pose estimation for spacecrafts from monocular images. It has been shown that feature detected by CNNs has more accuracy and stability than traditional methods for computer vision domain tasks [16]. Therefore, many learning-based methods have been proposed to solve the pose estimation problem [17][18][19][20][21][22][23]. Recently, Chen and Park proposed a similar pipeline to estimate the 6D pose of spacecrafts from a monocular image [18,19]. They used CNNs to automatically crop out the part of image where the spacecraft exists and predicted the 2D pixel coordinates of keypoints from the cropped image. They used the 2D pixel coordinates of keypoints and a wireframe model of the spacecraft obtained in advance to estimate the 6D pose. Following their work, we propose a learning-based 6D pose estimation method for spacecrafts, with effective uncertainty prediction enabling automatic selection of keypoints for pose estimation. Our main contribution can be concluded as follows:

•
We introduce the idea of region detection into the keypoint detection of spacecrafts, which can capture the feature of keypoints better; • We achieve effective uncertainty prediction for the detected keypoints, which can be used to automatically eliminate keypoints with low detection accuracy; • We conduct sufficient experiments on SPEED dataset [17]. Compared with previous methods, our method can reduce the average error of pose estimation by 53.3% while reducing the number of model parameters.
The rest of this paper is organized as follows. First, in Section 2 we briefly introduce previous works on learning-based 6D pose estimation of spacecraft and the keypoints detection. Second, the proposed methods are detailed in Section 3.4. Third, the experimental results will be benchmarked in Section 4. Finally, Section 5 will conclude this work.

Learning-Based Methods
Instead of handcrafting the image features to estimate the pose of spacecraft, learningbased methods use deep learning to automatically extract the features to estimate the 6D pose of the spacecraft. These methods can be divided into two categories, direct estimation and indirect estimation. Sharma [22] used a CNN to extract the features in images and a fully connected layer to output a 6-dimensional vector as the predicted 6D pose. In Gao's work [21], the prediction of the orientation vector was converted into the regression of a heatmap. Sharma adopted multi-task learning [20,23] to estimate 6D pose. While predicting the 6D pose, he completed the task of keypoints prediction, spacecraft detection and image segmentation simultaneously. For the indirect estimation methods, Park [18] and Chen [19] first used CNN to predict the position of keypoints and then took these keypoints to estimate pose with the EPnP algorithm [24]. They mainly differ in how to detect the keypoints. Park [18] used light MobileNetv2 [25] as a backbone to extract features and used a fully convolutional network (FCN) [26] to regress the pixel coordinates of keypoints. Chen [19] predicted a heatmap for each keypoint, meaning the probability of keypoints appearing at each pixel coordinate.
Our method also belongs to the method of indirect prediction. Different from [19] and [18], we treat each keypoint as a square region to detect. Although Chen also treated each keypoint as a square region, the size of the area he set is fixed. We replace three square anchors of different sizes for each pixel on the feature map for the situation of different relative distance to the spacecraft (Figure 1).
(a) (b) Figure 1. Our advantage over Chen [19] on how to set the region of keypoints. (a) Chen [19], (b) Ours. The blue box represents the box containing a keypoint, and the yellow box represents the anchor in our Keypoint Detection Network. When the relative distance of the spacecraft is too small, the fixed region ignores some key area of the keypoint. However, our adaptive region size can solve this problem better, which is described in Section 2.2.

Keypoint Detection
Keypoint detection is a traditional task in computer vision, and there have been many surveys that extensively discuss related methods [27]. We present related works in two main categories: handcrafted and learned detector.
For handcrafted detectors, Harris [28] and Hessian [29] detectors used first and second order image derivatives to find corners or blobs in images. The more refined keypoint feature can be calculated through some engineered algorithms [30][31][32][33], which seek alternative structures within images to represent the keypoint. MSER [32] segmented and selected stable regions as keypoints, and SIFT [30] looked for blobs over multiple scale levels.
For learned detectors, the improvement of learned methods in object detection help to explore similar techniques for keypoint detectors. FAST [34] was one of the first attempts to use machine learning to design a keypoint feature descriptor, and then some people made improvements on this method [31,35,36]. Recently, many methods have been proposed to utilize CNNs to detect keypoints. TILDE [37] trained multiple patch-wise linear regression models to detect keypoints that are robust under severe weather and illumination changes. Georgakis [38] proposed a pipeline to automatically sample positive and negative pairs of patches from a region proposal network to optimize jointly point detections and their representations. LF-Net [39] estimated the position, scale and orientation of features by jointly optimizing the detector and descriptor.
For the keypoint detection of spacecrafts, Park [19] directly used CNN to regress the 2D coordination of keypoints. Sharma [23] and Park [23] improved it by introducing multitask learning. Chen used HRNet [40], a CNN proposed to predict the pose of the human body, to predict the heatmap of the monocular image. However, he assigned the same region for all the keypoints, which is not rational for different relative distances. We introduce the idea of region detection into the keypoint detection task of spacecraft, where anchors of different sizes can fit different relative distance s (Figure 1). At the same time, an effective uncertainty prediction is introduced for detected keypoints, enabling end-to-end accurate keypoint selection.

Method
The overall pipeline of our method is shown in Figure 2. We first selected 22 images from multiple views to manually obtain the 2D coordinates of each keypoint, and used the simulated annealing (SA) algorithm [41] to obtain the spacecraft's 3D wireframe model. For each input image, we first used a spacecraft detection network (SDN) to find the location and the area where the spacecraft exists. Then, the cropped image of the spacecraft was put into a keypoint detection network (KDN) to detect the position of keypoints. KDN simultaneously estimates the uncertainty of detection for each keypoint. We developed a strategy to select more accurate keypoints as candidate keypoints. The reconstructed 3D coordinate and predicted 2D coordinates of all the candidate keypoints were used to solve the 6D pose of spacecrafts through EPnP [24].

3D Wireframe Model Recovery
Given the internal parameter matrix K c and the external parameter matrix R and T of the monocular camera, if the 3D coordinate p 3D,k of the k-th keypoint in the world coordinate system is known, we can obtain its 2D coordinate in the image. We selected 11 keypoints with great visibility. For each keypoint, we obtained its 2D coordinate manually from 22 images. For each k-th keypoint, the sum of the reprojection error was minimized over a set of images in which the k-th keypoint was visible. The optimal 3D coordinate of each keypoint can be obtained by minimizing the following objective function, where R i and T i represent the known camera extrinsic parameters. p h 2D,i,k represents the 2D coordinate of the k-th keypoint in the i-th image and p h 3D,k represents the according 3D coordinate. The superscript h indicates that the point is expressed in homogenous coordinates. λ i,k represents the scaling factor, which is also needed to solve. N is the number of selected images for k-th keypoint. We define the symbols in Equation (1) in more detail as: where p * * i represents the element in matrix P i . The (u i,k , v i,k ) represents the pixel coordinate of the k-th keypoint in the i-th image. The (X k , Y k , Y k ) represents the 3D coordinate of the k-th keypoint in the world coordinate system.
Due to the presence of noise, the optimal solution cannot make the Equation (1) zero. The most general way is to use the least square (LS) method to obtain the optimal solution. According to Equation (1), we can construct N linear equations with N images as: Thus, we can construct over-determined linear equations for s = (X k , Y k , Z k ) T as: where A is a 2N × 3 matrix and b is a 2N × 1 matrix, i.e., The optimal solution can be obtained by the LS as: In this paper, we mainly consider that the manually chosen 2D coordinates of the keypoints may have different degrees of error in different images. We selected only 12 out of 22 images for each keypoint to obtain its 3D coordinates, which makes Equation (1) reach the least value. We used SA [41] to obtain the 3D ordinates p h 3D,k and the scaling factors λ i,k , and calculated the value of Equation (1) to select the best 12 images for each keypoint. In Section 4.6, we show that compared to obtaining the optimal solution directly through LS, the SA method can achieve a better solution.
After obtaining the wireframe model of spacecraft, we can obtain the 2D coordinates of keypoints in each image without manually labeling a large number of images for subsequent tasks.

Spacecraft Detection Network (SDN)
We used a Spacecraft Detection Network (SDN) to automatically find the location of the spacecraft. Considering the smaller model consumes less, we took the tiny version of YOLOX [42] as our SDN. The 2D bounding boxes were obtained by projecting the 3D keypoints onto the image using the ground-truth poses. In order to ensure that the bounding boxes could contain the whole spacecraft, we enlarged the boxes by 10% in the center as our final labels.

Keypoints Detection Network (KDN)
We treated each keypoint as a square region and used anchor-based methods to detect them. Different from the general object detection method, where we needed to replace rectangular boxes of different sizes for each pixel, since our detection area was square, we only replaced three square boxes with different sizes for each pixel to adapt to the different relative distances of the spacecraft from the camera. The framework of the KDN is shown in Figure 3. We used CSPDarknet [43] as the backbone to extract features of three scales from the input image. We used the feature pyramid network (FPN) [44] to complement the features between different scales to obtain refined features. Finally, all features were input to the detection head for keypoint detection. For the detection and classification, we minimized the following loss function, commonly used in object detection [43], i.e., where b i , c i and C i represent the box, keypoint class and confidence predicted by the KDN for the i-th image, respectively.b i ,c i andC i represent the corresponding labels. L reg (•) represents the MSE loss function, L cls (•) and L con f (•) represent the cross entropy loss function, and N represents the number of images in each batch. We define the predicted box b i and labelb i as: where (x i , y i ) represent the pixel coordinates of the center point of the predicted box on the image, and w i and h i represent the width and height of the predicted box, respectively. The symbols with superscript ∼ represents the corresponding label. The L reg (b i ,b i ) can be written as: For L cls (c i ,c i ), both the predicted keypoint class c i and labelc i are 11-dimensional column vectors. For c i , each element c i,k represents the probability that the k-th keypoint exists in the box. Each elementc i,k inc i represents the corresponding label. The L cls (c i ,c i ) can be written like cross entropy loss function as: For the uncertainty prediction, we minimized the following loss function, where U i represents predicted uncertainty, i.e., the probability of whether there is a target for each keypoint, andŨ i represents the corresponding label. L uncertain (U i ,Ũ i ) can be written as: where U i,k represents predicted uncertainty for the k-th keypoint, andŨ i,k represents the corresponding label. The uncertainty label for the k-th keypoint can be calculated as: where IOU(•) is the intersection ratio of the predicted box b i and the ground truth boxb i . K is the number of keypoint classes. The subscript k indicates that the variable is related to the k-th keypoint.
In order to guide KDN to achieve the joint prediction of classification uncertainty and regression uncertainty, the loss function of our KDN is defined as:

Pose Estimation
After obtaining the 3D coordinates and 2D coordinates of the keypoints, we used the EPnP [24] to solve the 6D pose of the spacecraft. To increase the accuracy of pose estimation, we developed a strategy to select more accurate keypoints by the predicted uncertainty. We divided the selection strategy into two separate sub-strategies, Top K and uncertainty threshold selection (UTS).
For each category of keypoints, the keypoint with the lowest uncertainty was used as the final detected keypoint of this category. In UTS strategy, for these eleven detected keypoints, we selected the keypoints whose uncertainty was less than a given threshold µ as candidate keypoints. In Top K strategy, if the number of candidate keypoints was less than five, we directly used the five keypoints with the lowest uncertainty among the eleven detected keypoints as candidate keypoints, since four keypoints may be coplanar, which is detrimental to the pose estimation. If the number of candidate keypoints was more than K n , we took the K n keypoints with the lowest uncertainty as candidate keypoints. All the candidate keypoints were used to solve the 6D pose with EPnP [24]. The above architecture is described in Algorithm 1.

Algorithm 1 keypoints selection strategy
Require: Keypoints with predicted uncertainty p 2D,i,k , U i,k , uncertainty threshold µ, candidate keypoints set C , detected keypoints set D and K n .

Datasets and Implementation Details
We evaluated our method using the SPEED dataset [17] with 12,000 synthetic satellite images and five real satellite images provided by the Advanced Concepts Team (ACT) at European Space Agency (ESA) in the pose estimation challenge 2019 [45,46]. Each image was annotated with the extrinsic parameter matrices R and T corresponding to the camera. The difficulty of pose estimation varied from image to image. They had varying degrees of light intensity, relative distance to spacecraft, perspective occlusion, and background complexity ( Figure 4). From the synthetic images of SPEED dataset, we randomly selected 10,000 images as the training set and 1000 images as the validation set. The rest 1000 synthetic images were used as the test set, as well as five real images.
We took the methods of Park [18] and Chen [19] as our baselines. Their methods share a similar pipeline with ours, and the main difference is how to predict the 2D coordinates of keypoints. Park [18] used CNNs to directly regress the 2D coordinates of keypoints, which belong to the regression-based method. Chen [19], however, predicted a heatmap for each keypoint, indicating the probability of each keypoint appearing at different positions, which is a heatmap-based method. We introduce the idea of region detection for the prediction of keypoint positions. We hope to prove the superiority of our method for improving the accuracy of 6D pose estimation by comparing it with the above methods. In order to ensure the fairness of the comparison, all three methods used the data augmentation method used by Park and were trained with the Adaptive Momentum Estimation (Adam) optimizer for 300 epochs with a 0.001 learning rate, 48 batch-size, momentum of 0.9, and weight decay of 5 × 10 −4 .
In Section 4.3, we set K n and µ as 7 and 0.5, following Algorithm 1.

Evaluation Metrics
In order to quantitatively evaluate our final pose estimation results, we adopt the evaluation metrics provided by ESA to define the errors of estimation of translation, orientation and 6D pose.
For the i-th image, the error of the pose estimation is calculated as the sum of the orientation error E R,i and the translation error E T,i , i.e., The translation error and orientation error can be calculated as: where t i andt i represent the predicted and real translation vectors, and q i andq i represent the predicted and real orientation vectors, respectively. • 2 is to calculate the two-norm of a vector and •, • is to calculate the angle between two vectors. The mean error of the pose estimation for the test set is calculated as: where N is the number of images in the test set. Similarly, we can calculate the mean and median of other errors on the test set. We take the above six metrics, medianE T , medianE R , medianE, meanE T , meanE R and meanE, to evaluate the pose estimation results.

Comparison in Synthetic Images
In this section, we compare three methods in 1000 synthetic images (Table 1). In order to prove that our method can maintain high accuracy while reducing the number of parameters, we reduced the size of the feature map output by the backbone 25% and 50% to obtain ours-small version and ours-nano version respectively. It can be seen from Table 1 that our method performs much better than Park [18] and Chen [19] in six metrics, except for the nano version. However, the number of parameters of our nano version is only about one-tenth of Chen's [19], and the nano version is only slightly worse than Chen on medianE T and medianE. It means that our nano version can achieve considerable accuracy of estimation with obviously less memory space. Notably, compared with Chen [19], all three versions of our method achieve reductions in both the estimation error and number of parameters, up to 53.3% and 89.6% respectively at most.

Comparison in Real Images
In this section, we compare three methods in five real images ( Figure 5 and Table 2). Due to the large gap in the field between the training set and the test set, the accuracy of all three methods has declined. Some estimation results of Chen's [19] have been especially unacceptably bad (Figure 5c). Table 2 shows that the estimation error of our method is still much smaller than that of the other two methods, which proves that the generalization ability of our method is stronger. Ours-small and ours-nano do worse than Park [18] in three metrics in Table 2. We consider the reason that the small number of parameters limits their generalization ability. However, both ours-small and ours-nano still achieve better estimation than Chen [19].

Performance with Different Background
In this section, we compare three methods in images with different backgrounds (Figure 6c,d). Among the 1000 synthetic images, 506 have Earth backgrounds with different degrees of complexity ( Figure 4). We divided the test set images into two groups with Earth backgrounds (EB) and pure black backgrounds (BB) to test the estimation errors of three methods. Figure 6c,d shows that our method achieves better pose estimation than Park [18] and Chen [19] in either EB or BB.

Performance in Different Relative Distance
In this section, we compare three methods in images with different relative distances to the spacecraft (Figure 6a,b). In the 1000 test images, we took 100 images as a group to divide the images of the test set into 20 groups in the order of relative distance. We draw the translation error and orientation error curves at different relative distances respectively. Figure 6a,b show that our method can maintain a very high prediction accuracy in each distance segment. Park's [18] method has a greater estimation error in both too short and long distances. This is reasonable for when the spacecraft is very close, a part of the spacecraft often falls out of the camera's field of view, called occlusion, a common challenge for object detection and segmentation [47]. When the spacecraft is far, its features in the image will become coarse, making it more difficult for the keypoints detection module to work well. Chen's method [19] has good accuracy of translation estimation in each distance segment, but the error of orientation estimation is still affected by the too-long or short relative distance. Our method achieves stable translation and orientation estimation accuracy over the full range segment, proving that our method is more capable of resisting target occlusion and recovering the feature of small spacecraft.

Effective Uncertainty Prediction
We conducted an ablation study to prove the effectiveness of our uncertainty prediction and keypoints selection strategy. According to Algorithm 1, we can only take UTS strategy or Top K strategy to select keypoints. If both strategies were not taken, we directly chose all eleven keypoints to estimate the pose with EPnP [24]. Here, we set µ and K n as 0.5 and 7 (Top 7) for the analysis in Section 4.7. Table 3 shows that both strategies can improve the accuracy of pose estimation of our method separately, and our complete keypoints selection strategy helps our method achieve the best estimation. We show four cases that demonstrate the effectiveness of our uncertainty prediction and keypoints selection strategy in Figure 7. The lower right corner marks the percentage reduction in the three-class estimation errors after removing the detection points in red. Our selection strategy succeeded in selecting accurate keypoints with effective uncertainty prediction to reduce the error of pose estimation.
. Uncertainty Prediction Helps Reduce Pose Estimation Error. The blue points represent the key points that we retained for pose estimation, the red points represent the key points that we eliminated due to the high uncertainty, the yellow points represent the true positions of the eliminated keypoints, and we used the green dotted line Connect the corresponding yellow and red points.
Although Chen [19] proposed an iterative trial-and-error method to remove some detected keypoints, they did not consider that this method would increase the time cost of the entire pose estimation process. Our method performs this by the uncertainty prediction of the network.

Comparison between SA and LS
In order to verify the superiority of using SA to solve the optimal problem in Equation (1), we recovered a new 3D wireframe model through LS for all the 22 images and analyzed the changes in the accuracy of the three versions of our method. Table 4 shows that all six error metrics for the three versions have increased when using LS to recover the wireframe model. Our SA method can help to obtain a more accurate 3D wireframe model under the noise from manual selection of images.

Hyperparameters Analysis
We conducted a hyperparameters analysis to study how K n and µ in Algorithm 1 affect the pose estimation of our method.
We analyzed how the choice of K n affects the performance of our method. Although the EPnP algorithm [24] only requires more than three keypoints, since our detection result may have four coplanar points, which is fatal to the EPnP algorithm, we separately analyzed the change of three estimation errors when K n changes from five to eleven (Figure 8). Here we set the µ as 0.5. All three versions of our method show the same pattern of changes. As K n changes from five to seven, the estimation error decreases gradually, which is reasonable for larger point sets to introduce redundancy and reduce the sensitivity to noise [24]. However, when K n changes from seven to eleven, the estimation errors of three versions increases. We consider that compared with the top seven keypoints, the errors introduced by the last four keypoints are too large to improve the accuracy, which also proves the validity of our uncertainty prediction. We took our version to analyze how the µ affects the pose estimation of our method. Here, we set K n as 7, which works best. Figure 9 shows that all three errors decrease as the µ decreases, proving that the uncertainty our KDN predicts for each keypoint has a certain positive correlation with its detection error. We call the keypoints screened out by our keypoints selection strategy as refused keypoints. When the µ changes from 1.0 to 0.4, the average number of refused keypoints remains generally unchanged. When it changes to 0.2, this number begins to rise rapidly, which means that our method fails to complete the pose estimation from the corresponding images, since the number of available keypoints does not meet the acquirement of the EPnP algorithm [24]. Therefore, in practical applications, it is necessary to consider the trade-off between the continuity and accuracy of 6D pose estimation.

Conclusions
In this paper, we proposed a monocular pose estimation framework for space-borne objects, such as spacecraft. Our main contribution is to introduce the idea of area detection into the task of spacecraft keypoints detection and use the uncertainty of keypoints predicted by our KDN to automatically select keypoints with higher prediction accuracy to estimate the 6D pose of the spacecraft. Our method achieves a 53.3% reduction in pose estimation error with the reduction of the number of network parameters.
In future work, we will study how to adaptively choose the k value of the Top k strategy to achieve a more effective trade-off between estimation precision and computational efficiency.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Nomenclature
The following nomenclature are used in this manuscript: Ground-truth category in the k-th image C i Predicted confidence in the k-th imagẽ C i Ground-truth confidence in the k-th image U i Predicted uncertainty in the k-th imagẽ U i Ground-truth uncertainty in the k-th image K The number of keypoint categories µ Uncertainty threshold C Candidate keypoints set D Detected keypoints set K n The number of keypoints used for pose estimation q i Predicted orientation in the k-th imagẽ q i Ground-truth orientation in the k-th image t i Predicted translation in the k-th imagẽ t i Ground-truth translation in the k-th image