Weakly Supervised Pose Estimation of Surgical Instrument from a Single Endoscopic Image

Instrument pose estimation is a key demand in computer-aided surgery, and its main challenges lie in two aspects: Firstly, the difficulty of obtaining stable corresponding image feature points due to the instruments’ high refraction and complicated background, and secondly, the lack of labeled pose data. This study aims to tackle the pose estimation problem of surgical instruments in the current endoscope system using a single endoscopic image. More specifically, a weakly supervised method based on the instrument’s image segmentation contour is proposed, with the effective assistance of synthesized endoscopic images. Our method consists of the following three modules: a segmentation module to automatically detect the instrument in the input image, followed by a point inference module to predict the image locations of the implicit feature points of the instrument, and a point back-propagatable Perspective-n-Point module to estimate the pose from the tentative 2D–3D corresponding points. To alleviate the over-reliance on point correspondence accuracy, the local errors of feature point matching and the global inconsistency of the corresponding contours are simultaneously minimized. Our proposed method is validated with both real and synthetic images in comparison with the current state-of-the-art methods.


Introduction
Robot-assisted minimally invasive surgery has become an important field of rapid development due to its enormous potential for application, where surgical instrument pose estimation is a key demand.However, instrument pose estimation faces two significant challenges: difficulty in obtaining stable feature points for surgical instruments due to their lack of texture and high reflectivity of surface materials, and the absence of labeled pose data for surgical instruments in monocular endoscopic images.Due to the two difficulties mentioned above, current surgical instrument pose estimation techniques often rely on additional positioning markers or control points, or the use of binoculars and depth cameras to provide annotation information [1,2].For example, in order to enhance the active research of surgical tool identification and tracking, Lajkó et al. [3] utilize the surgeon's visual feedback loop to compensate for robot surgery inaccuracies.Attanasio et al. [4] identify anatomical structures and provide contextual awareness in the surgical environment, while [5,6] improve robotic surgical precision with the help of recognition of surgical phases.Kitaguchi et al. [7] employed a convolutional neural network to achieve high-precision automatic recognition of surgical actions in laparoscopic images.Nguyen et al. [8] used a CNN-LSTM neural network model and an IMU sensor-based automatic evaluation system to classify and regress motion data in the JIGSAWS [9] dataset.Pan et al. [10] proposed an automatic surgical skill evaluation framework based on visual motion tracking and deep learning.That method introduced a Kernelized Correlation Filter (KCF) to capture key motion signals from the robot.In sum, although quite a number of Sensors 2024, 24, 3355 3 of 14 Unlike other applications, the endoscopic image-based pose estimation of surgical instruments has special difficulties, as we mentioned at the very beginning of this work, which are, mainly, the lack of stable feature points of surgical instruments, the large reflection property, and the lack of labeled training data.To tackle such difficulties, we propose a weakly supervised neural network framework to estimate the pose from the surgical instrument contour in the endoscopic image and use the synthetical image contours and key feature points of the instrument to assist the network training.In particular, a 2D feature points inference module is proposed to predict the corresponding 2D feature points from the segmented instrument contour in the image, and a back-propagatable PnP network module [38] and a differentiable rendering module [39] are used to estimate the pose.
To train our network effectively, we employ a two-phase training strategy.We train the segmentation module, separated from other modules, with the help of TernausNet-16 [40].Our focus is on training the other network modules after the segmentation network, which takes segmentation mask images as input and outputs pose estimation.In summary, the main contributions of this work are as follows: (1) We propose a weakly supervised pose estimation method for surgical instruments from single endoscopic images, assisted by synthesized endoscopic images of surgical instruments; (2) both local discrepancy errors of predicted feature points and the global inconsistency of the whole contour are taken into account in the training phase; (3) our proposed method outperforms many state-of-the-art methods with large real and synthetic data; and (4) we create a synthetic surgical instrument contour image dataset of about 50,000 samples with corresponding pose ground truth, providing data support for future research.

Materials and Methods
The method proposed in this paper is outlined in the following steps: Firstly, we utilized an offline contour-based 3D reconstruction method to obtain 3D models of surgical instruments and selected 8 implicit feature points by the farthest point sampling method.Then, we imported these 3D models into Unreal Engine 5 (UE5) virtual engine to generate the pose labeled synthetic segmentation image training data.
Secondly, to effectively leverage the auto-generated training data, we utilized a surgical instrument image segmentation module to preprocess the real surgical scene image.This module employs a TernausNet-16 structure to infer the foreground mask image of surgical instruments from the real surgical scene images.
Finally, our instrument pose estimation network was trained by the auto-generated labeled data.The network consists of cascaded modules: a backbone module consisting of alternated cascaded channel-space tensor self-attention layers and multi-resolution convolution layers to infer the heat map style feature, an implicit points' projection coordinates regression module, and a back-propagated PnP (BPnP) and differentiable rendering pose estimation module.
The overall framework of the proposed method in this paper is shown in Figure 1.
In the following section, the main modules will be elaborated.

Synthetic Training Data Generation
To address the stable feature tracking challenge caused by the lack of texture and specular reflection of surgical instrument surfaces, our method uses the foreground mask image of an instrument to infer the instrument's pose.With the foreground mask image as an input of the pose estimation network, disturbances caused by the drastic change in image appearance can be eliminated, leading to more robust pose inference.We utilized UE 5 virtual engine to generate foreground mask image training data.In the following section, the main modules will be elaborated.

Synthetic Training Data Generation
To address the stable feature tracking challenge caused by the lack of texture and specular reflection of surgical instrument surfaces, our method uses the foreground mask image of an instrument to infer the instrument's pose.With the foreground mask image as an input of the pose estimation network, disturbances caused by the drastic change in image appearance can be eliminated, leading to more robust pose inference.We utilized UE 5 virtual engine to generate foreground mask image training data.
We used structure from the contour-based method to reconstruct the 3D models of three kinds of typical instruments: hooks, clippers, and suction, which are widely used in endoscopic minimally invasive surgery.After the 3D model reconstruction, eight implicit feature points were selected for each 3D model.Then, we imported the 3D models of instruments into UE 5 virtual engine, and we performed a uniform sampling within the reasonable range of the SE(3) pose space of the virtual camera.At each virtual camera pose sample point, a surgical instrument's foreground mask image was generated by virtual engine rendering, with the projections of implicit feature points being calculated and sampled pose parameters being recorded.We augmented the training dataset by adding noise to the generated images.The comparison between a captured real surgical instrument image and a rendered image with normal material and lighting in UE 5 virtual engine is shown in Figure 2. The samples of generated training image data are shown in Figure 3.We used structure from the contour-based method to reconstruct the 3D models of three kinds of typical instruments: hooks, clippers, and suction, which are widely used in endoscopic minimally invasive surgery.After the 3D model reconstruction, eight implicit feature points were selected for each 3D model.Then, we imported the 3D models of instruments into UE 5 virtual engine, and we performed a uniform sampling within the reasonable range of the SE(3) pose space of the virtual camera.At each virtual camera pose sample point, a surgical instrument's foreground mask image was generated by virtual engine rendering, with the projections of implicit feature points being calculated and sampled pose parameters being recorded.We augmented the training dataset by adding noise to the generated images.The comparison between a captured real surgical instrument image and a rendered image with normal material and lighting in UE 5 virtual engine is shown in Figure 2. The samples of generated training image data are shown in Figure 3.We generated the training dataset of three types of instruments: hooks, clippers, and suction.The dataset contains 50,000 of samples, and each data sample includes a synthetic foreground mask image, its corresponding pose label, the implicit feature points' 3D coordinates, and their 2D projections on the image.We utilized this dataset to train our model without further manual label effort, such that our model could be trained in a weakly supervised manner.We generated the training dataset of three types of instruments: hooks, clippers, and suction.The dataset contains 50,000 of samples, and each data sample includes a synthetic foreground mask image, its corresponding pose label, the implicit feature points' 3D coordinates, and their 2D projections on the image.We utilized this dataset to train our model without further manual label effort, such that our model could be trained in a weakly supervised manner.

Instrument Segmentation Preprocessing
To better transfer the pose estimation network trained on synthetic data to estimate the pose of surgical instruments in real surgical scenes, we employed a surgical instrument segmentation network module before proceeding to our main pose estimation network.TernausNet-16 network, a popular one in the field of image segmentation, was used in this work.TernausNet-16 is an improvement and optimization based on the U-Net [41] architecture.Its model encoding layer replaces the original design with VGG16 [42], which features a deeper architecture that can capture richer and more abstract feature representations.This design gives TernausNet-16 stronger feature extraction and transfer learning capabilities, as VGG16 is typically pretrained on large-scale image classification datasets, allowing it to learn rich, general-purpose image features.Additionally, TernausNet-16 cleverly employs skip connections and deconvolution operations to enhance feature fusion between the encoder and decoder while retaining U-Net's advantages in image segmentation.To further boost the model's nonlinear capabilities, TernausNet-16 adopts batch normalization and Rectified Linear Unit (ReLU) activation functions, thereby improving the model's expressiveness and generalization capabilities.
By utilizing foreground segmentation, we eliminated the interference caused by image appearance variations.We evaluated the performance of the segmentation module using the metrics Dice coefficient and Jaccard index to ensure that it effectively provided

Instrument Segmentation Preprocessing
To better transfer the pose estimation network trained on synthetic data to estimate the pose of surgical instruments in real surgical scenes, we employed a surgical instrument segmentation network module before proceeding to our main pose estimation network.TernausNet-16 network, a popular one in the field of image segmentation, was used in this work.TernausNet-16 is an improvement and optimization based on the U-Net [41] architecture.Its model encoding layer replaces the original design with VGG16 [42], which features a deeper architecture that can capture richer and more abstract feature representations.This design gives TernausNet-16 stronger feature extraction and transfer learning capabilities, as VGG16 is typically pretrained on large-scale image classification datasets, allowing it to learn rich, general-purpose image features.Additionally, TernausNet-16 cleverly employs skip connections and deconvolution operations to enhance feature fusion between the encoder and decoder while retaining U-Net's advantages in image segmentation.To further boost the model's nonlinear capabilities, TernausNet-16 adopts batch normalization and Rectified Linear Unit (ReLU) activation functions, thereby improving the model's expressiveness and generalization capabilities.
By utilizing foreground segmentation, we eliminated the interference caused by image appearance variations.We evaluated the performance of the segmentation module using the metrics Dice coefficient and Jaccard index to ensure that it effectively provided clear and accurate input for subsequent pose estimation tasks.Our experiments on real surgical scene data show that our method stably and robustly estimates the pose of surgical instruments, even in the presence of partial errors in the segmentation results.
In the main contour-based pose estimation network, we adopted a two-stage framework to estimate instrument pose from surgical instrument foreground mask images.At the first stage, our main framework estimated the 2D projection points of the implicit model feature points, and, at the second stage, the pose was predicted from the projection points and the contour points of the foreground mask.

Multi-Scale Heatmap Feature Generation
The first part of our network is a feature extraction net, a multi-scale heatmap feature module, which consists of two sub-modules: a multi-resolution convolution module and an attention (CSTSA) net, inspired by a polarized self-attention network [43], as shown in Figure 4.
A multi-resolution convolution module is a four-stage cascaded convolution network that generates both high-resolution features and low-resolution features.The first stage consists of a high-resolution convolution branch, one-fourth of the original image resolution.Then, the following three stages gradually expand parallel branches, with the resolution of each new expanded branch being half of the lowest resolution of the previous stage.Then, the features of different resolutions are fused into a unified feature before being inputted to the next stage to preserve both the high-resolution local features and the low-resolution global features.After the last multi-resolution convolution stage, a feature map is generated by a Softmax layer.
work to estimate instrument pose from surgical instrument foreground mask images.At the first stage, our main framework estimated the 2D projection points of the implicit model feature points, and, at the second stage, the pose was predicted from the projection points and the contour points of the foreground mask.

Multi-Scale Heatmap Feature Generation
The first part of our network is a feature extraction net, a multi-scale heatmap feature module, which consists of two sub-modules: a multi-resolution convolution module and an attention (CSTSA) net, inspired by a polarized self-attention network [43], as shown in Figure 4.A multi-resolution convolution module is a four-stage cascaded convolution network that generates both high-resolution features and low-resolution features.The first stage consists of a high-resolution convolution branch, one-fourth of the original image resolution.Then, the following three stages gradually expand parallel branches, with the resolution of each new expanded branch being half of the lowest resolution of the previous stage.Then, the features of different resolutions are fused into a unified feature before being inputted to the next stage to preserve both the high-resolution local features and the low-resolution global features.After the last multi-resolution convolution stage, a feature map is generated by a Softmax layer.
To fully utilize the dependency information of different parts of the intermediate features after the third multi-resolution convolution stage, we leverage channel-space tensor self-attention network.Channel-space tensor self-attention network treats the intermediate features as a tensor and extracts attention from channel-U and channel-V attention and space only attention.Channel-U and channel-V attentions are the attentions acting on the space spanned by a channel direction vector and one base vector of space.The architecture of the CSTSA module is shown in Figure 5.To fully utilize the dependency information of different parts of the intermediate features after the third multi-resolution convolution stage, we leverage channel-space tensor self-attention network.Channel-space tensor self-attention network treats the intermediate features as a tensor and extracts attention from channel-U and channel-V attention and space only attention.Channel-U and channel-V attentions are the attentions acting on the space spanned by a channel direction vector and one base vector of space.The architecture of the CSTSA module is shown in Figure 5.The two attention sub-modules extract the dependencies between space and channel, ensuring our model generates heatmap features that contain rich multi-scale information and fuse spatial and channel correlation information.We set the heatmap loss to supervise our heatmap as follows: The two attention sub-modules extract the dependencies between space and channel, ensuring our model generates heatmap features that contain rich multi-scale information and fuse spatial and channel correlation information.We set the heatmap loss to supervise our heatmap as follows: MSE(•, •) is the mean squared error function in Equation ( 1), m is the estimated feature point heatmap, and m * is the ground truth feature point heatmap.With our synthetic dataset, as the ground truths of the projections of implicit feature points are known, the ground truth feature point heatmap is generated by setting impulses at all the points' ground truth positions and convolving them with a Gaussian kernel with the following parameter setting: σ = 8 pixels.

Implicit Feature Point Projection Regression
After the extraction of feature heatmaps, we utilized a simple implicit feature point projection regression module to predict the coordinates.This module is a differentiable network to regress the pixel coordinates from the heatmap, and its architecture is shown in Figure 6.The two attention sub-modules extract the dependencies between space and channel, ensuring our model generates heatmap features that contain rich multi-scale information and fuse spatial and channel correlation information.We set the heatmap loss to supervise our heatmap as follows: MSE(•,•) is the mean squared error function in Equation ( 1), m is the estimated feature point heatmap, and * m is the ground truth feature point heatmap.With our synthetic dataset, as the ground truths of the projections of implicit feature points are known, the ground truth feature point heatmap is generated by setting impulses at all the points' ground truth positions and convolving them with a Gaussian kernel with the following parameter setting: σ = 8 pixels.

Implicit Feature Point Projection Regression
After the extraction of feature heatmaps, we utilized a simple implicit feature point projection regression module to predict the coordinates.This module is a differentiable network to regress the pixel coordinates from the heatmap, and its architecture is shown in Figure 6.

Surgical Instrument Pose Estimation
After the inference of the projected implicit feature points, we utilized the backpropagatable PnP(BPnP) to estimate the pose of the surgical instrument by solving the originally non-differentiable implicit function through gradient-based back-propagation.The BPnP loss function is the sum of the reprojection errors of the implicit feature points, as shown in Equation (2).
where x is the inferred projections of implicit feature points, x * is the ground truth of the projections, y is the pose parameters, z is the 3D implicit feature points, K is the intrinsic matrix of the camera, r i is the reprojection error of the implicit feature point, and π is the projection function of a 3D point to the image one.We leverage the zero-point theorem to convert the task of solving pose parameters by minimizing the reprojection loss into solving the explicit equation derived from partial differentials of the reprojection loss with respect to the pose parameters, as shown in Equation (3).
According to the implicit function theorem, the partial differential of the pose parameters with respect to the reprojection errors of the implicit feature point coordinates can be computed as follows: Thus, the explicit expression of the gradient of reprojection loss with respect to the network parameters θ is as follows: dl p dθ = ∂l p ∂y ∂y ∂x ∂x ∂θ By minimizing the final loss function, this gradient is back-propagated in end-to-end to supervise the network parameters of the multi-scale heatmap feature module and the implicit feature point projection regression module, and finally, the surgical instrument pose is estimated.
To fully use the mask contour information, we also set a differentiable loss function to enhance the contour global consistency; the foreground rendering loss is defined as follows: where m r is the rendered foreground image, and m s is the segmented foreground image.
We set the final loss as the weighted sum of the reprojection loss, the foreground consistency loss, and the heatmap loss, as follows: By minimizing this final loss, we trained our surgical instrument pose estimation in a weakly supervised manner.Our model utilizes synthetic foreground mask image data generated by a virtual engine as the training set and tackles the problems of stable tracking and lack of pose annotation in weakly supervised learning.In a real-data experiment, our method must be validated to be robust in estimating the surgical instrument pose, even for instruments with slightly different 3D shapes.In Equation ( 7), the hyper parameter α is set to 0.00005, and β is set to 1.2.

Results
To verify the validity of our method, experiments on both real surgery scenes and synthetic data were carried out.For real surgery scene experiments, we tested our method on our self-collected minimally invasive lung lobe resection endoscopic surgery dataset in collaboration with the Renmin Hospital affiliated with Peking University.For the synthetic data experiments, we tested our method on the UE 5 virtual engine generated instrument data.We undertook comparisons between the results of our method and the current stateof-the-art single-view Red Green Blue (RGB) image-based object pose estimation techniques on various popular metrics.To evaluate the impact of each submodule on our method, we conducted an ablation study on both real surgery scene data and synthetic data.

Metrics
The commonly used metrics of Average Distance of 3D points (ADD)(S)(0.1 d) [44,45], 2D projection error(5 px) [46], and 2D projection error(3 px) were used for the synthetic data experiments.The intersection over union (IOU) was used for real-scene data experiments.ADD refers to the average distance between the 3D model points obtained using ground truth poses and predicted poses.Typically, if the average distance of model points is less than 10% of the model diameter, the pose estimation is considered correct.ADD-S is a specific variant used for measuring symmetric objects.The 2D projection error refers to the average distance between the 2D feature points obtained from the ground truth pose projection and that from the predicted pose projection.The definitions of these metrics are shown as Equations ( 8)- (11).
Sensors 2024, 24, 3355 where x i is the three-dimensional model point of the object, R is the predicted rotation matrix, t is the predicted translation vector, K is the camera intrinsic matrix, R is the ground truth of the rotation matrix, t is the ground truth of the translation vector, and x j is the nearest distance point in the corresponding points of the symmetric object.IOU:A and B represent the foreground masks of instrument reprojection and segmentation.

Implementation Detail
We scaled the input image to a 640 × 480 resolution for both training and testing, as in [47].We trained the network for 130 epochs.The initial learning rate was set to 2 × 10 5 .After 50 epochs, it was adjusted to 5 × 10 6 .After 100 epochs, it was further adjusted to 2 × 10 6 .After 120 epochs, it was set to 2 × 10 7 .To accelerate iterative training, the first 50 epochs were trained using only heatmap loss.The following 70 epochs were trained using a combination of heatmap loss and the reprojection error of the implicit feature points.The final 10 epochs were trained using the final loss.We used Adam as our optimizer with a momentum of 0.9 and a weight decay of 1 × 10 3 .
The experimental results are reported as follows:

Real Surgery Scene Data Experiments
In our self-collected lung endoscopic surgery dataset, including 2000 image frames extracted from real surgical video recordings, we predicted the pose of three kinds of instruments: hook, suction, and clippers, using our method trained on our synthetic dataset of 50,000 pose labeled images.Figure 7 illustrates some examples of the reprojected masks of the three kinds of instruments under the pose inferred by our method.In Figure 7, the blue points on the mask are the estimated feature points, and these images show that the reprojected masks are closely overlaid on the instrument, which indicates that the estimated poses are accurate.
As pose ground truth is unavailable for real-scene data, we compared our method with state-of-the-art methods on the IOU metric.The comparison methods include the following: Segmentation-driven-pose [20], Transpose + RansacPnP [48], Higherhrnet + RansacPnP [49], and PVNet [26]; the results are shown in Table 1.The results show that our method outperforms the others by a significant margin.extracted from real surgical video recordings, we predicted the pose of three kinds of struments: hook, suction, and clippers, using our method trained on our synthetic data of 50,000 pose labeled images.Figure 7 illustrates some examples of the reprojected ma of the three kinds of instruments under the pose inferred by our method.In Figure 7, blue points on the mask are the estimated feature points, and these images show that reprojected masks are closely overlaid on the instrument, which indicates that the e mated poses are accurate.As pose ground truth is unavailable for real-scene data, we compared our meth with state-of-the-art methods on the IOU metric.The comparison methods include following: Segmentation-driven-pose [20], Transpose + RansacPnP [48], Higherhrne RansacPnP [49], and PVNet [26]; the results are shown in Table 1.The results show t our method outperforms the others by a significant margin.

Synthetic Data Experiments
In our synthetic dataset experiments, we compared our methods with those methods on ADD(S)(0.1 d), 2D projection error(5 px), and 2D projection error(3 px) as the synthetic images contain the real pose information.Our dataset contains 50,000 images of 3 kinds of instruments in different poses (roughly the same number of images for each instrument).We train the compared models with 90% of the data from the dataset and the test models with the remaining 10% of the data.The results of the comparisons are shown in Table 2.The results show that the pose estimated by our method is significantly more accurate on all the three metrics.

Ablation Study
In order to investigate the effects of each module of our method, we conducted an ablation study with both real-scene and synthetic data.We employed various combinations of multi-resolution convolution (MrC) modules, channel-space tensor self-attention (CSTSA) modules, BPnP modules, and differentiable rendering pose estimation (DRPE) modules to assess the impact of different modules, and the results are shown in Tables 3 and 4. The above ablation study shows that each module has a positive impact on the results.More specifically, the DRPE module has made the most contribution to the IOU metric in real experiments.For synthetic data, all our modules have positive contributions to the ADD(S) metric, and DRPE and CSTSA have the greatest impact on the 2D projection metric.

Conclusions and Discussion
In this work, we have proposed a new weakly supervised method for 6D pose estimation of surgical instruments from single endoscopic images.Our method requires only virtual engine generated synthetic instrument foreground masks as training data, without extensive manually labeled effort.Our method has successfully addressed the common challenges of stable feature tracking and the lack of labeled data in real surgical instrument pose estimation.Experiments on both self-collected real surgical instrument datasets and synthetic data demonstrate that, compared to state-of-the-art methods, our method exhibits enhanced robustness and reliability in pose estimation tasks.Our work offers a new solution for surgical instrument pose estimation that is applicable not only to surgical instrument pose estimation tasks but also provides valuable insights for weakly supervised object pose estimation problems in other domains.
Since our method uses mask images for pose estimation, our method is less affected by the illumination changes and lack of salient textures on surgical instruments.However, the contour-based methods also have their inherent weaknesses.For example, how do you choose the implicit feature points on the instrument so that they are visible under most endoscope poses?If some of the model points are invisible, whether a visible subset of feature points could still reliably predict the pose, etc.

Figure 1 .
Figure 1.The overall framework of our weakly supervised surgical tools instrument pose estimation method.Three steps are involved.Step 1: Synthetic data generation.Step 2: Surgical instruments segmentation of real scenes.Step 3: Segmentation based surgical instruments pose estimation.

Figure 1 .
Figure 1.The overall framework of our weakly supervised surgical tools instrument pose estimation method.Three steps are involved.Step 1: Synthetic data generation.Step 2: Surgical instruments segmentation of real scenes.Step 3: Segmentation based surgical instruments pose estimation.

Figure 2 .
Figure 2. Comparison between a captured real surgical tool image and a rendered image with normal material and lighting in UE 5 virtual engine.

Figure 2 .
Figure 2. Comparison between a captured real surgical tool image and a rendered image with normal material and lighting in UE 5 virtual engine.

Figure 2 .
Figure 2. Comparison between a captured real surgical tool image and a rendered image with normal material and lighting in UE 5 virtual engine.

Figure 4 .
Figure 4.The multi-scale heatmap feature module to generate the feature map.The colors of the blocks indicate the feature map size.The yellow feature map is 1/4 the size of the original image, with 32 channels.The orange feature is 1/8 the size of the original image, with 64 channels.The pink feature is 1/16 the size of the original image, with 128 channels.The red feature is 1/32 the size of the original image, with 256 channels.

Figure 4 .
Figure 4.The multi-scale heatmap feature module to generate the feature map.The colors of the blocks indicate the feature map size.The yellow feature map is 1/4 the size of the original image, with 32 channels.The orange feature is 1/8 the size of the original image, with 64 channels.The pink feature is 1/16 the size of the original image, with 128 channels.The red feature is 1/32 the size of the original image, with 256 channels.

Figure 6 .
Figure 6.Architecture of implicit feature point projection regression module.Figure 6. Architecture of implicit feature point projection regression module.

Figure 6 .
Figure 6.Architecture of implicit feature point projection regression module.Figure 6. Architecture of implicit feature point projection regression module.

Figure 7 .
Figure 7. Examples of the reprojected masks of hook, suction, and clippers under the pose infer by our method.

Figure 7 .
Figure 7. Examples of the reprojected masks of hook, suction, and clippers under the pose inferred by our method.

Table 1 .
Comparison of our method with state-of-the-art methods on the IOU metric.

Table 3 .
Comparison of different combinations of our modules on the ADD(S)(0.1 d), 2D projection(5 px), and (3 px) metrics, with synthetic data.

Table 4 .
Comparison of different combinations of our modules on the IOU metric with real data.