CT-Video Matching for Retrograde Intrarenal Surgery Based on Depth Prediction and Style Transfer

: Retrograde intrarenal surgery (RIRS) is a minimally invasive endoscopic procedure for the treatment of kidney stones. Traditionally, RIRS is usually performed by reconstructing a 3D model of the kidney from preoperative CT images in order to locate the kidney stones; then, the surgeon ﬁnds and removes the stones with experience in endoscopic video. However, due to the many branches within the kidney, it can be difﬁcult to relocate each lesion and to ensure that all branches are searched, which may result in the misdiagnosis of some kidney stones. To avoid this situation, we propose a convolutional neural network (CNN)-based method for matching preoperative CT images and intraoperative videos for the navigation of ureteroscopic procedures. First, a pair of synthetic images and depth maps reﬂecting preoperative information are obtained from a 3D model of the kidney. Then, a style transfer network is introduced to transfer the ureteroscopic images to the synthetic images, which can generate the associated depth maps. Finally, the fusion and matching of depth maps of preoperative images and intraoperative video images are realized based on semantic features. Compared with the traditional CT-video matching method, our method achieved a ﬁve times improvement in time performance and a 26% improvement in the top 10 accuracy.


Introduction
Image-guided endoscopic navigation has been a hot topic of research in surgical navigation, which can provide visual aids to clinicians for interventional surgery. Retrograde intrarenal surgery (RIRS) is one of the procedures for the image-guided treatment of kidney stones, performed through a ureteroscopy. RIRS has become one of the important methods of treating kidney stones, especially for the removal of large calculus. In traditional RIRS, the surgeons usually use preoperative CT to reconstruct a 3D model of the kidney by preoperative CT to determine the location of the kidney stones and understand the structure of the kidney, and then finds and removes the stones through intraoperative endoscopic video images based on experience. Due to the kidney having many internal branches, it can be difficult to locate stones and guarantee that all branches have been searched for during surgery, which may cause misdiagnosis. It is therefore important to assist the surgeon in RIRS by locating the ureteroscope through image navigation. At present, image fusion of preoperative CT images and intra-operative video images has become a popular solution.
There are mainly two methods to realize the fusion of preoperative CT and intraoperative video, one is the 2D-2D registration method [1][2][3], and the other is the 3D-3D registration method [4][5][6][7]. However, both methods have their limitations in RIRS. Traditional 2D-2D registration methods, which were based on global pixel points of grayscale images, were time-consuming. While 3D-3D registration methods were used for matching or registration after a point cloud reconstruction based on video images. The accuracy of point cloud reconstruction is affected heavily by kidney stones, water-filled, bubbles, flocculent, and other impurities in RIRS, and is not suitable for the RIRS scene.

Materials and Methods
Due to the limitations of traditional methods depicted above, a new method of preoperative CT and intraoperative video information fusion was proposed to solve the CT-video matching problem in ureteroscopic surgical guidance in this paper. Inspired by the application of deep learning and depth maps in the field of endoscopic guidance [8,9], we proposed to use depth maps as the intermediate connecting medium to match preoperative CT with intraoperative video. Our proposed method could be specified through the following steps.
(1) We first reconstructed a 3D model of the kidney based on CT images. (2) We used a virtual camera to simulate the movement path of the real ureteroscope to generate pairs of images. Each image pair consists of one simulated image (SI) taken by a virtual camera and its corresponding depth map (DM). We call these pairwise images datasets (simulated images and depth maps) as SI-DM. Based on the SI-DM dataset, we trained a model to predict the depth of simulated images. (3) We trained a model to transfer the style of endoscopic images (EI) into the style of simulated images mentioned in (2). Then we could indirectly obtain endoscopic images' depth maps. (4) Finally, by extracting features of the depth maps from SI and EI and calculating their similarity, we realized CT-video matching to avoid kidney stone misdiagnosis based on the depth maps from (2) and the depth maps from (3).
In this method, both the depth prediction network and the style transfer network are crucial. The style transfer network can transfer one modal data to another modal data, which means that if we know one modal data's depth maps, another modal data's depth maps can be predicted indirectly. This approach has been shown to be effective when applied to natural images [10]. Moreover, to avoid using traditional feature extraction methods to compute feature descriptors for matching, we use convolutional neural networks (CNNs) instead to extract deep semantic features of the depth map, which can improve the time performance.
The main contributions of our paper are as follows: • We established a corresponding mapping relationship based on the depth map between the white-light ureteroscopic image and the virtual endoscopic image. In other words, we achieved depth prediction based on a single white-light endoscopic image.

•
We extract abstract semantic features of the depth map from ureteroscopic images and simulated images captured by the virtual camera for CT-video matching. This approach achieves effective matching and significantly reduces the computational time consumption.

•
The results show that our method achieved a 26% improvement in top 10 matching accuracy with a five times improvement in time performance. Figure 1 shows the flow chart of our approach to navigate the ureteroscope by matching preoperative CT images and intraoperative video frames. In detail, first, a 3D model of the kidney was reconstructed by CT images. Then we generated an SI-DM dataset based on the 3D kidney and virtual reality to train the depth prediction model. Meanwhile, we introduced a style transfer network to transform ureteroscopic images into simulated images, whose style was like the SI dataset. Once we obtained both ureteroscopic images' depth maps and SI dataset's depth maps, we could extract their depth semantic features, respectively. Finally, based on the depth maps' features, we realized a CT-video matching task to make sure each branch of the kidney was examined.
while, we introduced a style transfer network to transform ureteroscopic images into simulated images, whose style was like the SI dataset. Once we obtained both ureteroscopic images' depth maps and SI dataset's depth maps, we could extract their depth semantic features, respectively. Finally, based on the depth maps' features, we realized a CT-video matching task to make sure each branch of the kidney was examined.

Depth Prediction
A depth map was used as the connecting medium to realize the matching of CT and video. A depth map describes the spatial geometry information, with each pixel value representing the spatial distance of that pixel from the camera, and it can be applied in virtual reality, 3D reconstruction, and other fields. There have been a lot of related studies on natural images [11][12][13][14][15][16][17] that have studied how to obtain depth information from monocular images, that is, monocular depth prediction. In monocular depth prediction, it can be regarded as a regression problem, just like image segmentation. In this paper, we used the SI-DM dataset obtained from CT to train the depth map prediction model, mainly because we cannot obtain the real ureteroscopic images' depth maps. This step is a fundamental part of solving the depth map of ureteroscope video images through style transfer. For virtual endoscopic images, a depth map can be recovered from CT image series directly. As to real endoscopic images, they are transferred to the virtual endoscopic domain at first, and then the depth maps are generated using the depth prediction model trained from virtual endoscopic images.
This paper uses an encoder-decoder network structure which was first proposed by Alhashim et al. [17].

Datasets
In this paper, the SI-DM dataset was like the RGB-D dataset [18][19][20]. It was generated from preoperative CT image sequences. The whole process is depicted in Figure 2. (1) We

Depth Prediction
A depth map was used as the connecting medium to realize the matching of CT and video. A depth map describes the spatial geometry information, with each pixel value representing the spatial distance of that pixel from the camera, and it can be applied in virtual reality, 3D reconstruction, and other fields. There have been a lot of related studies on natural images [11][12][13][14][15][16][17] that have studied how to obtain depth information from monocular images, that is, monocular depth prediction. In monocular depth prediction, it can be regarded as a regression problem, just like image segmentation. In this paper, we used the SI-DM dataset obtained from CT to train the depth map prediction model, mainly because we cannot obtain the real ureteroscopic images' depth maps. This step is a fundamental part of solving the depth map of ureteroscope video images through style transfer. For virtual endoscopic images, a depth map can be recovered from CT image series directly. As to real endoscopic images, they are transferred to the virtual endoscopic domain at first, and then the depth maps are generated using the depth prediction model trained from virtual endoscopic images.
This paper uses an encoder-decoder network structure which was first proposed by Alhashim et al. [17].

Datasets
In this paper, the SI-DM dataset was like the RGB-D dataset [18][19][20]. It was generated from preoperative CT image sequences. The whole process is depicted in Figure 2.
(1) We segmented CT images according to grey-scale thresholding and extracted interested anatomical sites; (2) We used the Marching Cubes (MC) [21] algorithm to reconstruct the interested anatomical sites for the kidney 3D model; (3) We extracted the center path of the kidney 3D model and used a virtual camera to simulate the movement path of a real ureteroscope; (4) We obtained pairwise SI-DM dataset by using the 3D object rendering imaging principle.
segmented CT images according to grey-scale thresholding and extracted interested anatomical sites; (2) We used the Marching Cubes (MC) [21] algorithm to reconstruct the interested anatomical sites for the kidney 3D model; (3) We extracted the center path of the kidney 3D model and used a virtual camera to simulate the movement path of a real ureteroscope; (4) We obtained pairwise SI-DM dataset by using the 3D object rendering imaging principle. We collected CT images for 3D model reconstructions from two patients at Shanghai Changhai Hospital. A Philips CT scanner was used with 1.5 mm scan thickness and 512*512 image resolution. In generating the synthetic data, we considered the points on the center path as the virtual camera's location and rotated the virtual camera at different angles at each interval point in order to get more multi-angle virtual endoscopic images in the virtual endoscope. Here, the virtual camera can perform the functions of an RGB-D camera and can collect the depth map corresponding to each frame of the virtual image. As shown in Table 1, we generated 29,608 SI-DM images, of which 21,429 were training sets and 8179 were test sets.

. Loss Function
In order to obtain better model precision, we combine different loss functions to conduct experiments. For in depth prediction tasks, researchers usually apply the L1 loss function and the L2 loss function. However, training with a sole reconstruction loss function can cause the model to tend to generate an average value, which can lead to ambiguous outputs. In the work of Alhashim. et al. [17], structural similarity (SSIM) error as a loss function was also mentioned. As mentioned in that work, using structural similarity as a loss function is a good loss term for depth estimating CNNs. We also tried this loss in this article. The formulas of the loss functions were as follows: y represents the ground truth of the depth value and represents the prediction of the depth value: We collected CT images for 3D model reconstructions from two patients at Shanghai Changhai Hospital. A Philips CT scanner was used with 1.5 mm scan thickness and 512 * 512 image resolution. In generating the synthetic data, we considered the points on the center path as the virtual camera's location and rotated the virtual camera at different angles at each interval point in order to get more multi-angle virtual endoscopic images in the virtual endoscope. Here, the virtual camera can perform the functions of an RGB-D camera and can collect the depth map corresponding to each frame of the virtual image. As shown in Table 1, we generated 29,608 SI-DM images, of which 21,429 were training sets and 8179 were test sets.

. Loss Function
In order to obtain better model precision, we combine different loss functions to conduct experiments. For in depth prediction tasks, researchers usually apply the L1 loss function and the L2 loss function. However, training with a sole reconstruction loss function can cause the model to tend to generate an average value, which can lead to ambiguous outputs. In the work of Alhashim. et al. [17], structural similarity (SSIM) error as a loss function was also mentioned. As mentioned in that work, using structural similarity as a loss function is a good loss term for depth estimating CNNs. We also tried this loss in this article. The formulas of the loss functions were as follows: y represents the ground truth of the depth value andŷ represents the prediction of the depth value: We quantitatively evaluated the performance of the model. The quantitative evaluation generally uses threshold accuracy, root mean square error (RMSE), root mean square log error RMSE (log), average log10 error, and relative error (REL) to judge model performance. The formula of evaluation functions are as follows: Usually, in formula (4), the value of the threshold is set as 1.25, 1.25 2 , or 1.25 3 .

Implementation Details
All implementation and training were done in PyTorch [22], using densenet-169 [23] pre-trained on ImageNet [24] to initialize the encoder parameters, and the decoder used skip connection and up-sample structure. The Adma [25] optimizer with parameter values of β1 = 0.9 and β2 = 0.999 was used. The learning rate was set to 0.001, the training epoch was set to 50, the batch size was set to four, and the training input image size was 480 × 640.

Style Transfer
In Section 2.1, we have been able to predict depth maps from simulated virtual endoscopic monocular images (SI-DM dataset). However, our purpose was to obtain depth maps from real ureteroscopic images. Since video images and virtual images belong to two domains, migration of different modality data is a problem of domain adaptation [26]. In order to achieve this, we introduced a style transfer neural network [27]. The style in the paper refers to the different data distribution domains of these two kinds of images, such as color, texture obtained with the same content. Style transfer refers to retaining the image content and transferring the image to the target style from the source style.
In this paper, we use the image style transfer method to transfer the real endoscopic images to the virtual endoscopic images because the depth image matching under the same data domain is more suitable than the cross-domain matching from the perspective of data distribution and image alignment, so that the depth prediction model trained in Section 2.1 was effective for the real endoscopic images. There are many methods to realize style transfer [28][29][30][31]. In [32], CycleGAN was adopted and unpaired data was used for training. The structure of the CycleGAN is shown in Figure 3. A represents the real endoscopic image domain (EI), while B represents the simulated images domain (SI). It consists of two discriminators and two generators. The input image Input_A generates Fake_B through Generator A to B, and Fake_B generates Rec_A through Generator B to A. After two transformations, Rec_A belongs to the A domain, and the model is optimized by comparing the similarity between the Input_A and the Rec_A, and Input_B is also processed in the same way.
training. The structure of the CycleGAN is shown in Figure 3. A represents the real endoscopic image domain (EI), while B represents the simulated images domain (SI). It consists of two discriminators and two generators. The input image Input_A generates Fake_B through Generator A to B, and Fake_B generates Rec_A through Generator B to A. After two transformations, Rec_A belongs to the A domain, and the model is optimized by comparing the similarity between the Input_A and the Rec_A, and Input_B is also processed in the same way.

Datasets
The advantage of the CycleGAN adopted in this paper is that it does not need a pair of source-target domain matching images, which provides the convenience of our dataset acquisition. We only need to obtain the images from the two domains separately, without considering their relationship, and even the number of image datasets in the two domains do not need to be consistent (however, in order to guarantee model training more successfully, the gap of the two domains' datasets should not be too large). The source domain dataset we used was derived from two videos of clinical ureteroscopic lithotomy in Shanghai Changhai Hospital. In order to collect high-quality endoscopic images, we got rid of invalid frames from the surgical videos, as shown in Figure 4. At the same time, the target domain dataset used in this paper was the simulated virtual endoscopic images (SI) obtained in Section 2.1. As shown in Table 2, we used 1747 source domain images (EI) and in order to ensure that the number of datasets in the two domains would not be too different, 2767 target domain images were randomly selected from the SI to train style transfer model.

Datasets
The advantage of the CycleGAN adopted in this paper is that it does not need a pair of source-target domain matching images, which provides the convenience of our dataset acquisition. We only need to obtain the images from the two domains separately, without considering their relationship, and even the number of image datasets in the two domains do not need to be consistent (however, in order to guarantee model training more successfully, the gap of the two domains' datasets should not be too large). The source domain dataset we used was derived from two videos of clinical ureteroscopic lithotomy in Shanghai Changhai Hospital. In order to collect high-quality endoscopic images, we got rid of invalid frames from the surgical videos, as shown in Figure 4. At the same time, the target domain dataset used in this paper was the simulated virtual endoscopic images (SI) obtained in Section 2.1. As shown in Table 2, we used 1747 source domain images (EI) and in order to ensure that the number of datasets in the two domains would not be too different, 2767 target domain images were randomly selected from the SI to train style transfer model.  Figure 4. Examples of typical ureteroscope video images. The first row shows problematic or formative frames due to being blurry, kidney stones, or flocculent, and the second row di informative images without any impurity.  [33] architecture, the discriminator used PatchGAN [31], and the cycle loss fun was the L1 loss function. The model training platform was the same as Section 2.1 initial learning rate was 0.0002, the Adam optimizer with parameter values of 1 = 0. 2 = 0.999 was used, the batch size was set to 4, and the training epoch was 250.

Semantic Feature Matching
After acquiring depth maps of the two domains' images, this section explains h realize the matching method between real endoscopic images and corresponding S ages. In order to improve time performance, we extracted semantic features of the maps to match endoscopic images with SI images. We used an auto-encoder [34] net to extract high-level semantic features of each image. We encoded the 240 × 320 imag the feature vector of 3840 dimensions.
We used vector Euclidean Distance to calculate the similarity of semantic fea which was a common but effective method. In the matching process, we built a sem feature database by using the preoperative depth map dataset, and searched for th match in the database by using the intraoperative depth map features to realize mat between semantic features. Then, CT-video registration was re-implemented based o mapping relationship between semantic features and preoperative and intraoperativ atomical positions. The process of semantic feature matching is shown in Figure 5 valid frames of the intraoperative video were successively inputted to style transfe work, depth prediction, and semantic feature extraction. Similarity analysis was ducted between the calculated features and the semantic feature database constr from the semantic features of the preoperative depth map dataset, and the top-10 im with the best matches were output after sorting. When SI-DM images were acquir Section 2.1.1, the virtual camera's position and angle corresponding to each simulate age were saved. After obtaining the matched top-10 simulated images, the positions . Examples of typical ureteroscope video images. The first row shows problematic or uninformative frames due to being blurry, kidney stones, or flocculent, and the second row displays informative images without any impurity.

Implementation Details
This paper conducted experiments based on CycleGAN. The generator used the ResNet [33] architecture, the discriminator used PatchGAN [31], and the cycle loss function was the L1 loss function. The model training platform was the same as Section 2.1. The initial learning rate was 0.0002, the Adam optimizer with parameter values of 1 = 0.9 and 2 = 0.999 was used, the batch size was set to 4, and the training epoch was 250.

Semantic Feature Matching
After acquiring depth maps of the two domains' images, this section explains how to realize the matching method between real endoscopic images and corresponding SI images. In order to improve time performance, we extracted semantic features of the depth maps to match endoscopic images with SI images. We used an auto-encoder [34] network to extract high-level semantic features of each image. We encoded the 240 × 320 image into the feature vector of 3840 dimensions.
We used vector Euclidean Distance to calculate the similarity of semantic features, which was a common but effective method. In the matching process, we built a semantic feature database by using the preoperative depth map dataset, and searched for the best match in the database by using the intraoperative depth map features to realize matching between semantic features. Then, CT-video registration was re-implemented based on the mapping relationship between semantic features and preoperative and intraoperative anatomical positions. The process of semantic feature matching is shown in Figure 5. The valid frames of the intraoperative video were successively inputted to style transfer network, depth prediction, and semantic feature extraction. Similarity analysis was conducted between the calculated features and the semantic feature database constructed from the semantic features of the preoperative depth map dataset, and the top-10 images with the best matches were output after sorting. When SI-DM images were acquired in Section 2.1.1, the virtual camera's position and angle corresponding to each simulated image were saved. After obtaining the matched top-10 simulated images, the positions of the frames could be displayed in the 3D model, which could reflect whether the match was correct according to the position of the branch. When one or more of the matched top-10 images matched the correct anatomical location, the match was considered a successful case. Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 14 Figure 5. The process of semantic feature matching.

Results
This section verifies the effectiveness of the proposed methods. First, we examined the results of our depth prediction model. As shown in Table 3, we compared the effects of different loss functions on the experiment. We found that the difference in loss functions had little impact on the performance of the model. The loss functions and error functions are shown in 2.1.2, with δ1, δ2, and δ3 was set as 1.25, 1.25 2 , and 1.25 3 , respectively. The accuracy of the model could achieve 94.1% in depth prediction on the SI-DM dataset using the loss function of L2 + SSIM. Figure 6 shows some examples of SI-DM's test dataset. We could find that the predicted depth map was generally consistent with the ground truth.

Results
This section verifies the effectiveness of the proposed methods. First, we examined the results of our depth prediction model. As shown in Table 3, we compared the effects of different loss functions on the experiment. We found that the difference in loss functions had little impact on the performance of the model. The loss functions and error functions are shown in 2.1.2, with δ1, δ2, and δ3 was set as 1.25, 1.25 2 , and 1.25 3 , respectively. The accuracy of the model could achieve 94.1% in depth prediction on the SI-DM dataset using the loss function of L2 + SSIM. Figure 6 shows some examples of SI-DM's test dataset. We could find that the predicted depth map was generally consistent with the ground truth. Then, we evaluated the effectiveness of style transfer by analyzing the feature distributions of the source domain images (EI), the target domain images (SI), and the images generated by the CycleGAN. We input each of these three types of images into the discriminator of the target domain (Discriminator B in Figure 3, which determines whether an image belongs to the target domain). Each image was denoted by a 1 × 1024 feature  Then, we evaluated the effectiveness of style transfer by analyzing the feature distributions of the source domain images (EI), the target domain images (SI), and the images generated by the CycleGAN. We input each of these three types of images into the discriminator of the target domain (Discriminator B in Figure 3, which determines whether an image belongs to the target domain). Each image was denoted by a 1 × 1024 feature vector. We used t-SNE (t-distributed stochastic neighbor embedding) to visualize these features in the same 2D coordinate space (see Figure 7a). We could see that the images generated by the CycleGAN were closer to those in the target domain, but due to the discriminator having a discriminatory effect, the feature distribution of both has a certain boundary. Similarly, these images were also inputted to ResNet-50 for feature vectoring and visualization. The results are shown in Figure 7b; the distribution of the images generated by the CycleGAN and target domain images are similar with some overlapping areas, which meant that the conventional CNN classifier could not distinguish between the two types of images correctly. This proved that we could achieve the migration between domain distributions through style transfer. The examples of style transfer results are shown in Figure 8. Then, we evaluated the effectiveness of style transfer by analyzing the feature distributions of the source domain images (EI), the target domain images (SI), and the images generated by the CycleGAN. We input each of these three types of images into the discriminator of the target domain (Discriminator B in Figure 3, which determines whether an image belongs to the target domain). Each image was denoted by a 1 × 1024 feature vector. We used t-SNE (t-distributed stochastic neighbor embedding) to visualize these features in the same 2D coordinate space (see Figure 7a). We could see that the images generated by the CycleGAN were closer to those in the target domain, but due to the discriminator having a discriminatory effect, the feature distribution of both has a certain boundary. Similarly, these images were also inputted to ResNet-50 for feature vectoring and visualization. The results are shown in Figure 7b; the distribution of the images generated by the CycleGAN and target domain images are similar with some overlapping areas, which meant that the conventional CNN classifier could not distinguish between the two types of images correctly. This proved that we could achieve the migration between domain distributions through style transfer. The examples of style transfer results are shown in Figure 8. In addition, in order to further prove the effectiveness of the style transfer, we directly input the ureteroscopic images into the depth prediction model, and the results of the depth maps are shown in Figure 9b. For comparison, we input the same ureteroscopic images into the depth prediction model after the style transfer and got the depth maps as shown in Figure 9d. As shown in Figure 9, we can see that, by comparison, the depth map without style transfer had poor performance, which showed that style transfer could greatly improve the results of depth prediction.   In addition, in order to further prove the effectiveness of the style transfer, we directly input the ureteroscopic images into the depth prediction model, and the results of the depth maps are shown in Figure 9b. For comparison, we input the same ureteroscopic images into the depth prediction model after the style transfer and got the depth maps as shown in Figure 9d. As shown in Figure 9, we can see that, by comparison, the depth map without style transfer had poor performance, which showed that style transfer could greatly improve the results of depth prediction.   In addition, in order to further prove the effectiveness of the style transfer, we directly input the ureteroscopic images into the depth prediction model, and the results of the depth maps are shown in Figure 9b. For comparison, we input the same ureteroscopic images into the depth prediction model after the style transfer and got the depth maps as shown in Figure 9d. As shown in Figure 9, we can see that, by comparison, the depth map without style transfer had poor performance, which showed that style transfer could greatly improve the results of depth prediction. Finally, we evaluate the performance of the matching method. In order to highlight the advantages of our method, we also used the traditional 2D-2D matching method on the same data. The traditional method was based on the direct similarity calculation of EI and SI. The similarity calculation method was based on the similarity calculation of image pixel points, and the commonly used method was the SSIM calculation. We compared our methods (see Figure 10c) with the traditional 2D-2D methods (see Figure 10b), and our methods showed better performance with more accurate matching positions. For further analysis, we used 1039 ureteroscopy images (EI) and the feature database, which included 2077 preoperative depth maps' semantic features to do the matching experiment. We analyzed the matching accuracy and matching time consumption for Top-1, Top-5, and Top-10, and the experimental results are shown in Table 4. Accuracy is the ratio of correct matches to the total number of matches, and the matching elapsed time is calculated by dividing the time to complete all images by the number of images. It could be seen that although our method had no advantage in terms of top-1, our methods were 16 and 26% more accurate in terms of top-5 and top-10 respectively, with a 5 five times improvement in time performance, which proved that our method works better than the traditional methods. Finally, we evaluate the performance of the matching method. In order to highlight the advantages of our method, we also used the traditional 2D-2D matching method on the same data. The traditional method was based on the direct similarity calculation of EI and SI. The similarity calculation method was based on the similarity calculation of image pixel points, and the commonly used method was the SSIM calculation. We compared our methods (see Figure 10c) with the traditional 2D-2D methods (see Figure 10b), and our methods showed better performance with more accurate matching positions. For further analysis, we used 1039 ureteroscopy images (EI) and the feature database, which included 2077 preoperative depth maps' semantic features to do the matching experiment. We analyzed the matching accuracy and matching time consumption for Top-1, Top-5, and Top-10, and the experimental results are shown in Table 4. Accuracy is the ratio of correct matches to the total number of matches, and the matching elapsed time is calculated by dividing the time to complete all images by the number of images. It could be seen that although our method had no advantage in terms of top-1, our methods were 16 and 26% more accurate in terms of top-5 and top-10 respectively, with a 5 five times improvement in time performance, which proved that our method works better than the traditional methods.

Discussion
This paper aims to solve the difficulty of RIRS caused by complex lumen in ureteroscopy. Nevertheless, our method has some limitations. The premise that our method can be effective is that the depth information varies from anatomical structure to anatomical structure.
However, in actual clinical images, the depth information distribution of different anatomical locations may be similar, in which case, our method will have an increased matching failure rate and obtain false-negative results, i.e., there is a matching ambiguity problem (as shown in Figure 11). be effective is that the depth information varies from anatomical structure to anatomical structure. However, in actual clinical images, the depth information distribution of different anatomical locations may be similar, in which case, our method will have an increased matching failure rate and obtain false-negative results, i.e., there is a matching ambiguity problem (as shown in Figure 11).
At the same time, the amount of our data needs to be increased, although the thousands of virtual endoscopy frames generated from our collected CT data are sufficient for training pre-training-based depth estimation and style transfer models. The evaluation results show that our method has significantly surpassed traditional matching methods, and we believe that the proposed matching method can bring more inspiration for the exploration of artificial intelligence methods in the field of nephrology surgery. We will collect more different data for further experiments in the future.

Conclusions
A CT-video matching method based on a depth map was proposed for the ureteroscopy scene. We applied the method to clinical data and compared it with the traditional 2D-2D registration method. The results show that our method outperforms the traditional method in terms of accuracy and time performance, with a 26% improvement in accuracy and a five times improvement in speed for the top 10. In addition, even though the time performance is improved by a factor of five, our time performance of 1.26 s per image is At the same time, the amount of our data needs to be increased, although the thousands of virtual endoscopy frames generated from our collected CT data are sufficient for training pre-training-based depth estimation and style transfer models. The evaluation results show that our method has significantly surpassed traditional matching methods, and we believe that the proposed matching method can bring more inspiration for the exploration of artificial intelligence methods in the field of nephrology surgery. We will collect more different data for further experiments in the future.

Conclusions
A CT-video matching method based on a depth map was proposed for the ureteroscopy scene. We applied the method to clinical data and compared it with the traditional 2D-2D registration method. The results show that our method outperforms the traditional method in terms of accuracy and time performance, with a 26% improvement in accuracy and a five times improvement in speed for the top 10. In addition, even though the time performance is improved by a factor of five, our time performance of 1.26 s per image is still far from meeting the clinical requirements. In fact, the depth map-based matching method in this paper is not limited to ureteroscopy application scenarios but can also be considered for other endoscopy scenarios. We believe that that with the continuous exploration of deep learning technology, future research work can optimize the matching method based on this paper to achieve better matching accuracy and speed trade-off.