Use of Yolo Detection for 3D Pose Tracking of Cardiac Catheters Using Bi-Plane Fluoroscopy

: The increasing rate of minimally invasive procedures and the growing prevalence of cardiovascular disease have led to a demand for higher-quality guidance systems for catheter tracking. Traditional methods for catheter tracking, such as detection based on single points and applying masking techniques, have been limited in their ability to provide accurate pose information. In this paper, we propose a novel deep learning-based method for catheter tracking and pose detection. Our method uses a Yolov5 bounding box neural network with postprocessing to perform landmark detection in four regions of the catheter: the tip, radio-opaque marker, bend, and entry point. This allows us to track the catheter’s position and orientation in real time, without the need for additional masking or segmentation techniques. We evaluated our method on a dataset of fluoroscopic images from two distinct datasets and achieved state-of-the-art results in terms of accuracy and robustness. Our model was able to detect all four landmark features (tip, marker, bend, and entry) used to generate a pose for a catheter with 0.285 ± 0.143 mm, 0.261 ± 0.138 mm, 0.424 ± 0.361 mm, and 0.235 ± 0.085 mm accuracy. We believe that our method has the potential to significantly improve the accuracy and efficiency of catheter tracking in medical procedures that utilize bi-plane fluoroscopy guidance.


Introduction
Cardiovascular diseases are the leading cause of death around the world [1].To reduce the invasiveness of procedures, image-guided procedures have been proposed in various fields, including cardiovascular interventions [2].Cardiac catheterization in adults, a type of minimally invasive surgery (MIS), is dependent on the accurate tracking and assessment of the spatial positioning of delivery catheters and cardiac devices.This can affect the number of cardiac patients and potential adverse outcomes in these types of MIS or interventions [3].To this end, augmented and mixed reality technologies have been deployed in fluoroscopy-guided procedures to assist physicians in catheter detection and tracking [4].
Werner Forssman was among the first to introduce catheter tracking, originally part of interventional radiology.Since then, cardiac interventions have leveraged the direct visualization of catheters in 2D monitors to understand and predict the position of the catheter within the patient's body.However, due to the fact that fluoroscopy only provides a 2D projection of the catheter position, it is necessary for the physician to view the catheter from multiple angles to fully interpret its 3D position.This limitation has led to the use of ultrasound guidance to provide direct visualization of the catheter's relative cardiac tissue, which is not visible in fluoroscopy.
Recent research has offered a novel direction based on classic and deep learning techniques, to detect the 3D shape of catheters in ultrasound images.One approach suggests a UNet-3D architecture with the ability to localize the catheter centroid in each frame.This method, however, depends on a diverse training set [5].In another study, which was based on cardiovascular magnetic resonance (CMR)-guided cardiac catheterization, researchers utilized the T1-overlay technique for improved catheter visualization.This itself led to higher blood/balloon rating, anatomy visualization, and improved cardiologists' iCMR guidance [6].Researchers also designed a convolutional deep neural network to detect the area of interest containing the catheter tip and then a color intensity difference detection technique for catheter detection.This method was successful in 94% of longaxis projections; however, it only had 57% success in short-axis projections, requiring the manual identification of the initial position of the catheter [7].Furthermore, in a study conducted on improving poor spatial resolution in ultrasound images for cardiac interventions, researchers applied a UNet-3D model to estimate the 3D shape of the catheter.An adaptive Kalman filter was used and the points to the 3D coordinate system were fused.Despite the authors' claim on their approach accuracy, they also stated that their results could benefit from multiple outputs provided by the model and the use of a more diverse dataset [6].
In terms of catheter or guidewire tracking for fluoroscopy images [5,[8][9][10][11][12][13][14], Vernikouskaya et al. designed a convolutional neural network (CNN) with two channels, utilized to track the pacing of the catheter tip and considered a single-point tracking method [13].Motion was identified based on the template matching of 2D fluoroscopic images.The training data were generated by tracing a rapid-pace catheter tip; however, heatmap plots indicated that the catheter tip was not of interest, spreading the attention of the CNN on the diaphragm [13].In another study, researchers estimated an external force applied on the tip of a planar catheter through image processing algorithms based on cropping and edge detection, followed by a mathematical catheter representation.Random forest and deep neural networks (DNNs) were used to estimate the force.Despite providing promising results, the authors mentioned the need for a more extensive dataset due to traces of inaccuracy in the validation set [8].
Our research highlights how computer vision and recent machine learning techniques have influenced the detection and tracking of medical images.In coordinate regression, where the goal is to predict a fixed number of location coordinates corresponding to points of interest in an input image, various promising results have been presented.This is more obvious in conjunction with augmented reality, image guidance systems, and MIS for cardiac interventions [15].Two studies published in 2023 [16] and 2021 [17] focus on applying deep learning solutions to detect the position of the tip of a catheter tracked from bi-plane fluoroscopic images.Torabinia et al. designed a U-Net model, trained on masks of the catheter's tip radio-opaque marker.Masking is an image processing technique to hide a partial or full section of the image.The area is then compared to the ground truth segments, which in this case resulted in 83.67% intersection over union (IOU), Dice, or 0.8457 [17].In 2022 [18], another study was conducted on catheter localization and tip detection using a U-Net model.It was stated that although an 80% reconstruction accuracy was achieved, there were times that the tip was not detected, resulting in a need for extrapolation.Following this study, ref. [16] was published that applied a deep cascaded VGG neural network to detect the tip of the catheter by first learning and locating the tip in a grid and then identifying the local and global tip coordinates [16].The researchers stated that if incorrectly selected regions are removed, the model is able to provide a 7.36-pixel mean error.
In catheter pose identification, another aspect of our methodology, there have been deep learning-based advancements.Pose estimation enables the localization of multiple landmarks, which are identified in relation to the human pose; it provides posture recognition in an image, enabling the recognition and tracking of human action.To identify the pose, the landmark points are identified and then grouped to provide a valid pose estimation.DeepPose was among the first methods that involved the intersection of deep learning and human pose estimation.Subsequently, multiple backbone networks, such as AlexNet [19] with extensions including R-CNN, Fast-CNN, and FPN, followed by VGG [20] and ResNet [21], as well as other backbone architectures with ResNet [21], have provided more accurate results, managing to overcome the challenge of vanishing gradients [22].Some have been conducted on knee arthroplasty and the use of convolutional neural networks for pose computation in navigation sensors, reporting a root mean square difference of less than 0.7 mm and 4 • for the test results and 0.8 mm and 1.7 • for the validation results [23].Ravigopal et al. reported the Jaccard Index (IOU) on guidewire segmentation [14].ResNet-50 [21] was able to provide a 0.65 IOU on guidewire shape, although the tip had a 0.51 IOU, which outranked ResNet-18 [21], Squeezenet [24], MobileNetv2 [25], Inception-ResNetv2 [26], and FCN-2s [14].The algorithm had a 0.99 IOU for the background.Their least final error after traversal for tendon stroke was 0.23 mm in Trial 3, for a bending length of 1.02 mm, and for 0.18 mm stage feed in Trial 1.
To address all the limitations in the current studies and to provide a basis for catheter detection and tracking, we propose a bounding box backbone.Multiple state-of-the-art bounding box methods have been introduced in recent years with the "you only look once" (Yolo) [27] algorithm at the forefront of object detection methods.This methodology acquires Darknet as the backbone framework with several adjustments that enable the grid-based detection of annotated bounding boxes [28].These methods have been found to provide exemplary results for computer-aided design (CAD) detection/devices in the medical field, ranging from the detection of lesions in mammography [28,29], lung nodule diagnosis in CT scans [30], and brain tumor segmentation in MRI [31] to tracking invisible needles in ultrasound sequences [32] and detecting EMT sensors in X-ray images [33].Concerning mammograms, the Yolo algorithm was able to detect and classify abnormalities and assist radiologists in their early diagnosis of breast cancer.In this regard, the researchers applied Yolo on a fusion of real and synthetic samples for the prediction of the region of interest of mass lesion.Their model was able to achieve a maximum accuracy rate of 93% in mammograms with mass lesions, 88% with calcification lesions, and 95% with architectural distortion lesions in 0.62 s during inference mode [28].Krumb et al. detected an EMT sensor on X-rays by applying YoloV5 due to its vast success in image segmentation.Through that, they identified the center of the bounding box, which was the sensor, achieving 0.955 mAP50 and 0.512 mAP50-95 by using a pretrained network on MSCOCO and applying it to their 214-sample set [33].Concerning the application of Yolo algorithms, we have found no indication of any studies conducted on the application of the Yolo approach and catheter tracking in fluoroscopy images.
In this paper, we introduce a bounding box-based method for catheter tracking in fluoroscopy images, identifying four classes of the catheter: entry point, bend, radio-opaque marker, and tip.To this end, we use Yolo as a state-of-the-art bounding box method to identify the landmark regions that comprise the pose set of the catheter's orientation.Once found, the bounding box centroids are identified to enable an understanding of the catheter's position in relation to the image coordinate system.To summarize, our contributions are listed as follows: • Improved accuracy by using the Yolo architecture for object detection; • Deep learning-based bounding box pose estimation of the catheter, including four classes of landmark features, namely catheter tip, radio-opaque marker, bend, and entry, for future use in catheter tracking systems; • A new diverse catheter dataset with a complete bounding box and representative pixel annotation.

Materials and Methods
In this section, we elaborate on a step-by-step methodology (with a schematic view presented in Figure 1 for detecting 4 landmark points of the catheter using bounding boxes to extract the catheter's 3D pose).This section will discuss data collection, preprocessing, Yolov5 deep neural network for bounding box detection, and postprocessing for 3D pose detection.

Materials and Methods
In this section, we elaborate on a step-by-step methodology (with a schematic view presented in Figure 1 for detecting 4 landmark points of the catheter using bounding boxes to extract the catheter's 3D pose).This section will discuss data collection, preprocessing, Yolov5 deep neural network for bounding box detection, and postprocessing for 3D pose detection.Data Collection: The catheter image data were collected in two sessions under the physician's supervision, with various operating settings in terms of handling, maneuvering, and position in experiments, and consisted of two datasets.One dataset consisted of 529 fluoroscopic samples, previously gathered [16] with a custom-made 3D-printed heart and a metal spray-painted spine to provide visibility under X-ray, gathered during a mock procedure in the catheterization lab.It should be noted that the 3D-printed heart, the metal spray-painted spine, and the catheter were located within an acrylic box, thus producing additional artifacts in the image that would not be present in clinical images.
The second dataset consisted of 900 fluoroscopic images without the 3D-printed heart and metal spray-painted spine, which was created to enable the assessment of the 3D coordinate generation of the catheter and to generalize the dataset, generating diverse images, since these features may or may not be present in clinical images.This sample set was gathered using the custom-built setup shown in Figure 2.  Data Collection: The catheter image data were collected in two sessions under the physician's supervision, with various operating settings in terms of handling, maneuvering, and position in experiments, and consisted of two datasets.One dataset consisted of 529 fluoroscopic samples, previously gathered [16] with a custom-made 3D-printed heart and a metal spray-painted spine to provide visibility under X-ray, gathered during a mock procedure in the catheterization lab.It should be noted that the 3D-printed heart, the metal spray-painted spine, and the catheter were located within an acrylic box, thus producing additional artifacts in the image that would not be present in clinical images.
The second dataset consisted of 900 fluoroscopic images without the 3D-printed heart and metal spray-painted spine, which was created to enable the assessment of the 3D coordinate generation of the catheter and to generalize the dataset, generating diverse images, since these features may or may not be present in clinical images.This sample set was gathered using the custom-built setup shown in Figure 2.

Materials and Methods
In this section, we elaborate on a step-by-step methodology (with a schematic view presented in Figure 1 for detecting 4 landmark points of the catheter using bounding boxes to extract the catheter's 3D pose).This section will discuss data collection, preprocessing, Yolov5 deep neural network for bounding box detection, and postprocessing for 3D pose detection.Data Collection: The catheter image data were collected in two sessions under the physician's supervision, with various operating settings in terms of handling, maneuvering, and position in experiments, and consisted of two datasets.One dataset consisted of 529 fluoroscopic samples, previously gathered [16] with a custom-made 3D-printed heart and a metal spray-painted spine to provide visibility under X-ray, gathered during a mock procedure in the catheterization lab.It should be noted that the 3D-printed heart, the metal spray-painted spine, and the catheter were located within an acrylic box, thus producing additional artifacts in the image that would not be present in clinical images.
The second dataset consisted of 900 fluoroscopic images without the 3D-printed heart and metal spray-painted spine, which was created to enable the assessment of the 3D coordinate generation of the catheter and to generalize the dataset, generating diverse images, since these features may or may not be present in clinical images.This sample set was gathered using the custom-built setup shown in Figure 2.   Images from both datasets had a 512 × 512 resolution.The catheter tip, radio-opaque marker, entry, and bend were clearly visible in all images of both datasets, with the catheter moving along the entire range of the image.
Preprocessing: In the preprocessing phase, the set with 529 samples was excluded and used as is in the annotation and processing phase; however, in the 900-sample set, LAO90 and AP were cropped, as shown in Figure 4, to remove the spheres used for aligning the frames (which will not be used in clinical applications).The results and coordinate conversions to original settings will be addressed in the postprocessing section, where we analyze the bi-plane coordinates to specify the final 3D output in both pixels and millimeters.
Yolo Image Processing: Our method uses a deep learning-based bounding box method known as Yolov5.This algorithm was implemented through the PyTorch library and the data were trained through data loaders in 64 batches on a GPU server without any pretraining.The Yolo [27] architecture was first designed and introduced in 2016 as a new approach to object detection.It resizes images to 448 × 448, and then a single convolutional network predicts coordinates from multiple bounding boxes and their class probability, focusing on the entire image in real time.The coordinates are converted to their original location based on the original image dimension.The network divides the image into grids, and if the center of the object falls within the grid cell, that cell is identified as the object's region.The Yolo architecture included the use of the DarkNet architecture attached to fully connected neural networks (FCNNs).Each image was divided into n-byn grids, and bounding boxes were returned for each grid based on the network's learning.Images from both datasets had a 512 × 512 resolution.The catheter tip, radio-opaque marker, entry, and bend were clearly visible in all images of both datasets, with the catheter moving along the entire range of the image.
Preprocessing: In the preprocessing phase, the set with 529 samples was excluded and used as is in the annotation and processing phase; however, in the 900-sample set, LAO90 and AP were cropped, as shown in Figure 4, to remove the spheres used for aligning the frames (which will not be used in clinical applications).The results and coordinate conversions to original settings will be addressed in the postprocessing section, where we analyze the bi-plane coordinates to specify the final 3D output in both pixels and millimeters.The backbone of our network contained 182 layers and 7,254,609 parameters.All datasets were divided into three groups, namely training, validation, and test sets, with a 65%, 18%, and 17% sample portion, respectively.This network was implemented in Python 3.10, torch 2.0.1, and on a GPU in Google's Colab.The inference did not require utilizing a GPU, and the network could be run on a CPU, but if used, it would decrease the Yolo Image Processing: Our method uses a deep learning-based bounding box method known as Yolov5.This algorithm was implemented through the PyTorch library and the data were trained through data loaders in 64 batches on a GPU server without any pretraining.The Yolo [27] architecture was first designed and introduced in 2016 as a new approach to object detection.It resizes images to 448 × 448, and then a single convolutional network predicts coordinates from multiple bounding boxes and their class probability, focusing on the entire image in real time.The coordinates are converted to their original location based on the original image dimension.The network divides the image into grids, and if the center of the object falls within the grid cell, that cell is identified as the object's region.The Yolo architecture included the use of the DarkNet architecture attached to fully connected neural networks (FCNNs).Each image was divided into n-by-n grids, and bounding boxes were returned for each grid based on the network's learning.
The backbone of our network contained 182 layers and 7,254,609 parameters.All datasets were divided into three groups, namely training, validation, and test sets, with a 65%, 18%, and 17% sample portion, respectively.This network was implemented in Python 3.10, torch 2.0.1, and on a GPU in Google's Colab.The inference did not require utilizing a GPU, and the network could be run on a CPU, but if used, it would decrease the inference time.Prior to training or testing, all training, validation, and test samples were resized to a 416 × 416 resolution and then at the end converted back to their original size.For the combinatory dataset, which was the largest set, we trained the model with 32 batches in 300 epochs in 55 min, 37.5 s; however, the inference time was 0.256 s.
We conducted three experiments on three training, validation, and test sets.In our first experiment, we only used the 900 paired-sample set.In the second and third experiments, we used the 900 paired-sample dataset consisting of AP and LAO90 planes in combination with the 529 unpaired-sample set and tested the model on combinatory and paired samples; this was considered to provide a more diverse input to the model's learning algorithm.For the third experiment's inference, we included 134 samples, resized to 416 × 416 resolution from the paired dataset, and ran it through the model based on the weights saved in the best checkpoint (the respected checkpoint updated during training based on the model's fitness improvement).Afterward, we determined the highest confidence to be saved in a dictionary of results.We identified a confidence threshold for each class (tip, radio-opaque marker, bend, and entry) resulting in 127, 129, 105, and 113 samples, respectively.
Postprocessing: In the final stage, we used all prediction and ground truth labels to identify the center of the bounding boxes, calculate the average and standard deviation of prediction vs. ground truth, and generate the 3D coordinates of the paired LAO90 and AP bi-plane samples in both pixels and millimeters (mm).The 3D coordination was calculated based on X-Y-Z axes with respect to the points' location on the AP and LAO90 bi-planes from a geometry point of view.

Results and Discussion
In this work, we show how the bounding box method, based on the Yolov5 architecture without any augmentation or transfer learning, can improve the accuracy of coordinate regression for the landmark detection of a catheter for each view of bi-plane fluoroscopy.This method involves a singular deep neural network followed by a postprocessing technique to detect the 3D location of entry, bend, marker, and tip from LAO90 and AP planes, based on the detected bounding boxes.In what follows, we will discuss the results for all three experiments and each class and its representative mean and standard deviation in addition to its 3D accuracy.In some samples, no bend was identified, and therefore we eliminated the mentioned image from the bend results.

Experiment 1-Paired AP LAO90 Dataset
The paired AP-LAO90 dataset consisted of 900 samples representing two views: LAO90 and AP.Each view consisted of 450 images.The images contained the catheter and were cropped in the preprocessing phase to exclude the spheres contained in the image.
The model was able to achieve 0.961 and 0.953 precision and recall for all classes with a mAP50 of 0.95 and a mAP50-95 of 0.507.The class with the highest mAP50 was the marker with a precision rate of 1 for the training set, a 0.979 recall rate, and a 0.995 mAP50 score, while the training results for the tip provided the least precision, recall, and maP50, with 0.927, 0.911, and 0.904 values, respectively.The entry and tip provided the results ranking in second and third place, respectively, with a 0.969 and 0.934 mAP50 score.For the marker class, we found no significant difference between the threshold-based samples and all test samples.The mean pairwise pixel-based Euclidean distance between ground truths (GTs) and predictions was 1.016 with a 0.661 standard deviation, with a maximum distance of approximately 3 pixels found among all test samples.The threshold provided a 0.999 mean with a standard deviation of 0.648 and the same maximum distance approximation as that found in all test samples.However, for the tip, bend, and entry, the numbers differed more, with the threshold affecting the distance mean, standard deviation, and maximum distance.

Experiment 2-Combinatory Dataset, Tested on a Combinatory Test Set
In the second experiment, we used the combinatory dataset consisting of 900 paired LAO90 and AP samples and 529 samples recorded with the 3D-printed heart and the spine, achieving a precision of 0.927 and a recall of 0.908 for all classes during training.The mAP50 was 0.913, and the mAP50-95 was 0.467.The class with the highest mAP50 was the marker, with a precision rate of 1 for the training set and a 0.984 recall rate, while the training results for the bend provide the least precision, recall, and maP50, with values of 0.865, 0.801, and 0.848, respectively.

Experiment 3-Combinatory Dataset, Tested Only on the Paired Dataset
As an additional experiment, we utilized the model on only paired samples derived from paired LAO90 and AP views.We analyzed the error for Experiments 2 and 3.The marker bounding box had the highest detection among all boxes, resulting in more samples being identified with their respective bounding box's representative coordinates.

Result Analysis
As can be seen in Table 1, the results for each bounding box are presented based on the three mentioned experiments in Sections 3.1-3.3,based on the chosen threshold for each class.In Experiment 1, the entry had the lowest average Euclidean distance (0.235 mm in the 3D detection phase) followed by the marker, tip, and lastly the bend class.It should be noted that during training, marker detection had the highest precision, followed by the entry, bend, and tip.
In Experiment 2, we combined both datasets and analyzed the model on the unpaired bi-plane data.In this experiment, the marker class outperformed all other classes, with a 1.009-pixel Euclidean distance between prediction and ground truth (GT).We believe this occurrence is due to the algorithm's intake on finding the grids that best represent that class, and, in some cases, as the bend occurred prior to entry, the algorithm tried to find some indication of bend in the sample.When the bend was not present, it was reliably not detected, indicating the specificity of the model.In addition, the algorithm was successful in the detection of the bend in images where the bend angle was small, another indicator of the model's accuracy.The entry had the highest accuracy as the catheter entered the image from the sides, which made it less challenging for the model to learn.
In Experiment 3, we utilized the model trained on both datasets and tested it only on the paired samples.The results indicate that the entry had the lowest distance, followed closely by the marker and tip.The bend was found to have a higher prediction of GT distance.It should be noted that the grids were analyzed based on their relationship with other grids, identifying possible effective neighbors and distinction between the goal class and other classes.By comparing the results of Experiments 1 and 3 in Table 1, it can be inferred that all classes, except for the bend class, benefited from an increase in the number and diversity of the dataset.Two 3D representations of the LAO90 and AP views are provided in Figure 5, with entry, bend, marker and tip presented in red, green, blue and purple bounding boxes respectively.
in the detection of the bend in images where the bend angle was small, another indicator of the model's accuracy.The entry had the highest accuracy as the catheter entered the image from the sides, which made it less challenging for the model to learn.
In Experiment 3, we utilized the model trained on both datasets and tested it only on the paired samples.The results indicate that the entry had the lowest distance, followed closely by the marker and tip.The bend was found to have a higher prediction of GT distance.It should be noted that the grids were analyzed based on their relationship with other grids, identifying possible effective neighbors and distinction between the goal class and other classes.By comparing the results of Experiments 1 and 3 in Table 1, it can be inferred that all classes, except for the bend class, benefited from an increase in the number and diversity of the dataset.Two 3D representations of the LAO90 and AP views are provided in Figure 5, with entry, bend, marker and tip presented in red, green, blue and purple bounding boxes respectively.As part of our comparative analysis, we experimented with the method proposed by Aghasizade et al. [16] based on the same paired dataset.The results are shown in Table 2.As can be seen for all four classes, our method outperforms the VGG cascaded architecture proposed by Aghasizade et al. [16] in both the average and standard deviation of accuracy, with respect to the 3D Euclidean distance between predictions and labels.Other than the abovementioned finding, our method is able to provide accurate prediction with an error of less than 0.3 (mm) on average for the tip, marker, and entry landmark features, and approximately 0.42 (mm) error for the bend, resulting in four distinct 3D coordinates for catheter pose detection and tracking.

Limitations
Beyond improving the 3D orientation of the catheter, there are limitations that need to be addressed in future work and real-world settings.First and foremost, the acquired datasets are based on 3D-printed models and experimental settings in the lab and do not have realistic clinical backgrounds.This approach provides a first basal approach to catheter detection and tracking.A more diverse dataset including the actual settings in a catheterization procedure with all imaging views and catheter interactions with tissue is needed to fully validate the performance of the model, but we do expect this approach to be compatible since imaging views and different poses are not expected to vary significantly from those in the current dataset.Second, the current model should also be applied to other types of catheters and also be combined with other imaging modalities such as CT and ultrasound to ensure whether there is a need for more variety in the dataset, resulting in the application of this research on a more diverse dataset and analyzing it in real time.Third, there are samples where the image does not contain a bend, either due to the straight pose in the catheter or the bend not being present in the image.In these cases, simply the mid-point between the marker and entry can be analytically calculated and considered an approximation for the bend to fulfill the missing class for pose detection.Lastly, if even higher accuracy is required, experiments may benefit from the application of transfer learning and parallel programming on medical images for further analysis and comparison, in addition to using data augmentation techniques.The model can also be exposed to a more complex set of images with a higher rate of artifacts to improve its generalizability.

Conclusions
In this study, we designed, implemented, and developed a bounding box method based on Yolov5 for catheter detection and tracking.We applied a deep learning processing technique to identify four locations on the catheter: its entry point, bend, radio-opaque marker, and tip.Considering these points, when extracted from bi-plane imaging, a 3D coordinate location was provided for each of them, yielding a pose for the catheter.In the third experiment, our model provided a mean Euclidean distance of 0.235 mm for the entry and 0.261 mm for the marker.Despite all the mentioned limitations, our research provides a novel Yolo-based deep learning algorithm to extract the catheter's pose with no required laborious segmentation masks.Overall, this model provides a significant improvement in the accuracy of marker detection, yielding sub-millimeter 3D positional accuracy for catheter detection and tracking and providing space for further exploration.

Figure 1 .
Figure 1.Overall schematic view of the 3D pose tracking methodology.

Figure 2 .
Figure 2. Image of the 3D-printed setup used for collecting the 900 paired-sample dataset in cardiac Cath lab.This dataset consisted of paired samples, including 450 fluoroscopic images from the LAO90 view and 450 fluoroscopic images from the AP view.Two corresponding views of samples in the second dataset alongside the first dataset are presented in Figure 3.

Figure 1 .
Figure 1.Overall schematic view of the 3D pose tracking methodology.

Figure 1 .
Figure 1.Overall schematic view of the 3D pose tracking methodology.

Figure 2 .
Figure 2. Image of the 3D-printed setup used for collecting the 900 paired-sample dataset in cardiac Cath lab.This dataset consisted of paired samples, including 450 fluoroscopic images from the LAO90 view and 450 fluoroscopic images from the AP view.Two corresponding views of samples in the second dataset alongside the first dataset are presented in Figure 3.

Figure 2 .
Figure 2. Image of the 3D-printed setup used for collecting the 900 paired-sample dataset in cardiac Cath lab.This dataset consisted of paired samples, including 450 fluoroscopic images from the LAO90 view and 450 fluoroscopic images from the AP view.Two corresponding views of samples in the second dataset alongside the first dataset are presented in Figure 3.

Figure 3 .
Figure 3. (a,b) Two fluoroscopic samples of the set with 529 samples; (c,d) two paired fluoroscopic samples of the set with 900 samples, LAO90 and AP.

Figure 3 .
Figure 3. (a,b) Two fluoroscopic samples of the set with 529 samples; (c,d) two paired fluoroscopic samples of the set with 900 samples, LAO90 and AP.

AI 2024, 5 , 6 Figure 4 .
Figure 4. (a,b) The X and Y axes (yellow), entering edge (red dotted line), and crop area (blue dotted line) specified in LAO90 and AP views; "a" represents the distance between the entering edge and crop area, where a > 1.5D, with D being equal to the sphere's diameter; (c,d) two paired fluoroscopic samples of the 529-sample set, AP (c) and LAO90 (d), after preprocessing.

Figure 4 .
Figure 4. (a,b) The X and Y axes (yellow), entering edge (red dotted line), and crop area (blue dotted line) specified in LAO90 and AP views; "a" represents the distance between the entering edge and crop area, where a > 1.5D, with D being equal to the sphere's diameter; (c,d) two paired fluoroscopic samples of the 529-sample set, AP (c) and LAO90 (d), after preprocessing.

Figure 5 .
Figure 5. (a,b) Two 3D representations of paired samples from LAO90 and AP planes.Figure 5. (a,b) Two 3D representations of paired samples from LAO90 and AP planes.

Figure 5 .
Figure 5. (a,b) Two 3D representations of paired samples from LAO90 and AP planes.Figure 5. (a,b) Two 3D representations of paired samples from LAO90 and AP planes.

Table 1 .
Average marker detection accuracy for experimental settings.

Table 1 .
Average marker detection accuracy for experimental settings.