1. Introduction
Heart disease is still the number one cause of death in the US, accounting for 22% of the deaths in 2023 [
1]. Heart disease is caused in main part by atherosclerosis, in which cholesterol deposits line the heart arteries, in time occluding them and starving the heart of oxygen. Currently, there exists no medication to reverse atherosclerosis, only medications to prevent heart attacks and strokes.
The occluded arteries can be repaired either via bypass surgery, an invasive procedure that stops the heart and bypasses the clogged arteries, or by coronary angioplasty, a minimally invasive procedure that places stents (little specialized springs) at the occluded locations to keep the arteries open. The coronary angioplasty procedure is performed through catheters inserted in the body through an artery, and is monitored using real-time X-ray called fluoroscopy.
Inside the catheter is a thin wire called the guidewire that is used to penetrate the occluded location and guide different tools, such as a stent or a balloon, to perform the procedure. Because X-rays are harmful in high doses, the energy and duration of the X-ray are kept to a minimum, which results in noisy images and decreased guidewire visibility. For these reasons, fluoroscopy can be considered a noisy imaging sensor.
Finding the guidewire automatically is important for different purposes, such as image augmentation, 2D-3D integration, etc. However, there are different levels of finding the guidewire.
The lowest level is guidewire segmentation, where just the pixels of the guidewire are desired to be found. A higher-level task is guidewire localization, where a parametrization of the guidewire is desired to be found, either as a spline or another parametrized curve representation. This task is especially challenging because there might be multiple guidewires present in the image, and a separate curve is required for each of them. At the same time, because of the noisy nature of the fluoroscopy images, large parts of the guidewire might be invisible, and seemingly disparate guidewire segments need to be combined into the same curve based on good continuation.
Finding a parameterization of the guidewire is important for certain tasks, such as guidewire tracking or 2D/3D registration.
For this purpose, the paper brings the following contributions:
It introduces a guidewire segmentation method that uses two prediction outputs from a residual network or other feature extractor: one that predicts a coarse segmentation directly from the encoder output, and one that refines the coarse segmentation only at the relevant places using a single convolutional layer. In contrast to the UNet or other segmentation methods, this architecture does not have any skip connections, making it simpler and faster to train.
It introduces a method for guidewire localization based on perceptual grouping of the curves extracted from the guidewire segmentation output. The novel perceptual grouping method uses a continuity measure to score what curves might belong to the same guidewire and the Hungarian algorithm to match the curve ends for grouping. This way, the proposed perceptual grouping method uses the Hungarian algorithm to find the global minimum of a cost function, which is a more principled approach than the heuristic-based methods that do not minimize a cost function, or the greedy methods that usually cannot find a global minimum.
It performs experiments on two datasets, showing that the proposed segmentation usually outperforms other existing segmentation methods, including the Res-UNet [
2], a UNet with residual layers, and the nnU-Net [
3], a well-celebrated segmentation method that is the state of the art for many medical imaging segmentation problems.
It also performs localization experiments and extensive ablations on the same datasets, showing that the proposed perceptual grouping obtains competitive localization results with a small average number of curves per image.
The proposed segmentation and localization methods are presented as fundamental research aimed at advancing the knowledge of how to find guidewires, catheters, or other curves in images. More research and evaluation are needed to introduce these methods into clinical practice.
2. Related Work
While there are quite a lot of works in guidewire segmentation, we are only aware of a small number of works on guidewire localization.
2.1. Guidewire Segmentation
The UNet [
4] is a U-shape CNN architecture that has an encoder-decoder structure. The encoder has convolutional blocks followed by max-pooling to gradually reduce the spatial resolution of the output while increasing the number of channels. The decoder has a mirror architecture with the encoder, with the same number of blocks, and uses skip connections to bring information from the corresponding encoder layer, which are combined with the upscaled inputs from the previous block using convolutions.
The Res-Unet [
2] is a modified UNet that was introduced for catheter segmentation. As opposed to the standard UNet that uses convolutions for the encoder and decoder blocks [
2], uses residual blocks [
5], which are convolutional layers that sum the input to the block output for improved back-propagation.
A simple method based on image processing was introduced in [
6] for guidewire segmentation and localization. The segmentation is obtained by applying a Frangi filter [
7] and using a
k-nearest neighbor classifier to classify
pixel patches centered at high response locations and oriented by the Frangi filter orientation. The method is evaluated only on 8 image sequences and successfully detects the guidewire in 83.4% of the frames.
A steerable CNN was introduced in [
8] as the first level of screening for guidewire segmentation, where the CNN’s filters were steered to align with the guidewire direction for better accuracy. The paper used
pixel patches for predicting whether the center pixel is on the guidewire or not, and was focused on obtaining a good precision for 90% recall. In contrast, the proposed segmentation method uses a fully convolutional ResNet that takes a much larger context into account and is able to obtain much better guidewire segmentation results that trade off precision with recall.
A version of UNet was used in [
9] for catheter segmentation. The method used a small UNet and transfer learning from synthetic data or phantom data, to obtain results similar to [
2] on a catheter dataset.
In [
10], the authors propose a two-phase guidewire segmentation method that uses a neural network to predict a binary indicator whether overlapping
patches contain the guidewire or not, and a UNet to obtain the segmentation result on the patches that are predicted positive. The paper does not specify how to combine the obtained overlapping segmentations, and also misses details on data augmentation during training, and has no code available. In contrast, the guidewire segmentation part of our work uses a single neural network to predict, at the same time, a binary indicator on non-overlapping patches and to obtain the final segmentation on the predicted positive patches, without any UNet-like decoder and without any skip connections. Moreover, our architecture is fully convolutional and does not need to extract image patches, gathering larger context and obtaining much better segmentation results, besides being more computationally efficient.
Another UNet architecture with 12 transformer layers in the bottleneck was used in [
11] for guidewire segmentation. However, the method was evaluated only on 11 image sequences, and instead of segmenting the whole guidewire, the paper only segments the guidewire tip, which is much more visible. Moreover, the method is missing important training details, such as the training loss function, data augmentation information, and there is no code available, which may hinder reproducibility. The authors’ follow-up paper [
12] introduces a background residual attention layer and multiple frames to obtain even better guidewire tip segmentation results, but the paper has the same reproducibility issues.
Another vision transformer has been used in [
13] together with a shape-sensitive loss function to improve the segmentation accuracy for many standard CNN architectures such as UNet [
4], TransUNet [
14], SwinU-Net [
15], etc. This work is complementary to our work since it introduces a loss function, while our work introduces a novel architecture that is not a UNet. The shape-sensitive loss could also be used in principle together with our architecture to further increase accuracy.
From the guidewire segmentation papers discussed in this section, one could see that the segmentation results vary a lot from dataset to dataset. This is probably due to the variability introduced by the fluoroscopy machines, X-ray intensity, sensor sensitivity, etc., as well as the quality of the annotation. Therefore, in our opinion, guidewire segmentation methods cannot be compared if they are evaluated on different datasets. They can only be compared on the same dataset using the same evaluation measure. Actually, ref. [
16] has pointed out that the same conclusion applies to many other medical image segmentation tasks.
In that respect, ref. [
17] introduced the CathAction dataset, a dataset of more than 23,000 X-ray images obtained on endovascular interventions on pigs and phantoms. This dataset, together with a more challenging guidewire dataset, will be used in experiments to evaluate the proposed method and compare it with the state of the art.
2.2. Guidewire Localization
A hierarchical method for guidewire localization was introduced in [
18]. The method first detects short segments on the guidewire using a trained object detector. The segments are used as nodes in a weighted graph, where the edge weights are obtained by another classifier. Finally, a curve is obtained as the shortest path in the graph; thus, a single curve is obtained for each image. In contrast, the proposed approach uses a deep CNN to obtain a good segmentation, from which initial curves are extracted, which are linked using a matching algorithm and a continuity measure.
Besides the Res-UNet segmentation method described in
Section 2.1, ref. [
2] also introduced a catheter localization method that extracts a centerline using skeletonization and connected components. The extracted curves are merged into a single curve using heuristics. Our proposed localization method also extracts centerlines using a type of skeletonization, but it constructs the curves as maximal chains of degree two nodes instead of connected components, which ensures that each obtained curve is a chain of pixels with no bifurcations. Moreover, our method uses the Hungarian algorithm and a measure of curve continuation for merging the curves, and the final number of curves is obtained automatically.
The
k-NN-based method from [
6] connects the segmented guidewire blocks using a greedy energy minimization algorithm that tries to minimize the sum of distances and the cosine of angles between the connected blocks.
3. Method Description
The proposed guidewire localization method is composed of four steps:
Guidewire segmentation, which labels the image pixels whether they belong to the guidewire or not.
Initial curve extraction, which takes the segmentation result and returns a number of pixel chains as initial curves.
Perceptual curve grouping, which merges the initial curves into longer curves based on a continuation measure.
Cleanup, an optional step that removes all obtained curves that are shorter than in length.
The first three steps are illustrated in
Figure 1, and will be described in the following subsections.
3.1. Guidewire Segmentation
This step uses a deep CNN to obtain a good guidewire segmentation. The quality of the segmentation will be reflected in the quality of the obtained localization result, which is why we aim for the best possible segmentation.
For that reason, we introduce a novel segmentation method called MSLNet, described below.
Proposed MSLNet Segmentation Architecture
The proposed guidewire segmentation architecture is illustrated in
Figure 2. It contains a ResNet (or other type of) feature extractor
f() and two convolution filters,
of size
and
of size
, with
in our experiments.
The MSLNet segmentation method consists of the following steps:
From an image of size , the ResNet is used as an encoder to extract a feature map of size with .
An initial segmentation is obtained from the feature map using the convolution kernel , which produces a map of size . Each -dimensional vector from this map is reshaped to a patch and placed at the corresponding location in a grid of patches, which together form the initial segmentation of size .
From the feature map , a coarse segmentation of size is also obtained using the convolution kernel .
The final segmentation
is obtained as
, where
is the indicator function and
resizes the input
to make it
z times larger in each direction, without interpolation, thus
The whole process is summarized in Algorithm 1 below, where the number of channels used in this paper is .
Algorithm 1 MSLNet Segmentation. |
Input: Image I of size , feature extractor (ResNet) f, filters w0, w1
Output: Binary segmentation of size
- 1:
Compute of size , where - 2:
Compute of size and reshape it to - 3:
Obtain initial segmentation of size , by tiling the entries as patches at positions in - 4:
Compute of size - 5:
Obtain final segmentation with defined in Equation ( 1)
|
Observe that this approach requires the input image dimensions to be divisible by z. If that is not the case, the image is padded with zeros to make it divisible.
It is worth noting that this architecture directly predicts the segmentation from the encoded representation without many decoder layers and without skip connections. This reduces the number of trainable parameters and the depth of the CNN, but faces some overfitting issues that are addressed by the coarse segmentation branch .
This approach can be thought of as a Marginal Space Learning (MSL) approach [
18], where the marginal space is the space of coarse segmentations
, which is
times smaller than the final segmentation space. Only the
patches corresponding to locations where
are expanded to a fine segmentation; the rest are just set to zero. This is the reason this approach is called MSLNet.
The proposed MSLNet approach is specially designed for segmenting small objects that occupy only a small percentage of the image pixels. For example, the guidewire pixels occupy only about 0.3% of the image pixels. In such cases, standard patch-based and fully convolutional networks might overfit, unless trained with sufficient data. MSLNet is better equipped when the training data are limited by using the coarse layer to predict what parts of the image the segmentation should focus on. The coarse layer, predicting just a binary label for each patch, is less prone to overfitting than the fine layer that predicts the whole segmentation for each patch.
3.2. Training the MSLNet
Training is carried out end-to-end using a two-term loss function that encourages a good coarse segmentation
and a good final segmentation
. This is in contrast with [
10], where the coarse segmentation and the UNet are trained separately.
The trainable parameters consist of the ResNet feature extractor parameters and the two convolution kernels .
Given a training example
with input
and target binary segmentation
, the coarse target
is first constructed as a binary indicator for the grid of
patches, whether they contain at least one guidewire pixel:
After constructing
, the training loss function for an observation
has two parts,
the coarse segmentation loss
and the fine segmentation loss
, where
is the ResNet feature extractor and ‘∗’ is the convolution operator.
Inspired by [
3], who combine the Dice and BCE losses, the coarse segmentation loss is the sum
of the Dice loss and the weighted BCE loss. The Dice loss is
where the sums are taken over the coarse pixels, the function
is the sigmoid, and
is a tuning parameter (
in our experiments).
The weighted binary cross-entropy (BCE) loss is
where
are the positive pixels of the coarse target
and
are the negative ones.
The fine segmentation loss is also the sum of the Dice loss and the weighted binary cross-entropy (BCE) loss:
where here
and
with
as defined in Equation (
1).
By restricting the fine segmentation loss only to patches where , we make sure that the training data are more balanced, since in this case the percentage of foreground pixels is about , as opposed to when considering the entire image, when the percentage of foreground pixels is about .
However, due to inaccuracies in the annotation, the BCE fine segmentation loss might not be the best choice because it is not very robust to labeling noise. For that reason, we also experimented replacing
with the Lorenz loss [
19]:
where
is the ReLU and
are the same as for Equation (
7). This loss is more robust to labeling noise because it penalizes a mistake less than the BCE loss.
3.3. Initial Curve Extraction
To extract the initial curves, the thresholded segmentation result is processed using the thinning morphological operation so that each pixel of the obtained output has a small number of neighbors, enabling the extraction of the initial curves as pixel chains. Thinning [
20] is an iterative morphological algorithm that is applied to a binary image until convergence and aims to find the centerline of a strip of pixels. In our experiments, we used Matlab’s
bwmorph with the thinning option and scikit-image’s
thin with identical results. We also experimented with two other related morphological operations: skeletonization and medial axis, but observed that thinning obtained slightly better results.
To extract the pixel chains as curves, first the 8-neighbor graph is constructed with V as the positive pixels of the thinned segmentation. On the thinned segmentation result, most nodes of this graph have degree 2, and some have degree 3. Nodes with degrees more than 3 are very rare.
The rest of the curve extraction is described in Algorithm 2 below.
Algorithm 2 Initial Curve extraction. |
Input: Binary segmentation
Output: Set of initial curves S- 1:
Apply morphological thinning to , obtaining output - 2:
Construct the 8-neighbor graph with - 3:
Initialize curve set - 4:
while exists node of degree 2 do - 5:
Let where are the two neighbors of i - 6:
while k has degree 2 do - 7:
if exists neighbor of k, then - 8:
- 9:
Set - 10:
end if - 11:
end while - 12:
while j has degree 2 do - 13:
if exists neighbor of j, then - 14:
- 15:
Set - 16:
end if - 17:
end while - 18:
Add C to S: - 19:
Remove from V all nodes in C: , and remove the corresponding edges from E. - 20:
end while
|
Lines 6–17 extract the initial curves as maximal chains C containing a node i of degree 2. Observe that because it is a chain, each curve C induces an ordering of its nodes, an ordering that is unique up to its reversal.
3.4. Perceptual Curve Grouping
Perceptual curve grouping takes the curves extracted in
Section 3.3 and merges them into longer curves using a continuation measure. When two curves are merged, the pixel ordering for one of them might need to be reversed to obtain a consistent ordering for the merged curve. The whole perceptual grouping algorithm is described in Algorithm 3, with its components being described below.
In Algorithm 3, end curve directions are estimated using PCA for each curve, and are used for the curve continuation measure.
Therefore, for
n curves, there are
PCA models, with models
and
corresponding to curve
. Model
is built from the first
k points of the curve, as illustrated in
Figure 3, while model
is built from the last
k points. If the curve is less than
k points long, the PCA models are estimated from all curve points. We used
in experiments.
The directions are then aligned in step 6 to point outwards from the curve by making them point towards the respective end of the curve. To align a direction with mean to point towards , first is computed. If , then is already aligned. If , then the direction is reversed: .
The point-direction pairs are checked in line 10 to be within a distance range and an angle alignment. The angle alignment checks that the angles between and , and between and , are less than ≈45°, corresponding to in line 10 of Algorithm 3.
Algorithm 3 Perceptual Curve Grouping (PCG). |
Input: Curve set , parameters - 1:
for to do - 2:
for to n do - 3:
Let be the points of curve - 4:
- 5:
- 6:
Align with and , respectively - 7:
end for - 8:
for do - 9:
Set and - 10:
if then - 11:
, with - 12:
- 13:
else - 14:
- 15:
end if - 16:
end for - 17:
Use the Hungarian algorithm to find a permutation - 18:
Set for all i such that - 19:
- 20:
- 21:
end for
|
For the pairs that pass the check, a continuation measure is computed as
based on fitting a degree 3 polynomial
, as specified in Algorithm A1 and illustrated in
Figure 4.
For that, a coordinate system is constructed, centered at with x-axis towards , thus the x-axis is and the y-axis is .
Then a degree three polynomial is fitted analytically to go through and be tangent to as described in Algorithm A1. One can easily check that , so the continuation matrix M is symmetric.
The curve ends are matched using the Hungarian algorithm [
21] and the matches with cost
are discarded.
The matches are validated so that only pairs such that i is matched to j and j is matched to i are kept, as described in Algorithm A2. This step is essential, since the curve merging step would fail without it.
Then the curves are merged based on the validated endpoint matches, as described in Algorithm 4. The function reverses the points of a curve C.
Algorithm 4 MergeCurves. |
Input: Curves , validated closest index vector
Output: Merged curves O- 1:
Initialize to do set and - 2:
while do - 3:
Select and set - 4:
Set - 5:
while do - 6:
if then - 7:
if then - 8:
Set where - 9:
else - 10:
Set where - 11:
end if - 12:
Set - 13:
else - 14:
if then - 15:
Set where - 16:
else - 17:
Set where - 18:
end if - 19:
Set - 20:
end if - 21:
Set - 22:
end while - 23:
- 24:
Add curve G to O: - 25:
end while
|
4. Experiments
Experiments are performed on two datasets: a guidewire dataset and the CathAction dataset [
17].
The guidewire dataset contains 82 fluoroscopic video sequences recorded during coronary angioplasty procedures with 871 frames of various sizes in the range
. Of the 82 video sequences, 42 were used as training data, containing 433 frames, and 40 as test data, containing 438 frames. Two example images from this dataset are shown in
Figure 1 and
Figure 2. The guidewire was annotated using splines, so the accuracy is quite good, usually within 1 pixel of the guidewire. However, in places where the guidewire is not visible or in high curvature places, the annotation might be more than 1 pixel away. Examples of annotation can be seen as red curves in
Figure A1,
Figure A3 and
Figure A5. There is one annotation for each image, and the annotator might be different for different images.
The CathAction dataset [
17] contains 23,449 X-ray images obtained from endovascular interventions on animals (pigs) and imaging phantoms. The CathAction dataset divides the images into 18,758 training images consisting of 4021 animal and 14,737 phantom, and 4691 test images with 1006 animal and 3685 phantom frames. The dataset is annotated for segmentation, using 3–5 pixels thick and very long line segments, so the annotation is not very precise. The catheter pixels and guidewire pixels are annotated with different labels, with examples shown in
Figure 5,
Figure 6,
Figure A2 and
Figure A4. As one can see, the guidewire is not annotated inside the catheter. There is one annotation for each image, and the annotator might be different for different images.
4.1. Methods Compared and Implementation Details
For segmentation, we compared our proposed MSLNet segmentation approach with the nnU-Net [
3], the ResUNet [
2], the SCNN [
8], the two-step method from [
10], and the hierarchical method from [
18].
For localization, we compared with the hierarchical localization [
18], and with [
2], as these two were the only methods that output parameterized curves.
For nnU-Net, we used the GitHub package nnunetv2
https://github.com/MIC-DKFZ/nnUNet (accessed on 16 October 2025), and trained it on our data using the default parameters (batch size 12 for the guidewire dataset and 4 for the CathAction, weight decay
, Stochastic Gradient Descent with initial learning rate of 0.01, decreasing to 0 linearly), except the number of epochs was 300. The network was initialized with the default initialization for network layers built in PyTorch 2.8.0, and no early stopping was used. Ensembling was not used for a fair comparison.
The MSLNet was also trained as part of the NN-UNet framework, using the same default parameters discussed above, for better segmentation results. This is because the NN-UNet framework offers a rich array of data augmentation transformations that were proven to be useful in many segmentation applications [
16]. The ResNet-152 backbone was initialized with the default weights from PyTorch (weights pretrained on ImageNet 1k).
For the Res-UNet [
2] architecture, we used the authors’ code from the GitHub package
https://github.com/pambros/CNN-2D-X-Ray-Catheter-Detection (accessed on 16 October 2025), but trained it within the NN-UNet framework, again for better segmentation results. The same training parameters were used as for MSLNet and nnU-Net. We also used the authors’ curve grouping code from the same GitHub package to obtain the localization results.
Because the two-step method [
10] does not have code available, we used our own implementation of the classification CNN and the segmentation UNet based on the description in the paper. However, we encountered overfitting issues when training these models without data augmentation. For the UNet, without data augmentation, the train
was 0.78 and the test was 0.26 on the guidewire dataset. With data augmentation, the train
was 0.36 and the test was 0.26. For the binary classification CNN, data augmentation in the form of random translation up to 4 pixels and random rotation up to 15 degrees helped with overfitting, obtaining a train
of 0.90 and a test
of 0.46 on the guidewire dataset.
For the SCNN [
8], we used our own implementation of a four-layer SCNN, and for the Hierarchical method [
18], we used a pretrained model.
All experiments were performed on a Core I7 computer with 32 GB RAM and a NVIDIA MSI Gaming GeForce 3090 GPU.
The training and test times for the different methods on the guidewire dataset are summarized in
Table 1, where the test times and the FLOPS are shown for 512 × 512 images. From
Table 1, one could see that the MSLNet has the smallest detection time, due to the fact that the ResNet feature extractor is a fully convolutional network, so that it can be applied directly to images of any size. The other competitive methods, such as NN-UNet and Res-UNet, need to crop images of a certain size on which to apply the segmentation, and then to merge the obtained results into a final segmentation output, which increases the segmentation time.
4.2. Evaluation Measures
The methods are evaluated using precision, recall,
scores, Dice coefficient, IOU (Intersection over Union), and average Hausdorff distance (AHD). Because some methods only obtain a segmentation, separate comparisons are conducted for segmentation and for localization. All results are shown as the average and standard deviation obtained from four independent runs, except for the Hierarchical method from [
18], for which we only have a pretrained model.
Annotating a one-pixel-wide guidewire is prone to inaccuracies, which can drastically affect standard measures such as the Dice coefficient or IOU . To see that, one can imagine evaluating a perfect 1-pixel wide result with a 1-pixel wide annotation that is one pixel off everywhere. Such a result would have a Dice and IOU of 0, while being visually close to perfect. Our conclusion is that Dice and IOU are very good for evaluating blob-like structures, such as organs, but not for very thin structures, such as the guidewire. The catheter segmentation evaluation is somewhere in the middle, because the catheter is 3–5 pixels wide, so the Dice and IOU are less sensitive than for the guidewire evaluation, but they are still sensitive to some extent.
For this reason, besides the Dice and IOU, we will also evaluate using precision, recall, and scores, measures that are specifically designed for robustness to such inaccuracies. The precision is defined as the percent of detected pixels that have an annotated guidewire or catheter pixel at a distance of at most 3 pixels. The recall is defined as the percent of guidewire pixels that are at a distance of at most 3 from a detected pixel. The score is defined as usual, , in terms of the precision p and the recall r defined above.
We will also compute the average Hausdorff distance (AHD), which is the average of two measures: the average distance of the detected pixels to the closest annotation pixels, and the average distance of the annotation pixels to the closest detected pixels. This measure is more lenient to annotation inaccuracies, but is not well defined for images where there are no detected pixels.
On the CathAction dataset, the mask annotation is usually several pixels thick, but the method from [
18] always outputs a one-pixel-wide result, so the Dice and IOU results are even less relevant for this method on this dataset.
For guidewire localization, the same precision and recall measures are used to evaluate the rasterization of the obtained curves. However, for the CathAction dataset, which does not provide a one-pixel-wide annotation but a several-pixel-wide segmentation, the annotation is first thinned to approximate the location of the guidewire inside the catheter, and this thinned segmentation is used for evaluation. The localization Dice, IOU, and AHD are also evaluated on this thinned segmentation to be able to compare one-pixel-wide results with one-pixel-wide annotations. The guidewire localization is also evaluated on the average number of curves obtained per image, which is desired to be close to the true average number of curves obtained from the annotation, which, on the guidewire test set, is 1.1. We do not know the average number of curves on the CathAction dataset because the dataset does not provide curve annotations, only segmentation masks. Approximating the number of curves using connected components on the CathAction GT masks, we obtained an average of 1.4 curves per image on the test set. However, this might not be an accurate number, as one could see in
Figure 5d, where there is only one catheter but the GT mask is broken into two connected components.
4.3. Segmentation Results
The segmentation results are displayed in
Table 2 for both datasets. Two-sample
t-tests based on the results of the four independent runs were conducted to compare the best results with the other ones, except the Hierarchical method [
18], which is quite behind anyway. Based on these
t-tests, the best results and the ones that are not significantly worse (
) are shown in bold.
Two MSLNet versions are shown, with “MSLNet” being trained with the Dice + BCE loss function from Equation (
7), and “MSLNet-Lor” being trained with the Dice + the Lorenz loss from Equation (
8). 95% confidence intervals for the results from
Table 2 are shown in
Table A1.
From
Table 2, MSLNet performs better than the other methods, including the nnU-Net. The Lorenz loss has a strong influence on the results for the guidewire dataset, where the wire is one pixel wide, increasing the
score from 89.38 to 92.68, but not for the CathAction dataset, where the catheter is 3–5 pixels wide.
As expected, the nnU-Net [
3] performed very well on both datasets, being the second best after MSLNet, followed by the Res-UNet [
2]. The other three methods, SCNN [
8], Two-phase [
10], and Hierarchical [
18], are behind by a large margin. The Steerable CNN [
8] was designed to only serve as an initial step towards segmentation, using
patches to predict whether the center pixel is on the guidewire or not. For that reason, it is not capable of capturing long-range interactions, and it has
scores comparable with the Two-phase method [
10], which uses
patches. Also, because it uses a Spherical Quadrature Filter [
22] response map as a preprocessing step, the output is one pixel thin, so the Dice/IOU scores for the CathAction data are very small and unreliable.
We can also see from
Table 2 that some methods reach quite high
scores around 90% while the Dice/IOU scores are very low. In our opinion, this serves as a confirmation that the Dice/IOU scores are more sensitive to annotation inaccuracies for the wire-like structures than the precision/recall and
scores defined in
Section 4.2. Moreover, we see that the Dice/IOU scores are higher on the CathAction data than on the guidewire data for some methods with similar
scores of about
. This is in line with the fact that the catheter that is evaluated in the CathAction data is thicker than the guidewire, so the Dice/IOU are less sensitive to annotation errors for the catheter than for the guidewire. Nevertheless, all four measures (
score, Dice, IOU, and AHD) tell the same story about how the best segmentation methods compare with each other.
In
Table 3 are shown cross-dataset segmentation results, of the models trained on the guidewire dataset and tested on the CathAction data, and vice-versa, with 95% confidence intervals shown in
Table A1. The testing on CathAction was separated into the animals and phantoms data, because the images are very different, with the animal images resembling the guidewire data.
From
Table 3, one could see that all four methods trained on the guidewire data and tested on the CathAction animals data performed quite well, with Res-UNet, nnU-Net, and MSLNet performing better in terms of
score and AHD, and MSLNet-Lor better in terms of Dice and IOU. On the CathAction phantoms, none of the methods performed well, with MSLNet-Lor being the best in
score, Dice, and IOU, and Res-UNet and MSLNet being better in AHD.
Training on the CathAction and testing on the guidewire data yielded quite poor results for all methods, a sign that the CathAction data are easier than the guidewire data. In this case, MSLNet was in the top-performing group on all measures, and nnU-Net was in the top group for , Dice, and IOU. MSL-Net was in the top group for and AHD.
In conclusion, the cross-dataset experiments reveal that the two MSLNet versions performed very well, with MSLNet being in the top group for and AHD on all datasets.
4.4. Localization Results
The localization results are shown in
Table 4, with confidence intervals in
Table A2. From
Table 4, one could see that the two MSLNet versions obtain the best results in all measures by a large margin, followed by Res-UNet [
2] and then by the Hierarchical method [
18].
Here, the Dice and IOU scores are even less reliable because the results as well as the annotations are both 1-pixel wide (see examples in
Figure A3 and
Figure A4), and the Dice/IOU scores are extremely sensitive to even 1-pixel discrepancies between the annotation and the localization result.
Looking at the CathAction
scores, we notice that the Res-UNet [
2] curve grouping method starts with a good segmentation
score of 92.48, and obtains a localization
score of 53.49, while the MSLNet-Lor starts from a slightly smaller
score of 92.26, obtaining a localization
of 83.44. A similar but not so dramatic phenomenon is observed on the guidewire dataset. This is an indication that our proposed perceptual grouping method does a better job at grouping curves based on good continuation than the method from [
2].
The perceptual grouping does a very good job in grouping curve fragments, with one such example shown in
Figure 6a. We could only find very few failure cases of the perceptual grouping method, but one such case is shown in
Figure 6b.
Figure 6.
(a): a success example of the perceptual grouping method where four curves were connected correctly. (b): a failure example. The segmentation result (green) is shown in the top image, and the obtained perceptual grouping result is shown in the bottom image. The annotation is shown in red.
Figure 6.
(a): a success example of the perceptual grouping method where four curves were connected correctly. (b): a failure example. The segmentation result (green) is shown in the top image, and the obtained perceptual grouping result is shown in the bottom image. The annotation is shown in red.
4.5. Evaluation of F1 Pixel Tolerance
The precision, recall, and
measures have been evaluated using a 3-pixel tolerance. In
Table 5 and
Figure 7 are shown the test set
values computed using 0–4 pixel tolerances for the top methods: Res-UNet, nnU-Net, MSLNet, and MSLNet-Lor for segmentation and Res-UNet, MSLNet, and MSLNet-Lor for localization. From
Table 5 and
Figure 7, one could see that the
values for 0-pixel tolerance are very small, and they quickly rise with the tolerance distance up to 3 pixels. However, the rise from 3 to 4-pixel tolerance is not so large, which gives us a justification for using the 3-pixel tolerance in our evaluations.
4.6. Segmentation Ablation Studies
The segmentation ablation studies evaluate the importance of using the MSL training, the form of the loss function for the coarse segmentation , and the form of the loss function for the fine segmentation.
The importance of MSL. In this experiment, we removed the coarse segmentation
and its coarse loss function, so we just kept the upper path in
Figure 2, so the final segmentation
is the initial segmentation
. The results using the Dice + BCE loss from Equation (
7), with and without MSL, are shown in
Table 6. From
Table 6, one could see that the MSL training is important for the guidewire dataset, but not for the CathAction dataset.
Qualitative examples of the architecture without MSL and the MSL-Net Lor segmentation and localization are shown in
Figure A5.
The form of the coarse segmentation loss function . Intuitively, the coarse segmentation loss function should follow the pattern observed by [
3], that the Dice + BCE loss is better for segmentation than the individual Dice or BCE losses. Indeed,
Table 7 confirms this intuition, with the Dice + BCE loss obtaining higher
scores than using the Dice or BCE losses for the guidewire dataset. However, the Dice scores are higher when using the Dice loss only, because in this case, the Dice is explicitly maximized by that loss. However, the higher Dice score is not reflected in a higher
score, so we consider it unreliable.
The form of the fine segmentation loss function . For the fine segmentation, we know from [
3] that Dice + BCE is better than Dice or BCE alone, so we only compare the Dice + BCE from Equation (
7) with Dice + Lorenz, where the Lorenz loss is given in Equation (
8). The results are given in
Table 8. From
Table 8, one could see that the Lorenz loss is important for obtaining higher
and Dice scores for the guidewire dataset, which has 1-pixel wide annotations, but not for the CathAction dataset, which has thicker annotations.
4.7. Localization Ablation Studies
The localization ablation evaluates the importance of the whole perceptual grouping algorithm and its tuning parameters in the quality of the localization result.
The importance of perceptual grouping. First, we evaluate the importance of the proposed perceptual grouping algorithm. For that, in
Table 9 are shown localization results after initial curve extraction, with and without perceptual grouping on the guidewire dataset starting from the MSLNet-Lor segmentation. Also shown are results after the cleanup step of removing short curves (less than
pixels) directly from the extracted curves, without perceptual grouping.
From
Table 9, one could see that the extracted initial curves are broken into many pieces, and just removing the short pieces results in a worse localization result with more curves than when using the proposed perceptual grouping method.
We can also see that cleanup after perceptual grouping has a minimal influence, and the cleanup step can be removed.
Tuning parameters. The perceptual grouping method has a number of tuning parameters: the number of iterations , the number of points to estimate the endpoint directions , the maximum distance for endpoint matching, the minimum alignment parameter , the maximum continuity score and the minimum final curve length . They are set to the following values: .
The dependence of the obtained result on each of these parameters, while the others are kept to the above values, is shown in
Table 10. These experiments are performed on the guidewire dataset starting from the MSLNet-Lor segmentation. From
Table 10, one could see that the
score, AHD, and average number of curves depend only slightly on these parameters when their value is in the range from the table.
Continuation measure. The continuation measure between nearby curves used in line 12 of Algorithm 3 is based on fitting a polynomial f between the two curves and measuring the . We also experimented with the Bhattacharyya distance (BD), a measure of similarity between distributions.
For that, in the PCA steps 4–5 of Algorithm 3, when we obtain
, we also obtain the corresponding singular values
. Then we use probabilistic PCA (PPCA) models
with
to compute a continuation measure
based on the Bhattacharyya distance (
9) instead of lines 11–12 in Algorithm 3. The Bhattacharyya distance for two Gaussians
is:
where
. Observe that this BD measure has one more tuning parameter than the polynomial measure:
.
An evaluation of the BD continuation measure with different values of
is shown in
Table 11. The other parameters were:
. From
Table 11, one could see that the dependence on the parameters is minimal, but the best
score is slightly lower than that obtained by the polynomial measure. The difference is significant; the
p value for a paired
t test of the difference in
values between the two results obtained from the same four MSLNet-Lor segmentation runs is
, which is quite low. This confirms that the polynomial measure
based on the polynomial fit is better than the BD continuation measure.
5. Discussion
The segmentation experiments show that the end-to-end trained segmentation methods obtain better results than other methods that obtain the result in a number of steps that are trained separately. These experiments also indicate that the proposed MSLNet segmentation method obtains competitive results with the other methods evaluated, on both datasets. Moreover, using the Lorenz loss for robustness to annotation imperfections further improves the results for the guidewire dataset, but not for the CathAction data, where the annotation is thicker and the imperfections are not so important.
The localization experiments show that the proposed perceptual grouping method is better than the existing methods evaluated in organizing the segmented pixels into a number of initial curves, together with filling in gaps between the initial curves based on good continuation. The proposed perceptual organization method is less greedy than existing methods because it finds the global minimum of a loss function using the Hungarian algorithm instead of heuristics or greedy loss minimization.
6. Conclusions
This paper introduced a method for guidewire localization based on a perceptual grouping algorithm that groups a set of initial curves into longer curves based on a good continuation measure using PCA models at the curve endpoints. The initial curves are extracted from a guidewire segmentation result.
The paper also introduces a guidewire segmentation method based on a ResNet that directly predicts a coarse segmentation as well as a fine segmentation at promising locations indicated by the coarse segmentation.
Experiments on two datasets show that the proposed method obtains competitive results, usually outperforming existing guidewire segmentation and localization methods.
The perceptual organization method has some weaknesses, since it relies on a good guidewire segmentation, and it has six tuning parameters, which could be considered to be too many. However, we saw in the ablation study that the method is quite robust to the tuning parameters, taking a large range of values.
In the future, we plan to study deep-learning-based methods for perceptual grouping that can be trained end-to-end, possibly by reinforcement learning.