Automatic Detection and Segmentation of Thrombi in Abdominal Aortic Aneurysms Using a Mask Region-Based Convolutional Neural Network with Optimized Loss Functions

The detection and segmentation of thrombi are essential for monitoring the disease progression of abdominal aortic aneurysms (AAAs) and for patient care and management. As they have inherent capabilities to learn complex features, deep convolutional neural networks (CNNs) have been recently introduced to improve thrombus detection and segmentation. However, investigations into the use of CNN methods is in the early stages and most of the existing methods are heavily concerned with the segmentation of thrombi, which only works after they have been detected. In this work, we propose a fully automated method for the whole process of the detection and segmentation of thrombi, which is based on a well-established mask region-based convolutional neural network (Mask R-CNN) framework that we improve with optimized loss functions. The combined use of complete intersection over union (CIoU) and smooth L1 loss was designed for accurate thrombus detection and then thrombus segmentation was improved with a modified focal loss. We evaluated our method against 60 clinically approved patient studies (i.e., computed tomography angiography (CTA) image volume data) by conducting 4-fold cross-validation. The results of comparisons to multiple other state-of-the-art methods suggested the superior performance of our method, which achieved the highest F1 score for thrombus detection (0.9197) and outperformed most metrics for thrombus segmentation.


Introduction
An abdominal aortic aneurysm (AAA) is an expansion of the abdominal aorta of more than 50% of its normal diameter, which is caused by weakened arterial walls [1]. AAAs are the 14th most common cause of mortality in the United States and are a significant public health issue [2]. After the age of 50, the incidence of AAAs increases steadily and is 2-3 times more likely to occur in males [3,4]. In the majority of cases, AAAs dilate continually without causing any symptoms. However, when blood leaks between the arterial walls or a portion of the artery ruptures, the mortality rate rise to more than 80% [5,6]. To prevent possible leaks or ruptures, one available treatment relies on a surgical approach that involves open aneurysm repair (OAR), while another is a minimally invasive technique that is known as endovascular aneurysm repair (EVAR) [7]. Open aneurysm repair is a traditional surgical treatment that removes the aneurysm completely through open surgery and replaces it with a synthetic vessel. Endovascular aneurysm repair does not require open surgery and instead, a stent graft is inserted through the vessel to prevent rupture by allowing the aneurysm to flow into the stent graft. The treatment method is only chosen after carefully considering anatomic characteristics and the age, gender, concomitant diseases, and mortality of the patient.
A synthetic stent graft is placed inside the aorta after surgical treatment and damage or fatigue in the graft material can induce leakage, graft migration or graft twisting, which can lead to rupture or occlusion. As a result, depending on the diameter and state of the thrombus, the patient should be checked every 3 to 12 months [8]. Computed tomography angiography (CTA) is the currently preferred imaging method for the diagnosis of changes in thrombus volume or ruptures and trained radiologists manually perform to detect aneurysms and measure the thrombus diameter for each slice that is obtained through CTA [9,10]. However, CTA images are challenging for the following reasons: • As thrombi have irregular morphologies, precise segmentation is essential; • Similar intensity values make distinguishing a thrombus from surrounding tissues challenging; • Due to the thrombus being obscured by the metal stent graft, it becomes difficult to detect and segment; • Manual labeling takes a long time, even for expert radiologists, and data are limited. Figure 1 shows the characteristics that are mentioned above. As a result, graph cuts, level sets, and deformable models, which are traditional methods that use intensity information with shape constructions, are difficult to use for the accurate segmentation of thrombi that have similar intensities to surrounding tissues [11][12][13][14][15][16]. As most of these methods rely highly on several parameter adjustments, they have impacts on clinical applications. Trained radiologists can also only detect thrombi with a 65% degree of accuracy and it is time-consuming manual work [17]. Deep convolutional neural networks (DCNNs) have become famous for their remarkable success in computer visualization processes, such as image classification, object detection, and image segmentation [18][19][20]. In addition, the deep learning-based method for thrombus detection and segmentation has also achieved high levels of performance. However, most of the methods that use deep learning focus on thrombus segmentation rather than thrombus detection. Few studies have obtained segmentation results following thrombus detection, which is far more clinically relevant than segmentation alone.
In this paper, we present a novel method for thrombus detection and segmentation that is based on an improved Mask R-CNN. This method could accurately detect thrombi in CTA image slices and segment thrombi of irregular shapes. Furthermore, the proposed method improved detection performance by introducing a bounding box regression loss function that combined smooth L 1 loss [21] and complete intersection over union (CIoU) [22]. Weighted binary focal loss as a mask loss function improved segmentation performance by decreasing mis-segmentation. As a result, the proposed method could help radiologists in terms of overall diagnosis by performing high-accuracy thrombus detection and segmentation while also reducing analysis time.
The rest of this paper is organized as follows. In Section 2, we review the other methods that are related to thrombus detection and segmentation. In Section 3, we describe the proposed method in detail. Next, in Section 4, we describe our dataset, environment settings, evaluation method, and the evaluation results. Finally, our conclusions and directions for future work are explained in Section 5.

Related Works
In order to monitor the growth rates of thrombi, they need to be diagnosed using various imaging methods and tests, both pre-operative and post-operative. A chest X-ray is one of the basic tests for diagnosing asymptomatic aortic aneurysms and is used for initial diagnosis. However, these X-rays are not suitable for accurate thrombus detection or follow-up. Magnetic resonance imaging (MRI) has the advantage of being able to determine the condition of the aorta without using a contrast, but the examination takes a long time and is difficult to perform when a patient's condition is deteriorating rapidly [23][24][25][26][27]. On the other hand, CTA enables rapid examination, has a high resolution, and can obtain three-dimensional images of thrombi and other important adjacent vascular structures, all of which leads to breakthrough progress in AAA diagnosis.
Thrombus detection and segmentation cause significant challenges due to the size and shape variability of thrombi and in the differentiation of thrombi from surrounding tissues. Several semi-automatic and fully automatic methods have been proposed to solve these problems. As a semi-automatic method, Bruijne et al. [28] presented an active shape model formulation in which landmarks are defined by comparing nearby slices instead of training data. The first slice is manually segmented and then the slice outline of the entire aneurysm is automatically detected. Macia et al. [29] used radial functions that were constrained by a priori knowledge and spatial coherency as a new model-based approach to the semiautomatic segmentation of both the lumen and thrombus of an AAA. User interaction is minimized by defining the two seed points that are contained within the lumen and defining the range of slices of interest. Joldes et al. [30] used a finite element analysis that was based on user-entered parameters, such as the thickness of the AAA wall, the inclusion of the thrombus, and geometry meshing. They created a software system called BioPARR, in which the entire analysis is automatically performed except for the semi-automatic segmentation of the AAA. Lalys et al. [14] proposed a method to first detect the centerlines in order to obtain the initial lumen segmentation, which requires minimal user interaction. The thrombus and lumen are then separated from the surrounding structures using gradient information during the pre-processing step. The final segmentation is performed using a deformable model.
Regarding fully automatic methods, Zheng et al. [31] trained the UNet using only a small dataset and obtained successful segmentation results. They found that overfitting could be avoided through data augmentation with gray value variation and translation. Hong et al. [32] proposed a fully automatic method for the detection and segmentation of aneurysm using a deep belief network-based approach. Wang et al. [33] proposed a novel network that fuses the high-level part of the CT and MRI image networks together based on the UNet architecture. They demonstrated that their fusion model increases the ability to learn the shared representations of multi-modality images. Lopez-Linares et al. [34] presented a two-dimensional automatic method that uses different networks for detection and segmentation. DetectNet is only used for the detection of AAA thrombi, while segmentation is performed using a modified holistically nested edge detection (HED) network. Even more relevant to this paper, Lu et al. [35] presented a three-dimensional algorithm for AAA segmentation for the first time. The detection and segmentation are performed by applying ellipse fitting that is based on a variant of the 3D UNet architecture.
Furthermore, various other methods have also recently achieved state-of-the-art medical image segmentation. The appropriate loss function for each method is combined with the ensemble and transformer methods to demonstrate good overall performances. Nanni et al. [36] proposed encoder-decoder ensemble classifiers that can be used for semantic segmentation and introduced a novel loss function that results from the combination of Dice loss and a structural similarity index (SSIM). Dong et al. [37] presented a pyramid vision transformer backbone as an encoder for the extraction of robust features that has three tight components: a cascaded fusion module (CFM), camouflage identification module (CIM), and similarity aggregation module (SAM). The sum of the IoU and weighted binary cross-entropy loss is used as the loss function.
In this paper, we present a novel loss function from the Mask R-CNN framework. We also present an efficient loss function that improved the performance of thrombus detection and segmentation by considering various factors, such as the aspect ratio and overlap area of the predicted bounding box, when used in a region-based convolutional neural network method that consisted of two stages. Moreover, we trained the architecture using 2D images by slicing a limited 3D CTA dataset.

Structure of Mask R-CNN
Our ultimate goal was to only precisely detect and process the thrombus out of a wide range of CTA datasets, which were located from the heart to the pelvis. Our method was based on Mask R-CNN, which is one of the most popular frameworks for detection and segmentation and takes into consideration certain difficulties that can occur in CTA images, such as irregular shapes, stent graft occlusions, and the differentiation of thrombi from adjacent tissues [38].
The Mask R-CNN consisted of several sub-modules, including the backbone network, feature pyramid network (FPN) [39], region proposal network (RPN), classifier network, and mask generation network, as shown in Figure 2. The backbone used ResNet50 [40] to extract the abstract features of CTA images of thrombi through convolution operation. At this time, feature maps of various sizes were obtained and FPN could then obtain five feature maps after gradually merging all of the initial feature maps. During the RPN stage, three different ratios of anchors were used to create a region of interest (ROI), which was the area where the foreground was likely to be. The ROI preserved the spatial location through RoIAlign in order to minimize the differences in misalignment between the extracted features and the fixed grid sizes. Next, the ROI that was generated from the RPN went through the RoIAlign layer to obtain a refined feature map, which was then forwarded to each of the two sub-modules. One sub-module was a classifier that distinguished between the background and class of each ROI and regression, which predicted the bounding box using the ground truth (GT). The other sub-module was a fully convolutional network (FCN) [41] and was used to predict the segmentation masks for each class. Formally, Mask R-CNN was used as a multi-task loss function for training that consisted of the sum of classification loss L cls , bounding box regression loss L reg , and segmentation mask loss L mask , which was defined as: The cross-entropy function was used to compute the classification loss and the smooth L 1 loss was used to calculate the bounding box regression loss, which was the same as the loss function in Faster R-CNN [42]. L mask was defined as the average binary cross-entropy, which was defined as: where y ij represents the ground truth value for the area of m × m andŷ k ij is the prediction value for the k-th class in the mask. The mask prediction did not have to compete between the k-th class, so it was not affected by each class result.

Improvement of Loss Function
Specific learning objectives need to be considered after the network topology has been determined. As an objective function of the optimization problem, machine learning generally uses a function that is known as a loss function. This is important in deep learning because it determines how to update the network parameters based on the purpose. Setting a reasonable loss function is critical and changing the loss function to only suit each specific purpose can improve the overall performance of the network. As shown in Equation (1), the Mask R-CNN in our study consisted of three loss functions. Among the three loss functions, L reg and L mask were redesigned to take into consideration the heterogeneity of the thrombus morphology and data characteristics. The bounding box regression loss was modified to accurately identify a thrombus from the surrounding tissue. The segmentation loss was changed to reduce the risk of the mis-segmentation of the thrombus. The classification loss was unchanged because the dataset only contained two classes: thrombus and background.
The bounding box regression loss function of the basic Mask R-CNN model is used as the smooth L 1 loss. Compared to the widely used L 2 loss, the smooth L 1 loss reduces sensitivity to outliers and is often used as the bounding box loss for two-stage methods, such as Faster R-CNN and Mask R-CNN. The smooth L 1 loss is defined as follows: However, the value of smooth L 1 loss calculates all four bounding box points as independent variables, so the correlation between each point disappears. It also has a high value when the coordinate value is large because it is not normalized. The IoU-based loss function was designed to solve some of the above-mentioned problems with smooth L 1 loss. Through normalization, a value between 0 and 1 can be obtained regardless of the value of the large coordinates. However, it is not trained on non-overlapping bounding boxes because the IoU values are 0 when there is no overlap. To secure these shortcomings, generalized IoU (GIoU) [43], distance IoU (DIoU) [22], and complete IoU (CIoU) have been proposed by adding a penalty term to the IoU loss function.
GIoU adds a penalty term that uses the smallest bounding box E, which covers both the predicted bounding box B and the ground truth bounding box B gt , as in Equation (4). Its range is [−1, 1], which indicates that it does not have a normalized IoU form. Even when it does not overlap with the ground truth bounding box, it moves toward the ground truth bounding box in order to reduce the penalty term. However, when the predicted bounding box becomes large enough to include the ground truth bounding box, it operates in the same way as the IoU loss. It has also become even more sensitive to small offsets between small objects, which slows convergence and reduces performance [44]. DIoU compares the center coordinates of the bounding box and adds a distance-based penalty term. ρ(·) in Equation (5) is the Euclidean distance and c is the diagonal distance of the smallest bounding box, which contains b and b gt . Compared to GIoU, convergence occurs faster in DIoU since distance-based horizontal and vertical directions are included.
To achieve successful bounding box regression, overlap areas, central point distance, and aspect ratio must all be taken into account. The CIoU loss adds a penalty term that is based on the aspect ratio to the DIoU loss, which reduces missing or false detection and improves the accuracy of bounding box detection. CIoU is defined as follows: where v measures the concordance of the aspect ratios and α regulates the balance between the non-overlapping cases and overlapping cases: However, CIoU loss still has problems with convergence speed and its performance is not ideal. Therefore, we proposed an idea to supplement the smooth L 1 loss and IoU-based loss problems by combining the loss functions. As a result, each loss function improved in thrombus detection performance due to complementary effects while training the network. Our new loss function, named SCIoU, was defined as follows: where δ is a parameter that adjusts the weight of each loss function. The benefits of combining two different loss functions included the assistance of the smooth L 1 loss in the slow convergence, which is a disadvantage of CIoU, for quick convergence and the concentration on the regression analysis of the bounding box. The CIoU loss focuses on fine-tuning the predicted bounding box by considering the overlapping areas, center point distance, and aspect ratio. Simultaneously, we replaced the cross-entropy loss, which is a mask loss, with a weighted binary focal loss to improve thrombus segmentation performance. It should be noted that mask loss was only calculated using positive sample ROIs.
Focal loss [45] applies a higher weight strategy for difficult examples and easily misclassified cases, whereas low weight is used for easy examples. This idea is expressed mathematically in Equation (10), where p denotes the predicted probability of the ground truth class and α t and γ are the hyper-parameters of the loss function: In our function, modified focal loss, which we called weighted binary focal loss, was applied to binary classification. Instead of the α t hyper-parameter of the traditional focal loss, the binary ground truth mask was multiplied to focus more on each of the positive and negative aspects. Finally, by adding positive and negative focal loss, weighting parameters were used to prevent the thrombus from being mis-segmented with an irregular shape that was obscured by the metal stent graft and improved segmentation performance. Our mask loss function was defined as follows: L neg (p t ) = −ξ t (p t ) γ log(1 − p t ). (13) where λ denotes the parameter that controls the function by focusing more on the positive sample, ζ is a positive ground truth mask, and ξ is a negative ground truth mask.

Dataset
In Table 1, we describe the overall characteristics of the dataset. Our dataset of AAA thrombus CTA scan images from 60 unique patients, which were generated from 2012 to 2020 at Gachon University Gil Medical Center in the Republic of Korea, contained the largest number of patients in a post-operation AAA thrombus dataset, to the best of our knowledge. Therefore, our dataset was well suited to establishing the robustness of the proposed method for thrombus variability. We were able to obtain CTA volume images using five different pieces of scanning equipment that were manufactured by the same Siemens company: SOMATOM Definition Edge, SOMATOM Definition Flash, SOMATOM Force, SOMATOM Emotion Duo, and Sensation 16. The CTA volume images were converted from the Digital Imaging and Communications in Medicine (DICOM) file format into PNG-type two-dimensional images of 512 × 512 and consisted of data from the heart to below the pelvis. There were 49 to 206 slices for each patient and the axial slice interval had a value that ranged from 3 to 5 mm.
The basic characteristics of the study group were 46 male patients and 14 female patients. The male to female ratio was 3.28:1, which was similar to the statistics of the study in the United States (4:1) [46]. In general, AAAs occur more frequently in the elderly population (over 65 years) and in our dataset, male patients averaged 72 years of age and female patients averaged 78 years of age [47]. The proportion of images that had a ground truth label of thrombus out of the CTA slice images of all patients was 20%, with a standard deviation of 7%. Ground truth voxels were manually labeled with thrombus in the axial view direction by trained radiologists.

3D Quantitative Metrics for Evaluation
The experimental results were evaluated using an evaluation index in two aspects: detection performance and segmentation performance. To evaluate and comprehensively verify the thrombus detection performance, we used the metrics of precision, recall, and F1 score. Precision represented the proportion of predicted positive cases that were accurately identified as ground truth positive and recall represented the proportion of ground truth positive cases that were accurately predicted as positive. True positive (TP) was defined as an IoU of 0.5 or higher. The F1 score is another general indicator of the precision and recall of the harmonic mean. It could be said that the greater the F1 score, the better the model detection performance. The precision, recall, and F1 score were calculated as follows: For segmentation results, region-based and distance-based measures are often used for segmentation evaluation. Therefore, we computed the total overlap (TO), Dice coefficient, Jaccard index, false negative rate (FN), and false positive rate (FP), as proposed in [48]. At the same voxel resolution, the segmented volume (source, S) that was obtained through our modified Mask R-CNN for each patient was restored to three dimensions and compared to the ground truth (target, T). The total overlap was calculated by dividing the intersection between two thrombus regions r in the S and T by the ground truth region in T and was expressed as follows: The Dice coefficient was used to calculate the amount of spatial overlap between two thrombus regions. The Dice coefficient value is shown below: The Jaccard index represented the area of overlap between the source thrombus region and the target thrombus region divided by the union region: A false negative (FN) was referred to as a type II error, which meant that the ground truth voxels failed to segment. A false positive (FP) was referred to as a type I error, which predicted voxels that were not ground truth voxels and was expressed as follows:

Experiments
All experiments used a 4-fold cross-validation approach to decrease the possibility of biased testing and provide robustness to the results. We split our dataset into four different folds, then trained the model with three of the folds and tested the model with the remaining fold. We conducted four rounds, each of which used a different test fold, and calculated the average value of the performance metrics across the four rounds. We divided the dataset according to the patients, i.e., there were no overlapping patients in each fold.
We implemented the proposed method using Python and Pytorch library. The test environment was run on an Intel Core E5-2620 v4 CPU with a clock speed of 2.10 GHz and an NVIDIA TITAN RTX (24 GB RAM) graphics card. We employed the stochastic gradient descent (SGD) optimization method. We empirically set the initial learning rate to 5e-3 and the momentum to 9e-1. The code is available on GitHub (Link: https://github.com/AAAimproved-mask-rcnn).

Thrombus Detection Results
Before changing the segmentation loss function, we decided to set the bounding box loss function using the best regression performance. The hyper-parameter delta value that regulated the ratio was fixed to a value between 0.1 and 0.9 in order to find the optimal balance between the smooth L1 loss and the CIoU loss and then the performance evaluation was performed. Figure 3 displays the performance values for precision, recall, and F1 score for different delta values of the SCIoU loss function. When the delta value was 0.2, the precision value was 0.8847, the recall value was 0.9576, and the F1 score value was 0.9197, which were the highest values out of all of the evaluation indicators. Setting the delta value that was more focused on the smooth L 1 loss to 0.8 and 0.9 resulted in comparatively low values. Our proposed SCIoU loss function had both the four-variable independent regression advantage of smooth L 1 loss and the center point distance and aspect ratio advantages of CIoU loss. Table 2 shows the results when the bounding box regression loss of Mask R-CNN was changed to the existing loss function that is widely used. Our SCIoU loss function had the highest values for F1 score and precision compared to the other loss functions and the recall value achieved 0.9576, which was 0.0005 lower than the vanilla Mask R-CNN that only uses the smooth L 1 loss. Figure 4 also shows that each coordinate was appropriately regressed while maintaining the ratio in contrast to IoU-based losses, which are more regressed than GT. We compared the regression performance of the DeepAAA [35] and DetectNet [34] methods using our dataset. Table 3 shows that our proposed loss function outperformed the DeepAAA method using the smooth negative Dice coefficient and the DetectNet method using the L 1 loss in terms of recall. Our predicted bounding box regressed in a highly similar manner to the ground truth bounding box. Additionally, the evaluation showed that SCIoU had a higher precision value (0.8847) than the other models. Finally, the F1 score, which represents the harmony between precision and recall, also had a high value of 0.9197. It could be proved that the proposed loss function was efficient in the bounding box regression evaluation overall.

Thrombus Segmentation Results
Due to the similar intensities of the medical images to other tissues, thrombi can only be detected intermittently, even in images that do not contain thrombi. As a result, the segmentation performance was evaluated using an extraction algorithm for the detected thrombus from a medical diagnosis perspective. DeepAAA evaluated the image that was obtained when the largest diameter was greater than 3 cm by applying ellipse fitting. The vanilla Mask R-CNN and our proposed combined loss Mask R-CNN (CL Mask R-CNN) were both influenced by Lopez-Linares et al. [34] and used a continuous slice-based approach. It was confirmed that a thrombus had been found when the number of slice images of the detected thrombus was above a certain threshold in each patient. Since the patients in our dataset had at least eight thrombus images, segmentation was only conducted when more than eight consecutive thrombi were detected. Table 4 shows the detailed thrombus segmentation experimental results for the five evaluation indicators. Compared to the two deep learning-based networks and the vanilla Mask R-CNN, our proposed CL Mask R-CNN achieved the highest segmentation performance of 0.8971 for the total overlap, 0.7163 for the Jaccard index, and 0.8267 for the Dice coefficient. In medical image segmentation, reducing false negatives is recommended over reducing false positives [49]. This is because false negatives can have severe consequences for patient health and false positives can be resolved later by trained radiologists. In this study, our method achieved the lowest false negative rate of 0.1029. Figure 5 shows the results of the visualization of the detected thrombi based on 2D images, which resulted in a significant reduction in false negatives. Furthermore, Figure 6 shows that false negatives decreased and false positives increased in the area where the qualitative evaluation was performed by converting the 2D results into 3D images. Table 4. The results of thrombus segmentation when applying a detection algorithm for medical relevance. Our proposed CL Mask R-CNN was changed in both the bounding box regression loss function and the segmentation mask loss function.

Network
Overlap

Conclusions and Future Work
Previous research has concentrated on thrombus segmentation rather than thrombus detection and Claridge et al. reported that only 65% of thrombi were detected by trained radiologists [17]. As such, detection is as crucial as segmentation and because it is related to patient mortality, there should be no missed or incorrect detections. Therefore, we proposed to change the loss function of the Mask R-CNN for thrombus detection and the segmentation of 2D-based images that were obtained from the CTA dataset. The proposed method was able to detect thrombi correctly while also precisely segmenting thrombi in each CTA image slice. As can be seen in Table 2 and 3, our method achieved a high detection performance. The segmentation results also outperformed the existing deep learning methods and the false negative rate was significantly reduced. Furthermore, our method has significant clinical value because it could reduce the possibility of thrombi going undetected and support radiologists in making clinical decisions for AAA patients.
The model that was adopted in this paper is a Mask R-CNN, which is based on instance segmentation. Recently, various new methods have been developed, such as a Mask Scoring R-CNN, which can improve the performance of Mask R-CNN. In future work, we plan to use these models as backbone networks to achieve high accuracy for thrombus detection and segmentation.