An Enhanced Mask R-CNN Approach for Pulmonary Embolism Detection and Segmentation

Pulmonary embolism (PE) refers to the occlusion of pulmonary arteries by blood clots, posing a mortality risk of approximately 30%. The detection of pulmonary embolism within segmental arteries presents greater challenges compared with larger arteries and is frequently overlooked. In this study, we developed a computational method to automatically identify pulmonary embolism within segmental arteries using computed tomography (CT) images. The system architecture incorporates an enhanced Mask R-CNN deep neural network trained on PE-containing images. This network accurately localizes pulmonary embolisms in CT images and effectively delineates their boundaries. This study involved creating a local data set and evaluating the model predictions against pulmonary embolisms manually identified by expert radiologists. The sensitivity, specificity, accuracy, Dice coefficient, and Jaccard index values were obtained as 96.2%, 93.4%, 96.%, 0.95, and 0.89, respectively. The enhanced Mask R-CNN model outperformed the traditional Mask R-CNN and U-Net models. This study underscores the influence of Mask R-CNN’s loss function on model performance, providing a basis for the potential improvement of Mask R-CNN models for object detection and segmentation tasks in CT images.


Introduction
Pulmonary embolism (PE) is the obstruction of blood arteries in the lungs caused by a blood clot [1].Peripheral edema has the third-highest prevalence among cardiovascular illnesses.The disease has a death rate of 30% [2][3][4].A delay in diagnosing the condition increases the likelihood of impairment and mortality [5].Early diagnosis is crucial to treating the disease effectively [6,7], with computed tomography pulmonary angiogram (CTPA) being the preferred method for diagnosing PE due to its quick and detailed imaging capabilities [8,9].Blood arteries appear bright in contrast-enhanced CT scans due to the contrast material, while the embolism appears dark because it does not absorb the contrast agent.Figure 1 displays pulmonary embolisms in a high-quality computed tomography scan.
The detection of PE in CTPA images is performed manually by experienced radiologists; therefore, it can be time-consuming and sometimes difficult [10].Some studies have shown that there is a 13% discrepancy between overnight and daytime assessments for the detection of PE [11][12][13].In addition, in some emergency situations, the rapid and accurate assessment of PE is of great importance [14].Some semi-quantitative methods can be used to measure the degree of vascular occlusion and determine the severity of PE.The most common methods for this purpose are the Mastora score and the Qanadli score, also known as the Vascular Obstruction Index (VOI), which are measured by an expert [15,16].However, there is inconsistency among different experts regarding the use of these methods [17].Therefore, researchers have considered the use of computer-assisted systems to automatically detect PE.With the development of technological infrastructure, the PE detection performances of various algorithms have been increasing.Early studies on pulmonary fixation included limited clinical applications and showed poor performance.In these studies, clinical findings were used instead of CT images as reference materials.Initially, feature extraction was performed using simple artificial neural networks [18][19][20]; however, new algorithms have been developed for PE detection over time [21,22].These studies have shown that computer-aided systems are successful in detecting PEs.It has also been shown that these systems can accurately detect small PEs, which may escape the eye of the specialist [22,23].In a study, PE detection was performed automatically by classifying the features of PE using the k-Nearest Neighbors (kNN), artificial neural network (ANN), and Support Vector Machine (SVM) algorithms, and a sensitivity of 98% was obtained [24].Machine learning algorithms have been widely used in recent years and have achieved high performances.There are many studies in which MR and CT images have been analyzed with deep learning algorithms, where much higher performances have been obtained [25][26][27].Through the use of convolutional neural networks, high PE detection performance has been achieved in CTPA images [28,29].Pham et al. combined natural language processing and machine learning for the diagnosis of thromboembolic disease [30].CT-based deep learning and automated PE studies have distinct challenges when compared with their counterparts focusing on embolism in other locations.For example, PE data represent only a small fraction of images, compared with the size of the baseline CT data.There are also signal-to-noise problems when the intravenous contrast injection protocol and the patient breath-hold instructions are not followed [31,32].For this reason, the correct design and creation of the neural network model and data set used have a great effect on performance.In particular, it is of great importance that the ground truth in the data set is created correctly.In recent years, significant progress in pulmonary embolism (PE) detection has been achieved through the development of novel algorithms and models.Grenier et al. developed the Hybrid 3D/2D U-Net model, which integrates 3D and 2D deep learning techniques tailored for CT image analysis.This model offers advanced segmentation capabilities to accurately identify pulmonary embolism regions [33].Wu et al. explored computer-aided PE detection using VoxelNet, a deep learning method designed for the analysis of 3D CT images through processing volumetric elements (voxels) in order to automatically detect pulmonary embolisms and other airway obstructions [34].Furthermore, Khan et al. achieved an 88% accuracy with a CNN model based on DenseNet201 when analyzing 9446 CT angiography scans from the RSNA-Kaggle database [35].Additionally, Vainio et al. employed transfer learning and Maximum Intensity Projection (MIP) on images from the RSPECT data set to detect chronic PE [36].These studies highlight the With the development of technological infrastructure, the PE detection performances of various algorithms have been increasing.Early studies on pulmonary fixation included limited clinical applications and showed poor performance.In these studies, clinical findings were used instead of CT images as reference materials.Initially, feature extraction was performed using simple artificial neural networks [18][19][20]; however, new algorithms have been developed for PE detection over time [21,22].These studies have shown that computer-aided systems are successful in detecting PEs.It has also been shown that these systems can accurately detect small PEs, which may escape the eye of the specialist [22,23].In a study, PE detection was performed automatically by classifying the features of PE using the k-Nearest Neighbors (kNN), artificial neural network (ANN), and Support Vector Machine (SVM) algorithms, and a sensitivity of 98% was obtained [24].Machine learning algorithms have been widely used in recent years and have achieved high performances.There are many studies in which MR and CT images have been analyzed with deep learning algorithms, where much higher performances have been obtained [25][26][27].Through the use of convolutional neural networks, high PE detection performance has been achieved in CTPA images [28,29].Pham et al. combined natural language processing and machine learning for the diagnosis of thromboembolic disease [30].CT-based deep learning and automated PE studies have distinct challenges when compared with their counterparts focusing on embolism in other locations.For example, PE data represent only a small fraction of images, compared with the size of the baseline CT data.There are also signal-tonoise problems when the intravenous contrast injection protocol and the patient breath-hold instructions are not followed [31,32].For this reason, the correct design and creation of the neural network model and data set used have a great effect on performance.In particular, it is of great importance that the ground truth in the data set is created correctly.In recent years, significant progress in pulmonary embolism (PE) detection has been achieved through the development of novel algorithms and models.Grenier et al. developed the Hybrid 3D/2D U-Net model, which integrates 3D and 2D deep learning techniques tailored for CT image analysis.This model offers advanced segmentation capabilities to accurately identify pulmonary embolism regions [33].Wu et al. explored computer-aided PE detection using VoxelNet, a deep learning method designed for the analysis of 3D CT images through processing volumetric elements (voxels) in order to automatically detect pulmonary embolisms and other airway obstructions [34].Furthermore, Khan et al. achieved an 88% accuracy with a CNN model based on DenseNet201 when analyzing 9446 CT angiography scans from the RSNA-Kaggle database [35].Additionally, Vainio et al. employed transfer learning and Maximum Intensity Projection (MIP) on images from the RSPECT data set to detect chronic PE [36].These studies highlight the effectiveness of deep learning techniques in improving the detection and diagnosis of pulmonary embolism.
In this study, an enhanced mask R-CNN method is established for the detection and localization of pulmonary embolism in CT images, and the relationship between the loss function and the performance of the Mask R-CNN algorithm is determined.The results obtained show that the coefficients in the loss function significantly affect the performance of Mask R-CNN.For this study, a local data set containing masks for pulmonary embolism was also created.Details of the proposed method are given in the second part of this article.The results are presented in the third section.The last section includes our conclusions.

Materials
The data set was obtained from the Radiology Department of Kahramanmaraş Sutcu Imam University.The computed tomography images of 50 patients (27 of whom were female and 23 male) were used.The age range of the patients was between 28 and 95.The images were obtained with a Toshiba Aquilion ONE 320/640 Slice instrument.A raw data set with a total of 430 1212 × 1212 images in .tifformat was created, which only included PE-containing sections.Furthermore, the sections in which a PE was displayed as the largest surface area were used.The CT scans were 8-bit (0-255) gray-level images.An average of 9 PE-containing sections were taken from each patient (min = 8; max = 12).An ethics committee report was obtained for the data set.All of the PEs in the images in this data set were labeled by one of the authors of this study (an expert radiologist with 15 years of experience) using the MATLAB ImageLabeler toolbox.In this study, only patients with pulmonary embolism (PE) and no other structures, such as lymph nodes, were included.Therefore, images were sourced from a limited pool of patients (50).Data augmentation techniques were applied to these images, including methods such as image rotation, contrast adjustment, vertical and horizontal flipping, and zooming, which are commonly used for data set enhancement.Specifically, horizontal and vertical rotations (±90 • ) and mirroring were employed, resulting in a fivefold increase in the number of images.As a result, 1630 images containing PEs were obtained.A total of 130 images (12 patients) were used for testing, and 1500 images (38 patients) were used for training.

Pre-Processing
The dimensions of the raw images in the data set were 1212 × 1212.PEs were located in the middle region of all images in the raw data set.We obtained a 448 × 448 sub-image from the middle region of each raw image to cover the sections with pulmonary embolisms.Figure 2 shows a raw image and sample sub-image with PE.
effectiveness of deep learning techniques in improving the detection and diagnosis of pulmonary embolism.
In this study, an enhanced mask R-CNN method is established for the detection and localization of pulmonary embolism in CT images, and the relationship between the loss function and the performance of the Mask R-CNN algorithm is determined.The results obtained show that the coefficients in the loss function significantly affect the performance of Mask R-CNN.For this study, a local data set containing masks for pulmonary embolism was also created.Details of the proposed method are given in the second part of this article.The results are presented in the third section.The last section includes our conclusions.

Materials
The data set was obtained from the Radiology Department of Kahramanmaraş Sutcu Imam University.The computed tomography images of 50 patients (27 of whom were female and 23 male) were used.The age range of the patients was between 28 and 95.The images were obtained with a Toshiba Aquilion ONE 320/640 Slice instrument.A raw data set with a total of 430 1212 × 1212 images in .tifformat was created, which only included PE-containing sections.Furthermore, the sections in which a PE was displayed as the largest surface area were used.The CT scans were 8-bit (0-255) gray-level images.An average of 9 PE-containing sections were taken from each patient (min = 8; max = 12).An ethics committee report was obtained for the data set.All of the PEs in the images in this data set were labeled by one of the authors of this study (an expert radiologist with 15 years of experience) using the MATLAB ImageLabeler toolbox.In this study, only patients with pulmonary embolism (PE) and no other structures, such as lymph nodes, were included.Therefore, images were sourced from a limited pool of patients (50).Data augmentation techniques were applied to these images, including methods such as image rotation, contrast adjustment, vertical and horizontal flipping, and zooming, which are commonly used for data set enhancement.Specifically, horizontal and vertical rotations (±90°) and mirroring were employed, resulting in a fivefold increase in the number of images.As a result, 1630 images containing PEs were obtained.A total of 130 images (12 patients) were used for testing, and 1500 images (38 patients) were used for training.

Pre-Processing
The dimensions of the raw images in the data set were 1212 × 1212.PEs were located in the middle region of all images in the raw data set.We obtained a 448 × 448 sub-image from the middle region of each raw image to cover the sections with pulmonary embolisms.Figure 2 shows a raw image and sample sub-image with PE.

Mask R-CNN Architecture
The region-based convolutional neural network (R-CNN) [37], Faster R-CNN [38], and Mask R-CNN [39] approaches have been shown to possess high object detection performance.Unlike the other models, Mask R-CNN performs both detection and segmentation.This network is the extended version of Faster R-CNN, as it includes an extra segmentation pattern (i.e., segmentation mask).There are two phases in Mask R-CNN: in the first stage, feature extraction is performed for the regions; in the second stage, bounding-box detection, class detection, and segmentation are performed according to the extracted features.The Mask R-CNN architecture, which also includes the ResNet50 backbone network [40], is shown in Figure 3.

Mask R-CNN Architecture
The region-based convolutional neural network (R-CNN) [37], Faster R-CNN [38], and Mask R-CNN [39] approaches have been shown to possess high object detection performance.Unlike the other models, Mask R-CNN performs both detection and segmentation.This network is the extended version of Faster R-CNN, as it includes an extra segmentation pattern (i.e., segmentation mask).There are two phases in Mask R-CNN: in the first stage, feature extraction is performed for the regions; in the second stage, boundingbox detection, class detection, and segmentation are performed according to the extracted features.The Mask R-CNN architecture, which also includes the ResNet50 backbone network [40], is shown in Figure 3.As can be seen in Figure 3, Mask R-CNN includes a Feature Pyramid Network (FPN) that allows for deep feature extraction.The FPN, designed according to the pyramid concept, is a network structure that excels in speed as well as accuracy.It has a multi-scale feature map and an upstream-downstream network structure.The upstream path is a convolutional neural network for feature extraction; as the number of upstream layers increases, the semantic value increases and high-level structures are detected.Regarding the backbone, as ResNet has a multi-layer structure, its training speed and estimation performance are quite high.In the basic structure of ResNet, there are hops between the front and back layers to facilitate back-propagation in the deep network training process.After the FPN in the proposed model architecture, there is an RPN, a deep convolutional neural network, which allows regions containing possible objects to be detected.The RPN takes data of any size as an input and proposes bounding boxes based on the object score.It makes this suggestion by shifting a small mesh over the feature map generated by the convolutional layer.After the RPN network, there is ROIAlign (i.e., an ROI alignment layer).This layer performs the same process as ROI pooling but solves the problem of unnecessary offsets in segmentation problems through the use of bilinear interpolation, thus achieving results much faster.This layer re-scales the image dimensions, which are As can be seen in Figure 3, Mask R-CNN includes a Feature Pyramid Network (FPN) that allows for deep feature extraction.The FPN, designed according to the pyramid concept, is a network structure that excels in speed as well as accuracy.It has a multi-scale feature map and an upstream-downstream network structure.The upstream path is a convolutional neural network for feature extraction; as the number of upstream layers increases, the semantic value increases and high-level structures are detected.Regarding the backbone, as ResNet has a multi-layer structure, its training speed and estimation performance are quite high.In the basic structure of ResNet, there are hops between the front and back layers to facilitate back-propagation in the deep network training process.After the FPN in the proposed model architecture, there is an RPN, a deep convolutional neural network, which allows regions containing possible objects to be detected.The RPN takes data of any size as an input and proposes bounding boxes based on the object score.It makes this suggestion by shifting a small mesh over the feature map generated by the convolutional layer.After the RPN network, there is ROIAlign (i.e., an ROI alignment layer).This layer performs the same process as ROI pooling but solves the problem of unnecessary offsets in segmentation problems through the use of bilinear interpolation, thus achieving results much faster.This layer re-scales the image dimensions, which are then transmitted to the fully connected layer.Finally, the class and bounding box information for each region are obtained.

Enhanced Mask R-CNN Architecture
In this study, the Mask R-CNN structure was modified and the developed mask was trained as an R-CNN.In order to improve the pulmonary embolism detection performance, two weighting parameters, λ 1 and λ 2 , were used in the loss function, which, in turn, consisted of the sum of classification loss, positioning loss, and segmentation loss, as shown in Equation (1).
where L cls corresponds to the level at which the classes are detected incorrectly.In multiobject detection, this value should be increased if more than one object is not detected.L box is an adjustable parameter for the correct determination of the object's boundaries, which can be increased if the bounding box is incorrectly positioned when the object is detected.
On the other hand, λ 1 and λ 2 weigh the segmentation positioning losses of the object such that the network performance can be enhanced.L mask indicates how accurately the object is segmented and is expressed as follows: where y and ŷ denote the label and predicted value, respectively.

U-Net
The performance of the enhanced Mask R-CNN algorithm was compared with that of the U-Net algorithm.U-Net is a widely used deep learning model designed for image segmentation tasks, especially in medical imaging.It consists of an encoder, which extracts features from input images using convolutional layers, and a decoder, which reconstructs segmented images based on these features.U-Net is known for its ability to produce detailed segmentation results close to the original resolution of the image, making it effective for precise organ or lesion identification in medical images.The architecture of U-Net is shown in Figure 4.As its shape resembles the letter U, it was named after it.
mation for each region are obtained.

Enhanced Mask R-CNN Architecture
In this study, the Mask R-CNN structure was modified and the developed mask was trained as an R-CNN.In order to improve the pulmonary embolism detection performance, two weighting parameters, λ1 and λ2, were used in the loss function, which, in turn, consisted of the sum of classification loss, positioning loss, and segmentation loss, as shown in Equation (1).
where Lcls corresponds to the level at which the classes are detected incorrectly.In multiobject detection, this value should be increased if more than one object is not detected.Lbox is an adjustable parameter for the correct determination of the object's boundaries, which can be increased if the bounding box is incorrectly positioned when the object is detected.On the other hand, λ1 and λ2 weigh the segmentation positioning losses of the object such that the network performance can be enhanced.Lmask indicates how accurately the object is segmented and is expressed as follows: where y and ŷ denote the label and predicted value, respectively.

U-Net
The performance of the enhanced Mask R-CNN algorithm was compared with that of the U-Net algorithm.U-Net is a widely used deep learning model designed for image segmentation tasks, especially in medical imaging.It consists of an encoder, which extracts features from input images using convolutional layers, and a decoder, which reconstructs segmented images based on these features.U-Net is known for its ability to produce detailed segmentation results close to the original resolution of the image, making it effective for precise organ or lesion identification in medical images.The architecture of U-Net is shown in Figure 4.As its shape resembles the letter U, it was named after it.There are two parts in this architecture.The first part is known as contraction (encoding) and the second part as expansion (decoding).The encoding part uses a traditional Diagnostics 2024, 14, 1102 6 of 12 CNN architecture, in which the image size is slowly reduced using convolutional and maximum pooling layers, where the former consist of 3 × 3 filters and the latter consist of 2 × 2 filters.In the decoding part, which is completely symmetric with respect to the first part, the feature map is enlarged step-by-step through deconvolution towards the actual size of the image.Finally, each convolutional layer in the architecture is followed by an activation layer.

Evaluation Metrics
Sensitivity, specificity, accuracy, and the Dice and Jaccard indices were used to assess the performance of Mask R-CNN in detecting pulmonary embolism.Sensitivity refers to the proportion of PE within the detected pixels; low sensitivity values indicate that true lesions are not adequately detected, while high sensitivity values indicate that the system is able to detect a high proportion of the regions that are recognized as lesions.As evaluating different criteria together gives more accurate results, the following were considered (see Figure 5): the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) areas of the pixel groups obtained through automatic and manual segmentation.The equations for the performance parameters used are given as shown in Equations ( 3)-( 7) below.
iagnostics 2024, 14, x FOR PEER REVIEW There are two parts in this architecture.The first part is known as contr ing) and the second part as expansion (decoding).The encoding part use CNN architecture, in which the image size is slowly reduced using conv maximum pooling layers, where the former consist of 3 × 3 filters and the l 2 × 2 filters.In the decoding part, which is completely symmetric with resp part, the feature map is enlarged step-by-step through deconvolution towa size of the image.Finally, each convolutional layer in the architecture is f activation layer.

Evaluation Metrics
Sensitivity, specificity, accuracy, and the Dice and Jaccard indices were the performance of Mask R-CNN in detecting pulmonary embolism.Sensi the proportion of PE within the detected pixels; low sensitivity values ind lesions are not adequately detected, while high sensitivity values indicate t is able to detect a high proportion of the regions that are recognized as les ating different criteria together gives more accurate results, the following w (see Figure 5): the true positive (TP), true negative (TN), false positive (FP), ative (FN) areas of the pixel groups obtained through automatic and man tion.The equations for the performance parameters used are given as show (3)-( 7) below.

Results and Discussion
Thorax CT images obtained from 50 patients with pulmonary embolism were used in this study, and images containing a total of 430 pulmonary embolisms were created by extracting sections of these images containing pulmonary embolisms.This study was carried out in four stages: data augmentation, pre-processing, PE segmentation, and performance evaluation.The pre-processed images were used as inputs to Mask R-CNN.The images of 36 patients (1505) were used for training, and those of 14 patients (645) were used for testing.Feature extraction for PE detection was performed with the ResNet50 convolutional neural network pre-trained on the COCO data set.Both hold-out validation (with a training/testing ratio of 70%:30%) and 10-fold cross-validation were used for performance evaluation.Figure 6 shows PEs automatically segmented by the proposed system (shown in blue) and those manually segmented by the expert doctor (shown in red).Mask R-CNN performed both detection and segmentation; in the figure, the detection of PEs is shown with yellow bounding boxes. =  +  +  .

Results
Thorax CT images obtained from 50 patients with pulmonary embolism were in this study, and images containing a total of 430 pulmonary embolisms were create extracting sections of these images containing pulmonary embolisms.This study wa ried out in four stages: data augmentation, pre-processing, PE segmentation, and pe mance evaluation.The pre-processed images were used as inputs to Mask R-CNN images of 36 patients (1505) were used for training, and those of 14 patients (645) used for testing.Feature extraction for PE detection was performed with the ResN convolutional neural network pre-trained on the COCO data set.Both hold-out valid (with a training/testing ratio of 70%:30%) and 10-fold cross-validation were used for formance evaluation.The manual and automatic segmentation results are shown in different colors in ure 7. The regions identified by the expert but not detected by the system are show red, while those detected as PEs by the system but not by the doctor are shown in g The pixels belonging to PEs detected in both ways are shown in yellow.It can be seen the proposed system could detect PEs with high performances.The manual and automatic segmentation results are shown in different colors in Figure 7.The regions identified by the expert but not detected by the system are shown in red, while those detected as PEs by the system but not by the doctor are shown in green.The pixels belonging to PEs detected in both ways are shown in yellow.It can be seen that the proposed system could detect PEs with high performances.The average sensitivity, specificity, and accuracy values obtained on the test data with the proposed method following hold-out validation are given in Figure 8.The average sensitivity was 96.2%, the specificity was 93.4%, and the accuracy was 96%.It can be seen that the proposed enhanced Mask R-CNN presented a high performance in terms of PE detection.The Dice and Jaccard similarity indices were also calculated in order to determine the similarity between automatic and manual PE detection.The minimum and maximum values were found to be 0.95 and 0.97 for the Dice index, respectively, and 0.88 and 0.91 for the Jaccard index, respectively.Figure 9 shows the Dice and Jaccard values obtained on the test images.The mean values obtained on the test data, as shown in Table 1, were 0.96 and 0.90 for the Dice and Jaccard similarity indices, respectively.The performance of the The average sensitivity, specificity, and accuracy values obtained on the test data with the proposed method following hold-out validation are given in Figure 8.The average sensitivity was 96.2%, the specificity was 93.4%, and the accuracy was 96%.It can be seen that the proposed enhanced Mask R-CNN presented a high performance in terms of PE detection.The average sensitivity, specificity, and accuracy values obtained on the test data with the proposed method following hold-out validation are given in Figure 8.The average sensitivity was 96.2%, the specificity was 93.4%, and the accuracy was 96%.It can be seen that the proposed enhanced Mask R-CNN presented a high performance in terms of PE detection.The Dice and Jaccard similarity indices were also calculated in order to determine the similarity between automatic and manual PE detection.The minimum and maximum values were found to be 0.95 and 0.97 for the Dice index, respectively, and 0.88 and 0.91 for the Jaccard index, respectively.Figure 9 shows the Dice and Jaccard values obtained on the test images.The mean values obtained on the test data, as shown in Table 1, were 0.96 and 0.90 for the Dice and Jaccard similarity indices, respectively.The performance of the The Dice and Jaccard similarity indices were also calculated in order to determine the similarity between automatic and manual PE detection.The minimum and maximum values were found to be 0.95 and 0.97 for the Dice index, respectively, and 0.88 and 0.91 for the Jaccard index, respectively.Figure 9 shows the Dice and Jaccard values obtained on the test images.The mean values obtained on the test data, as shown in Table 1, were 0.96 and 0.90 for the Dice and Jaccard similarity indices, respectively.The performance of the proposed method was tested using both hold-out validation and 10-fold cross-validation.Table 1 illustrates the superior performance of the enhanced Mask R-CNN, compared with Mask R-CNN, in terms of both Dice and Jaccard scores.The significant improvement observed, particularly in the 10-fold cross-validation results, suggests that the enhanced Mask R-CNN model possesses greater generalizability and reliability.
Diagnostics 2024, 14, x FOR PEER REVIEW 9 of 12 proposed method was tested using both hold-out validation and 10-fold cross-validation.Table 1 illustrates the superior performance of the enhanced Mask R-CNN, compared with Mask R-CNN, in terms of both Dice and Jaccard scores.The significant improvement observed, particularly in the 10-fold cross-validation results, suggests that the enhanced Mask R-CNN model possesses greater generalizability and reliability.The PE detection performances of the enhanced Mask R-CNN, U-Net, and classical Mask R-CNN models were compared, as shown in Table 2.The results indicate that the enhanced Mask R-CNN provided higher sensitivity, specificity, and accuracy values than the classical Mask R-CNN and U-Net models.In particular, the fact that the sensitivity value obtained using the enhanced Mask R-CNN model was higher indicates that it has a higher success rate in detecting the lesion pixels compared with the other two methods.The accuracy values obtained under different λ values are given in Table 3, where λ1 is the coefficient of the loss function of the bounding box (Lbox) and λ2 is the coefficient of the loss function of the mask (Lmask).The results in the table indicate that varying the λ1 and λ2 values together impacted the model's accuracy performance differently.For instance, when λ1 was held constant at 1 and λ2 was reduced from 1 to 0.8, both the holdout and 10-fold CV accuracies generally increased.This illustrates that balancing the regularization coefficients can improve the model's performance.Specifically, the optimal performance observed in this study was obtained when λ1 = 0.9 and λ2 = 0.8.In particular,  The PE detection performances of the enhanced Mask R-CNN, U-Net, and classical Mask R-CNN models were compared, as shown in Table 2.The results indicate that the enhanced Mask R-CNN provided higher sensitivity, specificity, and accuracy values than the classical Mask R-CNN and U-Net models.In particular, the fact that the sensitivity value obtained using the enhanced Mask R-CNN model was higher indicates that it has a higher success rate in detecting the lesion pixels compared with the other two methods.The accuracy values obtained under different λ values are given in Table 3, where λ 1 is the coefficient of the loss function of the bounding box (L box ) and λ 2 is the coefficient of the loss function of the mask (L mask ).The results in the table indicate that varying the λ 1 and λ 2 values together impacted the model's accuracy performance differently.For instance, when λ 1 was held constant at 1 and λ 2 was reduced from 1 to 0.8, both the hold-out and 10-fold CV accuracies generally increased.This illustrates that balancing the regularization coefficients can improve the model's performance.Specifically, the optimal performance observed in this study was obtained when λ 1 = 0.9 and λ 2 = 0.8.In particular, the case where λ 1 = 1 and λ 2 = 1 corresponds to the standard configuration of the classical Mask R-CNN.

Conclusions
The effectiveness of a deep learning model during its training process is determined by the choice of its loss function, which is an especially critical component in models, such as Mask R-CNN, used for segmentation tasks to achieve accurate pixel-level predictions.The integration of a meticulously chosen loss function into the Mask R-CNN architecture significantly enhanced the model's performance when compared with the traditional Mask R-CNN and U-Net models.This improvement was exemplified by the efficacy results reported in this study, where the enhanced Mask R-CNN achieved high sensitivity (96.2%), specificity (93.4%), and accuracy (96%) for the detection of pulmonary embolism within segmental arteries in computed tomography (CT) images.The results of this study underscore the pivotal role of the loss function in optimizing the model's capacity to precisely identify and segment pulmonary embolisms.Enhancements and refinements in CNN-based models can significantly improve outcomes in object detection and segmentation tasks related to CT images.We hope that this study will inspire research into new loss functions that are particularly effective for specific tasks or applications.

Figure 2 .
Figure 2. Obtaining the middle region of the image containing PE.Figure 2. Obtaining the middle region of the image containing PE.

Figure 2 .
Figure 2. Obtaining the middle region of the image containing PE.Figure 2. Obtaining the middle region of the image containing PE.

Figure 5 .Figure 5 .
Figure 5. Representative overlap of automatic and manual segmentation when calcu index performances.

Figure 6 shows
PEs automatically segmented by the proposed tem (shown in blue) and those manually segmented by the expert doctor (shown in Mask R-CNN performed both detection and segmentation; in the figure, the detecti PEs is shown with yellow bounding boxes.

Figure 6 .
Figure 6.The automatic and manual segmentation of PE and detection performance: the (A) or images and (B) enlarged images.

Figure 6 .
Figure 6.The automatic and manual segmentation of PE and detection performance: the (A) original images and (B) enlarged images.

Figure 7 .
Figure 7. Drawings of detected PE (A) Colored drawings of manually and automatically segmented PEs (B).

Figure 8 .
Figure 8.Average PE detection performance values following hold-out validation.

Figure 7 .
Figure 7. Drawings of detected PE (A) Colored drawings of manually and automatically segmented PEs (B).

Figure 8 .
Figure 8.Average PE detection performance values following hold-out validation.

Figure 8 .
Figure 8.Average PE detection performance values following hold-out validation.

Figure 9 .
Figure 9. Dice and Jaccard similarity index values for test images.

Figure 9 .
Figure 9. Dice and Jaccard similarity index values for test images.

Table 1 .
Comparison of segmentation performances of Mask R-CNN and enhanced Mask R-CNN.

Table 1 .
Comparison of segmentation performances of Mask R-CNN and enhanced Mask R-CNN.Hold-Out Val.10-Fold CV.

Table 2 .
Comparison of the enhanced Mask R-CNN with U-Net and Mask R-CNN.

Table 2 .
Comparison of the enhanced Mask R-CNN with U-Net and Mask R-CNN.