Skeleton Segmentation on Bone Scintigraphy for BSI Computation

Bone Scan Index (BSI) is an image biomarker for quantifying bone metastasis of cancers. To compute BSI, not only the hotspots (metastasis) but also the bones have to be segmented. Most related research focus on binary classification in bone scintigraphy: having metastasis or none. Rare studies focus on pixel-wise segmentation. This study compares three advanced convolutional neural network (CNN) based models to explore bone segmentation on a dataset in-house. The best model is Mask R-CNN, which reaches the precision, sensitivity, and F1-score: 0.93, 0.87, 0.90 for prostate cancer patients and 0.92, 0.86, and 0.88 for breast cancer patients, respectively. The results are the average of 10-fold cross-validation, which reveals the reliability of clinical use on bone segmentation.


Introduction
Bone is the most common targeted site for metastatic cancer, especially in the advanced and later phases of cancer progression-notably breast, prostate, and lung cancers, with the highest incidence rates [1]. Bone metastases can severely impact patients' daily activities and quality of life due to severe pain and associated major complications. The protracted clinical course of bone metastasis poses significant challenges to treatment. Per a 2022 report published in the Taiwan National Health Insurance Research Database [2], prostate cancer ranked sixth among the leading causes of cancer death among Taiwanese men. In contrast, breast cancer ranked second among the leading causes of cancer death among Taiwanese women. Diagnostic techniques for bone metastasis currently include bone scintigraphy (BS), X-ray imaging, computed tomography (CT), and magnetic resonance imaging (MRI), while BS serves as the most cost-effective early screening method. BS can diagnose bone metastasis earlier than CT or X-ray, within 3 to 6 months [3].
Bone metastasis typically affects the central skeletal system and the proximal regions of the upper and lower limbs. The central skeletal system contains red bone marrow, which may contribute to the formation of bone metastasis due to its physiological characteristics [4]. Physicians often perform a whole-body bone scan (WBBS) to diagnose the presence of bone metastasis. 99m Tc-MDP is the radiopharmaceutical injected into a patient's vein, which can enter the bone cells and deposit with mineral components in four hours. Consequently, Tc-99m MDP tends to accumulate in areas of active bone formation in the affected region, resulting in localized increased radiopharmaceutical activity that appears as a "hot spot" on BS, allowing physicians to identify bone metastasis [5]. However, BS may suffer from ambiguity owing to impacts such as bone injury, arthritis, and degenerative changes, and causes interpretation challenges. Inexperienced clinical physicians may struggle to make accurate judgments or even misinterpret images.
Bone scan index (BSI) is an imaging biomarker used to quantify the extent of bone metastasis in cancers [6]. BSI is calculated as the ratio of "the number of bone lesions indicating bone metastasis" to "the number of regions with a high incidence of bone metastasis" [7][8][9], as shown in Figure 1. With artificial intelligence, machine learning, and big data, BSI calculation has become more objective, accurate, and diagnostically efficient. BSI's most attractive application is monitoring treatment and prognosis, providing significant clinical value. Armstrong et al. from Duke University introduced the automated bone scan index (aBSI) as an objective imaging parameter [10], which can evaluate the prognosis of metastatic castration-resistant prostate cancer (mCRPC) patients undergoing systemic treatment in clinical trials. In [11,12], manual and automated BSI measurements were highly correlated (ρ = 0.80), and automated BSI scoring demonstrated reproducibility, eliminating the subjectivity of clinical judgment while retaining the same clinical significance as manual BSI scoring. Furthermore, some studies confirmed the utility of aBSI in mCRPC patients [13][14][15], while other studies have begun to explore its application and refinement in other tumors [16].
Diagnostics 2023, 13, x FOR PEER REVIEW 2 of 16 arthritis, and degenerative changes, and causes interpretation challenges. Inexperienced clinical physicians may struggle to make accurate judgments or even misinterpret images. Bone scan index (BSI) is an imaging biomarker used to quantify the extent of bone metastasis in cancers [6]. BSI is calculated as the ratio of "the number of bone lesions indicating bone metastasis" to "the number of regions with a high incidence of bone metastasis" [7][8][9], as shown in Figure 1. With artificial intelligence, machine learning, and big data, BSI calculation has become more objective, accurate, and diagnostically efficient. BSI's most attractive application is monitoring treatment and prognosis, providing significant clinical value. Armstrong et al. from Duke University introduced the automated bone scan index (aBSI) as an objective imaging parameter [10], which can evaluate the prognosis of metastatic castration-resistant prostate cancer (mCRPC) patients undergoing systemic treatment in clinical trials. In [11,12], manual and automated BSI measurements were highly correlated (ρ = 0.80), and automated BSI scoring demonstrated reproducibility, eliminating the subjectivity of clinical judgment while retaining the same clinical significance as manual BSI scoring. Furthermore, some studies confirmed the utility of aBSI in mCRPC patients [13][14][15], while other studies have begun to explore its application and refinement in other tumors [16]. Generally, computer-assisted diagnosis (CAD) systems that utilize machine learning or neural network (NN) framework for calculating BSI on WBBS images can be divided into two parts: lesion segmentation and skeleton segmentation, which respectively reflect the numerator and denominator of the BSI value [17][18][19][20]. Recently, numerous studies [21,22] and related patents [23,24] on lesion segmentation using the NN framework have been conducted. However, the performance of the lesion pixel-wise segmentation has not been thoroughly and rigorously investigated. Similarly, research on skeleton segmentation using deep learning and NN models is scarce in previous studies [20,25] despite the mention of its skeleton segmentation approach in [20], which lacks comparison with other NN models. Although [25] compared its performance with U-Net, it remained confined to traditional semantic segmentation network architectures. Thus, the field of skeleton segmentation using NN remains insufficiently explored. This paper uses different NN models for skeleton segmentation on WBBS images and investigates their results. Additionally, we have built a website platform for online skeleton segmentation of WBBS images [Appendix A], which provides effective skeleton segmentation data for further evaluation of BSI. Generally, computer-assisted diagnosis (CAD) systems that utilize machine learning or neural network (NN) framework for calculating BSI on WBBS images can be divided into two parts: lesion segmentation and skeleton segmentation, which respectively reflect the numerator and denominator of the BSI value [17][18][19][20]. Recently, numerous studies [21,22] and related patents [23,24] on lesion segmentation using the NN framework have been conducted. However, the performance of the lesion pixel-wise segmentation has not been thoroughly and rigorously investigated. Similarly, research on skeleton segmentation using deep learning and NN models is scarce in previous studies [20,25] despite the mention of its skeleton segmentation approach in [20], which lacks comparison with other NN models. Although [25] compared its performance with U-Net, it remained confined to traditional semantic segmentation network architectures. Thus, the field of skeleton segmentation using NN remains insufficiently explored. This paper uses different NN models for skeleton segmentation on WBBS images and investigates their results. Additionally, we have built a website platform for online skeleton segmentation of WBBS images [Appendix A], which provides effective skeleton segmentation data for further evaluation of BSI.

Materials
In this retrospective study in collaboration with the Department of Nuclear Medicine at China Medical University Hospital, 196 WBBS images of patients with prostate cancer were collected. Among the 196 patients, 110 patients had bone metastasis, and 86 patients had no evidence of bone metastasis. We also collected 163 WBBS images of patients with breast cancer. All of them had bone metastasis. The study was approved by the Institutional Review Board (IRB) and the Hospital Research Ethics Committee (CMUH106-REC2-130) of China Medical University.
The radiopharmaceutical used for WBBS was Tc-99m MDP, and the imaging was performed 4 h after the vein injection. A Gamma camera (Millennium MG, Infinia Hawkeye 4, or Discovery NM/CT 670 system; GE Healthcare, Waukesha, WI, USA) was used for planar bone scanning, with a low-energy high-resolution or general-purpose collimator, a matrix size of 1024 × 256, a photon energy centered on the 140 keV peaks, and a symmetric 20% energy window. The collected bone scan images were in DICOM format, with a spatial resolution of 1024 × 512 pixels (composed of anterior-posterior (AP) and posterior-anterior (PA) views), and the intensity information of each pixel was saved in 2-byte (uint16). The images were preprocessed using the dedicated GE Xeleris workstation (GE Medical Systems, Haifa, Israel; version 3.1) before being uploaded to PACS.
A standard WBBS image contains two views: anterior and posterior. The original DICOM images were first converted to PNG format after removing any identifiable information. Following the approach described in [22], pre-processing was performed by normalizing the image size and intensity. Afterwards, the anterior and posterior views were cropped into a single image with a size of 950 × 512, without any scaling or geometric transformations, as shown in Figure 2.

Materials
In this retrospective study in collaboration with the Department of Nuclear Medicine at China Medical University Hospital, 196 WBBS images of patients with prostate cancer were collected. Among the 196 patients, 110 patients had bone metastasis, and 86 patients had no evidence of bone metastasis. We also collected 163 WBBS images of patients with breast cancer. All of them had bone metastasis. The study was approved by the Institutional Review Board (IRB) and the Hospital Research Ethics Committee (CMUH106-REC2-130) of China Medical University.
The radiopharmaceutical used for WBBS was Tc-99m MDP, and the imaging was performed 4 h after the vein injection. A Gamma camera (Millennium MG, Infinia Hawkeye 4, or Discovery NM/CT 670 system; GE Healthcare, Waukesha, WI, USA) was used for planar bone scanning, with a low-energy high-resolution or general-purpose collimator, a matrix size of 1024 × 256, a photon energy centered on the 140 keV peaks, and a symmetric 20% energy window. The collected bone scan images were in DICOM format, with a spatial resolution of 1024 × 512 pixels (composed of anterior-posterior (AP) and posterior-anterior (PA) views), and the intensity information of each pixel was saved in 2-byte (uint16). The images were preprocessed using the dedicated GE Xeleris workstation (GE Medical Systems, Haifa, Israel; version 3.1) before being uploaded to PACS.
A standard WBBS image contains two views: anterior and posterior. The original DICOM images were first converted to PNG format after removing any identifiable information. Following the approach described in [22], pre-processing was performed by normalizing the image size and intensity. Afterwards, the anterior and posterior views were cropped into a single image with a size of 950 × 512, without any scaling or geometric transformations, as shown in Figure 2.

Region Definition
To identify the skeletal regions where bone metastases occur most frequently, we consulted with two experienced nuclear medicine physicians and established standards. The standards require the approval of these two board-certified nuclear medicine physicians. The regions are the skull, spine, chest (including ribs, scapula, and clavicle),

Region Definition
To identify the skeletal regions where bone metastases occur most frequently, we consulted with two experienced nuclear medicine physicians and established standards. The standards require the approval of these two board-certified nuclear medicine physicians. The regions are the skull, spine, chest (including ribs, scapula, and clavicle), humerus (proximal to midshaft of the femurs), femurs (proximal to midshaft of the humerus), and pelvis.
The positions of the humerus on images differ significantly, as shown in Figure 2. Different from only one category on femurs, we categorize humerus into four categories, i.e., the left and right humerus in the anterior and posterior views separately. The reason for doing so will be addressed in the discussion. Furthermore, Tc-99m MDP undergoes renal metabolism, which can result in the kidneys appearing as high signal areas. In some situations, the kidney will be misclassified as metastasis. To alleviate this problem, we created an extra kidney category to exclude ambiguity.
humerus (proximal to midshaft of the femurs), femurs (proximal to midshaft of the humerus), and pelvis.
The positions of the humerus on images differ significantly, as shown in Figure 2. Different from only one category on femurs, we categorize humerus into four categories, i.e., the left and right humerus in the anterior and posterior views separately. The reason for doing so will be addressed in the discussion. Furthermore, Tc-99m MDP undergoes renal metabolism, which can result in the kidneys appearing as high signal areas. In some situations, the kidney will be misclassified as metastasis. To alleviate this problem, we created an extra kidney category to exclude ambiguity.

Neural Network Architectures
Three different neural network architectures were tested, including Mask R-CNN [26], Double U-Net [27], and Deeplabv3 plus [28]. We used similar hyperparameters on these three models to conduct experiments to compare their performances.
The Mask R-CNN architecture shown in Figure 4 comprises four main parts: backbone architecture, RPN, RoIAlign, and head architecture. We used ResNet-50 as the backbone. The hyperparameters hold the same learning rate of 0.005, batch size of 4, and 100 epochs.

Neural Network Architectures
Three different neural network architectures were tested, including Mask R-CNN [26], Double U-Net [27], and Deeplabv3 plus [28]. We used similar hyperparameters on these three models to conduct experiments to compare their performances.
The Mask R-CNN architecture shown in Figure 4 comprises four main parts: backbone architecture, RPN, RoIAlign, and head architecture. We used ResNet-50 as the backbone. The hyperparameters hold the same learning rate of 0.005, batch size of 4, and 100 epochs.
humerus (proximal to midshaft of the femurs), femurs (proximal to midshaft of the humerus), and pelvis.
The positions of the humerus on images differ significantly, as shown in Figure 2. Different from only one category on femurs, we categorize humerus into four categories, i.e., the left and right humerus in the anterior and posterior views separately. The reason for doing so will be addressed in the discussion. Furthermore, Tc-99m MDP undergoes renal metabolism, which can result in the kidneys appearing as high signal areas. In some situations, the kidney will be misclassified as metastasis. To alleviate this problem, we created an extra kidney category to exclude ambiguity.

Neural Network Architectures
Three different neural network architectures were tested, including Mask R-CNN [26], Double U-Net [27], and Deeplabv3 plus [28]. We used similar hyperparameters on these three models to conduct experiments to compare their performances.
The Mask R-CNN architecture shown in Figure 4 comprises four main parts: backbone architecture, RPN, RoIAlign, and head architecture. We used ResNet-50 as the backbone. The hyperparameters hold the same learning rate of 0.005, batch size of 4, and 100 epochs.  The Double U-Net architecture shown in Figure 5 comprises two sub-networks, dilated convolution, spatial pyramid pooling, and an SE block. It was originally designed for binary classification. Here we modified it to make the multi-class classification. We changed the output layer of Network 1 to have a SoftMax activation function to enable multi-class classification. The hyperparameters were set to be a learning rate of 0.0005, batch size of 4, 200 epochs (without data augmentation), or 20 epochs (with data augmentation).
The Double U-Net architecture shown in Figure 5 comprises two sub-networks, dilated convolution, spatial pyramid pooling, and an SE block. It was originally designed for binary classification. Here we modified it to make the multi-class classification. We changed the output layer of Network 1 to have a SoftMax activation function to enable multi-class classification. The hyperparameters were set to be a learning rate of 0.0005, batch size of 4, 200 epochs (without data augmentation), or 20 epochs (with data augmentation). The Deeplabv3 plus architecture shown in Figure 6 includes an encoder, decoder, dilated convolution, and depth-wise separable convolution. We used ResNet-50 as the encoder backbone. The hyperparameters were set to be a learning rate 0.0005, batch size of 4, and 200 epochs. The Deeplabv3 plus architecture shown in Figure 6 includes an encoder, decoder, dilated convolution, and depth-wise separable convolution. We used ResNet-50 as the encoder backbone. The hyperparameters were set to be a learning rate 0.0005, batch size of 4, and 200 epochs. The learning rate is a hyperparameter used in various machine learning algorithms, particularly in gradient-based optimization. It determines the step size at which the model updates its parameters during training. The choice of learning rate depends on the specific problem. Typically, every model has its own suggested learning rate. In this study, we choose a balance between accuracy and training speed. For this task, Mask R-CNN uses a learning rate of 0.005, while Double U-Net and DeeplabV3 plus use a learning rate 0.0005. The learning rate is a hyperparameter used in various machine learning algorithms, particularly in gradient-based optimization. It determines the step size at which the model updates its parameters during training. The choice of learning rate depends on the specific problem. Typically, every model has its own suggested learning rate. In this study, we choose a balance between accuracy and training speed. For this task, Mask R-CNN uses a learning rate of 0.005, while Double U-Net and DeeplabV3 plus use a learning rate 0.0005.

Image Pre-Processing
The input matrix size for Mask R-CNN was 950 × 512. Double U-Net and Deeplabv3 plus's input matrix size was adjusted to 960 × 512 by padding with zeros due to their restriction. The labels were saved in PNG format with integers ranging from 0 to 10.

Evaluations
In this study, the terms true positive (TP), false positive (FP), true negative (TN), and false negative (FN) were defined in pixel scale. The evaluation metrics used in the experiment were precision (Equation (1)) and sensitivity (Equation (2)), and the overall model evaluation was based on the F1 score (Equation (3) F1 score = 2(Precision × Sensitivity)/(Precision+ Sensitivity),

10-Fold Cross-Validation
In this study, validations on these three models used 10-fold cross-validation. Two datasets comprised 196 prostate cancer WBBS images and 163 breast cancer WBBS images, respectively. The ratio of training, validation, and test was set to be 8:1:1. The main goal of this experiment was to compare the performance differences among each network and to evaluate the impact of prostate and breast cancer WBBS images on network training. The hyperparameters used in the experiment are in Table 1, and the results are depicted in Tables 2 and 3, compared in Table 4. The qualitative results of bone segmentation are shown in Figures 7 and 8.   to evaluate the impact of prostate and breast cancer WBBS images on network training. The hyperparameters used in the experiment are in Table 1, and the results are depicted in Tables 2 and 3, compared in Table 4. The qualitative results of bone segmentation are shown in Figures 7 and 8.     (a) (b) (c)

10-Fold Cross-Validation with Data Augmentation
After the above experiments, we chose Double U-Net for investigation because it slightly outperformed on F1-score. Following, we fine-tuned the epoch to trade-off the training time and the performance to see what best performance we could reach. The images of training for prostate cancer and breast cancer were augmented 63 times by using rotation, scaling, and brightness adjustment described in the methods. Again, the

10-Fold Cross-Validation with Data Augmentation
After the above experiments, we chose Double U-Net for investigation because it slightly outperformed on F1-score. Following, we fine-tuned the epoch to trade-off the training time and the performance to see what best performance we could reach. The images of training for prostate cancer and breast cancer were augmented 63 times by using rotation, scaling, and brightness adjustment described in the methods. Again, the hyperparameters are in Table 5, and the quantitative results of the 10-fold cross-validation are in Table 6.

Discussion
This study utilised Mask R-CNN, Double U-Net, and DeeplabV3 plus for skeleton segmentation comparison on prostate cancer and breast cancer WBBS images. The quantitative results were investigated via 10-fold cross-validation. Based on the quantitative findings, Mask R-CNN exhibited higher precision than Double U-Net by 2.03% in the prostate cancer dataset and 1.84% in the breast cancer dataset. Mask R-CNN also exhibited higher precision than DeeplabV3 by 3 To better understand these results, we visualized the predictions, where white color represented TP, green color represented FP and red color represented FN (as shown in Figures 9 and 10). Mask R-CNN's predictions shifted inward slightly compared to the ground truth (GT), resulting in more FN in the edge regions and only a few FP. Double U-Net's predictions aligned well with the GT along the edges, resulting in slightly fewer FN but more FP. DeeplabV3 plus exhibited irregularities along the edges compared to the other two models, leading to noticeable erroneous FP and an overall increase in FP. These findings shed light on the performance of different models for skeleton segmentation, emphasizing the trade-off between FP and FN. Further improvements can be explored to address the limitations observed, particularly in the case of DeeplabV3 These findings shed light on the performance of different models for skeleton segmentation, emphasizing the trade-off between FP and FN. Further improvements can be explored to address the limitations observed, particularly in the case of DeeplabV3 plus, to enhance its stability and accuracy. These findings shed light on the performance of different models for skeleton segmentation, emphasizing the trade-off between FP and FN. Further improvements can be explored to address the limitations observed, particularly in the case of DeeplabV3 plus, to enhance its stability and accuracy.
Further investigation of Mask R-CNN results revealed an increase in false negatives (FN) when predicting smaller categories, such as the humerus and kidneys, as shown in Figure 11a. This result could be caused by the following reasons: First, the insufficient brightness in the WBBS image may hinder feature detection. The brightness of WBBS images depends on the counts collected by the scintillation crystal, which can be influenced by factors such as patient thickness and radiopharmaceutical activity. In cases where the received counts are insufficient, resulting in inadequate image brightness, deep neural network models may struggle to make accurate judgments or even make errors. Adjusting the image brightness and conducting further tests can help alleviate this situation, as shown in Figure 11b.
Second, abnormal patient positioning in the WBBS image could cause another issue. In a few instances, patient positioning in the WBBS image deviates to some extent from standard clinical positions. This deviation made challenges for CNN prediction, as shown in Figure 12. The degree of deviation is closely related to the patient's clinical condition and is difficult to entirely avoid in clinical practice. While other previous studies might manually exclude misleading images to prevent such occurrences, this study aimed to maintain a dataset that reflects real clinical scenarios, thereby we did not exclude any case. To enhance the network's ability to predict WBBS images with unusual positioning, future considerations include employing hard negative mining techniques to improve the model's generalization capabilities. First, the insufficient brightness in the WBBS image may hinder feature detection. The brightness of WBBS images depends on the counts collected by the scintillation crystal, which can be influenced by factors such as patient thickness and radiopharmaceutical activity. In cases where the received counts are insufficient, resulting in inadequate image brightness, deep neural network models may struggle to make accurate judgments or even make errors. Adjusting the image brightness and conducting further tests can help alleviate this situation, as shown in Figure 11b.
Second, abnormal patient positioning in the WBBS image could cause another issue. In a few instances, patient positioning in the WBBS image deviates to some extent from standard clinical positions. This deviation made challenges for CNN prediction, as shown in Figure 12. The degree of deviation is closely related to the patient's clinical condition and is difficult to entirely avoid in clinical practice. While other previous studies might manually exclude misleading images to prevent such occurrences, this study aimed to maintain a dataset that reflects real clinical scenarios, thereby we did not exclude any case. To enhance the network's ability to predict WBBS images with unusual positioning, future considerations include employing hard negative mining techniques to improve the model's generalization capabilities. Third, the model's insensitivity to features of small objects in WBBS images could also decrease performance. Quantitative results indicated relatively low precision for categories such as upper limbs, femurs, and kidneys, which correspond to smaller objects. This suggested Mask R-CNN facing certain difficulties in segmenting smaller regions.
These findings highlighted specific challenges encountered during the skeleton segmentation process, particularly related to image brightness, abnormal patient positioning, and the segmentation of smaller objects. Addressing these challenges could improve the performance of the Mask R-CNN model. Third, the model's insensitivity to features of small objects in WBBS images could also decrease performance. Quantitative results indicated relatively low precision for categories such as upper limbs, femurs, and kidneys, which correspond to smaller objects. This suggested Mask R-CNN facing certain difficulties in segmenting smaller regions.
These findings highlighted specific challenges encountered during the skeleton segmentation process, particularly related to image brightness, abnormal patient positioning, and the segmentation of smaller objects. Addressing these challenges could improve the performance of the Mask R-CNN model.
On the other hand, we observed that DeeplabV3 plus and Double U-Net tended to mix categories, resulting in unstable performance. Double U-Net and DeeplabV3 plus did not exhibit the category missing issue observed in Mask R-CNN, but they experienced problems such as category confusion and masks appearing in unintended areas, with DeeplabV3 plus being particularly affected. The issue of category confusion during prediction in semantic segmentation network architectures was not explicitly mentioned in [20,25]. However, in our experiments, we did observe this problem. Figure 13a   On the other hand, we observed that DeeplabV3 plus and Double U-Net tended to mix categories, resulting in unstable performance. Double U-Net and DeeplabV3 plus did not exhibit the category missing issue observed in Mask R-CNN, but they experienced problems such as category confusion and masks appearing in unintended areas, with DeeplabV3 plus being particularly affected. The issue of category confusion during prediction in semantic segmentation network architectures was not explicitly mentioned in [20,25]. However, in our experiments, we did observe this problem. Figure 13a showed an incorrect segmentation in the knee area in a Double U-Net skeleton segmentation result, while Figure 13b depicted category confusion in the upper limbs and head in a DeeplabV3 plus skeleton segmentation result.
This problem stemmed from different network architectures. Mask R-CNN utilizes parallel branch networks to independently determine categories and select the appropriate masks based on individual region-of-interest (ROI). Consequently, different ROIs could be distinguished independently, and masks could be treated as separate entities. In contrast, traditional fully convolutional network (FCN) architectures performed category and mask predictions simultaneously, leading to competition between different categories and masks. Additionally, due to the design of having one category per mask, FCN-based methods could not treat ROIs independently. Another critical factor was using the Sigmoid activation function and average binary cross-entropy loss in the branch networks, which mitigated the adverse effects of cross-category competition encountered in traditional FCN methods. This design yielded excellent instance segmentation results and avoided category overlap or confusion. From the experiments, Mask R-CNN demonstrated itself more suitable for skeleton segmentation in WBBS images than the other two network architectures.
From experiments shown in Tables 2-4, one might think that the models' performance is close to each other, and there might not be a statistically significant difference. It is crucial to consider the context of image segmentation in deep learning. In this task, precision and sensitivity are calculated pixel-wise. Therefore, even a small difference in percentage points can have a significant impact.
In the experiments involving data augmentation, it was observed that it contributed to a slight performance improvement. As the model already performed reasonably well without data augmentation, the addition of data augmentation only led to marginal per-formance gains. According to related literature [29], incorporating data augmentation helped reduce overfitting at higher learning rates, allowing the model to be trained for more epochs without sacrificing accuracy. Further experiments and investigations were warranted to explore the impact of data augmentation in more depth. This problem stemmed from different network architectures. Mask R-CNN utilizes parallel branch networks to independently determine categories and select the appropriate masks based on individual region-of-interest (ROI). Consequently, different ROIs could be distinguished independently, and masks could be treated as separate entities. In contrast, traditional fully convolutional network (FCN) architectures performed category and mask predictions simultaneously, leading to competition between different categories and masks. Additionally, due to the design of having one category per mask, FCN-based methods could not treat ROIs independently. Another critical factor was using the Sigmoid activation function and average binary cross-entropy loss in the branch networks, which mitigated the adverse effects of cross-category competition encountered in traditional FCN methods. This design yielded excellent instance segmentation results and avoided category overlap or confusion. From the experiments, Mask R-CNN demonstrated itself more suitable for skeleton segmentation in WBBS images than the other two network architectures.
From experiments shown in Tables 2-4, one might think that the models' performance is close to each other, and there might not be a statistically significant difference. It is crucial to consider the context of image segmentation in deep learning. In this task, precision and sensitivity are calculated pixel-wise. Therefore, even a small difference in percentage points can have a significant impact.
In the experiments involving data augmentation, it was observed that it contributed to a slight performance improvement. As the model already performed reasonably well without data augmentation, the addition of data augmentation only led to marginal performance gains. According to related literature [29], incorporating data augmentation helped reduce overfitting at higher learning rates, allowing the model to be trained for more epochs without sacrificing accuracy. Further experiments and investigations were warranted to explore the impact of data augmentation in more depth.
The limitation of this study is the scarcity of original data and the homogeneity of its source. In the future, it is desirable to establish collaborations with other medical centers to acquire cross-centre data, thereby improving the performance and generalization ability of the models. Additionally, we only investigated three relatively common network architectures, and it would be an attractive research direction to explore newer architectures, such as transformer-based networks. Different nuclear medicine imaging The limitation of this study is the scarcity of original data and the homogeneity of its source. In the future, it is desirable to establish collaborations with other medical centers to acquire cross-centre data, thereby improving the performance and generalization ability of the models. Additionally, we only investigated three relatively common network architectures, and it would be an attractive research direction to explore newer architectures, such as transformer-based networks. Different nuclear medicine imaging modalities, such as planar and SPECT, differ in the resulting images. It would be worth investigating whether these differences lead to heterogeneity in model predictions. This is an area for further exploration in the future.

Conclusions
In this study, we investigated three CNN models on bone segmentation of the WBBS images. We found that only one model was suitable for this goal, Mask R-CNN. The Double U-Net and Deeplabv3 + had a problem with 'category confusion', which humans would never have. We used a pixelwise scale to examine the model performance. The best performance we had ever made for Mask R-CNN was the precision, sensitivity, and F1-score: 0.93, 0.87, 0.90 for the prostate cancer dataset and 0.92, 0.86, 0.88 for the breast cancer dataset, which was the average of 10-fold cross-validation.  Informed Consent Statement: Patient consent was waived by IRB due to this is a retrospective study, and only images were used without the patient's identification.
Data Availability Statement: Not applicable.