1. Introduction
Bone is the most common targeted site for metastatic cancer, especially in the advanced and later phases of cancer progression—notably breast, prostate, and lung cancers, with the highest incidence rates [
1]. Bone metastases can severely impact patients’ daily activities and quality of life due to severe pain and associated major complications. The protracted clinical course of bone metastasis poses significant challenges to treatment. Per a 2022 report published in the Taiwan National Health Insurance Research Database [
2], prostate cancer ranked sixth among the leading causes of cancer death among Taiwanese men. In contrast, breast cancer ranked second among the leading causes of cancer death among Taiwanese women. Diagnostic techniques for bone metastasis currently include bone scintigraphy (BS), X-ray imaging, computed tomography (CT), and magnetic resonance imaging (MRI), while BS serves as the most cost-effective early screening method. BS can diagnose bone metastasis earlier than CT or X-ray, within 3 to 6 months [
3].
Bone metastasis typically affects the central skeletal system and the proximal regions of the upper and lower limbs. The central skeletal system contains red bone marrow, which may contribute to the formation of bone metastasis due to its physiological characteristics [
4]. Physicians often perform a whole-body bone scan (WBBS) to diagnose the presence of bone metastasis.
99mTc-MDP is the radiopharmaceutical injected into a patient’s vein, which can enter the bone cells and deposit with mineral components in four hours. Consequently, Tc-99m MDP tends to accumulate in areas of active bone formation in the affected region, resulting in localized increased radiopharmaceutical activity that appears as a “hot spot” on BS, allowing physicians to identify bone metastasis [
5]. However, BS may suffer from ambiguity owing to impacts such as bone injury, arthritis, and degenerative changes, and causes interpretation challenges. Inexperienced clinical physicians may struggle to make accurate judgments or even misinterpret images.
Bone scan index (BSI) is an imaging biomarker used to quantify the extent of bone metastasis in cancers [
6]. BSI is calculated as the ratio of “the number of bone lesions indicating bone metastasis” to “the number of regions with a high incidence of bone metastasis” [
7,
8,
9], as shown in
Figure 1. With artificial intelligence, machine learning, and big data, BSI calculation has become more objective, accurate, and diagnostically efficient. BSI’s most attractive application is monitoring treatment and prognosis, providing significant clinical value. Armstrong et al. from Duke University introduced the automated bone scan index (aBSI) as an objective imaging parameter [
10], which can evaluate the prognosis of metastatic castration-resistant prostate cancer (mCRPC) patients undergoing systemic treatment in clinical trials. In [
11,
12], manual and automated BSI measurements were highly correlated (ρ = 0.80), and automated BSI scoring demonstrated reproducibility, eliminating the subjectivity of clinical judgment while retaining the same clinical significance as manual BSI scoring. Furthermore, some studies confirmed the utility of aBSI in mCRPC patients [
13,
14,
15], while other studies have begun to explore its application and refinement in other tumors [
16].
Generally, computer-assisted diagnosis (CAD) systems that utilize machine learning or neural network (NN) framework for calculating BSI on WBBS images can be divided into two parts: lesion segmentation and skeleton segmentation, which respectively reflect the numerator and denominator of the BSI value [
17,
18,
19,
20]. Recently, numerous studies [
21,
22] and related patents [
23,
24] on lesion segmentation using the NN framework have been conducted. However, the performance of the lesion pixel-wise segmentation has not been thoroughly and rigorously investigated. Similarly, research on skeleton segmentation using deep learning and NN models is scarce in previous studies [
20,
25] despite the mention of its skeleton segmentation approach in [
20], which lacks comparison with other NN models. Although [
25] compared its performance with U-Net, it remained confined to traditional semantic segmentation network architectures. Thus, the field of skeleton segmentation using NN remains insufficiently explored. This paper uses different NN models for skeleton segmentation on WBBS images and investigates their results. Additionally, we have built a website platform for online skeleton segmentation of WBBS images [
Appendix A], which provides effective skeleton segmentation data for further evaluation of BSI.
2. Materials and Methods
2.1. Materials
In this retrospective study in collaboration with the Department of Nuclear Medicine at China Medical University Hospital, 196 WBBS images of patients with prostate cancer were collected. Among the 196 patients, 110 patients had bone metastasis, and 86 patients had no evidence of bone metastasis. We also collected 163 WBBS images of patients with breast cancer. All of them had bone metastasis. The study was approved by the Institutional Review Board (IRB) and the Hospital Research Ethics Committee (CMUH106-REC2-130) of China Medical University.
The radiopharmaceutical used for WBBS was Tc-99m MDP, and the imaging was performed 4 h after the vein injection. A Gamma camera (Millennium MG, Infinia Hawkeye 4, or Discovery NM/CT 670 system; GE Healthcare, Waukesha, WI, USA) was used for planar bone scanning, with a low-energy high-resolution or general-purpose collimator, a matrix size of 1024 × 256, a photon energy centered on the 140 keV peaks, and a symmetric 20% energy window. The collected bone scan images were in DICOM format, with a spatial resolution of 1024 × 512 pixels (composed of anterior-posterior (AP) and posterior-anterior (PA) views), and the intensity information of each pixel was saved in 2-byte (uint16). The images were preprocessed using the dedicated GE Xeleris workstation (GE Medical Systems, Haifa, Israel; version 3.1) before being uploaded to PACS.
A standard WBBS image contains two views: anterior and posterior. The original DICOM images were first converted to PNG format after removing any identifiable information. Following the approach described in [
22], pre-processing was performed by normalizing the image size and intensity. Afterwards, the anterior and posterior views were cropped into a single image with a size of 950 × 512, without any scaling or geometric transformations, as shown in
Figure 2.
2.2. Region Definition
To identify the skeletal regions where bone metastases occur most frequently, we consulted with two experienced nuclear medicine physicians and established standards. The standards require the approval of these two board-certified nuclear medicine physicians. The regions are the skull, spine, chest (including ribs, scapula, and clavicle), humerus (proximal to midshaft of the femurs), femurs (proximal to midshaft of the humerus), and pelvis.
The positions of the humerus on images differ significantly, as shown in
Figure 2. Different from only one category on femurs, we categorize humerus into four categories, i.e., the left and right humerus in the anterior and posterior views separately. The reason for doing so will be addressed in the discussion. Furthermore, Tc-99m MDP undergoes renal metabolism, which can result in the kidneys appearing as high signal areas. In some situations, the kidney will be misclassified as metastasis. To alleviate this problem, we created an extra kidney category to exclude ambiguity.
In summary, there are in total ten categories (
Figure 3), including the skull, spine, chest (including ribs, scapula, and clavicle), anterior right humerus (AR), anterior left humerus (AL), posterior right humerus (PR), posterior left humerus (PL), femurs (proximal to midshaft of the humerus), pelvis, and kidney.
2.3. Neural Network Architectures
Three different neural network architectures were tested, including Mask R-CNN [
26], Double U-Net [
27], and Deeplabv3 plus [
28]. We used similar hyperparameters on these three models to conduct experiments to compare their performances.
The Mask R-CNN architecture shown in
Figure 4 comprises four main parts: backbone architecture, RPN, RoIAlign, and head architecture. We used ResNet-50 as the backbone. The hyperparameters hold the same learning rate of 0.005, batch size of 4, and 100 epochs.
The Double U-Net architecture shown in
Figure 5 comprises two sub-networks, dilated convolution, spatial pyramid pooling, and an SE block. It was originally designed for binary classification. Here we modified it to make the multi-class classification. We changed the output layer of Network 1 to have a SoftMax activation function to enable multi-class classification. The hyperparameters were set to be a learning rate of 0.0005, batch size of 4, 200 epochs (without data augmentation), or 20 epochs (with data augmentation).
The Deeplabv3 plus architecture shown in
Figure 6 includes an encoder, decoder, dilated convolution, and depth-wise separable convolution. We used ResNet-50 as the encoder backbone. The hyperparameters were set to be a learning rate 0.0005, batch size of 4, and 200 epochs.
The learning rate is a hyperparameter used in various machine learning algorithms, particularly in gradient-based optimization. It determines the step size at which the model updates its parameters during training. The choice of learning rate depends on the specific problem. Typically, every model has its own suggested learning rate. In this study, we choose a balance between accuracy and training speed. For this task, Mask R-CNN uses a learning rate of 0.005, while Double U-Net and DeeplabV3 plus use a learning rate 0.0005.
2.4. Image Pre-Processing
The input matrix size for Mask R-CNN was 950 × 512. Double U-Net and Deeplabv3 plus’s input matrix size was adjusted to 960 × 512 by padding with zeros due to their restriction. The labels were saved in PNG format with integers ranging from 0 to 10.
Augmentation included rotations (−3°, 0°, 3°) with step 1°, scaling (0.9, 1, 1.1) with step 0.1, and brightness adjustments (0.8, 0.93, 1.06, 1.19, 1.32, 1.45, 1.58, 1.7 times). The augmented images had the same matrix size as the original images, resulting in a total rate of 63 times increase. The augmentations were only used in training.
2.5. Evaluations
In this study, the terms true positive (TP), false positive (FP), true negative (TN), and false negative (FN) were defined in pixel scale. The evaluation metrics used in the experiment were precision (Equation (1)) and sensitivity (Equation (2)), and the overall model evaluation was based on the F1 score (Equation (3)).
4. Discussion
This study utilised Mask R-CNN, Double U-Net, and DeeplabV3 plus for skeleton segmentation comparison on prostate cancer and breast cancer WBBS images. The quantitative results were investigated via 10-fold cross-validation. Based on the quantitative findings, Mask R-CNN exhibited higher precision than Double U-Net by 2.03% in the prostate cancer dataset and 1.84% in the breast cancer dataset. Mask R-CNN also exhibited higher precision than DeeplabV3 by 3.23% in the prostate dataset and 2.31% in the breast dataset. On the other hand, Double U-Net (90.70% & 88.86%) demonstrated higher sensitivity than Mask R-CNN (87.02% & 85.51%) and DeeplabV3 plus (88.64% & 85.71%). This indicated that Mask R-CNN had lower false positives (FP) during prediction, while Double U-Net had lower false negatives (FN).
To better understand these results, we visualized the predictions, where white color represented TP, green color represented FP and red color represented FN (as shown in
Figure 9 and
Figure 10). Mask R-CNN’s predictions shifted inward slightly compared to the ground truth (GT), resulting in more FN in the edge regions and only a few FP. Double U-Net’s predictions aligned well with the GT along the edges, resulting in slightly fewer FN but more FP. DeeplabV3 plus exhibited irregularities along the edges compared to the other two models, leading to noticeable erroneous FP and an overall increase in FP.
These findings shed light on the performance of different models for skeleton segmentation, emphasizing the trade-off between FP and FN. Further improvements can be explored to address the limitations observed, particularly in the case of DeeplabV3 plus, to enhance its stability and accuracy.
Further investigation of Mask R-CNN results revealed an increase in false negatives (FN) when predicting smaller categories, such as the humerus and kidneys, as shown in
Figure 11a. This result could be caused by the following reasons:
First, the insufficient brightness in the WBBS image may hinder feature detection. The brightness of WBBS images depends on the counts collected by the scintillation crystal, which can be influenced by factors such as patient thickness and radiopharmaceutical activity. In cases where the received counts are insufficient, resulting in inadequate image brightness, deep neural network models may struggle to make accurate judgments or even make errors. Adjusting the image brightness and conducting further tests can help alleviate this situation, as shown in
Figure 11b.
Second, abnormal patient positioning in the WBBS image could cause another issue. In a few instances, patient positioning in the WBBS image deviates to some extent from standard clinical positions. This deviation made challenges for CNN prediction, as shown in
Figure 12. The degree of deviation is closely related to the patient’s clinical condition and is difficult to entirely avoid in clinical practice. While other previous studies might manually exclude misleading images to prevent such occurrences, this study aimed to maintain a dataset that reflects real clinical scenarios, thereby we did not exclude any case. To enhance the network’s ability to predict WBBS images with unusual positioning, future considerations include employing hard negative mining techniques to improve the model’s generalization capabilities.
Third, the model’s insensitivity to features of small objects in WBBS images could also decrease performance. Quantitative results indicated relatively low precision for categories such as upper limbs, femurs, and kidneys, which correspond to smaller objects. This suggested Mask R-CNN facing certain difficulties in segmenting smaller regions.
These findings highlighted specific challenges encountered during the skeleton segmentation process, particularly related to image brightness, abnormal patient positioning, and the segmentation of smaller objects. Addressing these challenges could improve the performance of the Mask R-CNN model.
On the other hand, we observed that DeeplabV3 plus and Double U-Net tended to mix categories, resulting in unstable performance. Double U-Net and DeeplabV3 plus did not exhibit the category missing issue observed in Mask R-CNN, but they experienced problems such as category confusion and masks appearing in unintended areas, with DeeplabV3 plus being particularly affected. The issue of category confusion during prediction in semantic segmentation network architectures was not explicitly mentioned in [
20,
25]. However, in our experiments, we did observe this problem.
Figure 13a showed an incorrect segmentation in the knee area in a Double U-Net skeleton segmentation result, while
Figure 13b depicted category confusion in the upper limbs and head in a DeeplabV3 plus skeleton segmentation result.
This problem stemmed from different network architectures. Mask R-CNN utilizes parallel branch networks to independently determine categories and select the appropriate masks based on individual region-of-interest (ROI). Consequently, different ROIs could be distinguished independently, and masks could be treated as separate entities. In contrast, traditional fully convolutional network (FCN) architectures performed category and mask predictions simultaneously, leading to competition between different categories and masks. Additionally, due to the design of having one category per mask, FCN-based methods could not treat ROIs independently. Another critical factor was using the Sigmoid activation function and average binary cross-entropy loss in the branch networks, which mitigated the adverse effects of cross-category competition encountered in traditional FCN methods. This design yielded excellent instance segmentation results and avoided category overlap or confusion. From the experiments, Mask R-CNN demonstrated itself more suitable for skeleton segmentation in WBBS images than the other two network architectures.
From experiments shown in
Table 2,
Table 3 and
Table 4, one might think that the models’ performance is close to each other, and there might not be a statistically significant difference. It is crucial to consider the context of image segmentation in deep learning. In this task, precision and sensitivity are calculated pixel-wise. Therefore, even a small difference in percentage points can have a significant impact.
In the experiments involving data augmentation, it was observed that it contributed to a slight performance improvement. As the model already performed reasonably well without data augmentation, the addition of data augmentation only led to marginal performance gains. According to related literature [
29], incorporating data augmentation helped reduce overfitting at higher learning rates, allowing the model to be trained for more epochs without sacrificing accuracy. Further experiments and investigations were warranted to explore the impact of data augmentation in more depth.
The limitation of this study is the scarcity of original data and the homogeneity of its source. In the future, it is desirable to establish collaborations with other medical centers to acquire cross-centre data, thereby improving the performance and generalization ability of the models. Additionally, we only investigated three relatively common network architectures, and it would be an attractive research direction to explore newer architectures, such as transformer-based networks. Different nuclear medicine imaging modalities, such as planar and SPECT, differ in the resulting images. It would be worth investigating whether these differences lead to heterogeneity in model predictions. This is an area for further exploration in the future.