Deep Learning Using Multiple Degrees of Maximum-Intensity Projection for PET/CT Image Classification in Breast Cancer

Deep learning (DL) has become a remarkably powerful tool for image processing recently. However, the usefulness of DL in positron emission tomography (PET)/computed tomography (CT) for breast cancer (BC) has been insufficiently studied. This study investigated whether a DL model using images with multiple degrees of PET maximum-intensity projection (MIP) images contributes to increase diagnostic accuracy for PET/CT image classification in BC. We retrospectively gathered 400 images of 200 BC and 200 non-BC patients for training data. For each image, we obtained PET MIP images with four different degrees (0°, 30°, 60°, 90°) and made two DL models using Xception. One DL model diagnosed BC with only 0-degree MIP and the other used four different degrees. After training phases, our DL models analyzed test data including 50 BC and 50 non-BC patients. Five radiologists interpreted these test data. Sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) were calculated. Our 4-degree model, 0-degree model, and radiologists had a sensitivity of 96%, 82%, and 80–98% and a specificity of 80%, 88%, and 76–92%, respectively. Our 4-degree model had equal or better diagnostic performance compared with that of the radiologists (AUC = 0.936 and 0.872–0.967, p = 0.036–0.405). A DL model similar to our 4-degree model may lead to help radiologists in their diagnostic work in the future.


Introduction
Breast cancer (BC) is the most common cancer and the second leading cause of cancerrelated deaths among women, and its incidence has increased recently [1]. Fluorine-18fluorodeoxyglucose ( 18 F-FDG)-positron emission tomography (PET)/computed tomography (CT) is mainly used to search for distant metastases and secondary cancers, perform staging, and monitor the response to therapy [2][3][4].
However, 18 F-FDG-PET/CT is accurate for staging and assessing treatment response in a variety of malignancies [2,5]. Indeed, 18 F-FDG-PET/CT is routinely used to image the entire body, at least from the mid-orbit to the proximal thigh, including the entire thorax and breast tissue. This imaging has led to incidental detection of other primary malignancies, including BC. For example, Benveniste et al. [6] reported that 440 incidental breast lesions were identified in 1951 patients who underwent 18 F-FDG-PET/CT.
Deep learning (DL) algorithms are rapidly increasing in their use for medical imaging applications [7]. Convolutional neural network (CNN), one of the DL algorithms, has shown excellent performance in recent years for medical image processing, such as for

Data Set
For each patient, we obtained MIP images with 4 different degrees (0 • , 30 • , 60 • , 90 • ). Table 1 summarizes the number of images and the clinical T categories according to the TNM classification 8th edition. First, we randomly split the image data into training, validation, and test image sets. For the training and validation phase, we used 400 sets of MIP images (200 BC, 200 non-BC) and labeled them into 2 classes according to the existence of BC. For the test phase, 100 sets of MIP images (50 BC, 50 non-BC) were used. The data used in the test phase were independent and were not used in the training or validation phases.

Image Processing
The image sets were further processed and augmented by using code written in the programming language Python 3.7.0 (accessed on 21 July 2021 https://www.python.org) and Python imaging library of Pillow 3.3.1 (accessed on 21 July 2021 https://pypi.python. org/pypi/Pillow/3.3.1). Image processing was performed separately for the training, validation, and test image sets.
For the training image sets, image processing that cut out the top and bottom of each image, approximately corresponding to the brain and bladder, and data augmentation were performed such that the CNN model became robust against the degree of enlargement, rotation, changing brightness and contrast, horizontal flip, and partial lack of image. Through those processes, 16 image sets were generated from one image set, resulting in a total of 5120 image sets (320 image sets of each phase of 5-fold cross validation × 16) that were available for training use. For each validation and test image set, like the training phase set, the top and bottom of each image (approximately corresponding brain and bladder) were cut out at first, and the central part (299 × 299 pixels) of captured images was cropped.

DL Methods
We performed the whole process using a computer with a GeForce RTX 2080Ti (NVIDIA, Santa Clara, California, CA, USA) graphics processing unit, a Core i7-10700 K 3.80-GHz (Intel, Santa Clara) central processing unit, and 32 GB of random-access memory. The Python programming language and Pytorch 1.6.0 (accessed on 24 July 2021 https://pytorch.org/) framework for neural networks were used for building DL models.
We made 2 DL models based on Xception, architecture of which has 36 convolutional layers forming the feature extraction base of the network [16]. One model, named the 0-degree model, diagnosed BC with only 0 • PET MIP image. The other was a model using 4 different degrees of images: 0 • , 30 • , 60 • and 90 • PET MIP images, named the 4-degree model. First, pointwise (1 × 1) convolution was performed with 30 • and 60 • images, and a 30 • + 60 • image was created. Second, the 0 • PET MIP image, 30 • + 60 • image, and 90 • PET MIP image were placed into an RGB image with 3 channels: the red channel for 0 • PET MIP image, green channel for 30 • + 60 • image, and blue channel for 90 • PET MIP image. Then, BC was diagnosed with the RGB image by Xception. Pointwise convolution is a type of convolution method that uses a 1 × 1 kernel, which iterates through every single point [17]. This method makes the channels for the input images reduce and makes multiple images train at the same time. In addition, it can reduce the computational complexity of DL models [18]. By using this technique, we could input 4 images, including 0 • , 30 • , 60 • , and 90 • MIP, into Xception that needs images composed of 3 channels (Figure 1). For training, image sets were prepared as described previously in the image processing section and were provided to each CNN. The output data were compared with the teacher data (2 categories: BC or non-BC), and the error was back-propagated to update parameters in each CNN so that the error between the output data and teacher data would be minimal. The CNNs comprised several layers, including convolutional layers, are popular for image recognition.
The CNNs were initialized by the ImageNet (accessed on 24 July 2021 http://www.image-net.org/) pretraining model and fine-tuned to yield better performance. The parameters of optimization were as follows: optimizer algorithm = stochastic gradient descent, learning rate = 0.0001 which is scheduled to decay by 0.4 every 15 epochs, weight decay = 0.001, and momentum = 0.9. The image sets for the training and validation phase were randomly split into training data and validation data at the ratio of 4:1 in each fold, and supervised learning by 30 epochs was performed.
After developing models, we tested them with more image sets that included 50 BC patients and 50 non-BC patients.

Radiologists' Readout
For this study, 5 radiologists assessed the data with the following years of experience: Readers 1 and 2 had 1 year of experience, Reader 3 had 11 years, Reader 4 had 9 years, and Reader 5 had 8 years of experience in breast imaging. These 5 radiologists blindly evaluated the possibility of existence of BC (0-100 %) in 0°, 30°, 60°, and 90° MIP DICOM images of the test cases. The radiologists could not refer to the original PET/CT data. None of these images were processed by cutting out the top and bottom of the image as we performed for the DL training, validation, and test phases. For training, image sets were prepared as described previously in the image processing section and were provided to each CNN. The output data were compared with the teacher data (2 categories: BC or non-BC), and the error was back-propagated to update parameters in each CNN so that the error between the output data and teacher data would be minimal. The CNNs comprised several layers, including convolutional layers, are popular for image recognition.
The CNNs were initialized by the ImageNet (accessed on 24 July 2021 http://www. image-net.org/) pretraining model and fine-tuned to yield better performance. The parameters of optimization were as follows: optimizer algorithm = stochastic gradient descent, learning rate = 0.0001 which is scheduled to decay by 0.4 every 15 epochs, weight decay = 0.001, and momentum = 0.9. The image sets for the training and validation phase were randomly split into training data and validation data at the ratio of 4:1 in each fold, and supervised learning by 30 epochs was performed.
After developing models, we tested them with more image sets that included 50 BC patients and 50 non-BC patients.

Radiologists' Readout
For this study, 5 radiologists assessed the data with the following years of experience: Readers 1 and 2 had 1 year of experience, Reader 3 had 11 years, Reader 4 had 9 years, and Reader 5 had 8 years of experience in breast imaging. These 5 radiologists blindly evaluated the possibility of existence of BC (0-100 %) in 0 • , 30 • , 60 • , and 90 • MIP DICOM images of the test cases. The radiologists could not refer to the original PET/CT data. None of these images were processed by cutting out the top and bottom of the image as we performed for the DL training, validation, and test phases.

Statistical Analysis
All statistical analysis in this study was performed using the EZR software package, version 1.54 (Saitama Medical Center, Jichi Medical University, Saitama, Japan) [19].
Interobserver agreement was assessed using the Pearson correlation coefficient and was interpreted as follows: r = 0, no linear relationship; 0 < r < 1, a positive linear trend; r = 1, a perfect positive linear trend; −1 < r < 0, a negative linear trend; and r = −1, a perfect negative trend [20]. Receiver operating characteristic (ROC) analyses were performed to calculate the area under the ROC (AUC) for performance of the CNN models and the 2 readers in probability of the existence of BC (%), respectively. An optimal cut-off value that was closest to the upper left corner was derived (the cut-off value with the highest sum of sensitivity and specificity). We performed a DeLong test to compare AUC [21]. A p-value of <0.05 was considered to be statistically significant. Table 2 summarizes the interobserver agreement of our 4-degree model, 0-degree model, and radiologists. Significant interobserver agreement was found between all CNN models and the radiologists (r = 0.563-0.896; p < 0.001), although the interobserver agreement between these models and the radiologists (r = 0.563-0.754) was lower than that between the radiologists alone (r = 0.708-0.896). Table 3 and Figure 2 show a comparison between the diagnostic performance of the five readers and two models.  Readers 1, 2, 3, 4, and 5 had sensitivities of 80%, 80%, 90%, 94%, and 98%; specificities of 84%, 92%, 76%, 90%, and 86%; and AUCs of 0.872, 0.891, 0.900, 0.957, and 0.967, respectively. Our 4-degree model showed a sensitivity of 96%, a specificity of 80%, and an AUC of 0.936. Our 0-degree model showed a sensitivity of 82%, a specificity of 88%, and an AUC of 0.918. The AUC of our 4-degree model was significantly larger than that of In our 4-degree model, there were 10 false-positive ( Figure 3) and three false-negative cases (Figure 4). Among these 10 false-positive cases, four cases had physiological FDG uptake at both (2 cases) or left (2 cases) mammary glands resembling masses; four cases had both nipples with physiological FDG uptake, but 1 of them disappeared in 30°, 60°, or 90° MIP. Table 4 summarizes three false-negative cases. In two cases, lesions showed the maximum SUV (SUVmax) of 0.9 and 1.2. In the other case, the organs that are near the breast (heart, liver, spleen, and kidneys), showed up to SUVmax of 7.375. Readers 1, 2, 3, 4, and 5 had sensitivities of 80%, 80%, 90%, 94%, and 98%; specificities of 84%, 92%, 76%, 90%, and 86%; and AUCs of 0.872, 0.891, 0.900, 0.957, and 0.967, respectively. Our 4-degree model showed a sensitivity of 96%, a specificity of 80%, and an AUC of 0.936. Our 0-degree model showed a sensitivity of 82%, a specificity of 88%, and an AUC of 0.918. The AUC of our 4-degree model was significantly larger than that of Reader 1 (0.936 vs. 0.872; p = 0.036). Although there was no significant difference, the AUC of the 4-degree model was larger than that of Reader 2 (0. In our 4-degree model, there were 10 false-positive ( Figure 3) and three false-negative cases (Figure 4). Among these 10 false-positive cases, four cases had physiological FDG uptake at both (2 cases) or left (2 cases) mammary glands resembling masses; four cases had both nipples with physiological FDG uptake, but 1 of them disappeared in 30 • , 60 • , or 90 • MIP. Table 4 summarizes three false-negative cases. In two cases, lesions showed the maximum SUV (SUVmax) of 0.9 and 1.2. In the other case, the organs that are near the breast (heart, liver, spleen, and kidneys), showed up to SUVmax of 7.375.         In six cases, the 0-degree model made mistakes, for which the 4-degree model made the correct diagnosis. The FDG uptake of BCs was shown near the nipple in three of these cases, and the shape of FDG uptake in BC was a non-mass-like lesion in another case ( Figure 5).

Case Age SUVmax
Tomography 2022, 8, FOR PEER REVIEW 8 In six cases, the 0-degree model made mistakes, for which the 4-degree model made the correct diagnosis. The FDG uptake of BCs was shown near the nipple in three of these cases, and the shape of FDG uptake in BC was a non-mass-like lesion in another case (Figure 5). The right breast cancer is recognizable in the 0° and 90° maxi-mumintensity projections (MIPs) (black arrows) but is difficult to recognize in the 30° and 60° MIPs due to physiological FDG uptake of other organs.
The DL technologies are used increasingly in the field of breast imaging such as mammography [22,23] and ultrasonography [24]. Some of these technologies (e.g., Mam-moScreen) support radiologists in diagnosing BC clinically. Raya-Povedano et al. [25] reported that digital mammography screening strategies based on artificial intelligence systems could reduce the workload for radiologists by up to 70%. To our knowledge, however, few software programs with MIP of PET/CT are used clinically.
The sensitivity and specificity of 18 F-FDG-PET/CT in diagnosing primary lesion or lesions of BC by radiologists varies from 48-96% and 73-100%, respectively [4]. The increase in 18 F-FDG-PET/CT use may lead to an increased possibility of detecting incidental breast abnormality. The use of MIP in 18 F-FDG-PET/CT allows the clinician to easily view the whole body; therefore, it is also useful in screening for breast abnormality.
Our research focused on detecting primary BCs on MIP of 18 F-FDG-PET/CT using several DL methods with CNNs to evaluate their diagnostic performance compared with human readers. To our knowledge, this study is the first to compare the diagnostic performance of classifying primary lesions of BC among two CNN models and human readers on MIP of 18 F-FDG-PET/CT. The DL technologies are used increasingly in the field of breast imaging such as mammography [22,23] and ultrasonography [24]. Some of these technologies (e.g., Mam-moScreen) support radiologists in diagnosing BC clinically. Raya-Povedano et al. [25] reported that digital mammography screening strategies based on artificial intelligence systems could reduce the workload for radiologists by up to 70%. To our knowledge, however, few software programs with MIP of PET/CT are used clinically.
The sensitivity and specificity of 18 F-FDG-PET/CT in diagnosing primary lesion or lesions of BC by radiologists varies from 48-96% and 73-100%, respectively [4]. The increase in 18 F-FDG-PET/CT use may lead to an increased possibility of detecting incidental breast abnormality. The use of MIP in 18 F-FDG-PET/CT allows the clinician to easily view the whole body; therefore, it is also useful in screening for breast abnormality.
Our research focused on detecting primary BCs on MIP of 18 F-FDG-PET/CT using several DL methods with CNNs to evaluate their diagnostic performance compared with human readers. To our knowledge, this study is the first to compare the diagnostic performance of classifying primary lesions of BC among two CNN models and human readers on MIP of 18 F-FDG-PET/CT.
Our 4-degree model showed significantly better results in diagnosing primary BC than one less-experienced radiologist, and, although not significantly different, this model also showed better diagnostic performance than another less-experienced radiologist and one expert radiologist. In addition, no significant differences were found between the model and two expert radiologists. Based on these results, a DL model like our 4-degree model may decrease the occurrence of overlooking an incidental but critical breast abnormality, especially when the model is used to support a less-experienced radiologist and to minimize the negative effect for patients.
In this study, we examined the interobserver agreement between the CNN models and radiologists and found significant interobserver agreement between them. However, the interobserver agreement between these models and the radiologists were shown to be lower than the agreement between the radiologists alone. These findings may suggest that, although the radiologists and the CNN models made similar diagnosis, they may have different decision criteria. In the future, more accurate models will be developed by visualizing and validating the CNN model and the human rationale for the decision.
Our 4-degree model also showed non-significant but better diagnostic performance than the 0-degree model. In fact, six cases including a non-mass-like lesion were diagnosed correctly only by the 4-degree model. Hosni et al. [26] reported that ensemble methods, a technique that combines a set of single techniques, show better performance in breast image classification. Nobashi et al. [27] also demonstrated that CNNs with the ensemble of multiple images of different axes and window settings improved performance over the models using single image in the domain of brain 18 F-FDG-PET scans. Considering the findings of these reports and our results, using multiple images may contribute to an increase in diagnostic performance more than using only one image.
In 4 of 10 false-positive cases of the 4-degree model, it is possible that our 4-degree model misrecognized normal FDG uptake of one nipple as BC. Because FDG uptake of the heart is typically higher than that of nipples, it is considered that the model could not recognize one nipple overlapping with the heart and presume the other nipple was the breast abnormality ( Figure 4a). In the other four cases, the model may have misrecognized normal but mass-like FDG uptake of a mammary gland or a nipple as a breast lesion (Figure 4b).
For the three false-negative cases, it is possible that the level of FDG uptake at the lesions was insufficiently high (Figure 5a) or that the high level of physiologic FDG uptake in other organs led the model to avoid recognizing the lesions (Figure 5b). For these reasons, the model seemed not to be able to detect the abnormal FDG uptake. In two of these false-negative cases, the cancer subtypes were ductal carcinoma in situ (DCIS). The size of these lesions might be too small and low FDG uptake to recognize lesions. The remaining case was a small 8 mm invasive carcinoma with low activity and was a luminal A type.
This study has several limitations. First, sample size is small. Second, the design is a single-center and retrospective study. Third, we did not consider benign lesions such as fibroadenoma and intraductal papilloma. Forth, differences in the image quality among PET/CT devices may have influenced the diagnostic performance of our DL models. Fifth, only four types of PET MIP images were used in the construction of the DL and the radiologists' reading. In the future, a large-scale, multicenter, prospective, validation study should be performed using a large amount of 18 F-FDG-PET/CT data.

Conclusions
Our 4-degree model, using images that consisted of multiple degrees, was significantly more accurate than the diagnosis of an inexperienced radiologist and was comparable to that of three expert radiologists and the 0-degree model. Therefore, a DL model similar to our 4-degree model may lead to a decrease in missing incidental breast findings and may help radiologists in their diagnostic work in the future.