A High-Performance Deep Neural Network Model for BI-RADS Classification of Screening Mammography

Globally, the incidence rate for breast cancer ranks first. Treatment for early-stage breast cancer is highly cost effective. Five-year survival rate for stage 0–2 breast cancer exceeds 90%. Screening mammography has been acknowledged as the most reliable way to diagnose breast cancer at an early stage. Taiwan government has been urging women without any symptoms, aged between 45 and 69, to have a screening mammogram bi-yearly. This brings about a large workload for radiologists. In light of this, this paper presents a deep neural network (DNN)-based model as an efficient and reliable tool to assist radiologists with mammographic interpretation. For the first time in the literature, mammograms are completely classified into BI-RADS categories 0, 1, 2, 3, 4A, 4B, 4C and 5. The proposed model was trained using block-based images segmented from a mammogram dataset of our own. A block-based image was applied to the model as an input, and a BI-RADS category was predicted as an output. At the end of this paper, the outperformance of this work is demonstrated by an overall accuracy of 94.22%, an average sensitivity of 95.31%, an average specificity of 99.15% and an area under curve (AUC) of 0.9723. When applied to breast cancer screening for Asian women who are more likely to have dense breasts, this model is expected to give a higher accuracy than others in the literature, since it was trained using mammograms taken from Taiwanese women.


Introduction
Globally, the incidence rate for breast cancer ranks first [1]. A recent report [2] indicates that more than 10,000 Taiwanese women are diagnosed as having breast cancer, and more than 2000 died of breast cancer in 2018. As a matter of fact, treatments for early-stage breast cancer are effective. The 5-year survival rate for stage 0-2 breast cancer exceeds 90%, while it falls below 25% for stage 4 [3]. Screening mammography has been acknowledged as the most reliable way to detect breast cancer at an early stage, particularly in detecting grouped microcalcification lesions. For years, the Taiwanese government has been urging women without any symptoms, aged between 45 and 69, to have a screening mammogram on a biennial basis. A great number of mammograms are collected in a large-scale mammography screening program and need to be interpreted by well-qualified but overloaded radiologists. Hence, there is definitely an unmet need to develop AI models to assist radiologists with mammographic interpretation, and AI model development requires interdisciplinary research that integrates medical science and engineering.
Routine screening mammography consists of the cranio-caudal (CC) view and the mediolateral-oblique (MLO) view of each breast of a woman, that is, the LCC, RCC, LMLO and RMLO views in total. Developed by the American College of Radiology (ACR), the Breast Imaging Reporting and Data System (BI-RADS) [4] lexicon is used to standardize the reporting of mammographic findings, assessment categories and follow-up management, and communication between radiologists and referring physicians can be facilitated accordingly.
BI-RADS classification is frequently used in breast cancer screening. Therefore, there is definitely a necessity to develop AI models for efficient and reliable BI-RADS classification. However, little has been reported on this issue in the literature so far, mainly due to an inadequate number of open-access mammogram datasets. For example, breast masses were classified incompletely into BI-RADS categories 2-5 using a computer-aided diagnosis system [13] where merely 300 mammograms were employed as training data, and another 200 mammograms were employed as testing data.
Accordingly, this paper presents a deep learning model to address the BI-RADSclassification issue. Breast lesions were classified into categories 0, 1, 2, 3, 4A, 4B, 4C and 5, but excluding category 6, which indicates a known biopsy-proven malignancy. For the first time in the literature, breast lesions can be completely classified using a deep learning model that was well trained by a mammogram dataset of our own. For the purpose of model training, all the lesions contained were labeled and classified by six well-qualified radiologists, as will be detailed below.
It is worth mentioning that this work can provide at least three benefits for medical industries. First, this developed tool can assist radiologists with mammographic interpretation in clinical works and can improve the efficiency of mammogram interpretation as well. Second, the workload of radiologists can be significantly eased, particularly when interpreting mammograms in a large-scale breast cancer screening program. Third, the tool can assist general physicians to interpret mammograms due to there being a shortage of radiologists or breast surgeons in most remote areas.
This paper is outlined as follows. Section 2 describes a labeled and annotated mammogram dataset for training purposes. Section 3 presents a deep neural network (DNN)-based model for BI-RADS classification. Experimental results and discussions are given in Section 4. Finally, Section 5 concludes this study.

Materials and Lesion Annotation
Firstly, Table 1 gives the complete BI-RADS categories, the respective description and assessment of mammography [23]. As can be found therein, category 4 is further subcategorized into categories 4A, 4B and 4C to indicate the different levels of malignancy suspicion.
The digital mammogram dataset employed in this work is provided by the E-Da hospital, Taiwan. The dataset is composed of up to 5733 mammograms of 1490 patients, including 1434 LCC, 1436 RCC, 1433 LMLO and 1430 RMLO views, within the time frame  This study was approved by a local institutional review board (EMRP-108-142), and informed consent was waived. This is simply because there is no personal identifiable data in the dataset, since all the personal data were deleted. To facilitate data preprocessing, an easy-to-use tool was exclusively developed for users to label the lesion in each mammogram. Once the image labeling was completed, an interface, as illustrated in Figure 1, appeared to give users detailed annotation. In this work, all the lesions in the mammograms were labeled by a total of six qualified radiologists of the E-Da hospital. The annotation was saved as a JSON file. For illustrative purposes, Figure 2 gives a BI-RADS category 4C mammogram with a labeled lesion and shows a JSON file that saved the annotation in Figure 1.  The digital mammogram dataset employed in this work is provided by the E-Da hospital, Taiwan. The dataset is composed of up to 5733 mammograms of 1490 patients, including 1434 LCC, 1436 RCC, 1433 LMLO and 1430 RMLO views, within the time frame of 2004 and 2010. This study was approved by a local institutional review board (EMRP-108-142), and informed consent was waived. This is simply because there is no personal identifiable data in the dataset, since all the personal data were deleted.
To facilitate data preprocessing, an easy-to-use tool was exclusively developed for users to label the lesion in each mammogram. Once the image labeling was completed, an interface, as illustrated in Figure 1, appeared to give users detailed annotation. In this work, all the lesions in the mammograms were labeled by a total of six qualified radiologists of the E-Da hospital. The annotation was saved as a JSON file. For illustrative purposes, Figure 2 gives a BI-RADS category 4C mammogram with a labeled lesion and shows a JSON file that saved the annotation in Figure 1.    Table 2 gives the statistics on the number of lesion annotations. As can be found therein, there is no annotation in BI-RADS category 1, simply because category 1 means that the breast tissue looked healthy, and there was no need to annotate accordingly. Additionally, there is a maximum of 8 annotations in a mammogram and a total of 4557 annotations for all the mammograms in this work.  0  520  1  0  2  2125  3  847  4A  367  4B  277  4C  217  5 204 Overall 4557

Methodology and Model
This paper presents a DNN-based model to classify mammograms into categories 0, 1, 2, 3, 4A, 4B, 4C and 5, but excluding category 6, since category 6 is used to represent a female diagnosed with breast cancer. As illustrated in Figure 3, the model was trained using block-based images segmented from the dataset. A block-based image was applied to the model as an input, and a category was assigned as an output. In this manner, the feature maps of the block-based images were correlated with the BI-RADS categories.  Table 2 gives the statistics on the number of lesion annotations. As can be found therein, there is no annotation in BI-RADS category 1, simply because category 1 means that the breast tissue looked healthy, and there was no need to annotate accordingly. Additionally, there is a maximum of 8 annotations in a mammogram and a total of 4557 annotations for all the mammograms in this work.

Methodology and Model
This paper presents a DNN-based model to classify mammograms into categories 0, 1, 2, 3, 4A, 4B, 4C and 5, but excluding category 6, since category 6 is used to represent a female diagnosed with breast cancer. As illustrated in Figure 3, the model was trained using block-based images segmented from the dataset. A block-based image was applied to the model as an input, and a category was assigned as an output. In this manner, the feature maps of the block-based images were correlated with the BI-RADS categories.  The DNN-based model has the following advantages. It was well trained using a multitude of block images, and it is the first time that mammograms were classified into eight BI-RADS categories for the sake of completeness in the literature. Finally, breast lesions can be reliably located and efficiently classified to allow the radiologists to speed up mammogram interpretation. The training data and the flowchart of the presented model are described as follows.

Block Images as Training Data
As referenced previously, the presented model was trained using a multitude of block-based images of size 224 × 224 pixels in this work. Figure 4 illustrates block images and a lesion contained in a block image. As illustrated in Figure 4a,b, the white portions represent the same view of a breast, and a mammogram is segmented into overlapping block images from right to left and then top to bottom, with a stride of 36 pixels. Furthermore, a block image where a contained breast occupies no less than 90% of the block area is chosen as a piece of training data. The DNN-based model has the following advantages. It was well trained using a multitude of block images, and it is the first time that mammograms were classified into eight BI-RADS categories for the sake of completeness in the literature. Finally, breast lesions can be reliably located and efficiently classified to allow the radiologists to speed up mammogram interpretation. The training data and the flowchart of the presented model are described as follows.

Block Images as Training Data
As referenced previously, the presented model was trained using a multitude of blockbased images of size 224 × 224 pixels in this work. Figure 4 illustrates block images and a lesion contained in a block image. As illustrated in Figure 4a,b, the white portions represent the same view of a breast, and a mammogram is segmented into overlapping block images from right to left and then top to bottom, with a stride of 36 pixels. Furthermore, a block image where a contained breast occupies no less than 90% of the block area is chosen as a piece of training data.
As illustrated in Figure 4c, part of a lesion is contained in the block image. Next, a BI-RADS category is assigned to the block image according to the ratio of the areas of the contained lesion to the area of the block, which can be categorized as follows. In Case 1, a block image does not contain a lesion and is assigned as BI-RADS category 1 accordingly. Otherwise, two quantities, ratio B and ratio L are, respectively defined in Case 2 as where Area B and Area L represent the areas of the block image and the lesion, respectively. Subsequently, if the condition where thr B = thr L = 0.5 are two user-specified thresholds, is true, the block image is then classified as the category of the contained lesion. In Case 3, where there are multiple findings in a block image, check whether the condition in Expression (3) is satisfied. If satisfied, the block image is assigned the highest category in the following hierarchy, from highest to lowest: 5, 4C, 4B, 4A, 0, 3, 2. Otherwise, the block image is assigned as BI-RADS category 1. All the block images were divided into two parts, as the training and test data, respectively, and Table 3 gives the numbers of these data for each BI-RADS category.  As illustrated in Figure 4c, part of a lesion is contained in the block image. Next, a BI-RADS category is assigned to the block image according to the ratio of the areas of the contained lesion to the area of the block, which can be categorized as follows. In Case 1, a block image does not contain a lesion and is assigned as BI-RADS category 1 accordingly. Otherwise, two quantities, ratioB and ratioL are, respectively defined in Case 2 as where AreaB and AreaL represent the areas of the block image and the lesion, respectively. Subsequently, if the condition Condition: (ratioB ≥ thrB) or (ratioL ≥ thrL) where thrB = thrL = 0.5 are two user-specified thresholds, is true, the block image is then classified as the category of the contained lesion. In Case 3, where there are multiple findings in a block image, check whether the condition in Expression (3) is satisfied. If satisfied, the block image is assigned the highest category in the following hierarchy, from highest to lowest: 5, 4C, 4B, 4A, 0, 3, 2. Otherwise, the block image is assigned as BI-RADS category 1. All the block images were divided into two parts, as the training and test data, respectively, and Table 3 gives the numbers of these data for each BI-RADS category.

Model Architecture
The model was built based on one of the state-of-the-art models, EfficientNet [24]. As illustrated in Figure 5, the model, made up of a stem, a body, a head and an output mode, takes a mammogram of size 224 × 224 pixels as an input, that is, an input image shape of 224 × 224 × 1. In the Stem module, the input image is firstly normalized to lie between 0 and 1, and then feature maps are extracted using a 3 × 3 convolution layer. Subsequently, high-level feature maps are extracted in the Body module, consisting of 16 mobile inverted bottleneck convolution (MBConv) blocks [25]. Finally, the feature maps are classified in the Head and Output modules.

Model Architecture
The model was built based on one of the state-of-the-art models, EfficientNet [24]. As illustrated in Figure 5, the model, made up of a stem, a body, a head and an output mode, takes a mammogram of size 224 × 224 pixels as an input, that is, an input image shape of 224 × 224 × 1. In the Stem module, the input image is firstly normalized to lie between 0 and 1, and then feature maps are extracted using a 3 × 3 convolution layer. Subsequently, high-level feature maps are extracted in the Body module, consisting of 16 mobile inverted bottleneck convolution (MBConv) blocks [25]. Finally, the feature maps are classified in the Head and Output modules.
is used in the Activation-Swish block. As compared with ReLU, the performance of a neural network can be improved in most cases using a Swish activation function. Table 4 summarizes all the modules contained in Figure 5. A Swish activation function [26], expressed as is used in the Activation-Swish block. As compared with ReLU, the performance of a neural network can be improved in most cases using a Swish activation function. Table 4 summarizes all the modules contained in Figure 5.  Figure 6 gives detailed flowcharts of the MBConv-A and B blocks in Figure 5. An MB-Conv block is mainly composed of an expansion layer, a depthwise layer and a squeeze-andexcitation network (SENet) [27] where C e = C i × R e , and R e represents the expansion ratio, as tabulated in Table 4. Accordingly, C d = C i if R e = 1, and C d = C e otherwise. Additionally, Table 4 gives the kernel size and the stride for each DepthwiseConv. For stride = 1, the output shape is equal to the input shape of a feature map, that is, (W d , H d ) = (W i , H i ). For stride = 2, the output shape is half of the input shape. The values of the parameters W d , H d and C o can be referenced in Table 4.  Figure 6 gives detailed flowcharts of the MBConv-A and B blocks in Figure 5. An MBConv block is mainly composed of an expansion layer, a depthwise layer and a squeeze-and-excitation network (SENet) [27] where Ce = Ci × Re, and Re represents the expansion ratio, as tabulated in Table 4. Accordingly, Cd = Ci if Re = 1, and Cd = Ce otherwise. Additionally, Table 4 gives the kernel size and the stride for each DepthwiseConv. For stride = 1, the output shape is equal to the input shape of a feature map, that is, (Wd, Hd) = (Wi, Hi). For stride = 2, the output shape is half of the input shape. The values of the parameters Wd, Hd and Co can be referenced in Table 4. The SENet module is detailed in Figure 7. A feature map is downsized from W × H × C to 1 × 1 × C in the squeeze module. To take arbitrary-sized feature map as an input, two fully connected layers are replaced with two convolutional layers with a kernel size of 1 × 1 in the excitation module, and Cs = Ci × Rs where Ci represents the one in the MBConv block, and Rs represents a user-specified ratio that is set to 0.25. Each channel of the input The SENet module is detailed in Figure 7. A feature map is downsized from W × H × C to 1 × 1 × C in the squeeze module. To take arbitrary-sized feature map as an input, two fully connected layers are replaced with two convolutional layers with a kernel size of 1 × 1 in the excitation module, and C s = C i × R s where C i represents the one in the MBConv block, and R s represents a user-specified ratio that is set to 0.25. Each channel of the input is weighted non-uniformly by multiplying the input and the output of the excitation module, so as to reflect the significance of each channel feature.
The SENet module is detailed in Figure 7. A feature map is downsized from W × H × C to 1 × 1 × C in the squeeze module. To take arbitrary-sized feature map as an input, two fully connected layers are replaced with two convolutional layers with a kernel size of 1 × 1 in the excitation module, and Cs = Ci × Rs where Ci represents the one in the MBConv block, and Rs represents a user-specified ratio that is set to 0.25. Each channel of the input is weighted non-uniformly by multiplying the input and the output of the excitation module, so as to reflect the significance of each channel feature.  Finally, a categorical cross-entropy loss function was used to train the model with a batch size of 128 and an epoch of 350, and a Ranger optimizer [28] was also used to improve the training performance. Table 5 lists the development environment of this work.

Experimental Results
A confusion matrix for an eight-class classification system and four performance metrics for each class, including the sensitivity, specificity, precision and F1-score, were evaluated to quantify the model performance. Then, the mean value of each performance metric and the overall accuracy were found.
In Figure 8, an 8 × 8 confusion matrix is used to illustrate how all the performance metrics were evaluated in the case of type 6 (BI-RADS category 4B). True positive (TP) and false positive (FP) are used to represent a lesion that is accurately classified or misclassified as category 4B, respectively. Likewise, true negative (TN) and false negative (FN) are used to represent a lesion that is accurately classified or misidentified as a category, other than category 4B, respectively.
In Figure 8, an 8 × 8 confusion matrix is used to illustrate how all the performa metrics were evaluated in the case of type 6 (BI-RADS category 4B). True positive (T and false positive (FP) are used to represent a lesion that is accurately classified or m classified as category 4B, respectively. Likewise, true negative (TN) and false negat (FN) are used to represent a lesion that is accurately classified or misidentified as a ca gory, other than category 4B, respectively. Accordingly, all the performance metrics are given, respectively, by Specificity k = TNR k = TN k /(TN k + FP k ) Precision k = PPV k = TP k /(TP k + FP k ) F1-score k = 2 × (Precision k × Sensitivity k )/(Precision k + Sensitivity k ) where 1 ≤ k ≤ CNum = 8 and is used to represent that a lesion is classified as category l = the kth element of the hierarchy: 0, 1, 2, 3, 4A, 4B, 4C, 5, e.g., category 2 for k = 3. The sensitivity, specificity and precision are also referred to as the true positive rate (TPR), true negative rate (TNR) and positive predictive value (PPV), respectively. The mean values of the performance metrics in Equations (5)-(8) and the overall accuracy are respectively given by where TNum represents the number of the test data. Performance testing was conducted using the 85,683 pieces of test data, as tabulated in Table 3, and led to the confusion matrix in Figure 9 and the performance metrics in Table 6. Subsequently, a receiver operating characteristic (ROC) curve was plotted for each BI-RADS category in Figure 10, and the corresponding area under curve (AUC) value was shown therein. The outperformance of this work was clearly indicated by an average sensitivity of 95.31%, an average specificity of 99.15%, an average precision of 94.93%, an average F1-score of 95.11%, an average AUC of 97.23% and an overall accuracy of up to 94.22%.
where TNum represents the number of the test data.
Performance testing was conducted using the 85,683 pieces of test data, as tabulated in Table 3, and led to the confusion matrix in Figure 9 and the performance metrics in Table 6. Subsequently, a receiver operating characteristic (ROC) curve was plotted for each BI-RADS category in Figure 10, and the corresponding area under curve (AUC) value was shown therein. The outperformance of this work was clearly indicated by an average sensitivity of 95.31%, an average specificity of 99.15%, an average precision of 94.93%, an average F1-score of 95.11%, an average AUC of 97.23% and an overall accuracy of up to 94.22%.   In each case of BI-RADS category 0, 4A, 4B, 4C and 5 lesions, the sensitivity, specificity and precision exceeded 98%, 99% and 96%, respectively. This validates that such lesions can be well classified using this work, and early-stage breast cancer can be diagnosed more accurately.
In the cases of BI-RADS category 2 and 3 lesions, all the performance metrics lay above 92%, which was slightly below those in the above-referred five cases. The worst performance occurred in the case of BI-RADS category 1 lesions, and the sensitivity and precision hit 81.22% and 85.91%, respectively, for the following reason. All the lesion-free block images were classified as BI-RADS category 1, leading to non-distinctive features that were difficult to diagnose.  In each case of BI-RADS category 0, 4A, 4B, 4C and 5 lesions, the sensitivity, specificity and precision exceeded 98%, 99% and 96%, respectively. This validates that such lesions can be well classified using this work, and early-stage breast cancer can be diagnosed more accurately.
In the cases of BI-RADS category 2 and 3 lesions, all the performance metrics lay above 92%, which was slightly below those in the above-referred five cases. The worst performance occurred in the case of BI-RADS category 1 lesions, and the sensitivity and A deeper investigation revealed that the sensitivity in the BI-RADS category 1 case was actually a function of the thresholds thr B and thr L in Equation (3). This is because a block image, classified as BI-RADS category 1, in fact contained a small portion of a lesion in some cases, leading to a negative effect on the training of the presented model. Additionally, each performance metric is also a function of thr B and thr L .
The outperformance of this model was indicated by an overall accuracy of 94.22%, an average sensitivity of 95.31% and an average specificity of 99.15%. As can be found in Figure 11, there is a good agreement between the red framed ground truth and the blocks, highlighted in color, in each of the mammograms in Figure 11a-f, where findings were classified as BI-RADS categories 2-5, respectively.
Finally, Table 7 lists the task and performance comparisons between the presented study and previous studies on breast cancer detection in order to reveal the contribution of this work. The Ave_Sen, Ave_Spe and Acc represent the average sensitivity, average specificity and accuracy, respectively. block image, classified as BI-RADS category 1, in fact contained a small portion of a lesion in some cases, leading to a negative effect on the training of the presented model. Additionally, each performance metric is also a function of thrB and thrL.
The outperformance of this model was indicated by an overall accuracy of 94.22%, an average sensitivity of 95.31% and an average specificity of 99.15%. As can be found in Figure 11, there is a good agreement between the red framed ground truth and the blocks, highlighted in color, in each of the mammograms in Figure 11a-f, where findings were classified as BI-RADS categories 2-5, respectively. Figure 11. Comparisons between findings labeled by radiologists (framed in red) and highlighted in color in the cases of BI-RADS category 2, 3, 4A, 4B, 4C and 5 lesions in (a-f), respectively.
Finally, Table 7 lists the task and performance comparisons between the presented study and previous studies on breast cancer detection in order to reveal the contribution Figure 11. Comparisons between findings labeled by radiologists (framed in red) and highlighted in color in the cases of BI-RADS category 2, 3, 4A, 4B, 4C and 5 lesions in (a-f), respectively.

Conclusions
This paper presented a DNN-based model to efficiently and reliably locate and classify breast lesions from mammograms. Block-based images, segmented from collected mammograms, were used to adequately train the model, by which the workload of radiologists can be significantly eased, particularly when interpreting mammograms in a large-scale breast cancer screening program. For the first time in the literature, breast lesions can be completely classified into BI-RADS categories 0, 1, 2, 3, 4A, 4B, 4C and 5. The outperformance of this model was indicated by an overall accuracy of 94.22%, an average sensitivity of 95.31%, an average specificity of 99.15% and an average AUC of 0.9723. When applied to breast cancer screening for Asian women, who are more likely to have dense breasts, this model is expected to give a higher accuracy than others in the literature, since it was trained using mammograms taken from Taiwanese women.
It is worth mentioning that this work can provide three benefits for healthcare industries. First, the developed tool can help radiologists with mammographic interpretation in clinical works and can improve the efficiency of mammogram interpretation as well. Second, the workload of radiologists can be reduced remarkably. Third, the tool can assist general physicians with interpreting mammograms due to a shortage of radiologists or breast surgeons in most remote areas.
As the next step, our team aims to upsize the collected dataset so as to better train the model and advance the generalization ability as well. In the meantime, we are making continuous efforts to improve the model performance, particularly in the worst BI-RADS category 1 case. Finally, we will test the generalization ability of this model as an interhospital project. Informed Consent Statement: Informed consent was waived because all the personal identifiable data were deleted.

Data Availability Statement:
The data presented in this paper are not publicly available at this time but may be obtained from the corresponding author upon reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.