Deep Learning Analysis of Mammography for Breast Cancer Risk Prediction in Asian Women

The purpose of this study was to develop a mammography-based deep learning (DL) model for predicting the risk of breast cancer in Asian women. This retrospective study included 287 examinations in 153 women in the cancer group and 736 examinations in 447 women in the negative group, obtained from the databases of two tertiary hospitals between November 2012 and March 2022. All examinations were labeled as either dense breast or nondense breast, and then randomly assigned to either training, validation, or test sets. DL models, referred to as image-level and examination-level models, were developed. Both models were trained to predict whether or not the breast would develop breast cancer with two datasets: the whole dataset and the dense-only dataset. The performance of DL models was evaluated using the accuracy, precision, sensitivity, specificity, F1 score, and area under the receiver operating characteristic curve (AUC). On a test set, performance metrics for the four scenarios were obtained: image-level model with whole dataset, image-level model with dense-only dataset, examination-level model with whole dataset, and examination-level model with dense-only dataset with AUCs of 0.71, 0.75, 0.66, and 0.67, respectively. Our DL models using mammograms have the potential to predict breast cancer risk in Asian women.


Introduction
Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer deaths in women worldwide [1]. Mammography is the primary imaging modality for breast cancer screening. Randomized controlled trials and incidence-based studies have reported the benefits of routine mammography screening in reducing breast cancer mortality [2,3].
However, a "one-size-fits-all" approach for screening may have the risk of underdiagnosis of breast cancer, especially in women with dense breasts. The sensitivity of mammography depends on breast density, which refers to the amount of fibroglandular tissue compared with that of fatty tissue in the breast [4][5][6][7]. According to the Breast Cancer Surveillance Consortium, the sensitivity of mammography decreased from 86-89% in women with fatty breasts to 62-68% in women with dense breasts [5,8]. More importantly, mammographic breast density is an independent risk factor for breast cancer [9,10]. Studies have shown an association between dense breast tissue and increased risk of breast cancer in both Western and Asian women [11][12][13]. Women with extremely dense breasts are 4-6 times more likely to develop breast cancer compared with those with fatty breasts [5,9,14].
In recent decades, several risk prediction models for breast cancer have been developed [15][16][17][18]. Initial studies were mainly based on clinical risk factors such as age at menarche, age at first childbirth, hormone replacement therapy use, and family history of breast cancer [18,19]. Because more recent studies have suggested that mammographic breast density is the strongest risk factor, breast density has been incorporated into the Gail model and the Tyrer-Cuzick model, which has resulted in improvement in model performance [20,21].
With the development of artificial intelligence (AI), studies have applied a deep learning algorithm to breast cancer risk assessment using information on mammograms [22][23][24][25]. A mammography-based deep learning risk model has shown superior accuracy in predicting breast cancer risk compared with traditional risk models across seven global institutions [22,23]. However, the existing risk prediction models were developed based on predominantly Western populations [17][18][19][20][21][22]24,25], and there are insufficient data on risk assessment models for Asian women. It is well known that the breast density and age-specific incidence of breast cancer differ between women in Asian and Western countries. High breast density is more frequent in Asian women compared with Western women [26,27]. In East Asian women, the peak age of breast cancer is in their 40s and 50s, while in Western women, it is in their 60s and 70s [28].
Given the differences in mammographic and clinical characteristics between Asian and Western women, developing a risk prediction model for Asian women is needed. The purpose of this study was to develop a deep learning model based on mammograms alone for predicting breast cancer risk in Asian women.

Data Collection
This retrospective study was approved by the Institutional Review Board of our institution (IRB No. 2020-02-033-003) and the requirement for written informed consent was waived. We collected consecutive digital mammograms between November 2012 and March 2022 at two tertiary hospitals. We also obtained data on patients with breast cancer from electronic medical records and pathology reports. The cancer group included mammograms before cancer diagnosis in patients with breast cancer diagnosed at each hospital. We excluded mammograms within 1 year before breast cancer diagnosis and women with a prior history of breast cancer in the same breast. However, women with a history of contralateral breast cancer were included. The negative group included all screening mammograms in women who had at least 5-year follow-up. Only mammograms with negative or benign results (Breast Imaging and Reporting Data System [BI-RADS] category 1 or 2) were included based on the mammography report in both groups. If each woman had multiple mammographic examinations during the study period, each examination was independently included as the index mammography. In addition, 17 examinations were excluded due to technical issues in image preprocessing for deep learning.
Finally, our study included 287 examinations in 153 women (hospital A, 189 examinations in 96 women; hospital B, 98 examinations in 57 women) in the cancer group and 736 examinations in 447 women (hospital A, 252 examinations in 161 women; hospital B, 484 examinations in 286 women) in the negative group.

Mammographic Examinations and Data Categorization
Full-field digital mammography was acquired using two different machines (Hologic and GE HealthCare) at two hospitals. Standard mammography included craniocaudal and mediolateral oblique views of each breast. The final assessment category and breast density based on the BI-RADS system were determined based on the mammography report [29]. We classified all mammographic examinations into two categories (dense and nondense) based on the BI-RADS density grading system. The BI-RADS grades 1 or A (almost entirely fatty) and 2 or B (scattered areas of fibroglandular tissue) were considered nondense, and the BI-RADS grades 3 or C (heterogeneously dense) and 4 or D (extremely dense) were considered dense.
All mammograms were saved as digital imaging and communications in medicine (DICOM) images and uploaded to the cloud system for data processing. In the cancer group, only images of the breast where cancer occurred were uploaded for each examination. In the negative group, all four full-field digital mammograms were uploaded. All images were labeled as "cancer" or "negative" and "dense" or "nondense." After image preprocessing, all mammograms were randomly assigned to either training, validation, or test sets. We split the dataset by women, so each woman only contributed mammograms to one set. A flowchart illustrating the construction and distribution of whole datasets is shown in Figure 1. and nondense) based on the BI-RADS density grading system. The BI-RADS grades 1 or A (almost entirely fatty) and 2 or B (scattered areas of fibroglandular tissue) were considered nondense, and the BI-RADS grades 3 or C (heterogeneously dense) and 4 or D (extremely dense) were considered dense. All mammograms were saved as digital imaging and communications in medicine (DICOM) images and uploaded to the cloud system for data processing. In the cancer group, only images of the breast where cancer occurred were uploaded for each examination. In the negative group, all four full-field digital mammograms were uploaded. All images were labeled as "cancer" or "negative" and "dense" or "nondense." After image preprocessing, all mammograms were randomly assigned to either training, validation, or test sets. We split the dataset by women, so each woman only contributed mammograms to one set. A flowchart illustrating the construction and distribution of whole datasets is shown in Figure 1.

Model Development
In this study, we prepared the two separate datasets to investigate whether prior mammograms could predict a risk of breast cancer: the whole dataset, which included all mammograms in women with dense and nondense breasts; and the dense-only dataset, which included mammograms only in women with dense breasts. The deep convolutional neural networks were, respectively, trained using the whole and dense-only datasets. In addition, deep learning models were trained based on both the examinations (comprising two images per breast) and the images.
We used commercially available software (Neuro-X v3.0.1, Neurocle Inc., Seoul, Republic of Korea) to train the two deep learning models-that is, image-level and examinationlevel models. Data augmentation options, which included the image rotation at 90 • and the image flipping process for horizontal and vertical directions, were randomly applied as follows: hue at -0.1 to 0.1, brightness at -0.12 to 0.12, contrast at 0.6 to 1.4, and flipping process for horizontal and vertical directions.
All training processes were performed on a single workstation computer with the Windows operating system (Windows 10, Microsoft, 2015) and using the NVIDIA Quadro RTX 8000 with 48 GB of memory (Nvidia Corporation, Santa Clara, CA, USA).

Statistical Analysis
The deep learning models separately evaluated each image. However, individual image results were combined as the outcome of a mammographic examination. The trained model was analyzed by mammographic images and examinations to evaluate the performance of the deep learning model. For analysis based on mammographic images, the result was considered as cancer if an image predicted a developing breast cancer. For analysis based on mammographic examinations, the result was considered as cancer if at least one image predicted a developing breast cancer. Otherwise, the result was considered as negative. Consequently, four deep learning models were obtained: an image-level model with the whole dataset, an image-level model with the dense-only dataset, an examination-level model with the whole dataset, and an examination-level model with the dense-only dataset. For each scenario, the model performance was evaluated with the accuracy, precision, sensitivity, specificity, F1 score, and area under the receiver operating characteristic (ROC) curve (AUC). A two-sided 95% confidence interval (CI) was obtained for the sensitivity, specificity, and AUC.  Numbers are raw data and percentages are in parentheses. SD = standard deviation, FGT = fibroglandular tissue, DCIS = ductal carcinoma in situ.

Performance of Risk Prediction Models
We evaluated the performance of mammography-based deep learning models for predicting breast cancer risk. The performance measures of the risk prediction models on test sets are summarized in Table 2.   The image-level and examination-level models showed similar sensitivity and specificity in the whole dataset (

Discussion
In this study, we developed a deep learning model using full-field digital mammograms to predict breast cancer risk in Asian women with a high proportion of dense breasts. The image-level and examination-level deep learning models were created and trained with whole and dense-only datasets, respectively. On a test set, the mammography-based risk models showed a reasonable performance in predicting the risk of breast cancer (AUC, 0.66-0.75).

Discussion
In this study, we developed a deep learning model using full-field digital mammograms to predict breast cancer risk in Asian women with a high proportion of dense breasts. The image-level and examination-level deep learning models were created and trained with whole and dense-only datasets, respectively. On a test set, the mammography-based risk models showed a reasonable performance in predicting the risk of breast cancer (AUC, 0.66-0.75).
The AUC values of our model were comparable with those of existing image-based risk models [22][23][24]. Yala et al. [22] developed a deep learning model that used mammograms in addition to traditional risk factors (a hybrid deep learning model) to assess breast cancer risk. Although the hybrid model was the best model (AUC, 0.70), a deep learning model based on mammograms alone outperformed the Tyrer-Cuzick model (AUC, 0.68 vs. 0.62). This finding suggests that the mammography-based risk model can provide breast cancer risk assessment when traditional risk factor information is unavailable. Compared with the mammography-based risk model (AUC, 0.65-0.74) developed by Eriksson et al. [24], our risk assessment models achieved similar performance (AUC, 0.66-0.75).
The majority of image-based risk models were developed on predominantly white populations, and thus have limitations in predicting risk for Asian women [22,24,25]. In contrast, our risk models were targeted for only Asian women (Korean women). There is a distinct age distribution of breast cancer among Asian women compared with white women. The incidence of breast cancer among Asian women peaks at age 45-49 years, whereas breast cancer incidence peaks among non-Hispanic white women at age 75-79 years [28]. It has also been shown that mammographic parenchymal patterns differ between Asian and Western women [26,30]. In a study of mammography data on more than one million women [30], Asian women had the highest proportion of dense breast tissue compared with other racial groups. High breast density reduces the sensitivity of screening mammography and can increase the incidence of interval breast cancer because overlapping fibroglandular dense tissue can mask a breast lesion [9]. In addition, breast density itself is a strong risk factor for developing breast cancer [10][11][12].
Simulation modeling studies have shown that screening strategies should be personalized based on a woman's age, breast density, and other risk factors [6,31]. Individual women have different needs for breast cancer screening. All women can be assessed for breast cancer risk based on AI-based or traditional risk models, and then be stratified into average-, moderate-, and high-risk groups [32]. Personalized risk-based screening may be more important for average-risk women with dense breasts to identify the potential candidates for supplemental screening and more frequent screening. Lehman et al. [25] compared the performance of a deep learning image-based risk model with traditional risk models in the screening setting, and found that the deep learning score derived from the woman's prior mammogram outperformed traditional risk models in identifying the subgroup of women with higher cancer burden.
In this study, both image-level and examination-level deep learning models performed better on the dense-only dataset compared with the whole dataset in all performance measures except sensitivity. Our study populations had a high percentage of dense breast tissue. In the whole dataset, the proportions of dense breasts were 72.3% (507 of 701) and 70.6% (113 of 160) in the training and validation sets, respectively.
Our study had several limitations. First, this was a retrospective study with a relatively small sample size. Because we used mammography data obtained from only two institutions, the generalizability of our results may be limited. If a deep learning model is trained with a large dataset collected from multiple institutions, the model performance could be further improved. Second, negative or benign results from the mammography report might include missed or subtle cancers. However, we did not include mammograms within 1 year before breast cancer diagnosis in the cancer group. Lastly, our deep learning models were developed based on the patient's mammographic images alone. The predictive accuracy of deep learning risk models can be further improved by incorporating the genetic test result and clinical risk factors such as age, menopausal status, family history of breast cancer, prior benign biopsy, or use of hormone therapy. However, the mammography-based deep learning model has advantages because it does not require knowledge of those risk factors required by traditional risk models.
In conclusion, we developed a deep learning risk model based on the mammography data from Asian women. Our deep learning model demonstrated comparable predictive accuracy in breast cancer risk assessment compared with existing risk models. Further studies are required to validate our model across institutions in larger datasets. The mammography-based risk model has the potential to support effective risk-based screening in clinical practice. Institutional Review Board Statement: Our Institutional Review Board (IRB) approved this retrospective study (IRB No. 2020-02-033-003).

Informed Consent Statement:
The requirement for written informed consent was waived for this retrospective study.
Data Availability Statement: Data will be provided upon request.