3.1. BPE Classification
Borkowski et al. used a transfer learning approach and trained a deep 2D CNN for standardized and automatic BPE classification [
4]. This was the first study to attempt to qualitatively classify BPE by machine learning using whole images, similar to a human reader. Their retrospective study included 11,769 single MRI first post-subtraction images obtained at 3 T from 149 patients. A hierarchical approach implemented transfer learning with the first computational model detecting slices imaging breast tissue and the second computational model performing BPE classification. Two board-certified radiologists annotated the data. The consensus of 2 radiologists served as reference/ground truth for BPE classification. For the BPE model, training, validation, and testing subsets of 87, 25, and 12 patients were randomly selected. For the testing group, a subset of 100 images (25 for each BPE category) were utilized for the evaluation. The images were annotated by 2 radiologists with over 5 years of breast imaging experience. Accuracies of 98, 96, and 97% for training, validation, and testing, respectively, were achieved for classifying slices demonstrating breast tissue. Mean accuracies of 74, 75, and 75% for training, validation, and real-world dataset, respectively, were achieved for BPE classification of 4 classes. Inter-reader reliability for radiologist 1 was 0.780, for radiologist 2 was 0.679, and for the 2DCNN model was 0.815. The kappa coefficient for the agreement between the 2 experts and each expert with the model was 0.793 ± 0.15, 0.804 ± 0.14, and 0.768 ± 0.16, respectively. The overall accuracy of the breast detection model was 96%. The BPE model’s overall accuracy was 75%. The accuracy of reader 1 and reader 2 were 69% and 52%, respectively. The reliability of both readers differed for different BPE classes, ranging from 0.47 to 0.71 for the first and 0.24 to 0.49 for the second reader. The reliability of the second reader was lower than that of the first one except for BPE class of minimal enhancement. The reliability of the model exceeded both readers except for moderate enhancement. The main limitation of this study was the small sample size and single institution.
Eskreis-Winkler et al. developed two deep learning AI CNN models based on VGG-19 architecture for automated BPE classification (Slab AI and MIP AI) and using standard-of-care radiology report BPE designations and three-reader averaged consensus of BPE scores as ground truth [
5]. The study included 5224 breast MRI, divided into training, testing and validation. On radiology reports, 1286 exams were categorized as high BPE (i.e., marked or moderate) and 3938 as low BPE (i.e., mild or minimal). Slab AI and MIP AI were tested, and cross validation was performed to evaluate performance. They found that the Slab AI model significantly outperformed the MIP AI model across the full test set (area under the curve of 0.84 vs. 0.79) using the radiology report reference standard. Using a three-reader consensus BPE labels reference standard, the Slab AI model significantly outperformed radiology report BPE labels (AUC of 0.96 versus 0.83) for BI-RADS 1. The AI model was significantly more likely to assign “high BPE” to suspicious breast MRIs and significantly less likely to assign “high BPE” to negative breast MRIs compared to the radiologist. Additionally, the AI tool can autopopulate the BPE result into the report, providing increased accuracy while reducing radiologist workload. The strengths of the study are that it included a large sample size with cross validation and consensus from three experts as ground truth. They concluded that fully automated BPE assessments for breast MRI improve accuracy over BPE assessments from radiology reports.
Ha et al. developed a fully automated CNN method for quantification of breast MRI FGT and BPE [
6]. They manually segmented and calculated the amount of FGT and BPE to establish ground truth parameters. Then, a novel 3D CNN modified from the standard 2D U-Net architecture was developed and implemented for voxel-wise prediction of whole breast and FGT margins. Cases were separated into training (80%) and test sets (20%). Fivefold cross-validation was performed. In the test set, the fully automated CNN method for quantifying the amount of FGT yielded an accuracy of 0.813 (cross-validation Dice score coefficient) and a Pearson correlation of 0.975. For quantifying the amount of BPE, the CNN method yielded an accuracy of 0.829 and a Pearson correlation of 0.955. Their CNN network was able to quantify FGT and BPE within an average of 0.42 s per MRI case.
Nam et al. developed a fully automated machine-learning algorithm for whole breast FGT segmentation and BPE classification [
7]. The study consisted of 594 patients assigned to the development set, and 200 patients to the test set. Automated segmentation was performed with 3D V-NET CNN. Manual segmentation of the contralateral breast was performed for the whole breast and FGT regions. BPE was acquired by thresholding using the subtraction of the pre- and postcontrast T
1-weighted images and the segmented FGT mask. 3 classification models were trained and tested, including conventional 4-way classification of BPE, 3-way classification of BPE (minimal versus mild/moderate/marked), and 2-way classification (minimal/mild versus moderate/marked). Two radiologists independently assessed the categories of FGT and BPE as ground truth. A deep-learning-based algorithm was designed to segment and measure the volume of whole breast and FGT and classify the grade of BPE. Dice similarity coefficients (DSC) and Spearman correlation analysis were used for evaluation. The mean DSC for manual and deep-learning segmentations was 0.85 ± 0.11. The correlation coefficient was 0.93 for FGT volume from manual and deep-learning derived segmentation methods. Overall accuracy of manual segmentation and deep-learning segmentation were comparable for BPE classification, at 66% and 67%, respectively. Binary categorization of BPE grade (minimal/mild vs. moderate/marked) increased overall accuracy to 91.5% and 90.5% for manual and deep-learning segmentation, respectively, with 0.93 AUC for both. Segmentation and classification computation time including pre- and post-processing data were under 2 min per exam. This deep-learning-based algorithm demonstrated similar accuracy to radiologists. Limitations include the single-center nature of the study and the use of radiologists’ subjective assessments as ground truth.
3.3. BPE for Predicting Risk
Hu et al. evaluated the association of breast cancer with BPE intensity (BPE
I), BPE volume (BPE
V), and the amount of FGT using an automatic quantitative assessment method in breast MRI [
10]. The fully automated method consisted of a 3-step process, consisting of whole breast segmentation, FGT segmentation, and enhanced FGT segmentation. Subsequent calculations were performed to obtain FGT, BPE
I, and BPE
V. The volume ratio of segmented FGT and breast tissues generated FGT. BPE
V was derived from the volume ratio of enhanced to unenhanced FGT and BPE
I was derived from the intensity ratio of enhanced FGT. This retrospective study included 132 healthy women (control group), 132 women with benign breast lesions (benign group), and 132 women with breast cancer (cancer group) matched by age and menopausal status. For women with a benign lesion or a cancerous lesion, the contralateral breast was evaluated to avoid lesion-associated effects. Compared with the control groups, the cancer group showed a significant difference in BPE
V with a maximum AUC of 0.715 and 0.684 for patients in premenopausal and postmenopausal subgroups, respectively. A significant difference in BPE
V was demonstrated in the cancer cohort with a maximum AUC of 0.622 and 0.633 for premenopausal and postmenopausal subgroups, respectively, when compared with the benign group. FGT showed no significant difference when the breast cancer group was compared with the healthy control and benign lesion group, respectively. Compared with the control groups, BPE
I showed a slight difference in the cancer group. Compared with the benign group, no significant difference was seen in the cancer cohort. The novelty of the study includes automated assessment of BPE, the use of BPE
V, and accounting for menopausal status. They concluded that increased BPE
V correlated with a high risk of breast cancer while FGT did not. A limitation of the study was its small sample size.
Lam et al. retrospectively evaluated a matched case–control cohort of 46 high-risk, 23 cancer, and 23 control patients [
11]. A single slice from the DCE acquisition at or near the level of the nipple was selected as representative by an experienced radiologist blinded to outcome. If one breast had suspicious findings, then the contralateral breast was used. BPE maps were performed based on percent FGT enhancement in each voxel. Qualitative BI-RAD BPE were obtained as per report. Semiautomated segmentation of FGT was performed. Percent enhancement for each voxel in FGT was calculated and BPE maps were generated at varying percent-enhancement thresholds at 10% threshold increments, from 10% to 70%. BPE maps only included voxels equal to or exceeding the threshold. A total of 10 quantitative parameters were calculated for each BPE map. BPE area ratio had the highest AUC (0.76) among the BPE area parameters. 25th percentile BPE intensity had the highest AUC among BPE intensity parameters (0.77) for predicting breast cancer risk. BPE integrated intensity combined area and mean BPE, with an AUC of 0.78. BPE area, BPE to FGT ratio, and BPE intensity were associated with breast cancer, while FGT area and whole breast area were not. Quantitative BPE parameters yielded higher AUC compared to qualitative BPE, though this was not statistically significant. Limitations of this study include small sample size and that only one slice selected by an experienced radiologist was used for quantitative BPE analysis.
Niell et al. evaluated a population of high-risk patients using a semiautomated 3D segmentation algorithm to calculate quantitative BPE [
12]. A total of 95 high risk women without personal history of breast cancer, of which 19 developed breast cancer, had MRI between 2010 and 2016. The 19 patients comprised the study group and age-matched controls of 76 patients were then curated, of which 62 control patients were selected who were free of breast cancer within a two-year follow-up period. They found that patients who developed breast cancer were three times more likely to have mild, moderate, or marked BPE, based on the radiologist’s assigned category. A maximum AUC of 0.62 was achieved using BPE threshold greater than minimal. This is in keeping with similar studies, as shown in a meta-analysis of seven studies by Thompson et al., where high-risk women with mild, moderate, or marked BPE demonstrated twice the risk compared to minimal BPE [
13]. They also performed quantitative BPE analysis which showed that it outperformed radiologist classification, with AUC values of 0.85 and 0.84 using the first gadolinium-enhanced phase BPE at 30% and 40% enhancement ratio thresholds, respectively. The percent enhancement ratio was determined by averaging the voxels of enhancement, estimating the percentage of breast tissue that enhances above the threshold value relative to the total breast volume (BPE%). This was conducted in 10% increments to ascertain the ideal threshold for percent enhancement. They segmented FGT from chest wall/pectoralis, skin, and fat on pre- and post-contrast images to quantify FGT and BPE. The automated process was reviewed and refined by 2 scientists and then reviewed by a breast radiologist. No difference was seen in FGT category, median total breast volume, FGT volume, and FGT % between patients who developed breast cancer and those who did not. Similarly, volumetric BPE did not show a significant disparity between patients who developed breast cancer and controls, in univariate analysis. However, feature pairs that included volume and intensity of BPE showed greater sensitivity, PPV, and Youden index compared to univariate BPE%. Limitations of this study include sample size, semiautomated as opposed to fully automated segmentation, and inability to determine reproducibility and variability over time, as only the index MRI was utilized in their data. Their results were similar to those of Lam et al., who reported AUC up to 0.78 using 10 to 40% threshold on the first post-contrast images using a 2D quantitative method.
Saha et al. examined a cohort of 133 high-risk women in a retrospective study spanning 9 years (2004–2013, 1039 high-risk patients), where 46 developed cancer over a 2 year follow-up period and 87 served as controls [
14]. A total of 1.5 or 3 Tesla pre-contrast axial fat saturated and non-fat saturated and post-contrast axial fat-saturated sequences were used for data pre-processing and feature extraction. Automated segmentation of FGT was obtained with a fuzzy-C-Means clustering method. Quantitative BPE was derived from 2-D and 3-D images. A team of 5 breast radiologists reviewed exams for BPE independently with BI-RADS 5th ed. AUC was calculated using leave-one-out cross-validation. Computer models based on automated extraction features demonstrated higher AUCs than human readers. Machine learning model 1 (BPE features based on the FGT segmentation on the fat saturated sequence) demonstrated AUC of 0.52 to 0.73 and machine learning model 2 (BPE features based on the FGT mask on the non-fat saturated sequence) demonstrated AUC of 0.60 to 0.79. Mean and median reader AUCs were 0.49 to 0.70 and 0.51 to 0.69, respectively. Two thresholds were tested: minimal versus mild/moderate/marked and minimal/mild versus moderate/marked. Automatic features outperformed subjective readings for both thresholds for breast cancer prediction. Features derived from FGT mask obtained from T1 non-fat saturated pre-contrast images performed better than from FGT mask obtained from 1st post-contrast DCE. Limitations include small sample size, the retrospective nature of the study, and the reliance on a single institution.
Vreemann et al. performed a retrospective study that included negative baseline MRI scans of high-risk patients screened for breast cancer at 1.5 or 3 T MRI during the period 1–1-03 to 1–1-14 [
15]. Transverse or coronal T1 GRE pre- and post-contrast images were obtained. Deep learning was implemented to quantitatively measure FGT using pre-contrast T1 GRE DCE images and BPE relative enhancement values using pre- and 1st post-contrast motion-corrected DCE acquisitions. Segmented FGT volume divided by total breast volume constituted the fraction of FGT. The fraction of BPE relative to the volume of FGT was calculated using a relative enhancement value of over 10% for any given voxel as constituting BPE. Both breasts were averaged for final BPE and FGT assessment. The same analysis was repeated using relative enhancement cut-off values of 20, 30, 40, and 50%. During the follow-up period, 60 cancers were diagnosed. Logistic regression using forward selection determined whether there were any relationships between FGT, BPE, cancer detection, false-positive recall, and false-positive biopsy. For recalls that did not lead to biopsy, at least 1 year of clinical follow-up was established as ground truth for benignity. This study found that BPE and FGT were not associated with overall breast cancer development in their high-risk population; however, it was associated with false positive recalls and false positive biopsies at baseline, which disappeared with subsequent screening rounds. Subgroup analysis of patients who developed breast cancer within 2 years of their baseline exam (17 cancers and 1499 baseline MRIs), showed that only FGT was associated with short term risk (within 2 years). BRCA mutation carriers had lower FGT and BPE and lower age at baseline scan. Similarly to this, 2 earlier studies found that mammographic increased breast density or increased FGT on MRI in BRCA positive women was not predictive [
2,
3]. This is in contrast to a study by Holm et al. showing that increased density in average-risk women does impact breast cancer risk [
16]. Mitchell et al. reported that increased breast density in BRCA-positive patients conferred increased risk of breast cancer, similar to the relative risk in the general population [
17]. Limitations include the retrospective nature of study, the use of only baseline exams which therefore does not consider alterations in BPE on subsequent exams, and potentially selection bias in only selecting exams that were negative at baseline, while including false positive exams.
Wang et al. aimed to use deep learning to evaluate possible associations between the quantitative properties of breast parenchyma on baseline MRI scans and breast cancer occurrence among women with extremely dense breasts [
18]. The study is a secondary analysis of data obtained from the DENSE trial, in which 4553 women ages 50–75 with BI-RADS category D breast density without abnormality on mammography were followed for 6 years to assess for the development of interval breast cancer. The study was performed at multiple institutions in the Netherlands between December 2011 and January 2016. Image analysis was performed on baseline MRI to segment FGT using an nnU-Net technique. Spatiotemporal characteristics of FGT were classified into 15 quantitative features, and each image-processed MRI was then analyzed by neural network to classify the FGT according to MRI feature. The 15 characteristics were then grouped into 5 principal components, and multivariable Cox proportional hazards regression statistical analysis was performed to correlate associations between each principal component and breast cancer occurrence. The study determined that there was a statistically significant association between the volume of enhancing parenchyma on baseline MRI (PC1) and breast cancer occurrence (hazard ratio [HR], 1.09; 95% CI: 1.01, 1.18;
p = 0.02). Additionally, when stratified into low, intermediate, and high volume of enhancing parenchyma, there was nearly double the occurrence of breast cancer in the high tertile than in the low tertile (HR, 2.09; 95% CI: 1.25, 3.61;
p = 0.005). Other imaging characteristics, such as early contrast uptake of parenchyma, shape of parenchyma, late contrast uptake of parenchyma, and breast density on MRI, were not significantly associated with breast cancer occurrence. Major strengths of this study are the prospective nature of the study design, and the large sample size. The study, however, only examines women with extremely dense breasts and is not generalizable to a larger population. Additionally, a follow up time of 6 years limits the evaluation of late breast cancer development, and no racial or ethnic data were obtained. The study concluded that quantitative parenchymal features on baseline dynamic contrast-enhanced MRI scans are independent predictors of breast cancer occurrence in women with extremely dense breasts.
Watt et al. aimed to use a fully automated quantitative measure of background parenchymal enhancement (BPE) and examine its association with odds of breast cancer development in patients undergoing breast MRI [
19]. The study is a prospective multicenter study based in the USA, with a population of patients who had received a breast MRI between November 2010 and July 2017. The study population included both patients undergoing high-risk screening MRI and patients undergoing diagnostic breast MRI as part of workup. Of the 1476 patients in this group, 536 had a subsequent breast cancer occurrence, and 940 cases were included in the control group. MRI was obtained using standard protocol at each of the clinical sites. Subsequently, the MRI images were analyzed using a fully automated computational method as detailed by Wei et al. 2020, with scores generated for BPE extent and BPE intensity. Calculated results were compared with analysis from a single board-certified radiologist blinded to case–control status and clinical data. With multivariable logistic regression, statistical analysis was performed between breast cancer occurrence and tertiles of BPE extent, BPE intensity, FGT volume, and fat volume. This study determined that calculated BPE extent, defined as the proportion of FGT voxels with enhancement of 20% or more, was positively associated with radiologist-determined BI-RADS BPE (rs = 0.54;
p < 0.001). In addition, participants with high calculated BPE extent had 74% increased odds of breast cancer (odds ratio, 1.74; 95% CI: 1.23, 2.46]) relative to participants with low BPE extent. This study was able to demonstrate the concordance of calculated BPE extent with radiologist-determined BI-RADS BPE, lending credence to the use of automated BPE calculation methods. The study is limited, however, by the disproportionate number of participants in the control group undergoing MRI for high-risk screening, thus having a greater burden of known breast cancer risk factors than the case participants did. Overall, this study demonstrates the non-inferiority of calculated BPE scores for clinical use and demonstrates the increased odds of breast cancer occurrence in those in the study population with high BPE extent.
Zhang et al. studied 80 women with early-stage invasive breast cancer who underwent MRI, of whom 46 were low risk and 34 were intermediate or high-risk based on Oncotype Dx [
20]. Mean age was 51.1. All had MRI on a 1.5 or 3 T magnet with pre and 3 post contrast phase acquisitions. Automatically extracted FGT from contralateral breast with K-Means clustering algorithm on the pre-contrast T1 weighted images was performed. Risk stratification by Oncotype Dx Recurrence Score to determine if low risk, which requires no chemotherapy, showed that 46 patients or 58% were low risk and 34 or 42% were intermediate or high-risk. Overall BPE score showed no significant difference (
p = 0.642). Intermediate/high-risk patients were more likely HER2+ (
p = 0.029), have higher nuclear grade (
p = 0.030), and less likely to be on hormonal treatment (
p = 0.004) relative to the low-risk group. Initial enhancement = IE (percentage increase in signal from pre-contrast to 1st post with respect to pre-contrast), overall enhancement = OE (percentage increase from pre-contrast to last post-contrast with respect to the pre-contrast), and late enhancement = LE (percentage increase from 1st pre-contrast to last post-contrast with respect to first post-contrast) were calculated. Pre-contrast T1 was used for baseline. Pixels with negative signal post-contrast or showing over 300% enhancement were removed and deemed not physiologic. Correlation of BPE with continuous OncotypeDx score for median initial/overall/late enhancement yielded
p-values of 0.07, 0.05, and 0.13, with AUC
p-value of 0.06. BPE in contralateral breast correlated with ODxRS. The top 10% of BPE showed even stronger correlation for IE, OE, and LE, yielding
p-values of 0.02, 0.02, 0.22, and 0.02, respectively. When ODxRS was dichotomized into low-risk versus intermediate/high-risk, as opposed to a continuous variable, even greater association with the mean of the top 10% of pixels for BPE was demonstrated. Limitations of this study are its small sample size, retrospective nature, and focus on a single institution.
Table 1.
Classifying BPE.
Table 1.
Classifying BPE.
Author (Year) | Data Sources | Data Type | # of Pts | Ground Truth | Pre-Process | ML Method | Training, Validation, Testing | AUC | ACC | Spec | Sens |
---|
Borkowski (2020) [4] | University Hospital Zurich, Switzerland | 1st post Subtraction | 87 training 25 validation 12 testing | 4 classes; 2 radiologists | | VGG16 | 70%, 20%, 10% | | 75% | | |
Eskreis-Winkler (2023) [5] | Memorial Sloan Kettering, USA | 1 pre, 3 post, 1st post subtraction FS DCE used as input | High risk 3705 patients | 2 classes; hi/low Ref #1: report Ref #2: 3 reader averaged consensus | Slab and MIP k-means clustering for segmentation | VGG-19
| 77%, 8%, 15% (2 test sets: BR1 and BR4/5) | Ref #1: Slab AI: 0.84 MIP AI: 0.79 | Ref #2: Slab: 94% Rep: 88% | Slab: 97%, MIP: 86%, Rep: 90% | Slab: 77%, MIP: 57%, Rep: 80% |
Ha (2019) [6] | Columbia University | T1 pre, post, sub | 137 patients | Semiautomated masks inspected by radiologist. | | U-Net | 80%, 20% none | | BPE: 82.9% | | |
Nam (2021) [7] | Seoul St. Mary’s Hospital, S Korea | T2w, DCE | 594 training 200 testing | 2 classes; 4 classes; 2 radiologists; consensus if disagree | | V-NET | 75%, 25%, none | 0.93 (min versus mild/mod/marked) | 4-class: 67% 2 class: 91%
| | |
Table 2.
Predicting recurrence.
Table 2.
Predicting recurrence.
Author (Year) | Data Sources | Data Type | # of Pts | Ground Truth | Pre-Process | ML Method | Training, Validation, Testing | AUC | ACC | Spec | Sens |
---|
Arefan (2024) [8] | UPMC- University of Pittsburgh School of Medicine, USA | First post | 127 training 60 testing
unifocal inv BC ER+, node-Oncotype DX recurrence score > 10 yr F/U | Recurrence using BPE and radiomics | | Automated BPE vol (BPE-v) | 68%, 32%, none | 0.79 high versus low/intermed risk Oncotype Dx recurrence score | NPV = 0.97 | | |
Moliere (2019) [9] | Strasbourg University Hospital, Strasbourg, France | Pre, 5 post | 102 patients treated with NAC with pre and post NAC MRIs | Post Rx BPE; 2 independent readers | SA FGT seg Threshold based on fuzzy C-means | 20% threshold to calculate BPE-v | | Pre BPE not predictive of PCR or recurrence | | Post BPE RPR: 84% | Post BPE RCR: 93% |
Table 3.
Association of BPE and breast cancer risk.
Table 3.
Association of BPE and breast cancer risk.
Author (Year) | Data Sources | Data Type | # of Pts | Ground Truth | Pre-Process | ML Method | Training, Validation, Testing | AUC or H/O Ratio |
---|
Hu (2021) [10] | Fudan University Shanghai, China | Pre, 3 post | 132 control, 132 benign, 132 cancer | Calculated BPEv | Fuzzy C-means clustering FGT seg | | | 0.68 to 0.71 |
Lam (2019) [11] | U of Washington School of Medicine, USA | pre-, 2 post | high risk 23 cancer 23 ctrls | | SA FGT seg | BPE maps of varying PE thresholds (10/20/..70%) | | 0.78 BPE area 0.70 (BI-RADS) |
Niell (2020) [12] | Moffitt Cancer Center Research Institute, USA | Pre, 4 Post | 95 high risk pts 19 cases 76 controls | | | | 80%, 20%, none | 0.85 for threshold at 30% PE |
Saha (2019) [14] | Duke University School of Medicine, USA | Pre, First Post | high risk 46 cases 87 ctrls | 5 independent breast radiologists | MIP | | Leave one out cross-validation | 0.70 |
Vreemann (2019) [15] | Radboud University Medical Center, Netherlands | Pre, 5 post; 1st post used for BPE | high risk 60 cancer 1473 controls | biopsy-proven cancer or negative on clinical exam | U-net (DL) for FGT segmentation | BPE at cutoffs of 20, 30, 40, 50% relative enh values | | BPE not predictive for breast cancer in high-risk pts. |
Wang (2023) [18] | Multicenter, DENSE clinical trial, Netherlands | Pre, 4 or 5 post | 122 cancer, 4431 ctrls | Breast radiologists | nnUNet for FGT segmentation | Quantitative features including enhancement characteristics | | BPE HR = 1.09 [95% CI: 1.01, 1.18] |
Watt (2023) [19] | Multicenter, IMAGINE clinical trial, USA | Pre, Post, Sub 1st post | 536 cancer 940 ctrls | single board-certified radiologist | | Automated method; proportion of voxels enhancing > 20% | | High BPE OR = 1.74 [95% CI: 1.23, 2.46] |
Zhang (2021) [20] | Memorial Sloan Kettering, USA | Pre, 3 post | 46 low risk 34 interm/high risk by Oncotype Dx. Early-stage invasive BC | Manually by radiologist. Association of BPE with Oncotype DX | k-means clustering algorithm to extract FGT | | | |