Validation of Breast Cancer Risk Models by Race/Ethnicity, Family History and Molecular Subtypes

Simple Summary Several statistical models exist to predict a person’s risk of breast cancer. Risk assessment models can guide cancer screening approaches by identifying individuals who would benefit from additional screening. In this study, we compared the performance of four models in predicting the 5-year risk of breast cancer in a cohort of women aged 40–84 years who underwent screening mammography at three large health systems. Models showed comparable discrimination (ability to distinguish between cases and non-cases) and calibration (ability to accurately predict risk) overall, with no difference by race. Model discrimination was poorer for some cancer subtypes, and better for women with high BMI. The combined BRCAPRO+BCRAT model had improved calibration and discrimination among women with a family history of breast cancer. Our results can inform risk-based screening approaches by identifying women at a high risk of breast cancer. Abstract (1) Background: The purpose of this study is to compare the performance of four breast cancer risk prediction models by race, molecular subtype, family history of breast cancer, age, and BMI. (2) Methods: Using a cohort of women aged 40–84 without prior history of breast cancer who underwent screening mammography from 2006 to 2015, we generated breast cancer risk estimates using the Breast Cancer Risk Assessment tool (BCRAT), BRCAPRO, Breast Cancer Surveillance Consortium (BCSC) and combined BRCAPRO+BCRAT models. Model calibration and discrimination were compared using observed-to-expected ratios (O/E) and the area under the receiver operator curve (AUC) among patients with at least five years of follow-up. (3) Results: We observed comparable discrimination and calibration across models. There was no significant difference in model performance between Black and White women. Model discrimination was poorer for HER2+ and triple-negative subtypes compared with ER/PR+HER2−. The BRCAPRO+BCRAT model displayed improved calibration and discrimination compared to BRCAPRO among women with a family history of breast cancer. Across models, discriminatory accuracy was greater among obese than non-obese women. When defining high risk as a 5-year risk of 1.67% or greater, models demonstrated discordance in 2.9% to 19.7% of patients. (4) Conclusions: Our results can inform the implementation of risk assessment and risk-based screening among women undergoing screening mammography.


Introduction
While breast cancer mortality has fallen over the past decade in the U.S., it remains the second leading cause of cancer death among women, with 43,600 breast cancer deaths projected in 2021 [1]. Identification of patients at high risk of developing breast cancer could allow targeting of preventive and screening interventions to mitigate risk to further reduce mortality. Specifically, risk-based screening approaches that tailor screening initiation, interval, and supplemental screening to individual risk may increase benefits and reduce harms of screening. Multiple validated risk assessment models have been developed to quantify an individual woman's risk of developing breast cancer [2]. The Breast Cancer Risk Assessment Tool (BCRAT, also known as the Gail model) utilizes age, race/ethnicity, history of breast biopsy and atypical hyperplasia, first-degree family history of breast cancer, age at menarche, and age at first birth to estimate risk [3][4][5][6]. The Breast Cancer Surveillance Consortium (BCSC) model [7] utilizes age, race/ethnicity, first-degree family history of breast cancer, breast biopsy and benign breast disease, and breast density, as measured by the American College of Radiology Breast Imaging Reporting and Database System [8]. The BRCAPRO model [9] was developed in the setting of women undergoing genetic counseling and uses a detailed family history of breast and other cancers to estimate risk both of a BRCA1/2 mutation and risk of breast cancer. Recently, the BRCAPRO model was combined with the BCRAT model to create the BRCAPRO+BCRAT model [10] that incorporates both factors in the BCRAT model and detailed family history.
Our prior work compared the BCRAT, BCSC, and BRCAPRO models in a cohort of women undergoing mammography screening at a single institution [11]. We found comparable moderate discrimination and good calibration of these three models; however, we were unable to assess differences in model performance by race/ethnicity due to the lack of diversity in the cohort. Few studies have evaluated the performance of breast cancer risk models by race/ethnicity. Additionally, while we found poorer predictive accuracy of the models for HER2+ and triple-negative breast cancers, the numbers of cancers of these subtypes were small, limiting our ability to draw strong conclusions.
The purpose of the current study is to compare the performance of breast cancer risk models for use in the setting of mammography screening, and specifically to evaluate model performance among subgroups defined by race/ethnicity, by molecular subtypes, by family history, and by obesity. Additionally, we performed validation of the recently developed BRCAPRO+BCRAT model [10], utilizing three large mammography screening cohorts. We evaluated the performance of BCRAT, BCSC, BRCAPRO, and BRCAPRO+BCRAT models because these models utilize risk factors that are routinely collected during the course of clinical care in these three health systems, and therefore the integration of one or more of these risk models into clinical care for decision making is potentially feasible.

Study Population
We assembled three cohorts of women presenting for screening mammography at Massachusetts General Hospital (MGH), Newton-Wellesley Hospital (NWH), and the University of Pennsylvania Health System (UPenn) between 2006-2015, 2006-2015, and 2011-2015, respectively. All mammograms included were digital or digital breast tomosynthesis... The following patient-reported risk factors were collected via questionnaire at the time of the mammogram: age, race/ethnicity, age at menarche, age at first birth, body mass index (BMI), history of breast biopsy, history of atypical hyperplasia or benign breast findings, and family history of breast cancer. Breast density measurements were pulled from radiology reports in electronic medical records (EMR) and were classified based on the American College of Radiology's Breast Imaging-Reporting and Data System (BI-RADS). EMR was also used to supplement missing survey information on the history of breast biopsy, atypical hyperplasia, BMI, and history of prior breast cancer. At MGH, the questionnaire asked whether women had ever had "benign tissue removed from the breast" which was considered as evidence of atypical hyperplasia or benign breast findings. Additionally, at MGH and NWH, we additionally included data from pathology reports on diagnoses of atypical hyperplasia, lobular carcinoma in situ (LCIS), or lobular neoplasia. At UPenn, women were asked whether they had previously been diagnosed with atypical hyperplasia. Additional detail about benign breast conditions was not available for UPenn. Missing BMI specifically was supplemented with the closest measurement from EMR within 1 year prior to or 6 months after the mammogram. The presence of BRCA1 and BRCA2 pathogenic genetic mutations was also collected from linkage with genetic counseling records. We included the first mammogram visit over the time period for each woman, using covariates collected at the time of the first mammogram visit in the cohort for analysis (N = 91,094 at MGH; N = 54,032 at NWH; N = 48,035 at UPenn).
Patients with breast cancer prior to screening were excluded, as were patients diagnosed with cancer within 6 months of screening ( Figure 1). For MGH and UPenn, patients with breast implants were excluded; information on breast implants was not available for NWH. Patients who did not fall between the ages of 40 and 84 were excluded, as were those with known pathogenic BRCA1 or BRCA2 mutations since the BCRAT and BCSC models are not appropriate for women with these mutations. Patients with less than 5 years of follow-up time and deceased patients with no date of death or no date of the last contact were also excluded. Patients with missing BI-RADS breast density were excluded. These exclusions resulted in final analytic samples of 58,706 patients from MGH, 39,189 patients from NWH, and 24,661 patients from UPenn ( Figure 1).

Outcomes
Breast cancer cases were determined through to 31 December 2017, using a combination of hospital cancer registries and state health department cancer registries. At MGH and NWH, breast cancer diagnosis information was obtained from the Massachusetts Cancer Registry, while at UPenn, breast cancer diagnosis information was obtained from Pennsylvania, New Jersey, and Delaware state cancer registries. Invasive breast cancers were categorized into molecular subtypes based on the expression of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2). Borderline ER and PR results were considered positive, while borderline HER2 results were considered unknown [12]. Since this analysis predicted a 5-year absolute risk of breast cancer, only patients with invasive breast cancer within 5 years were considered cases. Patients with ductal carcinoma in situ (DCIS) were not considered as cases, since the risk models considered the risk of invasive breast cancer.

Outcomes
Breast cancer cases were determined through to December 31, 2017, using a combination of hospital cancer registries and state health department cancer registries. At MGH and NWH, breast cancer diagnosis information was obtained from the Massachusetts Cancer Registry, while at UPenn, breast cancer diagnosis information was obtained from Pennsylvania, New Jersey, and Delaware state cancer registries. Invasive breast cancers were categorized into molecular subtypes based on the expression of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2). Borderline ER and PR results were considered positive, while borderline HER2 results were considered unknown [12]. Since this analysis predicted a 5-year absolute risk of We assumed patients with missing data on atypical hyperplasia had no atypical hyperplasia (N = 59,285, 48.37%) and that those with a missing number of biopsy examinations had no such examinations (N = 18,549, 15.22%). Regarding missing menopause status, any person over 55 or anyone who had stopped menstruating was categorized as postmenopausal [13]. Anyone who did not meet this criterion was categorized as premenopausal. For patients without, or with incorrect, or ambiguous family history information (approximately 3% total), we assumed that they had no affected relatives.
We validated and compared the performance of BCRAT, BCSC, BRCAPRO, and BRCAPRO+BCRAT with respect to the absolute risk of invasive breast cancer. We also compared these models in subgroups defined by race/ethnicity, invasive molecular sub- types (ER/PR+HER2−, ER/PR+HER2+ or ER/PR−HER2+, or ER/PR/HER2−), a number of affected first-or second-degree relatives, age <50 or ≥50 years, and BMI <30 kg/m 2 or ≥30 kg/m 2 . We performed sensitivity analyses evaluating model performance by site and comparing performance for Black and White women only at UPenn since the majority of Black women were from that site.
Calibration of risk prediction models was evaluated by the observed to the expected ratio (O/E), where the numerator is the observed count of invasive breast cancer cases that arose within 5 years of the initial mammogram, and the denominator is the expected number of cases predicted by the model, obtained by the sum of the predicted 5-year absolute risk estimates. An O/E ratio of 1 indicates perfect calibration, an O/E ratio >1 indicates that the model under-predicts the true number of cases, and an O/E ratio <1 indicates that the model over-predicts the true number of cases. Calibration curves were plotted as the observed proportion of breast cancer cases versus the predicted proportion of breast cancer cases in each decile of predicted absolute risk. Discrimination was assessed by area under the receiver operating characteristic curve (AUC). AUC measures the probability that a prediction model provides higher absolute risk scores for cases than for non-cases. An AUC of 0.5 indicates that the model performs no better than chance, while an AUC of 1 indicates perfect discrimination. The clinically recognized threshold for elevated 5-year risk of breast cancer for chemoprevention is 1.67% [14,15]. This cutoff was used in calculating the true positive rate (TPR) (sensitivity) and the false positive rate (FPR (1-specificity)). TPR was calculated as the proportion of cases who were categorized as high-risk by the model and the FPR was calculated as the proportion of non-cases who were categorized as high-risk by the model.
To obtain 95% confidence intervals (CI) for the performance metrics we generated bootstrap samples and calculated metrics for each model in each bootstrapped sample [16]. To test the statistical significance of observed differences in AUCs across models and strata, we calculated the difference in AUC between two models or two strata for each bootstrap replicate, then calculated the test statistic as the observed difference in AUCs squared, divided by the variance of the difference in AUC derived from the bootstrapped samples. Using this statistic, we obtained p-values for the difference in AUC between pairwise comparisons of models and strata using the Chi-square distribution with 1 degree of freedom. All statistical tests were two-sided using an alpha of 0.05. In addition, we adjusted for multiple comparisons (we performed a total of 34 model and strata comparisons) by the Bonferroni method, with corrected p-values less than 0.00147 considered statistically significant (p = 0.05/34 comparisons).
We also compared the proportion of patients categorized as high vs. low absolute risk, comparing two models at a time and using the cutoff of 1.67% 5 years risk. All analyses were performed using R statistical software (www.R-project.org, accessed on 23 November 2021). Table 1 displays the descriptive characteristics of the study population by site. Most notably, the NWH population was younger than the MGH and UPenn populations, the UPenn population included nearly 46% Black or African American women, and distributions of age at first birth, breast density, and the proportion of missing data differed across sites. Among the 122,556 women in the study population, 1734 were diagnosed with breast cancer within 5 years of their screening mammogram.

Results
In the full study population (Table 2), the AUCs for the models ranged from 0.590 for BRCAPRO to 0.617 for the BCSC model. We observed significant differences in the AUCs comparing BCRAT and BCSC (p = 0.010), BCRAT and BRCAPRO (p = 0.007), BCRAT and BRCAPRO+BCRAT (p = 0.040), BCSC and BRCAPRO (p < 0.001), and BRCAPRO and BRCAPRO+BCRAT (p < 0.001). The differences in AUC remained statistically significant after Bonferroni correction for the comparisons of BRCAPRO and BCSC (p < 1 × 10 −8 ) and BRCAPRO and BRCAPRO+BCRAT. O/E ratios ranged from 1.036 for BCRAT to 1.185 for the BCSC model, and the O/E for the BCSC model was further away from 1. The true positive rate ranged from 30.7% for the BCSC model to 37.8% for the BCRAT model, whereas the false positive rate ranged from 18.4% for BCSC to 24.3% for BRCAPRO. The proportion of patients identified as having high 5-year risk differed across models, with 18.6% of patients identified as high risk based on the BCSC model compared with 24.5% based on the BRCAPRO model. Comparable model performance was observed across the three institutions (Appendix A Table A1).   Model performance stratified by race/ethnicity is displayed in Table 3. There were 209 breast cancers diagnosed among Black or African American women. Among Black or African American women, the AUCs ranged from 0.610 for BRCAPRO to 0.644 for BCSC, though the differences in AUC by race did not reach statistical significance. In addition, there was no significant difference in AUCs for any model between Black or African American and White women. O/E ratios ranged from 1.054 for BRCAPRO to 1.316 for BCSC. The numbers of cancers among Hispanic (N = 32) and Asian women (N = 54) in the study population were small, so results should be interpreted with caution. In the sensitivity analysis limited to women at UPenn, there was no significant difference in the AUC between Black and White women for the BCRAT or BCRAT+BRCAPRO, but BCSC (p = 0.006) and BRCAPRO (p = 0.049) had significantly higher AUC among Black compared with White women, though the O/E ratio for both Black (O/E = 1.40) and White (O/E = 1.41) women at UPenn indicated under-prediction of cancer cases for BCSC (Appendix C Table A2).  Table 4 displays the performance of the risk models by molecular subtype. Given that the models are not trained to predict subtype-specific risk, we did not assess model calibration. AUCs were lower for triple-negative (N = 132, AUC range 0.564-0.585) and HER2+ cancers (N = 224, AUC range 0.513-0.567) compared with ER/PR+HER2− cancers (N = 1316, AUC range 0.605-0.629). The differences in AUCs between ER/PR+HER2− disease and HER2+ disease were statistically significant (all p-values < 0.009), but differences did not reach statistical significance for ER/PR+HER2− compared with ER/PR/ HER2− (triple-negative breast cancer, TNBC). After Bonferroni correction, the difference in AUC for BRCAPRO between ER/PR+HER2− and HER2+ disease remained statistically significant (p < 1 × 10 −5 ). Table 5 displays model performance stratified by family history of breast cancer, age, and BMI. For women with a family history, breast cancers were under-predicted for all models (O/E range 1.263-1.509). AUC estimates were significantly higher for patients with family history than without family history for the BCRAT (p = 0.019) and BRCAPRO+BCRAT (p = 0.016) models. These differences were no longer statistically significant after the Bonferroni correction. O/E ratios for all models among patients with a family history were further from 1 than those without a family history, and confidence intervals did not overlap, suggesting cases were underpredicted for patients with a family history (Appendix B Figure A1). For women under 50, O/E ratios indicated under-prediction of cancers. AUCs did not differ significantly by age. For BMI, all four models had significantly higher AUCs among obese compared with non-obese women (BCRAT p = 0.004, BCSC p < 0.001, BRCAPRO p = 0.001, BRCAPRO+BCRAT p = 0.012). Both the BCSC model and the BRCAPRO model had significantly higher AUC among obese women compared with non-obese women after adjusting for multiple comparisons (BCSC p = 0.000025, BRCAPRO p = 0.0012). However, the under-prediction was more severe for BCSC and BRCAPRO among women with high BMI compared with lower BMI.  Table 6 displays a cross-tabulation of high (≥1.67% 5 years risk) versus low absolute risk classification for all possible pairs of models. Overall, the proportion of discordant classifications ranged from 2.9% for the comparison of BCRAT with BRCAPRO+BCRAT (0.5% + 2.4%) to 19.7% for the comparison of BCSC with BRCAPRO (13.6% + 6.1%). BCSC classified more patients as low risk than other models. Nearly 10% of patients were classified as high risk with the BCRAT model but low risk with BCSC, whereas only 5.7% of patients were classified as low risk on BCRAT and high risk on BCSC. Almost 14% of patients classified as high risk by BRCAPRO were classified as low risk by BCSC, while only 6.1% of patients considered a low risk on BRCAPRO were considered high risk with BCSC. Comparing BCRAT and BRCAPRO, 7.8% of women who were high risk on BCRAT were classified as low risk by BRCAPRO, while 11.1% of women who were high risk on BRCAPRO were classified as low risk by BCRAT. The proportions of patients classified differently by the BRCAPRO+BCRAT model compared with the BCRAT model were smaller.

Discussion
Our study is one of the largest to examine the validity of multiple breast cancer risk prediction models simultaneously in a large and diverse sample of women undergoing mammography screening, allowing comparisons of model performance by race/ethnicity, tumor molecular subtypes, family history, and obesity. We found that model performance was comparable across the four models examined, with moderate discriminatory accuracy, and generally good calibration, consistent with previous estimates [7,[17][18][19][20][21][22][23][24][25][26][27][28][29][30]]. The BCSC model had the highest AUC but under-predicted the number of cancer cases to a greater degree than the other models. We found no evidence of poorer model performance for Black or African American women compared with White women. AUC estimates were consistently higher for Black or African American women compared to White women, though not statistically significantly different, except among women at UPenn, where the AUC was better for BCSC and BRCAPRO models for Black compared with White women. With the exception of the BRCAPRO model, the O/E ratio was further from 1 for Black or African American women compared to White women, indicating that models under-predicted the number of cases among Black or African American women. Model discriminatory accuracy was poorer for HER2+ than ER/PR+HER2− disease across all models, and also lower for TNBC than ER/PR+HER2− disease, however, the difference was not statistically significant for TNBC. This is the first study to validate the combined BRCAPRO+BCRAT model, which has significantly better discriminatory accuracy than BRCAPRO in this screening population. BCRAT and BRCAPRO+BCRAT had better AUC among women with family history than women without a family history, though O/E ratios were further from 1 for women with family history compared to those without. Across all models, AUCs were higher for obese women than non-obese women, but the calibration was poorer, with under-prediction of cases among obese women. While overall there was good concordance in the patients identified as high and low absolute risk, we found anywhere from 2.4% to 19.7% discordance in pairwise comparisons of the proportions of patients identified as high versus low risk across the models. These findings add to the literature and provide useful performance metrics to consider when selecting breast cancer risk assessment models, and also highlight the complications when applying risk models clinically and the need for improved risk assessment tools.
The measures of calibration and discrimination of the models observed in our study are generally consistent with existing literature [7,[17][18][19][20][21][22][23][24][25][26][27][28][29][30]. This suggests that there is no clear winner in terms of which risk model to use for the purpose of directing screening. However, while we found that the models largely agreed on which patients were high risk, there was variation across models. Interestingly, the BCSC model was more likely to classify women as having low risk when compared with the BCRAT model and the BRCAPRO model. While the choice of model may have only marginal differences at the population level, our results highlight that the choice of model may lead to a different assessment of high-risk status for individual women. In the case of supplemental screening, different recommendations would be made if only one model is used compared to if multiple models are used, and an individual is considered high-risk based on any of the applied models.
If the goal is to increase the number of women identified as high risk, we might take the latter approach. If the goal is to minimize the use of resources and unnecessary procedures, we may use only one model or set a higher threshold for high-risk status. Answering such questions is key to implementing a risk-based approach to screening.
Our study found no significant difference in model performance between Black or African American and White women. To our knowledge, ours is the first study to compare the performance of multiple breast cancer risk prediction models simultaneously for White and Black women. AUCs tended to be higher for Black or African American than White women, and calibration tended to be poorer for Black or African American than White women, though p-values for differences in AUCs were not significant, and confidence intervals for O/E ratios for Black or African American and White women were overlapping. Despite our large population, there were only 209 breast cancers diagnosed among Black or African American women, which may limit our power to detect differences. The original BCRAT model was found to underestimate risk among Black or African American women and was therefore updated to better predict breast cancer risk for Black or African American women based on the results of the Women's Contraceptive and Reproductive Experiences (CARE) study [5]. However, even after this update, the BCRAT model was still shown to significantly underestimate risk in a study that included 725 cases and 725 controls from the Black Women's Health Study [31]. Validation of the BCSC model among women undergoing mammography screening showed that risk was under-predicted for non-Hispanic Black women, though the confidence interval of the observed to expected ratio overlapped with that for White women [28]. The AUC of the BCSC model among Black women was not reported in this study. While discrimination and calibration were comparable across models for Black or African American and White women in our study, all risk models identified a smaller proportion of Black or African American than White women as high risk. For example, over 21% of White women were identified as having high 5 years risk based on the BCRAT model, compared with 10% of Black or African American women. This illustrates the fact that using existing risk models to direct supplemental screening will result in fewer Black or African American women than White women qualifying for additional imaging. While historically Black or African American women in the U.S. have had lower breast cancer mortality than White women, breast cancer incidence has increased for Black or African American women over the past 20 years, with incidence rates now very close to those of White women (127.3 vs. 131.6 per 100,000 women) [1]. Future research should closely evaluate the potential effects of using risk models to guide prevention measures on racial disparities in breast cancer.
Our results further suggest that risk models do not identify the risk of triple-negative or HER2+ tumors as well as hormone receptor-positive HER2− tumors. Our results are consistent with an analysis of the Women's Health Initiative, which showed that the BCRAT model predicted ER+ breast cancers, but not ER− breast cancers [21] and with our prior analysis with a smaller sample [11]. Given that HER2+ and TNBCs are more aggressive, future studies should attempt to build subtype-specific risk prediction models. Identification of risk of poor prognosis cancers would allow targeting of prevention strategies such as intensive screening to women at greatest risk for breast cancer death. Thus far, the lower prevalence of both HER2+ and triple-negative tumors have limited the ability to generate subtype-specific models.
BRCAPRO and BRCAPRO+BCRAT use more extensive family history data than the BCSC and BCRAT models. In this screening cohort, approximately 20% of patients had any family history of breast cancer, including first or second-degree relatives. We found that BRCAPRO+BCRAT had significantly better AUC than BRCAPRO and calibration was very similar for both models overall, and among the subset of women with family history. This highlights the need to balance between a potentially small increased accuracy of risk assessment including extended family history, and the additional time and computational burden of collecting and analyzing extensive family history data. It is important to note that detailed family history assessment has the added benefit of identifying patients who may be at high risk for high and moderate penetrance mutations such as BRCA1/2 who may benefit from genetic testing.
We found significantly higher AUC across models among obese women compared with non-obese women, though calibration was poorer for obese women, particularly for the BCSC model. While none of the existing models include BMI as a predictor, the BCSC model does incorporate breast density, which is highly correlated with BMI, with women with higher BMI having lower breast density on average. Therefore, the BCSC model likely downgrades risk for many obese women given their lower breast density, leading to under-prediction of risk. Breast density as coded by the radiologist has been shown to have poor reproducibility [32]. Novel measures of breast density, such as quantitative measures of volumetric breast density from digital breast tomosynthesis may prove a better risk marker than radiologist-coded breast density, particularly for obese women. A recent study found that the effect of volumetric breast density on breast cancer risk was strongest in overweight and obese women [33], suggesting that risk models may need to incorporate these novel measures and include interaction terms between density and BMI in order to improve risk assessment for obese women. Given that the prevalence of obesity is large and increasing, and that obese women are more likely to be diagnosed with more aggressive diseases [34,35], the inclusion of BMI in risk models should be explored, with a closer evaluation of how to best predict risk among women with high BMI.
To our knowledge, this is one of the largest studies to compare multiple breast cancer risk models among women undergoing mammography screening in the U.S. The large and diverse cohort enabled comparisons of model performance by race/ethnicity, molecular subtypes, family history, age, and BMI. Additionally, we were able to evaluate a novel combination of the BRCAPRO and BCRAT model, BRCAPRO+BCRAT, that may be useful for patients with an extensive family history of breast and ovarian cancers. Our findings provide directions for future research by identifying subsets of the population for which existing models may have suboptimal performance.
Several limitations should be noted when interpreting the findings. We were unable to include the IBIS/Tyrer-Cuzick model because we lacked sufficient data on the use of menopausal hormone therapy for input to this model. Due to the small number of Hispanic and Asian/Pacific Islander patients, there were too few cancer outcomes to make meaningful inferences on the performance of the risk models in these racial/ethnic groups. There were 209 breast cancers among Black or African American women, which is also a relatively small sample. Most of the Black or African American patients were screened at one of the three sites, and therefore results for Black or African American patients may be confounded by site-specific measurement biases. In addition, the data on prior biopsy from MGH was incomplete. However, we believe the data on atypical hyperplasia to be reasonably accurate since it incorporated both self-reported data and information extracted from biopsy reports. However, we lacked data on other benign breast conditions, such as non-proliferative benign breast diseases and proliferative benign breast diseases without atypia. While the BCRAT uses atypical hyperplasia, the BCSC model uses benign breast disease, and therefore we may be underestimating risk among patients since we do not have full data on benign breast diseases. This highlights the broader concern, that there is variation in data quality across risk factors and across sites. However, this is the reality when data collected for clinical purposes is utilized for risk assessment. The inclusion of three cohorts with slightly different risk collection instruments may help in reducing the effect of misclassification of risk factors at any one site on the results. We treated patients diagnosed with DCIS as non-cases since the risk models considered focus on the risk of invasive breast cancer. However, since risk factors for DCIS and invasive cancer share common risk factors, this may have led to the poorer observed performance of our models. Additionally, the purpose of this study was to evaluate risk models that would be feasible to integrate into clinical care in mammography to support decision-making in the general population. We excluded known BRCA1/2 carriers because the BCRAT and BCSC risk models are specifically designed to predict risk in non-carriers. There were fewer BRCA1/2 carriers at NWH than MGH and UPenn, which is partly expected, as NWH is a community hospital, whereas MGH and UPenn have large cancer genetics clinics. It is also possible that BRCA1/2 mutation status was less well annotated in health records at NWH, and we may have included some BRCA1/2 mutation carriers in the analysis, which may bias results in unknown ways. However, given the small prevalence of BRCA1/2 mutations in the general population, we expect the magnitude of this bias to be small. Finally, we lacked genetic data to incorporate polygenic risk scores into risk models, which has been shown to improve predictive accuracy. A recent study showed that adding the 313 SNP polygenic risk score to classical breast cancer risk factors improved the AUC from 56% to 64% among women younger than 50 and from 57% to 64% among women aged 50 and older [36]. Polygenic risk scores are not yet integrated into clinical practice to direct supplemental screening, and further studies are needed to direct the implementation of such approaches. Finally, we performed a large number of comparisons, and therefore some of the statistically significant associations may have been the result of chance rather than true underlying differences in performance. We provide Bonferroni corrected p-values and found several associations that remained statistically significant after Bonferroni correction.

Conclusions
In summary, our research provides important data on the performance of breast cancer risk assessment models among women undergoing mammography screening in the U.S. and provides important data on how models perform in different subsets of the population, including Black or African American women and women with a family history of breast cancer, and how well the models predict breast cancer subtypes. We validate the BRCAPRO+BCRAT model which may be beneficial for use among women with a family history of breast and ovarian cancer. These findings are useful to develop risk assessment strategies but ultimately point to the need to further improve risk prediction by incorporating additional risk factors, such as quantitative measures of breast density and genetic markers.

Institutional Review Board Statement:
This study was deemed exempt from review by the University of Pennsylvania Institutional Review Board.
Informed Consent Statement: Patient consent was waived due to this study being a retrospective analysis of existing data, and therefore posing no more than minimal risk to patients.

Data Availability Statement:
The data underlying this article cannot be shared publicly in order to protect patient privacy. The data may be shared in a de-identified format on reasonable request to the corresponding author.

Conflicts of Interest:
Giovanni Parmigiani is a co-founder and equity holder in Phaeno Biotechnologies, a member of the Scientific Advisory Board of Konica-Minolta Precision Medicine (which includes Ambry Genetics and Invicro), and a consultant for Delfi Diagnostics and Foundation Medicine. Danielle Braun and Giovanni Parmigiani co-lead the BayesMendel lab, which develops and maintains the BayesMendel software package. This includes a variety of risk assessment tools including BRCAPRO, PancPRO, MelaPRO, MMRpro, and PanelPRO and is licensed for commercial use. All licensing revenues are used for software maintenance and upgrades. Neither BayesMendel lab leaders nor members derive personal income from BayesMendel licenses. Danielle Braun and Giovanni Parmigiani are co-inventor of the Ask2me tool, which is commercially licensed. Kevin Hughes receives honoraria from Hologic (surgical implant for radiation planning with breast conservation and wire free breast biopsy) and Myriad Genetics, Hughes has financial interests in CRA Health (Formerly Hughes RiskApps) which recently was sold to Volpara. CRA Health develops risk assessment models/software with a particular focus on breast cancer and colorectal cancer. Hughes is a founder of the company. Hughes is the Co-Creator of Ask2Me.Org which is freely available for clinical use and is licensed for commercial use by the Dana Farber Cancer Institute and the MGH. Hughes's interests in CRA Health and Ask2Me.Org were reviewed and are managed by Massachusetts General Hospital and Partners Health Care in accordance with their conflict-of-interest policies. Constance Lehman is co-founder of Clairity, which is developing an AI-based risk assessment products. Lehman's interests in Clairity were reviewed and are managed by Massachusetts General Hospital and Partners Health Care in accordance with their conflict-of-interest policies. Emily Conant is on the grant and advisory boards for iCAD, Inc. and Hologic, Inc. The remaining authors have no conflicts to disclose. The funders of this study had no role in its design; the collection, analyses, or interpretation of the data; in the writing of the manuscript, or in the decision to publish the results.

Appendix B
(a)