A Personal Breast Cancer Risk Stratification Model Using Common Variants and Environmental Risk Factors in Japanese Females

Simple Summary Breast cancer remains the most common cancer in females, warranting the development of new approaches to prevention. One such approach is personalized prevention using genetic risk models. Here, we developed a risk model using both genetic and environmental risk factors. Results showed that a genetic risk score defined by the number of risk alleles for 14 breast cancer risk SNPs clearly stratified breast cancer risk. Moreover, the combination of this genetic risk score model with an environmental risk model which included established environmental risk factors showed significantly better C-statistics than the environmental risk model alone. This genetic risk score model in combination with the environmental model may be suitable for stratifying individual breast cancer risk, and may form the basis for a new personalized approach to breast cancer prevention. Abstract Personalized approaches to prevention based on genetic risk models have been anticipated, and many models for the prediction of individual breast cancer risk have been developed. However, few studies have evaluated personalized risk using both genetic and environmental factors. We developed a risk model using genetic and environmental risk factors using 1319 breast cancer cases and 2094 controls from three case–control studies in Japan. Risk groups were defined based on the number of risk alleles for 14 breast cancer susceptibility loci, namely low (0–10 alleles), moderate (11–16) and high (17+). Environmental risk factors were collected using a self-administered questionnaire and implemented with harmonization. Odds ratio (OR) and C-statistics, calculated using a logistic regression model, were used to evaluate breast cancer susceptibility and model performance. Respective breast cancer ORs in the moderate- and high-risk groups were 1.69 (95% confidence interval, 1.39–2.04) and 3.27 (2.46–4.34) compared with the low-risk group. The C-statistic for the environmental model of 0.616 (0.596–0.636) was significantly improved by combination with the genetic model, to 0.659 (0.640–0.678). This combined genetic and environmental risk model may be suitable for the stratification of individuals by breast cancer risk. New approaches to breast cancer prevention using the model are warranted.


Introduction
Breast cancer is the most common cancer in females, with an estimated global incidence in 2018 of 2,088,849 [1]. Breast cancer is also a leading cause of death worldwide, causing 15.1 million disability-adjusted life years (DALY) in 2016 [2]. Furthermore, incidence is estimated to increase to 3,059,829 cases in 2040 [1]. In Japan also, breast cancer incidence has increased rapidly for the last 30 years and is now the most common cancer in women [3]. This increase is associated with changes in the prevalence of established risk factors in Japanese women, which might be broadly categorized as behavioral and social westernization.
Conventional strategies for breast cancer prevention include control of risk factors and early detection through mammography screening, targeting women in the general population. Given the increased burden of breast cancer, however, the development of new prevention strategies is essential. As a revolutionary approach, personalized prevention or precision prevention has recently been proposed [4][5][6]. A detailed elucidation of individual breast cancer risk might allow personalized intervention in women stratified by risk factors. Several breast cancer risk prediction models have been developed to evaluate individual risk based on lifestyle factors, reproductive factors, family history and clinical factors [7][8][9][10][11]. These are now in clinical use as tools for individual cancer prevention. For example, the American Cancer Society has developed a guideline which recommends MRI as an adjunct to mammography screening for women at high risk, as identified by a risk prediction model [12].
We and others have speculated that feedback on genetic and environmental risk to individuals at high risk might be meaningful for breast cancer prevention [32,33]. In particular, genetic risk assessment in combination with environmental risk assessment might predict breast cancer risk better than either assessment alone. To date, however, only a few studies have attempted to predict individual breast cancer risk using both environmental factors and genetic factors [34][35][36].
Here, we aimed to develop a genetic risk score and integrate it with established risk factors for personalized risk assessment for breast cancer in Japanese.

Subjects
Breast cancer cases and corresponding controls from three hospital-based case-control studies were included in the study. The Nagano study was a multicenter, hospital-based case-control study of breast cancer conducted from May 2001 to September 2005 at four hospitals in Nagano Prefecture. Details of the study have been described previously [37]. Briefly, the case subjects were a consecutive series of women aged 20-74 years with newly diagnosed, histologically confirmed invasive breast cancer who were admitted to one of the four hospitals during the survey period. Of 412 eligible patients, 405 (98%) agreed to participate. Healthy controls were selected from medical checkup examinees in two of the hospitals and confirmed not to have any cancer. One control was matched with each case by age (within three years) and residential area (city or regional area) during the study period. Among potential control subjects, one declined to participate. Consequently, written informed consent was obtained from 405 matched pairs. Thereafter, two subjects refused to provide blood samples, and two declined use of their data outside the Nagano study. Due to a shortage of DNA samples, 12 pairs were excluded from the study. The Kagoshima study was a hospital-based case-control study conducted in two hospitals in Kagoshima City from May 2010 to March 2012. Cases were female patients with newly diagnosed and histologically confirmed breast cancer while controls were outpatients undergoing breast cancer screening who were confirmed without malignant disease. Consecutive cases admitted to either hospital during the study period were asked to participate in this study, and the participation rate was 91%. In total, 233 BC cases and 331 controls were analyzed, with written informed consent obtained from all. The Aichi study was conducted between 2001 and 2005 at Aichi Cancer Center Hospital [38,39]. This study was conducted within the framework of the Hospital-based Epidemiological Research Program in Aichi Cancer Center (HERPACC2). Cases were first-visit outpatients with histologically confirmed breast cancer during the study period. Controls were first-visit outpatients during the same period who were confirmed to have no malignancy and no history of neoplasia. Controls were selected randomly and matched by age at a case-control ratio of 1:2. All study subjects provided blood samples. Lifestyle factors were collected by self-administered questionnaire.
In total, the present study included 1319 breast cancer cases and 2094 non-cancer controls.

Evaluation of Environmental Risk Factors
Information on known environmental risk factors for breast cancer was collected by self-administered questionnaire in each study. Data from three studies were harmonized according to common items and a categorization of variables was defined. The following variables were considered as environmental risk factors: age at enrollment, body mass index (BMI, <18.5, 18.5-24.9, ≥25), ethanol drinking (never, <23 g/day, ≥23 g/day), cigarette smoking (never, ever), physical activity (yes, no), family history of breast cancer (yes, no), age at menarche (≤12 years old, 13 or 14 years old, ≥15 years old), parity (yes, no), number of children (0, 1-2, 3 or more), age at first birth (<30 years old, ≥30 years old, nonparous), breastfeeding (yes, no), hormone therapy (yes, no) and menopausal status (menstruation, menopause). BMI was calculated as the reported weight in kilograms divided by the reported height in meters squared. Ethanol consumption was estimated using the average number of alcohol beverages per day. Subjects reporting regular leisure time exercise at least once per month were classified as having physical activity. Family history was considered positive if a mother or a sister had ever had breast cancer.
Genomic DNA was extracted from the peripheral blood using a Qiagen FlexiGene DNA Kit (Qiagen, Hilden, Germany) in the Nagano study, a QIAamp DNA Blood Maxi Kit (Qiagen) in the Kagoshima study and a DNA Blood mini kit (Qiagen, Tokyo, Japan) in the Aichi study. The 114 loci were genotyped in the study subjects using SNPtype assays (Fluidigm, San Francisco, CA, USA). Among 114 loci, 11 monomorphic SNPs were excluded. Sixteen SNPs that were not accordant with the Hardy Weinberg Equilibrium (HWE) in at least one of the three populations were excluded. In total, 87 SNPs were included in further analysis. The impact of each SNP on breast cancer risk was evaluated by per allele odds ratio (OR) and 95% confidence interval (CI) using a logistic regression model adjusted for age at enrollment. The results of the three studies were combined using random effects meta-analysis. SNPs with summary p-values less than 0.05 were selected for risk prediction modeling. Linkage disequilibriums (LD) of SNPs located within same genes were calculated. LD of the candidate loci clustered in the same region were assessed by Haploview 4.2 [42]. Strong LD was defined as a one-sided upper 95% confidence bound on D' of more than 0.98 and a lower 95% confidence bound above 0.7. SNPs within the same LD block were excluded, except one SNP with the lowest p-value for breast cancer risk. Similarly, SNPs with summary p-values less than 0.10 and 0.30 were also used in sensitivity analysis for genetic risk modeling.
The genetic risk group for breast cancer was defined according to the number of risk alleles in each control subject. Three risk groups (Low, Moderate, High) were defined by the distribution of risk allele numbers. Approximately 20%, 70% and 10% of controls were defined as the low-, moderate-and high-risk group, respectively. Breast cancer susceptibility in each risk group was evaluated by OR and its 95% CI using both crude and adjusted logistic regression models. Age at enrollment was adjusted in the crude model. In addition to the crude model, environmental risk factors were included in the adjusted model. ORs in total populations were calculated by crude and adjusted models with the addition of study site as a covariate. To assess the discriminatory ability of the risk prediction model, the area under the curve (AUC) in the Receiver Operating Characteristic (ROC) curve-also known as the concordance statistic (C-statistic)-was used. The Cstatistic in the genetic model for each study population and the total population was calculated using logistic regression models which included the genetic risk score in the risk model. Similarly, C-statistic in the environmental model was calculated using logistic regression models which included the environmental risk factors. All variables in the genetic and environmental models were included in the inclusive model. In the ROC, the y axis shows sensitivity and the x axis shows the false positive rate, with AUC values ranging from 0.5 to 1. The straight line in the ROC shows a random classification of case and control subjects with an AUC of 0.5, while an AUC value of 1 corresponds to a perfect classification. An AUC value between 0.7 and 0.8 is acceptable while a value greater than 0.8 represents excellent model discrimination [43]. In addition to the genetic risk score model, we also assessed the genetic risk score model in three levels and the allelic risk model as sensitivity analyses. The C-statistic of the genetic risk score model in three levels was calculated using a logistic regression model which included the low, moderate and high genetic risk groups. The C-statistic of the allelic risk model was calculated using logistic regression models which included the summation of logarithmic allelic risk ORs of SNPs in the genetic risk models. The C-statistic values were compared using the method of DeLong et al. [44]. A calibration of the risk model was assessed by the Hosmer-Lemeshow goodness-of-fit statistic and calibration plots [45]. Subjects were grouped by decile of predicted probability. A significant p-value in the Hosmer-Lemeshow test indicates disagreement between the predicted and observed outcomes. The mean predicted probability was plotted against the mean observed probability for each decile in a calibration plot. A p-value < 0.05 was defined as the threshold of significance. Statistical analyses were conducted using Stata version 15.2 (StataCorp LP, College Station, TX, USA).

Results
The three case-control studies are characterized in Table 1. In total, 1319 cases and 2114 controls were included in the present study, broken down as 389 and 389 from the Nagano study, 233 and 331 from the Kagoshima study and 697 and 1394 from the Aichi study, respectively. Due to matching, age distributions among cases and controls in the Nagano and Aichi studies were not different, whereas cases in the Kagoshima study were older than controls. The proportion of obesity (BMI 25 or more) was similar in the Nagano and Aichi studies, but obesity was more prevalent in cases in the Kagoshima study. Finally, hormone therapy use was more prevalent in controls in the Kagoshima and Aichi studies.
Among 114 genotyped breast cancer susceptibility loci identified by GWAS studies (Table S1), 19 SNPs had statistically significant summary p-values of less than 0.05. Five loci located in 10q26 (rs2981579, rs2981578, rs1219648, rs2420946 and rs2981582) and two in 16q12 (rs3803662 and rs4784227) were in strong LD. Four loci in 10q26 (rs2981578, rs1219648, rs2420946 and rs2981582) and one in 16q12 (rs3803662) were excluded from further analysis. The list of breast cancer susceptibility loci and their allelic ORs is shown in Table 2. Similarly, 22 SNPs with summary p-values of less than 0.10 and 42 SNPs with summary p-values of less than 0.30 were selected for additional genetic risk assessment.
Genetic risk groups were defined according to the risk allele distribution of the 14 SNPs in controls (Figure 1), with those with 0 to 10, 11 to 16 and 17 to 28 risk alleles defined as low-, moderate-and high-risk groups, respectively. Subjects with undetermined alleles were classified as undetermined. Subject proportions in the low-, moderate-and high-risk groups were 23.84%, 69.30% and 6.86%, respectively. Proportions in risk groups in each study's controls were similar to those in the total control subjects. In the crude model, summary ORs of breast cancer in the moderate-and high-risk groups were 1.70 (95% CI, 1.41-2.05) and 3.29 (CI, 2.49-4.34) compared with low-risk group, respectively. The ORs in each study were similar to those in the total population. ORs were similar after adjustment for known breast cancer risk factors.  Figure 2 shows the ROC curves of the genetic, environmental and inclusive risk models in the three study populations and total population. The C-statistics of genetic model, environmental model and inclusive models in the three populations and total population are shown in Table 3. The C-statistics of the genetic models were 0.605, 0.609, 0.604 and 0.633 in the Nagano, Kagoshima, Aichi and overall populations, respectively. The C-statistics of the inclusive model (combination of genetic and environmental models) in the Nagano, Kagoshima, Aichi and total populations were better than those of the environmental models. The ROC curves in total population resembled those in the Aichi study, because of the relatively large sample size of the Aichi study. A calibration plot of the    Figure 2 shows the ROC curves of the genetic, environmental and inclusive risk models in the three study populations and total population. The C-statistics of genetic model, environmental model and inclusive models in the three populations and total population are shown in Table 3. The C-statistics of the genetic models were 0.605, 0.609, 0.604 and 0.633 in the Nagano, Kagoshima, Aichi and overall populations, respectively. The C-statistics of the inclusive model (combination of genetic and environmental models) in the Nagano, Kagoshima, Aichi and total populations were better than those of the environmental models. The ROC curves in total population resembled those in the Aichi study, because of the relatively large sample size of the Aichi study. A calibration plot of the inclusive model in the overall population remained close to the ideal calibration line (calibration slope of 1.02 and p for Hosmer-Lemeshow test = 0.506) ( Figure S1).  Aichi study, and (D) total population. Orange, yellow and navy lines are ROC curves of the Inclusive, Environmental and Genetic models, respectively. Age, body mass index, ethanol drinking, cigarette smoking, physical activity, family history of breast cancer, age at menarche, parity, number of children, age at first delivery, breastfeeding, hormone therapy and menopausal status were adjusted in the Environmental model.  and Genetic models, respectively. Age, body mass index, ethanol drinking, cigarette smoking, physical activity, family history of breast cancer, age at menarche, parity, number of children, age at first delivery, breastfeeding, hormone therapy and menopausal status were adjusted in the Environmental model.   (Table S3). The C-statistics were significantly improved with the combination of genetic and environmental models in both premenopausal and postmenopausal females.
To check the validity of the SNP selection in the genetic risk model, genetic risk models with additional SNPs were assessed (Table S4). The C-statistics of genetic risk models that included 14, 22 and 42 SNPs in the total population were 0.633 (95% CI 0.614-0.652), 0.636 (95% CI 0.617-0.655) and 0.636 (95% CI 0.617-0.655), respectively. Accordingly, the C-statistic of the genetic risk model with 14 SNPs was not statistically poorer than that of those with 22 or 42 SNPs, and the inclusion of more SNPs did not improve model performance.
To assess the validity of genetic risk categorization, C-statistics of the three levels of the genetic risk model, genetic risk score model (number of risk alleles) and allelic risk model (summation of logarithmic allelic ORs) were assessed (Table S5). The C-statistics of the genetic risk score models did not significantly differ from those of the allelic risk models.

Discussion
We established a genetic risk model for breast cancer in subjects from three casecontrol studies in Japan using 14 risk loci identified in GWASs. The high-risk group, which accounted for 6.86% of total population, had a 3.27 times higher breast cancer risk than the low-risk group. While the discriminatory ability of the genetic risk model alone was not satisfactory, its combination with an environmental risk model produced significantly improved performance. Further, performance of the combined risk model was consistent between premenopausal and postmenopausal females.
Many GWASs aimed at breast cancer risk seek to elucidate carcinogenic mechanisms. However, these studies have no direct impact on clinical practice [46]. One reason is the small impact of each loci. In our study also, the magnitude of each SNP on the risk of breast cancer was small. When aggregated, however, these risk alleles together would likely indicate substantial risk elevation in those in the high-risk group. In previous studies in Japanese females, the impact of genetic risk in the high-risk groups was larger than that of cigarette smoking, alcohol drinking and obesity [47][48][49][50]. Unlike smoking, drinking and obesity, however, genetic risk cannot be modified. Nevertheless, preventive approaches for women with high genetic risk should be considered.
The use of genetic risk stratification for breast cancer prevention should be investigated. Several studies have assessed preventive approaches to genetic risk [51,52]. One possible strategy is personalized breast cancer screening: currently, biannual mammography is recommended for Japanese women aged 40 years or older [53], but screening intensity might be strengthened in high-risk individuals. Appropriate frequency, examination modalities, and age of screening initiation by predicted individual breast cancer risk should be investigated. A second potential strategy is lifestyle modification via individual risk feedback. Feedback on genetic risk in combination with education about a healthy lifestyle might induce individuals to modify behaviors associated with breast cancer risk such as obesity, physical activity, alcohol drinking, and cigarette smoking. Appropriate lifestyle modification directly decreases breast cancer risk [54], while a healthy lifestyle attenuates genetic breast cancer risk [55]. Lifestyle modification is difficult to achieve and sustain, and few studies have reported success in using risk feedback to change lifestyle. In addition, the impact of risk feedback and that of intervention for each modifiable risk factor must almost certainly differ. Thus, weight management, physical activity, abstinence from drinking, and smoking cessation should be recommended with appropriate intervention strategies. Novel and personalized risk communications and interventions suitable for lifestyle modification should be investigated.
Compared to the allelic risk model, the risk score model appears to have had an attenuated discriminatory ability. The risk score model was based on the assumption that all risk alleles confer the same magnitude of breast cancer risk. While the allelic ORs of risk loci ranged from 1.11 to 1.46 in the total population, the C-statistics of risk score models were not poorer than those of allelic risk models, indicating that the risk score models could be used in place of the allelic risk models. The number of risk alleles and the three corresponding risk grouping levels are simple to implement and comprehensive for females in general populations, the characteristics which would facilitate preventive interventions for breast cancer. The risk model would be available in other populations, although useful sets of alleles must be assessed in the populations. Randomized controlled studies to determine whether genetic risk feedback modifies individual behavior for breast cancer prevention are warranted.
Hundreds of loci associated with breast cancer risk have been identified in GWAS studies. A previous polygenic risk model based on a large GWAS dataset reported AUCs of 0.603, 0.630 and 0.636 using 77, 313 and 3820 SNPs, respectively [56]. These findings suggested that using more SNPs associated with breast cancer risk might improve model performance; in our present study, however, the inclusion of SNPs with low significance did not improve performance: genetic models with 22 and 44 SNPs offered no significant improvement over that with 14 SNPs. Indeed, three GWAS studies in Japanese populations identified only 31 loci in 19 regions [57][58][59]. Attempts to further improve the performance of genetic models by adding more SNPs would, therefore, require a larger sample size.
The major strength of this study was its study population. Because we established the risk model using three case-control studies conducted in geographically distant regions, the results are generalizable to the Japanese population. Nevertheless, the study was based on hospital-based case-control studies, meaning that several methodological limitations exist. First, the values for self-reported lifestyle factors considered as potential confounding factors might have some misclassification and recall bias. Second, selection bias in study subjects is inevitable in hospital-based case-control studies, and external validity should be interpreted carefully. Distributions of alcohol drinking and cigarette smoking in controls were highly consistent with those in a national survey [60], suggesting that our study population did not vary from the Japanese general population in these regards. Third, we were unable to establish a risk prediction model by tumor subtype as not all cases had information on hormonal status. Against this, however, establishment of a risk model according to tumor subtype would have little meaning in personalized risk assessment for breast cancer. Fourth, categorization of high genetic risk had no consensus. Further studies were required to evaluate reasonable and useful thresholds of genetically high-risk groups.

Conclusions
This genetic risk score model using 14 GWAS-identified loci in combination with environmental factors is able to stratify breast cancer risk. New breast cancer prevention strategies for genetically high-risk populations should be developed.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/10 .3390/cancers13153796/s1, Table S1: List of breast cancer-associated SNPs reported in GWAS studies, Table S2: ORs of genetic risk groups stratified by menopausal status, Figure S1: Calibration plot of the inclusive model in the total population, Table S3: C-statistics of genetic, environmental and inclusive risk models stratified by menopausal status, Table S4: C-statistics of genetic risk score models including 14, 22 and 42 SNPs, Table S5: C-statistics of three levels of genetic risk score model, risk score model and allelic risk model.