Uncovering the Contribution of Moderate-Penetrance Susceptibility Genes to Breast Cancer by Whole-Exome Sequencing and Targeted Enrichment Sequencing of Candidate Genes in Women of European Ancestry

Simple Summary Genetic variants explaining approximately 40% of familial breast cancer risk have been identified, thus leaving a significant fraction of the heritability of this disease still unexplained. The exact nature of this missing fraction is unknown; more extensive sequencing efforts could potentially identify new moderate-penetrance breast cancer risk alleles. The aim of this study was to perform a large-scale whole-exome sequencing study, followed by a targeted validation, in breast cancer patients and healthy women of European descent. We identified 20 novel genes with modest evidence of association (p-value < 0.05) for either overall or subtype-specific breast cancer; however, much larger studies are needed to confirm the exact role of these genes in susceptibility to breast cancer. Abstract Rare variants in at least 10 genes, including BRCA1, BRCA2, PALB2, ATM, and CHEK2, are associated with increased risk of breast cancer; however, these variants, in combination with common variants identified through genome-wide association studies, explain only a fraction of the familial aggregation of the disease. To identify further susceptibility genes, we performed a two-stage whole-exome sequencing study. In the discovery stage, samples from 1528 breast cancer cases enriched for breast cancer susceptibility and 3733 geographically matched unaffected controls were sequenced. Using five different filtering and gene prioritization strategies, 198 genes were selected for further validation. These genes, and a panel of 32 known or suspected breast cancer susceptibility genes, were assessed in a validation set of 6211 cases and 6019 controls for their association with risk of breast cancer overall, and by estrogen receptor (ER) disease subtypes, using gene burden tests applied to loss-of-function and rare missense variants. Twenty genes showed nominal evidence of association (p-value < 0.05) with either overall or subtype-specific breast cancer. Our study had the statistical power to detect susceptibility genes with effect sizes similar to ATM, CHEK2, and PALB2, however, it was underpowered to identify genes in which susceptibility variants are rarer or confer smaller effect sizes. Larger sample sizes would be required in order to identify such genes.


Introduction
Breast cancer is the most common cause of cancer-related death and the most frequently diagnosed cancer among women worldwide [1]. Currently, approximately 40% of familial breast cancer risk is explained by a combination of common low-penetrance variants [2,3], together with coding rarer variants in predisposition genes, such as BRCA1, BRCA2, PALB2, ATM, and CHEK2 that confer higher risks [4]. Thus, a significant fraction of the heritability of this disease is still unexplained. A better understanding of the genetic risk factors contributing to breast cancer can improve risk prediction, and hence better inform screening and prevention strategies, and may also inform understanding of the biology underlying breast cancer predisposition.
While linkage studies have identified high-risk breast cancer susceptibility genes, and genome-wide association studies (GWAS) using single-nucleotide polymorphism (SNP) arrays have identified many common variants conferring modest risks, neither of these approaches are powerful enough to identify rarer variants conferring moderate risk. Several genes associated with moderate risk (two-to four fold) of breast cancer have been identified by candidate gene sequencing approaches. These targeted approaches are limited to our understanding of the pathways involved in breast cancer etiology, and have thus mainly identified new breast cancer susceptibility alleles within genes either coding for proteins that interact with BRCA1 or BRCA2 (e.g., PALB2), or other genes involved in DNA repair processes, such as ATM and CHEK2. Since most genes have not yet been subjected to large-scale sequencing studies, it is possible that other susceptibility genes exist, which would explain some of the residual heritability of breast cancer.
Recently, whole-exome sequencing (WES) has provided a comprehensive, agnostic approach to exploring associations between rarer coding variants and disease. Compared to whole-genome sequencing (WGS), WES has a substantially lower cost and generates results that are more easily interpretable [5]. In the last few years, a number of WES studies have been carried out for breast cancer [6,7]. These studies rely on the sequencing of cases enriched for family history of disease, and matched controls, sometimes in combination with analysis of the segregation of the variants associated with disease in families. However, very few novel susceptibility genes have been robustly identified by these studies. This lack of success may, in part, be due to the lack of statistical power: since deleterious variants in most genes are likely to be very rare, studies need to be very large to detect them. More recently, the largest WES study for overall breast cancer reported increased breast cancer risk associated with ATM, CHEK2, PALB2, and MSH6 [8]. However, while the first three genes are already well established, the association with MSH6 has not been definitively replicated in large, targeted sequencing studies.
In an attempt to overcome these shortcomings, we designed a large-scale two-stage WES study. The discovery step was carried out using WES data from 1528 breast cancer cases enriched for genetic susceptibility to the disease, based on early onset breast cancer, a family history of disease, or bilateral breast cancer, and 3733 geographically matched, unaffected controls. The validation step assessed the associations between the presence of loss-of-function (LoF) and rare missense variants in 198 selected candidate genes and the risk of breast cancer overall, as well as estrogen receptor (ER) disease subtypes, in a replication set of 6211 breast cancer cases and 6019 controls. In addition to these genes, a panel of 32 known or suspected breast cancer susceptibility genes was also included in this targeted enrichment sequencing study.

. Studies and Datasets
For the discovery stage, WES was performed on 1528 breast cancer cases, and the resulting data were compared with whole-exome data obtained from 3733 unaffected controls (3483 unaffected women from four studies, as described below, and 250 unaffected men from GoNL trios) (Table S1). Briefly, breast cancer cases were selected from two clinic-based studies: The German Consortium for Hereditary Breast and Ovarian Cancer (GC-HBOC) (1021 cases) [9], with half of the samples originating from the Cologne region (490 samples) and the other half from the Munich area (531 samples), and The Dutch Familial Bilateral Breast Cancer Study (DFBBCS) (511 cases) [10][11][12]. All subjects had previously tested negative for the presence of BRCA1 and BRCA2 germline mutations in a clinical genetic setting. All subjects from the GC-HBOC had a family history of breast cancer, while all subjects from the DFBBCS had bilateral breast cancer. Control wholeexome data were obtained from four studies: The Rotterdam Study [13], The Genome of the Netherlands Project (GoNL) [14,15], The German Study on Ageing, Cognition, and Dementia (AgeCoDe) [16], and The KORA Study [17]. A more detailed description of these studies and datasets is included in Table S1 and in the Supplementary Methods. All study subjects were recruited based on protocols approved by the Institutional Review Boards at each participating institution, and all subjects provided written informed consent.

Selection of Breast Cancer Cases
Breast cancer cases were selected according to the following criteria: early onset (<50 years of age) and a family history of one or more first-degree relatives with breast cancer diagnosed before 50 years of age, or two relatives diagnosed under the age of 60. A scoring system was established for the selection of families based on the assumption that rare, moderate-risk variants might be expected to confer a higher relative risk at young ages, as is the case for most known susceptibility genes [10]. In addition, the number of affected relatives weighted by their degree of relationship to the case was taken into account. Briefly, families were scored according to the following: (a) For each case in the family, a score of 1 was given if the case was diagnosed at age 50 or older, a score of 1.5 was given if diagnosed between 40-49 years, and a score of 2 if diagnosed at <40 years of age. (b) Cases were then weighted by their degree of relationship to the index case, where a weight of 1 was attributed for the index case, 0.5 for first-degree relatives, and 0.25 for second-degree relatives. (c) For bilateral cases, both cancers were included (thus giving a score of 2). (d) The total score for the family was obtained by the sum of these scores. Carriers of deleterious germline mutations in BRCA1 and BRCA2, and CHEK2 c.1100delC carriers were excluded. Related samples were identified on the basis of genotype data, and the individual with the youngest age of onset was chosen. The score distribution among breast cancer cases was as follows: 0.6% with a score below 2; 16.7% with a score between 2 and <2.5; 33.6% with a score between 2.5 and <3; 25.5% with a score between 3 and <3.5; 14.1% with a score between 3.5 and <4; 9.5% with a score ≥4.

Library Preparation, High-Throughput Sequencing, and Bioinformatics Analysis
Libraries for samples from GC-HBOC cases were prepared using the Agilent SureS-electXT Human All Exon V5 capture kit, while libraries for samples from DFBBCS cases were prepared using NimbleGen SeqCap EZ Exome Library v3.0. (Table S1). Barcoded libraries were sequenced on an Illumina HiSeq2500 for paired-end 100 bp sequencing. Available sequencing data from controls was obtained from other studies. These data were generated using the following capture technologies: NimbleGen SeqCap EZ Exome V2, NimbleGen SeqCap EZ Exome V3, Agilent SureSelect Human All Exon V3 and V5 (Table S1). Whole-exome data from cases and controls were demultiplexed and aligned to the reference genome using BWA-MEM to create aligned BAM files, followed by base quality score recalibration. The analysis of sequencing data from cases and controls was  Figure 1 and in Supplementary Methods. Most variant filtering strategies are similar for all groups, reflecting current best practices.

Aggregation of Gene Lists
Following these gene prioritization analyses, the results from each group underwent manual scrutiny and were then merged into a final list of candidate genes for validation. The selection process first included the top 30-ranked genes from each group. Each group then nominated additional putative breast cancer susceptibility genes according to their analysis and expertise. Finally, the list was augmented by the inclusion of genes that were among the top 500-ranked by at least four groups. Through this selection process, 198 candidate genes identified in the WES were selected for targeted resequencing (Table S2). In addition to these candidate genes, this targeted sequencing enrichment stage also included 32 genes for which there was prior evidence of association with breast cancer risk [4,18]. These 32 genes were selected based on their a priori associations with the disease, and were not considered in the individual gene ranking process.

Aggregation of Gene Lists
Following these gene prioritization analyses, the results from each group underwent manual scrutiny and were then merged into a final list of candidate genes for validation. The selection process first included the top 30-ranked genes from each group. Each group then nominated additional putative breast cancer susceptibility genes according to their analysis and expertise. Finally, the list was augmented by the inclusion of genes that were among the top 500-ranked by at least four groups. Through this selection process, 198 candidate genes identified in the WES were selected for targeted resequencing (Table S2). In addition to these candidate genes, this targeted sequencing enrichment stage also included 32 genes for which there was prior evidence of association with breast cancer risk [4,18]. These 32 genes were selected based on their a priori associations with the disease, and were not considered in the individual gene ranking process.

Sample Sets
A total of 12,655 samples, including 6518 women affected with breast cancer, and 6137 controls, were selected for the validation stage. These were drawn from five sources, namely: GC-HBOC [9] (n = 5966, 3199 cases and 2767 controls); a Dutch validation set comprising samples from two studies, the Amsterdam Breast Cancer Study-Familial (ABCS-F) [10] and the Rotterdam Breast Cancer Study (RBCS) [19] (n = 1941, 979 cases and 962 controls); Studies of Epidemiology and Risk factors in Cancer Heredity (SEARCH) [20][21][22] (n = 2637, 1289 cases and 1348 controls); the Ontario Familial Breast Cancer Registry (OFBCR) [23] (n = 1191, 600 cases and 591 controls) and CARTaGENE (CaG) [24] (n = 920, 451 cases and 469 controls). The GC-HBOC samples did not overlap with those used for the discovery step. A more detailed description of these studies is included in Table S1 and in Supplementary Methods. All study subjects were recruited based on protocols approved by the Institutional Review Boards at each participating institution, and all subjects provided written informed consent.

Power Calculations
To perform power calculations for the replication stage, we assumed a significance level of p-value < 10 −4 . The standard "exome-wide" significance level is p < 2.5 × 10 −6 ; this is equivalent to assuming a prior probability 40-fold higher than an average gene. This is also the level proposed by Easton et al. [4] for assessing the evidence for candidate DNA repair genes included in gene panels. To allow for the oversampling of cases with a family history, bilaterality, or early age at onset, we assumed that the effect size (log odds ratio) was increased by a factor of 1.5. Based on these assumptions, the power exceeded 50% for genes with a carrier frequency >0.005 and an odds ratio (OR) in the population >2, or a carrier frequency >0.001 and an OR > 4. Based on the effect sizes and European allele frequencies [18], the power of our study to detect genes with the same characteristics as ATM, CHEK2, and PALB2 was 0.44, 1.0 and 0.95, respectively. However, the power was low to detect genes with the characteristics of RAD51C (0.0039), RAD51D (0.0025), or BARD1 (0.017).

Library Preparation and High-Throughput Sequencing
Library preparation was performed at the Cologne Center for Genomics, the Center for Familial Breast and Ovarian Cancer, and the Genomics Centre at CHU de Québec-Université Laval Research Center. Sequencing was performed at two centers: the Cologne Center for Genomics, Cologne, Germany, using Illumina HiSeq 4000 sequencers for paired-end 75 bp sequencing (mean coverage 140×); and the Genomics Centre at CHU de Québec-Université Laval Research Center, Quebec City, Canada, using Illumina HiSeq 2500 for paired-end 100 bp sequencing (mean coverage 185×). Additional details on library preparation and quality control procedures are given in the Supplementary Methods.

Bioinformatics and Statistical Analyses
Bioinformatics analyses were performed on the Compute Canada supercomputer, Graham. A pipeline was set up following GATK Best Practices [25]. GATK Unified Genotyper was used for variant calling, and variants were annotated and quality-controlled using VICTOR (variant interpretation for clinical testing or research), and both gene-level and variant-level significances were computed using PERCH (polymorphism evaluation, ranking and classification for heritable traits) [26]. PERCH quantitatively integrated deleteriousness prediction, allele frequency information, rare variant association analysis, biological relevance assessment, and the quality of variant calls. Additional details on the bioinformatics pipeline and variant calling procedure are given in the Supplementary Methods. Following these quality control and filtering steps, data from a total of 12,230 samples were further analyzed, including 6211 breast cancer cases and 6019 controls.
The principal association analysis was a gene burden analysis, in which the genotypes were collapsed into a 0/1 variable for each gene, according to whether or not a deleterious variant was carried. For each gene, variant carriers were defined using the VICTOR software suite. LoF variants were defined as (1) StopGain or FrameShift variants, excluding those affecting only the last 5% of coding sequences (CDS); (2) SpliceSite variants that include coding and non-coding regions; and (3) variants that inhibit transcription or translation. Rare missense variants included in these analyses were those showing a deleteriousness score according to the BayesDel algorithm [26] (Tables S3-S5). A final list of 7437 variants was established, including LoF variants (n = 1696) and rare missense variants (n = 5741). Among these, 25 LoF variants had a minor allele frequency (MAF) between 0.01 and 0.001, and 98.5% of LoF variants had an MAF < 0.001; in contrast, 21 missense variants had an MAF between 0.01 and 0.001, and the remaining 99.6% of missense variants had an MAF < 0.001.
ORs and 95% confidence intervals (CI) associated with carrier status were computed using logistic regression, adjusting for study population as a covariate, and p-values were derived from Wald tests. For analyses performed by individual study, p-values were computed via Fisher's exact tests. All statistical analyses were performed in R v3.6.0.

Results
A final list of 7437 variants including LoF (n = 1696) and rare missense variants (n = 5741) were included in the gene burden analyses. Quantile-quantile (Q-Q) plots are shown in Figure S1A for LoF variants, and Figure S1B for missense variants. When known breast cancer susceptibility genes were excluded, there was no evidence of inflation in the test statistics.

Known or Suspected Breast Cancer Susceptibility Genes
Targeted enrichment sequencing of the panel of 32 known or suspected breast cancer susceptibility genes confirmed the associations of known breast cancer susceptibility genes ATM, CHEK2, and PALB2 ( Figure 2, Tables 1, S6 and S7). It should be noted that these analyses did not include BRCA1 or BRCA2, as our original study design excluded samples from families that carried mutations in either of these genes. The principal association analysis was a gene burden analysis, in which the genotypes were collapsed into a 0/1 variable for each gene, according to whether or not a deleterious variant was carried. For each gene, variant carriers were defined using the VIC-TOR software suite. LoF variants were defined as (1) StopGain or FrameShift variants, excluding those affecting only the last 5% of coding sequences (CDS); (2) SpliceSite variants that include coding and non-coding regions; and (3) variants that inhibit transcription or translation. Rare missense variants included in these analyses were those showing a deleteriousness score according to the BayesDel algorithm [26] (Tables S3-S5). A final list of 7437 variants was established, including LoF variants (n = 1696) and rare missense variants (n = 5741). Among these, 25 LoF variants had a minor allele frequency (MAF) between 0.01 and 0.001, and 98.5% of LoF variants had an MAF < 0.001; in contrast, 21 missense variants had an MAF between 0.01 and 0.001, and the remaining 99.6% of missense variants had an MAF < 0.001.
ORs and 95% confidence intervals (CI) associated with carrier status were computed using logistic regression, adjusting for study population as a covariate, and p-values were derived from Wald tests. For analyses performed by individual study, p-values were computed via Fisher's exact tests. All statistical analyses were performed in R v3.6.0.

Results
A final list of 7437 variants including LoF (n = 1696) and rare missense variants (n = 5741) were included in the gene burden analyses. Quantile-quantile (Q-Q) plots are shown in Figure S1A for LoF variants, and Figure S1B for missense variants. When known breast cancer susceptibility genes were excluded, there was no evidence of inflation in the test statistics.

Known or Suspected Breast Cancer Susceptibility Genes
Targeted enrichment sequencing of the panel of 32 known or suspected breast cancer susceptibility genes confirmed the associations of known breast cancer susceptibility genes ATM, CHEK2, and PALB2 ( Figure 2, Tables 1, S6 and S7). It should be noted that these analyses did not include BRCA1 or BRCA2, as our original study design excluded samples from families that carried mutations in either of these genes.

Rare Missense Variants
As described in the Materials and Methods section, rare missense variants included in these analyses were those showing a deleteriousness score according to the BayesDel algorithm [26] (Tables S3-S5 Table S7). The results of the tumor ER-subtype analyses are shown in Table S7. Among genes that were not associated with overall breast cancer risk, missense variants in BARD1 showed some evidence of association with ER-positive disease (OR = 2.14 (CI 95% 1.13-4.04), p-value = 0.020), while an association with ER-negative breast cancer was observed for missense variants in BRIP1 (OR = 3.78 (CI 95% 1.23-11.55), p-value = 0.020) ( Figure 2, Table S7).

Validation of Candidate Genes Identified at the Discovery Stage
Analysis of the 198 genes selected from the WES analysis identified 20 potential candidate susceptibility genes with evidence of association with breast cancer (p-value < 0.05) (Figure 3, Tables 2 and S8-S10).

Individual LoF and Missense Variant Analysis
As mentioned in the Materials and Methods section, the main association analysis performed in our study was a gene burden analysis. This approach increases the power to detect an association, and has been used by a vast majority of association studies, including those recently published by Dorling et al. [18] and Hu et al. [27]. Nevertheless, it has been shown that single variants can show evidence of association when a weaker signal is obtained at the gene level [28,29]. We therefore examined whether certain individual LoF or missense variants with higher frequencies showed evidence of association, and whether they were driving the observed associations for a given gene. In this analysis, we only examined individual variants observed in at least three carriers, at least two of which were observed in breast cancer cases. As shown in Supplementary Table S11A, a total of seven LoF variants in six different genes showed evidence of association with overall breast cancer (nominal p < 0.05). Three of these variants were observed in known breast cancer susceptibility genes, one in ATM (c.1564_1565delGA, OR = 8.77 (CI 95% 1.11-69.29), p-value = 0.040), which is reported as pathogenic in the ClinVar database [30], and two in CHEK2 With the exception of the ATM gene, PMS2 and RIC1 showed no evidence of association in the gene burden test. Even though these variants showed a deleteriousness score according to the BayesDel algorithm, caution should be taken when interpreting missense variants because their impact may differ greatly depending on their position in the gene (e.g., conserved or functional domains).

Discussion
Our study identified 20 new genes showing modest evidence of an association either with overall disease and/or with ER-positive breast cancer or ER-negative breast cancer (p-value < 0.05). These results are based on the analysis of LoF variants and rare missense variants. Of these twenty, three showed evidence of association with LoF variants for overall breast cancer (ZFAND1, TYRO3, TMEM206/PACC1), and a further six with breast cancer subtypes (DNAH11, PARP2, LAMC3, MTMR11, EPN3, SLC22A10). When missense variants were assessed, three further associations were observed for overall breast cancer (TMEM161A, SIPA1L1, ERCC2), and a further seven were observed for subtypes (RNF175, NCKAP1L, PHAX, SMARCA2, EML5, NTRK3, MED23). One additional association was observed when missense and LoF variants were combined (ABCC2). Detailed information on these 20 genes, including a description of the known functions and pathways in which they are involved, as well as the number of LoF and missense variants observed in the different analyses, are provided in Supplementary Tables S2-S5. It should be noted that none of the genes reached the p < 10 −4 threshold in the replication stage, suggested for candidate genes [4]. Moreover, the number of associations observed at the p < 0.05 level for each analysis is actually less than the number that would be expected by chance (10, given 198 genes). Thus, larger replication studies will be required to confirm or refute these associations.
Among these genes, ZFAND1, TYRO3, and TMEM206/PACC1 showed evidence of association with LoF variants for overall breast cancer. In ZFAND1, one variant (c.139-3_139delCAGG, rs1260212806) explained the majority of the observed association. It is located in a splice site junction (acceptor site). This variant was observed only in the German study, so it could be more frequent in this population. The TMEM206/PACC1 gene is involved in the progression of colorectal cancer by accelerating cell proliferation and promoting cell migration and invasion [31]. Five LoF variants were identified in this gene, one of which was observed in 93% of the carriers (c.631delC, rs532279691). TYRO3 has been found to be upregulated in various cancers, including AML, CML, multiple myeloma, melanoma, as well as uterine endometrial cancers [32][33][34]. Eight LoF variants were identified in TYRO3, including one recurrent variant c.308_308 + 1insCCTGAAGTCA (rs773930671), accounting for 85% of carriers of LoF variants in this gene. Our analysis of missense variants in TYRO3 revealed evidence of a protective effect, which is opposite to that observed for LoF variants. Eleven rare missense variants were identified in our study, the majority only being observed once, and seven of them only observed in controls.
Tumor subtype-specific analyses were performed and, as described below, some evidence of ER-specific associations was observed. It should be noted that ER status was only available for 3572 (57.51%) breast cancer cases (808 ER-negative cases and 2764 ERpositive cases). Tumor subtype analyses of LoF variants revealed evidence of association for ZFAND1, TYRO3, DNAH11, and PARP2 with ER-negative breast cancer, with ZFAND1 showing the strongest association (OR = 2.96 (CI 95% 1.59-5.50), p-value = 6.37 × 10 −4 ). The observed association for this gene is mainly due to the variant rs1260212806, previously mentioned for breast cancer overall. A few reports have linked variants (other than those observed in the current study) in DNAH11 with breast cancer risk, but these reports were based on comparatively small sample sizes (c.2081_2082del (p.Val694Glyfs*2)) [35] and studies involving specific populations (G > A (rs2494938-India)) [36]. More recently, large-scale GWAS studies performed by our group through the Breast Cancer Association Consortium (BCAC) identified a common variant located at the 3 -UTR region of this gene (rs7971) to be statistically significantly associated with breast cancer risk (p-value = 1.9 × 10 −8 ) [3]. Although further validation is needed, these observations seem to indicate that this locus may comprise both common and rare variants, with some evidence of association with breast cancer risk.
On the other hand, evidence of association with ER-positive disease with regard to LoF variants was observed for MTMR11, LAMC3, and EPN3. For MTMR11, the observed association is mainly due to one recurrent variant, c.14_15insG (rs587606143), which accounts for 94% of carriers of a LoF variant in this gene. Interestingly, a locus at 1q12-q21, which includes MTMR11, was identified by GWAS as being associated with mammographic density (rs11205303, OR = 0.73 (95% CI = 0.66-0.80), p = 2.64 × 10 −11 ) [37]. Large-scale GWAS and fine-mapping studies performed in the BCAC have shown that common variants at 1q21.2 are significantly associated with overall and ER-positive breast cancer risk (overall breast cancer p-value = 9 × 10 −14 ; ER + breast cancer p-value = 1 × 10 −12 ) [3,38]. One study has also reported altered expression levels of MTMR11 in breast cancer cells [39]. These observations make this locus an interesting candidate for further validation and characterization with regard to breast cancer. Although a modest association was observed in the current study for SLC22A10, the frequency of rare variants in cases and controls was very low for this gene (Tables S3-S5).
In addition to clinical estrogen receptor subtyping, analyses performed on the molecular subtypes of breast cancer, namely, luminal A, luminal B, HER2-enriched, basal-like, and normal-like, would have been relevant in the context of this analysis, given the recognized heterogeneity of ER-positive and ER-negative tumors. However, the limited number of samples for which we had data did not allow us to perform a meaningful analysis of these molecular subtypes.
A number of genes showed an excess of missense variants (MAF < 0.01) classified as deleterious using bioinformatics in silico methods, without showing a corresponding excess of LoF variants. The best putative candidates in this group were TMEM161A, ERCC2, and SIPA1L1 for overall breast cancer, RNF175 and NCKAP1L for ER-positive disease, and PHAX, SMARCA2, NTRK3, EML5, and MED23 for ER-negative disease. Because the observed effects did not reach the threshold significance level following the adjustment of probability for multiple testing, we cannot exclude the possibility that the excess of missense variants, in the absence of enrichment for LoF variants, may be a false-positive result. However, these findings may also suggest that some of these genes are intolerant of LoF variants. When looking at the best candidate genes for overall breast cancer, pLI (probability of being loss-of-function intolerant) scores and observed/expected (o/e) scores available on gnomAD seem to suggest that this may be the case for SIPAL1, with pLI = 1, o/e = 0.09 (0.05-0.17). Interestingly, we did not observe any LoF variants in this gene in any of our analyses. On the other hand, ERCC2 (pLI = 0, o/e = 0.82 (0.62-1.09)) and TMEM161A (pLI = 0; o/e = 0.53 (0.35-0.82)) scores seem to indicate that these genes are not LoF intolerant. For ER-subtype-specific associations, LoF intolerant scores were observed for NCKAP1L (pL1 = 0.85, o/e = 0.2 (0.13-0.33)), SMARCA2 (pL1 = 1, o/e = 0.12 (0.08-0.2)), and NTRK3 (pL1 = 1, o/e = 0.12 (0.06-0.26)) (https://gnomad.broadinstitute.org, accessed on 9 February 2022). Among these genes, ERCC2 is of particular interest, as it plays an important role in the nucleotide excision repair pathway, and common polymorphisms in this gene have been associated with risk of various types of cancers [40]. The evidence supporting ERCC2 as a more penetrant breast cancer susceptibility gene, however, is not convincing [41].
Our study also included the analysis of 32 well-established or suspected susceptibility genes included in breast cancer gene panels. For many genes in such panels, the evidence of an association with cancer is often weak, and, therefore, further validation is required to confirm their contribution to breast cancer risk [4]. The current validation study successfully detected confirmed breast cancer susceptibility genes such as ATM, CHEK2, and PALB2. The ORs calculated in our study of these frequently mutated genes are similar to those calculated in recent publications [8,18,[42][43][44][45] according to analyses of LoF and rare missense variants. Similar to the current analysis, the study by Dorling et al. [18] did not detect a statistically significant association of missense variants in PALB2 (OR = 0.96 (95% CI 0.87-1.06)) with breast cancer risk. It should be noted, however, that there is a significant overlap of the current study with the Dorling et al. dataset, which may contribute towards explaining these observations. For the remaining suspected susceptibility genes, including genes confirmed to be associated with breast cancer risk or predisposition to specific cancer syndromes, as studied by Lee et al. [46], Dorling et al. [18], and/or Hu et al. [27] (BARD1, CDH1, PTEN, RAD51C, RAD51D, STK11, TP53), the frequencies of LoF and rare missense variants were too low in our study to detect or support an association with breast cancer risk.

Limitations and Weaknesses
It should be noted that this study has some limitations. Firstly, the discovery step involved WES of samples from breast cancer cases followed by analysis of the resulting data with available WES or WGS control data. This study design was motivated by the possibility of taking advantage of several available datasets. However, many factors resulting from this choice may have affected our ability to identify likely candidate genes at the discovery step, such as: (1) The matched control data were derived from different exome enrichment methods. These methods do not have the same efficiency in terms of probe hybridization and exon capture. This may have contributed to uneven exon sequencing depth, which in turn could have affected the identification of variants [47]; (2) For the GoNL control dataset, exome data were extracted from WGS data. Although it is recognized that WGS outperforms WES, and provides a higher proportion of covered transcripts at a lower sequencing coverage and no strand bias [48], the 13× sequencing coverage for these samples may not have been sufficient for the purpose of identifying rare variants. In an attempt to minimize these impacts, several calling scenarios for the joint genotyping of case and control samples, taking into consideration ethnicity and capture technology, were performed in order to ensure that the most comprehensive variety of calling scenarios were covered. Five different variant filtering and prioritization strategies were also performed in an attempt to build the best candidate gene list for the validation step. Despite this, it cannot be excluded that insufficient or non-uniform sequencing coverage across exons between case and controls [49], batch effects, and population or cryptic stratification [50] could have had an impact on variant calling, which may not have been resolved at the bioinformatics level [48]. We also acknowledge the limitations of the currently available in silico tools for the prediction of pathogenic missense variants, and how these limitations could have impacted our gene selection.
Secondly, while the replication stage was quite large, important susceptibility may not have been selected at the discovery stage due to the limitations mentioned above. Moreover, although the validation stage was of a large enough scale to identify genes with effect sizes similar to those of CHEK2, ATM, and PALB2, the power was too low to detect genes with the characteristics of RAD51C, RAD51D, or BARD1. Indeed, the associations with LoF variants in these genes were all non-significant, despite the fact that the estimated ORs (~2) were comparable with those previously reported. Thus, if a susceptibility gene with the characteristics of the latter was among those identified at the discovery stage, the validation step would not have had sufficient power to confirm it. We therefore conclude that it is unlikely that any genes in the replication set with combined LoF frequencies comparable with ATM or CHEK2 are associated with moderate risk, but that moderate-risk genes with lower LoF frequencies could be present.
Finally, our study was based on WES, which does not consider non-coding regulatory regions or large genomic rearrangements. Regulatory variants have been shown to contribute to the heritability of several diseases. In particular, the common susceptibility variants for breast cancer identified through GWAS are mostly located in regulatory regions [51].

Conclusions
We found nominal evidence of association with breast cancer overall for LoF variants in three genes (ZFAND1, TMEM206/PACC1, and TYRO3), with ER-negative breast cancer in two genes, DNAH11 and PARP2, and with ER-positive disease in four genes LAMC3, MTMR11, EPN3, and SLC22A10. Analysis of rare missense variants identified evidence of associations of TMEM161A, SIPA1L1, and ERCC2 with overall breast cancer, RNF175 and NCKAP1L with ER-positive disease, and SLC22A10, PHAX, SMARCA2, EML5, NTRK3, and MED23 with ER-negative disease. Our study design shows that we had the power to detect genes with effect sizes similar to some confirmed breast cancer susceptibility genes, such as ATM and CHEK2. The fact that we did not identify similar novel genes suggests that the remaining breast cancer susceptibility could be explained by lower frequency variants. The outcome of our study thus provides crucial information with regards to the planning of future sequencing efforts.
In order to efficiently identify novel lower frequency variants, larger collaborative sequencing efforts will be needed. The data generated in the current project will most certainly be useful for combining with other similar large-scale breast cancer studies in order to obtain better power to detect such variants in moderate-risk genes.