Genome-Wide Analysis of Sex Disparities in the Genetic Architecture of Lung and Colorectal Cancers

Almost all complex disorders have manifested epidemiological and clinical sex disparities which might partially arise from sex-specific genetic mechanisms. Addressing such differences can be important from a precision medicine perspective which aims to make medical interventions more personalized and effective. We investigated sex-specific genetic associations with colorectal (CRCa) and lung (LCa) cancers using genome-wide single-nucleotide polymorphisms (SNPs) data from three independent datasets. The genome-wide association analyses revealed that 33 SNPs were associated with CRCa/LCa at P < 5.0 × 10−6 neither males or females. Of these, 26 SNPs had sex-specific effects as their effect sizes were statistically different between the two sexes at a Bonferroni-adjusted significance level of 0.0015. None had proxy SNPs within their ±1 Mb regions and the closest genes to 32 SNPs were not previously associated with the corresponding cancers. The pathway enrichment analyses demonstrated the associations of 35 pathways with CRCa or LCa which were mostly implicated in immune system responses, cell cycle, and chromosome stability. The significant pathways were mostly enriched in either males or females. Our findings provided novel insights into the potential sex-specific genetic heterogeneity of CRCa and LCa at SNP and pathway levels.


Introduction
Sex disparities have been long reported in various malignancies, with most cancers predominantly affecting males and having better survival and lower mortality rates in females [1][2][3][4][5]. Lung (LCa) and colorectal (CRCa) cancers are among the top three common malignancies in both males and females. They jointly comprised around 25.1% of new cancer cases in men and 17.6% in women in 2018. They were also among the leading causes of cancer-related deaths in 2018, accounting for around 31% and 23% of such deaths in males and females [6]. A study of the relative risks of different cancer types revealed that LCa and CRCa were among 32 other cancers that had significantly higher incidence rates in men across various geographical regions and gross domestic product (GDP) groups, with average male-to-female incidence rate ratios of 2.08 and 1.33, respectively [3]. Sex has also been suggested as a potential favorable prognostic factor for these cancers conferring better survival to female patients [2,[7][8][9]. LCa and CRCa were reported to have male-to-female mortality ratios of 2.31 and 1.42, respectively, and worse survival in males with significant male-to-female hazards ratios of 1.17 and 1.08 after adjusting models for the age of subjects and stage of tumors [2].
In addition to the sex-dependent differences in incidence, survival, and mortality rates, LCa and CRCa have displayed some other clinical and histopathological sex disparities in tumor topology, clinical manifestations, aggression potentials, and responses to therapy [10][11][12]. For instance, several studies reported that the female-to-male ratios were >1 and <1 in right-and left-sided CRCa, respectively, which have different clinical in cancer research [5,22]. The evident contributions of genetic mechanisms to such sex disparities warrants further investigations into the sex-specific genetic architecture of LCa and CRCa, in particular due to their potential genetic heterogeneity. Exploring sex-specific genetic contributors to LCa and CRCa may provide more comprehensive insights into their underlying biological processes which in turn may help implementing more effective personalized and sex-specific medical interventions [5,10,22,33]. Searching genome-wide associations databases [34,35] shows that the genetic analysis of LCa's and CRCa's sex disparities has not received proper attention in previous GWAS. In this study, we performed sex-stratified genome-wide analyses of LCa and CRCa using phenotype and genotype data from three independent datasets to investigate potential sex disparities in the genetic predisposition to these common cancers.

Study Participants
Data from three independent studies were used including: Cardiovascular Health Study (CHS) [36], Framingham Heart Study (FHS) [37,38], and Health and Retirement Study (HRS) [39]. In each dataset, the genetic analyses were performed separately in females (i.e., CRCa-F and LCa-F) and males (i.e., CRCa-M and LCa-M). The cases comprised of 211 and 237 females as well as 186 and 220 males with CRCa and LCa, respectively. Also, 8382 and 8354 unaffected females and 6312 and 6278 unaffected males were included as controls in the CRCa-F, LCa-F, CRCa-M, and LCa-M analyses, respectively. The cases and controls were identified either by the study researchers (FHS) or by decoding medical diagnoses (CHS) or Medicare claims (HRS) using the International Classification of Disease codes, Ninth revision (ICD-9). Our genetic analyses were performed on subjects of Caucasian ancestry as there were not sufficient samples from other ethnicities. Table S1 (Supplementary File 1) provides summary demographic information for these three datasets. Figure 1 displays an overview of the analysis steps and main findings of our study.

Genotype Data and Quality Control (QC)
Our study made use of~2 million genotyped and imputed SNPs. The imputation process has been detailed in [40]. Low-quality data were first filtered out including: (1) SNPs with imputation r 2 < 0.7, (2) SNPs with minor allele frequencies (MAF) <5%, (3) SNPs/subjects with missing rates >5%, (4) SNPs deviated from Hardy-Weinberg equilibrium at P < 1.0 × 10 −6 , and (5) SNPs and subjects/families with Mendel error rates >2% in the case of FHS which is a family-based study. QC was performed using PLINK package [41]. This resulted in~1.3-1.7 million SNPs in the datasets under consideration (Supplementary File 1: Table S2).

Genetic Models
Additive genetic models were fitted using PLINK package [41] to identify the association between SNPs and cancers of interest after adjustment for birth year, smoking history, and body mass index (BMI) of subjects, and the top 3-4 principal components of genotype data obtained by GENESIS R package [42]. To address the risk of inflation of type-I errors due to ignoring family structure [43], SNPs nominally (i.e., P < 0.05) associated with CRCa/LCa in FHS were reanalyzed by fitting generalized linear mixed models (using lme4 R package [44]) which contained family IDs as a random-effects covariate in addition to the fixed-effects covariates stated above [40,45]. The results of GWAS of each cancer from the three datasets under consideration were then combined through an inverse-variance meta-analysis after adjustment for genomic inflation (i.e., λ values). Meta-analysis was performed using GWAMA package [46].

Discovery and Replication Analyses
We followed a commonly used discovery-replication strategy considering each of CHS, FHS, and HRS as a discovery set and the other two datasets as its counterpart replication sets. An association signal was considered replicated if a SNP had P < 5.0 × 10 −8 (i.e., genome-wide significance) or 5.0 × 10 −8 ≤ P < 5.0 × 10 −6 (i.e., suggestive significance) [40] in GWAS of one dataset and P < 0.05 in another dataset, and had consist directions of associations in the discovery and replication sets. The SNPs that were not among the replicated set of SNPs but had significant P-values at genome-wide or suggestive significance levels in conducted meta-analyses constituted the meta-analysis set of significant SNPs.

Novel Associations
CRCa/LCa-associated SNPs were considered as newly detected cancer variants if they were not associated with CRCa/LCa at P < 5.0 × 10 −6 by previous GWAS available at databases such as GRASP [34] and NHGRI-EBI GWAS catalog [35]. LDlink webtool [47] was then used to search possible proxy variants for the newly detected SNPs in the CEU population (i.e., Utah Residents with Northern and Western European Ancestry). A proxy variant was defined as a SNP that was located within ±1 Mb of a newly detected CRCa/LCa-associated SNP, was in LD with it (i.e., significant X 2 in LD test) and was previously associated with the same cancer at P < 5.0 × 10 −6 .

Sex-specific Associations
SNPs disparately associated with CRCa/LCa in males and females were further analyzed by contrasting SNPs effects between males and females to determine if their effects were sex-specific [48]: where χ 2 is the Wald's Chi-square statistics, b f and b m are the SNP effects (i.e., the natural logarithm of odds ratios) in females and males, and se f and se m are their standard errors.

Pathway Enrichment Analysis
Pathway enrichment analyses were performed by the GSA-SNP2 package [49] using compound gene-based P-values, obtained according to the fastBAT method [50,51], to identify potential biological processes associated with the studied cancers in males and females. The canonical pathways from the Broad Institute gene set enrichment analysis (GSEA) [52] were considered as the reference pathways [53][54][55][56]. The significant pathways were determined at false discovery rates (FDR) [57] of 0.025 (CRCa-F and CRCa-M) and 0.05 (LCa-F and LCa-M) to keep the numbers of possible false-positive findings below one in each analyzed cancer.

Fixed-Effects Covariates
Smoking history, birth year, and BMI were included as fixed-effects covariates in our GWAS models to address their potential confounding effects on SNPs effects estimates, particularly due to their different distributions between males and females (Supplementary File 1: Table S1). Our meta-analyses revealed that smoking history and birth year were associated with CRCa and LCa in both males and females (P < 1.69 × 10 −2 ) and BMI was associated with CRCa in both sexes (P < 7.74 ×10 −3 ). However, their effects were not statistically different when their odds ratios were compared between males and females (Supplementary File 1: Table S3).  Table S2), indicating the adequacy of population structure control [58]. Table 1 and Table  S4 (Supplementary File 1) contain summary and detailed information regarding significant associations detected in our GWAS. We found that five SNPs (i.e., rs7593032, rs11000463, and rs11000467 in LCa-F; and rs9579517 and rs56357430 in CRCa-M) were associated with cancers of interest in a discovery dataset at suggestive significance level (i.e., 5.0 × 10 −8 ≤ P < 5.0 × 10 −6 ) and were replicated at P < 0.05 in a replication dataset with the same directions of effects. In addition, there were 28 SNPs which were associated with CRCa or LCa in conducted meta-analyses (P META = 3.21 × 10 −7 to 4.98 × 10 −6 ; P Q = 1.72× 10 −1 to 9.75 × 10 −1 ; and i 2 values between 0 and 0.432). As seen in Table 1, there were several genes (i.e., GLRX3 (CRCa-F), PRKG1 (LCa-F), MPHOSPH8 (CRCa-M), LINC02039, MAP7, and GRIK1 (LCa-M)) to which multiple significant associations signals were mapped. SNPs mapped to each of these genes were in high LD (0.855 ≤ r 2 ≤ 1 and D'=1) with each other in the CEU population [47] (Supplementary File 1: Table S5).

GWAS
None of the 33 detected SNPs and their corresponding chromosomal regions had significant association signals in both sexes. Of these, 26 SNPs were sex-specific as their effect sizes (i.e., the natural logarithm of odds ratios) were statistically different between males and females at a Bonferroni-adjusted significance level of 0.0015 (i.e., 0.05/33) when compared by a Wald's Chi-square test ( Table 2).     Please see the description provided below Table 1. + denotes the SNP did not have sex-specific effects (i.e., P-value ≥ 0.0015 in Chi-square test comparing SNP effects in males and females).

Pathway Enrichment Analysis
Our analyses (Table 3) revealed that 11 and 13 pathways were significantly associated with LCa-F and LCa-M, respectively, at an FDR of 0.05. They were mainly involved in meiosis, chromosome maintenance and telomere/centromere organization, and DNA transcription. Of these, eight pathways were significant in both sexes, while three pathways in females and five pathways in males were specifically enriched in one sex. We also found that 19 pathways were associated with CRCa-M at an FDR of 0.025. They were mainly involved in immune system responses and signal transduction. No pathway was enriched in CRCa-F analyses at FDRs of 0.025 or 0.05, however, 2 pathways were significant at FDR of 0.2. These 2 pathways were among the 19 CRCa-M-associated pathways. The other 17 pathways detected in CRCa-M were male-specific. None of the detected pathways were associated with both CRCa and LCa. Abbreviations: CRCa = colorectal cancer; LCa = lung cancer; GSEA = Gene Set Enrichment Analysis Platform; KEGG = Kyoto Encyclopedia of Genes and Genomes [53]; BIOCARTA = BIOCARTA pathways [54]; PID = Pathway Interaction Database [55]; REACTOME = REACTOME pathway knowledgebase [56]; Size = number of genes in the pathway; Count = number of enriched genes in the pathway; NS = non-significant.

Discussion
Almost all complex diseases (including many cancer types) have manifested sex disparities in epidemiological and clinical studies (e.g., in incidence/prevalence rates or disease severity) [21] which may be due to the hormonal effects, lifestyle risk factors, and genetic mechanisms, among others [5,10,21,22,40]. Investigating sex disparities in the genetic mechanisms underlying complex disorders may have translational impacts on medical interventions and has been stressed by the National Institutes of Health (NIH) [5,10,22,33].
In this study, we analyzed potential sex disparities in the genetic architectures of CRCa and LCa in three independent datasets which, to the best of our knowledge, were not used previously for the study of sex-specific genetic contributions to the cancer phenotypes of interest. Our GWAS revealed replicated association signals of five SNPs at P < 5.0 × 10 −6 that were associated with CRCa in males or LCa in females. In addition, 28 other SNPs were significantly associated with CRCa or LCa at suggestive significance level in conducted meta-analyses (Table 1 and Table S4). None of the detected SNPs attained genome-wide significance in our study. This might be due to the insufficient sample sizes of this study or the heterogeneity of SNPs effects in the studied cohorts (i.e., CHS, FHS, and HRS). All these 33 SNPs were potentially novel markers for the studied cancers as their association signals were not reported by previous GWAS and, in addition, there were no proxy CRCa/LCa-associated SNPs within their ±1 Mb flanking regions. It should be noted that the significant associations detected in GWAS do not imply causality. Functional studies are needed to investigate whether the identified SNPs themselves or other variants in nearby chromosomal regions that are in high LD with these index SNPs contribute to the genetic architecture of the studied cancers. A literature review further delineated potential implications of these SNPs, their closest genes, and variants in nearby regions in CRCa and LCa. We found that the closest genes to these SNPs were not associated with the same cancer at P < 5.0 × 10 −6 by previous GWAS [34,35], except for CSMD1 gene (corresponding to rs13261356 detected in LCa-M) that was previously associated with LCa at genome-wide significance level [34,59]. Therefore, they can be considered as potentially novel genes for the studied cancers. However, nine of these genes (i.e., KHDRBS3, PRKG1, ZRANB1, THRB, FSTL4, HTR1E, LINC01377, MIR4675, and GRIK1) were previously implicated in cancers at other sites (i.e., other than those in our study) at P GWAS < 5.0 ×10 −6 [34,35], and 7 genes (i.e., SNX31, KHDRBS3, GLRX3, THRB, MAP7, ATP8B1, and GRIK1) were prognostically linked to other cancers at P < 0.001 (The Human Protein Atlas [60]: www.proteinatlas.org accessed on 2019-2020) ( Table 1). In addition, 21 of 33 detected SNPs were located within nine chromosomal regions which were not previously associated with the same cancers at P GWAS < 5.0 × 10 −6 (i.e., KHDRBS3/8q24.23 and GLRX3/10q26.3 (CRCa-F); PRKG1/10q21.1 (LCa-F); THRB/3p24.2, HTR1E/6q14.3, and MPHOSPH8/13q12.11 (CRCa-M); MAP7/6q23.3, ATP8B1/18q21.31, and GRIK1/21q21.3 (LCa-M)).
Notably, 26 SNPs had sex-specific effects as they were significant only in males or females and their odds ratios were statistically different between the two sexes ( Table 2). None of these SNPs were among or in LD with previously reported sex-linked SNPs [61,62]. The sex-specific SNPs associations with CRCa and LCa may advance the understanding of the underlying mechanisms of these common cancers in the two sexes by guiding functional studies in the detected chromosomal regions. Such sex-specific genetic factors may have translational implications in the era of personalized medicine by providing more efficient and cost-effective sex-specific medical interventions.
Our pathway enrichment analyses (Table 3) revealed that several pathways were significantly associated with CRCa and LCa (19 and 16 pathways, respectively). Most of the CRCa-associated pathways were related to the immune system functions. The intact/dysfunctional immune system responses were previously implicated in preventing/promoting tumorigenesis of CRCa [63,64]. The LCa-associated pathways were mostly involved in DNA replication/transcription and chromosome stability whose potential roles in LCa were previously highlighted [65][66][67]. Sex disparities were also noticed at the pathway level as most of the significant pathways (i.e., 17 pathways in CRCa and 8 pathways in LCa) were sex-specifically associated with the cancers of interest.
Limitations. Our analyses were generally underpowered for detecting association signals of SNPs with very small effect sizes and/or low MAFs (e.g., <0.05). Analyzing datasets with larger sample sizes would provide more statistical power and may replicate some of the detected disparate associations at genome-wide significance level and discover additional sex-specific associations. In addition, investigating potential sex disparities that may exist in the genetic architecture of different stages and/or histopathologic subtypes of CRCa and LCa may increase our knowledge about the genetic heterogeneity of these common cancers, although this requires sufficiently large sample sizes and availability of the staging and histopathologic data for the analyzed patients.

Conclusions
Our genome-wide analyses revealed associations of 33 SNPs (mapped to 19 genes) with CRCa or LCa at suggestive significance levels which were significant in either males or females. None of these associations were reported by previous GWAS, and there were no proxy SNPs within ±1 Mb regions of the identified SNPs. Of these, 26 SNPs had sex-specific effects evidenced by significantly different effect sizes (i.e., the natural logarithm of odds ratios) between the two sexes. Our pathway enrichment analyses revealed 35 pathways, mainly involved in immune system functions, DNA replication/transcription, and chromosome stability, were associated with the studied cancers. Twenty-five of these pathways were significant in either males or females. The potential sex-specific contributions to the genetic architecture of CRCa and LCa identified in our study provided novel insights into the genetic heterogeneity of these common cancers, although they did not imply causality. Such sex-specific associations, if replicated in independent genome-wide studies and/or corroborated in functional studies, may have translational impacts on the medical interventions in CRCa and LCa.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/genes12050686/s1, Supplementary File 1: Supporting Acknowledgment, Table S1. Demographic information about the analyzed datasets; Table S2. Numbers (and percentages) of analyzed SNPs along with the genomic inflation factors (λ) resulted from our genome-wide association analyses; Table S3. The odds ratios of fixed-effects covariates in females and males from conducted meta-analyses; Table S4. Cancer-associated SNPs from genome-wide association analyses; Table S5. Linkage disequilibrium measures among SNPs with significant association signals that were mapped to the same gene; Figure S1. Manhattan plot of the genome-wide association analyses of colorectal cancer in females (CRCa-F); Figure S2. QQ plot of the genome-wide association analyses of colorectal cancer in females (CRCa-F); Figure S3. Manhattan plot of the genome-wide association analyses of lung cancer in females (LCa-F); Figure S4. QQ plot of the genome-wide association analyses of lung cancer in females (LCa-F); Figure S5 Funding: This research was supported by Grants from the National Institute on Aging (P01AG043352, R01AG047310, R01AG061853, R01AG065477, and R01AG070488). The funders had no role in study design, data collection and analysis, decision to publish, or manuscript preparation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Institutional Review Board Statement: The CHS, FHS, and HRS studies, whose data were analyzed here, were approved by the corresponding institutional review boards (IRBs) and were conducted after obtaining written informed consent from participants. The authors accessed data upon approval by the Duke University IRB.