Whole Exome-Wide Association Identifies Rare Variants in GALNT9 Associated with Middle Eastern Papillary Thyroid Carcinoma Risk

Simple Summary This study targeted the identification of rare variants in Middle Eastern papillary thyroid carcinoma (PTC) through an exome-wide association study. It was found that the GALNT9 gene was strongly associated with rare inactivating variants. Three genes (TRIM40, ARHGAP23, and SOX4) were associated with rare damaging variants (RDVs). Furthermore, seven genes (VARS1, ZBED9, PRRC2A, VWA7, TRIM31, TRIM40, and COL8A2) were associated with PTC risk. The study unveils some new genes to be potential candidates for PTC predisposition. Abstract Papillary thyroid carcinoma (PTC) is the commonest thyroid cancer. The majority of inherited causes of PTC remain elusive. However, understanding the genetic underpinnings and origins remains a challenging endeavor. An exome-wide association study was performed to identify rare germline variants in coding regions associated with PTC risk in the Middle Eastern population. By analyzing exome-sequencing data from 249 PTC patients (cases) and 1395 individuals without any known cancer (controls), GALNT9 emerged as being strongly associated with rare inactivating variants (RIVs) (4/249 cases vs. 1/1395 controls, OR = 22.75, p = 5.09 × 10−5). Furthermore, three genes, TRIM40, ARHGAP23, and SOX4, were enriched for rare damaging variants (RDVs) at the exome-wide threshold (p < 2.5 × 10−6). An additional seven genes (VARS1, ZBED9, PRRC2A, VWA7, TRIM31, TRIM40, and COL8A2) were associated with a Middle Eastern PTC risk based on the sequence kernel association test (SKAT). This study underscores the potential of GALNT9 and other implicated genes in PTC predisposition, illuminating the need for large collaborations and innovative approaches to understand the genetic heterogeneity of PTC predisposition.


Introduction
Thyroid cancer is the most common endocrine malignancy [1,2].Papillary thyroid carcinoma (PTC) is a particular subtype that is responsible for a significant majority of all thyroid cancers, accounting for approximately 85% of patients [3,4].The incidence of PTC has increased significantly in recent years [5,6].In Saudi Arabia, PTC is of high prevalence and it is the second most common cancer affecting females [7].Although PTC has favorable Cancers 2023, 15, 4235 2 of 10 outcomes, 3-10% of patients demonstrated recurrent disease within the first decade after treatment [8,9].
Heritable genetic factors contribute to the susceptibility of PTC in around 10% of patients who have a family history of the disease in first or second-degree relatives and, in many instances, genetic effect has been shown to extend even beyond the nuclear family [10][11][12][13][14].
Despite the suggestive evidence that PTC is relatively highly heritable, understanding the genetic underpinnings and origins remains a challenging endeavor across diverse ethnic populations.Genetic studies, such as genome-wide association studies (GWAS), have offered some valuable insights, identifying several common genetic variants associated with PTC [15][16][17][18][19].However, these identified variants account for only a modest portion of the total heritability of the disease, leaving a considerable proportion of the genetic architecture of PTC yet to be deciphered.
The advent of next-generation sequencing technologies such as whole-exome sequencing (WES) has been a powerful tool for uncovering the full spectrum of genetic variations that contribute to disease susceptibility.In contrast to GWAS, which primarily focus on common genetic variants, WES has the potential to uncover rare genetic variants that might play a significant role in various cancer susceptibilities [20][21][22].These rare variants, which potentially are more deleterious than their common counterparts, might have been overlooked in previous studies due to their low frequencies.Therefore, WES has the potential to provide a more comprehensive picture of the genetic contributors to a complex disease like PTC, especially in a unique ethnicity where the prevalence of PTC is higher.
In this study, we harnessed the power of WES to identify rare inactivating variants (RIVs) and rare damaging variants (RDVs) associated with PTC risk within the Middle Eastern population.We hope our findings can shed light on the potential roles of several genes including GALNT9 in conferring risk for PTC in this population, thereby contributing to a better understanding of this prevalent cancer.

Patient Selection
Two hundred and forty-nine PTC patients diagnosed between 2006 and 2018 at King Faisal Specialist Hospital and Research Center (Riyadh, Saudi Arabia) and Prince Sultan Military Medical City (Riyadh, Saudi Arabia), with fresh non-tumor tissue and peripheral blood available, were included in the study.Baseline clinico-pathological data were collected from case records and are summarized in Table 1.The histopathological subtypes of PTC were classified based on the fifth edition of the WHO classification for thyroid tumors [23].The staging of PTC was performed using the eighth edition of the American Joint Committee on Cancer (AJCC) staging system [24].Cases were identified based on clinical history followed by fine needle aspiration cytology for confirmation.

DNA Isolation
Genomic DNA was isolated either from fresh non-tumor tissues or peripheral blood utilizing a Gentra DNA isolation kit (Gentra, Minneapolis, MN, USA), following the manufacturer's recommendations as described previously [25].
The GATK's Haplotype Caller was used to call germline variants which were subsequently annotated using Annotate Variation (ANNOVAR) software (https://annovar.openbioinformatics.org/en/latest/,accessed on 2 March 2023) [27].The control population included in our cohort was our in-house data from the exome sequencing of 1395 normal samples sequenced at different points in time.To provide homogeneity, all variants (cases and controls) were annotated simultaneously with ANNOVAR.Only those variants were selected that were in the exonic or splicing region of the gene.Variants were excluded if they had a minor allele frequency of >0.01 in the dbSNP, the National Heart, Lung, and Blood Institute exome sequencing project, 1000 Genomes, the Exome Aggregation Consortium (ExAC), and the Genome Aggregation Database (gnomAD).A variant was considered a true positive if the variant allele frequency (VAF) was at least 20% with a minimum altered reads of 5, and a sequencing depth in the variant location region of ≥20 with a GATK quality score of ≥100.Filtered variants were checked using the Integrated Genomics Viewer to filter out the false positives and artifacts.
Rare inactivating variants (RIVs) are deleterious variants that include frameshift insertion, frameshift deletion, stop-gain, stop-loss, and splice-site variants with allele frequencies < 1% in our control cohort and the ExAC database.On the other hand, rare damaging variants (RDVs) are the types of variants which are either deleterious/inactivating or predicted to be pathogenic/deleterious with a score greater than 0.025, utilizing the M-CAP classifier [21,28].Furthermore, nonsynonymous variants include the variants with allele frequencies less than 0.1 in the ExAC database and our control cohort, including deleterious variants, missense single nucleotide variants, or non-frameshift deletions and insertions [21].
A Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis was performed, which is available on the Database for Annotation, Visualization and Integrated Discovery (DAVID) platform [29].The gene list was uploaded with the default parameters for analysis.The results were considered statistically significant if p < 0.05.

Statistical Analysis
A chi square test was used to assess the departure from the Hardy-Weinberg equilibrium using a p-value < 0.001 as the cutoff.Variants were removed if they failed the Hardy-Weinberg equilibrium.
A sequence kernel association test (SKAT), which was implemented in an R package, was utilized for the association analysis.Standard parameters were used to compute the association.The p-values of non-synonymous variants were estimated using efficient resampling methods incorporated in the 'SKATBinary_Single' function.The exome-wide significance level was set at p < 2.5 × 10 −6 , whereas associations with p < 0.001 were labeled as suggestive [21].
To determine the power, simulation methods were applied which are available in the SKAT package under the 'power logistic' function.We repeated the simulation 100 times at different frequencies of variants in cases and odds ratios.The simulations were repeated to determine the power to detect a true association at the exome-wide (p < 2.5 × 10 −6 ) and suggestive thresholds (p < 0.001).A power analysis was also calculated through a power simulation in R with varying samples, effect sizes, and threshold levels of significance.

Results
The median age of the entire cohort of cases was 39.2 years (range: 10-83 years), with a male/female ratio of 1:3.6.The majority of the tumors were of the classical variant (50.2%; 125/249), whilst 47.8% (119/249) of tumors were multifocal.Extrathyroidal extension was noted in 41.8% (104/249) of cases.Regional LN metastasis was noted in 38.2% (95/249) of cases and distant metastasis was present at diagnosis in 6.4% (16/249) (Table 1).The power analysis in our study was conducted utilizing the simulation-based procedure based on the variant distribution observed in our cohort.We found that our study had 80% power to identify a gene significantly enriched for RIVs with an odds ratio of 10 and a carrier rate of 7.5% at the exome-wide statistical significance level (p < 2.5 × 10 −6 ).At the suggestive level (p < 0.001), our study had 80% power to detect an odds ratio of 10 and a carrier rate of 5% (Figure 1).noted in 41.8% (104/249) of cases.Regional LN metastasis was noted in 38.2% (95/249) of cases and distant metastasis was present at diagnosis in 6.4% (16/249) (Table 1).
The power analysis in our study was conducted utilizing the simulation-based procedure based on the variant distribution observed in our cohort.We found that our study had 80% power to identify a gene significantly enriched for RIVs with an odds ratio of 10 and a carrier rate of 7.5% at the exome-wide statistical significance level (p < 2.5 × 10 −6 ).At the suggestive level (p < 0.001), our study had 80% power to detect an odds ratio of 10 and a carrier rate of 5% (Figure 1).In the primary analysis focused on RIVs, GALNT9 had the strongest association (adjusted odds ratio OR = 22.75, p = 5.09 × 10 −5 ), meeting suggestive significance (2.5 × 10 −6 < p < 0.001).RIVs were carried in 4/249 (1.6%) cases compared to 1/1395 (0.1%) control (Table 2).This association was driven by three different variants including chr12: 132690466A>G, a splicing variant observed in two cases and one control (OR = 11.28,p = 0.013).The other two variants were observed in one case each and were absent in controls (OR = 16.84,p = 0.018).(Table 3).No other gene besides GALNT9 had suggestive evidence (p < 0.001) of an association with RIVs.In the primary analysis focused on RIVs, GALNT9 had the strongest association (adjusted odds ratio OR = 22.75, p = 5.09 × 10 −5 ), meeting suggestive significance (2.5 × 10 −6 < p < 0.001).RIVs were carried in 4/249 (1.6%) cases compared to 1/1395 (0.1%) control (Table 2).This association was driven by three different variants including chr12: 132690466A>G, a splicing variant observed in two cases and one control (OR = 11.28,p = 0.013).The other two variants were observed in one case each and were absent in controls (OR = 16.84,p = 0.018).(Table 3).No other gene besides GALNT9 had suggestive evidence (p < 0.001) of an association with RIVs.Furthermore, we utilized the filter-based approach to inspect the top 10 genes selected by counting the number of RIVs in the cases, which ranged from five RIVs in TEKT5 to eight in DNAH14.When comparing the cases with controls, eight genes had significant evidence of association (p < 0.05), including CYP4X1 (p = 0.004), RBM23 (p = 0.020), CRYGN (p = 0.007), TTC23L (p = 0.030), CYP2R1 (p = 0.030), AP3S1, (p = 0.002), TTC23 (p = 0.005), and TEKT5 (p = 0.005) (Table 4).In addition, a KEGG pathway analysis was performed for signaling pathway analysis.No pathways were identified to be significantly associated with an increased risk of developing PTC.
As part of the above exploratory analyses, SKAT analysis was utilized to analyze the cumulative effect of all nonsynonymous variants with an allele frequency of less than 1% in our cohort.Seven genes, valyl-tRNA synthetase 1 (VARS1), Zinc finger BED-type containing 9 (ZBED9), Proline Rich Coiled-Coil 2A (PRRC2A), Von Willebrand Factor A  5).All seven genes harbored at least one individual variant with p < 0.001.Five individual variants had an exome-wide significant association after Bonferroni Correction (Supplementary Table S1).

Discussion
Most genetic susceptibility association studies on PTC to date have focused mainly on common variants, and the majority have also focused on the European population rather than other Asian populations, especially Middle Eastern.However, the proportion of the missing heritability in PTC could be explained by low-frequency rare variants.
Our study underscores the valuable role of whole-exome sequencing in uncovering the genetic underpinning of PTC risk.We shed light on the presence of rare inactivating variants (RIVs), primarily in the GALNT9 gene, which according to our analysis had the strongest association with PTC in our research population.
The GALNT9 gene, as a part of the N-Acetylgalactosaminyl transferase (GALNT) gene family, is involved in the initial stages of O-linked protein glycosylation in the golgi apparatus [30,31].Thus, inactivating variants in the GALNT9 gene, as detected in our study, could disrupt normal glycosylation processes and potentially contribute to oncogenic pathways.However, the precise role of GALNT9 in cancer biology, particularly in PTC, remains under-explored and needs to be the focal point of future research to elucidate the potential role of GALNT9 in cancer susceptibility.
Previous evidence indicates that mutations in GALNT12, another member of the N-Acetylgalactosaminyl transferase gene family, might explain some of the familial CRC cases of unknown etiology, and it showed that deleterious variants in GALNT12 are associated with CRC susceptibility [32,33].In addition, several prior studies have also demonstrated the implications of other GALNT family members in different cancer types [34][35][36][37][38][39][40].This evidence lends credence to our findings and strengthens the hypothesis that GALNT9 may have a similar role in PTC.
The enrichment of rare damaging variants (RDVs) in other genes such as TRIM40, ARHGAP23, and SOX4 further complements our understanding of the heritable genetic landscape of PTC.Each of these genes have previously been implicated in various cancers, including gastrointestinal cancers, soft tissue sarcomas, osteosarcoma and prostate cancer [41][42][43][44][45][46][47][48][49].However, there are a lack of reports on the roles of the TRIM40, ARHGAP23 and SOX4 genes in the development of PTC.
Interestingly, in our attempt to collapse the rare variants into pathways across the genome and look for their associations with risk of PTC, we failed to identify any pathway being significantly associated with increased risk of PTC, using KEGG pathway analysis.
While our study provides an essential step toward comprehending the genetic basis of PTC, it is vital to acknowledge the challenges posed by the genetic heterogeneity of this disease.The rarity of identified variants such as the GALNT9 gene in this unique ethnic population demands larger collaborative efforts to substantiate their roles in PTC risk.The development and application of advanced statistical and bioinformatics methodologies Cancers 2023, 15, 4235 8 of 10 could facilitate more robust genetic associations containing larger study cohorts, which could eventually help in the discovery of more risk-associated variants.

Conclusions
In conclusion, our study found novel genes and rare variants in GALNT9, TRIM40, ARHGAP23, and SOX4 to be associated with PTC in the Middle Eastern population, suggesting potential biomarkers for PTC that are unique to this ethnicity.However, largerscale studies are warranted to validate these findings due to the relatively small number of cases tested.

Figure 1 .
Figure 1.Power analysis for a gene to be enriched in rare inactivating variants based on an empirical procedure.(A) Power at the exome-wide significance level: p < 2.5 × 10 −6 .(B) Power at the suggestive significance level: p < 0.001.

Figure 1 .
Figure 1.Power analysis for a gene to be enriched in rare inactivating variants based on an empirical procedure.(A) Power at the exome-wide significance level: p < 2.5 × 10 −6 .(B) Power at the suggestive significance level: p < 0.001.

Table 2 .
List of genes significantly associated with PTC risk based on rare variant analyses.

Table 2 .
List of genes significantly associated with PTC risk based on rare variant analyses.

Table 3 .
List of rare variants significantly associated with PTC risk.

Table 4 .
Top 10genes by number of cases based on RIV association analysis.

Table 5 .
Significant genes based on non-synonymous variant associations analyzed using SKAT.