Non-Syndromic Cleft Lip with or without Cleft Palate: Genome-Wide Association Study in Europeans Identifies a Suggestive Risk Locus at 16p12.1 and Supports SH3PXD2A as a Clefting Susceptibility Gene

Non-syndromic cleft lip with or without cleft palate (nsCL/P) ranks among the most common human congenital malformations, and has a multifactorial background in which both exogenous and genetic risk factors act in concert. The present report describes a genome-wide association study (GWAS) involving a total of 285 nsCL/P patients and 1212 controls from the Netherlands and Belgium. Twenty of the 40 previously reported nsC/LP susceptibility loci were replicated, which underlined the validity of this sample. SNV-based analysis of the data identified an as yet unreported suggestive locus at chromosome 16p12.1 (p-value of the lead SNV: 4.17 × 10−7). This association was replicated in two of three patient/control replication series (Central European and Yemeni). Gene analysis of the GWAS data prioritized SH3PXD2A at chromosome 10q24.33 as a candidate gene for nsCL/P. To date, support for this gene as a cleft gene has been restricted to data from zebrafish and a knockout mouse model. The present GWAS was the first to implicate SH3PXD2A in non-syndromic cleft formation in humans. In summary, although performed in a relatively small sample, the present GWAS generated novel insights into nsCL/P etiology.


Introduction
Orofacial clefting represents the second most common congenital malformation in humans, after various forms of heart defects all combined [1]. Despite advances in surgical correction, the disorder has lifelong implications for the health and social integration of those affected. Improved understanding of cleft etiology may facilitate development of new preventative measures and therapeutic approaches, and may improve genetic counseling for families at risk.
Clefting can occur either as part of a complex malformation syndrome or as an isolated anomaly, and several cleft subphenotypes have been defined according to the affected anatomical structures. The most frequent subphenotype is non-syndromic cleft lip with or without cleft palate (nsCL/P). In European populations, the estimated prevalence of nsCL/P is around 1:1000 [1]. The etiology of nsCL/P is multifactorial, whereby genetic risk factors, environmental exposures, and potential gene-environment interactions all contribute to disease susceptibility. The estimated contribution of all combined genetic factors to nsCL/P is 90% [2].
Since 1989, diverse genetic approaches have been used to identify genes and pathways underlying nsCL/P, including linkage and candidate gene studies. However, prior to the commencement of the genomics era around a decade ago, extensive research efforts had identified only two common genetic factors that could be considered true nsCL/P-associated risk factors: (1) the regulatory region of the Interferon Regulatory Factor 6 (IRF6), which was identified in a candidate gene association study; and (2) the Forkhead Box E1 (FOXE1) risk locus, which was identified in a meta-analysis of linkage data [3,4]. In the genomics era, new DNA sequencing techniques have enabled whole-exome sequencing, which has led to the identification of potential nsCL/P susceptibility variants in Cadherin 1 (CDH1) and a small number of other genes [5][6][7]. Variants detected to date via exome sequencing have been dominant, heterozygous, and can be important for the respective family as carriers of such variants can be at high risk. However, these are rare findings, and currently explain only a small fraction of patients.
At the time of writing, a total 40 common nsCL/P risk loci are known. However, these account for only a modest proportion of the genetic variance of nsCL/P, e.g., up to 30% of the narrow-sense heritability in the European population [19]. Thus, the existence of further common risk loci must be assumed. Given the tremendous success of the nsCL/P GWAS and follow-up study approach over the past decade, further GWAS appear to be warranted.
The present report describes a GWAS in a medium-sized nsCL/P case/control sample of European ethnicity recruited in the Netherlands and Belgium. In addition to the SNV-wise evaluation of data, gene-based and pathway analyses were performed. In these analyses, genetic marker data were aggregated to the level of whole genes or biological processing to test the joint association of all markers in the gene with the phenotype of interest, thereby increasing statistical power.

Ethics Statement
The study was approved by the Regional Committee on Research Involving Human Subjects Arnhem-Nijmegen, and the Review Board for Clinical Studies of the University Hospital KU Leuven. All study procedures were performed in accordance with the principles of the Declaration of Helsinki. The NBS protocol was approved by the Regional Committee on Research Involving Human Subjects Arnhem-Nijmegen. Written informed consent was obtained from all patients/participants, or their parents/legal guardians in the case of legal minors.

GWAS Patients
Patients were part of the AGORA project (Aetiological Research into Genetic and Occupational/Environmental Risk Factors for Anomalies in Children) [21]. The AGORA project commenced in 2005 and has established a large data and biobank of DNA samples and clinical and questionnaire data from: (i) children with congenital malformations or childhood cancer, (ii) their respective parents, and (iii) controls. Patients are recruited at two sites: (1) the Radboud University Medical Center (Radboudumc) Nijmegen, The Netherlands; and (2) University Hospital KU, Leuven, Belgium. Only patients with nsCL/P were eligible for inclusion in the present study. Patients with other cleft phenotypes, such as an isolated cleft palate, and patients with a possible specific malformation syndrome, intellectual disability, or other anomalies were excluded from the present study.
In the Nijmegen initiative, collection of data and biomaterials from patients with clefting commenced in 2007. Intraoperative blood or saliva samples were collected, and parents were asked to complete a questionnaire and to donate blood or saliva samples. The questionnaire addresses demographics and family history, among other factors. At the beginning of 2012, three "biomaterial donation days" were arranged to collect blood or saliva from clefting patients who had been treated at Radboudumc prior to 2007. At these sessions, blood or saliva samples and questionnaire data were also collected from parents.
In the Leuven initiative, collection of data and biomaterials from patients with clefting commenced in 2010. Blood samples and clinical and questionnaire data (e.g., demographics and family history) were collected from pediatric patients and their parents.
At both Leuven and Nijmegen clinical sites, all pediatric clefting patients underwent a clinical examination by an orthodontist, a maxillofacial surgeon, a plastic surgeon, and a clinical geneticist from the Cleft Lip and Palate team. To determine phenotype classification in the present cohort, data were retrieved from the respective medical charts.
The present analyses were performed using DNA samples from nsCL/P patients who were treated and followed up at: (1) Radboudumc (n = 219); or (2) University Hospital of KU Leuven (n = 66). The initial patient sample therefore comprised 285 nsCL/P patients. All patients were of self-reported European ancestry. Ancestral background in the Dutch and Belgian samples was assessed by identifying the origins of the grandparents and the parents, respectively.

GWAS Controls
Population-based controls (n = 1212) were drawn from the Nijmegen Biomedical Study (NBS), a population-based survey conducted by the Department for Health Evidence and the Department of Laboratory Medicine of the Radboud University Medical Center. This cohort was established to generate a universal reference population for the investigation of genetic variation and lifestyle and environmental exposures for a variety of traits and diseases in case/control studies [22]. A total of 22,451 age-and sex-stratified, randomly selected inhabitants of the municipality of Nijmegen received an invitation to fill out a postal questionnaire on lifestyle and medical history, and to donate an 8.5 mL blood sample in a serum separator tube and a 10 mL EDTA blood sample. The overall response to the questionnaire was 42% (n = 9350), and 69% (n = 6468) of the respondents donated blood samples.

GWAS Genotyping and Quality Control (QC)
Within the AGORA project, standard methods are used to extract DNA from blood collected in EDTA tubes or saliva specimens collected in ORAgene containers (DNA Genotek Inc., Ottawa, Canada). In the present study, genotyping of the 285 nsCL/P patients from AGORA was performed using the Illumina OmniExpressExome Array (Illumina, San Diego, CA, USA). Within the NBS project, DNA was genotyped using Illumina chips. For the present analyses, Illumina's OmniExpress Array (Illumina, San Diego, CA, USA) genotype data for the 1212 NBS participants were available. A total of 718,286 SNVs were present in both patients and controls.
Exclusion criteria for patient and control data were: (i) any discrepancy between documented and genotyped sex; (ii) a call rate of <99%; or (iii) evidence from multidimensional scaling (MDS) of ethnic outlier status ( Figure S1). Un-relatedness between individuals was evaluated using the program KING [23].
Markers were excluded from the analysis if the minor allele frequency (MAF) was <1% or the call rate was <95% in either patients or controls. In addition, markers were excluded if there was deviation from Hardy-Weinberg equilibrium at p < 10 −4 in controls or p < 10 −6 in patients.

Imputation of GWAS Data
The combined post-QC dataset for patients and controls was subjected to imputation using the June 2014 release of the 1000 Genomes Project and the program IMPUTE2 [24]. SNVs with an INFO score ≥ 0.4 and a MAF > 1% were then tested for association using SNPTEST [25]. For the logistic regression analysis, the first five components obtained from MDS were used as covariates.

Genome-Wide Association Analysis
Four patients were excluded due to gender incompatibility (discrepancy between documented and genotyped sex). Two patients and one control sample were excluded due to a call rate < 99%. Five patients and 24 controls were excluded due to relatedness. Fifteen patients were excluded due to ethnic outlier status. Following these QC procedures, 259 patients and 1187 controls remained for further analysis. In total, 8,785,346 imputed SNVs with an INFO score ≥ 0.4 and a MAF > 1% were finally tested. For each SNV passing QC, a logistic regression model was considered, with the additively coded SNV as the predictor variable and the first five MDS coordinates as covariates. The p-value of the likelihood ratio test of no association was then calculated.

Replication Analysis for Two Interesting SNVs
Two SNVs were of particular interest, as they had small p-values and were located in regions not reported to be nsCL/P risk loci. For these two interesting candidate SNVs (rs73145631 at chromosome 12, and rs56383345 at chromosome 16), replication was performed by genotyping these SNVs in three case/control nsCL/P samples: A MassARRAY genotyping assay was designed using the Assay Design Suite Software v 1.0 Software (AGENA Bioscience, San Diego, USA). The genotyping assay contained these two SNVs of interest and 26 SNVs from other projects. SNVs were genotyped using the MassARRAY system and end-point PCR, followed by matrix-assisted laser desorption/ionization time-of-flight (MALDI-ToF) mass spectrometry (AGENA Bioscience, San Diego, USA). Data analysis was performed using the AGENA Spectrodesigner Software package. Individual genotypes were assigned using the AGENA Typer Analysis software. Primers were synthesized at Metabion, Germany (individual primer sequences available upon request). Intra-and interplate duplicates were included for quality control purposes. No genotype inconsistencies were observed.
Association statistics were calculated by applying the Armitage-trend test separately for each sample cohort. For each SNV, relative risks of the three replication cohorts were combined using fixed-effect meta-analysis.

Prioritization of Candidate Genes
To prioritize candidate genes from the present GWAS dataset, SNV data with an INFO score > 0.6 were uploaded into FUMAGWAS (https://fuma.ctglab.nl/) [28]. Gene and gene-set analyses implemented in FUMAGWAS were based on GWAS summary statistics. In the present study, these analyses were performed with MAGMA, a tool that can be used for gene and gene-set analysis from GWAS data [29]. Input SNVs were mapped to 18,644 protein coding genes. Genome wide significance was defined as a p-value of 0.05/18,644 = 2.682 × 10 −6 .

Expression Analysis using SysFACE
Evaluation of expression of a given candidate gene and its neighboring genes was performed using the bioinformatics tool SysFACE (systems tool for craniofacial expression-based gene discovery; https: //bioinformatics.udel.edu/research/sysface). For each gene, expression data for various time-points of murine embryonic development in organs specific for the phenotype under study (maxilla, frontonasal, palate) were identified from microarray-based genome-level gene expression profiles across various mouse embryonic orofacial tissues.

SNV-Based Analysis
A total of 615,168 autosomal markers passed QC for both samples. The genomic inflation factor lambda was 1.044. The Q-Q plot is shown in Figure S2.
In the imputed GWAS data, 228 SNVs at a total of 25 different loci yielded p-values < 10 −5 and an INFO score > 0.8 (Table 1, Table S1, Figure S3).   This locus had not been reported as a recognized nsCL/P risk locus in previous studies. Therefore, rs56383345 was also subjected to replication in the three independent patient/control samples ( Table 2). A nominally significant p-value was obtained for rs56383345 in both the European and the Yemeni sample (0.027 and 0.0099 respectively). The p-value obtained after combination of all three replication samples by meta-analysis was also nominally significant (p = 0.00167). However, no  1 underlined = genome-wide significant SNVs, 2 major allele first, risk allele underlined, 3 only markers with a p-value < 10 −5 are presented. Pos = position, OR = odds ratio. The 25 loci also included an interesting locus at chromosome 16p12.1. This locus gave a suggestive p-value (5 × 10 −8 < p < 10 −5 ), and its lead SNV was rs56383345, with a p-value of 4.17 × 10 −7 (Figure 1).
Genome-wide significance (p < 5 × 10 −8 ) was reached by 63 SNVs at two loci. One of the genome-wide significant SNVs was rs73145631 at 12q23.1, a locus that has not been reported as an nsCL/P risk locus previously. It was subjected to a replication step in three independent case/control samples ( Table 2). In none of the three replication samples, nor after combining the replication data in a meta-analysis was the association for this SNV replicated. The other 62 genome-wide significant SNVs were located at an already well-known nsCL/P susceptibility locus at chromosome 8q24.21. This locus was initially identified in the Central European case/control sample that served as a replication sample for the present study [8]. It was later replicated in many other samples-among others, the Mexican and the Yemeni replication samples that were used for replication in the present study [26,27]. This locus was therefore not included in the replication step of the present study. Table 2.
Replication of two interesting SNVs in three replication samples of different biogeographical backgrounds. This locus had not been reported as a recognized nsCL/P risk locus in previous studies. Therefore, rs56383345 was also subjected to replication in the three independent patient/control samples (Table 2). A nominally significant p-value was obtained for rs56383345 in both the European and the Yemeni sample (0.027 and 0.0099 respectively). The p-value obtained after combination of all three replication samples by meta-analysis was also nominally significant (p = 0.00167). However, no association signal was obtained in the Mexican sample only (p = 0.577). Rs56383345 is located in a 930 kb non-coding region. Evaluation of the 16p12.1 region with the bioinformatics tool SysFACE did not implicate any flanking genes as cleft candidate genes.

Replication of Previously Reported nsCL/P Susceptibility Loci
Of the 40 previously reported nsCL/P susceptibility loci, 20 showed at least a nominally significant p-value (p < 0.05) in the present dataset (Table 3). For three of these loci, the lead SNV from the literature and the lead SNV in the present dataset were identical. At 15 of the 20 loci, a SNV in substantial linkage disequilibrium (LD) (r 2 > 0.6) with the lead SNV from the literature achieved a lower p-value. For two of the 20 loci, the lead SNVs were not in LD with the lead SNV from the literature.

Gene-Based Evaluation and Gene-Set Analysis
For the gene-based evaluation, genome-wide significance was set by FUMAGWAS at p < 2.672 × 10 −6 and was achieved by two different genes: (i) SH3 And PX Domains 2A (SH3PXD2A) at chromosome 10q24.33 (p = 1.82 × 10 −8 ); and (ii) anoctamin 4 (ANO4) at chromosome 12q23.1 (p = 1.46×10 −7 ) (Figure 2, Figure 3, Figure S4, Table S2). To date, neither of these genes has been proposed as a candidate gene for clefting in humans. For ANO4, the SysFACE evaluation revealed no expression at any relevant embryonic time-points in tissues of relevance to lip and palate development in the mouse model ( Figure S5). However, Growth Arrest Specific 2 Like 3 (GAS2L3), which is located upstream of ANO4, is expressed in the murine maxilla at E11.5 to E12.5, and Insulin-Like Growth Factor 1 (IGF1), which is located downstream, is expressed in the murine maxilla at E11.5 to E12.5 and also in the murine palate at E14.5.
Genes with a suggestive p-value (2.672 × 10 −6 < p < 10 −4 ) in the gene analysis included the gene Paired Box 7 (PAX7) at chromosome 1p36.13. PAX7 is a well-known cleft candidate gene ( Table 4). The PAX7 locus has shown genome-wide association in two previous studies, and rare, potentially pathogenic variants located in or near PAX7 have been reported in patients with orofacial clefting [30][31][32][33].      . An unreported suggestive locus at chromosome 12q23.1 (a) Regional association plot showing a genome-wide significant marker upstream of the anoctamin 4 (ANO4) gene. It is of note that in our dataset, there were no SNVs in LD with the lead SNV rs73145631. (b) Regional association plot for the ANO4 gene. Figure 3. An unreported suggestive locus at chromosome 12q23.1 (a) Regional association plot showing a genome-wide significant marker upstream of the anoctamin 4 (ANO4) gene. It is of note that in our dataset, there were no SNVs in LD with the lead SNV rs73145631. (b) Regional association plot for the ANO4 gene. The gene-set analysis implemented in FUMAGWAS revealed twelve gene ontology (GO) gene sets with an adjusted p-value < 0.05 (Table S2). The two gene sets with the lowest p-values were GO_TISSUE_MORPHOGENESIS (adjusted p = 0.0259) and GO_KERATINOCYTE_PROLIFERATION (adjusted p = 0.0259).

Discussion
The present report describes a GWAS performed in nsCL/P patients from the Netherlands and Belgium and unaffected controls from the Netherlands. The sample comprised 259 patients and 1187 controls, and thus represents a medium-sized nsCL/P cohort. Nonetheless, the replication of 20 of the 40 previously reported nsCL/P susceptibility loci-four of which were originally identified in Asian samples-demonstrated the power of the sample. Notably, at the well-established nsCL/P susceptibility locus at chromosome 8q24.21, the analyses identified 62 genome-wide significant SNVs. Among others, the results of the gene-set analysis prioritized GO_TISSUE_MORPHOGENESIS and GO_KERATINOCYTE_PROLIFERATION. This was consistent with the hypothesis that nsCL/P arises from the disturbed proliferation, adhesion, and apoptosis of cells in the facial prominences during embryogenesis, and demonstrated the validity of the present Dutch/Belgian sample [33].
The SNV-based analysis identified a novel locus at chromosome 16p12.1 yielding suggestive evidence of association (lead SNV rs56383345 with a p = 4.17 × 10 −7 ). This association was replicated in two of the three replication series (Central European and Yemeni). The association was not replicated in the Mexican patient/control series. However, the MAF for this SNV in the Mexican series was <2%, so the sample would not be expected to have much power to detect a true association. The SNV rs56383345 maps to a 930 kb non-coding region at 16p12.1, and is not located in any currently known regulatory element [34]. The nearest flanking genes are heparan sulfate-glucosamine 3-sulfotransferase 4 (HS3ST4) upstream and C16orf82 downstream. Neither of these genes, nor any other genes near rs56383345, has any reported role in cleft development. Furthermore, no orofacial clefting has been reported in those patients from the DECIPHER database [35] who have a copy number variant encompassing the new suggestive susceptibility locus or either one of the flanking genes.
Twenty of the 40 recognized nsCL/P susceptibility loci were replicated with at least a nominally significant p-value (p < 0.05) in the present study. At 3 of these 20 replicated loci, the lead SNV reported in the literature and the lead SNV in the present dataset were identical, and at 15 loci a SNV in LD (r 2 > 0.6) with the lead SNV from the literature achieved a smaller p-value. Notably, at two of the 20 replicated loci, namely 1p36 and 3q29, additional peaks with lead SNVs not in LD with the lead SNV from the original literature were identified. This suggested that the original sample and the Dutch/Belgian sample have differing haplotype structures.
In addition to the "typical" single-SNV analyses, the present study involved a gene analysis, which generated several interesting findings. In a gene analysis, genetic marker data are aggregated to the level of whole genes to test the joint associations of all markers in the gene with the phenotype [36]. Previous authors have suggested that this approach represents a potentially more powerful alternative to single-SNV analyses. The gene analysis approach has the advantage of considerably reducing the required number of tests, and renders possible detection of effects consisting of multiple weaker associations, which would otherwise be overlooked.
The present gene analysis prioritized SH3PXD2A at chromosome 10q24.33 as a candidate gene for nsCL/P. SH3PXD2A is a protein-coding gene with 15 exons and a size of 267 kb. The SH3PXD2A gene product is necessary for the formation and function of podosomes, which are structures located on the cellular surface that establish close contact with the extracellular matrix, and are involved in cell migration and matrix degradation [37]. Observations in zebrafish and knockout mice suggest that SH3PXD2A is also a potential risk gene for orofacial clefting [38,39]. In mammals, the primary and secondary palate forms from cranial-neural-crest-cell-derived mesenchymal protuberances, and any alteration in the growth, proliferation, movement, adhesion, or death of cells composing the palatal structures can affect palatal architecture and lead to orofacial clefting. SH3PXD2A encodes TKS5, a scaffold protein shown to be fundamental to zebrafish neural crest cell migration in vivo [38]. However, even stronger support for that SH3PXD2A may be a cleft candidate gene has been generated by Cejudo-Martin et al., who showed that disruption of the mouse Sh3pxd2a gene was associated with complete cleft of the secondary palate in 50-90% of mutant mice [39]. Of note, the fact, that the mouse model in Cejudo-Martin et al. has a cleft of the secondary palate but not a cleft lip, does not speak against this gene being also involved in nsCL/P formation in humans. The present gene analysis provides the first strong support for the involvement of SH3PXD2A in non-syndromic cleft formation in humans.
Gene evaluation of the present GWAS data also prioritized the gene anoctamin 4 (ANO4) at chromosomal band 12q23.1. Interestingly one SNV, rs73145631, located 79 kb upstream of the transcription start site of this gene, achieved genome-wide significance in this dataset. The SNV association signal should be considered independent of the MAGMA gene analysis, which only took into account signals located between the transcription start and stop sites of the gene. The protein product of the candidate gene, ANO4, is a transmembrane protein from the anoctamin family. This protein family plays a key role in diverse physiological functions, including ion transport, phospholipid scrambling, and the regulation of other ion channels. While the ANO1 and ANO2 proteins have been functionally characterized, the roles of other family members, such as ANO4, remain poorly understood and controversial [40]. There is currently no convincing support for ANO4 as a cleft susceptibility gene in the literature or publically available databases on gene expression, such as SysFace. However, it is possible that rather than being attributable to the ANO4 gene itself, this association signal was generated by a regulatory element located intronically in ANO4, which influences the activity of another nearby gene. Notably, two genes in the vicinity of the ANO4 signal are expressed at relevant embryonic time-points in tissues of relevance to lip and palate development in the mouse model: (i) Growth Arrest Specific 2 Like 3 (GAS2L3), located upstream from ANO4; and (ii) Insulin Like Growth Factor 1 (IGF1), located downstream [34]. In the present GWAS, the genome-wide significant association signal for rs73145631 in the ANO4 upstream region could be interpreted as additional independent support for the ANO4 locus, despite being based on the same data. However, for several reasons, these results must be interpreted with caution. First, no supportive association signal for rs73145631 was found in any of the three replication cohorts. Second, interpretation of the GWAS association signal for rs73145631 was hampered by the fact that no SNVs were in LD with this SNV in this dataset. Even though it was not replicated in any of the three replication samples, the finding may be a true association specific to the present Dutch/Belgian sample, or may represent a false positive signal from a single marker.
After Asians, Europeans represent the second most common ethnicity in published nsCL/P association studies. The largest ethnically homogenous GWAS sample to date included 2033 patients of Chinese ancestry [17]. The largest European SNV data set explored to date combined 1158 nsCL/P patients from two large studies in a genome-wide meta-analysis [20]. Compared to such studies, the present sample size was relatively small, but this sample was valid and allowed the generation of interesting results warranting further investigation. The present sample is also of value in terms of future meta-analyses of GWAS data. These meta-analyses will increase statistical power for locus discovery, and facilitate the elucidation of the genetic background of orofacial clefting.

Conclusions
In summary, the present GWAS generated novel insights into the etiology of nonsyndromic orofacial clefting. The gene-based analysis provided strong support for SH3PXD2A as a candidate gene identified in animal models, and the SNV analysis identified two novel suggestive risk loci. The present results demonstrate how even medium-sized, clinically well characterized GWAS samples can improve knowledge of the genetic basis of nsCL/P.  Figure S5: UCSC tracks of ANO4 region including SysFace data; Table S1-S2: