Genetic Risk Assessment of Nonsyndromic Cleft Lip with or without Cleft Palate by Linking Genetic Networks and Deep Learning Models

Recent deep learning algorithms have further improved risk classification capabilities. However, an appropriate feature selection method is required to overcome dimensionality issues in population-based genetic studies. In this Korean case–control study of nonsyndromic cleft lip with or without cleft palate (NSCL/P), we compared the predictive performance of models that were developed by using the genetic-algorithm-optimized neural networks ensemble (GANNE) technique with those models that were generated by eight conventional risk classification methods, including polygenic risk score (PRS), random forest (RF), support vector machine (SVM), extreme gradient boosting (XGBoost), and deep-learning-based artificial neural network (ANN). GANNE, which is capable of automatic input SNP selection, exhibited the highest predictive power, especially in the 10-SNP model (AUC of 88.2%), thus improving the AUC by 23% and 17% compared to PRS and ANN, respectively. Genes mapped with input SNPs that were selected by using a genetic algorithm (GA) were functionally validated for risks of developing NSCL/P in gene ontology and protein–protein interaction (PPI) network analyses. The IRF6 gene, which is most frequently selected via GA, was also a major hub gene in the PPI network. Genes such as RUNX2, MTHFR, PVRL1, TGFB3, and TBX22 significantly contributed to predicting NSCL/P risk. GANNE is an efficient disease risk classification method using a minimum optimal set of SNPs; however, further validation studies are needed to ensure the clinical utility of the model for predicting NSCL/P risk.


Introduction
Orofacial clefts (OC), which are the second most common congenital anomaly with a wide range of etiologies, can occur as an isolated form or as a syndrome. The prevalence of OC varies by region and ethnicity, with the highest incidence being observed in Asian populations [1]. According to a nationwide cohort study, the overall prevalence of OC in Korea was 1.96 per 1000 live births, and approximately 76.45% of all cases occur in the nonsyndromic form. Specifically, cleft lip only (CL), cleft lip with cleft palate (CLP), and cleft palate only (CP) accounted for 26.47%, 26.56%, and 52.97% of total cases, respectively [2]. As CP has been considered a distinct malformation, recent genetic studies have primarily

Genetic Association Analysis for NSCL/P
Four SNPs (rs10790330, rs906830, rs17104928, and rs3917211) demonstrated HWE p-values less than 0.05; however, none of the SNPs showed evidence of deviation from HWE (p > 0.01) in the control data, and the MAFs of all ninety-two SNPs were >1% in both the case and control groups. In the Fisher's exact test, two intronic SNPs of IRF6 in linkage disequilibrium (LD) with a r 2 value of 0.80 (rs2235373 and rs2235371) were found to be significantly associated with NSCL/P (p = 3.5 × 10 −4 and p = 4.5 × 10 −4 , respectively). Moreover, SNPs located near or within five other genes (RUNX2, ARNT, TGFB3, MTHFR, and TCOF1) also showed significant associations in Korean NSCL/P patients (p < 0.05) ( Table 1). After accounting for pairwise LD (r 2 < 0.8, see Table S1), we identified three SNPs that were associated with NSCL/P at the level of p < 0.01, as well as ten SNPs with nominal significance (p < 0.05) and sixteen SNPs with marginal significance (p < 0.1).

Genetic Risk Prediction
The predictive performance of the PRS models for NSCL/P risk increased as the number of SNPs increased (accuracy = 0.676 and AUC = 0.711 for the 92-SNP model). When evaluating the models generated by the six traditional machine learning algorithms, the training accuracies significantly improved to above 95% for the 10-SNP model, especially for four of the ML algorithms. However, the testing accuracies remained in the 60% range. Out of the 18 models categorized by the number of SNPs and the type of machine learning algorithm, the SVM utilizing 10 SNPs demonstrated the highest predictive performance (test accuracy = 0.677, F1 = 0.678, AUC = 0.685). On the other hand, LightGBM demonstrated the lowest predictive performance among the machine learning algorithms (test accuracy = 0.565, F1 = 0.566, AUC = 0.568). We trained the four sets of SNPs by using the ANN deep learning algorithm but did not observe a significant improvement in predictive performance compared to PRS and the machine learning models (test accuracy = 0.63, F1 = 0.65, AUC = 0.71) (Figure 1).

Figure 1.
Comparison of predictive power by using nine models for nonsyndromic cleft lip with or without cleft palate risk. Model performance was measured by the area under the receiver operating curve (AUC) for each risk model in four different sets of single nucleotide polymorphisms (SNPs). (A) models trained on 3 SNPs with p < 0.01, (B) models trained on 10 SNPs with p < 0.05, (C) models trained on 16 SNPs with p < 0.1, and (D) models trained on all 92 SNPs. ADA, adaptive boosting; GANNE, genetic-algorithm-optimized neural networks ensemble; ANN, artificial neural network; PRS, polygenic risk score; RF, random forest; SVM; support vector machine; XGB, extreme gradient boosting; LR, logistic regression; LGBM, light gradient boosting model.
In the current study, we developed a model to improve NSCL/P classification by using the GANNE algorithm. We first prepared a set consisting of the top SNPs that were identified in the genetic association analysis, along with five optimal sets of SNPs that were selected by using GA, to be used as inputs for ANN deep learning. GANNE significantly improved predictive performance across all three SNP settings, especially the best model selected from six sets of ten SNPs (AUC of 88.2%), which increased AUC (∆AUC) by 17%, 23%, and 28.5%, respectively, compared to ANN, PRS, and RF ( Figure 1). Despite the lower weighted F1-score of 0.76 compared to AUC, the 10-SNP GANNE model still demonstrated superior performance when accounting for class imbalance in the binary data. In addition, the test accuracy of the 10-SNP GANNE (74.2%) increased within the range of 6.5% (SVM) to 14.5% (RF) compared to other methods, and it increased by 11.3% compared to the deep-learning-based ANNs. GANNE models with three SNPs and sixteen SNPs exhibited similar test results (accuracy = 0.694, F1 = 0.709, AUC of approximately 0.744), but the 16-SNP GANNE demonstrated better training accuracy than the 3-SNP GANNE ( Figure 1, Table 2). In the current study, we developed a model to improve NSCL/P classification by using the GANNE algorithm. We first prepared a set consisting of the top SNPs that were identified in the genetic association analysis, along with five optimal sets of SNPs that were selected by using GA, to be used as inputs for ANN deep learning. GANNE significantly improved predictive performance across all three SNP settings, especially the best model selected from six sets of ten SNPs (AUC of 88.2%), which increased AUC (∆AUC) by 17%, 23%, and 28.5%, respectively, compared to ANN, PRS, and RF ( Figure 1). Despite the lower weighted F1-score of 0.76 compared to AUC, the 10-SNP GANNE model still demonstrated superior performance when accounting for class imbalance in the binary data. In addition, the test accuracy of the 10-SNP GANNE (74.2%) increased within the range of 6.5% (SVM) to 14.5% (RF) compared to other methods, and it increased by 11.3% compared to the deep-learning-based ANNs. GANNE models with three SNPs and sixteen SNPs exhibited similar test results (accuracy = 0.694, F1 = 0.709, AUC of approximately 0.744), but the 16-SNP GANNE demonstrated better training accuracy than the 3-SNP GANNE ( Figure 1, Table 2). The GANNE utilized 46, 25, and 15 different SNPs that were located in 14, 12, and 8 genes, respectively, at least once for the 3-, 10-, and 16-SNP models. Five SNPs from IRF6 (including rs2013162), rs11204737 (ARNT), rs7715100 (TCOF1), rs16873348 (RUNX2), and rs3917192 (TGFB3) were used in all three SNP models. Among the SNPs that were selected for the 10-SNP GANNE models, rs2013162 (IRF6) was the most potent SNP included in all six sets, followed by rs3917192 (TGFB3) in five sets (Table S2).
To verify the reproducibility of the deep learning models, we performed 100 iterations, and the average of the results in each iteration followed the trend of the best model results for each set of SNPs. As expected, the 10-SNP GANNE model produced the highest accuracy and AUC, even at 100 iterations (average training accuracy = 92.1%, average test accuracy = 65.4%, average test AUC = 75.2%), with the highest AUC of 89.5% (Table 2).
In the PPI network analysis, nine of the twenty candidate genes that were evaluated in this study showed multiple interactions with other genes based on experimental evidence of co-expression. In particular, MSX1 and IRF6 were the most important hubs in this network, and genes such as PAX9, TBX22, RUNX2, TGFB3, and VAX1 also appeared to interact with more than one gene. However, eight genes (TCOF1, NSF, ADH1C, RARA, WNT3, ARNT, ZNF385B, and BCL3) did not show an interaction at a confidence score of 0.45 ( Figure 2). PVRL1, PAX9, and TGFB3) showed significant enrichment in GO:0042476~odontogenesis.
In the PPI network analysis, nine of the twenty candidate genes that were evaluated in this study showed multiple interactions with other genes based on experimental evidence of co-expression. In particular, MSX1 and IRF6 were the most important hubs in this network, and genes such as PAX9, TBX22, RUNX2, TGFB3, and VAX1 also appeared to interact with more than one gene. However, eight genes (TCOF1, NSF, ADH1C, RARA, WNT3, ARNT, ZNF385B, and BCL3) did not show an interaction at a confidence score of 0.45 ( Figure 2).

Discussion
As the discovery of genetic variants associated with complex diseases increases, the demand for personalized health care services using genetic information is also rapidly increasing. To overcome the limitations of regression-based PRS and conventional ML algorithms, artificial intelligence (AI) has recently begun to be applied to risk prediction and the early diagnosis of complex diseases [11]. Unlike traditional machine learning algorithms, deep learning is helpful in solving complex problems with far more parameters but requires a large-scale dataset to avoid overfitting and to generalize results [27]. Therefore, state-of-the-art deep learning algorithms are not widely applied in genomic medicine due to the difficulty of large-scale sample collection.
In the current study, we improved the classification ability for NSCL/P in Korean individuals by performing a deep-learning-based ANN with informative SNPs selected via GA to reduce dimensionality while also increasing test accuracy. GANNE performed best for all three SNP settings compared to the eight conventional methods for risk prediction. In conjunction with the results of the in silico functional analysis, we also demonstrated the possibility of explaining interactions among genetic features, which have been considered a black box in ML applications.
The machine learning algorithms, including GANNE, showed the highest classification accuracy when using 10 SNPs but the performance declined as the number

Discussion
As the discovery of genetic variants associated with complex diseases increases, the demand for personalized health care services using genetic information is also rapidly increasing. To overcome the limitations of regression-based PRS and conventional ML algorithms, artificial intelligence (AI) has recently begun to be applied to risk prediction and the early diagnosis of complex diseases [11]. Unlike traditional machine learning algorithms, deep learning is helpful in solving complex problems with far more parameters but requires a large-scale dataset to avoid overfitting and to generalize results [27]. Therefore, state-of-the-art deep learning algorithms are not widely applied in genomic medicine due to the difficulty of large-scale sample collection.
In the current study, we improved the classification ability for NSCL/P in Korean individuals by performing a deep-learning-based ANN with informative SNPs selected via GA to reduce dimensionality while also increasing test accuracy. GANNE performed best for all three SNP settings compared to the eight conventional methods for risk prediction. In conjunction with the results of the in silico functional analysis, we also demonstrated the possibility of explaining interactions among genetic features, which have been considered a black box in ML applications.
The machine learning algorithms, including GANNE, showed the highest classification accuracy when using 10 SNPs but the performance declined as the number of input SNPs increased. On the other hand, PRS, a widely used method in predicting complex disease risk, exhibited a consistent improvement in its AUC with the addition of more SNPs. Despite the simplicity in implementation, logistic-regression-based methods, such as LR and PRS, may not be effective in dealing with non-linear or highly correlated input data [10]. Our findings underscore the issue of dimensionality, whereby the number of required datasets increases exponentially as the input dimensionality increases when using ML algorithms as genetic risk predictors [24]. Supervised machine learning algorithms, RF and SVM, tend to perform well in high-dimensional data, but are prone to overfitting and are computationally intensive [12]. In this study, we found SVM to be more suitable for the non-linear binary classification task, as it showed better predictive performance (F1 = 0.678) compared to RF (F1 = 0.598). Boosting algorithms, including XGBoost, Adaboost, and LightGBM, are ensemble techniques that combine multiple models with weak predictive performance to form a more potent model [28]. Among the nine classification methods used in this study, LightGBM exhibited the lowest predictive performance. Further studies are necessary to investigate the impact of the strengths and limitations of each ML algorithm on disease risk prediction accuracy.
There have been attempts to improve predictive accuracy by combining results from different SNP models, but most statistical association analyses have limitations in selecting different subsets of SNPs [29]. Although there are 7 trillion possibilities to select a set of 10 SNPs out of 92 SNPs in our dataset, GANNE can efficiently select an optimal set of SNPs by initializing the first population with the best SNPs that were identified in the association analysis.
In particular, the 10-SNP GANNE model showed excellent performance and improved the AUC by 28.6%, 23%, and 17% compared to the RF, PRS, and ANN methods, respectively, by including SNPs that did not show a strong association with NSCL/P, which was likely due to a lack of statistical power. GA selected the SNPs that were significantly associated with NSCL/P while also extracting SNPs (such as rs7103685 in the PVRL1 gene) that did not show significant associations but that were used in four of the six SNP sets (p = 0.46).
Although a further evaluation of gene-gene interactions by using PLINK did not yield statistical significance, a functional protein association network analysis suggests that GA considers functional interactions of genes in SNP selection. The IRF6 gene that was most frequently selected by GA was also a major hub gene in the PPI network, and its association with NSCL/P has been reported in previous studies [30]. However, MSX1, which is another hub gene in the PPI network, was selected by GA in the 16-SNP subset but not in the 10-SNP subset. Moreover, all three SNP markers for the MSX1 gene were not statistically significant in this case-control analysis, but its association with NSCL/P remains controversial with inconsistent results, especially in Asian studies [31][32][33]. GANNE has demonstrated the potential to identify significant interactions among genes when used in conjunction with the PPI network analysis. Due to the fact that there may be valid interactions between SNPs that cannot be detected by using statistical analysis, neural-network-based genetic interaction studies using tools such as class activation mapping or attention modules may be needed in the future [34].
In this study, we demonstrated that GANNE, which is an ensemble neural network with automated feature selection, outperforms existing methods in predicting NSCL/P risk with genotype data by reducing the input dimension of each network through the use of a GA. Although GANNE achieves better generalization and robustness than other classification methods, given the number of samples that were trained in this study, further studies with larger samples are needed to validate the accuracy of the model. In genetic association studies, adjustments for age as a potential confounder are usually unnecessary, as differences in age between cases and controls may be associated with disease outcome but unlikely with genotype [35].

Study Subjects
We evaluated 143 Korean NSCL/P patients (91 males and 52 females) from 258 Korean families with nonsyndromic OC who visited Seoul National University Dental Hospital and SAMSUNG Medical Center. At each hospital, an orthodontist diagnosed the types of NSCL/P in the cases (nine cases with cleft lip only, twenty-six cases with cleft lip and alveolus, and one hundred and eight cases with cleft lip and palate). As a control group, we selected 119 healthy Korean adults without OC (60 males and 59 females) from a community-based cohort that was jointly developed by Hallym University College of Medicine and Chuncheon Sacred Heart Hospital. A trained dentist or clinician interviewed the participants and collected peripheral venous blood samples after obtaining informed written consent. The Institutional Review Board of each institution approved this study protocol. The details of the data collection can be found elsewhere [36,37].

SNP Genotyping
By using literature reviews, we identified nineteen candidate genes, including PAX9 and TGFA, and two chromosomal loci (8q24.21 and 10q250), which have been reported to be associated with NSCL/P in previous studies. By using a web browser known as, 'TAG SNP selection (TagSNP)' (https://snpinfo.niehs.nih.gov/snpinfo/snptag.html) [38], we identified SNP markers that were frequently found in East Asian populations among SNPs located within 2 kb from each of the 5 and 3 ends of the candidate genes. Genomic DNA was isolated from each blood sample by using a commercial DNA extraction kit (Quiagen Inc., Valencia, CA, USA) at the Samsung Biomedical Research Center, and genotype data were generated via SNP Genetics Inc. (Seoul, Republic of Korea) by using VeraCode Technology (Illumina Inc., San Diego, CA, USA). Details of these procedures are presented elsewhere [39].

Genetic Association Analysis
We subsequently analyzed only 92 SNPs in Hardy-Weinberg equilibrium (HWE p-values greater than 0.01) with both genotype and sample call rates greater than 95% and a minor allele frequency (MAF) greater than 1%. After SNP quality control, a pairwise LD was estimated by calculating r 2 via the Haploview program in the control group. The missing genotypes were imputed by considering the calculated LD [40]. We performed Fisher's exact test by using PLINK 1.9 for genetic association analysis [41].

SNP Subset Selection
Based on the statistical significance obtained by the Fisher's exact test, we selected four subsets of SNP markers for the binary classification of NSCL/P risk: three SNPs (p < 0.01), ten SNPs (p < 0.05), sixteen SNPs (p < 0.1), and ninety-two SNPs (all). SNPs in LD (r 2 > 0.8) were excluded (except for the 92-SNP set). Of the 262 samples, we used 200 samples (100 cases and 100 controls) in the training process (180 samples for training and 20 samples for validation) and 62 samples (43 cases and 19 controls) for testing purposes.

Polygenic Risk Score
We calculated the PRS for each jth subject by using the equation , where M is the number of SNP markers, logOR i is the natural logarithmically transformed odds ratio (OR) of the ith susceptibility SNP, and x ij is the count of the risk alleles (0, 1, or 2) at the ith SNP in the jth individual. We performed a logistic regression analysis on the PRS that was calculated to determine case-control status [42].

Traditional Machine Learning Algorithms
We evaluated the risk prediction performance of six commonly used machine learning algorithms: support vector machine (SVM), random forest (RF), extreme gradient boosting (XGBoost), logistic regression (LR), light gradient boosting model (LGBM), and adaptive boosting (ADA). This evaluation was performed by using the Python Scikit-learn package [43].

Artificial Neural Network
To classify NSCL/P cases, we constructed four ANN models for each given set of SNPs by using the Keras package of TensorFlow [44]. Our ANN contains two dense layers followed by a rectified linear unit (ReLU). We set the number of neurons in each layer to 8, 16, 32, and 64 for the 3-, 10-, 16-, and 92-SNP models, respectively. In addition, we constructed a dense output layer with sigmoid activations to classify NSCL/P and utilized the Adam method for optimization with an initial learning rate of 5 × 10 −3 and a decay rate of 10 −5 [45]. We trained each ANN model in the 100 epochs setting and measured the binary cross-entropy loss to evaluate the model performance. At the end of the training, each set was replaced with the best weight with low validation loss and high training accuracy.

Genetic-Algorithm-Optimized Neural Networks Ensemble
Our model implemented the GA that was proposed by Tong and Schierz [25] to extract an optimal set of SNPs for classification, followed by an ensemble of ANN results trained with each optimal set. Total cycles and population size were set to 30, and each population consisted of a fixed number of SNPs. To speed up the identification of the local minima, we initialized one population with a set consisting of the most significant SNPs that were found in the association analysis. The goodness-of-fit of the GA was calculated by adding the training loss and the validation loss. For each of the three settings (three, ten, and sixteen SNPs), we created six sets of SNPs, which consisted of five sets from GA and one set from the association analysis. The six SNP sets were trained on each ANN with the same settings as described above. The final value of the ensemble prediction was the average of the prediction values of multiple neural networks (Figure 3).
the Adam method for optimization with an initial learning rate of 5 × 10 and a decay rate of 10 [45]. We trained each ANN model in the 100 epochs setting and measured the binary cross-entropy loss to evaluate the model performance. At the end of the training, each set was replaced with the best weight with low validation loss and high training accuracy.

Genetic-Algorithm-Optimized Neural Networks Ensemble
Our model implemented the GA that was proposed by Tong and Schierz [25] to extract an optimal set of SNPs for classification, followed by an ensemble of ANN results trained with each optimal set. Total cycles and population size were set to 30, and each population consisted of a fixed number of SNPs. To speed up the identification of the local minima, we initialized one population with a set consisting of the most significant SNPs that were found in the association analysis. The goodness-of-fit of the GA was calculated by adding the training loss and the validation loss. For each of the three settings (three, ten, and sixteen SNPs), we created six sets of SNPs, which consisted of five sets from GA and one set from the association analysis. The six SNP sets were trained on each ANN with the same settings as described above. The final value of the ensemble prediction was the average of the prediction values of multiple neural networks (Figure 3). Overview of the GANNE pipeline for NSCL/P classification. Genotype information for each subject is encoded into a n x 1 input matrix, where n is the total number of SNPs. NN, neural network; SNP, single nucleotide polymorphism; NSCL/P, nonsyndromic cleft lip with or without cleft palate.

Model Evaluation and Validation
As evaluation metrics, we calculated the accuracy, which represents the percentage of correctly classified samples, and the area under the receiver operating characteristic curve (AUC). The performance of the ML and DL models was further evaluated using the weighted average F1-score, which balances precision and recall. To address the potential for variability in the results of the ANN models when trained on a GPU server, we Figure 3. Overview of the GANNE pipeline for NSCL/P classification. Genotype information for each subject is encoded into a n × 1 input matrix, where n is the total number of SNPs. NN, neural network; SNP, single nucleotide polymorphism; NSCL/P, nonsyndromic cleft lip with or without cleft palate.

Model Evaluation and Validation
As evaluation metrics, we calculated the accuracy, which represents the percentage of correctly classified samples, and the area under the receiver operating characteristic curve (AUC). The performance of the ML and DL models was further evaluated using the weighted average F1-score, which balances precision and recall. To address the potential for variability in the results of the ANN models when trained on a GPU server, we repeated the training process 100 times and calculated the average and 95% confidence intervals (95% CIs) of both accuracy and AUC for each model to ensure the reproducibility of the results.

In Silico Functional Analysis
We used the Database for Annotation, Visualization, and Integrated Discovery (DAVID) v6.8 to analyze gene ontology (GO) terms to identify the central function of the SNP markers [46]. We further examined the functional relevance between candidate genes with the protein-protein interaction (PPI) network by using STRING v11 [47].

Conclusions
GANNE, a deep-learning-based approach for disease risk classification, has shown promise in overcoming the sample size limitations of population-based genetic association studies by utilizing genetic algorithms to select the optimal set of SNP markers. Nevertheless, due to the limited sample size in this study, it is necessary to validate the results in larger, independent Korean populations, as well as to conduct comparative analyses of the model performance across different ethnic groups. With further validation studies, this GANNE model will realize its potential in enhancing NSCL/P genetic risk predictions.