Prediction of Early Childhood Caries Based on Single Nucleotide Polymorphisms Using Neural Networks

Background: Several genes and single nucleotide polymorphisms (SNPs) have been associated with early childhood caries. However, they are highly age- and population-dependent and the majority of existing caries prediction models are based on environmental and behavioral factors only and are scarce in infants. Methods: We examined 6 novel and previously analyzed 22 SNPs in the cohort of 95 Polish children (48 caries, 47 caries-free) aged 2–3 years. All polymorphisms were genotyped from DNA extracted from oral epithelium samples. We used Fisher’s exact test, receiver operator characteristic (ROC) curve and uni-/multi-variable logistic regression to test the association of SNPs with the disease, followed by the neural network (NN) analysis. Results: The logistic regression (LogReg) model showed 90% sensitivity and 96% specificity, overall accuracy of 93% (p < 0.0001), and the area under the curve (AUC) was 0.970 (95% CI: 0.912–0.994; p < 0.0001). We found 90.9–98.4% and 73.6–87.2% prediction accuracy in the test and validation predictions, respectively. The strongest predictors were: AMELX_rs17878486 and TUFT1_rs2337360 (in both LogReg and NN), MMP16_rs1042937 (in NN) and ENAM_rs12640848 (in LogReg). Conclusions: Neural network prediction model might be a substantial tool for screening/early preventive treatment of patients at high risk of caries development in the early childhood. The knowledge of potential risk status could allow early targeted training in oral hygiene and modifications of eating habits.


Introduction
Dental caries is a chronic, multifactorial and dynamic disease that affects up to 83% of the global population, irrespective of the age. Early childhood caries (ECC) is defined as the presence of one or more decayed, missing or filled primary teeth in a child up to 71 months of age [1,2]. In Poland, based on the epidemiological studies, 53.8% of 3-year-old children have on average 2.4 effected teeth and this prevalence is higher when compared to the data from Western Europe [3,4]. However, even 6-month-old infants have been found to develop caries lesions; therefore, it has been suggested that an early caries prevention should start in the 4th month mother's pregnancy [5]. Additionally, some mothers have presented reluctance to visit clinics regularly [6] and 48% of 3-year-old Polish children have never been to the dentist so far, although the dental care is free [4].
Interestingly, despite the treatment programs and caries prevention for preschoolers, about 15% of young patients would still develop the disease, which suggests an intrinsic, genetic factor influencing the individual host's susceptibility to caries [7]. In addition, patients exposed to the same level of environmental factors and with comparable behavioral factors, might present distinct susceptibility to caries lesions [8]. Indeed, biological factors show stronger association with caries than the lifestyle and socio-economic factors, the latter only being responsible for intermediate and distant effects [9]. Genetic and immunological factors have been considered to be more important in enamel defects than the eating habits or nutritional deficits and the overall health status and immunodeficiencies in children have been shown to significantly affect the enamel hypomineralization [1]. For the last decade, several genes and genetic polymorphisms have been identified and associated with dental caries lesions in patients of different ages and ethnicities [1,7,[10][11][12][13]. However, the majority of the existing caries prediction models lack biological, especially genetic factors. Selected and validated single nucleotide polymorphisms (SNPs) might be genotyped easily and early in a child's life, and represent a potentially valuable tool for the caries risk prediction. This could allow enhanced early targeted prevention for infants and toddlers at greatest risk. As most of the eating and oral health habits can be changed at any time, this might result in better health and quality of life.
Our previous research [14,15] showed significant association of genetic polymorphisms with caries in 2-3-year-old Polish children. The aim of this study was to present caries prediction model based on chosen polymorphisms from all three studies, as predictors, using the artificial neural network approach.

Ethical Issues
The study was reviewed and approved by the Bioethical Committee of the Poznan University of Medical Sciences (resolutions no. 590/13, 605/14, and 727/18).

Study Design
In total, 262 children from the four nurseries in the central-west Poland were enrolled. In the two previous studies [14,15], we analyzed the differences in the frequencies of alleles and genotypes of 18 single nucleotide polymorphisms (SNPs) in 7 genes in the reference of caries experience in the cohort of Polish children. Another six SNPs in 6 genes with a presumable role in caries pathogenesis were analyzed in this study [1,10,[16][17][18]. In brief, we genotyped: rs4547741 in LTF, rs7217186 in ALOX15, rs10429371 in MMP16, rs7096206 in MBL2, rs1884302 in SMAD6 and rs1711437 in MMP20. Additional analyses (including additional statistics and machine learning techniques) were performed by combining SNP testing results with data from the previous studies to determine predictive capacity of studied variants for childhood caries. Two variants, i.e., rs2609428 and rs36064169 in ENAM had homozygous status in all individuals, therefore they were excluded from statistical analyses. The remaining 22 differentiated SNPs analyzed in this study are presented in Table 1.

Dental Examination
Dental examination was carried out by a trained and calibrated dentist (K.G.), specialist in pediatric dentistry, after calibration by an experienced specialist (M.B.-L.). The intra-examiner agreement was assessed by second dental examination in a group of 10 children after 2 weeks, with a κ of 1.00. Teeth evaluation was performed in the nursery school, with the use of a dental mirror and a probe, in an artificial light. The dentition was not additionally cleaned before examination. Assessment of the teeth concerned the occurrence of carious cavities as well as initial (incipient) caries lesions (non-cavitated lesions, white spot). All tooth surfaces accessible for examination were investigated. Dental examination concerned the occurrence of teeth with carious cavities (dt) and with initial (incipient) caries lesions (non-cavitated, white spot; di), i.e., the stage before cavitation during the process of dental caries development [19]. In accordance with the international standards, the white spot lesions were included in dental caries diagnosis [20,21] as they indicate the susceptibility of an individual to dental caries and are prevalent in primary dentition in children in the first years of life. White spot lesions were easily differentiated from developmental defects of enamel on the clinical ground based on the association between caries lesion and its location on the tooth and the areas of mature plaque [22]. However, when it was impossible or difficult to differentiate the white spot lesions from other changes in some individuals, they were excluded from examination and further analyses. Radiograph of the children's dentition was not taken.
Detailed inclusion and exclusion criteria are presented in Table 2. Out of 262 individuals examined in the study, 48 children (18.3%) were diagnosed with caries and comprised a study group. Out of the rest 214 subjects we chose 48 sex-and age-matched individuals that comprised a control group. There were 23 males (48%) aged 20 to 42 months (mean 30.2 ± 6.2) and 25 females (52%) aged 20 to 40 months (mean 30.8 ± 5.7). One control sample had low DNA concentration and was fully utilized during two previous studies; therefore, it was not available for genotyping in this study and was excluded from statistical analyses. We intended to use the genotyping and distribution results, therefore the missing control sample was not replaced by another matching sample from the caries-free cohort in this study. Finally, the control group comprised 47 individuals, including 23 males (49%) aged 20 to 42 months (mean 31.2 ± 5.3) and 24 females (51%) aged 20 to 38 months (mean 28.3 ± 5.4).

Study group
Dental caries present in child's dentition Lack of dental caries in child's dentition

Control group
Lack of dental caries in child's dentition Dental caries present in child's dentition

Biological Samples and Genotyping
Biological material was obtained directly after dental examination. Samples were gathered using buccal swabs, which were provided for each child in sterile packs. The procedure included rubbing of the inside of the mouth, at least ten times, from each side of both cheeks in order to scrub epithelial cells with saliva. Subsequently, the swab was put inside the 1.5 mL Eppendorf tube, and the plastic stick was cut off. The tube was placed in a portable fridge at +4 • C until DNA extraction that was done the same day using EXTRACTME DNA Swab&Semen Kit (Blirt S.A., Gdansk, Poland) according to the manufacturer's protocol and kept at −20 • C for further analyses. We used TaqMan probes (Applied Biosystems, ThermoFisher Scientific, Frederick, MD, USA) and 7900HT Fast Real-Time PCR System, according to the manufacturer's instructions. Per each real-time PCR reaction, we used 10ng of genomic DNA that was previously extracted and stored at −20 • C. SDS v2.4 software was used to run the analysis and for allele calling.

Statistics
The continuous data were presented as mean ± standard deviation (SD) while the categorical data were counted and presented as numbers. The Kruskal-Wallis test was applied for comparison of the means of continuous data. We used chi square test for testing the Hardy-Weinberg Equilibrium and the Fisher's exact test for estimating differences in allele and genotype frequencies between study subgroups and between study groups and CEU data (samples of Northern and Western European ancestry, from the International HapMap Project). The dominant (AA vs. Aa+aa), over-dominant (Aa vs. AA+aa), recessive (aa vs. AA+Aa) and allelic (A vs. a) models of genetic inheritance were applied [23]. Cochran-Armitage test for trend, the most common approach in case-control analyses, was used to test the additive model of inheritance [24]. The analyses were run using IBM SPSS Statistics software. Additionally, SHEsis software was applied for haplotype analysis.

Association and Prediction Analyses
For the modeling purposes, two approaches were assessed, namely logistic regression and artificial neural network. Firstly, we used univariable logistic regression to test the impact of individual variables, accompanied by the receiver operator characteristic (ROC) curve and area under the receiver operating characteristic (AUC) value to evaluate the sensitivity, specificity and the discriminatory ability of each factor. Statistical significance in univariable logistic regression, in ROC analysis, and/or in the Fisher's exact test for single variables were applied as inclusion criteria for multivariable logistic regression and neural network analyses. Multivariable logistic regression was run using the enter method and the following characteristics were applied to describe the model: R 2 Negelkerke and Cox and Snell R 2 values to assess how well the model explains the data, the coefficient β and the exponentiated coefficient β (the odds ratio) values to indicate the relationship between each variable and the outcome, Hosmer and Lemeshow test to determine a goodness of fit of the data with the model, and, at last, the ROC curve to evaluate the predictive accuracy and the discrimination power of the model. IBM SPSS Statistics software v27.0.1.0 was used for logistic regression and ROC analyses and Statistica v13.3 were employed for neural network modeling.
Artificial neural network is a deep learning approach that, in the image of the human brain, self-learns from experience and adjusts to a situation. Briefly, neural network (NN) consists of multiple neurons, called nodes, and interconnections among them creating a complex structure, in which information passes imitating the system of real neurons. Each interconnection is given its weight, upon which the strength of the association is acquired. A typical network contains an input, hidden and output layers. The input layer receives the input signal to be processed, i.e., the data, while the hidden layer performs all the computational processes resulting in an outcome prediction in the output layer. The results are compared to the real observations and each time the process is repeated, until the smallest prediction error is reached. The great advantage of the NN approach is that it enables detecting complex nonlinear associations between variables and an outcome as well as all possible interactions between variables themselves, using multiple distinct learning algorithms, that are adjusted for the data type. We used the following parameters: a typical multilayer perceptron as a network type, 4 to 12 hidden layers (default), the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm as an iterative method, 4 types of activation functions, i.e., linear, logistic, exponential and tanh, and SOS (symbiotic organisms search) error rate type as a training method, that gives the best accuracy [25]. Out of the total data, 70%, 15% and 15% were used as training, testing and validation set, respectively. In total, 25 different models were created, out of which the 6 best, based on self-learning, were saved and analyzed in detail.

Demographic Data
The caries rate in our cohort of preschool children was 18.3% (48/262) and the number of affected males was comparable with affected females (23 vs. 25). To minimize the potential demographic stratification, the control samples were adjusted to the caries samples in the reference to age and gender (p = 0.4163 and p = 0.9208, respectively). There was no difference between the number of erupted teeth or the type, i.e., incisors, canines, molars, between the groups (p = 0.3945, p = 0.3250, p = 0.7148, p = 0.1988, respectively). Additionally, the number of erupted teeth and its types did not differ between males and females, in caries or control group.

Genotyping
All markers genotyped in the present research were in Hardy-Weinberg equilibrium. The distribution of genotypes and alleles in six SNPs, in caries and control groups as well as in the study cohort and CEU data are presented in Table 3.
We spotted an almost 26-fold higher occurrence of rs10429371 recessive TT homozygote (p = 0.0262) and a 1.5-fold higher occurrence of the T allele (p = 0.1645) in caries patients in comparison with healthy controls, and the CT heterozygote was 2.5-fold more frequent (p = 0.0320) in the control group in comparison to caries group. In the case of another variant, rs7096206, the recessive GG homozygote and the G allele were 9.6-fold (p = 0.0363) and over 2-fold (p = 0.0180), respectively, more frequent in controls than in caries patients. The frequencies of alleles and genotypes for the rest of the genotyped SNPs did not differ between the groups or the results were statistically insignificant. When the subgroups were compared in reference to gender, none of the results were significant, although there was an almost 2-fold higher occurrence of rs10429371_T and TT variants in males, as well as over 2.5-fold higher occurrence of the wild CC homozygote in females, in caries patients, while no such observation was made for the controls. When our results were compared to the CEU data, we spotted significant differences for rs4547741, rs7217186 and rs10429371, and the latter SNP showed significance both for allele and genotype frequencies, showing the recessive T and TT variants to be, respectively, 8-fold and almost 18-fold more frequent in the CEU group than in our Polish cohort (p < 0.001 for both comparisons) ( Table 3).
The differences in the frequency of the other 18 single nucleotide variants used for further statistical analyses in the present research were discussed in our previous studies, showing the association of recessive ENAM rs12640848_G, recessive AMELX rs17878486_T, wild TUFT1 rs2337360_A and two recessive KLK4 rs2235091_G and rs198969_G variants with caries outcome, and one recessive variant in AMBN, rs34538475_T, with the absence of the disease [14,15]. Table 3. Genotype and alleles distribution differences for 6 SNPs genotyped in this study between study subgroups (caries vs. caries-free) and between our cohort and CUE data.  When the Cochran-Armitage test for trend was applied (see Table 4), out of eight significantly different distributed variants between the groups, six showed the positive significant additive model of inheritance and they were as follows: rs7096206, rs12640848, rs17878486, rs2337360, rs2235091 and rs198969. In turn, variants rs10429371 and rs34538475 showed stronger recessive modes of inheritance.

Haplotype Analysis
Although not each variant comprising the haplotype presented significance in single variant analysis, the overall haplotype analysis showed some association with the disease or disease-free trait. Briefly, AMBN TC and TT haplotypes (comprising rs34538475 and rs4694075, respectively) were significantly associated with healthy controls, while GC haplotype showed association with caries. Additionally, strong association with caries was observed for TUFT1 AAA, AAG and AGG haplotypes (comprising rs3790506, rs4970957, rs2337360, respectively) and the significances for the rest of TUFT1 haplotypes were most probably the result of quite low numbers of dominant and recessive homozygotes in both groups for rs3790506 and rs4970957, respectively, which was reflected by the absence of some haplotype variants in one group and their low frequency in the other. Likewise, significance was observed for rs2235091_rs198969 TC haplotype in MMP20, most probably due to its absence in of the study subgroups, and a low frequency in the other. Both SNPs in KLK4 were significant in single locus analysis, however, only rs2235091_rs198969 GG haplotype showed significant association with caries, demonstrating that both recessive alleles were essential for caries effect and, analogically, the AC haplotype was significantly associated with non-disease trait, showing that both dominant alleles were essential for protective effect. Two haplotypes in ENAM were significantly associated with caries and non-caries trait, most possibly due to the middle allele of rs12640848, i.e., the only significant SNP in ENAM in other statistical test conducted. Haplotype CAAGA was over twofold more frequent in caries patients, corroborating the A allele as a risk variant, while haplotype CAGGA was almost 2.5-fold times more frequent in controls, which confirmed the G allele as a protective variant. The other four SNPs of ENAM did not contribute into the haplotype significance. No significant results were found for TFIP11 haplotypes. The haplotype distributions are shown in details in Table 5.  Table 6 includes the results of univariable logistic regression that depicted eight variables, that were all single nucleotide polymorphisms, i.e., rs10429371, rs7096206, rs12640848, rs17878486, rs12640848, rs2337360, rs2235091 and rs198969, to be significantly associated with caries outcome and those eight SNPs were then included in multivariable logistic regression analysis and neural network prediction. The overall multivariable logistic regression model characteristics are shown in Table 7 and information about single variables in the model are shown in Table 8.   Briefly, the strongest association with caries was assessed for rs17878486 in AMELX, rs2337360 in TUFT1 and rs12640848 in ENAM, that remained in the final model at p < 0.05 (p = 0.0008, p = 0.0040, p = 0.0401, respectively). The model evaluation gave 93% total number of correct calls with 90% sensitivity and 96% specificity and a strong level of significance p < 0.0001. The AUC value of the model (0.970 (95% CI:0.912-0.994; standard error = 0.014), p < 0.0001) was also high. The ROC curve was constructed by plotting the true caries rate against the false caries rate and the prediction was made using all eight SNPs that were significantly associated with the outcome in single locus analyses. The rest of parameters, e.g., the prediction accuracy and goodness of fit, reached the high level of the overall performance of the model (see Table 7 and Figure 1). According to other mathematical algorithms used in neural network analysis, all eight variables that performed at p < 0.05 in univariable analyses were also introduced in the NN model, although only three of them hold the significance in multivariable logistic regression. The overall performance of six neural network prediction models is presented in Table 9.  i.e., rs10429371, rs7096206, rs12640848, rs17878486, rs12640848, rs2337360, rs2235091 and rs198969, and the true positives (sensitivity) and false negatives (100-specificity) for caries prediction when compared to the actual data. The dashed lines indicate the ROC curves for 95% Confidence Interval.

Caries Association and Prediction
According to other mathematical algorithms used in neural network analysis, all eight variables that performed at p < 0.05 in univariable analyses were also introduced in the NN model, although only three of them hold the significance in multivariable logistic regression. The overall performance of six neural network prediction models is presented in Table 9. All the models reached a high prediction accuracy, from 90.9% to 98.4% in the test analysis. The best model, namely NN1, which gave the highest prediction accuracy in the test analysis (98.4%), simultaneously gave the lowest rate in the validation analysis (73.6%). The other five models performed relatively high in the validation analysis, i.e., 85.3-87.2%. All four types of activation function algorithms were used in the final prediction models. Only one model, NN3, used the linear function in the hidden layer, while logistic, exponential and tanh functions turned out to be better fitted, as they were used interchangeably in the rest of the models. In fact, the latter two functions support the backpropagation process, and therefore speed up the model's self-training process, which is one of the crucial steps in a multilayer neural network system [26,27]. When analyzing the prediction sensitivity of single markers, described as their usefulness, interestingly, all of the eight markers turned out to be crucial for the prediction models. Figure 2 shows the overall importance of eight markers in six caries prediction models. The sensitivity of prediction allows distinguishing variables that are important from those that do not contribute, or contribute little, to the overall performance of a model, and therefore, can be rejected. The greater the error after rejection of a variable to the original error, the more sensitive the network model is to the lack of this variable. In this study, the top three predictors in the reference to each single model as well as to the mean value of error rate were: AMELX_rs17878486, TUFT1_rs2337360 and MMP16_rs1042937, while the least important predictor was AMBN_rs34538475. Nevertheless, all eight markers gave an error value above "0", which means that even the weaker one was still important for the caries prediction.
Genes 2021, 12, x FOR PEER REVIEW 13 of 21 backpropagation process, and therefore speed up the model's self-training process, which is one of the crucial steps in a multilayer neural network system [26,27]. When analyzing the prediction sensitivity of single markers, described as their usefulness, interestingly, all of the eight markers turned out to be crucial for the prediction models. Figure 2 shows the overall importance of eight markers in six caries prediction models. The sensitivity of prediction allows distinguishing variables that are important from those that do not contribute, or contribute little, to the overall performance of a model, and therefore, can be rejected. The greater the error after rejection of a variable to the original error, the more sensitive the network model is to the lack of this variable. In this study, the top three predictors in the reference to each single model as well as to the mean value of error rate were: AMELX_rs17878486, TUFT1_rs2337360 and MMP16_rs1042937, while the least important predictor was AMBN_rs34538475. Nevertheless, all eight markers gave an error value above "0", which means that even the weaker one was still important for the caries prediction.

Discussion
We presented a complex analysis of 22 differentiated single nucleotide polymorphisms in prediction of dental caries in primary dentition of children. This study is an extended analysis utilizing additional data from our previous studies [14,15]. Deep learning neural network models for caries prediction were applied in a homogeneous cohort of 2-3-year-old children living in an urban environment under similar cultural conditions.
The previous caries experience, independent of an individual's age, has previously been reported as the strongest and the most universal risk factor for future caries development [28]. However, it is challenging to assess the previous caries experience in infants and toddlers. Most of the previous studies and distinct caries prediction models apply to adults, school children or older preschoolers, while models for toddlers are scarce [29]. Another obstacle in assessing well-performing caries prediction model are discrepancies among the studies, i.e., imprecise definition of caries phenotype and caries lesions or inconsistency in the terms used by the researchers [9,23,30,31]. Furthermore, the majority of acquired caries prediction models are based on demographical and environmental factors and the only biological feature that has been considered is the cariogenic bacteria colonization of oral cavity.

Discussion
We presented a complex analysis of 22 differentiated single nucleotide polymorphisms in prediction of dental caries in primary dentition of children. This study is an extended analysis utilizing additional data from our previous studies [14,15]. Deep learning neural network models for caries prediction were applied in a homogeneous cohort of 2-3-year-old children living in an urban environment under similar cultural conditions.
The previous caries experience, independent of an individual's age, has previously been reported as the strongest and the most universal risk factor for future caries development [28]. However, it is challenging to assess the previous caries experience in infants and toddlers. Most of the previous studies and distinct caries prediction models apply to adults, school children or older preschoolers, while models for toddlers are scarce [29]. Another obstacle in assessing well-performing caries prediction model are discrepancies among the studies, i.e., imprecise definition of caries phenotype and caries lesions or inconsistency in the terms used by the researchers [9,23,30,31]. Furthermore, the majority of acquired caries prediction models are based on demographical and environmental factors and the only biological feature that has been considered is the cariogenic bacteria colonization of oral cavity.
The first linkage studies in caries were performed in 2008 [32] and the first genomewide association studies for early childhood caries were published in 2011 [33]. Abbasoglu et al. [10] were the first to correlate environmental and genetic factors in ECC in Turkish 2-5-year-olds, while Lewis et al. [34] considered several single nucleotide polymorphisms, however they were not included in the final caries prediction meta-analysis. Another technical obstacle in any caries prediction model implementation in practice is an almost total lack of replication and validation studies in independent populations [13,35]. Mejare et al. [28] depicted 17 studies on caries prediction in preschoolers, defined as children of age <1 to 6 years old. None of them used genetic variables as prediction factors, most gave moderate prediction accuracy results and only one study by Holgerson et al. [36] based on environmental factors and saliva sampling in 2-year-olds was validated. Persistent high dental caries rate in the general Polish population and nearly constant prevalence and severity index of the disease in children in the recent years [37] should encourage both researchers and clinicians to develop better and validated prediction approaches that might be implemented in practice.
In this study, we obtained relatively high prediction accuracy ranging from 90.9% to 98.4%, depending on the prediction model, using a neural network approach. Additionally, multivariate logistic regression analysis showed accuracy of 93% with high sensitivity and specificity values, i.e., 89.6% and 95.7%, respectively. Results of both approaches were gender-and age-independent, as the two study subgroups were adjusted for both features. The most important predictors/indicators in both assays were AMELX rs17878486 and TUFT1 rs2337360 [15]. In brief, AMELX and TUFT1 are the genes that play a crucial role in the enamel formation process and the single nucleotide polymorphisms have previously been attributed to high caries susceptibility, in primary as well as permanent dentition, although the risk allele varied depending on population and age [12,30,38]. Another factor that was significantly associated with caries outcome in both uni-and multivariable logistic regression was ENAM rs12640848. ENAM encodes enamel-special protein enamelin that is a critical factor during enamel maturation. Our results were in agreement with other reports, supporting the presumable role of rs12640848_G as a protective factor in ECC [10,30]. Interestingly, when the SNP was used in this study as a predictor in neural network analysis, it turned out to be an intermediate marker in the scale of importance ( Figure 2). This might be explained by the fact that both wild AA and rare GG genotypes were associated with the opposite outcomes, i.e., caries and caries-free phenotype, respectively. Conversely, another SNP, i.e., MMP16 rs10429371, was the third most important predictor in NN models, although it did not hold the significance as an indicator in multivariate analysis. This is an interesting finding, since rs10429371 explained over 20% of the trait in single locus analysis and the TT variant was nearly 26 times more frequent in caries individuals when compared to controls, showing a strong recessive pattern of inheritance and suggesting TT as a risk variant for dental caries development in the studied individuals. Additionally, it was one of the variants for which alleles and genotypes were differently distributed in the cohort in this study in comparison to the CEU individuals in favor of wild C and CC variants (Table 3). Matrix metalloproteinases (MMPs) play an important role in early tooth development by regulating ameloblast maturation and formation of enamel. Several types of MMPs have been described to be involved in dentin collagen degradation and dental caries lesion progression [12,34,[39][40][41][42]. MMP16 rs10429371 was previously associated with caries in white adults [34]. Linhartova et al. [41] observed higher incidence of rare T allele in caries children aged 13-15 years, although it did not reach the statistical significance level.
The second variant that showed significant association with caries outcome and was genotyped in this study was rs7096206 in MBL2. Mannose binding protein (MBL2) is an acute phase protein that plays an important role in innate immunity. The study showed that MBL2 polymorphisms are involved in several infectious diseases and the top ranked rs7096206 has been annotated as deleterious to the protein's function and described as one of the most functionally important variants in the gene [42]. Rs7096206_G was found to be a risk factor in Polish 5-year-olds in reference to higher vs. lower caries experience, while it had no effect in 12-year-olds [1]. Interestingly, the same rare variant was significantly associated with no caries experience in 2-3-year-olds in our study, showing an additive model of inheritance. On the other hand, in the former study [1], rs7096206_G was associated with higher caries experience in both age subgroups while in the haplotype with rs1800450_G, and the CG haplotype had the opposite effect. Alike, rs7096206 genotype distribution was insignificant, but in haplotype with rs7501477_T it correlated with caries experience in Saudi 5-13-year-old children [43]. It might suggest that other SNPs could be associated with a more complex pattern of the disease and/or that distinct genetic variants could be involved at different ages. Such differences, i.e., in primary vs. permanent dentition, were also acknowledged by Wang et al. [44]. Likewise, as the rs7096206_G variant is associated with lower MBL2 serum levels predisposing to infections [45], other SNPs could show strong linkage disequilibrium accounting for possible protective mechanisms in the toddlers in our study. Nevertheless, the SNP was not significant in multivariate analysis, despite the drastic OR value, i.e., 3.36 × 10 −9 .
We did not observe any association of other SNPs in MMP, i.e., MMP20 rs1711437 genotyped in this study and rs1784418 in the previous study [15], with caries or caries-free phenotype. MMP20 (enamelysin) is the early protease secreted during enamel development and is involved in both dentin and enamel decomposition [46]. Only Antunes et al. [47] found both variants to be associated with early childhood caries. Yet, the results of other studies of MMP20 SNPs were on the border of significance in 5-year-old Caucasian children with dental caries or have been associated more with poor oral hygiene and dietary habits than the disease itself in 5-14-year-old Caucasians [16,46]. Likewise, rs45447741 in LTF, rs7217186 in ALOX15 and rs1884302 in SMAD6 presented no association with caries experience in this study in any of the statistical tests used.
The rest of significantly different distributed variants in the previous study [15], i.e., AMBN rs34538475 and KLK4 rs2235091and rs198969, were not significant in multivariate analysis and their importance in caries prediction models were of a moderate (rs2235091and rs198969) to low (rs34538475) degree ( Figure 2). KLK4 plays an important role in the late stages of enamel development, while AMBN, together with AMELX, is crucial for enamel matrix formation and mineralization. The roles of the abovementioned SNPs in both genes as risk factors in caries development in children were partially in agreement with other authors, depending on studied population and children's age [10,12,46]. Additionally, the haplotype analysis showed that alleles of both variants in KLK4 were necessary for the risk (i.e., rs2235091G_rs198969G haplotype) and protective (i.e., rs2235091A_rs198969C haplotype) effect.
It must be emphasized that the differences in association of genetic variants with caries experience and severity in the studies occurs not only due to the differences between the populations but also in one population itself, even from one individual to another [34]. Single nucleotide polymorphisms are often highly variable between distinct ethnic cohorts and, at least partially, resemble divergence in human phenotypes, including different disease susceptibility or drug response. Dental caries is a highly complex trait, also in reference to environmental factors, i.e., socioeconomic and cultural factors, age, oral hygiene, eating habits, also the course of pregnancy and mother-child relations [3,6,46,[48][49][50]. Age appears one of the key factors in caries analysis, as the disease experience is differentially defined in reference to the age of patients. Some authors emphasize that each age group should be characterized by other sets of temporal variables [50], with the most important risk factors as follows: in 2-year-olds-allergies and infections before first tooth eruption and intake of drugs during the first 12 months of life, in 3-year-olds-mother's age at the time of pregnancy and smoking during pregnancy [29], consumption of sweetened food during first 12 months of life and nocturnal drinking of sweet drinks above 12th month [3], in 4-year-olds and older children-frequency of tooth brushing, fluoride treatment and mother's education [51].
One of the hypotheses might be that distinct processes, and hence different genes and polymorphisms, play a significant role throughout the stages of a child's growth and development. Since the genetic sequence does not change over the course of life, each stage might be sensitive to distinct variants, including those that shape susceptibility to caries development. Simultaneously, the sensitivity to the secondary environmental factors and the tertiary behavioral factors, responsible for the intermediate and distal effects, respectively, might also alter. Substantially, it can be explained by the order of teeth eruption, their type and number. Caries lesions appear first on incisors and molars, also maxillary teeth are more often affected than mandibular teeth. Caries formation predominantly affects teeth with deep and/or narrow pits and fissures on the occlusal surface, which is closely related to the morphology of primary dentition and molars [4,46,50]. The deeper and narrower the pit, the easier it is for the bacterial plaque to penetrate and adhere to a tooth surface. The higher the number of teeth with deep and narrow pits, together with inappropriate brushing, the more frequent the predisposition to higher caries rate, which is especially relevant in preschoolers [50,52,53]. The similar pattern of caries development has been observed in 3-year-old children in other studies [53][54][55]. In this study, we did not observe differences in the total number of erupted teeth and the number of the teeth types was similar between affected and caries-free children and between boys and girls in both subgroups (not shown). Although we did not conduct a follow-up study, some authors pointed that caries susceptibility varies depending on the tooth morphology and rises sharply after 2-3 years after the tooth eruption, when posteruptive enamel maturation takes place [4,5]. Therefore, dental caries seems to be a remarkably divergent trait constantly changing in time and with its occurrence increasing with age. The prevalence of caries experience in our study was 18.3% and was comparable to children of the corresponding age of other ethnicities, i.e., Western Europe, Eastern-Southern Asia and Sub-Saharan Africa [3,29,56]. Interestingly, it appeared to be much lower when compared to other studies concerning Polish children with active caries of distinct regions, i.e., 40.8% in the Podlasie region [57], 53.8% in Lower Silesian, Malopolskie and Lubelskie voivodships [4] or from 35% to 56.6% in the general 2-3-year-olds population [3,13,57]. However, according to Werneck et al. [13] caries rate can reach even 85% in preschoolers.
Another risk factor for caries lesions development, closely related to children's age, is the presence and composition of bacterial plaque in the host. The level of colonization by cariogenic bacteria has been considered the strongest risk factor in 3-year-old Polish children [49], however, it should be emphasized that it might be influenced by other features. Firstly, the activity of Lactobacilli and Mutans Streptococci is higher in preschoolers and in primary dentition in comparison with older children and permanent teeth [10,50,58], therefore it might be a biological predisposition of the host. Secondly, the younger the child, the more attention and supervision over teeth brushing should be provided by the parents/guardians, therefore minimizing the bacterial film and setting a good hygiene habits. The mother not brushing a child's teeth was found a major risk factor in Mutans Streptococci infections and ECC in Australian children aged 12-72 months but also in 9-month-old Thai children [6,59].
Both genetic and host factors fluctuate with age, and with a broad spectrum of environmental factors, they contribute to the development of caries lesion differently in every individual. Moreover, even children of the same ethnicity that are exposed to the same levels of risk factors might present distinct caries severity index and patterns of the disease [13], which strongly supports the existence of other, intrinsic, i.e., genetic components. Still, depending on the population and its habits, prediction models using the same risk factors might be characterized by distinct final prediction parameters, as well as comparable prediction features of different models might be determined by distinct input variables [8]. To mention a few, Kalhan et al. reached a high AUC value of 0.81-0.91 of caries prediction in 2-year-olds and 0.79 in 3-year-olds [29], while Fontana et al. [60] assessed AUC of 0.73 in 4-year-old children. Tamaki et al. [61] came across 33 studies on caries prediction, however, the majority of them used logistic regression analyses and only some authors used additional, advanced machine learning algorithms to compare the results, namely artificial neural network, decision analysis or classification and regression trees [27,50,[61][62][63]. The benefit of logistic regression analysis is that it can be conducted using non-complicated software; however, the procedure is descriptive in its nature and should be applied to assess risk indicators, not predictors. On the other hand, advanced machine learning methods consider the unequal strength of potential markers, so that a weak factor is not hidden by the strong one and predictive power is more reliable [9,51]. In fact, we observed some differences in the importance level of studied variants between the two approaches, although it is not entirely adequate to compare the results of different algorithms, even describing the same features. Still, one of NN models-NN5, in which logistic function was applied as an activation algorithm in both hidden and output layers, gave the most outstanding importance values of all tested predictors, while in the remaining 5 models based on distinct algorithms were lower but comparable with each other within one model. The NN5 model presented the second-best prediction accuracy in this study; however, using logistic methods only might suggest over-fitting of the model, even in advanced machine learning approaches. Likewise, So and Sham [64] stated that ROC curve analysis is not directly correlated with the disease risk and even high significance, i.e., low p-values, is not equal to a good predictive power. Javed et al. [27] obtained 99% accuracy in caries prediction in 6-14-year-old Indian children using neural network modeling, and it was the method of choice in other studies with medical prediction [65][66][67][68].
Another issue worth mentioning is the nature of variables themselves. When studying some environmental, behavioral and biological factors, a collinearity might occur, e.g., the presence of dental plaque might be the result of a poor oral hygiene, which in turn might be correlated with a lack of guardians' control over child's brushing. [63]. While it is not a problem in advanced modeling approaches, it should be avoided in logistic regression analyses, since it tends to cause over-fitting of the data and spurious results [63,69]. The strength of this study is a homogeneous study group, that is sex-and age-adjusted, since both factors are found to be potential confounders in case-control studies [24]. Secondly, the assessment and comparison of the results and performance of two different statistical approaches, adjusted for the data and therefore more reliable, appears to be crucial when implementing the method into the clinics. The obvious limitation of this study and a necessary future direction is a larger study group and validation of the model.
The polymorphisms analyzed in this study and a high caries prediction rates indicate a strong genetic component in the course of the disease. Nevertheless, one has to remember about the multifactorial nature of caries and that even the best prediction model cannot fully describe the real life scenarios and neither solely genetic nor solely environmental factors can completely explain the disease cause. Hence, analyses exploring environmental and genetic predictors need to be conducted very carefully. Firstly, some genetic variants might influence behavioral habits, e.g., SNPs in taste genes [11] as well as non-genetic factors themselves might be correlated with one another, yielding spurious results. Secondly, compared to the environmental impact, if some genetic effect on a disease trait exists, it remains the main and unchangeable element in the disease incidence, especially early in life [70,71]. Even being under the influence of the environment, genetic factors combined with non-genetic findings have been replicated only partially and with the lack of statistical power [70]. Although many rare variants or the ones with smaller effect might be missed in purely genetic case-control studies, the subgroup analyses and high homogeneity of the study subgroups greatly improves the prediction [72,73]. Some authors [70,74] have studied and described interesting gene-environment association tests, that successively and carefully analyze both components in multiple testing approaches. McAllister et al. [75] have developed novel statistical approaches to detect genetic-non-genetic interactions with regard to different durations of exposure to environmental factors. Therefore, a fully developed caries prediction model undoubtedly requires a complex analysis of multiple factors as well as replication studies before implementation in practice. Still, the high prediction value of genetic polymorphisms presented in this study comprises valuable findings on early caries development that sets the direction of the future research.

Conclusions
In conclusion, our study demonstrates neural network models of high accuracy for dental caries prediction in early childhood, based solely on the genetic single nucleotide polymorphisms selected using more basic and less statistically advanced methods. An effective and early prediction of intrinsic risk factors might influence the change in eating habits, improve oral hygiene and other behavioral factors when an individual is at a high risk of developing caries lesions. Early implementation of preventive strategies and a customized early treatment might decrease the risk of caries while improving children's health, the quality of life and self-esteem, as well as reducing the financial burden associated with medical care.