1. Introduction
Advances in information and communication technology (ICT), artificial intelligence (AI), and data analytics have ushered in an era of precision medicine, in which healthcare is increasingly tailored to individual characteristics [
1]. This paradigm shift has gradually extended to dentistry, evolving into precision dentistry. This new approach integrates biological, behavioral, and environmental information to predict disease risk and develop personalized preventive strategies. Accordingly, oral healthcare is transitioning from a treatment-oriented approach focused on disease occurrence to a data-driven, predictive, personalized prevention model [
2].
Dental caries is a multifactorial disease influenced by the complex interplay of microorganisms, dietary habits, behavioral patterns, and socioeconomic factors [
3]. To address the limitations of conventional preventive education that fails to account for such complexity, Featherstone proposed the Caries Management by Risk Assessment (CAMBRA) model, which comprehensively evaluates disease indicators, risk factors, and protective factors [
4]. However, CAMBRA has inherent limitations in adequately capturing the complex interactions among risk factors and temporal transitions in caries risk over time [
5].
The oral environment and health-related behaviors of preschool children change rapidly due to growth and caregiver involvement. This makes it difficult for single-time-point assessments to fully reflect dynamic changes in caries risk [
6]. Continuous follow-up and data-driven approaches are therefore required. Given that caregivers’ behaviors directly influence children’s oral health, participatory mobile health (mHealth) strategies have been shown to be particularly effective [
7]. The CAMBRA-kids application, developed based on a Korean-adapted caries risk assessment tool, has been used to promote self-care and behavioral modification in children. However, previous studies have primarily focused on usability evaluation and short-term effectiveness [
8]. To date, limited research has examined long-term changes in caries risk categories or the determinants influencing such transitions.
Recently, machine learning-based predictive models have been increasingly applied across healthcare disciplines to elucidate complex relationships among multiple interacting factors [
9,
10]. Among these methods, Random Forest algorithms, which aggregate multiple decision trees, are widely used. They provide high predictive accuracy and enable a quantitative assessment of relative feature importance. This makes them particularly suitable for clinical prediction research [
11]. Despite their strong predictive performance, ensemble models are often criticized as “black-box” systems due to their limited interpretability [
12]. To address this limitation, explainable artificial intelligence (XAI) techniques, such as SHapley Additive exPlanations (SHAP), have gained attention. SHAP uses a game-theoretic framework to visualize the contribution of individual variables, thereby enhancing model interpretability and expanding the clinical applicability of complex predictive models [
13].
This study aimed to apply the concept of precision dentistry to caries prevention in preschool children by utilizing machine learning techniques to predict caries risk based on data collected through the CAMBRA-kids mobile application and to analyze interactions among associated factors. This approach aims to provide foundational evidence for establishing a long-term, data-driven oral healthcare management framework.
2. Materials and Methods
2.1. Study Design and Ethical Approval
This study was conducted as a retrospective cohort study with longitudinal follow-up to identify factors associated with the escalation of dental caries risk in preschool children, using secondary data collected through a mobile-based caries management program. This study was conducted in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines for cohort studies (see
Table S1). The analysis focused on changes in caries risk status from pre-intervention to post-intervention following a 12-month caries management program delivered via the CAMBRA-kids mobile application, which is based on the Caries Management by Risk Assessment (CAMBRA) framework [
4].
The dataset used in this study originated from a previously implemented intervention evaluating the effectiveness of the CAMBRA-kids mobile application for children under 5 years of age [
8]. The present study represents a secondary analysis of de-identified data collected during that intervention, and no additional data were collected.
The study protocol was approved by the Institutional Review Board of Namseoul University (IRB No. NSU-1041479-007; approval date: 11 June 2019). To ensure participant confidentiality, all personal identifiers were removed prior to analysis. These de-identified data were stored in a secure, password-protected environment accessible only to the investigators. The requirement for informed consent was waived due to the retrospective use of secondary data.
2.2. Study Population and Data Collection
Data were derived from a longitudinal clinical study [
8] involving a one-year caries management program via the CAMBRA-kids mobile application. The original intervention targeted preschool children under 5 years of age and was designed to support caries risk management through individualized feedback and behavior modification [
8].
Participants were included if caregivers provided informed consent and completed the CAMBRA-kids application with the required entries on their child’s risk and protective factors, and if the child was able to participate throughout the intervention period. Participants were excluded if caregivers did not provide consent or did not download or complete the application, if the child was absent from scheduled program sessions or was transferred during the study period, or if caregiver questionnaire responses were incomplete. For this secondary analysis, only children with complete data at both baseline (T0) and 12-month follow-up (T1) were included. A total of 119 preschool children were included in the final analysis.
Pre-intervention assessments conducted at enrollment and post-intervention assessments conducted at the 12-month follow-up included the evaluation of CAMBRA-based caries-related variables, including disease indicators, risk factors, and protective factors; QLF measures (ΔF, ΔR30, ΔR70, and ΔR120); oral hygiene status assessed using the SHS; salivary flow rate for the assessment of severe dry mouth; and caregiver-related measures such as oral health knowledge and self-efficacy.
2.3. Outcome Definition
The primary outcome was defined as a ‘high-risk transition’, a binary variable indicating whether a child moved to or remained in the ‘high’ or ‘extreme high’ caries risk categories at the 12-month follow-up (T1). Children classified into these categories at T1 were coded as positive cases (event = 1), whereas those in the ‘low’ or ‘moderate’ risk categories were coded as negative cases (event = 0). This binary outcome was used for the subsequent predictive modeling.
2.4. Variables and Measurements
Caries-related variables were collected in accordance with CAMBRA guidelines and categorized into disease indicators, risk factors, and protective factors [
4,
5]. Variable definitions and assessment procedures followed those used in a prior CAMBRA-kids study [
8] to ensure consistency and comparability across analyses.
Assessments were performed at baseline and repeated after 12 months. Risk and protective factors, including caregiver-related characteristics, dietary habits, oral hygiene behaviors, fluoride exposure, and preventive practices, were obtained from caregiver responses entered into the CAMBRA-kids mobile application.
Disease indicators were assessed by clinicians through clinical examination and included enamel defects or white spot lesions, cavitated caries lesions, and restorations due to caries. Quantitative light-induced fluorescence (QLF) was additionally used to evaluate enamel demineralization and plaque-related conditions. Plaque accumulation and plaque maturity were assessed using Qraycam™ with dedicated analysis software (QA2 v1.23; Inspektor Research Systems BV, Amsterdam, the Netherlands), whereas enamel demineralization was assessed using Qraypen™ (AIOBIO Co., Ltd., Seoul, Republic of Korea) and Q-ray v1.34 software (v1.34; AIOBIO Co., Ltd., Seoul, Republic of Korea), with fluorescence loss quantified as the average ΔF (%).
Plaque maturity was evaluated using red fluorescence plaque parameters (ΔR30, ΔR70, and ΔR120), with higher ΔR values indicating greater plaque maturity. All fluorescence images were obtained in a dark environment by a single trained dental hygienist to ensure measurement reliability. Oral hygiene status was assessed using the Simple Hygiene Score (SHS), scored from 0 to 5, and severe dry mouth was assessed by measuring salivary flow rate [
8,
14,
15,
16].
2.5. Pre–Post Change Variables
To capture longitudinal changes over the intervention period, change variables (delta variables) were calculated as the difference between the 12-month follow-up (T1) and baseline (T0) values (Δ = T1 − T0). Baseline (T0) values were also included as covariates to adjust for initial differences in caries risk status among participants.
2.6. Data Preprocessing and Feature Selection
Categorical variables, including binary-coded risk factors, protective factors, and disease indicators (0 = absence, 1 = presence), were entered into the model as binary indicators, whereas continuous delta variables were standardized to have a mean of zero and a standard deviation of one to minimize scale-related bias. Changes in caregiver oral health knowledge and self-efficacy demonstrated minimal contribution to predictive performance and were excluded from the final model to improve model parsimony. All analyses were conducted using Python (version 3.12; Python Software Foundation, Wilmington, DE, USA) in the Google Colab environment (Google LLC, Mountain View, CA, USA), with the scikit-learn (version 1.4.2) and SHAP libraries (version 0.44.1).
2.7. Machine Learning Model Development
A Random Forest classifier consisting of 500 decision trees was used to predict post-intervention caries risk escalation [
17,
18]. Delta variables were used as primary predictors, and pre-intervention variables were included as covariates. Model development was conducted within a pipeline-based framework incorporating data preprocessing, class-weight adjustment, and cross-validation. Numerical variables were imputed using the median, whereas categorical variables were imputed using the most frequent value and one-hot-encoded. Class imbalance was addressed using class-weight adjustment. Model development and internal validation were performed using 5-fold stratified cross validation with shuffled splits (random state = 42). Feature importance and stability were examined using SHAP values, and variables showing stable contributions across models were retained for the final model.
2.8. Model Evaluation and Interpretability
Model performance was evaluated using receiver operating characteristic (ROC) and precision–recall (PR) curves. Given the imbalanced nature of the outcome variable, PR curves were used alongside ROC curves to provide a more informative assessment of predictive performance [
18]. Area under the ROC curve (AUC) and average precision (AP) were calculated [
19]. Out-of-fold predicted probabilities obtained from the cross-validation procedure were used for ROC and PR analyses, and the optimal classification threshold was determined from the precision–recall curve using the threshold that maximized the F1 score.
To improve model interpretability, SHapley Additive exPlanations (SHAP) analysis was performed to quantify the contribution and directional impact of individual pre–post change variables on model predictions. SHAP is a game-theoretic approach that enables transparent interpretation of complex ensemble models [
12,
13].
3. Results
3.1. Distribution of Disease Indicators, Risk Factors, and Protective Factors at Pre- and Post-Intervention
The distributions of CAMBRA-based disease indicators, risk factors, and protective factors at pre- and post-intervention are presented in
Table 1. Frequencies and percentages were calculated to describe changes over the 12-month period.
Among the disease indicators, the proportion of children with restorations present (past caries experience for the child) increased from 33.6% at pre-intervention to 67.2% at post-intervention, whereas the proportion with obvious white spots, decalcifications, or enamel defects decreased from 19.3% to 5.0%. The proportion of children for whom plaque is obvious on the teeth and/or gums bleed easily remained high at both pre-intervention (85.7%) and post-intervention (84.0%).
Among risk factors, caregiver caries experience and frequent sugar intake decreased from 38.7% to 28.6%. Among protective factors, the proportion of children brushing at least twice daily with fluoride toothpaste increased from 84.9% to 90.8%, fluoride varnish application within the previous 6 months increased from 32.8% to 65.5%, and caregiver xylitol use increased from 67.2% to 86.6%.
3.2. Transitions in Caries Risk Categories from Pre- to Post-Intervention
Transitions in caries risk categories over the 12-month period are illustrated in
Figure 1. Among children classified as low-risk at pre-intervention, 28.6% transitioned to high-risk at post-intervention. Among those classified as moderate-risk at pre-intervention, 42.9% transitioned to high risk. Conversely, 22.2% of children classified as extreme high-risk at pre-intervention transitioned to high-risk at post-intervention.
3.3. Predictive Performance of the Random Forest Model
The Random Forest model demonstrated acceptable discriminative performance for predicting post-intervention caries risk escalation. The ROC curve yielded an AUC of 0.773. Precision–recall analysis showed an average precision (AP) of 0.919.
Using a PR-optimized classification threshold of 0.698, the model achieved an accuracy of 0.798 and a balanced accuracy of 0.701. Precision and recall were 0.892 and 0.856, respectively, while specificity was 0.545. The F1 score at this threshold was 0.874 (
Figure 2).
3.4. SHAP-Based Importance of Pre–Post Change Variables
SHAP analysis was performed to evaluate the contribution of the pre–post change variables to model predictions (
Figure 3). Among all delta variables, change in fluorescence loss (ΔΔF) showed the highest importance (0.074), followed by change in restored teeth status (ΔD3; 0.068) and change in red fluorescence plaque area at the 70% threshold (ΔΔR70; 0.061).
Increases in ΔΔF, ΔD3, and ΔΔR70 contributed to classification into the high or extreme caries risk group. Among protective factors, change in caregiver caries-free status (ΔP6) demonstrated a moderate contribution, with decreases associated with higher risk classification and increases associated with lower risk classification.
4. Discussion
This study used 12-month longitudinal follow-up data collected through the CAMBRA-kids mobile application to predict transitions to high- or extreme-risk categories for dental caries and identify the determinants of these transitions. A Random Forest machine learning model, combined with SHapley Additive exPlanations (SHAP), was used to quantify the contribution of changes in clinical indicators and behavioral factors to caries risk transitions. This model also provided a clinically interpretable explanation of the predictive process.
Following the intervention, the overall mean levels of disease indicators and risk factors improved; however, a subset of children continued to transition to high or extreme caries risk categories. These findings suggest that although caries risk in early childhood may be reduced at the population level through intervention, risk transitions can persist at the individual level. The oral environment during early childhood is highly dynamic and can change rapidly in response to external factors such as caregiver management behaviors, dietary habits, and fluoride exposure, leading to temporal fluctuations in caries risk [
17]. This highlights the limitation of relying solely on average intervention effects to explain individual risk trajectories and supports the necessity of approaches that incorporate pre–post change variables (delta variables). Accordingly, this study moved beyond traditional CAMBRA score-based or cross-sectional analyses [
8] by examining how longitudinal biological and clinical changes contribute to caries risk transitions over time.
The Random Forest-based predictive model demonstrated good overall discriminative performance, achieving an ROC–AUC of 0.773. Notably, despite the pronounced class imbalance in the dataset, with relatively few cases exhibiting transitions to higher-risk categories, the area under the precision–recall (PR) curve was high, with an average precision (AP) of 0.919. This indicates that the model achieved high precision and reliability in identifying true cases of caries risk escalation. These findings support the use of a machine learning approach integrating clinical indicators with data collected through the CAMBRA-kids mobile platform as a digital healthcare tool for the early prediction of oral health deterioration and the support of preventive interventions in preschool children [
20].
The classification threshold was set to prioritize recall. This decision was made in accordance with a clinical judgment that failing to identify children at high risk (false negatives) poses a greater potential for harm than overclassifying low-risk children as high-risk (false positives) [
21]. The relatively lower specificity observed in the model indicates an increase in false-positive classifications; however, in clinical prediction contexts, sensitivity and specificity are inherently subject to a trade-off. In instances where the clinical cost of false negatives exceeds that of false positives, previous studies have recommended prioritizing sensitivity [
22]. From this perspective, the implementation of supplementary procedures, including closer follow-up and additional assessments, has the potential to mitigate the adverse effects of misclassification in real-world clinical applications.
Analysis of feature importance using SHAP revealed that changes in ΔΔF, ΔD3, and ΔΔR70 were the most influential predictors of transitions to high-risk categories after the intervention. ΔΔF exerted the strongest impact on model predictions and may be considered a clinically meaningful indicator. The term ‘ΔF’ is used to denote the change in enamel fluorescence loss, which is measured using quantitative light-induced fluorescence (QLF). This method can reflect subtle progression of early mineral loss with a high degree of sensitivity [
23]. Previous studies have reported that ΔF can distinguish between progression and arrest of demineralization prior to cavitation [
24]. In line with these findings, the present results suggest that caries risk transitions do not occur abruptly at the stage of clinically evident lesions but rather emerge along a biological continuum characterized by the cumulative progression of early demineralization over time.
It is evident that the presence of restored teeth is indicative of prior caries experience, a phenomenon that is reflected in the alterations in disease indicators as measured by ΔD3. The identification of ΔD3 as a key predictor suggests exposure to a high-risk environment in which caries has already developed. Previous cohort studies [
16] and analyses of the CAMBRA-students mobile application in adolescent populations [
5] have similarly reported that changes in disease indicators, such as prior caries or restoration experience, are strong predictors of subsequent transitions to high-risk categories. The findings of this study indicate that past caries experience functions not merely as a record of treatment history but as an accumulated marker of disease activity and sustained risk exposure.
The emergence of ΔΔR70 as a highly significant variable indicates that increases in caries risk are more closely associated with qualitative changes in dental plaque accumulated over time than with transient fluctuations in oral hygiene status [
16]. Together with early demineralization changes (ΔΔF), this finding supports the notion that transitions to high-risk categories occur along a continuum of caries development, involving plaque maturation, increased acid production, mineral loss, and subsequent disease activation. This interpretation is consistent with the findings previously reported by Spatafora et al. These researchers described caries development as a progressive biological process rather than a discrete clinical event [
25].
This study has several limitations. As a retrospective cohort study involving preschool children who participated in a single intervention program, the sample size was limited, and the number of cases exhibiting upward risk transitions was relatively small. This resulted in an imbalanced data structure. Consequently, the prioritization of recall in threshold selection led to a trade-off with lower specificity. Nevertheless, the study provides both academic and clinical value by moving beyond single-time-point scoring or cross-sectional analyses and by predicting and explaining caries risk transitions using pre–post change variables. It is recommended that future longitudinal cohort studies be conducted, with extended follow-up periods, to evaluate the external generalizability of change-based predictive models and to assess their applicability across diverse clinical settings. Based on our findings, we propose several implications for clinical practice and future research. In clinical practice, oral healthcare providers should prioritize the longitudinal monitoring of clinical indicators such as ΔΔF, which reflects mineral loss of the tooth surface, and ΔΔR70, which reflects dental plaque maturity, as these measures may capture cumulative biological changes associated with caries risk progression. Future research should further investigate whether the provision of explainable artificial intelligence (XAI)-based feedback through mobile health platforms can influence caregiver behaviors and improve long-term oral health outcomes across diverse clinical settings.
5. Conclusions
This study used multidimensional data collected through the CAMBRA-kids mobile application to predict which preschool children would transition to high- or extreme-risk categories for dental caries following a 12-month intervention. The study also aimed to identify the key contributing factors underlying such transitions.
The Random Forest-based predictive model demonstrated strong discriminative performance, achieving an ROC–AUC of 0.773 and an average precision (AP) of 0.919, effectively identifying cases of caries risk escalation. The change-based approach, which focuses on pre–post delta variables, was particularly useful in explaining risk transitions at the individual level that are obscured by average intervention effects in conventional point-in-time classifications. SHAP analysis identified change variables such as ΔΔF, ΔD3, and ΔΔR70 as the most influential contributors to transitions to high-risk categories. These findings suggest that the escalation of caries risk reflects cumulative biological and clinical changes rather than transient behavioral fluctuations.
These results demonstrate that change-based approaches incorporating explainable artificial intelligence can facilitate the early identification of high-risk children and provide a robust foundation for developing personalized preventive strategies.