1. Introduction
Preterm birth (PTB), defined by the World Health Organization as delivery occurring before 37 completed weeks of gestation [
1], complicates about 10% of the obstetric population and represents a major global health challenge [
2]. The etiology of PTB is broadly classified into two subtypes: spontaneous onset, which encompasses 75–80% of cases arising from either preterm labor or preterm prelabor rupture of membranes, and medically indicated birth, which accounts for the remaining 20–25% and results from obstetric intervention [
3]. The global incidence of PTB has shown a concerning increase over the last years [
4]. The profound clinical impact of prematurity is highlighted by its status as the leading cause of neonatal mortality worldwide, responsible for over one million infant deaths in a year and significant lifelong morbidities among survivors [
5]. In particular, the consequences of PTB include a wide spectrum of severe short- and long-term health complications; immediate neonatal morbidities frequently include respiratory distress syndrome, intraventricular hemorrhage, bronchopulmonary dysplasia, necrotizing enterocolitis, retinopathy of prematurity, periventricular leukomalacia, and sepsis [
6,
7,
8]. For infants who survive these initial challenges, the long-term sequelae can be debilitating, often manifesting as cerebral palsy or other significant cognitive and neurobehavioral deficits [
9,
10]. Beyond the direct health impact on the individual and their family, prematurity imposes a substantial socioeconomic burden. The costs associated with prolonged stays in neonatal intensive care units, recurrent hospitalizations, and the need for specialized long-term follow-up are considerable [
11].
A central challenge in PTB prediction and prevention is its complex and multifactorial etiology. Major risk factors for spontaneous PTB are a prior preterm delivery or a short cervix and multivariable predictive models that integrate these key indicators with other clinical and obstetric characteristics yield superior prognostic value compared to standalone risk factors [
12]. Effective risk stratification for spontaneous PTB in the general obstetric population is of considerable clinical importance, as it facilitates the timely application of prophylactic measures for pregnancies identified as high risk [
13]. In pregnancies at high risk for spontaneous PTB, the primary therapeutic options that have been systematically evaluated include hormonal support with vaginal progesterone, surgical reinforcement via cervical cerclage, and mechanical support with a silicone pessary [
14,
15]. Iatrogenic PTB, on the other hand, is mainly the result of placental dysfunction in the form of preeclampsia (PE), fetal growth restriction (FGR) and stillbirth. Recent data suggest that early aspirin administration in women at high risk for preterm PE may reduce the rate of preterm deliveries in pregnancies with clinical manifestations of placental dysfunction and the overall severity of PTB [
13].
Accurate prediction is the key to optimizing perinatal outcomes in both spontaneous and iatrogenic PTB, by allowing for the timely administration of interventions including antenatal corticosteroids and magnesium sulfate [
16,
17]. It also offers significant logistical benefits, enabling better planning of neonatal intensive care resources and transfer to specialized centers [
10,
18]. The second trimester is widely regarded as a crucial stage for PTB risk assessment. This timeframe offers a balance between achieving greater predictive accuracy and allowing sufficient opportunity for implementing prophylactic measures and stratifying our population into different intensities of care [
12,
19,
20]. To date, one of the most powerful second-trimester models for PTB achieved an Area Under the Curve (AUC) of 0.75; its performance relied on incorporating numerous socioeconomic variables alongside standard medical data, which are not easily accessible, and combined both spontaneous and iatrogenic PTB [
21]. Other relevant studies that utilized more restricted, clinically focused variable sets from the second trimester have reported models with AUC scores below 0.75 [
22]. Furthermore, the prediction of spontaneous PTB has proven particularly challenging, with current models demonstrating limited accuracy with data from both the first [
23] and the second trimester [
24], a fact often attributed to the multifactorial pathophysiology of this condition [
25]. Conversely, while the clinical pathways leading to iatrogenic PTB are often more defined, this subtype has been largely overlooked in prediction research, with a significant scarcity of dedicated models [
26]. Interestingly, we found only one study that investigated prediction in iatrogenic PTB, and it was limited to women with scarred uteruses [
27]. This clear gap in the literature, compounded by methodological limitations in many existing studies, underscores the need for robust, subtype-specific models that can be readily translated into clinical practice. Crucially, by treating PTB as a single outcome, previous models may have been limited, as predictors for spontaneous and iatrogenic subtypes can have different or even opposing effects, potentially obscuring important predictive signals.
Therefore, the objective of this study was to develop and internally validate robust predictive models for spontaneous and iatrogenic PTB at <32, <34, and <37 weeks’ gestation, comparing the performance of traditional Logistic Regression with several machine learning algorithms. Our approach utilized a set of readily available and cost-effective variables derived from maternal history and routine second-trimester ultrasound examinations, aiming to provide a practical and effective tool for risk stratification in a contemporary antenatal care setting.
2. Materials and Methods
2.1. Study Design and Setting
This retrospective cohort study included a consecutive sample of women who attended the 3rd Department of Obstetrics and Gynecology, School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Greece, between April 2012 and May 2025. All participants received routine antenatal care within this single tertiary institution, and all ultrasound examinations were performed by fetal medicine specialists accredited by the Fetal Medicine Foundation, London, UK. Data for this study were extracted from dedicated electronic medical records, which encompassed detailed demographic, clinical, and sonographic information.
2.2. Ethical Considerations and Reporting Standards
Informed consent was obtained from all participants, which included permission for their anonymized data to be used for possible future research purposes. As the study involved the retrospective analysis of routinely collected clinical data with no patient intervention, a formal review by an institutional board was not required. The reporting of this observational study adhered to the principles outlined in the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines [
28]. Furthermore, the development and reporting of the analytical model followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) recommendations [
29].
2.3. Study Population and Data Collection
The study population comprised all women with viable singleton pregnancies who underwent both a first-(11+0–13+6 weeks) and second-trimester (20+0 and 23+6 weeks) ultrasound scan at our department. Exclusion criteria included multiple gestations, pregnancies with known fetal genetic or major structural anomalies, termination of pregnancy, and miscarriage before 23+6 weeks. A standardized protocol was used to systematically collect clinical data, including maternal demographics, anthropometrics, obstetric history, lifestyle factors, and pre-existing medical conditions, which were recorded in the Astraia electronic database.
2.4. Fetal Medicine Assessment
All second-trimester sonographic assessments were performed by Fetal Medicine Foundation certified sonographers, ensuring high consistency in examination protocols and measurement techniques. These examinations involved a standardized fetal anatomy survey and biometry. Fetal head circumference, abdominal circumference, and femur length were measured to calculate an estimated fetal weight (EFW) using the Hadlock formula [
30]. Transabdominal uterine artery (UtA) Doppler velocimetry was conducted according to the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) guidelines [
31]. The pulsatility index (PI) was measured for both the left and right uterine arteries, and the average of these two values was used for analysis. Gestational age was defined based on the crown–rump length measured during the 11–13 weeks’ scan.
2.5. Investigated Outcomes
This study investigated the prediction of PTB, which was stratified into its two primary clinical subtypes: spontaneous and iatrogenic. Spontaneous PTB was defined as delivery before 37 weeks of gestation resulting from either spontaneous preterm labor or preterm prelabor rupture of membranes. Iatrogenic PTB was defined as a medically indicated delivery before 37 weeks of gestation, accomplished via induction of labor or a planned cesarean section. To evaluate model performance across varying degrees of prematurity, each subtype was further analyzed at three distinct gestational age thresholds: delivery at <37, <34, and <32 completed weeks. This framework resulted in six discrete binary outcomes, for which independent predictive models were developed and assessed.
2.6. Statistical Analysis
The initial analysis involved a comprehensive exploratory phase. The distributions of all continuous predictor variables were visually assessed using histograms and density plots to determine their normality. Subsequently, a baseline characteristics table was generated to summarize and compare the study population, stratified by delivery outcome (PTB vs. term delivery). For this comparison, normally distributed continuous variables were presented as mean (standard deviation) and compared using Student’s t-test, while non-normally distributed variables were presented as median [interquartile range] and compared using the Mann–Whitney U test. Categorical variables were presented as frequencies and percentages and were compared using the Chi-squared test or Fisher’s exact test, as appropriate.
2.7. Predictor Variables and Modeling Strategy
In our models, we included three types of data, based on clinical significance and availability for second trimester prediction of PTB:
Maternal characteristics and history: Maternal age, height, pre-pregnancy weight, use of assisted reproductive technology, smoking status, first trimester bleeding, presence of a cervical cerclage, previous cesarean section, pre-existing diabetes, chronic hypertension, thrombophilia, thyroid disorder, a prior loop electrosurgical excision procedure or a large loop excision of the transformation zone procedure, a history of preterm delivery, and multiparity.
Ultrasound markers: Gestational age at examination determined by CRL, estimated fetal weight (EFW), mean uterine artery pulsatility index (UtA-PI), suspected vasa previa, and cervical length.
2.8. Feature Selection and Model Development
The following multi-stage process was performed independently for every outcome. A robust, data-driven feature selection process was conducted only on the maternal characteristics and history variables to identify the most predictive and parsimonious core set of predictors for each outcome. This process involved the following:
Variable importance ranking: A Random Forest model was trained on all available maternal history predictors using 5-fold cross-validation and down-sampling to generate a stable, ranked list of variables based on their predictive importance.
Iterative subset evaluation: The algorithm then iteratively built and tested new Random Forest models on incrementally larger subsets of the top-ranked predictors, specifically testing subsets. The cross-validated AUC was recorded for each subset size.
Parsimonious selection: The iterative evaluation revealed that the model using the complete set of maternal history variables achieved the highest AUC score. The analysis showed that no smaller, more parsimonious subset of features achieved a performance within 0.5 standard deviations of this top score. Therefore, to maximize the information available to the models, the decision was made to retain all maternal history and characteristic variables for the final model development.
After this exploratory analysis confirmed the utility of the full maternal dataset, the ultrasound markers were added to the model.
Dataset preparation and partitioning: For each outcome, the dataset containing the final selected features was prepared. To handle missing data points, k-Nearest Neighbors (k-NN) imputation (k = 5) was employed. This method was chosen to preserve inter-variable relationships by using the full predictor set to inform its estimates. The complete dataset was then partitioned into a training set (80%) and a testing set (20%) using stratified sampling to maintain the outcome distribution in both sets.
2.9. Model Training and Validation
A rigorous internal validation process was conducted on the training set using 10-fold cross-validation repeated 3 times. Within this framework, several key steps were performed:
Data pre-processing: Prior to fitting each model, the continuous predictor variables were centered and scaled. This standardization process ensures that all variables are on a comparable scale.
Class imbalance correction: To address the significant class imbalance inherent in predicting PTB, random undersampling was employed within the cross-validation process. This technique prevented the models from developing a bias towards predicting the majority (term) class by training on a balanced subset of data [
25].
Algorithm training: Following these steps, four distinct classification algorithms were trained:
- ○
Multivariable Logistic Regression: A standard Logistic Regression model was fitted using the main effects of the selected predictors.
- ○
Random Forest: Models were trained with 500 trees, and the number of variables sampled at each split (mtry) was tuned via the cross-validation process.
- ○
eXtreme Gradient Boosting (XGBoost): An automated tuning process optimized a suite of key hyperparameters, including the number of boosting rounds, maximum tree depth, and learning rate.
- ○
Single-hidden-layer Artificial Neural Network: The network’s key hyperparameters, the number of neurons in the hidden layer (size) and the weight decay regularization parameter (decay), were tuned via cross-validation.
By applying these steps inside the cross-validation loops, we ensured that hyperparameter tuning and model training were performed robustly, leading to an unbiased estimate of performance on the final held-out test set.
2.10. Model Performance Evaluation
The final, trained models were evaluated on the held-out testing set. Model performance was primarily assessed by the AUC with its 95% Confidence Intervals (CI). For the <37 weeks outcome, the statistical significance of the difference in AUC for the logistic regression and the Random Forest model between the spontaneous and iatrogenic outcomes was evaluated using DeLong’s test. Additional performance metrics calculated included accuracy, sensitivity, specificity, Positive Predictive Value (PPV), and the F1-Score. Performance was also assessed by calculating the sensitivity of each model at a fixed specificity of 80%.
2.11. Model Interpretation
To enhance model interpretability, the final logistic regression and Random Forest models for the <37 weeks outcome were further analyzed using SHapley Additive exPlanations (SHAP) values [
32]. This technique, originating from cooperative game theory, provides a method to fairly attribute the output of a prediction among the different input features by quantifying the marginal contribution of each predictor to the model’s decision for each individual case. The analysis was implemented in R, primarily utilizing the DALEX package to create model-agnostic explainers for the final, trained models using the held-out test data. Subsequently, the predict parts function was employed to compute the SHAP values for every prediction within the test set.
The resulting SHAP summary plots aggregate these instance-level values to provide a global understanding of feature importance and impact. In these plots, features are ranked vertically by their overall importance, which is calculated as the mean absolute SHAP value across all test set instances. The length of the horizontal bar for each feature represents the average magnitude of its impact on the model’s output probability. The color of the bar indicates the direction of the effect: green bars represent a positive contribution that increases the predicted risk of PTB, while red bars represent a negative contribution that decreases the predicted risk. Finally, the variability of each feature’s impact across the dataset is represented by the interquartile range of its SHAP values, shown as a dark line overlaying each bar.
All statistical and ML analyses were conducted using R software (Version 4.3.2 or later), with a fixed random seed set for reproducibility. The primary analysis relied on the caret package for its unified modeling framework, with specific models implemented via Random Forest, xgboost, and nnet. Additional key packages included dplyr for data manipulation, pROC for receiver operating characteristic (ROC) curve analysis, tableone for descriptive statistics, and ggplot2 for data visualization.
4. Discussion
4.1. Main Findings
This study yielded four principal findings that underscore the distinct nature of spontaneous and iatrogenic PTB. First, our predictive models demonstrated a significantly more robust and reliable performance in predicting iatrogenic PTB compared to spontaneous PTB across all gestational age thresholds. Second, this predictive accuracy was not static; for both subtypes, the models performed consistently better at identifying the risk for earlier and more clinically severe degrees of prematurity. A third key finding was the comparable performance between traditional Logistic Regression and the more complex machine learning models across most prediction tasks. Finally, our use of interpretable machine learning elucidated the divergent pathophysiological pathways driving these outcomes, providing clear insight into the models’ decision-making. The analysis showed that iatrogenic PTB was primarily driven by markers of placental dysfunction, while the prediction of spontaneous PTB was most influenced by a history of the condition and the presence of a short cervix.
4.2. Interpretation of Our Findings
This work is situated within a broad field of research for PTB prediction, which has reported a wide range of performance metrics [
23]. For instance, recent machine learning models developed on large-scale data that combine PTB subtypes report strong predictive performances, with AUCs in the range of 0.73 to 0.75 [
21,
22]. Notably, our model focusing solely on iatrogenic PTB achieves a comparable AUC of 0.764, suggesting that the stronger predictive signals from indicated PTB may be the primary drivers of performance in those combined models. Therefore, a key distinction of our study is the systematic development and comparison of predictive models for iatrogenic and spontaneous PTB as separate endpoints. This approach addresses a significant gap in the literature by avoiding the common pitfall of grouping these distinct clinical entities. By doing so, we ensure that the unique contribution of each variable is accurately captured for each subtype, preventing a situation where a factor’s opposing effects on spontaneous versus iatrogenic PTB might cancel each other out. This demonstrates that an integrated, subtype-specific approach using routinely collected mid-gestation data is a valuable strategy for PTB prediction.
By effectively combining this readily available clinical information, we developed models with immediate clinical applicability that contribute to the emerging era of precision medicine. Our work demonstrates that a robust predictive framework can be achieved even with standard, interpretable models. Notably, all algorithms tested, from traditional Logistic Regression to more complex machine learning methods, yielded comparable predictive performance. This key finding underscores that the clinical value lies in the subtype-specific modeling strategy itself, which better isolates the distinct drivers of each outcome, rather than in the complexity of the algorithm used. A deeper analysis of these findings reveals critical insights, beginning with the marked difference in predictive power between the two PTB subtypes.
4.3. Differential Performance: Spontaneous vs. Iatrogenic Preterm Birth
A striking finding of our study is the significant disparity in model performance between the two primary subtypes of PTB. For PTB before 37 weeks, all models performed substantially better at predicting iatrogenic deliveries (AUCs up to 0.764) compared to spontaneous deliveries (AUCs up to 0.609). This difference was statistically significant and likely reflects the fundamental etiological distinctions between these two clinical entities. Our findings for spontaneous PTB, for delivery <37 weeks, align with the existing literature that underscores the challenge of its prediction. A systematic review and external validation by Meertens et al. found that even promising models for spontaneous PTB performed moderately, with validated AUCs ranging from 0.54 to 0.67 [
23], a range within which our model’s performance (AUC 0.609) is consistent.
Spontaneous PTB is recognized as a complex syndrome with a multifactorial and often elusive etiology [
33]. The difficulty in its prediction stems from significant pathophysiologic and genetic heterogeneities [
25]. It is not a condition initiated by a single cause, but an overarching syndrome that can be triggered by diverse factors, which activate distinct but overlapping molecular pathways [
33,
34,
35]. Furthermore, these disease processes can originate in any number of feto-maternal tissues, including the placenta, fetal membranes, or decidua, each responding uniquely to pro-parturition signals. This complex interplay makes it difficult to establish a universal predictive model, as a precise mechanism often cannot be identified in most individual cases [
33,
34,
35]. Conversely, iatrogenic PTB is a medically indicated intervention, frequently prompted by maternal or fetal complications, and these conditions are often preceded by well-defined clinical and sonographic markers of placental insufficiency. Key indications, according to the American College of Obstetricians and Gynecologists guidelines, include hypertensive disorders, such as PE with severe features, and poorly controlled diabetes [
36]. Fetal or placental indications often involve conditions like fetal growth restriction with abnormal umbilical artery Doppler flow, placenta previa, or vasa previa [
36].
This is strongly corroborated by our model’s interpretability analysis using SHAP values. The SHAP analysis for iatrogenic PTB < 37 weeks heavily weighted markers of placental function and obstetric history. In both the Logistic Regression and Random Forest models, an increased mean UtA-PI, a direct measure of placental vascular resistance, was a leading predictor for an increased risk of iatrogenic PTB. Abnormal uterine artery Doppler velocimetry is a well-established indicator of impaired placentation and is strongly associated with adverse pregnancy outcomes like pre-eclampsia and fetal growth restriction, which are leading causes of indicated PTB [
10]. A history of a previous cesarean section was another powerful risk-increasing factor in both models. This finding aligns with its established role as a risk factor for subsequent placental abnormalities and as a marker for prior obstetric complications, which carry a significant risk of recurrence [
37].
In contrast, the prediction of spontaneous PTB < 37 weeks was predominantly driven by factors reflecting a predisposition to preterm labor and cervical incompetence. The SHAP analyses confirmed this, identifying a history of PTB and a reduced cervical length as the most powerful predictors of spontaneous PTB across both models. A history of PTB is a well-documented and significant risk factor, while cervical length is a cornerstone of modern PTB screening [
38]. A shortened cervix is a clear biophysical marker of the premature initiation of the parturition process, and its high importance in the models, particularly the Random Forest model, where it was the top feature, underscores its clinical significance [
38].
A noteworthy finding from the SHAP analysis is the differing interpretation of key variables between the Logistic Regression and Random Forest models, particularly for iatrogenic PTB. For instance, while the Logistic Regression model identified increased maternal age as a risk-increasing factor, the Random Forest model found it to be the most powerful predictor for decreasing risk. This apparent contradiction likely stems from the fundamental differences between the algorithms. Logistic Regression models linear relationships, assessing the average effect of each variable independently. In contrast, Random Forest, a tree-based ensemble, can capture complex, non-linear interactions. The model may have learned that while older age is correlated with certain risk factors, in the absence of those factors (e.g., no history of CS, normal UtA-PI), it is associated with a lower risk of iatrogenic intervention. This ability to model such nuanced, conditional relationships highlights a key advantage of non-linear models in capturing the intricate interplay of clinical variables.
4.4. Performance Across Gestational Age Cutoffs
Another important observation was the consistent improvement in model performance for earlier gestational age cutoffs. This trend was particularly pronounced for iatrogenic PTB, where the predictive accuracy increased substantially with the severity of prematurity. The best-performing model for iatrogenic delivery <37 weeks achieved an AUC of 0.764 (Random Forest), which rose to 0.806 for delivery <34 weeks (Random Forest), and peaked at an AUC of 0.862 for delivery <32 weeks (Neural Network) (
Table 3,
Table 5, and
Table 7). This strong positive correlation between predictive power and degree of prematurity suggests that the underlying pathological processes leading to very early indicated deliveries are more pronounced and, therefore, more readily detectable by the variables in our models. Severe, early-onset placental dysfunction, for example, is likely to manifest with more extreme deviations in uterine artery Doppler indices and fetal growth parameters, thereby providing a stronger predictive signal.
A similar, albeit less dramatic, trend was observed for spontaneous PTB. The predictive accuracy for spontaneous delivery <37 weeks was modest, with the best model achieving an AUC of 0.609 (Neural Network). However, performance improved to an AUC of 0.678 for delivery <34 weeks (Random Forest) and reached an AUC of 0.749 for the <32 weeks outcome (Random Forest). This progression suggests that while late spontaneous PTBs are difficult to distinguish from term births using mid-gestation data, the most severe cases of spontaneous PTB are better captured by our model’s variables. It is in these earliest predictions that machine learning models appeared to offer a slight advantage. For instance, in predicting spontaneous PTB < 32 weeks, the Random Forest model achieved a higher AUC (0.749) than Logistic Regression (0.685). This may suggest that more complex algorithms have a potential edge in capturing the non-linear patterns and complex associations that are more pronounced in extreme-risk pregnancies, where predictive precision is most vital.
This finding is also supported by previous studies in the area, which show that prediction increases as the gestational age threshold decreases [
12,
23]. The improved accuracy for the earliest and most clinically significant deliveries is a key strength of our models, as these are the cases associated with the highest rates of neonatal morbidity and mortality. This is crucial because the risks for adverse outcomes directly increase with the degree of prematurity [
10]. Globally, approximately 15% of all preterm births, around 2 million babies in 2020, occurred before 32 weeks of gestation and required more intensive neonatal care [
2]. Better prediction for these earliest deliveries is paramount, given that direct complications from PTB were the leading cause of child mortality in 2019 [
2].
4.5. Clinical Implications
The development of robust, accessible, and subtype-specific predictive models has significant clinical implications. By integrating multiple data points, these models provide a personalized risk assessment that surpasses the limitations of relying on single risk factors. This distinction is crucial, as the management strategies for women at high risk for each subtype differ substantially. A high predicted risk for spontaneous PTB may prompt interventions like progesterone therapy or cervical cerclage [
13], while a high risk for iatrogenic PTB necessitates intensified surveillance for conditions like PE and FGR.
Furthermore, accurate risk stratification is essential for optimizing perinatal outcomes. Identifying women at high risk for PTB allows for the timely administration of antenatal corticosteroids to promote fetal lung maturity and magnesium sulfate for neuroprotection, interventions proven to reduce neonatal morbidity [
16,
17]. This early risk stratification prompts intensified clinical surveillance, which in turn ensures that these time-sensitive treatments are administered without delay when the clinical signs of imminent delivery emerge. Beyond individual patient care, these predictive tools have important logistical benefits [
18]. Reliable prediction can aid in healthcare resource planning, ensuring the availability of neonatal intensive care unit beds and specialized staff, and facilitating the timely transfer of at-risk mothers to tertiary care centers [
18]. Our models are built on a set of non-invasive variables already collected during routine prenatal care. This provides a foundation for the potential development of easy-to-use, low-cost clinical decision support tools. If externally validated, such tools could be integrated into everyday practice to help refine risk stratification and support a more personalized approach to perinatal medicine.
4.6. Strengths and Limitations
The primary strengths of this study include its novel approach of developing and validating separate predictive models for spontaneous and iatrogenic PTB, its large cohort size, the use of prospectively collected data within a standardized academic protocol, and a robust methodological approach comparing several algorithms, including both traditional regression and machine learning methods, with rigorous internal validation. Furthermore, the use of SHAP for model interpretation provides valuable insights into the clinical drivers of prediction, enhancing the transparency and trustworthiness of the models.
However, the study has several limitations. The retrospective design, although based on prospectively collected data, is one such limitation. The single-center nature of the study may affect the generalizability of our findings to other populations. A significant limitation is the potential intervention bias. In our cohort, women at high risk for pre-eclampsia received aspirin, and those with a short cervix were treated with progesterone. These interventions may have prevented or delayed some PTBs, potentially leading to an underestimation of our models’ predictive performance in an untreated population. Finally, while our models were robustly validated internally, they have not yet undergone external validation in an independent dataset. This is a significant limitation, as the generalizability of the findings is uncertain until performance is confirmed in different clinical populations. External validation is a crucial next step before widespread clinical implementation can be recommended.