1. Introduction
Preeclampsia remains a major contributor to maternal and perinatal morbidity, and early identification of women at elevated risk is central to prevention strategies. Contemporary first-trimester screening aims to identify high-risk patients early enough to implement risk-reducing interventions, with the Fetal Medicine Foundation (FMF) approach now widely used worldwide.
The FMF competing-risks framework generates patient-specific risk by combining a prior distribution derived from maternal characteristics with likelihood functions from biophysical and biochemical markers using Bayes’ theorem, thus producing individualized posterior risks for preeclampsia before specified gestational-age thresholds [
1]. In the original large cohort of 35,948 singleton pregnancies, combined screening with maternal factors, mean arterial pressure (MAP), uterine artery pulsatility index (UtA-PI) and placental growth factor (PlGF) predicted 75% (95% CI 70–80%) of preterm PE (<37 weeks) and 47% (95% CI 44–51%) of term PE, at a fixed 10% false-positive rate (FPR), whereas maternal factors alone detected 49% (95% CI 43–55%) of preterm and 38% (95% CI 34–41%) of term PE [
2].
Subsequent multicenter validation in eight units in Spain (10,110 pregnancies) reported a detection rate of 72.7% (95% CI 62.9–82.6%) for preterm PE at a 10% screen-positive rate when combining maternal characteristics, MAP, UtA-PI and PlGF [
3,
4]. External validations in Australia (29,609 pregnancies) and China (10,899 pregnancies) found that combined screening using maternal factors, MAP, UtA-PI and biochemical markers detected 71% (95% CI 59–84%) and 65.0% of preterm PE at 10% FPR, respectively, with detection increasing to 72.7% and 76.1% as FPR rose to 15% and 20% in the Chinese cohort [
5,
6]. Consistent with these data, review articles summarize that the FMF triple test achieves detection rates of about 90% for early PE (<34 weeks) and 75% for preterm PE (<37 weeks) at 10% FPR, markedly outperforming approaches based on maternal risk factors alone [
7,
8,
9].
Machine learning (ML) methods, especially ensemble techniques such as random forest (RF) and extreme gradient boosting (XGBoost), have shown strong predictive performance for preeclampsia, often achieving higher area under the curve (AUC) values than classical regression models. Systematic reviews report that random forest and XGBoost can reach AUCs up to 0.94 and 0.92, respectively, outperforming many traditional models, although heterogeneity across studies and limited external validations remain challenging [
10,
11,
12]. In a large validation cohort of 12,549 pregnancies, XGBoost and logistic regression showed similar discrimination with AUCs of 0.75 (95% CI 0.73–0.78) and 0.76 (95% CI 0.74–0.78), respectively, while random forest had a slightly lower AUC of 0.71 (95% CI 0.69–0.74). Also, logistic regression demonstrated superior calibration compared to ML models [
13].
Comparisons between ML models and the FMF competing-risks model are complicated by differences in biomarker quality, population-specific marker distributions, and assay harmonization. An ML model using the same predictors as FMF (maternal factors, MAP, UtA-PI, PlGF) initially performed significantly worse than FMF until retrained with adjustments for PlGF analyzer differences. After retraining, AUCs were comparable for preterm and term PE prediction but remained inferior for any PE [
14]. Overall, while ML ensemble methods show promise with high discrimination metrics, simpler models like logistic regression or well-calibrated competing-risks models remain critical for clinical deployment due to their better calibration and interpretability.
In Romania and other countries where the completeness of biochemical markers can be limited, it remains unclear whether commonly deployed ML classifiers trained on locally available data can match the discriminative and clinical-utility performance of FMF risk estimates and whether ML may offer advantages when some biomarkers (notably PlGF) are missing or sparse. We therefore conducted a comparison of ML classifiers (logistic regression, random forest, XGBoost) against FMF a priori and a posteriori risks in a Romanian first-trimester PE screening cohort.
2. Materials and Methods
2.1. Study Design and Population
This retrospective analysis used a database of pregnancies screened at 11–14 weeks’ gestation in a single center. The primary cohort comprised 1583 screened pregnancies, and the prespecified primary analysis excluded aspirin-treated patients (n = 211) to mitigate treatment-related outcome modification because aspirin prophylaxis can influence downstream disease incidence and therefore bias apparent screening performance metrics in “screen-and-treat” settings. A secondary analysis repeated the evaluation, including aspirin-treated pregnancies.
Preeclampsia was defined as the presence of a recorded diagnosis or onset information in any PE-specific clinical field (PE status, gestational age at onset, or recorded preeclampsia diagnosis). The diagnosis was coded as binary.
2.2. FMF Risk Outputs
FMF risk estimates were calculated as (i) prior (a priori) risk and (ii) posterior (a posteriori) risk, reflecting the FMF Bayes-theorem approach that combines a prior distribution from maternal factors with likelihoods from biomarkers to produce patient-specific posterior risks.
2.3. Predictors and Feature Sets
Two feature sets were evaluated, designed to align conceptually with FMF’s a priori versus a posteriori structure:
- -
A priori (maternal factors): maternal age, body mass index (BMI), previous pregnancies, method of conception, smoking, diabetes, chronic hypertension, maternal family history of PE, and multiparity.
- -
A posteriori (maternal factors + biomarkers): a priori predictors plus MAP (mean arterial pressure), UtA-PI, and PAPP-A MoM.
2.4. Missing Data Handling
Missingness was quantified for all biomarkers and FMF-derived risk variables. The primary machine-learning analyses used a complete-case approach. To assess the potential impact of missingness, participants included in the complete-case a posteriori analyses were compared with excluded participants using Mann–Whitney U tests for continuous variables and chi-square or Fisher exact tests for categorical variables, as appropriate.
As a sensitivity analysis, multiple imputation using iterative imputation with posterior sampling was performed for retained predictors. Outcome status and FMF risk estimates were not imputed.
2.5. Machine-Learning Models
We trained using Python software (version 3.12.13, Python Software Foundation, Beaverton, OR, USA) three ML classifiers: logistic regression, RF, and XGBoost for PE prediction. The selected machine-learning algorithms were chosen to represent complementary modeling approaches commonly used in clinical prediction research and to allow comparison between interpretable linear methods and more flexible non-linear ensemble methods. These analyses were considered exploratory because of the very small number of preeclampsia events and the consequent high risk of overfitting.
Logistic regression was fitted with a maximum of 5000 iterations. Random forest models were trained using 500 trees, square-root feature selection at each split, a minimum leaf size of 5, and class weighting to account for the low event rate. XGBoost models were trained with 200 estimators, a maximum tree depth of 4, a learning rate of 0.1, subsampling of 0.8, column sampling of 0.8, a minimum child weight of 5, and a scale-pos-weight parameter calculated from the non-event to event ratio. Because the number of PE events was low relative to the number of predictors, the machine-learning analyses were considered exploratory and interpreted cautiously because of the increased risk of model overfitting and optimistic performance estimation despite repeated cross-validation.
2.6. Cross-Validation and Prediction Generation
We used stratified repeated k-fold cross-validation (5 folds × 10 repeats) to reduce variance in performance estimates. For each ML model, we generated out-of-fold predicted probabilities across all repeats and aggregated them per observation to produce a single cross-validated predicted probability per individual.
2.7. Performance Measures
We evaluated discrimination using AUC-ROC and compared AUCs between ML and FMF using DeLong tests. Screening performance was summarized using sensitivity at 90% specificity (equivalently 10% FPR), reflecting common reporting in FMF validations and the clinical context of fixed screen-positive rates. We additionally examined sensitivity, specificity, PPV, and NPV at a clinically interpretable FMF threshold (1:100), consistent with the literature.
2.8. Calibration and Decision-Curve Analysis
We assessed calibration using decile-based calibration plots comparing predicted risk to observed outcome frequency. Clinical net benefit was evaluated via decision-curve analysis (DCA).
2.9. Feature Attribution
For the a posteriori XGBoost model, we computed SHAP values to quantify feature contributions to individual predictions and summarize global importance.
2.10. Sensitivity Analysis with Aspirin
We repeated the primary analyses, including aspirin-treated pregnancies, because aspirin can alter PE incidence and influence apparent screening performance.
2.11. Sensitivity Analysis with Random Undersampling
Because our primary cohort had low PE prevalence, we conducted a random undersampling analysis to test whether class imbalance was responsible for ML underperformance. In each setting (a priori and a posteriori), we undersampled the majority class (non-PE) to achieve PE:non-PE ratios of 1:5 and 1:10. Each undersampling ratio was repeated 10 times using different random seeds, and model performance was summarized by the median AUC and interquartile range (IQR) across repetitions. Cross-validation splits were adapted to ensure feasibility given the reduced number of PE cases.
2.12. Multiple Imputation Sensitivity Analysis
As a sensitivity analysis, multiple imputation was performed for predictors retained in the actual models. Outcome status and FMF risk estimates were not imputed.
Multiple imputation was performed using iterative imputation with posterior sampling. Twenty imputed datasets were generated. For each imputed dataset, the same exploratory machine-learning modeling strategy was repeated, and AUC estimates were summarized using the mean, standard deviation, median, and interquartile range across imputations.
3. Results
3.1. Cohort Characteristics
The full cohort comprised 1583 singleton pregnancies screened at 11–14 weeks of gestation, of which 15 developed preeclampsia (0.95%). After exclusion of aspirin-treated pregnancies, 1372 pregnancies remained, with 12 PE events (0.87%). The a posteriori complete-case cohort included 901 pregnancies with 7 PE events (0.78%).
Patients who developed preeclampsia had higher body mass index (median 24.7 vs. 22.1 kg/m
2,
p = 0.004), higher mean arterial pressure (median 93.4 vs. 88.7 mmHg,
p = 0.016), and were more likely to have chronic hypertension (25.0% vs. 1.5%,
p < 0.001). Uterine artery pulsatility index tended to be higher (median 1.9 vs. 1.6,
p = 0.08), and PAPP-A levels tended to be lower (median 0.5 vs. 1.0 MoM,
p = 0.09) in affected pregnancies, though neither reached conventional statistical significance (
Table 1).
3.2. Missing Data
Table 2 compares participants included in the a posteriori complete-case analysis with those excluded because of missing biomarker or predictor data. Overall, 901 participants with complete data were included in the final a posteriori analysis, whereas 471 participants were excluded due to incomplete information.
The prevalence of preeclampsia did not differ significantly between included and excluded participants (7/901 [0.8%] vs. 5/471 [1.1%], p = 0.8163), suggesting that exclusion due to missing data did not substantially alter the distribution of the primary outcome.
However, several maternal and biomarker-related variables differed significantly between groups. Compared with included participants, excluded participants were slightly younger (median age 29.5 vs. 30.2 years, p = 0.0427), had lower mean arterial pressure (87.84 vs. 89.00 mmHg, p = 0.0305), higher uterine artery pulsatility index values (1.66 vs. 1.57, p = 0.0014), and lower PAPP-A concentrations (0.66 vs. 0.97 MoM, p = 0.0275). In addition, both prior FMF risk estimates (p = 0.0010) and posterior FMF risk estimates (p < 0.001) differed significantly between included and excluded participants.
3.3. A Priori and a Posteriori Model Comparison
Using maternal characteristics alone (age, BMI, parity, conception method, smoking, diabetes, chronic hypertension, family history of preeclampsia, and twin gestation), logistic regression was the best-performing ML algorithm with a cross-validated AUC of 0.80 (±0.11)—
Table 3. This was numerically lower than the FMF prior risk model (AUC 0.84), though the difference was not statistically significant by the DeLong test (
p = 0.34)—
Figure 1. At a fixed 10% false-positive rate, the FMF model detected 50% of preeclampsia cases compared with 33% for the ML model.
When mean arterial pressure, uterine artery pulsatility index, and PAPP-A were added to the maternal factors, the best-performing ML algorithm was random forest (cross-validated AUC 0.85 ± 0.14)—
Table 2,
Figure 2. The FMF posterior model achieved an AUC of 0.93 but did not reach statistical significance (DeLong
p = 0.087; bootstrap 95% CI: ML 0.70–0.97 vs. FMF 0.88–0.98). At a 10% false-positive rate, detection rates were 57% for the ML model and 71% for FMF.
3.4. Incremental Biomarker Analysis
Incremental addition of biomarkers to the XGBoost model demonstrated progressive improvement in discrimination: maternal factors alone yielded an AUC of 0.62, which increased to 0.73 with the addition of MAP, and to 0.81 with further addition of uterine artery pulsatility index. The subsequent addition of PAPP-A did not improve performance (AUC 0.79)—
Figure 3.
3.5. Feature Importance and SHAP Analysis
Model interpretability was assessed using SHAP for the a posteriori XGBoost model to characterize which predictors contributed most to the model output.
Figure 4 provides a SHAP summary visualization, and
Figure 5 presents the corresponding feature importance visualization for the XGBoost model, supporting inference about which predictors the model relied upon most strongly in the a posteriori feature space.
3.6. Calibration
Calibration was examined by comparing observed PE incidence against predicted risk across deciles of predicted risk for both the best ML model and the FMF posterior risk estimate.
Figure 6 displays calibration curves, reflecting prior evidence that competing-risks/FMF risk predictions may require recalibration to improve agreement between predicted and observed risks in external cohorts.
3.7. Classification at the 1:100 Clinical Threshold and Reclassification
At the clinically used 1:100 risk threshold, the a posteriori FMF model demonstrated a favorable balance of sensitivity (71.4%) and specificity (91.7%), with a screen-positive rate of 8.8%. By contrast, the a posteriori ML model (random forest) achieved 100% sensitivity but at a specificity of 18.0% and a screen-positive rate of 82.1% (
Table 4).
At a risk threshold of 1:100, the categorical NRI for the a priori model was +0.11, driven by upward reclassification of events (NRI-events +0.25) that was partially offset by adverse reclassification of non-events (NRI-non-events −0.13)—
Table 5. The integrated discrimination improvement (IDI) was negative (−0.010), indicating the FMF model maintained superior overall discrimination. For the a posteriori model, the categorical NRI was −0.45, reflecting excessive upward reclassification of non-events by the ML model (NRI-non-events −0.74), though the IDI was positive (+0.09).
3.8. Sensitivity Analysis with Aspirin-Treated Pregnancies Included
Inclusion of the 211 aspirin-treated women did not significantly alter the findings. The a priori ML model (logistic regression) achieved an AUC of 0.79 vs. FMF 0.83 (DeLong p = 0.46); the a posteriori ML model (logistic regression) achieved an AUC of 0.87 vs. FMF 0.90 (DeLong p = 0.44).
Detection rates at 10% FPR were identical between ML and FMF in both scenarios (46.7% a priori; 66.7% a posteriori), suggesting that the inclusion of aspirin-treated patients, in whom the intervention may have attenuated the preeclampsia phenotype, did not introduce meaningful bias.
Figure 7 and
Figure 8 show ROC comparisons in a priori and a posteriori settings with aspirin-treated pregnancies included.
3.9. Classification at the 1:100 Clinical Threshold
In the cohort that included aspirin-treated pregnancies, at the 1:100 threshold, the FMF posterior model and logistic regression showed similar sensitivity (77.8%), but the FMF model achieved higher specificity (85.5% vs. 82.1%) and a better overall balance of performance, as reflected by a higher F1-score (
Table 6).
In the a priori setting, logistic regression achieved the highest sensitivity (80.0%), outperforming the FMF prior model (40.0%), but at the cost of lower specificity. The FMF model demonstrated higher specificity (91.3%) and a slightly higher F1-score, reflecting fewer false positives (
Table 5).
3.10. Sensitivity Analysis with Random Undersampling
To interrogate the influence of severe class imbalance on ML performance, random undersampling analyses were conducted at PE:non-PE ratios of 1:5 and 1:10, with repeated sampling and evaluation to provide stable median AUC estimates (
Table 7). Undersampling experiments at 1:5 and 1:10 PE-to-non-PE ratios produced results consistent with the primary analysis. FMF model consistently outperformed ML models: at 1:5 undersampling, the a posteriori FMF AUC was 0.94 (IQR 0.92–0.97) compared with 0.83 for the best ML model (logistic regression).
The distributional effects of undersampling on AUC are visualized in
Figure 9, with a compact comparison across sampling strategies shown in
Figure 10.
3.11. Multiple Imputation Sensitivity Analysis
To evaluate the potential impact of missing predictor data, a multiple imputation sensitivity analysis was performed using the full cohort of 1372 participants, including all 12 PE cases (
Table 8).
In the a priori analysis, the FMF prior risk model achieved the highest discriminatory performance, with a mean AUC of 0.8409. Among machine learning models, logistic regression showed the best and most stable performance (mean AUC 0.789), followed by random forest (0.752), whereas XGBoost demonstrated substantially lower performance (0.621). Thus, no clear evidence of superiority of the exploratory ML models over the FMF model was observed.
Similarly, in the a posteriori analysis, the FMF posterior risk model retained the highest discrimination (mean AUC 0.8590). Logistic regression and random forest showed comparable performance (mean AUC 0.7607 and 0.7578, respectively), while XGBoost again demonstrated lower and less stable performance (0.6997). Variability across imputations was greater in the a posteriori models, particularly for XGBoost, suggesting reduced model stability when biomarker information was imputed.
Our results indicate that the multiple imputation sensitivity analysis yielded findings consistent with the primary complete-case analyses, with FMF risk models maintaining higher discriminatory performance than the exploratory machine learning approaches.
These results suggest that the main conclusions were not substantially altered by missing biomarker data. However, given the extremely low number of PE cases (n = 12), the substantial degree of biomarker missingness, and the observed variability across imputations, all machine learning findings should be interpreted cautiously and considered exploratory rather than definitive evidence of comparative model performance.
3.12. Decision Curve Analysis
To assess the clinical utility of each screening strategy beyond discrimination alone, we performed decision curve analysis across a range of clinically relevant risk thresholds (0.5–5%)—
Table 9 and
Figure 11. Both the ML and FMF models provided positive net benefit across most thresholds and consistently outperformed a “treat all” strategy, confirming that risk-stratified screening is preferable to universal prophylaxis in this low-prevalence population.
However, the FMF algorithm demonstrated superior net benefit in most clinically actionable scenarios. In the a posteriori analysis without aspirin, the setting most relevant to contemporary first-trimester screening, the FMF model exceeded the ML model at every threshold examined (0.5–5%), and the ML model’s net benefit fell below zero at thresholds ≥ 1.0%, indicating that at these thresholds the ML model would cause net harm relative to a strategy of treating no one. This finding is consistent with the calibration deficit observed for the random forest model, whose excessive screen-positive rate (82.1% at the 1:100 threshold) translates directly into an unfavorable balance of true and false positives.
In the a priori (maternal factors only) analysis, the ML model showed a modest advantage over FMF at lower thresholds (0.5–1.0%), where the cost of missing a case is weighted heavily relative to the cost of unnecessary treatment (
Figure 10). At higher thresholds (≥2%), where clinicians accept fewer false positives per detected case, the FMF model was consistently superior. This pattern suggests that logistic regression captured slightly more low-risk true positives than the FMF prior model, but at the expense of substantially more false-positive classifications, a trade-off that becomes unfavorable as the threshold rises.
The sensitivity analysis, including aspirin-treated women, did not materially change the pattern. In the a posteriori aspirin-inclusive analysis, the ML model showed marginally higher net benefit at intermediate thresholds (2–3%), but the absolute differences were small (net benefit difference < 0.001) and the FMF model remained superior at the most commonly used clinical thresholds of 0.5–1.0%. At the 1:100 (1%) threshold, the FMF model provided consistently greater net benefit than ML across all four analysis scenarios (a priori, a posteriori, with and without aspirin), with the exception of the a priori analysis without aspirin.
4. Discussion
In the present study, the FMF competing-risks algorithm demonstrated consistently superior discriminative performance compared with three ML classifiers across both screening scenarios that we evaluated. In the primary aspirin-excluded cohort, the best-performing ML model using maternal factors only was logistic regression with an AUC of 0.796, whereas the FMF a priori risk achieved an AUC of 0.841, with no statistically significant difference between the two. In the a posteriori setting, the best ML model was random forest with AUC 0.844, while the FMF posterior risk reached AUC 0.929, with a borderline p-value of 0.087.
The performance gap widened after adding biomarkers (AUC difference 0.045 a priori to 0.085 a posteriori), and the detection rate at 10% FPR consistently favored FMF (50.0% vs. 33.3% a priori; 71.4% vs. 57.1% a posteriori). These patterns remained directionally stable in sensitivity analyses that included aspirin-treated pregnancies and in repeated random undersampling experiments at PE:non-PE ratios of 1:5 and 1:10.
The undersampling sensitivity analyses may have practical clinical implications. Across different class-imbalance scenarios, the FMF algorithm generally maintained numerically higher and more stable discriminatory performance than the exploratory ML models, suggesting greater robustness in this low-prevalence real-world screening setting.
Our findings align with the broader external-validation literature showing that FMF frequently matches or outperforms alternative statistical or ML models when evaluated head-to-head using comparable predictors and endpoints, and that performance differences can remain even after ML retraining or adjustment of biochemical assays.
Our results can be interpreted through the lens of the way competing-risks screening is constructed, particularly in a low-event-rate cohort. With only 12 PE events in the a priori cohort and seven PE events in the a posteriori cohort, the ML algorithms faced a fundamental information bottleneck in estimating stable decision boundaries from the available positive examples. Expressed as events-per-variable, the effective EPV was approximately 12/9 = 1.3 for the a priori feature set and 7/12 = 0.58 for the a posteriori feature set. On the other hand, the FMF approach is explicitly designed as a Bayes theorem–based risk-updating framework that combines a prior distribution derived from maternal factors with likelihood functions derived from biomarkers to generate individualized posterior risks. The widening of the AUC gap in our cohort after adding biomarkers therefore plausibly reflects the FMF model’s ability to extract risk information from biomarkers through parameterized likelihoods that are stable across cohorts, whereas ML models in our setting must infer those relationships de novo from very few events.
Among the three ML algorithms evaluated, logistic regression demonstrated the most robust performance in the aspirin-excluded a priori screening scenario, while random forest was the most accurate in the a posteriori scenario, and XGBoost was consistently the most unstable. This ordering is noteworthy because much of the ML obstetrics literature reported strong performance for boosting methods (including XGBoost) and for ensemble learners in larger datasets, sometimes showing AUCs that exceed 0.90 when complex biomarker panels and large event counts were available [
15,
16,
17].
A temporally validated cohort study on pre-eclampsia prediction using routinely collected maternal characteristics reported that XGBoost and logistic regression had similar discrimination performance, with AUCs of 0.75 (95% CI 0.73–0.78) and 0.76 (95% CI 0.74–0.78), respectively, while random forest performed worse with an AUC of 0.71 (95% CI 0.69–0.74). The logistic regression model showed near-perfect calibration with a slope of 1.02 (95% CI 0.92–1.12), whereas XGBoost and random forest had calibration slopes of 1.15 (95% CI 1.03–1.28) and 0.62 (95% CI 0.54–0.70), respectively, illustrating that model ranking depends on the dataset and setting rather than being universal [
13].
Two systematic reviews also indicated no superiority of machine learning algorithms over logistic regression in clinical prediction models, emphasizing that improvements depend on study design and data characteristics [
18,
19]. Also, the literature data indicated that logistic regression tends to show similar calibration performance compared to some machine learning models, even though this aspect is not frequently evaluated [
10,
20].
In our cohort, the comparatively poor a priori performance and high variance of XGBoost (0.623 ± 0.175) are compatible with the notion that higher-capacity models can be more sensitive to small effective sample sizes, even when conventional regularization is used, and reinforce why a transparent, multi-algorithm comparison is preferable to reporting only the best-performing ML approach.
Our incremental biomarker analysis further indicated which components of first-trimester screening were most informative in this cohort and why complete biomarker capture is central for fair comparisons between modeling paradigms. In the XGBoost incremental series, discrimination increased from AUC 0.62 using maternal factors alone to 0.73 after adding MAP, then to 0.81 after adding UtA-PI, and then decreased to 0.79 after adding PAPP-A, while the analyzable sample dropped to 901 patients due to missingness. The magnitude and direction of these increments are concordant with our observed cohort-level associations, in which MAP was significantly higher in PE than non-PE pregnancies and UtA-PI and PAPP-A showed trends in expected directions.
These observations are consistent with the literature demonstrating that combined screening by maternal factors, uterine artery pulsatility index, mean arterial pressure, and placental growth factor achieves substantially higher detection rates than maternal factors alone at fixed false-positive rates, and that such combined approaches have very high discrimination in large cohorts when biomarkers are appropriately incorporated [
4,
8].
To test whether extreme class imbalance could explain the inferior ML performance relative to FMF models, we conducted a structured random undersampling sensitivity analysis that increased apparent PE prevalence while preserving the positive class size, and the results indicated that undersampling did not improve ML performance relative to FMF in both a priori and a posteriori settings, while the FMF AUC remained stable across sampling strategies.
This pattern resonates with the prior literature that combines resampling strategies such as SMOTE and random undersampling to optimize classifier performance. For example, SMOTE combined with XGBoost consistently improved F1 scores and showed robust performance across imbalance levels, whereas random undersampling often reduced computational burden but did not necessarily enhance predictive accuracy [
21]. Other studies found that hybrid methods such as SMOTE-ENN outperform SMOTE alone by generating synthetic samples while removing noisy data, leading to more stable and reliable models [
22,
23]. However, undersampling techniques tend to have higher variability in results and may not consistently improve discrimination metrics like AUC compared to oversampling methods, which generally provide more stable improvements [
24].
Reclassification metrics in our cohort further supported the conclusion that, even when ML increased sensitivity at a given FMF-oriented threshold, it often did so at the cost of adverse reclassification among non-events. The practical implication is that, in a screening setting in which most pregnancies are unaffected, non-event misclassification can dominate program burden and clinical consequences. In the broader screening literature, this trade-off is widely recognized: some external FMF evaluations in high-risk cohorts report higher specificity with a drawback of lower sensitivity, whereas other cohorts achieve stronger detection at fixed false-positive rates, highlighting that both the baseline algorithm and its local calibration and cutoffs matter for the specificity–sensitivity balance [
4,
25,
26]. Our NRI/IDI findings therefore complement the AUC comparisons by making explicit that, at least at the 1:100 cut-off, the ML models did not provide a favorable net reclassification of the much larger non-event group, even when they improved event capture in the small PE subgroup.
Decision curve analysis in our cohort provided the most direct evidence that the FMF approach retained superior clinical utility across plausible decision thresholds, particularly in the a posteriori scenario where the classification trade-offs were most extreme. FMF provided greater net benefit in three of four scenarios examined across the aspirin-excluded and aspirin-included analyses. The interpretive value of DCA in screening research is illustrated by external validations of FMF-type models in which DCA showed net clinical benefit across commonly used threshold ranges (e.g., 1–12%) [
6,
27], reinforcing the centrality of decision-analytic evaluation when comparing models that might have similar discrimination but different calibration and threshold behavior. In our study, the negative net benefit for the a posteriori ML model at thresholds ≥ 1% is coherent with its extremely low specificity (18.0%) and high screen-positive rate (82.1%) at the 1:100 threshold, because in DCA the harm of false positives is weighted increasingly as threshold probability increases. In other words, even if a model achieves high sensitivity, it can be clinically counterproductive if it generates too many false positives relative to the acceptable trade-off implied by a given clinical threshold.
Our calibration assessment, although necessarily limited by the small number of PE events, suggested that FMF was better calibrated than the ML model in the a posteriori setting, which is consistent with the broader validation literature in which calibration frequently limits transportability of both FMF and ML models across populations [
14,
27].
Interpretability analyses in our cohort, based on SHAP/feature importance for the a posteriori XGBoost model, suggested that the model relied most heavily on obstetric history, MAP and UtA-PI rather than on the sparse biochemical information available, and these patterns are consistent with both the FMF structure and many ML PE studies [
4,
28]. In that context, our SHAP pattern can be seen as a model-based reflection of the same physiologic pathways that the FMF competing-risks model encodes explicitly via biomarker likelihoods rather than learning implicitly from limited events.
When we included aspirin-treated pregnancies, the a priori comparison and the a posteriori comparison remained similar, with detection rates at 10% FPR identical between ML and FMF (46.7% a priori; 66.7% a posteriori). These findings suggest that our primary aspirin-excluded analysis did not derive its conclusions solely from treatment-related distortion in the risk–outcome relationship. This interpretation is consistent with external FMF research showing that accounting for aspirin prophylaxis can alter detection rates and PPV and slightly change false-positive rates, and with other cohorts noting that observed performance metrics can already be influenced by aspirin prophylaxis among high-risk women identified through screening programs [
29,
30]. In our data, the stability of the ML–FMF ranking across aspirin-excluded and aspirin-included analyses supports the conclusion that, in low-events cohorts, FMF risk scores remained the stronger discriminator and the more clinically useful tool across common threshold ranges, even when treatment patterns were incorporated into the analytic dataset.
First-trimester combined screening for preeclampsia is increasingly used as a general tool for stratifying placental and uteroplacental risk, rather than solely to identify candidates for aspirin prophylaxis against preterm preeclampsia [
31].
Some literature data showed that women classified as high-risk for preterm preeclampsia in the first trimester have a significantly higher hazard of spontaneous preterm and even term birth despite never developing clinical preeclampsia, suggesting that the preeclampsia risk score captures a broader uteroplacental vulnerability relevant to the timing of spontaneous labor [
32,
33].
Because the same maternal, Doppler, and biochemical markers also predict fetal growth restriction and small-for-gestational-age birth, first-trimester preeclampsia screening can inform intrapartum surveillance strategies for intrapartum fetal compromise, linking early risk assessment to decisions about intensity of monitoring and timing and mode of delivery [
34,
35].
Several methodological strengths of our study enhance the credibility of these findings and should be emphasized when positioning our results against heterogeneous literature. We performed a direct head-to-head comparison of three ML algorithms and the FMF risk estimates on the same individuals in both an a priori (maternal factors only) and an a posteriori (maternal plus biomarkers) setting, thereby minimizing confounding due to differing cohorts or outcome definitions. We employed repeated stratified cross-validation (5-fold with 10 repeats) and aggregated out-of-fold probabilities to reduce bias. Beyond discrimination, we reported a detection rate at 10% FPR, threshold performance at 1:100, calibration assessment, DCA, and interpretability analyses, reflecting the multifaceted model evaluation that is increasingly recommended in prediction research. Additionally, our explicit undersampling sensitivity analysis directly tested whether class imbalance could explain the ML–FMF gap, addressing a common methodological critique and linking our findings to prior ML studies that use combined resampling strategies such as SMOTE and random undersampling to improve classifier performance, while demonstrating that such strategies cannot substitute for additional events in small-outcome cohorts.
Our study also has limitations that are inseparable from the observed performance patterns, and that should temper any inference about the absolute superiority of one modeling paradigm over another in different settings. The most consequential limitation is the very low number of PE events: 12 cases in the a priori analysis and seven cases in the complete-case a posteriori analysis, which necessarily yields wide uncertainty around performance metrics and increases the risk that model comparisons are influenced by a small number of influential cases. This restricted statistical power increased the risk of model instability and overfitting, and resulted in relatively wide confidence intervals around several performance estimates. Consequently, all machine learning analyses should be interpreted as exploratory rather than definitive comparisons with the established FMF competing-risks algorithm. Although repeated cross-validation, out-of-fold prediction procedures, and multiple sensitivity analyses were performed to improve robustness, these approaches cannot fully compensate for the limitations imposed by the small event count. Therefore, larger multicenter cohorts with substantially higher numbers of PE cases will be necessary to enable reliable external validation and more robust assessment of ML-based screening approaches.
Second, the study is single-center and Romanian, and numerous external validations demonstrate that FMF performance and calibration can differ materially by population, requiring local correction of biomarker medians or analyzers and, in some contexts, recalibration or population-specific cutoffs; such population dependence is also relevant for ML models, which can be sensitive to measurement distributions and missingness patterns.
Third, the broader prediction literature cautions that even models with good discrimination may have low PPV in the most applicable cohorts and that calibration is frequently not reported, indicating that external validation remains necessary before clinical translation of any ML alternative to an established screening algorithm.
The present findings do not support the replacement of the FMF algorithm by ML approaches in current first-trimester PE screening practice. Rather, the results suggest that established FMF competing-risks models remain robust and clinically reliable in low-prevalence real-world settings. Nevertheless, exploratory ML approaches may still hold future potential, particularly if developed and externally validated in substantially larger multicenter datasets with higher event counts and broader population heterogeneity.
Future work should focus on overcoming current limitations while combining the strengths of FMF and machine learning approaches and should include large multicenter studies with sufficient preeclampsia cases to enable robust training, especially for early-onset and preterm forms, and improving biomarker completeness and standardization. Hybrid strategies, such as integrating FMF risk scores into ML models or using ML for recalibration, should also be explored to reduce false positives and improve clinical usefulness.
Moreover, future evaluations should jointly assess discrimination, calibration, and clinical utility, recognizing that in low-prevalence settings, meaningful improvements are more likely to come from reducing false positives and increasing net clinical benefit rather than small gains in AUC alone.