AI-FRS: An Ensemble-Based AI Decision-Support System for Fetal Risk Prediction in a Mexican Clinical Setting

Guzman-Pando, Abimael; Enriquez-Guillen, Bernardo O.; Ramirez-Alonso, Graciela; Camarillo-Cisneros, Javier; Aguilar-Torres, Cesar R.; Hinojos-Gallardo, Luis C.

doi:10.3390/ai7040129

Open AccessArticle

AI-FRS: An Ensemble-Based AI Decision-Support System for Fetal Risk Prediction in a Mexican Clinical Setting

by

Abimael Guzman-Pando

^1,*

,

Bernardo O. Enriquez-Guillen

^2,*,

Graciela Ramirez-Alonso

³

,

Javier Camarillo-Cisneros

⁴

,

Cesar R. Aguilar-Torres

² and

Luis C. Hinojos-Gallardo

²

¹

Artificial Intelligence and Medical Computing Research Laboratory (LIIACOM), Faculty of Medicine and Biomedical Sciences, Autonomous University of Chihuahua, Chihuahua 31125, Mexico

²

Faculty of Medicine and Biomedical Sciences, Autonomous University of Chihuahua, Chihuahua 31125, Mexico

³

Faculty of Engineering, Autonomous University of Chihuahua, Chihuahua 31125, Mexico

⁴

Computational Chemistry Physics Laboratory, Facultad de Medicina y Ciencias Biomedicas, Universidad Autonoma de Chihuahua (UACH), Chihuahua 31125, Mexico

^*

Authors to whom correspondence should be addressed.

AI 2026, 7(4), 129; https://doi.org/10.3390/ai7040129

Submission received: 19 January 2026 / Revised: 1 March 2026 / Accepted: 21 March 2026 / Published: 1 April 2026

(This article belongs to the Special Issue Intelligent Data-Driven Approaches for Next-Generation Medical Diagnosis and Healthcare Systems)

Download

Browse Figures

Versions Notes

Abstract

Nearly 2 million stillbirths occur globally each year. These outcomes are often driven by disparities in healthcare access, especially in low- and middle-income countries, where limited resources and shortages of trained medical personnel further increase preventable risks. Addressing these challenges requires not only strengthening healthcare systems but also enhancing intervention strategies. In this context, the development of decision-support systems becomes essential to dynamically identify at-risk pregnancies and improve fetal outcomes. Therefore, we propose AI-FRS (Artificial Intelligence–Fetal Risk Prediction System), a decision support tool for fetal risk prediction, designed to classify fetal conditions as healthy or at risk, using clinical data from Mexican obstetric patients. AI-FRS is built upon seven distinct machine learning models, systematically evaluated through 127 first-order ensemble combinations using hard voting. To further enhance predictive performance, we assessed 32,752 second-order ensembles, constructed by combining top-performing first-order ensembles across recall, precision, and F1-score metrics. The final selected model, called BSOEM, achieved a robust F1-score of 0.812, providing a more balanced and robust decision-making framework than individual models or simple ensembles. Additionally, we conducted an interpretability analysis to identify the clinical variables with the greatest contribution to model predictions, strengthening the system’s transparency and potential clinical trust. AI-FRS features a user-friendly interface specifically designed to facilitate adoption by healthcare professionals. This provides a fast and clinically applicable AI tool for intrapartum and peripartum risk detection in obstetrics, supporting clinical decision-making and improving fetal health outcomes.

Keywords:

fetal risk prediction; ensemble learning; machine learning; clinical decision support

1. Introduction

Neonatal morbidity and mortality remain a constant concern in global public health. According to the World Health Organization (WHO) [1], an estimated 1.9 million stillbirths occurred worldwide in 2021, representing one stillbirth every 17 seconds. In the case of Mexico, the National Institute of Statistics and Geography (INEGI) reported a fetal death rate of 72.2 per 100,000 women of reproductive age in 2022 [2]. Alarmingly, low- and middle-income countries account for approximately 84% of all neonatal deaths, reflecting stark inequalities in healthcare access, quality, and early intervention capacity [3]. Many of these stillbirths are linked to preventable factors such as inadequate prenatal care, untreated maternal infections, hypertensive disorders, and complications during labor [4]. Also, several factors have been associated with increased pregnancy risk, including chronic conditions such as diabetes, hypertension, and other health issues [5,6], as well as lifestyle behaviors like smoking [7], alcohol consumption [8], and inadequate diet or physical activity [9]. Predicting fetal risk during pregnancy is complex, as it involves the interaction of multiple maternal, fetal, or obstetric variables. However, nowadays, artificial intelligence algorithms, such as machine learning models, offer a promising alternative for processing large volumes of data and detecting patterns associated with gestational risks [10,11].

State-of-the-art approaches (e.g., [12,13,14,15]) have demonstrated promising results using cardiotocography (CTG) data. Nevertheless, they often exclude critical maternal and obstetric contexts, which are more commonly available in routine prenatal monitoring. Moreover, CTG is a specialized technique that requires dedicated equipment, continuous monitoring, and trained personnel, making it less accessible, especially in low-resource settings. In contrast, clinical variables like maternal age, gestational age, and medical history are easier to collect through standard consultations or ultrasound exams, which are more widely available in primary care. While other efforts have incorporated clinical data (e.g., [16,17]), they typically fall short in terms of predictive performance and have not explored ensemble-based methodologies. Moreover, there is a notable absence of research conducted using clinical datasets from Mexico, despite the country’s high stillbirth rates. These limitations highlight the need for more advanced, interpretable modeling strategies that integrate diverse clinical variables and are tailored to underrepresented populations.

Building upon that foundation, the present study introduces AI-FRS (Artificial Intelligence–Fetal Risk Prediction System), a clinically oriented decision support system designed to predict fetal risk using both antepartum and intrapartum clinical indicators. The system integrates seven machine learning models and systematically evaluates all 127 possible first-order ensemble combinations through a hard voting strategy. To further enhance performance, interpretability, and metric stability, a total of 32,752 second-order ensemble combinations were explored. The best-performing ensemble achieved an F1-score of 0.812, outperforming individual models and first-order ensembles. Trained and validated on a dataset obtained from a Mexican clinical setting, AI-FRS demonstrates robust and generalizable predictive capability. Furthermore, it was implemented in a user-friendly graphical interface that facilitates adoption by healthcare professionals, enabling real-time, interpretable decision support in obstetric care. The primary novelty of this study lies in the formulation of the Best Second-Order Ensemble Model (BSOEM) architecture, uniquely tailored to handle highly imbalanced tabular clinical datasets. Unlike previous methodologies that rely heavily on continuous cardiotocography (CTG) signals, which require specialized equipment and trained personnel, our system utilizes standard prepartum and intrapartum clinical variables. Furthermore, the exhaustive exploration of 32,752 second-order combinations ensures a mathematically balanced between precision and recall. Also, AI-FRS can be executed on a standard laptop without requiring GPU acceleration. This lightweight deployment profile supports real-time operation and may facilitate translational feasibility, scalability, and long-term maintainability, particularly in low-resource healthcare settings where advanced monitoring technologies and high-performance computing infrastructure are not consistently available.

2. Related Works

In recent years, artificial intelligence (AI) techniques have emerged as promising tools to address challenges in obstetric care. Some studies have applied deep learning methods to fetal diagnosis from medical imaging. For example, Jain et al. [18] proposed a Vision Transformer (ViT)-based model to analyze fetal ultrasound images for automated health assessment and severity prediction. Their dual-head architecture simultaneously classified images into four severity levels: healthy, mild, moderate, and severe, and predicted a continuous severity index. Using a dataset of 500 annotated ultrasound images, the ViT achieved an accuracy of 89% and an F1-score of 0.875 on the validation set, outperforming conventional CNN and SVM models. Yenkikar et al. [19] proposed a Siamese Neural Network (SNN) with few-shot learning capabilities to classify fetal ultrasound images as normal or anomalous using only 767 anomalous samples. The model incorporated hybrid contrastive learning, shared-weight CNN backbones, and dynamic pair sampling to address class imbalance, achieving 98.1% classification accuracy. Few-shot learning was employed as a strategy to mitigate data scarcity, allowing the network to learn discriminative visual representations for fetal anomaly detection even with limited annotated data.

Regarding fetal risk outcome predictions, a few studies have applied machine learning to cardiotocography or CTG recordings, often reporting high performances. For instance, in [12], they used a voting ensemble of Logistic Regression, Random Forest, Gradient Boosting, and XGBoost on the UCI Fetal Health dataset, achieving an accuracy of 99.5% and nearly perfect F1-scores. Similarly, Zannah et al. [13] reported an accuracy of 99.56% and F1 of 96.68% using a combination of Decision Tree, Random Forest, and Gradient Boosting, also on CTG data from Portuguese sources. Nazli et al. [14] used CatBoost, LightGBM, and neural networks, reporting balanced accuracy of 91.3% with class imbalance corrected via SMOTE. In another recent contribution, Mondal et al. [15] aimed to predict the risks of fetal health and categorize them into three classes: pathological, suspect, and normal. The authors evaluated sixteen machine learning models, including Decision Tree, Random Forest, SVM, KNN, XGBoost, CatBoost, LightGBM, and Neural Networks, followed by ensemble techniques such as bagging, stacking, and soft voting. To address the severe class imbalance in the CTG dataset, they applied oversampling and hybrid resampling strategies such as SMOTE and SMOTEENN, with the latter yielding the best overall performance. Their ten-fold cross-validation results showed mean accuracies of 99.60%, 99.72%, and 99.78%, respectively, demonstrating the effectiveness of ensemble-based models combined with balanced data preprocessing for robust and reliable fetal risk prediction.

However, these works [12,13,14,15] rely majorly on cardiotocography features of CTG, which leaves a gap in fetal risk assessment and can create barriers to broader implementation. CTG monitoring requires expensive equipment and trained personnel, which can limit its use in resource-constrained settings. Moreover, even high-quality tracings cannot capture preexisting maternal conditions or prior obstetric events that profoundly shape risk. Other efforts have incorporated clinical data, with information such as maternal age, existing medical conditions, routine ultrasound measurements, and laboratory findings that are already collected in most prenatal care settings. For example, Abdi et al. [20] developed a prognostic model for predicting preterm birth using eight machine-learning algorithms, including Logistic Regression, Random Forest, XGBoost, LightGBM, and deep neural networks. The dataset comprised 8853 deliveries from the Iranian IMaN Net database, containing demographic, medical, and obstetric features. Their best-performing model, based on Random Forest, achieved an AUC of 0.65 and identified the onset of labor, preeclampsia, placental abruption, and gestational age as the most relevant predictors. In [16], Akbulut et al. evaluated nine classifiers, including perceptrons, Decision Forests, and SVMs, on a small cohort of 96 pregnancies. They used 23 variables and developed an mHealth application for clinical use. However, the small dataset limits its generalization, and the results can be overestimated. Malacova et al. [21] utilized one of the largest datasets, with approximately 952,800 births over 35 years in Western Australia. Their models, including Logistic Regression, Random Forests, and XGBoost, integrated extensive clinical histories: maternal age, smoking, prior obstetric complications, and even family history through data linkage. The best AUC metric was 0.84 with XGBoost. Nevertheless, its sensitivity remained low 45%, highlighting the difficulty of reducing false negatives even with large-scale data. Zimmerman et al. [22] worked with a multicenter U.S. cohort (nuMoM2b) comprising over 10,000 nulliparous pregnancies. They used a probabilistic graphical model and logistic regression with variables such as maternal diabetes, hypertension, fetal sex, and Apgar score. While their approach emphasized interpretability and risk interactions, it was limited to 16 variables due to computational constraints and included postnatal features, restricting its prospective use in real-time decision-making. The study [17] applied Decision Trees, Random Forests, XGBoost, and KNN to a larger dataset of 7166 vaginal deliveries in Iran. They used features like preeclampsia, placental abruption, gestational age category, labor induction, and fetal sex, enabling both pre- and intrapartum prediction. Although the models identified meaningful clinical risk factors, their performance of F1 0.76 indicates room for improvement, and no ensemble learning was employed. Table 1 summarizes previous studies, distinguishing those that use ultrasound images or CTG recordings from those that rely on clinical features. Only a few recent works employ ultrasound images, mainly for fetal anomaly detection or severity estimation, reporting high accuracies but limited generalizability due to small datasets and lack of external validation. CTG-based models consistently achieve higher accuracy and AUC values, but they lack patient-level context and require specialized monitoring equipment. In contrast, clinical-feature studies operate on routinely collected data from different countries, making them more accessible and cost-effective, yet they generally report more modest metrics. Also, ensemble methods are not fully explored in clinical-feature studies.

Key Contributions

In contrast to earlier work, our key contributions are:

AI-FRS is one of the first fetal risk prediction systems developed within a Mexican clinical setting, addressing regional disparities in access to support diagnostic technologies through data-driven approaches.
The study introduces a second-order ensemble model specifically designed for a highly imbalanced obstetric tabular dataset. Our framework implements an exhaustive combinatorial evaluation of 32,752 ensemble configurations derived from seven base learning algorithms (Artificial Neural Networks, Light Gradient Boosting Machine, Decision trees, Random forest, CatBoost, Convolutional Neural Networks, and XGBoost), enabling systematic identification of ensemble structures that optimize precision–recall trade-offs, with particular emphasis on F1-score performance.
The proposed framework relies exclusively on routinely available antepartum and intrapartum clinical variables, outperforming previous clinical feature studies, and without requiring the specialized equipment or continuous monitoring of CTG-based approaches.
The proposed system aims to identify fetuses at increased risk of neonatal respiratory compromise at birth through the integration of a user-friendly clinical interface and a lightweight inference architecture. Real-time predictions can be performed on a standard laptop without GPU acceleration, ensuring low computational requirements. This deployment profile facilitates accessible and scalable risk assessment in resource-limited healthcare settings while supporting timely, data-informed clinical decision-making.

3. Materials and Methods

3.1. Dataset Description

The research conducted to acquire the data was an observational and longitudinal study with a quantitative approach, comprising 2486 patient records from the Gynecology and Obstetrics Department of the Central University Hospital in Chihuahua, Mexico, between 15 May 2022 and 30 August 2024. A standardized data collection form was developed and validated by a panel of 12 experts in gynecology and obstetrics to ensure the clinical relevance and consistency of all variables. The study adhered to the ethical standards of the Declaration of Helsinki and was approved by the Ethics Committee of the Faculty of Medicine and Biomedical Sciences, Universidad Autónoma de Chihuahua (UACH), register No. C1-038-21. This adherence underscores our commitment to the ethical management of research data and the protection of participants’ rights and welfare throughout the study process.

The dataset includes 18 clinical features, shown in Table 2, selected after a thorough review of the literature and validation by experts. These variables comprise both categorical and continuous data. Importantly, most of the variables are either directly available or estimable before or during labor, which enhances the clinical applicability of the proposed system. Table 3 summarizes the class distribution, where 2322 cases correspond to healthy pregnancies and 164 cases to risky pregnancies. No preprocessing steps were required, as the dataset was prospectively collected and curated by clinical experts through a structured digital form designed to prevent incomplete or inconsistent entries during data acquisition.

The binary target variable represents the risk of clinically significant neonatal respiratory or hypoxic compromise at birth. The “Risky” label was assigned in real time following neonatal evaluation at delivery. A case was classified as Risky (State 1) when the newborn presented objective evidence of respiratory or hypoxic compromise. Relevant findings considered during the evaluation included a low 5 min Apgar score consistent with perinatal hypoxia, a pathological Silverman–Andersen score indicating respiratory distress, documented neonatal respiratory syndromes (such as respiratory distress syndrome, hyaline membrane disease, or transient tachypnea of the newborn), hypoxia or acidosis, and/or the need for advanced neonatal resuscitation. The final classification was determined through clinical adjudication based on the integration of these standardized neonatal findings. The “Healthy” label (State 0) was assigned when no clinical evidence of respiratory or hypoxic compromise was observed during the same standardized neonatal evaluation. All predictor variables were collected antepartum or intrapartum prior to delivery as part of routine obstetric care. Fetal weight corresponds to the estimated fetal weight obtained through prenatal assessment rather than postnatal birth weight.

Therefore, the proposed model aims to identify fetuses at increased risk of developing neonatal respiratory compromise at birth using antepartum and intrapartum clinical variables available prior to neonatal outcome occurrence.

Figure 1 presents normalized histograms illustrating the distribution of dataset features stratified by fetal risk status (0: healthy, 1: risky). The use of percentage normalization for categorical variables and density estimation for continuous ones allows for a clearer comparison between the two outcome groups, even in the presence of class imbalance. A key observation is the effect of maternal age: while both groups include patients from 15 to 40 years, there is a noticeable increase in fetal risk (State 1) among women over 30 years, suggesting maternal age is a contributing factor to adverse fetal outcomes. Similarly, the distribution of gestational weeks reveals a separation between the two classes, where most healthy pregnancies (State 0) cluster tightly around 36–40 weeks, while risky cases (State 1) show more dispersion and a higher frequency of preterm deliveries (<37 weeks). In smoking, the fetal risk group (State 1) shows a clear increase in frequency in the “Yes” category (value = 2), whereas most healthy pregnancies are concentrated in the “No” category (value = 1). This reinforces the role of behavioral factors like tobacco use as contributors to poor fetal outcomes. A similar, though less pronounced, trend appears in drug use. Contrary to expectations in some clinical scenarios, the bar plot for Event Type (1 = vaginal delivery, 2 = cesarean) reveals that fetal risk appears more frequently in vaginal deliveries. This might suggest that cesarean sections, while often performed in emergencies, may also play a protective role in selected high-risk cases, or that fetal distress is sometimes not detected in time to warrant surgical intervention. Regarding delivery hours, the majority of cases in both groups fall within <12 h (value = 1), but there is a slight increase in longer labors (value = 2) among the risk group, potentially indicating prolonged labor as a complicating factor. Clinical interventions reveal nuanced associations with fetal outcomes. Oxytocin and misoprostol, used to induce or enhance labor, are more frequently used in healthy outcomes (State 0), whereas their non-use is more common in risk cases (State 1). This suggests a potential protective role for these medications in managing labor effectively. Conversely, carbetocin, typically used to prevent postpartum hemorrhage, shows higher usage among fetal risk cases, likely reflecting its role in managing complicated deliveries. Similarly, corticosteroids such as betamethasone and dexamethasone, administered to accelerate fetal lung development in anticipation of preterm birth, are more prevalent in healthy outcomes, suggesting they may contribute to reducing fetal risk when used appropriately. Finally, the use of antihypertensives follows a similar trend: pregnancies where these medications were administered show lower fetal risk, reinforcing their potential protective effect against hypertensive complications during pregnancy. Regarding physiological markers, placental maturity grade shows a notable pattern: fetal risk (State 1) is more frequently associated with lower placental maturity levels, particularly grades 0, 1, and 2, while grade 3 (indicating full-term maturity) is predominantly observed in healthy outcomes (State 0). This suggests that immature placentas may be linked to increased fetal risk, potentially due to insufficient oxygen and nutrient exchange in utero, especially in preterm pregnancies. As for the amniotic fluid index (AFI), although the variable appears less discriminative overall, there is a visible trend: fetal risk is more frequent when the AFI is <5 (value = 1), which aligns with clinical definitions of oligohydramnios, a known risk factor. On the other hand, when the AFI is ≥5 (value = 2), the proportion of risk cases decreases. Thus, while AFI alone may not be a strong discriminator, its lower values are clearly associated with increased fetal risk in this dataset. The histogram for fetal weight shows that while the healthy group is centered between 3000 and 3500 g, the fetal risk group has a left-shifted distribution, indicating a higher incidence of low birth weight (<2500 g), which is a known risk factor for neonatal complications.

3.2. Exploratory Feature Analysis

To explore the potential relevance of the clinical features used in the dataset, we conducted a two-step exploratory analysis. First, univariate statistical tests were applied individually to each feature: ANOVA [24] for continuous variables and Chi-squared tests [25] for categorical variables. These tests evaluated whether each feature exhibited a statistically significant association (

p < 0.05

) with the outcome variable. Table 4 summarizes the results, highlighting features that showed statistical significance.

Second, a Random Forest strategy [26] was employed to estimate multivariate feature importances, capturing the collective and interactive contribution of variables. Figure 2 shows the ranked importance of features, where Fetal Weight, Weeks of Gestation, and Maternal Age emerged as the most influential, aligning with clinical knowledge. Features such as Preeclampsia and Placental Maturity Grade also ranked highly. Interestingly, some variables not significant in the univariate tests (e.g., Carbetocin Use, Amniotic Fluid Index) showed importance in the Random Forest analysis, suggesting predictive value through interactions with other variables. This highlights the potential relevance of preserving all features in future dataset expansions. It is important to note that this analysis was performed on the complete dataset solely for exploratory and interpretative purposes, and was not used for feature selection or model training, thereby avoiding any risk of data leakage. All clinical features were retained for model development, since even variables without univariate statistical significance may contribute through multivariate interactions, as suggested by the Random Forest analysis. Furthermore, certain clinical features (e.g., medication use) are of potential medical relevance and may gain importance as more cases are incorporated in future datasets.

3.3. Models’ Description

To build a robust classification system for maternal and fetal state prediction, we evaluated multiple machine learning models with diverse learning paradigms. We focused on models with different strengths in terms of interpretability and generalization capabilities. The selected models were Artificial Neural Network (ANN), Convolutional Neural Network (CNN), Decision Tree (DT), and ensemble methods: Random Forest (RF), Extreme Gradient Boosting (XGB), CatBoost (CB), and LightGBM (LGBM). Following, we describe each model:

1.: Artificial Neural Network (ANN): The ANN is designed to capture complex non-linear relationships [27]. This capability is achieved through multiple layers of interconnected neurons that allow the network to approximate flexible decision boundaries, making it well-suited for complex classification tasks.
2.: Convolutional Neural Network (CNN): The CNN model extends the ANN architecture by adding convolutional layers that learn spatially local patterns in the input data. The CNN model builds a hierarchy of increasingly abstract representations, making it particularly effective at recognizing complex feature interactions without manual feature engineering [28].
3.: Decision Tree (DT): The DT model [29] builds an interpretable structure by recursively partitioning the feature space based on thresholds that maximize information gain or reduce Gini impurity [30]. Gini impurity quantifies the likelihood of misclassifying a randomly selected sample at a node, with lower values indicating greater class purity. By minimizing this measure, the tree organizes the data into homogeneous groups.
4.: Random Forest (RF): RF [31] enhances the robustness of DT by aggregating the predictions of multiple decision trees, each trained on different bootstrapped subsets of the data with random feature selection at each split. This approach reduces model variance and mitigates overfitting, which is particularly valuable for improving generalization.
5.: Extreme Gradient Boosting (XGB): XGB [32] implements a boosting strategy in which trees are sequentially added to correct the residual errors of previous trees. XGB leverages both the first and second gradients of the loss function during optimization. This second-order information enables faster convergence.
6.: LightGBM (LGBM): LightGBM [33] follows a boosting framework but adopts a leaf-wise tree growth strategy, unlike the traditional level-wise models like XGB. In a level-wise strategy, all leaves at the same depth are expanded simultaneously, which leads to balanced trees but can introduce unnecessary splits that offer minimal loss reduction. In contrast, the leaf-wise approach selects and splits the leaf with the highest loss reduction at each step. This allows the model to grow trees in the most promising direction, resulting in deeper and more unbalanced trees. While this method can lead to faster convergence, it may also increase the risk of overfitting if the tree depth is not constrained.
7.: CatBoost (CB): Categorical Boosting [34], or CB, is also a gradient boosting method designed to handle categorical features natively. It converts categorical features into numerical representations using target statistics while implementing ordered boosting to prevent target leakage. Target leakage occurs when a model inadvertently learns from information that would not be available at inference time. CB addresses this issue through ordered boosting, which ensures that each model is trained only on data from previous observations.

For the ANN, we designed a five-layer network with an input layer of eighteen clinical features, three hidden layers of thirty, twenty, and ten neurons, respectively, and an output layer with one neuron and softmax activation for binary classification. We included a dropout layer after the first hidden layer and another after the second hidden layer. Each hidden layer used a ReLU activation. We tuned both the learning rate and dropout rate to improve generalization.

Regarding the CNN model, it was necessary to reshape each scaled feature vector into a two-dimensional array with a single channel. The model begins with a convolutional layer that applies 64 filters of size

3 \times 3

, using ReLU activation and same padding to extract local feature patterns. The resulting feature maps are then flattened into a one-dimensional vector and passed through a fully connected layer of 256 neurons with ReLU activation. A final output layer consists of a single neuron with sigmoid activation, optimized with binary cross-entropy loss and trained using the RMSprop optimizer. We tuned the learning rate to minimize overfitting.

For DT, RF, XGB, CB, and LGBM, key hyperparameters such as maximum depth, number of estimators, learning rate, and minimum samples to split were tuned to achieve the best predictive performance while maintaining model robustness.

3.4. Mathematical Models Formulation and Regularization of Base Learners

To ensure methodological rigor and reproducibility, we formalize the mathematical formulation, loss functions, and regularization mechanisms of the selected base learners.

Notation

Let the dataset be

D = {(x_{i}, y_{i})}_{i = 1}^{N},

(1)

where

i \in {1, \dots, N}

indexes samples,

x_{i} = {[x_{i, 1}, \dots, x_{i, 18}]}^{⊤} \in R^{18}

is the normalized clinical feature vector of sample i, and

y_{i} \in {0, 1}

is the corresponding binary fetal-risk label. When training neural models with two output neurons, labels are represented in one-hot form

y_{i} = {[y_{i, 1}, y_{i, 2}]}^{⊤} \in {0, 1}^{K}

with

K = 2

and

\sum_{k = 1}^{K} y_{i, k} = 1

, where

y_{i, k} = 1

indicates membership to class

k \in {1, 2}

(with a fixed mapping to

{0, 1}

). Predicted class probabilities are denoted as

{\hat{p}}_{i} = {[{\hat{p}}_{i, 1}, {\hat{p}}_{i, 2}]}^{⊤} \in {(0, 1)}^{K}

for the ANN, and as

{\hat{p}}_{i} \in (0, 1)

for the CNN.

We define the ReLU, sigmoid, and softmax functions as

ReLU (z) = max (0, z), σ (z) = \frac{1}{1 + e^{- z}}, softmax {(z)}_{k} = \frac{e^{z_{k}}}{\sum_{r = 1}^{K} e^{z_{r}}},

(2)

where

z \in R^{K}

is a vector of logits and

{(\cdot)}_{k}

denotes its k-th component.

3.4.1. Neural Architectures (ANN and CNN)

We first present the model parameterization (architecture) and then the loss function and regularizers used during optimization.

Artificial Neural Network (ANN) architecture.

The implemented ANN is a fully connected network with fixed widths

18 \to 30 \to 20 \to 10 \to K

with

K = 2

output neurons. For each sample i,

h_{i}^{(l)}

denotes the post-activation output of hidden layer

l \in {1, 2, 3}

.

\begin{matrix} h_{i}^{(1)} & = ReLU (W_{1} x_{i} + b_{1}), & W_{1} \in R^{30 \times 18}, b_{1} \in R^{30}, \end{matrix}

(3)

where

W_{1}

and

b_{1}

are trainable weights and biases, respectively.

Dropout after layer 1. Let

r_{i}^{(1)} \in {0, 1}^{30}

be a random binary mask with independent components

r_{i, j}^{(1)} \sim Bernoulli (1 - p)

for

j \in {1, \dots, 30}

, where

p \in (0, 1)

is the dropout rate. The dropout-transformed activation is

{\tilde{h}}_{i}^{(1)} = r_{i}^{(1)} ⊙ h_{i}^{(1)},

(4)

where ⊙ denotes element-wise multiplication.

\begin{matrix} h_{i}^{(2)} & = ReLU (W_{2} {\tilde{h}}_{i}^{(1)} + b_{2}), & W_{2} \in R^{20 \times 30}, b_{2} \in R^{20}, \end{matrix}

(5)

Dropout after layer 2. Let

r_{i}^{(2)} \in {0, 1}^{20}

with

r_{i, j}^{(2)} \sim Bernoulli (1 - p)

for

j \in {1, \dots, 20}

. Then

{\tilde{h}}_{i}^{(2)} = r_{i}^{(2)} ⊙ h_{i}^{(2)} .

(6)

\begin{matrix} h_{i}^{(3)} & = ReLU (W_{3} {\tilde{h}}_{i}^{(2)} + b_{3}), & W_{3} \in R^{10 \times 20}, b_{3} \in R^{10}, \end{matrix}

(7)

\begin{matrix} o_{i}^{ANN} & = W_{4} h_{i}^{(3)} + b_{4}, & W_{4} \in R^{K \times 10}, b_{4} \in R^{K}, \end{matrix}

(8)

\begin{matrix} {\hat{p}}_{i}^{ANN} & = softmax (o_{i}^{ANN}), & {\hat{p}}_{i}^{ANN} \in {(0, 1)}^{K} . \end{matrix}

(9)

Here,

o_{i}^{ANN} \in R^{K}

is the ANN logit vector prior to softmax.

Dropout rates. Dropout is applied after the first and second hidden layers with dropout rate p selected via inner cross-validation grid search (e.g.,

p \in {0.1, 0.2}

).

Class weighting (used for ANN and CNN).

To mitigate class imbalance, we use balanced class weights. Let

N_{c} = \sum_{i = 1}^{N} I (y_{i} = c)

be the number of samples in class

c \in {0, 1}

, where

I (\cdot)

is the indicator function. The class weight is

ω_{c} = \frac{N}{2 N_{c}}, c \in {0, 1},

(10)

and the per-sample weight is

ω_{y_{i}}

(i.e.,

ω_{y_{i}} = ω_{0}

if

y_{i} = 0

and

ω_{y_{i}} = ω_{1}

if

y_{i} = 1

).

ANN objective and regularization.

The ANN is optimized by minimizing the weighted categorical cross-entropy loss:

J_{ANN} (θ_{ANN}) = - \frac{1}{N} \sum_{i = 1}^{N} ω_{y_{i}} \sum_{k = 1}^{K} y_{i, k} log ({\hat{p}}_{i, k}^{ANN}),

(11)

where

{\hat{p}}_{i, k}^{ANN}

is the k-th component of

{\hat{p}}_{i}^{ANN}

, and

θ_{ANN} = {W_{l}, b_{l}}_{l = 1}^{4}

denotes all ANN trainable parameters. Optimization is performed with Adam; let

η > 0

denote the learning rate selected by inner cross-validation grid search.

ANN inference used for evaluation.

For reporting metrics, predicted probabilities are:

{\hat{y}}_{i}^{A N N} = arg max_{k \in {0, 1}} {\hat{p}}_{i, k}^{A N N} .

(12)

Here,

{\hat{y}}_{i}^{ANN} \in {0, 1}

is the predicted class index.

Convolutional Neural Network (CNN) architecture.

The implemented CNN maps each feature vector

x_{i} \in R^{18}

into a square 2D representation by zero-padding and reshaping. Define

s = ⌈\sqrt{18}⌉, P = s^{2},

(13)

so that

s = 5

and

P = 25

in our setting. Let

u_{i} = {[u_{i, 1}, \dots, u_{i, P}]}^{⊤} \in R^{P}

be the padded vector:

u_{i, j} = \{\begin{matrix} x_{i, j}, & 1 \leq j \leq 18, \\ 0, & 19 \leq j \leq P, \end{matrix}

(14)

and reshape it into a single-channel “image”

X_{i} \in R^{s \times s \times 1}

as

X_{i} [a, b, 1] = u_{i, (a - 1) s + b}, a, b \in {1, \dots, s} .

(15)

The CNN applies one convolutional layer with 64 filters of size

3 \times 3

and same padding:

F_{i, k} = ReLU (X_{i} * K_{k} + β_{k} 1), k \in {1, \dots, 64},

(16)

where

K_{k} \in R^{3 \times 3 \times 1}

is the k-th convolution kernel,

β_{k} \in R

is its bias term,

1 \in R^{s \times s}

is an all-ones matrix (broadcast), and

F_{i, k} \in R^{s \times s}

is the resulting feature map.

Flattening. The 64 feature maps are concatenated and vectorized into

z_{i} = vec (F_{i, 1}, \dots, F_{i, 64}) \in R^{64 s^{2}},

(17)

so

z_{i} \in R^{1600}

when

s = 5

.

Fully connected layer (256 neurons) and output.

\begin{matrix} h_{i}^{CNN} & = ReLU (A z_{i} + c), & A \in R^{256 \times 64 s^{2}}, c \in R^{256}, \end{matrix}

(18)

\begin{matrix} o_{i}^{CNN} & = a^{⊤} h_{i}^{CNN} + c_{0}, & a \in R^{256}, c_{0} \in R, \end{matrix}

(19)

\begin{matrix} {\hat{p}}_{i}^{CNN} & = σ (o_{i}^{CNN}), & {\hat{p}}_{i}^{CNN} \in (0, 1) . \end{matrix}

(20)

Here, A and c denote the trainable weight matrix and bias vector of the fully connected layer, respectively.

o_{i}^{CNN} \in R

is the CNN logit prior to sigmoid.

CNN objective and regularization.

The CNN is optimized by minimizing the weighted binary cross-entropy:

J_{CNN} (θ_{CNN}) = - \frac{1}{N} \sum_{i = 1}^{N} ω_{y_{i}} [y_{i} log ({\hat{p}}_{i}^{CNN}) + (1 - y_{i}) log (1 - {\hat{p}}_{i}^{CNN})],

(21)

where

θ_{CNN} = {K_{k}, β_{k}}_{k = 1}^{64} \cup {A, c, a, c_{0}}

collects all CNN trainable parameters. Optimization uses RMSprop; let

η > 0

denote the learning rate selected by grid search.

Early stopping.

To prevent overfitting, training uses early stopping with patience

π = 20

and weight restoration based on the validation loss.

CNN inference threshold used for evaluation.

For metric computation, predicted probabilities are thresholded at

τ_{CNN} = 0.9

:

{\hat{y}}_{i}^{CNN} = I ({\hat{p}}_{i}^{CNN} > τ_{CNN}), τ_{CNN} = 0.9,

(22)

where

{\hat{y}}_{i}^{CNN} \in {0, 1}

is the predicted label.

3.4.2. Tree-Based Models (DT and RF)

We first define the model structure and splitting criterion of each tree-based learner, and then describe the corresponding regularization mechanisms and hyperparameters.

Decision Tree (DT) formulation.

A Decision Tree (DT) recursively partitions the input space using axis-aligned splits. Let

ν

denote an internal node of the tree. The set of training indices that reach node

ν

is denoted by

I_{ν} \subseteq {1, \dots, N}

, and its cardinality is

| I_{ν} |

(i.e., the number of samples at node

ν

). At node

ν

, the DT selects a feature index

j \in {1, \dots, 18}

and a threshold

τ \in R

, where

x_{i, j}

denotes the value of the j-th feature of sample i. This split induces the left and right child index sets

I_{ν}^{L} (j, τ) = {i \in I_{ν} ∣ x_{i, j} \leq τ}, I_{ν}^{R} (j, τ) = {i \in I_{ν} ∣ x_{i, j} > τ}, I_{ν} = I_{ν}^{L} \cup I_{ν}^{R},

(23)

where

ν^{L}

and

ν^{R}

denote the left and right child nodes created by this split, corresponding to

I_{ν}^{L}

and

I_{ν}^{R}

, respectively.

Let

c \in {0, 1}

be a class label. The empirical class proportion at node

ν

is defined as

π_{ν, c} = \frac{1}{| I_{ν} |} \sum_{i \in I_{ν}} I (y_{i} = c), c \in {0, 1},

(24)

where

y_{i} \in {0, 1}

is the true label of sample i and

I (\cdot)

is the indicator function, i.e.,

I (true) = 1

and

I (false) = 0

.

The Gini impurity at node

ν

is then computed as

G (ν) = 1 - \sum_{c \in {0, 1}} π_{ν, c}^{2},

(25)

where lower values indicate higher class purity at node

ν

.

For a candidate split

(j, τ)

at node

ν

, the impurity decrease (Gini gain) is defined as

Δ G (ν; j, τ) = G (ν) - \frac{| I_{ν}^{L} (j, τ) |}{| I_{ν} |} G (ν^{L}) - \frac{| I_{ν}^{R} (j, τ) |}{| I_{ν} |} G (ν^{R}),

(26)

where

| I_{ν}^{L} (j, τ) |

and

| I_{ν}^{R} (j, τ) |

denote the number of samples assigned to the left and right child nodes, respectively. The DT selects the split

(j^{★}, τ^{★}) = arg max_{j, τ} Δ G (ν; j, τ),

(27)

evaluating candidate thresholds

τ

for each feature j according to the implementation.

DT objective and regularization.

The DT is regularized by restricting its complexity through hyperparameters tuned by an inner 5-fold cross-validation grid search. In particular,

D T_m a x d e p t h

limits the maximum depth of the tree, and

D T_m i n s a m p l e s s p l i t

specifies the minimum number of samples required to further split an internal node. Constraining these values prevents overly deep trees and reduces overfitting.

Random Forest (RF) formulation.

Random Forest (RF) aggregates an ensemble of M decision trees

{T_{m}}_{m = 1}^{M}

, where

M \in N

corresponds to the hyperparameter

R F_n e s t i m a t o r s

(i.e., the number of trees). Each tree

T_{m} (\cdot)

is trained on a bootstrap sample of the original dataset. Let

D^{(m)}

denote the bootstrap dataset used to train tree m:

D^{(m)} \sim Bootstrap (D), m = 1, \dots, M,

(28)

where

Bootstrap (D)

indicates sampling N instances from

D

with replacement. Given an input

x_{i}

, each tree outputs a class prediction

T_{m} (x_{i}) \in {0, 1}

.

The RF ensemble prediction is obtained via majority voting:

{\hat{y}}_{i}^{RF} = arg max_{c \in {0, 1}} \sum_{m = 1}^{M} I (T_{m} (x_{i}) = c),

(29)

where

{\hat{y}}_{i}^{RF} \in {0, 1}

is the predicted label of sample i.

RF regularization.

RF reduces variance by averaging multiple decorrelated trees (bagging). Additional regularization is achieved by constraining the complexity of each constituent tree using the tuned hyperparameters

R F_m a x d e p t h

and

R F_m i n s a m p l e s s p l i t

, which prevent the full tree growth. Finally, the ensemble size M or

R F_n e s t i m a t o r s

controls the bias–variance trade-off, where larger M typically stabilizes predictions by reducing variance through averaging.

3.4.3. Gradient Boosting Frameworks (XGB, LGBM, and CB)

We first define the additive tree model (parameterization) and then present the implemented-specific regularizers.

Additive boosting formulation.

Gradient boosting models construct an additive scoring function (logit) composed of T regression trees. Let

t \in {1, \dots, T}

index boosting iterations, and let

f_{t} : R^{18} \to R

denote the regression tree added at iteration t. For a given sample

i \in {1, \dots, N}

with feature vector

x_{i} \in R^{18}

, the cumulative score after T iterations is defined as

s_{i}^{(T)} = \sum_{t = 1}^{T} η f_{t} (x_{i}),

(30)

where

s_{i}^{(T)} \in R

is the final logit score. Also, the additive formulation includes the learning rate factor

η

, which scales the contribution of each newly added tree. The predicted probability of class 1 is obtained by applying the sigmoid link function

{\hat{p}}_{i} = σ (s_{i}^{(T)}) = \frac{1}{1 + e^{- s_{i}^{(T)}}},

(31)

where

{\hat{p}}_{i} \in (0, 1)

.

Regularized boosting objective.

The boosting models learn the functions

{f_{t}}_{t = 1}^{T}

by minimizing a regularized objective. Let

l (\cdot, \cdot)

denote the binary cross-entropy computed on probabilities:

l (y_{i}, {\hat{p}}_{i}) = - [y_{i} log ({\hat{p}}_{i}) + (1 - y_{i}) log (1 - {\hat{p}}_{i})],

(32)

where

y_{i} \in {0, 1}

is the true label of sample i and

{\hat{p}}_{i}

is the predicted probability. The global regularized objective is

J_{GB} = \sum_{i = 1}^{N} l (y_{i}, {\hat{p}}_{i}) + \sum_{t = 1}^{T} Ω (f_{t}),

(33)

where

Ω (f_{t})

is a structural penalty controlling the complexity of the tree t. Let

L_{t} \in N

be the number of leaves of the tree

f_{t}

. Each leaf

l \in {1, \dots, L_{t}}

is assigned a constant score

w_{t, l} \in R

, and

f_{t} (x_{i})

equals the score of the leaf into which

x_{i}

falls. The penalty term is defined as

Ω (f_{t}) = γ L_{t} + \frac{λ}{2} \sum_{l = 1}^{L_{t}} w_{t, l}^{2},

(34)

where

γ \geq 0

penalizes the creation of additional leaves (model complexity cost) and

λ \geq 0

controls

l_{2}

regularization on the leaf scores.

XGBoost (XGB): Second-order optimization and early stopping.

XGBoost fits

f_{t} (\cdot)

by minimizing a second-order approximation of the objective around the current partial score. Define the partial score at iteration

(t - 1)

as

s_{i}^{(t - 1)} = \sum_{r = 1}^{t - 1} η f_{r} (x_{i}),

(35)

and the corresponding probability

{\hat{p}}_{i}^{(t - 1)} = σ (s_{i}^{(t - 1)})

. Using a second-order Taylor expansion of the loss with respect to s, the per-iteration objective can be written as

{\tilde{J}}_{XGB}^{(t)} \approx \sum_{i = 1}^{N} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t} {(x_{i})}^{2}] + Ω (f_{t}),

(36)

where

g_{i}

and

h_{i}

are the first and second derivatives of the loss with respect to the score s, evaluated at

s = s_{i}^{(t - 1)}

:

g_{i} = {\frac{\partial l (y_{i}, σ (s))}{\partial s}|}_{s = s_{i}^{(t - 1)}}, h_{i} = {\frac{\partial^{2} l (y_{i}, σ (s))}{\partial s^{2}}|}_{s = s_{i}^{(t - 1)}} .

(37)

To reduce overfitting, the maximum depth, number of estimators, and learning rate are tuned during hyperparameter optimization.

LightGBM (LGBM): Leaf-wise growth and early stopping.

LightGBM uses the same additive formulation in Equations (30) and (31), but differs in how each tree is grown. Let

L_{t}

denote the set of current leaves of the partially grown tree at iteration t. For any leaf

l \in L_{t}

, let

Δ J (l) \in R

denote the reduction in the objective obtained by splitting that leaf. LightGBM grows trees in a leaf-wise manner by selecting

l_{t}^{★} = arg max_{l \in L_{t}} Δ J (l),

(38)

that is, it always splits the leaf yielding the maximum objective reduction. Similarly to XGB, to reduce overfitting, the maximum depth, number of estimators, and learning rate are tuned during hyperparameter optimization.

CatBoost (CB): Categorical handling and validation-based model selection.

CatBoost follows the same regularized boosting objective in Equations (33) and (34). When categorical features are present, CatBoost can internally apply ordered target statistics to prevent target leakage. Let

r \in {1, \dots, R}

index categorical features and let

c_{i}^{(r)}

denote the categorical value of feature r for sample i. The ordered target-encoding operator

TE (\cdot)

maps

c_{i}^{(r)}

to a real value as

TE (c_{i}^{(r)}) = \frac{\sum_{j < i} I (c_{j}^{(r)} = c_{i}^{(r)}) y_{j} + α \bar{y}}{\sum_{j < i} I (c_{j}^{(r)} = c_{i}^{(r)}) + α},

(39)

where (i)

j < i

enforces that the encoding of sample i only uses label information from samples observed before i, (ii)

α > 0

is a smoothing constant controlling the bias–variance trade-off of the encoding, and (iii)

\bar{y}

is the global target mean computed on the training set:

\bar{y} = \frac{1}{N} \sum_{i = 1}^{N} y_{i} .

(40)

This transformation yields an encoded feature vector

{\tilde{x}}_{i} \in R^{18}

, where each categorical component is replaced by its corresponding

TE (c_{i}^{(r)})

value, while continuous components remain unchanged.

For CatBoost, the number of boosting iterations, learning rate, and tree depth were optimized via inner cross-validation, as they directly control the ensemble size, the capacity of each tree, and the contribution of each boosting step, respectively, thereby balancing convergence speed and overfitting.

3.5. Evaluation Metrics

The following metrics were used to evaluate the models, where

T P

refers to true positives,

T N

to true negatives,

F P

to false positives, and

F N

to false negatives:

Accuracy. Measures the overall correctness of the model:

$Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$

(41)
Precision. Indicates the proportion of correctly identified positive predictions among all predicted positives:

$Precision = \frac{T P}{T P + F P}$

(42)
Sensitivity (Recall). Measures the model’s ability to correctly detect positive outcomes:

$Recall = \frac{T P}{T P + F N}$

(43)
F1-Score (F-Measure). Combines precision and recall into a single score by calculating their harmonic mean. A value closer to 1 indicates better performance:

$F 1 - Score = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$

(44)
Specificity. Measures the proportion of correctly identified negative cases among all actual negative cases:

$Specificity = \frac{T N}{T N + F P}$

(45)

3.6. Base Model Training and Hyperparameter Tuning

Figure 3 illustrates the complete training and stratified nested cross-validation workflow adopted in this study. In each outer fold, the seven base models (ANN, CNN, DT, RF, XGB, CB, and LGBM) are independently trained, resulting in five trained instances per algorithm across the outer folds. Importantly, the held-out outer-fold test sets and their corresponding trained models are stored after training. These trained models are subsequently reused to generate fold-specific predictions required for the ensemble construction process described in Section 3.8 and illustrated in Figure 4. No additional retraining is performed during ensemble evaluation. This explicit separation between base model training (Figure 3) and ensemble selection (Figure 4) clarifies the methodology and prevents the misconception that thousands of ensemble models were independently retrained.

As it is shown in Figure 3, we employed a stratified nested cross-validation strategy to obtain unbiased estimates of model performance. First, the full dataset was partitioned into five stratified folds, each allocating 60% of the cases to training, 20% to validation, and 20% to testing. Stratified cross-validation was particularly important in this study due to the pronounced class imbalance (6.6% of “risky” cases), as it ensured that each fold preserved the same proportion of risky and healthy samples as in the overall dataset [35,36,37]. In our specific case, the number of folds was limited to five to avoid excessive fragmentation of the minority class, ensuring sufficient samples per fold for validation and testing. Within each outer training fold, we fit Z-score normalization parameters only on the raw outer training data before oversampling, and then applied that same scaler without retraining to the validation and test splits.

We ran an internal five-fold stratified cross-validation hyperparameter grid search on the training data. In this inner loop, 80% of the training subset was used for model fitting, while the remaining 20% was used for internal validation. The exact search grids for each algorithm are listed in Table 5.

ADASYN (Adaptive Synthetic Sampling) [38,39] was applied exclusively to the outer training subset to correct class imbalance prior to hyperparameter optimization, and the selected hyperparameters were subsequently used to refit each model on the augmented training data. ADASYN generates synthetic samples exclusively for the minority class by concentrating on regions where it is underrepresented and more heavily surrounded by majority class instances. This strategy forces the classifier to focus on challenging cases to improve class discrimination. ADASYN was deliberately selected over computationally expensive diffusion models or Bayesian generation algorithms due to its efficiency and suitability for tabular clinical data. ADASYN dynamically shifts the decision boundary toward the difficult-to-learn minority examples without artificially altering the underlying statistical distribution of the majority class, providing a robust solution for our imbalanced dataset.

To further clarify the numerical distribution of samples across splits, Table 6 reports the class-wise counts for the training, validation, and test subsets in each outer fold. The table shows the original sample counts and the corresponding post-ADASYN counts for the training subset only, highlighting that oversampling was applied exclusively to the minority class and only within the training data. Validation and test subsets remained untouched throughout the entire experimental pipeline to prevent data leakage.

Importantly, validation and test subsets were never used for oversampling or for fitting preprocessing parameters. The trained models were subsequently evaluated on their untouched validation and test sets. We reserved the validation fold not only for early stopping in the ANN or CNN to prevent overfitting but also to maintain a consistent checkpoint across all seven algorithms. Using a common validation split allows us to compare model behaviors under identical conditions. Finally, the fully trained models were assessed on the held-out test fold, providing a fair measure of generalization.

Regarding the overfitting mitigation strategy, for tree-based algorithms (RF, XGB, LGBM, CB, and DT), explicit early stopping was unnecessary, since these models incorporate internal mechanisms that naturally limit overfitting. Random Forest controls overfitting by averaging predictions across multiple decision trees. Each tree is trained on a random subset of data and features, ensuring diversity among individual learners and preventing the ensemble from memorizing training noise [26]. XGB, in particular, incorporates an explicitly regularized objective function that penalizes excessive model complexity through both L1 and L2 terms, effectively discouraging overfitting while maintaining flexibility to capture nonlinear relationships [32]. LGBM introduces leaf-wise growth with depth limitation, which, when combined with constraints on the minimum number of samples per leaf and feature subsampling, helps balance accuracy and generalization [33]. CB mitigates overfitting through ordered boosting, symmetric tree construction, and L2 regularization [34]. For standalone DT, model complexity was constrained through hyperparameter tuning of maximum depth and minimum samples per split, limiting growth and improving generalization [29]. This means that we further mitigated overfitting in the tree-based models through an inner-loop grid search for hyperparameter optimization, explicitly tuning parameters such as maximum depth, learning rate, and the number of estimators, as listed in Table 5. These constraints effectively limited tree expansion, ensuring that the models captured only generalizable data patterns. Additionally, the nested stratified cross-validation design and the use of ADASYN to augment training data within the outer folds provided strong safeguards against overfitting. Weighted categorical cross-entropy loss and explicit early stopping were reserved only for the ANN and CNN models, where limiting training epochs is critical to prevent overfitting due to their higher representational capacity.

The optimal hyperparameter combinations identified for each machine learning model are shown in Table 7. We report the mean test accuracy, standard deviation, and the corresponding best-performing hyperparameter configuration.

3.7. Experimental Domain Key Points

All models in this work were trained and evaluated within a common experimental domain defined by the same input feature space, outcome, and validation protocol. The predictive task consisted of a binary classification of fetal state (0: healthy, 1: risky) using the set of 18 prepartum and intrapartum clinical variables described in Table 2. The cohort comprised 2486 deliveries, of which 6.6% corresponded to risky cases, as summarized in Table 3. Data partitioning followed a stratified 5-fold cross-validation scheme, with a 60/20/20 split into training, validation, and test subsets in each outer fold, and ADASYN oversampling applied only to the training subsets of the outer folds (Table 6). This configuration ensures that all models are evaluated under identical and clinically realistic conditions in terms of features, outcome, and data splits.

Within this common domain, we deliberately selected seven algorithms with complementary strengths in terms of interpretability and generalization capabilities. The selected models (ANN, CNN, DT, RF, XGB, CB, and LGBM) were chosen to cover a spectrum of learning strategies: (1) interpretable models (DT, RF) to provide clinically meaningful decision paths; (2) boosting-based ensemble learners (XGB, CB, LGBM) known to perform robustly on structured and imbalanced datasets; and (3) neural architectures (ANN, CNN) to capture nonlinear feature interactions. This combination allowed us to explore complementary inductive biases and decision mechanisms within the same experimental domain, which is a key motivation behind the ensemble-based AI-FRS framework.

For each algorithm, the experimental domain was further constrained by its hyperparameter search space (Table 5), defined from commonly used ranges in the machine-learning literature, and explored through inner-loop grid search during stratified cross-validation. These grids explicitly limit model capacity (e.g., maximum tree depth, learning rate, number of estimators, dropout rate), ensuring that each model operates in a practically relevant region of its parameter space while reducing the risk of overfitting and enhancing the robustness of the comparison. For each outer fold, the seven trained base models (ANN, CNN, DT, RF, XGB, CB, and LGBM) were stored together with their corresponding held-out test subsets, and these saved instances were subsequently used to construct and evaluate the different ensemble configurations under strictly out-of-sample conditions.

Finally, it is important to note that the present study was designed as a retrospective, single-center analysis with an imbalanced outcome distribution and static clinical snapshots rather than longitudinal trajectories. These aspects define the methodological scope of the current work; their implications for generalizability and clinical implementation are further discussed in the subsequent Sections.

3.8. Model Ensemble Methodology

In this study, we opted to generate a second-order ensemble for fetal risk prediction. The motivation for implementing the second-order ensemble was to obtain a more robust and balanced model by integrating first-order ensembles optimized for precision, recall, and F1-score. The idea is that by combining complementary ensembles, the proposed second-order ensemble effectively balances the trade-off between false positives and false negatives, resulting in improved F1-score performance and reduced inter-fold variance. This approach aligns with previous studies that have demonstrated how multi-level or stacked ensemble architectures can increase model stability, generalization capacity, and predictive consistency [40,41,42]. Additionally, the exhaustive evaluation of all possible ensemble combinations ensured that the final configuration was not arbitrarily selected but empirically optimized, allowing the discovery of synergistic interactions among diverse models that would not emerge from manual or heuristic selection.

To construct the second-order ensemble, the six best first-order ensembles based on precision, the six best based on recall, and the three best based on F1-score were selected, resulting in fifteen candidate ensembles. This configuration was designed to achieve a balanced representation of complementary optimization objectives while keeping the computational search space manageable. Ensembles optimized for precision and recall capture opposite behaviors in classification, as they prioritize minimizing false positives or false negatives, respectively. Selecting an equal number of each allowed a symmetric evaluation of both tendencies, while the three F1-based ensembles served as reference models that inherently balance both precision and recall, providing stable comparative anchors for the integration process.

Figure 4 presents the complete two-stage ensemble selection framework developed for AI-FRS, detailing how individual base models are progressively aggregated into first-order and second-order hard-voting ensembles within a cross-validation protocol. In the first stage, the seven trained base learners: ANN, CNN, LGBM, DT, RF, CB, and XGB are combined to generate all 127 possible non-empty first-order hard-voting ensemble configurations. Each ensemble is evaluated using the held-out outer-fold test splits (Fold-1 to Fold-5), in which predictions are obtained independently for each fold, and performance metrics, including precision, recall, F1-score, and accuracy, are computed on the corresponding test partition. These fold-specific metrics are then averaged across the five folds. Based on this cross-validated evaluation, 15 top-performing first-order ensembles are selected using complementary optimization criteria: six ensembles prioritized for precision (EP1–EP6), six prioritized for recall (ER1–ER6), and three prioritized for F1-score (EF1–EF3). This selection strategy ensures diversity across different operating points along the sensitivity–specificity spectrum. In the second stage, these 15 first-order ensembles are treated as meta-models and combined again via hard voting to construct 32,752 second-order ensemble configurations. As in the first stage, each configuration is evaluated on the same held-out outer-fold test splits, and performance metrics are averaged across folds to maintain methodological consistency. The ensemble achieving the highest average F1-score across folds is designated as the Best Second-Order Ensemble Model (BSOEM), which constitutes the final predictive engine of the AI-FRS system. This hierarchical selection process explicitly prioritizes stability and balanced performance, ensuring that the final model reflects robust behavior across partitions.

Formally:

Let M be the set of the seven base models (ANN, CNN, LGBM, DT, RF, CB and XGB):

M = {m_{1}, m_{2}, \dots, m_{7}}

(46)

where m represents the saved model for each fold. We denote

F = {1, \dots, 5}

as the index of the five stratified test folds. For each fold

f \in F

, let

D_{f}^{t e s t}

be the held-out test set. Then, the predictions (a vector of class labels) of model

m_{i}

is denoted by:

y_{i, f} = m_{i} (D_{f}^{t e s t})

(47)

First-order ensembles
First-order ensembles were defined for all possible non-empty subsets of the seven models:

$E^{1 o r d} = {ϵ^{1 o r d} \subseteq M | ϵ^{1 o r d} \neq \emptyset}$

(48)

Each model m has two possibilities regarding its inclusion in an ensemble subset $ϵ^{1 o r d}$ : either included or not. Consequently, for seven distinct base models, the number of possible ensembles of size at least 1 is:

$\sum_{k = 1}^{7} (\binom{7}{k}) = 127$

(49)

For each ensemble $ϵ^{1 o r d} \in E^{1 o r d}$ and each fold f, we define its hard-voting prediction by:

$y_{ϵ^{1 o r d}, f} (x) = mode ({m_{i} (x) | m_{i} \in ϵ^{1 o r d}}), \forall x \in D_{f}^{t e s t}$

(50)

where $mode (*)$ is a function that obtains the majority vote of the models’ predictions. We evaluate the j-th performance metric $P M = {F 1 - score, recall, precision}$ of $ϵ^{1 o r d}$ on each fold:

${perf}_{j} (ϵ^{1 o r d}, f) = P M_{j} (y_{ϵ^{1 o r d}, f}, y_{f})$

(51)

where $y_{f}$ is the target that it is compared with $y_{ϵ^{1 o r d}, f}$ to compute the metric performance. Then, we average across folds to obtain the overall first-stage performance:

$\bar{{perf}_{j}} (ϵ^{1 o r d}) = \frac{1}{| F |} \sum_{f = 1}^{5} {perf}_{j} (ϵ^{1 o r d}, f)$

(52)

Finally, we selected the top-k ensembles $E_{j}^{1 o r d *}$ under each metric j:

$E_{j}^{1 o r d *} = {argtop}_{k} \{\bar{{perf}_{j}} (ϵ^{1 o r d}) | ϵ^{1 o r d} \in E^{1 o r d}\}$

(53)

with:

$k = \{\begin{matrix} 6, & if j \in {Precision, Recall} \\ 3, & if j = F 1 - score \end{matrix}$

(54)
Second-order ensembles
Having identified our top–k first-order ensembles $E_{j}^{1 o r d *} = {ϵ_{1}^{(1 o r d, j)}, ϵ_{2}^{(1 o r d, j)}, \dots, ϵ_{k}^{(1 o r d, j)}}$ for each metric j, we proceed to build second-order ensembles by combining those selected first-order groups.
We define the set of second-order candidates:

$E^{2 o r d} = {ϵ^{2 o r d} | ϵ^{2 o r d} \subseteq ⋃_{j = 1}^{3} E_{j}^{1 o r d *}}$

(55)

The total number of possible subsets $ϵ^{2 o r d}$ for 15 distinct first-order ensembles of size at least 2 is:

$\sum_{r = 2}^{15} (\binom{15}{r}) = 32752$

(56)

For each ensemble $ϵ^{2 o r d} \in E^{2 o r d}$ and each fold f, we define its hard-voting prediction by:

$y_{ϵ^{2 o r d}, f} (x) = mode ({ϵ_{i}^{1 o r d} (x) | ϵ_{i}^{1 o r d} \in ϵ^{2 o r d}}), \forall x \in D_{f}^{t e s t}$

(57)

We evaluate the F1-score performance metric of $ϵ^{2 o r d}$ on each fold:

$perF 1 (ϵ^{2 o r d}, f) = F 1 score (y_{ϵ^{2 o r d}, f}, y_{f})$

(58)

Then, we average across folds to obtain the overall second-stage performance:

$\bar{perF 1} (ϵ^{2 o r d}) = \frac{1}{| F |} \sum_{f = 1}^{5} perF 1 (ϵ^{2 o r d}, f)$

(59)

Finally, we selected the best ensemble $E^{2 o r d *}$ by:

$E^{2 o r d *} = argmax \{\bar{perF 1} (ϵ^{2 o r d}) | ϵ^{2 o r d} \in E^{2 o r d}\}$

(60)

This experimental framework was designed to ensure fair model comparison and robust generalization assessment. All models were trained and evaluated on a common feature space with identical data partitions, and nested cross-validation was used to obtain unbiased performance estimates and control overfitting. Complementary model families and a two-stage ensemble selection procedure were employed to capture heterogeneous predictive patterns and identify the most stable configuration for fetal risk prediction.

4. Results and Discussion

The following section presents and analyzes the results obtained from the seven individual machine learning (ML) models and evaluates the best-performing first-order (BFOEM) and second-order ensemble models (BSOEM, or

E^{2 o r d *}

) for fetal state prediction tasks. Evaluation metrics considered include precision, recall, F1-score, and accuracy.

Table 8 presents the performance metrics of each individual ML model. Among these, the XGB model demonstrated balanced behavior, achieving the highest F1-score (

0.791 \pm 0.032

) and exhibiting strong recall (

0.830 \pm 0.019

). Meanwhile, the CNN model achieved the highest accuracy (

0.953 \pm 0.004

) and precision (

0.863 \pm 0.054

), although its lower recall (

0.714 \pm 0.020

) indicates a tendency to miss positive cases more frequently compared to other models.

The RF and CB models also showed robust performances, slightly behind XGB but still competitive. Although the ANN and DT models showed the lowest accuracy values, they still contributed value in ensemble scenarios by offering diverse decision patterns.

To further evaluate whether combining these models would enhance predictive performance, all possible first-order ensemble combinations were assessed. Figure 5 illustrates selected first-order ensembles ranked by descending F1-Score. Although not all ensembles are depicted for clarity, it is evident that approximately the top 30 combinations surpass individual models, achieving F1-Scores above 0.80. Notably, base ML models without ensemble combinations are situated at lower performance levels, with the DT model performing worst.

Subsequently, second-order ensemble combinations were created using the top 15 first-order ensembles (six selected based on precision, six on recall, and three on F1-score). Given the high number of possible combinations (32,752), Figure 6 illustrates a random subset to reveal general performance trends.

It can be noted that around ∼22,500 combinations achieved an F1-Score above 0.80, with around 4400 surpassing 0.81. This pattern indicates limited incremental benefits from stacking already high-performing ensembles. However, evaluating second-order ensembles remains a valuable strategy. In fact, we decided to retain second-order ensemble models because they enhance the interpretability and transparency of the decision-making process of AI-FRS. By displaying the individual outputs of each participating first-order ensemble and then computing a final aggregated risk probability, the model facilitates a clearer understanding of how the final decision is reached.

Additionally, the second-order ensemble approach provides a more stable and balanced trade-off between precision and recall, which is essential in managing imbalanced clinical datasets such as those used for fetal risk prediction.

Table 9 presents detailed results of the best first-order ensemble models, BFOEM, categorized by precision, recall, and F1-score. Additionally, the best second-order ensemble model, or BSOEM, selected for developing the AI-FRS, is included for comparison. Mean values are presented alongside their 95% confidence intervals (CIs), calculated using Student’s t distribution (

t_{4, 0.975} = 2.776

), allowing a statistically grounded comparison among ensembles. The relatively narrow CIs observed for the BSOEM indicate lower variability and higher reliability of its predictions. In addition, the top three BFOEMs based on F1-score exhibit slightly lower precision, recall, and accuracy compared with the BSOEM. This performance gap demonstrates that combining top-performing precision- and recall-oriented ensembles within the second-order structure leads to a more balanced and stable predictive behavior. Figure 7 further illustrates the stability of the F1-score across folds for the proposed BSOEM compared to the top three first-order ensembles based on F1-Score. As shown, the BSOEM (blue violin) achieves the best fit between a high mean and stability on F1 values, reflecting its ability to maintain consistent and balanced performance across different validation folds. This outcome validates the optimization strategy based on the exhaustive evaluation of overall possible ensemble combinations, which leads to a more generalizable configuration.

A deeper look at these ensemble combinations reveals consistent performance enhancement patterns. Ensembles involving CNN often excel in precision, while ensembles including RF and XGB notably boost recall. Thus, combining precision-oriented and recall-oriented models achieves a desirable balance, significantly improving the overall predictive performance compared to individual models.

To contextualize the performance and modeling strategy of our BSOEM relative to prior work, Table 10 presents a structured comparison with representative state-of-the-art models.

Ultrasound image-based approaches remain relatively limited in the literature. Models such as [18,19] primarily concentrate on detecting fetal anomalies or estimating their severity based on a small set of images. While these models achieve solid accuracy, they lack extensive validation and generalizability. Additionally, this imaging-based approach tends to be more expensive and less accessible than CTG monitoring, making large-scale adoption in resource-limited settings particularly challenging.

In contrast, many studies have focused on physiological monitoring through CTG. Models such as [12,13,14,15] have reported exceptionally high accuracy and F1-scores using fetal cardiotocography data. However, those studies rely exclusively on signal-derived features from the UCI Portuguese Fetal Health dataset and are inherently limited in scope. They capture fetal heart rate and variability metrics, but cannot account for critical maternal or obstetric factors such as comorbidities or prior complications. This constraint limits their applicability in diverse clinical scenarios where contextual information is essential for accurate decision-making.

Importantly, both the International Federation of Gynecology and Obstetrics (FIGO) and the National Institute for Health and Care Excellence (NICE) guidelines emphasize that CTG monitoring should not be routinely used in low-risk pregnancies, recommending clinical observation and intermittent auscultation as the standard practice for fetal assessment during labor [43,44]. CTG should only be initiated when specific signs of fetal compromise are present, as its interpretation is subject to considerable interobserver variability, low specificity, and high false-positive rates [45,46].

Evidence also indicates that continuous CTG monitoring is associated with an increased rate of obstetric interventions, particularly a rise of approximately 20% in cesarean deliveries, without demonstrating improvement in neonatal morbidity or mortality among low-risk populations [47]. These limitations are especially relevant in middle-income countries such as Mexico, where structural inequities in maternal healthcare persist. National data show that cesarean deliveries already account for over 45% of all births, often driven by non-medical factors and limited adherence to evidence-based monitoring practices [48,49]. Additionally, access to CTG equipment and specialized personnel is markedly unequal: municipalities with high proportions of rural or indigenous populations face restricted availability of second-level hospitals and advanced monitoring technologies [50]. These disparities contribute to a maternal mortality ratio of up to 63 deaths per 100,000 live births in low-income municipalities, compared with 41 in wealthier regions [50].

Given this context, the development of predictive systems that rely solely on routinely available clinical information, such as the BSOEM proposed here, becomes crucial. Our model provides an equitable and scalable solution that leverages a rich set of features that span both prepartum and intrapartum periods, including maternal age, comorbidities, medication use, behavioral risk factors, and pregnancy-specific indicators such as gestational age and fetal weight. It supports intrapartum risk stratification prior to neonatal outcome occurrence, enabling better resource allocation and targeted use of CTG only when clinically justified.

The BSOEM achieved an F1-score of

0.812

, accuracy of

0.954

, precision of

0.824

, and recall of

0.805

, using real-world clinical data including maternal health factors, comorbidities, and pregnancy characteristics. Compared to clinical-feature-based models such as [16,17,20,21,22], the BSOEM not only achieves superior performance but also benefits from the advantages of ensemble learning and a broader feature set. While those studies either avoided ensemble techniques or were limited by variable scope or postnatal outcomes, the BSOEM offers a prospective and flexible solution tailored for decision support.

It is important to note that, to the best of our knowledge, this is the first machine learning–based software tool trained with clinical data from Mexico for fetal risk prediction. In contrast to high-resource countries where such decision support systems are increasingly available, access to this type of technology in Mexico remains very limited, making our work one of the first steps toward bridging this gap.

Class-Specific Performance and Clinical Implications

Given the class imbalance and the safety–critical nature of fetal risk prediction, we analyze class-specific errors and fold-wise stability using confusion matrices and class-specific metrics to highlight clinically relevant trade-offs. Figure 8 presents the fold-wise confusion matrices for three representative configurations selected by F1-score: the best single trained model (top row), the best first-order ensemble (middle row), and the best second-order ensemble (bottom row). It can be noted that the best single model (XGB) shows competitive performance in several folds, achieving relatively high true-positive counts for the risky class; however, it has noticeable variability in false positives among risky cases across folds. The best first-order ensemble (middle row of Figure 8) reduces variance compared to the single model, exhibiting more consistent confusion patterns across folds. In particular, false positives among healthy cases are generally reduced while maintaining comparable detection of risky cases. This reflects the benefit of aggregating heterogeneous learners, improving the robustness under different data partitions. The best second-order ensemble (bottom row) further stabilizes performance across folds, with more homogeneous confusion matrices and consistently higher true-negative counts. This meta-ensemble aggregates multiple first-order ensembles, leading to improved robustness and lower variance under fold resampling. Although some folds exhibit a modest reduction in sensitivity to the risky class compared to the best single model or first-order ensemble, the second-order ensemble demonstrates a more conservative and stable operating behavior, which may be preferable in deployment scenarios where minimizing false alarms in healthy cases is important.

Overall, these results illustrate the progressive variance reduction achieved when moving from a single model to first-order ensembles and then to second-order ensembles, at the cost of a clinically relevant trade-off between sensitivity and specificity.

Table 11 reports fold-wise sensitivity for the risky class, specificity for the healthy class, and precision for the risky class for the best single model (XGB), the best first-order ensemble (BFOEM), and the best second-order ensemble (BSOEM). These metrics provide a class-aware evaluation that is particularly relevant under the strong class imbalance present in fetal risk prediction.

The best single model (XGB) achieves the highest average sensitivity to risky cases (0.701), indicating a stronger capability to detect high-risk fetuses and reduce false negatives. However, this gain comes at the cost of the lowest average precision (0.549), reflecting a higher false-alarm rate when predicting the risky class. This trade-off suggests that XGB is more suitable for screening-oriented scenarios, where minimizing missed high-risk cases is prioritized, even if more healthy cases are incorrectly flagged.

The first-order ensemble (BFOEM) offers a more balanced operating point, with slightly lower sensitivity (0.689) but improved precision (0.612) and higher specificity (0.968) compared to the single model. This indicates that aggregating heterogeneous learners reduces false positives while largely preserving the ability to detect risky cases, yielding a better precision–recall trade-off for practical use.

The second-order ensemble (BSOEM) achieves the highest average specificity (0.977) and precision (0.673), substantially reducing false positives and producing more reliable positive predictions. Although its average sensitivity (0.634) is lower than that of XGB and BFOEM, BSOEM exhibits more stable performance across folds, suggesting improved robustness and generalization. Clinically, this conservative behavior may be preferable in confirmatory or resource-constrained settings, where unnecessary interventions should be minimized.

From a clinical perspective, false negatives (i.e., Risky cases misclassified as Healthy) represent the most safety-relevant type of error, as they may reduce anticipatory preparedness for neonatal support at delivery. Although the second-order ensemble shows a slightly lower sensitivity compared to the best single model, it has fewer false alarms and more consistent performance, and importantly, AI-FRS is intended as a decision-support tool rather than an autonomous diagnostic system. Thus, the expected trade-off between sensitivity and precision/specificity across model configurations can be further aligned with clinical priorities (e.g., screening-oriented configurations favoring higher sensitivity versus confirmatory use favoring higher specificity). Furthermore, although class imbalance was addressed through ADASYN-based resampling and a systematic second-order ensemble selection framework balancing recall, precision, and F1-optimized configurations, future extensions could incorporate weighted voting strategies within the second-order ensemble, where ensemble members are assigned class-dependent weights proportional to their recall performance on the Risky class, thereby explicitly penalizing false negatives during aggregation.

Results Availability. To ensure full transparency of the experimental analysis, all detailed evaluation results for the first-order and second-order ensemble configurations (including fold-wise metrics, confusion matrices, and average performance) are publicly available in the project repository at: https://github.com/AbimaelGP/AI-FRS/tree/main/Results (accessed on 25 February 2026).

5. System Interface Description

To translate the predictive ensemble models into a clinically usable tool, we developed AI-FRS, a machine learning-based decision-support system for estimating maternal and fetal risk. The system was implemented using Python 3.8.0 and the tkinter library (v8.6.12), allowing healthcare professionals to interact with the models through a graphical interface without the need for programming knowledge.

Figure 9 illustrates the processing pipeline and graphical user interface (GUI) of the AI-FRS system. The GUI allows healthcare professionals to input the 18 clinical features used by the BSOEM. Input validation is internally managed through radio buttons and sliders to ensure consistency and prevent data-entry errors. Once all information is entered, the user initiates the prediction process by pressing the Predict button. At this stage, the system transfers the input data to the BSOEM, which computes the fetal risk prediction. The output is displayed both textually and visually—showing the predicted fetal status (Healthy or Risky), a color-coded risk bar, and a console section that transparently reports model votes and the estimated risk probability. Predictions are generated in real time, with an average response time of approximately

0.8

s.

All predictions were executed on a standard laptop equipped with an Intel® Core™ i7-6820HQ CPU @ 2.70GHz, 16GB RAM, demonstrating that the system operates efficiently without the need for high-performance computing resources. Its rapid response time and low hardware requirements make it suitable for real-time use in clinical settings and point-of-care environments.

6. Clinical Implications and Limitations of AI-FRS

6.1. Clinical Implications

The AI-FRS system, powered by the BSOEM architecture, is designed to integrate into existing hospital information systems (HIS) or to operate as a standalone clinical decision support tool. The intended end-users are attending obstetricians and labor nurses, who can employ the probabilistic outputs as a structured intrapartum risk stratification aid prior to neonatal outcome assessment, rather than as an autonomous diagnostic system.

From a practical clinical perspective, AI-FRS provides a structured and objective framework for peripartum risk assessment using information routinely available before neonatal evaluation. By estimating the probability of neonatal respiratory compromise prior to birth, the system allows obstetric teams to anticipate potential complications, prepare appropriate neonatal support measures, and allocate monitoring resources more efficiently. In resource-constrained environments, such stratification may facilitate prioritization of higher-risk cases, ensuring timely availability of skilled personnel and resuscitation equipment when needed.

Regarding scalability and deployment feasibility, although the training and evaluation phase of the second-order ensemble were computationally demanding, the inference stage is lightweight and highly efficient. Once trained, the model exhibits a minimal memory footprint and can be executed in real time on standard clinical desktop computers or low-cost devices, without requiring specialized hardware such as GPUs. This characteristic supports maintainability and sustainability in low-resource healthcare environments, where advanced monitoring technologies may not be readily available.

Importantly, AI-FRS is conceived as a supportive decision aid embedded within existing clinical workflows. Its outputs are intended to complement clinical judgment and institutional protocols, contributing to more standardized and transparent fetal risk documentation and communication among healthcare professionals, while preserving clinician responsibility for final decision-making.

Finally, while the graphical user interface demonstrates technical feasibility, formal usability and human–computer interaction (HCI) evaluations in real clinical environments have not yet been conducted. Such studies will be essential to assess workflow integration, user trust, and clinical acceptance before routine clinical use.

6.2. Limitations

The dataset utilized in this study was collected from a single public hospital in the state of Chihuahua, Mexico, which may limit its generalization to other populations or healthcare contexts. Nevertheless, this study represents one of the first initiatives in Mexico to develop an artificial intelligence system for fetal risk prediction using real clinical data under strict ethical and institutional supervision. The study population, composed predominantly of indigenous and low-income women, provides valuable insight into an underserved demographic rarely represented in maternal–fetal AI research. The inclusion of this population addresses an important gap in the current literature, as most fetal health AI systems are developed using datasets from high-income countries where the representation of Latin American or indigenous populations remains extremely limited. Therefore, we believe that rather than contributing to bias, this study helps increase global diversity, representation, and inclusivity in maternal–fetal artificial intelligence research. Consistent with Fiorentino et al. [51], we acknowledge that fetal health AI systems may perpetuate demographic and socioeconomic biases. Therefore, the predictions generated by AI-FRS should be interpreted as context-specific and primarily applicable to populations with comparable clinical and socioeconomic characteristics. Future research will focus on expanding this work through multicenter validation across hospitals from different regions of Mexico to further assess and strengthen the model’s generalizability. At the time of this study, no independent external validation dataset was available. Although nested cross-validation was used to estimate internal generalization performance, it does not replace external validation; therefore, the reported results should be interpreted as internally validated within a single healthcare setting.

Although the AI-FRS system demonstrates strong performance and practical usability, its predictions are inherently limited by the characteristics of the training data. Figure 10 presented the feature distribution across the clinical dataset. A notable limitation is the strong class imbalance and sparse representation of certain variable categories. For instance, values corresponding to Preeclampsia: Severe, Diabetes: Yes, and Drug Use: Yes are underrepresented, which may lead to lower predictive performance in real-world scenarios where such cases are more prevalent. Furthermore, several features show low variability or heavily skewed distributions. This lack of distributional diversity restricts the model’s exposure during training, causing potential generalization issues when encountering values outside of the most frequent ranges. Consequently, predictions for patients presenting rare but clinically significant profiles may be less reliable. Continuous data collection and periodic model retraining with more diverse cohorts are planned to improve robustness and generalizability.

Another important limitation is that the current version of AI-FRS models fetal risk using static prepartum and intrapartum clinical snapshots rather than longitudinal temporal trajectories. This design simplifies data collection and facilitates integration into routine clinical workflows, but it also implies that the system captures associations rather than causal relationships and cannot describe the temporal evolution of fetal risk across prenatal visits. Future work should explore time-series modeling and longitudinal data integration (e.g., repeated clinical assessments or fetal monitoring signals) to better exploit temporal patterns.

These observations emphasize the importance of cautious interpretation when applying the model in practice. Continuous data collection and retraining with more diverse and balanced datasets will be critical to improving robustness and extending applicability to a wider clinical population.

7. Interpretability Analysis

This interpretability analysis is included as clinical deployment requires not only predictive performance but also transparent reasoning about which variables drive risk estimation. Importantly, these explanations describe associative patterns learned from the data and should not be interpreted as evidence of intrinsic biological causation.

To provide a transparent explanation of the model’s decision process, we applied the SHAP (SHapley Additive exPlanations) framework [52]. SHAP is a method that quantifies the individual contribution of each feature to a model’s output. This provides valuable insights into how each clinical variable influenced the fetal risk prediction.

The SHAP evaluation was conducted using the trained models corresponding to Fold 2, which was selected as a representative high-performing fold for interpretability analysis. The fold-specific Z-score scaler computed from the training subset of Fold 2 was consistently applied to the test data prior to SHAP computation. This ensures that feature attributions are derived from the same standardized feature space used during model training, thereby maintaining methodological consistency and preventing data leakage.

Figure 10a shows that the most influential features were weeks of gestation, fetal weight, event type, oxytocin use, and maternal age, all of which are clinically coherent indicators of fetal well-being. Specifically, in Figure 10b, lower values of weeks of gestation and fetal weight (blue points) were associated with an increased predicted risk, a pattern that is compatible with known clinical associations with prematurity and growth restriction. Interestingly, higher values of Event Type (cesarean delivery) and Oxytocin Use were associated with lower predicted risk. This pattern may reflect contextual characteristics of clinical management within the dataset, where cesarean delivery and oxytocin administration occur under structured medical supervision, rather than representing intrinsic protective factors. Maternal age also showed a mild positive contribution to risk, consistent with the higher incidence of obstetric complications at extreme reproductive ages. Additionally, higher values of Amniotic Fluid Index and Placental Maturity Grade were associated with lower predicted risk. This finding aligns with clinical physiology, as adequate amniotic fluid volume reflects proper fetal renal and placental function, while greater placental maturity typically corresponds to late gestational stages, when fetal development is complete and risk is naturally reduced. Carbetocin Use was associated with a lower predicted fetal risk, reflecting that its administration commonly occurs in controlled or elective deliveries with favorable neonatal outcomes. Regarding Preeclampsia Grade, the model revealed a non-linear relationship with fetal risk. As expected, the absence of preeclampsia was associated with healthy fetal outcomes, while mild preeclampsia corresponded to a higher predicted risk, consistent with its potential to compromise placental function and fetal oxygenation. Interestingly, some cases classified as severe preeclampsia were associated with lower predicted risk, likely because these severe cases received specialized attention and timely intervention, contributing to favorable predictions.

As expected, Smoking was positively associated with fetal risk. In contrast, variables such as Diabetes, Delivery Hours, Antihypertensive Use, Dexamethasone Administration, Number of Fetuses, and Misoprostol Use were associated with healthy fetal outcomes. Although these associations may appear counterintuitive, they may be influenced by the clinical context represented in the dataset, where these conditions are typically managed within institutional settings under close medical supervision. In such cases, preventive or therapeutic interventions—such as glucose control, antihypertensive therapy, corticosteroid administration to promote fetal lung maturation, and labor induction protocols, tend to mitigate adverse outcomes, resulting in healthy neonatal conditions.

These findings may highlight that the model not only captures physiological relationships but also reflects patterns of medical care and intervention within the studied population. However, the apparent behavior of some features may also be influenced by their limited representation in the dataset. As shown in the histograms of Figure 1, less than 5% of the samples reported medication use (e.g., betamethasone or misoprostol), and cases involving comorbidities were infrequent. Such low variability restricts the model’s ability, and consequently SHAP’s capacity, to infer reliable contribution patterns, which may lead to underestimation of clinically relevant variables. This underscores the importance of enhancing generalization and capturing different management approaches. Therefore, future work should incorporate data from multiple hospitals and regions, encompassing diverse obstetric protocols and clinical criteria to strengthen the robustness and clinical relevance of future versions of the model.

It is critical to emphasize that the feature importance derived from SHAP values reflects associative model behavior rather than intrinsic biological causation. Consequently, these interpretability tools transparently explain how the model reaches its predictions based on correlated features, which fosters clinical trust, but they must not be over-interpreted as causal risk factors.

8. Conclusions

This study presents a clinically grounded machine learning approach for fetal state prediction through the development of ensemble models using real-world Mexican obstetric data. Among the seven individual models evaluated, XGBoost achieved the best F1-score (0.791 ± 0.032), while CNN obtained the highest accuracy (0.953 ± 0.004) and precision (0.863 ± 0.054), albeit with a lower recall (0.714 ± 0.020), reflecting trade-offs that are common in clinical prediction tasks. To mitigate individual model limitations and enhance performance stability, this work explored an extensive ensemble evaluation strategy. A total of 127 first-order ensemble combinations were generated and assessed, from which the 15 top-performing combinations were selected (based on precision, recall, and F1-score) to create 32,752 second-order ensemble configurations. This exhaustive exploration led to the identification of a Best Second-Order Ensemble Model or BSOEM, which strategically combines seven high-performing ensembles. This BSOEM yielded superior performance with an accuracy of 0.954 ± 0.011, precision of 0.824 ± 0.060, recall of 0.805 ± 0.025, and an F1-score of 0.812 ± 0.037, surpassing not only individual models but also all first-order ensembles.

Compared to prior studies relying on fetal CTG data, our model adopts a more clinical approach. While CTG-based models focus on fetal heart rate patterns and variability metrics, they neglect essential maternal factors, comorbidities, and intrapartum interventions. In contrast, our BSOEM incorporates a rich set of 18 clinical variables encompassing maternal age, gestational weeks, behavioral risk indicators (e.g., smoking, drug use), comorbidities (e.g., preeclampsia, diabetes), medication history (e.g., betamethasone, antihypertensives), and birth-related procedures (e.g., oxytocin, carbetocin use). This holistic representation allows for a broader contextual understanding and improved application generalizability, particularly in settings where CTG is not available.

Importantly, when compared to clinical-feature-based models, the BSOEM also demonstrates competitive performance and methodological advantages. For example, some studies used postnatal outcomes or excluded treatment-related variables such as oxytocin or corticosteroid administration. In contrast, the BSOEM integrates both prepartum and intrapartum indicators and leverages ensemble methods for better generalization.

To the best of our knowledge, this work represents the first machine learning–based software framework for fetal risk prediction developed and validated with clinical data from Mexico. While similar decision support tools are beginning to emerge in high-resource settings, their availability in Mexico remains limited. Therefore, the AI-FRS system constitutes an important contribution toward the integration of artificial intelligence into obstetric care in low- and middle-income contexts, where access to advanced monitoring tools is often restricted. Another significant contribution of this study is the methodological framework for ensemble optimization, demonstrating that combining precision-focused and recall-focused models yields more balanced and clinically actionable predictions. The resulting AI-FRS tool also bridges the gap between predictive modeling and clinical usability by offering real-time decision support through a user-friendly, low-latency graphical interface. Nonetheless, the model’s performance remains conditioned by data distribution. Sparse representation of some clinical scenarios (e.g., severe preeclampsia, drug use) may affect reliability in underrepresented subgroups. Thus, ongoing model retraining with broader and more balanced datasets will be crucial for scaling deployment across diverse clinical environments.

While this study focused on well-established and interpretable models (ANN, CNN, DT, RF, XGBoost, LightGBM, and CatBoost), future work will explore emerging tabular architectures (e.g., Kolmogorov–Arnold Networks) on the fetal risk prediction task to further assess their clinical applicability.

Author Contributions

Conceptualization, A.G.-P. and B.O.E.-G.; Methodology, A.G.-P. and B.O.E.-G.; Software, A.G.-P., B.O.E.-G. and J.C.-C.; Validation, A.G.-P., B.O.E.-G. and C.R.A.-T.; Formal analysis, A.G.-P., B.O.E.-G., G.R.-A. and L.C.H.-G.; Investigation, A.G.-P.; Data curation, G.R.-A.; Writing—original draft, A.G.-P., B.O.E.-G., G.R.-A., J.C.-C., C.R.A.-T. and L.C.H.-G.; Visualization, A.G.-P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study protocol was approved by the Research Ethics Committee of the Faculty of Medicine and Biomedical Sciences at Autonomous University of Chihuahua (Universidad Autónoma de Chihuahua), Register number C1-038-21, on 10 February 2022. The study was conducted in accordance with the ethical standards outlined in the Declaration of Helsinki and the institutional guidelines of the aforementioned ethics committee.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. The research involved anonymized clinical data obtained from an observational and longitudinal study, ensuring the protection of participants’ privacy and rights throughout the data collection and analysis process. This study does not constitute a clinical trial and did not involve any prospective intervention or assignment of treatment.

Data Availability Statement

The dataset generated during the current study is not publicly available at this time, as it is undergoing institutional registration and documentation. Once the registration process is complete, the dataset will be made available on the following link: https://github.com/AbimaelGP/AI-FRS (accessed on 25 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

World Health Organization. Trends in Maternal Mortality 2000 to 2020: Estimates by WHO, UNICEF, UNFPA, World Bank Group and UNDESA/Population Division; World Health Organization: Geneva, Switzerland, 2023. [Google Scholar] [CrossRef]
Instituto Nacional de Estadística y Geografía. Estadísticas de Defunciones Fetales (Edf) 2022. 2022. Available online: https://n9.cl/lofjv (accessed on 13 November 2025).
UNICEF. A Neglected Tragedy: The Global Burden of Stillbirths—Report of the UN Inter-Agency Group for Child Mortality Estimation. 2020. Available online: https://data.unicef.org/resources/a-neglected-tragedy-stillbirth-estimates-report/ (accessed on 5 December 2025).
Lawn, J.E.; Blencowe, H.; Waiswa, P.; Amouzou, A.; Mathers, C.; Hogan, D.; Flenady, V.; Frøen, J.F.; Qureshi, Z.U.; Calderwood, C.; et al. Stillbirths: Rates, risk factors, and acceleration towards 2030. Lancet 2016, 387, 587–603. [Google Scholar] [CrossRef]
Oros Ruiz, M.; Perejón López, D.; Serna Arnaiz, C.; Siscart Viladegut, J.; Baldó, J.À.; Sol, J. Maternal and foetal complications of pregestational and gestational diabetes: A descriptive, retrospective cohort study. Sci. Rep. 2024, 14, 9017. [Google Scholar] [CrossRef]
Wang, S.; Rexrode, K.M.; Florio, A.A.; Rich-Edwards, J.W.; Chavarro, J.E. Maternal Mortality in the United States: Trends and Opportunities for Prevention. Annu. Rev. Med. 2023, 74, 199–216. [Google Scholar] [CrossRef]
Morales-Suárez-varela, M.; Peraita-Costa, I.; Perales-Marín, A.; Llopis-Morales, A.; Llopis-González, A. Risk of Gestational Diabetes due to Maternal and Partner Smoking. Int. J. Environ. Res. Public Health 2022, 19, 925. [Google Scholar] [CrossRef] [PubMed]
Patra, J.; Bakker, R.; Irving, H.; Jaddoe, V.W.V.; Malini, S.; Rehm, J. Dose-response relationship between alcohol consumption before and during pregnancy and the risks of low birthweight, preterm birth and small for gestational age (SGA)—A systematic review and meta-analyses. BJOG Int. J. Obstet. Gynaecol. 2011, 118, 1411–1421. [Google Scholar] [CrossRef] [PubMed]
Stuebe, A.M.; Oken, E.; Gillman, M.W. Associations of diet and physical activity during pregnancy with risk for excessive gestational weight gain. Am. J. Obstet. Gynecol. 2009, 201, 58.e1–58.e8. [Google Scholar] [CrossRef]
Salinas, M.; Bertini, A.; Osorio, C.; Ibacache, A.; Salas, R.; Pardo, F. Chapter 14—Artificial intelligence for prediction of perinatal health. In Robotics and Artificial Intelligence for Reproductive Medicine; Academic Press: Cambridge, MA, USA, 2026. [Google Scholar] [CrossRef]
Prins, L.I.; Bruin, C.M.; Kornaat, E.M.; Pels, A.; Gordijn, S.J.; Naaktgeboren, C.A.; Ganzevoort, W. Prediction of perinatal mortality in early-onset fetal growth restriction: A post hoc analysis of the Dutch STRIDER trial to predict perinatal mortality in early-onset fetal growth restriction. Eur. J. Obstet. Gynecol. Reprod. Biol. 2025, 304, 23–29. [Google Scholar] [CrossRef] [PubMed]
Kuzu, A.; Santur, Y. Early Diagnosis and Classification of Fetal Health Status from a Fetal Cardiotocography Dataset Using Ensemble Learning. Diagnostics 2023, 13, 2471. [Google Scholar] [CrossRef]
Zannah, T.B.; Tonni, S.I.; Sheakh, M.A.; Tahosin, M.S.; Sarower, A.H.; Begum, M. Comparative performance analysis of ensemble learning methods for fetal health classification. Inform. Med. Unlocked 2025, 56, 101656. [Google Scholar] [CrossRef]
Nazli, I.; Korbeko, E.; Dogru, S.; Kugu, E.; Sahingoz, O.K. Early Detection of Fetal Health Conditions Using Machine Learning for Classifying Imbalanced Cardiotocographic Data. Diagnostics 2025, 15, 1250. [Google Scholar] [CrossRef]
Mondal, S.; Maity, R.; Nag, A.; Ghosh, S. Fetal health risk prediction using ensemble-based machine learning approaches. Knowl. Inf. Syst. 2025, 67, 7227–7261. [Google Scholar] [CrossRef]
Akbulut, A.; Ertugrul, E.; Topcu, V. Fetal health status prediction based on maternal clinical history using machine learning techniques. Comput. Methods Programs Biomed. 2018, 163, 87–100. [Google Scholar] [CrossRef]
Roozbeh, N.; Montazeri, F.; Farashah, M.V.; Mehrnoush, V.; Darsareh, F. Proposing a machine learning-based model for predicting nonreassuring fetal heart. Sci. Rep. 2025, 15, 7812. [Google Scholar] [CrossRef] [PubMed]
Jain, E.; Kaushik, P.; Kukreja, V.; Sakshi; Dogra, A.; Goyal, B. Fetal Diagnostics using Vision Transformer for Enhanced Health and Severity Prediction in Ultrasound Imaging. Curr. Med. Imaging 2025, 21, E15734056360199. [Google Scholar] [CrossRef] [PubMed]
Yenkikar, A.; Singh, V.K.; Tamboli, G.; Charkha, P.; Bodke, S.; Bidwe, R.V.; Bali, M. A multi-modal AI framework integrating Siamese networks and few-shot learning for early fetal health risk assessment. MethodsX 2025, 15, 103618. [Google Scholar] [CrossRef]
Abdi, F.; Roozbeh, N.; Darsareh, F.; Mehrnoush, V.; Farashah, M.S.V.; Montazeri, F. Developing a prognostic model for predicting preterm birth using a machine learning algorithm. BMC Pregnancy Childbirth 2025, 25, 974. [Google Scholar] [CrossRef]
Malacova, E.; Tippaya, S.; Bailey, H.D.; Chai, K.; Farrant, B.M.; Gebremedhin, A.T.; Leonard, H.; Marinovich, M.L.; Nassar, N.; Phatak, A.; et al. Stillbirth risk prediction using machine learning for a large cohort of births from Western Australia, 1980–2015. Sci. Rep. 2020, 10, 5354. [Google Scholar] [CrossRef]
Zimmerman, R.M.; Hernandez, E.J.; Yandell, M.; Tristani-Firouzi, M.; Silver, R.M.; Grobman, W.; Haas, D.; Saade, G.; Steller, J.; Blue, N.R. AI-based analysis of fetal growth restriction in a prospective obstetric cohort quantifies compound risks for perinatal morbidity and mortality and identifies previously unrecognized high risk clinical scenarios. BMC Pregnancy Childbirth 2025, 25, 80. [Google Scholar] [CrossRef] [PubMed]
Hadlock, F.P.; Harrist, R.B.; Sharman, R.S.; Deter, R.L.; Park, S.K. Estimation of fetal weight with the use of head, body, and femur measurements—A prospective study. Am. J. Obstet. Gynecol. 1985, 151, 333–337. [Google Scholar] [CrossRef]
Kim, H.Y. Analysis of variance (ANOVA) comparing means of more than two groups. Restor. Dent. Endod. 2014, 39, 74. [Google Scholar] [CrossRef]
McHugh, M.L. The Chi-square test of independence. Biochem. Medica 2013, 23, 143–149. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Tsoukalas, L.H.; Uhrig, R.E. Fuzzy and Neural Approaches in Engineering; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1996. [Google Scholar]
Neapolitan, R.E. Neural Networks and Deep Learning. In Artificial Intelligence; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar] [CrossRef]
Rokach, L.; Maimon, O. Decision Trees. In Simplicial Complexes of Graphs; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 2008; Volume 1928, pp. 67–86. [Google Scholar] [CrossRef]
Tangirala, S. Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 612–619. [Google Scholar] [CrossRef]
Biau, G.; Scornet, E. A random forest guided tour. TEST 2016, 25, 197–227. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Machado, M.R.; Karray, S.; De Sousa, I.T. LightGBM: An effective decision tree gradient boosting method to predict customer loyalty in the finance industry. In Proceedings of the 14th International Conference on Computer Science and Education (ICCSE); IEEE: Toronto, ON, Canada, 2019; pp. 1111–1116. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 2–8 December 2018; pp. 1–12. [Google Scholar]
¿Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Scikit-Learn Developers. Cross-Validation: Evaluating Estimator Performance, “Stratified K-Fold”. 2024. Available online: https://scikit-learn.org/stable/modules/cross_validation.html (accessed on 14 October 2025).
Allgaier, J.; Pryss, R. Cross-Validation Visualized: A Narrative Guide to Advanced Methods. Mach. Learn. Knowl. Extr. 2024, 6, 1378–1388. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the International Joint Conference on Neural Networks (IJCNN); IEEE: Hong Kong, China, 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
Carvalho, M.; Pinho, A.J.; Brás, S. Resampling approaches to handle class imbalance: A review from a data perspective. J. Big Data 2025, 12, 71. [Google Scholar] [CrossRef]
Mohammed, A.; Kora, R. An effective ensemble deep learning framework for text classification. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 8825–8837. [Google Scholar] [CrossRef]
Almohimeed, A.; Saad, R.M.; Mostafa, S.; El-Rashidy, N.M.; Farrag, S.; Gaballah, A.; Elaziz, M.A.; El-Sappagh, S.; Saleh, H. Explainable Artificial Intelligence of Multi-Level Stacking Ensemble for Detection of Alzheimer’s Disease Based on Particle Swarm Optimization and the Sub-Scores of Cognitive Biomarkers. IEEE Access 2023, 11, 123173–123193. [Google Scholar] [CrossRef]
Bosch, N.; Shchur, O.; Erickson, N.; Bohlke-Schneider, M.; Turkmen, A.C. Multi-layer Stack Ensembles for Time Series Forecasting. In Proceedings of the AutoML Conference 2025, Methods Track, New York, NY, USA, 8–11 September 2025; pp. 1–10. [Google Scholar]
Ayres-de Campos, D.; Spong, C.Y.; Chandraharan, E. FIGO consensus guidelines on intrapartum fetal monitoring: Cardiotocography. Int. J. Gynecol. Obstet. 2015, 131, 13–24. [Google Scholar] [CrossRef]
NICE. Fetal Monitoring in Labour. NICE Guideline NG229. 2022. Available online: https://www.nice.org.uk/guidance/ng229 (accessed on 10 November 2025).
Lukhele, S.; Mulaudzi, F.M.; Gundo, R. Factors contributing to visual intrapartum cardiotocograph interpretation variation among healthcare professionals: An integrative review. PLoS ONE 2025, 20, e0315761. [Google Scholar] [CrossRef] [PubMed]
Blix, E.; Brurberg, K.G.; Reierth, E.; Reinar, L.M.; Øian, P. ST waveform analysis vs cardiotocography alone for intrapartum fetal monitoring: A systematic review. Acta Obstet. Gynecol. Scand. 2024, 103, 437–448. [Google Scholar] [CrossRef]
Devane, D.; Lalor, J.G.; Daly, S.; McGuire, W.; Smith, V. Cardiotocography versus intermittent auscultation on admission to labour ward for fetal wellbeing. Cochrane Database Syst. Rev. 2017, 1, CD005122. [Google Scholar] [CrossRef]
Uribe-Leitz, T.; Barrero-Castillero, A.; Cervantes-Trejo, A.; Santos, J.M.; de la Rosa-Rabago, A.; Lipsitz, S.R.; Basavilvazo-Rodriguez, M.A.; Shah, N.; Molina, R.L. Trends of caesarean delivery from 2008 to 2017, Mexico. Bull. World Health Organ. 2019, 97, 502–512. [Google Scholar] [CrossRef]
Vázquez Corona, M.; Betrán, A.P.; Bohren, M.A. The portrayal and perceptions of cesarean section in Mexican media Facebook pages: A mixed-methods study. Reprod. Health 2022, 19, 49. [Google Scholar] [CrossRef] [PubMed]
Secretaría de Salud; Organización Panamericana de la Salud (OPS). Primer Informe Sobre Desigualdades en Salud en México; Secretaría de Salud/Observatorio Nacional de Inequidades en Salud: Ciudad de México, Mexico, 2019; p. 68. [Google Scholar]
Fiorentino, M.C.; Moccia, S.; Cosmo, M.D.; Frontoni, E.; Giovanola, B.; Tiribelli, S. Uncovering ethical biases in publicly available fetal ultrasound datasets. npj Digit. Med. 2025, 8, 355. [Google Scholar] [CrossRef] [PubMed]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]

Figure 1. Normalized histograms of features by fetal state.

Figure 2. Feature importance using Random Forest analysis.

Figure 3. Training and a cross-validation strategy for the base learners. Arrows indicate the direction of data flow and processing steps. Solid lines represent the main processing pipeline, while dashed lines denote the outer and inner cross-validation loops.

Figure 4. Overview of the complete two-stage ensemble selection framework for fetal risk prediction. Arrows indicate the direction of data flow. Colors distinguish ensemble selection criteria, and ellipses (“...”) indicate omitted combinations for visualization purposes.

Figure 5. Performance comparison of first-order ensemble model combinations for the fetal state prediction task on F1-score.

Figure 6. Performance comparison of second-order ensemble model combinations for the fetal state prediction task sorted by F1-score. Ellipses (“...”) indicate omitted labels for visualization purposes.

Figure 7. F1-score distribution across the five outer folds for the best second-order ensemble model, BSOEM, and the top three first-order ensembles based on F1-score. Each violin plot represents the variability and central tendency (mean ± standard deviation) of F1 values across folds. The BSOEM demonstrates the best fit between a high mean and stability on F1 values compared with first-order ensembles. Blue lines show variability, and orange dots indicate the mean F1-score.

Figure 8. Fold-wise confusion matrices for the best single model (top row), best first-order ensemble (middle row), and best second-order ensemble (bottom row), selected according to F1-score analysis showed on Table 8 and Table 9. Numbers indicate counts (TP, TN, FP, FN); correct predictions are highlighted, while misclassifications appear in darker colors.

Figure 9. The process begins with user input of 18 clinical features (e.g., maternal age, gestational weeks, fetal weight, comorbidities, and medication use) through a Graphical User Interface (GUI). The GUI performs internal validation to ensure consistency and then feeds the data into the BSOEM ensemble model for fetal risk prediction. The output interface displays the predicted fetal risk status (Healthy or Risky) accompanied by a color-coded probability bar. A console-style output provides additional transparency, showing the votes from individual models, and the computed risk probability.

Figure 10. Global interpretability of the proposed BSOEM using SHAP values. (a) Summary plot showing the individual contribution of each variable on fetal risk prediction. (b) Mean absolute SHAP values indicating the overall feature importance. The analysis highlights that weeks of gestation, fetal weight, event type, oxytocin use, and maternal age are the most influential variables within the predictive structure of the model, and their relevance is broadly consistent with patterns reported in obstetric literature.

Table 1. Related works summary.

Type of Study	Study, Year	ML Models	Metrics	Dataset (Size & Origin)	Number of Features (Examples)	Timing	Location of Data	Cross Validation	Advantages	Opportunity Areas
Ultrasound image-based studies	Jain et al., 2025 [18]	Vision Transformer (ViT) dual-head architecture (classification + severity regression)	Accuracy 89%, F1 0.875, Precision 0.88, Recall 0.87	500 fetal ultrasound images (hospital dataset, multi-source)	Image embeddings extracted from 224 × 224 ultrasound images (divided into 16 × 16 patches with positional encoding)	Prepartum	Private clinical sources (unspecified location)	Train/validation split (80/20)	Combined classification and severity prediction; ViT outperformed CNN and SVM.	Small dataset; limited to 2D ultrasound and lacks external validation
Ultrasound image-based studies	Yenkikar et al., 2025 [19]	Siamese Neural Network (SNN) with few-shot learning, hybrid contrastive loss, shared-weight CNN backbone	Accuracy 98.1%	Ultrasound images (12,400 normal from Zenodo + 767 anomalous from YouTube)	Paired image embeddings (normal vs. anomalous) learned through contrastive similarity	Prepartum	Public datasets (Zenodo + YouTube)	Train/validation split (70/30)	Few-shot learning mitigated data scarcity and enabled accurate anomaly detection with limited samples	Limited to binary classification; normal and anomalous images came from different acquisition domains, suggesting the model may have learned device-specific features rather than true anatomical differences
CTG based studies	Zannah et al., 2025 [13]	Ensemble of Decision Tree, Random Forest and Gradient Boosting	Accuracy 99.56%, AUROC ≈ 0.98, F1 96.68%	∼3614 CTG records (UCI + Kaggle), Portuguese hospital	21 (Baseline FHR, accelerations/decelerations, short & long term variability, FHR histogram stats)	Prepartum	Portugal	20 fold CV	State-of-the-art accuracy on CTG; Local Interpretable Model-agnostic Explanations	Cannot capture broader clinical context (maternal comorbidities, obstetric history, socio-economic factors)
CTG based studies	Nazli et al., 2025 [14]	CatBoost, LightGBM, Random Forest, SVM, ANN, DNN (with SMOTE)	Accuracy 90.7%, Balanced Acc. 91.3%	2126 CTG records (UCI “Fetal Health” dataset, Portugal)	21 (Baseline FHR, accelerations, decelerations, uterine contraction count, STV/LTV variability)	Prepartum	Portugal	5 fold CV	High balanced accuracy; addressed class imbalance with SMOTE; practical interface	Lacks integration of clinical variables (maternal age, comorbidities, prior obstetric events)
CTG based studies	Kuzu & Santur, 2023 [12]	Voting ensemble: Logistic Regression, Random Forest, Gradient Boosting, XGBoost	Accuracy > 99.5%, F1/Precision/ Recall ≈ 1.00	2126 CTG records (UCI “Fetal Health” dataset)	21 (Fetal movements, FHR acceleration, deceleration patterns, variability measures)	Prepartum	Portugal	10 fold CV	Addressed class imbalance with class weighting; ensemble outperforms single models	Black-box decisions lack interpretability; no patient-level clinical data
CTG based studies	Mondal et al., 2025 [15]	Sixteen ML models (Decision Tree, Random Forest, SVM, KNN, XGBoost, CatBoost, LightGBM, ANN) with ensemble methods (Bagging, Stacking, Soft Voting)	Accuracy 99.60%, 99.72%, 99.78%; F1/Precision/ Recall ≈ 1.00	2126 CTG records (UCI “Fetal Health” dataset)	21 (Baseline FHR, accelerations, decelerations, uterine contraction count, STV/LTV variability)	Prepartum	Portugal	10 fold CV	Applied SMOTE and SMOTEENN to address class imbalance (SMOTEENN gave best results); ensemble learning achieved robust and highly stable predictions	Limited interpretability and lack of external validation with real-time CTG data
Clinical feature-based studies	Abdi et al., 2025 [20]	Logistic Regression, Linear Regression, Decision Tree, Random Forest, KNN (permutation-based), XGBoost, LightGBM, Feed-forward DNN	Best AUC of 0.65 with Random Forest	8853 deliveries (Iranian IMaN Net)	23 demographic, medical, and obstetric variables (e.g., onset of labor, preeclampsia, placental abruption, gestational age)	Prepartum	Iran	10 fold CV (70/30 train–test split)	Identified known predictors of preterm birth; a large national registry dataset	Limited predictive power (AUC ≤ 0.65); onset of labor included as predictor variable, not as causal feature
Clinical feature-based studies	Roozbeh et al., 2025 [17]	Decision Tree, Random Forest, XGBoost, KNN	AUROC 0.76–0.77; Accuracy 77%; F1 ∼0.76	7166 vaginal deliveries ≥ 28 wks (Iranian IMaN Net)	9 (Preeclampsia, placental abruption, gestational age category, doula support, fetal sex)	Pre/Intrapartum	Iran	10 fold CV	Application for non-reassuring fetal heart prediction; identified multiple risk factors	Needs performance improvement; no ensemble applied
Clinical feature-based studies	Akbulut et al., 2018 [16]	9 classifiers (Perceptron, Decision Forest, SVM, NN, etc.)	Accuracy 89.5%; AUC ≈ 0.95; F1 ∼0.75	96 pregnancies (maternal questionnaire + clinician evaluation)	23 (Maternal age, gravidity/parity, double/triple screen results, maternal illness history)	Prepartum	Turkey	Only for hyperparameter tuning	Developed mHealth app for patient and clinician use	Very small sample, overfitting risk; no ensemble applied
	Zimmerman et al., 2025 [22]	Probabilistic Graphical Model, Logistic Regression	AUC ≈ 0.83	10,038 nulliparas (nuMoM2b US multicenter cohort)	16 (Gestational age at delivery, maternal diabetes, hypertension, progesterone use, fetal sex, Apgar score)	Pre/Intra/ Postpartum	USA	Stratified 5-fold CV	Explainable AI reveals compound risk interactions; large, deeply phenotyped cohort	Limited to 16 variables due to computational constraints; inclusion of postnatal features limits purely prospective use
	Malacova et al., 2020 [21]	Logistic Regression, CART, Random Forest, XGBoost, Neural Net	AUROC up to 0.84; Sensitivity ∼45% at 95% specificity	∼952,800 births (947,025 live + 5788 stillbirths) in WA	36 (Maternal age, urbanicity, placenta previa, gestational diabetes, preeclampsia, family history)	Prepartum	Australia	Stratified 10 fold CV	Massive registry data over 35 years; included obstetric & family history	Low sensitivity leads to many false alarms despite CV

Table 2. Variables used in the dataset. The symbol “•” indicates variables available during the antepartum period (prior to the onset of labor), while “▸” denotes variables obtained during the intrapartum period (during labor). Fetal weight corresponds to the estimated fetal weight obtained through prenatal assessment. This can be approximated using ultrasound-based methods such as the standard Hadlock formula [23].

Type	Variable	Data Type	Values
Feature	Maternal Age (•)	Continuous	12–50
	Weeks of Gestation (•)	Continuous	0–45
	Smoking (•)	Categorical	1: No, 2: Yes
	Previous Fetuses (•)	Continuous	0–5
	Drug Use (•)	Categorical	1: No, 2: Yes
	Preeclampsia (•)	Categorical	1: None, 2: Mild, 3: Severe
	Diabetes (•)	Categorical	1: No, 2: Yes
	Amniotic Fluid Index (•)	Categorical	1: Less than 5, 2: 5 or more
	Placental Maturity Grade (•)	Categorical	Levels: 0, 1, 2, 3
	Betamethasone Use (•)	Categorical	1: No, 2: Yes
	Dexamethasone Use (•)	Categorical	1: No, 2: Yes
	Antihypertensive Use (•)	Categorical	1: No, 2: Yes
	Fetal Weight (Estimated) (•)	Continuous	0–5 kg
	Event type (▸)	Categorical	1: Vaginal Birth, 2: Cesarean
	Delivery hours (▸)	Categorical	1: Less than 12, 2: 12 or more
	Misoprostol Use (▸)	Categorical	1: No, 2: Yes
	Oxytocin Use (▸)	Categorical	1: No, 2: Yes
	Carbetocin Use (▸)	Categorical	1: No, 2: Yes
Target	Neonatal Risk Status	Categorical	0: Healthy, 1: Risky

Table 3. Distribution of cases according to the classification of fetal state.

Class	Fetal State	Count
0: Healthy	Cases with no clinical evidence of respiratory or hypoxic compromise during standardized neonatal evaluation.	2322
1: Risky	Cases with objective evidence of neonatal respiratory or hypoxic compromise.	164
Total		2486

Table 4. Relevant analysis of clinical features.

Feature	Test Used	p-Value	Significant (p < 0.05)
Maternal Age	ANOVA	0.00089	True
Weeks of Gestation	ANOVA	6.44335	True
Smoking	Chi-Squared	0.00986	True
Number of Previous Fetuses	Chi-Squared	$4.758 \times 10^{- 5}$	True
Drug Use	Chi-Squared	0.15632	False
Preeclampsia	Chi-Squared	$1.347 \times 10^{- 9}$	True
Diabetes	Chi-Squared	1.00000	False
Amniotic Fluid Index	Chi-Squared	0.77774	False
Event Type	Chi-Squared	$3.726 \times 10^{- 8}$	True
Placental Maturity Grade	Chi-Squared	$2.899 \times 10^{- 76}$	True
Delivery Hours	Chi-Squared	$9.318 \times 10^{- 6}$	True
Misoprostol Use	Chi-Squared	0.47744	False
Oxytocin Use	Chi-Squared	$3.390 \times 10^{- 5}$	True
Carbetocin Use	Chi-Squared	0.61851	False
Betamethasone Use	Chi-Squared	0.59433	False
Dexamethasone Use	Chi-Squared	$2.009 \times 10^{- 13}$	True
Antihypertensive Use	Chi-Squared	$5.201 \times 10^{- 6}$	True
Fetal Weight	ANOVA	$5.353 \times 10^{- 184}$	True

Table 5. Model hyperparameters to optimize.

Model	Hyperparameter	Values
ANN	Dropout rate ( $A N N_d r a t e$ )	{0.1, 0.2}
	Learning rate ( $A N N_l r$ )	{0.001, 0.0005}
CNN	Learning rate ( $C N N_l r$ )	{0.001, 0.0003}
DT	Max depth ( $D T_m a x d e p t h$ )	{10, 20}
	Min samples split ( $D T_m i n s a m p l e s p l i t$ )	{2, 5}
RF	Number of estimators ( $R F_n e s t i m a t o r s$ )	{100, 200}
	Max depth ( $R F_m a x d e p t h$ )	{10, 20}
	Min samples split ( $R F_m i n s a m p l e s p l i t$ )	{2, 5}
XGB	Number of estimators ( $X G B_n e s t i m a t o r s$ )	{100, 200}
	Max depth ( $X G B_m a x d e p t h$ )	{10, 20}
	Learning rate ( $X G B_l e a r n i n g r a t e$ )	{0.01, 0.1}
LGBM	Number of estimators ( $L G B M_n e s t i m a t o r s$ )	{100, 200}
	Max depth ( $L G B M_m a x d e p t h$ )	{10, 20}
	Learning rate ( $L G B M_l e a r n i n g r a t e$ )	{0.01, 0.1}
CB	Iterations ( $C B_i t e r a t i o n s$ )	{100, 200}
	Tree depth ( $C B_d e p t h$ )	{6, 10}
	Learning rate ( $C B_l e a r n i n g r a t e$ )	{0.01, 0.1}

Table 6. Numerical distribution of stratified data partitioning across outer folds (before and after ADASYN).

Class (Total Samples)	Train (60%)	Validation (20%)	Test (20%)
Healthy (0) (n = 2322)	1393	465	464
Risky (1) (n = 164)	99 (original) → 1393 (after ADASYN, Train only)	32	33
Total	1492 (original) → 2786 (after ADASYN, Train only)	497	497

Table 7. The best hyperparameter configuration identified for machine learning models.

Model	Mean Accuracy	Standard Deviation	Best Hyperparameters
ANN	0.8052	0.0703	$A N N_d r a t e$ = 0.1, $A N N_l r$ = 0.0005
CNN	0.9189	0.0287	$C N N_l r$ = 0.0003
RF	0.9401	0.0285	$R F_n e s t i m a t o r s$ = 200, $R F_m a x d e p t h$ = 20, $R F_m i n s a m p l e s p l i t$ = 2
DT	0.8940	0.0370	$D T_m a x d e p t h$ = 20, $D T_m i n s a m p l e s p l i t$ = 2
XGB	0.9429	0.0269	$X G B_n e s t i m a t o r s$ = 200, $X G B_m a x d e p t h$ = 10, $X G B_l e a r n i n g r a t e$ = 0.1
CB	0.9454	0.0264	CB_iterations = 200, $C B_d e p t h$ = 6, $C B_l e a r n i n g r a t e$ = 0.1
LGBM	0.9474	0.0277	$L G B M_n e s t i m a t o r s$ = 100, $L G B M_m a x d e p t h$ = 10, $L G B M_l e a r n i n g r a t e$ = 0.1

Table 8. Base ML models’ results for fetal risk prediction. The arrow (↑) indicates ascending order based on the F1-score.

Model	Accuracy	Precision	Recall	F1-Score ↑
XGB	$0.941 \pm 0.012$	$0.764 \pm 0.039$	$0.830 \pm 0.019$	$0.791 \pm 0.032$
RF	$0.939 \pm 0.015$	$0.762 \pm 0.046$	$0.817 \pm 0.014$	$0.783 \pm 0.031$
CB	$0.938 \pm 0.012$	$0.753 \pm 0.038$	$0.825 \pm 0.019$	$0.782 \pm 0.030$
LGBM	$0.938 \pm 0.008$	$0.750 \pm 0.026$	$0.811 \pm 0.024$	$0.775 \pm 0.024$
CNN	$0.953 \pm 0.004$	$0.863 \pm 0.054$	$0.714 \pm 0.020$	$0.765 \pm 0.017$
ANN	$0.923 \pm 0.023$	$0.719 \pm 0.054$	$0.808 \pm 0.030$	$0.751 \pm 0.047$
DT	$0.918 \pm 0.018$	$0.703 \pm 0.044$	$0.784 \pm 0.032$	$0.731 \pm 0.036$

Table 9. Top best first-order ensemble models’ results with 95% confidence intervals (computed using

t_{4, 0.975} = 2.776

) and

mean across metrics = (accuracy + precision + recall + F 1 - Score) / 4

.

Table 9. Top best first-order ensemble models’ results with 95% confidence intervals (computed using

t_{4, 0.975} = 2.776

) and

mean across metrics = (accuracy + precision + recall + F 1 - Score) / 4

.

Combination	Accuracy (95% CI)	Precision (95% CI)	Recall (95% CI)	F1-Score (95% CI)	Mean Across Metrics
Top six BFOEM on Precision
(CNN, LGBM)	$0.957 \pm 0.003$ [0.953–0.961]	$0.905 \pm 0.053$ [0.841–0.969]	$0.710 \pm 0.023$ [0.682–0.738]	$0.770 \pm 0.019$ [0.746–0.794]	$0.835$
(CNN, XGB)	$0.957 \pm 0.004$ [0.952–0.962]	$0.901 \pm 0.052$ [0.838–0.964]	$0.716 \pm 0.020$ [0.691–0.741]	$0.775 \pm 0.018$ [0.753–0.797]	$0.837$
(CNN, DT)	$0.953 \pm 0.004$ [0.948–0.958]	$0.895 \pm 0.054$ [0.831–0.959]	$0.689 \pm 0.046$ [0.627–0.751]	$0.745 \pm 0.043$ [0.691–0.799]	$0.821$
(CNN, RF)	$0.955 \pm 0.003$ [0.951–0.959]	$0.880 \pm 0.044$ [0.820–0.940]	$0.709 \pm 0.021$ [0.683–0.735]	$0.764 \pm 0.015$ [0.745–0.783]	$0.827$
(CNN, CB)	$0.955 \pm 0.004$ [0.950–0.960]	$0.878 \pm 0.047$ [0.819–0.937]	$0.715 \pm 0.020$ [0.690–0.740]	$0.769 \pm 0.019$ [0.745–0.793]	$0.829$
(ANN, CNN)	$0.954 \pm 0.005$ [0.948–0.960]	$0.868 \pm 0.056$ [0.804–0.932]	$0.715 \pm 0.020$ [0.690–0.740]	$0.766 \pm 0.019$ [0.742–0.790]	$0.826$
Top six BFOEM on Recall
(CB, RF, XGB)	$0.942 \pm 0.014$ [0.924–0.960]	$0.768 \pm 0.048$ [0.708–0.828]	$0.833 \pm 0.019$ [0.809–0.857]	$0.795 \pm 0.037$ [0.748–0.842]	$0.835$
(CNN, CB, DT, RF, XGB)	$0.947 \pm 0.013$ [0.931–0.963]	$0.788 \pm 0.051$ [0.726–0.850]	$0.830 \pm 0.015$ [0.811–0.849]	$0.805 \pm 0.035$ [0.761–0.849]	$0.843$
(ANN, RF, XGB)	$0.940 \pm 0.018$ [0.918–0.962]	$0.766 \pm 0.053$ [0.703–0.829]	$0.829 \pm 0.023$ [0.797–0.861]	$0.791 \pm 0.042$ [0.739–0.843]	$0.832$
(CNN, CB, DT, LGBM, XGB)	$0.946 \pm 0.013$ [0.930–0.962]	$0.782 \pm 0.050$ [0.721–0.843]	$0.829 \pm 0.015$ [0.810–0.848]	$0.801 \pm 0.035$ [0.757–0.845]	$0.840$
(ANN, CNN, CB, DT, LGBM, RF, XGB)	$0.946 \pm 0.013$ [0.930–0.962]	$0.781 \pm 0.048$ [0.721–0.841]	$0.829 \pm 0.014$ [0.811–0.847]	$0.801 \pm 0.033$ [0.760–0.842]	$0.839$
(CNN, CB, LGBM, RF, XGB)	$0.944 \pm 0.013$ [0.928–0.960]	$0.777 \pm 0.048$ [0.717–0.837]	$0.829 \pm 0.014$ [0.811–0.847]	$0.798 \pm 0.033$ [0.757–0.839]	$0.837$
Top three BFOEM on F1-Score
(CNN, CB, DT, LGBM, RF, XGB)	$0.949 \pm 0.012$ [0.934–0.964]	$0.795 \pm 0.048$ [0.735–0.855]	$0.828 \pm 0.015$ [0.809–0.847]	$0.809 \pm 0.031$ [0.771–0.847]	$0.845$
(ANN, XGB)	$0.951 \pm 0.014$ [0.933–0.969]	$0.808 \pm 0.063$ [0.730–0.886]	$0.812 \pm 0.031$ [0.774–0.850]	$0.808 \pm 0.045$ [0.752–0.864]	$0.845$
(CNN, CB, RF, XGB)	$0.951 \pm 0.012$ [0.936–0.966]	$0.804 \pm 0.056$ [0.740–0.868]	$0.815 \pm 0.020$ [0.790–0.840]	$0.807 \pm 0.034$ [0.770–0.844]	$0.844$
BSOEM (Best second-order ensemble)
((CNN, LGBM), (CNN, XGB), (CNN, DT), (ANN, RF, XGB), (CNN, CB, DT, LGBM, RF, XGB), (ANN, XGB), (CNN, CB, RF, XGB))	$0.954 \pm 0.011$ [0.940–0.968]	$0.824 \pm 0.060$ [0.749–0.899]	$0.805 \pm 0.025$ [0.774–0.836]	$0.812 \pm 0.037$ [0.765–0.859]	$0.849$

Table 10. Comparative performance of the BSOEM relative to state-of-the-art approaches for fetal health assessment across different data modalities and prediction tasks.

Study	Data Location	Model Type	Best Reported Metric	Comparative Observations
Image-based (vision) studies
Jain et al. [18]	Not specified	Vision Transformer	Acc 0.89, F1 0.875	Image-based modeling not directly comparable to clinical tabular prediction
Yenkikar et al. [19]	Not specified	Siamese Neural Network	Acc 0.981	Detects ultrasound anomalies rather than predicting clinical fetal risk
CTG-based studies
Kuzu & Santur [12]	Portugal	Voting ensemble	Acc > 99.5%, F1 ≈ 1.00	Relies on continuous CTG monitoring and specialized infrastructure; not directly comparable with clinical-variable-based BSOEM.
Zannah et al. [13]	Portugal	Ensemble models	Acc 0.995, F1 0.966
Nazli et al. [14]	Portugal	CatBoost, LightGBM, NN	Balanced accuracy ≈ 91.3%
Mondal et al. [15]	Portugal	Stacked ensembles	Acc 0.996, F1 ≈ 1.00
Clinical feature-based studies
Akbulut et al. [16]	Turkey	Multiple classifiers	Acc 0.895, F1 ≈ 0.75	The BSOEM uses a more robust ensemble architecture with
Roozbeh et al. [17]	Iran	DT, RF, XGBoost, KNN	Acc 0.77, F1 ≈ 0.76	improved overall predictive performance
Abdi [20]	Iran	Random Forest	AUC 0.65	The BSOEM provides balanced
Malacova et al. [21]	Australia	Multiple ML models	AUROC 0.84	classification performance through ensemble modeling
Zimmerman et al. [22]	USA	Probabilistic graphical model	AUC ≈ 0.83	beyond AUC/AUROC-only evaluation
Proposed BSOEM	Mexico	Second-order ensemble	Acc 0.954, F1 0.812	Robust second-order ensemble based on clinical data with balanced and clinically deployable performance

Table 11. Class-specific sensitivity (Risky), specificity (Healthy), and precision (Risky) across folds for the three models.

Fold	Best Base Model XGB			BFOEM			BSOEM
Fold	Sens. (Risky)	Spec. (Healthy)	Prec. (Risky)	Sens. (Risky)	Spec. (Healthy)	Prec. (Risky)	Sens. (Risky)	Spec. (Healthy)	Prec. (Risky)
0	0.6970	0.9441	0.4694	0.7273	0.9570	0.5455	0.6667	0.9634	0.5641
1	0.7273	0.9612	0.5714	0.6970	0.9698	0.6216	0.5758	0.9784	0.6552
2	0.7273	0.9677	0.6154	0.6667	0.9784	0.6875	0.6970	0.9892	0.8214
3	0.6970	0.9698	0.6216	0.6970	0.9806	0.7188	0.6061	0.9871	0.7692
4	0.6563	0.9484	0.4667	0.6563	0.9527	0.4884	0.6250	0.9656	0.5556
Mean	0.7010	0.9582	0.5489	0.6889	0.9677	0.6124	0.6341	0.9767	0.6731

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guzman-Pando, A.; Enriquez-Guillen, B.O.; Ramirez-Alonso, G.; Camarillo-Cisneros, J.; Aguilar-Torres, C.R.; Hinojos-Gallardo, L.C. AI-FRS: An Ensemble-Based AI Decision-Support System for Fetal Risk Prediction in a Mexican Clinical Setting. AI 2026, 7, 129. https://doi.org/10.3390/ai7040129

AMA Style

Guzman-Pando A, Enriquez-Guillen BO, Ramirez-Alonso G, Camarillo-Cisneros J, Aguilar-Torres CR, Hinojos-Gallardo LC. AI-FRS: An Ensemble-Based AI Decision-Support System for Fetal Risk Prediction in a Mexican Clinical Setting. AI. 2026; 7(4):129. https://doi.org/10.3390/ai7040129

Chicago/Turabian Style

Guzman-Pando, Abimael, Bernardo O. Enriquez-Guillen, Graciela Ramirez-Alonso, Javier Camarillo-Cisneros, Cesar R. Aguilar-Torres, and Luis C. Hinojos-Gallardo. 2026. "AI-FRS: An Ensemble-Based AI Decision-Support System for Fetal Risk Prediction in a Mexican Clinical Setting" AI 7, no. 4: 129. https://doi.org/10.3390/ai7040129

APA Style

Guzman-Pando, A., Enriquez-Guillen, B. O., Ramirez-Alonso, G., Camarillo-Cisneros, J., Aguilar-Torres, C. R., & Hinojos-Gallardo, L. C. (2026). AI-FRS: An Ensemble-Based AI Decision-Support System for Fetal Risk Prediction in a Mexican Clinical Setting. AI, 7(4), 129. https://doi.org/10.3390/ai7040129

Article Menu

AI-FRS: An Ensemble-Based AI Decision-Support System for Fetal Risk Prediction in a Mexican Clinical Setting

Abstract

1. Introduction

2. Related Works

Key Contributions

3. Materials and Methods

3.1. Dataset Description

3.2. Exploratory Feature Analysis

3.3. Models’ Description

3.4. Mathematical Models Formulation and Regularization of Base Learners

3.4.1. Neural Architectures (ANN and CNN)

3.4.2. Tree-Based Models (DT and RF)

3.4.3. Gradient Boosting Frameworks (XGB, LGBM, and CB)

3.5. Evaluation Metrics

3.6. Base Model Training and Hyperparameter Tuning

3.7. Experimental Domain Key Points

3.8. Model Ensemble Methodology

4. Results and Discussion

Class-Specific Performance and Clinical Implications

5. System Interface Description

6. Clinical Implications and Limitations of AI-FRS

6.1. Clinical Implications

6.2. Limitations

7. Interpretability Analysis

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI