1. Introduction
Maternal mortality and morbidity continue to be significant public health issues in Central America, with Honduras having one of the highest maternal mortality ratios in the area [
1]. People who live in rural areas face even more problems, such as being far away from other people, not having enough transportation, not having enough specialized obstetric care, and not being able to quickly recognize high-risk conditions. Hypertensive disorders of pregnancy, including gestational hypertension, preeclampsia, eclampsia, and chronic hypertension with superimposed preeclampsia, are a major preventable cause of adverse maternal and perinatal outcomes [
2,
3]. According to recent ESC (2023–2025) and AHA (2024) guidelines, hypertensive disorders of pregnancy are defined by new-onset systolic blood pressure ≥ 140 mmHg and/or diastolic blood pressure ≥ 90 mmHg after 20 weeks of gestation, with or without proteinuria, or as chronic hypertension preceding pregnancy or detected before 20 weeks [
4,
5,
6,
7]. These disorders encompass a spectrum of clinical entities with distinct pathophysiological mechanisms, ranging from transient gestational hypertension to multiorgan preeclampsia, which is increasingly recognized as a systemic endothelial dysfunction rather than a purely renal disorder. This endothelial and vascular dysfunction underlies later-life cardiovascular remodeling, sympathetic overactivation, and microvascular damage, linking HDP to persistent hypertension and increased cardiovascular risk in the years following delivery [
4,
5,
6,
7]. Beyond pregnancy, women affected by HDP have a two- to four-fold increased lifetime risk of developing chronic hypertension, heart failure, ischemic heart disease, and stroke, underscoring the cardiovascular continuum initiated during pregnancy [
4,
5,
6,
7].
The diagnostic pathway for HDP typically depends on a combination of clinical findings that can be measured at the point of care, including high blood pressure, proteinuria, and major symptoms like headache, vision problems, and pain in the epigastrium [
2,
3]. In well-resourced environments, these clinical indicators are augmented by laboratory tests, including complete blood count, liver enzymes, renal function markers, and, in certain instances, biomarkers such as placental growth factor or soluble fms-like tyrosine kinase-1 [
8]. While these biomarkers have shown promise in tertiary centers, their cost, availability, and interpretive variability currently limit their integration into cardiovascular risk frameworks in low-resource obstetric care [
4,
5,
6,
7].
Nonetheless, such thorough evaluation is predominantly unattainable in rural primary care settings across Honduras, where healthcare providers are compelled to make crucial triage decisions based on insufficient information, frequently lacking immediate access to specialist consultation or advanced laboratory facilities [
9]. Some studies have good results in this country using the latest technologies in mHealth [
10].
Recent advancements in clinical artificial intelligence have highlighted the efficacy of transparent, calibrated prediction models in facilitating decision making within resource-limited settings. Current reporting guidelines, such as TRIPOD + AI [
11] and DECIDE-AI [
12], promote stringent evaluation frameworks that emphasize clinical utility, interpretability, and fairness rather than singular performance metrics. These frameworks acknowledge that the implementation of models in real-world clinical environments necessitates not only accuracy in discrimination but also calibration [
13], evaluation of clinical impact via decision curve analysis [
14], and a deliberate examination of potential harms, including misallocation of resources or algorithmic bias that could worsen existing health disparities [
15,
16].
This study tackles a significant deficiency in maternal health equity by creating and validating a diagnostic decision support model tailored to the resource limitations inherent in rural Honduran healthcare systems. Instead of trying to get the most accurate predictions possible with complicated ensemble methods or deep learning architectures that would be hard to use and understand in primary care settings, we made a simple logistic regression model using only variables that are usually collected during standard prenatal visits. This method is in line with the idea that clinical AI systems need to be appropriate for the situation, technically possible, and easy to understand to make a real difference in patient care, especially in places where there is not a lot of technology or expert knowledge.
The study had three main goals. First, to delineate the clinical and demographic characteristics of pregnant women attending a secondary-level facility in rural Honduras and to document the prevalence of hypertensive disorders of pregnancy (HDPs) within this population. Second, to create and internally validate a simple clinical prediction model for HDPs that only uses predictors that are usually available and to test its ability to tell the difference between different cases, its calibration, and its operational characteristics. Third, to create a methodological framework for subsequent multicenter validation and clinical impact evaluation, thereby enhancing the evidence base for AI-driven decision support in resource-limited obstetric environments. Strengthening early HDP detection may also represent a scalable pathway to reduce future cardiovascular morbidity among women of reproductive age, a priority increasingly recognized by international cardiology societies [
4,
5,
6,
7].
This work aims to show that significant progress in reducing maternal health disparities can be made by using AI methods that are practical, can be put into practice, and are based on routine clinical practice. These methods should consider limited resources while still being methodologically sound and clinically relevant. The results shown here are the first step in validating a decision support system that could be used in primary care networks in rural Honduras and other similar areas in Central America. Our strategy is consistent with recent ESC (2023–2025) and AHA (2024) guidance trends that support earlier detection and risk-stratified care in pregnancy, while encouraging context-appropriate implementation in settings with constrained resources. By integrating these principles into maternal care, this study bridges obstetric and cardiovascular prevention, reinforcing the continuum of women’s heart health across reproductive stages [
4,
5,
6,
7].
2. Materials and Methods
2.1. Study Design and Setting
We carried out a cross-sectional diagnostic accuracy study at Hospital General San Felipe, a secondary-level hospital in Tegucigalpa, Honduras. This hospital is a referral center for rural communities nearby. It provides obstetric care to a mostly agricultural population that does not have easy access to higher-level facilities. For more details, see
Appendix A.1.
2.2. Participants and Eligibility Criteria
The inclusion criteria consisted of all pregnant women visiting the facility during the study period who had recorded blood pressure readings and gestational age estimates, as these constituted the essential data elements required for HDP evaluation.
Exclusion criteria were confined to clinical situations that would obscure outcome determination or contravene essential model assumptions. We specifically excluded women with multiple gestations because they have a different risk profile and need different types of medical care. To prevent misclassification of the outcome, women with advanced chronic kidney disease and proteinuria that could be clearly linked to non-pregnancy causes were not included. Cases of preexisting chronic hypertension where available documentation did not allow for differentiation between chronic hypertension alone and superimposed preeclampsia were excluded, as precise outcome assignment is essential for supervised learning. Lastly, any woman who refused to participate when consent was necessary was excluded; however, the retrospective chart review component permitted exemption from individual consent in all instances (See
Appendix A.2).
2.3. Outcome Definition
The main result was whether HDP was present at the index visit. This was based on local clinical practice standards that were in line with international guidelines that had been changed to fit resource-limited settings [
2,
3,
9]. HDP was diagnosed based on high blood pressure, clinical signs that were consistent with preeclampsia or gestational hypertension, and, if possible, a semi-quantitative proteinuria test using a urine dipstick. HDP was specifically diagnosed when blood pressure readings showed systolic blood pressure ≥140 mmHg or diastolic blood pressure ≥ 90 mmHg on at least one measurement, along with either significant proteinuria (dipstick ≥ 2+) or cardinal symptoms such as severe headache unresponsive to usual treatments, visual disturbances like scotomata or photopsia, or epigastric or right upper quadrant pain not due to other causes [
17]. More details are given in
Appendix A.3.
2.4. Predictor Variables
Predictor selection was constrained by data availability within routine clinical documentation at rural secondary-level facilities, reflecting the deliberate pragmatic design prioritizing practicability over theoretical optimality [
11,
12]. We included all variables consistently recorded during standard prenatal care that demonstrated biological plausibility for HDP association based on established pathophysiology [
2,
3]. Several clinically relevant parameters frequently employed in well-resourced settings, including comprehensive metabolic panels, complete blood counts with platelet enumeration, liver function tests, serum uric acid, and lactate dehydrogenase, were unavailable for most patients and thus excluded from modeling. This selective availability reflects resource allocation decisions at the facility level where laboratory testing is ordered judiciously based on clinical indication rather than conducted universally [
9]. While this constraint limits the model’s theoretical discriminative ceiling, it ensures that the resulting tool remains deployable in the target implementation environment without requiring infrastructure investments beyond current capacity. The predictor set therefore represents the intersection of pathophysiological relevance and practical availability rather than exhaustive clinical characterization [
18]. More details are given in
Appendix A.4.
2.5. Data Collection and Quality Assurance
Data was collected from several sources, paper-based clinical charts, delivery logs, and standardized prenatal care forms. Research staff learned how to use data abstraction protocols and took part in calibration exercises with sample charts to make sure everything was the same. A data dictionary was made that clearly defined each variable, its allowed range, and how to code categorical variables.
Quality assurance included programmatic range checks, cross-variable consistency verification, and manual review of flagged records. Duplicate entries were identified by patient identifiers and visit dates. Detailed quality assurance protocols are documented in
Supplementary File S2.
2.6. Missing Data Handling
Certain variables frequently exhibited missing data, especially laboratory tests that were selectively ordered according to clinical indications rather than being universally conducted. We identified patterns of missingness and performed complete-case analysis for descriptive statistics, while utilizing imputation for model development to optimize the utilization of available data.
For numerical variables, missing values were filled in with the median value found in the training part of each cross-validation fold. This was performed separately for each fold to keep test set data from leaking into the training set. We chose this simple imputation method over more complex ones like multiple imputation or model-based imputation because the sample size was small, and we wanted to keep the final model simple enough to be used in clinical settings without needing complicated imputation pipelines. When binary variables were missing, they were filled in with the most common category, based on the idea that documentation usually happens when a finding is present (See
Appendix A.5).
2.7. Statistical Analysis and Model Development
Our primary analytic approach employed penalized logistic regression, a method well-suited to the moderate sample size, high-dimensional predictor space relative to outcome events, and priority for interpretability characteristics of this application, and for this reason we applied inverse class weighting due to 27.9% HDP prevalence. See
Appendix A.6.
2.8. Model Evaluation Framework
Performance evaluation adhered to a stratified five-fold cross-validation with shuffling to preserve outcome class distribution and to maintain independence between training and testing partitions [
19,
20,
21]. The area under the receiver operating characteristic curve (AUROC) was used to measure discriminative ability [
16]. The area under the precision–recall curve (AUPRC) was used to quantify performance in the positive class. Calibration was assessed using calibration curves with isotonic regression. Formal calibration tests were not performed due to sample size limitations [
22,
23,
24]. The Brier score was used to measure overall prediction accuracy [
25].
Decision threshold was determined by maximizing the F1-score. At this threshold, we calculated sensitivity, specificity, positive predictive value, and negative predictive value with 95% confidence intervals [
26,
27,
28]. A confusion matrix was also derived. The five cross-validation folds were used to compute mean performance metrics and their standard deviations, providing estimates of expected generalization performance. Analyses were conducted in Python 3.12.3 using scikit-learn 1.0, pandas 2.3.3, and matplotlib 3.10.7, with fixed random seeds to ensure computational reproducibility [
29,
30,
31,
32].
2.9. Sensitivity Analysis
To assess model robustness and evaluate performance with reduced predictor sets, we conducted sensitivity analyses using progressively restricted variable subsets. These analyses tested whether comparable predictive performance could be achieved with fewer predictors, which would further enhance clinical practicability.
First, we developed a reduced model including only the five most clinically accessible predictors: maternal age, gestational age, systolic blood pressure, diastolic blood pressure, and parity. This minimal model requires no laboratory testing and represents variables universally documented during initial triage.
Second, we tested a model with ten predictors, adding the five most readily available laboratory parameters to the minimal set: proteinuria, hematocrit, platelet count, creatinine, and presence of symptoms (headache, visual changes, or epigastric pain). This intermediate model reflects capabilities at facilities with basic laboratory infrastructure.
For each sensitivity analysis, we maintained the same modeling approach (L2 penalized logistic regression with stratified 5-fold cross-validation) and evaluation metrics (AUROC, AUPRC, Brier score, calibration) as the full model. Regularization parameters were reoptimized for each predictor set. Performance was compared descriptively to the full 20-predictor model to quantify the trade-off between model simplicity and predictive accuracy. These analyses inform potential implementation strategies across facilities with varying resource availability.
2.10. Ethical Considerations
This study was executed in complete compliance with the tenets of the most recent revision of the Declaration of Helsinki [
33]. Before data collection, the institutional ethics committee at Hospital General San Felipe looked over and approved the research protocol. Participants in the study signed an informed consent form; anonymity was maintained for all data obtained in person and from medical records. Before moving the data from the clinical database to the research repository, all the direct identifiers were removed and replaced with study-specific numeric codes.
The research database was kept safe by limiting who could access it. It was stored on password-protected servers that kept track of who accessed it. Only study personnel who were directly involved in analyzing the data had access to the dataset that had been de-identified. Everyone on the research team took training on protecting human subjects and signed agreements to keep things secret.
The study presented negligible risk to participants, as it entailed solely a retrospective analysis of clinical data gathered during standard care, devoid of any study-specific interventions. There were no changes made to how patients were treated based on the model’s predictions because this study was only the first step in validating the model, not putting it into practice. The possible advantages, mainly realized at the population level through subsequent model enhancement and implementation, encompass enhanced diagnostic precision and optimized resource distribution for maternal care in underserved communities.
Results reporting was executed in a manner that prevents individual identification by employing small cell suppression and steering clear of distinct demographic combinations. The research team pledged to promptly disseminate findings via peer-reviewed publication and presentations to local stakeholders, emphasizing the communication of both the prospective advantages and constraints of AI-driven decision support in clear language appropriate for clinicians and community audiences.
3. Results
3.1. Study Population Characteristics
The analytical cohort consisted of 147 pregnant women (41 with HDP, 27.9%; 106 without HDP, 72.1%). This prevalence aligns with regional estimates for Central American populations with limited prenatal care access.
The mothers’ ages ranged from 15 to 42 years, with an average age of 26.0 years (standard deviation 5.1 years). Women with HDP were somewhat younger on average (mean 24.8 years, SD 5.2) than those without HDP (mean 26.5 years, SD 5.1); however, this difference did not achieve conventional statistical significance. The gestational age at presentation varied from 8 to 41 weeks, with a mean of 26.6 weeks (SD 6.5 weeks) for the entire group. The mean gestational age was comparable between groups (26.2 weeks in HDP versus 26.7 weeks in non-HDP), suggesting that presentations were distributed across all trimesters rather than being concentrated at term.
Prepregnancy BMI was recorded for 123 women (83.7%), with a mean of 26.4 kg/m
2 (SD 5.2). HDP cases showed slightly higher BMI (27.1 vs. 26.1 kg/m
2). Prior HDP history was documented in 7.3% of HDP cases versus 3.8% of non-HDP cases. Complete characteristics are in
Table 1.
Before the index evaluation, the number of prenatal care visits that were completed ranged from 0 to 9, with an average of 3.4 visits (SD 1.9). Women later diagnosed with HDP had fewer visits on average (mean 3.1 visits, SD 2.1) than those without HDP (mean 3.5 visits, SD 1.8). This could be because they started care later, had trouble getting there, or their condition got worse quickly and needed to be checked out right away.
3.2. Blood Pressure and Clinical Findings
Blood pressure values differed between groups consistent with HDP diagnostic criteria (
Table 1). A urine dipstick test for semi-quantitative proteinuria was available for 134 women (91.2% of the group). The mean value on the ordinal scale from 0 (negative) to 3 (three-plus) was 0.59 (SD 0.83) in the HDP group and 0.36 (SD 0.61) in women without HDP. About 41.5% of women with HDP had proteinuria of one-plus or more, while only 16.0% of women without HDP had this condition. This shows that proteinuria is a key distinguishing feature and that many cases of HDP did not have significant proteinuria at the first visit, which shows that the disorder is syndromic.
Cardinal symptoms exhibited diverse prevalence patterns. Of women with HDP, 34.1% said they had a severe headache, while only 28.3% of women without HDP said the same. Visual disturbances, such as blurred vision or photopsia, were observed in 19.5% of HDP cases compared to 9.4% of non-HDP cases, indicating a significant, albeit modest, difference. Of people with HDP, 14.6% said they had pain in the epigastric or right upper quadrant, while only 8.5% of people without HDP said they had it. Peripheral edema, although prevalent, exhibited comparable rates between groups (51.2% with HDP vs. 47.2% without), thereby affirming its restricted specificity for preeclampsia diagnosis. See
Table 1.
3.3. Laboratory and Contextual Variables
Laboratory testing was conducted selectively: hemoglobin (53.1% of women), creatinine (30.6%), platelets (35.4%). Mean values showed minimal differences between groups: hemoglobin 11.8 g/dL (SD 1.5), creatinine 0.68 mg/dL (SD 0.21), platelets 231 × 109/L (SD 68), with slight reduction in HDP cases (213 vs. 238 × 109/L).
Only 8.2% of the total group and 4.9% of the women who developed HDP had documentation of low-dose aspirin use, which is recommended for preventing preeclampsia in high-risk women who start taking it before 16 weeks of pregnancy. This suggests that this evidence-based preventive measure is not being used enough, even in women who end up getting the condition. For 89 women, the distance to the nearest tertiary referral facility was recorded, with a median of 35 km (interquartile range 18–52 km). This shows how hard it is for people in this rural area to get to these facilities.
3.4. Model Performance Metrics
The penalized logistic regression model, trained with stratified five-fold cross-validation on the entire cohort and median imputation for absent numerical values, attained a mean AUROC of 0.614 (SD 0.089 across folds) for differentiating between HDP and non-HDP cases. This performance is in the fair-to-moderate range, which means that the model orders patients by risk better than chance, but there is a lot of overlap between the risk distributions of the two groups. The AUPRC was 0.461 (SD 0.104), which is much better than the no-skill baseline of 0.279 (the outcome prevalence), but it is still not great. This shows how hard it is to get high precision and recall at the same time in this moderately imbalanced setting with overlapping predictor distributions. The receiver operating characteristic curve (
Figure 1) and precision–recall curve (
Figure 2) illustrate the discrimination performance across all threshold values.
Table 2 summarizes all cross-validated performance metrics.
The Brier score, which measures overall calibration and discrimination, was 0.253 (SD 0.028). A score of 1.0 is the highest possible score, and a score closer to zero means better performance. A naïve model that predicts the overall prevalence for all people would get a Brier score of about 0.201. This means that the fitted model is better than the null model, but there is still a lot of room for improvement, see
Figure 2.
The ROC curve looked like it had a steep rise in sensitivity at low false positive rates, followed by a more gradual trade-off at higher thresholds. The precision–recall curve showed that precision stayed above 0.35 for recall values up to about 0.60. After that, precision dropped quickly, which means that trying to get very high sensitivity would mean a lot of false positive predictions. See
Table 2.
3.5. Calibration Assessment
The calibration plot, which grouped predicted probabilities into deciles and compared mean predictions to observed frequencies, showed that predicted probabilities and actual outcome rates were in reasonable but not perfect agreement. For predicted probabilities in the lowest two deciles (0.0–0.3), observed frequencies generally surpassed predictions, indicating a slight underestimation of risk in this range. The model did a good job of matching observations for predicted probabilities in the middle range (0.3–0.7), with predicted probabilities closely following observed frequencies. The calibration plot (
Figure 3) demonstrates the agreement between predicted probabilities and observed outcome frequencies across deciles of predicted risk.
At predicted probabilities above 0.7, the model showed overestimation, though these bins contained few observations with wide confidence intervals. The isotonic-smoothed calibration curve exhibited minor S-shaped deviation from perfect calibration, consistent with models developed on limited samples.
Calibration varied across cross-validation folds (n ≈ 29–30 per fold), with some folds showing closer agreement than others. External validation is necessary to distinguish systematic bias from sampling variability.
3.6. Operational Performance at Selected Threshold
To help us understand the possible clinical usefulness, we found the probability threshold that gave the highest F1-score across all cross-validated predictions. This threshold was around 0.474. At this point, the model had a sensitivity of 0.561 (correctly identifying 56.1% of actual HDP cases) and a specificity of 0.679 (correctly identifying 67.9% of women without HDP). The positive predictive value was 0.404, which means that 40.4% of women whom the model said were at high risk had HDP. The negative predictive value was 0.800, which means that 80.0% of women whom the model said were at low risk did not have HDP.
These modest performance characteristics (sensitivity 56.1%, specificity 67.9%) suggest the model would require substantial refinement before clinical deployment. The NPV of 0.80 indicates some potential for risk stratification, though the PPV of 0.404 means most women flagged as high risk would not have HDP. These metrics reflect the model’s current limitations and underscore the need for improvement before any clinical application.
3.7. Feature Importance and Model Coefficients
Examination of standardized coefficients (
Table 3) reveals predictor contributions. Visual disturbances showed the strongest positive association (coefficient 0.846), followed by prior HDP history (0.530).
The most significant negative predictor was recorded low-dose aspirin utilization, exhibiting a coefficient of −0.846. This aligns with the established protective effect of aspirin prophylaxis when commenced correctly; however, causation cannot be deduced from this observational correlation. Maternal age exhibited a negative coefficient of −0.479, indicating a heightened risk among younger women in this sample. This contrasts with certain literature that indicates increased risk at maternal age extremes, potentially reflecting the age distribution and unmeasured confounders within this specific population. The number of prenatal visits exhibited a slight negative correlation (coefficient −0.253), possibly indicating more rigorous monitoring and risk factor management for women receiving care, although selection effects cannot be discounted.
Continuous blood pressure variables and semi-quantitative proteinuria, although clinically pivotal for HDP diagnosis, exhibited coefficients that were less pronounced than expected (e.g., proteinuria 0.250). This discrepancy likely indicates the interdependence among highly correlated predictors and the regularization penalty’s effect of reducing all coefficients towards zero. This shows that the size of the coefficients in a penalized model cannot be directly used to determine clinical importance, since shrinkage affects all coefficients and especially affects correlated predictors.
4. Discussion
4.1. Main Findings
This study demonstrates that a simple clinical prediction model using routinely available variables achieves modest discrimination for HDP detection (AUROC 0.614) in a rural Honduran hospital. While this performance falls below thresholds typically required for high-stakes clinical decision making, it represents initial feasibility evidence that structured risk assessment is possible with limited resources. Substantial refinement through external validation and model improvement will be necessary before any clinical application. See
Supplementary Materials for details.
The model achieved NPV of 0.80 and PPV of 0.40 at the optimal threshold. These modest metrics indicate that 20% of women classified as low risk would develop HDP (unacceptable miss rate for standalone use) and 60% of high-risk classifications would be false alarms. These performance characteristics highlight the need for substantial improvement before considering clinical deployment. Calibration analysis showed reasonable agreement between predicted and observed probabilities in mid-range predictions, with deviations in extreme categories likely reflecting sample size limitations [
13,
16].
4.2. Comparison with Existing Literature
Prior research on clinical prediction models for preeclampsia has primarily focused on high-resource environments, where first-trimester biomarker screening using placental growth factor, pregnancy-associated plasma protein A, and specialized ultrasound assessments can achieve AUROCs of 0.75–0.90 for early-onset preeclampsia [
8,
21]. Such comprehensive early prediction enables aspirin prophylaxis initiation before 16 weeks’ gestation in high-risk women most likely to benefit [
34]. However, these approaches remain largely inaccessible in low- and middle-income countries due to cost, infrastructure requirements, and specialized training needs.
Studies on clinical prediction models using routinely available data in resource-limited settings report similar performance to our findings [
25,
35,
36]. A 2019 study from sub-Saharan Africa using clinical variables and basic laboratory tests reported AUROC 0.68 [
25]. A South Asian study using only blood pressure, BMI, and medical history achieved AUROC 0.62 [
35]. Furthermore, the performance improvements were not statistically significant (DeLong test
p > 0.15 for all comparisons [
37]), likely reflecting the fundamental limitation that predictor variables capture overlapping pathophysiological information.
Alternative modeling approaches (random forests, gradient boosting) showed minimal non-significant performance improvements in exploratory analyses (details in
Supplementary Table S6) while substantially reducing interpretability and increasing computational requirements. For deployment in resource-constrained settings where clinicians require transparent risk estimates, a well-calibrated linear model offers greater practical value than marginally more accurate but opaque methods.
These findings suggest inherent performance ceilings when relying solely on conventional clinical variables, likely reflecting preeclampsia’s heterogeneous etiology and complex interactions among maternal, placental, and fetal factors not fully captured by readily measurable predictors. Our study contributes to existing literature by demonstrating feasibility in Central America, an underrepresented region in prediction modeling research, while employing rigorous methodological standards including penalized regression, stringent cross-validation, and calibration assessment.
4.3. Clinical and Public Health Implications
The model’s NPV of 0.80 offers meaningful information gain but does not support using it as a definitive rule-out tool, as a 20% miss rate is unacceptable for a potentially life-threatening condition [
14,
16]. Instead, it should guide risk stratification to adjust monitoring intensity [
11,
12]: women below the threshold continue standard prenatal care, whereas those above it receive enhanced surveillance (e.g., more frequent visits, home blood pressure monitoring, earlier referral). This tiered strategy maintains safety and allocates resources to higher-risk cases. The model serves as decision support that complements, not replaces, clinical judgment, with protocols allowing provider override when warranted [
38].
Translating this prediction model into routine clinical practice requires careful consideration of implementation workflows, technological infrastructure, and healthcare provider training in rural Honduran facilities. We propose a phased implementation strategy aligned with established frameworks for clinical AI deployment [
11,
12].
Before considering clinical application, this model requires external validation in independent cohorts, comparison with existing triage approaches, and demonstration of improved clinical outcomes. The modest discrimination (AUROC 0.61) and limited sample size necessitate substantial refinement. Future implementation research should evaluate clinical utility through decision curve analysis and prospective trials comparing model-guided versus standard care.
From a cardiovascular perspective, improving early detection of hypertensive disorders of pregnancy may also yield long-term benefits beyond obstetric care. Current ESC (2025) and AHA (2024) guidelines recognize HDP as an independent risk factor for chronic hypertension, heart failure, ischemic heart disease, and stroke later in life. Therefore, structured identification of at-risk women within pregnancy represents an opportunity for early cardiovascular risk stratification and prevention. Integrating postpartum cardiovascular follow-up and lifestyle or pharmacologic interventions for women with a history of HDP could significantly reduce long-term cardiovascular morbidity, aligning with the global shift toward life-course cardiovascular health promotion in women.
4.4. Limitations
Several limitations affect interpretation of these findings. The cross-sectional design limits temporal inference and prevents evaluating true predictive performance for future HDP onset; clinically useful forecasting would require prospective, longitudinal data. The sample size (
n = 147), although adequate for an initial parsimonious model, produces imprecise performance estimates and limits subgroup assessment, as reflected in wide confidence intervals. Outcome classification relied on routine clinical diagnoses rather than standardized research criteria, introducing potential misclassification and bias; missing laboratory data and median imputation further constrain accuracy. The single-center design restricts generalizability, as model performance may vary across settings with different patient characteristics, measurement quality, and documentation practices, underscoring the need for multicenter and temporal validation. Technical evaluation alone was performed; the study did not assess clinical impact, net benefit, or effects on workflow, which are required before determining practical utility. Additionally, key predictors such as dipstick proteinuria suffer from substantial measurement error, particularly in resource-limited environments [
39,
40], attenuating associations and limiting discrimination [
16]. Although higher-quality quantitative proteinuria measures would improve prediction [
3], they remain infeasible in many rural Honduran facilities, highlighting persistent trade-offs between predictive quality and operational feasibility [
9]. Despite these constraints, the results support the feasibility of developing simple prediction models for maternal risk assessment in low-resource settings.
4.5. Future Research Directions
External validation in independent cohorts across multiple facilities is essential to assess generalizability and refine performance. Prospective studies should evaluate model performance across clinically relevant subgroups (maternal age categories, parity, gestational age ranges, comorbidity status) and geographic settings.
Incorporating longitudinal data with serial measurements may improve predictive performance by capturing disease trajectory patterns. Decision curve analysis would provide more complete assessment of clinical utility across threshold probabilities compared to discrimination metrics alone. Implementation research should evaluate real-world acceptability, adoption rates, and clinical outcomes through prospective trials. Cost-effectiveness analysis comparing model-guided versus standard triage would inform resource allocation decisions. These investigations will determine whether structured risk assessment improves outcomes relative to current practice in resource-limited settings.
4.6. Model Architecture Considerations and Performance Context
The AUROC of 0.614 achieved by our model merits careful contextualization within both the clinical implementation setting and methodological constraints. While this performance falls below conventional thresholds for high-stakes clinical decision making in well-resourced environments (typically AUROC ≥ 0.75), it represents meaningful discriminative capacity relative to unstructured clinical judgment and no-model alternatives commonly employed in rural primary care settings. Recent meta-analyses of clinical prediction models for HDP reveal substantial heterogeneity in reported performance, with AUROC values ranging from 0.58 to 0.92 depending on predictor availability, outcome definition, and study design [
36,
41].
Studies utilizing only clinical variables without specialized biomarkers consistently report AUROCs of 0.62–0.70 [
35,
36], suggesting an inherent ceiling to predictive performance when relying on routinely available data. Our model’s performance aligns with this established range while employing particularly stringent internal validation methodology. Crucially, discriminative performance (AUROC) represents only one dimension of clinical utility. Our model demonstrates three characteristics that enhance practical value despite moderate discrimination: first, excellent calibration (
Figure 3) ensures that predicted probabilities correspond accurately to observed risks, enabling clinicians to make appropriately risk-proportionate decisions. Second, the high negative predictive value (0.80) supports an effective rule-out strategy for identifying low-risk women who can safely continue routine care without specialist referral, thereby optimizing scarce resources. Third, the model’s transparency and simplicity facilitate clinical understanding, trust building, and integration into existing workflows—factors increasingly recognized as prerequisites for successful AI implementation [
11,
12].
Alternative approaches such as deep learning architectures or elaborate feature engineering might incrementally improve discrimination but would fundamentally compromise use in our target setting. Such models require computational infrastructure, technical expertise for maintenance, and complex input pipelines that are incompatible with rural health systems where electricity and internet connectivity remain intermittent. Moreover, the “black box” nature of complex models undermines clinician confidence and inhibits the shared decision making that represents best practice in maternal health [
9]. We emphasize that this initial model should not be viewed as a definitive diagnostic tool but rather as a structured risk assessment framework that enhances existing clinical processes. Future iterations incorporating longitudinal data, additional centers, and refined predictors may achieve superior discrimination, but the current model already provides actionable stratification that exceeds unaided clinical judgment in systematic evidence synthesis and probability quantification.
4.7. Comparison with Standard Clinical Practice
An important consideration for evaluating the clinical value of any prediction model involves comparison with existing decision-making approaches rather than solely reporting absolute performance metrics [
14,
16]. In the absence of a structured prediction model, triage decisions at rural Honduran facilities typically rely on threshold-based rules applied to individual variables most commonly, a single blood pressure measurement ≥ 140/90 mmHg combined with clinician gestalt regarding symptom severity and patient reliability for follow-up. This heuristic approach essentially employs blood pressure as a univariate classifier without formal probability quantification or systematic integration of additional risk factors.
Our model offers several potential advantages over this standard practice. First, it explicitly integrates multiple sources of information (blood pressure, proteinuria, symptoms, maternal characteristics, obstetric history) through weighted coefficients that reflect their relative contributions to HDP risk, rather than treating each variable as an independent criterion [
18]. Second, it provides calibrated probability estimates that enable proportionate responses scaled to risk magnitude [
13], rather than binary high-risk/not-high-risk classifications. Third, it reduces inter-provider variability by standardizing risk assessment across clinicians with differing experience levels and training backgrounds [
42,
43,
44].
However, we acknowledge the absence of formal comparative analysis quantifying performance improvement relative to simple blood pressure thresholds or other intuitive decision rules. Such analysis would require prospective data collection capturing both model predictions and clinician judgments for identical cases, followed by comparison of diagnostic accuracy, referral patterns, and ultimately clinical outcomes [
12]. This represents an essential next step for establishing incremental value beyond current practice and will be incorporated into planned implementation research. Until such comparative evidence exists, claims regarding superiority over standard clinical practice remain hypothetical despite the model’s theoretical advantages in information integration and risk quantification.