1. Introduction
In modern AI research, there is a growing demand for methods that can extract nonlinear, multivariate patterns from high-dimensional biomedical data while retaining interpretability. This is particularly important in clinical and physiological contexts, where black-box models may limit translation. Explainable AI approaches are increasingly used to bridge this gap, enabling accurate predictions and, in parallel, promoting a deeper understanding of the factors driving model outputs.
A relevant example is the control of hormone balance in postmenopausal women. This problem involves a typical example of high-dimensional, complex data, with numerous factors intertwined nonlinearly and significant individual differences, hiding relationships that cannot be captured with conventional statistical methods. Therefore, tackling this black box using interpretable AI is an ideal subject for demonstrating the effectiveness of AI methods and generating new clinical insights. The medical background to this problem is described in detail below.
From the onset of puberty through menopause, progesterone and estradiol interact in complex ways—sometimes independently, at times synergistically, but most often antagonistically—to regulate physiological processes and promote health across the reproductive lifespan. Acting in synergy, progesterone enhances estradiol’s cardioprotective role in reducing exercise-induced myocardial ischemia [
1], illustrating how the coordinated signaling between these two reproductive hormones from the steroid hormone biosynthesis pathway supports physiological homeostasis. Even when acting in opposition, their dynamic interplay can be essential for the coordination of normal physiological function. One of the most well-characterized examples of functional antagonism is progesterone’s regulatory role in the endometrium: following estradiol-induced proliferation in the follicular phase, progesterone acts to stabilize the endometrial lining in the luteal phase by suppressing further proliferation while, in parallel, promoting the secretion of proteins, lipids, and growth factors necessary for implantation and tissue homeostasis [
2]. Taken together, progesterone modulates estrogen-dependent processes by attenuating, amplifying, or even mimicking them. Given their interdependence, examining the P4:E2 ratio is more informative than evaluating each hormone independently.
Disruption of the tightly regulated progesterone antagonism—whether due to abnormal systemic levels of estradiol and progesterone or cellular level dysfunction—can have widespread consequences for reproductive and systemic health. According to the widely accepted unopposed estrogen theory, first described by Key et al. (1988) [
3], estrogen that is not opposed by an adequate progesterone concentration can exert unregulated mitogenic effects, leading to excessive endometrial proliferation and, ultimately, the development of endometrial hyperplasia and adenocarcinoma. This concept has informed therapeutic strategies that leverage progesterone’s antiproliferative effects on the endometrium. Although progesterone treatment is not indicated for all forms of endometrial cancer, it is incorporated into the management of complex atypical hyperplasia and clinical stage 1A low-grade endometrial tumors in patients who are not surgical candidates. In such cases, reversal of endometrial hyperplasia can be observed in as little as 10 weeks following the initiation of treatment [
4].
Epidemiological studies investigating the association between circulating levels of estradiol, progesterone, and endometrial cancer have been difficult to execute, largely due to historical reliance on immunoassay technologies that lack the specificity and sensitivity to accurately differentiate steroid hormones. The more recent adoption of mass spectrometry has overcome these limitations by offering highly specific, sensitive, and reproducible hormone quantification, making it the preferred method in both research and clinical settings. Using this approach, and in line with the strong biological premise discussed above, recent findings indicate that pre-diagnostic levels of progesterone relative to estradiol in postmenopausal women are inversely associated with endometrial cancer risk [
5].
Notably, endogenous progesterone appears to play divergent roles in relation to estradiol, reducing risk in the endometrium but increasing it in the breast. The biological basis for progesterone’s enhancement of estradiol-mediated risk in breast cancer lies in the ability of estrogen to “prime” the expression of progesterone receptors, thereby amplifying the carcinogenic potential of progesterone to induce cell proliferation, activate stem cells, and promote angiogenesis and branching via the intermediary cell-intrinsic (or paracrine) mediators. This enhanced progesterone signaling creates a favorable environment for breast cancer development and progression [
6]. The divergent role of progesterone in endometrial versus breast cancer risk underscores the importance of contextualizing hormonal balance within specific disease pathways.
Given the growing recognition of the P4:E2 ratio as a biologically meaningful marker of endometrial and breast cancer risk, this study aimed to identify its predictors among postmenopausal women in the United States. Leveraging data from the National Health and Nutrition Examination Survey (NHANES), the approach implemented here uniquely combined two major methodological strengths. First, it relied upon NHANES-implemented gold-standard mass spectrometry—specifically, isotope dilution liquid chromatography–tandem mass spectrometry (ID LC-MS/MS)—for the measurement of circulating hormone concentrations with high specificity and sensitivity, thus overcoming the limitations of traditional immunoassay-based approaches. Second, it employed a supervised machine learning framework to model the relationship between the P4:E2 ratio and a broad array of features spanning hormonal, demographic, dietary, and inflammatory domains. This approach enabled the identification of complex, potentially nonlinear relationships while ensuring rigorous model validation through cross-validation and performance benchmarking.
In addition to modeling the P4:E2 ratio, estradiol and progesterone were analyzed as individual outcomes to disentangle the distinct pathways governing each hormone. While the ratio offers a useful integrative marker, its components may be influenced by partially independent biological processes where estradiol and progesterone exhibit different temporal dynamics and sources of production. Modeling estradiol and progesterone separately allowed for the detection of unique predictors and clarified whether shared or divergent mechanisms underlie their ratio.
Taken together, the use of high-resolution hormone quantification and interpretable machine learning models positioned this study to advance our understanding of hormonal regulation in postmenopausal women and generate hypotheses for future clinical and epidemiological investigations.
4. Discussion
In this study of 1902 postmenopausal women from the NHANES dataset, we developed a machine learning model to identify key predictors of the P4:E2 ratio—an emerging risk factor for endometrial and breast cancer. The model achieved an R2 of 0.298 on the test set, indicating that approximately 30% of the variance in the log-transformed P4:E2 ratio could be explained by the selected predictors. We found that FSH, waist circumference, and CRP were the most influential predictors, followed by total cholesterol, LH, and intake of dietary fat, protein, and sugar. Additional models revealed that FSH and waist circumference primarily predicted estradiol levels, while progesterone was more strongly influenced by cholesterol and LH. These findings offer new insights into the hormonal, metabolic, and lifestyle correlates of the P4:E2 ratio and provide a foundation for future work aimed at understanding its role in postmenopausal health and disease risk.
The observed inverse association between FSH and the P4:E2 ratio is biologically consistent with known endocrine adaptations to menopause [
7]. As ovarian function declines, circulating levels of both progesterone and estradiol decrease, resulting in diminished negative feedback on the hypothalamic–pituitary–gonadal axis. Notably, progesterone production stabilizes at low levels by the onset of menopause, whereas estradiol continues to be produced in more variable quantities, with progressively lower and more stable levels post menopause. As a result, with the progression to post menopause, the P4:E2 ratio increases, leading to compensatory increases in FSH. This relationship is visualized in the SHAP dependence plot (
Figure 2), which reveals a strong, nonlinear positive association between FSH and the predicted log-transformed P4:E2 ratio. SHAP values increase sharply with FSH concentrations up to approximately 75 mIU/mL, after which the curve plateaus, indicating a deflection point beyond which additional increases in FSH contribute minimally to the model’s output. This plateau may reflect a biological “ceiling effect,” wherein estradiol levels have already reached minimal postmenopausal values, thereby limiting the predictive utility of further increases in FSH.
Waist circumference emerged as a key feature influencing the P4:E2 ratio, with SHAP dependence plots suggesting that this effect was primarily driven by estradiol elevations associated with adiposity, while progesterone contributed a waist circumference–restricted effect that emerged only beyond higher thresholds of central adiposity. In the estradiol model, SHAP values increased steadily with waist circumference, reflecting the well-established role of adipose tissue as a site of peripheral estrogen biosynthesis via aromatization [
8]. This adiposity-related rise in estradiol exerts downward pressure on the P4:E2 ratio by disproportionately elevating estradiol relative to progesterone. Notably, starting at waist circumferences of approximately 100 cm, a modest progesterone decline also emerged, suggesting that at this level of adiposity, the influence of central adiposity may extend to both hormones, with progesterone dynamics contributing a secondary, waist circumference–dependent effect that reinforces the downward slope of the ratio.
C-reactive protein (CRP) emerged as one of the strongest non-hormonal predictors of the P4:E2 ratio, with SHAP dependence plots revealing a sharp decline in the ratio at lower CRP concentrations, followed by a plateau beyond approximately 5 mg/L. In the estradiol model, SHAP values increased steeply below this threshold and then stabilized, indicating a nonlinear relationship. The effects of estradiol on inflammation are context-dependent and can be pro- or anti-inflammatory depending on the cytokine profile, immune cell type, and estrogen receptor expression patterns [
12]. Pro-inflammatory actions of estradiol, as suggested in the present study assessing the hormone’s association with CRP, are mediated through estrogen receptor signaling pathways that activate transcription factors, particularly in immune and endothelial cells [
9]. The pattern observed in the present study underscores the importance of accounting for the concentration-sensitive interactions between estrogenic activity and inflammatory signaling in postmenopausal physiology, especially considering the altered distribution and function of estrogen receptor subtypes that occur with aging. In fact, prior studies examining the CRP–estradiol associations in postmenopausal women yielded mixed results—some reporting a positive association, others finding no significant relationship (reviewed in [
13]). The use of machine learning in the present analysis enabled the detection of threshold-dependent, positive nonlinear associations that help to reconcile these discrepancies and offer a more nuanced understanding of inflammation–estradiol dynamics.
Total cholesterol exhibited a nonlinear, predominantly positive association in both the P4:E2 and progesterone models. The relationship between estradiol and total cholesterol was relatively weak and more linear, suggesting a limited role for cholesterol in estradiol regulation. The divergence in the association between the two reproductive hormones (i.e., estradiol and progesterone) and total cholesterol likely reflects differences in their positions within the steroid biosynthesis pathway, with progesterone situated upstream and closer to cholesterol than estradiol. In the P4:E2 ratio model, SHAP values increased notably between approximately 140 and 200 mg/dL, with a plateau observed at higher concentrations, indicating a threshold-dependent effect on the ratio. The progesterone model revealed a strong and pronounced positive association, with SHAP values rising steeply between 120 and 220 mg/dL before stabilizing, highlighting cholesterol as a key metabolic predictor of progesterone levels (and the higher P4:E2 ratio) in postmenopausal women.
The waist circumference, CRP, and cholesterol findings may have practical relevance for clinical screening and risk stratification in postmenopausal women. Waist circumference, CRP, and total cholesterol are measurable in clinical settings, and they may serve as indirect markers of hormonal imbalance—particularly in relation to the P4:E2 ratio, which plays a key role in regulating tissue-specific estrogenic effects. The nonlinear rise in estradiol with increasing waist circumference reinforces the contribution of central adiposity to extragonadal estrogen production, while the concurrent decline in progesterone at higher adiposity thresholds may further shift the hormonal balance toward an estrogen-dominant profile. This shift is clinically relevant given the P4:E2 ratio’s influence on downstream outcomes in hormone-sensitive tissues. Similarly, the threshold-dependent associations with CRP and cholesterol underscore how inflammatory and metabolic states can modulate both components of the ratio, either by driving estrogen synthesis or influencing upstream steroidogenic pathways that affect progesterone availability. Recognizing these interrelationships could guide personalized strategies to improve metabolic and inflammatory health as a means of indirectly promoting more favorable hormonal dynamics and, in turn, disease risk.
Although ovulatory cycles cease after menopause, the pulsatile release of LH often mirrors that of FSH, albeit its secretory amplitude is smaller [
14]. The nonlinear relationship between LH and the P4:E2 ratio appears to reflect distinct—and at times opposing—contributions from estradiol and progesterone. In the present study, the stimulatory effect of LH on progesterone output plateaued at approximately 40 mIU/mL, contributing to an increase in the ratio within this range. Conversely, estradiol demonstrated a continuous positive association with LH across the entire range of values, exerting a countervailing influence that tempered the rise in the ratio up to 40 mIU/mL. Beyond this threshold, the persistent rise in estradiol—combined with the plateauing of progesterone—drove the ratio downward.
Carbohydrate intake emerged as the most consistent and meaningful dietary contributor across the three models. It showed a positive association with estradiol, particularly within the ~100 to ~250 g/day range, beyond which the effect plateaued. A similar but more attenuated pattern was observed in the P4:E2 ratio model, with SHAP values increasing gradually and leveling off beyond ~200 g/day. In the progesterone model, the association with carbohydrate intake was relatively flat, with only minor positive effects observed at lower intake levels, indicating a less consistent relationship. The remaining dietary measures exhibited weaker, inconsistent, or minimal effects on hormonal outcomes. They tended to show nonlinear but shallow SHAP profiles, often centering near zero or demonstrating fluctuating associations that lacked clear thresholds or sustained impact across models.
The SHAP dependence plot for age at menarche in the P4:E2 ratio model revealed a U-shaped pattern, with a modest decline in the ratio observed between approximately ages 10 and 13, followed by a steady increase beyond this range. This shape appears to reflect contrasting associations in the component hormone models: in the estradiol model, earlier age at menarche was associated with higher estradiol levels, while in the progesterone model, a positive association emerged at later menarche ages. These opposing trends resulted in a biphasic effect on the ratio, where the influence of estradiol predominates at younger menorrheal ages, lowering the ratio, and progesterone’s influence becomes more apparent at later ages, pushing the ratio upward. This composite pattern underscores how developmental timing may impart lasting effects on postmenopausal hormonal balance through divergent effects of individual steroid hormones. The SHAP dependence plots for age across the three models—estradiol, progesterone, and the P4:E2 ratio—show largely modest and inconsistent effects, suggesting limited explanatory value of chronological age alone in postmenopausal hormone variability.
Estrone sulfate serves as a circulating estrogen reservoir that can be converted to bioactive estradiol via the intermediate estrone conversion step. This process appears to be limited in postmenopausal women as the SHAP dependence plots reveal subtle associations across the estradiol and P4:E2 models. In the estradiol model, there was a mild nonlinear relationship, with SHAP values increasing slightly at low ratio values, followed by a plateau, suggesting a modest positive influence of a higher sulfate-to-parent hormone balance on estradiol levels. Similarly, the P4:E2 ratio model exhibited minimal SHAP variation across the estrone sulfate/estrone ratio range, indicating limited influence on the ratio itself. These findings suggest that while estrone sulfate may contribute to estradiol availability, its impact is not strong enough to meaningfully affect the balance between progesterone and estradiol in the postmenopausal context. One possible explanation for its weaker predictive value is that the conversion of estrone sulfate to estradiol is a dynamic, context-dependent process that may not exert a substantial influence under stable, non-cyclic hormonal conditions typical of menopause.
Having examined the individual SHAP dependence patterns above, a more integrated understanding takes shape regarding the interplay between global feature importance and context-specific hormonal dynamics. Estradiol consistently emerged as the dominant hormonal driver across the examined features, demonstrating the highest SHAP magnitudes and most pronounced associations in both the individual estradiol model and the P4:E2 ratio model (
Table 3). However, although the overall SHAP magnitude for the progesterone model was modest, the hormone nonetheless exerted a meaningful influence on the P4:E2 ratio in specific contexts. This was particularly evident for features such as total cholesterol, LH, and waist circumference, where SHAP dependence plots showed that progesterone altered the shape and direction of the ratio’s response. These findings underscore the importance of considering biological relevance alongside global model performance metrics. Progesterone’s sensitivity to upstream metabolic and gonadotropic signals—even if less predictive in isolation—can meaningfully modulate hormonal balance, particularly in systems modeled as ratios. Thus, while estradiol was the dominant driver of SHAP variance in most cases, progesterone’s context-specific contributions add interpretive depth to mechanistic inferences.
The mechanistic insights presented here align with epidemiological evidence linking distinct patterns of progesterone and estradiol concentrations to hormone-sensitive cancer risk in postmenopausal women. Notably, endogenous progesterone appears to play divergent roles in relation to estradiol, reducing risk in the endometrium but increasing it in the breast. In a case–cohort study nested within the Breast and Bone Follow-up to the Fracture Intervention Trial examining endometrial cancer incidence in relation to progesterone to estradiol ratio in postmenopausal women during a 12-year follow-up, Trabert et al. (2021) [
5] reported that postmenopausal women with high estradiol and low progesterone had the highest risk of developing endometrial cancer, while those with higher progesterone levels exhibited reduced risk. In contrast, in the same cohort, analysis of 405 incident breast cancer cases revealed that elevated progesterone concentrations were associated with an increased risk of invasive breast cancer—particularly when estradiol levels were also high [
15]. Together, these data underscore the complex and tissue-specific roles of progesterone in hormone-sensitive cancers, exerting protective effects in the endometrium while potentially promoting tumorigenesis in the breast. Indeed, although both clinical and epidemiological studies support a synergistic role of estradiol and progesterone in elevating breast cancer risk, disentangling their individual contributions remains challenging due to the partial dependence of progesterone receptor transcription on estrogen receptor α-mediated signaling [
16]. Thus, evaluating their combined hormonal interaction may be more informative than attempting to isolate independent effects [
16].
The divergent role of progesterone in endometrial versus breast cancer risk underscores the importance of contextualizing hormonal balance within specific biological outcomes and disease pathways, reinforcing the need for mechanistically informed, tissue-targeted research in postmenopausal women. In this regard, the present study’s feature-level SHAP modeling offers a framework for disentangling the nuanced, context-dependent effects of individual hormones. As an example, the analysis of waist circumference revealed how central adiposity may elevate estradiol and reduce progesterone in a threshold-dependent manner (i.e., waist circumference > 100 cm) to promote endometrial proliferation while potentially mitigating breast carcinogenesis.
Despite offering valuable insights into the determinants of the P4:E2 ratio, several limitations should be noted. The cross-sectional nature of the NHANES dataset limits causal inference, as the temporal ordering between predictors and hormone levels cannot be established. Additionally, because NHANES is a U.S.-based survey, the generalizability of these findings to other populations may be limited by differences in genetics, lifestyle, healthcare access, and environmental exposures. Replication in diverse cohorts and longitudinal designs will be important for validating and extending these findings. Although the P4:E2 model explained a moderate proportion of variance (R2 = 0.298), the progesterone model demonstrated low predictive performance (R2 = 0.022). Progesterone levels in postmenopausal women are typically low and somewhat stable, resulting in limited variance for the model to predict. This restricted range inherently constrains predictive power. In addition, unmeasured factors may also influence progesterone variability, such as residual adrenal activity, enzymatic pathways involved in steroidogenesis and metabolism, and behavioral factors (e.g., sleep) that were not captured in the NHANES dataset. Finally, while SHAP was selected for its strong theoretical foundation and ability to provide both global and local explanations, the method is not without limitations. Specifically, SHAP values can be affected by multicollinearity among features, and they may introduce variance due to sampling during estimation. Although we did not perform a full sensitivity analysis comparing SHAP to other XAI methods, the key contributors identified—such as FSH and waist circumference—are supported by strong biological plausibility and consistent patterns in the dependence plots.