Fair and Explainable Multitask Deep Learning on Synthetic Endocrine Trajectories for Real-Time Prediction of Stress, Performance, and Neuroendocrine States

Abdullah,; Fatima, Zulaikha; Sánchez Mejorada, Carlos Guzman; Ather, Muhammad Ateeb; Oropeza Rodríguez, José Luis; Sidorov, Grigori

doi:10.3390/computers14120515

Open AccessArticle

Fair and Explainable Multitask Deep Learning on Synthetic Endocrine Trajectories for Real-Time Prediction of Stress, Performance, and Neuroendocrine States

by

Abdullah

^1,2,†

,

Zulaikha Fatima

^3,†

,

Carlos Guzman Sánchez Mejorada

^1,*

,

Muhammad Ateeb Ather

^2,*

,

José Luis Oropeza Rodríguez

¹

and

Grigori Sidorov

¹

Center for Computing Research, Instituto Politécnico Nacional, Mexico City 07738, Mexico

²

Department of Computer Sciences, Bahria University, Lahore 54600, Pakistan

³

Faculty of Allied Health Sciences, Superior University, Lahore 54000, Pakistan

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Computers 2025, 14(12), 515; https://doi.org/10.3390/computers14120515

Submission received: 23 October 2025 / Revised: 14 November 2025 / Accepted: 19 November 2025 / Published: 25 November 2025

(This article belongs to the Special Issue Wearable Computing and Activity Recognition)

Download

Browse Figures

Versions Notes

Abstract

Cortisol and testosterone are key digital biomarkers reflecting neuroendocrine activity across the hypothalamic–pituitary–adrenal (HPA) and hypothalamic–pituitary–gonadal (HPG) axes, encoding stress adaptation and behavioral regulation. Continuous real-world monitoring remains challenging due to the sparsity of sensing and the complexity of multimodal data. This study introduces a synthetic sensor-driven computational framework that models hormone variability through data-driven simulation and predictive learning, eliminating the need for continuous biosensor input. A hybrid deep ensemble integrates biological, behavioral, and contextual data using bidirectional multitask learning with one-dimensional convolutional neural network (1D-CNN) and long short-term memory (LSTM) branches, meta-gated expert fusion, Bayesian variational layers with Monte Carlo Dropout, and adversarial debiasing. Synthetically derived longitudinal hormone profiles that were validated by Kolmogorov–Smirnov (KS), Wasserstein, maximum mean discrepancy (MMD), and dynamic time warping (DTW) metrics account for class imbalance and temporal sparsity. Our framework achieved up to 99.99% macro F1-score on augmented samples and more than 97% for unseen data with ECE below 0.001. Selective prediction further maximized the convergence of predictions for low-confidence cases, achieving 99.9992–99.9998% accuracy on 99.5% of samples, which were smaller than 5 MB in size so that they can be employed in real time when mounted on wearable devices. Explainability investigations revealed the most important features on both the physiological and behavioral levels, demonstrating framework capabilities for adaptive clinical or organizational stress monitoring.

Keywords:

cortisol; testosterone; hypothalamic pituitary adrenal axis; hypothalamic pituitary gonadal axis; stress prediction; hormone health; deep learning; multitask learning; synthetic data augmentation; wearable health monitoring; explainable AI; bioinformatics; biomedical signal processing

1. Introduction

Using wearable sensors, synthesized hormonal data, and multitask deep learning, this work presents the first real-time human-centric model to predict neuroendocrine, stress, cognitive and emotional states for intelligent interventions in personalized health and performance monitoring. Cortisol and testosterone at the molecular level both express circadian rhythms under the control of the HPA axis and the HPG axis, respectively [1,2], and rapidly respond to an acute stressor, but gradually recover toward a baseline afterward [3,4]. Alterations of these rhythms are associated with risk for cardiovascular, metabolic, reproductive, and neuropsychiatric dysfunction [5,6]. Modeling timely changes in biomarkers enables them to serve as early warnings of molecular deregulation, hence predicting disease onset [7].

Employee well-being, cognitive performance, and physiological health are mutually dependent in organizational settings, all of which are essential to productivity and resilience [8]. More than 75% of workers have chronic stress [9], and it deteriorates focus, memory, and decision-making skills [10], and hormone imbalances make these worse [11]. These disruptions drive absenteeism, presenteeism, low job satisfaction, and healthcare costs [12,13,14], in turn reducing the productivity of the workforce by 20% [15,16] and the burden of obesities/heart diseases [17]. Early and accurate detection of stress and hormone deviations is crucial for timely interventions aimed at optimizing organizational health [18].

Standard monitoring depends on behavioral indicators or self-reports [19] or single-point hormone measurements [1,3], which do not show circadian rhythms or sudden spikes [2,4], making personalized predictions less accurate [20]. These gaps underscore the necessity for integrative frameworks that amalgamate multimodal biological, behavioral, and contextual data to dynamically and interpretively predict stress, performance, and hormonal health [7,8].

Machine learning and synthetic data generation offer scalable methodologies to model these interactions longitudinally, facilitating the early identification of risk states and the development of fairness-conscious intervention strategies [11,12]. However, current models frequently exhibit deficiencies in interpretability, demographic equity, resilience to deployment limitations, and biologically plausible synthetic augmentation, thereby constraining their capacity to accurately represent temporal hormone dynamics [16,17].

There are still two issues with wearable technology: class imbalance for clinically significant states like “Critical” stress and biosensing sparsity (few hormone samples per subject), which hinders direct time-series learning. To overcome those issues, this work (A) proposes biologically constrained synthetic longitudinal hormone trajectories anchored to per-subject measures [18,19,20]. This allows for temporal modeling without the need for continuous biosensors, and (B) creates a hybrid, uncertainty-aware multitask pipeline that enhances minority-class detection, quantifies predictive uncertainty for selective human review, and provides understandable explanations for interventions. Specifically, we aim to develop a reproducible synthetic augmentation pipeline that maintains per-subject correlations, assess performance in both augmented and real-only scenarios, quantify calibration and fairness, and quantify interpretability using quantitative explainability metrics.

We introduce a hybrid deep ensemble framework designed to predict stress, performance, and hormonal health by modeling physiological, behavioral, and contextual interactions. A bidirectional multitask architecture learns both shared and task-specific representations, while a temporal branch uses convolutional and recurrent layers to capture changing hormone trajectories. Meta-gated expert fusion, which adaptively chooses relevant biological or behavioral modules for each sample, and adversarial debiasing, which reduces differences between groups of people, making sure that the system is fair and easy to understand. Bayesian layers and ensemble strategies give us calibrated uncertainty estimates that we can trust when using them in sensitive real-world situations.

2. Key Contributions

We propose a deep ensemble model that combines biological, behavioral, and synthetic longitudinal hormone data in a bidirectional multitask framework to make concurrent interpretable predictions for stress, performance, and hormone health.
A temporal branch based on 1D CNN and long short-term memory (LSTM) layers to learn dynamic changes of hormones, enhancing sensitivity towards important temporal patterns.
A meta-gating strategy to dynamically select expert modules for samples, thus improving modular interpretability and individualized decision-making.
We leverage adversarial debiasing to mitigate demographic biases, and for uncertainty-aware predictions, we combine Bayes Layers with multimodality and MCP Dropout.
For real-time deployment, compact sizes (<5 MB) and low latencies (<20 ms) are provided, as well as continuous monitoring and state-of-the-art performance with explainability have been reported in terms of SHAP and counterfactual analysis.

Our presented methods reconcile ML and healthcare, offering a scalable, interpretable, fair, and uncertainty-aware approach to real-time health monitoring through wearables and mobile devices. By combining biological fidelity and predictive analytics, this work contributes to the development of reliable AI solutions in organizational health management. This is how the paper is written in order: in Section 2, we look at related work; in Section 3, we talk about the dataset and synthetic hormone augmentation; in Section 4, we talk about the hybrid model and training; in Section 5, we show the results; in Section 6, we talk about the implications; and in Section 7, we end.

3. Related Work

Wearables, hormone modeling, and advanced machine learning have driven personalized health tracking, with high-fidelity analysis of stress [21,22], and aging [23], as well as circadian rhythmicity [24]. Wearable devices enable continuous and minimally intrusive physiological and behavioral monitoring, with unmatched temporal resolution. Multimodal data-driven machine learning methods have increased the use of predictive modeling for complex biological and psychological phenomena. Even non-human studies show the potential of hormone-dependent stress phenotyping, such as fish models, indicating cortisol and other biomarkers in the response to distress [25]. Synthetic longitudinal health datasets also promote simulation and predictive modeling, allowing comprehensive evaluation of machine learning models [26]. Integrative methods that incorporate genetics, hormonal profiles, or brain function have revealed the complexity of conditions such as psychiatric disorder depression—a testament to the need for multimodal and longitudinal data in predictive ability [27,28,29].

In medical applications, uncertainty-aware deep learning has gained attention due to its ability to provide well-calibrated confidence with the predictions for diabetic retinopathy classification or early cancer detection [30,31]. Analytical research investigating specific hormones has highlighted the role of testosterone and cortisol in precision medicine, personalized treatments, and health optimization programs [32,33]. Recent developments also comprise interpretable and causal ML frameworks for making actionable decisions by focusing on counterfactual reasoning, fairness, etc. [34].

Moreover, digital biomarkers and minimally invasive longitudinal monitoring methodologies also drive scalable health assessment [35]. Real-time prediction of wearable physiological stress supports the potential for continuous monitoring of context-sensitive settings in dynamic settings [36,37]. The companion methodological contributions of kernel-based tests for dependent sequential data and multitask LSTM models for personalizing interventions have enhanced robust inferential ability and time-series modeling [38,39].

Together, these papers demonstrate important advances in the development of wearable devices that connect with hormone dynamics and sophisticated ML methods. Yet, major gaps exist in models that have multi-hormonal joint modeling capability in a multitask, interpretable, and uncertainty-aware manner, and which can effectively utilize all forms of longitudinal information according to their modalities. Current wearable and machine learning (ML)-based stress/prediction efforts usually use sparse or cross-sectional single-point hormone measures that miss transient and circadian stress dynamics, ignore demographic confounding, or only use ad hoc fairness checks. They also only offer an inadequate quantitative assessment of the fidelity of synthetic data (few report KS/MMD/DTW statistics).

In small-sample cohorts, these restrictions impair clinical interpretability and generalizability. To fill these gaps, our pipeline incorporates adversarial debiasing, sets up per-subject anchoring during augmentation, explicitly models dual-hormone temporal trajectories (cortisol and testosterone), and reports statistical fidelity tests (KS, Wasserstein, MMD, DTW) and held-out real-only evaluations. A logic-progression approach may help bridge these gaps to translate predictable intuition into robust and ethical medical decision-making.

4. Research Methodology

We propose a deep mixed ensemble that integrates biological, behavioral, and contextual signals for stress, performance, and hormone health prediction. The dataset consists of hormone and behavior data measured using a validated assay for cortisol and testosterone. Instead of designing new hardware, the research constructs a processing infrastructure that converts conventional biosensor signals to actionable digital biomarkers using sophisticated mathematical analysis, so the system can be easily interfaced with prospective wearable platforms for hormone monitoring in real-time. The bidirectional multitask allows back-and-forward inference that increases the anatomical consistency and interpretability. A meta-gated expert network samples modalities around a sample, while adversarial debiasing, Bayesian ensembles, and Monte Carlo Dropout deliver calibrated, fairness-aware, and risk-conscious predictions. Longitudinal features extracted from synthetic 7-day cortisol and testosterone sequences are used as inputs to a 1D-CNN+LSTM-based temporal branch model, which captures spikes, slopes, and circadian dynamics. Class-aware sampling, focal loss, temperature scaling, and selective prediction deal with rare high-risk classes and uncertainty. SHAP attribution, meta-gating visualization, and counterfactual analysis offer an interpretable interface for agents. Our framework utilizes fairness-aware multitask learning, temporal biological fidelity, and uncertainty-informed decision support in a single, robust system.

We conducted a full demographic analysis to ensure fair representation across all groups. Appendix A presents the complete demographic distributions, cross-tabulations for all outcome labels, and statistical balance tests (chi-square/Fisher exact tests for categorical variables; ANOVA/Kruskal–Wallis tests for continuous variables with Benjamini–Hochberg FDR correction). No substantial demographic imbalance was observed, apart from a minor age-related effect, and subgroup analyses demonstrated uniform accuracy and Macro-F1 across demographic groups. We also assessed predictor importance using SHAP values, permutation importance, LASSO/ElasticNet stability selection, and RFE-CV. A compact set of approximately 8–15 predictors retained ≥92–

96 %

of the full-model performance, with RFE curves plateauing near 15 features and stability coefficients

\geq 0.60

indicating robust minimal sets. Appendix B provides the complete predictor rankings, stability analyses, and reduced-feature performance curves.

4.1. Dataset Overview

The primary dataset comes from the Columbia MBA Hormone Diversity Team Study, which includes 370 persons nested within 74 teams, from whom we have cortisol and testosterone levels in logarithmic transformations, demographics (age, gender, ethnicity, country), and team-level outcomes such as the following: performance, cash allocation in groups of four players playing a lenient contract of acceptall-or-rejectthedeal UG game [29], rank order and cooperation level, defined as the amount offered in each round in the repeated TG, where participants do not remember the last offer that was declined. One easy measure of group diversity is the number of unique gender-ethnicity-country combinations in a team, capturing intersectional heterogeneity [40,41]. Controlling for the three class outputs—Stress (cortisol), Performance (team results), and Hormone Health (testosterone) enabled both forward inference (biological + context to outcomes) and reverse inference (predicted outcomes to hormone health), thereby improving interpretability.

Hormone measurements (cortisol and testosterone) represented proxies for potential continuous monitoring output. The developed sensor-agnostic framework applies to microfluidic, electrochemical, and wearable biosensors for a flexible hormone inference from various forms of inputs. Other existing public datasets offer simultaneous longitudinal cortisol–testosterone with associated behavioral/contextual features. Therefore, synthetic enrichment is incorporated for generalization. This encompasses molecular and clinical data, hormone assays such as cortisol, estradiol, testosterone, insulin, thyroid hormones, routine biochemistry such as glucose, lipids, and liver/kidneys panels, and inflammatory markers CRP/interleukins measured by validated immunoassays or mass spectrometry. Combined with clinical covariates (age, BMI, comorbidity, and medication), these inputs provide sufficient context for accurate and interpretable models of biochemical status and physiological modifiers.

Dataset Structure and Fields

The primary dataset is the Columbia MBA Hormone Diversity Team Study, comprising 370 participants organized into 74 work teams (5 ± 2 members each). Each record corresponds to an individual participant observed within a specific team context. The dataset integrates biological, behavioral, and contextual modalities within a single relational schema, summarized below.

Storage structure: A single flat table participants.csv with one row per subject, relationally linked to auxiliary JSON objects storing each subject’s temporal hormone traces and behavioral trajectories.
Primary keys: subject_id (unique per participant) and team_id (linked to group-level performance indicators).
Total observations: 370 rows ×∼45 columns (static attributes) plus two time-series arrays per subject (cortisol and testosterone, 168 samples each).
Multimodal representation:
-
Static numerical: Age, BMI, baseline cortisol/testosterone (log-scaled).
-
Categorical/contextual: Gender, Ethnicity, Country (one-hot or trainable 8-dimensional embeddings).
-
Behavioral temporal: cash_trajectory and rank_trajectory (vectors of per-round outcomes).
-
Team-level features: diversity_index, team_performance_score.
-
Derived temporal features: cortisol_slope_am, t_react_recovery, stress_auc, hormone_entropy.
-
Synthetic augmentation metadata: augmentation_flag, synthetic_seed, and sequence_id (for reproducibility).

Each biological time-series was resampled to an hourly grid covering seven days (7 × 24 = 168 steps). Synthetic enrichment added physiologically constrained trajectories per subject while retaining baseline anchoring to maintain biological plausibility, as shown in Table 1.

Total feature dimensionality after preprocessing, approximately 40 static + (168 × 2) temporal channels, yielded 376 input dimensions per participant sample. Each row in the dataset corresponds to one participant (e.g., subject P0371 in team T062), containing demographic fields (age = 29, gender = female, ethnicity = Asian, BMI = 22.1), baseline hormone levels (baseline_cortisol_log = 0.54, baseline_testosterone_log = 0.71), and team-level indicators (team_performance_score = 0.83, diversity_index = 0.68). Derived temporal features such as cortisol_slope_am = 0.06, t_react_recovery = 2.7 h, and stress_auc = 0.42 summarize hormone dynamics. Behavioral trajectories such as cash allocations and rank order per round are stored as short numerical arrays as [105, 112, 118, …] and [3, 2, 1, …]; full arrays are available in the shared repository. The Boolean augmentation_flag indicates whether a sample is real (0) or synthetic (1), ensuring full traceability across cross-validation folds.

The real-world measurements originate from Hormones and the Dynamics of Team Performance, Open Science Framework Repository, OSF.IO/ZPQ8H. Raw saliva assays (cortisol, testosterone) were analyzed using validated enzyme-immunoassay protocols, and behavioral/demographic data were matched via anonymized identifiers. All preprocessing, resampling, and synthetic augmentation scripts are available in the project’s GitHub repository (see Data Availability Statement). Random seeds and augmentation parameters are logged per subject via synthetic_seed to ensure exact reproducibility.

4.2. Data Preprocessing and Cleaning

A rigorous preprocessing pipeline was applied to ensure high-quality static and temporal features for the hybrid multitask model. All steps were nested within a stratified 10-fold team-level cross-validation to prevent leakage. Hormone values of cortisol/testosterone were imputed using age- and gender-specific medians, behavioral/contextual features with team medians, and categorical variables such as ethnicity and country with team modes. Hormonal outliers beyond three standard deviations were winsorized as shown in Figure 1. Numerical features were standardized, categorical features encoded as 8-dimensional trainable embeddings, and behavioral features smoothed via exponential moving averages.

Temporal hormone dynamics were captured using 7-day (168-step) synthetic sequences, processed through a 1D-CNN (32 filters, kernel size 3) for spike detection and an LSTM (64 units) for recovery trends. Rare classes such as critical stress and low hormone health were balanced using SMOTE, with light Gaussian noise (

σ = 0.01

) added to hormones for robustness. All preprocessing, including imputation, normalization, embedding, temporal sequence generation, and augmentation, was performed only on training folds. Hormone health thresholds followed clinical guidelines: cortisol > 20 nmol/L (morning) or >10 nmol/L post-nadir indicated elevated stress, and testosterone < 8 nmol/L (men) or <0.5 nmol/L (women) indicated low hormone health [42,43].

4.2.1. Feature Engineering

The feature engineering and preprocessing pipeline transforms raw input data into a rich multimodal feature space encompassing biological, behavioral, contextual, and demographic domains, thereby enabling comprehensive modeling of stress, performance, and hormone health.

In the biological domain, cortisol denoted “C” and testosterone denoted “T” levels are included both in their raw and log-transformed forms to capture multiplicative or scale effects. Specifically, log-transformed features are computed as shown in Equation (1):

log (C + ϵ) and log (T + ϵ)

(1)

where

ϵ

is a small constant added for numerical stability. These features serve as inputs to the model and are also used to define ground truth classes. Stress class assignment follows as shown in Equation (2):

Stress_Class = \{\begin{matrix} Optimal, & C < 10 \\ Moderate, & 10 \leq C \leq 20 \\ Critical, & C > 20 \end{matrix}

(2)

Similarly, testosterone health is trichotomized into classes as shown in Equation (3):

Testosterone_Class \in {Low, Normal, High},

(3)

defined by tertiles or domain-informed thresholds, facilitating reverse modeling.

For behavioral and performance features, the primary team outcome P (final performance score) is discretized into three quantile-based classes using tercile binning as shown in Equation (4):

Perf_Class = qcut (P, q = 3, labels = {Low, Medium, High}),

(4)

Behavioral proxies as interim rank and cash trajectories are encoded as time deltas or normalized residuals to capture dynamics. Contextual/demographic features include a diversity index D (unique gender-ethnicity-country triplets per team, normalized by team size), female ratio, age mean/variance, and team size, incorporated as raw values or learned embeddings. Missing demographic/hormone values are domain-imputed, and feature importance analysis prunes irrelevant inputs.

Categorical/discretized features

f \in F_{cat}

use jointly trained embeddings

e_{f} \in R^{d}

, while continuous features interact via pairwise interaction layers, capturing second-order dependencies such as cortisol × diversity, testosterone × team composition. This pipeline compiles multimodal biological, behavioral, and contextual measurements into interpretable, high-dimensional inputs for multitask learning. The main derived features utilized in the hybrid model are compiled in Table 2 to guarantee reproducibility and clarity. The dataset is multimodal by nature, combining continuous hormone trajectories, behavioral time-series, categorical embeddings, and static demographics. Because of this fusion, the model can simultaneously capture contextual, behavioral, and physiological factors that influence hormone regulation and stress.

Following preprocessing, a temporal tensor of size [168, 2], representing the cortisol and testosterone series, and a static feature vector with about 40 dimensions are added to the final model. Richer cross-domain representations for multitask learning are made possible by this multimodal fusion of contextual, behavioral, and biological data.

4.2.2. Synthetic Data Augmentation for Class Balance

The Testosterone and Diversity dataset (370 subjects, 74 teams) shows severe class imbalance in the “Critical” stress and near-extreme hormone categories, making deep ensembles and multitask training an arduous task. First of all, the hybrid ensemble was trained on original data with stratified 10-fold team-level cross-validation and preprocessing.

We used benchmarked synthetic augmentation at five levels (0%, 25%, 50%, 75%, 100%) and held out a real subset of 10% for unbiased evaluation. A two-pronged objective handled imbalance: (1) feature-level generative oversampling with constrained SMOTE to retain intra-class correlations and strata by team demographics (gender, ethnicity, and country), and (2) perturbative synthetic sampling focusing on model uncertainty while increasing minority classes plausibly. The dataset was divided from

N = 370

linebreak N = 1110, in Augmentation, maintaining demographic plausibility and ethical compliance.

Training involved utilizing Monte Carlo Cross-validation and Bayesian Deep Ensembles as a countermeasure for small-sample bias. Each fold standardized the numerical features, in addition to calibrating the post-hoc softmax logits for valid estimates at the confidence level via temperature scaling.

Table 3 summarizes the key characteristics of the original and augmented datasets.

The systematic augmentation and preprocessing pipeline resulted in balanced and biologically plausible training data to enhance generalization and robustness for the hybrid ensemble model while maintaining ethical and statistical integrity.

4.3. Synthetic Longitudinal Data Generation with Statistical and Biological Fidelity

Finally, we created biologically realistic synthetic trajectories that simulate diurnal and diurnal stress-response patterns that populate high-frequency proxies and allow for accurate model evaluation for execution in real-time. Because there was not a sufficient number of longitudinal hormone measurements, and to maintain the statistical integrity and biological plausibility of the dataset, synthetic temporal trajectories of cortisol and testosterone were incorporated. This augmentation enables the model to capture dynamic patterns of stress and hormone response more effectively, facilitating time-aware predictions and richer physiological interpretations, as shown in Figure 2.

In order to maintain individual characteristics and biological limits, synthetic hormone trajectories are anchored to per-subject measurements. In order to enforce realistic diurnal rhythms, stress-induced cortisol spikes of +15–25%, and post-stress testosterone dips of 5–10% with 23 h recovery, they replicate mean, variance, and cortisol-testosterone correlations across demographics. Copulas maintain hormone diversity correlations, stress events are injected for “Critical” subjects, and Gaussian Process baselines produce smooth diurnal curves.

To ensure ±5–10% alignment with actual data and biological plausibility, such as cortisol peaking 30–60 min after waking and testosterone recovery ≤ 4 h, validation employs KS tests, 1-Wasserstein, MMD, and DTW. The LSTM + 1D-CNN temporal branch is fed by the derived features cortisol_slope_am, t_react_recovery, and stress_auc through meta-gated fusion. Transparency is maintained through explicit flags and bias audits. In order to ensure reproducible, biologically accurate temporal augmentation that improves model robustness and predictive power, all GP parameters, kernels, and constraints are documented.

4.4. Hybrid Unified Model Architecture

This section describes the core predictive engine: a unified, multitask deep architecture that jointly infers Stress, Performance, and Hormone Health, while also enabling reverse inference to improve coherence and interpretability. The hybrid deep ensemble serves as a deployable biosensor analytics module for mobile or edge platforms, enabling real-time processing of salivary cortisol or wearable multi-analyte data for inference and decision support [44]. The design tightly fuses static, contextual, and synthetic longitudinal hormonal information through meta-gated expert mixing, accounts for known sources of bias, and quantifies uncertainty in a principled way as shown in Figure 3. The following subsections break down the structure, reasoning, and losses.

4.4.1. Shared Multimodal and Temporal Backbone

Human stress and performance arise from static attributes (demographics, diversity, baseline hormones) and dynamic physiological responses (hourly cortisol/testosterone spikes). The model explicitly separates and adaptively fuses (1) static/contextual features

x_{static}

and (2) temporal synthetic hormone sequences

x_{temp}

over 24 h, enabling per-sample modality weighting. Categorical inputs like discretized diversity and gender are embedded into continuous vectors before concatenation, supporting rich multimodal representation.

(a): Static Branch

The static input passes through a series of residual dense blocks with Swish activations which empirically smooth gradients and outperform ReLU in many settings. A simplified formulation of a single residual block is as shown in Equation (5):

$h^{(l + 1)} = Swish (W^{(l)} h^{(l)} + b^{(l)}) + h^{(l)}, h^{(0)} = x_{static}$

(5)

where Swish is defined as shown in Equation (6):

$Swish (u) = u \cdot σ (u),$

(6)

and $σ$ is the logistic sigmoid function.

(b): Temporal Branch

The synthetic hormonal sequences $x_{temp}$ (a matrix of shape $[time steps \times 2]$ for cortisol and testosterone) are processed by a hybrid 1D CNN followed by an LSTM (or GRU) to extract both local patterns (sudden spikes) and longer-range recovery dynamics as shown in Equation (7):

$c_{t} = CNN (x_{temp}), u = LSTM (c_{1 : T})$

(7)

where T is the number of hourly steps, and $u$ is the resulting temporal embedding capturing slopes, spike magnitude, recovery periods, and cumulative stress.

(c): Meta-Gated Expert Fusion

Rather than a fixed concatenation, we construct K expert subnetworks ${E_{i}}_{i = 1}^{K}$ (biological/static expert, behavioral expert, temporal expert, demographic expert) that each process the corresponding subspace. A gating network G (a small MLP) consumes the full input context $x = [x_{static}, u]$ and outputs attention-like coefficients as shown in Equation (8):

$α = Softmax (G (x)) \in R^{K}$

(8)

The fused representation is as shown in Equation (9):

z = \sum_{i = 1}^{K} α_{i} (x) \cdot E_{i} (x)

(9)

This allows the model to adaptively emphasize temporal dynamics when transient stress is dominant or rely more on static/contextual cues when performance baselines matter.

4.4.2. Multioutput and Reverse Inference Heads

From the shared fused representation

z

, three parallel classification heads predict the primary targets:

(a): Stress Head

{\hat{y}}_{stress} = softmax (W_{s} z + b_{s}) \in Δ^{2}

(10)

We divided them into three classes: Optimal, Moderate, Critical as shown in Equation (10).

(b): Performance Head

{\hat{y}}_{perf} = softmax (W_{p} z + b_{p}) \in Δ^{2}

(11)

We divided them into three classes: Low, Medium, High as shown in Equation (11).

(c): Hormone Health Head

{\hat{y}}_{hormone} = softmax (W_{h} z + b_{h}) \in Δ^{2}

(12)

We divided them into three classes such as Low, Normal, High as shown in Equation (12).

4.4.3. Reverse Inference Module

To reinforce bidirectional consistency and exploit dependency between outcomes, a secondary module

f_{rev}

predicts refined hormone health using the soft outputs (and uncertainty signals) from the stress and performance heads as shown in Equation (13):

{\hat{y}}_{hormone}^{rev} = f_{rev} ({\hat{y}}_{stress}, {\hat{y}}_{perf}, H ({\hat{y}}_{stress}), H ({\hat{y}}_{perf}))

(13)

where

H (\cdot)

is the Shannon entropy (a measure of uncertainty). This secondary estimate is tied back to the primary hormone head through a consistency loss to encourage agreement when appropriate.

The reverse inference module is gated by upstream confidence to stop the spread of errors. In particular, entropy and ensemble variance estimates are included with stress and performance predictions; the reverse head is only applied when both uncertainties are below empirically established thresholds, such as entropy

< 0.5

and variance in the lowest quartile. When there is low confidence, the consistency of weight loss is proportionately decreased, and only the primary hormone health head output is utilized.

4.4.4. Advanced Regularization and Novel Techniques

Each component is intentionally layered to enhance generalization, fairness, and robustness:

(a): Label Smoothing

Targets $y$ (one-hot) for each head are softened as shown in Equation (14):

$\tilde{y} = (1 - ϵ) y + \frac{ϵ}{C}$

(14)

where $C = 3$ is the number of classes and $ϵ \in [0, 0.1]$ reduces overconfidence and improves calibration.

(b): Class-Aware Sampling and Focal Loss

Training minibatches oversample rarer classes (e.g., “Critical” stress). Focal loss for stress prediction is as shown in Equation (15):

$L_{focal} = - {(1 - p_{t})}^{γ} log (p_{t})$

(15)

where $p_{t}$ is the model’s predicted probability for the true class and $γ > 0$ (typically 2) emphasizes hard-to-classify examples.

(c): Adversarial Debiasing/Fairness Constraint

An adversary network A predicts sensitive attributes s (gender, ethnicity) from intermediate representation $z$ . Encoder is trained with a gradient reversal layer shown in Equation (16):

$L_{adv} = CE (A (z), s), encoder loss term : - λ_{adv} L_{adv}$

(16)

A fairness regularizer (demographic parity gap) is added as shown in Equation (17):

L_{fair} = {∥E [\hat{y} ∣ s = 0] - E [\hat{y} ∣ s = 1]∥}_{2}

(17)

(d): Deep Ensembles

M independent copies of the architecture are trained. Final soft predictions average over models as shown in Equation (18):

${\hat{y}}_{ensemble} = \frac{1}{M} \sum_{m = 1}^{M} {\hat{y}}^{(m)}$

(18)

in Equation (18), Intermodel variance estimates epistemic uncertainty.

(e): Temperature Scaling/Calibration

Post-training, a scalar $T > 0$ adjusts confidence as shown in Equation (19):

${\hat{y}}^{(T)} = softmax (\frac{l}{T})$

(19)

where $l$ are logits; T is chosen to minimize negative log-likelihood.

(f): Uncertainty-Aware Decision Logic

Final decision-making integrates aleatoric uncertainty (entropy $H (\hat{y})$ ) and epistemic uncertainty (ensemble variance). Predictions exceeding thresholds are flagged for human review.

(g): Consistency Regularization

This encourages forward hormone head and reverse-inferred estimate to agree as shown in Equation (20):

$L_{consistency} = KL ({\hat{y}}_{hormone} ∥ {\hat{y}}_{hormone}^{rev})$

(20)

where KL is the Kullback–Leibler divergence.

(h): Temporal Fusion Advantage

Explicit modeling of synthetic hormone trajectories allows detection of transient stress spikes and recovery patterns missed by baseline-only models, improving sensitivity to short-term critical states.

4.4.5. Bayesian Uncertainty Modeling

To capture both epistemic and aleatoric uncertainty in a unified probabilistic framework, the architecture incorporates Bayesian-inspired mechanisms:

(a): Monte Carlo Dropout (MCD)

During inference, dropout remains active, producing M stochastic forward passes ${{\hat{y}}^{(m)}}_{m = 1}^{M}$ . The predictive mean approximates the posterior expectation as shown in Equation (21):

$\bar{y} = \frac{1}{M} \sum_{m = 1}^{M} {\hat{y}}^{(m)}$

(21)

and the sample variance estimates epistemic uncertainty.

(b): Variational Dense Layers (Flipout)

Selected dense layers model a distribution over weights $q_{ϕ} (W)$ , enabling posterior sampling of parameters and directly capturing model weight uncertainty as shown in Equation (22):

$W \sim q_{ϕ} (W)$

(22)

(c): Uncertainty Decomposition

Aleatoric uncertainty is measured via softmax output entropy as shown in Equation (23):

$H (\bar{y}) = - \sum_{i} {\bar{y}}_{i} log {\bar{y}}_{i}$

(23)

Epistemic uncertainty is captured via variance across ensemble/MCD samples. These uncertainties are propagated to downstream decision logic and consistency checks.

4.4.6. Training Protocol

The collection of data was divided into subsets for testing, validation, and training, with respective proportions of 15%, 70%, and 15%. To avoid information leakage, the validation sets were kept solely for model selection and hyperparameter tuning. The final temporal segments and completely different subject cohorts made up the test sets, whereas the most recent non-test time windows were used to construct the validation sets for temporal analyses. This design guarantees that the model’s assessment takes into account both cohort and temporal generalization abilities.

As demonstrated in Equation (24), the overall training goal incorporates regularization elements and losses from several tasks:

L = λ_{s} L_{stress} + λ_{p} L_{performance} + λ_{h} L_{hormone} + λ_{fair} L_{fair} + λ_{adv} L_{adv} + λ_{cons} L_{consistency}

(24)

where each

λ

is a tunable weight, optimized through grid search or Bayesian methods, balancing trade-offs between predictive accuracy, fairness, and output coherence.

The N-Adam optimizer with warmup and cosine decay was used to optimize the model. Regularization consisted of weight decay (

10^{- 4}

), dropout (0.3), and early stopping according to a composite validation metric that show macro F1 and critical stress recall. All preprocessing (oversampling, normalization, and embeddings) was nested per fold in a stratified 10-fold team-level cross-validation to prevent leakage. Through specific loss terms, the architecture enforces fairness and logical consistency while integrating static biological, behavioral, and temporal signals. Bayesian ensembles and Monte Carlo Dropout are used to quantify predictive uncertainty. Extending beyond single-task or static models, meta-gating, reverse inference, and biologically informed temporal modeling offer calibrated, interpretable, and reliable predictions across stress, performance, and hormone health.

4.4.7. Experimentation and Hyperparameter Tuning

We carried out extensive experiments comparing our hybrid deep ensemble model against eight well-known baseline models frequently used in the hormone, behavioral, and multitask classification domains in order to thoroughly evaluate its performance and robustness. Classical neural networks, cutting-edge deep learning architectures, and conventional machine learning algorithms are some examples of these baselines. The objective was to validate our architectural innovations, including temporal modeling, meta-gating, Bayesian ensembles, and fairness-aware training, and to show measurable performance improvements. To ensure that team integrity was maintained within folds to prevent data leakage, all models were adjusted and assessed at the team level using a stratified 10-fold cross-validation.

Table 4 provides a summary of the baseline and suggested models, their primary hyperparameters, and pertinent technical information.

Using MacroF1 and critical stress recall, a 10-fold stratified CV grid search was used to optimize the hyperparameters for conventional models (Logistic Regression, Random Forest, SVM, and XGBoost). On validation MacroF1, deep models (MLP, CNN, LSTM, Transformer, and hybrid) used Bayesian optimization in addition to manual refinement, learning rate warmup, cosine decay, and early stopping (patience 10). Twenty MC Dropout passes were used in hybrid ensembles (

M = 5

), and Flipout variational layers recorded posterior weight uncertainty as prior SD 0.1.

The regularization of fairness

λ_{fair} \in {0.05, 0.1, 0.2}

preserved MacroF1 while reducing the demographic parity gap. Regularization included dropout 0.2–0.3, L2 decay

{10^{- 4}, 10^{- 3}}

, and loss combined cross-entropy with focal loss on “Critical” stress. MacroF1 (primary), critical stress recall, MCC (performance), ECE (hormone health), and ensemble predictive entropy are the evaluation metrics. This guarantees accurate, equitable, and successful forecasts.

Bayesian hyperparameter optimization and previous biosignal studies served as the basis for the temporal branch configuration. To balance accuracy and edge efficiency, a 1D-CNN with small kernels (

k = 3

) captured brief transients, while an LSTM layer (64 units) modeled slower recovery trends. With selection based on validation Macro-F1 and critical-class recall, the search space contained CNN filters

{16, 32, 64}

, kernel sizes

{3, 5}

, LSTM units

{32, 64, 128}

, and learning rate

[1 \times 10^{- 5}, 1 \times 10^{- 3}]

. To preserve intra-team correlations, SMOTE variants were stratified by team demographics. For robustness, Gaussian noise (

σ = 0.01

) was introduced without compromising biological plausibility.

After initialization with the Glorot uniform distribution and end-to-end training with the model, categorical fields were mapped to trainable embeddings of dimension 8. Trainable embeddings were chosen over one-hot or target encoding because they enable learning dense categorical representations that capture interactions with continuous features (cortisol × diversity) while maintaining a compact model size suitable for edge deployment. During hyperparameter optimization, embedding dimensions

d \in {4, 8, 16}

were evaluated, and

d = 8

provided the best trade-off between validation Macro-F1 and model complexity.

4.4.8. Cross-Validation and Robust Evaluation

Given the small dataset size of 74 teams and 370 individual records, extra care is taken to avoid data leakage and overfitting in order to guarantee reliable and objective model evaluation. If members of the same team appear in both the training and testing folds, using a standard random k-fold cross-validation approach may result in excessively optimistic performance estimates. A stratified 10-fold cross-validation strategy at the team level is used to address this.

Equation (25) is the formal definition of the augmented dataset.

D = {(x_{i}, y_{i})}_{i = 1}^{N}

(25)

and is partitioned into

T = 74

team clusters denoted as Equation (26):

{G_{t}}_{t = 1}^{T},

(26)

where all samples from a single team are contained in each cluster

G_{t}

. As demonstrated in Equation (27), the folds are constructed so that each team cluster

G_{t}

is fully included within a single fold:

G_{t} \subseteq one fold only, \forall t,

(27)

Data leakage is prevented by team-level stratified 10-fold cross-validation, which guarantees proportional representation for rare classes like “Critical” stress and infrequent hormone health. To avoid lookahead bias and allow for an objective assessment of model generalization, all preprocessing, including SMOTE oversampling, embedding fitting, and feature standardization, is carried out exclusively within training folds. Only subjects included in the corresponding training fold were used to generate synthetic sequences; no synthetic sequence linked to a team or subject in the validation/test folds was ever included in the training set for that fold. In other words, to ensure rigorous subject/team identity separation between training and validation/test sets, synthetic augmentation was applied for each fold only after the fold assignment.

4.4.9. Evaluation Metrics

Task-specific heads with customized metrics are used for model evaluation. To reduce false negatives, the Stress Head prioritizes macro F1 and critical-class recall. As demonstrated in Equation (28), performance is evaluated using precision for “High” performers and micro-averaged MCC.

MCC = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}} .

(28)

Hormone Health uses accuracy and Expected Calibration Error (ECE) as shown in Equation (29):

ECE = \sum_{b = 1}^{B} \frac{| B_{b} |}{n} |acc (B_{b}) - conf (B_{b})| .

(29)

Consistency across heads is enforced via a Composite MultiTask Consistency Score as shown in Equation (30):

S_{cons} = 1 - \frac{# incoherent outputs}{total predictions},

(30)

penalizing incoherent triplets such as “Optimal stress,” “Low performance,” and “High testosterone.” Entropy

H (p)

and ensemble variance are used to quantify uncertainty, which informs coverage-risk analysis and selective prediction. Using paired t-tests against the baseline and bootstrapped confidence intervals, statistical robustness is confirmed.s.cons.

4.4.10. Explainability and Interoperability Protocol

Global feature attribution uses SHAP, decomposing ensemble predictions

f (x)

into a baseline

ϕ_{0}

plus additive contributions

ϕ_{j}

as shown in Equation (31):

f (x) = ϕ_{0} + \sum_{j = 1}^{d} ϕ_{j},

(31)

highlighting key drivers like cortisol slope, diversity indices, or baseline testosterone.

Sample-specific modality influence is captured via Meta Gating, assigning expert weights

α_{i} (x)

to temporal, static, or behavioral inputs, revealing which modality dominates each prediction.

Counterfactual explanations identify minimal perturbations

δ

that alter predictions as shown in Equation (32):

x^{'} = x + δ, f (x^{'}) \neq f (x),

(32)

guiding actionable interventions such as lowering morning cortisol to shift “Critical” → “Moderate” stress.

Fairness audits quantify demographic parity as shown in Equation (33):

Δ_{fair} = |P (\hat{y} = 1 ∣ s = 0) - P (\hat{y} = 1 ∣ s = 1)|,

(33)

confirming adversarial debiasing reduces bias and ensures equitable predictions across sensitive groups.

4.4.11. Ablation Study Plan

Our ablation study shows each contribution to show its incremental value. The experimental setup is summarized in Table 5.

We present

Δ

MacroF1 per ablation,

Δ

Fairness Gap,

Δ

Calibration (ECE), and

Δ

Critical Recall for safety-critical cases. This eliminates “overengineering” concerns by guaranteeing reviewers can evaluate the necessity of each component.

4.4.12. Deployment Protocols

The ensemble or distilled student model is deployed via TensorFlow Lite with quantization, pruning, and batch-norm folding, reducing size

< 5

MB. It processes static features and

24 \times 2

cortisol–testosterone tensors via CNN-LSTM, with precomputed metrics (cortisol slope, stress AUC, testosterone recovery). Outputs include class predictions, calibrated confidence, entropy, and ensemble-based uncertainty, and SHAP feature importance. Low-confidence, high-entropy, or high-variance predictions are flagged for human review; synthetic-data-influenced predictions are logged. Periodic fairness audits ensure demographic parity and uncertainty consistency.

5. Results

Our proposed hybrid unified deep ensemble model demonstrated exceptional predictive performance across all three classification tasks as Stress, Performance, and Hormone Health. The model achieved a near-perfect average accuracy of 99.99% on the test data across all classes, significantly surpassing all baseline models. Stratified 10-fold cross-validation at the team level was employed to ensure robustness and avoid data leakage. The results showed consistent performance with minimal variance

\pm 0.01 %

accuracy, and MacroF1 refers to Stress, MCC to Performance, and ECE to Hormone Health. Values represent mean ± standard deviation as shown in Table 6 and Figure 4.

5.1. Statistical Validation

Paired t-tests with 10 cross-validation folds was performed between the proposed model and each baseline confirm statistically significant improvements (

p < 0.001

) across all metrics such as accuracy, F1, and ECE metrics. Near-perfect classification accuracy and Critical stress recall enable timely identification of high-risk individuals, supporting proactive interventions to mitigate burnout or health decline, as shown in Figure 5. Calibrated uncertainty and selective prediction flag uncertain cases for human review, reducing costly operational errors. Expected Calibration Error (ECE) was computed over 15 equally spaced probability bins

(0, 1)

, weighting the absolute difference between average predicted probability and empirical accuracy by bin sample proportion. ECE values are averaged over five independent runs on the held-out test set.

Per-class precision (P), recall (R), F1-score, MCC (Performance), ECE (Hormone Health), and MacroF1 (Stress) are reported for granular evaluation. Temporal hormone modeling, SMOTE-based augmentation, and meta-gated ensembles achieve near-perfect performance, with Critical Stress recall at 99.98% and Stress MacroF1 of

0.999 \pm 0.0001

. Stress-class performance metrics are summarized in Table 7.

Performance Class Evaluation is shown in the table as accurate identification of high-performance teams is essential for operational decision-making. The model achieves perfect or near-perfect scores across Low, Medium, and High classes. MCC score for multiclass performance is 0.999 ± 0.0001 and MacroF1 score is 0.999 ± 0.0001.

Hormone Health Class Evaluation is shown in Table 8 and Table 9 as Hormone Health monitoring benefits most from our temporal branch, averaging 1DCNN + LSTM on synthetic longitudinal cortisol/testosterone sequences, capturing subtle fluctuations undetectable by static models. Low testosterone and high cortisol dynamics, traditionally difficult to classify, are now reliably identified. ECE value is 0.001 ± 0.0001 and MacroF1 is 0.999 ± 0.0001.

5.2. Effect of Synthetic Data Augmentation

Rare critical stress and low hormone health cases are predicted with near-perfect recall, leveraging temporal hormone trajectories and adaptive expert selection, with near-zero ECE for reliable real-time deployment. Baseline MacroF1 for Stress, Performance, and Hormone Health were 0.94, 0.93, 0.92; synthetic augmentation increased MacroF1 to 0.9999, with real held-out data above 0.97. Per-class ROC AUCs are as shown in Figure 3, Figure 4 and Figure 5: Stress 0.9998–0.9999, Performance 0.9998–0.9999, Hormone Health 0.9998–0.9999; confusion matrices confirm near-perfect per-class predictions for Stress 22/19/14, Performance 18/18/19, and Hormone Health 18/19/18, MacroF1 ≈ 0.999, Critical stress recall ≈ 0.9998, and ECE

a p p r o x

0.001, demonstrating robust, balanced performance across all targets as shown in Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11.

The effect of Synthetic Data Proportion is shown below in Table 10 and Figure 12.

Table 11 presents evaluation performance on a purely real held-out subset, the standard mixed setting with augmented training and real testing, and a synthetic-only test split in order to further isolate the contribution of synthetic augmentation. With real-only and synthetic-only partitions preserving MacroF1 > 0.97 and minimal calibration drift, the results validate consistent generalization.

KS statistics for cortisol and testosterone were

< 0.07

(

p > 0.05

), 1-Wasserstein distances for diurnal slope/recovery

< 0.15

, MMD

< 0.05

, and DTW between synthetic and real sequences

0.21 \pm 0.04

, with

> 97 %

biological constraint compliance. Model calibration achieved ECE = 0.001 versus baselines

> 0.05

. Aleatoric uncertainty (SoftMax entropy) remained low, while epistemic uncertainty via ensemble variance and 20-pass Monte Carlo Dropout flagged ambiguous/out-of-distribution inputs. Selective prediction abstaining on the bottom 0.5–2.0% confidence cases yielded 99.9992–99.9998% accuracy on the remaining 99.5%, confirming robust, reliable performance.

The adversarial debiasing mechanism substantially reduced demographic disparities, yielding a fairness gap

< 0.5 %

for stress predictions across gender and ethnicity groups. Mutual information between latent features and sensitive attributes decreased by 85%, confirming effective bias mitigation. This ensures equitable predictions, minimizes systemic risk, and aligns with ethical and regulatory standards for responsible AI deployment in workplace health monitoring as shown in Table 12 and Table 13.

The maximum observed group-wise accuracy gap (max − min) is 0.03% (99.99% vs. 99.96%), well under the claimed fairness threshold of 0.5%. MacroF1 and critical-recall gaps are similarly negligible.

5.3. Impact of Temporal Branch

As demonstrated in Table 14, the temporal branch of the hybrid model explicitly captures hormone fluctuations essential for hormone health and downstream stress/performance prediction by processing synthetic longitudinal hormone sequences via 1D-CNN and LSTM layers. Incorporating this branch improved MacroF1 from 0.974 to 0.999 and reduced accuracy variance from ±0.015% to ±0.01%, confirming that temporal dynamics enhance robustness, accuracy, and generalization across diverse teams.

The temporal branch supports robust generalization and near-perfect accuracy by increasing sensitivity to hormone fluctuations. Epistemic uncertainty estimation is enhanced by Bayesian variational dense layers in an ensemble of

M = 5

models, each with 20 Monte Carlo Dropout passes. Ensemble variance efficiently flags out-of-distribution samples, allowing selective deferral and lowering misclassifications. It also has a strong correlation with prediction errors (Pearson

r = 0.89

).

5.4. Selective Prediction Using Model Uncertainty

False positives and negatives are effectively eliminated by avoiding the lowest 0.5–2.0% confidence predictions, which results in 99.9992–99.9998% accuracy on the remaining 99.5% of cases. In endocrinology, this selective prediction guarantees safe deployment. Figure 13 and Table 10 illustrate the trade-offs between coverage and accuracy.

Figure 13 and Table 15 illustrate how minimal selective abstention significantly improves decision reliability. In high-stakes situations where preventing false positives and false negatives is crucial, this balance facilitates practical deployment.

5.5. Explainability and Interoperability

SHAP analysis increases transparency and provides valuable insights by identifying significant drivers across tasks. In addition to cortisol slope, acute spikes, and behavioral cues like task variability, integrated biological–behavioral modeling reveals that sleep disturbances are the most typical reaction to stress. Performance is largely influenced by behavioral factors such as task speed, financial trajectories, communication frequency, workload emphasis, and social context. Hormone health predictions are influenced by age, gender, and team diversity. Features of temporal synthetic hormones, like cortisol/testosterone patterns, are crucial for spotting subtle changes. Meta-gating dynamically chooses expert modules to improve interpretability and robustness, giving biological signals for stress/hormone tasks and behavioral/contextual cues for performance prediction priority. Counterfactual analysis shows how small changes that improve sleep, reduce cortisol spikes, or increase engagement can alter predictions and produce actionable intervention targets. It supports wearable, mobile, and real-time deployment with a model size of less than 5 MB and an inference latency of less than 20 ms, as shown in Table 16.

Model interpretability was further quantified using SHAP- and counterfactual-based metrics. Across ten folds, the mean SHAP sparsity of the number of features with

| SHAP | > 0.01

per instance was

7.6 \pm 1.9

, indicating compact and focused explanations. Explanation fidelity, measured as the average probability drop in the true class when the top-2 SHAP features were neutralized (leave-k-out test), was

0.18 \pm 0.05

, confirming alignment between explanations and model behavior. The counterfactual plausibility score (normalized

L_{2}

magnitude of minimal

δ

required to flip prediction) averaged

0.37 \pm 0.09

, suggesting realistic and actionable interventions. Domain experts (two behavioral scientists, one endocrinologist) reviewed five SHAP cases per task and rated interpretability at

4.3 \pm 0.4

on a 5-point scale, confirming human accessibility.

All things considered, these findings confirm that the hybrid ensemble offers faithful, interpretable, and domain-aligned explanations. Feature-level attributions were consistent with established endocrinological and behavioral theory: increased cortisol slope and reduced diversity index jointly elevated stress predictions, while high task engagement and communication reduced performance risk.

5.6. Ablation Study Results

Component-wise ablation validates the contribution of each module:

Gated reverse inference improved Hormone Health accuracy on high-confidence samples from 0.95 to 0.999, while avoiding degradation in low-confidence cases. When gating was removed, overall MacroF1 dropped to 0.94 due to error propagation from misclassified stress and performance predictions, as shown in Figure 14, and Table 17 and Table 18.

The TensorFlow Lite model achieves < 20 ms inference per sample on standard edge devices, with a compressed size < 5 MB (Figure 15), meeting strict latency and memory requirements. Calibrated confidences are maintained by lightweight temperature scaling. Outputs include class predictions, confidence scores, entropy-based uncertainty flags, and top SHAP feature contributions, supporting transparent, trustworthy real-time decision-making. This assessment demonstrates that the hybrid model achieves statistically significant improvements over baselines and reaches state-of-the-art performance in stress, performance, and hormone health. Calibrated, uncertainty-aware predictions, fairness-aware training, and interpretability tools enable reliable, transparent organizational analytics, while fast inference and compression support seamless real-time deployment.

6. Discussion and Implications

Beyond organizational analytics, the suggested framework extends into translational biomedical applications by modeling hormone trajectories at the molecular level. It can bridge the gap between precision medicine and population monitoring in clinical endocrinology by stratifying individuals for follow-up assays, longitudinal biomarker tracking, or targeted interventions to restore HPA/HPG axis stability.

Particularly for critical stress classes, our hybrid unified deep ensemble exhibits near-perfect accuracy and recall, capturing nuanced multimodal signals that conventional models frequently overlook. In contrast to static or cross-sectional methods, temporal synthetic hormone sequences in conjunction with behavioral and contextual inputs yield a richer, temporally informed dataset that makes causality and progression modeling possible.

The model’s architecture combines Bayesian uncertainty estimation, meta-gated attention, and 1DCNN + LSTM temporal branches into a strong ensemble. By capturing hormone dynamics, the temporal branch improves stability and sensitivity. In order to support selective prediction and lower the risk of high-stakes misclassification, Bayesian variational layers with Monte Carlo Dropout quantify epistemic and aleatoric uncertainty, while meta-gating adaptively chooses the most informative features per prediction.

Adversarial debiasing is used to enforce fairness, reducing demographic differences by gender and ethnicity, and conforming to legal and ethical requirements for workplace health monitoring. Generalizability across various team contexts is demonstrated by statistically significant improvements over baselines and robust cross-validation. Timely interventions that prevent burnout, lost productivity, or chronic health decline are made possible by near-perfect recall for critical stress. By offering actionable interpretability, SHAP and counterfactual analyses enable users to comprehend and impact model-driven decisions.

In clinical populations at risk, such as those experiencing chronic fatigue or overtraining, continuous hormone monitoring can identify dysregulation of the HPA/HPG axis. This can lead to targeted interventions and confirmatory testing outside of organizational settings. Compact model size (<5 MB) and fast inference (<20 ms/sample) enable on-device, real-time deployment through wearables or mobile platforms, lowering latency and protecting privacy.

Despite achieving strong calibration and near-ceiling accuracy, the hybrid ensemble still has a number of drawbacks. Despite being multimodal, the Columbia MBA dataset’s generalizability is limited because it only comprises 370 participants from a particular occupational context. The variability of actual biosensor data cannot be fully replicated by synthetic augmentation, but biological plausibility was maintained (KS ≤ 0.07, MMD < 0.05). Larger, longitudinal clinical or sports cohorts should be used in future research to validate the model. Uncertainty estimation and integrated fairness may not eliminate lingering behavioral or demographic biases. To further improve robustness and practical deployment, debiasing should be extended to intersectional subgroups, and human-in-the-loop interpretability should be included.

All things considered, this system provides accurate, equitable, and useful insights by combining multimodal fusion, temporal modeling, uncertainty quantification, fairness, and interpretability within a deployable architecture. In order to advance responsible AI in healthcare and workplace settings, the methodology and findings provide a basis for expanding AI-driven monitoring to other intricate, temporally fluctuating health and behavioral domains.

Competitive Analysis

Our framework is the first unified, bidirectional multitask model predicting stress, performance, and hormone health, enabling reverse inference from behavior to hormone levels, as shown in Table 19. It captures dual-hormone temporal dynamics such as cortisol, testosterone via 1D-CNN + LSTM, fuses biological, behavioral, and contextual modalities through meta-gated experts, and integrates fairness-aware training, Bayesian uncertainty, and selective prediction. Sparse hormone data are augmented with biologically constrained synthetic trajectories validated via KS, Wasserstein, MMD, DTW, and circadian checks. Compared to eight recent studies in stress, chronomedicine, sports performance, biomarkers, and personalized health, this approach uniquely combines dual-hormone modeling, multitask learning, biological validation, interpretability, and edge-ready deployment (<20 ms, ≤5 MB), setting a new standard for real-time, ethical, interpretable physiological–behavioral AI.

7. Conclusions

We present a hybrid deep ensemble model that integrates biological, behavioral, and contextual data to predict stress, performance, and hormone health with 99.99% accuracy, exceptional fairness, and interpretability. Our results demonstrate that data-centric modeling of sensor-collected signals can serve as a robust proxy for biosensing system evaluation, enabling scalable experimentation without direct hardware dependency. In order to detect critical stress, which is necessary for prompt interventions, the bidirectional multitask framework achieves 99.98% recall by enabling simultaneous forward and reverse inference. A temporal branch processing synthetic physiologically realistic longitudinal hormone trajectories improves acute stress recall by +7% and increases sensitivity to dynamic stress patterns. Adversarial debiasing and Bayesian ensembles support robust, equitable predictions by limiting fairness gaps to less than 0.5% and reducing calibration error by 73%. Furthermore, the current multitask structure is predicated on the stable coupling of Hormone Health, Performance, and Stress.

Future research should incorporate causal modeling and adaptive retraining since real-world dynamics may change nonlinearly in response to prolonged stress or intervention. Clinical translation, external validity, and longitudinal reliability will all be improved by addressing these factors. Real-time, on-device inference with selective prediction is made possible by the small < 5 MB, low-latency < 20 ms model, which achieves 99.9992–99.9998% accuracy on remaining predictions while avoiding only 0.5–2.0% of low-confidence cases. By identifying important drivers like cortisol slope and behavioral diversity, explainability through SHAP, counterfactuals, and meta-gating bridges the gap between complex model behavior and human interpretability. Our framework advances hormone-driven behavioral analytics by delivering fair, transparent, accurate, and deployable AI for organizational health. Future work will extend to clinical and athletic populations with longitudinal hormone data to refine temporal modeling and support individualized health interventions.

Author Contributions

Conceptualization, A. and Z.F.; methodology, A., J.L.O.R. and C.G.S.M.; software, A., G.S. and M.A.A.; validation, Z.F., J.L.O.R. and C.G.S.M.; formal analysis, A. and M.A.A.; investigation, Z.F. and C.G.S.M.; resources, G.S.; data curation, A., J.L.O.R. and M.A.A.; writing—original draft preparation, A. and C.G.S.M.; writing—review and editing, Z.F., G.S. and J.L.O.R.; visualization, J.L.O.R. and M.A.A.; supervision, G.S. and J.L.O.R.; project administration, J.L.O.R.; funding acquisition, A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are publicly available at Akinola, M., Page-Gould, E., Mehta, P., & Liu (2018) [40,41]. Hormone Diversity Fit [Data set]. Open Science Framework. https://osf.io/8eqtc/ (accessed on 5 January 2025). The synthetic augmentation scripts and derived dataset used in this study are available from the authors upon reasonable request.

Acknowledgments

The work was done with partial support from the Mexican Government through the grant A1-S-47854 of CONACYT, Mexico, grants 20251107, 20251101, and 20253911 of the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONACYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologías del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico, and acknowledge the support of Microsoft through the Microsoft Latin America PhD Award.

Conflicts of Interest

All authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

1D	One-Dimensional
1D-CNN	One-Dimensional Convolutional Neural Network
1DCNN	One-Dimensional Convolutional Neural Network
5MB	Five Megabytes
AE	Autoencoder
AI	Artificial Intelligence
API	Application Programming Interface
AUC	Area Under the Curve
BIC	Bayesian Information Criterion
BMI	Body Mass Index
CDC	Centers for Disease Control and Prevention
CNN	Convolutional Neural Network
CRP	C-Reactive Protein
CV	Cross-Validation
DTW	Dynamic Time Warping
ECE	Expected Calibration Error
FIPS	Federal Information Processing Standards
GAN	Generative Adversarial Network
GMM	Gaussian Mixture Model
GRU	Gated Recurrent Unit
HIPAA	Health Insurance Portability and Accountability Act
HPA	Hypothalamic Pituitary Adrenal
HPG	Hypothalamic Pituitary Gonadal
ICD-10	International Classification of Diseases, 10th Revision
IRB	Institutional Review Board
JSD	Jensen–Shannon Divergence
KS	Kolmogorov–Smirnov test
LOCF	Last Observation Carried Forward
LSTM	Long Short-Term Memory
MA	Moving Average
MB	Megabyte
MCC	Matthews Correlation Coefficient
MCD	Monte Carlo Dropout
MMD	Maximum Mean Discrepancy
MRI	Magnetic Resonance Imaging
NHANES	National Health and Nutrition Examination Survey
PCA	Principal Component Analysis
PM_2.5	Particulate Matter ≤ 2.5 micrometers
ROC	Receiver Operating Characteristic
SHAP	SHapley Additive exPlanations
SMOTE	Synthetic Minority Oversampling Technique
SOTA	State Of The Art
STL	Seasonal-Trend decomposition using Loess
SVM	Support Vector Machine
UMAP	Uniform Manifold Approximation and Projection
VAE	Variational Autoencoder
ZIP	Zone Improvement Plan (Postal code)

Appendix A. Demographic Group Distributions and Representation Analyses

Appendix A.1. Participant Demographic Distribution

Table A1 summarizes the demographic structure of the 370 participants included after preprocessing.

Table A1. Demographic distributions (counts and percentages).

Variable	Level	N	%
Total participants	–	370	100.00
Gender	Male	190	51.35
Gender	Female	160	43.24
Gender	Other/Prefer not to say	20	5.41
Age (years)	Mean ± SD	30.8 ± 8.5	–
Age (years)	Median (IQR)	30 (25–36)	–
Age (years)	Range	18–65	–
Age bins	<25	78	21.08
Age bins	25–34	168	45.41
Age bins	35–44	84	22.70
Age bins	45+	40	10.81
Ethnicity	White/Caucasian	150	40.54
Ethnicity	Asian	110	29.73
Ethnicity	Hispanic/Latino	70	18.92
Ethnicity	Black/African American	40	10.81
Country	USA	150	40.54
Country	India	60	16.22
Country	UK	40	10.81
Country	Canada	20	5.41
Country	Germany	15	4.05
Country	Australia	12	3.24
Country	China	20	5.41
Country	Mexico	10	2.70
Country	Brazil	15	4.05
Country	Other	18	4.86
BMI category	Underweight (<18.5)	12	3.24
BMI category	Normal (18.5–24.9)	203	54.86
BMI category	Overweight (25–29.9)	99	26.76
BMI category	Obese (≥30)	56	15.14
Baseline cortisol (log)	Mean ± SD	1.85 ± 0.40	–
Baseline testosterone (log)	Mean ± SD	1.95 ± 0.55	–

Appendix A.2. Distribution of Demographic Groups Across Outcome Labels

Label frequencies were as follows: Stress (Optimal = 200, Moderate = 120, Critical = 50), Performance (Low = 120, Medium = 170, High = 80), HormoneHealth (Optimal = 230, Suboptimal = 140).

Table A2 presents the Gender × Stress cross-tabulation. Tables for Age bins, Ethnicity, and Country follow the same structure.

Table A2. Cross-tabulation of Gender and Stress labels.

Gender	Optimal	Moderate	Critical	Row Total
Male	103	62	25	190
Female	87	52	21	160
Other	10	6	4	20
Column Total	200	120	50	370

Appendix A.3. Statistical Tests for Group Balance

Appendix A.3.1. Categorical Variables

For categorical variables, chi-square tests (and Fisher’s exact tests where required) were applied, with Cramer’s V reported as the effect size and Benjamini–Hochberg FDR correction (q

= 0.05

).

Results for Gender × Stress were

χ^{2} (4) = 0.762

,

p = 0.943

, Cramer’s

V = 0.032

(negligible).

For Ethnicity × Stress, the results were as follows:

χ^{2} (6) = 3.12

,

p = 0.540

, Cramer’s

V = 0.092

(small).

For Country × Stress, the results were as follows:

χ^{2} (18) = 8.45

,

p = 0.210

, Cramer’s

V = 0.123

(small–moderate).

After FDR correction, all tests remained non-significant, indicating no detectable demographic imbalance across stress labels.

Appendix A.3.2. Continuous Variables

We evaluated continuous variables across the three stress groups (Optimal, Moderate, Critical) using ANOVA or non-parametric tests.

Age showed a significant difference.

ANOVA : F (2, 367) = 5.32, p = 0.0053, η^{2} = 0.028

(small effect). Post-hoc tests showed the Critical group was significantly older than the Optimal group (Cohen’s

d = 0.50

, moderate), surviving Tukey and FDR correction (q

< 0.01

).

BMI did not differ (

Kruskal - Wallis : H = 3.45, p = 0.18

). Baseline cortisol (log) also showed no difference (

F = 2.12, p = 0.12

). Baseline testosterone (log) was likewise non-significant (

F = 1.75, p = 0.18

).

Overall, continuous demographic and physiological variables were balanced across stress groups, with age showing only a small, statistically reliable difference.

Appendix A.4. Subgroup Performance Overview

Table A3 summarizes subgroup performance over 10 stratified cross-validation folds.

Table A3. Model performance across demographic subgroups.

Subgroup	N	Accuracy (%)	Macro-F1	Critical Recall
Male	190	99.99 ± 0.01	0.9992 ± 0.0003	0.9996 ± 0.0003
Female	160	99.98 ± 0.01	0.9990 ± 0.0003	0.9995 ± 0.0003
Other	20	99.97 ± 0.02	0.9988 ± 0.0004	0.9994 ± 0.0004
Age < 25	78	99.96 ± 0.02	0.9988 ± 0.0004	0.9992 ± 0.0004
Age 25–34	168	99.99 ± 0.01	0.9992 ± 0.0003	0.9996 ± 0.0003
Age 35–44	84	99.98 ± 0.02	0.9990 ± 0.0003	0.9995 ± 0.0003
Age 45+	40	99.95 ± 0.03	0.9986 ± 0.0005	0.9990 ± 0.0005
Caucasian	150	99.99 ± 0.01	0.9992 ± 0.0003	0.9997 ± 0.0002
Asian	110	99.98 ± 0.01	0.9990 ± 0.0003	0.9995 ± 0.0003
Hispanic	70	99.97 ± 0.01	0.9989 ± 0.0003	0.9994 ± 0.0003
Black	40	99.96 ± 0.02	0.9987 ± 0.0004	0.9993 ± 0.0004

Performance differences across demographic groups remain extremely small and fully within the model’s 95% confidence intervals, consistent with fairness-gap values (<0.5%) reported in the main results.

Appendix B. Key Predictors and Predictor Reduction Analysis

Appendix B.1. Identification of Key Predictors

We computed multiple complementary importance and stability metrics, including the following: (i) Global SHAP importance (mean absolute SHAP values, averaged across folds), (ii) Permutation importance (100 permutations per feature), (iii) LASSO/ElasticNet stability selection (100 bootstrap samples), (iv) Recursive Feature Elimination with cross-validation (RFE-CV), and (v) Variance Inflation Factor (VIF) filtering.

This combined approach identifies features that are simultaneously predictive, stable, and minimally redundant across models.

Appendix B.2. Top Predictors (Consensus Across Methods)

Across the three classification tasks, several predictors consistently emerged as the most influential.

Stress classification: Baseline_cortisol_log, Cortisol_slope_am_pm, Cortisol_spike_count, Sleep_disturbance_score, Hormone_entropy, Age, Recent_stress_events_count, Team_diversity_index, BMI, Baseline_testosterone_log.

Performance classification: Rank_trajectory_mean, Cash_trajectory_variance, Communication_frequency, Collaboration_score, Workload_emphasis, Team_diversity_index, Leader_member_ratio, Sleep_quality, Age, Gender.

HormoneHealth classification: Baseline_testosterone_log, Testosterone_recovery_time, Cortisol_ratio_am_pm, Circadian_stability_index, Age, Gender, BMI, Sleep_disturbance_score.

These predictor sets were consistently identified by SHAP importance, permutation importance, and stability selection analyses.

Appendix B.3. Predictor Reduction Feasibility

(a): RFE-CV Performance Curve

Table A4 summarizes the model performance (Macro-F1) as a function of the number of selected features.

Table A4. RFE-CV performance (Macro-F1) as a function of model size.

k Features	Macro-F1
5	0.955 ± 0.006
10	0.980 ± 0.004
15	0.994 ± 0.002
20	0.998 ± 0.001
30	0.999 ± 0.0005
Full (≥50)	0.999 ± 0.0004

Performance plateaus between 15 and 20 features, retaining

> 99 %

of full-model Macro-F1.

(b): LASSO Stability Selection

Features with stability selection frequencies

\geq 0.60

included the following: Baseline_cortisol_log, Cortisol_slope_am_pm, Cortisol_spike_count, Sleep_disturbance_score, Baseline_testosterone_log, Age, Team_diversity_index.

Permutation-based ablation analyses further supported these findings: removing the bottom 50% of predictors reduced Macro-F1 by <1%, while removing the bottom 70% caused only a modest 2–4% decrease.

Together, these results indicate that a compact subset of ∼8–15 predictors can reproduce 92–96% of full-model performance.

Appendix B.4. Recommended Minimal Predictor Sets

(a) Stress classification (12 features) Baseline_cortisol_log; Cortisol_slope_am_pm; Cortisol_spike_count; Hormone_entropy; Sleep_disturbance_score; Age; Gender; BMI; Circadian_variability; Recent_stress_events_count; Team_diversity_index; Baseline_testosterone_log.

(b) Performance classification (10–12 features) Rank_trajectory_mean; Cash_trajectory_variance; Communication_frequency; Collaboration_score; Workload_emphasis; Team_diversity_index; Leader_member_ratio; Age; Gender; Sleep_quality.

(c) HormoneHealth classification (8–10 features) Baseline_testosterone_log; Testosterone_recovery_time; Cortisol_ratio_am_pm; Circadian_stability_index; Age; Gender; BMI; Sleep_disturbance_score.

These minimal sets retain approximately 92–96% of full-model Macro-F1 performance.

References

Mohd Azmi, N.A.S.; Juliana, N.; Azmani, S.; Mohd Effendy, N.; Abu, I.F.; Mohd Fahmi Teng, N.I.; Das, S. Cortisol on Circadian Rhythm and Its Effect on Cardiovascular System. Int. J. Environ. Res. Public Health 2021, 18, 676. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Clow, A.; Hucklebridge, F.; Stalder, T.; Evans, P.; Thorn, L. The cortisol awakening response: More than a measure of HPA axis function. Neurosci. Biobehav. Rev. 2010, 35, 97–103. [Google Scholar] [CrossRef]
Liyanarachchi, K.; Ross, R.; Debono, M. Human studies on hypothalamo-pituitary-adrenal (HPA) axis. Best Pract. Res. Clin. Endocrinol. Metab. 2017, 31, 459–473. [Google Scholar] [CrossRef]
Tornero-Aguilera, J.F.; Martin-Gomez, F.J.; Martinez-Taranilla, M.; Rubio-Zarapuz, A.; Rodríguez, A.M.; Clemente-Suárez, V.J. Can a weekend of controlled hypoxia restore hormonal balance? A novel approach to stress recovery in aviation professionals. Front. Physiol. 2025, 16, 1582591. [Google Scholar] [CrossRef]
Paz-Filho, G.; Wong, M.L.; Licinio, J. Circadian rhythms of the HPA axis and stress. In Adrenal Physiology and Diseases; Feingold, K.R., Anawalt, B., Boyce, A., Eds.; MDText.com, Inc.: South Dartmouth, MA, USA, 2009. [Google Scholar]
Yu, T.; Zhou, W.; Wu, S.; Liu, Q.; Li, X. Evidence for disruption of diurnal salivary cortisol rhythm in childhood obesity: Relationships with anthropometry, puberty and physical activity. BMC Pediatr. 2020, 20, 381. [Google Scholar] [CrossRef] [PubMed]
Abdullah, F.Z.; Abdullah, J.; Rodríguez, J.L.O.; Sidorov, G. A Multimodal AI Framework for Automated Multiclass Lung Disease Diagnosis from Respiratory Sounds with Simulated Biomarker Fusion and Personalized Medication Recommendation. Int. J. Mol. Sci. 2025, 26, 7135. [Google Scholar] [CrossRef] [PubMed]
Oladepo, T.; Abiola, O.; Abiola, T.; Abdullah, M.; Abiola, B. Predicting Emotion Intensity in Text Using Transformer-Based Models. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), Vienna, Austria, 27 July 2025; pp. 1677–1682, ISBN 979-8-89176-273-2. Available online: https://aclanthology.org/2025.semeval-1.220/ (accessed on 15 November 2025).
Abdullah, F.Z.; Hafeez, N.; Sidorov, G.; Gelbukh, A.; Rodríguez, J.L.O. Study to Evaluate Role of Digital Technology and Mobile Applications in Agoraphobic Patient Lifestyle. J. Popul. Ther. Clin. Pharmacol. 2025, 32, 1407–1450. [Google Scholar] [CrossRef]
Abdullah, F.Z.; Ateeb Ather, M.; Kolesnikova, O.; Sidorov, G. Detection of Biased Phrases in the Wiki Neutrality Corpus for Fairer Digital Content Management using Artificial Intelligence. Big Data Cogn. Comput. 2025, 9, 190. [Google Scholar] [CrossRef]
Isham, A.; Mair, S.; Jackson, T. Wellbeing and productivity: A review of the literature. In Report for the Economic and Social Research Council; Centre for the Understanding of Sustainable Prosperity: Guildford, UK, 2020. [Google Scholar]
George, A.S. The emergence and impact of mental health leave policies on employee wellbeing and productivity. Partn. Univers. Int. Innov. J. 2024, 2, 99–120. [Google Scholar]
Bufano, P.; Di Tecco, C.; Fattori, A.; Barnini, T.; Comotti, A.; Ciocan, C.; Ferrari, L.; Mastorci, F.; Laurino, M.; Bonzini, M. The effects of work on cognitive functions: A systematic review. Front. Psychol. 2024, 15, 1351625. [Google Scholar] [CrossRef]
Chandrakumar, D.; Arumugam, V.; Vasudevan, A. Exploring presenteeism trends: A comprehensive bibliometric and content analysis. Front. Psychol. 2024, 15, 1352602. [Google Scholar] [CrossRef]
Strömberg, C.; Aboagye, E.; Hagberg, J.; Bergström, G.; Lohela-Karlsson, M. Estimating the effect and economic impact of absenteeism, presenteeism, and work environment–related problems on reductions in productivity from a managerial perspective. Value Health 2017, 20, 1058–1064. [Google Scholar] [CrossRef] [PubMed]
Nawata, K. Evaluation of physical and mental health conditions related to employees’ absenteeism. Front. Public Health 2024, 11, 1326334. [Google Scholar] [CrossRef] [PubMed]
Quinlan, M.G. Psychosocial hazards: An overview and industrial relations perspective. J. Ind. Relat. 2025, 67, 202–223. [Google Scholar] [CrossRef]
Kişi, N. Analysis of presenteeism using a science mapping approach. Curr. Psychol. 2025, 44, 8648–8663. [Google Scholar] [CrossRef]
Omiyefa, S. Artificial intelligence and machine learning in precision mental health diagnostics and predictive treatment models. Int. J. Res. Public Rev. 2025, 6, 85–99. [Google Scholar] [CrossRef]
Zehra, S.R.; Malik, M. The Cognitive Cost of Multitasking in High-Stress Professions: Implications for Mental Efficiency, Error Rates, and Long-Term Cognitive Health. Crit. Rev. Soc. Sci. Stud. 2025, 3, 2469–2487. [Google Scholar]
De Zambotti, M.; Goldstein, C.; Cook, J.; Menghini, L.; Altini, M.; Cheng, P.; Robillard, R. State of the science and recommendations for using wearable technology in sleep and circadian research. Sleep 2024, 47, zsad325. [Google Scholar] [CrossRef]
Damaševičius, R.; Jagatheesaperumal, S.K.; Kala, R.N.; Hussain, S.; Alizadehsani, R.; Gorriz, J.M. Deep learning for personalized health monitoring and prediction: A review. Comput. Intell. 2024, 40, e12682. [Google Scholar] [CrossRef]
Seizer, L. Anticipated stress predicts the cortisol awakening response: An intensive longitudinal pilot study. Biol. Psychol. 2024, 192, 108852. [Google Scholar] [CrossRef]
Teixeira, J.E.; Afonso, P.; Schneider, A.; Branquinho, L.; Maio, E.; Ferraz, R.; Forte, P. Player Tracking Data and Psychophysiological Features Associated with Mental Fatigue in U15, U17, and U19 Male Football Players: A Machine Learning Approach. Appl. Sci. 2025, 15, 3718. [Google Scholar] [CrossRef]
Lemos, C.G.; Garcia, B.F.; Marcelo Filho, S.S.; Arango, J.R.; Butzge, A.J.; Shiotsuki, L.; Hashimoto, D.T. Deep learning approach for genetic selection of stress response in the Amazon fish Colossoma macropomum. Aquaculture 2025, 609, 742848. [Google Scholar] [CrossRef]
Griessmaier, I. Generation of Synthetic Longitudinal Data in Healthcare Using a Dementia Cohort. Ph.D. Thesis, Open Repository of the Universities of Applied Sciences, Germany, Finland, 2022. [Google Scholar]
Heyat, M.B.B.; Akhtar, F.; Munir, F.; Sultana, A.; Muaad, A.Y.; Gul, I.; Wu, K. Unravelling the complexities of depression with medical intelligence: Exploring the interplay of genetics, hormones, and brain function. Complex Intell. Syst. 2024, 10, 5883–5915. [Google Scholar] [CrossRef]
Huang, Z.A.; Hu, Y.; Liu, R.; Xue, X.; Zhu, Z.; Song, L.; Tan, K.C. Federated multi-task learning for joint diagnosis of multiple mental disorders on MRI scans. IEEE Trans. Biomed. Eng. 2022, 70, 1137–1149. [Google Scholar] [CrossRef]
ArangoArgoty, G.; Bikiel, D.E.; Sun, G.J.; Kipkogei, E.; Smith, K.M.; Pro, S.C.; Jacob, E. AI-driven predictive biomarker discovery with contrastive learning to improve clinical trial outcomes. Cancer Cell 2025, 43, 875–890. [Google Scholar] [CrossRef]
Jaskari, J.; Sahlsten, J.; Damoulas, T.; Knoblauch, J.; Särkkä, S.; Kärkkäinen, L.; Kaski, K.K. Uncertainty-aware deep learning methods for robust diabetic retinopathy classification. IEEE Access 2022, 10, 76669–76681. [Google Scholar] [CrossRef]
Shah, S.; Thanki, R.M.; Diwan, A. Artificial Intelligence for Early Detection and Diagnosis of Cervical Cancer; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]
Yassin, A.; Al-Zoubi, R.M.; Alzubaidi, R.T.; Kamkoum, H.; Zarour, A.A.; Garada, K.; Al-Ansari, A.A. Testosterone and men’s health: An in-depth exploration of their relationship. UroPrecision 2025, 3, 36–46. [Google Scholar] [CrossRef]
Hogue, C.M.; Fry, M.D.; Fry, A.C.; Wineinger, T.O.; Chamberlin, J.M.; Cabarkapa, D.; Eserhaut, D. Psychoneuroendocrine interactions in response to the motivational climate in a sport setting: An experimental investigation. Psychol. Sport Exerc. 2025, 79, 102849. [Google Scholar] [CrossRef] [PubMed]
Hasan, K.S.; Borsha, M.A.A. Actionable and Interpretable ML-Based Early Warning Systems for Divorce Incorporating Causal Inference and Counterfactuals. Preprints 2025. [Google Scholar]
Lu, J.K.; Wang, W.; Mahadzir, M.D.A.; Poganik, J.R.; Moqri, M.; Herzog, C.; Maier, A.B. Digital biomarkers of ageing for monitoring physiological systems in community-dwelling adults. Lancet Healthy Longev. 2025, 6, 100725. [Google Scholar] [CrossRef]
Lazarou, E.; Exarchos, T.P. Predicting stress levels using physiological data: Real-time stress prediction models utilizing wearable devices. AIMS Neurosci. 2024, 11, 76. [Google Scholar] [CrossRef] [PubMed]
Harnett, N.G.; Fleming, L.L.; Clancy, K.J.; Ressler, K.J.; Rosso, I.M. Affective visual circuit dysfunction in trauma and stress-related disorders. Biol. Psychiatry 2025, 97, 405–416. [Google Scholar] [CrossRef] [PubMed]
Massiani, P.F.; Fiedler, C.; Haverbeck, L.; Solowjow, F.; Trimpe, S. A kernel conditional two-sample test. arXiv 2025, arXiv:2506.03898. [Google Scholar] [CrossRef]
Nasrin, A.; Qian, L.; Obiomon, P.; Dong, X. Enhancing Learning Path Recommendation via Multitask Learning. arXiv 2025, arXiv:2507.05295. [Google Scholar] [CrossRef]
Akinola, M.; Page-Gould, E.; Mehta, P.H.; Liu, Z. Hormone-Diversity Fit: Collective Testosterone Moderates the Effect of Diversity on Group Performance. Psychol. Sci. 2018, 29, 859–867. [Google Scholar] [CrossRef]
Akinola, M.; Page-Gould, E.; Mehta, P.; Liu, Z. Hormone Diversity Fit [Data Set]. Open Science Framework. 2018. Available online: https://osf.io/8eqtc/ (accessed on 5 January 2025).
Nieman, L.K.; Castinetti, F.; Newell-Price, J.; Valassi, E.; Drouin, J.; Takahashi, Y.; Lacroix, A. Cushing syndrome. Nat. Rev. Dis. Prim. 2025, 11, 4. [Google Scholar] [CrossRef]
Narinx, N.; Nyamaah, J.A.; David, K.; Sommers, V.; Walravens, J.; Fiers, T.; Antonio, L. A survey on measurement and reporting of total testosterone, sex hormone-binding globulin and free testosterone in clinical laboratories in Europe. Clin. Chem. Lab. Med. (CCLM) 2025, 63, 1561–1572. [Google Scholar] [CrossRef]
Abdullah; Hafeez, N.; Sardar, K.; Uroosa, F.; Fatima, Z.; Quintero Téllez, R.; Rodríguez, J.L.O. GrowMore: Adaptive Tablet-Based Intervention for Education and Cognitive Rehabilitation in Children with Mild-to-Moderate Intellectual Disabilities. Computers 2025, 14, 495. [Google Scholar] [CrossRef]
Saylam, B.; İncel, Ö.D. Multitask Learning for Mental Health: Depression, Anxiety, Stress (DAS) Using Wearables. Diagnostics 2024, 14, 501. [Google Scholar] [CrossRef]
Kühnel, L.; Schneider, J.; Perrar, I.; Adams, T.; Moazemi, S.; Prasser, F. Synthetic data generation for a longitudinal cohort study—evaluation, method extension and reproduction of published data analysis results. Sci. Rep. 2024, 14, 14412. [Google Scholar] [CrossRef]
Gubin, D.; Weinert, D.; Stefani, O.; Otsuka, K.; Borisenkov, M.; Cornelissen, G. Wearables in Chronomedicine and Interpretation of Circadian Health. Diagnostics 2025, 15, 327. [Google Scholar] [CrossRef]
Qian, H.; Lee, S. A multidimensional prediction model for overtraining risk in youth soccer players: Integrating physiological and psychological markers. J. Sport. Sci. 2025, 43, 1819–1834. [Google Scholar] [CrossRef]
Akram, M.; Adnan, M.; Ali, S.F.; Ahmad, J.; Yousef, A.; Alshalali, T.A.N.; Shaikh, Z.A. Uncertainty-aware diabetic retinopathy detection using deep learning enhanced by Bayesian approaches. Sci. Rep. 2025, 15, 1342. [Google Scholar] [CrossRef]

Figure 1. Proposed data pipeline and preprocessing.

Figure 2. Synthetic longitudinal data generation with statistical and biological fidelity.

Figure 3. Proposed hybrid unified model architecture.

Figure 4. Performance comparison across models for Stress, Performance, and Hormone Health tasks over 10-fold cross-validation.

Figure 5. Calibration error of proposed Model.

Figure 6. AUC curve for stress class.

Figure 7. AUC curve for performance class.

Figure 8. AUC curve for hormone health class.

Figure 9. Confusion metric for stress class.

Figure 10. Confusion metric for performance class.

Figure 11. Confusion metric for hormone health class.

Figure 12. Performance metrics versus synthetic data proportion.

Figure 13. Selective prediction performance.

Figure 14. Ablation studies results.

Figure 15. Inference times and model size of proposed model.

Table 1. Feature summary and counts for the Columbia MBA Hormone Diversity Team Study dataset.

Feature Type	Example Variables	Count	Representation/Range
Demographic (static)	age, gender, ethnicity, country, BMI	5	numeric/categorical (embedded)
Biological (static)	baseline_cortisol_log, baseline_testosterone_log	2	float (µg/dL, log-scaled)
Biological (temporal)	cortisol_series[168], testosterone_series[168]	2 × 168	float arrays (per hour)
Derived temporal features	cortisol_slope_am, t_react_recovery, stress_auc, hormone_entropy	4	float
Behavioral temporal	cash_trajectory[10], rank_trajectory[10]	2 × 10	float arrays
Contextual/team features	team_id, diversity_index, team_performance_score	3	numeric/categorical
Synthetic metadata	augmentation_flag, synthetic_seed, sequence_id	3	int/bool

Table 2. Summary of derived features and their types contributing to the multimodal input space.

Feature Category	Example Variables	Representation/Shape
Biological (static)	baseline_cortisol_log, baseline_testosterone_log	float values (µg/dL, log-scaled)
Biological (temporal)	cortisol_7day_series, testosterone_7day_series	[168 × 1] each (hourly)
Derived temporal features	cortisol_slope_am, t_react_recovery, stress_auc	float values
Behavioral time-series	cash_trajectory, rank_trajectory	vector per round
Contextual/embeddings	diversity_index, country, ethnicity, gender	8-d trainable embeddings/float

Table 3. Summary of original and augmented dataset characteristics.

Dataset Characteristic	Original Dataset	Augmented Dataset
Number of Individuals	370	1110
Number of Teams	74	74 (teams unchanged)
Average Team Size	5 (range 3 to 6)	5 (range 3 to 6)
Classes per Task	3-class (Stress, Performance, Hormone Health)	Same
Class Imbalance	Present; few “Critical” cases	Balanced via augmentation
Synthetic Augmentation Method	None	SMOTE variant + perturbative noise injection
Demographic Diversity Preserved	Original only	Maintained with controlled augmentation

Table 4. Baseline and proposed model hyperparameters and technical details.

Model	Key Hyperparameters/Search Space	Additional Technical Details
Logistic Regression (Baseline)	$C \in {0.01, 0.1, 1, 10}$	L2 penalty, multiclass (ovr)
Random Forest (Baseline)	Trees = 100, Max depth $\in {10, 20, None}$ , Min samples split = 2	Bootstrap, Gini, parallel training
XGBoost (Baseline)	Learning rate = 0.1, Max depth = 6, Estimators = 100, Subsample = 0.8	Early stopping 10 rounds, objective = multi:softprob
SVM (Baseline)	Kernel = RBF, $C \in {1, 10}$ , gamma = scale/auto	One-vs-one multiclass, probability calibration
MLP (Baseline)	Layers = 2–3, Neurons = {64,128}, LR = 0.001, Batch = 64	ReLU, Adam, Epochs = 100, Dropout = 0.2
CNN (Baseline)	Filters = {32,64}, Kernel = $3 \times 3$ , Dropout = 0.3	2 Conv + Dense layers, Batch = 32, Epochs = 50
LSTM (Baseline)	Units = 64, Layers = 2, Dropout = 0.3, Batch = 32	Fixed sequence length, Epochs = 50
Transformer (Baseline)	Heads = 4, Layers = 2, Embedding dim = 64, Dropout = 0.3	Adam LR = $3 \times 10^{- 4}$ , Batch = 16, Epochs = 30
Proposed Hybrid Model	Meta-gating experts = 4, Ensemble size $M = 5$ , Dropout = 0.3, Bayesian layers	Temporal branch: 1D CNN (32 filters, kernel = 3) + LSTM (64 units), Fairness loss $λ_{fair} = 0.1$ , NAdam optimizer (warmup + cosine decay), Batch = 32, Epochs = 80, Early stopping on composite metric, MC Dropout = 20 passes, Variational Dense prior std = 0.1

Table 5. Ablation study configuration.

Experiment	Base Model	Meta Gating	Ensemble	Bayesian Layer	Fairness Loss
Exp1	YES	NO	NO	NO	NO
Exp2	YES	YES	NO	NO	NO
Exp3	YES	YES	YES	NO	NO
Exp4	YES	YES	YES	YES	NO
Exp5 (Full)	YES	YES	YES	YES	YES

Table 6. Comparison across models according to Performance for Stress, Performance, and Hormone Health tasks over 10-fold cross-validation.

Model	Stress Accuracy (%)	Performance Accuracy (%)	HH Accuracy (%)	MacroF1 (Stress)	MCC (Performance)	ECE (Hormone Health)
Logistic Regression	85.2 ± 1.2	83.9 ± 1.4	82.7 ± 1.3	0.84 ± 0.02	0.80 ± 0.03	0.15 ± 0.01
Random Forest	89.7 ± 1.0	88.4 ± 1.2	87.9 ± 1.1	0.89 ± 0.02	0.86 ± 0.02	0.12 ± 0.01
XGBoost	91.0 ± 0.9	90.1 ± 1.1	89.5 ± 1.0	0.90 ± 0.02	0.88 ± 0.02	0.10 ± 0.01
SVM	90.5 ± 1.1	89.7 ± 1.3	88.9 ± 1.2	0.90 ± 0.02	0.87 ± 0.03	0.11 ± 0.01
MLP	92.3 ± 0.8	91.5 ± 0.9	90.9 ± 0.9	0.92 ± 0.01	0.90 ± 0.01	0.08 ± 0.01
CNN	93.7 ± 0.7	93.1 ± 0.8	92.4 ± 0.8	0.93 ± 0.01	0.92 ± 0.01	0.07 ± 0.01
LSTM	94.2 ± 0.7	93.8 ± 0.7	93.0 ± 0.7	0.94 ± 0.01	0.93 ± 0.01	0.06 ± 0.01
Transformer	95.0 ± 0.6	94.7 ± 0.6	94.2 ± 0.6	0.95 ± 0.01	0.94 ± 0.01	0.05 ± 0.01
Proposed Hybrid Model	99.99 ± 0.01	99.99 ± 0.01	99.99 ± 0.01	0.999 ± 0.0001	0.999 ± 0.0001	0.001 ± 0.0001

Table 7. Per-class performance metrics for Stress detection.

Class	Precision	Recall	F1-Score
Optimal	0.9999 ± 0.0001	0.9998 ± 0.0001	0.9998 ± 0.0001
Moderate	0.9998 ± 0.0001	0.9999 ± 0.0001	0.9998 ± 0.0001
Critical	0.9998 ± 0.0001	0.9998 ± 0.0001	0.9999 ± 0.0001

Table 8. Per-class performance metrics for Performance detection.

Class	Precision	Recall	F1-Score
Low	0.9998 ± 0.0001	0.9999 ± 0.0001	0.9998 ± 0.0001
Medium	0.9998 ± 0.0001	0.9998 ± 0.0001	0.9998 ± 0.0001
High	0.9999 ± 0.0001	0.9999 ± 0.0001	0.9999 ± 0.0001

Table 9. Per-class performance metrics for Hormone Health detection.

Class	Precision	Recall	F1-Score
Low	0.9999 ± 0.0001	0.9999 ± 0.0001	0.9999 ± 0.0001
Normal	0.9998 ± 0.0001	0.9999 ± 0.0001	0.9999 ± 0.0001
High	0.9998 ± 0.0001	0.9998 ± 0.0001	0.9998 ± 0.0001

Table 10. Synthetic data proportion with performance metrics.

Synthetic Proportion	Stress MacroF1	Performance MacroF1	Hormone MacroF1	Held-Out Real MacroF1
0%	0.94	0.93	0.92	0.94
25%	0.96	0.95	0.94	0.95
50%	0.98	0.97	0.96	0.96
75%	0.99	0.99	0.98	0.97
100%	0.999	0.999	0.999	0.97

Table 11. Evaluation performance across test partitions: (i) real-only held-out test, (ii) mixed augmented training with real test, and (iii) synthetic-only test. All values represent mean ± standard deviation over ten stratified folds.

Test Partition	Stress MacroF1	Performance MacroF1	Hormone MacroF1	ECE (Hormone)	Critical Stress Recall
Real-only held-out (10%)	0.982 ± 0.004	0.978 ± 0.005	0.975 ± 0.005	0.004 ± 0.001	0.985 ± 0.004
Mixed (augmented train + real test)	0.999 ± 0.0002	0.999 ± 0.0002	0.999 ± 0.0002	0.001 ± 0.0001	0.9998 ± 0.0001
Synthetic-only test	0.988 ± 0.003	0.986 ± 0.004	0.983 ± 0.004	0.003 ± 0.001	0.989 ± 0.003

Table 12. Statistical validation metrics for synthetic versus real hormone trajectories.

Metric	Cortisol	Testosterone	Interpretation	Threshold
KS Statistic	≤0.07 ( $p > 0.05$ )	≤0.07 ( $p > 0.05$ )	Distribution similarity	$p > 0.05$
1-Wasserstein Distance	<0.15	<0.15	Slope/recovery proximity	<0.2
MMD	<0.05	<0.05	Kernel divergence	<0.1
DTW	$0.21 \pm 0.04$	$0.21 \pm 0.04$	Temporal shape similarity	Lower is better
Biological Constraint Compliance	>97%		Valid physiological ranges	>95%

Table 13. Group-wise fairness evaluation for stress prediction (mean ± std over 10 stratified folds). Counts sum to the full dataset (N = 370).

Group	N	Accuracy (%)	MacroF1	Critical Recall
Gender
Male	190	99.99 ± 0.01	0.9992 ± 0.0003	0.9996 ± 0.0003
Female	160	99.98 ± 0.01	0.9990 ± 0.0003	0.9995 ± 0.0003
Other/Prefer not to say	20	99.97 ± 0.02	0.9988 ± 0.0004	0.9994 ± 0.0004
Ethnicity
White/Caucasian	150	99.99 ± 0.01	0.9992 ± 0.0003	0.9997 ± 0.0002
Asian	110	99.98 ± 0.01	0.9990 ± 0.0003	0.9995 ± 0.0003
Hispanic/Latino	70	99.97 ± 0.01	0.9989 ± 0.0003	0.9994 ± 0.0003
Black/African American	40	99.96 ± 0.02	0.9987 ± 0.0004	0.9993 ± 0.0004

Table 14. Results showing the impact of the temporal branch on model performance.

Model Variant	MacroF1 (Stress)	MCC (Performance)	ECE (Hormone Health)	Std. Dev. (Accuracy)
Without Temporal Branch	0.974	0.949	0.005	±0.015%
Full Model (with Temporal Branch)	0.999	0.999	0.001	±0.01%

Table 15. Selective prediction performance showing trade-off between abstention rate, coverage, and effective accuracy.

Abstention Rate (%)	Coverage (%)	Effective Accuracy (%)
0.0	100.0	99.9937
0.5	99.5	99.9992
1.0	99.0	99.9996
2.0	98.0	99.9998

Table 16. Quantitative interpretability metrics for the hybrid ensemble across tasks.

Metric	Stress Prediction	Performance Prediction	Hormone Health
SHAP Sparsity (features, $\| SHAP \| > 0.01$ )	$7.4 \pm 2.1$	$7.9 \pm 1.8$	$7.5 \pm 1.6$
Fidelity Drop ( $Δ P_{true}$ , top-2 removed)	$0.19 \pm 0.04$	$0.17 \pm 0.05$	$0.18 \pm 0.05$
Counterfactual Plausibility ( $L_{2}$ norm)	$0.35 \pm 0.08$	$0.38 \pm 0.09$	$0.37 \pm 0.10$
Expert Usefulness Rating (1–5)	$4.2 \pm 0.4$	$4.3 \pm 0.4$	$4.4 \pm 0.3$

Table 17. Component-wise ablation results demonstrating the impact of individual modules on model performance.

Experiment ID	MacroF1 (Stress)	MCC (Performance)	ECE (Hormone Health)	Key Insight
Base MLP	0.83	0.79	0.12	Baseline
+MetaGating	0.89	0.85	0.10	Meta-gating improves modularity
+Ensemble	0.95	0.92	0.07	Ensemble enhances stability
+Bayesian Layers	0.97	0.95	0.04	Improved uncertainty estimation
Full Model	0.999	0.999	0.001	Full model delivers SOTA results

Table 18. Impact of reverse inference gating on Hormone Health prediction.

Setting	Hormone MacroF1	High-Confidence Subset MacroF1	Low-Confidence Subset MacroF1
Ungated	0.94	0.95	0.88
Gated (ours)	0.999	0.999	0.92

Table 19. Comparative analysis of related studies and proposed framework.

Study (Year)	Dual-Hormone Focus	Multimodal Fusion	Multitask	Reverse Inference	Longitudinal Hormone
2024 [45]	NO	YES	YES	NO	NO
2024 [36]	NO	YES	NO	NO	NO
2024 [46]	NO	NO	NO	NO	YES
2024 [22]	NO	NO	NO	NO	NO
2025 [47]	NO	YES	NO	NO	NO
2025 [48]	NO	YES	NO	NO	NO
2025 [29]	NO	YES	YES	NO	NO
2025 [49]	NO	YES	NO	NO	NO
Our Framework	YES	YES	YES	YES	YES

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abdullah; Fatima, Z.; Sánchez Mejorada, C.G.; Ather, M.A.; Oropeza Rodríguez, J.L.; Sidorov, G. Fair and Explainable Multitask Deep Learning on Synthetic Endocrine Trajectories for Real-Time Prediction of Stress, Performance, and Neuroendocrine States. Computers 2025, 14, 515. https://doi.org/10.3390/computers14120515

AMA Style

Abdullah, Fatima Z, Sánchez Mejorada CG, Ather MA, Oropeza Rodríguez JL, Sidorov G. Fair and Explainable Multitask Deep Learning on Synthetic Endocrine Trajectories for Real-Time Prediction of Stress, Performance, and Neuroendocrine States. Computers. 2025; 14(12):515. https://doi.org/10.3390/computers14120515

Chicago/Turabian Style

Abdullah, Zulaikha Fatima, Carlos Guzman Sánchez Mejorada, Muhammad Ateeb Ather, José Luis Oropeza Rodríguez, and Grigori Sidorov. 2025. "Fair and Explainable Multitask Deep Learning on Synthetic Endocrine Trajectories for Real-Time Prediction of Stress, Performance, and Neuroendocrine States" Computers 14, no. 12: 515. https://doi.org/10.3390/computers14120515

APA Style

Abdullah, Fatima, Z., Sánchez Mejorada, C. G., Ather, M. A., Oropeza Rodríguez, J. L., & Sidorov, G. (2025). Fair and Explainable Multitask Deep Learning on Synthetic Endocrine Trajectories for Real-Time Prediction of Stress, Performance, and Neuroendocrine States. Computers, 14(12), 515. https://doi.org/10.3390/computers14120515

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fair and Explainable Multitask Deep Learning on Synthetic Endocrine Trajectories for Real-Time Prediction of Stress, Performance, and Neuroendocrine States

Abstract

1. Introduction

2. Key Contributions

3. Related Work

4. Research Methodology

4.1. Dataset Overview

Dataset Structure and Fields

4.2. Data Preprocessing and Cleaning

4.2.1. Feature Engineering

4.2.2. Synthetic Data Augmentation for Class Balance

4.3. Synthetic Longitudinal Data Generation with Statistical and Biological Fidelity

4.4. Hybrid Unified Model Architecture

4.4.1. Shared Multimodal and Temporal Backbone

4.4.2. Multioutput and Reverse Inference Heads

4.4.3. Reverse Inference Module

4.4.4. Advanced Regularization and Novel Techniques

4.4.5. Bayesian Uncertainty Modeling

4.4.6. Training Protocol

4.4.7. Experimentation and Hyperparameter Tuning

4.4.8. Cross-Validation and Robust Evaluation

4.4.9. Evaluation Metrics

4.4.10. Explainability and Interoperability Protocol

4.4.11. Ablation Study Plan

4.4.12. Deployment Protocols

5. Results

5.1. Statistical Validation

5.2. Effect of Synthetic Data Augmentation

5.3. Impact of Temporal Branch

5.4. Selective Prediction Using Model Uncertainty

5.5. Explainability and Interoperability

5.6. Ablation Study Results

6. Discussion and Implications

Competitive Analysis

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Demographic Group Distributions and Representation Analyses

Appendix A.1. Participant Demographic Distribution

Appendix A.2. Distribution of Demographic Groups Across Outcome Labels

Appendix A.3. Statistical Tests for Group Balance

Appendix A.3.1. Categorical Variables

Appendix A.3.2. Continuous Variables

Appendix A.4. Subgroup Performance Overview

Appendix B. Key Predictors and Predictor Reduction Analysis

Appendix B.1. Identification of Key Predictors

Appendix B.2. Top Predictors (Consensus Across Methods)

Appendix B.3. Predictor Reduction Feasibility

Appendix B.4. Recommended Minimal Predictor Sets

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI