1. Introduction
Psychological stress [
1,
2] is a pervasive driver of morbidity and diminished quality of life, shaping cognition, decision making, and physiological homeostasis [
3]. As consumer wearables and clinical-grade sensors proliferate [
4], there is growing interest in unobtrusive, continuous assessment that can surface early warnings and support timely self-regulation or clinical triage [
5]. Physiological streams such as electrodermal activity (EDA) as a proxy for sympathetic arousal; cardiac measures including heart rate (HR), heart rate variability (HRV), and electrocardiography (ECG); peripheral temperature (TEMP); photoplethysmography (PPG); and tri-axial accelerometry (ACC) provide complementary views on autonomic reactivity and behavioral context [
6]. Yet, these signals are noisy, nonstationary, and person-dependent; they drift with ambient conditions, are confounded by posture and movement, and reflect complex biopsychosocial mechanisms [
7]. The central challenge is to translate heterogeneous, artifact-prone time series into reliable stress estimates that remain faithful under everyday conditions rather than only in controlled laboratory settings [
8,
9,
10].
Data handling and labeling are pivotal in this translation. Real-world datasets often mix high-frequency physiological streams with irregular events, weak proxies of stress (for example, task epochs or self-reports), and heterogeneous taxonomies that vary across studies [
11]. A principled pipeline typically begins with synchronized acquisition and basic hygiene (artifact mitigation, detrending, outlier handling), followed by normalization across sessions or individuals to reduce inter-subject variability. Windowing choices must respect physiological time scales and downstream compute budgets since the temporal granularity of segments constrains both what dynamics can be captured and the latency of any real-time system. Equally important is split discipline: train, validation, and test partitions should be established before any window generation to prevent identity or temporal leakage, and subjectwise protocols are essential when the goal is generalization to unseen users [
12]. Class imbalance is the rule rather than the exception, motivating resampling strategies and balanced optimization, which prevent minority classes from being swamped during training [
13].
Modeling choices span a spectrum from feature-based classical learning to end-to-end deep architectures. Hand-crafted descriptors (for example, tonic–phasic EDA decomposition, HRV features from inter-beat intervals, and spectral power in physiologically meaningful bands) [
14] remain attractive for their interpretability and modest resource usage, especially when paired with linear models or tree ensembles. Deep approaches leverage 1D convolutions for local temporal patterns, recurrent or temporal-convolutional networks for longer dependencies, and attention mechanisms for context-sensitive weighting across channels and time [
15]. Regardless of the paradigm, deployment on microcontroller-class hardware imposes tight constraints on latency, memory, and energy. This drives interest in model compression (structured pruning, low-rank factorization, weight clustering, knowledge distillation) [
16,
17], quantization-aware and post-training quantization [
18], and kernel-level co-design that exploits device-specific SIMD or DSP instructions. The aim is not only to achieve strong accuracy but to do so within budgets that enable always-on operation without offloading raw bio-signals [
19].
Evaluation therefore must extend beyond headline accuracy. Sound methodology distinguishes between subject-dependent and subject-independent settings [
20], probes robustness under distribution shifts (for example, day-to-day variation, sensor repositioning, or activity contexts) [
21], and reports confusion patterns that reveal asymmetric errors across stress strata. Calibration metrics such as expected calibration error are increasingly relevant when predictions inform behavioral nudges or clinical review; selective prediction or abstention mechanisms can further reduce harm by allowing models to defer when uncertainty is high [
22]. For streaming, embedded use and end-to-end measurements—including sensor I/O, buffering, on-device preprocessing, and inference—are needed to quantify true latency and energy per decision, not just isolated forward-pass timings [
23]. Finally, privacy, security, and fairness are first-class concerns: on-device inference limits data exposure; population-level learning can adopt federated or continual strategies with privacy-preserving updates; and audits should check for biases across demographics, occupations, and sensor contexts [
24]. Taken together, these considerations define a rigorous pathway from raw wearable data to trustworthy, resource-aware stress monitoring that can operate quietly, locally, and respectfully in daily life.
In this paper, we present a Micro-Attention CNN Hybrid tailored to real-time stress detection from a compact physiological window, combining efficient 1D and depthwise-separable convolutions with a lightweight micro-attention block that selectively emphasizes clinically informative segments. The network culminates in a compact classifier for globally pooled features, designed for sub-millisecond inference and kilobyte-scale memory on TinyML hardware. We pair this architecture with a leakage-aware preprocessing and evaluation pipeline and report comprehensive results, including calibration and boundary analyses, to illustrate readiness for embedded deployment.
The main contributions of this research are:
Kilobyte-scale TinyML with micro-attention. We introduce a Micro-Attention CNN Hybrid explicitly co-designed for edge deployment that achieves an 88% size reduction down to 1.76 KB with 0.40 ms CPU inference, bringing three-class stress recognition into the sub-2 KB regime, far smaller than prior on-device systems that typically occupy hundreds of kilobytes to megabytes. This substantially tightens the memory/latency envelope for real-time stress detection on microcontrollers.
Selective QAT that preserves attention/BN in full precision. We propose a selective quantization-aware training strategy that quantizes only convolutional and dense layers while intentionally keeping BatchNorm and the micro-attention block in FP32. This preserves the model’s capacity to focus on subtle, boundary-case physiology while still delivering the majority of the latency/size gains—a practical recipe for compressing attention-equipped TinyML models.
Biologically grounded, ultra-short windowing. We justify and use a compact window at 32 Hz (, EDA, HR, TEMP) that matches the time scale of acute stress responses yet remains compute-light, and we detail the end-to-end micro-attention hybrid that exploits this representation with only ∼3.8 K parameters. This balances physiological fidelity with TinyML constraints.
Leakage-aware resampling with neighborhood diagnostics. Beyond standard class rebalancing, we introduce a neighborhood density check to verify that synthetic samples densify plausible regions without collapsing local structure, and we enforce leakage-safe splits before windowing. This provides a principled guardrail against distributional artifacts in highly imbalanced stress data.
Compression robustness and boundary analysis. We systematically characterize how pruning → selective QAT → PTQ affects classwise errors and show that residual mistakes concentrate at the low stress boundary, and then preserve high global performance after full compression. This offers rare, fine-grained evidence about ambiguity under aggressive model shrinking in stress recognition.
Deployment-oriented reporting. We report both accuracy and device-realistic metrics (model size and CPU latency) under each compression stage, enabling reproducible trade-off decisions for embedded practitioners targeting strict RAM/latency budgets.
The remainder of this paper is organized as follows:
Section 2 reviews related works on TinyML-based stress recognition and positions our study within the field.
Section 3 presents the proposed methodology, including our data pipeline, model design, and compression strategy.
Section 4 reports the experimental results and analyses, covering evaluation protocols, baseline findings, compression outcomes, and ablation insights, as well as a discussion of limitations. Finally,
Section 5 concludes this paper and outlines directions for future work.
2. Related Works
The deployment of intelligent stress detection systems on resource-constrained edge devices represents a significant advancement in personalized healthcare. This section reviews several pioneering studies that have successfully implemented TinyML models for real-time stress classification, showcasing a diverse range of approaches. These include variations in the physiological signals monitored, the machine learning architectures employed, and the specific hardware platforms targeted.
To begin, Rachakonda et al. [
25] suggested a DNN-integrated edge device for real-time stress level detection within the Internet of Medical Things (IoMT). Their methodology centers on a Deep Neural Network (DNN) deployed on an edge device (a wearable wristband) that processes data from three physiological sensors: temperature, humidity (for sweat), and an accelerometer (for motion). The model was trained and tested using a combined dataset of 26,000 samples, built from real-life datasets like Human Motion Primitives (HMP) and the PAMAP2 Physical Activity Monitoring dataset, with sensor value ranges defined for stress classification. The system achieved a high accuracy between 98.3% and 99.7% across different tests, successfully validating the concept of on-device, real-time stress detection.
Building upon this concept with a focus on data security, Rachakonda et al. [
26] proposed a blockchain-integrated privacy-assured IoMT framework for stress management that analyzes sleeping habits to predict next-day stress levels. Their methodology employs a Fully Connected Neural Network (FCN) deployed on an edge device (a smart pillow) to process physiological data such as heart rate, respiration, snoring, and body temperature during sleep, with all analyzed data securely stored and managed via a private Ethereum blockchain. The model was trained and tested using 15,000 samples from the National Sleep Research Resource (NSRR) dataset. The system achieved a high accuracy of 96% for stress prediction and successfully demonstrated a secure, privacy-preserving data storage and access mechanism using blockchain technology.
In parallel, adopting a strategy to enhance reliability, Gibbs et al. [
27] presented a multimodal context-aware stress recognition system that combines two separate tinyML models on a single, resource-constrained microcontroller (Arduino Nano 33 BLE Sense) to improve reliability by mitigating motion artifacts. Their methodology employs two 1D Convolutional Neural Networks (CNNs): one for Human Activity Recognition (HAR) using accelerometer data to classify users as ’active’ or ’resting’, and a second for stress detection using heart rate and electrodermal activity data that is only triggered during periods of inactivity identified by the first model. The models were trained on the public WISDM dataset for activity recognition and a lab-collected dataset using the Montreal Imaging Stress Task for stress, and they were optimized for deployment using post-training quantization (int8 and float16). The results showed that the system successfully ran on the device, with the HAR and stress models achieving 98% and 88% accuracy respectively, validating a novel, privacy-preserving approach to real-time, context-aware stress detection.
Shifting the focus to a different physiological modality, Mai et al. [
28] introduced an on-chip mental stress detection system that integrates a wearable behind-the-ear (BTE) EEG device with an embedded tiny Convolutional Neural Network (CNN). Their methodology involves capturing single-channel BTE EEG signals, performing on-chip noise removal and Fast Fourier Transform (FFT)-based signal-to-spectrogram conversion, and classifying stress levels using a compact, quantized CNN model deployed on a microcontroller. The dataset was collected from 15 participants undergoing stress-inducing tasks (Stroop and Mental Arithmetic), resulting in spectrogram images labeled as stress or non-stress. The system achieved high performance, with 95.32% accuracy using 10-fold cross-validation and 91.72% using leave-one-out cross-validation, while maintaining low power consumption and real-time processing capabilities.
Likewise, concentrating on another popular physiological signal, Rostami et al. [
29] developed a real-time stress detection system using a Long Short-Term Memory (LSTM) deep learning model optimized for deployment on resource-constrained microcontrollers via TinyML. Their methodology involved training an LSTM network directly on raw, unprocessed photoplethysmography (PPG) signals from the WESAD dataset, using sliding window segmentation, and then applying model compression techniques, including pruning and post-training quantization, to reduce its size and memory requirements. The optimized model achieved an accuracy of 87.76% on the test set while requiring only 170 KB of RAM, enabling efficient real-time inference on low-power STM32 microcontrollers.
Finally, demonstrating the effectiveness of traditional ML models on ultra-constrained hardware, Abu Samah et al. [
19] designed a TinyML-based stress classification system using a multi-sensor wearable device built around a Raspberry Pi Pico RP2040 microcontroller. Their methodology involved training and comparing several traditional machine learning models on a public dataset of nurses’ physiological data, including acceleration, body temperature, heart rate, and electrodermal activity, using hyperparameter tuning and NearMiss undersampling to handle class imbalance. The optimized XGBoost model achieved 86.0% accuracy for three-class stress detection (no stress, low stress, and high stress) and was successfully deployed on the resource-constrained edge device, occupying only 1.12 MB of flash memory.
Compared with existing TinyML stress detectors, our research advances the state of the art along four practical axes for embedded deployment. (1) Kilobyte-scale model and sub-millisecond latency: Through a staged pipeline (structured pruning → selective QAT → PTQ), our final artifact operates at 1.76 KB with 0.40 ms CPU latency while retaining 98.03% Macro-F1, tightening the memory/latency envelope substantially versus prior on-device systems that either do not report such ultra-small footprints or remain in the hundreds-of-kilobytes range. (2) Micro-attention under selective quantization: We integrate a lightweight self-attention block and explicitly preserve attention and batch-normalization layers in FP32 during QAT to protect boundary sensitivity while still reaping most compression gains, an implementation detail seldom surfaced in prior TinyML stress works. (3) Biologically grounded, ultra-short windowing: The model consumes a (≈1 s at 32 Hz) multimodal window (, EDA, HR, TEMP) that captures acute stress dynamics while staying compute-light for MCUs. (4) Deployment-oriented validation knobs: Beyond accuracy, we analyze compression-stage effects on near-boundary errors (notably in low stress) and show that boundary-aware calibration can recover F1 and push ECE below 0.5% without changing the footprint, offering neutral “safety dials” missing in most embedded reports.
Table 1 situates our system relative to TinyML stress detectors spanning DNN/FCN/CNN/LSTM/Boosted-tree approaches and multiple sensing stacks. In short, (a) prior MCU deployments rarely report both latency and footprint at the kilobyte scale; (b) several emphasize privacy frameworks or context gating but stop short of sub-ms inference or boundary calibration; and (c) RAM/flash budgets are often ≫100 KB, whereas our design demonstrates 1.76 KB with maintained three-class performance.
4. Experimental Results and Analyses
This section presents a comprehensive evaluation of the proposed Micro-Attention CNN Hybrid Architecture and its optimized variants. We systematically examine the baseline model’s performance, analyze the impact of our multi-stage compression pipeline, and validate the final compressed model’s efficacy through rigorous empirical assessment. The experimental analysis encompasses training dynamics, computational efficiency metrics, and classification performance across all stress categories, providing a holistic view of the model’s capabilities before and after optimization for edge deployment.
4.1. Experimental Setup and Evaluation Metrics
This subsection describes the experimental configuration, data partitioning, implementation settings, and evaluation metrics used to assess the proposed stress detection framework in a consistent and reproducible manner.
4.1.1. Implementation Details
The proposed Micro-Attention CNN Hybrid Architecture was implemented using TensorFlow 2.12 with Keras API, TensorFlow Model Optimization, and Keras Tuner, providing a stable and well-documented framework for model development and compression. All experiments were conducted on a local workstation featuring an Intel Core i7 processor and 8 GB of RAM.
During the training phase, the Adam optimizer was employed with default parameters (, , ) due to its adaptive learning rate properties and efficient convergence characteristics. The model was trained using sparse categorical cross-entropy as the loss function, mathematically defined as , where represents the true class label and denotes the predicted probability for that class. This loss function was particularly suitable for the multi-class stress classification task with integer-encoded labels.
To prevent overfitting and optimize training efficiency, early stopping was implemented, monitoring the validation loss for signs of deterioration. This callback mechanism terminated training when no improvement was observed for a predefined number of consecutive epochs, thereby conserving computational resources while ensuring the model reached its optimal performance state. The patience was set respectively to 15 epochs for the baseline training, 10 epochs for the fine-tuning phase after pruning, and 8 epochs for the fine-tuning after the selective QAT stage, reflecting the progressively shorter retraining needs. The initial learning rate was set to 0.001 for the baseline training and primary fine-tuning phases, with a reduction to 0.0001 applied during the selective QAT stage to facilitate gentle adaptation to the quantized representations. The training utilized a batch size of 32, striking an effective balance between computational efficiency and gradient estimation stability throughout all experimental phases.
A full leave-one-subject-out validation was not adopted in this study because the available dataset contains only 15 subjects, with marked subject-level imbalance and unequal coverage of the three stress classes across participants. Under such conditions, a strict leave-one-subject-out protocol would produce highly variable folds, and some test folds would not reflect the full difficulty of the three-class problem in a statistically stable way. For this reason, we instead used a leakage-aware evaluation pipeline with strict separation between training, validation, and test data, together with repeated-seed experiments and additional robustness analyses, in order to reduce the risk of optimistic bias while maintaining sufficient data per split for stable optimization and meaningful comparison. Nevertheless, we acknowledge that a stronger subject-independent protocol, such as leave-one-subject-out or grouped subjectwise cross-validation on a larger cohort, would further strengthen the generalization claims and remains an important direction for future work.
4.1.2. Evaluation Metrics
To ensure a comprehensive and unbiased assessment of the model’s classification performance across all stress levels, we used a standard set of evaluation metrics derived from the confusion matrix. These metrics provide complementary perspectives on the model’s strengths and weaknesses. The foundational definitions required for their calculation are established. The following key metrics were used for both per-class and global performance analysis.
- (1)
Key Terms
(true positives for class i): Number of instances correctly predicted as belonging to class i.
(false positives for class i): Number of instances incorrectly predicted as belonging to class i (actually belong to other classes).
(false negatives for class i): Number of instances that actually belong to class i but were incorrectly predicted as other classes.
(true negatives for class i): Number of instances correctly predicted as not belonging to class i (belong to other classes).
C: Number of classes.
- (2)
Accuracy
Accuracy serves as the fundamental metric for assessing the overall effectiveness of a classification model. It represents the proportion of total predictions that the model classified correctly, encompassing all classes. Calculated as the sum of true positives and true negatives divided by the total number of samples, it provides a high-level overview of model performance.
- -
- -
- (3)
Precision
Precision, also referred to as Positive Predictive Value, is a critical metric that measures the reliability of a model’s positive predictions for a specific class. It is defined as the ratio of true positive predictions to the total number of instances the model labeled as that class (i.e., true positives plus false positives).
- -
- -
- (4)
Recall
Recall measures a model’s ability to correctly identify all relevant instances of a given class. It is calculated as the number of true positives divided by the total number of actual positives for that class (i.e., true positives plus false negatives).
- -
- -
- (5)
F1-Score
The F1-score is the harmonic mean of precision and recall, providing a single metric that balances the trade-off between these two often-competing measures. While a high precision might come at the cost of a lower recall and vice-versa, the F1-score seeks a balance, making it especially useful for evaluating performance on imbalanced datasets. It gives equal weight to both false positives and false negatives, offering a more robust view of a model’s accuracy than looking at precision or recall in isolation.
- -
- -
- (6)
Error Rate
The error rate is the direct complement to accuracy, quantifying the overall proportion of incorrect predictions made by the model. It is calculated as the sum of all false positives and false negatives divided by the total number of samples. While accuracy tells you how often the model is right, the error rate tells you how often it is wrong. Analyzing the error rate, both globally and on a per-class basis, helps to quickly identify the scale of misclassification.
- -
- -
4.2. Baseline Model Performance
This subsection first presents the performance of the uncompressed baseline model, providing the main reference point for analyzing the effect of the proposed architecture and the subsequent compression stages.
4.2.1. Training Dynamics and Convergence
The baseline model learns quickly: training accuracy climbs steeply in the first epochs and reaches nearly 100% (see
Figure 4), while validation accuracy also improves, though more gradually, and plateaus slightly below the training curve for much of the run before approaching it toward the end. Correspondingly, training loss falls steadily on a log scale to extremely small values, and validation loss also decreases overall but stays higher than training loss and exhibits noticeably more fluctuations in the middle-to-late epochs. Taken together, these curves show strong fitting of the training data with a modest gap between training and validation performance but overall convergence and fairly good generalization by the final epochs.
This behavior is consistent with the use of regularization techniques such as dropout and batch normalization, which help prevent the network from relying too heavily on specific neurons or internal covariate shifts. Dropout randomly deactivates portions of the network during training, encouraging robustness, while batch normalization stabilizes learning and improves convergence. Together, these methods contribute to the model’s stable learning dynamics and its ability to maintain strong performance on unseen data.
4.2.2. Comprehensive Performance Analysis
The baseline model demonstrates an exceptional classification capability across all stress categories, as evidenced by both the confusion matrix (see
Figure 5) and comprehensive performance metrics (see
Table 5). Quantitative analysis reveals outstanding global performance with 99.63% accuracy, 99.63% precision, 99.63% recall, and 99.63% F1-score, indicating near-perfect balance between detection sensitivity and prediction reliability.
The confusion matrix exhibits strong diagonal dominance, with correct classifications overwhelmingly concentrated along the main diagonal. Specifically, the model achieves remarkable per-class performance, with the no stress and high stress categories both maintaining 99.68–99.69% accuracy, while a low stress classification reaches 99.52% accuracy. This consistent high performance across all three stress levels confirms the model’s robust capability to distinguish between different physiological stress manifestations.
Error analysis reveals minimal misclassification patterns, with only marginal confusion observed between adjacent stress categories. The confusion matrix shows negligible cross-category misclassification, particularly between low stress and its neighboring classes, indicating effective capture of the subtle physiological distinctions that differentiate stress levels. The extremely low error rates of 0.31% for no stress and high stress and 0.48% for low stress further validate the model’s precision in stress state identification.
These results collectively demonstrate that the proposed architecture successfully learns discriminative features from multimodal physiological signals, establishing a strong baseline for subsequent compression stages while maintaining exceptional classification performance across all stress categories.
4.2.3. ROC Analysis and Calibration Assessment
To provide a more complete evaluation of the proposed stress detector, we report two complementary diagnostics in addition to the scalar metrics already discussed: multiclass ROC curves and calibration plots. The ROC analysis is computed in a one-versus-rest manner for the three stress classes and is summarized both visually and through classwise and macro-averaged AUC values. Calibration is assessed using reliability diagrams together with the expected calibration error (ECE) and Brier score, which indicate whether predicted probabilities remain trustworthy after model compression.
Figure 6 and
Figure 7 complement the aggregate performance metrics by showing discrimination and probability reliability. In particular, the ROC curves indicate how well each stress class can be separated from the others across decision thresholds, while the calibration plots verify whether the posterior probabilities remain well aligned with empirical correctness. These diagnostics are especially important in stress detection, where not only the final class label but also the reliability of the prediction can influence downstream monitoring and intervention decisions.
4.3. Model Compression Pipeline Results
This subsection reports the results obtained at each stage of the compression pipeline, highlighting the trade-offs among classification performance, model size, and inference latency.
4.3.1. Structured Pruning Impact Analysis
- (1)
Training Acceleration
The acceleration of training stands out as one of the most significant benefits of pruning, manifesting in a significantly faster convergence rate for the model. Empirically, the pruned model reached the performance threshold of 99% training accuracy in just 37 epochs, a stark contrast to the 90 epochs required by the original model, representing a 50% reduction in training time. This accelerated convergence dynamic is clearly visible in the learning curves (see
Figure 8); the pruned model’s training accuracy curve exhibits a much steeper ascent, hitting the 99% mark substantially earlier, while its training loss curve plummets toward zero much more rapidly. This confirms a more efficient and stable optimization process.
- (2)
Compression Efficiency and Computational Benefits
The pruning process achieved remarkable compression efficiency, successfully eliminating 70% of the model’s parameters while maintaining 98% of the original accuracy. This substantial reduction, evidenced by the parameter count dropping from 3827 to just s 1148, as shown in
Table 6, transformed the model from a dense to a highly sparse architecture. The pruning algorithm demonstrated exceptional precision in identifying and removing redundant parameters while preserving critical connections necessary for performance. This parameter reduction directly translated into significant computational benefits: the model size shrunk from 14.95 KB to 4.5 KB, and the reduction in FLOPs resulted in nearly halved inference time, accelerating from 1.2 ms to 0.65 ms on CPU.
- (3)
Performance Trade-offs
The pruning process successfully maintained a high degree of structural preservation, as evidenced by the robust global accuracy of 98.27% and strong per-class F1-scores all exceeding 97.9% (see
Table 7). This indicates that the model’s core architectural integrity and its ability to distinguish between all three stress levels remained largely intact despite the aggressive parameter reduction. However, a detailed analysis reveals a key trade-off: a slight but notable increase in classification errors. This is most pronounced in the low stress class, which exhibits the highest error rate (2.12%) and recall (97.89%). The confusion matrix (see
Figure 9) confirms this, showing that a significant portion of the model’s mistakes occur when actual low stress instances are misclassified as either no stress or high stress. This suggests that the intermediate nature of the low stress class makes it more vulnerable to the slight feature representation loss incurred by pruning, leading to a concentration of residual errors at the class boundaries.
4.3.2. Selective Quantization-Aware Training Impact Analysis
- (1)
Selective Quantization Strategy Effectiveness
The selective quantization strategy focused on applying quantization only to the convolutional and dense layers, while keeping the batch normalization and micro-attention layers in full precision (FP32). This design aimed to preserve numerical stability and mitigate accuracy degradation in layers that are highly sensitive to precision loss. During fine-tuning, the model underwent 30 epochs of training with a reduced learning rate of 0.0001 to enable gradual adaptation to the quantized parameters. As shown in
Figure 10, the training and validation curves demonstrate consistent improvement throughout the epochs. The training accuracy quickly converged toward 99%, while the validation accuracy stabilized around 96%, indicating strong generalization performance. Meanwhile, both training and validation losses decreased steadily, reaching values below
by the final epochs, confirming effective convergence without significant overfitting.
The application of the selective QAT further enhanced the computational benefits without altering the parameter count, compressing the model size by an additional 60% down to just 1.8 KB while achieving a further 34% speed improvement to a 0.43 ms inference time (see
Table 8). Cumulatively, the fully optimized model represents an 88% reduction in model size and a 64% acceleration in inference speed compared to the original baseline model.
Numerical stability at the final compressed size of 1.76 kilobytes is maintained through a selective quantization strategy rather than by forcing every component into the same low-precision regime. In particular, convolutional and dense layers are quantized because they account for most of the memory and arithmetic cost, whereas the micro-attention module and normalization-related computations are kept at a higher precision during quantization-aware training because they are more sensitive to small numerical perturbations. This design prevents unstable attention weights, activation collapse, and boundary distortions near the low stress class. In addition, quantization parameters are learned or calibrated on representative training data so that dynamic ranges remain well matched to the observed signal amplitudes, reducing saturation and rounding error. The staged pipeline of structured pruning, selective quantization-aware training, and final post-training quantization also contributes to stability by allowing the model to adapt gradually to reduced precision instead of undergoing a single abrupt conversion. The fact that the final compressed model retains 98.03% Macro-F1 and remains well calibrated indicates that the extreme reduction in size does not result from numerically fragile compression but from preserving precision only where it is most important for reliable decision making.
- (2)
Performance Trade-offs
The implementation of selective QAT successfully maintained the model’s fundamental structural integrity, as evidenced by its preserved global accuracy of 98.05% (as shown in
Table 9) and consistent high performance across all classes. This indicates that the quantization process did not compromise the model’s core architectural ability to perform the classification task. However, a discernible performance trade-off emerged in the form of a slight but measurable increase in overall error, with the global error rate rising to 1.95%. This degradation is primarily concentrated in the low stress class, which exhibits the highest error rate (2.21%) and the lowest F1-score (97.71%) among all categories. Analysis of the confusion matrix (see
Figure 11) confirms this vulnerability, showing a significant number of “low stress” instances being misclassified as both “no stress” and “high stress.” This pattern suggests that the intermediate nature of the low stress class makes its boundaries more susceptible to the precision loss inherent in quantization, leading to a concentration of errors. Despite this, the model’s performance remains robust, confirming selective QAT as a valid strategy for optimization, albeit with a recognized trade-off in the precision of classifying ambiguous stress levels.
- (3)
Which computation contributes most to accuracy, and why?
The main accuracy gain in the proposed architecture comes from the micro-attention computation applied after the convolutional feature extractor, rather than from merely increasing the classifier capacity. The initial 1D and depthwise-separable convolutions are important because they efficiently capture short-range temporal patterns and cross-channel interactions in acceleration, electrodermal activity, heart rate, and skin temperature. However, these convolutional features alone do not fully explain the high classification performance. The most discriminative improvement arises when the micro-attention block reweights the temporal representation so that stress-salient segments receive higher importance than less informative or noisy intervals. This is particularly beneficial for the difficult boundary between low stress and neighboring classes, where subtle physiological changes may be diluted by simple averaging or by purely convolutional processing. In other words, the convolutional layers provide the local descriptors, but the attention mechanism determines where the model should focus within the short signal window, which leads to more accurate class separation. This interpretation is also consistent with our compression strategy: we deliberately preserve the attention module and normalization layers in full precision during selective quantization-aware training because these components are more sensitive to small numerical distortions and contribute disproportionately to the final predictive accuracy.
- (4)
Interpreting the very high Macro-F1 values.
Although the proposed model achieves very high Macro-F1 values, these results should be interpreted within the scope of the present dataset and protocol rather than as evidence that wearable stress recognition is solved under all practical conditions. The evaluation is conducted on a curated public dataset with fixed sensing channels, predefined labels, and a controlled leakage-aware pipeline, which helps reduce some of the variability that is typically encountered in unconstrained deployment. In real-world embedded biomedical systems, performance can be affected by environmental variation, sensor placement, motion artifacts, long-term drift, and hardware-level reliability constraints. This broader challenge has also been emphasized in recent embedded biomedical research, which shows that maintaining high performance outside controlled evaluation settings remains difficult and depends not only on the classifier but also on the long-term stability and reliability of the embedded acquisition chain [
33]. Accordingly, the results reported in this paper should be understood as strong performance under a carefully controlled wearable-sensing benchmark, while further real-world validation remains an important direction for future work.
4.3.3. Final Compressed Model Performance
- (1)
Size and Computational Benefits
The complete compression pipeline achieves remarkable efficiency gains, culminating in a highly optimized model suitable for resource-constrained environments. The final model, processed through pruning, selective QAT, and PTQ, demonstrates exceptional size reduction, compressing from an original 14.95 KB down to just 1.76 KB (see
Table 10), representing an 88% reduction in model size. This dramatic compression maintains parameter efficiency with a 70% reduction in parameters (from 3827 to 1148) while significantly accelerating inference performance. The computational benefits are substantial, with CPU inference time reduced by 67% from 1.2 ms to 0.4 ms. Most notably, the memory optimization achieves critical deployment viability, as the final model size of 1.76 KB comfortably fits within the stringent <2 KB RAM constraints of modern microcontrollers. This combination of minimal memory footprint and accelerated inference makes the compressed model ideally suited for edge deployment scenarios, where both storage and computational resources are severely limited, while maintaining the architectural integrity necessary for effective stress classification.
- (2)
Performance Trade-offs
The application of PTQ to the already pruned and selective QAT-optimized model completes the compression pipeline while maintaining largely consistent performance characteristics. The final compressed model achieves a global accuracy of 98.03%, as shown in
Table 11, with balanced precision and recall across all classes, demonstrating that the aggressive compression strategy successfully preserves the model’s fundamental classification capabilities. However, the performance analysis reveals a persistent trade-off pattern concentrated at the low stress level, which exhibits the highest error rate (2.25%) and lowest F1-score (97.68%) among all classes. The confusion matrix (see
Figure 12) confirms this vulnerability, showing significant misclassification between low stress and both adjacent categories, with 220 instances incorrectly classified as no stress and 205 as high stress. This consistent pattern across compression stages suggests that the intermediate nature of the low stress category makes it inherently more susceptible to the cumulative effects of precision loss from both pruning and quantization. Despite this class-specific sensitivity, the final model maintains robust overall performance with less than a 2% global error rate, validating the compression approach as effective for deployment while acknowledging the predictable trade-off in classifying ambiguous stress states.
- (3)
Energy and Resource Efficiency
The final compressed model demonstrates exceptional energy and resource efficiency, making it ideally suited for always-on edge deployment. With a minimal model size of 1.76 KB, it readily fits within the embedded flash memory constraints of microcontrollers, eliminating the need for external storage components. The optimized architecture achieves an inference time of 0.4 ms, well below the 10 ms threshold required for real-time processing on low-power ARM Cortex-M4 processors. This computational efficiency directly translates to significantly reduced power consumption, enabling extended operation on battery-powered devices and making the model practical for continuous stress monitoring applications in resource-constrained environments.
4.4. Ablation Studies
We conduct a suite of focused ablations to isolate how each design choice in the compression-and-deployment pipeline influences discrimination, calibration, latency, and memory, grounding the analysis in the reported baseline and compressed model results while holding the data preprocessing and training protocol fixed. Across ablations A1–A7, we refer to the baseline metrics (Accuracy/Precision/Recall/F1 , model size KB, CPU latency ms.) and to the progressively compressed variants (pruned, pruned + QAT, pruned + QAT + PTQ) whose reported results include , , and global accuracy with sizes KB, KB, and KB and latencies ms, ms, and ms, respectively. The ablations retain the same data normalization, stratified splits, and temporal reshaping pipeline described earlier, such that observed differences arise from compression choices alone.
4.4.1. A1. Stagewise Compression: Baseline → Pruning → QAT → PTQ
We ablate the contribution of each compression stage to discrimination, calibration, memory footprint, and latency, tracing an end-to-end path from the full-capacity baseline to the final deployment-ready model (see
Table 12 and
Figure 13). Let
denote the baseline parameters and
denote the pruning, QAT, and PTQ transforms. We compose these transforms as a stagewise big union over operations acting on the parameter space
and evaluate each partial model. This ablation shows a controlled and monotone trade-off consistent with the paper’s results: pruning delivers most of the size/latency gains with a small reduction in accuracy (cf.
and
KB/
ms →
KB/
ms); selective QAT preserves structure while compounding efficiency (
,
KB,
ms); and PTQ provides the final step to the sub-2 KB footprint with negligible additional loss (
,
KB,
ms). The calibration signal (ECE) stays low and can be further tightened via temperature scaling, aligning with the observation that the residual gap at the end of the pipeline is dominated by calibration rather than discrimination.
The stagewise path achieves Macro-F1 at KB and ms, i.e., an size reduction and speedup compared to the baseline; temperature scaling reduces ECE from to on the final artifact.
4.4.2. A2. Pruning Depth () and Early Convergence
We vary the final sparsity
under the cubic schedule used in the pipeline and track both convergence speed and generalization with the epoch to reach
training accuracy,
, quantifying acceleration (see
Table 13 and
Figure 14). Empirically,
drops from 90 (baseline training dynamics) to 37 at approximately
parameter removal (the pruned configuration reported in the paper), matching the steep ascent of training accuracy and the
test accuracy observed after pruning. The accuracy–sparsity curve shows a gentle slope up to roughly
and a sharper decline thereafter, consistent with redundancy being excised first in peripheral filters before mid-depth capacity is affected, as follows:
The configuration aligns with the measured convergence acceleration () while keeping Macro-F1 ; subsequent quantization preserves this balance while driving memory and latency toward the deployment targets.
4.4.3. A3. Pruning Schedule and Thresholding Policy
We compare polynomial (degree 2 vs. 3), cosine, and step schedules, and examine per-layer versus global thresholding. The degree-3 and cosine schedules provide smoother late-epoch behavior, slightly lower ECE, and similar accuracy at fixed sparsity, while per-layer quantiles avoid brittle global thresholds that can over-prune layers with naturally smaller weights. These effects are visible without changing parameter counts or latency, confirming that schedule and thresholding primarily shape the optimization trajectory and calibration rather than raw capacity, as follows:
Per-layer thresholding with a cubic schedule balances stability and plasticity, reproducing the
acceleration and slightly better ECE than alternative policies at the same sparsity without compromising the parameter and latency budgets (see
Table 14 and
Figure 15).
4.4.4. A4. Selective QAT: Layer Coverage and Stability
We assess which layers to quantize during QAT by comparing coverage patterns at fixed epochs and a learning rate of
. Keeping BatchNorm and micro-attention in FP32 while quantizing convolutional and dense layers achieves the best stability–efficiency trade-off: training remains smooth, calibration improves relative to quantizing fewer blocks, and accuracy matches the reported
. Quantizing BN or micro-attention increases loss variance and worsens ECE, reflecting the sensitivity of normalization and attention scaling to quantization noise as follows:
Selective QAT on convolutional and dense layers reproduces the reported
Macro-F1 and
KB footprint, while temperature scaling gives the best ECE among coverage patterns; the instability surge when quantizing BN or micro-attention motivates keeping those layers in FP32 (see
Table 15 and
Figure 16).
4.4.5. A5. PTQ Calibration Set Size and Scheme
We study PTQ sensitivity to calibration size
C and affine quantizer granularity. Per-channel calibration consistently improves ECE at a fixed footprint, with a larger
C further reducing bias between float and quantized activations. These gains materialize without changes to parameter count or latency, indicating that careful calibration is a near-free lever for better-calibrated probabilities at the same accuracy and memory budget, as follows:
The per-channel PTQ with
forms a Pareto point: the final
KB model attains
Macro-F1 and the best calibration among PTQ variants, and temperature scaling tightens ECE to
without affecting speed or size (see
Table 16 and
Figure 17).
4.4.6. A6. Bit-Width and Mixed-Precision Profiles
We evaluate uniform and mixed-precision configurations around the selective QAT/PTQ design. Eight-bit quantization preserves the delicate low stress boundary while meeting the sub-2 KB target, six-bit begins to erode recall, and four-bit is too aggressive unless limited to late dense layers. These trends underline that quantization noise is primarily expressed as boundary thickening for the intermediate class, visible as slight drops in low stress F1, as follows:
The selective 8-bit configuration satisfies the sub-2 KB constraint while preserving classwise balance; mixed-precision with reduced-density precision can trade a small low stress penalty for additional bytes if needed, but uniform 8-bit remains the safest deployment point (see
Table 17 and
Figure 18).
4.4.7. A7. Classwise Robustness: Low Stress Boundary Treatments
We examine simple boundary-aware treatments applied after the final PTQ model to counter the consistent concentration of errors in the low stress category reported in the paper’s confusion analyses. Soft terminal anchoring (tolerance
step), mild label smoothing, and focal loss during a brief fine-tune leave parameters, size, and latency unchanged, yet slightly improve low stress discrimination and calibration by smoothing decision boundaries where quantization thickens margins, as follows:
The boundary-aware adjustments recover low stress F1 by up to
points and push ECE below
without affecting memory or latency, providing deployment-neutral knobs to sharpen ambiguous transitions while preserving the compact footprint and the
Macro-F1 performance of the final compressed model (see
Table 18 and
Figure 19).
4.4.8. A8. Ablation Study: Impact of the Micro-Attention Module
To quantify the contribution of the micro-attention module, we compared the full proposed architecture against two reduced variants under the same preprocessing pipeline, train/validation/test split, optimizer settings, and evaluation protocol. The first variant removes the attention block entirely and directly applies global average pooling after the depthwise-separable convolutional stage. The second variant replaces the learned micro-attention mechanism with uniform temporal averaging, which preserves the same overall pipeline but removes adaptive temporal reweighting. This experiment allows us to isolate whether the performance gain comes from the additional parameters alone or from the attention-based selection of stress-salient temporal regions.
The results in
Table 19 show that the micro-attention module provides a consistent and meaningful gain over both reduced variants. Compared with the model without attention, the full architecture improves accuracy by
percentage points and Macro-F1 by
percentage points, while also reducing the expected calibration error from
to
. The gain is especially visible for the low stress class, whose F1 score increases from
to
, indicating that the attention mechanism is particularly useful near the most ambiguous class boundary. This behavior supports the intended role of the module: the convolutional layers extract local temporal patterns, whereas the micro-attention block adaptively emphasizes the most informative stress-related segments and suppresses less relevant fluctuations. Importantly, this improvement is obtained with only a very small latency increase (
ms versus
ms), confirming that the attention block contributes disproportionately to classification quality relative to its computational cost.
4.4.9. A9. Ablation: Verifying That SMOTE Does Not Artificially Inflate Performance
To verify that the use of SMOTE does not artificially inflate the reported performance, we compared four training strategies under the same train/validation/test split and evaluated all models on the untouched test set with its natural class distribution: (1) no rebalancing, (2) class-weighted loss only, (3) random undersampling of the majority class only, and (4) the proposed hybrid strategy that combines majority undersampling with SMOTE applied only to the training split. In addition, we monitored the neighborhood density diagnostic
introduced in
Section 3.2 to confirm that synthetic minority samples densify plausible local regions rather than creating unrealistic clusters. This protocol ensures that any performance gain cannot come from leakage into validation or test data, and instead reflects whether the training distribution is made more learnable without distorting the true evaluation distribution.
The results in
Table 20 indicate that the performance improvement is not an artifact of SMOTE. First, all gains are measured on an untouched test set, so synthetic samples are never seen during validation or testing. Second, the hybrid rebalancing strategy improves not only overall accuracy but also Macro-F1 and, in particular, the F1 score of the low stress class, which is the most difficult decision boundary in this dataset. If SMOTE were merely inflating performance through unrealistic synthetic duplication, we would expect unstable gains, degraded calibration, or clear signs of neighborhood collapse. Instead, the observed
values remain small and positive, indicating that synthetic points densify minority regions without distorting their local structure.
4.4.10. A10. Robustness to Environmental Variations, Sensor Noise, and Sensor Drift
To make the source of robustness more explicit, we evaluated the proposed system under controlled perturbation settings that emulate realistic wearable deployment conditions. We considered four regimes: Clean, Noise, Drift, and Combined. The Noise setting injects moderate channelwise Gaussian perturbations together with sparse motion-like spikes after normalization. The Drift setting applies slow per-channel bias and gain changes to emulate sensor aging, calibration mismatch, or environmental variation. The Combined setting applies both perturbation types simultaneously. We then ablated the main robustness-related components of the pipeline, namely training-only normalization, on-the-fly perturbation augmentation, the micro-attention block, selective quantization that preserves sensitive layers in higher precision, and class imbalance regulation. In all cases, the same train/validation/test split and evaluation protocol were maintained.
Table 21 shows that the proposed system remains robust under realistic perturbations, with Macro-F1 decreasing only from
in the clean setting to
in the most challenging combined setting. The largest degradation is observed when training-only normalization is removed, especially under sensor drift, which indicates that normalization is the main defense against slow calibration mismatch and channel-scale instability. Removing perturbation augmentation or the micro-attention block also reduces robustness, particularly in the noise and combined regimes, showing that robustness is strengthened both by exposure to disturbed signals during training and by the model’s ability to focus on the most informative temporal segments. Finally, quantizing all layers uniformly is more harmful than the selective strategy adopted in this work, confirming that keeping the attention and normalization components in higher precision improves stability under non-ideal sensing conditions. Overall, these results support the claim that robustness does not arise from a single design choice but from the joint effect of normalization, augmentation, attention, and precision-aware compression.
4.5. Interpretability–Performance Trade-Off
The proposed system was designed to balance predictive performance with a level of interpretability that remains practical for wearable stress monitoring. On the interpretability side, the model operates on a small and physiologically meaningful set of six signals, namely tri-axial acceleration, electrodermal activity, heart rate, and skin temperature, and the micro-attention block provides a direct indication of which temporal segments contribute most strongly to the final decision. This makes the model easier to analyze than larger black-box architectures that rely on deeper feature stacks or less transparent multimodal fusion. On the performance side, stronger but less interpretable models could potentially exploit additional signals, larger temporal contexts, or more complex non-linear interactions; however, they would also increase memory, latency, and difficulty of explanation. In our case, the trade-off is favorable: the attention-equipped full model reaches a Macro-F1 of 99.63%, and the final compressed deployment model still retains 98.03% Macro-F1 while preserving the ability to inspect stress-relevant temporal emphasis. Thus, the proposed approach does not maximize interpretability at the expense of accuracy, nor accuracy at the expense of transparency; instead, it adopts a middle ground in which clinically meaningful inputs and lightweight attention improve explainability while maintaining strong classification performance.
4.6. Limitations of Our Work
We acknowledge several limitations that bound the scope and generality of the present study and that motivate clear directions for subsequent research.
4.6.1. Generalizability Under a Limited Cohort Size
We acknowledge that the dataset used in this study is relatively small at the subject level, since it contains recordings from only 15 individuals. For this reason, the reported results should be interpreted as strong evidence on a controlled wearable-sensing benchmark rather than as proof of universal generalization across all populations and deployment contexts. To support generalizability as far as possible within the available data regime, we adopted a leakage-aware preprocessing and evaluation pipeline, used strict train/validation/test separation, and included robustness analyses under noise, drift, and environmental perturbation rather than relying only on a single clean-setting score. In addition, the model was designed around a compact set of broadly available physiological channels, namely acceleration, electrodermal activity, heart rate, and skin temperature, which improves the practical portability of the sensing setup. Nevertheless, inter-subject variability in stress physiology remains substantial, and a larger multi-subject, multi-session, and preferably multi-site evaluation would be needed to establish stronger population-level generalization. Expanding the cohort size and validating the model across more diverse participants and recording conditions therefore remains an important direction for future work.
4.6.2. Compression-Induced Boundary Sensitivity
We observe that aggressive compression can amplify sensitivity near ambiguous inter-class boundaries where physiological signatures are intrinsically subtle. While the compression stages emphasize parameter efficiency and latency, they may also thicken decision margins or attenuate weak but informative features, increasing the likelihood of near-boundary errors. This limitation suggests the need for boundary-aware training criteria, class-conditional regularization, and mixed-precision strategies that preserve discriminative capacity specifically in regions of high uncertainty, as well as calibration mechanisms that remain stable when logits are perturbed by quantization noise.
4.6.3. Device-Level Realism and Calibration Under Shift
We quantify efficiency primarily through model size and inference time in an isolated inference setting without end-to-end measurements, including sensor I/O, streaming preprocessing, firmware constraints, interrupt handling, and power-management overhead on embedded targets. In addition, calibration is assessed in-distribution, whereas deployment typically introduces prior and covariate shifts arising from activity changes, motion artifacts, and gradual sensor/skin-condition drift. Future studies should perform on-device energy profiling, closed-loop latency audits with the full signal-processing chain, and longitudinal calibration assessments under realistic shifts, including adaptive recalibration or lightweight uncertainty monitoring suitable for microcontroller-class hardware.
4.6.4. Potential Bias Related to Demographic and Activity Factors
Potential biases related to gender, age, and physical activity were not analyzed explicitly in the present study and should therefore be considered a limitation. The dataset used in this work contains recordings from a relatively small cohort of 15 nurses, and the current evaluation focuses on overall classification performance rather than subgroup-specific fairness or sensitivity analyses. In particular, we did not perform a separate audit of model behavior across demographic variables such as gender or age, and we did not stratify results by activity intensity beyond the implicit motion information contained in the tri-axial acceleration channels. This is important because physiological stress responses can vary across individuals and may also be influenced by demographic factors, occupational context, and movement-related confounding. Accordingly, the reported results should be interpreted as aggregate performance on the available cohort rather than as evidence of demographic invariance. A more comprehensive fairness analysis on a larger and more diverse population, with explicit subgroup annotations and activity-stratified evaluation, remains an important direction for future work.
4.7. Future Work
Future work will focus on improving generalization, robustness, and deployment breadth. A first direction is to evaluate the proposed model on additional stress datasets and under cross-dataset protocols to better quantify transferability across populations, sensing conditions, and recording environments. A second direction is to study longitudinal and subject-adaptive settings, where calibration and decision boundaries may be updated over time to reflect individual physiological baselines. We also plan to investigate richer multimodal variants that incorporate additional wearable bio-signals when available, while preserving the low-complexity design required for edge deployment. Finally, future research will explore hardware-aware co-design strategies, including energy profiling, compiler-level optimization, and lightweight uncertainty estimation, to further strengthen the practical use of compact stress-detection models in continuous real-world monitoring.
5. Conclusions
This paper presented a Micro-Attention CNN Hybrid Architecture for real-time stress detection using wearable bio-signals. The method was designed to operate on a compact set of physiological and motion channels, namely tri-axial acceleration, electrodermal activity, heart rate, and skin temperature, and to classify three stress levels: no stress, low stress, and high stress. The results show that the proposed architecture provides strong discriminative performance while remaining suitable for compressed on-device deployment. In particular, the experiments demonstrate that the combination of efficient one-dimensional convolutions, depthwise-separable filtering, and a lightweight attention mechanism can preserve stress-relevant temporal patterns even after aggressive model reduction.
This study also showed that a staged compression pipeline based on structured pruning, selective quantization-aware training, and post-training quantization can substantially reduce the model footprint while maintaining high classification quality. Additional analyses indicated that most residual errors are concentrated near the boundary of the low stress class and that lightweight calibration further improves prediction reliability after compression. Overall, the findings support the feasibility of accurate, compact, and privacy-preserving stress detection directly on wearable and edge devices using a minimal set of bio-signals.