4.1. Discussion
This study introduces three voting-based ensemble models—Consistent Precision Band (CPB), Selective Precision Band (SPB), and Optimized Precision Band (OPB)—for emotion recognition from sequential multimodal sensor data. The models operate on subinterval-level predictions and can be combined with different base classifiers, including logistic regression (log_reg), Random Forest (rf), and Random Forest with hyperparameter tuning (rf_ht).
In the binary setting, the proposed ensembles consistently outperform the strongest previously reported baselines across all experimental conditions (mo, mu, and mw), subinterval configurations, and base classifiers. The relative improvement ranges from 4.68% to 26.05%, with an average gain of 12.34%. The largest improvement is observed for SPB with five subintervals in the mu condition using log_reg, while all configurations remain above their corresponding baselines.
When extending the task to three classes, the advantage of the proposed ensembles becomes more pronounced. Improvements range from 11.17% to 37.10%, with an average gain of 21.63%. The highest improvement is achieved by OPB with four subintervals in the mu condition using log_reg, indicating that the proposed voting mechanisms remain effective under increased classification complexity.
The results also highlight the role of ensemble design and subinterval partitioning. In most configurations, SPB achieves the best or near-best performance, particularly with five subintervals, suggesting that incorporating the previous decision improves temporal consistency. OPB, in contrast, achieves the single largest improvement in the three-class setting by enforcing an odd number of votes, thereby reducing ties and stabilizing the final decision. Overall, partitioning each window into finer subintervals and aggregating predictions at the window level substantially improves performance compared with standard window-based approaches.
Deep learning architectures (e.g., LSTM or 1D CNN) were not included as baselines. The dataset used in this study follows a subject-dependent protocol, where a separate model is trained for each participant. As reported in the original dataset description [
4], each participant contributes only a limited number of samples per class, resulting in relatively small training sets at the individual level. Under such small-data conditions, training deep models becomes more challenging and increases the risk of overfitting, particularly for high-capacity architectures [
35,
36]. Moreover, deep learning models typically require greater computational and memory resources, as well as longer training times, than conventional machine learning methods [
37]. For these reasons, lower-complexity models were considered more appropriate for the present subject-dependent setting.
In addition, introducing a fundamentally different class of models would confound the evaluation by mixing the effects of the classifier and the ensemble mechanism. This would make it difficult to attribute performance changes specifically to the proposed method. Since the ensembles are model-agnostic, integrating them with deep learning models remains a direction for future work when larger datasets become available.
The study uses the full multimodal feature set (accelerometer, gyroscope, and heart rate) as established in prior work on the same dataset [
4], where this combination yielded the highest classification accuracy. Importantly, the proposed ensembles do not perform explicit multimodal fusion at either the feature or model level. Instead, multimodal information is integrated through the shared feature representation used by the base classifier, while the ensemble layer operates at the decision level by aggregating discrete predictions through majority voting. This separation preserves the generality of the framework, allowing it to be applied independently of the sensor configuration.
The findings of this study should be interpreted within a clearly defined experimental scope: the evaluation is subject-dependent, the test sequence is constructed using a class-blocked probabilistic ordering rather than an unstructured natural stream, and the temporal persistence parameter functions as a fixed experimental control rather than an empirically validated assumption. The following discussion outlines the main limitations of the present study and their implications for generalizability, robustness, and real-world deployment.
First, the evaluation follows a subject-dependent protocol in which a separate model is trained and tested for each participant. This design is consistent with prior work [
4,
11] and enables direct comparison with existing results.
However, it does not assess generalization across users. To quantify this gap, we conducted a supplementary leave-one-subject-out (LOSO) evaluation using logistic regression as the base classifier under the binary setting (happy vs. sad). As shown in
Table 11, the reproduced baseline accuracies (50.2–51.1%) are consistent with those reported by Quiroz et al. [
4] (approximately 51–52%), confirming that subject-independent emotion recognition remains near chance level with the current feature set. Nevertheless, applying the SPB ensemble on top of the same logistic regression classifier still yields a consistent improvement across all three conditions (2.5–4.4% relative uplift), indicating that the proposed voting mechanism provides a measurable benefit even under the most challenging cross-user generalization setting. These results suggest that meaningful subject-independent recognition will require fundamentally different strategies such as domain adaptation or user-invariant feature learning, which remain directions for future work.
Second, robustness to noisy or corrupted sensor data has not been explicitly evaluated. In practical scenarios, wearable sensors may produce artifacts due to motion interference, loose contact, or signal dropout. The subinterval-based design may provide a degree of robustness to localized noise, as each subinterval contributes a single vote and the final decision is determined by majority voting. However, controlled experiments with noise injection (e.g., additive noise or subinterval dropout) are required to quantify this effect and are deferred to future work.
Third, the dataset was collected under semi-controlled conditions, which may not fully reflect real-world variability. To partially address this limitation, the evaluation protocol incorporates a temporal persistence parameter
p (
Section 3.2) that simulates realistic sequential dynamics. For
, the expected duration of remaining in the same emotional state is approximately 30–50 s (depending on the window configuration; see Equation (
5)), providing a more realistic alternative to i.i.d. evaluation. Nevertheless, real-world environments involve greater variability in behavior and context.
The proposed ensembles have minimal computational overhead—requiring only majority voting over a small number of predictions. While this design is intentionally lightweight, no quantitative evaluation of inference latency, memory usage, or energy consumption on actual wearable hardware is included in this study; such benchmarking remains necessary before deployment conclusions can be drawn.
While conceptual differences between the proposed ensembles and structured probabilistic sequential models (e.g., HMM, CRF, Kalman filtering) are discussed in
Table 1, no direct experimental benchmarking against these methods is conducted in the present study; accordingly, no claims of comparative superiority over such models are made. Validation under naturalistic, free-living conditions remains necessary.
Moreover, in longitudinal deployment scenarios, model drift may arise as a user’s baseline movement patterns and emotional expression characteristics evolve over time, potentially degrading classification performance [
38]. Due to their model-agnostic design, the proposed ensembles can be readily integrated with drift-detection and adaptive model-update mechanisms without requiring architectural changes. Nevertheless, the impact of drift on this task has not been explicitly evaluated and is left for future investigation. Fourth, the emotional taxonomy used in this study is limited to three categories (happy, sad, and neutral). This restriction is inherited from the benchmark dataset of Quiroz et al. [
4], which was designed around discrete emotion induction with these specific labels, and does not reflect an architectural limitation of the proposed method. The mathematical formulation of CPB, SPB, and OPB presented in [
23] is derived for an arbitrary number of classes
C, and the simulation results reported therein provide evidence of effectiveness across different class counts.
The present dataset was selected because it provides detailed per-condition and per-classifier results using lightweight algorithms, enabling the controlled comparison that is central to this study; benchmark datasets that simultaneously satisfy these criteria remain limited in the wearable emotion recognition literature. As the number of classes increases, the probability of ties in majority voting also increases, since votes are distributed across a larger label space and a strict majority becomes less likely.
The proposed models are designed to mitigate this effect. CPB and SPB use an odd number of subintervals, while OPB enforces an odd total number of votes (even subintervals plus one previous-decision vote), which guarantees a unique majority in the binary case and reduces the likelihood of ties in multi-class settings. When ties do occur in CPB and OPB, they are resolved by randomly selecting one of the tied classes; in SPB, the previous window decision is used instead.
A finer-grained taxonomy—for example, the six basic emotions or a continuous valence–arousal representation—would enable a more comprehensive evaluation and may require more advanced tie-resolution strategies, such as weighted voting or confidence-based selection.
This is identified as an important direction for future work.
Fifth, the feature engineering pipeline relies on 28 statistical and signal descriptors computed from each subinterval. No feature importance analysis or ablation study was conducted to determine the relative contribution of individual features to the observed performance gains. Such analysis remains an important direction for future work and would help clarify which signal characteristics are most informative for the proposed ensemble mechanism.