Novel Ensemble Models for Enhanced Accuracy in Time Series Classification: Application to Multimodal Emotion Detection

Abdel-Kader Mahmoud, Mohamed Hanafy; Saleh, Sherine Nagy; Shoukry, Amin; Elgamal, Yousry

doi:10.3390/computers15040256

Open AccessArticle

Novel Ensemble Models for Enhanced Accuracy in Time Series Classification: Application to Multimodal Emotion Detection

by

Mohamed Hanafy Abdel-Kader Mahmoud

^1,*,

Sherine Nagy Saleh

²

,

Amin Shoukry

³

and

Yousry Elgamal

²

¹

Information Technology Institute (ITI), Ministry of Communications and Information Technology, Giza 12563, Egypt

²

Computer Engineering Department, Arab Academy for Science, Technology and Maritime Transport, Alexandria 21937, Egypt

³

Computer and Systems Engineering Department, Alexandria University, Alexandria 21544, Egypt

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(4), 256; https://doi.org/10.3390/computers15040256

Submission received: 20 February 2026 / Revised: 9 April 2026 / Accepted: 12 April 2026 / Published: 20 April 2026

(This article belongs to the Special Issue Wearable Computing and Activity Recognition)

Download

Browse Figures

Versions Notes

Abstract

Emotions are fundamental to the human experience and are increasingly analyzed in applications such as marketing, healthcare, and human–computer interaction. Many recent approaches to human emotion recognition rely on deep learning, which typically demands large labeled datasets and substantial computational resources and often suffers from limited interpretability. Applying classical machine-learning methods to sensor time series is more lightweight but may struggle to reach high accuracy, especially when the temporal structure is not explicitly modelled. This paper introduces three subinterval voting-based ensemble models designed for user-specific emotion classification from multimodal time-series data acquired by smartwatch inertial sensors and heart-rate measurements. Each model partitions a time window into subwindows and performs window-level voting, thereby exploiting the temporal consistency of emotional responses while remaining compatible with standard classifiers such as logistic regression and Random Forests (with or without hyperparameter tuning). The models are evaluated on a public smartwatch emotion benchmark dataset under both binary (happy vs. sad) and three-class (happy, sad, neutral) settings. The relative accuracy improvement over the corresponding baseline reported in prior work ranges from 4.68% to 26.05%, with a mean gain of 12.34%. For the three-class tasks, improvements range from 11.17% to 37.10%, with a mean gain of 21.63%. Within the evaluated experimental setting, these results show that the proposed subinterval ensembles consistently enhance performance while remaining model-agnostic and compatible with standard user-specific classification pipelines in sensor-based emotion recognition.

Keywords:

emotion recognition; user-specific models; wearable sensors; time-series classification; ensemble learning; smartwatch inertial data; heart-rate signals

1. Introduction

Human emotions play a central role in communication, decision-making, and social interaction. They are expressed through a mixture of verbal and non-verbal cues, including facial expressions, voice, posture, and physiological reactions, and they often follow recognizable patterns while still varying considerably across individuals [1]. Automatically recognizing these emotional states is therefore an important goal in affective computing, with applications in mental health monitoring, human–computer interaction, and adaptive user interfaces.

A wide range of signals has been used for emotion recognition, such as electroencephalography (EEG), electrocardiography (ECG), electrodermal activity (EDA), and other peripheral physiological and behavioral measurements [2,3]. In recent years, wearable devices have become a particularly attractive source of emotion-related information, because inertial sensors and heart-rate monitors can unobtrusively capture body movements and physiological responses during everyday activities [4,5]. These wearable signals are naturally represented as multichannel time series and are typically analyzed using sliding windows over time. Recent surveys on multimodal emotion detection and sentiment analysis [6] highlight that fusing complementary modalities—including text, audio, video, and physiological signals—consistently outperforms unimodal approaches, although the optimal fusion strategy remains an active area of research. Recent work on emotion recognition from physiological time-series signals indicates that deep learning models currently dominate the field. These architectures are highly effective at learning complex nonlinear representations and often achieve strong performance on benchmark datasets. However, when trained from scratch, they can face several practical limitations, including the need for sufficiently large labeled datasets, substantial computational and memory resources, and careful hyperparameter tuning, while typically operating as black-box models with limited interpretability. Although transfer learning and pretraining strategies may partially mitigate label scarcity, robust generalization across users, sessions, and recording setups remains challenging, particularly in wearable- and sensor-based settings where datasets are frequently small, heterogeneous, and collected under varying contexts and conditions (e.g., different stimulus modalities and activity states) [7]. Similar trends and limitations have been reported in recent reviews and studies on EEG-, ECG-, and wearable-based emotion recognition systems, where deep architectures frequently outperform classical models but at the cost of increased complexity and reduced transparency [2,3,5,8]. In related domains, mixture-of-experts architectures have shown that routing inputs to specialized sub-models and aggregating their outputs can achieve parameter-efficient improvements in generalization [9]. The present work follows a related diversity-through-specialization principle, but implements it along the temporal dimension by partitioning each analysis window into subintervals, rather than by learning distinct expert networks.

By contrast, classical machine-learning pipelines based on handcrafted features and shallow classifiers remain competitive in many physiological and wearable-sensing applications. They are easier to train on small datasets, can be deployed on resource-constrained devices, and offer more transparent decision boundaries [10]. In these approaches, the raw time series are segmented into time windows, a set of descriptive features is extracted from each window, and a supervised classifier such as logistic regression or Random Forest is trained on the resulting feature vectors [4,11]. Nevertheless, there is still room for improvement in terms of accuracy and robustness, particularly when the emotional state evolves over time.

A key characteristic of emotional dynamics is that emotions do not typically fluctuate arbitrarily from one second to the next; rather, they unfold as short-lived episodes that remain relatively stable over brief time intervals and involve coordinated experiential, physiological, and behavioral changes [1,12]. In wearable- and movement-based affect recognition, this view is reflected in the widespread use of fixed-length analysis windows of several seconds to a few minutes, implicitly assuming that the underlying affective state is approximately stationary within each window [13,14,15,16]. Many existing machine-learning pipelines, however, still treat successive windows as independent and identically distributed samples, thereby ignoring temporal consistency and cross-window dependencies along the time axis [2,3,4,5,8,13]. Furthermore, most ensemble strategies in emotion recognition and related time-series classification tasks focus on combining multiple classifiers or modalities at the level of entire windows or instances, but they seldom exploit the structured relationships between subwindows within a larger window or between neighboring windows in the sequence [17,18,19]. Several approaches address temporal dependencies in sequential classification.

Simple temporal smoothing methods, such as moving average or median filters [17], operate on a sequence of predictions by replacing each decision with a local aggregation of its neighbours. These methods reduce noise in the predicted labels, but they do not introduce new information, as they only reuse existing predictions.

Other approaches, such as Hidden Markov Models (HMMs) [20] and Conditional Random Fields (CRFs) [21], model temporal dependencies by learning transition relationships from labelled data. While effective, they require additional training, introduce model complexity, and rely on predefined assumptions about state transitions.

Similarly, state-space models such as the Kalman filter [22] assume specific dynamics that are not well suited for discrete classification tasks.

The proposed ensemble models differ from these approaches in two main aspects. First, they generate additional decision evidence rather than modifying existing predictions. Each analysis window is divided into multiple subintervals, and the base classifier produces one prediction for each subinterval. These predictions are then combined using majority voting, introducing intra-window decision diversity. Crucially, each subinterval prediction is an independent local decision in the sense that it is computed from its own feature vector, rather than being a filtered or interpolated version of a neighboring prediction. As a result, a single analysis window may produce multiple independent predictions, whereas smoothing methods typically operate on a single prediction per window. Second, the SPB and OPB models introduce temporal information in a simple way. The decision from the previous window is added as a single vote in the current voting process. This differs from methods that rely on learned transitions or filtering equations. Notably, the CPB model does not use any temporal information across windows; it relies only on subinterval-level predictions within the same window, which confirms that it is not a temporal smoothing method. The proposed models do not require additional training, do not impose distributional assumptions, and can be applied with any base classifier.

Under the assumptions described in [23], the proposed ensemble models are analytically shown to improve classification accuracy based on probabilistic analysis. This property is not available in standard smoothing or filtering methods. Therefore, the proposed models are better understood as temporal ensemble mechanisms that generate additional decision evidence, rather than as temporal smoothing techniques.

Table 1 summarizes these distinctions across nine criteria. This work addresses these gaps by proposing three voting-based ensemble models for emotion prediction from continuous sensor signals. The starting point is a generic segmentation–feature–classification pipeline: the multichannel signals from a smartwatch accelerometer, gyroscope, and heart-rate monitor are segmented into fixed-length windows; each window is further divided into subintervals; and a set of spatiotemporal features, together with heart-rate variability measures, is extracted from every subinterval. A standard base classifier (logistic regression or random forest, with and without hyperparameter tuning) is then trained on the subinterval-level feature vectors.

On top of this base pipeline, we propose three ensemble models that leverage subinterval diversity through within-window majority voting. In the SPB and OPB models, a single decision from the previous window is included as an additional vote to support temporal consistency. The Consistent Precision Band (CPB) aggregates the subinterval predictions within each window by simple majority voting. The Selective Precision Band (SPB) extends this idea by including the previous window decision as an additional vote and by using it to resolve ties, thereby favoring temporally consistent sequences. The Optimized Precision Band (OPB) further adjusts the number of subintervals so that the total number of votes is always odd, which analytically reduces the probability of ties while still incorporating the previous decision. These ensembles are designed to be model-agnostic: they can operate with any underlying classifier and any set of features extracted from continuous time signals.

The theoretical foundation of these ensembles is given in our recent work Pattern Recognition [23].

The contributions of this paper are threefold. First, we propose a simple yet flexible framework that combines subinterval-based feature extraction with voting-based temporal ensembles for emotion recognition from wearable sensor data in a user-specific setting. Second, we formalize three specific ensembles (CPB, SPB, and OPB) that are designed to exploit temporal consistency across adjacent time windows through majority-voting schemes. Third, we conduct an extensive experimental study on a publicly available smartwatch dataset [4], covering three experimental conditions (movie–then–walk, music–then–walk, and music–while–walking), both binary (happy vs. sad) and three-class (happy, sad, neutral) problems, and multiple base classifiers. Within the subject-dependent evaluation considered here, the results show consistent improvements over competitive baselines and previously published models, with relative accuracy gains ranging from around 5% to more than 35%.

2. Proposed Models Design and Implementation

The proposed framework operates in two phases: a training phase and an inference phase. In both phases, the model processes raw multi-channel time-domain signals from the accelerometer, gyroscope, and heart-rate sensor using a window-based representation that is further refined into subintervals.

During the training phase, the raw multi-channel signals are segmented into fixed-length windows, and each window is further divided into n contiguous subintervals. For every subinterval, a feature vector is extracted and used as input to a single base classifier A. These subinterval feature vectors, together with their corresponding labels, are used to train A.

In the inference phase, the same segmentation and feature-extraction pipeline is applied to unseen data. The trained classifier A is evaluated independently on each subinterval within a window, producing n subinterval predictions

y_{1} (t), \dots, y_{n} (t)

. These predictions are then fused at the window level using one of three ensemble schemes: Consistent Precision Band (CPB), Selective Precision Band (SPB), and Optimized Precision Band (OPB). In CPB, the final decision

\hat{y} (t)

is obtained by simple majority voting over the current subinterval predictions only. In SPB, the previous window decision

\hat{y} (t - 1)

is also included as an additional vote together with the n current predictions, and in case a tie still occurs,

\hat{y} (t - 1)

is used to resolve the ambiguity. In OPB, the previous decision

\hat{y} (t - 1)

is again added as an extra vote, but the number of subintervals is chosen such that the ensemble operates on

n + 1

votes in total, which reduces the probability of ties and exploits temporal consistency between consecutive windows.

2.1. Training Phase

As illustrated in Figure 1, the training phase consists of three main steps: data segmentation, feature extraction, and classifier training.

2.1.1. Data Segmentation

In the data-segmentation step, the raw time-domain signals are first smoothed using a median filter with a window length of three samples. This signal-level filtering step constitutes a standard preprocessing procedure that removes impulse noise from the raw sensor readings prior to feature extraction. It operates directly on continuous-valued amplitude measurements and is conceptually distinct from the decision-level voting mechanisms described in Section 2.2, which aggregate discrete classifier predictions.

The filtered accelerometer, gyroscope, and heart-rate signals are then segmented into fixed-length analysis windows using a sliding-window scheme. Each analysis window is partitioned into contiguous subintervals of 1 s (seconds) duration, where each subinterval contains 24 samples at the average sampling rate of 23.8 Hz. The choice of a 1 s subinterval duration is motivated by prior studies on emotion and activity recognition using inertial sensors, which commonly report that window lengths of approximately one second are effective for classification tasks based on inertial motion signals [23]. Quiroz et al. [4] also adopted a 1 s window on the same dataset.

Larger analysis windows are then constructed by grouping multiple 1 s subintervals. Specifically, window lengths of 3 s and 5 s are used for CPB and SPB (yielding 3 and 5 subintervals, respectively), while window lengths of 2 s and 4 s are used for OPB (yielding 2 and 4 subintervals, respectively). These choices ensure that the number of subintervals is odd for CPB and SPB, and even for OPB—so that adding the previous decision as an additional vote produces an odd total in all three models.

For each subinterval, a feature vector is extracted, yielding one record per subinterval; the classifier A is trained on this subinterval-level dataset. For comparison, we also consider a baseline configuration in which the windows are not divided into subintervals. In this case, features are computed from full windows of 1 s, 3 s, and 5 s. This design isolates the contribution of subinterval decomposition and window-level voting from effects that follow only from changing window duration.

This window–subinterval strategy, in which features are extracted separately from each subinterval, has also been adopted in previous work on activity recognition from mobile-phone accelerometer data [24,25].

2.1.2. Feature Extraction

Following the feature extraction approach of Quiroz et al. [4], a set of 17 features is computed from each subinterval for each of the six sensor axes (accelerometer:

a_{x}, a_{y}, a_{z}

; gyroscope:

g_{x}, g_{y}, g_{z}

). Let

s = (s_{1}, s_{2}, \dots, s_{N})

denote the samples of a single-axis signal within a subinterval.

Table 2 lists the 17 per-axis features together with their mathematical definitions and source references. The features span five categories: central tendency (mean, median, mean of absolute values), dispersion (standard deviation, range, quartiles, median absolute deviation), extrema (maximum, minimum), magnitude (root mean square, root sum of squares, sum, sum of absolute values), shape (skewness, kurtosis), and power (energy). These features are extracted independently for each of the six axes, yielding

17 \times 6 = 102

features per subinterval.

In addition, three orientation angles are computed from the accelerometer mean vector

μ_{a} = (μ_{a x}, μ_{a y}, μ_{a z})

. The angle with each coordinate axis is defined as

θ_{x} = arctan (\frac{μ_{a x}}{\sqrt{μ_{a y}^{2} + μ_{a z}^{2}}})

, with analogous definitions for

θ_{y}

and

θ_{z}

[26,27]. The standard deviation of the signal magnitude

σ_{m} = SD (\sqrt{a_{x}^{2} + a_{y}^{2} + a_{z}^{2}})

summarizes overall movement intensity variation.

From the heart-rate (HR) signal, the mean heart rate within each subinterval is used as a feature. The root mean square of successive differences (RMSSD) [27] is also computed to capture short-term heart-rate variability when sufficient measurements are available.

The complete feature vector for each subinterval is thus:

f = [f_{a x} (1 : 17), f_{a y} (1 : 17), f_{a z} (1 : 17), f_{g x} (1 : 17), f_{g y} (1 : 17), f_{g z} (1 : 17), θ_{x}, θ_{y}, θ_{z}, σ_{m}, HR]

resulting in 107 features per subinterval. The feature vector

f

is used as input to the base classifier A described in Section 2.1.

Table 2. Summary of the 17 per-axis features extracted from each subinterval.

s = (s_{1}, \dots, s_{N})

denotes the samples of a single-axis signal;

μ

and

σ

denote its mean and standard deviation (SD), respectively.

Table 2. Summary of the 17 per-axis features extracted from each subinterval.

s = (s_{1}, \dots, s_{N})

denotes the samples of a single-axis signal;

μ

and

σ

denote its mean and standard deviation (SD), respectively.

#	Feature	Definition	Category	Ref.
1	Mean	$μ = \frac{1}{N} \sum s_{i}$	Central tendency	[28]
2	SD	$σ = \sqrt{\frac{1}{N} \sum {(s_{i} - μ)}^{2}}$	Dispersion	[29]
3	Max	$max (s)$	Extrema	[30]
4	Min	$min (s)$	Extrema	[30]
5	Energy	$E = \sum s_{i}^{2}$	Power	[29]
6	Kurtosis	$κ = \frac{1}{N} \sum {(\frac{s_{i} - μ}{σ})}^{4}$	Shape	[31]
7	Skewness	$γ = \frac{1}{N} \sum {(\frac{s_{i} - μ}{σ})}^{3}$	Shape	[32]
8	RMS	$\sqrt{\frac{1}{N} \sum s_{i}^{2}}$	Magnitude	[33]
9	RSS	$\sqrt{\sum s_{i}^{2}}$	Magnitude	[33]
10	Sum	$\sum s_{i}$	Magnitude	[28]
11	$\sum \| s_{i} \|$	$\sum \| s_{i} \|$	Magnitude	[28]
12	Mean $\| s_{i} \|$	$\frac{1}{N} \sum \| s_{i} \|$	Central tendency	[28]
13	Range	$max (s) - min (s)$	Dispersion	[30]
14	Median	$Q_{2}$ (50th percentile)	Central tendency	[32]
15	$Q_{3}$	75th percentile	Dispersion	[32]
16	$Q_{1}$	25th percentile	Dispersion	[32]
17	MAD	$median (\| s_{i} - median (s) \|)$	Dispersion	[29]

2.2. Inference Phase and Voting-Based Ensembles

During the inference phase, the same segmentation and feature-extraction pipeline is applied to unseen data.

For each analysis window t, the trained classifier A is evaluated independently on all n subintervals, producing n subinterval predictions

y_{1} (t), \dots, y_{n} (t)

. These predictions are then combined at the window level using one of three ensemble schemes, as illustrated in Figure 2: Consistent Precision Band (CPB; Figure 2a), Selective Precision Band (SPB; Figure 2b), and Optimized Precision Band (OPB; Figure 2c). In SPB and OPB, a copy of the previous window decision

\hat{y} (t - 1)

is also included as an additional vote in the fusion process.

In all equations,

t = 1, \dots, T

denotes the index of the current analysis window in the sequence. The term

y_{j} (t)

refers to the prediction for subinterval j of window t, while

\hat{y} (t)

denotes the final window-level decision.

2.2.1. Consistent Precision Band (CPB)

As depicted in Figure 2a, in the Consistent Precision Band (CPB) scheme, each window is divided into an odd number n of subintervals. The classifier A outputs n subinterval predictions

y_{1} (t), \dots, y_{n} (t)

, which are aggregated by simple majority voting. The decision rule can be written as

\hat{y} (t) = M V (y_{1} (t), \dots, y_{n} (t)),

(1)

where

M V (\cdot)

denotes the majority-voting operator. Equation (1) serves as the baseline ensemble rule and does not use any information from previous windows.

2.2.2. Selective Precision Band (SPB)

The Selective Precision Band (SPB), illustrated in Figure 2b, uses the same window partitioning strategy with an odd number of subintervals, but it also incorporates temporal information by including the previous decision

\hat{y} (t - 1)

as an additional vote. First, an intermediate voting result is computed as

z (t) = M V (y_{1} (t), \dots, y_{n} (t), \hat{y} (t - 1)) .

(2)

The final decision is then defined by

\hat{y} (t) = \{\begin{matrix} z (t), & if there is a unique majority class in (2), \\ \hat{y} (t - 1), & if a tie occurs . \end{matrix}

(3)

Here,

z (t)

represents the outcome of the majority vote that already includes the previous decision as an additional vote, while

\hat{y} (t)

in Equation (3) is reserved for the final window-level decision after the tie-breaking step. This separation keeps the notation clear and makes explicit that the previous decision is given priority only when the current evidence is ambiguous.

2.2.3. Optimized Precision Band (OPB)

The Optimized Precision Band (OPB) is shown in Figure 2c. It also incorporates the previous decision

\hat{y} (t - 1)

as an additional vote, but it modifies the number of subintervals to further reduce the probability of ties. In OPB, each window is divided into an even number n of subintervals, so that when the previous decision is added, there are

n + 1

votes in total, which is an odd number. The decision rule is

\hat{y} (t) = M V (y_{1} (t), \dots, y_{n} (t), \hat{y} (t - 1)) .

(4)

Compared to SPB, Equation (4) avoids explicit tie-breaking by construction, since the ensemble always operates on an odd number of votes, while still exploiting temporal consistency through the inclusion of

\hat{y} (t - 1)

.

3. Experimentation and Results

The effectiveness of the proposed models is assessed using a well-documented dataset for emotion recognition [4]. The same dataset appears in several follow-up studies, for example [11,34], which report accuracy values under different parameter and hyperparameter settings. Using this dataset enables a detailed comparison with prior work and supports a realistic evaluation of the proposed ensembles.

3.1. Dataset

The dataset of Quiroz et al. [4] includes two types of emotion stimuli: audiovisual and audio. The audiovisual stimuli consist of commercial movie clips that target happiness or sadness, following established affective stimulus selection practices. A separate group of participants rates the emotional content of the clips on a Likert scale, and the dataset documentation reports intensity ratings in the range of from 5.0 to 6.5 for happy and sad stimuli.

During data collection, participants experience happy, sad, and neutral stimuli, either as movie clips (audiovisual) or as classical music (audio). The protocol defines three conditions: mo (movie before walking), mu (music before walking), and mw (music while walking), and it randomizes the condition order to reduce order effects. Each participant wears a smartwatch and a chest heart-rate monitor. The smartwatch records tri-axial accelerometer and gyroscope data at an average sampling rate of 23.8 Hz, while the heart-rate strap provides beat-to-beat heart rate. The protocol obtains self-reported mood using the PANAS questionnaire before and after each stimulus.

The walking task consists of a 250 m walk along an S-shaped corridor, and the setup monitors behavior and heart rate unobtrusively. The dataset includes 50 participants, and complete physiological data are available for 44 participants, who constitute the evaluation cohort in the present study. For each participant and each condition (mo, mu, mw), the pipeline processes the recorded signals as described in Section 2. The benefit of combining multiple sensor modalities has been demonstrated in prior work on the same dataset [4]. Quiroz et al. [4] compared three sensor configurations: accelerometer only, accelerometer with heart rate, and the full set of accelerometer, gyroscope, and heart rate. Their results showed that the full multimodal configuration consistently achieved the highest classification accuracy across all conditions—for example, in the binary setting, Random Forest accuracy increased from 0.774 (accelerometer only) to 0.822 (accelerometer + heart rate) to 0.854 (all sensors).

In the present study, the full multimodal configuration is adopted as the default, as it represents the strongest baseline pipeline reported on this dataset. The proposed ensembles (CPB, SPB, OPB) are applied on top of this multimodal feature representation and are agnostic to the sensor composition. Investigating the interaction between the ensemble mechanism and different sensor combinations is left for future work. Quiroz et al. [4] evaluate their classifiers using a within-subject (subject-dependent) protocol. Training and testing data come from the same participant, and a separate personal model is trained for each user. The evaluation uses stratified 10-fold cross-validation (repeated 10 times) for each participant, and the results are summarized across participants and conditions.

3.2. Experimental Setup

The three proposed ensembles (CPB, SPB, and OPB) are implemented in Python 3.11.4 and are evaluated separately for each experimental condition (mo, mu, mw). Two base classifiers serve as the subinterval-level classifier A: a Random Forest classifier and a Logistic Regression classifier. Accuracy values denote means across cross-validation folds and subjects for each condition; an overall mean across conditions is also reported when required.

The selection of base classifiers is intentionally aligned with prior studies on the same benchmark dataset. Specifically, logistic regression and random forest are adopted from Quiroz et al. [4], while random forest with hyperparameter tuning (rf_ht) is adopted from Nur et al. [11]. These models represent strong lightweight classifiers previously reported on this dataset.

The objective of this study is not to introduce a new classifier, but to evaluate the effect of the proposed ensemble mechanisms within the same experimental pipeline used in earlier work. This deliberate design ensures a fair and controlled comparison, allowing any performance improvement to be attributed to the proposed voting-based ensembles rather than to changes in the underlying classifier, thereby isolating the contribution of the proposed ensemble layer.

For CPB and SPB, the experiments consider window lengths of 3 s and 5 s. Each window is divided into equal 1 s subintervals, resulting in 3 or 5 subintervals per window. The pipeline extracts features from each 1 s subinterval as described in Section 2.1.1, and the classifier A produces an independent prediction for each subinterval.

The baseline model does not use subintervals: it operates on features computed from full windows of 1 s, 3 s, or 5 s. This design isolates the contribution of subinterval decomposition and window-level voting from effects that follow only from changing window duration.

For OPB, the design uses an even number of subintervals so that adding the previous decision

\hat{y} (t - 1)

yields an odd total number of votes. The experiments use window lengths of 2 s and 4 s divided into 2 and 4 subintervals of 1 s, respectively, and they include

\hat{y} (t - 1)

as an additional vote, as described in Section 2. This choice reduces the likelihood of ties in majority voting and encourages temporal consistency between consecutive windows.

Note that while an odd number of votes guarantees a unique majority in the binary case, ties can still occur in the multi-class case (e.g., a split such as 2–2–1). In CPB and OPB, such ties are resolved by randomly selecting one of the tied classes; in SPB, the previous window decision

\hat{y} (t - 1)

is used instead, consistent with the formulation presented in [23].

The evaluation follows a class-blocked K-fold cross-validation protocol that applies to both the binary and the three-class setups. Let

I_{c}

denote the set of sample indices with class label c in the original file order. For each class c, the index set

I_{c}

is partitioned into

K_{c}

disjoint contiguous blocks using an order-preserving split (contiguous with respect to the ordered index list

I_{c}

). A K-fold scheme is constructed by selecting one block from one class as the test set and concatenating all remaining blocks (from all classes) as the training set; therefore, each fold holds out a single class-specific block. In the binary setting, the protocol removes the neutral class (

c = 0

) and sets

K_{+ 1} + K_{- 1} = K

. In the three-class setting, the protocol retains the neutral class and sets

K_{+ 1} + K_{0} + K_{- 1} = K

, while preserving the same blocked splitting principle.

Within each fold, the evaluation analyzes sequential behavior in the test stream by constructing an ordered test sequence. Test indices are rearranged into contiguous windows of fixed length n (window_time). A window is retained only if it contains samples from a single ground-truth class. The ordering procedure maintains the current class for the next window with probability p and switches to a different class with probability

1 - p

, subject to having at least n remaining test samples in the selected class. This ordering applies only to the test stream within a fold and does not modify the training data. The evaluation applies the base classifier to each element within a retained window and uses majority voting to produce one window-level decision.

The parameter p represents the assumed degree of temporal persistence between consecutive analysis windows and functions as an experimental control. In the results presented in this paper, p is fixed at

0.9

for all configurations to reflect short-term persistence while avoiding overly restrictive temporal assumptions. With window length w (in seconds), the expected duration of staying in the same ground-truth class under this ordering is

E [D] = \frac{w}{1 - p} .

(5)

Therefore,

p = 0.9

corresponds to approximately 30 s and 50 s for

w = 3

s and

w = 5

s, respectively, which provides a conservative yet realistic persistence setting for sequential wearable-sensor emotion data.

3.3. Results

3.3.1. Effect of Subinterval Count and Window Size

The boxplot in Figure 3 summarizes the distribution of accuracy values obtained for the three ensembles (CPB, SPB, and OPB) under the different configurations, aggregated over conditions (mo, mu, mw), numbers of classes (C = 2 and C = 3), and the two base classifiers. The plot compares the proposed ensembles to the corresponding baseline models and highlights two main trends.

First, using subintervals consistently improves accuracy over the baseline. Across all models and class configurations, increasing the number of subintervals (e.g., from 3 to 5 for CPB and SPB, or from 2 to 4 for OPB) leads to higher accuracy. This indicates that the subinterval-based representation, combined with voting at the window level, exploits additional temporal information that is not captured when features are computed from a single, longer window. This comparison is particularly informative because both the ensemble models and their corresponding baselines operate on the same amount of raw data. For example, CPB with three subintervals and the baseline with a 3 s window both process the same signal segment. The difference lies in the internal processing: the ensemble divides the window into three subintervals, extracts features from each, produces multiple predictions, and combines them by majority voting, whereas the baseline extracts a single feature vector and produces one prediction.

The consistent improvement in accuracy across all configurations indicates that the gain is due to the within-window ensemble mechanism, rather than additional data or changes in the classifier.

Second, the SPB model generally achieves the best or near-best performance among the three ensembles. It consistently outperforms the CPB model, and it achieves comparable accuracy to the OPB model, which uses an even number of subintervals and an odd total number of votes.

3.3.2. Quantitative Comparison with the Baseline

Table 3 and Table 4 provide a quantitative comparison between the proposed ensembles and the baseline model for different numbers of subintervals and classes. In these tables, “Average Acc.” denotes the mean accuracy obtained by the corresponding ensemble configuration, “Baseline Model” reports the accuracy of the baseline classifier using the same number of classes C and an equivalent window length, and “Uplift” is the relative improvement in accuracy over the baseline.

The results in Table 3 show that both CPB and SPB clearly outperform the baseline across all configurations. For example, in the three-class setting with five subintervals (Subinterval = 5, C = 3), the SPB model achieves an average accuracy of 0.8852, corresponding to a 42.98% uplift over the baseline accuracy of 0.6191, while the CPB model achieves a 37.02% uplift under the same conditions. In the two-class setting (C = 2), increasing the number of subintervals from 3 to 5 improves the uplift from 9.39% to 11.24% for SPB and from 6.44% to 8.10% for CPB.

Table 4 presents the corresponding results for the OPB model. Here, using 2 or 4 subintervals yields uplifts between 8.99% and 43.91% over the baseline, depending on the number of classes. In the three-class case with four subintervals, the OPB model achieves an uplift of 43.91%, which is slightly higher than the best SPB configuration under the same class setting (42.98%). In the two-class case, OPB attains its highest uplift (12.87%) when using four subintervals.

Overall, these comparisons confirm that the proposed subinterval-based ensembles provide substantial and consistent improvements over the baseline model. Among them, SPB stands out as the most robust variant: it always outperforms CPB and achieves accuracy that is comparable to, or slightly lower than, the best OPB configurations, while relying on a simpler and more intuitive decision rule.

3.3.3. Comparison with Previous Studies

The proposed ensembles were compared with the best-performing models reported in previous studies on the same smartwatch dataset, namely Quiroz et al. [4] and Nur et al. [11]. Both works investigated emotion recognition from wearable sensors for binary (happy vs. sad) and multiclass (happy, sad, neutral) settings. In the tables below, log_reg denotes logistic regression, rf denotes a Random Forest, and rf_ht denotes a Random Forest with hyperparameter tuning as described in [11].

Specifically, baseline accuracies for log_reg and rf are taken from Quiroz et al. [4], whereas baselines for rf_ht are taken from Nur et al. [11]. All reported accuracies (both proposed and baseline) are mean values computed across subjects. Baseline values are extracted from the original tables and were also reproduced by re-running the corresponding configurations on the same dataset. The proposed CPB, SPB, and OPB ensembles use the same feature representations and base classifiers, differing only in subinterval decomposition and window-level voting.

Comparative Analysis of Emotion Classification: Happy and Sad

Table 5 presents the results for Condition 1: watch the movie, then walk with two classes (happy and sad). Across all classifier types and numbers of subintervals, the proposed ensembles consistently improve over the previous studies, with accuracy uplifts between about 7% and 18%. The largest relative gain in this condition is obtained by the SPB model with log_reg and five subintervals, achieving an improvement of 18.47% over the corresponding baseline. SPB generally provides the highest improvements, while CPB and OPB also yield substantial gains, especially when five subintervals are used.

Table 6 reports the results for Condition 2: listen to music, then walk with two classes. Again, all ensembles outperform the previous models, with uplifts ranging from about 8% to 26%. The highest improvement in this condition is observed for the SPB model with log_reg and five subintervals, yielding a 26.05% increase in accuracy. For both CPB and OPB, the best configurations also occur at five subintervals, confirming the benefit of a finer subinterval decomposition combined with window-level voting.

Table 7 shows the results for Condition 3: listen to music while walking with two classes. In this scenario, the overall improvements are more moderate but remain clearly positive, typically between about 4.7% and 13.9%. The largest uplift is obtained by the CPB model with log_reg and five subintervals (13.87%), closely followed by the SPB model with log_reg at five subintervals (13.17%) and the OPB model with log_reg at four subintervals (12.02%). These results indicate that all three ensembles are beneficial in this more challenging condition, with SPB and OPB often reaching accuracy levels close to the best CPB configuration.

Comparative Analysis of Emotion Classification: Happy, Sad and Neutral

Table 8 reports the results for Condition 1: watch the movie, then walk with three classes (happy, sad, neutral). All configurations of CPB, SPB, and OPB improve substantially over the previous models, with relative gains between about 11% and 33%. The highest uplift in this condition is achieved by the SPB model with log_reg and five subintervals (33.17%), closely followed by the OPB model with log_reg and four subintervals (32.54%) and the SPB model with rf and five subintervals (31.06%). In this scenario, SPB and OPB provide the largest improvements, while CPB also yields consistent but slightly lower gains.

Table 9 summarizes the results for Condition 2: listen to music, then walk with three classes. Here, all ensembles again produce large gains over the previous studies, with accuracy improvements between about 19% and 37%. The OPB model with log_reg and four subintervals achieves the largest uplift of 37.10%, while the SPB model with log_reg and five subintervals attains the second-largest gain (34.26%). These results indicate that in this condition, OPB is slightly stronger than SPB, but both models significantly outperform the earlier approaches.

Finally, Table 10 shows the results for Condition 3: listen to music while walking with three classes. In this setting, the improvements are still substantial, although smaller than in Conditions 1 and 2, with uplifts mostly between about 11% and 21%. The best configuration is obtained by the SPB model with log_reg and five subintervals, which achieves an improvement of 21.42%, followed closely by SPB with rf at five subintervals (21.22%) and OPB with rf at four subintervals (19.76%). These results confirm that SPB remains highly competitive in this more demanding scenario, with OPB and CPB also providing consistent gains.

Overall, when performance is measured in terms of relative accuracy improvement over the best configurations from previous studies, all three proposed ensembles (CPB, SPB, and OPB) provide consistent and often substantial gains across all conditions and class configurations. In the binary case, the improvements range from 4.68% to 26.05%, with SPB typically achieving the highest gains. In the three-class case, the improvements range from 11.17% to 37.10%, where SPB is again the strongest on average, and OPB attains the single largest uplift in Condition 2 with three classes.

4. Discussion and Conclusions

4.1. Discussion

This study introduces three voting-based ensemble models—Consistent Precision Band (CPB), Selective Precision Band (SPB), and Optimized Precision Band (OPB)—for emotion recognition from sequential multimodal sensor data. The models operate on subinterval-level predictions and can be combined with different base classifiers, including logistic regression (log_reg), Random Forest (rf), and Random Forest with hyperparameter tuning (rf_ht).

In the binary setting, the proposed ensembles consistently outperform the strongest previously reported baselines across all experimental conditions (mo, mu, and mw), subinterval configurations, and base classifiers. The relative improvement ranges from 4.68% to 26.05%, with an average gain of 12.34%. The largest improvement is observed for SPB with five subintervals in the mu condition using log_reg, while all configurations remain above their corresponding baselines.

When extending the task to three classes, the advantage of the proposed ensembles becomes more pronounced. Improvements range from 11.17% to 37.10%, with an average gain of 21.63%. The highest improvement is achieved by OPB with four subintervals in the mu condition using log_reg, indicating that the proposed voting mechanisms remain effective under increased classification complexity.

The results also highlight the role of ensemble design and subinterval partitioning. In most configurations, SPB achieves the best or near-best performance, particularly with five subintervals, suggesting that incorporating the previous decision improves temporal consistency. OPB, in contrast, achieves the single largest improvement in the three-class setting by enforcing an odd number of votes, thereby reducing ties and stabilizing the final decision. Overall, partitioning each window into finer subintervals and aggregating predictions at the window level substantially improves performance compared with standard window-based approaches.

Deep learning architectures (e.g., LSTM or 1D CNN) were not included as baselines. The dataset used in this study follows a subject-dependent protocol, where a separate model is trained for each participant. As reported in the original dataset description [4], each participant contributes only a limited number of samples per class, resulting in relatively small training sets at the individual level. Under such small-data conditions, training deep models becomes more challenging and increases the risk of overfitting, particularly for high-capacity architectures [35,36]. Moreover, deep learning models typically require greater computational and memory resources, as well as longer training times, than conventional machine learning methods [37]. For these reasons, lower-complexity models were considered more appropriate for the present subject-dependent setting.

In addition, introducing a fundamentally different class of models would confound the evaluation by mixing the effects of the classifier and the ensemble mechanism. This would make it difficult to attribute performance changes specifically to the proposed method. Since the ensembles are model-agnostic, integrating them with deep learning models remains a direction for future work when larger datasets become available.

The study uses the full multimodal feature set (accelerometer, gyroscope, and heart rate) as established in prior work on the same dataset [4], where this combination yielded the highest classification accuracy. Importantly, the proposed ensembles do not perform explicit multimodal fusion at either the feature or model level. Instead, multimodal information is integrated through the shared feature representation used by the base classifier, while the ensemble layer operates at the decision level by aggregating discrete predictions through majority voting. This separation preserves the generality of the framework, allowing it to be applied independently of the sensor configuration.

The findings of this study should be interpreted within a clearly defined experimental scope: the evaluation is subject-dependent, the test sequence is constructed using a class-blocked probabilistic ordering rather than an unstructured natural stream, and the temporal persistence parameter

p = 0.9

functions as a fixed experimental control rather than an empirically validated assumption. The following discussion outlines the main limitations of the present study and their implications for generalizability, robustness, and real-world deployment.

First, the evaluation follows a subject-dependent protocol in which a separate model is trained and tested for each participant. This design is consistent with prior work [4,11] and enables direct comparison with existing results.

However, it does not assess generalization across users. To quantify this gap, we conducted a supplementary leave-one-subject-out (LOSO) evaluation using logistic regression as the base classifier under the binary setting (happy vs. sad). As shown in Table 11, the reproduced baseline accuracies (50.2–51.1%) are consistent with those reported by Quiroz et al. [4] (approximately 51–52%), confirming that subject-independent emotion recognition remains near chance level with the current feature set. Nevertheless, applying the SPB ensemble on top of the same logistic regression classifier still yields a consistent improvement across all three conditions (2.5–4.4% relative uplift), indicating that the proposed voting mechanism provides a measurable benefit even under the most challenging cross-user generalization setting. These results suggest that meaningful subject-independent recognition will require fundamentally different strategies such as domain adaptation or user-invariant feature learning, which remain directions for future work.

Second, robustness to noisy or corrupted sensor data has not been explicitly evaluated. In practical scenarios, wearable sensors may produce artifacts due to motion interference, loose contact, or signal dropout. The subinterval-based design may provide a degree of robustness to localized noise, as each subinterval contributes a single vote and the final decision is determined by majority voting. However, controlled experiments with noise injection (e.g., additive noise or subinterval dropout) are required to quantify this effect and are deferred to future work.

Third, the dataset was collected under semi-controlled conditions, which may not fully reflect real-world variability. To partially address this limitation, the evaluation protocol incorporates a temporal persistence parameter p (Section 3.2) that simulates realistic sequential dynamics. For

p = 0.9

, the expected duration of remaining in the same emotional state is approximately 30–50 s (depending on the window configuration; see Equation (5)), providing a more realistic alternative to i.i.d. evaluation. Nevertheless, real-world environments involve greater variability in behavior and context.

The proposed ensembles have minimal computational overhead—requiring only majority voting over a small number of predictions. While this design is intentionally lightweight, no quantitative evaluation of inference latency, memory usage, or energy consumption on actual wearable hardware is included in this study; such benchmarking remains necessary before deployment conclusions can be drawn.

While conceptual differences between the proposed ensembles and structured probabilistic sequential models (e.g., HMM, CRF, Kalman filtering) are discussed in Table 1, no direct experimental benchmarking against these methods is conducted in the present study; accordingly, no claims of comparative superiority over such models are made. Validation under naturalistic, free-living conditions remains necessary.

Moreover, in longitudinal deployment scenarios, model drift may arise as a user’s baseline movement patterns and emotional expression characteristics evolve over time, potentially degrading classification performance [38]. Due to their model-agnostic design, the proposed ensembles can be readily integrated with drift-detection and adaptive model-update mechanisms without requiring architectural changes. Nevertheless, the impact of drift on this task has not been explicitly evaluated and is left for future investigation. Fourth, the emotional taxonomy used in this study is limited to three categories (happy, sad, and neutral). This restriction is inherited from the benchmark dataset of Quiroz et al. [4], which was designed around discrete emotion induction with these specific labels, and does not reflect an architectural limitation of the proposed method. The mathematical formulation of CPB, SPB, and OPB presented in [23] is derived for an arbitrary number of classes C, and the simulation results reported therein provide evidence of effectiveness across different class counts.

The present dataset was selected because it provides detailed per-condition and per-classifier results using lightweight algorithms, enabling the controlled comparison that is central to this study; benchmark datasets that simultaneously satisfy these criteria remain limited in the wearable emotion recognition literature. As the number of classes increases, the probability of ties in majority voting also increases, since votes are distributed across a larger label space and a strict majority becomes less likely.

The proposed models are designed to mitigate this effect. CPB and SPB use an odd number of subintervals, while OPB enforces an odd total number of votes (even subintervals plus one previous-decision vote), which guarantees a unique majority in the binary case and reduces the likelihood of ties in multi-class settings. When ties do occur in CPB and OPB, they are resolved by randomly selecting one of the tied classes; in SPB, the previous window decision

\hat{y} (t - 1)

is used instead.

A finer-grained taxonomy—for example, the six basic emotions or a continuous valence–arousal representation—would enable a more comprehensive evaluation and may require more advanced tie-resolution strategies, such as weighted voting or confidence-based selection.

This is identified as an important direction for future work.

Fifth, the feature engineering pipeline relies on 28 statistical and signal descriptors computed from each subinterval. No feature importance analysis or ablation study was conducted to determine the relative contribution of individual features to the observed performance gains. Such analysis remains an important direction for future work and would help clarify which signal characteristics are most informative for the proposed ensemble mechanism.

4.2. Conclusions and Future Work

Within the evaluated experimental setting, this paper presents three voting-based ensemble models—Consistent Precision Band (CPB), Selective Precision Band (SPB), and Optimized Precision Band (OPB)—for emotion classification from multimodal smartwatch time-series data. Each model partitions analysis windows into subintervals, generates one prediction per subinterval using a standard base classifier, and aggregates these predictions via majority voting, with SPB and OPB incorporating the previous window decision to promote temporal consistency.

Evaluation on a public dataset across three experimental conditions, two class settings, and three base classifiers showed consistent improvements over the evaluated baselines: 4.68–26.05% (mean 12.34%) for binary classification and 11.17–37.10% (mean 21.63%) for three-class classification. SPB achieved the best overall performance in most configurations, while OPB attained the largest gain in the three-class case.

Future work will address the identified limitations within the defined experimental scope along two main directions. On the evaluation side, this includes subject-independent and cross-session protocols, robustness testing under sensor noise and signal dropout, validation with naturalistic free-living data, and exploration of different temporal persistence settings. On the modeling side, future work includes broader emotional taxonomies (including continuous valence–arousal representations) with evaluation of tie-resolution strategies for larger label spaces, exploration of integration with Automated Machine Learning (AutoML) frameworks [39] for systematic base-classifier and hyperparameter selection, integration with transfer-learning or domain-adaptation methods, per-modality ablation studies, alternative temporal priors (e.g., state-space models or hidden Markov models), feature-selection or representation-learning approaches to further improve performance and efficiency, and bias–variance decomposition of the subinterval voting mechanism to assess whether the observed gains reflect variance reduction or a different form of decision diversity.

Author Contributions

All authors contributed to the study conception and design. Material preparation, data analysis and model implementation were performed by M.H.A.-K.M. Supervision, critical revision of the manuscript and methodological guidance were provided by S.N.S., A.S., and Y.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed during the current study are publicly available from the original authors of [4]. Processed data and scripts are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CPB	Consistent Precision Band
SPB	Selective Precision Band
OPB	Optimized Precision Band
EEG	Electroencephalography
ECG	Electrocardiography
EDA	Electrodermal Activity
RMS	Root Mean Square
RMSSD	Root Mean Square of Successive Differences
PANAS	Positive and Negative Affect Schedule

References

Mauss, I.B.; Robinson, M.D. Measures of emotion: A review. Cogn. Emot. 2009, 23, 209–237. [Google Scholar] [CrossRef]
Liu, H.; Zhang, Y.; Li, Y.; Kong, X. Review on emotion recognition based on electroencephalography. Front. Comput. Neurosci. 2021, 15, 758212. [Google Scholar] [CrossRef]
Hasnul, M.A.; Aziz, N.A.A.; Alelyani, S.; Mohana, M.; Aziz, A.A. Electrocardiogram-based emotion recognition systems and their applications in healthcare—A review. Sensors 2021, 21, 5015. [Google Scholar] [CrossRef]
Quiroz, J.C.; Geangu, E.; Yong, M.H. Emotion recognition using smart watch sensor data: Mixed-design study. JMIR Ment. Health 2018, 5, e10153. [Google Scholar] [CrossRef]
Shu, L.; Yu, Y.; Chen, W.; Hua, H.; Li, Q.; Jin, J.; Xu, X. Wearable emotion recognition using heart rate data from a smart bracelet. Sensors 2020, 20, 718. [Google Scholar] [CrossRef] [PubMed]
Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
Al-Hadithy, S.S.; Abdalkafor, A.S.; Al-Khateeb, B. Emotion recognition in EEG Signals: Deep and machine learning approaches, challenges, and future directions. Comput. Biol. Med. 2025, 196, 110713. [Google Scholar] [CrossRef]
Tizzano, G.R.; Spezialetti, M.; Rossi, S. A deep learning approach for mood recognition from wearable data. In Proceedings of the 2020 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Bari, Italy, 1 June–1 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
Kong, C.; Bao, P.; Yu, Y.; Li, H.; Zheng, Z.; Wang, S.; Kot, A. MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection. IEEE Trans. Dependable Secur. Comput. 2025, 23, 82–96. [Google Scholar] [CrossRef]
Alaeddini, M. Emotion detection in Reddit: Comparative study of machine learning and deep learning techniques. arXiv 2024, arXiv:2411.10328. [Google Scholar] [CrossRef]
Nur, Z.K.; Wijaya, R.; Wulandari, G.S. Optimizing Emotion Recognition with Wearable Sensor Data: Unveiling Patterns in Body Movements and Heart Rate through Random Forest Hyperparameter Tuning. arXiv 2024, arXiv:2408.03958. [Google Scholar] [CrossRef]
Kuppens, P.; Verduyn, P. Emotion dynamics. Curr. Opin. Psychol. 2017, 17, 22–26. [Google Scholar] [CrossRef]
Schmidt, P.; Reiss, A.; Dürichen, R.; Van Laerhoven, K. Wearable-based affect recognition—A review. Sensors 2019, 19, 4079. [Google Scholar] [CrossRef]
Saganowski, S.; Perz, B.; Polak, A.; Kazienko, P. Emotion recognition for everyday life using physiological signals from wearables: A systematic literature review. IEEE Trans. Affect. Comput. 2022, 14, 1876–1897. [Google Scholar] [CrossRef]
Pal, S.; Mukhopadhyay, S.; Suryadevara, N. Development and progress in sensors and technologies for human emotion recognition. Sensors 2021, 21, 5554. [Google Scholar] [CrossRef] [PubMed]
Egger, M.; Ley, M.; Hanke, S. Emotion recognition from physiological signal analysis: A review. Electron. Notes Theor. Comput. Sci. 2019, 343, 35–55. [Google Scholar] [CrossRef]
Dietterich, T.G. Ensemble Methods in Machine Learning. In Multiple Classifier Systems; Springer: Berlin/Heidelberg, Germany, 2000; Volume 1857, pp. 1–15. [Google Scholar] [CrossRef]
Opitz, D.; Maclin, R. Popular Ensemble Methods: An Empirical Study. J. Artif. Intell. Res. 1999, 11, 169–198. [Google Scholar] [CrossRef]
Campagner, A.; Barandas, M.; Folgado, D.; Gamboa, H.; Cabitza, F. Ensemble Predictors: Possibilistic Combination of Conformal Predictors for Multivariate Time Series Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7205–7216. [Google Scholar] [CrossRef]
Rabiner, L.R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Processing. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
Lafferty, J.; McCallum, A.; Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning (ICML), Williamstown, MA, USA, 28 June–1 July 2001; pp. 282–289. [Google Scholar]
Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Hanafy, M.A.M.; Saleh, S.N.; Shoukry, A.; Elgamal, Y. Enhancing Emotion Classification Accuracy from Time-Series Sensor Data Using Ensemble Modeling. SSRN. 2025. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5946334 (accessed on 11 April 2026).
Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. A Public Domain Dataset for Human Activity Recognition Using Smartphones. In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 24–26 April 2013. [Google Scholar]
Reyes-Ortiz, J.L.; Oneto, L.; Samà, A.; Parra, X.; Anguita, D. Transition-aware human activity recognition using smartphones. Neurocomputing 2016, 171, 754–767. [Google Scholar] [CrossRef]
Yurtman, A.; Barshan, B. Novel noniterative orientation estimation for wearable motion sensor units acquiring accelerometer, gyroscope, and magnetometer measurements. IEEE Trans. Instrum. Meas. 2020, 69, 3206–3215. [Google Scholar] [CrossRef]
Bourdillon, N.; Yazdani, S.; Vesin, J.M.; Schmitt, L.; Millet, G.P. RMSSD is more sensitive to artifacts than frequency-domain parameters: Implication in athletes’ monitoring. J. Sports Sci. Med. 2022, 21, 260–266. [Google Scholar] [CrossRef]
Mottelson, A.; Hornbæk, K. An affect detection technique using mobile commodity sensors in the wild. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 12–16 September 2016; Association for Computing Machinery: New York, NY, USA, 2016. [Google Scholar]
Garcia-Ceja, E.; Osmani, V.; Mayora, O. Automatic stress detection in working environments from smartphones’ accelerometer data: A first step. IEEE J. Biomed. Health Inform. 2016, 20, 1053–1060. [Google Scholar] [CrossRef] [PubMed]
Ruensuk, M.; Oh, H.; Cheon, E.; Oakley, I.; Hong, H. Detecting negative emotions during social media use on smartphones. In Proceedings of the Asian CHI Symposium 2019: Emerging HCI Research Collection, Glasgow, UK, 4–9 May 2019; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar]
Olsen, A.F.; Torresen, J. Smartphone accelerometer data used for detecting human emotions. In Proceedings of the 2016 3rd International Conference on Systems and Informatics (ICSAI), Shanghai, China, 19–21 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 410–415. [Google Scholar]
Cui, L.; Li, S.; Zhu, T. Emotion detection from natural walking. In Human Centered Computing; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; pp. 23–33. [Google Scholar]
Hashmi, M.A.; Riaz, Q.; Zeeshan, M.; Shahzad, M.; Fraz, M.M. Motion reveal emotions: Identifying emotions from human walk using chest mounted smartphone. IEEE Sens. J. 2020, 20, 13511–13522. [Google Scholar] [CrossRef]
Sujigarasharma, K.; Rathi, R.; Visvanathan, P.; Kanchana, R. Emotion-based human-computer interaction. In Multidisciplinary Applications of Deep Learning-Based Artificial Emotional Intelligence; Chowdhary, C.L., Ed.; IGI Global: Hershey, PA, USA, 2023; pp. 136–150. [Google Scholar]
Brigato, L.; Iocchi, L. A Close Look at Deep Learning with Small Data. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2490–2497. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Nguyen, G.; Dlugolinsky, S.; Bobák, M.; Tran, V.; López García, Á.; Heredia, I.; Malík, P.; Hluchý, L. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: A survey. Artif. Intell. Rev. 2019, 52, 77–124. [Google Scholar] [CrossRef]
Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.a.; Zhang, G. Learning under Concept Drift: A Review. IEEE Trans. Knowl. Data Eng. 2019, 31, 2346–2363. [Google Scholar] [CrossRef]
He, X.; Zhao, K.; Chu, X. AutoML: A survey of the state-of-the-art. Knowl.-Based Syst. 2021, 212, 106622. [Google Scholar] [CrossRef]

Figure 1. Overview of the training phase. Raw time-domain signals are segmented into windows, each window is divided into subintervals, and a 107-dimensional feature vector is extracted from each subinterval. The resulting feature vectors are then used to train the base classifier A. The arrows indicate the flow of selected training feature vectors to the classifier. The vertical ellipsis indicates repetition across intermediate subintervals or windows omitted for visual simplicity.

Figure 2. Architectures of the three proposed voting-based ensembles used in the inference phase. Arrows indicate the workflow between the prediction and voting stages.

Figure 3. Comparison of model accuracies under different subinterval counts and window sizes for (a) binary classification and (b) three-class classification settings. Gray boxplots represent the distributions of accuracy values, and turquoise circles show the individual plotted values.

Table 1. Comparison between traditional temporal methods and the proposed approaches (CPB, SPB, OPB). “Theoretically proven improvement” refers to analytical guarantees that the method improves classification accuracy over the base classifier under stated assumptions. Theoretical guarantees for the proposed methods are established in [23].

Criterion	Temporal Smoothing	HMM/CRF	Kalman Filter	Proposed Methods (CPB, SPB, OPB)
Creates new predictions	No	No (re-decodes existing observations)	No (re-estimates existing states)	Yes (new window prediction)
Operates on original prediction stream only	Yes	No	No	No
Source of improvement	Filters existing outputs	Learned temporal transitions	State estimation	Within-window diversity + feedback voting
Uses cross-window temporal information	Yes (local neighborhood)	Yes (full sequence)	Yes (recursive)	Partial (single-step feedback)
Requires extra training	No	Yes	Yes	No
Model-agnostic	Yes	No	No	Yes
Distributional assumptions	None	Markov/log-linear	Linear-Gaussian	No explicit assumptions
Theoretically proven improvement	No	No	No	Yes [23]
Added computational cost	Negligible	Moderate–High	Moderate	Negligible

Table 3. Comparison between CPB and SPB models with various subinterval counts and classes.

	Subinterval = 3, C = 2			Subinterval = 3, C = 3			Subinterval = 5, C = 2			Subinterval = 5, C = 3
	Average Acc.	Baseline Model	Uplift	Average Acc.	Baseline Model	Uplift	Average Acc.	Baseline Model	Uplift	Average Acc.	Baseline Model	Uplift
CPB Model	0.9046	0.8499	6.44%	0.8073	0.5983	34.93%	0.9399	0.8695	8.10%	0.8483	0.6191	37.02%
SPB Model	0.9297	0.8499	9.39%	0.8270	0.5983	38.22%	0.9672	0.8695	11.24%	0.8852	0.6191	42.98%

Table 4. OPB model results with different subinterval counts and classes.

	Subinterval = 2, C = 2			Subinterval = 2, C = 3			Subinterval = 4, C = 2			Subinterval = 4, C = 3
	Average Acc.	Baseline Model	Uplift	Average Acc.	Baseline Model	Uplift	Average Acc.	Baseline Model	Uplift	Average Acc.	Baseline Model	Uplift
OPB Model	0.9263	0.8499	8.99%	0.8229	0.5983	37.54%	0.9587	0.8494	12.87%	0.8809	0.6121	43.91%

Table 5. Comparison of proposed ensembles vs. previous studies (Condition 1: watch movie, then walk; two classes: happy and sad). Baseline accuracies are taken from Quiroz et al. [4] for log_reg and rf, and from Nur et al. [11] for rf_ht.

Proposed Model	Number of Subintervals	Classifier	Mean Acc.(Proposed)	Mean Acc.(Baseline from Previous Studies)	Accuracy Improvement (%)
CPB Model	3	log_reg	0.89	0.82	8.25%
CPB Model	3	rf	0.93	0.85	8.65%
CPB Model	3	rf_ht	0.94	0.87	7.42%
CPB Model	5	log_reg	0.94	0.82	14.61%
CPB Model	5	rf	0.94	0.85	10.11%
CPB Model	5	rf_ht	0.97	0.87	11.11%
SPB Model	3	rf	0.96	0.85	12.46%
SPB Model	3	rf_ht	0.99	0.87	13.29%
SPB Model	3	log_reg	0.93	0.82	13.76%
SPB Model	5	rf	0.98	0.85	14.70%
SPB Model	5	rf_ht	0.99	0.87	13.66%
SPB Model	5	log_reg	0.97	0.82	18.47%
OPB Model	2	rf	0.94	0.85	9.86%
OPB Model	2	rf_ht	0.97	0.87	10.87%
OPB Model	2	log_reg	0.88	0.82	8.18%
OPB Model	4	rf	0.97	0.85	13.74%
OPB Model	4	rf_ht	1.00	0.87	14.54%
OPB Model	4	log_reg	0.94	0.82	14.69%

Table 6. Comparison of proposed ensembles vs. previous studies (Condition 2: listen to music, then walk; two classes: happy and sad).

Proposed Model	Number of Subintervals	Classifier	Mean Acc.(Proposed)	Mean Acc.(Baseline from Previous Studies)	Accuracy Improvement (%)
CPB Model	3	log_reg	0.84	0.75	12.74%
CPB Model	3	rf	0.89	0.81	10.24%
CPB Model	3	rf_ht	0.90	0.83	8.77%
CPB Model	5	log_reg	0.92	0.75	22.60%
CPB Model	5	rf	0.91	0.81	13.32%
CPB Model	5	rf_ht	0.93	0.83	13.02%
SPB Model	3	rf	0.93	0.81	15.08%
SPB Model	3	rf_ht	0.95	0.83	14.69%
SPB Model	3	log_reg	0.84	0.75	12.43%
SPB Model	5	rf	0.96	0.81	18.91%
SPB Model	5	rf_ht	0.96	0.83	16.63%
SPB Model	5	log_reg	0.94	0.75	26.05%
OPB Model	2	rf	0.90	0.81	12.01%
OPB Model	2	rf_ht	0.93	0.83	13.00%
OPB Model	2	log_reg	0.86	0.75	14.93%
OPB Model	4	rf	0.93	0.81	15.83%
OPB Model	4	rf_ht	0.96	0.83	15.67%
OPB Model	4	log_reg	0.91	0.75	21.13%

Table 7. Comparison of proposed ensembles vs. previous studies (Condition 3: listen to music while walking; two classes: happy and sad).

Proposed Model	Number of Subintervals	Classifier	Mean Acc.(Proposed)	Mean Acc.(Baseline from Previous Studies)	Accuracy Improvement (%)
CPB Model	3	log_reg	0.93	0.85	9.26%
CPB Model	3	rf	0.95	0.89	7.09%
CPB Model	3	rf_ht	0.98	0.90	8.42%
CPB Model	5	log_reg	0.97	0.85	13.87%
CPB Model	5	rf	0.96	0.89	8.24%
CPB Model	5	rf_ht	0.99	0.90	9.50%
SPB Model	3	rf	0.97	0.89	8.87%
SPB Model	3	rf_ht	0.99	0.90	9.40%
SPB Model	3	log_reg	0.94	0.85	11.25%
SPB Model	5	rf	0.99	0.89	11.15%
SPB Model	5	rf_ht	1.00	0.90	10.96%
SPB Model	5	log_reg	0.96	0.85	13.17%
OPB Model	2	rf	0.94	0.89	5.09%
OPB Model	2	rf_ht	0.94	0.90	4.68%
OPB Model	2	log_reg	0.93	0.85	9.44%
OPB Model	4	rf	0.97	0.89	9.03%
OPB Model	4	rf_ht	0.99	0.90	9.65%
OPB Model	4	log_reg	0.95	0.85	12.02%

Table 8. Comparison of proposed ensembles vs. previous studies (Condition 1: watch movie, then walk; three classes: happy, sad, neutral).

Proposed Model	Number of Subintervals	Classifier	Mean Acc.(Proposed)	Mean Acc.(Baseline from Previous Studies)	Accuracy Improvement (%)
CPB Model	3	log_reg	0.77	0.64	20.96%
CPB Model	3	rf	0.84	0.72	15.77%
CPB Model	3	rf_ht	0.85	0.76	11.84%
CPB Model	5	log_reg	0.82	0.64	28.96%
CPB Model	5	rf	0.89	0.72	22.46%
CPB Model	5	rf_ht	0.93	0.76	21.92%
SPB Model	3	rf	0.81	0.72	12.19%
SPB Model	3	rf_ht	0.87	0.76	14.02%
SPB Model	5	rf	0.95	0.72	31.06%
SPB Model	5	rf_ht	0.98	0.76	28.49%
SPB Model	5	log_reg	0.85	0.64	33.17%
OPB Model	2	rf	0.85	0.72	17.51%
OPB Model	2	rf_ht	0.86	0.76	12.84%
OPB Model	2	log_reg	0.77	0.64	20.72%
OPB Model	4	rf	0.91	0.72	25.50%
OPB Model	4	rf_ht	0.96	0.76	25.86%
OPB Model	4	log_reg	0.84	0.64	32.54%

Table 9. Comparison of proposed ensembles vs. previous studies (Condition 2: listen to music, then walk; three classes: happy, sad, neutral).

Proposed Model	Number of Subintervals	Classifier	Mean Acc.(Proposed)	Mean Acc.(Baseline from Previous Studies)	Accuracy Improvement (%)
CPB Model	3	log_reg	0.73	0.59	22.21%
CPB Model	3	rf	0.82	0.69	19.25%
CPB Model	3	rf_ht	0.86	0.72	19.12%
CPB Model	5	log_reg	0.74	0.59	24.70%
CPB Model	5	rf	0.87	0.69	27.17%
CPB Model	5	rf_ht	0.92	0.72	28.23%
SPB Model	3	rf	0.85	0.69	24.77%
SPB Model	3	rf_ht	0.89	0.72	23.40%
SPB Model	3	log_reg	0.74	0.59	25.36%
SPB Model	5	rf	0.91	0.69	32.23%
SPB Model	5	rf_ht	0.94	0.72	30.36%
SPB Model	5	log_reg	0.80	0.59	34.26%
OPB Model	2	rf	0.82	0.69	19.69%
OPB Model	2	rf_ht	0.87	0.72	20.67%
OPB Model	2	log_reg	0.72	0.59	20.91%
OPB Model	4	rf	0.87	0.69	27.63%
OPB Model	4	rf_ht	0.91	0.72	25.61%
OPB Model	4	log_reg	0.81	0.59	37.10%

Table 10. Comparison of proposed ensembles vs. previous studies (Condition 3: listen to music while walking; three classes: happy, sad, neutral).

Proposed Model	Number of Subintervals	Classifier	Mean Acc.(Proposed)	Mean Acc.(Baseline from Previous Studies)	Accuracy Improvement (%)
CPB Model	3	log_reg	0.82	0.71	15.67%
CPB Model	3	rf	0.88	0.78	11.89%
CPB Model	3	rf_ht	0.90	0.81	11.17%
CPB Model	5	log_reg	0.85	0.71	19.59%
CPB Model	5	rf	0.92	0.78	17.97%
CPB Model	5	rf_ht	0.93	0.81	15.53%
SPB Model	3	rf	0.93	0.78	18.68%
SPB Model	3	rf_ht	0.98	0.81	20.74%
SPB Model	3	log_reg	0.81	0.71	14.13%
SPB Model	5	rf	0.95	0.78	21.22%
SPB Model	5	rf_ht	0.97	0.81	19.40%
SPB Model	5	log_reg	0.86	0.71	21.42%
OPB Model	2	rf	0.90	0.78	14.62%
OPB Model	2	rf_ht	0.94	0.81	15.75%
OPB Model	2	log_reg	0.79	0.71	11.68%
OPB Model	4	rf	0.94	0.78	19.76%
OPB Model	4	rf_ht	0.97	0.81	19.47%
OPB Model	4	log_reg	0.85	0.71	19.36%

Table 11. Leave-one-subject-out (LOSO) evaluation under the binary setting (happy vs. sad) using logistic regression as the base classifier. Baseline accuracies are consistent with those reported by Quiroz et al. [4].

Condition	Logistic Regression (LOSO)	SPB (LOSO)	Relative Uplift
mo	0.502	0.524	4.38%
mu	0.511	0.525	2.69%
mw	0.490	0.502	2.46%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abdel-Kader Mahmoud, M.H.; Saleh, S.N.; Shoukry, A.; Elgamal, Y. Novel Ensemble Models for Enhanced Accuracy in Time Series Classification: Application to Multimodal Emotion Detection. Computers 2026, 15, 256. https://doi.org/10.3390/computers15040256

AMA Style

Abdel-Kader Mahmoud MH, Saleh SN, Shoukry A, Elgamal Y. Novel Ensemble Models for Enhanced Accuracy in Time Series Classification: Application to Multimodal Emotion Detection. Computers. 2026; 15(4):256. https://doi.org/10.3390/computers15040256

Chicago/Turabian Style

Abdel-Kader Mahmoud, Mohamed Hanafy, Sherine Nagy Saleh, Amin Shoukry, and Yousry Elgamal. 2026. "Novel Ensemble Models for Enhanced Accuracy in Time Series Classification: Application to Multimodal Emotion Detection" Computers 15, no. 4: 256. https://doi.org/10.3390/computers15040256

APA Style

Abdel-Kader Mahmoud, M. H., Saleh, S. N., Shoukry, A., & Elgamal, Y. (2026). Novel Ensemble Models for Enhanced Accuracy in Time Series Classification: Application to Multimodal Emotion Detection. Computers, 15(4), 256. https://doi.org/10.3390/computers15040256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Novel Ensemble Models for Enhanced Accuracy in Time Series Classification: Application to Multimodal Emotion Detection

Abstract

1. Introduction

2. Proposed Models Design and Implementation

2.1. Training Phase

2.1.1. Data Segmentation

2.1.2. Feature Extraction

2.2. Inference Phase and Voting-Based Ensembles

2.2.1. Consistent Precision Band (CPB)

2.2.2. Selective Precision Band (SPB)

2.2.3. Optimized Precision Band (OPB)

3. Experimentation and Results

3.1. Dataset

3.2. Experimental Setup

3.3. Results

3.3.1. Effect of Subinterval Count and Window Size

3.3.2. Quantitative Comparison with the Baseline

3.3.3. Comparison with Previous Studies

Comparative Analysis of Emotion Classification: Happy and Sad

Comparative Analysis of Emotion Classification: Happy, Sad and Neutral

4. Discussion and Conclusions

4.1. Discussion

4.2. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI