Top-Confidence Gapped Cross-Entropy for Compact Human Activity Recognition

Alzhrani, Khudran M.

doi:10.3390/app16073394

Open AccessArticle

Top-Confidence Gapped Cross-Entropy for Compact Human Activity Recognition

by

Khudran M. Alzhrani

Computers Department, College of Engineering and Computing in Al-Qunfudhah, Umm Al-Qura University, Makkah 21955, Saudi Arabia

Appl. Sci. 2026, 16(7), 3394; https://doi.org/10.3390/app16073394

Submission received: 16 March 2026 / Revised: 21 March 2026 / Accepted: 24 March 2026 / Published: 31 March 2026

Download

Browse Figures

Versions Notes

Abstract

Human Activity Recognition (HAR) in resource-constrained settings has been studied mainly through architecture design, compression, and deployment, while the role of the training objective has received less attention. This paper introduces Top-Confidence Gapped Cross-Entropy (TCG-CE), a lightweight modification of categorical cross-entropy in which each sample is weighted by the gap between the two most probable predicted classes. TCG-CE adds no trainable parameters and can be used as a drop-in replacement for standard cross-entropy. The method is evaluated on the UCI-HAR and WISDM benchmarks using compact recurrent models, namely TinyRNN, TinyGRU, and TinyLSTM. The evaluation focuses on macro-averaged predictive performance and also reports empirical runtime and memory observations under a fixed execution environment. Across datasets and models, TCG-CE improves balanced classification metrics, with the clearest gains observed on WISDM and in more capacity-limited settings. These results indicate that top-1/top-2 confidence-gap modulation is a practical loss-design strategy for improving macro-level predictive performance in compact HAR classification.

Keywords:

Human Activity Recognition; loss function design; compact models; recurrent neural networks; confidence-gap modulation; Macro-F1; balanced classification

1. Introduction

Human Activity Recognition (HAR) with wearable and mobile sensors is a core task in intelligent sensing systems. It supports applications such as healthcare monitoring, rehabilitation, smart environments, and personal analytics. The widespread use of inertial sensors in smartphones and wearables has made continuous motion sensing widely available. This has sustained strong interest in sensor-based HAR. Recent studies have shown that deep learning can outperform traditional feature-engineered pipelines by learning task-relevant representations directly from sensor data or structured signal windows [1,2,3].

Many HAR systems must operate under constraints related to model size, memory, and computation [1,4]. For this reason, a large part of the literature has focused on architecture design, lightweight models, feature selection, and deployment-oriented optimization. These directions are important. However, architecture is not the only factor that shapes model behavior. The training objective also affects gradient allocation, convergence, class discrimination, and the treatment of easy and ambiguous samples. In compact HAR settings, this makes loss-function design a relevant but less explored direction.

This question is important because HAR often involves ambiguous class boundaries. Some activities have similar motion patterns. Transition segments can also be difficult to classify. These challenges become more important when model capacity is limited. In such settings, treating all training samples in the same way may not be ideal. A training objective that reduces the influence of clearly separated predictions and preserves greater emphasis on ambiguous cases may improve balanced classification behavior.

Related work in machine learning has shown that modified training objectives can improve robustness, discrimination, or optimization behavior without requiring structural changes to the model itself [5,6]. This motivates a simple question for compact HAR: can a lightweight, prediction-derived modification of cross-entropy improve macro-level classification performance under a fixed compact model family?

This paper studies that question through loss design. We propose Top-Confidence Gapped Cross-Entropy (TCG-CE), a modification of categorical cross-entropy in which the per-sample loss is scaled using the gap between the two most probable predicted classes. This top-1/top-2 gap is used as a simple signal of local prediction ambiguity. The proposed loss introduces no additional trainable parameters and can be used as a drop-in replacement for standard cross-entropy in the evaluated training pipeline.

To evaluate the proposed loss under a controlled and practically relevant HAR setting, this study uses a compact recurrent model family composed of TinyRNN, TinyGRU, and TinyLSTM. This choice is motivated by three considerations: First, recurrent architectures remain a well-established backbone family in sensor-based HAR. Second, compact recurrent variants can be implemented with small parameter budgets, which is suitable for lightweight HAR settings. Third, the progression from simple RNN to GRU and LSTM provides a structured increase in recurrent capacity. This makes the model family suitable for controlled comparison of training objectives. The purpose is not to introduce architectural novelty, but to examine the behavior of the proposed loss under a compact and capacity-structured HAR backbone family.

The proposed loss is evaluated on two HAR benchmarks, WISDM and UCI-HAR. These datasets provide different input representations and therefore broaden the evaluation setting. The analysis focuses on macro-averaged predictive performance and also reports empirical runtime and memory measurements as within-study observations under a fixed execution environment.

The main contributions of this work are as follows: First, we introduce TCG-CE, a lightweight modification of categorical cross-entropy based on the top-1/top-2 confidence gap. Second, we evaluate the proposed loss under a compact recurrent HAR model family that enables controlled comparison across different recurrent capacities. Third, we examine the method on two benchmark HAR datasets with different representations. Fourth, we provide a comparative analysis using macro-averaged predictive metrics, effect sizes, statistical significance tests, and win rates. We also report empirical runtime and memory measurements under a fixed execution environment.

Across the evaluated settings, TCG-CE improves macro-level predictive performance, with the clearest gains appearing on WISDM and in more capacity-limited configurations. These results support the view that loss-function design can serve as a practical complementary lever for improving balanced classification behavior in compact HAR.

The remainder of this paper is organized as follows: Section 2 reviews prior work on loss functions for Human Activity Recognition and compact resource-conscious HAR settings. Section 3 presents the proposed Top-Confidence Gapped Cross-Entropy (TCG-CE) loss and the compact recurrent models used for evaluation. Section 4 describes the benchmark datasets and preprocessing procedures. Section 5 reports the experimental results, including predictive performance and empirical runtime analysis. Section 6 discusses the implications and limitations of the findings. Finally, Section 7 concludes the paper.

2. Related Work

2.1. Loss Functions in Multi-Class HAR

Cross-Entropy (CE) is the dominant loss for multi-class neural classification, including Human Activity Recognition (HAR) [7,8]. When combined with a softmax output layer, it provides a probabilistic objective that is well suited to gradient-based optimization. In HAR, CE has been widely used with convolutional and recurrent architectures, including CNNs, LSTMs, and GRUs [7,9,10].

Despite its broad use, CE applies the same functional form to all training samples, which is limiting in sensor-based HAR where the difficulty of individual samples varies substantially. In such settings, the training objective affects how the model distributes emphasis across easy and ambiguous cases. CE is also sensitive to class imbalance, which is common in activity datasets that contain underrepresented or transitional classes [11,12]. Although evaluation metrics are central to fair assessment under imbalance [13], the training loss also plays an important role in shaping the learning dynamics.

2.2. Modified and Robust Cross-Entropy Variants

Several studies have proposed modified forms of CE to improve robustness or alter optimization behavior. Taylor Cross-Entropy uses a Taylor approximation of the logarithmic loss, and was shown to improve robustness under noisy labels [14]. PolyLoss extends this idea by expressing classification losses through weighted polynomial expansions, allowing for CE and focal loss to be viewed within a broader design family [15].

Other variants modify CE to address class imbalance. Class-Balanced Loss reweights samples using the effective number of training examples per class [11]. Generalized Dice Loss and Dilated Balanced Cross-Entropy were developed for highly imbalanced segmentation settings [12,16]. These methods were designed for different tasks, but they show a broader point: changing the loss can alter learning behavior even when the network architecture remains fixed.

Related work has also explored alternative objectives based on class geometry or probabilistic discrepancy. Large-Margin Softmax introduces angular margins to improve intra-class compactness and inter-class separation [17]. KL divergence has also been used in classification-relevant settings as a probabilistic measure of distributional discrepancy [5]. In standard multi-class neural training, KL divergence is commonly applied between the predicted class distribution and the one-hot target distribution. However, it does not explicitly modulate sample contribution according to the relative separation between competing predicted classes.

2.3. Confidence-Aware and Task-Specific Losses

A related line of research introduces confidence or difficulty awareness directly into the loss. Focal Loss down-weights easy samples by scaling CE according to the confidence assigned to the true class [18]. This makes the loss more sensitive to hard examples. However, the modulation is tied to the probability of the ground-truth class rather than to the relative gap between the strongest competing predictions.

In HAR, several task-specific losses have also been explored. Harmonic Loss applies temporally decaying weights in LSTM-based HAR [19]. Semantic Loss introduces contextual and logical constraints through a neuro-symbolic formulation [20]. Other studies incorporate Dynamic Time Warping into the loss for temporal alignment [21], use local losses for layer-wise training [22], or combine regression and classification objectives for zero-shot HAR [23].

These studies show that the training objective can affect predictive behavior in meaningful ways. At the same time, the modulation signals differ. Some methods depend on the confidence of the true class. Others use temporal, semantic, or task-specific structure. In contrast, the present study uses the gap between the two most probable predicted classes as a simple prediction-derived signal of local ambiguity. The goal is not to introduce auxiliary supervision or task-specific constraints, but to examine whether this top-1/top-2 separation can serve as a useful modulation signal within a standard cross-entropy framework.

2.4. Loss Design in Compact HAR Settings

Compact HAR has received sustained attention because many wearable and mobile systems operate under constraints related to memory, computation, and energy [24,25,26,27]. Most work in this area has focused on model architecture and deployment strategy. Representative directions include lightweight deep models [28,29,30], pruning and quantization [31,32,33], sensor selection [34], redundancy reduction in sliding-window pipelines [35], and computation offloading [36,37]. Other studies address continual learning under resource constraints [38] and low-power wearable design [39].

Across this literature, the loss function is usually treated as a fixed component, with standard categorical cross-entropy serving as the default choice in many compact HAR pipelines [31,32]. As a result, compact HAR has been studied far more extensively from the perspective of architecture, compression, and deployment than from the perspective of training-objective design.

This creates the main gap addressed in this paper. The present study examines whether a lightweight modification of cross-entropy can improve macro-level classification performance in compact HAR under a controlled model family. To study this question, the proposed loss is evaluated on a compact recurrent backbone family that provides a lightweight and capacity-structured testbed for comparison across training objectives.

3. Methodology

3.1. Top-Confidence Gapped Cross-Entropy (TCG-CE) Loss Function

This section defines the proposed Top-Confidence Gapped Cross-Entropy (TCG-CE) loss. TCG-CE is a modification of categorical cross-entropy in which each sample is weighted according to the separation between its two most probable predicted classes. The objective is to attenuate the loss contribution of clearly separated predictions while retaining stronger emphasis on ambiguous cases.

Let B denote the batch size and C denote the number of classes. The one-hot encoded labels are written as

Y = [y_{i c}] \in {0, 1}^{B \times C}

, and the predicted class probabilities are written as

P = [p_{i c}] \in {[0, 1]}^{B \times C}

. Each row

p_{i}

is a class-probability vector, typically produced by a softmax output layer, such that

\sum_{c = 1}^{C} p_{i c} = 1, i = 1, \dots, B .

(1)

Two fixed constants are used in the formulation. The first is a clipping constant

ϵ > 0

, used for numerical stability in the logarithm. The second is a scaling constant

γ > 0

, used in the confidence modulation term. In all experiments, these values are fixed to

ϵ = 10^{- 4}

and

γ = 0.99

. The same values are used for all datasets and model configurations. Here,

ϵ = 10^{- 4}

serves as a conservative numerical safeguard after probability clipping so that logarithms remain well-defined. The constant

γ = 0.99

controls the smoothness and strength of the confidence modulation. These values were kept fixed throughout the study so that the proposed loss could be evaluated as a single stable formulation under matched conditions, without introducing additional dataset-specific retuning.

Predicted probabilities are first clipped away from 0 and 1:

{\tilde{p}}_{i c} = clip (p_{i c}, ϵ, 1 - ϵ), \tilde{P} = [{\tilde{p}}_{i c}] \in {[ϵ, 1 - ϵ]}^{B \times C} .

(2)

This guarantees that

log ({\tilde{p}}_{i c})

is finite for all samples and classes.

For each sample i, let

j_{i}^{(1)} = arg max_{c \in {1, \dots, C}} {\tilde{p}}_{i c}

(3)

denote the index of the largest predicted probability, and let

j_{i}^{(2)} = arg max_{c \in {1, \dots, C} ∖ {j_{i}^{(1)}}} {\tilde{p}}_{i c}

(4)

denote the index of the second-largest predicted probability. The corresponding top-two probabilities are

{\tilde{p}}_{i (1)} = {\tilde{p}}_{i, j_{i}^{(1)}}, {\tilde{p}}_{i (2)} = {\tilde{p}}_{i, j_{i}^{(2)}} .

(5)

The central quantity in TCG-CE is the top-confidence gap

Δ_{i} = {\tilde{p}}_{i (1)} - {\tilde{p}}_{i (2)} .

(6)

By construction,

Δ_{i} \in [0, 1]

. Larger values indicate stronger separation between the most probable class and its nearest competitor. Smaller values indicate greater local ambiguity. The use of the top-1/top-2 confidence gap is motivated by the idea that local ambiguity in multi-class prediction is often most directly reflected in the competition between the two most probable classes. When this gap is small, the prediction is less decisive because the model assigns similar probability mass to its two strongest alternatives. When the gap is large, the model exhibits clearer local separation. Unlike entropy, which summarizes uncertainty across the full predictive distribution, the proposed signal focuses specifically on the most competitive decision boundary. Unlike a true-class margin, it is label-agnostic until combined with the cross-entropy term. In this study, the gap is therefore used as a lightweight ambiguity-oriented weighting signal rather than as a theoretically exhaustive uncertainty measure.

A design choice in TCG-CE is to detach this modulation term from the gradient flow:

{\bar{Δ}}_{i} = stopgrad (Δ_{i}) .

(7)

This means that the confidence-gap term acts only as a weighting factor on the loss contribution of each sample.

The detached gap is then mapped to a smooth confidence score:

c_{i} = σ (γ {\bar{Δ}}_{i}), σ (t) = \frac{1}{1 + e^{- t}} .

(8)

This transformation is bounded and monotonic, so

c_{i} \in (0, 1)

. Larger confidence gaps yield larger confidence scores, whereas smaller gaps yield lower confidence scores. The complementary factor

(1 - c_{i})

is then used to scale the loss. The sigmoid mapping was chosen because it provides a smooth, bounded, and monotonic transformation of the detached gap, avoiding abrupt changes in the weighting term while preserving the ordering induced by the confidence gap. The detach operation was used so that the gap functions as a sample-weighting signal rather than introducing an additional gradient path through the top-1/top-2 competition itself. In this way, the optimization remains centered on the cross-entropy objective, while the confidence signal modulates the magnitude of each sample’s contribution. The present study evaluates this fixed stabilized formulation as a complete loss design under matched experimental settings.

The per-sample TCG-CE loss is defined as

ℓ_{i} = - \sum_{c = 1}^{C} y_{i c} log ({\tilde{p}}_{i c}) (1 - c_{i}) .

(9)

Because

y_{i}

is one-hot encoded, only the ground-truth class contributes to the sum. If

t (i)

denotes the true class index for sample i, this simplifies to

ℓ_{i} = - log ({\tilde{p}}_{i, t (i)}) (1 - c_{i}) .

(10)

The mini-batch loss returned to the optimizer is the mean over all samples:

L_{TCG - CE} = \frac{1}{B} \sum_{i = 1}^{B} ℓ_{i} .

(11)

Equations (6)–(11) fully define the proposed loss. The inputs are

(Y, P)

together with the fixed constants

(ϵ, γ)

, and the output is the scalar objective

L_{TCG - CE}

used during optimization.

Operationally, the mini-batch computation proceeds in four steps: First, predicted probabilities are clipped to

[ϵ, 1 - ϵ]

. Second, the largest and second-largest predicted probabilities are identified for each sample, and their difference is computed. Third, this gap is detached from gradient flow and mapped through a sigmoid function to obtain a confidence score. Fourth, the standard cross-entropy term associated with the true class is scaled by

(1 - c_{i})

and averaged across the mini-batch.

In summary, TCG-CE modifies standard categorical cross-entropy by introducing a smooth attenuation factor derived from the top-1/top-2 separation in the predicted class distribution. Samples with smaller separation retain a larger loss contribution. Samples with larger separation are down-weighted. The proposed loss is applied during training, while the evaluated model architectures remain otherwise unchanged.

3.2. Compact Recurrent Models

This study deliberately focuses on compact recurrent backbones because the contribution is a loss-design study rather than an architecture paper. The objective is to examine the behavior of the proposed training loss under lightweight and systematically comparable models while keeping the backbone family controlled across datasets. Recurrent models remain a well-established family in sensor-based HAR, align naturally with sequential sensor windows, and provide a simple capacity ladder through TinyRNN, TinyGRU, and TinyLSTM. This makes them a suitable testbed for isolating the effect of the training objective without introducing additional architectural factors.

Accordingly, three recurrent variants are considered: TinyRNN, TinyGRU, and TinyLSTM. These models were selected because they preserve a common recurrent modeling framework while offering progressively richer gating mechanisms and representational capacity. This progression supports controlled comparison of the proposed loss across increasingly expressive compact recurrent cells while remaining consistent with lightweight HAR settings in which small parameter budgets are desirable.

All architectures share the same overall structure, shown in Figure 1. Each model contains a single recurrent layer with 16 hidden units, followed by a dense layer with 32 units, a dropout layer with rate 0.2, and a softmax output layer. The only architectural difference across the three models is the recurrent cell type itself. No convolutional layers, attention mechanisms, residual connections, or additional temporal aggregation modules are introduced. The hidden dimension, dense-layer size, dropout rate, and output configuration are kept fixed throughout, and no architecture-specific hyperparameter tuning is performed. This fixed-design strategy keeps the comparison centered on the training objective rather than on architecture engineering.

The same compact recurrent family is used for both datasets in order to preserve architectural consistency across the evaluation setting. For WISDM, this choice aligns naturally with the segmented sensor-window representation. For UCI-HAR, the data are provided in engineered-feature form rather than as raw temporal sequences. In this case, the recurrent models are used as a controlled common backbone family for loss-function comparison rather than as a claim of architectural optimality for engineered tabular features. This consistent backbone choice allows the effect of TCG-CE to be examined under matched compact models across both datasets.

4. Datasets

This study evaluates the proposed Top-Confidence Gapped Cross-Entropy (TCG-CE) loss on two widely used benchmark datasets for Human Activity Recognition (HAR): the Wireless Sensor Data Mining (WISDM) dataset and the Human Activity Recognition Using Smartphones (UCI-HAR) dataset. The two datasets differ in sensing modality, preprocessing level, and input representation. This provides a heterogeneous evaluation setting for assessing the proposed loss under different HAR data forms.

4.1. WISDM Activity Recognition Dataset

The WISDM Activity Recognition Dataset (v1.1) [40] was introduced for smartphone-based activity recognition using accelerometer data. It contains raw tri-axial accelerometer readings collected from Android smartphones carried in the front pants pocket during daily activities. Data were recorded at a sampling frequency of 20 Hz in a controlled environment.

The dataset includes recordings from 36 users performing six activities: walking, jogging, walking upstairs, walking downstairs, sitting, and standing. Each raw sample contains three acceleration channels corresponding to the x, y, and z axes. In total, the dataset contains 1,098,208 raw sensor samples. The class distribution is moderately imbalanced, with walking and jogging accounting for the largest share of samples, while sitting and standing are less frequent.

To construct learning-ready samples, the raw sensor stream is segmented into fixed-length overlapping windows. Each window contains 100 consecutive samples, corresponding to 5 s of activity data at the native sampling rate, with a 50% overlap between adjacent windows. This produces 21,963 windows, each represented as a

100 \times 3

time-series matrix. The label of each window is inherited from the corresponding activity segment.

Before segmentation, acceleration signals are standardized using z-score normalization. The same normalization is applied to all three sensor channels. The resulting windowed representation is used as input to the compact recurrent models evaluated in this study.

Table 1 summarizes the main characteristics of the WISDM dataset after window- based preprocessing.

4.2. UCI Human Activity Recognition Dataset

The UCI-HAR dataset [41] was collected from 30 volunteers aged between 19 and 48 years, each performing six activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying. Data were acquired using a smartphone worn on the waist and equipped with a tri-axial accelerometer and a tri-axial gyroscope. Signals were sampled at 50 Hz and video-recorded to support accurate labeling.

Unlike WISDM, UCI-HAR is provided in a preprocessed and feature-engineered form. The released feature vectors were extracted by the dataset authors from segmented inertial signals using fixed-width sliding windows of 2.56 s with 50% overlap, corresponding to 128 readings per window. Each window is represented by 561 features derived from accelerometer and gyroscope signals in both the time and frequency domains. These features are normalized and bounded in the range

[- 1, 1]

.

The dataset includes an official subject-based train–test split. Seventy percent of the participants are used for training and 30% for testing, so samples from the same individual do not appear in both subsets. The training set contains 7352 samples and the test set contains 2947 samples, for a total of 10,299 samples. This predefined split is preserved in all experiments.

In this study, each 561-dimensional UCI-HAR feature vector is standardized and then reshaped to an input of length 561 with 1 feature per step in order to maintain compatibility with the compact recurrent model family used throughout the evaluation. This reshaping is a controlled modeling choice adopted to preserve a common backbone family across datasets and support consistent comparison of the loss functions. The original UCI-HAR representation remains an engineered feature vector rather than a native temporal sequence, and the imposed feature ordering is used only for architectural compatibility.

Table 2 summarizes the main characteristics of the UCI-HAR dataset.

Both datasets are public and widely used in the HAR literature. Their complementary characteristics—windowed sensor sequences in WISDM and engineered feature vectors in UCI-HAR—provide a useful evaluation setting for examining the proposed loss across different HAR data representations.

5. Experiments and Results

5.1. Evaluation Protocol

A unified evaluation protocol was used to compare the considered loss functions under matched preprocessing, optimization, and model settings. For each dataset, the same model family, preprocessing pipeline, training budget, and evaluation metrics were used across all loss-function configurations.

For WISDM, which does not provide an official train–test partition, the windowed dataset was divided into training and test subsets using an 80/20 stratified split in order to preserve class proportions. For UCI-HAR, the predefined subject-based split supplied with the dataset was used without modification.

All experiments were executed on a Google Colab CPU runtime (Google LLC, Mountain View, CA, USA) with GPU acceleration disabled. At the time of execution, the cloud runtime reported an Intel(R) Xeon(R) CPU @ 2.20 GHz with 12.67 GB RAM, running Linux 6.6.113+ (x86_64, glibc 2.35). The software environment used Python 3.12.13, NumPy 2.0.2, and TensorFlow 2.19.0. Random seeds were fixed across Python, NumPy, and TensorFlow, and deterministic TensorFlow operations were enabled where supported. Each model was trained for 200 epochs using the Adam optimizer with learning rate 0.001 and batch size 64. A validation split of 20% was used during training, and the checkpoint with the lowest validation loss was selected for final evaluation.

Predictive performance was evaluated primarily using macro-averaged F1-score, which is appropriate for balanced assessment across activity classes. Additional macro-level metrics, including precision, recall, area under the ROC curve (AUC), Jaccard index, accuracy, and Hamming loss, were also computed to provide a broader characterization of model behavior.

For completeness, empirical runtime and memory measurements were also recorded under the adopted execution environment. Inference time was measured as wall-clock latency during model prediction, and memory usage was recorded as process resident set size (RSS) during the inference procedure. These quantities are reported as empirical observations under the adopted execution environment and are included to complement the predictive analysis.

This protocol supports controlled comparison of the considered loss functions under fixed model settings while keeping the primary focus on predictive evaluation.

5.2. Overall Predictive Performance

Table 3 reports the predictive performance of all model–loss function combinations across the UCI-HAR and WISDM datasets. The results are presented using macro-averaged metrics to ensure balanced evaluation across activity classes. The best-performing configurations within each model–dataset block are highlighted for clarity.

On UCI-HAR, TCG-CE achieves the strongest overall results for the TinyGRU and TinyLSTM architectures. In particular, TCG-CE attains macro F1-scores of 0.804 for TinyGRU and 0.827 for TinyLSTM, together with corresponding gains in accuracy, precision, recall, and Jaccard index. These results are also accompanied by lower Hamming loss. Relative to Categorical Cross-Entropy and Taylor Cross-Entropy, the gains are directionally consistent for these two architectures, whereas KL Divergence yields weaker macro-level performance.

For TinyRNN on UCI-HAR, the differences among loss functions are smaller. KL Divergence attains slightly higher accuracy and marginally lower Hamming loss, while the macro F1-scores of TCG-CE, Taylor Cross-Entropy, and KL Divergence are effectively tied. This suggests that, for the smallest recurrent model on this dataset, the effect of the loss is more limited than for the larger recurrent variants.

On WISDM, TCG-CE again shows strong predictive behavior. For TinyGRU and TinyLSTM, it attains the highest or jointly highest accuracy and macro F1-score, reaching 0.926 and 0.925, respectively. These advantages are also reflected in precision, recall, Jaccard index, and Hamming loss, indicating balanced gains rather than improvements confined to a single metric. Although KL Divergence occasionally matches or slightly exceeds AUC values, this does not translate into stronger macro-level classification performance.

The most pronounced gain appears for TinyRNN on WISDM, where TCG-CE achieves a macro F1-score of 0.809 and outperforms all baseline losses by a visible margin. Across the two datasets, the raw results indicate that TCG-CE is generally competitive, with the clearest advantages appearing in compact and capacity-limited settings.

Table 4 summarizes pairwise mean differences, effect sizes (Cohen’s d), and confidence intervals for macro-averaged predictive metrics. These comparisons are computed between TCG-CE and the baseline losses. This analysis complements the raw results by describing both the direction and the magnitude of the observed differences.

Across the primary predictive metrics, TCG-CE shows consistently positive mean differences relative to the baselines. The largest improvements are generally observed against KL Divergence, with mean differences of 0.0237 for macro F1-score and 0.0228 for macro recall. The corresponding effect sizes are small-to-moderate, which suggests that the observed gains are systematic rather than isolated fluctuations.

For accuracy and precision, TCG-CE again shows positive mean shifts, with the largest effects occurring against KL Divergence and Categorical Cross-Entropy. Although the effect sizes remain modest, the consistency of their direction across metrics supports the view that the proposed loss tends to improve balanced classification performance across model–dataset settings.

Macro F1-score, which is the primary metric in this study, shows a mean improvement of 0.0237 over KL Divergence and 0.0154 over Taylor Cross-Entropy. These differences are consistent with the raw performance patterns and support the practical value of TCG-CE for balanced classification.

The AUC results show smaller absolute differences, with the strongest relative gain appearing against Taylor Cross-Entropy. This suggests that improvements in ranking quality are present, but less uniform than the improvements observed for macro-level classification metrics. The Jaccard index and Hamming loss follow the same general pattern as accuracy, precision, recall, and F1-score, which reinforces the internal consistency of the observed predictive trends.

Table 5 reports statistical test results for macro-averaged predictive metrics comparing TCG-CE with each baseline loss across the evaluated model–dataset configurations. Symbols indicate the strength of statistical evidence.

Consistent with the raw predictive results, the strongest statistical evidence appears in the comparisons between TCG-CE and Categorical Cross-Entropy, where several macro-level metrics reach the

p < 0.01

level. Comparisons against Taylor Cross-Entropy also reach conventional significance thresholds for most of the reported predictive metrics. By contrast, comparisons with KL Divergence tend to be marginal, indicating larger average gains but weaker consistency across configurations.

For AUC, none of the comparisons reaches a strong conventional significance level, although marginal trends in favor of TCG-CE are visible. This is consistent with the smaller and less uniform AUC differences observed in the raw and effect-size analyses.

Figure 2 provides a consolidated view of average predictive performance across loss functions, aggregated over models for each dataset. This visualization complements Table 3 by emphasizing dataset-level tendencies and cross-model stability.

Across the reported macro-averaged performance metrics, TCG-CE consistently achieves the highest or near-highest mean values on both UCI-HAR and WISDM. The advantage is especially visible for macro F1-score, precision, recall, and Jaccard index. Categorical Cross-Entropy and Taylor Cross-Entropy remain competitive but generally lower on average, whereas KL Divergence shows greater variability and weaker mean results across several metrics despite occasional strength in AUC.

Overall, the predictive results indicate that TCG-CE provides reliable macro-level improvements across the evaluated datasets and model settings, which supports its use as a competitive training objective for compact HAR classification.

5.3. Empirical Runtime and Memory Measurements

This subsection reports empirical runtime and memory measurements obtained under the Google Colab CPU runtime used in this study for the evaluated model–loss function combinations. These measurements are included as observations and are not the primary basis of the paper’s contribution.

Table 6 reports the runtime and memory measurements for all model–dataset–loss function combinations. For each architecture and dataset, the lowest inference time and memory usage are highlighted.

Under the adopted execution environment, models trained with TCG-CE often showed lower or competitive observed inference-time and RSS values relative to the baselines. On UCI-HAR, TCG-CE attains the lowest inference time and memory usage for TinyGRU and TinyLSTM, while for TinyRNN it attains the lowest memory value and a runtime close to the best observed value. On WISDM, TCG-CE again yields the lowest time and memory values for TinyGRU and TinyLSTM and remains competitive for TinyRNN.

These measurements are reported as empirical observations under the adopted setup. They are not interpreted as establishing the direct causal inference-efficiency advantage of the loss itself.

Table 7 summarizes pairwise mean differences, effect sizes, and confidence intervals for the empirical runtime and memory measurements. Negative mean differences indicate lower values for TCG-CE relative to the corresponding baseline.

For runtime, the largest average reduction appears in the comparison with Categorical Cross-Entropy. Comparisons with KL Divergence and Taylor Cross-Entropy also show lower mean runtime values for TCG-CE. For RSS memory, TCG-CE again shows lower mean values across all baseline comparisons, with the strongest reduction appearing relative to Categorical Cross-Entropy.

Table 8 reports the corresponding statistical test results for the empirical runtime and memory measurements.

The statistical results show more consistent differences for RSS than for runtime. Across the reported comparisons, RSS differences are significant for all three baselines, whereas runtime differences are smaller and more variable.

Figure 3 provides a consolidated view of the average runtime and memory measurements across loss functions and datasets. Hamming loss is also shown on an inverted axis to support joint visual comparison with metrics for which lower values are preferable.

Across both datasets, TCG-CE often shows lower or competitive observed runtime and memory values relative to Categorical Cross-Entropy and Taylor Cross-Entropy, while KL Divergence exhibits greater variability. At the same time, TCG-CE maintains lower Hamming loss on average. Within the adopted setup, the predictive improvements reported earlier are therefore not accompanied by an obvious deterioration in the observed runtime measurements.

Table 9 summarizes win rates across all model–dataset configurations for each reported metric. A win is recorded when a loss function attains the best value for a given metric under a given configuration.

Across the predictive metrics, TCG-CE has the highest win rate by a clear margin. It also attains the highest win rate for runtime and a perfect win rate for RSS among the reported configurations. These summaries further illustrate the consistency of the empirical trends visible in the raw results and pairwise comparisons.

5.4. Performance–Efficiency Trade-Off Analysis

This subsection presents predictive performance together with the empirical runtime and memory measurements in order to summarize the observed operating points under a common execution environment.

Figure 4 visualizes this relationship using slope charts that jointly plot macro F1-score against inference time (top row, logarithmic scale) and RSS memory usage (bottom row), with separate columns for UCI-HAR and WISDM. Within each panel, lines connect models ordered by increasing recurrent complexity (TinyRNN → TinyGRU → TinyLSTM), allowing for the inspection of how the predictive–runtime relationship varies with model capacity.

Across both datasets, TCG-CE tends to occupy favorable observed operating points, combining higher macro F1-scores with lower or competitive runtime and RSS values. On UCI-HAR, this contrast is especially visible relative to Categorical Cross-Entropy, which yields substantially larger runtime values without corresponding predictive gains. On WISDM, TCG-CE again combines the strongest predictive performance for TinyGRU and TinyLSTM with comparatively low measured runtime and memory values.

Figure 5 presents the relationship between macro F1-score and inference time after averaging across datasets for each model architecture. Each point represents the mean predictive–runtime operating point of a loss function for a given architecture, with error bars indicating variability across dataset–model combinations.

For TinyGRU and TinyLSTM, TCG-CE attains the highest average macro F1-score while remaining below most baselines in mean runtime. For TinyRNN, the runtime differences between TCG-CE and Taylor Cross-Entropy are relatively small, but TCG-CE still retains a visible advantage in macro F1-score. These aggregated comparisons suggest that the predictive gains of TCG-CE are not accompanied by worse observed runtime under the adopted measurement setup.

Figure 6 complements this view by plotting macro F1-score against average RSS memory usage for each loss function and model architecture.

For TinyGRU and TinyLSTM, TCG-CE combines the highest average macro F1-score with the lowest average RSS. For TinyRNN, TCG-CE again remains favorable, yielding the highest macro F1-score together with the lowest average RSS among the reported losses. These plots therefore reinforce the descriptive observation that TCG-CE occupies favorable predictive–memory operating points under the adopted execution environment.

Taken together, the experimental results show that TCG-CE provides the most consistent predictive improvements across the evaluated datasets and recurrent architectures. These gains are also accompanied by competitive observed runtime and RSS measurements. Since the evaluated losses are training objectives and do not introduce additional architectural modules into the evaluated models, these runtime and memory results should be interpreted as within-study empirical observations rather than as evidence of guaranteed loss-induced deployment savings.

6. Discussion

The results demonstrate that TCG-CE consistently improves macro-averaged predictive performance in compact HAR classification, with the most pronounced gains observed on macro F1-score, precision, and recall. Across the evaluated settings, the proposed loss most consistently improves macro F1-score, precision, recall, Jaccard index, and Hamming loss relative to the baseline objectives. These gains are especially visible on WISDM and in the more capacity-limited recurrent settings.

A central empirical pattern is that the observed gains of TCG-CE are larger on WISDM than on UCI-HAR. This difference is noteworthy because the two datasets use different input representations. WISDM preserves segmented sensor windows, whereas UCI-HAR is used here in its engineered-feature form and reshaped only to maintain compatibility with the common model family. Under these conditions, the stronger gains on WISDM suggest that the behavior of the proposed loss may depend in part on how much local variation is retained in the input representation. Further analysis of dataset-level factors could help clarify this difference more fully.

The results also indicate that the effect of TCG-CE depends on model capacity. For TinyGRU and TinyLSTM, the proposed loss yields clear improvements across several predictive metrics. For TinyRNN on UCI-HAR, the gains are comparatively modest. This pattern suggests that the loss is not a substitute for representational capacity. Instead, it acts as a complementary training mechanism whose benefit becomes more visible when the underlying model has enough flexibility to make use of more selective gradient emphasis.

From a learning perspective, TCG-CE can be interpreted as a mild ambiguity-aware attenuation of categorical cross-entropy. By reducing the contribution of samples whose most probable class is already well separated from the nearest competitor, the loss preserves relatively greater emphasis on locally ambiguous predictions. In the present experiments, this behavior is associated with stronger balanced classification performance without adding trainable parameters or architectural modules. The results therefore support the practical value of the top-1/top-2 confidence gap as a simple prediction-derived modulation signal.

The compact recurrent model family also provides a useful capacity ladder for interpreting the results. TinyRNN, TinyGRU, and TinyLSTM share the same general architecture while differing in recurrent expressiveness. This makes it possible to examine the proposed loss under matched compact backbones rather than under unrelated architectures. The observed pattern across these models suggests that the proposed loss remains relevant across multiple recurrent capacities, while showing the clearest advantages in settings where compactness and class ambiguity interact more strongly.

A limitation of this study is that UCI-HAR is not inherently sequential in the form used here. Its 561-dimensional feature vectors are engineered descriptors rather than raw temporal windows. Accordingly, the recurrent models on that dataset should be interpreted as a controlled common modeling framework for loss-function comparison rather than as a claim that sequence modeling is the most appropriate architecture for engineered feature vectors. This limitation should be kept in mind when interpreting the smaller gains observed on UCI-HAR.

The empirical runtime and RSS measurements provide an additional descriptive view of model behavior. Models trained with TCG-CE often showed competitive observed runtime and memory values relative to the baselines. However, these measurements should be interpreted carefully. Since the evaluated losses are training objectives and do not introduce additional architectural modules into the evaluated models, the observed runtime and memory differences should be treated as within-study empirical observations under the adopted protocol rather than as guaranteed loss-induced deployment savings.

The comparison with the baseline losses is also informative. Categorical Cross-Entropy remains a strong general-purpose objective, but in the present experiments, it is less competitive on the macro-averaged metrics that are most relevant to balanced activity recognition. Taylor Cross-Entropy is often competitive and in some cases close to TCG-CE, which indicates that modified cross-entropy formulations can provide meaningful benefits over the standard objective. KL Divergence shows weaker and more variable predictive behavior, which suggests that its distribution-matching emphasis is less well aligned with the balanced multi-class recognition goals examined here.

Taken together, the findings indicate that TCG-CE is a competitive loss-function alternative for compact HAR classification when the goal is to improve macro-level predictive performance within a controlled lightweight recurrent model family. The strongest evidence in this study comes from the predictive analysis. Under the adopted evaluation setting, that evidence consistently supports the usefulness of confidence-gap modulation as a lightweight design choice for balanced HAR classification while also indicating that the magnitude of the gains is not uniform across all evaluated cases.

7. Conclusions

This paper investigates loss-function design as a lightweight mechanism for improving compact Human Activity Recognition models. It introduces Top-Confidence Gapped Cross-Entropy (TCG-CE), a modification of categorical cross-entropy based on the gap between the two most probable predicted classes. Experiments on the UCI-HAR and WISDM benchmarks with a compact recurrent model family showed that TCG-CE is a competitive training objective, with the clearest gains appearing in macro-averaged predictive measures such as macro F1-score, precision, recall, Jaccard index, and Hamming loss, particularly on WISDM and in the more capacity-limited settings considered here. Effect-size analysis, statistical testing, and win-rate summaries further supported the consistency of these predictive trends. At the same time, the gains were not uniform across all evaluated cases, and some settings, such as TinyRNN on UCI-HAR, showed only modest differences between TCG-CE and the comparison losses. Models trained with TCG-CE also showed competitive observed runtime and RSS measurements, which are reported as empirical within-study observations rather than as guaranteed deployment-level gains. Overall, the results indicate that, within the evaluated compact recurrent HAR setting, TCG-CE is a practical alternative to standard categorical cross-entropy and that loss-function design can serve as a useful complementary direction for improving balanced predictive performance.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are publicly available.

Conflicts of Interest

The author declares no conflicts of interest.

References

Bian, S.; Liu, M.; Rey, V.F.; Geissler, D.; Lukowicz, P. TinierHAR: Towards Ultra-Lightweight Deep Learning Models for Efficient Human Activity Recognition on Edge Devices. In Proceedings of the 2025 ACM International Symposium on Wearable Computers; Association for Computing Machinery (ACM): New York, NY, USA, 2025; pp. 163–169. [Google Scholar]
Navakauskas, D.; Dumpis, M. Wearable sensor-based human activity recognition: Performance and interpretability of dynamic neural networks. Sensors 2025, 25, 4420. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Xu, X.; Tian, X.; Zhou, L.; Li, Y. Efficient human activity recognition: A deep convolutional transformer-based contrastive self-supervised approach using wearable sensors. Eng. Appl. Artif. Intell. 2024, 135, 108705. [Google Scholar] [CrossRef]
Fan, C.; Gao, F. Enhanced human activity recognition using wearable sensors via a hybrid feature selection method. Sensors 2021, 21, 6434. [Google Scholar] [CrossRef] [PubMed]
Al-Shammary, D.; Radhi, M.; Alsaeedi, A.H.; Mahdi, A.M.; Ibaida, A.; Ahmed, K. Efficient ECG classification based on the probabilistic Kullback-Leibler divergence. Inform. Med. Unlocked 2024, 47, 101510. [Google Scholar] [CrossRef]
Seino, T.; Saito, N.; Ogawa, T.; Asamizu, S.; Haseyama, M. Expert–novice level classification using graph convolutional network introducing confidence-aware node-level attention mechanism. Sensors 2024, 24, 3033. [Google Scholar] [CrossRef]
Ronao, C.A.; Cho, S.-B. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 2016, 59, 235–244. [Google Scholar] [CrossRef]
Hammerla, N.Y.; Halloran, S.; Plötz, T. Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv 2016, arXiv:1604.08880. [Google Scholar]
Ordóñez, F.J.; Roggen, D. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef]
Murad, A.; Pyun, J.-Y. Deep recurrent neural networks for human activity recognition. Sensors 2017, 17, 2556. [Google Scholar] [CrossRef]
Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 9268–9277. [Google Scholar]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Cardoso, M.J. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Cham, Switzerland, 2017; pp. 240–248. [Google Scholar]
Farhadpour, S.; Warner, T.A.; Maxwell, A.E. Selecting and interpreting multiclass loss and accuracy assessment metrics for classifications with class imbalance: Guidance and best practices. Remote Sens. 2024, 16, 533. [Google Scholar] [CrossRef]
Feng, L.; Shu, S.; Lin, Z.; Lv, F.; Li, L.; An, B. Can cross entropy loss be robust to label noise? In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence; IJCAI: Marina del Rey, CA, USA, 2021; pp. 2206–2212. [Google Scholar]
Leng, Z.; Tan, M.; Liu, C.; Cubuk, E.D.; Shi, X.; Cheng, S.; Anguelov, D. PolyLoss: A polynomial expansion perspective of classification loss functions. arXiv 2022, arXiv:2204.12511. [Google Scholar]
Hosseini, S.M.; Baghshah, M.S. Dilated balanced cross entropy loss for medical image segmentation. BMC Med. Imaging 2026, 26, 113. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-margin softmax loss for convolutional neural networks. arXiv 2016, arXiv:1612.02295. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 2980–2988. [Google Scholar]
Hu, Y.; Zhang, X.-Q.; Xu, L.; He, F.X.; Tian, Z.; She, W.; Liu, W. Harmonic loss function for sensor-based human activity recognition based on LSTM recurrent neural networks. IEEE Access 2020, 8, 135617–135627. [Google Scholar] [CrossRef]
Arrotta, L.; Civitarese, G.; Bettini, C. Semantic loss: A new neuro-symbolic approach for context-aware human activity recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2024, 7, 153. [Google Scholar] [CrossRef]
Ram, D.D.; Muthukumaran, U.; Fatima, N.S. Enhanced human action recognition with ensembled DTW loss function in CNN LSTM architecture. In Proceedings of Third International Conference on Sustainable Expert Systems: ICSES 2022; Springer: Singapore, 2023; pp. 491–508. [Google Scholar]
Teng, Q.; Wang, K.; Zhang, L.; He, J. The layer-wise training convolutional neural networks using local loss for sensor-based human activity recognition. IEEE Sens. J. 2020, 20, 7265–7274. [Google Scholar] [CrossRef]
Wu, T.; Chen, Y.; Gu, Y.; Wang, J.; Zhang, S.; Zhanghu, Z. Multi-layer cross loss model for zero-shot human activity recognition. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Cham, Switzerland, 2020; pp. 210–221. [Google Scholar]
Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 2019, 119, 3–11. [Google Scholar] [CrossRef]
Nweke, H.F.; Teh, Y.W.; Al-Garadi, M.A.; Alo, U.R. Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges. Expert Syst. Appl. 2018, 105, 233–261. [Google Scholar] [CrossRef]
Chen, K.; Zhang, D.; Yao, L.; Guo, B.; Yu, Z.; Liu, Y. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Comput. Surv. 2021, 54, 77. [Google Scholar] [CrossRef]
Gu, F.; Chung, M.-H.; Chignell, M.; Valaee, S.; Zhou, B.; Liu, X. A survey on deep learning for human activity recognition. ACM Comput. Surv. 2021, 54, 163. [Google Scholar] [CrossRef]
Agarwal, P.; Alam, M. A lightweight deep learning model for human activity recognition on edge devices. Procedia Comput. Sci. 2020, 167, 2364–2373. [Google Scholar] [CrossRef]
Agac, S.; Incel, O.D. Resource-efficient, sensor-based human activity recognition with lightweight deep models boosted with attention. Comput. Electr. Eng. 2024, 117, 109274. [Google Scholar] [CrossRef]
Rashid, N.; Demirel, B.U.; Al Faruque, M.A. AHAR: Adaptive CNN for energy-efficient human activity recognition in low-power edge devices. IEEE Internet Things J. 2022, 9, 13041–13051. [Google Scholar] [CrossRef]
Bursa, S.Ö.; Incel, Ö.D.; Alptekin, G.I. Building lightweight deep learning models with TensorFlow Lite for human activity recognition on mobile devices. Ann. Telecommun. 2023, 78, 687–702. [Google Scholar] [CrossRef]
Gupta, S.; Jain, S.; Roy, B.; Deb, A. A TinyML approach to human activity recognition. J. Phys. Conf. Ser. 2022, 2273, 012025. [Google Scholar] [CrossRef]
Sharma, V.; Pau, D.; Cano, J. Efficient Tiny Machine Learning for Human Activity Recognition on Low-Power Edge Devices. In 2024 IEEE 8th Forum on Research and Technologies for Society and Industry Innovation (RTSI); IEEE: Piscataway, NJ, USA, 2024; pp. 85–90. [Google Scholar]
Leite, C.F.S.; Xiao, Y. Optimal sensor channel selection for resource-efficient deep activity recognition. In Proceedings of the 20th International Conference on Information Processing in Sensor Networks; ACM: New York, NY, USA, 2021; pp. 371–383. [Google Scholar]
Leite, C.F.S.; Xiao, Y. Improving resource efficiency of deep activity recognition via redundancy reduction. In Proceedings of the 21st International Workshop on Mobile Computing Systems and Applications; ACM: New York, NY, USA, 2020; pp. 33–38. [Google Scholar]
Kalantarian, H.; Sideris, C.; Mortazavi, B.; Alshurafa, N.; Sarrafzadeh, M. Dynamic computation offloading for low-power wearable health monitoring systems. IEEE Trans. Biomed. Eng. 2016, 64, 621–628. [Google Scholar] [CrossRef]
Secerbegovic, A.; Gogic, A.; Suljanovic, N.; Zajc, M.; Mujcic, A. Computational balancing between wearable sensor and smartphone towards energy-efficient remote healthcare monitoring. Adv. Electr. Comput. Eng. 2018, 18, 3–10. [Google Scholar] [CrossRef]
Leite, C.F.S.; Xiao, Y. Resource-efficient continual learning for sensor-based human activity recognition. ACM Trans. Embed. Comput. Syst. 2022, 21, 59. [Google Scholar] [CrossRef]
Tesema, W.; Jimma, W.; Khan, M.I.; Stiens, J.; da Silva, B. A taxonomy of low-power techniques in wearable medical devices for healthcare applications. Electronics 2024, 13, 3097. [Google Scholar] [CrossRef]
Kwapisz, J.R.; Weiss, G.M.; Moore, S.A. Activity recognition using cell phone accelerometers. ACM SIGKDD Explor. Newsl. 2011, 12, 74–82. [Google Scholar] [CrossRef]
Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. A public domain dataset for human activity recognition using smartphones. In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium, 24–26 April 2013; i6doc.com: Bruges, Belgium, 2013; pp. 437–442. [Google Scholar]

Figure 1. Schematic overview of the compact recurrent architectures used in this study. Each model includes a single recurrent layer (SimpleRNN, GRU, or LSTM), followed by a dense layer, a dropout layer, and a softmax output layer. The model family is intentionally kept compact to support controlled comparison across recurrent capacities.

Figure 2. Average performance metrics of loss functions on the UCI and WISDM datasets. Bars denote mean values aggregated over models, with error bars indicating standard deviation across models. Colors represent loss functions and hatch patterns indicate datasets.

Figure 3. Average efficiency metrics of loss functions on the UCI and WISDM datasets. Mean values aggregated over models are shown with standard deviation error bars. Colors indicate loss functions, hatch patterns denote datasets, and the y-axis is inverted for efficiency metrics.

Figure 4. Slope charts illustrating the relationship between predictive performance (F1-macro) and observed runtime or memory measurements across models and datasets. Columns correspond to datasets (UCI and WISDM), while rows show performance versus inference time (top, logarithmic scale) and performance versus RAM usage (bottom). Lines connect models ordered by increasing complexity, with colors denoting loss functions.

Figure 5. Relationship between predictive performance (F1-macro) and inference time for each loss function across TinyGRU, TinyLSTM, and TinyRNN models. Points represent mean values averaged over datasets, with error bars indicating variability across dataset–model combinations.

Figure 6. Relationship between predictive performance (F1-macro) and memory consumption for each loss function across TinyGRU, TinyLSTM, and TinyRNN models. Mean values are shown with standard deviation error bars computed across datasets.

Table 1. Summary of the WISDM activity recognition dataset after window-based preprocessing.

Property	Value
Number of users	36
Number of activities	6
Sensor modality	Tri-axial accelerometer
Sampling frequency	20 Hz
Window length	100 samples (5 s)
Window overlap	50%
Window shape	$100 \times 3$
Total windows	21,963

Table 2. Summary of the UCI-HAR activity recognition dataset.

Property	Value
Number of subjects	30
Number of activities	6
Sensor modalities	Accelerometer and gyroscope
Sampling frequency	50 Hz
Feature dimensionality	561
Window length	2.56 s (128 samples)
Train samples	7352
Test samples	2947
Total samples	10,299

Table 3. Raw performance metrics across models, datasets, and loss functions. Within each model–dataset block, the best value is highlighted in bold. For Accuracy, Precision, Recall, F1, AUC, and Jaccard, higher values are better; for Hamming loss, lower values are better. When values are tied after rounding, all tied best values are bolded.

Dataset	Model	Loss Function	Acc.	Prec.	Rec.	F1	AUC	Jaccard	Hamming
UCI	TinyGRU	TCG-CE	0.812	0.812	0.803	0.804	0.968	0.681	0.188
		Taylor Cross-Entropy	0.799	0.800	0.792	0.794	0.965	0.669	0.201
		KL Divergence	0.740	0.748	0.726	0.722	0.946	0.592	0.260
		Categorical Cross-Entropy	0.785	0.794	0.777	0.781	0.961	0.653	0.215
	TinyLSTM	TCG-CE	0.830	0.831	0.826	0.827	0.963	0.710	0.170
		Taylor Cross-Entropy	0.827	0.829	0.822	0.824	0.957	0.706	0.173
		KL Divergence	0.803	0.804	0.796	0.798	0.956	0.672	0.197
		Categorical Cross-Entropy	0.815	0.814	0.811	0.812	0.959	0.690	0.185
	TinyRNN	TCG-CE	0.759	0.756	0.748	0.750	0.950	0.624	0.241
		Taylor Cross-Entropy	0.759	0.753	0.749	0.750	0.953	0.621	0.241
		KL Divergence	0.761	0.756	0.750	0.750	0.953	0.621	0.239
		Categorical Cross-Entropy	0.753	0.748	0.742	0.743	0.950	0.614	0.247
WISDM	TinyGRU	TCG-CE	0.946	0.928	0.924	0.926	0.992	0.871	0.054
		Taylor Cross-Entropy	0.946	0.922	0.915	0.918	0.986	0.854	0.054
		KL Divergence	0.941	0.919	0.911	0.915	0.992	0.855	0.059
		Categorical Cross-Entropy	0.935	0.906	0.908	0.907	0.991	0.844	0.065
	TinyLSTM	TCG-CE	0.948	0.928	0.922	0.925	0.990	0.869	0.052
		Taylor Cross-Entropy	0.933	0.911	0.898	0.904	0.975	0.834	0.067
		KL Divergence	0.945	0.922	0.915	0.918	0.990	0.857	0.055
		Categorical Cross-Entropy	0.947	0.926	0.923	0.924	0.991	0.867	0.053
	TinyRNN	TCG-CE	0.874	0.812	0.806	0.809	0.974	0.722	0.126
		Taylor Cross-Entropy	0.840	0.767	0.758	0.758	0.941	0.656	0.160
		KL Divergence	0.866	0.802	0.796	0.796	0.972	0.710	0.134
		Categorical Cross-Entropy	0.861	0.787	0.788	0.785	0.973	0.695	0.139

Table 4. Mean differences and effect sizes for macro-averaged metrics. For each metric, the strongest comparison (largest absolute mean difference) is highlighted in bold.

Metric	Comparison	Mean Diff.	Cohen’s d	CI Lower	CI Upper
Accuracy	TCG-CE vs. Taylor Cross-Entropy	0.0111	0.1480	0.0009	0.0213
	TCG-CE vs. KL Divergence	0.0191	0.2307	−0.0032	0.0414
	TCG-CE vs. Categorical Cross-Entropy	0.0124	0.1602	0.0056	0.0192
Precision (Macro)	TCG-CE vs. Taylor Cross-Entropy	0.0142	0.2010	0.0012	0.0272
	TCG-CE vs. KL Divergence	0.0194	0.2638	0.0003	0.0385
	TCG-CE vs. Categorical Cross-Entropy	0.0156	0.2221	0.0085	0.0227
Recall (Macro)	TCG-CE vs. Taylor Cross-Entropy	0.0160	0.2266	0.0017	0.0302
	TCG-CE vs. KL Divergence	0.0228	0.3015	−0.0003	0.0458
	TCG-CE vs. Categorical Cross-Entropy	0.0137	0.1896	0.0062	0.0212
F1 Score (Macro)	TCG-CE vs. Taylor Cross-Entropy	0.0154	0.2155	0.0005	0.0304
	TCG-CE vs. KL Divergence	0.0237	0.3073	−0.0004	0.0478
	TCG-CE vs. Categorical Cross-Entropy	0.0150	0.2071	0.0075	0.0224
AUC Score	TCG-CE vs. Taylor Cross-Entropy	0.0098	0.6121	−0.0002	0.0199
	TCG-CE vs. KL Divergence	0.0045	0.2500	−0.0026	0.0116
	TCG-CE vs. Categorical Cross-Entropy	0.0020	0.1218	−0.0003	0.0043
Jaccard Index	TCG-CE vs. Taylor Cross-Entropy	0.0227	0.2279	0.0036	0.0419
	TCG-CE vs. KL Divergence	0.0284	0.2622	0.0025	0.0542
	TCG-CE vs. Categorical Cross-Entropy	0.0190	0.1852	0.0103	0.0277
Hamming Loss	TCG-CE vs. Taylor Cross-Entropy	−0.0111	−0.1480	−0.0213	−0.0009
	TCG-CE vs. KL Divergence	-0.0191	−0.2307	−0.0414	0.0032
	TCG-CE vs. Categorical Cross-Entropy	−0.0124	−0.1602	−0.0192	−0.0056

Table 5. Statistical significance of macro-averaged metrics comparing TCG-CE with baseline loss functions. Symbols denote strength of evidence (*

p < 0.05

, **

p < 0.01

, † marginal). For each metric, the strongest statistical evidence (lowest p-value) is highlighted in bold.

Table 5. Statistical significance of macro-averaged metrics comparing TCG-CE with baseline loss functions. Symbols denote strength of evidence (*

p < 0.05

, **

p < 0.01

, † marginal). For each metric, the strongest statistical evidence (lowest p-value) is highlighted in bold.

Metric	Comparison	p-Value	Significance
Accuracy	TCG-CE vs. Taylor Cross-Entropy	0.0426	*
	TCG-CE vs. KL Divergence	0.0771	†
	TCG-CE vs. Categorical Cross-Entropy	0.0079	**
Precision (Macro)	TCG-CE vs. Taylor Cross-Entropy	0.0422	*
	TCG-CE vs. KL Divergence	0.0512	†
	TCG-CE vs. Categorical Cross-Entropy	0.0038	**
Recall (Macro)	TCG-CE vs. Taylor Cross-Entropy	0.0396	*
	TCG-CE vs. KL Divergence	0.0552	†
	TCG-CE vs. Categorical Cross-Entropy	0.0080	**
F1 Score (Macro)	TCG-CE vs. Taylor Cross-Entropy	0.0496	*
	TCG-CE vs. KL Divergence	0.0561	†
	TCG-CE vs. Categorical Cross-Entropy	0.0055	**
AUC Score	TCG-CE vs. Taylor Cross-Entropy	0.0568	†
	TCG-CE vs. KL Divergence	0.1356
	TCG-CE vs. Categorical Cross-Entropy	0.0714	†
Jaccard Index	TCG-CE vs. Taylor Cross-Entropy	0.0338	*
	TCG-CE vs. KL Divergence	0.0420	*
	TCG-CE vs. Categorical Cross-Entropy	0.0039	**
Hamming Loss	TCG-CE vs. Taylor Cross-Entropy	0.0426	*
	TCG-CE vs. KL Divergence	0.0771	†
	TCG-CE vs. Categorical Cross-Entropy	0.0079	**

Table 6. Raw efficiency metrics in terms of inference time and memory usage. Best (lowest) values within each model–dataset block are highlighted in bold.

Dataset	Model	Loss Function	Inference Time (ms)	RAM Usage (MB)
UCI	TinyGRU	TCG-CE	4286.350	979.125
		Taylor Cross-Entropy	4477.815	1074.605
		KL Divergence	4376.693	1127.922
		Categorical Cross-Entropy	10,309.196	1127.047
	TinyLSTM	TCG-CE	4158.750	1076.965
		Taylor Cross-Entropy	5213.223	1100.312
		KL Divergence	5188.342	1182.234
		Categorical Cross-Entropy	10,311.030	1229.008
	TinyRNN	TCG-CE	2758.210	880.508
		Taylor Cross-Entropy	2659.950	913.801
		KL Divergence	3946.572	956.137
		Categorical Cross-Entropy	10,336.351	1004.605
WISDM	TinyGRU	TCG-CE	1771.826	1533.785
		Taylor Cross-Entropy	2255.266	1585.125
		KL Divergence	2618.680	1618.152
		Categorical Cross-Entropy	1808.878	1693.301
	TinyLSTM	TCG-CE	1993.740	1072.523
		Taylor Cross-Entropy	2641.590	1133.531
		KL Divergence	2640.173	1146.129
		Categorical Cross-Entropy	2072.128	1233.129
	TinyRNN	TCG-CE	1224.701	1416.938
		Taylor Cross-Entropy	1153.584	1442.590
		Categorical Cross-Entropy	1204.387	1487.941
		KL Divergence	1242.765	1469.547

Table 7. Mean differences and effect sizes for resource efficiency metrics comparing TCG-CE with baseline loss functions. For each metric, the strongest effect (largest absolute mean difference) is highlighted in bold.

Metric	Comparison	Mean Diff.	Cohen’s d	CI Lower	CI Upper
Inference Time (ms)	TCG-CE vs. Taylor Cross-Entropy	−367.9752	−0.2639	−726.5847	−9.3656
	TCG-CE vs. KL Divergence	−636.6080	−0.4687	−1026.0142	−247.2019
	TCG-CE vs. Categorical Cross-Entropy	−3308.0654	−0.9544	−6213.0469	−403.0840
RAM Usage (MB)	TCG-CE vs. Taylor Cross-Entropy	−48.3535	−0.1897	−70.2888	−26.4183
	TCG-CE vs. KL Divergence	−90.0462	−0.3583	−116.8137	−63.2787
	TCG-CE vs. Categorical Cross-Entropy	−135.8646	−0.5339	−163.4070	−108.3222

Table 8. Statistical significance of resource efficiency metrics comparing TCG-CE with baseline loss functions. Symbols denote strength of evidence (*

p < 0.05

, **

p < 0.01

, ***

p < 0.001

, † marginal). For each metric, the strongest statistical evidence (lowest p-value) is highlighted in bold.

Table 8. Statistical significance of resource efficiency metrics comparing TCG-CE with baseline loss functions. Symbols denote strength of evidence (*

p < 0.05

, **

p < 0.01

, ***

p < 0.001

, † marginal). For each metric, the strongest statistical evidence (lowest p-value) is highlighted in bold.

Metric	Comparison	p-Value	Significance
Inference Time (ms)	TCG-CE vs. Taylor Cross-Entropy	0.0502	†
	TCG-CE vs. KL Divergence	0.0119	*
	TCG-CE vs. Categorical Cross-Entropy	0.0380	*
RAM Usage (MB)	TCG-CE vs. Taylor Cross-Entropy	0.0038	**
	TCG-CE vs. KL Divergence	0.0006	***
	TCG-CE vs. Categorical Cross-Entropy	0.0001	*******

Table 9. Win rates of loss functions across models and datasets for each evaluation metric. For each metric, the highest win rate is highlighted in bold.

Metric	Categorical CE	KL Divergence	Taylor CE	TCG-CE
Accuracy	0	0.17	0	0.83
Precision (Macro)	0	0.17	0	0.83
Recall (Macro)	0.17	0.17	0	0.67
F1 Score (Macro)	0	0.17	0	0.83
AUC Score	0.17	0	0.17	0.67
Jaccard Index	0	0	0	1.00
Hamming Loss	0	0.17	0	0.83
Inference Time (ms)	0	0	0.33	0.67
RAM Usage (MB)	0	0	0	1.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alzhrani, K.M. Top-Confidence Gapped Cross-Entropy for Compact Human Activity Recognition. Appl. Sci. 2026, 16, 3394. https://doi.org/10.3390/app16073394

AMA Style

Alzhrani KM. Top-Confidence Gapped Cross-Entropy for Compact Human Activity Recognition. Applied Sciences. 2026; 16(7):3394. https://doi.org/10.3390/app16073394

Chicago/Turabian Style

Alzhrani, Khudran M. 2026. "Top-Confidence Gapped Cross-Entropy for Compact Human Activity Recognition" Applied Sciences 16, no. 7: 3394. https://doi.org/10.3390/app16073394

APA Style

Alzhrani, K. M. (2026). Top-Confidence Gapped Cross-Entropy for Compact Human Activity Recognition. Applied Sciences, 16(7), 3394. https://doi.org/10.3390/app16073394

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Top-Confidence Gapped Cross-Entropy for Compact Human Activity Recognition

Abstract

1. Introduction

2. Related Work

2.1. Loss Functions in Multi-Class HAR

2.2. Modified and Robust Cross-Entropy Variants

2.3. Confidence-Aware and Task-Specific Losses

2.4. Loss Design in Compact HAR Settings

3. Methodology

3.1. Top-Confidence Gapped Cross-Entropy (TCG-CE) Loss Function

3.2. Compact Recurrent Models

4. Datasets

4.1. WISDM Activity Recognition Dataset

4.2. UCI Human Activity Recognition Dataset

5. Experiments and Results

5.1. Evaluation Protocol

5.2. Overall Predictive Performance

5.3. Empirical Runtime and Memory Measurements

5.4. Performance–Efficiency Trade-Off Analysis

6. Discussion

7. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI