1. Introduction
Human Activity Recognition (HAR) with wearable and mobile sensors is a core task in intelligent sensing systems. It supports applications such as healthcare monitoring, rehabilitation, smart environments, and personal analytics. The widespread use of inertial sensors in smartphones and wearables has made continuous motion sensing widely available. This has sustained strong interest in sensor-based HAR. Recent studies have shown that deep learning can outperform traditional feature-engineered pipelines by learning task-relevant representations directly from sensor data or structured signal windows [
1,
2,
3].
Many HAR systems must operate under constraints related to model size, memory, and computation [
1,
4]. For this reason, a large part of the literature has focused on architecture design, lightweight models, feature selection, and deployment-oriented optimization. These directions are important. However, architecture is not the only factor that shapes model behavior. The training objective also affects gradient allocation, convergence, class discrimination, and the treatment of easy and ambiguous samples. In compact HAR settings, this makes loss-function design a relevant but less explored direction.
This question is important because HAR often involves ambiguous class boundaries. Some activities have similar motion patterns. Transition segments can also be difficult to classify. These challenges become more important when model capacity is limited. In such settings, treating all training samples in the same way may not be ideal. A training objective that reduces the influence of clearly separated predictions and preserves greater emphasis on ambiguous cases may improve balanced classification behavior.
Related work in machine learning has shown that modified training objectives can improve robustness, discrimination, or optimization behavior without requiring structural changes to the model itself [
5,
6]. This motivates a simple question for compact HAR: can a lightweight, prediction-derived modification of cross-entropy improve macro-level classification performance under a fixed compact model family?
This paper studies that question through loss design. We propose Top-Confidence Gapped Cross-Entropy (TCG-CE), a modification of categorical cross-entropy in which the per-sample loss is scaled using the gap between the two most probable predicted classes. This top-1/top-2 gap is used as a simple signal of local prediction ambiguity. The proposed loss introduces no additional trainable parameters and can be used as a drop-in replacement for standard cross-entropy in the evaluated training pipeline.
To evaluate the proposed loss under a controlled and practically relevant HAR setting, this study uses a compact recurrent model family composed of TinyRNN, TinyGRU, and TinyLSTM. This choice is motivated by three considerations: First, recurrent architectures remain a well-established backbone family in sensor-based HAR. Second, compact recurrent variants can be implemented with small parameter budgets, which is suitable for lightweight HAR settings. Third, the progression from simple RNN to GRU and LSTM provides a structured increase in recurrent capacity. This makes the model family suitable for controlled comparison of training objectives. The purpose is not to introduce architectural novelty, but to examine the behavior of the proposed loss under a compact and capacity-structured HAR backbone family.
The proposed loss is evaluated on two HAR benchmarks, WISDM and UCI-HAR. These datasets provide different input representations and therefore broaden the evaluation setting. The analysis focuses on macro-averaged predictive performance and also reports empirical runtime and memory measurements as within-study observations under a fixed execution environment.
The main contributions of this work are as follows: First, we introduce TCG-CE, a lightweight modification of categorical cross-entropy based on the top-1/top-2 confidence gap. Second, we evaluate the proposed loss under a compact recurrent HAR model family that enables controlled comparison across different recurrent capacities. Third, we examine the method on two benchmark HAR datasets with different representations. Fourth, we provide a comparative analysis using macro-averaged predictive metrics, effect sizes, statistical significance tests, and win rates. We also report empirical runtime and memory measurements under a fixed execution environment.
Across the evaluated settings, TCG-CE improves macro-level predictive performance, with the clearest gains appearing on WISDM and in more capacity-limited configurations. These results support the view that loss-function design can serve as a practical complementary lever for improving balanced classification behavior in compact HAR.
The remainder of this paper is organized as follows:
Section 2 reviews prior work on loss functions for Human Activity Recognition and compact resource-conscious HAR settings.
Section 3 presents the proposed Top-Confidence Gapped Cross-Entropy (TCG-CE) loss and the compact recurrent models used for evaluation.
Section 4 describes the benchmark datasets and preprocessing procedures.
Section 5 reports the experimental results, including predictive performance and empirical runtime analysis.
Section 6 discusses the implications and limitations of the findings. Finally,
Section 7 concludes the paper.
3. Methodology
3.1. Top-Confidence Gapped Cross-Entropy (TCG-CE) Loss Function
This section defines the proposed Top-Confidence Gapped Cross-Entropy (TCG-CE) loss. TCG-CE is a modification of categorical cross-entropy in which each sample is weighted according to the separation between its two most probable predicted classes. The objective is to attenuate the loss contribution of clearly separated predictions while retaining stronger emphasis on ambiguous cases.
Let
B denote the batch size and
C denote the number of classes. The one-hot encoded labels are written as
, and the predicted class probabilities are written as
. Each row
is a class-probability vector, typically produced by a softmax output layer, such that
Two fixed constants are used in the formulation. The first is a clipping constant , used for numerical stability in the logarithm. The second is a scaling constant , used in the confidence modulation term. In all experiments, these values are fixed to and . The same values are used for all datasets and model configurations. Here, serves as a conservative numerical safeguard after probability clipping so that logarithms remain well-defined. The constant controls the smoothness and strength of the confidence modulation. These values were kept fixed throughout the study so that the proposed loss could be evaluated as a single stable formulation under matched conditions, without introducing additional dataset-specific retuning.
Predicted probabilities are first clipped away from 0 and 1:
This guarantees that is finite for all samples and classes.
For each sample
i, let
denote the index of the largest predicted probability, and let
denote the index of the second-largest predicted probability. The corresponding top-two probabilities are
The central quantity in TCG-CE is the top-confidence gap
By construction, . Larger values indicate stronger separation between the most probable class and its nearest competitor. Smaller values indicate greater local ambiguity. The use of the top-1/top-2 confidence gap is motivated by the idea that local ambiguity in multi-class prediction is often most directly reflected in the competition between the two most probable classes. When this gap is small, the prediction is less decisive because the model assigns similar probability mass to its two strongest alternatives. When the gap is large, the model exhibits clearer local separation. Unlike entropy, which summarizes uncertainty across the full predictive distribution, the proposed signal focuses specifically on the most competitive decision boundary. Unlike a true-class margin, it is label-agnostic until combined with the cross-entropy term. In this study, the gap is therefore used as a lightweight ambiguity-oriented weighting signal rather than as a theoretically exhaustive uncertainty measure.
A design choice in TCG-CE is to detach this modulation term from the gradient flow:
This means that the confidence-gap term acts only as a weighting factor on the loss contribution of each sample.
The detached gap is then mapped to a smooth confidence score:
This transformation is bounded and monotonic, so . Larger confidence gaps yield larger confidence scores, whereas smaller gaps yield lower confidence scores. The complementary factor is then used to scale the loss. The sigmoid mapping was chosen because it provides a smooth, bounded, and monotonic transformation of the detached gap, avoiding abrupt changes in the weighting term while preserving the ordering induced by the confidence gap. The detach operation was used so that the gap functions as a sample-weighting signal rather than introducing an additional gradient path through the top-1/top-2 competition itself. In this way, the optimization remains centered on the cross-entropy objective, while the confidence signal modulates the magnitude of each sample’s contribution. The present study evaluates this fixed stabilized formulation as a complete loss design under matched experimental settings.
The per-sample TCG-CE loss is defined as
Because
is one-hot encoded, only the ground-truth class contributes to the sum. If
denotes the true class index for sample
i, this simplifies to
The mini-batch loss returned to the optimizer is the mean over all samples:
Equations (
6)–(
11) fully define the proposed loss. The inputs are
together with the fixed constants
, and the output is the scalar objective
used during optimization.
Operationally, the mini-batch computation proceeds in four steps: First, predicted probabilities are clipped to . Second, the largest and second-largest predicted probabilities are identified for each sample, and their difference is computed. Third, this gap is detached from gradient flow and mapped through a sigmoid function to obtain a confidence score. Fourth, the standard cross-entropy term associated with the true class is scaled by and averaged across the mini-batch.
In summary, TCG-CE modifies standard categorical cross-entropy by introducing a smooth attenuation factor derived from the top-1/top-2 separation in the predicted class distribution. Samples with smaller separation retain a larger loss contribution. Samples with larger separation are down-weighted. The proposed loss is applied during training, while the evaluated model architectures remain otherwise unchanged.
3.2. Compact Recurrent Models
This study deliberately focuses on compact recurrent backbones because the contribution is a loss-design study rather than an architecture paper. The objective is to examine the behavior of the proposed training loss under lightweight and systematically comparable models while keeping the backbone family controlled across datasets. Recurrent models remain a well-established family in sensor-based HAR, align naturally with sequential sensor windows, and provide a simple capacity ladder through TinyRNN, TinyGRU, and TinyLSTM. This makes them a suitable testbed for isolating the effect of the training objective without introducing additional architectural factors.
Accordingly, three recurrent variants are considered: TinyRNN, TinyGRU, and TinyLSTM. These models were selected because they preserve a common recurrent modeling framework while offering progressively richer gating mechanisms and representational capacity. This progression supports controlled comparison of the proposed loss across increasingly expressive compact recurrent cells while remaining consistent with lightweight HAR settings in which small parameter budgets are desirable.
All architectures share the same overall structure, shown in
Figure 1. Each model contains a single recurrent layer with 16 hidden units, followed by a dense layer with 32 units, a dropout layer with rate 0.2, and a softmax output layer. The only architectural difference across the three models is the recurrent cell type itself. No convolutional layers, attention mechanisms, residual connections, or additional temporal aggregation modules are introduced. The hidden dimension, dense-layer size, dropout rate, and output configuration are kept fixed throughout, and no architecture-specific hyperparameter tuning is performed. This fixed-design strategy keeps the comparison centered on the training objective rather than on architecture engineering.
The same compact recurrent family is used for both datasets in order to preserve architectural consistency across the evaluation setting. For WISDM, this choice aligns naturally with the segmented sensor-window representation. For UCI-HAR, the data are provided in engineered-feature form rather than as raw temporal sequences. In this case, the recurrent models are used as a controlled common backbone family for loss-function comparison rather than as a claim of architectural optimality for engineered tabular features. This consistent backbone choice allows the effect of TCG-CE to be examined under matched compact models across both datasets.
4. Datasets
This study evaluates the proposed Top-Confidence Gapped Cross-Entropy (TCG-CE) loss on two widely used benchmark datasets for Human Activity Recognition (HAR): the Wireless Sensor Data Mining (WISDM) dataset and the Human Activity Recognition Using Smartphones (UCI-HAR) dataset. The two datasets differ in sensing modality, preprocessing level, and input representation. This provides a heterogeneous evaluation setting for assessing the proposed loss under different HAR data forms.
4.1. WISDM Activity Recognition Dataset
The WISDM Activity Recognition Dataset (v1.1) [
40] was introduced for smartphone-based activity recognition using accelerometer data. It contains raw tri-axial accelerometer readings collected from Android smartphones carried in the front pants pocket during daily activities. Data were recorded at a sampling frequency of 20 Hz in a controlled environment.
The dataset includes recordings from 36 users performing six activities: walking, jogging, walking upstairs, walking downstairs, sitting, and standing. Each raw sample contains three acceleration channels corresponding to the x, y, and z axes. In total, the dataset contains 1,098,208 raw sensor samples. The class distribution is moderately imbalanced, with walking and jogging accounting for the largest share of samples, while sitting and standing are less frequent.
To construct learning-ready samples, the raw sensor stream is segmented into fixed-length overlapping windows. Each window contains 100 consecutive samples, corresponding to 5 s of activity data at the native sampling rate, with a 50% overlap between adjacent windows. This produces 21,963 windows, each represented as a time-series matrix. The label of each window is inherited from the corresponding activity segment.
Before segmentation, acceleration signals are standardized using z-score normalization. The same normalization is applied to all three sensor channels. The resulting windowed representation is used as input to the compact recurrent models evaluated in this study.
Table 1 summarizes the main characteristics of the WISDM dataset after window- based preprocessing.
4.2. UCI Human Activity Recognition Dataset
The UCI-HAR dataset [
41] was collected from 30 volunteers aged between 19 and 48 years, each performing six activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying. Data were acquired using a smartphone worn on the waist and equipped with a tri-axial accelerometer and a tri-axial gyroscope. Signals were sampled at 50 Hz and video-recorded to support accurate labeling.
Unlike WISDM, UCI-HAR is provided in a preprocessed and feature-engineered form. The released feature vectors were extracted by the dataset authors from segmented inertial signals using fixed-width sliding windows of 2.56 s with 50% overlap, corresponding to 128 readings per window. Each window is represented by 561 features derived from accelerometer and gyroscope signals in both the time and frequency domains. These features are normalized and bounded in the range .
The dataset includes an official subject-based train–test split. Seventy percent of the participants are used for training and 30% for testing, so samples from the same individual do not appear in both subsets. The training set contains 7352 samples and the test set contains 2947 samples, for a total of 10,299 samples. This predefined split is preserved in all experiments.
In this study, each 561-dimensional UCI-HAR feature vector is standardized and then reshaped to an input of length 561 with 1 feature per step in order to maintain compatibility with the compact recurrent model family used throughout the evaluation. This reshaping is a controlled modeling choice adopted to preserve a common backbone family across datasets and support consistent comparison of the loss functions. The original UCI-HAR representation remains an engineered feature vector rather than a native temporal sequence, and the imposed feature ordering is used only for architectural compatibility.
Table 2 summarizes the main characteristics of the UCI-HAR dataset.
Both datasets are public and widely used in the HAR literature. Their complementary characteristics—windowed sensor sequences in WISDM and engineered feature vectors in UCI-HAR—provide a useful evaluation setting for examining the proposed loss across different HAR data representations.
5. Experiments and Results
5.1. Evaluation Protocol
A unified evaluation protocol was used to compare the considered loss functions under matched preprocessing, optimization, and model settings. For each dataset, the same model family, preprocessing pipeline, training budget, and evaluation metrics were used across all loss-function configurations.
For WISDM, which does not provide an official train–test partition, the windowed dataset was divided into training and test subsets using an 80/20 stratified split in order to preserve class proportions. For UCI-HAR, the predefined subject-based split supplied with the dataset was used without modification.
All experiments were executed on a Google Colab CPU runtime (Google LLC, Mountain View, CA, USA) with GPU acceleration disabled. At the time of execution, the cloud runtime reported an Intel(R) Xeon(R) CPU @ 2.20 GHz with 12.67 GB RAM, running Linux 6.6.113+ (x86_64, glibc 2.35). The software environment used Python 3.12.13, NumPy 2.0.2, and TensorFlow 2.19.0. Random seeds were fixed across Python, NumPy, and TensorFlow, and deterministic TensorFlow operations were enabled where supported. Each model was trained for 200 epochs using the Adam optimizer with learning rate 0.001 and batch size 64. A validation split of 20% was used during training, and the checkpoint with the lowest validation loss was selected for final evaluation.
Predictive performance was evaluated primarily using macro-averaged F1-score, which is appropriate for balanced assessment across activity classes. Additional macro-level metrics, including precision, recall, area under the ROC curve (AUC), Jaccard index, accuracy, and Hamming loss, were also computed to provide a broader characterization of model behavior.
For completeness, empirical runtime and memory measurements were also recorded under the adopted execution environment. Inference time was measured as wall-clock latency during model prediction, and memory usage was recorded as process resident set size (RSS) during the inference procedure. These quantities are reported as empirical observations under the adopted execution environment and are included to complement the predictive analysis.
This protocol supports controlled comparison of the considered loss functions under fixed model settings while keeping the primary focus on predictive evaluation.
5.2. Overall Predictive Performance
Table 3 reports the predictive performance of all model–loss function combinations across the UCI-HAR and WISDM datasets. The results are presented using macro-averaged metrics to ensure balanced evaluation across activity classes. The best-performing configurations within each model–dataset block are highlighted for clarity.
On UCI-HAR, TCG-CE achieves the strongest overall results for the TinyGRU and TinyLSTM architectures. In particular, TCG-CE attains macro F1-scores of 0.804 for TinyGRU and 0.827 for TinyLSTM, together with corresponding gains in accuracy, precision, recall, and Jaccard index. These results are also accompanied by lower Hamming loss. Relative to Categorical Cross-Entropy and Taylor Cross-Entropy, the gains are directionally consistent for these two architectures, whereas KL Divergence yields weaker macro-level performance.
For TinyRNN on UCI-HAR, the differences among loss functions are smaller. KL Divergence attains slightly higher accuracy and marginally lower Hamming loss, while the macro F1-scores of TCG-CE, Taylor Cross-Entropy, and KL Divergence are effectively tied. This suggests that, for the smallest recurrent model on this dataset, the effect of the loss is more limited than for the larger recurrent variants.
On WISDM, TCG-CE again shows strong predictive behavior. For TinyGRU and TinyLSTM, it attains the highest or jointly highest accuracy and macro F1-score, reaching 0.926 and 0.925, respectively. These advantages are also reflected in precision, recall, Jaccard index, and Hamming loss, indicating balanced gains rather than improvements confined to a single metric. Although KL Divergence occasionally matches or slightly exceeds AUC values, this does not translate into stronger macro-level classification performance.
The most pronounced gain appears for TinyRNN on WISDM, where TCG-CE achieves a macro F1-score of 0.809 and outperforms all baseline losses by a visible margin. Across the two datasets, the raw results indicate that TCG-CE is generally competitive, with the clearest advantages appearing in compact and capacity-limited settings.
Table 4 summarizes pairwise mean differences, effect sizes (Cohen’s
d), and confidence intervals for macro-averaged predictive metrics. These comparisons are computed between TCG-CE and the baseline losses. This analysis complements the raw results by describing both the direction and the magnitude of the observed differences.
Across the primary predictive metrics, TCG-CE shows consistently positive mean differences relative to the baselines. The largest improvements are generally observed against KL Divergence, with mean differences of 0.0237 for macro F1-score and 0.0228 for macro recall. The corresponding effect sizes are small-to-moderate, which suggests that the observed gains are systematic rather than isolated fluctuations.
For accuracy and precision, TCG-CE again shows positive mean shifts, with the largest effects occurring against KL Divergence and Categorical Cross-Entropy. Although the effect sizes remain modest, the consistency of their direction across metrics supports the view that the proposed loss tends to improve balanced classification performance across model–dataset settings.
Macro F1-score, which is the primary metric in this study, shows a mean improvement of 0.0237 over KL Divergence and 0.0154 over Taylor Cross-Entropy. These differences are consistent with the raw performance patterns and support the practical value of TCG-CE for balanced classification.
The AUC results show smaller absolute differences, with the strongest relative gain appearing against Taylor Cross-Entropy. This suggests that improvements in ranking quality are present, but less uniform than the improvements observed for macro-level classification metrics. The Jaccard index and Hamming loss follow the same general pattern as accuracy, precision, recall, and F1-score, which reinforces the internal consistency of the observed predictive trends.
Table 5 reports statistical test results for macro-averaged predictive metrics comparing TCG-CE with each baseline loss across the evaluated model–dataset configurations. Symbols indicate the strength of statistical evidence.
Consistent with the raw predictive results, the strongest statistical evidence appears in the comparisons between TCG-CE and Categorical Cross-Entropy, where several macro-level metrics reach the level. Comparisons against Taylor Cross-Entropy also reach conventional significance thresholds for most of the reported predictive metrics. By contrast, comparisons with KL Divergence tend to be marginal, indicating larger average gains but weaker consistency across configurations.
For AUC, none of the comparisons reaches a strong conventional significance level, although marginal trends in favor of TCG-CE are visible. This is consistent with the smaller and less uniform AUC differences observed in the raw and effect-size analyses.
Figure 2 provides a consolidated view of average predictive performance across loss functions, aggregated over models for each dataset. This visualization complements
Table 3 by emphasizing dataset-level tendencies and cross-model stability.
Across the reported macro-averaged performance metrics, TCG-CE consistently achieves the highest or near-highest mean values on both UCI-HAR and WISDM. The advantage is especially visible for macro F1-score, precision, recall, and Jaccard index. Categorical Cross-Entropy and Taylor Cross-Entropy remain competitive but generally lower on average, whereas KL Divergence shows greater variability and weaker mean results across several metrics despite occasional strength in AUC.
Overall, the predictive results indicate that TCG-CE provides reliable macro-level improvements across the evaluated datasets and model settings, which supports its use as a competitive training objective for compact HAR classification.
5.3. Empirical Runtime and Memory Measurements
This subsection reports empirical runtime and memory measurements obtained under the Google Colab CPU runtime used in this study for the evaluated model–loss function combinations. These measurements are included as observations and are not the primary basis of the paper’s contribution.
Table 6 reports the runtime and memory measurements for all model–dataset–loss function combinations. For each architecture and dataset, the lowest inference time and memory usage are highlighted.
Under the adopted execution environment, models trained with TCG-CE often showed lower or competitive observed inference-time and RSS values relative to the baselines. On UCI-HAR, TCG-CE attains the lowest inference time and memory usage for TinyGRU and TinyLSTM, while for TinyRNN it attains the lowest memory value and a runtime close to the best observed value. On WISDM, TCG-CE again yields the lowest time and memory values for TinyGRU and TinyLSTM and remains competitive for TinyRNN.
These measurements are reported as empirical observations under the adopted setup. They are not interpreted as establishing the direct causal inference-efficiency advantage of the loss itself.
Table 7 summarizes pairwise mean differences, effect sizes, and confidence intervals for the empirical runtime and memory measurements. Negative mean differences indicate lower values for TCG-CE relative to the corresponding baseline.
For runtime, the largest average reduction appears in the comparison with Categorical Cross-Entropy. Comparisons with KL Divergence and Taylor Cross-Entropy also show lower mean runtime values for TCG-CE. For RSS memory, TCG-CE again shows lower mean values across all baseline comparisons, with the strongest reduction appearing relative to Categorical Cross-Entropy.
Table 8 reports the corresponding statistical test results for the empirical runtime and memory measurements.
The statistical results show more consistent differences for RSS than for runtime. Across the reported comparisons, RSS differences are significant for all three baselines, whereas runtime differences are smaller and more variable.
Figure 3 provides a consolidated view of the average runtime and memory measurements across loss functions and datasets. Hamming loss is also shown on an inverted axis to support joint visual comparison with metrics for which lower values are preferable.
Across both datasets, TCG-CE often shows lower or competitive observed runtime and memory values relative to Categorical Cross-Entropy and Taylor Cross-Entropy, while KL Divergence exhibits greater variability. At the same time, TCG-CE maintains lower Hamming loss on average. Within the adopted setup, the predictive improvements reported earlier are therefore not accompanied by an obvious deterioration in the observed runtime measurements.
Table 9 summarizes win rates across all model–dataset configurations for each reported metric. A win is recorded when a loss function attains the best value for a given metric under a given configuration.
Across the predictive metrics, TCG-CE has the highest win rate by a clear margin. It also attains the highest win rate for runtime and a perfect win rate for RSS among the reported configurations. These summaries further illustrate the consistency of the empirical trends visible in the raw results and pairwise comparisons.
5.4. Performance–Efficiency Trade-Off Analysis
This subsection presents predictive performance together with the empirical runtime and memory measurements in order to summarize the observed operating points under a common execution environment.
Figure 4 visualizes this relationship using slope charts that jointly plot macro F1-score against inference time (top row, logarithmic scale) and RSS memory usage (bottom row), with separate columns for UCI-HAR and WISDM. Within each panel, lines connect models ordered by increasing recurrent complexity (TinyRNN → TinyGRU → TinyLSTM), allowing for the inspection of how the predictive–runtime relationship varies with model capacity.
Across both datasets, TCG-CE tends to occupy favorable observed operating points, combining higher macro F1-scores with lower or competitive runtime and RSS values. On UCI-HAR, this contrast is especially visible relative to Categorical Cross-Entropy, which yields substantially larger runtime values without corresponding predictive gains. On WISDM, TCG-CE again combines the strongest predictive performance for TinyGRU and TinyLSTM with comparatively low measured runtime and memory values.
Figure 5 presents the relationship between macro F1-score and inference time after averaging across datasets for each model architecture. Each point represents the mean predictive–runtime operating point of a loss function for a given architecture, with error bars indicating variability across dataset–model combinations.
For TinyGRU and TinyLSTM, TCG-CE attains the highest average macro F1-score while remaining below most baselines in mean runtime. For TinyRNN, the runtime differences between TCG-CE and Taylor Cross-Entropy are relatively small, but TCG-CE still retains a visible advantage in macro F1-score. These aggregated comparisons suggest that the predictive gains of TCG-CE are not accompanied by worse observed runtime under the adopted measurement setup.
Figure 6 complements this view by plotting macro F1-score against average RSS memory usage for each loss function and model architecture.
For TinyGRU and TinyLSTM, TCG-CE combines the highest average macro F1-score with the lowest average RSS. For TinyRNN, TCG-CE again remains favorable, yielding the highest macro F1-score together with the lowest average RSS among the reported losses. These plots therefore reinforce the descriptive observation that TCG-CE occupies favorable predictive–memory operating points under the adopted execution environment.
Taken together, the experimental results show that TCG-CE provides the most consistent predictive improvements across the evaluated datasets and recurrent architectures. These gains are also accompanied by competitive observed runtime and RSS measurements. Since the evaluated losses are training objectives and do not introduce additional architectural modules into the evaluated models, these runtime and memory results should be interpreted as within-study empirical observations rather than as evidence of guaranteed loss-induced deployment savings.
6. Discussion
The results demonstrate that TCG-CE consistently improves macro-averaged predictive performance in compact HAR classification, with the most pronounced gains observed on macro F1-score, precision, and recall. Across the evaluated settings, the proposed loss most consistently improves macro F1-score, precision, recall, Jaccard index, and Hamming loss relative to the baseline objectives. These gains are especially visible on WISDM and in the more capacity-limited recurrent settings.
A central empirical pattern is that the observed gains of TCG-CE are larger on WISDM than on UCI-HAR. This difference is noteworthy because the two datasets use different input representations. WISDM preserves segmented sensor windows, whereas UCI-HAR is used here in its engineered-feature form and reshaped only to maintain compatibility with the common model family. Under these conditions, the stronger gains on WISDM suggest that the behavior of the proposed loss may depend in part on how much local variation is retained in the input representation. Further analysis of dataset-level factors could help clarify this difference more fully.
The results also indicate that the effect of TCG-CE depends on model capacity. For TinyGRU and TinyLSTM, the proposed loss yields clear improvements across several predictive metrics. For TinyRNN on UCI-HAR, the gains are comparatively modest. This pattern suggests that the loss is not a substitute for representational capacity. Instead, it acts as a complementary training mechanism whose benefit becomes more visible when the underlying model has enough flexibility to make use of more selective gradient emphasis.
From a learning perspective, TCG-CE can be interpreted as a mild ambiguity-aware attenuation of categorical cross-entropy. By reducing the contribution of samples whose most probable class is already well separated from the nearest competitor, the loss preserves relatively greater emphasis on locally ambiguous predictions. In the present experiments, this behavior is associated with stronger balanced classification performance without adding trainable parameters or architectural modules. The results therefore support the practical value of the top-1/top-2 confidence gap as a simple prediction-derived modulation signal.
The compact recurrent model family also provides a useful capacity ladder for interpreting the results. TinyRNN, TinyGRU, and TinyLSTM share the same general architecture while differing in recurrent expressiveness. This makes it possible to examine the proposed loss under matched compact backbones rather than under unrelated architectures. The observed pattern across these models suggests that the proposed loss remains relevant across multiple recurrent capacities, while showing the clearest advantages in settings where compactness and class ambiguity interact more strongly.
A limitation of this study is that UCI-HAR is not inherently sequential in the form used here. Its 561-dimensional feature vectors are engineered descriptors rather than raw temporal windows. Accordingly, the recurrent models on that dataset should be interpreted as a controlled common modeling framework for loss-function comparison rather than as a claim that sequence modeling is the most appropriate architecture for engineered feature vectors. This limitation should be kept in mind when interpreting the smaller gains observed on UCI-HAR.
The empirical runtime and RSS measurements provide an additional descriptive view of model behavior. Models trained with TCG-CE often showed competitive observed runtime and memory values relative to the baselines. However, these measurements should be interpreted carefully. Since the evaluated losses are training objectives and do not introduce additional architectural modules into the evaluated models, the observed runtime and memory differences should be treated as within-study empirical observations under the adopted protocol rather than as guaranteed loss-induced deployment savings.
The comparison with the baseline losses is also informative. Categorical Cross-Entropy remains a strong general-purpose objective, but in the present experiments, it is less competitive on the macro-averaged metrics that are most relevant to balanced activity recognition. Taylor Cross-Entropy is often competitive and in some cases close to TCG-CE, which indicates that modified cross-entropy formulations can provide meaningful benefits over the standard objective. KL Divergence shows weaker and more variable predictive behavior, which suggests that its distribution-matching emphasis is less well aligned with the balanced multi-class recognition goals examined here.
Taken together, the findings indicate that TCG-CE is a competitive loss-function alternative for compact HAR classification when the goal is to improve macro-level predictive performance within a controlled lightweight recurrent model family. The strongest evidence in this study comes from the predictive analysis. Under the adopted evaluation setting, that evidence consistently supports the usefulness of confidence-gap modulation as a lightweight design choice for balanced HAR classification while also indicating that the magnitude of the gains is not uniform across all evaluated cases.