4.1. FIRNN Performance
This subsection reports predictive performance of the FIRNN under the four training objectives: baseline CE (), LRP-TV, TRSI, and IG-TV. Each configuration was tested 30 times with the same data split and hyperparameters. The test accuracy reported for each configuration was the result of early stopping based on validation loss on a model snapshot.
Thirty independent runs were used to estimate variability due to random initialization while keeping the computational cost of attribution-based training feasible. Since each regularized run requires repeated attribution computation, substantially increasing the number of runs would considerably increase the computational burden.
Table 2 provides an overview of test accuracies across runs. We present mean ± standard deviation (SD) as well as the median, as some attribution-regularized objectives lead to a small number of low-accuracy runs, which substantially influence the mean performance.
Across all experiments, both the baseline and IG-TV showed good performance, with small ranges and no failed runs (0/30 were less than 85%). Their mean and median accuracies were close to each other. By contrast, LRP-TV results were highly imbalanced: the median remained relatively high (92.93%), but the mean dropped to 86.16% because of the significant number of poor results (min was 53.87%; 8/30 were less than 85%). TRSI mitigated this problem (only 2/30 were less than 85%), and the median was close to 93.04%, with the mean being slightly lower (91.08%), caused by the occurrence of relatively low-accuracy result(s).
Figure 1 illustrates the row-normalized confusion matrices for these best runs and explains where the remaining errors lie. In all cases, the most dominant confusion is between the two locomotion classes:
walk is most frequently confused with
stairs-up (8.54% baseline; 8.07% IG-TV; 8.07% LRP-TV; 8.19% TRSI), while the reverse error (
stairs-up →
walk) is least prevalent (1.73% baseline; 0.91% IG-TV; 0.73% LRP-TV; 0.91% TRSI).
Sit-up has the lowest rate of misclassification for both locomotion classes (usually ≤2–3%), which implies that the walk vs. stairs-up, not the three-way, classification appears to be the most difficult.
Table 3 lists the best overall test accuracy achieved for each method and its corresponding per-class breakdown. Overall test accuracies above the diagonal correspond to the best achievable operating points. In these models,
walk is consistently the lowest-accuracy class (89.01–89.94%), whereas
sit-up and
stairs-up are the highest-accuracy classes (94.59–95.90% and 96.54–97.73%, respectively), partitioning the remaining classes into roughly middle- and high-accuracy groups. The shared class-wise structure of the best models provides a fair baseline for comparing the losses, LRP, and SMLRP methods in terms of global trends in explanation magnitudes, sensitivity to parameter choices, and the distribution of explanation statistics.
4.2. Effect on Relevance Smoothness
To confirm whether the attribution priors affected temporal explanations, we measured explanation roughness in terms of the mean first-order total variation (TV) of the channel-aggregated absolute relevance envelope
(Equation (
14)), where lower values indicate smoother temporal relevance. We present the single best run (with respect to test accuracy) for each configuration: FIRNN (Run 15), FIRNN+LRP-TV (Run 18), FIRNN+TRSI (Run 7), and FIRNN+IG-TV (Run 1).
Table 4 provides the per-class and aggregated total variation (TV) values, scaled as TV ×
for better readability.
When compared to the baseline FIRNN, all three priors decrease the aggregated roughness score (“All classes”). FIRNN + TRSI achieves the highest reduction: vs. , i.e., a decrease in TV. FIRNN + LRP-TV decreases the aggregated TV from to ( decrease), and FIRNN + IG-TV reduces it to ( decrease). In this best-run comparison, TRSI produces relevance trajectories that are the smoothest among all methods when TV is considered at the same level of detail for the same number of iterations.
The smoothing effect varies for each class. TRSI has the highest reduction for Class 0 (
vs.
,
decrease), followed by Class 2 (
vs.
,
decrease), and Class 1 (
vs.
,
decrease). LRP-TV shows the highest reduction for Class 2 (
vs.
,
decrease) and lower reductions for Class 1 (
) and Class 0 (
). IG-TV has reductions similar to LRP-TV for Classes 0 and 2 (
and
) but with the lowest reduction for Class 1 (
). As such, it is not ideal to rely only on an aggregated score, and we recommend reporting the per-class smoothness, with the corresponding 30-run temporal relevance roughness statistics reported in
Table 5.
We performed a Kruskal–Wallis test on the all-class TV values from the 30 runs. The global test indicated a statistically significant difference between training objectives (p < 0.001). Pairwise Mann–Whitney tests with Holm correction showed that TRSI produced significantly lower all-class TV than the baseline (p < 0.001), IG-TV (p < 0.001), and LRP-TV (p < 0.001).
The 30-run statistics confirm the trend observed in the best-run comparison. For the aggregated all-class score, TRSI reduces the mean TV from to , corresponding to a decrease relative to the baseline. The reductions obtained by IG-TV and LRP-TV are smaller: IG-TV decreases the all-class TV from to ( decrease), while LRP-TV decreases it to ( decrease). Thus, TRSI remains the strongest smoothing objective when the evaluation is averaged across repeated independent runs, not only when the best run is selected.
The class-wise results show the same pattern. TRSI gives the largest reduction for stairs-up ( vs. , decrease), followed by sit-up ( vs. , decrease) and walk ( vs. , decrease). IG-TV and LRP-TV provide more modest improvements, mostly below at the class level. Notably, LRP-TV slightly increases the mean roughness for sit-up compared with the baseline ( vs. ), despite improving the aggregated all-class score. This supports the need to report both aggregated and class-wise smoothness statistics.
4.3. Perturbation-Based Relevance Faithfulness
A perturbation-based relevance masking check was performed to determine whether the attribution maps identify behaviorally important temporal regions. For each trained model, attribution scores were computed on clean test windows, and selected temporal positions were masked before re-evaluation. For CE, LRP-TV, and TRSI, masking was guided by LRP relevance. For IG-TV, masking was guided by IG attribution, matching the attribution signal used by its temporal prior. The 20% masking results are reported in
Table 6.
Masking the most relevant temporal positions caused a substantially larger accuracy decrease than random masking for all evaluated training objectives. This supports the behavioral relevance of the attribution maps: temporal regions assigned high relevance had a stronger effect on the model decision than randomly selected regions. For TRSI, top-relevance masking reduced accuracy to 79.18%, whereas random masking preserved 89.23% accuracy, giving an additional degradation of 10.05 percentage points.
The effect was also consistent across attribution-regularized models: both IG-TV and TRSI showed clear separation between top-relevance and random masking. This indicates that the relevance signals shaped or used during temporal explanation regularization still correspond to decision-relevant temporal regions. Therefore, the smoothness improvements reported above are supported by a behavioral perturbation check, rather than by smoothness metrics alone.
4.4. Sensitivity to Sensor Noise
To probe test-time robustness, we evaluated the trained FIRNN variants under additive perturbations applied only at inference time (no retraining). For a test window , noise amplitude was scaled per channel using the test-set standard deviation and a relative noise level , i.e., the injected noise had channel-wise scale . We swept in steps of and computed window-level accuracy using the same sliding-window evaluation protocol as in training.
Because the injected noise is random, accuracy at a fixed depends on the specific noise realization. Therefore, for each noise level, we used a Monte Carlo evaluation: the full test procedure was repeated times with independent noise draws, and we report the mean accuracy across repetitions as . The standard deviation across repetitions, , provides an estimate of variability due to noise sampling.
We considered three noise processes motivated by wearable sensing conditions: (i) train-like vibration modeled as band-limited Gaussian noise (0.5–20 Hz) obtained via the Butterworth band-pass filtering [
31]; (ii) loose pocket or watch jitter modeled as impulsive burst noise with enhanced high-frequency content (high-pass filtered); and (iii) plain sensor noise modeled as additive white Gaussian noise.
Figure 2 shows mean accuracy as a function of
for all four networks.
For all noise types, accuracy is observed to decrease approximately monotonically with . The curves corresponding to band-limited vibration noise are close over most of the sweep, particularly at larger noise levels. This result highlights the fact that the corresponding perturbation regime does not provide sufficient separation for either model. Conversely, for impulsive pocket noise and white Gaussian noise, the TRSI-regularized model retains higher accuracy for most moderate-to-high noise levels, with the LRP-TV and IG-TV curves showing smaller but largely positive offsets. Overall, the results indicate that enforcing temporal structure in explanations during training can translate into improved robustness under realistic test-time perturbations, with the strongest effect observed for the TRSI objective in this setting.
At representative noise levels , the TRSI-regularized model yields the most consistent robustness gains across perturbation types, particularly at stronger corruption. Under train-like band-limited vibration, TRSI achieves the highest mean accuracy at moderate noise ( at and at ), improving over the baseline by and percentage points, respectively. At , both temporal-prior objectives remain clearly above the baseline (LRP-TV: , TRSI: vs. baseline ), indicating that explanation-based temporal constraints improve resilience even when vibration noise becomes severe; the difference between the temporal prior and TRSI objectives at is within the Monte Carlo variability, suggesting near-tied performance in this regime.
For impulsive pocket or watch jitter and additive white Gaussian noise, TRSI provides the clearest advantage at high noise. At , TRSI reaches (pocket) and (Gaussian), compared with and for the baseline, corresponding to improvements of and percentage points. By contrast, IG regularization is consistently substantially weaker, and its performance often closely matches the unregularized baseline.
Table 7 provides a summary ranking of the four objectives within each noise type over increasing noise ranges. We can see a typical pattern for white Gaussian and impulsive pocket noise: with TRSI consistently ranked first, across all intervals, the baseline degrades the most with increasing corruption and ranks third in the highest-noise interval. For these two perturbations, IG-TV tends to be the weakest, often ranking last at low-to-moderate noise levels and improving only under strong corruption, where the performance of the competing methods converges.
For band-limited vibration that is train-like, the ranking differences between the methods are less pronounced, and the separation between accuracy curves collapses at high noise, as in
Figure 2. For moderate levels of vibration (
–
), all three methods share rank 1. This reinforces the interpretation that the TRSI yields the most consistently strong robustness profile, but the differences are amplified compared to the accuracy in the Gaussian and pocket scenarios with the proposed training set.
The robustness evaluation was intentionally based on controlled perturbation models because this allows the same trained networks to be compared under identical and repeatable corruption levels. The three selected noise processes represent different degradation regimes: band-limited vibration, impulsive jitter, and broadband sensor noise. Moreover, the perturbation amplitude was scaled by the empirical standard deviation of each channel, so the corruption was not imposed with the same absolute magnitude on all sensor axes. Real-device degradation, missing samples, and sensor dropout represent separate deployment scenarios and are therefore natural extensions of the present robustness protocol.