5.1. HAR Performance on Seen Classes
Table 3 summarizes the classification performance on seen classes for all three datasets. We compare the proposed TinyKAN-HAR model against classical baselines (kNN, SVM, Random Forest) and deep learning baselines (1D-CNN, LSTM, CNN-LSTM, Transformer).
As shown in
Table 3, the proposed TinyKAN-HAR consistently achieves accuracy above 96% on all three datasets, satisfying the TinyML constraint of high recognition performance despite compact models. On UCI HAR, TinyKAN-HAR reaches an overall accuracy of 98.3% and a macro-F1 of 98.0%, outperforming the best deep baseline (Transformer; 98.0% accuracy, 97.7% macro-F1) and clearly improving over classical methods such as SVM and Random Forest (96–97% accuracy).
On WISDM, TinyKAN-HAR attains 97.9% accuracy and 97.7% macro-F1, again slightly surpassing the Transformer baseline (97.6% accuracy) and CNN-LSTM (97.4% accuracy). The performance margin is smaller than on UCI HAR, suggesting that for this dataset most deep models saturate near the upper limit, but TinyKAN-HAR remains competitive while preserving interpretability.
On PAMAP2, which includes a richer set of activities and more heterogeneous sensor placements, TinyKAN-HAR still achieves 97.3% accuracy and 97.1% macro-F1, slightly ahead of the Transformer (97.1% accuracy) and CNN-LSTM (97.0% accuracy). These results confirm that the KAN-based feature extractor provides robust representations across diverse HAR settings.
5.2. Zero-Shot and Generalized Zero-Shot Performance
We next evaluate the zero-shot and generalized zero-shot capabilities of TinyKAN-HAR.
Table 4 reports the pure ZSL accuracy on unseen classes
, along with
,
and their harmonic mean H in the generalized setting. We compare TinyKAN-HAR with zero-shot variants of CNN, LSTM and Transformer baselines, where the last hidden representation is mapped to the same semantic space described in
Section 3.4.
On UCI HAR, TinyKAN-HAR achieves a pure zero-shot accuracy of 96.4% on unseen classes, significantly higher than the best baseline (Transformer + ZSL head, 93.2%). In the generalized setting, TinyKAN-HAR obtains
and
, leading to a harmonic mean
. This indicates that the calibration strategy in (
40) successfully balances performance on seen and unseen classes, avoiding the strong bias towards seen classes observed in the baselines (e.g., CNN + ZSL with 97.0% seen accuracy but only 88.5% unseen accuracy and
).
On PAMAP2, which includes a larger and more diverse set of activities, TinyKAN-HAR still reaches 96.0% pure ZSL accuracy, outperforming the Transformer-based ZSL baseline (92.0%). In the generalized setting, TinyKAN-HAR achieves
and
, with
, again demonstrating that the KAN-based semantic compatibility function (Equations (
31)–(
34)) generalizes well to unseen activities while retaining excellent performance on seen ones.
Which Unseen Activities Are Easier or Harder?
Qualitatively analyzing per-class unseen accuracies, we observe that locomotion-related unseen classes such as walking upstairs and descending stairs are relatively easy: TinyKAN-HAR correctly recognizes more than 97% of these examples when they are held out as unseen classes. This is consistent with the attribution maps where the model strongly focuses on periodic patterns in vertical acceleration and gyroscope signals.
In contrast, static or quasi-static unseen activities that are semantically similar (e.g., sitting, standing, lying) are more challenging, with unseen accuracies around 93–95%. In these cases, the semantic embeddings of the activities are close, and the sensor patterns differ only subtly. Nevertheless, TinyKAN-HAR remains above 94% accuracy on these harder unseen activities, which is reflected in the high overall
values in
Table 4.
5.3. Robustness of the Calibration Factor
In the generalized zero-shot setting we calibrate the scores with the factor
in Equation (
40). To verify that the reported zero-shot performance is not the result of an overly tuned hyperparameter, we perform a sensitivity analysis in which we fix the trained KAN-HAR model and sweep
over a broad range. For each value
we evaluate on UCI HAR the pure zero-shot accuracy on unseen classes
, the seen and unseen accuracies in the generalized setting (
,
), and their harmonic mean H. The value
is selected on the validation set and then applied unchanged to the test set.
Table 5 shows that KAN-HAR is robust to the choice of
within a relatively wide interval. For
(no calibration) the model is biased towards seen classes (98.4% seen accuracy and 92.0% unseen accuracy), resulting in a harmonic mean of 95.1%. Increasing
to 0.25 improves
to 94.1% with
, while
remains above 98%. The value
selected on the validation set achieves
on both validation and test splits, which indicates that the calibration generalises and is not overfitted to a particular split. For larger values (
and
) the harmonic mean remains within a narrow band between 96.6% and 96.7%, and
stays above 96%. Overall, both unseen accuracy and H are above 96% for
, demonstrating that the claimed zero-shot performance is robust to the choice of calibration factor (see also
Figure 2).
5.5. Case Studies and Visualization of Explanations
To make the explanations concrete, we present case studies for representative activities such as walking, sitting and ascending stairs. For each activity and each dataset, we select correctly classified examples and visualize:
the attribution matrix as a heatmap overlaid on the normalized sensor signals, highlighting which time-sensor pairs contributed most to the prediction;
the sensor-level relevance scores and group-level scores as bar plots, indicating which sensors and devices (phone vs. watch, accelerometer vs. gyroscope) dominated the decision;
the temporal relevance curve , showing the time intervals within the window where the model was most sensitive;
selected univariate spline functions and their derivatives for neurons with high class-specific activations .
For example, for walking on the UCI HAR dataset, the attribution heatmaps typically show high relevance on the vertical axis of the accelerometer and gyroscope during mid-window periodic oscillations, while the temporal relevance curve displays a regular sequence of peaks corresponding to gait cycles. In contrast, for sitting, the relevance is concentrated on low-frequency components of the accelerometer (gravity-related posture information) with relatively uniform temporal relevance, reflecting the static nature of the activity. For ascending stairs, we observe strong attributions on specific sensors around transient peaks associated with the lifting of the leg and body, and neurons whose univariate functions exhibit threshold-like behavior around medium-to-high pre-activation values, consistent with detecting more intense and asymmetric movements.
Figure 3 and
Figure 4 illustrate these patterns.
Figure 3 shows example attribution maps and sensor/temporal relevance plots for different activities, while
Figure 4 displays several learned univariate functions and their class-wise activation profiles. Together, these visualizations demonstrate that the TinyKAN-HAR model not only achieves competitive performance but also provides rich, multi-level explanations that connect input sensors, temporal dynamics and internal nonlinearities to the predicted activity labels.
5.7. Ablation Studies
To obtain a deeper understanding of the contribution of each architectural and deployment choice, we conduct an extensive ablation study on UCI HAR. For each variant, we measure overall accuracy, macro-F1, pure zero-shot accuracy
, generalized zero-shot harmonic mean H, as well as TinyML deployment metrics (model size, peak RAM, latency and energy per inference) on the target microcontroller.
Table 7 reports results for twenty different configurations, all derived from the same training and preprocessing pipeline described in
Section 3.6 and
Section 4.2. Zero-shot performance on PAMAP2 is reported separately in
Table 4.
The first two rows of
Table 7 correspond to the full TinyKAN-HAR architecture in its TinyML-ready int8 configuration and in its full-precision FP32 form. The int8 model is the configuration used in the main deployment experiments. On UCI HAR, it reaches an accuracy of 98.3% and macro-F1 of 98.0%, while maintaining a pure zero-shot accuracy of 96.4% and a generalized harmonic mean H of 96.7%. At the same time, it requires only 145 kB of flash and 26 kB of peak RAM, with a latency of 4.1 ms and an estimated energy of 320
J per inference. The FP32 counterpart slightly improves accuracy and ZSL metrics to 98.5% accuracy, 98.2% macro-F1, 96.8%
and 97.0% H, but at the cost of a fourfold increase in model size (580 kB), a more than threefold increase in RAM (92 kB), and more than three times slower inference (13.5 ms and over 1 mJ of energy). Comparing these two rows shows that TinyML-oriented quantization preserves the desired high accuracy and zero-shot performance while dramatically reducing resource usage.
The third and fourth rows isolate the effect of the zero-shot module itself. Removing the ZSL-specific losses but keeping the semantic compatibility function at test time (“w/o ZSL losses”) has virtually no impact on standard HAR metrics, which remain very high (98.2% accuracy and 97.9% macro-F1), but it significantly harms zero-shot generalization: pure ZSL accuracy drops from 96.4% to 92.4% and the harmonic mean H drops from 96.7% to 94.3%. Disabling only the calibration of scores (“w/o calibrated scores”) yields slightly stronger ZSL behavior than “w/o ZSL losses” (94.1% and 95.8% H), but still substantially below the full model. Together, these two variants indicate that both the explicit ZSL losses and the calibration mechanism in the scoring function are necessary to reach the >96% ZSL accuracy and >96% harmonic mean reported for the full TinyKAN-HAR on UCI HAR.
The fifth and sixth rows examine the semantic interface and the explainability regularizer. Removing the learned semantic projection layer and using a simpler mapping from KAN features to semantic vectors (“w/o semantic projection layer”) leads to 98.1% accuracy and 94.7% zero-shot accuracy, with . The drop in ZSL metrics relative to the full model suggests that the projection matrix is effectively adapting the latent space to the semantic manifold. By contrast, disabling the explainability regularizer while keeping the rest of the architecture unchanged (“w/o explainability regularizer”) yields 98.4% accuracy, 98.1% macro-F1 and 96.2% zero-shot accuracy with . The differences with respect to the full model are minor, which indicates that the regularizer can improve the smoothness and interpretability of univariate functions without sacrificing performance; however, the core predictive power comes primarily from the KAN structure and the ZSL loss rather than from additional regularization.
The next group of rows explores how the depth and width of the KAN feature extractor influence both recognition and resource consumption. The shallow configuration with a single KAN layer and latent dimension (“Shallow KAN, , ) still achieves strong HAR performance (97.7% accuracy, 97.3% macro-F1) and reasonable zero-shot metrics (94.5% and 95.2% H), while reducing model size to 110 kB, RAM to 20 kB and latency to 3.2 ms. This variant may be preferable in extremely constrained devices, but it does not reach the >96% ZSL accuracy of the full model. At the other extreme, a deeper KAN with four layers and the same latent dimension (“Deep KAN, , ) slightly improves HAR and ZSL performance (98.4% accuracy, 98.1% macro-F1, 96.6% , 96.9% H), but at the cost of 190 kB flash, 34 kB RAM and 5.6 ms latency. Narrowing the latent dimension while keeping three layers (“Narrow latent, , ) yields 97.9% accuracy and 95.1% zero-shot accuracy with a smaller model (128 kB, 23 kB RAM, 3.8 ms latency), whereas widening it (“Wide latent, , ) gives 98.4% accuracy and 96.5% ZSL accuracy but increases flash to 190 kB and latency to 5.0 ms. These four rows jointly demonstrate that and used in the full TinyKAN-HAR achieve an excellent balance between accuracy and TinyML friendliness.
Rows eleven and twelve examine the effect of spline resolution. The “Coarse spline” variant reduces the number of spline knots, shrinking the LUT memory and the overall model size to 132 kB and 24 kB RAM, with a latency of 3.8 ms. Accuracy remains high (98.1%) but zero-shot accuracy and harmonic mean decrease modestly to 95.6% and 96.0%, respectively. Conversely, the “Fine spline” variant increases the number of control points, leading to slightly better performance (98.4% accuracy, 96.7% zero-shot accuracy, 97.0% H), but requires larger LUTs and slightly more compute (165 kB flash, 28 kB RAM, 4.6 ms latency). These experiments indicate that spline resolution acts as a continuous knob that trades a few tenths of a percent of ZSL performance for tens of kilobytes of memory and a noticeable fraction of a millisecond of latency.
Rows thirteen and fourteen investigate the importance of LUT-based spline evaluation compared to direct computation of spline basis functions. The “w/o LUTs (direct spline evaluation)” variant keeps all other settings identical to the full int8 configuration but evaluates the univariate KAN functions on the fly. This retains high performance (98.4% accuracy, 96.6% , 96.9% H) and slightly reduces flash usage to 140 kB, but almost doubles latency to 8.9 ms and raises energy to 620 J. The “Quantization only (no LUT)” variant uses int8 quantization but still computes splines directly; it sits between the full model and the no-LUT variant with 6.5 ms latency and 450 J energy. Comparing these rows with the full TinyKAN-HAR shows that LUTs are crucial for achieving very low latency and energy budgets in TinyML applications, while preserving the target zero-shot performance.
Rows fifteen and sixteen focus on structured pruning applied on top of the quantized model. With 50% structured pruning (“Quant. + 50% structured pruning”), the model size is reduced to 110 kB and RAM to 22 kB, and latency drops to 3.0 ms and energy to 230 J. HAR accuracy remains at 98.0% and ZSL performance at 95.0% and 95.8% H, indicating that moderate pruning yields a favorable trade-off between efficiency and accuracy. When the pruning rate is increased to 70% (“Quant. + 70% structured pruning”), the model shrinks further to 90 kB and 20 kB RAM with 2.4 ms latency and 180 J energy, but ZSL metrics degrade more noticeably (93.2% and 94.5% H), even though overall accuracy remains above 97%. This suggests that aggressive pruning should be used with caution when zero-shot robustness is critical, whereas pruning around 50% offers a stronger balance.
The last four rows investigate data-related and regularization-related factors. The “Short window” configuration uses a reduced temporal window length T, which lowers RAM usage to 22 kB, reduces latency to 3.5 ms, and decreases energy to 270 J, while still achieving 97.8% accuracy and 94.9% zero-shot accuracy with 95.6% H. Conversely, the “Long window” configuration increases T, slightly improving the metrics to 98.5% accuracy, 96.8% and 97.1% H, but at the price of 30 kB RAM and 5.2 ms latency. The two dropout variants show that moderate regularization is beneficial: with low dropout (), accuracy is 98.2% and zero-shot accuracy 95.9%, whereas high dropout () slightly harms both HAR and ZSL performance (97.6% accuracy, 94.0% and 95.1% H) without affecting memory or latency.
5.7.1. Effect of Hybrid Semantic Embeddings
To assess how sensitive KAN-HAR is to the choice of semantic representation, we compare three variants of the zero-shot module: one using only manually defined attribute vectors (
Attr), one using only textual embeddings obtained from a pretrained language model (
Text), and one using a hybrid representation (
Hybrid). In the hybrid case, each activity is represented by the concatenation of its attribute vector and textual embedding, followed by a learned linear projection into the semantic space used in Equations (
31)–(
34). The rest of the architecture and training pipeline remains unchanged.
Table 8 reports pure zero-shot accuracy
and generalized harmonic mean H for UCI HAR and PAMAP2.
As shown in
Table 8, all three variants achieve strong performance with zero-shot accuracies above 95%, confirming that KAN-HAR is reasonably robust to the choice of semantic source. However, relying exclusively on manually defined attributes (
Attr) yields the lowest scores, with
and
on UCI HAR and slightly lower values on PAMAP2. Using only textual embeddings (
Text) improves performance to around 96.0% zero-shot accuracy and 96.3% harmonic mean on UCI HAR, suggesting that pretrained language models capture useful high-level relationships between activities. The best results are obtained with the hybrid representation (
Hybrid), which consistently pushes
above 96.0% on both datasets (96.8% on UCI HAR and 96.3% on PAMAP2) and raises the harmonic mean H to 97.1% and 96.6%, respectively. These improvements support the hypothesis that combining complementary sources of semantic information, structured attributes and data-driven textual embeddings, reduces mismatches between semantic and sensor spaces and leads to more reliable zero-shot generalization.
5.7.2. Isolating the Effect of Semantic Information
The strong performance on unseen classes suggests that semantic embeddings play a crucial role, but the previous experiments do not completely isolate their contribution. To verify that the zero-shot behavior is genuinely driven by meaningful semantics rather than incidental structure, we perform a controlled ablation where we systematically degrade the semantic space. Starting from the best-performing hybrid representation in
Table 8, we construct two additional variants: one where each activity is assigned a random embedding sampled from a standard Gaussian distribution (
Random), and one where the semantic vectors are randomly permuted across activity labels (
Shuffled), thus preserving the geometry of the space but breaking the alignment between embeddings and true labels. The feature extractor, loss functions and training schedule are kept fixed.
Table 9 reports average seen-class accuracy together with pure zero-shot accuracy
and generalized harmonic mean H for UCI HAR and PAMAP2.
The results in
Table 9 show that seen-class accuracy remains essentially unchanged at around 98% for all three configurations, confirming that the supervised component of KAN-HAR is largely insensitive to the particular choice of semantic vectors. In contrast, the zero-shot metrics collapse when the semantic space is randomized or misaligned. With meaningful hybrid embeddings, KAN-HAR achieves 96.8% zero-shot accuracy and 97.1% harmonic mean on UCI HAR, and 96.3%/96.6% on PAMAP2, consistent with
Table 8. When the embeddings are replaced by random Gaussian vectors (
Random), pure ZSL accuracy on UCI HAR drops to 21.4% and the harmonic mean H to 34.1%, with similar degradation on PAMAP2 (19.8% and 31.7%). Shuffling the semantic vectors across labels (
Shuffled) yields slightly higher but still very poor zero-shot performance (25.2%/38.9% on UCI HAR and 23.5%/36.4% on PAMAP2), indicating that the geometry of the semantic space alone is insufficient if the activity-to-embedding correspondence is destroyed. This sharp contrast between the
Hybrid configuration and the random or shuffled baselines provides strong evidence that genuine semantic information, rather than arbitrary vectors, is what enables KAN-HAR to generalize reliably to unseen activities.
5.8. TinyML Deployment Results
We now turn to on-device TinyML benchmarks of the proposed TinyKAN-HAR model and the main deep learning baselines. All measurements are obtained on the same microcontroller platform, a Cortex-M4F-class MCU with 256 kB of on-chip flash and 64 kB of SRAM, clocked at 80 MHz.
Table 10 summarizes these results for the int8 versions of TinyKAN-HAR, 1D-CNN, CNN-LSTM and Transformer architectures, together with their average accuracy across UCI HAR and PAMAP2 as reported in
Section 5.1.
The first observation from
Table 10 is that all TinyML models achieve high recognition performance, with average accuracy comfortably above 96% on both datasets. The 1D-CNN baseline is the most compact, with a flash footprint of 120 kB, peak RAM of 24 kB and an average latency of 3.5 ms, resulting in an estimated energy cost of 280
J per inference. The CNN-LSTM baseline provides a modest accuracy gain over the plain 1D-CNN (97.6% vs. 97.4%), but its recurrent component increases both memory and latency: its model size grows to 165 kB, peak RAM to 30 kB and latency to 6.1 ms, which nearly doubles the energy per inference to 450
J. The Transformer Tiny model reaches the strongest performance among the purely conventional baselines (97.9% accuracy), but it is also the heaviest: its self-attention layers require 210 kB of flash, 36 kB of RAM, and around 7.8 ms per inference, leading to an energy cost of approximately 520
J.
In this context, the proposed TinyKAN-HAR Tiny configuration achieves the best trade-off between accuracy and resource consumption. With int8 quantization and LUT-based spline evaluation as described in
Section 3.6, the model attains an average accuracy of
98.3%, outperforming the Transformer Tiny by about 0.4 percentage points and the 1D-CNN Tiny by almost 1 percentage point, while remaining well within the limits of a mid-range MCU. Its flash usage of 145 kB lies between 1D-CNN and CNN-LSTM, and is smaller than that of the Transformer by roughly 30%. The peak RAM of 26 kB is only slightly higher than the 1D-CNN baseline and significantly lower than the 36 kB required by the Transformer. The latency of 4.1 ms and energy of 320
J are marginally higher than those of the 1D-CNN model, but substantially better than the CNN-LSTM and Transformer baselines, which means TinyKAN-HAR remains suitable for real-time inference even with relatively short window strides.
The additional cost of spline evaluation is kept low by the LUT-based approximation, which replaces many floating-point operations with simple integer lookups and linear interpolation. As a result, the TinyKAN-HAR Tiny model is only about 0.6 ms slower than 1D-CNN Tiny, yet it offers higher HAR accuracy and, more importantly, substantially stronger zero-shot performance as shown in
Table 4, where it achieves unseen-class accuracy above 96% and harmonic mean above 96% across datasets. This extra 0.6 ms latency is therefore repaid by improved generalization to unseen activities and richer explanatory capabilities.
Comparing TinyKAN-HAR Tiny with Transformer Tiny highlights a different trade-off. Both models reach very high HAR accuracy (98.3% vs. 97.9%), but TinyKAN-HAR does so with a significantly smaller memory footprint and shorter latency. The attention mechanism in the Transformer scales quadratically with the temporal window length and requires multiple projections per head, which inflates both flash and RAM usage; by contrast, TinyKAN-HAR controls capacity primarily through the number of KAN layers and latent dimensions, while the univariate spline functions remain inexpensive to evaluate on-device. In practical terms, this means that TinyKAN-HAR can be deployed on MCUs where the Transformer is either too large to fit in flash or too slow to meet real-time constraints, without sacrificing accuracy or zero-shot generalization.
Finally, when comparing TinyKAN-HAR Tiny to CNN-LSTM Tiny, the results show that TinyKAN-HAR provides similar or better accuracy and superior zero-shot performance while being both smaller and faster. The CNN-LSTM’s recurrent layer increases latency to more than 6 ms and energy to 450 J, whereas TinyKAN-HAR keeps latency close to 4 ms and energy near 320 J, thanks to its fully feed-forward structure and efficient quantized linear layers. This confirms that KAN-based architectures are not only competitive as feature extractors for HAR in the cloud, but also particularly well-suited for constrained TinyML deployments where achieving accuracy and harmonic mean above 96% must be balanced against strict limits on model size, RAM and energy consumption.