1. Introduction
Human activity recognition (HAR) using wearable sensors has become a central component of pervasive computing, digital health, rehabilitation, and behavior-aware analytics. At the same time, the field remains constrained by heterogeneous sensors, limited annotation budgets, and benchmark protocols that often fail to reflect deployment conditions [
1,
2,
3].
Recent progress has largely moved in two directions. The first is scale: tera-scale self-supervised pretraining and large free-living datasets have substantially improved generalization across users, devices, and environments [
4,
5,
6]. The second is robustness: cross-dataset, concept-invariant, user-generalizable, and zero-shot formulations explicitly target domain shifts that are unavoidable outside the laboratory [
7,
8,
9,
10]. Yet these advances do not remove the practical bottleneck that many wearable HAR projects still begin with a small labeled seed set and a large unlabeled pool.
In realistic deployments, annotation scarcity interacts with another issue: the class distribution is rarely balanced. Frequent routine activities dominate the labeled subset, while minority or transitional activities are underrepresented, which biases uncertainty estimates and degrades downstream retraining. Comparative studies of imbalance mitigation in HAR confirm that sampling strategy materially affects both accuracy and minority sensitivity [
11], and general reviews of wearable HAR repeatedly point to label scarcity, distribution shift, and deployment cost as coupled rather than isolated problems [
1,
2].
Earlier sensor-based HAR studies further motivate this positioning. Demrozi et al. provided a broad survey of HAR using inertial, physiological, and environmental sensors, showing that sensor modality, application context, and computational constraints jointly shape model design [
12]. Ahmed et al. demonstrated that hybrid feature selection can improve smartphone-sensor HAR by reducing redundant features while preserving discriminative accelerometer and gyroscope information [
13]. Jalal et al. analyzed accelerometer and gyroscope measurements for physical life-log activity detection, emphasizing the practical value of inertial-signal characterization in daily activity settings [
14]. Yin et al. showed that combining multiple smartphone sensors with machine-learning algorithms can improve activity detection on mobile platforms, supporting the need to consider multi-sensor redundancy and classifier behavior together [
15]. These studies collectively reinforce the motivation of the present work: under wearable or smartphone-based sensing, data efficiency, sensor-feature redundancy, and class coverage must be addressed jointly rather than as independent issues.
Active learning offers a principled way to reduce labeling cost by querying the most informative unlabeled samples [
16]. Modern batch active learning methods further emphasize diversity, geometry, and loss-aware selection [
17,
18,
19]. However, classical active learning alone does not guarantee that the queried batch will improve minority coverage; in fact, under imbalance, it can reinforce existing biases, a problem already recognized in early active learning research [
20]. More recent studies outside HAR show that class balancing can be embedded into active learning to mitigate this effect [
21,
22], and diversified batch strategies for time-series classification likewise argue that informativeness must be controlled jointly with redundancy [
23].
For wearable HAR, the methodological gap is therefore not simply the lack of another query heuristic. The gap lies in how to coordinate three steps inside one closed budgeted loop: correcting the labeled distribution, selecting a query batch that is simultaneously uncertain and representative, and retraining a deployable classifier without moving the computational burden to inference time. Recent wearable-focused multitask or weakly supervised approaches reduce label dependence in complementary ways [
24,
25], but they do not directly solve the round-wise interaction between imbalance, redundancy, and batch acquisition that arises in pool-based active learning.
In response, we formulate CCUR-M as a budget-constrained operator framework for imbalanced active learning rather than as a purely empirical HAR pipeline. The mathematical contribution is the joint design of a median target class-size operator, a geometry-preserving balancing operator, and a convex uncertainty-representativeness acquisition functional within the same iterative budget. The contributions are fourfold. First, we define a closed-loop formulation that rebalances the labeled set toward a target class-size operator before each retraining round. Second, we combine cluster-preserving majority undersampling with minority-class conditional synthesis so that distribution correction respects geometric structure rather than raw frequency alone. Third, we construct a hybrid query score that couples minimum-confidence uncertainty with cluster-centered representativeness through a budget-aware weight schedule. Fourth, we show on PAMAP2, OPPORTUNITY, and USC-HAD that the resulting loop consistently improves final performance and yields its strongest gains on the most structurally difficult dataset.
2. Mathematical Preliminaries and Problem Setting
2.1. Notation, Label Budget, and Windowed Sensor Representation
We consider a pool-based active learning setting for windowed wearable sensor classification. , where is the initial labeled set, is the unlabeled pool, and is the held-out test set. The learner operates for rounds under a fixed annotation budget .
2.2. Class-Imbalance Modeling and Target Class-Size Operator
We denote by
the number of labeled samples from class c after round r. Rather than enforcing full balance, which may oversynthesize minority classes and over-prune majority structure, we define a conservative target class size
by the median class count.
The choice of a median target is deliberately conservative and has a simple optimality interpretation. For the class-count vector n_r = (n_{r, 1}, …, n_{r, C}), any median minimizes the total absolute displacement J(t) = sum_{c = 1}^C |n_{r, c}-t| among scalar targets t. Thus, compared with a mean-based target, tau_r is less affected by extremely frequent activities; compared with matching the maximum class count, it avoids excessive minority synthesis; and compared with purely cost-sensitive learning, it directly changes the empirical training distribution before the query-retrain step. Classes above tau_r are compressed, classes below tau_r are expanded, and classes already near the median are left unchanged, which maintains class support while reducing synthetic drift.
2.3. Uncertainty, Representativeness, and Query Selection
For any unlabeled sample x, the classifier produces posterior probabilities
. We define uncertainty by minimum confidence.
To represent coverage of the feature space, we cluster the current unlabeled embeddings and compute a cluster-centered representativeness score
Here
is the centroid of the cluster containing x and
is the within-cluster dispersion. The final query score is a convex combination of normalized uncertainty and representativeness.
The round-dependent weight lambda_r increases as the active-learning process progresses and as the current labeled set becomes more imbalanced. In the implementation, lambda_r = min{lambda_max, lambda_min + (lambda_max-lambda_min) r/R + beta I_r}, where I_r = (max_c n_{r, c}-min_c n_{r, c})/(max_c n_{r, c} + epsilon), lambda_min = 0.20, lambda_max = 0.80, beta = 0.20, and epsilon prevents division by zero. The schedule gives early rounds more representativeness pressure to repair coverage and gives later or more imbalanced rounds more uncertainty pressure to refine the boundary. This connects the framework to uncertainty sampling [
16], geometry-aware core-set selection [
18], and combined uncertainty-diversity batch acquisition [
19], while keeping the score algebraically simple enough for reproducible deployment.
2.4. Mathematical Rationale and Basic Properties
CCUR-M can be interpreted as an iteration of three operators on the current labeled distribution: a balancing operator B_tau, a scoring operator S_lambda, and a retraining operator A. This view clarifies the mathematical role of each module. B_tau reduces class-count dispersion before model fitting, S_lambda maps each unlabeled window to a bounded acquisition score, and A updates the classifier used to define the next uncertainty surface.
Proposition 1 (median target)
. For class counts n_r = (n_{r, 1}, …, n_{r, C}), every median of the multiset {n_{r, c}} is a minimizer of J(t) = sum_c |n_{r, c}-t|. Therefore, tau_r minimizes the total absolute resampling displacement required to move all classes toward one common target. This property explains why the median target is robust to long-tailed majority activities and less aggressive than maximum-count oversampling.
Proof Sketch. At differentiability points of J(t), the subgradient equals the number of counts below t minus the number of counts above t. A zero-containing subgradient occurs exactly when t is a median; hence, any median is an L1-optimal scalar target. □
Proposition 2 (bounded hybrid query score)
. If normalized uncertainty and representativeness are both in [0, 1] and lambda_r is in [0, 1], then s_r(x) = lambda_r u_bar(x) + (1-lambda_r) rho_bar(x) is also in [0, 1]. The additional cluster-diversity filter imposes a lower bound on batch coverage whenever the unlabeled pool contains sufficiently many non-empty clusters, thereby reducing the repeated selection of near-duplicate windows.
Proposition 3 (controlled training imbalance)
. If the class-wise balancing step can meet the requested quota, the rebalanced training counts satisfy n_tilde_{r, c} = tau_r for corrected classes. If adversarial synthesis is replaced by the Gaussian fallback under extreme sparsity, the remaining discrepancy is bounded by the unsatisfied synthesis deficit. Thus, before each retraining step, the learner is fitted to a distribution whose imbalance is explicitly controlled rather than left as a by-product of the previous query batch.
These properties are not presented as universal convergence guarantees for nonconvex active learning. Instead, they provide a verifiable mathematical rationale for the target operator, the bounded acquisition functional, and the distribution-control mechanism used by CCUR-M.
3. A Budget-Aware Class-Balanced Active Learning Framework
3.1. Overall Closed-Loop Architecture
Figure 1 summarizes the proposed workflow. Each round begins from the currently labeled subset
, computes the target class size
, rebalances
, retrains the classifier, scores the unlabeled pool
, queries the top batch under a diversity constraint, and then updates the labeled set. The loop repeats until the budget is exhausted.
3.2. Cluster-Preserving Majority Undersampling
To prevent the majority classes from dominating the decision surface, we apply cluster-preserving undersampling to classes with . For each such class, K-means is run with clusters. Within each cluster, samples are ranked by distance to the centroid, and a quota-proportional subset is retained. This step differs from random undersampling because it preserves representative prototypes from multiple local modes, thereby reducing the risk of deleting structurally important but low-density regions.
3.3. Minority-Class cGAN Augmentation
For minority classes with n_{r, c} < tau_r, we use conditional synthesis to expand the class toward the target size, following the broader principle of synthetic minority over-sampling [
26]. The implementation follows the conditional GAN paradigm [
27,
28], but in a class-wise setting suitable for compact tabular feature vectors. A generator maps noise and class code to synthetic embeddings, while a discriminator learns to distinguish synthetic from real minority samples. Because adversarial training is unreliable under extreme scarcity, cGAN augmentation is activated only when at least six real samples are available for the class. When this minimum support condition is not met, the framework falls back to a Gaussian bootstrap centered at the observed minority mean with variance regularization. The fallback is therefore a stability safeguard rather than a separate performance claim: it prevents generator collapse and limits the risk that a very small minority class defines the decision boundary through unstable synthetic samples.
3.4. Hybrid Query Construction Under Label Budget
After balancing, the model is retrained using XGBoost [
29]. This choice is intentional. In the current setting, the most expensive operations—clustering, synthesis, and query scoring—occur during the active-learning rounds, not at test time. Using a structured gradient-boosted tree model therefore preserves fast inference while still exploiting non-linear interactions among extracted wearable features. In the reference implementation, the classifier uses 120 trees, maximum depth 4, learning rate 0.08, and histogram-based tree construction, which together provide a favorable trade-off between robustness and computational cost.
The query stage then evaluates the unlabeled pool. Predictive uncertainty u(x) is computed from the retrained classifier, representativeness r(x) is estimated by clustering the unlabeled pool, and the combined score s_r(x) ranks candidate windows. Selection is batch-based rather than greedy. The top-ranked windows are filtered so that at least half of the batch is drawn from distinct clusters whenever possible. This rule operationalizes the intuition that high-uncertainty points are useful only if they also improve coverage of the pool geometry, echoing the motivations behind core-set and BADGE-type batch acquisition [
18,
19] while remaining directly applicable to structured sensor windows.
The budget-aware weight is implemented by the bounded schedule above with lambda_min = 0.20 and lambda_max = 0.80. This avoids the two degenerate cases: lambda_r = 1 reduces the method to pure uncertainty sampling, which may over-query a narrow boundary region, whereas lambda_r = 0 reduces it to pure representativeness sampling, which may select central but already easy windows. The schedule was therefore designed to move from exploration toward exploitation while remaining bounded and interpretable.
3.5. Complexity, Stability, and Deployability Analysis
Let N_L = |L_r|, N_U = |U_r|, d be the feature dimension, C be the number of classes, K_L and K_U be the class-wise and pool-wise cluster counts, I be the number of K-means iterations, T be the number of XGBoost trees, h be the tree depth, and E be the number of generative-training epochs. A single active-learning round has approximate cost O(I K_L N_L d) for class-wise undersampling, O(E N_min d) for minority synthesis over the affected minority samples, O(T N_B h) for retraining on the balanced set of size N_B, O(I K_U N_U d) for pool clustering, and O(N_U C) for posterior scoring. The memory cost is O((N_L + N_U)d + T 2^h) up to implementation constants.
The expensive operations are confined to annotation rounds and are not executed during ordinary inference. At deployment time, prediction for one window requires only the trained tree ensemble, with cost O(T h) and model memory proportional to the number of stored tree nodes. This distinction is important for wearable settings: the active-learning loop can be executed on a workstation or server, whereas the final classifier remains lightweight enough for edge-side or phone-side inference.
We do not report direct power or battery measurements in this study. The present complexity analysis therefore supports algorithmic deployability but does not replace hardware-specific energy profiling, which remains necessary before fully embedded deployment.
4. Experimental Design
4.1. Datasets and Preprocessing
We evaluate the framework on three public HAR benchmarks that reflect different difficulty profiles and annotation regimes (
Table 1). PAMAP2 contains 9 subjects, 12 activities, 27-dimensional structured features, and 100 Hz sensing, with 512-sample windows and a step size of 100 [
30]. OPPORTUNITY contains 4 subjects, 17 activities, 33-dimensional features, and more complex dailyliving interactions, with 30-sample windows and step size 15 [
31]. USC-HAD contains 14 subjects, 12 activities, 6-dimensional inertial features, and 100 Hz sensing, with 512-sample windows and a step size of 256 [
32]. The experimental splits follow the available source benchmark files: PAMAP2 uses subjects 7 and 9 as test, USC-HAD uses subjects 13 and 14 as test, and OPPORTUNITY follows the ADL split used in the source benchmark files. Because the OPPORTUNITY protocol is not the only accepted split for that dataset, OPPORTUNITY results should be interpreted as evidence within this benchmark setting rather than as a universal state-of-the-art claim.
4.2. Baselines, Evaluation Metrics, and Implementation Details
The benchmark set includes representative baselines reported in the source study: SMKM, GDB-LC, SATL, ADESSA, and a RESAMPLE strategy. Because this study uses the available benchmark assets rather than a full reimplementation of every baseline, we preserve the original benchmark protocol and focus on the comparative evidence available in the source
Table 1. We also clarify the relation to deep HAR models: CNN, LSTM, Transformer, and self-supervised models are important representation learners, but they address a different layer of the pipeline from the budgeted acquisition and class-distribution control studied here [
3,
4,
33,
34,
35]. A full comparison against end-to-end deep baselines would require standardized sequence inputs and repeated retraining under identical active-learning budgets; this is outside the available benchmark assets and is listed as future work. Macro-F1 is used as the primary metric because it is sensitive to minority-class degradation and therefore more informative than accuracy under imbalance [
11].
4.3. Reproducible Pipeline and Software Assets
The budget analysis reports Macro-F1 across label ratios from 10% to 100%. The reference pipeline is deterministic up to the stated random seed and uses CSV-based inputs and exportable outputs. In addition to the real benchmark tables, the companion toolkit includes a synthetic demo dataset and an executable command-line pipeline so that the logic of balancing, query selection, and round-wise retraining can be reproduced without access to the original preprocessing assets.
All derived summary values in this paper come directly from the source workbook or the software-generated tables. In particular, final and average performance values are taken from the benchmark summary sheet, while ablation scores and label-budget curves are taken from the dedicated tables. This design minimizes transcription error and makes every number in the manuscript traceable to a specific result file.
For reproducibility, the reference configuration is specified as follows: XGBoost uses 120 trees, maximum depth 4, learning rate 0.08, histogram-based tree construction, and multiclass posterior probabilities; uncertainty is computed by minimum confidence; all uncertainty and representativeness scores are min-max normalized within each round; the query schedule uses lambda_min = 0.20, lambda_max = 0.80, and beta = 0.20; cGAN synthesis is disabled when a minority class has fewer than six real samples; and the budget curves are reported at label ratios from 10% to 100%.
Hyperparameters that were not varied in the reported benchmark are therefore treated as fixed design choices rather than retrospectively optimized parameters. This prevents the reported gains from being interpreted as the result of hidden tuning.
5. Results and Discussion
5.1. Final Comparison Across PAMAP2, OPPORTUNITY, and USC-HAD
Table 2 and
Figure 2 report the final and average benchmark results. CCUR-M achieves the best final Macro-F1 on all three datasets: 0.9574 on PAMAP2, 0.6780 on OPPORTUNITY, and 0.6128 on USC-HAD. The corresponding gains over the strongest final baseline are +0.0272, +0.1205, and +0.0184, respectively. The OPPORTUNITY result is the clearest signal in the benchmark results: the improvement is not marginal, and it exceeds the final score of every baseline by a visibly larger margin than on the other two datasets.
Average performance across all budget stages reveals a more nuanced picture. CCUR-M still leads on PAMAP2 and OPPORTUNITY, with average gains of +0.0042 and +0.0629 over the best baseline. On USC-HAD, however, the framework does not lead to an average Macro-F1, trailing the best baseline by 0.0234 on average despite winning at the final budget. This distinction is important. It shows that the proposed loop should not be interpreted as uniformly dominant at every annotation stage; rather, its strength on USC-HAD lies in later-stage consolidation after the labeled set has become sufficiently informative.
The dataset-specific pattern is intuitively plausible. PAMAP2 is comparatively structured and already admits strong performance under simpler resampling baselines, so the gain from a full closed loop is modest but consistent. OPPORTUNITY is more heterogeneous and long-tailed, making it precisely the regime in which balancing and query diversity should matter most. USC-HAD is smaller and cleaner, which reduces the room for early-stage exploration gains but still allows the closed loop to overtake the strongest baseline by the end.
5.2. Budget-Curve Analysis and Label Efficiency
Figure 3 shows the label-budget curves. On PAMAP2, CCUR-M starts from a strong 10% label ratio score of 0.9065 and climbs almost monotonically to 0.9574. The RESAMPLE baseline is competitive in the middle of the curve, even matching CCUR-M around 30%, but then plateaus below the proposed method. This indicates that once the easiest majority-class redundancy has been removed, further gains depend on query quality rather than resampling alone.
On OPPORTUNITY, the budget curves reveal the core contribution of the method. CCUR-M leads from the first budget stage (0.5353 at 10%) and maintains the lead throughout the entire annotation trajectory, reaching 0.6780 at 100%. The gap is not driven by a single late spike; it is sustained over nearly all rounds, which is consistent with the intended effect of coupling distribution correction and representative querying in a difficult long-tail setting. The strongest baseline at the final budget, GDB-LC, reaches 0.5575, and the strongest average baseline reaches 0.5229. Both remain substantially below the full loop.
USC-HAD exhibits a different dynamic. SATL is stronger in the early and middle rounds, and CCUR-M only draws level near 80% before finishing at 0.6128 versus 0.5944 for SATL. This delayed crossover suggests that the framework spends more of the early budget on coverage repair and minority stabilization than on immediate exploitation. In applications where the budget can be expanded across multiple rounds, such behavior may still be desirable; in extremely small-budget settings however, alternative query schedules could be preferable.
5.3. Ablation Study and Module Coupling
The ablation study in
Figure 4 shows that the gain cannot be attributed to a single component. On PAMAP2, the full loop outperforms the base variant by +0.1243 and exceeds every single ablation. The largest drops occur when cGAN augmentation or uncertainty-based selection is removed, implying that minority expansion and boundary awareness are especially important in the relatively high-performing structured regime.
On OPPORTUNITY, the full framework improves on the base variant by +0.1638. Removing representativeness causes the largest performance loss, followed closely by removing undersampling or uncertainty. This is strong evidence that pool coverage matters most when the dataset contains many redundant or locally clustered windows. A pure uncertainty strategy would spend too much budget on boundary refinement around already well-covered modes; a pure representativeness strategy would miss genuinely ambiguous transitions. The result supports the claim that the two signals should be combined rather than optimized in isolation.
USC-HAD shows the strongest relative ablation effect: the full framework improves on the base variant by +0.2143. The representativeness-free ablation is particularly weak, again indicating that cluster-aware batch construction is crucial when the early labeled set is sparse. At the same time, removing oversampling also causes a large drop, which suggests that even relatively clean datasets can suffer from minority under-coverage once the active learner repeatedly exploits easy classes.
5.4. Sensitivity, Comparative Scope, and Statistical Interpretation
The hybrid score is sensitive to the balance between uncertainty and representativeness, but the two limiting cases are both undesirable. A purely uncertainty-driven policy tends to select ambiguous windows concentrated near the current boundary, whereas a purely representativeness-driven policy may select central samples that improve coverage but contribute little boundary information. The bounded schedule used here is intended to avoid both extremes by emphasizing coverage in early rounds and uncertainty in later or more imbalanced rounds.
The ablation results provide indirect evidence for this design. Removing representativeness produces large drops OPPORTUNITY and USC-HAD, indicating that diversity is important under heterogeneous or sparse labeled sets. Removing uncertainty also degrades performance, indicating that coverage alone is insufficient. We therefore interpret the schedule as a controlled compromise rather than as a globally optimal value of lambda_r.
The comparison with deep HAR architectures should also be interpreted carefully. CNNs, LSTMs, Transformers, and self-supervised encoders may learn stronger temporal representations than the structured features used here. The present contribution is orthogonal to that direction: CCUR-M defines how a labeled subset is balanced and queried under a fixed annotation budget. A natural extension is to apply the same acquisition rule on top of frozen or lightly tuned deep representations.
Finally, the available benchmark files do not contain per-seed outputs for every baseline, so formal paired significance testing or confidence intervals cannot be computed without rerunning all methods under the same random seeds. We therefore report effect sizes, budget curves, and ablation evidence, and we avoid claiming universal statistical dominance. The large OPPORTUNITY gain is the strongest empirical signal, whereas the smaller PAMAP2 and USC-HAD final gains should be treated as more protocol-dependent and requiring independent replication.
5.5. Cross-Dataset Interpretation and Application Implications
Taken together, the results support an interpretation of CCUR-M as a closed-loop correction mechanism rather than a single-shot query heuristic. The method is most valuable when three conditions coexist: the unlabeled pool is large, the labeled seed is class-imbalanced, and redundant windows are abundant. Under these conditions, balancing changes the effective training distribution, representative querying expands coverage, and uncertainty progressively sharpens the boundary.
This reading aligns with current research trends in wearable HAR. Large-scale pretraining and free-living datasets are pushing the field toward broader domain coverage [
4,
5,
7], while concept-invariant, user-generalizable, and weakly supervised approaches seek better transfer across users and environments [
6,
7,
9,
25]. In parallel, specialized batch active learning and multitask active learning are increasingly viewed as deployment tools rather than purely data-efficiency tools [
23,
24]. The present results suggest that a budget-aware closed loop can complement these trends by serving as a practical layer between representation learning and annotation policy.
There are also clear limits. First, the current benchmark files are based on structured features and XGBoost retraining rather than an end-to-end sequence model. This keeps inference lightweight, but it may underuse recent CNN, LSTM, Transformer, and self-supervised representations [
4,
33,
35]. Second, the OPPORTUNITY split is inherited from the source benchmark configuration, so the strongest result in this paper should be interpreted within that benchmark protocol rather than as a universal state-of-the-art claim. Third, the method assumes a closed-set label space and does not yet address unknown activities, continual drift, or personalization directly, although current literature shows these are becoming essential deployment requirements [
10,
36,
37]. Fourth, the present study reports algorithmic complexity but not device-level energy, latency, or memory measurements.
These limits point to concrete extensions. One promising direction is to couple the current budget-aware query logic with a frozen or lightly tunable self-supervised backbone, so that active learning operates on stronger representations without requiring full end-to-end retraining [
4,
6,
35]. Another is to replace the scalar budget schedule with an adaptive controller learned from round-wise performance and imbalance dynamics. Finally, open-set rejection, user-level personalization, and continual adaptation could be incorporated as downstream stages, allowing the framework to remain mathematically interpretable while addressing the realities of deployed wearable HAR [
8,
9,
10,
36,
37].
6. Conclusions
6.1. Main Findings
This paper presents a Mathematics-oriented algorithmic framework for budget-aware class-balanced active learning in wearable HAR. The central idea is simple but consequential: under label scarcity and class imbalance, balancing and querying should not be treated as separate preprocessing steps. They should be coupled inside the same round-wise loop, so that each queried sample is added to a distribution whose imbalance has been explicitly controlled.
Empirically, the framework achieves the best final Macro-F1 on PAMAP2, OPPORTUNITY, and USC-HAD, with the strongest evidence on OPPORTUNITY. Budget curves show that the advantage is most consistent when the dataset is heterogeneous and redundant, while ablation results demonstrate that cGAN-based minority expansion, cluster-preserving majority compression, uncertainty scoring, and representativeness each contribute to the final outcome.
6.2. Limitations
The study has several limitations. The benchmark is based on structured features and XGBoost rather than end-to-end temporal encoders; OPPORTUNITY follows the ADL split used in the source benchmark files; the available benchmark files do not support full paired statistical tests across all baselines; the method assumes a closed label set; cGAN synthesis remains fragile under extreme minority scarcity despite the Gaussian fallback; and no direct power or memory profiling was conducted on embedded wearable hardware. These limitations restrict the scope of the claims and motivate the future extensions below.
6.3. Future Work
Future work should integrate CCUR-M with self-supervised or Transformer-based representations, rerun all baselines under standardized cross-dataset protocols and repeated random seeds, replace the fixed lambda schedule with an adaptive controller, add open-set rejection for unseen activities, and evaluate energy and memory consumption on realistic mobile or wearable platforms. Even with these limitations, the current evidence supports the core conclusion that, under budget constraints, a closed loop that repairs the labeled distribution before each query-retrain cycle is more reliable than treating balancing, querying, and retraining as independent modules.