A Budget-Aware Class-Balanced Active Learning Framework for Imbalanced Wearable Human Activity Recognition

Liu, Xinrui; Wang, Guodong

doi:10.3390/math14111932

Open AccessArticle

A Budget-Aware Class-Balanced Active Learning Framework for Imbalanced Wearable Human Activity Recognition

by

Xinrui Liu

^* and

Guodong Wang

College of Computer Science & Technology, Fushan Campus, Qingdao University, Qingdao 266071, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(11), 1932; https://doi.org/10.3390/math14111932

Submission received: 3 April 2026 / Revised: 22 May 2026 / Accepted: 29 May 2026 / Published: 2 June 2026

Download

Browse Figures

Versions Notes

Abstract

Human activity recognition (HAR) from wearable sensors increasingly faces a dual bottleneck: obtaining labels is expensive, and the labeled subset is often class-imbalanced and redundant. We address this problem with a budget-aware class-balanced active learning framework, termed CCUR-M, that closes the loop between adaptive class balancing, hybrid batch querying, and lightweight retraining. At each round, the labeled subset is rebalanced toward a median target class size through cluster-preserving majority undersampling and minority-class conditional synthesis, after which a hybrid query score combines minimum-confidence uncertainty with cluster-centered representativeness under a round-dependent budget weight. An XGBoost classifier is retrained on the rebalanced set, and the procedure is iterated until the annotation budget is exhausted. We evaluate the method on three public wearable HAR benchmarks with different difficulty profiles: PAMAP2, OPPORTUNITY, and USC-HAD. CCUR-M achieves the best final Macro-F1 on all three datasets, reaching 0.9574, 0.6780, and 0.6128, respectively. The largest final and average gains over the strongest baseline occur on OPPORTUNITY (+0.1205 final, +0.0629 average), while USC-HAD reveals a later-stage rather than early-stage advantage. Ablation experiments show that no single module explains the overall gain; instead, balancing, uncertainty, and representativeness act synergistically, with the full loop outperforming the base variant by +0.1243, +0.1638, and +0.2143 on PAMAP2, OPPORTUNITY, and USC-HAD. These results support a mathematically interpretable view of active learning for imbalanced wearable time series: the key benefit arises from coupling distribution correction and query design within the same budgeted training loop.

Keywords:

active learning; class imbalance; wearable sensing; human activity recognition; budget-aware learning; time-series classification

MSC:

68T05; 62H30

1. Introduction

Human activity recognition (HAR) using wearable sensors has become a central component of pervasive computing, digital health, rehabilitation, and behavior-aware analytics. At the same time, the field remains constrained by heterogeneous sensors, limited annotation budgets, and benchmark protocols that often fail to reflect deployment conditions [1,2,3].

Recent progress has largely moved in two directions. The first is scale: tera-scale self-supervised pretraining and large free-living datasets have substantially improved generalization across users, devices, and environments [4,5,6]. The second is robustness: cross-dataset, concept-invariant, user-generalizable, and zero-shot formulations explicitly target domain shifts that are unavoidable outside the laboratory [7,8,9,10]. Yet these advances do not remove the practical bottleneck that many wearable HAR projects still begin with a small labeled seed set and a large unlabeled pool.

In realistic deployments, annotation scarcity interacts with another issue: the class distribution is rarely balanced. Frequent routine activities dominate the labeled subset, while minority or transitional activities are underrepresented, which biases uncertainty estimates and degrades downstream retraining. Comparative studies of imbalance mitigation in HAR confirm that sampling strategy materially affects both accuracy and minority sensitivity [11], and general reviews of wearable HAR repeatedly point to label scarcity, distribution shift, and deployment cost as coupled rather than isolated problems [1,2].

Earlier sensor-based HAR studies further motivate this positioning. Demrozi et al. provided a broad survey of HAR using inertial, physiological, and environmental sensors, showing that sensor modality, application context, and computational constraints jointly shape model design [12]. Ahmed et al. demonstrated that hybrid feature selection can improve smartphone-sensor HAR by reducing redundant features while preserving discriminative accelerometer and gyroscope information [13]. Jalal et al. analyzed accelerometer and gyroscope measurements for physical life-log activity detection, emphasizing the practical value of inertial-signal characterization in daily activity settings [14]. Yin et al. showed that combining multiple smartphone sensors with machine-learning algorithms can improve activity detection on mobile platforms, supporting the need to consider multi-sensor redundancy and classifier behavior together [15]. These studies collectively reinforce the motivation of the present work: under wearable or smartphone-based sensing, data efficiency, sensor-feature redundancy, and class coverage must be addressed jointly rather than as independent issues.

Active learning offers a principled way to reduce labeling cost by querying the most informative unlabeled samples [16]. Modern batch active learning methods further emphasize diversity, geometry, and loss-aware selection [17,18,19]. However, classical active learning alone does not guarantee that the queried batch will improve minority coverage; in fact, under imbalance, it can reinforce existing biases, a problem already recognized in early active learning research [20]. More recent studies outside HAR show that class balancing can be embedded into active learning to mitigate this effect [21,22], and diversified batch strategies for time-series classification likewise argue that informativeness must be controlled jointly with redundancy [23].

For wearable HAR, the methodological gap is therefore not simply the lack of another query heuristic. The gap lies in how to coordinate three steps inside one closed budgeted loop: correcting the labeled distribution, selecting a query batch that is simultaneously uncertain and representative, and retraining a deployable classifier without moving the computational burden to inference time. Recent wearable-focused multitask or weakly supervised approaches reduce label dependence in complementary ways [24,25], but they do not directly solve the round-wise interaction between imbalance, redundancy, and batch acquisition that arises in pool-based active learning.

In response, we formulate CCUR-M as a budget-constrained operator framework for imbalanced active learning rather than as a purely empirical HAR pipeline. The mathematical contribution is the joint design of a median target class-size operator, a geometry-preserving balancing operator, and a convex uncertainty-representativeness acquisition functional within the same iterative budget. The contributions are fourfold. First, we define a closed-loop formulation that rebalances the labeled set toward a target class-size operator before each retraining round. Second, we combine cluster-preserving majority undersampling with minority-class conditional synthesis so that distribution correction respects geometric structure rather than raw frequency alone. Third, we construct a hybrid query score that couples minimum-confidence uncertainty with cluster-centered representativeness through a budget-aware weight schedule. Fourth, we show on PAMAP2, OPPORTUNITY, and USC-HAD that the resulting loop consistently improves final performance and yields its strongest gains on the most structurally difficult dataset.

2. Mathematical Preliminaries and Problem Setting

2.1. Notation, Label Budget, and Windowed Sensor Representation

We consider a pool-based active learning setting for windowed wearable sensor classification.

L e t D = L 0 \cup U 0 \cup T

, where

L 0

is the initial labeled set,

U 0

is the unlabeled pool, and

T

is the held-out test set. The learner operates for

R

rounds under a fixed annotation budget

B

.

2.2. Class-Imbalance Modeling and Target Class-Size Operator

We denote by

n_{c}^{(r)}

the number of labeled samples from class c after round r. Rather than enforcing full balance, which may oversynthesize minority classes and over-prune majority structure, we define a conservative target class size

τ_{r}

by the median class count.

τ_{r} = m e d i a n (n_{c}^{(r)} f o r c = 1, \dots, C) .

The choice of a median target is deliberately conservative and has a simple optimality interpretation. For the class-count vector n_r = (n_{r, 1}, …, n_{r, C}), any median minimizes the total absolute displacement J(t) = sum_{c = 1}^C |n_{r, c}-t| among scalar targets t. Thus, compared with a mean-based target, tau_r is less affected by extremely frequent activities; compared with matching the maximum class count, it avoids excessive minority synthesis; and compared with purely cost-sensitive learning, it directly changes the empirical training distribution before the query-retrain step. Classes above tau_r are compressed, classes below tau_r are expanded, and classes already near the median are left unchanged, which maintains class support while reducing synthetic drift.

2.3. Uncertainty, Representativeness, and Query Selection

For any unlabeled sample x, the classifier produces posterior probabilities

p θ (c| x)

. We define uncertainty by minimum confidence.

u (x) = 1 - m a x_{c} p θ (c| x) .

To represent coverage of the feature space, we cluster the current unlabeled embeddings and compute a cluster-centered representativeness score

r (x) = e x p (- {|∣ z (x) - μ_{k} (x) ∣|}^{2} / (σ_{k} (x) + ε)) .

Here

μ_{k} (x)

is the centroid of the cluster containing x and

σ_{k} (x)

is the within-cluster dispersion. The final query score is a convex combination of normalized uncertainty and representativeness.

s_{r} (x) = λ_{r} \cdot u_{n} o r m (x) + (1 - λ_{r}) \cdot r_{n} o r m (x) .

The round-dependent weight lambda_r increases as the active-learning process progresses and as the current labeled set becomes more imbalanced. In the implementation, lambda_r = min{lambda_max, lambda_min + (lambda_max-lambda_min) r/R + beta I_r}, where I_r = (max_c n_{r, c}-min_c n_{r, c})/(max_c n_{r, c} + epsilon), lambda_min = 0.20, lambda_max = 0.80, beta = 0.20, and epsilon prevents division by zero. The schedule gives early rounds more representativeness pressure to repair coverage and gives later or more imbalanced rounds more uncertainty pressure to refine the boundary. This connects the framework to uncertainty sampling [16], geometry-aware core-set selection [18], and combined uncertainty-diversity batch acquisition [19], while keeping the score algebraically simple enough for reproducible deployment.

2.4. Mathematical Rationale and Basic Properties

CCUR-M can be interpreted as an iteration of three operators on the current labeled distribution: a balancing operator B_tau, a scoring operator S_lambda, and a retraining operator A. This view clarifies the mathematical role of each module. B_tau reduces class-count dispersion before model fitting, S_lambda maps each unlabeled window to a bounded acquisition score, and A updates the classifier used to define the next uncertainty surface.

Proposition 1 (median target).

For class counts n_r = (n_{r, 1}, …, n_{r, C}), every median of the multiset {n_{r, c}} is a minimizer of J(t) = sum_c |n_{r, c}-t|. Therefore, tau_r minimizes the total absolute resampling displacement required to move all classes toward one common target. This property explains why the median target is robust to long-tailed majority activities and less aggressive than maximum-count oversampling.

Proof Sketch.

At differentiability points of J(t), the subgradient equals the number of counts below t minus the number of counts above t. A zero-containing subgradient occurs exactly when t is a median; hence, any median is an L1-optimal scalar target. □

Proposition 2 (bounded hybrid query score).

If normalized uncertainty and representativeness are both in [0, 1] and lambda_r is in [0, 1], then s_r(x) = lambda_r u_bar(x) + (1-lambda_r) rho_bar(x) is also in [0, 1]. The additional cluster-diversity filter imposes a lower bound on batch coverage whenever the unlabeled pool contains sufficiently many non-empty clusters, thereby reducing the repeated selection of near-duplicate windows.

Proposition 3 (controlled training imbalance).

If the class-wise balancing step can meet the requested quota, the rebalanced training counts satisfy n_tilde_{r, c} = tau_r for corrected classes. If adversarial synthesis is replaced by the Gaussian fallback under extreme sparsity, the remaining discrepancy is bounded by the unsatisfied synthesis deficit. Thus, before each retraining step, the learner is fitted to a distribution whose imbalance is explicitly controlled rather than left as a by-product of the previous query batch.

These properties are not presented as universal convergence guarantees for nonconvex active learning. Instead, they provide a verifiable mathematical rationale for the target operator, the bounded acquisition functional, and the distribution-control mechanism used by CCUR-M.

3. A Budget-Aware Class-Balanced Active Learning Framework

3.1. Overall Closed-Loop Architecture

Figure 1 summarizes the proposed workflow. Each round begins from the currently labeled subset

L_{r}

, computes the target class size

τ_{r}

, rebalances

L_{r}

, retrains the classifier, scores the unlabeled pool

U_{r}

, queries the top batch under a diversity constraint, and then updates the labeled set. The loop repeats until the budget is exhausted.

3.2. Cluster-Preserving Majority Undersampling

To prevent the majority classes from dominating the decision surface, we apply cluster-preserving undersampling to classes with

n_{c}^{(r)} > τ_{r}

. For each such class, K-means is run with

k_{c} = m i n (|\sqrt{n_{c}^{(r)}}|, ⌊τ_{r}⌋)

clusters. Within each cluster, samples are ranked by distance to the centroid, and a quota-proportional subset is retained. This step differs from random undersampling because it preserves representative prototypes from multiple local modes, thereby reducing the risk of deleting structurally important but low-density regions.

3.3. Minority-Class cGAN Augmentation

For minority classes with n_{r, c} < tau_r, we use conditional synthesis to expand the class toward the target size, following the broader principle of synthetic minority over-sampling [26]. The implementation follows the conditional GAN paradigm [27,28], but in a class-wise setting suitable for compact tabular feature vectors. A generator maps noise and class code to synthetic embeddings, while a discriminator learns to distinguish synthetic from real minority samples. Because adversarial training is unreliable under extreme scarcity, cGAN augmentation is activated only when at least six real samples are available for the class. When this minimum support condition is not met, the framework falls back to a Gaussian bootstrap centered at the observed minority mean with variance regularization. The fallback is therefore a stability safeguard rather than a separate performance claim: it prevents generator collapse and limits the risk that a very small minority class defines the decision boundary through unstable synthetic samples.

3.4. Hybrid Query Construction Under Label Budget

After balancing, the model is retrained using XGBoost [29]. This choice is intentional. In the current setting, the most expensive operations—clustering, synthesis, and query scoring—occur during the active-learning rounds, not at test time. Using a structured gradient-boosted tree model therefore preserves fast inference while still exploiting non-linear interactions among extracted wearable features. In the reference implementation, the classifier uses 120 trees, maximum depth 4, learning rate 0.08, and histogram-based tree construction, which together provide a favorable trade-off between robustness and computational cost.

The query stage then evaluates the unlabeled pool. Predictive uncertainty u(x) is computed from the retrained classifier, representativeness r(x) is estimated by clustering the unlabeled pool, and the combined score s_r(x) ranks candidate windows. Selection is batch-based rather than greedy. The top-ranked windows are filtered so that at least half of the batch is drawn from distinct clusters whenever possible. This rule operationalizes the intuition that high-uncertainty points are useful only if they also improve coverage of the pool geometry, echoing the motivations behind core-set and BADGE-type batch acquisition [18,19] while remaining directly applicable to structured sensor windows.

The budget-aware weight is implemented by the bounded schedule above with lambda_min = 0.20 and lambda_max = 0.80. This avoids the two degenerate cases: lambda_r = 1 reduces the method to pure uncertainty sampling, which may over-query a narrow boundary region, whereas lambda_r = 0 reduces it to pure representativeness sampling, which may select central but already easy windows. The schedule was therefore designed to move from exploration toward exploitation while remaining bounded and interpretable.

3.5. Complexity, Stability, and Deployability Analysis

Let N_L = |L_r|, N_U = |U_r|, d be the feature dimension, C be the number of classes, K_L and K_U be the class-wise and pool-wise cluster counts, I be the number of K-means iterations, T be the number of XGBoost trees, h be the tree depth, and E be the number of generative-training epochs. A single active-learning round has approximate cost O(I K_L N_L d) for class-wise undersampling, O(E N_min d) for minority synthesis over the affected minority samples, O(T N_B h) for retraining on the balanced set of size N_B, O(I K_U N_U d) for pool clustering, and O(N_U C) for posterior scoring. The memory cost is O((N_L + N_U)d + T 2^h) up to implementation constants.

The expensive operations are confined to annotation rounds and are not executed during ordinary inference. At deployment time, prediction for one window requires only the trained tree ensemble, with cost O(T h) and model memory proportional to the number of stored tree nodes. This distinction is important for wearable settings: the active-learning loop can be executed on a workstation or server, whereas the final classifier remains lightweight enough for edge-side or phone-side inference.

We do not report direct power or battery measurements in this study. The present complexity analysis therefore supports algorithmic deployability but does not replace hardware-specific energy profiling, which remains necessary before fully embedded deployment.

4. Experimental Design

4.1. Datasets and Preprocessing

We evaluate the framework on three public HAR benchmarks that reflect different difficulty profiles and annotation regimes (Table 1). PAMAP2 contains 9 subjects, 12 activities, 27-dimensional structured features, and 100 Hz sensing, with 512-sample windows and a step size of 100 [30]. OPPORTUNITY contains 4 subjects, 17 activities, 33-dimensional features, and more complex dailyliving interactions, with 30-sample windows and step size 15 [31]. USC-HAD contains 14 subjects, 12 activities, 6-dimensional inertial features, and 100 Hz sensing, with 512-sample windows and a step size of 256 [32]. The experimental splits follow the available source benchmark files: PAMAP2 uses subjects 7 and 9 as test, USC-HAD uses subjects 13 and 14 as test, and OPPORTUNITY follows the ADL split used in the source benchmark files. Because the OPPORTUNITY protocol is not the only accepted split for that dataset, OPPORTUNITY results should be interpreted as evidence within this benchmark setting rather than as a universal state-of-the-art claim.

4.2. Baselines, Evaluation Metrics, and Implementation Details

The benchmark set includes representative baselines reported in the source study: SMKM, GDB-LC, SATL, ADESSA, and a RESAMPLE strategy. Because this study uses the available benchmark assets rather than a full reimplementation of every baseline, we preserve the original benchmark protocol and focus on the comparative evidence available in the source Table 1. We also clarify the relation to deep HAR models: CNN, LSTM, Transformer, and self-supervised models are important representation learners, but they address a different layer of the pipeline from the budgeted acquisition and class-distribution control studied here [3,4,33,34,35]. A full comparison against end-to-end deep baselines would require standardized sequence inputs and repeated retraining under identical active-learning budgets; this is outside the available benchmark assets and is listed as future work. Macro-F1 is used as the primary metric because it is sensitive to minority-class degradation and therefore more informative than accuracy under imbalance [11].

4.3. Reproducible Pipeline and Software Assets

The budget analysis reports Macro-F1 across label ratios from 10% to 100%. The reference pipeline is deterministic up to the stated random seed and uses CSV-based inputs and exportable outputs. In addition to the real benchmark tables, the companion toolkit includes a synthetic demo dataset and an executable command-line pipeline so that the logic of balancing, query selection, and round-wise retraining can be reproduced without access to the original preprocessing assets.

All derived summary values in this paper come directly from the source workbook or the software-generated tables. In particular, final and average performance values are taken from the benchmark summary sheet, while ablation scores and label-budget curves are taken from the dedicated tables. This design minimizes transcription error and makes every number in the manuscript traceable to a specific result file.

For reproducibility, the reference configuration is specified as follows: XGBoost uses 120 trees, maximum depth 4, learning rate 0.08, histogram-based tree construction, and multiclass posterior probabilities; uncertainty is computed by minimum confidence; all uncertainty and representativeness scores are min-max normalized within each round; the query schedule uses lambda_min = 0.20, lambda_max = 0.80, and beta = 0.20; cGAN synthesis is disabled when a minority class has fewer than six real samples; and the budget curves are reported at label ratios from 10% to 100%.

Hyperparameters that were not varied in the reported benchmark are therefore treated as fixed design choices rather than retrospectively optimized parameters. This prevents the reported gains from being interpreted as the result of hidden tuning.

5. Results and Discussion

5.1. Final Comparison Across PAMAP2, OPPORTUNITY, and USC-HAD

Table 2 and Figure 2 report the final and average benchmark results. CCUR-M achieves the best final Macro-F1 on all three datasets: 0.9574 on PAMAP2, 0.6780 on OPPORTUNITY, and 0.6128 on USC-HAD. The corresponding gains over the strongest final baseline are +0.0272, +0.1205, and +0.0184, respectively. The OPPORTUNITY result is the clearest signal in the benchmark results: the improvement is not marginal, and it exceeds the final score of every baseline by a visibly larger margin than on the other two datasets.

Average performance across all budget stages reveals a more nuanced picture. CCUR-M still leads on PAMAP2 and OPPORTUNITY, with average gains of +0.0042 and +0.0629 over the best baseline. On USC-HAD, however, the framework does not lead to an average Macro-F1, trailing the best baseline by 0.0234 on average despite winning at the final budget. This distinction is important. It shows that the proposed loop should not be interpreted as uniformly dominant at every annotation stage; rather, its strength on USC-HAD lies in later-stage consolidation after the labeled set has become sufficiently informative.

The dataset-specific pattern is intuitively plausible. PAMAP2 is comparatively structured and already admits strong performance under simpler resampling baselines, so the gain from a full closed loop is modest but consistent. OPPORTUNITY is more heterogeneous and long-tailed, making it precisely the regime in which balancing and query diversity should matter most. USC-HAD is smaller and cleaner, which reduces the room for early-stage exploration gains but still allows the closed loop to overtake the strongest baseline by the end.

5.2. Budget-Curve Analysis and Label Efficiency

Figure 3 shows the label-budget curves. On PAMAP2, CCUR-M starts from a strong 10% label ratio score of 0.9065 and climbs almost monotonically to 0.9574. The RESAMPLE baseline is competitive in the middle of the curve, even matching CCUR-M around 30%, but then plateaus below the proposed method. This indicates that once the easiest majority-class redundancy has been removed, further gains depend on query quality rather than resampling alone.

On OPPORTUNITY, the budget curves reveal the core contribution of the method. CCUR-M leads from the first budget stage (0.5353 at 10%) and maintains the lead throughout the entire annotation trajectory, reaching 0.6780 at 100%. The gap is not driven by a single late spike; it is sustained over nearly all rounds, which is consistent with the intended effect of coupling distribution correction and representative querying in a difficult long-tail setting. The strongest baseline at the final budget, GDB-LC, reaches 0.5575, and the strongest average baseline reaches 0.5229. Both remain substantially below the full loop.

USC-HAD exhibits a different dynamic. SATL is stronger in the early and middle rounds, and CCUR-M only draws level near 80% before finishing at 0.6128 versus 0.5944 for SATL. This delayed crossover suggests that the framework spends more of the early budget on coverage repair and minority stabilization than on immediate exploitation. In applications where the budget can be expanded across multiple rounds, such behavior may still be desirable; in extremely small-budget settings however, alternative query schedules could be preferable.

5.3. Ablation Study and Module Coupling

The ablation study in Figure 4 shows that the gain cannot be attributed to a single component. On PAMAP2, the full loop outperforms the base variant by +0.1243 and exceeds every single ablation. The largest drops occur when cGAN augmentation or uncertainty-based selection is removed, implying that minority expansion and boundary awareness are especially important in the relatively high-performing structured regime.

On OPPORTUNITY, the full framework improves on the base variant by +0.1638. Removing representativeness causes the largest performance loss, followed closely by removing undersampling or uncertainty. This is strong evidence that pool coverage matters most when the dataset contains many redundant or locally clustered windows. A pure uncertainty strategy would spend too much budget on boundary refinement around already well-covered modes; a pure representativeness strategy would miss genuinely ambiguous transitions. The result supports the claim that the two signals should be combined rather than optimized in isolation.

USC-HAD shows the strongest relative ablation effect: the full framework improves on the base variant by +0.2143. The representativeness-free ablation is particularly weak, again indicating that cluster-aware batch construction is crucial when the early labeled set is sparse. At the same time, removing oversampling also causes a large drop, which suggests that even relatively clean datasets can suffer from minority under-coverage once the active learner repeatedly exploits easy classes.

5.4. Sensitivity, Comparative Scope, and Statistical Interpretation

The hybrid score is sensitive to the balance between uncertainty and representativeness, but the two limiting cases are both undesirable. A purely uncertainty-driven policy tends to select ambiguous windows concentrated near the current boundary, whereas a purely representativeness-driven policy may select central samples that improve coverage but contribute little boundary information. The bounded schedule used here is intended to avoid both extremes by emphasizing coverage in early rounds and uncertainty in later or more imbalanced rounds.

The ablation results provide indirect evidence for this design. Removing representativeness produces large drops OPPORTUNITY and USC-HAD, indicating that diversity is important under heterogeneous or sparse labeled sets. Removing uncertainty also degrades performance, indicating that coverage alone is insufficient. We therefore interpret the schedule as a controlled compromise rather than as a globally optimal value of lambda_r.

The comparison with deep HAR architectures should also be interpreted carefully. CNNs, LSTMs, Transformers, and self-supervised encoders may learn stronger temporal representations than the structured features used here. The present contribution is orthogonal to that direction: CCUR-M defines how a labeled subset is balanced and queried under a fixed annotation budget. A natural extension is to apply the same acquisition rule on top of frozen or lightly tuned deep representations.

Finally, the available benchmark files do not contain per-seed outputs for every baseline, so formal paired significance testing or confidence intervals cannot be computed without rerunning all methods under the same random seeds. We therefore report effect sizes, budget curves, and ablation evidence, and we avoid claiming universal statistical dominance. The large OPPORTUNITY gain is the strongest empirical signal, whereas the smaller PAMAP2 and USC-HAD final gains should be treated as more protocol-dependent and requiring independent replication.

5.5. Cross-Dataset Interpretation and Application Implications

Taken together, the results support an interpretation of CCUR-M as a closed-loop correction mechanism rather than a single-shot query heuristic. The method is most valuable when three conditions coexist: the unlabeled pool is large, the labeled seed is class-imbalanced, and redundant windows are abundant. Under these conditions, balancing changes the effective training distribution, representative querying expands coverage, and uncertainty progressively sharpens the boundary.

This reading aligns with current research trends in wearable HAR. Large-scale pretraining and free-living datasets are pushing the field toward broader domain coverage [4,5,7], while concept-invariant, user-generalizable, and weakly supervised approaches seek better transfer across users and environments [6,7,9,25]. In parallel, specialized batch active learning and multitask active learning are increasingly viewed as deployment tools rather than purely data-efficiency tools [23,24]. The present results suggest that a budget-aware closed loop can complement these trends by serving as a practical layer between representation learning and annotation policy.

There are also clear limits. First, the current benchmark files are based on structured features and XGBoost retraining rather than an end-to-end sequence model. This keeps inference lightweight, but it may underuse recent CNN, LSTM, Transformer, and self-supervised representations [4,33,35]. Second, the OPPORTUNITY split is inherited from the source benchmark configuration, so the strongest result in this paper should be interpreted within that benchmark protocol rather than as a universal state-of-the-art claim. Third, the method assumes a closed-set label space and does not yet address unknown activities, continual drift, or personalization directly, although current literature shows these are becoming essential deployment requirements [10,36,37]. Fourth, the present study reports algorithmic complexity but not device-level energy, latency, or memory measurements.

These limits point to concrete extensions. One promising direction is to couple the current budget-aware query logic with a frozen or lightly tunable self-supervised backbone, so that active learning operates on stronger representations without requiring full end-to-end retraining [4,6,35]. Another is to replace the scalar budget schedule with an adaptive controller learned from round-wise performance and imbalance dynamics. Finally, open-set rejection, user-level personalization, and continual adaptation could be incorporated as downstream stages, allowing the framework to remain mathematically interpretable while addressing the realities of deployed wearable HAR [8,9,10,36,37].

6. Conclusions

6.1. Main Findings

This paper presents a Mathematics-oriented algorithmic framework for budget-aware class-balanced active learning in wearable HAR. The central idea is simple but consequential: under label scarcity and class imbalance, balancing and querying should not be treated as separate preprocessing steps. They should be coupled inside the same round-wise loop, so that each queried sample is added to a distribution whose imbalance has been explicitly controlled.

Empirically, the framework achieves the best final Macro-F1 on PAMAP2, OPPORTUNITY, and USC-HAD, with the strongest evidence on OPPORTUNITY. Budget curves show that the advantage is most consistent when the dataset is heterogeneous and redundant, while ablation results demonstrate that cGAN-based minority expansion, cluster-preserving majority compression, uncertainty scoring, and representativeness each contribute to the final outcome.

6.2. Limitations

The study has several limitations. The benchmark is based on structured features and XGBoost rather than end-to-end temporal encoders; OPPORTUNITY follows the ADL split used in the source benchmark files; the available benchmark files do not support full paired statistical tests across all baselines; the method assumes a closed label set; cGAN synthesis remains fragile under extreme minority scarcity despite the Gaussian fallback; and no direct power or memory profiling was conducted on embedded wearable hardware. These limitations restrict the scope of the claims and motivate the future extensions below.

6.3. Future Work

Future work should integrate CCUR-M with self-supervised or Transformer-based representations, rerun all baselines under standardized cross-dataset protocols and repeated random seeds, replace the fixed lambda schedule with an adaptive controller, add open-set rejection for unseen activities, and evaluate energy and memory consumption on realistic mobile or wearable platforms. Even with these limitations, the current evidence supports the core conclusion that, under budget constraints, a closed loop that repairs the labeled distribution before each query-retrain cycle is more reliable than treating balancing, querying, and retraining as independent modules.

Author Contributions

Conceptualization, X.L.; Software, X.L.; Investigation, G.W.; Resources, G.W.; Data curation, G.W.; Writing—original draft, X.L.; Writing—review editing, X.L.; Supervision, X.L.; Project administration, G.W.; Funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The PAMAP2, OPPORTUNITY, and USC-HAD datasets are publicly available from their original benchmark repositories and institutional web pages. The processed benchmark tables used in this manuscript can be made available with the accompanying reproducibility materials. No new data were created or analyzed in this study. A lightweight reference implementation of the proposed budget-aware class-balanced active learning loop, together with a synthetic demonstration dataset and exportable run artifacts, is available as a reproducibility toolkit accompanying this manuscript.

Conflicts of Interest

The authors declare no competing interests.

References

Arshad, M.H.; Bilal, M.; Gani, A. Human Activity Recognition: Review, Taxonomy and Open Challenges. Sensors 2022, 22, 6463. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Zhou, Y.; Zhao, H.; Riedel, T.; Beigl, M. A Survey on Wearable Human Activity Recognition: Innovative Pipeline Development for Enhanced Research and Practice. In Proceedings of the 2024 International Joint Conference on Neural Networks, Yokohama, Japan, 30 June–5 July 2024. [Google Scholar] [CrossRef]
Nguyen, D.A.; Le-Khac, N.-A. SoK: Behind the Accuracy of Complex Human Activity Recognition Using Deep Learning. In Proceedings of the 2024 International Joint Conference on Neural Networks, Yokohama, Japan, 30 June–5 July 2024. [Google Scholar] [CrossRef]
Yuan, H.; Chan, S.; Creagh, A.P.; Tong, C.; Acquah, A.; Clifton, D.A.; Doherty, A. Self-supervised learning for human activity recognition using 700,000 person-days of wearable data. npj Digit. Med. 2024, 7, 91. [Google Scholar] [CrossRef] [PubMed]
Chan, S.; Hang, Y.; Tong, C.; Acquah, A.; Schonfeldt, A.; Gershuny, J.; Doherty, A. CAPTURE-24: A large dataset of wrist-worn activity tracker data collected in the wild for human activity recognition. Sci. Data 2024, 11, 1135. [Google Scholar] [CrossRef] [PubMed]
Burq, M.; Sridhar, N. Human Activity Recognition Using Self-Supervised Representations of Wearable Data. arXiv 2023. [Google Scholar] [CrossRef]
Hong, Z.; Li, Z.; Zhong, S.; Lyu, W.; Wang, H.; Ding, Y.; He, T.; Zhang, D. CrossHAR: Generalizing Cross-dataset Human Activity Recognition via Hierarchical Self-Supervised Pretraining. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2024, 8, 64. [Google Scholar] [CrossRef]
Xiong, D.; Wang, S.; Zhang, L.; Huang, W.; Han, C. Generalizable Sensor-Based Activity Recognition via Categorical Concept Invariant Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar] [CrossRef]
Guo, P.; Nakayama, M. Towards User-Generalizable Wearable-Sensor-Based Human Activity Recognition: A Multi-Task Contrastive Learning Approach. Sensors 2025, 25, 6988. [Google Scholar] [CrossRef] [PubMed]
Chowdhury, R.R.; Kapila, R.; Panse, A.; Zhang, X.; Teng, D.; Kulkarni, R.; Hong, D.; Gupta, R.K.; Shang, J. ZeroHAR: Sensor Context Augments Zero-Shot Wearable Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar] [CrossRef]
Alharbi, F.; Ouarbya, L.; Ward, J.A. Comparing Sampling Strategies for Tackling Imbalanced Data in Human Activity Recognition. Sensors 2022, 22, 1373. [Google Scholar] [CrossRef] [PubMed]
Demrozi, F.; Pravadelli, G.; Bihorac, A.; Rashidi, P. Human Activity Recognition using Inertial, Physiological and Environmental Sensors: A Comprehensive Survey. IEEE Access 2020, 8, 210816–210836. [Google Scholar] [CrossRef] [PubMed]
Ahmed, N.; Rafiq, J.I.; Islam, M.R. Enhanced Human Activity Recognition Based on Smartphone Sensor Data Using Hybrid Feature Selection Model. Sensors 2020, 20, 317. [Google Scholar] [CrossRef] [PubMed]
Jalal, A.; Quaid, M.A.K.; Tahir, S.B.u.d.; Kim, K. A Study of Accelerometer and Gyroscope Measurements in Physical Life-Log Activities Detection Systems. Sensors 2020, 20, 6670. [Google Scholar] [CrossRef] [PubMed]
Yin, X.; Shen, W.; Samarabandu, J.; Wang, X. Human activity detection based on multiple smart phone sensors and machine learning algorithms. In Proceedings of the 2015 IEEE 19th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Calabria, Italy, 6–8 May 2015; pp. 582–587. [Google Scholar] [CrossRef]
Settles, B. Active Learning Literature Survey; Computer Sciences Technical Report 1648; University of Wisconsin-Madison: Madison, WI, USA, 2009; Available online: https://burrsettles.com/pub/settles.activelearning.pdf (accessed on 5 December 2025).
Yoo, D.; Kweon, I.S. Learning Loss for Active Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 93–102. Available online: https://openaccess.thecvf.com/content_CVPR_2019/html/Yoo_Learning_Loss_for_Active_Learning_CVPR_2019_paper.html (accessed on 5 December 2025).
Sener, O.; Savarese, S. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; Available online: https://openreview.net/forum?id=H1aIuk-RW (accessed on 5 December 2025).
Ash, J.T.; Zhang, C.; Krishnamurthy, A.; Langford, J.; Agarwal, A. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020; Available online: https://openreview.net/forum?id=ryghZJBKPS (accessed on 5 December 2025).
Ertekin, S.; Huang, J.; Bottou, L.; Giles, C.L. Active Learning in Imbalanced Data Classification. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, Lisbon, Portugal, 6–10 November 2007; Available online: https://clgiles.ist.psu.edu/pubs/CIKM-2007-learning-border.pdf (accessed on 5 December 2025).
Das, S. An Active Learning Framework with a Class Balancing Strategy for Time Series Classification. arXiv 2024. [Google Scholar] [CrossRef]
Yuan, X.; Wang, S.; Qu, T.; Feng, H.; Liu, P.; Zeng, J.; Chen, X. Learning the hard-to-learn: Active learning for imbalanced datasets in data-centric tunnel engineering. Comput. Geotech. 2024, 174, 106629. [Google Scholar] [CrossRef]
Lee, S.; Choi, C.; Do, H.; Son, Y. Batch active learning for time-series classification with multi-mode exploration. Inf. Sci. 2025, 711, 122109. [Google Scholar] [CrossRef]
Arefeen, A.; Ghasemzadeh, H. Cost-Effective Multitask Active Learning in Wearable Sensor Systems. Sensors 2025, 25, 1522. [Google Scholar] [CrossRef] [PubMed]
Sheng, T.; Huber, M. Reducing Label Dependency in Human Activity Recognition with Wearables: From Supervised Learning to Novel Weakly Self-Supervised Approaches. Sensors 2025, 25, 4032. [Google Scholar] [CrossRef] [PubMed]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; Curran Associates Inc.: Red Hook, NY, USA, 2014; Available online: https://papers.nips.cc/paper/5423-generative-adversarial-nets (accessed on 5 December 2025).
Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef]
Reiss, A.; Stricker, D. Introducing a New Benchmarked Dataset for Activity Monitoring. In Proceedings of the 16th International Symposium on Wearable Computers, Newcastle, UK, 18–22 June 2012; Available online: https://www.researchgate.net/publication/235348485_Introducing_a_New_Benchmarked_Dataset_for_Activity_Monitoring (accessed on 5 December 2025).
Chavarriaga, R.; Sagha, H.; Calatroni, A.; Digumarti, S.T.; Tröster, G.; del, R.; Millán, J.; Roggen, D. The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognit. Lett. 2013, 34, 2033–2042. Available online: https://www.sciencedirect.com/science/article/abs/pii/S0167865512004205 (accessed on 5 December 2025). [CrossRef]
Zhang, M.; Sawchuk, A.A. USC-HAD: A Daily Activity Dataset for Ubiquitous Activity Recognition Using Wearable Sensors. In Proceedings of the ACM International Conference on Ubiqutious Computing (UbiComp) Workshop on Situation, Activity and Goal Awareness (SAGAware), Pittsburgh, PA, USA, 5–8 September 2012; Association for Computing Machinery (ACM): New York, NY, USA, 2012; Available online: https://sipi.usc.edu/had/ (accessed on 5 December 2025).
Chen, H.; Gouin-Vallerand, C.; Bouchard, K.; Gaboury, S.; Couture, M.; Bier, N.; Giroux, S. Contrastive Self-Supervised Learning for Sensor-Based Human Activity Recognition: A Review. IEEE Access 2024, 12, 152511–152531. [Google Scholar] [CrossRef]
Navakauskas, D.; Dumpis, M. Wearable Sensor-Based Human Activity Recognition: Performance and Interpretability of Dynamic Neural Networks. Sensors 2025, 25, 4420. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Xu, X. Efficient human activity recognition: A deep convolutional transformer-based contrastive self-supervised approach using wearable sensors. Eng. Appl. Artif. Intell. 2024, 135, 108705. [Google Scholar] [CrossRef]
Rahman, F.; Schiemer, M.; Rosales Sanabria, A.; Ye, J. Continual learning in sensor-based human activity recognition with dynamic mixture of experts. Pervasive Mob. Comput. 2025, 110, 102044. [Google Scholar] [CrossRef]
Cortese, A.; Solbiati, S.; Scandelli, A.; Giudici, A.; Antonello, N.; Trojaniello, D.; Boracchi, G.; Caiani, E.G. Open-Set Recognition of Human Activities from Head-Mounted Inertial Sensor. Sensors 2026, 26, 1079. [Google Scholar] [CrossRef] [PubMed]

Figure 1. CCUR-M closed-loop workflow. Distribution correction, query construction, and lightweight retraining are recomputed at each annotation round.

Figure 2. Final and average performance summary. CCUR-M is best at the final budget on all three datasets, while the largest final and average gains appear on OPPORTUNITY.

Figure 3. Label-budget curves across the three datasets.

Figure 4. Final Macro-F1 under ablation. The full loop is stronger than any single-module variant on all three datasets.

Table 1. Dataset configuration used in the benchmark.

Split Summary	Window (Size/Step)	Sampling Rate	Activities	Subjects	Dataset
Subjects 7 and 9 as test; others as train/pool	512/100	100 Hz	12	9	PAMAP2
ADL split used in source benchmark files	30/15	Mixed	17	4	OPPORTUNITY
Subjects 13 and 14 as test; others as train/pool	512/256	100 Hz	12	14	USC-HAD

Table 2. Final and average Macro-F1 of CCUR-M against the strongest available baselines.

Δ avg	Best Baseline avg	CCUR-M avg	Δ Final	Best Baseline Final	CCUR-M Final	Dataset
+0.0042	RESAMPLE (0.9303)	0.9345	+0.0272	RESAMPLE (0.9302)	0.9574	PAMAP2
+0.0629	RESAMPLE (0.5229)	0.5858	+0.1205	GDB-LC (0.5575)	0.6780	OPPORTUNITY
−0.0234	SATL (0.5643)	0.5409	+0.0184	SATL (0.5944)	0.6128	USC-HAD

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, X.; Wang, G. A Budget-Aware Class-Balanced Active Learning Framework for Imbalanced Wearable Human Activity Recognition. Mathematics 2026, 14, 1932. https://doi.org/10.3390/math14111932

AMA Style

Liu X, Wang G. A Budget-Aware Class-Balanced Active Learning Framework for Imbalanced Wearable Human Activity Recognition. Mathematics. 2026; 14(11):1932. https://doi.org/10.3390/math14111932

Chicago/Turabian Style

Liu, Xinrui, and Guodong Wang. 2026. "A Budget-Aware Class-Balanced Active Learning Framework for Imbalanced Wearable Human Activity Recognition" Mathematics 14, no. 11: 1932. https://doi.org/10.3390/math14111932

APA Style

Liu, X., & Wang, G. (2026). A Budget-Aware Class-Balanced Active Learning Framework for Imbalanced Wearable Human Activity Recognition. Mathematics, 14(11), 1932. https://doi.org/10.3390/math14111932

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Budget-Aware Class-Balanced Active Learning Framework for Imbalanced Wearable Human Activity Recognition

Abstract

1. Introduction

2. Mathematical Preliminaries and Problem Setting

2.1. Notation, Label Budget, and Windowed Sensor Representation

2.2. Class-Imbalance Modeling and Target Class-Size Operator

2.3. Uncertainty, Representativeness, and Query Selection

2.4. Mathematical Rationale and Basic Properties

3. A Budget-Aware Class-Balanced Active Learning Framework

3.1. Overall Closed-Loop Architecture

3.2. Cluster-Preserving Majority Undersampling

3.3. Minority-Class cGAN Augmentation

3.4. Hybrid Query Construction Under Label Budget

3.5. Complexity, Stability, and Deployability Analysis

4. Experimental Design

4.1. Datasets and Preprocessing

4.2. Baselines, Evaluation Metrics, and Implementation Details

4.3. Reproducible Pipeline and Software Assets

5. Results and Discussion

5.1. Final Comparison Across PAMAP2, OPPORTUNITY, and USC-HAD

5.2. Budget-Curve Analysis and Label Efficiency

5.3. Ablation Study and Module Coupling

5.4. Sensitivity, Comparative Scope, and Statistical Interpretation

5.5. Cross-Dataset Interpretation and Application Implications

6. Conclusions

6.1. Main Findings

6.2. Limitations

6.3. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI