4.1. CGM Datasets
We used two datasets: one that simulates glucose readings for multiple users, with different types of drifts injected, and one real-world dataset with readings obtained from a single patient throughout different life phases.
A synthetic dataset was generated to simulate common patterns observed in continuous glucose monitoring, including gradual, abrupt and recurrent drifts. Gradual drift is represented by a slow, incremental change in mean glucose levels over time. Abrupt drift models sudden shifts in glucose distribution and is used to mimic unexpected physiological changes. Recurrent drift is implemented by including periods of increased short-term fluctuations. These synthetic streams are created to allow precise evaluation of drift detection performance and feedback-loop response. Glucose readings for 20 users were simulated. Synthetic glucose streams were generated with a base glucose range of 70–180 mg/dL, sampled at one-minute intervals. Independent Gaussian noise with σ = 10 mg/dL was added to each sample. Gradual drift was injected as a linear shift in mean glucose of +20 mg/dL over 10,000 samples. Abrupt drift was implemented as a step change of +30 mg/dL at a single time point. Recurrent drift was modeled as periodic variance increases of 50% with a cycle length of 1440 samples. No artificial missingness or outlier injection was applied; the synthetic streams are therefore free of the sampling irregularities present in the real CGM data. Synthetic data were used exclusively for testing during development and stress evaluation, and they do not form part of the operational system logic.
The real-world dataset used in this study was obtained from a single user and spans multiple, distinct physiological and treatment-related phases, providing a natural source of data drifts for system evaluation. The data were collected using a CGM device and later replayed through the containerized edge–cloud pipeline to assess system behavior under realistic conditions. The dataset covers four sequential phases. The first phase corresponds to gestational diabetes during pregnancy, characterized by regulated glucose dynamics under heightened physiological variability. The second phase captures the post-pregnancy period, during which glucose patterns shift as pregnancy-related metabolic influences subside. The third phase corresponds to the post-breastfeeding period, introducing additional changes in glucose dynamics associated with altered energy demands and hormonal regulation. The fourth phase reflects the initiation of medication-based management, resulting in further distributional changes in the CGM signal. These phases introduce naturally occurring distributional shifts in mean glucose levels, variability, and temporal structure, without any artificial manipulation of the data. The original CGM recordings contain occasional gaps due to sensor unavailability. For the purposes of system replay and drift evaluation, the dataset was transformed into a continuous data source, in which all available samples were streamed sequentially without explicitly modeling missing intervals. That is, gaps in wall-clock time were ignored, and the replay mechanism operated on the observed glucose measurements as a temporally ordered but uninterrupted stream. The transitions between phases are known, enabling qualitative assessment of alignment between detected drift events and documented physiological or treatment changes.
Table 3 summarizes the key characteristics of both evaluation datasets.
4.2. Experimental Setup and Evaluation Scope
The proposed architecture was evaluated using a containerized edge–cloud testbed emulating a distributed IoT deployment with multiple concurrent users. Each user is represented by an independent edge-role container executing continuous inference, local drift screening, and asynchronous communication with cloud services. The evaluation includes 20 simulated users with synthetically generated physiological time-series and one user with real continuous glucose monitoring (CGM) data, used to validate system behavior under realistic non-stationary conditions. Synthetic data allows controlled injection of drift events with known onset, magnitude, and duration, while real CGM data introduces irregular sampling, noise, and user-specific variability that cannot be easily modeled. The results focus on system-level performance, including drift propagation, validation accuracy, adaptation timing, feedback-loop latency, and model lifecycle behavior. Prediction accuracy is reported just to illustrate adaptation effectiveness and stability. The evaluation scope is bounded by the containerized testbed and the user population described above; the results are interpreted as evidence of architectural feasibility under controlled conditions rather than as a demonstration of scalability to large populations or clinical readiness.
The system successfully maintained continuous inference for all users throughout the evaluation period. Edge-role containers operated independently, generating drift indicators and inference outputs without centralized coordination, while cloud services aggregated and processed events asynchronously.
Figure 3 illustrates the end-to-end execution timeline for three representative users, including one real CGM user.
The results demonstrate that inference remains uninterrupted during drift validation, retraining, and model deployment. Updated models are applied only after successful registration and distribution, confirming the effectiveness of decoupled lifecycle management. During the evaluation, the system initiated three user groups and performed two reclassification events. These events are reported to illustrate feasibility and system behavior rather than to provide a quantitative assessment of reclassification optimality.
Edge-level drift screening produced frequent low-confidence alerts in response to short-term variability, particularly for the real CGM user, often coinciding with reported anomalies in the data. However, most of these alerts were filtered out during cloud-side validation. This confirms the intended architectural behavior: high sensitivity at the edge combined with high specificity in the cloud.
Model performance improved following validated adaptation events, particularly for the real CGM user where sustained drift caused baseline degradation.
Figure 4 shows MAE over time with (blue line) and without (red line) the model update for the real CGM user, annotated with drift validation (red squares and dotted line) and adaptation events (green triangles and dash-dotted line).
The real CGM user exhibited higher short-term variability and a greater number of edge-level drift alerts compared to simulated users. However, cloud-side validation filtered a large number of these alerts, resulting in adaptation behavior comparable to simulated users with injected drift. This demonstrates that the architecture generalizes across synthetic and real-world physiological data without requiring domain-specific tuning at the edge.
To quantify system-level performance beyond prediction accuracy, we instrumented the prototype to record timestamps at each stage of the drift handling pipeline: edge alert emission, cloud receipt, validation decision, retraining initiation, retraining completion, model registration, and edge deployment confirmation. Detection delay is defined as the interval between the known or estimated drift onset and the first validated cloud-level drift event. End-to-end adaptation latency is measured from drift onset to confirmed edge deployment of the updated model. Cloud validation pass rate is computed as the ratio of cloud-validated drift events to total edge alerts received. The false positive rate is estimated for synthetic users by comparing validated drift events against the known injection schedule; any validated event not corresponding to a known drift within a tolerance window of 200 samples is counted as a false positive.
Table 4 reports system-level performance metrics aggregated across all evaluation users. The mean detection delay from drift onset to validated cloud alert was 60.25 ± 35.2 s across synthetic users, reflecting primarily the time required for edge-side detectors to accumulate sufficient statistical evidence, of which cloud-side validation processing accounted for 4 ± 2.2 s, with the remaining ~56 s representing edge-side detector accumulation time. The subsequent retraining (5 ± 2 s) and deployment (~1 s) stages yield a total end-to-end latency of 66 ± 37 s. Following validation, the adaptation pipeline required 5 ± 2 s for model retraining and registration (dominated by MLflow logging and metadata persistence rather than model fitting, which completed in under 1 ms), and approximately 1 s for deployment to the edge, yielding a total end-to-end latency of approximately 66 ± 37 s from drift onset to the active updated model. Of 300–1800 total edge-level drift alerts per user, 15.32 ± 5.31% were confirmed by cloud validation, yielding an estimated false positive rate of 27.32 ± 19.20% relative to known synthetic drift events. While this FPR indicates room for improvement in validation precision, the cooldown mechanism limits consecutive instability; the estimated ~23 false-positive retraining events per user correspond to approximately 115 s of cumulative compute. The real CGM user generated approximately 2× more edge-level alerts than the synthetic average, driven by higher short-term postprandial variability. However, the cloud validation pass rate was substantially lower (~0.25% versus 15.32% for synthetic users), reflecting the greater proportion of noise-driven alerts in real physiological data. The absolute number of validated drift events (~5) aligned with the three known phase boundaries, confirming that the cloud filter correctly identified clinically meaningful distributional shifts while suppressing transient variability—consistent with the intended design of high edge sensitivity combined with selective cloud filtering.
To assess the contribution of individual architectural components, we conducted three ablation experiments on the synthetic dataset (20 users). In the first ablation (edge-only triggering), cloud validation was bypassed and every edge-level consensus alert directly triggered adaptation. In the second ablation (single detector), only ADWIN was retained and the ensemble voting mechanism was disabled; any ADWIN alarm triggered an edge alert. In the third ablation (no cooldown), the per-user cooldown interval was set to zero, allowing consecutive adaptations without a stabilization period. Each ablation was compared against the full pipeline configuration using the same synthetic data streams and drift injection schedule.
Table 5 presents the ablation results. Removing cloud validation (edge-only) increased retraining events by approximately 7× (range: 3.5×–21×) and raised the false positive rate from 27.32% to 88.87%, confirming that cloud-level filtering is essential for suppressing transient alerts. Using a single detector (ADWIN only) increased the false positive rate by 46.1 percentage points relative to the ensemble, while also increasing false negatives: the ensemble detected 5.61 true drifts per user on average compared to 4.90 for ADWIN alone, with the difference attributable primarily to variance-driven drifts captured by the variance-ratio monitor but missed by ADWIN. Disabling the cooldown mechanism resulted in 4.2 ± 1.8 oscillatory retraining episodes per user in which the system retrained up to 4× within 120 s for the same user, with no net improvement in MAE. These results support the design choices of hierarchical validation, multi-detector voting, and cooldown-based stability constraints.
To contextualize the drift-aware approach against simpler adaptation strategies, we implemented two baselines on the synthetic dataset. Baseline A (periodic retraining) retrains each user’s model at fixed intervals of N = {500, 2000, 10,000} samples regardless of drift status. Baseline B (performance-triggered retraining) monitors a rolling MAE computed over the most recent 100 samples and initiates retraining whenever MAE exceeds 2.0× the post-training baseline MAE. Both baselines use the same training window size and model configuration as the drift-aware pipeline to ensure comparability. The comparison metrics are mean post-adaptation MAE, total retraining events per user, and cumulative retraining compute time.
Table 6 compares the drift-aware pipeline against both baselines. Periodic retraining at the shortest interval (N = 500) achieved a mean MAE of 5.98 ± 1.32 mg/dL but required 1050 retraining events per user, 12.4× more than the drift-aware pipeline. Periodic retraining at N = 2000 reduced retraining frequency but produced a substantially higher mean MAE of 21.33 ± 5.97 mg/dL, reflecting the high drift frequency in the synthetic dataset where the model frequently operated with outdated parameters between fixed retraining intervals. Performance-triggered retraining achieved a mean MAE of 6.26 ± 1.50 mg/dL with 72 ± 27 retraining events. While this approach responded effectively to abrupt drift scenarios, it exhibited delayed adaptation under gradual drift, where error accumulation remained below the 2.0× triggering threshold for extended periods. The drift-aware pipeline achieved 3.55 ± 0.82 mg/dL with 85 ± 34 retraining events, representing a 40.6% reduction in MAE relative to the best-performing periodic baseline (N = 500) while using 91.9% fewer retraining events. Compared to performance-triggered retraining, the drift-aware approach achieved 43.3% lower MAE at a comparable number of retraining events, with the improvement attributable primarily to timely detection of gradual drift that remained below the performance-based threshold.
Computational overhead was measured on the containerized testbed to characterize the resource cost of drift-aware operation. The testbed hardware consists of Intel Core i7 and 16 GB RAM; each edge container is limited to 1 CPU core and 512 MB RAM; cloud containers unrestricted. Edge-side measurements include per-sample inference latency and per-sample drift detection overhead, both measured as wall-clock time averaged over the full evaluation run. Cloud-side measurements include retraining wall-clock time per user and model artifact size. Alert bandwidth was estimated from the average alert message size and emission frequency.
Table 7 summarizes the computational overhead. Edge inference latency averaged 0.054 ms per sample, while drift detection added 0.009 ms per sample, representing a 16.8% overhead relative to inference alone. Peak memory usage for the drift screening subsystem was 0.91 MB per edge container. Edge alert messages averaged 118 bytes each, with a mean emission rate of 0.069 alerts per hour per user, corresponding to an estimated bandwidth of 0.0079 KB/hour per user. Cloud retraining required 0.0006 s per event for model fitting; the 5 s adaptation pipeline latency reported in
Table 4 is dominated by MLflow artifact logging, metadata persistence, and message broker processing rather than model computation. Model artifacts averaged 0.44 KB in size.
To assess the sensitivity of drift detection to window parameter choices, we varied the feature extraction window size across {50, 100, 200} samples and the detector voting window across {30, 60, 120} seconds. This yielded nine parameter combinations, evaluated on a subset of five synthetic users selected to include at least one instance of each drift type (gradual, abrupt, recurrent). Detection delay, false positive rate, and number of unnecessary retraining events were recorded for each configuration.
Table 8 reports the sensitivity analysis results. Reducing the feature window to 50 samples increased the false positive rate by 12.5 percentage points relative to the default (100 samples) at a matched voting window of 60 s, as the shorter window made mean-shift and variance-ratio monitors more reactive to transient postprandial variability. Conversely, increasing the feature window to 200 samples raised the mean detection delay by 33.2 s, reflecting the time required for the larger window to accumulate sufficient post-drift samples. For the voting window parameter, a 120 s window increased edge alert frequency by approximately 35% relative to the default 60 s window due to increased coincidental detector agreement, though cloud-side validation absorbed most of the additional alerts without substantially changing the number of unnecessary retraining events (2.6 vs. 2.1). A 30 s voting window reduced edge alert frequency by approximately 24% and lowered the false positive rate to 21.8%, but marginally increased detection delay for gradual drifts requiring sequential detector activation. The default configuration (100-sample feature window, 60 s voting window) represents a balanced trade-off between sensitivity and stability, achieving moderate detection delay (60.2 s) and a false positive rate (27.3%) without favoring either extreme. These parameter choices are not claimed to be universally optimal and may require adjustment for data streams with different temporal characteristics.
4.3. Results and Discussion
The presented results highlight several architectural implications that extend beyond the specific experimental setup. First, the clear separation between edge-level drift screening and cloud-level drift validation proved essential for maintaining system stability. While edge nodes generated a high volume of low-confidence drift alerts, especially for real CGM data characterized by short-term variability, the cloud validation layer consistently filtered transient signals. This confirms that treating concept drift as a hierarchical, system-level event rather than a local trigger is critical for avoiding unnecessary adaptation in distributed IoT environments.
Second, the results demonstrate that asynchronous adaptation is sufficient to preserve continuous inference, even under sustained drift. By decoupling inference, retraining, and deployment, the system avoids inference downtime and reduces operational risk during model updates. The observed feedback-loop behavior shows that adaptation latency is dominated by retraining rather than coordination or communication overhead, suggesting that further optimizations should focus on training efficiency rather than messaging infrastructure.
Third, dynamic user-state modeling emerged as an effective mechanism for limiting retraining frequency. User reclassification enabled the system to respond to moderate drift by reassigning users to existing model ensembles instead of triggering retraining, reducing computational cost while maintaining model–data alignment. Importantly, user states function strictly as internal system abstractions and do not correspond to clinical conditions, preserving a clear separation between system adaptation and medical interpretation. The two reclassification events observed during evaluation illustrate that the mechanism is operationally viable within the prototype; however, quantitative evaluation of reclassification optimality—including comparison of alternative similarity metrics and group configurations—is deferred to future work as a distinct research question.
The inclusion of real CGM data further illustrates the robustness of the architecture. Although real-world data introduced higher variability and irregular sampling compared to synthetic streams, the system exhibited comparable adaptation behavior without requiring domain-specific tuning at the edge. While precise detection delay cannot be computed for the real CGM user due to the absence of exact drift onset timestamps, visual inspection of
Figure 4 indicates that all five cloud-validated drift events cluster near the documented phase boundaries, with no validated events occurring during periods of known distributional stability. The pipeline-only latency (cloud validation through deployment) for the real CGM user was approximately 6 s, consistent with synthetic user measurements. This qualitative alignment, combined with the strict cloud filtering that reduced approximately 2000 edge alerts to only 5 validated events, provides supporting evidence that the hierarchical validation correctly discriminates between physiologically meaningful drift and transient variability. This suggests that the proposed design generalizes across heterogeneous physiological signals and can accommodate realistic noise patterns encountered in practice.
Nevertheless, several limitations should be acknowledged. The evaluation was conducted in a containerized environment that emulates, but does not fully replicate, real-world network instability, device failures, or large-scale population heterogeneity. Additionally, while CGM data provides a representative example of non-stationary physiological signals, the results should be interpreted as system-level validation rather than evidence of clinical performance or safety. While user reclassification reduced retraining frequency in the evaluated scenarios, this work does not provide a comparative quantitative analysis of reclassification versus retraining strategies. The primary goal is to demonstrate architectural feasibility and decision integration; systematic evaluation of reclassification policies and similarity metrics is left for future work. Addressing these limitations will require extended deployments, larger user populations, and interdisciplinary collaboration, particularly when defining clinically informed validation and adaptation constraints.
Overall, the integrated results and discussion support the central premise of this work: that concept drift can be elevated from a passive monitoring signal to an active coordination mechanism governing model lifecycle, deployment, and user-state alignment across an edge–cloud continuum. Within the evaluated scope, the proposed architecture demonstrates a viable foundation for adaptive IoT systems operating under non-stationary data conditions. Validation at larger user populations, under real network instability, and with clinically informed evaluation protocols remains necessary before broader deployment claims can be made.