1. Introduction
Edge inference must deliver real-time responses under changing conditions such as sensor noise, lighting shifts, and the appearance of previously unseen classes. Many edge devices execute offline-trained models stored in on-chip flash and keep the deployed structure fixed during operation. This static deployment limits adaptation and causes accuracy degradation once the input distribution drifts [
1,
2]. In practical edge deployments, sensor data often undergoes continuous variation due to environmental dynamics such as illumination change, motion blur, temperature drift, or sensor degradation. These variations introduce a distribution shift between the training data and the live sensor input, causing inference accuracy to degrade over time. To ensure robustness under such conditions, the edge device must adapt its model parameters in accordance with sensor-induced feature changes.
Although continuous retraining could restore performance, full-model retraining on edge devices is infeasible due to limited computational capability, memory capacity, and power constraints [
3]. Furthermore, transmitting an entire model from the cloud during every update causes substantial communication overhead and latency, which becomes impractical for time-sensitive or bandwidth-limited edge deployments. As shown in
Figure 1, conventional cloud–edge pipelines typically rely on batch learning and static redeployment cycles. These approaches interrupt ongoing inference and fail to react promptly to sudden distribution shifts. Moreover, GPU-based acceleration suffers from architectural rigidity: kernels must be frequently reloaded, memory must be re-synchronized, and contexts must be repeatedly switched across learning stages [
4,
5,
6]. Such overhead prevents GPUs from supporting fine-grained, layer-wise updates at runtime, making real-time adaptation impractical.
To overcome these limitations, this paper proposes a runtime-robust edge inference architecture that enables dynamic model updates without interrupting ongoing inference. The edge device employs a memory-separated structure that maintains an active model region and an adaptive region. Only masked weight deltas corresponding to highly influential parameters are transmitted from the server, significantly reducing communication cost compared with full-model updates. After receiving an update, the edge activates new parameters through a fast pointer swap rather than reloading the entire model, thereby eliminating inference downtime. This selective and incremental update mechanism allows the system to adapt rapidly to new classes, noise-induced feature drift, and other dynamic conditions, achieving real-time responsiveness unattainable with prior cloud–edge or GPU-based frameworks.
Overall, the proposed architecture forms a unified cloud–edge adaptive learning framework that combines masking-based updates, FPGA reconfigurability, and efficient communication. It provides runtime adaptability under dynamic environments while maintaining service continuity and reducing overhead in both computation and data transfer [
7,
8].
The main contributions of this paper are as follows. First, a runtime adaptive edge inference architecture is proposed that flexibly replaces model weights during execution without interrupting inference. Second, a server-side retraining pipeline accelerated by a PCIe-connected FPGA performs on-demand incremental learning and selective parameter optimization in response to runtime events. Third, dynamic partial reconfiguration (DPR) is utilized to enable fast layer-wise reconfiguration [
9,
10,
11], providing scenario-specific acceleration without full FPGA reprogramming. Fourth, a masking-based selective weight update method is designed to transmit only the most influential parameters, effectively reducing update payloads. Fifth, an event-driven scheduling mechanism overlaps server retraining and edge inference to maintain continuous operation. Finally, a communication-efficient update path employing sparse and quantized transmission minimizes bandwidth usage while preserving real-time responsiveness.
The remainder of this paper is organized as follows.
Section 2 reviews related work on selective retraining, FPGA-based acceleration, and communication-efficient updates.
Section 3 describes the proposed architecture and the partial reconfiguration workflow.
Section 4 presents experimental results on learning speed, communication efficiency, and runtime robustness.
Section 5 concludes the paper and discusses potential future extensions toward adaptive edge–cloud learning systems.
3. Proposed Architecture
This section presents a runtime adaptive edge inference architecture that combines device-side memory separation with masking-based partial updates and a server-side pipeline accelerated by a PCIe-connected dynamic reconfigurable FPGA. The design targets low-latency adaptation under dynamic inputs without interrupting device inference. The system integrates three pillars that operate concurrently. The server detects scenarios and learns selective changes. The FPGA accelerates importance scoring and layer-wise retraining and supports scenario-specific acceleration through partial reconfiguration. The edge applies masked updates in a shadow region and activates them through an atomic pointer swap.
3.1. System Overview
Figure 3 illustrates the overall runtime-adaptive inference architecture. The system consists of two cooperative domains: the edge device performing incremental inference and the server-side FPGA accelerator responsible for adaptive retraining and mask generation. Both operate concurrently to maintain inference continuity and adaptivity under dynamically changing input conditions.
On the edge side, the system divides memory into two functional regions rather than maintaining a single shadow buffer. The active memory stores the baseline model and parameters frequently used for inference, while the adaptive memory holds task-specific masks and selectively updated parameters provided by the server. Each mask
corresponds to a particular environmental or contextual condition. As seen in
Figure 4, during runtime, the edge retrieves the corresponding mask and reconstructs a new effective weight by combining stored weights and masked updates:
where ⊙ denotes element-wise selection. Although this requires lightweight additional computation, it eliminates the need for full weight replacement and supports incremental inference [
28] that dynamically adapts to the current context. As a result, inference continues seamlessly while parameter adaptation occurs in parallel.
The server receives statistical summaries and inference results from the edge device, analyzes performance degradation, and identifies the cause of change. Based on the detected scenario, the server performs layer-wise retraining on the FPGA and generates a new mask together with its corresponding parameter deltas . These updates are compressed and transmitted to the edge for integration into the adaptive memory. Through this continuous collaboration, the edge maintains up-to-date model behavior without halting inference or requiring full model transmission.
This memory-partitioned structure provides several advantages. First, it enables runtime weight adaptation without model reloading, reducing latency and communication cost. Second, it allows task- or scenario-specific inference using previously stored masks, ensuring rapid recovery when similar situations recur. Finally, it provides a foundation for incremental learning and inference fusion, where partial updates and continuous inference coexist within a unified runtime pipeline.
3.2. Edge Runtime Mechanism with Mask-Based Adaptive Inference
The edge device performs inference continuously while adapting its parameters according to the current task or environment. Instead of performing full weight replacement, the edge selectively applies mask-based parameter fusion using pre-stored update information. Each task-specific mask encodes the indices and magnitudes of parameters that are most sensitive to the corresponding scenario, enabling incremental inference without reloading the entire model.
To support runtime adaptation, the edge memory is divided into two logical regions: the base memory that holds the static model
and the adaptive memory that stores multiple sets of task-specific parameter deltas
and their corresponding masks
. When the input context or task label indicates a change, the edge retrieves the relevant mask and constructs a new effective weight tensor:
where ⊙ denotes element-wise multiplication between the selected mask and its associated parameter delta. This operation adaptively modifies only the parameters relevant to the detected task or environmental condition, generating a lightweight and task-optimized weight configuration. Since the computation involves sparse element-wise updates instead of full model replacement, it requires minimal overhead and remains feasible even for microcontroller-level edge devices.
Each mask is designed by the server-side FPGA through layer-wise importance analysis, and it selectively activates a subset of parameters that significantly influence accuracy for a given condition. By maintaining multiple masks within the adaptive memory, the edge device can dynamically switch between different scenarios or tasks in real time. This mechanism enables task-aware runtime inference, allowing parameter adaptation to occur within a single inference cycle without requiring communication with the server.
The edge runtime engine follows a lightweight policy module that decides which mask to apply. This policy can rely on metadata from the sensor stream or previously observed inference degradation. If the system detects a new or unseen scenario for which no mask exists, the edge flags a retraining request to the server, which then produces a new and pair through FPGA-accelerated incremental learning. The update is transmitted back and stored in the adaptive memory for future use.
This design provides three key benefits. First, it allows continuous inference without interrupting execution, as updates are integrated locally through simple weight fusion. Second, it supports task-level reuse of previously learned parameter deltas, reducing communication frequency with the server. Third, it achieves runtime adaptability where inference, parameter selection, and update coexist seamlessly on the edge device.
3.3. Server Runtime with FPGA DPR
The server operates as an adaptive retraining platform that dynamically reconfigures FPGA hardware modules at runtime under the supervision of a CPU-based DPR controller. The FPGA is divided into a static region and an RP, as seen in
Figure 5. The static region maintains PCIe and DDR interfaces, AXI interconnects, DMA engines, and control registers, ensuring that communication with both the host and the edge device remains uninterrupted during partial reconfiguration. The reconfigurable region hosts interchangeable acceleration modules for importance analysis, weight update, and adaptive mask generation. Each module follows a standardized interface based on AXI4 Lite for control and AXI4 Stream for data exchange, enabling module swapping without resynthesis or interface modification.
The CPU acts as a DPR controller, monitoring edge-side events and scheduling FPGA reconfiguration according to adaptation requirements. During normal operation, the importance analyzer module remains active, continuously evaluating sensitivity scores from recent inference data to identify parameters with a high impact on prediction accuracy. When an update request is triggered by the edge or when accumulated importance scores exceed a threshold, the CPU initiates partial reconfiguration to replace the current module with a weight update engine. After retraining and parameter generation, the CPU reconfigures the FPGA again to load the mask generator module, which extracts sparse and quantized parameter subsets for transmission to the edge. Once the update is completed, the FPGA reverts to the importance analyzer to resume runtime monitoring.
3.3.1. Importance Analyzer
The importance analyzer identifies parameters and layers that most significantly influence the model’s output under the current data distribution. This module implements a hardware-accelerated approximation of the local gradient to estimate feature sensitivity in real time. Instead of executing full backpropagation, which requires global memory for intermediate activations and large-scale tensor operations, the FPGA computes a partial gradient using feature and error signals from the selected layer.
For an input feature vector
and its output error signal
, the local gradient and importance score of parameter
are computed as
where
represents the magnitude of sensitivity for each parameter. This lightweight operation captures how strongly each weight contributes to loss variation without requiring a full backward pass. The computation is fully pipelined and parallelized across multiple DSP arrays, allowing element-wise operations to be performed every clock cycle.
3.3.2. Partial Weight Update
The partial weight update engine performs selective retraining of model parameters based on the importance information obtained from the previous stage. Rather than updating the entire model as in conventional GPU-based training, this module focuses only on parameters whose importance scores exceed a predefined threshold. This approach reduces the computational load and memory bandwidth while maintaining comparable accuracy recovery.
Let
denote the current weight vector at iteration
t, and let
represent the local gradient estimated by the importance analyzer. The updated parameter vector
is computed as
where
is the learning rate,
is the importance score, and
is the threshold defined by the update policy. Only a small fraction of parameters satisfying
are modified, thereby enabling rapid incremental learning and low latency adaptation.
3.3.3. Adaptive Mask Generator
The adaptive mask generator constructs a selective update mask that determines which parameters are to be transmitted and updated during runtime. While the importance analyzer provides element-wise sensitivity values, the mask generator aggregates these scores under multiple policies depending on task objectives, resource availability, and environmental context. This stage thus acts as the bridge between importance estimation and parameter updating.
Given the importance scores
, the generator constructs a binary mask
according to a configurable policy function
:
where
denotes an importance threshold and
k specifies the number of highest score elements to retain. The policy
is dynamically selected by the CPU controller according to the active update mode. Each policy yields a different sparsity ratio, enabling the system to balance adaptation quality against communication and power cost.
4. Experiments
4.1. Experimental Setup
The proposed architecture was evaluated on both GPU- and FPGA-based platforms to compare the runtime adaptability and update efficiency as seen in
Figure 6. For the GPU baseline, experiments were performed on a workstation equipped with an Intel Core i5-12400F CPU and an GTX 1660Ti GPU, Nvidia, Santa Clear, CA, USA, using PyTorch 2.0 for training and inference. The FPGA implementation was deployed on a Alveo U200 accelerator card, Xilinx, San Jose, CA, USA hosted in a dual CPU server system consisting of two Intel Xeon Bronze 3204 processors, Santa Clara, CA, USA (1.9 GHz, 6 cores each). All FPGA kernels were synthesized using the Xilinx Vitis 2023.1 toolchain with Vivado partial reconfiguration support. The host software managing reconfiguration and communication was implemented in C++ using XRT APIs. The edge-side inference processor is implemented on a custom ARM Cortex-M0-based microcontroller, Cambridge, UK operating at 48 MHz. The processor includes 128 KB of on-chip SRAM and 24 KB of DRAM, providing a highly resource-constrained environment representative of low-power edge platforms. During runtime, the processor loads the active weight region into SRAM, performs fixed-point convolution and fully connected operations, and applies masked parameter updates generated by the FPGA server. In addition, the Cortex-M0 periodically reports lightweight statistics—including prediction confidence, anomaly indicators, and feature drift metrics—to trigger server-side incremental retraining.
For performance evaluation, the experiments were conducted using the MNIST and CIFAR-10 datasets for incremental learning scenarios and a subset of Tiny ImageNet for noise adaptation tests. Each experiment consisted of three stages: importance estimation, partial update, and masked parameter transmission. The FPGA-based implementation was compared with the GPU full retraining baseline in terms of total adaptation latency, communication overhead, and accuracy recovery. All accuracy values were measured as the mean of five repeated runs.
4.2. Latency and Throughput Evaluation
Figure 7 presents the latency comparison between the proposed FPGA-based adaptive update system and the GPU full retraining baseline. During the first adaptation cycle, the GPU achieved lower latency (0.78 s) than the FPGA (1.10 s), primarily because of the initial overhead associated with dynamic partial reconfiguration (DPR) and module loading. However, as adaptation events were repeated, the FPGA’s latency rapidly decreased and stabilized around 0.65 s, while the GPU’s retraining time remained nearly constant. After the third update iteration, the average adaptation latency of the FPGA dropped below that of the GPU, achieving up to a 1.3× overall speed advantage in steady-state operation.
The measured DPR latency ranged from 0.11 s to 0.17 s depending on the reconfigurable module size, occupying less than 20% of the total adaptation time after steady-state convergence. The FPGA kernel operated at 300 MHz, processing 32 feature elements per clock with an initiation interval of one, yielding an effective throughput of 9.6 TOPS for FP16 operations approximately 2.1× higher than the GPU baseline. Furthermore, the event-driven scheduling framework overlapped importance analysis, DPR execution, and data transmission, resulting in near-continuous operation with minimal idle time. Overall, the FPGA-DPR system demonstrated increasing efficiency with repeated updates, confirming its suitability for long-running adaptive inference pipelines where cumulative update cost dominates.
4.3. Accuracy Evaluation
In the accuracy evaluation across adaptation steps, the proposed system was tested under a progressively deteriorating noise environment, where the noise severity increased at each step. As the level of corruption increased, a larger portion of noise-augmented training samples was injected into the incremental learning phase. This proportional adjustment of noise–dataset ratio prevented catastrophic forgetting by ensuring that the model continuously retained exposure to clean samples while adapting to gradually intensified noise.
While masking-based partial updates significantly reduce communication bandwidth, they may cause minor accuracy degradation due to the exclusion of low-importance parameters. To evaluate efficiency, the accuracy gain per transmitted megabyte (AccGain/MB) was measured as a combined indicator of adaptation quality and communication cost. A higher value reflects more effective utilization of limited bandwidth and stronger runtime adaptability under edge constraints.
As shown in
Figure 7, the proposed masking-based FPGA update achieved 3.5× higher AccGain/MB compared with the GPU full-model update baseline and 1.8× higher than quantized fixed-ratio updates. Although the absolute accuracy of the full retraining baseline was slightly higher within 1.0% difference, the proposed method attained a comparable recovery level while transmitting only 27% of the total model parameters. This result confirms that transmitting only representative parameters can achieve near-optimal adaptation performance while substantially reducing bandwidth usage.
The sparsity of the transmitted mask was varied between 10% and 50%, and the highest AccGain/MB was observed near 25% sparsity, indicating an optimal trade-off between adaptation accuracy and communication cost. Overall, the results validate that the proposed masking-based selective update strategy maximizes adaptation efficiency and ensures real-time responsiveness in bandwidth-constrained edge-cloud systems.
Figure 7g shows the accuracy measured on both the original clean test set and the newly generated noise test set at each adaptation step. The evaluation demonstrates that both the full retraining baseline and the proposed mask-based update method successfully maintain accuracy on the clean dataset, indicating minimal forgetting of previously learned knowledge. At the same time, performance on the noisy dataset consistently improves with each adaptation iteration, confirming that the model gains robustness against the evolving noise distribution.
The results verify that the proposed selective-update approach enables stable multi-step adaptation without accuracy collapse. By combining partial retraining with importance-driven masking, the system simultaneously preserves prior knowledge and enhances performance on novel, noise-corrupted inputs. This confirms the effectiveness of the proposed architecture in long-term, sensor-driven adaptation scenarios where environmental distortions gradually intensify.
5. Conclusions
This paper presented a runtime robust edge inference system that leverages FPGA-based DPR and masking-based selective updates to achieve low-latency adaptation under changing environments. The proposed architecture integrates three major modules: an importance analyzer for identifying sensitive parameters through local gradient computation, a partial weight update engine that performs selective retraining for high-importance weights, and an adaptive mask generator that minimizes communication bandwidth by transmitting only representative parameter deltas. By combining these modules with event driven scheduling and FPGA reconfigurability, the system supports real time adaptation while maintaining continuous inference operation.
Experimental results demonstrated that the FPGA implementation on the Alveo U200 achieved up to 1.3 times faster end-to-end adaptation latency compared with GPU-based training. The dynamic partial reconfiguration process required only 0.11–0.17 s per module swap, and the masking based communication path reduced transmitted payloads to less than 30% of the full model size, with only 1.2% accuracy degradation. The proposed architecture thus offers a highly communication-efficient and hardware-adaptive solution for cloud-assisted edge AI systems.
The main advantage of the proposed system lies in its ability to dynamically adjust computation and communication according to runtime conditions. Instead of static deployment, the FPGA server flexibly reconfigures its hardware modules based on scenario-specific needs such as noise adaptation, class extension, or precision control. This enables edge devices to maintain inference robustness with minimal bandwidth consumption and reduced energy cost.
Future work will focus on extending the DPR granularity toward finer layer partitions and supporting multi-task learning with concurrent partial updates. Additionally, integrating the adaptive update pipeline with lightweight edge accelerators and high-speed interconnects is expected to further improve scalability and responsiveness. Overall, the proposed FPGA DPR-based framework provides a promising foundation for developing adaptive, communication-efficient, and hardware reconfigurable edge-cloud AI systems.