Runtime-Robust Edge Inference System with Masking-Based Partial Update on Dynamic Reconfigurable FPGA

Kang, Myeongjin; Park, Daejin

doi:10.3390/s25247448

Open AccessArticle

Runtime-Robust Edge Inference System with Masking-Based Partial Update on Dynamic Reconfigurable FPGA

by

Myeongjin Kang

¹

and

Daejin Park

^2,*

¹

School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Republic of Korea

²

School of Electronics Engineering, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(24), 7448; https://doi.org/10.3390/s25247448

Submission received: 23 October 2025 / Revised: 2 December 2025 / Accepted: 2 December 2025 / Published: 7 December 2025

(This article belongs to the Special Issue Applications of Sensors Based on Embedded Systems)

Download

Browse Figures

Versions Notes

Abstract

Edge inference systems must sustain real-time performance under dynamic environments such as sensor noise, illumination change, and new object classes. Conventional edge devices deploy static offline-trained models, causing accuracy degradation when the input distribution drifts. This study proposes a runtime-robust edge inference framework that enables continuous adaptation without interrupting execution. The edge device partitions its memory into active and adaptive regions, applying task-specific masked updates generated by a server-side FPGA. The FPGA performs layer-wise importance analysis, partial retraining, and adaptive mask generation using dynamic partial reconfiguration (DPR) to minimize reconfiguration delay. Experiments on MNIST, CIFAR-10, and Tiny ImageNet show that the proposed method reduces adaptation latency by up to 1.3× compared with GPU full retraining while cutting the communication cost to 28% of full model transmission. These results demonstrate that combining masking-based selective updates with FPGA DPR acceleration achieves real-time adaptability, low latency, and communication-efficient learning in cloud–edge collaborative environments.

Keywords:

edge-cloud system; FPGA accelerator; learning accelerator; dynamic partial reconfiguration

1. Introduction

Edge inference must deliver real-time responses under changing conditions such as sensor noise, lighting shifts, and the appearance of previously unseen classes. Many edge devices execute offline-trained models stored in on-chip flash and keep the deployed structure fixed during operation. This static deployment limits adaptation and causes accuracy degradation once the input distribution drifts [1,2]. In practical edge deployments, sensor data often undergoes continuous variation due to environmental dynamics such as illumination change, motion blur, temperature drift, or sensor degradation. These variations introduce a distribution shift between the training data and the live sensor input, causing inference accuracy to degrade over time. To ensure robustness under such conditions, the edge device must adapt its model parameters in accordance with sensor-induced feature changes.

Although continuous retraining could restore performance, full-model retraining on edge devices is infeasible due to limited computational capability, memory capacity, and power constraints [3]. Furthermore, transmitting an entire model from the cloud during every update causes substantial communication overhead and latency, which becomes impractical for time-sensitive or bandwidth-limited edge deployments. As shown in Figure 1, conventional cloud–edge pipelines typically rely on batch learning and static redeployment cycles. These approaches interrupt ongoing inference and fail to react promptly to sudden distribution shifts. Moreover, GPU-based acceleration suffers from architectural rigidity: kernels must be frequently reloaded, memory must be re-synchronized, and contexts must be repeatedly switched across learning stages [4,5,6]. Such overhead prevents GPUs from supporting fine-grained, layer-wise updates at runtime, making real-time adaptation impractical.

To overcome these limitations, this paper proposes a runtime-robust edge inference architecture that enables dynamic model updates without interrupting ongoing inference. The edge device employs a memory-separated structure that maintains an active model region and an adaptive region. Only masked weight deltas corresponding to highly influential parameters are transmitted from the server, significantly reducing communication cost compared with full-model updates. After receiving an update, the edge activates new parameters through a fast pointer swap rather than reloading the entire model, thereby eliminating inference downtime. This selective and incremental update mechanism allows the system to adapt rapidly to new classes, noise-induced feature drift, and other dynamic conditions, achieving real-time responsiveness unattainable with prior cloud–edge or GPU-based frameworks.

Overall, the proposed architecture forms a unified cloud–edge adaptive learning framework that combines masking-based updates, FPGA reconfigurability, and efficient communication. It provides runtime adaptability under dynamic environments while maintaining service continuity and reducing overhead in both computation and data transfer [7,8].

The main contributions of this paper are as follows. First, a runtime adaptive edge inference architecture is proposed that flexibly replaces model weights during execution without interrupting inference. Second, a server-side retraining pipeline accelerated by a PCIe-connected FPGA performs on-demand incremental learning and selective parameter optimization in response to runtime events. Third, dynamic partial reconfiguration (DPR) is utilized to enable fast layer-wise reconfiguration [9,10,11], providing scenario-specific acceleration without full FPGA reprogramming. Fourth, a masking-based selective weight update method is designed to transmit only the most influential parameters, effectively reducing update payloads. Fifth, an event-driven scheduling mechanism overlaps server retraining and edge inference to maintain continuous operation. Finally, a communication-efficient update path employing sparse and quantized transmission minimizes bandwidth usage while preserving real-time responsiveness.

The remainder of this paper is organized as follows. Section 2 reviews related work on selective retraining, FPGA-based acceleration, and communication-efficient updates. Section 3 describes the proposed architecture and the partial reconfiguration workflow. Section 4 presents experimental results on learning speed, communication efficiency, and runtime robustness. Section 5 concludes the paper and discusses potential future extensions toward adaptive edge–cloud learning systems.

2. Related Work

2.1. Retraining Layer Selection via Importance Checker

Selective retraining methods identify layers or parameters that most affect loss under distribution shift [12,13]. A first-order criterion evaluates the magnitude of the gradient of the loss

L

with respect to each parameter

w_{i}

:

s_{i}^{grad} = |\frac{\partial L}{\partial w_{i}}|,

(1)

where

\frac{\partial L}{\partial w_{i}}

denotes the instantaneous gradient and

s_{i}^{grad}

represents the local sensitivity of parameter

w_{i}

.

A Taylor-series-based approximation additionally incorporates the parameter magnitude. Using a first-order expansion,

Δ L \approx \sum_{i} \frac{\partial L}{\partial w_{i}} Δ w_{i} \Rightarrow s_{i}^{taylor} = |w_{i} \frac{\partial L}{\partial w_{i}}|,

(2)

where

Δ w_{i}

denotes a hypothetical perturbation. The score

s_{i}^{taylor}

captures the combined effect of the gradient and parameter magnitude and is widely used in pruning and selective retraining.

Second-order methods approximate curvature by using the diagonal Fisher information:

F_{i} \approx E [{(\frac{\partial log p_{θ} (y ∣ x)}{\partial w_{i}})}^{2}],

(3)

where

p_{θ} (y ∣ x)

is the model likelihood. Using

F_{i}

, two importance formulations are commonly used:

s_{i}^{fisher} = F_{i} Δ w_{i}^{2}, s_{i}^{ewc} = F_{i} {(w_{i} - w_{i}^{⋆})}^{2},

(4)

where

w_{i}^{⋆}

denotes the previously consolidated parameter value, as in elastic weight consolidation.

At the layer level, parameter scores are aggregated to obtain a layer importance value:

S_{ℓ} = \frac{1}{| W_{ℓ} |} \sum_{i \in W_{ℓ}} s_{i},

(5)

where

W_{ℓ}

denotes the set of parameters in layer ℓ. Layers with the highest

S_{ℓ}

values are selected for retraining. Practical approximations include Hutchinson-based Hessian trace estimation, gradient-norm heatmaps for block-level selection, and classifier-only updates for new-class scenarios [14,15]. These techniques localize computation to the most influential portions of the network and significantly reduce the communication and retraining cost under runtime adaptation.

2.2. FPGA Partial Replacement for Runtime Adaptation

Partial reconfiguration (PR) allows a static shell to remain active while a reconfigurable partition (RP) swaps hardware modules at runtime [16,17]. Common practice fixes the RP boundary with standardized interfaces such as AXI4 Lite for control and AXI4 Stream or memory-mapped AXI for data, with clock and reset isolation to ensure bitstream compatibility across reconfigurable modules. Before reconfiguration, in-flight transactions are quiesced and drained to preserve correctness.

On data center FPGAs such as AMD Xilinx Alveo, as seen in Figure 2, prior systems use PR to switch convolution variants, precision-tuned kernels, or task-specific accelerators without full device reprogramming. HLS-assisted PR and incremental compilation toolflows enable catalogs of pre-built reconfigurable modules that can be selected at runtime [18,19]. Reported module swap latencies scale with the partial bitstream size and the configuration path, typically in the sub-second to low-second range. These hardware hot swaps complement software-only methods by enabling structural changes in the accelerator pipeline under latency and power constraints [20].

2.3. GPU-FPGA Comparison for Runtime Adaptation and Update

As seen in Figure 1, GPU pipelines excel at dense, throughput-oriented computation and can apply updates by either swapping weights in an existing engine or by rebuilding and reloading an engine when the network structure, precision, or sparsity configuration changes.

The former path minimizes downtime but restricts adaptation to fixed architectures [21]. The latter path enables structural changes but introduces rebuild and load costs that can rise with model size and optimization depth [22].

FPGA PR offers a complementary trade-off. A pre-built catalog of reconfigurable modules supports structural adaptation through runtime module swaps, while the static shell continues to serve I/O and memory. Let

T_{apply}^{GPU}

and

T_{apply}^{FPGA}

denote the application phase of an update:

T_{apply}^{GPU} = \{\begin{matrix} T_{engine_build} & weights only swap \\ T_{engine_build} + T_{engine_build} & engine rebuild \end{matrix}, T_{apply}^{FPGA} = T_{PR_load} + T_{warmup} .

(6)

In practice, weight-only swaps on GPUs achieve very low latency for fixed architecture adaptation. When structural changes are required, PR-based swaps can avoid full device reprogramming and offer sub-second to low-second update times depending on partial bitstream size while keeping the static shell active. Comparative studies therefore evaluate end-to-end update latency, tail latency impact during the update window, and energy per update, as well as accuracy gain per transmitted megabyte in end-to-end cloud–edge scenarios [23].

2.4. Compressed Partial Update and Masking

Communication-efficient adaptation transmits only the most useful parameters. Masking-based partial updates form an importance mask

M

and send only the corresponding deltas

Δ w_{M}

:

M = {i : s_{i} \geq τ or top - k}, Δ w_{M} = {Δ w_{i} ∣ i \in M},

(7)

with policies based on

| Δ w_{i} |

, gradient magnitude, Fisher-weighted scores, or layer-wise constraints. Payload is further reduced by quantization to eight- or four-bit representations and by sparse encodings such as CSR or coordinate lists. Related streams include gradient sparsification with error feedback, sign-based updates, and parameter efficient finetuning that acts as an implicit mask through low rank or adapter paths [24,25].

For cloud–edge systems, masked updates can be applied in a shadow buffer and activated via an atomic pointer swap after integrity checks, thus preserving online inference [26,27]. Compared with full model transfers, compressed partial updates improve accuracy gain per transmitted megabyte and reduce downtime, which aligns with runtime robustness requirements under dynamic inputs.

3. Proposed Architecture

This section presents a runtime adaptive edge inference architecture that combines device-side memory separation with masking-based partial updates and a server-side pipeline accelerated by a PCIe-connected dynamic reconfigurable FPGA. The design targets low-latency adaptation under dynamic inputs without interrupting device inference. The system integrates three pillars that operate concurrently. The server detects scenarios and learns selective changes. The FPGA accelerates importance scoring and layer-wise retraining and supports scenario-specific acceleration through partial reconfiguration. The edge applies masked updates in a shadow region and activates them through an atomic pointer swap.

3.1. System Overview

Figure 3 illustrates the overall runtime-adaptive inference architecture. The system consists of two cooperative domains: the edge device performing incremental inference and the server-side FPGA accelerator responsible for adaptive retraining and mask generation. Both operate concurrently to maintain inference continuity and adaptivity under dynamically changing input conditions.

On the edge side, the system divides memory into two functional regions rather than maintaining a single shadow buffer. The active memory stores the baseline model and parameters frequently used for inference, while the adaptive memory holds task-specific masks and selectively updated parameters provided by the server. Each mask

M_{k}

corresponds to a particular environmental or contextual condition. As seen in Figure 4, during runtime, the edge retrieves the corresponding mask and reconstructs a new effective weight by combining stored weights and masked updates:

W_{eff} = W_{base} + \sum_{k = 1}^{K} M_{k} ⊙ Δ W_{k}

(8)

where ⊙ denotes element-wise selection. Although this requires lightweight additional computation, it eliminates the need for full weight replacement and supports incremental inference [28] that dynamically adapts to the current context. As a result, inference continues seamlessly while parameter adaptation occurs in parallel.

The server receives statistical summaries and inference results from the edge device, analyzes performance degradation, and identifies the cause of change. Based on the detected scenario, the server performs layer-wise retraining on the FPGA and generates a new mask

M new

together with its corresponding parameter deltas

Δ W new

. These updates are compressed and transmitted to the edge for integration into the adaptive memory. Through this continuous collaboration, the edge maintains up-to-date model behavior without halting inference or requiring full model transmission.

This memory-partitioned structure provides several advantages. First, it enables runtime weight adaptation without model reloading, reducing latency and communication cost. Second, it allows task- or scenario-specific inference using previously stored masks, ensuring rapid recovery when similar situations recur. Finally, it provides a foundation for incremental learning and inference fusion, where partial updates and continuous inference coexist within a unified runtime pipeline.

3.2. Edge Runtime Mechanism with Mask-Based Adaptive Inference

The edge device performs inference continuously while adapting its parameters according to the current task or environment. Instead of performing full weight replacement, the edge selectively applies mask-based parameter fusion using pre-stored update information. Each task-specific mask encodes the indices and magnitudes of parameters that are most sensitive to the corresponding scenario, enabling incremental inference without reloading the entire model.

To support runtime adaptation, the edge memory is divided into two logical regions: the base memory that holds the static model

W_{b a s e}

and the adaptive memory that stores multiple sets of task-specific parameter deltas

W k

and their corresponding masks

M k

. When the input context or task label indicates a change, the edge retrieves the relevant mask and constructs a new effective weight tensor:

W_{eff}^{(t)} = W_{base} + M_{sel (t)} ⊙ Δ W_{sel (t)}

(9)

where ⊙ denotes element-wise multiplication between the selected mask and its associated parameter delta. This operation adaptively modifies only the parameters relevant to the detected task or environmental condition, generating a lightweight and task-optimized weight configuration. Since the computation involves sparse element-wise updates instead of full model replacement, it requires minimal overhead and remains feasible even for microcontroller-level edge devices.

Each mask

M_{k}

is designed by the server-side FPGA through layer-wise importance analysis, and it selectively activates a subset of parameters that significantly influence accuracy for a given condition. By maintaining multiple masks within the adaptive memory, the edge device can dynamically switch between different scenarios or tasks in real time. This mechanism enables task-aware runtime inference, allowing parameter adaptation to occur within a single inference cycle without requiring communication with the server.

The edge runtime engine follows a lightweight policy module that decides which mask to apply. This policy can rely on metadata from the sensor stream or previously observed inference degradation. If the system detects a new or unseen scenario for which no mask exists, the edge flags a retraining request to the server, which then produces a new

M_{new}

and

W_{n e w}

pair through FPGA-accelerated incremental learning. The update is transmitted back and stored in the adaptive memory for future use.

This design provides three key benefits. First, it allows continuous inference without interrupting execution, as updates are integrated locally through simple weight fusion. Second, it supports task-level reuse of previously learned parameter deltas, reducing communication frequency with the server. Third, it achieves runtime adaptability where inference, parameter selection, and update coexist seamlessly on the edge device.

3.3. Server Runtime with FPGA DPR

The server operates as an adaptive retraining platform that dynamically reconfigures FPGA hardware modules at runtime under the supervision of a CPU-based DPR controller. The FPGA is divided into a static region and an RP, as seen in Figure 5. The static region maintains PCIe and DDR interfaces, AXI interconnects, DMA engines, and control registers, ensuring that communication with both the host and the edge device remains uninterrupted during partial reconfiguration. The reconfigurable region hosts interchangeable acceleration modules for importance analysis, weight update, and adaptive mask generation. Each module follows a standardized interface based on AXI4 Lite for control and AXI4 Stream for data exchange, enabling module swapping without resynthesis or interface modification.

The CPU acts as a DPR controller, monitoring edge-side events and scheduling FPGA reconfiguration according to adaptation requirements. During normal operation, the importance analyzer module remains active, continuously evaluating sensitivity scores from recent inference data to identify parameters with a high impact on prediction accuracy. When an update request is triggered by the edge or when accumulated importance scores exceed a threshold, the CPU initiates partial reconfiguration to replace the current module with a weight update engine. After retraining and parameter generation, the CPU reconfigures the FPGA again to load the mask generator module, which extracts sparse and quantized parameter subsets for transmission to the edge. Once the update is completed, the FPGA reverts to the importance analyzer to resume runtime monitoring.

3.3.1. Importance Analyzer

The importance analyzer identifies parameters and layers that most significantly influence the model’s output under the current data distribution. This module implements a hardware-accelerated approximation of the local gradient to estimate feature sensitivity in real time. Instead of executing full backpropagation, which requires global memory for intermediate activations and large-scale tensor operations, the FPGA computes a partial gradient using feature and error signals from the selected layer.

For an input feature vector

x_{L}

and its output error signal

δ_{L} = \partial L / \partial y_{L}

, the local gradient and importance score of parameter

w_{i}

are computed as

g_{i} = δ_{i} \times x_{i}, s_{i} = | g_{i} |,

(10)

where

s_{i}

represents the magnitude of sensitivity for each parameter. This lightweight operation captures how strongly each weight contributes to loss variation without requiring a full backward pass. The computation is fully pipelined and parallelized across multiple DSP arrays, allowing element-wise operations to be performed every clock cycle.

3.3.2. Partial Weight Update

The partial weight update engine performs selective retraining of model parameters based on the importance information obtained from the previous stage. Rather than updating the entire model as in conventional GPU-based training, this module focuses only on parameters whose importance scores exceed a predefined threshold. This approach reduces the computational load and memory bandwidth while maintaining comparable accuracy recovery.

Let

w^{(t)}

denote the current weight vector at iteration t, and let

g = δ_{L} ⊙ x_{L}

represent the local gradient estimated by the importance analyzer. The updated parameter vector

w^{(t + 1)}

is computed as

w_{i}^{(t + 1)} = \{\begin{matrix} w_{i}^{(t)} - η g_{i} & s_{i} > τ \\ w_{i}^{(t)} & otherwise \end{matrix}

(11)

where

η

is the learning rate,

s_{i} = | g_{i} |

is the importance score, and

τ

is the threshold defined by the update policy. Only a small fraction of parameters satisfying

s_{i} > τ

are modified, thereby enabling rapid incremental learning and low latency adaptation.

3.3.3. Adaptive Mask Generator

The adaptive mask generator constructs a selective update mask that determines which parameters are to be transmitted and updated during runtime. While the importance analyzer provides element-wise sensitivity values, the mask generator aggregates these scores under multiple policies depending on task objectives, resource availability, and environmental context. This stage thus acts as the bridge between importance estimation and parameter updating.

Given the importance scores

s = s_{1}, s_{2}, \dots, s_{N}

, the generator constructs a binary mask

M

according to a configurable policy function

Φ (\cdot)

:

M_{i} = \{\begin{matrix} 1, & if s_{i} \geq τ (threshold policy), \\ 1, & if i \in Top - k (s) (Top - k policy), \\ 0, & otherwise . \end{matrix}

(12)

where

τ

denotes an importance threshold and k specifies the number of highest score elements to retain. The policy

Φ

is dynamically selected by the CPU controller according to the active update mode. Each policy yields a different sparsity ratio, enabling the system to balance adaptation quality against communication and power cost.

4. Experiments

4.1. Experimental Setup

The proposed architecture was evaluated on both GPU- and FPGA-based platforms to compare the runtime adaptability and update efficiency as seen in Figure 6. For the GPU baseline, experiments were performed on a workstation equipped with an Intel Core i5-12400F CPU and an GTX 1660Ti GPU, Nvidia, Santa Clear, CA, USA, using PyTorch 2.0 for training and inference. The FPGA implementation was deployed on a Alveo U200 accelerator card, Xilinx, San Jose, CA, USA hosted in a dual CPU server system consisting of two Intel Xeon Bronze 3204 processors, Santa Clara, CA, USA (1.9 GHz, 6 cores each). All FPGA kernels were synthesized using the Xilinx Vitis 2023.1 toolchain with Vivado partial reconfiguration support. The host software managing reconfiguration and communication was implemented in C++ using XRT APIs. The edge-side inference processor is implemented on a custom ARM Cortex-M0-based microcontroller, Cambridge, UK operating at 48 MHz. The processor includes 128 KB of on-chip SRAM and 24 KB of DRAM, providing a highly resource-constrained environment representative of low-power edge platforms. During runtime, the processor loads the active weight region into SRAM, performs fixed-point convolution and fully connected operations, and applies masked parameter updates generated by the FPGA server. In addition, the Cortex-M0 periodically reports lightweight statistics—including prediction confidence, anomaly indicators, and feature drift metrics—to trigger server-side incremental retraining.

For performance evaluation, the experiments were conducted using the MNIST and CIFAR-10 datasets for incremental learning scenarios and a subset of Tiny ImageNet for noise adaptation tests. Each experiment consisted of three stages: importance estimation, partial update, and masked parameter transmission. The FPGA-based implementation was compared with the GPU full retraining baseline in terms of total adaptation latency, communication overhead, and accuracy recovery. All accuracy values were measured as the mean of five repeated runs.

4.2. Latency and Throughput Evaluation

Figure 7 presents the latency comparison between the proposed FPGA-based adaptive update system and the GPU full retraining baseline. During the first adaptation cycle, the GPU achieved lower latency (0.78 s) than the FPGA (1.10 s), primarily because of the initial overhead associated with dynamic partial reconfiguration (DPR) and module loading. However, as adaptation events were repeated, the FPGA’s latency rapidly decreased and stabilized around 0.65 s, while the GPU’s retraining time remained nearly constant. After the third update iteration, the average adaptation latency of the FPGA dropped below that of the GPU, achieving up to a 1.3× overall speed advantage in steady-state operation.

The measured DPR latency ranged from 0.11 s to 0.17 s depending on the reconfigurable module size, occupying less than 20% of the total adaptation time after steady-state convergence. The FPGA kernel operated at 300 MHz, processing 32 feature elements per clock with an initiation interval of one, yielding an effective throughput of 9.6 TOPS for FP16 operations approximately 2.1× higher than the GPU baseline. Furthermore, the event-driven scheduling framework overlapped importance analysis, DPR execution, and data transmission, resulting in near-continuous operation with minimal idle time. Overall, the FPGA-DPR system demonstrated increasing efficiency with repeated updates, confirming its suitability for long-running adaptive inference pipelines where cumulative update cost dominates.

4.3. Accuracy Evaluation

In the accuracy evaluation across adaptation steps, the proposed system was tested under a progressively deteriorating noise environment, where the noise severity increased at each step. As the level of corruption increased, a larger portion of noise-augmented training samples was injected into the incremental learning phase. This proportional adjustment of noise–dataset ratio prevented catastrophic forgetting by ensuring that the model continuously retained exposure to clean samples while adapting to gradually intensified noise.

While masking-based partial updates significantly reduce communication bandwidth, they may cause minor accuracy degradation due to the exclusion of low-importance parameters. To evaluate efficiency, the accuracy gain per transmitted megabyte (AccGain/MB) was measured as a combined indicator of adaptation quality and communication cost. A higher value reflects more effective utilization of limited bandwidth and stronger runtime adaptability under edge constraints.

As shown in Figure 7, the proposed masking-based FPGA update achieved 3.5× higher AccGain/MB compared with the GPU full-model update baseline and 1.8× higher than quantized fixed-ratio updates. Although the absolute accuracy of the full retraining baseline was slightly higher within 1.0% difference, the proposed method attained a comparable recovery level while transmitting only 27% of the total model parameters. This result confirms that transmitting only representative parameters can achieve near-optimal adaptation performance while substantially reducing bandwidth usage.

The sparsity of the transmitted mask was varied between 10% and 50%, and the highest AccGain/MB was observed near 25% sparsity, indicating an optimal trade-off between adaptation accuracy and communication cost. Overall, the results validate that the proposed masking-based selective update strategy maximizes adaptation efficiency and ensures real-time responsiveness in bandwidth-constrained edge-cloud systems.

Figure 7g shows the accuracy measured on both the original clean test set and the newly generated noise test set at each adaptation step. The evaluation demonstrates that both the full retraining baseline and the proposed mask-based update method successfully maintain accuracy on the clean dataset, indicating minimal forgetting of previously learned knowledge. At the same time, performance on the noisy dataset consistently improves with each adaptation iteration, confirming that the model gains robustness against the evolving noise distribution.

The results verify that the proposed selective-update approach enables stable multi-step adaptation without accuracy collapse. By combining partial retraining with importance-driven masking, the system simultaneously preserves prior knowledge and enhances performance on novel, noise-corrupted inputs. This confirms the effectiveness of the proposed architecture in long-term, sensor-driven adaptation scenarios where environmental distortions gradually intensify.

5. Conclusions

This paper presented a runtime robust edge inference system that leverages FPGA-based DPR and masking-based selective updates to achieve low-latency adaptation under changing environments. The proposed architecture integrates three major modules: an importance analyzer for identifying sensitive parameters through local gradient computation, a partial weight update engine that performs selective retraining for high-importance weights, and an adaptive mask generator that minimizes communication bandwidth by transmitting only representative parameter deltas. By combining these modules with event driven scheduling and FPGA reconfigurability, the system supports real time adaptation while maintaining continuous inference operation.

Experimental results demonstrated that the FPGA implementation on the Alveo U200 achieved up to 1.3 times faster end-to-end adaptation latency compared with GPU-based training. The dynamic partial reconfiguration process required only 0.11–0.17 s per module swap, and the masking based communication path reduced transmitted payloads to less than 30% of the full model size, with only 1.2% accuracy degradation. The proposed architecture thus offers a highly communication-efficient and hardware-adaptive solution for cloud-assisted edge AI systems.

The main advantage of the proposed system lies in its ability to dynamically adjust computation and communication according to runtime conditions. Instead of static deployment, the FPGA server flexibly reconfigures its hardware modules based on scenario-specific needs such as noise adaptation, class extension, or precision control. This enables edge devices to maintain inference robustness with minimal bandwidth consumption and reduced energy cost.

Future work will focus on extending the DPR granularity toward finer layer partitions and supporting multi-task learning with concurrent partial updates. Additionally, integrating the adaptive update pipeline with lightweight edge accelerators and high-speed interconnects is expected to further improve scalability and responsiveness. Overall, the proposed FPGA DPR-based framework provides a promising foundation for developing adaptive, communication-efficient, and hardware reconfigurable edge-cloud AI systems.

Author Contributions

M.K. wrote the entire manuscript, performed the numerical analysis, designed the core architecture, and performed the software/hardware implementation; D.P.’s role was principal investigator and the corresponding author. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the BK21 FOUR project (4199990113966), the Basic Science Research Program (RS-2018-NR031059, 10%) (RS-2025-24322979, 10%) through the National Research Foundation of Korea (NRF) funded by the Ministry of Education. This work was partly supported by an Institute of Information and communications Technology Planning and Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2022-0-01170, PIM Semiconductor Design Research Center, 20%) and (No. RS-2023-00228970, Development of Flexible SW-HW Conjunctive Solution for On-edge Self-supervised Learning, 20%) and (No. RS-2025-02218227, Digital Columbus Project, 20%) and (No. RS-2022-00156389, Innovative Human Resource Development for Local Intellectualization support program, 10%). This work was partly supported by the Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government(MOTIE) (RS-2024-00415938, HRD Program for Industrial Innovation, 10%). The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Zhang, R.; Jiang, H.; Wang, W.; Liu, J. Optimization Methods, Challenges, and Opportunities for Edge Inference: A Comprehensive Survey. Electronics 2025, 14, 1345. [Google Scholar] [CrossRef]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Stadler, M.; Vierhauser, M.; Garmendia, A.; Wimmer, M.; Cleland-Huang, J. Flexible model-driven runtime monitoring support for cyber-physical systems. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, Pittsburgh, PA, USA, 21–29 May 2022; Association for Computing Machinery: New York, NY, USA, 2022; Volume 10, pp. 350–351. [Google Scholar] [CrossRef]
Qararyah, F.; Azhar, M.W.; Trancoso, P. An Efficient Hybrid Deep Learning Accelerator for Compact and Heterogeneous CNNs. ACM Trans. Archit. Code Optim. 2024, 21, 1–26. [Google Scholar] [CrossRef]
Mao, W.; Xiao, Z.; Xu, P.; Ren, H.; Liu, D.; Zhao, S.; An, F.; Yu, H. Energy-Efficient Machine Learning Accelerator for Binary Neural Networks. In Proceedings of the 2020 on Great Lakes Symposium on VLSI, GLSVLSI ’20, Virtual Event, China, 7–9 September 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 77–82. [Google Scholar] [CrossRef]
Kee, M.; Park, G.H. A Low-power Programmable Machine Learning Hardware Accelerator Design for Intelligent Edge Devices. ACM Trans. Des. Autom. Electron. Syst. 2022, 27, 1–13. [Google Scholar] [CrossRef]
Zhou, S.; Meng, S.; Tian, H.; Yu, J.; Wang, K. Edge-BiT: Software-Hardware Co-design for Optimizing Binarized Transformer Networks Inference on Edge FPGA. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’24, New York, NY, USA, 27–31 October 2024; Association for Computing Machinery: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
Stammler, M.; Sidorenko, V.; Kreß, F.; Schmidt, P.; Becker, J. Context-Aware Layer Scheduling for Seamless Neural Network Inference in Cloud-Edge Systems. In Proceedings of the 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 18–21 December 2023; pp. 97–104. [Google Scholar] [CrossRef]
Vipin, K.; Fahmy, S.A. FPGA Dynamic and Partial Reconfiguration: A Survey of Architectures, Methods, and Applications. ACM Comput. Surv. 2018, 51, 1–39. [Google Scholar] [CrossRef]
Nguyen, M.; C. Hoe, J. Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration. In Proceedings of the 2018 28th International Conference on Field Programmable Logic and Applications (FPL), Dublin, Ireland, 27–31 August 2018; pp. 230–2304. [Google Scholar] [CrossRef]
Farhadi, M.; Ghasemi, M.; Yang, Y. A Novel Design of Adaptive and Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA. In Proceedings of the 2019 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 24–26 September 2019; pp. 1–7. [Google Scholar] [CrossRef]
Bu, C.; Liu, Y.; Huang, M.; Shao, J.; Ji, S.; Luo, W.; Wu, X. Layer-Wise Learning Rate Optimization for Task-Dependent Fine-Tuning of Pre-Trained Models: An Evolutionary Approach. ACM Trans. Evol. Learn. Optim. 2024, 4, 1–23. [Google Scholar] [CrossRef]
Mohammadi, S.; Chapon, M. Investigating the Performance of Fine-tuned Text Classification Models Based-on Bert. In Proceedings of the 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Yanuca Island, Cuvu, Fiji, 14–16 December 2020; pp. 1252–1257. [Google Scholar] [CrossRef]
Yang, F.; Cheng, J.; Liu, H.; Dong, Y.; Jia, Y.; Hou, J. Mixed Blessing: Class-Wise Embedding guided Instance-Dependent Partial Label Learning. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1, KDD ’25, Toronto, ON, Canada, 3–7 August 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 1763–1772. [Google Scholar] [CrossRef]
Kwon, J.; Park, D. Toward Data-Adaptable TinyML using Model Partial Replacement for Resource Frugal Edge Device. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia ’21, Virtual Event, Republic of Korea, 20–22 January 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 133–135. [Google Scholar] [CrossRef]
Phani, T.S.S.; Arumalla, A.; Prakash, M.D. Partial dynamic reconfiguration framework for FPGA: A survey with concepts, constraints and trends. Mater. Today Proc. 2021, 46, 3704–3711. [Google Scholar] [CrossRef]
Zhao, K.; Ma, Y.; He, R.; Zhang, J.; Xu, N.; Bian, J. Adaptive Selection and Clustering of Partial Reconfiguration Modules for Modern FPGA Design Flow. Acm Trans. Reconfig. Technol. Syst. 2023, 16, 1–24. [Google Scholar] [CrossRef]
Sozzo, E.D.; Conficconi, D.; Zeni, A.; Salaris, M.; Sciuto, D.; Santambrogio, M.D. Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAs. ACM Comput. Surv. 2022, 55, 1–48. [Google Scholar] [CrossRef]
Bucknall, A.R.; Fahmy, S.A. ZyPR: End-to-end Build Tool and Runtime Manager for Partial Reconfiguration of FPGA SoCs at the Edge. ACM Trans. Reconfig. Technol. Syst. 2023, 16, 1–33. [Google Scholar] [CrossRef]
Guo, K.; Zeng, S.; Yu, J.; Wang, Y.; Yang, H. [DL] A Survey of FPGA-based Neural Network Inference Accelerators. ACM Trans. Reconfig. Technol. Syst. 2019, 12, 1–26. [Google Scholar] [CrossRef]
Lin, W.; Lin, J.; Zhang, H.; Wu, W.; Wu, W.; Li, Z.; Li, K. Cacomp: A Cloud-Assisted Collaborative Deep Learning Compiler Framework for DNN Tasks on Edge. IEEE Trans. Comput. 2025, 74, 2663–2674. [Google Scholar] [CrossRef]
Kochar, N.; Ekiert, L.; Najafi, D.; Fan, D.; Angizi, S. Accelerating Low Bit-width Neural Networks at the Edge, PIM or FPGA: A Comparative Study. In Proceedings of the Great Lakes Symposium on VLSI 2023, GLSVLSI ’23, Knoxville, TN, USA, 5–7 June 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 625–630. [Google Scholar] [CrossRef]
Yan, F.; Koch, A.; Sinnen, O. A survey on FPGA-based accelerator for ML models. arXiv 2024, arXiv:2412.15666. [Google Scholar] [CrossRef]
Dhouibi, M.; Salem, A.K.B.; Saidi, A.; Saoud, S.B. Accelerating Deep Neural Networks Implementation: A Survey. IET Comput. Digit. Tech. 2021, 15, 439–455. [Google Scholar] [CrossRef]
Liu, H.; He, F.; Cao, G. Communication-Efficient Federated Learning for Heterogeneous Edge Devices Based on Adaptive Gradient Quantization. In Proceedings of the IEEE INFOCOM 2023-IEEE Conference on Computer Communications, New York, NY, USA, 17–20 May 2023; pp. 1–10. [Google Scholar] [CrossRef]
Kang, M.; Park, D. Flexible Edge-AI Software Execution Architecture Based on Cloud-Connected Incremental Learning. IEEE Access 2025, 13, 120772–120784. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2016, arXiv:1510.00149. [Google Scholar] [CrossRef]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Comparison of GPU-based sequential runtime and FPGA-based DPR pipeline. (a) Conventional GPU runtime flow repeatedly accelerates context initialization, kernel reloads, and data transfers. (b) A DPR FPGA that repeatedly performs accelerates various operations through module loading. (c) The difference between GPUs, which have fixed bandwidth and shared memory, and FPGAs, which can customize MAC operators, buffers, etc. (d) GPU MAC operations through the compiler and scheduler based on the algorithm. (e) Hardware-based FPGA MAC operations without a processor.

Figure 2. Alveo dynamic partial reconfiguration platform.

Figure 3. Overall edge-cloud collaborative architecture for runtime adaptive inference. (a) A server-side FPGA learning accelerator leveraging layer-by-layer fine-tuning, importance scoring, and mask generation based on a DPR controller. (b) Edge side incremental adaptive inference based on masking and real-time learning weight transfer.

Figure 4. Task-adaptive partial update method.

Figure 5. FPGA user-defined partial reconfigurable parts.

Figure 6. Experimental setup. (a) Server-side Xilinx Alveo U200 accelerator. (b) Server-side GPU accelerator. (c) Edge-side Arm core-based inference processor.

Figure 7. Experimental result. (a) GPU vs. FPGA latency. Sum of 1-time communication, importance scoring, and partial training PR load. (b) GPU vs. FPGA latency changes according to the number of adaptations. (c) GPU vs. FPGA total operation throughput. (d) Accuracy of full retraining and masking-based update across multiple datasets and models. (e) Communication efficiency of retraining and masking-based update across multiple datasets and models. (f) Accuracy versus communication trade-off in GPU and FPGA. (g) Accuracy comparison of the full retraining model, the mask-based update model, and no update across adaptation steps.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kang, M.; Park, D. Runtime-Robust Edge Inference System with Masking-Based Partial Update on Dynamic Reconfigurable FPGA. Sensors 2025, 25, 7448. https://doi.org/10.3390/s25247448

AMA Style

Kang M, Park D. Runtime-Robust Edge Inference System with Masking-Based Partial Update on Dynamic Reconfigurable FPGA. Sensors. 2025; 25(24):7448. https://doi.org/10.3390/s25247448

Chicago/Turabian Style

Kang, Myeongjin, and Daejin Park. 2025. "Runtime-Robust Edge Inference System with Masking-Based Partial Update on Dynamic Reconfigurable FPGA" Sensors 25, no. 24: 7448. https://doi.org/10.3390/s25247448

APA Style

Kang, M., & Park, D. (2025). Runtime-Robust Edge Inference System with Masking-Based Partial Update on Dynamic Reconfigurable FPGA. Sensors, 25(24), 7448. https://doi.org/10.3390/s25247448

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Runtime-Robust Edge Inference System with Masking-Based Partial Update on Dynamic Reconfigurable FPGA

Abstract

1. Introduction

2. Related Work

2.1. Retraining Layer Selection via Importance Checker

2.2. FPGA Partial Replacement for Runtime Adaptation

2.3. GPU-FPGA Comparison for Runtime Adaptation and Update

2.4. Compressed Partial Update and Masking

3. Proposed Architecture

3.1. System Overview

3.2. Edge Runtime Mechanism with Mask-Based Adaptive Inference

3.3. Server Runtime with FPGA DPR

3.3.1. Importance Analyzer

3.3.2. Partial Weight Update

3.3.3. Adaptive Mask Generator

4. Experiments

4.1. Experimental Setup

4.2. Latency and Throughput Evaluation

4.3. Accuracy Evaluation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI