DPAO-PFL: Dynamic Parameter-Aware Optimization via Continual Learning for Personalized Federated Learning

Tang, Jialu; Gao, Yali; Li, Xiaoyong; Jia, Jia

doi:10.3390/electronics14152945

Open AccessArticle

DPAO-PFL: Dynamic Parameter-Aware Optimization via Continual Learning for Personalized Federated Learning

by

Jialu Tang

^†

,

Yali Gao

^*,†

,

Xiaoyong Li

and

Jia Jia

The Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(15), 2945; https://doi.org/10.3390/electronics14152945

Submission received: 9 May 2025 / Revised: 3 July 2025 / Accepted: 21 July 2025 / Published: 23 July 2025

Download

Browse Figures

Versions Notes

Abstract

Federated learning (FL) enables multiple participants to collaboratively train models while efficiently mitigating the issue of data silos. However, large-scale heterogeneous data distributions result in inconsistent client objectives and catastrophic forgetting, leading to model bias and slow convergence. To address the challenges under non-independent and identically distributed (non-IID) data, we propose DPAO-PFL, a Dynamic Parameter-Aware Optimization framework that leverages continual learning principles to improve Personalized Federated Learning under non-IID conditions. We decomposed the parameters into two components: local personalized parameters tailored to client characteristics, and global shared parameters that capture the accumulated marginal effects of parameter updates over historical rounds. Specifically, we leverage the Fisher information matrix to estimate parameter importance online, integrate the path sensitivity scores within a time-series sliding window to construct a dynamic regularization term, and adaptively adjust the constraint strength to mitigate the conflict overall tasks. We evaluate the effectiveness of DPAO-PFL through extensive experiments on several benchmarks under IID and non-IID data distributions. Comprehensive experimental results indicate that DPAO-PFL outperforms baselines with improvements from

5.41 %

to

30.42 %

in average classification accuracy. By decoupling model parameters and incorporating an adaptive regularization mechanism, DPAO-PFL effectively balances generalization and personalization. Furthermore, DPAO-PFL exhibits superior performance in convergence and collaborative optimization compared to state-of-the-art FL methods.

Keywords:

federated learning; continual learning; parameter decomposition; non-IID

1. Introduction

With the rapid advancement of technologies such as the Internet of Things (IoT) and artificial intelligence (AI), there has been an exponential increase in both the number of devices connected at the network edge and the volume of data collected. Directly transmitting this private data to the cloud not only raises significant privacy leakage concerns but also faces increasing regulatory restrictions, such as those imposed by the General Data Protection Regulation (GDPR) [1]. Federated learning (FL) [2], as a novel distributed training paradigm, enables resource-constrained edge devices to collaboratively train models without moving raw data. To further enhance communication efficiency and the stability of data processing, we need to explore more efficient and feasible solutions for real-world environments.

1.1. Motivation

By exchanging only the parameter updates of decentralized models, FL effectively mitigates the issue of data silos [3] with privacy protection. FL has significant application potential in various domains, including recommendation systems [4], financial risk management [5], and medical big data analysis [6]. On the data side, clients in FL often collect evidence from diverse applications [7] (i.e., mobile text input, sensor readings, medical records, or user behavior logs), leading to datasets that are imbalanced and skewed in both feature and label spaces. When the training data is not independent and identically distributed (non-IID) [8] on the local devices, conventional FL will be significantly suboptimal and converge slowly compared to those trained in the standard centralized learning mode.

In practical applications, different clients have different preferences for model performance. Standard federated optimization methods (FedAvg) [9] often present significant challenges in hyperparameter tuning and demonstrate poor convergence characteristics. For instance, in scenarios such as medical diagnosis, intelligent keyboards, or industrial inspection, general FL models often fail to meet the personalized demands of all clients. Personalized federated learning (PFL) optimization [10] is an effective approach to overcome the limitations of heterogeneous data and insufficient individual performance. The model-based PFL method includes two objectives: optimizing global FL for potential downstream personalized demands and enhancing the performance of personalization from the global FL model. The strategy of global training and local fine-tuning can improve client performance where data heterogeneity across clients stems from user-specific behaviors and geographical variations.

However, existing personalization approaches neglect the potential risks of catastrophic forgetting and bias in the global model. With model fine-tuning [11] or multi-task learning [12], either incur excessive communication overhead or fail to balance the trade-off between adaptability and stability. The decline in performance is caused by the client drift phenomenon [13], which arises from weight divergence and multiple rounds of local updating. After training and updating the model parameters for the new task, the model is no longer well-adapted to the old tasks it learned previously. Traditional regularization in FL does not account for time-evolving parameter importance. In summary, we must tackle the challenges of balancing stability and personalization under non-IID distributions; that is, effectively prevent catastrophic forgetting while ensuring high adaptability to personalized update requirements. Consequently, there is an urgent need for a federated optimization strategy that dynamically gauges and retains critical global parameters on heterogeneous data, while still enabling effective personalization on each client.

1.2. Designs and Contributions

Data distribution is highly imbalanced across tens of thousands of decentralized edge devices, which have high latency and low throughput and can only be intermittently used for training. To alleviate the aforementioned issues, we propose a novel approach named Dynamic Parameter-Aware Optimization for Personalized Federated Learning (DPAO-PFL) that harmonizes global knowledge retention with localized adaptation in non-IID environments. Inspired by continual learning (CL), which focuses on preserving previously acquired knowledge when training on new tasks sequentially. DPAO-PFL integrates adaptive regularization into PFL, effectively addressing statistical heterogeneity and inter-round forgetting during model training across disparate geographical locations and time windows. First, we leverage the parameter decomposition strategy [14] to decouple global shared parameters from client-specific parameters and construct a transfer vector to adaptively guide the global shared parameters orientation in client-side training. Subsequently, we analogize the local optimization process performed by each client as sequential multi-task learning and address the potential catastrophic forgetting issue arising from local updates diverging from the initial global model. Finally, we introduce an elastic regularization term based on learning trajectories to constrain parameter updates and dynamically adjust the learning rate of each parameter in the next round according to its importance weight in the previous global task. This method not only safeguards the parameters critical to the global task but also allows greater flexibility for clients’ local updates. By designing an efficient mechanism for updating importance weights, it significantly mitigates communication and computational overhead. We conclude the contributions as follows:

By integrating online importance estimation and path integral sensitivity scoring, we designed a task-agnostic dynamic regularization mechanism. Importance weights were updated with an event-triggered mechanism (loss stabilization period), thereby effectively reducing communication and computational overheads.
We propose DPAO-PFL, a dual-dimension optimized adaptive PFL method, which leverages parameter decomposition and an Elastic Weight Consolidation strategy based on CL. This approach effectively decouples the adaptation to specific client needs from the retention of global shared knowledge.
We performed an extensive evaluation of the proposed method using publicly available federated learning benchmark datasets. The experimental results indicate that DPAO-PFL surpasses existing state-of-the-art methods in both performance and stability while exhibiting strong robustness to hyperparameters.

The rest of this paper is organized as follows. In Section 2, we provide an overview of the existing research and the application of PFL. In Section 3, we elaborate on the basic definitions and preliminaries. Section 4 details the DPAO-PFL framework and the adaptive regularization mechanism utilized. The analysis of performance evaluation results is presented in Section 5. Finally, we summarize our work and discuss future directions in Section 6.

2. Related Work

FL can effectively mitigate the systemic privacy risks and high computational/storage costs associated with traditional centralized machine learning methods. Researchers have conducted numerous studies focused on improving the convergence and accuracy of FL. In this section, we will review related works on FL and the application of continuous learning.

2.1. Federated Learning

Federated learning is a distributed training paradigm to build machine learning models based on datasets that are distributed across multiple devices while preventing data leakage. Recent research has concentrated on device-side federated learning [15] that involves distributed device interactions. There is an urgent need to design optimization strategies tailored for large-scale and heterogeneous distributed scenarios, as well as to carry out in-depth investigations into methods for addressing unbalanced data distribution and measures for enhancing device reliability. In order to improve the convergence capability of the global model in the presence of data heterogeneity, the federated optimization problem has recently become a central topic in many research studies.

Real-world federated learning deployments operate over heterogeneous and resource-constrained networks such as cellular, Wi-Fi, or edge compute clusters. Devices vary widely in compute capability, battery life, and communication bandwidth, resulting in stragglers and unreliable update rounds [16]. Moreover, wireless links often exhibit high latency and packet loss, making synchronous aggregation inefficient or even infeasible. To mitigate these issues, recent FL algorithms introduce client selection based on resource availability [17], and compression or quantization of model updates to reduce communication overhead [18]. Federated versions of adaptive optimizers [19] highlight the interplay between client heterogeneity and communication efficiency. However, these optimizations typically treat the model as a monolithic entity and do not account for the fact that certain parameters may be more critical to preserve across low-quality links or intermittent participation.

In scenarios such as medical diagnosis, intelligent keyboards, or industrial inspection, general models tend to converge slowly and are unable to fully satisfy the personalized requirements of all clients [10]. To address this issue, some studies have focused on enabling personalized modeling for clients, which deviates from the standard FL framework that primarily aims to learn a single global model. PFL encompasses three primary approaches [20]: clustering-based, data interpolation, and model mixture-based. IFCA [21] performs cluster-based federated learning model aggregation within the cluster partition, while it incurs higher communication overhead. FINDING [22] proposes a fine-grained model blending approach with an interpolation strategy to address the cold-user problem in news recommendation tasks. Hahn et al. [23] proposed SuPerFed, a mixture-based personalization method combined with interesting properties in weight space. Although these methods have mitigated the convergence and bias issues to some extent in non-IID environments, they seldom account for knowledge forgetting across communication rounds.

2.2. Continuous Iterative

Continual learning (or incremental learning, IL) [24] aims to empower neural network models to leverage the knowledge accrued from prior tasks, enabling them to adapt to novel tasks with greater efficiency and robustness. A key strength of this approach is its ability to mitigate the issue of catastrophic forgetting, thereby ensuring the enduring preservation of acquired knowledge. Existing studies [25] related to federated continual learning (FCL) predominantly employ the strategy of local continual training, which aims to preserve global knowledge among clients during local model training. Those approaches mitigate model divergence by alleviating catastrophic forgetting in local training. For example, works like FedCurv [26] and FedCL [27] emphasize incorporating continual learning regularization terms into local models to constrain the update magnitude of parameters strongly associated with the global task, thereby preventing excessive updates that could degrade model performance. FedMD [28] accomplishes dynamic learning tasks via knowledge distillation, necessitating the alignment of client models with public datasets. Nevertheless, in practical scenarios, public datasets might either be inaccessible or exhibit substantial discrepancies from the local data distribution, thereby hindering the fulfillment of requirements.

Traditional PFL primarily focuses on managing multiple well-defined task sequences [29], where each round of federated learning corresponds to an independent task in a new domain. Specifically, during continuous iterative updates, the global model may repeatedly fail to retain critical information acquired in earlier rounds. This issue mirrors catastrophic forgetting in CL, a phenomenon that has been extensively studied in incremental task training. F. et al. [30] examine time-evolving scenarios in FL and argue that CL can address common types of data heterogeneity. Recent research mainly contains task-incremental and class-incremental scenarios. For example, FedWeIT [31] enables each client to receive selective knowledge from other clients by taking a weighted combination of their task-specific parameters. FedViT [32] is a lightweight, client-side federated continual learning method that prevents catastrophic forgetting by integrating the knowledge of signature tasks on the client. Nevertheless, most studies aim to minimize interference between tasks of differing natures while maximizing the transfer of consensus knowledge among similar tasks. The emphasis on task stream evolution often overlooks the potential conflicts arising from consecutive local updates.

In contrast to prior methods such as FedCL and FedCurv, which address either task-incremental or forgetting-resistant global modeling, our DPAO-PFL framework is explicitly designed for continual personalization in federated non-IID settings. It avoids the need for task segmentation or rehearsal buffers, and instead relies on a combination of parameter decomposition, online importance tracking, and adaptive regularization to enable stable and adaptive learning across clients under evolving data distributions. Beyond federated optimization, neural-network-based gain scheduling has been widely used in adaptive control and flight systems, where models must operate over large dynamic envelopes with nonlinear behavior. For instance, Zhang et al. [33] proposed a neural-network-aided gain scheduling method for large-envelope flight control law design, enabling smooth interpolation and robust performance under varying flight regimes. While our setting differs (federated learning and control synthesis), both share the idea of adaptive parameter modulation guided by learned sensitivity metrics.

In summary, prior Personalized Federated Learning approaches can be categorized into meta-learning-based methods, regularization-based frameworks, and parameter decoupling techniques. Those clarify that existing works either lack adaptability to dynamic non-IID data, ignore parameter-specific importance, or fail to balance global stability and local personalization. While each of these lines of work improves local model adaptation under non-IID settings, few existing methods simultaneously address the challenges of gradient catastrophic forgetting and dynamic personalization.

3. Preliminaries

To tackle the twin challenges of statistical heterogeneity and inter-round forgetting in personalized FL, DPAO-PFL introduces a CL regularization to constrain the update of globally shared parameters. In this section, we elaborate on the fundamental approach to federal parameter decomposition and regularized constraint, while exploring the generalization problem in PFL. A dedicated notation table summarizing all key variables is shown as Table 1.

3.1. Parameter Decomposition

To address the limitations of a single global model parameter, we leverage the Additive Parameter Decomposition (APD) [14] approach from CL and introduce a federated parameter decomposition method to identify global shared parameters and extract client-specific personalized parameters. For each layer of the neural network model where the superscript

u_{k}

is used to index layer-wise components, and

φ^{k}

captures the parameter group associated with the

k - t h

layer, we assume that on the

k - t h

layer, the trainable network parameters can be decomposed into globally shared parameters

θ_{g}^{k}

and client-specific parameters

θ_{l}^{k}

. By incorporating a transfer vector

φ^{k}

, the globally shared parameters can be adaptively fine-tuned to better emphasize the training task for the client i, defined as

θ^{k} = θ_{g}^{k} ⊙ φ^{k} + θ_{l}^{k},

(1)

where ⊙ denotes the sequential multiplication of corresponding elements in network units,

φ^{k} \in R^{u_{k} \times 1}

is bounded by the sigmoid function within

(0, 1)

, and

u_{k}

indicates the number of units in the k-th layer. Specifically, the globally shared parameters and privately held parameters possess identical dimensionalities. During model training, gradients for all parameters are computed and updated synchronously via backpropagation. Dynamically balancing the global and local contributions by using APD, the method not only accommodates the heterogeneous needs of different clients but also decomposes the initial global model to offer a stable starting point for training. Moreover, it reduces communication overhead for transmitting only the global parameters while minimizing training oscillations.

3.2. Parameter Importance Score

We maintain a diagonal Fisher information estimate [34] for each shared weight by leveraging an exponential moving average of squared gradients. The parameter regularization techniques in continual learning offer a robust solution for mitigating the catastrophic forgetting issue in neural networks. The primary approach, Elastic Weight Consolidation (EWC) [35], is designed to preserve knowledge from previous tasks during training by leveraging a penalty term that restricts weight updates when learning new tasks. In addition to the forgetting problem faced by incremental learning, the RWALK model [36] also takes into account the issue of suffering from intransigence. That is

F_{θ}

, known as the empirical Fisher Information Matrix (FIM) at

θ

, defined as

F_{θ} = E_{(x, y) \sim D} [\frac{\partial log p θ (y ∣ x)}{\partial θ}] {[\frac{\partial log p θ (y ∣ x)}{\partial θ}]}^{⊤}

. Similar to RWalk, the generalization integrates efficient online Elastic Weight Consolidation (EWC++) and path integral (PI) [37] from a theoretically grounded KL-divergence perspective. We automatically adjust the penalty intensity proportionally according to the historical importance of each parameter, using EWC++ and parameter importance estimation based on optimized paths. EWC++ improves on EWC by updating the diagonal Fisher matrix in an online manner with the moving average:

EWC++. The Fisher Information Matrix measures the sensitivity of model output to parameters and is used to identify parameters that are important for old tasks. Specifically, for each parameter

θ_{i}

with the output distribution

p_{θ} (y ∣ x)

, define the diagonal approximation term of FIM as

D_{K L} (p_{θ} ‖ p_{θ + Δ θ}) \approx \frac{1}{2} \sum_{i \in P} F_{θ_{i}} {(Δ θ_{i})}^{2},

(2)

F_{θ_{i}}^{(t)} = α F_{θ_{i}}^{(t - 1)} + (1 - α) {(\frac{\partial {log}_{p_{θ}} (y ∣ x)}{\partial θ_{i}})}^{2},

(3)

where

α \in [0, 1]

and

F_{θ_{i}}^{(t)}

update online using moving average with the current batch. This penalty term is calculated based on the FIM of previous tasks, reflects the retention ratio of historical information, and quantifies the sensitivity of the model to parameter changes.

PI Sensitivity. Given that the Fisher matrix captures the intrinsic properties of the model, which are solely dependent on loss

L (θ_{i})

, it inherently overlooks the influence of parameters on the optimization path of the loss function. Consequently, the FIM can be enhanced by systematically accumulating parameter importance scores throughout the entire training process. Sensitivity of PI measures the efficiency of each update in improving the loss. This segment score is computed as the ratio of the change in the loss to the change in the approximate KL-divergence, which is formally defined as follows:

\begin{matrix} s_{Δ t} (θ_{i}) & = \frac{\sum_{t = t_{1}}^{t_{2}} Δ L_{t} (θ_{i})}{\frac{1}{2} F_{θ_{i}} \cdot {(Δ θ_{i})}^{2} + ϵ}, \end{matrix}

(4)

\begin{matrix} S (θ_{i}) & = \frac{1}{2} (s^{(τ - 2)} (θ_{i}) + s^{(τ - 1)} (θ_{i})), \end{matrix}

(5)

where

Δ θ_{i}

is the change in the parameter during the training of the task

τ

, and

ϵ

is a small constant to prevent division by zero. Ultimately, we integrate Fisher information and path scores into a unified per-weight regularizer as

Ω_{i} = F_{θ_{i}}^{(t - 1)} + S (θ_{i})

, and normalize it to represent the overall parameter importance. If a parameter changes slightly but has a significant impact on the loss, its importance is higher. The normalization of both scores makes the regularization hyperparameter less sensitive and provides a better trade-off between forgetting and intransigence.

3.3. Problem Formulation

The proliferation of edge devices and decentralized data generation has driven FL as a pivotal paradigm for privacy-preserving collaborative intelligence. However, real-world scenarios are inherently characterized by heterogeneous data distributions across clients with specific behaviors, geographical variations, or temporal shifts. Such non-IID data challenges conventional FL frameworks, which aim to train a single global model by aggregating local updates from distributed clients. Existing research on PFL focuses on adapting models to local data but is constrained by the challenges of model interference and insufficient knowledge retention. We formally present the problem formulation of federated learning along with the notations used.

We consider a set of N clients that collaboratively train a model using distributed data. For each client

i \in N

with

m_{i}

available training samples, it maintains a local dataset

D_{i} = {\{(x_{j}^{i}, y_{j}^{i})\}}_{j = 1}^{m_{i}}

, which is derived from its specific environment and thus exhibits non-IID characteristics, i.e.,

D_{i} \neq D_{j}

for

i \neq j

. The local loss function of the prediction

L_{i} (w) = E_{(x, y) \sim D_{i}} [ℓ_{i} ((x, y); w)]

is made with model parameters w on example

(x_{i}, y_{i})

. The objective of FL is to learn a shared global model applicable to all clients by minimizing all local empirical risk as follows:

min_{w \in R^{d}} \{f (w) ≜ \sum_{i = 1}^{N} \frac{1}{N} L_{i} (w)\} .

(6)

For non-IID data distributions, the key to enhancing both the convergence speed and robustness of the FL system lies in bounding the discrepancies between local optimization objectives and the global objective. We aim to develop a PFL framework with parameter decomposition, which enables dual-stream optimization. After t rounds of communication, each selected client i performs E epochs of stochastic gradient descent (SGD) [38] on its private data. To mitigate these discrepancies, we introduce an adaptive penalty term

R_{i}

as

R (w_{s}^{t}) = \frac{1}{2} \sum_{k \in L} (F_{w_{i}}^{(t - 1)} + S (w_{i})) {(w_{g, k}^{t} - w_{s, k}^{t})}^{2},

(7)

where

k \in L

indicates the index of the component w of the global shared parameter. By using adaptive parameter decomposition and parameter regularization methods to construct a new federated optimization objective, an attempt is made to alleviate the problem of difficult convergence of federated learning models caused by data heterogeneity. Consequently, we reformulate the objective of DPAO-PFL in Equation (6) into local objective functions to approximately minimize the following regularized federated risk as follows:

min_{w_{i}} L_{i} (w_{l}^{i}, w_{g}^{i}) + \frac{λ}{2} \sum_{k \in L} (F_{i}^{(t - 1)} + S_{i}^{(t - 1)}) {(w_{g, k}^{t} - w_{s, k}^{t})}^{2},

(8)

where

λ

is the hyperparameter for regularization, determining the intensity of the DPAO penalty applied to the globally shared parameters

w_{g}^{i}

, and

w_{l}^{i}

is free from the constraints of regularization terms, allowing for free personalization.

4. Methods

Our goal is to develop a Personalized Federated Learning model that can effectively adapt to the individual data distributions of each client while preserving the generalization capability of the global model and mitigating catastrophic forgetting. This section introduces the framework of DPAO-PFL and proposes a federated optimal algorithm with dynamic parameter-aware regularization.

4.1. Architecture Overview

The data distribution among different clients varies significantly, leading to inconsistent local update directions. As a result, conflicts occur during aggregation, making it difficult for the global model to converge rapidly. We first proposed a Dynamic Parameter-Aware Optimization for Personalized Federated Learning (DPAO-PFL) framework, which innovatively introduces the online parameter importance perception mechanism into PFL. As shown in Figure 1, the core components include a personalized parameter decomposition strategy and a dynamic regularization module.

We divide the model into global shared parameters

w_{g}^{i}

and local personalized parameters

w_{l}^{i}

, with the former used to retain the global knowledge across rounds and the latter supporting independent adaptation for each client. Furthermore, a weighted quadratic penalty is imposed on each shared parameter to achieve a stability and flexibility balance across rounds.

In a standard federated learning training process for edge devices, clients intermittently interact with the server to acquire the global model. The specific process is demonstrated in Algorithm 1. During each communication round t, the server transforms the current global model to a randomly selected subset

S_{t}

of the active clients. These clients then conduct local optimizations using their respective local data and transmit the resulting local updates to the central server. Subsequently, based on the performance of the updated global model, a decision is made to either terminate the training or initiate a new communication round.

Algorithm 1: PFL with adaptive parameter decomposition (run on the server).

4.2. Adaptive Parameters Descomposition (APD)

To provide a more appropriate starting point in the early stage of training, we perform parameter decomposition on an initialized global model. Typically, assume that each client i employs an

L

-layer deep neural network model for multi-classification tasks. The parameter set of the neural network integrates all network layers

k \in L

to be optimized

w_{i} = {\{θ^{i, k}\}}_{k = 1}^{L}, θ^{i, k} \equiv (θ_{l}^{i, k}, θ_{g}^{i, k})

, where

w^{i, k}

corresponds to the parameters of the k-th layer.

For simplicity, we no longer consider the data decomposition of each network layer, but simply decompose the model parameters maintained by client i. Upon completion of the local update, the global shared parameter

w_{g}^{i}

is transmitted to the server for averaging cross-client knowledge. Simultaneously, the transfer vector

Φ_{i}

and client-specific parameter

w_{l}^{i}

remain stored locally, awaiting updates during subsequent local model training and available for use during the testing phase, as follows:

w_{i} = w_{g}^{i} ⊙ Φ_{i} + w_{l}^{i},

(9)

where

ϕ_{i} = \{φ_{i}^{k} ∣ k \in L\}

. The globally shared parameters for each client i primarily aim to abstract the common knowledge across clients, while the personalized parameters for each client are dedicated to capturing the personalized knowledge. Moreover, parameters of the loss sliding window are configured to specify the timing for activating DPAO regularization and initiating parameter decomposition. Since only global shared parameters need to be exchanged with the server, compared with the standard FL framework that directly exchanges parameters. The proposed method does not increase the communication burden. Meanwhile, it can better protect data privacy.

DPAO-PFL effectively addresses the inherent drawbacks of traditional federated learning methods. Compared with FedAvg, which simply averages all client model updates, in non-IID scenarios, substantial differences in client data distributions can cause the global model to be dominated by local noise, leading to a significant decline in final accuracy. Although FedProx introduces a proximal term to regularize local updates, this constraint applies uniform treatment to all parameters and fails to account for variations in parameter importance, thereby limiting the model’s ability to balance the preservation of global common features with adaptation to local characteristics. In contrast, our proposed DPAO-PFL framework enables global parameters to retain cross-client common knowledge while allowing local parameters to adapt to client-specific data, thereby significantly enhancing model performance in non-IID settings.

4.3. Dynamic Parameter-Aware Regularization (DPAO)

The inconsistency of objectives among clients can cause the aggregated global model to be biased towards clients with larger data volumes or greater gradient magnitudes, leading to performance bias. The local update process is modeled as sequential multi-task learning, where the global model

w_{s}

is aggregated through multiple rounds of training on previous tasks. When adapting to new tasks using local data for updating

w_{i}

, the shared parameters

w_{g}^{i}

may partially or completely overwrite the knowledge acquired from earlier tasks, which is analogous to the catastrophic forgetting issue in neural networks. We propose a DPAO regularization strategy to protect the shared parameter

w_{g}

to prevent forgetting across rounds, while

w_{l}

remains unconstrained for maximum flexibility. The concrete details are shown in Algorithm 2. For the global shared parameter vector

w_{g}^{i}

, we construct a time-series regularization to adapt to evolving client data, maintaining the following two types of importance measures for all its components.

Efficient and Online Importance Estimation. We apply the online and efficient EWC++ method to continually estimate the importance of each shared parameter, eliminating costly recomputation for each task. Specifically, we perform an exponential moving average on the square of the gradients in Equation (3) for client i in the local training process. The fact that

F_{i}

can track the parameter directions with large gradient magnitudes indicates that random modifications to these dimensions may significantly impair the performance of the current task.

Algorithm 2: DPAO: Dynamic Parameter-Aware Optimization regularization.

Sensitivity in the Optimization Path. We also estimate the importance score

S_{i} (θ)

for each parameter along the training trajectory, which accumulates the ratio of the contribution of each update in Formula 5. Within the recent training steps, the dynamic path sensitivity scores

S_{i}

measured over sliding time windows indicate the degree to which the loss function is sensitive to parameter changes. The larger the cumulative score, the more it signifies that this dimension achieves greater loss improvement at a lower cost, thereby indicating its high utility for model personalization. Complementary to

F_{i}

,

S_{i}

emphasizes the protection of parameters with high curvature and highlights the prioritization of updating personal parameters.

Adaptive Regularization Update. In each round of local updates, we construct the dynamic importance weight

Ω_{i}

based on the online FIM with sliding average and path sensitivity accumulation, and then we add it as a quadratic penalty term to the local objective function of the client. The learning process can prioritize the retention of important parameters for the global task while learning the global shared parameters and simultaneously enhance the plasticity of non-important parameters. We introduce “peak-steady” detection on the server side to detect the stable period of loss by using the sliding window.

During the intermediate stages of training, once the model has stabilized, the continual learning method updates its meta-knowledge to avoid forgetting previously learned tasks. The stable period when the global loss first detects the peak and decreases to stabilization is taken as the timing to update the importance weights. Calculate the mean value

μ

and standard deviation

σ

of the current window, and determine whether they are higher than those of the previous window to monitor the stable period. If the current window loss is less than the smooth threshold

μ_{cur} + σ_{cur} < τ

, it is considered that the new task has entered the learning plateau after this peak, triggering a synchronous update of the global importance weights. It significantly reduces the communication and computational pressure caused by updates in each round.

The parameter importance score is dynamic in nature and contingent upon the stability of model parameters throughout the training process. We formalize this rationale using concepts from statistical learning theory: (1) Instability in “Peak Periods”: “Peak periods” are defined as phases where local model loss fluctuates sharply (e.g., >5% over consecutive rounds). During these periods, local data heterogeneity causes large variations in parameter updates, making gradient-based importance scores noisy. Models may overfit to spurious local patterns, leading to inflated importance scores for irrelevant parameters. (2) Reliability in “Stable Periods”: “Stable periods” indicate that parameters have converged to a local optimum for the current local data with low variance, as shown by the convergence of stochastic gradients to zero (e.g., loss fluctuations < 5% over T consecutive rounds). Importance scores better reflect long-term relevance to global performance, as they are no longer dominated by transient noise.

After integrating the importance scores derived from the FIM with the optimization path, we utilize them as a regularization constraint for the globally shared parameters.

L_{i}^{*} (w_{l}^{i}, w_{g}^{i}) = L_{i} (w_{l}^{i}, w_{g}^{i}) + λ \sum_{k \in L} (F_{i}^{(t - 1)} + S_{i}^{(t - 1)}) {(w_{g, k}^{t} - w_{s, k}^{t})}^{2},

(10)

where

λ

is the forgetting coefficient,

F_{i}^{(t - 1)}

is the empirical FIM for the parameter

w_{i}

, and

S_{i}^{(t - 1)}

is the score accumulated from the first training iteration

t_{0}

to the last training iteration. To ensure that evaluation metrics

F_{w_{i}}^{(t - 1)}

and

S (w_{i})

remain in the same dimension, this can be achieved by normalizing each metric to the interval

[0, 1]

. Given that the scores accumulate over time, the regularization constraint becomes progressively stricter. Therefore, averaging the scores after each task can reduce the sensitivity of the regularization hyperparameter to the number of tasks. Task-based sequential learning methods assume that data is organized into tasks with identifiable boundaries, enabling the training process to be divided into consecutive stages. By combining the above two importance indicators, online EWC++ and PI scores, a weighted quadratic penalty is imposed on each shared parameter to achieve a stability–flexibility balance across rounds. We normalize and automatically adjust constraint strength for each parameter, balancing stability and personalization under non-IID distributions.

5. Results

In this section, we focus on evaluating the effectiveness of the DPAO-PFL method across extensive existing federated learning methods, as well as verifying the performance on various benchmark datasets. The detailed experimental setup is described in Section 5.1. Then, we conduct a series of comparison experiments on multiple real-world datasets in Section 5.3, followed by ablation experiments and an analysis of robustness to varying parameter values.

5.1. Experiment Setup

To thoroughly assess the effectiveness and robustness of the proposed DPAO-PFL method, this paper carries out an in-depth and comprehensive evaluation using the latest federated learning benchmarks across four real-world datasets.

5.1.1. Dataset

We employ four distinct datasets that correspond to different application domains: natural image datasets CIFAR-100 [39] and FEMNIST [40], text dataset Shakespeare, and tabular dataset Vehicle. The comprehensive details of the dataset are presented in Table 2.

SensIT Vehicle. The dataset includes signals from acoustic and seismic sensors to classify the different vehicles, which comprises 23 sensor instances from a distributed network. Initially introduced by Duarte et al. [41], it aims to classify vehicles traveling on a road segment based on measurement data extracted from binary contour images of various vehicles [42]. Each sample is characterized by 100-dimensional feature data [43] and associated with a binary label. In our approach, each sensor is modeled as an individual client, and a linear support vector machine (SVM) model is collaboratively trained via federated learning to accomplish the binary classification task.

FEMNIST. The FEMNIST dataset originates from the LEAF benchmark [40] and is dedicated to the image classification task, which includes handwritten alphanumeric characters from a total of 3500 users. Each sample is comprised of a 28 × 28 grayscale image, an author identifier, a handwriting style indicator, and the corresponding character. In addition to the classification task for handwritten digits, it also encompasses the classification of handwritten letters, utilizing the same image structure and parameter configurations as the EMNIST dataset [44]. During the experiment, each client was assigned three types of character samples, with a total of 200 clients participating. Multiple logistic regression models were collaboratively trained using the FL framework.

Shakespeare. One of the benchmark datasets specifically designed for FL is the LEAF dataset [40]. The primary task of this dataset is next-character prediction, which is constructed from the complete works of William Shakespeare. It originally had 1129 users but was reduced to 660 users based on the selected sequence length. In the federated learning context, it can be partitioned into distributed clients according to the official division code. Clients are typically partitioned by script role, which represents a distinct client. For example, the character “Hamlet” is considered a node, and all of his lines are treated as local data. There are significant differences in word usage styles and vocabularies among different roles, making clients drift significantly. Verify how to learn the lines of new characters while retaining the language style of old characters, and assess the issue of catastrophic forgetting.

CIFAR-100. This visual recognition benchmark dataset released by Alex et al. [39] is widely used for training and evaluating image classification models. The dataset comprises 100 classes. Each class includes 600 color images with a size of

32 \times 32

, where 500 images are allocated to the training set and 100 to the test set. Each image is associated with two labels: fine_labels and coarse_labels, which correspond to the fine-grained and coarse-grained categories of the image, respectively. Additionally, there are 20 superclasses, each encompassing 5 fine-grained subclasses. In comparison to using the identical policy reduced CIFAR-10 dataset, CIFAR-100 features a larger number of classes, fewer samples per class, smaller image dimensions, and simpler backgrounds, yet it retains the diversity characteristic of natural images. We distribute to 100 clients, each with samples of 10 classes. The number of samples on each device followed a power-law distribution to simulate an unbalanced distribution. To illustrate the influence of data heterogeneity on the performance of FL algorithms, we additionally constructed IID versions of the dataset.

5.1.2. Baselines

FedAvg. [9] Federated Averaging was introduced by McMahan et al. as the first widely adopted FL algorithm, combining E local SGD epochs per client with occasional global averaging on the server. It has demonstrated empirical success in non-IID data environments and serves as a standard benchmark in contemporary federated learning research. In FedAvg, the local objective function is defined as (6), with the empirical loss derived from the client’s training data. All clients adopt a uniform learning rate

η

and epoch counts E, while the server periodically aggregates the model parameters of each client by averaging. Despite its simplicity, FedAvg often converges well even when theoretical guarantees are lacking, an effect recently attributed to the average drift at the optimum in realistic settings.

FedProx. [45] FedProx extends FedAvg by adding a proximal regularization term

\frac{μ}{2} {∥w_{i} - w_{global}∥}^{2}

to each client’s local objective, directly controlling how far local updates can drift from the global model. This proximal term accommodates both system heterogeneity with variable amounts of clients’ work and statistical heterogeneity data by guaranteeing convergence under broader conditions than FedAvg. FedProx yields more stable and often higher accuracy than FedAvg in highly heterogeneous FL scenarios, improving test accuracy by up to

22 %

on challenging benchmarks.

FedMeta. [46] It reformulates federated learning from the perspective of meta-learning rather than relying on a static global model. This approach introduces a parameterized meta-learner that enables rapid adaptation to each client’s data distribution by learning how to learn effectively. Each client employs model-agnostic meta-learning (MAML) [47] to optimize the initial parameters of the model, ensuring that the pre-trained model can be efficiently fine-tuned with minimal data from the target client. During each communication round, clients receive the current meta-parameters, execute a few inner-loop updates tailored to their local tasks, and transmit adaptation gradients back to the server for meta-updating, thereby significantly reducing communication overhead.

FedBN. [48] The batch normalization (BN) layers are updated locally on the client side without inter-client communication, while their parameters are aggregated on the server. FedBN effectively mitigates the issue of feature distribution shifts caused by non-iid data, such as variations in marginal feature distributions arising from different medical scanners or driving scenarios, by maintaining local batch-norm statistics rather than aggregating them globally. During the training process, each client utilizes its own running mean and variance for the BN layer and transmits only the weights of other layers to the server. Notably, FedBN does not involve any tunable hyperparameters and imposes minimal additional computational overhead, facilitating its seamless integration into any neural network architecture incorporating BN layers within the federated learning framework.

5.1.3. Implementation Details

Experimental environment. We simulate a federated learning architecture on a cloud-based IDE equipped with an RTX 2080Ti GPU, an Intel Core i9 CPU, and 64 GB of RAM. All methods with N clients and a central server are implemented using PyTorch 2.3.0 based on the latest PFL framework. In all experiments, the server uniformly aggregates the update parameters returned by the clients, assigning weights proportional to the number of samples. The number of selected clients in each round is set to 10, and the randomly selected mini-batch orders of devices and data are fixed across all runs. For the FedMeta approach, the training set on each client is further partitioned into two subsets:

80 %

as the support set and

20 %

as the query set. We map classes of images to different device types to simulate sensor type differences and perform non-IID data partitioning for each client using Dirichlet distribution [49] by sampling

p_{k} \sim D i r (δ)

over labels and allocating a

p_{k, i}

proportion of the instances of class k to party i.

Hyperparameter. To ensure a fair comparison, the mini-batch SGD [50] with a fixed learning rate is uniformly used as the local solver, and we set the local epochs as

E = 5

, batch size to

B = 64

as default. The initial learning rate

η

is set to 0.01 across all datasets. Respectively, we run models for 500 rounds on FEMNIST and Vehicle, and 800 rounds on CIFAR-100 and Shakespeare datasets. The regularization parameter

μ

is initialized to 1 by default, with its optimal value identified via grid search. In the DPAO-PFL approach, the additional forgetting factor hyperparameter

λ

is also initialized to 1. Within the importance weight update mechanism, the thresholds

μ_{cur}, σ_{cur}

for the mean and standard deviation of the loss are fine-tuned based on the characteristics of different datasets. The round parameter indicating the initiation of parameter decomposition is set to 2, thereby mitigating potential drastic changes in the model caused by abrupt parameter decomposition. To ensure stability, the importance weights are consistently updated during the two rounds preceding and following the initiation of parameter decomposition.

Evaluation metrics. In accordance with the widely recognized performance evaluation criteria in federated learning, the global objective is utilized to assess the average empirical loss on the training set and the average accuracy on the test set. Since each communication round aligns with a specific aggregation communication round, the results are presented in terms of rounds.

5.2. Performance Comparison

We evaluate the performance of DPAO-PFL against each benchmark method on the four independent non-IID datasets. The changes in average empirical loss and test accuracy across communication rounds are presented in detail in Figure 2. Under all non-IID datasets, DPAO-PFL (orange) attains the highest final accuracy and converges smoothly overall all methods, closely followed by FedBN (blue) and Fedmeta (gold), while FedProx (purple) and FedAvg (green) lag due to slower adaptation. While introducing moderate sample imbalance yields divergence in convergence rates, DPAO-PFL maintains robust performance. FedMeta shows moderate resilience, but DPAO-PFL still outperforms over the next best, whereas FedAvg and FedMeta drop by 2∼5 points, and FedProx/FedAvg degrade further. Under extreme imbalance (CIFAR-100), only DPAO-PFL remains above

53.87 %

test accuracy, all other methods collapse below

50 %

, demonstrating the efficacy of dynamic parameter-aware regularization and continual-learning principles in highly heterogeneous federated environments.

It is evident that the benchmark FedAvg algorithm exhibits a significant drop in accuracy on the FEMNIST and Shakespeare datasets, while the improvements achieved by other comparative algorithms remain relatively modest. In contrast, DPAO-PFL achieves the highest average test accuracy across all experimental datasets, with an improvement of 5.41∼30.42% compared to the baseline FedAvg method. Owing to the inherent data heterogeneity, DPAO-PFL, FedMate, and FedBN consistently outperform centralized FL methods such as FedAvg and FedProx. This demonstrates that the judicious application of personalization strategies in federated learning can yield superior algorithmic performance compared to centralized approaches.

A comprehensive analysis of the experimental results reveals that DPAO-PFL can construct a federated model with higher global accuracy and exhibits versatility across various datasets and training models. Concerning data scales, the method was evaluated on datasets with diverse sample sizes and client distributions, including large-scale datasets, as well as smaller-scale datasets like Vehicle. The results demonstrated that DPAO-PFL consistently outperformed baseline methods by

5.41 %

to

30.42 %

in average accuracy, irrespective of data volume. Across all tested models, DPAO-PFL achieved stable convergence and high accuracy, even when applied to deep networks containing millions of parameters. This performance advantage is attributed to normalized importance scores that reduce sensitivity to hyperparameter settings.

Figure 3 visually presents the changes in the average empirical loss of each comparison method across all experimental datasets over communication rounds. It is evident that FedAvg and FedProx converge relatively slowly, fail to reduce the empirical loss to a sufficiently low value for some clients, and exhibit oscillatory behavior in the global model’s test accuracy. FedBN freezes global parameters during local training, processes the batch normalization layer using its running mean and variance, and updates only the batch normalization layer parameters. Furthermore, only the weights of other layers are transmitted to the server, and the aggregation process solely fuses these shared parameters. This approach focuses exclusively on the independent updates of global and local parameters while neglecting the impact of model drift during continuous updates. FedMeta leverages meta-learning techniques, which are advantageous for the initial training phase of federated learning models. However, its performance becomes unstable during the middle and late stages of training, and the computation of second derivatives introduces significant computational overhead.

We conducted simulations and evaluations of the performance of federated learning methods under varying degrees of heterogeneous data distribution using the CIFAR-100 dataset. We constructed an IID version of the dataset by aggregating all the data initially distributed across the clients and then randomly redistributing it to each client. Under non-IID data distribution, the fixed parameter aggregation method makes the global model prone to being dominated by clients with large data volumes or significant features, leading to overfitting or insufficient generalization ability of the model on other clients. FedDBN lacks personalized client adjustments, as its parameter updates are confined to the global model. FedMeta’s fixed initialization mechanism encounters difficulties in adapting to dynamic data variations, potentially leading to initialization drift. Furthermore, when global models fail to accommodate diverse data characteristics across clients, such as varying handwriting styles, a phenomenon referred to as “style forgetting” may occur.

Additionally, we performed centralized training with mini-batch SGD across different data distributions for comparative analysis. As illustrated in Figure 4, the performance of the centralized method degrades substantially when the level of heterogeneity increases. Our proposed approach demonstrates robust performance not only under IID conditions but also under diverse data heterogeneity scenarios. While FL methods experience varying degrees of performance decline, their overall impact remains relatively limited, effectively mitigating the instability induced by data heterogeneity.

Notably, due to the presence of data heterogeneity, certain federated learning algorithms even surpass centralized learning approaches under IID data conditions. This suggests that incorporating appropriate personalized strategies in federated learning under heterogeneous data can yield superior algorithmic performance compared to centralized learning paradigms. In contrast, other comparative algorithms, which rely solely on a single global model, encounter mutual interference during client-specific model training under data heterogeneity conditions, resulting in diminished overall performance. Furthermore, PFL methods such as FedMate, which prioritize local model adaptation over global generalization, exhibit enhanced performance when data heterogeneity is high but underperform relative to other federated learning techniques when heterogeneity is low.

The comprehensive experimental results indicate that DPAO-PFL can significantly reduce empirical loss across various datasets within a reduced number of training rounds, thereby achieving more stable and faster convergence. By contrast, DPAO-PFL achieves enhanced generalization capability by effectively decomposing global knowledge and client-specific knowledge to extract more universal global features. Additionally, incorporating a continuous learning regularization term enables the global model of DPAO-PFL to achieve more robust convergence in complex scenarios.

5.3. Ablation Studies

To examine the effectiveness of our proposed model, we conduct a series of ablation experiments to compare the results of different components. Moreover, accuracy is selected as a metric for evaluating ablation experiments. We have proposed two improvement measures, such as APD (Section 4.2) and DPAO (Section 4.3), to address the shortcomings of traditional federated learning methods. Consequently, we present the performance of DPAO-PFL and its associated optimization components under FEMNIST and compare the performance of DPAO-PFL with varying local epoch settings.

We conducted several sets of control experiments on the non-IID FEMNIST dataset to compare the comprehensive performance and verify the synergy effect of the components. Among them,

w / o

ADP disabled parameter decomposition, where all clients shared a single global model, only updated global parameters, and had no client-specific parameters or transfer vectors. The comparison method

w / o

DPAO removed the dynamic regularization term

(λ = 0)

in the objective function while retaining parameter decomposition. FedAvg is adopted as the benchmark to simplify comparison.

As shown in Table 3, both DPAO and APD surpass FedAvg in terms of convergence speed and final convergence accuracy. DPAO-PFL reaches

60 %

and

80 %

accuracy in only 26 and 62 rounds, faster than either ablated variant, and it converges at

98.12 %

, which is significantly higher than the respective optimal methods. Further analysis reveals that APD and DPAO regularization is the core for enhancing personalization and anti-forgetting capabilities, while path integral scoring and event-triggered mechanisms further optimize dynamic adaptability and communication efficiency. By integrating these modules, DPAO-PFL delivers superior communication efficiency and higher asymptotic accuracy in challenging non-IID federated scenarios.

Meanwhile, Table 3 demonstrates that the proposed DPAO-PFL, leveraging peak–stable period detection, outperforms fixed-interval update methods (

w /

interval(1),

w /

interval(20)) and classical FL methods FedAvg on non-IID FEMNIST. Specifically, DPAO-PFL achieves 125 rounds to

90 %

accuracy (almost

40 %

fewer than fixed-interval methods), a final accuracy of

98.12 %

(almost

2.5 %

higher than fixed-interval variants), and a

0.58 %

accuracy standard deviation, which indicates stronger cross-client consistency. This performance advantage stems from the peak–stable detection mechanism, which accurately identifies optimal update timings and effectively suppresses gradient noise typically introduced by fixed-interval strategies.

In contrast, DPAO-PFL adopts an event-triggered update strategy augmented with APD and DPAO regularization techniques, which enables the system to capture high-quality model updates while effectively balancing global coordination with local adaptation, thereby significantly enhancing communication efficiency, model accuracy, and system robustness. The event-triggered mechanism dynamically skips updates during peak periods, triggered only when peak–stable detection confirms a stable period. To validate the efficiency and stability of our approach, we report the comparative results on (5, 25) non-IID FEMNIST in terms of communication cost, computation burden, and parameter update concentration.

5.4. Communication and Overhead

To reduce redundant communication, we adopt an event-triggered update strategy that allows clients to push updates only when their parameter changes are deemed significant. Fixed-interval methods exhibit evident limitations: interval(1) results in significant noise accumulation, thereby prolonging convergence and increasing variance, whereas interval(20) performs infrequent updates that often overlook critical stable-phase synchronization opportunities, ultimately compromising convergence speed and cross-client consistency.

As demonstrated in Figure 5, DPAO-PFL reduces the average update size and total FLOPs without compromising the convergence stability. Update size consistently decreases as the trigger interval increases, while drops occur for fewer synchronization events. A very small interval (

w /

interval(1)) introduces substantial communication overhead, whereas a large interval (

w /

interval(20)) leads to model divergence and increased computational demands. By minimizing the synchronization of noisy or less relevant parameters, DPAO-PFL enhances model robustness and adaptability, particularly in heterogeneous environments. Parameter importance standard deviation (SD), which reflects the variance of parameter selection, also decreases with the interval. Through “peak-steady” detection and dynamic regular smoothing of importance updates, DPAO-PFL can significantly reduce the variance of importance scores, thereby enhancing the stability of aggregation and convergence performance.

The decline in parameter importance SD with longer intervals indicates that conservative update strategies mitigate unnecessary parameter noise. Total FLOPs of DPAO-PFL are lower than the best fixed-interval method (

w /

interval(5)). Fewer communication rounds reduce iterative computations, and only

30 %

critical parameters participate in backpropagation, avoiding redundant calculations for low-importance weights. The ablation study on different trigger intervals confirms that moderate update sparsity yields the best trade-off between communication efficiency and learning stability.

In FL methods, multiple rounds of local updates are performed on selected clients, and the local models or updates are aggregated on the server to derive a more effective global model. Local epoch E refers to the process in which all training data is fed into the neural network to complete one forward pass and backpropagation. Consequently, the number of times E that local gradient updates are performed in each communication round influences the local computational load. As shown in Table 4, local epochs affect the final global accuracy, and DPAO-PFL significantly mitigates the negative impact of data heterogeneity.

Smaller E reduces computational costs but may result in underfitting, and larger E improves local fitting capability yet may lead to overfitting or exacerbate client heterogeneity. More local iterations may not only cause the training process to emphasize the optimization of local objectives overly. Therefore, local optimization potentially degrades the accuracy and convergence of the global model, but also reduces the overall convergence speed of the communication-constrained network. DPAO-PFL yields the highest accuracy for every E, which significantly improves both convergence speed and final model quality across varying local-epoch budgets, especially under challenging non-IID federated settings.

5.5. Hyperarameter Study

We achieve a balance between retaining historical knowledge and adapting to new tasks by dynamically tuning the learning rate

η

, regularization strength parameter

λ

and the Fisher information update decay coefficient

α

. We utilize grid search to fine-tune the weight of the control penalty term, thereby achieving a balance between learning new and retaining previously acquired knowledge from old tasks.

λ

explicitly determines the retention strength of historical knowledge and can be interpreted as an implicit forgetting factor. The update decay coefficient

α

, which represents the moving average attenuation rate during online FIM updates, determines the degree of attenuation for historical gradient information, thereby indirectly affecting the forgetting rate.

As shown in Figure 6a, DPAO-PFL achieves the highest convergence accuracy around

η = 0.001

. With

η = 0.01

, update overshoots cause oscillations, a lower final accuracy, and delayed convergence. A too large

η

leads to divergence or oscillation, while a too small

η

yields sluggish adaptation and suboptimal convergence. The learning rate

η

of the optimizer during local training on the client side governs the learning progress of the model and directly influences the convergence speed and stability of the local model. A larger learning rate accelerates parameter updates but also renders the model more vulnerable to the impact of abnormal data, thereby increasing the risk of divergence in the local model.

Meanwhile, we conducted an experiment to investigate the influence of the forgetting factor

α

on the model’s performance. The comparison of average test accuracy results is presented in Figure 6b. A smaller

α

, where the current batch is assigned a higher weight, may introduce noise into the importance estimation values, thereby resulting in a slower convergence rate. Conversely, a larger

α

tends to produce smoother estimation values but may introduce lag when adapting to new tasks. Furthermore, experimental results demonstrate that DPAO-PFL achieves consistently favorable outcomes across a broader range of parameter configurations, highlighting its robustness with respect to the choice of

α

. In the context of heterogeneous data, local models are more prone to diverge from the global model during learning, underscoring the necessity of incorporating cumulative memory of global parameter information.

The regularization term constraint

λ

is introduced to balance the relative contributions of the model parameter updates and the regularization term constraints. The aim is to ensure that the regularization mechanism maintains its guiding effectiveness for personalized feature selection during the parameter aggregation update process and is not overly weakened. As shown in Figure 6c, we set

λ

to be normalized to the interval

[0, 1]

, respectively, in the FEMNIST

(5, 25)

non-IID environment. For each

λ

value, we run the simulation three times using the same random seed for sensitivity scanning and take the average. The Personal Gain (%) metric was defined to quantify the degree of personalization improvement, computed as the difference between the local accuracy and the global accuracy. We evaluated the global model’s average accuracy on the FEMNIST dataset, the communication rounds required for convergence, and its personalized performance.

As illustrated in Figure 6c, when

λ = 0

(no regularization), DPAO-PFL reduces to standard aggregation without a mechanism for filtering out noisy parameters, resulting in certain client updates adversely affecting global convergence. As the value of

λ

varies, the model’s accuracy, level of personalization, and communication efficiency are all influenced accordingly. Empirical results show that when

λ = 0.01

, the model achieves optimal performance in the shortest time. This suggests that moderate regularization not only improves generalization capability but also expedites global convergence. The incorporation of an optimally selected regularization term enables each local update to concentrate more effectively on high-importance parameters, thereby enhancing aggregation quality and accelerating convergence.

6. Conclusions and Future Work

In this work, we proposed DPAO-PFL, a novel Parameter-Aware Optimization framework for Personalized Federated Learning over state-of-the-art methods under dynamic non-IID conditions. Our method decouples global and personalized parameters, applies online importance estimation via Fisher information, and adapts lightweight regularization strength through event-triggered communication. Extensive experiments demonstrate that DPAO-PFL consistently outperforms strong baselines in both personalization accuracy and convergence stability.

In future work, we plan to extend DPAO-PFL to large-scale heterogeneous systems with real-world drift patterns, explore theoretical convergence analysis under dynamic updates, and integrate DPAO-PFL with communication-efficient strategies for resource-constrained edge deployment. To highlight its practical value, DPAO-PFL can be readily applied to real-world scenarios such as healthcare, finance, and IoT, showcasing its effectiveness in addressing data privacy, heterogeneity, and resource constraints in federated learning deployments.

Author Contributions

J.T. and Y.G. provided methodology, experiments, and writing; X.L. and J.J. provided resources, validation, and revision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China under grant No. 2023YFB3107601.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

We use public CIFAR datasets to evaluate the performance of the proposed method. The CIFAR-100 public datasets can be freely downloaded from the web page at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 5 May 2025). The FEMNIST dataset is a dataset for recognizing handwritten digits and letters. Its official source is the LEAF framework. Shakespeare focuses on next-character prediction. Similar to FEMNIST, it is part of the LEAF benchmark dataset, which is specifically designed for federated learning. They can be freely downloaded from the website https://github.com/TalwalkarLab/leaf/tree/master (accessed on 5 May 2025). The version in PyTorch can be freely downloaded from the website https://github.com/SMILELab-FL/FedLab/tree/master/datasets (accessed on 5 May 2025). The Vehicle dataset comprises multi-dimensional vehicle sensor data designed for classification tasks within distributed sensor networks. It can be generated with the official resources link from https://euhubs4data.eu/datasets/know-center-gmbh-sensit-vehicle/ (accessed on 5 May 2025).

Conflicts of Interest

There are no conflicts of interest to declare.

References

Panduman, Y.Y.F.; Funabiki, N.; Fajrianti, E.D.; Fang, S.; Sukaridhoto, S. A survey of AI techniques in IoT applications with use case investigations in the smart environmental monitoring and analytics in real-time IoT platform. Information 2024, 15, 153. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Li, Q.; Diao, Y.; Chen, Q.; He, B. Federated learning on non-iid data silos: An experimental study. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 965–978. [Google Scholar]
Javeed, D.; Saeed, M.S.; Kumar, P.; Jolfaei, A.; Islam, S.; Islam, A.N. Federated learning-based personalized recommendation systems: An overview on security and privacy challenges. IEEE Trans. Consum. Electron. 2023, 70, 2618–2627. [Google Scholar] [CrossRef]
Li, Y.; Wen, G. Research and Practice of Financial Credit Risk Management Based on Federated Learning. Eng. Lett. 2023, 31, 271. [Google Scholar]
Liu, X.; Zhao, J.; Li, J.; Cao, B.; Lv, Z. Federated neural architecture search for medical data security. IEEE Trans. Ind. Inform. 2022, 18, 5628–5636. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated learning with non-iid data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the convergence of fedavg on non-iid data. arXiv 2019, arXiv:1907.02189. [Google Scholar]
Tan, A.Z.; Yu, H.; Cui, L.; Yang, Q. Towards personalized federated learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9587–9603. [Google Scholar] [CrossRef] [PubMed]
Fallah, A.; Mokhtari, A.; Ozdaglar, A. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. Adv. Neural Inf. Process. Syst. 2020, 33, 3557–3568. [Google Scholar]
Zhang, J.; Guo, S.; Ma, X.; Wang, H.; Xu, W.; Wu, F. Parameterized knowledge transfer for personalized federated learning. Adv. Neural Inf. Process. Syst. 2021, 34, 10092–10104. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. Scaffold: Stochastic controlled averaging for federated learning. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 5132–5143. [Google Scholar]
Yoon, J.; Kim, S.; Yang, E.; Hwang, S.J. Scalable and Order-robust Continual Learning with Additive Parameter Decomposition. arXiv 2019, arXiv:1902.09432. [Google Scholar]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–19. [Google Scholar] [CrossRef]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends® Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Wu, W.; He, L.; Lin, W.; Mao, R.; Maple, C.; Jarvis, S. SAFA: A semi-asynchronous protocol for fast federated learning with low overhead. IEEE Trans. Comput. 2020, 70, 655–668. [Google Scholar] [CrossRef]
Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv 2017, arXiv:1712.01887. [Google Scholar]
Reddi, S.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečnỳ, J.; Kumar, S.; McMahan, H.B. Adaptive federated optimization. arXiv 2020, arXiv:2003.00295. [Google Scholar]
Mansour, Y.; Mohri, M.; Ro, J.; Suresh, A.T. Three approaches for personalization with applications to federated learning. arXiv 2020, arXiv:2002.10619. [Google Scholar]
Ghosh, A.; Chung, J.; Yin, D.; Ramchandran, K. An efficient framework for clustered federated learning. Adv. Neural Inf. Process. Syst. 2020, 33, 19586–19597. [Google Scholar] [CrossRef]
Yu, S.L.; Liu, Q.; Wang, F.; Yu, Y.; Chen, E. Federated News Recommendation with Fine-grained Interpolation and Dynamic Clustering. In Proceedings of the CIKM’23: 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 3073–3082. [Google Scholar] [CrossRef]
Hahn, S.J.; Jeong, M.; Lee, J. Connecting Low-Loss Subspace for Personalized Federated Learning. In Proceedings of the KDD’22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 505–515. [Google Scholar] [CrossRef]
Wang, L.; Zhang, X.; Su, H.; Zhu, J. A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5362–5383. [Google Scholar] [CrossRef] [PubMed]
Parisi, G.I.; Kemker, R.; Part, J.L.; Kanan, C.; Wermter, S. Continual lifelong learning with neural networks: A review. Neural Netw. 2019, 113, 54–71. [Google Scholar] [CrossRef] [PubMed]
Shoham, N.; Avidor, T.; Keren, A.; Israel, N.; Benditkis, D.; Mor-Yosef, L.; Zeitak, I. Overcoming Forgetting in Federated Learning on Non-IID Data. arXiv 2019, arXiv:1910.07796. [Google Scholar]
Yao, X.; Sun, L. Continual Local Training For Better Initialization Of Federated Models. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 1736–1740. [Google Scholar] [CrossRef]
Li, D.; Wang, J. FedMD: Heterogenous Federated Learning via Model Distillation. arXiv 2019, arXiv:1910.03581. [Google Scholar]
Yang, X.; Yu, H.; Gao, X.; Wang, H.; Zhang, J.; Li, T. Federated Continual Learning via Knowledge Fusion: A Survey. IEEE Trans. Knowl. Data Eng. 2024, 36, 3832–3850. [Google Scholar] [CrossRef]
Criado, M.F.; Casado, F.E.; Iglesias, R.; Regueiro, C.V.; Barro, S. Non-IID data and Continual Learning processes in Federated Learning: A long road ahead. Inf. Fusion 2022, 88, 263–280. [Google Scholar] [CrossRef]
Yoon, J.; Jeong, W.; Lee, G.; Yang, E.; Hwang, S.J. Federated Continual Learning with Weighted Inter-client Transfer. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; Meila, M., Zhang, T., Eds.; Proceedings of Machine Learning Research. PMLR: New York, NY, USA, 2021; Volume 139, pp. 12073–12086. [Google Scholar]
Zuo, X.; Luopan, Y.; Han, R.; Zhang, Q.; Liu, C.H.; Wang, G.; Chen, L.Y. FedViT: Federated continual learning of vision transformer at edge. Future Gener. Comput. Syst. 2024, 154, 1–15. [Google Scholar] [CrossRef]
Zhang, P.; Yang, X.; Chen, Z. Neural network gain scheduling design for large envelope curve flight control law. J. Beijing Univ. Aeronaut. Astronaut. 2005, 31, 604–608. [Google Scholar]
Pascanu, R.; Bengio, Y. Revisiting natural gradient for deep networks. arXiv 2013, arXiv:1301.3584. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Chaudhry, A.; Dokania, P.K.; Ajanthan, T.; Torr, P.H. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 532–547. [Google Scholar]
Zenke, F.; Poole, B.; Ganguli, S. Continual learning through synaptic intelligence. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 3987–3995. [Google Scholar]
Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010; Keynote, Invited and Contributed Papers. Springer: Berlin/Heidelberg, Germany, 2010; pp. 177–186. [Google Scholar]
Alex, K. Learning Multiple Layers of Features from Tiny Images. Toronto, ON, Canada, 8 April 2009; Volume 2. Available online: https://api.semanticscholar.org/CorpusID:18268744 (accessed on 20 July 2025).
Caldas, S.; Duddu, S.M.K.; Wu, P.; Li, T.; Konečnỳ, J.; McMahan, H.B.; Smith, V.; Talwalkar, A. Leaf: A benchmark for federated settings. arXiv 2018, arXiv:1812.01097. [Google Scholar]
Smith, V.; Chiang, C.K.; Sanjabi, M.; Talwalkar, A.S. Federated multi-task learning. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Sivaraman, S.; Trivedi, M.M. Looking at vehicles on the road: A survey of vision-based vehicle detection, tracking, and behavior analysis. IEEE Trans. Intell. Transp. Syst. 2013, 14, 1773–1795. [Google Scholar] [CrossRef]
Aslan, Ö.; Zhang, X.; Schuurmans, D. Convex deep learning via normalized kernels. In Proceedings of the Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Cohen, G.; Afshar, S.; Tapson, J.; Van Schaik, A. EMNIST: Extending MNIST to handwritten letters. In Proceedings of the 2017 international joint conference on neural networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2921–2926. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Chen, F.; Luo, M.; Dong, Z.; Li, Z.; He, X. Federated meta-learning with fast convergence and efficient communication. arXiv 2018, arXiv:1802.07876. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Li, X.; Jiang, M.; Zhang, X.; Kamp, M.; Dou, Q. Fedbn: Federated learning on non-iid features via local batch normalization. arXiv 2021, arXiv:2102.07623. [Google Scholar]
Hsu, T.M.H.; Qi, H.; Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv 2019, arXiv:1909.06335. [Google Scholar]
Li, M.; Zhang, T.; Chen, Y.; Smola, A.J. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 661–670. [Google Scholar]

Figure 1. Overall structure of DPAO-PFL. The end-to-end DPAO-PFL process in Personalized Federated Learning includes the following processes: initialization and transform, parameter decomposition, online importance score, local training with DPAO regularization, weighted aggregation, and synchronous updates with peak–stable triggering. Together, it illustrates how DPAO-PFL continually balances protecting highly important global directions and personality.

Figure 2. Global test accuracy over sufficient communication rounds on FEMNIST, Shakespeare, CIFAR-100, and Vehicle datasets with non-IID settings

(p_{k} \sim D i r (0.3), D i r (0.6))

for baselines FedAvg, FedProx, FedMeta, FedBN, and DPAO-PFL. DPAO-PFL consistently outperforms all baselines and degrades most gracefully as data heterogeneity grows.

Figure 2. Global test accuracy over sufficient communication rounds on FEMNIST, Shakespeare, CIFAR-100, and Vehicle datasets with non-IID settings

(p_{k} \sim D i r (0.3), D i r (0.6))

for baselines FedAvg, FedProx, FedMeta, FedBN, and DPAO-PFL. DPAO-PFL consistently outperforms all baselines and degrades most gracefully as data heterogeneity grows.

Figure 3. Convergence curves of training loss over communication rounds for DPAO-PFL compared with FedAvg, FedProx, FedMeta, and FedBN under non-IID datasets including FEMNIST, Shakespeare, CIFAR-100, and Vehicle datasets. Comparing the overall performance, DPAO-PFL exhibits faster convergence and robustness.

Figure 4. Comparison result of the average and the highest testing accuracy for DPAO-PFL and five baselines on CIFAR-100 with different distributions.

(s, k)

represents the degree of heterogeneity, which is adjusted by allocating different numbers of classes to each client. Specifically, s denotes the number of classes assigned to each client, while k represents the sampling size for the clients. As the degree of heterogeneity decreases, the performance of PFL methods is better. DPAO-PFL outperforms all baselines in terms of accuracy and robustness across various distributions.

Figure 4. Comparison result of the average and the highest testing accuracy for DPAO-PFL and five baselines on CIFAR-100 with different distributions.

(s, k)

represents the degree of heterogeneity, which is adjusted by allocating different numbers of classes to each client. Specifically, s denotes the number of classes assigned to each client, while k represents the sampling size for the clients. As the degree of heterogeneity decreases, the performance of PFL methods is better. DPAO-PFL outperforms all baselines in terms of accuracy and robustness across various distributions.

Figure 5. In terms of communication cost, computational overhead, and parameter stability, the performance of different methods shows that compared with the fixed-interval update method, DPAO-PFL significantly reduces the update size and the total number of floating-point operations.

Figure 6. Comparison of the impacts of learning rate

η

, decay coefficient

α

, and regularization term constraint

λ

on the average testing accuracy of DPAO-PFL on the FEMNIST dataset with non-IID (5, 25) setting and

E = 5

. When

λ = 0.01

,

η = 0.001

, and

α = 0.85

, it achieves the best overall performance in terms of global accuracy, convergence rounds, and personalized improvement. (a) Learning rate

η

. (b) Decay coefficient

α

. (c) Regularization term constraint

λ

.

Figure 6. Comparison of the impacts of learning rate

η

, decay coefficient

α

, and regularization term constraint

λ

on the average testing accuracy of DPAO-PFL on the FEMNIST dataset with non-IID (5, 25) setting and

E = 5

. When

λ = 0.01

,

η = 0.001

, and

α = 0.85

, it achieves the best overall performance in terms of global accuracy, convergence rounds, and personalized improvement. (a) Learning rate

η

. (b) Decay coefficient

α

. (c) Regularization term constraint

λ

.

Table 1. Notation summary.

Symbol	Meaning
$i, N$	Index and total number of clients.
t	Communication round index.
$E, B$	Number of local training epochs and local batch size per client.
w	Full model parameter vector.
$w_{g}$	Global shared parameter subvector (communicated).
$w_{l}$	Local personalized parameter subvector (retained on client).
$θ^{k}$	Parameters of the $k - t h$ network layer
$φ_{k}$	Multiplicative transfer vector for layer k.
$Φ$	Collection of all layer transfer vectors ${ϕ_{k}}$ .
$L_{i} (w)$	Expected local loss of client i under parameters w.
$F_{i}^{(t)}$	Fisher information estimate for client i at round t.
$S_{i}^{(t)}$	Path-integral sensitivity score for client i at round t.
$λ_{t}$	Regularization strength (forgetting coefficient) at round t.
$p, q$	Indicator for “stable” periods, and “peak” periods in loss fluctuation.
$Δ w_{s}$	Update increment of parameter $w_{s}$ between aggregations.

Table 2. Statistics of datasets.

Dataset	Model	Partition	Devices	Samples	Samples/Device
Dataset	Model	Partition	Devices	Samples	Mean	Standard
FEMNIST [40]	CNN	Writters	1100	245,337	223	83
Shakespeare [40]	LSTM	Roles	523	1,378,095	2635	2800
CIFAR-100 [39]	ResNet18	Labels	100	60,000	600	300
Vehicle [41]	SVM	Sensors	23	43,698	1899	349

Table 3. Convergence accuracy and number of communication rounds to reach accuracy milestones. Considering APD and DPAO separately on FEMNIST

(5, 25)

, they all have some improvements over FedAvg, and DPAO-PFL outperforms overall.

Table 3. Convergence accuracy and number of communication rounds to reach accuracy milestones. Considering APD and DPAO separately on FEMNIST

(5, 25)

, they all have some improvements over FedAvg, and DPAO-PFL outperforms overall.

Methods	Communication Rounds			Convergence
Methods	60%	80%	90%	Accuracy
FedAvg	51	139	−	88.73 ± 3.24%
$w / o$ APD	38	81	165	93.37 ± 2.32%
$w / o$ DPAO	35	76	148	95.16 ± 1.39%
$w /$ interval(1)	45	95	173	96.50 ± 1.71%
$w /$ interval(5)	53	105	161	95.82 ± 1.89%
$w /$ interval(20)	62	127	182	95.33 ± 2.16%
DPAO-PFL	26	62	125	98.12 ± 0.58%

Bold fonts indicate better performances.

Table 4. Performance comparison under non-IID CIFAR-100 (100-way, 5 classes/client, 25 samples) and FEMNIST (10-way, 5 classes/client, 25 samples). Increasing E accelerates convergence and improves final accuracy across all methods, but excessive local optimization can reduce incremental benefits and even lead to performance declines.

Methods	$E = 1$	$E = 5$	$E = 10$	$E = 20$
Methods	CIFAR-100
FedAvg	24.37%	39.83%	32.51%	27.69%
FedProx	25.19%	41.75%	35.34%	28.13%
FedMeta	33.82%	48.63%	43.12%	36.27%
DPAO-PFL	37.25%	53.24%	47.19%	42.61%
	FEMNIST
FedAvg	53.28%	78.73%	83.12%	89.36%
FedProx	54.63%	79.45%	84.05%	90.27%
FedMeta	68.10%	94.39%	97.53%	97.61%
DPAO-PFL	72.50%	98.12%	98.62%	98.59%

Bold fonts indicate better performances.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, J.; Gao, Y.; Li, X.; Jia, J. DPAO-PFL: Dynamic Parameter-Aware Optimization via Continual Learning for Personalized Federated Learning. Electronics 2025, 14, 2945. https://doi.org/10.3390/electronics14152945

AMA Style

Tang J, Gao Y, Li X, Jia J. DPAO-PFL: Dynamic Parameter-Aware Optimization via Continual Learning for Personalized Federated Learning. Electronics. 2025; 14(15):2945. https://doi.org/10.3390/electronics14152945

Chicago/Turabian Style

Tang, Jialu, Yali Gao, Xiaoyong Li, and Jia Jia. 2025. "DPAO-PFL: Dynamic Parameter-Aware Optimization via Continual Learning for Personalized Federated Learning" Electronics 14, no. 15: 2945. https://doi.org/10.3390/electronics14152945

APA Style

Tang, J., Gao, Y., Li, X., & Jia, J. (2025). DPAO-PFL: Dynamic Parameter-Aware Optimization via Continual Learning for Personalized Federated Learning. Electronics, 14(15), 2945. https://doi.org/10.3390/electronics14152945

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DPAO-PFL: Dynamic Parameter-Aware Optimization via Continual Learning for Personalized Federated Learning

Abstract

1. Introduction

1.1. Motivation

1.2. Designs and Contributions

2. Related Work

2.1. Federated Learning

2.2. Continuous Iterative

3. Preliminaries

3.1. Parameter Decomposition

3.2. Parameter Importance Score

3.3. Problem Formulation

4. Methods

4.1. Architecture Overview

4.2. Adaptive Parameters Descomposition (APD)

4.3. Dynamic Parameter-Aware Regularization (DPAO)

5. Results

5.1. Experiment Setup

5.1.1. Dataset

5.1.2. Baselines

5.1.3. Implementation Details

5.2. Performance Comparison

5.3. Ablation Studies

5.4. Communication and Overhead

5.5. Hyperarameter Study

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI