Communication-Efficient Federated Optimization with Gradient Clipping and Attention Aggregation for Data Analytics and Prediction

Tang, Shengyuan; Zhang, Linwan; Xu, Shengzhe; Zeng, Xinyue; Hu, Peng; Gong, Xinyi; Li, Manzhou

doi:10.3390/electronics14234778

Open AccessEditor’s ChoiceArticle

Communication-Efficient Federated Optimization with Gradient Clipping and Attention Aggregation for Data Analytics and Prediction

by

Shengyuan Tang

¹,

Linwan Zhang

¹,

Shengzhe Xu

^1,2,

Xinyue Zeng

¹,

Peng Hu

¹,

Xinyi Gong

¹ and

Manzhou Li

^1,*

¹

National School of Development, Peking University, Beijing 100871, China

²

China Agricultural University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4778; https://doi.org/10.3390/electronics14234778

Submission received: 3 November 2025 / Revised: 22 November 2025 / Accepted: 3 December 2025 / Published: 4 December 2025

(This article belongs to the Special Issue Machine Learning in Data Analytics and Prediction)

Download

Browse Figures

Versions Notes

Abstract

To address the challenge of collaborative strategy optimization caused by non-independent and identically distributed data in cross-institutional scenarios, a federated quantitative learning framework integrating Path Attention Aggregation Module (PAAM), Gradient Clipping and Compression (GCC), and a Heterogeneity-Aware Adaptive Optimizer (HAAO) is proposed to achieve efficient return optimization and robust risk control. The framework is validated across multi-market and multi-institutional environments, with experiments covering three key dimensions: return performance, risk management, and communication efficiency. The results demonstrate that the proposed model achieves an annualized return (AR) of 16.57%, representing an approximate 19.7% improvement over the traditional FedAvg model; the Sharpe ratio (SR) increases to 1.25, while the maximum drawdown (MDD) decreases to 15.92% and volatility remains controlled at 8.83%, indicating superior balance between return and risk. In the communication efficiency evaluation, when the number of communication rounds is reduced to 50 and 25, the model maintains accuracy at 94.2% and 92.8%, recall at 93.5% and 91.7%, and precision at 94.8% and 92.3%, respectively. Overall, the proposed framework achieves a dynamic balance between global convergence and risk constraints through path weighting, gradient sparsification, and frequency-domain learning rate adjustment. This research provides a novel and scalable paradigm for distributed financial prediction that ensures both privacy preservation and communication efficiency, demonstrating substantial engineering feasibility and practical applicability in intelligent financial modeling.

Keywords:

cross-institutional data analytics; non-IID prediction; federated learning; communication-efficient deep learning; quantitative strategy optimization

1. Introduction

With the rapid advancement of artificial intelligence technologies, deep learning has demonstrated unprecedented potential in quantitative investment modeling within the financial domain [1]. Owing to its powerful nonlinear modeling and feature extraction capabilities, deep neural networks are capable of autonomously learning latent trading signals and price dynamics from complex time-series data, market sentiment, and macroeconomic variables, thereby promoting the intelligent and systematic evolution of quantitative strategies [2]. However, the performance of deep learning models heavily relies on large-scale and diverse training datasets, a dependency that is particularly pronounced in financial applications [3]. Significant heterogeneity exists among financial institutions due to differences in market environments, traded assets, risk-control mechanisms, and regulatory frameworks, leading to disparities in data type, distribution characteristics, and granularity [4]. Furthermore, the high sensitivity of financial data and the increasing emphasis on privacy compliance—such as general data protection regulation (GDPR), california consumer privacy act (CCPA), and various national financial privacy regulations—prevent direct inter-institutional data sharing [5], resulting in the so-called “data island” phenomenon [6,7]. Such isolation not only limits the generalization capacity of models but also poses major obstacles to cross-market strategy optimization and global asset allocation [8].

In conventional quantitative modeling paradigms, researchers and institutions commonly rely on centralized modeling approaches, in which multi-source data are aggregated to a unified platform for feature engineering and model training [9]. Although this centralized paradigm theoretically enables comprehensive utilization of multi-market and multi-asset data, it faces three critical limitations in practice. First, compliance and privacy risks [10]: cross-border financial data transmission may violate legal boundaries, particularly when transaction flows or account-level information are involved, potentially leading to privacy breaches and compliance violations [11]. Second, architectural and standard inconsistencies: variations in database structures, data-cleaning protocols, and temporal synchronization mechanisms across institutions result in costly integration and standardization processes [12]. Third, overfitting and bias propagation: during centralized training, the dominance of data from a single market may bias model learning toward that market’s characteristics, thereby degrading predictive accuracy and strategy robustness in other markets [13]. These issues collectively hinder the large-scale international deployment of centralized quantitative strategies.

To address the data silo problem, federated learning (FL) has emerged as a promising paradigm for collaborative modeling across institutions. Its core principle involves securely aggregating model parameters or gradients without sharing raw data, effectively balancing privacy protection, model generalization, and data utilization efficiency [14,15,16]. FL has shown preliminary success in financial risk control, credit assessment, and fraud detection—for instance, enabling multiple banks to jointly train fraud detection models without exposing sensitive user information [17]. However, most current studies focus on static supervised tasks with fixed labels, which differ fundamentally from high-frequency quantitative trading strategies characterized by dynamic feedback and continuous optimization requirements [18,19]. In trading scenarios, models must balance return-risk trade-offs, account for transaction costs and market impact, and make real-time predictions, posing unresolved challenges for collaborative strategy optimization within an FL framework.

Meanwhile, deep learning models such as recurrent neural networks (RNNs), long short-term memory (LSTMs), gate recurrent units (GRUs), and Transformers have demonstrated strong temporal modeling capabilities in quantitative strategy development, capturing long-term dependencies and global temporal patterns [20,21,22]. However, these models typically assume centralized and identically distributed data, leading to performance degradation in heterogeneous, cross-institutional settings [23]. Variations in market volatility, trading schedules, liquidity, and policy risks across institutions introduce divergent gradient directions and convergence paths, reducing the stability of federated aggregation and overall profitability [24]. Additionally, large model sizes result in excessive communication overhead under limited cross-border bandwidth, creating practical bottlenecks for FL-based quantitative strategies [25]. To address these issues, frameworks such as DPFedBank [26], 2P3FL [27], and privacy-preserving or blockchain-enabled FL architectures [28,29,30] have been proposed, facilitating secure and high-performance collaborative learning in financial and IoT domains.

In summary, traditional centralized quantitative approaches fail to meet the dual requirements of privacy protection and data heterogeneity, while existing FL studies are not directly applicable to dynamic strategy modeling. To overcome these limitations, a privacy-preserving and communication-efficient federated quantitative framework—federated quantitative learning (FQL)—is proposed. The framework enables collaborative strategy modeling across international financial institutions, where encrypted gradient transmission and heterogeneity-aware optimization allow efficient and robust joint training without sharing raw data. Specifically, the main contributions of this study can be summarized as follows:

1.: Path attention aggregation module: a path attention weighting mechanism is designed to address gradient stability and profit contribution discrepancies among clients. By analyzing historical convergence trajectories and corresponding strategy performance, aggregation weights are dynamically adjusted to reduce the influence of “noisy clients,” thereby accelerating convergence and stabilizing global profitability.
2.: Gradient clipping and compression: a hybrid approach combining top-K sparsification and $L_{2}$ clipping is adopted to constrain and compress gradient uploads, effectively reducing communication overhead under heterogeneous network bandwidth while enhancing stability and robustness during federated optimization.
3.: Heterogeneity-aware adaptive optimizer: considering market-specific characteristics such as volatility, trading regulations, and liquidity structures, an adaptive regularization and dynamic learning-rate adjustment mechanism is introduced to enable local market-aware optimization, preventing global models from converging to suboptimal local minima and improving cross-market generalization.

2. Related Work

2.1. Applications of Deep Learning in Quantitative Strategy Modeling

In recent years, deep learning has made significant strides in quantitative finance [31], offering advantages over traditional time-series models such as autoregressive integrated moving average (ARIMA) and generalized auto-regressive conditional heteroscedasticity (GARCH), which, despite effectively capturing linear dependencies and volatility clustering, struggle with the nonlinear and hierarchical dynamics of financial markets [32]. Deep neural networks, including long short-term memory (LSTM) models, have been widely adopted for price trend forecasting and trading signal generation due to their ability to extract complex temporal patterns from price sequences, technical indicators, and macroeconomic variables [33,34]. However, their effectiveness remains limited when trained on single-market data with restricted feature dimensionality [35]. To address this, Transformer-based architectures leveraging self-attention mechanisms have been introduced to model multi-period market dynamics [36,37]. Yet, in global multi-market environments, these models often face challenges related to non-identically distributed (Non-IID) data, feature-space heterogeneity, and overfitting to specific market regimes, limiting their generalization across markets with divergent volatility patterns [38,39]. Moreover, the high parameter count of deep models necessitates large volumes of high-quality data [40], further complicating deployment in data-constrained financial institutions.

2.2. Federated Learning and Financial Data Privacy Protection

To address the inability to centralize cross-institutional data, federated learning (FL) offers a collaborative modeling solution under strict privacy-preserving constraints [41]. Its core principle is that each participating institution (client) independently trains a local model and transmits model parameters or gradients to a central server for secure aggregation, thus updating the global parameters without sharing raw data [42]. In the most common federated averaging (FedAvg) algorithm, the global parameter update rule is defined as:

θ_{t + 1} = \sum_{k = 1}^{K} \frac{n_{k}}{n} θ_{t}^{(k)},

(1)

where

θ_{t}^{(k)}

denotes the model parameters of the k-th client after the t-th round of training,

n_{k}

is the local sample size, and

n = \sum_{k} n_{k}

represents the total global sample count. Although this weighted averaging enables efficient aggregation, several challenges remain in applying FL to financial domains.

Financial data are highly sensitive and regulated, with laws such as the EU’s GDPR and the U.S. GLBA prohibiting cross-border transmission of personal or transactional data [43]. While federated learning (FL) helps prevent direct data exposure, gradients may still reveal local data characteristics through reverse inference, posing privacy risks [44]; thus, secure aggregation and differential privacy are often employed. Moreover, most FL research in finance focuses on static prediction tasks such as credit risk evaluation, fraud detection, or credit scoring [45], which involve fixed targets and stable labels. In contrast, quantitative trading requires dynamic optimization incorporating multi-period returns and risk constraints [46], involving frequent feedback and non-stationary labels, necessitating task-aware parameter updates and attention-based aggregation mechanisms to better adapt to dynamic financial environments.

2.3. Heterogeneous Distribution Modeling and Communication Optimization Mechanisms

In multi-institutional collaborative settings, data distributions often exhibit pronounced heterogeneity [47]. This heterogeneity manifests in feature dimensionality, return structures, and market volatility patterns, resulting in divergent local optima across clients and thus hampering global convergence [48]. To address these challenges, various enhancements have been proposed, including path attention aggregation, local regularization, and gradient compression techniques [49].

First, the path attention mechanism assigns dynamic weights to client gradients based on their contribution to the global optimization objective [50]. If the gradient of client k at round t is represented as

g_{t}^{(k)}

, the global aggregation can be reformulated as:

θ_{t + 1} = \sum_{k = 1}^{K} α_{k}^{(t)} g_{t}^{(k)},

(2)

where

α_{k}^{(t)}

denotes the attention weight satisfying

\sum_{k} α_{k}^{(t)} = 1

. The weight

α_{k}^{(t)}

can be dynamically computed based on gradient stability, return correlation, or historical convergence trajectory, often implemented through a soft attention distribution. This mechanism effectively mitigates the influence of low-quality gradients and enhances aggregation efficiency. Second, to reduce cross-border communication overhead, gradient compression and clipping have been widely adopted [51]. Gradient clipping constrains extreme updates that may induce instability, formulated as:

g_{t}^{(k)} = \frac{g_{t}^{(k)}}{max (1, | g_{t}^{(k)} |_{2} / C)},

(3)

where C denotes the clipping threshold. Subsequently, top-K sparsification is employed to transmit only the K largest gradient components, significantly reducing communication cost without compromising model fidelity. Finally, heterogeneity-aware optimization focuses on dynamically adjusting local learning rates and regularization terms across markets with distinct characteristics [52]. If the local objective function for client k is defined as

f_{k} (θ)

, the optimization process can be formalized as:

min_{θ} \sum_{k = 1}^{K} p_{k} [f_{k} (θ) + μ {| θ - θ_{t}^{(k)} |}^{2}],

(4)

where

p_{k}

represents the client weight, and

μ

is a heterogeneity regularization coefficient balancing local and global updates. This mechanism alleviates model drift caused by distributional discrepancies and enhances stability and convergence speed in high-frequency trading and cross-market settings.

3. Materials and Method

3.1. Data Collection

The dataset was constructed through multi-regional, multi-market, and multi-dimensional financial data collection, encompassing the major capital markets of the United States, Europe, and the Asia-Pacific region, as shown in Table 1. The objective was to establish a heterogeneous financial data system capable of supporting cross-institutional federated quantitative strategy training. Specifically, the U.S. market dataset was derived from the Nasdaq-100 index constituents, covering key technology sector stocks such as AAPL, MSFT, NVDA, and AMZN. The data were obtained from the official Nasdaq database and the Yahoo Finance API, spanning from January 2014 to December 2023 with a daily frequency. The collected attributes included five fundamental indicators—opening price, closing price, highest price, lowest price, and trading volume—along with macroeconomic variables such as the VIX volatility index, the U.S. 10-year Treasury yield, the Consumer Price Index (CPI), and the U.S. Dollar Index (DXY), reflecting overall market risk preferences and macroeconomic cycles.

For the European market, data were extracted from the financial and manufacturing sectors of the Stoxx 600 index, representing major economies including Germany, France, the United Kingdom, and the Netherlands. The data were sourced from Refinitiv and the Euronext exchange database, covering the same temporal span from 2014 to 2023 to ensure synchronization with the U.S. market timeline. In addition to price series, this dataset incorporated the European Central Bank (ECB) policy rate, the Eurozone Consumer Price Index (EU-CPI), and the European sovereign yield curve, providing a detailed depiction of the macro-financial dynamics within the Eurozone. For the Asia-Pacific region, data were collected from major market indices included in the MSCI Asia ETF, such as the Hong Kong Hang Seng Index (HSI), the Taiwan Weighted Index (TWII), and the Singapore Straits Times Index (STI). These data were obtained from the Sina Finance and Yahoo Finance international interfaces, maintaining consistency in sampling frequency and data attributes with the aforementioned markets. To enhance inter-regional comparability, all datasets were aligned using a unified timestamp standard (UTC+0). Missing data were imputed via a time-weighted multiple interpolation method to preserve temporal consistency.

Additionally, macroeconomic factors were supplemented with monthly economic reports from the International Monetary Fund (IMF) and cross-country indicators from the World Bank, including the International Trade Index and Capital Flow metrics, enhancing the model’s sensitivity to macroeconomic shocks. All collected data underwent noise detection and outlier correction to ensure reliability and robustness. The final dataset thus integrated price features, technical indicators, and macro-level variables, forming a comprehensive multi-source financial dataset. It not only exhibits representative coverage in terms of temporal span and regional diversity but also captures the heterogeneity of market structures, trading regimes, and economic cycles, providing a solid empirical foundation for federated modeling and heterogeneity-aware optimization.

3.2. Data Augmentation

In quantitative modeling, logarithmic returns are commonly employed to characterize relative price variations, serving as a substitute for raw price sequences. Compared with simple returns, logarithmic returns possess additive and stationary properties, which better conform to the statistical nature of financial time series. Given an asset closing price

P_{t}

at time t, the logarithmic return is defined as:

r_{t} = ln (\frac{P_{t}}{P_{t - 1}}),

(5)

where

r_{t}

represents the unit-period log return at time t. This transformation eliminates scale discrepancies caused by price levels and transforms the return series into a near-zero-mean stationary process, facilitating stable model learning. To address magnitude differences among features, z-score standardization was applied. Given an original feature

x_{t}

with mean

μ

and standard deviation

σ

, the standardized feature

{\hat{x}}_{t}

can be expressed as:

{\hat{x}}_{t} = \frac{x_{t} - μ}{σ},

(6)

in practice,

μ

and

σ

were computed independently for each market-specific subset to preserve local statistical characteristics and prevent cross-market data mixing that could cause model drift. Furthermore, min–max normalization within the

[0, 1]

interval was performed to enhance inter-market comparability:

{\tilde{x}}_{t} = \frac{x_{t} - min (x)}{max (x) - min (x)},

(7)

this two-stage normalization preserved regional statistical distinctiveness while improving global numerical stability during training.

Missing values are prevalent in financial data due to trading suspensions, market holidays, or data acquisition delays. Removing such samples can result in information loss, while simple interpolation may introduce bias. To mitigate these issues, multiple imputation was adopted. The approach models the conditional distribution of missing data given the observed subset and samples repeatedly to estimate the expected value. Let

X_{o b s}

and

X_{m i s}

denote observed and missing data, respectively; the posterior distribution

p (X_{m i s} | X_{o b s})

is modeled, and its expectation is estimated as:

{\hat{X}}_{m i s} = E_{p (X_{m i s} | X_{o b s})} [X_{m i s}],

(8)

through iterative sampling, this method reconstructs missing segments while maintaining the temporal and statistical continuity of the series.

After preprocessing, two data augmentation mechanisms were incorporated to improve robustness against market noise and regime variations: masked return and Gaussian perturbation. The masked return strategy simulates incomplete information scenarios, enhancing decision-making capability under partial observability. For a given return sequence

{r_{t}}_{t = 1}^{T}

, a random segment of length m is selected and replaced with zeros:

r_{t + i} = \{\begin{matrix} 0, & i \in [0, m - 1], \\ r_{t + i}, & otherwise, \end{matrix}

(9)

this random masking mechanism compels the model to infer trends from surrounding context when certain time steps are missing, akin to the masked prediction strategy in BERT. Gaussian perturbation augmentation was further introduced to simulate market noise and uncertainty in price fluctuations. The augmented return

{\tilde{r}}_{t}

is defined as:

{\tilde{r}}_{t} = r_{t} + ϵ_{t}, ϵ_{t} \sim N (0, σ_{n}^{2}),

(10)

where

σ_{n}

controls the noise variance. Introducing small-scale perturbations during training prevents overfitting to transient volatility patterns and enhances model generalization across markets.

For temporal sample construction, a sliding-window mechanism was employed to convert the time series into model-ready input structures. Given a window length L and a prediction horizon H, the i-th sample can be formulated as:

X_{i} = [r_{i}, r_{i + 1}, \dots, r_{i + L - 1}], y_{i} = f (r_{i + L : i + L + H - 1}),

(11)

where

X_{i}

denotes the input window and

y_{i}

represents the future H-step prediction target, defined as the average future return or directional signal. A positive

y_{i}

indicates an upward prediction, whereas a negative value implies a downward trend. In experiments,

L = 20

trading days and

H = 5

days were used, allowing the model to predict five-day trends based on the preceding 20-day window. To avoid information leakage due to overlapping sequences, a partially non-overlapping sampling strategy was adopted, introducing an interval

δ < L

between adjacent samples. This approach maintained a sufficient sample size while reducing temporal dependency, ensuring causal consistency between training and forecasting processes in alignment with financial market logic.

3.3. Proposed Method

3.3.1. Overall

The overall workflow of the proposed framework starts from the model architecture design and proceeds through a complete execution pipeline that maps dual-head network outputs into tradable multi-asset positions. The preprocessed sequential samples from each participating institution are fed into a local strategy network in a sliding-window manner. A lightweight encoding backbone first performs cross-scale feature extraction through stacked gated recurrent units and self-attention layers. The recurrent units capture short- and mid-term temporal dependencies and localized market patterns, while the self-attention mechanism models long-term inter-asset correlations via the query–key–value interaction process. The extracted representations are then forwarded to a dual-head prediction structure. The direction-and-strength head outputs both directional signals (e.g.,

{- 1, 0, + 1}

) and normalized intensity scores to form preliminary trading intentions, while the risk head estimates uncertainty levels and downside-risk metrics that are used to calibrate position sizing and stop-loss constraints. The dual-head outputs are processed through a transparent execution pipeline. The directional label and intensity score jointly determine the raw exposure

w_{t}^{raw} = s_{t} \cdot sign ({\hat{y}}_{t}),

where

s_{t}

denotes the predicted strength and

{\hat{y}}_{t}

is the predicted direction. The risk estimates then scale this raw position into the final tradable exposure

w_{t} = \frac{w_{t}^{raw}}{1 + u_{t} + r_{t}},

where

u_{t}

and

r_{t}

represent the predicted uncertainty and downside risk, respectively. This formulation enforces dynamic leverage adjustment under high volatility or uncertainty. Standard trading constraints—such as maximum leverage, symmetric long–short bounds, and a fixed (e.g., daily) rebalancing frequency—are applied consistently across all experiments. Transaction costs and slippage are incorporated by subtracting a proportional penalty from the executed returns based on turnover, thereby explicitly linking model outputs to realized performance measures such as annualized return, Sharpe ratio, and maximum drawdown.

During local optimization, a heterogeneity-aware adaptive scheme dynamically adjusts learning rates and regularization strengths according to market volatility and liquidity variations. A global distance penalty is introduced to mitigate model drift and maintain consistency across clients. The gradients obtained through backpropagation are stabilized via norm clipping to suppress abnormal updates, and compressed through top-K sparsification and quantization to reduce communication overhead. The compressed gradient increments are securely transmitted to the central server, which performs path attention aggregation by assigning dynamic weights based on each client’s convergence stability and return correlation. This aggregation emphasizes high-value update paths and suppresses noisy or unbalanced contributions, enabling robust convergence under heterogeneous, partially corrupted, or Non-IID environments.

The federated process iterates across communication rounds until joint validation metrics converge to predefined thresholds. During inference, each institution only needs to download the converged global model, which maps the most recent feature window through the encoding backbone and dual-head structure to produce executable positions in real time. The risk head continuously constrains leverage, exposure, and stop-loss thresholds, supporting cross-market deployment of a privacy-preserving, communication-efficient, and globally optimized quantitative trading strategy.

3.3.2. Path Attention Aggregation Module

The path attention aggregation module (PAAM) aims to achieve optimal global aggregation under cross-market heterogeneity by modeling the dynamic evolution and profitability correlation of client gradient trajectories. Unlike conventional self-attention mechanisms, which focus on dependencies within a single sequence, PAAM operates across multiple clients and gradient paths. While self-attention computes intra-sequence dependencies through the similarity of query (Q), key (K), and value (V) matrices, path attention constructs an inter-path weighted representation based on each client’s gradient evolution features, historical performance, and convergence stability. In essence, self-attention models single-sequence self-dependence, whereas path attention models multi-sequence stability correlation. The key idea is to treat the gradient evolution as a learnable dynamic process, where the attention distribution emphasizes stable and profit-consistent update directions, ensuring that only the most reliable gradient paths dominate the global aggregation process.

As shown in Figure 1, the PAAM module consists of three sub-layers. The input layer has a dimension of

[B, N, 512]

, where B denotes batch size and N the number of clients. The encoder comprises three convolutional layers with

3 \times 3

kernels and GELU activation, with channel dimensions of

256 \to 128 \to 64

. The output layer is a fully connected (FC) layer followed by a Softmax activation to generate the attention distribution. The convolutional layers capture local statistical patterns within short temporal windows, while the FC layer learns global relationships among paths. The module input is the gradient sequence

{g_{t}^{(k)}}_{t = 1}^{T}

from each client over T rounds, which is encoded into a path embedding

e_{k} \in R^{64}

. The attention coefficients are then computed as:

α_{k} = \frac{exp (β \cdot ϕ (e_{k}, \bar{e}))}{\sum_{j} exp (β \cdot ϕ (e_{j}, \bar{e}))},

(12)

where

ϕ (\cdot)

denotes the cosine similarity function,

\bar{e}

represents the global reference embedding, and

β

is a temperature coefficient controlling the sharpness of the attention distribution. The aggregated global update is expressed as:

θ_{t + 1} = \sum_{k = 1}^{K} α_{k} g_{t}^{(k)},

(13)

the mathematical rationale behind this design lies in the monotonic relationship between gradient stability and attention weight. When a client exhibits higher stability and stronger profitability covariance, its

ϕ (e_{k}, \bar{e})

increases, leading to a larger

α_{k}

. Assuming the global loss gradient satisfies

\nabla L = E_{k} [α_{k} g^{(k)}]

, and defining a stability constraint function

S (e_{k})

as a monotonically increasing function, it follows that

\frac{\partial α_{k}}{\partial S (e_{k})} > 0

, indicating that greater stability promotes global convergence and robustness under heterogeneous market conditions.

The PAAM module operates jointly with the heterogeneity-aware adaptive optimizer (HAAO), forming a cooperative “global focusing–local adaptation” mechanism. Specifically, HAAO adaptively tunes learning rates according to local volatility during client-side optimization, while PAAM reinforces global information consistency through gradient path-based weighting. The former stabilizes local updates within each market, whereas the latter aligns the overall convergence direction across regions. When HAAO produces local gradients

g_{t}^{(k)}

, PAAM automatically adjusts their aggregation weights to ensure

E [\nabla L_{g l o b a l}] = \sum_{k} α_{k} \nabla L_{k}

, thereby guaranteeing convergence consistency and profitability-driven optimization. This joint design enables the federated framework to achieve both local adaptability and global stability, yielding superior performance in cross-market quantitative modeling.

3.3.3. Gradient Clipping and Compression

The gradient clipping and compression (GCC) module is designed to address global optimization instability in cross-market federated training, which often arises due to gradient fluctuation, limited communication bandwidth, and asynchronous updates. Unlike conventional gradient transmission methods, the proposed GCC module performs not only numerical constraint and sparsification but also learnable compression and dynamic reconstruction of gradient features through a lightweight neural encoding structure.

As shown in Figure 2, the input to the module is a local model gradient tensor for each client, denoted as

G^{(k)} \in R^{H \times W \times C}

, where

H = 32

,

W = 32

, and

C = 512

represent the dimension of the last-layer feature gradients. The module comprises a three-layer neural transformation network. The first layer is a

1 \times 1

convolution (

512 \to 128

channels) for dimensionality reduction and orthogonalization. The second layer is a

3 \times 3

convolution (

128 \to 64

channels, stride

s = 1

, padding

= 1

) followed by a ReLU activation function to capture local gradient correlations. The third layer is a fully connected (FC) layer that outputs an encoded vector

z^{(k)} \in R^{64}

, which is passed to a Softmax normalization layer to produce a weight distribution vector

w^{(k)}

. Prior to compression, the gradients are normalized and centered to ensure that their magnitude lies within

[- 1, 1]

. The neural compression process is defined as:

z^{(k)} = FC (ReLU ({Conv}_{3 \times 3} ({Conv}_{1 \times 1} (G^{(k)})))),

(14)

and the Softmax layer generates the gradient importance weights as:

w^{(k)} = \frac{exp (γ z^{(k)})}{\sum_{i} exp (γ z_{i}^{(k)})},

(15)

where

γ

denotes the temperature coefficient. Only the top

K = 0.05 H W

gradient channels with the largest

w^{(k)}

values are retained for transmission using top-K sparsification, substantially reducing communication overhead.

Before uploading gradients, each client performs

ℓ_{2}

-norm based gradient clipping to prevent abnormal fluctuations from biasing global aggregation. Unlike traditional fixed-threshold clipping, the proposed dynamic clipping threshold

C_{t}

adapts according to variance across training rounds, expressed as:

C_{t} = C_{0} (1 + λ \cdot \frac{Var (G_{t}^{(k)})}{E [| G_{t}^{(k)} |] + ϵ}),

(16)

where

C_{0}

is the baseline threshold,

λ

is the adjustment coefficient, and

ϵ

is a small constant for numerical stability. The clipped gradient is defined as:

{\tilde{G}}_{t}^{(k)} = \frac{G_{t}^{(k)}}{1 + (| | G_{t}^{(k)} {| |}_{2} / C_{t})^{2}},

(17)

which introduces nonlinear suppression to preserve major gradient directions while attenuating high-noise components. It can be demonstrated that the combination of compression and clipping preserves unbiased gradient estimation in expectation. Defining the true global gradient as

\nabla L = E [G^{(k)}]

, the expected value of the modified gradient satisfies:

E [{\tilde{G}}_{t}^{(k)}] = E [G_{t}^{(k)}] + O (σ^{2} / C_{t}^{2}),

(18)

where

σ^{2}

denotes gradient variance. When

σ^{2}

is sufficiently small and

C_{t}

is sufficiently large, the bias term tends to zero, ensuring alignment with the true global optimization direction. The performance of GCC depends on two critical parameters: the gradient variance

σ^{2}

and the sparsification ratio K. A larger

σ^{2}

increases the effective clipping strength, thereby stabilizing noisy markets but potentially slowing convergence in calm regimes. Conversely, a smaller

σ^{2}

leads to weaker clipping and faster optimization but may amplify volatility-induced drift. Similarly, the sparsity level K directly regulates the trade-off between communication efficiency and strategy accuracy: lower K produces substantial bandwidth savings but may remove informative channels, whereas higher K preserves more gradient structure at the cost of additional communication. Empirically, we observe that the strategy performance remains robust within a moderate range of

K \in [4 %, 8 %]

, while extreme values produce noticeable degradation. These observations highlight the importance of adaptive parameter selection and motivate future research on learning-to-sparsify mechanisms within federated quantitative optimization.

Within the federated framework, the GCC module operates in conjunction with the path attention aggregation module (PAAM). PAAM dynamically assigns aggregation weights based on gradient path stability and profit correlation at the server side, whereas GCC constructs structured, low-redundancy gradient inputs on the client side, enhancing the stability of attention weight computation and reducing noise sensitivity. When combined, the global update is represented as:

θ_{t + 1} = \sum_{k = 1}^{K} α_{k} {\tilde{G}}_{t}^{(k)},

(19)

where

α_{k}

denotes the attention coefficient computed by PAAM. Since

{\tilde{G}}_{t}^{(k)}

has already undergone sparsity regularization that emphasizes dominant features, the joint optimization of

α_{k}

and

{\tilde{G}}_{t}^{(k)}

minimizes the expected squared deviation from the ideal global gradient

G^{*}

:

min_{α_{k}} E [{∥\sum_{k} α_{k} ({\tilde{G}}_{t}^{(k)} - G^{*})∥}^{2}],

(20)

this joint mechanism provides two major advantages for the proposed task: first, it mitigates gradient explosion-induced oscillations common in high-frequency trading scenarios, improving convergence smoothness; second, it substantially reduces communication cost under asymmetric cross-border bandwidth, enabling efficient and stable federated optimization of quantitative trading strategies.

3.3.4. Heterogeneity-Aware Adaptive Optimizer

The heterogeneity-aware adaptive optimizer (HAAO) addresses instability in federated quantitative strategy training caused by market structural divergence, heterogeneous volatility characteristics, and Non-IID data distributions across clients. Its core principle lies in dynamically adjusting the learning rate and regularization strength during each local optimization stage while incorporating frequency-domain gating and adaptive weighting mechanisms to achieve self-regulated optimization according to market-specific volatility patterns and feature distributions. This design prevents the global model from converging to suboptimal local minima or experiencing directional drift.

Financial time series exhibit well-documented multi-scale stochastic behaviors such as volatility clustering, structural breaks, and regime switching, which induce distinct frequency-dependent characteristics in gradient trajectories during model training. High-frequency components generally correspond to transient shocks and noise-driven fluctuations, whereas low-frequency components reflect persistent market trends and slow-moving structural signals. Consequently, client-side gradient noise is not white but possesses heterogeneous spectral density across markets. Motivated by these properties, HAAO employs a fast Fourier transform (FFT) to decompose raw gradients into their spectral components, enabling the optimizer to distinguish informative structural gradients from noise-dominated components. By gating update magnitudes according to spectral energy distributions, HAAO performs frequency-aware learning rate adaptation that is theoretically aligned with the heterogeneous gradient noise model frequently observed in financial optimization.

As shown in Figure 3, the input is a client gradient sequence

G_{t}^{(k)} \in R^{32 \times 32 \times 512}

. An FFT transformation yields the spectral representation

F^{(k)} = F (G_{t}^{(k)}),

which captures both low-frequency structural information and high-frequency volatility-induced perturbations. The spectral tensor is then processed through three convolutional layers with kernel sizes

3 \times 3

,

5 \times 5

, and

1 \times 1

, and channel dimensions

512 \to 256 \to 64

. The SiLU activation function maintains smoothness while preserving low-frequency continuity. Extracted frequency-domain features are mapped by a linear layer to produce an adaptive gain vector

η^{(k)} \in R^{64}

, which is normalized via a Softmax layer to generate the learning rate adjustment factors

λ^{(k)}

.

The final optimization update is given by

θ_{t + 1}^{(k)} = θ_{t}^{(k)} - λ^{(k)} ⊙ \frac{η^{(k)} \cdot G_{t}^{(k)}}{\sqrt{v_{t}^{(k)}} + ϵ},

(21)

where

v_{t}^{(k)}

denotes the exponential moving average of squared gradients for numerical stability. Through this mechanism, each client adaptively scales its update magnitude according to the estimated volatility spectrum, suppressing aggressive updates in high-volatility regimes while accelerating convergence in smoother markets.

Mathematically, defining the local objective of client k as

f_{k} (θ)

, HAAO minimizes the following global expectation:

min_{θ} \sum_{k = 1}^{K} p_{k} [f_{k} (θ) + μ {∥ θ - {\bar{θ}}_{t} ∥}^{2}],

(22)

where

p_{k}

is the sample proportion,

μ

is the heterogeneity regularization coefficient, and

{\bar{θ}}_{t}

is the global parameter mean. The inclusion of the

μ

term constrains local updates to remain aligned with the global trend, yielding a bounded gradient variance:

Var [\nabla_{θ} f_{k} (θ)] \leq \frac{1}{μ} {∥ θ - {\bar{θ}}_{t} ∥}^{2},

(23)

by combining spectral analysis with heterogeneity regularization, HAAO effectively controls gradient variance growth arising from market divergence and enhances the robustness and generalization of the global model. In the overall framework, HAAO works synergistically with the path attention aggregation module (PAAM): the former provides volatility- and frequency-aware client-side adaptation, while the latter ensures profit-consistent and stability-aware global aggregation on the server side.

3.4. Privacy Analysis

The proposed framework provides inherent privacy protection through its gradient processing and aggregation mechanisms. In this subsection, we present a theoretical privacy analysis that characterizes how gradient clipping, sparsification, quantization, and path attention aggregation reduce the exposure of client-level information in federated learning.

We consider an honest-but-curious adversary who observes the transmitted local gradient updates and attempts to reconstruct private client data through gradient-based inversion attacks. Let

g_{i}

denote the true gradient of client i, and let

{\tilde{g}}_{i}

be the transmitted gradient after processing by our GCC module and aggregated by PAAM.

The GCC module applies norm clipping, top-K sparsification, and b-bit quantization:

{\tilde{g}}_{i} = Q_{b} (S_{K} (\frac{g_{i}}{max (∥ g_{i} ∥_{2}, C)})),

where C is the clipping threshold,

S_{K} (\cdot)

keeps only the top-K magnitudes, and

Q_{b} (\cdot)

denotes uniform b-bit quantization. These operations reduce the mutual information between the true gradient and the transmitted update.

Theorem 1.

Let

I (g_{i}; {\tilde{g}}_{i})

denote the mutual information between the true and transmitted gradients. Under GCC processing, we have

I (g_{i}; {\tilde{g}}_{i}) \leq I (g_{i}; S_{K} (g_{i})) \leq K log (2^{b}),

where

K ≪ d

is the sparsity level and d is the gradient dimension. Consequently, the information leakage decreases at least linearly with the sparsification ratio

K / d

and is further bounded by the quantization bit-width b.

Proof.

Clipping normalizes large gradient magnitudes and eliminates directional information beyond the threshold C, which constrains the Lipschitz continuity of the gradient mapping and reduces reconstructability. Top-K sparsification removes

(d - K)

components entirely, enforcing an upper bound

I (g_{i}; S_{K} (g_{i})) \leq K H (g_{i})

by information monotonicity. Quantization restricts each surviving component to one of

2^{b}

discrete levels, yielding

I (S_{K} (g_{i}); Q_{b} (\cdot)) \leq K b

. Combining these results yields the stated bound. □

The server aggregates updates as

\bar{g} = \sum_{i = 1}^{N} α_{i} {\tilde{g}}_{i},

where the attention weights

α_{i}

depend on convergence stability and return correlation. Since

α_{i} < 1

for all clients and

\sum_{i} α_{i} = 1

, the influence of any single client’s update is strictly diluted.

Theorem 2.

If

α_{i} \leq α_{max} < 1

, then the adversary’s ability to isolate client i’s gradient from the aggregated update is bounded by

I (g_{i}; \bar{g}) \leq α_{max}^{2} I (g_{i}; {\tilde{g}}_{i}),

which yields a strict decrease in information leakage compared to transmitting

{\tilde{g}}_{i}

directly.

Proof.

The result follows from strong data-processing inequalities for linear mixtures of random variables and the contractive property of convex combinations. The aggregation mapping attenuates the client-specific signal by at least a factor of

α_{max}

in norm and by

α_{max}^{2}

in mutual information. □

The combined effect of gradient clipping, sparsification, quantization, and weighted aggregation yields a privacy-amplification mechanism that significantly limits the useful information an adversary can extract from the transmitted gradients. Although differential privacy and secure aggregation are not activated in our experiments, our framework is fully compatible with these mechanisms and can incorporate them when applied to confidential financial datasets.

4. Results and Discussion

4.1. Experimental Setup

4.1.1. Environment and Baseline Models

In the experimental phase of this study, model training was conducted on a high-performance computing server. Each node was equipped with an NVIDIA A100 GPU with 40 GB of memory, enabling parallel training of large-scale deep learning models and efficient high-dimensional gradient computation. The CPU configuration consisted of dual Intel Xeon Gold processors operating at 2.6 GHz with 512 GB of RAM. On the software side, experiments were performed on the Ubuntu 20.04 operating system using Python 3.9 as the primary programming language. The deep learning framework adopted PyTorch 2.0, which supports dynamic computation graphs and GPU-accelerated training. The federated learning setup was implemented based on PySyft 0.6, providing secure gradient communication and encrypted model aggregation. Data preprocessing and analysis were carried out using the NumPy 1.24.4, Pandas 1.5.3, and Scikit-learn 1.2.2 libraries. Experimental reproducibility was ensured by fixing a random seed

α

to maintain consistent initialization and data partitioning. GPU computations were further accelerated by CUDA 11.8 and cuDNN 8 throughout the training process.

Regarding hyperparameter configurations, the dataset was first partitioned in a time-respecting manner into training, validation, and test sets in a 7:1:2 ratio, and a 5-fold cross-validation scheme was applied on the training subset to evaluate model robustness. To further mitigate look-ahead bias and overfitting, we adopt a walk-forward evaluation protocol in which the model is re-trained only on past data and rolled forward over successive out-of-sample segments, with purging and an embargo period applied around each test window to eliminate temporal leakage and overlapping label influence. The specific training parameters included a learning rate of

α = 0.001

, batch size

B = 64

, local training epochs

E = 5

, global aggregation rounds

R = 100

, and hidden layer dimension

H = 128

for both LSTM- and Transformer-based architectures, with each hidden layer containing 256 neurons. To prevent overfitting, weight decay was set to

λ = 1 \times 10^{- 4}

and a dropout probability of

p = 0.2

was applied during training. Furthermore, a top-K sparsification ratio of

K = 0.1

was employed during gradient uploads to reduce communication overhead. All hyperparameters were fine-tuned under cross-market and heterogeneous data environments to achieve optimal performance in terms of return prediction accuracy, risk control stability, and convergence speed, and all experiments were repeated with multiple random seeds to ensure the stability of the reported results.

Four representative baseline methods were selected for comparative evaluation, including FedAvg [53], FedProx [54], FedOpt [55], and Centralized [56].

Implementation details of the model and experiments are provided in Appendix A.

4.1.2. Evaluation Metrics

In this study, model performance was evaluated using five key metrics: annualized return (AR), Sharpe ratio (SR), maximum drawdown (MDD), volatility (

σ_{r}

), and communication cost (CC). The annualized return quantifies the long-term average profitability of an investment strategy, while the Sharpe ratio measures excess return per unit of risk. The maximum drawdown assesses the largest potential capital loss over the trading horizon, and volatility characterizes the degree of fluctuation in the return sequence, serving as a proxy for risk exposure. The communication cost evaluates the data transmission burden during federated training, reflecting the trade-off between optimization efficiency and communication overhead. The mathematical formulations of these metrics are expressed as follows:

A R = {(\prod_{t = 1}^{T} (1 + r_{t}))}^{\frac{252}{T}} - 1,

(24)

S R = \frac{E [r_{t} - r_{f}]}{σ_{r}},

(25)

M D D = max_{t \in [0, T]} (\frac{{max}_{s \in [0, t]} P_{s} - P_{t}}{{max}_{s \in [0, t]} P_{s}}),

(26)

σ_{r} = \sqrt{\frac{1}{T - 1} \sum_{t = 1}^{T} {(r_{t} - \bar{r})}^{2}},

(27)

C C = Communication Rounds \times Average Gradient Size .

(28)

In the above expressions,

r_{t}

denotes the return on day t,

r_{f}

represents the risk-free rate,

σ_{r}

is the standard deviation of returns,

P_{t}

indicates the asset or strategy net value at time t,

\bar{r}

is the mean return, T represents the total number of trading days, and

C C

denotes the communication cost, capturing both the number of communication rounds and the average size of uploaded gradients during the federated optimization process.

4.2. Overall Comparison with Baselines

The objective of this experiment is to evaluate, from a global perspective, the comprehensive advantages of the proposed federated quantitative strategy framework in terms of profitability, risk control, and communication efficiency. Through systematic comparison with five representative baseline models, the experiment aims to assess the effectiveness of the three core modules—path attention aggregation (PAAM), gradient clipping and compression (GCC), and heterogeneity-aware adaptive optimizer (HAAO)—under multi-market and Non-IID (independent and identically distributed) conditions. The evaluation involves five key dimensions: annualized return, Sharpe ratio, maximum drawdown, volatility, and communication cost. These metrics jointly measure both the financial performance of the trading strategy and the computational-communication overhead during federated optimization. This design enables a comprehensive depiction of model stability and adaptability in complex financial environments, thereby validating the proposed framework’s potential for achieving an optimal balance between profitability and robustness.

As presented in Table 2, Figure 4 and Figure 5, FedAvg and FedProx show moderate improvement under heterogeneous data conditions, benefiting from parameter averaging and local regularization; however, they remain prone to aggregation noise and update drift. FedOpt accelerates convergence through adaptive learning rates and momentum terms, yet suffers from gradient amplification in high-dimensional and volatile datasets. In contrast, the proposed framework consistently achieves the best results across all evaluation dimensions. The path attention mechanism effectively emphasizes update directions from clients with high return correlation, the gradient clipping and compression module mathematically reduces gradient variance and communication complexity, while the heterogeneity-aware adaptive optimizer balances volatility disparities across markets via frequency-domain learning rate adjustment. Collectively, these components form a synergistic optimization mechanism that is theoretically equivalent to incorporating a stability constraint term into the global objective function, ensuring rapid convergence while minimizing risk exposure. Consequently, the proposed method surpasses traditional federated optimization algorithms in both profitability and robustness.

4.3. Ablation Study

The purpose of this experiment is to verify the independent and collaborative contributions of the three core modules—PAAM, GCC, and HAAO—within the overall framework. By progressively introducing each module, the study quantifies their individual and combined effects on profitability improvement, risk control, and communication efficiency. This experimental design elucidates the underlying mechanisms driving performance enhancement and provides quantitative evidence of the framework’s structural necessity and coherence.

As shown in Table 3, the baseline FedAvg model demonstrates limited capability, indicating that simple parameter averaging fails to capture cross-market profitability relationships. Introducing PAAM significantly improves both annualized return and Sharpe ratio, reflecting its ability to enhance global aggregation consistency and mitigate interference from volatile clients. When only GCC is applied, the communication cost decreases substantially while maintaining stability, evidencing that gradient sparsification and dynamic clipping effectively suppress redundant noise. The inclusion of HAAO further enhances the profit–risk equilibrium, confirming the optimizer’s effectiveness in adapting to heterogeneous market dynamics; moreover, the HAAO (without FFT) variant yields slightly inferior AR and SR compared with full HAAO, which empirically verifies the marginal contribution of the FFT-based spectral routing mechanism. When two modules are combined, performance further improves, indicating mathematical complementarity among them: PAAM minimizes inter-client gradient variance through weighted aggregation, GCC suppresses abnormal gradients via norm constraints, and HAAO adjusts local update steps adaptively to form smoother optimization trajectories in the gradient space. Ultimately, the full model integrating all three modules achieves optimal performance across profitability, risk, and communication dimensions. Its theoretical advantage lies in simultaneously achieving unbiased gradient expectation and minimized variance constraints, leading to stable convergence and efficient profit optimization under complex Non-IID financial conditions.

4.4. Privacy Evaluation

To complement the theoretical analysis in the privacy section, we conduct a controlled gradient-leakage experiment to empirically assess the privacy benefits introduced by clipping, sparsification, quantization, and attention aggregation. Following the standard gradient inversion procedures such as GradInversion [57], an adversarial server attempts to reconstruct client-side input features from the transmitted gradients obtained during a single local update. We evaluate several configurations, including unprotected raw gradients, clipping, top-K sparsification, 8-bit quantization, the combined GCC module and the PAAM attention aggregation. Reconstruction quality is measured using MSE, SSIM, and correlation between the original features and the reconstructed features.

Table 4 reports the reconstruction fidelity under different configurations. As shown, raw gradients allow the attacker to partially recover the underlying features, while clipping, sparsification, and quantization all substantially reduce the effectiveness of the inversion. The combined GCC module further degrades reconstruction quality, and PAAM introduces additional attenuation by mixing gradients across clients. Overall, these empirical results validate our theoretical claims: the GCC module and PAAM attention effectively attenuate the mutual information between gradients and raw data. This demonstrates that the proposed framework offers robust privacy protection in realistic federated learning scenarios.

4.5. Communication Efficiency Comparison

This experiment aims to evaluate the efficiency and robustness of the proposed federated optimization framework under communication constraints. communication cost represents a major bottleneck in the deployment of federated learning within cross-institutional financial environments, where bandwidth heterogeneity, latency fluctuation, and asynchronous updates often necessitate a large number of communication rounds for traditional algorithms to achieve convergence. By comparing different models under identical communication budgets in terms of profitability, risk control, and resource consumption, this experiment investigates the performance variation as the number of communication rounds or gradient transmission volume is reduced. The goal of this design is not only to assess the performance superiority of the proposed framework but also to examine its robustness and convergence stability in low-communication regimes, thereby verifying the contribution of the gradient clipping and compression (GCC) and path attention aggregation (PAAM) mechanisms to communication efficiency.

As shown in Table 5 and Figure 6, traditional methods such as FedAvg, FedProx, and FedOpt require significantly larger gradient transmission volumes to achieve comparable drawdown and volatility reduction, resulting in lower communication efficiency. Although FedOpt improves local convergence through momentum accumulation, its performance gain remains limited under high communication overhead. FedProx demonstrates certain robustness in heterogeneous settings but fails to maintain stability under restricted bandwidth. In contrast, the proposed framework achieves the highest annualized return and Sharpe ratio while maintaining the lowest communication volume, and even when the number of communication rounds is reduced to 50 or 25, it sustains strong profitability and risk control. This indicates superior communication–performance trade-off capability. Theoretically, this advantage arises from the GCC module’s ability to sparsify redundant information, allowing the transmitted gradients to remain statistically unbiased. Meanwhile, the PAAM mechanism applies reward-correlated weighting during aggregation, effectively suppressing the influence of uninformative gradients. The heterogeneity-aware adaptive optimizer (HAAO) further stabilizes inter-market gradient alignment by adaptively adjusting local learning rates and update magnitudes. The joint effect of these three mechanisms ensures that the expected global convergence remains close to the optimal solution under communication constraints, thereby guaranteeing profitability maximization and convergence stability even with low communication complexity.

4.6. Discussion

The proposed federated quantitative strategy framework demonstrates strong applicability and practical value in various real-world financial scenarios. In collaborative strategy development among multinational banks, securities firms, and asset management institutions, data privacy, regulatory compliance, and network isolation often prevent direct sharing of trading behavior or market characteristic data. The proposed approach enables distributed model training through federated aggregation, allowing institutions to jointly learn optimal trading strategies without exposing raw data—achieving a “data remains local, model is shared” paradigm. For example, in cross-border foreign exchange and futures strategy modeling, financial institutions across the United States, Europe, and the Asia-Pacific region can train on local high-frequency trading data and perform dynamic path optimization on the server via PAAM, enhancing cross-market consistency and robustness of trading decisions.

Furthermore, the framework exhibits notable advantages in risk control and asset allocation tasks within quantitative investment institutions. The GCC module mitigates the influence of extreme market fluctuations by suppressing abnormal gradient amplification, allowing the model to maintain stable decision-making during black swan events or high-volatility periods. Meanwhile, the HAAO module adjusts the learning rate according to market volatility and liquidity, enabling rapid adaptation in high-frequency trading, commodity futures, and cryptocurrency markets. In multi-asset hedge fund scenarios, the system can concurrently operate local models on multiple market branches, capture diversified profit signals, and achieve balanced optimization between return and risk at the global level. Moreover, the improved communication efficiency facilitates model deployment on edge nodes, such as regional exchanges or local research terminals, making it suitable for low-latency and bandwidth-constrained environments. Overall, by integrating multi-source financial heterogeneity with federated optimization, the framework establishes a privacy-preserving, stable, and communication-efficient learning paradigm for intelligent strategy collaboration across institutions.

4.7. Limitation and Future Work

Although the proposed framework achieves promising results in federated quantitative strategy modeling, several limitations remain to be addressed. First, the current validation focuses on historical trading data across the U.S., European, and Asia-Pacific markets. While this provides multi-regional coverage, the model’s dynamic stability under extreme events or sudden macroeconomic shocks has not been thoroughly evaluated. For instance, structural changes in inter-market correlations during geopolitical conflicts or financial crises may challenge the consistency assumption of the PAAM module. Second, while the GCC module substantially improves communication efficiency, aggregation delay may accumulate under severely constrained bandwidth or high network latency. In addition, the adaptive learning rate mechanism of HAAO depends on accurate estimation of market volatility spectra, which may be limited when data streams are sparse or noise-distorted. Future work will enhance the spatiotemporal adaptability of the framework by integrating multimodal financial data—such as news sentiment, macroeconomic indicators, and high-frequency order flow—to develop a more interpretable multi-source federated learning system. In addition, we plan to design dedicated stress-testing protocols based on well-known crisis windows and synthetic shock injection, and to conduct regime-specific backtests that isolate high-volatility episodes using tail-risk and extreme-value metrics, so as to systematically assess the behavior of the framework under rare but impactful market events. Further, prototype deployment in real cross-institutional environments will be explored to assess feasibility and economic benefits in risk forecasting, portfolio management, and hedging applications. On the theoretical front, convergence bounds and stability constraints for asynchronous federated optimization will be investigated to support robust and transparent intelligent decision-making in high-volatility, low-trust financial ecosystems.

5. Conclusions

This study addresses the problem of collaborative optimization for quantitative strategies under cross-institutional and heterogeneous data environments by proposing a federated quantitative learning framework that integrates path attention aggregation, gradient clipping and compression, and a heterogeneity-aware adaptive optimizer. The research motivation arises from the non-independent and identically distributed characteristics of financial markets and the constraints of data privacy, which make traditional centralized optimization approaches inadequate for simultaneously improving returns and controlling risks. The proposed framework constructs a path attention aggregation module (PAAM) to dynamically weight high-return-related communication paths, designs a gradient clipping and compression module (GCC) to reduce communication costs and stabilize gradient distribution, and incorporates a heterogeneity-aware adaptive optimizer (HAAO) to adaptively adjust learning rates across markets in the frequency domain, forming an optimization system that achieves both communication efficiency and convergence robustness. Experimental results demonstrate that the proposed model improves the annualized return (AR) by approximately 16.57% compared with baseline models, increases the Sharpe ratio (SR) to 1.25, and reduces the maximum drawdown (MDD) to 15.92%, achieving optimal performance in balancing return and risk. Moreover, when the communication rounds are reduced to 50 or 25, the model maintains high accuracy and stability, confirming its superior trade-off between communication efficiency and predictive performance. Further ablation studies indicate that PAAM significantly enhances aggregation direction consistency, GCC effectively reduces gradient variance and communication redundancy, and HAAO improves convergence speed and robustness under heterogeneous market conditions. Overall, this work innovatively integrates attention-based aggregation, gradient sparsification, and frequency-domain adaptive optimization, achieving efficient return enhancement and risk control from both mathematical and engineering perspectives. The proposed framework provides a scalable and practically feasible paradigm for distributed intelligent decision-making in financial systems.

Author Contributions

Conceptualization, S.T., L.Z., S.X. and M.L.; Data curation, X.Z. and X.G. Formal analysis, P.H.; Funding acquisition, M.L.; Investigation, P.H. and X.G.; Methodology, S.T., L.Z. and S.X.; Project administration, M.L.; Resources, X.Z. and X.G.; Software, S.T., L.Z. and S.X.; Supervision, M.L.; Validation, P.H.; Visualization, X.Z.; Writing—original draft, S.T., L.Z., S.X., X.Z., P.H., X.G. and M.L.; S.T., L.Z. and S.X. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 61202479.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Reproducibility and Implementation Details

To enhance the reproducibility of our results and enable complete regeneration of Table 2, Table 3 and Table 4 and Figure 4 and Figure 5, we provide detailed descriptions of the full experimental pipeline, including data preprocessing, temporal splits, federated optimization, trading execution rules, and evaluation metrics. The corresponding scripts, configuration files, and anonymized training logs will be released in a supplementary repository upon acceptance.

Appendix A.1. Data Acquisition and Preprocessing

Data sources. We use publicly available daily OHLCV data from Yahoo Finance, Refinitiv, and Sina Finance. All assets are aligned to a unified trading calendar.

Feature construction. For each asset, we compute return-based and technical indicators including normalized log-returns, moving averages, realized volatility, rolling volume features, and cross-asset correlation signals. Missing values are imputed using forward fill and temporal interpolation.

Sliding window generation. A fixed-length window of

T = 32

days is used. For each time step t, the model receives input

X_{t} = [x_{t - T + 1}, \dots, x_{t}],

and predicts the next-day trading signal. Overlapping windows are used without information leakage.

Algorithm A1 Data Processing Pipeline

1:: Load raw OHLCV data from all assets.
2:: Align calendars; remove non-trading days.
3:: Compute engineered features (returns, volatility, correlations).
4:: Normalize features using rolling statistics.
5:: for each asset i do
6:: for each day t do
7:: Construct sliding window $X_{t}^{(i)}$ of length T.
8:: end for
9:: end for
10:: Output training-ready dataset.

Appendix A.2. Temporal Splits and Random Seeds

To avoid look-ahead bias, all data are split strictly by calendar time:

Training set: 70% of earliest observations
Validation set: next 10%
Test set (OOS): final 20%

A 5-fold walk-forward validation is additionally applied within the training period. All experiments are repeated with five random seeds

{0, 1, 2, 3, 4}

; tables report mean ± standard deviation across seeds.

Appendix A.3. Federated Optimization Procedure

The full federated learning workflow is summarized below.

Algorithm A2 Federated Training with PAAM, GCC, and HAAO

1:: Initialize global model parameters $θ_{0}$ .
2:: for round $r = 1$ to R do
3:: for each client k in parallel do
4:: Receive global parameters $θ_{r}$ .
5:: Perform local training for E epochs using HAAO.
6:: Compute gradient $g_{r}^{(k)}$ .
7:: Apply gradient clipping and top-K sparsification.
8:: Quantize and upload compressed gradient ${\hat{g}}_{r}^{(k)}$ .
9:: end for
10:: Server aggregates updates using PAAM:

$θ_{r + 1} = θ_{r} - \sum_{k} α_{k} {\hat{g}}_{r}^{(k)},$

where weights $α_{k}$ depend on path attention scores.
11:: end for
12:: Return final global model $θ^{*}$ .

Appendix A.4. Trading Execution Rules

All trading results follow a unified execution protocol to ensure reproducibility:

Signal timing: Model outputs at the end of day t;
Execution price: Orders executed at the next-day opening price ( $t + 1$ );
Transaction cost: 8 bps per transaction (round trip);
Slippage: 5 bps per trade;
Rebalancing: Daily;
Leverage constraint: $| w_{t} | \leq 1$ per asset;
Risk head outputs determine stop-loss levels and position sizing.

Appendix A.5. Metrics and Logging

We compute the following metrics for every experiment:

Annualized Return (AR)
Sharpe Ratio (SR)
Maximum Drawdown (MDD)
Volatility of returns ( $σ_{r}$ )
Communication cost (CC, in MB)

Training logs stored for reproducibility include:

Per-round loss curves and gradient norms
Aggregation weights from PAAM
HAAO learning-rate adaptation trajectories
Communication volume per client

All artefacts (hyperparameter configurations, preprocessed datasets, temporal index files, and logs) will be shared at https://github.com/Aurelius-04/FQL.git upon acceptance (accessed on 2 December 2025).

References

Sahu, S.K.; Mokhade, A.; Bokde, N.D. An overview of machine learning, deep learning, and reinforcement learning-based techniques in quantitative finance: Recent progress and challenges. Appl. Sci. 2023, 13, 1956. [Google Scholar] [CrossRef]
Jabeen, A.; Yasir, M.; Ansari, Y.; Yasmin, S.; Moon, J.; Rho, S. An empirical study of macroeconomic factors and stock returns in the context of economic uncertainty news sentiment using machine learning. Complexity 2022, 2022, 4646733. [Google Scholar] [CrossRef]
Khunger, A. DEEP Learning for financial stress testing: A data-driven approach to risk management. Int. J. Innov. Stud. 2022, 6, 45–58. [Google Scholar] [CrossRef]
Chun, D.; Kim, D. State heterogeneity analysis of financial volatility using high-frequency financial data. J. Time Ser. Anal. 2022, 43, 105–124. [Google Scholar] [CrossRef]
Li, Q.; Ren, J.; Zhang, Y.; Song, C.; Liao, Y.; Zhang, Y. Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; pp. 1–6. [Google Scholar]
Wang, L.; Wang, Y. Supply chain financial service management system based on block chain IoT data sharing and edge computing. Alex. Eng. J. 2022, 61, 147–158. [Google Scholar] [CrossRef]
Li, Q.; Zhang, Y.; Ren, J.; Li, Q.; Zhang, Y. You Can Use But Cannot Recognize: Preserving Visual Privacy in Deep Neural Networks. arXiv 2024, arXiv:2404.04098. [Google Scholar] [CrossRef]
DeLise, T. On the Generalization of Machine Learning Models in Finance: Five Essays on Bridging the Empirical Gap. Ph.D. Thesis, Université de Montréal, Montréal, QC, Canada, 2025. [Google Scholar]
Pour Rostami, L. Centralized Finance and Decentralized Finance. Ph.D. Thesis, University of South Carolina, Columbia, SC, USA, 2024. [Google Scholar]
Li, Q.; Zhang, Y. Confidential Federated Learning for Heterogeneous Platforms against Client-Side Privacy Leakages. In Proceedings of the ACM Turing Award Celebration Conference-China 2024, Changsha, China, 5–7 July 2024; pp. 239–241. [Google Scholar]
Şahin, Y.; Dogru, I. An enterprise data privacy governance model: Security-centric multi-model data anonymization. Int. J. Eng. Res. Dev. 2023, 15, 574–583. [Google Scholar] [CrossRef]
Ionescu, S.A.; Diaconita, V.; Radu, A.O. Engineering Sustainable Data Architectures for Modern Financial Institutions. Electronics 2025, 14, 1650. [Google Scholar] [CrossRef]
Aliferis, C.; Simon, G. Overfitting, underfitting and general model overconfidence and under-performance pitfalls and best practices in machine learning and AI. In Artificial Intelligence and Machine Learning in Health Care and Medical Sciences: Best Practices and Pitfalls; Springer: Cham, Switzerland, 2024; pp. 477–524. [Google Scholar]
Nevrataki, T.; Iliadou, A.; Ntolkeras, G.; Sfakianakis, I.; Lazaridis, L.; Maraslidis, G.; Asimopoulos, N.; Fragulis, G.F. A survey on federated learning applications in healthcare, finance, and data privacy/data security. AIP Conf. Proc. 2023, 2909, 120015. [Google Scholar] [CrossRef]
Tadi, S.R.C.C.T. Context-Aware Federated Learning for Regulatory Risk Assessment in Financial Applications. IJSAT-Int. J. Sci. Technol. 2024, 15, 112–125. [Google Scholar]
Oguntibeju, O.O. Mitigating artificial intelligence bias in financial systems: A comparative analysis of debiasing techniques. Asian J. Res. Comput. Sci. 2024, 17, 165–178. [Google Scholar] [CrossRef]
Li, Y.; Wen, G. Research and Practice of Financial Credit Risk Management Based on Federated Learning. Eng. Lett. 2023, 31, 271. [Google Scholar]
Bagheri, M. Optimizing Quantitative Trading: An Experimental Study of DQN Trading Strategies and Utility Functions. Ph.D. Thesis, Tilburg University, Tilburg, The Netherland, 2024. [Google Scholar]
Le, Q.; Diao, E.; Wang, X.; Khan, A.F.; Tarokh, V.; Ding, J.; Anwar, A. Dynamicfl: Federated learning with dynamic communication resource allocation. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 998–1008. [Google Scholar]
Mao, J.; Akhtar, J.; Zhang, X.; Sun, L.; Guan, S.; Li, X.; Chen, G.; Liu, J.; Jeon, H.N.; Kim, M.S.; et al. Comprehensive strategies of machine-learning-based quantitative structure-activity relationship models. iScience 2021, 24, 103052. [Google Scholar] [CrossRef]
Chavhan, S.; Raj, P.; Raj, P.; Dutta, A.K.; Rodrigues, J.J. Deep learning approaches for stock price prediction: A comparative study of LSTM, RNN, and GRU models. In Proceedings of the 2024 9th International Conference on Smart and Sustainable Technologies (SpliTech), Bol and Split, Croatia, 25–28 June 2024; pp. 1–6. [Google Scholar]
Zhang, Q.; Qin, C.; Zhang, Y.; Bao, F.; Zhang, C.; Liu, P. Transformer-based attention network for stock movement prediction. Expert Syst. Appl. 2022, 202, 117239. [Google Scholar] [CrossRef]
Zhang, G.; Guo, W.; Xiong, X.; Guan, Z. A hybrid approach combining data envelopment analysis and recurrent neural network for predicting the efficiency of research institutions. Expert Syst. Appl. 2024, 238, 122150. [Google Scholar] [CrossRef]
Zhang, Y.; Goel, D.; Ahmad, H.; Szabo, C. RegimeFolio: A Regime Aware ML System for Sectoral Portfolio Optimization in Dynamic Markets. IEEE Access 2025, 13, 184722–184744. [Google Scholar] [CrossRef]
Ji, Y.; Chen, L. FedQNN: A computation–communication-efficient federated learning framework for IoT with low-bitwidth neural network quantization. IEEE Internet Things J. 2022, 10, 2494–2507. [Google Scholar] [CrossRef]
He, P.; Lin, C.; Montoya, I. DPFedBank: Crafting a Privacy-Preserving Federated Learning Framework for Financial Institutions with Policy Pillars. arXiv 2024, arXiv:2410.13753. [Google Scholar]
Dasari, S.; Kaluri, R. 2p3fl: A novel approach for privacy preserving in financial sectors using flower federated learning. Comput. Model. Eng. Sci. 2024, 140, 2035–2051. [Google Scholar] [CrossRef]
Yin, L.; Feng, J.; Xun, H.; Sun, Z.; Cheng, X. A privacy-preserving federated learning for multiparty data sharing in social IoTs. IEEE Trans. Netw. Sci. Eng. 2021, 8, 2706–2718. [Google Scholar] [CrossRef]
Singh, S.; Rathore, S.; Alfarraj, O.; Tolba, A.; Yoon, B. A framework for privacy-preservation of IoT healthcare data using Federated Learning and blockchain technology. Future Gener. Comput. Syst. 2022, 129, 380–388. [Google Scholar] [CrossRef]
Varma, S.C.G.; Chaudhari, B. Federated Learning in Financial Data Privacy: A Secure AI Framework for Banking Applications. Int. J. Emerg. Trends Comput. Sci. Inf. Technol. 2025, 9, 101–110. [Google Scholar]
Manogna, R.; Anand, A. A bibliometric analysis on the application of deep learning in finance: Status, development and future directions. Kybernetes 2024, 53, 5951–5971. [Google Scholar] [CrossRef]
Han, H.; Liu, Z.; Barrios Barrios, M.; Li, J.; Zeng, Z.; Sarhan, N.; Awwad, E.M. Time series forecasting model for non-stationary series pattern extraction using deep learning and GARCH modeling. J. Cloud Comput. 2024, 13, 2. [Google Scholar] [CrossRef]
Dong, Y.; Hao, Y. A stock prediction method based on multidimensional and multilevel feature dynamic fusion. Electronics 2024, 13, 4111. [Google Scholar] [CrossRef]
Shen, J.; Shafiq, M.O. Short-term stock market price trend prediction using a comprehensive deep learning system. J. Big Data 2020, 7, 66. [Google Scholar] [CrossRef]
Chhajer, P.; Shah, M.; Kshirsagar, A. The applications of artificial neural networks, support vector machines, and long–short term memory for stock market prediction. Decis. Anal. J. 2022, 2, 100015. [Google Scholar] [CrossRef]
Xu, C.; Li, J.; Feng, B.; Lu, B. A financial time-series prediction model based on multiplex attention and linear transformer structure. Appl. Sci. 2023, 13, 5175. [Google Scholar] [CrossRef]
Lezmi, E.; Xu, J. Time series forecasting with transformer models and application to asset management. J. Financ. Data Sci. 2023, 7, 55–72. [Google Scholar] [CrossRef]
Zeng, Z.; Kaur, R.; Siddagangappa, S.; Rahimi, S.; Balch, T.; Veloso, M. Financial time series forecasting using cnn and transformer. arXiv 2023, arXiv:2304.04912. [Google Scholar] [CrossRef]
Kabir, M.R.; Bhadra, D.; Ridoy, M.; Milanova, M. LSTM–transformer-based robust hybrid deep learning model for financial time series forecasting. Sci 2025, 7, 7. [Google Scholar] [CrossRef]
Chen, A.; Wei, Y.; Le, H.; Zhang, Y. Learning by teaching with ChatGPT: The effect of teachable ChatGPT agent on programming education. Br. J. Educ. Technol. 2024, Online Version of Record. [Google Scholar] [CrossRef]
Dash, B.; Sharma, P.; Ali, A. Federated learning for privacy-preserving: A review of PII data analysis in Fintech. Int. J. Softw. Eng. Appl. (IJSEA) 2022, 13, 4. [Google Scholar] [CrossRef]
Abadi, A.; Doyle, B.; Gini, F.; Guinamard, K.; Murakonda, S.K.; Liddell, J.; Mellor, P.; Murdoch, S.J.; Naseri, M.; Page, H.; et al. Starlit: Privacy-preserving federated learning to enhance financial fraud detection. arXiv 2024, arXiv:2401.10765. [Google Scholar] [CrossRef]
Wilson, J.M. Cross-Border Data Transfers: A Balancing Act through Federal Law. Bus. Entrep. Tax Law Rev. 2022, 6, 150–179. [Google Scholar]
Wang, J.; Guo, S.; Xie, X.; Qi, H. Protect privacy from gradient leakage attack in federated learning. In Proceedings of the IEEE INFOCOM 2022-IEEE Conference on Computer Communications, London, UK, 2–5 May 2022; pp. 580–589. [Google Scholar]
Pingulkar, S.; Pawade, D. Federated Learning Architectures for Credit Risk Assessment: A Comparative Analysis of Vertical, Horizontal, and Transfer Learning Approaches. In Proceedings of the 2024 IEEE International Conference on Blockchain and Distributed Systems Security (ICBDS), Pune, India, 17–19 October 2024; pp. 1–7. [Google Scholar]
Wang, J.; Zhuang, Z.; Feng, L. Intelligent optimization based multi-factor deep learning stock selection model and quantitative trading strategy. Mathematics 2022, 10, 566. [Google Scholar] [CrossRef]
Gupta, S.; Kumar, S.; Chang, K.; Lu, C.; Singh, P.; Kalpathy-Cramer, J. Collaborative privacy-preserving approaches for distributed deep learning using multi-institutional data. RadioGraphics 2023, 43, e220107. [Google Scholar] [CrossRef] [PubMed]
Duan, R.; Ning, Y.; Chen, Y. Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika 2022, 109, 67–83. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Y.; Ma, X. A new strategy for tuning ReLUs: Self-adaptive linear units (SALUs). In Proceedings of the ICMLCA 2021; 2nd International Conference on Machine Learning and Computer Application, Shenyang, China, 17–19 December 2021; pp. 1–8. [Google Scholar]
Wang, Y.; Liang, X. Application of Reinforcement Learning Methods Combining Graph Neural Networks and Self-Attention Mechanisms in Supply Chain Route Optimization. Sensors 2025, 25, 955. [Google Scholar] [CrossRef]
van der Spek, M.; van Rooijen, A.; Bouma, H. Secure sparse gradient aggregation with various computer-vision techniques for cross-border document authentication and other security applications. In Proceedings of the Artificial Intelligence for Security and Defence Applications II, Edinburgh, UK, 16–20 September 2024; Volume 13206, pp. 121–134. [Google Scholar]
Wu, X.; Wu, Y.; Li, X.; Ye, Z.; Gu, X.; Wu, Z.; Yang, Y. Application of adaptive machine learning systems in heterogeneous data environments. Glob. Acad. Front. 2024, 2, 37–50. [Google Scholar]
Sun, T.; Li, D.; Wang, B. Decentralized federated averaging. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4289–4301. [Google Scholar] [CrossRef] [PubMed]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Reddi, S.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečnỳ, J.; Kumar, S.; McMahan, H.B. Adaptive federated optimization. arXiv 2020, arXiv:2003.00295. [Google Scholar]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Yin, H.; Mallya, A.; Vahdat, A.; Alvarez, J.M.; Kautz, J.; Molchanov, P. See through Gradients: Image Batch Recovery via GradInversion. arXiv 2021, arXiv:2104.07586. [Google Scholar] [CrossRef]

Figure 1. Overview of the path attention aggregation module (PAAM).

Figure 2. Illustration of the gradient clipping and compression (GCC) module.

Figure 3. Architecture of the heterogeneity-aware adaptive optimizer (HAAO).

Figure 4. Overall performance comparison between the proposed method and baseline models.

Figure 5. Trade-off curve between communication rounds and financial performance (AR and MDD).

Figure 6. Communication efficiency comparison under different communication budgets.

Table 1. Composition and sample statistics of datasets across markets.

Data Source	Market Region	Sample Size	Data Frequency
Nasdaq-100 Constituents	United States (US)	252,000	Daily
Stoxx 600 Financial and Manufacturing Sectors	Europe (EU)	198,000	Daily
MSCI Asia ETF (HSI, TWII, STI)	Asia-Pacific (APAC)	176,400	Daily
Macroeconomic Indicators (VIX, CPI, Rate, FX)	Global	36,500	Daily/Monthly
IMF and World Bank Cross-country Data	Global Multi-region	12,000	Monthly
Total	–	674,900	–

Table 2. Overall performance comparison (mean ± std) between the proposed method and baseline models. “*” and “**” indicate statistical significance of the performance difference relative to the Proposed method, based on Welch’s t-test (

p < 0.05

and

p < 0.01

, respectively).

Table 2. Overall performance comparison (mean ± std) between the proposed method and baseline models. “*” and “**” indicate statistical significance of the performance difference relative to the Proposed method, based on Welch’s t-test (

p < 0.05

and

p < 0.01

, respectively).

Method	AR (%)	SR	MDD (%)	$σ_{r}$ (%)	CC (MB)
Centralized [56]	$15.26 \pm 0.34$	$1.12 \pm 0.05$	$17.58 \pm 0.42$	$9.73 \pm 0.28$	—
FedAvg [53]	$13.84 \pm 0.41$ **	$0.97 \pm 0.04$ **	$19.45 \pm 0.53$ *	$10.26 \pm 0.31$ **	$145.2 \pm 3.8$ **
FedProx [54]	$14.12 \pm 0.38$ **	$1.03 \pm 0.05$ *	$18.36 \pm 0.49$ *	$9.94 \pm 0.26$ **	$146.8 \pm 4.1$ **
FedOpt [55]	$14.86 \pm 0.36$ *	$1.08 \pm 0.04$ *	$17.84 \pm 0.47$ *	$9.68 \pm 0.24$ *	$149.3 \pm 4.4$ **
Proposed (PAAM + GCC + HAAO)	$16.57 \pm 0.32$	$1.25 \pm 0.05$	$15.92 \pm 0.39$	$8.83 \pm 0.22$	$121.7 \pm 3.1$

Table 3. Ablation study (mean ± std). “*” and “**” indicate statistically significant differences relative to the Full Model, determined using Welch’s t-test (

p < 0.05

and

p < 0.01

, respectively).

Table 3. Ablation study (mean ± std). “*” and “**” indicate statistically significant differences relative to the Full Model, determined using Welch’s t-test (

p < 0.05

and

p < 0.01

, respectively).

Configuration	AR (%)	SR	MDD (%)	$σ_{r}$ (%)	CC (MB)
Baseline (FedAvg)	$13.84 \pm 0.41$ **	$0.97 \pm 0.04$ **	$19.45 \pm 0.53$ **	$10.26 \pm 0.31$ **	$145.2 \pm 3.8$ **
PAAM only	$15.22 \pm 0.37$ **	$1.12 \pm 0.05$ **	$17.66 \pm 0.48$ **	$9.84 \pm 0.27$ **	$141.7 \pm 3.6$ **
GCC only	$14.96 \pm 0.39$ **	$1.05 \pm 0.04$ **	$18.21 \pm 0.51$ **	$9.92 \pm 0.28$ **	$127.3 \pm 3.3$ **
HAAO only	$15.08 \pm 0.36$ **	$1.09 \pm 0.05$ **	$17.88 \pm 0.47$ **	$9.68 \pm 0.26$ **	$145.6 \pm 3.7$ **
HAAO (without FFT)	$14.92 \pm 0.38$ **	$1.04 \pm 0.04$ **	$18.03 \pm 0.49$ **	$9.81 \pm 0.27$ **	$145.5 \pm 3.7$ **
PAAM + GCC	$15.93 \pm 0.35$ *	$1.17 \pm 0.05$ *	$16.84 \pm 0.45$ *	$9.22 \pm 0.24$ *	$125.5 \pm 3.2$ *
PAAM + HAAO	$16.01 \pm 0.34$ *	$1.20 \pm 0.05$ *	$16.41 \pm 0.43$ *	$9.10 \pm 0.23$ *	$134.2 \pm 3.4$ *
GCC + HAAO	$15.87 \pm 0.33$ *	$1.18 \pm 0.05$ *	$16.52 \pm 0.44$ *	$9.27 \pm 0.24$ *	$126.0 \pm 3.3$ *
Full Model (PAAM + GCC + HAAO)	$16.57 \pm 0.32$	$1.25 \pm 0.05$	$15.92 \pm 0.39$	$8.83 \pm 0.22$	$121.7 \pm 3.1$

Table 4. Reconstruction fidelity under gradient inversion attacks (lower is better). “*” and “**” represent statistical significance relative to Raw Gradients (Welch’s t-test;

p < 0.05

,

p < 0.01

).

Table 4. Reconstruction fidelity under gradient inversion attacks (lower is better). “*” and “**” represent statistical significance relative to Raw Gradients (Welch’s t-test;

p < 0.05

,

p < 0.01

).

Configuration	MSE ↓	SSIM ↓	Corr. ↓
Raw Gradients (Unprotected)	0.124	0.71	0.68
Clipping only	$0.202$ *	$0.53$ *	$0.41$ *
Top-K	$0.317$ **	$0.38$ **	$0.29$ **
Quantization	$0.288$ **	$0.42$ **	$0.32$ **
GCC (Full)	$0.426$ **	$0.27$ **	$0.18$ **
PAAM Attention	$0.367$ **	$0.33$ **	$0.22$ **

Table 5. Communication efficiency comparison (mean ± std) under different communication budgets. “*” and “**” represent statistical significance relative to the Full Model (100 rounds) (Welch’s t-test;

p < 0.05

and

p < 0.01

).

Table 5. Communication efficiency comparison (mean ± std) under different communication budgets. “*” and “**” represent statistical significance relative to the Full Model (100 rounds) (Welch’s t-test;

p < 0.05

and

p < 0.01

).

Method	Rounds (R)	Avg Grad. Size (MB)	Total CC (MB)	AR (%)	SR	MDD (%)
FedAvg [53]	100	$1.45 \pm 0.02$	$145.2 \pm 3.8$	$13.84 \pm 0.41$ **	$0.97 \pm 0.04$ **	$19.45 \pm 0.53$ **
FedProx [54]	100	$1.47 \pm 0.02$	$146.8 \pm 4.1$	$14.12 \pm 0.38$ **	$1.03 \pm 0.05$ **	$18.36 \pm 0.49$ **
FedOpt [55]	100	$1.49 \pm 0.02$	$149.3 \pm 4.4$	$14.86 \pm 0.36$ **	$1.08 \pm 0.04$ **	$17.84 \pm 0.47$ **
Proposed (PAAM + GCC + HAAO)	100	$1.22 \pm 0.02$	$121.7 \pm 3.1$	$16.57 \pm 0.32$	$1.25 \pm 0.05$	$15.92 \pm 0.39$
Proposed (50 Rounds)	50	$1.22 \pm 0.02$	$60.8 \pm 1.7$	$15.88 \pm 0.35$ *	$1.19 \pm 0.05$ *	$16.83 \pm 0.44$ *
Proposed (25 Rounds)	25	$1.22 \pm 0.02$	$30.4 \pm 1.2$	$14.91 \pm 0.37$ **	$1.08 \pm 0.05$ **	$17.95 \pm 0.46$ **

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, S.; Zhang, L.; Xu, S.; Zeng, X.; Hu, P.; Gong, X.; Li, M. Communication-Efficient Federated Optimization with Gradient Clipping and Attention Aggregation for Data Analytics and Prediction. Electronics 2025, 14, 4778. https://doi.org/10.3390/electronics14234778

AMA Style

Tang S, Zhang L, Xu S, Zeng X, Hu P, Gong X, Li M. Communication-Efficient Federated Optimization with Gradient Clipping and Attention Aggregation for Data Analytics and Prediction. Electronics. 2025; 14(23):4778. https://doi.org/10.3390/electronics14234778

Chicago/Turabian Style

Tang, Shengyuan, Linwan Zhang, Shengzhe Xu, Xinyue Zeng, Peng Hu, Xinyi Gong, and Manzhou Li. 2025. "Communication-Efficient Federated Optimization with Gradient Clipping and Attention Aggregation for Data Analytics and Prediction" Electronics 14, no. 23: 4778. https://doi.org/10.3390/electronics14234778

APA Style

Tang, S., Zhang, L., Xu, S., Zeng, X., Hu, P., Gong, X., & Li, M. (2025). Communication-Efficient Federated Optimization with Gradient Clipping and Attention Aggregation for Data Analytics and Prediction. Electronics, 14(23), 4778. https://doi.org/10.3390/electronics14234778

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Communication-Efficient Federated Optimization with Gradient Clipping and Attention Aggregation for Data Analytics and Prediction

Abstract

1. Introduction

2. Related Work

2.1. Applications of Deep Learning in Quantitative Strategy Modeling

2.2. Federated Learning and Financial Data Privacy Protection

2.3. Heterogeneous Distribution Modeling and Communication Optimization Mechanisms

3. Materials and Method

3.1. Data Collection

3.2. Data Augmentation

3.3. Proposed Method

3.3.1. Overall

3.3.2. Path Attention Aggregation Module

3.3.3. Gradient Clipping and Compression

3.3.4. Heterogeneity-Aware Adaptive Optimizer

3.4. Privacy Analysis

4. Results and Discussion

4.1. Experimental Setup

4.1.1. Environment and Baseline Models

4.1.2. Evaluation Metrics

4.2. Overall Comparison with Baselines

4.3. Ablation Study

4.4. Privacy Evaluation

4.5. Communication Efficiency Comparison

4.6. Discussion

4.7. Limitation and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Reproducibility and Implementation Details

Appendix A.1. Data Acquisition and Preprocessing

Appendix A.2. Temporal Splits and Random Seeds

Appendix A.3. Federated Optimization Procedure

Appendix A.4. Trading Execution Rules

Appendix A.5. Metrics and Logging

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI