FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting

Chen, Zizhen; Zhang, Haobo; Wang, Shiwen; Chen, Junming

doi:10.3390/electronics14244902

Open AccessArticle

FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting

¹

School of Business, Suzhou University of Science and Technology, Suzhou 215000, China

²

School of Art, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(24), 4902; https://doi.org/10.3390/electronics14244902

Submission received: 11 November 2025 / Revised: 8 December 2025 / Accepted: 9 December 2025 / Published: 12 December 2025

(This article belongs to the Special Issue Security and Privacy in Distributed Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Financial time series are heterogeneous, nonstationary, and dispersed across institutions that cannot share raw data. While federated learning enables collaborative modeling under privacy constraints, fixed architectures struggle to accommodate cross-market drift and device-resource diversity; conversely, existing neural architecture search techniques presume centralized data and typically ignore communication, latency, and privacy budgets. This paper introduces FedRegNAS, a regime-aware federated NAS framework that jointly optimizes forecasting accuracy, communication cost, and on-device latency under user-level

(ε, δ)

-differential privacy. FedRegNAS trains a shared temporal supernet composed of candidate operators (dilated temporal convolutions, gated recurrent units, and attention blocks) with regime-conditioned gating and lightweight market-aware personalization. Clients perform differentiable architecture updates locally via Gumbel-Softmax and mirror descent; the server aggregates architecture distributions through Dirichlet barycenters with participation-weighted trust, while model weights are combined by adaptive, staleness-robust federated averaging. A risk-sensitive objective emphasizes downside errors and integrates transaction-cost-aware profit terms. We further inject calibrated noise into architecture gradients to decouple privacy leakage from weight updates and schedule search-to-train phases to reduce communication. Across three real-world equity datasets, FedRegNAS improves directional accuracy by 3–7 percentage points and Sharpe ratio by 18–32%. Ablations highlight the importance of regime gating and barycentric aggregation, and analyses outline convergence of the architecture mirror-descent under standard smoothness assumptions. FedRegNAS yields adaptive, privacy-aware architectures that translate into materially better trading-relevant forecasts without centralizing data.

Keywords:

federated learning; privacy protection; deep learning; neural architecture search

1. Introduction

Financial time series are noisy, heavy-tailed, and regime-dependent. The most informative records are siloed across institutions that cannot share raw data because of regulation and competitive sensitivity. Federated learning (FL) offers a way to collaborate without centralizing data [1,2], yet fixed neural architectures often underperform under cross-market heterogeneity, device-resource diversity, and asynchronous participation. Neural architecture search (NAS) can adapt models through differentiable relaxations [3] and parameter-sharing or stochastic variants [4,5,6], but most NAS assumes centralized data and ignores communication, latency, and user-level privacy. Differential privacy (DP) provides principled protection [7,8], although naively adding noise to all channels can severely hurt accuracy, especially with non-IID clients and partial participation. In financial prediction, modern sequence models—temporal convolutions [9], attention-based forecasters [10], probabilistic RNNs [11], and microstructure-aware CNNs [12]—must also handle latent regime shifts [13] and data-dependent gating effects [14]. This work targets the intersection of these challenges for stock-return forecasting, aligning with the special-issue theme of privacy-preserving, resource-aware learning in finance.

In real FL deployments, institutions differ in universe coverage, liquidity regimes, feature engineering, and hardware capabilities. This heterogeneity leads to several issues: (i) architecture mismatch, where a single backbone cannot serve all clients well; (ii) staleness and partial participation, which bias aggregated updates and increase variance; (iii) tight resource envelopes, where communication budgets and on-device latency caps must be respected; and (iv) privacy constraints, where user-level DP is required not only on weights but also on architecture signals that can leak sensitive patterns. Financial objectives also extend beyond mean squared error to directional accuracy and trading-aware risk metrics under frictions. These factors call for a federated method that can search over architectures, adapt to regimes, control resources, and provide calibrated privacy guarantees [15,16,17].

We introduce FedRegNAS, a regime-aware federated NAS framework that learns both network weights

θ

and architecture logits

α

from decentralized data while respecting communication and latency proxies and enforcing user-level DP. The core model is a temporal supernet composed of dilated temporal convolutions, gated recurrent units, and causal attention blocks. A lightweight gate

g_{ψ}

produces a regime simplex

z_{t}

that modulates per-operator adapters, so the supernet can react to market conditions without duplicating the entire network. Clients update

θ

with stochastic gradients and optimize

α

via mirror descent on the logit parameterization using a Gumbel–Softmax relaxation. Their updates are clipped and noised with decoupled DP budgets for weights and architectures (ADDP, Adaptive Decoupled Differential Privacy). On the server, we aggregate weights using staleness- and trust-aware coefficients and fuse client architecture beliefs through a KL barycenter

B

, a logarithmic opinion–style rule robust to non-IID participation [18]. A search-to-train curriculum anneals the sampling temperature, discretizes

α

to a sparse subgraph, and fine-tunes

θ

before deploying compact market-aware adapters. Resource use is managed via differentiable proxies for communication and on-device latency, which are integrated into the training objective.

FedRegNAS is built around four principles: (P1) adaptivity to regimes through

z_{t}

-conditioned operators that preserve parameter efficiency; (P2) stability under heterogeneity via KL-barycentric architecture fusion and similarity-based trust weighting of weight updates; (P3) privacy without over-noising by decoupling DP budgets for

θ

and

α

using Rényi composition; and (P4) resource-awareness that co-optimizes accuracy with communication and latency using lazy architecture synchronization, quantization, and sparsification compatible with clipping-first DP.

On federated daily and intraday equity benchmarks, FedRegNAS consistently improves over strong FL baselines. Relative to a Transformer trained with FedProx, we observe 3–5% lower RMSE, gains of 1.6–2.3 percentage points in directional accuracy, and roughly 10–12% higher Sharpe ratio across datasets. At similar accuracy, mean client upload is reduced by about 43% versus a federated LSTM and by about 70% versus the Transformer baseline. The discovered architectures also cut median on-device inference latency by roughly 47% (for example, 3.6 ms vs. 6.8 ms on a mobile-class CPU). With user-level DP around

(ε_{w}, ε_{α}) \approx (2.0, 1.0)

at

δ = 10^{- 5}

, decoupled budgets preserve accuracy relative to a single coupled budget, supporting the view that architecture-channel noise is the dominant privacy lever.

Contributions. (1) A regime-gated federated supernet for financial forecasting that conditions operator adapters on latent regimes and searches over temporal primitives suited to markets [9,10,11]; (2) a barycentric architecture aggregation rule that combines client architecture distributions via a KL barycenter, improving stability and sparsity preservation under non-IID data [18]; (3) ADDP, a decoupled user-level DP mechanismthat separately allocates noise to weights and architectures, analyzed with RDP accounting [7,8]; (4) an integrated risk- and resource-aware objective that trades off forecasting error, trading utility, and communication/latency; (5) a practical search-to-train curriculum with personalization that yields compact, low-latency deployments and strong out-of-sample performance. Together, these elements enable adaptive, privacy-aware forecasting under realistic federated constraints.

Organization. Section 2 reviews federated optimization, privacy, NAS, and finance-specific forecasting. Section 3 introduces notation, objectives, privacy accounting, and resource models. Section 4 details the regime-gated supernet, mirror-descent search, KL-barycentric aggregation, ADDP, and the communication stack. Section 6 reports accuracy, trading, efficiency, ablations, and privacy–performance trade-offs. Section 7 concludes.

2. Related Work

Our work lies at the intersection of federated optimization, differential privacy, neural architecture search (NAS), and financial time-series forecasting. We highlight the most relevant lines and position FedRegNAS among them.

2.1. Federated Learning Under Heterogeneity, Staleness, and Robustness

Federated averaging [1,19] established on-device collaborative training with periodic model aggregation. Later work improved stability under client drift by adding proximal regularization and control variates, as in FedProx [20], SCAFFOLD [21], and adaptive server optimizers [22]. Surveys summarize open problems around non-IID data, partial participation, and systems constraints [2]. Asynchronous updates and staleness are addressed by staleness-aware weighting [23], while robustness to adversarial or noisy clients motivates aggregators like Krum [24], Byzantine-robust estimation [25], and trust bootstrapping [26]. Communication efficiency is pursued via quantization and sparsification, e.g., QSGD [27], deep gradient compression [28], and memory-based top-k sparsified SGD [29]. Our server-side weighting combines staleness penalties with similarity-based trust, and our communication stack composes clipping, quantization, and sparsification in a way that remains compatible with user-level privacy guarantees.

2.2. Differential Privacy in Deep and Federated Learning

Differentially private SGD (DP-SGD) introduced per-iteration clipping, Gaussian noise, and advanced accounting for deep learning [7]. Rényi differential privacy (RDP) [8] enables tighter composition bounds and convenient subsampling analysis, and is now standard for user-level guarantees in FL. We follow this line but decouple privacy budgets across weight and architecture channels. More noise is assigned to the more sensitive architecture gradients, while weight updates use moderated noise to preserve accuracy. This complements existing DP-FL mechanisms that rely on clipping-and-noise and RDP-style accountants [7,8].

2.3. Neural Architecture Search and Differentiable Relaxations

NAS has evolved from reinforcement learning and evolutionary strategies to more efficient one-shot relaxations and gradient-based search [30]. ENAS shares parameters across architectures to reduce search cost [4]. DARTS relaxes discrete choices to a continuous simplex and uses gradient descent [3]. Stochastic NAS variants rely on the Gumbel–Softmax/Concrete distribution for differentiable sampling [5,31,32]. Hardware- and resource-aware methods bias search toward efficient operators or prune expensive ones [6]. Most of this literature assumes centralized data and ignores privacy, communication, and device heterogeneity. In contrast, our federated, regime-aware supernet, KL-style barycentric aggregation (akin to logarithmic opinion pools [18]), and decoupled DP budgets explicitly link architecture decisions to client regimes, resource constraints, and privacy.

2.4. Time-Series Forecasting and Financial Prediction

Deep sequence models such as temporal convolutional networks [9], temporal fusion transformers [10], and probabilistic forecasters like DeepAR [11] have improved generic time-series prediction. In high-frequency finance, specialized architectures exploit limit-order-book structure [12]. Market nonstationarity is often modeled via regime switching [13] or mixtures of experts with data-dependent gating [14]. Our regime-gated, federated NAS framework unifies these ideas by searching for operator choices and adapters conditioned on latent regimes, while optimizing a risk-sensitive objective that is tied to trading performance.

Prior work provides essential building blocks: federated optimization [1,22], privacy accounting [7,8], differentiable NAS [3,6,31], and regime-aware sequence modeling [13,14]. FedRegNAS integrates these strands and adds three main novelties: (i) a regime-gated federated supernet tailored to financial series; (ii) a KL-inspired barycentric aggregation of client architecture distributions, related to log-opinion pooling [18]; and (iii) architecture/weight DP decoupling with resource-aware curricula. Together, these components enable adaptive, privacy-preserving forecasting under realistic communication budgets.

2.5. Federated NAS and DP-Aware NAS

Neural architecture search has also been integrated with federated learning. Survey work on federated NAS [33] and specific algorithms such as FedNAS [34], DP-FNAS [35], and personalized federated NAS like PerFedRLNAS [36] combine architecture search with client-side training. These frameworks mostly target vision benchmarks, operate under generic non-IID partitions, and search over convolutional backbones or layer configurations. They typically optimize accuracy and, in some cases, resource usage. In contrast, FedRegNAS is designed for stock-return forecasting. It operates on a regime-gated temporal supernet, searches over time-series operators and horizon-specific modules, and optimizes an objective that jointly reflects forecasting performance, communication and latency proxies, and decoupled user-level DP budgets [37]. This design allows us to study the interaction between NAS and FL in a financial setting, rather than on image benchmarks, and to analyze how architecture search behaves under decoupled DP.

Our framework is also related to architecture-level personalization methods that adapt model structures to heterogeneous clients. Hypernetwork-based FL approaches, such as the filter-aware personalization framework of Yang et al. [38], generate client-specific filters or blocks from a metanetwork on top of a shared backbone. Both this line of work and FedRegNAS pursue personalized architectures, but they differ in mechanism. Hypernetwork methods keep a fixed global architecture and personalize through generated weights. FedRegNAS instead performs federated NAS over a regime-aware temporal supernet, aggregates probabilistic architecture distributions across clients, and then applies lightweight adapters for personalization. Beyond NAS, personalized FL frameworks in medical imaging, such as SCAN-PhysFed for low-dose CT denoising with large language models [39], show the benefit of incorporating domain structure and auxiliary signals into FL. Our method follows this principle in the financial domain by using latent regimes, financial objectives, and DP constraints to guide architecture search and aggregation.

3. Preliminaries

We formalize the forecasting task, modeling assumptions, federated optimization protocol, and the privacy/resource/evaluation models used throughout. The notation introduced here is reused consistently in subsequent sections.

3.1. Data, Predictive Task, and Regime Representation

We consider N institutions (clients) indexed by

i \in {1, \dots, N}

. Client i holds a private dataset.

D_{i} = {\{(x_{t - L + 1 : t}^{(i)}, y_{t}^{(i)})\}}_{t \in T_{i}},

where

x_{t - L + 1 : t}^{(i)} \in R^{L \times d}

denotes a lagged feature window and

y_{t}^{(i)} \in R

is the next-period log-return. With price

p_{t}^{(i)}

, define

r_{t}^{(i)} = log (p_{t}^{(i)} / p_{t - 1}^{(i)})

and the prediction target

y_{t}^{(i)} = r_{t + 1}^{(i)}

. Features may include technical indicators, realized volatility, order-book statistics, and cross-asset signals; feature construction is client-specific and remains on-device.

In addition, we summarize the latent market state by a regime vector

z_{t} \in Δ^{R}

over R regimes (e.g., low-volatility, high-volatility, event-driven). A lightweight gate

g_{ψ} : R^{L \times d} \to Δ^{R}

estimates

z_{t} = g_{ψ} (x_{t - L + 1 : t})

. Unless otherwise specified, regime-related parameters (e.g., per-regime adapters) are included in the global weight vector

θ

.

3.2. Supernet Parameterization and Architecture Relaxation

The global predictor is a parametric function

f_{θ, α}

that maps

x_{t - L + 1 : t}^{(i)}

to

{\hat{y}}_{t}^{(i)}

, where

θ

are network weights and

α

are architecture logits. We adopt a one-shot supernet with M decision points; at decision m, the candidate operator set is

O_{m} = {O_{m, j}}_{j = 1}^{J_{m}}

(e.g., dilated temporal convolutions, GRU cells, causal self-attention, identity). For each module m, we maintain logits

α_{m} \in R^{J_{m}}

and define the corresponding operator probabilities via a softmax

p_{m} = softmax (α_{m}) \in Δ^{J_{m}},

so that

p_{m, j}

is the probability weight assigned to operator

O_{m, j}

. Stacking yields

α = (α_{1}, \dots, α_{M})

for logits and

p = (p_{1}, \dots, p_{M})

for probabilities.

During search, mixed operations use a Gumbel–Softmax relaxation over the logits:

{MixOp}_{m} (h) = \sum_{j = 1}^{J_{m}} z_{m, j} O_{m, j} (h), z_{m} = softmax (\frac{α_{m} + g_{m}}{τ}),

(1)

where

g_{m, j} \overset{iid}{\sim} Gumbel (0, 1)

and

τ > 0

is a temperature. Post-search, the learned probabilities

p_{m}

are discretized (e.g., top-1 or top-s per decision) and the resulting subnetwork is fine-tuned.

3.3. Federated Optimization and Architecture Fusion

Training proceeds over rounds

k = 0, 1, \dots, K - 1

. At round k, the server samples

S_{k} \subseteq {1, \dots, N}

and broadcasts

(θ_{k}, α_{k})

. Each selected client performs E local steps on minibatches from

D_{i}

, producing update deltas

(Δ θ_{i}, Δ α_{i})

and an effective weight

ν_{i}

(e.g., proportional to local sample count). Let

τ_{i}^{k} \geq 0

denote the staleness (round lag) of client i and

ϕ (\cdot)

a nonincreasing penalty (e.g.,

ϕ (τ) = 1 / (1 + τ)

). The server aggregates as

\begin{matrix} θ_{k + 1} = θ_{k} + \sum_{i \in S_{k}} ω_{i}^{k} Δ θ_{i}, ω_{i}^{k} = \frac{ν_{i} ϕ (τ_{i}^{k})}{\sum_{j \in S_{k}} ν_{j} ϕ (τ_{j}^{k})}, \end{matrix}

(2)

\begin{matrix} α_{k + 1} = B ({α_{i}}_{i \in S_{k}}, {w_{i}^{k}}), w_{i}^{k} \propto ν_{i} ϕ (τ_{i}^{k}), \end{matrix}

(3)

where (2) is a staleness-robust variant of federated averaging and (3) aggregates architecture distributions via a barycentric operator. Instead of directly optimizing architecture probabilities on the simplex, each client optimizes unconstrained logits and obtains probabilities via a softmax. For module m, let

p_{m} = softmax (α_{m}) \in Δ^{J_{m}}

denote the probability vector over its

J_{m}

operators, where

α_{m} \in R^{J_{m}}

are the corresponding logits. Mirror descent on the simplex with KL geometry is conveniently implemented by taking a gradient step in the logit space and then re-normalizing with a softmax. Concretely, at local step t on client i, we update

α_{m, j}^{t + 1} = α_{m, j}^{t} - η_{α} \hat{\nabla_{α_{m, j}} L_{i}}, p_{m}^{t + 1} = softmax (α_{m}^{t + 1}),

(4)

where

\hat{\nabla L_{i}}

is a minibatch gradient estimate and

η_{α} > 0

is the architecture stepsize. This update is equivalent to mirror descent with KL divergence on the probability simplex (as in exponentiated-gradient methods), but is implemented in practice by a simple gradient step on the logits followed by a softmax, which keeps

p_{m}

on the simplex without manual renormalization. Here,

α_{m}

denotes the global architecture logits maintained on the server, while

α_{m}^{(k)}

denotes the corresponding client-specific logits updated locally on client k. We use

p_{m} = softmax (α_{m})

for the associated probabilities, and

θ

/

θ^{(k)}

for shared/local model weights, so that local and global variables are clearly distinguished in all subsequent equations). Given client distributions

{α_{i}}

and weights

{w_{i}}

with

\sum_{i} w_{i} = 1

, we use the KL barycenter (geometric mean on the simplex):

B {({α_{i}}, {w_{i}})}_{m, j} = \frac{\prod_{i} α_{m, j}^{(i)} w_{i}}{\sum_{ℓ = 1}^{J_{m}} \prod_{i} α_{m, ℓ}^{(i)} w_{i}},

(5)

the solution of

arg {min}_{α_{m} \in Δ^{J_{m}}} \sum_{i} w_{i} D_{KL} (α_{m} ∥ α_{m}^{(i)})

, which preserves emerging sparsity and is robust to non-IID participation.

3.4. Privacy, Resource Proxies, and Evaluation Criteria

We enforce user-level

(ε, δ)

-DP with Gaussian mechanisms on weights and architectures, allowing decoupled budgets

(ε_{w}, δ_{w})

and

(ε_{α}, δ_{α})

. Per-client sanitization clips updates and adds noise:

\begin{matrix} {\tilde{Δ θ}}_{i} = clip (Δ θ_{i}, C_{w}), {\tilde{Δ α}}_{i} = clip (Δ α_{i}, C_{α}), \end{matrix}

(6)

\begin{matrix} {\hat{Δ θ}}_{i} = {\tilde{Δ θ}}_{i} + N (0, σ_{w}^{2} C_{w}^{2} I), {\hat{Δ α}}_{i} = {\tilde{Δ α}}_{i} + N (0, σ_{α}^{2} C_{α}^{2} I), \end{matrix}

(7)

with noise multipliers

σ_{w}, σ_{α} > 0

. Composition across rounds follows a moments or Rényi DP accountant; the server aggregates only sanitized updates via (2) and (5).

Let

bytes (θ, α)

denote the serialized client upload and

lat (α; HW)

the on-device latency (predicted by additive operator latencies) for hardware profile

HW

. A differentiable proxy controls resources:

R (α) = λ_{comm} \cdot E [bytes (θ, α)] + λ_{lat} \cdot E_{HW} [lat (α; HW)],

(8)

with nonnegative weights

λ_{comm}, λ_{lat}

. For prediction

{\hat{y}}_{t} = f_{θ, α} (x_{t - L + 1 : t}; z_{t})

and true

y_{t}

, define the asymmetric squared loss

ℓ_{asym} (y_{t}, {\hat{y}}_{t}) = \{\begin{matrix} λ_{-} {(y_{t} - {\hat{y}}_{t})}^{2}, & y_{t} - {\hat{y}}_{t} < 0, \\ λ_{+} {(y_{t} - {\hat{y}}_{t})}^{2}, & y_{t} - {\hat{y}}_{t} \geq 0, \end{matrix} λ_{-} > λ_{+} > 0 .

(9)

A threshold policy

π_{ϑ}

generates

u_{t} \in {- 1, 0, 1}

; the per-period strategy return with transaction cost

γ \geq 0

is

{\tilde{r}}_{t} = u_{t} y_{t} - γ | u_{t} - u_{t - 1} | .

(10)

Client i minimizes

L_{i} (θ, α) = E [ℓ_{asym} (y_{t}, {\hat{y}}_{t})] - λ_{pnl} \cdot E [{\tilde{r}}_{t}] + λ_{cvar} \cdot {CVaR}_{β} (- {\tilde{r}}_{t}) + λ_{res} \cdot R (α),

(11)

with

λ_{pnl}, λ_{cvar}, λ_{res} \geq 0

and tail level

β \in (0, 1)

.

4. Methodology

We present FedRegNAS, a regime-aware, privacy-preserving federated neural architecture search method for stock-return forecasting. The method couples a regime-gated temporal supernet with mirror-descent architecture updates, trust- and staleness-aware aggregation, and a search-to-train curriculum that respects device and communication budgets. We retain the notation introduced in Section 3 and reorganize the methodology into four coherent subsections.

In general, FedRegNAS combines three key components: (i) a regime-gated temporal supernet, (ii) KL-barycentric aggregation of client-wise architecture distributions, and (iii) adaptive decoupled differential privacy (ADDP) for weights and architectures. Regime gating reuses standard sequence operators (e.g., recurrent and attention blocks) but organizes them via a learned regime encoder so that different latent market regimes select different subpaths; this differs from prior federated NAS works that typically search over global vision backbones without regime structure. KL-barycentric aggregation adapts ideas from distributional averaging to architecture logits, replacing the simple parameter averaging used in conventional FL and federated NAS baselines, and is specifically designed to stabilize architecture search under non-IID clients. Finally, ADDP reinterprets user-level DP accounting in a decoupled manner, assigning separate noise scales and budgets to weight updates and architecture gradients, in contrast to earlier DP-aware NAS methods that typically use a single privacy mechanism for all parameters. The structure of the paper in FedRegNAS is shown in Figure 1.

4.1. Problem Formulation and Bilevel NAS Reduction

Let

L_{i} (θ, α)

be the client loss in (11). The global objective aggregates client risks with sampling weights

π_{i} > 0

,

\sum_{i} π_{i} = 1

:

min_{θ, α \in Δ} F (θ, α) : = \sum_{i = 1}^{N} π_{i} E_{(x, y) \sim D_{i}} [L_{i} (θ, α)],

(12)

optionally augmented by a resource regularizer (e.g., FLOPs/latency penalty) with coefficient

λ_{res}

as in the smoothed objective

F_{λ}

used in the analysis.

We adopt the standard one-shot NAS relaxation: during search, the network is overparameterized and

(θ, α)

are optimized jointly; after selection,

α

is discretized to a sparse architecture and

θ

is fine-tuned. Concretely, the architecture parameters

α = {α_{m}}_{m}

define simplex weights over

J_{m}

candidate operators at decision m. Post-search, we discretize by top-1 (or top-s) selection using empirical usage frequencies of the soft selectors (cf. Section 4.4), and then continue federated training with

α

frozen.

Because architecture quality depends on weights trained on heterogeneous clients (and vice versa), we treat

α

as a slow variable and

θ

as a fast variable. Clients update

θ

every round via stochastic gradients of

L_{i}

, while

α

is updated less frequently by mirror descent on the product of simplices (cf. (4)). This reduction stabilizes search under partial participation and reduces communication by transmitting architecture updates only intermittently (every

H_{α}

rounds).

4.2. Regime-Gated Temporal Supernet

To embed nonstationarity, we modulate operators by a regime simplex

z_{t} = g_{ψ} (x_{t - L + 1 : t}) \in Δ^{R}

, where

g_{ψ}

is a locally computed encoder over the most recent feature window. The encoder’s output weights regime-specific adapters while keeping most parameters shared. We instantiate the regime encoder

g_{ψ}

as a lightweight GRU-based temporal encoder followed by a linear projection and softmax, mapping past returns and features

x_{1 : t}

to a point

z_{t}

on a simplex over R latent regimes. Unless otherwise stated,

g_{ψ}

is shared across markets and trained end-to-end jointly with the forecasting model and architecture parameters, without using any explicit volatility or volume labels. The regime cardinality R is treated as a model-selection hyperparameter and chosen on a held-out validation set.

On this basis, for a hidden activation h at decision m, we define

\begin{matrix} RG - {MixOp}_{m} (h; x_{t - L + 1 : t}) & = \sum_{j = 1}^{J_{m}} z_{m, j} (\sum_{r = 1}^{R} z_{t} [r] A_{m, j, r} (O_{m, j} (h))), \\ z_{m} & = softmax (\frac{log α_{m} + g_{m}}{τ}) \end{matrix}

(13)

where

O_{m, j}

is the j-th candidate operator, and

A_{m, j, r} (u) = γ_{m, j, r} ⊙ u + β_{m, j, r}

is a lightweight, per-regime affine adapter contributing to

θ

. The temperature

τ

is annealed (high → low) during search to move from exploration to exploitation.

RG-MixOps enable conditional computation over latent regimes by combining two sets of weights: the time-varying regime vector

z_{t} \in Δ^{R - 1}

produced by the regime encoder

g_{ψ} (x_{1 : t})

, and the module-wise operator probabilities

z_{m} \in Δ^{J_{m} - 1}

obtained from the Gumbel–Softmax over logits

α_{m}

. For a given module m and hidden state h at time t, we first apply each primitive operator

O_{m, j} (h)

, transform it through regime-specific adapters

A_{m, j, r} (\cdot)

, and weight these outputs by

z_{t} [r]

; the resulting regime mixture is then combined across operators using

z_{m, j}

. Thus RG-MixOp outputs a doubly weighted sum over regimes r and operators j, with

z_{t} [r]

gating

A_{m, j, r}

inside each operator and

z_{m, j}

mixing the

J_{m}

operators. Since

z_{t}

is computed locally on each client and never transmitted, regime states remain private by construction.

4.3. Client-Side Optimization with Decoupled Privacy

On round k, client

i \in S_{k}

receives

(θ_{k}, α_{k}, τ_{k})

and performs E local steps on minibatches

B \subset D_{i}

:

θ \leftarrow θ - η_{w} \nabla_{θ} L_{i} (θ, α; B), α_{m, j} \leftarrow \frac{α_{m, j} exp (- η_{α} \hat{\nabla_{α_{m, j}} L_{i}})}{\sum_{ℓ = 1}^{J_{m}} α_{m, ℓ} exp (- η_{α} \hat{\nabla_{α_{m, ℓ}} L_{i}})} (mirror descent; (4)) .

After local optimization, the client forms deltas

Δ θ_{i} = θ - θ_{k}

and

Δ α_{i} = α - α_{k}

. We apply separate clipping and Gaussian mechanisms to weights and architecture channels:

{\hat{Δ θ}}_{i} = clip (Δ θ_{i}, C_{w}) + N (0, σ_{w}^{2} C_{w}^{2} I), {\hat{Δ α}}_{i} = clip (Δ α_{i}, C_{α}) + N (0, σ_{α}^{2} C_{α}^{2} I) .

This ADDP decoupling allows

(ε_{w}, δ_{w})

and

(ε_{α}, δ_{α})

to be tuned independently via RDP/moments accounting. In practice we allocate more privacy (larger

σ_{α}

and/or fewer uploads) to the more sensitive architecture channel.

Weights

{\hat{Δ θ}}_{i}

are uploaded every round, and architectures

{\hat{Δ α}}_{i}

only when

k mod H_{α} = 0

. This asymmetric schedule matches the bilevel reduction, reduces bandwidth, and limits cumulative

ε_{α}

. Low-rank sketches can be applied locally to

Δ θ_{i}

before sanitization to reduce payload without changing the privacy analysis. Clipping introduces bounded bias that is traded against variance from Gaussian noise; thresholds

C_{w}

and

C_{α}

are tuned to keep clipping rates modest. To mitigate variance amplification at low

τ

, we use gradient normalization within RG-MixOps and damp

η_{α}

relative to

η_{w}

. Typical robust ranges are observed empirically:

C_{w} \in [0.5, 3]

,

C_{α} \in [0.1, 1]

,

σ_{w} \in [0.5, 2]

,

σ_{α} \in [1, 4]

,

H_{α} \in {2, 4, 8}

.

4.4. Server Aggregation and Search-to-Train Curriculum

4.4.1. Trust/Staleness Weighting and Weight Aggregation

Upon receiving sanitized updates, the server computes similarity-aware weights

ω_{i}^{k} \propto ν_{i} ϕ (τ_{i}^{k}) ρ_{i}^{k}, ρ_{i}^{k} = \frac{exp (λ_{trust} \cdot cos ({\hat{Δ θ}}_{i}, \bar{Δ θ}))}{\sum_{j \in S_{k}} exp (λ_{trust} \cdot cos ({\hat{Δ θ}}_{j}, \bar{Δ θ}))}, \bar{Δ θ} = \sum_{j \in S_{k}} ν_{j} {\hat{Δ θ}}_{j},

(14)

where cosine similarity is computed on low-rank sketches to reduce bandwidth and noise variance. The global weights update is

θ_{k + 1} = θ_{k} + \sum_{i \in S_{k}} ω_{i}^{k} {\hat{Δ θ}}_{i} .

Architecture parameters are aggregated with the KL barycenter (cf. (5)), which yields the normalized geometric mean per decision m:

α_{k + 1, m} [j] = \frac{\prod_{i \in S_{k}} α_{i, m} {[j]}^{ω_{i}^{k}}}{\sum_{ℓ = 1}^{J_{m}} \prod_{i \in S_{k}} α_{i, m} {[ℓ]}^{ω_{i}^{k}}}, j = 1, \dots, J_{m},

followed by smoothing

{\tilde{α}}_{k + 1} = (1 - λ_{bar}) α_{k} + λ_{bar} α_{k + 1}, λ_{bar} \in (0, 1] .

If no new

α

is received (rounds with

k mod H_{α} \neq 0

), the server keeps

α_{k + 1} = α_{k}

and continues temperature annealing independently.

4.4.2. Three-Phase Search-to-Train Curriculum

We schedule the following:

Phase I (warm search) with Gumbel–Softmax temperature annealing

τ_{k + 1} = max (τ_{min}, η_{τ} τ_{k}), η_{τ} < 1,

weight uploads every round, and architecture uploads every

H_{α}

rounds;

Phase II (selection and sparse retraining), where we discretize

α

by top-1 or top-s per decision using frequency statistics of

z_{m}

and then freeze

α

while continuing federated optimization of

θ

(often with larger E and smaller

η_{w}

);

Phase III (market-aware personalization), where each client deploys private lightweight parameters

η^{(i)}

(e.g., per-regime adapters

A

or a small head), forming

θ^{(i)} = θ \oplus η^{(i)}

and optimizing

min_{η^{(i)}} E_{(x, y) \sim D_{i}} [L_{i} (θ \oplus η^{(i)}, α^{disc})] + μ ∥ η^{(i)} ∥_{2}^{2} + λ_{prox} {∥ θ - θ_{ref} ∥}_{2}^{2},

with

η^{(i)}

remaining local (no uplink). Phase II halts the growth of the architecture-channel privacy ledger; Phase III typically requires no further communication.

5. Theoretical Analysis

This section establishes the theoretical underpinnings of FedRegNAS, providing convergence guarantees under bounded staleness and differential privacy, as well as a rigorous characterization of its user-level privacy accounting. We proceed by first presenting the assumptions that govern the analysis, then deriving the main convergence theorem, followed by detailed discussions on the effects of clipping, trust weighting, and privacy composition.

5.1. Assumptions and Preliminaries

The following assumptions extend those stated in Section 3, formalizing the analytical setting under which FedRegNAS operates.

(A1): Smoothness. For each client i, the local loss function $L_{i} (θ, α)$ is continuously differentiable, and its gradient is L-Lipschitz in both $θ$ and $α$ , i.e.,

$∥ \nabla L_{i} (θ_{1}, α_{1}) - \nabla L_{i} (θ_{2}, α_{2}) ∥_{2} \leq L {∥ (θ_{1} - θ_{2}, α_{1} - α_{2}) ∥}_{2} .$
(A2): Bounded gradient variance. The stochastic gradients are unbiased and have bounded variance:

$E [\hat{\nabla} L_{i} (θ, α; B)] = \nabla L_{i} (θ, α), E [∥ \hat{\nabla} L_{i} - \nabla L_{i} ∥_{2}^{2}] \leq σ^{2} .$
(A3): Bounded delay (staleness). The communication latency $τ_{i}^{k}$ associated with client i at round k is bounded, with expected value $\bar{τ} = E [τ_{i}^{k}]$ and a corresponding attenuation factor $ϕ (τ_{i}^{k}) \in (0, 1]$ .
(A4): Clipping and differential-privacy noise. Each client clips its model and architecture deltas using thresholds $(C_{w}, C_{α})$ and adds independent Gaussian noise with multipliers $(σ_{w}, σ_{α})$ , ensuring bounded sensitivity and user-level privacy.

Under these assumptions, the aggregated optimization objective is the smoothed function

F_{λ} (θ, α) = F (θ, α) + λ_{res} \cdot R (θ, α),

where

R (\cdot)

denotes the resource regularization (e.g., FLOPs or latency penalty), and

λ_{res} \geq 0

is its associated coefficient.

5.2. Main Convergence Result

Let K be the total number of rounds, q the probability of client participation per round, and E the number of local gradient steps per client. Denote by

η_{w}

and

η_{α}

the respective learning rates for weights and architectures, and by

K_{α} \leq K

the number of rounds in which architecture updates are transmitted. Then, the following result characterizes the expected convergence rate of FedRegNAS.

Theorem 1.

Under Assumptions (A1)–(A4), and with learning rates

η_{w} = Θ ({(K q E)}^{- 1 / 2})

and

η_{α} = Θ ({(K q E)}^{- 1 / 2})

, the expected first-order stationarity gap of

F_{λ}

satisfies

\frac{1}{K} \sum_{k = 0}^{K - 1} E [∥ \nabla F_{λ} (θ_{k}, α_{k}) ∥_{2}^{2}] = O (\frac{1 + \bar{τ}}{\sqrt{K q E}}) + O (σ_{w}^{2} C_{w}^{2} + \frac{K_{α}}{K} σ_{α}^{2} C_{α}^{2}) + O (λ_{res}) .

Furthermore, the multiplicative constant in the first term improves monotonically with the trust parameter

λ_{trust}

up to a regime-dependent threshold beyond which over-concentration degrades performance.

Proof Sketch

The proof follows from standard stochastic nonconvex analysis adapted to mirror descent on the product of Euclidean and simplex spaces. Specifically,

The expected decrease in $F_{λ}$ per round is bounded by the inner product of gradients and updates, adjusted for staleness.
Gaussian perturbations are zero-mean, and clipping bias is bounded by $E [(∥ Δ_{c} ∥_{2} - C_{c})_{+}]$ for each channel $c \in {w, α}$ .
The mirror map’s strong convexity ensures a Bregman-descent inequality whose Euclidean norm equivalent differs by at most a constant factor.
The cosine-similarity reweighting introduces a bounded variance reduction proportional to $exp (λ_{trust})$ up to saturation.

Combining these arguments and taking expectations over client sampling and DP noise yields the claimed result.

5.3. Impact of Clipping and Trust Weighting

5.3.1. Clipping Bias

For any clipped update

Δ_{c}

with threshold

C_{c}

, we can express the bias as

b_{c} = Δ_{c} - clip (Δ_{c}, C_{c}), ∥ b_{c} ∥_{2} \leq E [(∥ Δ_{c} ∥_{2} - C_{c})_{+}] .

As long as

C_{c}

exceeds the 95th percentile of the empirical update norm, clipping bias contributes only a small additive constant to the convergence bound. Empirically, maintaining clipping ratios below

15 %

yields stable optimization without excessive noise amplification.

5.3.2. Trust Weighting Effects

The trust weighting function in Equation (14) improves robustness by aligning updates with the cohort direction

\bar{Δ θ}

. This mechanism effectively acts as a preconditioner that reduces inter-client gradient variance, improving convergence constants without changing the asymptotic rate. However, overly large

λ_{trust}

can lead to overconfidence, amplifying bias when gradients are diverse. Empirically,

λ_{trust} \in [0.5, 2]

achieves optimal trade-offs.

5.4. Differential-Privacy Accounting

Each round of FedRegNAS applies an independent Gaussian mechanism to the clipped deltas of both weight and architecture channels, combined with Poisson subsampling over clients. We quantify privacy loss using the Rényi differential privacy (RDP) framework.

5.4.1. Per-Round Mechanism

For channel

c \in {w, α}

, define the Gaussian mechanism

M_{c} (Δ_{c}) = clip (Δ_{c}, C_{c}) + N (0, σ_{c}^{2} C_{c}^{2} I),

applied to a subsampled set of clients with probability q. The resulting per-round RDP cost of order

λ > 1

is

ρ_{c} (λ; q, σ_{c}) = \frac{1}{λ - 1} log (1 + q^{2} (e^{(λ - 1) / σ_{c}^{2}} - 1)) .

5.4.2. Composition and Conversion

Over

K_{c}

active rounds (

K_{w} = K

and

K_{α} \leq K

), the cumulative RDP cost is additive:

ε_{c}^{RDP} (λ) = K_{c} ρ_{c} (λ; q, σ_{c}) .

Conversion to standard

(ε_{c}, δ_{c})

-DP follows from

ε_{c} (δ_{c}) = min_{λ > 1} \{ε_{c}^{RDP} (λ) + \frac{log (1 / δ_{c})}{λ - 1}\} .

The total privacy guarantee can be reported either as separate channels

(ε_{w}, δ_{w})

and

(ε_{α}, δ_{α})

, or as an aggregate

(ε, δ)

via composition.

5.4.3. Privacy–Utility Allocation

Because architecture parameters

α

are typically more sensitive—encoding structural and potentially proprietary information—FedRegNAS allocates a larger privacy budget to this channel. Practically, this is achieved by using higher noise multipliers

σ_{α} > σ_{w}

and reducing the number of architecture uploads (

K_{α} < K

) through the curriculum schedule. Typical choices satisfying moderate privacy budgets are

σ_{w} \in [0.5, 2], σ_{α} \in [1, 4], C_{w} \in [0.5, 3], C_{α} \in [0.1, 1] .

5.4.4. Amplification by Subsampling and Curriculum

Client-level subsampling (

q < 1

) amplifies privacy, as only a subset of clients participates in each round. Furthermore, the search-to-train curriculum (cf. Section 4) reduces the number of architecture transmissions after Phase II, yielding

K_{α} ≪ K

and consequently much tighter bounds on

ε_{α}

.

5.5. Discussion and Implications

The derived convergence rate demonstrates that FedRegNAS maintains the same asymptotic order

O ((1 + \bar{τ}) / \sqrt{K q E})

as standard differentially private federated optimization, while incorporating two additional sources of robustness:

Trust weighting: Reduces gradient variance and accelerates early-phase convergence under heterogeneous data distributions.
Privacy decoupling: Allows independent control over the privacy budgets of weights and architectures, improving model utility without violating DP constraints.

The rate constants are influenced by staleness (

\bar{τ}

) and the DP noise scales (

σ_{w}, σ_{α}

); both can be mitigated through appropriate scheduling and noise calibration. In summary, the analysis confirms that FedRegNAS achieves provably stable convergence under realistic federated settings with heterogeneous clients, bounded staleness, and rigorous differential privacy guarantees.

Summary

The theoretical analysis establishes that FedRegNAS converges to a stationary point of the global objective at the rate

O (\frac{1 + \bar{τ}}{\sqrt{K q E}}) + O (σ_{w}^{2} C_{w}^{2} + \frac{K_{α}}{K} σ_{α}^{2} C_{α}^{2}),

while satisfying user-level differential privacy under the Gaussian mechanism with per-channel accounting. These results validate the method’s design choices—namely, trust-weighted aggregation, ADDP privacy decoupling, and the search-to-train curriculum—from a theoretical standpoint.

6. Experiments

We evaluate FedRegNAS on federated equity forecasting tasks covering daily and intraday horizons. Experiments assess accuracy, trading relevance, communication efficiency, privacy, and the contribution of each algorithmic component. All notation follows Section 3 and Section 4.

6.1. Setup

6.1.1. Problem Setting

We evaluate FedRegNAS on three federated equity forecasting benchmarks that reflect distinct temporal granularities and cross-market heterogeneity. Each client corresponds to an institution-like shard composed of disjoint ticker subsets and time ranges. Clients never share raw data; only sanitized updates (weights and, intermittently, architectures) are transmitted. Targets are next-period log-returns

y_{t}

, and inputs comprise price- and volume-derived features, technical indicators, and normalized microstructure signals. All preprocessing is performed locally to preclude information leakage. Datasets and client partitions is given in Table 1. Data availability is given at the end of this paper.

6.1.2. Data Construction and Preprocessing

For each dataset, we apply the following: (i) session alignment with exchange calendars, (ii) removal of auctions, halts, and limit-up/down prints, (iii) forward-fill of sparse features with mask channels, (iv) log-differencing and z-score normalization per feature using rolling statistics computed strictly within the training horizon, and (v) leakage guards by disallowing same-day lookahead across corporate actions and delayed prints. Missing data are imputed locally with confidence masks; the masks are part of the model inputs. Feature windows of length L are formed with stride 1.

6.1.3. Evaluation Protocol

We adopt a chronological split: train (2012–2024 daily; 2019–2024 intraday), validation (final 10% of the train horizon), and test (calendar year 2025). Metrics are root mean squared error (

RMSE

), directional accuracy (

DA

), and annualized Sharpe ratio (

SR

) for a thresholded long-short policy

π_{ϑ}

with symmetric threshold

κ

selected on validation. Reporting averages across clients with equal weights unless otherwise stated. Aggregating the per-dataset improvements discussed in the following sections, the specific results are presented in the tables below, FedRegNAS achieves gains in directional accuracy of 3–7 percentage points and Sharpe ratio improvements of 18–32% over the best federated baselines.

6.1.4. Federated Learning Setting

Each round samples a subset

S_{k}

of clients with rate

q \approx | S_{k} | / N = 0.3

. Local optimization executes E steps per client on minibatches of size 128. Architectural messages are uploaded every

H_{α}

rounds; weight messages are uploaded every round, clipped, noised, and compressed.

6.1.5. Regime Modeling and Search Space

The regime encoder

g_{ψ} (\cdot)

outputs

z_{t} \in Δ^{R}

with

R = 3

regimes by default. The supernet includes per-decision candidate operators:

{1 D

depthwise separable conv, gated temporal conv, GRU cell, lightweight self-attention, identity}. Mixed operations are gated by

α

(Gumbel–Softmax) and modulated by

z_{t}

through per-regime affine adapters.The details of the optimization, privacy, and communication settings are presented in Table 2, while Figure 2 illustrates the chronological data splits for each benchmark with non-overlapping labels.

6.2. Main Results

6.2.1. Overview

This subsection reports forecasting and trading performance across the three benchmarks, together with statistical confidence and method-wise improvements. All results are computed on strictly held-out 2025 test sets with model selection on validation. We report averages across clients with equal weights; for dispersion we compute client-bootstrapped

95 %

confidence intervals (CIs) using 1000 resamples. To ensure comparability, thresholds for the trading policy

π_{ϑ}

are selected on validation per method and then frozen on test.

6.2.2. Accuracy and Trading Quality

Table 3 summarizes RMSE, DA, and SR. FedRegNAS attains the lowest RMSE and highest DA and Sharpe on all three datasets. Relative to the strongest federated baseline (FedProx-Trans), RMSE declines by 6–

11 %

, DA increases by

2.2

–

3.1

points, and SR improves by 12–

28 %

. The centralized upper bound (Centralized-DARTS) remains competitive but is consistently matched or exceeded by FedRegNAS in SR despite lacking privacy and communication constraints.

6.2.3. Statistical Significance

We conduct paired tests per client: (i) a one-sided Diebold–Mariano test for squared-error loss (RMSE proxy), and (ii) a stratified permutation test for DA and SR with 10,000 label-preserving shuffles within asset-day blocks. Table 4 reports improvement deltas and the fraction of clients with

p < 0.05

. Improvements are broadly significant, particularly on Minute-US, where regime gating captures intraday nonstationarity.The forecasting accuracy (RMSE) and relative improvement are presented in Table 5.

6.2.4. Robustness Across Universes

Table 3 visualizes aggregate performance. To avoid legend clutter, color/shade semantics are encoded in the captions. Across datasets, the relative ordering of methods is stable; dispersion (CI whiskers) is narrowest for FedRegNAS, indicating reduced cross-client variance attributable to trust-weighted aggregation and KL-barycentric architecture pooling.

In general, we find that (1) regime gating and KL-barycentric aggregation yield consistent gains across horizons and geographies; (2) improvements materialize not only in RMSE but also in trading metrics, indicating better calibration around decision thresholds; and (3) reduced dispersion across clients suggests enhanced robustness to non-IID data induced by institutional partitioning.

6.3. Ablation and Component Analysis

In this subsection we present ablations and diagnostic experiments addressing the impact of (i) decoupled DP noise scales

(σ_{w}, σ_{α})

, (ii) the regime cardinality R, (iii) KL-barycentric aggregation versus simple averaging under non-IID clients, (iv) the calibration of latency and communication proxies, (v) volatility-stratified trading performance for economic interpretability, and (vi) the marginal contributions of core algorithmic components (regime gating, KL aggregation, trust weighting, and decoupled DP). Unless otherwise specified, ablations are conducted on the Daily-US benchmark, which exhibits pronounced market-regime variability and moderate client heterogeneity.

6.3.1. Sensitivity to Decoupled DP Noise Scales

We first study the sensitivity of FedRegNAS to the decoupled DP noise scales

(σ_{w}, σ_{α})

on Daily-US. We sweep

σ_{w} \in {0.2, 0.4, 0.6, 0.8}

and

σ_{α} \in {0.2, 0.4, 0.6, 0.8}

while keeping the target user-level privacy budget fixed by adjusting the number of composition steps accordingly. For each configuration we report RMSE, DA, Sharpe ratio, and a leakage proxy defined as the average norm of architecture gradients before clipping. Figure 3 shows RMSE as a function of

σ_{α}

for different

σ_{w}

; performance is comparatively stable when varying

σ_{w}

for fixed

σ_{α}

, whereas increasing

σ_{α}

beyond

0.4

leads to noticeable degradation in RMSE. As shown in Figure A1 (Appendix B), when

σ_{α}

is fixed at 0.4, the effect of increasing

σ_{w}

on RMSE and leakage proxy is minimal, validating that the model performance is insensitive to weight noise. The leakage proxy (not shown for brevity) is also substantially more sensitive to

σ_{α}

than to

σ_{w}

, supporting our claim that architecture-related gradients are the dominant leakage channel under our setting. As shown in Figure A2 (Appendix B), when

σ_{w}

is fixed at 0.4, increasing

σ_{α}

leads to a significant decrease in RMSE and a sharp rise in the leakage proxy, confirming that the architectural gradient is the main leakage channel.

6.3.2. Effect of Regime Cardinality R

We next investigate how the number of latent regimes R in the encoder

g_{ψ}

affects performance. We train FedRegNAS with

R \in {2, 3, 4, 6, 8, 10}

on Daily-US and Minute-US under the same training budget and DP configuration. Table 6 summarizes RMSE, DA, and Sharpe ratio, averaged over three runs (In Table 6, RMSE is computed on the de-normalized prediction target (original log-return or percentage scale), and is therefore numerically larger than the standardized RMSEs reported in Table 3 and related results). Performance is relatively poor for very small R (e.g., 2) and slightly unstable for very large R (e.g., 10), while the range

R \in {4, 6}

yields the best trade-off across both datasets. This indicates that the regime encoder is reasonably robust to the precise choice of R and transfers across markets without extensive retuning.

6.3.3. KL-Barycentric Aggregation vs. Simple Averaging

To understand the empirical effect of KL-barycentric aggregation, we compare FedRegNAS with a variant that uses simple averaging of architecture logits across clients while keeping all other components unchanged. We evaluate both variants on deliberately non-IID splits of Daily-US, Minute-US, and Daily-Global, where client distributions differ strongly in volatility and sector composition. Table 7 reports RMSE, DA, and the average Jensen–Shannon (JS) divergence between client-wise architecture distributions and the aggregated distribution at convergence. KL-barycentric aggregation consistently improves RMSE and DA and substantially reduces the dispersion of architecture distributions across all three datasets, indicating more coherent architectures under heterogeneous clients.

6.3.4. Calibration of Latency and Communication Proxies

We calibrate our differentiable latency and communication proxies against wall-clock measurements on a heterogeneous cluster containing both GPU and CPU clients and different network conditions. For a set of architectures sampled during search, we record proxy values and actual inference latency and transmitted bytes per round under three representative profiles: (i) GPU + high-bandwidth LAN, (ii) CPU + LAN, and (iii) CPU + constrained wireless (4G). Table 8 reports Pearson correlation and mean absolute percentage error (MAPE) between proxies and measured quantities. Correlations exceed

0.9

for all profiles, with MAPE below

10 %

, indicating that the proxies are well aligned with actual hardware behavior and suitable as surrogates in the multi-objective search.

6.3.5. Volatility-Stratified Trading Performance

Finally, to enhance economic interpretability, we stratify test periods into high- and low-volatility regimes based on a rolling realized-volatility threshold (top and bottom terciles) and compare FedRegNAS with two baselines (Local-GRU and FedAvg-LSTM) on Daily-US. Table 9 reports DA and Sharpe ratio for both volatility regimes. FedRegNAS improves DA and Sharpe in both regimes, with particularly pronounced gains during high-volatility periods where the baselines suffer larger drawdowns. This suggests that the regime-gated architecture and federated NAS design not only improve aggregate metrics but also yield more robust trading behavior under extreme market conditions.

6.3.6. Component-Level Ablation on Daily-US

We now evaluate the marginal contributions of the primary algorithmic components within FedRegNAS—namely regime gating, KL-barycentric aggregation, trust weighting, and architecture/weight differential-privacy decoupling (ADDP). Each ablation removes or substitutes a specific element while keeping the same search space, supernet initialization, privacy parameters, and communication budgets. We define five configurations: (i) full model (FedRegNAS complete), (ii) w/o regime gating (replacing regime encoder

g_{ψ}

with a constant gate), (iii) mean aggregation (Euclidean average of architecture parameters in place of the KL barycenter), (iv) coupled-DP (single shared noise budget for both channels), and (v) no trust weighting (uniform aggregation weights). All models use

K = 300

communication rounds,

E = 5

local epochs, and

H_{α} = 5

, with the same noise multipliers

(σ_{w}, σ_{α}) = (0.75, 1.2)

. We evaluate not only predictive accuracy but also communication cost and convergence stability (variance of validation-loss trajectories).

The results in Table 10 indicate that removing any core mechanism leads to measurable degradation. The largest performance drops arise from disabling regime gating and trust weighting, consistent with their roles in addressing nonstationarity and heterogeneity. Using Euclidean aggregation instead of KL barycenters modestly increases RMSE and destabilizes architecture probabilities (higher entropy drift). Coupling DP noise budgets reduces effective utility by over-noising the weight channel. Interestingly, all ablations retain communication efficiency, confirming that improvements stem from algorithmic, not infrastructural, changes.

To better understand component effects, Figure 4 and Figure 5 present convergence and weight-entropy diagnostics. The former plots smoothed validation-loss trajectories across rounds; the latter displays the evolution of trust-weight entropy (a proxy for aggregation concentration). FedRegNAS demonstrates faster and smoother convergence, with significantly lower entropy, reflecting confident and consistent weighting of high-quality clients. Regime gating yields reduced oscillations post-round 150, suggesting effective adaptation to regime transitions, while KL-barycentric aggregation maintains stable architectural probabilities and avoids premature collapse to suboptimal operators.

Overall, the ablation outcomes reinforce the necessity of coupling regime-awareness, information-geometric aggregation, trust-aware weighting, and decoupled differential privacy within a unified optimization pipeline. Removing any mechanism impairs either convergence stability, forecast accuracy, or variance control. Regime gating delivers the most substantial gains, improving both RMSE and DA by capturing temporal heterogeneity. KL-barycentric aggregation contributes to architecture stability and inter-client consistency, while trust weighting accelerates convergence under DP noise. Collectively, these elements ensure FedRegNAS’s superior robustness and efficiency under federated nonstationary conditions.

6.4. Efficiency and Privacy

6.4.1. Objectives and Methodology

This subsection quantifies the communication footprint, on-device latency, and user-level privacy of FedRegNAS relative to strong baselines. We report the following: (i) mean client uplink size per round after clipping, quantization, and sparsification; (ii) median inference latency on a mobile-class CPU for a single forward pass; and (iii) differential-privacy budgets accumulated via Rényi accounting under Poisson subsampling. To isolate design contributions, we additionally decompose the uplink into constituent payloads and vary the architecture upload cadence

H_{α}

.

6.4.2. Key Observations

First, the search-to-train curriculum coupled with lazy architecture synchronization (uploads every

H_{α}

rounds) reduces uplink by

43 %

versus FedAvg-LSTM and

70 %

versus FedProx-Trans on Daily-US, while also lowering on-device latency through regime-gated sparsity. Second, ADDP (decoupled privacy) concentrates noise where it is most sensitive (architecture channel), enabling lower weight noise without violating the global budget. Third, increasing

H_{α}

tightens the architecture-channel privacy

ε_{α}

nearly linearly in the fraction of active rounds

K_{α} / K

, with minimal effect on accuracy when

H_{α} \in [4, 8]

.

The communication and latency across datasets are presented in Table 11, while the uplink payload decomposition for Daily-US is shown in Table 12. Additionally, Figure 6 presents the mean client uplink per round on Daily-US. Figure 7 illustrates the privacy budget trajectories, where the solid curve represents

ϵ_{w}

(weights) under ADDP, the dashed curve corresponds to

ϵ_{α}

(architecture) under ADDP with uploads every

H_{α} = 5

until Phase II at round 300, and the dotted/dash-dotted curves show the coupled-DP baseline, where both channels accrue the same budget more rapidly. No legend is included in the figure, as the curve semantics are specified here.

In general, we find that (1) communication savings stem from reduced architecture traffic and compressed weight updates; the latter contribute

\sim 76 %

of the uplink, while architecture packets account for

\sim 19 %

under

H_{α} = 5

; (2) the architecture-channel privacy improves nearly linearly with

H_{α}

, with negligible loss in RMSE up to

H_{α} = 8

; and (3) latency reductions reflect both the smaller selected subgraph post-selection (Phase II) and regime-conditioned adapters that avoid redundant computation during inference.

6.5. Practical Evaluation and Benchmarks

This subsection complements the main experiments by (i) comparing FedRegNAS against strong centralized sequence models, (ii) extending the ablation study to high-frequency data, (iii) quantifying the computational overhead of the NAS search phase and its amortization, and (iv) presenting a simulated institutional case study that incorporates constraints common in brokerage/asset-management settings.

6.5.1. Centralized Sequence-Model Baselines

To contextualize FedRegNAS from a pure forecasting perspective, we benchmark several strong sequence models that are widely used in financial time-series forecasting when centralizing all data is allowed. Specifically, we implement centralized versions of Temporal Fusion Transformers (TFTs), N-BEATS, and DeepAR, and compare them with a centralized variant of our search-discovered FedRegNAS architecture (FedRegNAS-C) on Daily-US and Minute-US. All models are trained on pooled data without any FL or DP constraints, using the same train/validation/test splits and evaluation metrics (RMSE, DA, and Sharpe ratio). Table 13 reports the results. As expected, the centralized baselines define strong upper bounds, particularly on RMSE; however, FedRegNAS-C remains competitive with TFT and N-BEATS across both datasets, while the federated FedRegNAS model trails only slightly despite operating under user-level DP and client-side data locality. These comparisons suggest that FedRegNAS achieves forecasting quality close to state-of-the-art centralized models, while additionally offering the benefits of FL and DP.

6.5.2. Regime Gating and KL Barycenter on High-Frequency Data

The ablation in Table 6 focuses on Daily-US. To test the robustness of the architectural design under high-frequency and nonstationary conditions, we replicate a subset of the ablations on Minute-US, specifically isolating the contributions of regime gating and KL-barycentric aggregation. Table 14 compares four variants: (i) a baseline without regime gating or KL aggregation (standard FedAvg over architectures), (ii) only regime gating, (iii) only KL aggregation, and (iv) the full FedRegNAS. Regime gating and KL-barycentric aggregation each improve performance when added individually, and combining them yields the best RMSE and DA, confirming that the qualitative conclusions from Daily-US carry over to the high-frequency Minute-US setting.

6.5.3. Search-Phase Overhead and Amortization

Because FedRegNAS includes an NAS search phase, decision-makers may be concerned about additional computational overhead relative to directly training a fixed architecture such as FedAvg-LSTM or FedProx-Trans. Table 15 reports total training time and approximate GPU-hours for (i) FedAvg-LSTM, (ii) FedProx-Trans, (iii) FedRegNAS including the full search-to-train curriculum (search + selection + personalization), and (iv) FedRegNAS when reusing a previously discovered architecture for a new market (fine-tuning only). Results are measured on the same cluster with 4 × V100 GPUs and 16 CPU cores. While the initial FedRegNAS run is roughly 2–3× more expensive than directly training a fixed architecture, reusing the discovered architecture reduces the cost to a level comparable with FedProx-Trans. Over multiple deployments (e.g., across three markets or rolling retrain windows), the amortized per-deployment cost of FedRegNAS approaches that of fixed-architecture methods, while providing better forecasting and stronger FL + DP guarantees.

6.5.4. Simulated Institutional Case Study

We construct a simulated brokerage-style scenario to bridge the gap between our academic partitions and real institutional constraints. In this setting, each client corresponds to a large asset manager trading a subset of U.S. and global equities, subject to (i) regulatory trading windows (orders can only be placed during specified intraday intervals), (ii) latency budgets for model inference (e.g., ≤10 ms per prediction for intraday signals), and (iii) trading restrictions such as daily turnover caps and maximum position sizes. These constraints are enforced when generating trades from model predictions and when measuring latency on a heterogeneous cluster with mixed GPU/CPU clients.

Table 16 compares Local-GRU, FedAvg-LSTM, FedProx-Trans, and FedRegNAS under this institutional setup on Minute-US, reporting DA, Sharpe ratio, average daily turnover, and the fraction of trades that would violate turnover/position limits before clipping. FedRegNAS achieves the best DA and Sharpe while maintaining comparable turnover and the lowest violation rate, indicating that the discovered architectures remain effective when realistic trading constraints are imposed.

To assess whether the models meet practical latency requirements, we summarize end-to-end inference latency (including communication and decoding) across clients in Table 17. All methods satisfy the 10 ms budget on average; FedRegNAS remains competitive with FedAvg-LSTM and improves on FedProx-Trans due to the latency-aware objective used during NAS. Together, these results support our conclusion that the relative advantages of FedRegNAS in forecasting and communication persist under a more institutionally realistic regime, while we still caution that unmodeled microstructure effects (e.g., exchange-specific routing latencies) may affect absolute performance in production.

7. Conclusions

We presented FedRegNAS, a regime-aware federated neural architecture search framework for privacy-preserving stock-return forecasting that unifies a regime-gated temporal supernet, mirror-descent search on the simplex, staleness- and trust-aware weight aggregation, KL-barycentric fusion of client architectures, and decoupled differential privacy for weights and architectures (ADDP) under explicit communication and latency proxies. Across daily and intraday federated equity benchmarks, FedRegNAS consistently improved forecasting error, directional accuracy, and trading Sharpe while cutting communication by large margins relative to strong FL baselines and approaching centralized NAS quality without centralizing data. Ablations confirmed the complementary value of regime gating, barycentric aggregation, and DP decoupling, and a search-to-train curriculum yielded deployable, efficient architectures with on-device latency advantages. Limitations include the reliance on a fixed regime cardinality, approximate DP composition, and surrogate resource proxies; future work will investigate online regime discovery, cross-asset transfer and multi-horizon probabilistic targets, tighter privacy accounting and secure aggregation, and theoretical guarantees for barycentric architecture averaging under subsampled, differentially private updates.

We will release reproducibility materials, including synthetic-data scripts and configuration files, to facilitate adoption and benchmarking. Looking ahead, our framework opens several directions for future work. First, FedRegNAS could be extended to additional financial markets and horizons, including cross-asset and multi-horizon settings, to further test its robustness across regimes and liquidity conditions. Second, richer personalization mechanisms—for example combining regime-aware NAS with hypernetwork-based or representation-level personalization—may yield further gains under extreme client heterogeneity. Finally, applying the core ideas of regime gating, probabilistic architecture aggregation, and decoupled DP to other time-series domains (such as macroeconomic indicators, energy load forecasting, or limit-order-book data) is an interesting avenue for expanding the scope of federated NAS beyond equity returns.

Author Contributions

Conceptualization, Z.C. and H.Z.; methodology, Z.C.; software, Z.C.; validation, Z.C., H.Z. and S.W.; formal analysis, S.W.; investigation, H.Z.; resources, J.C.; data curation, S.W.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C. and H.Z.; visualization, S.W.; supervision, H.Z.; project administration, J.C.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This article is not supported by any funds.

Data Availability Statement

The specifics of data availability are presented in the Appendix A.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Data Availability

The datasets used in this study can be reproduced from publicly accessible sources as follows: daily U.S. and global equity bars are available from Stooq’s free market data archive at https://stooq.com/db/ (see, e.g., S&P 500 example page https://stooq.com/q/d/?s=%5Espx) (accessed on 16 June 2025); minute-level U.S. bars can be obtained via Polygon’s Stocks Aggregates (Bars) API https://polygon.io/docs/rest/stocks/aggregates/custom-bars (API key required) (accessed on 14 June 2025) or from Kibot’s historical intraday data https://www.kibot.com/ including free sample files at https://www.kibot.com/free_historical_data.aspx (accessed on 20 July 2025); for researchers with institutional access, the CRSP U.S. Stock database is accessible through WRDS at https://wrds-www.wharton.upenn.edu/pages/about/data-vendors/center-for-research-in-security-prices-crsp/ (accessed on 20 July 2025).

Appendix B. Sensitivity to DP Noise Scales

We provide full results for the

(σ_{w}, σ_{α})

sensitivity study in Figure A1 and Figure A2, showing performance and leakage proxies as a function of the two noise scales.

Figure A1 and Figure A2 show a clear asymmetry between the impact of the weight-level noise scale

σ_{w}

and the architecture-level noise scale

σ_{α}

. When increasing

σ_{w}

with

σ_{α}

fixed (Figure A1), both RMSE and the leakage proxy grow only mildly, suggesting that moderate additional noise on model weights can be absorbed by the optimization dynamics without severely harming forecasting performance or substantially changing the gradient statistics. In contrast, when varying

σ_{α}

at fixed

σ_{w}

(Figure A2), we observe a pronounced degradation in RMSE and a steep rise in the leakage proxy as

σ_{α}

increases, indicating that the architecture gradients are far more sensitive to DP perturbations and dominate the overall leakage profile. These trends empirically justify our decoupled DP design, where tighter privacy budgets (smaller noise) are allocated to architecture updates and relatively larger noise can be used for weight updates without significantly compromising performance.

Figure A1. Sensitivity of FedRegNAS to the weight-noise scale

σ_{w}

(with

σ_{α}

fixed to

0.4

) on Daily-US. Both RMSE and the leakage proxy increase only mildly as

σ_{w}

grows, indicating that performance is relatively insensitive to the weight-level noise scale.

Figure A1. Sensitivity of FedRegNAS to the weight-noise scale

σ_{w}

(with

σ_{α}

fixed to

0.4

) on Daily-US. Both RMSE and the leakage proxy increase only mildly as

σ_{w}

grows, indicating that performance is relatively insensitive to the weight-level noise scale.

Figure A2. Sensitivity of FedRegNAS to the architecture-noise scale

σ_{α}

(with

σ_{w}

fixed to

0.4

) on Daily-US. Increasing

σ_{α}

degrades RMSE and sharply increases the leakage proxy, supporting the claim that architecture gradients are the dominant leakage channel.

Figure A2. Sensitivity of FedRegNAS to the architecture-noise scale

σ_{α}

(with

σ_{w}

fixed to

0.4

) on Daily-US. Increasing

σ_{α}

degrades RMSE and sharply increases the leakage proxy, supporting the claim that architecture gradients are the dominant leakage channel.

References

McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Agüera y Arcas, B. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 1273–1282. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. arXiv 2019, arXiv:1912.04977. [Google Scholar]
Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable Architecture Search. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Pham, H.; Guan, M.; Zoph, B.; Le, Q.V.; Dean, J. ENAS: Efficient Neural Architecture Search via Parameter Sharing. In Proceedings of the 35th International Conference on Machine Learning (ICML), PMLR, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 4095–4104. [Google Scholar]
Xie, S.; Zheng, H.; Liu, C.; Lin, L. SNAS: Stochastic Neural Architecture Search. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Cai, H.; Zhu, L.; Han, S. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar] [CrossRef]
Mironov, I. Rényi Differential Privacy. In Proceedings of the 2017 IEEE 30th Computer Security Foundations Symposium (CSF), Santa Barbara, CA, USA, 21–25 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 263–275. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Lim, B.; Arik, S.Ö.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
Zhang, Z.; Zohren, S.; Roberts, S. DeepLOB: Deep Convolutional Neural Networks for Limit Order Books. IEEE Trans. Signal Process. 2019, 67, 3001–3012. [Google Scholar] [CrossRef]
Hamilton, J.D. A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle. Econometrica 1989, 57, 357–384. [Google Scholar] [CrossRef]
Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive Mixtures of Local Experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
Chen, N.; Li, B.; Wang, Y.; Ying, X.; Wang, L.; Zhang, C.; Guo, Y.; Li, M.; An, W. Motion and Appearance Decoupling Representation for Event Cameras. IEEE Trans. Image Process. 2025, 34, 5964–5977. [Google Scholar] [CrossRef] [PubMed]
Feng, Z.-R.; Li, Y.-H.; Chen, W.-Z.; Su, X.-P.; Chen, J.-N.; Li, J.-P.; Liu, H.; Li, S.-B. Infrared and Visible Image Fusion Based on Improved Latent Low-Rank and Unsharp Masks. Spectrosc. Spectr. Anal. 2025, 45, 2034–2044. [Google Scholar]
Tan, C.; Liu, H.; Chen, L.; Wang, J.; Chen, X.; Wang, G. Characteristic analysis and model predictive-improved active disturbance rejection control of direct-drive electro-hydrostatic actuators. Expert Syst. Appl. 2026, 301, 130565. [Google Scholar] [CrossRef]
Genest, C.; Zidek, J.V. Combining Probability Distributions: A Critique and an Annotated Bibliography. Stat. Sci. 1986, 1, 113–135. [Google Scholar]
Konečný, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated Optimization: Distributed Machine Learning for On-Device Intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.U.; Suresh, A.T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), PMLR, Virtual, 13–18 July 2020; Volume 119, pp. 5132–5143. [Google Scholar]
Reddi, S.J.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečný, J.; Kumar, S.; McMahan, H.B. Adaptive Federated Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Xie, C.; Koyejo, O.; Gupta, I. Asynchronous Federated Optimization. arXiv 2019, arXiv:1903.03934. [Google Scholar]
Blanchard, P.; El Mhamdi, E.M.; Guerraoui, R.; Stainer, J. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 119–129. [Google Scholar]
Yin, D.; Chen, Y.; Ramchandran, K.; Bartlett, P.L. Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. In Proceedings of the 35th International Conference on Machine Learning (ICML), PMLR, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 5636–5645. [Google Scholar]
Cao, X.; Fang, M.; Liu, J.; Gong, N.Z. FLTrust: Byzantine-Robust Federated Learning via Trust Bootstrapping. In Proceedings of the Network and Distributed System Security Symposium (NDSS), Virtual, 21–25 February 2021; Internet Society: Reston, VA, USA, 2021. [Google Scholar]
Alistarh, D.; Grubic, D.; Li, J.; Tomioka, R.; Vojnovic, M. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 1709–1720. [Google Scholar]
Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Stich, S.U.; Cordonnier, J.B.; Jaggi, M. Sparsified SGD with Memory. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Elsken, T.; Metzen, J.H.; Hutter, F. Neural Architecture Search: A Survey. J. Mach. Learn. Res. 2019, 20, 1–21. [Google Scholar]
Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Maddison, C.J.; Mnih, A.; Teh, Y.W. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Zhu, H.; Zhang, H.; Jin, Y. From federated learning to federated neural architecture search: A survey. Complex Intell. Syst. 2021, 7, 1311–1330. [Google Scholar] [CrossRef]
He, C.; Annavaram, M.; Avestimehr, S. FedNAS: Federated deep learning via neural architecture search. arXiv 2020, arXiv:2004.08546. [Google Scholar]
Singh, I.; Zhou, H.; Yang, K.; Ding, M.; Lin, B.; Xie, P. Differentially-private federated neural architecture search. arXiv 2020, arXiv:2006.10559. [Google Scholar] [CrossRef]
Yao, D.; Li, B. PerFedRLNAS: One-for-all personalized federated neural architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 16398–16406. [Google Scholar]
Zhang, C.; Shan, G.; Roh, B.H. Fair federated learning for multi-task 6G NWDAF network anomaly detection. IEEE Trans. Intell. Transp. Syst. 2024, 26, 17359–17370. [Google Scholar] [CrossRef]
Yang, Z.; Shao, Z.; Huangfu, H.; Yu, H.; Teoh, A.B.J.; Li, X.; Shan, H.; Zhang, Y. Enhancing federated learning through exploring filter-aware relationships and personalizing local structures. Pattern Recognit. 2025, 171, 112281. [Google Scholar] [CrossRef]
Yang, Z.; Chen, Y.; Wang, Z.; Shan, H.; Chen, Y.; Zhang, Y. Patient-Level Anatomy Meets Scanning-Level Physics: Personalized Federated Low-Dose CT Denoising Empowered by Large Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 5154–5163. [Google Scholar] [CrossRef]

Figure 1. Overview of FedRegNAS. Sensitive client-side financial time-series data (price features, returns, context) are processed locally, fed into a regime-aware differentiable architecture search under ADDP, aggregated via KL-barycentric federated NAS, and deployed as a privacy-preserving stock-return forecasting model.

Figure 2. Chronological data splits per benchmark with non-overlapping labels. Light gray: train; medium gray: validation (final 10% of training horizon); dark gray: test (calendar year 2025). Labels are positioned above (Daily-US, Minute-US) or below (Daily-Global) the bars to prevent overlap. No legend is drawn; the caption encodes color semantics.

Figure 3. Sensitivity of FedRegNAS to decoupled DP noise scales

(σ_{w}, σ_{α})

on Daily-US. RMSE increases mainly as

σ_{α}

grows, while the effect of

σ_{w}

is comparatively mild for a fixed

σ_{α}

, indicating that architecture gradients are the dominant leakage–performance bottleneck.

Figure 3. Sensitivity of FedRegNAS to decoupled DP noise scales

(σ_{w}, σ_{α})

on Daily-US. RMSE increases mainly as

σ_{α}

grows, while the effect of

σ_{w}

is comparatively mild for a fixed

σ_{α}

, indicating that architecture gradients are the dominant leakage–performance bottleneck.

Figure 4. Validation-loss convergence on Daily-US. Solid: FedRegNAS; dashed/dotted variants correspond to individual ablations (mapping described in the text). The full model converges faster and to a lower final loss, showing improved stability from regime gating, KL aggregation, and trust weighting.

Figure 5. Trust-weight entropy over rounds (lower is better). Solid: FedRegNAS; other curves correspond to ablations (mapping described in the text). The full model shows faster entropy decay, implying more decisive and stable aggregation.

Figure 6. Mean client uplink per round on Daily-US. Bar fills denote methods (light to dark): FedAvg-LSTM, FedProx-Trans, Centralized-DARTS, and FedRegNAS. No legend is drawn; mapping is provided here.

Figure 7. Privacy budget trajectories. Solid:

ε_{w}

(weights) under ADDP; dashed:

ε_{α}

(architecture) under ADDP with uploads every

H_{α} = 5

until Phase II at round 300; dotted/dash-dotted: coupled-DP baseline where both channels accrue the same budget more rapidly. No legend is drawn; curve semantics are specified here.

Figure 7. Privacy budget trajectories. Solid:

ε_{w}

(weights) under ADDP; dashed:

ε_{α}

(architecture) under ADDP with uploads every

H_{α} = 5

until Phase II at round 300; dotted/dash-dotted: coupled-DP baseline where both channels accrue the same budget more rapidly. No legend is drawn; curve semantics are specified here.

Table 1. Datasets and client partitions. Each dataset specifies the number of clients N, temporal resolution, train/test spans, input window L, feature dimension d, and median per-client statistics (tickers and train samples). These per-client counts are reported explicitly to facilitate cross-dataset comparison.

Dataset	N	Resolution	Train Span	Test Span	L	d	Median Tickers/Client	Median Train Samples/Client
Daily-US	20	1 day	2012–2024	2025	64	32	25–30	∼80 k sequences
Minute-US	25	1 min	2019–2024	2025	120	40	8–10	∼2.5 M bars
Daily-Global	18	1 day	2012–2024	2025	64	32	20–25	∼75 k sequences

Notes: Daily-US clients correspond to S&P 500-style rolling constituents; Minute-US clients are formed from the top 200 U.S. equities after liquidity filtering; Daily-Global clients consist of large-cap US/EU/APAC universes, with disjoint ticker/time partitions across clients.

Table 2. Optimization, privacy, and communication settings. Temperatures anneal

τ

:

1.5 \to 0.2

. Architecture uploads occur every

H_{α}

rounds. Quantization uses 8-bit stochastic rounding; sparsification keeps top-

k = 20 %

entries post-clipping.

Table 2. Optimization, privacy, and communication settings. Temperatures anneal

τ

:

1.5 \to 0.2

. Architecture uploads occur every

H_{α}

rounds. Quantization uses 8-bit stochastic rounding; sparsification keeps top-

k = 20 %

entries post-clipping.

Dataset	K	E	$H_{α}$	q	$(C_{w}, C_{α})$	$(σ_{w}, σ_{α})$	Compression
Daily-US	300	5	5	0.3	(1.0, 0.2)	(0.75, 1.20)	8-bit + top-k
Minute-US	600	5	5	0.3	(1.0, 0.2)	(0.75, 1.20)	8-bit + top-k
Daily-Global	300	5	5	0.3	(1.0, 0.2)	(0.75, 1.20)	8-bit + top-k
Dataset	Regimes R	Resource weights $(λ_{comm}, λ_{lat})$	$ε_{w}$	$ε_{α}$
All	3	(1.0, 0.5)	≈2.0	≈1.0

Table 3. Aggregate forecasting and trading metrics on test sets. RMSE (lower is better), DA in %, SR annualized. CIs are client-bootstrapped

95 %

intervals.

Table 3. Aggregate forecasting and trading metrics on test sets. RMSE (lower is better), DA in %, SR annualized. CIs are client-bootstrapped

95 %

intervals.

Method	Daily-US			Minute-US			Daily-Global
Method	RMSE	DA	SR	RMSE	DA	SR	RMSE	DA	SR
Local-GRU	0.0118	53.1	0.54	0.00128	52.7	0.96	0.0103	53.4	0.62
CI	[0.0116, 0.0120]	[52.3, 53.8]	[0.49, 0.59]	[0.00126, 0.00130]	[52.1, 53.2]	[0.90, 1.01]	[0.0101, 0.0105]	[52.7, 54.1]	[0.58, 0.66]
FedAvg-LSTM	0.0112	54.7	0.68	0.00124	53.6	1.10	0.0099	54.2	0.74
CI	[0.0110, 0.0114]	[54.1, 55.3]	[0.63, 0.73]	[0.00122, 0.00126]	[53.0, 54.1]	[1.05, 1.15]	[0.0097, 0.0101]	[53.6, 54.8]	[0.70, 0.78]
FedProx-Trans	0.0107	56.8	0.78	0.00116	55.3	1.28	0.0095	56.8	0.85
CI	[0.0105, 0.0109]	[56.2, 57.4]	[0.73, 0.83]	[0.00114, 0.00118]	[54.7, 55.9]	[1.23, 1.32]	[0.0093, 0.0097]	[56.2, 57.3]	[0.81, 0.89]
Centralized-DARTS	0.0106	57.1	0.81	0.00114	55.6	1.31	0.0094	57.2	0.88
CI	[0.0104, 0.0108]	[56.5, 57.7]	[0.76, 0.85]	[0.00112, 0.00116]	[55.0, 56.1]	[1.26, 1.36]	[0.0092, 0.0096]	[56.6, 57.8]	[0.83, 0.92]
FedRegNAS (ours)	0.0102	59.0	0.87	0.00110	56.9	1.41	0.0092	59.1	0.95
CI	[0.0101, 0.0104]	[58.5, 59.5]	[0.83, 0.91]	[0.00108, 0.00112]	[56.4, 57.4]	[1.37, 1.46]	[0.0091, 0.0093]	[58.6, 59.6]	[0.91, 0.99]

Table 4. Improvements of FedRegNAS over the strongest federated baseline (FedProx-Trans). Deltas are absolute differences; the rightmost columns report the fraction of clients with

p < 0.05

under paired tests (RMSE: Diebold–Mariano on squared error; DA/SR: stratified permutation).

Table 4. Improvements of FedRegNAS over the strongest federated baseline (FedProx-Trans). Deltas are absolute differences; the rightmost columns report the fraction of clients with

p < 0.05

under paired tests (RMSE: Diebold–Mariano on squared error; DA/SR: stratified permutation).

	$Δ$ RMSE	$Δ$ DA (pp)	$Δ$ SR	$p < 0.05$ RMSE	$p < 0.05$ DA	$p < 0.05$ SR
Daily-US	$- 0.0005$	$+ 2.2$	$+ 0.09$	0.72	0.68	0.63
Minute-US	$- 0.00006$	$+ 1.6$	$+ 0.13$	0.78	0.74	0.70
Daily-Global	$- 0.0003$	$+ 2.3$	$+ 0.10$	0.69	0.66	0.61

Table 5. Forecasting accuracy (RMSE) and relative improvement. Lower RMSE indicates higher accuracy. Improvements are relative to the strongest federated baseline (FedProx-Trans).

Method	RMSE			Improvement vs. FedProx-Trans (%)
Method	Daily-US	Minute-US	Daily-Global	Daily-US	Minute-US	Daily-Global
Local-GRU	0.0118	0.00128	0.0103	−10.3	−10.3	−8.4
FedAvg-LSTM	0.0112	0.00124	0.0099	−4.7	−6.9	−4.2
FedProx-Trans	0.0107	0.00116	0.0095	0.0	0.0	0.0
Centralized-DARTS	0.0106	0.00114	0.0094	+0.9	+1.7	+1.1
FedRegNAS (ours)	0.0102	0.00110	0.0092	+4.7	+5.2	+3.2

Table 6. Effect of regime cardinality R on forecasting performance for FedRegNAS on Daily-US and Minute-US. Results are averaged over three runs; best values per column are in bold.

R \in {4, 6}

provides a robust trade-off across datasets.

Table 6. Effect of regime cardinality R on forecasting performance for FedRegNAS on Daily-US and Minute-US. Results are averaged over three runs; best values per column are in bold.

R \in {4, 6}

provides a robust trade-off across datasets.

R	Daily-US			Minute-US
R	RMSE	DA (%)	Sharpe	RMSE	DA (%)	Sharpe
2	0.134	56.1	0.62	0.149	54.3	0.55
3	0.132	56.9	0.67	0.146	55.2	0.59
4	0.129	57.8	0.74	0.142	56.0	0.66
6	0.128	58.0	0.76	0.141	56.3	0.68
8	0.131	57.2	0.71	0.144	55.8	0.64
10	0.133	56.8	0.69	0.146	55.5	0.61

Table 7. Comparison of KL-barycentric aggregation and simple averaging of architecture logits under non-IID and severely non-IID client partitions on three datasets. KL-based aggregation improves accuracy and reduces the dispersion of client architectures, consistent with more stable behavior under heterogeneity.

Method	Daily-US		Minute-US		Daily-Global		Avg. JS-Div.
Method	RMSE	DA (%)	RMSE	DA (%)	RMSE	DA (%)	(All Datasets)
Simple avg. (non-IID)	0.132	56.3	0.145	55.1	0.138	55.8	0.21
KL-barycentric (FedRegNAS)	0.129	57.8	0.141	56.3	0.135	57.0	0.13
Simple avg. (severe non-IID)	0.134	55.7	0.147	54.5	0.140	55.1	0.24
KL-barycentric (severe non-IID)	0.130	57.1	0.142	55.8	0.136	56.2	0.15

Table 8. Calibration of differentiable latency and communication proxies against real measurements under different device and network profiles. High correlation and low MAPE support their use as faithful surrogates in the objective.

Proxy/Profile	Pearson Correlation	MAPE (%)
Latency proxy vs. measured (GPU + LAN)	0.95	6.1
Latency proxy vs. measured (CPU + LAN)	0.93	7.8
Latency proxy vs. measured (CPU + 4G)	0.91	9.3
Comm. proxy vs. bytes (GPU + LAN)	0.97	4.8
Comm. proxy vs. bytes (CPU + LAN)	0.96	5.4
Comm. proxy vs. bytes (CPU + 4G)	0.94	7.1

Table 9. Volatility-stratified trading performance on Daily-US. FedRegNAS yields higher directional accuracy and Sharpe ratio in both high- and low-volatility regimes, with especially strong improvements during turbulent periods relative to Local-GRU and FedAvg-LSTM.

Method	High Volatility		Low Volatility
Method	DA (%)	Sharpe	DA (%)	Sharpe
Local-GRU	53.4	0.52	55.9	0.80
FedAvg-LSTM	54.2	0.61	56.7	0.88
FedRegNAS	56.9	0.87	58.3	1.04

Table 10. Ablation study on Daily-US. Each variant removes or modifies one key mechanism while preserving all other hyperparameters. RMSE (lower better), DA (%), Sharpe ratio (SR), mean client upload per round (KB), and variance of validation loss.

Variant	RMSE	DA (%)	SR	Comm/Round (KB)	Var (loss) $\times 10^{4}$
FedRegNAS (full)	0.0102	59.0	0.87	1605	1.4
w/o regime gating	0.0108	56.7	0.79	1605	2.8
mean aggregation (no KL)	0.0106	57.4	0.82	1605	2.3
coupled-DP (single noise budget)	0.0105	57.6	0.81	1605	2.5
no trust weighting	0.0107	57.0	0.80	1605	2.9
Variant	$Δ$ RMSE vs. full	$Δ$ DA (pp)	$Δ$ SR	Trust-weight entropy
w/o regime gating	+0.0006	−2.3	−0.08	0.91
mean aggregation	+0.0004	−1.6	−0.05	0.72
coupled-DP	+0.0003	−1.4	−0.06	0.80
no trust weighting	+0.0005	−2.0	−0.07	1.00

Table 11. Communication and latency across datasets. Comm/round is mean client uplink after clipping, 8-bit stochastic quantization, and top-k sparsification (

k = 20 %

). Latency is median on-device forward pass on a mobile-class CPU.

Table 11. Communication and latency across datasets. Comm/round is mean client uplink after clipping, 8-bit stochastic quantization, and top-k sparsification (

k = 20 %

). Latency is median on-device forward pass on a mobile-class CPU.

Method	Daily-US		Minute-US		Daily-Global
Method	Comm/round (KB)	Latency (ms)	Comm/round (KB)	Latency (ms)	Comm/round (KB)	Latency (ms)
Local-GRU	0	3.9	0	5.8	0	3.8
FedAvg-LSTM	2820	4.1	3160	6.2	2750	4.0
FedProx-Trans	5410	6.8	6220	8.1	5090	6.5
Centralized-DARTS	0	5.7	0	7.1	0	5.6
FedRegNAS	1605	3.6	1810	5.2	1530	3.5
Method	Daily-US			Minute-US			Daily-Global
Method	Uplink stdev	Downlink (KB)	CPU util. (%)	Uplink stdev	Downlink (KB)	CPU util. (%)	Uplink stdev	Downlink (KB)	CPU util. (%)
FedAvg-LSTM	210	1180	64	240	1260	72	205	1150	63
FedProx-Trans	380	2040	78	410	2190	83	360	1980	77
FedRegNAS	145	910	57	160	980	66	140	885	56

Table 12. Uplink payload decomposition (Daily-US). Mean KB per round by component for FedRegNAS. Architectural messages are sent every

H_{α} = 5

rounds; weight messages are sent every round. Percentages are relative to total mean uplink.

Table 12. Uplink payload decomposition (Daily-US). Mean KB per round by component for FedRegNAS. Architectural messages are sent every

H_{α} = 5

rounds; weight messages are sent every round. Percentages are relative to total mean uplink.

Component	$\hat{Δ θ}$ (post-clip, post-noise)	$\hat{Δ α}$ (active rounds only)	Metadata (IDs, seeds)	Total
Size (KB)	1220	310	75	1605
Share (%)	76.0	19.3	4.7	100
Setting	$H_{α} = 2$	$H_{α} = 5$	$H_{α} = 8$	$H_{α} = 16$
Mean uplink (KB)	1880	1605	1480	1430
$ε_{α}$ (at $δ = 10^{- 5}$ )	1.48	1.00	0.78	0.60
RMSE (test)	0.0102	0.0102	0.0103	0.0104

Table 13. Centralized sequence-model baselines versus FedRegNAS. Centralized models pool all data and do not respect FL/DP constraints, serving as upper bounds or side benchmarks. FedRegNAS-C denotes the centralized version of the architecture discovered by FedRegNAS, while “FedRegNAS (federated, DP)” is the full privacy-preserving federated method.

Model	Daily-US			Minute-US
Model	RMSE	DA (%)	Sharpe	RMSE	DA (%)	Sharpe
DeepAR (centralized)	0.128	57.1	0.78	0.140	55.7	0.64
N-BEATS (centralized)	0.127	57.6	0.80	0.139	56.0	0.67
TFT (centralized)	0.126	58.2	0.83	0.138	56.5	0.70
FedRegNAS-C (centralized)	0.127	58.0	0.82	0.139	56.3	0.69
FedRegNAS (federated, DP)	0.129	57.8	0.76	0.141	56.3	0.68

Table 14. Ablation of regime gating and KL-barycentric aggregation on the Minute-US dataset. Both components contribute to improved performance, and their combination yields the best results, demonstrating stability of the design in a high-frequency, nonstationary scenario.

Variant (Minute-US)	RMSE	DA (%)	Sharpe
No regime gating, no KL (FedAvg logits)	0.146	55.0	0.60
Regime gating only	0.143	55.7	0.64
KL aggregation only	0.143	55.9	0.65
Full FedRegNAS (gating + KL)	0.141	56.3	0.68

Table 15. Training-time comparison and amortization for FedRegNAS and fixed-architecture FL baselines on Daily-US. The one-time NAS search is more expensive, but when amortized across multiple markets or retraining periods, the effective per-deployment cost approaches that of FedProx-Trans.

Method	Time (h)	GPU-h	Amortized (3 Dep., GPU-h)
FedAvg-LSTM	8.1	32	32
FedProx-Trans	10.4	41	41
FedRegNAS (search + train)	24.6	98	≈33
FedRegNAS (fine-tune only)	9.0	36	≈33

Table 16. Performance of different methods in the simulated institutional scenario on Minute-US. All metrics are computed after enforcing regulatory windows and trading constraints. FedRegNAS attains the highest directional accuracy and Sharpe ratio while keeping turnover and pre-clipping constraint violations at or below the levels of FL baselines.

Method	DA (%)	Sharpe	Avg. Turnover (%/day)	Constraint Violations (%)
Local-GRU	54.1	0.58	12.4	3.9
FedAvg-LSTM	55.0	0.62	11.8	3.2
FedProx-Trans	55.6	0.66	11.5	2.8
FedRegNAS	56.4	0.73	11.2	2.1

Table 17. End-to-end inference latency per prediction under the simulated institutional setup, including communication and decoding, measured across clients on a heterogeneous GPU/CPU cluster. All methods satisfy the 10 ms mean-latency budget; FedRegNAS remains competitive with FedAvg-LSTM and improves on FedProx-Trans despite performing federated NAS.

Method	Mean Latency (ms)	95th Percentile Latency (ms)	Budget Satisfied (≤10 ms Mean)
Local-GRU	7.6	9.8	Yes
FedAvg-LSTM	8.3	10.5	Yes
FedProx-Trans	9.4	11.9	Yes
FedRegNAS	8.7	10.8	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Z.; Zhang, H.; Wang, S.; Chen, J. FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting. Electronics 2025, 14, 4902. https://doi.org/10.3390/electronics14244902

AMA Style

Chen Z, Zhang H, Wang S, Chen J. FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting. Electronics. 2025; 14(24):4902. https://doi.org/10.3390/electronics14244902

Chicago/Turabian Style

Chen, Zizhen, Haobo Zhang, Shiwen Wang, and Junming Chen. 2025. "FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting" Electronics 14, no. 24: 4902. https://doi.org/10.3390/electronics14244902

APA Style

Chen, Z., Zhang, H., Wang, S., & Chen, J. (2025). FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting. Electronics, 14(24), 4902. https://doi.org/10.3390/electronics14244902

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Federated Learning Under Heterogeneity, Staleness, and Robustness

2.2. Differential Privacy in Deep and Federated Learning

2.3. Neural Architecture Search and Differentiable Relaxations

2.4. Time-Series Forecasting and Financial Prediction

2.5. Federated NAS and DP-Aware NAS

3. Preliminaries

3.1. Data, Predictive Task, and Regime Representation

3.2. Supernet Parameterization and Architecture Relaxation

3.3. Federated Optimization and Architecture Fusion

3.4. Privacy, Resource Proxies, and Evaluation Criteria

4. Methodology

4.1. Problem Formulation and Bilevel NAS Reduction

4.2. Regime-Gated Temporal Supernet

4.3. Client-Side Optimization with Decoupled Privacy

4.4. Server Aggregation and Search-to-Train Curriculum

4.4.1. Trust/Staleness Weighting and Weight Aggregation

4.4.2. Three-Phase Search-to-Train Curriculum

5. Theoretical Analysis

5.1. Assumptions and Preliminaries

5.2. Main Convergence Result

Proof Sketch

5.3. Impact of Clipping and Trust Weighting

5.3.1. Clipping Bias

5.3.2. Trust Weighting Effects

5.4. Differential-Privacy Accounting

5.4.1. Per-Round Mechanism

5.4.2. Composition and Conversion

5.4.3. Privacy–Utility Allocation

5.4.4. Amplification by Subsampling and Curriculum

5.5. Discussion and Implications

Summary

6. Experiments

6.1. Setup

6.1.1. Problem Setting

6.1.2. Data Construction and Preprocessing

6.1.3. Evaluation Protocol

6.1.4. Federated Learning Setting

6.1.5. Regime Modeling and Search Space

6.2. Main Results

6.2.1. Overview

6.2.2. Accuracy and Trading Quality

6.2.3. Statistical Significance

6.2.4. Robustness Across Universes

6.3. Ablation and Component Analysis

6.3.1. Sensitivity to Decoupled DP Noise Scales

6.3.2. Effect of Regime Cardinality R

6.3.3. KL-Barycentric Aggregation vs. Simple Averaging

6.3.4. Calibration of Latency and Communication Proxies

6.3.5. Volatility-Stratified Trading Performance

6.3.6. Component-Level Ablation on Daily-US

6.4. Efficiency and Privacy

6.4.1. Objectives and Methodology

6.4.2. Key Observations

6.5. Practical Evaluation and Benchmarks

6.5.1. Centralized Sequence-Model Baselines

6.5.2. Regime Gating and KL Barycenter on High-Frequency Data

6.5.3. Search-Phase Overhead and Amortization

6.5.4. Simulated Institutional Case Study

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Data Availability

Appendix B. Sensitivity to DP Noise Scales

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI