FedQuAD: Fast-Converging Curvature-Aware Federated Learning for Credit Default Prediction from Private Accounting Data

Bai, Dingwen; WaEr, MuGa; Wu, Qichun

doi:10.3390/math14061012

Open AccessArticle

FedQuAD: Fast-Converging Curvature-Aware Federated Learning for Credit Default Prediction from Private Accounting Data

by

Dingwen Bai

¹,

MuGa WaEr

^1,* and

Qichun Wu

^2,*

¹

Business School, Chengdu University of Technology, Chengdu 610059, China

²

School of Economics, Beijing Institute of Technology, Beijing 100081, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2026, 14(6), 1012; https://doi.org/10.3390/math14061012

Submission received: 30 January 2026 / Revised: 6 March 2026 / Accepted: 13 March 2026 / Published: 17 March 2026

(This article belongs to the Special Issue Applied Mathematics, Computing, and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Credit default prediction from firm-level accounting statements is central to risk management, yet the underlying financial data are highly sensitive and often siloed across banks, auditors, and platforms. Federated learning (FL) offers a practical route to collaborative modeling without centralizing raw records, but standard FL optimization can converge slowly under severe client heterogeneity, heavy-tailed accounting features, and label imbalance typical of default events. This paper proposes FedQuAD, a novel fast-converging FL algorithm that couples (i) quasi-Newton curvature aggregation on the server with a lightweight limited-memory update to accelerate global progress, (ii) a proximal variance-reduced local solver that stabilizes client drift under non-IID accounting distributions, and (iii) federated robust standardization of tabular financial ratios via secure aggregated quantile statistics to mitigate scale instability and outliers. FedQuAD is communication-efficient by design: It transmits compact gradient and curvature sketches and adapts local computation to each client’s stochasticity and drift. We provide convergence guarantees for strongly convex default-risk objectives (logistic and calibrated GLM losses) under bounded heterogeneity, and extend the analysis to nonconvex deep tabular models via expected stationarity bounds. Experiments on public credit-risk benchmarks with simulated cross-silo (institutional) partitions demonstrate that FedQuAD reaches target AUC and calibration error with substantially fewer communication rounds than representative baselines while maintaining privacy constraints compatible with secure aggregation and optional client-level differential privacy accounting.

Keywords:

federated learning; privacy-preserving machine learning; secure aggregation; tabular deep learning

MSC:

62R07

1. Introduction

Credit default prediction is a core capability for lenders, investors, and regulators. In practice, a large fraction of predictive signal comes from firm-level accounting statements and derived ratios, such as leverage, liquidity, profitability, and cash-flow measures. These variables are sensitive and often protected by contractual, legal, or competitive constraints, which makes cross-institution data pooling difficult. Federated learning (FL) enables collaborative model training without moving raw accounting records off-premise by keeping data local and exchanging only model updates [1,2]. Despite this promise, applying FL to default prediction raises three practical challenges.

First, the data distribution is inherently heterogeneous across institutions. Portfolio composition varies by industry, geography, and underwriting policy, and accounting features are typically heavy tailed with outliers. Such non-IID effects can cause client drift and slow global convergence, particularly when each institution performs multiple local steps per round [3,4,5]. Second, defaults are rare, leading to label imbalance and noisy gradients on many clients; this worsens optimization stability and harms probability calibration, which is critical for downstream risk decisions. Third, communication rounds are expensive in cross-silo deployments: Institutions have limited windows for secure computation and auditing, and latency dominates wall-clock time. Thus, for federated default prediction, fast convergence in terms of communication rounds is often more important than minimizing local computation.

Cross-silo financial learning faces a structural tension: Curvature-aware methods promise faster convergence and better calibration, yet second-order information is communication heavy and potentially privacy sensitive. At the same time, credit risk data are heterogeneous, heavy tailed, and severely imbalanced, making purely first-order federated optimization unstable. The central question we address is therefore how can one incorporate curvature information in a privacy-compatible, communication-efficient manner that remains robust under rare-event financial heterogeneity?

This paper proposes FedQuAD, a fast-converging FL algorithm tailored to private accounting data. FedQuAD couples two ideas. On the client side, we solve a proximal local objective anchored at the server model and use a variance-reduced gradient estimator to reduce the instability caused by heterogeneous portfolios and imbalance. On the server side, we introduce a curvature-aware correction computed from low-dimensional Hessian sketches aggregated across clients. The correction acts as a quasi-Newton step in a random subspace and is combined with the standard aggregated model delta. Unlike full second-order FL, our design transmits only a compact curvature sketch and a projected gradient vector, making the added communication small while delivering substantial reductions in rounds-to-accuracy.

Beyond optimization, we incorporate a robust preprocessing component designed for accounting data. Specifically, FedQuAD supports federated robust standardization based on global medians and interquartile ranges computed from aggregated quantile summaries, improving conditioning under heavy-tailed ratios and reducing the impact of extreme values. Our training loop is compatible with secure aggregation and can optionally support client-level differential privacy via clipping and noise on aggregated messages.

FedQuAD builds on standard primitives (proximal stabilization and variance reduction), but its novelty is in coupling them with a per-round, shared sketch-space quasi-Newton correction computed from aggregated curvature sketches anchored at the same iteration. Unlike approaches that transmit or maintain high-dimensional second-order state data, FedQuAD aggregates only an m-vector projected gradient and an

m \times m

curvature sketch, lifts the resulting preconditioned direction back to parameter space, and combines it with the usual aggregated model delta in a single server update. This design keeps the added communication

O (m^{2})

, remains compatible with secure aggregation, and is tailored to heavy-tailed, non-IID accounting distributions.

Our contributions are as follows:

A novel curvature-aware federated algorithm. We introduce FedQuAD, which augments standard FL with a sketch-space quasi-Newton correction built from aggregated curvature sketches, accelerating convergence under non-IID accounting distributions.
A stabilized local solver for default prediction. FedQuAD combines a proximal objective with variance reduction to mitigate client drift and label-imbalance-induced gradient noise, improving both discrimination and calibration.
Robust federated normalization for heavy-tailed accounting features. We propose a quantile-based federated standardization mechanism that uses aggregated summaries to compute global medians and interquartile ranges without sharing raw data.
Empirical validation on default-prediction benchmarks. Experiments demonstrate that FedQuAD reduces the communication rounds needed to reach target AUC and improves calibration relative to representative baselines.

The remainder of the paper is organized as follows: Section 3 defines the federated default prediction problem and key assumptions. Section 4 presents FedQuAD in detail. Section 5 reports experimental results. Section 2 discusses related work. Section 6 concludes.

2. Related Work

Our work sits at the intersection of (i) federated optimization under statistical heterogeneity, (ii) curvature/second-order and variance-reduced methods for faster federated convergence, and (iii) credit-risk modeling and privacy-preserving collaboration across financial institutions.

2.1. Federated Optimization Under Heterogeneity

Federated learning (FL) enables collaborative training without centralizing raw data, typically via periodic client updates aggregated by a server [1,2,6]. The dominant baseline, FedAvg, is simple and effective but can degrade under non-IID data and partial participation [1]. A substantial line of work improves robustness to heterogeneity and client drift through regularization, control variates, and alternative aggregation or local update schemes. Representative examples include FedProx for proximal stabilization [3], SCAFFOLD for variance reduction via control variates [4], and FedNova for normalized aggregation under variable local computation [5]. Server- and client-side adaptive federated optimization has also been explored (e.g., FedOpt/FedAdam-style server optimizers) [7], alongside dynamic regularization approaches such as FedDyn [8]. More recent work continues to target practical cross-device constraints (communication and memory) while retaining adaptivity, e.g., FedAda² [9], and asynchronous adaptive variants such as FADAS [10]. These methods motivate our focus on stabilizing local steps and mitigating drift while improving convergence speed.

2.2. Curvature, Second-Order, and Variance Reduction in FL

While first-order methods dominate FL practice, slow convergence and sensitivity to heterogeneity have motivated curvature-aware and variance-reduced alternatives. Classical variance-reduction ideas have been adapted to the federated setting (e.g., control variates in SCAFFOLD [4]), and more specialized constructions study single-loop or client-variance-reduced mechanisms to accelerate convergence and improve communication efficiency [11,12,13]. Parallel to this, second-order and quasi-Newton federated methods aim to leverage curvature information without prohibitive communication costs. Early communication-efficient second-order directions include basis-sketching approaches [14], while newer scalable designs target large models and realistic system constraints, such as Fed-Sophia [15]. Very recent work develops quasi-Newton FL frameworks explicitly (e.g., distributed quasi-Newton FL with fairness considerations [16]) and integrates quasi-Newton updates with error-feedback mechanisms to cope with compression and noisy updates [17]. Privacy constraints can further interact with curvature and communication; for example, recent differentially private FL work studies second-order information under communication budgets [18]. Our method is aligned with this trajectory: we aim to use lightweight curvature information and stabilization to reduce the number of rounds needed in heterogeneous settings.

Our proposed FedQuAD differs from prior curvature-aware FL in that it (i) refreshes a shared random sketch subspace each round via a public seed, (ii) computes the quasi-Newton correction solely from aggregated sketch-space quantities anchored at the broadcast iterate, and (iii) explicitly combines this lifted correction with a stabilized proximal variance-reduced local delta. This isolates what is new in FedQuAD: the single-round coupling of stabilized local training with an explicit sketch-space quasi-Newton preconditioner under cross-silo communication and privacy constraints.

2.3. Privacy, Security, and Trust Mechanisms

In finance and credit, privacy is not only a performance concern but a deployment constraint. Secure aggregation protocols allow servers to learn only aggregated updates rather than individual client updates [19]. Differential privacy (DP) provides formal privacy guarantees and is often combined with FL to mitigate leakage from gradients/updates [20,21]. However, gradient/parameter updates can still leak sensitive information via reconstruction or inference attacks, motivating stronger defenses and careful system design [22,23,24]. Recent applied systems in credit scoring combine FL with auxiliary trust infrastructures (e.g., blockchain) and emphasize explainability and traceability [25,26]. These considerations shape our evaluation and design choices, since credit modeling typically requires both strong privacy posture and operational transparency.

2.4. Credit-Risk Modeling and Federated Credit Scoring

Credit default prediction and credit scoring have a long history in statistical modeling and machine learning, with persistent concerns about class imbalance, temporal drift, and dataset shift. Large-scale empirical comparisons and benchmarks have highlighted the competitiveness of tree ensembles and carefully tuned linear models in many credit settings [27,28,29]. Modern gradient-boosted decision trees such as XGBoost and LightGBM have become standard baselines for tabular credit data due to strong accuracy and practical training/inference characteristics [30,31]. Deep tabular models (e.g., attention-based architectures) offer another path when feature interactions are complex [32,33]. When data are siloed across institutions, FL provides a natural mechanism to improve generalization without data pooling [1,2]. Recent work specifically targets privacy-preserving multi-party credit scoring, including knowledge-distillation-based federated approaches [34], explainable FL credit scoring with blockchain support [25,26], and broader surveys discussing FL’s emerging role in large/foundation-model settings that may eventually impact financial modeling workflows [35,36,37,38].

We conclude that traditional credit-risk models, including logistic scorecards and structural risk frameworks, emphasize interpretability, calibration, and economic coherence. Our work does not alter the statistical foundation of default modeling; rather, it advances the optimization and privacy layer enabling such models to be trained collaboratively without sharing raw accounting data. This perspective aligns with classical discussions of credit risk modeling foundations. Our work contributes to this growing body by focusing on optimization and stability improvements tailored to the credit default prediction setting, where heterogeneity across institutions and strict privacy constraints are central.

3. Preliminaries

3.1. Problem Setting and Federated Objective

We consider a cross-silo federated learning (FL) system with K institutions (clients) indexed by

k \in {1, \dots, K}

. Client k holds a private dataset

D_{k} = {(x_{k, i}, y_{k, i})}_{i = 1}^{n_{k}},

where

x_{k, i} \in R^{d}

denotes a d-dimensional accounting feature vector and

y_{k, i} \in {0, 1}

indicates whether a default event occurs within a fixed horizon. Let

N = \sum_{k = 1}^{K} n_{k}

and

p_{k} = n_{k} / N

. A model with parameters

w \in R^{d}

(or

w \in R^{P}

for deep tabular models) outputs a default probability

{\hat{p}}_{w} (x) \in (0, 1)

. We use the logistic link as a canonical choice,

{\hat{p}}_{w} (x) = σ (f_{w} (x)), σ (z) = \frac{1}{1 + e^{- z}},

where

f_{w} (x)

is either linear (

f_{w} (x) = w^{⊤} x

) or a nonlinear network for tabular data. The (optionally class-weighted) logistic loss is

ℓ (w; x, y) = - α y log {\hat{p}}_{w} (x) - (1 - α) (1 - y) log (1 - {\hat{p}}_{w} (x)),

(1)

where

α \in (0, 1)

controls imbalance. Equation (1) uses standard class weighting (cost-sensitive logistic loss): Positives are weighted by

α

and negatives by

(1 - α)

. In our experiments,

α

is used to address class imbalance (e.g., a balanced choice is

α = 1 - π

where

π

is the positive-class prevalence on a client or on the union of participating clients) and is tuned on a federated validation split for calibration. We do not claim this is an explicit prior-correction unless

α

is deliberately chosen to reweight a source prior toward a target prior. With

ℓ_{2}

regularization coefficient

λ \geq 0

, client k defines

F_{k} (w) = \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} ℓ (w; x_{k, i}, y_{k, i}) + \frac{λ}{2} {∥ w ∥}_{2}^{2},

(2)

and the global objective is

F (w) = \sum_{k = 1}^{K} p_{k} F_{k} (w) .

(3)

We denote

\nabla F_{k} (w)

and

\nabla^{2} F_{k} (w)

as the gradient and Hessian (when defined), and use

\nabla ℓ (w; ξ)

for a stochastic gradient on

ξ = (x, y)

.

3.2. Federated Protocol, Regularity, and Heterogeneity

FL proceeds in synchronous communication rounds

t = 0, 1, \dots, T - 1

. The server maintains the global model

w^{t}

. In round t, a subset of clients

S_{t} \subseteq {1, \dots, K}

participates and returns update messages

u_{k}^{t}

, which the server aggregates:

w^{t + 1} = w^{t} + A ({u_{k}^{t}}_{k \in S_{t}}),

(4)

where

A (\cdot)

is an aggregation operator. We use

{∥ \cdot ∥}_{2}

for the Euclidean norm and

〈 \cdot, \cdot 〉

for the inner product. For convex default-risk models (e.g., logistic regression with

λ > 0

), we assume each

F_{k}

is L-smooth:

∥ \nabla F_{k} (u) - \nabla F_{k} {(v) ∥}_{2} \leq L {∥ u - v ∥}_{2}, \forall u, v,

(5)

and the global objective F is

μ

-strongly convex for some

μ > 0

:

F (v) \geq F (u) + 〈 \nabla F (u), v - u 〉 + \frac{μ}{2} {∥ v - u ∥}_{2}^{2} .

(6)

For deep tabular models, we assume F is L-smooth and target approximate stationarity via

E ∥ \nabla F (w^{t}) ∥_{2}^{2}

. Accounting data are typically non-IID across institutions. We quantify heterogeneity by bounded gradient dissimilarity: There exists

ζ \geq 0

such that

\sum_{k = 1}^{K} p_{k} {∥ \nabla F_{k} (w) - \nabla F (w) ∥}_{2}^{2} \leq ζ^{2}, \forall w .

(7)

For stochastic local solvers, we assume bounded gradient noise: for

ξ \sim D_{k}

,

E [∥ \nabla ℓ (w; ξ) - \nabla F_{k} {(w) ∥}_{2}^{2}] \leq σ_{k}^{2} .

(8)

3.3. Curvature Sketches, Robust Normalization, and Privacy Primitives

Curvature information and sketching. To accelerate convergence, we will use curvature-aware aggregation. Let

H^{t}

denote a positive definite preconditioner used by the server (interpretable as an approximation to

{(\nabla^{2} F (w^{t}))}^{- 1}

). Since transmitting full matrices is impractical, we work with low-dimensional sketches. Let

S^{t} \in R^{P \times m}

be a sketching matrix with

m ≪ P

. Client k forms a curvature sketch

C_{k}^{t} \approx {(S^{t})}^{⊤} \nabla^{2} F_{k} (w^{t}) S^{t} \in R^{m \times m},

(9)

which the server aggregates to obtain a compact curvature model used during aggregation (specified in Section 4).

Robust feature normalization via federated quantiles. Tabular accounting variables are heavy-tailed and may contain extreme outliers. For a univariate random variable X, define the p-quantile

Q_{X} (p) = inf {x \in R : P (X \leq x) \geq p}, p \in (0, 1) .

(10)

For each feature

j \in {1, \dots, d}

, we use the global median

m_{j} = Q_{X_{j}} (0.5)

and interquartile range

{IQR}_{j} = Q_{X_{j}} (0.75) - Q_{X_{j}} (0.25)

to robustly standardize

{\tilde{x}}_{j} = \frac{x_{j} - m_{j}}{{IQR}_{j} + τ},

(11)

where

τ > 0

ensures numerical stability. These quantiles are computed from privacy-preserving aggregated summaries (Section 4).

4. Methodology

This section presents FedQuAD (Federated Quasi-Newton with Adaptive Drift control), a fast-converging FL method for default prediction on private accounting data. FedQuAD augments a stabilized local solver with a server-side curvature-aware correction computed in a low-dimensional sketch space, adding only

O (m^{2})

extra uplink payload per round (beyond transmitting model deltas).

4.1. FedQuAD Protocol and Global Update Rule

FedQuAD targets fast communication-round convergence for minimizing the federated objective

F (w) = \sum_{k = 1}^{K} p_{k} F_{k} (w)

in (3). The key idea is to combine (i) the standard model-delta aggregation that captures broad progress from local training, with (ii) a curvature-aware correction computed in a low-dimensional sketch space that approximates a Newton-like step along informative directions. This yields a single-round update that remains compatible with secure aggregation and adds only

O (m^{2})

uplink payload per client when the sketch dimension

m ≪ P

.

Round structure. At communication round t, the server holds the global model

w^{t} \in R^{P}

and samples a participating client set

S_{t} \subseteq {1, \dots, K}

. The server also specifies a sketching matrix

S^{t} \in R^{P \times m}

with orthonormal columns, used to define a shared m-dimensional subspace. To reduce server bandwidth,

S^{t}

is generated from a public random seed

{seed}^{t}

broadcast to clients; each client reconstructs

S^{t}

locally (the concrete construction is given in Section 4.3).

Each participating client

k \in S_{t}

computes three quantities at (or anchored to) the broadcast iterate

w^{t}

:

1.: A stabilized local model delta $Δ_{k}^{t} \in R^{P}$ , obtained by approximately minimizing a proximal, variance-reduced subproblem (detailed in Section 4.2). Intuitively, $Δ_{k}^{t}$ captures useful descent directions under the client distribution while controlling drift.
2.: A projected gradient in sketch space, $g_{k, s}^{t} = {(S^{t})}^{⊤} \nabla F_{k} (w^{t}) \in R^{m},$ which summarizes how the local objective changes along the shared subspace directions.
3.: A curvature sketch in sketch space, $C_{k}^{t} \approx {(S^{t})}^{⊤} \nabla^{2} F_{k} (w^{t}) S^{t} \in R^{m \times m},$ where the approximation may use an exact Hessian for generalized linear models or a Gauss–Newton/empirical Fisher approximation for deep tabular models.

All three messages can be protected by secure aggregation (and optionally by client-level DP via clipping/noise; see Section 4.4).

Weighted aggregation. Let

{\tilde{p}}_{k}

denote normalized weights over participating clients,

{\tilde{p}}_{k} = \frac{p_{k}}{\sum_{j \in S_{t}} p_{j}} .

The server forms the aggregated delta, projected gradient, and curvature sketch:

Δ^{t} = \sum_{k \in S_{t}} {\tilde{p}}_{k} Δ_{k}^{t}, g_{s}^{t} = \sum_{k \in S_{t}} {\tilde{p}}_{k} g_{k, s}^{t}, C^{t} = \sum_{k \in S_{t}} {\tilde{p}}_{k} C_{k}^{t} .

(12)

Compared with FedAvg-style aggregation (which uses only

Δ^{t}

), FedQuAD additionally aggregates

(g_{s}^{t}, C^{t})

to compute a preconditioned correction in the shared subspace.

The server computes a damped inverse in sketch space,

M^{t} = {(C^{t} + ρ I_{m})}^{- 1},

where

ρ > 0

is a damping parameter ensuring numerical stability even when

C^{t}

is ill-conditioned or rank-deficient. The quasi-Newton correction direction in parameter space is then

d_{q}^{t} = - S^{t} M^{t} g_{s}^{t} .

Finally, FedQuAD updates the global model by combining the aggregated local delta

Δ^{t}

with the curvature-aware correction:

w^{t + 1} = w^{t} + Δ^{t} + η_{q} d_{q}^{t}

(13)

where

η_{q} > 0

controls the correction strength. When

η_{q} = 0

, (13) reduces to a stabilized first-order method driven purely by local deltas. When

η_{q} > 0

, the term

d_{q}^{t}

acts like a Newton step restricted to

span (S^{t})

, improving the effective conditioning along those directions and empirically reducing the rounds needed to reach target AUC and calibration.

Per participating client, FedQuAD transmits the usual model delta

Δ_{k}^{t}

(same order as FedAvg) plus a projected gradient vector of size m and a symmetric curvature sketch of size

m \times m

. Transmitting only the upper-triangular part yields

m (m + 1) / 2

scalars. Thus, the additional uplink beyond

Δ_{k}^{t}

scales as

O (m^{2})

, which is small for

m ≪ P

. On the server, the additional compute is dominated by inverting the

m \times m

matrix in (12), i.e.,

O (m^{3})

per round, which is negligible relative to client training for typical choices (e.g.,

m \in [32, 128]

).

4.2. Client-Side Solver with Adaptive Drift Control

FedQuAD employs a client-side solver designed for two dominant difficulties in federated default prediction: (i) client drift caused by heterogeneous portfolios and non-IID accounting distributions, quantified by (7) and (ii) high gradient noise induced by rare defaults and heavy-tailed features, reflected in (8). To address both, each client approximately minimizes a proximal objective anchored at the broadcast model and uses a variance-reduced estimator to stabilize local steps. This subroutine produces the model delta

Δ_{k}^{t}

used in the aggregated update (12). Given server model

w^{t}

, client k defines the round-t proximal objective

Φ_{k}^{t} (w) = F_{k} (w) + \frac{μ_{p}}{2} {∥ w - w^{t} ∥}_{2}^{2},

(14)

where

μ_{p} \geq 0

controls the strength of drift control. The proximal term penalizes deviations from

w^{t}

, reducing the tendency of local iterates to overfit to institution-specific patterns that may not generalize globally. This is particularly important when defaults are sparse on some clients: Without anchoring, local updates can be dominated by a small number of rare events and lead to unstable directions.

Snapshot gradient and control variate. At the beginning of the round, client k computes a snapshot gradient at the broadcast point,

g_{k}^{t} = \nabla F_{k} (w^{t}),

either exactly (for moderate

n_{k}

) or using a sufficiently large minibatch to reduce noise. This snapshot gradient serves two roles: (i) It is used directly in the projected gradient message

g_{k, s}^{t}

in (2), and (ii) it forms a control variate for variance reduction during local optimization.

Prox-SVRG local steps. Initialize

w_{k}^{t, 0} = w^{t}

. For local step

e = 0, 1, \dots, E - 1

, sample a minibatch

B_{k}^{t, e}

and construct the variance-reduced (and proximal) gradient estimator

v_{k}^{t, e} = \nabla ℓ (w_{k}^{t, e}; B_{k}^{t, e}) - \nabla ℓ (w^{t}; B_{k}^{t, e}) + g_{k}^{t} + μ_{p} (w_{k}^{t, e} - w^{t}),

(15)

then take the update

w_{k}^{t, e + 1} = w_{k}^{t, e} - η_{ℓ} v_{k}^{t, e},

where

η_{ℓ} > 0

is the local step size. The estimator in (15) has the standard SVRG structure: It subtracts the minibatch gradient at the snapshot point

w^{t}

and adds back the (near-)full gradient

g_{k}^{t}

, thereby reducing the variance of minibatch gradients when

w_{k}^{t, e}

stays near

w^{t}

(which is encouraged by the proximal term). The additional proximal gradient

μ_{p} (w_{k}^{t, e} - w^{t})

ensures that the local direction is consistent with descent on

Φ_{k}^{t}

and provides explicit drift control. After E steps, client k returns the model delta

Δ_{k}^{t} = w_{k}^{t, E} - w^{t} .

Adaptive drift control in practice. While

μ_{p}

can be fixed, FedQuAD supports a simple adaptive rule that increases anchoring when local updates appear unstable. Let

Δ_{k}^{t}

denote the candidate delta produced by a trial run with current

μ_{p}

. Client k computes the normalized step magnitude

r_{k}^{t} = \frac{∥ Δ_{k}^{t} ∥_{2}}{∥ w^{t} ∥_{2} + ϵ},

(16)

with small

ϵ > 0

. If

r_{k}^{t}

exceeds a threshold

r_{max}

(indicating aggressive drift) or if the local objective

Φ_{k}^{t}

fails to decrease over the last few steps, the client increases anchoring (e.g.,

μ_{p} \leftarrow γ μ_{p}

with

γ > 1

) and reruns the remaining local steps. This adaptive mechanism is lightweight, requires no extra communication, and empirically reduces catastrophic local jumps on clients with extreme imbalance or outlier-heavy ratios. textcolorblueThe indicator

r_{k}^{t}

in (16) measures the relative parameter change in a round and is intended as a simple, model-scale-invariant safety valve. In cross-silo credit settings where some institutions may have very few default events, we recommend conservative caps (e.g.,

r_{max} \in [0.01, 0.1]

) and a small multiplicative increase factor (e.g.,

γ \approx 2

). Unless otherwise noted, we use

r_{max} = 0.05

and

γ = 2

(with at most a few retries per round) to stabilize rare-event clients without adding communication or materially affecting typical clients.

In addition to

Δ_{k}^{t}

, client k already has

g_{k}^{t}

from (15), which is used to compute the projected gradient message

g_{k, s}^{t} = {(S^{t})}^{⊤} g_{k}^{t}

. Curvature sketch computation uses the same broadcast point

w^{t}

(Section 4.3), ensuring all server-side quantities

(Δ^{t}, g_{s}^{t}, C^{t})

are aligned to the same iteration and thus can be coherently combined in (12).

In general, given that credit default prediction exhibits three structural properties: (i) rare positive events leading to gradient sparsity and instability, (ii) cross-institution feature-scale and distributional mismatch, and (iii) heavy-tailed accounting ratios. The proximal term mitigates client drift caused by heterogeneous default frequencies; variance reduction stabilizes local stochastic updates when positives are scarce; and the sketch-based quasi-Newton correction compensates for curvature mismatch across institutions without transmitting full Hessians. In combination, these mechanisms address financial heterogeneity at three levels: parameter stability (proximal), stochastic noise (variance reduction), and geometry mismatch (curvature correction).

4.3. Curvature Sketching and Quasi-Newton Correction

This subsection describes how FedQuAD constructs the shared sketch space, how clients compute curvature sketches with low overhead, and how the server uses the aggregated sketch to form a stable quasi-Newton correction

d_{q}^{t}

in (13). The central principle is to approximate Newton preconditioning only in a low-dimensional subspace where curvature information can be communicated and inverted efficiently.

At round t, the server samples a public random seed

{seed}^{t}

and defines

S^{t} = qr (G^{t} ({seed}^{t})) \in R^{P \times m},

(17)

where

G^{t} ({seed}^{t})

is a pseudo-random matrix with i.i.d.

N (0, 1)

entries and

qr (\cdot)

denotes a thin QR factorization producing orthonormal columns. The server broadcasts

{seed}^{t}

(and m), and each client reconstructs the same

S^{t}

locally.

Curvature model to be sketched. Client k forms a symmetric positive semidefinite curvature approximation

{\hat{H}}_{k}^{t}

at the broadcast model

w^{t}

:

{\hat{H}}_{k}^{t} \approx \nabla^{2} F_{k} (w^{t}) .

For generalized linear models (e.g., logistic regression with

λ > 0

),

{\hat{H}}_{k}^{t}

can be the exact empirical Hessian including the

λ I

term. For deep tabular models, we use a Gauss–Newton or empirical Fisher approximation, which is PSD and can be evaluated via Hessian-vector products (HVPs) without materializing a full

P \times P

matrix. Given

S^{t} = [s_{1}^{t}, \dots, s_{m}^{t}]

, the curvature sketch is

C_{k}^{t} = {(S^{t})}^{⊤} {\hat{H}}_{k}^{t} S^{t} + ρ_{0} I_{m} \in R^{m \times m},

(18)

where

ρ_{0} \geq 0

is a client-side ridge for robustness under noisy curvature. The client computes

C_{k}^{t}

using HVPs:

v_{j}^{t} = {\hat{H}}_{k}^{t} s_{j}^{t} for j = 1, \dots, m,

and fills the entries by inner products

{[C_{k}^{t}]}_{a b} = 〈 s_{a}^{t}, v_{b}^{t} 〉 + ρ_{0} 1 {a = b} .

Projected gradient alignment. Curvature must be paired with a gradient in the same sketch space. Client k computes the projected gradient at

w^{t}

:

g_{k, s}^{t} = {(S^{t})}^{⊤} \nabla F_{k} (w^{t}) \in R^{m},

(19)

using the snapshot gradient from Section 4.2. Both

C_{k}^{t}

and

g_{k, s}^{t}

are anchored at

w^{t}

, ensuring coherent aggregation across clients. The server aggregates sketches and projected gradients using normalized weights

{\tilde{p}}_{k}

:

C^{t} = \sum_{k \in S_{t}} {\tilde{p}}_{k} C_{k}^{t}, g_{s}^{t} = \sum_{k \in S_{t}} {\tilde{p}}_{k} g_{k, s}^{t} .

It then computes a stabilized inverse:

M^{t} = {(C^{t} + ρ I_{m})}^{- 1}

(20)

with server-side damping

ρ > 0

, and forms the lifted quasi-Newton correction

d_{q}^{t} = - S^{t} M^{t} g_{s}^{t} .

This correction is combined with the aggregated local delta

Δ^{t}

through (12).

Cost and practical choices. Per client, computing (20) requires m HVPs; for logistic regression it can be implemented via matrix-vector products with local data, and for deep models HVPs are supported by autodiff at a cost comparable to backpropagation. Communication per client includes one m-vector and one symmetric

m \times m

matrix (upper triangle), i.e.,

m (m + 1) / 2

scalars. In practice, moderate m (e.g., 32–128) yields most of the acceleration benefits while keeping overhead small. As the curvature payload scales as

m (m + 1) / 2

, a practical rule is to pick m so that this term is a small fraction of the model-delta payload for the chosen model size P. For tabular models in our study,

m = 64

is a robust default. we recommend starting from

m \in {32, 64}

and increasing only if rounds-to-target improvements justify the quadratic communication increase; keeping m modest is also beneficial when per-round participation is small, because

C^{t}

becomes noisier and the damping in (20) plays a larger role.

4.4. Robust Preprocessing and Optional Privacy Layer

FedQuAD is intended for default prediction on sensitive, heavy-tailed accounting data. Accordingly, we integrate two practical components: (i) a robust federated standardization mechanism that improves numerical conditioning and stability, and (ii) an optional privacy layer that supports secure aggregation and client-level differential privacy (DP). Both components are designed to preserve the main optimization loop in (13) while improving deployability in cross-silo settings.

Robust federated standardization via quantile summaries. Accounting ratios and statement-derived variables often exhibit extreme skew and outliers, which can degrade both first-order optimization (through exploding gradients) and curvature estimation (through ill-conditioned Hessians). FedQuAD therefore standardizes features using global medians and interquartile ranges as in (11). Since raw feature values cannot be shared, clients transmit only mergeable summaries.

Concretely, for each feature

j \in {1, \dots, d}

, client k constructs a compact quantile summary

Q_{k, j}

that supports approximate queries for

Q_{X_{j}} (0.25)

,

Q_{X_{j}} (0.5)

, and

Q_{X_{j}} (0.75)

. Such summaries can be implemented using fixed-bin histograms (after coarse clipping) or digest-style sketches that support merge operations. The server aggregates the summaries by merging:

Q_{agg, j} \leftarrow Merge ({Q_{k, j}}_{k \in S^{stat}}),

where

S^{stat}

is a (possibly larger) set of clients participating in the statistics phase. From

Q_{agg, j}

, the server estimates

m_{j} \approx Q_{X_{j}} (0.5), {IQR}_{j} \approx Q_{X_{j}} (0.75) - Q_{X_{j}} (0.25),

and broadcasts

{m_{j}, {IQR}_{j}}_{j = 1}^{d}

. Clients then transform each feature by

{\tilde{x}}_{j} = \frac{x_{j} - m_{j}}{{IQR}_{j} + τ},

with

τ > 0

to avoid division by near-zero scales. This preprocessing improves the stability of both the local solver (Section 4.2) and curvature sketching (Section 4.3) by controlling heavy tails and reducing cross-client scale mismatch. In our experiments, robust normalization also improves calibration, likely because the model is less sensitive to extreme ratios that correlate spuriously with rare default events on a subset of clients. We note that, Robust median/IQR scaling is one natural choice for heavy-tailed accounting ratios. Alternatives in federated settings include global z-score standardization via secure aggregation of per-client sums and squared sums, per-client standardization (which can increase cross-client scale mismatch), or monotone transformations such as log/winsorization and rank/quantile transforms. We focus on median/IQR because it is resilient to outliers while preserving a common scale across institutions.

Secure aggregation compatibility. FedQuAD messages in a round are

Δ_{k}^{t}

,

g_{k, s}^{t}

, and

C_{k}^{t}

. Secure aggregation can be used to ensure that the server only observes aggregated values such as

\sum_{k \in S_{t}} {\tilde{p}}_{k} Δ_{k}^{t}

and similarly for

g_{k, s}^{t}

and

C_{k}^{t}

. We treat secure aggregation as an oracle that reveals only these sums, aligning with the aggregation definitions in (12). This protects individual client updates from direct inspection while preserving exactness of the optimization update.

This secure aggregation ensures the server only observes aggregated messages, but the aggregates

(Δ^{t}, g_{s}^{t}, C^{t})

are still model-update statistics and can reveal distribution-level information, especially if few clients participate or participation is highly unbalanced. Curvature sketches are second-order summaries in the random subspace

span (S^{t})

and thus may encode structural information beyond first-order gradients. FedQuAD mitigates this risk by (i) relying on secure aggregation so that no single institution’s sketch is revealed, (ii) refreshing the sketch subspace each round via the public seed (17), limiting information accumulation in fixed coordinates, and (iii) supporting client-level DP for all transmitted objects, including the vectorized upper-triangular entries of

C^{t}

, via clipping and Gaussian noise. In deployments we also recommend enforcing a minimum participation threshold before unmasking secure aggregation, as this is standard in secure-aggregation protocols [19].

Client-level differential privacy (optional). When stronger formal privacy is required, FedQuAD supports client-level

(ε, δ)

-DP by clipping and adding noise to aggregated messages. Let

{clip}_{C} (u) = {u \cdot min {1, C / ∥ u ∥}_{2}}

denote

ℓ_{2}

clipping. Prior to secure aggregation, each client clips its transmitted delta and projected gradient:

{\tilde{Δ}}_{k}^{t} = {clip}_{C_{Δ}} (Δ_{k}^{t}), {\tilde{g}}_{k, s}^{t} = {clip}_{C_{g}} (g_{k, s}^{t}) .

(21)

For curvature sketches, we clip with respect to the Frobenius norm:

{\tilde{C}}_{k}^{t} = C_{k}^{t} \cdot min \{1, \frac{C_{C}}{∥ C_{k}^{t} ∥_{F}}\} .

(22)

After secure aggregation, the server adds Gaussian noise to the aggregated messages. For example, letting

{\bar{Δ}}^{t}

denote the weighted aggregate of

{\tilde{Δ}}_{k}^{t}

, the server uses

Δ^{t} \leftarrow {\bar{Δ}}^{t} + N (0, σ_{Δ}^{2} C_{Δ}^{2} I_{P}),

(23)

and similarly

g_{s}^{t} \leftarrow {\bar{g}}_{s}^{t} + N (0, σ_{g}^{2} C_{g}^{2} I_{m}) .

For

C^{t}

, noise can be added to the vectorized upper-triangular entries, yielding a symmetric noisy sketch after reshaping and symmetrization. The noise multipliers

(σ_{Δ}, σ_{g}, σ_{C})

are selected based on the desired

(ε, δ)

and the number of rounds T using standard composition accounting for client subsampling. When DP noise is applied to

C^{t}

, the aggregated sketch can become less stable. FedQuAD mitigates this by (i) client-side ridge

ρ_{0}

in (18), (ii) server-side damping

ρ

in (20), and (iii) optional eigenvalue flooring in sketch space:

C^{t} \leftarrow U max {Λ, λ_{min} I} U^{⊤},

where

C^{t} = U Λ U^{⊤}

and

λ_{min} > 0

is a small floor. These stabilizers ensure that the inverse in (20) remains well-defined and prevents overly large correction steps due to noisy or near-singular curvature.

5. Experiments

5.1. Experimental Setup

Datasets and feature construction. We evaluate federated default prediction on two tabular credit-risk benchmarks that reflect key properties of private accounting data: non-IID client distributions, heavy-tailed covariates, missing values, and label imbalance.

(D1) Corporate Bankruptcy (Accounting Ratios). We use the publicly available Polish companies bankruptcy dataset, which contains 64 financial ratios derived from accounting statements and a binary label indicating bankruptcy over a fixed horizon (available at https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data (accessed on 20 May 2025)). We remove features with more than

15 %

missingness, impute the remaining missing values using per-client medians to avoid cross-silo leakage, and apply the robust federated standardization in (11), where

m_{j}

and

{IQR}_{j}

are estimated from aggregated quantile summaries and broadcast to clients.

(D2) Consumer Default (Tabular Credit). We use the UCI “default of credit card clients” dataset, which contains 23 tabular covariates and a binary default label (available at https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients (accessed on 8 May 2025)). While not purely accounting ratios, it exhibits similar non-IID and imbalance characteristics and is widely used as a standard benchmark for tabular default modeling. We adopt the same missingness handling (when applicable) and the same robust normalization protocol for consistency. Table 1 summarizes basic statistics.

Federated partitioning and heterogeneity controls. We simulate a cross-silo network with

K = 20

institutions. For each dataset, we generate heterogeneous client datasets using a two-stage procedure: (i) label skew via a Dirichlet partition with concentration parameter

0.3

, producing widely varying default prevalences across clients; and (ii) covariate shift by assigning clients disjoint segments approximated by k-means clustering on covariates (to mimic institutions serving different industries or customer segments). Each round uses partial participation with

| S_{t} | = 5

clients (

25 %

participation). Figure 1 illustrates the resulting variability in client-level default prevalence on (D1), a primary driver of drift and noisy gradients in practice. Our evaluation is conducted on public datasets with simulated cross-silo partitions (label skew and covariate shift). Since real institutional accounting ledgers are typically unavailable for public release due to confidentiality and regulation, we limit real-world claims to what can be supported by reproducible benchmarks and treat deployment on proprietary multi-bank data as future work.

Models, training protocol, and hyperparameter tuning. We evaluate two model families: (i) logistic regression with

ℓ_{2}

regularization

λ = 10^{- 4}

in (2), representing a strong, interpretable baseline for accounting ratios, and (ii) a tabular MLP with two hidden layers (256, 128) and ReLU activations, trained with the loss in (1). We train for

T = 200

rounds and report mean ± std over 5 random seeds (different client partitions and sampling trajectories). For each method, hyperparameters are tuned on a validation split that preserves the federated partition (no cross-client mixing). We tune server and client learning rates on a small grid and choose the configuration maximizing validation AUC; ties are broken by lower validation ECE. In regulated environments where a shared validation set is unavailable, we recommend (i) starting with sketch dimension

m = 32

or 64, (ii) damping

ρ \in [10^{- 4}, 10^{- 2}]

, (iii) proximal weight

μ

proportional to observed heterogeneity (e.g., larger

μ

when label skew exceeds 2:1 across institutions), and (iv) fixed conservative learning rates satisfying standard smoothness bounds. Sensitivity analysis indicates that FedQuAD is more sensitive to excessive m than to moderate underestimation of curvature dimension. In all experiments we set the ridge parameter to

ρ = 10^{- 3}

unless otherwise specified.

Table 2 lists the default settings used across experiments. For FedQuAD, we keep the sketch dimension moderate (

m = 64

) and use damping

ρ

to ensure the sketch-space inverse remains stable. All baselines use the same client sampling and the same total local computation budget (E steps/round) to isolate algorithmic differences in aggregation and drift control.

Metrics and targets. We report (i) AUC for discrimination, (ii) Brier score for probability accuracy, and (iii) ECE (15 bins) for calibration. Since communication rounds are the primary cost in cross-silo FL, we also report Rounds@AUC: the smallest round index t such that validation AUC exceeds a specified target (e.g., 0.87 for (D1), 0.90 for (D2)). Unless explicitly noted, all reported results assume secure aggregation without additional client-level DP noise. The optional DP mechanism described in Section 4.4 is evaluated separately and is not active in the main comparison tables.

Why robust normalization matters in this setup. Heavy-tailed accounting ratios can strongly distort optimization dynamics: A few extreme values may dominate gradients and inflate curvature estimates. Figure 2 illustrates this effect on a representative feature from (D1) by plotting empirical quantiles before and after applying the robust standardization (11). After normalization, the interquartile region is stabilized, and extreme tails are compressed, improving the conditioning of both local updates and curvature sketches.

5.2. Main Results

This subsection evaluates whether FedQuAD improves communication-round efficiency and predictive quality for federated default prediction. We focus on two questions: (i) does FedQuAD reach target AUC and calibration in fewer rounds than representative first-order and adaptive baselines, and (ii) are the gains consistent under severe client heterogeneity (Dirichlet label skew + covariate segmentation) and partial participation (

25 %

per round)? Our main experimental results correspond to secure aggregation without DP noise; DP-enabled runs introduce additional Gaussian noise calibrated to the clipping norm and are reported separately when applicable.

Logistic regression on (D1): discrimination, calibration, and rounds-to-target. Table 3 reports the final test performance at

T = 200

rounds for logistic regression. FedQuAD attains the best mean AUC and the lowest calibration error (ECE), while substantially reducing the number of rounds required to reach the target validation AUC. The improvement in ECE is particularly important for credit risk, where predicted probabilities directly enter pricing, provisioning, and capital models.

Convergence curves in rounds. Figure 3 shows validation AUC versus communication rounds on (D1). The curve for FedQuAD rises sharply early and saturates sooner, consistent with the sketch-space quasi-Newton correction Section 4.3 improving effective conditioning along

span (S^{t})

. No legend is shown; curve identities are described in the caption. Notably, our convergence analysis assumes smoothness and (locally) strong convexity for clarity. Deep tabular models used in practice are nonconvex, and financial ratios may violate sub-Gaussian assumptions. However, empirical loss landscapes in calibrated credit models are often locally well-behaved near minimizers, especially under regularization. In practice, when strong convexity is not satisfied globally, we recommend moderate damping (

ρ

), conservative step sizes, and monitoring of calibration metrics rather than relying solely on theoretical rates. The theoretical results should therefore be interpreted as stability guarantees under idealized conditions rather than exact performance predictors for real accounting data. Formal convergence guarantees for FedQuAD are provided in Appendix A.

Calibration behavior. To understand the calibration gains, Figure 4 plots validation ECE versus rounds on (D1). FedQuAD reduces ECE more quickly and to a lower level than baselines, indicating that the faster optimization does not come at the cost of unstable probability estimates. We attribute this to three interacting factors: robust normalization (11) that stabilizes feature scales, proximal drift control (14) that limits overfitting to local rare events, and curvature-aware correction Section 4.3 that improves the conditioning of global steps.

Robustness across heterogeneity severity. Table 4 evaluates logistic regression on (D1) under varying Dirichlet concentration (smaller means more heterogeneous). FedQuAD consistently reduces rounds-to-target, and the gap widens as heterogeneity increases, which is consistent with its explicit drift control and curvature scaling. For each concentration, we keep

K = 20

and

| S_{t} | = 5

and retune only the client stepsize on a small grid for fairness.

Overall, the main results show that FedQuAD improves both discrimination and calibration while dramatically reducing communication rounds required to reach target AUC. The next subsections study nonconvex tabular models, end-to-end communication costs, and ablations of key FedQuAD components.

5.3. Deep Tabular Model and Communication Efficiency

We next evaluate FedQuAD on a nonconvex deep tabular model and quantify end-to-end communication efficiency. The goal is twofold: (i) verify that sketch-space curvature correction remains beneficial when the curvature is approximate (Gauss–Newton/empirical Fisher) and the loss landscape is nonconvex, and (ii) assess whether the additional curvature payload is offset by fewer rounds-to-target, resulting in lower bytes-to-target.

Tabular MLP performance on (D2). We train a two-hidden-layer MLP (256, 128) with ReLU activations using the loss (1). Table 5 reports test AUC, Brier score, ECE, and rounds required to reach validation AUC

\geq 0.90

on (D2). FedQuAD achieves the best AUC and calibration while requiring substantially fewer communication rounds than baselines that use only first-order information.

Figure 5 plots validation AUC versus rounds on (D2). No legend is used; the curve identities are specified in the caption. FedQuAD again exhibits a steeper early rise, indicating that sketch-space preconditioning accelerates progress even when curvature is approximate and clients perform nonconvex local optimization.

Communication accounting and bytes-to-target. FedQuAD transmits

(Δ_{k}^{t}, g_{k, s}^{t}, C_{k}^{t})

per participating client. With sketch size

m = 64

,

g_{k, s}^{t}

contributes m scalars, and

C_{k}^{t}

contributes

m (m + 1) / 2 = 2080

scalars (upper triangle). Table 6 reports per-client uplink per round and bytes-to-target for reaching AUC

\geq 0.87

on (D1) using logistic regression. Although FedQuAD adds a small curvature payload, it substantially reduces rounds-to-target, leading to lower overall transmitted bytes. Furthermore, secure aggregation typically introduces a small constant number of additional communication phases plus auxiliary key material, but the dominant term remains transmitting the masked update vectors/matrices themselves [19]. The optional client-level DP mechanism is implemented through local clipping and server-side noise addition (Section 4.4) and does not change message dimensionality; it affects accuracy rather than bytes. Therefore, while absolute bytes depend on the concrete secure-aggregation/DP instantiation, the relative bytes-to-target trade-offs are still governed by the

O (P)

vs.

O (P + m^{2})

payload scaling and the reduced rounds-to-target.

Where the savings come from. To make the trade-off explicit, Figure 6 plots cumulative uplink bytes (per participating client) versus achieved validation AUC on (D1). No legend is shown; the caption describes curve identities. FedQuAD attains a given AUC at substantially lower cumulative bytes because its AUC rises rapidly in early rounds. This metric is often more actionable than per-round payload in cross-silo deployments, where the dominant cost is total secure-computation and transmission volume to reach a target model quality.

Sensitivity to sketch dimension. Finally, Table 7 studies the impact of sketch dimension m on (D1) using logistic regression. As m increases, rounds-to-target improve until diminishing returns set in, while per-round uplink increases quadratically in m. The default

m = 64

offers a favorable trade-off. In this regime, increasing m beyond 64 yields only marginal gains in rounds-to-target while increasing overhead quadratically; thus

m = 64

is a sensible default for similar tabular credit tasks. If communication is very constrained,

m = 32

still provides most of the benefit; if rounds are extremely expensive and bandwidth allows,

m \in [96, 128]

can extract the last few rounds of improvement.

Overall, the deep-model results confirm that FedQuAD’s curvature sketching remains effective beyond convex objectives, and communication accounting shows that the additional curvature channel is typically offset by large reductions in rounds-to-target and total transmitted bytes.

5.4. Ablation Study

This subsection isolates which components of FedQuAD contribute most to the observed reductions in communication rounds and improvements in calibration. We focus on logistic regression on (D1), where optimization effects are easier to interpret, and evaluate three ablations that remove one component at a time while keeping the remaining settings fixed: (i) NoCurv: remove the sketch-space quasi-Newton correction by setting

η_{q} = 0

in (13); (ii) NoVR: remove variance reduction by replacing (15) with minibatch SGD on the proximal objective (14); (iii) NoRobustNorm: replace robust normalization (11) with standard z-score normalization computed locally (mean/std per client). All variants use the same

K = 20

,

| S_{t} | = 5

,

E = 5

local steps per round, and are averaged over 5 random seeds.

End performance and rounds-to-target. Table 8 reports test AUC, ECE, and rounds-to-target. Removing curvature correction causes the largest increase in Rounds@0.87, confirming that sketch-space preconditioning is the main driver of communication efficiency. Removing variance reduction increases rounds and worsens calibration modestly, consistent with higher stochasticity under rare-event labels. Removing robust normalization produces both slower convergence and noticeably worse ECE, reflecting sensitivity to heavy-tailed accounting features.

Convergence and calibration trajectories. To understand when each component matters, Figure 7 plots validation AUC versus rounds, and Figure 8 plots validation ECE versus rounds. No legends are shown; curve identities are described in the captions. The NoCurv curve rises more slowly throughout training, especially in early rounds, indicating that curvature correction accelerates global progress from the start. The NoRobustNorm curve is unstable early (slow AUC rise and higher ECE), suggesting that heavy-tailed features degrade both local optimization and curvature estimation. NoVR exhibits intermediate behavior: AUC approaches FedQuAD eventually but requires more rounds, and ECE remains higher, consistent with noisier local updates under imbalance.

Step statistics and stability. Ablations also differ in update stability. Table 9 reports (i) the mean

ℓ_{2}

norm of client deltas

∥ Δ_{k}^{t} ∥_{2}

and (ii) the fraction of rounds in which the validation AUC decreases relative to the previous round (a simple instability indicator). Removing robust normalization increases update magnitudes and instability; removing variance reduction modestly increases both; removing curvature does not directly increase instability (it primarily slows progress), which aligns with the curvature term acting as a structured, damped correction in sketch space.

Component interactions. Finally, Figure 9 summarizes the rounds-to-target for each variant and highlights interactions: Curvature correction provides the largest gain alone, but the best performance requires combining it with robust normalization and variance reduction, which improve stability and calibration and enable the curvature channel to operate on better-conditioned features.

Overall, the ablation study confirms that the sketch-space curvature correction is the primary driver of round efficiency, while robust normalization and variance reduction play crucial supporting roles by improving conditioning, reducing stochasticity under imbalance, and stabilizing probability calibration.

5.5. Discussions

Across heterogeneous cross-silo partitions, FedQuAD consistently reduces rounds-to-target relative to first-order baselines while maintaining competitive AUC and improved calibration stability, with the sketch-space quasi-Newton correction being especially beneficial under stronger label skew and covariate shift, where purely first-order aggregation is more vulnerable to drift and curvature mismatch. The additional communication cost from the

m \times m

sketch remains moderate for practical choices such as

m \leq 64

, so bytes-to-target improvements persist in this regime. At the same time, our results should be interpreted with appropriate caution: All experiments are conducted on public credit-risk-style datasets with simulated cross-silo (multi-institution) partitions, where heterogeneity is induced through controlled label skew and covariate shift rather than observed from raw proprietary bank data. Although this setup captures realistic statistical heterogeneity patterns, it does not fully represent operational constraints such as compliance workflows, asynchronous participation, or institution-specific feature engineering, so validation on live inter-institution deployments remains future work. Our primary threat model assumes an honest-but-curious server and honest clients, and unless otherwise stated, all reported experiments assume secure aggregation without additional client-level differential privacy (DP) noise, meaning that the server observes only aggregated masked updates rather than individual client messages. However, aggregated curvature sketches are still second-order summary statistics and may reveal distribution-level structure when participation is small, so practical deployments should combine secure aggregation with minimum participation thresholds and, where required, client-level DP. For reproducibility, we clarify that for deep tabular models the curvature sketch corresponds to a Gauss–Newton/empirical Fisher-type approximation obtained by projecting per-client gradients into the shared sketch subspace and aggregating outer products, yielding an

m \times m

matrix that approximates

{(S^{t})}^{⊤} \nabla^{2} F (w^{t}) S^{t}

up to stochastic and heterogeneity noise. Unless otherwise stated, we use sketch dimension

m = 64

and damping parameter

ρ = 10^{- 3}

in Equation (20); the ridge term

ρ I

stabilizes inversion under limited participation or poor conditioning, and all remaining hyperparameters, including m,

ρ

, proximal weight

μ

, and adaptive drift thresholds, are summarized in Table 2.

6. Conclusions

This paper investigated how curvature-aware optimization can be incorporated into federated learning for credit default prediction while remaining communication-efficient and compatible with privacy constraints. We proposed FedQuAD, which integrates proximal stabilization, variance reduction, and a sketch-based quasi-Newton server correction to address the challenges of heterogeneous, imbalanced financial data across institutions. Empirical results on heterogeneous credit-risk benchmarks with simulated cross-silo partitions show that the proposed approach can significantly reduce the number of communication rounds required to reach target predictive performance while maintaining competitive AUC and stable calibration. These results suggest that compressed second-order information, when incorporated through randomized sketching, can provide practical benefits for collaborative financial modeling without requiring the transmission of full curvature matrices.

Despite these promising results, several limitations should be acknowledged. The experiments rely on public datasets with simulated institutional partitions rather than proprietary multi-bank accounting data, and the theoretical guarantees rely on standard smoothness and convexity assumptions that may only approximately hold for real financial data. In addition, the reported results assume secure aggregation without additional differential privacy noise, and practical deployments may require stronger privacy mechanisms depending on regulatory constraints. Future research should therefore investigate stress-test scenarios with extreme heterogeneity or limited communication budgets, explore adaptive strategies for selecting sketch dimensions and other hyperparameters, and validate curvature-aware federated optimization in real multi-institution financial environments.

Author Contributions

Conceptualization, D.B., M.W. and Q.W.; Investigation, D.B. and M.W.; Supervision, Q.W.; Project administration, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data (accessed on 20 May 2025) https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients (accessed on 8 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Convergence Guarantees for FedQuAD

This appendix provides formal statements and proofs for the convergence properties referenced in Section 4.2. We analyze an idealized version of the FedQuAD update in order to isolate the role of the sketch-space quasi-Newton correction. The analysis follows standard smoothness-based descent arguments commonly used in federated optimization.

Appendix A.1. Notation

Let

F (w)

denote the global federated objective defined in Equation (3). Let

g^{t} = \nabla F (w^{t})

denote the gradient evaluated at round t. FedQuAD applies the server update

w^{t + 1} = w^{t} - η_{g} g^{t} - η_{q} P^{t} g^{t},

(A1)

where

η_{g}

is the gradient step size,

η_{q}

is the curvature correction step size, and

P^{t} = S^{t} {(C^{t} + ρ I)}^{- 1} {(S^{t})}^{⊤}

(A2)

is the lifted quasi-Newton preconditioner constructed from the shared sketch matrix

S^{t}

and aggregated curvature sketch

C^{t}

.

Appendix A.2. Strongly Convex Convergence

Theorem A1.

Strongly convex convergence

Assume that the global objective F is L-smooth and μ-strongly convex.

Assume further that

1.: All clients participate in each round
2.: The aggregated local update equals a gradient step

$Δ^{t} = - η_{g} \nabla F (w^{t})$
3.: The curvature sketch satisfies

$C^{t} = {(S^{t})}^{⊤} \nabla^{2} F (w^{t}) S^{t}$

If the step sizes satisfy

η_{g} \leq \frac{1}{2 L}, η_{q} \leq \frac{ρ}{2 L},

then the FedQuAD iterates satisfy

F (w^{t + 1}) \leq F (w^{t}) - \frac{η_{g}}{2} {∥ \nabla F (w^{t}) ∥}^{2} - \frac{η_{q}}{2} {(g_{s}^{t})}^{⊤} {(C^{t} + ρ I)}^{- 1} g_{s}^{t}

where

g_{s}^{t} = {(S^{t})}^{⊤} \nabla F (w^{t})

. Consequently,

F (w^{T}) - F^{★} \leq {(1 - μ η_{g})}^{T} (F (w^{0}) - F^{★}) .

Proof.

Let

g^{t} = \nabla F (w^{t})

. By the L-smoothness of F, the descent lemma gives

F (w^{t + 1}) \leq F (w^{t}) + 〈 g^{t}, w^{t + 1} - w^{t} 〉 + \frac{L}{2} {∥ w^{t + 1} - w^{t} ∥}^{2} .

Substituting the FedQuAD update Equation (A1) yields

w^{t + 1} - w^{t} = - η_{g} g^{t} - η_{q} P^{t} g^{t} .

The linear term becomes

〈 g^{t}, w^{t + 1} - w^{t} 〉 = - η_{g} {∥ g^{t} ∥}^{2} - η_{q} {(g^{t})}^{⊤} P^{t} g^{t} .

The quadratic term satisfies

∥ η_{g} g^{t} + η_{q} P^{t} g^{t} ∥^{2} \leq 2 η_{g}^{2} ∥ g^{t} ∥^{2} + 2 η_{q}^{2} {∥ P^{t} g^{t} ∥}^{2} .

Given that

P^{t} = S^{t} {(C^{t} + ρ I)}^{- 1} {(S^{t})}^{⊤}

and

{(C^{t} + ρ I)}^{- 1} ⪯ ρ^{- 1} I

, we obtain

∥ P^{t} ∥ \leq \frac{1}{ρ} .

Therefore,

∥ P^{t} g^{t} ∥^{2} \leq \frac{1}{ρ} {(g^{t})}^{⊤} P^{t} g^{t} .

Combining these bounds yields

F (w^{t + 1}) \leq F (w^{t}) - (η_{g} - L η_{g}^{2}) {∥ g^{t} ∥}^{2} - (η_{q} - \frac{L η_{q}^{2}}{ρ}) {(g^{t})}^{⊤} P^{t} g^{t} .

Under the step-size conditions

η_{g} \leq 1 / (2 L)

and

η_{q} \leq ρ / (2 L)

, we obtain the stated descent inequality.

Finally, strong convexity implies

∥ \nabla F (w^{t}) ∥^{2} \geq 2 μ (F (w^{t}) - F^{★}) .

Substituting this into the descent inequality yields the linear convergence bound. □

Appendix A.3. Nonconvex Stationarity Guarantee

Proposition A1

(Nonconvex convergence). Assume that F is L-smooth and bounded below by

F_{inf}

.

Under the same idealized assumptions used in Theorem A1 and with

η_{g} \leq 1 / (2 L)

, the FedQuAD iterates satisfy

\frac{1}{T} \sum_{t = 0}^{T - 1} {∥ \nabla F (w^{t}) ∥}^{2} \leq \frac{2 (F (w^{0}) - F_{inf})}{η_{g} T} .

Proof.

From the descent inequality established in the proof of Theorem A1, we obtain

F (w^{t + 1}) \leq F (w^{t}) - \frac{η_{g}}{2} {∥ \nabla F (w^{t}) ∥}^{2} .

Summing this inequality for

t = 0, \dots, T - 1

gives

F (w^{T}) \leq F (w^{0}) - \frac{η_{g}}{2} \sum_{t = 0}^{T - 1} {∥ \nabla F (w^{t}) ∥}^{2} .

Using the lower bound

F (w^{T}) \geq F_{inf}

yields

\frac{η_{g}}{2} \sum_{t = 0}^{T - 1} {∥ \nabla F (w^{t}) ∥}^{2} \leq F (w^{0}) - F_{inf} .

Dividing both sides by T gives the stated result. □

References

McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS); PMLR: Cambridge, MA, USA, 2017; pp. 1273–1282. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. Found. Trends^® Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. In Proceedings of the Machine Learning and Systems (MLSys), Austin, TX, USA, 2–4 March 2020. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.J.; Stich, S.U.; Suresh, A.T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML); PMLR: Cambridge, MA, USA, 2020. [Google Scholar]
Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Konečný, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated Learning: Strategies for Improving Communication Efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
Reddi, S.J.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečný, J.; Kumar, S.; McMahan, H.B. Adaptive Federated Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
Acar, D.A.E.; Zhao, Y.; Navarro, R.; Mattina, M.; Whatmough, P.; Saligrama, V. Federated Learning with Dynamic Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
Lee, S.H.; Sharma, S.; Zaheer, M.; Li, T. Efficient Adaptive Federated Optimization. arXiv 2024, arXiv:2410.18117. [Google Scholar]
Wang, Y.; Wang, S.; Lu, S.; Chen, J. FADAS: Towards Federated Adaptive Asynchronous Optimization. In Proceedings of the 41st International Conference on Machine Learning (ICML); Proceedings of Machine Learning Research; PMLR: Cambridge, MA, USA, 2024; Volume 235. [Google Scholar]
Oko, K.; Akiyama, S.; Wu, D.; Murata, T.; Suzuki, T. SILVER: Single-loop Variance Reduction and Application to Federated Learning. In Proceedings of the 41st International Conference on Machine Learning (ICML); Proceedings of Machine Learning Research; PMLR: Cambridge, MA, USA, 2024; Volume 235. [Google Scholar]
Zhao, H.; Burlachenko, K.; Li, Z.; Richtárik, P. Faster Rates for Compressed Federated Learning with Client-Variance Reduction. SIAM J. Math. Data Sci. 2024, 6, 154–175. [Google Scholar] [CrossRef]
Thakur, D.; Guzzo, A.; Fortino, G.; Das, S.K. Non-Convex Optimization in Federated Learning via Variance Reduction and Adaptive Learning. arXiv 2024, arXiv:2412.11660. [Google Scholar] [CrossRef]
Qian, X.; Islamov, R.; Safaryan, M.; Richtárik, P. Basis matters: Better communication-efficient second order methods for federated learning. arXiv 2021, arXiv:2111.01847. [Google Scholar] [CrossRef]
Elbakary, A.; Ben Issaid, C.; Shehab, M.; Seddik, K.G.; ElBatt, T.; Bennis, M. Fed-Sophia: A Communication-Efficient Second-Order Federated Learning Algorithm. In Proceedings of the IEEE International Conference on Communications (ICC), Denver, CO, USA, 9–13 June 2024. [Google Scholar]
Hamidi, S.M.; Ye, L. Distributed Quasi-Newton Method for Fair and Fast Federated Learning. arXiv 2025, arXiv:2501.10877. [Google Scholar] [CrossRef]
Wu, Y.; Kamzolov, D.; Takáč, M. Quasi-Newton Methods for Federated Learning with Error Feedback. In Proceedings of the OPT 2025: Optimization for Machine Learning (Workshop at NeurIPS 2025), San Diego, CA, USA, 30 November–7 December 2025. [Google Scholar]
Krouka, M.; Koskela, A.; Kulkarni, T.D. Communication Efficient Differentially Private Federated Learning Using Second Order Information. Proc. Priv. Enhancing Technol. (PoPETs) 2025, 2025, 584–612. [Google Scholar] [CrossRef]
Bonawitz, K.; Ivanov, V.; Kreuter, B.; Marcedone, A.; McMahan, H.B.; Patel, S.; Ramage, D.; Segal, A.; Seth, K. Practical Secure Aggregation for Privacy-Preserving Machine Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS); ACM: New York, NY, USA, 2017; pp. 1175–1191. [Google Scholar]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography, Proceedings of the Third Theory of Cryptography Conference (TCC); Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
Zhang, X.; Wu, J.; Yao, J.; Yu, C. Employee incentive and stock liquidity: Evidence from a quasi-natural experiment in China. Int. Rev. Econ. Financ. 2025, 104, 104674. [Google Scholar] [CrossRef]
Zhu, L.; Liu, Z.; Han, S. Deep Leakage from Gradients. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Geiping, J.; Bauermeister, H.; Dröge, H.; Moeller, M. Inverting Gradients—How Easy Is It to Break Privacy in Federated Learning? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership Inference Attacks against Machine Learning Models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP); IEEE: New York, NY, USA, 2017; pp. 3–18. [Google Scholar]
Yang, F.; Abedin, M.Z.; Hajek, P. An Explainable Federated Learning and Blockchain-Based Credit Scoring System. Eur. J. Oper. Res. 2024, 317, 449–467. [Google Scholar] [CrossRef]
Jovanovic, Z.; Hou, Z.; Biswas, K.; Muthukkumarasamy, V. Robust Integration of Blockchain and Explainable Federated Learning for Credit Scoring. Comput. Netw. 2024, 243, 110722. [Google Scholar] [CrossRef]
Lessmann, S.; Baesens, B.; Seow, H.V.; Thomas, L.C. Benchmarking State-of-the-Art Classification Algorithms for Credit Scoring: An Update of Research. Eur. J. Oper. Res. 2015, 247, 124–136. [Google Scholar] [CrossRef]
Brown, I.; Mues, C. Experimental Comparison of Classification Algorithms for Imbalanced Credit Scoring Data Sets. Expert Syst. Appl. 2012, 39, 3446–3453. [Google Scholar] [CrossRef]
Zhang, Y.; Liang, K.; Loo, B.P. Measuring dynamic accessibility by metro system under travel time uncertainty based on smart card data. J. Transp. Geogr. 2025, 127, 104294. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD); ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting Deep Learning Models for Tabular Data. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021. [Google Scholar]
Arik, S.Ö.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtually, 2–9 February 2021. [Google Scholar]
Wang, Z.; Xiao, J.; Wang, L.; Yao, J. A Novel Federated Learning Approach with Knowledge Transfer for Credit Scoring. Decis. Support Syst. 2024, 177, 114084. [Google Scholar] [CrossRef]
Liu, B.; Lv, N.; Guo, Y.; Li, Y. Recent Advances on Federated Learning: A Systematic Survey. Neurocomputing 2024, 597, 128019. [Google Scholar] [CrossRef]
Hatfaludi, C.A.; Serban, A. Foundational Models and Federated Learning: Survey, Taxonomy, Challenges and Practical Insights. PeerJ Comput. Sci. 2025, 11, e2993. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Long, G.; Lu, Q.; Zhu, L.; Jiang, J.; Zhang, C. Federated Low-Rank Adaptation for Foundation Models. In Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI), Montreal, QC, Canada, 29–31 August 2025. [Google Scholar]
Domnikov, A.; Khomenko, P.; Chebotareva, G.; Khodorovsky, M. Risk and profitability optimization of investments in the oil and gas industry. Int. J. Energy Prod. Manag. 2017, 2, 263–276. [Google Scholar] [CrossRef][Green Version]

Figure 1. Client-level default prevalence on (D1) under the Dirichlet label-skew partition (concentration

0.3

) with additional covariate segmentation. Bars show the fraction of

y = 1

per client; the large spread motivates drift control and variance reduction in Section 4.2.

Figure 1. Client-level default prevalence on (D1) under the Dirichlet label-skew partition (concentration

0.3

) with additional covariate segmentation. Bars show the fraction of

y = 1

per client; the large spread motivates drift control and variance reduction in Section 4.2.

Figure 2. Effect of robust federated standardization (11) on a representative heavy-tailed accounting feature from (D1). The blue line shows empirical quantiles before normalization; the red line shows quantiles after median/IQR scaling (tails compressed, interquartile region stabilized).

Figure 3. Validation AUC vs. rounds on (D1) for four methods (no legend). The blue line is FedQuAD; the red line is FedAvg; the black and brown line correspond to FedProx and FedAdam (in increasing order). FedQuAD reaches the target regime in substantially fewer rounds.

Figure 4. Validation ECE vs. rounds on (D1) for four methods (no legend). The blue line is FedQuAD; the red line is FedAvg; the remaining curves are FedProx and FedAdam. FedQuAD improves calibration faster and achieves lower final ECE.

Figure 5. Validation AUC vs. rounds on (D2) for four methods (no legend). The blue line is FedQuAD; the red line is FedAvg; the black and brown line correspond to FedProx and FedAdam (in increasing order). FedQuAD reaches AUC

\geq 0.90

much earlier.

Figure 5. Validation AUC vs. rounds on (D2) for four methods (no legend). The blue line is FedQuAD; the red line is FedAvg; the black and brown line correspond to FedProx and FedAdam (in increasing order). FedQuAD reaches AUC

\geq 0.90

much earlier.

Figure 6. Validation AUC vs. cumulative uplink bytes (per participating client) on (D1) for three methods (no legend). The blue line is FedQuAD, the red line is FedAdam, and the brown line is FedAvg. Despite a slightly larger per-round payload, FedQuAD reaches high AUC at substantially lower total transmitted bytes.

Figure 7. Ablation: validation AUC vs. rounds on (D1) (no legend). Curves (top to bottom at

t = 200

): FedQuAD, NoVR, NoCurv, NoRobustNorm. Removing the curvature correction slows convergence most, especially in early rounds.

Figure 7. Ablation: validation AUC vs. rounds on (D1) (no legend). Curves (top to bottom at

t = 200

): FedQuAD, NoVR, NoCurv, NoRobustNorm. Removing the curvature correction slows convergence most, especially in early rounds.

Figure 8. Ablation: validation ECE vs. rounds on (D1) (no legend). Curves (lowest to highest at

t = 200

): FedQuAD, NoVR, NoCurv, NoRobustNorm. Robust normalization and variance reduction both contribute to improved calibration, while curvature correction primarily accelerates convergence.

Figure 8. Ablation: validation ECE vs. rounds on (D1) (no legend). Curves (lowest to highest at

t = 200

): FedQuAD, NoVR, NoCurv, NoRobustNorm. Robust normalization and variance reduction both contribute to improved calibration, while curvature correction primarily accelerates convergence.

Figure 9. Rounds-to-target comparison on (D1) (no legend). Bars show Rounds@0.87 for FedQuAD and three ablations. Removing curvature increases rounds most; removing robust normalization increases rounds and harms calibration; removing variance reduction yields intermediate degradation.

Table 1. Dataset statistics after preprocessing. “Pos. rate” is the global fraction of

y = 1

. “Miss.” is the fraction of missing entries prior to imputation.

Table 1. Dataset statistics after preprocessing. “Pos. rate” is the global fraction of

y = 1

. “Miss.” is the fraction of missing entries prior to imputation.

Dataset	Samples	Features	Pos. Rate	Miss.	Task
(D1) Polish bankruptcy	$10,503$	64	$0.052$	$0.087$	corporate bankruptcy
(D2) Credit card default	$30,000$	23	$0.221$	$0.000$	consumer default

Table 2. Training configuration used by default in all experiments (unless otherwise noted).

Component	Setting	Notes
Clients	$K = 20$ , $\| S_{t} \| = 5$	$25 %$ participation
Rounds	$T = 200$	synchronous rounds
Local steps	$E = 5$	equalized across methods
Minibatch	256	per local step
Loss	(1) with $α$ tuned	handles imbalance
Regularization	$λ = 10^{- 4}$	logistic regression
FedQuAD (local)	$η_{ℓ} = 0.05$ , $μ_{p} = 0.1$	Prox-SVRG steps
FedQuAD (sketch)	$m = 64$ , $ρ = 10^{- 3}$ , $η_{q} = 0.5$	curvature correction
Robust norm	(11) with $τ = 10^{- 3}$	quantile-based
FedQuAD (adaptive)	$r_{max} = 0.05$ , $γ = 2$	drift-control thresholds

Table 3. Logistic regression results on (D1) (mean ± std over 5 seeds). “Rounds@0.87” is the number of rounds required to reach validation AUC

\geq 0.87

.

Table 3. Logistic regression results on (D1) (mean ± std over 5 seeds). “Rounds@0.87” is the number of rounds required to reach validation AUC

\geq 0.87

.

Method	AUC ↑	Brier ↓	ECE ↓	Rounds@0.87 ↓
FedAvg	$0.871 \pm 0.004$	$0.146 \pm 0.003$	$0.041 \pm 0.004$	$120 \pm 11$
FedProx	$0.874 \pm 0.003$	$0.144 \pm 0.003$	$0.039 \pm 0.003$	$102 \pm 10$
SCAFFOLD	$0.878 \pm 0.003$	$0.141 \pm 0.002$	$0.036 \pm 0.003$	$78 \pm 9$
FedNova	$0.876 \pm 0.004$	$0.143 \pm 0.003$	$0.037 \pm 0.003$	$91 \pm 12$
FedAdam	$0.880 \pm 0.003$	$0.140 \pm 0.002$	$0.035 \pm 0.003$	$66 \pm 7$
FedQuAD	$0.883 \pm 0.002$	$0.137 \pm 0.002$	$0.027 \pm 0.002$	$35 \pm 6$

Table 4. Heterogeneity sweep on (D1) with logistic regression. Smaller Dirichlet concentration implies stronger label skew. Reported values are mean ± std over 5 seeds.

Dirichlet Concentration	Method	AUC at $T = 200$ ↑	Rounds@0.87 ↓
0.5	FedAdam	$0.881 \pm 0.003$	$58 \pm 7$
0.5	FedQuAD	$0.884 \pm 0.002$	$31 \pm 5$
0.3	FedAdam	$0.880 \pm 0.003$	$66 \pm 7$
0.3	FedQuAD	$0.883 \pm 0.002$	$35 \pm 6$
0.1	FedAdam	$0.874 \pm 0.004$	$92 \pm 10$
0.1	FedQuAD	$0.879 \pm 0.003$	$48 \pm 8$

Table 5. Tabular MLP results on (D2) (mean ± std over 5 seeds). “Rounds@0.90” is the number of rounds required to reach validation AUC

\geq 0.90

.

Table 5. Tabular MLP results on (D2) (mean ± std over 5 seeds). “Rounds@0.90” is the number of rounds required to reach validation AUC

\geq 0.90

.

Method	AUC ↑	Brier ↓	ECE ↓	Rounds@0.90 ↓
FedAvg	$0.901 \pm 0.003$	$0.129 \pm 0.002$	$0.031 \pm 0.003$	$148 \pm 13$
FedProx	$0.903 \pm 0.003$	$0.128 \pm 0.002$	$0.029 \pm 0.003$	$131 \pm 12$
SCAFFOLD	$0.906 \pm 0.003$	$0.126 \pm 0.002$	$0.026 \pm 0.002$	$104 \pm 10$
FedAdam	$0.908 \pm 0.002$	$0.125 \pm 0.002$	$0.024 \pm 0.002$	$86 \pm 9$
FedQuAD	$0.912 \pm 0.002$	$0.122 \pm 0.001$	$0.018 \pm 0.002$	$49 \pm 7$

Table 6. Communication cost on (D1) for reaching AUC

\geq 0.87

. Payload excludes protocol overhead and assumes 32-bit floats;

C_{k}^{t}

transmits the upper triangle.

Table 6. Communication cost on (D1) for reaching AUC

\geq 0.87

. Payload excludes protocol overhead and assumes 32-bit floats;

C_{k}^{t}

transmits the upper triangle.

Method	Uplink/Client/Round (KB) ↓	Rounds@0.87 ↓	Bytes-to-Target (MB) ↓
FedAvg	≈256.0	120	≈153.6
FedAdam	≈256.0	66	≈84.5
FedQuAD	≈264.6	35	≈46.4

Table 7. Sketch-dimension sweep for FedQuAD on (D1) with logistic regression (mean ± std over 5 seeds). “Overhead” is the extra uplink per round beyond transmitting

Δ_{k}^{t}

(32-bit floats, upper-triangular

C_{k}^{t}

).

Table 7. Sketch-dimension sweep for FedQuAD on (D1) with logistic regression (mean ± std over 5 seeds). “Overhead” is the extra uplink per round beyond transmitting

Δ_{k}^{t}

(32-bit floats, upper-triangular

C_{k}^{t}

).

Sketch Size m	AUC at $T = 200$ ↑	Rounds@0.87 ↓	Extra Uplink/Round (KB) ↓
32	$0.881 \pm 0.003$	$44 \pm 7$	≈2.3
64	$0.883 \pm 0.002$	$35 \pm 6$	≈8.6
96	$0.884 \pm 0.002$	$32 \pm 6$	≈18.8
128	$0.884 \pm 0.002$	$31 \pm 5$	≈33.8

Table 8. Ablation on (D1) with logistic regression (mean ± std over 5 seeds).

Variant	AUC ↑	ECE ↓	Rounds@0.87 ↓
FedQuAD	$0.883 \pm 0.002$	$0.027 \pm 0.002$	$35 \pm 6$
NoCurv	$0.879 \pm 0.003$	$0.033 \pm 0.003$	$58 \pm 8$
NoVR	$0.881 \pm 0.003$	$0.031 \pm 0.003$	$47 \pm 7$
NoRobustNorm	$0.878 \pm 0.004$	$0.038 \pm 0.004$	$61 \pm 10$

Table 9. Update magnitude and stability indicators on (D1), averaged over rounds

t \in [1, 200]

and 5 seeds. “AUC drop” is the fraction of rounds where validation AUC decreases from the previous round.

Table 9. Update magnitude and stability indicators on (D1), averaged over rounds

t \in [1, 200]

and 5 seeds. “AUC drop” is the fraction of rounds where validation AUC decreases from the previous round.

Variant	$E ∥ Δ_{k}^{t} ∥_{2}$ ↓	AUC Drop Rate ↓
FedQuAD	$1.00$	$0.18$
NoCurv	$1.01$	$0.20$
NoVR	$1.12$	$0.24$
NoRobustNorm	$1.28$	$0.31$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bai, D.; WaEr, M.; Wu, Q. FedQuAD: Fast-Converging Curvature-Aware Federated Learning for Credit Default Prediction from Private Accounting Data. Mathematics 2026, 14, 1012. https://doi.org/10.3390/math14061012

AMA Style

Bai D, WaEr M, Wu Q. FedQuAD: Fast-Converging Curvature-Aware Federated Learning for Credit Default Prediction from Private Accounting Data. Mathematics. 2026; 14(6):1012. https://doi.org/10.3390/math14061012

Chicago/Turabian Style

Bai, Dingwen, MuGa WaEr, and Qichun Wu. 2026. "FedQuAD: Fast-Converging Curvature-Aware Federated Learning for Credit Default Prediction from Private Accounting Data" Mathematics 14, no. 6: 1012. https://doi.org/10.3390/math14061012

APA Style

Bai, D., WaEr, M., & Wu, Q. (2026). FedQuAD: Fast-Converging Curvature-Aware Federated Learning for Credit Default Prediction from Private Accounting Data. Mathematics, 14(6), 1012. https://doi.org/10.3390/math14061012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FedQuAD: Fast-Converging Curvature-Aware Federated Learning for Credit Default Prediction from Private Accounting Data

Abstract

1. Introduction

2. Related Work

2.1. Federated Optimization Under Heterogeneity

2.2. Curvature, Second-Order, and Variance Reduction in FL

2.3. Privacy, Security, and Trust Mechanisms

2.4. Credit-Risk Modeling and Federated Credit Scoring

3. Preliminaries

3.1. Problem Setting and Federated Objective

3.2. Federated Protocol, Regularity, and Heterogeneity

3.3. Curvature Sketches, Robust Normalization, and Privacy Primitives

4. Methodology

4.1. FedQuAD Protocol and Global Update Rule

4.2. Client-Side Solver with Adaptive Drift Control

4.3. Curvature Sketching and Quasi-Newton Correction

4.4. Robust Preprocessing and Optional Privacy Layer

5. Experiments

5.1. Experimental Setup

5.2. Main Results

5.3. Deep Tabular Model and Communication Efficiency

5.4. Ablation Study

5.5. Discussions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Convergence Guarantees for FedQuAD

Appendix A.1. Notation

Appendix A.2. Strongly Convex Convergence

Appendix A.3. Nonconvex Stationarity Guarantee

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI