This section presents FedQuAD (Federated Quasi-Newton with Adaptive Drift control), a fast-converging FL method for default prediction on private accounting data. FedQuAD augments a stabilized local solver with a server-side curvature-aware correction computed in a low-dimensional sketch space, adding only extra uplink payload per round (beyond transmitting model deltas).
4.1. FedQuAD Protocol and Global Update Rule
FedQuAD targets fast communication-round convergence for minimizing the federated objective
in (
3). The key idea is to combine (i) the standard model-delta aggregation that captures broad progress from local training, with (ii) a curvature-aware correction computed in a low-dimensional sketch space that approximates a Newton-like step along informative directions. This yields a single-round update that remains compatible with secure aggregation and adds only
uplink payload per client when the sketch dimension
.
Round structure. At communication round
t, the server holds the global model
and samples a participating client set
. The server also specifies a sketching matrix
with orthonormal columns, used to define a shared
m-dimensional subspace. To reduce server bandwidth,
is generated from a public random seed
broadcast to clients; each client reconstructs
locally (the concrete construction is given in
Section 4.3).
Each participating client computes three quantities at (or anchored to) the broadcast iterate :
- 1.
A stabilized local model delta
, obtained by approximately minimizing a proximal, variance-reduced subproblem (detailed in
Section 4.2). Intuitively,
captures useful descent directions under the client distribution while controlling drift.
- 2.
A projected gradient in sketch space, which summarizes how the local objective changes along the shared subspace directions.
- 3.
A curvature sketch in sketch space, where the approximation may use an exact Hessian for generalized linear models or a Gauss–Newton/empirical Fisher approximation for deep tabular models.
All three messages can be protected by secure aggregation (and optionally by client-level DP via clipping/noise; see
Section 4.4).
Weighted aggregation. Let
denote normalized weights over participating clients,
The server forms the aggregated delta, projected gradient, and curvature sketch:
Compared with FedAvg-style aggregation (which uses only ), FedQuAD additionally aggregates to compute a preconditioned correction in the shared subspace.
The server computes a damped inverse in sketch space,
where
is a damping parameter ensuring numerical stability even when
is ill-conditioned or rank-deficient. The quasi-Newton correction direction in parameter space is then
Finally, FedQuAD updates the global model by combining the aggregated local delta
with the curvature-aware correction:
where
controls the correction strength. When
, (
13) reduces to a stabilized first-order method driven purely by local deltas. When
, the term
acts like a Newton step restricted to
, improving the effective conditioning along those directions and empirically reducing the rounds needed to reach target AUC and calibration.
Per participating client, FedQuAD transmits the usual model delta
(same order as FedAvg) plus a projected gradient vector of size
m and a symmetric curvature sketch of size
. Transmitting only the upper-triangular part yields
scalars. Thus, the additional uplink beyond
scales as
, which is small for
. On the server, the additional compute is dominated by inverting the
matrix in (
12), i.e.,
per round, which is negligible relative to client training for typical choices (e.g.,
).
4.2. Client-Side Solver with Adaptive Drift Control
FedQuAD employs a client-side solver designed for two dominant difficulties in federated default prediction: (i) client drift caused by heterogeneous portfolios and non-IID accounting distributions, quantified by (
7) and (ii) high gradient noise induced by rare defaults and heavy-tailed features, reflected in (
8). To address both, each client approximately minimizes a proximal objective anchored at the broadcast model and uses a variance-reduced estimator to stabilize local steps. This subroutine produces the model delta
used in the aggregated update (
12). Given server model
, client
k defines the round-
t proximal objective
where
controls the strength of drift control. The proximal term penalizes deviations from
, reducing the tendency of local iterates to overfit to institution-specific patterns that may not generalize globally. This is particularly important when defaults are sparse on some clients: Without anchoring, local updates can be dominated by a small number of rare events and lead to unstable directions.
Snapshot gradient and control variate. At the beginning of the round, client k computes a snapshot gradient at the broadcast point, either exactly (for moderate ) or using a sufficiently large minibatch to reduce noise. This snapshot gradient serves two roles: (i) It is used directly in the projected gradient message in (2), and (ii) it forms a control variate for variance reduction during local optimization.
Prox-SVRG local steps. Initialize
. For local step
, sample a minibatch
and construct the variance-reduced (and proximal) gradient estimator
then take the update
where
is the local step size. The estimator in (
15) has the standard SVRG structure: It subtracts the minibatch gradient at the snapshot point
and adds back the (near-)full gradient
, thereby reducing the variance of minibatch gradients when
stays near
(which is encouraged by the proximal term). The additional proximal gradient
ensures that the local direction is consistent with descent on
and provides explicit drift control. After
E steps, client
k returns the model delta
Adaptive drift control in practice. While
can be fixed, FedQuAD supports a simple adaptive rule that increases anchoring when local updates appear unstable. Let
denote the candidate delta produced by a trial run with current
. Client
k computes the normalized step magnitude
with small
. If
exceeds a threshold
(indicating aggressive drift) or if the local objective
fails to decrease over the last few steps, the client increases anchoring (e.g.,
with
) and reruns the remaining local steps. This adaptive mechanism is lightweight, requires no extra communication, and empirically reduces catastrophic local jumps on clients with extreme imbalance or outlier-heavy ratios. textcolorblueThe indicator
in (
16) measures the relative parameter change in a round and is intended as a simple, model-scale-invariant safety valve. In cross-silo credit settings where some institutions may have very few default events, we recommend conservative caps (e.g.,
) and a small multiplicative increase factor (e.g.,
). Unless otherwise noted, we use
and
(with at most a few retries per round) to stabilize rare-event clients without adding communication or materially affecting typical clients.
In addition to
, client
k already has
from (
15), which is used to compute the projected gradient message
. Curvature sketch computation uses the same broadcast point
(
Section 4.3), ensuring all server-side quantities
are aligned to the same iteration and thus can be coherently combined in (
12).
In general, given that credit default prediction exhibits three structural properties: (i) rare positive events leading to gradient sparsity and instability, (ii) cross-institution feature-scale and distributional mismatch, and (iii) heavy-tailed accounting ratios. The proximal term mitigates client drift caused by heterogeneous default frequencies; variance reduction stabilizes local stochastic updates when positives are scarce; and the sketch-based quasi-Newton correction compensates for curvature mismatch across institutions without transmitting full Hessians. In combination, these mechanisms address financial heterogeneity at three levels: parameter stability (proximal), stochastic noise (variance reduction), and geometry mismatch (curvature correction).
4.3. Curvature Sketching and Quasi-Newton Correction
This subsection describes how FedQuAD constructs the shared sketch space, how clients compute curvature sketches with low overhead, and how the server uses the aggregated sketch to form a stable quasi-Newton correction
in (
13). The central principle is to approximate Newton preconditioning only in a low-dimensional subspace where curvature information can be communicated and inverted efficiently.
At round
t, the server samples a public random seed
and defines
where
is a pseudo-random matrix with i.i.d.
entries and
denotes a thin QR factorization producing orthonormal columns. The server broadcasts
(and
m), and each client reconstructs the same
locally.
Curvature model to be sketched. Client
k forms a symmetric positive semidefinite curvature approximation
at the broadcast model
:
For generalized linear models (e.g., logistic regression with
),
can be the exact empirical Hessian including the
term. For deep tabular models, we use a Gauss–Newton or empirical Fisher approximation, which is PSD and can be evaluated via Hessian-vector products (HVPs) without materializing a full
matrix. Given
, the curvature sketch is
where
is a client-side ridge for robustness under noisy curvature. The client computes
using HVPs:
and fills the entries by inner products
Projected gradient alignment. Curvature must be paired with a gradient in the same sketch space. Client
k computes the projected gradient at
:
using the snapshot gradient from
Section 4.2. Both
and
are anchored at
, ensuring coherent aggregation across clients. The server aggregates sketches and projected gradients using normalized weights
:
It then computes a stabilized inverse:
with server-side damping
, and forms the lifted quasi-Newton correction
This correction is combined with the aggregated local delta
through (
12).
Cost and practical choices. Per client, computing (
20) requires
m HVPs; for logistic regression it can be implemented via matrix-vector products with local data, and for deep models HVPs are supported by autodiff at a cost comparable to backpropagation. Communication per client includes one
m-vector and one symmetric
matrix (upper triangle), i.e.,
scalars. In practice, moderate
m (e.g., 32–128) yields most of the acceleration benefits while keeping overhead small. As the curvature payload scales as
, a practical rule is to pick
m so that this term is a small fraction of the model-delta payload for the chosen model size
P. For tabular models in our study,
is a robust default. we recommend starting from
and increasing only if rounds-to-target improvements justify the quadratic communication increase; keeping
m modest is also beneficial when per-round participation is small, because
becomes noisier and the damping in (
20) plays a larger role.
4.4. Robust Preprocessing and Optional Privacy Layer
FedQuAD is intended for default prediction on sensitive, heavy-tailed accounting data. Accordingly, we integrate two practical components: (i) a robust federated standardization mechanism that improves numerical conditioning and stability, and (ii) an optional privacy layer that supports secure aggregation and client-level differential privacy (DP). Both components are designed to preserve the main optimization loop in (
13) while improving deployability in cross-silo settings.
Robust federated standardization via quantile summaries. Accounting ratios and statement-derived variables often exhibit extreme skew and outliers, which can degrade both first-order optimization (through exploding gradients) and curvature estimation (through ill-conditioned Hessians). FedQuAD therefore standardizes features using global medians and interquartile ranges as in (
11). Since raw feature values cannot be shared, clients transmit only mergeable summaries.
Concretely, for each feature
, client
k constructs a compact quantile summary
that supports approximate queries for
,
, and
. Such summaries can be implemented using fixed-bin histograms (after coarse clipping) or digest-style sketches that support merge operations. The server aggregates the summaries by merging:
where
is a (possibly larger) set of clients participating in the statistics phase. From
, the server estimates
and broadcasts
. Clients then transform each feature by
with
to avoid division by near-zero scales. This preprocessing improves the stability of both the local solver (
Section 4.2) and curvature sketching (
Section 4.3) by controlling heavy tails and reducing cross-client scale mismatch. In our experiments, robust normalization also improves calibration, likely because the model is less sensitive to extreme ratios that correlate spuriously with rare default events on a subset of clients. We note that, Robust median/IQR scaling is one natural choice for heavy-tailed accounting ratios. Alternatives in federated settings include global z-score standardization via secure aggregation of per-client sums and squared sums, per-client standardization (which can increase cross-client scale mismatch), or monotone transformations such as log/winsorization and rank/quantile transforms. We focus on median/IQR because it is resilient to outliers while preserving a common scale across institutions.
Secure aggregation compatibility. FedQuAD messages in a round are
,
, and
. Secure aggregation can be used to ensure that the server only observes aggregated values such as
and similarly for
and
. We treat secure aggregation as an oracle that reveals only these sums, aligning with the aggregation definitions in (
12). This protects individual client updates from direct inspection while preserving exactness of the optimization update.
This secure aggregation ensures the server only observes aggregated messages, but the aggregates
are still model-update statistics and can reveal distribution-level information, especially if few clients participate or participation is highly unbalanced. Curvature sketches are second-order summaries in the random subspace
and thus may encode structural information beyond first-order gradients. FedQuAD mitigates this risk by (i) relying on secure aggregation so that no single institution’s sketch is revealed, (ii) refreshing the sketch subspace each round via the public seed (
17), limiting information accumulation in fixed coordinates, and (iii) supporting client-level DP for all transmitted objects, including the vectorized upper-triangular entries of
, via clipping and Gaussian noise. In deployments we also recommend enforcing a minimum participation threshold before unmasking secure aggregation, as this is standard in secure-aggregation protocols [
19].
Client-level differential privacy (optional). When stronger formal privacy is required, FedQuAD supports client-level
-DP by clipping and adding noise to aggregated messages. Let
denote
clipping. Prior to secure aggregation, each client clips its transmitted delta and projected gradient:
For curvature sketches, we clip with respect to the Frobenius norm:
After secure aggregation, the server adds Gaussian noise to the aggregated messages. For example, letting
denote the weighted aggregate of
, the server uses
and similarly
For
, noise can be added to the vectorized upper-triangular entries, yielding a symmetric noisy sketch after reshaping and symmetrization. The noise multipliers
are selected based on the desired
and the number of rounds
T using standard composition accounting for client subsampling. When DP noise is applied to
, the aggregated sketch can become less stable. FedQuAD mitigates this by (i) client-side ridge
in (
18), (ii) server-side damping
in (
20), and (iii) optional eigenvalue flooring in sketch space:
where
and
is a small floor. These stabilizers ensure that the inverse in (
20) remains well-defined and prevents overly large correction steps due to noisy or near-singular curvature.