4.1. Motivation and Problem Formulation
In modern deep learning training, gradient accumulation has become essential for training large-scale models under GPU memory constraints. The standard gradient accumulation procedure computes gradients over multiple microbatches and aggregates them before performing a parameter update. Formally, given
N microbatches within an accumulation cycle, the standard approach computes the total gradient as a uniform average:
where
denotes the gradient computed on the
i-th microbatch. This uniform weighting scheme treats all microbatches equally, implicitly assuming that each gradient provides an equally valid estimate of the true gradient direction at the current parameter state.
From a numerical analysis perspective, the standard gradient accumulation formula in (
8) implements a simple arithmetic mean of stochastic gradient estimates, which represents the maximum likelihood estimator under the assumption that all gradients
are independent samples from an identical distribution with equal variance
for all
i. Under this equal-variance assumption, uniform weighting is provably optimal in the sense of minimizing the variance of the aggregate estimator
, which achieves variance
representing the standard
convergence rate of Monte Carlo averaging. However, this optimality guarantee breaks down when the assumption of equal variance fails to hold, as occurs when individual microbatch gradients exhibit heterogeneous noise levels due to data heterogeneity, local curvature variation, or sampling artifacts. In such scenarios with unequal variances
where
varies with
i, the classical minimum variance unbiased estimator theory establishes that weighted averaging with optimal weights
inversely proportional to individual variances achieves lower aggregate variance than uniform averaging. Specifically, the optimal weighted estimator attains variance
, which satisfies the inequality
whenever variances differ, demonstrating strict improvement over uniform weighting. While computing exact variance
for each microbatch during training is computationally impractical and would require maintaining extensive gradient statistics, FracGrad provides a computationally tractable proxy for variance-aware weighting by implementing the heuristic assumption that later microbatches in the accumulation sequence tend to provide lower-variance gradient estimates on average, operationalized through power-law weights rather than explicit variance estimation. This heuristic approximation enables practical variance reduction benefits without incurring the computational overhead of maintaining second-moment statistics or performing expensive variance computations.
However, this assumption may be suboptimal in stochastic non-convex optimization. During each accumulation cycle, parameters remain fixed while N microbatches are processed sequentially. Let denote the parameter state at the beginning of the accumulation cycle. All gradients are computed in the same state . The key issue is that each gradient is computed on a different random microbatch, leading to high variance in the gradient estimates. In non-convex optimization with stochastic gradients, individual microbatch gradients can be noisy and may not accurately represent the true descent direction. Our observation is that even without parameter updates, later microbatches might yield less noisy gradient estimates simply by chance aggregation of data variance. That is, as more independent samples are drawn within the accumulation cycle, the empirical average direction may converge slightly faster.
This study proposes to address this by introducing a weighting scheme based on fractional calculus. Rather than treating all gradients uniformly, this study assigns higher weights to more recent gradients while maintaining contributions from earlier ones through a power-law decay mechanism. This approach provides a mathematically grounded framework for biasing toward potentially more reliable gradient estimates. This study refers to this as biasing toward more reliable gradient estimates rather than invoking gradient staleness, which usually describes outdated gradients across parameter updates. Within a single accumulation cycle, there is no parameter shift, so the issue is variance in stochastic gradient estimates rather than staleness in the traditional sense.
To formalize the theoretical intuition behind why later microbatches in the accumulation sequence might provide more reliable gradient estimates despite all being computed at the identical parameter state , this study presents the following variance-based perspective. Let denote the gradient computed on the i-th random microbatch at the fixed parameter state. While all gradient estimates satisfy the unbiasedness property , they exhibit different realized variance due to the random sampling of training data. In the sequential gradient accumulation process, the act of observing gradients on preceding microbatches provides implicit information about the gradient variance structure in the current training state. Specifically, if the early gradient estimates exhibit high angular disagreement or magnitude variation, this serves as an empirical signal that the gradient variance is substantial in the current region of parameter space. By constructing a weighting scheme that down-weights earlier observed gradients and emphasizes later observations, FracGrad effectively implements an adaptive response to this variance structure without requiring explicit computation of variance estimates or second-moment statistics. This heuristic aligns with principles from online learning and sequential decision-making, where more recent observations often receive higher weight to enable adaptation to changing conditions. While the gradient accumulation setting involves a stationary data distribution within each cycle rather than genuine non-stationarity, the high variance of individual stochastic gradient estimates motivates similar temporal weighting principles. The power-law decay structure provides a principled mathematical framework for implementing this recency bias while maintaining contributions from all microbatches through the long-tail memory property of fractional integrals, avoiding the complete dismissal of early gradient information that would occur with more aggressive weighting schemes.
4.2. Fractional-Order Gradient Weighting
This study formulates Fractional Gradient Accumulation (FracGrad) by replacing the uniform averaging in (
8) with fractional-order weighted aggregation. The FracGrad gradient is computed as:
where
denotes the fractional weight assigned to the
i-th microbatch gradient, and
is a hyperparameter controlling the decay rate. The weights are derived from the fractional integral operator, which naturally produces power-law decay patterns that have proven effective in modeling systems with memory effects. The key distinction from standard accumulation is that FracGrad explicitly models the temporal ordering of microbatches within each accumulation cycle, assigning different importance to gradients based on their position in the sequence.
The fractional weights are defined using the Riemann–Liouville fractional integral formulation, adapted for discrete gradient accumulation. The Riemann–Liouville fractional integral of order
for a continuous function
is defined as:
where
is the gamma function. For discrete gradient accumulation, this study discretizes this integral using the Grünwald–Letnikov approximation, which leads to weights proportional to
. Specifically, this study computes the weights as:
The numerator represents the fractional increment for position i, while the denominator ensures proper normalization such that . This normalization guarantees that the magnitude of the accumulated gradient remains comparable to standard accumulation, preventing scale-related training instabilities. The normalization is critical because without it, the sum of weights would vary with , causing the effective learning rate to change unpredictably.
The FracGrad weighting scheme defined in (
11) possesses several mathematically and practically desirable properties that distinguish it from ad-hoc or heuristic weighting approaches and establish its suitability for gradient accumulation in memory-constrained training.
Property 1 (Recency bias with memory): Unlike exponential decay weighting schemes where weights decrease geometrically and rapidly approach zero for distant past values, the power-law decay structure maintains non-negligible weights for all microbatches throughout the accumulation window. This longer memory tail prevents complete loss of gradient information from early microbatches while still emphasizing recent gradient estimates, providing a mathematically principled balance between utilizing historical information and adapting to recent observations. For instance, at with , the oldest microbatch still receives weight (approximately 2.5% of total), whereas an exponential scheme with comparable recent-to-old ratio would assign near-zero weight.
Property 2 (Smooth interpolation): The fractional order parameter provides continuous interpolation between uniform weighting ( yielding for all i) and increasingly aggressive recency bias as decreases toward zero. This smooth parameter space enables fine-grained control over the trade-off between temporal emphasis and historical retention, facilitating principled hyperparameter selection through standard tuning procedures rather than requiring discrete architectural choices.
Property 3 (Normalization preservation): The denominator
in (
11) ensures that weights sum exactly to unity,
, for all values of
and
N. This normalization property maintains gradient magnitude consistency across different
configurations and accumulation step values, preventing unintended learning rate scaling effects that would occur if the aggregate weight sum varied.
Property 4 (Scale adaptivity): The weights automatically adapt their decay rate to the number of accumulation steps N, with the power-law structure naturally spanning the full accumulation window regardless of its length. This scale-invariance property means that the relative emphasis on recent versus early gradients remains approximately constant across different memory constraint scenarios, unlike fixed exponential decay rates that would become increasingly severe as N grows.
Property 5 (Computational efficiency): Computing the complete weight vector for N microbatches requires arithmetic operations consisting of power evaluations, N subtractions, one summation, and N divisions, with total operation count approximately . For typical deep neural networks where the number of parameters ranges from millions to billions, this weight computation is completely dominated by the cost of computing N gradient vectors through forward and backward passes, rendering the fractional weighting overhead negligible in practice.
The parameter controls the degree of temporal weighting. When , the weights reduce to uniform values for all i, recovering standard gradient accumulation. This can be verified by direct computation: for all i, so the numerator is constant and the denominator equals N, yielding . As decreases to zero, the weighting becomes increasingly skewed towards recent gradients. For , the weights exhibit power-law decay, with increasing monotonically with i, meaning more recent microbatches receive progressively higher weights. To understand this behavior, note that for small , the function grows slowly, so is larger when i is large (corresponding to recent microbatches) because the derivative is larger for smaller x values. This power-law structure provides a smooth interpolation between uniform weighting and extreme recency bias, allowing fine-grained control over the temporal emphasis.
The implementation of FracGrad requires minimal modification to existing training pipelines. During each accumulation cycle, gradients are computed and stored as usual. Before calling the optimizer step function, this study computes the fractional weights using (
11) and apply them to the accumulated gradients. The weight computation has complexity
and involves only simple arithmetic operations, introducing negligible computational overhead compared to the gradient computation itself. Specifically, computing the weights requires
N power operations,
N subtractions, one summation, and
N divisions, which is dominated by the
cost of computing
N gradients where
denotes the number of model parameters. For typical deep learning models with millions to billions of parameters, the weight computation overhead is less than one percent of total training time. Importantly, FracGrad is completely optimizer-agnostic and architecture-agnostic, working seamlessly with any optimization algorithm such as Adam, SGD, or AdamW, and any model architecture including transformers, convolutional networks, and diffusion models. The method does not require modifications to the optimizer implementation or model architecture, making it immediately deployable in existing training pipelines.
4.3. Theoretical Analysis and Properties
Rather than a formal convergence guarantee, this study poses the following working hypothesis based on this study’s empirical observations.
Hypothesis (Empirical Effectiveness of FracGrad). In stochastic non-convex training with moderate gradient variance, reweighting microbatch gradients by the power-law schedule tends to reduce variance and yield improved optimization trajectories, provided is chosen sufficiently below 1 and the number of accumulation steps N does not exceed a practical threshold (here ).
This behavior aligns with known variance-reduction effects from weighted averaging in other stochastic methods. The key properties of FracGrad that support this hypothesis are as follows. First, gradient accumulation is widely used in modern large-scale training, making any improvement to this component immediately applicable across diverse applications. Second, FracGrad requires only a single-line modification to training loops, making it a simple intervention point in the training pipeline. The modification involves computing fractional weights once per accumulation cycle and multiplying them element-wise with the accumulated gradients before the optimizer step. Third, FracGrad is designed to be compatible with arbitrary optimizers and model architectures. The method operates purely on gradient vectors without making assumptions about the optimizer state or model structure. Fourth, it directly addresses the issue of gradient estimation quality during memory-constrained training.
To understand why FracGrad may improve gradient estimation, consider the stochastic gradient estimation problem. Let denote the loss function and denote the parameter state at the beginning of an accumulation cycle. All gradients are computed at this same state, where represents the i-th random microbatch. In stochastic optimization, individual microbatch gradients are noisy estimates of the true gradient . FracGrad assigns higher weights to later gradients in the sequence. While all gradients are computed at the same parameter state, the sequential observation of microbatches may provide information about gradient variance. By down-weighting earlier gradients and emphasizing later ones, FracGrad may reduce the impact of noisy gradient estimates.
The computational overhead of FracGrad is provably negligible. Computing the fractional weights requires arithmetic operations per accumulation cycle, which is dominated by the cost of gradient computation, where denotes the number of model parameters. For typical large-scale models with millions to billions of parameters, the weight computation represents less than one percent of the total training time. Specifically, the weight computation involves N power operations to compute and , N subtractions to compute the numerators, one summation over N terms to compute the denominator, and N divisions to normalize the weights. Modern hardware can perform these operations efficiently, and the weights can be precomputed once per accumulation cycle and reused for all parameter gradients. Memory overhead is similarly minimal, requiring only N additional scalar values to store the precomputed weights, which can be reused across training steps when N remains constant. This represents negligible memory consumption compared to storing the model parameters and optimizer states.
A critical theoretical question concerns why power-law decay weighting derived from fractional integrals should outperform alternative temporal weighting schemes such as exponential decay, which represents the most natural baseline comparison for time-dependent weighting. Exponential weights of the form for a decay factor also provide recency bias, assigning higher weight to recent microbatches through geometric decay. This study argues that power-law decay offers three fundamental advantages over exponential decay in the gradient accumulation context.
First, memory retention characteristics: Exponential decay with decay factor assigns weight to a gradient k steps in the past, which vanishes extremely rapidly. For example, with which already represents relatively slow decay, a gradient 10 steps in the past receives weight (approximately one-third of the most recent weight), and a gradient 20 steps in the past receives weight (approximately one-eighth of the most recent weight). By contrast, power-law decay with at assigns the oldest gradient weight compared to the newest weight , maintaining a ratio of approximately or roughly one-ninth rather than exponentially vanishing. More importantly, the collective weight assigned to the first half of microbatches under power-law decay remains substantial (approximately 20–30% for typical values), whereas exponential decay concentrates the overwhelming majority of weight on the most recent few microbatches, effectively discarding most of the historical gradient information that gradient accumulation was designed to aggregate.
Second, adaptive spread: Power-law weights exhibit a form of scale invariance where the relative weight distribution adapts automatically to the accumulation window size N. As N increases, the power-law decay rate adjusts naturally to span the extended window while maintaining similar relative emphasis patterns. In contrast, exponential decay with a fixed parameter becomes increasingly concentrated on recent microbatches as N grows larger. For instance, with , doubling N from 16 to 32 causes the first 16 microbatches to collectively receive exponentially diminished weight times their original aggregate, potentially amplifying noise from a small subset of recent gradients. Power-law decay avoids this pathological behavior through its intrinsic scale adaptation.
Third, theoretical foundation: Power-law decay emerges naturally and inevitably from the mathematics of fractional calculus, specifically from discretizing the Riemann–Liouville fractional integral that has been extensively studied in mathematical analysis for over a century. This provides rigorous mathematical grounding and connects the gradient weighting scheme to a rich body of theoretical results about fractional operators, memory effects, and long-range correlations in dynamical systems. Exponential decay, while widely used in optimization through momentum and moving averages, lacks comparable theoretical justification specifically for the intra-cycle gradient weighting problem. Our experimental results implicitly validate this theoretical comparison: standard Adam optimization employs exponential momentum with operating across parameter updates, and when combined with the power-law weighting within accumulation cycles, the two complementary temporal weighting mechanisms at different time scales yield superior results compared to either mechanism alone.
An important theoretical consideration concerns the case of fractional order
, which this study explicitly excludes from this study’s investigation. When
exceeds unity, the weight formula in (
11) reverses its temporal ordering, assigning higher weights to earlier microbatches rather than later ones. To understand this reversal, observe that for
, the function
grows with convex curvature (second derivative positive), which makes the incremental difference
larger when evaluated at smaller values of
and
, corresponding to smaller indices
i representing older microbatches. Mathematically, the derivative
increases with
x when
, so the difference between consecutive power values grows as the base increases, causing
to be larger for large
(small
i). This means that
would emphasize the oldest, first-observed gradient estimates in the accumulation sequence, precisely the opposite of this study’s motivation to bias toward potentially more reliable later gradients that benefit from implicit information accumulation. Furthermore, assigning highest weight to the first microbatch contradicts the variance reduction intuition underlying FracGrad, as early gradients have no preceding observations to inform variance assessment and represent the noisiest point in the accumulation process from an information-theoretic perspective. For these fundamental theoretical reasons, both mathematical (weight ordering reversal) and conceptual (contradiction of motivating principles), this study restricts the attention to the fractional order range
, where
serves as the uniform baseline and values
provide the desired recency bias with power-law decay characteristics.
Our formulation assumes uniform temporal spacing between consecutive microbatches, which holds exactly in standard gradient accumulation procedures where microbatches are processed sequentially with identical computational cost per microbatch. This uniformity assumption is valid for fixed-architecture models processing fixed-size inputs, which encompasses the vast majority of computer vision and many natural language processing applications. However, certain training scenarios involve non-uniform temporal spacing. For instance, in sequence modeling with variable-length inputs, different microbatches may require different processing times depending on sequence lengths, causing non-uniform temporal gaps between gradient computations. In such cases, one could generalize the weight formula in (
7) to account for actual elapsed time rather than discrete sequence position. Specifically, let
denote the cumulative processing time up to and including microbatch
i, with
and
representing the total accumulation cycle duration. The generalized fractional weight for microbatch
i would be
, which reduces to
after integration. This time-continuous formulation would yield non-uniform weights reflecting actual temporal gaps between gradient observations, with microbatches taking longer to process receiving proportionally higher weight to account for their extended temporal contribution. While theoretically straightforward, implementing this generalization would require tracking actual computation times and performing time-weighted aggregation, introducing measurement overhead and potential sensitivity to system performance variations. Since standard training employs uniform-size microbatches, this study does not explore this extension experimentally, but acknowledges it as a potential avenue for future work in domains with inherent temporal non-uniformity.
FracGrad exhibits connections to momentum-based optimization methods, yet operates at a different level. While momentum methods like Adam maintain exponential moving averages of gradients between parameter updates with decay factors and , FracGrad applies fractional weighting within each accumulation cycle before the update. This makes FracGrad complementary to momentum methods. The power-law decay of fractional weights differs qualitatively from exponential decay , providing longer memory tails that preserve more information from earlier microbatches while still emphasizing recent gradients. For exponential decay with , the weight assigned to a gradient t steps in the past decays as , which approaches zero rapidly. In contrast, power-law decay with assigns weights that decay more slowly and maintain non-negligible contributions from earlier gradients. This distinction becomes important when accumulation steps are large, as the power-law structure prevents complete loss of early gradient information while still providing recency bias.