The Hausdorff difference does not replace the stochastic gradient but functions as a dynamic weighting mechanism within the momentum accumulation process. Specifically, the fractal scaling term acts as a dual regulator of the effective learning rate and inertia, allowing for nonlinear adaptation of the update trajectory.
Adam Algorithm with Hausdorff Difference
To transform the learning rule with integer-order momentum into its counterpart based on the Hausdorff difference, Equations (
3)–(
6) are rewritten as follows:
This modification generalizes the original update rule from the integer-order case to the
-order case. By replacing the first-order difference in Equations (
12)–(
15) with the
-order difference, we obtain
In the following, we update the parameter formulas using the Hausdorff difference. Using the Hausdorff difference, Equations (
16)–(
19) can be written as follows:
Accordingly, we obtain the following expressions. These expressions will be used in the subsequent analysis:
To ensure the algorithm is well-defined at the first iteration (
), we address the interpretation of the term
and the initial states. At
, the scaling factor becomes
, which avoids any numerical singularity in the coefficients of Equations (
20)–(
23). The variables with index
refer to the initialization states. Specifically,
represents the gradient computed at the initial weights
, while the moment estimates are explicitly initialized as
and
. Substituting these into Equation (
20), the first update step simplifies to
, ensuring a deterministic start to the optimization process.
Equation (
23) constitutes the central update mechanism of the proposed algorithm, effectively translating the theoretical properties of the Hausdorff difference into practical parameter adjustments. By dynamically incorporating the order
, this mechanism modulates the balance between the retention of historical information and the sensitivity to current gradients. Consequently, the optimizer acts as a dynamic system with a variable time scale, which enhances convergence speed during the transient phases and maintains stability throughout the steady-state training process.
In the HAdam algorithm, the first-order and second-order moment estimates may suffer from initialization deviations. To correct these deviations, the corresponding moment values must be properly initialized. Therefore, we study the solution of Equation (
20) as follows.
Theorem 1. Let ; the solution of Equation (20) can be determined as follows:where and . Proof. This theorem is proved by the mathematical induction.
Based on Equation (
24), we also obtain
□
Similarly, the solutions of Equations (
21)–(
23) are given as
where
and
.
In the initial iterations of the HAdam algorithm, the first-order and second-order moment estimates depend on historical information, so their values are easily influenced by the initial state and may become biased. To reduce this initialization bias, the corresponding moment variables are initialized to zero before training begins. Specifically, if the initial conditions at iteration 0 are set as
,
,
, and
, then the following relations can be derived:
Then, the mathematical expectations from Equations (
27)–(
30) are determined by
Here,
with
represents the residual term caused by the non-stationarity of the gradients during the update process. Strictly defined,
accounts for the difference between the expectation of historical gradients and the expectation of the current gradient. For instance,
is defined as
Assuming the objective function is smooth and the learning rate is sufficiently small, the change in the gradient expectation between adjacent steps is minimal. Consequently,
approaches zero, validating the approximation in the moment estimation. Similar definitions apply to
,
, and
.
Based on the zero-initialization condition
, a systematic bias exists in the early training stages. Unlike standard Adam, where the bias decay is constant, HAdam’s bias structure is time-varying and depends on
. To strictly eliminate this specific initialization bias, we do not use the standard correction factor. Instead, we derive the exact correction terms based on the coefficient expansion in Theorem 1. The corrected moment estimation formulas are defined as follows:
To prevent the computational complexity from increasing linearly with iterations, we implement the bias correction terms recursively. Let
and
denote the denominators for the first-order and second-order moment corrections in Equations (
33) and (
34), respectively. Taking
as an example,
Using the property
, this term can be rewritten as a recurrence relation:
with the initial condition
. A similar recursive form applies to
. This recursive implementation ensures that the bias correction step has a constant time complexity of
per iteration, making HAdam computationally efficient and suitable for large-scale training.
To eliminate the estimation bias generated in the initialization stage, the corrected moment estimates, such as
, are first computed by using Equations (
33) and (
34). These corrected terms are then applied in the HAdam optimization framework to determine the final parameter update equations with order-dependent momentum.
To study the role of the order parameter in the HAdam update, we introduce the following coefficients (Algorithm 1). If we set
,
,
, and
, then we can analyze how the order
affects the weight update formulas. The influence of the order on the weight update formulas is examined in the following cases. Letting
and
, we select the order as
and
, respectively, for the cases that are
and
. Letting
and
, we select the order as
and
, respectively, for the cases that are
and
. The curves of
with different orders
and
are drawn in
Figure 1.
| Algorithm 1 HAdam optimization algorithm with recursive bias correction |
Input: training set , learning rate , exponential decay rates , order parameter , objective function J, small constant . Initialize: , , , . Initialize bias correction terms: , . for do
Step 1: Update raw moment estimates Step 2: Update recursive bias correction terms Step 3: Compute bias-corrected moments Similarly compute and using corresponding and
Step 4: Update parameters end for Output: updated network parameters , .
|
From the curves in
Figure 1, we observe that, as
k increases,
also increases for all
. This indicates that the coefficient
grows monotonically with respect to the iteration index
k under the considered range of
. For
, a decrease in
likewise leads to an increase in
. However, with an increase in
k,
gradually decreases, reaching even negative values for all
. Therefore, avoiding such conditions is crucial to prevent instability in parameter tuning. To ensure the boundedness and non-negativity of the key coefficients throughout the training process, a strict sufficient condition is required. Specifically, for a maximum iteration count
M, the order
must satisfy
. A detailed derivation of this stability criterion and the resulting upper bounds for
are provided in Remark 1.
The values of
were selected as
for the range
and as
for the range
to examine the behavior of
with respect to
.
Figure 2 shows the curves of
for the selected values of
. These curves illustrate how the coefficient
varies with the iteration index
k under different choices of the order
.
From
Figure 2, we observe that
increases as the order
increases. For each order
, a decrease in
k results in an increase in
. Equivalently, for fixed
,
decreases as
k increases, which indicates that the contribution of this coefficient becomes smaller at later iterations. For
, an increase in
likewise leads to an increase in
. However, contrary to the previous range, an increase in
k results in an increase in
for all
.
From the viewpoint of parameter interpretation, can be regarded as a coefficient that measures the influence of the momentum term, and can be viewed as a coefficient that reflects the effective global learning rate. For , decreases monotonically as increases, whereas increases monotonically. This means that a larger order weakens the contribution of the momentum term and strengthens the effect of the learning rate. Hence, in the initial stage of network training, it is desirable that is relatively small and is relatively large so that parameter updates are accelerated. In the later stage of training, it is preferable that is relatively large and is relatively small in order to improve recognition accuracy. When the iteration index k is small, which corresponds to the early training phase, a relatively large-order should be chosen to speed up convergence. When k becomes large and the training enters a later phase, a relatively small order in the same interval should be selected to obtain a more refined optimization result. For , increases as increases, while decreases as increases. Therefore, in this range, a relatively small order is recommended if a higher weight update speed is required. Overall, the variation of and across different training stages and order intervals provides an intuitive rule for selecting so as to balance rapid convergence and high recognition accuracy.
During network training, the choice of the order can be adjusted in coordination with the training stage and the gradient magnitude. In the early stage of training, the gradients of the parameters are relatively large. A larger-order is then used so that the update step becomes faster and convergence is accelerated. In the later stage of training, the gradients gradually decrease. A smaller-order is then adopted in order to enhance the smoothness and refinement of the updates and to improve the final recognition accuracy. Accordingly, and play complementary roles at different stages and support the transition from fast approach to fine adjustment.
Based on this idea, the order
can be set adaptively according to the rules given in Equations (
38)–(
40). In this way, a mechanism is obtained in which the order is adjusted automatically by the iteration progress and the gradient information. Furthermore, two concrete methods for selecting the order are proposed, which provide simple and practical guidance for choosing
in different training scenarios.
Adjustment Method One Nonlinear order adjustment based on a cosine function.
During optimization, in the initial stage of training, a relatively large-order
is selected to accelerate convergence. In the later stage of training, a smaller-order
is preferred in order to improve optimization accuracy. Based on this idea, a cosine function is used in this method to adjust the order
in a nonlinear manner. In the
lth layer, the orders
for weight updates and
for bias updates vary smoothly from
to
within the interval
as the iteration index
k increases. This strategy implements a natural transition from a stage of accelerated convergence to a stage of fine adjustment. It provides a simple and practical nonlinear scheme for selecting the orders associated with the weights and the biases.
Adjustment Method Two Nonlinear adjustment based on a hyperbolic tangent function.
This strategy is designed to optimize the order configuration dynamically by using gradient feedback. In the early stage of optimization, large gradient norms require a fast response. In the later stage, small gradient norms require fine adjustment. A hyperbolic tangent function is used to construct the dependence of the order
on the gradient norm. This design allows the order to adapt to the current gradient magnitude through expansion or contraction. In this way, fast descent is promoted in the initial phase, and the quality of convergence is improved in the later phase. A dynamic balance is thus achieved over the whole optimization process.
where
and
denote the minimum and maximum values, respectively, in the range of
or
. To guarantee the robustness of the proposed method and minimize the necessity for layer-specific manual tuning,
and
are established as global hyperparameters shared across all network layers. The upper bound
is strictly constrained by the theoretical stability criterion derived in Remark 1 (specifically,
), whereas the lower bound
is assigned a small positive value to exploit the acceleration characteristics inherent to fractional orders. Empirical evidence indicates that the interval
consistently yields effective performance across diverse tasks.
Remark 1. Because may occur for , we need to take a terminal value of γ. Letting , we getHence, we haveDue to with , , and , where M is the total number of iterations, we get Lettingwe set order to satisfyand then the condition can make not be less than zero. We can set the order to satisfy the following equation as Then, we can use to determine the effective interval of γ. For example, let , , and then we get ; hence, the interval of γ should be set as .
Let order satisfy the following equation by the same way as For example, we set , , and then is determined; then, the effective interval of γ should be set as .
Then, the effective interval of γ is , where . For this example, the effective interval of γ should be set as from the above analysis in Adam algorithm with Hausdorff difference.