Hausdorff Difference-Based Adam Optimizer for Image Classification

Jian, Jing; Gao, Zhe; Zhang, Haibin

doi:10.3390/math14020329

Open AccessArticle

Hausdorff Difference-Based Adam Optimizer for Image Classification

by

Jing Jian

^1,*,

Zhe Gao

² and

Haibin Zhang

¹

Department of Operations Research and Scientific Computing, Beijing University of Technology, Beijing 100021, China

²

Department of Electrical Engineering and Automation, College of Light Industry, Liaoning University, Shenyang 110036, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(2), 329; https://doi.org/10.3390/math14020329

Submission received: 4 December 2025 / Revised: 26 December 2025 / Accepted: 31 December 2025 / Published: 19 January 2026

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

To address the limitations of fixed-order update mechanisms in convolutional neural network parameter training, an adaptive parameter training method based on the Hausdorff difference is proposed in this paper. By deriving a Hausdorff difference formula that is suitable for discrete training processes and embedding it into the adaptive moment estimation framework, a generalized Hausdorff difference-based Adam algorithm (HAdam) is constructed. This algorithm introduces an order parameter to achieve joint dynamic control of the momentum intensity and the effective learning rate. Through theoretical analysis and numerical simulations, the influence of the order parameter and its value range on algorithm stability, parameter evolution trajectories, and convergence speed is investigated, and two adaptive order adjustment strategies based on iteration cycles and gradient feedback are designed. The experimental results on the Fashion-MNIST and CIFAR-10 benchmark datasets show that, compared with the standard Adam algorithm, the HAdam algorithm exhibits clear advantages in both convergence efficiency and recognition accuracy.

Keywords:

convolutional neural networks; hausdorff difference; adam algorithm; parameter optimization; adaptive strategy

MSC:

90C31

1. Introduction

Convolutional neural networks (CNNs) have established a dominant position in tasks such as image classification [1], object detection [2], semantic segmentation [3], and scene understanding [4] owing to the structural advantages of local receptive fields, weight sharing, and hierarchical feature extraction. By utilizing end-to-end training mechanisms, CNNs are capable of learning multiscale representations automatically, ranging from low-level edge textures to high-level semantic concepts directly from two-dimensional image data [5]. Consequently, the reliance on traditional manual feature engineering is fundamentally alleviated. In recent years, driven by the substantial increase in the scale of training data and the enhancement of computational power, network architectures have evolved towards greater depth, width, and topological complexity. This evolution has continuously elevated the accuracy limits of large-scale image recognition. Nevertheless, the expansion of the network scale results in an optimization landscape of the loss function that exhibits complex characteristics, such as high-dimensional non-convexity, heterogeneous curvature, and multiple local extrema. These complexities impose significantly stringent challenges on the convergence efficiency, numerical stability, and generalization ability of training algorithms.

In the training process of deep convolutional networks [6], optimization algorithms govern the update trajectory and step size of parameters within the solution space. These algorithms serve as a critical factor determining the convergence speed and generalization performance of the model. Although Stochastic Gradient Descent (SGD) [7] and the variants with momentum constitute the cornerstone of deep training, such methods often rely on globally unified learning rates. Consequently, such methods are sensitive to hyperparameters and prone to stagnation in saddle points or flat regions on high-dimensional complex loss surfaces. To address these issues, adaptive optimization algorithms have been developed to implement dynamic adjustments of learning rates for individual parameters by introducing first-order and second-order gradient statistics. Among these, adaptive moment estimation (Adam) [8] is widely adopted as the default optimizer for CNN training due to the simplicity of the structure and rapid convergence. However, theoretical analysis and practical experience indicate that the standard Adam algorithm is intrinsically established on an update mechanism based on integer-order differences. When accurate gradient information is sparse or the loss landscape is complex, this fixed integer-order rule often fails to characterize the complex dynamics of deep networks adequately. This limitation frequently leads to oscillating update trajectories and restricted generalization performance. While several variants of Adam—AdamW, RAdam, and AdaBelief—have been proposed to address specific limitations—such as decoupled weight decay, variance rectification, or curvature-aware steps—these methods generally retain the classical integer-order update structure. In contrast, this paper seeks to fundamentally generalize the underlying difference operator. By incorporating the Hausdorff difference, we introduce a fractal time-scale mechanism that inherently modulates the trade-off between historical memory and current gradient responsiveness. This approach offers a generalized framework where standard Adam is merely a special case.

To exploit historical gradient information more fully and capture long-range dependencies and nonlocal effects during parameter evolution, the introduction of fractional calculus theory has emerged as a significant research direction in optimization algorithms [9]. Unlike integer-order models that rely solely on instantaneous rates of change [10], fractional-order models inherently possess properties of memory and nonlocality by introducing a continuously tunable order parameter. This inclusion allows for a more refined description of the dynamic behavior of complex systems. Embedding the order parameter into parameter update rules essentially constructs a tunable trade-off mechanism between the current gradient response and the historical cumulative influence [11]. Existing studies have confirmed that a rational configuration of the order helps to enhance the stability of the algorithm [12], smooth the parameter evolution trajectory, and improve the capability of the model to fit complex data distributions. Therefore, the introduction of fractional-order concepts into CNN optimization possesses a clear theoretical basis and application value [13].

Fractional derivatives find extensive applications in modeling, spanning diverse fields such as physics [14], engineering [15], image processing [16], and economics [17]. Although classical fractional derivatives offer modeling advantages [18], the definitions typically involve convolution operations over long historical intervals. This requirement results in computational complexity for a single iteration that increases significantly with time [19]. In the scenario of large-scale training for deep networks [20], this computational bottleneck severely restricts the practical application of fractional-order methods. To control computational costs while retaining memory characteristics, generalized calculus operators based on fractal geometry concepts have garnered attention [21]. Specifically, the Hausdorff derivative [22] and the discrete form thereof, the Hausdorff difference, provide an efficient solution [23]. The Hausdorff difference degrades to the classical integer-order difference when the order is unity, whereas a tunable parameter is introduced at noninteger orders. This property allows the weight of historical information in the current update to be adjusted flexibly. Compared to traditional fractional derivatives [24], the Hausdorff difference is more compact in form and easier to implement discretely. It enables the embedding of dynamic characteristics similar to fractal memory into the update process without increasing computational overhead significantly. The integer-order methods rely on a fixed time scale that is ill-suited for fractal landscapes, whereas traditional fractional algorithms incur prohibitive computational costs due to full history accumulation. This work addresses this dichotomy by introducing a recursive Hausdorff framework that achieves fractal adaptivity while maintaining constant time complexity.

Based on the aforementioned analysis, a Hausdorff difference-based Adam algorithm (HAdam) is proposed in this paper for CNN training by embedding the Hausdorff difference operator into the adaptive moment estimation framework. The momentum update mechanism based on integer-order differences in traditional Adam is reconstructed. The order parameter

γ

is introduced to modulate the momentum term explicitly, as well as the first-order and second-order moment estimates. Theoretical derivation indicates that the order

γ

exerts a significant nonlinear modulation on momentum intensity and the effective learning rate. In the early stage of training, a larger order helps to reinforce the driving role of the current gradient, thereby accelerating convergence. In the later stage, a smaller order helps to smooth the update trajectory, suppress oscillations, and improve the precision of the solution. Furthermore, two adaptive strategies for order adjustment are designed: a nonlinear scheduling based on the iterative process and a dynamic adjustment based on the gradient norm. These strategies enable

γ

to evolve automatically according to the optimization stage and gradient scale, forming a dynamic parameter update mechanism that balances convergence speed and precision. The primary research content and innovative contributions of this paper are summarized as follows. First, the Hausdorff difference formula that is suitable for discrete optimization processes is derived based on Hausdorff derivative theory. This formula is seamlessly integrated with the Adam framework to propose the HAdam optimization algorithm. Second, to address the limitations of fixed orders, adaptive adjustment strategies based on iteration cycles and gradient feedback are constructed. These strategies achieve joint dynamic control of momentum intensity and the effective learning rate. Third, through theoretical analysis and numerical simulation, the influence of the order parameter and the value range thereof on algorithm stability, weight evolution trajectories, and convergence speed is revealed. Principles for selecting orders during different training stages are subsequently proposed. Fourth, experimental verification on standard image classification datasets and typical convolutional network architectures demonstrates that the HAdam algorithm outperforms standard Adam and other related algorithms in terms of convergence efficiency and recognition accuracy.

The remainder of this paper is organized as follows. Section 2 briefly reviews typical optimization algorithms for convolutional neural networks and presents the mathematical definitions of the Hausdorff derivative and difference, clarifying the connection to traditional operators. Section 3 elaborates on the derivation of the HAdam algorithm, provides the parameter update formulas containing the order parameter, and introduces the two adaptive order adjustment strategies. Section 4 presents experimental results on multiple image datasets, contrasting the convergence characteristics and recognition performance of the algorithm under different order settings and adjustment strategies. Finally, Section 5 summarizes the work of this paper and discusses the potential for applying this method to other deep learning architectures and future research directions.

2. Preliminaries

To demonstrate the effectiveness of fractal differences in improving the optimization performance of CNNs, this subsection first introduces a series of optimization algorithms for network parameter training. It then recalls the definition of the Hausdorff derivative.

2.1. Adam Algorithm

w_{i j}^{(l)} (k) = w_{i j}^{(l)} (k - 1) - η g_{i j}^{(l)} (k - 1),

(1)

b_{j}^{(l)} (k) = b_{j}^{(l)} (k - 1) - η h_{j}^{(l)} (k - 1) .

(2)

In the context of the optimization process at iteration k, the weight parameter located at the i-th spatial position within the j-th kernel of layer l is symbolized by

w_{i j}^{(l)} (k)

, while its associated bias is denoted by

b_{j}^{(l)} (k)

. The magnitude of the parameter update step is regulated by the learning rate scalar

η

. Moreover, with

J (k)

representing the objective function value, the terms

g_{i j}^{(l)} (k)

and

h_{j}^{(l)} (k)

are defined as the partial derivatives of

J (k)

with respect to the weights and biases, respectively. The gradients

g_{i j}^{(l)} (k) = \partial J (k) / \partial w_{i j}^{(l)} (k)

and

h_{j}^{(l)} (k) = \partial J (k) / \partial b_{j}^{(l)} (k)

are obtained via backpropagation.

In terms of the optimization strategy, the Adam algorithm exhibits properties that are superior to those of traditional methods. Its main feature is an adaptive learning rate scaling mechanism. The effective learning rate at each iteration is adjusted dynamically according to moment estimates of the gradient and is restricted to a reasonable bounded range. By preventing extreme values of the step size, this mechanism significantly improves the numerical stability of the training process.

m_{i j}^{(l)} (k) = β_{1} m_{i j}^{(l)} (k - 1) + (1 - β_{1}) g_{i j}^{(l)} (k - 1),

(3)

v_{i j}^{(l)} (k) = β_{2} v_{i j}^{(l)} (k - 1) + (1 - β_{2}) {(g_{i j}^{(l)})}^{2} (k - 1),

(4)

n_{j}^{(l)} (k) = β_{1} n_{j}^{(l)} (k - 1) + (1 - β_{1}) h_{j}^{(l)} (k - 1),

(5)

u_{j}^{(l)} (k) = β_{2} u_{j}^{(l)} (k - 1) + (1 - β_{2}) {(h_{j}^{(l)})}^{2} (k - 1),

(6)

\begin{matrix} w_{i j}^{(l)} (k) = & w_{i j}^{(l)} (k - 1) - η \frac{{\hat{m}}_{i j}^{(l)} (k)}{\sqrt{{\hat{v}}_{i j}^{(l)} (k)} + ε}, \end{matrix}

(7)

\begin{matrix} b_{j}^{(l)} (k) = & b_{j}^{(l)} (k - 1) - η \frac{{\hat{n}}_{j}^{(l)} (k)}{\sqrt{{\hat{u}}_{j}^{(l)} (k)} + ε}, \end{matrix}

(8)

where

{\hat{m}}_{i j}^{(l)} (k) = \frac{m_{i j}^{(l)} (k)}{1 - β_{1}^{k}}, {\hat{n}}_{j}^{(l)} (k) = \frac{n_{j}^{(l)} (k)}{1 - β_{1}^{k}},

{\hat{v}}_{i j}^{(l)} (k) = \frac{v_{i j}^{(l)} (k)}{1 - β_{2}^{k}}, {\hat{u}}_{j}^{(l)} (k) = \frac{u_{j}^{(l)} (k)}{1 - β_{2}^{k}} .

In this update rule,

ε

is defined as a small positive real constant to maintain numerical stability, while the scalars

β_{1}

and

β_{2}

serve as critical momentum factors. To capture the dynamic characteristics of parameter evolution, the variables

m_{i j}^{(l)} (k)

and

v_{i j}^{(l)} (k)

accumulate the momentum information for the weights at spatial position i of the j-th convolution kernel in layer l. Correspondingly,

n_{j}^{(l)} (k)

and

u_{j}^{(l)} (k)

represent the momentum statistics for the bias parameters within the same kernel and layer. Finally, the k-th powers of

β_{1}

and

β_{2}

are utilized to perform bias correction on the moment estimates, thereby fully constructing the adaptive update mechanism within the Adam-type optimization framework.

2.2. Definition and Discrete Form of the Hausdorff Derivative

For a differentiable function

f (t)

defined on the positive real line

R^{+}

and for an order parameter

γ \in R^{+}

, the Hausdorff derivative

{}^{H}D_{t}^{γ} f (t)

is defined by

{}^{H}D_{t}^{γ} f (t) = lim_{t_{0} \to t} \frac{f (t) - f (t_{0})}{t^{γ} - t_{0}^{γ}} .

(9)

From this definition, a more convenient equivalent expression can be obtained. Specifically, assume that f is continuously differentiable on

(0, \infty)

,

f \in C^{1} (0, \infty)

, and let

γ > 0

. Then, for any

t > 0

, Equation (9) admits an equivalent differential form by introducing the time-scale variable

τ = t^{γ}

. By using

τ = t^{γ}

and the chain rule, Equation (9) can be rewritten in the form

{}^{H}D_{t}^{γ} f (t) = \frac{1}{γ} \frac{d f (t)}{d t} t^{1 - γ} .

(10)

In this sense, the Hausdorff derivative can be viewed as a weighted derivative with respect to the transformed time scale

τ = t^{γ}

.

The rationale for employing the Hausdorff derivative lies in its inherent suitability for the fractal and rough loss landscapes typical of deep networks. Unlike standard derivatives that assume a smooth optimization surface, the Hausdorff operator utilizes a fractal metric that adapts to non-uniform geometries. This feature allows the optimizer to navigate complex non-convex terrains more effectively, resulting in superior stability and convergence compared to traditional integer-order methods.

To adapt this concept to a discrete optimization setting, the notion of the Hausdorff derivative is extended to the discrete domain. According to Equation (10), the Hausdorff difference operator for a sequence

f (k)

is defined.

To adapt this concept to a discrete optimization setting, the notion of the Hausdorff derivative is extended to the discrete domain. Here, we assume a unit time step

Δ t = 1

for the discretization process, treating the iteration index k as the discrete equivalent of physical time t. This assumption simplifies the computational complexity while retaining the fractal memory characteristics. According to Equation (10), the Hausdorff difference operator for a sequence

f (k)

is defined as

{}^{H}▵_{k}^{γ} f (k) = \frac{1}{γ} (f (k) - f (k - 1)) k^{1 - γ} .

(11)

It is worth noting that, since k represents the cumulative iteration count, the scale factor

k^{1 - γ}

inherently depends on the number of iterations per epoch. Consequently, the memory length of the operator is implicitly influenced by the batch size setting used during training. The Hausdorff difference provides a generalized extension of the classical difference operator. When the order is set to

γ = 1

, Equation (11) reduces to the standard integer-order first difference,

{}^{H}▵_{k}^{1} f (k) = f (k) - f (k - 1) .

In this theoretical framework, the order is defined on the interval

(0, + \infty)

. This parameter determines the degree of deviation of the generalized operator from the classical integer-order case. However, it is important to note that the practical application of the HAdam algorithm imposes further restrictions on

γ

to guarantee numerical stability. The specific algorithmic constraints and the derivation of the effective upper bound for

γ

are detailed in Remark 1. This parameter determines the degree of deviation of the generalized operator from the classical integer-order case. In particular, when the Hausdorff difference is combined with the Adam optimizer with momentum, the parameter

γ

can be used to adjust the effective order of the convolutional neural network and to control the influence of momentum during parameter training. This allows flexible regulation of the optimization process.

3. Construction of Parameter Update Rules via Hausdorff Difference

The Hausdorff difference does not replace the stochastic gradient but functions as a dynamic weighting mechanism within the momentum accumulation process. Specifically, the fractal scaling term

k^{1 - γ}

acts as a dual regulator of the effective learning rate and inertia, allowing for nonlinear adaptation of the update trajectory.

Adam Algorithm with Hausdorff Difference

To transform the learning rule with integer-order momentum into its counterpart based on the Hausdorff difference, Equations (3)–(6) are rewritten as follows:

\begin{matrix} {}^{H}▵_{k}^{1} m_{i j}^{(l)} (k) & = m_{i j}^{(l)} (k) - m_{i j}^{(l)} (k - 1) \\ = (β_{1} - 1) m_{i j}^{(l)} (k - 1) + (1 - β_{1}) g_{i j}^{(l)} (k - 1), \end{matrix}

(12)

\begin{matrix} {}^{H}▵_{k}^{1} v_{i j}^{(l)} (k) & = v_{i j}^{(l)} (k) - v_{i j}^{(l)} (k - 1) \\ = (β_{2} - 1) v_{i j}^{(l)} (k - 1) + (1 - β_{2}) {(g_{i j}^{(l)})}^{2} (k - 1), \end{matrix}

(13)

\begin{matrix} {}^{H}▵_{k}^{1} n_{j}^{(l)} (k) & = n_{j}^{(l)} (k) - n_{j}^{(l)} (k - 1) \\ = (β_{1} - 1) n_{j}^{(l)} (k - 1) + (1 - β_{1}) h_{j}^{(l)} (k - 1), \end{matrix}

(14)

\begin{matrix} {}^{H}▵_{k}^{1} u_{j}^{(l)} (k) & = u_{j}^{(l)} (k) - u_{j}^{(l)} (k - 1) \\ = (β_{2} - 1) u_{j}^{(l)} (k - 1) + (1 - β_{2}) {(h_{j}^{(l)})}^{2} (k - 1) . \end{matrix}

(15)

This modification generalizes the original update rule from the integer-order case to the

γ

-order case. By replacing the first-order difference in Equations (12)–(15) with the

γ

-order difference, we obtain

\begin{matrix} {}^{H}▵_{k}^{γ} m_{i j}^{(l)} (k) & = m_{i j}^{(l)} (k) - m_{i j}^{(l)} (k - 1) \\ = (β_{1} - 1) m_{i j}^{(l)} (k - 1) + (1 - β_{1}) g_{i j}^{(l)} (k - 1), \end{matrix}

(16)

\begin{matrix} {}^{H}▵_{k}^{γ} v_{i j}^{(l)} (k) & = v_{i j}^{(l)} (k) - v_{i j}^{(l)} (k - 1) \\ = (β_{2} - 1) v_{i j}^{(l)} (k - 1) + (1 - β_{2}) {(g_{i j}^{(l)})}^{2} (k - 1), \end{matrix}

(17)

\begin{matrix} {}^{H}▵_{k}^{γ} n_{j}^{(l)} (k) & = n_{j}^{(l)} (k) - n_{j}^{(l)} (k - 1) \\ = (β_{1} - 1) n_{j}^{(l)} (k - 1) + (1 - β_{1}) h_{j}^{(l)} (k - 1), \end{matrix}

(18)

\begin{matrix} {}^{H}▵_{k}^{γ} u_{j}^{(l)} (k) & = u_{j}^{(l)} (k) - u_{j}^{(l)} (k - 1) \\ = (β_{2} - 1) u_{j}^{(l)} (k - 1) + (1 - β_{2}) {(h_{j}^{(l)})}^{2} (k - 1) . \end{matrix}

(19)

In the following, we update the parameter formulas using the Hausdorff difference. Using the Hausdorff difference, Equations (16)–(19) can be written as follows:

\begin{matrix} \frac{1}{γ} (m_{i j}^{(l)} (k) - m_{i j}^{(l)} (k - 1)) k^{1 - γ} = (β_{1} - 1) m_{i j}^{(l)} (k - 1) + (1 - β_{1}) g_{i j}^{(l)} (k - 1), \end{matrix}

\begin{matrix} \frac{1}{γ} (v_{i j}^{(l)} (k) - v_{i j}^{(l)} (k - 1)) k^{1 - γ} = (β_{2} - 1) v_{i j}^{(l)} (k - 1) + (1 - β_{2}) {(g_{i j}^{(l)})}^{2} (k - 1), \end{matrix}

\begin{matrix} \frac{1}{γ} (n_{j}^{(l)} (k) - n_{j}^{(l)} (k - 1)) k^{1 - γ} = (β_{1} - 1) n_{j}^{(l)} (k - 1) + (1 - β_{1}) h_{j}^{(l)} (k - 1), \end{matrix}

\begin{matrix} \frac{1}{γ} (u_{j}^{(l)} (k) - u_{j}^{(l)} (k - 1)) k^{1 - γ} = (β_{2} - 1) u_{j}^{(l)} (k - 1) + (1 - β_{2}) {(h_{j}^{(l)})}^{2} (k - 1) . \end{matrix}

Accordingly, we obtain the following expressions. These expressions will be used in the subsequent analysis:

\begin{matrix} m_{i j}^{(l)} (k) = [1 + \frac{γ}{k^{1 - γ}} (β_{1} - 1)] m_{i j}^{(l)} (k - 1) + \frac{(1 - β_{1}) γ}{k^{1 - γ}} g_{i j}^{(l)} (k - 1), \end{matrix}

(20)

\begin{matrix} v_{i j}^{(l)} (k) = [1 + \frac{γ}{k^{1 - γ}} (β_{2} - 1)] v_{i j}^{(l)} (k - 1) + \frac{(1 - β_{2}) γ}{k^{1 - γ}} {(g_{i j}^{(l)})}^{2} (k - 1), \end{matrix}

(21)

\begin{matrix} n_{j}^{(l)} (k) = [1 + \frac{γ}{k^{1 - γ}} (β_{1} - 1)] n_{j}^{(l)} (k - 1) + \frac{(1 - β_{1}) γ}{k^{1 - γ}} h_{j}^{(l)} (k - 1), \end{matrix}

(22)

\begin{matrix} u_{j}^{(l)} (k) = [1 + \frac{γ}{k^{1 - γ}} (β_{2} - 1)] u_{j}^{(l)} (k - 1) + \frac{(1 - β_{2}) γ}{k^{1 - γ}} {(h_{j}^{(l)})}^{2} (k - 1) . \end{matrix}

(23)

To ensure the algorithm is well-defined at the first iteration (

k = 1

), we address the interpretation of the term

k^{1 - γ}

and the initial states. At

k = 1

, the scaling factor becomes

1^{1 - γ} = 1

, which avoids any numerical singularity in the coefficients of Equations (20)–(23). The variables with index

k - 1 = 0

refer to the initialization states. Specifically,

g_{i j}^{(l)} (0)

represents the gradient computed at the initial weights

w_{i j}^{(l)} (0)

, while the moment estimates are explicitly initialized as

m_{i j}^{(l)} (0) = 0

and

v_{i j}^{(l)} (0) = 0

. Substituting these into Equation (20), the first update step simplifies to

m_{i j}^{(l)} (1) = (1 - β_{1}) γ g_{i j}^{(l)} (0)

, ensuring a deterministic start to the optimization process.

Equation (23) constitutes the central update mechanism of the proposed algorithm, effectively translating the theoretical properties of the Hausdorff difference into practical parameter adjustments. By dynamically incorporating the order

γ

, this mechanism modulates the balance between the retention of historical information and the sensitivity to current gradients. Consequently, the optimizer acts as a dynamic system with a variable time scale, which enhances convergence speed during the transient phases and maintains stability throughout the steady-state training process.

In the HAdam algorithm, the first-order and second-order moment estimates may suffer from initialization deviations. To correct these deviations, the corresponding moment values must be properly initialized. Therefore, we study the solution of Equation (20) as follows.

Theorem 1.

Let

G_{(1, 0)} = 1

; the solution of Equation (20) can be determined as follows:

\begin{matrix} m_{i j}^{(l)} (k) = G_{(1, k)} m_{i j}^{(l)} (0) + \sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} g_{i j}^{(l)} (k - q), \end{matrix}

(24)

where

G_{(1, k)} = μ_{(0, β_{1}, k)} G_{(1, k - 1)}

and

H_{(1, k)} = μ_{(1, β_{1}, k)}

.

Proof.

This theorem is proved by the mathematical induction.

For

k = 2

, we have

\begin{matrix} m_{i j}^{(l)} (2) = & μ_{(0, β_{1}, 2)} μ_{(0, β_{1}, 1)} m_{i j}^{(l)} (0) + μ_{(0, β_{1}, 2)} μ_{(1, β_{1}, 1)} g_{i j}^{(l)} (0) + μ_{(1, β_{1}, 2)} g_{i j}^{(l)} (1) \\ = & G_{(1, 2)} m_{i j}^{(l)} (0) + \sum_{q = 1}^{2} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} g_{i j}^{(l)} (k - q) . \end{matrix}

(25)

Based on Equation (24), we also obtain

\begin{matrix} m_{i j}^{(l)} (k + 1) = & μ_{(0, β_{1}, k + 1)} m_{i j}^{(l)} (k) + μ_{(1, β_{1}, k + 1) g_{i j}^{(l)} (k)} \\ = & μ_{(0, β_{1}, k + 1)} [G_{(1, k)} m_{i j}^{(l)} (0) + \sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} g_{i j}^{(l)} (k - q)] \\ - μ_{(1, β_{1}, k + 1) g_{i j}^{(l)} (k)} \\ = & μ_{(0, β_{1}, k + 1)} G_{(1, k)} m_{i j}^{(l)} (0) - μ_{(0, β_{1}, k + 1)} \sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} g_{i j}^{(l)} (k - q) \\ + μ_{(1, β_{1}, k + 1) g_{i j}^{(l)} (k)} \\ = & G_{(1, k + 1)} m_{i j}^{(l)} (0) + \sum_{q = 1}^{k + 1} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} g_{i j}^{(l)} (k - q + 1) . \end{matrix}

(26)

□

Similarly, the solutions of Equations (21)–(23) are given as

\begin{matrix} v_{i j}^{(l)} (k) = & G_{(2, k)} v_{i j}^{(l)} (0) + \sum_{q = 1}^{k} \frac{G_{(2, k)}}{G_{(2, k - q + 1)}} H_{(2, k - q + 1)} {(g_{i j}^{(l)})}^{2} (k - q), \end{matrix}

\begin{matrix} n_{j}^{(l)} (k) = & G_{(1, k)} n_{j}^{(l)} (0) + \sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} h_{j}^{(l)} (k - q), \end{matrix}

\begin{matrix} u_{j}^{(l)} (k) = & G_{(2, k)} u_{j}^{(l)} (0) + \sum_{q = 1}^{k} \frac{G_{(2, k)}}{G_{(2, k - q + 1)}} H_{(2, k - q + 1)} {(h_{j}^{(l)})}^{2} (k - q), \end{matrix}

where

G_{(2, k)} = μ_{(0, β_{2}, k)} G_{(2, k - 1)}

and

H_{(2, k)} = μ_{(1, β_{2}, k)}

.

In the initial iterations of the HAdam algorithm, the first-order and second-order moment estimates depend on historical information, so their values are easily influenced by the initial state and may become biased. To reduce this initialization bias, the corresponding moment variables are initialized to zero before training begins. Specifically, if the initial conditions at iteration 0 are set as

m_{i j}^{(l)} (0) = 0

,

v_{i j}^{(l)} (0) = 0

,

n_{j}^{(l)} (0) = 0

, and

u_{j}^{(l)} (0) = 0

, then the following relations can be derived:

\begin{matrix} m_{i j}^{(l)} (k) = \sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} g_{i j}^{(l)} (k - q), \end{matrix}

(27)

\begin{matrix} v_{i j}^{(l)} (k) = \sum_{q = 1}^{k} \frac{G_{(2, k)}}{G_{(2, k - q + 1)}} H_{(2, k - q + 1)} {(g_{i j}^{(l)})}^{2} (k - q), \end{matrix}

(28)

\begin{matrix} n_{j}^{(l)} (k) = \sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} h_{j}^{(l)} (k - q), \end{matrix}

(29)

\begin{matrix} u_{j}^{(l)} (k) = \sum_{q = 1}^{k} \frac{G_{(2, k)}}{G_{(2, k - q + 1)}} H_{(2, k - q + 1)} {(h_{j}^{(l)})}^{2} (k - q) . \end{matrix}

(30)

Then, the mathematical expectations from Equations (27)–(30) are determined by

\begin{matrix} E [m_{i j}^{(l)} (k)] = & E [\sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} g_{i j}^{(l)} (k - q)] \\ = & E [g_{i j}^{(l)} (k - 1)] \sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} + δ_{1}, \\ E [v_{i j}^{(l)} (k)] = & E [\sum_{q = 1}^{k} \frac{G_{(2, k)}}{G_{(2, k - q + 1)}} H_{(2, k - q + 1)} {(g_{i j}^{(l)})}^{2} (k - q)] \\ = & E [{(g_{i j}^{(l)})}^{2} (k - 1)] \sum_{q = 1}^{k} \frac{G_{(2, k)}}{G_{(2, k - q + 1)}} H_{(2, k - q + 1)} + δ_{2}, \end{matrix}

(31)

\begin{matrix} E [n_{j}^{(l)} (k)] = & E [\sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} h_{j}^{(l)} (k - q)] \\ = & E [h_{j}^{(l)} (k - 1)] \sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} + δ_{3}, \\ E [u_{j}^{(l)} (k)] = & E [\sum_{q = 1}^{k} \frac{G_{(2, k)}}{G_{(2, k - q + 1)}} H_{(2, k - q + 1)} {(h_{j}^{(l)})}^{2} (k - q)] \\ = & E [{(h_{j}^{(l)})}^{2} (k - 1)] \sum_{q = 1}^{k} \frac{G_{(2, k)}}{G_{(2, k - q + 1)}} H_{(2, k - q + 1)} + δ_{4}, \end{matrix}

(32)

Here,

δ_{r}

with

r = 1, 2, 3, 4

represents the residual term caused by the non-stationarity of the gradients during the update process. Strictly defined,

δ_{r}

accounts for the difference between the expectation of historical gradients and the expectation of the current gradient. For instance,

δ_{1}

is defined as

δ_{1} = \sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} (E [g_{i j}^{(l)} (k - q)] - E [g_{i j}^{(l)} (k - 1)]) .

Assuming the objective function is smooth and the learning rate is sufficiently small, the change in the gradient expectation between adjacent steps is minimal. Consequently,

δ_{r}

approaches zero, validating the approximation in the moment estimation. Similar definitions apply to

δ_{2}

,

δ_{3}

, and

δ_{4}

.

Based on the zero-initialization condition

m (0) = 0

, a systematic bias exists in the early training stages. Unlike standard Adam, where the bias decay is constant, HAdam’s bias structure is time-varying and depends on

γ

. To strictly eliminate this specific initialization bias, we do not use the standard correction factor. Instead, we derive the exact correction terms based on the coefficient expansion in Theorem 1. The corrected moment estimation formulas are defined as follows:

\begin{matrix} {\hat{m}}_{i j}^{(l)} (k) = \frac{m_{i j}^{(l)} (k)}{\sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)}}, \\ {\hat{v}}_{i j}^{(l)} (k) = \frac{v_{i j}^{(l)} (k)}{\sum_{q = 1}^{k} \frac{G_{(2, k)}}{G_{(2, k - q + 1)}} H_{(2, k - q + 1)}}, \end{matrix}

(33)

\begin{matrix} {\hat{n}}_{j}^{(l)} (k) = \frac{n_{i j}^{(l)} (k)}{\sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)}}, \\ {\hat{u}}_{j}^{(l)} (k) = \frac{u_{i j}^{(l)} (k)}{\sum_{q = 1}^{k} \frac{G_{(2, k)}}{G_{(2, k - q + 1)}} H_{(2, k - q + 1)}} . \end{matrix}

(34)

To prevent the computational complexity from increasing linearly with iterations, we implement the bias correction terms recursively. Let

Ω_{m} (k)

and

Ω_{v} (k)

denote the denominators for the first-order and second-order moment corrections in Equations (33) and (34), respectively. Taking

Ω_{m} (k)

as an example,

Ω_{m} (k) = \sum_{q = 1}^{k} \frac{G_{(1, k)}}{G_{(1, k - q + 1)}} H_{(1, k - q + 1)} .

(35)

Using the property

G_{(1, k)} = μ_{(0, β_{1}, k)} G_{(1, k - 1)}

, this term can be rewritten as a recurrence relation:

Ω_{m} (k) = μ_{(0, β_{1}, k)} Ω_{m} (k - 1) + μ_{(1, β_{1}, k)},

(36)

with the initial condition

Ω_{m} (0) = 0

. A similar recursive form applies to

Ω_{v} (k)

. This recursive implementation ensures that the bias correction step has a constant time complexity of

O (1)

per iteration, making HAdam computationally efficient and suitable for large-scale training.

To eliminate the estimation bias generated in the initialization stage, the corrected moment estimates, such as

{\hat{m}}_{i j}^{(l)} (k)

, are first computed by using Equations (33) and (34). These corrected terms are then applied in the HAdam optimization framework to determine the final parameter update equations with order-dependent momentum.

\begin{matrix} w_{i j}^{(l)} (k) = w_{i j}^{(l)} (k - 1) - η \frac{{\hat{m}}_{i j}^{(l)} (k)}{\sqrt{{\hat{v}}_{i j}^{(l)} (k)} + ε}, \\ b_{j}^{(l)} (k) = b_{j}^{(l)} (k - 1) - η \frac{{\hat{n}}_{j}^{(l)} (k)}{\sqrt{{\hat{u}}_{j}^{(l)} (k)} + ε} . \end{matrix}

(37)

To study the role of the order parameter in the HAdam update, we introduce the following coefficients (Algorithm 1). If we set

μ_{(0, β_{1}, k)} = [1 + γ (β_{1} - 1) / k^{1 - γ}]

,

μ_{(1, β_{1}, k)} = (1 - β_{1}) γ / k^{1 - γ}

,

μ_{(0, β_{2}, k)} = [1 + γ (β_{2} - 1) / k^{1 - γ}]

, and

μ_{(1, β_{2}, k)} = (1 - β_{2}) γ / k^{1 - γ}

, then we can analyze how the order

γ

affects the weight update formulas. The influence of the order on the weight update formulas is examined in the following cases. Letting

β_{1} = 0.9

and

β_{2} = 0.999

, we select the order as

γ = 0.2, 0.4, \dots, 0.8

and

γ = 1.05, 1.1, 1.2, 1.4

, respectively, for the cases that are

γ \in (0, 1]

and

γ \in (1, + \infty)

. Letting

β_{1} = 0.9

and

β_{2} = 0.999

, we select the order as

γ = 0.2, 0.4, \dots, 0.8

and

γ = 1.05, 1.1, 1.2, 1.4

, respectively, for the cases that are

γ \in (0, 1]

and

γ \in (1, + \infty)

. The curves of

μ_{(0, β_{1}, k)}

with different orders

γ = 0.2, 0.4, \dots, 0.8

and

γ = 1.05, 1.1, 1.2, 1.4

are drawn in Figure 1.

Algorithm 1 HAdam optimization algorithm with recursive bias correction

Input: training set

D = {(x_{i, j}^{(l)}, y_{i, j}^{(l)})}

, learning rate

η

, exponential decay rates

β_{1}, β_{2} \in [0, 1)

, order parameter

γ > 0

, objective function J, small constant

ε

.
Initialize:

m_{i, j}^{(l)} (0) \leftarrow 0

,

v_{i, j}^{(l)} (0) \leftarrow 0

,

n_{j}^{(l)} (0) \leftarrow 0

,

u_{j}^{(l)} (0) \leftarrow 0

.
Initialize bias correction terms:

Ω_{m} (0) \leftarrow 0

,

Ω_{v} (0) \leftarrow 0

.
for

k = 1, 2, \dots, K

do

Step 1: Update raw moment estimates

\begin{matrix} m_{i j}^{(l)} (k) & = [1 + \frac{γ}{k^{1 - γ}} (β_{1} - 1)] m_{i j}^{(l)} (k - 1) + \frac{(1 - β_{1}) γ}{k^{1 - γ}} g_{i j}^{(l)} (k - 1), \\ v_{i j}^{(l)} (k) & = [1 + \frac{γ}{k^{1 - γ}} (β_{2} - 1)] v_{i j}^{(l)} (k - 1) + \frac{(1 - β_{2}) γ}{k^{1 - γ}} {(g_{i j}^{(l)} (k - 1))}^{2} . \end{matrix}

Step 2: Update recursive bias correction terms

\begin{matrix} Ω_{m} (k) & = [1 + \frac{γ}{k^{1 - γ}} (β_{1} - 1)] Ω_{m} (k - 1) + \frac{(1 - β_{1}) γ}{k^{1 - γ}}, \\ Ω_{v} (k) & = [1 + \frac{γ}{k^{1 - γ}} (β_{2} - 1)] Ω_{v} (k - 1) + \frac{(1 - β_{2}) γ}{k^{1 - γ}} . \end{matrix}

Step 3: Compute bias-corrected moments

{\hat{m}}_{i j}^{(l)} (k) = m_{i j}^{(l)} (k) / Ω_{m} (k), {\hat{v}}_{i j}^{(l)} (k) = v_{i j}^{(l)} (k) / Ω_{v} (k) .

Similarly compute

{\hat{n}}_{j}^{(l)} (k)

and

{\hat{u}}_{j}^{(l)} (k)

using corresponding

n, u

and

Ω

Step 4: Update parameters

\begin{matrix} w_{i j}^{(l)} (k) & = w_{i j}^{(l)} (k - 1) - η \frac{{\hat{m}}_{i j}^{(l)} (k)}{\sqrt{{\hat{v}}_{i j}^{(l)} (k)} + ε}, \\ b_{j}^{(l)} (k) & = b_{j}^{(l)} (k - 1) - η \frac{{\hat{n}}_{j}^{(l)} (k)}{\sqrt{{\hat{u}}_{j}^{(l)} (k)} + ε} . \end{matrix}

end for
Output: updated network parameters

w_{i j}^{(l)} (K)

,

b_{j}^{(l)} (K)

.

From the curves in Figure 1, we observe that, as k increases,

μ_{(0, β_{1}, k)}

also increases for all

γ \in (0, 1]

. This indicates that the coefficient

μ_{(0, β_{1}, k)}

grows monotonically with respect to the iteration index k under the considered range of

γ

. For

γ \in (1, + \infty)

, a decrease in

γ

likewise leads to an increase in

μ_{(0, β_{1}, k)}

. However, with an increase in k,

μ_{(0, β_{1}, k)}

gradually decreases, reaching even negative values for all

γ

. Therefore, avoiding such conditions is crucial to prevent instability in parameter tuning. To ensure the boundedness and non-negativity of the key coefficients throughout the training process, a strict sufficient condition is required. Specifically, for a maximum iteration count M, the order

γ

must satisfy

γ M^{γ - 1} (1 - β) \leq 1

. A detailed derivation of this stability criterion and the resulting upper bounds for

γ

are provided in Remark 1.

The values of

γ

were selected as

γ = 0.2, 0.4, \dots, 0.8

for the range

γ \in (0, 1]

and as

γ = 1.05, 1.1, 1.2, 1.4

for the range

γ \in (1, + \infty)

to examine the behavior of

μ_{(1, β_{1}, k)}

with respect to

γ

. Figure 2 shows the curves of

μ_{(1, β_{1}, k)}

for the selected values of

γ

. These curves illustrate how the coefficient

μ_{(1, β_{1}, k)}

varies with the iteration index k under different choices of the order

γ

.

From Figure 2, we observe that

μ_{(1, β_{1}, k)}

increases as the order

γ

increases. For each order

γ \in (0, 1]

, a decrease in k results in an increase in

μ_{(1, β_{1}, k)}

. Equivalently, for fixed

γ

,

μ_{(1, β_{1}, k)}

decreases as k increases, which indicates that the contribution of this coefficient becomes smaller at later iterations. For

γ \in (1, + \infty)

, an increase in

γ

likewise leads to an increase in

μ_{(1, β_{1}, k)}

. However, contrary to the previous range, an increase in k results in an increase in

μ_{(1, β_{1}, k)}

for all

γ

.

From the viewpoint of parameter interpretation,

μ_{(0, β_{1}, k)}

can be regarded as a coefficient that measures the influence of the momentum term, and

μ_{(1, β_{1}, k)}

can be viewed as a coefficient that reflects the effective global learning rate. For

γ \in (0, 1]

,

μ_{(0, β_{1}, k)}

decreases monotonically as

γ

increases, whereas

μ_{(1, β_{1}, k)}

increases monotonically. This means that a larger order weakens the contribution of the momentum term and strengthens the effect of the learning rate. Hence, in the initial stage of network training, it is desirable that

μ_{(0, β_{1}, k)}

is relatively small and

μ_{(1, β_{1}, k)}

is relatively large so that parameter updates are accelerated. In the later stage of training, it is preferable that

μ_{(0, β_{1}, k)}

is relatively large and

μ_{(1, β_{1}, k)}

is relatively small in order to improve recognition accuracy. When the iteration index k is small, which corresponds to the early training phase, a relatively large-order

γ \in (0, 1]

should be chosen to speed up convergence. When k becomes large and the training enters a later phase, a relatively small order

γ

in the same interval should be selected to obtain a more refined optimization result. For

γ \in (1, + \infty)

,

μ_{(0, β_{1}, k)}

increases as

γ

increases, while

μ_{(1, β_{1}, k)}

decreases as

γ

increases. Therefore, in this range, a relatively small order

γ

is recommended if a higher weight update speed is required. Overall, the variation of

μ_{(0, β_{1}, k)}

and

μ_{(1, β_{1}, k)}

across different training stages and order intervals provides an intuitive rule for selecting

γ

so as to balance rapid convergence and high recognition accuracy.

During network training, the choice of the order

γ

can be adjusted in coordination with the training stage and the gradient magnitude. In the early stage of training, the gradients of the parameters are relatively large. A larger-order

γ

is then used so that the update step becomes faster and convergence is accelerated. In the later stage of training, the gradients gradually decrease. A smaller-order

γ

is then adopted in order to enhance the smoothness and refinement of the updates and to improve the final recognition accuracy. Accordingly,

μ_{(0, β_{1}, k)}

and

μ_{(1, β_{1}, k)}

play complementary roles at different stages and support the transition from fast approach to fine adjustment.

Based on this idea, the order

γ

can be set adaptively according to the rules given in Equations (38)–(40). In this way, a mechanism is obtained in which the order is adjusted automatically by the iteration progress and the gradient information. Furthermore, two concrete methods for selecting the order are proposed, which provide simple and practical guidance for choosing

γ

in different training scenarios.

Adjustment Method One Nonlinear order adjustment based on a cosine function.

During optimization, in the initial stage of training, a relatively large-order

γ

is selected to accelerate convergence. In the later stage of training, a smaller-order

γ

is preferred in order to improve optimization accuracy. Based on this idea, a cosine function is used in this method to adjust the order

γ

in a nonlinear manner. In the lth layer, the orders

γ_{i j}^{(l)} (k)

for weight updates and

γ_{j}^{(l)} (k)

for bias updates vary smoothly from

γ_{max}

to

γ_{min}

within the interval

[γ_{min}, γ_{max}]

as the iteration index k increases. This strategy implements a natural transition from a stage of accelerated convergence to a stage of fine adjustment. It provides a simple and practical nonlinear scheme for selecting the orders associated with the weights and the biases.

γ_{i j}^{(l)} (k) = \cos (\frac{π}{2} \cdot \frac{k - 1}{M - 1}) (γ_{max} - γ_{min}) + γ_{min},

(38)

γ_{j}^{(l)} (k) = \cos (\frac{π}{2} \cdot \frac{k - 1}{M - 1}) (γ_{max} - γ_{min}) + γ_{min} .

(39)

Adjustment Method Two Nonlinear adjustment based on a hyperbolic tangent function.

This strategy is designed to optimize the order configuration dynamically by using gradient feedback. In the early stage of optimization, large gradient norms require a fast response. In the later stage, small gradient norms require fine adjustment. A hyperbolic tangent function is used to construct the dependence of the order

γ

on the gradient norm. This design allows the order to adapt to the current gradient magnitude through expansion or contraction. In this way, fast descent is promoted in the initial phase, and the quality of convergence is improved in the later phase. A dynamic balance is thus achieved over the whole optimization process.

γ_{i j}^{(l)} (k) = \tanh (| g_{i j}^{(l)} (k - 1) |) (γ_{max} - γ_{min}) + γ_{min},

(40)

γ_{j}^{(l)} (k) = \tanh (| h_{j}^{(l)} (k - 1) |) (γ_{max} - γ_{min}) + γ_{min},

(41)

where

γ_{min}

and

γ_{max}

denote the minimum and maximum values, respectively, in the range of

γ_{i j}^{(l)} (k)

or

γ_{j}^{(l)} (k)

. To guarantee the robustness of the proposed method and minimize the necessity for layer-specific manual tuning,

γ_{m i n}

and

γ_{m a x}

are established as global hyperparameters shared across all network layers. The upper bound

γ_{m a x}

is strictly constrained by the theoretical stability criterion derived in Remark 1 (specifically,

γ \leq \bar{γ}

), whereas the lower bound

γ_{m i n}

is assigned a small positive value to exploit the acceleration characteristics inherent to fractional orders. Empirical evidence indicates that the interval

[0.2, 1.05]

consistently yields effective performance across diverse tasks.

Remark 1.

Because

μ_{(0, β_{1}, k)}

may occur for

γ \in (1, + \infty)

, we need to take a terminal value of γ. Letting

μ_{(0, β_{1}, k)} = [1 + γ (β_{1} - 1) / k^{1 - γ}]

, we get

\frac{γ}{k^{1 - γ}} (β_{1} - 1) > - 1 .

Hence, we have

γ k^{γ - 1} (β_{1} - 1) > - 1 .

Due to

k^{γ - 1} \leq M^{γ - 1}

with

0 < k \leq M

,

μ < 1

, and

γ > 1

, where M is the total number of iterations, we get

γ k^{γ - 1} (β_{1} - 1) \geq γ M^{γ - 1} (β_{1} - 1) .

Letting

γ M^{γ - 1} (β_{1} - 1) \leq - 1,

we set order

γ_{0}

to satisfy

γ_{0} M^{γ_{0} - 1} = \frac{1}{1 - β_{1}},

and then the condition

γ \in (0, γ_{0}]

can make

μ_{(0, β_{1}, k)}

not be less than zero. We can set the order

γ_{1}

to satisfy the following equation as

γ_{1} M^{γ_{1} - 1} = \frac{1}{1 - β_{1}} .

Then, we can use

γ_{1}

to determine the effective interval of γ. For example, let

M = 940

,

β_{1} = 0.9

, and then we get

γ_{1} = 1.2982

; hence, the interval of γ should be set as

γ \in (0, 1.2982]

.

Let order

γ_{2}

satisfy the following equation by the same way as

γ_{2} M^{γ_{2} - 1} = \frac{1}{1 - β_{2}} .

For example, we set

M = 940

,

β_{2} = 0.999

, and then

γ_{2} = 1.6037

is determined; then, the effective interval of γ should be set as

γ \in (0, 1.6037]

.

Then, the effective interval of γ is

γ \in (0, \bar{γ}]

, where

\bar{γ} = min (γ_{1}, γ_{2})

. For this example, the effective interval of γ should be set as

γ \in (0, 1.2982]

from the above analysis in Adam algorithm with Hausdorff difference.

4. Experimental Results Analysis

In this section, the performance of the proposed algorithms is evaluated on the AlexNet architecture. Two representative datasets are used in the experiments. The first one is the Fashion-MNIST dataset, which contains 70,000 grayscale images of size

28 \times 28

from 10 categories of clothing. Among them, 60,000 images are used for training and 10,000 images are used for testing. This dataset is adopted to examine the behavior of the algorithms on grayscale images. The second one is the CIFAR-10 dataset, which consists of 60,000 color images of size

32 \times 32

with three channels. It also has 10 categories, including common objects such as animals and vehicles. In this case, 50,000 images are used as the training set and 10,000 images are used as the test set. This dataset is used to assess the applicability and robustness of the algorithms on color natural images.

To ensure numerical stability and prevent gradient scaling issues in the tanh-based adjustment strategy, standard input normalization is applied to all datasets. Specifically, pixel values are scaled to the range

[0, 1]

prior to training. This preprocessing step ensures that the gradient magnitudes remain within a dynamic range that allows the tanh function to operate effectively without permanent saturation. For the experiments on the Fashion-MNIST dataset, both the HAdam algorithm and the standard Adam optimizer are configured with an identical learning rate of 0.001 and trained for 100 epochs. For the experiments on the CIFAR-10 dataset, the learning rate for both optimizers is set to 0.0001, and the number of training epochs is set to 200. In both cases, a batch size of 64 is adopted, and the default moment decay rates (

β_{1} = 0.9, β_{2} = 0.999

) are used. By enforcing consistent hyperparameter settings, we isolate the impact of the proposed Hausdorff difference operator on the optimization performance.

4.1. Influence of the Order $γ$ on the Performance of the HAdam Algorithm

To evaluate the impact of the order parameter, the selection of these candidate intervals is grounded in the theoretical analysis presented in Section 3. Specifically, the upper bound of the test range is limited by the stability constraint

\bar{γ}

derived in Remark 1 to prevent coefficient divergence, while the lower bound allows for the exploration of fractional memory effects. Accordingly, we investigate ranges that cover the stable fractional regime

(0, 1]

and the valid super-integer regime

(1, \bar{γ}]

.

To determine an appropriate range for the order parameter

γ

in the HAdam algorithm, several candidate intervals are considered, namely

γ \in [0.2, 0.35]

,

[0.75, 0.9]

,

[0.2, 1.05]

,

[0.9, 1.05]

,

[1.0, 1.15]

, and

[1.05, 1.2]

. For each interval of

γ

, the recognition accuracy and the training loss on the Fashion-MNIST dataset are computed. The HAdam algorithm with Adjustment Method One is compared with the standard Adam algorithm. The corresponding results are summarized in Figure 3, where the performance variation under different order intervals can be observed directly.

Figure 3 shows that, for the HAdam algorithm, recognition accuracy improves and training loss decreases as the order parameter

γ

increases. Across the six tested intervals, the recognition accuracy is highest and the training loss is lowest in the interval

[0.2, 1.05]

, whereas the worst results occur in the interval

[0.2, 0.35]

when using the HAdam algorithm. These observations suggest that including a broader range of

γ

values allows the method to better adapt to different training stages. Therefore, a wider interval for the order parameter

γ

is preferable when tuning the parameters of a CNN with the HAdam algorithm. Moreover, for intervals of the same length, a larger-order

γ

consistently yields better performance than a smaller-order

γ

when the HAdam algorithm is applied. Overall, the performance of the HAdam algorithm with Adjustment Method One is satisfactory compared with the Adam algorithm, except in the interval

[0.2, 0.35]

.

We also evaluate the performance of the second-order adjustment method that incorporates gradient information for CNN optimization. The same order intervals as in Adjustment Method One are used; that is, several different intervals of

γ

are considered. For each interval, the recognition accuracy and the training loss are computed for the HAdam algorithm with Adjustment Method Two and for the Adam algorithm. The results are shown in Figure 4. These results are used to examine the effect of explicitly introducing gradient information into the order adjustment strategy on the overall performance of the network.

From Figure 4, it can be seen that the results obtained with Adjustment Method Two are similar in overall trend to those obtained with Adjustment Method One. Among the six candidate intervals, when the order interval is

[0.2, 1.05]

, the HAdam algorithm with Adjustment Method Two achieves relatively high recognition accuracy and low training loss. In contrast, in the narrower interval

[0.2, 0.35]

, the recognition performance of HAdam is the worst and the training loss is the largest. Two observations can be summarized. First, a moderate increase in the length of the order interval tends to improve recognition performance. Second, for a fixed interval length, an increase in the order values usually improves the performance of the algorithm. In general, except for the interval

[0.2, 0.35]

, the HAdam algorithm with Adjustment Method Two outperforms the Adam algorithm in CNN optimization for the other suitable order intervals.

On the CIFAR-10 dataset, comparative experiments are carried out for the HAdam algorithm with two nonlinear order adjustment strategies and for the standard Adam algorithm. The corresponding results are presented in Figure 5. This figure reports the recognition accuracy and the training loss of the different algorithms and allows a direct comparison of their optimization performance on CIFAR10.

From Figure 5, it can be seen that the recognition accuracy increases when the length of the interval for the order parameter

γ

becomes larger. For a fixed interval length, an increase in the order

γ

also improves the performance of the HAdam algorithm. These observations indicate that both the length of the interval and the magnitude of the order have an important influence on the optimization effect of HAdam. In general, for all the order intervals except

[0.2, 0.35]

, the HAdam algorithm achieves better performance than the Adam algorithm.

The enhanced stability observed during the initial training phase is physically grounded in the metric scaling properties of the Hausdorff operator. Specifically, the term

k^{1 - γ}

in the update rule acts as a structural regularizer against the high-variance gradients typically encountered at the start of optimization. Unlike standard Adam, which may suffer from aggressive steps leading to oscillation, the proposed method utilizes this fractal scaling to effectively dampen initial fluctuations. This allows the optimizer to navigate the steep curvature of the loss landscape with reduced risk of divergence, securing a stable trajectory for subsequent acceleration.

The experimental results are averaged over 20 independent runs to statistically mitigate the impact of random initialization and stochastic gradient noise. This sample size is sufficient to ensure that the observed performance trends are statistically significant and consistent. On the Fashion-MNIST dataset, the recognition accuracy of the HAdam algorithm with Adjustment Method One and with Adjustment Method Two as a function of the order

γ

is summarized in Table 1. This table provides a direct comparison of the combined influence of the order and the two adjustment strategies on recognition accuracy.

By combining the results shown in Figure 4 and Figure 5, several conclusions can be drawn. First, except for the order interval

[0.2, 0.35]

, the HAdam algorithm shows overall better performance than the Adam algorithm in CNN optimization. Second, the proposed adaptive order strategies are suitable for two-dimensional image classification tasks and can consistently improve recognition performance. In addition, the overall influence of the order choice on performance is similar for Adjustment Method One and Adjustment Method Two. Based on these observations, it can be considered that selecting a relatively large order range in the interval

(0, \bar{γ}]

, together with adaptive tuning of the momentum and the learning rate, is beneficial for improving recognition accuracy and accelerating convergence. These conclusions provide practical guidance for choosing the order

γ

and its interval in actual training.

The superior performance of HAdam is attributed to the fractal time-scale mechanism

k^{1 - γ}

, which acts as an adaptive regularization coefficient to effectively dampen early-phase gradient oscillations and accelerate subsequent convergence. Furthermore, the recursive formulation of this operator preserves the theoretical benefits of fractional long-term memory while maintaining constant computational complexity, thereby achieving a synergistic balance between efficiency and optimization robustness.

4.2. Overheads and Limitations

While the recursive calculation of dynamic coefficients introduces a marginal scalar overhead, the algorithm maintains a constant time complexity of

O (1)

per iteration, yielding wall-clock training times comparable to standard Adam. Regarding limitations, the adaptive order parameter

γ

expands the hyperparameter search space for future research.

5. Conclusions

This study addressed the persistent challenge of optimizing CNNs within complex non-convex loss landscapes, where traditional integer-order optimizers often suffer from slow convergence and entrapment in local minima. To overcome these limitations, we proposed a novel optimizer, HAdam, which incorporates the Hausdorff difference operator to introduce a fractal time-scale mechanism into the gradient update rule. A key innovation of this work is the derivation of a recursive update formulation that reduces the computational complexity of fractional dynamics from linear growth to a constant

O (1)

, thereby ensuring scalability.

The experimental evaluations on the Fashion-MNIST and CIFAR-10 datasets demonstrated that the HAdam algorithm significantly outperforms state-of-the-art methods. Specifically, the proposed method achieved faster initial convergence and higher final classification accuracy compared to standard Adam and existing fractional-order variants, validating the effectiveness of the adaptive order strategy in balancing exploration and exploitation.

Future research will focus on two specific directions. First, we aim to establish rigorous theoretical convergence bounds for the adaptive order parameter

γ

under non-convex conditions. Second, given the recursive efficiency of the proposed operator, we plan to extend the application of HAdam to the training of Transformer-based architectures and Large Language Models to investigate its potential in processing sequential data with long-term dependencies.

Author Contributions

Methodology, J.J. and Z.G.; Validation, H.Z.; Formal analysis, H.Z.; Writing—original draft, J.J.; Writing—review and editing, J.J.; Supervision, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China, grant number No. 12471295.

Data Availability Statement

The original data presented in the study are openly available at https://github.com/zalandoresearch/fashion-mnist and https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 1 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, Y.; Xue, B.; Zhang, M.; Yen, G.G. Evolving deep convolutional neural networks for image classification. IEEE Trans. Evol. Comput. 2019, 24, 394–407. [Google Scholar] [CrossRef]
Wei, J.; He, J.; Zhou, Y.; Chen, K.; Tang, Z.; Xiong, Z. Enhanced object detection with deep convolutional neural networks for advanced driving assistance. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1572–1583. [Google Scholar] [CrossRef]
Guo, Y.; Liu, Y.; Georgiou, T.; Lew, M.S. A review of semantic segmentation using deep neural networks. Int. J. Multimed. Inf. Retr. 2018, 7, 87–93. [Google Scholar] [CrossRef]
Pan, X.; Shi, J.; Luo, P.; Wang, X.; Tang, X. Spatial as deep: Spatial CNN for traffic scene understanding. Proc. AAAI Conf. Artif. Intell. 2018, 32. [Google Scholar] [CrossRef]
Zhao, W.; Du, S. Learning multiscale and deep representations for classifying remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2016, 113, 155–165. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, H.; Zhang, G. cPSO-CNN: An efficient PSO-based algorithm for fine-tuning hyper-parameters of convolutional neural networks. Swarm Evol. Comput. 2019, 49, 114–123. [Google Scholar] [CrossRef]
Bottou, L. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, 2nd ed.; Montavon, G., Orr, G.B., Müller, K.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 421–436. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Fernandez, E.; Huilcapi, V.; Birs, I.; Cajo, R. The role of fractional calculus in modern optimization: A survey of algorithms, applications, and open challenges. Mathematics 2025, 13, 3172. [Google Scholar] [CrossRef]
Afolabi, O.M.; Adeyemi, V.A.; Tlelo-Cuautle, E.; Nuñez-Perez, J.C. FPGA realization of a fractional-order model of universal memory elements. Fractal Fract. 2024, 8, 605. [Google Scholar] [CrossRef]
Lee, D.; Chen, W.; Wang, L.; Chan, Y.C.; Chen, W. Data-driven design for metamaterials and multiscale systems: A review. Adv. Mater. 2024, 36, 2305254. [Google Scholar] [CrossRef]
Zhou, Y.; Shao, Z.; Li, H.; Chen, J.; Sun, H.; Wang, Y.; Yuan, C. A novel back propagation neural network based on the Harris Hawks optimization algorithm for the remaining useful life prediction of lithium-ion batteries. Energies 2025, 18, 3842. [Google Scholar] [CrossRef]
Chen, B.P.; Chen, Y.; Zeng, G.Q.; She, Q. Fractional-order convolutional neural networks with population extremal optimization. Neurocomputing 2022, 477, 36–45. [Google Scholar] [CrossRef]
Xu, H.; Jiang, X. Creep constitutive models for viscoelastic materials based on fractional derivatives. Comput. Math. Appl. 2017, 73, 1377–1384. [Google Scholar] [CrossRef]
Sikora, R.; Pawłowski, S. Fractional derivatives and the laws of electrical engineering. COMPEL 2018, 37, 1384–1391. [Google Scholar] [CrossRef]
Li, B.; Xie, W. Adaptive fractional differential approach and its application to medical image enhancement. Comput. Electr. Eng. 2015, 45, 324–335. [Google Scholar] [CrossRef]
Tarasova, V.V.; Tarasov, V.E. Elasticity for economic processes with memory: Fractional differential calculus approach. Fract. Differ. Calc. 2016, 6, 219–232. [Google Scholar] [CrossRef]
Podlubny, I. Fractional Differential Equations; Academic Press: San Diego, CA, USA, 1998. [Google Scholar]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef]
Cao, J.; Udhayakumar, K.; Rakkiyappan, R.; Li, X.; Lu, J. A comprehensive review of continuous-/discontinuous-time fractional-order multidimensional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 5476–5496. [Google Scholar] [CrossRef]
Borja-Jaimes, V.; Valdez-Martínez, J.S.; Beltrán-Escobar, M.; Cruz-Rojas, A.; Gil-Velasco, A.; Coronel-Escamilla, A. A review of fractional order calculus applications in electric vehicle energy storage and management systems. Mathematics 2025, 13, 2920. [Google Scholar] [CrossRef]
Chen, W. Time-space fabric underlying anomalous diffusion. Chaos Solitons Fractals 2006, 28, 923–929. [Google Scholar] [CrossRef]
Schleicher, D. Hausdorff dimension, its properties, and its surprises. Am. Math. Mon. 2007, 114, 509–528. [Google Scholar] [CrossRef]
Rabei, E.M.; Nawafleh, K.I.; Hijjawi, R.S.; Muslih, S.I.; Baleanu, D. The Hamilton formalism with fractional derivatives. J. Math. Anal. Appl. 2007, 327, 891–897. [Google Scholar] [CrossRef]

Figure 1. Curves of

μ_{(0, β_{1}, k)}

with different orders.

Figure 1. Curves of

μ_{(0, β_{1}, k)}

with different orders.

Figure 2. Curves of

μ_{(1, β_{1}, k)}

with different orders.

Figure 2. Curves of

μ_{(1, β_{1}, k)}

with different orders.

Figure 3. Performance evaluation on the Fashion-MNIST dataset using Adjustment Method One across various order intervals.

Figure 4. Performance comparison on CIFAR-10 dataset under various orders

γ

and optimization schemes.

Figure 4. Performance comparison on CIFAR-10 dataset under various orders

γ

and optimization schemes.

Figure 5. Performance analysis on the CIFAR-10 dataset across various orders

γ

and adjustment strategies.

Figure 5. Performance analysis on the CIFAR-10 dataset across various orders

γ

and adjustment strategies.

Table 1. Comparison of recognition accuracy for HAdam using different adaptive adjustment strategies.

Method	Interval of Order $γ$
Method	[0.2, 0.35]	[0.75, 0.9]	[0.2, 1.05]	[1.05, 1.2]
Adjustment Method One	0.8859	0.8956	0.9005	0.8998
Adjustment Method Two	0.8860	0.8983	0.9023	0.9006

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jian, J.; Gao, Z.; Zhang, H. Hausdorff Difference-Based Adam Optimizer for Image Classification. Mathematics 2026, 14, 329. https://doi.org/10.3390/math14020329

AMA Style

Jian J, Gao Z, Zhang H. Hausdorff Difference-Based Adam Optimizer for Image Classification. Mathematics. 2026; 14(2):329. https://doi.org/10.3390/math14020329

Chicago/Turabian Style

Jian, Jing, Zhe Gao, and Haibin Zhang. 2026. "Hausdorff Difference-Based Adam Optimizer for Image Classification" Mathematics 14, no. 2: 329. https://doi.org/10.3390/math14020329

APA Style

Jian, J., Gao, Z., & Zhang, H. (2026). Hausdorff Difference-Based Adam Optimizer for Image Classification. Mathematics, 14(2), 329. https://doi.org/10.3390/math14020329

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hausdorff Difference-Based Adam Optimizer for Image Classification

Abstract

1. Introduction

2. Preliminaries

2.1. Adam Algorithm

2.2. Definition and Discrete Form of the Hausdorff Derivative

3. Construction of Parameter Update Rules via Hausdorff Difference

Adam Algorithm with Hausdorff Difference

4. Experimental Results Analysis

4.1. Influence of the Order $γ$ on the Performance of the HAdam Algorithm

4.2. Overheads and Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Hausdorff Difference-Based Adam Optimizer for Image Classification

Abstract

1. Introduction

2. Preliminaries

2.1. Adam Algorithm

2.2. Definition and Discrete Form of the Hausdorff Derivative

3. Construction of Parameter Update Rules via Hausdorff Difference

Adam Algorithm with Hausdorff Difference

4. Experimental Results Analysis

4.1. Influence of the Order γ on the Performance of the HAdam Algorithm

4.2. Overheads and Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1. Influence of the Order $γ$ on the Performance of the HAdam Algorithm