Self-Organizing Optimization Based on Caputo’s Fractional Order Gradients

Tan, Sunfu; Zhang, Ni; Pu, Yifei

doi:10.3390/fractalfract8080451

Open AccessArticle

Self-Organizing Optimization Based on Caputo’s Fractional Order Gradients

by

Sunfu Tan

¹,

Ni Zhang

² and

Yifei Pu

^1,*

¹

College of Computer Science, Sichuan University, Chengdu 610065, China

²

Library of Sichuan University, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Fractal Fract. 2024, 8(8), 451; https://doi.org/10.3390/fractalfract8080451

Submission received: 30 April 2024 / Revised: 17 July 2024 / Accepted: 22 July 2024 / Published: 30 July 2024

(This article belongs to the Section Engineering)

Download

Browse Figures

Versions Notes

Abstract

This paper analyses the condition necessary to guarantee no divergence for Caputo’s fractional order gradient descent (C-FOG) algorithm on multivariate functions. C-FOG is self-organizing, computationally efficient, simple, and understandable. It converges faster than the classical gradient-based optimization algorithms and converges to slightly different points when the order of the fractional derivative is different. The additional freedom of the order is very useful in situations where the diversity of convergence is required, and it also allows for more precise convergence. Comparative experiments on a typical poor conditioned function and adversarial sample generation frameworks demonstrate the convergence performance of C-FOG, showing that it outperforms currently popular algorithms in terms of convergence speed, and more excitingly, the diversity of convergence allows it to exhibit stronger and more stable attack capability in adversarial sample generation procedures (The code for experiments is available at: https://github.com/mulertan/self_optimizing/tree/main, accessed on 30 April 2024).

Keywords:

Caputo’s fractional order derivative; natural gradient descent; adversarial samples; optimization algorithms

1. Introduction

Minimizing objective functions is common in machine learning or other engineering applications. The most popular algorithm for solving this type of problem is the gradient descent method (GDM), but it is not guaranteed to converge on multivariate functions. The most commonly used gradient-based optimization algorithms are Adam [1], Momentum [2], AdaGrad [3], RMSProp [4], and so on. However, these algorithms sometimes do not meet the computational requirements for their slow convergence. In addition, it is sometimes necessary to find out multiple points near the extreme point, where the traditional gradient-based optimization algorithm cannot do anything. In this paper, a new search algorithm based on the theory of Caputo’s fractional derivatives is introduced, which guarantees the convergence at points within the domain of the analytical solution. When the order of Caputo’s fractional derivative is different, it converges to different points, which is useful in many situations, such as generating different adversarial samples, which are subtle changes from the original images and imperceptible to humans but can be prone to be misjudged by DNNs.

It is well known that deep neural networks (DNNs) have made significant advances and have been applied to a wide range of areas [5,6,7,8]. However, DNNs are vulnerable to adversarial samples [9,10,11,12,13,14,15,16]. Even if the adversary cannot access the inner of the networks, adversarial samples are successfully generated [17,18,19,20,21], which raises important concerns in key areas related to security [13,22,23,24]. C&W [25], introduced by Nicholas Carlini and David Wagner and optimized using Adam, is the most important representative algorithm based on traditional optimization to generate adversarial samples, and many defensive methods such as defensive distillation cannot resist its attack but effective against many other attack algorithms [25,26]. In addition, C&W can generate well-known high-confidence adversarial samples, which have strong transfer attack performance [27]. However, the long run time and the limitation to converge only to one certain point become its Achilles’ heel. It is desirable to obtain an algorithm that combines convergence speed with convergence diversity.

Other than integer gradient-based optimization, the study and application of optimization based on fractional gradient has attracted increasing interest and made great strides in recent years. A fractional adaptive algorithm based on the fractional Taylor series has been shown to converge to the mean square error if the step sizes are presented appropriately [28]. A variant of fractional stochastic gradient descent is proposed to enhance the memory effect and improve the speed and accuracy of the recommender system [29]. Standard hierarchical gradient descent is generalized to fractional order and can therefore effectively estimate the parameters of nonlinear control autoregressive systems under different fractional orders [30].

In this paper, we prove the convergence of Caputo’s fractional gradient descent algorithm over multivariate functions and compare the convergence properties under different orders separately. With this new search method, a new framework for generating adversarial samples is proposed and the additional freedom of the order is used to obtain the adversarial diversity. Experiments show that the new algorithm runs 10 times faster than C&W and requires only up to half of the time of other state-of-the-art optimization-based adversarial sample generation algorithms. More importantly, these new adversarial samples have a stronger transfer attack capability.

The rest of this paper is structured as follows. In Section 2, we introduce the related work on the gradient-based optimization methods. Some necessary mathematical background on fractional calculus is briefly introduced in Section 3. In Section 4, we analyze the conditions required for Caputo’s fractional order gradient descent (C-FOG) to guarantee no divergence on multivariable functions and intuitively illustrate its performance compared to currently popular optimization algorithms. In Section 5, C-FOG is applied to design the framework for generating adversarial samples, and several experiments are performed to demonstrate the speed improvement and the attack capability compared to the state-of-the-art optimization-based adversarial algorithms. Finally, conclusions are drawn, and we discuss the outlook.

2. Related Work

There are many gradient-based optimization algorithms, among which RMSProp [4], AdaGrad [3] and Adam [1] are the most directly related to our algorithm. It is well known that Momentum [2] contributes much to these algorithms, which is presented below:

m_{t} = β m_{t - 1} + η \cdot g_{t}

(1)

X_{t + 1} = X_{t} - m_{t}

(2)

where

g_{t}

and

X_{t}

are both vectors, denoting the gradient and independent variables at time step t,

g_{t} = \nabla_{X} f (X_{t})

.

Momentum uses exponential moving averages to accelerate changes in the same direction and smooth out changes in the opposite direction. This method has a positive impact on later optimization algorithms.

RMSProp calculates the exponential moving average of the squared gradients, which is used to continuously reduce the step size during iteration. It is depicted as follows:

m_{t} = β m_{t - 1} + (1 - β_{1}) g_{t} ⊙ g_{t}

(3)

X_{t + 1} = X_{t} - \frac{η \cdot g_{t}}{\sqrt{m_{t} + ε}}

(4)

where ⊙ represents a multiplication operator calculated elementwise.

AdaGrad, on the other hand, replaces the exponential moving average

m_{t}

in Equation (4) with the sum of the previous squared gradients, as shown below to reduce the iteration step size. As a result of the rapidly decreasing step size, the algorithm may converge too slowly or even fail to converge.

m_{t} = m_{t - 1} + g_{t} ⊙ g_{t}

(5)

Adam combines the two algorithms of AdaGrad and RMSProp, described as follows:

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}

(6)

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t} ⊙ g_{t}

(7)

X_{t + 1} = X_{t} - η \cdot \frac{\hat{m_{t}}}{\sqrt{\hat{v_{t}}} + ε}

(8)

where

\hat{m_{t}} = m_{t} / (1 - β_{1}^{t})

and

\hat{v_{t}} = v_{t} / (1 - β_{2}^{t})

.

Equation (6) can be viewed as a momentum for accelerating descent. Adam can be used for large gradient matrices. However, Adam uses the squared gradient to continuously reduce the step size, which means that it becomes slower to converge.

Since then, there have been many improved versions of the above optimization algorithm. AdaDelta [31] uses the exponential moving average of the gradient squares instead of the sum of the squared gradient in AdaGrad as the denominator to adjust the step sizes. AdamW [32], Nadam [33], and AMSGrad [34] are improvements to Adam from different perspectives. Nesterov improves Momentum by correcting the current values of the parameters to be optimized [35]. These algorithms are complex and converge slowly. In this paper, a new self-organizing optimization algorithm is proposed that is simple, has low memory requirements and converges quickly with guarantees.

3. Mathematical Background

In order to better carry out the following analyses, a brief introduction of the fractional calculus is provided in this subsection. In 1695, Leibniz discussed the meaning of a half order derivative in his list to L’Hospital [36], which is the first mention of a non-integer order calculus. Later, Riemann, Liouville, Grünwald, Letnikov, Caputo, and others made outstanding contributions to the theory of fractional calculus [36,37,38]. To date, the most widely used definition of fractional calculus is that of Grünwald–Letnikov, Riemann-Liouville, and Caputo.

The

p

-order of the Riemann–Liouville fractional derivative is defined as:

{{}_{a}^{R - L}D}_{t}^{p} f (t) = \frac{1}{Γ (m - p + 1)} {(\frac{d}{d t})}^{m + 1} \int_{a}^{t} {(t - τ)}^{m - p} f (τ) d τ, (m \leq p \leq m + 1)

(9)

where

Γ (\cdot)

is the gamma function and

f (\cdot)

is a continuous function defined on [a, b], and m is an integer.

The Riemann–Liouville definition only requires that the function

f (t)

is integrable, which is the main reason why it is so popular in practical applications [37]. One of the drawbacks of the definition is that it is not easy to compute numerically. Therefore, in practice, most applications use the Riemann–Liouville definition for modeling and the Grünwald–Letnikov definition for calculation. The Grünwald–Letnikov fractional order derivative of order p is as follows:

{{}_{a}^{G - L}D}_{t}^{p} f (t) = \sum_{k = 0}^{m} \frac{f^{(k)} (a) {(t - a)}^{- p + k}}{Γ (- p + k + 1)} + \frac{1}{Γ (- p + m + 1)} \int_{a}^{t} {(t - τ)}^{m - p} f^{(m + 1)} (τ) d τ

(10)

The above definition requires the function to be m + 1 times differentiable. Later, Grünwald–Letnikov and Poster generalized it and obtained the following representation:

{{}_{a}^{G - L}D}_{t}^{p} f (t) = \lim_{N \to \infty} \frac{{[\frac{t - a}{N}]}^{- p}}{Γ (- p)} \sum_{j = 0}^{N - 1} \frac{Γ (j - p)}{Γ (j + 1)} f (t - j [\frac{t - a}{N}])

(11)

The above equation does not require any integral or differential operations on the function, only the function values at different points, and p can be either positive or negative. If p is positive, it represents differentiation, and if p is negative, it represents integration [38].

The definition of Caputo’s fractional order derivative is as follows,

{{}_{a}^{C}D}_{t}^{p} f (t) = \frac{1}{Γ (n - p)} \int_{a}^{t} {(t - τ)}^{n - p - 1} f^{(n)} (τ) d τ, (n - 1 < p \leq n)

(12)

Both the Riemann–Liouville definition and the Grünwald–Letnikov definition have the same property that the derivative of a constant is not zero, which is inconsistent with the conventional calculus. Instead, Caputo’s fractional order derivative of a constant is 0. In practice, particularly in physical engineering, Caputo’s fractional order derivatives are more widely used because the initial conditions required are the same as those for integer order differential equations [39].

The conventional method to search for the extremes, also known as the steepest descent algorithm, is the first-order gradient descent, which is formally expressed as:

X_{k + 1} = X_{k} - η \cdot g_{k}

(13)

where

η

denotes the learning rate or the step size, and

g_{k}

denotes the first order gradient of the objective function with respect to

X

at

X_{k}

.

Initially, some algorithms tried to replace

g_{k}

directly with the fractional order gradients defined above, but this cannot lead to convergence to the analytical solution [40,41,42]. Chen et al. proved that Caputo’s fractional order derivative with the variable initial value can guarantee the convergence to the extreme point [42]. The iterative formula is as follows:

x_{k + 2} = x_{k + 1} - η \cdot {{}_{x_{k}}^{c}D}_{x_{k + 1}}^{p} f (x_{k + 1})

(14)

where

0 < p < 1

,

η > 0

and

f (x)

is a smooth convex function with a unique extreme value

x^{*}

.

Repeatedly using the method of integration by parts, the definition of Caputo’s fractional order derivative in (12) can also be expressed as follows:

{{}_{a}^{c}D}_{t}^{p} f (t) = \sum_{i = n}^{\infty} \frac{f^{(i)} (a)}{Γ (i + 1 - p)} {(t - a)}^{i - p}, (n - 1 < p \leq n)

(15)

Taking the first item in (15), the iteration formula in (14) can also be expressed as follows when

0 < p \leq

1 [42]:

x_{k + 2} = x_{k + 1} - η \cdot f^{(1)} (x_{k + 1}) {| x_{k + 1} - x_{k} |}^{1 - p}

(16)

In (16), the order of the fractional order derivative can be extended from

0 < p < 1

to

0 < p < 2

. When

1 < p < 2

, the convergence is faster than the first-order derivative, and when

0 < α < 1

, the convergence is slower [42].

However, for multivariate functions, the fractional order gradient-based optimization algorithm like (16) is not so simple and must satisfy certain conditions to guarantee convergence.

Fractional order gradient descent is widely used in deep neural networks and other engineering fields. When it is applied in the convolutional neural networks, experiments show high accuracy and the ability to escape the local optimal points [43]. When applied to the backpropagation training of neural networks, the fractional gradient descent algorithms based on Caputo’s definition shows the monotonicity and strong or weak convergence related to the order [44]. Zhu et al. combined the Caputo fractional gradient algorithm with Particle Swarm Optimization (PSO) [45] and found that 99.9% of the cases the global optimum was achieved, while the traditional gradient descent method had only a 27.2% probability of reaching the global optimum [46].

4. Analysis of Caputo’s Fractional Order Gradient Descent Method and Evaluation

In general, gradient descent methods (GDM) will converge to an analytic solution as long as the step size is small enough, but the convergence may be too slow, while a large step size may lead to divergence. Therefore, many optimization algorithms use trial and error to determine the appropriate step size. However, are there algorithms that do not diverge even if the step size is large? Caputo’s fractional order gradient descent (CFOGD) has such a character when the order is in the interval (1, 2). The theory comes from the gradient sign descent method (GSDM).

4.1. Condition for Non-Divergence

Each iterative step using GSDM can be formulated as

X_{t + 1} = X_{t} - η \cdot s i g n (\nabla_{X} f (X_{t}))

(17)

where

\nabla_{X} f (X_{t})

denotes the gradient of

f (X_{t})

w.r.t.

X

at

X_{t}

, abbreviated as

\nabla_{t}

.

The convergence of GSDM can be proved using the online convex optimization framework [47].

Lemma 1.

Assuming

f (X) : R^{d} \to R

, and

f (X)

is smooth and convex, the search method in (17) with step sizes

ρ_{t} = \frac{1}{\sqrt{t}}

guarantees the following for all

T \geq 1

.

R e g r e t (T) = \sum_{t = 1}^{T} [f (X_{t}) - f (X^{*})] \leq \frac{D^{2} G_{\infty} \sqrt{T}}{2} + d G_{\infty} \sqrt{T}

(18)

where

X^{*} = \arg \min_{X} f (X)

;

∥ \nabla_{X} f {(X) ∥}_{\infty} \leq G_{\infty}; ∥ X_{m} - X_{n} ∥_{2} \leq D

(for any

m, n \in Z^{+}

).

Proof.

For convexity of

f (X)

,

f (X_{t}) - f (X^{*}) \leq \nabla_{t}^{T} \cdot (X_{t} - X^{*})

\begin{matrix} ∥ X_{t + 1} - X^{*} ∥^{2} & = ∥ X_{t} - ρ_{t} \cdot s i g n (\nabla_{t}) - X^{*} ∥^{2} \\ = ∥ X_{t} - X^{*} ∥^{2} + ρ_{t}^{2} ∥ s i g n {(\nabla_{t}) ∥}^{2} - 2 ρ_{t} \cdot s i g n {(\nabla_{t})}^{T} (X_{t} - X^{*}) \\ \leq ∥ X_{t} - X^{*} ∥^{2} + d \cdot ρ_{t}^{2} - 2 ρ_{t} \cdot s i g n {(\nabla_{t})}^{T} (X_{t} - X^{*}) \end{matrix}

Therefore,

\nabla_{t}^{T} (X_{t} - X^{*}) \leq \frac{G_{\infty}}{2 ρ_{t}} [∥ X_{t} - X^{*} ∥^{2} - ∥ X_{t + 1} - X^{*} ∥^{2}] + \frac{d G_{\infty} ρ_{t}}{2}

Summing the above from

t = 1

to T,

R e g r e t (T) \leq \sum_{t = 1}^{T} \nabla_{t} \cdot (X_{t} - X^{*})

\leq \sum_{t = 1}^{T} [\frac{G_{\infty}}{2 ρ_{t}} (∥ X_{t} - X^{*} ∥^{2} - ∥ X_{t + 1} - X^{*} ∥^{2}) + \frac{d G_{\infty} ρ_{t}}{2}]

\leq \sum_{t = 1}^{T} (\frac{G_{\infty}}{2} (∥ X_{t} - X^{*} ∥^{2} (\frac{1}{ρ_{t}} - \frac{1}{ρ_{t - 1}})) + \frac{d G_{\infty} ρ_{t}}{2})]

\leq \frac{D^{2} G_{\infty}}{2 ρ_{T}} + \frac{d G_{\infty}}{2} \sum_{t = 1}^{T} ρ_{t}

\leq \frac{D^{2} G_{\infty} \sqrt{T}}{2} + d G_{\infty} \sqrt{T}

This completes the proof. □

When

T \to \infty

,

\frac{1}{T} R e g r e t (T) \leq \frac{D^{2} G_{\infty}}{2 \sqrt{T}} + \frac{d G_{\infty}}{\sqrt{T}} \to 0

. According to Jensen inequality,

f (X^{*}) \leq f (\frac{1}{T} \sum_{t = 1}^{T} X_{t}) \leq \frac{1}{T} \sum_{t = 1}^{T} f (X_{t})

. Therefore, the search method in (17) is convergent.

From another point of view, Lemma 1 is intuitive:

X_{t + 1} = X_{t} - η \cdot s i g n (\nabla_{X} f (X_{t}))

= X_{t} - η \cdot \frac{\nabla_{X} f (X_{t})}{| \nabla_{X} f (X_{t}) |}

= X_{t} - η^{'} \cdot \nabla_{X} f (X_{t})

where

η^{'} = \frac{η}{| \nabla_{X} f (X_{t}) |}

,

η

is a non-zero positive constant and

| \nabla_{X} f (X_{t}) |

is the absolute value of the gradient, calculated elementwise.

In (17),

\frac{1}{| \nabla_{X} f (X_{t}) |}

is the scaling factor of the constant step size

η

for each dimension, through which the actual step size for each variable varies in the way that the greater the variation of the variable, the smaller the step size, and vice versa. Therefore, the search method in (17) guarantees a consistent approximation to the neighborhood of the extreme point. RMSProp can be considered as an approximate GSDM.

Momentum can be applied in GSDM, which speeds up convergence and achieves higher accuracy. Adam can be viewed as GSDM with momentum added. CFOGD is also a variant of GSDM that converges faster or slower with different orders.

For multivariate function

f (X)

, CFOGD can be formulated as:

X_{t + 1} = X_{t} - η \cdot \nabla_{X} f (X_{t}) \cdot {| X_{t} - X_{t - 1} |}^{1 - α} (0 < α \leq 2)

(19)

where

{| X_{t} - X_{t - 1} |}^{1 - α}

is calculated elementwise.

Theorem 1.

Given f(X) as defined in Lemma 1, the search method in (19) with step sizes

η_{t} = \frac{{| X_{t} - X_{t - 1} |}^{α - 1}}{\sqrt{t} | \nabla f (X_{t}) |}

guarantees the following for all

T \geq 1

.

R e g r e t (T) = \sum_{k = 1}^{T} [f (X_{t}) - f (X^{*})] \leq \frac{D^{2} G_{\infty} \sqrt{T}}{2} + d G_{\infty} \sqrt{T}

where

{| X_{t} - X_{t - 1} |}^{α - 1}, | \nabla_{X} f (X_{t}) |, \frac{{| X_{t} - X_{t - 1} |}^{α - 1}}{| \nabla_{X} f (X_{t}) |}

are all computed elementwise.

Proof.

∥ X_{t + 1} - X^{*} ∥^{2} = ∥ X_{t} - η_{t} \nabla_{t} {| X_{t} - X_{t - 1} |}^{1 - α} - X^{*} ∥^{2}

= ∥ X_{t} - ρ_{t} s i g n (\nabla_{t}) - X^{*} ∥^{2}

where

ρ_{t} = \frac{1}{\sqrt{t}}

is a scalar. The following procedure is consistent with proving Lemma 1. According to Lemma 1,

R e g r e t (T) \leq \frac{D^{2} G_{\infty} \sqrt{T}}{2} + d G_{\infty} \sqrt{T}

holds, so the search method in (19) is convergent. □

CFOGD can be intuitively considered as:

\begin{matrix} X_{t + 1} & = X_{t} - η \cdot \nabla_{X} f (X_{t}) \cdot {| X_{t} - X_{t - 1} |}^{1 - α} \\ = X_{t} - η^{'} \cdot \nabla_{X} f (X_{t}) \end{matrix}

where

η^{'} = η \cdot {| X_{t} - X_{t - 1} |}^{1 - α}

, which is a set of scaling factors indicating the actual step sizes of the variables in each dimension.

It is clear that

f (x) = {| x_{t} - x_{t - 1} |}^{1 - α}

decreases the step sizes for variables with a change less than 1.0 and increases the step sizes for changes greater than 1.0 when

0 \leq α < 1

. In this case, the search method in (19) increases the inconsistency between variables and may cause divergence. While

1 < α \leq 2

, variables with changes greater than 1 decrease the step sizes, and the variables with changes less than 1 increase step sizes. The search method in this scenario eventually balances the step size changes for variables, which is in line with GSDM. When

α = 1

, the search method in (19) is the natural gradient descent method and when

α = 2

, it is the approximation of GSDM.

We refer to the search method in (19) when

1 < α < 2

as C-FOG for short, because it guarantees no divergence. As the extreme point is approached, the variation is very small, and the actual step sizes become very large. Consequently, oscillations associated with the constant step size η and order of the fractional gradient α will occur. When

0 < α < 1

, the search method in (19) may be divergent, but as the extreme point is approached, it converges faster, and the oscillations shrink significantly.

4.2. Performance Evaluation

Consider the function

f (X) = X^{T} [\begin{matrix} 0.1 & 0 \\ 0 & 10 \end{matrix}] X

with a condition number of 100, which means that the function is 100 times more sensitive to a change in input in one dimension than it is to a change in the other dimension. In machine learning, the cost is often highly sensitive to some directions in parameter space and insensitive to others [48]. Furthermore, the poorly conditioned function intuitively demonstrates the convergence performance of C-FOG compared to the other optimization algorithms.

Figure 1 intuitively illustrates the imbalance in speed of the two variables using the natural gradient descent method. With initial values [−5, −2] and 20 iteration steps, the trajectories of the two directions are obtained. The plots in the middle and on the right show that at learning rates of 0.4 and 0.6,

x_{2}

has already diverged, while

x_{1}

is still far from the extreme point. The left shows that at learning rate 0.1,

x_{2}

oscillates repeatedly and

x_{1}

progresses slowly.

Figure 2 shows the trajectories of two variables after 40 iterations using GSDM. After a few steps, both variables change consistently and neither diverges, even when the learning rate is set to 1.9. The higher the learning rate set, the faster the convergence, but the greater the induced oscillations.

Figure 3 shows the results after 40 iterations using C-FOG with the fractional order

α = 1.75

. As can be seen from the figure, at a learning rate of 0.1 both variables converge to the analytical solution very quickly, and at learning rates of 0.4 and 1.9 both variables oscillate but eventually converge. Figure 4 illustrates the trend in actual learning rates

η^{'} = η \cdot {| X_{t} - X_{t - 1} |}^{1 - α}

. It shows a tendency for the faster changing variable to have a smaller learning rate and the slower-changing variable to have a larger learning rate, with both eventually stabilizing. This is in line with the above analysis. However, in Figure 3, it is also clear that C-FOG converges far from the analytical solution when the learning rate is high, such as

η = 1.9

. Empirically, it is not possible to set such a high learning rate in machine learning, and it is just showed how stable it is in terms of learning rate. By gradually reducing the learning rate or by reducing the fractional order when approaching the extreme point, more accurate solutions can be obtained. Figure 5 shows the results using C-FOG to iterate 40 steps, and at the 20th step, the learning rate is divided by 10. It can be seen that they all converge close to the analytical solution.

Figure 6 shows the inverse change trends using C-FOG with the fractional order

α = 0.75

after 40 iterations. For

η = 0.01

and

η = 0.02, x_{2}

quickly approaches the extreme point, while

x_{1}

run slowly forward. For

η = 0.1, x_{2}

diverges after a few iterations.

Figure 7 shows how the value of the function varies with the order of the fractional gradients. For each

α

, 20 steps are iterated, and the values of variables from the last iteration are taken. The learning rate is set to 0.01 because if the learning rate is set too high, the variable in the vertical direction will diverge when

0 \leq α < 1

. The minimum of the function is 0.0035 when

α = 1.8

. When

α > 1

, the value of the function is close to 0 but slightly different.

To better illustrate the performance, comparisons were made with state-of-the-art optimization algorithms such as Momentum, AdaGrad, RMSProp, AdaDelta, and Adam. The typicality of

f (X)

makes it meaningful as a test benchmark function for comparison. There are two ways to evaluate the convergence: (1) by iterating 20 steps to observe the trajectories of each variable and its distance to the extreme point; (2) by examining the number of iterations for each algorithm to reach the vicinity of the extreme point. Figure 8 shows the trajectories of the variables. The hyperparameter settings for each optimization algorithm are shown in Table 1. For all algorithms, [−5, −2] is chosen as the starting point of the iteration.

It is clear from Figure 8 that only with C-FOG do both variables converge rapidly towards the extreme point and fluctuate less. For Momentum, both variables cross the extreme point and fluctuate significantly, whereas for the other algorithms both variables are far away from the extreme point. Of these, AdaDelta has the slowest rate of convergence. For Adam, the rate of convergence is greatly reduced for both variables.

Table 2 shows the minimum steps to approach the vicinity of the extreme point for each algorithm, where ‘--’ indicates that the vicinity cannot be approached. The vicinity reflects the gap between the iterative value and the function’s analytical minimum. For C-FOG, the initial learning rate is set to 0.1 and then every 10 steps the learning rate shrinks to one fifth of the last learning rate. For RMSProp, reduce the step size by half every 80 steps. As can be seen from Table 2, C-FOG converges significantly faster than the other algorithms for the same accuracy.

Based on the above analysis and experimental data, the following conclusions can be drawn:

(1): GSDM and C-FOG are both gradient-based self-organizing optimization algorithms that can guarantee no divergence. For the Caputo fractional order gradient descent method, the fractional order 1 < α < 2 is the necessary condition to guarantee no divergence.
(2): Compared with currently popular gradient-based optimization algorithms, C-FOG converges faster, and the fractional order gives C-FOG additional freedom to converge to different points near the extreme point.

5. Application of C-FOG

5.1. Framework for Generating Adversarial Samples With C-FOG

In recent years, adversarial attacks against DNNs and defenses have become one of the hot topics, and the technology for generating adversarial samples is constantly innovating. According to the methods used by adversaries, adversarial attacks can be classified into gradient-based attacks, score-based attacks, decision-based attacks, and transfer-based attacks [17]. Gradient-based attacks, which use the gradient information of the model’s loss w.r.t. the input to generate samples, are the most important ways to evaluate the robustness of the model. As pointed out by Szegedy, a box-constrained optimization procedure can be used to find adversarial samples described as follows [12]:

\min {∥ r ∥}_{2}

subject t o : (1) C (x + r) = l; (2) x + r \in {[0, 1]}^{m}

(20)

where

C (\cdot)

is the classifier mapping the input image to the output label,

r

is the perturbation commonly measured by

ℓ_{p}

norm with

p = {1, 2, \infty}

, and

x + r

is the adversarial sample closest to

x

but misclassified as the label

l

.

This problem is not easy to solve, but it can be transformed into the following non-linear optimization scheme [25]:

\min ∥ x^{'} - {x ∥}_{2}^{2} + c \cdot f (x^{'})

s u b j e c t t o : x' \in {[0, 1]}^{n}

(21)

where

f (\cdot)

is the loss function of the classifier and c is the contribution factor of the regularization term to the loss function,

x^{'}

is the adversarial sample.

Szegedy et al. approximated the solution using the complicated L-BFGS algorithm [12], which is time-consuming and does not scale to large images [15,25]. C&W uses Adam to optimize the problem and uses the following formula as the loss function [25]:

f (x^{'}) = \max (\max {Z {(x^{'})}_{i} : i \neq t} - Z {(x^{'})}_{t}, - κ)

(22)

where

κ

indicates the confidence level and

Z {(x^{'})}_{i}

is the

i^{t h}

element of the last logit layer, t is the target class (such that

t \neq C (x)

for untargeted attacks or

t = C (x^{'})

for targeted attacks). Although theoretically easy to accept, C&W takes a long computation time and can only yield one adversarial sample for the same input, which becomes the main drawback. Madry proposed the projected gradient descent (PGD) to generate multiple adversarial samples within the neighborhood of the input using the iterative Fast Gradient Sign Method (FGSM) [16], which significantly improves the resistance to a wide range of adversarial attacks [49]. DeepFool utilizes gradient information to generate adversarial samples with the minimum distance from the decision boundary [15]. However, PGD and DeepFool have the limited capability to attack distillation models and transfer attacks. Here, the application of C-FOG to optimize the two terms in (21) will combine the advantages of C&W and PGD. The main procedure for generating adversarial samples using C-FOG is outlined in Algorithm 1.

Algorithm 1. C-FOG:

1. Input: Image

x

, classifier

f

,

\max_i t e r

, the order

α

2. Output: the adversarial example

x ’

3. Initialize:

w_{0} = x, w_{1} = w_{0} + 0.01

,

t = 0

4. while

t < m a x_i t e r

:

w_{0} = \tan h (w_{0})

w_{1} = \tan h (w_{1})

loss =

MSELoss (w_{1} - x) + c \cdot f (w_{1})

w_{2} = w_{1} - η \cdot \nabla_{w_{1}} J (w_{1}) \cdot {| w_{1} - w_{0} |}^{1 - α}

w_{0} = w_{1}, w_{1} = w_{2}

t + = 1

end while

5. return

x ’ = \tan h (w_{1})

Where

\nabla_{w_{1}} J (w_{1})

denotes the gradient of loss w.r.t.

w_{1}

and

{| w_{1} - w_{0} |}^{1 - α}

is calculated elementwise. Here

\tan h (\cdot)

is used to constrain the output of each iteration to (0, 1).

Algorithm 2 shows, for comparison, the main procedure that generates adversarial samples using GSDM. At the same time, in the experimental subsection, we compare C-FOG with C&W, DeepFool and PGD, which are the most popular gradient-based algorithms to date. The detailed algorithms and the main codes can be found on the websites presented in the bibliography.

Algorithm 2. GSDM:

1. Input: Image

x

, classifier

f

,

\max_i t e r

2. Output: the adversarial example

x^{'}

3. Initialize:

w = x

,

t = 0

4. While

t < m a x_i t e r

:

w = \tan h (w)

loss =

MSELoss (w - x) + c \cdot f (w)

w = w - η \cdot s i g n (\nabla_{w} J (w))

t + = 1

end while

5. return

x^{'} = \tan h (w)

Where

\nabla_{w} J (w)

denotes the gradient of loss w.r.t.

w

.

5.2. Experimental Results

5.2.1. Evaluation of Attack Speed

To resist adversarial samples, most models tend to enlarge the input image size for increasing the computational overhead and complexity [50]. We trained ResNet on the Oxford-IIIT-Pet dataset [51], mobilenet_v2 on the Mini-ImageNet dataset [52], and VGG11 on the CUB-200-2011 dataset [53] to evaluate the speed of generating adversarial samples and the successful attack probability for the above algorithms. For these datasets, we randomly selected 80% as the training set and the remaining 20% as the test set. We randomly selected 1000 images from the test set as the original images. The value of c in (21) is set to 4 and

κ

is set to 0. We trained these models on the training datasets using the standard fine-tuning technique and the test accuracies were 98.7%, 92.7%, and 99.1% respectively.

The Oxford-IIIT-Pet dataset is for fine-grained image classification and covers 37 breeds of cats and dogs, with about 200 color images for each class. Mini-ImageNet is a popular dataset consisting of 60,000 color images evenly distributed across 100 classes. There are 11,788 color images in the CUB-200-2011 dataset, which consists of 200 classes of birds. The images are of different scales, so they are all rescaled to the same size of

299 \times 299

, 224 × 224 and

256 \times 256

for each dataset respectively.

Table 3 records the time in seconds to generate 1000 adversarial samples, and Table 4 records the probability of successful attacks for each algorithm. In Table 3 it can be seen that the time requirement for C-FOG is only one-tenth of that for C&W, one-fourth of that for PGD and one-half that for DeepFool. Table 4 shows that all these algorithms have roughly the same probability of a successful attack.

Some of these adversarial samples generated with C-FOG are shown in Figure 9. The first row shows the original images and categories. The adversarial samples and the corresponding misclassified categories are shown in the second row. It is difficult for the human visual system to distinguish between the two series of images, but the classifier made completely different identifications.

5.2.2. Evaluation of Attack Strength

The evaluation of gradient-based attack algorithms via defensive distillation models is widely accepted. Distillation is a knowledge compression technique that transfers knowledge from multiple models or from a large model to a small model [54]. Papernot et al. modified the distillation method by dividing the output by a temperature to prevent the adversary from using gradients to generate adversarial samples, called defensive distillation [55]. Defensive distillation enhances the robustness of the models and is popular for measuring the capability of gradient-based adversarial attacks.

On the handwriting dataset MNIST, we trained two constructively identical classifiers, Teacher and Student, consisting of two convolutional layers of 32 filters, a

2 \times 2

max_pooling layer, and two convolutional layers of 64 filters, followed by another

2 \times 2

max_pooling layer. After the convolutional layers, there were three fully connected layers of size 1024, 200, and 10, respectively.

The output layer of Teacher was modified as follows:

output = softmax (\frac{output}{temperature})

(23)

When training Student, the corresponding labels are the output of Teacher, called soft labels, instead of one-hot labels. The output layer of Student was also transformed in the same way as that of Teacher. However, on the evaluation set, the temperature was reset to 1.

Figure 10 shows the accuracies of adversarial samples generated with C&W

(κ = 0)

, C-FOG

(κ = 0)

, PGD, and DeepFool, respectively, on Student at various temperatures. It can be seen that the performance of PGD attacking distilled models is extremely unstable and DeepFool has stable performance, but the attack capability of both algorithms is far less powerful than C&W and C-FOG. For defensive distilled models at low temperatures (<30), C-FOG has significant advantages compared with C&W.

Figure 11 shows the accuracy comparison of adversarial samples generated with C&W and C-FOG at different κ on Students at different temperatures. From this plot it can be seen that for larger κ (>10) both C&W and C-FOG are almost 100% successful in attacking defensive distilled models at high temperatures (>10). For smaller κ, the advantages of C-FOG are more obvious. In general, C-FOG is more stable in performance than C&W.

5.2.3. Evaluation of Attack Transferability

Gradient-based attack algorithms are also useful when the internal structure of the objective model is not known, and its training data set is not accessible. Here, we train a model with the same structure as Student on the MNIST training set as the object to attack, called Oracle. After training as usual, Oracle achieves an accuracy of 99.2% on the training set and 98.9% on the test set. In addition, two substitution models are constructed with a similar but different structure to Oracle, as shown in Table 5. Subs_01 has one less linear layer than Oracle, while Subs_02 has 5 times the number of neural units on the first linear layer compared to Oracle. For training substitute models, we randomly select 150 samples from the MNIST training set as the initial training set. In each iteration of training the surrogate model, a certain number of adversarial samples are added to the initial training set, resulting in a total of 7000 training samples. When generating adversarial samples with C-FOG, the fractional order is set to be uniformly distributed in the interval [1.5, 1.99] with a learning rate of 0.01. This setting is useful for increasing the diversity of the adversarial samples. Figure 12, Figure 13 and Figure 14 show the comparison of the accuracy on the test set of subs_01 and subs_02 trained on the synthetic data set by C&W, C-FOG, PGD, DeepFool and GSDM, and the comparison of the probability of generating adversarial samples that successfully mislead Oracle.

The first and second sets of data in Figure 12 show the accuracies of subs_01 trained on the synthetic training set using C&W and C-FOG with different values of κ, and the third and the fourth sets of data show the probability that successfully attack Oracle using the adversarial samples generated on subs_01 with C&W and C-FOG. The next data in Figure 13 and Figure 14 are represented in the same way.

As can be seen in Figure 12 and Figure 13, compared with C&W, C-FOG clearly has a stronger transfer attack capability when κ = 0, 5, 10, and 20, but the C&W algorithm is slightly better when κ = 40. This shows that C-FOG is less dependent on κ.

Figure 14 shows a comparison of accuracy on the test set of Sub_01 and Sub_02 trained on the synthetic training set generated with DeepFool, PGD, and GSDM, and accuracies of successful attack against Oracle. For comparison, we also show the case of C-FOG when κ = 0. From Figure 14, it is clear that C-FOG has a significantly better performance in transfer attacks, and PGD and DeepFool show the obvious limitation. Moreover, surrogate models trained on adversarial samples generated with C-FOG exhibit better generalization performance.

6. Conclusions and Outlook

In this paper, we analyzed the convergence properties of the gradient sign descent method (GSDM) on multivariate functions and then discussed the conditions required for Caputo fractional order gradient descent method (C-FOG) to guarantee no divergence. It was found that both GSDM and C-FOG are self-organizing for optimization. When the order of the fractional order is in the interval (1, 2), C-FOG is able to guarantee no divergence and converge faster than the classical gradient-based optimization algorithms. Moreover, the additional freedom of the fractional order leads to the diversity of convergence results. The adversarial sample generation algorithm based on C-FOG is constructed, and experiments show that the new algorithm generates adversarial samples about 10 times faster than C&W and only requires up to half the time of other state-of-the-art gradient-based algorithms. The new algorithm generates adversarial samples with diversity and has a stronger transfer attack capability. More importantly, the new algorithm is more stable in terms of convergence.

There are still drawbacks and many unknown areas about C-FOG that need to be further explored, including

(1): Compared with classical optimization algorithms, power functions increase computational complexity. During the iteration process, it is necessary to store the parameters generated during the last two iterations and to calculate the $(1 - α)$ power of the absolute value of the difference between the two parameter vectors elementwise. This increases the computational and storage overhead.
(2): Adjusting the learning rate at the right time determines the speed of convergence as well as the accuracy of the converged results, which requires experience and trial and error. C-FOG guarantees no divergence but cannot guarantee convergence into the neighborhood of the extreme point if the learning rate is too high, so appropriate learning rates ensure both convergence speed and desired accuracy. Adaptive learning rate optimization algorithms such as Adam are much simpler in this regard.
(3): Further areas related to the convergence and divergence of C-FOG need to be explored, for example, the performance of C-FOG in high-dimensional parameter spaces, the relationships between the learning rates of these parameters, and the law of oscillations near the convergence points.

As a rapidly converging self-organizing optimization algorithm, C-FOG may be the subject of further research and development.

Author Contributions

Conceptualization, Y.P. and S.T.; methodology, S.T. and N.Z.; software, S.T.; validation, N.Z.; S.T.; formal analysis, N.Z.; investigation, S.T.; resources, S.T..; data curation, S.T.; writing—original draft preparation, S.T.; writing—review and editing, N.Z. and Y.P.; visualization, S.T.; supervision, Y.P.; project administration, Y.P.; funding acquisition, Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China (Grant No.~62171303), in part by China South Industries Group Corporation (Chengdu) Fire Control Technology Center Project (non-secret) (Grant No.~HK20-03), in part by the National Key Research and Development Program Foundation of China (Grant No.~2018YFC0830300).

Data Availability Statement

The data that support the findings of this study are openly available at https://github.com/mulertan/self_optimizing/tree/main (accessed on 30 April 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Kingmma, D.P.; Lei, B.J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Tas, E. Learning Parameter Optimization of Stochastic Gradient Descent with Momentum for a Stochastic Quadratic. In Proceedings of the 24th European Conference on Operational Research (EURO XXIV), Lisbon, Portugal, 11–14 July 2010. [Google Scholar]
Duchi, J.C.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 7, 2121–2159. [Google Scholar]
Ruder, S. An Overview of Gradient Descent Optimization Algorithms. Comput. Sci. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classifcation with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Shamshirband, S.; Fathi, M.; Dehzangi, A.; Chronopoulos, A.T.; Alinejad-Rokny, H. A review on deep learning approaches in healthcare systems: Taxonomies. J. Biomed. Inform. 2020, 113, 103627. [Google Scholar]
Dahl, G.E.; Yu, D.; Deng, L.; Acero, A. Context-Dependent Pre-Trained Deep Neural Networks for Large vocabulary Speech Recognition. IEEE Trans. Audio Speech Lang. Proc. 2012, 20, 30–42. [Google Scholar] [CrossRef]
You, Y.B.; Qian, Y.M.; He, T.X.; Yu, K. An investigation on DNN-derived bottleneck features for GMM-HMM based robust speech recognition. In Proceedings of the 2015 IEEE China Summit and International Conference on Signal and Information Processing, Beijing, China, 12–15 July 2015; pp. 30–34. [Google Scholar]
Aslan, S. A deep learning-based sentiment analysis approach (MF-CNN-BILSTM) and topic modeling of tweets related to the Ukraine-Russia conflict. Appl. Soft Comput. 2023, 143, 110404. [Google Scholar] [CrossRef]
Alagarsamy, P.; Sridharan, B.; Kalimuthu, V.K. A Deep Learning Based Glioma Tumor Detection Using Efficient Visual Geometry Group Convolutional Neural Networks. Braz. Arch. Biol. Technol. 2024, 67, 267101018. [Google Scholar] [CrossRef]
Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Laskov, P.; Giacinto, G.; Roli , F. Battista Biggio; Corona, I.; Maiorca, D.; Nelson, B.; Srndic, N.; Laskov, P.; Giacinto, G.; Roli, F. Evasion Attacks Against Machine Learning at Test Time. arXiv 2017, arXiv:1708.06131. [Google Scholar]
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Alexey, K.; Bengio, S.; Goodfellow, I. Adversarial Examples in the Physical World. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 14–16 August 2016. [Google Scholar]
Machado, G.R.; Silva, E.; Goldschmidt, R.R. Adversarial Machine Learning in Image Classification: A Survey Toward the Defender’s Perspective. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Moosavi-Dezfooli, S.M.; Fawzi, A.; Frossard, P. DeepFool: A simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, DC, USA, 27–30 June 2016; pp. 2574–2582. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Brendel, W.; Rauber, J.; Bethge, M. Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models. In Proceedings of the International Conference on Learning Representations (ICLR), Tulon, France, 24–26 April 2017. [Google Scholar]
Maho, T.; Furon, T.; Erwan, L.M. Surfree: A Fast Surrogate-Free Black-Box Attack. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Shenzhen, China, 19–25 June 2021. [Google Scholar]
Rahmati, A.; Moosavi-Dezfooli, S.-M.; Frossard, P.; Dai, H. GeoDA: A Geometric Framework for Black-Box Adversarial Attacks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, DC, USA, 14–19 June 2020. [Google Scholar]
Chen, J.B.; Jordan, M.I. HopSkipJumpAttack: A Query-Efficient Decision-Based Attack. In Proceedings of the IEEE Symposium on Security and Privacy (S&P), Oakland, CA, USA, 4 April 2019. [Google Scholar]
Shi, Y.C.; Han, Y.H.; Hu, Q.H. Query-Efficient Black-Box Adversarial Attack with Customized Iteration and Sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2226–2245. [Google Scholar] [CrossRef]
Qayyum, A.; Qadir, J.; Bilal, A. Secure and Robust Machine Learning for Healthcare: A Survey. IEEE Rev. Biomed. Eng. 2020, 14, 156–180. [Google Scholar] [CrossRef]
Zhang, Z.; Ma, L.; Liu, M.; Chen, Y.; Zhao, N. Adversarial Attacking and Defensing Modulation Recognition with Deep Learning in Cognitive-Radio-Enabled IoT. IEEE Internet Things J. 2023, 11, 14949–14962. [Google Scholar] [CrossRef]
Bai, Z.X.; Wang, H.J.; Guo, K.X. Summary of Adversarial Examples Techniques Based on Deep Neural Networks. Comput. Eng. Appl. 2021, 57, 61–70. [Google Scholar]
Carlini, N.; Wagner, D. Towards Evaluating the Robustness of Neural Networks. In Proceedings of the IEEE Symposium on Security and Privacy, San Jose, CA, Canada, 22–26 May 2017. [Google Scholar]
Carlini, N.; Wagner, D. Defensive Distillation is Not Robust to Adversarial Examples. arXiv 2016, arXiv:1607.04311v1. [Google Scholar]
Papernot, N.; McDaniel, P.; Goodfellow, I. Transferability in Machine Learning: From Phenomena to Black-Box Attacks using Adversarial Samples. arXiv 2016, arXiv:1605.07277v1. [Google Scholar]
Iqbal, F.; Tufail, M.; Ahmed, S.; Akhtar, M.T. A Fractional Taylor Series-based Least Mean Square Algorithm and Its Application to Power Signal Estimation. Signal Process. 2021, 193, 108405. [Google Scholar] [CrossRef]
Khan, Z.A.; Chaudhary, N.I.; Raja, M.A.Z. Generalized Fractional Strategy for Recommender Systems with Chaotic Ratings Behavior. Chaos Solitons Fractals 2022, 160, 112204. [Google Scholar] [CrossRef]
Chaudhary, N.I.; Raja, M.A.Z.; Khan, Z.A.; Mehmood, A.; Shah, S.M. Design of Fractional Hierarchical Gradient Descent Algorithm for Parameter Estimation of Nonlinear Control Autoregressive Systems. Chaos Solitons Fractals 2022, 157, 111913. [Google Scholar] [CrossRef]
Zeiler, M.D. AdaDelta: An Adaptive Learning Rate Method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Tian, Y.J.; Zhang, Y.Q.; Zhang, H.B. Recent Advances in Stochastic Gradient Descent in Deep Learning. Mathematics 2023, 11, 682. [Google Scholar] [CrossRef]
Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Sutskever, I.; Martens, J.; Dahl, G.E.; Hinton, G. On the Importance of Initialization and Momentum in Deep Learning. In Proceedings of the 30th International Conference on Machine Learning, Toronto, ON, Canada, 17–19 June 2013. [Google Scholar]
Podlubny, I. Preface. In Fractional Differential Equations; Academic Press: San Diego, CA, USA, 1998; Volume 198, p. XVII. [Google Scholar]
Miller, K.S.; Ross, B. An Introduction to the Fractional Calculus and Fractional Differential Equation; Wiley: New York, NY, USA, 1993. [Google Scholar]
Oldham, K.B.; Spanier, J. The Fractional Calculas—Theory and Application of Differentiation and Integration to Arbitrary Order; Academic Press: New York, NY, USA, 1974. [Google Scholar]
Gorenflo, R.; Mainardi, F. Fractional Calculus: Integral and Differential Equations of Fractional Order. Mathematics 2008, 49, 277–290. [Google Scholar]
Pu, Y.F.; Zhou, J.L.; Zhang, Y.; Zhang, N.; Huang, G.; Siarry, P. Fractional Extreme Value Adaptive Training Method: Fractional Steepest Descent Approach. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 653–662. [Google Scholar] [CrossRef] [PubMed]
Cheng, S.; Wei, Y.; Chen, Y.; Li, Y.; Wang, Y. An Innovative Fractional Order LMS Based on Variable Initial Value and Gradient Order. Signal Process 2017, 133, 260–269. [Google Scholar] [CrossRef]
Chen, Y.; Gao, Q.; Wei, Y.; Wang, Y. Study on fractional order gradient methods. Appl. Math. Comput. 2017, 314, 310–321. [Google Scholar] [CrossRef]
Sheng, D.; Wei, Y.; Chen, Y.; Wang, Y. Convolutional neural networks with fractional order gradient method. Neurocomputing 2020, 408, 42–50. [Google Scholar] [CrossRef]
Wang, J.; Yan, Q.W.; Gou, Y.D.; Ye, Z.; Chen, H. Fractional-Order Gradient Descent Learning of BP Neural Networks with Caputo Derivative. Neural Netw. 2017, 89, 19–30. [Google Scholar] [CrossRef]
Kennedy, J.; Eberhart, R. Particle Swarm Optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995. [Google Scholar]
Zhu, Z.G.; Li, A.; Wang, Y. Study on Two-Stage Fractional Order Gradient Descend Method. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021. [Google Scholar]
Hazan, E. Introduction to Online Convex Optimization, 2nd ed.; Now Foundations and Trends: Boston, MA, USA, 2019; pp. 41–48. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2017; p. 306. [Google Scholar]
Madry, A.; Maklov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Papernot, N.; Mcdaniel, P.; Goodfellow, I.; Jha, C.; Celik, Z.B.; Swami, A. Practical Black-Box Attacks against Machine Learning, In Proceedings of the 2017 ACM Asia Conference on Computer and Communications Security, Dubai, United Arab Emirates, 2–6 April 2017.
Andrea, P.; Andrew, Z.; Jawahar, C.V. Cats and dogs. In Proceedings of the 25th IEEE Conference on Computer Vision and Pattern Recognition, Rhode, Greece, 18–20 June 2012; pp. 3498–3505. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching Networks for One Shot Learning. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems 2016 (ICONIP), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset, Computer Science; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 1–9. [Google Scholar]
Papernot, N.; McDaniel, P.; Wu, X.; Jha, S.; Swami, A. Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks. In Proceedings of the IEEE Symposium on Security and Privacy, San Jose, CA, USA, 17–21 May 2015. [Google Scholar]

$Fractalfract 08 00451 g001$

Figure 1. Iteration results using the natural gradient descent method.

$Fractalfract 08 00451 g001$

$Fractalfract 08 00451 g002$

Figure 2. Iteration results using GSDM after 40 steps.

$Fractalfract 08 00451 g002$

$Fractalfract 08 00451 g003$

Figure 3. Iteration results using C-FOG with

α = 1.75

after 40 steps.

Figure 3. Iteration results using C-FOG with

α = 1.75

after 40 steps.

$Fractalfract 08 00451 g003$

$Fractalfract 08 00451 g004$

Figure 4. The changing trend in the actual learning rates

η^{'} = η \cdot {| X_{t} - X_{t - 1} |}^{1 - α}

.

Figure 4. The changing trend in the actual learning rates

η^{'} = η \cdot {| X_{t} - X_{t - 1} |}^{1 - α}

.

$Fractalfract 08 00451 g004$

$Fractalfract 08 00451 g005$

Figure 5. Results using C-FOG to iterate 40 steps and at the 20th step

η = η / 10

.

Figure 5. Results using C-FOG to iterate 40 steps and at the 20th step

η = η / 10

.

$Fractalfract 08 00451 g005$

$Fractalfract 08 00451 g006$

Figure 6. Iteration results using C-IFOG with

α = 0.75

.

Figure 6. Iteration results using C-IFOG with

α = 0.75

.

$Fractalfract 08 00451 g006$

$Fractalfract 08 00451 g007$

Figure 7.

f (X)

changes with the fractional order

α

.

Figure 7.

f (X)

changes with the fractional order

α

.

$Fractalfract 08 00451 g007$

$Fractalfract 08 00451 g008$

Figure 8. Comparison with several currently popular optimization algorithms.

$Fractalfract 08 00451 g008$

$Fractalfract 08 00451 g009$

Figure 9. Original images and adversarial samples generated with C-FOG.

$Fractalfract 08 00451 g009$

$Fractalfract 08 00451 g010$

Figure 10. Accuracy comparison of adversarial samples generated with C&W, C-FOG, PGD, and DeepFool on Student at different temperatures.

$Fractalfract 08 00451 g010$

$Fractalfract 08 00451 g011$

Figure 11. Accuracy comparison of adversarial samples generated with C&W and C-FOG at different

κ

on Student at different temperatures.

Figure 11. Accuracy comparison of adversarial samples generated with C&W and C-FOG at different

κ

on Student at different temperatures.

$Fractalfract 08 00451 g011$

$Fractalfract 08 00451 g012$

Figure 12. Comparison of probabilities for training subs_01 and successfully attacking Oracle.

$Fractalfract 08 00451 g012$

$Fractalfract 08 00451 g013$

Figure 13. Comparison of probabilities for training subs_02 and successfully attacking Oracle.

$Fractalfract 08 00451 g013$

$Fractalfract 08 00451 g014$

Figure 14. Comparison of probabilities for training subs_01 and subs_02 and successfully attacking Oracle with PGD, DeepFool, and GSDM.

$Fractalfract 08 00451 g014$

Table 1. Hyperparameter settings for each optimization algorithm.

Algorithms	Hyperparameters
C-FOG	$η = 0.1$ , $α = 1.75$
Momentum	$η = 0.1$ , $β = 0.9$
AdaGrad	$η = 0.1$
AdaDelta.	$β = 0.1$
RMSProp	$η = 0.1$ , $β = 0.9$
Adam	$η = 0.1, β_{1} = 0.9, β_{2} = 0.999$

Table 2. Minimum steps required to reach the vicinity of the extreme point.

	C-FOG	Momentum	AdaGrad	RMSProp	AdaDelta	Adam
steps (1 × 10⁻⁵)	32	104	3788	74	260	931
steps (1 × 10⁻¹⁰)	61	222	7586	86	11909	1547
steps (1 × 10⁻²⁰)	112	431	15,181	648	--	2280
steps (1 × 10⁻³⁰)	171	640	22,776	1929	--	2750
steps (1 × 10⁻⁴⁰)	221	901	30,371	3211	--	3081

Table 3. Time in seconds to generate 1000 adversarial samples on each model.

Algorithm	C&W	C-FOG	PGD	DeepFool	GSDM
resnet50	2545.71	251.68	1064.32	481.13	524.43
mobilenet_v2	1606.71	160.93	671.72	252.03	312.19
vgg11	1383.99	137.97	556.00	264.46	266.88

Table 4. Probability of different algorithms successfully attacking 1000 samples.

Algorithm	C&W	C-FOG	PGD	DeepFool	GSDM
resnet50	98.3%	98.5%	98.8%	98.2%	98.7%
mobilenet_v2	98.7%	99.5%	99.5%	99.1%	98.8%
vgg11	97.5%	99.4%	99.3%	99.4%	99.1%

Table 5. The structure of substitute models subs_01 and subs_02.

Model	Conv	Conv	Conv	Conv	Linear	Linear	Linear
subs_01	32	32	64	64	200	--	10
subs_02	32	32	64	64	1000	200	10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tan, S.; Zhang, N.; Pu, Y. Self-Organizing Optimization Based on Caputo’s Fractional Order Gradients. Fractal Fract. 2024, 8, 451. https://doi.org/10.3390/fractalfract8080451

AMA Style

Tan S, Zhang N, Pu Y. Self-Organizing Optimization Based on Caputo’s Fractional Order Gradients. Fractal and Fractional. 2024; 8(8):451. https://doi.org/10.3390/fractalfract8080451

Chicago/Turabian Style

Tan, Sunfu, Ni Zhang, and Yifei Pu. 2024. "Self-Organizing Optimization Based on Caputo’s Fractional Order Gradients" Fractal and Fractional 8, no. 8: 451. https://doi.org/10.3390/fractalfract8080451

APA Style

Tan, S., Zhang, N., & Pu, Y. (2024). Self-Organizing Optimization Based on Caputo’s Fractional Order Gradients. Fractal and Fractional, 8(8), 451. https://doi.org/10.3390/fractalfract8080451

Article Menu

Self-Organizing Optimization Based on Caputo’s Fractional Order Gradients

Abstract

1. Introduction

2. Related Work

3. Mathematical Background

4. Analysis of Caputo’s Fractional Order Gradient Descent Method and Evaluation

4.1. Condition for Non-Divergence

4.2. Performance Evaluation

5. Application of C-FOG

5.1. Framework for Generating Adversarial Samples With C-FOG

5.2. Experimental Results

5.2.1. Evaluation of Attack Speed

5.2.2. Evaluation of Attack Strength

5.2.3. Evaluation of Attack Transferability

6. Conclusions and Outlook

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI