Adaptive Morphing Activation Function for Neural Networks

Herrera-Alcántara, Oscar; Arellano-Balderas, Salvador

doi:10.3390/fractalfract8080444

Open AccessArticle

Adaptive Morphing Activation Function for Neural Networks

by

Oscar Herrera-Alcántara

^1,*

and

Salvador Arellano-Balderas

²

¹

Departamento de Sistemas, Universidad Autónoma Metropolitana, Azcapotzalco 02200, Mexico

²

Departamento de Ciencias Básicas, Universidad Autónoma Metropolitana, Azcapotzalco 02200, Mexico

^*

Author to whom correspondence should be addressed.

Fractal Fract. 2024, 8(8), 444; https://doi.org/10.3390/fractalfract8080444

Submission received: 19 June 2024 / Revised: 8 July 2024 / Accepted: 11 July 2024 / Published: 29 July 2024

(This article belongs to the Special Issue Recent Advances in Fractional-Order Neural Networks: Theory and Application)

Download

Browse Figures

Versions Notes

Abstract

A novel morphing activation function is proposed, motivated by the wavelet theory and the use of wavelets as activation functions. Morphing refers to the gradual change of shape to mimic several apparently unrelated activation functions. The shape is controlled by the fractional order derivative, which is a trainable parameter to be optimized in the neural network learning process. Given the morphing activation function, and taking only integer-order derivatives, efficient piecewise polynomial versions of several existing activation functions are obtained. Experiments show that the performance of polynomial versions PolySigmoid, PolySoftplus, PolyGeLU, PolySwish, and PolyMish is similar or better than their counterparts Sigmoid, Softplus, GeLU, Swish, and Mish. Furthermore, it is possible to learn the best shape from the data by optimizing the fractional-order derivative with gradient descent algorithms, leading to the study of a more general formula based on fractional calculus to build and adapt activation functions with properties useful in machine learning.

Keywords:

fractional derivative; activation function; machine learning; wavelets

1. Introduction

Since the seminal work of McCulloch and Pitts on the neural model [1] and Rosenblatt on perceptrons [2], activation functions have been a topic of interest in the context of artificial neural networks. In particular, given the singularity of the Heaviside function, derivable functions such as Sigmoid or tanh were introduced as activation functions [3]. Later, for deep learning, more activation functions were proposed to solve new challenges such as the gradient vanishing problem [4,5], where ReLU [6] plays an important role to the point of being considered the state of the art. Given the success of ReLU, more variants have been proposed, including Leaky ReLU [7], ELU [8], GeLU [9], Swish [10], and Mish [11], among others.

Two relevant concepts that have been considered to improve the results of neural networks are (i) activation functions with adaptive parameters and (ii) activation functions with fractional derivative, or even a combination of these, as in this paper. Certainly, this work is not a pioneer in introducing these concepts, and the papers [12,13,14,15,16,17] are evidence of this. In addition to the previous cited works, as justification and motivation to continue with this research of applying fractional derivatives to activation functions, the research of [18,19] was carried out, where fractional gradients were applied successfully to the backpropagation algorithm to improve the performance of the gradient descent algorithms available in Tensorflow and PyTorch. In such a case, the formula for updating the synaptic weights requires the derivative of the activation function and it involves the factor

D_{x}^{ν} x = \frac{x^{1 - ν}}{Γ (2 - ν)}

, which reduces to 1 in the case of

ν = 1

, i.e., the first derivative of x with respect to x is 1, and makes it possible for it to be extended from integer to fractional orders. Furthermore, the experiments support the idea that fractional optimizers improve their integer-order versions.

Regarding activation functions with adaptive parameters, they can be trained through the learning process and are therefore called trainable or learnable, since the adaptation is obtained from the training data. Indeed, to improve the accuracy of neural networks, adaptive activation functions such as SPLASH or APTx have been studied. On the one hand, SPLASH is a class of learnable piecewise linear activation functions with parameterization to approximate non-linear functions [20]. On the other hand, APTx is similar to Mish but is intended to be more efficient to compute [21]. Other related works study ReLU variants, such as FELU [22], SELU, MPELU, and DELU [23], with parameters that consider non-zero values for the negative plane to solve the “Dying ReLU” phenomenon that stops learning [24]. Specifically, DELU considers three parameters,

α, β, z_{c} \geq 0

, to be determined, and special cases are ELU (

α = β = 1, z_{c} = 0

) and ReLU (

z_{c} = 0, β \to \infty

). The Shape Autotuning Activation Function (SAAF) was proposed to simultaneously overcome three weaknesses of ReLU related to non-zero mean, negative missing (zero values), and unbounded output [25]. SAAF is a smooth function like Sigmoid and tanh, and piecewise like ReLU, but avoids some of their deficiencies by adjusting a pair of independent trainable parameters,

α

and

β

, to capture negative information and provide a near-zero mean output, which leads to better generalization and learning speed, as well as a bounded output. In [26], three variants of tanh, tanhSoft1, tanhSoft2, and tanhSoft3, are presented as trainable variants depending of

α, β

, and

δ

parameters that are used to control the slope of the curve on both positive and negative axes. Also, TanhLU is proposed in [26] as a combination of the tanh and a linear unit. All these works are examples of the effort to build and generalize some activation functions that, in addition to inheriting good properties from the original, provide flexibility with parameters to acquire useful properties for machine learning, such as control of overfitting or stop learning.

The other relevant topic considered in activation functions is the fractional derivative. The theoretical basis to extend the activation function derivative from integer to fractional order is the fractional calculus theory [27,28] that generalizes the integer operators: integration and differentiation. These two operators are combined in a single concept of fractional derivative once the integration is conceived as an antiderivative; thus, the orders are positive for derivatives and negative for antiderivatives so that, when added, the result is a real number, positive, negative, or zero. Historically, instead of “real”, it is referred to as “fractional”, maybe because, in a letter, L’Hopital asked Leibniz about the order

\frac{1}{2}

[29]. An intuitive approach comes from the interpolation between two functions. Let

f (x) = x

and its first derivative

f^{1} (x) = 1

; then, it is possible to build a “fractional” version with both of them. For

α \in [0, 1]

, we obtain

f^{α} (x) = (1 - α) f (x) + α f^{1} (x) = (1 - α) x + α

. Now, if

α = 0

, there is no derivative, and

f^{α} (x)

is equal to x. However, if

α = 1

, then

f^{α} (x)

is equal to 1, meaning the first derivative of x. For “fractional” values, i.e.,

0 < x < 1

, it corresponds to a line with gradient (slope)

1 - α

. So, it offers a mechanism to control the gradient, which could be useful in backpropagation algorithms [18,19] by considering that

f (x)

could represent a more sophisticated activation function. It is illustrated in Figure 1, where there is a line with unitary slope for

α = 0

and a horizontal line for

α = 1

. This

3 D

graph represents all lines with slopes in the interval

[0, 1]

.

Effectively, fractional gradient optimizers [18,19] and fractional activation functions have been proposed based on fractional calculus concepts, as a generalization to the classic integer-order operators. For example, ref. [13] combines fractional derivatives on the activation functions and a fractional gradient descent backpropagation algorithm for perceptrons where, instead of choosing the first derivative, the fractional derivative of the activation function is applied. The experiments support the idea that the models that involve fractional derivatives are more accurate than the integer-order models. The fractional approach allows us to obtain the memory and hereditary properties of the processes described by the data, since it is a property of the fractional calculus operators [27,28,30]. This is possible because the derivative focuses on a point, but the integral operator covers (observes) a neighborhood (integration interval) around a point of interest. A more extensive study of the interpretation of the fractional derivative can be reviewed in [31], where one can find an analogy that was previously described from a physical point of view, specifically concerning the concept of divergence, which measures how much the vector field behaves like a source (positive divergence) or a sink (negative divergence).

In [17], fractional sigmoid activation functions and Fractional Rectified Linear Units (FReLUs) are proposed, and the order of the derivative of linear fractional activation functions is optimized in [15] to give a place to Fractional Adaptive Linear Units (FALUs).

Another possibility to apply fractional derivatives beyond the activation functions is to modify the standard loss functions, such as the Mean Squared Error (MSE), which involves a power of two, which can be modified to a fractional FMSE if it is multiplied by

\frac{a}{2}

[12]. When the fractional order

a = 2

, FSME is equivalent to MSE, but when a is not an integer, FMSE identifies more complex relationships between the input and the output. A similar approach can be used with the Cross-Entropy Loss function, to obtain a Fractional Cross-Entropy Loss function. However, the authors of [12] report that the fractional order a must be adjusted by trial and error.

Wavelets are also on the list of activation functions [32] and, due to their oscillatory property, they allow us to define sophisticated classification regions [33]. In this paper, a morphing activation function is presented based on fractional calculus. Morphing refers to the idea of changing the shape gradually to mimic different activation functions. Morphing is possible by applying the Caputo derivative [34] and varying the fractional

ν

-order to obtain shapes very similar to other activation functions, including Heaviside, Sigmoid, ReLU, Softplus, GeLU, Swish, and Haar. Related works present relationships between a few similar activation functions, for example, ReLU and Leaky ReLU, or ReLU and Heaviside, or Sigmoid and Softplus, or hyperbolic tangent and square hyperbolic secant [14]. All the revised papers on adaptive activation function groups focus on families of too-similar activation functions. Adaptive or fixed ReLU variants such as SAAF, SELU, MPELU, DELU, Leaky ReLU, ELU, GELU, Swish, and Mish focus on solving the drawbacks of the fundamental ReLU, but they follow a similar shape pattern. In contrast, SPLASH aims to approximate any function, included activation functions, but requires a significant amount of piecewise linear components to reach a smoothness similar to Sigmoid, tanh, Swish, or Mish, and leaks derivability at the hinges. However, at the time of writing this paper, the morphing function is unprecedented in the sense of linking several seemingly unrelated activation functions, essentially by means of the fractional-order derivative. The kind of shapes that the proposed morphing function can emulate is broad, since it evolves from wavelet to triangular, ReLU-variants, sigmoid, and polynomial by depending only on the fractional-order derivative. So, given a single parameter, it is able to reproduce an infinite set of activation functions, with different behavior, and when necessary, the shapes can be smooth. Indeed, the morphing activation function considers the fractional

ν

-order as a parameter, and it is possible to obtain the optimal value from the data in the training process, so it is trainable. Additionally, it facilitates obtaining mathematical expressions of piecewise activation functions with a polynomial approach, which could lead to improved computational efficiency. It is worth noting that these points are relevant because, compared to other adaptive activation functions, the morphing function has the fewest number of parameters and explores different shapes. Thus, rather than focusing on a family or subset of adaptive or fixed activation functions, the morphing function aims to encompass as many activation functions as possible. Therefore, this research has a twofold purpose; the first aims to obtain a single equation that mimics a large list of activation functions, and the second is to obtain the optimal activation function shape by calculating the appropriate parameters from the training data using gradient descent algorithms, and this approach is different to other related works.

The approach that SAAF and MPELU follow shares some goals with this research, by attemping to combine the advantages of several activation functions. In the case of SAAF, it has parameters to mimic Mish or Swish in the negative plane, whereas MPELU is limited to ReLU and ELU variants. But Morph goes further and manages to unify more activation functions by introducing fractional derivatives and wavelet concepts.

Experimental results show a competitive performance of the morphing activation function, which learns the best shape from the data and adaptively mimics other existing and successful activation functions. How is this possible? The fractional-order derivative is declared as a hyperparameter in PyTorch, as part of the learning process. Given an initial value, it changes towards the optimal value guided by the gradient descent algorithm used to optimize the rest of the parameters. But why does the Morph activation function work so well? Because it is able to mimic other successful activation functions.

Finally, this paper aims to progress towards the construction of a more general formula to unify as many activation functions as possible.

2. Materials

The following topics are briefly reviewed in this section: fractional derivatives, fractional and parametric activation functions, and a new morphing activation function

M o r p h ()

based on fractional calculus. It provides the necessary material to develop the experiments which support the conclusions.

2.1. Fractional Derivatives

There is no single definition of fractional derivatives. In this paper, only two are described: the Riemann–Liouville type conformable (RL-type) fractional derivative and the Caputo fractional derivative [34]. The former allows us to obtain some generalized activation functions by replacing the exponential term [17]. The second is the Caputo derivative used to obtain

M o r p h ()

, a novel fractional activation function that mimics various activation functions.

Definition 1.

Let

0 \leq a < x < + \infty

. The Riemann–Liouville-type conformable fractional derivative of f or order ν is defined by:

{}_{a}^{RL}T_{ν} (x) = l i m_{ϵ \to 0} [(1 - ν) f (x) + ν \frac{f (x + ϵ {(x - a)}^{1 - ν}) - f (x)}{ϵ}] .

In particular, for

f (x) = e^{- x}

:

{}_{0}^{RL}T_{ν} (x) = (1 - ν) e^{- x} - ν x^{1 - ν} e^{- x} .

(1)

The exponential function

f (x) = e^{- x}

is very important since it appears as the cornerstone in several activation functions. In this sense, Equation (1) was applied in [17] to obtain fractional versions of classical activation functions that involve exponentials.

Definition 2.

For

a, x \in R

,

ν > 0

and

n = ⌈ ν ⌉

. The ν-order Caputo fractional derivative of

f (x)

is [35]:

{}_{a}^{C}D_{x}^{ν} f (x) = \frac{1}{Γ (n - ν)} \int_{a}^{x} {(x - y)}^{n - ν - 1} f^{(n)} (y) d y .

(2)

The Caputo derivative of a constant is zero and for

f (x) = x^{q}

[34,36]:

D_{x}^{ν} x^{q} = \frac{Γ (q + 1) x^{q - v}}{Γ (q - v + 1)} .

(3)

Specifically, for

q = 1

:

D_{x}^{ν} x = \frac{x^{1 - v}}{Γ (2 - v)} .

(4)

This Caputo fractional derivative is used to build the

M o r p h ()

function.

2.2. Activation Functions

Activation functions introduce the non-linearity required for the neural networks to efficiently map inputs to outputs. In this section, several activation functions are reviewed and some properties of them are briefly discussed. Later on, a new family of activation functions is proposed.

Heaviside. This is a fundamental discriminator and activation function defined as:

H (x) = {\begin{matrix} 0 & x < 0 \\ 1 & 0 \leq x . \end{matrix}

(5)

The discontinuity at

x = 0

is a drawback of the Heaviside, as well as the zero derivative of its piecewise parts. However, its behavior is consistent with the Perceptron Convergence Theorem. The Perceptron Convergence Theorem states that for a binary classification problem, where the data are labeled as classes 0 or 1 and are linearly separable, the perceptron will define a separation hyperplane for the two classes in a finite number of iterations [3].

Sigmoid and tanh. To deal with the discontinuity of Heaviside, a Sigmoid activation function can be used, since it defines a smooth curve given by:

σ (x) = \frac{1}{1 + e^{- x}} .

(6)

The tanh function also defines a smooth curve expressed as:

t a n h (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

(7)

and there is a relationship between Sigmoid and tanh:

t a n h (x) = 2 σ (2 x) - 1 .

(8)

Sigmoid and tanh satisfy the conditions of the Universal Approximation Theorem [37], which states that a linear combination of bounded, monotone-increasing continuous functions is enough to approximate with desired precision any arbitrary continuous function on compact sets of

R^{n}

. In fact, it has been a justification to build Multilayer Perceptron networks starting with a single hidden layer [38].

Both functions, Sigmoid and tanh, fail for the vanishing gradient problem, which means that the backpropagation of the adjustment for the free parameters decays through the layers [4,5].

Sigmoid and tanh functions involve exponentials, and the series for

e^{x} = 1 + x + \frac{x^{2}}{2!} + \frac{x^{3}}{3!} + \dots

reveals the need to use many terms to reach the non-linearity of the function, with the consequent increase in computational complexity.

ReLU. The Rectified Linear Unit activation function was mainly proposed to deal with the vanishing gradient problem of Sigmoid and tanh functions [39]. ReLU is simple and elegant:

R e L U (x) = {\begin{matrix} 0 & x \leq 0 \\ x & 0 < x . \end{matrix}

(9)

The ReLU does not satisfy the bounded condition of the Universal Approximation Theorem [37]. However, it has shown a great improvement in speed and accuracy in deep learning models over other activation functions.

LReLU. Leaky Relu is a variant of ReLU, and is intended to solve the “Dying ReLU” problem where a zero gradient for

x \leq 0

causes “neuron deaths” because ReLUs cannot update the synaptic weights during training [7]. So, LReLU considers a small gradient

α > 0

for

x \leq 0

, and its mathematical expression is:

L R e L U (x) = {\begin{matrix} α x & x \leq 0 \\ x & 0 < x . \end{matrix}

(10)

ELU. The Exponential Linear Unit activation function considers negative values for

x \leq 0

, using an idea similar to Leaky ReLU. Given

α > 0

, the ELU formula is [8]:

E L U (x) = {\begin{matrix} α (e^{x} - 1) & x \leq 0 \\ x & 0 < x . \end{matrix}

(11)

Swish. It was proposed by Google, and many experiments report that Swish enhances the ReLU performance [10,40]. The expression for Swish is:

S w i s h (x) = x \cdot σ (x) .

(12)

GeLU. Let

Φ (x) = P (X \leq x), X \sim N (0, 1)

, the cumulative distribution function of the standard normal distribution. The Gaussian Error Linear Unit (GeLU) multiplies x by

Φ (x)

. It can be approximated by [9]:

G e L U (x) \approx 0.5 x (1 + tanh (\sqrt{\frac{2}{π}} (x + 0.044715 x^{3})))

(13)

or

G e L U (x) \approx x σ (1.702 x) .

(14)

Softplus. This activation function uses logarithmic and exponential functions as follows:

S o f t p l u s (x) = l n (1 + e^{x}) .

(15)

For

1 < < e^{x}

, the logarithm and exponential nullify, and then it becomes similar to ReLU. But it follows a smooth curve when x is close to zero.

Mish. It is obtained by function composition of tanh and Softplus, and then multiplied by x. Mish is expressed as:

M i s h (x) = x \cdot t a n h (S o f t p l u s (x)) .

(16)

SAAF. The Shape Autotuning Activation Function is defined as:

S A A F (x) = \frac{x}{\frac{x}{α} + e^{- \frac{x}{β}}}, 0 < \frac{β}{α} e .

(17)

where

α

and

β

are trainable parameters [25]. In the negative plane, SAAF follows a behavior similar to Mish or Swish, whereas in the positive plane, SAAF reaches a maximum value and the asymptotic saturation rate is controlled by their two parameters.

FELU. The acronym refers to Fast Exponentially Linear Unit activation. It uses properties of bit displacement together with integer algebra operations to achieve exponential fast approximation based on the representation of IEEE-754 floating point number calculation. So, it yields to a base 2 with logarithms in the exponent:

F E L U (x) = {\begin{matrix} x & x > 0 \\ α (2^{\frac{x}{l n (2)}} - 1) & x \leq 0 \end{matrix}

(18)

and the curve that it represents is similar to ELU.

SELU. Scaled Exponential Linear Unit is an activation function that addresses the covariant shift problem, which means that the distribution of the network activations changes at every training step, which in turn may slow down training [41]. The SELU properties are useful to train networks with many layers. For example, the self-normalizing property implies that node activations remain centered around zero and have unitary variance. The SELU formula is:

S E L U (x) = λ {\begin{matrix} x & x > 0 \\ α (e^{x} - 1) & x \leq 0 . \end{matrix}

(19)

DELU. It means “extendeD Exponential Linear Unit”, and is an extended version of ELU. Given

α \geq 0

,

β \geq 0

and

x_{c} \geq 0

, the DELU function is defined as:

D E L U (x) = {\begin{matrix} x & x > x_{c} \\ \frac{e^{α x} - 1}{β} & x \leq x_{c} . \end{matrix}

(20)

A drawback of DELU is the possible discontinuity at

x_{c}

. The condition for continuity is

β x_{c} = e^{α x_{c}} - 1

and

x_{c} \geq 0

. ELU is a special case of DELU when

α = β = 1

and

x_{c} = 0

. Also, a ReLU approximation is given by

x_{c} = 0

and

β \to \infty

.

MPELU. Multiple Parametric Exponential Linear Units is an activation function that aims to generalize and unify ReLU and ELU units. The main idea is to lead to better classification performance and convergence. MPELU is able to adaptively switch between the rectified and exponential linear units, and makes

α

a hyperparameter learnable to further improve its representational ability and tune the function shape [42]. MPELU is expressed as:

M P E L U (x) = λ {\begin{matrix} x & x > 0 \\ α (e^{β x} - 1) & x \leq 0 \end{matrix}

(21)

where

β > 0

.

Wavelets. They are sets of non-linear bases of vector spaces that oscillate and then vanish away from the origin [43]. Typically, wavelets are used via the wavelet transform that maps a temporal signal to time-scale space, but wavelets can also be used as activation functions [32].

There are infinitely many wavelet functions, and just a few of them are mentioned here. The Haar wavelet is defined as:

H a a r (x) = {\begin{matrix} 1 & 0 \leq x < 0.5 \\ - 1 & 0.5 \leq x < 1 \\ 0 & o t h e r w i s e . \end{matrix}

(22)

The Mexican Hat wavelet belongs to a family of continuous functions and is the second derivative of the Gaussian function

- e^{- \frac{x^{2}}{2}}

:

M e x H a t (x) = (1 - x^{2}) e^{- \frac{x^{2}}{2}} .

(23)

It is noteworthy that all the derivatives of these Gaussian wavelets are also wavelets [44].

Wavelet bases are associated with multiresolution analysis, where vector spaces are spanned from dilation and traslation of a (mother) wavelet

f \in L^{2} (R)

. Let

a \neq 0

and

b \in R

the scaling and translation of f. The continuous wavelet transform of a function

g \in L^{2} (R)

,

W_{f} g

with respect to f is [45]:

W_{f} g (a, b) = \int_{R} \frac{1}{\sqrt{| a |}} g (x) f (\frac{x - b}{a}) d x .

(24)

Given the admissibility condition

C_{f} = \int_{R}^{+} \frac{| \hat{f} {(ω) |}^{2}}{| ω |} d ω < \infty

, where

\hat{f}

is the Fourier transform of f, the reconstruction formula for

g (x)

is [43,46]:

g (x) = \frac{1}{C_{f}} \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} W_{f} g (a, b) f (\frac{x - b}{a}) \frac{d a}{a^{2}} d b

(25)

which means that a function

g \in L^{2} (R)

can be represented through its decomposition at different resolutions (scale or frequency analysis).

It is worth mentioning that for

Q > 0

, if g is defined as:

g (x) = σ_{Q} (x) = {\begin{matrix} \frac{1}{1 + e^{- x}} & x \in [- Q, Q] \\ 0 & o t h e r w i s e \end{matrix}

(26)

i.e., a sigmoid function defined in the interval

[- Q, Q]

, then the wavelet transform of g with respect to the Haar wavelet is [47]:

W_{H a a r} σ_{Q} (a, b) = \frac{2 g^{(- 1)} (b + \frac{a}{2}) - g^{(- 1)} (b) - g^{(- 1)} (b + a)}{a^{2}}

(27)

where

g^{(- 1)} (x) = S o f t p l u s (x)

.

As already mentioned, wavelets can be used as activation functions and this paper addresses this approach. In fact, the main contribution of this work is to propose a morphing activation function inspired by the Haar wavelet expressed in terms of Heaviside functions.

Triangular. A triangular activation function [16] can be expressed as a piecewise linear and continuous function. For example:

T r i a n g u l a r (x) = {\begin{matrix} x + 1 & - 1 \leq x < 0 \\ - x + 1 & 0 \leq x < 1 \\ 0 & o t h e r w i s e . \end{matrix}

(28)

APTx. Activation function APTx [21] refers to “Alpha Plus Tanh Times” and has three parameters

α

,

β

, and

γ

:

A P T x (x, α, β, γ) = (α + t a n h (β x)) γ x .

(29)

For

α = β = 1

and

γ = \frac{1}{2}

, APTx has the same shape as Mish described in Equation (16), but with the difference that Equation (29) does not use Softplus, which could improve the computational efficiency.

SPLASH. It is the acronym of Simple Piecewise Linear and Adaptive with Symmetric Hinges. These types of functions are continuous, piecewise linear, with hinges placed at fixed locations where the slopes of the line segments are learned from the data [20]. SPLASH functions are expressed as:

S P L A S H (x) = \sum_{s = 1}^{\frac{S + 1}{2}} a_{+}^{s} m a x (0, x - b^{s}) + \sum_{s = 1}^{\frac{S + 1}{2}} a_{-}^{s} m a x (0, - x - b^{s}) .

(30)

Essentially, a SPLASH connects ReLU functions with different slopes and aims to approximate concave and convex activation functions.

2.3. Fractional Activation Functions

Considering the fractional derivatives of Section 2.1, it is possible to define fractional activation functions.

A first fractional activation function is based on Equation (1) that provides a generalized fractional exponential function of

ν

-order

\in (0, 1]

. Thus, as described in [34], activation functions that involve Sigmoid functions can be generalized to fractional versions. For example, given the Sigmoid of Equation (6), the RL-type Fractional Sigmoid can be defined as:

^{R L} F σ (x) = \frac{1}{1 +_{0}^{R L} T_{ν} (x)} .

(31)

On the other hand, the RL-type fractional tanh function is:

^{R L} F t a n h (x) = 2 (^{R L} F σ (x)) - 1 .

(32)

We mention once again that this approach of applying the RL-type fractional derivative is not the one followed in this paper to obtain the

M o r p h ()

function, but a different one based on the Caputo fractional derivative, as will be shown later.

To improve accuracy, reduce overfitting, and create smaller and more efficient neural network models, activation functions with parameters can be adjusted from the data according to the learning process. In other words, the parameters become hyperparameters and gradient-based methods can be used to optimize them. This is the case of PRLeLU, which is described below.

PReLU. Parametric ReLU is based on Leaky ReLU of Equation (10), where

α

becomes a learnable parameter.

At this point, it is convenient to introduce a fractional derivative for ReLU based on the definition of Caputo [15,34,35].

Definition 3.

Let

ν \in (- \infty, 1], x \in R

, the fractional derivative of ν-order for the ReLU function is

R e L U_{x}^{ν} (x) = D_{x}^{ν} R e L U (x) = \frac{R e L U {(x)}^{1 - v}}{Γ (2 - v)} = {\begin{matrix} 0 & x \leq 0 \\ \frac{x^{1 - v}}{Γ (2 - v)} & 0 < x . \end{matrix}

(33)

There are two notable cases:

$(i)$ If $ν = 0$ , then Equation (33) is reduced to ReLU of Equation (9).
$(i i)$ If $ν = 1$ , then Equation (33) becomes the Heaviside function given that for $x > 0$ , $\frac{x^{0}}{Γ (1)} = 1$ .

This gradual change of shape (morphing) is illustrated in Figure 2, which shows

R e L U_{x}^{ν} (x)

for

ν \in [0, 1]

. The

ν

-value is displayed on the y-axis. Since

ν

is real, there are infinitely many activation function shapes.

Definition 4.

Let

ν \in (- \infty, 1], x \in R - 0

, and

α > 0

, the fractional Leaky ReLU activation function is

L R e L U_{x}^{ν} (x) = {\begin{matrix} - α \frac{{(- x)}^{1 - v}}{Γ (2 - v)} & x < 0 \\ \frac{x^{1 - v}}{Γ (2 - v)} & 0 < x . \end{matrix}

(34)

The parameters of activation functions, such as

α

of Equation (34), can be considered hyperparameters to calculate their optimal values in the learning process. In this way, several parametric activation functions can be obtained, and some examples are listed below:

$L R e L U (x, α)$ is a Parametric Leaky ReLU (PLReLU) depending on $α$ according to Equation (10).
$E L U (x, α)$ is the Parametric $E L U (x, α)$ (PELU) that depends on $α$ according to Equation (11).

In fractional activation functions, the

ν

-order derivative may also be considered a learning parameter. Consequently, more activation functions can be obtained:

$R e L U (x, ν)$ is the Parametric ReLU ( $P R e L U_{x}^{ν} (x)$ ) that optimizes $ν$ based on Equation (33).
$P L R e L U^{ν} (x, α, ν)$ is a Parametric Fractional Leaky ReLU ( $P F L R e L U_{x}^{ν} (x)$ ) that depends on the optimization of $α$ and $ν$ , according to Equation (34).

Definition 5.

Let

ν \in (- \infty, 1], a \in R^{+}

,

b \in R

, the fractional derivative of ν-order for the ReLU function considering dilation a and traslation b is:

R e L U_{x}^{ν} (\frac{x - b}{a}) = D_{x}^{ν} R e L U (\frac{x - b}{a}) = \frac{R e L U {(\frac{x - b}{a})}^{1 - v}}{Γ (2 - v)} = {\begin{matrix} 0 & \frac{x - b}{a} \leq 0 \\ \frac{{(\frac{x - b}{a})}^{1 - v}}{Γ (2 - v)} & 0 < \frac{x - b}{a} . \end{matrix}

(35)

The same dilation and translation operations can be applied to other parametric fractional activation functions like

L R e L U_{x}^{ν} (x)

of Equation (34).

2.4. Morph(): Adaptive Fractional Activation Function

A Haar wavelet of Equation (22) can be expressed in terms of Heaviside functions

H (x)

as follows [46]:

H a a r (x) = H (x) - 2 H (x - 0.5) + H (x - 1) .

(36)

This expression inspired us to propose a parameterized activation function (morphing) depending on

ν \in (- \infty, 1]

given by:

M o r p h (x, ν) = R e L U_{x}^{ν} (x + 1) - 2 R e L U_{x}^{ν} (x) + R e L U_{x}^{ν} (x - 1)

(37)

where the translation of the

R e L U_{x}^{ν}

terms are chosen to center the shape along the x-axis.

In this paper, the nomenclature

M o r p h_{x}^{v} (x)

is the same as

M o r p h (x, v)

, since in the first case, it is intended to emphasize the derivative order (not to be confused with an exponent), and secondly, because the change of shape depends on the parameter

ν

, the fractional order. Figure 3 shows

M o r p h (x, ν)

for

- 5 < ν < 1

. When

ν

changes, the shape mimics a Haar wavelet, Triangular, Sigmoid, ReLU, and polynomial functions.

In the following sections, special cases of

M o r p h (x, ν)

are studied for some relevant integer values of

ν

, which allow us to obtain piecewise polynomial versions of several activation functions including Sigmoid, Softplus, Swish, Mish, GeLU, and ELU.

2.5. Morphing from $M o r p h (x, ν)$ to Haar Wavelet

If

ν = 1

, then Equation (37) becomes a Haar wavelet, since the first derivative of each ReLU becomes a Heaviside, that is,

R e L U_{x}^{1} (x) = H (x)

, and consequently:

M o r p h (x, ν = 1) = H (x + 1) - 2 H (x) + H (x - 1) = H a a r (\frac{x + 1}{2}) .

(38)

M o r p h (x, ν = 1)

is shown in Figure 4. Effectively, it corresponds to the Haar wavelet, with a dilation factor

a = 2

and translation

b = 1

.

2.6. Morphing from $M o r p h (x, ν)$ to Triangular

When

v = 0

, the piecewise function

M o r p h (x, ν = 0)

has a triangular shape since:

M o r p h (x, ν = 0) = R e L U_{x}^{0} (x + 1) - 2 R e L U_{x}^{0} (x) + R e L U_{x}^{0} (x - 1)

(39)

which corresponds to a triangular activation function:

T r i a n g u l a r (x) = M o r p h (x, ν = 0) = {\begin{matrix} 1 + x & - 1 < x \leq 0 \\ 1 - x & 0 < x \leq 1 \\ 0 & o t h e r w i s e . \end{matrix}

(40)

Figure 5 illustrates the plot of

M o r p h (x, ν = 0)

, where the triangular shape is symmetric with respect to the y-axis.

2.7. Morphing from $M o r p h (x, ν)$ to Sigmoid: PolySigmoid

For

ν = - 1

, the function

R e L U_{x}^{- 1} (x) = \frac{R e L U {(x)}^{2}}{2}

, and by applying Equation (37), it yields to the piecewise function called PolySigmoid:

P o l y S i g m o i d (x) = M o r p h (x, ν = - 1) = {\begin{matrix} 0 & x \leq - 1 \\ \frac{{(x + 1)}^{2}}{2} & - 1 < x \leq 0 \\ \frac{- x^{2} + 2 x + 1}{2} & 0 < x \leq 1 \\ 1 & 1 < x . \end{matrix}

(41)

The name

P o l y S i g m o i d

refers to a Sigmoid shape written as a piecewise polynomial. Figure 6 shows

M o r p h (x, ν = - 1)

(left) and the piecewise approximation with two parabolas, the first for

- 1 \leq x \leq 0

and the second for

0 \leq x \leq 1

(right).

PolySigmoid does not use exponential functions, but rather a quadratic polynomial, which may be more efficient in terms of computational complexity. A parametric version is obtained from Equation (41), replacing x with

α x

:

P o l y S i g m o i d (α, x) = P o l y S i g m o i d (α x) = M o r p h (α x, ν = - 1) .

(42)

If

α \to + \infty

, the shape changes from Sigmoid to Heaviside. If

α = \frac{1}{4}

, then

P o l y S i g m o i d (α, x)

approximates well enough the Sigmoid

σ (x)

, as is shown in Figure 7.

2.8. Adapting $M o r p h (x, ν)$ to PolySwish

Given the good similarity between

P o l y S i g m o i d (α = \frac{1}{4}, x)

and Sigmoid, and following the definition of Swish, another activation function can be proposed multiplying

P o l y S i g m o i d

by x. In this way, a polynomial version of Swish is:

P o l y S w i s h (x) = x \cdot P o l y S i g m o i d (α = \frac{1}{4}, x) = {\begin{matrix} 0 & \frac{x}{4} \leq - 1 \\ x \cdot \frac{{(\frac{x}{4} + 1)}^{2}}{2} & - 1 < \frac{x}{4} \leq 0 \\ x \cdot \frac{- {(\frac{x}{4})}^{2} + \frac{x}{2} + 1}{2} & 0 < \frac{x}{4} \leq 1 \\ x & 1 < \frac{x}{4} . \end{matrix}

(43)

Of course, Equation (43) can be simplified, but it has been kept as such for clarity.

Figure 8 compares PolySwish and Swish. For

x > 0

, it seems an excellent approximation, though not for

x \leq 0

. However, it is relevant to consider that this is a polynomial-based version obtained through an integer order of

M o r p h (x, ν)

. Later, the performance of this approximation will be explored experimentally.

2.9. Adapting $M o r p h (x, ν)$ to PolyGeLU

By applying another scaling factor

β = 1.702

to Equation (42), i.e., to

P o l y S i g m o i d (α, x)

with

α = \frac{1}{4}

, it is possible to approximate the GeLU activation function. It is called PolyGeLU:

P o l y G e L U (x) = x \cdot P o l y S i g m o i d (α, β x) = x \cdot P o l y S i g m o i d (0.4255 x) .

(44)

Figure 9 illustrates the comparison between GeLU and PolyGeLU. They are too similar, and their plots are practically overlapping except for

x \in (- 3, - 1.6)

, where there is a subtle difference.

2.10. Morphing from $M o r p h (x, ν)$ to Softplus and ReLU

When

ν = - 2

, the proposed function

M o r p h (x, ν = - 2)

has a shape similar to Softplus, and then it is called PolySoftplus.

PolySoftplus describes a piecewise polynomial approximation to Softplus as follows:

P o l y S o f t p l u s (x) = M o r p h (x, ν = - 2) = {\begin{matrix} 0 & x \leq - 1 \\ \frac{{(x + 1)}^{3}}{6} & - 1 < x \leq 0 \\ \frac{{(x + 1)}^{3}}{6} - \frac{x^{3}}{3} & 0 < x \leq 1 \\ x & 1 < x . \end{matrix}

(45)

However, the truth is that PolySoftplus is closer to ReLU than to SoftPlus, as is shown in Figure 10.

There are some differences between PolySoftplus and Softplus:

PolySoftplus matches ReLU exactly for $x \notin (- 1, 1)$ .
Softplus converges asymptotically to ReLU with error lower than $ϵ > 0$ if:
$x > - l n (e^{ϵ} - 1)$ and $x > 1$ ;
$x < l n (e^{ϵ} - 1)$ and $x < 0$ .
PolySoftplus uses a third-degree polynomial.
Softplus uses an exponential function (infinite series).

A better approximation between PolySoftplus and Softplus can be reached by introducing an

α > 0

parameter to the former, so that:

P o l y S o f t p l u s (α, x) = α \cdot P o l y S o f t p l u s (\frac{x}{α}) .

(46)

Figure 11 shows the case

α = 4

where effectively,

P o l y S o f t p l u s (α = 4, x)

gives a better and more acceptable approximation to Softplus than Equation (45).

Furthermore, it is possible to approximate ReLU considering Equation (42) if

α \to + \infty

, which is a Heaviside approximation. Thus:

R e L U (x) = x \cdot H (x) \approx x \cdot M o r p h_{α \to + \infty} (α x, ν = - 1) .

(47)

2.11. Morphing from $M o r p h (x, ν)$ to Piecewise Polynomial

For

ν < - 2

, the function

M o r p h (x, ν < - 2)

is piecewise polynomial. It is denoted as

P o l y (x, ν) = M o r p h (x, ν < - 2)

and then:

P o l y (x, ν) = {\begin{matrix} 0 & x \leq - 1 \\ \frac{{(x + 1)}^{1 - v}}{Γ (2 - v)} & - 1 < x \leq 0 \\ \frac{{(x + 1)}^{1 - v}}{Γ (2 - v)} - 2 \frac{{(x)}^{1 - v}}{Γ (2 - v)} & 0 < x \leq 1 \\ \frac{{(x + 1)}^{1 - v}}{Γ (2 - v)} - 2 \frac{{(x)}^{1 - v}}{Γ (2 - v)} + \frac{{(x - 1)}^{1 - v}}{Γ (2 - v)} & 1 < x . \end{matrix}

(48)

Note that this polynomial has a lower order for

1 < x

than for the intervals

- 1 < x \leq 0

and

0 < x \leq 1

. For example, if

ν = - 3

:

P o l y (x, ν = - 3) = M o r p h (x, ν = - 3) = {\begin{matrix} 0 & x \leq - 1 \\ \frac{{(x + 1)}^{4}}{24} & - 1 < x \leq 0 \\ \frac{{(x + 1)}^{4}}{24} - \frac{x^{4}}{12} & 0 < x \leq 1 \\ \frac{6 x^{2} + 1}{12} & 1 < x . \end{matrix}

(49)

This is quadratic polynomial for

1 < x

. Effectively, in spite of the fourth degree of other intervals, if

1 < x

, then

P o l y (x, ν = - 3)

is quadratic, and the same should apply for larger negative

ν

-values. Why? It comes from the Formulas (38) and (39) where, given the translations, the summation finally cancels two positive terms with a double factor for the negative monomial with the highest degree.

This concept was illustrated in Figure 3 of Section 2.4, where essentially,

M o r p h (x, ν)

represents a smooth surface for

ν < 0

.

Following the same theme, Figure 12 shows the case

ν = - 3

corresponding to Equation (49), where

P o l y (x, ν = - 3)

reaches the highest polynomial degree in

x \in (- 1, 1)

, but for

1 < x

, the degree is reduced to quadratic.

3. Results

This section describes several experiments that support the conclusions. They include:

Experiments 1 to 7. Accuracy comparison between existing activation functions and polynomial versions obtained from the proposed morphing activation function.
Experiments 8 to 11. Adaptation of hyperparameters of activation functions (including Morph) during the training by using gradient descent algorithms with MNIST dataset.
Experiment 12. Adaptation of hyperparameters for Morph by using gradient descent algorithms with CIFAR10 dataset.

The experiments were developed in PyTorch running on a GPU NVIDIA GeForce RTX 3070. The optimizer was SGD with learning rate

l r = 0.01

. The number of epochs is 30, and the metric for comparison purposes is the test accuracy.

For Experiments 1 to 11, the dataset was MNIST, with

60, 000

images, split into a training set of 50,000 and a test set of 10,000. The neural network architecture is:

Conv2d(1, 32, 3, 1)

Activation Function1(x, hyperparams)

Conv2d(32, 64, 3, 1)

Activation Function2(x, hyperparams)

max_pool2d(x, 2)

Dropout(0.25)

flatten(x, 1)

Linear(9216, 128)

Activation Function3(x, hyperparams)

Dropout(0.5)

Linear(128, 10)

output = log_softmax()

For Experiment 12, the dataset was CIFAR10. The neural network architecture is minimal:

nn.Linear(input_size, hidden_size)

nn.Linear(hidden_size, num_classes)

activation_function Morph()

Although the accuracy is not good, it aims to show how the parameters

ν, a, b

change during the training for different initializations.

3.1. Experiment 1: PolySoftplus, Softplus, and Relu

A first experiment compares

P o l y S o f t p l u s

,

P o l y S o f t p l u s (α, x)

, Softplus, and ReLU. The results are shown in the plots of Figure 13. Note that:

ReLU is superior to all the other activation functions.
PolySoftplus outperforms Softplus. It is a bit intuitive after reading Section 2.10, since PolySoftplus looks more similar to ReLU than to Softplus (see Figure 10).
The version $\frac{1}{α} P o l y S o f t p l u s (α, x)$ with $α = \frac{1}{4}$ is a good approximation of Softplus; therefore, it is not surprising that they produce essentially the same results.

3.2. Experiment 2: PolySwish, Swish, and ReLU

In Experiment 2, PolySwish is compared with Swish and ReLU. Figure 14 shows that PolySwish is superior to Swish and ReLU. Swish is slightly better than ReLU.

3.3. Experiment 3: PolySigmoid, Sigmoid, and ReLU

In Experiment 3, the Sigmoid performance was too low, starting with an accuracy of

11.35 %

and reaching a maximum of

38.29 %

. For this reason, instead of the original sigmoid

σ (x)

,

σ (3 x)

was considered, and the performance was improved from

89.47 %

up to a maximum of

93.51 %

.

Figure 15 shows the accuracies for PolySigmoid, Sigmoid

σ (x)

, Sigmoid

σ (3 x)

and ReLU. It is possible to appreciate that PolySigmoid is superior to

σ (x)

, and

σ (3 x)

. However, ReLU is better than PolySigmoid and therefore better than all activation functions in this experiment.

3.4. Experiment 4: PolyGeLU and GeLU

In this experiment, both activation functions, PolyGeLU (Equation (44)) and GeLU (Equation (14)), are compared. Figure 16 shows the results, where it is evident that PolyGeLU outperforms GeLU.

3.5. Experiment 5: ELU Approximation from PolySigmoid

In order to obtain a piecewise polynomial version of the ELU function, the following approximation is made. Given Equation (42), which approximates Sigmoid via Polysigmoid, and focusing on the interval

- 4 < x < 0

, it follows that:

σ (x) \approx P o l y S i g m o i d (\frac{x}{4}) = \frac{{(\frac{x}{4} + 1)}^{2}}{2} .

(50)

Thus,

\frac{1}{1 + e^{- x}} \approx \frac{{(\frac{x}{4} + 1)}^{2}}{2}

, and solving for

e^{x}

:

e^{x} \approx \frac{{(x + 4)}^{2}}{32 - {(x + 4)}^{2}} .

(51)

With this approximation, it is possible to write a polynomial version for ELU named PolyELU, so:

P o l y E L U (x, α) = {\begin{matrix} - α & x \leq - 4 \\ α [\frac{{(x + 4)}^{2}}{32 - {(x + 4)}^{2}} - 1] & - 4 < x \leq 0 \\ x & 0 < x . \end{matrix}

(52)

Figure 17 has the plots for

α = 1

of PolyELU and ELU on the left side, whereas the corresponding accuracies are shown on the right side. Basically, they are overlapped and have a correlation of

0.999

. This confirms that there is a good approximation between these two activation functions.

3.6. Experiment 6: Haar Wavelet, Triangular, Heaviside, and ReLU

Experiment 6 gathers some activation functions that achieve a low accuracy: Haar, Heaviside, and Sigmoid, and it is shown in Figure 18.

Haar’s accuracy is low, but higher than Heaviside and Sigmoid. Heaviside varies from

49.99 %

up to

58.22 %

, whereas Sigmoid is below

40 %

. It is striking how a Sigmoid function that approaches Heaviside as

α

tends to infinity can increase its accuracy. It has been experimentally calculated that as

α

increases up to

19.459

, the accuracy improves. In fact, the accuracy of Sigmoid with high

α

-value outperformed that of Heaviside.

Obtaining the maximum accuracy for Sigmoid(

19.459 x

), it was compared with Triangular and ReLU. The results are in Figure 19, where Sigmoid(

19.459 x

) outperforms Triangular followed by ReLU. Sigmoid(

3 x

) was also plotted just as a reference.

This experiment leads us to consider the importance of the non-zero gradient problem, challenging for Haar and Heaviside, which is extensive to Sigmoid for high scaling factors.

3.7. Experiment 7: Mish and PolyMish

Given the definition of PolySoftplus(x) in Equation (45), and some advantages over Softplus described in Section 3.1, is natural to propose a first Mish approximation named PolyMish, as follows:

P o l y M i s h (x) = x \cdot t a n h (P o l y S o f t p l u s (x)) .

(53)

Figure 20 shows the plots of Mish vs. PolyMish. Mish(x) follows a smooth curve for

x < 0

and asymptotical behavior to zero for

x < - 1.2

, whereas PolyMish vanishes faster than Mish and is exactly zero for

x < - 1

(see Section 3.7). In other words, this first version is not a good approximation. Moreover, the bottom side of Figure 20 illustrates how Mish outperforms PolyMish.

The approximation to Mish can be improved by considering an

α

factor, as in Section 3.1, so that:

P o l y M i s h (α, x) = x \cdot t a n h (α \cdot P o l y S o f t p l u s (\frac{x}{α})) .

(54)

The top side of Figure 21 shows the approximation of Mish using

P o l y M i s h (α = 4, x)

. The accuracy results are shown on the bottom side of Figure 21, where it is noticeable that this second version of

P o l y M i s h (α = 4, x)

outperforms Mish.

3.8. Experiment 8: Adaptive Fractional Derivative Order for $P R e L U (x, ν)$ vs. $R e L U (x)$

In this experiment, given

P R e L U (x, ν)

, the parameter

ν

is adapted during the learning (see Equation (33)). The initial value for

ν

is zero. The network architecture involves three

P R e L U (x, ν)

functions, and each

ν

-parameter is adjusted with SGD as a hyperparameter in Pytorch. The results are shown in Figure 22. From the top side, it is evident that the PReLU version outperforms the network built only with ReLUs. The adaptive

ν

-values for the three PReLUs are shown at the bottom of Figure 22, where the fractional derivatives have negative values, i.e., the power of x is greater than

1.0

.

3.9. Experiment 9: Fractional Derivative of $| x |$ as Activation Function

The importance of having a non-zero gradient for

x \leq 0

is a motivation to use a fractional derivative of

| x |

based on Equations (33) and (34), as follows:

{| x |}^{ν} = D_{x}^{ν} | x | = {\begin{matrix} \frac{{(- x)}^{1 - v}}{Γ (2 - v)} & x \leq 0 \\ \frac{x^{1 - v}}{Γ (2 - v)} & 0 < x . \end{matrix}

(55)

Beyond this, different fractional-order derivatives can be used,

ν_{0}

for

x \leq 0

and

ν_{1}

for

x > 0

. Thus, for

α \in R

:

{| x |}^{ν_{0}, ν_{1}} = D_{x}^{ν_{0}, ν_{1}} | x | = {\begin{matrix} α \frac{{(- x)}^{1 - v_{0}}}{Γ (2 - v_{0})} & x \leq 0 \\ \frac{x^{1 - v_{1}}}{Γ (2 - v_{1})} & 0 < x . \end{matrix}

(56)

If

ν_{0} > 1

, then Equation (56) produces a zero division if

x = 0

, but to avoid this situation, it is possible to add an

ϵ > 0

[18]. In this experiment,

ϵ = 1 e - 10

.

Results of four cases are reported in Figure 23, sorted from best to worst:

Case 1. ${| x |}^{ν}$ , Equation (55).
Case 2. ${| x |}^{ν_{0}, ν_{1}}$ , Equation (56) $α = 1$ .
Case 3. ${| x |}^{ν_{0}, ν_{1}}$ , Equation (56), where $α$ is optimized with SGD as hyperparameter.
Case 4. ${| x |}^{ν_{0}, ν_{1}}$ , Equation (56) $α = - 1$ .

All

ν

-parameters as well as

α

are initialized to zero. For Case 3,

α

is optimized in three activation functions AF1, AF2, and AF3.

3.10. Experiment 10: Trainable Activation Functions with Fractional Order: $M o r p h$ and ${| x |}^{ν}$ vs. SPLASH

In this experiment, the adaptive

ν

-order of fractional functions is optimized with SGD in PyTorch. Three activation functions are compared: SPLASH,

{| x |}^{ν}

, and

M o r p h

.

From Equation (30), only seven terms are considered for SPLASH, then it is written as:

S P L A S H_{7} (x) = a_{0} m a x (0, x) + \sum_{s = 1}^{3} a_{+}^{s} m a x (0, x - b^{s}) + \sum_{s = 1}^{3} a_{-}^{s} m a x (0, - x - b^{s})

(57)

where

b^{s} = {1, 2, 2.5}

,

a_{0}

is initialized to one, and the rest to zero (see [20]). These initial conditions mean initializing with ReLU shape. SPLASH does not involve fractional derivatives, but it is used in this experiment as a reference for comparison purposes.

The neural network architecture has three activation functions, and for SPLASH, each unit needs seven a-parameters. So, the total number of hyperparameters is 21 (it certainly consumes enough GPU memory).

For the fractional derivative

{| x |}^{ν}

of Equation (55), the total number of hyperparameters is 3, and the initialization was

ν = 0

, which means initializing with a piecewise linear shape.

In the case of Morph, dilation a and translation b are used, so

M o r p h (x, ν, a, b) = M o r p h (\frac{x - b}{a}, ν)

and the initialization was

ν = - 2.0

,

a = 1

, and

b = 0

. It corresponds to PolySoftplus of Equation (45) illustrated in Figure 10 (a shape similar to ReLU or Softplus, in Section 2.10). In this case,

M o r p h (x, ν, a, b)

needs to adapt

3 \times 3 = 9

hyperparameters.

The accuracy results are show in Figure 24. SPLASH adjust all of the

7 \times 3 = 21

parameters, but its performance is lower than the corresponding of

{| x |}^{ν}

, which, by the way, only uses three parameters.

Special attention is given to

M o r p h (x, ν, a, b)

, which achieves the best accuracy compared to

{| x |}^{ν}

and

S P L A S H_{7}

.

Figure 25 shows the change of the fractional orders of three

M o r p h ()

activation functions used in the network architecture:

ν_{0}

,

ν_{1}

, and

ν_{2}

initialized to zero. Also, the parameters a and b of these three

M o r p h ()

functions are optimized, and they are plotted in the same figure.

Figure 26 shows the shape of

M o r p h (x, ν, a, b)

when the parameters

ν = - 2, a = 1,

b = 0

are updated to

ν = - 2.0932, a = 0.3014, b = - 0.2269

, and then to

ν = - 2.0959,

a = 0.2829,

b = - 0.2295

at epochs 0, 2, and 30, respectively. These parameters are optimized with the gradient descent algorithm SGD in PyTorch. Only the values of one of the three Morph functions are presented for reasons of space, and since they are similar to those of the other two units.

3.11. Experiment 11: Comparison of Morph() with Other 20 Activation Functions

Experiment 11 compares several activation functions: piecewise polynomial obtained as special cases of

M o r p h ()

and other existing and well-known activation functions. Based on the accuracy results, they are shown in Figure 27 sorted from left to right, worst to best, and enumerated in Table 1. Highlighted in bold are cases where a polynomial version is better than its counterpart, for example, PolySoftplus is better than Softplus. Note that the highest accuracies are achieved by

M o r p h

with optimized parameters:

Case 21.

M o r p h (x, ν)

, initializes

ν = - 1, a = 1, b = 0

and

ν

is optimized during the training.

Case 22.

M o r p h (x, ν)

, initializes

ν = - 2, a = 1, b = 0

and

ν

is optimized during the training.

Case 23.

M o r p h (x, ν, a, b)

, initializes

ν = - 2, a = 1, b = 0

and

ν, a, b

are optimized during the training.

3.12. Experiment 12: Adapting Morph with SGD and Adam

Finally, in the last experiment, a minimal neural network is used to illustrate how the fractional derivative

ν

for

M o r p h ()

is updated using Adam and a fractional SGD (FSGD) [19] with the fractional gradient set to

1.7

, which was determined experimentally for MNIST [18,19]. Note that the fractional gradient uses the fractional derivative of the activation function to update the learning parameters, but the approach is different (and complementary) to adjust the shape of the fractional derivative for

M o r p h ()

as activation function.

The number of epochs was 100, and the accuracy results are shown in Figure 28. The parameter initialization is

ν = 0.75

. In Figure 29, the shapes for epochs

1, 50

, and 100 are plotted for both Adam and FSGD. It is a demonstration that a fractional optimizer like SGD can be sufficiently competitive with other more sophisticated optimizers like Adam. In fact, FSGD outperforms Adam in this experiment.

The source code of all the experiments is available at http://ia.azc.uam.mx accessed on 11 July 2024.

4. Discussion

One of the branches of artificial intelligence research is the architecture of neural networks, focused on the number of layers, types of modules, and their connections. However, another branch that has gained traction is proposing new activation functions that provide the non-linearity required by neural networks to efficiently map inputs to outputs of complex data.

Fractional calculus emerges as a mathematical tool to generalize to non-integer derivative orders, useful for building activation functions and, in this paper, a novel adaptive activation function

M o r p h

was proposed based on fractional derivatives.

The search for the best activation function has led to the proposal of adaptive functions that learn from data. The adjustment of hyperparameters allows us to control stability, overfitting, and to enhance the performance of neural networks, among other challenges in machine learning.

Although we are certainly not working on function approximation with

M o r p h

, but rather using it as an activation function, it is important to emphasize the approximation property of

M o r p h

, and consequently to remark on how it provides the nonlinearity needed in a neural network to efficiently map inputs to outputs with good generalization capacity.

The proposed

M o r p h

function approximates the Sigmoid function sufficiently to satisfy the conditions of the Universal Approximation Theorem [37]. Thus, there is a theoretical justification for

M o r p h

to approximate functions in compact subspaces.

Also,

M o r p h

is able to mimic shapes such as the Haar wavelet, which in turn are basis functions of vector spaces, and it gives an idea of the approximation capacity of this proposed function [43]. In fact, the inspiring idea for

M o r p h

relies on the Haar decomposition as a linear combination of translated Heaviside functions.

Indeed, two fundamental operations on functions are translation and dilation [43]. In this way, for

a \neq 0

and

b \in R

, the translated and dilated version of

M o r p h (x, ν)

is:

M o r p h (x, ν, a, b) = M o r p h (\frac{x - b}{a}, ν) .

(58)

A successful application of this was the approximation of Sigmoid via PolySigmoid, Swish via PolySwish in Section 2.7, and GeLU via PolyGeLU in Section 2.9.

Given that several activation functions include x as a factor, an adaptation to

M o r p h

is

M o r p h x

, which can be defined as:

M o r p h x (x, ν, a, b, c, d, p) = c \cdot x^{p} \cdot M o r p h (\frac{x - b}{a}, ν) + d

(59)

where

a \neq 0

,

b, c, d \in R

,

ν \in (- \infty, 1)

and

p = {0, 1}

.

Considering this, other activation functions such as tanh can be written using

M o r p h x

. For example, starting with the parameters for Sigmoid

ν = - 1, a = 4, b = 0, c = 1, d = 0

, and

p = 0

, it is possible to write

2 σ (2 x) \approx M o r p h x (x, - 1, 2, 0, 2, 0, 0)

, and consequently:

t a n h (x) = 2 σ (2 x) - 1 = M o r p h x (x, - 1, 2, 0, 2, - 1, 0) .

(60)

Table 2 summarizes important cases of

M o r p h x (x, ν, a, b, c, p)

that allow us to obtain shapes that sufficiently approximate existing activation functions.

M o r p h

not only allows us to obtain good approximations, but also provides formulas of piecewise polynomial versions of activation functions which could improve computational efficiency.

Future work includes proposing variants of

M o r p h

that maintain flexibility without increasing the number of parameters. The aim is to progress towards the construction of a more general formula that unifies as many adaptive activation functions as possible, with properties useful in machine learning.

Author Contributions

Conceptualization, O.H.-A. and S.A.-B.; Methodology, O.H.-A. and S.A.-B.; Software, O.H.-A.; Validation, O.H.-A.; Writing—review and editing, O.H.-A. and S.A.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
Rosenblatt, F. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef] [PubMed]
Haykin, S.S. Neural Networks and Learning Machines, 3rd ed.; Pearson Education: Upper Saddle River, NJ, USA, 2009. [Google Scholar]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training Recurrent Neural Networks. In Proceedings of the IEEE International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
Hochreiter, S. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 1998, 6, 107–116. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, Madison, WI, USA, 21–24 June 2010; pp. 807–814. [Google Scholar]
Dubey, A.K.; Jain, V. Comparative Study of Convolution Neural Network’s Relu and Leaky-Relu Activation Functions. In Applications of Computing, Automation and Wireless Systems in Electrical Engineering; Mishra, S., Sood, Y.R., Tomar, A., Eds.; Springer: Singapore, 2019; pp. 873–880. [Google Scholar]
Clevert, D.A.; Unterthiner, T.; Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941v2. [Google Scholar]
Misra, D. Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
Alijani, Z.; Molek, V. Fractional Concepts in Neural Networks: Enhancing Activation and Loss Functions. arXiv 2023, arXiv:2310.11875. [Google Scholar]
Joshi, M.P.; Bhosale, S.; Vyawahare, V.A. Comparative study of integer-order and fractional-order artificial neural networks: Application for mathematical function generation. e-Prime 2024, 8, 100601. [Google Scholar] [CrossRef]
Zamora Esquivel, J.; Cruz Vargas, A.; Camacho Perez, R.; Lopez Meyer, P.; Cordourier, H.; Tickoo, O. Adaptive Activation Functions Using Fractional Calculus. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zamora, J.; Rhodes, A.D.; Nachman, L. Fractional Adaptive Linear Units. Proc. AAAI Conf. Artif. Intell. 2022, 36, 8988–8996. [Google Scholar] [CrossRef]
Freiberger, M.A. Training Activation Function in Deep Neural Networks. Ph.D. Thesis, Graz University of Technology, Graz, Austria, 2015. [Google Scholar]
Kumar, M.; Mehta, U.; Cirrincione, G. Enhancing neural network classification using fractional-order activation functions. AI Open 2024, 5, 10–22. [Google Scholar] [CrossRef]
Herrera-Alcántara, O. Fractional Derivative Gradient-Based Optimizers for Neural Networks and Human Activity Recognition. Appl. Sci. 2022, 12, 9264. [Google Scholar] [CrossRef]
Herrera-Alcántara, O.; Castelán-Aguilar, J.R. Fractional Gradient Optimizers for PyTorch: Enhancing GAN and BERT. Fractal Fract. 2023, 7, 500. [Google Scholar] [CrossRef]
Agostinelli, F.; Hoffman, M.; Sadowski, P.; Baldi, P. Learning activation functions to improve deep neural networks. In Proceedings of the International Conference on Learning Representations Workshops, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Kumar, R. APTx: Better Activation Function than MISH, SWISH, and ReLU’s Variants used in Deep Learning. Int. Artif. Intell. Mach. Learn. 2022, 2, 56–61. [Google Scholar] [CrossRef]
Qiumei, Z.; Dan, T.; Fenghua, W. Improved Convolutional Neural Network Based on Fast Exponentially Linear Unit Activation Function. IEEE Access 2019, 7, 151359–151367. [Google Scholar] [CrossRef]
Burak, C.; Omer, M. Deep Learning with extendeD Exponential Linear Unit (DELU). Neural Comput. Appl. 2023, 35, 22705–22724. [Google Scholar]
Lu, L. Dying ReLU and Initialization: Theory and Numerical Examples. Commun. Comput. Phys. 2020, 28, 1671–1706. [Google Scholar] [CrossRef]
Zhou, Y.; Li, D.; Huo, S.; Kung, S.Y. Shape autotuning activation function. Expert Syst. Appl. 2020, 171, 114534. [Google Scholar] [CrossRef]
Biswas, K.; Kumar, S.; Banerjee, S.; Pandey, A.K. TanhSoft—Dynamic Trainable Activation Functions for Faster Learning and Better Performance. IEEE Access 2021, 9, 120613–120623. [Google Scholar] [CrossRef]
Ames, W.F. Chapter 2—Fractional Derivatives and Integrals. In Fractional Differential Equations; Podlubny, I., Ed.; Elsevier: Amsterdam, The Netherlands, 1999; Volume 198, pp. 41–119. [Google Scholar]
Luchko, Y. Fractional Integrals and Derivatives: “True” versus “False”; MDPI: Basel, Switzerland, 2021. [Google Scholar]
Miller, K.; Ross, B. An Introduction to the Fractional Calculus and Fractional Differential Equations; Wiley-Interscience: New York, NY, USA, 1993. [Google Scholar]
Podlubny, I. Fractional Differential Equations; Academic Press: Cambridge, MA, USA, 1999; Volume 198, p. 340. [Google Scholar]
Ruby, R.; Mandal, M. The geometrical and physical interpretation of fractional order derivatives for a general class of functions. Math. Methods Appl. Sci. 2024, 47, 8400–8420. [Google Scholar] [CrossRef]
Herrera, O.; Priego, B. Wavelets as Activation Functions in Neural Networks. J. Intell. Fuzzy Syst. 2022, 42, 4345–4355. [Google Scholar] [CrossRef]
Herrera-Alcántara, O.; Rubén, C.A.J. Estudio de la capacidad de clasificacion de neuronas wavelet sobre funciones booleanas. Pistas Educativas 2022, 44. [Google Scholar]
Garrappa, R.; Kaslik, E.; Popolizio, M. Evaluation of Fractional Integrals and Derivatives of Elementary Functions: Overview and Tutorial. Mathematics 2019, 7, 407. [Google Scholar] [CrossRef]
Wang, J.; Wen, Y.; Gou, Y.; Ye, Z.; Chen, H. Fractional-order gradient descent learning of BP neural networks with Caputo derivative. Neural Netw. 2017, 89, 19–30. [Google Scholar] [CrossRef] [PubMed]
Bao, C.; Pu, Y.; Zhang, Y. Fractional-Order Deep Backpropagation Neural Network. Comput. Intell. Neurosci. 2018, 2018, 7361628. [Google Scholar] [CrossRef] [PubMed]
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In A Field Guide to Dynamical Recurrent Neural Networks; Wiley-IEEE Press: Piscataway, NJ, USA, 2001; Volume 15, pp. 237–244. [Google Scholar]
Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Swish: A Self-Gated Activation Function. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Sakketou, F.; Ampazis, N. On the Invariance of the SELU Activation Function on Algorithm and Hyperparameter Selection in Neural Network Recommenders. In Artificial Intelligence Applications and Innovations; MacIntyre, J., Maglogiannis, I., Iliadis, L., Pimenidis, E., Eds.; Springer International Publishing: Berlin, Germany, 2019; pp. 673–685. [Google Scholar]
Li, Y.; Fan, C.; Li, Y.; Wu, Q.; Ming, Y. Improving deep neural network with Multiple Parametric Exponential Linear Units. Neurocomputing 2018, 301, 11–24. [Google Scholar] [CrossRef]
Daubechies, I. Ten Lectures on Wavelets; Society for Industrial and Applied Mathematics (SIAM): Philadelphia, PA, USA, 1992. [Google Scholar]
Tan, J.; Jiang, P. Marr-type wavelets of high vanishing moments. Appl. Math. Lett. 2007, 20, 1115–1121. [Google Scholar] [CrossRef]
Mallat, S. A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way, 3rd ed.; Academic Press, Inc.: Cambridge, MA, USA, 2008. [Google Scholar]
Navarro, J.; Elizarraraz, D. Introducción a la Transformada Wavelet Continua; Editorial Reverte: Barcelona, Spain, 2010. [Google Scholar]
Navarro-Fuentes, J.; Arellano-Balderas, S.; Herrera-Alcántara, O. Local Convergence of the Continuous and Semi-Discrete Wavelet Transform in Lp(R). Mathematics 2021, 9, 522. [Google Scholar] [CrossRef]

$Fractalfract 08 00444 g001$

Figure 1. Interpolation of two functions to illustrate a “fractional” operator.

$Fractalfract 08 00444 g001$

$Fractalfract 08 00444 g002$

Figure 2.

R e L U_{x}^{ν} (x)

for

ν \in [0, 1]

. The y-axis represents the

ν

-value. It morphs between ReLU and Heaviside.

Figure 2.

R e L U_{x}^{ν} (x)

for

ν \in [0, 1]

. The y-axis represents the

ν

-value. It morphs between ReLU and Heaviside.

$Fractalfract 08 00444 g002$

$Fractalfract 08 00444 g003$

Figure 3.

M o r p h ()

: a morphing activation function. The shape changes gradually to mimic Haar (

ν = 1

), Triangular (

ν = 0

), Sigmoid (

ν = - 1

), Softplus/ReLU (

ν = - 2

), and polynomial functions (

ν < - 2

).

Figure 3.

M o r p h ()

: a morphing activation function. The shape changes gradually to mimic Haar (

ν = 1

), Triangular (

ν = 0

), Sigmoid (

ν = - 1

), Softplus/ReLU (

ν = - 2

), and polynomial functions (

ν < - 2

).

$Fractalfract 08 00444 g003$

$Fractalfract 08 00444 g004$

Figure 4. Plot for

M o r p h (x, ν = 1) = H a a r (\frac{x + 1}{2})

.

Figure 4. Plot for

M o r p h (x, ν = 1) = H a a r (\frac{x + 1}{2})

.

$Fractalfract 08 00444 g004$

$Fractalfract 08 00444 g005$

Figure 5. Plot for

T r i a n g u l a r (x)

=

M o r p h (x, ν = 0)

.

Figure 5. Plot for

T r i a n g u l a r (x)

=

M o r p h (x, ν = 0)

.

$Fractalfract 08 00444 g005$

$Fractalfract 08 00444 g006$

Figure 6. Plot for

P o l y S i g m o i d (x) = M o r p h (x, ν = - 1)

that mimics a Sigmoid shape (left). Two parabolas define the behavior for

- 1 < x < 1

(right).

Figure 6. Plot for

P o l y S i g m o i d (x) = M o r p h (x, ν = - 1)

that mimics a Sigmoid shape (left). Two parabolas define the behavior for

- 1 < x < 1

(right).

$Fractalfract 08 00444 g006$

$Fractalfract 08 00444 g007$

Figure 7. Sigmoid

σ (x)

approximated with

P o l y S i g m o i d (α = \frac{1}{4}, x)

.

Figure 7. Sigmoid

σ (x)

approximated with

P o l y S i g m o i d (α = \frac{1}{4}, x)

.

$Fractalfract 08 00444 g007$

$Fractalfract 08 00444 g008$

Figure 8. Plot for Swish and

P o l y S w i s h (x)

=

x \cdot P o l y S i g m o i d (α = \frac{1}{4}, x)

.

Figure 8. Plot for Swish and

P o l y S w i s h (x)

=

x \cdot P o l y S i g m o i d (α = \frac{1}{4}, x)

.

$Fractalfract 08 00444 g008$

$Fractalfract 08 00444 g009$

Figure 9. Plot for GeLU and

P o l y G e L U (x) = x \cdot P o l y S i g m o i d (0.4255 x)

.

Figure 9. Plot for GeLU and

P o l y G e L U (x) = x \cdot P o l y S i g m o i d (0.4255 x)

.

$Fractalfract 08 00444 g009$

$Fractalfract 08 00444 g010$

Figure 10.

P o l y S o f t p l u s (x) = M o r p h (x, ν = - 2)

. Softplus is above PolySoftplus. PolySoftplus is above ReLU. PolySoftplus is closer to ReLU than to Softplus.

Figure 10.

P o l y S o f t p l u s (x) = M o r p h (x, ν = - 2)

. Softplus is above PolySoftplus. PolySoftplus is above ReLU. PolySoftplus is closer to ReLU than to Softplus.

$Fractalfract 08 00444 g010$

$Fractalfract 08 00444 g011$

Figure 11. Plot for Softplus and

P o l y S o f t p l u s (α = 4, x)

.

Figure 11. Plot for Softplus and

P o l y S o f t p l u s (α = 4, x)

.

$Fractalfract 08 00444 g011$

$Fractalfract 08 00444 g012$

Figure 12. Plot for

P o l y (x, ν) = M o r p h (x, ν = - 3)

. Piecewise polynomials conformed of zero, fourth, fourth, and quadratic degrees.

Figure 12. Plot for

P o l y (x, ν) = M o r p h (x, ν = - 3)

. Piecewise polynomials conformed of zero, fourth, fourth, and quadratic degrees.

$Fractalfract 08 00444 g012$

$Fractalfract 08 00444 g013$

Figure 13. Accuracy for ReLU, PolySoftplus, Softplus, and

4 * P o l y S o f t p l u s (\frac{x}{4})

.

Figure 13. Accuracy for ReLU, PolySoftplus, Softplus, and

4 * P o l y S o f t p l u s (\frac{x}{4})

.

$Fractalfract 08 00444 g013$

$Fractalfract 08 00444 g014$

Figure 14. Accuracy comparison for PolySwish, Swish, and ReLU.

$Fractalfract 08 00444 g014$

$Fractalfract 08 00444 g015$

Figure 15. Accuracy comparison for PolySigmoid, Sigmoid, and ReLU.

$Fractalfract 08 00444 g015$

$Fractalfract 08 00444 g016$

Figure 16. Accuracy comparison between PolyGeLU and GeLU.

$Fractalfract 08 00444 g016$

$Fractalfract 08 00444 g017$

Figure 17. Plot for ELU and PolyELU. Shapes (top). Accuracies (bottom).

$Fractalfract 08 00444 g017$

$Fractalfract 08 00444 g018$

Figure 18. Test accuracies for Haar, Heaviside, and Sigmoid.

$Fractalfract 08 00444 g018$

$Fractalfract 08 00444 g019$

Figure 19. Test accuracies for Triangular, Sigmoid, and ReLU.

$Fractalfract 08 00444 g019$

$Fractalfract 08 00444 g020$

Figure 20. Plot for Mish and PolyMish (top) and the test accuracies for MNIST dataset (bottom).

$Fractalfract 08 00444 g020$

$Fractalfract 08 00444 g021$

Figure 21. Plot for Mish and

P o l y M i s h (α = 4, x)

(top). Test accuracies for Mish, PolyMish, and

P o l y M i s h (α = 4, x)

(bottom).

Figure 21. Plot for Mish and

P o l y M i s h (α = 4, x)

(top). Test accuracies for Mish, PolyMish, and

P o l y M i s h (α = 4, x)

(bottom).

$Fractalfract 08 00444 g021$

$Fractalfract 08 00444 g022a$ $Fractalfract 08 00444 g022b$

Figure 22. Accuracies for PReLU and ReLU (top). Optimization of the fractional

ν

-order derivatives for the three PReLU units (bottom).

Figure 22. Accuracies for PReLU and ReLU (top). Optimization of the fractional

ν

-order derivatives for the three PReLU units (bottom).

$Fractalfract 08 00444 g022a$ $Fractalfract 08 00444 g022b$

$Fractalfract 08 00444 g023$

Figure 23. Performance of adaptive fractional activation functions for

{| x |}^{ν}

and

\pm {| x |}^{ν_{0}, ν_{1}}

.

Figure 23. Performance of adaptive fractional activation functions for

{| x |}^{ν}

and

\pm {| x |}^{ν_{0}, ν_{1}}

.

$Fractalfract 08 00444 g023$

$Fractalfract 08 00444 g024$

Figure 24. Plot for accuracies of trainable

S P L A S H_{7}

,

{| x |}^{ν}

, and

M o r p h (x, ν, a, b)

.

Figure 24. Plot for accuracies of trainable

S P L A S H_{7}

,

{| x |}^{ν}

, and

M o r p h (x, ν, a, b)

.

$Fractalfract 08 00444 g024$

$Fractalfract 08 00444 g025$

Figure 25. Adaptation of

ν_{0}, ν_{1}, ν_{2}

corresponding to the fractional derivatives of

M o r p h (x, ν, a, b)

in a neural network with three adaptive units. Also,

a_{0}, a_{1}, a_{2}

and

b_{0}, b_{1}, b_{2}

are plotted;

a^{'} s

are initialized to 1 and

b^{'} s

to 0.

Figure 25. Adaptation of

ν_{0}, ν_{1}, ν_{2}

corresponding to the fractional derivatives of

M o r p h (x, ν, a, b)

in a neural network with three adaptive units. Also,

a_{0}, a_{1}, a_{2}

and

b_{0}, b_{1}, b_{2}

are plotted;

a^{'} s

are initialized to 1 and

b^{'} s

to 0.

$Fractalfract 08 00444 g025$

$Fractalfract 08 00444 g026$

Figure 26. Shape of

M o r p h (x, ν, a, b)

when adaptive parameters are optimized in 30 epochs. Shapes for epochs 0, 2, and 30 (green, blue, and red, respectively).

Figure 26. Shape of

M o r p h (x, ν, a, b)

when adaptive parameters are optimized in 30 epochs. Shapes for epochs 0, 2, and 30 (green, blue, and red, respectively).

$Fractalfract 08 00444 g026$

$Fractalfract 08 00444 g027$

Figure 27. Boxplots—accuracies of activation functions. Sorted from worst to best, left to right. The best case corresponds to

M o r p h ()

with optimized fractional derivative during 30 epochs.

Figure 27. Boxplots—accuracies of activation functions. Sorted from worst to best, left to right. The best case corresponds to

M o r p h ()

with optimized fractional derivative during 30 epochs.

$Fractalfract 08 00444 g027$

$Fractalfract 08 00444 g028$

Figure 28. Accuracies for

M o r p h ()

optimizing the fractional derivative

ν

using Adam and FSGD (Fractional SGD) during 100 epochs.

Figure 28. Accuracies for

M o r p h ()

optimizing the fractional derivative

ν

using Adam and FSGD (Fractional SGD) during 100 epochs.

$Fractalfract 08 00444 g028$

$Fractalfract 08 00444 g029$

Figure 29. Plot for

M o r p h ()

with

ν = 0.7913

,

ν = - 0.1216

, and

ν = - 0.4546

at epochs 1, 50, and 100, respectively, optimized with Adam (left) and

ν = 0.8532

,

ν = 0.5864

, and

ν = 0.5733

optimized with FSGD (right).

Figure 29. Plot for

M o r p h ()

with

ν = 0.7913

,

ν = - 0.1216

, and

ν = - 0.4546

at epochs 1, 50, and 100, respectively, optimized with Adam (left) and

ν = 0.8532

,

ν = 0.5864

, and

ν = 0.5733

optimized with FSGD (right).

$Fractalfract 08 00444 g029$

Table 1. Comparison of activation functions, worst to best on MNIST dataset, sorted by test accuracy.

1,: Softplus (lowest accuracy)
2.: PolySoftplus
3.: ReLU
4.: Swish
5.: PolyMish
6.: PolySwish
7.: GeLU
8.: Mish
9.: PolyELU
10.: ELU
11.: Triangular
12.: Case 4. ${| x |}^{v_{0}, v_{1}}, α = - 1$

13.: Sigmoid( $19 x$ )
14.: Case 3. ${| x |}^{v_{0}, v_{1}}, α$ optimized
15.: $P R e L U (x, v)$
16.: $S P L A S H_{7}$
17.: Case 2. ${| x |}^{ν_{0}, ν_{1}}, α = 1$
18.: Case 1. ${| x |}^{ν}$
19.: PolyGeLU
20.: PolyMish $(4, x)$
21.: $M o r p h (x, ν)$ , $ν = - 1, a = 1, b = 0$
22.: $M o r p h (x, ν)$ , $ν = - 2, a = 1, b = 0$
23.: $M o r p h (x, ν, a, b)$ , $ν = - 2, a = 1, b = 0$ (highest accuracy)

Table 2. Special cases of

M o r p h x (x, ν, a, b, c, d, p)

.

Table 2. Special cases of

M o r p h x (x, ν, a, b, c, d, p)

.

Shape	Approximation	$ν$	a	b	c	d	p
Haar	Haar	1	1	0	1	0	0
Triangular	Triangular	0	1	0	1	0	0
Sigmoid	PolySigmoid	−1	4	0	1	0	0
Swish	PolySwish	−1	4	0	1	0	1
GeLU	PolyGeLU	−1	$\frac{4}{1.702}$	0	1	0	1
ReLU	PolySoftplus	−2	1	0	1	0	0
Softplus	PolySoftplus	−2	4	0	4	0	0
ELU	PolyELU *	−1	4	0	1	0	0
Polynomial	Polynomial	$< - 2$	$\neq 0$	-	$\neq 0$	0	0
tanh	$2 σ (2 x) - 1$	−1	2	0	2	−1	0
ReLU	PolySigmoid	−1	$a \to 0^{+}$	0	1	0	1

* Approximating

e^{x}

in

- 4 < x \leq 0

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Herrera-Alcántara, O.; Arellano-Balderas, S. Adaptive Morphing Activation Function for Neural Networks. Fractal Fract. 2024, 8, 444. https://doi.org/10.3390/fractalfract8080444

AMA Style

Herrera-Alcántara O, Arellano-Balderas S. Adaptive Morphing Activation Function for Neural Networks. Fractal and Fractional. 2024; 8(8):444. https://doi.org/10.3390/fractalfract8080444

Chicago/Turabian Style

Herrera-Alcántara, Oscar, and Salvador Arellano-Balderas. 2024. "Adaptive Morphing Activation Function for Neural Networks" Fractal and Fractional 8, no. 8: 444. https://doi.org/10.3390/fractalfract8080444

APA Style

Herrera-Alcántara, O., & Arellano-Balderas, S. (2024). Adaptive Morphing Activation Function for Neural Networks. Fractal and Fractional, 8(8), 444. https://doi.org/10.3390/fractalfract8080444

Article Menu

Adaptive Morphing Activation Function for Neural Networks

Abstract

1. Introduction

2. Materials

2.1. Fractional Derivatives

2.2. Activation Functions

2.3. Fractional Activation Functions

2.4. Morph(): Adaptive Fractional Activation Function

2.5. Morphing from M o r p h ( x , ν ) to Haar Wavelet

2.6. Morphing from M o r p h ( x , ν ) to Triangular

2.7. Morphing from M o r p h ( x , ν ) to Sigmoid: PolySigmoid

2.8. Adapting M o r p h ( x , ν ) to PolySwish

2.9. Adapting M o r p h ( x , ν ) to PolyGeLU

2.10. Morphing from M o r p h ( x , ν ) to Softplus and ReLU

2.11. Morphing from M o r p h ( x , ν ) to Piecewise Polynomial

3. Results

3.1. Experiment 1: PolySoftplus, Softplus, and Relu

3.2. Experiment 2: PolySwish, Swish, and ReLU

3.3. Experiment 3: PolySigmoid, Sigmoid, and ReLU

3.4. Experiment 4: PolyGeLU and GeLU

3.5. Experiment 5: ELU Approximation from PolySigmoid

3.6. Experiment 6: Haar Wavelet, Triangular, Heaviside, and ReLU

3.7. Experiment 7: Mish and PolyMish

3.8. Experiment 8: Adaptive Fractional Derivative Order for P R e L U ( x , ν ) vs. R e L U ( x )

3.9. Experiment 9: Fractional Derivative of | x | as Activation Function

3.10. Experiment 10: Trainable Activation Functions with Fractional Order: M o r p h and | x | ν vs. SPLASH

3.11. Experiment 11: Comparison of Morph() with Other 20 Activation Functions

3.12. Experiment 12: Adapting Morph with SGD and Adam

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.5. Morphing from $M o r p h (x, ν)$ to Haar Wavelet

2.6. Morphing from $M o r p h (x, ν)$ to Triangular

2.7. Morphing from $M o r p h (x, ν)$ to Sigmoid: PolySigmoid

2.8. Adapting $M o r p h (x, ν)$ to PolySwish

2.9. Adapting $M o r p h (x, ν)$ to PolyGeLU

2.10. Morphing from $M o r p h (x, ν)$ to Softplus and ReLU

2.11. Morphing from $M o r p h (x, ν)$ to Piecewise Polynomial

3.8. Experiment 8: Adaptive Fractional Derivative Order for $P R e L U (x, ν)$ vs. $R e L U (x)$

3.9. Experiment 9: Fractional Derivative of $| x |$ as Activation Function

3.10. Experiment 10: Trainable Activation Functions with Fractional Order: $M o r p h$ and ${| x |}^{ν}$ vs. SPLASH