4.1. Smooth Activation Fuction Is Better
Research by Hayou et al. [
3] indicate that the smoothness of an activation function is a key factor influencing the effective propagation of information in deep neural networks. A sufficient condition for an activation function to be smooth is that its second-order derivative can be piecewise represented as a sum of continuous functions. For smooth activation functions, the correlation between inter-layer neuron outputs converges to 1 at a rate of
, with the specific formula being
, where
c represents the correlation,
l is the number of network layers, and
is a coefficient determined by the target variance
q and the activation function
f. In contrast, for non-smooth functions like ReLU, this correlation converges at a rate of
, with the specific relationship being
. This research reveals that in deep networks employing non-smooth activation functions, the correlation of neuron outputs rapidly approaches 1 as the network depth increases. This high degree of correlation impedes the effective propagation of information, leading to unstable gradients and diminished expressive power, ultimately impairing model performance. Intuitively, the non-differentiability of functions like ReLU at the origin acts as a ’kink’ that introduces abrupt changes in the gradient flow. In deep networks, the accumulation of these sharp transitions can rapidly decorrelate the input–output relationship, causing information to degrade as it propagates through layers. Smoother activations, by contrast, offer a continuous transition of derivatives. This preserves the local geometry of the data manifold and ensures that gradients vary more predictably, thereby maintaining stronger signal correlations even in very deep architectures. Therefore, selecting activation functions that meet specific smoothness requirements is an important strategy for enhancing both the training efficiency and the final performance of deep learning models.
4.2. Smoothed Kernel Function
To smooth non-smooth activation functions, we can use the mollification method. This approach involves smoothing the target function by convolving it with a specific kernel. The goal is to create a smooth approximation of the original function without sacrificing its valuable attributes.
Definition 1. We define the smoothing kernel as follows: Here, constant A is defined as Proposition 1. The smoothing kernel is normalized, i.e., .
Proposition 2. , i.e., is infinitely differentiable on and . Furthermore, for each natural number k, is bounded on .
Detailed proofs of Propositions 1 and 2 can be found in
Appendix A.
Definition 2. For any we define a family of smoothing kernels generated by scaling the smoothing kernel as follows: This family of functions has several important properties. Each is normalized, maintaining , and its support is scaled to the interval . As , the family converges to the Dirac delta function in the sense of distributions. We can now employ this framework to smooth the target activation function.
4.3. Smoothness of the Mollified Activation Function
Definition 3. Let be a locally integrable activation function. The convolution of with the smoothing kernel is called a smoothing of , denoted by or , and defined as: Remark 1. It is important to clarify that while the smoothed activation is defined via a convolution integral, this operation is performed analytically during the derivation phase (as shown later in Section 4.6). Consequently, the actual implementation relies on the resulting closed-form polynomial expression, avoiding the need for expensive numerical integration during training or inference. Thus, the convolution-based definition provides theoretical rigor without imposing a runtime computational burden. Let
. Taking partial derivative of it, we can get:
This partial derivative exists for any
, because
is a
function according to Proposition 2. Let
be a fixed, arbitrary point. We will consider values of
x within a neighborhood of
. The function
is continuous and has compact support, which implies that it is bounded on the
. Let
The support of the integrand with respect to
y is the interval
. For any
x in the chosen neighborhood, this interval is contained in the larger compact set
. From Equation (
1), we obtain the following bound:
where
is the characteristic function of the set
K, its specific form is as follows:
Since
, it is integrable on the compact set
K. This implies that the dominating function
is integrable. This argument shows that for any order k, the control conditions required for the differential operator
are satisfied. According to the Lebesgue Dominated Convergence Theorem, we can move any order differential operator into the integral sign. Therefore, we can directly calculate the
k-th derivative of
:
This implies that the
k-th derivative of
exists for any positive integer
k. Let
Since the function
is continuous with compact support, it is uniformly continuous on
. Consider an arbitrary
, we then have
Then we establish a dominating function for the integrand. By the triangle inequality,
Due to the uniform continuity of
, when
, we have
Now we can apply Equation (
3) and the Dominated Convergence Theorem to the Equality (
2):
Thus, we have proven that for any , the derivative is continuous, which implies . This establishes that the mollification process transforms the original activation function into a smooth approximation .
4.4. Approximation Properties of the Mollified Activation Function
While achieving smoothness, a crucial question arises: to what degree does the new function retain the core properties of the original function ? Have we “distorted” the essence of the activation function in pursuit of smoothness? Let us discuss these questions next.
To analyze the approximation error, we start with the term
. Since the smoothing kernel integrates to 1, we can rewrite
in the following form:
For the term
, recall its definition and perform a substitution by letting
, which implies
. Thus,
Combining Equations (
4) and (
5), we get:
According to the triangle inequality for integrals and the fact that
:
We consider the activation function
that is continuous. Let
be an arbitrary compact set. Since
, both
x and
lie within the larger compact set
. By the Heine-Cantor Theorem, a function that is continuous on a compact set is also uniformly continuous. Therefore,
is uniformly continuous on
. By the definition of uniform continuity, for any
given at the beginning, there must exist a
such that whenever
,
holds for all
. We choose our smoothing parameter
to be less than
given by uniform continuity, i.e.,
. In this way, for all
in our integral expression, we have
. Therefore, for these
u, applying the property of uniform continuity yields the following:
Substitute Equation (
6) back into our integral:
In summary, we have proven that for any , there exists a such that whenever , then holds for all . From this, we can conclude the uniform convergence of the smoothed activation function to the original activation function on the set D.
By choosing a sufficiently small smoothing parameter , we can make the smoothed activation function approximate the original function arbitrarily accurately. Uniform convergence plays a key role by ensuring that the smoothed activation function maintains the desirable characteristics of the original. It also ensures that the maximum error between the two functions approaches zero over any input interval of interest, thus avoiding unexpected deviations in specific regions.
4.5. Lipschitz Continuity Analysis
In the field of deep learning, an abstract mathematical concept—Lipschitz continuity—is increasingly becoming a key factor in building more reliable, robust, and generalizable neural network models. From theoretical analysis to practical applications, Lipschitz continuity provides a powerful tool for deeply understanding and effectively controlling the behavior of deep networks. Lipschitz continuity offers a stronger condition than standard continuity by constraining a function’s maximum rate of change. To understand this property, we will start with its formal definition.
Definition 4 (Lipschitz Continuous Function [
29])
. A function with domain is said to be C-Lipschitz continuous with respect to an α-norm for some constant if the following condition holds for all :. In our analysis, we concentrate on the smallest value of
C that satisfies the aforementioned condition. This value is formally defined as the Lipschitz constant. Lipschitz continuity of the activation function is crucial for ensuring well-behaved optimization, thereby promoting efficient convergence during training. Ensuring model stability by controlling the Lipschitz property is an effective way to prevent gradient runaway [
30,
31,
32,
33]. The smaller the Lipschitz constant, the more stable the training gradients [
29,
34].
From the previous derivation, we know that the support set of the smoothing kernel function
is
. The support set of a function derivative must be contained within the support set of the original function. Therefore, the support set of
is also contained in
. So we have
For a continuous function
, it must be bounded within the closed interval
, so we can find a constant
that satisfies the condition
. Substitute into the above equation and simplify:
Now let us calculate the integral
. We know that
, by variable substitution
:
The value of this integral
is completely determined by the kernel function
we initially chose. It is a constant that is independent of
f,
, and
x; we denote this constant as
. Combining Equations (
7) and (
8), we can conclude that
This shows that the Lipschitz constant of , , satisfies: . An increase in the value of leads to a broader refinement range for the original activation function, consequently enhancing its overall smoothness. Additionally, increasing the value of lowers the Lipschitz upper bound of the enhanced activation function, thereby promoting a more stable training gradient. Therefore, we can conclude: the higher the smoothness, the higher the training gradient stability, and there is a quantitative relationship between the two, where the specific quantitative relationship is determined by constants M and . This conclusion provides a principled guiding framework for the design of activation functions in the future.
4.6. Reshaping Relu(S-Relu): From Relu to Better
![Mathematics 14 00072 i004 Mathematics 14 00072 i004]()
In
Section 4.1, we provide evidence that smoother activation functions are advantageous for enhancing both the efficiency of the training process and the final performance of the model.
Section 4.3 and
Section 4.4 further establish that activation functions constructed via smoothing theory possess desirable smoothness and approximation properties; specifically, they are infinitely differentiable and can approximate the original activation function arbitrarily closely, thereby preserving its favorable characteristics. Building upon these findings, we next apply the proposed methodology to a concrete instance. In particular, we select the widely used ReLU function as the basis for mollification. Regarding the choice of kernel, we strike a balance between theoretical rigor and engineering practicality by adopting the Epanechnikov kernel, which is theoretically optimal for minimizing the Mean Integrated Squared Error. While the Gaussian kernel is a common choice for smoothing, we select the Epanechnikov kernel for two strategic reasons. First, convolving a ReLU with a Gaussian kernel results in expressions involving the Error Function (erf) or exponentials, which are computationally expensive transcendental operations. In contrast, the Epanechnikov kernel is a polynomial; its convolution with ReLU yields a simple piecewise polynomial (S-ReLU), which can be computed using only basic arithmetic operations. Second, the Epanechnikov kernel has a finite support
. This ensures that the smoothing effect is strictly local—S-ReLU is exactly linear (preserving ReLU’s properties) outside this interval. Gaussian smoothing, with its infinite support, asymptotically alters the function over the entire domain, theoretically sacrificing the strict linearity and sparsity of the original ReLU.
The standard ReLU activation function computes the maximum of zero and its input, as given by
. We select the Epanechnikov kernel function, denoted as
, parameterized by the smoothing radius
. This kernel acts as a weighting function with compact support on the interval
. Its normalized form is:
Applying the preceding theory yields the smoothed activation function S-ReLU:
Remark 2. While introducing smoothness often incurs a computational premium, the proposed S-ReLU minimizes this overhead through its design. Unlike widely used smooth activations such as GELU [20] or SiLU [18], which rely on computationally expensive transcendental functions, S-ReLU is constructed as a piecewise polynomial function. Within the smoothing interval , it requires only basic arithmetic operations (addition, multiplication) and avoids the latency associated with approximating infinite series. Consequently, S-ReLU maintains a computational efficiency comparable to leaky ReLU variants, making it highly suitable for both large-scale training and latency-sensitive inference scenarios. Differentiating the S-ReLU function twice yields the following:
This indicates that S-ReLU is an activation function with continuous second derivatives, which meets the definition of a smooth function in
Section 4.1. We specifically target second-order differentiability (
) rather than higher-order or infinite smoothness (
) for two pragmatic reasons. First, from an optimization perspective,
continuity ensures a well-defined and continuous Hessian matrix, which is sufficient to guarantee stability of curvature-based optimization dynamics and avoid abrupt changes in the loss landscape. Second, achieving
typically requires kernels with infinite support, which leads to computationally expensive transcendental functions. In contrast, restricting smoothness to
allows us to utilize the Epanechnikov kernel, yielding a computationally efficient piecewise polynomial form while still satisfying the rigorous requirements for gradient stability. We use the uniform error metric to quantify how well S-ReLU approximates the target function. The calculation result is given as follows:
This result clearly demonstrates that, compared to ReLU, the approximation error of S-ReLU is controllable and proportional to the smoothing radius
. This is a valuable property, as it allows us to precisely control the extent of the approximation error introduced by the smoothing operation by choosing the value of
. A detailed proof is provided in
Appendix B. Further details regarding S-ReLU are available in the appendices. Specifically,
Appendix C presents the Python-style pseudocode, while
Appendix D contains a more in-depth discussion of its characteristics. Finally, we discuss Lipschitz continuity. Based on the theory in
Section 4.5, we can calculate the Lipschitz constant for each activation function.
Fact 1. Lipschitz constant of GELU is 1.084; Lipschitz constant of SiLU is 1.100; Lipschitz constant of Mish is 1.089. Lipschitz constant of S-ReLU is 1.000.
Remark 3. The Lipschitz constants presented in Fact 1 reveal a fundamental structural advantage of S-ReLU. Theoretically, an activation function f with a Lipschitz constant (such as GELU and SiLU) acts as an expansive operator. In deep neural networks, where gradients are computed via the chain rule involving products of layer Jacobians, an expansive activation can cause the gradient magnitude to grow exponentially with depth d (upper bounded by ), potentially leading to gradient explosion or optimization instability. In contrast, S-ReLU guarantees , making it a strictly non-expansive operator. This ensures that the gradient norm is never mathematically amplified by the activation layer itself (). This property provides a quantifiable theoretical guarantee for training stability in deep architectures, distinguishing S-ReLU from existing smooth variants.
The detailed proof can be found in
Appendix E. S-ReLU’s Lipschitz constant of 1 ensures that its output never changes more rapidly than its input. This property acts as a vital safeguard for stabilizing gradient flow in the network, which helps prevent the problem of exploding gradients and makes the model more robust to small input perturbations—a clear advantage over other activation functions.