Abstract
This work takes up the task of the determination of the rate of pointwise and uniform convergences to the unit operator of the “normalized cusp neural network operators”. The cusp is a compact support activation function, which is the composition of two general activation functions having as domain the whole real line. These convergences are given via the modulus of continuity of the engaged function or its derivative in the form of Jackson type inequalities. The composition of activation functions aims to more flexible and powerful neural networks, introducing for the first time the reduction in infinite domains to the one domain of compact support.
Keywords:
neural network approximation; cusp activation function; modulus of continuity; reduction of domain MSC:
41A17; 41A25; 41A30; 41A36
1. Introduction
From AI and computer science, we have the following: in essence, composing activation functions in neural networks offers the advantage of potentially tailoring the network’s ability to learn and model complex, non-linear relationships in data. Here is a breakdown of the potential benefits:
- Enhanced Capacity for Complex Modeling:
- Diversification of Non-linearity: Different activation functions have different characteristics. For example, ReLU introduces sparsity, while Sigmoid squashes values into a range. By composing them, the network potentially can learn a wider variety of non-linear transformations and capture more intricate patterns in the data.
- Improved Training Dynamics:
- Mitigating Gradient Problems: Activation functions influence gradient flow during training. Using different activation functions can potentially help address issues like vanishing or exploding gradients, which hinder learning in deep networks.
- Faster Convergence: Certain activation functions, like ReLU, can accelerate the convergence of the training process compared to others like Sigmoid or Tanh. Combining different functions can potentially lead to faster training and competitive performance.
- Enhanced Generalization and Robustness:
- Better Generalization: By learning richer representations of the data through diverse activation functions, the network’s ability to generalize well to unseen data improves, reducing the risk of overfitting.
- Increased Robustness: Networks with carefully chosen activation functions can handle variations in input data more effectively, adapting to noise, missing data, or unexpected perturbations.
- Adaptation to Input Characteristics:
- Handling Diverse Data: Different activation functions can be suited to different data characteristics. For instance, tanh can be useful when dealing with data containing both positive and negative values.
- Potential for Architectural Interpretability:
- Insight into Learning: By using distinct activation functions, different parts of the network might become responsible for capturing specific features, which can potentially offer insights into how the model learns.
In summary, composing activation functions potentially allows for a more flexible and powerful neural network capable of
- Learning more complex patterns.
- Faster and more stable training.
- Better generalization to new data.
- Greater adaptability to diverse data.
Attention: While composing activation functions can offer benefits, it’s important to choose them judiciously and with consideration for the specific problem at hand, as some combinations might not be beneficial or could even lead to unwanted behaviors like exploding gradients. Empirical testing and validation are crucial when exploring different activation function compositions.
The author was greatly inspired and motivated by [1] and was the pioneer of quantitative neural network approximation, see [2], and since then, he has published numerous of papers and books, e.g., see [3].
In this article, we continue this trend.
In mathematical neural network approximation AMS Mathscinet lists no articles related to composition of activation functions. So this is the first one of its kind.
By using the composition of activation functions, we achieve the first extensive part of this introduction, and most notably, this composition leads to an activation function of compact support, though the initial activation functions had an infinite domain, the whole real line.
Now the resulting activation function is an open cusp of compact support . Our involved activation functions are very general, and the constructed neural network operators resemble the squashing operators in [2,3], and so do the produced quantitative results.
As a result, our produced convergence inequalities look much simpler and nicer.
Of great inspiration are the articles [4,5,6]. References [7,8,9] are foundational. Finally, references [10,11,12,13] represent recent important works.
2. Basics
Let , and be general sigmoid activation functions, such that they are strictly increasing, , , , , . Also, is strictly convex over and strictly concave over , with .
Clearly, is strictly increasing and , and
that is
Furthermore,
Next, acting over let . Then, by convexity of there we have
and
i.e.,
So that is convex over
Similarly, over , we get: let . Then, by concavity of there we have
and
Therefore is concave over
Also, it is
So is a sigmoid activation function.
Next we consider the function
We observe that
that is
So can serve as a density function in general.
So we have h2 : ℝ → (−1, 1), h1|(−1,1) : (−1, 1) → (−1, 1), and the strictly increasing function H := h1|(−1,1) ◦ h2 : ℝ → (−1, 1), with the graph of H containing an arc of finite length, such that H(0) = 0, starting at (−1, h1(h2(−1))) and terminating at (1, h1(h2(1))). We call this arc also H. In particular H is negative and convex over (−1, 0], and it is positive and concave over [0, 1).
So it has compact support [−1, 1] and it is like a squashing function, see [3], Ch. 1, p. 8.
We will work from now on with |H|, which has as a graph a cusp joining the points (−1, |h1(h2(−1))|), (0, 0), (1, h1(h2(1))) and with compact support, again, [−1, 1]. The points (−1, |h1(h2(−1))|), (1, h1(h2(1))) belong to the graph of |H| and (0, 0) too.
Typically H has a steeper slope than of h2, but it is flatter and closer to the x-axis than h2 is, e.g. tanh(tanh x) has asymptotes ±0.76, while tanh x has asymptotes ±1, notice that tanh(1) = 0.76. Clearly H has applications in spiking neural networks.
3. Background
Here we consider functions to be either continuous and bounded, or uniformly continuous.
The first modulus of continuity is given by
Here we have that ,
In this article, we study the pointwise and uniform convergences with rates over the real line, to the unit operator, of the “normalized cusp neural network operators”,
where and , .
Notice is a positive linear operator with .
The terms in the ratio of sums (1) can be non-negative and make sense, iff , i.e., iff
In order to have the desired order of numbers
it is sufficient to assume that
When , , it is enough to assume , which implies (3), and
But the unique case contributes nothing and can be ignored.
Thus, without loss of generality, we can always take that
Proposition 1
([2]). Let , . Let be the maximum number of integers contained in . Then
Note 1.
We would like to establish a lower bound on over the interval . By Proposition 1, we get that
We obtain , if iff which is always true.
So to have the desired order and over it is enough to consider
Also notice that , as
Denote as the integral part of a number and as its ceiling.
Thus, it is clear that
, and
4. Main Results
Next come our first main results.
Theorem 1.
Let , , is either continuous and bounded, or uniformly continuous. Then
where is the first modulus of continuity of f. Hence, , pointwise, given f is uniformly continuous.
When , we obtain
Hence , uniformly over , given f isuni f ormlycontinuous.
Proof.
□
We continue with our second main result.
Theorem 2.
Proof.
With Taylor’s formula, we have
Call
Hence
Thus
where
So that
And hence
Next we estimate
where
and
where
The last part of inequality (18) comes from the following:
(i) Let then
i.e., when we get
Corollary 1
(to Theorem 2). It holds
Corollary 2
(to Theorem 1). Let , , and be such that , . Consider . Then
By (24), we derive the convergence of to f with rates, given f is uniformly continuous.
We finish with
Corollary 3
(to Theorem 2). In the assumptions of Theorem 2 and Corollary 2 we have
By (25) we derive again the convergence of to f with rates.
Funding
This research received no external funding.
Data Availability Statement
No new data were created or analyzed in this study. Data sharing is not applicable to this article.
Conflicts of Interest
The author declares no conflicts of interest.
References
- Cardaliaguet, P.; Euvrard, G. Approximation of a function and its derivative with a neural network. Neural Netw. 1992, 5, 207–220. [Google Scholar] [CrossRef]
- Anastassiou, G.A. Rate of Convergence of some neural network operators to the unit—Univariate case. J. Math. Appl. 1997, 22, 237–262. [Google Scholar] [CrossRef]
- Anastassiou, G.A. Intelligent Systems II: Complete Approximation by Neural Network Operators; Springer: Heidelberg, Germany; New York, NY, USA, 2016. [Google Scholar]
- Chen, Z.; Cao, F. The approximation operators with sigmoidal functions. Comput. Math. Appl. 2009, 58, 758–765. [Google Scholar] [CrossRef]
- Costarelli, D.; Spigler, R. Approximation results for neural network operators activated by sigmoidal functions. Neural Netw. 2013, 44, 101–106. [Google Scholar] [CrossRef] [PubMed]
- Costarelli, D.; Spigler, R. Multivariate neural network operators with sigmoidal activation functions. Neural Netw. 2013, 48, 72–77. [Google Scholar] [CrossRef] [PubMed]
- Haykin, S. Neural Networks: A Comprehensive Foundation, 2nd ed.; Prentice Hall: New York, NY, USA, 1998. [Google Scholar]
- McCulloch, W.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 7, 115–133. [Google Scholar] [CrossRef]
- Mitchell, T.M. Machine Learning; WCB-McGraw-Hill: New York, NY, USA, 1997. [Google Scholar]
- Yu, D.S.; Cao, F.L. Construction and approximation rate for feed-forward neural network operators with sigmoidal functions. J. Comput. Appl. Math. 2025, 453, 116150. [Google Scholar] [CrossRef]
- Cen, S.; Jin, B.; Quan, Q.; Zhou, Z. Hybrid neural-network FEM approximation of diffusion coeficient in elyptic and parabolic problems. IMA J. Numer. Anal. 2024, 44, 3059–3093. [Google Scholar] [CrossRef]
- Coroianu, L.; Costarelli, D.; Natale, M.; Pantiş, A. The approximation capabilities of Durrmeyer-type neural network operators. J. Appl. Math. Comput. 2024, 70, 4581–4599. [Google Scholar] [CrossRef]
- Warin, X. The GroupMax neural network approximation of convex functions. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 11608–11612. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).