Next Article in Journal
On Double Cyclic Codes and Applications to DNA Codes
Previous Article in Journal
Integrating Deep Learning into Semiparametric Network Vector AutoRegressive Models
Previous Article in Special Issue
Dual-Channel Heterogeneous Graph Neural Network for Automatic Algorithm Recommendation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ReLU Neural Networks and Their Training

1
Institute of Al for Industries, Nanjing 211100, China
2
Faculty of Engineering, University of Toyama, Toyama-shi 930-8555, Japan
*
Authors to whom correspondence should be addressed.
Mathematics 2026, 14(1), 39; https://doi.org/10.3390/math14010039
Submission received: 24 October 2025 / Revised: 8 December 2025 / Accepted: 17 December 2025 / Published: 22 December 2025
(This article belongs to the Special Issue New Advances and Challenges in Neural Networks and Applications)

Abstract

Among various activation functions, the Rectified Linear Unit (ReLU) has become the most widely adopted due to its computational simplicity and effectiveness in mitigating the vanishing-gradient problem. In this work, we investigate the advantages of employing ReLU as the activation function and establish its theoretical significance. Our analysis demonstrates that ReLU-based neural networks possess the universal approximation property. In addition, we provide a theoretical explanation for the phenomenon of neuron death in ReLU-based neural networks. We further validate the effectiveness of this explanation through empirical experiments.

1. Introduction

In recent years, Artificial Neural Networks (ANNs) have emerged as a central paradigm in the fields of artificial intelligence and machine learning. They have found widespread applications in computer vision [1,2,3], natural language processing [4,5,6], and speech recognition [7,8], where they have achieved outstanding results.
Inspired by biological neural systems, ANNs are composed of multiple interconnected layers of artificial neurons. Each neuron performs a weighted summation of its inputs and passes the result through a nonlinear activation function to produce an output. By progressively combining multiple layers, neural networks can approximate highly complex nonlinear functions. So that the ReLU neural network exhibits powerful function approximation capabilities [9].
In 1986, Hinton introduced the Backpropagation (BP) algorithm [10], which provided an effective means of parameter optimization for neural networks. Meanwhile, Hecht-Nielsen and Robert theoretically proved the convergence of BP neural networks under certain conditions [11] in 1992, laying a solid theoretical foundation for the reliable training and application of ANNs.

2. ReLU Function and Related Works

As the AI field has continued to advance, it has become clear that the choice of activation function plays a critical role in enhancing the performance of neural networks. Activation functions not only provide neural networks with nonlinear modeling capabilities but also directly influence convergence speed, representational capacity, and generalization ability. Different activation functions can therefore lead to vastly different training dynamics and application outcomes. Statistical analyses of open-source platforms such as GitHub show that the Rectified Linear Unit (ReLU) is currently the most widely used activation function (see Table 1). ReLU is favored for its simplicity, computational efficiency, and effectiveness in mitigating the vanishing-gradient problem, making it the default choice in many neural network architectures.
The function was originally introduced by Professor Zheng Tang under the name ULR function in 1993 [12]. He proposed a multi-valued logic algebra system based on the ULR function and strictly proved its mathematical completeness, i.e., any multi-valued logic function can be represented by a combination of ULR functions [13]. Tang and his team conducted extensive theoretical and experimental studies on its potential. They proposed a learnable fuzzy network based on the ULR function [14]. Their achievements have had a far-reaching impact on the current development in complex control tasks such as intelligent manufacturing and autonomous driving. Later, Hinton reintroduced the function as ReLU (see Figure 1) and promoted its widespread use across various artificial intelligence tasks [15,16]. The ReLU function is popular because it is easy to differentiate, has a simple form, and is less likely to cause gradient vanishing or explosion issues in deep neural networks.
From ResNet to Transformer, ReLU and its variants have demonstrated exceptional versatility and practical value. Nevertheless, most applications of artificial neural networks are still primarily driven by empirical findings, with their deployment guided more by experimental success than by solid theoretical foundations ensuring generality and feasibility. Against this backdrop, we aim to study the mathematical principles underlying the approximation capability of neural networks using ReLU as the activation function (hereafter referred to as ReLU neural networks). Our goal is to establish a stronger theoretical basis for understanding their remarkable expressive power.

3. Main Result

3.1. Theorems and Corollaries

The first question is how to use mathematical models to quantify the expressive power of a ReLU neural network. For simplicity, we consider a ReLU neural network with an r-dimensional input and a one-dimensional output. Next, we will first provide some mathematical definitions to help explain this problem. In all the following definitions and formulas R ( · ) denotes the ReLU function, i.e., R ( x ) = max { 0 , x } .
Definition 1. 
1. 
A r is the set of all affine functions from R r to R , which means
A A r , w A R 1 × r , b A R , s . t . A ( x ) = w A · x + b A , x R r .
2. 
B r is the Borel σ-field in R r .
3. 
C r (res. M r ) is the set of all continuous (res. Borel measurable) functions from R r to R .
4. 
Let S, T be subsets of the metric space ( X , ρ ) . We say S is ρ-dense in T when for any ϵ > 0 and for all t in T, there is an s in S such that ρ ( s , t ) < ϵ .
5. 
A subset S of C r is said to be uniformly dense on compact set in C r if for every compact subset K R r , S is ρ K -dense in C r , where ρ K ( f , g ) = sup x K | f ( x ) g ( x ) | .
6. 
Σ r ( R ) = { f | f ( x ) = j = 1 q β j R ( A j ( x ) ) , x R r , β j R , A j A r , q = 1 , 2 , 3 , } .
7. 
Σ Π r ( R ) = { f | f ( x ) = j = 1 q β j k = 1 l j R ( A j k ( x ) ) , x R r , β j R , l j N , A j k A r , q = 1 , 2 , 3 , } .
It is readily seen that Σ r ( R ) is the mathematical model of a ReLU neural network with r-dimensional input and one-dimensional output.
After obtaining the definitions and formulas mentioned above, let us take a look at previous research results on the approximate property of neural networks. Hornik [9] proved the following conclusions in 1989 to show the universal approximation property of a normal neural network.
Theorem 1. 
For any continuous non-constant function G from R to R , Σ Π r ( G ) is uniformly dense on compact set in C r . And for any continuous non-constant function, every r, and every probability measure μ on ( R r , B r ) , Σ Π r ( G ) is ρ μ -dense in M r .
According to the theorem above, we obtain two corollaries.
Corollary 1. 
Σ Π r ( R ) is uniformly dense on compact set in C r .
Corollary 2. 
For every r Z + , every probability measure μ on ( R r , B r ) , Σ Π r ( R ) is ρ μ -dense in M r .
Inspired by the theorem and corollaries above, we began studying neural networks that use ReLU as the activation function. The main result of our research is the theorem below.
Theorem 2. 
For every r Z + , every probability measure μ on ( R r , B r ) , Σ r ( R ) is uniformly dense on compact set in C r , and it is ρ μ -dense in M r .
The theorem explains that the ReLU neural network and the Sigmoid neural network have the same expressive power. And we can easily prove the following corollaries.
Corollary 3. 
1. 
For every function g in M r , there is a compact set K R r and a function f Σ r ( R ) , such that for any real number ϵ > 0 we have μ ( K ) > 1 ϵ . And for every x K we have | g ( x ) f ( x ) | < ϵ . Here r is an integer and μ is a probability measure on R r .
2. 
If a compact set K R r satisfies μ ( K ) = 1 , then Σ r ( R ) is ρ p -dense in L p ( R r , μ ) , for any p 1 and any integer r.
3. 
If μ is a probability measure on [ 0 , 1 ] r , then Σ r ( R ) is ρ p -dense in L p ( [ 0 , 1 ] r , μ ) , where p 1 and r is arbitrary integer.
4. 
If μ puts mass 1 on a finite set of points, then for every g M r and any ϵ > 0 there is a function f Σ r ( R ) such that μ { x : | f ( x ) g ( x ) | < ϵ } = 1 .
5. 
For any Boolean function g and real number ϵ > 0 , there is a function f Σ r ( R ) such that max x { 0 , 1 } r | g ( x ) f ( x ) | < ϵ .
In other words, a ReLU neural network can approximate a Boolean function with arbitrary precision limits as long as it has enough neurons.
Before proving the above, we need the following lemma for assistance.
Lemma 1. 
For arbitrary a , b R , which a < b . Let f be a convex function on [ a , b ] . Assume f is derivable on the interval [a,b] and | f | is bounded. Then for every ϵ > 0 , there exists f ϵ Σ r ( R ) such that
sup λ [ a , b ] | f ( λ ) f ϵ ( λ ) | < ϵ .
A similar conclusion also holds for concave functions.
Corollary 4. 
For every a , b , which a < b , if f is a concave function on [ a , b ] , and | f | is bounded. Then for any ϵ > 0 , there is an f ϵ Σ r ( R ) such that
sup λ [ a , b ] | f ( λ ) f ϵ ( λ ) | < ϵ .
If the condition of a convex function is changed to a concave function, while keeping the other conditions unchanged, the conclusion in Lemma 1 still holds.
Lemma 2. 
For every ϵ > 0 , M > 0 , there is a function cos M , ϵ Σ 1 ( R ) such that
sup λ [ M , + M ] | cos M , ϵ ( λ ) cos ( λ ) | < ϵ .
And this approximation can be extended to a triangular series.
Lemma 3. 
Let g ( · ) = j = 1 Q β j cos ( A j ( · ) ) , A j A r , then for any compact set K R r , and every real number ϵ > 0 , there is an f Σ r ( R ) such that
sup x K | g ( x ) f ( x ) | < ϵ .
The following lemma is about the density of Σ r ( R ) in C r .
Lemma 4. 
Σ r ( R ) is uniformly dense on compact set in C r .
The last lemma we need is Lemma A.1 in the paper by Hornik [9]:
Lemma 5 
([9]). For any finite measure μ, C r is ρ μ -dense in M r .
Artificial intelligence is a discipline that emphasizes practical applicability. Considering only the theoretical approximation capabilities of neural networks is insufficient for real-world applications. What matters more is the ability of ReLU neural networks to fit data in actual training and the approximation error under certain constraints. This motivates the following discussion: for any given training dataset, a single-hidden-layer ReLU neural network theoretically has the capacity to perfectly fit the training data.
Theorem 3. 
Let { x 1 , x 2 , , x n } be the set of different points in R r . And g : R r R is any function. Then there is a function f Σ r ( R ) such that f ( x i ) = g ( x i ) , i = 1 , 2 , , n .
If we use Σ r , k ( R ) to donate the set
{ f | f ( x ) = Σ j = 1 q , β j R ( A j ( x ) ) , x R r , β j R , A j ( · ) i s a a f f i n e f u n c t i o n f r o m R r t o R k } .
Then we can draw the following conclusion.
Corollary 5. 
Let { x 1 , x 2 , , x n } be the set of different points in R r . And g : R r R k is any function. Then there is a function f Σ r , k ( R ) such that f ( x i ) = g ( x i ) , i = 1 , 2 , , n .

3.2. Discussion

We have proven that for any given training set, there exists an f Σ r ( R ) that can fit it perfectly. Another question is how to find this f. The BP algorithm tells us that we can find a neural network with a locally minimal error. However, according to Hecht-Nielsen’s theory [11], we cannot guarantee that a single hidden-layer neural network architecture can converge to the global optimum. Some papers [17,18] also point out a phenomenon called Dying ReLU. Dying ReLU refers to a problem when ReLU neurons become inactive and only output 0 for any input. We can use our theory to explain this phenomenon.
If we use j = 1 q β j R ( A j ( x ) ) to donate a trained ReLU neural network, which β j R , A j is an affine function for all j = 1 , 2 , , q . Let S R n be a dataset. Then the direct cause of Dying ReLU is that the affine function maps the dataset to the negative half-axis of the ReLU. In terms of a formula, it is expressed as
1 j q , s . t . A j ( x ) 0 , x S .
Here, A j ( x ) 0 means that each component of A j ( x ) is less than 0. And we call all x that satisfy A j ( x ) 0 the Dying Area of A j , denoted as D y A j . So the Dying ReLU problem then becomes how to train a ReLU neural network so that the dataset does not fall into D y A j . It is easy to see that if S D y A j for some j, then the BP algorithm will no longer adjust the parameters of this neuron. Then the training will have no effect on it.
Previous researchers have proposed several different approaches to solve this problem. Some research teams have suggested using new activation functions, such as the leaky ReLU [19] and GELU [20] (see Figure 2). Avoiding the Dying Area might be a way to prevent neuron death.
Others believe that the initialization of parameters plays an important role in the training of neural networks [2,7]. A poor initialization parameter can slow down the training of the neural network, and may even cause some ReLU neurons to die before training begins. In the next subsection, we will conduct a set of experiments to help illustrate this point.
Some people are also thinking about approaching it from the perspective of network depth. They theoretically demonstrated that deep neural networks can achieve the function of shallow neural networks with fewer parameters [21]. We also conducted an experiment to compare the performance of different activation functions in deep neural networks.

3.3. Experiments

In this section, we will show the impact of different initialization parameter strategies on neural network training. Additionally, we will demonstrate the training differences between ReLU, Sigmoid, and Tanh in deep neural networks.
1. Datasets We evaluate our models on a subset of the COVID-19 Radiography Database from Kaggle (https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database, accessed on 20 November 2025), which originally contains 21,165 chest X-ray images across four categories. For our experiments, we sample 1059 images (953 for training and 106 for testing) while preserving class balance. All images are resized to 224 × 224 and normalized before training. We also validated the properties of some deep neural networks on the MNIST dataset to check whether the experimental conclusions were coincidental for a specific dataset.
2. Models We removed the residual connection structure from He Kaiming’s paper [2] and built each layer of the network using only fully connected layers. We implement two deep neural models (50 layers and 101 layers) for our experiments. Based on the above models, we designed three sets of experiments.
The first is to study the effect of the learning rate on different activation functions in deep neural networks. We used three learning rates (0.01, 0.001, 0.0001) and employed the Kaiming normal initialization strategy for parameters (gain = 1). This experiment was conducted on the MNIST dataset.
The second experiment investigates the impact of different initialization strategies on training 50-layer deep neural networks with various activation functions. This experiment was conducted on two datasets, with the learning rate fixed at 0.001.
The last experiment involved replacing the 50-layer deep neural network in the second experiment with a 101-layer deep neural network, while keeping all other hyperparameters unchanged.
For each experiment, three activation settings are evaluated: ReLU, Tanh, and Sigmoid. All experiments were trained for 50 epochs with a batch of 32 using the SGD optimizer. All experimental parameters are set as shown in Table 2.
3. Evaluation Metrics Model performance is evaluated using Top-1 classification accuracy. We also record training loss curves to analyze convergence differences.
4. Experimental Procedure All controlled experiments are conducted, each using one activation function (ReLU, Sigmoid, or Tanh) and one method for parameter initialization (Kaiming or Xavier). All models are trained from scratch under identical settings, ensuring that any performance difference is solely due to the activation function.
5. Results and Discussion We report the performance of models trained with ReLU, Tanh, and Sigmoid activation functions under two different network depths. The results consistently demonstrate that the effect of activation functions becomes more pronounced as the network grows deeper.
  • Analysis on the 50-layer network.
    In the shallower 50-layer architecture (see Table 3 and Figure 3 and Figure 4), all three activation functions achieve comparable accuracy and convergence behavior. Although the performance gap is relatively small, ReLU still produces slightly higher accuracy and faster convergence. This indicates that the vanishing-gradient issue is less severe at this depth, allowing Sigmoid and Tanh to remain competitive.
  • Analysis on the 101-layer network (see Figure 5, Table 4).
    ReLU achieves the highest accuracy and fastest convergence in the 101-layer network, confirming its advantage in mitigating gradient vanishing. Sigmoid and Tanh show almost no learning progress during the first 20 epochs due to saturation in deep layers. After epoch 20, their curves diverge: Sigmoid recovers gradients more effectively and improves rapidly, while Tanh remains slower because of stronger saturation. Overall, activation choice becomes increasingly critical in deeper networks, with ReLU demonstrating the most stable and efficient training behavior.
By comparing the effects of different initializations on networks with the same depth and activation functions, it can be shown that parameter initialization does indeed affect the efficiency of model training.

4. Concluding Remark

Beyond the depth of the network architecture and the choice of initialization, our findings also shed light on several additional issues related to ReLU-based neural networks.
Increasing the number of parameters or neurons generally enhances the model’s generalization ability [22,23,24]. This is because a larger network reduces the likelihood that training samples fall into the Dying Area of ReLU units, thereby improving the effectiveness of gradient-based optimization. As a result, more neurons can actively participate in representing the target mapping function. Moreover, Theorem 2 provides an idealized upper bound connecting the number of effective neurons and the approximation error: the more effective neurons the network possesses, the smaller the expected approximation error becomes.
The quality of the training dataset is crucial for generalization performance [25,26]. Neural networks aim to approximate the underlying mapping represented by the dataset; thus, if the dataset poorly reflects the true characteristics of the real-world signal, the network cannot faithfully capture the actual data structure. Conversely, if the dataset contains substantial redundancy or meaningless samples, Theorem 3 implies that a significantly larger number of neurons would be required to fit the dataset adequately, leading to unnecessary computational and resource expenditure.
The future development of artificial intelligence should aim to enhance model capabilities while simultaneously minimizing energy consumption. We identify two potential research directions.
First, improving model architectures to maximize neuron utilization is essential. In other words, increasing the practical effectiveness of ReLU-based networks requires preventing as many neurons as possible from entering the Dying ReLU state, thereby enabling the same task to be accomplished with fewer neurons and parameters. However, the challenges in this direction are evident: aside from the transformer architecture, no substantially superior neural network design has been identified, and we still lack a clear theoretical understanding of why transformer-based models consistently exhibit such strong empirical performance.
Second, constructing higher-quality datasets through more precise data acquisition is crucial. If the collected data accurately reflects the latent structure of the target phenomenon, model training and learned representations will exhibit greater robustness and transferability. Yet this line of inquiry raises new and fundamental questions: What constitutes a high-quality dataset? And, for different application domains, what sampling strategies are required to reliably obtain such datasets? These questions remain open and merit careful investigation.

5. Mathematical Appendix

Proof of Lemma 1. 
Notice that convex function f satisfies for every x , y [ a , b ] , f ( x ) + f ( y ) 2 f ( x + y 2 ) . We construct two linear function as below
l 1 ( x ) = f ( a + b 2 ) f ( a ) a + b 2 a ( x a ) + f ( a ) , l 2 ( x ) = f ( b ) f ( a + b 2 ) b a + b 2 ( x a + b 2 ) + f ( a + b 2 ) .
Let g 1 ( x ) = l 1 ( x ) , a x < a + b 2 ; l 2 ( x ) , a + b 2 x b . be a piecewise linear function, then f 1 ( x ) = f ( x ) g 1 ( x ) is still a convex function on [ a , a + b 2 ] (res. on [ a + b 2 , b ] ). We obtain the following inequality for estimating
max x [ a , b ] { f ( x ) } min x [ a , b ] { f ( x ) } max x [ a , b ] { f 1 ( x ) } min x [ a , b ] { f 1 ( x ) } .
The equality holds if and only if f is a constant number. Similarly (Figure 6), we construct piecewise linear function g 2 based on f 1 (that is constructing piecewise linear functions similar to g 1 on [ a , a + b 2 ] and [ a + b 2 , b ] , respectively, and then concatenate them together). Let f 2 = f 1 g 2 , then
max x [ a , b ] { f 1 ( x ) } min x [ a , b ] { f 1 ( x ) } max x [ a , b ] { f 2 ( x ) } min x [ a , b ] { f 2 ( x ) } .
Similarly, we can construct f 3 , f 4 , and so on. We define δ i = max x [ a , b ] f ( x ) i min x [ a , b ] f ( x ) i , i = 1 , 2 , and A i j = [ a + ( j 1 ) ( b a ) 2 i , a + j ( b a ) 2 i ) , j = 1 , 2 , , 2 i 1 and A i 2 i = [ a + ( 2 i 1 ) ( b a ) 2 i , b ] . Since | f | is bounded, we assume | f | < M . Notice that f i is obtained by subtracting from f several piecewise linear functions on [ a , b ] , then we obtain the following estimation
δ i = max x [ a , b ] f ( x ) i min x [ a , b ] f ( x ) i = max j = 1 , 2 , , 2 i { max x A i j f i ( x ) min x A i j f i ( x ) } max j = 1 , 2 , , 2 i { max x A i j f ( x ) min x A i j f ( x ) } max j = 1 , 2 , , 2 i A i j | f | d x max j = 1 , 2 , , 2 i A i j M d x = M ( b a ) 2 i .
When i > i ϵ = [ log ( M ( b a ) ϵ ) ] , δ i < ϵ , which means f satisfies
sup λ [ a , b ] | f ( λ ) f ϵ ( λ ) | < ϵ .
Here f ϵ ( x ) = j = 1 i 1 g j ( x ) is a piecewise linear function.
Now it remains to show that f ϵ Σ r ( R ) , we just need to show g j Σ r ( R ) . Notice that
G g 1 , a , b ( x ) = f ( a ) ( = g 1 ( a ) ) , x < a , g 1 ( x ) , x [ a , b ] , f ( b ) ( = g 1 ( b ) ) , x > b .
And we can express G g 1 , [ a , b ] ( x ) as
G g 1 , [ a , b ] ( x ) = g 1 ( a + b 2 ) g 1 ( a ) b a 2 R ( x a ) 2 g 1 ( a + b 2 ) g 1 ( a ) g 1 ( b ) b a 2 R ( x a + b 2 ) + g 1 ( a + b 2 ) g 1 ( b ) b a 2 R ( x b ) + g 1 ( a ) R ( 1 ) .
So g 1 Σ r ( R ) . Similarly, for i 2
g i ( x ) = j = 1 2 i 1 G g i , [ a + ( j 1 ) ( b a ) 2 i 1 , a + ( j 1 ) ( b a ) 2 i ] ( x ) .
Now we show that for all i = 1 , 2 , , 2 i 1 , g i is an element of Σ r ( R ) . Since Σ r ( R ) is an algebra, we have f ϵ Σ r ( R ) . □
Proof of Lemma 2. 
According to Lemma 1, there is a sequence of functions { f k , k Z } Σ r ( R ) , such that for ϵ > 0
sup λ [ 2 k π + π 2 , 2 k π + 3 π 2 ] | f k ( λ ) cos ( λ ) | < ϵ .
And for all λ ( , 2 k π + π 2 ) ( 2 k π + 3 π 2 , ) , f k ( λ ) = 0 .
By the Corollary 3, it is easy to find a function sequence { g k , k Z } in Σ r ( R ) , which for every ϵ > 0 we have estimation as below
sup λ [ 2 k π π 2 , 2 k π + π 2 ] | g k ( λ ) cos ( λ ) | < ϵ .
And for any λ ( , 2 k π π 2 ) ( 2 k π + π 2 , ) , g k ( λ ) = 0 .
Then for every M > 0 , choose n N that satisfies 2 n π > M . We have
sup λ [ M , + M ] | k = n n ( f k ( λ ) + g k ( λ ) ) cos ( λ ) | < sup k N , k [ n , n ] { sup λ [ 2 k π π 2 , 2 k π + 3 π 2 ] { f k ( λ ) + g k ( λ ) cos ( λ ) } } < ϵ .
So the function cos M , ϵ ( λ ) = k = n n ( f k ( λ ) + g k ( λ ) ) is what we need. □
Proof of Lemma 3. 
Notice that A j A r , j = 1 , 2 , , Q are continuous mapping and compact set K is bounded, there is a real number M > 0 , which for all j = 1 , 2 , , Q , A j ( K ) [ M , M ] .
Since Lemma 2 tells us there is a function cos M , ϵ | β j | · Q ( · ) Σ 1 ( R ) such that
sup λ [ M , + M ] | cos ( λ ) cos M , ϵ | β j | · Q ( λ ) | < ϵ | β j | · Q .
We can construct f ( x ) as following
f ( x ) = j = 1 Q β j cos M , ϵ | β j | · Q ( A j ( x ) ) .
So that
sup x K | g ( x ) f ( x ) | = sup x K | j = 1 Q β j cos ( A j ( x ) ) f ( x ) | = sup x K | j = 1 Q β j cos ( A j ( x ) ) cos M , ϵ | β j | · Q ( A j ( x ) ) | sup λ [ M , + M ] | j = 1 Q β j cos ( λ ) cos M , ϵ | β j | · Q ( λ ) | j = 1 Q sup λ [ M , + M ] | β j | | cos ( λ ) cos M , ϵ | β j | · Q ( λ ) | j = 1 Q ϵ Q = ϵ .
Up to this point, we have proven Lemma 3. □
Proof of Lemma 4. 
With the trigonometric identity, we have
2 cos ( α ) cos ( β ) = cos ( α + β ) + cos ( α β ) .
Based on the formula above, we can transform elements in Σ Π r ( cos ) into ones in Σ r ( cos ) . Furthermore, we can prove that Σ Π r ( cos ) = Σ r ( cos ) . According to Theorem 1, we know Σ Π r ( cos ) is uniformly dense on a compact set in C r . Then the proof is finished by Lemma 3. □
Proof of Theorem 2. 
On the one hand, Lemma 4 shows that Σ r ( R ) is uniformly dense on compact set in C r , so Σ r ( R ) is ρ μ -dense in C r for any probability measure μ on ( R r , B r ) . On the other hand, we know C r is ρ μ -dense in M r through Lemma 5. The triangle inequality implies that Σ r ( R ) is ρ μ -dense in M r . □
Proof of Theorem 3. 
First we find an affine function A A r which makes A ( x i ) ( i = 1 , 2 , , n ) all different from each other. Since the solution set of a system of linear equations A i j ( x i ) = A i j ( x j ) , 1 i j n is a real subset of R r , there must be a matrix A in R 1 × r such that A x i A x j , i j . Then A ( x ) = A x is what we need.
Let y i = A ( x i ) , then y 1 , y 2 , , y n are different from each other. Without loss of generality, we assume y i 1 < y i 2 < < y i n . Think about the function below.
f k ( x ) = g ( x k ) y k y k 1 R ( x y k 1 ) ( g ( x k ) y k + 1 y k + g ( x k ) y k y k 1 ) R ( x y k ) + g ( x k ) y k + 1 y k R ( x y k + 1 ) .
Specially, we define y i 0 = min { 2 y i 1 , 2 y i 1 } , y i n + 1 = max { 2 y i n , 2 y i n } . Meanwhile, f k ( x ) is a piecewise linear function that takes values between 0 and g ( x k ) on the interval [ y k 1 , y k + 1 ] . If and only if x = y k , f k ( x ) = g ( x k ) . And f k remains 0 out of the interval [ y k 1 , y k + 1 ] . Let F = k = 1 n f k , then F ( y k ) = g ( x k ) . Let f = F A , then for all i = 1 , 2 , , n , f ( x i ) = F ( A ( x i ) ) = F ( y i ) = g ( x i ) . So f is the function we need. □
Remark 1. 
It seems that the number of activation functions (neurons in the hidden layer) in the theorem above is approximately three times the number of sample points. However, in practice, we only need about the same number of activation functions as the number of sample points to achieve a perfect fit of the training set. Think about the function below.
f i ( x ) = β k R ( x y k ) , k = 1 , 2 , , n 1 .
Here β 1 = g ( x 2 ) g ( x 1 ) y 2 y 1 , β k + 1 = g ( x k + 2 ) g ( x k + 1 ) y k + 2 y k + 1 j = 1 k β j , k = 1 , 2 , , n 1 .
And we can easily verify F ( x ) = g ( x 1 ) R ( 1 ) + k = 1 n 1 β k R ( x y k ) satisfies F ( A ( x i ) ) = g ( x i ) .

Author Contributions

Conceptualization, G.L. and Z.T.; methodology, X.W.; validation, G.L. and Z.T.; investigation, G.L.; resources, S.T. and Z.T.; data curation, W.Z.; writing—original draft preparation, G.L.; writing—review and editing, W.Z., S.T. and Z.T.; supervision, X.W.; project administration, Z.T.; funding acquisition, S.T. and Z.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Special Fund for Provincial Basic Research. Grant number: BK20253035.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest. The funders were involved in the data collection and the provision of the experimental platform, but had no role in the design of the study; in the analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
ANNsArtificial Neural Networks
BPBackpropagation
BPNNBackpropagation Neural Networks
CNNConvolutional Neural Network
CVPRConference on Computer Vision and Pattern Recognition
DCNNDeep Convolutional Neural Network
DLDeep Learning
GELUGaussian Error Linear Unit
MLMachine Learning
NLPNatural Language Processing
ResNetResidual Network
ReLURectified Linear Unit
ULRUnidirectional Linear Response

References

  1. Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef] [PubMed]
  2. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  3. Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar] [CrossRef]
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998. [Google Scholar] [CrossRef]
  5. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar] [CrossRef]
  6. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
  7. Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in English and Mandarin. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 20–22 June 2016; pp. 173–182. [Google Scholar] [CrossRef]
  8. Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar] [CrossRef]
  9. Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
  10. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  11. Hecht-Nielsen, R. Theory of the backpropagation neural network. In Neural Networks for Perception, 2nd ed.; Wechsler, H., Ed.; Elsevier: Amsterdam, The Netherlands, 1992; pp. 65–93. [Google Scholar] [CrossRef]
  12. Tang, Z.; Ishizuka, O.; Matsumoto, H. A model of neurons with unidirectional linear response. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 1993, 76, 1537–1540. [Google Scholar]
  13. Tang, Z.; Ishizuka, O.; Matsumoto, H. Multiple-valued neuro-algebra. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 1993, 76, 1541–1543. [Google Scholar]
  14. Tang, Z.; Kobayashi, Y.; Ishizuka, O.; Tanno, K. A learning fuzzy network and its applications to inverted pendulum system. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 1995, 78, 701–707. [Google Scholar]
  15. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  16. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
  17. Lu, L.; Shin, Y.; Su, Y.; Karniadakis, G.E. Dying relu and initialization: Theory and numerical examples. arXiv 2019, arXiv:1903.06733. [Google Scholar] [CrossRef]
  18. Apicella, A.; Donnarumma, F.; Isgró, F.; Prevete, R. A survey on modern trainable activation functions. Neural Netw. 2021, 138, 14–32. [Google Scholar] [CrossRef]
  19. Xu, J.; Li, Z.; Du, B.; Zhang, M.; Liu, J. Reluplex made more practical: Leaky ReLU. In Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 7–10 July 2020; pp. 1–7. [Google Scholar] [CrossRef]
  20. Hendrycks, D. Gaussian Error Linear Units (Gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  21. Mhaskar, H.; Liao, Q.; Poggio, T. When and why are deep networks better than shallow ones? In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar] [CrossRef]
  22. Gowal, S.; Rebuffi, S.-A.; Wiles, O.; Stimberg, F.; Calian, D.A.; Mann, T.A. Improving robustness using generated data. Adv. Neural Inf. Process. Syst. 2021, 34, 4218–4233. [Google Scholar] [CrossRef]
  23. Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
  24. Hernandez, D.; Kaplan, J.; Henighan, T.; McCandlish, S. Scaling laws for transfer. arXiv 2021, arXiv:2102.01293. [Google Scholar] [CrossRef]
  25. Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
  26. Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning convolutional neural networks for resource efficient inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]
Figure 1. The Rectified Linear Unit (ReLU).
Figure 1. The Rectified Linear Unit (ReLU).
Mathematics 14 00039 g001
Figure 2. (a) Leaky ReLU. (b) GELU.
Figure 2. (a) Leaky ReLU. (b) GELU.
Mathematics 14 00039 g002
Figure 3. Training Accuracy of experiment with parameters rank 1 in Table 2. (a) On the left are the experimental data with a learning rate of 0.01. On the right are the experimental data with a learning rate of 0.001. Both models quickly converged to 100% on the MNIST dataset. In the left figure, we indicate that by the 18th epoch, all three models have almost fully converged. (b) Experiment images with a learning rate of 0.0001. Since convergence was achieved within 10 epochs, only the performance from the first 4 epochs is shown. It can be seen that ReLU still maintains a slight advantage on simple datasets.
Figure 3. Training Accuracy of experiment with parameters rank 1 in Table 2. (a) On the left are the experimental data with a learning rate of 0.01. On the right are the experimental data with a learning rate of 0.001. Both models quickly converged to 100% on the MNIST dataset. In the left figure, we indicate that by the 18th epoch, all three models have almost fully converged. (b) Experiment images with a learning rate of 0.0001. Since convergence was achieved within 10 epochs, only the performance from the first 4 epochs is shown. It can be seen that ReLU still maintains a slight advantage on simple datasets.
Mathematics 14 00039 g003
Figure 4. Training Accuracy of experiment with parameters of rank 2 in Table 2. (a) Experimental data on the COVID-19 dataset; from left to right are ReLU, Tanh, and Sigmoid. (b) Experimental data on the MNIST dataset. MNIST.
Figure 4. Training Accuracy of experiment with parameters of rank 2 in Table 2. (a) Experimental data on the COVID-19 dataset; from left to right are ReLU, Tanh, and Sigmoid. (b) Experimental data on the MNIST dataset. MNIST.
Mathematics 14 00039 g004
Figure 5. Training Accuracy of experiment with parameters of rank 3 in Table 2. (a) Experimental data on the COVID-19 dataset; from left to right are ReLU, Tanh, and Sigmoid. (b) Experimental data on the MNIST dataset.
Figure 5. Training Accuracy of experiment with parameters of rank 3 in Table 2. (a) Experimental data on the COVID-19 dataset; from left to right are ReLU, Tanh, and Sigmoid. (b) Experimental data on the MNIST dataset.
Mathematics 14 00039 g005
Figure 6. (a) Using piecewise linear functions to approximate a convex function. (b) f 1 = f g 1 is a convex function on some compact set. (c) Using another piecewise linear function to approximate the convex function on a compact set.
Figure 6. (a) Using piecewise linear functions to approximate a convex function. (b) f 1 = f g 1 is a convex function on some compact set. (c) Using another piecewise linear function to approximate the convex function on a compact set.
Mathematics 14 00039 g006
Table 1. Usage frequency of various activation functions on GitHub (2025.7).
Table 1. Usage frequency of various activation functions on GitHub (2025.7).
RankFunctionUsage Count
1ReLU15.3 M
2SoftMax3.7 M
3Tanh3.1 M
4Sigmoid2.4 M
5GELU1.8 M
6Swish717 k
7Leaky ReLU635 k
8Softplus253 k
Table 2. Experimental Parameters.
Table 2. Experimental Parameters.
RankLayersInitializationLearning RateDataset
150Kaiming normal0.01, 0.001, 0.0001MNIST
250Kaiming normal & Xavier0.001COVID-19 & MNIST
3101Kaiming normal & Xavier0.001COVID-19 & MNIST
Table 3. The classification accuracy of different activation functions (50 layers on COVID-19).
Table 3. The classification accuracy of different activation functions (50 layers on COVID-19).
RankFunctionAccuracy (%)
1ReLU81.36
2Sigmoid80.95
3Tanh81.28
Table 4. The classification accuracy of different activation functions (101 layers on COVID-19).
Table 4. The classification accuracy of different activation functions (101 layers on COVID-19).
RankFunctionAccuracy (%)
1ReLU73.16
2Sigmoid67.18
3Tanh62.85
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luo, G.; Wang, X.; Zhao, W.; Tao, S.; Tang, Z. ReLU Neural Networks and Their Training. Mathematics 2026, 14, 39. https://doi.org/10.3390/math14010039

AMA Style

Luo G, Wang X, Zhao W, Tao S, Tang Z. ReLU Neural Networks and Their Training. Mathematics. 2026; 14(1):39. https://doi.org/10.3390/math14010039

Chicago/Turabian Style

Luo, Ge, Xugang Wang, Weizun Zhao, Sichen Tao, and Zheng Tang. 2026. "ReLU Neural Networks and Their Training" Mathematics 14, no. 1: 39. https://doi.org/10.3390/math14010039

APA Style

Luo, G., Wang, X., Zhao, W., Tao, S., & Tang, Z. (2026). ReLU Neural Networks and Their Training. Mathematics, 14(1), 39. https://doi.org/10.3390/math14010039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop