ReLU Neural Networks and Their Training
Abstract
1. Introduction
2. ReLU Function and Related Works
3. Main Result
3.1. Theorems and Corollaries
- 1.
- is the set of all affine functions from to , which means
- 2.
- is the Borel σ-field in .
- 3.
- (res. ) is the set of all continuous (res. Borel measurable) functions from to .
- 4.
- Let S, T be subsets of the metric space . We say S is ρ-dense in T when for any and for all t in T, there is an s in S such that .
- 5.
- A subset S of is said to be uniformly dense on compact set in if for every compact subset , S is -dense in , where .
- 6.
- 7.
- 1.
- For every function g in , there is a compact set and a function , such that for any real number we have . And for every we have . Here r is an integer and μ is a probability measure on .
- 2.
- If a compact set satisfies , then is -dense in , for any and any integer r.
- 3.
- If μ is a probability measure on , then is -dense in , where and r is arbitrary integer.
- 4.
- If μ puts mass 1 on a finite set of points, then for every and any there is a function such that .
- 5.
- For any Boolean function g and real number , there is a function such that .
3.2. Discussion
3.3. Experiments
- Analysis on the 50-layer network.In the shallower 50-layer architecture (see Table 3 and Figure 3 and Figure 4), all three activation functions achieve comparable accuracy and convergence behavior. Although the performance gap is relatively small, ReLU still produces slightly higher accuracy and faster convergence. This indicates that the vanishing-gradient issue is less severe at this depth, allowing Sigmoid and Tanh to remain competitive.
- ReLU achieves the highest accuracy and fastest convergence in the 101-layer network, confirming its advantage in mitigating gradient vanishing. Sigmoid and Tanh show almost no learning progress during the first 20 epochs due to saturation in deep layers. After epoch 20, their curves diverge: Sigmoid recovers gradients more effectively and improves rapidly, while Tanh remains slower because of stronger saturation. Overall, activation choice becomes increasingly critical in deeper networks, with ReLU demonstrating the most stable and efficient training behavior.
4. Concluding Remark
5. Mathematical Appendix
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| ANNs | Artificial Neural Networks |
| BP | Backpropagation |
| BPNN | Backpropagation Neural Networks |
| CNN | Convolutional Neural Network |
| CVPR | Conference on Computer Vision and Pattern Recognition |
| DCNN | Deep Convolutional Neural Network |
| DL | Deep Learning |
| GELU | Gaussian Error Linear Unit |
| ML | Machine Learning |
| NLP | Natural Language Processing |
| ResNet | Residual Network |
| ReLU | Rectified Linear Unit |
| ULR | Unidirectional Linear Response |
References
- Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in English and Mandarin. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 20–22 June 2016; pp. 173–182. [Google Scholar] [CrossRef]
- Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar] [CrossRef]
- Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Hecht-Nielsen, R. Theory of the backpropagation neural network. In Neural Networks for Perception, 2nd ed.; Wechsler, H., Ed.; Elsevier: Amsterdam, The Netherlands, 1992; pp. 65–93. [Google Scholar] [CrossRef]
- Tang, Z.; Ishizuka, O.; Matsumoto, H. A model of neurons with unidirectional linear response. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 1993, 76, 1537–1540. [Google Scholar]
- Tang, Z.; Ishizuka, O.; Matsumoto, H. Multiple-valued neuro-algebra. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 1993, 76, 1541–1543. [Google Scholar]
- Tang, Z.; Kobayashi, Y.; Ishizuka, O.; Tanno, K. A learning fuzzy network and its applications to inverted pendulum system. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 1995, 78, 701–707. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
- Lu, L.; Shin, Y.; Su, Y.; Karniadakis, G.E. Dying relu and initialization: Theory and numerical examples. arXiv 2019, arXiv:1903.06733. [Google Scholar] [CrossRef]
- Apicella, A.; Donnarumma, F.; Isgró, F.; Prevete, R. A survey on modern trainable activation functions. Neural Netw. 2021, 138, 14–32. [Google Scholar] [CrossRef]
- Xu, J.; Li, Z.; Du, B.; Zhang, M.; Liu, J. Reluplex made more practical: Leaky ReLU. In Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 7–10 July 2020; pp. 1–7. [Google Scholar] [CrossRef]
- Hendrycks, D. Gaussian Error Linear Units (Gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Mhaskar, H.; Liao, Q.; Poggio, T. When and why are deep networks better than shallow ones? In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar] [CrossRef]
- Gowal, S.; Rebuffi, S.-A.; Wiles, O.; Stimberg, F.; Calian, D.A.; Mann, T.A. Improving robustness using generated data. Adv. Neural Inf. Process. Syst. 2021, 34, 4218–4233. [Google Scholar] [CrossRef]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
- Hernandez, D.; Kaplan, J.; Henighan, T.; McCandlish, S. Scaling laws for transfer. arXiv 2021, arXiv:2102.01293. [Google Scholar] [CrossRef]
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
- Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning convolutional neural networks for resource efficient inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]






| Rank | Function | Usage Count |
|---|---|---|
| 1 | ReLU | 15.3 M |
| 2 | SoftMax | 3.7 M |
| 3 | Tanh | 3.1 M |
| 4 | Sigmoid | 2.4 M |
| 5 | GELU | 1.8 M |
| 6 | Swish | 717 k |
| 7 | Leaky ReLU | 635 k |
| 8 | Softplus | 253 k |
| Rank | Layers | Initialization | Learning Rate | Dataset |
|---|---|---|---|---|
| 1 | 50 | Kaiming normal | 0.01, 0.001, 0.0001 | MNIST |
| 2 | 50 | Kaiming normal & Xavier | 0.001 | COVID-19 & MNIST |
| 3 | 101 | Kaiming normal & Xavier | 0.001 | COVID-19 & MNIST |
| Rank | Function | Accuracy (%) |
|---|---|---|
| 1 | ReLU | 81.36 |
| 2 | Sigmoid | 80.95 |
| 3 | Tanh | 81.28 |
| Rank | Function | Accuracy (%) |
|---|---|---|
| 1 | ReLU | 73.16 |
| 2 | Sigmoid | 67.18 |
| 3 | Tanh | 62.85 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Luo, G.; Wang, X.; Zhao, W.; Tao, S.; Tang, Z. ReLU Neural Networks and Their Training. Mathematics 2026, 14, 39. https://doi.org/10.3390/math14010039
Luo G, Wang X, Zhao W, Tao S, Tang Z. ReLU Neural Networks and Their Training. Mathematics. 2026; 14(1):39. https://doi.org/10.3390/math14010039
Chicago/Turabian StyleLuo, Ge, Xugang Wang, Weizun Zhao, Sichen Tao, and Zheng Tang. 2026. "ReLU Neural Networks and Their Training" Mathematics 14, no. 1: 39. https://doi.org/10.3390/math14010039
APA StyleLuo, G., Wang, X., Zhao, W., Tao, S., & Tang, Z. (2026). ReLU Neural Networks and Their Training. Mathematics, 14(1), 39. https://doi.org/10.3390/math14010039

