A Novel Adaptive Activation Function for Convolutional Neural Networks: The Parametric Arctangent Unit (PATU)

Liao, Xuanzhi; Sahran, Shahnorbanun; Abdullah, Azizi; Shukor, Syaimak Abdul; Deriche, Mohamed

doi:10.3390/sym18060971

Open AccessArticle

A Novel Adaptive Activation Function for Convolutional Neural Networks: The Parametric Arctangent Unit (PATU)

by

Xuanzhi Liao

^1,2,3

,

Shahnorbanun Sahran

^1,*

,

Azizi Abdullah

¹

,

Syaimak Abdul Shukor

¹ and

Mohamed Deriche

⁴

¹

Center for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Selangor, Malaysia

²

Department of Computer Science, School of Medical Technology and Artificial Intelligence, Youjiang Medical University for Nationalities, Baise 533000, China

³

Guangxi Key Laboratory of Artificial Intelligence for Genetic Diseases of Long-Dwelling Nationalities, Baise 533000, China

⁴

Artificial Intelligence Research Centre, College of Engineering and Information Technology, Ajman University, Ajman P.O. Box 346, United Arab Emirates

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(6), 971; https://doi.org/10.3390/sym18060971

Submission received: 23 April 2026 / Revised: 26 May 2026 / Accepted: 2 June 2026 / Published: 4 June 2026

(This article belongs to the Special Issue Asymmetry and Symmetry in Computer Vision and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

The activation function is a critical component in convolutional neural networks that perform nonlinear mappings between inputs and outputs. The rectified linear unit (ReLU) function is currently the baseline activation function in the deep learning community; however, it has several drawbacks that result in inefficient training, such as how the zero value of a negative part treats negative inputs as insignificant information for learning, which results in bias shift in network layers, and how the multilinear structure of ReLU restricts the nonlinear approximation power of networks. To address these limitations, this study proposes the parametric arctangent unit (PATU), which integrates advantageous features of ReLU, (parametric ReLU) PReLU, (exponential linear unit) ELU and Swish. Unlike the PReLU, which employs a learnable linear slope for negative inputs and thus remains piecewise linear, PATU integrates a nonlinear arctangent activation in the negative part. Specifically, PATU retains ReLU’s advantageous positive part, while the negative part employs arctangent to activate negative inputs, and a trainable parameter is implemented to control the saturation of the negative part. The study compared the proposed PATU to four state-of-art activation functions, namely ReLU, PReLU, ELU and Swish, on the CIFAR10 and CIFAR100 datasets using SmallNet, Network in Network, and ResNet18 convolutional neural network architectures. Experimental results demonstrate that PATU outperformed ReLU in terms of accuracy on the CIFAR10 and CIFAR100 datasets by 0.45% and 1.44%, respectively, for the SmallNet architecture; 1.87% and 3.11%, respectively, for the NIN architecture; and 1.19% and 2.92%, respectively, for the ResNet18 architecture. All experiments were repeated five times, and the improvements were statistically significant (Friedman test, p = 0.015). PATU reached the highest mean rank score of 1.16 in the Friedman test for classification accuracy, which is superior to state-of-the-art methods. These results suggest that the proposed PATU is an effective alternative to ReLU.

Keywords:

convolutional neural network; activation function; rectified linear; gradient vanishing; dying neuron; parametric arctangent unit; integrate

1. Introduction

Convolutional neural networks (CNNs) have shown great results in various computer vision tasks, e.g., image classification [1,2] and object detection [3,4]. A CNN implements multiple functions to simulate how the human brain works [5]. The activation function, which is a critical component of a CNN, provides nonlinear mappings such that the CNN model can learn more complex representations [6,7]. These nonlinearities are important because they keep the multi-layer network from turning into a linear system. This gives it the expressive ability it needs to build hierarchical features and simulate complicated, real-world data distributions that are naturally non-linear [8].

Natural images often contain objects that are invariant or equivariant with respect to specific symmetry groups. The odd-symmetric activation function keeps the sign structure of features intact, which is desired for learning such symmetric patterns. One of the milestone CNN models is called LeNet5, which adopts the hyperbolic tangent function (Tanh) as an activation function and achieves notable results in image classification. However, the Tanh function also leads to the vanishing gradient problem. The saturating regions of Tanh tend to push the gradients to zero, making it challenging to update the parameters effectively [9].

The rectified linear unit (ReLU) activation function [10] is one of the innovations that led the revolution of deep learning, originally proposed in 2010 by Nair and Hinton, almost 20 years ago [11]. Its fully asymmetric design eliminates the vanishing gradient issue, and it is simple and efficient.

Unlike the hyperbolic tangent function, the positive part of ReLU is a linear mapping, and the negative part always generates a zero value, which mitigates gradient vanishing in the positive part. In addition, the negative part has sparse characteristics; thus, the model can be trained quickly. Although ReLU has always been the baseline in various CNN models [12,13], it has several limitations that cannot be ignored. For example, ReLU simply converts negative inputs to zero, so negative inputs result in zero gradients, which leads to a bias shift in the network layers because ReLU always produces non-negative values [14]. This situation is called “dying neuron”. Specifically, when a neuron consistently receives negative inputs, its output becomes zero, and the gradient with respect to its parameters also becomes zero. Consequently, the neuron’s weights are never updated, and the neuron remains inactive forever [15]. This problem is particularly severe when a large fraction of neurons die, as it effectively reduces the network’s representational capacity and can stall the training process [16]. In addition, the ReLU is a predefined activation function; thus, it cannot adjust adaptively according to the inputs [17]. Another shortcoming of ReLU is that it is a multilinear activation function, i.e., it has limited non-linearity ability [7,18].

Several activation functions, e.g., LeakyReLU [19], parametric rectified linear unit (PReLU) [20], exponential linear unit (ELU) [21], scaled exponential linear unit (SELU) [22], and Swish [23], have been proposed to address the problems with ReLU. These activation functions remain fundamentally asymmetric, but they incorporate varying degrees of approximate symmetry (e.g., by preserving a non-zero but suppressed response in the negative domain). They thus strike a compromise between the efficiency of asymmetric designs and the good feature-preserving properties of symmetric ones.

LeakyReLU and PReLU share the same shape, where the positive part is linear, and a slight slope is assigned to the negative part so that negative inputs can be activated. The main difference is that PReLU uses a trainable parameter (rather than a fixed parameter) to adjust the slope of the negative part adaptively. This parametric strategy avoids the need to manually set the slope parameter of LeakyReLU and allows PReLU to provide more non-linearity for the CNN model. The previous studies have demonstrated experimentally that a trainable slope parameter yields better performance than a fixed slope parameter [24,25].

ELU was proposed to address the dying neuron problem, and it ensures a noise-robust deactivation state, which enhances model robustness. Previous studies have shown that ELU outperforms ReLU [26,27]. SELU further improved ELU to be self-normalizing. However, SELU does not produce results that are comparable to those of ELU and, in some cases, can fail to converge. To the best of our knowledge, only a few studies have investigated the use of SELU in CNN models [28].

Swish is another recently proposed activation function, which is the multiplication form of the input value and sigmoid function. Swish is smooth, non-monotonic, bounded below, and unbounded above. However, Egar et al. [29] have confirmed the unstable nature of the Swish activation function because of gate function transformation.

Previous studies have focused on addressing specific drawbacks of ReLU individually. For instance, PReLU introduces a learnable linear slope for negative inputs to mitigate the dying neuron problem; ELU employs an exponential negative part to improve robustness and deactivation state; and Swish incorporates non-monotonicity via a self-gating mechanism. Each of these works resolves one or two particular issues of ReLU but does not simultaneously integrate multiple beneficial properties into a single activation function. In other words, limited attention has been given to integrating the beneficial good properties of various activation functions to achieve better overall performance. This paper proposed PATU, which aims to preserve the advantages of widely used activation functions such as ReLU, PReLU, ELU and Swish.

The primary contributions are summarized as follows:

1.: Motivated by previous studies [19,20,21,23], this paper investigates the effective characteristics of existing activation functions.
2.: The parametric arctangent unit (PATU) activation function is proposed. The proposed PATU integrates the advantages of ReLU in the positive part, the negative part employs arctangent to activate negative inputs, and a trainable parameter is implemented to control saturation of the negative part.
3.: The proposed PATU is evaluated on the CIFAR10 and CIFAR100 [30] datasets using three CNN architectures, i.e., SmallNet [31], the Network in Network (NIN) [32], and the Residual Network (ResNet) [33], which cover a variety of numbers of layers and filter structures. The experimental results demonstrate that the proposed PATU yields improvements over existing activation functions.

The remainder of this paper is organized as follows. In Section 2, common activation functions are reviewed, including predefined and parametric activation functions. Section 3 summarizes the properties of an effective activation function. In Section 4, we introduce the proposed PATU, and in Section 5, the paper discusses a series of experiments conducted to verify the effectiveness of the method. The experimental results are discussed in Section 6. Finally, the paper is concluded in Section 8.

2. Related Works

The activation function is an essential element that provides the required non-linearity of deep learning models to learn complex representations, and it affects the performance of CNN models. Over the past few years, many activation functions have been proposed to improve the performance of CNN models in various applications. Generally, an activation function can be summarized in two categories, i.e., predefined and parametric activation functions. Here, we present a summary of common state-of-art activations, including ReLU, PReLU, ELU, and Swish. More details can be found in Section 2.1 and Section 2.2.

2.1. Predefined Activation Function

A predefined activation function is also known as a fixed activation function, where the shape of the function is retained before and after training. In the following, we describe ReLU, ELU, and Swish, which are common predefined activation functions.

2.1.1. ReLU

To overcome the gradient vanishing problem that occurs with the sigmoid and hyperbolic tangent functions, Nair et al. [10] proposed the ReLU, which is a primary factor in the recent revival of CNNs [34]. ReLU has become the baseline activation function in various CNN models. ReLU is defined in Equation (1), and its shape is shown in Figure 1.

ReLU (x) = \{\begin{matrix} x, & if x > 0 \\ 0, & if x \leq 0 \end{matrix}

(1)

ReLU employs linear mapping in the positive parts and discards the negative parts. Linear mapping in the positive part alleviates the gradient vanishing problem, makes training feasible, and reduces computational costs. However, ReLU assigns a zero value to all negative values, which prevents the propagation of negative information and causes specific neurons to never be activated during training because the derivative of the negative quadrant is always zero.

2.1.2. ELU

Another activation function that uses negative parts is the ELU, which employs an exponential function in the negative part to push the mean of activations closer to zero. This unique property improves generalization and accelerates learning by bringing the normal gradient closer to the unit’s natural gradient because of a reduced bias shift effect [21,35]. In addition, the ELU activation function exhibits a saturation plateau property in the negative part, which enables learning more robust and stable representations. However, the ELU activation function uses an exponential unit, which is computationally heavy. Moreover, the negative part still has gradient vanishing risk in its saturating negative region, especially when the inputs are very small. The ELU activation function is defined in Equation (2), and related plots are shown in Figure 2.

ELU (x) = \{\begin{matrix} x, & if x > 0 \\ α \cdot (e^{x} - 1), & if x \leq 0 \end{matrix}

(2)

2.1.3. Swish

Ramachandran et al. [23] proposed Swish, which is a non-segment activation function based on the sigmoid function. Swish is an input gate-scaled sigmoid function that is smooth, non-monotonic, unbounded in the positive part, and bounded in the negative part. This approach proves the feasibility of a smooth activation function when training a CNN model after long abandonment of sigmoidal activation functions [36]. However, the performance of Swish is inconsistent [37,38,39,40]. This might be caused by the fact that the positive part is not approximately linear, which disturbs the original distribution of the input data [41]. The Swish activation function is defined in Equation (3), and corresponding plots are shown in Figure 3.

Swish (x) = x \cdot σ (x) = \frac{x}{1 + e^{- x}}

(3)

2.2. Parametric Activation Function

The parametric activation function is also known as the trainable activation function. The parameter activation function will automatically fine-tune biases and the weights of the network. Unlike the predefined activation functions, the shape of the parameter activation function changes constantly during training. The typical representative parametric activation function is the PReLU.

PReLU

To address the sensitivity of the slope parameter for the LeakyReLU, He et al. [20] proposed a leaky parametric version of the ReLU (PReLU). The PReLU is generally the same as the LeakyReLU; however, the slope parameter is assigned as a trainable parameter. Thus, rather than employ a fixed parameter, PReLU can adjust the slope parameter adaptively, which introduces more non-linearity to the model. However, PReLU adds extra parameters, and its linear negative part is less smooth than ELU; moreover, if its learnable slope approaches 1, it degenerates to a purely linear function, losing non-linearity. Here, the initial slope parameter is set to 0.25 [20]. The characteristics, definition, and shape of PReLU are shared by LeakyReLU, which are shown in Equation (4) and Figure 4 below.

PReLU (x) = \{\begin{matrix} x, & if x > 0 \\ α \cdot x, & if x \leq 0 \end{matrix}

(4)

3. Properties of Enhanced Activation Function

Previous studies limited their focus to addressing the inherent problems with ReLU. As a result, limited attention has been given to integrating the effective properties of activation functions in order to realize better results. Here, we summarize characteristics a beneficial activation function should possess according to the literature. Four criteria are discussed in the following subsections.

3.1. Mean Activation

Neurons that have a non-zero mean activation act as a bias for the next layer. If such neurons do not cancel each other out, learning causes a bias shift for units in the next layer. The more the neurons are correlated, the higher their biases shift. Thus, less bias shift brings the standard gradient closer to the natural gradient and speeds up learning [21]. As mentioned previously, most activation functions proposed after ReLU have the negative activity property. This type of activation function relieves the neuron death problem, and negative inputs are permitted to spread information during propagation [19,20,21,23]. In addition, the attribute of negative inputs pushes the average activation closer to zero. This attribute is significant for CNN models where the bias offset in the network layer can be reduced by normalizing the input value.

3.2. Adaptability

Adaptability is also referred to as parameterization or trainability. Similar to the PReLU activation function, adaptability introduces a trainable parameter to control a negative shape in an adaptive manner. This strategy allows different layers to respond to different activation states, which realizes greater flexibility and enhances the network’s nonlinear capabilities [20,36].

3.3. Local Non-Linearity

The ReLU, LeakyReLU, and PReLU activation functions are multilinear, and ELU and Swish contain local nonlinear regions. Previous studies have demonstrated that local nonlinear activation functions are better than multilinear activation functions in different network architectures for image classification tasks [18,23].

3.4. Noise-Robust Deactivation State

The ELU and Swish activation functions possess the property of a deactivation state in the negative part. The ELU and Swish activation functions ensure that smaller negative values can be learned properly due to the relatively large corresponding derivatives.

In addition, the negative part is gradually soft-saturated because the derivative tends to zero, which reduces the variety of the information to be propagated to the next layer, which ensures a noise-robust deactivation state; thus, model robustness is enhanced [21,23,42,43].

4. The Proposed PATU Activation Function

In this section, we introduce the proposed PATU activation function, which was inspired by our investigation of the effective properties of existing activation functions. The proposed PATU activation function is defined in Equation (5).

PATU (x) = \{\begin{matrix} x, & if x > 0 \\ arctan (x) \cdot α, & if x \leq 0 \end{matrix}

(5)

Here, x represents the inputs from the previous layer, PATU(x) denotes the transformation, and

α

is a trainable parameter, which can be learned adaptively during training. The

α

is constrained to a non-negative value in order to preserve the negative region behavior. The positive part shares the same property of ReLU, and ReLU has demonstrated its superiority over traditional activations and is the suggested baseline activation function for CNN models [44,45]. In addition, calculating the function result and gradient is an easy task because the forward and backpropagation steps can be processed quickly [46]. In the proposed PATU, we employ the arctangent function in the negative part so that negative information can be learned. This paper adopts the arctangent function for the negative part because this function has higher gradients for small negative inputs. According to the chain rule [47], the derivative of the activation function contributes to updating the weights and bias. Here, a small derivative causes the gradient vanishing problem, and a large derivative helps update parameters properly. In addition, the arctangent function realizes noise-robust deactivation because the derivative tends to zero gradually, which reduces the variety of the information to be propagated. Inspired by PReLU, a trainable parameter is introduced, which acts as a gate to control the saturation of arctangent. This PATU generates a deactivation state in the negative part to enhance the model’s robustness, while PReLU does not. Thus, different network layers will yield different activation responses, and the best gate parameter will be generated during training. Figure 5 shows the shape of the proposed PATU activation function and its derivative.

CNN parameters, e.g., weights and bias, are updated via backpropagation, the derivative of the activation function contributes to the parameter update process, and the model is optimized as a result. The derivative of the activation function can be obtained easily according to the chain rule [47]. The derivative in the proposed PATU is defined in Equation (6), the gradient of the activation function relative to

α

is given in Equation (7), and the shape of the derivative is shown in Figure 5.

{PATU}^{'} (x) = \{\begin{matrix} 1, & if x > 0 \\ \frac{α}{x^{2} + 1}, & if x \leq 0 \end{matrix}

(6)

\frac{\partial f (x)}{\partial α} = \{\begin{matrix} 0, & if x > 0 \\ arctan (x), & if x \leq 0 \end{matrix}

(7)

PATU integrates the previously identified effective properties of existing activation functions described in Section 3. The advantages of the proposed PATU are summarized as follows:

The proposed PATU exploits the advantages of ReLU in the positive part, and the arctangent function is employed in the negative part; thus, negative activations push the mean activation closer to zero, which reduces the effect of bias shift.
The negative part employs the arctangent function to ensure a higher derivative for small negative inputs. A higher derivative relieves gradient vanishing problems; thus, the weights and bias can be updated properly. In addition, the derivative of the arctangent function decreases gradually and approaches zero; thus, a soft-saturation property realizes the noise-robust activation, which improves the model’s robustness.
A trainable parameter is implemented to control the saturation shape of the arctangent function in the negative part. As a result, negative values can be activated selectively and adaptively. The parameterization feature enhances the flexibility and non-linearity of the model.
The proposed PATU activation function is a locally nonlinear function.

Compared to ReLU, PATU preserves a linear mapping in the positive part, which alleviates the gradient vanishing problem and ensures an original transformation of inputs; moreover, PATU can produce the negative outputs to relieve the bias shift. Compared to PReLU, the negative part of PATU maintains a deactivation state that reduces the variety of information to be propagated. Compared to ELU, a trainable parameter is introduced to control the saturation, allowing different layers to respond to different activation states. Compared to Swish, PATU also ensures an original transformation of inputs.

5. Experimental Evaluations

5.1. Experimental Settings

An experiment was conducted to validate the proposed PATU’s performance compared to ReLU and three other state-of-the-art activation functions, namely PReLU, ELU, and Swish, in image classification tasks. The details of the activation functions and the corresponding plots are presented in previous Section 2. All experiments were implemented using Keras and TensorFlow as the backend. In addition, a GeForce GTX 1060 6 GB GPU was utilized to accelerate computations.

We trained three CNN architectures on the CIFAR10 and CIFAR100 [30] datasets for a total of 100 epochs. The models include SmallNet [31], Network in Network [32], and ResNet18 [33], covering different depths and widths. The batch size was set to 128, and the Adam [48] optimizer was employed to train the models with a default learning rate of 0.001. The HeNormal is used for weight initialization. In the following subsections, we report the average accuracy obtained over five runs for each activation function across the different CNN architectures and datasets.

Data augmentation is a popular regularization technique used to expand the size of a dataset; however, it is difficult to isolate the impact of data augmentation from the methodology [49,50]. To exclude complex data expansion, which impacts the final results, in this study, the original images were normalized by dividing by the maximum pixel value, 255, to avoid the possibility of exploding gradients, and no data augmentation technique was applied. Here, we focused on testing the performance of the activation function rather than the effect of the regularization technique.

5.2. Analysis of Parameter Initialization

To investigate the best initialization parameter for

α

, we conducted a preliminary analysis using SmallNet on the CIFAR10 dataset. Intuitively, when trainable parameter

α

is initialized as 0.0, the negative part of the proposed PATU is initially flattened (similar to ReLU), and then trainable parameter

α

is changed adaptively to produce more negative activations. When the trainable parameter

α

is initialized as 1, the negative part of the proposed PATU is the original arctangent unit in the beginning, and the trainable parameter

α

is changed adaptively to produce more negative activations.

Thus, in this preliminary analysis, the trainable parameter

α

was initialized as 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1 for a single run for each value. The results are shown in Table 1. As can be seen, the highest accuracy was obtained when the trainable parameter was initialized as 0.3. Thus, for the sake of clarity, the trainable parameter

α

was initialized as 0.3 in the subsequent experiments.

5.3. Results of SmallNet on CIFAR10 and CIFAR100

The performance of the activation functions using the SmallNet architecture on the CIFAR10 and CIFAR100 datasets is presented in Table 2. As can be seen, the proposed PATU activation function achieved the best results, i.e., 82.40% and 52.79% accuracy on the CIFAR10 and CIFAR100 datasets, respectively. We found that ReLU outperformed the other activation functions (except for the proposed PATU) on the CIFAR10 dataset with an average accuracy of 81.95%. The Swish activation function demonstrated the lowest average accuracy (81.43%) on the CIFAR10; however, it achieved the second highest accuracy on the CIFAR100 dataset (52.00%). The PATU, ReLU, PReLU, Swish, and ELU activation functions with SmallNet on the CIFAR10 dataset obtained average accuracy values of 82.40%, 81.92%, 81.62%, 81.57%, and 81.43%, respectively. On CIFAR100, the PATU, Swish, ReLU, PReLU, and ELU activation functions obtained average accuracy values of 52.79%, 52.00%, 51.35%, 51.25%, and 51.20%, respectively. The results obtained on both the CIFAR10 and CIFAR100 datasets are shown in Table 2, where values presented in bold indicate the best results.

As we can see in the curve accuracy presented in Figure 6, PATU is better than other activation functions after around 20 epochs on CIFAR10 and CIFAR100 under the SmallNet architecture.

5.4. Results of Network in Network on CIFAR10 and CIFAR100

The performance of the activation functions with the NIN architecture on the CIFAR10 and CIFAR100 datasets is shown in Table 3. As can be seen, PReLU obtained the highest average accuracy on the CIFAR10 dataset, and the proposed PATU achieved the best result on the CIFAR100 dataset. Note that the ReLU activation function failed to reach convergence one time on CIFAR10. This phenomenon indicates that neuron death may affect a model’s generalizability, which is consistent with the findings of previous studies [51,52]. However, ReLU was retrained one more time to generate a total of five runs of valid results for further comparison. On the CIFAR100 dataset, the ELU activation function approximately converged within 10 epochs and collapsed after that. On the CIFAR10 dataset, the PReLU, PATU, ELU, Swish, and ReLU activation functions obtained average accuracy values of 86.47%, 86.42%, 86.27%, 86.10%, and 84.55%, respectively (Table 3). On the CIFAR100 dataset, PATU, PReLU, Swish, ReLU, and ELU obtained average accuracy values of 59.52%, 59.26%, 59.11%, 56.64%, and null, respectively. Bold values in Table 3 indicate the best results.

Figure 7 shows the learning curves of the testing accuracy on the CIFAR10 and CIFAR100 datasets, respectively, with the NIN architecture. As can be seen, ReLU exhibited consistently worse performance than the other activation functions in all epochs. The other activation functions show similar curve accuracy in early epochs, and the PReLU and PATU activation functions obtained the highest accuracy on the CIFAR10 and CIFAR100 datasets in the final epoch, respectively.

5.5. Results of ResNet18 on CIFAR10 and CIFAR100

As shown in Table 4, the proposed PATU activation function obtained the best performances on both the CIFAR10 and CIFAR100 datasets with ResNet18, and the Swish activation function achieved the second-best results. On the CIFAR10 dataset with the ReSNet18 architecture, the PATU, Swish, PReLU, ReLU, and ELU activation functions obtained average accuracy values of 83.51%, 82.74%, 82.52%, 82.32%, and 81.60%, respectively. On the CIFAR100 dataset, the PATU, Swish, ELU, ReLU, and PReLU activation functions obtained average accuracy values of 54.46%, 54.27%, 52.88%, 51.54%, and 50.60%, respectively.

Figure 8 shows that the accuracy curve of the proposed PATU on CIFAR10 with the ResNet18 architecture is more variable than other activation functions. However, the highest accuracy was obtained by PATU at approximately 90 epochs. As shown in Figure 8, the accuracy curve of the proposed PReLU on the CIFAR100 dataset with ResNet18 was smoother than that of the other activation functions. Here, PReLU obtained the worst accuracy, which was 50.60% according to Table 4.

5.6. Mean Rank

In reference to a study by Sütfeld [53], we also evaluated the activation function in terms of the mean rank obtained by the Friedman test, which provides additional insight into the performance of these methods with different CNN architectures and datasets. First, ranking is assigned to the activation function according to the classification accuracy of a particular task, where the best result is marked starting with 1. Here, the smallest number is the best rank. Then, the mean rank score of each activation function for all tasks is calculated. The lower the mean rank value, the better the overall performance.

In these experiments, we compared the performance of the ReLU, PReLU, ELU, Swish, and PATU activation functions in six tasks, i.e., CIFAR10 with SmallNet, CIFAR100 with SmallNet, CIFAR10 with NIN, CIFAR100 with NIN, CIFAR10 with ResNet18, and CIFAR100 with ResNet18. Taking ReLU as an example, the mean rank is determined as follows: ReLU obtained the second-best accuracy of 81.95% in the CIFAR10 with SmallNet task, and the rank of this task was marked as 2. ReLU obtained the third best accuracy of 51.35% on the CIFAR100 with SmallNet task, and the rank on this task was marked as 3. This process continued until the final task was calculated. Following this calculation method, the rank values of ReLU to the remaining four tasks were 5, 4, 4, and 4, respectively. Then, the mean rank was acquired by averaging the six rank values: (2 + 3 + 5 + 4 + 4 + 4)/6 = 3.67.

The overall rankings for all compared activation functions are given in Table 5, where bold values indicate the best results. As can be seen, the proposed PATU achieved the highest mean rank score of 1.16, and Swish and PReLU both obtained a mean rank value of 3. ReLU, which outperformed the ELU activation function, obtained a mean rank of 3.67. These results imply that the parametric activation functions enhanced the model’s performance compared to the predefined activation functions. Overall, the paper found that the proposed PATU activation function maintained the top two rankings with all CNN architectures on both the CIFAR10 and CIFAR100 datasets.

6. Discussion

The experimental results demonstrate that the proposed PATU activation function invariably outperformed ReLU. The paper found that the proposed PATU effectively integrates the beneficial properties of existing activation functions. First, PATU inherits the attributes of the ReLU activation function in the positive part, which alleviates the gradient vanishing problem. Second, the negative part is permitted to be activated by assigning the arctangent function, and mean activations are pushed to zero. Third, the arctangent function has a larger gradient than the exponential unit in the negative part. As a result, the weights are updated properly. In addition, the derivative of arctangent is similar to the exponential unit, which gradually approaches zero to ensure a noise-robust deactivation state; thus, model robustness is enhanced. The proposed PATU activation function improves the learning ability and promotes the flexibility of the CNN by introducing the trainable parameter

α

to control the shape of the arctangent function; thus, negative inputs can be activated selectively and adaptively. This work performed a preliminary analysis on Section 5.2 to determine the most effective initialization parameter for trainable

α

, i.e.,

α

= 0.3. Finally, the proposed PATU is a local nonlinear activation function, which allows the proposed method to express higher nonlinear approximation capabilities, thereby improving prediction performance.

Our experimental results further reveal that the performance gain of PATU over ReLU becomes more pronounced as network depth increases. We attribute this observation to two key factors inherent to ReLU’s limitations in deep architectures. First, deeper networks contain a larger number of parameters. Since ReLU outputs zero for any negative input, many neurons may never be activated during training, leading to the well-known dead neuron problem. These dead neurons have zero gradient and never recover, effectively reducing the model’s representational capacity and hindering the fitting of complex functions. Second, ReLU only outputs the non-negative values, which leads to a positive mean activation value. In deep networks, this bias accumulates across layers, resulting in a progressive bias shift, which slows down convergence and may hurt generalization. By comparison, PATU handles both problems by having trainable negative-half behavior. The arctangent function keeps the negative regime for the model with gradient flow for negative inputs and avoids dead neurones. Moreover, the trainable parameter

α

allows each layer to adaptively adjust its response to negative inputs, which results in the layer-wise flexibility. At the same time, the saturating property of the arctangent derivative for strongly negative inputs ensures noise-robust deactivation, balancing adaptability and stability. Consequently, PATU achieves consistently larger improvements over ReLU in deeper networks, where the detrimental effects of dead neurons and bias shift are most severe.

Although the proposed activation function achieves certain accuracy improvements over ReLU, ELU, PReLU, and Swish, it incurs increased training time due to the use of an arctangent function and an additional trainable parameter. It takes 8 s, 59 s, and 40 s on the SmallNet-CIFAR, NIN-CIFAR, and ResNet18-CIFAR tasks, respectively. Among all compared activations, ReLU is consistently the fastest.

7. Limitations

Although our proposed PATU outperformed ReLU and other baseline activation functions, some limitations should be acknowledged:

The computational complexity increases with the arctangent function and trainable parameter in the negative part. This results in a moderate increase in the training time per epoch, which might be an issue for very large-scale or time-critical applications.
Our experiments are at the moment restricted to rather small-scale datasets, i.e., CIFAR10 and CIFAR100. The performance of PATU on large-scale datasets such as ImageNet has not been evaluated yet. Moreover, the proposed method has not been validated for more challenging computer vision tasks such as object detection and image dehazing [54,55].
The initial value of trainable parameter $α$ in the negative part can be sensitive to the datasets and the tasks.

8. Conclusions

In this paper, we proposed the PATU activation function, which integrates the beneficial properties of existing state-of-the-art activation functions by combining the ReLU and arctangent functions through a parametric strategy. As a result, the proposed PATU introduces better flexibility, stability, expressiveness, and robustness to improve model performance.

The proposed PATU activation function was evaluated experimentally and compared with ReLU, PReLU, ELU and Swish activation functions. The results demonstrated that the proposed PATU achieved stable performance and outperformed ReLU with three different CNN architectures. Specifically, compared to ReLU on the CIFAR10 and CIFAR100 datasets, the proposed PATU produced 0.45%, 1.44%, 1.87%, 3.11%, 1.19%, and 2.92% improvements on the SmallNet, NIN, and ResNet18 architectures, respectively. In addition, the proposed PATU obtained the highest mean rank score of 1.16 (the Friedman mean rank score averages the performance ranks across all datasets and models; a lower score indicates better overall performance), thereby outperforming existing state-of-art activation functions. Overall, we believe that these results position the proposed PATU as an effective alternative to the ReLU activation function. Additionally, our findings suggest that approximate symmetry properties may better align with inherent symmetries of visual data, and can thus outperform purely asymmetric properties in image classification tasks. In future works, more large-scale datasets or computer vision tasks will be further evaluated.

Author Contributions

Conceptualization, X.L.; methodology, X.L.; software, X.L.; validation, X.L., S.S., A.A., M.D. and S.A.S.; formal analysis, X.L., S.S., A.A., M.D. and S.A.S.; investigation, X.L., S.S., A.A., M.D. and S.A.S.; resources, X.L., S.S., A.A., M.D. and S.A.S.; data curation, X.L., S.S., A.A., M.D. and S.A.S.; writing-original draft preparation, X.L. and S.S.; writing-review and editing, S.S., A.A., M.D. and S.A.S.; supervision, S.S., A.A. and S.A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Universiti Kebangsaan Malaysia (UKM) under grant PP-FTSM-2024.

Data Availability Statement

The selected dataset in this study is available on this link: https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 17 February 2026).

Acknowledgments

We would like to thank the team members and also our collaborator from Ajman University, UAE, Mohamed Deriche for his valuable comments to the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Abdullah, A.; Arianti, D.; Sahran, S. Iterative Ensemble Threshold Selection in Branch CNNs for Efficient Image Classification. In 2025 International Conference on Intelligent Systems: Theories and Applications (SITA); IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar]
Abdullah, A.; Wong, W.S.; Albashish, D. EB-CNN: Ensemble of branch convolutional neural network for image classification. Pattern Recognit. Lett. 2025, 189, 1–7. [Google Scholar] [CrossRef]
Su, Z.; Adam, A.; Nasrudin, M.F.; Prabuwono, A.S. Proposal-free fully convolutional network: Object detection based on a box map. Sensors 2024, 24, 3529. [Google Scholar] [CrossRef]
Oday, A.; Azizi, A.; Sahran, S. YOLO-OSAM: Reassembly Spatial Attention Mechanisms for Facial Expression Recognition. Trait. Du. Signal 2025, 42, 2379. [Google Scholar] [CrossRef]
Hassabis, D.; Kumaran, D.; Summerfield, C.; Botvinick, M. Neuroscience-inspired artificial intelligence. Neuron 2017, 95, 245–258. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Dubey, S.R.; Singh, S.K.; Chaudhuri, B.B. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing 2022, 503, 92–108. [Google Scholar] [CrossRef]
Duan, B.; Yang, Y.; Dai, X. Activation by Switch Unit of Opposite First Powers. In 2022 IEEE 8th International Conference on Computer and Communications (ICCC); IEEE: New York, NY, USA, 2022; pp. 1431–1439. [Google Scholar]
Rasamoelina, A.D.; Adjailia, F.; Sinčák, P. A review of activation function for artificial neural network. In 2020 IEEE 18th World Symposium on Applied Machine Intelligence and Informatics (SAMI); IEEE: New York, NY, USA, 2020; pp. 281–286. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Nwankpa, C.; Ijomah, W.; Gachagan, A.; Marshall, S. Activation functions: Comparison of trends in practice and research for deep learning. arXiv 2018, arXiv:1811.03378. [Google Scholar] [CrossRef]
Wang, S.H.; Sakk, E. The effect of activation function choice on the performance of convolutional neural networks. J. Emerg. Investig. 2023, 6, 1–9. [Google Scholar] [CrossRef]
Budiman, N.A.; Adi, K.; Wibowo, A. Impact of Activation Function on the Performance of Convolutional Neural Network in Identifying Oil Palm Fruit Ripeness. Int. J. Math. Comput. Res. 2025, 13, 5107–5113. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, J.; Gao, C.; Qu, J.; Ji, L. Natural-Logarithm-Rectified Activation Function in Convolutional Neural Networks. In Proceedings of the 2019 IEEE 5th International Conference on Computer and Communications (ICCC), Chengdu, China, 6–9 December 2019; pp. 2000–2008. [Google Scholar]
Douglas, S.; Yu, J. Why RELU Units Sometimes Die: Analysis of Single-Unit Error Backpropagation in Neural Networks. In Proceedings of the 2018 52nd Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 28–31 October 2018; pp. 864–868. [Google Scholar]
Luo, G.; Wang, X.; Zhao, W.; Tao, S.; Tang, Z. ReLU Neural Networks and Their Training. Mathematics 2025, 14, 39. [Google Scholar] [CrossRef]
Pusztaházi, L.S.; Eigner, G.; Csiszár, O. Parametric activation functions for neural networks: A tutorial survey. IEEE Access 2024, 12, 168626–168644. [Google Scholar] [CrossRef]
Ohn, I.; Kim, Y. Smooth function approximation by deep neural networks with general activation functions. Entropy 2019, 21, 627. [Google Scholar] [CrossRef]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. Proc. ICML 2013, 30, 3. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Clevert, D.A.; Unterthiner, T.; Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). arXiv 2015, arXiv:1511.07289. [Google Scholar]
Klambauer, G.; Unterthiner, T.; Mayr, A.; Hochreiter, S. Self-normalizing neural networks. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; p. 30. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar] [CrossRef]
Florek, D.; Miłosz, M. Comparison of an effectiveness of artificial neural networks for various activation functions. J. Comput. Sci. Inst. 2023, 26, 7–12. [Google Scholar] [CrossRef]
Jiang, T.; Cheng, J. Target recognition based on CNN with LeakyReLU and PReLU activation functions. In 2019 International Conference on Sensing, Diagnostics, Prognostics, and Control (SDPC); IEEE: New York, NY, USA, 2019; pp. 718–722. [Google Scholar]
Shah, A.; Kadam, E.; Shah, H.; Shinde, S.; Shingade, S. Deep residual networks with exponential linear unit. In Proceedings of the Third International Symposium on Computer Vision and the Internet, Florence, Italy, 21–24 September 2016; pp. 59–65. [Google Scholar]
Wang, T.; Qin, Z.; Zhu, M. An ELU network with total variation for image denoising. In Proceedings of the International Conference on Neural Information Processing; Springer International Publishing: Cham, Switzerland, 2017; pp. 227–237. [Google Scholar]
Ying, Y.; Su, J.; Shan, P.; Miao, L.; Wang, X.; Peng, S. Rectified exponential units for convolutional neural networks. IEEE Access 2019, 7, 101633–101640. [Google Scholar] [CrossRef]
Eger, S.; Youssef, P.; Gurevych, I. Is it time to swish? Comparing deep learning activation functions across NLP tasks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4415–4424. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Qiu, S.; Xu, X.; Cai, B. FReLU: Flexible rectified linear units for improving convolutional neural networks. In 2018 24th International Conference on Pattern Recognition (ICPR); IEEE: New Yok, NY, USA, 2018; pp. 1223–1228. [Google Scholar]
Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
Hock, H.C.; Wahid, N.; Ong, P. Parametric flatten-t swish: An adaptive nonlinear activation function for deep learning. J. Inf. Commun. Technol. (JICT) 2021, 20, 21–39. [Google Scholar]
Jinsakul, N.; Tsai, C.F.; Tsai, C.E.; Wu, P. Enhancement of deep learning in image classification performance using xception with the swish activation function for colorectal polyp preliminary screening. Mathematics 2019, 7, 1170. [Google Scholar] [CrossRef]
Szandała, T. Review and comparison of commonly used activation functions for deep neural networks. In Bio-Inspired Neurocomputing; Springer: Singapore, 2020; pp. 203–224. [Google Scholar]
Szandała, T. Benchmarking comparison of swish vs. Other activation functions on cifar-10 imageset. In International Conference on Dependability and Complex Systems; Springer International Publishing: Cham, Switzerland, 2019; pp. 498–505. [Google Scholar]
Tripathi, G.C.; Rawat, M.; Rawat, K. Swish activation based deep neural network predistorter for RF-PA. In TENCON 2019–2019 IEEE Region 10 Conference (TENCON); IEEE: New Yok, NY, USA, 2019; pp. 1239–1242. [Google Scholar]
Liu, X.; Di, X. TanhExp: A Smooth Activation Function with High Convergence Speed for Lightweight Neural Networks. arXiv 2020, arXiv:2003.09855. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Feature extraction through LOCOCODE. Neural Comput. 1999, 11, 679–714. [Google Scholar] [CrossRef]
Wang, Y.; Li, Y.; Song, Y.; Rong, X. The influence of the activation function in a convolution neural network model of facial expression recognition. Appl. Sci. 2020, 10, 1897. [Google Scholar] [CrossRef]
Ping, W.; Peng, K.; Gibiansky, A.; Arik, S.O.; Kannan, A.; Narang, S.; Miller, J. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Liao, X.; Sahran, S.; Abdullah, A.; Shukor, S.A. Adacb: An adaptive gradient method with convergence range bound of learning rate. Appl. Sci. 2022, 12, 9389. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 2002, 86, 2278–2324. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Hou, S.; Liu, X.; Wang, Z. Dualnet: Learn complementary features for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 502–510. [Google Scholar]
Murthy, V.N.; Singh, V.; Chen, T.; Manmatha, R.; Comaniciu, D. Deep decision network for multi-class image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2240–2248. [Google Scholar]
Xu, J.; Li, Z.; Du, B.; Zhang, M.; Liu, J. Reluplex made more practical: Leaky ReLU. In 2020 IEEE Symposium on Computers and Communications (ISCC); IEEE: New York, NY, USA, 2020; pp. 1–7. [Google Scholar]
Yang, C.; Yang, Z.; Liao, S.; Hong, Z.; Nai, W. Triple-GAN with variable fractional order gradient descent method and mish activation function. In 2020 12th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC); IEEE: New York, NY, USA, 2020; Volune 1, pp. 244–247. [Google Scholar]
Sütfeld, L.R.; Brieger, F.; Finger, H.; Füllhase, S.; Pipa, G. Adaptive blending units: Trainable activation functions for deep neural networks. In Science and Information Conference; Springer International Publishing: Cham, Switzerland, 2020; pp. 37–50. [Google Scholar]
Zhang, S.; Zhang, X.; Shen, L.; Wan, S.; Ren, W. Wavelet-based physically guided normalization network for real-time traffic dehazing. Pattern Recognit. 2025, 172, 112451. [Google Scholar]
Liu, Y.; Li, T.; Tan, C.; Ren, W.; Ancuti, C.; Lin, W. IHDCP: Single image dehazing using inverted haze density correction prior. IEEE Trans. Image Process. 2026, 35, 1448–1461. [Google Scholar] [CrossRef]

Figure 1. The illustration of ReLU and its derivative.

Figure 2. The illustration of ELU and its derivative.

Figure 3. The illustration of Swish and its derivative.

Figure 4. The illustration of PReLU and its derivative.

Figure 5. The illustration of PATU and its derivative.

Figure 6. Testing accuracy on CIFAR10 and CIFAR100 with SmallNet. (a) demonstrates the curve of testing accuracy of activation functions on SmallNet through CIFAR10 dataset. (b) demonstrates the curve of testing accuracy of activation functions on SmallNet through CIFAR100 dataset.

Figure 7. Testing accuracy on CIFAR10 and CIFAR100 with Network IN Network. (a) demonstrates the curve of testing accuracy of activation functions on Network IN Network through CIFAR10 dataset. (b) demonstrates the curve of testing accuracy of activation functions on Network IN Network through CIFAR100 dataset.

Figure 8. Testing accuracy on CIFAR10 and CIFAR100 with ResNet18. (a) demonstrates the curve of testing accuracy of activation functions on ResNet18 through CIFAR10 dataset. (b) demonstrates the curve of testing accuracy of activation functions on ResNet18 through CIFAR100 dataset.

Table 1. Accuracy obtained by proposed PATU with different initialization

α

parameter. The bold variable indicates the best result.

Table 1. Accuracy obtained by proposed PATU with different initialization

α

parameter. The bold variable indicates the best result.

α	0.0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
Accuracy	0.8226	0.8231	0.8191	0.8257	0.8235	0.8247	0.8190	0.8214	0.8193	0.8182	0.8197

Table 2. Performance of activation functions with SmallNet on CIFAR10 and CIFAR100 datasets. The bold variable indicates the best result.

Activation Function	CIFAR10 (%)	CIFAR100 (%)	Training Time
ReLU	81.95 ± 0.37	51.35 ± 0.28	5 s
PReLU	81.62 ± 0.30	51.25 ± 0.41	7 s
ELU	81.57 ± 0.33	51.20 ± 0.18	5 s
Swish	81.43 ± 0.22	52.00 ± 0.31	6 s
PATU	82.40 ± 0.16	52.79 ± 0.33	8 s

Table 3. Performance of activation functions with NIN on CIFAR10 and CIFAR100 datasets. – indicates failure to converge. The bold variable indicates the best result.

Activation Function	CIFAR10 (%)	CIFAR100 (%)	Training Time
ReLU	84.55 ± 0.37	56.41 ± 0.47	25 s
PReLU	86.47 ± 0.25	59.26 ± 0.36	41 s
ELU	86.27 ± 0.24	–	27 s
Swish	86.10 ± 0.23	59.11 ± 0.27	37 s
PATU	86.42 ± 0.14	59.52 ± 0.39	59 s

Table 4. Performance of activation functions with ResNet18 on CIFAR10 and CIFAR100 datasets. The bold variable indicates the best result.

Activation Function	CIFAR10 (%)	CIFAR100 (%)	Training Time
ReLU	82.32 ± 0.51	51.54 ± 0.16	25 s
PReLU	82.52 ± 0.12	50.60 ± 0.63	38 s
ELU	81.60 ± 0.27	52.88 ± 0.77	26 s
Swish	82.74 ± 0.66	54.27 ± 0.62	30 s
PATU	83.51 ± 0.53	54.46 ± 1.59	40 s

Table 5. Mean rank of activation functions across different CNNs and datasets.

Activation Function	SmallNet (C10)	SmallNet (C100)	NIN (C10)	NIN (C100)	ResNet18 (C10)	ResNet18 (C100)	Mean Rank
ReLU	2	3	5	4	4	4	3.67
PReLU	3	4	1	2	3	5	3.00
ELU	4	5	3	5	5	3	4.16
Swish	5	2	4	3	2	2	3.00
PATU	1	1	2	1	1	1	1.16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liao, X.; Sahran, S.; Abdullah, A.; Shukor, S.A.; Deriche, M. A Novel Adaptive Activation Function for Convolutional Neural Networks: The Parametric Arctangent Unit (PATU). Symmetry 2026, 18, 971. https://doi.org/10.3390/sym18060971

AMA Style

Liao X, Sahran S, Abdullah A, Shukor SA, Deriche M. A Novel Adaptive Activation Function for Convolutional Neural Networks: The Parametric Arctangent Unit (PATU). Symmetry. 2026; 18(6):971. https://doi.org/10.3390/sym18060971

Chicago/Turabian Style

Liao, Xuanzhi, Shahnorbanun Sahran, Azizi Abdullah, Syaimak Abdul Shukor, and Mohamed Deriche. 2026. "A Novel Adaptive Activation Function for Convolutional Neural Networks: The Parametric Arctangent Unit (PATU)" Symmetry 18, no. 6: 971. https://doi.org/10.3390/sym18060971

APA Style

Liao, X., Sahran, S., Abdullah, A., Shukor, S. A., & Deriche, M. (2026). A Novel Adaptive Activation Function for Convolutional Neural Networks: The Parametric Arctangent Unit (PATU). Symmetry, 18(6), 971. https://doi.org/10.3390/sym18060971

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Adaptive Activation Function for Convolutional Neural Networks: The Parametric Arctangent Unit (PATU)

Abstract

1. Introduction

2. Related Works

2.1. Predefined Activation Function

2.1.1. ReLU

2.1.2. ELU

2.1.3. Swish

2.2. Parametric Activation Function

PReLU

3. Properties of Enhanced Activation Function

3.1. Mean Activation

3.2. Adaptability

3.3. Local Non-Linearity

3.4. Noise-Robust Deactivation State

4. The Proposed PATU Activation Function

5. Experimental Evaluations

5.1. Experimental Settings

5.2. Analysis of Parameter Initialization

5.3. Results of SmallNet on CIFAR10 and CIFAR100

5.4. Results of Network in Network on CIFAR10 and CIFAR100

5.5. Results of ResNet18 on CIFAR10 and CIFAR100

5.6. Mean Rank

6. Discussion

7. Limitations

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI