Comparison of different convolutional neural network activation functions and methods for building ensembles

Recently, much attention has been devoted to finding highly efficient and powerful activation functions for CNN layers. Because activation functions inject different nonlinearities between layers that affect performance, varying them is one method for building robust ensembles of CNNs. The objective of this study is to examine the performance of CNN ensembles made with different activation functions, including six new ones presented here: 2D Mexican ReLU, TanELU, MeLU+GaLU, Symmetric MeLU, Symmetric GaLU, and Flexible MeLU. The highest performing ensemble was built with CNNs having different activation layers that randomly replaced the standard ReLU. A comprehensive evaluation of the proposed approach was conducted across fifteen biomedical data sets representing various classification tasks. The proposed method was tested on two basic CNN architectures: Vgg16 and ResNet50. Results demonstrate the superiority in performance of this approach. The MATLAB source code for this study will be available at https://github.com/LorisNanni.


Introduction
First developed in the 1940s, artificial neural networks have had a checkered history, sometimes lauded by researchers for their unique computational powers and other times discounted for being no better than statistical methods. About a decade ago, artificial neural networks called deep learners composed of multiple specialized hidden layers radically changed the direction of machine learning and rapidly made significant inroads into many scientific and engineering areas [1][2][3][4][5]. The strength of deep learners is illustrated by the many successes achieved by one of the most famous and robust deep learning architectures, Convolutional Neural Networks (CNNs). CNNs frequently win image recognition competitions and have consistently outperformed other classifiers on a range of image classification tasks [1,6]: object detection [7], face recognition [8], and machine translation [9], to name but a few. Not only do CNNs continue to eclipse traditional classifiers, but they have also been shown to outperform human beings, including experts, in many image recognition tasks. CNNs outshine humans beings, for example, in recognizing faces [10,11], traffic signs [12], handwritten digits [12,13], and the fourteen million objects categorized into one thousand classes in the ImageNet data set [14,15].
Evolutions in CNN design initially centered around building better network topologies. Because activation functions impact training dynamics and performance, many researchers have also focused on developing better activation functions. For many years, the sigmoid and the hyperbolic tangent were the most popular neural network activation functions. The hyperbolic tangent's main advantage over the sigmoid is that the hyperbolic has a steeper derivative than the sigmoid function. Neither function, however, works that well with deep learners since both are subject to the vanishing gradient problem. It was soon realized that nonlinearities function better with deep learners. One of the first nonlinearities to demonstrate improved performance with CNNs was the now-classic activation function Rectified Linear Units (ReLU) [16], which is equal to the identity function with positive input and zero with negative input [17]. Although ReLU is non-differentiable, it gave AlexNet the edge to win the 2012 ImageNet competition [1].
The success of ReLU in AlexNet motivated researchers to investigate other nonlinearities and the desirable properties they possess. As a consequence, variations of ReLU have proliferated. For example, Leaky ReLU [18], like ReLU, is also equivalent to the identity function for positive values but has a hyperparameter > 0 applied to the negative inputs to ensure the gradient is never zero. As a result, Leaky ReLU is not as prone to getting caught in local minima and counters ReLU's problem with hard zeros that makes it more likely to fail to activate. Similar to Leaky ReLU is Exponential Linear Units (ELU) [19]. The advantage offered by ELU derives from the fact that it always produces a positive gradient since it exponentially decreases to the limit point as the input goes to minus infinity. A disadvantage of ELU, however, is that it saturates on the left side. Another activation function designed to handle the vanishing gradient problem is the Scaled Exponential Linear Unit (SELU) [20]. SELU is identical to ELU except that it is multiplied by the constant > 1 to maintain the mean and the variance of the input features.
Until 2015, activation functions were engineered to modify the weights and biases of a neural network. Parametric ReLU (PReLU) [14] gave Leaky ReLU a learnable parameter applied to the negative slope. The success of PReLU set in motion more research into learnable activation functions [21,22]. The Adaptive Piecewise Linear Unit (APLU) [21], for instance, independently learns during the training phase the piecewise slopes and points of nondifferentiability for each neuron using gradient descent. In consequence, it can imitate any piecewise linear function.
Aside from applying a learnable parameter to part of an activation function, as with PReLu and APLU, the construction of an activation function from a fixed set of them can be learned as well. In [23], for instance, an activation function was produced that automatically learned the best combinations of tanh, ReLU, and the identity function. Another activation function of this type is the S-shaped Rectified Linear activation Unit (SReLU) [24]. Using reinforcement learning, SReLU was designed to learn convex and non-convex functions to imitate both the Webner-Fechner and the Stevens law. This process produced an activation called Swish, which the authors view as a smooth function that nonlinearly interpolates between the linear function and ReLU.
Similar to APLU is the Mexican ReLU (MeLU [25]), whose shape resembles the Mexican hat wavelet. MeLU is a piecewise linear activation function that combines PReLU with many Mexican hat functions. Like APLU, MeLU has learnable parameters that approximate the same piecewise linear functions equivalent to identity when is sufficiently large, but MeLU differs from APLU, first, in having a much larger number of parameters (collectively called a hyperparameter), second, in the manner in which the approximations are calculated for each function, and, third, in the gradients.
Combining different activation functions has recently been shown to be a highly effective way to train robust classifier systems. In [26], CNNs with different activation functions were trained and fused; and, in [27], different activation functions were inserted into the layers of a single network. Both methods produced excellent results and surpassed the performance of the single CNNs.
In this paper, we extend [26] by comparing a large set of seventeen activation functions using two CNNs, Vgg16 [28], and ResNet50 [29], across fifteen biomedical data sets representing different biomedical tasks. The set of activation functions include the stateof-the-art and six new ones (2D Mexican ReLU, TanELU, MeLU+GaLU, Symmetric MeLU, Symmetric GaLU, and Flexible MeLU) proposed here. Also compared here are different methods for generating the CNN ensembles. The best performance is obtained by randomly replacing every ReLu layer of each CNN with a different activation function. This paper is organized as follows. In Section 2, all the tested activation functions, including the new ones presented here, are described. In section 3, the stochastic approach for constructing CNN ensembles is detailed (the other methods are described in the experimental section). In Section 4, the evaluation of the activation functions on two CNN architectures is reported and discussed. Finally, in Section 5, the paper is concluded with ideas for future investigations.

Activation functions used with the two CNNs
Some of the best performing activation functions were selected as candidates to be substituted into two of the most highly regarded CNN architectures: Vgg16 [28] and ResNet50 [29], each pre-trained on ImageNet. VGG16 [28], also known as the OxfordNet, is the second-place winner in the ILSVRC 2014 competition and was one of the deepest neural networks produced at that time. The input into VGG16 passes through stacks of convolutional layers, with filters having small receptive fields. Stacking these layers is similar in effect to CNNs having larger convolutional filters, but the stacks involve fewer parameters and are thus more efficient. ResNet50 [6], winner of ILSVRC 2015 contest and now a popular network, is a CNN with fifty layers known for its skip connections that sum the input of a block to its output, a technique that promotes gradient propagation and that preserves lower semantic information so that higher layers can work on it.

ReLu
ReLU [16], illustrated in Figure 1, is defined as The gradient of ReLU is

Leaky ReLU
In contrast to ReLu, Leaky ReLu [18] has no point with a null gradient. Leaky ReLU, illustrated in Figure 2, is defined as where (set to 0.01 here) is a small real number.
The gradient of Leaky ReLU is

ELU
Exponential Linear Unit (ELU) [19] is differentiable; and, as is the case with Leaky ReLU, the gradient is always positive and bounded from below by − . ELU, illustrated in Figure  3, is defined as where (set to 1 here) is a real number.
The gradient of Leaky ELU is

PReLU
Parametric ReLU (PreLU) [14] is identical to Leaky ReLU except that the parameter (different for every channel of the input) is learnable. PReLU is defined as where is a real number.
The gradients of PReLU are Slopes on the left-hand sides are all initialized to 0.

SReLU
S-Shaped ReLU (SReLU) [31] is composed of three piecewise linear functions expressed by four learnable parameters ( , , , and initialized as = 0, = 0 , = , a hyperparameter). This rather large set of parameters gives SReLU its high representational power. SReLU, illustrated in Figure 4, is defined as where is a real number.

APLU
Adaptive Piecewise Linear Unit (APLU) [21] is a linear piecewise function that can approximate any continuous function on a compact set. The gradient of APLU is the sum of the gradients of ReLU and of the functions contained in the sum. APLU is defined as where and are real numbers that are different for each channel of the input.
With respect to the parameters and , the gradients of APLU are The values for are initialized here to zero, with points randomly initialized. The 0.001 2 -penalty is added to the norm of the a c values. This addition requires that another term be included in the loss function: . Furthermore, a relative learning rate is added: multiplied by the smallest value used for the rest of the network. If λ is the global learning rate, then the learning rate λ * of the parameters a c would be * = ⁄ .

MeLU
The mathematical basis of the Mexican ReLU (MeLu) [25] activation function can be described as follows. Given the real numbers and and letting , ( ) = max( − | − |, 0) be a so-called Mexican hat type of function, then when | − | > , the function , ( ) is null but increases with a derivative of 1 and between − and decreases with a derivative of −1 between and + .
Considering the above, MeLU is defined as where is the number of learnable parameters for each channel, are the learnable parameters, and 0 is the vector of parameters in PReLU.
The parameter ( = 4,8 here) has one value for PReLU and − 1 values for the coefficients in the sum of the Mexican hat functions. The real numbers and are fixed (see Table 1) and are chosen recursively. The value of maxInput is set to 256. The first Mexican hat function has its maximum at 2 • and is equal to zero in 0 and 4 • . The next two functions are chosen to be zero outside the interval [0, 2 • ] and [2 • , 4 • ], with the requirement being they have their maximum in and 3 • . The Mexican hat functions that MeLU is based on are continuous and piecewise differentiable. Mexican hat functions are also a Hilbert basis on a compact set with the 2 norm. As a result, MeLU can approximate every function in 2 ([0,1024]) as k goes to infinity.
When the learnable parameters are set to zero, MeLU is identical to ReLU. Thus, MeLU can easily replace networks pre-trained with ReLU. This is not to say, of course, that MeLU cannot replace the activation functions of networks trained with Leaky ReLU and PReLU. In this study, all are initialized to zero, so start off as ReLU, with all its attendant properties.
MeLU's hyperparameter ranges from zero to infinity, producing many desirable properties. The gradient is rarely flat, and in no direction does saturation take place. As the size of the hyperparameter approaches infinity, it can approximate every continuous function on a compact set. Finally, the modification of any given parameter only changes the activation on a small interval and only when needed, making optimization relatively simple.

GaLU
Piecewise linear odd functions, composed of many linear pieces, do a better job of approximating nonlinear functions than does ReLU [37]. For this reason, Gaussian ReLU (GaLU) [36], based on Gaussian types of functions, aims to add more linear pieces than does MeLU. Since GaLU extends MeLU, GaLU retains all the favorable properties discussed in section 2.7.
Like MeLU, GaLU has the same set of fixed parameters. A comparison of values for the fixed parameters with = 1 is provided in Table 2. Table 2. Comparison of the Fixed parameters of GaLU and MeLU with = 1.

PDELU
Piecewise linear Parametric Deformable Exponential Linear Unit (PDELU) [34] is designed to have zero mean, which speeds up the training process. It is defined as

Swish
Swish [24] is designed using reinforcement learning to learn to efficiently sum, multiply, and compose different functions that are used as building blocks. The best function is where (set here to 1) is a parameter that can be learnable during training.

Mish
Mish [33] is defined as is a learnable parameter.

SRS
Soft Root Sign (SRS) [35] is defined as where α and β are non-negative learnable parameters. The output has zero means if the input is a standard normal.

Soft Learnable
Soft Learnable [35] is a very recent activation function defined as where , are positive parameters. We used two different versions of this activation, depending on whether the parameter is fixed (labeled here as SoftLearnable) or not (labeled here as SoftLearnable2).

Splash
Splash [32] is another modification of APLU that makes the function symmetric. In the definition of APLU, let and be the learnable parameters , ( ). Then Splash is defined as ℎ + , − , ( ) = + , ( ) + − , (− ). This equation's hinges are symmetric with respect to the origin. The authors in [32] claim that this network is more robust against adversarial attacks.

2D MeLU (New)
2D Mexican ReLU (2D MeLU) is an activation function presented here that is not defined component-wise; instead, every output neuron depends on two input neurons. If a layer has N neurons (or channels), its output is defined as The parameter , is a two-dimensional vector whose entries are the same as those used in MeLU. In other words, , = ( , ) as defined in Table 1. Likewise, max ( , ) is defined as it is for MeLU in Table 1.

TanELU (New)
TanELU is an activation function presented here that is simply the weighted sum of tanh and ReLU: where is a learnable parameter.

MeLU + GaLU (New)
MeLU + GaLU is an activation function presented here that is, as its name suggests, the weighted sum of MeLU and GaLU: where is a learnable parameter.

Symmetric MeLU (New)
Symmetric MeLU is the equivalent of MeLU, but it is symmetric like Splash. Symmetric MeLU is defined as where the coefficients of the two MeLUs are the same. In other words, the k coefficients of ( ) are the same as (− ).

Symmetric GaLU (New)
Symmetric GaLU is the equivalent of symmetric MeLU but uses GaLU instead of MeLU. Symmetric GaLU is defined as where the coefficients of the two GaLUs are the same. In other words, the k coefficients of ( ) are the same as (− ).

Flexible MeLU (New)
Flexible MeLU is a modification of MeLU where the peaks of the Mexican function are also learnable. This feature makes it more similar to APLU since its points of non-differentiability are also learnable. With respect to MeLU, APLU has more hyperparameters.

Methods for combining CNNs
One of the objectives of this study is to use several methods for combining the two CNNs with the different activation functions. Two methods are in need of discussion: Sequential Forward Floating Selection (SFFS) [38] and the stochastic method for combining CNNs introduced in [27].

Sequential Forward Floating Selection (SFFS)
A popular method for selecting an optimal set of descriptors, SFFS [38], has been adapted for selecting the most performing/independent classifiers to be added to the ensemble. In applying the SFFS method, each model to be included in the final ensemble is selected by adding, at each step, the model which provides the highest increment in performance compared to the existing subset of models. Then a backtracking step is performed to exclude the worst model from the actual ensemble. This method for combining CNNs is labeled Selection in the experimental section. Since SFFS requires a training phase, we perform a leave-one-out-data set selection to select the best-suited models.

Stochastic method
The stochastic approach [27] involves randomly substituting all the activations in a CNN architecture with a new one selected from a pool of potential candidates. Random selection is repeated many times to generate a set of networks that will be fused together. The candidate activation functions within a pool differ depending on the CNN architecture. Some activation functions appear to perform poorly and some quite well on a given CNN, with quite a large variance. The activation functions included in the pools for each of the CNNs tested here are provided in Table 3. The CNN ensembles randomly built from these pools varied in size, as will be noted in the experimental section, which investigates the different ensembles. Ensemble decisions are combined by sum rule, where the softmax probabilities of a sample given by all the networks are averaged, and the new score is used for classification.
The stochastic method of combining CNNs is labeled Stoc in the experimental section. Table 3. Activation functions included (✓) in the pools for each of the two CNN architectures.

Biomedical data sets
Each of the activation functions detailed in section 2 is tested on the CNNs using the following fifteen publicly available biomedical data sets: 1. CH (CHO data set [39]): this is a data set containing 327 fluorescence microscope images of Chinese Hamster Ovary cells divided into five classes: an-ti-giantin, Hoechst 33258 (DNA), anti-lamp2, anti-nop4, and anti-tubulin. 2. HE (2D HeLa data set [39]): this is a balanced data set containing 862 fluorescence microscopy images of HeLa cells stained with various organelle-specific fluorescent dyes. The images are divided into ten classes of organelles: DNA (Nuclei), ER (Endoplasmic reticulum), Giantin, (cis/medial Golgi), GPP130 (cis Golgi), Lamp2 (Ly-sosomes), Mitochondria, Nucleolin (Nucleoli), Actin, TfR (Endosomes), and Tubulin. 3. RN (RNAi data set [40]): this is a data set of 200 fluorescence microscopy images of fly cells (D. melanogaster) divided into ten classes. Each class contains 1024 x1024 TIFF images of phenotypes produced from one of ten knock-down genes, the IDs of which form the class labels. 4. MA (C. Elegans Muscle Age data set [40]): this data set is for classifying the age of the nematode given twenty-five images of C. elegans muscles collected at four ages representing the classes. 5. TB (Terminal Bulb Aging data set [40]): this is the companion data set to MA and contains 970 images of C. elegans terminal bulbs collected at seven ages rep-resenting the classes. 6. LY (Lymphoma data set [40]): this data set contains 375 images of malignant lymphoma representative of three types: CLL (chronic lymphocytic leukemia), FL (follicular lymphoma), and MCL (mantle cell lymphoma). 7.
LG (Liver Gender Caloric Restriction (CR) data set [40]): this data set contains 265 images of liver tissue sections from six-month male and female mice on a CR diet; the two classes represent the gender of the mice. 8. LA (Liver Aging Ad-libitum data set [40]): this data set contains 529 images of liver tissue sections from female mice on an ad-libitum diet divided into four classes representing the age of the mice. 9. CO (Colorectal Cancer [41]): this is a Zenodo data set (record: 53169#.WaXjW8hJaUm) of 5000 histological images (150 x 150 pixels each) of human colorectal cancer divided into eight classes. 10. BGR (Breast Grading Carcinoma [42]): this is a Zenodo data set (record: 834910#.Wp1bQ-jOWUl) that contains 300 annotated histological images of twenty-one patients with invasive ductal carcinoma of the breast representing three classes/grades 1-3. 11. LAR (Laryngeal data set [43]): this is a Zenodo data set (record: 1003200#.WdeQcnBx0nQ) containing 1320 images of thirty-three healthy and early-stage cancerous laryngeal tissues representative of four tissue classes. 12. HP (set of immunohistochemistry images from the human protein atlas [44]): this is a Zenodo data set (record: 3875786#.XthkoDozY2w) of 353 images of fourteen proteins in nine normal reproductive tissues belonging to seven subcellular locations. The data set in [44] is partitioned into two folds, one for training (177 images) and one for testing (176 images). 13. RT (2D 3T3 Randomly CD-Tagged Images: Set 3 [45]): this collection of 304 2D 3T3 Randomly CD-Tagged images were created by randomly generating CDtagged cell clones and imaging them by automated microscopy. The images are divided into ten classes. As in [45], the proteins are put into ten folds so that images in the training and testing sets never come from the same protein. 14. LO (Locate Endogenous data set [46]): this fairly balanced data set contains 502 images of endogenous cells divided into ten classes: Actin-cytoskeleton, Endosomes, ER, Golgi, Lysosomes, Microtubule, Mitochondria, Nucleus, Peroxisomes, and PM. This data set is archived at https://integbio.jp/dbcatalog/en/record/nbdc00296. 15. TR (Locate Transfected data set [46]): this is a companion data set to LO. TR contains 553 images divided into the same ten classes as LO but with the additional class of Cytoplasm for a total of eleven classes.
Data sets 1-8 can be downloaded at https://ome.grc.nia.nih.gov/iicbu2008/, data sets 9-12 can be found on Zenodo at https://zenodo.org/record/ by concatenating the data set's Zenodo record number provided in the descriptions above to this URL. Data set 13 is available at http://murphylab.web.cmu.edu/data/#RandTagAL, and data sets 14 and 15 are available on request. Unless otherwise noted, the five-fold cross-validation protocol is applied, and the Wilcoxon signed-rank test [47] is the measure used to validate experiments. Tables 2-4 is the performance of the different activation functions on the CNN topologies Vgg16 and ResNet50, each trained with a batch size (BS) of 30 and a learning rate (LR) of 0.0001 for 20 epochs (the last fully connected layer has an LR 20 times larger than the rest of the layers (i.e., 0.002)), except the stochastic architectures that are trained for 30 epochs (because of slower convergence). The reason for selecting these settings was to reduce computation time. Images were augmented with random reflections on both axes and two independent random rescales of both axes by two factors uniformly sampled in [1,2] (using MATLAB data augmentation procedures). The objective was to rescale both the vertical and horizontal proportions of the new image. For each stochastic approach, a set of 15 networks was built and combined by sum rule. We trained the models using MATLAB 2019b; however, the pre-trained architectures of newer versions perform better.

Reported in
The performance of the following ensembles is reported in Tables 4 and 5  The most relevant results reported in Table 4 on ResNet50 can be summarized as follows:  The most relevant results reported in Table 5 on Vgg16 can be summarized as follows: • Again, the ensemble methods outperform the stand-alone CNNs. As was the case with ResNet50, 15ReLu strongly outperforms (p-value of 0.01) the stand-alone CNNs with ReLu; • Among the stand-alone Vgg16 networks, ReLU is not the best activation function. The two activations that reach the highest performance on V6616 are MeLU ( = 4) with = 255 and GaLU with = 255 . According to the Wilcoxon Signed Rank Test, there is no statistical difference between ReLU, MeLU( = 4)-MI = 255, and GaLu-MI= 255; • Interestingly, ALL with = 1 outperforms eALL with p-value of 0.05; • Stoc_4 outperforms 15ReLu with p-value of 0.01, but the performance of Stoc_4 is similar to eALL, ALL ( = 1), and Selection.

Conclusions
The goal of this work was to evaluate the performance of CNN ensembles by replacing the ReLU layers with activations from a large set of activation functions, including six new activation functions introduced here named 2D Mexican ReLU, TanELU, MeLU+GaLU, Symmetric MeLU, Symmetric GaLU and Flexible MeLU. Tests were run on two different networks: Vgg16 and ResNet50 across fifteen challenging image data sets representing various tasks. Different methods of making ensembles of the CNNs were also explored.
Experiments show that an ensemble of multiple CNNs that differ only in their activation functions outperforms the results of single CNNs. Experiments also show that among the single architectures there is no clear winner.
More studies need to investigate the performance gains of generating larger ensembles composed of different CNN architectures using many activation functions across even more data sets. Studies like the one presented here are difficult because investigating ensembles of CNNs requires enormous computational resources.