Comparison of Different Convolutional Neural Network Activation Functions and Methods for Building Ensembles for Small to Midsize Medical Data Sets

CNNs and other deep learners are now state-of-the-art in medical imaging research. However, the small sample size of many medical data sets dampens performance and results in overfitting. In some medical areas, it is simply too labor-intensive and expensive to amass images numbering in the hundreds of thousands. Building Deep CNN ensembles of pre-trained CNNs is one powerful method for overcoming this problem. Ensembles combine the outputs of multiple classifiers to improve performance. This method relies on the introduction of diversity, which can be introduced on many levels in the classification workflow. A recent ensembling method that has shown promise is to vary the activation functions in a set of CNNs or within different layers of a single CNN. This study aims to examine the performance of both methods using a large set of twenty activations functions, six of which are presented here for the first time: 2D Mexican ReLU, TanELU, MeLU + GaLU, Symmetric MeLU, Symmetric GaLU, and Flexible MeLU. The proposed method was tested on fifteen medical data sets representing various classification tasks. The best performing ensemble combined two well-known CNNs (VGG16 and ResNet50) whose standard ReLU activation layers were randomly replaced with another. Results demonstrate the superiority in performance of this approach.


Introduction
First developed in the 1940s, artificial neural networks have had a checkered history, sometimes lauded by researchers for their unique computational powers and other times discounted for being no better than statistical methods. About a decade ago, the power of deep artificial neural networks radically changed the direction of machine learning and rapidly made significant inroads into many scientific, medical, and engineering areas [1][2][3][4][5][6][7][8]. The strength of deep learners is demonstrated by the many successes achieved by one of the most famous and robust deep learning architectures, Convolutional Neural Networks (CNNs). CNNs frequently win image recognition competitions and have consistently outperformed other classifiers on a variety of applications, including image classification [9,10], object detection [11,12], face recognition [13,14], and machine translation [15], to name a few. Not only do CNNs continue to perform better than traditional classifiers, but they also outperform human beings, including experts, in many image recognition tasks. In the medical domain, for example, CNNs have been shown to outperform human experts in recognizing skin cancer [16], skin lesions on the face and scalp [17], and the detection of esophageal cancer [18].
It is no wonder, then, that CNNs and other deep learners have exploded exponentially in medical imaging research [19]. CNNs have been successfully applied to a wide range (1) The performance of twenty individual activation functions is assessed using two CNNs (VGG16 and ResNet50) across fifteen different medical data sets. The remainder of this paper is organized as follows. In Section 2, we review the literature on activation functions used with CNNs. In Section 3, we describe all the activation functions tested in this work. In Section 4, the stochastic approach for constructing CNN ensembles is detailed (some other methods are described in the experimental section). In Section 5, we present a detailed evaluation of each of the activation functions using both CNNs on the fifteen data sets, along with the results of their fusions. Finally, in Section 6, we suggest new ideas for future investigation together with some concluding remarks.
The MATLAB source code for this study will be available at https://github.com/LorisNanni.

Related Work with Activation Functions
Evolutions in CNN design initially focused on building better network topologies. As activation functions impact training dynamics and performance, many researchers have also focused on developing better activation functions. For many years, the sigmoid and the hyperbolic tangent were the most popular neural network activation functions. The hyperbolic tangent's main advantage over the sigmoid is that the hyperbolic has a steeper derivative than the sigmoid function. Neither function, however, works that well with deep learners since both are subject to the vanishing gradient problem. It was soon realized that nonlinearities work better with deep learners.
One of the first nonlinearities to demonstrate improved performance with CNNs was the Rectified Linear Unit (ReLU) activation function [50], which is equal to the identity function with positive input and zero with negative input [51]. Although ReLU is nondifferentiable, it gave AlexNet the edge to win the 2012 ImageNet competition [52].
The success of ReLU in AlexNet motivated researchers to investigate other nonlinearities and the desirable properties they possess. As a consequence, variations of ReLU have proliferated. For example, Leaky ReLU [53], like ReLU, is also equivalent to the identity function for positive values but has a hyperparameter α > 0 applied to the negative inputs to ensure the gradient is never zero. As a result, Leaky ReLU is not as prone to getting caught in local minima and solves the ReLU problem with hard zeros that makes it more likely to fail to activate. The Exponential Linear Unit (ELU) [54] is an activation function similar to Leaky ReLU. The advantage offered by ELU is that it always produces a positive gradient since it exponentially decreases to the limit point α as the input goes to minus infinity. The main disadvantage of ELU is that it saturates on the left side. Another activation function designed to handle the vanishing gradient problem is the Scaled Exponential Linear Unit (SELU) [55]. SELU is identical to ELU except that it is multiplied by the constant λ > 1 to maintain the mean and the variance of the input features.
Until 2015, activation functions were engineered to modify the weights and biases of a neural network. Parametric ReLU (PReLU) [56] gave Leaky ReLU a learnable parameter applied to the negative slope. The success of PReLU attracted more research on the learnable activation functions topic [57,58]. A new generation of activation functions was then developed, one notable example being the Adaptive Piecewise Linear Unit (APLU) [57]. APLU independently learns during the training phase the piecewise slopes and points of nondifferentiability for each neuron using gradient descent; therefore, it can imitate any piecewise linear function.
Instead of employing a learnable parameter in the definition of an activation function, as with PReLu and APLU, the construction of an activation function from a given set of functions can be learned. In [59], for instance, it was proposed to create an activation function that automatically learned the best combinations of tanh, ReLU, and the identity function. Another activation function of this type is the S-shaped Rectified Linear Activation Unit (SReLU) [60]. Using reinforcement learning, SReLU was designed to learn convex and nonconvex functions to imitate both the Webner-Fechner and the Stevens law. This process produced an activation called Swish, which the authors view as a smooth function that nonlinearly interpolates between the linear function and ReLU.
Similar to APLU is the Mexican ReLU (MeLU [61]), whose shape resembles the Mexican hat wavelet. MeLU is a piecewise linear activation function that combines PReLU with many Mexican hat functions. Like APLU, MeLU has learnable parameters that approximate the same piecewise linear functions equivalent to identity when x is sufficiently large. MeLU has some main differences with respect to APLU: first, it has a much larger number of parameters; and second, the method in which the approximations are calculated for each function is different.

Activation Functions
As described in the Introduction, this paper explores classifying medical imagery using combinations of some of the best performing activation functions on two widely used high-performance CNN architectures: VGG16 [48] and ResNet50 [49], each pre-trained on ImageNet. VGG16 [48], also known as the OxfordNet, is the second-place winner in the ILSVRC 2014 competition and was one of the deepest neural networks produced at that time. The input into VGG16 passes through stacks of convolutional layers, with filters having small receptive fields. Stacking these layers is similar in effect to CNNs having larger convolutional filters, but the stacks involve fewer parameters and are thus more efficient. ResNet50 [49], winner of the ILSVRC 2015 contest and now a popular network, is a CNN with fifty layers known for its skip connections that sum the input of a block to its output, a technique that promotes gradient propagation and that propagates lower-level information to higher level layers.
The main advantage of these more complex activation functions with learnable parameters is that they can better learn the abstract features through nonlinear transformations. This is a generic characteristic of learnable activation functions, well known in shallow networks [69]. The main disadvantage is that activation functions with several learnable parameters need large data sets for training.
A further rationale for our proposed activation functions is to create activation functions that are quite different from each other to improve performance in ensembles; for this reason, we have developed the 2D MeLU, which is quite different from standard activation functions.

ReLU
ReLU [50], illustrated in Figure 1, is defined as: The gradient of ReLU is Sensors 2022, 22, x FOR PEER REVIEW

Leaky ReLU
In contrast to ReLU, Leaky ReLU [53] has no point with a null gradient. Leak

Leaky ReLU
In contrast to ReLU, Leaky ReLU [53] has no point with a null gradient. Leaky ReLU, illustrated in Figure 2, is defined as: where a (set to 0.01 here) is a small real number. The gradient of Leaky ReLU is:

Leaky ReLU
In contrast to ReLU, Leaky ReLU [53] has no point with a null gradie illustrated in Figure 2, is defined as: where (set to 0.01 here) is a small real number. The gradient of Leaky ReLU is:

PReLU
Parametric ReLU (PreLU) [56] is identical to Leaky ReLU except tha ac (different for every channel of the input) is learnable. PReLU is defined where is a real number. The gradients of PReLU are: Slopes on the left-hand sides are all initialized to 0.

PReLU
Parametric ReLU (PreLU) [56] is identical to Leaky ReLU except that the parameter a c (different for every channel of the input) is learnable. PReLU is defined as: where a c is a real number. The gradients of PReLU are: Slopes on the left-hand sides are all initialized to 0.

ELU
Exponential Linear Unit (ELU) [54] is differentiable and, as is the case with Leaky ReLU, the gradient is always positive and bounded from below by −a. ELU, illustrated in Figure 3, is defined as: where a (set to 1 here) is a real number. The gradient of Leaky ELU is: (9) in Figure 3, is defined as: where (set to 1 here) is a real number. The gradient of Leaky ELU is: where [ ] = max ( , 0). The ( ) function takes values in the (− , ∞) r in the negative part is controlled by means of the parameters ( runs channels) that are jointly learned by the loss function. The parameter c gree of deformation of the exponential function. If 0 < < 1, then ( faster than the exponential.

Swish
Swish [60] is designed using reinforcement learning to learn to efficie tiply, and compose different functions that are used as building blocks. Th is where acts as a constant or a learnable parameter that is evaluated d When = 1, as in this study, Swish is equivalent to the Sigmoid-weight (SiLU), proposed for reinforcement learning. As → ∞, Swish assumes ReLU function. Unlike ReLU, however, Swish is smooth and nonmonoto strated in [60]; this is a peculiar aspect of this activation function. In prac = 1 is a good starting point, from which performance can be further imp ing such a parameter.

PDELU
Piecewise linear Parametric Deformable Exponential Linear Unit (PDELU) [66] is designed to have zero mean, which speeds up the training process. PDELU is defined as where [x] + = max(x, 0). The f (x i ) function takes values in the (−α, ∞) range; its slope in the negative part is controlled by means of the α i parameters (i runs over the input channels) that are jointly learned by the loss function. The parameter t controls the degree of deformation of the exponential function. If 0 < t < 1, then f (x i ) decays to 0 faster than the exponential.

Swish
Swish [60] is designed using reinforcement learning to learn to efficiently sum, multiply, and compose different functions that are used as building blocks. The best function is where β acts as a constant or a learnable parameter that is evaluated during training. When β = 1, as in this study, Swish is equivalent to the Sigmoid-weighted Linear Unit (SiLU), proposed for reinforcement learning. As β → ∞ , Swish assumes the shape of a ReLU function. Unlike ReLU, however, Swish is smooth and nonmonotonic, as demonstrated in [60]; this is a peculiar aspect of this activation function. In practice, a value of β = 1 is a good starting point, from which performance can be further improved by training such a parameter.

TanELU (New)
TanELU is an activation function presented here that is simply the weighted sum of tanh and ReLU: where a i is a learnable parameter.
3.4. Learning/Adaptive Activation Functions 3.4.1. SReLU S-shaped ReLU (SReLU) [63] is composed of three piecewise linear functions expressed by four learnable parameters (t l , t r , a l , and a r initialized as a l = 0, t l = 0, t r = maxInput, a hyperparameter). This rather large set of parameters gives SReLU its high representational power. SReLU, illustrated in Figure 4, is defined as: where a c is a real number.

APLU
Adaptive Piecewise Linear Unit (APLU) [57] is a linear piecewise function that can approximate any continuous function on a compact set. The gradient of APLU is the sum of the gradients of ReLU and of the functions contained in the sum. APLU is defined as: where a c and b c are real numbers that are different for each channel of the input. With respect to the parameters a c and b c , the gradients of APLU are: The values for a c are initialized here to zero, with points randomly initialized. The 0.001 L 2 -penalty is added to the norm of the a c values. This addition requires that another term L reg be included in the loss function: Furthermore, a relative learning rate is added: maxInput multiplied by the smallest value used for the rest of the network. If λ is the global learning rate, then the learning rate λ * of the parameters a c would be

MeLU
The mathematical basis of the Mexican ReLU (MeLU) [61] activation function can be described as follows. Given the real numbers a and λ and letting φ a, λ (x) = max(λ − |x − a|, 0) be a so-called Mexican hat type of function, then when |x − a| > λ, the function φ a, λ (x) is null but increases with a derivative of 1 and a between a − λ and decreases with a derivative of −1 between a and a + λ.
Considering the above, MeLU is defined as where k is the number of learnable parameters for each channel, c j are the learnable parameters, and c 0 is the vector of parameters in PReLU. The parameter k (k = 4 or 8 here) has one value for PReLU and k − 1 values for the coefficients in the sum of the Mexican hat functions. The real numbers a j and λ j are fixed (see Table 1) and are chosen recursively. The value of maxInput is set to 256. The first Mexican hat function has its maximum at 2·maxInput and is equal to zero in 0 and 4·maxInput. The next two functions are chosen to be zero outside the interval [0, 2·maxInput] and [2·maxInput, 4·maxInput], with the requirement being they have their maximum in maxInput and 3·maxInput. The Mexican hat functions on which MeLU is based are continuous and piecewise differentiable. Mexican hat functions are also a Hilbert basis on a compact set with the When the c i learnable parameters are set to zero, MeLU is identical to ReLU. Thus, MeLU can easily replace networks pre-trained with ReLU. This is not to say, of course, that MeLU cannot replace the activation functions of networks trained with Leaky ReLU and PReLU. In this study, all c i are initialized to zero, so start off as ReLU, with all its attendant properties.
MeLU's hyperparameter ranges from zero to infinity, producing many desirable properties. The gradient is rarely flat, and saturation does not occur in any direction. As the size of the hyperparameter approaches infinity, it can approximate every continuous function on a compact set. Finally, the modification of any given parameter only changes the activation on a small interval and only when needed, making optimization relatively simple.

GaLU
Piecewise linear odd functions, composed of many linear pieces, do a better job in approximating nonlinear functions compared to ReLU [70]. For this reason, Gaussian ReLU (GaLU) [68], based on Gaussian types of functions, aims to add more linear pieces with respect to MeLU. Since GaLU extends MeLU, GaLU retains all the favorable properties discussed in Section 3.4.3. Letting be a Gaussian type of function, where a and λ are real numbers, GaLU is defined, similarly to MeLU, as In this work, k = 2 parameters for what will be called in the experimental section Small GaLU and k = 4 for GaLU proper.
Like MeLU, GaLU has the same set of fixed parameters. A comparison of values for the fixed parameters with maxInput = 1 is provided in Table 2.
where α and β are nonnegative learnable parameters. The output has zero means if the input is a standard normal.

Soft Learnable
It is defined as where α and β are nonnegative trainable parameters that enable SRS to adaptively adjust its output to provide a zero-mean property for enhanced generalization and training speed. SRS also has two more advantages over the commonly used ReLU function: (i) it has nonzero derivative in the negative portion of the function, and (ii) bounded output, i.e., the function takes values in the range αβ β−αe , α), which is in turn controlled by the α and β parameters We used two different versions of this activation, depending on whether the parameter β is fixed (labeled here as Soft Learnable) or not (labeled here as Soft Learnable2).

Splash
Splash [64] is another modification of APLU that makes the function symmetric. In the definition of APLU, let a i and b i be the learnable parameters leading to APLU a i ,b i (x). Then, Splash is defined as This equation's hinges are symmetric with respect to the origin. The authors in [65] claim that this network is more robust against adversarial attacks.

2D MeLU (New)
The 2D Mexican ReLU (2D MeLU) is a novel activation function presented here that is not defined component-wise; instead, every output neuron depends on two input neurons. If a layer has N neurons (or channels), its output is defined as where The parameter a u,v is a two-dimensional vector whose entries are the same as those used in MeLU. In other words, a u,v = (a u , a v ) as defined in Table 1. Likewise, λ max(u,v) is defined as it is for MeLU in Table 1.

MeLU + GaLU (New)
MeLU + GaLU is an activation function presented here that is, as its name suggests, the weighted sum of MeLU and GaLU: where a i is a learnable parameter.

Symmetric MeLU (New)
Symmetric MeLU is the equivalent of MeLU, but it is symmetric like Splash. Symmetric MeLU is defined as where the coefficients of the two MeLUs are the same. In other words, the k coefficients of MeLU(x i ) are the same as MeLU(−x i ).

Symmetric GaLU (New)
Symmetric GaLU is the equivalent of symmetric MeLU but uses GaLU instead of MeLU. Symmetric GaLU is defined as where the coefficients of the two GaLUs are the same. In other words, the k coefficients of GaLU(x i ) are the same as GaLU(−x i ).

Flexible MeLU (New)
Flexible MeLU is a modification of MeLU where the peaks of the Mexican function are also learnable. This feature makes it more similar to APLU since its points of nondifferentiability are also learnable. Compared to MeLU, APLU has more hyperparameters.

Building CNN Ensembles
One of the objectives of this study is to use several methods for combining the two CNNs with the different activation functions discussed above. Two methods are in need of discussion: Sequential Forward Floating Selection (SFFS) [71] and the stochastic method for combining CNNs introduced in [47].

Sequential Forward Floating Selection (SFFS)
A popular method for selecting an optimal set of descriptors, SFFS [71], has been adapted for selecting the best performing/independent classifiers to be added to the ensemble. In applying the SFFS method, each model to be included in the final ensemble is selected by adding, at each step, the model which provides the highest increment in performance compared to the existing subset of models. Then, a backtracking step is performed to exclude the worst model from the actual ensemble.
This method for combining CNNs is labeled Selection in the experimental section. Since SFFS requires a training phase, we perform a leave-one-out data set selection to select the best-suited models.

Stochastic Method (Stoc)
The stochastic approach [47] involves randomly substituting all the activations in a CNN architecture with a new one selected from a pool of potential candidates. Random selection is repeated many times to generate a set of networks that will be fused together. The candidate activation functions within a pool differ depending on the CNN architecture. Some activation functions appear to perform poorly and some quite well on a given CNN, with quite a large variance. The activation functions included in the pools for each of the CNNs tested here are provided in Table 3. The CNN ensembles randomly built from these pools varied in size, as is noted in the experimental section, which investigates the different ensembles. Ensemble decisions are combined by sum rule, where the softmax probabilities of a sample given by all the networks are averaged, and the new score is used for classification. The stochastic method of combining CNNs is labeled Stoc in the experimental section. It should be noted that there is no risk of overfitting in the proposed ensemble. The replacement is randomly performed; we did not choose any ad hoc data sets. Overfitting could occur if we chose the Activation Functions (AFs) ad hoc data sets. The aim of this work is to propose an ensemble based on stochastic selection of AFs in order to avoid any risk of overfitting. The disadvantage of our approach is the increased computation time needed to generate the ensembles. As a final note, since 2D MeLU, Splash, and SRS obtain low performance when run with MI = 255 using VGG16, we ran those tests on only a few data sets; those AFs that demonstrate poor performance were cut to reduce computational time.

Biomedical Data Sets
There are no fixed definitions of small/midsize data sets that would apply to all fields in data mining. Whether a data set is considered large or small is relative to the task and the publication date of the research. As many deep learning algorithms require large data sets to avoid overfitting, the expectation today is to produce extremely large data sets. We claim that if the data set contains fewer than 1000 images, then it is small, and if the number of images is between 1001 and 10,000, we say that it is midsize.
Each of the activation functions detailed in Section 3 is tested on the CNNs using the following fifteen publicly available biomedical data sets: 1.
RN (RNAi data set [73]): this is a data set of 200 fluorescence microscopy images of fly cells (D. melanogaster) divided into ten classes. Each class contains 1024 ×1024 TIFF images of phenotypes produced from one of ten knock-down genes, the IDs of which form the class labels. 4.
MA (C. elegans Muscle Age data set [73]): this data set is for classifying the age of a nematode given twenty-five images of C. elegans muscles collected at four ages representing the classes. 5.
TB (Terminal Bulb Aging data set [73]): this is the companion data set to MA and contains 970 images of C. elegans terminal bulbs collected at seven ages representing the classes. 6.
LG (Liver Gender Caloric Restriction (CR) data set [73]): this data set contains 265 images of liver tissue sections from six-month-old male and female mice on a CR diet; the two classes represent the gender of the mice. 8.
LA (Liver Aging Ad libitum data set [73]): this data set contains 529 images of liver tissue sections from female mice on an ad libitum diet divided into four classes representing the age of the mice. 9.
CO (Colorectal Cancer [74]): this is a Zenodo data set (record: 53169#.WaXjW8hJaUm) of 5000 histological images (150 x 150 pixels each) of human colorectal cancer divided into eight classes. 10. BGR (Breast Grading Carcinoma [75]): this is a Zenodo data set (record: 834910#.Wp1bQ-jOWUl) that contains 300 annotated histological images of twenty-one patients with invasive ductal carcinoma of the breast representing three classes/grades 1-3. 11. LAR (Laryngeal data set [76]): this is a Zenodo data set (record: 1003200#.WdeQc-nBx0nQ) containing 1320 images of thirty-three healthy and early-stage cancerous laryngeal tissues representative of four tissue classes. 12. HP (set of immunohistochemistry images from the Human Protein Atlas [77]): this is a Zenodo data set (record: 3875786#.XthkoDozY2w) of 353 images of fourteen proteins in nine normal reproductive tissues belonging to seven subcellular locations. The data set in [77] is partitioned into two folds, one for training (177 images) and one for testing (176 images). 13. RT (2D 3T3 Randomly CD-Tagged Images: Set 3 [78]): this collection of 304 2D 3T3 randomly CD-tagged images was created by randomly generating CD-tagged cell clones and imaging them by automated microscopy. The images are divided into ten classes. As in [78], the proteins are put into ten folds so that images in the training and testing sets never come from the same protein. 14. LO (Locate Endogenous data set [79]): this fairly balanced data set contains 502 images of endogenous cells divided into ten classes: Actin-cytoskeleton, Endosomes, ER, Golgi, Lysosomes, Microtubule, Mitochondria, Nucleus, Peroxisomes, and PM. This data set is archived at https://integbio.jp/dbcatalog/en/record/nbdc00296 (accessed on 9 August 2022). 15. TR (Locate Transfected data [79]): this is a companion data set to LO. TR contains 553 images divided into the set same ten classes as LO but with the additional class of Cytoplasm for a total of eleven classes.
Data sets 1-8 can be downloaded at https://ome.grc.nia.nih.gov/iicbu2008/ (accessed on 9 August 2022), data sets 9-12 can be found on Zenodo at https://zenodo.org/record/ (accessed on 9 August 2022) by concatenating the data set's Zenodo record number provided in the descriptions above to this URL. Data set 13 is available at http://murphylab.web. cmu.edu/data/#RandTagAL (accessed on 9 August 2022), and data sets 14 and 15 are available on request. Unless otherwise noted, the five-fold cross-validation protocol is applied (see Table 3 for details), and the Wilcoxon signed-rank test [80] is the measure used to validate experiments. Tables 4 and 5 is the performance (accuracy) of the different activation functions on the CNN topologies VGG16 and ResNet50, each trained with a batch size (BS) of 30 and a learning rate (LR) of 0.0001 for 20 epochs (the last fully connected layer has an LR 20 times larger than the rest of the layers (i.e., 0.002)), except the stochastic architectures that are trained for 30 epochs (because of slower convergence). The reason for selecting these settings was to reduce computation time. Images were augmented with random reflections on both axes and two independent random rescales of both axes by two factors uniformly sampled in [1,2] (using MATLAB data augmentation procedures). The objective was to rescale both the vertical and horizontal proportions of the new image. For each stochastic approach, a set of 15 networks was built and combined by sum rule. We trained the models using MATLAB 2019b; however, the pre-trained architectures of newer versions perform better.  The performance (accuracy) of the following ensembles is reported in Tables  The most relevant results reported in Table 4 on ResNet50 can be summarized as follows:

Reported in
• ensemble methods outperform stand-alone networks. This result confirms previous research showing that changing activation functions is a viable method for creating ensembles of networks. Note how well 15ReLU outperforms (p-value of 0.01) the stand-alone ReLU; • among the stand-alone ResNet50 networks, ReLU is not the best activation function. The two activations that reach the highest performance on ResNet50 are MeLU (k = 8) with maxInput = 255 and Splash with maxInput = 255. According to the Wilcoxon signed rank test, MeLU (k = 8) with maxInput = 255 outperforms ReLU with a p-value of 0.1. There is no statistical difference between MeLU (k = 8) and Splash (with maxInput = 255 for both); • according to the Wilcoxon signed rank test, Stoc_4 and Stoc_2 are similar in performance, and both outperform the other stochastic approach with a p-value of 0.1; • Stoc_4 outperforms eALL, 15ReLU, and Selection with a p-value of 0.1. Selection outperforms 15ReLU with p-value of 0.01, but Selection's performance is similar to eALL. Examining Figure 5, which illustrates the average rank of the different methods used in Table 4, with ensembles in dark blue and stand-alone in light blue, it is clear that: (a) there is not a clear winner among the different AFs; (b) ensembles work better with respect to stand-alone approaches; (c) the methods named Sto_x work better with respect to other ensembles.
Splash (with = 255 for both); • according to the Wilcoxon signed rank test, Stoc_4 and Stoc_2 are similar in performance, and both outperform the other stochastic approach with a p-value of 0.1; • Stoc_4 outperforms eALL, 15ReLU, and Selection with a p-value of 0.1. Selection outperforms 15ReLU with p-value of 0.01, but Selection's performance is similar to eALL.
Examining Figure 5, which illustrates the average rank of the different methods used in Table 4, with ensembles in dark blue and stand-alone in light blue, it is clear that: (a) there is not a clear winner among the different AFs; (b) ensembles work better with respect to stand-alone approaches; (c) the methods named Sto_x work better with respect to other ensembles.  The most relevant results reported in Table 5 on VGG16 can be summarized as follows: • again, the ensemble methods outperform the stand-alone CNNs. As was the case with ResNet50, 15ReLU strongly outperforms (p-value of 0.01) the stand-alone CNNs with ReLU; • among the stand-alone VGG16 networks, ReLU is not the best activation function. The two activations that reach the highest performance on V6616 are MeLU (k = 4) with maxInput = 255 and GaLU with maxInput = 255. According to the Wilcoxon signed rank test, there is no statistical difference between ReLU, MeLU (k = 4), MI = 255, and GaLU, MI = 255; • interestingly, ALL with maxInput = 1 outperforms eALL with p-value of 0.05; • Stoc_4 outperforms 15ReLU with p-value of 0.01, but the performance of Stoc_4 is similar to eALL, ALL (maxInput = 1), and Selection. Considering both ResNet50 and Vgg16, the best AF is MeLU (k = 8), MI = 255. It outperforms ReLU with a p-value 0.1 in ResNet and a p-value of 0.16 in VGG16. Interestingly, the best average AF is a learnable one that works even on small/midsize data sets. Figure 6 provides a graph reporting the average rank of different AFs and ensembles for VGG16. As with ResNet50 (see Figure 5), it is clear that ensembles of AFs outperform the baseline 15ReLU and stand-alone networks. With VGG16, the performance of Stoc_4 is similar to eALL and Selection. To further substantiate the power of different AFs in ensembles with small to midsize data sets, in Table 6, we show a further batch of tests comparing 15ReLU and the ensembles built by varying the activation functions to the performance of five different CNN topologies, each trained with a batch size (BS) of 30 and a learning rate (LR) of 0.001 for 20 epochs (the last fully connected layer has an LR 20 times larger than the rest of the layers (i.e., 0.02)) using Matlab2021b. Note that these parameters are slightly different from those of the previous tests. We did not run tests using VGG16 due to computational issues.
The tested CNNs are the following: As in the previous tests, training images were augmented with random reflections on both axes and two independent random rescales of both axes by two factors uniformly sampled in [1,2] (using MATLAB data augmentation procedures). The objective was to rescale both the vertical and horizontal proportions of the new image.  To further substantiate the power of different AFs in ensembles with small to midsize data sets, in Table 6, we show a further batch of tests comparing 15ReLU and the ensembles built by varying the activation functions to the performance of five different CNN topologies, each trained with a batch size (BS) of 30 and a learning rate (LR) of 0.001 for 20 epochs (the last fully connected layer has an LR 20 times larger than the rest of the layers (i.e., 0.02)) using Matlab2021b. Note that these parameters are slightly different from those of the previous tests. We did not run tests using VGG16 due to computational issues. As in the previous tests, training images were augmented with random reflections on both axes and two independent random rescales of both axes by two factors uniformly sampled in [1,2] (using MATLAB data augmentation procedures). The objective was to rescale both the vertical and horizontal proportions of the new image.
The most relevant results reported in Table 6 can be summarized as follows: • the ensembles strongly outperform (p-value 0.01) the stand-alone CNN in each topology; • in MobileNetV2, DenseNet201, and ResNet50, Stoc_4 outperforms 15ReLU (p-value 0.05); • DarkNet53 behaved differently: on this network, 15Leaky and Stoc_4 obtained similar performance.
In Table 7, we report the performance on a few data sets obtained by ResNet50, choosing the optimal values of BS and LR for ReLU. Even with BS and LR optimized for ReLU, the performance of Sto_4 is higher than that obtained by ReLU and 15ReLU. In Table 8, we report some computation time tests. The hardware improvements reduce the inference time; there are several applications where it is not a problem to classify 100 images in just a few seconds.
In Table 9, we report the four best AFs for each topology with both MI = 1 and MI = 255. If we consider the two larger data sets, CO and LAR, the best AF is always a learnable one: • CO-ResNet: the best is Swish Learnable; • LAR-ResNet: the best is 2D MeLU; • CO-VGG16: the best is MeLU + GaLU; • LAR-VGG16: the best is MeLU (k = 4).
It is clear that some of the best AFs are proposed here.

Conclusions
The goal of this study was to evaluate some state-of-the-art deep learning techniques on medical images and data. Towards this aim, we evaluated the performance of CNN ensembles created by replacing the ReLU layers with activations from a large set of activation functions, including six new activation functions introduced here for the first time (2D Mexican ReLU, TanELU, MeLU + GaLU, Symmetric MeLU, Symmetric GaLU, and Flexible MeLU). Tests were run on two different networks: VGG16 and ResNet50, across fifteen challenging image data sets representing various tasks. Several methods for making ensembles were also explored.
Experiments demonstrate that an ensemble of multiple CNNs that differ only in their activation functions outperforms the results of single CNNs. Experiments also show that, among the single architectures, there is no clear winner.
More studies need to investigate the performance gains offered by our approach on even more data sets. It would be of value, for instance, to examine whether the boosts in performance our system achieved on the type of data tested in this work would transfer to other types of medical data, such as Computer Tomography (CT) and Magnetic Resonance Imaging (MRI), as well as image/tumor segmentation. Studies such as the one presented here are difficult, however, because investigating CNNs requires enormous computational resources. Nonetheless, such studies are necessary to increase the capacity of deep learners to classify medical images and data accurately.