Stochastic Activation Function Layers for Convolutional Neural Networks

: In recent years, the field of deep learning achieved considerable success in pattern recognition, image segmentation and may other classification fields. There are a lot of studies and practical applications of deep learning on images, video or text classification. In this study, we suggest a method for changing the architecture of the most performing CNN models with the aim of designing new models to be used as stand-alone networks or as a component of an ensemble. We propose to replace each activation layer of a CNN (usually a ReLu layer) by a different activation function stochastically drawn from a set of activation functions: in this way the resulting CNN has a different set of activation function layers. The code developed for this work will be available at https://github.com/LorisNanni


Introduction
Deep neural networks have become extremely popular as they achieve state-of-the-art performance on a variety of important applications including image classification, image segmentation, language processing and computer vision [1]. Deep neural networks typically have a set of linear components whose parameters are usually learned to fit the data, and a set of nonlinearities, which are pre-specified, typically in the form of a sigmoid, a tanh function, a rectified linear unit, or a max-pooling function. The presence of non-linear activation functions at each neuron is essential to give the network the ability of approximate arbitrarily complex functions [2], and its choice affects both the speed of training and net accuracy. The design of new activation functions in order that improve training speed and network accuracy is an active area of research [3] [4]. Recently, the sigmoid and hyperbolic tangent, which were the most widely used activations functions, have been replaced by Rectified Linear Units (ReLU) to train deep networks [5]: ReLU is a piecewise linear function equivalent to the identity for positive inputs and zero for negative ones. Thanks to the good performance of ReLU and the fact that it is fast, effective, and simple to evaluate, several alternatives to the standard ReLu function have been proposed in the literature. The most known "fixed" activation function are: Leaky ReLU [6] an activation function equal to ReLU for positive inputs but having a very small slope α > 0 for negative ones, ELU [4] which exponentially decreases to a limit point α in the negative space, SELU [7], a scaled version of ELU (by a constant λ). Together with "fixed" activation functions, several "learnable" activation functions have been proposed as: Parametric ReLU (PReLU) [8] which is a Leaky ReLU where the amount of α is learned during training, Adaptive Piecewise Linear Unit (APLU) [3] which is a piecewise linear activation with learnable parameters, Swish, one of the best performing functions according to [9], which is the combination of a sigmoid function and a trainable parameter, and the recent Mexican ReLU (MeLU) [10] which is a piecewise linear activation function that is the sum of PReLU and multiple Mexican hat functions.
In this paper, we perform a largescale empirical comparison of different activation function across a variety of image classification task and for an image segmentation problem. Starting form two of the best performing models, i.e. ResNet50 [11] for the classification task and DeepLab-v3 [12] for the segmentation task, we compare different approaches for replacing ReLu layers and different methods for building ensembles of CNNs obtained by varying the activation function layers.
After comparing several activation functions, we propose to design a new model based on the combination of different activation functions at different levels of the graph: to this aim, we propose a method for stochastic selection of activation functions to replace each ReLu layer of the starting network. The activation functions are randomly selected from a set of 9 approaches, including the most effective ones. After training the new models on the target problem, they are fused together to build an ensemble of CNNs.
The proposed framework for ensemble creation is evaluated on two different applications: image classification and image segmentation. In the image classification field, we deal with several medical problems including in our benchmark 13 image classification datasets. CNNs have already been used on several medical datasets reaching very high performance, including keratinocyte carcinomas and malignant melanomas detection [13], thyroid nodules classification [14] from ultrasound images or breast cancer recognition [15]. Our testing protocol include a fine-tuning of each model in each dataset and a testing evaluation and comparison: our experiments show that the proposed method works well in all the tested problems gaining state-of-the-art classification performance [16].
In the image segmentation field we deal with the skin segmentation problem: the discrimination of skin and non-skin regions in a digital image has a wide range of applications including face detection [17], body tracking [18], gesture recognition [19], objectionable content filtering [20]. Skin detection has great relevance also in the medical field, where it is employed as a component of face detection or body tracking: for example in the remote photoplethysmography (rPPG) problem [21] it is a component of a system based at solving the problem of estimating the heart rate of a subject given a video stream of his/her face. In our experiment, we carry out a comparison of several approaches performing a single training on a small dataset including only 2000 labeled images, while testing is performed on 11 different datasets including images from very different applications. The reported results show that the proposed method is able to reach state of the art performance [22] in most of the benchmark datasets even without ad-hoc tuning.

Materials and Methods
In this section we describe both the starting models and the stochastic method proposes to design a new CNN models. CNNs are deep neural networks designed to work similarly to the human brain in visually perceiving the world. They are made of several type of "layers" on neurons: i.e. convolutional layers, activation layers, subsampling layers, fully connected layers [23]. In particular, activation layers are aimed at deciding if a neuron would fire or not according to a nonlinear transformation of the input signal. Since activation functions play a vital role in the training of CNN, several activation functions have been proposed: in this work we evaluate some variant of the standard ReLU function, which will be discussed in the next section.
In the literature several CNN architecture have been proposed both for the image classification (i.e. AlexNet [24], GoogleNet [25], InceptionV3 [26], VGGNet [27], ResNet [11], DenseNet [28]) and segmentation problems (i.e. SegNet [29], U-Net [30], Deeplabv3+ [12]). In our experiments, we select two of the most performing models: i.e. ResNet50 [11] for image classification and Deeplabv3+ [12] for segmentation. ResNet50 is a 50-layer network, which introduces a new "network-in-network" architecture using residual layers. ResNet50, which was the winner of ILSVRC 2015, is one of the most performing and popular architecture used for image classification. In our experiments, all the models for image classification have been fine-tuned on the training set of each classification problem according to the following options: batch size 32, learning rate 0.0001, max epoch 30 (data augmentation based random reflection on both axes and two independent random rescales of both axes by two factors uniformly sampled in [1,2] has been used in all the epochs).
For image segmentation purposes we select Deeplabv3+ [12], a recent architecture based on atrous convolution, in which the filter is not applied to all adjacent pixels of an image but rather a spaced-out lattice of pixels. Deeplabv3+ uses four parallel atrous convolutions (each with differing atrous rates) followed by a "Pyramid Pooling" method. Since DeepLabv3+ is based on encoderdecoder structure, it can be built on top of a powerful pretrained CNN architecture: in this work, we selected again ResNet50 for this task, anyway our internal evaluation showed that ResNet101 and ResNet 34 gained similar performance. All the models for skin segmentation have been trained on a small dataset of 2000 images using the same options: batch size 32, learning rate 0.001, max epoch 50 (data augmentation has been used only in the first 30 epochs), class weighting.
This study considers 10 different activation functions (more details, and specific reference for each function, are given in [10]), namely the widely used ReLU and several variants. The functions used are summarized in the following, together with their derivatives. The well-known ReLU activation function is defined as: ≥ 0, and its derivative is easily evaluated as: This work also considers several variants of the original ReLU function. The first variant is the Leaky ReLU function, defined as: where is a small real number (0.01 in this study). The main advantage of Leaky ReLU is that the gradient is always positive (no point has a zero gradient): The second variant of the ReLU function considered in this work is the Exponential Linear Unit (ELU), which is defined as: where is a real number (1 in this study). ELU has a gradient that is always positive: The third variant of ReLU is the Scaled Exponential Linear Unit (SELU). This is defined as: where and are real numbers -in our case = 1.6733 and = 1.0507. SELU is very similar to ELU with the additional scaling parameter s, which can be used to correct the gradient when it is exploding or vanishing. The gradient in this case is given by: The Parametric ReLU (PReLU) is the fourth variant that is considered here. It is defined by: where ac is a set of real numbers, one for each input channel. PReLU is similar to Leaky ReLU, the only difference being that the parameters are learned. The gradient of PReLU is: S-Shaped ReLU (SReLU) is the fifth variant. It is defined as a piecewise linear function: In this case four learnable parameters are used, , , , and , expressed as real numbers. They are initialized to = 0, = 0, = , where is a hyperparameter. SReLU is highly flexible thanks to the rather large number of tunable parameters. The gradients are given by: The sixth variant is APLU, namely the Adaptive Piecewise Linear Unit. As the name suggests, it is a linear piecewise function. It is defined as: where and are real numbers, one for each input channel. The gradients are evaluated as: In our tests the parameters are initialized to 0, and the points are randomly chosen. We also added an 2 -penalty of 0.001 to the norm of the parameters . An interesting variant is the MeLU, that is, the Mexican ReLU, derived from the Mexican hat functions. These are defined as: and are real numbers. These functions are used to define the MeLU function: The parameter represents the number of learnable parameters for each input channel, are the learnable parameters, 0 is the parameter vector in PReLU, and and are fixed parameters chosen recursively. The MeLU activation function has interesting properties, inherited from the Mexican hat functions, that are continuous and piecewise differentiable. ReLU can be seen as a special case of MeLU, when all the parameters are set to 0. This is important because pre-trained networks based on the ReLU function can be enhanced in a simple way using MeLU. Similar substitutions can be made when the source network is based on Leaky ReLU and PReLU. As previously observed, MeLU is based on a set of learnable parameters. The number of parameters is sensibly higher with respect to SReLU and APLU, making MeLU more adaptable and with a higher representation power but more likely to overfit. The gradient is given by the Mexican hat functions. The MeLU activation function also has a positive impact on the optimization stage. In our work the learnable parameters are initialized to 0, meaning that the MeLU starts as a plain ReLU function; the peculiar properties of the MeLU function come into play at a later stage of the training. The Gaussian ReLU, also called GaLU, is the last activation function considered in our work. Its definition is based on the gaussian type functions: where and are real numbers. The GaLU activation function is defined as: which is a formulation similar to the one provided for MeLU, which again depends on the parameters and . Again, the function is defined in this way to provide a good approximation of nonlinear functions.

Results
In order to evaluate the different activation functions detailed in section 2 and to validate the stochastic method for ensemble creation, we performed experiments on 13 well-known medical datasets for image classification and 11 datasets for skin segmentation. Table 1 summarizes the 13 datasets including a short abbreviation, the dataset name, the number of samples and classes and the testing protocol. We used in 12 out 13 datasets five-fold cross-validation (5CV), while we maintain a three fold division for the Laryngeal dataset (same protocol of [31]). Table 2 summarizes the 11 datasets used for skin segmentation. All the models have been trained only on the first 2000 images of the ECU dataset [32], therefore all the other skin dataset are used only for testing (for ECU only the last 2000 images not included in the training set have been used for testing).
The evaluation and comparison of the proposed approaches is performed according to two of the most used performance indicators in image classification and skin segmentation: accuracy and F1-measure, respectively. Accuracy is the ratio between the number of true predictions and the total number of samples, while the F1-measure is the harmonic mean of precision and recall and it is calculated according to the following formula F1 = 2 /(2 + + ) , where tn, fn,tp, fp are the number of true negatives, false negatives, true positives and false positives evaluated at pixel-level. According to other works in skin detection, F1 is calculated at pixel-level instead of at image-level to be independent on the image size in the different databases. Finally to validate the experiments the Wilcoxon signed rank test [33] has been used. In the first experiment, we evaluate the proposed methods for image classification on the datasets listed above: in Table 3 the performance, in terms of accuracy, of several variant of ResNet50 obtained by varying the activation function and some the following ensembles are reported: • The method named ReLu is our baseline, since it corresponds to the standard implementation of ResNet50. ReLu performs very well, but it is not the best performing activation function: many activation functions with INPUT parameter 255 work better than ReLu on average. • It is a very valuable result the methods as wMelu(255), MeLu(255) and some other stand-alone approaches outperform ReLu in a large selection of classification problems. Starting from a model pretrained with ReLu and changing its activation layers, we obtained a sensible error reduction.

•
It is difficult to select a function that win in all problems, therefore a good method to improve performance is to create an ensemble of different models: both FusAct10 and FusAct10(255) work better than each of their single components.   From the results in Table 3 we can draw the following conclusions: • The method named ReLu is our baseline, since it corresponds to the standard implementation of ResNet50. ReLu performs very well, but it is not the best performing activation function: many activation functions with INPUT parameter 255 work better than ReLu on average.

•
It is a very valuable result the methods as wMelu(255), MeLu(255) and some other stand-alone approaches strongly outperform ReLu in a large selection of classification problems. Starting from a model pretrained with ReLu and changing its activation layers, we obtained a sensible error reduction. This mean that our approaches permit to boost the performance of the original ResNet50 in a large set of problems.

•
It is difficult to select a function that win in all problems, therefore a good method to improve performance is to create an ensemble of different models: both FusAct and FusAct(255) work better than each of their single components. • Designing the model by means of stochastic activation functions (i.e. Random or Random(255)) gives valuable results above all in the creation of ensembles: indeed both FusRan10 and FusRan10(255) performs very well compared to all stand-alone models and also the ensembles above . FusRan10(255) is the first ranked method tested in these experiments.
• The two small ensemble FusAct3(255) and FusRan3(255) performs very well strongly outperform stand-alone approaches and gain performance comparable with other heavier ensembles (composed by 10 or 20 models). In the second experiment, we evaluate the proposed methods for skin segmentation on the 11 datasets listed above: in Table 4 the performance of several variant of ResNet50 and some the following method/ensembles are reported in terms of F1-measure: • ReLu is the standard DeepLabv3+ segmentation CNN based on ResNet50 encoder. This net has shown state-of-the-art performance for skin segmentation [22].
For each dataset, the best result is highlighted and in the last two columns report the average F1measure and the rank (calculated on the average F1). Table 4. Performance of the proposed approaches in the skin datasets (F1-measure).

Dataset
Avg  Table 4 it is clear that:

•
In this problem the activation functions with INPUT parameter 1 work better than those initialized by 255, therefore we fixed to 1 the INPUT for the ensembles with 3 models (FusAct3 and FusRan3.

•
Similarly, to the image classification experiment, the ensembles work better than stand-alone approaches: Fus20 is the best ranked method in our experiments, but two "lighter" ensembles as FusAct3 and FusAct10 gain very high performance.

•
As in the classification problem, our approaches outperform ReLu, i.e. the standard DeepLabv3+ based on ResNet50, a state of the art approach for image segmentation.

•
The reported results show that the proposed ensemble methods are able to reach state of the art performance [22] in most of the benchmark datasets.
Finally, we report some comparisons considering the Wilcoxon signed rank test, see Table 5 and Table 6. Notice that we report different approaches in Table 5 and Table 6, in each table we report performance of the most interesting approach for classification and segmentation (one approach for each size of ensembles). The reported p-values confirm the conclusions drawn from Tables 3 and 4. Table 5. P-value of the comparison among some tested approaches in the medical image classification experiment ( < denotes that the method in row wins, ^ denotes that the method in column wins, = denotes that there are were no statistically significant differences).

Fus20
---Finally, notice that using a GTX1080 the classification time of ResNet is 0.024 seconds / image; this mean that also using an ensemble of 20 CNNs it is possible to classify two images each second using a single GTX1080.

Conclusions
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 17 February 2020 doi:10.20944/preprints202002.0231.v1 In this study, we proposed a method for CNN model design based on changing the architecture of the most performing CNN models by stochastic layer replacement. We proposed to replace each activation layer of a CNN by a different activation function stochastically drawn from a given set: in such a way that the resulting model has different activation function layers. This generation process introduces diversity among models making them suitable for ensemble creation. Interestingly, this design approach has gained very strong performance for ensemble creation: a set of ResNet50-like models designed using stochastic replacement of ReLu layers and combined by sum rule strongly outperforms both standard ResNet50 and a single stochastic ResNet50 in our experiments. A large experimental evaluation, carried out in a wide set of benchmark problems both for image classification and image segmentation, showed that our idea can be used for building a high performance ensemble of CNNs.
Even if these first results are limited to a single, although performing model, we plan as a future work to evaluate the proposed method on a large class of models including lighter architectures suitable for mobile devices. The difficulty of studies involving ensembles of CNNs lies in the enormous speed and memory resources required to conduct such experiments.
Another research direction is related to the selection of the initial set of activation function according to their performance.