1. Introduction
The use of Neural Network (NN) models has been steadily increasing in the recent past, following the introduction of Deep Learning methods and the ever-growing computational capabilities of modern machines. Thus, such models are applied to various problems, including image classification [
1] and generation [
2], text classification [
3], speech recognition [
4], emotion recognition [
5], and many more. New and more complex network structures, such as Convolutional Neural Networks [
6], Neural Turing Machines [
7], and NRAM [
8], were developed and applied to the aforementioned tasks; such new problems and structures also required the development of new optimization techniques [
9,
10,
11].
According to these new trends, neuroevolution has also been renewed [
12,
13,
14,
15]. The term of neuroevolution is used to identify the research area where evolutionary algorithms are used to construct and train artificial neural networks. Several approaches have been proposed both to train networks’ weights and topology and to exploit the characteristics of neuroevolution of being highly general, allowing learning with nondifferentiable activation functions, without explicit targets, and with recurrent networks [
16,
17].
The traditional method used by neural networks to learn their weights and biases is the gradient descent algorithm applied to a cost function and its most famous implementation is the backpropagation procedure. Nowadays, the backpropagation algorithm is still the workhorse of learning in neural networks even if its origin dates back to 1970s; its importance was revealed in 1986 [
18].
Backpropagation works under two main assumptions about the form of the cost function: it has to be written as an average over cost functions for individual training examples x and as a function of the outputs from the neural network. Moreover the activation functions have to be differentiable.
With that said, there are tricks for avoiding this kind of problems, and finding alternatives to gradient descent is an active area of investigation. An interesting analysis on the motivations according to backpropagation is the most used technique based on gradient to train neural networks and evolutionary approaches are not sufficiently studied is presented in [
15].
As long as meta-heuristic algorithms are generally nondeterministic and not sensitive to the differentiability and continuity of the objective functions, these methods are used in a wide range of complex optimization problems. In addition, the stochastic global optimizations can identify global minimum without being trapped in local minima [
19,
20,
21].
The most used evolutionary approach in neuroevolution is the genetic one, extensively employed in the conventional neuroevolution (CNE) [
17,
22] and also recently proposed in the case of deep neuroevolution [
23]. In those algorithms, the best individuals (the individuals with the highest fitness) are evolved by means of the mutation and crossover operators and replace the genotypes with the lowest fitness in the population. The genetic approach is the most used technique because it is easy to implement and practical in many domains. However, on the other hand, there is the problem of the encoding since they use a discrete optimization method to solve continuous problems.
In order to avoid the encoding problem other continuous evolutionary meta-algorithms have been proposed including, in particular, differential evolution (DE). Indeed, DE evolves a population of real-valued vectors, so no encoding and decoding are required.
It is well known that DE performs better than other popular evolutionary algorithms [
24], has a quick convergence, and is robust [
25]; it also performs better for learning applications [
26]. At the same time, DE has simple genetic operations, such as its operator of the mutation and survival strategy based on one-on-one competition. Moreover, they can also use population global information and individual local information to search for the optimal solution.
When the optimization problem is complex, the performance of the traditional DE algorithm depends on the selected the control parameters and mutation strategy [
19,
27,
28,
29]. If the control parameters and selected mutation strategy are unsuitable, then DE is likely to yield premature convergence, stagnation phenomena and excessive consumption of computational resources. In particular, the stagnation problem for DE applied to neural network optimization has been studied in [
30].
In this paper the system DENN that optimizes artificial Neural Networks using DE is presented. The system uses a direct encoding with a one-to-one mapping between the weights of the neural networks and values of individuals in the population. This system is an enhanced version of the system introduced in [
12], where a preliminary implementation was described.
A batching system is introduced to overcome one of the main computational problems of the proposed approach, i.e., the fitness computation. For every generation the population is evaluated on a limited number of training examples, given by the size of the current batch, rather than the whole training set. This reduces the computational load, particularly on large training sets. Moreover, a restart method is applied to avoid a premature convergence of the algorithm: the best individual is saved and the rest of the current population is discarded, continuing the research on a new random generated population.
Finally, a new self-adaptive mutation strategy
MAB-ShaDE inspired to the multi-armed bandit UCB1 [
31] and a new particular crossover operator
interm, a randomized version of the arithmetic crossover, have been proposed.
An extensive experimental study have been implemented to (i) determine if this approach is scalable and applicable also to large classification problems, like MNIST digit recognition; (ii) study the performance reached by using MAB-ShaDE and interm components; and (iii) identify the best algorithm configurations, i.e., the configurations reaching the highest accuracy.
The experimental results show that DENN is able to outperform the backpropagation algorithm in training neural networks without hidden layers. Moreover, DENN is a viable solution also from a computational point of view, even if the time spent for learning is higher than its competitor BPG.
The paper is organized as follows. Background concepts about neuroevolution, DE algorithm and its self-adaptive strategies are summarized in
Section 2, related works are presented in
Section 3, the system is presented in
Section 4, and experimental results are shown in
Section 5.
Section 6 closes the paper with some final considerations and some ideas for future works.
3. Related Works
The first DE-based optimizers for NNs were presented in the late ’90s and the early 2000s by [
40,
41], who presented and analyzed the applications of DE on the problem of feedforward NN train. In recent times, new applications of evolutionary algorithms have been presented in the area of neuroevolution [
32].
The dominating evolutionary approach used is the genetic one [
17,
22]: this is used to optimize both topology and weights of the network but in the latter case it is very limited by being a discrete approach. In literature several encodings for the real weights are proposed, with genes represented either as a real-valued string or characters sequence, which can be interpreted as real values with a specific precision using for example Gray-coded numbers.
More adaptive approaches have been suggested, for example in [
42] or more recently in [
43]. In the first paper, the authors presented a dynamic encoding, which depends on the exploitation and exploration phases of the search. In the second one, the authors proposed a self-adaptive encoding, where the string characters are interpreted as a system of particles whose center of mass determines the encoded value. Other adaptive approaches have been developed for network immunization and diffusion in link prediction [
44,
45].
Moreover, they have also used a direct encoding that exploits the particular problem structure.
These methods are not general and are not easily extendable to be applicable in more general cases [
17]. In [
46], a direct encoding floating-point representation of the NN’s weights is used. Precisely, the authors use the evolution strategy called CMA-ES, a real-value optimization algorithm, applied to the well-known reinforcement learning problem: pole balancing.
Among DE applications to neuroevolution, the most related works we have to cite are [
13,
14,
15,
30,
47], even if they apply the evolutionary meta-heuristics in a different way.
In [
47], the search exploration is enhanced by a DE algorithm with a modified best mutation operation: the algorithm is used to train the network and the global best value is used as a seed by the backpropagation procedure (BPG).
In [
13], three different methods (GA, DE, and EDA) are compared and used to train a simple network architecture with one hidden layer, the learning factor, and the seed for the weights initialization.
In [
14], the authors use the Adaptive DE (ADE) algorithm to calculate the initial weights and the thresholds of standard neural networks trained by BPG. The authors demonstrated that the system is effective to solve time series forecasting problems.
In [
15], a Limited Evaluation Evolutionary Algorithm (LEEA) is applied to optimize the weights of the network. This paper is related to our paper because we employ a similar batching system, in which minibatches are used in the training phase and are changed after a certain number of generations.
The work in [
30] has a strong connection with ours because the author studied how different mutation operators work to train neural networks. The results showed that the DEGL-trig (a composition of DEGL with Trigonometric mutation) is the best mutation operator to use with small NNs.
DE and the other enhancement methods permit our algorithm to train neural networks much larger than those used in [
15,
30]: whereas the maximum size handled in [
15] has less than 1500 weights and the maximum size handled in [
30] has only 46 weights, we are capable to train a feedforward neural network for MNIST which has more than 7000 weights.
4. The DENN Algorithm
This section describes the Differential Evolution for Neural Networks. The idea is to apply the Differential Evolution for optimization of NN’s weights taking in count the structure of the network.
Given a fixed topology and fixed activation functions, a population is defined as a set of N neural networks.
We decided to exploit the DE characteristic of working with continuous values by using a direct codification based on a one-to-one mapping between the weights of the neural network and individuals in DE population.
More precisely, let
N be a feedforward neural network composed of
L levels. For each level,
l, of the network is defined by a real valued matrix,
, and a real valued vector,
, representing, respectively, the connection weights and the bias values. Therefore, each population individual
is defined as a sequence
where
is the real values vector obtained by linearization of the matrix
, for
.
For a population individual , we indicate by its h–th component, for . For example, , if h is odd, whereas if h is even.
Note that for each solution the component is a vector whose size is dependent on the number of neurons of in the level h.
The individuals of the population are evolved by applying mutation and crossover operators in a component-wise way. For instance, the mutation
for the individual
is applied as three indices,
, that are randomly chosen in the set
without repetition; then, for
, the
h–th component
of the donor individual
is calculated as the linear combination
The evaluation of a population element is performed by a fitness function f, which is the objective function to be optimized.
As proposed in the many other efficient applications, we split the dataset D in three different subsets: a training set TS, a validation set VS, and a test set ES. The TS is used for the training phase, the VS is used at the end of each training phase for a uniform evaluation of the individuals, and ES is used on the best neural network in order to evaluate the performance.
As the evaluation phase is the most time consuming operation, and it can lead to unacceptable computation time if the fitness is computed on the whole dataset, we decided to use a batching method similar to the one proposed in [
15] by partitioning the training set
TS in
k batches
of size
.
Note that records in each batch should follow the same distribution to avoid the risk of the overfitting, followed by generation of a model that is unable to generalize.
At each generation the population is evaluated against only a small number of training samples, given by the size of the current batch, instead of evaluating the population with all the training set samples. This permits to reduce the computational load, especially on large training sets.
To reduce the problems that arose when the batch is changed as well as obtaining a smoother transition from a batch to the next one, we defined a window U of size b, which is a set of samples taken from the current batch and from the next one .
At the beginning of an epoch, the fitness of all individuals in is re-evaluated by computing the fitness on the new batch defined by currently window U.
The window is changed after s generations, by substituting examples of U from with examples taken from and not already present in U.
Then, given sub-epoch dimension s, the window passes from a batch to the next one in r sub-epoch, or in other words in generations (we call epoch this period). In this way, the fitness function change more smoothly and the evolution has more time to learn from the batch because the window is updated after s generations.
Moreover, the batches are reused in a cyclic way; when the algorithm iterates for more than k epochs and thus runs out of available batches, the batch sequence restarts from the first one.
Since the fitness function relies also on the batch and we need a fixed one to compare the individuals across the epochs; consequently, at the end of every epoch e, the best individual is calculated as the NN in , which reaches the highest accuracy in the validation set . The global best network found so far is then eventually updated.
A restart method is used to avoid a premature convergence of the algorithm; The restart strategy adopted discard all the individuals in the current population, except the best one, and for the next algorithm iteration a new population randomly generated is used. The restart technique is applied at the end of each epoch e, if the fitness evaluation of did not change for a given number M of epochs. The complete algorithm, namely DENN, is depicted in Algorithm 1.
Algorithm 1: The algorithm DENN |
|
In the algorithm DENN, the function generate_offspring execute the mutation and the crossover operators in order to produce the trial individual, whereas the function best_score finds the best network and computes the respective score among all the individuals in the population.
4.1. Fitness Function
In the case of classification problems, the fitness function used to evaluate the individual
x is the well-known cross-entropy. In this case, the optimization problem is to find the neural network
x minimizing the
value, computed as
where
and
are, respectively, the value predicted by
x and the actual value for the
i-th record of
U with respect to the
j-th class (
C is the number of classes).
4.2. The Interm Crossover
We have implemented a new particular crossover operator called
interm, which is a randomized version of the arithmetic crossover. If
is the target and
is the donor, then the trial
is obtained in the following way; for each component
of
and
of
, let
be a vector of
randomly numbers, generated with a uniform distribution
, then
for
.
4.3. The MAB-ShaDE Mutation Method
We have also implemented a variant of ShaDE algorithm, called MAB-ShaDE. MAB-ShaDE has a solution archive and a history of the best
and
F parameters, like ShaDE (
Section 2).
The novelty of MAB-ShaDE is in the method used, inspired to the Multi-armed bandit UCB1 [
31], to select one mutation strategy among a list of possible operators.
We consider the mutation strategies as arms of the bandit and the epochs as the rounds where the reward of the selected arm is computed. Therefore, for each mutation operator
, UCB1 stores the average value of the reward
and the number of epoch
in which
has been used. After the end of the epoch
e, the operator
is chosen as mutation strategy for the next epoch.
5. Experiments
In this section, we describe the experiments performed to assess the effectiveness of DENN algorithm as an alternative to backpropagation for neural network optimization.
Moreover, we are interested to find the best algorithm combination and, in particular, the best mutation and crossover operators. To do that we organized two rounds of experiments. First of all, we tested all the possible combinations in order to define the best algorithm singularly for each dataset and the global best. These experiments are described in
Section 5.3 and allow us to conclude that there is no winner combination if we consider the results grouped by dataset, whereas we can say that the combination of ShaDE with
curr_p_best and
interm globally perform better than any other combination. Then, we decided to verify the effectiveness both in term of computational effort and accuracy compared to the classical backpropagation. These results are shown in
Section 5.2.
All the networks used in these experiments are without any hidden layer.
DENN has been implemented as a C++ program (Source code available at
https://github.com/Gabriele91/DENN). The results presented here are obtained with a computer having a CPU AMD Ryzen 1600 and 16 GB RAM.
5.1. Datasets
We tested DENN on various classification datasets from the UCI repositories (
https://archive.ics.uci.edu/ml/datasets) (MAGIC, QSAR, and GASS) and also on the well-known MNIST (
http://yann.lecun.com/exdb/mnist/) dataset for hand-written digit classification. They have been chosen because of their differences on the number of features and records. Moreover, we chose the MNIST dataset because it is a classical challenge with well-known results obtained by various NN classification systems. Note that these datasets are also considered as interesting challenges in [
15].
MAGIC Gamma telescope:dataset with 19,020 records, 10 features, and two classes.
QSAR biodegradation: dataset with 1055 records, 41 features, and two classes.
GASS Sensor Array Drift: dataset with 13,910 records, 128 features, and six classes.
MNIST: dataset with 70,000 records, 784 features, and 10 classes.
5.2. System Parameters
The DENN algorithm depends on various parameters: some directly deriving from the DE (F, , the auto-adaptive variant of DE, the mutation, and crossover operators), other depending on the batching system (s, b, and r). For each dataset we analyzed the following parameters,
the auto-adaptive variant of DE (simply called Method),
the Mutation operator,
the Crossover operator,
the number s of generations of a sub-epoch,
the batch/window size b, and
the ratio r between the batch size and the number of records changed in the window at each sub-epoch.
and their values are shown in
Table 1.
We have chosen three levels for the window size
b, called
low,
mid, and
high, which depend on the dataset size, hence they correspond to different values for each dataset (see
Table 2).
We have also chosen three levels for the length s of the sub-epoch, which are proportional to the number of records changed at each sub-epoch. For instance, the lowest level is , which corresponds to a number of generations equal to of . The main motivation of this choice is that DENN should need more generations with larger batches/windows.
Another aspect of our tests is that we have used a double version for each dataset, the original one and the normalized one. In this way, we can see if the normalization process affects the performances of DENN.
As we implemented a complete test for each possible combination in each dataset and we run the same configuration five times, we collected accuracy values and computation time for 30,240 runs.
5.3. Algorithm Combination Analysis
The first analysis has been made on the convergence graphics, where for each dataset the data of accuracy has been plotted during the generations. For each dataset and for each self-adaptive method, the data of the method which obtained the highest accuracy have been displayed in
Figure 1,
Figure 2,
Figure 3 and
Figure 4.
From the plots, it is possible to see that, excluding the cases where the differences are not significant, MAB-ShaDE works well on smaller datasets (MAGIC and QSAR), whereas ShaDE is the best method for larger datasets.
5.4. Convergence Analysis
In this subsection, we discuss the convergence across all DE used and analyzed in this paper on the datasets discussed before. On the MAGIC dataset, SHADE and L-SHADE converge in around 1750 generations, whereas the proposed MAB-SHADE requires only 250 generations to achieve a solution with a comparable quality. Other methods were able to discover lower quality solutions only. Regarding the other binary classification problem, QSAR, MAB-SHADE converges faster than all the other methods in less than 200 generations, while simultaneously obtaining a higher quality solution.
On the GASS multi-class problem, MAB-SHADE follow the same convergence path of L-SHADE, whereas SHADE has a slow convergence, but the quality of result reached by SHADE is slightly better, conversely, the other methods do not reach a satisfactory solution.
On the image classification problem MNIST, SHADE and L-SHADE resulted the best algorithms in terms of the solution quality and the time of convergence, whereas the other methods did not obtain comparable solutions in terms of quality; noticeably, MAB-SHADE did not get stuck, but it is likely that it requires more generations to converge to a solution.
We also performed the same tests with normalized versions of the datasets, finding susceptible differences with the previous results.
On MAGIC all the methods converged to the same solution, whereas on QSAR the best solution was reached by MAB-SHADE and jDE. Regarding GASS, no analyzed method reached a solution comparable in terms of quality to the solutions found on the corresponding original dataset. Finally, on MNIST all the methods, except SAMDE, reached good solutions, which are, however, below the solutions found with SHADE on the non normalized datasets.
Generally speaking, the best DE method is SHADE for the multi-class problems and MAB-SHADE for binary classification. Anyway, the convergence curves of SHADE are close to those of MAB-SHADE in the latter kind of problems.
Finally, it is worth to notice that MAB-SHADE performed systematically better than its direct competitor SAMDE without requiring to choose a particular mutation strategy.
Quade Weighted Rank
As we have different results for different datasets, we applied the Quade test [
48] in order to obtain a global ranking which takes into account the differences among the datasets.
The Quade test considers that some datasets could be more difficult to deal with (i.e., the differences in accuracy of the various algorithms are larger). In this way, the rankings computed on each dataset are scaled depending on the differences observed in the algorithms’ performances [
48].
With reference to
Table 3, for each algorithm combination, the weighted ranking values are shown in the last column
Quade rank.
These values are computed as follows.
Given the 756 parameter configurations we obtained by varying the values for each dimensions as shown in
Table 1, we memorized in
the average accuracy value obtained by the configuration in the row
i on the dataset in the column
j. The ranks
of these values are computed for each dataset. Ranks are also assigned to the datasets according to the sample range of accuracy values obtained on it. The sample range within data set
j is the difference between the largest and the smallest accuracy
within that data set. Let
be the rank assigned to the
j-th dataset with respect to these values. Then, the Quade weighted rank is obtained ordering the parameters configuration with respect to
.
In
Table 3, the top 20 among the 756 configurations tested are shown. We can see that
SHADE,
curr_p_best,
interm, and
are the best choices.
5.5. Execution Times
The execution time of DENN changes with respect to the number of features and the size of batch. Therefore, in
Table 4, we show the average execution time in seconds of DENN in each datasets and for each level of
b. Note that the execution time is not sensitively affected by the normalization of the datasets.
In
Table 4, the worst case required approximately three minutes for the computation of the solution also thanks to a strong parallelization of the computation. Note that this point is a plus of the evolutionary approach: in the case of an iterative method like backpropagation it would have been impossible. Therefore, we can conclude that the time to reach the solution is reasonable and the approach is feasible, even if it is slower when compared to gradient-based methods.
5.6. Comparison with Backpropagation
In this section, we compared our method to the Backpropagation (BPG) algorithm, using two optimizer: the Stochastic Gradient Descendent (SGD) and the more powerful Adam. The experiments were performed on the same datasets MAGIC, QSAR, GASS, and MNIST, using both the original and the normalized versions.
The results are reported in
Table 5, where for each dataset we compare the classification accuracy obtained by NNs trained with BPG (using both optimizers) to the accuracy obtained by our method (DENN). As it can be seen in the results, in such a scenario our method shows better performances or, in some cases, comparable to the competitors. More specifically, DENN obtained higher accuracy if compared to SGD on all classification problems, while ADAM performed better only on MNIST.
The difference between MNIST and other datasets is about their features. In MNIST, the features are just quantitative; whereas, in the other ones, some data has a quantitative nature and other data are qualitative.
Generally, all the algorithms work better on normalized datasets, except that in MNIST, where data have are already a high degree of homogeneity. On the other hand, in GASS the effect of using normalized datasets is much greater for all the algorithms.
Note that our method can be useful in MLP networks trained for problems on which traditional algorithms can hardly achieve satisfying performances or need larger networks to achieve the same results.