1. Introduction
Artificial Neural Networks (ANNs) are pretty old ideas to mimic the human brain [
1]. ANNs have been intensively studied for many years in the hope of achieving human-like performance in the fields of speech and image recognition [
2]. They are designed to solve a variety of problems in the area of pattern recognition, prediction, optimization, associative memory, and control [
3]. ANNs are also used for approximation of phenol concentration using computational intelligence methods [
4].
The Feedforward Neural Networks (FNNs) are the most fundamental part of ANNs that have been trained by using the connectionist learning procedure called a supervised manner which requires a teacher to specify the desired output vector [
5]. An FNN is one part of a multi-layer perceptron (MLP) with unidirectional data flow [
6]. In FNNs, the neurons are arranged in the form of layers, namely input, hidden and output layers and there exist the connections between the neurons of one layer to those of the next layer [
7]. These connections are repeatedly adjusted to minimize a measure of the difference between the actual network output vector and the desired output vector through the learning process [
8]. The FNNs are commonly trained by using the backpropagation (BP) algorithm which uses a gradient descent learning method, also called the steepest descent [
9].
Depending on the way the weights are updating, the gradient descent method can be classified into the following three methods: batch gradient descent method, mini-batch gradient descent method and stochastic gradient descent method. In the batch gradient descent method, the weights are updated after all the training examples are shown to the learning algorithm. In the stochastic gradient descent method, the weight parameters are updated using each training example. In a mini-batch gradient descent method, the training examples are partitioned into small batches, and the weight parameters are updated for each mini-batch example. In this study, we focus only on the batch gradient method.
It is well known that determining the number of input and output neurons naturally depends on the dimension of the problem, but the main problem is determining the optimal number of neurons in the hidden layer that can solve the specific problem [
10]. Here, we mean the optimal number of the hidden layer neurons of FNNs is one that is large enough to learn the samples and small enough to perform well on unseen samples. There is no clear standard method to determine the optimal size of the hidden layer neurons for the network to solve a specific problem. However, usually, the number of hidden neurons of the neural network is determined by a trial-and-error method. This will lead to a cost problem in computation. In addition, having too many neurons in the hidden layer may lead to overfitting of the data and poor generalization while having too few neurons in the hidden layer may not provide a network that learns the data [
11].
Generally speaking, the constructive and destructive approaches are the two main approaches used in literature for optimizing neural network structure [
12]. The first approach, also called the growing method, begins with a minimal neural network structure and adds more hidden neurons only when they are needed to improve the learning capability of the network. The second approach begins with an oversized neural network structure and then prunes redundant hidden layer neurons [
13]. A disadvantage of applying the growing method is that the initial neural network with a small number of hidden layer neurons can easily be trapped into local minima and it may need more time to get the optimal number of hidden layer neurons. Therefore, we aimed to find the optimal number of hidden layer neurons by using the pruning method.
Furthermore, depending on the techniques used for pruning, the pruning methods can be further classified into the following methods [
14]: regularization (penalty) methods, cross-validation methods, magnitude base methods, evolutionary pruning methods, mutual information, significance based pruning methods, and the sensitivity analysis method. The most popular sensitivity based pruning algorithms are the Optimal Brain Damage [
15] and the Optimal Brain Surgeon method [
16]. This paper focuses on a pruning technique called the regularization method that mainly addresses the overfitting problem. To do this, we can add the extra regularization terms into the standard error function to sparse the values of weight connections by assuming that sparse neural network models lead to better performance.
Most of the existing regularization methods for FNNs can be further categorized into different
regularization methods.
Figure 1 indicates a graphical representation of
norms with different
p-values.
The
regularization method is widely applied as a parameter estimation technique to solve the variable selection problem [
17]. The most common
regularization terms are
where
W is the set of all weights of the neural network and
represents the absolute value function. The regularization term in Equation (
1) is defined as the 2-norm (squared norm) of the network weights.
regularization does not have the sparsity property, but it has the property of being smooth. The
regularization term leads to an area of the convex optimization problem, which is easy to solve, but it does not give a sufficiently sparse solution [
18]. Adding the
regularization term into the standard error function promotes excess weights to take values close to zero. The
regularization term performs better in the sparsity of weight connections [
19,
20,
21]. Recently, the
regularization term has been proposed to determine the redundant dimensions of the input data for the multilayer feedforward networks by fixing the number of hidden neurons [
22]. The result of this study confirms that the
regularization method produces better performance than
due to its sparsity property.
To sum up, a major drawback of using the regularization terms described above is that they are mainly designed for removing the redundant weights from the neural network, but they cannot remove the redundant or unnecessary hidden neurons of the neural network automatically. This study aimed to investigate the pruning of unnecessary hidden layer neurons of FNNs.
The popular Lasso, least absolute shrinkage and selection operator, regularization method that was originally proposed for estimation of linear models is defined in [
23] as
where
is a regularization parameter,
is a continuous response,
X is an
design matrix,
is a vector parameter. Moreover,
and
stands for 2-norm (squared norm) and 1-norm, respectively. Lasso tends to produce sparse solutions for network models.
An extension of Lasso, group Lasso was originally used to solve linear regression problems and it is one of the most popular regularization method for variable selection [
24,
25]. For a given training set that consists of
M input-output pairs
, the following optimization problem with group Lasso regularization term was used in [
26] to sparse the network with any
L numbers of layers that consists of
neurons numbers each of which is encoded by parameters
,
is a linear operator acting on the layer’s input and
is a bias, where these parameters form the parameter set
, with
,
where
is a loss function that compares the network prediction with the ground-truth output, such as the logistic loss for classification or the square loss for regression,
is the size of parameters grouped together in layer
l and
is the regularization coefficient. The regularization parameters
are scaled with group size
to regularize larger groups in (
5). Here, tuning different regularization parameters
for each groups in each layer is considered as one disadvantage. However, by rescaling the groups, we can simplify the cost function in Equation (
5) into
Now, one can use Equation (
6) that is simplified from Equation (
5) to sparse the neural network structure by penalizing each group in each layer with the same regularization parameter
. Particularly, it is important to prune the redundant neurons from the input and hidden layers.
Hence, developing an automated hidden layer regularization method by using the idea of Equation (
6), which can find out a small, necessary, and sufficient number of neurons in the hidden layer of FNNs without an additional retraining process is our primary motivation. To achieve this, there are two approaches. The first approach is considering only the norm of the total entering weights to each hidden layer neurons. The second approach is considering only the norm of the total outgoing weights from each hidden layer neurons. In this paper, we propose a group Lasso regularization method by using the second approach. Here, our goal is shrinking the total outgoing weights from unnecessary or redundant neurons of the hidden layer to zero without loss of accuracy.
Furthermore, we conduct experiments by using the benchmark datasets to compare our proposed hidden layer regularization method with the standard batch gradient method without any regularization term and the popular Lasso regularization method. The numerical results demonstrate the effectiveness of our proposed hidden layer regularization method on both sparsing and generalization ability.
The rest of this paper is organized as follows: in
Section 2, Materials and Methods are described. In
Section 3, the results are presented. In
Section 4, we discuss in detail the numerical results. Finally, we conclude this paper with some remarks in
Section 5.
4. Discussion
The main goal of this study is to prune the redundant or unnecessary hidden layer neurons of the FNNs. In this respect, the regularization terms are often introduced into the error function and have shown to be efficient to improve the generalization performance and decrease the magnitude of the network weights [
31]. In particular,
regularizations are used to regularize the sum of the norm of the weights during training. Lasso [
23] is one of the most popular
regularization terms that is used to remove the redundant weights. However, Lasso regularization is mainly introduced for removing the redundant weights, and a neuron can be removed only if all of its outgoing weights have been close to zero. As shown in
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6, the batch gradient method with Lasso regularization (BGLasso) can find more redundant weights, but it cannot find any redundant hidden layer neurons.
Group Lasso [
26] was used for imposing the sparsity on group level to eliminate the redundant neurons of the network. As shown in
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6, the batch gradient method with group Lasso regularization term (BGGLasso) can identify unnecessary or redundant hidden layer neurons. The average number of pruned hidden layer neurons (AVGNPHNs) by BGGLasso is higher for each dataset. In these tables, the average norm of the gradient of the error function
for our proposed learning method is also smaller than BGLasso. This tells us that the BGGLasso converges better than BGLasso.
Table 4 and
Table 6 are the results of ionosphere dataset using the same parameters except that the initial number of hidden layer neurons are different (i.e.,
and
), respectively. Here, we confirm that the results are not significantly different.
Moreover,
Figure 5a,
Figure 6a,
Figure 7a,
Figure 8a and
Figure 9a, display comparison results of the average cross-validation error obtained by different hidden layer regularization methods for iris, zoo, seeds, and ionosphere datasets, respectively. The
x-axis represents the maximum number of iterations and the
y-axis represents the average cross-validation error of every iteration. As we can see from these figures, BGGLasso is monotonically decreasing and converges more quickly with a smaller cross-validation error compared to the popular BGLasso regularization method. Similarly,
Figure 5b,
Figure 6b,
Figure 7b,
Figure 8b and
Figure 9b depict the average testing accuracy results of hidden layer regularization methods for iris, zoo, seeds and ionosphere datasets, respectively. In each learning curves of these figures, the
x-axis represents the maximum number of iterations and
y-axis represents the average testing accuracy. From these learning curves, we can see that BGGLasso always has better classification performance on the validation sets as compared to BGLasso regularization method.
As seen from the above discussion, we find that our proposed BGGLasso regularization method outperforms the existing BGLasso regularization method in all numerical results. The importance of applying BGGLasso regularization method does not only result in more sparsity in hidden layer neurons but also achieves a much better test accuracy results than BGLasso. From the results of BGGLasso in
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6, the number of the redundant weights and the number of the redundant hidden layer neurons are proportional. This phenomenon indicates that the batch gradient method with a group Lasso regularization term has limitations on removing weight connections from surviving hidden layer neurons. All of our numerical results are obtained by applying our proposed method using one hidden layer of FNNs. One can extend our proposed approach for sparsification of FNNs that contains any number of hidden layers.