informatics Improving the Classiﬁcation Efﬁciency of an ANN Utilizing a New Training Methodology

: In this work, a new approach for training artiﬁcial neural networks is presented which utilises techniques for solving the constraint optimisation problem. More speciﬁcally, this study converts the training of a neural network into a constraint optimisation problem. Furthermore, we propose a new neural network training algorithm based on the L-BFGS-B method. Our numerical experiments illustrate the classiﬁcation efﬁciency of the proposed algorithm and of our proposed methodology, leading to more efﬁcient, stable and robust predictive models.


Introduction
Artificial neural networks constitute distributed processing systems, comprised of densely interconnected, adaptive processing units, characterised by an inherent propensity for learning from experience and also discovering new knowledge. The excellent capability of self-learning and self-adapting of these learning systems has established them as powerful tools for pattern recognition and as vital component of many classification systems. Thus, they have been extensively studied and widely used in many applications of artificial intelligence (see [1][2][3][4][5][6][7] and the references there in). In the literature, although many different models and have been proposed, the Multi-Layer Perceptron (MLP) is the most commonly and widely used in a variety of applications. The operation of an MLP is usually based on the following equations: where net l j is the sum of its weighted inputs for the j-th node in the l-th layer (j = 1, . . . , N l ), w l−1,l ij are the weights from the i-th neuron at the (l − 1) layer to the j-th neuron at the l-th layer, b l j is the bias of the j-th neuron at the l-th layer, y l i is the output of the j-th neuron that belongs to the l-th layer and f (net l j ) is the j-th neuron activation function. The problem of training an MLP is an incremental adaptation of connection weights which propagate information contained in the examples of the training set, between simple processing units called neurons [8]. More mathematically, the problem of training an MLP can be formulated as the minimisation of an error function E defined by the sum of square differences over all examples of the training set, namely where o L j,p is the actual output of the j-th neuron that belongs to the L-th (output) layer, N L is the number of neurons of the output layer, t j,p is the desired response at the j-th neuron of the output layer at the input pattern p and P represents the total number of patterns used in the training set. To simplify the formulation of the above equations, let us use a unified notation for the weights. To this end, for an MLP with n weights, let w ∈ R n be a column weight vector with components w 1 , w 2 , . . . , w n and w * be an optimal weight vector defined by the solution of min w∈R n E(w) Gradient-based training algorithms are usually applied to deal with this problem (2) by generating a sequence of weights {w k } utilising the following iterative formula where k is the current iteration usually called epoch, w 0 ∈ R n is a given starting point, η k is a stepsize (or learning rate) with η k > 0 and d k is a descent search direction. Furthermore, the gradient can be easily obtained by means of back propagation of errors through the network layers.
Since the appearance of backpropagation [8], a variety of approaches were suggested for improving the efficiency of the minimisation error process. It is worth noting that the optimisation problem (2) is significantly challenging since its dimensionality is usually high and the corresponding nonconvex multimodal error function possesses multitudes of local minima and has broad flat regions adjoined with narrow step ones. Therefore, several methods based on the well established unconstrained optimisation theory have been suggested which utilize second order derivative related information, such as limited memory quasi-Newton methods [9][10][11] and conjugate gradient methods [12][13][14]. Another interesting approach for improving the generalisation efficiency of a neural network was based on the adaptation of nonmonotone learning strategies, exploiting the accumulated information with regard to the most recent values of the error function. Along this line, Peng and Magoulas [15][16][17][18] and Livieris and Pintelas [19] proposed nonmonotone training algorithms which possess strong convergence properties and also have good classification performance. Karras and Perantonis [20,21] considered incorporating knowledge in the form of constraints in neural networks learning process and presented a Lagrange multiplier approach for the minimisation of the error function (1) in order to improve convergence. The advantage of their proposed method was that the weights updates in two successive epochs are highly aligned, therefore, avoiding zig-zag trajectories in the parameter space and improving the speed of convergence.
After a neural network is successfully trained, its classification accuracy is depended on its architecture, but mostly on the values of its weights. Nevertheless, since there are no restrictions and limitations on the weights during the training process, a small number of weights may significantly affect the output of the network. In other words, in case some of the weights have large values, they dominate and sometimes determine the neural network's output; therefore, degrading the classification efficiency of the network since only some of the inputs of the network will be efficiently explored.
The major novelty of this work is that the problem of efficiently training an ANN is re-formulated as a constrained optimisation problem by defining bounds on the weights. More specifically, in order to avoid the degradation of the classification accuracy, we consider the training problem as follows min E(w), subject to l ≤ w ≤ u (4) where the vectors l and u denote the lower and upper bounds on the weights w of the optimisation problem, respectively. Our basic aim and motivation is focused on defining the weights in the trained network in more uniform way, by restricting them from taking large values. In this case all the inputs will be efficiently explored for improving the classification ability of the network. Furthermore, in order to evaluate the efficacy and the efficiency of our proposed methodology, we propose a new weight constrained neural network training algorithm which is based on the L-BFGS-B method. The rationale behind this selection is based upon the fact that limited memory BFGS constitutes an elegant choice for efficiently training neural networks due to their numerical efficiency and their very low memory requirements [22]. Our preliminary numerical experiments illustrate the classification efficiency of the proposed algorithm and our proposed methodology. The remainder of this paper is organized as follows: Section 2 presents the proposed weight constrained neural network training algorithm and Section 3 presents the numerical experiments using the performance profiles Dolan and Morè [23]. Finally, Section 4 presents the discussion, our concluding remarks and our proposals for future research.
Notations. Throughout this paper, the gradient of the error function is indicated by ∇E(w k ) = g k and the vectors s k = w k+1 − w k and y k = g k+1 − g k represent the evolutions of the current point and of the error function gradient between two successive iterations.

Weight Constrained Neural Network Training Algorithm
In this section, we present the proposed neural network training algorithm, which is based on the L-BFGS-B method.
L-BFGS-B [24] constitutes one of the most successful and efficient large-scale bound-constrained optimisation methods. More analytically, L-BFGS-B is a limited-memory algorithm which minimizes a nonlinear function subject to simple bounds on the variables, by efficiently combining L-BFGS updates with a gradient-projection strategy. In contrast to the traditional BFGS method which stores a dense approximation, the L-BFGS-B stores only a small number, say m, of vectors which implicitly describe the approximation. Thus, this moderate requirement of memory makes L-BFGS-B well suitable especially for the optimisation problems with a large number of variables.
At each iteration of the L-BFGS-B algorithm, the error function E is approximated by a quadratic model m k (w) at a point w k , that is where E k = E(w k ) and the Hessian approximation B k is defined as follows. Letm = min{k, m − 1}, then given the set of correction vector pairs (s i , y i ) for i = k −m, . . . , k − 1, we define the n × m matrices The Hessian approximation B k (in compact form) resulting fromm updates to the basic matrix B (k) 0 = θ k I is given by where θ k is a positive scalar and D k and L k are the matrices Subsequently, the algorithm approximately minimizes m k (w) subject to the feasible domain D = {w ∈ R n | l ≤ w ≤ u} utilizing the gradient projection method to find a set of active bounds, followed by a minimisation of m k (w) treating those bounds as equality constraints. In more detail, this procedure is performed in three stages: (1) the computation for the generalized Cauchy point; (2) the subspace minimisation; and (3) the line search.
Stage I: Cauchy point computation. The basic objective of this stage is to compute the generalized Cauchy point w C . This is defined as the local minimum w C = w(t * ) of quadratic approximation of the error function, starting from the current point w k , on the path defined by the projection of the steepest descent direction on the feasible domain D, that is w(t) = P(w k − tg k ; l; u).
Notice that the variables whose value at w C is at lower or upper bound, comprising the active set A(w C ), are held fixed.
Stage II: Subspace minimisation. After the generalized Cauchy point w C is obtained, the quadratic function (5) is minimized for the free variables in w C i.e. variables whose values are not at lower or upper bound. To solve this minimizing problem, a direct primal method [24] is utilized to find the minimizer w k+1 , by using the following formulation: Notice that the feasibility domain is reduced at a subspace of the feasibility domain by considering as free variables, the variables that are not fixed on limits while the rest variables are fixed on their boundary value obtained during the Cauchy point calculation stage.
Step III: Line search. After an approximate solution w k+1 of this problem has been obtained, we compute the new iterate w k+1 by a line search along d k = w k+1 − w k which satisfies the strong Wolfe line search conditions, that is with 0 ≤ c 1 ≤ c 2 < 1. Summarizing the above discussion, we present a high level description of the proposed Weight Constrained Neural Network (WCNN) training algorithm.

Algorithm 1: Weight Constrained Neural Network Training Algorithm
Step 1. Initiate w 0 , E G , c 1 , c 2 , vectors l and u and k MAX .
Calculate the error function value E k and its gradient g k .

Step 4.
Set the quadratic model m k (w) at a point w k , that is where the Hessian approximation B k is defined by (6).

Step 5.
Calculate the generalized Cauchy point w C .
Compute the learning rate η k satisfying the strong Wolfe line search (Stage III) conditions Step 9.
Update the weights w k+1 = w k + η k d k and set k = k + 1.

Step 10. until (stopping criterion).
It is worth noticing that since the Algorithm 1 is implemented with a line search which satisfies the strong Wolfe conditions, every Hessian approximation B k is positive definite. Therefore, the solution w k+1 of the quadratic problem (7) defines a descent direction d k = w k+1 − w k for the error function [24]. The significance of the sufficient descent property is highlighted in [12,13,19] in order to avoid the usually inefficient restarts which degrade the overall efficiency and robustness of the minimisation process.

Experimental Analysis
In this section, we will present experimental results in order to evaluate the performance of the proposed neural network training algorithm in six famous classification problems acquired by the UCI Repository of Machine Learning Databases [25]: the breast cancer problem, the Australian credit card problem, the diabetes problem, the Escherichia coli problem, the Coimbra problem and the SPECT heart problem. Table 1 presents a brief description of each datasets' structure i.e., number of attributes (#Features) and the number of instances (#Instances) and the network architectures and total number of weights for each problem. All MLPs had logistic activation functions and received the same sequence of input patterns. Moreover, the weights were initiated using the Nguyen-Widrow method [26]. The classification accuracy of each algorithm was evaluated using the standard procedure called stratified k-fold cross-validation. The implementation code was written in Matlab 7.6 and the simulations have been carried out on a PC (2.66GHz Quad-Core processor, 4Gbyte RAM) running Linux operating system while the results have been averaged over 300 simulations.
Our experimental analysis was obtained by conducting a two phase procedure: In the first phase, the classification performance of the proposed algorithm WCNN was evaluated against the classical training method L-BFGS. The rationale for the selection of these algorithms is based upon the fact that L-BFGS-B is a substantial extension of the classical L-BFGS [22] for constrained optimisation problems; hence, both methods require the same information per iteration. In the second phase, we evaluate the performance of WCNN against the state-of-the-art neural network training algorithms Resilient backpropagation [27], scaled conjugate gradient [14] and Levenberg-Marquardt training algorithm [28].
The cumulative total for a performance metric over all simulations does not seem to be too informative, since a small number of simulations tend to dominate these results. For this reason, we utilize the performance profiles of Dolan and Morè [23] relative to the performance metrics: accuracy and F 1 -score, to present perhaps the most complete information in terms of robustness, efficiency and solution quality. The use of performance profiles eliminates the influence of a small number of simulations on the benchmarking process and the sensitivity of results associated with the ranking of solvers [23]. The performance profile plots the fraction P of simulations for which any given method is within a factor τ of the best training method.
More specifically, assume that there exist n s solvers and n p problems for each solver s and problem p, Requiring a baseline for comparisons, Dolan and Morè compared the performance α p,s (based on a metric) by solver s on problem p with the best performance by any solver on this problem; namely, using the performance ratio The performance of solver s on any given problem might be of interest, but we would like to obtain an overall assessment of the performance of the solver. Next, the function ρ s was the (cumulative) distribution function for the performance ratio is defined by where P is the set of all problems. Notice that the performance profile ρ s : R → [0, 1] for a solver was a non-decreasing, piecewise constant function, continuous from the right at each breakpoint [23].
In other words, the performance profile plots the fraction P of problems for which any given algorithm is within a factor α of the training algorithm. The horizontal axis of each figure gives the percentage of the simulations for which a training algorithm achieved the best performance (efficiency). Regarding the above rules and discussion, we can conclude that one solver whose performance profile plot lies on top right, outperforms the rest of the solvers.

Performance Evaluation Against L-BFGS Algorithm
Next, we briefly describe each classification problem and present the performance comparison between the proposed algorithm WCNN and L-BFGS training algorithm. The curves in the following figures have the following meaning • "WCNN 1 " stands for Algorithm 1 with bounds on the weights −1 ≤ w i ≤ 1. • "WCNN 2 " stands for Algorithm 1 with bounds on the weights −2 ≤ w i ≤ 2. • "WCNN 3 " stands for Algorithm 1 with bounds on the weights −5 ≤ w i ≤ 5. • "L-BFGS" stands for the limited-memory BFGS.
WCNN and L-BFGS were evaluated using m = 3 and m = 7 as in [22] and were implemented with the same line search [24] with c 1 = 10 −4 and c 2 = 0.9.

Breast Cancer Classification Problem
The first benchmark concerns the diagnosis of breast cancer malignancy. The data have been collected from 683 patients from the University of Wisconsin each having 9 attributes and a class label (malignant or benign tumor). We have used neural networks with 2 hidden layers of 4 and 2 neurons respectively, as suggested in [12]. The stopping criterion is set to E G ≤ 0.02 within the limit of 2000 epochs and all networks were tested using 10-fold cross validation. Figures 1 and 2 present the performance profiles for the breast cancer classification problem, based on accuracy and F 1 -score, respectively. Firstly, it is worth noticing that all versions of WCNN exhibited better performance than L-BFGS, regarding both performance metrics. Therefore, the bounds on the weights, substantially led to the development of trained neural networks with improved classification accuracy. Regarding the performance of the proposed algorithm, WCNN 1 illustrates the best performance in terms of generalisation ability, followed by WCNN 2 and WCNN 3 . Moreover, the interpretation of Figures 1 and 2 report that tighter the bounds on the weights, the more efficient the resulting classification performance will be (in most cases).

Australian Credit Card Classification Problem
The Australian credit approval dataset contains all the details about credit card applications. This dataset is interesting because the data varies and has mixture of attributes which is continuous, nominal with small numbers of values and nominal with larger numbers of values. We have used neural networks with two hidden layers with 16 and 8 neurons and an output layer of 2 neurons [12]. The termination criterion is set to E G ≤ 0.1 within the limit of 1000 epochs and all networks were tested using 10-fold cross validation. Figures 3 and 4 illustrate the performance profiles for the Australian credit card classification problem, investigating the efficiency and robustness of each training method. Similar observations can be made with the previous benchmark. WCNN 1 outperforms all other training methods, since its curves lie on the top, relative to each value of parameter m. More specifically, WCNN 1 for m = 3 and m = 7 reported 65%, and 69.3% of simulations with the highest classification accuracy, respectively; while L-BFGS reported only 46% and 50%, in the same situations. Summarizing, we conclude that the tighter the bounds get, the higher the chance for good generalisation performance (i.e., the classification ability of the neural network will he higher).

Diabetes Classification Problem
The aim of this real-world classification task is to decide when a Pima Indian female is diabetes positive or not. The data of this benchmark consists of 768 different patterns each of them having 8 features of real continuous values and a class label (diabetic positive or not). We have used neural networks with 2 hidden layers of 4 neurons each and an output layer of 2 neurons [12]. The termination criterion is set to E G < 0.14 within the limit of 2000 epochs and all networks were tested using 10-fold cross validation [29]. Figures 5 and 6 illustrate the performance profiles for the diabetes classification problem, relative to each performance metric. WCNN 2 exhibits the best probability of being the optimal solver in terms of efficiency and robustness, outperforming all training methods, followed by WCNN 1 and WCNN 3 which exhibited almost similar performance. More specifically, WCNN 2 reported 62.6% and 60% of simulations with the highest classification accuracy for m = 3 and m = 7, respectively, while L-BFGS presented the worst performance among all training methods. In general, the interpretation of Figures 5 and 6 reveal that the bounds on the weights increased the overall classification accuracy, in most cases. Nevertheless, in contrast to the previous benchmarks, in case the bounds are too tight, this substantially did not benefit much the classification performance of the networks.

Escherichia coli Classification Problem
This problem is based on a drastically imbalanced data set of 336 patterns and concerns the classification of the E. coli protein localisation patterns into eight localisation sites. E. coli, being a prokaryotic gram-negative bacterium, is an important component of the biosphere. Three major and distinctive types of proteins are characterized in E. coli: enzymes, transporters and egulators. The largest number of genes encodes enzymes (34%) (this should include all the cytoplasm proteins) followed by the genes for transport functions and the genes for regulatory process (11.5%) [30]. The network architectures consists of one hidden layer with 16 neurons and an output layer of 8 neurons [31]. The termination criterion is set to E G ≤ 0.02 within the limit of 2000 epochs and all neural networks were tested using four-fold cross-validation. Figures 7 and 8 present the performance profiles for the Escherichia coli classification problem, based on the performance metrics accuracy and F 1 -score, respectively. Similar observations can be made with the previous benchmarks. Firstly, it is worth noticing that in most cases the bounds on the weights, lead to the training of neural networks with higher classification accuracy. More specifically, WCNN 1 , WCNN 2 and WCNN 3 exhibited better generalisation performance than the classical training method L-BFGS. Furthermore, Figures 7 and 8 report that WCNN 2 exhibits the best performance in terms of classification ability, followed by WCNN 1 and WCNN 3 .

Coimbra Classification Problem
This dataset is comprised of ten predictors, all quantitative, and a binary dependent variable, indicating the presence or absence of breast cancer [32]. The predictors are anthropometric data and parameters which can be gathered in routine blood analysis. Prediction models based on these predictors, if accurate, can potentially be used as a biomarker of breast cancer. The network architectures consists of two hidden layers of five and two neurons, respectively, and an output layer of two neurons. The termination criterion is set to E G ≤ 0.05 within the limit of 1000 epochs and all neural networks were tested using ten-fold cross-validation. Figures 9 and 10 illustrate the performance profiles for the Coimbra classification problem, investigating the classification efficiency of each training method. WCNN 2 exhibited the best performance, since its curves lie on the top, relative to each value of parameter m, followed by WCNN 1 and WCNN 3 . It is worth noticing that WCNN 2 for m = 3 and m = 7 reported 62% and 55.3% of simulations with the highest classification accuracy, respectively; while L-BFGS exhibited the worst performance, reporting only 32% and 33.3%, in the same situations. Although that in most cases WCNN appears to train neural networks with higher classification accuracy on average; however if the bounds are too tight, this substantially did not benefit much the generalisation ability of the networks.

SPECT Heart Classification Problem
This dataset contains data instances derived from cardiac Single Proton Emission Computed Tomography (SPECT) images from the University of Colorado. This is also a binary classification task, where patients heart images are classified as normal or abnormal. The class distribution has 55 instances of the abnormal class (20.6%) and 212 instances of the normal class (79.4%). From them there have been selected 80 instances for the training process and the remainder 187 for testing the neural networks generalisation capability [25]. The network architecture consists of two hidden layers with 16 and 8 neurons, respectively, and an output layer of two neurons [12]. The training goal was set to E G = 0.1 and the maximum number of epochs was set to 1000 as in [12]. Figures 11 and 12 report the performance profiles for the SPECT heart classification problem, relative to the values of parameter m. For m = 3, WCNN 1 , WCNN 2 , WCNN 3 exhibited similar performance, with WCNN 3 presenting slightly better results. For m = 7, WCNN 2 exhibits the best probability of being the optimal solver, followed by WCNN 1 and WCNN 3 which exhibited similar performance. Furthermore, L-BFGS presented the worst performance compared against all training methods. Therefore, the interpretation of Figures 11 and 12 demonstrate that application of the bounds on the weights of the neural network, increased the overall classification accuracy, in most cases. Nevertheless, by comparing the performance of all versions of the proposed algorithm, we are able to conclude that in case the bounds are too tight this will not substantially benefit much the classification performance.

Performance Evaluation against State-of-the-Art Training Algorithms
In the sequel, we evaluate the performance of the proposed neural network training algorithm WCNN against state-of-the-art training algorithms, i.e. Resilient backpropagation, scaled conjugate gradient and Levenberg-Marquardt training algorithm which were utilized with their default parameter settings. The curves in the following figures have the following meaning • "WCNN 1 " stands for Algorithm 1 with m = 7 and bounds on the weights −1 ≤ w i ≤ 1. • "WCNN 2 " stands for Algorithm 1 with m = 7 and bounds on the weights −2 ≤ w i ≤ 2. • "WCNN 3 " stands for Algorithm 1 with m = 7 and bounds on the weights −5 ≤ w i ≤ 5. • "RPROP" stands for Resilient backpropagation. • "SCG" stands for scaled conjugate gradient. • "LM" stands for Levenberg-Marquardt training algorithm. Figure 13 present the performance profiles based on accuracy of WCNN, RPROP, SCG and LM, relative to all classification problems. It is worth mentioning all versions of the proposed algorithm WCNN 1 and WCNN 2 exhibit better classification performance than RPROP, SCG and LM in all cases while WCNN 3 present similar or slightly worst performance compared to the classical training algorithms. Furthermore, it is worth noticing that WCNN 2 demonstrate the best performance in four out of six problems while WCNN 1 report the best performance in the rest two classification problems.  Figure 14 presents the performance profiles based on F 1 -score of each training algorithm, regarding all classification problems. Similar observations can be made with Figure 13. Clearly, WCNN 1 and WCNN 2 exhibit better classification performance than the classical training algorithms RPROP, SCG and LM, regarding all benchmarks. Moreover, WCNN 2 illustrate the best performance since it curves lie on the top in five out of six classification problems, followed by WCNN 1 . Regarding WCNN 3 , it exhibits the worst performance among all versions of the proposed algorithm, nevertheless its performance is similar or slightly better than the performance of the classical training algorithms, regarding F 1 -score metric.
Based on the above discussion, we conclude that the interpretation of Figures 13 and 14 show that in general the bounds on the weights increased the overall classification accuracy of the ANN, however in case the bounds are too tight, this substantially may not benefit much the classification performance of the networks.

Discussion, Conclusions and Future Research
In this work, we proposed a new direction for efficiently training a neural network. More specifically, the problem of training of a neural network is formulated as constrained optimisation problem by defining lower and upper bounds on the weights. The motivation consisted of improving the classification accuracy by defining the weights in the trained network in more uniform way for sufficiently exploring all inputs and neurons of the network. Additionally, in order to evaluate our methodology, we proposed a new neural network training algorithm based on the L-BFGS-B method and compared its classification accuracy against the efficient state-of-the-art training algorithms L-BFGS algorithm, Resilient backpropagation, scaled conjugate gradient and Levenberg-Marquardt training algorithm. Our numerical experiments demonstrated the classification efficiency of the proposed algorithm, illustrating that the proposed methodology could improve the accuracy of neural networks as confirmed statistically by the performance profiles.
Summarizing, it is worth mentioning that the bounds on the weights of a neural network increased the overall classification accuracy in most cases. By placing these constraints on the values of weights, reduces the likelihood that some weights will "blow up" to unrealistic values. Therefore, we conclude that the proposed methodology, appears to efficiently train neural networks with improved classification ability. Nevertheless, sometimes the bounds seems to be too tight in some benchmarks, which substantially did not benefit much the classification performance of the networks. As a consequence, it is difficult to set optimal bounds on the weights and more research is needed. To this end, the question of what should be the values of the bounds or which additional constraints should be applied is still under consideration. Probably, the research to answer these questions is very likely to reveal additional and crucial information and questions.
In our future work, since our experimental results are quite encouraging, we commit to explore its performance on imbalanced datasets [33] and also utilizing even more sophisticated performance metrics [34]. Additionally, we intent to conduct extensive empirical experiments by applying the proposed algorithm in specific scientific fields and evaluate its performance on large real-world datasets, such as educational, healthcare, etc. Finally, another interesting aspect for future research is to incorporate in our framework conjugate gradient methods for constrained optimisation [35,36] and genetic algorithms [7,[37][38][39].
Funding: This research received no external funding.

Conflicts of Interest:
The author declare no conflict of interest.