Constructing Features Using a Hybrid Genetic Algorithm

: A hybrid procedure that incorporates grammatical evolution and a weight decaying technique is proposed here for various classiﬁcation and regression problems. The proposed method has two main phases: the creation of features and the evaluation of these features. During the ﬁrst phase, using grammatical evolution, new features are created as non-linear combinations of the original features of the datasets. In the second phase, based on the characteristics of the ﬁrst phase, the original dataset is modiﬁed and a neural network trained with a genetic algorithm is applied to this dataset. The proposed method was applied to an extremely wide set of datasets from the relevant literature and the experimental results were compared with four other techniques.


Introduction
Artificial neural networks (ANNs) are programming tools [1,2], based on a series of parameters that are commonly called weights or processing units. They have been used in a variety of problems from different scientific areas such as physics [3][4][5], chemistry [6][7][8], economics [9][10][11] and medicine [12,13]. A common way to express a neural network is a function N( − → x , − → w ), with − → x as the input vector (commonly called pattern) and − → w as the weight vector. A method that trains a neural network should be used to estimate the vector − → w for a certain problem. The training procedure can be also formulated as an optimization problem, wherein the objective is to minimize the so-called error function: In Equation (1), the set − → x i , y i , i = 1, . . . , M is the dataset used to train the neural network, with y i being the actual output for the point − → x i . The neural network form used here was also considered in [14]. Suppose we have a neural network with a processing level that uses the sigmoid function as an output function. Every output of the network is defined as where p i is the weight vector and θ i is the bias for the output i. For a neural network with H hidden nodes, the final output function can be written as where v i is the output weight for the processing unit i. Hence, by using one vector for all parameters (weights and biases), the neural network can be written in the following form: x j w (d+2)i−(d+1)+j + w (d+2)i (4) where H is the number of processing units of the neural network and d is the dimension of vector − → x . The function σ(x) is the sigmoid function defined as From Equation (4), one can obtain that the dimension of the weight vector w is computed as: w = (d + 2)H. The function of Equation (1) has been minimized with a variety of optimization methods during the past years, such as the back propagation method [15,16], the RPROP method [17][18][19], Quasi Newton methods [20,21] and particle swarm optimization [22,23]. All the previously mentioned methods have to overcome two major problems: • Excessive computational times, because they require a processing time proportional to the dimension of the objective problem and the number of processing units as well.
For example, a neural network of H = 10 processing units applied to a test data with d = 3 is considered an optimization problem with dimension w = (d + 2)H = 50. This means that the total number of network parameters is growing extremely quickly, which results in a longer computation time than the corresponding universal optimization method. An extensive discussion of the problems caused by the dimensionality of neural networks was presented in [24]. A common approach to overcome this problem is to use the PCA technique to reduce the dimensionality of the objective problem [25][26][27], i.e., the parameter d.

•
The overfitting problem, which is quite common for these methods to produce poor results when they are applied to data (test data) not previously used in the training procedure. This problem was discussed in detail in the article by Geman et al. [28] as well as in the article of Hawkins [29]. A variety of methods have been proposed to overcome this problem, such as weight sharing [30], pruning [31][32][33], the dropout technique [34], early stopping [35,36], and weight decaying [37,38].
This article proposes a method that tackles both the above problems using two major steps. During the first step, a new set of features was created from the initial features using a procedure based on the grammatical evolution technique [39]. A feature is a measurement that defines a property of the objective problem and the series of all measurements forms a pattern. A feature can be an integer value, a double precision value or even a string literature. In our case, we only consider numeric values for the features. The number of features of each pattern is the dimensionality of the problem defined as d in this work. The procedure of feature construction with grammatical evolution was introduced in the work of Gavrilis et al. [40] and it has been used with success in spam identification [41], fetal heart classification [42], epileptic oscillations in clinical intracranial electroencephalograms [43], etc. The outcomes of the first phase are the training and testing data which have been modified according to the created features. During the second step, a genetic algorithm that incorporates a weight decaying procedure is used to train a neural network on the modified data of the first step.
Genetic algorithms are methods based on biological observations such as reproduction and mutation [44,45]. The genetic algorithms work by creating and maintaining a population of candidate solutions (chromosomes). This population is iteratively altered though operations such as crossover and mutation until some stopping criteria are met. They have many advantages, such as simplicity of implementation, endurance in noise, can be easily parallelized, etc. Furthermore, they have been applied to many problems such as aerodynamic optimization [46], steel structure optimization [47] and brain images [48]. They have been used to train neural networks in various research papers, such as in the work of Leung et al. [49] which estimates the structure and weights of a neural network through a genetic algorithm, the evolution of a neural networks for daily rainfall-runoff forecasting [50], and the evolution of neural networks to predict the deformation modulus of rock masses [51] etc.
The idea of feature construction has been examined by various researchers in the relevant literature, such as the work of Smith and Bull [52], who used a tree genetic programming approach to construct features from the original ones. Another approach to constructing features using genetic programming was proposed by Neshatian et al. [53], where the genetic programming utilizes an entropy-based fitness function that maximizes the purity of class intervals. Another evolutionary approach was proposed by Li and Yin for feature selection using gene expression data [54]. Finally, a recent work that utilizes a genetic programming approach and the information gain ratio (IGR) was proposed by Ma and Teng [55] to construct features from the original ones.
In problems of classification and regression, as the number of features increases, additional examples are needed in order to achieve good results in training a model but also to maintain good generalization skills in unknown data. Of course, adding new examples to the training process is almost never possible and this results in the poor performance of the control data. For this reason, the original dataset must be transformed into a new one, which gives better generalization skills to the learning models. According to Cover's theorem [56], there is at least one non-linear extension of the original feature vector, so that with this extension, a linear separation of the set of patterns can be made. Many techniques have been proposed in this direction that try to detect such non-linear extensions. The proposed method uses a hybrid approach, in which first new features are constructed using grammatical evolution and then these features are evaluated by a neural network that appropriately trains a genetic algorithm. In the first phase, the creation of new features is performed in such a way as to achieve the best possible learning accuracy. The methods that can be used to convert attributes are grouped into three categories: feature selection, feature construction, and feature reduction. The second case is the most difficult, as it does not simply require reducing the size of the problem, but also the non-linear creation of new features from old ones.
The proposed technique can outperform other techniques from the modern literature as it does not require prior knowledge of the objective problem, and can thus be applied with the exact same procedure to both categorization problems and function learning problems. In addition, it can be used to discover hidden function dependencies between the original features of the problem and, because it is based on grammatical evolution, the user can add and subtract functions or even allow the algorithm to construct new functions to better learn the dataset. Nonetheless, the final characteristics of the method can be evaluated by any computational intelligence model without any additional processing. In the present method, these characteristics are evaluated by an artificial neural network, but this is something that could change.
The rest of this paper is organized as follows: in Section 2, the proposed method is described in detail; in Section 3, the proposed method is tested on a series of well-known datasets from the relevant literature and the results are compared to those of a simple genetic algorithm; and finally, in Section 4, some conclusions are presented.

Method Description
The proposed method has two major phases. In the first phase, a procedure that exploits the grammatical evolution technique is used in order to create new features from the old ones. The new features are evaluated using a radial basis function (RBF) [57] neural network with H hidden nodes. The RBF network is used during this phase instead of a neural network because the training procedure for RBF networks are much faster than those of neural networks. In the second phase, a hybrid genetic algorithm trains a neural network using the constructed features of the first phase.

The Usage of Grammatical Evolution
Grammatical evolution is an evolutionary procedure whereby the chromosomes represent production rules from a BNF (Backus-Naur form) grammar [58], which is very often used to describe the syntax of programming languages, document formats, etc. These grammars are defined as the set G = (N, T, S, P) where: • N is the set of non-terminal symbols, which produce a series of terminal symbols through production rules. • T is the set of terminal symbols. • S is a non-terminal symbol which is also called the start symbol. • P is a set of production rules in the form A → a or A → aB, A, B ∈ N, a ∈ T.
The production procedure starts from the start symbol of the BNF grammar and iteratively produces programs by replacing non-terminal symbols with the right hand of the production rules that will be selected according to the value of each element in the chromosome. In the proposed method, the BNF grammar of Figure 1 was used to create a new feature from the initial features. The symbols that are in <> are considered non-terminal symbols. The parameter N denotes the number of original features. Typically, a chromosome x in grammatical evolution is expressed as a series of binary values 0 or 1. In the current work, in order to simplify the mapping procedure and to increase the speed of the algorithm, every element of each chromosome is considered an integer in a predefined range. In our case, the range [0, 255] was used but of course, this could be easily changed.
Take, for example, the chromosome x = [9,8,6,4,16,10,17,23,8,14] and N = 3. The valid expression f (x) = x 2 + cos(x 3 ) is created using a series of production steps shown in Table 1. An expression is considered valid if it only contains terminal symbols. Each number in the parentheses stands for the sequence number of the production rule. Hence, the process to produce N f features from the original is as follows:

1.
Every chromosome Z is split into N f parts. Each part g i will be used to construct a feature.

2.
For every part g i construct a feature t i using the grammar given in Figure 1.

3.
Create a mapping function: where − → x is a pattern from the original set and Z is the chromosome.

Feature Construction
The first phase of feature construction presented below has also been used as a feature construction mechanism in the initial work of Gavrilis et al. [40]. In this phase, a genetic algorithm constructs new features from the original and the training error of an RBF network on the dataset created with the new features is used. The steps of the algorithm for the first phase are:

2.
Termination check. If iter ≥ N g go to step 6.

3.
Estimate the fitness f i of every chromosome Z i with the following procedure: (a) Use the procedure described in Section 2.1 and create N f features.
Create a modified training set : (c) Train an RBF neural network C with H processing units on the modified training set TN using the following train error:

Genetic Operators
(a) Selection procedure: initially, the chromosomes are sorted according to their fitness value. The best chromosomes are placed in the beginning of the population and the worst at the end. The best (1 − p s ) × N c chromosomes are transferred to the next generation intact. The remaining chromosomes are substituted by offspring created through the crossover and mutation procedures.
Crossover procedure: in this process, for every produced offspring, two mating chromosomes (parents) are selected from the previous population using tournament selection. Tournament selection is a rather simple selection mechanism defined as: first a set of K > 1 randomly selected chromosomes is constructed and subsequently the chromosome with the best fitness value in the previous set is selected as the mating chromosome. Having selected the two parents for the offspring, the offspring is formed using the one point crossover.
In one-point crossover, a random point is selected for the two parents and their right-hand side subchromosomes are exchanged. (c) Mutation procedure: for every element of each chromosome, a random number r ∈ [0, 1] is taken. If r ≤ p m , then this element is randomly altered by producing a new integer number.

6.
Get the best chromosome in the population defined as Z l with the corresponding fitness value f l and Terminate.

Weight Decay Mechanism
The quantity x in Equation (5) of the sigmoid function is calculated through many calculations involving the input patterns as well as the weight vector. If the value within the function is excessively large, then the sigmoid function tends towards one and this will result in the neural network losing what generalization possibilities it has. In order to estimate the effect of this issue, the quantity B N − → x , − → w , F is defined as shown in Algorithm 1.

Application of Genetic Algorithm
The following is a hybrid genetic algorithm used to train artificial neural networks in the modified dataset. The purpose of this algorithm is to train the artificial neural network in such a way that it does not lose its generalizing abilities. For this purpose, it uses a fitness function that consists of the neural network training error, but also a punitive factor is added. This penalty factor aims to ensure that network weights do not attain excessively high values during training. This technique can be directly applied to a neural network without having previously performed the first phase of feature construction. The main steps of the hybrid genetic algorithm used in the second phase are:

1.
Initialization step (a) Set iter = 0 as the generation number.
Set TN as the modified training set, where: Set End for

4.
Genetic operations step. Apply the same genetic operations as in the first algorithm of Section 2.2.
Local search step.
Set L * = L(D * , LM, RM) where L() is a local optimization method procedure that searches for a local optimum of N(x, D * ) inside the bounds −→ LM, −→ RM . The TOLMIN [59] local optimization procedure used in the above algorithm is a modified version of the BFGS local optimization procedure [60].
Apply the optimized neural network N(x, D * ) to the test set, which has been modified using the same transformation procedure as in the train set, and report the final results.

Experiments
The software for the algorithm was coded using ANSI C++ and was utilized for parallelization and to accelerate the genetic algorithm for all of the OpenMP library [61]. Every experiment was executed 30 times with a different speed for the random generator each time and averages were measured and reported. The function used for random numbers was the drand48() function of the C programming language. The classification error is reported for the classification datasets on the test set and the average mean squared error for regression datasets. Furthermore, for more reliability in the results, the commonly used method of 10-fold cross-validation was used. The values for the parameters of the used algorithms are reported in Table 2.

Experimental Datasets
The method was tested on a series of regression and classification datasets which were mostly obtained from two repositories: 1.
The description for these datasets is as follows: 1.

2.
Dermatology dataset [63], which is used for differential diagnosis of erythematosquamous diseases.

3.
Glass dataset. This dataset contains a glass component analysis for glass pieces that belong to 6 classes.

7.
Parkinsons dataset, [68] which is created using a range of biomedical voice measurements from 31 people, among which 23 have Parkinson's disease (PD). The dataset has 22 features. 8.
PopFailures dataset [69], used in meteorology. 10. Spiral dataset, which is an artificial dataset with two classes. The features in the first class are constructed as: x 1 = 0.5t cos(0.08t), x 2 = 0.5t cos 0.08t + π 2 and for the second class the used equations are: x 1 = 0.5t cos(0.08t + π), x 2 = 0.5t cos 0.08t + 3π 2 11. Wine dataset, which is related to the chemical analysis of wines and it has been used in comparison in various research papers [70,71]. 12. Wdbc dataset, which contains data for breast tumors. 13. As a real-world example, consider an EEG dataset described in [72,73] which is used here. This dataset consists of five sets (denoted as Z, O, N, F and S), each containing 100 single-channel EEG segments which each have 23.6 s duration. With different combinations of these sets, the produced datasets are Z_F_S, ZO_NF_S, ZONF_S.
The regression datasets are available from the Statlib URL http://lib.stat.cmu.edu/ datasets/ (accessed on 1 April 2022) and other sources: BK dataset. This dataset comes from smoothing methods in statistics [74] and is used to estimate the points scored per minute in a basketball game. The dataset has 96 patterns of 4 features each.

2.
BL dataset. This dataset can be downloaded from StatLib. It contains data from an experiment on the effects of machine adjustments on the time to count bolts. It contains 40 patters of 7 features each. 3.

4.
Laser dataset, which is related to laser experiments.
Quake dataset, used to estimate the strength of an earthquake. 7.
FA dataset, which contains a percentage of body fat and ten body circumference measurements. The goal is to fit body fat to the other measurements. 8.
The numbers of features and patterns for every dataset used in the experiments are listed in Table 3.  Table 4 represents the comparative results for the classification datasets and Table 5 shows the results for the regression problems. For the case of classification problems, the average classification error is reported, while for the regression problem, the average per point error is reported. The proposed method is denoted as FC MLP and it is compared against four other approaches from the relevant literature:

1.
The minimum redundancy maximum relevance feature selection method [78,79] with two selected features. This approach is denoted as MRMR in the experimental tables. The features selected by MRMR are evaluated using an artificial neural network trained by a genetic algorithm with N c chromosomes.

2.
The principal component analysis (PCA) method as implemented in the Mlpack software [80]. The PCA method is used to construct two features from the original dataset. Subsequently, these features are evaluated using an artificial neural network trained by a genetic algorithm with N c chromosomes.

3.
A genetic algorithm with N c chromosomes and the parameters of Table 2 used to train a neural network with H hidden nodes. This approach is denoted as MLP GEN in the experimental tables.

4.
A particle swarm optimization (PSO) with N c particles and a N g number of generations used to train a neural network with H hidden nodes. This method is denoted as MLP PSO in the experimental tables. Furthermore, the average classification errors for some of the classification datasets are graphically illustrated in Figure 2. An additional experiment was performed, wherein the number of chromosomes for the genetic algorithm of feature construction (Section 2.2) varied from 50 to 500. This experiment was performed on four datasets and the results are presented in Table 6. This table shows the reliability and durability of the proposed method, and partially because of a low number of chromosomes, it achieves quite good generalization results.
As the experimental results clearly show, the proposed method is significantly superior to the other techniques and in many cases the percentage gain reaches 90%. The proposed technique for each dataset created two artificial features with non-linear combinations of the original features. This process is based on grammatical evolution. Because the previous procedure is extremely time consuming, it was chosen to evaluate the characteristics to train a radial basisnetwork which has a fast training time. Then, another genetic algorithm is used to train an artificial neural network on the new features. The overall process is the same regardless of the type of data and this means that the method can be applied to a wide range of datasets. However, because the method requires the presence of two phases using genetic algorithms, it is considered a very slow method compared to other techniques in the literature. Execution times, however, could be drastically reduced by using parallel techniques such as the OpenMP technique used during the experiments. Furthermore, as was clear from the additional experiments performed with the number of chromosomes, this method is quite robust, even for a small number of chromosomes.

Conclusions
In the present work, a hybrid feature construction technique was presented with two phases: (a) feature construction and (b) feature evaluation. In the first phase, new features were created as non-linear combinations of old features using grammatical evolution and radial basis networks. In the second phase, the original dataset was transformed based on the new features and an artificial neural network with a genetic algorithm was trained to learn the new dataset. The genetic algorithm used tried to train the artificial neural network in such a way that it did not lose its generalizing abilities. The proposed technique was applied to a number of datasets from the relevant literature and the results were more than satisfactory. Furthermore, with a series of additional experiments, the stability of the proposed methodology was shown, since it produces satisfactory results even with a small number of chromosomes. However, the proposed technique is much slower than other processes as it requires two computational phases to reach a conclusion. However, with the use of parallel techniques, acceleration can be achieved. The method can be made more efficient in a number of ways. For example, this can be achieved by using parallel genetic algorithms; smarter evaluators to construct features instead of radial basis networks such as SVM; and more sophisticated termination techniques for genetic algorithms to achieve the acceleration of the export of the results.