Learning Functions and Classes Using Rules

: In the current work, a novel method is presented for generating rules for data classiﬁcation as well as for regression problems. The proposed method generates simple rules in a high-level programming language with the help of grammatical evolution. The method does not depend on any prior knowledge of the dataset; the memory it requires for its execution is constant regardless of the objective problem, and it can be used to detect any hidden dependencies between the features of the input problem as well. The proposed method was tested on a extensive range of problems from the relevant literature, and comparative results against other machine learning techniques are presented in this manuscript.


Introduction
A variety of common datasets from real-world problems or scientific areas can be regarded as a classification or regression problem.Such problems may include problems from the area of physics [1][2][3][4], chemistry problems [5][6][7], problems induced by economic models [8,9], pollution [10][11][12], medicine problems [13,14], etc.All the above problems are in most cases handled by artificial intelligence models such as artificial neural networks [15,16], radial basis function (RBF) networks [17,18], support vector machines (SVM) [19], etc.A systematic review of these methods is the e work of Kotsiantis et al. [20].Additionally, a discussion on how the neural networks perform on regression datasets is given in [21].In most cases, these learning models contain a vector of parameters − → w used to minimize the quantity: The set − → x i , t i , i = 1, ..., M is the so-called train set, with t i being the actual output for the pattern − → x i .The model is denoted as a function N( − → x , − → w ).Equation ( 1) is usu- ally minimized using a variety of optimization methods, such as the back propagation method [22,23], the resilient backpropagation (RPROP) method [24][25][26], the Quasi Newton methods [27,28], particle swarm optimization [29][30][31] , genetic algorithms [32,33], simulated annealing [34,35], etc.However, these techniques suffer from a number of significant problems, such as 1.
The overfitting problem.A major problem with learning techniques is that when applied to unknown data, also called test data, they produce poor results even if the learning process was successful.This is because the parameters of the models fit accurately to the training data but fail to fit into unknown data.This problem is presented with details in the article by Geman et all [36].This problem is tackled by a list of methods such as weight sharing [37,38], pruning [39][40][41], weight decaying [42,43] etc.

2.
Long execution time.In most cases, in learning models, the number of parameters is at least proportional to the dimension of the objective problem and several times, as for example in artificial neural networks, the number can be multiples of the dimension.This means that the long execution times of the optimization methods are required, especially for large datasets.This problem can be tackled either by using parallel optimization techniques that exploit modern parallel architectures [44,45] or by reducing the input dimension with feature selection or construction techniques from existing ones [46,47].

3.
It is difficult to explain the solution.In most cases, the generated machine learning models produce solutions consisting of numerical series and numerical parameters derived from optimization methods.For example, an artificial neural network can consist of a sum of products with several terms, especially in large dimensional problems.
In this paper, an innovative technique is presented that constructs rules in a high-level programming language to estimate the true output in a regression or classification problem.The construction of the rules is done using grammatical evolution technique [48].Grammatical evolution is an evolutionary process that has been applied with success in many areas, such as music composition [49], economics [50], symbolic regression [51], robot control [52], caching algorithms [53], and combinatorial optimization [54].
The proposed method has an advantage over other methods from the relevant literature as it does not depend on an any prior knowledge of the dataset and it can be used without any change to both regression and classification problems.Furthermore, since the method uses grammatical evolution, it can discover hidden associations between the features of the objective problem and furthermore construct rules in a form that can be understood.Additionally, the proposed method does not require any additional use of an optimization method as in traditional machine learning models, thus avoiding problems of numerical accuracy.The proposed method requires a fixed amount of memory to store the proposed solutions, which do not directly depend on the dimension of the objective problem .
The rest of this article is organized as follows: in Section 2 the proposed method is outlined in detail; in Section 3 the used experimental datasets are presented as well as the comparative results against the proposed method and other methods from the relative literature; and finally, in Section 4, a list of conclusions from the usage of the proposed method is presented.

Usage of Grammatical Evolution
The grammatical evolution technique is an evolutionary method in which chromosomes express production rules from grammar expressed in BNF (Backus-Naur form) [55].This technique can be used to produce programs in any programming language.BNF grammar is defined as a set G = (N, T, S, P), where • N.This constitutes the set of so-called non-terminal symbols.

•
T. This set defines the terminal symbols of the grammar.For example, terminal symbols could be the digits or the functional symbols (exp, log, etc.).• S is a symbol from the set N, and it is considered as the start symbol of the grammar, which means that the production initiates from this symbol.• P is the set of production rules, used to produce new programs in the provided language.The production rules follow the scheme: A → a or A → aB, A, B ∈ N, a ∈ T.
A basic premise for grammatical evolution to start producing rules is to modify the original grammar and put a sequence number next to each production rule.For example, consider the enhanced grammar of Figure 1.The numbers in parentheses denote the sequence numbers for each non-terminal symbol.The number N defines the original number of features for the provided dataset.
In grammatical evolution, the chromosomes are considered as arrays of integers, and every element represents a production rule.The production algorithm initiates from the start symbol and progressively builds the final program by replacing non-terminal symbols with the right-hand part of the corresponding production rule.The selection of the rule is performed in two steps:

•
During the first step, the next element from the chromosome is selected.Let us denote this symbol as V.

•
The next production rule is selected according to the scheme where R denotes the number of production rules for the current non-terminal symbol.
An example produced by this grammar could be the following:

The Main Algorithm
The main algorithm consists of a series of steps such as: Execute the selection procedure: i.
The chromosomes are sorted based on the fitness of each chromosome.ii.
The p s × N C best chromosomes are copied intact to the next generation, while the genetic operators of crossover and mutation are applied to the rest.

(d)
Execute the crossover procedure: The outcome of this procedure is (1 − p s ) × N C offsprings, which will be constructed from the population.For every couple of offsprings, two mating parents are selected using tournament selection.The offsprings are constructed using one point crossover, which is graphically demonstrated in Figure 2.

(e)
Execute the mutation procedure: a random number r ∈ [0, 1] is produced for each integer value of every chromosome and this value is altered if r ≤ p s .
If iter ≤ N G goto Genetic Step, 5.
Otherwise define as g * the chromosome in the population with the lowest fitness value.

6.
Create the corresponding artificial program C * through the procedure of Section 2.1 for chromosome g * 7.
Apply C * to the test set of the dataset and report the results.
In the proposed genetic algorithm, elitism is used for the best chromosome in the population, which means that the best solution, if found, will not be lost between the iterations of the genetic algorithm.

Experiments
The software used in the experiments was coded using ANSI C++ and with the assistance of the QT programming library available from https://www.qt.io (accessed on 30 August 2022).The software used in the following experiments is available under the GPL license from https://github.com/itsoulos/GrammaticalRuler/(accessed on 30 August 2022).Every experiment was reoeated 30 times, and averages were measured.In every execution, a different seed for the random number generator was used.
In the case of classification datasets, the average percent classification error is shown in the results tables.Likewise, for regression datasets the mean squared error is reported in the corresponding result tables.Additionally, the well-known method of 10-fold cross validation was used for more reliable results.

Dataset Description
The proposed method is compared against other machine learning techniques on some datasets, publicly available from some repositories.The main repositories used were: 1.
The The used classification datasets were: 19. Student dataset [73], used to predict student's knowledge level.20.Wine: dataset, a dataset used in chemical analysis of wines [74,75].21.Wdbc dataset [76], a medical dataset used to detect breast tumors.22. Eeg dataset.This is a medical dataset described in [77,78].This dataset has four variants in the conducted experiments: Z_F_S, Z_O_N_F_S, ZONF_S, and ZO_NF_S.23.Zoo dataset [79].In this dataset, the goal is to categorize animals into seven classes.
The regression datasets used here are: 1.
Abalone dataset [80], used to predict the age of abalone from different physical measurements.

3.
Baseball dataset, which used to estimate the income of baseball players.

4.
BK dataset, used in basketball games.
Dee dataset.This data set is used to predict prices of electricity.
FA dataset, a data related to body fat.10.Housing dataset, related to housing prices [83].11.MB dataset, available from Smoothing Methods in Statistics [84].12. MORTGAGE dataset, an economic dataset.13.NT dataset [85], which is related to the body temperature measurements.14.PY dataset [86], used to estimate Quantitative Structure Activity Relationships (QSARs).15.Quake dataset.It can be used to predict the magnitude of earthquake tremors.16.Treasure dataset, which contains economic data information of USA.17. Wankara dataset, which is a meteorological dataset.

Experimental Results
The parameters for the current method are shown in Table 1.The results fro the classification datasets are outlined in Table 2 and for regression datasets in Table 3.An extra line has been added to the scoreboards at the end titled AVERAGE.This line illustrates the average of the values for the data sets so that comparison between the different learning techniques can be made easier.The definition of the columns names is: 1.
The column RBF represents the results obtained from an RBF network with 10 parameters.

2.
The column MLP represents the results obtained from an artificial neural network with 10 sigmoidal nodes.The neural network is trained by a genetic algorithm with 500 chromosomes.At the end of the genetic algorithm, a BFGS method [87] was utilized to enhance the obtained result.

3.
The column SGD represents the results obtained by the same artificial neural network trained with the incorporation of the stochastic gradient descent method [88].

4.
The column LEVE represents the results from the usage of the well-known local search procedure Levenberg-Marquardt [89] to train the same artificial neural network.

5.
The column PROPOSED represents the results for the proposed method.
To justify the incorporation of 10 weights in the RBF network and in the artificial neural network, they were trained with 5, 10, 15, and 20 weights for all classification data, and the results are shown graphically in the Figure 3.The neural network was trained using the BFGS local search procedure.In the graph, the average classification error for all datasets is shown.As can be seen in both neural networks, 10-15 processing nodes are enough to achieve good results.
Judging from the obtained results, it seems that the proposed method on average outperforms the other machine learning methods.In classification problems, there is a gain of 22-30%, and in regression problems the gain increases to 50-70%.Furthermore, the proposed method does not demand any knowledge for the structure of the input dataset, and hence the memory space it uses is fixed and independent of the objective problem.
The results produced by the proposed method are in an understandable form in which possible dependencies between the characteristics of the objective problem can be detected.The generated result of the method could, for example, be used as a function in some highlevel programming language such as the C programming language.An example program for the Wine dataset is illustrated in Figure 4.A similar constructed program for the housing regression dataset is shown in Figure 5.The proposed method constructs simple rules for classifying or learning functions while at the same time selecting features, i.e., keeping from the initial features of the problem those that have greater weight in learning.
Additionally, an additional experiment was performed to examine the impact of altering the maximum number of generations on the accuracy and the efficiency of the method (parameter N G ).The value for this parameter was changed from 500 to 2000, and the results for the classification datasets are shown in Table 4 and for regression datasets in Table 5.These experiments show the dynamics of the proposed method and its accuracy since even for a small number of generations it shows remarkable results.Additionally, through them is seen the need for the use of intelligent termination rules that will terminate the proposed technique in time without having to exhaust all generations of the genetic algorithm.

Conclusions
An innovative rule-generation method was presented in this work.The method can be applied in both classification and regression datasets.The proposed method constructs rules in a high-level programming language, similar to the C programming language.The method has no set of parameters to be estimated, such as the weights of an artificial neural network, and the only memory required is for the genetic algorithm's chromosomes.This means that the memory required to run the method is at the same levels regardless of the size of the input problem.In addition, it can be used to indirectly select features from input features and to find dependencies between existing features.Furthermore, the results obtained are quite encouraging.The method is freely available and requires the existence of the C++ language as well as the public available QT programming library.Future extensions of the method may include

Figure 1 .
Figure 1.The BNF grammar for the proposed method.

Figure 3 .
Figure 3.Comparison of average classification error for RBF and artificial neural network for different number of weights.

Figure 5 .
Figure 5. Example program for the housing dataset.
Read the training set.The training set consists of M pairs of points (x i , t i ), i = 1 . . .M, where t i being the desired output for pattern x i .
i to the training set and subsequently calculate the associated fitness f i as

Table 3 .
Experiments for regression datasets.

Table 4 .
Experiments with the number of generations for the classification datasets.

Table 4 .
Cont.Example of constructed program for the wine dataset.

Table 5 .
Experiments with the number of generations using the proposed method for the regression datasets.