Pool-Based Genetic Programming Using Evospace, Local Search and Bloat Control

: This work presents a unique genetic programming (GP) approach that integrates a numerical local search method and a bloat-control mechanism within a distributed model for evolutionary algorithms known as EvoSpace. The first two elements provide a directed search operator and a way to control the growth of evolved models, while the latter is meant to exploit distributed and cloud-based computing architectures. EvoSpace is a Pool-based Evolutionary Algorithm, and this work is the first time that such a computing model has been used to perform a GP-based search. The proposal was extensively evaluated using real-world problems from diverse domains, and the behavior of the search was analyzed from several different perspectives. The results show that the proposed approach compares favorably with a standard approach, identifying promising aspects and limitations of this initial hybrid system.


Introduction
Within the field of Evolutionary Computation (EC), the Genetic Programming (GP) paradigm includes a variety of algorithms that can be used to evolve computer code or mathematical models, and has had success in a variety of domains. Even the first version of GP, proposed by Koza in the 1990s and commonly referred to as tree-based GP or standard GP [1], is still being used today. This paper focuses on a recent variant of GP called neat-GP-LS [2] that integrates what we consider as fundamental elements of any state-of-the-art GP method, e.g., bloat control and local search (LS) techniques.
However, one discouraging aspect of integrating LS methods into a GP search is the increase in algorithm complexity (execution time might increase if the total number of generations is kept constant, but, since the algorithm converges more quickly, fewer generations are required to reach the same level of performance). One way to minimize this issue is by porting the search process to massively parallel architectures [3]. However, another approach is to move towards distributed EC systems (dEC) [4][5][6]. There are several possible benefits from this approach. First, it is much simpler to develop and use a distributed system than developing low-level code for GPUs or FPGAs [3,7]. The need for strict synchronization policies, for instance, is greatly reduced in a distributed framework compared to a GPU or FPGA implementation. Second, it is possible to leverage cheaper computing power that is already accessible, rather than investing in specialized hardware [8,9]. Finally, the robustness and asynchronous nature of an evolutionary search can easily deal with unexpected errors or dropped connections in a distributed environment. In this work, we use a distributed platform designed to run using heterogeneous computing resources called EvoSpace, a conceptual model for the development of distributed pool-based algorithms [8][9][10]. While it has been applied in standard black-box optimization benchmarks and collaborative-interactive evolutionary algorithms [11], it has not been studied in a GP-based search.
To summarize, the present paper proposes a hybrid distributed GP system that integrates a recent bloat control mechanism and a LS operator for parameter optimization of GP trees. Bloat control is performed by neat-GP, which uses speciation and the well-known method of fitness sharing to control the growth of program trees [12]. For the LS process, the method from [13,14] is used, where the individual trees are enhanced with numerical weights in each node, and these are then optimized using a trust region optimizer [15]; this strategy has proven to be beneficial in several recent learning problem [16,17]. This work shows that the EvoSpace model can easily exploit the speciation process performed by neat-GP, maintaining the same level of performance as the sequential version even though evolution is now performed in an asynchronous manner.
The remainder of this work is organized as follows. Section 2 presents relevant background and related research. Section 3 describes how the proposed system is ported to a distributed framework. A summary and conclusions are outlined in Section 4.

Background
This section described the neat-GP algorithm and a method to integrate LS in GP. In addition, a brief overview of EvoSpace model is provided.

neat-GP
The neat-GP algorithm [12] is based on the operator equalization [18] family of bloat control methods, in particular the Flat-OE [19] algorithms and the NeuroEvolution of Augmenting Topologies algorithm (NEAT) [20].
The neat-GP algorithm has the following main features: The initial population only contains shallow trees (3 levels), while most GP algorithms initialize the search with small-and medium-sized trees (depth of 3-6 levels).
Individual trees are grouped into species, using a similarity measure that is based on their size and shape. With the following measure we can group individuals: given a tree T, let n T represent the size of the tree (number of nodes) and d T its depth (number of levels). Moreover, let S i,j represent the shared structure between both trees starting from the root node (upper region of the trees), which is also a tree, as seen in Figure 1. Then, the dissimilarity between two trees T i and T j is given by where  Each time an individual T i is produced, it is compared to a randomly chosen individual T j , sequentially from different species. This is done by first randomly shuffling the species, and then if δ T (T i , T j ) < h, with threshold h an algorithm parameter, then the tree T i is assigned to the species of T j , and no further comparisons are carried out. When the condition described above is never satisfied, a new species is created for the tree T i .
To promote the formation of several species fitness sharing is used, in this way the individuals in large species (with many trees) are penalized more than individuals from smaller (with fewer trees) species. Assuming a minimization problem, neat-GP penalizes individuals with where f (T i ) is the raw fitness of the tree, f (T i ) is the penalized or adjusted fitness, S u is the species to which T i belongs, and |S u | is the number of individuals in species S u . However, the best individual (with the best fitness) from each species are not penalized, this protects the elite individuals from each species. Moreover, penalization is most important during selection for parents, which considered the computed adjusted value of fitness. Selection is done deterministically, sorting the population based on adjusted fitness. In this way, individuals with very bad adjusted fitness will not produce offspring, but this high selective pressure is offset by protecting the elite individuals from each species, such that the best individual from each species has a good chance of producing offspring.

Local Search in Genetic Programming
Particularly, we focus on symbolic regression problems, where the goal is to search for the symbolic expression K O : R p → R that best fits a particular training set T = {(x 1 , y 1 ), . . . , (x n , y n )} of n input/output pairs with x i ∈ R p and y i ∈ R defined as where G is the solution or syntactic space defined by the primitive set P of functions and terminals; f is the fitness function that is based on the difference between a program's output K(x i , θ) and the desired output y i ; and θ is a particular parametrization of the symbolic expression K, assuming m real-valued parameters. The goal of the LS method is to optimize the parameters of each GP solution. Following [13,14], the search includes on additional search operator which is not common in GP, an LS process that is used to optimize the implicit parameters in GP individuals. This allows the search to use subtree mutation and crossover to explore the search space, or syntax space, and uses the LS process to perform fine tuning of the individuals in parameter space.
As suggested in [21], for each individual K in the population, we add a small linear upper tree above the root node, such that K = θ 2 + θ 1 (K) where K represents the new program output, while θ 1 and θ 2 are the first two parameters from θ, as shown in Figure 2. In this way, for all the other nodes n k in the tree K we add a weight coefficient θ k ∈ R, such that each node is now defined by n k = θ k n k , where n k is the new modified node, k ∈ {1, ..., r}, r = |Q| and Q is the tree representation. Notice that each node has a unique parameter that can be modified to help meet the overall optimization criteria of the non-linear expression. When the search starts, the parameters are initialized to θ k = 1. Then, during the evolutionary process, when subtree mutation or crossover exchange genetic material (syntax) between individuals, these also include the corresponding parameter values. In general, each GP individual is considered to be a nonlinear expression that the LS operator must fit to the problem data. This can be done using different methods, but here a trust region optimizer is used [22], following [13,14].
One of the most important things to consider is that the local search optimizer can substantially increase the underlying computational cost of the search, particularly when individual trees are very large. While applying the local search strategy to all trees might produce good results [13], it is preferable to reduce to a minimum the amount of trees to which it is applied.

Integration LS into neat-GP
The neat-GP-LS algorithm was recently proposed to integrate the neat-GP search with an LS process [2], showing the ability to improve performance and generate compact and simple solutions. Figure 3 shows the main modules in this algorithm. Another interesting result reported in [2] was that neat-GP-LS exhbited very little performance variance on all tested problems, suggesting that the meta-heuristic search is robust.
Given the reliance of neat-GP-LS on the speciation process, as defined for neat-GP, the following observations are of note. First, species tend to grow in size when the individuals in the species have good fitness, and they grow more when they include the best solution in the entire population. Second, while species with bigger trees tend to appear as evolution progresses, diversity is maintained throughout the search. Third, while species are different, in terms of the size and shape of individuals they contain, it is common for all species to include at least some highly fit individuals. Finally, species grow in size when they contain highly fit individuals, and this increased exploitation is beneficial because the LS tends to produce high levels of improvement in those particular species.

EvoSpace
The EvoSpace model for evolutionary algorithms (EA) follows a pool-based approach [8,9], where the search process is conducted by a collection of possibly heterogeneous processes that cooperate using a shared memory or population pool. We refer to such algorithms as pool-based EAs (PEAs) and highlight the fact that such systems are intrinsically parallel, distributed and asynchronous.
In EvoSpace, distributed nodes (called EvoWorkers) asynchronously interact with the pool; their job is to take a subset of individuals from the central pool, which is called a sample, and evolve them for a certain number of generations (or until a given termination criterion is met), and return the new population of offspring back to the pool. The general scheme is depicted in Figure 4.
This means that EvoSpace has two main components, a set of EvoWorkers and a single instance of an EvoStore. The EvoStore container manages a set of objects representing individuals in a EA population. EvoWorkers pull a subset of individuals from the EvoStore making them unavailable to other workers. Moreover, individuals are removed from the EvoStore as a random subset or sample of the population. Once a EvoWorker has a sample to work on, it can perform a partial evolutionary process, and then return the newly evolved subpopulation to the EvoStore where the new individuals replace those found in the original sample; at this point, replaced or reinserted individuals can be taken by others clients. Figure 5 shows the distributed architecture of the EvoSpace model with GP. The figure shows that on the Server the EvoSpace manager and HTTP communication framework are performed, while different samples of individuals from the population are sent to EvoWorkers where evolution takes place. EvoSpace was conceived as a model for cloud-based evolutionary algorithms and is general enough to be amenable to any type of population-based algorithm. Several works have shown that this general approach can solve standard black-box optimization problems [9] and even interactive evolution tasks [11]. It has been shown, as expected, that distributing costly fitness function evaluations will help reduce the total run-time of the algorithm [9].

Distributing neat-GP-LS into the EvoSpace Model
In this work, we present the first implementation of a GP algorithm on EvoSpace. Since neat-GP-LS already divides the population into species, it seems straightforward to exploit this structure and distribute individuals to EvoWorkers by sending complete species to each.

The Intra-Species Distance and Re-Speciation
One aspect of neat-GP-LS that is not asynchronous is the speciation process. In the sequential and synchronous versions, speciation occurs at specific moments during the search, as shown in Figure 3. However, since EvoSpace is asynchronous, EvoWorkers return samples to the population pool at different moments in time. When an EvoWorker returns a sample, it is not correct to assume that all of the new individuals actually belong in the same species. It is possible that the species diverged during the local evolution carried out on the EvoWorker.
To solve this issue, we track the level of homogeneity within each species, which is measured before a species leaves the pool and when the new species returns from the EvoWorker. If a significant change is detected, then a flag is raised that tells EvoSpace that the population should go through a new speciation process or re-speciation. This is done by computing what is referred to as the intra-species distance. Basically, in each species, we compute the dissimilarity measure using Equation (1), between each tree T i and its nearest neighbor T j (the individual with which Equation (1) is minimum within the species), calling this value nn T i . Then, the intra-species distance D S l for species S l is the average of all nn T i considering all T i ∈ S.
The D S l values could be used in different ways to trigger a re-speciation process. In this work, we can say that D S l is the intra-species distance before S l is taken as a sample by an EvoWorker, and we can defineD S l as the intra-species distance of species S l computed with the population returned by the EvoWorker. IfD S l > D S l for any species in the population, then a re-speciation event is triggered. Basically, this causes a synchronization event, where the EvoStore waits for all species to return and the population goes through the speciation process once more. Figure 6 shows the basic scheme of the proposes implementation. Compared to Figure 5, the new implementation in Figure 6 accounts for specific elements of the neat-GP algorithm. In particular, the speciation process is carried out on the server, such that instead of sending random samples of individuals to the EvoWorkers, complete species are sent and a local evolutionary process is carried out. In this case, the number of EvoWorkers used depends on the number of species in the population.

Experiments and Results
We analyzed and evaluated the integration of the neat-GP-LS algorithm in a PEA known as the EvoSpace model. EvoSpace was designed for problems where fitness computation might be expensive; in this work, we were only interested in studying the effects of implementing neat-GP-LS as a PEA. In particular, we wanted to determine if there are any significant and substantial effects on the convergence of the algorithm, the solutions qualities on all the population and the behavior of the bloating phenomena.
For simplicity, the distributed framework was simulated using multiple CPU threads, such that each EvoWorker was assigned to a specific thread. When the number of EvoWorkers exceeded the number of threads, then several workers could share a single thread.
All experiments were carried out using real world symbolic regression problems, where the objective is to minimize the fitness function. All problems are summarized in Table 1.
When a species was sent to an EvoWorker, we performed a short local evolutionary search, basically a standard GP search using the parameters specified in Table 2. The number of EvoWorkers depended on the number of species in the EvoStore, and we assumed that an EvoWorker was always available for any species in the EvoStore. In addition, the local evolution performed in an EvoWorker iterated for 10 generations, applying the LS operator with probability of 0.50. The concrete compressive strength is a highly nonlinear function of age and ingredients.
Energy Heating [25] 768 9 This study looked into assessing the heating load requirements of buildings as a function of building parameters.
Energy Cooling [25] 768 9 This study looked into assessing the cooling load requirements of buildings as a function of building parameters.
Tower [26] 5000 26 An industrial data set of a gas chromatography measurement of the composition of a distillation tower.
Yacht [27] 308 7 Delft data set, used to predict the hydodynamic performance of sailing yachts from dimensions and velocity. Input variables and constants as indicated in each real-world problem.
Selection for reproduction Eliminate the p worst = 50% worst individuals of each species.

Elitism
Do not penalize the best individual of each species. Species threshold value h = 0.15 with β = 0.5 Local optimization probability P s = 0.5 Figure 7 shows a single run of the PEA version of neat-GP-LS on the Housing, Concrete and Energy Cooling problems. The plots show the convergence of the training and testing RMSE, as well as the average size of the population given in number of tree nodes. The horizontal axis represents the number of samples taken from the EvoStore. Note that the number of samples over different problems and over different runs l varied due to the randomness of the individual population and the speciation process, and due to the asynchronous nature of the EvoSpace model, which makes it unfeasible to aggregate the behavior of multiple runs into a single plot. Therefore, these plots only show a single run, but the behavior of the algorithm in these examples is in fact representative of the convergence behavior of most runs. One notable observation is the almost identical behavior of both training and testing MAE in all of the runs, showing that the algorithm generalizes in a consistent manner relative to training performance. The size of the population is also quite informative. Notice that, while the average size fluctuates in all cases, the algorithm is in general producing compact solutions. This is particularly clear when the search process terminates and the final sample is returned to the EvoStore.
The results are summarized in Figures 8 and 9, which show a box plot comparisons between the sequential neat-GP-LS algorithm and the PEA implementation in EvoSpace, respectively, for test RMSE and the average size of the population. Table 3 presents the p-values of the Friedman test, where bold values indicate that the null hypothesis is rejected at the α = 0.05 confidence level. The null hypothesis states that the medians of the two groups are the same. Notice that, on three (Concrete, Energy Heating and Tower) out of the six problems, the EvoSpace version performed worse than the sequential algorithm in terms of RMSE, since the null-hypothesis were rejected. Conversely, if we consider the three problems in which the PEA version and the sequential algorithm performed equivalently based on test RMSE (i.e., the null hypothesis is not rejected), the Housing Energy Cooling and Yacht problems, EvoSpace produced smaller trees and thus was more effective at bloat control. Therefore, we can state with some confidence that the modified search dynamics introduced in the distributed version of the algorithm do alter the effectiveness of the search. On the one hand, the quality of the results seemed to depend on the problem. On the other hand, in all cases where the EvoSpace implementation achieved equivalent performance, it was significantly and substantially less affected by bloat, producing more parsimonious and compact solutions.   It is reasonable to assume that larger learning problems, in terms of number of instances and features, are in general more difficult to solve. Moreover, difficult problems usually require more complex or larger solutions to effectively model their structure. The three problems where RMSE performance of the EvoSpace implementation was statistically worse (Concrete, Energy Heating and Tower) are also three of the four largest problems used in our experiments, in terms of total number of instances and number of features (see Table 1). Since the EvoSpace search dynamics pushes the search towards smaller program sizes, with statistical significance in five of the six problems (including all problems in which RMSE performance was worse), a plausible explanation of the results can be formulated. The EvoSpace implementation is controlling bloat too aggressively, severely impacting learning in the more difficult test cases. Therefore, future variants of the implementation will need to allow the search to explore large program sizes to evolve more accurate models. Finally, Figure 10 analyzes the re-speciation process based on the intra-species distance. The plot shows how D S l changes over for each of the species in the population, using a single run of the algorithm on the Housing problem, zooming in on the first 225 samples taken by the EvoWorkers. Each vertical line represents the difference between D S l andD S l . When a line is black (shorter lines), it means that a re-speciation event was not triggered, and when a line is red (longer lines) this means that a re-speciation event could have been triggered by a sample. We can see that, at the beginning of the run, speciation events are more frequent and, as the search progresses begins to converge, these events become infrequent. Figure 10. Analysis of the re-speciation process using the intra-species distance.

Conclusions and Future Work
This work presents, to the authors' knowledge, the first implementation of a GP system in a Pool-based EA, using the EvoSpace model. The PEA approach is particularly well suited for the speciation-based neat-GP search, allowing for a straightforward strategy to distribute the population over the processing elements of the system (EvoWorkers). It is notable that the performance of the PEA version was not equivalent to the sequential one, in two key respects. On the one hand, it did not reach the same level of performance on some problems. On the other hand, on the problems where it performed equivalently, or better, it was able to reduce solution size significantly.
Future work will center around eliminating the synchronization required by the speciation process in the EvoSpace implementation. Another interesting extension is to consider other elements in the speciation process besides program size and shape, such as program semantics, program behavior or solution novelty. Moreover, we would like to integrate a wider range of parameter local search methods, particularly gradient free methods, and to combine them with other forms of local optimizers that work at the level of syntax or semantics. Finally, it will be important to deploy the proposed algorithms in high-performance computing platforms, to tackle large scale big data problems, where distributing the computational load becomes a requirement.