An Improved Genetic Algorithm with a New Initialization Mechanism Based on Regression Techniques

: Genetic algorithm (GA) is one of the well-known techniques from the area of evolutionary computation that plays a signiﬁcant role in obtaining meaningful solutions to complex problems with large search space. GAs involve three fundamental operations after creating an initial population, namely selection, crossover, and mutation. The ﬁrst task in GAs is to create an appropriate initial population. Traditionally GAs with randomly selected population is widely used as it is simple and efﬁcient; however, the generated population may contain poor ﬁtness. Low quality or poor ﬁtness of individuals may lead to take long time to converge to an optimal (or near-optimal) solution. Therefore, the ﬁtness or quality of initial population of individuals plays a signiﬁcant role in determining an optimal or near-optimal solution. In this work, we propose a new method for the initial population seeding based on linear regression analysis of the problem tackled by the GA; in this paper, the traveling salesman problem (TSP). The proposed Regression-based technique divides a given large scale TSP problem into smaller sub-problems. This is done using the regression line and its perpendicular line, which allow for clustering the cities into four sub-problems repeatedly, the location of each city determines which category/cluster the city belongs to, the algorithm works repeatedly until the size of the subproblem becomes very small, four cities or less for instance, these cities are more likely neighboring each other, so connecting them to each other creates a somehow good solution to start with, this solution is mutated several times to form the initial population. We analyze the performance of the GA when using traditional population seeding techniques, such as the random and nearest neighbors, along with the proposed regression-based technique. The experiments are carried out using some of the well-known TSP instances obtained from the TSPLIB, which is the standard library for TSP problems. Quantitative analysis is carried out using the statistical test tools: analysis of variance (ANOVA), Duncan multiple range test (DMRT), and least signiﬁcant difference (LSD). The experimental results show that the performance of the GA that uses the proposed regression-based technique for population seeding outperforms other GAs that uses traditional population seeding techniques such as the random and the nearest neighbor based techniques in terms of error rate, and average convergence.


Introduction
Genetic algorithms (GAs) are stochastic optimization search techniques that depend on the natural evolution strategies.GAs strategies are based on the concept of 'survival of the fittest'.The basic principles of GAs were first described at the University of Michigan in the 1970s by John Holland [1].Holland aimed to simulate the natural evolution by studying the adoption in natural and artificial systems [2].Holland introduced GAs from Darwinian theory 'survival of the fittest' [3,4], by generating new generation, chromosomes, through recombination (crossover) and mutation operations, then the fittest or feasible individuals are more likely to remain, mate and generate a new generation.The new individuals need to have more favorable fitness than the previous ones (i.e., the solution evolves from one generation to another).However, this is not the case all the time, as the new individuals may have worse fitness than the previous ones as well, but this can be solved using a good selection strategy.
GAs are often suitable to achieve the optimal or near-optimal solution for big or huge search space problems [5,6].They are not a mathematically guided algorithm; however, they are a stochastic algorithm where generation and selection are implemented randomly [7].Having two major operators, crossover and mutation allows GAs to deal with optimization problems efficiently, where the crossover operator attempts to create better solutions from in-hand solutions, and the mutation operators allows GAs to overcome strong local minima [8].Therefore, GAs continue to attract the interest of many researchers as optimization tools for solving many problems to find the optimal or near-optimal solution quickly, reliably, accurately, and effectively [9][10][11].Thus, they became popular techniques for solving wide variety of problems such as image processing [10], speech recognition [12,13], software engineering [14], clustering low-dimensional data [15], optimization of transport networks [16], vehicle detection [17], business process simulation [18], sensor network configuration [19], and robotics [20].GAs has been developed to increase the population diversity and the efficiency to find the solution [21].Therefore, new types of the standard GAs have emerged such as multi-population GAs and parallel GAs.
The initial population seeding phase is the first phase of any GA application.It generates a population of feasible solutions or individuals randomly or by heuristic initialization as input for the GA.Although the initial population seeding phase is executed only once, it has an important role to improve the GA performance [22,23].While the others GA phases are repeated [24,25].A various initialization techniques have been introduced since the emergence of GAs concepts.All of known techniques depend on the availability of information about the problem being studied [24,26].The random initialization technique is considered as one of the most appropriate and commonly used technique to generate the initial population seeds.Random technique contains poor fitness solution that decreases the possibility of finding the optimal solution or near to optimal; also it requires long searching time in the case of deficient knowledge.However, the random population is preferred when applying GAs on various problems.On the other hand, in huge search space, if the preceding heuristic information about the optimal solution is available, then it can generate the initial population and recognize high quality solution areas easily.The process of generating initial population seeding using heuristic techniques implies high quality of population individuals that allows GAs to find the better solutions faster.However, it may end up with a small search space region and may never be able to obtain the universal optimal solutions [27].
Usually, the computation time of GAs to extract the initial population seeding is less than the computation time consumed for normal generation process.Further, the populations in each recursive generation process depend on their previous populations and at the end on the initial population seeding [25].Therefore, specifying the initial population in GA is very important to decrease the computation time and then to find the optimal/near-optimal solution.The initial population seeding is as important as the other GA's phases; it plays a manifest role in increasing the efficiency, also to obtain the optimal solution or the nearest to optimal.But, the random initialization technique generates a population of individuals that need additional computation time to obtain the optimal or near to optimal solution because of their infeasible and bad quality status.Researchers declared the need to improve the quality of population that is generated at the initial population stage of GA.Additionally, the improved convergence rate is very important requirement for solving particular problem.They pointed that the GAs may obtain the optimal solution for a given problem when generating the initial individuals with good quality and maximum diversity [28].The individuals in each generation depend on the previous generation and finally on the initial population [29].Researchers proved that the initial population seeding that depends on a prior knowledge about the problem can enhance GA's capability to provide solutions near to an optimal solution [26].
The used method to generate the initial population has a critical impact on determining the convergence speed and the quality of provided solution [30].The specific initial population seeding technique for a problem improves the efficiency of GAs to find the optimal solution, and the proposed techniques for the initial population seeding are limited [31].Most of these techniques focus on the quality improvement of the initial population seeding such as: random, nearest neighbor, and K-means clustering, see Figure 1.The limited number of these techniques motivates this study, as there is still room for enhancing and finding better initial population to start with.Step 3.According to the global optimal path, one edge of each local optimal path disconnects to rewire the front and back groups.
Step 4. Repeating Step 3, an initial population can be generated.
Considering the 14 cities example as shown in Figure 1, 14 cities cluster into 4 groups  1 ,  2 ,  3 ,  4 .A global optimal path { 1 ,  2 ,  3 ,  4 ,  1 } (the black path) and 3 local optimal paths (the red and blue paths) can be seen from Figure 1(a).Then, select one edge from each group ( 1 ,  2 ,  3 ) and disconnect to rewire the front and back group, such as Figure 1   Step 3.According to the global optimal path, one edge of each local optimal path disconnects to rewire the front and back groups.
Step 4. Repeating Step 3, an initial population can be generated.
Considering the 14 cities example as shown in Figure 1, 14 cities cluster into 4 groups  1 ,  2 ,  3 ,  4 .A global optimal path { 1 ,  2 ,  3 ,  4 ,  1 } (the black path) and 3 local optimal paths (the red and blue paths) can be seen from Figure 1(a).Then, select one edge from each group ( 1 ,  2 ,  3 ) and disconnect to rewire the front and back group, such as Figure 1  Step 3.According to the global optimal path, one edge of each local optimal path disconnects to rewire the front and back groups.
Step 4. Repeating Step 3, an initial population can be generated.
Considering the 14 cities example as shown in Figure 1, 14 cities cluster into 4 groups  1 ,  2 ,  3 ,  4 .A global optimal path { 1 ,  2 ,  3 ,  4 ,  1 } (the black path) and 3 local optimal paths (the red and blue paths) can be seen from Figure 1(a).Then, select one edge from each group ( 1 ,  2 ,  3 ) and disconnect to rewire the front and back group, such as Figure 1  Step 3.According to the global optimal path, one edge of each local optimal path disconnects to rewire the front and back groups.
Step 4. Repeating Step 3, an initial population can be generated.
Considering the 14 cities example as shown in Figure 1, 14 cities cluster into 4 groups  1 ,  2 ,  3 ,  4 .A global optimal path { 1 ,  2 ,  3 ,  4 ,  1 } (the black path) and 3 local optimal paths (the red and blue paths) can be seen from Figure 1(a).Then, select one edge from each group ( 1 ,  2 ,  3 ) and disconnect to rewire the front and back group, such as Figure 1 [32], (c) K-mean clustering initial population [33], and (d) the best solution of this problem.
There are several various factors that could influence the performance of GAs: parameters, genetic operators and strategies [34].One of the main factors that effect on GAs performance is the initial population.Random initial population is considered as a traditional method for obtaining the initial population, however it is inefficient to produce a favorable initial population usually.This work proposes a new initial population technique based on regression for GAs initial population.Our experiments showed promising results in improving the GAs performance to solve TSP.We analyze the performance of the GA using traditional population seeding techniques, such as the random and nearest neighbors, along with the proposed regression-based techniques.Our approach divides the problem into sub problems using the regression line analysis that displays the relationship between the points in x − y coordinates.The partition occurs because of the intersection of regression line and the perpendicular line at the center point.This allows for clustering the cities into four sub-problems repeatedly, the coordinates of each city determines which category/cluster the city belongs to, the algorithm works repeatedly until the size of the subproblem becomes very small, four cities or less for instance, these cities are more likely to be adjacent to each other, so connecting them to each other creates (somehow) a good solution to start with, this solution is mutated several times to form the initial population.
Our experiments show promising results that highlight the efficiency of our new approach in terms of improvement in error rate, average convergence and convergence diversity.The contributions of this work include a new approach for producing initial population in the case of solving TSP problem compared to Random, NN techniques.
The rest of this paper is organized as follows.Section 2 reviews the basics of GA, solving TSP using GA along with highlighting important population seeding techniques.Section 3 introduces our regression-based technique with illustrative example.Section 4 provides detailed experimental results on TSP and compares with different techniques with different error metrics.Finally, Section 5 concludes the paper indicating possible future research directions.

Basic Principles of Genetic Algorithms
GA is one of the most efficient and popular techniques that are used to find the optimal or near-optimal solution for hard problems with a large search space particularity in combinatorial problems where the search space is of factorial order.The primary function of GA is to generate and manipulate several individuals using suitable genetic operators to find the solutions.Thus, GA is classified as one of global search techniques that depend on the principle of collecting solutions instead of adopting a single solution [35].Generally, the computation time that classical GA needs to reach the optimal solution is large.However, it can be rectified by using heuristics in specific way.Applying heuristics may decrease the computation time and improve the solutions evolved by GAs [36].
1. Encoding: Before starting to solve the problem with GA, the appropriate encoding technique must be applied to represent the individuals that are related to the problem domain in a form of genes with specific length.The type of problem determines the encoding technique used [37][38][39].Below, some encoding techniques are introduced:

•
Binary encoding: all individuals are represented as series of bits 0 or 1; each bit represents a gene in the chromosome.For example, the Knapsack problem uses binary encoding:  We next briefly describe crossover and mutation with example binary encoding for simplicity, and other encoding requires other types of crossover and mutation.
5. Crossover (Recombination): uses two individuals as parents to deliver a new offspring by alternating part of the parent genes.Thus, there is a chance to produce offspring with higher fitness.A various crossover operators can be applied with GAs [40,41].Let us begins with single-point crossover and two-point crossover, then continue the process using another technique to fit some situations, see Figure 3. 6. Mutation: The process of alternating or switching between certain genes within one chromosome to obtain other chromosomes as new input solutions for the next generation.Mutation aims to reach the best likely solutions in order to reach a high positive level of diversity to the population; also, it helps not to release in the local optimum [42].There are several methods of mutations; starts with bit conversion mutation type and evolves to other types that fit several locations, see Figure 4.  7. Termination (stopping) criteria: Many terminating conditions have been applied to the simple GA [43] such as: 1.
Reaching the peak level of generations.

2.
The chance to make updates in future as the individuals has become weak, which means low convergence diversity rate is expected [3,43].

3.
Improving fitness still below the threshold value.
The life cycle of GAs evolves from one phase to another starting with population seeding, selection, crossover, mutation, and finally the stop constraint, see Figure 5.

Solving Travelling Salesman Problem (TSP) with GA
The term 'traveling salesman' is first introduced in a manual for the traveling salesman in Germany in 1832.It is known as classical combinatorial optimization problems that are easy to be expressed but hard to be solved [29].TSP is classified as non-deterministic polynomial time, and cannot be solved in polynomial time.Also, TSP relates to the class of NP-hard problems.In TSP the objective is to find the optimal path (tour) among a set of vertices (begins from a given vertex N and ends up at the same vertex), thus each vertex is visited only once.Since TSP is a minimization problem, the fitness function can be described by finding the cost of the path.Euclidean distance (ED) is used to calculate the cost between the two cities as represented below: where (x 1 , y 1 ) and (x 2 , y 2 ) are coordinates of city i and city j respectively.The TSP can be classified into three types: 1.
Symmetric traveling salesman problem: The distance or cost between any two city nodes is equal for both directions (undirected graph), i.e., the distance from node1 to node2 and the distance from node2 to node1 are alike.Therefore, the expected solutions here will be (n − 1)!.

2.
Asymmetric traveling salesman problem: The distance or cost between any two city nodes is not equal for both directions (directed graph), i.e., the distance from node1 to node2 is not the same from the distance from node2 to node1.Thus, the expected solutions will be (n − 1)!/2.

3.
Multi traveling salesman problem: More than one salesperson involved in the problem of finding optimal route.
Over the years, a huge number of studies have been carried out to solve TSP using GAs [8].Hence, there are many simulations to TSP using a GA, but with different operators for each [44].These simulations include: 1.

3.
Adjacency representation: The destination city that is linked to the source may become the source for an upcoming tour.

4.
Ordinal representation: The path from one city to another is implemented as an array of cities.The path i, in the list, is a number ranging from 1 to (n + 1).

5.
Matrix representation: There are several models to be applied on the matrix representation [46,47].In this representation, if city i is linked to city j, then assign 1 to m ij in the bit matrix M.

Population Seeding Techniques
We briefly discuss a background review for several initial population seeding techniques that are used for the GAs.

•
Random Initialization: Random initial population seeding technique is the simplest and the most common technique that generates the initial population to GA.This technique is preferred when the prior information about the problem or the optimal solution is trivial.The sentence 'generate an initial population' related to the process of generating the initial population by using random initialization technique.In TSP, the random initialization technique selects the cities of the initial solutions randomly.During the individual generation, the random initialization technique generates a random number between 1 and n.If the current individual is already contains the generated number, then it generates a new number.Otherwise, the generated number is added to the current individuals.The Operation is repeated until the desired individual size (n) is reached.There are many random initial population methods aim to generate a random numbers such as the uniform random sequence, Sobol random, and quasi random [2,48].
• Nearest Neighbor: The nearest neighbor (NN) is considered as one of the most common initial population seeding technique.NN may still good alternative to random initial population technique in order to generate an initial population solutions that are used in solving TSP with GAs [49][50][51][52][53].In the case of NN technique, the generation of each individual starts by selecting random city as the starting city and then the nearest city to be added as the new starting city.Iteratively, NN adds the nearest city to the current city that was not added to the individual until the individual includes all the cities in the problem space.The generated individuals from the NN population seeding can improve the evolving search process in the next generations as they were created from a city nearest to the current city [52].

•
Selective Initialization: Yang [54] presented a selective initial population technique based on the K-nearest neighbor (KNN) sub graph.The KNN builds a graph contains all cities such as c i and c j , based on the distance matrix.Where c i is one of the KNN cities of c j or c j is one of the KNN cities of c i .The selective initial population technique grants the higher priority to the KNN sub graph edges.Firstly, from the city c, the next city will be randomly selected from c's KNN list, but if all cities of c in KNN list are selected, then the next city is randomly selected from unvisited cities.

•
Gene Bank: Wei et al. [55] proposed a greedy GA that depends on Gene Bank (GB) to generate the initial population to GA. GB technique aims to generate a high quality initial population solution.The GB is created based on the distance between cities by gathering the permutation of N cities.The initial population solutions that are generated from the GB are greater than the average fitness.In the case of solving TSP with N cities, the GB is constructed from c closer cities to city I, where c is the gene size less than or equal n − 1.Each gene of the first city, I, is randomly chosen.Then, the closest unvisited city j from the i-th row is selected and from the j-th row the closest unvisited city k is selected.On the other hand, if all j-th row cities are visited, then the next city is randomly chosen from unvisited cities list.

•
Sorted Population: Yugay et al. [28] proposed a sorted initial population technique to modify and improve GA based on the principle of the better offspring's which are generated from the best parents.SP technique generates a large number of initial population solutions and sorts the min ascending order based on their fitness value in case of TSP-short distance.Finally, some of initial populations that have bad fitness are eliminated.The probability of finding a good solution in the population is very high when the initial population is very large.So, the sorted initial population technique is more likely to find a favorable initial population solution.• K-means Initial Population: Deng et al. [33] introduced a new initial population technique to improve the performance of GA by using k-means algorithm for solving TSP.The proposed strategy used the k-means clustering to split a large-scale of TSP into small groups k, where K = [ √ N + 0.5] and N = number of cities. Next, KIP applies GA to find the local optimal path for each groups and a global optimal path that connects each local optimal solution, see Figure 6.K-means based initial population technique was compared with two initializations, random and gene bank initialization techniques.The results showed that this particular initialization technique is more efficient to improve GA.
• Knowledge Based Initialization: Li et al. [56] proposed a knowledge based initialization technique to elevate the performance of GA in solving TSP.The main idea of KI based on generating initial population without path crossover, see Figure 7.However, when the number of involved nodes is large; it is too difficult to delete the crossover path without triggering another path.KI uses a heuristic method based on coordinate the transformation and the polar angle along with learned knowledge to create the initial population.The main idea is to split the plane into disjoint sectors; by increasing the polar angle to choose the cities that does not cause path crossover.Knowledge based initial population technique was compared to four other initializations: random, NN, gene bank, and Vari-begin with Vari-diversity techniques.The results showed that knowledge based nitialization technique is better than other techniques on the improvement of GA.The best adjacent (BA) number plays an important role in the individual diversity of the population.It assumes that any city c i in the optimal solution is connected to city c j , where c j is one of nearest BA number of cities to c i .In addition, Indevlen is the number represents the number of cities in each individual.In ODV techniques, the ordered distance matrix (ODM) size is created by using the value of BA and the given problem distance matrix.The techniques of generating the initial population using the ODM can be represented as follows: -ODV-EV.In ODV-EV technique, each individual in the populations begins with same city.A random number (BAi) is generated within the (BA) before inserting each city into each individual.The podv-ev that is generated using the ODV-EV technique can be represented as: where θ-an individual in the P ODV−EV , o-individuals total number in the population P ODV−EV , n-problem size.Each individual first city is remained same, i.e., θ In ODV-VE technique, each individual is assigned a random number (BAi) which is generated within the BA; and the same random number (BAi) is used to adjust each city in the individual.After that, the BA number of individuals, in the population begins with the same initial city number.The ODV-VE technique can be represented as: where (BA) is the number of individuals and the first city is the same, i.e., θ , and so on.-ODV-VV.In ODV-VV technique, a new random number (BAi) between 1 and BA is generated before inserting any city to any individual.The starting city for each individual is randomly selected.The generated initial population seeding from ODV-VV is efficient and has good individual diversity.The ODV-VV technique can be represented as: • Insertion Population Seeding: The process of the insertion initial population seeding (In) technique starts with a partial path that contains several randomly selected cities.Then, iteratively inserts the nearest city to any city in the partial path.Finally, adds the edge to the lowest cost position at the path [58].

•
Solomon Heuristic Seeding: Solomon Heuristic is a modification of the heuristic that was proposed for the vehicle routing problem with time windows by Solomon [59].It starts with a partial path contains two cities that are randomly selected.Next, it calculates the inserting cost at any possible positions on the path for each city that is not inserted in the path.Finally, inserts the city into the path at the optimal position cost.
Among other techniques we mention the relevant ones and indicate the review works done in this direction next.Raja and Bhaskaran [60] proposed a new technique called population reduction (PR) in their study.Population reduction applies tournament selection after dividing the initial population into groups to select the best fit candidates.After that, the selected best fit individuals are entered to simple GA.This technique has been carried out on 0/1 knapsack problem to study the impact of multi variables as follows: • The experiment results when compared to simple GA depicted that the new methodology maintained less time than simple GA.Also, the result showed that tournament selection is the best performing selection techniques used along with PR.
Chiroma et al. [61] studied the appropriate values of critical variables that determine the fitness degree of the solution.They accomplished their work by developing a survey to discuss the impact of various variables by increase or decrease their proportion on solving problem using GA.The set of variables that have been involved in this study are the size of the population, mutation rate, and crossover rate.The findings are summarized up as follows:

•
The larger population size, the higher efficiency in the search space.

•
The increasing or decreasing of crossover rate leads to lose some solutions, where the range of crossover rate from (0.1-1) and the range of mutation rate from (0.001-1).
Shanmugam et al. [62] conducted a survey of different population seeding techniques that were carried out on TSP.They aimed to analyze the population seeding techniques performance, namely: random, NN, gene bank, selective initialization, and sorted population.They ordered the performance results of the previous population seeding techniques based on the factors of error rate (%) and convergence rate (%).The results showed that NN population technique is better than other investigated techniques.NN generates individuals with high fitness followed by selective initialization, and then gene bank techniques.
Paul et al. [63] conducted another survey using known population seeding techniques and a new population seeding technique namely: ODV based EV, VE and VV techniques.They studied, analyzed and carried out these techniques on traveling salesman problem.In addition to error rate and Convergence rate criteria that measure the population seeding techniques performance, new performance criteria were added such as computation time, convergence diversity, and average distinct solutions.The results showed that the ODV population seeding techniques have outperformed the other population seeding techniques for GA.ODV seeding techniques generate individuals with characteristics of high quality, diversity, and randomness.On the other hand, NN technique outperformed other techniques in terms of the average convergence and computation time criteria.
For studying the influence of applying heuristic initialization functions in GA, Osaba et al. [64] applied a combinatorial optimization problem using GA.Their experiments adopted three heuristic initialization functions: NN, Insertion (In) and Solomon heuristic.Several applications of GA have been designed to carry out the experiments in order to find the comparative model.Each version of applied GA has assigned a value of 100, 50, 10, and 0 called GA α , where α is the percentage of the population created by heuristic initialization functions.In their study, they used different initialization phase for each GA version.In addition, then, they measured the influence that has been emerged by each GA as a result of using heuristic functions.The results summarized that the NN technique has beaten other techniques in 13 cases out of 15 that means 86.66% of the cases.Also, the GA50 version has had the best solution in 80% of instances.
As it can be seen from the literature discussed above, there are some strong motivations to find better initial populations to start GA.It can be summarized as follows: 1.
Finding the best initial population is critical to find the optimal or near-optimal solution.2.
The need of population diversity to avoid GA early convergence problem.

3.
Avoid falling in the local optimal solution problem: (a) Decrease the GA search time that are consumed for finding an optimal or near-optimal solution.(b) Decrease the numbers of generations that are needed to obtain the optimal or near-optimal solution.
The previous works indicate that there is still no consensus in using the initial population selection method.In this work, we propose a new method that is based on regression analysis to generate better initial populations for GA.

Proposed Initialization Technique Based on Linear Regression
Seeding the initial population is the first phase of the application of GAs.Random generation method of initial population is the most widely used method.It is considered as one of the most important GAs parameter that improves the GA performance to find the optimal solution.Indeed, enhancing GAs performance is achieved by increasing the speed of finding solution, improving individual diversity, and quality.Here, we develop a new population seeding technique using regression and successive partitioning of the main problem into sub-problems for GA.Our proposed method is tested on traveling salesman problem (TSP).The proposed technique depends on the linear regression technique and uses perpendicular lines that cross the regression lines at the center points.
Recall that the linear regression is a statistical approach to reveal the relationship between dependent variable and one or more independent variables.Linear regression correlates and models the relationship between two-dimensional sample points with two variables by fitting a linear equation [55].The first variable is called explanatory variable or independent variable denoted by x, and the second variable is called a scalar dependent variable denoted y [28,65].In linear regression, linear predictor functions are used to model the relationships with estimated parameters from the data [55,65].The Linear regression equation has the form y = a + bx, where x is explanatory or independent variable and y is the dependent variable.While the constant a is the y-intercept and b is the slope of the line [66].
Our proposed approach is based on the computation of regression lines, and the perpendicular lines that crosses the regression lines at the center points to divide a large-scale TSP problem to small sub-problems.The resulting sub-problems are repeatedly classified to fit into four categories to obtain local optimal solutions.Thus, the main procedures of the proposed method are: (1) start with dividing the large-scale TSP into four small sub-problems using regression line and the perpendicular line, and classify the points into four categories.Each category is divided into four new categories recursively by using the regression line and the perpendicular line.
The process carries on until having the target category that contains a small number of instances (x,y) points.Maximum four cities or (x,y) points assigned to each category that are considered as initial population for TSP sub-problem.The process ends up when the local optimal solution is obtained for each category.(2) Rebuild the initial populations seeding by reconnecting all local optimal solutions together.Finally, mutate the initial population N times to obtain N solutions, where N is the population size.
In what follows, we utilize two of TSP cities berlin52, and att532 to illustrate the regression-based technique for initialization of GA, see Figure 8 that show these cities in x − y scatter plots representation.The following steps illustrate the new initialization technique designed to improve the GA for solving the well-known TSP:

•
Step 1: Find the regression line equation (y = a + bx) that divide the points into two sections.

•
Step 3: Find the perpendicular line equation that intersects the regression line at the center point.Note that the regression line slope is b, then the perpendicular line slope −1/b.The perpendicular line equation can then be obtained by using the line slope and the intersection point with the regression line (center point).See Figure 9c that shows these perpendicular lines.Step 6: Terminate the recursive computation if the number of points (cities) less than or equal to four.

•
Step 7: Select a random city to be the starting city, and then add the nearest city as new starting city until having all cities connected in the category of the local path.The group in each category is connected with the nearest group in other categories until all groups are connected in a global path.In summary, Assuming that a TSP can provide the location information about its cities, it is possible to repeatedly cluster these cities using the regression line that divides these cities equally based on their y-coordinates, finding the perpendicular line passing the center of the regression line will further divide these cities based on their x-coordinates, this gives four sub-TSPs, each of these groups has several adjacent or neighboring cities.The algorithm recursively repeats the same process four times on the four sub problems to go deeper until the size of the sub problem becomes very small, in this work we choose the number 4 or less as the stopping criteria of the recursive algorithm.Since the algorithm uses the regression line, it guarantees that the small number of cities that it ends with are more likely to be neighbors and closer to each others from the other cities, and therefore, connecting them with each others is better than connecting any of them with further cities, as these local links are minimized, and minimizing the local links attributes in finding a smaller global route.However, there is a problem of connecting which city from a sub problem with which city from another sub problem, this problem remains unsolved in this work, as we think that the GA will take care of this problem by evolving new solutions.The previous algorithm provides only one solution, which is the initial seed of the initial population, since the population needs more than one solution to start the GA, we used this seed solution to derive n solutions using the mutation operator, which is used n − 1 times to mutate the seed solution, where n is the size of the population.We used the mutation operator here and not the crossover operator to increase the diversity of the initial population.When the initial population is completed, The GA will follow up to optimize for the solutions.Figure 11 shows a comparison population initialization of the obtained by our proposed regression-based technique with random, nearest neighbor based ones on berlin52, and att532.

Experimental Results and Discussion
In addition to the proposed seeding technique, we present experimental results and performance analysis of two different population seeding approaches, namely the random initialization, which is used by most GAs [2,48], and the nearest neighbor (NN) approach.The NN approach used here is similar to that of [49][50][51][52][53], but not exactly the same, as we implemented the general NN approach and compared with our own implementation rather than comparing with each specific NN method.The general NN approach used for our comparison is based on selecting a current city randomly, then linking the nearest city to the current city and carry on until linking all cities to form the final rout.All the experiments have been carried out in the same test setup to generate individuals for evaluation purpose using specific performance factors.  1 displays the GA parameters that have been chosen to conduct our experiments here.Roulette wheel selection strategy was used to assign fitness to each individual.This is a traditional process to assign the fitness function to each individual (chromosome) in the population.The best solution is measured by fitness level that links the probability of selection with each chromosome [67].
Our experiments used one-point modified crossover and exchange mutation approaches.The single-point modified crossover strategy identifies the point in the chromosomes randomly, and then, switches genes after this point between the individuals to produce new children [68].The exchange mutation strategy is based on the random selection of two genes and the switching between their positions [69].Each technique was applied 20 times, and the average of all executions results was used for the purposes of experimental analysis.

Experimental Setup
All the experiments were carried out in a similar environment setting for all initialization techniques to make a fair assessment.The results help to assess the performance and efficiency of the regression-based technique for GA's population initialization in comparing with the performance of different initialization techniques.All initialization techniques under investigation used the problem examples of traveling salesman problem.The Experiments were implemented using Microsoft Visual Studio 2008 tool with TSP benchmark datasets obtained from TSPLIB (https://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/).The selected TSP examples for experimentation were classified into four classes based on their problem size, see Table 2, which displays the classes and the TSP instances that belong to.

Assessment Criteria
The performance factors that have been identified as measurements should be considered in investigating various initialization techniques namely, error rate, average convergence and final solution error:

•
Error Rate is the percentage of the difference between the known optimal solution and the fitness value of the solution for the problem [52,70,71].It can be represented as: Error Rate(%) = Fitness − Optimal fitness Optimal fitness × 100 The error rate can be classified into two types based on the fitness values in the population.First, individuals with high error rate due to the initial population with worst fitness value.Second, individuals with low error rate due to the initial population with worst fitness [32].
Average Convergence is the convergence rate of solutions in the initial population [26,32].It can be defined as: Average Convergence(%) = 1 − Average fitness -Optimal fitness Optimal fitness × 100 (7) where, average fitness is the fitness value average in the population, and the optimal fitness is the recognized optimal value of identical instance.

•
Final Solution Error Rate refers to the difference between the known optimal solution and the final solution that is resulted when applying the GAs on TSP instances using one of initial population technique.It can be represented as: This factor measures the quality of the generated population by finding the effect of applying initial population technique on Gas performance to obtain a solution near to optimal one.Further, we use the following statistical tools to measure performance of different initialization techniques: • ANOVA: A one-way analysis of variance (ANOVA) is used as one of statistical analysis techniques that test if one or more groups mean are significantly different from each other.Specifically, the ANOVA statistical analysis tests the null hypothesis: where, µ 1 = group mean and k = the number of groups.The one-way ANOVA test is performed with critical value α (the value that must be exceeded to reject the null hypothesis (H0)).H0 is accepted if the sig value is greater than the critical value (α) which equals (0.05) in this work.Otherwise, the H0 should be rejected and H1 should be accepted.That means there are two groups at least are different from each other.The one-way ANOVA cannot determine which specific groups were statistically different from each other.Therefore, to determine the different groups, a post hoc test such Duncan's multiple range test, and least significant difference (LSD) test are used.

•
Duncan's multiple range test (DMRT): This is considered as one of the most important statistical analysis tests that is used to find group differences after rejecting the null hypothesis.It is called post hoc or multiple comparison tests [72].The Duncan's multiple range tests compares all pairs of groups mean.It computes the numeric boundaries that allow classifying the difference between any two techniques range [73].If there is a significant difference between the population means, DMRT will have a high probability of declaring the difference.For this reasons, the Duncan's test has being the most popular test among researchers.The DMRT was implemented in this work for classifying the study groups (random, nearest neighbor, and regression) into homogenous group.The classified groups and sig value show if there is a significant difference between groups or not.Pairs of means resulting from a group comparison study with more than two groups are significantly different from each other with 5% level of significance (α).However, DMRT produces information about the significant difference between groups without differentiates their mean.
• Least significant difference (LSD): This is one of post-hoc test developed by Ronald Fisher in 1935.In general, the (LSD) is a method used to calculate and compare groups mean after rejecting the ANOVA null hypothesis (H0) of equal means using the ANOVA F-test [72].Rejecting H0 means there are at least two means different from each other, but if the ANOVA F-test fails to reject the H0, there will be no need to apply LSD as it will incorrectly propose a significant differences between groups mean.LSD computes the minimum significant variance between two means, and to declare any significant difference larger than the LSD.

Experimental Results and Discussion
In this section, the efficiency of GA is discussed in the case of using Random, NN, and our proposed regression-based technique for GAs population initialization in solving TSP instances under similar experimental environment.
The error rate results show that the regression-based technique for GAs maintains the minimum error rate than other seeding techniques Random and NN.Also, it is clear that NN error rate 9.2% is less than Random technique of 18.6%.This clarifies that the generated individuals by our proposed regression-based technique for GAs are better fit the quality measures than the individuals generated by NN and Random techniques.This difference referred to the mechanism of our new technique that divides the problem into small sub problems.Table 3 shows the experimental results of the initial population techniques with respect to error rate for the best individuals and the worst individuals in the initial population for each technique.Figure 12 illustrates the error rate that attained by different initial population techniques for several classes of problem example.Analysis of Variance test (ANOVA) test determines the significance of means difference between two or more independent groups.In this study, the one-way ANOVA analysis was used to determine if the different techniques have significant differences or not.ANOVA, Duncan, and LSD were applied to examine the existence of significant difference between the different techniques.Findings show different levels of error rate prevailed among Random, NN and our Regression techniques.Table 4 shows that regression technique has the minimum error rate compared to Random and NN techniques.Also, NN mean is much less than Random method.Table 5 shows ANOVA test result that proves a significant difference in mean values with respect to error rate of adopted techniques, Random, NN and our regression.The sig value 0.001, as observed, there is a significant difference between the groups.To find the group differences, a multiple comparisons tests have been carried out such as Duncan and LSD.Duncan results in Table 6 show that there are two homogeneous groups exist when applying the different population initialization techniques in respect to their error rate.Table 7 shows that there is a significant mean difference between Random and NN techniques (sig = 0.24) as well as a significant difference between Random and our Regression techniques (sig = 0.000).Also, a significant difference between NN and our Regression (sig = 0.024) as can observed from Table 7 our Regression technique has the minimum error rate with significant difference in pair wise comparisons with Random and NN due to the Individuals that generate using regression technique is better than Individuals that generate from other technique based on their Working mechanism.The average convergence results show that the regression-based technique for GA's population initialization achieved average convergence rate of 98.9% greater than other seeding techniques.The regression-based technique performs better, particularly with the large size problems.Also, results show that NN average convergence is greater than Random.These results mean that the individuals who are generated by regression-based technique is the nearest to the optimal value that has average convergence close to 98.9.Table 8 shows the experimental results of the initial population techniques with respect to average convergence (%).The average convergence (%) obtained for the GA using a various initial population techniques is shown in Figure 13.Results in Table 9 show that our Regression technique has the maximum average convergence (Mean = 98.5) compared to Random and NN techniques (Mean = 85.57).Also, NN mean is slightly greater than Random method (Mean = 78.95).ANOVA test results show whether one or more group means are significantly different according to average convergence.The One-way-ANOVA test, which was used, helps to determine if the different techniques under investigation have significant differences or not.ANOVA, Duncan and LSD were applied to explore the significant difference between the different techniques.Table 10 shows ANOVA test results that prove a significant difference in mean values with respect to average convergence of adopted techniques, Random, NN and our Regression.The sig value (0.004), as observed, means that there is a significant difference between groups.To find the group differences, a multiple comparisons tests have been applied, Duncan and LSD.Duncan results, Table 11 show that there are two homogeneous groups exist when applying the different population initialization techniques in respect to average convergence.Table 12 shows that there is insignificant mean difference between Random and NN techniques (sig = 0.23) as well as a significant difference between Random and Regression techniques (sig = 0.001).Also, a significant difference between NN and Regression (Sig = 0.024).As can be observed from Table 12, regression technique has the maximum average convergence rate with significant difference in pair wise comparisons with Random and NN.The final solutions error rate results show that the regression-based technique for GAs population initialization maintains the minimum error rate than other seeding techniques Random and NN.Also, results show that the NN technique error rate is lesser than Random technique.This demonstrates that the individuals generated by regression-based technique have better fit quality than those individuals who are generated by NN and Random techniques.Also, as we can see in Table 13, the problem size has no significant impact on regression-based technique performance comparing to other seeding techniques Random and NN.This referred to the mechanism of our new technique that divides the problem into small sub problems.Results in Table 14 shows that regression technique has the minimum difference average between optimal and final solution (Mean = 0.17) compared to Random and NN techniques (Mean = 7.33).Also, NN mean is greater than Random method (Mean = 0.98).ANOVA test results show whether one or more group means are significantly different according to final solution differences.One-way-ANOVA test was used in order to determine if the different techniques under investigation have significant differences or not.ANOVA, Duncan and LSD were applied to explore the significant difference between the different techniques.Table 15 shows ANOVA test results that prove a significant difference in mean values with respect to final solution of adopted techniques, Random, NN and Regression.The sig value (0.002), as observed, means that there is a significant difference between groups.To find the group differences, a multiple comparisons tests have been applied, Duncan and LSD.Duncan results indicate that two homogeneous groups can be formed among the different population initialization techniques in respect to their final solution differences see Table 16.Table 17 shows that there is significant mean difference between Random and NN techniques (sig = 0.003) as well as a significant difference between Random and Regression techniques (sig = 0.001).Also, insignificant difference between NN and Regression (Sig = 0.690).As can be observed from Table 17, Regression technique has the minimum difference between optimal and final solution with significant difference in pair wise comparisons with Random and NN. Figure 15 show the overview of performance for Random, NN and Regression techniques as can be observed from the four cities kroA100, KroA200, Att532, and D2103 respectively.The difference between the initial population that are generated by Regression and the final solution after (3000) generations is very small with significant difference in pair wise comparisons with Random and NN.The above results mean that the individuals who are generated by Regression technique do not need large number of generation to obtain the final solution.Further, we can see that the regression initialized GA works better on larger problems (d2103 compared to KroA100) implying that the Regression approach speeds up the evolving process without improvement to the quality of the solution.

Conclusions
In this paper, a new regression-based technique for GA Population seeding is proposed mainly to solve the TSP with GA.The proposed technique divides a given TSP problem into smaller sub-problems.This is done using the regression line and its perpendicular line, which allows for clustering the cities into four sub-problems repeatedly, the location of each city determines which category/cluster the city belongs to, the algorithm works repeatedly until the size of the subproblem becomes very small, four cities or less for instance, these cities are more likely neighboring each other, so connecting these together (more likely) creates a good solution to start with, this solution is mutated several times to form the initial population.
The proposed technique is implemented, analyzed and compared with two most well-known initial population techniques, namely: random, and nearest neighbor initial population techniques.The study considered a set performance criteria to measure the performance factors for the proposed technique and the other seeding techniques, including: convergence diversity, error rate, and average convergence.The experimental results on different sized TSP examples showed that the regression-based technique for GA's population initialization outperforms both of the random and the NN initialization approaches for GA.This demonstrates that the regression-based technique for GA's population initialization generates the fittest individuals with good quality that enables the GA to evolve the solutions using better fit individuals to start with.The role of an initialization technique is not to enhance the performance of a GA, it rather speeds the convergence of the GA to an optimal or near optimal solution by providing better solutions to start with.However, the experimental results on TSP show that the quality of the final solution using the proposed initialization technique was better than that of the other two approaches compared, giving the same number of iteration.Finally, using this initialization mechanism based on regression to generate pre-selected individuals in the first population may lead to premature and therefore a local optimal solution, and requires further deeper study.The future scope of this work on the regression-based techniques for GAs population initialization are as follows.

•
Performance analysis of the regression-based technique with different GA operators such as different population size, mutation rate, and number of generations that may lead to improve the GA performance by finding optimal parameters.

•
Analysis of new performance evaluation criteria including, computational time and distinct solutions need to be compared to old or new initial population techniques.

•
Applying the proposed technique on different NP problems (e.g., Knapsack and job scheduling problem), as this paper evaluated the proposed technique on TSP only.
The best solution (d) Randomly initial population

Figure 1 :
Figure1: The processes to initialize the population with -means clustering.Firstly, 14 nodes are clustered into 4 groups.Secondly, GA is used to obtain the local optimal path of each group and a global optimal path of 4 groups.Finally, disconnect and rewire to obtain the initial population.

Figure 2 :
Figure 2: Three methods to initialize the population of problem a280 [35].(a) is RIG, (b) is generated by a greedy method (GIP) [25], (c) is KIP, and (d) is the best solution of this problem.
The best solution (d) Randomly initial population

Figure 1 :
Figure1: The processes to initialize the population with -means clustering.Firstly, 14 nodes are clustered into 4 groups.Secondly, GA is used to obtain the local optimal path of each group and a global optimal path of 4 groups.Finally, disconnect and rewire to obtain the initial population.

Figure 2 :
Figure 2: Three methods to initialize the population of problem a280 [35].(a) is RIG, (b) is generated by a greedy method (GIP) [25], (c) is KIP, and (d) is the best solution of this problem.

Figure 1 :
Figure1: The processes to initialize the population with -means clustering.Firstly, 14 nodes are clustered into 4 groups.Secondly, GA is used to obtain the local optimal path of each group and a global optimal path of 4 groups.Finally, disconnect and rewire to obtain the initial population.

Figure 2 :
Figure 2: Three methods to initialize the population of problem a280 [35].(a) is RIG, (b) is generated by a greedy method (GIP) [25], (c) is KIP, and (d) is the best solution of this problem.

Figure 1 :
Figure1: The processes to initialize the population with -means clustering.Firstly, 14 nodes are clustered into 4 groups.Secondly, GA is used to obtain the local optimal path of each group and a global optimal path of 4 groups.Finally, disconnect and rewire to obtain the initial population.

Figure 2 :
Figure 2: Three methods to initialize the population of problem a280 [35].(a) is RIG, (b) is generated by a greedy method (GIP) [25], (c) is KIP, and (d) is the best solution of this problem.

Figure 1 .
Figure 1.Three methods for initializing the population of problem a280.(a) Random initial population, (b) Greedy initial population[32], (c) K-mean clustering initial population[33], and (d) the best solution of this problem.

Figure 5 .
Figure 5. Flow chart of a typical GA.

Figure 6 .
Figure 6.The processes to initialize the population with k-means clustering.

Figure 7 .
Figure 7.The strategy to delete crossovers in the knowledge based initialization technique [56].

Figure 8 .
Figure 8.We use examples of two of TSP cities berlin52 and att532 to illustrate the regression-based technique for population initialization in GAs.The x − y scatter plots of (a) berlin52, and (b) att532.

Figure 9 .• Step 4 : 5 :
Figure 9. Examples of finding regression lines and corresponding perpendicular lines in proposed our regression-based technique on (Left) berlin52, (Right) att532 TSP cities.(a) The regression lines, (b) center points (x,y), (c) perpendicular lines that intersect with the regression lines at the center points.

Figure 10 .
Figure 10.Dividing the domain into four sub-domains based on the computed regression and perpendicular lines for (Left) berlin52, and (Right) att532 and applying again.(a) The A, B, C, and D categories are generated from the intersection between regression lines and the perpendicular lines shown in Figure 9c.We then continue the regression and perpendicular lines on these sub-domains.(b) On B(−,+) for berlin52, and on A(+,+) for att532.
Our regression-based technique

Figure 15 .
Figure 15.Performance for three techniques Random, NN and our Regression with respect to the number of generations for (a) KroA100, (b) KroA200, (c) att532, and (d) d2103.
Selection techniques such as Rank-based selection, Tournament selection, and Roulette Wheel selection.• Crossover: handled single-point crossover, two point crossover, and uniform crossover.• Different Population Size.• Crossover Rate: The rate of crossover operator used in each experiment.• Mutation Rate: Aims to find the best operator to their technique to be used in the experiments.

Table 1 .
GA configuration parameters for experiments.

Table 2 .
Different sized TSP problems.

Table 3 .
Experimental results with respect to the Error Rate (%) for different population seeding techniques.

Table 4 .
Descriptive analysis of random, NN and our Regression technique with respect to error rate.

Table 5 .
ANOVA analysis of Random, NN and our Regression with respect to error rate.

Table 7 .
Multiple Comparisons (Random, NN and our Regression) significant with respect to the Error Rate.* The mean difference is significant at the 0.05 level.

Table 8 .
Average convergence (%) rate results for Random, NN and our regression-based techniques.

Table 9 .
Descriptive analysis of Random, NN and our Regression technique with respect to average convergence.

Table 10 .
ANOVA analysis of Random, NN and our Regression with respect to average convergence.

Table 12 .
LSD: Multiple Comparisons (Random, NN and our Regression) significant w.r.t average convergence.* The mean difference is significant at the 0.05 level.

Table 13 .
Final solution error rate (%) results for Random, NN and our Regression techniques.

Table 14 .
Descriptive analysis of Random, NN and Regression technique with respect to final solution.

Table 15 .
ANOVA analysis of Random, NN and Regression with respect to final solution.

Table 17 .
LSD: Multiple Comparisons (Random, NN and Regression) significant with respect to final solution.Here * denotes that the mean difference is significant at the 0.05 level.