A Grid-Based Genetic Approach to Solving the Vehicle Routing Problem with Time Windows

This research allows working collaboratively between di ﬀ erent institutions through a grid computing infrastructure, sharing supercomputing equipment to be able to ﬁnd good solutions in a more e ﬃ cient way to problems intractable that are of real application as the Vehicle Routing Problem


Introduction
The vehicle routing problem (VRP) involves the definition of an optimal route set for several delivery vehicles satisfying the requirements of the customers. In particular, the VRP with time windows (VRPTW) is a VRP variant in which a slot of time is assigned for the delivery to the customer. This problem is known as one NP-hard combinatorial optimization problem, and many algorithms have been proposed in the existing literature to try to solve it, highlighting the use of swarm and evolutionary algorithms such as genetic algorithms and particle swarm optimization methods. In a genetic algorithm (GA), each candidate solution to VRPTW is encoded in one chromosome, and a population of them evolves by applying the selection, crossover, and mutation operators. This evolutionary process is guided by a fitness function, helping to reach a near-optimal solution. Since the processes can lead to improved algorithmic designs for the treatment of these problems. This impact is stronger where computing infrastructure is a constraint in the study of these problems, and the use of several computational clusters to build a grid environment is a viable option in the study of these complex problems.
The rest of this document is structured as follows: Section 2 describes the VRPTW and the application of metaheuristics to solve this problem. A description of the MiniGrid components is included in Section 3, and Section 4 describes the proposed approach for the GA parallelization and its implementation in a grid environment. Section 5 presents the experimental study. Finally, the results and discussion of this approach are summarized in Section 6.

Vehicle Routing Problem with Time Windows
VRP involves the definition of an optimal route set for several delivery vehicles sharing a central depot and satisfying the requirements of several geographically dispersed customers while minimizing the total routes distance or some other criterion [38]. This problem was first studied by Dantzig and Ramser [39] and is considered one NP-hard problem [37].
Several VRP variants have been proposed in the existing literature such as the VRPTW, the capacitated VRP (CVRP), and the multiple depots VRP (MDVRP), among others. In the VRPTW, every customer has to be supplied within a specified time window. In the CVRP, every vehicle has a limited capacity, and in the MDVRP, the vendor uses many depots to supply the customers.

Mathematical Formulation of VRPTW
Based on the formal definition introduced by Toth and Vigo [30], the VRPTW is presented in Equations (1)- (11). In this formulation, the problem is represented through a directed graph composed of a set of nodes containing the customers and the central depot. These nodes are joined by arcs symbolizing the possible route of a vehicle between two nodes. The graph is formally defined as G(V, E), where V = {0, . . . , n} is the set of n nodes, and E is the set of edges joining pairs of nodes in V. Furthermore, K is the set of delivery vehicles, and N = V\{0} is the set of customers, where the node 0 in V is the vehicle depot.
On the other hand, c ij is the travel cost from the i-th node to the j-th node, and x ijk is equal to 1 if the arc joining the i-th node and the j-th is used by k-th vehicle, otherwise it is 0. d i is the demand of i-th customer, and C is the vehicle capacity. Furthermore, w ik specifies the start service time of i-th customer by the k-th vehicle, s i is the service time of i-th customer, t ij is the time travel from the i-th node to the j-th node, and a i and b i are the time window limits of i-th node. E and L represent the earliest possible departure time and the latest possible arrival time at the depot, respectively. ∆ + (i) denotes the set of nodes directly reachable from the i-th node, and ∆(i) denotes the set of nodes from which the i-th node is reachable. min k∈K (i,j)∈A c ij x ijk (1) subject to: k∈K j∈∆ + (i) j∈∆ + (0) x 0 jk = 1 ∀k ∈ K i∈∆ − ( j) x ijk − i∈∆ + ( j) x jik = 0 ∀k ∈ K, j ∈ N (4) i∈∆ − (n+1) x i,n+1,k = 1 ∀k ∈ K Appl. Sci. 2019, 9,3656 5 of 23 x ijk w ik + s i + t ij − w jk ≤ 0 ∀k ∈ K, (i, j) ∈ A (6) a i j∈∆ + (i) x ijk ≤ w ik ≤ b i j∈∆ + (i) x ijk ∀k ∈ K, i ∈ N (7) x ijk ≤ C ∀k ∈ K (9) x ijk ≥ 0 ∀k ∈ K, (i, j) ∈ A (10) x ijk ∈ {0, 1} ∀k ∈ K, (i, j) ∈ A.
The objective function in Equation (1) is the minimization of the total travel cost. Furthermore, Equation (2) indicates that a single vehicle must serve only one customer. The path of each vehicle is represented by Equations (3)-(5) while Equations (6), (8), and (9) guarantee both the time feasibility and the capacity of the vehicles. Equation (7) indicates that one service should start in the time window [a, b]. Finally, Equation (10) are non-negativity constraints and Equation (11) indicate that variables must have binary values.

A Sequential Genetic Algorithm for Solving the VRPTW
GAs are search procedures inspired by evolutionary theories synthesizing the Darwinian evolution through natural selection with the Mendelian genetic inheritance [40]. This type of evolutionary algorithm stands out for its robustness and its ability to carry out one intelligent search in large, complicated, and unpredictable search spaces. These algorithms can be applied to solve diverse problems by encoding the candidate solutions as sequences of genes, as well as by defining an appropriate fitness function. In particular, GA can be applied to solve the VRP by encoding the candidate solutions using a sequence of integer-based values, where each value represents one customer or one delivery vehicle [41].
Several studies have been carried out implementing one GA to solve VRP instances, where various encoding schemes and several genetic operators have been proposed. In particular, the GA-VRPTW method [1] is a sequential algorithm using the best-selection operator and two new recombination operators: crossover-k and mutation-s. These genetic operators are described in Table 1. Each gene in a chromosome represents one customer that must be visited by one vehicle. This gene has several associated values: The Euclidean distance between two customers, as well as the demand and the time window associated with one customer (c ij , d i , and [a i , b i ] in the previously mathematical formulation, respectively). Table 1. Genetic operators in GA-VRPTW [1].

Operator Description
best-selection This is a commonly used operator in literature. It selects the best chromosome from a population.

crossover-k
This operator generates two new offspring by making a random crossover of two chromosomes. Crossover-k operator randomly takes two numbers and carries out the crossover exclusively in the customer that corresponds to both chromosomes. This is done in order to avoid repetition and violation of time and capacity restrictions.

mutation-s
This operator is termed intelligent because it does not randomly make changes, rather it attempts to reduce the total travelled distance by making only changes that satisfy time and capacity constraints. This operator searches for a gene with a greater distance with respect to the previous customer, this is the gene-candidate. It searches for another gene with a shorter distance, this is the gene-mutation. Once the mutation-s operator has identified these genes, the gene-mutation makes the change if time and capacity constraints are not violated. The fitness is reduced due to the movements of genes in an offspring.
The general scheme of the GA-VRPTW implementation is depicted in Figure 1. First, the initial population with only feasible solutions is created using the k-means clustering algorithm [42]. Next, the evolutionary process applies genetic operators to generate only feasible new solutions. Finally, the best candidate solution in the final population is selected as the result of the GA-VRPTW algorithm.

Operator Description bestselection
This is a commonly used operator in literature. It selects the best chromosome from a population.

crossoverk
This operator generates two new offspring by making a random crossover of two chromosomes. Crossover-k operator randomly takes two numbers and carries out the crossover exclusively in the customer that corresponds to both chromosomes. This is done in order to avoid repetition and violation of time and capacity restrictions.

mutation-s
This operator is termed intelligent because it does not randomly make changes, rather it attempts to reduce the total travelled distance by making only changes that satisfy time and capacity constraints. This operator searches for a gene with a greater distance with respect to the previous customer, this is the gene-candidate. It searches for another gene with a shorter distance, this is the gene-mutation. Once the mutation-s operator has identified these genes, the gene-mutation makes the change if time and capacity constraints are not violated. The fitness is reduced due to the movements of genes in an offspring.
The general scheme of the GA-VRPTW implementation is depicted in Figure 1. First, the initial population with only feasible solutions is created using the k-means clustering algorithm [42]. Next, the evolutionary process applies genetic operators to generate only feasible new solutions. Finally, the best candidate solution in the final population is selected as the result of the GA-VRPTW algorithm. In this approach, the mutation-s operator implements an iterated local search to reduce the total travel distance. This operator alters the order of the chromosome genes and ensures that this perturbation improves the fitness value of the mutated chromosome, and accomplishes this with the defined constraints (demand, time window, and distance). The use of the mutation-s operator in one chromosome with 10 genes is shown in Figure 2. In this example, the candidate genes to be mutated have a distance of 11.1 and 5.6, respectively. These genes are interchanged, the associated values are updated, and the new total travel distance is computed. If this is not a valid perturbation, other candidate genes must be selected. This local search procedure causes the computation time of the mutation operator to increase considerably as compared to the time used by the other genetic operators. In this approach, the mutation-s operator implements an iterated local search to reduce the total travel distance. This operator alters the order of the chromosome genes and ensures that this perturbation improves the fitness value of the mutated chromosome, and accomplishes this with the defined constraints (demand, time window, and distance). The use of the mutation-s operator in one chromosome with 10 genes is shown in Figure 2. In this example, the candidate genes to be mutated have a distance of 11.1 and 5.6, respectively. These genes are interchanged, the associated values are updated, and the new total travel distance is computed. If this is not a valid perturbation, other candidate genes must be selected. This local search procedure causes the computation time of the mutation operator to increase considerably as compared to the time used by the other genetic operators.

The Parallelization Approach as a Solution
The search to one optimum solution for the VRPTW is unattainable when the problem size grows since the execution time increases exponentially, given the NP-completeness characteristics of the problem [43]. Although metaheuristics are efficient to find reasonable solutions to complex problems, the single computer resources are not sufficient to reach results for these type of problems in a reasonable time. Parallelization is one of the most effective approaches to improve the performance of one algorithm and to overcome the single machine constraints.
Several parallel-based approaches to solve the VRPTW have been proposed, such as one parallel version of the Branch and Bound algorithm [44], and several metaheuristic-based approaches such as SA [14], TS [45], and PSO [46]. Furthermore, various parallel GAs have been described in the existing literature. Cantú-Paz [47] groups these approaches in four categories: Global master-slave, island, cellular, and hierarchical parallel GA (Table 2). Table 2. Categories of parallel genetic algorithms [47].

Category
Description Global master-slave In this approach, a single population is distributed among several nodes. Genetic operations are applied to the whole population.

Island GA
Several subpopulations evolve separately with the occasional migration of individuals between subpopulations.

Cellular GA
This approach consists of one spatially structured population. Selection and crossover are restricted to a small neighborhood. The neighborhoods are allowed to overlap, permitting some interaction among individuals. Hierarchical parallel GA This category combines an island model with either a master-slave or cellular GA.

MiniGrid Infrastructures
Grid computing is an emerging technology where a set of heterogeneously networked resources distributed in geographically dispersed locations are coordinated to provide transparent,

The Parallelization Approach as a Solution
The search to one optimum solution for the VRPTW is unattainable when the problem size grows since the execution time increases exponentially, given the NP-completeness characteristics of the problem [43]. Although metaheuristics are efficient to find reasonable solutions to complex problems, the single computer resources are not sufficient to reach results for these type of problems in a reasonable time. Parallelization is one of the most effective approaches to improve the performance of one algorithm and to overcome the single machine constraints.
Several parallel-based approaches to solve the VRPTW have been proposed, such as one parallel version of the Branch and Bound algorithm [44], and several metaheuristic-based approaches such as SA [14], TS [45], and PSO [46]. Furthermore, various parallel GAs have been described in the existing literature. Cantú-Paz [47] groups these approaches in four categories: Global master-slave, island, cellular, and hierarchical parallel GA (Table 2). Table 2. Categories of parallel genetic algorithms [47].

Category Description
Global master-slave In this approach, a single population is distributed among several nodes. Genetic operations are applied to the whole population.
Island GA Several subpopulations evolve separately with the occasional migration of individuals between subpopulations.

Cellular GA
This approach consists of one spatially structured population. Selection and crossover are restricted to a small neighborhood. The neighborhoods are allowed to overlap, permitting some interaction among individuals.
Hierarchical parallel GA This category combines an island model with either a master-slave or cellular GA.

MiniGrid Infrastructures
Grid computing is an emerging technology where a set of heterogeneously networked resources distributed in geographically dispersed locations are coordinated to provide transparent, dependable, pervasive, and consistent computing support to diverse types of applications [48]. There are different levels of integration that range from the use of computing resources of two or more organizations to the extensive use of all grid resources [49].
There are several approaches to build and configure grid environments, and there are many efforts to define general schemes for grid resources exploitation. First, grid environments using Globus, Condor, or gLite are considered high-throughput grids or service grids. These grids manage the resource set (machines or clusters) and distribute the processes among these resources. The service grid's goal is to provide an abstraction level for the users so that they can use all the grid's resources while maintaining an adequate security level. Service grids can implement either sequential or parallel applications. On the other hand, the use of a VPN to cluster integration in grid environments avoids the need for specialized packages such as Globus or gLite [50]. A VPN-based grid can be considered a high-performance grid since it can run parallel applications using the resources of several clusters connected in the grid. A VPN creates an encrypted communication channel between groups to ensure secure data exchange.
In this work, the MiniGrid is composed of two clusters with homogeneous machines (Table 3), each belonging to a different organization, geographically distant (14.37 km), as is shown in Figure 3. One cluster is in the Autonomous University of Morelos State (UAEM), in Cuernavaca city, and the other in the Polytechnic University of Morelos State (UPEMOR), in Jiutepec city, both in the Mexican state of Morelos. In this case, OpenVPN is used to integrate the clusters in a grid environment.

Communication between clusters is via a wireless WAN link between institutions (UAEM-UPEMOR) implemented via a point-to-point microwave link with ISM frequency bands with a bandwidth of 30
Mbs [51]. Message passing interface (MPI) is used to exchange data, and to reduce the communication rate between MiniGrid clusters. MPI has been used to develop applications with both Globus-based grids [52] and VPN-based environments [29,53,54].

Grid-Based Genetic Algorithm to Solve the VRPTW
This paper proposes a grid version of the GA-VRPTW algorithm, named GGA, where the mutation-s operator is applied only to a segment of each population generated in the processes created by the MiniGrid, and a two-stage communication scheme is defined to the result exchanging into the MiniGrid clusters. First, a procedure to combine the mutated segments between cluster nodes is applied, and then the mutated segments are exchanged between the MiniGrid clusters.

Performance Analysis of the GA-VRPTW Algorithm
Experimental results of the sequential GA-VRPTW algorithm show that it is a very competitive evolutionary algorithm to solve VRPTW instances, but its mutation operator consumes a lot of computing time, as evidenced by his performance analysis ( Figure 4). This performance analysis is based on the running time consumed by the algorithm to solve two Solomon benchmark problems (C101 and C106), using five generations and 100 individuals. Figure 4 shows that 99.96% of the computation time in the sequential algorithm is used for the mutation-s operator. In contrast, the bestselection operator uses only 0.04% of the computation time.

Grid-Based Genetic Algorithm to Solve the VRPTW
This paper proposes a grid version of the GA-VRPTW algorithm, named GGA, where the mutation-s operator is applied only to a segment of each population generated in the processes created by the MiniGrid, and a two-stage communication scheme is defined to the result exchanging into the MiniGrid clusters. First, a procedure to combine the mutated segments between cluster nodes is applied, and then the mutated segments are exchanged between the MiniGrid clusters.

Performance Analysis of the GA-VRPTW Algorithm
Experimental results of the sequential GA-VRPTW algorithm show that it is a very competitive evolutionary algorithm to solve VRPTW instances, but its mutation operator consumes a lot of computing time, as evidenced by his performance analysis ( Figure 4). This performance analysis is based on the running time consumed by the algorithm to solve two Solomon benchmark problems (C101 and C106), using five generations and 100 individuals. Figure 4 shows that 99.96% of the computation time in the sequential algorithm is used for the mutation-s operator. In contrast, the best-selection operator uses only 0.04% of the computation time.
Experimental results of the sequential GA-VRPTW algorithm show that it is a very competitive evolutionary algorithm to solve VRPTW instances, but its mutation operator consumes a lot of computing time, as evidenced by his performance analysis (Figure 4). This performance analysis is based on the running time consumed by the algorithm to solve two Solomon benchmark problems (C101 and C106), using five generations and 100 individuals. Figure 4 shows that 99.96% of the computation time in the sequential algorithm is used for the mutation-s operator. In contrast, the bestselection operator uses only 0.04% of the computation time.  Based on these results, the selection and crossover operators are applied to the entire population in each process, but the mutation-s operator modifies only one segment of the population in each MiniGrid process.

Grid-based Genetic Algorithm
GGA splits the population of candidate solutions into several segments, which are mutated in each process of the MiniGrid nodes. Figure 5 shows the general scheme of the algorithm. The population division bounds are defined in Equations (12) and (13).
Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 23 Based on these results, the selection and crossover operators are applied to the entire population in each process, but the mutation-s operator modifies only one segment of the population in each MiniGrid process.

Grid-based Genetic Algorithm
GGA splits the population of candidate solutions into several segments, which are mutated in each process of the MiniGrid nodes. Figure 5 shows the general scheme of the algorithm. The population division bounds are defined in Equations (12) and (13).
In Equations (12) and (13), L is the population size, Ni is the i-th node, NN is the number of nodes in the MiniGrid, and res represents the remainder of the division between L and NN.
First, one initial population with only feasible candidate solutions is created in each process of the MiniGrid. Next, these populations evolve until an optimal solution is found, or until a specified number of generations is reached. The genetics operators are applied as follows: 1. Each process applies both the best-selection operator and the crossover-k operator in the entire population. 2. The resulting population is split into NN segments, using the Loweri and Upperi values, and the i-th node Ni applies the mutation-s operator in the assigned segment. 3. Each process sends the mutated segment to the other processes of the MiniGrid. All collected chromosomes are used to build a new population.
The distributed mutation step and the segments migration are conducted in each process of the MiniGrid, allowing the increase in the population diversity, as compared to that of the original GA-VRPTW. For example, if the GGA is running in a MiniGrid with 20 process, using a mutation rate of 80%, and each population has 100 candidate solutions, each process applies the selection and crossover operators in the entire population and the mutation operator js applied with 80 individuals. These mutated individuals migrate to the other 19 processes in a balanced way, i.e., four mutated individuals per population.  In Equations (12) and (13), L is the population size, N i is the i-th node, NN is the number of nodes in the MiniGrid, and res represents the remainder of the division between L and NN.
First, one initial population with only feasible candidate solutions is created in each process of the MiniGrid. Next, these populations evolve until an optimal solution is found, or until a specified number of generations is reached. The genetics operators are applied as follows: 1.
Each process applies both the best-selection operator and the crossover-k operator in the entire population.

2.
The resulting population is split into NN segments, using the Lower i and Upper i values, and the i-th node N i applies the mutation-s operator in the assigned segment.

3.
Each process sends the mutated segment to the other processes of the MiniGrid. All collected chromosomes are used to build a new population.
The distributed mutation step and the segments migration are conducted in each process of the MiniGrid, allowing the increase in the population diversity, as compared to that of the original GA-VRPTW. For example, if the GGA is running in a MiniGrid with 20 process, using a mutation rate of 80%, and each population has 100 candidate solutions, each process applies the selection and crossover operators in the entire population and the mutation operator js applied with 80 individuals. These mutated individuals migrate to the other 19 processes in a balanced way, i.e., four mutated individuals per population.

Distribution and Migration of Mutated Segments
To ensure that the mutated segments are shared between the processes created by the MiniGrid, broadcast with MPI_Scatter function is used to distribute the mutated individuals from each process to the rest of the processes. Each process receives the same amount of sent information so the size of the population will be the same in each process of the MiniGrid. For example, in Figure 6, the MPI_Scatter function distributed the individuals of the population in an equal and parallel manner. The individuals that were stored in an array of structures were distributed to created processes. Each row in the array represents an individual, which is a solution, and a sequence of six customers of the VRPTW. In this example, MPI_Scatter distributed two processes, P1 and P2, in a parallel manner. Individual one was distributed to process P1, and individual two to process P2, and so on. This is done in each process that contains a population of individuals. One process is generally assigned to each core of the MiniGrid, but the MiniGrid can be overloaded so that there is more than one process per core. Each process can only read the population that it owns and can only be responsible for distributing information to each of the remaining processes.

Distribution and Migration of Mutated Segments
To ensure that the mutated segments are shared between the processes created by the MiniGrid, broadcast with MPI_Scatter function is used to distribute the mutated individuals from each process to the rest of the processes. Each process receives the same amount of sent information so the size of the population will be the same in each process of the MiniGrid. For example, in Figure 6, the MPI_Scatter function distributed the individuals of the population in an equal and parallel manner. The individuals that were stored in an array of structures were distributed to created processes. Each row in the array represents an individual, which is a solution, and a sequence of six customers of the VRPTW. In this example, MPI_Scatter distributed two processes, P1 and P2, in a parallel manner. Individual one was distributed to process P1, and individual two to process P2, and so on. This is done in each process that contains a population of individuals. One process is generally assigned to each core of the MiniGrid, but the MiniGrid can be overloaded so that there is more than one process per core. Each process can only read the population that it owns and can only be responsible for distributing information to each of the remaining processes.  Figure 7 shows the general scheme of this distribution stage, where S1 is the mutated segment of process 1, s2 is the mutated segment of process 2, and so on. Each process only alters the population assigned to it and, at the end of its mutation stage, must send the mutated segment to the other processes. One process is generally associated with each MiniGrid core, but the environment can be overloaded so that one core can run more than one process.  Figure 7 shows the general scheme of this distribution stage, where S1 is the mutated segment of process 1, s2 is the mutated segment of process 2, and so on. Each process only alters the population assigned to it and, at the end of its mutation stage, must send the mutated segment to the other processes. One process is generally associated with each MiniGrid core, but the environment can be overloaded so that one core can run more than one process. Figure 7 shows the general scheme of this distribution stage, where S1 is the mutated segment of process 1, s2 is the mutated segment of process 2, and so on. Each process only alters the population assigned to it and, at the end of its mutation stage, must send the mutated segment to the other processes. One process is generally associated with each MiniGrid core, but the environment can be overloaded so that one core can run more than one process.    In GGA, the exploitation rate is high since each MiniGrid process carries out a specialized mutation in conjunction with one iterative local search to improve the quality of the candidate solutions. Furthermore, since a different population evolves in each process by means of one independent recombination, the exploration of new candidate solutions is conducted in different areas of the solution space.

Experimental Study
In this section, the experimental study carried out to analyze the GGA performance is detailed. A description of the problems used in this study, as well as the definition of the GGA parameters, is given.

Experimental Setup
The data used to measure the efficiency of GGA in MiniGrid are nine C-type Solomon benchmark problems (C101 to C109) [55], and 10 Gehring and Homberger benchmark problems (C1_10_1 to C1_10_10) [15]. The Solomon benchmark problems have 100 customers, each of whom always places an order and has a maximum of 25 vehicles. Gehring and Homberger benchmark problems use 1000 customers. GGA tests use a population of 100 individuals, the evolutionary process is conducted through 120 generations, the crossover rate is 100%, and the mutation factor is defined using Equations (12) and (13). In this experimental study, GGA is run 30 times, and the average fitness value of these runs is used as the performance value. This value is compared with those obtained to other approaches described in the existing literature. In GGA, the exploitation rate is high since each MiniGrid process carries out a specialized mutation in conjunction with one iterative local search to improve the quality of the candidate solutions. Furthermore, since a different population evolves in each process by means of one independent recombination, the exploration of new candidate solutions is conducted in different areas of the solution space.

Experimental Study
In this section, the experimental study carried out to analyze the GGA performance is detailed. A description of the problems used in this study, as well as the definition of the GGA parameters, is given.

Experimental Setup
The data used to measure the efficiency of GGA in MiniGrid are nine C-type Solomon benchmark problems (C101 to C109) [55], and 10 Gehring and Homberger benchmark problems (C1_10_1 to C1_10_10) [15]. The Solomon benchmark problems have 100 customers, each of whom always places an order and has a maximum of 25 vehicles. Gehring and Homberger benchmark problems use 1000 customers. GGA tests use a population of 100 individuals, the evolutionary process is conducted through 120 generations, the crossover rate is 100%, and the mutation factor is defined using Equations (12) and (13). In this experimental study, GGA is run 30 times, and the average fitness value of these runs is used as the performance value. This value is compared with those obtained to other approaches described in the existing literature.
Furthermore, two non-parametric statistical tests are applied to carried out a statistical analysis of the results produced by the GGA method when comparing them with those obtained by other algorithms. Non-parametric statistical tests are used in this work since it is known that the experimental studies involving evolutionary algorithms do not fulfill the necessary conditions (independency, normality, and homoscedasticity) to apply a parametric test such as ANOVA [56]. In accordance with Derrac et al. [57], two post-hoc tests are carried out: The Wilcoxon test [58] is conducted to compare two algorithms, and the Friedman test [59] and the Bergmann-Hommel post-hoc procedure [60] are used to compare several algorithms.
One non-parametric statistical test evaluates the statistical significance of the average rank of the experimental results through computing the p-value without making any assumptions about the distribution of the analyzed data. This p-value is used to accept or to reject the null hypothesis of the experiment, which holds that the performance of the compared algorithms does not present significant differences. If the p-value does not exceed a predefined significance level (0.05 in this work), the null hypothesis is rejected, indicating that the algorithms are statistically different. Wilcoxon test is designed to compare the two experiments, and two tests must be conducted to compare three or more algorithms: First, the Friedman test is used to detect differences between multiple experiments. Next, if statistical differences exist, a post-hoc test must be conducted to detect the differences between all existing pairs of algorithms. In this work, the Bergmann-Hommel test is used as a post-hoc test since it analyzes the differences of each possible pair of algorithms and whether to reject each of them or not. These statistical tests are applied using the scmamp R library [61].

Methology Applied to Analyze the GGA Performance
Three indicators are evaluated in the experimental study: Latency, speedup, and solution quality. Latency measures the time delay experienced in one distributed environment, and it is analyzed to determine the effect of using the two-stage scheme to migrate information between the nodes of the MiniGrid, and its value is based on the transfer rates between the two clusters in the MiniGrid.
Two processes are used in these experiments: First, 120 processes were used in the MiniGrid (one process for each core), and then an overload of up to 500 uniformly distributed processes was used to evaluate the latency generated by GGA in the MiniGrid. On the other hand, speedup measures the gain achieved by using the parallel program over the sequential version, and its value is computed in Equation (14).
where T 1 is the total running time of the sequential version of the algorithm and T n is the running time of the parallel version using n processes. The speedup was analyzed to determine the effect of increasing the number of nodes for the distributed application of mutation operator. Furthermore, to obtain reliable estimates of the results quality, the relative error (RE) is used to compare the results of this experimental study with those produced by other approaches to solve the VRPTW. RE is a factor measuring the difference between the result generated by one experiment and the best-known result value in the existing literature. RE is defined in Equation (15).
where f (x) is the fitness value result of the current experiment and f (x opt ) is the best-known fitness value reported in the existent literature. GGA results are compared with those achieved by the following approaches described in the existing literature: • GA-VRPTW [1]: Sequential GA-based algorithm with specialized operators to solve VRPTW instances. These operators are used in the GGA. Its experimental study is conducted using a computer with one 1.6 GHz Pentium processor, and the algorithm is implemented in Microsoft Visual C++ 6.0. • KDMSS [11]: This method is identified in the existing literature using the initials of its authors (Kohl, Desrosiers, Madsen, Solomon, and Soumis). Its experimental study is carried out in one HP9000/829 computer with a 100 MHz PA7200 processor. The KDMSS algorithm is implemented in ANSI C. • CPLA (cooperative population learning algorithm) [22]: Author uses a cluster HOLK of 256 computers with one Intel Itanium 2 Dual Core. The proposed algorithm has been implemented in Java and uses a software framework to peer-to-peer applications. • BCO-SIH (a bee colony optimization algorithm with a sequential insertion heuristic) [23]: This algorithm is implemented using Java and performed on one computer with an Intel Core I3 processor. • MOGA (multi-objective genetic algorithm) [24]: The algorithm is implemented in C/C++ and run on a Silicon Graphics machine with 128 CPUs. • ACO-TS (ant colony optimization and Tabu search) [25]: The algorithm has been programmed in C++, and the experimental study is carried out in an Intel Core I3 processor machine. Furthermore, the GGA results with the Gehring and Homberger benchmark problems are compared with the best-known values obtained by other implementations such as Q (Quintiq's optimization technology) [62], CAINIAO (Cainiao network technology) [63], and SCR (Emapa Inc.) [64].

Results and Discussion
In this section, the experimental results and the statistical tests applied to evaluate these results are outlined. Finally, a discussion about the performance of the GGA method is provided.

Latency Results
Two tests have been performed to measure the latency effects in the GGA performance: The latency rate in a time interval is observed in the first experiment, and the relation between the latency and the number of processes used to run the algorithm is analyzed in the second test. A balanced processes assignation is applied in these tests: 50% of the processes is assigned in the UAEM cluster, and the remaining are assigned to the UPEMOR cluster.
The test to analyze the latency rate in a time interval has been performed with several pairs of nodes selected in the two clusters sending packages of 64 bits. Figure 9 shows the average latency obtained throughout five days, from 9:00 to 16:00. A latency reduction is recorded as the day progresses: Higher latencies are reported at 10:00 (between 57 and 58 msec), and a latency value of less than 52 msec is observed at 15:00. This indicates that in the afternoon, the data transfer is more efficient. Since latency affects the GGA efficiency, it is recommended running the algorithm with it in low (in the afternoon).
nodes selected in the two clusters sending packages of 64 bits. Figure 9 shows the average latency obtained throughout five days, from 9:00 to 16:00. A latency reduction is recorded as the day progresses: Higher latencies are reported at 10:00 (between 57 and 58 msec), and a latency value of less than 52 msec is observed at 15:00. This indicates that in the afternoon, the data transfer is more efficient. Since latency affects the GGA efficiency, it is recommended running the algorithm with it in low (in the afternoon).  Figure 10 shows the latency behavior using a different number of processors assigned in the MiniGrid. It is observed that the latency decreases with increasing MiniGrid processes overload. This behavior is due to the fact that the bandwidth existing in the cluster communication is near to 30 Mbs, meaning that when there are a higher number of processes executed by GGA, more data are sent among clusters. A maximum of 15 Mb of data per second can be sent from UAEM cluster to UPEMOR  Figure 10 shows the latency behavior using a different number of processors assigned in the MiniGrid. It is observed that the latency decreases with increasing MiniGrid processes overload. This behavior is due to the fact that the bandwidth existing in the cluster communication is near to 30 Mbs, meaning that when there are a higher number of processes executed by GGA, more data are sent among clusters. A maximum of 15 Mb of data per second can be sent from UAEM cluster to UPEMOR cluster, and vice-versa, without saturating the microwave communication channel. It is crucial to conduct one proper process balancing to ensure the maximum efficiency of GGA. An overload tends to produce one inefficient running behavior on each cluster.

Speedup Results
To analyze the speedup behavior in the MiniGrid, the GGA results using one cluster are analyzed, since the two clusters in the MiniGrid are homogeneous. The Intel compiler is used with the minimum and maximum number of processes, so that there is always a maximum of one process (population) per core. Figure 11 shows the speedup observed in the UAEM cluster. The real speedup is very close to ideal and at some points is better than it, for example with 5 and 20 processes. By using the -O3 option for the Intel compiler, the speedup significantly improves. Starting at 20 processes, there is a super-linear speedup in homogeneous parallel machines. It decreases starting at

Speedup Results
To analyze the speedup behavior in the MiniGrid, the GGA results using one cluster are analyzed, since the two clusters in the MiniGrid are homogeneous. The Intel compiler is used with the minimum and maximum number of processes, so that there is always a maximum of one process (population) per core. Figure 11 shows the speedup observed in the UAEM cluster. The real speedup is very close to ideal and at some points is better than it, for example with 5 and 20 processes. By using the -O3 option for the Intel compiler, the speedup significantly improves. Starting at 20 processes, there is a super-linear speedup in homogeneous parallel machines. It decreases starting at 30 processes but remains a super-linear speedup. Obtaining super-linear speedup for parallel evolutionary algorithms has already been studied in [65] when the algorithm needs a lower number of evaluations in the process communication, which is the case in the proposed GGA.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 16 of 23 Figure 11. Speedup of the GGA in the Autonomous University of Morelos State (UAEM) cluster. Table 4 presents the experimental results obtained by both GA-VRPTW and GGA with nine Solomon benchmark problems. In this table, NV indicates the number of delivery vehicles used in the optimum solution, Optimum is the optimal solution of the problem, Time is the time in seconds used to reach the reported result, and RE is the relative error previously described. The numbers in parentheses refer to the ranking reached by each method for each problem. The best results were obtained using 60 processes and a single cluster of the MiniGrid. It can be observed that for all instances, GGA is closer to optimal than GA-VRPTW, and the results are reached with relatively short times.   Table 4 presents the experimental results obtained by both GA-VRPTW and GGA with nine Solomon benchmark problems. In this table, NV indicates the number of delivery vehicles used in the optimum solution, Optimum is the optimal solution of the problem, Time is the time in seconds used to reach the reported result, and RE is the relative error previously described. The numbers in parentheses refer to the ranking reached by each method for each problem. The best results were obtained using 60 processes and a single cluster of the MiniGrid. It can be observed that for all instances, GGA is closer to optimal than GA-VRPTW, and the results are reached with relatively short times. The Wilcoxon signed ranks test is used in this experiment, and the p-value computed is 0.003843, indicating that GGA has one significant improvement over GA-VRPTW. Table 5 shows the results obtained with the GGA using 10 Gehring and Homberger benchmark problems. In this table, UB is the upper bound best known to the problem (both for the number of vehicles and for the distance traveled), Best and Worse are the best and the worst values found by GGA for each problem, RE is the relative error, and Mean, Median, and Mode are the statistical measures of the results obtained in 30 runs. GGA best results were obtained using 120 processes. The two clusters of the MiniGrid were used, and the MiniGrid had a balanced process distribution. It can be observed that for half the instances, GGA obtained the UB, and in one case (the C1_10_8 problem), the UB was improved (41,954.51 versus 42,660.70), because a better solution is found, the RE is represented in Table 5 with *. The execution times for GGA were an average of 15,000 to 16,000 s. Table 5 shows that the failure rate to achieve one known result was no higher than 0.501. This value indicates that GGA is very competitive. On the other hand, none of the significant size problems had a mode value equal to the best UB. The most significant difference between the mode value and the UB is 51, and the smallest difference is 0.4. If the median value is equal to the optimum, this indicates that at least half of the 30 tests reached the optimal solution. It is observed that none of the median values of the problems is equal to the UB. The problems that reach results close to optimal, in their mode and median values, were the C1_10_2 problem and the C1_10_7 problem. C1_10_1, C1_10_3, and C1_10_6 problems are the most challenging regarding reaching good results concerning these statistical values. Table 6 show a comparison of both NV values and the travel costs obtained by the GGA with those of other similar approaches. Table 6 depicted a comparison of the experimental results with the Solomon benchmark problems. The optimum values of the number of vehicles and travel costs listed in this table are those reached by the KDMSS algorithm. The GGA algorithm is better than PHGA and LC03 algorithms since it has a lower relative error. Concerning the KDMSS algorithm, GGA is competitive since it has a minimal relative error. The smallest relative error (0.0002) is for the C108 problem, and the most significant relative error (0.1310) is with the C104 problem.   Table 7 shows a comparison of the results with the Gehring and Homberger benchmark problems. Only some algorithms described in the existing literature are compared since they mostly report results for some problems, and it is difficult to make a comparison involving all benchmark instances. Due to this difficulty, in this work, a comparison of the relative error between the GGA solution and the best-known solution is described. GGA is better than the LC03 algorithm since it has a less relative error. Concerning the HM4 algorithm, GGA shows better results in most cases, and the HM4 algorithm is better than GGA in two problems only: C1_10_6 (UB = 42479.15) and C1_10_7 (UB = 42711.39), in which a better upper bound is obtained than the best results reported in the literature (43830.21 and 43372.03), because a better solution is found in HM4, the RE is represented in Table 7 with *. The number of vehicles does not improve, the existing literature reports 99 (Q) and 97 (SCR) vehicles, and the HM4 algorithm reports 100 and 99, respectively. When using GGA, the first problem uses 100 vehicles, and the second uses 97, which are equal to those reported in the literature. GGA find a new upper bound to the C1_10_8 problem (41954.51), but the number of vehicles does not improve since the existing literature reports 92 (SCR) and the GGA achieves 94 vehicles. The Friedman test is run with this relative error, and its resulting statistic value is 20.2 for five methods and nine problems, which has a p-value of 0.000456. When evaluating this p-value with a significance level of 5%, the null hypothesis is rejected. Next, the Bergmann-Hommel post-hoc test is applied to find all the possible hypotheses that cannot be rejected. In Table 8 is shown both the average rank (AR) of the results yielded by each method and the p-values computed by comparing the average accuracies achieved by the GGA versus those obtained by the other methods. The p-values highlighted with bold numbers indicate that the null hypothesis is rejected for this pair of methods since they show different performance. Unadjusted p-values are calculated with the average ranks of the two methods being compared, as is described by Demšar [66]. These values are used by the Bergmann-Hommel post-hoc test to compute the corresponding adjusted p-values. Table 8 shows that the GGA has a better performance than the other methods since it has the lowest average rank (1), and its results are statistically different than the others. Figure 12 shows a graph where the nodes represent the compared methods and the edges joining two nodes indicates that the performance of these methods does not present significant differences. The values shown in the edges are the p-values computed by the Bergmann-Hommel post-hoc test. This figure is based on that obtained using the scmamp library.  Finally, the Friedman statistics computed by analyzing the results produced by these three methods with 10 problems is 9.8, and the corresponding p-value is 0.007447, therefore the null hypothesis is rejected. The Bermann-Hommel post-hoc test is then applied to find all possible hypotheses that cannot be refused. Table 9 shows the results of these tests, and Figure 13 shows the graph corresponding to these p-values. --The p-values obtained by the post-hoc test point out that the GGA method is only statistically different from the HM4 algorithm, and the comparison with the LC03 algorithm indicates that they have a similar statistical performance. However, the GGA has the better average ranking than the LC03 method. Finally, the Friedman statistics computed by analyzing the results produced by these three methods with 10 problems is 9.8, and the corresponding p-value is 0.007447, therefore the null hypothesis is rejected. The Bermann-Hommel post-hoc test is then applied to find all possible hypotheses that cannot be refused. Table 9 shows the results of these tests, and Figure 13 shows the graph corresponding to these p-values. --The p-values obtained by the post-hoc test point out that the GGA method is only statistically different from the HM4 algorithm, and the comparison with the LC03 algorithm indicates that they have a similar statistical performance. However, the GGA has the better average ranking than the LC03 method.

Conclusions
The results presented in this paper show that the algorithm efficiency improves by increasing the number of processes in the MiniGrid. It is clear that the use of a grid environment to solve complex problems is one useful alternative. The time required to find near-optimal solutions decreases when the cores used in the grid increases, as well as when the communication rate between cores is efficient. The latency in the MiniGrid tends to decline as the day progresses, and it is important to consider the period of the time in which the experimental study is carried out.
By using the collective communication and overload processes for the transmission of mutated segments between the MiniGrid, we can reduce the latency between the clusters geographically distant and connected by microwave. Furthermore, by applying the mutation to a segment of the population in each process executed in parallel, there is a time reduction in creating a new population. Beginning with a random initial generation of individuals in each process, the combination of mutated segments for each process executed in the MiniGrid produces an increase in the exploitation The p-values obtained by the post-hoc test point out that the GGA method is only statistically different from the HM4 algorithm, and the comparison with the LC03 algorithm indicates that they have a similar statistical performance. However, the GGA has the better average ranking than the LC03 method.

Conclusions
The results presented in this paper show that the algorithm efficiency improves by increasing the number of processes in the MiniGrid. It is clear that the use of a grid environment to solve complex problems is one useful alternative. The time required to find near-optimal solutions decreases when the cores used in the grid increases, as well as when the communication rate between cores is efficient. The latency in the MiniGrid tends to decline as the day progresses, and it is important to consider the period of the time in which the experimental study is carried out.
By using the collective communication and overload processes for the transmission of mutated segments between the MiniGrid, we can reduce the latency between the clusters geographically distant and connected by microwave. Furthermore, by applying the mutation to a segment of the population in each process executed in parallel, there is a time reduction in creating a new population. Beginning with a random initial generation of individuals in each process, the combination of mutated segments for each process executed in the MiniGrid produces an increase in the exploitation of the search space. The application of the migration scheme between populations increases their diversity, producing good GGA efficacy. GGA performs a low number of evaluations in the communication between processes and with this gets a super-linear speedup.