On GPU Implementation of the Island Model Genetic Algorithm for Solving the Unequal Area Facility Layout Problem

: Facility layout problem (FLP) is one of the hottest research areas in industrial engineering. A good facility layout can achieve efﬁcient production management, improve production efﬁciency, and create high economic values. Because FLP is an NP-hard problem, meaning it is impossible to ﬁnd the optimal solution when problem becomes sufﬁciently large, various evolutionary algorithms (EAs) have been proposed to ﬁnd a sub-optimal solution within a reasonable time interval. Recently, a genetic algorithm (GA) was proposed for unequal area FLP (UA-FLP), where the areas of facilities are not identical. More precisely, the GA is an island model based, which is called IMGA. Since EAs are still very time consuming, many efforts have been devoted to how to parallelize various EAs including IMGA. In recent work, Steffen and Dietmar proposed how to parallelize island models of EAs. However, their parallelization approaches are preliminary because they focused mainly on comparing the performances between different parallel architectures. In addition, they used one mathematical function to model the problem. To further investigate on how to parallelize the IMGA by GPU, in this paper we propose multiple parallel algorithms, for each individual step in the IMGA when solving the industrial engineering problem, UA-FLP, and conduct experiments to compare their performances. After integrating better algorithms for all steps into the IMGA, our GPU implementation outperforms the CPU counterpart and the best speedup can be as high as 84.


Introduction
In manufacturing industries, facility layout problems (FLP) are one of the most important issues among the various aspects of manufacturing system management.FLPs have a significant impact upon manufacturing costs, lead times, work processes and productivity [1].The suitable placement of facilities contributes to reduction of the total operating costs by 20% to 50% [2].Based on shapes and dimensions, FLPs can be divided into different categories.Among them, unequal area FLP (UA-FLP) has been attracting attention of many researchers because it has been applied to many fields such as industrial facility design, warehouse organization, school facility layout, and VLSI element placement [3,4].
In most studies, the main goal for UA-FLP is to obtain the minimal operating costs [5].However, UA-FLP has been proven to be an NP-hard combinatorial optimization problem.That is, when the number of facilities is increased, the computational time is exponentially increased [6].As a sequence, during the last decades more and more research has proposed various metaheuristic approaches to solve UA-FLP by finding the approximate optimal solution in finite time [7][8][9][10][11][12].Among them, the most popular and widely used research is genetic algorithm (GA).
GA, proposed by Holland [13] and DeJong [14], is based on Darwin's principles of natural selection and divergence [15].GA is a global search algorithm, which has typical applications in different fields to solve optimization problems [16][17][18][19][20].For instance, recently GA has been proven for its effectiveness in enhancing the opportunity to achieve a global optimal solution without falling into the local optimal solution to UA-FLP [11,12].However, approaches based on standard GAs usually have premature convergence, which means a population for an optimization problem converged too early to get optimal solution.Moreover, all the individuals in a population should be selected, evaluated, crossed, and mutated in each generation, which requires a long CPU execution time [21].
Many models of parallel GAs have been proposed to shorten the computation time in the literature.Two categories of parallel GAs are island model and global parallelization [22].In recent work, Steffen Limmer et al. [22] compared four parallel architectures including multi-core CPUs, clusters, grids, and graphic process units (GPUs) for the execution of the two models of the genetic algorithms.In this study, we pay close attention to how to use GPU to parallelize island model of genetic algorithm (IMGA).
Nowadays, modern GPUs have evolved into very powerful and flexible processors.Especially after CUDA (compute unified device architecture) platform were distributed, developing highly parallel GPU applications becomes much easier.GPUs are very well to address general problems that are suitable for data-parallel computations, including GA [23][24][25].The IMGA, a parallel genetic algorithm model, can fully explore the computing power by either coarse or fine-grained parallelisms.It is worthwhile to study how to parallelize IMGA on GPUs to solve real-world optimization problems.Nevertheless, only few works have discussed this issue before.N. Melab and E. G. Talbi [26] proposed three different general schemes for building efficient island models for GA on GPU.Their experiments indicated that GPU computing can not only speed up the search process, but also exploit parallelism to improve the quality of the obtained solutions.The authors strongly believed that the schemes of the fully distributed island model on GPU could be easily extended to large scale optimization problems.Steffen and Dietmar [22] discussed the island model and global parallelization for evolutionary algorithm in detail.The methods of IMGA on GPU are based on the work reported in the work [26], and a common Weirerstrass function is used to compare the performances between GPU and CPU.However, their parallelization approaches are preliminary because they focused mainly on comparing the performances between different parallel architectures.In addition, they used one mathematical function to model the problem.
In this paper, we further investigate how to parallelize the IMGA by GPU.We propose multiple parallel algorithms, for each individual step in the IMGA when solving UA-FLP, and conduct experiments to compare their performances.To the best of our knowledge, no previous work has considered using GPU parallelization to solve UA-FLP.Through comparing the different parallel schemes at each step in the IMGA, the better ones were integrated as an entire program of IMGA.According to our experimental results of IMGA, our GPU implementation outperforms the CPU counterpart and the best speedup can be as high as 84.
The paper is organized as follows.Section 2 introduces the related work, including introduction to CUDA GPUs, IMGAs, and UA-FLPs.Section 3 presents our multiple parallel strategies for the main steps of IMGA on a GPU.Through theoretical analysis and experimental comparison of these parallel methods for each step, the better ones are integrated into our IMGA.Moreover, the overall performance of the algorithm on the GPU and CPU is compared through experiments.Finally, conclusions are given in Section 4.

Related Work
In industry, data processing is one of the most important technologies-especially when the paradigm of "manufacturing as an ecosystem" has emerged.Modern information and communication technologies such as big data analytics and cloud computing [27,28] can provide useful insight to the industry to increase productivity, quality, and agility benefits, which have significance competitive value.For a long time, the industry has been looking for efficient algorithms to solve UA-FLPs, which are typical NP-hard problems.That is, their solution space is characterized by rapid expansion as the scale of the problem increases.The use of mathematical programming methods to solve large-scale and complex problems is difficult to achieve satisfactory results.People are more inclined to seek heuristic algorithms, for example GA, that find acceptable approximate solutions in a limited time.However, the standard GAs also require a long CPU execution time, so many parallel architectures are proposed.The parallel architectures, such as multi-core CPU, grid, clusters, and GPUs are widely used in many fields [27][28][29].Steffen Limmer et al. [22] mainly compare these four architectures for IMGA to solve a mathematic function.In this work, we emphasize using GPU to solve IMGA for UA-FLP.In this section, the related work is surveyed, including CUDA GPUs, IMGAs, and UA-FLPs.

CUDA GPU
During the last few years, the capability of GPU is growing much faster than that of CPU's because of greatly increasing hardware requirement for modern computer games.The GPU is also rapidly and widely used for various scientific computations in addition to graphic display [30], such as fluid dynamics [31], biophysics [32], molecular dynamics [33], and IoT sensing [34].GPUs can provide huge performance improvement than a single CPU core for many applications.For instance, the work in the CFD field [31] has shown speed-up values of a single GPU against a single core CPU larger than 100.'General-purpose computing on graphics processing unit' (GPGPU) was thus coined [35].
On a GPU-chip, there are multiple cores called streaming-multiprocessors (SMs), and each SM has several streaming processors (SPs).Each SP of an SM executes the same instruction on different data during each cycle.Since 2007, NIVDIA has distributed CUDA as a platform to develop on GPU.Along with the introduction of CUDA platform, a rapid increase of scientists paid more attention to GPU because CUDA makes programmers easier work.A CUDA program can be executed on the host (CPU) and on a device (GPU).Sequential codes run on the CPU, while the parallel codes are offloaded to the GPU, called a 'kernel'.If a kernel is invoked, blocks with the grid are distributed to the SM.The kernel can launch up a large number of threads to exploit data parallelism.The warps containing 32 threads in each of the blocks exhibit single instruction multiple data (SIMD) execution.All threads within a warp must execute the same instructions at any given time.Data-dependent branch causes different threads in the same warp to follow different paths, known as branch divergence [36].Branch divergence has a significant impact on the performance of GPU.CUDA GPU owns many memories-including global memory, shared memory, constant memory, local memory, and registers-to achieve high execution speed in their kernel.Different types of memory are differing in sizes, access scopes, access times, and whether it is read-only or cached.In brief, GPUs are now becoming the most suitable processor for implementing algorithms with simple and large amount of computations [30].

IMGA
IMGA is one of the most promising variant among the parallelized models of GA.The model uses the advantage of parallel computing by dividing population of a large size into smaller partitions, distributing them into different islands to perform simultaneous random searches, and exchanging the individuals through a migration operator.The IMGA model is possible to explore different regions of the search space, so as to improve issues of premature convergence or convergence to local optimal solutions in the standard GA using a single large population.In the latest literature, an IMGA for UA-FLP was proposed by J. M. Palomo-Romero et al. [21].This is the first time to propose a parallel GA based on the island model to solve UA-FLP using the flexible bay structure (FBS) formulation.In their proposed approach, each island is assigned with a part of population for local evolution, and several chromosomes of high quality will migrate from one island to another after a few generations.The experiments were conducted on a total of 26 benchmarks of UA-FLP, and the results were compared with the best results found in the literature.The method can improve search quality, obtain better solutions, provide greater population diversity, and reduce the number of evaluations required to find good solutions.Moreover, the execution times of the algorithm were similar to, or lower than, the execution times of the previous approaches in all cases.Nevertheless, the IMGA method still requires long execution time, which is why the authors pointed an interesting future work in their paper, i.e., parallelizing IMGA.Parallelizing IMGA on GPU is helpful to reduce the execution time in searching for good solutions especially for large-scale problems and further solving the issues of local optimal solutions.In our study, we propose multiple parallel algorithms to implement each step of IMGA and compare the performance ratios between them.The better methods for all steps are integrated into our parallel optimized model for UA-FLP on GPU, which are described in the following section.

UA-FLP
The UA-FLP was originally formulated by Armour and Buffa [37].It considered on the assumptions that there is a given rectangular region, and there are a set of facilities needed to be placed into the region.The fixed dimensions of the rectangular region are W(width) × H(height), and each of the rectangular facilities occupies a specified area (B i ).The constraints of the problem include the sum of the facility areas must be less than or equal to the fixed rectangular region (see Equation ( 1)) and the facilities cannot overlap.The objective function, or the fitness function in our study, is on basis of total material handling cost (MFC) between facilities [38], which is presented in Equation (2). where: p = the number of facilities; f (i, j) = the total flow of moving materials between facility i and j, where i, j = 1, 2, . . ., p; d(i, j) = the weighted rectilinear distance between facilities i and j, where i, j = 1, 2, . . ., p; D inf = the number of infeasible departments; V f eas = the best feasible objective function value yet found; V all = the best overall objective function value yet found; k = the parameter for adjusting the "severity", which was set to k = 3 [39].
The calculation of distance between facilities can be Euclidean (Equation (3)) or rectilinear (Equation (4)), where the point defined by x and y is the center of the facility The FBS in our UA-FLP consists of two parts: the facility sequence codes and the bay division codes.The former encodes the order of p facilities represented by integer numbers which will be placed into the fixed rectangular area.The order of the facilities, named 'facility sequence', is organized bay by bay from top to bottom and from left to right.The latter encodes the bay divisions which have (p − 1) binary elements in the area, each facility is associated with one element except the first one.A value of 1 indicates that the corresponding facility is the last facility in the present bay, while 0 means the facility and its previous one are in the same bay.We take an example to detail the encoding method of FSB.
Assume there are seven facilities in a UA-FLP.The facility layout problem is to minimize the overall cost by determining how many bays are required, which facilities are assigned to which bays, what kind of ordering is adopted for placing the facilities in each bay.For instance, one solution is illustrated in Figure 1.There are four bays in this example: Bay 1, Bay 2, Bay 3, and Bay 4. Each bay contains one or more facilities.For instance, Bay 1 contains two facilities, F and C, while Bay 2 contains only one facility, A. To represent the layout shown in Figure 1 by using FSB encoding, two arrays are required, as shown in Figure 2. The first array, facility sequence, describes the ordering of the facilities from top to bottom and from left to right in the layout.The second array, bay divisions, indicates whether the corresponding facility is the last one in one bay or not.Since the top left facility is F, it is the first one placed into the two arrays, as shown the first step in Figure 2.Moreover, because it is not the last one in the first bay, the corresponding element is set to 0. That is, the first element in the array of bay divisions is set to 0. Next, the second facility, C, is inserted into the arrays because it is in Bay 1 and below Facility F, as shown in Step 2 in Figure 2. Since it is the last one in Bay 1, its value in the second element in the array of bay division is set to 1.At Step 3, we move to Facility A in Bay 2. Because Facility A is also the last one in Bay 2, the third element in Bay Divisions is set to 1. Totally, seven steps are needed for processing seven facilities.Note that the seventh element in the bay divisions array is omitted because Facility E is the last one in the facility sequence.

Our Parallel Strategies
In this section, our parallel strategies for the main steps of IMGA are proposed in detail.Section 3.1 illustrates the overall structure of our parallel IMGA on GPU, and introduces our experimental environment.Sections 3.2-3.5 explain the multiple parallel strategies for initialization, selection, mutation, and migration of IMGA on GPU, respectively, and compare the performance ratio between the multiple parallel versions.Section 3.6 is the comparison of the performance executed on GPU with that on CPU counterpart after integrating the better parallel strategies.

Parallel IMGA on GPU
The sequential IMGA consists of the following processes [21].( 1 7) Individuals between the islands are migrated after g generations.
Figure 3 illustrates the structure of our parallel IMGA on GPU.The main implementation structure is shown in Figure 3a.There are N islands organized as ring topology, which have n/N chromosomes on each of them.The entire algorithm is executed on GPU with one block per island.The total population of individuals is stored in global memory which is the largest in terms of size.For each block, a group of threads execute standard GA in parallel.Figure 3b gives an illustration of one particular island on GPU.Threads on each block are responsible for executing the procedures of IMGA including initialization, evaluation, selection, crossover, mutation, replacement, and migration operators.Regarding the migration, communications are carried out through the global memory which stores global population, resulting in each island being able to communicate with each other according to our ring topology.In our CUDA GPU programming, one-dimensional array of blocks and threads is used in the application of UA-FLP.The program starts on the host, i.e., a CPU core.Parallel computations are conducted on the device-i.e., a GPU-where the kernel functions are used to do the parallel executions.The main kernel functions in our program are generation of facility sequence, generation of facility bay divisions, fitness calculation, selection, crossover, mutation, and migration.In the following subsections, we select the main operations to illustrate our GPU solutions to parallelize IMGA for UA-FLP.In the previous work [22], for each step of IMGA, only one method is used to implement the parallelization.In our implementation, multiple parallelized schemes are proposed to improve the main operations.After comparing these schemes for each step, the methods with better performances are determined.In the following subsections, we explain our multiple implementations for the main steps in detail.
We adopt NVIDIA GeForce GTX 980 [40] to evaluate our GPU version of IMGA.The workstation mainly consists of Intel Xeon CPU E5-2609 v3 with 16GB memory and GXT980 with 4GB memory.Detailed configurations are shown in Table 1.Our GPU and CPU versions are written in C, and the operating system installed is Linux and its version is Ubuntu 14.04 64-bit.We use CUDA version 7.0 to implement the GPU algorithms.In the experiments, for comparison with the performance of the different versions, we adopt performance ratio between the multiple methods for measurement.Our proposed algorithms are tested using 26 well-known problem sets taken from the literature [21] as shown in Table 2.For testing larger scale facilities, we have added 8 problem data sets.Table 3 shows the information about our added problem data sets.In Tables 2 and 3, number of facilities, facility size (width × length), shape constraint (α is the maximum aspect radio constraint, and l min is the minimum side length constraint, and distance measure (Euclidean or rectilinear) are illustrated in detail.Note that the number in the suffix of each data set name conventionally indicates how many the facilities the data set contains.For instance, VC10E-a has 10 facilities and F40 has 40 facilities.Some parameters are needed for our IMGA such as total population size (n), number of islands (N), cycle generation (c), migration rate (m), migration frequency (g), crossover probability (p c ), and mutation probability (p m ).We also tested empirically to determine the better fit parameters when developing our programs.In the following experiments, the settings of important parameters are listed in Table 4, and we have executed five times for each data set to calculate its average execution time and solution quality.
In the following subsections, all the tests are based on 15 islands and 500 individuals per island, and we choose the typical examples of different number of facilities in the benchmark to conduct the experiments.The kernel configuration is shown in Table 5, which indicates the number of blocks (B) and the number of threads per block (T) used for each kernel function.The β is equal to 1 when one thread is used to migrate one chromosome; and the β is equal to the number of genes per chromosome when one thread is used to migrate one gene.

Parallel Initialization on GPU
In data initialization, random numbers are used to generate the initial content of each individual.In our case of UA-FLP, because the FBS is used to encode our facility layout, initialization of facilities consists of two parts: generations of facility sequence and bay divisions, which will be detailed below.

Facility Sequence Generation
To generate facility sequences randomly, the conventional method (INI_FS1) is usually adopted, i.e., if there are p facilities, p random numbers are generated and any two numbers cannot be identical.An example to illustrate INI_FS1 is shown in Figure 4. Assume there are seven facilities, and they are represented by letters, A to G, respectively.Actually, each letter is associated with a code, e.g., 1 to 7 for A to G. Initially, we need a facility sequence to generate different facility sequences from it, which is called the seed facility sequence.The seed facility sequence is set according to alphabetical order, as shown on the left side of Figure 4. To generate a new individual of the facility sequence, Step 1 is to produce one random integer number, 5 in this example.Then at Step 2, the second random integer number, 3, is produced.In order to avoid generating two identical facilities, we have to compare the newly generated facility with all previous ones at every step.Therefore, at Step 2, the second number, 3, is compared with the first number, 5, and we conclude that 3 can be kept.Next, at Step 3 the number 5 is produced, however, it is identical to the first number in the sequence.Therefore, the newly generated facility with number 5 should be abandoned and regeneration of the third number is required.At Step 4, the code number 4 is randomly produced, which can be accepted through comparing it with the preceding numbers.Repetition of random number generation and duplication avoidance like above steps, finally a new facility sequence is generated, as shown on the right-hand side of Figure 4.In this method, a large number of comparisons are required, making the complexity of the algorithm equal to O(p 2 ).Therefore, our improved method (INI_FS2) is proposed below to eliminate comparisons and also to avoid regeneration of random numbers for the same array element.The main idea of INI_FS2 method is to get a new facility sequence by performing a swap operation several times on the seed facility sequence to fully randomizing the facility positions.If there are p facilities, p swaps are performed.How the p swaps are performed is based on an array, called 'swap position', which contains p random numbers and the value of each random number is between 0 and (p − 1).It is not required that any two of the p random numbers are not identical, which is crucial to eliminate comparisons of duplicates.The random numbers stored in the array of swap position will be extracted from the head to the end, one by one, to perform a swap on the seed facility sequence.At Step i, we exchange two positions in the seed facility sequence, one is the position (p − 1 − i), another is the position specified by the random number stored in the (i − 1)-th element in the array of swap position.An example is shown in Figure 5   The kernel Algorithm 1 shows the pseudocode for INI_FS2 method to parallelize the generation of facility sequences.We adopt a coarse-grained parallelization method.That is, every thread will produce a new facility sequence based on the seed facility sequence.It is inefficient to adopt fine grained parallelization method, as explained below.If multiple threads generate random number simultaneously to produce a facility sequence, they need to check whether any two random numbers are identical because any facility sequence is actually a permutation of facility IDs in UA-FLP.It incurs high computational cost when multiple parallel threads need to perform comparison, communication, and even synchronization to eliminate duplicates.
In the Kernel Algorithm 1, firstly, each thread generates random numbers and stores them in the array of ridx based on its thread id and block id.Next, each thread reads one random number from array ridx and uses it to exchange two positions on the corresponding facility sequence in array facsarr.

Bay Divisions Generation
To generate bay divisions, the conventional method and the improved method are both proposed in this subsection.The conventional method (INI_BD1) is firstly to generate a random integer randI for each facility on bay divisions, and then use randI%2 to get an integer of value equal to 0 or 1.Consequently, (p − 1) random integers are required for p facilities.In INI_BD1, p facilities require (p − 1) modulo operations for the representation of bay divisions.If there are n individuals, we need to perform the modulo operation (n * (p − 1)) times in total, thus resulting in a high time complexity even in parallel computing.In order to reduce such calculations in parallel, we propose the other method (INI_BD2) to improve the performance of generating bay divisions.The new method uses bitwise operators instead of modulo operators.Assume the value of an integer generated randomly is x (x ≥ 1), we calculate x AND 1 to generate a value of either 0 or 1 for the first facility in the bay divisions.Then a one-bit right arithmetic shift is adopted to change the value of x, and the new x is used in next iteration.Repeat the above procedure for (p − 1) times, the whole bay divisions are generated.INI_BD2 algorithm is of O(1) time complexity, which is much smaller than O(p) time complexity required by INI_BD1.Moreover, INI_BD2 uses arithmetic shift operations to replace modulo operators in INI_BD1.
x AND 1 = Kernel Algorithm 2 is the pseudocode of the improved method INI_BD2 for facility bay divisions.Each thread is responsible for producing one bay divisions.Firstly, we generate a random integer, and then bitwise operators, instead of modulo operators, are used to obtain a value of either 0 or 1 for every facility except the last one (Lines 2-5).

Parallel Selection on GPU
Selection of GA can be performed in many ways, of which the tournament selection and roulette wheel selection are most famous.Limmer and Fey suggested that selection and recombination operators should be grouped together to investigate their relationship [22].They adopted tournament selection because one-point crossover for the recombination was used in their work [22].The one-point crossover produces two offspring from two parent individuals.In the UA-FLP, facility sequences and bay divisions are both required to perform selection and recombination, respectively.According to [21], two different crossover methods are applied: partially mapped crossover (PMX) is used to recombine facility sequences, and n-point crossover method is for recombination of bay divisions.The above feature is quite different from that requiring only one crossover strategy.Therefore, we developed two versions of selection strategies-the tournament selection, and the roulette wheel selection-to investigate which is better for UA-FLP.
The tournament selection involves running several 'tournaments' among a few individuals who are chosen at random from the population, and the winner of each tournament is selected.In our program, each time several pairs of individuals are randomly chosen from the population for a competition.After a series of selections, the individual with the best fitness from these pairs is selected.For parallel tournament selection, our method (SEL_TN) let one thread execute a series of selections to determine one best individual.The time complexity of parallel tournament selection is equal to O(nlogn).
Roulette wheel selection [22,27], also known as fitness proportionate selection, is for selecting potentially useful solutions by firstly calculating the fitness level for each chromosome and then using fitness levels to associate each individual chromosome with a probability for selection.If f i is the fitness value of individual i and n is the number of individuals in the population, the probability of the individual being selected is . The probabilities of all individuals are accumulated one by one, where the initial and resultant accumulated values are 0 and 1, respectively.During the accumulation, each individual is associated with two numbers: starting and ending numbers.An individual is selected only when a random number, ranging from 0 to 1, is generated and its value is between the associated starting and ending numbers.
The way for parallelizing roulette wheel selection (SEL_RL) is that one thread is responsible for picking up one better individual in an island.Furthermore, in order to take full advantage of threads in every block, as shown in Figure 8, important data are stored in shared memory to increase access speed.Shared memory, accessible by the threads in a same block, is high-speed memory.Kernel Algorithm 3 shows the pseudocode to the SEL_RL, i.e., the roulette wheel selection.In the algorithm, shared memory is used to store the total cost of the island, probabilities for individuals, and the sum proportion for each island.Threads in one block carry out selections of individuals for next generation, respectively.With the fast-shared memory to communicate with each other, the system is improved.The time complexity of parallel roulette wheel selection is equal to O(n).    9 illustrates the performance ratio between the roulette wheel selection SEL_RL and the tournament selection SEL_TN.The performance of SEL_RL has great performance improvement when it combines with the PMX crossover.In other words, roulette wheel is more suitable for UA-FLP due to the crossover strategies adopted, which is different from that in the reference [22].

Parallel Mutation on GPU
Gaussian mutation is applied and it consists mainly of the generation of random numbers in the work [22].In our GPU solution, Gaussian mutation is also used in both mutations of facility sequence and bay divisions according to the mutation probability.To parallelize the mutation operator, according to the previous work, each thread will generate a random number to determine whether the mutation operator is performed.For the individuals who are mutated, the corresponding threads need to do the mutation operator for their own facility sequence and flexible bay divisions in parallel.Two positions are generated randomly to swap each other for a facility sequence in mutation.In this process, two identical positions are likely to be produced, causing a failed mutation.To avoid this problem, two methods can be adopted.One method (MUT_M1) is regeneration when inspecting the repetition, and the other (MUT_M2) is producing a position for mutation randomly and exchanging the position with its adjacent location.Though the time complexities of MUT_M2 and MUT_M1 are both constant time, MUT_M2 need not check and avoid duplication.Kernel Algorithm 4 shows the pseudocode of the mutation operator for MUT_M2.end } Figure 10 shows the performance ratio between MUT_M2 and MUT_M1 for mutation.The performance of MUT_M2 is better than MUT_M1 to most of the benchmarks.However, the performance difference is not significant, although theoretically MUT_M2 is more efficient than MUT_M1 because MUT_M2 needs not to check and avoid duplication.The reason is that the mutation probability is not high and it has a small chance to generate two identical locations even when a mutation is required to be performed.

Parallel Migration on GPU
The migration operator, the featuring operator in IMGA, exchanges individuals among islands, which provides greater population diversity.With a ring topology, individuals from an island can migrate to their neighboring island during one migration interval.The previous work [22] proposed that one thread is responsible for migrating an individual (MIG_I).Figure 11 illustrates how their main concept is implemented on GPU, assuming every island needs to migrate two individuals.Each island will be processed by one thread block.Firstly, two individuals, represented with two chromosomes, are selected randomly for each island.Secondly, an island will use two threads to read two chromosomes in its neighboring island from global memory and then write the two chromosomes into its shared memory.Thirdly, synchronization between thread blocks is needed to ensure that the second step is completed by all blocks before any thread block proceeds to next step.Finally, each island uses the two migrated chromosomes in its shared memory to replace its two original chromosomes in global memory, which are executed by two threads with copy operations.However, because one chromosome consists of many genes, to increase the parallelism, we propose an enhanced method that uses one thread to deal with one gene, instead of one chromosome, to improve the migrated performance (MIG_G).In the case of UA-FLP, p (p ≥ 7) facilities-encoded with a gene for each facility-are combined into an individual that represents one facility sequence.Assume that two random individuals are selected for migration, there will be 2p threads responsible for the migration.If the number of facilities is large and the migration rate is high, then our method can use more threads to process migration concurrently, which can save more time than the previous method.Figure 12 shows the procedures of how to apply one thread to process one gene in the migration step.The pseudocode of MIG_G is shown in Kernel Algorithm 5.According to the experimental result between our proposed method MIG_G and the previous method MIG_I, as shown in Figure 13, the performance ratio of MIG_G is up to more than 1.6 for exploring more parallelisms.If the migration rate is higher, the efficiency of MIG_G is also better.The reason why using each thread to process a gene is better than using the thread to process a chromosome is because we not only take full advantage of the fine-grained parallelism, but also utilize memory coalescing to access global memory in our improved migration algorithm.Accessing data in global memory is critical to the performance of a CUDA application since read and write operations are slow in global memory that is comprised of dynamic random-access memories (DRAMs).A parallel process of modern DRAMs is that for each time when a location is accessed, many consecutive locations including the requested location are accessed.Therefore, only in this way of parallel accessing can DRAMs work close to the advertised peak global memory bandwidth if an application uses data from consecutive locations.Figure 14 shows the memory coalescing of our MIG_G.When all threads in a warp execute a same instruction to read the migrated genes, the hardware detects whether the threads access consecutive memory locations in global memory.Because the genes of a chromosome are stored in an array with consecutive locations in global memory, the hardware coalesces all memory accesses into one consolidated access to these consecutive DRAM locations.While in the preliminary algorithm MIG_I, one thread migrates one chromosome.Although each chromosome is adjacent to each other, a stride exists between them as there are many genes in each chromosome.In a CUDA GPU, the threads in a same warp will access genes at the same place in their responsible chromosomes.For instance, if one thread access to the first gene, the other threads in the same warp will also access to their first gene in their responsible chromosomes simultaneously.However, the first genes are not consecutive.When all the threads in a warp execute a read operation to the chromosomes in global memory, usually more than one memory access transaction is required, resulting in a performance worse than that with memory coalescing, like our proposed algorithm.The complexity of MIG_I is O(p) as each thread must move the genes sequentially, while the complexity of MIG_G is O(1) because one thread moves one gene.To investigate how the number of threads per block and the number of blocks influence the performance of our migration algorithm, we have conducted an experiment and the results are shown in Figure 15.In our method, each island is processed by one thread block and one gene is moved by one thread.In this experiment, we assume five chromosomes are migrated per island.Consequently, the number of blocks is equal to the number of islands, and the number of threads per block is equal to the product of the number of chromosomes and the number of genes per chromosome.Moreover, because the number of genes per chromosome is equal to the number of facilities, we choose the benchmarks based on the numbers of genes per chromosome required in the experiment.The migration workloads are kept the same for all cases.The execution times are normalized to the execution time of the case with 15 blocks and 300 threads.When the number of threads is increased, the execution time is decreased except the case with 90 blocks.If there are 90 blocks, the number of threads per block is down to 50, indicating that only two warps of threads can be used to exploit warp-level parallelism.Nevertheless, the execution time differences between any two of them are insignificant.

Parallel UA-FLP Integrated with Better Strategies
According to the comparison of two alternative parallelization strategies for each main step, the better strategies are chosen and integrated into the complete algorithm of IMGA for solving UA-FLP on GPU, which is our improved GPU version.For comparison, another GPU version is also implemented: called the previous version, which consists of conventional and previously proposed parallelization methods.Figure 16 shows the execution times of our GPU version, the previous GPU version, and the CPU version for the larger data sets.GPU versions outperform the CPU version significantly all the time.Our GPU version always requires the shortest execution time for any data set.We compare the performance improvements executed on the CPU and GPU for all 34 data sets, as shown in Figure 17, which indicates the performance ratios of our improved GPU version over the CPU version and the performance ratios of the previous GPU version over the CPU version.Note that the data sets are listed in ascending order based on the number of facilities.The previous GPU version outperforms the CPU one for all the data sets.The best performance ratio is up to 20.On the other hand, our GPU version provides much better performance for all cases, with the performance ratio over the CPU version up to 84.Comparing the performance ratios provided by two versions of GPU, our GPU version is much better, meaning the better parallelization methods we suggest for IMGA steps really contribute to performance improvement.Moreover, when the number of facilities is increased, the performance of our GPU version is also highly improved.

Conclusions
UA-FLPs are widely used in different fields, including industrial facility design, warehouse organization, school facility layout, and VLSI element placement.However, UA-FLP is an NP-hard problem.With the increase in the problem size, the amount of calculation will become larger and larger, and the execution time is getting longer and longer even though metaheuristic approaches such as GA are applied to solve UA-FLP.In the traditional CPU workstations, when the number of the facilities reaches a certain level, it is difficult to calculate the appropriate solution within a reasonable period of time.IMGA, one of the most promising variants among the parallelized models of GA, can explore different regions of the search space.IMGA can improve the solution for most of the problem data set for the UA-FLP [21].However, the IMGA method for UA-FLP still requires long execution time, so that the authors pointed an interesting future work: parallelizing IMGA.
Using GPU to solve large-scale and complex computing has been widely investigated.Modern GPU, as specialized computer processors, can manipulate large amounts of data efficiently by highly parallel many-core system.Parallel IMGA on GPU is focused in the recent work [22], while their parallelization approaches are preliminary because the authors mainly emphasize comparing performances between different parallel architectures.In addition, they used only one mathematical function to model and implement the IMGA.
In this paper, we further investigated on parallelizing IMGA by GPU.We proposed multiple parallel algorithms for the main steps in the IMGA when solving the UA-FLP, including initialization, selection, mutation, and migration operators.In data initialization, there are two parts: generations of facility sequence and bay divisions.To generate the facility sequence, we proposed an improved method to eliminate a large number of comparisons in conventional method.Our method can also avoid regeneration of random numbers for the same array element.For the above reasons, our approach can reduce the time complexity from the previous O(p 2 ) to O(p).To generate bay divisions, our improved method used bitwise operators, instead of a lot of module operators in conventional methods, resulting in the time complexity is greatly reduced from O(p) to O(1).According to the previous work [22], when choosing a selection operator, we should take into account the relationship between it and the recombination operator.Therefore, we implemented two selection strategies-tournament selection and roulette wheel selection-to investigate which is better for UA-FLP.Through experimental results, roulette wheel selection is more suitable for UA-FLP due to the crossover strategies adopted, which is quite different from that in the reference [22] because two different crossover strategies are used for FBS encoding in UA-FLP.To parallelize the mutation operator, our method needs not to check for duplications, which improved the mutation performance.For migration parallelization, we proposed an enhanced method that uses one thread to deal with one gene, instead of one chromosome, to improve the migrated performance.Our improved migration algorithm is better than the conventional one because we take full advantage of the fine-grained parallelism and utilize memory coalescing to access global memory.
Experiments were conducted to compare performances between the different parallel algorithms for each step.Based on theoretical derivation and experimental results, we integrated the better parallel method in each step into the parallel IMGA for UA-FLP on GPU.The experiments were conducted on a total of 26 well-known UA-FLPs data sets and 8 additional benchmarks for large-scale facilities.According to the experimental results, if conventional parallelization methods for main steps of IMGA are applied to GPU implementation, the performance ratio over the CPU version is more than 20 at best.However, if our suggested parallel methods are used to implement IMGA on GPU, the best performance ratio over the CPU version can be as high as 84.In other words, our GPU version provides much higher performance than conventional GPU version.
In our future work, more effective methods can be proposed to optimize parallel IMGA, such as how to manage memories, improve fitness evaluations, etc.

Figure 1 .
Figure 1.An example of unequal sized facility layout.

Figure 2 .
Figure 2. Steps of FBS encoding for the facility layout shown in Figure 1.
) n individuals as the initial population is randomly generated and distributed into N islands randomly.(2) All newly-generated individuals should be evaluated fitness values by the objective function.The processes of next procedures from (3) to (7) are performed sequentially on each island, which repeat c times to produce the best individuals.(3) The best individuals are selected for recombination to the subpopulation with selection operators.(4) Two individuals are recombined randomly using crossover operators.(5) Mutation operators should be conducted for avoiding local solutions for maintaining genetic diversity.(6) The fitness values of modified individuals are updated.The population for the next generation is generated and takes place of the parents.(

Figure 3 .
Figure 3.The structure of our parallel IMGA on GPU.(a) Description of the main implementation structure.(b) Implementation of one particular island on GPU.

Figure 4 .
Figure 4.An example of the INI_FS1 method.
to illustrate the method of INI_FS2.Firstly, we generate seven random numbers without checking duplicate and store them in the array of swap position, as shown on the left-hand side of Figure 5.At Step 1, because the first random number in the array of swap position is 5, we exchange the two elements on Positions 6 and 5 in the array of facilities.At Step 2, because the second random number in swap position is 3, Positions 3 and 5 in the facilities array are exchanged.Similarly, Positions 2 and 5 are swapped at Step 3.After seven steps, we get a new facility sequence, as shown on the right hand side of Figure 5.The complexity of the INI_FS2 method is O(p) since no comparisons are required.

Figure 5 .
Figure 5.An example of the INI_FS2 method.

Figure 6 .
Figure 6.Performance ratio between INI_FS2 and INI_FS1 for facility sequence generation.

Algorithm 2 :Figure 7
Figure 7  is the performance ratio between INI_BD2 and INI_BD1 for generation of bay divisions.The INI_BD2 uses bitwise operators instead of a lot of mod operators of INI_BD1, yet GPU is not good at such mod calculations due to hardware limitations.So the performance ratio of INI_BD2 is much better than INI_BD1, and with the increase in the number of facilities, the more obvious the performance improvement.

Figure 7 .
Figure 7. Performance ratio between INI_BD2 and INI_BD1 for generation of bay divisions.

Figure
Figure9illustrates the performance ratio between the roulette wheel selection SEL_RL and the tournament selection SEL_TN.The performance of SEL_RL has great performance improvement when it combines with the PMX crossover.In other words, roulette wheel is more suitable for UA-FLP due to the crossover strategies adopted, which is different from that in the reference[22].

Figure 9 .
Figure 9. Performance ratio between SEL_RL and SEL_TN for selection operator.

Figure 11 .
Figure 11.Migration between islands with one thread processing one chromosome (MIG_I).

Figure 12 .
Figure 12.Migration between islands with one thread processing one gene (MIG_G).

Figure 14 .
Figure 14.Memory coalescing represents a data chunk.

Figure 15 .
Figure 15.Comparison of normalized execution time for various combinations of number of blocks and number of threads per block.The number of blocks is equal to the number of islands.The number of threads per block is equal to the product of the number of chromosomes and the number of genes per chromosome.In this comparison, five chromosomes are migrated per island.

Figure 16 .
Figure 16.Execution times of the different methods.

Figure 17 .
Figure 17.Comparisons of performance ratio between two GPU versions and CPU version, respectively.The previous GPU version is implemented with previous parallel methods while our GPU version is implemented with our proposed methods.

Table 1 .
Configuration of workstation.

Table 2 .
The 26 well-known problem data sets.

Table 3 .
Our added data of test problem sets.

Table 4 .
Settings of the parameters in our program.

Table 5 .
Configuration of each kernel function.