Efficient Subpopulation Based Parallel TLBO Optimization Algorithms

A numerous group of optimization algorithms based on heuristic techniques have been proposed in recent years. Most of them are based on phenomena in nature and require the correct tuning of some parameters, which are specific to the algorithm. Heuristic algorithms allow problems to be solved more quickly than deterministic methods. The computational time required to obtain the optimum (or near optimum) value of a cost function is a critical aspect of scientific applications in countless fields of knowledge. Therefore, we proposed efficient algorithms parallel to Teaching-learning-based optimization algorithms. TLBO is efficient and free from specific parameters to be tuned. The parallel proposals were designed with two levels of parallelization, one for shared memory platforms and the other for distributed memory platforms, obtaining good parallel performance in both types of parallel architectures and on heterogeneous memory parallel platforms.


Introduction
The purpose of optimization algorithms is to find the optimal value for a particular cost function.Cost functions, depending on the application in which they are used, can be highly complex, it may be necessary to repeatedly obtain a new optimum value, and they may present different numbers of parameters (or design variables).Moreover, if cost functions have local minimums, the search for the optimum value becomes more complicated.
When deterministic methods have been applied to obtain the optimal value of a function, a sequence of points tending to the global optimum value is generated considering the analytical properties of the problem under consideration.In other words, the search for the optimum is treated as a problem of linear algebra, often based on the gradient of the function.The optimal value, or a value very close to it, of a cost function can be obtained using deterministic methods (see [1]).In some cases, however, the efforts involved can be considerable, for example in non-convex or large-scale optimization problems.When deterministic methods can be applied, the results obtained are unequivocable and replicable, but the computational cost can make it useless.Several heuristic methods have been proposed to address these drawbacks, many of them based on phenomena found in nature, leading to acceptable solutions while reducing the required efforts.Two main groups of this type of algorithm, evolutionary algorithms and swarm intelligence, include the major heuristic algorithms.
On the one hand, metaheuristic methods are able to accelerate convergence even when local minima exist, and on the other, they can be used in functions whose characteristics prevent the use of deterministic methods, for example non-differentiable functions.In most cases, metaheuristic methods employ guided search techniques, in which some random processes are involved to solve the problem, although it cannot be formally proven that the optimal value obtained is the solution to the problem.In particular, the Teaching-learning-based optimization (TLBO) algorithm, presented in [2], has proven its effectiveness in a wide range of applications.For example in [3], it is used for the optimal coordination of directional overcurrent relays in a looped power system; in [4], a multi-objective TLBO is used to solve the optimal location of automatic voltage regulators in distribution systems in the presence of distributed generators; in [5], an improved multi-objective TLBO is applied to optimize an assembly line to produce large-sized high-volume products such as cars, trucks and engineering machinery; in [6], a load shedding algorithm for alleviating line overloads employs a TLBO algorithm; in [7], a TLBO algorithm is used to optimize feedback gains and the switching vector of an output feedback sliding mode controller for a multi area multi-source interconnected power system; in [8], the TLBO method is used to train and accelerate the learning rate of a model designed to forecast both wind power generation in Ireland and that of a single wind farm, in order to demonstrate the effectiveness of the proposed method; in [9] Cetane number estimation of biodiesel with a fatty acid methyl esters composition was performed using a hybrid optimization method including a TLBO algorithm; in [10], a residential demand side management scheme based on electricity cost and peak to average ratio alleviation with maximum user satisfaction is proposed using a hybrid technique based on TLBO and enhanced differential evolution (EDE) algorithms; in [11] a TLBO algorithm is used in Transmission Expansion Planning (TEP) that involves determining if and how transmission lines should be added to the power grid, considering power generation costs, power loss, and line construction costs among others.
Among the well-known metaheuristic optimization algorithms based on natural phenomena, it is worth mentioning: Particle Swarm Optimization (PSO) and its variants, Artificial Bee Colony (ABC), Shuffled Frog Leaping (SFL), Ant Colony Optimization (ACO), Evolutionary Strategy (ES), Evolutionary Programming (EP), Genetic Programming (GP), the Fire Fly (FF) algorithm, the Gravitational Search Algorithm (GSA), Biogeography-Based Optimization (BBO), the Grenade Explosion Method (GEM), Genetic Algorithms (GA) and its variants, Differential Evolution (DE) and its variants, Simulated Annealing (SA) algorithm and the Tabu Search (TS) algorithm can be mentioned.
In most of these algorithms, it is necessary to adjust one or more parameters first, for example, GA needs crossover probability, mutation probability, selection operator, etc. to be set correctly; the SA algorithm needs the initial annealing temperature and cooling schedule to be tuned; PSO's specific parameters are inertia weight and social and cognitive parameters; HSA needs the harmony memory consideration rate, the number of improvisations, etc. to be set correctly; and the immigration rate, emigration rate, etc., need to be tuned for BBO.The population-based heuristic algorithm used in this work, the Teacher-Learner Based Optimization (TLBO) [12] overcomes the problem of tuning algorithm-specific parameters.Specifically, the TLBO algorithm only needs general parameters to be set, such as the number of iterations, population size and stopping criterion.Some recent works applied TLBO algorithm parallelization techniques.For example, authors in [13] implemented a TLBO algorithm on a multicore processor within an OpenMP environment.The OpenMP strategy emulated the sequential TLBO algorithm exactly, so the calculation of fitness, calculation of mean, calculation of best, and comparison of fitness functions remained the same, while small changes were introduced to achieve better results.A set of 10 test functions were evaluated when running the algorithm on a single core architecture, and were then compared on architectures ranging from 2 to 32 cores.Average speed-up values of 4.9x and 6.4x with 16 and 32 processors were obtained respectively, corresponding to efficiencies of 30% and 20% respectively.In [14], the authors propose a parallel TLBO procedure for automatic heliostat aiming, obtaining good speed-up values for this extremely expensive problem using up to 32 processes; parallel performance, however, worsened when using functions that were not so computationally expensive.
Other parallel proposals for different heuristic optimization algorithms have been proposed.For example, authors in [15] implemented the Dual Population Genetic Algorithm (DPGA) on a parallel architecture obtaining average speed-up values of 1.64x using both 16 and 32 processors.The authors in [16] propose a parallel version of the ACO metaheuristic algorithm obtaining a maximum speed-up of 5.94x using 8 processors, going down to 5.45x when using 16 processors.In addition, other proposals use hardware accelerators.For example, in [17], the PSO algorithm is accelerated using FPGAs and in [18], the Jaya algorithm is accelerated through the use of GPUs.
In Section 2, we present the TLBO optimization algorithm and describe the parallel algorithms in Section 3. In Section 4, we analyse the latter in terms of parallel performance and optimization behaviour, and some conclusions are drawn in Section 5.

The TLBO Algorithm
The Teaching-Learning-Based Optimization (TLBO) algorithm, like all evolutionary and swarm intelligence-based algorithms, requires common controlling parameters, but does not require algorithm-specific control parameters.Both these algorithms and TLBO are population-based and probabilistic algorithms, therefore TLBO needs to set only the size of the populations and number of generations.
The TLBO algorithm is based on common teaching and learning processes of a group of students, whose learning process is influenced both by the teacher and by interactions within the group of students.Each source of advancement of knowledge (that allows to approach the solution to the problem) is associated with a different phase of the TLBO algorithm, the first phase is the teacher phase and the second is the learner phase.
As mentioned previously, the TLBO is a population-based heuristic algorithm, therefore the first step is the creation of the initial population (line 1 of Algorithm 1).A population is a set of m individuals; each individual is composed of k variables (design variables) and the value of k depends on the cost function (F cost ) to be optimized.Each individual in the initial population is created as shown in Equation (1), where r i,j are uniformely distributed random numbers, and minVar j and maxVar j specify the domain size of each variable.
X i,j = minVar j + (maxVar j − minVar j ) * r i,j Once the population is created, the teacher phase begins by identifying the individual that will act as teacher (line 6 of Algorithm 1).The teacher will be the individual possessing the greatest amount of knowledge, i.e., the individual whose solution is the best among all individuals in the population.In the learner phase, the teacher tries to improve students' knowledge.To model this interaction, the mean of each design variable (M j ) is calculated considering all individuals in the population, and the interaction is performed considering the mean values computed: the teacher (X teacher ), the teaching factor (TF), as well as a random factor (r j ).The teaching factor is an integer random value in the range of [1,2], while the random factor is a random real value in the range of [0, 1].In other words, the teaching factor is an integer value equal to 1 or equal to 2 that is randomly chosen for each teacher phase, i.e., teaching factor is not a parameter to be tuned.While r j are k floating-point random numbers uniformely distributed between 0 and 1.
Each individual is influenced by the teacher (line 12 of Algorithm 1).If the influence is positive, i.e., if it improves the student, the new student replaces the previous student in the population.Whorthy of note, in line 14 of Algorithm 1 a minimization problem is considered.The resulting population at the end of the teacher phase will be the initial population used in the learner phase (Y i,j in Algorithm 2).X i,j = X i,j + r j (X teacher,j − TF × M j ) 13: end for Replace X i by X i 16: end if 17: end for 18: } Algorithm 2 Learner phase of TLBO algorithm.
1: Initial population in the learner phase: Y i,j 2: i identifies the individual i = 1 . . .m 3: j identifies the design variable j = 1 . . .k 4: Learner phase: 5: { 6: for i = 1 to m do 7: Randomly identify another student with whom to interact (p) end for 12: else 13: for j = 1 to k do 14: Z i = Y i 21: end if 22: end for 23: Output population in learner phase: Z i,j 24: i identifies the individual i = 1 . . .m 25: j identifies the design variable j = 1 . . .k 26: } In the second stage, the learner phase, the students' knowledge can improve due to the influence of the students themselves, i.e., by the interaction between them.In the learner phase, shown in Algorithm 2, each student (or individual) interacts with another student, who is randomly chosen.Worthy of note, the initial population (Y i,j ) is the resulting population at the end of the teacher phase.
Once both students are identified the interaction between them depends on the most learned student, i.e., it depends on the evaluation of the cost function for the two interacting students (lines 8-16 of Algorithm 2).The result of this interaction is an individual who is evaluated and compared with the initial individual, so the best among them is transferred to the population resulting from the learner phase (Z i,j ).Worthy of note, in the teacher phase algorithm, a minimization problem is considered in line 17 of Algorithm 2.
The teacher and learner phases are repeated until the stop criterion is met.The number of repetitions (determined by the "Iterations" parameter) specifies the number of generations to be created.Significantly, the resulting population of the learner phase (Z i,j ) is the initial population for the teacher phase in the next iteration.All random numbers used in Algorithms 1 and 2 (r j and r i,j ) are uniformely distributed random numbers in the range of [0, 1].

Parallel Approaches
We propose hybrid OpenMP/MPI parallel algorithms to exploit heterogeneous memory platforms.The whole sequential TLBO algorithm is shown in Algorithm 3. The "Runs" parameter corresponds to the number of independent executions performed.Therefore, in line 21 of Algorithm 3, "Runs" different solutions should be evaluated.In each independent execution both teacher and learner phases are repeated "Iterations" times.The parallel approach to exploit distributed memory platforms is applied to independent executions (line 5 of Algorithm 3), while the parallel approaches to exploit shared memory platforms are applied using subpopulations in teacher and learner phases as well as in the duplicate removal phase.The elimination of duplicates is necessary to avoid premature convergence.(Input: Population Z q ) 10: (Output: Population Y q ) 11: Learner phase: 12: (Input: Population Y q ) 13: ( We developed two parallel proposals in order to exploit shared memory platforms.Both proposals distribute the work load associated with teacher and learner phases by considering subpopulations.The size of the whole population is equal to m; if the number of parallel threads (or processes) is nt, we consider nt subpopulations of sizes m nt , where ∑ m nt = m.In the first proposal, called SPG_ParTLBO, the whole population is partitioned into subpopulations (SP) that are stored in global (G) memory.While in the second proposal, called SPP_ParTLBO, the whole population is also partitioned into subpopulations (SP), but they are stored in private (P) memory.
Algorithm 4 shows the parallel teacher phase for the SPG_ParTLBO algorithm.In line 5 all threads compute the initial subpopulation and store it in global memory; in line 12, the best individual of each subpopulation is identified, and the teacher (the global best individual) is sequentially identified in line 14.Following a similar strategy, the means of the design variables of each subpopulation are calculated in line 16, and in line 18 the global value of these mean values are obtained sequentially.Finally, the influence of the teacher is applied to each individual in parallel, introducing those who have improved their knowledge into the population (line 27).The parallel teacher phase shown in Algorithm 4 does not modify the optimization procedure of the sequential algorithm shown in Algorithm 1. for j = 1 to k do 25: end if 30: end for 31: end for 32: } Algorithm 5 shows the parallel learner phase for the SPG_ParTLBO algorithm.Each process, for each student in its subpopulation, randomly chooses another student with whom to interact, who can be located in any subpopulation since the whole population is stored in global memory (line 5).The rest of the code (lines 6-20) remains unchanged with respect to the sequential algorithm shown in Algorithm 2.
end for 10: else 11: end for 14: end if 20: end for 21: end for 22: } The duplicate removal phase for the SPG_ParTLBO algorithm, shown in Algorithm 6, performs the same procedure as the sequential procedure in parallel.Worthy of note, when a duplicate is found, a random design variable is chosen to be modified.To increase parallel efficiency, we developed the second proposal called SPP_ParTLBO, in which subpopulations are stored in private memory at each thread.However, the subpopulations are not isolated structures.Algorithm 7 shows the parallel learner phase for the SPP_ParTLBO algorithm.As can be seen, after identifying the best individual (i.e., the teacher) the thread that stored it in its subpopulation copies it into the global memory, so all the threads use the same teacher (lines [11][12][13][14][15][16][17].
end for 10: else 11: end for 14: end if 20: end for 21: end for 22: } In the SPP_ParTLBO algorithm, the duplicate removal phase shown in Algorithm 9, changes with respect to the sequential procedure, by restricting the search to the subpopulation, which is stored in private memory.To use heterogeneous memory platforms (clusters) we need to develop a hybrid memory model algorithm.As explained in Section 2, and as can be seen in Algorithm 3, the TLBO algorithm performs several fully independent executions ("Runs").Therefore, we developed a parallel algorithm, at a higher level, for distributed memory platforms, load balance being a key aspect.The high level parallel algorithm needed to include load balance mechanisms and be able to include parallel algorithms previously described, developed for shared memory platforms.
The high level parallel TLBO algorithm focuses on the fact that all iterations in line 5 in Algorithm 3 are actually independent executions.Therefore, the total number of executions ("Runs") to be performed is divided among np available processes, taking into account that it cannot be distributed statically.The high level parallel algorithm must be designed for distributed memory platforms using MPI.On the one hand, we must develop a load balance procedure, and on the other, a final data gathering process (data collection from all processes) must be performed.
The developed hybrid MPI/OpenMP algorithm is shown in Algorithm 10.In this algorithm, if the number of desired worker processes is equal to np, the total number of distributed memory processes will be np + 1.This is because a critical process (distributed memory process) will be in charge of distributing the computing work among the np available working processes.We call this process the work dispatcher.Although the work dispatcher process is critical, it will be running in one of the nodes with worker processes, because no significant overhead is introduced in the overall parallel algorithm performance.The work dispatcher will be waiting to receive a work request signal from an idle worker process.When a particular worker process requests new work (independent execution), the dispatcher will assign a new independent execution or send an end of work signal.The computational load of the dispatcher process is negligible, as can be observed in lines 4 to 11 of Algorithm 10.In line 21 one of the two parallel proposals of the TLBO algorithm is used, i.e., SPG_TLBO or SPP_TLBO.The total number of processes is equal to tp = np * nt, where np is the number of distributed memory worker processes (MPI processes) and nt is the number of shared memory processes (OpenMP processes or threads).

Numerical Results
In this section, we analyse the parallel TLBO algorithms, presented in Section 3. To perform the tests, we developed the reference algorithm, presented in [2], in C language to implement the parallel algorithms, and used the GCC v.4.8.5 compiler [19].We chose MPI v2.2 [20] for the high level parallel approach and OpenMP API v3.1 [21] for the shared memory parallel algorithms.The parallel platform used was composed of HP Proliant SL390 G7 nodes, where each node was equipped with two Intel Xeon X5660 processors.Each X5660 included six processing cores at 2.8 GHz, and QDR Infiniband was used as the communication network.The performance was analysed using 30 unconstrained functions, listed and described in Tables 1 and 2.     We will now analyse parallel behaviour of the parallel algorithm SPG_TLBO, described in Algorithms 4-6, i.e., the shared memory parallel algorithm that stores the whole population in shared (or global) memory.Table 3 shows the parallel efficiencies for all functions of the benchmark test, using a number of threads (NoT) between 2 to 10.In this table, we can see that good efficiencies are obtained for almost all functions using up to 6 threads.However, in very low computational cost functions, efficiency decreases rather considerably when increasing the number of threads.In such cases, to be able to increase the number of processes efficiently, the heterogeneous memory parallel TLBO algorithm should be used.Tables 4 and 5 show the parallel efficiencies for the heterogeneous memory parallel TLBO algorithm using SPG_TLBO, setting the number of total processes (NoTP) to 4 and 10, when the number of (MPI) processes (NoP) is equal to 1 the SPG_TLBO algorithm is used.We compare the SPG_TLBO parallel algorithm with respect the hybrid MPI/OpenMP algorithm using the same number of total processes (NoTP).Since the MPI algorithm is independent of the OpenMP algorithm, the same behaviour is obtained when the SPP_TLBO algorithm is used instead of the SPG_TLBO.As can be seen, using the hybrid MPI/OpenMP algorithm can significantly increase scalability of the parallel algorithm.Table 6 shows the efficiencies for highest computational cost functions, increasing the total number of processes (NoTP) to 30 and the number of iterations to 10, 000, and in which the good behaviour of the efficiency can be verified.Table 7 shows the parallel efficiencies for the SPP_TLBO algorithm using the same sequential reference algorithm as the one used in Table 7, i.e., the sequential TLBO algorithm.Worthy of note, the parallel algorithm SPP_TLBO does not emulate the sequential algorithm TLBO literally.The use of subpopulations in the SPP_TLBO algorithm causes modifications in some procedures, such as the calculation of the mean of the variables; it also reduces the working population in some procedures, such as the detection of duplicates.This means that on the one hand, efficiency results generally improve with respect to the SPG_TLBO algorithm, and on the other, in some cases, the efficiency exceeds the theoretical upper limit when comparing exactly the same algorithms.In particular the duplicate removal procedure for very low cost computational cost functions becomes a very important aspect in the overall cost of the algorithm.As shown in Tables 3-5 and 7, the parallel methods proposed obtain good efficiencies.However in [14], the authors' parallel proposal for the particular problem under study, also achieves very good efficiencies, the cost function having a high computational cost.In Table 8, we compare the method proposed in [14] to both proposed methods, SPG_TLBO and SPP_TLBO, for the first function of the benchmark test (provided by the reference software in https://gitlab.hpca.ual.es/ncc911/ParallelTLBO), i.e., the Sphere function, using between 2 to 10 threads (NoT), i.e., OpenMP processes.Results presented in Table 8 were obtained by running the reference code on the same parallel platform where the results for the SPG_TLBO and SPP_TLBO algorithms have been obtained.As shown, the efficiencies for both proposed algorithms, SPG_TLBO and SPP_TLBO, improve those obtained by the reference algorithm, especially by increasing the number of threads used.Worthy of note the TLBO parallel proposal presented in [13] obtains efficiencies of only between 20% and 30% for 16 and 32 processes respectively, and other parallel proposals applied to the state-of-the-art algorithms DPGA and ACO, obtain worse efficiency results and serious scalability problems.Finally, the effectiveness of the optimization, especially of the SPP_TLBO algorithm, should be checked, as it modifies the procedure carried out in the sequential TLBO algorithm, while the SPG_TLBO algorithm performs a processing that is analogous to the sequential processing.Table 9 show the number of iterations (N.It.) needed to achieve an optimal value with an error of less than 1e − 3.This table shows the data of the original sequentially executed TLBO algorithm and the data of the parallel SPP_TLBO algorithms.Please note that the number of iterations shown in Table 9 is the average of the functions iterations needed to achieve an optimal value with an error of less than 1e − 3, this average has been computed over the 30 values obtained from the 30 independent runs performed, both for the sequential and parallel algorithms.On the other hand, both the subpopulation size and the population size for the sequential algorithm are equal to 120.Whorty of note, the number of iterations when using the SPG_TLBO parallel algorithm is similar to the sequential reference algorithm, due to the sequential procedure has not been modified.While the number of iterations, shown in Table 9, when using the SPP_TLBO parallel algorithm shows that our parallel proposal outperforms the sequential TLBO algorithm, i.e., convergence is accelerated.Therefore, the strategy of using subpopulations connected by the best global individual, used in the SPP_TLBO algorithm, offers improvements both at the computational level and regarding convergence speed.Table 9 does not include those functions of faster convergence.

Conclusions
The TLBO heuristic optimization algorithm is an effective optimization algorithm that though recent, has been tested and compared.In this work, we presented efficient parallel algorithms for heterogeneous parallel platforms.We proposed a hybrid MPI/OpenMP algorithm, exploiting inherent parallelism at different levels.Moreover we proposed two different algorithms for shared memory architectures, using OpenMP, called SPG_TLBO and SPP_TLBO.The first is an efficient parallel implementation of the TLBO sequential algorithm without any changes to the sequential procedure.In the second, SPP_TLBO, we proposed a different strategy that improves both computational performance and optimization behaviour.Significantly, the parallel proposals achieved good parallel performance regardless of the intrinsic characteristics of the functions to be optimized, in particular with regard to the computational cost of the function to be optimized.On the other hand, the high level parallel proposal included an intrinsic load balancing mechanism allowing the use of non-dedicated computing platforms.

Algorithm 1 4 :{ 6 :
Teacher phase of TLBO algorithm.1: Create Initial Population: X i,j 2: i identifies the individual i = 1 . . .m 3: j identifies the design variable j = 1 . . .k Teacher phase: 5: Identify the best individual or teacher (X teacher ) 7: Compute the mean of all design variables M j 8: Compute the teaching factor (TF) 9: Compute the random factors (r j ) 10: for i = 1 to m do 11: for j = 1 to k do 12:

Algorithm 4 4 :
Teacher phase of SPG_TLBO algorithm.1: Set population size parameter (m) 2: Obtain the number of parallel threads (nt) 3: Compute the size of subpopulations (m nt ) In parallel s = 1 to nt do 5: Create Initial Subpopulation: X i s 6: end for 7: {Whole Population is: X i,j } 8: Teacher phase: 9: { 10: Compute the teaching factor (TF) 11: In parallel s = 1 to nt do 12: Identify the best individual of subpopulation (X best s ) 13: end for 14: Compute the global teacher: X teacher = Besto f (X best s ) 15: In parallel s = 1 to nt do 16: Compute the partial mean of all design variables M j s 17: end for 18: Compute the global mean of all design variables M j 19: In parallel s = 1 to nt do 20: Compute the random factors (r j s ) 21: end for 22: In parallel s = 1 to nt do 23: for i = 1 to m nt do 24:

Algorithm 5 1 :{ 3 :
Learner phase of SPG_TLBO algorithm.Learner phase: 2: In parallel s = 1 to nt do 4: for i = 1 to m nt do 5:Randomly identify another student with whom to interact

Algorithm 6
Duplicate removal phase of SPG_TLBO algorithm.
1: Duplicate removal phase: 2: { 3: In parallel s = 1 to nt do 4: for i ∈ m nt do 5: for j = i + 1 to m do

Algorithm 9
Duplicate removal phase of SPP_TLBO algorithm.
1: Duplicate removal phase: 2: { 3: In parallel s = 1 to nt do 4: for i = 1 to m nt do 5:for j = i + 1 to m nt do Algorithm 10 Heterogeneous memory parallel TLBO algorithm.
1: np: number of distributed memory worker processes 2: Dispatcher process: 3: { 4: for l = 1 to Runs do 5: Receive idle signal 6: Send work signal 7: end for 8: for l = 1 to np do