Predicting DNA Motifs by Using Multi-Objective Hybrid Adaptive Biogeography-Based Optimization

The computational discovery of DNA motifs is one of the most important problems in molecular biology and computational biology, and it has not yet been resolved in an efficient manner. With previous research, we have solved the single-objective motif discovery problem (MDP) based on biogeography-based optimization (BBO) and gained excellent results. In this study, we apply multi-objective biogeography-based optimization algorithm to the multi-objective motif discovery problem, which refers to discovery of novel transcription factor binding sites in DNA sequences. For this, we propose an improved multi-objective hybridization of adaptive Biogeography-Based Optimization with differential evolution (DE) approach, namely MHABBO, to predict motifs from DNA sequences. In the MHABBO algorithm, the fitness function based on distribution information among the habitat individuals and the Pareto dominance relation are redefined. Based on the relationship between the cost of fitness function and average cost in each generation, the MHABBO algorithm adaptively changes the migration probability and mutation probability. Additionally, the mutation procedure that combines with the DE algorithm is modified. And the migration operators based on the number of iterations are improved to meet motif discovery requirements. Furthermore, the immigration and emigration rates based on a cosine curve are modified. It can therefore generate promising candidate solutions. Statistical comparisons with DEPT and MOGAMOD approaches on three commonly used datasets are provided, which demonstrate the validity and effectiveness of the MHABBO algorithm. Compared with some typical existing approaches, the MHABBO algorithm performs better in terms of the quality of the final solutions.


Introduction
The motif discovery problem (MDP) in molecular biology is to find similar regions common to each sequence in a given set of DNA, RNA, or protein sequences [1].It is an important problem for locating binding sites and finding conserved regions in unaligned sequences.From a computational point of view, finding motifs in many sequences is an NP-hard problem.Many methods have been applied to solve MDP and have achieved excellent results such as statistical methods, probabilistic methods etc.In recent years, with the development of evolutionary algorithms and their advantages, they have also been gradually applied to MDP.
Evolutionary computation (EC) is an optimization method based on the principles of biological evolution and is gaining more attention in recent years.EC has certain advantages for motif discovery [2].Evolutionary algorithms (EA) carry out global search and have relatively low sensitivity to initial conditions [3].They are comparatively flexible in terms of how solutions are represented and evaluated, and do not require knowledge about the problem to which they are being applied.EC methods have been successfully applied to solve the motif discovery problem such as the genetic algorithm (GA) [4], bacterial foraging optimization algorithm integrating taboo search (TSBFO) [5], estimation of distribution algorithm with differential evolution (DE/EDA) [6], evolutionary multi-objective optimization (DEPT) [7], multi-objective artificial bee colony (MOABC) algorithm [8], Multi-objective genetic algorithm (MOGAMOD) [9], Non-dominated Sorting Genetic algorithm-III (NSGA-III) [10] and Multi-objective evolutionary algorithm based on decomposition (MOEA/D) [11], etc.The biogeography-based optimization (BBO) algorithm (Simon, 2008) [12] is a nature-inspired computational technique based on the mathematical models of biogeography.As a population-based stochastic algorithm, the BBO algorithm generates the next generation population by simulating the characteristics of the biological species migration.Because of information sharing in the migration process, the BBO algorithm has a better exploitation ability.The BBO algorithm is superior for solving the single-objective motif discovery problem [13][14][15], which has also been modified to solve multi-objective optimization problems (MOPs) [16][17][18][19].However, in these papers, the BBO algorithm has still not been applied to solving the multi-objective motif discovery problem.In the literature [20], we have applied hybridization of adaptive biogeography-based optimization algorithm and differential evolution (DE) to multi-objective optimization problems (MOPs), and have achieved excellent performance on the convergence and the distribution of solutions.
The aim of this paper is based on our previous research, and is to apply the BBO algorithm to the multi-objective motif discovery problem.So far as we know, it is the first time that the multi-objective biogeography-based optimization has been applied to multi-objective MDP.In this paper, a new algorithm named MHABBO is presented and used to solve multi-objective MDP, and then presents a comparative study on twelve datasets with other different algorithms.
The motivation for proposing MHABBO for MDP in this research is threefold.First, based on the above literature review, there have been several successful applications based on multi-objective biogeography-based optimization (MBBO).Second, we have proposed a new MBBO algorithm and achieved excellent performance on multi-objective benchmark functions [20].Finally, we will try to apply it to solve multi-objective MDP.
The key contributions of this paper are as follows.Firstly, we propose a new approach called MHABBO based on BBO algorithm to predict motifs.In the MHABBO algorithm, the migration in the BBO is implemented with the number of iterations to avoid the presence of a very stable local minimum.Secondly, motivated by the work described in References [21][22][23][24], the mutation is performed by integrating with DE to produce new feasible solutions.Simultaneously, the parameters of migration probability and mutation probability are adaptively changed.Furthermore, the immigration and emigration rates based on the cosine curve are modified.Finally, we apply MHABBO algorithm to the multi-objective motif discovery problem.
Compared with DEPT and MOGAMOD approaches on three commonly used datasets, the MHABBO algorithm performs better, or at least comparably, in terms of the quality of the final solutions.Statistical comparisons with some typical existing approaches demonstrate the validity and effectiveness of the MHABBO algorithm.Experimental results show that the obtained Pareto solutions can approximate to the Pareto optimal front and has good diversity and uniform distribution.
The paper is organized as follows.Section 2 describes the motif discovery problem.The MHABBO algorithm process for multi-objective motif discovery problem is proposed in Section 3. Section 4 shows the simulation and experiment results.Finally, a brief conclusion is illustrated in Section 5.

Motif Discovery Problem
In this paper, we also use the same objectives as those used in Reference [9] to find many long and strong motifs.The multi-objective motif discovery problem is converted into the following three-objective optimization problem: Maximize similarity, Maximize motif length and Maximize support.These three objectives for MDP are defined as follows [7]: (1) Support: Support indicates the level of the support of the candidate motifs to the consensus motif.The consensus motif is built by using the candidate motifs.The level of the support is measured by similarity rate of the candidate motif to the consensus motif.The similarity rate means the same number of the nucleosides between the candidate motif and the consensus motif.When the similarity rate is larger than 50%, the subsequent corresponding to candidate motif can be considered as a Support.For example, the consensus motif is assumed to be GACCTTTTGCAATCCTGG, the candidate motif of the sequence 1, i.e., GACCACTTGCAGTCTTAG, has 13 nucleotides identical to the consensus motif, and the consensus motif has 18 nucleotides, so its similarity rate is 13/18 = 72%.(2) Motif Length: The motif length points to the number of the nucleotides of the consensus motif.
In the example, the motif length is 18.According to real datasets used in this paper, the value of the motif length is limited to between 5 and 60. (3) Similarity: the similarity objective function of motif is defined as the average of the dominance values of all position weight matrix columns.The similarity is calculated based on Equation (1).
In which the dv in each column (dominant nucleotide) is the dominance value of the dominant nucleotide, it is calculated by Equation (2): where f (b, i) is the score of nucleotide b in column i in the position weight matrix, and l is the motif length.
To better understand the similarity objective function, an example is used to illustrate it.Firstly, a position weight matrix (PWM) from the motif patterns found by MHABBO in every sequence is generated.Then, the percentage of occurrence of nucleotides at each motif position is calculated (see Table 1).The highest value of each matrix column is selected.The similarity is obtained by averaging all these dominance values.In this example, the similarity value is computed as 0.81 (81%) by using Equation ( 6): (1 + 1 + 0.5 + 0.75 + 1 + 1 + 0.75 + 0.5 + 1 + 0.5 + 0.75 + 1 + 0.75 + 1 + 0.5 + 1 + 0.5 + 1)/18 = 0.81.

MHABBO Algorithm
In this section, we describe the MHABBO algorithm for the motif discovery problem in detail.In the MHABBO algorithm, the migration in the BBO is implemented with the number of iterations, the mutation is performed by integrating with DE to produce new feasible solutions.Simultaneously the parameters about migration probability and mutation probability are adaptively changed.Meanwhile, the immigration rate and emigration rate based on a cosine curve are modified.
First, we describe the representation of the individuals in our algorithms.Because each individual contains the necessary information used to form a possible motif, an individual is represented as the motif length and the starting location s i of the potential motif on all the sequences.Representation of an individual is shown as Table 2.This representation is the same as that used in [9].The sharing of features between solutions is represented as immigration and emigration between the islands.The immigration rate λ and the emigration rate µ of each solution are used to probabilistically share features between solutions.Motivated by the work in [25], these parameters are modified based on the cosine curve.The immigration rate and emigration rate of each individual are changed respectively by Equations ( 3) and (4).In which NP is the size of population.
Motivated by blended migration operator in [26], in our algorithm, the coefficient of solution H i is related to the number of iterations.The modified migration operator is based on the following considerations.First, blend combination operators have been widely used in other optimization algorithms.Second, good solutions will be not degraded due to migration.Besides, poor solutions can still accept a lot of new features from good solutions.The migration operator is designed to accelerate the speed of convergence based on the number of iterations.Modified migration is defined as: where H i is immigrating island, H k is emigrating island, H i (j) is the jth dimension of the ith solution, and t is the number of iterations, t max is the maximum number of iteration.Equation ( 5) means a new solution after migration is comprised of two components: the migration of feature from itself and another solution.It accelerates the convergence speed of the algorithm.And modified migration operator is described as follows (Algorithm 1):

Mutation Operator for the MDP
Although the hybridization of the BBO with DE has achieved many good results [27][28][29], they incorporate DE into the migration procedure for single-objective optimization problems.In MHABBO algorithm, DE is incorporated into the mutation procedure for multi-objective optimization problems.The algorithm helps to find the non-dominated solutions.A mutated individual (H i (j)) is generated according to Equation ( 6) where H i (j) is selected for mutation, c 1 is the mutation scaling factor, usually its value is set as range between 0.1 and 0.15.H r1 (j), H r2 (j) is the randomly selected two solutions, H best (j) is the best solution in this generation.In MHABBO algorithm, this mutation scheme tends to increase the diversity among the population.Modified mutation operator is described as follows (Algorithm 2):

Adaptive BBO for MDP
Modification probability factors and mutation probability factors in the BBO algorithm are denoted as P modi f and P muta respectively.The two factors with ranges between 0 and 1 are set by users.The settings of the parameters are related to the experience of the user, and they may be unfavorable for the selection of migration individual and mutation individual.In order to choose better migration individual and mutation individual, these parameters are changed dynamically with the fitness function.

The Redefinition of the Fitness Function
In this paper, we propose the multi-objective MHABBO algorithm for multi-objective motif discovery problem.Generally speaking, solving multi-objective optimization problems is through Pareto non-dominated sorting and crowding distance sorting of different solutions.The fitness function is determined based on the Pareto dominance relation in [30].However, only considering the Pareto dominance relation is not enough, if the distribution of solutions is also included, the definition of the fitness function will be more reasonable.Originating from SPEA2 algorithm [31], which measures the Pareto dominance relationship and density relationship between different solutions as fitness function.So the fitness function is redefined based on density information and Pareto dominance relationship among the habitats.That is, we employ non-dominated sorting approach to determine the non-dominated rank of individuals.Specifically, the Pareto dominance relationship refers to the number of non-dominated solutions that dominate an individual.The density of each individual is calculated by the k nearest neighbor method.For any individual H i = (H i1 , H i2 . . ., H in ) in the habitat population H = {H 1 , H 2 . . . ,H NP } , its fitness function is defined by Equation ( 9) as follows.
where H i , H j , H k ∈ H are habitats, σ e i is the distance between the habitat H i and H e in the objective space, the operator | • | is the cardinality of the set.e is integer value of the square root of the sum of population number NP and elitism number N. According to Equation ( 14), the fitness function F(H i ), is the sum of D(i) and the average of the sum of the number of dominated habitats in the total population.In which the number of dominated habitats means the total number of any other habitats whom every individual who dominates H i can dominate in the population.Therefore, the lower the dominated degree of habitat H i is, the smaller the fitness function of H i is, when the fitness function of H i is 0, it indicates that H i is a non-dominated habitat.

Main Procedure of MHABBO for Multi-Objective Motif Discovery Problem
Firstly, the fitness function on the basis of density information and Pareto dominance relation is redefined, then the modified migration procedure and the mutation procedure are merged into the BBO.Furthermore, related parameters in the BBO such as modification probability and mutation probability, emigration rate and immigration rate, are altered.The procedure of the MHABBO is described in Algorithm 3.

Algorithm 3:
The main pseudo-code of MHABBO algorithm for multi-objective MDP Input: The Sequences S Output: support, motif length, similarity and the non-dominated consensus motif instance and corresponding PWM.
1. Init(number of iterations, elitism parameter keep, migration probability P modi f , mutation probability P muta etc.) 2. P ⇐ GenerateInitialRandomPopulation() 3. EvaluateFitness(H i ) for each habitat H i in P according to Equation (9). 4. While the halting criterion is not satisfied do 5.
Compute λ i , µ i for each habitat H i according to Equations ( 3) and (4) 7.
[P modi f ,P muta ]⇐updateProbability() Equations ( 7) and (8) 15.End while We generate the initial population for three different targets.That is, a solution is randomly generated when the length and support is different.Besides, we get the similarity between this candidate motif and consensus motif.Each solution has these three different indicators, including the length, support and similarity.That is, each solution reflects multiple different objectives.The function GenerateInitialRandomPopulation() in Algorithm 3 is described as follows: GenerateInitialRandomPopulation()

1.
The length of candidate motif is randomly generated on the basis of the range of motif length

2.
The beginning position of candidate motif in the sequence is randomly given according to the length of the sequence

3.1
Generate a candidate motif based on the beginning position and the length of motif for each sequence 3.2 Generate PWM based on these candidate motifs 3.3 Generate consensus motif according to PWM 3.4 Compare candidate motif generated from each sequence with the consensus motif, if similarity rate is greater than 0.5, the number of support is added one.

4.
Regenerate PWM according to support

5.
Confirm the consensus motif H i according to PWM

6.
Repeat the above steps, generate a specified number of population In the evaluation of the fitness function, the degrees of Pareto domination and distribution information between different solutions are reflected by fitness function.That is, the Pareto non-dominated sorting is equivalent to the ranking of the value of fitness function.The MHABBO measures the Pareto dominance relationship and density information between different solutions as fitness function.We employ non-dominated sorting approach to determine the non-dominated rank of individuals.The function EvaluateFitness(H i ) in Algorithm 3 is described as follows: EvaluateFitness(H i )

1.
Confirm the Length, the Support of H i

2.
Compute Similarity of H i by Equation ( 7)

3.
Count the number of the solution dominating H i based on the three objective function (Length, Support and Similarity)

4.
For each solution H j dominating H i ,

5.
Count the number of the solution dominating H j

6.
For each solution H k dominating H j

7.
Evaluate Fitness function for each habitat H i according to Equation ( 14)

Results Comparisons with Other Methods
In order to demonstrate the feasibility of the MHABBO algorithm for the MDP, MHABBO algorithm is compared with MOGAMOD and DEPT.Some experiments are carried out on a number of real sequence datasets which are selected from the TRANSFAC database [32].Motif instances from different sequences of each dataset have already been tagged, so these datasets are used as a benchmark for the discovery of TBFSs [33].The properties of datasets are shown in Table 3.Every real dataset corresponds to living beings in nature.More concretely, three of the datasets are from the fly (those beginning by dm), three from the human being (hm), three from the mouse (mus), and three from yeast (yst).Meanwhile, datasets with a different number of sequences and different sizes (nucleotides per sequence) are selected to ensure that our algorithm works with several types of instances.For example, the yst04r sequence dataset contains 7 sequences of 1000 bps each.Motif instances from the yst04r sequence have 7 instances ranges from 5 to 25.The hm03r sequence dataset contains 10 sequences of 1500 bps each.Motif instances from the hm03r sequence have 15 instances ranges from 14 to 46.Using datasets from different species, the new algorithm can obtain the meaning motifs in all types of biological data.The times are also given in Table 3.The algorithm has been implemented by using the MATLAB R2014b programming language.All experiments were performed using windows10 OS, 64 bit processer, Inter(R) Core(TM) i5-6200U CPU (2.30 GHz) with 12 GB of RAM.Parameters used by the MHABBO algorithm are shown in Table 4. MHABBO algorithm has been run 5 times for each dataset with different random seeds.The top 20 results obtained by a non-dominated sort in the 5 runs are recorded.Due to the limit of the length of article, we only list the parameters used for yst04r dataset: DE mutation scheme is DE/rand/1/bin, the population size is 100, the maximal generation number is 100, number of variables in each individual is 8, the value of motif length is between 5 to 25, habitat modification probability is 0.75, mutation probability is 0.05, elitism parameter is 10, and scaling factor is 0.01, k 1 factor is 0.4; k 2 factor is 0.95; k 3 factor is 0.05; k 4 factor is 0.1.Other test problems have similar parameters to yst04r dataset.These parameters that are different from the yst04r dataset have a number of variables for each individual mutation probability and modification probability etc.We assume a motif instance is correctly discovered if the predicted binding site is within 3 bps away from the true binding site.for hm03r are shown in Table 7.The motifs predicted with "*" in the Table 6 indicate that the motif predicted is consistent with the known motif instance.From these tables above, MHABBO achieves better results than MOGAMOD, while MHABBO achieves solutions similar to solutions obtained by DEPT and several motif instances predicted by MHABBO are very similar with the known motif instances.So we conclude that the MHABBO algorithm can predict meaningful motifs, therefore it is a promising method for multi-objective motif discovery.As the length of the predicted motifs becomes longer, the similarity does not obviously decrease.From Tables 5 and 6, we observe that there are some motifs only one of algorithms can predict.The reason is that different search strategies explore different search spaces.Hence, MHABBO is chosen for motif discovery.Additionally, the MHABBO algorithm can not only predict some motifs acquired by other known methods but also find novel motifs.However, the accuracy of the predicted motifs is not high enough.The reason is that the performance of MHABBO is influenced by randomly selecting an SIV during the process of migration and mutation between the islands.Another reason is that the definition of fitness function just considers some factors, so it may lead them away from accuracy.

The Consensus Motifs Obtained by MHABBO Algorithm
Sequence logos are a graphical representation of an amino acid or nucleic acid multiple sequence alignment developed by Tom Schneiderand Mike Stephens [37].A sequence logo provides a richer and more precise description of, for example, a binding site, than a consensus sequence.WebLogo is a web based application designed to make the generation of sequence logos as easy and painless as possible, so the consensus motifs predicted by our algorithm on different datasets are expressed by WebLogo in Table 8.In order to have a visual perspective on the results, we show the graphs corresponding to the solutions obtained by MHABBO for each dataset (see Figure 1).The graphs show the Pareto front points (blue points) that are obtained by running the algorithm configured with the optimal parameters.The motif length is represented in the X-axis, the similarity in the Y-axis, and the support in the Z-axis.Furthermore, we show the projection of each point at the planes XY (purple points), XZ (red points) and ZY (yellow points).

Representation of the Pareto Fronts Obtained by MHABBO Algorithm
In order to have a visual perspective on the results, we show the graphs corresponding to the solutions obtained by MHABBO for each dataset (see Figure 1).The graphs show the Pareto front points (blue points) that are obtained by running the algorithm configured with the optimal parameters.The motif length is represented in the X-axis, the similarity in the Y-axis, and the support in the Z-axis.Furthermore, we show the projection of each point at the planes XY (purple points), XZ (red points) and ZY (yellow points).The Pareto fronts obtained by MHABBO are shown in Figure 1, which shows that MHABBO on these datasets achieves better distribution of solutions.For example, there are 7 motif instances with a length range between 9 and 54 in the hm16r dataset.The length value of most parts of these motif instances is about 20.It can be seen from Figure 1 that the length of most obtained solutions is about 20.So the distribution of the obtained solutions is consistent with the distribution of the standard solution.As the length of the predicted motifs becomes longer, the similarity does not obviously decrease and as the support of the predicted motifs becomes larger, the similarity does not obviously decrease.The results demonstrate that the proposed MHABBO algorithm is competitive on the quantity and the distribution of final solutions.The results also present the distribution of the solutions and the convergence to Pareto-optimal front.It indicates that our approach performed well on multi-objective MDP.

Metrics to Assess Performance
Performance metrics play an important role in returning a scalar quantity, which reflects the quality of solutions.For each tool T and each data set D, we now have the set of known binding sites and the set of predicted binding sites.The correctness of T on D can be assessed both at the nucleotide level and at the motif level.There are many metrics that can be used to measure the quality of MDP [30], for example, nucleotide-level sensitivity (nSn), nucleotide-level positive predictive value (nPPV), the nucleotide-level correlation coefficient (nCC), nucleotide-level performance coefficient (nPC), the motif-level correlation coefficient (mCC) and the motif-level F-score [35] etc.The following metrics are used in this paper: the nPC and F-score.The Pareto fronts obtained by MHABBO are shown in Figure 1, which shows that MHABBO on these datasets achieves better distribution of solutions.For example, there are 7 motif instances with a length range between 9 and 54 in the hm16r dataset.The length value of most parts of these motif instances is about 20.It can be seen from Figure 1 that the length of most obtained solutions is about 20.So the distribution of the obtained solutions is consistent with the distribution of the standard solution.As the length of the predicted motifs becomes longer, the similarity does not obviously decrease and as the support of the predicted motifs becomes larger, the similarity does not obviously decrease.The results demonstrate that the proposed MHABBO algorithm is competitive on the quantity and the distribution of final solutions.The results also present the distribution of the solutions and the convergence to Pareto-optimal front.It indicates that our approach performed well on multi-objective MDP.

Metrics to Assess Performance
Performance metrics play an important role in returning a scalar quantity, which reflects the quality of solutions.For each tool T and each data set D, we now have the set of known binding sites and the set of predicted binding sites.The correctness of T on D can be assessed both at the nucleotide level and at the motif level.There are many metrics that can be used to measure the quality of MDP [30], for example, nucleotide-level sensitivity (nSn), nucleotide-level positive predictive value (nPPV), the nucleotide-level correlation coefficient (nCC), nucleotide-level performance coefficient (nPC), the motif-level correlation coefficient (mCC) and the motif-level F-score [35] etc.The following metrics are used in this paper: the nPC and F-score.

The Nucleotide-Level Performance Coefficient (nPC)
To measure the prediction accuracy of methods with respect to motif location, we have used the nucleotide-level performance coefficient (nPC).It was also adopted by Tompa et al. to evaluate binding site predictions in their single motif discovery benchmark study.The nPC is defined as follows: Here, nTP is the number of nucleotide positions in both known sites and predicted sites, while nFN is the number of nucleotide positions in known sites but not in predicted sites, nFP is the number of nucleotide positions not in known sites but in predicted sites.
In order to furtherly measure the efficiency of this algorithm, the nPC value obtained by the MHABBO algorithm on different test functions are compared with other 14 algorithms.The results obtained by the fifteen different algorithms used to the different test problems are given in Figure 2.

The Nucleotide-Level Performance Coefficient (nPC)
To measure the prediction accuracy of methods with respect to motif location, we have used the nucleotide-level performance coefficient (nPC).It was also adopted by Tompa et al. to evaluate binding site predictions in their single motif discovery benchmark study.The nPC is defined as follows: Here, nTP is the number of nucleotide positions in both known sites and predicted sites, while nFN is the number of nucleotide positions in known sites but not in predicted sites, nFP is the number of nucleotide positions not in known sites but in predicted sites.
In order to furtherly measure the efficiency of this algorithm, the nPC value obtained by the MHABBO algorithm on different test functions are compared with other 14 algorithms.The results obtained by the fifteen different algorithms used to the different test problems are given in Figure 2.
It can be seen from Figure 2 that the nPCs obtained by the MHABBO algorithm on datasets from 1 to 9 are significantly better than the other fourteen algorithms.Except that the nPCs obtained by the MHABBO algorithm on datasets (Yst) are worse than several algorithms.This algorithm shows better performance on higher organisms than simpler organisms.It can be concluded that the performance of the algorithm is not obviously decreased with the increase of the dimension of the problem.

F-Score
To assess the performance of our algorithm at the motif-level, Precision, Recall and F-score are adopted on the basis of Equation ( 10) [38], where the operator |•| is the cardinality of the set.The candidate motif instances obtained by MHABBO need to be verified by biological experiments.We hope to have a high Precision and a high Recall.The F-score is a tradeoff between Precision and Recall.It can be seen from Figure 2 that the nPCs obtained by the MHABBO algorithm on datasets from 1 to 9 are significantly better than the other fourteen algorithms.Except that the nPCs obtained by the MHABBO algorithm on datasets (Yst) are worse than several algorithms.This algorithm shows better performance on higher organisms than simpler organisms.It can be concluded that the performance of the algorithm is not obviously decreased with the increase of the dimension of the problem.

F-Score
To assess the performance of our algorithm at the motif-level, Precision, Recall and F-score are adopted on the basis of Equation (10) [38], where the operator |•| is the cardinality of the set.The candidate motif instances obtained by MHABBO need to be verified by biological experiments.We hope to have a high Precision and a high Recall.The F-score is a tradeoff between Precision and Recall.
Average results (precisions (P), recalls (R) and F-scores (F)) obtained by MHABBO on the twelve datasets is shown in Table 9.The comparisons of MHABBO with other methods [33] on the three datasets are given in Table 10.Table 10 shows the average results of these algorithms in 5 runs.According to the F-score, MHABBO on hm03r and mus02r dataset is the best algorithm of all twelve algorithms, and it is better than ABBO/DE/GEN for single-objective motif discovery problems on hm03r and mus02r dataset.However, it is worse on yst08r than MEME, MEME3, ABBO/DE/GEN and MOTIFSAMPLE.This algorithm shows better performance on higher organisms than simpler organisms.The experiments demonstrate the validity of the proposed MHABBO algorithm for multi-objective motif discovery problems.
Assessing performance of the MHABBO algorithm at the nucleotide level and at the motif level, similar results have been obtained.That is to say, the more dimensions of the problem there are, the performance of the MHABBO does not worsen.This algorithm can obtain a more significant motif.It also shows that the algorithm on the convergence has better performance on higher organisms than simpler organisms.

Conclusions and Future Research
Since multi-objective, biogeography-based optimization has not been applied to the multi-objective motif discovery problem, we propose a hybrid multi-objective optimization algorithm named MHABBO to solve three-objective motif discovery problem on the basis of our previous research work.Compared with the existing methods, the proposed algorithm has the following advantages.Firstly, the redefinition of fitness evaluation based on MOEA can simplify the multi-objective optimization problem and use the Pareto dominance relationship to preserve population diversity.Secondly, modifying migration operations can speed up the convergence of the algorithm, and the mutation is performed by integrating with DE to produce new feasible solutions.In such a way, population diversity can be maintained.Finally, the robustness of the algorithm is enhanced by adaptively changed parameters related to the BBO algorithm.
Statistical comparisons with some typical existing approaches on several commonly used datasets are provided.The main work has been done in this paper as follows.Firstly, the motif instances obtained by the MHABBO algorithm on three commonly used datasets are compared with five other algorithms.Secondly, according to the PWMs corresponding to the obtained motif instances on twelve commonly used datasets, the logos of the motif instances are acquired using the online WebLogo software.Thirdly, the Pareto fronts of obtained motif instances on twelve commonly used datasets are drawn according to three-objective of the motif discovery problem.Finally, based on the NPC and F-score methods, the new algorithm is compared with other classical algorithms.
The experiments have indicated that the MHABBO algorithm outperforms other algorithms on the hm03r and mus02r datasets.From the Pareto fronts obtained by MHABBO, the results demonstrate that the proposed MHABBO algorithm is competitive on the convergence to Pareto-optimal front and the distribution of final solutions.It also shows that the algorithm on the convergence performs better on higher organisms than simpler organisms.It demonstrates the validity and effectiveness of the proposed MHABBO algorithm used to predict motifs from DNA sequences.
In this paper, we mainly discuss the multi-objective motif discovery problem.In the future, we will continue to improve the multi-objective BBO algorithm.We will try to combine NSGA-III or MOEA/D with the BBO algorithm for motif discovery problem.Additionally, in our earlier work, we discussed the portfolio optimization problem in second-order stochastic dominance constraint based on the BBO algorithm [46], and we will try to apply the multi-objective BBO algorithm to the multi-objective portfolio optimization problem [47,48] in the future.

Figure 1 .
Figure 1.Representation of the Pareto fronts obtained by MHABBO: (a) the distribution of solutions on the DM01g dataset; (b) the distribution of solutions on DM04g; (c) the distribution of solutions on DM05g; (d) the distribution of solutions on HM03r; (e) the distribution of solutions on HM04r; (f) the distribution of solutions on HM16g; (g) the distribution of solutions on MUS02r; (h) the distribution of solutions on MUS07g; (i) the distribution of solutions on MUS11m; (j) the distribution of solutions on YST03m; (k) the distribution of solutions on YST04r; (l) the distribution of solutions on YST08r.

Figure 1 .
Figure 1.Representation of the Pareto fronts obtained by MHABBO: (a) the distribution of solutions on the DM01g dataset; (b) the distribution of solutions on DM04g; (c) the distribution of solutions on DM05g; (d) the distribution of solutions on HM03r; (e) the distribution of solutions on HM04r; (f) the distribution of solutions on HM16g; (g) the distribution of solutions on MUS02r; (h) the distribution of solutions on MUS07g; (i) the distribution of solutions on MUS11m; (j) the distribution of solutions on YST03m; (k) the distribution of solutions on YST04r; (l) the distribution of solutions on YST08r.

Table 1 .
Position Weight Matrix for a Motif.

Table 2 .
The representation of an individual.

Algorithm 1: Migration for the MDP (MigrationDo(H, P modi f )) Input: Initial population H and migration probability Output: The population H that have
been optimized by migration For i = 1 to NP // NP is the size of population If rand < P modi f Use λ i to probabilistically decide whether to immigrate to H i If rand(0, 1)

Table 3 .
The properties of the benchmark datasets.

Table 5 .
Comparisons of the predicted motif with different methods for yst04r.

Table 6 .
Comparisons of the predicted motif with different methods for yst08r.

Table 7 .
Comparisons of the predicted motif with different methods for hm03r.

Table 8 .
The Consensus motif predicted by MHABBO.

Table 8 .
The Consensus motif predicted by MHABBO.

Table 8 .
The Consensus motif predicted by MHABBO.

Table 8 .
The Consensus motif predicted by MHABBO.

Table 10 .
Comparisons of MHABBO with other methods on the three datasets: average results (precisions (P), recalls (R) and F-scores (F)).