Solving the Longest Common Subsequence Problem Concerning Non-uniform Distributions of Letters in Input Strings

: The longest common subsequence (LCS) problem is a prominent N P –hard optimization 1 problem where, given an arbitrary set of input string, the aim is to ﬁnd a longest subsequence 2 which is common to all input strings. This problem has a variety of applications in bioinformatics, 3 molecular biology, ﬁle plagiarism checking, among others. All previous approaches from the 4 literature are dedicated to solving LCS instances sampled from uniform or near-to-uniform 5 probability distributions of letters in the input strings. In this paper we introduce an approach 6 that is able to effectively deal with more general cases, where the occurrance of letters in the 7 input strings follows a non-uniform distribution such as, for example, a multinomial distribution. 8 Texts in any spoken language, for example, are well approximated by multinomial distributions. 9 The proposed approach makes use of beam search, guided by a novel heuristic function named 10 G MPSUM . This heuristic synthesizes two complementary scores in form of a convex combination: 11 the ﬁrst one performs well in the uniform case and the second one works well in the non-uniform 12 case. Furthermore, we introduce a time-restricted beam search algorithm that is able to adapt 13 the beam size during the algorithm execution in order to achieve a desired target runtime. Apart 14 from benchmark sets from the related literature, in which the distribution of letters is close to 15 uniform, we introduce three new benchmark sets that differ in terms of their statistical properties. 16 One of these benchmark sets concerns a case-study in the context of text analysis. We provide 17 a comprehensive empirical evaluation in two distinctive settings: (1) short-time execution with 18 ﬁxed beam size in order to evaluate the guidance abilities of the compared search heuristics, and 19 (2) long-time executions with ﬁxed target duration times in order to obtain high-quality solutions. 20 In both settings, the newly proposed approach performs comparably to state-of-the-art techniques 21 in the context of close-to-random instances, and outperforms state-of-the-art approaches for 22 non-uniform instances. 23


Introduction 26
In the field of bioinformatics, strings are commonly used to model sequences such 27 as DNA, RNA, and protein molecules or even time series.Strings represent fundamental way so that they exhibit high similarity, but still the letters' frequencies are similar.In 86 practical applications this assumption of uniform or close-to-uniform distribution of 87 letters does not need to hold.Some letters may occur substantially more frequently than 88 others.For example, if we are concerned of finding motifs in sentences of any spoken 89 language, each letter has its characteristic frequency [22].Text in natural languages 90 can be modeled by a multinomial distribution over the letters.The required level of 91 model adaptation can vary depending on the distribution assumptions such as letter 92 dependence of a particular language.Also, letter frequencies in a language can differ 93 depending of text types (e.g., poetry, fiction, scientific documents, business documents).

98
Motivated by this considerations, we develop in the following a new BS-based algo-99 rithm which is able to more effectively tackle instances with different string distributions.

100
The novel guidance heuristic applied at the core of this BS can be used as a credible and 101 simplified replacement of the so far leading approximate expected length calculation.

102
Additional advantages are that the novel heuristic is easier to implement than the ap-103 proximate expected length calculation (which required a Taylor series expansion and 104 a divide-and-conquer approach in an efficient implementation) and that there are no 105 issues with numerical stability.

106
The main contributions of this article are as follows.

•
We propose a novel search guidance for a BS which performs competitively on 108 the standard LCS benchmark sets known from literature and in some cases even 109 produces new state-of-the-art results. of BS approximately fits a desired target time limit.A tuning of the beam width to 117 achieve comparable running times among different algorithms is hereby avoided.

118
In the following we introducing some commonly used notation before giving an overview 119 on the remainder of this article.Theorem 1.Let r be a given string and s be a random string chosen from the same multinomial distribution.Then, Proof.It is clear by the definition of a subsequence that the empty string is a subsequence of every string and that a string cannot be a subsequence of a shorter one.Therefore, the cases |s| = 0 and |s| > |r| are trivial.In the remaining case (1 |s| |r|), follows from the law of total probability.

162
The probability P(s ≺ r) in recurrence relation (1) is dependent not only on the 163 length of string r, but also on the letter distribution of this string.Therefore, it is hard to 164 come up with a closed-form expression for the general case of a multinomial distribution 165 MN(p 1 , . . . ,p η ).One way to deal with this problem is to consider some special cases of 166 the multinomial distribution, for which closed-form expressions may be obtained.The most frequently used form of the multinomial distribution considered in the literature is the uniform distribution.Since in this case every letter has the same occurrence probability, probability P(s ≺ r) in the recurrence relation (1) depends only on the Technical Report AC-TR-2021-009 lengths k = |s| and l = |r| and can be simpler written as P(k, l).This case is covered by Mousavi and Tabataba in [12], where the recurrence relation ( 1) is reduced as follows: Probabilities P(k, l) can be calculated using dynamic programming as described by Let one letter a j ∈ Σ have occurrence probability p ∈ (0, 1), p = 1/η and each other letter a i , i ∈ {1, . . ., η} \ {j} have occurrence probability (1 − p)/(η − 1).For this multinomial distribution, recurrence relation (1) reduces to: where We now further generalize the previous case.Let {Σ 1 , Σ 2 } be a partitioning of the alphabet Σ, i.e., let Σ 1 , Σ 2 ⊆ Σ be nonempty sets such that Σ = Σ 1 ∪ Σ 2 and Σ 1 ∩ Σ 2 = ∅.Let us assume that every letter in Σ 1 has the same occurrence probability and also, that every letter in Σ 2 has the same occurrence probability.We define where p ∈ (0, 1) is the probability mass assigned to the set Σ 1 .For this multinomial distribution, recurrence relation (1) reduces to where This probability therefore depends on whether or not a letter in r belongs to the set 175 Σ 1 or not.Theorem 2. Let r and s be random independent strings chosen from the same multinomial distribution MN(p 1 , . . ., p η ).Then Proof.The first two cases are trivial, so it remains to show the last case.Using the law of total probability, we obtain can be calculated with another application of the law of total probability, using the assumption that random strings s and r are mutually independent: Except for the obvious dependency on the multinomial distribution MN(p if k filter ≥ 0 then 14: The state graph for the LCS problem that is used by all BS variants is already well 211 known in the literature, see for example [16,29].It is defined as a directed acyclic graph have the same length l v ; 214 2. induce the same subproblem denoted by S[ θ v ] w.r.t. the position vector θ v .

215
We say that a partial solution s induces a subproblem S[ is the smallest 216 prefix of s i among all prefixes that has s as a subsequence. 217 the partial solution that induces v 2 is obtained by appending (a) to the partial

222
The root node r = (1, . . ., 1), 0 of G refers to the original LCS problem on input string 223 set S and can be said to be induced by the empty partial solution ε.

224
For deriving the successor nodes of a node v ∈ V, we first determine the subset Σ v ⊆ Σ of letter that feasibly extend the partial solutions represented by v.The candidates for letter a ∈ Σ v are therefore all letter a ∈ Σ that appear at least once in each string in the subproblem given by strings S[ θ v ].This set Σ v may be reduced by determining and discarding dominated letters.We say that letter a ∈ Dominated letters can be safely omitted since they lead to suboptimal solutions.Let

225
Σ nd v ⊆ Σ v be the set of feasible and non-dominated letters.For each letter a where is the vector indicating for each remaining string of the respective subproblem the number of occurrences of letter a ∈ Σ, while µ g (•) and σ g (•) denote the geometric mean and geometric standard deviation, respectively, which are calculated for x = (x 1 , . . ., x m ) ∈ R m by Function UB 1 (v) in expression (7) is the known upper bound on the length of an LCS for the subproblem represented by node v from [30] and calculated as Overall, the GM score is thus a weighted average of the adjusted geometric means

269
The GM score is relevant if its underlying sampling geometric mean and standard 270 deviation are based on a sample of sufficient size.In all our experiments, the minimal 271 number of input strings is therefore ten.Working on samples of smaller sizes would 272 make the GM score likely not that useful.

273
In addition to the GM score, we consider the PSUM score that is calculated by where Unlike the GM score that considers mostly general aspects of an underlying prob-274 ability distribution, PSUM better captures more specific relations among input strings.

275
It represents the sum of probabilities that a string of length k will be a common subse- It is not known in advance the exact length of the resulting subsequence.Note that 282 in the case of the HP heuristic proposed in [12], the authors heuristically determine 283 an appropriate value of k for each level in the BS.Finally, the total GMPSUM score is calculated by the linear combination where λ ∈ [0,

5.
Depending on the discrepancy of the actual and expected remaining time, we possibly increase or decrease the beam width for the next level: In this adaptive scheme, the thresholds for the discrepancy to increase or decrease 352 the beam width, as well as the factor by which the beam width is modified, were

367
In this section we evaluate our algorithms and compare them with the state-of-the-  whose length varies from 1000 to 5000, while alphabet sizes range from 2 to 100.

389
This set consists of 12 groups of instances.

390
• Benchmark set BB, introduced in [34], is different to the others, because the input 391 strings of each instance are generated in a way that there is a high similarity between 392 them.For this purpose, first, a randomly generated base string was generated.

393
Second, all input strings were generated based on the base string by probabilistically 394 introducing small mutations such as delete/update operations of each letter.This 395 set consists of eight groups (each one containing 10 single instances).

396
• Benchmark set BACTERIA, introduced in [35], is a real-world benchmark set used 397 in the context of the constrained longest common subsequence problem.We make 398 use of these instances by simply ignoring all pattern strings (constraints).This set 399 consists of 35 single instances.

400
• Finally, we introduce two new sets of instances:

-
The input strings of the instances of benchmark set POLY are generated in a way 402 such that the number of occurrences of each letter in each input string are deter-403 mined by a multinomial distribution with known probabilities p 1 , . . ., p η > 0, 404 such that ∑ i p i = 1; see [36] for how to sample such distributions.More specif-405 ically, we used the multinominal distribution with   proposed for this purpose in the literature: EX [16], POW [13], and HP [12].The four 420 resulting BS variants are labeled BS-GMPSUM, BS-EX, BS-POW, and BS-HP, respectively.

421
These four BS variants were applied with the same parameter settings (β = 600 and 422 k filter = 100) in the short-run scenario in order to ensure that all of them use the same 423 amount of resources.

424
In the long-run scenario, we tested the proposed time-restricted BS (TRBS) guided 425 by the novel GMPSUM heuristic, which is henceforth labeled as TRBS-GMPSUM.Our 426 algorithm was compared to the current state-of-the-art approach from the literature:

427
A * +ACS [29].These two algorithms were compared in the following way:

428
• Concerning A * +ACS, the results for benchmark sets RANDOM, VIRUS, RAT, ES and 429 BB were taken from the original paper [29].They were obtained with a computation 430 time limit of 900 seconds per run.For the new benchmark sets-that is, POLY and 431 BACTERIA-we applied the original implementation of A * +ACS with a time limit 432 of 900 seconds on the above-mentioned machine.

433
• TRBS-GMPSUM was applied with a computation time limit of 600 seconds per run 434 to all instances of benchmark sets RANDOM, VIRUS, RAT, ES and BB.Note that we 435 reduced the computation time limit used in [29] by 50% because the CPU of our 436 computer is faster than the one used in [29].In contrast, the time limit for the new

447
More specifically, the results of the short-run scenarios are summarized in Table 1, while 448 the ones for the long-run scenarios are given in Table 2. Table 1 displays the results     The tables reporting on the new state-of-the-art results are organized as follows. 542 The first column contains the name of the corresponding benchmark set, while the

110 •
We introduce two new LCS benchmark sets based on multinomial distributions, 111 whose main property is that letters occur with different frequencies.The proposed 112 new BS variant excels on these instances in comparison to previous solution ap--restricted BS version is described.It automatically adapts the beam 115 width over BS levels w.r.t.given time restrictions such that the overall running time 116

120 1 . 1 .
Preliminaries 121 By S we always refer to the set of m input strings, i.e. S = {s 1 , . . ., s m }, m ≥ 1.The 122 length of a string s is denoted by |s|, and its i-th letter, i ∈ {1, . . ., |s|}, is referred to by 123 s[i].Let n refers to the length of a longest string and n min to the length of a shortest 124 string in S. A continuous subsequence (substring) of string s that starts with the letter 125 at index i and ends with the letter at index j is denoted by s[i, j]; if i > j, this refers to 126 the empty string ε.The number of occurrences of a letter a ∈ Σ in string s is denoted 127 by |s| a .For a subset of the alphabet A ⊆ Σ, the number of appearances of each letter 128 from A in s is denoted by |s| A .For an m-dimensional integer vector θ ∈ N m and the set 129 of strings S, we define the set of suffix-strings S[ θ] = {s 1 [θ

176 2 . 4 .
The Case of Independent Random Strings 177 Another approach of calculating the probability that a string s is a subsequence of a 178 string r is based on the assumption that both s and r are random strings chosen from 179 the same multinomial distribution and are independent as a random vectors.Using this 180 setup, we established a recurrence relation for calculating probability P(s ≺ r).181 Technical Report AC-TR-2021-009

(Figure 1 .
Figure 1.State graph for the LCS problem instance on strings {s 1 = bcaacbdba, s 2 = cbccadcbbd, s 3 = bbccabcdbba} and alphabet Σ = {a, b, c, d}).Light-gray nodes are nonextensible goal nodes.The longest path in this state graph is shown in blue, leads from the root to node ((1, 10, 11), 6) and corresponds to the solution s = bcacbb, having length six.

258( 1 .
µ g (•)/σ g (•)) of the number of letter occurrences, and the weight of each letter is deter-259 mined by normalizing the minimal number of the letter occurrences across all strings 260 with the sum of minimal number occurrences across all letters.The motivation behind 261 this calculation is three-fold:262 Letters with higher average numbers of occurrences across the strings will increase 263 the chance of finding a longer common subsequence (composed of these letters).

264 2 . 3 .
Higher deviations around the mean naturally reduce this chance.265The minimal numbers of occurrences of a letter across all input strings is an upper 266 bound on the length of common subsequences that can be formed by this single 267 letter.Therefore, by normalizing it with the sum of all minimal letter occurrences, 268 an impact of each letter in the overall summation is quantified.

276
quence for all remaining input strings relevant for further extensions.Index k goes from 277 one to l max (v), i.e., the length of the shortest possible non-empty subsequence up to the 278 length of the longest possible one, which corresponds to the size of the shortest input 279 string residual.The motivation behind using a simple (non-weighted) summation across 280 all potential subsequence lengths is three-fold: 281 Technical Report AC-TR-2021-009 1.

284 2 .
The summation across all k provides insight on the overall potential of node v 285 -approximating the integral on the respective continuous function.Note that it 286 is not required for this measure to have an interpretation in absolute terms since 287 throughout the BS it is used strictly to compare different alternative extensions on 288 the same level of the BS tree.
approach that assigns different weights to the different k 290 values would impose the challenge of deciding these specific weights.This would 291 bring us back to the difficult task of an expected length prediction -which would 292 be particularly hard when considering now the arbitrary multinomial distribution.293 351 366

368••
art algorithms from the literature.The proposed algorithms are implemented in C# and 369 Technical Report AC-TR-2021-009 executed on machines with Intel i9-9900KF CPUs with @ 3.6GHz and 64 Gb of RAM 370 under Microsoft Windows 10 Pro OS.Each experiment was performed in single-threaded 371 mode.We have conducted two types of experiments: 372 Short runs: these are limited-time scenarios-that is, BS configurations with β = 373 600 are used-executed in order to evaluate the quality of the guidance of each of 374 the heuristics towards promising regions of the search space.375 Long runs: these are fixed-duration scenarios (900 seconds) in which we compare 376 the time-restricted BS guided by the GMPSUM heuristic with the state-of-the-art 377 results from the literature.The purpose of these experiments is the identification of 378 new state-of-the-art solutions, if any.

379 4 . 1 .• 387 •
Benchmark sets 380 All relevant benchmark sets from the literature were considered in our experiments: 381 Benchmark sets RAT, VIRUS and RANDOM, each one consisting of 20 single in-382 stances, are well known from the related literature [32].The first two sets are 383 biologically motivated, originating from the NCBI database.In the case of the third 384 set, instances were randomly generated.The input strings in these sets are 600 385 characters long.Moreover, they contain instances based on alphabets of size 4 and 386 20.Benchmark set ES, introduced in [33], consists of randomly distributed input strings 388

2 i-
for generating the input strings.The number of the 407 occurrences of different letters is very much unbalanced in the obtained input 408 strings.This set consists 10 instances for each combination of the input string 409 length n ∈ {100, 500, 1000} and the number of input strings m ∈ {10, 50}, 410 which makes a total of 60 problem instances.411 Benchmark set ABSTRACT, which will be introduced in Section 5, is a real-world 412 benchmark set whose input strings are characterized by close-to-polynomial 413 distributions of the different letters.The input strings originate from abstracts 414 of scientific papers written in English.

415 4 . 2 .
Considered algorithms 416 All considered algorithms make use of the state-of-the-art BS component.In order 417 to test the quality of the newly proposed GMPSUM heuristic for the evaluation of the 418 partial solutions at each step of BS, we compare to the other heuristic functions that were 419 Technical Report AC-TR-2021-009

437
instances was set to 900 seconds.Regarding restricted-filtering, the same setting 438 (k filter = 100) as for the short-run experiments was used.439RegardingGMPSUM parameter λ, we performed short-run evaluations across a 440 discrete set of possible values: λ ∈ {0, 0.25, 0.5, 0.75, 1}.The conclusion was that the 441 best performing values are λ = 0 for BB, λ = 0.5 for VIRUS and BACTERIA, λ = 0.75 for 442 RANDOM, RAT and POLY, and λ = 1 for ES.The same settings for λ were used in the 443 context of the long-run experiments.

444 4 . 3 .
Summary of the results445Before studying the results for each benchmark set in detail, we present a summary 446 of the results in order to provide the reader with the broad picture of the comparison.

449 in a way
such that each line corresponds to a single benchmark set.The meaning of 450 columns is as follows: the first column contains the name of the benchmark set, while 451 the second column provides the number of instances-respectively, instance groups-in 452 the set.Then there are four blocks of columns, one for each considered BS variant.The 453 first column of each block shows the obtained average solution quality (|s|) over all 454 instances of the benchmark set.The second column indicates the number of instances-455 respectively, instance groups-for which the respective BS variant achieves the best 456 result (#b.).Finally, the third column provides the average running time (t) in seconds 457 over all instances of the considered benchmark set.458459 The following conclusions can be drawn: 460 Technical Report AC-TR-2021-009

495A*
+ACS and TRBS-GMPSUM in terms of the average solution quality over all instances 496 of the respective benchmark set (|s|), and the number of instances (or instance groups) 497 for which the respective algorithm archived the best result (#b.).498The following can be concluded based on the results obtained for the long-run scenarios:499•Concerning RANDOM and ES, A * +ACS is-as expected-slightly better than TRBS-500 GMPSUM in terms of the number of best results achieved.However, when com-501 paring the average performance, there is hardly any difference between the two 502 Technical Report AC-TR-2021-009

543
following three columns identify the respective instance (in the case of RAT and VIRUS), 544 respectively the instance group (in the case of BB and ES).Afterwards, there are two 545 columns that provide the best result known from the literature.The first of these columns 546 provides the result, and the second column indicates the algorithm (together with the 547 reference) that was the first one to achieve this result.Next, the tables provide the 548 results of BS-EX, BS-POW, BS-HP and BS-GMPSUM in the case of the short-run scenario, 1 , |s 1 |], . . ., s m [θ n , |s n |}, which 130induce a respective LCS subproblem.For each letter a ∈ Σ, the position of the first131 occurrence of a in s i [ θ i , |s i |] is denoted by θ i,a , i = 1, . .., m.Last but not least, if a string s 132is a subsequence of a given string r, we write s ≺ r. 133 1.2.Overview 134 This article is organized as follow.Section 2 provides theoretical aspects concerning 135 the calculation of the probability that a given string is a subsequence of a random 136 string chosen from a multinomial distribution.Section 3 describes the BS framework 137 Technical Report AC-TR-2021-009 otherwise.Note that, besides lengths |s| and |r|, (3) depends only on whether or not a letter in 172the string r is equal to a j .1732.3.Multinomial Distribution -Special Case 3: Two Sets of Letters174

. Beam Search for Multinomially Distributed LCS Instances 187
1 , . . ., p η ), This allows us to pre-compute a probability matrix for all 184where k = |s| and l = |r|.185 relevant values of k and l by means of dynamic programming.186 3199 in the set of extensions V ext .Note that for some problems efficient filtering techniques 200 can be applied to discard nodes from V ext that are dominated by other nodes, i.e., nodes 201 that cannot yield better solutions.It is controlled by an internal parameter k filter .This 202 (possibly filtered) set of extensions is then sorted according to the nodes' values obtained 203 from the guidance heuristic h, and the top β nodes (or less if V ext is smaller) then form the 204 beam B of the next level.The whole process is repeated level-by-level until B becomes 205 empty.In general, to solve a combinatorial optimization problem, information about 206 the longest (or shortest) path from the root node to a feasible goal node is kept to finally 207 return a solution that maximizes or minimizes the problem's objective function.The 208 pseudocode of such a general BS is given in Algorithm 1. 209 Technical Report AC-TR-2021-009 Algorithm 1 Beam Search.1: Input: A problem instance, heuristic h, β > 0, k filter 2: Output: A heuristic solution 3: B ← {r} 4: while B = ∅ do • k filter • m) time is required, which gives O(n min • β • k filter • m) total time forIn this section we extend the basic BS from Algorithm 1 to a time-restricted beam 330 search (TRBS).This BS variant is motivated by the desire to compare different algorithms 331 with the same time-limit.The core idea we apply is to dynamically adapt the beam 332 width in dependence of the progress over the levels.Similarly to the standard BS from Algorithm 1, TRBS is parameterized with the 334 problem instance to solve, the guidance heuristic h, and the filtering parameter k filter .+∞, i.e., the time limit is actually enabled, the beam width for the next level is [29]s a strategy parameter.Based on an empirical study with different 294 benchmark instances and values for parameter λ, we came up with the following rules 295 of thumb to select λ. 296 1.Since GM and PSUM have complementary focus, i.e., they capture and award (or 297 implicitly penalize) different aspects of the extension potential, their combined 298 usage is indeed meaningful in most cases, i.e., 0 < λ < 1. 2992.GM tends to be a better indicator when instances are more regular, i.e., when each 300 input string better fits the overall string distribution.3013.PSUM tends to perform better when instances are less regular, i.e., when input 302 strings are more dispersed around the overall string distribution.303Regardingthecomputationalcosts of the GMPSUM calculation, the GM score calcu-304 lation requires O(|Σ| • m) time.This can be concluded from (7) where the most expensive 305 part is the iteration through all letters from Σ and finding the minimal number of the 306 letter occurrences across all m input strings (µ g (•) and σ g (•) have the same time complex-307 ity).Note that the number of occurrences of each letter across all possible suffixes of all 308 m input strings positions is calculated in advance, before starting the beam search, and 309stored in an appropriate three-dimensional array, see[29].The worst-case computational310 complexity of this step is O(|Σ| • m • n max ).This is because the number of occurrences of 311 a given letter across all positions inside the given input string can be determined in a 312 single linear pass.Since this is done only at the start and the expected number of GM 313 calls is much higher than n max , this up-front calculation can be neglected in the overall 314 computational complexity The PSUM score given by (8) takes O(n min • m) time to be 315 calculated due to a definition of l max (•).Similarly as in GM, the calculation of matrix P is 316 performed in pre-processing -its computational complexity corresponds to the number 317 of entries, i.e., O(n max • n max ), see (5).318 Finally, the total computational complexity of GMPSUM can be concluded to be 319 O((|Σ| + n min ) • m).The total computational complexity of the beam search is therefore 320 a product of the number of calls of GMPSUM O(n min • β • |Σ|) and the time complexity of 326 executing the filtering within the BS.According to this, the BS guided by GMPSUM and 327 utilizing (restricted) filtering requires O(n min • β • m • (k filter + |Σ| 2 + |Σ| • n min )) time.333 a (S[ θ v ]) i .(10) Thus, for each node v ∈ V ext and each letter a we consider the minimal number 342 of occurrences of the letter across all string suffixes S[ θ v ] and select the one that 343 is maximal.In other words, this LCS lower bound is based on considering all 344 common subsequences in which a single letter is repeated as often as possible.In 345 the literature, this procedure is known under the name Long-run [31] and provides 346 a |Σ|-approximation.347 3. Let t rem be the actual time still remaining in order to finish at time t max .348 4. Let t rem = t iter • LB max (V ext ) be the expected remaining time when we would 349 continue with the current beam width and the time spent at each level would stay 350 the same as it was measured for the current level.

Table 2 :
Long-run results summary.
465•In the case of the quasi-random instances of benchmark sets VIRUS and RAT, BS-466 GMPSUM starts to show its strength by delivering the best solution qualities in 31 467 out of 40 cases.The second best variant is BS-EX, which is still performing very 468 well, and is able to achieve the best solution qualities in 24 out of 40 cases.469 • For the special BB benchmark set, in which input strings were generated in order 470 to be similar to each other, GMPSUM turns out to perform comparably to the best 471 variant BS-POW.472 • Concerning the real-world benchmark set BACTERIA, BS-GMPSUM is able to deliver 473 the best results for 18 out of 35 groups, which is slightly inferior to the BS-HP variant 474 with 22 best-performances, and superior to variants BS-EX (12 cases) and BS-POW 475 (15 cases).Concerning the average solution quality obtained for this benchmark set, 476 BS-GMPSUM is able to deliver the best one among all considered approaches.477 • Concerning the multinominal non-uniformly distributed benchmark set POLY, BS-478 GMPSUM clearly outperforms all other considered BS variants.In fact, BS-GMPSUM 479 is able to find the best solutions for all 6 instance groups.Moreover, it beats the 480 other approaches in terms of the average solution quality.481 • Overall, BS-GMPSUM finds the best solutions in 80 (out of 121) instances or instance 482 groups, respectively.The second best variant is BS-EX, which is able to achieve 483 best-performance in 62 cases.In contrast, BS-Hp and BS-POW are clearly inferior 484 to the other two approaches.We conclude that BS-GMPSUM performs well in the 485 context of different letter distributions in the input strings, and it is worth to try 486 this variant first when nothing is known about the distribution in the considered 487 instance set.488 • Overall the running times of all four BS variants are comparable.The fastest one is 489 BS-HP, while BS-GMPSUM requires somewhat more time compared to the others 490 since it makes use of a heuristic function that combines two functions.491

Table 2
provides a summary concerning the long-run scenarios, i.e., it compares 492 the current state-of-the-art algorithm A * +ACS with TRBS-GMPSUM.As the benchmark 493 instances are the same as in the short-run scenarios, the first two table columns are 494 the same as in Table2.Then there are two blocks of columns, presenting the results of

Table 3 :
New best results for the instances from literature in the short-run scenario.

Table 4 :
New best results for the instances from literature in the long-run scenario.