A Multi-Branch-and-Bound Binary Parallel Algorithm to Solve the Knapsack Problem 0–1 in a Multicore Cluster

Featured Application: An uncorrelated instance is equivalent to solving any problem where the beneﬁt is independent of the weight. A weakly correlated instance has a high correlation between the beneﬁt and the weight of each element. Typically, the beneﬁt di ﬀ ers from the weight by a small percentage. Such instances are the most practical in administration, such as with a return on an investment, which is generally proportional to the sum of the amount invested. Abstract: This paper presents a process that is based on sets of parts, where elements are ﬁxed and removed to form di ﬀ erent binary branch-and-bound (BB) trees, which in turn are used to build a parallel algorithm called “multi-BB”. These sequential and parallel algorithms calculate the exact solution for the 0–1 knapsack problem. The sequential algorithm solves the instances published by other researchers (and the proposals by Pisinger) to solve the not-so-complex (uncorrelated) class and some problems of the medium-complex (weakly correlated) class. The parallel algorithm solves the problems that cannot be solved with the sequential algorithm of the weakly correlated class in a cluster of multicore processors. The multi-branch-and-bound algorithms obtained parallel e ﬃ ciencies of approximately 75%, but in some cases, it was possible to obtain a superlinear speedup.


Introduction
The KP 0-1 (knapsack problem 0-1) has a wide variety of applications in everyday life, including production planning, financial modeling, project selection, the allocation of data in distributed systems, and facility capacity planning [1].
Due to the diversity of real problems, it is necessary to consider them through problem models, such as those proposed by Reference [2]. They proposed seven classes of problems, and we focused on solving two of them in this work: the uncorrelated and the weakly correlated. Uncorrelated problems model any problem in which the weights of the elements are not correlated with their benefit. The weakly correlated problem model offers many practical applications, such as capital budgeting, project selection, resource allocation, cutting stock, and investment decision-making [3]. In this way, if an algorithm performs well in solving these instances, it is likely to solve a problem in everyday life.
The characteristic of this problem is that its items cannot be split among themselves and is classified as an NP-hard problem, where NP, indicates non-polynomial behavior problems. For certain sizes of the instances of these problems, it is possible to calculate their optimal solution by means of algorithms and parallel computers. Therefore, a parallel BB (Branch and Bound) algorithm is proposed to calculate the exact solution of KP 0-1 for instances of medium and low complexity (weak and uncorrelated [2], respectively).
The novelty of this work is that using a different approach to a binary tree (from a formulation of a set of parts), we propose generating several different trees to find the optimal solution [4][5][6][7][8][9][10]. Each of these trees represents a decision tree that is generated by fixing or removing elements, so the roots of these trees are located in different search spaces with respect to the initial tree. Each binary tree forms new search spaces, and consequently there are a greater number of feasible solutions available. Because each decision tree is independent of the others, each of them is assigned in a processing unit for execution, and therefore it is possible to calculate the optimal solution more quickly and thus reduce the computation time, sometimes determining superlinear speedup as well as solving the instances that cannot be resolved with a sequential approach.
The factors involved in efficient parallel implementation are diverse: those corresponding to the algorithm and those inherent in the use of parallel computers. It is important to consider algorithms that are efficiently parallelizable, i.e., algorithms whose execution times are polylogarithmic and use a number of polynomial processors (both depending on the size of the input) [11]. However, in the parallel BB algorithms applied to solve the integer knapsack problem and its version, 0-1, additional variables were used to express complexity, such as the capacity of the knapsack c and the weights of the elements (the maximum w max and the minimum w min [12]). This indicates that these variables influence the complexity of the instance and, consequently, its execution time.
On the other hand, the factors that influence the performance of the parallel implementation of the BB are shared knowledge, knowledge use, division of work, and synchronization. These are considered independently of the computer architecture and the BB parallel algorithm to solve KP 0-1 [13].
In different parallel BB algorithms and their implementation in multicore computer architectures and GPUs (Graphics Processing Units), two points of view are considered: developing a parallel algorithm and the factors that influence that algorithm's implementation [14,15]. This paper proposes a way to search for the optimal solution from an algorithmic point of view, generating greater spaces with feasible solutions and their implementation in a multicore cluster.
In BB parallel models, a decision tree travels through the space of feasible solutions until reaching the optimum [14][15][16]. That is to say, collectively, a single tree is crossed until reaching the optimal solution. In our BB parallel model, we propose generating several different trees to find the optimal solution [4][5][6][7][8][9][10]. Each of these trees represents a decision that is generated by fixing or removing elements, so the roots of these trees are located in different search spaces with respect to the initial tree. The implementation of this algorithm was performed in the multicore cluster "Ioevolution", with 136 cores distributed among four servers.
The sequential algorithm solves all the uncorrelated instances, and our algorithm, the "multi-BB" solves all of the weakly correlated instances, with superlinear speedup found in some solutions.
The formulation of the KP 0-1 and related works are discussed in Section 2. In Section 3, we describe the hypotheses of our work and the procedure to generate the decision trees. The implementation and testing are done in Section 4, as is a discussion of the results. In Section 5, we present the conclusions of this paper.

Problem Formulation
The 0-1 knapsack problem has been among the most studied problems in the literature since 1897 [17]. The mathematician Tobias Dantzig named the problem in the early 20th century [18]. Its formulation is simple, but its solution turns out to be complex, since it is a problem that grows exponentially. Therefore, in the theory of complexity, KP 0-1 is classified as an NP problem [19].
The 0-1 knapsack problem arises as follows: where f opt = optimal function; p i = element benefit i, p i ∈ Z + ; w i = element weight i, w i ∈ Z + ; c = knapsack capacity, c ≤ n i=1 w i ; and x i = element x i ∈ {0, 1}. The formula in Equation (1) refers to maximizing the utility of the knapsack based on the sum of the benefits of each of its elements. This restriction indicates that it is necessary to take into account the sum of the weights of the items that will be stored in the knapsack, whose capacity is determined by the variable c.
The variable x i is "0" when the item is not included in the knapsack and "1" when the item is included.
The KP 0-1 consists of determining the elements that should be included in the knapsack, so the total utility of the selected elements is maximum, without exceeding the allowed capacity of the knapsack.

Related Works
The BB parallel algorithms used to solve KP 0-1 have been designed primarily for static computer architectures, such as rings, toroids, and hypercubes [12]. Subsequently, these algorithms have been implemented in dynamic configuration computers, multicore CPUs (Central Processing Units), and GPUs [14,15], where the focus is mainly on the four factors that influence implementation: shared knowledge, knowledge utilization, division of work, and synchronization [13].
The complexity of the BB parallel algorithms that solve the KP 0-1 is a function of the problem instance variables. For example, when the computer architecture is a hypercube, the complexity is given by [12,15] O nc p w max w min for a number of processors of p < c log w max . In Equation (3), one of the terms is w max w min . If this term grows, then the complexity increases, unlike the classification proposed by Martello et al. [20], where the complexity of KP 0-1 is given by the variability of the benefit with respect to the weights of the elements (p i and w i ). Martello et al. [20] classified the complexity of KP 0-1 into seven different instances grouped into three complexities based on the correlation between the p i benefit and the weight w i [2]. The equations for three of the seven representative instances of each of the complexity classes are shown below: weakly correlated : w i ∈ [1, R] and p i ∈ w i − 1 10 R, w i + 1 10 R such that p i ≥ 1; strongly correlated : w i ∈ [1, R] and p i = w i + 1 10 R; where p i = the benefit of element i; w i = element weight i; and R ∈ {1, 000, 10, 000}.
As observed in the previous equations, the complexity of the instance is not only given by the variability of the weights of the elements (w i ), because the elements are calculated in the same way in the three instances. Therefore, the increase in complexity is given by the variability of the benefit of the elements p i with respect to w i .
The other term of Equation (3), nc p , indicates that the capacity of the knapsack is important in determining the execution time of the algorithm. This coincides with the results of various researchers [2,14], who determined that the capacity of the knapsack affects complexity, so the tests are carried out by varying the capacity of the knapsack or half of the sum of the weights of the elements.
In recent implementations in multicore computer architectures and GPUs [14][15][16], the focus has been on the four factors identified by Trienekens and de Bruin [13], as well as on using the two search criteria in combination: the first is depth, and the second is breadth [15]. With depth, the tree levels are advanced, and with breadth, the volume of subproblems is generated to use massive computation. However, the search remains for a single decision tree. Once the volume of children is generated to use these architectures, scholars have focused on developing an algorithm that considers the following: • Shared knowledge: this consists of the list or lists of the subproblems or nodes that are stored for later calculation and divided in two (the type of storage of the subproblems (general or local) and how that list is updated); • Knowledge use: This is divided in two (the access strategy for the list of subproblems and whether that knowledge is shared with the results) and is updated immediately (the optimal solution and limits); • Division of work: This is formed by the active processes. In this way, the unit of work in each process and the load balance are determined; • Synchronization: This refers to the synchronization of each of the concurrent processes for data communication.

Considerations for the Elaboration of the Parallel Algorithm
Our model starts from a sequential algorithm that has been previously described in Reference [21], to which some improvements and modifications were made to elaborate on the parallel algorithm.
The search space for the discrete knapsack problem 0-1 is given by the power set of the elements of the initial problem. If these elements are in the set I = {1, 2, . . . , m}, then the number of subsets of the power set or parts is |P(I)| = 2 m . The power set can be represented using the terms of the empty set φ and the set I, whereby the power set can be denoted by P (m) = [φ, I]. If the fixed elements are in the set ω 1 , and the elements assigned to the node are in the set ω 2 , then P(m) can also be denoted by P(m) = [ω 1 , ω 2 ] [22,23].
An example of the initial search space of a KP 0-1 with five elements is shown in Figure 1. The search space is for set I = {1, 2, 3, 4, 5}.
This space is divided into two when an item is fixed or removed. In Figure 1, the line l 0 l 0 divides the search space in two by setting and removing element 1: the left side is the left child and the right side is the right child. As can be seen in Figure 1, the search space of the left child consists of all the subsets containing element 1. On the other hand, in the search space of the right child, no subset has element 1 as its element. These two spaces correspond to level 1 of the binary tree, as shown in Figure 2. This space is divided into two when an item is fixed or removed. In Figure 1, the line divides the search space in two by setting and removing element 1: the left side is the left child and the right side is the right child. As can be seen in Figure 1, the search space of the left child consists of all the subsets containing element 1. On the other hand, in the search space of the right child, no subset has element 1 as its element. These two spaces correspond to level 1 of the binary tree, as shown in Figure 2.
In turn, these search spaces are divided in two by setting and removing element 2. The line shows the spaces generated with the left child, and line shows the spaces generated with the right child. Therefore, at level 2, there will be four nodes. Each of these children is divided into two sets whose search space is a smaller dimension than the search space of the node from which they come. There are two sets of level 3 for each of the sets of level 2. In level 3, there will be eight sets. In general, at the level of , 0 ≤ ≤ , 2 , subsets can be formed, where the search spaces of every two sets are disjointed sets and the union of all spaces of the same level is the initial power set P(m). Following this procedure, at the last level, a subset of the initial power set is assigned to each leaf [25]. In the last levels of the binary tree, most of the elements will be fixed at ω1, leaving a small number of free elements. Taking into account the above, the KP 0-1 is formulated below: = ∑ , In turn, these search spaces are divided in two by setting and removing element 2. The line l 1 l 1 shows the spaces generated with the left child, and line l 2 l 2 shows the spaces generated with the right child. Therefore, at level 2, there will be four nodes. Each of these children is divided into two sets whose search space is a smaller dimension than the search space of the node from which they come. There are two sets of level 3 for each of the sets of level 2. In level 3, there will be eight sets. In general, at the level of k, 0 ≤ k ≤ m, 2 k , subsets can be formed, where the search spaces of every two sets are disjointed sets and the union of all spaces of the same level is the initial power set P(m). Following this procedure, at the last level, a subset of the initial power set is assigned to each leaf [25].
In the last levels of the binary tree, most of the elements will be fixed at ω 1 , leaving a small number of free elements. Taking into account the above, the KP 0-1 is formulated below: where p i is the benefit of element i, w i is the element weight i, and c is the knapsack capacity. Equation (7) is divided into two terms: fixed (ω 1 ) and free elements (ω 2 /ω 1 ). Therefore, the formulation of the problem is adapted to the part of the problem that contains free elements: this is because it is not known which of them will be part of the solution. The modification consists of calculating the capacity of the available knapsack, once the fixed elements have been stored in the knapsack, according to Equation (9): This algorithm begins with the calculation of a feasible solution and its upper and lower limits: these are obtained as proposed by Dantzig (1957) [18] with Equation (10). The algorithm calculates them only once at the beginning of the calculations: Subsequently, for free items (ω 2 /ω 1 ), the critical variable x s is determined by Equation (11), the upper bound (UB) limit by Equation (12), and the lower bound (LB) limit by Equation (13), where the LB will be the first optimal integer solution (f opt ): where c 0 is the available capacity of the knapsack that will be filled with free items (ω 2 /ω 1 ), l 0 is the minimum number of free elements that cause the sum of the weights w i to be greater than c 0 , p i is the benefit of the critical element, and w i is the weight of that critical element. After determining the critical variable and the root limits (UB and LB), two children are generated using the procedure described previously, and each of the children will have a UB child that is used to determine if that child is pruned or branched. In order for the child to branch out, the following must be fulfilled: If Equation (14) is met, there is a possible LB child that is better than the current one and, consequently, a better optimal solution if LB child > f opt . On the other hand, if UB child < f opt , then that branch of the tree is pruned, since there will be no LB child greater than the best f opt that has so far been calculated.
The optimal solution f opt is the control variable.

Multi-BB Parallel Model
The sequential BB algorithm uses a breadth search to traverse the binary tree and calculate the optimal solution. This algorithm progresses level by level, calculating all the subsets generated. Therefore, to explain the proposal of the parallel model, the following example is presented. The subset {2, 4} is supposed to be the optimal solution to the problem. The subset {2, 4, 5} is at level 3, as is shown in the tree in Figure 2, where elements 2 and 4 are in the first locations. Thus, the algorithm will have to reach level 3 to calculate the solution. In the worst case, the algorithm will require 2 0 + 2 1 +2 2 + 2 3 subsets, so it will also be necessary to perform calculus for the subsets of level 4 (2 4 ) to determine that the solution is optimal. These subsets are stored in a list and are resolved sequentially. In the worst case, the calculated subsets will be 31.
When at the start of the problem, element 3 is fixed and . The solution is in the last subset. The number of subsets needed to reach the solution is 2 0 + 2 1 + 2 2 , and the subsets of level 3 (2 3 ) will be required to determine that the solution is optimal. In the worst case, 15 subsets will be calculated, which is 2 4 operations less than in the previous case.
Because the fixed or removed element that helps determine the optimal solution in fewer iterations is not known in advance, we propose generating different binary trees. In each tree, a different element is fixed or removed. When the element is fixed, that element is forced to be part of the solution, and when it is removed in another tree, that element will not be part of the solution, meaning that other elements must form the solution. In each binary tree, the process described previously is followed. The processing units share their best solution to accelerate the calculation, thereby increasing the number of branches that are pruned in each tree. Consequently, it is possible to reduce the parallel computation time. Communications among the processing units influence the synchronization of the processing units, an aspect that is resolved, as indicated in the following paragraphs. The ideal number of binary trees is equal to twice the number of elements of the problem, because in one-half of the trees, a different element is fixed, and in the other half, each element is removed. Therefore, we propose generating different trees to increase the spaces of the feasible solutions, with the possibility that in some binary trees, the optimal solution is found in its first levels.
An outline of the algorithm is shown in Figure 3, where the trees that are generated do not constitute a forest [24] because they share the same leaves. element is removed. Therefore, we propose generating different trees to increase the spaces of the feasible solutions, with the possibility that in some binary trees, the optimal solution is found in its first levels.
An outline of the algorithm is shown in Figure 3, where the trees that are generated do not constitute a forest [24] because they share the same leaves.

Multi-BB Algorithm for a Multicore Cluster
In the implementation of the multi-BB algorithm in the multicore cluster (Figure 4), four factors that influence its efficiency are considered: shared knowledge, knowledge utilization, work division,

Multi-BB Algorithm for a Multicore Cluster
In the implementation of the multi-BB algorithm in the multicore cluster (Figure 4), four factors that influence its efficiency are considered: shared knowledge, knowledge utilization, work division, and synchronization [13]. These are defined to have a good load balance, a minimum number of synchronization points, and the shortest communications times. elements for and ). Communications C2 and C3 are made each time the maximum number of local iterations is reached. In both cases, only two datapoints are sent. In communication C2, the master sends the updated global solution and the signal to the processing units to continue the calculations. In communication C3, each processing unit sends its solution and the status of its local list.
The multi-BB algorithm was programmed in ANSI C, and the Message Passing Interface (MPI) library is used to pass messages. The number of maximum local iterations is one-tenth the size of the problem [24]. The proposed distribution for the parallel processes number is described in Section 4.2. The remaining variables have been previously described. Shared knowledge, or the subsets that are generated by the branching of trees, is stored in local lists in each processing unit because each of the multi-BB trees is independent of the others. An Exclusive Read, Exclusive Write, Parallel Random Access Machine (EREW PRAM) model is used for access to the subsets and to write the partial solution. Figure 4 shows how the local lists are used in each processing unit, from how they are started and updated to when the subsets are taken.
For division of the work, both in a static form and at the beginning of the calculation, a tree is assigned to each processing unit. Each decision tree is expected to grow, and a good load balance is achieved. This is the first activity that the processes perform, as is seen in Figure 4.
With regard to the use of knowledge and synchronization, the optimal solution is shared knowledge. The processing units are synchronized to exchange partial solutions and determine the optimal solution to the problem. A processing unit receives partial solutions, obtains the optimal solution, and stores the identification of the processing unit within that solution. The above process is presented in a green box in Figure 4. The number of local iterations to carry out the synchronization is based on the proposal by Zavala-Díaz [24].
The algorithm is asynchronous when each processing unit travels through its search space. Synchronization occurs when the processes send and receive information (blue lines in Figure 4). Communications C1, C4, and C5 are executed only once during the entire calculation process. In communication C1, the elements in ω 1 and ω 2 of the problem are sent. In communication C4, the master processor sends data to all others and the number of the processing unit that calculated the optimal solution. In communication C5, the processing unit with the optimal solution sends the data to the master processor of the optimal solution, such as ω 1 , ω 2 (the set of solution elements for f linear and f opt ). Communications C2 and C3 are made each time the maximum number of local iterations is reached. In both cases, only two datapoints are sent. In communication C2, the master sends the updated global solution and the signal to the processing units to continue the calculations. In communication C3, each processing unit sends its f opt solution and the status of its local list.
The multi-BB algorithm was programmed in ANSI C, and the Message Passing Interface (MPI) library is used to pass messages. The number of maximum local iterations is one-tenth the size of the problem [24]. The proposed distribution for the parallel processes number is described in Section 4.2. The remaining variables have been previously described.

Computational Experimentation
Computational experimentation is divided into two parts. In the first part, the sequential algorithm is tested, the published instances are resolved, and the uncorrelated and weakly correlated instances are resolved. In the second part, the most complex instances found in the first section with the parallel algorithm are solved. Table 1 shows the main characteristics of the Ioevolution cluster. For a comparison between the results obtained with the sequential BB and parallel multi-BB algorithms, the following tests were carried out. In the first test, it was verified that the algorithm elaborated using sets of parts calculated the optimal solution. For this purpose, the optimal solution (based on the published instances with solutions) was calculated [26][27][28]. The second part consisted of solving the instances classified as uncorrelated and weakly correlated: their coefficients were given by equations as each author calculated them [14]. Therefore, it was not possible to directly compare these equations to other algorithms.
The sizes of the instances of our tests were similar to those used in other analyses: up to 50 elements [14] and up to 500 elements, with the size of the instances of 50 elements being frequent [29] (up to 2000 elements [16]).

Results of the Sequential Algorithm
Sequential execution was carried out on the Ioevolution server. To verify that the proposed method solved the knapsack problem 0-1, the instances published by References [26][27][28] were resolved. Tables 2-4 show the results obtained. As can be seen in Tables 2-4, the sequential algorithm was capable of solving any simple instance. The importance of the comparison is in showing that the process of building ever-smaller solution spaces works to determine the optimal solution.
The second step was to resolve the instances of KP 0-1 that were classified as uncorrelated or weakly correlated. The tests were done with six sets of 1000 elements, for which the optimal solution for every 100 elements was calculated: n = 100, 200, . . . , 1000. In this way, the influence of the size of the problem could be measured, and it could be determined if the execution time depended on other variables. The value of the constant R was equal to 1000 for the first three series and was equal to 10,000 for the second three series. For each test, the capacity of the knapsack is given by Equation (15). Table 5 and Figure 5 contain the sequential execution times for the uncorrelated instance problems:  The results show that the sequential version calculated all of the series of uncorrelated instances. Table 5 and Figure 5 show that the runtime increment was primarily based on the size of the problem, tending to increase as the problem size increased. This increase was not linear (like the increase in the number of elements was), and in Figure 5, the vertical axis is presented in base logarithm 10. Only one of the results obtained had an execution time of more than 1000 s. This result was from series 2, with n = 1000. The values of this series were calculated with R = 1000, indicating that it is not practical to use R = 10,000 (in series 4, 5, and 6). Table 6 and Figure 6 show the sequential execution times for a weakly correlated instance.  The results show that the sequential version calculated all of the series of uncorrelated instances. Table 5 and Figure 5 show that the runtime increment was primarily based on the size of the problem, tending to increase as the problem size increased. This increase was not linear (like the increase in the number of elements was), and in Figure 5, the vertical axis is presented in base logarithm 10. Only one of the results obtained had an execution time of more than 1000 s. This result was from series 2, with n = 1000. The values of this series were calculated with R = 1000, indicating that it is not practical to use R = 10,000 (in series 4, 5, and 6). Table 6 and Figure 6 show the sequential execution times for a weakly correlated instance.  With the sequential algorithm, it was not possible to obtain the optimal solution for all the problems in this instance. Seven of the sixty problems were not solved by the sequential algorithm, as shown in Table 6 and Figure 6. Some instances were not resolved because a number of generated children used all the memory of the Ioevolution server. Figure 6 shows that an increase in size influenced the execution time but also influenced the variability of pi with respect to , which caused a greater dispersion of execution times. The influence of using a constant R equal to 10,000 increases complexity in this kind of problem. Four of the instances exceeded the execution time of 10,000 s: the instances of series 4, 5, and 6.

Application of the Multi-BB Parallel Algorithm
When the multi-BB algorithm was applied, the generated trees were distributed uniformly to all servers: one tree for each processing unit. Table 7 shows the number of cores used for the parallel computational experimentation. The distribution of the number of cores of the different servers was produced with the intention of having the same workload on each server. The same number of cores was used until the maximum of the smallest servers was covered. When more cores were required, they were assigned to the cores of the servers with the largest quantities to cover the maximums of all servers.  With the sequential algorithm, it was not possible to obtain the optimal solution for all the problems in this instance. Seven of the sixty problems were not solved by the sequential algorithm, as shown in Table 6 and Figure 6. Some instances were not resolved because a number of generated children used all the memory of the Ioevolution server. Figure 6 shows that an increase in size influenced the execution time but also influenced the variability of p i with respect to w i , which caused a greater dispersion of execution times. The influence of using a constant R equal to 10,000 increases complexity in this kind of problem. Four of the instances exceeded the execution time of 10,000 s: the instances of series 4, 5, and 6.

Application of the Multi-BB Parallel Algorithm
When the multi-BB algorithm was applied, the generated trees were distributed uniformly to all servers: one tree for each processing unit. Table 7 shows the number of cores used for the parallel computational experimentation. The distribution of the number of cores of the different servers was produced with the intention of having the same workload on each server. The same number of cores was used until the maximum of the smallest servers was covered. When more cores were required, they were assigned to the cores of the servers with the largest quantities to cover the maximums of all servers.  Cores  20  40  60  80  100  120   Ioevolution  5  10  15  25  30  30  compute-0-0  5  10  15  25  40  60  compute-0-1  5  10  15  15  15  15  compute-0-2  5  10  15 15 15 15 The elements to be removed and fixed were selected from a list of items in decreasing order (depending on the ratio (p i /w i )). In one tree, the element was fixed, and in another, the same element was removed. In the processing unit with odd numbers, the element was fixed, and when it had an even number, that same element was removed, and the element to be fixed or removed was calculated with Equation (16). If proc i was odd, the element k = proc i * ∆w was fixed. If proc i was even, the element element k = (proc i − 1) * ∆w was removed: The first application of the multi-BB algorithm was to solve the weakly correlated instance of series 6 (with n = 800), which required 47,107.17 s. To measure the results of the parallel implementation, speedup and parallel efficiency were used, which are described in Equations (17) and (18):

Machine-
Parallel e f f iciency : Ep = Speedup with p processors number o f processors p . (18) The ideal speedup value is a linear value, that is, S p = p (number of processors). However, for parallel efficiency, its ideal value is equal to 1 or 100%. Table 8 shows the execution time, the speedup S p , and the parallel efficiency E p for different numbers of cores. As is shown in Table 8, the weakly correlated instance of series 6 (with n = 800) could be solved in parallel with different numbers of cores. Because binary trees are generated to explore different solution spaces, by fixing and removing elements, it is possible to reduce parallel computation times. This is due to the calculation of optimal local solutions, which allow for more efficient calculations of the global optimal solution. This method is used in all binary trees to determine which branches are pruned or branched. The multi-BB algorithm accelerates its calculation process by sending the global optimal solution to each interval of the local iterations, allowing for the pruning of a greater number of branches of each binary tree. In this case, the elements selected to be fixed or removed were correct because the parallel execution time was reduced: only on two occasions was a superlinear speedup achieved. This scenario could change if another method for selecting the numbers of elements to be fixed or removed in the decision trees is used. The two times a superlinear speedup was reached were when 20 and 80 processing units were used. Parallelism in these algorithms is speculative because dependencies are ignored and subtrees are explored in a parallel form whose algorithm efficiencies can result in an anomalous performance, as defined in Reference [29]: "It may be that a parallel task finds a strong incumbent more quickly than in the sequential execution, leading to less work being done. In this case we observe superlinear speedups." In the first case of a speedup, 144.87 was reached, which was much higher than the 20 cores used. In the second case, a speedup of 86.14 with 80 cores was achieved. These two cases gave parallel efficiencies of 724.36% and 107.67%, on speedup respectively, which were both higher than the ideal value of 100%. In the other solved cases, a superlinear speedup was not reached, and the efficiencies were lower than 100%. Figure 6 shows the workload when 20 processing units were used.
As is seen in Figure 7, 4 of the 20 processing units concluded their calculations in the first iterations, and a fifth processor concluded its calculations as 20% of the iterations converged. Thus, the Ioevolution cluster required approximately 75% efficiency to obtain an optimal solution. The above results indicate that the execution time was algorithmically reduced, but not by the efficient use of the Ioevolution cluster. This indicates that the hypothesis of reducing the number of operations is possibly true if the operation starts from another point in the search space of feasible solutions.

Solution of Weakly Correlated Instances without a Sequential Solution
The application of the multi-BB algorithm for the problems of weakly correlated instances without sequential solutions was the same as in the previous problem. The results are shown in Table 9 and Figure 8.

Solution of Weakly Correlated Instances without a Sequential Solution
The application of the multi-BB algorithm for the problems of weakly correlated instances without sequential solutions was the same as in the previous problem. The results are shown in Table 9 and Figure 8.  The results in Table 9 and Figure 8 show that as the number of processing units increased, the dispersion of the results was smaller. Except for series 4, all others tended to be less variable for 100 processing units, and the variability increased again with 120 processing units. This indicates that the element that decreased the computations may have been a greater number of processing units. The same held true for series 4 with 1000 elements: it was only possible to determine its solution with a larger number of processing units.

Conclusions
From the results obtained, the following conclusions could be made: It is possible to make binary branch-and-bound trees by fixing and removing subset elements to build a parallel algorithm, which calculates the optimal solution of the 0-1 knapsack problem.
Consideration of the four factors that affect the implementation of parallel algorithms is important, because they guide the process of obtaining efficient parallel algorithms, such as the one presented in this work.
The multi-BB algorithm calculated the optimal solution of the weakly correlated instances that could not be solved with the sequential version, instances that required excessive memory usage. This indicates that despite the fact that a larger number of search spaces were generated, the algorithm calculated the optimal solution without using the full memory of any server in the cluster. This shows that the moment when the global optimal function is updated is the correct one, because it allows for the pruning of a large number of binary tree branches assigned in the cores.
The parallel efficiencies reached were around 75%, but in some cases, it was possible to achieve superlinear speedup.
The improvement that can be made to the algorithm is to calculate, in a more efficient way, the elements that are fixed in the solution process, which perhaps could be done through hybridization of the algorithm. This will be a future task, along with including parallel programming via GPUs, which will allow for shared memory to be used and which will considerably accelerate communication between the procedures generated by the algorithm.