Batch-Wise Permutation Feature Importance Evaluation and Problem-Speciﬁc Bigraph for Learn-to-Branch

: The branch-and-bound algorithm for combinatorial optimization typically relies on a plethora of handcraft expert heuristics, and a research direction, so-called learn-to-branch, proposes to replace the expert heuristics in branch-and-bound with machine learning models. Current studies in this area typically use an imitation learning (IL) approach; however, in practice, IL often suffers from limited training samples. Thus, it has been emphasized that a small-dataset fast-training scheme for IL in learn-to-branch is worth studying, so that other methods, e.g., reinforcement learning, may be used for subsequent training. Thus, this paper focuses on the IL part of a mixed training approach, where a small-dataset fast-training scheme is considered. The contributions are as follows. First, to compute feature importance metrics so that the state-of-the-art bigraph representation can be effectively reduced for each problem type, a batch-wise permutation feature importance evaluation method is proposed, which permutes features within each batch in the forward pass. Second, based on the evaluated importance of the bigraph features, a reduced bigraph representation is proposed for each of the benchmark problems. The experimental results on four MILP benchmark problems show that our method improves branching accuracy by 8% and reduces solution time by 18% on average under the small-dataset fast-training scheme compared to the state-of-the-art bigraph-based learn-to-branch method. The source code is available online at GitHub.


Introduction
Mixed-integer linear programming (MILP) offers a generic way to formulate and solve practical decision-making problems, e.g., routing optimization [1], manipulator control [2], and resource allocation [3]. Due to the wide applicability of MILP, numerous commercial and free MILP solvers exist, with a few well-known examples such as CPLEX [4], SCIP [5], and Gurobi [6]. The basic component of modern MILP solvers is the branch-andbound (B&B) algorithm for global optimization [7]. Typically, B&B recursively partitions the search space by branching on the optimal solution of the linear relaxation of the MILP problem and cleverly exhausts the search space by pruning unpromising solution space until a solution with the certificate of optimality is found. The B&B algorithm relies heavily on heuristic rules, which are essentially priority guidelines devised by human experts to direct search directions toward more promising regions, such as the variable selection policy or node selection policy. Traditionally, the heuristics are carefully constructed based on expert domain knowledge and the common characteristics of specific types of problems. With its rapid development in recent years, machine learning (ML) [8] offers a way to replace some of the sophisticated hand-crafted expert heuristics in B&B [9].
To learn the variable selection policy in the B&B algorithm, Alvarez et al. adopted ML early for solving MILPs [10]. This kind of learning-based policy is also known as learn-to-branch, where learning is introduced to the optimization process to search for the 1. In order to measure the feature importance, a batch-wise PFI (BPFI) evaluation method is proposed for learn-to-branch, which permutes features within only one batch in the forward pass. The GCNN model is augmented as BPFI-GCNN by adding one shuffling switch in the GCNN model, therefore allowing the fragmented processing of the branching samples.

2.
Based on the results of the BPFI evaluation, a reduced bigraph representation is proposed for each specific benchmark problem to reduce the model complexity. The proposed representation is shown to outperform the original in most cases on both branching accuracy and solution efficiency.
The remainder of this paper is organized as follows. In Section 2, the background and related studies of ML-B&B are discussed. In Section 3, the MILPs of four NP-hard benchmark problems are introduced. In Section 4, BPFI is evaluated for the bigraph representation, according to the results of which an improvement to the bigraph representation is proposed. In Section 5, comparative experiments are carried out to verify the effectiveness of the proposed method. Finally, Section 6 concludes the paper. The source code is available online at GitHub (https://github.com/NiuYajie0/BPFI-learn2branch, accessed on 28 May 2022).

Machine Learning Based Branch-and-Bound
Typically, B&B recursively partitions the search space by branching on the optimal solution of the linear relaxation of the MILP and cleverly exhausts the search space by pruning unpromising solution space until a solution with a certificate of optimality is found.
When branching, a candidate variable is selected as the branching variable according to the variable selection policy (or, branching policy), and two child branches are created. The branching variable is rounded down on the left child branch and rounded up on the right child branch. One of the most famous branching strategies is SB [19]. In SB, each candidate variable is tentatively branched, and the one that has the greatest product of the lower bound increase of the left branch and the lower bound increase of the right branch is selected. However, a common drawback of these expert-crafted heuristics is that they are usually time-consuming.
With the rapid development in recent years, ML offers a possibility to automatically construct effective heuristics from data by exploiting the shared structure among MILP instances [15]. In addition, using specialized deep learning and parallel computing hardware for ML models, ML-B&B can be much faster than traditional B&B implementations. Generally speaking, the training of ML models for B&B follows one of two methodologies: imitation learning (IL) and reinforcement learning (RL) [20]. In IL, the ML model is trained through the demonstration of an expert solver, such as the default MILP solver of SCIP [5]. For example, the state-of-the-art "learn to branch" method [14] frames variable selection as a classification problem and trains a GCNN using SB expert decisions as the ground truth labels. However, by the nature of IL, the IL-trained model is limited by the performance of the expert policy [21].
On the other hand, the sequential decision-making during B&B can be regarded as a Markov decision process [22], which lays the foundation for RL. By training the policy through exploration experience, RL offers a good alternative to automate the search for heuristics [23]. Therefore, it is good practice to use IL in conjunction with RL, i.e., using IL at the start of the training process, then switching to RL to continue refining the ML model. A well-known example of this practice is the AlphaGo project [24], where the experts are human players. The IL part of an IL-RL mixed training typically suffers from limited training samples; however, a good IL at the early stage can greatly improve the convergence rate of RL. Therefore, our work is focused on the performance of ML-B&B under a small-dataset fast-training scheme, which is typically the case in the early stage of an IL-RL mixed training.

The Bigraph Representation for State Embedding
A key element in ML-B&B is state embedding, which includes embedding the MILP problem and its B&B solution status. In previous research [12], the variable selection policy was trained offline on the collected SB scores of candidate variables. However, correlations between constraints and variables are represented by the hand-crafted features, which rely on extensive feature engineering. To address the above issue, a bigraph representation of MILP was proposed in [14], where corresponding nodes are connected if a constraint is associated with a variable, and a GCNN was used to extract useful information from the bigraph representation. This representation is natural for MILPs and has shown promising performance. In [16], Peng et al. proposed that prioritizing the sampling of certain branching decisions over others and thus providing a better branching data distribution could further improve the performance of the trained model. In [17], the authors pointed out that the GCNN-based approach relies too heavily on high-end GPU, which may not be be accessible to many practitioners. Thus, a new hybrid architecture was proposed for efficient branching on CPU machines, which combined the expressive power of GNNs with computationally inexpensive multi-layer perceptrons for branching and achieved a better balance between solution time and branching accuracy.
The original bigraph representation and its later improvements [14,17] are designed for general MILP problems, i.e., aiming to apply one ML model to as many MILP problems as possible. As a result, the bigraph representation contains a large number of features, which often leads to complicated ML models, as well as extended training and inference times. For example, in the bigraph representation [14], 13 features are used to represent a variable, 5 for a constraint, and 1 for an edge, and there can be approximately 100-1000 variables and 700-5000 constraints for a MILP instance. Therefore, this paper aims to reduce the bigraph representation by using problem-specific fast feature analysis to address the existing problem.

Refined Problem-Specific Branch-and-Bound
Currently, the bigraph representation in learn-to-branch is designed for general MILPs, i.e., aiming to apply one ML model to many different types of MILPs. As a result, the bigraph representation often contains a large number of features and leads to complicated ML models. Therefore, refining the bigraph representation for specific problems is an important step to further improve the ML-B&B algorithm.
In recent years, the B&B algorithm has been refined for various problems, and different methods have been proposed to utilize problem-specific knowledge. For example, in [25], a data-mining based approach was proposed to generate problem-specific knowledge for combinatorial optimization. In [26], Khachay et al. specifically designed a B&B algorithm for the precedence-constrained generalized traveling salesman problem and demonstrated that the performance of such an algorithm is competitive against the stateof-the-art MLP-solver Gurobi. Similarly, Kudriavtsev et al. proposed and refined a B&B algorithm specifically for the shortest simple path problem and demonstrated its good performance by numerical evaluations [27].
Therefore, previous studies show that the B&B algorithm has a considerable space for improvement when refined for specific problems. In this paper, the ML-B&B model is specifically refined for each of the benchmark problems using the proposed BPFI method.

Benchmark Problems
An MILP is an optimization problem, which can be formulated as follows: arg min where c ∈ R n denotes the objective coefficient vector, A ∈ R m×n the matrix of constraint coefficients, and b ∈ R m the vector of the right-hand-sides of constraints, respectively. In addition, l, u ∈ R n are the vectors of lower and upper bounds of variables, and p ≤ n is the number of integer variables. As popular benchmarks, four classes of MILPs are evaluated in this paper, namely set covering (SC), combinatorial auction (CA), maximum independent set (MIS), and capacitated facility location (CFL). Specifically,

1.
The SC problem can be formulated as follows: where A = {a ij } is an m × n binary matrix, and if column j covers row i, a ij = 1; otherwise, a ij = 0. Define e = (1, . . . , 1), which has m components, and c j is the cost of column j. If column j is in the solution, x j = 1; otherwise, x j = 0.

2.
The CA problem can be formulated as follows: where n, m are the numbers of distinct items and bidders, respectively, and y ij represents a binary decision variable indicating whether item i is sold to bidders. The highest price that bidder j with the purchasing power W can offer for item i is w ij .

3.
The MIS problem can be formulated as where V, E denote the set of vertices and edges of an undirected graph, respectively, and x v for each node v ∈ V is a binary decision variable indicating whether v is selected in an independent set. 4.
The CFL problem can be formulated as x ij ≥ 0, i = 1, . . . , n and j = 1, . . . , m, where c ij is the transportation cost between customer j and facility i, d j is the demand for customer j, and x ij is the fraction of the demand of client j met from facility i. If facility i is open, y i = 1; otherwise, y i = 0, and f i is the fixed cost.

Metrics
In this paper, two groups of metrics are used for testing the branching accuracy of an ML brancher and evaluating the solution efficiency of the solver that adopts the brancher. Specifically,

1.
The branching accuracy is described by four metrics, i.e., the percentage of times the decision has the highest strong branching score (acc@1), one of the three highest (acc@3), one of the five highest (acc@5), and one of the ten highest (acc@10) strong branching scores.

2.
The solution efficiency is described by two metrics, i.e., the 1-shifted geometric mean of the solving times in seconds (Time) and final node counts of instances (Nodes).

Methodology
In this paper, to reduce model complexity, the bigraph representation is refined according to the evaluated importance of features. In ML studies, PFI is an effective approach to gain insights into black-box models. In learn-to-branch, however, branching samples are usually large and collected fragmentally, which makes traditional PFI evaluation infeasible. To address this issue, the BPFI method is proposed to identify non-contributing features in the full bigraph representation. According to the BPFI results, a bigraph representation is designed for each of the benchmark problems.
As shown in Figure 1, BPFI evaluation is implemented by adding only one shuffling switch in the learning model. According to the BPFI evaluation, the problem-specific bigraph is built, and the non-contributing features are masked out to refine the model.

Batch-Wise Permutation Feature Importance
In PFI, the utility of a feature is measured by the decrease in model performance caused by permuting this feature over the dataset. The general steps for computing PFI are as follows.

1.
Train and evaluate the model for a performance score A.

2.
Evaluate the model on a modified test dataset with feature i shuffled. Compute performance score A i,s , s = 1, . . . , N, for N different shuffling seeds.

3.
The PFI F i of feature i is computed as the drop of performance after shuffling: The PFI is commonly used as an interpretation method. However, the original PFI evaluation cannot be used for learn-to-branch directly. The reason is that PFI evaluation requires shuffling features over the entire test dataset (see Figure 2a); whereas, in learn-to-branch, the branching samples are generally collected fragmentally, large (each around 200 KB), and stored as separate binary files, which makes the original PFI evaluation infeasible.
Therefore, to compute feature importance for the bigraph representation considering the fragmented branching samples dataset, the BPFI evaluation is proposed, which permutes features within only one batch in the forward pass (see Figure 2b). Formally, let A(M, D) denote the performance function that computes the score of model M given dataset D, and let P i (D, b) denote a per-batch permutation function that permutes feature i of dataset D for a batch size b. Then, the performance scores A i,s after shuffling in the traditional PFI evaluation andÃ i,s in the proposed BPFI evaluation are given by respectively, where |D| denotes the size of dataset D. Since the per-batch permutation can be performed within one forward pass after a batch of samples has been loaded, BPFI evaluation is more lightweight and can approximate the traditional PFI evaluation. ...

Problem-Specific Bipartite Graph Representation
As shown in Figure 3, the state of the B&B process at a certain timestep can be encoded as a bigraph with node and edge features. In the bigraph, one type of node corresponds to constraints in the MILP, and the other corresponds to variables. The variable node and constraint node are connected by an edge if the variable's coefficient is non-zero in the constraint.  Given a MILP instance, let m be the number of constraints of which each has c features, let n be the number of variables each of which has d features, and each edge has e features. A constraint feature matrix C ∈ R m×c can be used to represent the constraint nodes, a variable feature matrix V ∈ R n×d for the variable nodes, and an edge feature matrix E ∈ R m×n×e for the edges. Therefore, the original bigraph representation can be defined as G = {C, E, V} ∈ G, where G is the set of all bigraph representations of MILPs. In the proposed problem-specific bigraph representation, the non-contributing features are masked out for each of the benchmark problems. Therefore, the proposed bigraph representation can be formulated as G = {C, E, V} where C, E, and V are the reduced features.
As a special heterogeneous graph, the bigraph has only two different types of nodes (constraints and variables) and two types of edges (involves-in and belongs-to). With the bipartite structure of the input graph, the graph convolution can be separated into two consecutive passes, i.e., the v-to-c and c-to-v passes, as introduced in [14]. The BPFI-GCNN further simplifies the original Full GCNN model for each problem type, according to the BPFI evaluation results. See Section 5.2 for details of the BPFI-GCNN.

Computer Experiments
In this paper, the experimental framework partially inherits from the state-of-the-art learn-to-branch project [14]. Specifically, the MILP instance generation and branching sample collection algorithms in [14] are reused, meaning that our experimental dataset is consistent with the former studies.

Experimental Framework
As shown in Figure 4, our experiments consist of the following six major steps.

1.
Generate instances that include the four benchmark problems, i.e., set covering, combinatorial auction, maximum independent set, and capacitated facility location.

2.
Sample the branching decision data during the B&B solution of MILP instances with SCIP 7.0 [5], obtaining branching samples datasets for training, validation, and testing.

3.
Train the GCNN model with the full bigraph representation after the shuffling switch is turned off.

4.
Perform BPFI evaluation, reduce the bigraph representation for each of the benchmark problems, and train GCNN with each reduced bigraph representation after the shuffling switch is turned on. As features are reduced, the GCNN also requires fewer parameters, thus decreasing in size.

5.
Test and compare the branching accuracy of the trained models, including the full GCNN and the BPFI-GCNN. 6.
Evaluate and compare the MILP solution efficiency of the ML-B&B models by embedding the trained GCNNs into the SCIP's B&B solution process.  For consistency with [14], the SC instances are generated using the procedure of Balas and Ho [28] with 1000 columns for 500 (Easy), 1000 (Medium), and 2000 (Hard) rows for evaluating. The CA instances are generated using the procedure of the arbitrary relationships procedure of Leyton-Brown et al. [29] with 100 items for 500 bids (Easy), 200 items for 1000 bids (Medium), and 300 items for 1500 bids (Hard). The MIS instances are generated using the procedure of Bergman et al. [30] with 500 (Easy), 1000 (Medium), and 1500 (Hard) nodes. The CFL instances are generated using the procedure of Cornuejols et al. [31] with 100 facilities for 100 (Easy), 200 (Medium), and 400 (Hard) customers. The training and testing instances have the same size as the Easy instances.
During the BPFI evaluation, 20 independent random shufflings are performed on each feature. In the experiment, 1000 branching samples are extracted from 100 instances for training, 200 branching samples are extracted from 20 instances for validation, and the same amount is used for testing. The training process uses a batch size of 16, epoch size of 20, and max epochs of 300.

BPFI Evaluation and the Resulting BPFI-GCNN
The BPFI evaluation results on the four benchmark problems are shown in Figure 5, where the importance of a feature is computed as the decrease of acc@5 accuracy after this feature is shuffled. In this paper, the indicator variables are not considered in the BPFI evaluation due to their similar tensor distributions. As shown in Figure 5, it can be seen that the distribution of variable importance is different for each of the benchmark problems. Therefore, the problem-specific bigraph representation is employed based on the principle of feature reduction, i.e., a reduced bigraph representation is formalized for each of the benchmark problems. In each reduced bigraph representation, most of the non-contributing features with negative variable importance are masked out to maximize the performance of the BPFI-GCNN model.   [14] and are also detailed in Table A1 in the Appendix A for completeness.
For example, since the BPFI evaluation shows that edge features are unimportant for all four benchmark problems, the convolution in the BPFI-GCNN implementation ignores all edge weights. In addition, the BPFI-GCNN is further optimized with the Deep Graph Library (DGL) [32].

Comparison of Branching Accuracy
In this subsection, the branching accuracy of the full GCNN model [14] and the BPFI-GCNN model are compared. Moreover, two other ML branchers are also tested, i.e., the learning-to-score approach of Alvarez et al. [13] (TREES) based on an ExtraTrees model [33] and the learning-to-rank approach of Hansknecht et al. [12] (LMART) based on a LambdaMART model [34]. In Table 1, the branching accuracy of these models are shown under the small-dataset fast-training scheme over five seeds. It can be seen from Table 1 that the BPFI-GCNN model has the highest branching accuracy measured by these four indicators (acc@1, acc@3, acc@5, acc@10) in the four benchmark problems. Specifically, compared to the state-of-the-art bigraph-based method [14], these four branching accuracy indicators, i.e., acc@1, acc@3, acc@5 and acc@10, have increased by 8.4%, 7.5%, 7.8%, and 7.4% on average, respectively. 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0

Comparison of Problem-Solving Efficiency
In this subsection, ML-B&B models are obtained to solve problem instances by embedding the trained models into the SCIP's B&B solution process and replacing the default SCIP brancher. Five training seeds are applied to evaluate 20 new instances for each of the problem difficulties (Easy, Medium, Hard), giving a total of 100 solving attempts per problem difficulty.
As in [16], the results in this paper are presented in the form of "mean r ± std%" to avoid the dependence of results on different experimental environments, and "r" is the mean of Node or Time as a reference value. For example, 0.7883r ± 6.68% means that the metric is 0.7883 times the reference value, and the per-instance standard deviation is 0.0668 averaged over all instances. In the "mean r ± std%" expression, the normalized "mean" and averaged per-instance "std" value are employed in the t-test statistical test.
The complete experimental results are shown in Table 2. The results show that the BPFI-GCNN model achieves significantly better results (in the sense of t-test significance) on most of the performance metrics. Specifically, compared to [14], the solution time has been reduced by an average of 16.8% on the Easy instances, by 22.5% on the Medium instances, and by 15% on the Hard instances. Thus, the BPFI-GCNN model achieves an overall 18% reduction on the solution time.

Conclusions
In this paper, the BPFI evaluation method has been proposed, which allows the fragmented processing of branching samples. Based on the results of the BPFI evaluation, a refined bigraph representation for each of the benchmark problems has been proposed for the BPFI-GCNN model. The experimental results have shown that the proposed BPFI-GCNN model improves the accuracy of the B&B solution, shortening the solution time on four MILP benchmark problems.
Our work is limited to ML-B&B under a small-dataset fast-training scheme, which corresponds to the IL part of IL-RL mixed training. However, the effectiveness of the full IL-RL mixed training using this approach for IL remains to be studied. Furthermore, the explainability of GCNN for learn-to-branch is an interesting research direction that is worth exploring.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: The features of the full bigraph representation are given in Table A1. Table A1. Feature matrix C for the constraints, feature matrix E for the edges and feature matrix V for the variables in the bigraph representation G = {C, E, V} [14].