Learning Efficient Search Approximation in Mixed Integer Branch and Bound

In line with the growing trend of using machine learning to improve solving of combinatorial optimisation problems, one promising idea is to improve node selection within a mixed integer programming branch-and-bound tree by using a learned policy. In contrast to previous work using imitation learning, our policy is focused on learning which of a node’s children to select. We present an offline method to learn such a policy in two settings: one that is approximate by committing to pruning of nodes; one that is exact and backtracks from a leaf to use a different strategy. We apply the policy within the popular open-source solver SCIP. Empirical results on four MIP datasets indicate that our node selection policy leads to solutions more quickly than the state-of-the-art in the literature, but not as quickly as the state-of-practice SCIP node selector. While we do not beat the highly-optimised SCIP baseline in terms of solving time on exact solutions, our approximation-based policies have a consistently better optimality gap than all baselines if the accuracy of the predictive model adds value to prediction. Further, the results also indicate that, when a time limit is applied, our approximation method finds better solutions than all baselines in the majority of problems tested.


Introduction
Hard constrained optimization problems (COPs) exist in many different applications. Examples include airline scheduling [Bayliss et al., 2017] and CPU efficiency maximization [Lombardi et al., 2017]. Perhaps the most common paradigm for modelling and solving COPs is mixed integer linear programming (MILP, or simply MIP). State-of-the-art MIP solvers perform sophisticated pre-solve mechanisms followed by branch-and-bound search with cuts and additional heuristics [Gleixner et al., 2018].
A growing trend is to use machine learning (ML) to improve COP solving. Bengio et al. [2018] survey the potential of ML to assist MIP solvers. One promising idea is to improve node selection within a MIP branch-and-bound tree by using a learned policy [He et al., 2014]. A policy is a function that maps states to actions, where in this work an action is the next node to select. However, research in ML-based node selection is scarce, as the only available literature is the work of He et al. [2014]. This paper contributes a novel approach to improve MIP node selection by using an offline learned policy. We obtain a node selection and pruning policy with imitation learning, a type of supervised learning. In contrast to He et al. [2014], our policy learns only to choose which of a node's children it should select. This encourages finding solutions quickly, as opposed to learning a breadth-first search-like method. Further, we generalize the expert demonstration process by sampling paths that lead to the best k solutions, instead of only the top single solution. The motivation for this is to obtain different state-action pairs that lead to good solutions compared to only using the top solution, in order to aid learning within a deep learning context.
We study two settings: the first is approximate by committing to pruning of nodes. In this way, the solver might find good or optimal solutions more quickly, however with the possibility of overlooking optimal solutions. The second setting is exact: when reaching a leaf the solver backtracks up the branch-and-bound tree to use a different strategy.
We apply the learned policy within the popular open-source solver SCIP [Gleixner et al., 2018]. The results indicate that our node selector finds (optimal) solutions more quickly than He et al. [2014], but not as quickly as the current default SCIP node selector, called BestEstimate. However, the results also indicate that our approximation method finds better initial solutions than BestEstimate, albeit in a higher solving time. Overall, our approximation-based policies have a consistently better optimality gap than all baselines if the accuracy of the predictive model adds value to prediction. Further, when a time limit is applied, our approximate method finds better solutions than all the baselines in three of the five problem classes tested and for one problem class not statistically significantly worse than the baseline.
The outline of this preprint paper is as follows: Section 2 contains preliminaries, Section 3 specifies the imitation learning approach to node selection and pruning, Section 4 reports the results on benchmark MIP instances, Section 5 reviews related work, and Section 6 concludes with future directions.

Imitation learning
Reinforcement learning is designed to learn an approximately optimal policy: a function that maps states to actions, such that the accumulated reward is maximized [Sutton and Barto, 2018]. A problem that can arise in trying to find a policy using reinforcement learning is that the reward function is unknown. In some cases an expert or oracle can provide demonstrations. The demonstrations show what action the expert has taken in a specific state. In this case, a policy can be learned by using inverse reinforcement learning. This is called imitation learning [Abbeel and Ng, 2004] or apprenticeship learning. Here, supervised learning is used with the features being the states of the demonstrations and the labels being the action the expert took.

Mixed integer programming
Mixed integer programming (MIP) is a familiar approach to constraint optimization problems. A mixed integer program requires one or more variables to decide on, a set of linear constraints that need to be met, and a linear objective function, which produces an objective value without loss of generality to be minimized. Thus we have: where y is the objective value and x is the vector of decision variables to decide on. In a MIP, at least one variable has integer domain; if all variables have continuous domains then the problem is a linear program (LP). A is an m x n constraint matrix with m constraints and n variables; c is a n x 1 vector.
Since general MIP problems cannot be solved in polynomial time, a helpful idea is to relax the integer constraints to allow all variables to take real values: an LP relaxation of the problem (1). A series of LP relaxation can be leveraged in the MIP solving process. For minimization problems, the solution of the relaxation provides a lower bound on the original MIP problem.
Equation 1 is also referred to as the primal problem. The primal bound is the objective value of a solution that is feasible, but not necessarily optimal. This is referred to as a 'pessimistic' bound. The dual bound is the objective value of the solution of an LP relaxation, which is not necessarily feasible. This is referred to as an 'optimistic' bound. The integrality gap is defined as: where B P is the primal bound, B D is the dual bound, and sign(·) returns the sign of its argument. The integrality gap is monotonically reduced during the solving process. The solving process combines inference, notably in the form of inferred constraints (cuts), and search, usually in a branch-and-bound framework.

Branch and bound
Branch and bound [Land and Doig, 1960] is the most common constructive search approach to solving MIP problems. In this method, the state space of possible solutions is explored with a growing tree. The root node consists of all solutions. At every node, an unassigned integer variable is chosen to branch on. Every node has two children: candidate solutions for the lower and upper bound respectively of the chosen variable.
The main steps of a standard MIP branch-and-bound algorithm are shown in Algorithm 1.
Defining a comparator for the node priority queue P Q is handled by the node selector (used in P Q.poll() and P Q.add() in Algorithm 1). We explain further below.
Choosing on which variable to branch (variableSelection() in Algorithm 1) is not trivial and affects the time to find solutions and prove optimality. Different approaches to this variable selection are discussed in Section 5. For example, in the SCIP solver 2 , the default variable selection heuristics (the 'brancher') is: reliability branching on pseudo-cost values. The brancher can inform the node selector which child it prefers; it is up to the node selector, however, to choose the child. The child preferred by the brancher, if any, is called the priority child. As described in Section 3, the prioChild property is used as a feature in our work.
In more detail, the left child priority value is calculated by SCIP as: and the right child priority value as: is the average number of inferences at the left child (right child), V r is the value of the relaxation of the branched variable at the root node and V is the value of the relaxation of the branched variable at the current node. An inference is defined as a deduction of another variable after tightening the bound on a branched variable [Achterberg, 2007]. If P L > P R , then the left child is prioritised over the right child, if P L < P R , then the right child is prioritised. If they are equal, then none are prioritised. Note that while this rule for priority does not necessarily hold for all branchers in general, it does hold for the standard SCIP brancher.

Node selection and pruning
In MIP solvers such as SCIP, a node (and its entire sub-tree) is pruned when the solution of the relaxation is worse than the current primal bound (line 7 and 15 in Algorithm 1). However, we can further leverage node pruning to create an approximation algorithm. The goal is then to prune nodes that lead to bad solutions. Correctly pruning sub-trees that do not contain an optimal solution is analogous to taking the shortest path to an optimal solution, which obviously minimizes the solving time. It is generally preferred to find feasible solutions quickly, as this enables the node pruner to prune more sub-trees (due to bounding), with the effect of decreasing the search space. However, we must be aware that there is no guarantee that the optimal solution is not pruned.
Deciding which node is prioritised over another node to explore is defined by the node selector. As is the case for branching, different heuristics exist for node selection. Among these are depth-first search (DFS), breadth-first search (BFS), RestartDFS (restarting DFS at the root after a fixed amount of newly-explored nodes) and BestEstimate. The latter is the default node selector in the SCIP solver from version 6. It uses an estimate of the objective function at a node to select the next node and it prefers going deep into the search tree. 3 Algorithm 1: Algorithm to solve a minimization MIP problem using branch and bound.
input :Root R, which is a node representing the original problem output :Optimal solution if one exists Summarising the node selection heuristics, they can be grouped into two general strategies [Achterberg et al., 2008]. The first strategy is choosing the node with the best lower bound in order to increase the global dual bound. The second strategy is diving into the tree to search for feasible solutions and decrease the primal bound. The second has the advantage to prune more nodes and decrease the search space. In this paper we use the second strategy to develop a novel heuristic using machine learning, leveraging local variable, local node and global tree features, in order to predict as far as possible the best possible child to be selected.

Approach
Recall that our goal is to obtain an approximate MIP node selection policy using machine learning, and to use it in a MIP solver. The policy should lead to promising solutions more quickly in the branch-and-bound tree, while pruning as few good solutions as possible.
Our approach is to obtain a node selector by imitation learning. A policy maps a state s t to an action a t . In our case s t consists of features gathered within the branch-and-bound process. The features consist of branched variable features, node features and global features. The branched variable features are derived from Gasse et al. [2019]. See Table 1 for the list of features. Note that we define a separate left node lower bound and right node lower bound, instead of a general node lower bound, because during experimentation, we obtained two different lower bounds among the child nodes. We constrain the action space for a t to only select one child node or both. This leads to the restricted action space {L, R, B}, where L is the left child, R is the right child and B are both children.
In order to train a policy by imitation learning, we require training data from the expert. Our sampling process of state-action pairs is similar to the prior work of He et al. [2014], with two major differences. The first is that our policy learns only to choose which of a node's children it should select. This encourages finding solutions quickly, as opposed to learning a breadth-first search-like method. The second difference is that we generalize the node selection process by sampling paths that lead to the best k solutions, instead of only the top solution. The reason for this is to obtain different state-action pairs that lead to good solutions compared to only using the top solution, in order to aid learning within a deep learning context. He et al. [2014] check whether the current node is in the path to the best solution. In our work, we check whether the left and right children of the current node are in a path that leads to the best k solutions. If that is the case, then we associate the label of the current node as 'B'; if not, we check the left child or right child and associate the appropriate label ('L' or 'R' respectively). If neither are in such a path, then the node is not sampled.
Pre-processing of the dataset is done by removing features that do not change and standardising every non-categorical feature. We then feed it to the imitation learning component, described next. Our machine learning model is a standard fully-connected multi-layer perceptron (MLP) with H hidden layers, U hidden units per layer, ReLU activation layers and Batch Normalisation layers after each activation layer, following a Dropout layer with dropout rate p d . See Figure 1 for an overview of the operations within a hidden layer. We obtain the model architecture parameters and learning rate ρ using the hyperparameter optimization algorithm hyperopt 4 [Bergstra et al., 2013]. Since during pre-processing features that have constant values are removed, the number of input units can change across different problems. For example, in a fully binary problem, the features left node branch bound and right node branch bound are constants (0 and 1 respectively), while for a general mixedinteger problem this is not the case. The number of output units is three. The cross-entropy loss is optimized during training with the Adam algorithm [Kingma and Ba, 2015].
During policy evaluation, the action B ('both') can result in different operations. See Table 2 for an overview. We define PrioChild, Second and Random as possible operations. PrioChild selects the priority child as indicated by the variable selection heuristic (i.e., the brancher); Second selects the next best scoring action from the ML policy; Random selects a random child. Additionally, when the solver is at a leaf and there is no child to select, then we define three more operations. These are RestartDFS, BestEstimate and Score. The first two are baseline node selectors from SCIP [Gleixner et al., 2018]; Score selects the node which obtained the highest score so far as calculated by our node selection policy.
Obtaining the node pruning policy is similar to obtaining the node selection policy. The difference is that the node pruning policy also prunes the child that is ultimately not selected by the node selection policy. If prune on both = True, then this results in diving only once and then terminating the search. Otherwise, the nodes initially not selected after the action B are still explored. The resulting solving process is thus approximate, since we cannot guarantee that the optimal solution is not pruned.
To summarise, we use our learned policy in two ways: the first is approximate by committing to pruning of nodes, whereas the second is exact: when reaching a leaf, we backtrack up the branch-and-bound tree to use a different strategy.

Results and Discussion
The following standard NP-hard problem instances were tested: set cover, maximum independent set, capacitated facility location and combinatorial auctions. These problems are derived from the generator provided by Gasse et al. [2019]. The instances are different from each other in terms of constraints structure, existence of continuous variables, existence of non-binary integer variables, and direction of optimization.
For the MIP branch-and-bound framework we use SCIP version 6.0.2. 5 As noted earlier, SCIP is an open source MIP solver, allowing us access to its search process. Further, SCIP is regarded as the most sophisticated and fastest such MIP solver. The machine learning model is implemented in PyTorch 6 [Paszke et al., 2019], and interfaces with SCIP's C code via PySCIPOpt 2.1.6.
For every problem, we show the learning results, i.e., how well the policy is learned, and the MIP benchmarking results, i.e., how well the MIP solver does with the learned policy. We compare the policy evaluation results with various node selectors in SCIP, namely BestEstimate (the SCIP default), RestartDFS and DFS. Additionally, we compare our results with the node selector and pruner from He et al. [2014], with both the original SCIP 3.0 implementation by those authors (He) and with the a re-implementation in SCIP 6 developed by us (He6). He et al. [2014] has three policies: selection only (S), pruning only (P) and both (B). For exact solutions, we only use (S). For the first solution found at a leaf, we use (S) and (B). For the experiments with a time limit, we use (P) and (B).
We train on 200 training instances, 35 validation instances and 35 testing instances across all problems. These provide sufficient state-action pairs to power the machine learning model. The number of obtained samples (state-action pairs) differs per problem. For every problem, we use the k = 10 best solutions to gather the state-action pairs. Additionally, we use a batch size of 1024, dynamically lower the learning rate after 30 epochs and terminate training after another 30 epochs if no improvement was found. During training, the validation loss is optimized. The maximum number of epochs is 200.
We evaluate a number of different settings for our node selection and pruning policy, as seen in Table 2. This leads to nine different configurations for the node selection policy and twelve different configurations for the node pruning policy. Note that for the node pruning policy, when prune on both is true, then optimization terminates when a leaf is found; thus the parameter value for on leaf does not matter. We refer to our policies as ML {on both}{on leaf}{prune on both}.
For example, ML PB denotes the node pruning policy that uses PrioChild for on both and BestEstimate for on leaf.
In more detail, we report three different kinds of experiments: 1. We evaluate the policy on every problem by checking the average solving time of each node selector.
2. We check the solution quality in terms of the optimality gap and the solving time of the first solution found at a leaf node. Note that it is possible an infeasible leaf node is found, in that case, a solution is returned that was found prior to the branch-and-bound process, through heuristics inbuilt in SCIP.
3. We select one ML policy, based on the (lowest) harmonic mean between the solving time and optimality gap.
For each instance, we run the solver on each baseline with a time limit equal to the solving time of the selected ML policy and present the obtained optimality gaps. We also show the initial optimality gap obtained by the solver before branch-and-bound is applied, i.e., from the solver's pre-solve prior to search.  For each experiment, we apply the policies on two different difficulty levels: 1. Easy instances, which can be solved within 15 minutes. 2. Hard instances, where we set a solving time limit to one hour. Here, for all three experiments we substitute the optimality gap for the integrality gap, because the optimal solution is not known for every hard instance. Additionally, for the first experiment, instead of checking the solving time, we check the integrality gap. Table 3 provides an overview of the machine learning parameters and results. The baseline accuracy (column 2) is what the accuracy would have been if each sample is classified as the majority class. The test accuracy (column 3) is the classification accuracy on the test dataset. Note that k = 40 is included in the maximum independent set instances, see Section 4.2 for an explanation. The best performing ML model is the model with the settings that achieve the lowest validation loss.
The experiments are run on a machine with an Intel i7 8770K CPU at 3.7-4.7 GHz, NVIDIA RTX 2080 Ti GPU and 32GB RAM. For the hard instances, the default SCIP solver settings are used. For the other instances, pre-solving and primal heuristics are turned off, to better capture the effect of the node selection policy. We use the shifted geometric mean (shift = 1) as the average across all metrics. This is a standard practice for MIP benchmarks. 7

Set cover
These instances consist of 2,000 variables and 1,000 constraints forming a pure binary minimization problem. We sampled 17,254 state-action pairs on the training instances, 2,991 on the validation instances and 3,218 on the test instances. The model achieves a testing accuracy of 76.4%, with a baseline accuracy of 57.5%.
See Table 4 for the average solving time and explored nodes of various node selection strategies. BestEstimate achieves the lowest average solving time at 26.4 seconds; ML RB comes next at 35.7 seconds. We conducted a pair-wise t-test between the mean solving time of BestEstimate and the mean solving time of the other policies. We can only reject the null hypothesis of equal means with p-value below 0.1 for ML RR (p-value: 0.08). For the rest of our ML policies, we can not reject the null hypothesis of equal means. We have also conducted a pair-wise t-test between the mean number of explored nodes of BestEstimate and the mean number of explored nodes of the other policies. Again, we can not reject the null hypothesis of equal means with p-value below 0.05 for any of our ML policies (lowest observed p-value: 0.34). Recall that the values in the tables are the shifted geometric means, while the t-tests by their nature compares (arithmetic) means. The difference between the solving time of BestEstimate and any ML policy here is not statistically significant due to outliers. Figure 2 shows the average solving time against the average optimality gap of the first solution obtained by the baselines and ML policies at a leaf node. Note that the only parameter that is influential for the ML solver is the first parameter, namely on both. The second parameter on leaf and the third parameter prune on both do not influence the solving time or quality of the first solution as the search terminates at the first found leaf that is found. The policy of He et al. [2014] is not included here, due to its outliers. We see here that our ML policy obtains a lower optimality gap at the price of a higher solving time for the first solution.
See Table 5 for the mean optimality gap of the baselines using a time limit for each instance. The time limit for each instance is based on the solving time of the ML policy that achieved the lowest harmonic mean between the average    solving time and average optimality gap across all instances. In this case, ML SRF has the lowest harmonic mean and also achieves the lowest average optimality gap. We conducted a pairwise t-test between the mean optimality gap of our best ML policy and the mean optimality gap of each baseline. We can reject the null hypothesis of equal means with p-value below 0.005 for all baselines.
The initial optimality gap obtained by the solver before branch-and-bound is 2.746. This shows that applying branchand-bound to find a solution has a significant difference.

Maximum independent set
These instances consist of 1,000 variables and around 4,000 constraints forming a pure binary maximization problem. For this particular problem, we noticed that for k = 10, the class imbalance was significant. To combat this, we increased the value k to 40. For k = 10, we sampled 29,801 state-action pairs on the training instances, 5,820 on the validation instances and 4,639 on the testing instances. The class distribution is: (Left: 92%, Right: 4%, Both: 4%). For k = 40, we sampled 82,986 state-action pairs on the training instances, 14,460 on the validation instances and 14,273 on the testing instances. The class distribution is: (Left: 89%, Right: 4%, Both: 7%).
Both the k = 10 and k = 40 models achieve a testing accuracy that is very close to the baseline accuracy, which results in a model that is not able to generalize. See Table 6 for the average solving time and explored nodes of various node selection strategies. We conducted a pair-wise t-test between the mean solving time of BestEstimate and the mean solving time of the other policies. We can reject the null hypothesis of equal means with p-value below 0.1 for all our ML policies. We have also conducted a pair-wise t-test between the mean number of explored nodes of BestEstimate and the mean number of explored nodes of the other policies. We can reject the null hypothesis of equal means with p-value below 0.1 for 12 of 18 of our ML policies.   harmonic mean between the average solving time and average optimality gap of all ML policies. The ML policy has a higher average optimality gap than the baselines for this problem. We conducted a pairwise t-test between the mean optimality gap of our best ML policy and the mean optimality gap of each baseline. We can reject the null hypothesis of equal means with p-value below 0.05 for all baselines, except BestEstimate (p-value: 0.334). The initial optimality gap obtained by the solver before branch-and-bound is 0.999. This shows that He6 policy prunes aggressively at the start, because the average optimality gap obtained by He6 is similar to initial optimality gap. The other policies find significantly better solutions.

Capacitated facility location
These instances consist of 150 binary variables, 22,500 continuous variables and 300 constraints, forming a mixedinteger minimization problem. We sampled 17,266 state-action pairs on the training instances, 3,531 on the validation instances and 3,431 on the testing instances. The model achieves a testing accuracy of 90.1%, with a baseline of 73.1%.
See Table 8 for the average solving time and explored nodes of various node selection strategies. ML RB achieves the lowest average solving time at 111.7 seconds and ML SS the lowest average explored nodes at 1099. We conducted a pair-wise t-test between the mean solving time of BestEstimate and the mean solving time of the other policies. We can not reject the null hypothesis of equal means with p-value below 0.1 for any our ML policies (lowest observed p-value: 0.14). We have also conducted a pair-wise t-test between the mean number of explored nodes of BestEstimate and the mean number of explored nodes of the other policies. We can not reject the null hypothesis of equal means with p-value below 0.1 for any our ML policies (lowest observed p-value: 0.18).  Figure 4 shows the average solving time against the average optimality gap of the first solution obtained by the baselines and ML policies at a leaf node. We see here that the ML policies are clustered and obtain a lower optimality gap than the baselines. Note that RestartDFS and DFS have a very similar optimality gap and solving time, so they are stacked on top of each other. See Table 9 for the optimality gap of the baselines using a time limit for each instance. In this case, ML SST has the lowest harmonic mean between the average solving time and average optimality gap of all ML policies. ML SST also achieves a significantly lower average optimality gap than the baselines. We conducted a pairwise t-test between the mean optimality gap of our best ML policy and the mean optimality gap of each baseline. We can reject the null hypothesis of equal means with p-value below 0.001 for all baselines. The initial optimality gap obtained by the solver before branch-and-bound is 0.325. All policies find a significantly better solution than the first found feasible solution.

Combinatorial auctions
These instances consist of 1,200 variables and around 475 constraints forming a pure binary maximization problem. We sampled 13,554 state-action pairs on the training instances, 2,389 on the validation instances and 2,170 on the testing instances. The model achieves a testing accuracy of 71.7%, with a baseline of 57.0%.
See Table 10 for the average solving time and explored nodes of various node selection strategies. BestEstimate achieves the lowest average solving time at 19.7 seconds. We conducted a pair-wise t-test between the mean solving time of BestEstimate and the mean solving time of the other policies. We can reject the null hypothesis of equal means with p-value below 0.05 for all our ML policies. We have also conducted a pair-wise t-test between the mean number of explored nodes of BestEstimate and the mean number of explored nodes of the other policies. We can reject the null hypothesis of equal means with p-value below 0.1 for ML PS and ML RS. Figure 5 shows the average solving time against the average optimality gap of the first solution obtained by the baselines and ML policies at a leaf node. We see here that the ML policies are not as clustered. The ML P** strategy is the only strategy that delivers Pareto efficient result, having a both a lower optimality gap and a lower solving time. Note that BestEstimate, RestartDFS and DFS have a very similar optimality gap and solving time, so they are stacked on top of each other. See Table 11 for the optimality gap of the baselines using a time limit for each instance. In this case, ML PST has the lowest harmonic mean between the average solving time and average optimality gap of all ML policies. ML PST achieves a very similar optimality gap compared to the baselines. We conducted a pairwise t-test between the mean optimality gap of our best ML policy and the mean optimality gap of each baseline. We can reject the null hypothesis of equal means with p-value below 0.05 for BestEstimate and He6, but not DFS (p-value: 0.94) and RestartDFS (p-value: 0.98). The initial optimality gap obtained by the solver before branch-and-bound is 0.914. All policies find a significantly better solution than the first found feasible solution.

Set cover: Hard instances
To assess how the ML policies perform on hard instances, we use the same trained model of the ML policies that were previously trained on the easier set cover instances. The size of these hard instances are in terms of 4,000 variables and  Table 9: Capacitated facility location: model with on both = Second, on leaf = Score and prune on both = True against baselines, with equal time limits for each problem. The initial optimality gap obtained by the solver before branch-and-bound is 0.325. Pairwise t-tests against the ML policy: '***' p < 0.001, '**' p < 0.01, '*' p < 0.05, '·' p < 0.1.

Strategy Optimality gap
BestEstimate 0.0174 * DFS 0.0126 He6 (both) 0.4678 *** He6 (prune only) 0.2873 *** ML PST 0.0127 RestartDFS 0.0126 2,000 constraints, while the easier set cover instances had 2,000 variables and 1,000 constraints. We evaluated 10 hard instances due to computational limitations, and focused on BestEstimate as a baseline on the node selection policy.
For the node selection policies, we set the time limit to one hour per problem. Figure 6 shows the number of solved instances per policy, and Figure 7 shows boxplots of the integrality gaps for each policy. We use integrality gap here, because we do not know the optimal objective value for all instances. Table 12 shows the average solving time, explored nodes and integrality gap of various node selection strategies. BestEstimate achieves the lowest average solving time at 2256.7 seconds and integrality gap at 0.0481. ML SB has the lowest number of explored nodes at 109298. We conducted a pair-wise t-test between the mean solving time of BestEstimate and the mean solving time of the other policies. We can not reject the null hypothesis of equal means with p-value below 0.1 for all our ML policies (lowest observed p-value: 0.64). We have also conducted a pair-wise t-test between the mean number of explored nodes of BestEstimate and the mean number of explored nodes of the other policies. We can reject the null hypothesis of equal means with p-value below 0.05 for ML PB, ML RB and ML SB. Lastly, we have conducted a pair-wise t-test between the mean integrality gap of BestEstimate and the mean integrality gap of the other policies. We can reject the null hypothesis of equal means with p-value below 0.1 for ML RS and ML SS.
For the pruning policies, Figure 8 shows the solving time plotted against the integrality gap of the first solution obtained by the baselines and ML policies. As before, we see the same trend where the ML policies find a lower gap at the cost of a higher solving time. See Table 13 and Figure 9 for the integrality gap of the baselines using a time limit for each instance. In this case, ML PST has the lowest harmonic mean between the average solving time and average integrality gap of all ML policies. ML PST achieves a significantly lower integrality gap compared to the baselines. We conducted a pairwise t-test between the mean optimality gap of our best ML policy and the mean optimality gap of each baseline. We can reject the null hypothesis of equal means with p-value below 0.001 for all baselines. In all instances, the solver     could only obtain an integrality gap of infinity on the first feasible solution. This means we can not compare the initial integrality gap to the found integrality gaps during branch and bound.

Discussion
We examined three experiments, namely measuring the solving time for an exact solution, measuring the solving time and optimality gap for the first solution found at a leaf node in the branch and bound tree, and setting a low instance-specific time limit to measure what the optimality gap is.
For the first experiment, our method performed better than the baselines for capacitated facility location problem, but worse on the three purely binary problems. The best ML policy is ML RB, although ML SB has an almost equal solving time, while still exploring fewer nodes.
For the second experiment, the prioChild ML policy (ML P**) performed Pareto-equivalent on four of the five problem sets, performing inferior to the baselines only on maximum independent set instances.
For the third experiment, we chose ML SRF on set cover, ML PRF on maximum independent set, ML SST on capacitated facility location, and ML PST on combinatorial auctions and hard set cover instances, in order to measure how well they do against the baselines. These policies were chosen based on the (lowest) harmonic mean between the solving time and optimality gap as was conducted in the second experiment. In four of the five problem sets, our policies had a statistically significantt lower optimality gap than the baselines, while on combinatorial auctions, ML PST did not perform statistically significantly worse than DFS and RestartDFS.
Overall we conclude that the on both = Random configuration of the policy usually performs worse than the other configurations. on both = {PrioChild, Second} both do well. The policies from both the on leaf = {Score, RestartDFS} configurations perform better than those from the the on leaf = BestEstimate configuration. For both prune on both configurations, the policy performed well. Recall that when prune on both is True, then the search is terminated after the first leaf, saving solving time but resulting in a higher optimality gap. That both prune on both configurations lead to effective policies means that we offer the user the choice between a lower optimality gap and higher solving time, or the other way around.
Our method is effective when the ML model is able to meaningfully classify optimal child nodes correctly. By contrast, in the case of the maximum independent set problem, the classification was poor (base acc.: 0.895, test acc.: 0.899, gain: 0.004). Hence, when the predictive model adds value to the prediction, there is potential for effective decision making using the policy; when it does not, inferior performance can be expected.
Lastly, we note that the feature extraction was the biggest contributor to the overall solving time. Applying the predictor had a rather small impact. This means that it is possible to achieve lower solving times by incorporating the entire process in the original C code of SCIP, avoiding the overhead of the Python interface.

Branching
Deciding on what variable to branch on in the branch and bound process is called branching, as was mentioned in Section 2.3. Good branching techniques make it possible to reduce the tree size, resulting in fast solving times. A survey on branching, and the use of learning to improve it, is by Lodi and Zarpellon [2017].
Strong branching [Applegate et al., 1995] is a popular branching strategy, among other strategies such as most infeasible branching, pseudo-cost branching, reliability branching [Achterberg et al., 2005] -used as the default in SCIP -and hybrid branching [Achterberg and Berthold, 2009]. Strong branching creates the smallest trees, as Achterberg et al. [2005] reported that strong branching required around 20 times less nodes to solve a problem than most infeasible branching and around 10 times less nodes than pseudo-cost branching. However, strong branching is the most expensive to calculate, because two LP-relaxations are solved for every variable to assign scores.
Nonetheless, exact scores are not required to find the best variable to branch on. Therefore, it is interesting to approximate the score of strong branching, which can be done using machine learning. Alvarez et al. [2014] was the first to use supervised learning to learn a strong branching model. The features they used to train the ML model consist of static problem features, dynamic problem features and dynamic optimization features. The static problem features derive from c, A and b as stated in Equation 1. The dynamic problem features derive from the solutionx of the current node in the branch and bound tree and the dynamic optimization features derive from statistics of the current variable. They used the Extremely Randomized Trees (ExtraTree) classifier [Geurts et al., 2006]. The results show that supervised learning successfully imitated strong branching, being 9% off relative to gap size, but 85% faster to calculate. Although strong branching was successfully imitated, it was still behind reliability branching in terms of gap size and runtime. Khalil et al. [2016] extended Alvarez et al. [2014] work by adding new features to the machine learning model and by learning a pairwise ranking function instead of a scoring function. The ranking function they used is a ranking variant of Support Vector Machine (SVM) classifier [Joachims, 2006]. Their algorithm solved 70% more hard problems (over 500,000 nodes, cut-off time 5 hours) than strong branching alone. However, the time spent per node (18 ms) is higher than pseudo-cost branching (10 ms) and combining strong branching with pseudo-cost branching (15 ms). This is due to calculating the large number of features on every node.
To overcome complex feature calculation, Gasse et al. [2019] proposes features based on the bipartite graph structure of a general MILP problem. The graph structure is the same for every LP relaxation in the branch-and-bound tree, which reduces the feature calculation cost. They use a graph convolutional neural network (GCNN) to train and output a policy, which decides what variable to branch on. Furthermore, they used cutting planes on the root node to restrict the solution space. Their GCNN model performs better than both Alvarez et al. [2014] and Khalil et al. [2016] for generalizing branching, using few demonstration examples for the set covering, capacitated facility location, and combinatorial auction problems. Moreover, GCNN solved the combinatorial auction problem 75% faster than the method of Alvarez et al. [2014] and 70% faster than the method of Khalil et al. [2016], both for hard problems (1,500 auctions).
Seeing their success, we adopt the same variable features as Gasse et al. [2019].

Node selection and pruning policy
While learning to branch has been studied quite extensively, learning to select and prune nodes has received insufficient attention in the literature. He et al. [2014] used machine learning to imitate good node selection and pruning policies. The method of data collection in that work is by first solving a problem and provide its solution to the solver. Afterwards, the problem is solved again, but now that the solver knows the solution, it will take a shorter path to the solution. The features for learning the node selection policy are derived from the nodes in this path and the features for the node pruning policy are derived from the nodes that were not explored further. This was done for a limited amount of problems as the demonstrations.
He et al. [2014] trained their machine learning algorithm on four datasets, called MIK, Regions, Hybrid and CORLAT. They were able to achieve prune rates of 0.48, 0.55, 0.02 and 0.24 for each dataset respectively. Prune rate shows the amount of nodes that did not have to be explored further relative to the total amount of nodes seen. Their solving time reached a speedup of 4.69, 2.30, 1.15 and 1.63 compared to a baseline SCIP version 3 heuristic respectively for each dataset. Note that the lowest speedup seems to correlate with a low prune rate.
Our work differs from He et al. [2014] by constraining the node selection space to direct children only at non-leaf nodes. Furthermore, we use the top k solutions to sample state-action pairs. By using more than one solution, we can create additional state-action pairs from which the neural network can learn and create a predictive model. Lastly, we include branched variable features, obtained from Gasse et al. [2019]. As seen in Section 4, our approach easily outperforms that of He et al. [2014], in both their original implementation and a re-implementation in SCIP 6.

Conclusion
This paper shows that approximate solving of mixed integer programs can be improved by a node selection policy obtained with offline off-policy imitation learning. In contrast to previous work using imitation learning, our policy is focused on learning to choose which of its children it should select. We apply the policy within the popular open-source solver SCIP, in exact and approximate settings.
Empirical results on four MIP datasets indicate that our node selector leads to solutions more quickly than the stateof-the-art in the literature [He et al., 2014], but not as quickly as the state-of-practice SCIP node selector. While we do not beat the highly-optimised SCIP baseline in terms of solving time on exact solutions, our approximation-based policies have a consistently better optimality gap than all baselines if the accuracy of the predictive model adds value to prediction. Further, the results also indicate that our approximation method finds better solutions within a given time limit than all baselines in four of the five problem classes examined.
This paper shows that learned policies can be Pareto-equivalent or superior to state-of-practice MIP node selection heuristics: heuristics that have been honed by hand over many years. It adds to the body of literature that demonstrates how machine learning can benefit generic constraint optimisation problem solvers.
In MIP terminology, our learned policy constitutes a diving rule, focusing on finding a good integer feasible solution.
The performance on non-binary problem classes like capacitated facility location is particularly noteworthy. This is because, unlike purely binary problems, for non-binary instances, MIP primal heuristics struggle to obtain decent primal bounds [Achterberg et al., 2008]. By contrast, in general for binary instances, the greater challenge is to close the dual bound, and our learned policy also performs well here.
For future work, more study could be undertaken for choosing the meta-parameter k. Values too low add only few state-action pairs, which naturally degrades the predictive power of neural networks. On the other hand, values too high add noise, as paths to bad solutions add state-action pairs that are not useful.
We mentioned during pre-processing that certain features are removed that are constant throughout the entire dataset. This has the consequence of different number of input units in the neural network architecture for every problem. The bigger issue is that this work focused on training a machine learning model for every problem. Future work could include a method to unify a machine learning model that works for all problems. This would make ML-based node selection a more accessible feature for current MIP solvers, like SCIP.
A limitation of this study is that we only performed experiments with the MIP solver SCIP. We chose SCIP, because it is open-source and has a rich amount of documentation. However, another MIP solver Gurobi (proprietary) is generally faster than SCIP and it would be interesting to see how the ML-based node selection policy compares to Gurobi.
Lastly, reinforcement learning, in contrast to imitation learning, is an interesting research direction to create a node selection policy.