4. Experiments and Results
The dataset we used is composed of a multitude of publicly available graph datasets—this section will provide a list of all datasets used and where they can be found. Some needed to be converted to the PACE treewidth format from their original format—those are marked with an asterisk (*) in the list below. Additionally, the list will also provide a shorthand name for every dataset. Datasets used are:
Named graphs, referred to as named
. (These are graphs with special names, originally extracted from the SAGE graphs database). Available at github.com/freetdi/named-graphs
In total, 30,340 graphs from these datasets were used.
4.2. Experimental Setup
The three exact algorithms tested in this research output a solution as soon as they find one and then terminate. However, a limit of 30 min run time was imposed to keep tests within an acceptable time frame; this is the same as the time limit used in the PACE 2017 competition. If an algorithm was terminated due to going above the time limit, it was presumed to have failed to find a solution. For each run, we recorded whether the algorithm terminated within the time limit, the running time (if it did) and the treewidth of the solution. Recording the treewidth is not strictly necessary, since we are primarily interested in which algorithm terminates most quickly but it is useful auxiliary information and we have made it publicly available. All experiments were conducted on Google Cloud Compute virtual machines with 50 vCPUs and 325GB memory. Since all algorithm implementations used are single-threaded, fifty experiments were run in parallel. Where necessary (e.g., Java-based solvers), each experiment was allocated 6.5GB of heap space.
The extracted features and experimental results were combined into a single dataset, in order to train the machine learning models. There were 162 graphs on which attempting to run a solver or the feature extractor would cause either the process or the machine where the process was running to crash. Most of the time, these graphs came from datasets that were not meant for exact treewidth computation, such as sat_sr15
) and were therefore too big. For instance, all
graphs of the sat_sr15 dataset were too hard for any solver to terminate. For an even more extreme example, graphs 195 through 200 of the PACE 2017 Heuristic
Competition dataset were all too big for our feature extractor to load them into memory, with graph 195 being the smallest of them at 1.3 million vertices and graph 200 being the biggest at 15.5 million. These graphs are three orders of magnitude larger than the graphs the implementations were built to work on—the biggest graph from the PACE 2017 Exact Competition dataset is graph number 198, which has 3104 vertices. Other errors included unusual yet trivial cases like the graph collect-neighbor_collect_neighbor_init.gr
from the dataset cfg
, which only contained one vertex and which broke assumptions of both our feature extractor and the solvers we used. All such entries were discarded from the dataset. An additional 680 entries were discarded because no solver managed to obtain a solution on them in the given time limit—presumably those problem instances were too hard. The resulting dataset contained a total of 29,498 instances.
In the course of preliminary experiments, it was discovered that the dataset we assembled required some further pre-processing. When each graph in the dataset was assigned a label corresponding to the exact algorithm that found a solution the fastest, a very large class imbalance was detected—tdlib was labeled the best algorithm for about 99% (29,234) of instances; Jdrasil—for under 0.2% (55); and tamaki—for about 0.7% (209). Under these circumstances, an algorithm selection approach could trivially achieve 99% accuracy by simply always selecting tdlib. To create genuine added value above this trivial baseline, we imposed additional rules in order to re-balance the dataset towards graphs that are neither too easy, nor too hard; we call these graphs of moderate difficulty. Graphs were considered not of moderate difficulty if either all algorithms found a solution quicker than some lower bound (i.e., the graph is too easy), or all algorithms failed to find a solution within the allotted time (i.e., the graph is too hard). The reasoning behind this approach is that if algorithms’ run times lie outside of the defined moderate area, there is little gains to be made with algorithm selection anyway. If a graph is too easy, a comparatively weak algorithm can still solve it quickly; if a graph is too hard, there simply is no “correct" algorithm to select. Formally, if is the lower bound, is the allotted time (upper bound) and is the run time of algorithm A on graph G, then if holds, G will be excluded.
Three different lower bounds were used—1, 10 and 30 s—and graphs that passed the conditions were stored in datasets A, B and C, respectively. (For the upper bound, we used the time-out limit of 30 min). This necessitates us making a distinction between source
datasets and filtered
datasets—the source datasets are those publicly available datasets that we started with; the filtered datasets are the sum of all graphs from all source datasets which passed through the respective filter. Table 1
breaks down how many graphs from each source dataset remained in each filtered dataset.
The same label-assigning procedure described above was repeated. The resulting class distributions are shown in Table 2
. Afterwards, Random Forest models were trained using the Python machine-learning package scikit-learn
], using all default settings except for allowing for up to 1000 trees in the forest. Both Leave-One-Out (LOO) cross-validation and 5-fold cross-validation were used in order to decrease the importance of randomness in the train-test split. 5-fold cross validation was repeated 100 times with different random seeds for the same purpose; at the end, the results from all 100 runs were averaged. These two cross-validation methods produced virtually identical results, hence we only present the results for LOO cross-validation. The results presented are for the entire dataset. Our algorithm selector is evaluated based on its predictions for each graph when it was in the test set—that is, a model is never evaluated on a graph that it has been trained on. We deemed the usage of a hold-out set unnecessary, as no hyper-parameter optimisation took place. (For readers not familiar with these technical machine learning terms, we refer to survey literature such as Tsamardinos et al. [49
An additional model was trained on Dataset B, using a single CART decision tree as a classifier and 50% of the data as a training set. The purpose of this model was not to optimise predictive performance but interpretability, as a Decision Tree model is significantly easier to interpret than a Random Forest model. We hoped that this interpretation would provide insight into what makes solvers good on some instances and bad on others. Dataset B was chosen due to having the best class distribution, with no clearly best algorithm.
An experiment using Principal Component Analysis (PCA) for dimensionality reduction was also undertaken on Dataset B. The purpose was to evaluate to what extent the feature set can be shrunk without significant loss of performance and potentially to provide insight into feature importance through examination of the principal components.
Reflections on filtering. Before turning to the results, we wish to reflect further on moderate difficulty filtering, described above. Recall: graphs which are not of moderate difficulty, are those where all the solvers terminated extremely quickly (i.e., more quickly than the lower bound) or all the solvers exceeded the upper bound (here: exceeded 30 min of run time, that is, timed out). As mentioned earlier, we originally introduced this filtering after observing heavy skew in the initial dataset. We did not know a priori what the `correct’ lower bound should be, which is why we tried three different values (1, 10, 30 s) (Note that, once the bound is chosen, the Random Forest classifier can be trained on our data in less than a minute, although cross-validation takes longer). As our results show, this filtering is sufficient to obtain a hybrid algorithm that clearly outperforms the individual algorithms. Interestingly, the filtering also mirrors our practical experience of solving NP-hard problems exactly. That is, many different algorithms can solve easy instances of NP-hard problems within a few seconds but beyond this running times of algorithms tend to increase dramatically and relative efficiencies of different algorithms become more pronounced: one algorithm might take seconds, while another takes hours or days. Once inputs are eliminated that are far too hard for any existing exact algorithm to solve, we are left with inputs that are harder than a certain triviality threshold but not impossibly hard—and it is particularly useful to be able to distinguish between algorithmic performance in this zone. A second issue concerns the choice of lower bound. If we wish to use the hybrid algorithm in practice, by training it on newly gathered data, what should the lower bound be? If we do not have any information at all concerning the underlying distribution of running times—which will always be a challenge for any machine learning model—we propose training using a lower bound such as 1 or 10 s. Unseen graphs which are moderately difficult (subject to the chosen bound) will utilise the classifier in the region it was trained for. Others will be solved very quickly by all solvers—a running time of at most 10 s is, for many NP-hard problems, very fast—or all solvers will time out, and then it does not, in practice, matter which algorithm the model chose. In the future work section we consider alternatives to moderate difficulty filtering.
4.3. Experimental Results
This section is divided into five subsections where each of the first three subsections covers the results of experimenting on one of the three datasets generated by setting a different lower bound for algorithms’ run time, as described in Section 4.2
. The hybrid algorithm—essentially, the mapping from graph to algorithm that is prescribed by the trained classifier—will be compared against the solvers and against an `oracle’ algorithm, which is a hypothetical hybrid algorithm that always selects the best solver. Three performance metrics will be used:
Victories. A `victory’ is defined as being (or selecting, in the case of the hybrid algorithm or the oracle algorithm) the fastest algorithm for a certain graph.
Total run time on the entire dataset.
Terminations. A `termination’ is defined as successfully solving the given problem instance within the given time. No regard is given to the run-time, the only thing that matters is whether the algorithm managed to find a solution at all.
Please refer to Figure 1
for detailed results from these three experiments. We note that the running times for the hybrid algorithm do not include the time to execute the Random Forest classifier. This is acceptable because in the context of the experiments its contribution to the running time is negligible: approximately a hundredth of a second for a single graph.
The fourth subsection covers the experiment conducted on Dataset B with a decision tree classifier, while the fifth subsection covers the Principal Component Analysis experiment.
4.3.1. Dataset A
(The information in this subsection can also be found in Figure 1
; similarly for Datasets B and C). The hybrid algorithm selected the fastest algorithm on 1025 out of 1162 graphs, whereas the best individual solver (tdlib
) was fastest for 898. The hybrid algorithm’s run time was 87,500 s, while the overall fastest solver (tamaki
) required 137,447 s and the perfect algorithm—54,810 s). The hybrid algorithm terminated on 1145 out of the 1162 graphs, whereas the best solver (tamaki
) terminated on 1115.
4.3.2. Dataset B
The hybrid algorithm selected the fastest algorithm on 326 out of 427 graphs, whereas the best individual solver (tdlib) was fastest for 192. The hybrid algorithm’s run time was 82,085 s, while the overall fastest solver (tamaki) required 135,935 s and the perfect algorithm—54,466 s. The hybrid algorithm terminated on 413 out of the 427 graphs, whereas the best solver (tamaki) terminated on 380.
4.3.3. Dataset C
The hybrid algorithm selected the fastest algorithm on 261 out of 337 graphs, whereas the best individual solver (tamaki) was fastest for 168. The hybrid algorithm’s run time was 80,697 s, while the overall fastest solver (tamaki) required 134,910 s and the perfect algorithm—54,060 s. The hybrid algorithm terminated on 323 out of the 337 graphs, whereas the best solver (tamaki) terminated on 290.
4.3.4. Dataset B—Decision Tree
A decision tree was trained with the default scikit-learn settings but it was too large to interpret, having more than 40 leaf nodes. We optimised the model’s hyper-parameters until we obtained a model that was small enough that it could be easily interpreted, without sacrificing too much accuracy. The final model was built with the following restrictions: any leaf node must contain at least 2 samples; the maximum depth of the tree is 3; a node is not allowed to split if its impurity is lower than 0.25.
The resulting hybrid algorithm’s performance was deemed satisfactorily close to the Random Forest selector we trained on the same dataset. The decision tree selected the fastest algorithm on 151 out of 214 graphs, whereas the Random Forest model selected the fastest for 153. The decision tree selector’s run time was 54,017 s, while the Random Forest selector required 55,696 s. The decision tree selector terminated on 202 out of the 214 graphs, whereas the Random Forest selector terminated on 201.
4.3.5. Dataset B—Principal Component Analysis
Principal Component Analysis with three components was applied to the dataset and used to train a Random Forest classifier, which was trained and evaluated in the same way as the other Random Forest models. The three components cumulatively explained about 90% of the variance in the data; for a detailed breakdown, please refer to Table 3
. The trained model had an accuracy of about 73%
compared to the baseline model’s accuracy of 76.5%
. While the loss of accuracy is small, our drop-column feature analysis in Section 5
shows many sets of three features that can be used to build a model of similar or higher accuracy.
In this section, the experimental results and the machine learning models behind them are analysed with the intention of deriving insights into why certain algorithm-graph pairings are stronger than others and to determine which features of a graph are the most predictive.
We begin by analysing the importance of features for a Random Forest model trained on 50% of Dataset B. We chose to train a separate model, instead of using one from our cross-validation, because those models are all trained on all but one samples. We chose that dataset because of its class distribution—the two solvers that are strong on average (tdlib and tamaki) have nearly equal results and the third solver (Jdrasil) is still best for a significant number of problem instances, unlike in Dataset A. This guarantees that the hybrid algorithm’s task is the hardest, as the trivial `winner-takes-all’ approach would be the least effective.
We used the feature importance functionality that is built into the scikit-learn
package. The results are presented in Figure 2
. The results indicate that almost all features are important for the classification and their importance varies within relatively tight bounds.
In order to gain further insight, we decided to exclude certain features from the dataset, retraining the model on the reduced dataset and measuring its accuracy against the baseline of the original full model, which is 76.5%. In order to obtain more statistically robust results, we made multiple runs of 10-fold cross-validation and kept their scores.
Afterwards, the scores were compared to the baseline of using all features using the Wilcoxon Signed-Rank Test with an alpha of 0.05.
At first, we attempted excluding single features. We did 10-fold cross-validation 50 times for every excluded feature. No result was statistically significantly different from the baseline.
Next, we attempted excluding two features at a time. Again, we did 10-fold cross-validation 50 times for every excluded set of features. Two pairs of features obtained a statistically significant worse accuracy score—variation and minimum degree, as well as variation and entropy.
Finally, we attempted excluding three features at a time. This time, we did 10-fold cross-validation only 5 times, as here the computational cost of doing otherwise was prohibitive. While there were 14 sets of features that led to significantly worse scores, the magnitude of the change was rather small—the worst performance was 74.5% and was achieved by removing minimum degree, maximum degree and variation. Of note is also the fact that 13 of 14 sets contained at least two of these same features and one set contained only one of them.
Since removing features seemed to provide us with little insight, we attempted another approach—removing all features except for a small number of designated features. Then we repeated the same procedure—we retrained the model on the reduced dataset and measured its accuracy against the original model. Again, we did 10-fold cross-validation a multitude of times. We began by only selecting one feature to retain, repeating the training 50 times. The feature minimum degree emerged as a clear winner, having 72% accuracy. We highlight the fact that by removing three features at a time, we only managed to lower the accuracy to about 74.5%, while a model with only one feature successfully reached 72%.
Next, we selected pairs of features to keep and did 10-fold cross-validation 50 times. 7 sets of features performed at least as well as the original model. All of them contained variation and/or maximum degree.
Finally, we selected sets of 3 features to keep and did 10-fold cross-validation 5 times. More than 50 sets of features performed at least as well as the original model. Notably, despite the previously demonstrated importance of variation, minimum degree and maximum degree, adding all three of them produced results that were around the middle of the pack at 74.5%. However, a large majority of the best results contained at least one or more often two of those features.
All experiments also showed models that surpassed the performance of the benchmark model, reaching 78.5% by only adding mean, variation and maximum degree—compared to 76.5% for the original model. Naturally, we view such results with caution. They could well be the result of chance, especially seeing as how we use no validation set—however, another possibility is that the classifiers can be more efficiently trained on smaller subsets of features.
The frequency with which the features variation, minimum degree and maximum degree appear in our analysis indicates that they carry some critically important signal that the classifier needs in order to be accurate. However, it appears that one or two of the features are sufficient to reproduce the signal and adding the third one does not help much.
We also attempted to determine feature importance by examining the Decision Tree model that was built for Dataset B (Figure 3
). Our analysis ignores nodes that offer under 50% accuracy and nodes that contain very few (less than four) samples, as those are deemed not to bring a significant amount of insight to the analysis.
The very first split in the model already provides a dramatic improvement in accuracy. Its left branch is heavily dominated by tdlib—after that split alone, selecting tdlib would be the correct choice about 70% of the time. On the right side of the split, a similar situation is observed with tamaki being the correct choice about 72% of the time. This shows that the first split, which sends to the left graphs where the first quartile of the degree distribution is smaller than or equal to 4.5, is very important for solvers’ performance. The first quartile being low is an indicator for a small or sparse graph and according to the model, tdlib’s performance on those is better, while tamaki seems to cope better with larger or denser graphs.
There are two paths that lead to Jdrasil being the correct label, both of which require that the variation coefficient feature is higher than a certain threshold, which is quite high (0.603 and 0.866). This leads us to believe that Jdrasil performs well on graphs where there is significant variation in the degree of all vertices. One of these paths also requires the graph to have more than 2184 edges or otherwise tdlib is selected instead. This again indicates that tdlib is better at solving smaller or sparser graphs, while Jdrasil can deal with variability in larger graphs too.
One thing worth highlighting is that tdlib also seems to excel on graphs where the degree distribution has a low first quartile and the minimum degree is low. This also confirms our belief that tdlib excels on smaller and sparser graphs.
In this section, the experimental results and analysis are discussed and some general insights are derived. One such general insight is the relative unavailability of graph sets of moderate difficulty. Depending on the definition of `moderate’, only between 1 and 3 percent of the graphs we tested could be considered as such. A likely explanation for this is that most datasets were assembled before the 2016 and 2017 editions of the PACE challenge, which introduced implementations that were multiple orders of magnitude faster than what was previously available, which in turn rendered many instances too easy.
Moving to specific points, we start by noting the promising performance of our hybrid algorithm compared to individual solvers. Our experiments clearly demonstrate the strength of the algorithm selection approach, as it outperformed all solvers on all datasets and all performance metrics, even though our underlying machine learning model is (deliberately) simple. However, the comparison with an omniscient algorithm selector also demonstrates that there is room for improvement in our framework. Section 7
lays out some suggestions for how our work can be improved upon.
Next, we reflect on the question: which features of a problem instance are the most predictive? We utilised three different approaches to answering this question—measuring how much each feature reduces impurity in our Random Forest model on Dataset B, measuring the performance of models that were only trained on a subset of all features and analysing the Decision Tree model we trained on Dataset B. Overall, our three approaches provided different and sometimes conflicting insights but some insights were confirmed by multiple approaches. One such insight is that there do not seem to be critically important individual features, as it appears that many different features can carry the same information, partly or in whole, which makes determining their individual importance difficult.
Measuring the impurity reduction of all features indicated that all features have a positive contribution to prediction. The most important feature, degree variation coefficient, was only about 3.5 times more important than the least important feature, median degree. The overall distribution of feature importances is such that the most important five features together account for about 50% of the importance, while the remaining eight account for the rest. Our interpretation of these results is that they show all features being significant contributors and while there are features that are more important than others, there are no clearly dominant features that eliminate the need for others.
The feature removal analysis indicated that three features—variation, minimum degree and maximum degree—all seem to be related in that removing all of them significantly reduces performance and performance increases as more of them are added, until all three are added, which does not provide a significant improvement in performance. Our interpretation is that there is a predictive signal that is present only in those three features but any two of them are sufficient to reproduce it. Besides this insight, feature removal provided little that we could interpret and that was not in direct conflict with other parts of the same analysis.
The analysis of the Decision Tree model indicated that size, density and variability are the most important characteristics of a graph; however, those could be expressed through different numerical features. For instance, the first quartile of the degree distribution (q1
), which could be an indicator for graph size or density, was by far the most predictive feature in our Decision Tree model. To make discussing this easier, we will separate the concepts characteristic
are a more general characteristic that can be represented by many different features
are specifically the numbers we described in Section 3.1
Combining the results from all three approaches is difficult, as they often conflict. However, the insight from the Decision Tree analysis that size, density and variability are the most important characteristics of a graph, no matter what specific numerical proxy they are represented by, is consistent with results from other analysis approaches. The feature removal analysis indicated that variation, minimum degree and maximum degree together carry an important signal—these features can be considered a proxy for size, density and variability. The impurity reduction results are also consistent with this, as they showed variation, density and minimum degree being the three most important features. The relative lack of importance stemming from which proxy is used for these characteristics of a graph is also demonstrated by q1—while that feature is by far the most important in the Decision Tree analysis, the other two analytical approaches did not show it being particularly important, as it was only seventh out of thirteen in impurity reduction and did not make even one appearance in the sets of important features that the feature removal analysis yielded.
Our experiments also yield some insights into the strengths and weaknesses of the solvers. One insight that becomes clear from the class distribution in both our unfiltered dataset and the three filtered datasets (as per Section 4.2
) is that the solver tdlib
is dominant on `easy’ graphs. In the unfiltered dataset, tdlib
was the best algorithm for 97% of graphs which became progressively less with a higher lower bound being imposed on difficulty. At the lower bound of 30 s (Dataset A), tdlib
was the best algorithm for 77% of graphs; at 10 s (Dataset B) that number went further down to 45%; and at 30 s (Dataset C) it was only 38%. Notably, tdlib
kept the “most victories" title in the unfiltered dataset, Dataset A and Dataset B; however, in Dataset C, tamaki
dethroned it with 49% versus 38%. Undeniably, going from a 97% dominance to no longer being the best algorithm as difficulty increases tells us something about the strengths and weaknesses of the solver. This is also confirmed by our analysis of the Decision Tree model, which clearly showed tdlib
had an aptitude for smaller and sparser graphs.
Another insight is tamaki
’s robustness. It is the best solver in terms of terminations and
run time on all three datasets, despite not being the best in terms of victories on datasets A and B. Most interesting is tamaki
’s time performance on Dataset A—while tdlib
has more than four times as many victories on that dataset as tamaki
’s run time is still about 30% better than tdlib
’s. Our analysis of the Decision Tree model showed tamaki
having an affinity for larger or denser graphs, complementing tdlib
’s strength on smaller or sparser graphs. The weakness of tamaki
on sparse graphs that we discover is consistent with the findings of the solver’s creator [14
Finally, Jdrasil seemed to have a tighter niche than the other solvers—specifically, larger graphs with a lot of variability in their vertices’ degree. However, Jdrasil clearly struggles on most graphs, as evidenced by its always coming in last in our experiments on all datasets and all performance metrics.
Summarizing, our analysis suggests that the most important characteristics of a graph are size, density and variability and that when focussing on these characteristics the three algorithms have the following strengths: tdlib—low density and size; tamaki—high density and size, low variability; Jdrasil—high density and size, high variability.