Using machine learning for quantum annealing accuracy prediction

Quantum annealers, such as the device built by D-Wave Systems, Inc., offer a way to compute solutions of NP-hard problems that can be expressed in Ising or QUBO (quadratic unconstrained binary optimization) form. Although such solutions are typically of very high quality, problem instances are usually not solved to optimality due to imperfections of the current generations quantum annealers. In this contribution, we aim to understand some of the factors contributing to the hardness of a problem instance, and to use machine learning models to predict the accuracy of the D-Wave 2000Q annealer for solving specific problems. We focus on the Maximum Clique problem, a classic NP-hard problem with important applications in network analysis, bioinformatics, and computational chemistry. By training a machine learning classification model on basic problem characteristics such as the number of edges in the graph, or annealing parameters such as D-Wave's chain strength, we are able to rank certain features in the order of their contribution to the solution hardness, and present a simple decision tree which allows to predict whether a problem will be solvable to optimality with the D-Wave 2000Q. We extend these results by training a machine learning regression model that predicts the clique size found by D-Wave.


Introduction
Quantum annealing is an emerging technology with the potential to provide high quality solutions to NP-hard problems. In this work, we focus on the devices built by D-Wave Systems, Although solutions returned by the D-Wave 2000Q annealer for minimization problems of the type of eq. (1) are typically of very high quality, they are not guaranteed to be optimal.
In fact, there is no guarantee that the solutions returned by a D-Wave device keep any level of accuracy.
In this article, we aim to identify those types of QUBOs that are solvable on D-Wave 2000Q with given hardware parameters, without actually using the annealer. This could help decide on what resources (annealing time, number of reads) should be allocated depending on the problem hardness. We focus on the Maximum Clique (MC) problem, and explore two different types of classifications. First, we train a binary machine learning classification model on basic problem characteristics (such as density of the input graph) to discriminate between solvable and unsolvable instances of MC. The resulting model also allows us to rank the features used for training by importance. Classification is performed using the open source machine learning package scikit-learn in Python [16]. Second, we show that the decision problem of whether an MC instance will be solved optimally by D-Wave can be predicted with high accuracy by a simple decision tree on the same basic problem characteristics. Third, we train a machine learning regression model to predict the clique size returned by the annealer.
Let G = (V, E) be a graph (referred to as original graph) consisting of a vertex set V ⊆ {1, . . . , n} and edge set there is an edge between each pair of vertices of C. The MC problem asks us to find a clique of maximum size, called a maximum clique, which is an NP-hard problem with many applications.
To solve an instance of MC on the D-Wave annealer, we proceed as follows: 1. The task of finding a maximum clique has to be mapped to a formulation of the type of eq. (1). As shown in [3,17], the QUBO function x i ∈ {0, 1} for all i ∈ V , achieves a minimum if and only if the vertices i ∈ V for which x i = 1 form a maximum clique, where the two constants can be chosen as A = 1, B = 2 [17]. It is noteworthy that any function of the type of eq. (1) can be represented as a graph P itself, which is used to embed the QUBO problem into the quantum processor. In this representation, each of the n variables x i becomes a vertex with vertex weight h i , and each edge between vertices i and j is assigned an edge weight J ij . We call P the unembedded graph.
2. To minimize eq. (1) on D-Wave, its corresponding unembedded graph has to be mapped onto the physical hardware qubits on the chip of the D-Wave 2000Q. Since the connectivity of the hardware qubits, called the Chimera graph (see Figure 1), is limited, a minor embedding of P onto the Chimera graph has to be computed. In a minor embedding, each vertex of P (referred to as a logical qubit) is mapped onto a connected set of hardware qubits, called a chain. As such, a coupler weight, called a chain strength, has to be specified for each pair of connected chain qubits. If chosen appropriately, the chain strength ensures that hardware qubits representing the same logical qubit take the same value after annealing is complete. The minor embedding of P onto the Chimera graph is a subgraph of the Chimera graph and thus a new graph P . We call P the embedded graph.
3. After annealing, it is not guaranteed that chained qubits on the Chimera graph take the same value, even though they represent the same logical qubit. This phenomenon is called a broken chain. To obtain an interpretable solution of logical qubits, definite values have to be assigned to each logical qubit occurring in eq. (1). Although there is no unique way to accomplish this task, D-Wave offers several default methods to unembed chains.
Applying machine learning to predict features of a quantum device is a timely area of research. Existing work mostly focuses on gate quantum computing. For instance, machine learning is employed to help tune the behavior of quantum systems, such as computing the overlap of two quantum states [4], quantum circuit optimization [21,9], or readout of states with higher assignment fidelities under noise conditions [13]. Other works address parameter tuning of variational quantum algorithms [19] or the QAOA (quantum approximate optimization algorithm) algorithm of [8], see [22]. The closest to our approach is [14], wherein the authors use machine learning techniques to identify graph problems that are easy to solve using QAOA.
However, they use the quantum gate model and their specific objective is a bit different-to decide whether QAOA or the classical Goemans-Williamson algorithm will perform better on instances of the Maximum Cut problem. In contrast, we compare the quality of the solution found with the quantum algorithm against an optimal solution, and our optimization problem is the Maximum Clique problem.
Somewhat related, in [18,1] the authors use a differential evolution optimization to tune three parameters of D-Wave 2000Q, namely spin reversal, anneal offsets, and chain weights.
However, machine learning methods are not used, and the objective is to optimize parameters, rather than predict performance.
This article is structured as follows. Section 2 introduces the machine learning methodology we employ to predict the behavior of D-Wave 2000Q. Experimental results are given in Section 3, where we consider the classification problem of predicting whether D-Wave 2000Q can solve a particular MC instance, the computation of a simple decision tree allowing us to classify problem instances manually, and the prediction of the clique size returned by the annealer using a machine learning regression model. The article concludes with a discussion in Section 4.

Methods
We aim to predict both if an instance of MC is solvable by D-Wave 2000Q to optimality, as well as the size of the clique which will be found by the annealer.
As test and validation datasets, we first generate a set of Erdős-Rényi random graphs with a number of vertices in {20, . . . , 65}, and a density chosen uniformly at random in [0.01, 0.99]. The upper bound of 65 vertices is determined by the largest size of a complete graph embeddable on D-Wave's Chimera architecture. We specifically do not use any graphs with zero degree nodes.
Next, we compute an embedding of a complete 65 node graph onto the Chimera architecture with the help of the minorminer module of the D-Wave Ocean API [7]. In order to be able to compare and interpret results across different graphs, we keep the embedding fixed in all experiments. The chain strength for the embedding is always computed using D-Wave's uniform torque compensation feature [6]. The best D-Wave result (i.e., the largest clique found) is then determined among 1000 anneals. We use the majority vote technique to unembed all broken chains (in particular, using an unbiased coin flip if a chain is balanced). The split ratio between test and validation datasets is always chosen as 90% (validation) and 10% (testing).

Classification
Our task is to relate graph features to a given binary indicator from D-Wave expressing if an instance could be solved by the annealer to optimality. Several avenues exist to accomplish this task, for instance via (penalized) linear regression or logistic regression, or via more sophisticated machine learning models. The following highlights some of the challenges.
First, in order to have a well-defined decision problem for finding maximum cliques, we require a ground truth determined with a classical method. We define an instance of MC to be solvable by the D-Wave 2000Q if the clique determined by D-Wave has size equal to the maximum clique size found with the function graph clique number of the NetworkX package in Python [10,15], which is an exact solver.
Second, a couple of choices have to be made, both regarding the machine learning model for regression, as well as the set of graph features selected for prediction. We decided to use a decision tree classifier for two main reasons: The classifier achieved good performance in the classification task we consider and, most importantly, it allows us to obtain an interpretable output in the form of a decision tree. The decision tree highlights in what order/importance the features contribute to solvability on the D-Wave 2000Q, and by tuning the decision tree for simplicitly, it allows one to (manually) determine with high probability in advance if an instance is likely solvable.
Third, an assortment of graph-related features has to be selected to serve as inputs to the machine learning model. Those features are selected to cover a wide variety of potential metrics impacting solvability, however the list below is neither exhaustive nor rigorously proven. The following features went into the machine learning model threefold (unless noted otherwise), precisely for the original input graph G, its unembedded graph P of the MC problem, and its embedded graph P on the Chimera architecture (see Section 1): 1. the graph density (Graph Density); 3. the annealing time (in microseconds), which was selected uniformly at random in [1,2000] (Annealing Time).
In total, there were 46 features that were included as inputs to the machine learning model.
Using those features of the training MC instances and their maximum clique result computed by the D-Wave 2000Q annealer, we train a machine learning classifier implemented in the sklearn.tree.DecisionTreeClassifier() class provided in scikit-learn [16]. After performing a grid search across the following parameters, we selected max depth=5, random state=0, and min impurity decrease=0.005. All other parameters were kept at their default values.
To weigh solvable MC instances by D-Wave more heavily than unsolvable ones, the option class weight='balanced' was employed. The option of balanced class weightings was chosen as only about p ≈ 0.11 − 0.13 of the test problems are solvable. Therefore, a model that classifies all instances as unsolvable will still be 87 − 89% accurate. By setting class weight='balanced', we assign a weight to the solvable problems that is inversely proportional to how often they appear in the dataset, thus penalizing incorrectly classified solvable problems by a factor of about 1/p.
During training, we noticed that the decision tree machine learning model appears to have trouble with problems that return a clique size of zero. This was due to a suboptimal QUBO formulation we used for MC, and the problem disappeared when using the formulation of [3,17]. However, similar issues might occur with other problems, highlighting that the present approach might be most suitable for problem classes that can be solved sufficiently well on D- Wave. Additionally, it is challenging to tune the decision tree in such a way as to obtain good performance and interpretable results simulatenously. Applying the decision tree classifier using default parameters usually results in very large trees having many redundant branches, which are poorly interpretable. However, this issue can be alleviated by increasing the Gini impurity (parameter min impurity decrease) while simultaneously decreasing the maximal depth of the tree (parameter max depth).

Regression
In contrast to predicting solvability alone, we also attempt to predict the actual clique size found by the D-Wave 2000Q device from graph and annealer features alone. Therefore, in this section, our aim is not the prediction of a binary decision (solvable, not solvable), but predicting an integer response (precisely, an integer within the range of possible clique sizes {0, . . . , |V |}). For this task, no ground truth (via a classical solver) is needed.
As in the previous section, several regression models are suitable for this task. We chose to predict the clique size returned by D-Wave 2000Q with gradient boosting, a popular machine learning regression model. Whereas random forests build deep independent trees, gradient boosting works by constructing an ensemble of dependent shallow trees, each one improving on the previous one. Although shallow trees by themselves are usually only weak predictive models, combining (or "boosting") them in an ensemble allows for powerful machine learning models. We employed the GradientBoostingRegressor() class [20] in scikit-learn [16]. Similarly to the classification setting, we performed a grid search across selected parameters, and set n estimators=200, and random state=0. All other parameters were kept at their default values.

Experimental results
This section presents our results when training the machine learning models described in Sec-  We first consider a fixed annealing time and a fixed UTC prefactor parameter (Section 3.1), and then generalize the results to random annealing times and random UTC prefactors (Section 3.2). For both cases, the classification and regression problems are considered.

Fixed annealing time and fixed UTC prefactor
We generate a dataset of 47000 MC instances as described in Section 2. In this section, the annealing time was held fixed at 100 microseconds, and the UTC prefactor of D-Wave's uniform torque compensation feature for computing the chain strength was empirically fixed at 0.5, which yielded good approximate cliques with D-Wave.

Classification
We start by looking at the classification problem. After fitting a decision tree classifier with scikit-learn, we obtain the tree displayed in Figure 2.
Several observations regarding Figure 2 are noteworthy. First, the tree is sufficiently simple to allow for a manual classification of MC instances into solvable and unsolvable ones by D-Wave 2000Q. The most decisive feature turns out to be the largest eigenvalue of the adjacency matrix of the input graph. If it is larger than the threshold given in Figure 2, the number of edges in the unembedded problem graph seems to be a good predictor of solvability. This is sensible, since less connections in the unembedded problem tend to facilitate embedding and solving on D-Wave. However, further work is needed to understand how to the 5'th largest eigenvalue of the adjacency matrix of the input graph comes into play. Here, a smaller 5'th largest eigenvalue tends to facilitate solvability. Finally, the second largest eigenvalue of the unbedded problem graph (as before, smaller eigenvalues tend to facilitate solvability) and, moreover, the number of edges in the input graph (smaller instances tend to be more likely to be solvable than larger ones) tend to be good predictors of solvability-a finding which seems straightforward.
Although the importance of the 5'th largest eigenvalue (of the adjacency matrix of the input graph) is a surprising result, the predictive power of the largest and second largest eigenvalues Predicted Not Solvable Solvable  Actual  Not Solvable  3458  654   Solvable  97  497   Table 1: Confusion matrix for evaluating the decision tree ( Figure 2) on the test dataset. Setting of fixed annealing time and fixed UTC prefactor. is sensible, since those are well known to predict a variety of structural properties of a graph, see [5,11]: for instance, the largest eigenvalue is related to the largest degree of a graph. In two follow-up experiments, removing only the 5'th largest eigenvalue from the set of predictors turned out to decrease performance marginally (the accuracy decreases by 0.37%, and the recall decreases by 0.51%). Removing all eigenvalues apart from the largest one seems to decrease performance more substantially (the accuracy drops by 2.9%, but the recall goes up by 0.83%).
An evaluation of the decision tree of Figure 2 on the unseen testing data yields the confusion matrix given in Table 1. We observe that our machine learning model is able to accurately classify solvable instances with a probability of around 83.7%, and unsolvable ones with a probability of around 84.1%.

Regression
Next, we consider the prediction of the clique size that the D-Wave 2000Q annealer will find on certain MC instances, as a function of the features outlined in Section 2. All results in this section were obtained with the gradient boosting regressor of scikit-learn.

Random annealing time and random UTC prefactor
To see if the aforementioned results generalize, we repeated the same experiment with a random annealing time (sampled uniformly at random within the interval of [1,2000] microseconds), and a random UTC prefactor (sampled uniformly at random within the interval [0.5, 3]). The dataset used for classification contained 49000 samples and was generated as outlined in Section 2. We again consider both classification and regression. Note that in contrast to Section 3.1, where only graph features were used to train the machine learning models, in this section we have two additional parameters in our model (annealing time and UTC prefactor), which are related to the quantum annealing algorithm. Figure 4 shows the decision tree we obtain on the test dataset after fitting a decision tree classifier with scikit-learn. It is similar to the one of Section 3.1 in that it is suitably simple to allow one to classify MC instances manually. However, the importance of the features has changed: The number of edges, though previously also included in the tree of Figure 2 (Figure 4) on the test dataset. Setting of random annealing time and random UTC prefactor. for the classification, whereby small chain strengths used to embed a problem are favorable for solvability.

Classification
On the validation dataset, the decision tree of Figure 4 shows a similar performance to the one of the previous section. Results are summarized in Table 2 and show that our classifier is able to classify solvable instances correctly with an accuracy of around 86.2%, and unsolvable ones with an accuracy of around 84.7%.

Regression
Finally, we repeat the machine learning regression with the help of the gradient boosting regressor of scikit-learn. Figure 5 shows the results of this experiment.
On the test dataset, we are able to very accurately predict the clique size which D-Wave 2000Q will find, achieving a low root mean square error of 0.903 ( Figure 5, left). The most important features for regression seem to be chain strength, the density of the original input graph, and the density of the unembedded graph. The smallest eigenvalue of the adjacency matrix of the original input graph seems to play a minor role.

Discussion
In this contribution, we aim to understand some of the factors contributing to the hardness of a problem instance sent to the D-Wave 2000Q annealer. We focus on the MC problem, and train several machine learning models on several thousand randomly generated input problems with the aim to learn features to (a) predict if D-Wave 2000Q will be able to solve an instance of MC to optimality, and (b) predict the size of the clique which the D-Wave 2000Q device will find.
We show that indeed, for specific problem of predicting the clique size, a relatively small number of features of the input problem, its QUBO formulation and its embedding onto the D-Wave Chimera architecture seems to suffice to predict solubility with high precision. Those features can be summarized in a simple decision tree, which even allows one to manually classify MC instances. A regression analysis demonstrated that the clique size the D-Wave 2000Q will find can likewise be predicted with a low root mean square error.
This article leaves scope for a variety of avenues for future work. For instance, it is unknown how these results generalize to other NP-hard problems, how well prediction will work on future D-Wave machines (or even other quantum annealers), and it remains to be investigated if the presented results can be improved by fitting more sophisticated machine learning models.