Quantum-Inspired Classification Algorithm from DBSCAN–Deutsch–Jozsa Support Vectors and Ising Prediction Model

Quantum computing is suggested as a new tool to deal with large data set for machine learning applications. However, many quantum algorithms are too expensive to fit into the small-scale quantum hardware available today and the loading of big classical data into small quantum memory is still an unsolved obstacle. These difficulties lead to the study of quantum-inspired techniques using classical computation. In this work, we propose a new classification method based on support vectors from a DBSCAN–Deutsch–Jozsa ranking and an Ising prediction model. The proposed algorithm has an advantage over standard classical SVM in the scaling with respect to the number of training data at the training phase. The method can be executed in a pure classical computer and can be accelerated in a hybrid quantum–classical computing environment. We demonstrate the applicability of the proposed algorithm with simulations and theory.


Introduction
Quantum machine learning [1][2][3] is an emerging field using quantum information processing for machine learning applications. Various quantum machine learning algorithms have been proposed, such as quantum Boltzmann machines [4], supervised quantum autoencoders [5], reinforcement learning using quantum Boltzmann machines [6], quantum circuit learning [7,8] and some quantum methods for linear algebra [9][10][11]. Even though the exponentially large Hilbert space is expected to help tackling the "curse of dimensionality" in machine learning, loading large classical data sets into a small quantum memory is a bottleneck [12,13]. Even though this problem can be circumvented in some settings [14], the available quantum information processors are restricted by the small number of qubits and high operational noise in the current noisy intermediate-scale quantum [15] era. On the other hand, these quantum algorithms have inspired various new classical efficient methods [16][17][18][19]. The dequantizing of quantum algorithms is not only for practical concerns, but also for theoretical interests in complexity theory [20].
The support-vector machine (SVM) is a widely used classification algorithm [21][22][23][24]. There are also newly developed variants of the SVM [25,26] and several quantum versions of the SVM have been proposed [27][28][29]. One proposal is to use the HHL algorithm [9] to invert the kernel matrix [27]. Schuld et al. [28] proposed a QSVM algorithm in which the distance for the kernel classifier is evaluated by single qubit gate circuits. Recently, Havlicek et al. [29] implemented the quantum circuit learning classification and a quantum kernel estimator for the SVM. However, due to the requirements of quantum coherence, the applications of these algorithms to large-scale data sets has not been demonstrated yet.
In this work, we propose a new quantum-inspired algorithm for classification tasks. The support vectors are selected from the data by a principle inspired by the classical clustering algorithm density-based spatial clustering of applications with noise (DBSCAN) [30] and the quantum algorithm Deutsch-Jozsa (DJ) [31]. DBSCAN is a commonly used classical clustering algorithm that determines clusters by considering numbers of data points within some distance parameter . DBSCAN is efficient, noise-resilient and does not require the input of the number of clusters. DJ is a deterministic quantum algorithm that determines a specific property of a given Boolean function f : {0, 1} n → {0, 1} with a single oracle call to the function f . The classical deterministic algorithm to solve the DJ problem requires O(2 n ) calls. We combine DBSCAN and DJ to identify support vectors that are close to the classification boundary. The support vectors are then used for prediction based on an Ising model classifier. The prediction labels are identified as spins in an Ising model and the prediction label is aligned with nearby support vectors through spin-spin interactions. The training time complexity of the proposed algorithm scales according to O(l 2 d), where l is the number of training data and d is the feature dimension. This is favorable comparing to O(l 3 d) standard kernel support-vector machines, such as LIBSVM, used in scikit-learn [21][22][23][24]. The proposed algorithm can be executed on classical computers, while quantum devices could be used to accelerate several subroutines in the algorithm. The numerical verification is presented. We also provide theoretical evidences for the learnability of the algorithm. We show that the training algorithm is described by an integer linear programming problem. We also derive upper bounds for the VC dimension of the Ising classifier in two special lattices.

Hypothesis Set
In this subsection, we briefly illustrate the hypothesis set and training and testing algorithms. The flow chart is illustrated in Figure 1. The details are discussed in the following sections. Given the training data set τ = {( x i , y i )|i = 1, . . ., l}, where y i ∈ {±1} are the labels for binary classification, the goal of a classification algorithm is to find a hypothesis f θ : ( x 1 , . . ., x n t ) → {−1, +1} n t with small prediction error, where n t is the number of test data. The classification algorithm consists of two parts, training and prediction. A training algorithm finds a hypothesis by minimizing some loss function L( f θ ( x i ), y i ). Our hypothesis set consists of the ground states of the Ising model. The hypothesis set is (s 1 , . . ., s n t +n sv ) = arg min where S = {support vectors} and T = {test data}, n sv = |S|, n t = |T|, The free parameters of the model are, in general, θ = (J ij ). We use a physics-motivated ansatz to efficiently determine the support vectors and θ from three free parameters, θ = (α, β, ), as explained in the following subsections. Finally, the training is performed by minimizing the loss function with respect to the remaining three free parameters, θ = (α, β, ), with particle swarm optimization (PSO). The loss function is the cross-entropy loss function where q( x i ) = (1 + y i )/2; p ±1 ( x i ) is the probability that s i = ±1 for the Ising ground state. The training data set is fed into the support vector selection algorithm. The support vectors are then used in the Ising predictor for prediction. The support vector selection algorithm is described in Algorithm 1 and the Ising predictor is described in Section 2.5.

DBSCAN Algorithm for Determining the Support Vector
The training algorithm is designed by applying the DBSCAN clustering algorithm to a Deutsch-Jozsa decision problem on the training data. The idea is depicted in Figure 2. Let us consider the hypersphere B ( x i ) of radius centered in the data point x i . Near the classification boundary, it is more likely that the data in the sphere have two different classification labels. On the other hand, for the data in the interior of each cluster, it is more likely that all the data points in the sphere have the same label. Hence, the classification boundary can be identified by a DBSCAN scanning through all the data points. For each data point, we query a decision problem, namely, whether the data inside the sphere is constant or not. If the number of non-constant spheres around is large for a given data point, we then select this point as a support vector for the prediction algorithm. This decision problem takes the form of a Deutsch-Jozsa problem and the potential use of the quantum DJ algorithm to enhance the training is discussed in Section 2.4.
After selecting the support vectors from the training data, we compute a kernel matrix (J sv ) i,j = J( x i , x j ) for x i , x j ∈ {support vectors}. This kernel matrix will be used in the prediction algorithm. The matrix elements are, in general, some functions of data positions. We choose J to be the inverse of some power-law function of distances between data points: The training algorithm can be described as the pseudo-code Algorithm 1.

Algorithm 1:
The training algorithm to determine support vectors.  11 The top 1/α data points in the ranking are the support vectors 12 Compute matrix (J sv ) i,j for x i , x j ∈ {support vectors} according to Equation (4)

Integer Linear Programming Formulation
We show that the training algorithm is equivalent to solving an integer linear programming problem. We use the following definitions: B ( x i ) is the hypersphere of radius centered in x i ; Y ij = 1 2 (1 + y i y j ) is the indicator y-correlation matrix of training labels; is the Heaviside step function. The indicator kernel matrix satisfies The indicator y-correlation matrix satisfies From the definitions, we have Then, the counting number returned by the DBSCAN-DJ training procedure is . Since all c i > 0, the ranking algorithm selects support vectors that satisfy the solution for the constrained integer linear programming problem: where α i = 1 if and only if x i is selected as a support vector. Notice that, if the number of training data is l, the sorting only takes O(l log l) time, while general integer linear programming is NP-complete.

Quantum-Enhanced Algorithm with DJ
Given a Boolean function f : {0, 1} n → {0, 1}, that is guaranteed to be constant or balanced, and its reversible circuit U f : |x |y → |x |y ⊕ f (x) , the DJ algorithm determines whether f is constant or balanced with one call to U f . A balanced function means f (x) = 1 for half of the possible input string x. The DJ algorithm is conducted using operator (H ⊗n ⊗ 1)U f H ⊗(n+1) on the initial state |0 ⊗n |1 . The probability to measure an all-zero string Hence, if f is constant, the probability to obtain a measurement result 00. . .0 is P 0 = 1. If f is balanced, then P 0 = 0. If f is not constant nor balanced, then 1 > P 0 > 0.
Given the data {( x i , y i )} in a sphere, we construct the reversible circuit for the function f : x i → (1 + y i )/2. This serves as the oracle function in the DJ algorithm. A four-data point example is shown in Table 1 and Figure 3. For n data points, the reading of the data and the construction of the oracle U f take O(n) time. However, once given the oracle, the DJ algorithm can determine whether the function is constant or balanced with a single call. In the case where the data are not constant, there is some probability to obtain measurement results other than 0. . .0. We could repeat the algorithm several times (the number of repetitions is another hyper-parameter that determines the precision) to obtain a reasonable estimation of the probability of how likely that is a constant function. Then, this probability provides an estimation for c i . For an efficient implementation of the U f circuit, one could utilize the single-step n-Toffoli gate [32] in a quantum computer. While this scheme works in principle, it is difficult to implement and the overhead is large. In our simulation, we take an simplified oracle consisting of one n-Toffoli gate and one n-Toffoli gate with negative control, as shown in Figure 4. Each qubit is encoded in a classical indicator state (1 + y i )/2 ∈ {0, 1}. Then, a single query to the oracle gives the measurement result ∏ i (1 + y i )/2, which indicates whether the data are constant or not. This approach is simpler to implement, but requires more encoding qubits. In the original DJ, it takes O(log n) for n data points, while, in this simplified approach, it takes O(n) qubits. However, it has a potential acceleration similar to that of DJ, where a single call to the oracle gives the desired result.

Annealing Algorithm for Data Prediction
The classical SVM algorithm constructs w and b from the training data and then predicts each test data by computing w · x + b. We propose an algorithm such that it is possible to predict multiple test data points with a single run of the prediction algorithm. The prediction is made by finding the ground state of the Ising model: The number of spins is N = n sv + n t , where n sv is the number of support vectors and n t is the number of test data. The coupling constants are functions of the data positions J ij ( x i , x j ) defined in Equation (4). We use the notation J sv is an n sv × n sv matrix, J t is an n t × n t matrix and J int is an n sv × n t matrix. Notice that the J sv matrix is given by the training algorithm and does not need to be computed again.
The magnetic field is defined as The prediction result is given by the ground state of the Ising model: The configurations s i of the support vectors are dummy variables and not used for prediction. Intuitively, the model favors parallel spins, since J ij > 0. Hence, we expect the prediction to have the same spin direction as nearby training data. In principle, the Ising ground state can be calculated by any Ising solver. However, the problem of finding the general Ising model ground state is equivalent to quadratic unconstrained binary optimization (QUBO), which is NP-hard. In this work, we use adiabatic quantum computing implemented on a quantum circuit simulator to find the ground state [33]. The prediction labels are determined by where p ± is the probability that the point label is predicted to be ±1.

The VC Dimension
We further show that the Ising classifier has a finite VC dimension upper bound in two simplified toy models in Figure 5. These simplifications can be regarded as regularization methods. The intuition is that, for J ij > 0, anti-parallel spin configurations have higher energy and it is difficult that they are a ground state. This difficulty sets a limitation on how many points can be shattered and leads to an upper bound for the VC dimension. The detail derivation is in Appendix A. We use the following definitions: n sv is the number of support vectors; n t is the number of prediction data points; N = n sv + n t is the total number of spins in the Ising model; d is the feature space dimension. We assume that n sv ≤ n t . For the first toy model, we consider a simplification by introducing equal distance constraints to the support vectors and test data. This means J ij = 1 || x i − x j || = J for all (i, j). The Ising prediction model, in this case, is the fully connected infinite-dimensional mean field Ising model: Notice that this is a strong constraint which requires N ≤ d + 1, but there is still one continuous parameter J, so the hypothesis set is still infinite. This model is exactly solvable. The eigenenergy is E m = J(− 1 2 N(N − 1) + 2m(N − m)) for eigenstates with m spin up and N − m spin down. If there are two test data, the anti-parallel state ↑↓ is always an excited state; hence, it is impossible it is part of a ground state. We obtain the result d VC = 1 in this case. This upper bound can be relaxed by considering a model with less constraints. Consider a second toy model where all the support vectors form a fully connected equaldistance lattice defined by parameter J sv . All the test data form another fully connected equal-distance lattice defined by another parameter, J t . The distance between support vectors and prediction data is parametrized by a different variable J int . Hence, there are three continuous variables in the model. The interaction between support vectors and prediction data points is restricted to one-to-one coupling. Let Then, the coupling for this model is For this model, we obtain the result d vc ≤ 2J int J t n sv . The intuition is that, if there are more support vectors (large n sv ), the model has more degrees of freedom. If the support vectors are closer to the test data (large J int ), the model has stronger influences on the test data. Hence, the Ising model can shatter more test data for larger J int and n sv . On the other hand, if the distance among test data is small (large J t ), then the strong interactions among test data make it difficult to be shattered by the model.

Simulation Results
We demonstrate the classification task for two types of teacher data (linear classification boundary and non-linear classification boundary) using the proposed method. The learning of the optimal (α, β, ) parameters was performed by minimizing the cost function. Since the (α, β, ) parameters are associated with discrete ranking, we chose gradient-free optimization instead of gradient-based optimization. We implemented the Nelder-Mead method and particle swarm optimization (PSO) algorithms. We observed that PSO generally gives better results; hence, we focused on the results obtained by PSO. The hyperparameter N is the number of particles for PSO. The learned support vectors, classification boundary and learning curves for N = 10, 50, 100 are depicted in Figure 6. The learned parameters are shown in Table 2. In all the cases, the total number of support vectors was fixed at n sv = 10, where five of them belonged to +1 and the other five belonged to −1. The α and the parameters were fixed to α = 3 and = 0.4. The optimization was performed by minimizing the loss function Equation (3) for the optimal values of β. We observed that the classification could be performed in all experiments. We also observed that, for larger values of N, the optimized loss function had a lower value, but the test accuracy was also lower. This suggests possible overfitting for large values of N in PSO optimization. Figure 6. Support vector distribution, prediction probability and classification boundary from particle swarm optimization (PSO) results. The experimental parameters and optimal values are listed in Table 2  We now compare the proposed method with scikit-learn SVM. The verification results are shown in Figure 7. We used 28 teacher data (small dots) for each verification. The detail numbers of data used are listed in Table 3. The large sphere symbols are the support vectors. In the classification using the proposed algorithm, we set the radius of the scan circle to = 0.5 (linear data) and = 0.6 (non-linear data). The power of the reciprocal of the distance among data was β = 1 and the criterion for determining the support vectors from the ranking was α = 3. For the traditional SVM, we used scikit-learn.svm.SVC with default values. From Figure 7, we found that both the linear classification and the non-linear classification performed by the proposed method could give a classification boundary similar to that of scikit-learn. However, if we changed only the positions of teacher data without changing the number of teacher data, there were cases where it could not be classified well. A possible cause is that the number of teacher data was too small. As a result, the number of support vectors was insufficient and the classification accuracy might have been reduced. Therefore, we expected a better classification result by using a larger number of training data. However, we noticed that quantum adiabatic calculations required a number of qubits of n sv + n t , where n sv is the number of support vectors and n t is the number of test data. Our experiment was limited by this restriction.

Computational Complexity
We investigate the efficiency of the proposed algorithm. Table 4 shows the time complexity of the kernel SVM and the proposed algorithm. The training algorithm is DBSCAN; hence, the worst-case complexity is essentially the same as DBSCAN which takes O(l 2 ) time for computing distances among l training data points. This is better than the standard kernel SVM, which takes the worst-case time of O(l 3 ) for kernel matrix inversion (empirical scaling is O(l 2+δ ) for some 1 > δ > 0) [22,23]. The training algorithm has also the feature of Deutsch-Jozsa; hence, it can be quantum-enhanced by using an n-Toffoli gate to speed it up. The speed-up does not change the worst-case scaling of the training algorithm's complexity. The cost of the prediction part is O((n sv + n t )n t d) for computing the J t and J int matrices. This scaling seems to be worse than that of the kernel SVM, but, in practice, this is not an issue. We could divide the test data into n b batches and run the prediction algorithm one time for each batch. The scaling becomes O((n sv + n t n b )n t d) in this case. For example, when n b = n t (i.e., we predict one datum at the time, as in the kernel SVM), we recover the same scaling as the kernel SVM. The cost of the prediction part depends on finding the ground state of the Ising model, which is equivalent to solving a QUBO problem. This, in general, could take an exponentially large amount of time, depending on the solver algorithm. However, to make a reasonable prediction, it is not necessary to find the optimal solution. The Ising prediction part can be solved by classical simulated annealing, or other QUBO solver and can be potentially accelerated by using adiabatic quantum computation or quantum annealing [34]. One might question about the possibility to combine the DBSCAN-DJ training algorithm with the kernel SVM prediction. However, since our training algorithm does not produce w and b vectors, it cannot be used for kernel SVM prediction. Table 4. Worst-case time complexity for the proposed algorithm comparing to the standard kernel SVM [22][23][24]. d is the feature space dimension, l is the number of training data, n sv is the number of support vectors, n t is the number of test data and n b is the number of batches for the test data. T a is the annealing time to find the Ising ground state, which depends on the choice of Ising solver algorithm.

Conclusions
In this work, we propose a quantum-inspired algorithm for classification. The algorithm has an advantage over the traditional kernel SVM in the training stage. The algorithm can be performed on a pure classical computer and can be enhanced in a hybrid quantumclassical computing environment. We experimentally verify the algorithm and provide theoretical background for the algorithm. The applicability to more general classification problems and possible improvements will be studied in future works. For example, multiclass classifications could be treated by one-versus-rest or one-versus-one approaches [35]. The optimal value for could be determined by other methods [36].
In this appendix, we provide the derivation for the VC dimension of the Ising classifier toy models. First, we show a simple result that one test datum can always be shattered by one support vector; hence, there is a general lower bound d VC ≥ 1. If there is only one support vector s 1 and one test datum s 2 , the Hamiltonian is H = −Js 1 s 2 − h 1 s 1 = −s 1 (Js 2 + h 1 ). Given any s 2 , we can choose s 1 = s 2 to produce the desired ground state. Hence, one point can always be shattered by one support vector. Hence, d VC ≥ 1.
Then, we consider the fully connected Ising model H = −J ∑ i<j s i s j . The eigenenergy E m = J(− 1 2 N(N − 1) + 2m(N − m)) for eigenstates with m spin-up and (N − m) spindown. The spectrum is a parabolic curve concave downward as a function of m. The doubly degenerated ground states are all spin-up and all spin-down states with energy E 0 = − J 2 N(N − 1). The highest energy state occurs at m = N 2 with half spin-up and half spin-down and the energy is E max = J N 2 . Now, we can consider the first prediction model defined in Equation (16). Let us consider two configurations of prediction data: (A) half spin-up and half spin-down; (B) all spins aligned in the same direction. To shatter the prediction data, the model has to be able to achieve a ground state when the given test data are in the highest energy state E A t = J n t 2 . The best possible choice of the support vectors is that all spins are aligned in the same direction with some energy E sv . The interaction between support vectors and test data gives the energy E A int = − J 2 (n t mod 2). The resulting total energy is E A total = E A t + E A int + E sv = J n t 2 − J 2 (n t mod 2) + E sv . To predict configuration (A), E A total must be lower than the state E B total , where all prediction spins are aligned with all support vectors. The total energy in the second case is E B total = − J 2 n t (n t − 1) − Jn t n sv + E sv . The requirement E A total ≤ E B total means J n t 2 − J 2 (n t mod 2) + J 2 n t (n t − 1) + Jn t n sv ≤ 0. Since only the second term is negative, this condition can never be satisfied for any n t ≥ 2. Hence, the model can never shatter two points, so d VC ≤ 1. Since we also have d VC ≥ 1, we conclude that d VC = 1.
Then, we consider the second relaxed model defined in Equation (18). To simplify the derivation, we assume that n sv and n t are both even numbers. The support vectors can be in the configuration of half spin-up and half spin-down. The spin-up support vectors are one-to-one-coupled to the spin-up test data. The interaction energy is E A int = −J int n sv and the total energy is E A total = E A t + E A int + E sv = J t n t 2 − J int n sv + E sv . We also have E B total = E B t + E B int + E sv = −1 2 J t n t (n t − 1) + 0 + E sv . The requirement E A total ≤ E B total then implies n p ≤ 2J int J t n sv . Hence, d VC ≤ 2J int J t n sv . We may also consider that all support vectors are aligned in the same direction. In this case, we have E A total = E A t + E A int + E sv = J t n t 2 + 0 + E sv and E B total = E B t + E B int + E sv = −1 2 J t n t (n t − 1) + J int n sv + E sv ; therefore, the upper bound remains the same.