Coreset Clustering on Small Quantum Computers

Many quantum algorithms for machine learning require access to classical data in superposition. However, for many natural data sets and algorithms, the overhead required to load the data set in superposition can erase any potential quantum speedup over classical algorithms. Recent work by Harrow introduces a new paradigm in hybrid quantum-classical computing to address this issue, relying on coresets to minimize the data loading overhead of quantum algorithms. We investigate using this paradigm to perform $k$-means clustering on near-term quantum computers, by casting it as a QAOA optimization instance over a small coreset. We compare the performance of this approach to classical $k$-means clustering both numerically and experimentally on IBM Q hardware. We are able to find data sets where coresets work well relative to random sampling and where QAOA could potentially outperform standard $k$-means on a coreset. However, finding data sets where both coresets and QAOA work well--which is necessary for a quantum advantage over $k$-means on the entire data set--appears to be challenging.


Introduction
Machine learning algorithms for analyzing and manipulating large data sets have become an integral part of today's world. Much of the rapid progress made in this area over the past decade can be attributed to the increased availability of large data sets that machine learning algorithms can train on and advances in hardware which accelerate the training process, such as the GPU. The last decade has also witnessed the emergence of prototype quantum computers which have been implemented using various qubit technologies, including superconducting circuits, trapped ions, neutral atoms, and solid state devices [1][2][3][4]. The intersection between the emergent machine learning and quantum computing fields has produced many new algorithms which promise further advances in data processing capabilities.
Quantum algorithms such as HHL for solving linear systems [5] and Grover's algorithm for database search [6] are known to achieve exponential and quadratic speedups over their classical counterparts, respectively. However, many quantum algorithms for machine learning, including HHL and Grover search, also assume the use of an input data model (e.g. quantum RAM [7]) which allows them to easily load classical data onto the quantum processor. This model is currently unrealistic [8]. Without access to a quantum RAM, which presents the state |ψ on demand, we resort to using a quantum circuit generate the desired state |ψ . Unfortunately, as the size of classical data sets grows to millions or billions of data points, the time and space requirements necessary to load the data may erase any potential quantum speedup.
Recent work by Harrow [9] introduces a new paradigm of hybrid quantum-classical computing to address this issue. The main idea is to take a large classical data set X and use a classical computer, potentially aided by a small quantum processor, to construct a coreset: a smaller, weighted data set (X , w) which sufficiently summarizes the original data. If the coreset is small enough (but still a faithful representation of X), we could hope to optimize under the coreset with a small quantum computer. Prior work has focused on finding coreset construction algorithms that allow machine learning models to train on the coreset while remaining competitive with models that are trained on the entire data set [10][11][12]. In [9], Harrow proposes three new hybrid algorithms which cover a range of machine learning tasks including maximum a posteriori estimation, inference, and optimization.
In this paper we evaluate the first of these new algorithms and adapt it to noisy quantum computers. The general version of this algorithm takes a data set X and cost function f as input, uses a classical computer to construct a coreset (X , w), and then uses a quantum optimization algorithm to perform maximum a posteriori estimation ( [9], Algorithm 1). A specific instance of the algorithm is also outlined which solves the k-means clustering problem ( [9], Algorithm 1.1). The specific case of k-means is the focus of this paper. At a high level, this algorithm solves k-means clustering on a data set X by first constructing a coreset (X , w), and then optimally clustering (X , w) with Grover search.
However, Grover search is unlikely to be tenable on noisy devices [13]. As proposed in [9], we reformulate the coreset clustering problem as a Quantum Approximate Optimization Algorithm (QAOA) [14] instance. QAOA is variationally optimized and is able to tolerate some noise when coupled with a robust classical optimizer. For simplicity, we study 2means clustering specifically. Algorithm 1 summarizes our approach.
Our core contributions are as follows: • We implement algorithms for coresets and evaluate their performance on real data sets.
• We cast coreset clustering to a Hamiltonian optimization problem that can be solved with QAOA. We also demonstrate how to break past the assumption of equal cluster weights.
• We benchmark the performance of Algorithm 1 across six different data sets including real and synthetic data. We compare the 2-means clustering solutions found by coresets+QAOA with the solutions given by standard algorithms for solving 2-means on both the full data sets and the coresets. We find that some data sets are better suited to coreset summarization than others which can play a large role in the overall performance of the algorithm.
In our evaluations, the size of the coresets constructed in step 1 of Algorithm 1 are limited by the size of the quantum computer used in step 3. For some data sets, this restriction on coreset size negatively impacts the performance of clustering on the coresets when compared to k-means on the entire data set. Nonetheless, we are able to at least show cases where QAOA-based clustering on the coresets is competitive with the standard 2-means algorithms on those coresets. This suggests that Algorithm 1 will improve when quantum computers can support more qubits and more gates, thereby allowing them to utilize larger coresets. However, our evaluations also suggest that either high m (and thereby many qubits) or a high order QAOA implementation (with many gates) will be needed for a possible quantum advantage on typical data sets.
The rest of the paper is organized as follows. In Section 2 we give an overview of the k-means clustering problem. Section 3 discusses coresets for k-means. Section 4 describes the reduction from k-means to QAOA. We present and discuss our results in Sections 5 and 6.

k-means Clustering
The k-means clustering problem takes an input data set x 1 , ..., x n ∈ R d and aims to identify cluster centers µ 1 , ..., µ k that are near the input data. Typically, k n; for simplicity we focus on k = 2. Foreshadowing quantum notation, we will prefer to denote our cluster centers as µ −1 and µ +1 . Then, the objective of this 2-means problem is to find the partitioning of [n] into two sets S −1 and S +1 that minimizes the squared-distances from the closest cluster centers: While the cluster centers µ −1 and µ +1 appear to be free variables, it can be shown [15] that they are uniquely determined by the S −1 and S +1 partitionings in order to minimize Eq. (1). In particular, these cluster centers are the centroids, Thus, in principle the objective function in Eq. (1) can be minimized by evaluating all 2 n possible partitionings of [n] into S −1 and S +1 . However, this brute force exponential scaling is impractical, even for modest n. Instead, k-means is typically approached with heuristics like Lloyd's Algorithm [16], which does not guarantee optimal solutions in polynomial time, but performs well in practice. Moreover, relatively simple improvements to the initialization step in Lloyd's Algorithm lead to performance guarantees. Notably, the k-means++ initialization procedure guarantees (8 ln k +2)-competitive solutions in the worst case [17].
For many data sets, Lloyd's Algorithm augmented with k-means++ initialization rapidly converges to close-to-optimal (often optimal) solutions. However, in general, finding the optimal cluster centers is a computationally hard problem. Even for k = 2, the problem is NP-hard [18,19].

Coresets for k-means
An -coreset for k-means is a set of m (typically << n) weighted points such that the optimal k-means clustering on the coreset is within (1 + ) of the optimal clustering on the entire data set of n points. A coreset data reduction is appealing because we would prefer to solve a problem over m << n points. The size m needed depends on the target error , k, the data set dimension d, and the probability of success δ. We implemented two coreset procedures. The first, Algorithm 2 of [10], gives a coreset size of m = O( ). The second, Algorithm 2 of [20], gives a coreset size of m = O( −2 k log k min( k , d)).
One might hope to pick a target and then pick m accordingly. However, the exact expressions-including constants-for the scaling of m are not readily available. Regardless, our goal is simply to probe the limits of small current-generation quantum computers, which have at most a few dozen qubits. Therefore, we approach coreset construction in the reverse direction by first choosing m and then evaluating the performance of the resulting coreset. As discussed in the next section, m will equal the number of qubits we need. Therefore, we choose m ∈ {5, 10, 15, 20} for our evaluations.
For implementations of the coreset algorithms in both, [10] and [20], an (α, β) bicriterion approximation is required. We use D 2 sampling, which is the initialization step for k-means++ [17], as our bicriterion approximation. We chose β = 2, which corresponds to picking βk = 4 centroids in the bicriterion approximation. For each data set, we selected the best (lowest cost) approximation from 10 repeated trials, as is also done by Scikitlearn's default implementation of k-means.
Through our evaluations, we did not find significant differences between the performances of the [10] and [20] coreset algorithms. In fact, we did not observe a significant improvement over random sampling either, except for a synthetic data set with a few rare and distant clusters.

QAOA
The Quantum Approximate Optimization Algorithm (QAOA) [14] is a quantum variational algorithm inspired by the quantum adiabatic algorithm [21]. The adiabatic theorem implies that, for large enough T , starting in the |+ ⊗m state and performing time-evolution under the time dependent Hamiltonian: For concreteness we assume that H P is diagonal such that |z sol is a computational basis state. One can approximate this adiabatic evolution with a finite Trotterized evolution In the limit p → ∞, the Trotter decomposition and the adiabatic theorem imply that there exist β and γ such that this approximation is exact; a priori, however, it is not obvious what one should choose for these parameters for finite p to tighten the approximation in Eq. (2). Therefore, QAOA combines the ansatz of Eq.
(2) with a classical optimization loop, performing the maximization of the function By the variational principle, for large enough p the arg max of this optimization will approximate |z sol . In practice, a quantum computer evaluates F (|β, γ ) (or e.g. gradients of F (|β, γ )), whilst a classical computer uses the function evaluations to heuristically optimize the function. In the remainder of this Section we describe how one can interpret the solution of the k-means problem as the highest excited state of a diagonal Hamiltonian, which can be heuristically found using QAOA. We note that prior work in [22] also proposed and experimentally demonstrated clustering via QAOA instance; our work and reduction is more specific to k-means clustering.

Hamiltonian Interpretation of k-means: Equal Cluster Weights
Under the weighted vectors of a coreset of size m, the 2-means objective function is similar to that of Eq. (1), but now each input vector x i has an associated weight w i . The modified objective function is then where the cluster centers are now also weighted such that: Here, W ±1 = i∈S ±1 w i . We also define W ≡ W −1 + W +1 . As shown in Appendix A, minimizing Eq. (3) is equivalent to maximizing the weighted intercluster distance In this section, we consider the case where the optimal clusters have equal weights, W +1 = W −1 . Often, this is a good approximation, because In this expression, only the third term is dependent on our S −1 , S +1 partitioning. Therefore, the 2-means objective for equal cluster weights is equivalent to maximizing the (re-scaled) third term: This expression can be interpreted as a weighted MAX-CUT problem on a complete graph. Each vertex in the graph represents a coreset point x i , and the weight of the (i, j) edge in the graph is −w i w j x i · x j . Our objective is to assign labels ±1 to each vertex in the graph such that the sum of edge weights over cut-crossing edges is maximized. Figure 1 depicts an example for a coreset containing five points. Although all 10 edges have weights, we only sum the weights of edges crossing the cut. For the particular coloring (partitioning) in Figure 1, six of the edges cross the cut.
In order to maximize Eq. (5) using QAOA, we must encode it as a Hamiltonian. For is 1 for Z i = Z j and 0 for Z i = Z j . Therefore, Eq. (5) corresponds to the energy of the problem Hamiltonian: which is maximized at the same assignment of {Z i } as the offset-and-scaled Hamiltonian:

Hamiltonian Interpretation of k-means: Unequal Cluster Weights
Now, we once again begin with Eq. (4): Unlike in Sec. 4.2, however, we now longer assume that W +1 = W −1 = W/2. We can instead write: We now examine the ratios and we consider the Taylor expansion of this expression around x ≡ W −1 /W = W +1 /W = 1/2, i.e. around the equal cluster weight scenario of Sec. 4.2. The motivation for using a Taylor expansion is that the resulting polynomial has operational significance as a Hamiltonian and can be written in terms of Z i 's. 1/x and the three leading orders of its Taylor expansion are shown in Figure 2. The zeroth order Taylor expansion, 1/x ≈ 2, corresponds to the case of equal cluster weights. Keeping terms through first order gives 4 − 4x. Informally, this linear approximation is close to 1/x for 0.4 < x < 0.6. Therefore, we could reasonably hope for the first order Taylor expansion to work well for slightly unbalanced optimal clusters, perhaps with 3:2 imbalances.
With this motivation, we expand Eq. (8) and find that: Performing a similar translation into Pauli Z operators as in Sec. 4.2, these terms can be expressed in Hamiltonian form and simplified to Plugging these into Eq. (7) gives The indicator functions can also be rewritten as Pauli expressions, resulting in the final problem Hamiltonian: data set Description CIFAR-10 10k images (32x32 pixels) from CIFAR-10 data set [27]. 1k images per category. COCO 5k images from Common Objects in Context validation data set [28]. Images translated into feature vectors of dimension 512.
Synthetic 40k 512-dimensional points drawn from 11 random Gaussian clusters. Ten clusters contribute 5 points each, last cluster has majority. Interestingly, the Eq. (9) Hamiltonian only has quadratic terms, which is no more difficult to implement than the zeroth order case that assumes equal cluster weights. However, for higher order Taylor expansions, the degree of the Hamiltonian will increase; a second order Taylor expansion will have a cubic Hamiltonian, a third order Taylor expansion will have a quartic Hamiltonian, and so forth. Table 1 describes the six data sets we used in our evaluations. The Epilepsy, Pulsars, and Yeast data sets are part of the UCI Machine Learning Repository [23]. For Common Objects in Context (COCO), the image pixels were preprocessed with the img2vec [24] library. This library translates the pixels of each image into a 512-dimensional feature vector using a Resnet-18 model [25] pretrained on the ImageNet data set [26].

Evaluation Methodology
We evaluated 2-means on each of these data sets using Scikit-learn [32]. We used Scikitlearn's default implementation of k-means, which initializes clusters with best-of-10 k-means++ [17] and terminates either after 300 iterations or upon stabilizing within 10 −4 relative tolerance. This default implementation is an aspirational, though realistic, target against which to compare QAOA. The cost we report is the "sum of squared distances to nearest cluster" objective function in Eq. (3), also referred to as inertia [33].
On each data set, we computed m= 5, 10, 20, and 40 uniformly random samples, as well as m-coresets using Algorithm 2 of [20]. Then, we ran the 2-means clustering implementation on this coreset, and evaluated the cost of the output cluster centers, evaluated against the entire data set. For each data set, we ran this process 10 times. On five of the six data sets, we report best-of-10 results, since in practice one would indeed choose the best result. For the Synthetic data set, we report average, best, and worst costs, to emphasize that the coreset algorithm is consistently better than random sampling.
In addition to these classical results, we took the best m = 5 and m = 10 coresets and constructed Hamiltonians for them, as described in Sec. 4. For m = 10, we constructed Hamiltonians with zeroth order, first order, second order, and infinite order Taylor expansions. For m = 5, we only constructed the zeroth order Hamiltonian (i.e. assuming equal cluster weights as in Sec. 4.2), because this is a realistic experimental target on current devices, as evaluated in Sec. 5.4. For each Hamiltonian, we found its highest-energy eigenstate by brute force over the 2 m basis states, though in a real test, we would approximately optimize with QAOA. This is the solution one would expect to find with Grover's search algorithm, and it can also be interpreted as a bound on the best possible QAOA performance. The highest eigenstate is the weighted MAX-CUT solution or equivalently, the best cluster assignment on the coreset. For the infinite order Hamiltonian, this highest eigenstate is truly the optimal clustering of the coreset, whereas running 2-means on the coreset does not guarantee an optimal solution. However, note that the optimal clustering on the coreset does not necessarily correspond to the optimal clustering on the whole data set, because the coreset is not completely representative of the underlying data set. Figure 3 shows results, using the methodology above, on our six data sets. The green bars, which are entirely classical, correspond to 2-means on the whole data set, a random sample, or a coreset via Algorithm 2 of [20]. Interestingly, for most of our benchmarked data sets, [20] does not appear to outperform simple random sampling. The exception is on the Synthetic data set, which has 39,950 points in a localized Gaussian cluster, along with 50 points in ten distant clusters. On this data set, random sampling does not perform well, because the random samples are unlikely to capture points outside the big cluster. This suggests we may see gains from using coresets on anomaly-detection oriented data sets.

Coreset and QAOA Bound Results
The blue bars correspond to the energy-maximizing eigenstate of each Hamiltonian, which is constructed from a coreset of m elements and a given Taylor expansion order. QAOA attempts to find this energy-maximizing eigenstate, but it is only an approximate solver. Therefore, the blue bars can be interpreted as a lower bound on the cost of a QAOA-based solution. However, recall that the ultimate cost is with respect to the whole data set, and the coreset is only an approximation of this data set. Therefore, a suboptimal eigenstate of the Hamiltonian could actually outperform the supposedly-optimal eigenstate.
We observe this behavior in a few data sets. Focusing on the blue bars only, we see that the cost is almost always lower (or the same) when we increase m or increase the order of the Hamiltonian. However, for the Epilepsy data set, we see that the dotted (m = 10) blue bars have lower cost for lower Taylor expansion orders (i.e. more approximate Hamiltonians). In fact, the m = 5 diagonal-striped blue bar has the lowest cost among blue bars. A possible explanation for this is that the m = 10 coreset (green dotted bar) is poor, so deviating from its optimal clustering is actually favorable for the overall cost. This does open the broader possibility that QAOA's approximate optimization could outperform brute force optimization, but this may simply be an artifact of the limitations of coresets.
On all six data sets studied, the lowest clustering cost is achieved by running 2-means on the whole data set. The main barrier to a possible 'quantum advantage' seems to be that for small m, the coreset approximation is too coarse. However, we were able to find QAOA bounds that beat corresponding coreset 2-means results. This occurred in both the Epilepsy and Synthetic data sets, where a blue quantum bar beats the same-textured green bar (although again, this may be an artifact of a poor coreset). It is also encouraging that the 0th order QAOA bounds, which have only quadratic gate count cost, are generally similar to the higher order bounds. The exception is the Synthetic data set, which by Common Object Images (COCO) whole m=5 m=10 m=15 m=20 m=5 m=10 m=15 m=20 0th 0th 1st 2nd Cost (lower is better) 1e8 Yeast whole m=5 m=10 m=15 m=20 m=5 m=10 m=15 m=20 0th 0th 1st 2nd  Figure 4: Example of a QAOA circuit for k-means on an m = 5 coreset implemented using the swap networks proposed in [34,35]. Here, P = 1, α and β are the variational parameters and the w ij 's are the edge weights of the constructed graph (see Figure 1).
construction favors unequal cluster sizes and violates the equal cluster weight assumption of 0th order expansion. The broader pattern appears to be that if a data set is amenable to coresets, it does not work well for low-order QAOA approximation, and vice versa.  [34]. Here, P = 1, α and β are the variational parameters and w ij are the edge weights of the constructed graph (see Figure 1). Each execution consists of 8192 individual shots.

Experimental QAOA Results
For each of the six data sets examined above we also tested the performance of QAOA for the m = 5 case on the 5-qubit ibmq_rome processor. Figure 4 shows the QAOA circuit. Solving k-means via QAOA on the zeroth or first order Hamiltonian is equivalent to finding the weighted MAX-CUT on a complete graph, which requires an interaction between every pair of qubits during each step of QAOA. This would seem to imply that the quantum computer must support a high degree of connectivity between qubits or else incur a large number of SWAP operations which would deteriorate performance. However, using the SWAP networks proposed in [34,35], we can achieve the required all-to-all interactions in depth which scales linearly with the number of qubits m, while only requiring nearestneighbor interactions. SWAP networks were also implemented for MAX-CUT QAOA in [36] where the total number of CNOTs required was found to scale as 3 2 m(m − 1)P . We note that this CNOT count can be slightly reduced by forgoing the last layer of the SWAP network and reordering the bits in a classical post-processing step so that the overall number of CNOTs scales as ( 3 2 m(m − 1) − m 2 )P . Figure 5 shows the results of running k-means QAOA for the Epilepsy m = 5 coreset on ibmq_rome with and without the SWAP network. Before running on the quantum hardware, we found optimal values for the variational parameters using circuit simulation in conjunction with a Nelder-Mead optimizer. In Figure 5 the best partitionings found by the noiseless simulation are the symmetric 01100 and 10011 states. We use the cluster centers given by the coreset partitioning to evaluate the cost function on the entire data set; both the 01100 and 10011 partitions have a cost c = 5.581e10 which approaches the bound shown in Figure 3. There is a significant amount of noise present in the hardware executions compared to the noiseless simulation, but when the SWAP network is utilized, the 01100 solution state is still the most likely state to be measured. Without the SWAP network, the highest probability state would result in a suboptimal partition.

Discussion
In this work we investigated the performance of performing k-means clustering using QAOA, using offline coresets to effectively reduce the size of the clustering data sets. Indeed, we found that there do exist data sets where coresets seem to work well in practice, such as the Synthetic data set we analyzed in Sec. 5.3. Furthermore, as our Hamiltonian construction of the problem is all-to-all connected, our QAOA instance circumvents the light cone oriented issues [37,38] associated with running constant-P -depth QAOA on sparse bounded-degree graphs. Furthermore, while naively implementing QAOA on this all-to-all connected graph would require time evolution under O(m 2 ) edge operators for a size m instance, the SWAP network construction allows us to implement QAOA evolution with just O(m) depth, even with linear connectivity.
However, in practice coresets did not seem to work well relative to random sampling for the standard classification data sets we benchmarked our results on. This may be due to the small m we restricted our coresets to, with the motivation of fitting the problem instance on today's quantum computers, or may be due to the fact that these "natural" data sets have near equally sized optimal clusters. Indeed, the Synthetic data set where using coresets performed well had artificially rare clusters that naive random sampling would miss-however, this worked to the detriment of our Hamiltonian construction of the problem, which involves Taylor expanding the optimization problem around near equal optimal cluster sizes. As standard k-means already performs remarkably well, it seems that one would need a high-degree Hamiltonian expansion for a method such as QAOA to compete, or otherwise a more clever Hamiltonian construction of the problem. Methods such as Grover's algorithm, however, would not necessarily have this limitation. We leave for future work refining this intuition and perhaps finding instances where both coresets and QAOA work well in conjunction.

A 2-Means to Weighted MAX-CUT Reduction
In Sec. 4.2, it is claimed that minimizing Eq. (3) (the 2-means objective function on a weighted coreset) over set assignments S ±1 , is equivalent to maximizing Eq. (4) (the weighted intercluster distance): This essence of this reduction is a known [39] application of the Law of Total Variance. Our notation and proof are drawn directly from [40], with some deviations to handle weighted k-means clustering. We begin by considering the coreset's scatter, which measures the sum of squared distances from weighted coreset points to the overall centroid. The scatter is a function of the coreset data, and is therefore invariant with respect to the chosen cluster partitioning. We will write the scatter as a sum over three terms: T 1 is the just the 2-means objective from Eq (3), which we wish to minimize. Meanwhile, T 2 is zero: Finally, we show that T 3 is related to the weighted intercluster distance objective: The last line is just the weighted intercluster distance objective (in Eq. (4)), scaled by a 1 W constant. Since the scatter (T 1 − 2T 2 + T 3 = T 1 + T 3 ) is not a function of the partitioning, minimizing T 1 corresponds to maximizing T 3 . Therefore, minimizing the 2-means objective is equivalent to maximizing the weighted intercluster distance, i.e. maximizing W −1 W +1 µ −1 − µ +1 2 . As shown in Sec. 4.2, this optimization problem corresponds to a weighted MAX-CUT instance.