Assembly of a Coreset of Earth Observation Images on a Small Quantum Computer

: Satellite instruments monitor the Earth’s surface day and night, and, as a result, the size of Earth observation (EO) data is dramatically increasing. Machine Learning (ML) techniques are employed routinely to analyze and process these big EO data, and one well-known ML technique is a Support Vector Machine (SVM). An SVM poses a quadratic programming problem, and quantum computers including quantum annealers (QA) as well as gate-based quantum computers promise to solve an SVM more efﬁciently than a conventional computer; training the SVM by employing a quantum computer/conventional computer represents a quantum SVM (qSVM)/classical SVM (cSVM) application. However, quantum computers cannot tackle many practical EO problems by using a qSVM due to their very low number of input qubits. Hence, we assembled a coreset (“core of a dataset”) of given EO data for training a weighted SVM on a small quantum computer, a D-Wave quantum annealer with around 5000 input quantum bits. The coreset is a small, representative weighted subset of an original dataset, and its performance can be analyzed by using the proposed weighted SVM on a small quantum computer in contrast to the original dataset. As practical data, we use synthetic data, Iris data, a Hyperspectral Image (HSI) of Indian Pine, and a Polarimetric Synthetic Aperture Radar (PolSAR) image of San Francisco. We measured the closeness between an original dataset and its coreset by employing a Kullback–Leibler (KL) divergence test, and, in addition, we trained a weighted SVM on our coreset data by using both a D-Wave quantum annealer (D-Wave QA) and a conventional computer. Our ﬁndings show that the coreset approximates the original dataset with very small KL divergence (smaller is better), and the weighted qSVM even outperforms the weighted cSVM on the coresets for a few instances of our experiments. As a side result (or a by-product result), we also present our KL divergence ﬁndings for demonstrating the closeness between our original data (i.e., our synthetic data, Iris data, hyperspectral image, and PolSAR image) and the assembled coreset.


Introduction
Remotely sensed images are used for EO and acquired by means of aircraft or satellite platforms. The acquired images from satellites are available in digital format and consist of information on the number of spectral bands, radiometric resolution, spatial resolution, etc. A typical EO dataset is big, massive, and hard to classify by using ML techniques when compared with conventional non-satellite images [1,2]. In principle, ML techniques are a set of methods for recognizing and classifying common patterns in a labeled/unlabeled dataset [3,4]. However, they are computationally expensive and intractable to train big massive data. Recently, several studies proposed to use only a coreset ("core of a dataset") of an original dataset for training ML methods and tackling intractable posterior distributions via Bayesian inference [5][6][7], even for a Support Vector Machine (in short, SVM) [8]. The coreset is a small, representative weighted subset of an original dataset, and ML methods trained on the coreset yield results being competitive with the ones trained on the original dataset. The concept of a coreset opens a new paradigm for training ML models by using small quantum computers [9,10] since currently available quantum computers offered by D-Wave Systems (D-Wave QA) and by IBM quantum experience (a gate-based quantum computer) comprise very few quantum bits (qubits) (https://cloud.dwavesys.com/leap, https://quantum-computing.ibm.com/, accessed on 30 August 2021). In particular, quantum computers promise to solve some intractable problems in ML [11][12][13], and to train an SVM even better/faster than a conventional computer when its input data volume is very small ("core of a dataset") [14,15]. Training ML methods by using a quantum computer or by exploiting quantum information is called Quantum Machine Learning (QML) [16][17][18], and finding the solutions of the SVM on a quantum computer is termed a quantum SVM (qSVM), otherwise classical SVM (cSVM).
This work uses a D-Wave QA for training a weighted SVM since the D-Wave QA promises to solve a quadratic programming problem, and our method can be easily adapted and extended for a gate-based quantum computer. The D-Wave QA has a very small number of input qubits (around 5000) and a specific Pegasus topology for the connectivity of its qubits [19], and it is solely designed for solving a Quadratic Unconstrained Binary Optimization (QUBO) problem [12,20]. For practical EO data, there is a benchmark and a demonstration example for training an SVM with binary quantum classifiers when using a D-Wave QA [21,22]. Here, the SVM is a quadratic programming problem considered as a QUBO problem. Furthermore, there is a challenge to embed the variables of a given SVM problem into the Pegasus topology (i.e., the connectivity constraint of qubits), and to overcome this constraint of a D-Wave QA, the authors of [21] employed a k-fold approach to their EO data such that the size of variables in the SVM satisfies the connectivity constraint of qubits of a D-Wave QA.
In this article, we construct the coreset of an original dataset via sparse variational inference [6] and then employ this coreset for training the weighted SVM by using a D-Wave QA. Furthermore, we train the weighted SVM, posed as a QUBO problem, by using a D-Wave QA on the coreset instead of the original massive data, and we benchmark our classification results with respect to the weighted cSVM. As for practical and real-world EO data, we use synthetic data, Iris data, a Hyperspectral Image (HSI) of Indian Pine, and a Polarimetric Synthetic Aperture Radar (PolSAR) image of San Francisco characterized by its Stokes parameters [23].
Our paper is structured as follows: In Section 2, we present our datasets, and we construct the coresets of our datasets in Section 3. We introduce a weighted cSVM, and construct a weighted qSVM for our experiments in Section 4. Then, we train the weighted qSVM on our coresets by using a D-Wave QA and demonstrate our results with respect to the weighted cSVM in Section 5. Finally, we draw some conclusions in Section 6.

Our Datasets
We use four different datasets, namely synthetic data, Iris data, an Indian Pine HSI, and a PolSAR image of San Francisco characterized by its Stokes parameters [23,24]. The first two sets are used to understand the concept of a coreset, and the implementation of a weighted SVM on their coresets by using a D-Wave QA. Namely, we use the coresets of the first two to set the internal parameters of a D-Wave QA since the solutions generated by the D-Wave QA are affected by those internal parameters (called annealing parameters). The last two sets are employed as real-world EO data for constructing their coresets and for training the weighted qSVM on their coresets after the annealing parameters are set in a prior (see Figures 1 and 2). In the next sections, we use a notation "weighted qSVM" meaning that "training a weighted SVM posed as a QUBO problem by using a D-Wave QA".

Synthetic Data
We generated synthetic data with two classes (x n , y n ) according to where r n = 1 if y n = −1, and r n = 0.15 if y n = +1. φ n is linearly spaced in (0, 2π] for each class, and x n , y n are samples drawn from a normal distribution with µ = 0, σ = 1. We are replicating the data already demonstrated for training an SVM by using a D-Wave QA described in [25]. Moreover, we have (x n , y n ), n = 1, . . . , 100 data points shown in Figure 1 (Left) and in Table 1.

Iris Data
Iris data consist of three classes (Iris Setosa, Iris Versicolour, and Iris Virginica), each of which has four features, namely sepal length, sepal width, petal length, and petal width. We consider a two-class dataset {Iris Setosa, Iris Versicolour} with a size of 100 data points, and each class is characterized by two features (sepal length, sepal width) shown in Figure 1 (Right) and Table 1.

Indian Pine HSI
An Indian Pine HSI obtained by the AVIRIS sensor comprises 16 classes; each class is characterized by 200 spectral bands (see Figure 2 (Left)). For simplicity, we use only two classes (see Table 2), and each class is characterized by two features instead of 200 spectral bands by exploiting Principal Component Analysis (PCA) [22].

PolSAR Image of San Francisco
Each pixel of our PolSAR image is characterized by a 2 × 2 scattering matrix as follows: where the first index of s ij , i, j ∈ {H, V} represents the polarization state of the incident polarized beam, and its second index represents the polarization state of the reflected polarized beam on targets. The off-diagonal elements of S are equal s V H = s HV since our PolSAR image of San Francisco is a fully-polarized PolSAR image obtained by a monostatic radar [26,27]. The incident/reflected polarized beam can be represented by its complex amplitude in a polarization basis {Ĥ,V} by The complex amplitude vector can be expressed by a so-called Jones vector where φ i are the phases of the polarized states. Furthermore, the scattering matrix S expressed in (2) is a mapping such that where J i , J r is an incident and a reflected Jones vector, respectively. In matrix form, (5) can be re-expressed as E r The intensity of the reflected Jones vector is defined by where · stands for spatial averaging with a window size 7 × 7 pixels, and * for conjugation. Furthermore, we can re-express this intensity by where q 1 , q 2 , and q 3 are called Stokes vectors. We normalize these Stokes vectors according to and the normalized q 1 , q 2 , and q 3 are called Stokes parameters [23]. Moreover, in this study, we use two classes for our PolSAR image of San Francisco, and the two classes are {urban area, sea water}, and {vegetation, sea water} shown in Figure 2 (Right) and in Table 3. In addition, each class is characterized by its Stokes parameters (q 1 , q 2 , q 3 ) defined in (9).

Coresets of Our Datasets
In Bayesian inference, a posterior density p(θ|x) is written for θ parameters and for where Z is a partition function, f i (θ) is a potential function, and p 0 (θ) is a prior. For big massive data, the estimation of the posterior distribution is intractable, and hence, in practice, a Markov Chain Monte Carlo (MCMC) method is widely used to obtain samples from the posterior expressed by (10) [28].
To reduce the computational time of an MCMC method, the authors of [5][6][7] proposed to run the MCMC method on a small, weighted subset (i.e., coreset) of big massive data. They derived a sparse vector of nonnegative weights w such that only M N are nonzero, where M is the size of a coreset. Furthermore, the authors proposed to approximate the weighted posterior distribution and run the MCMC method on the approximate distribution as follows: We denote the full distribution of an original big massive dataset as p 1 = p 1 (θ|x). More importantly, this posterior becomes tractable.
For assembling the coresets of our datasets presented in Table 1-3, we use an algorithm via sparse variational inference for finding the sparse vector of nonnegative weights w and for approximating the posterior distribution (11) proposed by [6]. Here, the sparse vector of nonnegative weights w is found by optimizing a sparse variational inference problem: whereŵ is an optimal sparse vector weight, and D KL (p w ||p 1 ) is the Kullback-Leibler (KL) divergence which measures the similarity between two distributions (smaller is better): Moreover, the smaller value of the KL divergence implies that we can estimate the parameters θ in (11) by using a coreset yielding similar results with respect to the ones in (10) by using its original massive dataset.
We derived the optimal sparse vector weightsŵ and the coreset of our dataset such that where (x i , t i ) represents an original dataset, while (c i , t i ,ŵ i ) is our newly assembled coreset.
In addition, we computed the similarity between our datasets and the corresponding coresets by using their KL divergences (see Table 4). Our results show that our coresets are very small in size compared with our original datasets, and the KL divergences between the original dataset and our coresets are comparatively small in most instances.

Weighted Classical SVMs
In the previous section, we assembled the coreset of our original datasets shown in Table 4 as To train a weighted SVM for our coresets represented via (15) by using a conventional computer, we express a weighted SVM as where C i =ŵ i C is a regularization parameter, and k(·, ·) is the kernel function of the SVM [28]. This formulation of the SVM is called a weighted cSVM [29]; sometimes, it is called a kernel-based weighted cSVM. The point c i with α i = 0 is called a support vector. After training the weighted cSVM, for a given test point x t ∈ R 2 , the decision function for its class label is defined by: where sign( f (x t )) = 1 if f (x t ) > 0, sign( f (x t )) = −1 if sign( f (x t )) < 0, and sign( f (x t )) = 0 otherwise. The decision boundary is an optimum hyperplane drawn by data points such that f (x t ) = 0 [28]. The bias b is expressed following [25]: The kernel-based weighted cSVM is a powerful technique since the kernel function maps non-separable features to higher dimensional separable features, and the decision boundary is less sensitive to outliers due to the weighted constraints C i [25,29]. Furthermore, the choice of the kernel function has a huge impact on the decision boundary, and the types of the kernel function are linear, polynomial, Matern, and a radial basis function (rfb) [28]. A widely-used kernel is an rbf defined by where γ > 0 is a parameter.

Weighted Quantum SVMs
A weighted quantum SVM (in short, weighted qSVM) is the training result of the weighted cSVM given in (16) on a D-Wave QA. The D-Wave QA is a quantum annealer with a specific Pegasus topology for the interaction of its qubits, and it is specially designed to solve a QUBO problem: where z i , z j are called logical variables, and Q ij includes a bias term h i and the interaction strength of the logical variables g ij [19]. Physical states of the Pegasus topology are called physical variables, two-state qubits residing at the edges of the Pegasus topology; a QUBO problem is also called a problem energy. The D-Wave QA anneals (evolves) from an initial to its final energy (problem energy) according to where H 0 is an initial energy, T is the annealing time in microseconds, and ε(T) is an annealing parameter in the range of [0, 1]. Furthermore, to train the weighted qSVM on our coresets by using a D-Wave QA, we pose the weighted cSVM with an rbf kernel expressed by (16) and (19) as a QUBO problem. Here, we duplicate the formulation for posing the weighted cSVM as a QUBO problem in the article [25].
The variables of the weighted cSVM are decimal integers when the ones of the QUBO problem are binaries. Hence, we use a one-hot encoding form for the variables of the weighted cSVM where K is the number of binary variables (bits), and B is the base. We insert this one-hot encoding form into the weighted cSVM given in (16), and formulate the second constraint of (16) as a squared penalty term By using a Lagrange multiplier λ, we transform our weighted cSVM into the QUBO problem (20) minimize H(z) = 1 where Note that the first constraint of (16) is satisfied automatically since the one-hot encoding form given in (22) is always greater than zero, and hence the maximum value for each α i is given by For training the weighted qSVM, we concentrated on four hyperparameters which are the parameter γ of the RBF expressed by (19), the number of binary bits K, the base B, and the Lagrange multiplier λ given in (24); thus, we used the hyperparameters (γ, K, B, λ). For our applications, we set these hyperparameters to (γ = 16, K = 2, B = 2, λ = 0) as proposed by [25] since these settings of the hyperparameters for the weighted qSVM generate competitive results with the ones generated by the weighted cSVM. For setting the bias defined in (18), we used the weighted cSVM.
In the next section, we train the weighted cSVM given in (16) and the weighted qSVM expressed by (24) on our coresets (see Table 4). In addition, we demonstrate how to program a D-Wave QA for obtaining a good solution of (24) since the solutions yielded by a D-Wave QA are greatly dependent upon the embedding of the logical variables into their corresponding physical variables, and the annealing parameters (annealing time, number of reads, and chain strength) [30].

Our Experiments
In our experiments, we trained our weighted cSVM and our weighted qSVM (models) on the coresets, and we tested our models on the original datasets (see Table 4). In addition, we set the hyperparameters of our weighted qSVM to (γ = 16, K = 2, B = 2, λ = 0), and for training (i.e., for setting of the bias) of the weighted cSVM, we used the Python module scikit-learn [31].
For defining the annealing parameters (annealing time, number of reads, and chain strength) of a D-Wave QA, we first ran quantum experiments on synthetic two-class data, and Iris data. Then, by leveraging these annealing parameters, we used our real-world EO data (the Indian Pine HSI and the PolSAR image of San Francisco) for evaluating our proposed method, "by training the weighted qSVM on the coreset of our EO data instead of a massive original EO data due to the small quantum computer (D-Wave QA) with only few qubits".

Synthetic Two-Class Data and Iris Data
For training the weighted qSVM expressed by (24), we first experimented on our coresets of synthetic two-class data and Iris data shown in Table 4 in order to optimize the annealing parameters (annealing time, number of reads, and chain strength) of a D-Wave QA. In addition, we benchmarked the classification results generated by the weighted qSVM compared with the weighted cSVM. This had the advantage that we could easily tune the annealing parameters and visualize the generated results, both for quantum and classic settings.
In Figure 3 (in Table 5), we show our results for the classification of synthetic two-class data and Iris data. Our results demonstrate that the weighted qSVM performs well in comparison with the weighted cSVM for both coresets (often better for Iris data).  (16) and weighted qSVM expressed by (24). Our visual results demonstrate that our weighted qSVM generalizes the decision boundary of a given dataset better than its counterpart weighted cSVM.
To obtain these good solutions generated by our weighted qSVM, we set the annealing parameters of the D-Wave QA as follows:

Indian Pine HSI and PolSAR Image of San Francisco
As real-world EO data, we used the coresets of an Indian Pine HSI, and a PolSAR image of San Francisco for training the weighted qSVM when setting the annealing parameters of a D-Wave QA set as described above. Initially, we ran a number of quantum experiments on our coresets. In Table 5, we show the classification accuracy of our weighted qSVM results in comparison with the ones yielded by the weighted cSVM.
Our results explicitly demonstrate that the coresets obtained via sparse variational inference are small and representative subsets of our original datasets validated by their KL divergences shown in Table 4. In addition, our weighted qSVM generates its decision results with the same classification accuracy as for the weighted cSVM; in some instances, the weighted qSVM outperforms the weighted cSVM. Furthermore, by exploiting the coresets, we reduced the computational time of training with the weighted qSVM and the MCMC method for inferring the parameters of the posterior distribution as proved theoretically and demonstrated experimentally in [5,6].

Discussion and Conclusions
Quantum algorithms (e.g., Grover's search algorithm) are designed to process data in quantum computers, and they are known to achieve quantum advantages over their conventional counterparts. Motivated by these quantum advantages, quantum computers based on quantum information science are being built for solving some problems (or running some algorithms) more efficiently than a conventional computer. However, currently available quantum computers (a D-Wave quantum annealer, and a gate-based quantum computer) are very small in input quantum bits (qubits). A very specific type of a quantum computer is a D-Wave quantum annealer (QA); it is designed to solve a Quadratic Unconstrained Binary Optimization (QUBO) problem belonging to a family of quadratic programming problems better than conventional methods.
For Earth observation, satellite images obtained from aircraft or satellite platforms are massive and represent hard heterogeneous data to train ML models on a conventional computer. As a practical and real-world EO dataset, we used synthetic data, Iris data, a Hyperspectral Image (HSI) of Indian Pine, and a Polarimetric Synthetic Aperture Radar (PolSAR) image of San Francisco. One of the well-known methods in ML is a Support Vector Machine (SVM): This represents a quadratic programming problem. A global minimum of such a problem can be found by employing a classical method. However, its quadratic form allows us to use a D-Wave QA for finding the solution of an SVM better than a conventional computer. Thus, we can pose an SVM as a QUBO problem, and we named an SVM-to-QUBO transformation as a quantum SVM (qSVM). Then, we can train the qSVM on our real-world EO data by using a D-Wave QA. However, the number of the physical variables of the qSVM is much larger than that of the logical variables of a D-Wave QA due to the massive EO data and the very few qubits. Therefore, in our paper, we employed the coreset ("core of a dataset") concept via sparse variational inference, where the coreset is a very small and representative weighted subset of the original dataset. By assembling and exploiting the coreset of synthetic data and Iris data shown in Table 4, we trained a weighted qSVM posed as a QUBO problem on these coresets in order to set the annealing parameters of a D-Wave QA. We then presented our obtained visual results and the classification accuracy of synthetic and Iris data in Figure 2 and in Table 5, respectively, in contrast to the ones of the weighted cSVM. Our results show that the weighted qSVM is competitive in comparison with the weighted cSVM − and for Iris data even better than the weighted cSVM.
Finally, we assembled the coresets of our real-world EO data (from an HSI of Indian Pine, and a PolSAR image of San Francisco), and demonstrated the similarity between our real-world EO data and its coreset by analyzing their KL divergence. The KL divergence test proved that our coresets are valid, small, and representative weighted subsets of our real-world EO data (see Table 4). Then, we trained the weighted qSVM on our coresets by using a D-Wave QA to prove that our weighted qSVM generates classification results being competitive with the weighted cSVM in Table 5. The annealing parameters of the D-Wave QA were already defined in the prior section. In some instances, one can see that our weighted qSVM outperforms the weighted cSVM.
As ongoing and future work, we intend to develop a novel method for assembling coresets with balanced labels via sparse variational inference since currently available techniques generate unbalanced labels. Furthermore, we plan to design hybrid quantumclassical methods for different real-world EO problems. These hybrid quantum-classical methods will perform a dimensionality-reduction of remotely-sensed images (in the spatialdimension) by using our established methods, and will reduce the size of our training/test data by using a coreset generating balanced labels when we process these small datasets on a small quantum computer.