Quantum Based Pseudo-Labelling for Hyperspectral Imagery: A Simple and Efﬁcient Semi-Supervised Learning Method for Machine Learning Classiﬁers

: A quantum machine is a human-made device whose collective motion follows the laws of quantum mechanics. Quantum machine learning (QML) is machine learning for quantum computers. The availability of quantum processors has led to practical applications of QML algorithms in the remote sensing ﬁeld. Quantum machines can learn from fewer data than non-quantum machines, but because of their low processing speed, quantum machines cannot be applied to an image that has hundreds of thousands of pixels. Researchers around the world are exploring applications for QML and in this work, it is applied for pseudo-labelling of samples. Here, a PRISMA (PRecursore IperSpettrale della Missione Applicativa) hyperspectral dataset is prepared by quantum-based pseudo-labelling and 11 different machine learning algorithms viz. , support vector machine (SVM), K-nearest neighbour (KNN), random forest (RF), light gradient boosting machine (LGBM), XGBoost, support vector classiﬁer (SVC) + decision tree (DT), RF + SVC, RF + DT, XGBoost + SVC, XGBoost + DT, and XG-Boost + RF with this dataset are evaluated. An accuracy of 86% was obtained for the classiﬁcation of pine trees using the hybrid XGBoost + decision tree technique.


Introduction
The predictive power of hybrid/deep learning (DL) classifiers makes them the primary option for remote sensing researchers [1].Annotated datasets are used to develop and evaluate machine learning (ML)/deep learning.However, in remote sensing applications, only a small amount of labelled data is available for training.As labelled data is expensive to collect, abundant unlabelled data are used [2].Field observations must be conducted across different locations of the studied region to ensure the sampling points have a high spatial variability, similar to that of satellite data acquisition.The accuracy of machine learning techniques depends on the magnitude of the training dataset.For classical ML, large datasets are required to obtain higher accuracy, so annotated datasets are a crucial precondition for modelling and validating ML-based classifications.Unfortunately, due to the heterogeneity of remote sensing measurements and tasks, there is no single go-to dataset that can serve as a standardized pre-training benchmark [2].Therefore, researchers are developing procedures to create datasets on their own in a semi-supervised approach [3].Pseudo-labelling makes abundant unlabelled data available in the image for training ML models.QML has the ability to learn from smaller datasets using a procedure of pseudolabelling, i.e., annotating the unlabelled data of a hyperspectral image using a quantum support vector machine (QSVM), which is proposed [4].
QML is an interdisciplinary field of quantum mechanics, and ML outperforms classical computation by outsourcing complex computations to a quantum computer [5].The performance of QML implemented for big data applications is compared with the performance of classical computation [6][7][8].Two major providers of cloud QC environments are IBM (gate-based systems) [5,9] and D-Wave (based on quantum annealing) [10,11].The main goal of many researchers in this field is to search for potential applications that demonstrate quantum speed-ups [6].Quantum computers can recognize complex patterns of labelled data that are not recognised by classical computers; the latter need more dimensional planes.A quantum support vector machine (QSVM) was chosen in this study because QSVM can be trained with comparatively lesser samples, which attracts most users [12].Different models look for different trends and patterns.It is tricky to predict which model will perform best before each model is tested.The classification accuracy depends on the size, quality, and nature of the training data; the training time; and the expected results [13,14].In the present study, different machine learning methods were evaluated, and the parameters were tuned to reach the desired accuracy.Pseudo-labels were assigned to samples of hyperspectral data and 11 different ML techniques were applied to evaluate the quantum-based dataset.
In this short communication, firstly, the procedure of quantum based pseudo-labelling of samples is demonstrated.From the literature [3], it was observed that QSVM has the highest accuracy for hyperspectral classification problems but cannot be used for predicting a large-scale (30 km × 30 km) image because of the limitation on the prediction speed.Since data scarcity is one of the challenges faced by most remote sensing scientists, a quantumbased pseudo-labelling procedure is proposed as a solution to this challenge for pine tree classification.Secondly, different machine learning techniques viz., support vector machine (SVM), K-nearest neighbour (KNN), random forest (RF), light gradient boosting machine (LGBM), XGBoost, support vector classifier (SVC) + decision tree (DT), RF + SVC, RF + DT, XGBoost + SVC, XGBoost + DT, and XGBoost + RF were evaluated and a technique that gives better accuracy with quantum-based pseudo-labelled PRISMA dataset was also found.This manuscript is organized into four sections.Section 2 describes the study area, the reference data, and the PRISMA satellite data used for the analysis.Section 3 presents the methods used.Section 4 explains the experimental setup.Section 5 presents the results and discusses the classifications and the validation information presented.Section 6 explains the conclusions derived from this study.

Study Area
The hyperspectral PRISMA (PRecursore IperSpettrale della Missione Applicativa) data shown in Figure 1 comprises the estate of Castel Porziano in Rome and was acquired on 27 June 2021.This image was downloaded from the PRISMA archive on https://prisma.asi.it/ (accessed on 15 May 2022).Pine trees cover the majority of this estate.Additionally, this estate has different oak trees, cork trees, shrubs, and grasslands.This region has a humid climate and the temperature in this area varies from −5 • C in winter to 31 • C in summer.The region covered in this image has an elevation varying from 20 to 70 feet.

Reference Data
Reference data (Figure 2) were extracted from a ground truth vegetation map (2021) provided by the geoportal of the Lazio regional administration.This data was provided in shapefile format and converted to GTiff on ArcMap, giving a spatial resolution of 30 m.The area shown in green is covered by pine vegetation, which comprised the input for training the ML models.

Reference Data
Reference data (Figure 2) were extracted from a ground truth vegetation map (2021) provided by the geoportal of the Lazio regional administration.This data was provided in shapefile format and converted to GTiff on ArcMap, giving a spatial resolution of 30 m.The area shown in green is covered by pine vegetation, which comprised the input for training the ML models.

Reference Data
Reference data (Figure 2) were extracted from a ground truth vegetation map (2021) provided by the geoportal of the Lazio regional administration.This data was provided in shapefile format and converted to GTiff on ArcMap, giving a spatial resolution of 30 m.The area shown in green is covered by pine vegetation, which comprised the input for training the ML models.

Pre-Processing of PRISMA Data
PRISMA, a satellite of the Italian space agency (Agenzia Spaziale Italiana, ASI), carries a hyperspectral sensor that enables hyperspectral imaging [13,14].Using an imaging spectrometer, this sensor captures imagery with a continuum of spectral bands of 400-2500 nm at a spatial resolution of 30 m.There are 173 bands in the shortwave infrared within a 920-2500 nm range, and 66 bands are in the visible near-infrared portion (400-1010 nm) of the spectrum.The widths and spectral sampling intervals are ≤12 nm.A panchromatic camera that provides a single band (400-700 nm) image at a 5 m spatial resolution is also on-board the ASI's satellite [15].
The PRISMA archive has Level-1, Level-2B, Level-2C, and Level-2D products among which Level-1's cloud cover and land cover image and Level-2C's hypercube were considered for processing.For more details on the level-wise images, please refer to https: //prisma.asi.it/(accessed on 15 May 2022).For the region of interest, images of minimal cloud cover downloaded from the archive and remaining clouded pixels were masked using the cloud cover image.Level-1's land cover image was considered for masking the non-vegetated areas in the hypercube.The Prismaread tool from CNR, Italy (https://irea-cnr-mi.github.io/prismaread/) was accessed on 25 May 2022 and used to georeference the image on R software.This tool converted the hypercube in he5 format to a GTiff file that was used for further processing.
The hyperspectral imagery from PRISMA was considered for this comparative study.A PRISMA image has 233 bands with a size of 1266 × 1260 × 233.During preprocessing, 37 noisy and water absorption bands were removed by manual selection.First, the image with 1266 × 1260 = 1,595,160 pixels, where a pixel represents the feature vector x n of dimension d = 196; to each pixel, a label y n ∈ {0,1} was assigned, indicating the absence (0) or presence (1) of a pine tree [14].

Implemented Methods
(1) Jeffries Matusita-Spectral Angle Mapper (JM-SAM) is the tangent combination of the most popular SAM technique and Jeffries Matusita distance.SAM provides a spectral angle to detect the intrinsic properties of reflective materials and it has the limitation of insensitivity to illumination and shade effects.So, it is always better to use it in combination with stochastic divergence measures.In this study, the Jeffries Matusita distance involving the exponential factor was used in combination to identify similar spectra [16].(2) Quantum Support Vector Machine (QSVM) is SVM with a quantum kernel.The significant component that outperforms classical classifiers is the feature map, which has the ability to map d-dimensional non-linear classical data points in quantum state, which plays a key role in pattern recognition.It is difficult to recognise complex patterns of data in original space, especially in learning algorithms, but becomes easy when mapping in higher-dimensional feature space.For more information on the accuracy and processing speeds of QSVM, please refer to [3].The QISKIT library package was used with Python.QISKIT has three parts: the provider, the backend, and the job.The provider provides access to different backends, such as Aer and IBMQ.Using Aer, the simulator within QISKIT can be availed to run on the local machine, e.g., statevec-tor_simulator, qasm_simulator, unitary_simulator, and clifford_simulator, whereas IBMQ provides access to cloud-based backends [5].The backend signifies either a real quantum processor or a simulator and can be used to run the quantum circuit and generate results.The execution state, i.e., whether the model is running, queued, or failed, can be found in the third part of QISKIT, the job [5].One of the backends of IBMQ is "ibmq_qasm_simulator", which has the features shown in Table 1 [5].
As IBMQ has extensive usage and is a freely available test management solution, it was considered for this study.Maximum accuracy was achieved with 16 qubits [3].QISKIT packages on Anaconda and the scikit-learn library in Python were used for this classification.The scikit-learn packages of numpy, spectral, matplotlib, time, scipy, math, scipy, pandas, pysptools, os, and gdal were used.(3) Referring to the literature [1,2,15], some machine learning techniques were selected to find a suitable classifier for the quantum-based pseudo-labelled dataset.The methods implemented in these experiments were support vector machine (SVM), Knearest neighbour (KNN), random forest (RF), and the boosting methods light gradient boosting machine (LGBM), and extreme gradient boosting (XGBoost).Hybrid ML models were tested by hybridizing a classifier with another classifier and in this study, only hybrid models that provided an accuracy higher than 40% were presented.Hybrid models of SVM-Decision Tree (DCT), RF-SVM, RF-DCT, XGBoost + DCT, XGBoost + RF, and XGBoost + SVM were used.Since we used well-established techniques, a detailed explanation is not provided.Correlations between local spatial features were ignored and deep learning methods were not implemented because we were working with a small dataset.The flowchart shown in Figure 3 represents the process implemented for pseudolabelling of the samples using QSVM.From the reference data, 20 pixels were selected that included 12 pine tree pixels; the remaining 8 pixels represented other vegetation and/or un-vegetation because a balanced dataset includes 60% positive and 40% negative samples [14].In this work, a pixel represents the feature vector x n of dimension d = 16, and to each pixel, a label y n ∈ {0,1} was assigned, indicating the absence (0) or presence (1) of a pine tree.The dimensions of the feature vector were reduced from 196 to 16 using principal component (PC) analysis because it has been demonstrated by Riyaaz et al. [3] that it is possible to obtain the highest accuracy with 16 PCs of PRISMA data for vegetation classification.QSVM was trained using these 20 extracted samples.
Although the accuracy of QSVM is better than classical SVM for classification problems, it cannot be used for predicting all pixels of an image because of the limitation of the prediction speed.It was observed in the literature that QSVM takes around 7000 s to predict 50 samples whereas classical SVM takes only 1.2 s [3].So, the use of QSVM is proposed for the preparation of a dataset of 600 samples.Using this accurate dataset, machine learning techniques that give a higher accuracy can be trained and implemented for prediction [1,3].So, the JM-SAM method was applied to select 1000 similar samples from the image on which the trained QSVM was applied to predict 400 very similar samples as shown in Figure 3.The remaining 200 samples were selected randomly from the image, which included other vegetation types and non-vegetated pixels.Finally, a dataset of 600 pseudo-labelled samples was prepared in which 400 samples were positively annotated as '1' and the remaining 200 samples were negatively annotated as '0'.Although the accuracy of QSVM is better than classical SVM for classification problems, it cannot be used for predicting all pixels of an image because of the limitation of the prediction speed.It was observed in the literature that QSVM takes around 7000 s to predict 50 samples whereas classical SVM takes only 1.2 s [3].So, the use of QSVM is proposed for the preparation of a dataset of 600 samples.Using this accurate dataset, machine learning techniques that give a higher accuracy can be trained and implemented for prediction [1,3].So, the JM-SAM method was applied to select 1000 similar samples from the image on which the trained QSVM was applied to predict 400 very similar samples as shown in Figure 3.The remaining 200 samples were selected randomly from the image, which included other vegetation types and non-vegetated pixels.Finally, a dataset of 600 pseudo-labelled samples was prepared in which 400 samples were positively annotated as '1' and the remaining 200 samples were negatively annotated as '0'.

Inputs of ML Classifiers
Six ML techniques were considered in this study to identify the optimal technique for pine tree classification.Table 2 shows the input parameters considered for training the ML models of the six types referring to the literature.This study conducted Bayesian optimization to select the optimum parameters among the provided values of parameters.The same values of the parameters were used for the hybrid ML models as well.

Inputs of ML Classifiers
Six ML techniques were considered in this study to identify the optimal technique for pine tree classification.Table 2 shows the input parameters considered for training the ML models of the six types referring to the literature.This study conducted Bayesian optimization to select the optimum parameters among the provided values of parameters.The same values of the parameters were used for the hybrid ML models as well.Different kinds of optimization techniques give different values for the parameters.Zelin Huang et al. [17] proved that the parametric values differed between the genetic algorithm and grid search.For this demonstration, we chose Bayesian optimization in this study.

Classification of Pine Trees
Figure 4 shows the pine tree classifications obtained using the quantum-based dataset and 11 different ML techniques.Table 2 shows the details of the classifiers implemented in the evaluation and the hyperparameters given as the input.The classifications were validated by comparing them with the reference data.In some of the classifications, vegetation near the river appears to be similar to the waves in the flow of the river.

Classification of Pine Trees
Figure 4 shows the pine tree classifications obtained using the quantum-based dataset and 11 different ML techniques.Table 2 shows the details of the classifiers implemented in the evaluation and the hyperparameters given as the input.The classifications were validated by comparing them with the reference data.In some of the classifications, vegetation near the river appears to be similar to the waves in the flow of the river.

Validation of the Pine Tree Classification
The pine tree classification was validated by randomly selecting 300 points in the classification and comparing each point with reference data, as shown in Table 3.As shown in Figure 4, different ML classifiers were applied to choose a suitable classifier for

Validation of the Pine Tree Classification
The pine tree classification was validated by randomly selecting 300 points in the classification and comparing each point with reference data, as shown in Table 3.As shown in Figure 4, different ML classifiers were applied to choose a suitable classifier for the classification of pine trees.Hybridised XGBoost, a tree-based algorithm that uses a gradient boosted framework, showed the best classification accuracy.XGBoost combined with a decision tree (DT) algorithm gave the best result in this study.

Discussions
Hyperspectral remote sensing has advanced significantly in the past decades.ML/DL techniques have become the most important tool in modern hyperspectral image analysis, especially for classification problems, because of their unprecedented predictive power.Additionally, the accuracy of machine learning-based classification depends entirely on the dataset.Therefore, to develop and evaluate machine learning-based classifications, annotated datasets have become most crucial.There is no single go-to dataset that serves the purpose of standardized benchmarking and pretraining due to the heterogeneity of remote sensing tasks and measurements.Thus, a dataset was prepared by pseudo-labelling samples using QSVM trained with 20 samples.In total, 20 samples were extracted, referring to the reference maps.Alternatively, a field survey can also easily be carried out to collect 12 data points for each vegetation, which could further increase the accuracy.
Another challenge of machine learning was also solved in this study, which is the selection of a suitable machine learning classifier for a specific task.The accuracy of ML algorithms depends on various factors such as the size of training data, dataset pattern, training parameters, etc., so selection based on accuracy is also tricky.In such cases, an optimal algorithm can be selected considering the number of classes, size and nature of training data, and predictor variables.In this work, considering these reasons, popular machine learning techniques were selected and trained using a quantum-based pseudolabelled dataset to check the classification accuracy.XGBoost with the DCT hybrid ML technique trained with 600 pseudo-labelled samples gave a comparatively higher accuracy (86%) in classifying pine trees.This result can vary if other types of optimization models are used according to Huang, Z [17].Another combination of XGBoost also performed slightly lower, with accuracies of 83%.From the classifications, it can be observed that some classifiers performed better in only one corner of the image, which may be at the top or bottom of the image, which may be due to mixed pixels.The bottom of the image has dense vegetation and the majority of the labelled data was selected from there whereas the top part of image has comparatively less vegetation with mixed pixels.So, spectral profiles extracted from mixed pixels lead to a decrease in the classification accuracy if classified using ML classifiers with broader thresholds.

Conclusions
This paper demonstrates the ability of QML to classify pine trees using the hybrid XGBoost + decision tree technique trained on a quantum-based pseudo-labelled dataset.Samples were pseudo-labelled for dataset preparation with QML using PRISMA hyperspectral imagery.It was proven that when there is no single go-to dataset, QML can be used for pseudo-labelling with ≤20 samples.Different ML classifiers were evaluated on a quantum-based dataset to find a suitable technique for the classification of pine trees.A hybrid technique (XGBoost + Decision Tree) gave promising results, with an accuracy of around 86%.The results show that QML pseudo-labelling and XGBoost hybrid classification can solve feature mapping and classification problems with accuracy within a modest processing time.

Figure 3 .
Figure 3. Flowchart of the quantum-based pseudo-labelling of samples.

Figure 3 .
Figure 3. Flowchart of the quantum-based pseudo-labelling of samples.

Table 1 .
Features of IBM backend.

Table 2 .
Details of the Machine Learning Classifiers.

Table 2 .
Details of the Machine Learning Classifiers.

Table 3 .
Validation Details of the Classification.