A Parallel Unmixing-Based Content Retrieval System for Distributed Hyperspectral Imagery Repository on Cloud Computing Platforms

: As the volume of remotely sensed data grows signiﬁcantly, content-based image retrieval (CBIR) becomes increasingly important, especially for cloud computing platforms that facilitate processing and storing big data in a parallel and distributed way. This paper proposes a novel parallel CBIR system for hyperspectral image (HSI) repository on cloud computing platforms under the guide of unmixed spectral information, i.e., endmembers and their associated fractional abundances, to retrieve hyperspectral scenes. However, existing unmixing methods would suffer extremely high computational burden when extracting meta-data from large-scale HSI data. To address this limitation, we implement a distributed and parallel unmixing method that operates on cloud computing platforms in parallel for accelerating the unmixing processing ﬂow. In addition, we implement a global standard distributed HSI repository equipped with a large spectral library in a software-as-a-service mode, providing users with HSI storage, management, and retrieval services through web interfaces. Furthermore, the parallel implementation of unmixing processing is incorporated into the CBIR system to establish the parallel unmixing-based content retrieval system. The performance of our proposed parallel CBIR system was veriﬁed in terms of both unmixing efﬁciency and accuracy.


Introduction
Hyperspectral remote sensing, which is capable of measuring and interpreting on earth objects without contact detection, has been employed as an important technique for governments and scientific research institutions to achieve earth object information directly and quickly [1]. Today, hyperspectral remote sensing has played a critically important role in many fields, such as mineral detection [2], ecological environment investigation and detection [3] (water environment [4], air pollution [5], soil erosion [6], vegetation detection [7], etc.), dynamic monitoring of marine [8], atmospheric [9], land [10], and military application [11].
With the innovation of hyperspectral remote sensing technology, the spatial resolution and spectral resolution of hyperspectral images (HSIs) are continuously enhanced [12]. At present, there are hundreds of multi-resolution hyperspectral remote sensing satellites and sensors running in the air and space to provide continual hyperspectral remote sensing data. As far as the spatial resolution is concerned, it ranges from a few kilometers to a few meters. For instance, National Oceanic and Atmospheric Administration (NOAA) [13] and Moderate Resolution Imaging Spectroradiometer (MODIS) [14], which belong to the equipped with thousands of CPUs. Another way of shared storage is network attached storage (NAS) [22] that achieves more powerful expandability and higher cost-effectiveness. At present, NFS file systems are widely used in computing clusters. However, the high protocol overhead, low bandwidth, and large latency of NAS have limited the application of high-performance clusters.
It is challenging to deal with the storage and processing of big remote sensing data simultaneously [23]. Fortunately, cloud computing emerges in recent years. Cloud computing is highly promising in the hyperspectral field and offers an effective solution to the storage and processing of HSIs due to its scalable storage capabilities and high-performance computing capacity [24]. Wu et al. [25] have developed a parallel and distributed implementation of principal component analysis (PCA), which is a widely-used technique for hyperspectral dimensionality reduction. This implementation is built upon cloud computing and has demonstrated very high performance. In our previous work [23], a parallel CBIR system is proposed, but it is lack of more dynamic cloud resource management (e.g., using restricted virtual resources management softwares) and more endmember information extraction algorithms. Nevertheless, to the best of our knowledge, there are very few efforts in the literature on constructing unified and general-purpose cloud computing architecture for hyperspectral remote imagery.
In this paper, we propose a novel parallel CBIR system for hyperspectral image (HSI) repository on cloud computing platforms under the guide of unmixed spectral information as a extension of [23]. To accelerate spectral information extraction, we implement a distributed and parallel unmixing method that operates on cloud computing platforms for accelerating the unmixing processing flow. In addition, we implement a global standard distributed HSI repository equipped with a large spectral library in a software-as-a-service mode, providing users with HSI storage, management, and retrieval services through web interfaces. Moreover, the unmixing processing is verified to achieve high efficiency with an up to 31.5× speedup.
The rest of this paper is organized as follows. Section 2 describes the hyperspectral imagery CBIR system based on cloud computing architecture released in SaaS mode, which is composed of three main layers: (1) the device layer, in charge of providing various underlying compute and storage resources; (2) the service layer, which is responsible for storage, management and processing of HSIs; (3) and the application layer, which provides users with secure remote access to the repository. Section 3 presents a method of hyperspectral imagery distributed retrieval based on parallel unmixing algorithms on Spark. Section 4 presents an experimental validation of the CBIR system. Finally, Section 5 summarizes the whole paper and discusses future work directions.

Cloud-Computing-Based CBIR Framework in SaaS Mode for Hyperspectral Remote Imagery Repository
This section describes our study focused on the HSI repository and parallel CBIR method on cloud-computing-based architecture in SaaS mode.

Cloud-Computing-Based CBIR Architecture in SaaS Mode
In order to overcome the shortcomings of existing hyperspectral image retrieval systems, Figure 1 illustrates a cloud-computing-based CBIR architecture in SaaS mode for HSI repository. As a well-developed cloud computing model, SaaS is generally applied in various fields. SaaS applications are hosted in the cloud and support effective storage and computing capacity. In the Saas mode, software application services for hyperspectral remote imagery, e.g., HSI storage and processing are provided to users through Internet by using a thin client, e.g., a web browser, which can only purchase software services on demand. In this mode, the users can ignore the problems, such as deployment, operations, and software upgrades.
Our proposed architecture can be divided into two parts: management module and service module. The main functionality of management module is to ensure that the whole cloud computing center can operate safely and stably. The management module plays an important role in applying service management, security, monitoring, resource maintenance, etc. Service module mainly provides applications of hyperspectral image retrieval to customers in a Web-based manner. To make every part more clear and to make the service module simple yet with high maintainability, we divide the entire framework into three layers: device layer, service layer, and application layer. The cloud computing framework starts at the most fundamental device layer of storage and server infrastructure. The device layer provides necessary physical resources for the whole cloud platform. It consists of physical assets, such as servers, network devices, and storage disks. Through virtualization technology [26], these physical resources form a virtual resources pool from physics. In virtualization technology, a physical server can provide user with several virtual machines, which have diverse computing capability and are independent from each other. In this manner, the framework's efficiency, flexibility and resource utilization can be enhanced to lower IT costs. In addition, the software and hardware are separated after virtualization so that users can deploy their applications without considering the underlying architecture. In brief, the device layer offers backend support for the upper layers and provides a highly available and scalable resource environment for upper-layer applications.
The service layer of the framework is mainly composed of hyperspectral remote sensing big data storage module, computing module and service module. This layer is the most important part of the architecture and is mainly responsible for executing parallel unmixing algorithms with high computational cost and storage of huge HSIs. For distributing computing, we sort to Apache Spark as the computing engine for parallel processing. Apache Spark [27] is a versatile large-scale data rapid computing framework for big data on large clusters and has been widely used in academia and industry. It inherits the advantages of Apache Hadoop, a previous distributed computing framework, and is far more powerful than Apache Hadoop. Spark's computational model is similar to Hadoop's, using Map-Reduce. But considering the low efficiency of the Map-Reduce iterative algorithms due to frequent IO operations, Spark builds an integrated and diversified big data processing system which implements a fault-tolerant abstraction for in-memory cluster and proposes a creative data structure namely resilient distributed datasets (RDDs). To obtain the computing power of Spark cluster, developers can only call uniform standard application programming interfaces (APIs) to do RDDs action and transformation in Scala, Python, or Java which improves the convenience and effectiveness greatly. Through the action and transformation operations, RDDs can form a series of dependent lineage. If a data in the lineage is incorrect or missing, lineage can implement transaction rollback through a checkpoint mechanism with high fault tolerance. Through the distributing computing power provided by Spark cluster, the overhead of the Web servers is reduced considerably. For storage, we sort to Hadoop Distributed File System (HDFS) [28] as bottom support for the distributing storage. HDFS is the core sub-project and an important part of Hadoop. It is deployed in hundreds of low-cost servers and provides outstanding scalability and capacity with high throughput IO. A HDFS cluster runs in master/slaves mode and primarily consists of a NameNode that manages the file system metadata, a Secondary NameNode that helps NameNode and DataNodes that store the actual data. Through common interfaces, Spark imports data from HDFS and transforms to RDDs for the following execution. Besides, the layer provides Web service and Database service. It plays a significant part as a bond between the service layer and the application layer through calling program. The application service archives user requests from the application layer and invokes the computing and storage services by Shell. The executions results will be delivered to the application layer.
The application layer of the framework is responsible for the deployment and execution of CBIR applications. This layer defines the interaction between the end users and CBIR system over the HTTP protocol, and provides users with secure remote access to the system including image acquisition, image retrieval, image management, image sharing, etc., in a dashboard through a Web browser. The application is deployed in a Tomcat servlet container running in web servers and is implemented with SpringMVC, an original web framework built on the Servlet API. As for the front end, a progressive framework named Vue for building user interfaces is selected to form an interactive and a beautiful user interface. The frontend communicates data with the backend through unified restful style service's APIs with JavaScript Object Notation (JSON) format asynchronously.

Storage Method for Hyperspectral Remote Images
The main HSIs storage structure is presented in Figure 2. In order to build a scalable and efficient HSI repository, it is essential to have data well organized. The storage structure consists of two parts: The first part is the design of metadata storage of HSIs, and the other part is the design of the storage of original HSIs based on HDFS. To facilitate the reading of original files, the relationship between HDFS and file servers is established through a hyperlink stored in MySQL, whereas the HSI thumbnails are stored in files servers.
Metadata information is descriptive information for hyperspectral data and plays a very important role in the field of remote sensing data especially when the data is of large size and hard to analyze in a timely way. Massive HSIs have a wide range of sources and the data standard from different HSIs providers is not unified. Most providers provide HSIs with specific independent header files to store these metadata. Therefore, it is particularly important to design a unified metadata structure of HSIs reasonably in the process of collecting hyperspectral data through analyzing the header file format of HSIs from different organizations. There are many choices of storage types, such as JSON, which is convenient for Web operation. In order to access these relatively small metadata efficiently, we carefully analyze the characteristics of metadata access and consider the relational database MySQL to store HSIs metadata.
The main database tables for HSIs retrieval and relations between them are shown in Figure 3. In the database, not only general information including image name, number of samples, data type, bands, lines, wave information, thumbnail, etc. (refer table image in Figure 3), but also the metadata of endmembers and abundances information (refer to table abundance in Figure 3) is stored. The spectral information and relevant basic information about the scene are obtained by parallel unmixing algorithms and stored in MySQL automatically in advance.  The other part is the storage of original HSIs in HDFS. This part provides distributed storage support of massive HSIs for parallel computing in HSIs retrieval process. Considering the diversity of the data sources, we attentively analyze the storage format of different HSIs. There are mainly three kinds of storage format including band interleaved by pixel format (BIP), band interleaved by line format (BIF), and band sequential format (BSQ) [29]. Specific algorithms need data in different formats that fit them, so data conversion between BIP, BSQ, and BIF will bring additional overhead and greatly reduce the efficiency of unmixing execution. In order to reduce the consumption of data conversion, as well, we pre-convert the hyperspectral remote sensing images in three data formats. Although there will be some data redundancy in this way, the cost of disks is negligible.

Method of the CBIR System
The early research of HSI repository, which basically relied on text to retrieve hyperspectral imagery data [24], was limited by the computer performance. It mainly archived and indexed information, such as the location and time of image acquisition, which is regarded as the metadata of hyperspectral imagery data to facilitate the research and analysis of remote sensing image retrieval. With the continuous development of content-based image retrieval of traditional images, the content-based image retrieval technique is also applied to hyperspectral remote imagery repository since the mid-1990s [30].
Up to present, content-based image retrieval is still a research hotspot in hyperspectral field. Although HSIs are different from traditional three-channel images, they still maintain the characteristics of traditional image. Therefore, some methods and techniques based on traditional image retrieval can still be applied to hyperspectral remote image retrieval. Combined with the unique characteristics of hyperspectral remote image, such as spectral characteristics, the research on hyperspectral remote image retrieval makes good effect. One popular strategy is proposed in Reference [31], which present a spectral/spatial CBIR system for HSIs. One main problem the author solved involved in the CBIR system is spectral unmixing, which includes two steps: (1) using endmember extraction algorithms to extract a set of image spectral features which called endmembers; (2) complete abundance inversion to obtain spatial features. And then these two kinds of features are integrated together, which guides to the queries to the database to retrieve the image meeting the spectral of certain conditions.
In our paper, we follow the strategy in Reference [31] and design an unmixing-based parallel CBIR method based on cloud computing for higher efficient. First, the HSIs' information extraction is accomplished in advance whiling uploading HSIs to the repository. The endmember signatures are extracted using pixel purity index (PPI) [32] in parallel and then the spectral angle distance (SAD) algorithm is considered to identify the pure endmembers by comparing with the spectral library. A particular issue may arise that the wavelengths of the HSIs are different from the wavelengths of the spectral library used as input. For this purpose, we have implemented a linear interpolation strategy that looks for wavelength values which are present in both the extracted endmembers and the spectral library. Further the abundance estimation is conducted using sum-to-one constrained least squares (SCLS) in parallel. The information provided by endmember identification and abundance estimation is stored in table image and abundance as metadata to catalog each image. Hence when retrieving hyperspectral remote sensing image in the repository, just start with querying the metadata information stored in MySQL database ( Figure 4). There are two retrieval options are available in the proposed system. The first one is query by hyperspectral remote sensing image directly. From the user's perspective, a standard retrieval procedure in the system can be summarized as follows.
1. Image input. In this step, an image to be retrieve is uploaded to HDFS. 2. Endmember extraction. The system extracts the image's endmember information automatically.
3. Signature comparison. For all spectral signatures in the spectral library, the system calculates the SAD with all the endmembers extracted in step 2 and identities the extracted endmembers according to SAD scores. 4. Abundance filtration. One or many endmembers interested in are chosen and minimum abundance filters for each endmember are defined. In this case, the image is retrieved only if the matched endmember contains a higher total abundance in the scene than the predefined minimum abundance threshold. 5. Results display and manual selection. The retrieved hyperspectral remote images are showed in table form in the web for users to select.
Another option is query by endmembers directly provided by the spectral library. Compared with the previous option, this option skips steps 1 and 2 and avoids the complicated process of unmixing, which makes query more convenient and efficient.
In what follows, we provide a simple step-by-step example to demonstrate how to perform a simple hyperspectral image retrieval in our system. Figure 5a shows the list of HSIs in our system with their general metadata. Click the View command button and then the HSI's information including all metadata, thumbnail, its endmembers' spectrum, and the ratio of every endmember will be displayed in detail, as shown in Figure 5b. To expand the HSI repository, the system provides an interface for uploading. Once an HSI is uploaded, the users can decide which unmixing algorithms to extract the spectral information. As shown in Figure 5c, the PPI, SAD, and SCLS algorithms based on Spark were chose which can be executed in appointed cluster in the cloud framework. In addition, the parameters including Driver-Memory, Executor-Memory, and Executor-Cores for task execution can also be specified in the web page. After the unmixing algorithms have been executed, the HSI spectral information obtained as a complement will be automatically cataloged in the database described in Figure 4. Figure 5d shows a querying example, in which we specified two spectral signatures (clinochiore_Fe GDS157 and Eugsterite GDS140 Syn) from United States Geological Survey (USGS) library as the query criteria. The minimum abundance is set to 0.1 that means the HSIs we are looking forward to contain at least 10% clinochiore_Fe GDS157 and 10% Eugsterite GDS140 Syn. The query result will be shown as a list, and corresponding HSI's information can be viewed by click the View button. In this case, the HSI with id 41 contains 21.05% clinochiore_Fe GDS157 and 18.25% Eugsterite GDS140 Syn, which satisfies the query criteria accurately. Readers who are interested in this work and would like to know more details are encouraged to download and run the source codes of the project hosted on Github Github:git@github.com: ZpWaitingForSunshine/CBIR.git.

Hyperspectral Imagery Distributed Retrieval
In this section, as a case study, we use a Spark-based parallel unmixing flow, including PPI, SAD, and SCLS algorithms, which are the key steps of hyperspectral remote sensing image distributed retrieval chain. The spectral features are the key information we make use of, so the hyperspectral remote sensing imagery unmixing is accomplished in advanced.

Parallel PPI
Let us assume that X N×L is a hyperspectral image with N pixel vectors and L spectral bands and S is an L-dimensional random vector. The traditional PPI algorithm can be formulated as follows: 1. Generate a set {skewer j } Y j=1 of Y random skewers where skewer j denotes a random vector. 2. Iterate Y times, for each skewer j , we project all the pixel vectors onto this skewer j to record those sample vectors that are at its extreme positions denoted as max j = max 1≤j≤N {X T i×L s j } and min j = min 1≤j≤N {X T i×L s j }. Let PPI(X i×L ) = {skewer j , 1 ≤ j ≤ Y|X T i×L · skewer j = max j or min j }. 3. Remove those pixels with low frequency, and spectral angle distance (SAD) is used to calculate the similarity between any two pixels in PPI(X i×L ) to remove those similar pixels. Finally, the all remaining pixels in PPI(X i×L ) are endmembers M.
In the processing flow of PPI algorithm [33], a large amount of calculation occurs in Step 2. And the computational complexity of this part is closely related to the number of random vectors skewer j Y j=1 , and the number of random vectors directly determines the execution effort of the PPI algorithm. Therefore, we mainly optimize the calculation in Step 2 in parallel based on Spark, summarized in Algorithm 1. Built upon the cloudcomputing-based CBIR framework described in Section 2 and the map/reduce paradigm, the distributed PPI algorithm can be implemented for endmember extraction of massive hyperspectral images by the following steps, which is shown in Figure 6.

1.
We divide the original hyperspectral remote sensing image X N×L into m partitions and store them on HDFS in a distributed way automatically.

2.
We read the hyperspectral dataset from HDFS to ByteRdd as a stream of bytes using built-in newAPIHadoopFile() method. After that, ByteRdd is converted into DataRDD according to the format of remote sensing image. It is noteworthy that the spectral data in DataRDD is complete spectral.

3.
We generate Y random skewers in the driver and broadcast them to all executors through resource manager. Thus, the data X i×L , i = {1, 2, . . . , m}in each partition shares the same skewers. 4.
The following steps are similar to the steps of the traditional PPI algorithm. Perform the map operation on DataRDD and, in each executor, project the pixel onto each skewer to find pixels that are at its extreme positions to form a set E i denoted by {tmax k , tmax_p k , tmin k , tmin_p k } i 1≤k≤Y , where tmax k and tmin k denote the maximal and minimal projections on every skewer, and tmax_p k and tmin_p k denote their corresponding positions in the entire HSI, respectively.

5.
The reduce operation is conducted to submit all E i in each partition to driver and compute the maximal and minimal projects on every skewer over again. The result will be cached in driver, denoted by Count pixel purity index from {tmax k , tmax_p k , tmin k , tmin_p k }

Parallel SAD Implemented
After the processing in the previous step is completed, spectral feature matching of the endmember extraction results with the spectral library should be carried out. The spectral feature matching algorithm used in the system is based on spectral angle distance (SAD) method, which measures the distance between two spectral by calculating the SAD between two spectral vectors. Let e i and e j be two spectral vectors. The SAD value can be defined as d SAD (e i , e j ) = cos −1 e i · e j e i · e j .
We assume {e i } N i=1 be the endmembers set of the spectral library and {e j } R j=1 be endmembers set of the extraction results where N is the total of the spectral library endmembers and R is the total of the extraction endmembers. The SAD algorithm can be implemented in parallel and distributed form by the following steps, which are graphically summarized in Figure 7.
1. First, the spectral library {e i } N i=1 is read from the MySQL server to a LibRDD instance which is divided into n partitions. Every partition contains a list of N/n elements which are tuples, including name, wavelength, and reflectance. It is obvious that R is always much smaller than N, so we broadcast {e j } R j=1 to computing nodes through Spark broadcast mechanism. 2. Perform the map operation on LibRDD. The map operation is in charge of calculating the SAD value between e i in each partition and e j in broadcast. In order to solve the problem that the wavelengths of the results are different from the wavelengths of the spectral library, we take wavelengths in the same range and align the spectrum by linear interpolation. 3. Conduct the reduce operation on LibRDD to filter out the minimum SAD value for each spectral vector e j and identify the spectrum type.

Parallel Sum-to-One Constrained Least Squares (PSCLS) Abundance Inversion
The sum-to-one constrained least squares (SCLS) method based on linear mixed models is widely applied in abundance inversion. The mixing pixel reflectivity can be simply regarded as a linear combination of multiple endmembers' reflectivities, which can be formulated as where r is any L-dimensional vector in a hyperspectral image, M = [m 1 , m 2 , . . . , m n ] is endmember matrix with L spectral bands and n endmembers, and α = (α 1 , α 2 , . . . , α n ) T represents abundance coefficient of a pixel, while is a nonnegative and L-dimensional error term. Here, we must guarantee that the sum of the abundance coefficient is 1.
In this approach, is expected as small as possible. Therefore, the optimization problem is formulated as: Then, we get the estimation of abundance coefficient: After that, we introduce matrix N and vector s to convert Formula 5 as follows: where I = (1, 1, . . ., 1) and δ is used to normalize M, which we set 10 −5 . It looks like the hyperspectral image embraces an extra band, all of which are 1. Correspondingly, we get the solution: In order to accelerate SCLS algorithm, the data parallel strategy is used based on Spark as shown in Figure 8 and Algorithm 2. First, we pro-process the endmember matrix M and δ to form N in the driver. Then, we prepare pr NN = (N T N) −1 N T , which is a fixed value. In order to reduce network traffic and the memory overhead, we broadcast pr NN to computing nodes through Spark broadcast mechanism. Since the origin image data is divided into many partitions, we read every pixel data r in every partition on HDFS to form DataRDD instances. The pixels data are transformed into X p×L , where p denotes the number of pixels in the partition. After that, perform the map operation to get SCLSRDD. The map operation is in charge of pro-processing X p×L into X p×(L+1) (which is similar to the operation on endmember matrix M), and multiplying pr NN broadcasted with each column in X p×(L+1) to get the abundance coefficient of per pixel.

Algorithm 2:
Parallel Sum-to-one Least Squares abundance inversion on Spark.

Experiment Setup
We use a synthesized HSI dataset to verify our proposed parallel CBIR system. The images in the dataset are generated in a fully controlled environment for our evaluation of retrieval accuracy so that the accuracy and efficiency of the parallel algorithms can be effectively tested and validated. In this work, we randomly select five spectra from the digital spectrum library the United States Geological Survey (USGS) [34], including Clinochlore_Fe, Eugsterite GDS140 Syn, Heulandite GDS3, Labradorite HS105.3B, and Montmorillonite+Illi CM42, as shown in Figure 9a. Each spectral contains a total of 224 bands range of 0.4 to 2.5 µm. Then, the Gaussian field method [35] is used to generate the simulated abundance diagram of 5 minerals with the size of 128 × 128, as shown in Figure 9b. Finally, the simulated Dataset_S is generated in a linear hybrid model, as shown in Figure 9c.
The other hyperspectral image scene used in our experiments in this work was a subset of the well-known Airborne Visible Infrared Imaging Spectrometer (AVIRIS) Cuprite image [36], available online in reflectance units from http://aviris.jpl.nasa.gov/data/freedata. html. This image was collected over the Cuprite mining site, Nevada, in 1995, which has been widely used to validate the performance of endmember extraction algorithms. It consists of 224 spectral channels range of 0.4 to 2.5 µm. A part of bands including 1-3, 107-114, 153-169, and 221-224 were removed prior for analysis due to water absorption and low signal-to-noise ratio. The final Dataset_C 1 comprises 350 × 350 pixels, 192 bands, and a total size of about 44.86MB. In order to evaluate the performance on big datasets of different sizes, five larger datasets have been generated using the Mosaicking function of ENVI software denoted as follows: To verify the accuracy and efficiency of the unmixing-based content retrieval method for hyperspectral imagery repository on cloud computing platform based on an OpenStack cluster which contains a control node equipped with an eight-core i7-9700 k CPU at 3.6 GHz with 16 GB memory and a 500 GB solid state Disk (SSD), a network node equipped with the same configuration as the control node, and five computing nodes of each equipped with a sixteen-core i9-9960X CPU at 3.1 GHz with 128 GB memory and 500 GB SSD. After virtualization, a Hadoop and Spark cluster is built with a master and eight slaves. The master node is allocated with 16 cores (logic processors) and each Slave node is allocated with 64 cores and 128 GB memory. All nodes have Java 1.8.0_201, Apache Spark 2.3.3, Hadoop 2.7.3, Tomcat 9.0, and Centos 7.0 as operating system installed. The distributed parallel version is implemented by Java and Scala hybrid programming. The web services version is realized by Spring, SpringMVC, MyBatis (SSM), MySQL, and Vue. All software and their versions used in experiments are listed in Table 1.

Evaluation of Parallel Unmixing Accuracy
At first, we evaluate the accuracy of our parallel and distributed implementation of the unmixing (PPI and SCLS as a case) on the original Dataset_S and Dataset_C 1-6. The serial, pseudo-distributed and distributed unmixing process are performed in the same execution environment. In order to quantitatively evaluate the accuracy of image unmixing, we use the SAD scores in endmember comparison to measure the spectral similarity and root-mean-square error (RMSE) in the estimated abundance fractions to measure the abundance similarity. The SAD scores are calculated as Formula (1) and the RMSE scores is calculated as follows: where L and p stand for the number of bands and pixels of image X of size p × L, respectively.α represents the abundance coefficient matrix of size p × n, and M is an endmember matrix of size n × L. The serial, pseudo-distributedand distributed unmixing process give the same SAD and RMSE scores of endmember extraction and abundance inversion, which demonstrates the accuracy of the distributed parallel version of unmixing algorithm. In addition, we can conclude that our system can effectively retrieve the images consisting of endmembers which are closely similar.

Computational Performance Evaluation
In this section, we performed the proposed distributed parallel unmixing processing on Dataset_C 2-6 described in Section 4.1 to further evaluate the computational efficiency. In order to describe the acceleration result of unmixing on Spark in detail, the individual parallel PPI and SCLS on Spark were separated from the unmixing chain and were performed, respectively. The execution time and speedup comparison between the serial and parallel versions are given in Tables 2 and 3, respectively, in which cores denote the sum of the executors' cores. It is worth mentioning that we upload all datasets into HDFS with same blocksize 11,760,000 bytes (integral multiple of bands); thus, every pixel vector is stored in the same node. Therefore, the parallel PPI and SCLS on Spark can read the original data and form an RDD of which every partition corresponds to a block.
We further analyze the processing time and speedup of the serial and parallel versions. Benefiting from virtualization technology, we can apply for enough memory and execute the serial version to process big data. When counting the execute time, the initialization time of SparkContext and data reading is also included since these times are generally higher in a Spark cluster than in single machine, which is due to the delay in the communication between the nodes in the Spark cluster and between the storage module and computing module. Referring to the execution time and speedup statistics of parallel PPI reported in Table 2 and Figure 10, we observe that when the number of cores is less than a certain value, the speedup is nearly linear regarding to the core count. However, if we continue to increase the number of cores, the growth of speedup becomes slower or even negative. This observation is due to the fact that as the increasing number of cores, the communication cost between the nodes increases and the cost growth rate is superlinear, which leads to no longer negligible communication cost. For example, when the data is Dataset_C 2 (1.4 GB), employing 16 cores leads to a 13.1× speedup and the speedup has been increased by 80.15% through doubling the number of cores (32), but, when the number of cores reaches 64, the speedup is only 17.37% higher than that by using 32 cores. Referring to the execution time and speedup statistics of parallel SCLS reported in Table 3 and Figure 11, the similar speedup trends with regard to the number of cores for all datasets can be observed. The only difference is that when the number of cores is fixed, the speedup of parallel SCLS increases with the data size, while the speedup of parallel PPI keeps at the same. As shown in Table 2  The map operation to compute the maximal and minimal projects accounts for a majority of the overall execution time, the parallel PPI's speedups are relatively constant when the cores number is fixed. In addition, the improvement of parallel SCLS is because the system and communication overhead depends more on the number of computing nodes than on data size [37,38]. With the increasing in data size, the proportion of the actual computation time goes up since its ratio to the communication and system overhead scales up. The results also explain why the parallel PPI's computational efficiency is better than the parallel SLCS's.

Method Advantage Analysis
The experimental results presented in Section 4 justify the accuracy and computational efficiency of the parallel unmixing-based CBIR system. This depends largely, we believe, on the cloud-computing-based CBIR architecture in SaaS mode. The storage, management, and retrieval services for HSIs are provided for free through a web browser, which has brought great convenience of maintenance and use. In its concrete implementation, we first propose a standard HSI repository on cloud computing platforms. In the repository, we intend to keep the standard format for HSI data in different formats from all over the world. In this way, we can distribute those hyperspectral data among the potential users in time and even improve users' work efficiency because of the reduced overhead of format conversion. In addition, we design an unified hyperspectral retrieval process based unmixing including two retrieval process. The first way is retrieving images with a image, and the second way is retrieving images with spectrums directly. A spectral library is built which covers every substance in ideal situation firstly. The spectral/spatial information is extracted by parallel unmixing method and then stored in MySQL database. The spectrums of the data are indexd by the spectrums of the library. Therefore, when users retrieve images, especially in the second way, band-level comparisons, e.g., SAD can be omitted and results can be obtained quickly compared to traditional methods. As a result, the HSI storage, metadata storage, data management, and retrieval form a closed loop.
Moreover, we make use of the powerful storage and computation power of cloud computing to store various HSIs and accelerate the unmixing processing flow. More HSI data can be stored and processed than traditional platforms, such as single machine and GPUs clusters, due to the scalability of cloud computing. To speed up the processing speed and shorten the period of endmember extraction, we implement the distributed parallel unmixing including parallel PPI and SCLS algorithms based on Spark. As shown in Figures 10 and 11, we obtain a good acceleration ratio both for synthetic data and real hyperspectral images in different size. In addition, the unmixing algorithms are extensiblethe if we code in uniform standards. As a result, unmixing algorithms are selectable for higher computing speed and accuracy.

Limitation Analysis
In fact, the parallel unmixing-based content retrieval system for distributed hyperspectral imagery repository on cloud computing platforms offer our users with the convenience of use, but substantial computer resources are needed including computing power, storage space, network traffic, etc. Besides, in the real world, there are too many uncertainties that lead to poor HSI quality resulting high requirements for unmixing algorithms. If the spectral information extracted is inaccurate, the accuracy of the retrieval is further affected.
Furthermore, the acceleration ratio cannot increase linearly as the number of cores increases in the process of unmixing acceleration. It is caused by the built-in task scheduling and data migration of Spark. Therefore, the parallel strategylism for unmixing algorithms should be well designed.

Conclusions
In this paper, we propose a novel parallel CBIR system for hyperspectral image (HSI) repository on cloud computing platforms under the guide of unmixed spectral information. To accelerate spectral information extraction, we implement a distributed and parallel unmixing method that operates on cloud computing platforms for accelerating the unmixing processing flow. In addition, we implement a global standard distributed HSI repository equipped with a large spectral library in a software-as-a-service mode, providing users with HSI storage, management, and retrieval services through web interfaces. As a case study, we deploy a Spark-based parallel unmixing flow including PPI, SAD, and SCLS to verify the accuracy and efficiency of the proposed architectures. Experimental results demonstrate that the proposed parallel and distributed implementation is effective in terms of both unmixing accuracy and computational efficiency. In future work, we plan to implement more unmixing algorithms for meta-data extraction and provide more retrieval methods for more types of remote sensing images.

Institutional Review Board Statement:
The study did not involve humans or animals.

Informed Consent Statement:
The study did not involve humans.