A New GPU Implementation of Support Vector Machines for Fast Hyperspectral Image Classiﬁcation

: The storage and processing of remotely sensed hyperspectral images (HSIs) is facing important challenges due to the computational requirements involved in the analysis of these images, characterized by continuous and narrow spectral channels. Although HSIs offer many opportunities for accurately modeling and mapping the surface of the Earth in a wide range of applications, they comprise massive data cubes. These huge amounts of data impose important requirements from the storage and processing points of view. The support vector machine (SVM) has been one of the most powerful machine learning classiﬁers, able to process HSI data without applying previous feature extraction steps, exhibiting a robust behaviour with high dimensional data and obtaining high classiﬁcation accuracies. Nevertheless, the training and prediction stages of this supervised classiﬁer are very time-consuming, especially for large and complex problems that require an intensive use of memory and computational resources. This paper develops a new, highly efﬁcient implementation of SVMs that exploits the high computational power of graphics processing units (GPUs) to reduce the execution time by massively parallelizing the operations of the algorithm while performing efﬁcient memory management during data-reading and writing instructions. Our experiments, conducted over different HSI benchmarks, demonstrate the efﬁciency of our GPU implementation.


Introduction
Recent advances in computer technology allowed for the development of powerful instruments for remote sensed data acquisition, lowering both the cost of their production and the cost of launching new Earth Observation (EO) missions.In particular, imaging spectroscopy (also known as hyperspectral imaging) [1] has attracted the attention of many researchers because of the great potential of hyperspectral images (HSIs) in characterizing the surface of the Earth by covering the visible, near-infrared and shortwave infrared regions of the electromagnetic spectrum.To exploit these data, multiple EO missions are now using imaging spectrometers, such as the Environmental Mapping and Analysis Programme (EnMAP) [2,3], or the Hyperspectral Precursor of the Application Mission (PRISMA) [4].In this context, a high-dimensional stream of remotely sensed HSI data is now being generated, which can be applied to multiple tasks after being properly processed, such as managing of natural and environmental resources, urban planning and monitoring of agricultural fields, risk prevention and natural/human disaster management, among others.However, HSI data processing faces important memory and computational requirements, due the large amount of information provided by EO airborne/satellite instruments.In fact, imaging spectrometers are able to collect hundreds of images over large areas along the Earth's surface, gathering hundreds of narrow and continuous spectral bands along different wavelengths.As a result, each pixel in a HSI cube measures the reflection and absorption of electromagnetic radiation from ground objects into several spectral channels, creating a unique spectral signature for each material optically detected by the spectrometer [5].This allows for a very accurate characterization of land cover surface, which is useful for the modeling and mapping of materials of interest but presents significant computational requirements.In particular, spectral-based classification algorithms need to process large amounts of spectral information in order to correctly assign a label (which corresponds to a fixed land cover class) to each pixel in the HSI scene, facing heavy computation burdens [6].In this sense, the development of parallel implementations of traditional classification algorithms for fast and accurate processing is a requirement in order to efficiently process large HSI data repositories through proper management of computational resources.
With this aim, high performance computing (HPC) has become an efficient tool, not only for managing and analyzing the increasing amount of remotely sensed data that are currently being collected [7][8][9][10], but also for efficiently dealing with the high dimensionality of HSI data cubes [11,12].From a computational point of view, most spectral-based HSI data classification algorithms exhibit inherent parallelism at three different levels [13]: • through HSI pixel vectors (coarse-grained pixel level parallelism), • through spectral-band information (fine-grained spectral level parallelism), and • through tasks (task-level parallelism).
HSI classification algorithms generally map nicely on HPC systems [14], with several HPC approaches (from coarse-grained and fine-grained parallelism techniques to complex distributed environments) already successfully implemented.For instance, distributed approaches based on commodity (homogeneous and heterogeneous) clusters [15,16], grid computing [17,18], and cloud computing techniques [19][20][21][22] provided good results in terms of reducing execution times, enabling an efficient processing of massive data sets on the ground segment.However, dedicated resources such as massively parallel clusters and networks of computers are generally expensive and hard to maintain, being cloud computing a more appropriate and cheaper approach (in addition to a fully distributed solution) due to its "service-oriented" computing and "pay-per-use" model [23].Neither cluster nor cloud computing models allow on-board processing.
On the other hand, parallel solutions based on multicore CPUs [10,24,25], graphics processing units (GPUs) [14,[26][27][28][29], and field programmable gate arrays (FPGAs) [30][31][32][33][34] were also successful, leading to near real-time performance in many HSI applications [35].These platforms generally provide a cheaper solution when compared to large-scale distributed systems.Also, these approaches consume much less energy in comparison with cluster-based or distributed systems, being GPU systems (in particular, those with low-power consumption [36]) and FPGAs the ones currently offering the best performance/consumption ratio.Nevertheless, although both kinds of devices are appealing for on-board processing, FPGAs (despite their reconfigurability) are still more difficult to program than GPUs.In fact, GPUs are massively parallel programmable systems that can satisfy the extremely high computational requirements by HSI-based applications at low cost, with very good programmability.As a result, many HSI data processing algorithms were implemented on GPUs, including target and anomaly detection methods [37,38], and techniques for image compression [39], scene classification [40], and spectral unmixing [41].
In this paper, we develop a new GPU implementation of the traditional support vector machine (SVM) algorithm [42].Traditionally considered to be a powerful classifier in a wide variety of fields, such as medical computing or text and image processing (among others), the SVM has been succefully adopted in HSI classification problems.It separates two classes of samples with the maximal margin by exploiting an optimal decision boundary [43].As a result, SVMs have often been found to provide higher classification accuracies than other widely used pattern recognition techniques, such as the maximum likelihood, the random forest (RF) or the multi-layer perceptron (MLP) classifiers.Furthermore, SVMs appear to be particularly advantageous in heterogeneous environments, for which only a few training samples are available per class.The SVM has demonstrated its success in HSI data classification in a multitude of scientific works within the remote sensing field [44] due to its robust behaviour when dealing with high-dimensional data and its great ability to generalize to unseen data.However, SVMs exhibit a high computational cost, which strongly discourages its use when facing large-scale, real-time HSI classification problems.
To mitigate the aforementioned problem, several techniques were adopted to reduce the execution time of SVMs.For instance, in [45], Osuna et al. presented a decomposition algorithm by addressing sub-problems iteratively to enable tackling larger problems.In [46], Platt proposed the sequential minimal optimization (SMO) algorithm by decomposing the quadratic programming (QP) problem into a series of smaller QP subproblems that can be solved analytically, without the need for a time-consuming QP optimization.Ease of implementation is a typical feature of the SMO algorithm.In [47], Fan et al. developed a series of working set selection (WSS) heuristics, based on second order information, to achieve a faster convergence of the SMO.Recently, some researchers used the portability and excellent computing performance of GPUs to address the computational requirements of SVMs.In particular, Tan et al. [48] proposed a novel two-level parallel computing framework based on NVIDIA's compute device unified architecture (CUDA) and OpenMP to accelerate SVM-based classification, while Li et al. [49] used GPU devices to improve the speed performance of SVM training and prediction stages.
In this work, we present a novel GPU implementation of SVMs that does not only reduces the execution time by massively parallelizing the operations of the algorithm on the GPU, but also performing efficient memory management during data-reading and writing operations [50].We empirically evaluate the efficiency of our method considering different HSI real scenes, which comprise urban and agricultural land-cover information, and observed that it is able to maintain the precision in the results while significantly improving the computational performance.
The remainder of the paper is organized as follows.Section 2 presents the fundamentals of the SVM method adapted to HSI data.Section 3 describes the GPU implementation of the considered classifier.Section 4 validates the GPU implementation by comparing it with other existing implementations.Finally, Section 5 concludes the paper with some remarks and hints at plausible future research lines.

Linear SVM
Any HSI classification method y = f (x) can be described as the mapping function f : N n b → N, where the domain is composed by those spectral pixels being n c the number of different land cover categories.The final goal is to obtain the sample-label pairs for each HSI image element, {x i , y i } n h •n w i=1 , obtaining as a result the final classification map Y ∈ N n h ×n w .
In this context, the original SVM method [51] was proposed for binary classification problems, i.e., n c = 2 so y i ∈ {1, −1}, ∀i ∈ {1, • • • , n h • n w }, where the two considered classes are linearly separable, with at least one hyperplane separating them (see Figure 1).This hyperplane is defined by its normal vector w ∈ R n b and the bias b ∈ R that defines the distance between the samples and the hyperplane.In this regard, the classifier can be defined as the discriminant function: where f (x) = 0 is the hyperplane and the samples lie on the planes f (x) = ±1 depending on the class they belong to.Both parameters w and b are estimated by the SVM during the training stage in order to maximize the distance between the closest sample and the hyperplane, i.e., with the aim of maximizing the margin m = 2/||w|| [42].This maximization problem should be conveniently replaced by the equivalent minimization problem 1 2 ||w|| 2 .In the end, the SVM finds the optimal hyperplane along the space of samples by minimizing the quadratic optimization problem given by Equation ( 2 ||w|| 2 , subject to y i (w where n t is the number of training samples, x i is the i-th sample, and y i is the correct output provided by the SVM, i.e., +1 for positive samples and -1 for negative ones.Considering the Lagrangian formulation [52], Equation ( 2) can be simplified by replacing the inequality ≥through a dual problem.In particular, employing the Lagrange multipliers α i ≥ 0 ∀i ∈ {1, • • • , n t }, we define the optimization problem given by Equation ( 3), which only depends on the set of Lagrange multipliers α i , subject to α i ≥ 0 and where x i and x j are two samples of the training set with labels y i and y j and Lagrange multipliers α i and α j , respectively.Equation (3) can be computed using QP methods, such as the SMO [53], which breaks the problem into a series of smallest and equivalent QP problems, solving each one analytically.With this problem representation, the normal vector w can be derived as the sum of all the samples that belong to the training set and modified by their corresponding Lagrange multiplier together with the associated class of the sample, w = n t ∑ i=1 α i y i x i , while the bias is obtained for some multiplier α j > 0 as b = wx j − y j .Moreover, the discriminant function f (x) provided by Equation (1) can be solely described in terms of Lagrange multipliers and samples as: It is worth noting that each Lagrange multiplier α i measures the relevance of the sample x i when calculating the hyperplane and classifying the data, so that samples with α i = 0 will be ignored, while samples with α i > 0 will be considered to be support vectors.Also, from Equation (4), we can observe that, to perform the classification of any sample x, a significant volume of memory will be required, consuming also a large amount of computational resources to compute the dot products needed.

Linear SVM for Linearly Nonseparable Data Classification
The linear SVM has a limitation: with linearly nonseparable data classification there may not be a hyperplane, so the solution becomes infinite.To overcome this drawback, the linear soft-margin SVM implements a new cost function, which takes into account the original margin maximization and including a loss term that penalizes misclassified data, as it can be observed in Equation ( 5): where δ i ≥ 0 are some slack variables that are introduced into the inequality restriction y i (w relax the condition that all the samples that belongs to the same class lie on the same side of the hyperplane.Thus, δ i can be interpreted as some measure of the amount of misclassifications. Also, a regularization constant C ∈ [0, ∞) is included to control the weight of minimizing 1 2 ||w|| 2 while penalizing solutions with very large δ i .In fact, C can be considered to be a parameter that adjusts the robustness or flexibility of the model, depending its value.
Once more, the Lagrangian formulation can be applied to translate the primal problem defined by Equation ( 5) into the dual problem of Equation (3), by simply introducing the box constraint C ≥ α i ≥ 0. The slack variables δ i do not affect the optimization problem.

Kernel SVM for Non-Linear Data Classification
As an alternative to the soft-margin implementation, the SVM method can be adapted to perform non-linear classification by including a kernel [54,55] within the classifier.The main idea is to map the original feature vectors x i ∈ N n b into a higher dimensional Euclidean space, denoted as H, by using a non-linear vector function Φ : N n b → H, as we can observe in Figure 2. In this context, the optimal margin problem can be posed in the space H by replacing the original inner product between the samples x i • x with the transformed vectors Φ(x i ) • Φ(x), so the discriminant function will be defined as: Equation ( 6) can be easily simplified by assuming there is such a kernel function that K(x i , x) = Φ(x i ) • Φ(x).This assumption allows avoiding the computation of the inner product between transformed vectors, significantly reducing the computational consumption of the algorithm.To ensure the existence of the underlying map Φ, the kernel function is positively defined [51] by satisfying Mercer's conditions [56].
Finally, the solution of the dual problem can be also meaningfully simplified as: In fact, the kernel function usually measures the distance between the input sample x i and the other training samples x j .Table 1 shows some of the most widely used kernels.In particular, the RBF kernel has quite interesting properties.It can be simplified as exp −γ||x i − x j || 2 , where γ controls the model's performance, in the sense that a small value of γ will provide classification solutions with low bias and high variance, while a high value of γ will give solutions with high bias and low variance.
Table 1.Most common kernels in the SVM method, where x i and x j are two samples of dataset X.

Multi-Class SVM
The aforementioned SVM implementations can be generalized to develop multi-class classification by combining several binary classifiers [42].In this regard, there are two main strategies that can be adopted to perform multi-class classification [57]: (i) constructing and combining several binary classifiers, and (ii) considering all data in one optimization formulation.
Regarding the first approach, it is usual to find one-against-all, one-against-one and decision directed acyclic graph (DDAG)-based approaches for multi-class SVM in the available literature [58][59][60].The first approach constructs n c pairwise classifiers, where the k-th model is trained with those samples that belong to the k-th class by assigning them a positive label (while the rest of the samples are paired with negative labels).The classification procedure applies the n c SVM models to each sample x i , selecting the label of the classifier f k (x) with the largest margin: As a result, n c n t -variable QP-problems have to be solved, which implies that training time scales linearly with n c .Instead of developing n c classifiers, the one-against-one approach adopts a pairwise decomposition strategy, implementing n c (n c − 1)/2 binary models on all pairs of training data.Each model f kt is trained on data that belong to two different classes, k and t, ignoring all the other classes.In this way, those samples that belong to class k are labeled as positive, while the samples that belong to class t correspond to the negative label (note that f tk = − f kt ).The final label is selected by majority voting strategy: It is noteworthy that the one-against-one approach exhibits a much more robust behaviour when classifying imbalanced datasets; however, it is also more computationally expensive than the one-against-all for simpler problems, such as those with fewer classes.
Finally, the DDAGSVM also solves n c (n c − 1)/2 binary models during the training stage, implementing a rooted binary directed acyclic graph with also n c (n c − 1)/2 internal nodes (each one is a binary SVM) and n c leaves (i.e., the predicted class) during the testing stage.

Previous Works and Proposal Overview
Despite obtaining great accuracy results, the training and prediction steps of the SVM algorithm are very expensive from a computational point of view, especially for large and complex problems.Furthermore, it is noteworthy that kernel-based techniques can greatly simplify the computations, thus reducing the execution times.However, these methods need to use all the data to properly calculate the kernel, which generally requires large memory consumption [61].In particular, HSI data offers a great challenge to the SVM classifier, mainly due to (both) the large spectral dimensionality and volume of data to be processed.As noted in Section 1, there were some previous works in the literature focused on optimizing and parallelizing the execution of the SVM.However, there are few efforts in the recent literature oriented at the exploitation of GPU-based implementations for HSI data classification.From the current state-of-the-art in the literature, we can highlight two works.Tan et al. [48] combined the NVIDIA Compute Unified Device Architecture (CUDA) [50] and OpenMP [62] to optimize the classification problem, and implemented as a multi-class SVM following one-against-one strategy.In this sense, they were able to run the one-against-one strategy in parallel by implementing an OpenMP-based approach, while each single WSS-based binary SVM classifier was parallelized into GPUs.In addition, the calculation of every kernel element K(x i , x j ) was done by the GPU, being the constant values stored into the shared memory of the GPU device, following column-major order, and sent (when needed) through a broadcasting mechanism.Also, Wu et al. [63] developed a GPU-parallelized SVM method based on composite kernels (SVMCK) [64] and able to integrate spatial and spectral information, with the aim of improving the accuracy in HSI data classification tasks.In this context, they carried the calculations of the composite kernel matrix in the GPU.The resulting K(x i , x j ) were copied to the host (CPU), and then stored in the available memory of the host.The rest of the computations were carried out in the CPU.
In this context, we propose a new parallel version of the SVM algorithm for high dimensionality data.Our proposal can be efficiently generalized to be applied in other fields, being our parallel SVM adapted to the specific case of remote sensing HSI data classification.In particular, we propose a parallel implementation of a multi-class SVM, following one-against-one strategy, with the RBF kernel.It must be noted that each single binary SVM has been solved through the SMO method.Furthermore, the SMO solver (during the training stage) and the full inference stage were paralleled into a general purpose GPU (GPGPU) framework, employing the NVIDIA CUDA platform.Our goal is not only to achieve a good speed up ratio in comparison with other traditional SVM implementations, but also to develop a scalable implementation with regards to both the amount of data and the spectral dimensionality of the data.In this regard, the implemented approach performs the algorithmic parallelization of the SVM operations during the training and inference stages, implementing at the same time a series of optimizations in the memory read and write operations, with the aim of reducing the number of communications between the host and the device, and to reduce the memory latency.Also, it should be noted that a data-oriented parallelization strategy has been adopted, parallelizing the operations related to data management and reusing as much information as possible between different iterations.Regarding to this, a batch-based approach has been implemented within the SMO with the aim of solving multiple sub-problems in parallel.Particular attention has been paid to the calculation of the kernel matrix, as it consumes a massive amount of computational and storage resources.In the following sections we will provide details about our new implementation.

CUDA Platform
To obtain an accelerated version of the SVM for HSI data classification, a GPGPU-based implementation has been employed using NVIDIA CUDA platform [50].The main goal is to exploit the full computing power of GPU devices (using the general-purpose parallel computing platform and the programming model offered by CUDA) to solve many complex computational problems in a more efficient way.Specifically, we solve n c binary SVM instances (with the RBF kernel) that compose our multi-class SVM algorithm for remote sensing HSI data classification.In this context, we consider the GPU device as a set of stream multiprocessors, composed by several cores with a 3-level hierarchical memory system: (i) several fast registers that are accessible by each stream processor, (ii) a multiprocessor-level memory that is shared between all the cores that compose the multiprocessor, and (iii) a global memory shared between all the multiprocessors.
As we can observe in Figure 3, CUDA maintains the hierarchy of memories at the same time that provides a programming model which abstracts the system of cores and multiprocessors into a three-level logical representation of threads, blocks and grids, where the thread is the minimum processing unit (that can be grouped into 1D/2D/3D blocks of warps), and the blocks can be organized into a 1D/2D grid environment.In this sense, the programmer determines the logical representation by organizing the structure of the threads in the grid that best fits the data structure handled by the problem to be parallelized.Underneath, CUDA maps this representation to the actual hardware resources.This distinction between physical and logical representations allows CUDA-based algorithms to be ported to any NVIDIA GPU platform, improving their scalability and computing performance as new and higher-capacity GPUs are released to the market.In addition, it should be highlighted that each thread has access to three different memory spaces: its local memory, the memory shared by all the threads of the same block, and the global memory shared by all the threads of the grid.Therefore the correct handling of the threads and their accesses to memory are fundamental for the correct and optimal parallelization of the problem.Through the organization and synchronization of the threads, we can develop the parallel processing of the operations carried out by the SVM method.In particular, we can differentiate between those operations performed during the training stage of the SVM and those operations performed during the inference step.As pointed out before, we adopted the one-against-one approach for the multi-class SVM to perform remote sensing HSI data classification.Therefore n c (n c − 1)/2 binary models must be properly trained to extract the support vectors and derive the corresponding Lagrange multipliers for each classifier.In particular, the binary models were trained using a decomposition method to solve the convex optimization problem defined by Equation (7).It is noticeable that the main difficulty when solving Equation ( 7) is how to calculate the kernel matrix K ∈ R n t ×n t that stores the kernel values K(x i , x j ), ∀i ∈ [1, n t ] and ∀j ∈ [1, n t ], as: Usually, K is a dense matrix and may be too difficult to be efficiently stored and handled if n t is too large.To deal with this limitation, some decomposition methods were designed [46,47,[65][66][67][68][69][70][71][72][73][74] to break down the problem into several smaller and easier to handle sub-problems, where small subsets of Lagrange variables α are modified by employing some columns of K instead of the entire matrix.In particular, the SMO [46] algorithm has been considered in this work.
The SMO is a simple and iterative algorithm that allows for the quick and effective resolution of very large QP problems (such as those involved in the SVM calculations) by decomposing such overall QP problem into smaller QP sub-problems [45], which are solved analytically without the need for numerical optimization.It solves the optimization problem described by Equations ( 3) and (7), by solving the smallest possible optimization problem at every step until the optimal condition of the SVM classifier is reached.As the linear equality constraint n t ∑ i=1 y i α i = 0 involves the Lagrange multipliers α i , the smallest possible optimization problem will involve two such multipliers, in the sense that, if we change one α t by an amount in either direction, then the same change must be applied to another α l in the opposite direction.That is, α t and α l should be on the same line in order to maintain the constraint (see Figure 4): In this way, the SMO algorithm comprises two main components: (i) a heuristic for selecting the pair of Lagrange multipliers to be optimized, and (ii) an analytic method for solving those multipliers.In this sense, in each iteration the SMO algorithm heuristically chooses two Lagrange multipliers α t and α l at every step to jointly optimize, then it analytically obtains the new optimal values αt and αl , and finally it updates the SVM to reflect the new values.Focusing on the heuristic procedure, the SMO applies two heuristic searches, one for each Lagrange multiplier.The first multiplier α t is chosen by iterating over the entire training set, looking for those samples that violate the Karush-Kuhn-Tucker (KKT) conditions [75] that help to find an optimal separating hyperplane.In particular, the KKT conditions for Equation ( 7) are: where y i is the correct SVM output and (wx i + b) is the current output of the SVM for the i-th sample x i , ∀i ∈ [1, n t ].Any α i that satisfies the KKT conditions will be an optimal solution for the QP optimization problem defined by Equation (7).On the contrary, any α i that violates the KKT conditions will be eligible for optimization, so the SMO's goal is to iterate until all these conditions are satisfied within a tolerance threshold (in our case, this tolerance has been set to 0.001).Once the first α t has been chosen, the second Lagrange multiplier α l is selected in order to maximize the size of the step taken during joint optimization.To do this, the SMO method implements an optimality indicator vector where each E j is the optimality indicator of the j-th training sample, i.e., the classification errors on the j-th sample: Related to this, α t and α l can be selected by looking for those samples x t and x l that have the maximum and minimum optimality indicators that maximize |E t − E l |, so if E t is positive, the SMO will choose a sample with minimum E l , while if E t is negative, the SMO will select a sample with maximum E l .The desired indexes t and l can be directly obtained by computing Equation (13) [67]: where ), E t and E i are the optimality indicator of samples x t and x i , respectively, and X upper = X 1 ∪ X 2 ∪ X 3 and X lower = X 1 ∪ X 4 ∪ X 5 are two data subsets, where each X * is composed by the following training samples: Once both Lagrange multipliers have been obtained, the SMO method will compute their optimal values αt and αl with the aim of obtaining the optimal class-separating hyperplane.In particular, it begins by calculating αl , whose feasible values are framed into the constraint U ≤ αl ≤ V to meet the original C ≥ α l ≥ 0, being U and V two boundaries defined as: If Besides, the optimal αl value within the range [U, V] will be obtained as: where µ = K(x t , x t ) + K(x l , x l ) − 2K(x t , x l ) and E t and E l are the classification errors on the t-th and l-th training samples respectively.Once αl has been obtained, an optimal αt is easily calculated as: Once αt and αl have been obtained, the SMO updates the bias threshold b such that the KKT conditions are satisfied for the t-th and l-th samples.In this sense, three cases can occur, as we can observe in Equation ( 18 where b1 and b2 are defined as follows: Finally, the SMO updates the SVM.For each training sample x i , the SMO updates its E i using the following Equation ( 19): The full procedure is repeated until the optimal condition is reached, i.e., E t ≥ E max , where E max acts as a threshold, E max = max{E i |x i ∈ X lower }.

CUDA Optimization of SMO Algorithm
The SVM starts by dividing the HSI scene into training and inference subsets.We can consider the training set as a collection of instances and their associated labels, i.e., D train = {X, Y}.The training instances are represented by a 2D-matrix X ∈ N n t ×n b composed by n t training vectors, where each The parallel SMO solver begins by creating its working set B. Traditionally, B is of size two; however, our proposal implements a bigger working set to solve in parallel multiple subproblems of SMO in a batch.Moreover, the proposed implementation precomputes all the kernel values for the current working set, storing them as a data buffer on the device's global memory, in order to reduce the high latency memory accesses and also to avoid a large number of small read/write operations in the CPU memory.This implies that in each iteration of the SMO algorithm, the current working batch B will be updated, being n B Lagrange multipliers optimized.This allows us to compute (at once) n B rows of the kernel matrix K, performing a more efficient use of the GPU by reducing the number of accesses to the device's global memory (which is much slower than the shared memory) and enabling the re-use of kernel information (which in turn, reduces repeated kernel value computations).Taking this into account, our parallel SMO solver can be divided into three different steps.
During the step 1, the parallel SMO solver looks for the n B extreme training instances which can potentially improve the SVM the most according to Equation ( 9).This is parallelized by applying consecutive parallel reductions [76] over the training samples.For each working set, the optimality indicators of the training samples are sorted in ascending order, selecting the first n B /2 and the last n B /2 training samples to optimize the corresponding n B Lagrangre multipliers.During the reduction, one thread per sample takes the data from the global device memory to the block shared memory (in a coalesced way) to perform a fast execution.It must be noted that shared memory is around 7 times faster than global memory.Once synchronized, the threads operates over the data, where each one takes and compares the optimality indicators of the samples from the batch, choosing the smaller or the larger one depending on the desired Lagrange multiplier through the application of the corresponding heuristics given by Equation (13).
Once the n B Lagrange multipliers have been selected, the improvements of the Lagrangian multiplier pair α t and α l are sequentially obtained by one GPU thread per pair (step 2).It is worth noting from previous Equations ( 13), ( 16) and ( 19) that, during training stage, the same kernel values may be used in indifferent iterations.This implies that some kernel values can be stored into the GPU memory with the aim of reducing the high latency memory and avoiding a large number of small read/write operations from the CPU memory.In this sense, all the kernel values related to the n B Lagrange multipliers are obtained, computing n B rows of the kernel matrix K through parallel matrix multiplications [77,78]: In particular, the RBF-based K B matrix is computed in parallel by the GPU and stored into the device's global memory.Algorithm 1 provides the pseudo-code of the parallel kernel RBF function, which were implemented following the approximate RBF kernel form: In this context, the input parameters xx and x2x contain the ||x i || 2 , ∀i ∈ [1, . . ., n B ] and x i x j , ∀i, j ∈ [1, . . ., n B ] values respectively, which were previously computed through parallel matrix multiplications [78].Then, n B threads compute the desired n B rows of the kernel matrix K B : During the training stage, the kernel values are extracted from the global memory and gathered into the shared memory as they are needed, avoiding repeated computations.
Finally, the parallel SMO solver updates the optimality indicator vector E of the training instances by launching n t GPU threads that compute Equation ( 14) in parallel (step 3).
These training steps are sequentially repeated until the optimality condition is met or when the SVM classifier is not able to improve.

Parallel Classification during the Inference Stage
The proposed one-against-all SVM for HSI remote sensing data multi-class classification provides the corresponding label y i of a given test sample x i by applying Equation ( 8) during the inference stage.In this sense, Equation ( 8) is computed in parallel, making use of the kernel values in K to reduce the latency memory and avoid the repeated computations.Also, from Equation ( 6) we can observe that K(x i , x j ) is the same that K(x j , x i ), where i and j are related to the number of test samples and support vectors, respectively.In this sense, we can reduce the number or read/copy operations into the GPU memory following two strategies.On the one hand, if the number of test samples is larger than the number of support vectors, we read all the rows of the kernel matrix K that correspond to the support vectors by reading the indexes respected to j.On the other hand, if the number of support vectors is larger than the number of test samples, those rows that correspond to the test samples will be read by reading the indexes respected to i.

Experimental Environment
To evaluate the performance and the benefits of the proposed parallel SVM for HSI remote sensing data classification, several implementations of the proposed classifier were developed and tested over two different hardware platforms: It is equipped with 3584 CUDA cores.These processors were named CPU1 and GPU1.
CPU1 will serve as the baseline for the performance comparisons that were conducted during the experimentation due to its characteristics.Moreover, both environments use Ubuntu 18.04.3x64 as operating system.

Hyperspectral Datasets
Our experiments were conducted over six different and well-known HSI scenes: Indian Pines and Big Indian Pines, Pavia University, Salinas Valley and University of Houston.Figure 5 shows the ground truth and the number of pixels per class for each image.Below, the characteristics of each scene are provided.

1.
The first dataset is known as Indian Pines, which was collected by the Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) [79] over the Indian Pines test site in North-western Indiana, which is characterized by several agricultural crops and irregular forest and pasture areas.It has 145 × 145 pixels, each of which has 224 spectral reflectance bands covering the wavelengths from 400 nm to 2500 nm.We remove the bands 104-108, 150-163 and 220 (water absorption and null bands), and keep 200 bands in our experiments.This scene has 16 different ground-truth classes (see Figure 5).

2.
Big Indian Pines is a larger version of the first dataset, which has 2678 × 614 pixels with wavelengths ranging from 400 nm to 2500 nm.We also remove the water absorption and null bands, retaining 200 spectral bands in our experiments.This scene has 58 ground-truth classes (see Figure 5).

3.
The third dataset is the Pavia University scene, which was collected by the Reflective Optics Spectrographic Imaging System (ROSIS) [80] during a flight campaign over Pavia, nothern Italy.
In this sense, it is characterized by being an urban area, with areas of buildings, roads and parking lots.In particular, the Pavia University scene has 610 × 340 pixels, and its spatial resolution is 1.3 m.The original pavia dataset contains 115 bands in the spectral region of 0.43-0.86µm.
We remove the water absorption bands, and retain 103 bands in our experiments.The number of classes in this scene is 9 (see Figure 5).

4.
The fourth dataset is Pavia Centre and was also gathered by ROSIS sensor.It is composed by 1096 × 1096 pixels and 102 spectral bands.This scene also has 9 ground-truth classes from an urban area (see Figure 5).

5.
The fifth dataset is Houston University [81], which was acquired by the Compact Airborne Spectrographic Imager (CASI) sensor [82] over the Houston University campus in June 2012, collecting spectral information from an urban area.This scene has 114 bands and 349 × 1905 pixels with wavelengths ranging from 380nm to 1050nm.It comprises 15 ground-truth classes (see Figure 5).6.
Finally, the sixth dataset is Salinas Valley, which was also acquired by AVIRIS sensor over an agricultural area.It has 512 ×217 pixels and covers Salinas Valley in California.We remove the water absorption bands 108-112, 154-167 and 224, and keep 204 bands in our experiments.This scene contains 16 classes (see Figure 5).

Performance Evaluation
With the aim of evaluating the computational performance obtained by the proposed GPU implementation of SVMs for HSI data classification, and making a thorough analysis of the implemented classifiers also in terms of accuracy, several experiments were carried out: 1.
The first experiment focuses on the classification accuracy obtained by our GPU implementation as compared to a standard (CPU) implementation in LibSVM [83].In particular, proposed GPU-SVM was compared with its CPU counterpart, the random forest (RF) [40] and the multinomial logistic regression (MLR) [40].

2.
The second experiment focuses on the scalability and speedups achieved by the GPU implementation with regards to the CPU implementation, from a global perspective.As we pointed before, CPU1 will be considered to be the baseline due its characteristics that make it the slowest device.

3.
The third and last experiment focuses on some specific aspects of the GPU implementation, including data-transfer times.
In the following, we describe in detail the results obtained in each of the aforementioned experiments.6h, 7h and 8h].Although all classification maps have the typical "salt and pepper" noise of spectral classifiers, we can see that SVM classifiers achieve a higher quality map by classifying certain regions of the scenes more precisely, for instance the Soybeans-notill and Oats land-covers in Indian Pines scene or the the tree-line areas of Pavia University.Moreover, those classification maps obtained by CPU-based SVM and GPU-based SVM are quite similar.
The individual class accuracies are reported on Table 2. Also the overall (OA), average (AA) accuracies and kappa coefficient are shown for each HSI dataset.We can observe that the CPU and GPU implementations of the SVM classifier reach the best OA and AA percentages in all datasets, exhibiting also the best kappa value.In particular, as shown in all cases, the OA obtained by our GPU implementation of SVM is very similar to that obtained by the corresponding CPU implementation, and superior to that obtained by other well-known classifiers such as RF and MLR.

Experiment 2: Scalability and Speedup
In this experiment, we used randomly selected training/test sets to evaluate the scalability of our GPU implementation as the amount of training becomes larger.In particular, 1%, 3%, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 60% and 80% of training data were considered for the University of Pavia, Pavia Center, Salinas Valley and Big Indian Pines datasets.
Figure 9 reports the processing times (and speedups) measured in the considered CPUs (CPU0 in the first hardware platform and CPU1 in the second platform) and GPUs (GPU0 in the platform 1 and GPU1 in the platform 2) as the percentage of training increases.As we can observe in Figure 9, for all the considered images in this experiment (University of Pavia, Pavia Center, Salinas Valley and Big Indian Pines), the processing times increase as the percentage of training samples increases.This is expected, as the complexity of the classification problem becomes larger.However, we can also observe that the speedups obtained become more significant with the training size, particularly for the larger images.For instance, with University of Pavia (about 34,220 pixels when training with 80% of data) the implementation is able to reach a speedup of x6 with the fastest GPU, while with Big Indian Pines (around 1,315,433 samples when considering 80% of training data) is x140 faster than the baseline.This indicates that our newly developed GPU implementation of SVM scales with complexity and problem size, which is a highly desirable feature in parallel implementations.In this experiment, we fix the training percentage to 40% and study the memory transfers required by the proposed GPU implementation for different HSI datasets over GPU0 and GPU1 devices.Here, we specifically focus on the two largest datasets (Pavia Center and Big Indian Pines) and report the memory times consumed by the considered GPU platforms.Table 3 provides a breakdown of the obtained processing times for different images and training percentages, indicating the transfer times from host to device (HtoD), device to host (DtoH), device to device (DtoD) and the time taken by the kernels.The times reported on Table 3 indicate that most of the time spent by our parallel implementation is consumed by kernels and not by data transfers, despite these transfers become more significant as the training percentage becomes larger (this does not affect scalability, as reported in Figure 9.We can graphically observe these results in Figure 10, where the amount of GPU video memory required by our implementation is very small in all cases (below 5.9%), which indicates that our implementation is efficient in terms of memory usage, even with large percentages of training.

Conclusions and Future Lines
This work presented a new GPU implementation of the well-known SVM technique in the context of HSI remote sensing data classification, with the goal of reducing computation times by massively parallelizing operations across GPU threads, and reducing memory latencies by efficiently handling data read/write operations.Three experiments were conducted over six widely used and real HSI datasets, providing very heterogeneous information in terms of content (different land-cover classes extracted from agricultural and urban environments) and amount of data (different spatial sizes with also different spatial resolutions).These experiments show the effectiveness of our accelerated SVM implementation, which not only scales with data volume, but also with training percentage.This reveals that the acceleration is more effective when the classification problem is more complex.Moreover, the study of data transfer times and algorithmic computation times shows an efficient use of memory, where the kernel computations dominate the parallel processing times.In addition, the proposed GPU implementation achieves good performance in comparison with other popular SVM implementations, achieving similar accuracy results but with faster performance.
As with any new approach, there are some unresolved issues that may present challenges over time.In the future, we will improve our SVM implementation in order to reduce the accesses to global device memory and make a more efficient use of CUDA streams.In addition, we will use our accelerated SVM in conjunction with other techniques for HSI processing and classification (e.g., supervised and semi-supervised techniques) with the aim of improving even more the obtained classification results.

Figure 1 .
Figure 1.General SVM scheme.The final goal is to approximate parameters w and b with the aim of defining the optimal hyperplane that maximizes the margin between the two classes.

Figure 2 .
Figure 2. In order to deal with non-linearly separable data, the SVM applies a transformation Φ to translate the data from the original non-linear representation in space N n b , to a linear representation in space H, i.e., Φ : N n b → H.

Figure 3 .
Figure 3. Programming and memory model provided by NVIDIA CUDA for a GPU device.

Figure 4 .
Figure 4.The inequality constraint C ≥ α i ≥ 0 forces the Lagrange multipliers to be into a box while the linear equality constraint 2 , . . ., x i,n b ] comprises n b spectral bands, while the training labels are stored into the matrix Y ∈ N n t ×n c , being y i = [y i,1 , y i,2 , . . ., y i,n c ] the corresponding label of sample x i in one-hot encoding, where n c indicates the number of different land cover classes.The training stage starts by creating the different single binary SVMs in a sequential way, confronting each class i to each different class j, ∀i, j ∈ [1, n c ].Each binary SVM was optimized by employing a parallel SMO solver.

Figure 5 .
Figure 5. Number of available samples in the considered HSI datasets.

4. 3 . 1 .
Experiment 1: Accuracy Performance In this experiment, we use the fixed training and test data for the AVIRIS Indian Pines image [displayed in Figure 6c,d], the ROSIS Pavia University image [displayed in Figure 7c,d] and the University of Houston image [Figure8c,d].These fixed training and test sets are publicly available online from the IEEE Geoscience and Remote Sensing Society (GRSS) data and algorithm evaluation website (DASE) at http://dase.grss-ieee.org.The main rationale for using these fixed training sets is that all algorithms can be compared in a fair way, regardless of random splits of training and test sets.The obtained classification results can be graphically observed in Figures6 and 8.In particular, Figures6a, 7aand 8a respectively show false color compositions of the aforementioned three images, while Figures 6b, 7b and 8b show the entire ground-truth data for each image.The results obtained by two well-known classifiers: RF [see Figures 6e, 7e and 8e] and the multinomial logistic regression (MLR) [see Figures 6f, 7f and 8f] are reported, together with the classification maps obtained by the CPU (LibSVM) implementation of the SVM [see Figures 6g, 7g and 8g] and our GPU implementation of SVM [see Figures

Figure 6 .
Figure 6.Classification maps for the Indian Pines dataset with the fixed training test sets in http://dase.grss-ieee.org.(a) False color composition.(b) Ground-truth (available labeled samples).(c) Fixed training set.(d) Fixed test set.(e) Classification map obtained by the RF classifier.(f) Classification map obtained by the MLR classifier.(g) Classification map obtained by the SVM classifier (LibSVM implementation).(h) Classification map obtained by the proposed GPU implementation.The corresponding overall classification accuracies are shown in brackets.

Figure 7 .Figure 8 .
Figure 7. Classification maps for the Pavia University dataset with the fixed training test sets in http://dase.grss-ieee.org.(a) False color composition.(b) Ground-truth (available labeled samples).(c) Fixed training set.(d) Fixed test set.(e) Classification map obtained by the RF classifier.(f) Classification map obtained by the MLR classifier.(g) Classification map obtained by the SVM classifier (LibSVM implementation).(h) Classification map obtained by the proposed GPU implementation.The corresponding overall classification accuracies are shown in brackets.

Figure 9 .
Figure 9. Processing times (and speedups) measured in the considered CPUs (CPU0 in the platform 1 and CPU1 in the platform 2) and GPUs (GPU0 in the platform 1 and GPU1 in the platform 2) as the percentage of training samples increases.4.3.3.Experiment 3: GPU Transfer-Memory and Kernel Runtimes

Figure 10 .
Figure 10.Memory-transfer times required by the proposed GPU implementation for the Pavia Center and Big Indian Pines datasets.In all cases, the training percentage was fixed to 40% of the available samples.

1
Parallel Kernel RBF for HSI classification Require: xx matrix of ||x i || 2 values, x2x matrix of x i x j values, K B resulting kernel matrix, n B number of rows, n t number of training samples.

1 .
Platform 1: it is composed by an Intel Core Coffee Lake Refresh i7-9750H processor, 32 GB of DDR4 RAM with 2667 MHz, and an NVIDIA GeForce RTX 2070 with 8 GB of RAM, graphic clock at 2100 MHz and 14,000 MHZ of memory transfer rate.It is equipped with 2304 CUDA cores.Intel i9-9940X processor, 128 GB of DDR4 RAM with 2100 MHz, and an NVIDIA GTX 1080Ti with 11 GB of RAM, 2037 MHz of graphic clock and 11,232 MHz of memory transfer rate.

UNIVERSITY OF HOUSTON (UH)
BIG INDIAN PINES SCENE (BIP)

Table 2 .
Classification results obtained by different techniques for the Indian Pines, University of Pavia and University of Houston scenes, using the fixed training and test sets available for these datasets in http://dase.grss-ieee.org.

Table 3 .
Processing times (ms) obtained by our GPU implementation of the SVM algorithm on different platforms, using the datasets: Pavia Center and Big Indian Pines.The times indicate the transfers from host to device (HtoD), device to host (DtoH), device to device (DtoD) and the total processing times consumed by the kernels.