3.3.1. Previous Concepts about the SMO Algorithm
As pointed out before, we adopted the oneagainstone approach for the multiclass SVM to perform remote sensing HSI data classification. Therefore
${n}_{c}({n}_{c}1)/2$ binary models must be properly trained to extract the support vectors and derive the corresponding Lagrange multipliers for each classifier. In particular, the binary models were trained using a decomposition method to solve the convex optimization problem defined by Equation (
7). It is noticeable that the main difficulty when solving Equation (
7) is how to calculate the kernel matrix
$\mathbf{K}\in {\mathbb{R}}^{{n}_{t}\times {n}_{t}}$ that stores the kernel values
$K({\mathbf{x}}_{i},{\mathbf{x}}_{j})$,
$\forall i\in [1,{n}_{t}]$ and
$\forall j\in [1,{n}_{t}]$, as:
Usually,
$\mathbf{K}$ is a dense matrix and may be too difficult to be efficiently stored and handled if
${n}_{t}$ is too large. To deal with this limitation, some decomposition methods were designed [
46,
47,
65,
66,
67,
68,
69,
70,
71,
72,
73,
74] to break down the problem into several smaller and easier to handle subproblems, where small subsets of Lagrange variables
$\alpha $ are modified by employing some columns of
$\mathbf{K}$ instead of the entire matrix. In particular, the SMO [
46] algorithm has been considered in this work.
The SMO is a simple and iterative algorithm that allows for the quick and effective resolution of very large QP problems (such as those involved in the SVM calculations) by decomposing such overall QP problem into smaller QP subproblems [
45], which are solved analytically without the need for numerical optimization. It solves the optimization problem described by Equations (
3) and (
7), by solving the smallest possible optimization problem at every step until the optimal condition of the SVM classifier is reached. As the linear equality constraint
$\sum _{i=1}^{{n}_{t}}{y}_{i}{\alpha}_{i}=0$ involves the Lagrange multipliers
${\alpha}_{i}$, the smallest possible optimization problem will involve two such multipliers, in the sense that, if we change one
${\alpha}_{t}$ by an amount in either direction, then the same change must be applied to another
${\alpha}_{l}$ in the opposite direction. That is,
${\alpha}_{t}$ and
${\alpha}_{l}$ should be on the same line in order to maintain the constraint (see
Figure 4):
In this way, the SMO algorithm comprises two main components: (i) a heuristic for selecting the pair of Lagrange multipliers to be optimized, and (ii) an analytic method for solving those multipliers. In this sense, in each iteration the SMO algorithm heuristically chooses two Lagrange multipliers ${\alpha}_{t}$ and ${\alpha}_{l}$ at every step to jointly optimize, then it analytically obtains the new optimal values ${\widehat{\alpha}}_{t}$ and ${\widehat{\alpha}}_{l}$, and finally it updates the SVM to reflect the new values.
Focusing on the heuristic procedure, the SMO applies two heuristic searches, one for each Lagrange multiplier. The first multiplier
${\alpha}_{t}$ is chosen by iterating over the entire training set, looking for those samples that violate the KarushKuhnTucker (KKT) conditions [
75] that help to find an optimal separating hyperplane. In particular, the KKT conditions for Equation (
7) are:
where
${y}_{i}$ is the correct SVM output and
$(\mathbf{w}{\mathbf{x}}_{i}+b)$ is the current output of the SVM for the
ith sample
${\mathbf{x}}_{i}$,
$\forall i\in [1,{n}_{t}]$. Any
${\alpha}_{i}$ that satisfies the KKT conditions will be an optimal solution for the QP optimization problem defined by Equation (
7). On the contrary, any
${\alpha}_{i}$ that violates the KKT conditions will be eligible for optimization, so the SMO’s goal is to iterate until all these conditions are satisfied within a tolerance threshold (in our case, this tolerance has been set to 0.001). Once the first
${\alpha}_{t}$ has been chosen, the second Lagrange multiplier
${\alpha}_{l}$ is selected in order to maximize the size of the step taken during joint optimization. To do this, the SMO method implements an
optimality indicator vector$\mathbf{E}=[{E}_{1},{E}_{2},\cdots ,{E}_{{n}_{t}}]$, where each
${E}_{j}$ is the optimality indicator of the
jth training sample, i.e., the classification errors on the
jth sample:
Related to this,
${\alpha}_{t}$ and
${\alpha}_{l}$ can be selected by looking for those samples
${\mathbf{x}}_{t}$ and
${\mathbf{x}}_{l}$ that have the maximum and minimum optimality indicators that maximize
${E}_{t}{E}_{l}$, so if
${E}_{t}$ is positive, the SMO will choose a sample with minimum
${E}_{l}$, while if
${E}_{t}$ is negative, the SMO will select a sample with maximum
${E}_{l}$. The desired indexes
t and
l can be directly obtained by computing Equation (
13) [
67]:
where
${\mu}_{i}=K({\mathbf{x}}_{t},{\mathbf{x}}_{t})+K({\mathbf{x}}_{i},{\mathbf{x}}_{i})2K({\mathbf{x}}_{t},{\mathbf{x}}_{i})$,
${E}_{t}$ and
${E}_{i}$ are the optimality indicator of samples
${\mathbf{x}}_{t}$ and
${\mathbf{x}}_{i}$, respectively, and
${\mathcal{X}}_{upper}={\mathcal{X}}_{1}\cup {\mathcal{X}}_{2}\cup {\mathcal{X}}_{3}$ and
${\mathcal{X}}_{lower}={\mathcal{X}}_{1}\cup {\mathcal{X}}_{4}\cup {\mathcal{X}}_{5}$ are two data subsets, where each
${\mathcal{X}}_{\ast}$ is composed by the following training samples:
Once both Lagrange multipliers have been obtained, the SMO method will compute their optimal values ${\widehat{\alpha}}_{t}$ and ${\widehat{\alpha}}_{l}$ with the aim of obtaining the optimal classseparating hyperplane. In particular, it begins by calculating ${\widehat{\alpha}}_{l}$, whose feasible values are framed into the constraint $U\le {\widehat{\alpha}}_{l}\le V$ to meet the original $C\ge {\alpha}_{l}\ge 0$, being U and V two boundaries defined as:
Besides, the optimal
$\widehat{{\alpha}_{l}}$ value within the range
$[U,V]$ will be obtained as:
where
$\mu =K({\mathbf{x}}_{t},{\mathbf{x}}_{t})+K({\mathbf{x}}_{l},{\mathbf{x}}_{l})2K({\mathbf{x}}_{t},{\mathbf{x}}_{l})$ and
${E}_{t}$ and
${E}_{l}$ are the classification errors on the
tth and
lth training samples respectively. Once
${\widehat{\alpha}}_{l}$ has been obtained, an optimal
${\widehat{\alpha}}_{t}$ is easily calculated as:
Once
${\widehat{\alpha}}_{t}$ and
${\widehat{\alpha}}_{l}$ have been obtained, the SMO updates the bias threshold
b such that the KKT conditions are satisfied for the
tth and
lth samples. In this sense, three cases can occur, as we can observe in Equation (
18):
where
${\widehat{b}}_{1}$ and
${\widehat{b}}_{2}$ are defined as follows:
Finally, the SMO updates the SVM. For each training sample
${\mathbf{x}}_{i}$, the SMO updates its
${E}_{i}$ using the following Equation (
19):
The full procedure is repeated until the optimal condition is reached, i.e., ${E}_{t}\ge {E}_{max}$, where ${E}_{max}$ acts as a threshold, ${E}_{max}=max\left\{{E}_{i}\right{\mathbf{x}}_{i}\in {\mathcal{X}}_{lower}\}$.
3.3.2. CUDA Optimization of SMO Algorithm
The SVM starts by dividing the HSI scene into training and inference subsets. We can consider the training set as a collection of instances and their associated labels, i.e., ${\mathcal{D}}_{train}=\{\mathbf{X},\mathbf{Y}\}$. The training instances are represented by a 2Dmatrix $\mathbf{X}\in {\mathbb{N}}^{{n}_{t}\times {n}_{b}}$ composed by ${n}_{t}$ training vectors, where each ${\mathbf{x}}_{i}\in {\mathbb{N}}^{{n}_{b}}=[{x}_{i,1},{x}_{i,2},\cdots ,{x}_{i,{n}_{b}}]$ comprises ${n}_{b}$ spectral bands, while the training labels are stored into the matrix $\mathbf{Y}\in {\mathbb{N}}^{{n}_{t}\times {n}_{c}}$, being ${\mathbf{y}}_{i}=[{y}_{i,1},{y}_{i,2},\cdots ,{y}_{i,{n}_{c}}]$ the corresponding label of sample ${\mathbf{x}}_{i}$ in onehot encoding, where ${n}_{c}$ indicates the number of different land cover classes. The training stage starts by creating the different single binary SVMs in a sequential way, confronting each class i to each different class j, $\forall i,j\in [1,{n}_{c}]$. Each binary SVM was optimized by employing a parallel SMO solver.
The parallel SMO solver begins by creating its working set B. Traditionally, B is of size two; however, our proposal implements a bigger working set to solve in parallel multiple subproblems of SMO in a batch. Moreover, the proposed implementation precomputes all the kernel values for the current working set, storing them as a data buffer on the device’s global memory, in order to reduce the high latency memory accesses and also to avoid a large number of small read/write operations in the CPU memory. This implies that in each iteration of the SMO algorithm, the current working batch B will be updated, being ${n}_{B}$ Lagrange multipliers optimized. This allows us to compute (at once) ${n}_{B}$ rows of the kernel matrix $\mathbf{K}$, performing a more efficient use of the GPU by reducing the number of accesses to the device’s global memory (which is much slower than the shared memory) and enabling the reuse of kernel information (which in turn, reduces repeated kernel value computations). Taking this into account, our parallel SMO solver can be divided into three different steps.
During the
step 1, the parallel SMO solver looks for the
${n}_{B}$ extreme training instances which can potentially improve the SVM the most according to Equation (
9). This is parallelized by applying consecutive parallel reductions [
76] over the training samples. For each working set, the optimality indicators of the training samples are sorted in ascending order, selecting the first
${n}_{B}/2$ and the last
${n}_{B}/2$ training samples to optimize the corresponding
${n}_{B}$ Lagrangre multipliers. During the reduction, one thread per sample takes the data from the global device memory to the block shared memory (in a coalesced way) to perform a fast execution. It must be noted that shared memory is around 7 times faster than global memory. Once synchronized, the threads operates over the data, where each one takes and compares the optimality indicators of the samples from the batch, choosing the smaller or the larger one depending on the desired Lagrange multiplier through the application of the corresponding heuristics given by Equation (
13).
Once the
${n}_{B}$ Lagrange multipliers have been selected, the improvements of the Lagrangian multiplier pair
${\alpha}_{t}$ and
${\alpha}_{l}$ are sequentially obtained by one GPU thread per pair (
step 2). It is worth noting from previous Equations (
13), (
16) and (
19) that, during training stage, the same kernel values may be used in indifferent iterations. This implies that some kernel values can be stored into the GPU memory with the aim of reducing the high latency memory and avoiding a large number of small read/write operations from the CPU memory. In this sense, all the kernel values related to the
${n}_{B}$ Lagrange multipliers are obtained, computing
${n}_{B}$ rows of the kernel matrix
$\mathbf{K}$ through parallel matrix multiplications [
77,
78]:
In particular, the RBFbased
${\mathbf{K}}_{B}$ matrix is computed in parallel by the GPU and stored into the device’s global memory. Algorithm 1 provides the pseudocode of the parallel kernel RBF function, which were implemented following the approximate RBF kernel form:
In this context, the input parameters
$xx$ and
$x2x$ contain the
$\left\right{\mathbf{x}}_{i}{\left\right}^{2}$,
$\forall i\in [1,\cdots ,{n}_{B}]$ and
${\mathbf{x}}_{i}{\mathbf{x}}_{j}$,
$\forall i,j\in [1,\cdots ,{n}_{B}]$ values respectively, which were previously computed through parallel matrix multiplications [
78]. Then,
${n}_{B}$ threads compute the desired
${n}_{B}$ rows of the kernel matrix
${\mathbf{K}}_{B}$:
Algorithm 1 Parallel Kernel RBF for HSI classification 
Require:$xx$ matrix of $\left\right{\mathbf{x}}_{i}{\left\right}^{2}$ values, 
$x2x$ matrix of ${\mathbf{x}}_{i}{\mathbf{x}}_{j}$ values, 
${K}_{B}$ resulting kernel matrix, 
${n}_{B}$ number of rows, 
${n}_{t}$ number of training samples. 

$i=blockIdx.x\ast blockDim.x+threadIdx.x$ 
if $idx<{n}_{B}$ then 
$j=0$ 
while $j<{n}_{t}$ do 
${K}_{B}[i,j]=exp(\gamma (xx\left[i\right]+xx\left[j\right]2\ast x2x[i,j]))$ 
$j++$ 
end while 
end if 
During the training stage, the kernel values are extracted from the global memory and gathered into the shared memory as they are needed, avoiding repeated computations.
Finally, the parallel SMO solver updates the optimality indicator vector
$\mathbf{E}$ of the training instances by launching
${n}_{t}$ GPU threads that compute Equation (
14) in parallel (
step 3).
These training steps are sequentially repeated until the optimality condition is met or when the SVM classifier is not able to improve.