Accelerating a Geometrical Approximated PCA Algorithm Using AVX2 and CUDA

: Remote sensing data has known an explosive growth in the past decade. This has led to the need for efﬁcient dimensionality reduction techniques, mathematical procedures that transform the high-dimensional data into a meaningful, reduced representation. Projection Pursuit (PP) based algorithms were shown to be efﬁcient solutions for performing dimensionality reduction on large datasets by searching low-dimensional projections of the data where meaningful structures are exposed. However, PP faces computational difﬁculties in dealing with very large datasets—which are common in hyperspectral imaging, thus raising the challenge for implementing such algorithms using the latest High Performance Computing approaches. In this paper, a PP-based geometrical approximated Principal Component Analysis algorithm (gaPCA) for hyperspectral image analysis is implemented and assessed on multi-core Central Processing Units (CPUs), Graphics Processing Units (GPUs) and multi-core CPUs using Single Instruction, Multiple Data (SIMD) AVX2 (Advanced Vector eXtensions) intrinsics, which provide signiﬁcant improvements in performance and energy usage over the single-core implementation. Thus, this paper presents a cross-platform and cross-language perspective, having several implementations of the gaPCA algorithm in Matlab, Python, C++ and GPU implementations based on NVIDIA Compute Uniﬁed Device Architecture (CUDA). The evaluation of the proposed solutions is performed with respect to the execution time and energy consumption. The experimental evaluation has shown not only the advantage of using CUDA programming in implementing the gaPCA algorithm on a GPU in terms of performance and energy consumption, but also signiﬁcant beneﬁts in implementing it on the multi-core CPU using AVX2 intrinsics.


Introduction
Today, enormous amounts of data are being generated on a daily basis from social networks, sensors and web sites [1].In particular, one type of multidimensional data which is undergoing an explosive growth is the remote sensing data.
Remote sensing is generally defined as the field or practice of gathering information from distance about an object (usually the Earth's surface by measuring the electromagnetic radiation) [2].The recent technological advances in both sensors and computer technology have increased exponentially the volume of remote sensing data repositories.For example, the data collection rate of the NASA Jet Propulsion Laboratory's Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) is 4.5 MB/s, which means that it can obtain nearly 16 GB of data in one hour [3].
The availability of new space missions providing large amounts of data on a daily basis has raised important challenges for better processing techniques and the development of computationally efficient techniques for transforming the massive amount of remote sensing data into scientific understanding.However, because much of the data are highly redundant, it can be efficiently brought down to a much smaller number of variables without a significant loss of information.This can be achieved using the so-called Dimensionality Reduction (DR) techniques [4].These are mathematical procedures that transform the high-dimensional data into a meaningful representation of reduced dimensionality [5].DR techniques are also referred to as projection methods, and are the widely used exploratory tools for applications in remote sensing due to a set of benefits that are bringing for remote sensing data analysis [6]: Plotting and visualizing data and potential structures in the data in lower dimensions; • Applying stochastic models; • Solving the "curse of dimensionality"; • Facilitating the prediction and classification of the new data sets (i.e., query data sets with unknown class labels).
Of the several existing DR techniques, Projection Pursuit (PP) [7] and its particular form of Principal Component Analysis (PCA) [8] are being credited as two of the most popular dimensionality reduction techniques with numerous applications in remote sensing.PP is a Projection Index (PI)-based dimensionality reduction technique that uses a PI as a criterion to find directions of interestingness of data to be processed and then represents the data in the data space specified by these new interesting directions [9].Principal Component Analysis [8] can be considered a special case of PP in the sense that PCA uses data variance as a PI to produce eigenvectors.
Such DR (like PP or PCA) have the disadvantage of being extremely computationally demanding, in consequence, their implementation onto high-performance computer architectures becomes critical especially for applications under strict latency constraints.In this regard, the emergence of high performance computing devices such as multi-core processors, Field Programmable Gate Arrays (FPGAs) or Graphic Processing Units (GPUs) [10] exhibit the potential to bridge the gap towards onboard and real-time analysis of remote sensing data.
PCA is a time and computational expensive algorithm, with two computationally intensive steps: computing the covariance matrix, and computing the eigenvalue decomposition of the covariance matrix.As a result of its nature, most PCA algorithms work in multiple synchronous phases, and the intermediate data are exchanged at the end of each phase, introducing delays and increasing the total execution time of the PCA algorithm.This is one of the reasons that has motivated the appearance in the recent literature of various implementations for the PCA algorithms in single-core systems, multi-core systems, GPUs and even on FPGAs.
In this context, this paper describes the parallel implementation of a PP-based geometrical approximated Principal Component Analysis algorithm, namely the gaPCA method.For validation, we applied gaPCA to a satellite hyperspectral image for dimensionality reduction.This method has been implemented in C++, Matlab and Python for our multi-core CPU tests, and PyCUDA and CUDA targeting an NVIDIA GPU.A comparative time analysis has been performed and the results show that our parallel implementations provide consistent speedups.
The main contributions of this work are as follows: • We introduce four implementations of the gaPCA algorithm: three targeting multi-core CPUs developed in Matlab, Python and C++ and a GPU-accelerated CUDA implementation; • A comparative assessment of the execution times of the Matlab, Python and PyCUDA multi-core implementations.Our experiments showed that our multi-core PyCUDA implementation is up to 18.84× faster than its Matlab equivalent; • A comparative assessment of the execution times of the C++ single-core, C++ multi-core, C++ single-core Advanced Vector eXtensions (AVX2) and C++ multi-core AVX2 implementations.
The multi-core C++ AVX2 implementation proved to be up to 27.04× faster than the C++ single core one; • Evaluation of the GPU accelerated CUDA implementation compared to the other implementations.
Our experiments show that our CUDA Linux GPU implementation is the fastest, with speed ups up to 29.44× compared to the C++ single core baseline; • Energy consumption analysis.
The rest of the paper is organized as follows: Section 2 highlights related work in the field of projection pursuit algorithms; Section 3 describes the experimental setup-hardware, software and the data sets used for testing; Section 4 introduces the gaPCA algorithm with its applications in land classification; Section 5 describes the parallel implementations of the gaPCA algorithm using multi-core CPUs, CUDA programming and AVX2 intrinsics; Section 6 presents and analyzes the results obtained for each experiment.Finally, Section 7 concludes the paper.

Projection Pursuit Algorithms
Projection Pursuit is a class of statistical non-parametric methods for data analysis proposed initially by Friedman and Tukey in 1974 [7], which involves searching for directions in the high-dimensional space onto which the data can be projected to reveal meaningful low-dimensional insight into the data structures.Consequently, in Projection Pursuit, the key aspect is optimizing a certain objective function referred to as the Projection Index.A PI is a function which associates a data projection to a real value measuring how important that projection is (usually meaning how far from normality is).
Given the fact that the importance of the projection is dependent on the goal to be achieved (e.g., exploratory data analysis, density estimation, regression, classification, de-correlating variables), various PIs can be defined for each application (clustering analysis and detection [11,12], classification [13][14][15][16][17], regression analysis [18,19] and density estimation [20], leading to different indices for different projections (among these indices one can find entropy [21], variance, robust scale estimator [22], skewness [23], L1-norm [24]).One of the most popular PI is the variance of the projected data, defined by the largest Principal Components (PCs) of the PCA.
Due to the fact that Projection Pursuit methods execution time increases exponentially with the dimensionality of data, these algorithms tend to be computationally intensive.This motivated researches to find alternative approaches for pursuing interesting projection not only in terms of data structures, but also in terms of computational resources [25,26].For example, in [27] a simple version of the Projection Pursuit-based estimator is proposed which is easy to implement and fast to compute.
The Quality of Projected Clusters (QPC) [28] projection aims at finding linear transformations that create compact clusters of vectors, each with vectors from a single class, separated from other clusters.Fast approximate version of QPC [29] proposes some modifications of the QPC method that decrease computational cost and obtains results of similar quality with significant less effort.
The study in [17] presents an accelerating genetic algorithm for the Projection Pursuit model based on genetic algorithm, which can effectively avoid the huge calculated quantity and slow speed of traditional genetic algorithm.The authors in [30] proposed a PI based on white noise matrix, which proved to reveal interesting features in high dimensions with low computational effort on both simulated and real data sets.
In [31], the algorithms are targeted to speeding up the covariance matrix computation rather than finding the eigenvalues of matrix (by showing that the covariance between two feature components can be expressed as a function of the relative displacement between those components in patch space) and it achieved around 10× speedups than the baseline implementation.

Parallel Implementations of PP Algorithms
In addition to algorithmic optimizations like the ones in the references mentioned above, many efforts made by the scientific community in the past decade focused on developing parallel implementations of a particular case of Projection Pursuit method, which is PCA, in order to achieve increased performance fostered by parallel architectures such as multi-core CPUs or GPUs.
In [32], Funatsu et al. proposed a PCA algorithm based on the L1-norm that to be up to 3× faster on the Geforce GTX 280 GPU architecture comparing with classical CPU.The results were verified by applying the algorithm in the field of facial recognition.In [33], Josth et al. presented two implementations of the PCA algorithm, first using the Streaming SIMD Extensions (SSE) instructions set of the CPU, and secondly using the CUDA technique on GPU architectures for RGB and spectral images.The results show speed-ups of around 10× (compared with a multi-threaded C implementation) which allows using PCA on RGB and spectral images in real time.
Another approach of the iterative PCA algorithm implementation on GPU which is based on the standard Gram-Schmidt orthogonalization was made in [34] by Andrecut.The author presented the comparison of the GPU parallel optimized versions, based on CUBLAS (NVIDIA) with the implementation on CBLAS (GNU Scientific Library) on the CPU and showed approximately 12× speedup for GPU.
In [35], a modified version of the fast PCA algorithm implemented on the GPU architecture was presented along with the suitability of the algorithm for face recognition tasks.The results showed approximately 20× faster results comparing with two other implementations of the fast PCA algorithm (the CPU implementation and the CPU sequential version).In a similar way, in [36] a fast non-iterative PCA computation for spectral image analysis by utilizing a GPU is proposed.
The paper in [37] presents a study of the parallelization possibilities of an iterative PCA algorithm and its adaptation to a Massively Parallel Processor Array (MPPA) manycore architecture (The Kalray MPPA-256-N which is a single-chip manycore processor that assembles 256 cores running at 400MHz, distributed across 16 clusters), and achieved an average speedup of 13× compared to the sequential version.In [38] the authors present a study of the adaptation of the previous mentioned PCA algorithm applied to Hyperspectral Imaging and implemented on the same aforementioned MPPA manycore architecture.
In addition, the work in [39] presents the implementation of the PCA algorithm onto two different high-performance devices (an NVIDIA GPU and a Kalray manycore, a single-chip manycore processor) in order to take full advantage of the inherent parallelism of these high-performance computing platforms, and hence, reducing the time that is required to process hyperspectral images.
Reconfigurable architectures, like FPGAs, are also used for accelerating PP-based and PCA-related algorithms.In [40] a FPGA-based implementation of the PCA algorithm is presented as a fast on-board processing solution for hyperspectral remotely sensed images, while in [41] a parallel Independent Component Analysis (ICA) algorithm was implemented on FPGA.Furthermore, in [42], the authors propose an FPGA implementation of the PCA algorithm using High Level Synthesis (HLS).
Last but not least, cloud computing is also leveraged for improving performance and decreasing runtime: in [43], the authors proposed a parallel and distributed implementation of the PCA algorithm on cloud computing architectures for hyperspectral data.
This paper addresses the computational difficulties of a Projection Pursuit method named geometrical approximated PCA (gaPCA), which is based on the idea of finding the projection defined by the maximum Euclidean distance between the points.The difficulties arise due to the fact that gaPCA involves computing a large number of distances and we address them by proposing several parallel implementations of the method.Among them, an SIMD based method and a GPU implementation are compared and the results showed significant speedup.

Hardware
The CPU used in our experiments was AMD Ryzen 5 3600 [44].The main characteristics of this processor are listed in Table 1.We used the MSI B450 GAMING PLUS mainboard with AMD B450 Chipset.The GPU used in our experiments was the EVGA NVIDIA GeForce GTX 1650 [45].The GPU specifications are shown in Table 2.For measuring we used the energy consumption, we used the PeakTech 1660 Digital Power Clamp Meter [46], which provides a large variety of measuring functions (such as measurements for alternating current, alternating voltage, true power, apparent power, reactive power, energy, the power factor, frequency and phase angle), and a PC interface for data acquisition via USB 2.0 Interface.

Software
The experiments described in this paper were conducted on the above-mentioned AMD Ryzen 5 3600 and NVIDIA GeForce GTX 1650 system, running the Ubuntu 18.04.

Datasets
The experiments have been carried out on two data sets: the Pavia University [47] hyperspectral image data set, gathered by the Rosis sensor, having a number of 103 spectral bands, 610 × 340 pixels and a geometric resolution of 1.3 m, and the Indian Pines data set [48], gathered by the AVIRIS sensor over the Indian Pines test site in North-western Indiana and consists of 145 × 145 pixels and 200 bands.RGB representation of both datasets is illustrated in Figure 1.In our experiments the image dimensions taken as input are: M (the image width), N (the image height) and P (the number of bands), giving rise to a [M × N, P] matrix.The testcases used to benchmark our gaPCA implementations involved taking 4 different crops of each of the two datasets (100 × 100, 200 × 200, 300 × 300 and 610 × 340 for Pavia University, and 40 × 40, 80 × 80, 100 × 100 and 145 × 145 for Indian Pines).For all test cases we maintained the full spectral information, i.e., all the available number of bands were considered: 103 for Pavia University and 200 for Indian Pines.The number of Principal Components computed was varied between 1, 3 and 5 in order to analyze the scaling effects of computing more than one component.

Description of the gaPCA Algorithm
gaPCA is a novel method that defines its Principal Components as the set of linear projections described by the maximum range in the data (as opposed to the canonical version of the PCA in which the Principal Components are those linear projections that maximize the variance of the data) and are computed geometrically, as the vectors connected by the furthest points.Hence, gaPCA is focused on retaining more information by giving credit to the points located at the extremes of the distribution of the data, which are often ignored by the canonical PCA.
In the canonical PCA method, the Principal Components are given by the directions where the data varies the most and are obtained by computing the eigenvectors of the covariance matrix of the data.As a result that these eigenvectors are defined by the signal's magnitude, they tend to neglect the information provided by the smaller objects which do not contribute much to the total signal's variance.
Among the specific features of the gaPCA method are an enhanced ability to discriminate smaller signals or objects from the background of the scene and the potential to accelerate computation time by using the latest High Performance Computing architectures (the most intense computational task of gaPCA being distances computation, a task easily implemented on parallel architectures [49]).From the computational perspective, gaPCA subscribes to the family of Projection Pursuit methods (because of the nature of its algorithm).These methods are known to be computationally expensive (especially for very large datasets).Moreover, most of them involve statistical computations, discrete functions and sorting operations that are not easily parallelized [50,51].From this point of view, gaPCA has a computational advantage of being parallelizable and thus yielding decreased execution times (an important advantage in the case of large hyperspectral datasets).
To carry out gaPCA in the manner of PP analysis, we first have to define an appropriate Projection Index and a direction onto which the projection is optimal.A PI is a measure of the desirability of a given projection [52].In PCA, which can be viewed as a special case of PP, this index is the variance of the projection scores.In gaPCA, the PI is the Euclidean distance among data points.A high PI (given by large distance between data points) suggests important features, in terms of divergence from normality, which is a desired objective of Projection Pursuit techniques in general.In the field of hyperspectral imagery, if a PI attains a high value, it usually indicates a spectral characteristic different than that of the background, very important for object detection, identification of spectral anomalies, regardless of their relative size.
Given a sample X = (x 1 , . . ., x n ) of p-dimensional points, x i ⊂ R p , in order to form a univariate linear projection of X, we define a as a p-dimensional normalized vector determined by the two p-dimensional points separated by the maximum Euclidean distance: with {x k , x l } = arg max where d(•, •) denotes the Euclidean distance.
The gaPCA projection index function can be defined as: Then, the projected data of X can be described as: resulting in a new representation for each of the p dimensional points in a n-dimensional space.
To find a set of orthogonal projections, we have to reduce the rank of the data matrix by projecting it onto the subspace orthogonal to the previous projections.All the points in X are projected onto the hyperplane H, determined by the normal vector a and containing m (the midpoint of the segment that connects the two points): This results in a set of projections of the original points, Y = (y 1 , . . ., y n ), computed using the following formula: To further seek a linear projection onto the new dimensions, all the above steps are computed iterativelly, giving rise to a direction matrix A, so that the projection data can be rewritten as: This sequence of operations can be repeated until the rank of the data matrix becomes zero.The end result is an orthogonal matrix, where each column is a projection vector.A graphical illustration of the computed gaPCA components vs. canonical PCA components is shown in Figure 2.
gaPCA is a geometric construction-based approach for PCA approximation, which was previously validated in applications such as remote sensing image visualization [53] and face recognition [54].Moreover, gaPCA was shown [55] to yield better performance in land classification accuracy than the standard PCA, with the most remarkable results recorded in the cases of preponderantly spectral classes.In such classes the standard PCA's performances are lower due to its loss in spectral information, that restrain its ability to discriminate between spectral similarities.Consequently, gaPCA is more suitable for hyperspectral images with small structures or objects that need to be detected or where preponderantly spectral classes or spectrally similar classes are present [55].

gaPCA in Land Classification Applications
To show the performance and efficiency of gaPCA as a Dimensionality Reduction algorithm in land classification applications, we have compared the gaPCA approach in the field of classification with the standard PCA algorithm as benchmark for the assessment of the gaPCA method.
Both methods were used on the Indian Pines and Pavia University datasets for dimensionality reduction, and for each data set the number of Principal Components computed was selected in order to achieve the best amount of variance explained (98-99%) with a minimum number of components (10 for the Indian Pines data, 4 for Pavia University).
The first Principal Components obtained after the implementation of each of the PCA methods represent the bands of the images on which the classification was performed (the input), using the ENVI [56] software application.Maximum Likelihood Algorithm (ML) and the Support Vector Machine Algorithm (SVM) were used for classifying both the standard PCA and the gaPCA image, for both datasets.The classification accuracy was assessed by generating a sub-set of randomly generated pixels, based on which a visual comparison with the groundtruth image of the test site at the acquisition moment was performed.
We assesed the classification accuracy of each method with two metrics: the Overall Accuracy (OA representing the number of correctly classified samples divided by the number of test samples) and the kappa coefficient of agreement (k, which is the percentage of agreement corrected by the amount of agreement that could be expected due to chance alone).
In order to assess the statistical significance of the classification results provided by the two methods, the McNemar's test [57] was performed for each classifier (ML and SVM), based on the equation: where f ij represents the number of samples missclassified by method i, but not by method j.
For |z| > 1.96, the overall accuracy differences are said to be statistically significant.

Indian Pines Dataset
Performing land classification on the Indian Pines dataset is a challenging task due to the large number of classes on the scene and the moderate spatial resolution of 20 m and also due to the high spectral similarity between the classes, the main crops of the region (soybeans and corn) being in an early stage of their growth cycle.The classification results of the Standard PCA and the gaPCA methods are shown in Figure 3a In Table 3 we summarized the classification accuracy of the two methods for each of the classes on the scene along with the Overall Accuracy of both methods with the ML and SVM algorithms.A set of 20,000 randomly generated pixels, uniformly distributed in the entire image were used for testing.The gaPCA OA is higher than the one scored by the standard PCA and the classification accuracy results for most classes are better.One reason for these results is that PCA makes the assumption that the features that present high variance are more likely to provide a good discrimination between classes, while those with low variance are redundant.This can sometimes be erroneous, like in the case of spectral similar classes.There is an important difference in the case of sets of similar class labels.gaPCA scored higher accuracy results for the similar sets of Corn, Corn notill, Corn mintill and also for Grass-pasture and Grass-pasture mowed than those achieved by the Standard PCA, demonstrating the ability of the method to better discriminate between similar spectral signatures.The classification results (for 1000 test pixels) using either ML or SVM are displayed in Table 4, showing the classification accuracy for each class and the OA.The results show gaPCA scored the best OA and classification accuracy for most classes.The classification accuracies report better performances for gaPCA in interpreting classes with small structures and complex shapes (e.g., asphalt, bricks).This confirms the interest accorded by the gaPCA to smaller object and spectral classes leading to less false predictions compared to the standard PCA for this classes.This is more obvious for classes such as bricks where confusion matrix shows a misinterpretation with gravel and for asphalt confused with bitumen.
The reason behind these confusions is the spectral similarity between the two classes and not to the spatial proximity (Table 5), proving that gaPCA does a better job in discriminating between similar spectral classes because unlike PCA it is not focused on classes that dominate the signal variance.In light of these results, gaPCA is confirmed to have a consistent ability when classifying smaller objects or spectral classes, proving the hypothesis that it has a superior ability in retaining information associated to smaller signals' variance.

Parallelization of the gaPCA Algorithm
The first step was to perform a profiling of the code in order to assess what are the most time-and computation-consuming subroutines and which of them are appropriate for parallel implementations.A detailed perspective on the gaPCA algorithm and its sub-routines is shown in Figure 5.As expected, the code profiling analysis showed that by far the computation of the Euclidean distances between the pixels was the most time consuming task, accounting for approximately 95% of the total running time of the algorithm for any given dataset.The second most time consuming stage of the gaPCA algorithm is the projection of the data on the hyperplane orthogonal to each of the vectors computed, accounting for just 3% of the total computational time of gaPCA.Based on the time profiling results, we focused our efforts on parallel implementations for these two subroutines.Three implementations: CPU multi-threading, CUDA and SIMD AVX2 were ealborated for the Euclidean distance function (computeMaximumDistance), which is by far the most time consuming and also it's computational complexity (pariwise distance computations on a large dataset) fostered the use of the above-mentioned technologies.A diagram depicting the computation of one iteration of the Euclidean distances function is shown in Figure 6.For the subroutine responsible for computing the data projections on the hyperplane (computeProjections), a CPU multi-core parallel implementation was elaborated.Our research focused on two main aspects.First, on the efficient parallel implementation of the gaPCA algorithm using SIMD instruction set extensions, multi-threading and GPU acceleration.Second, on providing a comparison of the above mentioned implementations in terms of time and energy performance onto two different computing platforms: a PCI Express GPU accelerator from NVIDIA and a 64-bit x86 CPU from AMD.
The main idea behind parallelizing the gaPCA algorithm was to take advantages from both thread level parallelism and data level parallelism in order to gain higher performance.As a result that the gaPCA algorithm possess a massive data parallelism (with the sequence of operations computing the Euclidean distance between each pair of n-dimensional points executed for a large number of inputs), it is a good candidate for implementing CPU Data Level Parallelism (DLP) with SIMD.

Matlab, Python and PyCUDA Implementations
The initial implementation (Listing 1 in Appendix A) [49] was developed in Matlab using the specific instructions from the Matlab Parallel Computing Toolbox.The test runs were performed in Matlab R2019a, on the Linux Ubuntu 18.04 operating system.The par f or instruction executes for-loop iterations in parallel on a parallel pool of workers on the multi-core computer or cluster.The default size of the pool of workers is given by the number of physical cores available.
The next two are Python implementations, targeting parallel execution on a multi-core CPU and a NVIDIA GPU, respectively.Both are based on the Python Numba and Numpy libraries; the CPU multi-core parallel execution was achieved using Numba JIT compiler, while the GPU CUDA version has been implemented using the PyCUDA Python library [58].Python [59] is a high level programming language.It has efficient high-level data structures and a variety of dedicated libraries which recommends it both for scientific and real-time application.Numba [60] is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.Numba can automatically execute array expressions on multiple CPU cores and makes it easy to write parallel loops.
Our second implementation (Listing 2 in Appendix A) was developed in Python using the Numba compiler.The test runs were performed in Python3, on the Linux Ubuntu 18.04 operating system.The @jit decorator added before the definition of the Python function, specifies that Numba will be used.Numba's CPU detection also enables LLVM [61] to autovectorize for appropriate SIMD instruction set: SSE, AVX, AVX2, AVX-512.The parallel option (parallel = True) enables a Numba transformation pass that attempts to automatically parallelize and perform other optimizations on the function.Furthermore, it offers support for explicit parallel loops.Consequently, Numba's prange was used (instead of the standard range) to specify that the loop should be parallelized.A thread pool is automatically started and the tasks are distributed among these threads.Typically the pool size is equal to the number of cores available on the machine (6 in our case).However, this may be controlled through the nthreads variable.The numpy.linalg.norm is a function from the Python's Numpy scientific package used to compute the norm of a vector or a matrix, in our case the Frobenius norm of the difference between the two vectors, equivalent to the Euclidean distance between the two).

CUDA implementation
Our CUDA kernel was designed in order to have a block of threads compute one element of the distance matrix, with each thread computing the square distance between the corresponding elements of the two rows in the input matrix.In the example illustrated in Figure 7, the dataset a is the input matrix, whose rows contain the pixels' values in every spectral band.The block of threads computes the distance between rows i and j, with each thread tid q computing (a i,q − a j,q ) 2 .After all threads in the block finish the tasks assigned (passing a synchronization phase), the squared values computed by the threads are added in a parallel tree reduction manner to obtain the total distances between the rows i and j of the input dataset a.
Listing 3 in Appendix A shows the pseudocode for the CUDA kernel (the Euclidean function).This function takes as input matrix X and returns C, a three element array containing the maximum distance (C[0]) and the indexes of the corresponding two furthest points from matrix X (C [1] and C [2]).
In the CUDA kernel, we used a shared memory vector named accumResult, where each thread computes its corresponding squared difference.After all the threads in a block complete this step (ensured by calling the __syncthreads(); method), we perform parallel tree reduction as shown in the code excerpt from Listing 4, obtaining the final distance value.Finally, the thread with threadIdx.y= 0 in each block compares the current computed distance with the previous existing maximum distance, and updates the maximum distance (and its corresponding indexes) if it is the case.

C++ Implementations
The first C++ implementation is a single-core one.For the second C++ multithreading implementation, we used OpenMP [62], a simple C/C++/Fortran compiler extension, which allows adding multithreading parallelism into existing source code.
In Listing 5 in Appendix A, the section of code that runs in parallel is marked accordingly by the pragma omp parallel compiler directive, that will cause the threads to form before the section is executed.This directive does not create a pool of threads, instead it take the team of threads that is active, and divide the tasks among them.The pragma omp for directive divides the distances computations between the pool of threads, which are run simultaneously.Each thread computes a distance between two rows of the matrix.
For the third C++ implementation, besides OpenMP, we have also employed SIMD instruction set extensions, for exploiting the avaliable DLP.SIMD instructions work on wide vector registers.Depending on the specific SIMD instruction set extension employed, the width of the vector registers ranges from 64 bits to 512 bits (e.g., MultiMedia eXtensions (MMX) [63], SSE [64], AVX [65]).In order to achieve 100% efficiency, the input array size should be a multiple of the width of the vector registers (e.g., 4, 8, 16 or 32 16-bit data elements); however, for large input array sizes, this does affect performance and high speedups can be achieved.
In our experiments, we employ the AMD Ryzen 5 3600 CPU [44] which supports the AVX2 [66] SIMD instruction set extension.An AVX2 vector register uses dedicated 256-bit registers, corresponding to the the float, double and int C/C++ types, and may store 8 double precision floating point numbers or 16 integer values, for example.After profiling the data, we choose to represent them as 16-bits integers, using __m256i type vectors, containing 16 integers represented on 16 bits.For storing and manipulating the data, we used the AVX2 intrinsic functions, available in the Intel library header immintrin.h.Intrinsics are small functions that are intended to be replaced with a single assembly instruction by the compiler.
One of the major challenges arising when vectorizing for SIMD architectures, is the handling of memory alignment.SIMD instructions are sensitive to memory alignment, and there are dedicated instructions for accessing unaligned and aligned data.As a result that each register size has a corresponding alignment requirement (i.e., the 256-bit Intel AVX registers have an alignment requirement of 32 bytes), attempting to perform an aligned load, or store, on a memory operand which does not satisfy this alignment requirement will generate errors [67].There are several different methods for memory alignment.As shown in Listing 6 in Appendix A, we used the alignment macros defined in the header stdalign.h,providing the alignas() macro and the corresponding _alignas keyword, for alignment of a variable.
Listing 7 in Appendix A provides the code for computing the distances between two arrays, p1 and p2, of length size, using AVX2 intrinsics functions.To load the arrays of values into memory, we used the intrinsic _mm256_load_si256.The _mm256_sub_epi16 was used to subtract each pair of 16-bit integers in the 256-bit chunks and the values obtained were then multiplied using the _mm256_mullo_epi16 intrinsic.The second for loop of the function deals with the left-overs of the input arrays (in case that they are not multiple of 16, the sum of the differences between the remaining values is computed using standard C++ instructions).The sum of the squared differences was computed with another C++ function based on SIMD instructions, showed in Listing 8 in Appendix A.

Results and Discussion
To evaluate the gaPCA algorithm acceleration, we build a benchmark in which we measured the amount of time the algorithm took to execute on every testcase of the two datasets mentioned above, for 1, 3 and 5 Principal Components.For each of the above mentioned image sizes, we ran the algorithm with the specified number of computed Principal Components, several times for each test (in order to minimize the overhead time associated with the cache warming [68]) and averaged the result.The speedup was then computed as the division between the averaged time of each implementation (e.g., single/multi-core CPU, SIMD-augmented CPU, GPU) for the same image size and number of computed Principal Components.

Matlab vs. Python and PyCUDA
The first comparative evaluation was performed between the execution times of the Matlab, Python and PyCUDA implementations of the gaPCA algorithm.The execution times were measured for all the above-mentioned image crops of the Indian Pines and the Pavia University dataset, for a number of 1, 3 and 5 computed Principal Component(s) (PCs), for 2 consecutive runs.
Table 6 shows the timing results for the three implementations, for each test case of the Indian Pines dataset.Similarly, Table 7 shows the timing results for the test cases of the Pavia University dataset.Figures 8 and 9 present the speedup between the three implementations (with the Matlab implementation taken as a baseline for comparison) for all the test cases of the Indian Pines and the Pavia University, respectively.
For the first test case (Indian Pines), the Matlab implementation outperforms its Python equivalent, being with up to 4.58× for the 40 × 40 image crops, while for the other image sizes (80 × 80, 100 × 100 and 145 × 145) the Python implementation is faster.For the second test case (Pavia University), we can notice that the Matlab implementation is faster than the Python one with approximately 20% on average (ranging from 4% for the 200 × 200 and 5 PCs dataset to 42% for the 100 × 100 and 1 PC dataset).
One of the major factors influencing the differences in speedup between the Matlab implementation and its Python counterpart can be explained by the overheads associated with each environment's compiler and the deployed components: a startup delay while the data is loaded or data marshaling associated with deploying certain libraries.Matlab employs a just in time compiler to translate code to machine binary executables.This substantially increases speed and is seamless from the user perspective since it is performed automatically in the background when a script is run.The python Numba Project has developed a similar just in time compiler, but the speedup is actually very small in case of using "non-native" python code (NumPy functions which cannot be compiled in optimal ways).
A second factor causing poor performance for Python as compared to Matlab for the smaller image crops, is the fact that in general, NumPy's speed comes from calling underlying functions written in C/C++/Fortran.NumPy performs well when these functions are applied to whole arrays.In general, poorer performance is achieved when applying NumPy functions on smaller arrays or scalars in a Python loop (for the smaller dataset dimensions the Matlab speedup is higher than for the larger datasets).It is also notable that Matlab's Parallel Toolbox is limited to 12 workers, whereas in Python there is no limit to the number of workers.
When comparing the two CPU multi-core implementations (Matlab and Python) with the GPU implementation (PyCUDA), the speedups show that the GPU implementation of the gaPCA algorithm is up to 17.82× faster (for Indian Pines) and up to 18.84× faster for the Pavia University dataset than the Matlab CPU implementation of the same algorithm taken as reference for comparison.One can notice that the greatest speedup is registered for the test cases with 1 computed principal component, due to the fact that for the smaller dataset sizes more of the data can fit into caches and the memory access time reduces dramatically, which causes the extra speedup in addition to that from the actual computation.In addition, the GPU speedup is slightly smaller for 3 and 5 computed Principal Components, due to the fact that unlike the CPU execution times which grow unlinear, the GPU execution time grows linear.The second comparative evaluation was performed between the execution times of several C++ implementations: single core (C++ SC), single core with AVX2 intrinsics (C++ SC AVX2), multi-core (C++ MC) and multi-core using AVX2 intrinsics (C++ MC AVX2).The execution times were again measured for all image crops of the Indian Pines and the Pavia University dataset, for 1, 3 and 5 computed principal component(s) and averaged between 2 consecutive runs.Table 8 shows the timing results for the four implementations, for each test case of the Indian Pines dataset.Similarly, Table 9 shows the timing results for the test cases of the Pavia University dataset.
The speedups between the four implementations (with the single core implementation taken as a baseline for comparison) are shown in Figures 10 and 11, for all the test cases of the Indian Pines and the Pavia University, respectively.The results show that the single-core AVX2 implementation is almost 4 times faster on average (3.90×) than the single core one for Pavia and 4.6 times faster on average for the Indian Pines testcases.The speed-up is slightly higher for Indian Pines due to the fact that this dataset has more spectral information (more bands, 200 vs. 103 as previously mentioned) which leads to more data being processed using AVX2 instructions thus increasing the efficiency and decreasing the execution time.
With regard to the multi-core implementation, the speed-up is consistent between the two datasets, and is on average 6 times faster than the single core version (6.04× for Pavia and 5.98× for Indian Pines).This is consistent with the number of cores (6) of the CPU on which the tests were carried.
As expected, the multi-core AVX2 implementation yielded the best results with an average speed-up of 21.24× for Pavia and 23.92× for the Indian Pines testcases.The slightly better speed-up in the case of Indian Pines is again caused by the higher spectral dimension of this dataset, as explained above.The final comparative evaluation was performed between the execution times of the three parallel C++ implementations: multi-core (C++ MC), multi-core using AVX2 intrinsics (C++ MC AVX2) and multi-core with CUDA programming (C++ MC CUDA).The execution times were again measured for all image crops of the Indian Pines and the Pavia University dataset, for 1, 3 and 5 computed principal component(s) and averaged between 2 consecutive runs.Table 10 shows the timing results for the three implementations, for each test case of the Indian Pines dataset.Similarly, Table 11 shows the timing results for the test cases of the Pavia University dataset.
The speedups between the three implementations (with the standard multi-core implementation taken as a baseline for comparison) are shown in Figures 12 and 13, for all the test cases of the Indian Pines and the Pavia University, respectively.The results show that the multi-core AVX2 implementation is on average 3.52 times faster than the standard multi-core one for Pavia and 4 times faster on average for the Indian Pines testcases.This difference in speedup is caused by the same spectral dimensionality difference between the two datasets previously addressed.
With regard to the multi-core CUDA implementation, this was shown to be 4.85× faster on average than the standard multi-core version for Pavia and 4.14× faster on average for the Indian Pines testcases.It can be noticed that for the smaller crop sizes the speed-up is significantly lower or the CUDA version is even slower than the MC AVX2 implementation for the smallest crop size (e.g., for Indian Pines 40 × 40 and 1PC is 1.49× vs. 3.58× for MC AVX2).This is due to the CUDA computational overhead (i.e., copying the data from CPU memory to GPU memory and reverse).The impact of the overhead is reduced as the dataset size increases.This also explains the the speed-up difference between the two datasets which is reversed compared to the other comparisons discussed above, with a higher speed-up for Pavia and lower for Indian Pines.This is due to the fact that Indian Pines has a much smaller spatial resolution than Pavia (145 × 145 vs. 610 × 340) and for the smallest crop size (40 × 40) the CUDA computational overhead surpasses the acceleration gain.
Figure 14 shows a comparison of all implementations in terms of speed-up compared to the C++ single-core version taken as baseline.This comparative perspective confirms that the two CUDA-based implementations (C++ MC CUDA and PyCUDA) yield the highest speed-ups, followed closely by the C++ MC AVX2 version.The C++ MC CUDA is faster than the PyCUDA version on average with 9.3% for Pavia and with aprrox.30% for Indian Pines, which is explained by the fact that aside from the CUDA kernel, C++ multi-core code proved to be faster than its Python equivalent.Moreover, both show better speed-ups for Pavia than for Indian Pines, which is consistent since both share the same CUDA kernel.The C++ MC AVX2 implementation shows a reversal of this trend -Indian Pines results having a slightly higher speed-up than Pavia-results which are consistent with the C++ SC AVX2 version where we notice the same trend.The probable cause is related to the spatial and spectral characteristics of the two datasets.Indian Pines has a lower spatial resolution but a higher spectral resolution than Pavia, and the spectral resolution impacts the number of AVX2 operations carried out.This leads to Indian Pines benefiting from AVX2 acceleration to a higher degree than Pavia.

Energy Efficiency
For evaluating the energy consumption we measured the system power consumption during the execution of each algorithm implementation for two test-cases: Indian Pines 100 × 100 and Pavia University 200 × 200; in all cases 5 PCs were computed.The system's power consumption was measured using a PeakTech 1660 Digital Power Clamp Meter [46], which provides a USB connection and dedicated software for data acquisition, thus increasing the accuracy and reliability of the results.
The measurement results were consistent between the two test-cases and show on one hand that all parallel solutions consume much less total energy than the single-core implementation and on the other, that among all parallel implementation, the most efficient was the C++ MC AVX2 implementation, followed closely by the C++ MC CUDA implementation and the PyCUDA version.
Table 12 shows the energy consumption for the Matlab, Python and PyCUDA implementations versus the total execution time; the energy results are also displayed in Figure 15.Similarly, the energy consumption for the C++ implementations are illustrated in Table 13 and Figure 16.
C++ MC CUDA implementation is shown to be slightly less energy efficient than the C++ MC AVX2 implementation (consuming 24.74% more energy in the Indian Pines test-case and 4.89% more energy in the Pavia University test-case) but it outperforms the C++ MC AVX2 implementation in terms of execution speed (being 24.07%faster for the Indian Pines test-case and 39.89% faster for Pavia University).The difference in the increased energy consumption for the two test-cases (24.74%Indian Pines vs. 4.89% Pavia University) can be explained by the difference in the number of spectral bands (200 for Indian Pines vs. 103 for Pavia University), which leads to the number of threads per block being used by the CUDA kernel (256 for Indian Pines vs. 128 for Pavia University).These results are confirmed also in the case of the PyCUDA implementation, which uses the same CUDA kernel.

Conclusions
The recent trends and advances in remote sensing and Earth observation in recent years have led to the continuous acquisition of massive geospatial data of various formats.This raises scientific challenges related to the processing, analysis and visualization of remote sensing data, with focus on algorithms and computing paradigms able to extract knowledge and meaningful information both offline and in real time.Hence, the latest High Performance Computing devices and techniques, like parallel computing, GPUs and enhanced CPU instructions sets like AVX2 represent solutions able to provide significant gains in term of performance improvements of various data intensive applications and algorithms.
In this paper we have presented the implementation of a PP-based geometrical Principal Component Analysis approximation algorithm (gaPCA) for hyperspectral image analysis on a multi-core CPU, a GPU and multi-core CPU using AVX2 intrinsics, together with a comparative evaluation of the implementations in terms of execution time and energy consumption.
The experimental evaluation has shown that all parallel implementations have consistent speed-ups over the single core version: the C++ CUDA was on average 29.3× faster on Pavia and 24.8× faster on Indian Pines, while the C++ MC AVX2 version had an average speed-up of 21.2× for Pavia and 23.9× for Indian Pines compared to the baseline C++ SC version.These timing results show not only the advantage of using CUDA programming in implementing the gaPCA algorithm on a GPU in terms of performance and energy consumption, but also significant benefits in implementing it on the multi-core CPU using AVX2 intrinsics.These two implementations were shown to be much faster than the standard multi-core implementation, and also the most efficient with regard to energy consumption.The C++ MC AVX2 version was shown to be the most efficient, requiring on average 8.26× less energy when running the Pavia dataset and 9.71× less energy when running the Indian Pines dataset compared to the baseline C++ SC implementation.The C++ MC CUDA version had just slighlty lower results, being 7.88× more efficient on Pavia and 7.78× more efficient on Indian Pines than the single-core implementation.
Consequently, this paper highlights the benefits of using parallel computing, AVX2 intrinsics and CUDA parallel programming paradigms for accelerating dimensionality reduction algorithms like gaPCA in order to obtain significant speed-ups and improved energy efficiency over the traditional CPU single-core or multi-core implementations.

Figure 1 .
Figure 1.RGB visualisation of the two hyperspectral datasets used for experiments, Pavia University (a) and Indian Pines (b).

Figure 3 .
Figure 3. Standard PCA (a) and gaPCA (b) images classified (Maximum Likelihood) vs. the groundtruth image (c) of the Indian Pines dataset.

Figure 4
Figure 4 shows the images achieved with the Standard PCA method (a) and the gaPCA method (b) classified with the ML Algorithm and the groundtruth of the scene (c).

Figure 5 .
Figure 5. Diagram showing gaPCA algorithm design and the two sub-routines with parallel implementations.

Figure 6 .
Figure 6.Diagram showing one iteration of the Euclidean distances function.

Figure 7 .
Figure 7. Diagram showing the parallel implementation in Compute Unified Device Architecture (CUDA) of the Euclidean distance function.

Figure 15 .
Figure 15.Energy consumption for the Indian Pines (a) and Pavia University (b) datasets for the Matlab, Python and PyCUDA implementations.

Figure 16 .
Figure 16.Energy consumption for the Indian Pines (a) and Pavia University (b) datasets for the C++ implementations.

Table 3 .
Classification results for the Indian Pines dataset.

Table 4 .
Classification results for the Pavia University dataset.

Table 5 .
Excerpt from the confusion matrix for the Pavia University dataset.

Table 12 .
Energy measurement results for the Matlab, Python and PyCUDA implementations.

Table 13 .
Energy measurement results for the C++ implementations.