A Deep Pipelined Implementation of Hyperspectral Target Detection Algorithm on FPGA Using HLS

: Real-time target detection for hyperspectral images (HSI) has received considerable interest in recent years. However, owing to enormous data volume provided by HSI, detection algorithms are generally computationally complex, thus developing rapid processing techniques for target detection has encountered several challenging issues. It seems that using a deep pipelined structure can improve the detection speed, and implementing on ﬁeld programmable gate arrays (FPGAs) can also achieve concurrent operations rather than run streams of sequential instruction. This paper presents a deep pipelined background statistics (DPBS) approach to optimizing and implementing a well-known subpixel target detection algorithm, called constrained energy minimization (CEM) on FPGA by using high-level synthesis (HLS). This approach offers signiﬁcant beneﬁts in terms of increasing data throughput and improving design efﬁciency. To overcome a drawback of HLS on implementing a task-level pipelined circuit that includes a feedback data path, a script based circuit design method is further developed to make connections between some of the modules created by HLS. Experimental results show that the proposed method can detect targets on a real-hyperspectral data set (HyMap Data) only in 0.15 s without compromising detection accuracy.


Introduction
Hyperspectral remote sensing imaging acquires three-dimensional (3D) data including two spatial dimensions with space information of pixels and one spectral dimension with high-dimensional reflectance vectors [1].The rich spectral information provided by HSI is very useful and has been widely used in a range of various applications such as ecology [2], agriculture [3], environmental [4] and geology [5], where target detection plays a crucial role [6][7][8][9].There are many algorithms have been developed for target detection in HSI [1], such as matched filter (MF) [10], spectral angle mapper (SAM) [11], constrained energy minimization (CEM) [12], target-constrained interference-minimized filter (TCIMF) [13], adaptive coherence estimator (ACE) [14], matched subspace detector (MSD) [15], orthogonal subspace projection (OSP) [16], and sparsity-based target detector (STD) [17].Among them, CEM along with its variants have been widely used for hyperspectral target detection.The effectiveness of CEM has been shown successfully in many applications such as reconnaissance, rescue, search and on-orbit processing [6].For such applications, the high-speed data processing of HSI is generally required for finding targets on a timely basis.However, as a trade-off the volume of data in HSI has also become unmanageable with increasing spectral and spatial resolutions.In this case, CEM needs a significantly large number of complex matrix computations.The algorithm could be implemented on one of the widely used platforms like CPUs, GPUs [18], and FPGAs [19].Among them, FPGAs have significant advantages in supporting parallel computation and customizable deep pipeline operation with the cost of low power consumption.
Currently, great progress has been made in implementing target detection algorithms on FPGAs.For example, Chang described a new FPGA design by using the Coordinate Rotation Digital Computer (CORDIC) algorithm to solve the matrix inversion problem of the classical CEM [20].This method of computing the inverse of a large matrix is unable to support fast target detection.Yang utilized Streaming Background Statistics (SBS) structure with an idea of continuously updating the inverse of the correlation matrix on FPGA [21].Despite that a pixel-by-pixel processing design is realized using fewer hardware resources, its data processing speed is not high.Recently, Gonzalez C. et al. proposed an FPGA implementation of the automatic target-generation process based on an orthogonal subspace projector (ATGP-OSP) using the pseudoinverse operation [22], where Gauss-Jordan elimination method was selected for computing the inverse of a small square matrix whose size is no more than 32 × 32.Unfortunately, the Gauss-Jordan elimination method consumes too much logic resources for solving the large matrix inversion problem required by the CEM.
Although FPGAs gain much attention, it is still not widely deployed for accelerating many algorithms that require high computational complexity such as CEM.The main reason is that the conventional development methods of FPGAs, which are based on register transfer level (RTL) hardware description, are much more difficult than that of CPUs or GPUs.It commonly requires great efforts in achieving highly efficient results on FPGAs.Furthermore, a design method based on RTL for FPGAs lacks portability and flexibility compared to those based on C/C++ for CPUs or GPUs.To close this gap, FPGA vendors and developers have begun to take advantage of high-level synthesis (HLS) to work on FPGA applications.HLS is able to convert high abstraction languages such as C, C++ and SystemC into VHDL/Verilog hardware description language (HDL) for RTL-level circuit design.According to the user-defined constraints and C code style, the efficiency of the converted RTL designs are quite different.Until now, studies on the acceleration of hyperspectral data processing algorithms with HLS are already available.Santos proposed a novel adaptive and predictive algorithm for lossy hyperspectral image compression algorithm [23] and Lossy Compression for Exomars (LCE) algorithm [24] described in Vivado HLS.Domingo R. et al. proposed a hyperspectral image spatial-spectral classifier accelerator using Intel FPGA SDK for OpenCL [25].What's more, HLS is popularly utilized for hardware acceleration of deep learning algorithms like convolution neural networks (CNN) [26,27].However, no research work has been reported for implementing CEM on FPGA using HLS.
In this work, a DPBS-CEM algorithm is developed to be implemented on FPGA using HLS for real-time hyperspectral target detection.Like SBS-CEM, the inverse matrix is gradually updated according to a Sherman-Morrison formula [28].Different from using sliding windows, DPBS-CEM takes advantage of cumulative windows instead to greatly reduce the number of calculations.As for the issue of removing data dependency in updating inverse matrices, separate memories are proposed to store the results of the successive inverse matrices, which make sure the operations on adjacent pixels can be processed independently.As a consequence, a deep pipelined implementation of DPBS-CEM can be further developed, which has an extraordinary performance improvement in terms of data throughput.Experimental results demonstrate that the proposed algorithm has the capability of operating at a high-speed rate of more than 200 MHz on FPGA.Setting the same clock frequency, the algorithm can also achieve a significant speed-up of near 7.3× than SBS-CEM [21] with no compromise for detection accuracy.
The contributions of this paper can be summarized as follows.
• A novel deep pipelined architecture is proposed to accelerate the proposed DPBS-CEM algorithm on FPGA using HLS.It outperforms the previous work designed with RTL in terms of data throughput performance.• A solution is derived to remove the data dependency existing in SBS-CEM for updating the inverse matrices, by allowing four adjacent pixels to update their own individual inverse matrices that are stored in four different memories.• The proposed structure can be simply rebuilt to support diverse HSI implementations with different spatial resolution and number of spectral bands through several parameters modified under HLS.Most importantly, the framework can support various operation modes including split/non-split data and local/global detection.It is easily adapted to match multiple rates of hyperspectral imagery.• Last but not least, alternative solutions to the problems of feedback and high fanout are also provided.
The remainder of this paper is organized as follows.Section 2 briefly discusses CEM and SBS-CEM used for target detection.Section 3 describes the principle of DPBS-CEM in great detail.The FPGA implementation of DPBS-CEM is presented in Section 4. Section 5 conducts a detailed performance analysis via extensive experiments.Finally, conclusions along with some remarks were drawn in Section 6.

Related Algorithms
In this section, the principles of the classical CEM and SBS-CEM algorithms are described.Besides, the problems of implementing these two algorithms in practical applications are also analysed.

Principle of the CEM Algorithm
Let X ∈ R W×H×L denote a HSI with W × H pixels (row of X) and L spectral bands (column of X).We may interpret X either as a collection of L 2D images (or bands) of size N (N = W × H), or as a collection of W × H spectral vectors of size L. The entire data matrix X = [x 1 , x 2 , x 3 , ..., x N ], where x i is the ith sample pixel vector x i = (x i1 , x i2 , ..., x iL ) T for 1 ≤ i ≤ N and the signature d = (d 1 , d 2 , ..., d L ) T of target is known.The basic purpose of CEM is to design a linear finite impulse response (FIR) filter with L filter coefficients denoted by an L-dimensional vector w = (w 1 , w 2 , ..., w L ) T that minimizes the energy of the the output y i (1 ≤ i ≤ N) with the following constraint.
where R = 1 N ∑ N i=1 x i x T i is the global correlation matrix of X.The weighting vector w solved for Equation (1) and Equation ( 2) is given by which yields the CEM described by

Problem Analysis
The classical CEM algorithm is a global subpixel target detector, which uses all the pixels in HSI to calculate the correlation matrix.After the correlation matrix is obtained, the process of calculating the inverse matrix is executed through many complicated steps via QR decomposition.This is a typical large matrix inversion problem, which may be the main cause of a significant latency up to obtaining the final results.

SBS-CEM Algorithm
To accelerate the task of target detection using CEM, some researchers choose to calculate local detection by using a partial set of pixel vectors instead of all data sample vectors [6,29,30].For instance, Yang [21] proposed an FPGA-based implementation of SBS-CEM by using a new matrix inversion method to perform the correlation operation and the inversion operation simultaneously.

Principle of the SBS-CEM Algorithm
Unlike the classical CEM algorithm, SBS-CEM takes the inverse of the correlation matrix of the K-group pixel vectors to replace the entire pixel vectors.More specifically, SBS-CEM can be described as follows.
n is the inverse of the correlation matrix of the K-group pixel vectors.The Sherman-Morrison formula is used to derive the following two formulas.
Based on this streaming framework, the inverse matrix can be updated by using Equations (7) and (8).When applying the Sherman-Morrison formula, the initial value of S −1 0 should be set.Let S −1 0 = β • I; then S K+1 can be expressed as: Among them, the matrix (1/β) • I does not affect the performance of the detector.On the contrary, it makes the detection results be more stable [31].The detection equation of the SBS-CEM algorithm is then derived as: Since the pixel to be detected is located in the middle of the window, SBS-CEM can also be expressed as:

Problem Analysis
Sliding window problem.Compared to the classical CEM algorithm, SBS-CEM does not need the full image data sample vectors to compute the correlation matrix.Instead, a local region of the image defined by a sliding window is utilized to capture the local statistics.The size of the sliding window is fixed and set to L 2 (square of the number of spectral bands) in the SBS-CEM algorithm.The fixed size of the sliding window requires the compute-intensive task of calculating the Sherman-Morrison formula to be performed twice (as shown in Equations ( 7) and ( 8)) for each update of the inverse matrix.However, the extra calculation of the Equation (8) does not provide appreciable improvements of the target detection accuracy according to our experimental results.
Data dependency problem.The problem of data dependency exists in the process of updating the inverse matrix where the calculation for S −1  n+1 cannot be started until the S −1 n is available.Unless S −1 n is ready, it is not possible to compute S −1 n+1 .SBS-CEM divides the process of updating the inverse matrix into several stages to reduce its complexity.Unfortunately, under such circumstance, several stages' time consumption has to spend waiting for each inverse matrix updating.This computation overhead would be the major bottleneck of the SBS-CEM's data throughput performance.

Principle of Algorithm Optimization
To solve the problems described above for SBS-CEM, an optimized algorithm is proposed in this section.The two main improvements are proposed to deal with the use of sliding windows and to remove data dependency.
Non-sliding window.We choose not to use sliding windows to update calculations of the inverse matrix, which is quite different from the SBS-CEM algorithm.With no requirement for moving out the oldest pixel, the Equation ( 8) can be removed and thus a large number of calculations can be therefore reduced.When a new pixel vector x n is loaded into the window, we can obtain the output value S −1 n+1 by Equation (12).
Data segmentation for deep pipeline.As mentioned above, the SBS-CEM algorithm runs calculations of matrix inversions in serial.Since data dependency exists between S −1 n+1 and S −1 n , there is a great increase in processing time.To solve this problem, we need to complete the computation of Equation ( 12) in four stages and apply pipeline optimization for achieving pipeline acceleration.However, updating the inverse matrix between adjacent pixels is not independent, which prevents the use of the optimization strategy of deep pipeline.If we want to achieve a deep pipelined design, we have to make sure there is no feedback or iterations among the stages.In this case, we solve the data dependency by means of data segmentation.As a result, the current input pixel can be processed directly with no need of waiting for the previous pixel to be completed.By making the inverse calculations between neighbouring pixels independent, we are able to carry out a deep pipelined architecture, which can achieve 8× speed-up compared to SBS-CEM in theory.
Table 1, derived from the evaluation of hardware calculation, shows that the number of computations for each stage is different, but the number of clock cycles consumed by each stage is approximately equal after being parallelized.Where x n P = x T n represents a column of X, T and Q are scalars.S −1  n is denoted by U, which is an L-dimensional matrix.In addition, the detail procedure of DPBS-CEM algorithm is shown as Algorithm 1.
Table 1.Four stages of the inverse matrix update.

Stage Number
Formula Flop (× : ±) Parallelism Clock Cycles The deep pipelined background statistics (DPBS) target detection CEM algorithm Input: Initialize the following parameters.
(1) HSI data size: (2) the value of β; (3) the desired signature d; (4) the number of inverse matrices: M = 4; (5) bn indicates the index of number; (6) K indicates the number of pixel vectors collected before starting target detection; Output: the final target detection results.define an initial inverse matrix S −1 0 : endif calculate the target detection results: bn d endif endfor

Design Challenges
Feedback.There is a feedback problem in updating the inverse matrix.In fact, the inverse matrix updated in the fourth stage has to be transmitted back to the first stage as an input operand for the next updating.All of the stages are described by individual C/C++ functions.To substantially accelerate the process of updating the inverse matrix, we have to apply the data flow optimization directly to these functions so that the HLS tool can be guided to implement a task-level pipelining.Unfortunately, the HLS tool will not take place if it detects a feedback among the functions.As a result, the task-level pipelining cannot be achieved only using HLS directly.
Fanout.Due to the use of a large number of bands, there are some high fanout cases where some registers need to drive lots of loads like multipliers, which result in longer path delay and lower clock frequency.For example, in the fourth stage as described in Table 1, the scalar Q needs to be multiplied by L elements of a column in the matrix F simultaneously after parallel computation applied.It means that the element of the scalar Q has a high fanout to drive as much as L slave modules.It is simple to solve the high fanout problem by means of duplicating registers when designing with RTL, but it is not easy with HLS.

FPGA Implementation
In this section, an overall hardware structure of DPBS-CEM is given in Section 4.1.Section 4.2 describes the internal architecture of the inverse matrix updater in detail along with its workflow of deep pipeline.The difficulties in developing the hardware framework of DPBS-CEM using the HLS tool and their solutions are discussed in Section 4.3.Section 4.4 briefly introduces a few particular features of the proposed FPGA implementation of DPBS-CEM.

Overall Hardware Architecture of DPBS-CEM
As shown in Figure 1, the framework of DPBS-CEM mainly consists of three components including an off-chip memory, a processor core, and a scheduler.The off-chip memory (DDR3 SDRAM) is utilized to cache the hyperspectral image pixels.The processor core is responsible for the data processing of DPBS-CEM, which involves three modules: the first module is an inverse matrix updater, dedicated to update the inverse matrix in five stages; the second module is a spectral pixel filter, applied to filter pixels in four stages; and the last module is a storage component, utilized to cache the inverse matrix.Finally, the third component is scheduler which is designed to schedule the two modules of inverse matrix updater and spectral pixel filter.

Internal Architecture
As described in Figure 2, the inverse matrix updater contains five processing stages for updating the inverse matrices and four memory buffers for independently caching inverse matrices associated with four successive pixels.The four individual memories are allocated for solving the problem of data dependency described in Section 2.2.2.The specific calculations of each stage, the data flow, and the access mode of inverse matrices are clearly displayed in Figure 2. In addition, we arrange five blocks (Block A in Figure 3a, Block B in Figure 4a, Block C in Figure 5a, Block D in Figure 6a, and Block E in Figure 7a) to realize the last four processing stages in Figure 2, and we also provide pieces of C/C++ code written in HLS for these blocks on the right side of the Figures.In the Appendix A, the features of the #pragma used in these pieces of code are explained in Table A1.for(unsigned char l=0;l<L;l++)   In what follows, the complete updating process of the inverse matrices by using these blocks can be summarized in five stages as depicted in Figure 2.

Stage1 All elements of vector x T
n read from the DDR3 SDRAM are loaded sequentially and passed on to the next stage.Stage2 According to the index of the current pixel, we read a corresponding matrix S −1 from the storage module.When dealing with the first four pixels of an image, we need to overwrite the matrix S −1 with initialized matrix β • I.Then, we take matrix U and vector p T as input operands into the Block A to calculate product h.Subsequently, p, h, and U are passed to the next stage.Stage3 We count T by applying the Block B, then transmit U, T, and h to the next stage.Stage4 The Block C is utilized to work out the product F of two vectors.We calculate Q by employing the Block D. Then U, F, and Q are delivered to the next stage.Stage5 We figure out the new matrix S −1 through utilizing the Block E and write it to the corresponding location of the storage module according to the current pixel.
Besides, it is worth noting that the following design optimization strategies play an important role in improving the performance of the FPGA implementation.
(1) In the process of updating the inverse matrices, we allocate a single divider and execute it once for each inverse matrix updating.Thanks to such operation, a lot of logic resources and computation time consumed by the divider can be saved.(2) There are three types of data that need to be cached between two stages, the scalar data, the vector data, and the matrix data.In order to attain the capability of parallel computation, the matrix is cached in L first in first out (FIFO) memories (In HLS, we use the STREAM directive to map these sorts of data into FIFOs).While the elements of a vector are realized as registers.In addition, L simple dual port RAMs (simple DPRAMs) are deployed to implement the storage module.(3) The data type of input data is 16 bits signed fixed-point (15 bits fractional part), while the data type of intermediate data and detection results are not easy to assign.Due to the precision of intermediate data and detection results have a significant impact not only on the detection accuracy but also on the resource consumption, we performed some experiments to explore the relationship between the data precision and the detection accuracy.The experimental results demonstrate that the detection accuracy goes up with the increase of the bit-width of the fractional part.To better balance the trade-off between the detection accuracy and the resource consumption, we use different data types in different stages.As shown in Figure 2, the variable T and Q are defined as 38 bits signed fixed-point type (14 bits integer part, 23 bits fractional part).The elements of the matrix F are 32 bits signed fixed-point type (14 bits integer part, 17 bits fractional part).
All the other intermediate data are represented as 32 bits signed fixed-point type (7 bits integer part, 24 bits fractional part).T and Q have the significant impact on the detection accuracy.Therefore, they are assigned high data precision up to 38 bits.The elements of the matrix F are obtained by the accumulation operations, and more bits should be assigned to the integer part for avoiding data overflow.Though T and Q have larger bit-width up to 38 bits, it almost does not increase the logic resource consumption compared with the data type of 32 bits signed fixed-point.The reason is that only one single accumulation adder is allocated to compute T while one single adder and one single divider are placed to calculate Q.It is worthwhile to highlight that these data types can be defined and modified by HLS ap_fixed type easily.

Deep Pipeline
As shown in Figure 8, a full pipeline for updating inverse matrices is comprised of Task1, Task2, Task3, Task4, and Task5.These five tasks correspond to Stage1, Stage2, Stage3, Stage4, and Stage5 in Figure 2 respectively.(1) For the purpose of reducing logic resources without compromising accuracy, we implement a high-precision division with the price of long latency.It takes near 30 clock cycles to output the division result.If the division operation is assigned to Task3, the running time of Task3 will increase a lot.As a result, Task3 will turn out to be a bottleneck in the pipeline.Therefore, we assign the division operation to Task4.Note that, the division and multiplication operations in Task4 are carried out simultaneously.(2) For each task, it does not start until all input data are ready and all output FIFOs are not full.
It can be simply realized in HLS by writing C/C++ code as shown in Figure 9.To make the pipeline run efficiently, these FIFOs, which are dedicated to bridging two adjacent tasks, are designed a little bit larger.In this work, the depth of FIFO for vector is 2, while the depth of FIFO for the matrix is L × 2. Besides, the depth of simple DPRAM for the storage module is L × 2 × 4. (3) With regard to the execution time of each task, it is consistent with L + 12 times of the system clock period.Among them, the input time of an L-dimensional vector is L clock cycles, the delay time of the multiplier is one clock cycle, and the remaining 11 clock cycles are used to control input/output of the task.Especially, because two extra clock cycles are required for overwriting the matrix S −1 with the initialized matrix β • I, the total execution time of Task2 is L + 14 clock cycles.

Feedback
The function of task-level pipelining is available in HLS by applying DATAFLOW directive.However, one of the major difficulties with HLS is that HLS does not support to generate a task-level pipelined structure if data dependency (feedback) exists.Unfortunately, there is a feedback in the process of updating the inverse matrix as explained in Section 3.2.To solve this problem, we exploit a design method with a hybrid of RTL and HLS.As shown in Figure 10, HLS is applied to create the two complex modules, inverse matrix updater and spectral pixel filter.A small piece of RTL code is written to complete the scheduler whose function is quite simple.Verilog's generate statement is used to circularly instantiate all of the simple DPRAMs allocated in the storage module.Moreover, a TCL script for automatically connecting the above-mentioned modules is employed.When using HLS to realize the inverse matrix updater module, we define separate interface variables representing the input and output inverse matrices respectively.This separation strategy allows HLS to understand there is no data feedback in accessing the inverse matrix.In fact, the input and output inverse matrices are pointed to the same memory location in the storage module.

High Fanout
Register duplication is one of the most common ways to solve high fanout violation.It can be applied to relieve the fanout challenge as described in Section 3.2.However, the difficulty is how to make HLS replicate registers automatically since there is no inherent support of such feature in HLS.To solve this problem, we modify part of C/C++ code in HLS to split the high fanout task into two or more identical subtasks, which allows HLS to generate duplicated circuits for reducing the fanout.With this optimization, our FPGA implementation is able to work at a rate of speed higher than 200 MHz.

Scalability and Portability
Parallel computation and memory units are placed in the stages of the core architecture of DPBS-CEM to accelerate the related operations of matrix multiplication.The number of the parallel units is equal to the value L. By modifying the value L, we can easily scale the core framework of DPBS-CEM with HLS to support different HSIs with different number of bands.Parameter customized design method with HLS greatly improves the scalability of the system.Simultaneously, the framework does not rely on any specific underlying physical devices of FPGA and vendor-provided IP cores.Thus it can be easily ported to other types of FPGAs.

Flexibility
The flexibility feature is referred to as multiple work modes supported by the proposed DPBS-CEM.The default work mode is high-speed, at which the pipeline is fully operating.Besides, DPBS-CEM is also allowed to be configured working at low-speed mode.Then it can produce global detection results with no need to split image data into four parts for applying deep pipeline.Furthermore, through altering the control of the pipeline between the two processes of inverse matrix updater and spectral pixel filter, DPBS-CEM can output detection results while part of the image pixels are obtained.

Experimental Results and Analysis
A Virtex7 FPGA board (Alpha-Data ADM-PCIE-7V3) is chosen as our development platform, which provides more logic resources than the Kintex-7 board used in [21].Besides the FPGA implementation of DPBS-CEM, the simulation versions were also implemented using the MATLAB and C++ languages.The code of MATLAB and C++ are executed on Windows 7 operating system equipped with the Intel Core (TM) quad CPU @3.2 GHz and 4 GB main memory.We compare the performance of DPBS-CEM with SBS-CEM [21] under the same condition of FPGA implementation.The rest of this section is organized as follows.Section 5.1 describes two hyperspectral data sets used in the experiment.Section 5.2 shows the detection accuracy of the DPBS-CEM algorithm evaluated on both of the hyperspectral data sets.Section 5.3 gives a comparison of the processing time of the DPBS-CEM algorithm in MATLAB, C++ and FPGA.Finally, compared to the FPGA implementation of SBS-CEM [21], we analyze the advantages of the FPGA implementation of DPBS-CEM in terms of logic resources utilization and data processing speed.

TE1 Image
As shown in Figure 11a, 25 panels created with five United States Geological Survey (USGS, Reston, VA, USA) reflectance hyperspectral signatures: alunite (A), buddingtonite (B), calcite (C), kaolinite (K), and muscovite (M).Each row of the five panels in Figure 11b is simulated by the same mineral signature and each column of five panels has the same size [32,33].Among 25 panels are: five 4 × 4-pure pixel panels, px i 4×4 for i = 1, . . ., 5 in the first column; five 2 × 2-pure pixel panels, px i 2×2 for i = 1, . . ., 5 in the second column; five 2 × 2-mixed pixel panels, px i for i = 1, . . ., 5 in the third column; five subpixel panels, px i 4,11 for i = 1, . . ., 5 in the fourth column; and five subpixel panels, px i 5,11 for i = 1, . . ., 5 in the fifth column.Table 2 tabulates the mixing details of mineral composition in the 20 panels in the third column, while subpixel panels in the fourth and fifth columns are simulated with their abundance fractions tabulated in Table 3, where the background (BKG) is simulated by the sample mean of the real cuprite image scene in USGS [33].The Synthetic image TE1 is 200 × 200 pixels, 189 bands from 0.4 um to 2.5 um.The hyperspectral data set is provided by the Digital Imaging and Remote Sensing Group, Center for Imaging Science, Rochester Institute of Technology [34]. Figure 12 shows the HyMap reflection map of Cook City, Montana, USA with a resolution of 280 × 800 and a total of 126 bands distributed between 0.4 and 2.4 um.There is a grass area and four real panels of fabric in the data set as shown in Table 4, where the area of interest is highlighted with a red circle.

Analysis of Target Detection Accuracy
In this part, we evaluate the detection accuracy of the FPGA implementation of DPBS-CEM by using the simulation/real HSI data sets described above.CEM and SBS-CEM are evaluated as well for comparison.The detection accuracy can be evaluated via Receiver Operating Characteristics (ROC) [35].However, the ROC curves of different algorithms may be too close to determine which algorithm has better performance.Therefore, in this paper, we choose another way commonly used in medical diagnosis to calculate the area under a ROC curve, referred to as the area under the curve (AUC) [36].The AUC values corresponding to the detection results can further quantify the differences in the accuracy of the algorithms.The higher the AUC, the better the detection accuracy.

Detection Accuracy of HyMap
In order to further measure the performance of DPBS-CEM, we also focus on the detection results of HyMap data set.Figure 15 shows the results of the target F4 obtained by Global-CEM, SBS-CEM, and DPBS-CEM, respectively.For a more accurate representation of the detection results, we have an enlarged target region of interest, as shown in red boxes of target F4 and Figure 16 of target F1, F2, and F3.As we expected, in comparison to the target detection accuracy of SBS-CEM, DPBS-CEM has the same or even better performance.This conclusion is further verified by the AUC values in Table 5.

Cross-Platform Performance Comparison
From the previous section, we can see that the proposed DPBS-CEM is very close to SBS-CEM [18] in detection accuracy, some detection results of DPBS-CEM are even superior to the latter one.Table 6 shows the processing time comparison of the proposed DPBS-CEM on different platforms (such as MATLAB, C++, and FPGA).The version of MATLAB used here is R2014a.The C++ environment directly uses the software simulation environment of Vivado HLS 2017.3.The clock frequency of FPGA is set at 200 MHz.As shown in Table 6, the processing time of DPBS-CEM implemented on FPGA has achieved significant improvements compared to MATLAB and C++ implementations.On the other hand, the processing time of our software versions is also superior to that of SBS-CEM software implementations [21] since the proposed DPBS-CEM algorithm is less computationally expensive than the SBS-CEM algorithm.

Performance Comparison between DPBS-CEM and SBS-CEM
The FPGA design of DPBS-CEM is implemented on a Virtex7 XC7VX690T FPGA.This FPGA contains 108,300 slices, 433,200 six-input LUTs, 1470 BRAMs, and 3600 DSPs.To facilitate the performance comparison between DPBS-CEM and SBS-CEM, we selected HyMap, the same hyperspectral data source used by SBS-CEM, as our input HSI.Next, we compare the FPGA implementations of SBS-CEM and DPBS-CEM from two aspects of logic resources utilization and data processing speed.
Table 7 shows the resource utilization corresponding to SBS-CEM and DPBS-CEM.The right-hand side lists the unit's ratios and average ratios of DPBS-CEM and SBS-CEM.As Table 7 illustrates, the average resource utilization of DPBS-CEM is 5.21 times more than that of the SBS-CEM algorithm, which is caused by the deep pipelined structure.As aforementioned in Section 4.2.1, the intermediate data precision has a dramatic impact on the detection accuracy.Table 8 shows the relationship between the detection accuracy represented by AUC and the intermediate data precision.In Table 8, we set the data precision as fixed-point type with total of 32, 34, 36, 38, 40, and 42 bits, and identical 14 bits integer part.The experimental results demonstrate that the detection accuracy goes up sharply with the increase of data precision from 32 to 38 while keeps the same from 38 to 42.Due to the same bit-width of the integer part, it can be concluded that the bit-width of the fractional part mainly determines the detection accuracy.According to the experimental results, the bit-width of the fractional part should be more than 23.
The performance of DPBS-CEM has been greatly improved compared with SBS-CEM.Table 9 shows the number of clock cycles occupied by SBS-CEM and DPBS-CEM and the ratio between them.At the same clock frequency of 200 MHz, the number of clock cycles consumed by SBS-CEM is nearly 7.3 times more than that of DPBS-CEM.In other words, when processing the same image, the data processing speed of DPBS-CEM is 7.3 times faster than that of SBS-CEM.It is worthwhile to mention that our work is conducted by mainly using HLS.

Discussion
CEM is an effective algorithm for subpixel target detection in hyperspectral imagery.The classical CEM needs to solve a large matrix inversion problem.SBS-CEM takes the Sherman-Morrison formula to update the inverse matrix for each pixel, which can avoid the complex calculation of large matrix inversion.However, SBS-CEM still uses sliding windows and has data dependency problems, which prevents its further performance improvement on target detection in terms of processing speed.
To solve these problems, we proposed an optimized algorithm called DPBS-CEM.It follows the same way that is used to update the inverse matrix gradually according to the Sherman-Morrison formula [28] but uses cumulative windows instead of sliding windows to reduce the number of calculations.Pixel data splitting and separating inverse matrix memories are utilized to remove the data dependency existing in the process of updating the inverse matrix.Moreover, we provide an FPGA implementation of the proposed DPBS-CEM whose deep pipelined architecture can be realized by using HLS.
According to the experimental results presented in this paper, the target detection accuracy of the proposed DPBS-CEM algorithm on two data sets are nearly the same.Compared to SBS-CEM, it has the same or even better detection accuracy.Regarding the processing speed performance, DPBS-CEM gained about 7.3 times speedup than that of SBS-CEM.It is worth noting that the proposed architecture of DPBS-CEM can also gain benefits in terms of scalability, portability, and flexibility with the help of HLS.This is particularly suitable for the real-time hyperspectral target detection applications on satellite.

Conclusions
In this paper, an optimized algorithm , referred to as DPBS-CEM for hyperspectral target detection, is proposed.A deep pipelined architecture of DPBS-CEM on FPGA is developed by using HLS as well.The experimental results show that the proposed FPGA implementation of DPBS-CEM has an extraordinary performance improvement in terms of data throughput without compromising for detection accuracy.Under the same test conditions, the detection speed of our proposed DPBS-CEM is about 7.3 times faster than that of SBS-CEM.

Figure 3 .Figure 4 .
Figure 3. (a) Hardware structure and (b) C/C++ code in HLS of Block A. (v represents the pixel number, l represents the row number of the matrix S −1 , and m represents the column number of the matrix S −1 ).

Figure 7 .
Figure 7. (a) Hardware structure and (b) C/C++ code in HLS of Block E.

2 Figure 8 .
Figure 8. Timing diagram of the process of updating inverse matrix.

Figure 9 .
Figure 9. Sample code used for implementing the data flow control of a task in HLS (The omitted lines of code are the specific computations of each stage described in Section 4.2.1).

Figure 10 .
Figure 10.Diagram of development with multiple tools.

Figure 12 .
Figure 12.(a) HyMap reflectance image of Cook City in Montana, USA, and locations of the real targets; (b) Enlarged figure of red box area; (c) Spectral signatures of four targets.

Figure 13
Figure 13 shows five detection maps produced by DPBS-CEM using the five-panel signatures A, B, C, K, and M in Figure 11c as the desired target signatures.The two-dimensional (2-D) results of real-time detection of target A illustrated in Figure 14.The experimental results show that all the AUC values of five desired targets detected by DPBS-CEM are one, indicating that the detection results are extremely satisfactory.

Figure 13 .Figure 14 .
Figure 13.Detection maps of DPBS-CEM using A, B, C, K and M as desired target signature.

Table 2 .
Simulated 20 mixed panel pixels in the third column.

Table 3 .
Abundance fractions of subpixel panels in the fourth and fifth columns.

Table 4 .
The characteristics of targets in the real scene of HyMap.Fabric type Red cotton Yellow nylon Blue cotton Blue cotton Red nylon Red nylon

Table 5 .
AUC obtained by different algorithms for the targets.

Table 6 .
Processing time measured for DPBS-CEM methods in MATLAB, C++, and FPGA implementations.

Table 7 .
Comparison of resource utilization for the FPGA implementations of SBS-CEM and DPBS-CEM.

Table 8 .
Corresponding AUC values with different intermediate data accuracy of algorithm (we set F1 in HyMap image as the desired target).

Table 9 .
Comparison of data processing speed for the FPGA implementations of SBS-CEM and DPBS-CEM.