An FPGA Accelerator for Real-Time Lossy Compression of Hyperspectral Images

Hyperspectral images offer great possibilities for remote studies, but can be difficult to manage due to their size. Compression helps with storage and transmission, and many efforts have been made towards standardizing compression algorithms, especially in the lossless and near-lossless domains. For long term storage, lossy compression is also of interest, but its complexity has kept it away from real-time performance. In this paper, JYPEC, a lossy hyperspectral compression algorithm that combines PCA and JPEG2000, is accelerated using an FPGA. A tier 1 coder (a key step and the most time-consuming in JPEG2000 compression) was implemented in a heavily pipelined fashion. Results showed a performance comparable to that of existing 0.18 μm CMOS implementations, all while keeping a small footprint on FPGA resources. This enabled the acceleration of the most complex step of JYPEC, bringing the total execution time below the real-time constraint.


Introduction
Remote sensing covers a broad range of techniques that are used to perform a variety of analyses remotely and with no close up interactions with the subjects of interest. Hyperspectral imaging is one of these techniques that has been growing since its inception.
It extends the concept of remote imaging by capturing information related not only to the visible part of the spectrum, but also in wavelengths that the human eye cannot see. A typical hyperspectral image will have from tens to hundreds of samples per pixel [1,2], each recording information of the light perceived at a specific wavelength.
The data collected at one wavelength are grouped in bands that span the whole image. The combination of multiple bands creates a spectral signature per pixel, providing information on a scale that helps with military applications such as target detection [3,4] and terrain trafficability [5], mineral identification [6,7], ground and water studies [8], vegetation and crop control [9,10], and many more.
It is the amount of information that is the bottleneck of hyperspectral imaging: storage and transmission are often limited in remote scenarios, and optimizing them is a must for uninterrupted image capture. Some of the most popular sensors such as EnMAP [11] reach speeds of 700 Mb/s and work uninterrupted for hours or days on a satellite.
Hyperspectral image compression has been explored in many ways, and has provided great results, especially in the lossless and near-lossless domains. Standards such as CCSDS [12] step is added prior to the dimensionality reduction, a variable bit-depth is used for each band, and the JPEG2000 coding is optimized for the whole image instead of for progressive decoding afterwards. This improves compression ratios and quality. A general diagram is seen in Figure 1.

Dimensionality Reduction
Even though JYPEC supports multiple dimensionality reduction algorithms, for this approach only PCA was used. It has a very low complexity (being a candidate for real-time), and has also a very good distortion-ratio performance [20], only falling short when targeting very high qualities at low ratios (against vector quantization PCA (VQPCA)).
First, a preprocessing step selects random pixels in the image. These pixels are processed by PCA, generating a set of vectors that indicate the directions of maximum variance within the set. Since the selection is random and the sample size is big due to the size of hyperspectral images, it matches well the total variance even when using only 1% of the total pixels [39]. This process generates a projection matrix to reduce the data size, as well as a recovery matrix to go back from the reduced data to the original size. Each band of the reduced image is compressed by JPEG2000.
The process can be undone by first uncompressing each reduced band, and then using the recovery matrix to project back into the original space.

The JPEG2000 Algorithm
The JPEG2000 standard is a multi-step algorithm which takes advantage of different image characteristics for compression. As well as being used for its main purpose of image compression, it has also found applications in compressing electroencephalography [40], video [41], and hyperspectral images [20], among others.
It follows a series of simple steps ( Figure 2) to compress an image:  Figure 2. Diagram of the full JPEG2000 algorithm. In green, steps that deal with the whole image at once. In blue, steps that deal with small blocks of the image.
• A color transformation is done per pixel, converting the input color space (usually RGB) into a luminance (brightness) and chrominance (color) model, since human vision is more sensitive to brightness than color. The color channels can be down-sampled with no perceivable loss in quality, reducing a large chunk of the data bits. • Every channel is then subjected to a wavelet transform [42]. A wavelet transform consists of a high-pass and a low-pass [43] filter that are applied both horizontally and vertically to all rows and columns respectively. This can be done in a reversible (lossless) or irreversible (lossy) way, and in either case the result is a partitioned channel in which different zones present different patterns that can be compressed to a higher degree than the original data.

•
After doing the wavelet transform, the resulting values are quantized to integer values; some information is lost when the lossy wavelet transform is used. • Finally, the values are encoded. The image is split into blocks of up to 4096 samples; each one is encoded individually, thereby exploiting local redundancies and the patterns left by the wavelet transform.
The color transform is not needed for hyperspectral compression, since the spectral dimension is already decorrelated by the dimensionality reduction step. Only the wavelet transform and encoding steps are performed.
The wavelet transform is fast when compared to encoding, which takes up to 70% of the total execution time [24,25] of JPEG2000. Within JYPEC, it also dominates execution times [39]. This paper focuses on a FPGA implementation that greatly improves encoding, bringing the full algorithm execution time down as much as possible by accelerating the bottleneck.
In the following subsections, the encoder algorithm is detailed to give a better understanding of the implemented accelerator later.

Encoding
Encoding is done in a lossless way over blocks of up to 4096 samples (usually 64 × 64 squares, with a depth of 16 bits (15 magnitude + 1 sign). The encoding technique used in JPEG2000 is called embedded block coding with optimal truncation [44] (EBCOT). Two different coders lie within EBCOT: the tier 1 and tier 2 coders: The tier 1 coder compresses the original block. The resulting stream has the prefix property: any prefix of that stream, once decoded, gives an approximation of the block. Longer prefixes provide better reconstruction accuracy.
The tier 2 coder splits the output streams from each block into sections, with each section refining the information decoded by the previous one. Sections from different blocks are interleaved by first storing the ones which better approximate the original image. This technique extends the concept of the prefix property from blocks to the full image. This part of the coder is not used here, since JYPEC targets full image compression and not progressive decoding.

Tier 1 Coder
Two phases lie within: First, the so called "bit plane coder" (BPC) pairs each bit with some context. This creates context-data pairs (CxD pairs). The distribution of all bits paired with the same context is highly skewed, making coding more efficient.
Second, CxD pairs are processed by the MQ-coder (it belongs to the family of arithmetic coders), generating the compressed output stream. The more skewed the input distributions are, the fewer bits the MQ-coder will output.
The BPC works by scanning the different bit planes of the block. First, the most significant bit of each sample is coded; then the second-most significant; and so on. This is the trick that later allows for progressive decoding.
The sign bit receives a special treatment, and is only coded when needed (i.e., when a sample is known to be nonzero, the sign bit for that sample is coded, but not before). This avoids coding sign bits for samples that are zero.
To exploit redundancies, bits are scanned in a zig-zag pattern in up to three passes per bit plane p ( Figure 3). To keep track of what bit is scanned in what pass, a "significance" value σ[j] is kept per sample v[j], where j is the 2D position within the block/bit plane. This value indicates whether a sample is significant (i.e., at least one of its already coded bits is one) or insignificant (if all bits have been zero up until the current pass in previous bit planes). A significant sample is either positive or negative depending on its sign bit.
Three distinct passes exist: First, a significance propagation pass in which bits from samples that are expected to turn significant in the current plane are coded. Second, a refinement pass that codes bits from all samples that are already significant. Finally, a cleanup pass that codes the remaining bits.
The cleanup pass will mostly code zeros and the significance pass will mostly code ones, while the refinement pass is more random. These skewed distributions are the key for compression. To increase efficiency, each bit is paired with a context based on the significance state of neighboring samples. The context is used to predict the value of the bit, and if right, can save even more space in the compressed stream.
The MQ-coder starts by mapping each context to two values: • A bit x which is the current prediction for the given context. The basic idea of the MQ-coder is that of arithmetic coders: The input data will be compressed as a subinterval I ⊂ [0, 1). Starting with the interval [0, 1), the data are subdivided with each CxD pair.
Given the probability p, . If the predictive model is right, and the probability p high, I 1 (the bigger subinterval) will be kept. The bigger the final interval is, the less bits are needed to represent it.
In practice, infinite precision cannot be achieved, so 27 and 16 bits are allocated for c and a respectively as per JPEG2000 standard, calling them registers C and A.
Sincep is fixed for a given state, its change is accomplished with two transition functions: most probable symbol (MPS) and least probably symbol (LPS). The MPS or LPS functions will be used depending on whether the prediction turns out to be right or not, and will change the state to one wherep is expected to better approximate the current bit input distribution. The prediction is also inverted when certain states are reached, doubling the possible states.
p is used to update the interval [C, C + A) by either addingp to C or subtracting it from A. When both ends get close together, a shift left is performed to keep the length of the interval longer than the maximum value ofp.
A is kept under control by periodically resetting it, and C eventually overflows. The overflow is saved and forms the compressed output stream. To finish compression, C is flushed out. The process can then be reversed, and the original inputs recreated for decompression.

Existing Implementations
The objective of this study was to bring lossy compression to real time. For that, JYPEC was chosen as the target algorithm, and a tier 1 coder implementation of JPEG2000 (a step within JYPEC) is presented. It accelerates both the BPC and MQ-coder by making a single high-speed pipeline with both. To show that this effort was justified, a timing breakdown of compression of different images with JYPEC is shown in   It is clear that coding is the slowest part of the algorithm, and that any efforts to speed up the algorithm should be dedicated to it. Many implementations of the full tier 1 coder have been proposed, but more efforts have been focused towards accelerating the MQ-coder alone. In the following paragraphs, the literature is explored in this regard.

Bit Plane Coder
In [45], the authors designed a basic BPC which goes over the full block following the zig-zag pattern. Its controller is a 24 state machine which goes over the different passes bit by bit, producing at most one CxD pair per cycle.
Improving on that, in [25] the authors introduced the concept of skipping. They loaded full columns with four bits at once, and marked them with flags when they were no longer required in a certain pass. This way, the BPC could skip them when not needed, saving clock cycles. They also introduced flags for groups of columns and even full passes, allowing one to skip big chunks of idle cycles in some cases. Finally, since they loaded the full column at once, they also checked which bits needed to be encoded, and skipped the others within the column. In the end, savings of around 60% of clock cycles were achieved.
A different approach was reported in [27]. Instead of skipping, the authors processed whole columns at once, producing up to 10 CxD pairs per cycle. CxD generation is not independent, so a series of cascading dependencies were taken into account. Despite the added complexity, this idea doubles the throughput of the sample-skipping technique without the extra memory. In [46], the authors went even further by allowing multiple planes to be coded simultaneously by using non-default coding options. Despite the small loss in compression efficiency, the throughput grew by a factor of 8 in CMOS 0.35 µm technology when using gray-scale images. This, however, deviates from the standard implementation, since multiple MQ-coders would have had to be used for a single block in order to keep up with the BPC.

MQ-Coder
The MQ-coder has seen more optimizations [47] than the BPC, since traditionally the MQ-coder always was the bottleneck.
MQ-coder receives and processes CxD pairs from the bit plane coder, generating a compressed bit stream which can be further processed by the tier 2 coder. It is important to note that by design, the CxD pairs are processed serially, so no parallelization is possible at this stage.
Two main approaches have been used to accelerate its execution: • Pipelining: As with many other designs, pipelining can be the key to improving performance. Distinct stages have been identified (mainly separating the update of A and C, and the output of coded byte(s)). • Dual symbol processing: Some bit plane coders can produce two CxD pairs in one cycle. This has motivated the design of MQ coders with the capability of processing two pairs at the same time. Since this can not be done in parallel, these MQ-coders incorporate two cascaded processing units.
In [31], a three stage pipelined MQ coder is proposed. It performs all arithmetic operations in the first stage, A and C register shifts are done in a second stage, and a third stage emits bytes. The drawback of the authors' approach was that the second stage could stall the first if the number of shifts was greater than one, since the authors did not use barrel shifters. In the end, they worked around this limitation by having two clock domains increasing the speed of the shifting stage, ensuring that stalling occurred only 0.64% of the time, achieving a performance of around 145.9 MS/s on a Stratix EP1S10B672C6 board.
In [30], an implementation of the full tier 1 coder with no pipelining nor dual symbol processing is presented for the Virtex II Pro FG 456 board. They note that, at 112 MS/s, the arithmetic coder is the bottleneck, with the (context, bit) generation being five times faster. Speed was later increased by having up to four simultaneous instances of the tier 1 coder working in parallel.
In [48], two techniques were used to create a pipelined design that works at 413 MS/s: They first employed "traced pipelining" which consists of designing a pipeline for the most likely cases, and processing unlikely, more costly cases in a separate unit that stalls the pipeline if necessary. The second technique is based on eliminating cascading shifts (such as the ones from [29]) by looking ahead at the number of necessary shifts and performing them all at once. All of this is made possible by working on 0.18 µm CMOS technology.
Both pipelining and dual processing are employed in the approach from [32], in which improvements are made to the multiple approaches from [26]. Dual processing is solved by having four different units processing all four possible scenarios (taking the lower or upper interval two times in any combination). Pipelining was used to separate the A register update, C register update, and byte output procedures, achieving in the end performances of 96.6 MS/s on FPGA and 440.2 MS/s on 0.18 µm CMOS technology.
A different pipelined approach is presented in [49]. The A register update, C register update, and byte out procedures are kept in three distinct stages, and two more are added at the beginning by using two memory modules. The first one stores contextual information (namely state and predicted symbol) and the second one is a ROM which outputs state change information. If two consecutive contexts are equal, the second memory will be read with the updated state that is sent to the first one. This splits reading into a two-step process that accelerates the pipeline. The other notable technique is that shifting is limited to seven bits per cycle, reducing the critical path at the cost of one cycle stall in the unlikely case that the shift amount is greater than seven. In the end, a speed of 192.77 MS/s was achieved on 0.18 µm technology.

Implementation
The presented design includes both the BPC and MQ-coder, chained together to form the full tier 1 coder. The basic structure of the tier 1 coder is shown in Figure 5. The bit plane coder receives data from memory and generates CxD pairs. These are coded by the MQ-coder and the output stream is generated.

BPC
Internally, the BPC generates CxD pairs for four samples at a time (a full column of the zig-zag pattern), following the techniques of [27]. To generate the CxD pairs for a sample, flags from a 3 × 3 neighborhood around the current sample are taken into account. When dealing with columns, this neighborhood grows to 6 × 3. This is seen in Figure 6.
The main problem comes from the fact that the context generation modules are slow. They are based on lookup tables that require many LUT levels during FPGA implementation. They also need to be cascaded to generate the output significance that is required by the following bit strip. To avoid delays, a special module that predicts the next significance state in a faster way is used. It is shown in Figure 7.
The other modules are straightforward, with context generation being a ROM that outputs the context associated with the neighborhood by simple lookup. The cleanup predictor does a job similar to the significance predictor by looking at the first bit that is nonzero,and setting it and the following ones as significant if needed.
Up to 10 CxD pairs are generated per cycle. A serializer is used to order them sequentially before sending them to the MQ-coder (see Figure 8).
With a big enough vector queue, the problem of running into idle cycles when few CxD pairs are generated is avoided (e.g., the first refinement pass or the last significance and cleanup passes), since a big enough buffer exists to keep feeding the MQ coder. The serializer is designed to output one pair per cycle as long as the vector queue is not empty, so it can feed the MQ-coder without forcing a stall.

MQ-Coder
The extensive work seen in Section 3.2 can be summed up in two different approaches: pipelining and dual-symbol processing. Since the target platform is an FPGA, pipelining is ideal to avoid a longer critical path which, on reconfigurable hardware, has a higher timing penalty than on fixed silicon. It will later be seen that a bottleneck arises in the CxD generation, so this area-efficient approach is the right choice since it can keep up with previous stages.
The only common step of all pipelining approaches is separating the byte out procedure in a last stage. Register updates are often split in two stages, treating A and C updates independently. Memory access is also split in some implementations. The point is, there are no obvious stages in the algorithm given the great dependency of the different stages. In fact, most designs implement the most likely execution path, having to stall in the event of an unexpected input.
Most implementations are offering a theoretical throughput that varies depending on the data being compressed. The goal with this paper was to design an implementation which can consistently process CxD pairs at a certain speed. By pipelining the design and inserting queues in between stages, any potential stall in one stage gets absorbed by the queues and does not affect the others. The result can be seen on Figure 9.  Four main stages can be seen.

First and Second Stages
First, the context memory is accessed. This memory outputs "context information", which has the prediction for that context, as well as the MPS and LPS transitions, XOR bit, and probability.
This memory is written with the context from the third stage, so care is taken whenever the same context is encountered twice in a three context window span:

•
If the same context is found in cycles n and n + 2, a write-read cycle is skipped and input data are directly multiplexed to the memory's output.

•
To avoid stalling in the case where the same context appears in cycles n and n + 1, a second memory is present in the second stage, which outputs state information. The state is decided from the MPS and LPS transitions, and used to read this second memory. In this case, the context memory will be updated the next cycle. But those values are required in the current cycle, so a mux is used to bypass it from the second memory, avoiding a stall.
All in all, reading is segmented in two different stages, without the stalling that sometimes could happen in implementations such as [49].
The second stage is the most complex one, where the critical path lies. First, the context information to be used is decided, which comes from the context memory unless the same context appeared twice in a row, in which case it is fetched from the state memory and previous prediction.
State and prediction changes are fed to the state and context memories. The prediction is adjusted, and the A register is updated. This can happen in one of four ways: either the A register is not shifted, it is shifted once or twice, or the contents are assigned from memory. In the last case, the number of shifts is calculated in advance.
The number of shiftss, valuep to add to the C register, and hit flag h (indicating C needs to be updated withp) are sent to the next stage.
Both stages are seen in Figure 10.

Third Stage
The fourth stage can stall the pipeline. In order to minimize that risk, the inputs from multiple cycles are combined into just one update, so as to send the minimum number of updates ahead. This can be done under two scenarios:

•
If h i = h i+1 = 0 ands i +s i+1 ≤ 15, both updates can be merged by setting h = 0,s =s i +s i+1 , p = 0. This merges two consecutive shifts that are under the maximum shift length of 15. • Ifs i = 0 andp i +p i+1 ≤ 2 16 − 1, then both updates can be merged with h = h i ⊕ h i+1 ,s =s i+1 , p =p i +p i+1 . This is because the addition ofp to C happens before the shifts . Both probabilities can be added at once because they are below the limit of 2 16 .
Both these merging techniques can be done recursively.

Fourth Stage
The C register is updated by addingp and shifting it. Whenever it fills up, a byte is output. When the register overflows or a special byte 0xFF is output, padding needs to be added to avoid special markers used to indicate the end of stream. Up to three bytes might be output per update, and the control logic for all possibilities would make this stage too slow.
In order to avoid that problem, shifting is done byte by byte. If the shift amount is greater than one byte, the pipeline will stall for one cycle. Studies have shown [48] this problem to be negligible (<1% of the time). This stage is seen in Figure 11.  Figure 11. MQ-coder last stage. It interfaces with two FIFOs, reading updates from the previous stage and making sure space is available at the output to send bytes out.

The Full Tier 1 Coder
By chaining together both the BPC and MQ-coder, the tier 1 coder for JPEG2000 is formed. The basic segmentation is two stages for the BPC coder and four for the MQ-coder. Joining the different stages are multiple FIFOs. These help maintain a constant flow of data:

•
When the MQ-coder stage IV stalls (because it has to output more than one byte), the fuser queue can hold updates until a fused one is sent (effectively canceling out the stalling).

•
When the BPC-core is producing many CxD vectors, the vector queue avoids a stall from the BPC-core.

•
Conversely, when the BPC-core does not produce vectors, the queue serves as a buffer to keep the next stages busy.
The full pipeline, taking into account the different queues, has a total of 15 stages, as seen in Figure 12. Despite the amount of stages, this has negligible impact in the final speed, since the coding of a full 64 × 64 × 16 block takes a minimum of 1024 · 3 · 14 + 1024 = 44, 032 cycles, so filling the pipeline takes at most 15/44, 032 · 100 = 0.03% of the total cycles.  Figure 12. Detailed pipeline of the tier 1 coder. In dotted orange, the separation between stages. Each FIFO introduces two stages (read/write).

Results
The hardware architecture described in Section 4 has been implemented using VHDL language for the specification of the tier 1 coder. Moreover, we have used the 2020 Xilinx Vivado Design Suite environment to specify the complete system. The full system has been implemented on a VC709 board, a reconfigurable board with a single Virtex-7 XC7VX690T, two DDR3 SDRAM DIMM slots which hold up to 4 GB each, a RS232 port, and some additional components not used by our implementation. The HDLmodel has been verified via simulation and physical prototyping using a memory controller for input/output. Table 1 shows the frequency and FPGA slice occupancy for the full tier 1 coder and its modules and submodules. More details are given in the following list: • The BPC-core processes a full block of 65,536 bits in 44,032 cycles, working at a speed of 380 Mb/s.  All in all, the full tier 1 coder is able to work at 255 MHz. At that speed, the bottleneck is the number of CxD pairs processed by the MQ-coder at 255 MCxD/s. By studying how many CxD pairs are produced,the input speed is calculated: • The minimum number of updates for a 64 × 64 × 16 block is seen when it is all zero, having successful run-length coding throughout the block. In this case, a total of 15 × 1024 = 15,360 updates are generated. That is, 0.234 per bit. • Conversely, an upper limit for the number of updates is given by a cleanup pass with run-length interruptions at every position, followed by 14 refinement passes. In this case, the number of updates is 1024 × 10 + 4096 × 14 = 67,584 updates. Exactly 1.03125 per bit.
Thus, the input rate to generate 255 MS/s would range from 1.01 Gb/s to 247.3 Mb/s. However, the BPC-core is only capable of processing 380 Mb/s, so in practice this range is limited to 247.3 to 380 Mb/s. The exact value within this range of course depends on the redundancy of the data. For [31], they compressed five images of size 512 × 512 × 10 and noted that the average p/b rate was 0.56. This means that, on average, the input rate for 255 MS/s would be 455 Mb/s. Thus it is safe to say that the tier 1 coder will consistently perform at its 380 Mb/s limit.

Comparison
A comparison with other implementations can be seen in Table 2. Only the best implementations found in the literature have been taken into account.
As seen, this implementation of the BPC works more than four times faster than other FPGA implementations, surpassing even ASIC designs in throughput.
With regard to the MQ-coder, this design doubles the performance of previous FPGA designs, only falling short of 0.18 µm CMOS. It was expected that porting the design to this technology would make it faster than the competition, since other implementations have experienced [32] a speedup of 4× when doing so. * Although not specified, the architecture is similar to the one presented here so a similar relationship between frequency and speed was expected. ** Requires external memory for data and/or internal variables.

Acceleration of JYPEC
To see its impact on hyperspectral image compression under JYPEC, six images from two libraries have been compressed by JYPEC with and without acceleration. Four from the Spectir [55] library and two from the CCSDS 123 test data set [56]. The image characteristics are seen in Table 3 and a preview in Figure 13.  The results have been acquired in a DELL XPS 13 9360 computer, with an i7-7500U processor with a thermal design power (TDP) of 15 W, 8 GB of RAM running at 1866 MHz, and 256 GB of SSD PCIe storage. For the accelerated version, the time of coding in the processor was replaced with the time of coding in the FPGA itself. Memory transfer times ertr not taken into account, because the PCIe of the VC709 board works at 25 GB/s and the typical image size is 500 MB, so it was transferred in 20 ms, not impacting results.
The speedup attained is shown in Figure 14. The average speedup obtained was 3.6, ranging from 1.6 for the DHOimage to 7.5 for the CRWimage. The code for the software JYPEC implementation can be accessed in [57], and the accelerator code was uploaded in [58].

Conclusions
JYPEC is a complex algorithm that demands high-performing hardware for a fast execution in real time. The most costly part is the tier 1 coder within JPEG2000, since code with erratic branching is very hard to optimize for traditional processors.
Very simple arithmetic and logic operations, however, make this part of the algorithm ideal for execution on a FPGA. A very fast architecture for the full tier 1 coder within JPEG2000 has been developed based on two main ideas: • First, the bit plane coder concurrently processes bits in groups of four, greatly accelerating execution. A system of FIFOs and buffers ensure that a constant stream of CxD pairs reach the MQ-coder. • Second, the coder itself is highly optimized in a pipelined fashion. Stalling of the pipeline, a problem previous designs had, is avoided by fusing together multiple updates when possible.
The presented design doubles the speed of any previous design on FPGA, coming close in performance to 0.18 µm CMOS technology in single-core tests.
In the context of hyperspectral imaging, it brings complex lossy compression to real-time performance under the AVIRIS-ng sensor threshold (30-72 MS/s for a total of 491.52 Mb/s). This allows for very high data rates to be reduced for long-term storage on-the-fly, while keeping great quality for posterior analyses.