A Parallel FPGA Implementation of the CCSDS-123 Compression Algorithm

: Satellite onboard processing for hyperspectral imaging applications is characterized by large data sets, limited processing resources and limited bandwidth of communication links. The CCSDS-123 algorithm is a specialized compression standard assembled for space-related applications. In this paper, a parallel FPGA implementation of CCSDS-123 compression algorithm is presented. The proposed design can compress any number of samples in parallel allowed by resource and I/O bandwidth constraints. The CCSDS-123 processing core has been placed on Zynq-7035 SoC and veriﬁed against the existing reference software. The estimated power use scales approximately linearly with the number of samples processed in parallel. Finally, the proposed implementation outperforms the state-of-the-art implementations in terms of both throughput and power.


Introduction
In recent years, space development has moved towards small-satellite (SmallSat) missions which are characterized by capable low-cost platforms with introduced budget and schedule flexibility.Space-related applications, such as synthetic aperture radar (SAR), multispectral and hyperspectral imaging (HSI) require critical data processing to be performed onboard in order to preserve transmission bandwidth.In this respect, the compression algorithms are commonly used as a final step in onboard processing pipelines to reduce memory access and limit data transfer to Earth.To fulfill real-time data processing requirement, hybrid processing systems with reconfigurable hardware (FPGAs) have become the standard choice in small-satellite missions.The expansion of logic resources in the current FPGAs allows execution of complex algorithmic tasks in parallel and the trend for CubeSats and other SmallSat single-board computers is to use common SoC devices with commercial FPGAs due to their superior performance in terms of power, speed and resources compared to radiation-hardened FPGAs [1].
Hyperspectral and multispectral imaging have been both widely used in remote sensing Earth observation missions in recent decades.Unlike multispectral sensors, such as Landsat, MSG and MODIS [2], with a fairly limited number of discrete spectral bands, hyperspectral sensors record a very large number of narrow spectral bands.Airborne hyperspectral sensors such as Compact Airborne Spectrographic Imager (CASI), Airborne Visible/InfraRed Imaging Spectrometer (AVIRIS) [3], Infrared Atmospheric Sounding Interferometer (IASI) [4] and the Hyperspectral Imager for the Coastal Ocean (HICO) [5,6] have provided an expansion of hyperspectral research in the number of applications such as environmental monitoring, coastal ecosystems, geology and land cover.A hyperspectral imager has recently been deployed on an intellegent nano-satellite [7], where a key feature is intensive onboard data processing including operations such as comparisons of images in subsequent orbits.However, a mission with HSI payload to fulfill its objectives, in addition to smart onboard processing, requires compression for downlink of the acquired data.
The Consultative Committee for Space Data Systems (CCSDS) has developed image compression algorithms [8][9][10][11] specifically designed for space data systems.In particular, the CCSDS-123 compression standard [10,11] is an efficient prediction-based algorithm characterized by low complexity and, thus, is suitable for real-time hardware implementation.In fact, in the recent years several FPGA implementations of the CCSDS-123 standard are presented in the literature [12][13][14][15][16][17][18][19].Keymeulen et al. [12] propose an on-the-fly implementation in BIP sample ordering.In the implementation proposed by Santos et al. [13], the focus is on low complexity and low memory footprint.The chosen BSQ sample ordering requires only one weight vector and one accumulator to be stored.However, the repeated computations of local differences decrease the input bandwidth efficiency.This approach requires either the non-sequential memory access pattern with potentially reduction of streaming efficiency, or that the data is arranged in memory in the desired streaming order.The serial CCSDS-123 implementation with BIP ordering proposed by Theodorou et al. [14] relies on external memory to buffer samples coming from the image sensor such that the current, N and NE neighboring samples are streamed in parallel reducing greatly on-chip memory requirements.The downside is, however, lack of support for on-the-fly compression.Báscones et al. [15] propose an implementation with BIP sample ordering characterized by the ability to perform compression without relying on external memory.This is achieved by queuing incoming samples in internal FIFOs, resulting in linear dependence of memory usage with respect to the product of width and depth of a HSI cube.A parallel CCSDS-123 implementation proposed by Báscones et al. [16] consists of several instances of the CCSDS-123 core that share local differences.Other than sharing local differences, the cores operate independently by processing samples from a fixed subset of bands.
In this paper, an efficient parallel FPGA implementation of the CCSDS-123 compression algorithm is proposed.The high throughput is achieved by the use of several optimization techniques for data routing between parallel processing pipelines and for efficient parallel packing.In the proposed solution, parallel processing of several samples is only constrained by the logic resources of the chosen technology.
The paper is structured as follows: Section 2 presents an overview of the CCSDS-123 standard.The proposed parallel hardware implementation is described in Section 3. The influence of the number of pipelines and chosen architectural solutions on the logic use, timing and power are analyzed in Section 4. Finally, the conclusions are given in Section 5.

Background
The CCSDS-123 standard for lossless data compressors is applicable to 3D HSI data cubes produced by multispectral and hyperspectral imagers and sounders, where a 3D HSI data cube is a three-dimensional array (N x , N y , N z ).A sample in the HSI cube is specified by coordinates (x, y, z), whereas an HSI pixel is characterized by fixed (x, y) coordinates and consists of N z components in spectral domain.The standard supports Band Interleaved (BI) and Band Sequential (BSQ) orderings for scanning the HSI coordinates.Special cases of BI ordering are Band Interleaved by Pixel (BIP) and Band Interleaved by Line (BIL).BSQ ordering traverses the components band by band -(z, y, x) order.In BIP ordering, each full pixel is accessed sequentially in (y, x, z) order.In BIL ordering, traversing is performed frame by frame in (z, x) order.
The integer samples of the HSI cube are labeled as s z,y,x or s z (t) where t = y • N x + x.The sample s z,y,x is predicted by computation of a local sum σ z,y,x of nearby predicted samples (s z,y,x−1 , s z,y−1,x−1 , s z,y−1,x , s z,y−1,x+1 ) at positions (W, NW, N, NE) with respect to sample s z,y,x .The reduced prediction mode computes central local differences d k for previously processed bands k = 0, . . ., P as ) and local sum σ z,y,x , respectively.The created differences are then stored in the local difference vector U z (t).Predictor parameters such as the number of prediction bands P, the local sum type and the prediction mode impact significantly the overall performance of the CCSDS-123 standard, and suggested non-normative default values of these parameters provide a reasonable trade-off between performance and complexity [10,20].
The computation of the rounded scaled predicted sample includes the dot product operation of weight vector W z (t) and local difference vector U z (t) and shifting operation of local sum σ z (t) by parameter Ω which is defined as bit precision of the weight elements.The scaled predicted sample value sz (t) is a version of the rounded scaled predicted sample in the range [−2 D , 2 D ] for signed integers, where D is a dynamic range of HSI samples.The weights are dynamically updated based on the prediction error e z = 2s z (t) − sz (t) by the weight update factor ∆W z (t) which depends on several user-defined parameters which control convergence speed of the learning rate at which the predictor adapts to the image statistics.The scaled predicted sample value sz (t) is re-normalized to the range of the input sample (D-bit quantity) resulting in ŝz (t).Finally, the residual mapping converts the signed predicted residual ∆ z (t) = s z (t) − ŝz (t) to a D-bit unsigned integer mapped prediction residual δ z (t).
In the sample-adaptive encoding, code words are generated based on the average value of the input residuals in each band.The encoder updates an accumulator Σ z (t) by storing recent sample values and then divides the result by the counter Γ(t) which tracks the number of processed samples.
A code word generator computes quotient and residual pair, (u z , r z ) from the division δ z (t) , where the parameter k z (t) is defined as the largest non-negative integer satisfying an inequality expression depending on relations between the accumulator Σ z (t) and the counter Γ(t).

Implementation
The proposed parallel implementation contains N p pipelines for concurrent processing of multiple samples and shared resources for storing intermediate data.The block diagram of the proposed implementation for N p = 4 is shown in Figure 1.A number of data samples are streamed into the shared sample delay module in each clock cycle.The samples are rearranged and sent to pipelines where a pipeline contains a chain of modules performing the local sum and difference computation, prediction, residual mapping and sample adaptive encoding as illustrated in Figure 2. The predicted data computed in central local differences, the updated weight vector elements and accumulator values in sample adaptive encoding are routed to the central difference store, the weight store and accumulator store modules, respectively, which are shared between the pipelines.The data packages streamed into the CCSDS-123 core contain N p samples.A lane is defined as a position of samples in the input package.Figure 3a,b show the sample placement grids for N p = 4 lanes in the first 10 clock cycles for the number of bands is N z = 8 and N z = 9, respectively.The first sample in each pixel is highlighted.When the number of bands is divisible by the number of pipelines i.e., N z mod N p = 0, lane i contains a fixed subset of bands for each pixel so that the sample from band z is always streamed in the lane i = z mod N p .For N z not divisible by N p , samples from the same band are no longer confined to a specific lane.Instead, samples shift between lanes.After streaming the last sample in a pixel, the input stream can be stalled so that the first sample of the next pixel is in lane i = 0.In this manner, a fixed subset of samples is processed by each pipeline similarly to the case when N z is divisible by N p .The downside of the introduced stalling is reduced throughput and additional logic.To avoid stalling, an interleaved pipeline approach is proposed.In this approach, samples from the same bands are processed in different pipelines, requiring from pipelines to share additional information besides local differences.For instance, a sample arriving to the sample delay module in lane i = 0 is also sent to pipeline 2 as the neighbor of a sample arriving to lane i = 2. Furthermore, vector W 0 (1) is produced by pipeline 0 when processing s 0 (0), but it is then also used by pipeline 1 when processing s 0 (1).The advantage of the interleaved approach is a generic implementation with maximized throughput and independent of the parameters N z and N p .
In the proposed interleaved approach, the data shifting is introduced for moving data from different lanes to corresponding processing pipelines.If a current sample is in lane i, then the sample with distance n from the current sample is in lane (i + n) mod N p .The distance between neighboring samples s z (t) in lane i and s z (t + ∆t) is given as N z ∆t, where the lane of sample In Figure 3b, sample s 0 (5) is streamed in the lane which is computed as shift(3, 2) = 0 based on the distance from sample s 0 (3) from lane i = 3. Due to the data shifting, the number of clock cycles between samples within the same band is not constant.The number of clock cycles between two samples is equivalent to the number of rows between them in the grid.In the edge case, when sample s z (t) is in the left-most lane, sample s z+1 (t) is streamed in the right-most lane of the next row.The time delay between s z+1 (t) and s z (t) in lane i is computed as follows:

Pipeline
A pipeline contains a chain of modules implemented as described in the previous work [19] on a sequential CCSDS-123 implementation.To accommodate parallel processing, adaptation of the sequential modules are required.This includes several modifications such as setting FIFO depths and RAM sizes to z/N z instead of z.

Sample Delay
The sample delay module delays incoming samples so that the current sample and the previously predicted neighboring samples are available at its output.The proposed parallel implementation of sample delay module is shown in Figure 4.For each lane i, there is a set of FIFOs with the depth determined by the delay(i, ∆t) function.The outputs of FIFOs are then shifted according to the shift(i, ∆t) function, so that the delayed samples are used as neighbors in (W, NW, N, NE) positions with respect to the samples which are currently processed by each pipeline.The performed sample delay operation described by the use of the streaming grid (lanes, clock cycles) is presented in Figure 5.In the example, W neighbors (s 1 (1), s 0 (1), s 8 (0), s 7 (0)) of samples (s 1 (2), s 0 (2), s 8 (1), s 7 (1)) are obtained by delay and shift operations.s 3 (0) s 2 (0) s 1 (0) s 0 (0) s 7 (0) s 6 (0) s 5 (0) s 4 (0) s 2 (1)

Local Differences
The computed local differences are stored in the central difference store since there is a need to share differences between the pipelines.The local difference vectors U z for each pipeline are assembled as a combination of local differences from lower indexed pipelines and from the central difference store.The pipeline with the lowest index contains P differences only from the central difference store.An example of local differences routing between pipelines and to/from the central difference store for N p = 4 and P = 5 is illustrated in Figure 6.Pipelines 0 − 3 produce local differences d z (t) to d z+3 (t) for input samples s z (t) to s z+3 (t), respectively.Since each pipeline requires P previous local differences, pipeline 3 requires differences [d z+2 (t), d z+1 (t), d z (t), d z−1 (t), d z−2 (t)] where the differences [ d z+2 (t), d z+1 (t), d z (t)] are produced by pipelines [2 − 0] in the current clock cycle and the other two elements [d z−1 (t), d z−2 (t)] are fetched from the central difference store.After using P differences from central difference store to create U z vectors, the differences from the bands in the range [(z − 1), (z − (P mod N p ))] are kept in the store to be used in the next clock cycle.When z < P, it is required to use only z previous local differences and no local differences remaining from the previous pixel.In the serial implementation, the contents of the difference store is set to zero when z = N z − 1.For the parallel one, since previous local differences are used directly from the pipelines, the differences are masked based on the z coordinate.In this manner, the local differences with index i ≤ z are included in the local difference vector and elements with index i > z are set to zero.

Weights and Accumulators
Weights and accumulators are stored in two instances of the same module, shared store, with different element sizes of stored vectors.Figure 7 shows the shared store implementation with N p block RAMs of depth M = N z /N p .A read counter rd_cnt and a write counter wr_cnt are used for computation of the read and write addresses in each bank.The counters are initialized as rd_cnt(i) = 0 and wr_cnt(i) = delay(0, 1).The write counter is used directly as the write address w_addr(i), whereas the read address r_addr(i) for bank i is computed as follows: creating the initial distance between the read and write addresses equal to delay(i, 1).The behaviour of the weight shared store for parameters N z = 61, M = 16 and N p = 4 is presented in Figure 8.The initial state after reset in Figure 8a shows that rd_cnt is initialized to 0 and wr_cnt is set to delay(0, 1) = 15.For lanes 0 − 2, the read addresses are equal to the counter value (rd_cnt = 0), whereas for lane 3 the read address is 15 based on the condition (3 + 61 mod 4) ≥ 4.However, the data read from weight shared store for the first pixel are not used since the standard defines no prediction for the first pixel.Figure 8b shows the write operation of the first weight samples of pixel 1 at the address of the weight store pointed to by wr_cnt.At this time stamp, counter rd_cnt is N positions from its initial position, where the delay N corresponds to several pipeline stages from the weight reading operation to the end of the weight update operation.The delay N is equal to 8 + S, where parameter S is the number of pipeline stages in the dot product.In Figure 8c, the read counter is set to position M − 1, the r addr are computed as [M − 2, M − 1, M − 1, M − 1] and the first weights [−, W 0 (1), W 1 (1), W 2 (1)] are read simultaneously with samples [s 60 (0), s 0 (1), s 1 (1), s 2 (1)] at the input of the compression core.Figure 8d shows the state of weight shared store after 15 cycles when samples [s 59 (1), s 60 (1), s 0 (2), s 1 (2)] arrive at the input. ...

Packing of Variable Length Words
The last stage includes packing of the variable-length encoded words W 0 , . . ., W N p −1 with respective lengths L 0 , . . ., L N p −1 from N p pipelines into fixed-size blocks.The packing operation for N p = 4 is illustrated in Figure 9.The packing process starts by shifting the word W 0 from the first pipeline by the number of bits from the previous cycle, L prev .After that, word W 0 is concatenated to the bits remaining from the previous cycle, W prev .In general, shifting of the word It is observed that the number of shifts depends heavily on N p and maximum length U max + D of each word.On the other side, the standard defines the fixed-size output blocks of size B which is extracted each time the sum of the words' lengths exceeds B.  The block extraction limits the maximum word chain length to B − 1 regardless of N p or maximum world length.Therefore, the number of bits left after block extraction and the number of extracted blocks are introduced.The number of bits left after block extraction s i is computed as follows: where The extraction count e i , indicating the number of blocks to extract, is defined as: If e i is non-zero, the number of accumulated bits ∑ L i is greater than B.
The implementation of packer module is presented in Figure 10.In the first stage, computation of s i and e i parameters is performed for input word W i .In the second and third stage, a combiner chain combines input words using computed s i and e i as shown in Figure 11.The shifting operation for each word is performed in parallel by using s i to select among shifted versions of W i from a multiplexer.The last pipeline stage concatenates shifted words and extracts full blocks based on extraction count e i .The produced full blocks are added to the chain of complete blocks, the count of full blocks is updated and the remaining bits W prev are stored into a register to be combined in the next cycle.Finally, the last flag is set when the remaining bits are output as a separate block.After combiner chain, a chain of full blocks, its length and the last flag are pushed into an output FIFO.To output blocks sequentially, there is a need to buffer the blocks sent from the combiner.The data word width of the FIFO is determined by the maximum number of blocks N max produced in one clock cycle given as: where (B − 1) is the maximum number of leftover bits from the previous cycle, N p (U max + D) is the maximum word length produced in one clock cycle and factor 1 accounts for the last block when the last flag is set.If the average bit rate of the encoded samples is higher than the output bus width, there is risk for FIFO to become full.Thus, it is required to stall the data streaming into the core before the overflow occurs by de-asserting ready signal at the input.This is done by setting a threshold N th to the number of data words in the FIFO.In this manner, it is ensured that all encoded samples, which are streamed in from the cycle when de-assertion of ready signal happens, are stored.The threshold N th is equal to S + 15 which corresponds to the total number of pipeline stages from the core input to the FIFO.In the on-the-fly processing, stalling of the input stream is not possible and the choice of FIFO depth is dependant on the image statistics and speed of predictor's ability to adapt.The proposed serial packing of incoming words in combinatorial logic is feasible for N p < 6.For larger N p values, the critical path in the initial pipeline stage does not meet timing requirements due to the dependence of sum of word lengths L i on N p .For this reason, a modified version of the packer module is presented in Figure 12.The modified packer distributes the incoming words across several combiner chains operating in parallel.Large critical paths are then avoided by displacing each combiner chain by one clock cycle.The generic parameter N per_chain is introduced to define the number of words per combiner chain where 1 ≤ N per_chain ≤ N p .The number of combiner chains N c is then computed as: Parameters s i and e i are computed sequentially for each combiner chain across N c clock cycles as shown in Figure 12.
Since this operation takes more than one clock cycle, L prev is not available when computation starts.Therefore, the partial sum of word lengths are initially computed as follows: whereas the complete length ΣL i = ΣL i + L prev is computed in the N c -th clock cycle.Large critical paths can be created due to existing data dependence between combiner chains.To avoid this, large delay registers for the left-most chains are used to keep the full blocks from each chain synchronized with the last chain N c − 1.The proposed solution is that each combiner chain shifts its input words by L prev without concatenating it with the remaining bits.Instead, the concatenation is done at the output of each combiner chain.The outputs of each combiner chain are a block set and a length of the produced block set which is sent to a block set FIFO.The output logic controls the streaming of created blocks and tracks which FIFO contains the packed blocks for that particular set of words.In particular, the control FIFO monitors which block set FIFOs contain valid data.For each block set pushed to the block FIFOs, a new word is pushed to the control FIFO with a block set mask and the last flag, where the bits in block set mask correspond to one of the combiner chains.In Figure 13, block set mask '101.1' for N c = 3 indicates that valid block sets are from combiner chains 0 and 2 and last flag is high.

Results
The proposed parallel architecture of the CCSDS-123 compression algorithm is described by the VHDL language, and the Vivado tool is used for synthesis, implementation, power estimation, testing and verification on a PicoZed board with a Zynq-7035 FPGA.The implementation supports BIP sample ordering and both on-the-fly and offline processing.In addition, the implementation is tested against the reference software Emporda [21] and it is fully compliant with the standard allowing user-defined parameter selection.
The proposed core implementation is tested as a part of a larger system supported by AXI bus [22].Since the internal stalling of output stream is not supported by the core, it is necessary to buffer the output data in a FIFO as shown in Figure 14.Data streaming into the core is stopped when the number of words in FIFO is larger than a certain limit.The FIFO capacity limit N limit is determined as follows: based on the assumption that each pipeline stage has valid data and each data word has the maximum length of U max + D. The depth of the FIFO is a trade-off between area usage and frequency of output stalling.It is, however, required the depth to be larger than N limit .

Utilization Results
The resource use is affected by several parameters such as the number of bands used for prediction P and sample bit resolution D. As reported in [19], both LUT and register use in the dot product, predictor and weight update modules in the pipeline scale linearly with P.However, in the proposed implementation throughput is not affected by the choice of parameters and remains N p samples per clock cycle for any chosen parameter configuration.
The proposed implementation of the CCSDS-123 algorithm supports the majority of the standard's parameter settings, including full ranges of bit resolution D, number of previous bands for prediction P and output word size B. The implementation supports both neighbor-and column-oriented local sums, full and reduced prediction modes.However, only the sample adaptive encoder is supported.In the following resource use analysis for the proposed implementation, the chosen parameter configuration is set to the configuration provided in [10] with parameters D = 16 and P = 3.Table 1 shows resource use for the proposed implementation for a variety of different hyperspectral and multispectral image sensors.The main factor affecting the area use results is frame size N x • N z which determines the amount of memory required for storing delayed samples, weights and accumulators.
The used resources in terms of LUTs, registers and block RAM have been elaborated in more details for core configuration set for processing available 16-bit L1b HICO data cubes [5].Initially, the number of block RAMs varied considerably for different number of pipelines.In particular, weight store and sample delay block RAM use varied depending on N p .This happens due to the synthesis tool which extends depth of an array to the closest power of 2, and by that increases significantly used block RAM resources.To avoid this, LUT elements are used as Distributed RAMs instead of block RAMs.Since one LUT element in 7-series FPGAs [23] can be configured as a 32 × 1 bit  The LUT use in the packer module is analyzed in terms of several words per chain N per_chain and block sizes B and results are presented in Figure 17.It is observed that regardless of N per_chain , the block size is the main factor which greatly affects area use.

Timing
The maximum operating frequency of the proposed implementation for different N p is shown in Figure 18.It is shown that the operating frequency depends on N p with a downward trend and varies in the range 126-157 MHz.The critical path is in the output logic which produces the last flag signal obtained as a logical sum of the last signals from the control modules in each of the pipelines.

Power Estimation
The power estimation has been performed in Xilinx Vivado on the post implementation design in combination with data in post-implementation functional simulation.Figure 19 shows that power usage increases linearly with N p in all modules for 1 ≤ N p ≤ 8. Static power consumption of 0.125 W is mainly due to leakage in the memories in the stores, whereas dynamic power grows with respect to N p as presented in Figure 20.The estimates for stores refer to the power sum in the weight, accumulator, sample and local difference stores.The linear increase is due to the added logic for each pipeline and to increasing complexity of the packer module.Fluctuations appear in the power contribution of the stores when N p is power of 2 since inference of block RAMs in the NE FIFO of the sample delay module is the most effective when the depth of FIFO is a power of two.For example, for HICO data set the depth of FIFO, computed as N x N z /N p , is a power of two when N p is also a power of two.

Comparison with State-of-the-Art Implementations
The comparison of the proposed parallel implementation of CCSDS-123 algorithm with recent sequential [12][13][14][15]17,18] and parallel [16] FPGA implementations with regards to maximum frequency, the throughput performance and power is presented in Table 4.The majority of implementations target Virtex-5 FX130T FPGA which is commercial equivalent of radiation hardened Virtex-5QV.However, the detailed power and performance analysis on parallel implementation [16] is reported for powerful Virtex-7 FPGA device.The sensor maximum data rates for AVIRIS NG and HICO imagers, representing real-time sensor throughput requirements, are also given.The implementations with BIP ordering have a roughly similar architecture but with large performance differences.In the implementation proposed by Santos et al. [13], the chosen sample ordering requires that local differences are recomputed when needed.As a consequence, each sample is read 2(P + 1) times and the input bandwidth efficiency is decreased.This approach requires either the non-sequential memory access pattern with potentially reduction of streaming efficiency, or that the arrangement of the data samples in memory follows the irregular streaming order, occupying 2(P + 1) as much storage.The implementation SHyLoC [17] supports all three sample orderings, where a different architecture is suggested for each ordering.The implementation by Bascones et al. [15] achieves throughput lower than 50 Msamples/s on Virtex-7 suggesting less than one sample compressed per clock cycle.
In the parallel implementation proposed by Bascones et al. [16], the throughput of 3510 Mb/s is reported for C = 7 compression cores employed in parallel.By fixing the subset of bands processed by each CCSDS-123 core, throughput degradation can be introduced when the number of bands is not divisible by number of cores C since this requires stalling of several cores when processing the last samples of each pixel.Another limitation is the serial nature of the final packing stage which creates a significant throughput bottleneck for large number of parallel cores.The paper suggests, however, that the serial packing circuit can be clocked faster.
The proposed parallel implementation builds up on the processing chain implemented in the previous work [19] which is characterized by a throughput of 2350 Mb/s.The CCSDS-123 processing chain adaptation for parallel processing and the structuring of several CCSDS-123 compression chains in parallel are introduced.The limitations of data routing between processing chains (CCSDS-123 cores) and packing operation in the work proposed by Bascones et al. [16] are successfully overcome in the proposed implementation.In fact, the throughput is maximized by the proposed interleaved data routing between parallel processing chains which eliminates pipeline stalling.In addition, the proposed parallel packing provides linear scaling of the throughput when the number of pipelines is increased.In fact, the ability to achieve high throughput for the number of spectral bands N z which is not an integer multiple of N p , and to pack any number of variable length words into fixed-size words in each clock cycle are the greatest improvements of the proposed implementation.In comparison with the state of the art, the proposed parallel implementation achieves superior performance in terms of processing speed such as data rates of 9984 Mb/s and 12,000 Mb/s for N p = 4 and N p = 5, respectively.Future work will include an hardware implementation of emerging Issue 2 of CCSDS-123 standard [25,26] which builds up on the current version (Issue 1) of CCSDS-123 compression standard [10].The Issue 2 focuses on new features such as a closed-loop scalar quantizer to provide near-lossless compression, modified hybrid entropy coder for low entropy data and support for high-dynamic-range instruments with 32-bit signed and unsigned integer samples.The introduced data dependencies can affect the throughput and challenge parallel processing.

Conclusions
In this paper, a parallel FPGA implementation of CCSDS-123 compression algorithm is proposed.The full use of the pipelines is achieved by the proposed advance routing with shifting and delay operations.In addition, the packing operation of variable-length words is performed fully in parallel, providing throughput of user-defined N p samples per clock cycle.This implementation significantly outperforms the state-of-the-art implementations in terms of throughput and power.The estimated power use scales linearly with the number of input samples.In conclusion, the proposed core can compress any number of samples in parallel provided that resource and I/O bandwidth constraints are obeyed.

Figure 1 .
Figure 1.Overview of the parallel CCSDS-123 implementation for N p = 4.

Figure 4 .
Figure 4. Sample delay processing chain described by delay(i, ∆t) and shift(i, ∆t) functions.

Figure 5 .
Figure 5. Sample delay operation for obtaining W neighbor samples for N p = 4 and N z = 9.

Figure 6 .
Figure 6.Routing of central differences between pipelines for N p = 4 and P = 5.

Figure 7 .
Figure 7. Implementation of the shared store module.

Figure 8 .
States of weight shared store for N z = 61 and N p = 4 (a) after reset (b) during writing operation of the first weight samples for pixel 1 (c) during reading operation of the first stored weight samples for pixel 1 (d) after 15 cycles from the first reading operation for pixel 1.

Figure 9 .
Figure 9. Operation of the variable length word packer.

Figure 10 .
Figure 10.Implementation of variable length word packer.

Figure 12 .
Figure 12.Implementation of improved variable length word packer.

Figure 13 .
Figure 13.Memory organization for block set FIFOs.

Figure 15 .Figure 16 .
Figure 15.Resource use by pipeline logic with respect to the available resources.

Figure 17 .
Figure 17.LUT use in packer module for various B and N per_chain , N p = 4.

Figure 18 .
Figure 18.Maximum operating frequency for different number of pipelines.

Table 3 .
Memory element use in various stages for different N p .