An FPGA-Based LOCO-ANS Implementation for Lossless and Near-Lossless Image Compression Using High-Level Synthesis

In this work, we present and evaluate a hardware architecture for the LOCO-ANS (Low Complexity Lossless Compression with Asymmetric Numeral Systems) lossless and near-lossless image compressor, which is based on JPEG-LS standard. The design is implemented in two FPGA generations, evaluating its performance for different codec configurations. The tests show that the design is capable of up to 40.5 MPixels/s and 124 MPixels/s per lane for Zynq 7020 and UltraScale+ FPGAs, respectively. Compared to the single thread LOCO-ANS software implementation running in a 1.2 GHz Raspberry Pi 3B, each hardware lane achieves 6.5 times higher throughput, even when implemented in an older and cost-optimized chip like the Zynq 7020. Results are also presented for a lossless only version, which achieves a lower footprint and approximately 50% higher performance than the version that supports both lossless and near-lossless. Interestingly, these great results were obtained applying High-Level Synthesis, describing the coder with C++ code, which tends to establish a trade-off between design time and quality of results. These results show that the algorithm is very suitable for hardware implementation. Moreover, the implemented system is faster and achieves higher compression than the best previously available near-lossless JPEG-LS hardware implementation.


Introduction
Information compressors allow the reduction of bandwidth requirements and, given that data transmission systems tend to demand much more power than computing systems, they are useful as well when energy or dissipation is limited. For the case of images or videos, apart from lossless compression, we may also introduce errors in a controlled manner in order to improve the compressibility of the data. A particularly convenient way to perform this is to use near-lossless compression, which ensures that these errors are bounded by a limit set by the user. When this limit is set to zero, lossless compression is obtained.
These codecs are particularly useful when the data to compress contains very valuable information and/or, given the nature of the application, a minimum quality must be ensured. Satellite image acquisition is a prominent application of these systems, which have pushed the development of many algorithms and hardware implementations [1,2]. Additionally, we can find medical applications such as capsule endoscopy [3][4][5][6][7] or portable image devices [8].
New applications emerge in scenarios where traditionally raw (uncompressed) data was transmitted. Given the rapid increase in the data volume generated, image codecs can reduce costs and development time by leveraging already available transmission infrastructure and standards. An example of this in the video broadcasting industry is the use of intermediate codecs (mezzanine codecs), used between initial acquisition and final Given these objectives, where we aimed to achieve a first architecture, not a fine-tuned one, the encoder was completely implemented using High-Level Synthesis to allow faster development. Thanks to a careful design and advances in the HLS compilers, the resulting system achieves high performance and a reasonably small footprint. The complete set of sources required to reproduce the systems here presented are open to the community through a publicly available repository (https://github.com/hpcn-uam/LOCO-ANS-HWcoder, accessed on 23 October 2021).
The rest of this paper is structured as follows: first, Section 2 revises the ideas in which this paper is grounded. Next, Section 3 describes the architecture of the implemented system. Then, Section 4 provides the obtained implementation results when deploying the system in different FPGA platforms and evaluates them. After this, Section 5 further discusses the achieved results in light of the related work. Finally, Section 6 concludes the paper by summarizing its main contributions.

Background
Before describing the proposed system, this section introduces ANS, the LOCO-ANS algorithm, and HLS, which are the fundamental ideas this work is based on.

ANS
ANS coding system [26], similarly to the arithmetic coder, codes a stream of symbols in a single output bitstream, where whole bits cannot be assigned to a particular input symbol. That is, it codes the alphabet extension of order n = number of symbols. However, instead of storing the information in a range (as the arithmetic coder does), it encodes it in a single natural number, the state. In order to limit the size of this state, a re-normalization is performed when it is out of bounds and new output bits are generated. Furthermore, to be able to decode the resulting bitstream, the last ANS coder state must be sent to the decoder.
ANS logic can be encoded in a ROM storing, for each current state (ROM address), the next state and numbers of bits to take from the current state. This is one of the ways of implementing tANS, one of the ANS variants. Therefore, although the ideas behind ANS are a bit more complex, its operation can be really simple. Each ROM, or table, codes for a specific symbol source distribution, so to perform adaptive coding, several tables need to be available, choosing the one that better adapts to the currently estimated symbol probabilities. When designing a system using tANS, it is important to take into account that the Kullback-Leibler divergence tends to stay in the (0.05/k 2 , 0.5/k 2 ) range, with k = |S|/|A|, where |S| is the size of the state (generally assumed to be 2 state_bits ) and |A| is the cardinality of the symbol source the table codes for. In addition, the output bitstream acts as a Last In First Out (LIFO) memory, a stack. Then, the decoding is performed in the reverse order, starting the process with the last bits generated and recovering the last symbols first.
For more about ANS, see [26,28], and about tANS in hardware, see [30,31]. Figure 1 shows the LOCO-ANS algorithm block diagram, where two main subsystems can be appreciated, the Pixel Decorrelator and the TSG Coder. The former processes the input pixels with the aim to turn them into a stream of statistically independent symbols with their estimated distribution parameters, which the latter will code. These symbols are errors made by the adaptive predictor, which are then quantized according to the error tolerance (NEAR parameter) as shown by Equation (1). This quantization allows ensuring that the absolute difference between the original value of a pixel and the decoded one is less or equal to NEAR. Note that if NEAR = 0, then lossless compression is obtained. Other reversible operations are then applied to q to improve compression.

LOCO-ANS Algorithm
Pixel Decorrelator The adaptive predictor is composed of a fixed predictor plus an adaptive bias correction. The adaptive correction is computed for each context, which is a function of the gradients surrounding the pixel currently processed. The prediction errors are modeled using the Two-Sided Geometric (TSG) distribution, that is, an error q is assumed to have the following probabilities: where θ and s are the distribution parameters and C(θ, s) = (1 − θ)/(θ 1+s + θ −s ) is a normalization factor. However, to simplify the modeling and coding of this error, the next re-parametrization is used: and where p = (θ 1+s )/(θ 1−|s| + θ |s| ) and θ is the same parameter as in Equation (2) [32]. These distribution parameters are estimated by the Context Modeler for each context, generating the estimated quantized versions,θ q andp q . As seen in the block diagram, the TSG coder uses two different coders to handle y and z, both based on tANS. As mentioned, ANS output bitstream acts as a LIFO, but the decoder needs to obtain the errors in the same order the decorrelator processed them, to be able to mimic the model adaptations. For this reason, the Block Buffer groups symbols in blocks and inverts their order. The output bits of a block are packed in the Binary Stack and stored in the inverse order, so the decoder can recover pixels in the same order the encoder processed them without additional metadata.
The Bernoulli coder requires a single access to the tANS ROM to code the input y, whereas the Geometric coder may need several accesses. This is because z is decomposed in min( (z + 1)/C(θ q ) , N I + 1) subsymbols, where C(θ q ) + 1 is the cardinality of the tANS symbol source used for a given θ q and N I is a coder parameter that sets the maximum ROM accesses for each z symbol. However, as shown in [29], for 8-bit gray images and using C(θ q ) ≤ 8 and N I greater than the z range, the coder only requires 1.3 accesses on average.
These coders may or may not use the same ANS state. If they do, at the cost of losing the ability to run in parallel, only one ANS state is sent at the end of the block. If they do not, larger symbol blocks can be used to compensate for the additional bits required to send the second ANS state. Then, this option establishes a memory-speed trade-off.
For a more in-depth explanation of how the codec works and its design, refer to [29]. Additionally, Appendix A provides some examples of images compressed with LOCO-ANS setting NEAR to 0 and 3.

High-Level Synthesis
There are currently several compilers in the market that translate C/C++ code to Register Transfer Level (RTL) such as VHDL or Verilog. Examples of these compilers are Vitis HLS (Xilinx), Intel HLS, or Catapult (Mentor). Apart from the C/C++ code, directives (sometimes included in the code as #pragmas) are used to guide the compiler towards the desired architecture. These directives, for example, can establish the desired number of clock cycles required for a module to be ready to consume a new input or, in other words, to set the Initial Interval (II). Additionally, they can shape memories and select a specific resource for their implementation.
HLS compilers allow faster development of hardware modules [33]. The main reasons are: • The code describes the algorithm, whereas the compiler is in charge of scheduling operations to clock cycles and assigning operators/memory to the target technology resources.
• Code can be validated much faster using a C/C++ program instead of an RTL simulator. • Directives allow a wide design space exploration. Moving from a low footprint to a heavily pipelined, high-performance architecture is possible just by changing a single line of code. • After code verification and RTL generation, the output system can be automatically validated using the C/C++ code to perform an RTL simulation. • The source code is less technology-dependent.
However, even though compilers have been improving, the use of HLS tends to establish a trade-off between design time and quality of results (performance and/or footprint). Furthermore, except for trivial applications, being aware of the underlying architecture and resources used is still necessary to obtain good implementations.

Encoder Architecture
In this section, the LOCO-ANS encoder architecture is presented. The block diagram in Figure 2 shows the main modules composing the system: The Pixel Decorrelator, S t Quantizer, and TSG coder. Each of these modules is implemented in C/C++ with compiler pragmas and transformed to RTL code using Vitis HLS.
The pixel decorrelator takes pixels as input and outputs a stream of y,p q , z, t, and S t . The last two variables are further processed by the S t quantizer to generate theθ q geometric distribution parameter. The TSG coder uses a tANS coder to transform the y and z streams in blocks of bits and, finally, the File Writer sends these streams and header information, issuing the appropriate DMA commands.
The TSG coder may need several cycles to code a symbol, but it is much faster than the Pixel Decorrelator, so in order to increase the encoder throughput, the former module runs at a higher clock frequency. FIFOs are inserted between these modules to move data from one clock domain to the other.
Subsections below explain in more detail each module.  Figure 2. LOCO-ANS hardware high-level block diagram. In blue, modules running at the lower frequency, and in red, modules running at the higher frequency.

Pixel Decorrelation
Given the sequential nature of the pixel decorrelation algorithm, it is mainly implemented by a single pipelined module, including a single line row buffer. It consists of an initialization phase and the pixel loop. In the initialization phase, the first pixel is read (which is not coded but included in the bitstream directly), context memories and tables used in the pixel loop are initialized according to the NEAR parameter setting. The operation takes about 512 clock cycles to complete. This could be optimized in many ways, such as computing and storing several memory entries in a single cycle, or avoiding the re-computation of tables when NEAR does not change. Additionally, ping-pong memories could be used to achieve zero-throughput penalty, initializing these memories in a previous pipeline stage, as done in [19]). However, the HLS compiler did not support some constructions required to create that architecture. Although workarounds exist, the potential benefit for HD and higher resolution images is negligible (less than 0.056% performance improvement in the best case and assuming the same clock frequency is achieved). What is more, particularly in high congestion implementations (i.e., FPGAs with high usage ratio), this could even reduce the actual throughput, given that the extra logic and use of additional memory ports can imply frequency penalties. For these reasons, and given that other works have presented optimized architectures for this part of the algorithm (changes to the JPEG-LS algorithm do not have important architectural implications), these initialization time optimizations were not implemented.
Algorithm 1 describes the pixel loop. This code structure allowed a deep pipeline (shown in Figure 3), which reads the row buffer, computes the quantized gradients g 1 and g 2 , which do not depend on the previous pixel (after quantization), and starts to compute the context id before the previous pixel quantization is finished. To obtain the context id and sign, the value Q(g 1 ) · 81 + Q(g 2 ) · 9 + Q(g 3 ) is computed, where only the g 3 gradient uses the previous pixel. Then, Q(g 1 ) · 81 + Q(g 2 ) · 9 can be computed in an earlier stage, which is what the pipeline does. Observe that the gradients order in the equation was chosen such that the dependency between loop iterations is eased, as the component requiring g 3 (which cannot be computed earlier) is not multiplied by any factor. Algorithm 1 Pixel loop algorithm structure 1: q_pixel ← f irst_px 2: for i ∈ [1, image_size) do 3: #pragma HLS PIPELINE II = 2 The lossless optimized version uses II = 1 Data stored in the row buffer does not establish dependencies 4: #pragma HLS DEPENDENCE variable=row_buffer intra false 5: #pragma HLS DEPENDENCE variable=row_buffer inter false 6: Store q_pixel in row bu f f er 7: Read new pixel 8: Compute f ixed prediction, context id, and sign 9: Get context bias and statistics 10: Correct prediction and compute error 11: Per f orm error quantization and modulo reduction 12: Send symbol with metadata to the output 13: q_pixel ← Reconstruct the pixel 14: U pdate context statistics 15: end for Additionally, to improve the performance (reducing the II), the updated context data is forwarded to previous stages when two consecutive pixels have the same context (something that happens in most cases according to [23], although this depends on the nature of the images). Originally, this optimization was done explicitly in the code and using pragmas (to inform the compiler of the false dependency), but newer versions of the HLS compiler perform this optimization automatically.
Since the HLS compiler handles the scheduling of the operations, the number of pipeline stages may change depending on the target frequency and FPGA. For the tested technologies, aiming at the maximum performance, the pixel loop operations were scheduled in five stages.

Obtaining the Distribution Parameterθ q
The decorrelator keeps for each context a register S t = ∑ t i=0 z i . The register and the context counter t are then processed by the downstream module S t Quantizer ( Figure 2) to obtain the quantized distribution parameterθ q . The implemented quantization procedure is a generalization of the iterative method used in LOCO-I to obtain the k parameter of the Golomb-power-of-2 coder [34] and it is described in detail in [29]. Algorithm 2 shows the coarse-grained configuration of this quantization function.

Algorithm 2 Coarse grained θ quantization function (Q θ )
Although this procedure could have been done within the decorrelator, it was decided to keep it separated, to ease the scheduler job and ensure this operation extended the pipeline without affecting the pixel loop performance. This operation can be computeintensive, but as there are no dependencies among consecutive symbols, the module can be deeply pipelined, achieving high throughput.

Near-Lossless Quantization and Error Reduction
To handle the quantization processes, a set of tables (the term look-up table (LUT) is usually used to refer to these tables, but here it is avoided in order not to confuse it with the FPGA resource also denominated LUT) was designed to increase the system performance, taking into account that even small FPGA have plenty of memory blocks to implement these tables. The Algorithm 3 describes the error quantization (lines 1-5), modulo reduction (lines 6-10), and re-scale (line 11) processes.

Require:
Input error Ensure: q Output symbol Ensure: re Re-scaled error, ysed to update context bias Uniform quantization 1: if > 0 then 2: Reduction modulo α = f(NEAR, pixel depth) 6: if q < MI N_ERROR then 7: As suggested in [34], the error quantization can be easily implemented using a table. However, the result after the modulo reduction logic is stored in the table, as the memory resources are reduced and it helps to speed up the context update, which is one of the logical paths that limits the maximum frequency. In addition, a second table contains the re-scaled error ( re ), to avoid the general integer multiplication logic and also to ease the sequential context dependency.
Additionally, a third table is used, in this case, to speed up the pixel reconstruction process, which is the other important logical path that could limit the maximum frequency. There are several ways to perform this, as is shown in Figure 4. To our knowledge, previous implementations of the LOCO/JPEG-LS encoder reconstruct the pixel starting from the quantized prediction error (as indicated in the ITU recommendation [16]) or from the re-scaled error (e.g., [35]). Instead, we use the value of the exact prediction error (only available on the encoder), to get the reconstructed pixel. Given a NEAR value, each integer will have a quantization error, which can be pre-computed and stored in a table. Then, the exact prediction error (before the sign correction) addresses the table that provides the quantization error, and it is then added to the original value of the pixel. As it can be appreciated in Figure 4, using this method greatly simplifies the computation and eases the path. This is one of the key ideas that enabled our high-throughput implementation.  These tables could be implemented as ROMs, supporting a small set of NEAR values, or implemented by RAMs, which are filled depending on the NEAR value currently needed. In the presented design, the latter option was chosen, giving the system the flexibility to use any practical NEAR value, using 3 tables with 2 pixel depth+1 entries each. The time required to fill these memories can be masked, as stated before. Although the uniform quantization would require general integer division, the tables are filled with simpler logic. It is easy to see that, if sweeping the error range sequentially (either increasing or decreasing by 1) and starting from zero, almost trivial logic is required to keep track of the division and remainder.
If a single clock and one edge of the clock are used, the minimum II for the system will be 2. To compute the prediction, the context memory is read (memory latency ≥ 1), then the prediction error is obtained, which is needed to address the quantization tables (also implemented with memories with a latency ≥ 1). The result of the quantization process is used to address the next pixel context, producing a minimum II = 2.
Within a module, Vitis HLS does not allow the designs with multiple clocks or using different clock edges. However, in this case, a great improvement is not expected from the implementation of these techniques, they will imply a much greater development time and the result will tend to be more technology-dependent (given that the FPGA fabric architecture and relative propagation times vary, affecting the pipeline tuning).

Decorrelator Optimized for Lossless Compression
A decorrelator optimized just for lossless compression operation was also implemented. The removal of the quantization logic, plus the logic simplification that arises from using a fixed NEAR = 0 allows going from an II = 2 to II = 1 with approximately a 25% frequency penalty in the tested technologies. That is about a 50% throughput increase (see Section 4). In this case, this pixel loop is implemented with a 4-stage pipeline and the frequency bottleneck is established by the context update.
An interesting fact about this optimization is that going from the general decorrelator to testing on hardware, a first lossless only version took less than one hour. Such fast development was possible given that just a few lines of C++ code needed to be modified. These simple modifications led to significant changes in the scheduling of the pipeline, resulting in the stated performance, which would have been much more time-consuming using HDL languages. Figure 5 shows the block diagram of the double lane TSG coder, which allows sharing the tANS ROMs without clock cycle penalties, as double port memories are used and each lane requires one port. This module can receive the output of two independent Pixel Decorrelators and process them in parallel. In this way, it allows the compression of images in vertical tiles, which was shown to improve compression for HD and higher resolution images [29].

TSG Coder
The system was designed in a 2-level hierarchy because, as we go downstream, the basic data elements each module processes change. The input buffer works with blocks of symbols, while the subsymbol generator works at the symbol level, the ANS coder at the subsymbol level, and the output stack with blocks of packed bits. This modularization allows easily choosing the coding technique better suited for each module. The modules shown in Figure 5 are instantiated in a dataflow region synchronized only by the input and output interfaces such that each module can run independently. In Vitis HLS, this is accomplished with the following pragma:

Stages of the TSG Coder Input Buffers
The main function of the Input Buffer is to invert the symbol order to make the adaptive coding with ANS practical (complex methods would be required otherwise). However, to avoid the use of large memories, this module creates blocks of symbols, and the order within each block is inverted (see Figure 6). The write and read pipelined functions are instantiated in a dataflow region using a ping-pong buffer, given the required non-sequential memory accesses. However, it is noted that there is an alternative with a memory of one block, which comes at the cost of slightly more complex logic.  For coding efficiency reasons, the cardinality of the symbol source modeled by the z ANS ROM varies for each distribution parameter θ q . Then, for a given θ q tANS will model a distribution of the symbols [0..C(θ q )]. For this reason, z needs to be represented in terms of these symbols, so it is decomposed as follows: ∑ n i=0 z i = z, where the first subsymbol z 0 is equal to mod(z, C(θ q )) and all the rest are set to C(θ q ). In this way, to retrieve z, the decoder just needs to sum subsymbols until it finds one (first encoded, but last decoded) that is different to C(θ q ). As C(θ q ) is always an integer power of 2, this process is simple. Finally, if it is detected that the length of this sequence is going to be greater than a design parameter N I (which determines the maximum number of geometric coder iterations) the subsymbol sequence represents an escape symbol. Following this sequence, the original z is inserted in the bitstream.

Write Block
As described in [29], this process is used to reduce the cardinality tANS needs to handle, which translates into significantly lower memory requirements and higher coding efficiency while keeping simple operation.
As it decomposes z and serializes the result with y (in the coupled coders version), this module establishes the TSG coder bottleneck in terms of symbols per clock cycle (not the frequency bottleneck, i.e., contains the critical path). Due to this, it was fundamental to optimize this module to be able to output a new subsymbol every clock cycle. Pipelining the modules was not sufficient to accomplish this goal. As shown in Figure 7, the z subsymbol generation process was split into two modules, one to get the required metadata and another one to decompose the symbol. Furthermore, the Z Decompose module was not described as a loop, as one normally would specify this procedure, but instead, it was coded as a pipelined state machine, which allowed reaching the desired performance. Finally, all these modules are instantiated in a dataflow region synchronized only by the input and output interfaces.

ANS Coder
As shown in Figure 8, the ANS coder is composed of three modules. For each subsymbol, the first one chooses the tANS table according to the symbol type (z i or y) and the distribution parameter. This table is then used to obtain the variable-length code for the sub-symbol. Thus, the module implements the Bernoulli Coder and the remaining process of the Geometric Coder. However, they can be easily split, resulting in a simpler module and the ROM memories would have weaker placement and routing constraints. The module also accepts bypass symbols, which are used to insert z after the escape symbol. After the last sub-symbol is coded, the second module inserts the last ANS state as a new code. The last module packs these codes into compact bytes. The ANS coder can accept a new input in every clock cycle. This was accomplished by instantiating the modules in a dataflow region synchronized only by the input and output interfaces and pipelining each of them with an II=1. This II was achieved by the modularization of the process and by describing all three modules as state machines.

Output Stack
Finally, the Output Stack is in charge of reversing the order of the byte stream of each block of symbols. For this, it uses a structure similar to the one employed in the Input Buffer.

Increasing Coder Performance Independent Component Coders
As mentioned before, if y and z ANS coders (Bernoulli and Geometric, resp.) are independent, the coder throughput would be increased by a (î + 1)/î factor. As indicated in [29],î tends to be around 1.3 for lossless coding (the worst case). Then, applying this value will result in a 1.77 times faster coder. What is more, given that z and y coders will be decoupled and almost no additional logic is required, it is expected that the maximum frequency would be at least the one achieved for the coupled coders. To implement it, the Subsymbol Generator should not serialize z and y, the tANS coder should be split in two (each with one tANS ROM) and the bit packer should merge the two code streams.

Decreasing the Maximum Iterations Limit
In addition, the worst-case performance, as well as the maximum code extension, can be controlled using the maximum geometric coder parameter N I. This is particularly important for implementations with limited buffering.

Results
This section presents how the designs were tested as well as the achieved frequencies and resource footprints. Finally, throughput and latency analyses are provided.

Test Platform and Encoder Configurations Description
In order to conduct the hardware verification, the system depicted in Figure 9 was implemented in two different Xilinx FPGA technologies, described in Table 1: Zynq 7 (cost-optimized, Artix 7 based FPGA fabric) and Zynq UltraScale+ MPSoC. For all implementations, although not optimal in terms of resources, two input and output DMAs were used to simplify the hardware, as the objective was to verify the encoders building a demonstrator, not a fully optimized system. Images were sent from the Zynq µP running a Linux to the FPGA fabric using the input DMAs, which accessed the main memory and fed the encoder using an AXI4 stream interface. As the encoder generates the compressed binary, the Output DMA stores it in the main memory. The evaluation of the coding system was carried out for the configurations in Table 2    These configurations correspond to the Nt4_Stcg5_ANS4, Nt6_Stcg7_ANS6, and Nt6_Stcg8_ANS7 prototypes tested in [29]. The most relevant information is given here, but for a complete description, refer to that work. 1 Bits per pixel relative to JPEG-LS baseline for NEAR = 0 and NEAR = 1 (see Section 5.

Implementation Results
For the tested implementations and both technologies, the critical path of the lowfrequency clock domain is, in general, in the pixel reconstruction loop for the near-lossless encoders and within the update logic of the adaptive bias correction for the lossless version.
In the case of the high-frequency clock domain, the slowest paths of these implementations tend to be in the TSG coder and the output DMA for the Zynq 7020 implementation. Within the TSG coder, the critical path is, in general, either in the tANS logic (from the tANS ROM new state data output to the tANS ROM address, the new state) or in the Z Decompose module. In the case of the Zynq MPSoC, the slowest paths tend all to be in the tANS logic.

Results Evaluation
Results are analyzed in terms of throughput and latency, which are of paramount importance for real-time image and video applications.

Throughput
The near-lossless decorrelator critical path is in the pixel reconstruction loop, which is the same procedure used in the standard. This fact supports that the changes introduced by LOCO-ANS in the decorrelator do not limit the system performance. In the case of the lossless decorrelator, the bias context update logic limits the frequency. This procedure is the same as in the JPEG-LS standard extension, which requires an additional conditional sign inversion compared to the baseline. This tends to worsen the critical path, but it is a minor operation compared to the complete logical path. Although it achieves a slower clock, the lossless decorrelator throughput is about 50% higher than the near-lossless decorrelator, given that it achieves an II = 1 instead of II = 2.
The presented implementations represent a wide range of trade-offs between performance, compression, and resources (also cost, considering technology dimension). All of them have the Bernoulli and Geometric coders coupled, then their mean throughput will be clk1/2.3 MPixels/s for photographic images, where clk1 refers to the clock shown in Table 3. In this way, for a given configuration and target, the TSG coder will have in the mean between 83% and 98% higher throughput than the near-lossless decorrelators for the Zynq 7020 implementations and between 47% and 76% for the Zynq MPSoC. In the case of the lossless optimized decorrelators, this performance gap is reduced to (15%, 26%) and (−10%, 16%), for Zynq 7020 and Zynq MPSoC, respectively. From the presented implementations, just one of them shows a lower TSG coder throughput. In this case, the increased compression ratio comes at the cost of not only higher memory utilization but also a throughput penalty. The top half features implementations that support near-lossless compression (including lossless), and the bottom half, lossless-only compression (with -LS suffix). All the presented implementations have 2 lanes and support up to 8K wide images per lane. 1 Clk0 is the low-frequency clock used for the pixel decorrelation process, while clk1 is the high-frequency clock used for the coder. See Figure 2.
However, it is observed that many possible optimizations of the TSG coder exist, and particularly of the tANS procedures. The Z ROM memory layout can be enhanced to significantly reduce the memory usage, which could have a positive impact on the maximum frequency as Table 3 suggests. Furthermore, alternative hardware tANS implementations exist [30], which may allow a wider range of performance/resources trade-offs.
The obtained results support the hypothesis that the use of the proposed TSG coder, which has a compression efficiency higher than the methods used in JPEG-LS, will not reduce the encoder throughput. This is observed in the hardware tests, where the encoder pixel rate is determined by the decorrelators when photographic images are compressed, except the lower TSG coder throughput case (LOCO-ANS7-LS in the Zynq MPSoC). As expected, this is not the case for randomly generated images, as the coder requires larger code words for them, and then, it is the TSG coder the one that limits throughput, particularly for small images and lossless compression.

Latency
The implemented decorrelator latency is determined by the initialization time plus the pixel loop pipeline depth, which results in 512 + 6 = 518 cycles. For the lossless optimized version, this is reduced to 365 + 4 = 369 cycles. In the case of the low-end device implementation (Zynq 7020), this results in 6.3 µs and 5.8 µs latency, respectively. As mentioned before, if required, the initialization time could be reduced or even completely masked, but these optimizations were not implemented due to compiler limitations, and the fact that it was considered that the potential benefits were low.
It is a bit more complicated to obtain the TSG coder latency, as it is data-dependent, and the coder works with blocks of symbols. To determine the marginal latency (delay added by the coder), we consider the time starting when the last symbol of the block is provided to the coder until the moment the coded block is completely out of the module. Then, avoiding the smaller pipeline delay terms, the TSG coder latency can be computed as: (1 + subsym(z)) · BS + bpp/out_word_size · BS clock cycles (5) Here, BS is the block size, subsym(z) is the mean subsymbols z is decomposed into, bpp is the mean bits per pixel within the block and out_word_size is the size (in bits) of each element of the output stack. The latency is dominated by two modules: the Subsymbol Generator (first term of the equation) and the Output Stack (second term). This is because, as mentioned before, the former creates a bottleneck given that for each input it consumes it outputs several through a single port and the latter buffers the whole block of output bytes and outputs it in the inverse order.
To obtain a pessimistic mean latency, we assume a low compression rate of 2 (bpp = 4). The block size is set to 2048, the output stack word size to 8, and subsym(z) ≈ i = 1.3 (as determined in [29]). Then, for the Zynq 7020 implementation, the mean TSG coder latency is 31.9 µs.
To estimate a practical upper bound to this latency, the following image compression case was analyzed: • Image pixels equal to BS = 2048. In this way, we maximize the block used while keeping the pixel count low, so the decorrelator's capability to learn the statistics of the image is reduced. • Pixels independently generated using a uniform distribution (worst-case scenario) and the errors model hurts compression (the prior knowledge is wrong). • Image shape: 64 × 32 (cols × rows). This shape allows visiting many different contexts, and then, the adaptation of the distribution parameterθ will be slower, thus increasing the resulting bpp. • NEAR = 0 (lossless compression): which maximizes the error range and bpp.
From a set of 100 images generated in this way, we took the lower compression instance, where bbp = 9.844 and subsym(z) = 6.31. This code expansion is due to the fact that the prior knowledge embedded in the algorithm (coming from the feature analysis of photographic images, such as the correlation between pixels) is wrong in this case and, as the image is small, it does not have enough samples to correct this. Moreover, given that the range of the θ distribution parameter was determined with photographic images, additional θ tables may be needed for these abnormally high entropies. Then, using the presented formulas, we obtain 97.2 µs as a practical upper bound on the encoder latency for the Zynq 7020 implementation running at 180 MHz. Although the presented system establishes a trade-off between latency and compression, the achieved latency is remarkably low and suitable for many real-time systems. Moreover, it is possible to tune this trade-off by modifying the implementation parameters.

Discussion
In this section, we evaluate the results presented in the previous section as well as analyze them taking prior works into consideration.

Related Work
There exists a large set of compression methods that achieve a very wide range of compression-resources-throughput trade-offs, but not all have an amenable hardware implementation. The use of dynamic structures tends to make logic slower and require a higher footprint. For example, JPEG-XL [36] can achieve better lossless compression ratios than JPEG-LS, but for that, it needs very flexible contexts and non-trivial logic is used to optimize their histograms and the rANS tables to code for these functions. Furthermore, the use of large memories, like in the case of inter-frame video compression, tends to require external memories, which also contributes significantly to the system power requirements. Given the fact that this work targets real-time and, in general, highly constrained applications with bounds on the errors generated by the compression system and considering the already mentioned features of the JPEG-LS codec that makes it very suitable for these applications, the discussion is focused on JPEG-LS-like codecs, analyzing the trade-offs within this subregion of the metrics space. Table 4 shows key metrics of the most relevant hardware and, for performance comparison, software codecs implementations. In this section, to provide clearer explanations, we focus on the balanced LOCO-ANS6 configuration.   4 Software implementations running in Raspberry 3B, with a single thread. 5 Standard-compliant JPEG-LS implementation. 6 12-bit image support.

Comparison Considerations
Before diving into the analysis of the presented work in light of other works in the area, we examine what we consider the most relevant aspects of the comparison process itself that condition it.

Compression Trade-Offs
The fact that most of these implementations use different algorithms complicates performance comparisons, particularly because the compression ratios for a given dataset are not available. Then, it is hard to analyze the trade-offs that each design implies.
Although many works claim to be standard-compliant, some present a design that it is not, as they apply several changes to the algorithms, in general, to simplify and/or speed up the implementation. Not supporting the run-mode is a common one.
In [21], for example, we note they introduced the following changes without assessing the implications: • Not using run-mode. • Not clamping the corrected prediction (see A.4.2 ITU-T.87). Due to this, the range of the prediction error is increased and, given that JPEG-LS uses limited-length Golomb codes, the binary code after the escape code needs to be increased by 1 bit. • Error modulo reduction is applied after context bias update (see A.4.5 ITU-T.87). • Not including the error sign correction required by the bias update (see A.4.3 ITU-T.87). Not applying the error sign correction will have a negative impact on compression, as it is needed to perform the context merge. • Not limiting the maximum bias correction (see A.6.2 ITU-T.87).
To quantify the impact on the throughput of these changes, we utilize the Vitis HLS implementation feature, which instantiates the resulting HDL module in the target device, performs RTL synthesis followed by place and route (P&R). In this way, it allows obtaining a good estimation of the performance of a module in a non-congested implementation. With these changes, the tool reports that the lossless only decorrelator achieves 100 MHz in the Zynq 7020 (a 55.5% performance increase).
Of course, provided that the trade-offs are understood, changes to the algorithms that improve performance can be useful. For example, in [37] the bias update mechanism was replaced by a more precise one, which also allowed a much more feed-forward pipeline, resulting in a fasted encoder at the cost of resources. However, in this case, it is not clear whether the presented results are implementation ones or just RTL synthesis.
To better compare the encoders, we run compression experiments where, apart from LOCO-ANS and JPEG-LS, we tested JPEG-LS without run mode and JPEG-LS without run mode with 32 × 64 tilling (max tile size supported by [19]). These non-standard JPEG-LS implementations were obtained through the modification of the reference libjpeg codec [38]. Given the number of changes, and the fact that it probably has issues, we do not attempt to reproduce the algorithm implemented in [21]. In this experiment, we used the photographic (non-artificial) images of the 8-bit gray image dataset maintained by Rawzor [39] for NEAR ∈ [0..3]. The results are presented in Figure 10. As it can be appreciated, even when dealing with photographic images, the run-length coder does have a noticeable impact on compression. While LOCO-ANS6 output file size is 1.1%, 5.4%, 9.2%, and 13.4% smaller than JPEG-LS output (for NEAR ∈ [0..3], respectively), removing the run-length coder increases it by 1.1%, 6.8%, 14.4%, and 22.3%.
Moreover, we can appreciate the effect of different tile sizes. Diving the image in 2 columns (LOCO-ANS6 (2 lanes)), which can be compressed in parallel, improves JPEG-LS by 1.4%, 5.9%, 9.9%, and 14.2% for NEAR ∈ [0..3]. We estimate that this improvement comes from the intuition that, for wide images, image statistics vary slower when scanning an image in columns, so the model is more accurate and then, higher compression is achieved. However, using small tiles, and particularly reducing the height, the encoder model does not have enough samples to learn the image statistics, so it does not make good estimations. As a result, JPEG-LS with no run mode with 64 × 32 tiles worsens compression even further, increasing the output file size by 6.4%, 13.0%, 20.3%, and 28.0%, compared to JPEG-LS.

Implementation Technology
Another problem is how to normalize speed, considering the target technology. In the literature, we find implementations in a wide range of devices, using different technologies. Even within the Xilinx FPGAs, it is hard to make performance comparisons as both programmable logic fabric architecture and manufacture node change. Although FPGAs have increased their maximum clock frequency with time, differences between subsequent releases vary and greater variability can exist within a release, considering different architectures and speed grades. Additionally, the clock frequency of feed-forward compute engines (without data dependencies) was able to increase much more with the introduction of more pipeline stages within FPGA hard blocks, like on-chip memories and DSPs. However, codecs with good compression ratios, and particularly JPEG-LS, have feedback loops that cannot be easily sped up.
For a subset of the Xilinx FPGAs used for the hardware codecs works, Table 5 shows key times involved in the context update logic, which determines the clock frequency of most of these implementations. Observe the relative magnitude of the BRAM clock to output propagation time (without output register) compared to other metrics and that it consumes a significant part of the respective clock periods. Of course, the information in this table is not enough to have an accurate model that would allow fair comparisons between technologies, among other reasons, because routing tends to be a major contributor to the critical paths in FPGA implementations and there is no clear way to compare different fabric architectures. However, this data does seem to explain, at least in part, the frequency jump from Zynq 7020 -1 to Zynq UltraScale+ -2 that we observe in Table 3.
To overcome this, Ref. [20] implemented their architecture, which seems to be standard compliant, in a set of devices used by previous works. As a result, the presented design compared favorably both in terms of speed and resources. For this reason, this work, which achieves 207.8 MPixel/s in a Virtex 7 speed grade 2 with JPEG-LS compression rate, is taken as a reference point to analyze the proposed lossless encoder results. In the near-lossless case, we compare to [19], which is the closest to standard-compliant and faster design in the literature.

Lossless-Only Encoders Comparison
The Vitis HLS implementation feature was used to estimate the clock frequency that LOCO-ANS6 would achieve in a Virtex 7 -2, used by the lossless reference architecture. Although the resulting pipeline of the lossless only decorrelator is very similar, the maximum frequency obtained after P&R is 120 MHz. The performance gap probably comes from the lower level optimizations applied to the context bias update path, as described in [40] and later improved in [20], which is the frequency bottleneck of our and their implementations.
At first glance, for lossless, LOCO-ANS6 achieves a compressed image 1.1% smaller than JPEG-LS (see Section 5.2.1), at the cost of throughput. However, the TSG coder is able to achieve 288 MHz in that device for the 6 ANS configuration. That is, 1.39 times faster than the reference design. Thus, if the Bernoulli and Geometric coder are decoupled (independent ANS states) and an optimized decorrelator is used, the TSG coder would not be the system bottleneck as, on average, it requires running 1.3 times faster.
In practice, we may find symbol sequences that increase the local mean of Geometric coder iterations, particularly with very noisy images, but this can be countered by decreasing the iterations limit (also limiting code expansion) and increasing the cardinality of the tables (decreasing mean iterations). Additionally, increasing the block size (which also improves compression) and using buffering between the decorrelator and the coder can mitigate the eventual performance throttling.
Finally, note that these positive results arise from comparing an HLS coder implementation with the best performing and carefully designed HDL decorrelator.

Near-Lossless Encoders Comparison
The reference near-lossless JPEG-LS encoder implementation does not support the run coder and has a maximum tile size of 32 × 64 [19]. Then, the achieved compression ratio is considerably lower than the JPEG-LS standard. The negative effect of not supporting the run-length coder increases with the NEAR parameter, as lower entropy symbols are generated and the Golomb coder becomes less and less efficient as can be appreciated in Figure 10. LOCO-ANS exhibits the opposite behavior, as the TSG coder is very well suited for near-lossless compression. As a result, LOCO-ANS6 (single lane) achieves 7.0%, 16.2%, 24.5%, and 32.4% smaller output size compared to the near-lossless reference implementation. Using the two lanes in parallel to compress an image widens further this compression gap to 7.4%, 16.7%, 25.1%, and 33.0%.
Regarding performance, the reference implementation decorrelator has two lanes with an II = 2 running at 51.68 MHz (25.84 Mpixels/s/lane) in a Virtex 6-75t. These lanes share a single Golomb encoder with II = 1 running at the same frequency. This performance is surpassed by our implementation, also with two decorrelator lanes with II = 2 running at 81.1 MHz (40.55 Mpixels/s/lane for photographic images of medium and above size) in a Zynq 7020. However, this reference implementation was designed for 12-bit images, which worsens the two feedback paths that can limit the encoder performance. For this reason, to better compare these two designs, we run an implementation with Vitis HLS, configuring our decorrelator to work with 12-bit images. As the newer toolset starting from Vivado (almost 10 years old) does not support devices prior to the 7 series, the low-end Zynq 7020 (with the lowest speed grade) was targeted as opposed to the higher end Virtex 6. Table 5 gives a hint supporting that this decision favors the reference implementation as all Virtex 6 timings are noticeably smaller than the chosen target. The Virtex 6 speed grade used in that work is not reported, but this consideration is still applicable to the slowest Virtex 6 as it can be appreciated in the table. As a result, the 12-bit HLS decorrelator achieved a clock of 67.3 MHz after P&R, still a 30% higher throughput.
We attribute this performance increase to the alternative method used to reconstruct the quantized pixel (Section 3.1.2). The reference implementation uses the multiplication by inverse trick to implement the division and applies a compensation scheme to correct the errors derived from this technique while using 15 bits for the fractional part. For very deep pixels, this might be more efficient, but in the proposed architecture, using a table, we achieve a greater simplification and reduction of the critical path. For deeper pixels, larger tables would indeed be required. However, the needed type of memories are abundant (see Table 1), and for this case, targeting up to 12-bit images, only 8 36 K on-chip memories are required (in the case of Xilinx devices). The performance increase comes at the cost of memory resources, but as it can be observed comparing Tables 1 and 3, this resource is not the limiting factor.
Again, as mentioned before, these positive results were obtained comparing an HLS implementation with carefully designed HDL ones. Additionally, as noted in Section 3, further optimizations are possible. However, for the purpose of this work, the presented module was optimal enough to analyze the LOCO-ANS encoder performance.

Conclusions
In this work a hardware architecture of LOCO-ANS was described, as well as implementation results presented, analyzed, and compared against prior works in the area of near-lossless real-time hardware image compression.
The presented encoder excels in near-lossless compression, achieving the fastest pixel rate so far with up to 40.5 MPixels/s/lane for a low-end Zynq 7020 device and 124.15 MPixels/s/lane for Zynq Ultrascale+ MPSOC. At the same time, a balanced configuration of the presented encoder can achieve 7.4%, 16.7%, 25.1%, and 33.0% better compression than the previous fastest JPEG-LS near-lossless implementation (for an error tolerance in [0..3], respectively).
In this way, the presented encoder is able to cope with higher image resolutions or FPS than previous near-lossless encoders while achieving higher compression and keeping encoding latency below 100 µs. Thus, it is a great tool for real-time video compression and, in general, for highly constrained scenarios like many remote sensing applications. These results are in part possible thanks to a new method to perform the pixel reconstruction in the pixel decorrelator and the high-performance Two-Sided coder, based on tANS, which increases the coding efficiency. Moreover, as mentioned throughout the article, it is noted that further optimizations of the presented system are possible. Finally, experiment results support that if used with the fastest lossless optimized JPEG-LS decorrelators in the state-of-the-art, this coder will improve compression without limiting the encoder throughput. Data Availability Statement: The complete set of sources required to reproduce the systems here presented are publicly available through the following repository: https://github.com/hpcn-uam/ LOCO-ANS-HW-coder, accessed on 23 October 2021.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:

ALUT
Adaptive Look-up