2.1. Logarithmic Hop Encoding
Logarithmic Hop Encoding, or LHE for short, is a video encoding and decoding algorithm designed to minimize encoding times, thus reducing latency in remote control applications, where response time is a priority over bandwidth. It works in the spatial domain by dividing a frame into blocks and encoding the differences between pixels within the same block, minimizing the need to store information. Its mathematical basis has been described in [
18]. Here, a summary of its theory of operation and updates made for this particular scenario will be provided.
LHE exploits the Weber–Fechner laws, which state that changes in the human perception of light intensity are proportional to the previous values of the intensity and that the intensity increase for such a change is logarithmic [
27]. This means that small differences in pixel intensity before and after encoding will not be perceptible to the human eye. LHE takes advantage of this by calculating the predicted intensity of a pixel based on its surrounding pixels and encoding the error—the difference between the prediction and the real intensity—as one of a set of “hops”. Each “hop” is a discrete step in intensity on a logarithmic scale. As the image is being encoded, the actual value of each step is updated dynamically, based on the values of the previous errors. This process is completely deterministic, meaning that, given one starting value, it can be replicated in the decoder to obtain the same sequence of value updates.
Another feature of LHE that optimizes compression is the fact that different areas of the image may have different levels of detail. For instance, one region may be a background, with little variation in colors and intensities, while another region may have text or complex objects that need a higher level of detail to distinguish. Areas with low detail can be compressed more without losing perceived quality, while other areas may benefit from little to no image compression. To analyze these areas, since the Weber–Fechner laws refer to light intensity, the colors of the image are represented as luminance (Y), which gives an idea of the brightness, and chrominance (U and V), which gives an idea of the hue. The original frame is divided into blocks of the same width and height—typically or , which are exact divisors of some of the most commonly used video resolutions.
For each block, a metric called the perceptual relevance (PR) is computed for the horizontal (
) and vertical (
) axes. The PR corresponds to the average of the absolute value of the differences in intensity between adjacent pixels for each axis, given in the interval [0, 1] (with 1 representing the greatest rate of change). The first step for obtaining the PR is calculating the differences between neighboring pixels in the desired axis (
) and assigning them a “quantum” between 0 and 4 (
), according to the thresholds in
Table 1. The PR is then calculated using the following formula, where
M is the number of differences for which
:
Finally, the
is adjusted so that its range fits within the [0, 1] interval:
and is then discretized into one value of the set
.
Once the PR is computed, each block is compressed using downsampling. An independent downsampling factor is used in both axes by defining a plane predicted pixel value or PPP, which corresponds to the number of pixels that are averaged together for the downsampled block, according to the following equation:
Here, the CF or “compression factor” is a factor that is set before streaming and controls the sensitivity of the mapping from PR to PPP (aggressiveness when downsampling). The higher the CF, the more the image will be downsampled even with high PRs. is generally 8.
For downsampling, the PPP is quantized to an integer number, so pixels are averaged together in groups of 1, 2, 4, or 8. The higher the PR, the less the block will be compressed. Each LHE block is first downsampled horizontally and then vertically. Note that different downsampling factors can be used for each axis because the PR can be different for each of them. This is meant to preserve information about edges, while still compressing similar colors or backgrounds.
Lastly, the hops are calculated for each compressed block. The process begins in the upper-left corner of the block. For the remaining pixels, a predicted value is calculated as the average of two neighboring pixels, typically the one to its left and the one in the upper-right corner. For pixels close to the borders of the image, a compromise is made by taking other neighbors.
Figure 1 shows the possible combinations of neighbors used to form a prediction. The chosen hop value is the one closer to the difference between the pixel’s actual and predicted values. Then, the reference point for the hops is recalculated by readjusting the difference between the central hop (which means no error) and the first hop (the smallest possible error). The rest of the hops all change relative to this difference—the difference in value between two hops is a power of
for positive hops and a power of
for negative hops. The most important variables of the algorithm for calculating the hops are the value for the smallest hop
, which establishes the sensitivity of the predictions, and a gradient, which is a small increment in the prediction based on the tendency of the image, to correct possible deviations. Both of them are updated after each prediction. When the difference between neighbors is large, the colors of intensities are very variable within the block, so the values of the parameters are increased. LHE predicts well when the neighboring pixels are similar, so the average of the two makes for a good prediction. However, in cases where the value for these pixels is very different, the prediction is not accurate. In these cases, when the difference between neighboring pixels is above a threshold, the value of
is set to its highest value, instead of waiting for slower changes in gradient. This is called “immediate
”. Code that provides a more detailed description of the algorithm can be found in the GitHub repository linked in the Data Availability Statement.
The LHE algorithm lends itself to parallelization and quantization at several points. First of all, since the image is divided into blocks, each block, or set of partial results for each block, can be treated separately from the rest, and several blocks can be processed simultaneously. The different processing stages can be pipelined easily: while a series of blocks is undergoing downsampling, the next batch of blocks can already have their perceptual relevance computed, and so on. The decisions on quantization and encoding can affect the downsampling and hop calculation processes. When calculating the PR, rather than averaging the value of the differences directly, each difference is classified into one of five quanta with four thresholds. Then the PR is also discretized into one of . This results in a possible number of points to average in the set . Since the intensity values for the hops can be updated in the receiver following the same algorithm as the transmitter, only the number of the hop (how far the jump in intensity should be) needs to be transmitted. For the LHE implementation being discussed, this means an integer value in the range .
This system of parallelization and discrete values can be translated well into platforms for accelerated computing, such as GPUs and FPGAs, since operations within a small set of well known conditions are more predictable. Additionally, some of the operations, such as calculating the threshold values or the hop and update values for a pixel–prediction pair, can be precalculated and stored in look-up tables. This consumes more memory resources but greatly reduces latency. Subsequent sections show how these concepts are translated into a custom processor implemented on an FPGA.
2.2. FPGA Platform and Implementation Considerations
When designing for FPGAs, it is possible to target small and efficient hardware blocks. This way, the target algorithm can be broken into steps, and parallelism can be achieved in two dimensions. The first dimension is spatial parallelism, in which the same function using different data can be instantiated more than once, and all instances can operate in parallel. The second dimension is temporal parallelism, in which data can be passed to another function after processing, leaving that function available for new data. This way several steps can be operational simultaneously—this is usually known as pipelining. Due to these optimizations, high throughput can be achieved with lower clock frequencies (FPGAs operate in the order of hundreds of megahertz). The main disadvantage of FPGA design is memory constraints: the memory in these devices is usually small compared with other parallelization platforms like graphics cards (graphics processing units, GPU). While GPUs typically work in batches, it is encouraged to design for FPGAs with data streams, that is, processing the data as it arrives and storing only a small amount if it is required during processing. Compromises need to be made in algorithms where data is accessed more than once. In such cases, memory should be reused as often as possible.
To test this version of LHE, a Zybo Z7-20 development board from Digilent, Inc. (Austin, TX, USA) [
28] was used. This board includes a Zynq-7000 system-on-a-chip from Xilinx (XC7Z020-1CLG400C, San Jose, CA, USA), a mid-range model that includes reconfigurable blocks and a dual-core ARM Cortex A9 (ARM Holdings, Cambridge, UK). The reconfigurable logic has direct access to 630 KB on block RAM, which can be accessed in one clock cycle for reading and writing. It also includes a 125 MHz external clock, whose frequency can be changed by internal phase-locked loop blocks, and a MIPI CSI-2 ribbon connector for an external camera.
The development of the proposed system was carried out in Vivado, version 2020.2, Xilinx’s software (San Jose, CA, USA) suite for designing and simulating on their FPGAs. Vivado’s simulation tools (Vivado Simulator) were used to simulate the designs.
2.3. Proposed LHE Design for Implementation in FPGA
The main distinguishing factor of the design process in FPGAs compared to parallel design in GPUs and CPUs is that memory is often the limiting factor. FPGAs benefit from processing a continuous stream of data, applying a fixed processing to each data point and pipelining the design, so that the final results are transmitted in a stream as well. Of course, small memories are available for intermediate results, but the data flow does not usually work well in batches. The proposed pipelined implementation of LHE follows the stages shown in
Figure 2. A block-by-block description of the system is provided in this section.
The first block (MIPI) establishes the connection with the camera and streams the pixels using the MIPI CSI protocol (Mobile Industry Processor Interface and Camera Serial Interface). This block obtains the timing reference signals (beginning of frame, end of line, and pixel valid) and the Bayer values for each pixel. This block is useful when connecting with portable cameras, which in their basic configuration usually provide raw Bayer values. Cameras sense light through a grid of small photosensors, with filters sensitive to green, blue, and red light. These filters are distributed so that 50% of them are green, 25% are blue, and 25% are red, with alternating colors. When raw values are transferred by a camera, they are returned in this special order. Therefore, when a color space used in image processing, such as RGB, is needed, a transform needs to be applied to the data to convert from Bayer values to pixel color values. Although the camera can perform some of this processing, it adds latency to the capture process, the critical first stage. By implementing the transformation process by hand, values can be read faster from the camera, and the heavier computations can be handled by functions that can work with parallelism. The next block (RGB) converts the Bayer raw values into three RGB components per pixel (8 bits each). These blocks were provided by Digilent, Inc. (Pullman, WA, USA) as intellectual property (IP) ready-made blocks for use with their products and set a starting point for processing [
29].
The design tested in this work focused on luminance (color intensity) because it carries most of the information from the image and can be used to determine the viability of the system. Following the MIPI and RGB blocks, the luminance block calculates luminance values for each pixel (8 bits, to match other implementations of LHE). To optimize the latency, the following formula with integer coefficients was used:
which corresponds to the studio swing luminance formula from recommendation ITU-R BT.601-7 [
30]. These calculations are implemented in hardware with three 16-bit pipelined multipliers, adders, and a right shifter, meaning that data can be processed in a stream as new pixel data arrives. From this point, the data needs to be divided into LHE blocks. For this implementation, the size is
. To reduce the need to access memory, each pixel is paired with information about the LHE block to which it belongs. This enables the PR to be calculated for an entire row of LHE blocks without interrupting the streaming. For stages that depend on the PR, the data does need to be stored, so block RAM is dedicated to saving the pixel values while the PR is being processed. The main disadvantage of RAM compared to registers is that only one value can be accessed at a time. In this design, RAM was deemed more suitable as a buffer for repeating the stream of luminance values to other stages, while registers were dedicated to places requiring quick calculations with previous values.
For the PR,
and
are calculated simultaneously for one line of LHE blocks following the logic in
Figure 3. For each horizontal or vertical pair of pixels in a block, the difference is calculated (block DIFF in
Figure 3), then quantized. The previous pixels are saved in a register bank (Reg) to calculate the differences. The register bank has a propagation delay of one pixel for the horizontal PR and one line of pixels for the vertical PR. The average of the non-zero differences is then the PR. To obtain the average, one accumulator for the differences and the count of non-zero elements (ACC) and a pipelined divider hardware (÷) are used. The hardware implementation of a divider is a complex circuit, so only two dividers are used for all the blocks in one row (one for each direction). To keep the division error within thousandths of a unit, without significantly impacting the latency, the operations are performed with fixed-point numbers with 10 integer bits and 10 fraction bits (introducing a latency of 20 cycles on the first division). Since the data arrives in a stream and the dividers are pipelined, the calculation of the differences is completed first for the starting block, then for the second block, etc. This allows each block to take turns using the divider. It also means that the next computation stages will start working in a staggered way, which can be exploited for resource sharing.
Once the PRs have been calculated, the downsampling process can begin. The downsampling components (one per LHE block) read and rewrite data from the previously mentioned RAM. Depending on the PR, the average of 2, 4, or 8 pixels is calculated by adding one pixel and its neighbors. To hold the values of the neighbors as they are added together, delays of one pixel and one line are used in series. Since the dividers are powers of two, a shifter is sufficient to calculate the result. The main problem with variable shifters in hardware is that they may require an additional register bank. Fixed shifters are more efficient because they only require removing bits. The solution adopted for this implementation is to instantiate blocks that accumulate 2, 4, and 8 pixels with fixed shifters. Then, the result is multiplexed according to the PR. The schematic of the proposed architecture is shown in
Figure 4.
The last step in the process is the LHE encoding (“Calculate hops” block in
Figure 2). The main elements needed for this computation are a table of hop and update values, registers to save pixel values as they are read, and hardware to perform the predictions and calculate intermediate values.
The table with hops and updated values offers a significant advantage in terms of latency and a challenge for resource usage. With a pre-calculated hop table, the most time-consuming part of the algorithm is accelerated. This decision is shared with GPU implementations of the algorithm. However, the size of the table for 8-bit luminance, considering that the cache is symmetrical so half of the range of the original values is omitted, is . This is equivalent to 2688 Kb with 12-bit values, which takes around 20% of the available block RAMs in medium-end FPGAs. For this implementation, only one table was used. This table is shared among all the encoding blocks, which otherwise operate independently. Access to the table is based on the round-robin algorithm.
A prediction based on previous values is calculated for each pixel, so an entire line of pixels is saved in registers. Depending on the pixel’s position relative to the edges of the block, different positions in the pixel register are used to calculate the combinations in
Figure 1. As shown in
Figure 5, a one-line buffer is enough to access the combinations if the current pixel is not recorded until the calculations are finished. The system always has access to the value of the current pixel (
, not yet in the buffer), the previous one in the line (
, in the buffer), the one above (
, in the buffer and not yet replaced), and the one to the upper-right (
, in the buffer). The values of the predictions, which are calculated as the average of the neighbors, are then used as inputs to the hop table, which outputs the hop values. Next, the base value for the first hop and the intermediate gradients are recalculated, which in the case of LHE involves a series of comparisons that can be implemented using multiplexers. After a hop value is produced, the final result from the FPGA is sent for streaming. Because the hop table access is round-robin, pixels from different LHE blocks can often interleave.