Digital Image Decoder for Efficient Hardware Implementation

Increasing the resolution of digital images and the frame rate of video sequences leads to an increase in the amount of required logical and memory resources necessary for digital image and video decompression. Therefore, the development of new hardware architectures for digital image decoder with a reduced amount of utilized logical and memory resources become a necessity. In this paper, a digital image decoder for efficient hardware implementation, has been presented. Each block of the proposed digital image decoder has been described. Entropy decoder, decoding probability estimator, dequantizer and inverse subband transformer (parts of the digital image decoder) have been developed in such way which allows efficient hardware implementation with reduced amount of utilized logic and memory resources. It has been shown that proposed hardware realization of inverse subband transformer requires 20% lower memory capacity and uses less logic resources compared with the best state-of-the-art realizations. The proposed digital image decoder has been implemented in a low-cost FPGA device and it has been shown that it requires at least 32% less memory resources in comparison to the other state-of-the-art decoders which can process high-definition frame size. The proposed solution also requires effectively lower memory size than state-of-the-art architectures which process frame size or tile size smaller than high-definition size. The presented digital image decoder has maximum operating frequency comparable with the highest maximum operating frequencies among the state-of-the-art solutions.


Introduction
The development of new and improvement of existing techniques for the compression and decompression of digital images and videos is very topical today. There is a constant need to improve the quality and resolution of the digital image and the need to increase the frame rate and the duration of the video sequences. All of this results in an increase in the amount of required logical and memory resources for digital image and video processing and storage. Therefore, the improvement of existing techniques and the development of new techniques and hardware architectures for digital image and video compression and decompression, which will decrease the amount of required logical and memory resources, are the only answers to these challenges and many efforts are directed towards achieving that goal. Hardware implementation of 3-D DCT based image decoder with two algorithms to reduce the number of computations and the amount of utilized hardware resources has been presented in [1]. A flexible, line-based JPEG 2000 decoder with customizable level of parallelization without need to use external memory, has been described in [2]. FPGA implementation of a high-performance MPEG-4 simple profile video decoder, capable of parsing multiple bitstreams from different encoder sources has been proposed in [3]. Architecture design of an H.264/AVC decoder described in [4], allows efficient FPGA implementation. A flexible hardware JPEG 2000 decoder for digital cinema, presented in [5], intended for implementation in a single FPGA device, requires a reduced amount of logic and memory resources. A hardware JPEG 2000 decoder architecture based on the DCI specification, which can decode digital cinema frames without accessing any external memory, supports the decoding process in accordance with the order of output images, with reduced storage resources for middle states and temporary image data, has been proposed in [6]. Design and implementation of an efficient memory video decoder with increased effective memory bandwidth has been presented in [7]. FPGA implementation of a full HD real-time high efficiency video coding main profile decoder, solving both real-time and power constraints, has been proposed in [8]. Hardware implementation of a full HD capable H.265/HEVC video decoder, presented in [9], targeted constraints related to hardware costs. Video decoder implemented on FPGAs using 3 × 3 and 2 × 2 networks-on-chip, with communication between the decoder modules performed via a network-on-chip, has been described in [10].
The block diagram of the state-of-the-art digital image decoder is shown in Figure 1. It consists of entropy decoder, decoding probability estimator, dequantizer and inverse subband transformer. The input compressed image is primarily received and processed by an entropy decoder which forwards its output data to the decoding probability estimator. The decoding probability estimator reconstructs the symbol probabilities within the specified contexts, sends them to the dequantizer and feeds them back to the entropy decoder. These data samples are processed by the dequantizer, which produces dequantized data samples in case of lossy compression or only forwards received data samples to the inverse subband transformer in case of lossless compression. Inverse subband transformer performs inverse filtering and composition of data samples received from the dequantizer and generates pixels of the output decompressed image at its output. As it has been shown in [11][12][13], the arithmetic coding ensures the highest compression ratio, which can theoretically remove all redundant information from the digital message. Arithmetic Q-coder has been presented in [14][15][16][17] and arithmetic Z-coder has been described in [18][19][20][21]. In the well-known JPEG 2000 still image compression standard, the MQ arithmetic coder is used, which is similar to QM-coder adopted in the original JPEG image compression standard described in [22][23][24]. The inverse process of the range encoding presented in [25], has been adopted as the basis of the decoding process implemented in the hardware realization proposed in this paper. The decoding process proposed in this paper is performed for every level of composition and every subband separately. Due to its high performance, uniform scalar quantizer with dead-zone [26] is used for quantization purposes very often. For that reason, it is adopted as a basis for the hardware realization of the dequantizer proposed in this paper. This paper has the following structure: Section 2 describes the proposed entrop decoder and decoding probability estimator. The proposed dequantizer is presented Section 3. Description of the proposed inverse subband transformer, based o two-dimensional (2-D) DWT, can be found in Section 4. Section 5 contains synthesis r The inverse subband transformer within the proposed digital image decoder is based on two-dimensional (2-D) discrete wavelet transform (DWT) with Le Gall's 5/3 filters, which is also a part of the JPEG 2000 still image standard, due to its very good performances. The state-of-the-art hardware architectures of the 2-D DWT are mainly convolution-based or lifting-based. Convolution-based hardware architectures [27][28][29][30][31][32] are usually more complex and utilize larger amount of logic and memory resources. Lifting-based hardware architectures [33][34][35][36][37][38] are usually simpler, have lower computational complexity and utilize less amount of logic and memory resources. The most efficient hardware architectures of the 1-D DWT and 2-D DWT are described in [28] and . The concept of the proposed 2-D DWT with 5/3 filters and its hardware architecture are presented in [61][62][63].
The digital image decoder for efficient hardware implementation presented in this paper has the same block diagram at the highest level of hierarchy as the state-of-the-art decoder shown in Figure 1. However, internal blocks of the proposed digital image decoder have been developed with intention to reduce the amount of utilized memory and logic resources and to optimize the hardware architecture of each internal block and to optimize the hardware architecture of the entire digital image decoder. Some initial research results, related to this topic, have been presented in [64]. This paper has the following structure: Section 2 describes the proposed entropy decoder and decoding probability estimator. The proposed dequantizer is presented in Section 3. Description of the proposed inverse subband transformer, based on twodimensional (2-D) DWT, can be found in Section 4. Section 5 contains synthesis results of the hardware realization of the entire digital image decoder proposed in this paper. A brief conclusion is presented in Section 6.

Entropy Decoder and Decoding Probability Estimator
Hardware realization of entropy decoder and decoding probability estimator presented in this paper is based on the inverse process of the range encoding described in [25,65]. Decoding process is performed for every level of composition and every subband separately.
During the process of image compression, the samples of the components of the decomposed signal C (generated by direct subband transformer) had been split into magnitude M and sign S pairs: Magnitudes were then classified into magnitude-set indexes MS, which contained a group of magnitudes with similar values. A residual R had been defined as the difference between magnitude M and the lower limit of the sample of the component of the decomposed signal M_lower_limit: Magnitude-set indexes MS and the lower limits of the samples of the components of the decomposed signal M_lower_limit are determined based on Table 1.
In further process of image compression, MS, S and R had been separately encoded. In order to obtain a higher compression ratio, symbols had been defined based on contextual model which contained neighboring data samples, as shown in Figure 2. These contexts are also used in the process of decoding as a part of image decompression.  0  0  1  0  1  1  1  1  2  2  1  2  3  3  1  3  4  5  2  4  6  7  2  5  8  11  4  6  12  15  4  7  16  23  8  8  24  31  8  9  32  47  16  10  48  63  16  11  64  95  32  12  96  127  32  13  128  191  64  14  192  255  64  15  256  383  128  16  384  511  128  17  512  767  256  18  768  1023  256  19  1024  1535  512  20  1536  2047  512  21  2048  3071  1024  22  3072  4095  1024  23  4096  6143  2048  24  6144  8191  2048  25  8192  12,287  4096  26  12,288  16,383  4096  27  16,384  24,575  8192  28  24,576  32,767  8192  29  32,768  49,151  16,384  30  49,152  65,535  16,384  31   24  31  8  32  47  16  48  63  16  64  95  32  96  127  32  128  191  64  192  255  64  256  383  128  384  511  128  512  767  256  768  1023  256  1024  1535  512  1536  2047  512  2048  3071  1024  3072  4095  1024  4096  6143  2048  6144  8191  2048  8192  12,287  4096  12,288  16,383  4096  16,384  24,575  8192  24,576  32,767  8192  32,768  49,151  16,384  49,152 65,535 16,384 A flowchart of the entropy decoder and the decoding on single-pass adaptive histograms with fast adaptation, is aptation process starts from a uniform distribution and requ complete. The adaptation time is proportional to the numbe difference between the uniform distribution and the exact being decoded. A flowchart of the entropy decoder and the decoding probability estimator, based on single-pass adaptive histograms with fast adaptation, is shown in Figure 3. The adaptation process starts from a uniform distribution and requires several data samples to complete. The adaptation time is proportional to the number of histogram bins and the difference between the uniform distribution and the exact distribution of the variable being decoded.   Figure   2) of already encoded samples of the components of the decomposed signal are loaded and their mean value MS is calculated. Based on the calculated value MS , the magnitude context MC ,which represents the index of the appropriate adaptive magnitude h MC , is determined, which is then used for the decoding of the magnitude-set index MS using the range decoder. The magnitude context MC is limited by a constant ML , with preferable value 4 ML  , because the local variance can increase significantly near the sharp edges in the image, which would lead to a large number of histograms and their slow adaptation.
The number of magnitude histograms MH , i.e., the number of different magnitude contexts MC , is preferably limited to 1 Figure 2) of already encoded samples of the components of the decomposed signal are loaded and then used for the decoding of a ternary context TC .
Based on the ternary context TC , the sign context SC is then determined using the CTX table represented as  First, the values of the neighborhood magnitude-set indexes MS i (shown in Figure 2) of already encoded samples of the components of the decomposed signal are loaded and their mean value MS is calculated. Based on the calculated value MS, the magnitude context MC,which represents the index of the appropriate adaptive magnitude histogram h[MC], is determined, which is then used for the decoding of the magnitude-set index MS using the range decoder. The magnitude context MC is limited by a constant ML, with preferable value ML = 4, because the local variance can increase significantly near the sharp edges in the image, which would lead to a large number of histograms and their slow adaptation.
The number of magnitude histograms MH, i.e., the number of different magnitude contexts MC, is preferably limited to MH = ML + 1 = 5. After decoding the magnitude-set index MS, the magnitude histogram h[MC] is updated.
In case of MS = 0, sign S is not decoded at all. In case of MS = 0, the neighborhood sign values S i (shown in Figure 2) of already encoded samples of the components of the decomposed signal are loaded and then used for the decoding of a ternary context TC.
Based on the ternary context TC, the sign context SC is then determined using the CTX table represented as Table 2. The CTX table translates 81 different values of ternary contexts TC into a preferable number of five different values of sign context SC for each of the subbands, because a large number of different sign context SC values would lead to histograms that do not adapt at all, which also represents the number of sign histograms SH. This very small number is justified by the fact that the more probable sign S is decoded, which is assured by appropriate examination of the sign and, if necessary, by inversion of the sign S using the NEG table represented as Table 3. Ternary contexts TC with NS = NEG[TC] = 0 correspond to a higher probability of a positive sign P(0) than a probability of a negative sign P(1). Ternary contexts TC with NS = NEG[TC] = 1 correspond to a higher probability of a negative sign P(1) than the probability of a positive sign P(0).   The sign context SC represents the index of the appropriate adaptive sign histogram g[SC] which is then used for decoding the sign S using a range decoder. After decoding the sign S, the sign histogram g[SC] is updated.
After that, the encoded value of the residual is loaded and decoded using a decoder with a variable length code (INVVLC). Based on the already decoded values of the magnitude-set index MS, using Table 1 given for 16-bit values of the samples of the components of the decomposed signal, the lower limits of the samples of the components of the decomposed signal M_lower_limit are determined, which are then summed with the decoded value of the residual R, forming the decoded value of the magnitude M, as it is shown in Equation (4).
Finally, at the very end of the decoding process, the decoded value of the samples of the components of the decomposed signal C is formed based on the already decoded magnitude M and sign S values.
The initialization flowchart for histograms with fast adaptation is shown in Figure 4. Each histogram bin corresponds to a single symbol x, which can be MS for a magnitude histogram or S for a sign histogram. State-of-the-art method for the probability p(x) estimation of an occurrence of symbols x is based on the number u(x) of occurrences of symbol x and the number of occurrences of all symbols Total.  The main drawback of this simple method is that Total is an arbitrary integer, which means that division operation is necessary in order to calculate the probability ) (x p . However, in the proposed hardware realization of the entropy decoder and decoder probability estimator, division operation is replaced by shift right operation for w bits, due to:

END
Another drawback of this method is slow adaptation of the probability ) (x p , due to Additionally, it is possible to define the cumulative probability P(x) of all symbols y that precede the symbol x in the alphabet.
The main drawback of this simple method is that Total is an arbitrary integer, which means that division operation is necessary in order to calculate the probability p(x). However, in the proposed hardware realization of the entropy decoder and decoder probability estimator, division operation is replaced by shift right operation for w bits, due to: Another drawback of this method is slow adaptation of the probability p(x), due to averaging process. However, in the proposed hardware realization, the adaptation of the probability p(x) is provided by low-pass filtering of the binary sequence I(j) which represents the occurrence of a symbol x in a sequence y of symbols: The time response of mentioned low-pass filter is very important, since it is wellknown that the bigger time constant of the low-pass filter provides more accurate steadystate estimation, while a smaller time constant provides faster estimation. This problem is especially pronounced at the beginning of the adaptation process, due to a lack of information. In order to avoid making a compromise in a fixed choice of a dominant pole of the low-pass filter, the variation of a dominant pole between minimum and maximum value is implemented.
According to the histogram initialization flowchart shown in Figure 4, the values of the variables are first loaded and, based on them, the variables within the histogram structure h are initialized. In that flowchart, the parameter i represents the histogram bin index, which can have values in the range from 1 to imax. The parameter imax represents the maximum value of the index i of the non-zero histogram, i.e., the total number of different symbols in the alphabet, which is preferably less than or equal to 32 for the magnitude histogram or equal to 2 for the sign histogram. The parameter h.P() represents a string of cumulative probabilities: The parameter h.k is a reciprocal of an absolute dominant pole value of the lowpass filter. Variation of its value between h.kmin and h.kmax allows fast adaptation of the histogram after the start. The parameter h.kmax represents the reciprocal value of the minimum absolute dominant pole of the low-pass filter and it is a fixed empirical parameter with preferable value less than Total. The parameter h.kmin represents the reciprocal value of the maximum absolute dominant pole of the low-pass filter and it is a fixed parameter with preferable value h.kmin = 2. The total number of symbols within the histogram increased by 1 is represented by the parameter h.i. Finally, the parameter h.itmp represents the temporary value of the parameter h.i before the parameter h.k is changed.
After initializing the variables within the histogram structure h, in accordance with the flowchart shown in Figure 4, the step size h.s is calculated, the index i is initialized and the histogram is initialized. This is followed by incrementing the index i and examining its value. The last step is the initialization of the last histogram bin. Figure 5 shows an update flowchart for histogram with fast adaptation, based on the input of the symbol x and already described histogram structure h. Since the range decoder cannot operate with estimated zero probability p(x) = 0, even for symbols that do not occur at all, there is a need to modify the binary sequence I(j). Another reason for modifying the binary sequence I(j) is the fact that the modified probability Mp(x) = Total · p(x) is estimated using a fixed-point arithmetic. Adaptation of the probability p(x) is performed by low-pass filtering of the modified binary sequence MI(j) defined by Equation (12). Figure 5. The update flowchart for histogram with fast adaptation.
The maximum probability max ( ) p x and the minimum probability min ( ) p x can be represented as: The preferable low-pass filter is the first order IIR filter in which the divide operation is avoided by keeping the parameter . h k to be the power of two during its variation: The maximum probability maxp(x) and the minimum probability minp(x) can be represented as: The preferable low-pass filter is the first order IIR filter in which the divide operation is avoided by keeping the parameter h.k to be the power of two during its variation: Instead of updating the modified probability Mp(x), a modified cumulative probability MP(x) = Total · P(x) is updated, i.e., a string of cumulative probabilities h.P() is updated. The constant K h , which is used for the fast adaptation of histograms, and the histogram bin index i are initialized first. Then, i − 1 is added to the cumulative probability h.P(i) prescaled with a constant K h , which is equivalent to adding one to a number u(x). This is followed by an update of the cumulative probability h.P(i), only for histograms with the index i greater than or equal to x, which is determined by the previous examination of the values of these parameters.
In the rest of the histogram update algorithm, the histogram is updated according to the following mathematical formulas: where the preferable value h.kmin = 2, which is important for the first h.k during the process of the fast adaptation. The described method for the fast adaptation of histograms has significant advantages in comparison with state-of-the-art methods. Modifications of estimated probabilities are large at the beginning of the estimation process and much smaller later, which makes possible the detection of small local probability variations, which increases the compression ratio. Figure 6 shows a flowchart of the state-of-the-art range decoder, which is together with the state-of-the-art range encoder described in [66][67][68]. Decoding is performed using a lookup table LUT (Equation (18)), which is compatible with Equations (19)- (22) for encoding symbol x (the symbol x had been encoded in the buffer of width s = b w in the form of a number i): In flowchart from Figure 6, following variables and constants are used: Operators <<, >>, %, | and &, used in that flowchart are borrowed from C/C++ programming language. Floating point range decoder algorithm after the renormalization and without checking the boundary conditions is described with following equations: After introduction of the prescaled range r , the integer range decoder algorithm after the renormalization and without checking the boundary conditions becomes: Floating point range decoder algorithm after the renormalization and without checking the boundary conditions is described with following equations: After introduction of the prescaled range r, the integer range decoder algorithm after the renormalization and without checking the boundary conditions becomes: x ⇐ LUTr(t); where: Digits of the symbol x in base b from the input buffer are input. First, the 2w 1 − ExtraBits bits are ignored according to the concept of extra bits. In this particular case, the first byte is a dummy one. Before start of the range decoding process, the following variables need to be initialized: The first part of the range decoding algorithm shown in Figure 6 performs renormalization before decoding, according to the initial examination block. Then, the appropriate bits are written into variable B and new symbol d is input in appropriate input block. After that, the variable B is updated using the appropriate shift operation and the variable R is updated by shifting.
The second part of the range decoding algorithm shown in Figure 6 updates the range. First, the prescaled range r for all symbols is updated using the first division operation. This is followed by deriving the cumulative number of occurrences t of the current symbol using the second division operation, and then limiting the value of t if corresponding condition is met. The next step is to find the appropriate symbol x based on the parameter t value and then to prescale the parameter t value. The parameter B value is updated, followed by the update of the parameter R value using the second multiplication operation with u(x) for the current symbol x for all symbols except the last one. In the case of the last symbol, the parameter R value is updated using the subtraction operation. After the decoding of all data is completed, the final renormalization is performed.
In the state-of-the-art range decoder, the first division operation by Total can be implemented with the shift right operation for w 3 bits in case when Total = 2 w 3 , which is provided by the decoder probability estimator. However, the second division operation cannot be eliminated, which contributes to the increasing complexity of the decoder processor because a large number of existing digital signal processors do not support the division operation. Additionally, there are two multiplication operations per each symbol of the compressed image in the range decoder, which contributes to reducing the processing speed in general-purpose microprocessors. These drawbacks have been eliminated in the range decoder described in this paper. Figure 7 shows the flowchart of the range decoder proposed in this paper without division operations and, optionally, without multiplication operations. The first division operation by Total = 2 w 3 (when calculating the parameter r value) is implemented by the shift right operation for w 3 bits, due to the fast adaptation of histograms described in this paper. The parameter r is then represented as r = V · 2 l and the first multiplication operation is implemented by multiplication with a small number V and shift left operation for l bits in order to calculate the value of the parameter t.
The second multiplication operation is performed when calculating the parameter R value by multiplying with a small number V and shift left operation for l bits. Both small number V multiplication operations are significantly simplified due to the small number of bits used to represent the number V. Furthermore, the multiplication with small, odd numbers, V = 3 or V = 5, can be implemented by the combination of shift and add operations, which completely eliminates the multiplication operations. The second division operation by r, when calculating the parameter t value, is implemented by the division operation with small number V and shift right operation for l bits. In this case, the division operation by constant small odd numbers V = 3, V = 5, V = 9, V = 11, V = 13 or V = 15 can be implemented with one multiplication operation and one shift right operation according to Table 4, as disclosed in [69,70]. Specially, the division operation by V = 7 is the most complex, because it requires the implementation of the addition operation of 049240249h and the addition operation with carry and 0h between the multiplication and shift right operations shown in Table 4.  The second multiplication operation is performed when calculating the parameter R value by multiplying with a small number V and shift left operation for l bits. Both small number V multiplication operations are significantly simplified due to the small number of bits used to represent the number V . Furthermore, the multiplication with small, odd numbers, 3 V  or 5 V  , can be implemented by the combination of shift and add operations, which completely eliminates the multiplication operations. The second division operation by r , when calculating the parameter t value, is implemented by the division operation with small number V and shift right operation for l bits. In this case, the division operation by constant small odd numbers can be implemented with one multiplication operation and one shift right operation according to Table 4, as disclosed in [69,70]. Specially, the division operation by 7 V  is the most complex, because it requires the implementation of the addition operation of 049240249h and the addition operation with carry and 0h between the multiplication and shift right operations shown in Table 4.  The approximations used in the implementation of multiplication or division operations in the proposed range decoder led to a smaller compression/decompression ratio. For example, by fixing V = 1, it is possible to completely eliminate all multiplication and division operations, but this also causes the largest approximation error and the largest decreasing of the compression/decompression ratio, but not more than 5%. On the other hand, if V is allowed to be V = 1 or V = 3, the compression/decompression ratio is decreased by less than 1%. Tables 5 and 6 show the difference in a number of multiplication and division operations per decoded symbol between the state-of-the-art range decoder and the range decoder proposed in this paper. Although approximations, implemented in the proposed range decoder cause a negligible decrease of the compression/decompression ratio and, in contrast, they significantly reduce the hardware complexity of the realization.

Dequantizer
Dequantization is only performed in the case of lossy compression, while in the case of lossless compression data samples from the input of the dequantizer are simply routed to its output. Dequantizer proposed in this paper performs the process of dequantization for data samples which had been previously quantized with the uniform scalar quantizer with dead-zone, with quantization step ∆ b and dead-zone width 2∆ b , as it is shown in Figure 8. Figure 8. Illustration of the quantization process with uniform scalar quantizer with dead-zone.
Generally, each subband b (HH, HL, LH or LL) has its own quantization step b  , calculated based on dynamic range of data samples which represent the components of the decomposed signal from subband b . This approach provides higher compression/decompression ratio. Equation (37) describes the quantization process with uniform scalar quantizer with dead-zone: where b y represents the component of the decomposed signal from subband b and b q represents the resulted quantized value of data sample. In order to avoid the division operation and to reduce the hardware complexity of the quantizer and dequantizer, for quantization steps for all four subbands from particular level of decomposition i , the values which represent the power of two are adopted: where M represents the mantissa (integer from the range 127 64   M ) and E represents the exponent (integer from the range 6 6    E ). Dequantized absolute values of data samples which represent the components of the decomposed signal from subbands HH, HL, LH or LL, at level i of composition, are calculated according to the following equations: Generally, each subband b (HH, HL, LH or LL) has its own quantization step ∆ b , calculated based on dynamic range of data samples which represent the components of the decomposed signal from subband b. This approach provides higher compression/decompression ratio. Equation (37) describes the quantization process with uniform scalar quantizer with dead-zone: where y b represents the component of the decomposed signal from subband b and q b represents the resulted quantized value of data sample. In order to avoid the division operation and to reduce the hardware complexity of the quantizer and dequantizer, for quantization steps for all four subbands from particular level of decomposition i, the values which represent the power of two are adopted: where M represents the mantissa (integer from the range 64 ≤ M ≤ 127) and E represents the exponent (integer from the range −6 ≤ E ≤ 6).
Dequantized absolute values of data samples which represent the components of the decomposed signal from subbands HH, HL, LH or LL, at level i of composition, are calculated according to the following equations: The hardware complexity of the dequantizer proposed in this paper is significantly reduced, since the multiplication operation by power of two is implemented by using permanently shifted hardware connections between input and output bit lines, and due to multiplication with narrow-range integer M, which is implemented by a simple lookup table.

Inverse Subband Transformer
Inverse subband transformer is an important part of digital image decoder from the aspect of memory resources utilization. Optimal realization of inverse subband transformer can make important contribution to reducing the capacity of used memory and the neglecting the importance of inverse subband transformer optimization could lead to a significant increase in the amount of utilized memory resources.
The proposed hardware realization of the inverse subband transformer is based on the 2-D DWT with 5/3 filters. Equation (43) describes one-dimensional (1-D) inverse low-pass Le Gall's 5/3 filter, while Equation (44) describes 1-D inverse high-pass Le Gall's 5/3 filter: The basic building block utilized for 2-D DWT filtering is non-stationary hardware realization of the 1-D inverse 5/3 filter shown in Figure 9.
The basic building block utilized for 2-D DWT filtering is non-stationary hardware realization of the 1-D inverse 5/3 filter shown in Figure 9.  with odd index n when two upper switches are opened, while two lower switches are closed. The time diagram of control signal c in the proposed 1-D inverse 5/3 filter is shown in Figure 10.  The control signal c controls four switches, providing two different topologies of the filter: one topology for input data samples y[n] with even indexes n = 2p and another topology for input data samples y[n] with odd indexes n = 2p + 1. The control signal c is at low level (c = 0) for every input data sample y[n] with even index n when two upper switches are closed, while two lower switches are opened. Control signal c is at high level (c = 1) for every input data sample y[n] with odd index n when two upper switches are opened, while two lower switches are closed. The time diagram of control signal c in the proposed 1-D inverse 5/3 filter is shown in Figure 10.
other topology for input data samples ] [n y with odd indexes 1 2   p n . The control signal c is at low level ( 0  c ) for every input data sample ] [n y with even index n when two upper switches are closed, while two lower switches are opened. Control signal c is at high level ( 1  c ) for every input data sample ] [n y with odd index n when two upper switches are opened, while two lower switches are closed. The time diagram of control signal c in the proposed 1-D inverse 5/3 filter is shown in Figure 10.  The proposed 1-D inverse DWT 5/3 filter provides output data samples for even indexes 2 n p  and odd indexes 2 1 n p   in an interleaved fashion, as shown in Figure 11. Hardware realization of 1-D inverse 5/3 filter from Figure 9 has been implemented on EP4CE115F29C7 FPGA device from Altera Cyclone IVE family [71]. The synthesis results for the proposed non-stationary filter realization and state-of-the-art convolution-based and lifting-based realizations (implemented on the same FPGA device), obtained using Altera Quartus II 10.0 software, are presented in Table 7.
It can be seen that hardware implementation of the proposed non-stationary 1-D inverse 5/3 filter utilizes the lowest number of total logic elements and registers, has the shortest critical path delay, allows the highest maximum operating frequency and has the lowest total power dissipation in comparison with state-of-the-art realizations. The proposed 1-D inverse DWT 5/3 filter provides output data samples for even indexes n = 2p and odd indexes n = 2p + 1 in an interleaved fashion, as shown in Figure 11. the filter: one topology for input data samples ] [n y with even indexes p n 2  and another topology for input data samples ] [n y with odd indexes 1 2   p n . The control signal c is at low level ( 0  c ) for every input data sample ] [n y with even index n when two upper switches are closed, while two lower switches are opened. Control signal c is at high level ( 1  c ) for every input data sample ] [n y with odd index n when two upper switches are opened, while two lower switches are closed. The time diagram of control signal c in the proposed 1-D inverse 5/3 filter is shown in Figure 10. The proposed 1-D inverse DWT 5/3 filter provides output data samples for even indexes 2 n p  and odd indexes 2 1 n p   in an interleaved fashion, as shown in Figure 11. Hardware realization of 1-D inverse 5/3 filter from Figure 9 has been implemented on EP4CE115F29C7 FPGA device from Altera Cyclone IVE family [71]. The synthesis results for the proposed non-stationary filter realization and state-of-the-art convolution-based and lifting-based realizations (implemented on the same FPGA device), obtained using Altera Quartus II 10.0 software, are presented in Table 7.
It can be seen that hardware implementation of the proposed non-stationary 1-D inverse 5/3 filter utilizes the lowest number of total logic elements and registers, has the shortest critical path delay, allows the highest maximum operating frequency and has the lowest total power dissipation in comparison with state-of-the-art realizations. Hardware realization of 1-D inverse 5/3 filter from Figure 9 has been implemented on EP4CE115F29C7 FPGA device from Altera Cyclone IVE family [71]. The synthesis results for the proposed non-stationary filter realization and state-of-the-art convolution-based and lifting-based realizations (implemented on the same FPGA device), obtained using Altera Quartus II 10.0 software, are presented in Table 7. It can be seen that hardware implementation of the proposed non-stationary 1-D inverse 5/3 filter utilizes the lowest number of total logic elements and registers, has the shortest critical path delay, allows the highest maximum operating frequency and has the lowest total power dissipation in comparison with state-of-the-art realizations.
The block diagram of the proposed 2-D inverse DWT 5/3 architecture, with J = 7 levels of composition, is shown in Figure 12. The input data samples are the components of the decomposed signal z  LL [m, n] from level 7 of composition. The subband LL represents the data samples produced as the result of forward low-pass filtering over rows and forward low-pass filtering over columns within the direct subband transformer, which is a part of a digital image encoder. The subband HL represents the data samples produced as the result of forward low-pass filtering over rows and forward high-pass filtering over columns. The subband LH represents the data samples produced as the result of forward high-pass filtering over rows and forward low-pass filtering over columns. Finally, the subband HH represents the data samples produced as the result of forward high-pass filtering over rows and forward high-pass filtering over columns.
the direct subband transformer, which is a part of a digital image encoder. The subband HL represents the data samples produced as the result of forward low-pass filtering over rows and forward high-pass filtering over columns. The subband LH represents the data samples produced as the result of forward high-pass filtering over rows and forward low-pass filtering over columns. Finally, the subband HH represents the data samples produced as the result of forward high-pass filtering over rows and forward high-pass filtering over columns.
The time diagram of the 2-D inverse DWT 5/3 filtering at the beginning of even lines (starting from 0) for the first three levels of composition is shown in Figure 13. This pattern continues until the end of the even lines, and time diagram of the 2-D inverse DWT 5/3 filtering at the end of even lines for the first three levels of composition can be seen in Figure 14.         Figure 17) at each level of composition. The intermediate results from level 1 of composition are stored into "On-chip memory A", which contains one FIFO buffer with capacity of 2N data samples, while the intermediate results from all other levels of composition are stored into "On-chip memory B", which contains six FIFO buffers (in case of 7 J  levels of composition) with capacity halved at every succeeding level, starting from capacity of N data samples at level 2. The total on-chip memory capacity needed for N N  image filtering with J levels of composition is:  In order to ensure the proper inverse 2-D DWT 5/3 filtering of N × N image, two lines of intermediate results have to be stored into on-chip memory (shown in Figure 17) at each level of composition. The intermediate results from level 1 of composition are stored into "On-chip memory A", which contains one FIFO buffer with capacity of 2N data samples, while the intermediate results from all other levels of composition are stored into "On-chip memory B", which contains six FIFO buffers (in case of J = 7 levels of composition) with capacity halved at every succeeding level, starting from capacity of N data samples at level 2. The total on-chip memory capacity needed for N × N image filtering with J levels of composition is: Due to the very low capacity of required memory, the proposed inverse 2-D DWT architecture does not require off-chip memory at all. The comparison between the proposed architecture and the best state-of-the-art 2-D inverse 5/3 DWT architectures so far published in the literature, in terms of required capacity of on-chip and off-chip memory, is presented in Table 8. It can be concluded that the proposed 2-D inverse 5/3 DWT architecture, for N N  image and   J levels of composition, requires the total memory capacity of N 4 data samples, which is 20% lower capacity compared with the best state-of-the-art architecture.  Due to the very low capacity of required memory, the proposed inverse 2-D DWT architecture does not require off-chip memory at all. The comparison between the proposed architecture and the best state-of-the-art 2-D inverse 5/3 DWT architectures so far published in the literature, in terms of required capacity of on-chip and off-chip memory, is presented in Table 8. It can be concluded that the proposed 2-D inverse 5/3 DWT architecture, for N × N image and J → ∞ levels of composition, requires the total memory capacity of 4N data samples, which is 20% lower capacity compared with the best state-of-theart architecture. Table 8. Comparison of various 2-D inverse 5/3 DWT architectures.

Synthesis Results of the Hardware Implementation of the Proposed Digital Image Decoder
Described in this paper is a digital image decoder for efficient hardware implementation with three color planes (Y, U and V) and its functional correctness had been verified by implementation within Altera DE2-115 development board, produced by Terasic Technologies [72], on an EP4CE115F29C7 FPGA device. Synthesis results, which show the amount of utilized resources and the maximum operating frequency of the decoder are presented in Table 9.  Table 10.  [1] 160 × 120 n/a 24.15 [2] 512 × 512 1424 89.9 [3] 704 × 576 594 105.6 [4] 1920 × 1080 433,357 42.8 [5] 512 × 512 1602 116.9 [6] 2048 × 1080 2710 n/a [7] 1920 × 1080 6192 n/a [8] 1920 × 1080 5182 180 [9] 1920 × 1080 3277 110 [10]  It can be seen that the proposed hardware architecture for digital image decoder requires at least 32% less memory resources in comparison to the other state-of-the-art decoders which can process HD frame size or HD tile size. Some state-of-the-art architectures which process frame size or tile size smaller than HD size require total memory size lower than the memory size of the proposed solution. However, when frame size or tile size is taken into account as well, it can be concluded that proposed digital image decoder architecture can process 7.9 times larger frame/tile size, while it only requires 29% greater memory size in comparison with [2]. Similarly, the proposed digital image decoder can process 5.1 times larger frame size while it requires only a 3.1 times greater memory size in comparison [3]. The proposed solution for digital image decoder can process 5.1 times larger frame/tile size than [5] but utilizes only 15% more memory resources. Finally, the proposed digital image decoder architecture can process 32.4 times larger frame size, while requires only 64% greater memory size in comparison with 2 × 2 NoC decoder from [10]. In comparison to all other state-of-the-art solutions, the proposed architecture requires less memory size although it can process larger frame size.
Additionally, it can be seen that the proposed solution for digital image decoder has lower maximum operating frequency than architectures from [5,8], but can operate at higher frequency than all other state-of-the-art architectures.

Conclusions
The digital image decoder for efficient hardware implementation presented in this paper has many advantages in comparison to state-of-the-art solutions. The proposed entropy decoder and decoder probability estimator for efficient hardware implementation reduces the hardware complexity compared to the other state-of-the-art solutions by reducing or completely eliminating multiplication and division operations. The hardware complexity of the proposed dequantizer is reduced, in comparison to the state-of-the-art solutions, due to using the multiplication operation by power of two (which is implemented by using permanently shifted hardware connections between input and output bit lines), and due to using the multiplication operation with narrow-range integer which is implemented by simple lookup table. The proposed novel hardware realization of the inverse subband transformer, which performs 2-D inverse 5/3 DWT, utilizes 20% less memory resources compared to the best realization so far published in the literature. As a basic building block for the 2-D inverse 5/3 DWT, non-stationary hardware realization of the 1-D inverse 5/3 DWT filter has been used. This realization utilizes the lowest number of logic elements and the lowest number of registers, has the lowest total power dissipation and allows the highest operating frequency in comparison to any other realizations from the literature. The proposed digital image decoder requires at least 32% less memory resources in comparison to the other state-of-the-art decoders from the literature which can process HD frame size and requires effectively lower memory size than state-of-the-art solutions which process frame size or tile size smaller than HD size. The presented solution for digital image decoder has maximum operating frequency comparable with the highest maximum operating frequencies among the state-of-the-art solutions.
Future work on proposed digital image decoder would include modifications and optimizations which would increase maximum operating frequency while maintaining the reduced amount of utilized logical and memory resources. Additionally, future work could include the upgrade of proposed digital image decoder for efficient hardware implementation so that it can support decompression of ultra-high-definition (UHD) resolution images.

Conflicts of Interest:
The authors declare no conflict of interest.