A JND-Based Pixel-Domain Algorithm and Hardware Architecture for Perceptual Image Coding

This paper presents a hardware efficient pixel-domain just-noticeable difference (JND) model and its hardware architecture implemented on an FPGA. This JND model architecture is further proposed to be part of a low complexity pixel-domain perceptual image coding architecture, which is based on downsampling and predictive coding. The downsampling is performed adaptively on the input image based on regions-of-interest (ROIs) identified by measuring the downsampling distortions against the visibility thresholds given by the JND model. The coding error at any pixel location can be guaranteed to be within the corresponding JND threshold in order to obtain excellent visual quality. Experimental results show the improved accuracy of the proposed JND model in estimating visual redundancies compared with classic JND models published earlier. Compression experiments demonstrate improved rate-distortion performance and visual quality over JPEG-LS as well as reduced compressed bit rates compared with other standard codecs such as JPEG 2000 at the same peak signal-to-perceptible-noise ratio (PSPNR). FPGA synthesis results targeting a mid-range device show very moderate hardware resource requirements and over 100 Megapixel/s throughput of both the JND model and the perceptual encoder.


Introduction
Advances in sensor and display technologies have led to rapid growth in data bandwidth in high-performance imaging systems.Compression is becoming imperative for such systems to address the bandwidth issue in a cost-efficient way.Moreover, in many real-time applications, there is a growing need for a compression algorithm to meet several competing requirements such as decent coding efficiency, low complexity, low latency and high visual quality [1].It has been realized that algorithms specifically designed to meet such requirements could be desirable [2][3][4].Compared with off-line processing systems, the computational power and memory resources in real-time high-bandwidth systems are much more limited due to the relatively tight constraints on latency, power dissipation and cost, especially in embedded systems such as display panels for ultra high definition contents and remote monitoring cameras with high temporal and spatial resolutions.
The use of existing transform-domain codecs such as JPEG 2000 and HEVC has been limited in real-time high-bandwidth systems, since such codecs typically require storing multiple image lines or frames.Especially when the spatial resolution of the image is high, the line or frame buffers result in both expensive on-chip memories and non-negligible latency, which are disadvantages for a cost-efficient hardware implementation of the codec, e.g., on FPGAs.While JPEG-LS is considered to have created a reasonable balance between complexity and compression ratio for lossless coding, its use in lossy coding is much less widespread due to the inferior coding efficiency compared with transform-domain codecs and stripe-like artifacts in smooth image regions.It is desirable to investigate the feasibility of a lightweight and hardware-friendly pixel-domain codec with improved compression performance as well as significantly improved visual quality over that of the lossy JPEG-LS.
One possibility is to exploit the visual redundancy associated with properties of the human visual system (HVS) in the pixel domain.Features and effects of the HVS can be modeled either in the pixel domain or in the transform domain.While effects such as the Contrast Sensitivity Function (CSF) are best described in the Fourier, DCT or Wavelet domain and hence can be exploited by compression algorithms operating in these domains [5][6][7], other effects such as visual masking can be well modeled in the pixel domain [8,9].The term visual masking is used to describe the phenomenon that a stimulus (such as an intensity difference in the pixel domain) is rendered invisible to the HVS by local image activities nearby, hence allowing a coarser quantization for the input image without impacting the visual quality.The masking effects of the HVS can be estimated by a visibility threshold measurement model, which ideally provides a threshold level under which the difference between the original signal and the target signal is invisible.Such a difference threshold is referred to as just-noticeable difference (JND) [10].Compression algorithms like JPEG-LS operating in the pixel domain can be adapted to exploit the pixel-domain JND models, e.g., by setting the quantization step size adaptively based on the JND thresholds.One problem with such a straightforward approach, however, is that the JND thresholds must be made available to the decoder, incurring a relatively large overhead.
A classic pixel-domain JND model was proposed by Chou and Li [9].This model serves as a basis for various further JND models proposed in research work on perceptual image/video compression, such as Yang et al.'s model [11] and Liu et al.'s model [12], which achieve improved accuracy in estimating visual redundancies at the cost of higher complexity.A good review of JND models as well as approaches to exploit JND models in perceptual image coding was given by Wu et al. [13].
In this work, a new region-adaptive pixel-domain JND model based on efficient local operations is proposed for a more accurate detection of visibility thresholds compared with the classic JND model [9] and for a reduced complexity compared with more recent ones [11,12].A low complexity pixel-domain perceptual image coder [14] is then used to exploit the visibility thresholds given by the proposed JND model.The coding algorithm addresses both coding efficiency and visual quality issues in conventional pixel-domain coders in a framework of adaptive downsampling guided by perceptual regions-of-interest (ROIs) based on JND thresholds.In addition, hardware architecture for both the proposed JND model and the perceptual encoder is presented.Experimental results including hardware resource utilization of FPGA-based implementations show reasonable performance and moderate hardware complexity for both the proposed JND model and the perceptual encoder.The remainder of the paper is organized as follows.Section 2 reviews background and existing work on pixel-domain JND modeling.The proposed JND model and its FPGA hardware architecture are presented in Sections 3 and 4, respectively.Section 5 discusses the hardware architecture for the JND-based perceptual image coding algorithm [14].Experimental results based on standard test images as well as FPGA synthesis results are presented in Section 6, which show the effectiveness of both the proposed JND model and the perceptual encoder.Section 7 summarizes this work.

Background in Pixel-Domain JND Modeling
In 1995, Chou and Li proposed a pixel-domain JND model [9] based on experimental results of psychophysical studies.Figure 1 illustrates Chou and Li's model.For each pixel location, two visual masking effects are considered, namely luminance masking and contrast masking, and visibility thresholds due to such effects are estimated based on functions of local pixel intensity levels.The two resulting quantities, luminance masking threshold LM and contrast masking threshold CM, are then combined by an integration function into the final JND threshold.In Chou and Li's model, the integration takes the form of the MAX(•) function, i.e., the JND threshold is modeled as the dominating effect between luminance masking and contrast masking.Basic algorithmic parts of JND modeling described in the rest of this section are mainly based on Chou and Li's model.

Luminance Masking Estimation
The luminance masking effect is modeled in [9] based on the average grey level within a 5 × 5 window centered at the current pixel location, as depicted in Figure 2a.Let BL(i, j) denote the background luminance at pixel location (i, j), with 0 ≤ i < H and 0 ≤ j < W for an image of size W × H. Let B(m, n) be a 5 × 5 matrix of weighing factors (m, n = 0, 1, 2, 3, 4).As shown in Figure 2b, a relatively larger weight ( 2) is given to the eight inner pixels surrounding the current pixel, since such pixels have stronger influences on the average luminance at the current pixel location.The sum of all weighting factors in matrix B is 32.While other weighting factors can be considered for evaluating the average background luminance, the matrix B used in Chou and Li's JND model [9] results in highly efficient computation and has been used in many subsequent models (see, e.g., [11,12]).Further, let p(i, j) denote the pixel grey level at (i, j).The average background luminance BL is then given by Obviously, Equation ( 1) can be implemented in hardware by additions and shifts only.It can be readily verified that 23 additions are required.Chou and Li examined the relationship between the background luminance and distortion visibility due to luminance masking based on results of subjective experiments [9,15], and concluded that the distortion visibility threshold decreases in a nonlinear manner as the background luminance changes from completely dark to middle grey (around 127 on an intensity scale from 0 to 255) and increases approximately linearly as the background luminance changes from grey to completely bright.Specifically, a square root function is used in [9] to approximate the visibility thresholds due to luminance masking for low background luminance (below 127), whereas and a linear function was used for high background luminance (above 127): where T 0 denotes the visibility threshold when the background luminance is 0 in the nonlinear region when BL(i, j) ≤ 127, while γ is the slope of the growth of the visibility threshold in the linear region when the background luminance is greater than 127.The values of parameters T 0 and γ depend on the specific application scenario, such as viewing conditions and properties of the display.Both T 0 and γ increase as the viewing distance increases, leading to higher visibility thresholds.Default values of T 0 = 17 and γ = 3 128 are used in [9], and these are also used for the JND model in this paper.

Contrast Masking Estimation
The contrast masking effect is modeled in [9] based on: (1) the background luminance at the current pixel; and (2) luminance variations across the current pixel in the 5 × 5 JND estimation window.Luminance variations, e.g., due to edges are measured by four spatial operators, G 1 -G 4 , as depicted in Figure 3.The result from an operator G k is the weighted luminance intensity difference across the current pixel in the direction corresponding to k, with k = 1, 2, 3, 4 for vertical, diagonal 135°, diagonal 45°and horizontal difference, respectively.The kth weighted luminance intensity difference ID k is calculated by 2D correlation, and the maximum weighted luminance difference MG is obtained as: In Chou and Li's model, for a fixed average background luminance, the visibility threshold due to contrast masking is a linear function of MG (also called luminance edge height in [9]) by Both the slope α and intercept β of such a linear function depend on the background luminance BL.The relationship between α, β and BL was modeled by Chou and Li as α(i, j) = BL(i, j) • 0.0001 + 0.115 ( 6) Parameter λ in Equation ( 7) depends on the viewing condition.The value of λ increases as the viewing distance becomes larger, leading to higher visibility thresholds.A default value of λ = 0.5 is used in [9].

Formulation of JND Threshold
In Chou and Li's model, the final JND threshold is considered to be the dominating effect between luminance masking and contrast masking: Since in real-world visual signals there often exist multiple masking effects simultaneously, such as luminance masking and contrast masking, the integration of multiple masking effects into a final visibility threshold for the HVS is a fundamental part of a JND model [11].Contrary to Chou and Li, who considered only the dominating effect among different masking effects, Yang et al. [11,16] proposed that: (1) in terms of the visibility threshold, the combined effect T in the presence of multiple masking effects T 1 , T 2 , ..., T N is greater than that of a single masking source T i (i = 1, 2, ..., N); and (2) the combined effect T can be modeled by a certain form of addition of individual masking effects, whereas T is smaller than a simple linear summation of the individual effects T i , i = 1, 2, ..., N, i.e., MAX{T 1 , T 2 , ..., Yang et al. [11] further proposed that the right-hand side of the above inequality is due to the overlapping of individual effects.A pair-wise overlap O i, j is hence modeled for the combination of two individual masking factors T i , T j (i < j) by a nonlinear function γ(T i , T j ), weighted by an empirically determined gain reduction coefficient C i, j (0 < C < 1), i.e., The total overlap is modeled as the sum of overlaps between any pair of masking factors.The combined visibility threshold is given by the difference between the sum of all thresholds due to individual masking effects and the total overlap, called the nonlinear-additivity model for masking (NAMM) [11]: For simplicity and the compatibility with existing models including Chou and Li's, in Yang et al.'s model [11] the nonlinear function γ is approximated as the minimum function MIN(•), and only luminance masking and contrast masking effects are considered.The result is therefore an approximation of the general model given by Equation (11).In Yang et al.'s model, the final visibility threshold at pixel location (i, j) in component θ (θ = Y, Cb, Cr) of the input image is a nonlinear combination of the luminance masking threshold T L and an edge-weighted contrast masking threshold T C θ given by JND θ (i, j) = T L (i, j)  12), i.e., considering the luminance image only and assuming maximum overlapping between the luminance and contrast masking effects.

Proposed JND Model
In the proposed JND model, each input pixel is assumed to belong to one of three basic types of image regions: edge (e), texture (t) and smoothness (s).The weighting of the contrast masking effect, as well as the combination of the basic luminance masking threshold (LM) and contrast masking threshold (CM) into the final JND threshold, is dependent on the region type of the current pixel.Figure 4 illustrates the proposed JND model, where W e , W t and W s are factors used for weighting the contrast masking effect in edge, texture and smooth regions, respectively.As shown in Figure 4, to combine LM and weighted CM values, the MAX() function is used for edge and NAMM is used for texture and smooth regions.Depending on the region type of a current pixel, the final output, i.e., JND threshold for the current pixel, is selected from three candidates JND e , JND t and JND s , corresponding to the visibility threshold evaluated for the edge, texture and smooth region, respectively.The individual treatment of edge regions in a JND model was first proposed by Yang et al. [16].Clear edges such as object boundaries are familiar to the human brain, since they typically have simple structures and draw immediate attention from an observer.Hence, even a non-expert observer can be considered as relatively "experienced" in viewing edge regions of an image.As a result, distortions, e.g., due to lossy compression, are more easily identified at edges than in other regions with luminance non-uniformity [11,17,18].In Yang et al.'s work [11], visibility thresholds due to contrast masking are reduced for edge regions (detected by the Canny operator) compared with non-edge regions.Weighting factors of 0.1 and 1.0 are used for edge and non-edge pixels, respectively, such that edges are preserved in a subsequent compression encoder exploiting the JND thresholds.
Textures, on the other hand, are intensity level variations usually occurring on surfaces, e.g., due to non-smoothness of objects such as wood and bricks.Since textures have a rich variety and generally exhibit a mixture of both regularity (e.g., repeated patterns) and randomness (e.g., noise-like scatterings) [19], the structure of a texture is much more difficult to predict than that of an edge for the human brain.Eckert and Bradley [18] indicated that about three times the quantization noise can be hidden in a texture image compared with an image of simple edges with similar spectral contents.To adequately estimate the contrast masking effects in texture regions, Liu et al. [12] proposed to decompose the image into a textural component and a structural one.Both components are processed independently for contrast masking in Liu et al's model [12], with the masking effects computed for the textural and structural components weighted by factors of 3 and 1, respectively.The masking effects of both components are added up to obtain the final contrast making in Liu et al.'s JND model.
The main differences of our JND model to the works by Chou and Li [9], Yang et al. [11] and Liu et al. [12] are: (1) marking pixels in an input image as edge, texture or smooth regions, instead of decomposing the image into multiple components processed separately; (2) combination of LM and CM into the final JND threshold using the maximum operator for edge regions and NAMM [11] for non-edge regions; (3) alternative weighting of the contrast masking effect compared with [11,12]; and (4) less complex edge and texture detection schemes more suitable for FPGA implementation compared with [11,12].The following subsections provide details on our JND model.

Edge and Texture Detection
Each input pixel is assigned one out of three possible regions in the input image, i.e., edge, texture and smoothness.Different regions are detected by lightweight local operations such as 2D filtering, which can be implemented efficiently on FPGAs (see Section 4). Figure 5 illustrates the detection scheme, where the input is the original image while the outputs are three binary maps corresponding to edge, texture and smooth regions, respectively.Edges are detected by the Sobel operator [20] which uses two 3 × 3 kernels.It is well known that the Sobel operator requires less computation and memory compared with the Canny operator [21], which is used in the JND models in [11,12].To reduce the impact of noise in the input image, Gaussian low-pass filtering is performed prior to edge detection.A two-dimensional 3 × 5 Gaussian kernel with standard deviation σ = 0.83 is used by default in the proposed JND model.The vertical size of the Gaussian kernel is chosen as 3 for a low memory requirement as well as a low latency of an FPGA implementation.For computational efficiency, an integer approximation of the Gaussian kernel discussed in Section 6.1 is used, which can be implemented efficiently by shifts and additions.Figure 6 presents edges detected in different JND models for the BARB test image.Edges obtained by the proposed lightweight scheme (i.e., Gaussian smoothing followed by Sobel) are depicted in Figure 6b.The four panels in the middle and right columns of Figure 6 show outputs of the Canny edge detector in Yang et al.'s model [11] with sensitivity thresholds of 0.5 (default [11], middle panels) and 0.25 (right panels).Morphological operations have been used in Yang et al.'s software implementation [22] of their JND model to expand the edges given by the original Canny operator (see Figure 6d,f).Such operations result in bigger regions around the edges having reduced visibility thresholds to protect edge structures.Many of the well-known texture analysis techniques (e.g., [23]) focus on distinguishing between different types of textures.While such techniques achieve promising results for image segmentation, they typically require larger blocks and computationally-intensive statistical analysis such as multi-dimensional histograms, and their complexity/performance trade-offs are not well-suited for JND modeling especially in resource-constrained scenarios.As discussed earlier, a desirable property of a JND model is to distinguish textures as opposed to structural edges and smooth regions, and a reasonable complexity/quality trade-off is an advantage especially for FPGA applications.Even if some texture regions were not picked up by a lightweight texture detection scheme compared with a sophisticated one, the visibility thresholds in such regions computed by the JND model would still be valid, e.g., for a visually lossless compression of the input image, since weighting factors for contrast masking are generally smaller in non-texture regions than in texture ones.For the reasons above, a low complexity local operator is used for texture detection in our JND model.

Edge
The proposed texture detection scheme works as follows.Firstly, a local contrast value is calculated for every pixel location.Figure 7a shows a 3 × 3 neighborhood for evaluating the local contrast, where p 0 is the intensity value at the current pixel location and p 1 -p 8 are intensity values of the eight immediate neighbors of p 0 .Let µ be the average of all intensity values in the 3 × 3 neighborhood.Then, the local contrast C can be measured for the current pixel location in terms of mean absolute deviation (MAD): Obviously, C MAD is invariant to image rotation and intensity-level shifts.In an implementation, e.g., based on FPGA, the divisions in Equation ( 13) can be avoided since such divisions can be canceled by multiplications on both sides of the equation.A division-free implementation of the local contrast calculation equivalent to that in Equation ( 13) is used in the proposed hardware architecture for the JND model, as discussed in Section 4.4.2.
Next, the total contrast activity in the neighborhood is estimated based on local contrasts.Figure 7b presents an example of computed local contrasts, the thresholding of such local contrasts into a contrast significance map, the computation of a contrast activity value and finally the derivation of a binary high-contrast-activity decision.Let C i be the local contrast at pixel location i in the 3 × 3 neighborhood centered about the current pixel.Then, contrast significance s i is given by where T C is a threshold for local contrast.A higher value of T C corresponds to a smaller number of local contrasts detected as significant.In this paper, T C = 8 is used.Contrast activity CA at the current pixel location is estimated as the total number of significant local contrasts in the 3 × 3 neighborhood: The presence of a texture is typically characterized by a high contrast activity (H A): where T A is a threshold for contrast activity.A lower value of T A corresponds to a higher sensitivity to local contrast activities.In this paper, T A = 5 is used.Figure 8a plots the contrast activities computed for the BARB image (cf. Figure 6a).The H A map after thresholding is shown in Figure 8b.Finally, denoting the binary output of the edge detector by E, a pixel is considered to be in a texture region (T) if it has a high contrast activity and is not an edge, as indicated in Figure 5: and a pixel is considered to be in a smooth region (S) if it is neither an edge nor a texture: The final edge, texture and smooth regions obtained for the BARB image are depicted in Figure 8c.While it is possible to achieve a better separation of the image into different regions using more sophisticated texture analysis and segmentation algorithms such as in Liu et al.'s model [12], the proposed lightweight edge and texture detection scheme has achieved quite reasonable results, as shown in Figure 8c, which provides a firm basis for a region-based weighting of contrast masking discussed in the next subsection.Comparisons of different JND models are given in Sections 6.2 and 6.3.

Region-Based Weighting of Visibility Thresholds due to Contrast Masking
In the proposed JND model, each basic contrast masking threshold estimated using Equation ( 5) is multiplied by a weighting factor based on the region in which the current pixel is located.Let W e , W t and W s be the weighting factors for edge (e), texture (t) and smooth (s) regions, respectively.Then, the adaptively weighted contrast masking effect CM κ is given by where κ denotes the region type of the current pixel.In Yang et al.'s JND model [11], a weighting factor equivalent to W e = 0.

Final JND Threshold
In the proposed JND model, the luminance masking and weighted contrast masking effects are combined using the NAMM model in texture (t) and smooth (s) regions, whereas, in edge (e) regions, the masking effects are combined using the maximum operator MAX(•), as shown in Equation (20).
The individual treatment of edge regions is based on the similarity between simple edge regions and scenarios in classical psychophysical experiments to determine distortion visibility thresholds in the presence of luminance edges, where simple edges are studied under different background luminance conditions [8].Hence, for well-defined edges, the visibility thresholds modeled by Chou and Li based on such experiments should be considered as suitable.For the same reason, we selected W e = 1.p(i, j)) is first buffered in row buffers which are needed for the filtering operations applied in our JND model.From the row buffers, pixels are grouped as a column of 3 pixels ({p(i, j)} 1 ) or a column of 5 pixels ({p(i, j)} 2 ).The 3-pixel column is sent to the Edge-texture-smooth Function, while the 5-pixel column is sent to both Luminance Masking Function and Contrast Masking Function.From these three functions, region mask M ec (i, j), luminance masking threshold LM(i, j) and contrast masking threshold CM(i, j) are calculated, respectively.The JND Calculation Function combines these masks together and generates the final JND value (JND(i, j)) for each pixel in the input image.

Row Buffer
The proposed JND architecture employs a common row buffer design [24], which includes registers for the current row pixel and several FIFOs for previous row pixels.Suppose r is the vertical window radius of a filter kernel, the number of required FIFOs for this design is 2 • r − 1.The row buffers are needed before every filtering operation.In our implementation, there are three places where row buffers are deployed: after the input, before the calculation of high contrast activity and after low-pass filtering.The latter two row buffers are for r = 1 and the first row buffer is for r = 1 and r = 2.
As shown in Figure 9, the rightmost row buffers contain four FIFOs to support a filter kernel with a maximum size of 5 (r = 2).The output of the row buffer forms a pixel-array denoted as {p(i, j)} 2 (see Equation (21)) which is fed to Background Luminance module and Max Gradient module where 5 × 5 filter kernels are applied.A subset of this row buffer output, {p(i, j)} 1 , is sent to Low-Pass Filter module and Contrast Significance module which consist of 3 × 5 and 3 × 3 kernel filtering operations, respectively.{p(i, j)} r = {p(i − r, j), p(i − r + 1, j), ..., p(i + r − 1, j), p(i + r, j)} (21)

Pipelined Weighted-Sum Module
For filtering operations, which are employed in several parts of proposed JND model, a common design to perform weighted-sum is introduced, as illustrated in Figure 10.The block representation of a Pipelined Weighted-Sum (PWS) module is depicted in Figure 10a.The input to this module is an array of column pixel denoted as {p(i, j)} r m , and the output is a weighted-sum value calculated as The PWS module is parameterized as a function F(K, w s , r m , r n ), where K is a 2D array of coefficients, w s is an output scaling factor, and r m , r n are vertical and horizontal kernel window radius, respectively.Figure 10b The operator denoted as is a Customized Shift-based Multiplier (CSM), which generally consists of sum and shift operators.The actual content of this operator will be defined according to the value of a given coefficient.For example, considering the coefficient −3 in kernel G 1 (see Figure 3), the multiplication of this coefficient with a pixel value p can be rewritten as: −3 • p = −(p << 1 + p), which now consists of one left shift operator, one adder and one sign-change operator.Since all the coefficients are known, this customized multiplier strategy allows us to optimize for both timing and hardware resource.

Luminance Masking Function
As discussed in Section 2.1, the calculation of the luminance masking threshold (LM) includes two steps.The first step is finding the background luminance (BL), which can be realized by a PWS module F(B, 1  32 , 2, 2).The second step is calculating LM based on the value of BL.Since the value of BL belongs to the same range as of input pixel value, which is an 8-bit integer in our implementation, the latter step can be simply realized as a look-up operation (see Figure 11).The LM ROM is implemented by Block RAM and has 256 entries, each with 5 + σ bits where 5 and σ are implicitly the number of bits for integer part and fractional part of LM, respectively.The output of this function is indeed 2 σ larger than the actual value of LM ( LM(i, j) = 2 σ • LM(i, j)).The scaling factor 2 σ is discussed further in Section 4.3.

Contrast Masking Function
Contrast masking function consists of two modules: the first module (Max Gradient) calculates MG based on input pixels from the row buffer.The second module (Contrast Mask) computes CM from MG and BL, which is the output of Background Luminance module (see Figure 12).For each of the directional gradient operations (G i , i = 1, 2, 3, 4), PWS module is deployed with output scaling factor w s = 1  16 and the two radii are set to 2. Absolute values of these modules' outputs are then calculated, by Abs functions, and compared to each other to find the maximum value (MG).The absolute function can be simply realized by a multiplexer with select signal being the most significant bit of the input.The contrast masking threshold (CM) is calculated for each pixel location based on the value of MG and BL.This calculation requires multiplications by several real numbers which cannot be accurately converted to shift-based operators.To keep the implementation resource-efficient, without using floating point operations, a fixed-point based approximation strategy is proposed as in Equation (24).A scaling factor 2 σ is applied to the overall approximation of the given real numbers for providing more accuracy adjustment.

Abs
With the above approximations, Equations ( 5)-( 7) are then rewritten as Equation ( 25) and implemented as Contrast Mask module shown in Figure 12.In this implementation, σ is empirically set to 5, since it provides a reasonable trade-off between accuracy and resource consumption.

Edge-Texture-Smooth Function
This function consists of two separate modules: Edge Detection and High Contrast Activity which, respectively, mark pixel location belonging to edge region and high contrast activity region.These modules receive the same 3-pixel column as an input and output a binary value for each pixel location.The output of Edge Detection module (M e (i, j)) and High Contrast Activity module (M c (i, j)) are combined into a two-bit signal (M ec (i, j)), which has M e (i, j) as the most significant bit (MSb) and M c (i, j) as the least significant bit (LSb).M ec (i, j) is then used as the select signal for multiplexers in JND Calculation Function.The following subsections discuss each of these modules in detail.

Edge Detection
The edge detection algorithm applied in the proposed JND model requires three filtering operations: one for Gaussian filtering and the other two for finding the Sobel gradients in horizontal and vertical directions.These filters are realized by PWS modules, as depicted in Figure 13a,b.The coefficient array G can be found in Section 6.1, and the kernels S x , S y are as follows: Low-pass filter

High Contrast Activity
To detect high contrast activity regions, the contrast significance CS needs to be calculated for each pixel location.The proposed architecture for this task is illustrated in Figure 14.Considering Equation ( 13), two divisions by 9 are required for finding C MAD .This can actually introduce some errors to the implementation using fixed-point dividers.Therefore, the following modification is done to find CS: It is obvious that the value of ĈMAD is 81 times as large as C MAD .Therefore, instead of comparing C MAD to the threshold T C as in Equation ( 14), the modified ĈMAD is now compared to the new threshold T hc = 81 • T C .This strategy indeed requires extra hardware resources if T C is not implemented as a constant but can guarantee the accuracy of CS without using floating-point operation.

+ Abs f
High Contrast Activity Considering the implementation of Contrast Significance module depicted in Figure 14, the input 3-pixel column is registered four times: the first three register columns are for calculating μ and the last three register columns are for calculating ĈMAD .There is one clock cycle delay between these two calculations, which is resolved by inserting a register, as shown in the bottom-left side of the module.

JND Calculation Function
Figure 15 presents the implementation of Equations ( 19) and (20), which calculate the final value of JND based on the contrast masking threshold ( CM), the luminance masking threshold ( LM) and the region mask (M ec ).The Region-based Weighting module (RW) applies a weighting factor to the incoming contrast mask.The weighting factors, which depend on the region type for the current pixel, are W e = 1, W t = 1.75 and W s = 1 for edge, texture and smooth regions, respectively.The texture weight can be rewritten as W t = 2 1 − 2 −2 , which results in two shift operations and one adder in our customized shift-based multiplier.The other two weights can be simply realized as wires connecting the input and the output.The region mask is used as the select signal of a multiplexer in order to choose correct weighted value for the next calculation phase.In the next calculation phase, the weighted contrast masking threshold ( CM κ ) is fed to the MAX module and NAMM module, which compute the JND value for the edge region and non-edge regions, respectively.For the CSM module in NAMM, an approximation is done for C L,C Y , as shown in Equation ( 28).The final value of JND is then computed by removing the scaling factor 2 σ applied to the input contrast masking and luminance masking thresholds.

JND-Based Pixel-Domain Perceptual Image Coding Hardware Architecture
A low complexity pixel-domain perceptual image coding algorithm based on JND modeling has been proposed in our earlier work [14].Its principle is briefly described in what follows, before addressing architectural aspects.The perceptual coding algorithm is based on predictive coding of either the downsampled pixel value or the original pixels according to the encoder's decision about whether the downsampled pixel is sufficient to represent the corresponding original pixels at visually lossless (or at least visually optimized in the case of suprathreshold coding) quality.Figure 16 illustrates the algorithm of the perceptual encoder.The Visual ROI determination block compares local distortions due to downsampling against the distortion visibility thresholds at corresponding pixel locations given by the pixel-domain JND model.If any downsampling distortion crosses the JND threshold, the current downsampling proximity (a 2 × 2 block in [14]) is considered to be a region-of-interest, and all pixels therein are encoded.In non-ROI blocks, only the downsampled mean value is encoded.In both cases, the encoder ensures that the difference from a decoded pixel to the original pixel does not exceed the corresponding JND threshold, fulfilling a necessary condition on visually lossless coding from the perspective of the JND model.The predictive coder exploits existing low complexity algorithmic tools from JPEG-LS [25] such as pixel prediction, context modeling and limited-length Golomb coding but uses a novel scan order so that coherent context modeling for ROI and non-ROI pixels is possible.The ROI information and the predictive coder's outputs are combined to form the output bitstream.More detailed information on the coding algorithm can be found in [14].The remainder of this section provides information on the hardware architecture for such a perceptual encoder.

JND model
Visual

Top-Level Architecture of the JND-Based Pixel-Domain Perceptual Encoder
The overall proposed architecture for the perceptual encoder is depicted in Figure 17.On the top level, apart from the JND module discussed in Section 4, the proposed encoder architecture can be divided into two main parts: an Encoder front end module and a Predictive coding module.As shown in Figure 17, pixels encoded by the predictive coding path are provided by the Encoder front end, which performs the following tasks: • Generate the skewed pixel processing order described in [14].• Downsample the current 2 × 2 input block.• Determine whether the current input 2 × 2 block is an ROI based on the JND thresholds.

•
Select the pixel to be encoded by the predictive coding path based on the ROI status.
For clarity, the JND module, as well as the delay element for synchronizing the JND module outputs with the input pixel stream for the encoder, is omitted from the discussions on the encoder architecture in the rest of the paper.In addition, since existing works (e.g., [26]) have well covered architectural aspects of fundamental pixel-domain predictive coding algorithms such as JPEG-LS, the following discussion focuses mainly on the aspects of the proposed encoder architecture that enable the skewed pixel processing, the JND-based adaptive downsampling and the ROI-based pixel selection [14].

Input Scan Order vs. Pixel Processing Order
The raster scan order represents a common sequence in which pixels in an image are produced or visited, for example at the output interface of a sensor or at the input interface of an encoder.The encoder architecture in this paper assumes that pixels of an input image are streamed sequentially into the encoder in a raster scan order, with the source of the input image being arbitrary, such as a camera sensor, e.g., when the encoder is directly connected to the sensor to compress raw pixels, or an external memory, e.g., when the whole image needs to be temporarily buffered for denoising before compression.Inside the encoder, pixels do not have to be processed in the same order as they have been received.Figure 18 shows an example in which the input pixels are received in a raster scan order whereas the actual encoding of the pixels follows a skewed scan order [14].Obviously, internal pixel buffers such as block RAMs on FPGAs are required, if an encoder's internal pixel processing order differs from its input pixel scan order.An architecture for implementing the skewed pixel processing order is presented in Section 5.4.Input pixel scan order (raster scan) vs. internal pixel processing order (skewed scan [14]).

Encoder Front End
A high-level architecture for the Encoder front end is presented in Figure 19.Input pixel buffering and skewed pixel output are performed in the Pixel processing order conversion module, which is composed mainly of shift registers and FIFOs as row buffers.When enough pixels are buffered so that the skewed processing can be started, pixels from the same columns in a pair of rows (called an upper row and a lower row in this paper) are outputted by the row buffers.After a full 2 × 2 pixel block is stored in the Downsampling window, the mean value of the block is computed by the Downsampling module.A Lower row delay block is used to delay the output of pixels on the lower row, as required by the skewed scan order.Figure 19 shows that all four original pixels in the Downsampling window and the output of the Downsampling module are sent to the ROI decision module, as well as the JND thresholds.Depending on whether the current 2 × 2 block is an ROI, either an original pixel or the downsampled mean value is adaptively selected by the ROI-based pixel selection module and forwarded to the predictive coding path.Different components of the encoder front end are connected by pipeline registers and their operation is controlled by a state machine.More details and architectural aspects of this module are examined in the following subsections.

Pixel Processing Order Conversion
The architecture of the Pixel processing order conversion module is shown in Figure 20.At the input side, pixels of the input image arrive sequentially (i.e., streaming scenario), as indicated in the waveform in the top-left side of Figure 20.According to the skewed scan order (cf. Figure 18), pixels in a pair of rows shall be interleaved with a delay in the lower row.As depicted in Figure 20, two different row buffers (dual-port RAMs) are used to store input depending on the current row index.The modulo-2 operation on the row_index signal is implemented by taking the least significant bit (LSb) of row_index.The conversion process is as follows.Firstly, all pixels in an upper row (e.g., first row of the input image) are stored in the Upper row buffer.Next, pixels in a lower row (e.g., second row of the image) begin to be received and stored in the Lower row buffer.As long as neither row buffer is empty, both buffers are read simultaneously every two clock cycles, as illustrated in the waveform in the top-right side of Figure 20.Outputs of both row buffers are then fed into the Downsampling window consisting of two two-stage shift registers.Downsampling as well as ROI detection is performed once all 4 pixels of a 2 × 2 block are in the Downsampling window.Finally, by inserting an offset into the data path for the lower row pixels using the Lower row delay block, the skewed scan order [14] is obtained at the output of the Pixel processing order conversion module.The two output pixel values from the upper and lower rows are denoted as p U and p L , respectively.Both p U and p L are candidates for the final pixel to be encoded, which is determined later by the ROI-based pixel selection module.

Downsampling and ROI Decision
The architecture of the Downsampling and ROI decision modules is presented in Figure 21.Let p 1 , p 2 , p 3 , p 4 be the four pixels of a 2 × 2 block in the downsampling window and p m be the downsampled mean value.The Downsampling module implements the following operation: As shown in Figure 21, downsampling is performed by first adding up all 4 pixel values in an adder tree and then shifting right by 2 bits.The extra addition by 2 before the right shift is used to implement the rounding function in Equation ( 29).Such a downsampling scheme is straightforward and computationally efficient.When higher compression ratio is desired, the downsampling module and the corresponding register window and can be extended to deal with larger block sizes, and a low-pass filtering can be optionally employed before the downsampling to reduce aliasing.

Predictive Coding and Output Bitstream
Pixels from the Encoder front end are compressed along the predictive coding path which comprises four main modules: Prediction and context modeling, Symbol mapping, Coding parameter estimation and Golomb-Rice coding, as depicted in the lower part of Figure 17.These blocks are implemented in a high throughput and resource efficient architecture for the classic context-based pixel-domain predictive coding, which is fully pipelined without stall.The throughput is 1 pixel/clock cycle.Architectural details here are similar to those in existing publications, e.g., on the hardware architecture for the regular mode of JPEG-LS [26].The variable-length codeword streams from the predictive coding path are combined with the ROI (in raw binary representation) at the output multiplexing (MUX) module, where a barrel shifter is used to formulate fixed-length final output bitstreams.Detailed architecture for the predictive coding path and bitstream multiplexing is omitted due to space limitations.

Analysis of Integer Approximation of the Gaussian Kernel
As discussed in Section 3.1, a 3 × 5 Gaussian kernel with standard deviation σ = 0.83 is employed in the proposed JND model.Figure 23a shows the original kernel coefficients with a precision of four digits after the decimal point, whereas an integer approximation of the same kernel is presented in Figure 23b.In total, 15 multiplications and 14 additions are required in a straightforward implementation of the filtering with the original kernel, whereas the integer kernel can be implemented with 25 integer additions plus several shift operations (for instance, multiplying x by 15 can be implemented by a shift-add operation as (x << 4) − x, where << is the left shift operator).The impact of using the integer kernel on the accuracy of results is analyzed in Table 1.The results using the integer kernel after both Gaussian smoothing and Sobel edge detection (cf. Figure 5) have been compared with those using the original kernel for various test images (see Section 6.2).Table 1 indicates that on average 97% of the results based on the integer version of the kernel matches those of the floating-point version after the smoothing step, whereas over 99% of the results based on the integer version of the kernel are the same as those based on the floating-point version after the edge detection step.Since the performance of the integer Gaussian kernel is closely comparable to that of the floating-point one, it is reasonable to use the integer kernel for the improved resource efficiency.The proposed JND model was implemented in software and experimented with widely used standard test images.The performance of the proposed JND model was tested in terms of both the distortion visibility of JND-contaminated images and the amount of imperceptible noise that can be shaped into the images, i.e., visual redundancies in the images.To reveal or compare visual redundancies given by the JND models, the well-known PSNR metric is often used with a particular interpretation in the literature on JND models.For example, it is pointed out in [9] that, if the JND profile is accurate, the perceptual quality of the corresponding JND-contaminated image should be "as good as the original" while the PSNR of the JND-contaminated image should "as low as possible".Chou and Li believed that PSNR can be used to quantify the amount of imperceptible distortion allowed for transparent coding of images [9].With this interpretation, a lower PSNR value corresponds to a larger potential coding gain.Other examples of work in which the PSNR metric is used in a similar way to analyze the performance of JND include [11,12,27,28].
Multiple greyscale 8 bit/pixel test images [29,30] of different sizes and contents were used in our experiments.For each test image, four sets of JND profiles were computed using Chou and Li's original model [9], Yang et al.'s model [11,22], Liu et al.'s model [12,31] and the proposed one.A JND-contaminated image was then obtained by injecting the JND profile as a noise signal to the original image.As described in [9], noise injection works by adding each original pixel with the corresponding visibility threshold multiplied by a random sign {−1, 1}.The resulting JND-contaminated image can be used in both objective tests such as PSNR measurement to reveal the JND model's capability for estimating the visual redundancy and subjective tests to validate the model by comparing the original image with the JND-contaminated one.Since each sign is generated independently, the above random-sign noise injection scheme may occasionally cause most injected noise samples in a small neighborhood to have the same sign, which often shows a correlation to distortion visibility even when the noise injection is guided by a high quality JND profile (see [13] for an example).An alternative is to ensure additionally a zero-mean of the randomly-generated signs of noise samples in every M × N block, which is referred to as zero-mean random-sign noise injection in this work.A neighborhood size of 2 × 2 in the zero-mean random-sign scheme was used in our experiments.The distortion visibility experiment on the proposed JND model was conducted on a 31.1 EIZO CG318-4K monitor with 100 cd/m 2 luminance and with viewing conditions specified in [32].The original test image is temporal-interleaved with the JND-contaminated image at a frequency of 5 Hz, and a noise signal is invisible if no flickering can be seen.In our experiments, hardly any flickering could be noticed at a normal viewing distance corresponding to 60 pixels/degree.Figure 24 presents a test image and various noise-contaminated images.An original section of the BALLOON image is in Figure 24a, and a white-Gaussian-noise-contaminated image (PSNR = 31.98) is shown in Figure 24b.A JND-contaminated image (PSNR = 31.97)based on Chou and Li's JND model is in Figure 24c, whereas the JND-contaminated image based on the proposed model is in Figure 24d.While the noise in Figure 24b is quite obvious, the same amount of noise injected based on Chou and Li's JND model is much less visible (see Figure 24c), and an even higher amount (0.23 dB more) of noise based on the proposed model and the zero-mean random-sign injection scheme is almost completely invisible, as shown in Figure 24d.4 in terms of the number additional operations required in the main algorithmic parts of these JND models.Compared with Chou and Li's JND model, Yang et al.'s model additionally performs edge-based weighting of the contrast masking effect using a Canny edge detector followed by a 7 × 7 Gaussian filter [9].From the upper part of Table 4, it can be seen that Yang et al.'s model required approximately 162 additions, one multiplications, one division and a look-up table (LUT) in addition to the basic operations required in Chou and Li's model (Table 3).It can be seen from the lower part of   To compare the JND models in terms of hardware resource requirement and speed, we implemented hardware models of three JND models in VHDL, including Chou and Li's original model, Yang et al.'s model and the proposed one.The hardware models were simulated and synthesized using Xilinx Vivado Design Suite 2018.2.The target device was selected as Xilinx Kintex-7 XC7K160T with a speed grade of −2.For the FPGA implementation of the proposed JND model, the input image was assumed to be greyscale with 8 bits/pixel and with a horizontal size of up to 1024 pixels.Table 6 presents the FPGA resource utilization of the synthesized models and their maximum clock frequency.The pixel throughput was one pixel per clock cycle.The proposed JND model was implemented in combination with the perceptual encoder described in Section 5. Parameter values for the JND model are as discussed in Section 3. Compressed image quality of the perceptual codec was compared with that of JPEG-LS for a range of rates corresponding to approximately 2:1 to 6:1 compression.Objective metrics used to evaluate the compressed image quality included PSNR, MS-SSIM [36,37] and HDR-VDP score [38,39].Compressed data rates of the perceptual codec based on the proposed JND model were additionally compared with those of JPEG, JPEG 2000 and JPEG XR at the same perceptual quality given by PSPNR [9].The compression experiments were based on widely used standard test images, as described in Section 6.2.

Complexity Comparison of Proposed JND Model and Existing JND Models
Figure 25 presents comparisons of rate-distortion performance between the perceptual codec based on the proposed JND model and JPEG-LS for test image GOLD, TXTUR2 and WOMAN.It can be seen from the MS-SSIM and HDR-VDP curves that the perceptual codec exhibited a clear gain in perceptual quality over JPEG-LS in a rate range between 1 and 3.5 bits-per-pixel (bpp).In terms of PSNR, which is not a perceptual quality metric, the perceptual codec delivered an improved coding performance of about 10-15% over JPEG-LS at rates below approximately 1.5-2 bpp. Figure 26 provides visual comparisons of images compressed to approximately the same rate by JPEG-LS and the perceptual codec combined with the proposed JND model.Selected parts of two different types of images are shown.From this figure, it is evident that the proposed scheme achieved improved visual quality by avoiding the stripe-like artifacts of JPEG-LS.
Towards the goal of visually transparent coding, a codec's performance can be related to its ability to keep coding distortions within the visibility thresholds provided by the JND model.As discussed in [9], the peak signal-to-perceptible-noise ratio (PSPNR) is a metric taking visual redundancy into account based on the JND model.While transform-domain codecs such as JPEG, JPEG 2000 and JPEG XR have higher complexity and latency than a pixel-domain codec such as the proposed JND-based one or JPEG-LS, it is possible to find out experimentally the bit rates at which any coding distortion in the compressed image is kept below the corresponding visibility threshold given by the proposed JND model.Table 7 shows the minimum compressed bit rates for JPEG, JPEG 2000, JPEG XR and the proposed JND-based perceptual codec at which the PSPNR reaches the upper bound, i.e., none of the coding errors exceed the JND thresholds, which can be considered as a necessary condition given by the JND model on perceptually lossless coding.For this experiment, the proposed JND model, the baseline JPEG, Kakadu implementation [40] of JPEG 2000 (with visual weights) and the ITU-T reference implementation [41] of JPEG XR were used.Table 7 indicates that, at the same visual quality given by PSPNR, the perceptual codec required on average about 58%, 48% and 41% fewer bits compared with JPEG, JPEG 2000 and JPEG XR, respectively.The architecture for the proposed JND model and perceptual encoder was implemented in hardware using VHDL hardware description language.The hardware model for the perceptual encoder was simulated and synthesized using Xilinx Vivado Design Suite 2016.4.The target device was selected as Xilinx Kintex-7 XC7K160T, a popular mid-range FPGA, with a speed grade of −2.Since the proposed perceptual encoder is compatible with different JND models (and vice-versa for the proposed JND model), the proposed JND model and perceptual encoder were implemented as separate modules, and their synthesis results are reported separately for clarity.An integration of these two modules is straightforward, as is obvious from Section 5. Synthesis results for the proposed JND model as well as two other JND models are presented in Section 6.3.
Table 8 shows the FPGA resource utilization of the proposed perceptual encoder architecture for 8-16 bits/pixel input greyscale images with a horizontal size of up to 2048 pixels.It can be seen that the proposed encoder architecture required 5.85% of logic resource and 2% of the BRAM resource on the target FPGA, and a pixel throughput of about 140 Megapixel/s (1 pixel/clock cycle) was achieved.For both the proposed JND model and the perceptual encoder architecture, the logic and BRAM resources used were well below 10% of all the available resources of each type on the target FPGA, which, on the one hand, provides abundant hardware resources for the other image processing tasks running on the FPGA such as noise cancellation, and, on the other hand, leaves ample room for using multiple parallel encoding instances on a single FPGA when higher pixel throughput is demanded.

Conclusions
A new pixel-domain JND model and a perceptual image coding architecture exploiting the JND model are presented.In the proposed JND model, lightweight and hardware-efficient operators are used to identify edge, texture and smooth regions in the input image.Different weighting factors for the contrast masking effects are applied to pixels in different regions.The contrast masking and luminance masking effects are combined into the final JND value in the new approach, i.e., using the nonlinear additivity model for masking (NAMM) operator for texture/smooth regions and the maximum operator for edge regions.The proposed JND model and architecture are suitable for implementation on FPGAs for real-time and low complexity embedded systems.In the proposed architecture for a low complexity pixel-domain perceptual codec, the input image is adaptively downsampled based on the visual ROI map identified by measuring the downsampling distortion against the JND thresholds.The proposed JND model provides a more accurate estimation of visual redundancies compared with Chou and Li's model and Yang et al.'s model.Since the computational complexity of the proposed model is significantly less than that of Liu et al.'s model based on image decomposition with total variation, the proposed JND mode achieves a new balance between the accuracy of JND profile and the computational complexity.Experimental results further show that the proposed JND-based pixel-domain perceptual coder achieved improved rate-distortion performance as well as visual quality compared with JPEG-LS.At the same perceptual quality in terms of PSPNR, the proposed coder generated fewer bits compared with JPEG, JPEG 2000 and JPEG XR.Finally, FPGA synthesis results indicate that both the proposed JND model and the perceptual coder required a very moderate amount of hardware resources to implement in terms of both logic and block memory resources.On a mid-range FPGA, the hardware architecture of the proposed JND model required about 2.6% of logic and 1.4% of block memory resources and achieved a throughput of 190 Megapixel/s, while the hardware architecture of the perceptual encoder required about 6% of logic and 2% of block memory resources and achieved a throughput of 140 Megapixel/s.

Figure 2 .
Figure 2. Pixel window for JND estimation and weighing factors for the background luminance: (a) JND estimation window of 5 × 5; and (b) weighing factor matrix B.
Yang et al. selected default values of gain reduction coefficients as C L,C Y = 0.3, C L,C Cb = 0.25 and C L,C Cr = 0.2 based on subjective tests in [16].The compatibility with Chou and Li's model can be seen by letting θ = Y and C L,C Y = 1 in Equation (

Figure 7 .
Figure 7. Illustration of contrast activity detection: (a) neighborhood for local contrast estimation; and (b) example of local contrasts, contrast significance and derivation of the high-contrast-activity decision.

Figure 8 .
Figure 8. Texture information of the BARB image in the proposed scheme; (a) visualization of contrast activity (treated as grey values and multiplied by 20 for visibility); (b) high contrast activity (black) regions after thresholding with T A = 5; and (c) final edge, texture and smooth regions.

4 .
Figure9depicts the overall hardware architecture of proposed JND estimation core implemented on FPGA.The core includes four main parts (names of functional modules of the architecture are indicated in italics): Luminance Masking Function, Contrast Masking Function, Edge-texture-smooth Function, and JND Calculation Function.The streaming input pixel (p(i, j)) is first buffered in row buffers which are needed for the filtering operations applied in our JND model.From the row buffers, pixels are grouped as a column of 3 pixels ({p(i, j)} 1 ) or a column of 5 pixels ({p(i, j)} 2 ).The 3-pixel column is sent to the Edge-texture-smooth Function, while the 5-pixel column is sent to both Luminance Masking Function and Contrast Masking Function.From these three functions, region mask M ec (i, j), luminance masking threshold LM(i, j) and contrast masking threshold CM(i, j) are calculated, respectively.The JND Calculation Function combines these masks together and generates the final JND value (JND(i, j)) for each pixel in the input image.

Figure 9 .
Figure 9. Overall architecture of the proposed JND model.
presents a zoom-in sample design for F(K, w s , 1, 1) with K defined as K =    w 00 w 01 w 02 w 10 w 11 w 12 w 20 w 21 w 22   

Figure 17 .
Figure 17.Overview of the proposed JND-based perceptual encoder architecture.

Figure 25 .
Figure 25.Objective rate-distortion plots of the proposed codec and JPEG-LS: top to bottom, MS-SSIM, PSNR and HDR-VDP values; and left to right, results for test images GOLD, TXTUR2 and WOMAN.

Figure 26 .
Figure 26.Visual quality of images compressed by JPEG-LS and the proposed JND-based perceptual codec at closely comparable bit rates: top and bottom, WOMAN and GOLD image; and left to right, original image, selected section compressed by JPEG-LS, and same section compressed by the perceptual codec.
Values for weighting factors W e , W t and W s may vary, for example depending on different viewing conditions and applications.Based on our experiments as well as for reasons discussed in the following subsection, values for the weighting factors are selected as W e = 1, W t = 1.75 and W s = 1 in this work as default for the proposed JND model for normal viewing conditions and general purpose test images.More details about the test images and viewing conditions in our experiments are provided in Section 6.2.
(19) used to preserve visual quality in edge regions, while in Liu et al.'s JND model[12]a weighting factor equivalent to W t = 3 is used to avoid underestimating visibility thresholds in texture regions.From Equation(19), it is obvious that larger values of W e , W t and W s correspond to larger results for the contrast masking effects (and hence the final JND thresholds) in edge, texture and smooth regions, respectively.

Table 1 .
Influence of the integer Gaussian kernel on the accuracy of smoothing and edge detection results in comparison with the original kernel in floating-point double precision.

Table 3
lists the number of operations required by Chou and Li's JND model, which is the basis for the other pixel-domain JND models discussed in this paper.The complexity of two JND models extending Chou and Li's model, including Yang et al.'s model and the proposed one, are compared in Table

Table 4
that compared to Yang et al.'s model, the proposed model required about half the number of extra additions and required neither additional LUTs nor division operations.

Table 3 .
Basic operations required for computing a visibility threshold by Chou and Li's JND model.

Table 4 .
[31]oximate number of additional operations per pixel required for computing a visibility threshold by Yang et al.'s JND model and the proposed model.comparison of software complexity in terms of CPU time was made for different JND models.The comparison was based on the original authors' implementation of Yang et al.'s model[22]and Liu et al.'s model[31], as well as our own implementation of Chou and Li's model and the proposed one.All models were implemented in MATLAB.The software models were run on a desktop computer with Intel Core i7-4820K (3.70 GHz) CPU and 32 GB of RAM.The operating system was Windows 7 64-bit.The test image used was BARB with a resolution of 720 × 576.The time need by each model to evaluate the JND profile was obtained as the least CPU time measured from running each JND model 30 times on the test image.The results are presented in Table5.It can be seen that the CPU time required by the proposed model to evaluate the JND profile was 68 ms, which was less than twice of that (37 ms) required by Chou and Li's model.By contrast, the CPU time required by Yang et al.'s model was 88 ms, which was more than twice of that required by Chou and Li's model.In the case of Liu et al.'s model, the CPU time was 474 ms, which was over an order of magnitude more than that of Chou and Li's model. A

Table 5 .
CPU time used by MATLAB implementations of different JND models for evaluating the JND profile of the BARB test image.

Table 6
shows that, compared with Chou and Li's JND model, the amount of required FPGA hardware resource was increased by over 200% for Yang et al.'s JND model, while for the proposed model the resource increase was less than 100%.In terms of the maximum clock frequency, the proposed model achieved the same performance as Chou and Li's model, i.e., 190 MHz, which was about 35% faster than the 140 MHz achieved by Yang et al.'s model.

Table 6 .
FPGA resource utilization and clock frequency comparison of three JND models: Chou and Li's model, Yang et al.'s model and the proposed one.

Table 7 .
Compressed data rates of JPEG, JPEG 2000, JPEG XR and the proposed JND-based perceptual encoder at the same quality in terms of peak signal-to-perceptible-noise ratio (PSPNR).
6.5.FPGA Resource Utilization and Throughput of the Proposed Perceptual Encoder Architecture

Table 8 .
FPGA resource utilization of the proposed perceptual encoder architecture.