Perceptual Video Coding Scheme Using Just Noticeable Distortion Model Based on Entropy Filter

Because perceptual video coding (PVC) can reduce bitrates with negligible visual quality loss in video compression, a PVC scheme based on just noticeable distortion (JND) model is proposed for ultra-high definition video. Firstly, the proposed JND model is designed, considering the spatial JND characteristics such as contrast sensitivity, luminance adaptation and saliency weight factor. Secondly, in order to perform precise JND suppression, the Gauss differential entropy (GDE) filter is designed to divide the image into smooth and complex texture region. Thirdly, through incorporating the proposed JND model into the encoding process, the transform coefficients are suppressed in harmonization with the transform/quantization process of high efficiency video coding (HEVC). In order to achieve the JND suppression effectively, a distortion compensation factor and distortion compensation control factor are incorporated to control the extent of distortion in the rate distortion optimization process. The experimental results show that the proposed PVC scheme can achieve a remarkable bitrate reduction of 32.98% for low delay (LD) configuration and 28.61% for random access (RA) configuration with a negligible subjective quality loss. Meanwhile, the proposed method only causes about average 12.94% and 22.45% encoding time increase under LD and RA configuration compared with an HEVC reference software, respectively.


Introduction
Ultra-high definition (UHD) video provides viewers with enhanced visual experience via a wider field of view (FOV) and more exquisite frames than high definition (HD) video [1]. UHD video is widely applied in various fields, education, amusement, sports, etc. [2,3]. Unfortunately, the large bandwidth and storage space are required for realization of these UHD video applications [4,5]. These problems limit the development and application of UHD video. Besides spatial and temporal redundancies, visual redundancy widely exists in UHD videos. Hence, the perceptual video coding (PVC) scheme can be utilized to further exploit the visual redundancy, and improve compression performance.
The key of PVC is to determine the distortion that users can just notice. Therefore, the study of just noticeable distortion (JND) and PVC become a hot topic and has attracted much interest. So far, various PVC schemes based on JND have been proposed and utilized for HD image/video compression . The JND models can be categorized into two types: pixel and subband (e.g., discrete cosine transform (DCT), wavelet) domain approaches.

The Proposed JND Model by GDE Filter and the Saliency Factor
An appropriate transform-based JND model for UHD image/video with 10-bit depth is proposed in this section. Firstly, the basic framework of transform domain JND model is modeled based on the CSF and proposed 10-bit LM effect. Following this, the proposed JND threshold values are suppressed by the GDE filter. Besides, the saliency factor is incorporated into the proposed JND model.    Figure 1.

Symbol Description
H CSF (w i,j , ϕ i,j ) Basic contrast sensitivity function (CSF) model MF LM (w i,j , µ p ) Luminance masking effects J LM The proposed luminance masking effects H GDE The proposed Gauss differential entropy (GDE) filter S(w i,j ) Modulation factor for saliency J GDE-S The proposed JND model w i,j , ϕ i,j , µ p Cycles per degree, Directional angle, Average pixel intensity of N × N TU block The "red dotted box" of Figure 1 shows the proposed JND model. To estimate the JND threshold values for transform coefficients suppression, the original TU blocks of input image are applied to calculate the µ p for the LM effect. Here, the LM effects are calculated as modulation factors. Then, LM effects are multiplied to a basic contrast sensitivity function (CSF). In order to control the proposed JND threshold values range in the smooth and complex region of image, the GDE filter is incorporated into J LM model. Finally, the J LM model after the filter multiply the saliency factor for the FOV of UHD.
To effectively apply the transform coefficient based on JND suppression, the DCF and DCCF are proposed to compensate the distortion introduced by J GDE-S model at different QPs. Following this, they are incorporated into the RDO process of HEVC in the "blue box" of Figure 1.

The Proposed JND Model by GDE Filter and the Saliency Factor
An appropriate transform-based JND model for UHD image/video with 10-bit depth is proposed in this section. Firstly, the basic framework of transform domain JND model is modeled based on the CSF and proposed 10-bit LM effect. Following this, the proposed JND threshold values are suppressed by the GDE filter. Besides, the saliency factor is incorporated into the proposed JND model.

The Basic Framework of the JND Model
The proposed basic framework of JND model is constructed based on the H CSF and MF LM , which is expressed as: where w i,j is cycles per degree (cpd) in spatial frequency for the (i, j)-th DCT coefficient, which is given by: where θ x and θ y indicate the horizontal and vertical visual angles of a pixel. The M is the transform sizes (M = four to 32 for HEVC). The pixel aspect ratio is one for most of the display screens, and the θ is identical in the horizontal and vertical directions [15], which is given by: where R VH is the ratio of the view distance to the screen height, and H is the number of pixels in the screen height. From (2) and (3), it is found that the view distance or the pixel resolution of display screen increases, the spatial frequency also increases for each DCT component.
In (1), the directional angle ϕ i,j between ϕ i,0 and ϕ 0,j at the (i, j)-th DCT coefficient is taken into account for H CSF modeling. Usually, the HVS is more sensitive to the distortions along the horizontal and vertical directions (i or j = 0) than the diagonal (i = j) direction in spatial frequency, the character is called the oblique effect [27]. The ϕ i,j is given by: the µ p is the normalized pixel intensity of image block, which is defined as: where I(x, y) is the pixel intensity at (x, y).
where K is the maximum pixel intensity (255 in 8-bit image format and 1023 in 10-bit image format). The CSF quantifies the characteristics of human visual perception, which has different sensitivities at different spatial frequencies. In this paper, we adopt H CSF threshold in the transform domain [16], which is given by: where φ i and φ j are the normalization factors for transform coefficients, which is given by: where the model parameter values of a = 1.33, b = 0.11, and c = 0.18 are defined as [15] in (7).

The Proposed 10-Bit Luminance Masking Effect
The LM effect is modeled as a human visual function by which JND threshold values change along with the average pixel intensity. The existing JND models only extends the luminance range to adapt the images/videos with 10-bit depth, but do not fully consider the visual difference with increasing the luminance range.
The proposed 10-bit LM effect is obtained with psychophysical experiments by increasing the distortion amplitudes of the DCT coefficients until subjects start to perceive the distortions. Bae et al. designed a subjective experiment to obtain a LM effect in DCT domain [18]. In this paper, the subjective experiment is followed as in [18], however there are two differences with Bae's work. The first one is that Bae et al. added distortion into selected 15 DCT coefficients in a lower triangle frequency zone of 8 × 8 DCT and perceived the distortion in the DCT domain. However, we directly added the distortions into all DCT coefficients and perceived the distortions in the luminance domain. Our method can better reflect the distortions of the DCT domain coefficients. The other is that we test the subjective experiments in a wider luminance region (0 to 1023) than Bae's (0 to 255). Therefore, in order to get a more perceptual friendly 10-bit LM effect, the following subjective experiments are carried out as Initialization: Monitor shows test images resided in parafovea regions (P2 in Figure 2). The experimental environments and conditions are listed in Table 2. Step 1: A subject is noticed where the distortion will be injected by making the region dark ( Figure 2a).
Step 2: The subject gradually increases the amplitude of distortion for the DCT coefficient until the viewer perceives the distortion in luminance domain, such as P1 in Figure 2b. Each measured JND value is a value that 50% of subjects start to perceive as the corresponding distortion.
Step 3: Step 1 and Step 2 are iteratively conducted by going back to Step 1 with the different presentation, until all test presentations are finished.   Figure 3 shows the experimental results of the subjective experiment. The circle represents the JND threshold values obtained from the subjective experiment. Normalized pixel intensity of the N×N TU blocks given by (5) are displayed in the x-axis and JND threshold values are in the y-axis. Zhang et al. exploit several parabolas to approximate the subjective data curve to build an LM effect model [28]. Therefore, we also use the method to fit subjective data, the proposed 10-bit MFLM model is expressed as:  Step 1: A subject is noticed where the distortion will be injected by making the region dark ( Figure 2a).
Step 2: The subject gradually increases the amplitude of distortion for the DCT coefficient until the viewer perceives the distortion in luminance domain, such as P1 in Figure 2b. Each measured JND value is a value that 50% of subjects start to perceive as the corresponding distortion.
Step 3: Step 1 and Step 2 are iteratively conducted by going back to Step 1 with the different presentation, until all test presentations are finished. Figure 3 shows the experimental results of the subjective experiment. The circle represents the JND threshold values obtained from the subjective experiment. Normalized pixel intensity of the N × N TU blocks given by (5) are displayed in the x-axis and JND threshold values are in the y-axis. Zhang et al. exploit several parabolas to approximate the subjective data curve to build an LM effect model [28]. Therefore, we also use the method to fit subjective data, the proposed 10-bit MF LM model is expressed as: where the parameters can be obtained by subjective experiments and data fitting, then A1 = 4, A2 = 5, B = 1.5, α = 0.5, and β = 0.8.   Figure 3 shows the experimental results of the subjective experiment. The circle represents the JND threshold values obtained from the subjective experiment. Normalized pixel intensity of the N×N TU blocks given by (5) are displayed in the x-axis and JND threshold values are in the y-axis. Zhang et al. exploit several parabolas to approximate the subjective data curve to build an LM effect model [28]. Therefore, we also use the method to fit subjective data, the proposed 10-bit MFLM model is expressed as:

The Proposed 10-bit JND Model is Suppressed by the GDE Filter
With the dramatic improvement of UHD resolution, the image texture details are more fine and closer to the natural scene. To analyze the differences between pixels of UHD image/videos, the GDE is utilized in this paper. Meanwhile, the designed GDE filter is not only applied to divide the image into the smooth and complex texture region, but also it is utilized to control the extent of imperceptible noise in different regions of the image. Moreover, the texture of the image has not been considered in the proposed JLM model and the fixed quantization matrices are used, which limits the ability to perform precise JND suppression. The GDE is widely utilized in the information theory [29], and is generally defined as: (10) where x = I(i, j) expresses the (i, j)-th position pixel, P(x) is a probability density function for arbitrary pixels, which accord with the Gaussian distribution as follows:

The Proposed 10-Bit JND Model Is Suppressed by the GDE Filter
With the dramatic improvement of UHD resolution, the image texture details are more fine and closer to the natural scene. To analyze the differences between pixels of UHD image/videos, the GDE is utilized in this paper. Meanwhile, the designed GDE filter is not only applied to divide the image into the smooth and complex texture region, but also it is utilized to control the extent of imperceptible noise in different regions of the image. Moreover, the texture of the image has not been considered in the proposed J LM model and the fixed quantization matrices are used, which limits the ability to perform precise JND suppression. The GDE is widely utilized in the information theory [29], and is generally defined as: where x = I(i, j) expresses the (i, j)-th position pixel, P(x) is a probability density function for arbitrary pixels, which accord with the Gaussian distribution as follows: where µ is mean and σ is variance of the image block. Therefore, (10) and (11) are utilized to deduced as: where the GDE is only related to the σ. Then, it is expressed as: The σ-H(σ) curves of ShakeNDry video sequences with eight-bit and 10-bit depth are shown in Figure 5a,b, moreover they display an approximately logarithmic distribution. The small variance indicates that the texture of current image block is smooth and allows less imperceptible noise. On the contrary, the texture of the current image block is more complex and allows for more imperceptible noise. Thus, the H(σ) reflects the capacity of imperceptible noise in different regions.   The maximum curvature of the whole curve is taken as the "first-turning" point, which is used as the threshold to distinguish smooth regions from texture complex regions. Usually, let ρ be the curvature of the curve, and is expressed as: where H and H" are the first derivative and second derivative of σ respectively. Finally, the "first-turning" point at (σ th , ρ(σ th )) which is the maximum curvature of the curve is obtained as follows: (σ th , ρ(σ th )) = Maximum(ρ(σ)) (15) where the σ th is a threshold to divide the image into complex texture and smooth region. In order to preserve more smooth regions, the "second-turning" point is defined as the threshold. The "second-turning" point is the maximum curvature of the curve from the "first-turning" point to the end, then (14) and (15) can be used to obtain as σ th ≈ 2.0 and ρ (σ th ) ≈ 3.0, which are in the intersection of two red lines as shown in Figure 5a,b. Table 3 lists statistics of image segmentation region. Specifically, the σ of image blocks is smaller than σ th , the current image blocks are in the smooth region, otherwise are in the complex texture or edge region. As shown in Table 3, the proportion of the smooth region of the video with 8-bit depth is larger than the texture region. However, for the video with 8-bit depth, the proportion of texture region of the UHD video is smaller than the texture region of HD video, because of UHD video containing more pixels in the same content image area. As shown in Table 3, the proportion of texture region is larger than smooth region for UHD image with 10-bit depth. Essentially, the UHD images can contain more texture details for the 10-bit grey luminance range. Given the above analysis, the GDE filter is modeled as: As shown in Figure 6, one frame of the TrafficFlow video sequence is segmented by H GDE at 8-bit depth and 10-bit depth respectively. As shown in Figure 6b,d, the texture or edge is white region, and the smooth region is black region. Noticeably, there are more white regions in 10-bit segmentation image than in 8-bit segmentation image, since the image with 10-bit depth contains more texture details than image with 8-bit depth. Because the human eyes are not sensitive to the complex texture details, the complex texture details are essentially considered as noises for HVS. Thus, the image with 10-bit depth has a capacity for more perception distortions than the image with 8-bit depth. Meanwhile, the JND threshold values are mainly exploited in complex texture regions of the image.  The GDE is expressed as the capacity of noise in different regions of image. Therefore, the JND threshold values can be suppressed to a certain range by HGDE without causing perceptual distortion, so the J'LM(μp) is obtained by HGDE filter and it is modeled as: The GDE is expressed as the capacity of noise in different regions of image. Therefore, the JND threshold values can be suppressed to a certain range by H GDE without causing perceptual distortion, so the J LM (µ p ) is obtained by H GDE filter and it is modeled as: As shown in Figure 7, the proposed J LM threshold values are "blue stars" and the H GDE threshold values are "red stars". The proposed JND threshold values increase along with variances, but they all suppressed by H GDE with imperceptible distortions.
Entropy 2019, 21, x FOR PEER REVIEW 10 of 24 (17) As shown in Figure 7, the proposed JLM threshold values are "blue stars" and the HGDE threshold values are "red stars". The proposed JND threshold values increase along with variances, but they all suppressed by HGDE with imperceptible distortions.

Saliency Factor
The human eyes do not view all UHD video contents. Furthermore, the visual focus will not be fixed in a specific region. Thus, the saliency region is utilized to simulate the change of visual focus when human eyes view the UHD image/video. Moreover, it better reflects the performance of the proposed JND model in a global image.

Saliency Factor
The human eyes do not view all UHD video contents. Furthermore, the visual focus will not be fixed in a specific region. Thus, the saliency region is utilized to simulate the change of visual focus when human eyes view the UHD image/video. Moreover, it better reflects the performance of the proposed JND model in a global image.

Saliency Region Extraction Algorithm Based on DCT Domain
Hou et al. proposed a simple salient region extraction algorithm using the Fourier transform (FT) [30]. In order to maintain consistency with the HEVC coding structure and simplify the complexity of the algorithm, the DCT transform is used to replace the FT transform according to Hou's method. Different from Hou's method, the saliency region is modeled as: where the g(x, y) is a gaussian filter, which is utilized to smooth the saliency map for better visual effects. The DCT −1 (·) is inverse discrete cosine transform. The residual spectrum of image R(u, v) is extract in spectral domain and is given by: where h n (·) represents the mean filter. The logarithmic spectrum of image on a log-log scale is defined as: I(x, y) represents one image, the amplitude spectrum of image A(u, v) is obtained by: and phase spectrum of image θ(u, v) is obtained by: where DCT(·) is denoted as a discrete cosine transform. Figure 8b is the extracted saliency region based on Hou's method and Figure 8c is the saliency region based on the proposed saliency extraction algorithm. As shown in Figure 8b,c, the saliency region of the proposed algorithm is consistent with Hou's.  (22) where DCT(•) is denoted as a discrete cosine transform. Figure 8b is the extracted saliency region based on Hou's method and Figure 8c is the saliency region based on the proposed saliency extraction algorithm. As shown in Figure 8b,c, the saliency region of the proposed algorithm is consistent with Hou's. Through using the proposed saliency region extraction algorithm, we continue to obtain the JND model based on saliency weight factor. Firstly, the object map is expressed as: (23) where μ is N×N the normalized intensity of saliency.
The selection of Kth is a trade-off problem between false alarm and the neglect of objects. According to [30] and subject experiments, we set β = 1.5 empirically. Thus, the object map O(x, y) in Through using the proposed saliency region extraction algorithm, we continue to obtain the JND model based on saliency weight factor. Firstly, the object map is expressed as: where µ is N × N the normalized intensity of saliency.
The selection of K th is a trade-off problem between false alarm and the neglect of objects. According to [30] and subject experiments, we set β = 1.5 empirically. Thus, the object map O(x, y) in a saliency is defined as: According to the characteristics of human subjective perception, small JND thresholds are utilized to saliency regions, while the large JND thresholds are utilized to non-saliency regions. Therefore, the saliency regions are inversely proportional to saliency weight factor γ and the γ is given by: where the parameter α is measure factor. µ 0 ∈ [0, 1] is normalized intensity of binary saliency image block with a size of N × N and expressed as: According to (17), the JND threshold values of the current image block are not larger than the H GDE . Thus, the conditional relation is obtained as: where J LM is suppressed by H GDE , then the range of values γ can be deduced as: According to (28) and (29), the range of parameter α can be obtained, that is α ∈ [0, 1], then the parameter A can be deduced as: A= 1 in order to obtain a large control range, the parameter α = 0.9. Therefore, the JND model based on saliency weight factor is given by: As Figure 9b shows, one frame of CatRobot1 UHD video sequence is contaminated by the proposed JND values. Compared with Figure 9a, there are no obvious subjective visual distortion loss in Figure 9b. There are three colored (red, yellow and blue) boxes in Figure 9 to show the visual differences clearly. Figure 9c shows the distribution of J LM threshold values, where the white regions represent that the JND values are more lager than the JND values in black or grey regions. Because the H GDE divides the image into complex texture regions and smooth regions, JND threshold values are mainly embedded in texture regions or edge regions of the image. Figure 9d shows the distribution of proposed J GDE-S threshold values. There are some saliency weight factors are incorporated into the J LM model. Figure 9f shows the saliency image of Figure 9a, and as shown in Figure 9d, the edge region of saliency object is not embedded or embedded with small JND threshold values. Therefore, the edge regions of saliency object are protected to avoid the distortion of the edge region which could often cause significantly visual perception. Figure 9e shows that the difference region is the same as Figure 9b. in order to obtain a large control range, the parameter α = 0.9. Therefore, the JND model based on saliency weight factor is given by: As Figure 9b shows, one frame of CatRobot1 UHD video sequence is contaminated by the proposed JND values. Compared with Figure 9a, there are no obvious subjective visual distortion loss in Figure 9b. There are three colored (red, yellow and blue) boxes in Figure 9 to show the visual differences clearly. Figure 9c shows the distribution of J'LM threshold values, where the white regions represent that the JND values are more lager than the JND values in black or grey regions. Because the HGDE divides the image into complex texture regions and smooth regions, JND threshold values are mainly embedded in texture regions or edge regions of the image. Figure 9d shows the distribution of proposed JGDE-S threshold values. There are some saliency weight factors are incorporated into the J'LM model. Figure 9f shows the saliency image of Figure 9a, and as shown in Figure 9d, the edge region of saliency object is not embedded or embedded with small JND threshold values. Therefore, the edge regions of saliency object are protected to avoid the distortion The temporal masking is not considered and incorporated into our proposed JND model through weighing encoding time and efficiency. Because the reuse of CU or TU block motion vectors are obtained in the HEVC recursively coding process, they are uncertain true motion vectors. Meanwhile, the precise temporal JND values are hard to calculate with these motion vectors, which may cause to bring additional visible distortions. Moreover, the calculation of the temporal masking model is computationally heavy because of the HEVC recursively coding process.

Overall Architecture of the Proposed PVC Scheme
The proposed J GDE-S model is incorporated into an HEVC Test model (HM 16.9) reference software [31], and the JND-based perceptually HEVC-complaint video coding is proposed. Figure 10 shows the whole flowchart of proposed PVC encoder. Firstly, a CU block is split into some TU blocks according to the rate distortion (RD) cost. If the parent TU block is split into four sub-TU blocks further, then each sub-TU block is decided to continue splitting by RD cost of the parent TU and its children TUs respectively, else the transform coding process is carried out directly. The transform coding process is mainly divided into DCT transform and quantization. As shown in the "red dot box" of Figure 10, the average pixel value and the variance of the current N × N TU block is calculated, then the saliency factor is obtained in DCT domain, finally the proposed J GDE-S threshold is modeled.
The temporal masking is not considered and incorporated into our proposed JND model through weighing encoding time and efficiency. Because the reuse of CU or TU block motion vectors are obtained in the HEVC recursively coding process, they are uncertain true motion vectors. Meanwhile, the precise temporal JND values are hard to calculate with these motion vectors, which may cause to bring additional visible distortions. Moreover, the calculation of the temporal masking model is computationally heavy because of the HEVC recursively coding process.

Overall Architecture of the Proposed PVC Scheme
The proposed JGDE-S model is incorporated into an HEVC Test model (HM 16.9) reference software [Error! Reference source not found.], and the JND-based perceptually HEVC-complaint video coding is proposed. Figure 10 shows the whole flowchart of proposed PVC encoder. Firstly, a CU block is split into some TU blocks according to the rate distortion (RD) cost. If the parent TU block is split into four sub-TU blocks further, then each sub-TU block is decided to continue splitting by RD cost of the parent TU and its children TUs respectively, else the transform coding process is carried out directly. The transform coding process is mainly divided into DCT transform and quantization. As shown in the "red dot box" of Figure 10, the average pixel value and the variance of the current N × N TU block is calculated, then the saliency factor is obtained in DCT domain, finally the proposed JGDE-S threshold is modeled.     Figure 11 shows the JND suppression example for 1-D transform coefficients. It is the distribution before JND suppression in Figure 11a, where the red dot line is represented the JND threshold. As (32) shows, if the amplitudes of transform coefficients |C(n, i, j)| < J GDE-S , then the after suppression transform coefficients |C (n, i, j)| is set to zero. Otherwise, the |C (n, i, j)| is equal to |C(n, i, j)| − J GDE-S . Through the above decision, the transform coefficients of complex texture regions or unperceived regions by the HVS are set to zero, and the transform coefficients of smooth regions or perceived regions are decreased by JND suppression. Since the zero or smaller transform coefficients are discarded in the quantization processing, the encoding bitrates are saved, and the coding compression efficiency is further to improved. As Figure 12 shows, the "red bars" are the absolute values of 1-D transform coefficients before suppression and the "blue bars" are the absolute values of 1-D transform coefficients after suppression with proposed JND threshold. As Figure 12 shows, we can draw the following two conclusions. First, the amplitudes of "blue bars" are all lower than the "red bars" because of the suppression with proposed JND threshold values. Second, the distributions of the "blue bar" are sparser than "red bars", because the lower absolute values of coefficients than the JND threshold values are suppressed to zero. This explains that parts of the transform coefficients are suppressed by proposed JND threshold values.

The Proposed JND-Based HEVC-Complaint PVC Scheme using Distortion Compensation
Because the residuals of CU blocks are utilized in HEVC recursively coding process, the subsequent coding performance is influenced on the reconstructed image quality. For the transform coefficient based on the JND suppression method, a certain distortion of the reconstructed image will be produced, and then the coding errors are accumulated recursively in the encoding process. Therefore, how to control the distortion caused by JND suppression is the key to determining the final coding performance. Obviously, the distortions directly affect the RD cost of encoding process, As Figure 12 shows, the "red bars" are the absolute values of 1-D transform coefficients before suppression and the "blue bars" are the absolute values of 1-D transform coefficients after suppression with proposed JND threshold. As Figure 12 shows, we can draw the following two conclusions. First, the amplitudes of "blue bars" are all lower than the "red bars" because of the suppression with proposed JND threshold values. Second, the distributions of the "blue bar" are sparser than "red bars", because the lower absolute values of coefficients than the JND threshold values are suppressed to zero. This explains that parts of the transform coefficients are suppressed by proposed JND threshold values. As Figure 12 shows, the "red bars" are the absolute values of 1-D transform coefficients before suppression and the "blue bars" are the absolute values of 1-D transform coefficients after suppression with proposed JND threshold. As Figure 12 shows, we can draw the following two conclusions. First, the amplitudes of "blue bars" are all lower than the "red bars" because of the suppression with proposed JND threshold values. Second, the distributions of the "blue bar" are sparser than "red bars", because the lower absolute values of coefficients than the JND threshold values are suppressed to zero. This explains that parts of the transform coefficients are suppressed by proposed JND threshold values.

The Proposed JND-Based HEVC-Complaint PVC Scheme using Distortion Compensation
Because the residuals of CU blocks are utilized in HEVC recursively coding process, the subsequent coding performance is influenced on the reconstructed image quality. For the transform coefficient based on the JND suppression method, a certain distortion of the reconstructed image will be produced, and then the coding errors are accumulated recursively in the encoding process. Therefore, how to control the distortion caused by JND suppression is the key to determining the final coding performance. Obviously, the distortions directly affect the RD cost of encoding process,

The Proposed JND-Based HEVC-Complaint PVC Scheme using Distortion Compensation
Because the residuals of CU blocks are utilized in HEVC recursively coding process, the subsequent coding performance is influenced on the reconstructed image quality. For the transform coefficient based on the JND suppression method, a certain distortion of the reconstructed image will be produced, and then the coding errors are accumulated recursively in the encoding process. Therefore, how to control the distortion caused by JND suppression is the key to determining the final coding performance. Obviously, the distortions directly affect the RD cost of encoding process, and RD cost is the basis for judging the selection of encoding mode and CU partition. In (33), the RD cost function J RDO is composed of distortion D, bitrate R and Lagrange factor λ, which has a relationship with quantization step. Kim et al. control the distortion caused by JND suppression through incorporating the distortion compensation factor into RDO of the encoding process [21]. However, Kim's method needs to calculate the dequantization DCT coefficients without JND suppression, which increases computational complexity. The proposed distortion compensation factor only needs to calculate the absolute value of the difference between the DCT coefficients before and after suppression. Moreover, the distortion compensation control factor (DCCF) under different QPs are introduced to control the extent of distortion compensation more efficiently.
Equation (34) is the proposed rate-distortion (RD) cost function J RDO , where ε is the distortion compensation factor (DCF) and ψ q is DCCF. The DCF is calculated by (35), where the J GDE-S (n, i, j) represents the (i, j)-th JND threshold values of the n-th block.
The ∆C"(n, i, j) is the absolute value of the difference between transform coefficient before suppression and transform coefficient after suppression and is defined as (36). The DCF is large when ∆C"(n, i, j) is small, which means that the distortion caused by JND suppression is small at this time. Therefore, a large distortion D and small DCCF are allowed in the RD cost function. Otherwise the distortion caused by JND suppression is large, then the distortion D in the RD cost function needs to be reduced.
Usually, the coding distortion of the video is smaller when using a small QP than the use of a large QP. As shown in Figure 13a,b, the subjective and objective (PSNR) quality of encoded frame when using QP = 19 is higher than that of encoded frame when using QP = 40. The encoding process causes high coding distortion because of using a large QP, and there is no chance to further suppress perceptual redundancy. Therefore, the large DCF is adopted to the RDO process. Otherwise, the small DCF is adopted to the RDO process and more bitrates are saved. Therefore, a DCCF is applied to control the extent of DCF under different QPs. When the QP value is greater than 40, because the objective and subjective quality of encoded video decreases significantly, the small DCCF is utilized. In order to get the relationship between DCCF and QPs, the designed decision is shown in (37), where q is a vector of QPs and q = [22 27 32 37]. Therefore, ψ q represents the difference of PSNR between q 37 and q = 37 or the difference of PSNR between q = 37 and q = 40.
In Figure 14, the statistics are come from different scenarios of HD and UHD sequences at different QPs under the random access (RA) and low delay (LD) configurations. As shown in Figure 14, we can obtain the values of ψ q (in ordinate) are large at small QPs and the curves are smooth at high QPs. Essentially, the objective and subjective quality of encoded video are high at the small QP and more perceptual distortions are acceptable. Thus, the DCF is reduced and DCCF should be increased appropriately. On the contrary, in order to ensure the subjective and objective quality of the encoded video, and allow for small perceptual distortion, the DCF should be increased while the DCCF should be reduced.  Figure 15 shows the polynomial fitting curves based on the statistics in Figure 14. The ψq decrease with increasing QPs, and ψq is larger in LD configuration than in RA configuration, because the encoding usually operates with smaller prediction errors under the RA configuration than that under the LD configuration for different QPs. The ψq is given by: where (38) shows the fact that the high ψq is utilized at small QPs. On the contrary, the small ψq is utilized at high QPs. It is consistent with the results shown in Figure 14.      Figure 15 shows the polynomial fitting curves based on the statistics in Figure 14. The ψq decrease with increasing QPs, and ψq is larger in LD configuration than in RA configuration, because the encoding usually operates with smaller prediction errors under the RA configuration than that under the LD configuration for different QPs. The ψq is given by: where (38) shows the fact that the high ψ q is utilized at small QPs. On the contrary, the small ψ q is utilized at high QPs. It is consistent with the results shown in Figure 14.    Figure 14. ψ q with different QPs; (a) based on RA configuration, (b) based LD configuration.
From the above statistical results, average values of ψ q are set to DCCF under different QPs. Figure 15 shows the polynomial fitting curves based on the statistics in Figure 14. The ψ q decrease with increasing QPs, and ψ q is larger in LD configuration than in RA configuration, because the encoding usually operates with smaller prediction errors under the RA configuration than that under the LD configuration for different QPs. The ψ q is given by: ψ q = 0.013 · QP 2 − 1.109 · QP + 24.641, f or RA 0.015 · QP 2 − 1.294 · QP + 28.310, f or LD where (38) shows the fact that the high ψ q is utilized at small QPs. On the contrary, the small ψ q is utilized at high QPs. It is consistent with the results shown in Figure 14. From the above statistical results, average values of ψq are set to DCCF under different QPs. Figure 15 shows the polynomial fitting curves based on the statistics in Figure 14. The ψq decrease with increasing QPs, and ψq is larger in LD configuration than in RA configuration, because the encoding usually operates with smaller prediction errors under the RA configuration than that under the LD configuration for different QPs. The ψq is given by: where (38) shows the fact that the high ψ q is utilized at small QPs. On the contrary, the small ψ q is utilized at high QPs. It is consistent with the results shown in Figure 14.    Figure 15. The polynomial fitting curves based on RA and LD configuration.

Verification of the Proposed J GDE-S Model
The HD and UHD videos are mainly provided by [32] and Joint Video Exploration Team (JVET). For a wide FOV of 4K UHD video/image, it is not appropriate to apply the adjectival categorical judgment (ACJ) method, which shows the reference images (left) and test images (right) at the same time. To verify the effectiveness of the proposed J GDE-S model for 4K UHD video/image, subjective viewing tests are conducted based on double-stimulus continuous quality-scale (DSCQS) where A (a reference sequence) and B (a sequence to be compared) are pseudo-randomly ordered for each presentation [33,34]. For the subjective of still picture, the sequence A consists of one frame from 4K UHD video sequence. Then sequence A is contaminated by (40) to form sequence B as shown in Figure 16. However, a 3-4 s sequence and five repetitions (voting during the last two) may be appropriate for still picture [33,34]. The display condition is the same as the one in Table 2. The viewing distance is set to 2.1 m which is appropriate for a 55-inch 4K UHD display. A total of 15 subjects have participated in the subjective quality assessment experiments where all of them have normal vision power.

Verification of the Proposed JGDE-S Model
The HD and UHD videos are mainly provided by [31] and Joint Video Exploration Team (JVET). For a wide FOV of 4K UHD video/image, it is not appropriate to apply the adjectival categorical judgment (ACJ) method, which shows the reference images (left) and test images (right) at the same time. To verify the effectiveness of the proposed JGDE-S model for 4K UHD video/image, subjective viewing tests are conducted based on double-stimulus continuous quality-scale (DSCQS) where A (a reference sequence) and B (a sequence to be compared) are pseudo-randomly ordered for each presentation [32,33]. For the subjective of still picture, the sequence A consists of one frame from 4K UHD video sequence. Then sequence A is contaminated by (40) to form sequence B as shown in Figure 16. However, a 3-4s sequence and five repetitions (voting during the last two) may be appropriate for still picture [32,33]. The display condition is the same as the one in Table 2. The viewing distance is set to 2.1m which is appropriate for a 55-inch 4K UHD display. A total of 15 subjects have participated in the subjective quality assessment experiments where all of them have normal vision power.
According to [32,33], the both A and B are evaluated with subjective voting scores ranged from 0 to 100 for the worst and the best visual qualities, respectively. The differential mean opinion score (DMOS) value is defined as: According to (40), the C(n, i, j) represent the (i, j)-th DCT coefficient of the n-th block, the C'(n, i, j) is the (i, j)-th DCT coefficient of the n block that is contaminated by random noise, Srand(n,i,j) is random noise with the value of +1 or -1 randomly.
(40) Table 4 lists the comparison of the Kim's CM-JND and the proposed JGDE-S models in terms of PSNR and DMOS values. As shown in Table 4  According to [33,34], the both A and B are evaluated with subjective voting scores ranged from 0 to 100 for the worst and the best visual qualities, respectively. The differential mean opinion score (DMOS) value is defined as: where MOS PVC and MOS ORI are the measured mean opinion score (MOS) values from the image contaminated by the JND and the original image of sequence. According to (40), the C(n, i, j) represent the (i, j)-th DCT coefficient of the n-th block, the C (n, i, j) is the (i, j)-th DCT coefficient of the n block that is contaminated by random noise, Srand(n, i, j) is random noise with the value of +1 or −1 randomly. Table 4 lists the comparison of the Kim's CM-JND and the proposed J GDE-S models in terms of PSNR and DMOS values. As shown in Table 4

Objective and Subjective Performance Evaluation for the Proposed PVC Scheme
To verify the effectiveness of the proposed HEVC-compliant PVC scheme, the proposed J GDE-S model is implemented into HM 16.9, and it is compared with the original HM 16.9 and Kim's PVC scheme [21]. Because Kim's PVC is proposed in harmonization with HM 11.0 and also used in the RDO-based encoding process, the Kim's PVC is re-implemented into HM 16.9 for fair comparison in this paper. The test sequence used for the experiments include eight different scenes of 4k UHD video sequences in the 4:2:0 color format. Because 10-bit depth is used as the mainstream for 4K UHD videos, the LD and RA configuration based on 10-bit depth are used for all experiments with a set of fixed quantization parameters, such as QP = 22, 27, 32, and 37.
We compare the original HM 16.9, Kim's PVC and proposed PVC in terms of bitrate reduction and encoding time to verify the objective RD performance and encoder complexity. Then we also use DMOS values to verify subjective quality assessment. The bitrate reduction is represented as (41), between the original HM16.9 and the proposed (Kim's and our) PVC scheme.
and encoding time is represented as: where R ori and T ori are the bitrates and encoded time produced by the original HM 16.9. where R PVC and T PVC are the bitrates and encoded time produced by proposed or Kim's PVC scheme. The proposed PVC scheme are also compared via subjective quality assessment experiments. For this, the DSCQS method is employed for subjective evaluation and the experimental set as Section 4.1. For movie video sequence, A and B are respectively presented to last 10 s and 3 s dummy videos are inserted between them as Figure 16 shows, where one test sequence is a reconstructed sequence encoded by the original HM 16.9, and the other is a reconstructed sequence encoded by proposed PVC or Kim's PVC. The presentation order of the two test sequences are randomly selected for each presentation. The subjective voting scores are set the same as Section 4.1. Tables 5-7 show the objective and subjective test results for the proposed PVC scheme, the Kim's PVC scheme and the original HM 16.9 under LD and RA Main10 profiles, respectively. For performance in terms of bitrate reduction, the proposed PVC scheme outperforms Kim's PVC scheme for all the test sequences and all the four QP values, yielding the average ∆R = 32.98%, the maximum ∆R = 80.89% under the LD profile and the average ∆R = 28.61%, the maximum ∆R = 66.04% under the RA profile for the 'DaylightRoad2' sequence at QP = 22. The maxim bitrate reduction obtained from the 'DaylightRoad2' sequence is due to the fact that it contains richer scenes and a lot of complex texture regions throughout all the frames. Therefore, it contains more noise which is insensitive to visual perception so that the proposed PVC scheme effectively suppresses the transform coefficients in those regions.  Tables 5 and 6 show that the ∆R values are usually decreased as QP values increase for both PVC schemes. This is because the distortions introduced by quantization errors are high enough with QP values increasing, in order to maintain high objective quality, thus resulting in almost no room for coefficient suppression and making the coefficient suppression relatively less contribute to the total bitrate reduction. Compared to the Kim's PVC scheme, the proposed PVC scheme achieves remarkably higher bitrate reductions with higher objective and subjective quality than Kim's for all test sequences at different QPs. In order to reach a large extent of suppression for DCT coefficients, the JND threshold values are scaled to a large threshold values in Kim's PVC. Therefore, the reconstruction distortions of TU residual block are increased to cause the lower objective and subjective quality than proposed PVC scheme. Given that the DCCF is utilized in the proposed PVC scheme to control the extern of distortions, the proposed PVC scheme more effectively suppresses the transform coefficients than the Kim's JND suppression. In particular, this is because the objective and subjective video quality encoded by proposed PVC scheme is higher than Kim's PVC scheme. It is noticed that the average bitrate reduction ratios for the test PVC schemes under the RA profile are slightly smaller than those obtained under the LD profile. This is because the encoding under the RA profile usually operates with smaller prediction errors than that under the LD profile. That is, there exist smaller chances for both the proposed J GDE-S suppression and Kim's suppression on the transform coefficients of residues under the RA profile.
As shown in Tables 5 and 6, the proposed PVC scheme increases the total encoding time only with average 12.94% and 22.45% under LD Main10 and RA Main10 profiles compared to the original HM 16.9, respectively. Because the temporal masking is not considered and the proposed simple distortion compensation factor is utilized in our PVC scheme, the proposed PVC scheme is lower in complexity than Kim's. Table 7 shows that the PSNR is a little degradation than the original PSNR for the proposed PVC scheme, and Kim's PVC scheme. However, the PSNR of the proposed PVC scheme is higher than Kim's. Usually the objective quality is not very well correlated to perceived visual quality, thus subjective quality is also an important measurement to assessment PVC scheme. For the subjective test results shown in Table 7, Kim's and the proposed PVC scheme have statistically almost zero average DMOS values, yielding rarely visible distortion. The subjective quality of the proposed PVC scheme is higher than Kim's PVC scheme. For example, the character details and edges are clearer compare with Kim's PVC scheme in the yellow boxes of Figures 17 and 18, due to the effect of different QPs on distortion considered in the proposed PVC scheme.  From the objective and subjective test results, the proposed PVC scheme achieves significant bitrate reduction at the same perceptual quality with little increment in encoder complexity compared to the original HM 16.9, and outperforms the Kim's PVC scheme about two times the average bitrate reduction ratio. The superiority of the proposed PVC scheme comes from the fact that the proposed JND model is more appropriate to UHD videos with 10-bit depth and the designed DCCF which minimize the bitrate for different QPs under high PSNR. Thus, the proposed PVC scheme can reduce more bitrates than Kim's PVC scheme under the higher PSNR.

Conclusions
A perceptual friendly JND-based HEVC-compliant PVC scheme is proposed in this paper. To better reflect the UHD image/video perception characteristics of HVS, a simple J GDE-S model is proposed by appropriately subjective experiments. For example, a LM effect for UHD image/video with 10-bit depth is modeled, and it is suppressed by GDE filter which control the extent of imperceptible noise for image. In addition, a DCT-based saliency factor is added into proposed JND model according to the FOV of UHD image/video. A simple transform coefficient suppression method with JND model is proposed in harmonization with the transform and quantization process of HM16.9 in a HEVC-complaint manner. Perceptual distortion is also appropriately compensated in the RDO process by incorporating the proposed DCF. In order to efficiently control the extent of the distortion compensation based on different QPs, the DCCF is also incorporated to RDO process. In objective and subjective experiments, the proposed HEVC-compliant PVC scheme yielded a remarkable bitrate reduction of 32.98% average for LD configuration and 28.61% average for RA configuration with a negligible subjective quality loss. It only causes an average of 12.94% and 22.45% encoding time increase under LD and RA configuration compared to the original HM 16.9, respectively. However, the proposed PVC scheme can reduce more bitrates with higher objective and subjective quality assessment and is computationally faster than the Kim's PVC. There are still some shortages and improvements in this paper, and we conclude as follow: (1) The proposed JND model is not considered in the chroma domain. (2) We only considered that the encoding OPs lead to the different extent of perceptual distortion in the RDO process. The other encoding characters are explored in our future work.