High Efficiency Video Coding Compliant Perceptual Video Coding Using Entropy Based Visual Saliency Model

In past years, several visual saliency algorithms have been proposed to extract salient regions from multimedia content in view of practical applications. Entropy is one of the important measures to extract salient regions, as these regions have high randomness and attract more visual attention. In the context of perceptual video coding (PVC), computational visual saliency models that utilize the charactertistics of the human visual system to improve the compression ratio are of paramount importance. To date, only a few PVC schemes have been reported that use the visual saliency model. In this paper, we conduct the first attempt to utilize entropy based visual saliency models within the high efficiency video coding (HEVC) framework. The visual saliency map generated for each input video frame is optimally thresholded to generate a binary saliency mask. The proposed HEVC compliant PVC scheme adjusts the quantization parameter according to visual saliency relevance at the coding tree unit (CTU) level. Efficient CTU level rate control is achieved by allocating bits to salient and non-salient CTUs by adjusting the quantization parameter values according to their perceptual weighted map. The attention based on information maximization has shown the best performance on newly created ground truth dataset, which is then incorporated in a HEVC framework. An average bitrate reduction of 6.57% is achieved by the proposed HEVC compliant PVC scheme with the same perceptual quality and a nominal increase in coding complexity of 3.34% when compared with HEVC reference software. Moreover, the proposed PVC scheme performs better than other HEVC based PVC schemes when encoded at low data rates.


Introduction
Currently, the majority of information being communicated and shared on the Internet is in the form of multimedia. Images and videos captured from imaging and handheld devices possess an enormous amount of redundant information, which needs to be exploited for efficient transmission and storage. Traditionally, image and video coding techniques are developed with the aim to remove redundant information to reduce size, while preserving visual quality. The International Telecommunication Union (ITU) and the International Standards Organization (ISO) have developed a series of video coding standards over the last three decades. In 2010, the ITU video coding experts' group (VCEG) and the ISO motion picture experts' group (MPEG) created a joint collaborative team on video coding (JCT-VC) for the development of high efficiency video coding (HEVC), with the aim of achieving high compression gain [1]. Since the first draft of HEVC in April 2013, the research community has contributed to improving the performance of HEVC and its implementation on the hardware [2][3][4][5]. In Reference [2], a computationally scalable rate estimation algorithm is proposed that addresses the complexity issue associated with HEVC for encoding higher resolution videos. In Reference [3], FPGA-based hardware implementation of a video encoder is presented, which addresses the throughput of high resolution and high-quality videos in the entropy coding stage. A CABACbit rate estimation algorithm is implemented in FPGA and ASICbased hardware architecture, which exploits parallelism to improve the HEVC performance [4]. In Reference [5], FPGA and ASIC-based hardware architecture of HEVC intra encoder is presented that achieve a better performance in terms of computation workload reduction, BD-Rate and BD-PSNR.
Recently, researchers in the field of video coding have been focusing on reducing bit rate by utilizing the characteristics of the human visual system (HVS) and targeting higher quality for salient regions of video. This is going to benefit the network usability by reducing the required amount of bandwidth and helps to enhance the user experience. Psychovisual aspects of HVS have been employed in the video coding framework to remove perceptually redundant information. Video contains perceptually irrelevant information as humans generally focus on certain regions in a scene called region of interest (ROI). A perceptual video coding (PVC) scheme employs a visual saliency model to remove perceptually redundant information. Taking advantage of HVS characteristics, the PVC scheme screens out the perceptually irrelevant information present in the video. This improves the performance of video coding systems in terms of bit rate reduction, while maintaining the same perceived video quality. A visual saliency model can be integrated in a video coding framework in a variety of ways, which results in diversified PVC schemes. Generally, the PVC schemes are classified into two classes-pre-processing based PVC and embedded PVC [6].
Pre-processing based PVC schemes exploit HVS characteristics to modify the input video signal characteristics prior to encoding. In Reference [7], visual saliency based smoothing and enhancing is performed on the video frames before the encoding process. A foveation filter is incorporated at the pre-processing stage, which is modified by moving pattern classifier and the Hedge algorithm to suit the HVS mechanism. Spatial blurring is employed to remove the high-frequency contents from the image background, which represent the non-salient region [8]. As a result, the background is encoded at lower bit rate. In Reference [9], multiscale analysis and wavelet decomposition is employed to compute salient regions in video frames. Smoothing filters are applied to non-salient regions to remove high frequency content, which results in an improvement in compression efficiency. The overall performance of the pre-processing based PVC schemes is low because these methods are unable to fully utilize the video encoder characteristics. On the other hand, in embedded based PVC schemes, one or more functional blocks of the video coding framework are optimized, consistent with the HVS characteristics [10]. A visual saliency algorithm is employed to extract the perceptual features from video frames and adjust the encoder parameters accordingly. In Reference [11], HVS characteristics are utilized to optimize the distortion model of the HEVC encoder in accordance with the perceived image quality. A simplified perceptual rate-distortion optimization (RDO) procedure is adopted for the PVC scheme, which is influenced by the structural similarity index based divisive normalization scheme. In Reference [12], the PVC scheme adapts the scaling factor in the quantization block to the perceptual characteristics at macroblock level. In Reference [13], the frequency sensitivity of HVS is employed to improve the subjective quality of the video coding framework. The adaptive frequency weighting algorithm is utilized at the macroblock level to pick the frequency weighting factor for the quantization matrix.
In video coding, the data rate of the encoded bitstream is controlled by varying the quantization parameter (QP) value. As the QP value increases, the bitrate drops, but at the cost of visual quality. In PVC, several rate-control schemes have employed perceptual information for efficient resource allocation. In Reference [14], PVC architecture is proposed that computes a saliency map for each frame of input video and incorporate saliency information in video coding for non-uniform bit allocation. In Reference [15], the perceptual relevance of facial features in conversational videos is incorporated for rate control of HEVC. In Reference [16], the HEVC coding tree unit (CTU) and QPs are adaptively adjusted based on a hierarchical perception model of facial features in conversational videos. In Reference [17], a variable block-sized DCT kernel-based just-noticeable difference (JND) profile is proposed for PVC, where transform coefficients are suppressed according to perceptual distortion detection model. In Reference [18], a visual perception model is incorporated to extract texture and motion masking properties that optimized the rate-distortion optimization process in HEVC. Exploiting the fact that the HVS is not sensitive to the distortion of regions that have a complex texture and intense motion, it modifies the Lagrangian multiplier and QP value adaptively to the current CTU according to the video content. However, Lagrangian computation adds complexity, while selecting the best QP values. In Reference [19], an RDO scheme is adopted in HEVC reference implementation HM to select the best QP value in rate-distortion sense. The RDO scheme calculates a Lagrange multiplier before computing QP. However, in the RDO scheme, the perceptual relevance of each pixel in a frame is weighted uniformly [15], which results in needless equal bit allocation to ROI and non-ROI.
The moving objects in a video are the potential points to catch human attention. The spatial, as well as the temporal, characteristics of video have been utilized to generate a saliency map [20]. The spatiotemporal saliency map is then used for QP selection at coding unit level to guide bit allocation in the HEVC encoding framework. The JND model is employed in transformation and quantization blocks to phase out visually redundant information in HEVC [21]. For the transform skip mode in HEVC, the JND threshold is computed in the pixel domain by taking into account the luminance adaptation and contrast masking effects. For the transform non-skip mode, the transform domain JND threshold is estimated by considering the contrast sensitivity function. In Reference [22], the JND threshold based on perceptual redundancy in both luma and chroma channels is incorporated in HEVC at transformation and quantization stages to achieve bitrate saving and complexity reduction.
To the best of our knowledge, entropy based visual saliency models have not been incorporated in a video coding framework. Since entropy-based techniques have been effectively utilized to capture image features, it is therefore worth investigating the effectiveness of entropy-based visual saliency algorithms in a PVC framework. In this paper, a flexible and versatile HEVC compliant PVC framework is proposed that achieves bitrate reduction without degrading the perceived visual quality. An entropy based visual saliency algorithm is used to generate a saliency map at frame level. A binary saliency mask is created by thresholding the saliency map. A perceptual weight map is generated that identifies salient and non-salient CTUs. Different QP values are assigned to salient and non-salient CTUs in such a manner that the data rate is minimized while preserving the perceptual video quality. The major contributions of this work are:

1.
Performance comparison of different entropy based visual saliency algorithms is presented for videos using a newly developed pixel-labeled ground truth.

2.
Information maximization based visual saliency algorithm is incorporated in an HEVC framework.

3.
An efficient algorithm to allocate quantization parameters for salient and non-salient CTUs is presented that minimizes the data rate while preserving the perceived quality.

4.
The proposed entropy based PVC framework is evaluated objectively and subjectively and shows superior coding performance.
The rest of the paper is organized as follows. Section 2 describes the proposed HEVC compliant PVC framework. Section 3 presents the experimental results followed by the conclusion in Section 4.

Proposed Methodology
The block diagram of our proposed HEVC compliant perceptual video coding framework using entropy based visual saliency model is shown in Figure 1. The saliency map of each frame generated by an entropy based visual saliency model is a grayscale image, which needs to be thresholded for a binary saliency mask. An optimal threshold value is obtained by comparing the generated saliency map with the human labeled ground truth to generate a binary saliency mask. The binary saliency mask is divided into CTUs in a similar fashion as in HEVC, which are categorized into salient and non-salient CTUs based on their perceptual relevance. An optimal QP value for salient and non-salient CTU is selected in such a way that the data rate is minimized while maintaining the perceived visual quality. The details of each block are presented in the following subsections.

Entropy Based Visual Saliency Model
Visual saliency has been the focus of psychologists, neurobiologists and computer scientists over the last few decades [23]. Computer scientists have developed numerous computational visual saliency algorithms, which aim at detecting the salient regions in an image. Computational visual saliency models find their applications in a broad spectrum of domains including remote sensing [24], watermarking [25], privacy [26], text detection [27], object recognition [28], multi-camera calibration [29], binocular vision [30], and video coding [31]. Generally, saliency detection techniques are categorized into bottom-up and top-down approaches. Bottom-up approaches are data-driven where the perception starts at the stimulus and top-down approaches are goal-driven where the saliency extraction is influenced by the task dependent cues. A great deal of research has focused on how human attention shifts while viewing a scene [32]. Attention theories [33,34] and earlier work on understanding human perception [35,36] suggest that the HVS is attracted to the regions in a scene that carry the maximum information [37].
Entropy has been extensively utilized in extracting and analyzing the salient regions from an image. It has been observed that image regions with high randomness attract more attention. A number of methods have been proposed that compute visual saliency from an entropy and information maximization perspective [38][39][40]. In this work, we selected four entropy based visual saliency models, namely attention based on information maximization [37], saliency and scale measures [41], entropy based object segmentation [42], and fuzzy entropy based multi-level thresholding [43] to generate a saliency map SM(p) from the input video frame. A brief description of each entropy based visual saliency model is as follows:

Attention based on Information Maximization (AIM):
is based on Shannon's theory and computes the self-information at each location of the frame [37]. AIM takes advantage of the fact that the HVS directs the attention mechanism to the most informative visual content in a scene.

Saliency and Scale Measures (SSM):
capture the most salient features over different spatial locations and feature space [41]. Entropy maximization is used as a measure to identify salient regions in images. The scales are selected for each pixel location at which the entropy measure reaches its peak value. Degree of self-similarity is measured by using local descriptor statistics over a window of scales around the peak saliency measure.

Entropy based Object Segmentation (EOS):
In EOS, a saliency map using local entropy as a feature is used, which represents the complexity and unpredictability of a local region [42]. The regions are considered salient if they have high complexity resulting in flat distribution, thus having higher entropy values.

Fuzzy Entropy based Multi-Level Thresholding (FEMLT)
: utilizes fuzzy entropy to segment an image's foreground object from the background [43]. To segment foreground objects, the Shannon's entropy of the input frame is computed at different thresholds, which are determined by normalized histogram. The entropy maximization approach is employed to select the optimum threshold, which is then used for segmentation.

Thresholding
The saliency map generated by the entropy based visual saliency model is a grayscale image, where pixel intensity specifies saliency relevance. The saliency map is normalized to range from 0 to 255 in such a way that the value 255 corresponds to the most salient pixels, while value 0 corresponds to the least salient pixels. The perceptual weight of a pixel increases as the intensity of saliency map increases. A binary saliency mask is generated by thresholding the grayscale saliency map as, where Th o is the optimal threshold value. BSM(p) is a pixel-level binary mask, where pixel value 1 corresponds to salient pixel in a frame, while pixel value 0 corresponds to a non-salient pixel. The choice of an optimal threshold value Th o to generate a binary saliency mask is critical as it influences deciding on the salient and non-salient regions and hence the encoding cost of the overall framework. A pixel-level accurate human labeled ground truth was required for comparison to select the optimal threshold value. Pixel-accurate labeling of salient objects within the frame was obtained through subjective experiment. Each frame was shown to 9 subjects and they were asked to label the salient region. Majority voting criteria was adopted to generate a single aggregated ground truth binary mask GTM(p) for each frame, where pixel value 1 corresponds to salient and 0 corresponds to non-salient regions. The steps involved in selecting the optimal threshold value were as follows: 1. Initialize the threshold vector Th i with N values as, where Min(SM(p)) and Max(SM(p)) represents the minimum and maximum value of the saliency map generated by the visual saliency algorithm respectively. The number of threshold levels is represented by N and i = 0, 1, · · · , N − 1.

2.
Initialize a vector F m of size 1 × N representing average F-measure values with all zeros.

3.
Calculate the thresholded saliency map TSM i (p) of each video frame in the dataset at threshold value Th(1, i) as

4.
Calculate the F-measure between the thresholded saliency mask TSM i (p) and human labeled ground truth binary mask GTM(p) for all video frames in the dataset.

5.
Compute the average F-measure and store in the vector F m at ith position. 6.
Repeat steps 3 to 5 for all threshold values. 7.
Choose index from vector F m that gives maximum average threshold value as optimum threshold value.

Perceptual Weight Map and Optimized QP Selection
The binary saliency mask generated by thresholding is divided into CTUs in a similar way as done by HEVC reference software. CTUs are then categorized into two categories-salient and non-salient CTUs-based on their perceptual significance. The salient and non-salient pixels are quantified to mark the perceptual significance of each CTU. The percentage of salient pixels in a CTU is determined as, where N s and N represent the number of salient pixels and total number of pixels in the CTU of binary saliency mask. The CTU based perceptual weight mask is obtained as, As the proposed PVC scheme depends on the perceptual significance of CTU, therefore an optimized quantization parameter is required for each CTU based on their perceptual significance. CTUs that fall in the salient region attract more attention as compared to those CTUs which belong to a non-salient region. Therefore, to enhance the perceptual quality, an optimal criterion is required to assign QP values to different CTUs.
Let QP d be the default QP value for all CTUs in the frame. The CTU-based perceptual weight map categorizes CTUs into salient and non-salient CTUs. Then the optimized QP o values for non-salient CTUs are computed as, where AF represents the QP adjustment factor for non-salient CTUs. The value of AF depends on the saliency significance of a CTU and is selected in such a way to minimize the perceptual distortion at default quantization parameter i.e., QP d . The procedure of selecting optimum QP for non-salient CTUs i.e., QP o depends on tolerated difference in perceptual quality i.e., ∆Q. The tolerated difference in perceptual quality shows the difference in average perceptual quality using the default QP and optimized QP. The computation procedure of selecting optimized QP for salient and non-salient CTUs is described in Algorithm 1.

Experimental Results
Video content has a high impact on encoder performance, therefore test video sequences for HEVC are defined according to resolution, application domain and genre [1]. In this paper, sixteen test video sequences of class A, B, C, D, E, F and 4K were selected for the purpose of evaluation [44]. The selected video sequences cover a variety of resolutions, that is, 4K, HD 1080p, HD 720p, WVGA, WQVGA and frame rates, that is, 24, 30, 50, 60 and 120 frames per second (fps) and statistical features. The details of video sequences used in this paper are presented in Table 1. The experimental results are presented in two sections. In the first set of experiments, the performance of different entropy based visual saliency models is compared. In the second set of experiments, the best entropy based visual saliency model is incorporated in HEVC standard for the proposed PVC scheme, which is compared with HEVC reference software and other PVC schemes in an objective and a subjective manner.

Performance Comparison of Entropy Based Visual Saliency Models
In this set of experiments, the performance of four entropy based visual saliency algorithms (AIM [37], SSM [41], EOS [42], and FEMLT [43]) were compared in both quantitative and qualitative manners. The main aim of this comparison is to select the best entropy based visual saliency algorithm and optimum threshold value to generate a binary saliency mask based on a human labeled groundtruth binary mask. In this work, pixel-accurate labeling of salient and non-salient was adopted as it offers extensive and accurate evaluation as compared to rectangular bounding box based labeling [45]. For pixel-level groundtruth mask construction, 9 subjects were involved. The video frames were shown to subjects, which were instructed to precisely mark the salient objects at the pixel-level accuracy. The final groundtruth mask was obtained by applying majority voting criteria to remove labeling inconsistency.
Precision-recall (PR) curves and F-measure were employed as metrics for the quantitative performance comparison of different visual saliency algorithms. Precision is the ability of a visual saliency model to label the non-salient pixels as non-salient, whereas recall is an ability of a visual saliency model to correctly mark the salient pixels as salient. F-measure presents the harmonic mean of precision and recall. The binary saliency maps are evaluated objectively to figure out the correspondence with the human-labeled groundtruth. The precision, recall and F-measure score varies with the change of threshold value, therefore the appropriate selection of threshold value is a critical issue to generate a binary saliency mask. The saliency map generated by each visual saliency algorithm was thresholded at 32 threshold values. For each threshold value, a corresponding binary mask was generated and the equivalent precision and recall were computed using a binary groundtruth mask. Figure 2 shows the PR curves of different entropy based visual saliency algorithms (AIM, SSM, FEMLT and EOS) computed over all 16 videos in the dataset. It is evident that the AIM visual saliency algorithm gives the best PR curve except for the Johnny video. This shows that the AIM visual saliency algorithm gives higher precision and recall values for majority of the videos. Figure 3 depicts the performance comparison of different entropy based visual saliency models in terms of average F-measure computed over all video sequences for different threshold values. A higher F-measure value indicates better performance of the visual saliency model when compared with the human labeled groundtruth binary mask. It is evident that AIM gives higher average F-measure values than SSM, FEMLT and EOS for all threshold values. Moreover, a maximum value of average F-measure achieved by AIM is at threshold value 9. The average precision, recall and F-measure values by different entropy based visual saliency algorithms, when compared with pixel-level binary groudtruth mask at Th o = 9, are shown in Table 2. It can easily be observed that high precision, recall and F-measure values are achieved by AIM as compared to SSM, EOS, and FEMLT based visual saliency algorithms.   Table 2. Performance comparison of entropy based visual saliency algorithms in terms of average precision, recall and F-measure when thresholded at Th o = 9.

Visual Saliency Model Precision Recall F-Measure
AIM [37] 0.851 0.738 0.790 SSM [41] 0.349 0.828 0.491 EOS [42] 0.357 0.785 0.490 FEMLT [43] 0.326 0.693 0.443 The qualitative comparison of salient regions detected by different entropy based visual saliency algorithms and groundtruth for representative frame from seven video sequences of class A, B, C, D, E, F, and 4K in the dataset at optimum threshold Th o = 9 is shown in Figure 4. We observed that the AIM visual saliency algorithm gives a better binary saliency mask after thresholding than SSM, EOS and FEMLT visual saliency algorithms, when compared with aggregated pixel-level binary groundtruth mask. The pixel-level binary groundtruth mask highlights salient and non-salient regions in the frame with white and black values, respectively. The salient pixels detected by AIM in Figure 4c coincide well with the groundtruth binary mask. Moreover, very few non-salient pixels are detected as salient. On the other hand SSM, EOS and FEMLT partially detect salient pixels as salient and majority of the non-salient pixels are also detected as salient, which is evident from Figure 4c,d,e. These qualitative results are also consistent with quantitative results as average precision, recall and F-measure achieved by AIM is much higher than the average precision, recall, and F-measure of the SSM, EOS and FEMLT models.

Perceptual Video Coding
To verify the effectiveness of the proposed PVC framework, the saliency model AIM was incorporated into the HEVC reference software HM 16.11 [46]. The AIM model was selected because it gives a better performance than other entropy based visual saliency algorithms.The saliency map of each frame is thresholded by using Th o = 9 to generate a binary saliency mask that is used to divide a frame into salient and non-salient regions. A perceptual weight map is computed, which indicates the perceptual significance of a coding tree unit (CTU) in each frame. The saliency map of each frame is divided into CTUs as in HEVC. Experiments are performed under common test conditions with random access (RA) configuration for quantization parameter values QP = 22, 27, 32 and 37 [47]. The performance evaluation of the proposed HEVC compliant PVC scheme is performed in terms of bitrate saving, computational complexity, quality assessment using objective and subjective measures.

Bitrate Reduction and Computational Complexity
Bitrate reduction is computed to gauge the compression efficiency. The bitrate reduction ∆BR between the proposed PVC scheme and the HEVC reference model is computed as, where R Pr and R HM represents the bitrate required to encode video using the proposed PVC scheme and HEVC reference software respectively. A negative value of ∆BR indicates percentage bitrate saving achieved by the proposed scheme in comparison with HEVC. Encoding time is used to measure the computational complexity of the proposed PVC scheme in comparison with HEVC. Computational complexity is computed as, where T Pr and T HM represents encoding times of video coding using proposed PVC scheme and HEVC reference software respectively. A positive value of ∆T indicates a percentage increase in encoding time by the proposed PVC as compared to HEVC reference software. The encoding time is measured on a computer system with Intel 3.6 GHz quadcore processor, 16 GB RAM. The proposed PVC scheme is compared with HEVC reference software (HM 16.11) in terms of bitrate saving and encoding time and results are summarized in Table 3. It is evident that the proposed PVC achieves highest bitrate saving at QP = 22. An average bitrate saving for sixteen video sequences at QP = 22 is 10.37% with maximum 20.08% bitrate saving for video sequence RaceHorses. However, the coding complexity increased by 2.96%. At QP = 27, the average bitrate saving for sixteen videos is 6.68%, with a maximum bitrate saving 11.67% achieved by video sequence RaceHorses. However, the coding complexity increased by 2.97%. The average bitrate saving for all video sequences at QP = 32 is 5.12%, with a maximum bitrate saving of 9.69% achieved by video sequence Jockey. The coding complexity increased at QP = 32 is 3.46%. Whereas the average bitrate saving for sixteen video sequences at QP = 37 is 4.10% with a maximum bitrate saving of 7.80% for video sequence Jockey. The coding complexity increase is 3.99% at QP = 37. The proposed PVC achieves an average bitrate reduction of 6.57% as compared to the HEVC reference software. This shows a superior performance of the proposed PVC scheme when compared with HEVC reference software.

Objective and Subjective Quality Assessment
An objective evaluation of the proposed scheme was performed by two metrics-multiscale structural similarity index (MS − SSI M) [48] and perceptual peak signal to noise ratio (PPSNR) [49]. MS-SSIM takes into account the mechanism of processing in the early vision system and implements it on multiple scales. The MS − SSI M index between original and distorted videos is computed as, where l M (Orig, Dist) denotes luminance comparison, while c j (Orig, Dist) and s j (Orig, Dist) represent contrast and structure comparisons at j-th scale of original and distorted videos. As mentioned earlier, removing perceptual redundancy while maintaining visual quality is the primary focus of this work. The proposed PVC framework removes perceptually irrelevant information from non-salient regions, while maintaining the visual quality of salient regions. It is worth measuring the PSNR of only salient regions, where perceived visual quality needs to be preserved. Perceptual peak signal to noise ratio has been used as an objective measure to compute the perceived quality [49], which is calculated as, where δ t (x, y) = 1 for salient region and δ t (x, y) = 0 for non-salient region of original V and decoded V frames. Subjective evaluation of the proposed PVC scheme was performed through double stimulus continuous quality scale (DSCQS) [50]. Test and reference videos were shown to the subjects one after the other. The subject compared the visual quality of both the videos and assigned comparative scores to the test and reference videos. Figure 5a shows test and reference video presentation structure in the subjective experiment. Video sequences were randomly ordered with respect to the test and reference for different QP values. To alleviate grading tiredness from session to session, the test sessions were arranged such that the maximum test time taken by each subject was 25 min. Sixteen subjects (8 males and 8 females) participated in subjective experiments. Display conditions and viewing distance were set according to ITU-R subjective assessment methodology [50]. All subjects were graduate students, aged from 24 to 34 years and were not experts in video coding. For subjective voting, a quality-rating form, as shown in Figure 5b, with continuous scores from 0 to 100 was used. Scores 0 and 100 represent the worst and the best visual qualities, respectively. Subjects observed the overall quality of video sequences and inserted a mark on a grading scale. Mean opinion score (MOS) at each QP for each video sequence was computed by taking an average of the opinion scores of all subjects. For subjective comparison, a difference mean opinion score (DMOS) is computed as, where MOS Pr and MOS HM are the mean opinion scores of the video sequences encoded by proposed PVC and HEVC, respectively. A DMOS value close to zero shows that the perceived visual quality of the videos encoded by proposed PVC is as good as that of the HEVC reference software. Table 4 summarizes MS − SSI M, PPSNR and DMOS results for ten test video sequences at QP 22, 27, 32 and 37. A negative value of MS − SSI M shows a drop in the values of MS − SSI M. It is evident that the average drop in MS − SSI M for sixteen videos encoded by the proposed PVC scheme is 0.367% in comparison with HEVC. Such a minute difference in MS − SSI M value does not produce a noticeable visible difference. The average PPSNR difference between the proposed PVC and HEVC is 0.019, which signifies that the proposed PVC scheme preserves the visual quality in the salient regions. The average DMOS value of −0.107 is observed for sixteen video sequences, which is not significantly different. This shows that the visual quality of the proposed PVC scheme as perceived by subjects is same as the HEVC reference software but at a lower data rate. A comparison of our proposed HEVC based PVC and HEVC in terms of bitrate and PPSNR is also shown in Figure 6. It is evident that our proposed PVC scheme performs better than the HEVC reference software scheme for all the video sequences used in this work. Figure 7 shows the decoded frames of ParkScene, FourPeople, BQMall and BlowingBubbles video sequences at QP = 22 using the HEVC reference software and proposed PVC scheme. It is evident that the proposed HEVC compliant entropy based PVC has the same visual quality for visually salient regions in the decoded frame as compared to the reference HEVC encoder but with a reduction in data rate by 17.69% for ParkScene, 8.95% for FourPeople, 8.08% for BQMall and 7.63% for BlowingBubbles.
The comparison of perceptual video coding schemes available in the literature is a challenging task as each scheme utilizes a different set of video sequences and quality evaluation metrics. For example, Sehwan [51] used six video sequences for evaluation and compared results with HEVC HM 16.17. Similarly, Bae [17] used six video sequences and compared results with HEVC HM 11.0. Table 5 presents a comparison in terms of bitrate reduction and DMOS for the video sequences that are common among the proposed, Sehwan [51] and Bae [17] PVC schemes. It is evident that the proposed PVC scheme achieves more bit rate reduction as compared to Bae [17] PVC schemes when encoded at QP = 32 and QP = 37. This shows that the proposed scheme performs well at low data rates. Similarly, the proposed PVC scheme achieves more bit rate reduction as compared to the Sehwan [51] PVC scheme when encoded at QP = 22 and QP = 37, which shows better performance of the proposed scheme at low and high data rates. The proposed PVC scheme DMOS values are close to zero for all QP values when compared with both the PVC schemes. This shows that the proposed PVC scheme achieves the same perceived quality with more bit rate saving.

Conclusion
In this paper, a new HEVC compliant PVC scheme is proposed. An information maximization based visual saliency model was utilized to identify the salient and non-salient regions in each video frame. The perceptual significance of each CTU in a frame was figured out by considering the number of salient and non-salient pixels. A QP value for each CTU was selected in an optimum way based on their perceptual relevance. As a result, fewer bits were assigned to non-salient CTUs in a frame. The proposed PVC scheme was incorporated in HEVC reference implementation HM 16.11. Sixteen test video sequences belonging to Class A, B, C, D, E, F and 4K were encoded using random access configuration. Objective and subjective evaluations were performed to measure the efficacy of the proposed PVC scheme. The proposed HEVC compliant PVC scheme achieves 10.37% of average bitrate reduction at QP = 22 for all video sequences, while preserving the perceived visual quality. However, performance improvement costs a nominal increase in computational complexity of the encoder.