Saliency-Guided Local Full-Reference Image Quality Assessment

: Research and development of image quality assessment (IQA) algorithms have been in the focus of the computer vision and image processing community for decades. The intent of IQA methods is to estimate the perceptual quality of digital images correlating as high as possible with human judgements. Full-reference image quality assessment algorithms, which have full access to the distortion-free images, usually contain two phases: local image quality estimation and pooling. Previous works have utilized visual saliency in the ﬁnal pooling stage. In addition to this, visual saliency was utilized as weights in the weighted averaging of local image quality scores, emphasizing image regions that are salient to human observers. In contrast to this common practice, visual saliency is applied in the computation of local image quality in this study, based on the observation that local image quality is determined both by local image degradation and visual saliency simultaneously. Experimental results on KADID-10k, TID2013, TID2008, and CSIQ have shown that the proposed method was able to improve the state-of-the-art’s performance at low computational costs.


Introduction
Image quality assessment (IQA) is a hot research topic in the image processing community, since it can be applied in a wide range of practical and important applications, such as the optimization of computer vision system parameters [1] and monitoring the quality of image displays [2], it is also helpful in the process of benchmarking image and video compression and denoising algorithms [3,4]. In practice, the final perceivers and users of digital images are human beings. As a consequence, the most reliable way of evaluating the perceptual quality of digital images is subjective user studies involving a group of vision experts and human observers either in a laboratory environment [5] or a crowdsourcing experiment [6]. Further, these subjective evaluation methods result in publicly available benchmark IQA databases, which can be used in objective IQA. Namely, objective IQA tries to construct mathematical algorithms that estimate the perceptual quality of digital images as consistently as possible with human judgement. To help the development of IQA methods, publicly available benchmark databases, such as TID2013 [7] and CSIQ [8], contain digital images with their mean opinion score (MOS) values, which are the averages of individual human quality ratings. Traditionally, IQA is divided into three classes [9]: full-reference (FR), no-reference (NR), and reduced-reference (RR). The main difference is the degree of access to the reference, i.e., the distortion-free images. Namely, FR-IQA algorithms have full access to the reference image, while NR-IQA ones have no access to the them, and RR-IQA methods have limited access to the reference images.
Human vision research has increased the understanding of the human visual system. During observation of an image, people tend to focus on the visually significant part of a scene. As a consequence, the content of the image is not treated equally. This is why considering visual saliency in FR-IQA has gained a lot of attention in the literature [10][11][12]. In the literature, the traditional and well-known PSNR [13] was first improved using visual saliency computation by dividing the input images into different regions and assigning weights to them according to their estimated saliency. This work was followed by a line of papers utilizing visual saliency from different features. Specifically, Zhang et al. [14] utilized the phase congruency feature [15] as a visual saliency measure to create weights for defined feature similarity metrics. Wang and Li [16] compiled local distortion maps from the reference and distorted images. To quantify visual quality, these local distortion maps were pooled together by using saliency as a weighting function. The common point of algorithms using visual saliency is that visual saliency is used as a weighting function in the comparison of local image quality maps.

Motivation and Contributions
Over the evolution of the human race, humans have evolved to pay different amounts of attention to different regions of a visual scene as it is viewed. This mechanism obviously affects human perceptual quality judgements. Previously published FR-IQA methods [14,[16][17][18] utilize visual saliency in the final pooling stage of the algorithm. To be more specific, visual saliency is used as weight in the weighted averaging of local image quality scores, emphasizing image regions that are more salient to the human visual system. The main contribution of this paper was the following. As opposed to other previously published algorithms, visual saliency was utilized by this work in the computation of local image quality and not in the pooling stage, which is usual in the literature. In fact, the human visual system perceives image degradation with higher probability in visually significant regions and tends to neglect degradation in visually less significant regions. In other words, local quality degradation is simultaneously influenced by both local image degradation and visual saliency. The effect of visual saliency on the perception of local image quality is demonstrated below. If humans observe Figure 1a, they to tend to focus their attention on the child's face rather than on the vegetation in the background. To be specific, we added the same amount of noise to two different visually significant areas of the image. In Figure 1a, it was added to the region containing the the vegetation in the left corner, while it was added to the face in Figure 1b. Comparing these two figures, the human visual system was less likely to perceive the degradation if it was not in a visually salient area. Based on the above-mentioned observations, a visual-saliency-guided FR-IQA framework was introduced where the visual saliency adaptively adjusted local image quality. Within this framework, ESSIM (edge strength similarity-based image quality metric) [19] method was further developed. Furthermore, it was demonstrated that the proposed method was able to improve state-of-the-art performance at low computational costs on four large and widely accepted IQA benchmark databases, i.e., KADID-10k [20], TID2013 [7], TID2008 [21], and CSIQ [8].

Organization of the Paper
After this introduction, related and previous papers are reviewed in Section 2. Next, our proposed method is described in Section 3. Databases, evaluation metrics, and implementation details are given in Section 4. Numerical, experimental results, and a comparison to the state-of-the-art is presented in Section 5. At the end, a conclusion is drawn in Section 6.

Related Work
As already mentioned, FR-IQA algorithms predict the perceptual quality of distorted images with full access to the reference images. The traditional and probably the simplest methods for FR-IQA are mean squared error (MSE) and peak signal-to-noise ratio (PSNR), which rely on the measurement of the distortion energy of the images. However, their performance lags behind other more sophisticated metrics, since their results are not consistent with human quality perception [22]. Hence, many FR-IQA methods have tried to build on the properties of the human visual system (HVS) to compile effective algorithms. For example, the structural similarity index (SSIM) [23] utilizes the observation that HVS is sensitive to contrast and structural distortions. More specifically, SSIM [23] applies local sliding windows on the reference and distorted images and measures luminance, contrast, and structure similarity using predefined functions containing average, variance, and covariance computations. Finally, the perceptual quality of the distorted image is obtained by taking the arithmetic mean of the similarity values of the local sliding windows. Brunet et al. [24] investigated the mathematical properties of SSIM and pointed out that SSIM satisfies the identity of indiscernibles and exhibits symmetry properties, excluding the triangle inequality. SSIM has become quickly popular in the signal processing community and gained a significant amount of attention for further research [25]. For example, Wang et al. [26] extended SSIM into MS-SSIM by conducting multi-scale processing. Further, Sampat et al. [27] utilized complex wavelet domains to define a new structural similarity index. In contrast, Chen et al. [28] modified SSIM to compare edge information between the reference and the distorted images. In addition to structural degradation, image gradient is also a popular feature in FR-IQA. Specifically, Liu et al. [29] defined an FR-IQA metric using gradient similarity between the reference and distorted images. Similarly, Zhu and Wang [30] utilized gradient similarity, but scale information was also incorporated into their metric.
Motivated by the success of deep learning in many image processing tasks [31][32][33], deep learning has also appeared in FR-IQA. For example, significant amount of proposed works compare deep activations of convolutional neural networks (CNNs) for a referencedistorted image pair to establish an FR quality metric. For example, Amirshahi et al. [34] extracted feature maps using an AlexNet [35] CNN. Subsequently, the extracted feature maps were compared using histogram intersection kernels [36] and a similarity value was produced for each feature map. To obtain perceptual quality, the arithmetic mean of these similarity values was taken. Later, this approach was developed further [37] by replacing the histogram intersection kernels by traditional image similarity metrics. Another line of work focused on training end-to-end deep architectures on large IQA benchmark databases [38][39][40]. Chubarau et al. [39] investigated vision transformers and self-attention for FR-IQA. Namely, a context-aware sampling procedure was introduced to extract patches from the reference and the distorted images. Next, the patches were encoded by a vision transformer.
Another class of FR-IQA methods uses existing and available FR-IQA metrics to construct a new metric with improved performance. For instance, Okarma [41] studied the characteristics of three different metrics. In addition, the authors introduced a combined metric by taking the exponentiated product of the examined three metrics. In contrast, other proposals utilized optimization techniques to find optimal weights for a linear combination of already existing FR-IQA techniques [42][43][44]. Instead of optimization, Lukin et al. [45] trained a neural network from scratch using traditional FR-IQA metrics as image features to obtain improved estimation performance.

Proposed Method
In this section, our proposed method is introduced and described. Specifically, we first gave an overview about the ESSIM [19] method to better understand our contribution to the saliency-guided determination of local image quality. In the following, the ESSIM [19] algorithm is first briefly described, and then the saliency-guided SG-ESSIM is described in detail.

ESSIM Method
To quantify quality degradation of a distorted image (g) with respect to a reference image ( f ), the ESSIM is defined as where C is a small constant to avoid the division by zero. To be more specific, C has two roles in Equation (1): as already mentioned, it is necessary to avoid having a denominator equal to zero; secondly C is also a scaling factor. If C −→ ∞, the result of similarity measure is 1. Further, E( f , i) is used to characterize the edge strength around pixel i in the reference image f , and is determined as where E 2,4 i ( f ) is the edge strength in the diagonal directions and is determined as Similarly, E 1,3 i ( f ) stands for the edge strength but in the vertical and horizontal directions and is computed as In Equations (3) and (4), ∂ f j i stands for the directional derivatives at pixel i in reference image f . The applied directions in the ESSIM are the following: 0 • , 45 • , 90 • , 135 • . Further, the fractional derivatives in the ESSIM are implemented with the help of 5 × 5 Scharr operators [46]. Moreover, p is used to adapt the edge strength in the algorithm. The edge strength of distorted image g around pixel i is defined as in order to guarantee that the edge strengths of f and g are compared in the same direction.

SG-ESSIM Method
As one can observe from the derivation of the ESSIM [19], the impact of visual saliency was not incorporated into these metrics. In this section, we describe how we enhanced the ESSIM by using visual saliency in the measurement of local image quality. Hence, it was highlighted that local image quality degradation was jointly characterized by the objective degradation of edge strength and significance.
As proposed by the authors of the ESSIM [19], our algorithm first determined the edge strength maps of the reference and distorted images as recommended in the ESSIM [19].
Subsequently, it calculated the saliency-guided local image quality map. Finally, it pooled the local image quality map to obtain the overall perceptual quality of the distorted image.
Using Equations (2) and (5), which describe E( f , i) and E(g, i) in the ESSIM, our visual saliency-guided local similarity map is defined as where V(i) stands for the visual saliency measure at pixel location at pixel location i. Moreover, H(·) is a decreasing function defined as where e is Euler's number, K is a scaling factor, and h is an attenuation factor (Equation (7) is a decreasing function because the similarity metric defined by Equation (9) also needs to be regulated). To be more specific, V(i) takes the edge strength maps of the reference and distorted images and returns their pixel-wise maximum: Equation (8) implies that regions with strong edges are more salient to the human visual system than those with weak edges since edge information conveys essential information about the visual scene to humans [47]. A larger value of V(i) = max(E( f , i), E(g, i)) indicated a higher visual saliency at pixel location i. Moreover, the computation of V(i) did not result in a significant increase in the computational costs since E( f , i) and E(g, i) were determined anyway in the ESSIM [19]. Substituting Equations (7) and (8) into Equation (9), we obtain the following equation for visual saliency guided local similarity map: Finally, the overall predicted perceptual quality score of the image was estimated by averaging the visual saliency-guided (SG) local similarity map where N is the total number of pixels. The hyperparameters of the SG-ESSIM, which have to be set, were K and h. These parameters were determined on a subset of the TID2008 [21] database, which contained 5 random reference images and 340 corresponding distorted images. Specifically, those hyperparameters were chosen where the highest Spearman's rank order correlation coefficient between the ground truth and the predicted values was obtained. As a result, we chose K = 51,000 and h = 0.5 in our MATLAB implementation. Since the SG-ESSIM was not a machine-learning-based approach, this random subset was also used in the evaluation process.

Materials
In this section, the description of the applied benchmark IQA databases are given first. Second, we give the applied evaluation metrics and protocol. Finally, the implementation details of the proposed method are given.

Databases
In this study, four large publicly available IQA benchmark databases (KADID-10k [20], TID2013 [7], TID2008 [21], and CSIQ [8]) were used to evaluate and compare the state-of-the-art to the proposed method. The most important information about the applied databases is summarized in Table 1. Namely, these databases contained a small set of distortion-free reference images from which distorted images were derived using different distortion types at different distortion levels. Moreover, the distorted images were annotated with subjective quality ratings. Therefore, they were suitable and accepted for the evaluation and ranking of FR-IQA metrics in the literature [48]. Figure 2 depicts the empirical MOS distributions in the applied databases.

Evaluation Metrics and Protocol
Ranking of FR-IQA methods relies on measuring the correlation between predicted and ground truth quality scores of benchmark IQA databases, such as KADID-10k [20]. To be more specific, the performance of an FR-IQA metric was evaluated using three different criteria in this study: Pearson's linear correlation coefficient (PLCC), Spearman's rank order correlation coefficient (SROCC), and Kendall's rank order correlation coefficient (KROCC). In addition, a nonlinear logistic regression was applied before the calculation of the PLCC as recommended in [49], because nonlinear relationship exists between the ground truth and predicted scores: where Q and Q p are the fitted and predicted scores, respectively. In addition, the regression described by Equation (11) is determined by the β i (i ∈ {1, 2, . . . , 5}) parameters.

Implementation Details
MATLAB R2021a was used to implement the proposed FR-IQA metric using the functions of the Image Processing Toolbox. The computer configuration applied in our experiments is outlined in Table 2.
The comparison, in terms of the PLCC, SROCC, and KROCC, in KADID-10k [20], TID2013 [7], TID2008 [21], and CSIQ [8] is summarized in Tables 3 and 4. From these experimental, numerical results, it could be seen that the proposed SG-ESSIM was able to provide the best results in terms of the SROCC and KROCC on KADID-10k [20] and the PLCC on TID2013 [7], respectively. Moreover, it gave the best results for all correlation values in TID2008 [21]. On CSIQ [8], the second best results in terms of the SROCC and KROCC were achieved by the proposed method. To illustrate the overall performance of the examined methods on these databases, direct and weighted average (using the numbers of images as weights) performance values are summarized in Table 5. It could be seen that the proposed method was able to deliver the second best results in terms of the SROCC and KROCC, both in direct and weighted averages. In Tables 6 and 7, detailed results could be seen on TID2013 [7] and TID2008 [21] with respect to the different distortion types. Similarly, Tables 8  and 9 illustrate the detailed results with respect to the noise levels. TID2013 [7]  , and SSR (sparse sampling and reconstruction). TID2008 [21] consisted of the first 17 distortion types from TID2013 [7]. From the presented results, it could be observed that SG-ESSIM was able to produce the best SROCC values on five distortion types of TID2013 [7] and on four distortion types of TID2008 [21]. On the other hand, it provided the second best results on nine distortion types of TID2013 [7] and on seven distortion types of TID2008 [21]. Moreover, the SG-ESSIM was able to produce the best SROCC values on three distortion levels of TID2013 [7], while it gave the second best results on the remaining two distortion levels. In TID2008 [21], it provided the best results on three distortion levels and the second best result on the remaining distortion level. Table 3. Comparison of the SG-ESSIM to several other state-of-the-art algorithms on KADID-10k [20] and TID2013 [7]. The highest values are typed in bold, while the second highest ones are underlined. [20] TID2013 [7] FR    In Figure 3, the execution time (logarithmic scale) versus SROCC scatter plot measured on KADID-10k [20] is depicted. From this figure, it could be seen that the proposed SG-ESSIM method was the fourth fastest algorithm out of the examined thirteen ones. Considering execution time and estimation performance together, the SG-ESSIM provided a competitive result against the state-of-the-art. Similarly, Figure 4 depicts the execution versus the SROCC on the CSIQ [8] database. Here, we could also observe that considering execution time and performance together, the proposed method was able to achieve a competitive result.

Conclusions
The goal of FR-IQA is to predict digital images' perceptual quality with full access to the distortion-free, reference images. FR-IQA methods usually contain two stages: local image quality estimation and pooling. In addition, visual saliency is utilized in the pooling stage as weights for the weighted averaging of local image quality. Unlike this traditional approach, we applied visual saliency in the computation of local image quality, motivated by the fact that local image quality is determined both by local image degradation and visual saliency simultaneously. Experimental results on four publicly available benchmark IQA databases showed that the proposed algorithm was able to outperform the state-ofthe-art and was characterized by low computational costs. Future work could involve the incorporation of local and global saliency into a novel FR-IQA metric. The source code of the proposed method is available at: https://github.com/Skythianos/SG-ESSIM (accessed on 12 June 2022). SCN spatially correlated noise; SG saliency guided; SROCC Spearman's rank order correlation coefficient; SSR space sampling and reconstruction