A Human Visual System Inspired No-Reference Image Quality Assessment Method Based on Local Feature Descriptors

Objective quality assessment of natural images plays a key role in many fields related to imaging and sensor technology. Thus, this paper intends to introduce an innovative quality-aware feature extraction method for no-reference image quality assessment (NR-IQA). To be more specific, a various sequence of HVS inspired filters were applied to the color channels of an input image to enhance those statistical regularities in the image to which the human visual system is sensitive. From the obtained feature maps, the statistics of a wide range of local feature descriptors were extracted to compile quality-aware features since they treat images from the human visual system’s point of view. To prove the efficiency of the proposed method, it was compared to 16 state-of-the-art NR-IQA techniques on five large benchmark databases, i.e., CLIVE, KonIQ-10k, SPAQ, TID2013, and KADID-10k. It was demonstrated that the proposed method is superior to the state-of-the-art in terms of three different performance indices.


Introduction
With the continuous development of imaging systems, the demand for innovative, objective image quality assessment (IQA) methods is growing. Since digital images are subjects of a variety of distortions and noise types during image acquisition [1], compression [2], reconstruction [3], and enhancement [4], image quality assessment has also many practical applications [5] in medical imaging [6], remote sensing imaging [7], monitoring the quality of streaming applications [8], or benchmarking image processing algorithms [9] under different distortions. Thus, objective IQA has been a subject of intensive research in the image processing community to replace subjective quality evaluation of digital images which is a time-consuming, expensive, and laborious process [10].
In the literature, objective IQA measures are traditionally divided into three branches [11], such as full-reference (FR), reduced-reference (RR), and no-reference (NR) IQA, with respect to the availability of the distortion-free-very often referred as reference-images. As the terminology implies, FR techniques evaluate the perceptual quality of a distorted image with full access to its reference image, while NR algorithms cannot rely on reference images. For RR methods, partial information about the reference images is available.

Contributions
The contributions of this work are as follows. Although the deep learning paradigm dominates the field of objective IQA [12,13], the interest in methods which simulates the sensitivity of HVS to statistical regularities and structures is also a hot research topic in the literature [14,15]. In our previous work [16], it was empirically corroborated that the statistics of local feature descriptors are quality-aware features. Here, we use systematically the statistics of local feature descriptors to compile a powerful feature vector for NR-IQA. Specifically, multiple HVS inspired filters are applied to the color channels of an input image to generate feature maps where the HVS sensitive statistical regularities are emphasized. Next, the statistics of local feature descriptors, such as KAZE [17] or BRISK [18], are

Related Work
NR-IQA algorithms can be classified into learning-free and learning-based categories. As the name indicates, learning-based methods rely on various machine and/or deep learning techniques to construct a model for perceptual quality estimation. Learning-free methods can be further divided into two groups, i.e., spatial [26] and spectral domain [27] based approaches. A common method [28] for score prediction involves fitting a portion of the training data to the joint distribution of the feature vector and the related opinion scores. Given the test data feature vector, the score prediction in this instance entails maximizing the likelihood of the test data opinion score. Other methods [29] that are both opinion-and distortion-unaware measure the separation in sparse feature space between the reference and distorted images. In contrast, Leonardi et al. [30] elaborated an opinion-unaware method that exploits the activation maps of pretrained convolutional neural networks [31] by considering the correlations between feature maps.
Classical machine learning based algorithms utilized for choice natural scene statistics (NSS) [32] and support vector regressors (SVR) [33]. According to our understanding, the evolution of the human visual system (HVS) has been driven by natural selection. Therefore, HVS assimilated a comprehensive knowledge about the regularity of our natural environment. In addition, researchers have pointed out [34] that certain image structure regularities deteriorate in the presence of noise and the deviation from them can be exploited for image quality evaluation. Classical methods utilizing NSS include for example BLIINDS-II [35], BRISQUE [36], CurveletQA [37], and DIIVINE [38]. Specifically, Saad et al. [35] constructed an NSS model from discrete cosine transform (DCT) coefficients of an image and fitted generalized Gaussian distributions on the coefficients to obtain their shape parameters which were used as quality-aware attributes and mapped onto quality scores with a trained SVR. In contrast, Jenadeleh and Moghaddam [39] proposed a Wakeby distribution statistical model to extract quality-aware features. Moorthy et al. [38] utilized steerable pyramid decomposition [40]-an overcomplete wavelet transform-using across multiple orientations and scales. In contrast, Mittal et al. [36] applied the spatial domain for NSS model construction. To be more specific, quality-aware features were derived from locally normalized luminance coefficients and mapped onto perceptual quality with a trained SVR. Liu et al. [37] proposed a two-stage framework incorporating a distortion classification and a quality prediction step. Further, quality-aware features were derived from the curvelet representation of the input image. Specifically, the authors emprirically proved that the coordinates of the maxima given in log-histograms of the curvelet coefficients, the energy distributions, and the scale are good predictors of image perceptual quality. Based on the observation that image distortions can significantly modify the shapes of objects present in the image, Bagade et al. [41] introduced shape adaptive wavelet features for NR-IQA. In contrast, Jenadeleh et al. [42] boosted already existing NR-IQA features with features proposed for image aesthetics assessment. Further, they demonstrated that aesthetic aware features are able to increase the performance of perceptual quality estimation.
Recently, deep neural networks, particularly convolutional neural networks (CNN) have gained a significant amount of attention in the literature due to their improved performance in many fields [43][44][45][46] compared to other approaches and paradigms. In NR-IQA, Kang et al. [47] applied first a CNN successfully. Namely, the authors implemented a traditional CNN which accepts image patches of 32 × 32 and predicts the patches' quality independently from each other. The entire image's perceptual quality was obtained by taking the arithmetic mean of the patches' quality scores. Similar to [47], Kim and Lee [48] trained a CNN on image patches but the patches' desired quality scores were determined by a traditional FR-IQA metric which restricts this method to the evaluation of artificially distorted images. Bare et al. [49] developed a network that operates on image patches similar to [47] but the patches target score is calculated from a traditional FR-IQA metric (feature similarity index [50]) similar to [48]. On the whole, the entire image's perceptual quality is estimated by the predicted feature similarity index [50] scores of the image patches. In contrast, Conde et al. [51] took a CNN backbone network and trained it using a loss function [52] which aims to minimize the mean squared error and maximize linear correlation coefficient between the predicted and ground-truth quality scores. Further, the authors applied several data augmentation techniques, such as horizontal flips, vertical flips, rotations, and random cropping. To handle images with different aspect ratios, Ke et al. [53] introduced a transformer [54] based NR-IQA model which applied a hashbased 2D absolute-position-encoding for embedding image patches extracted from multiple scales. In contrast, Zhu et al. [55] embedded the input images' original aspect ratios into the self-attention module of a swin transformer [56]. Sun et al. [57] introduced the distortion graph representation framework which contains a distortion type discrimination network aiming to discriminate between distortion types and a fuzzy prediction network for perceptual quality estimation. Liu et al. [58] introduced lifelong learning for NR-IQA to learn new distortion types without accessing to previous training data. First, the authors utilized a split-and-merge distillation strategy for compiling a single-head regression network. In the split phase, a distortion-specific generator was implemented for generating pseudo-features for unseen distortions. In the merge phase, these pseudo-features were coupled with pseudo-labels to distill knowledge about distortions.
A general, in-depth overview about the field of NR-IQA is out of the scope of this paper. For more details, we refer to the PhD thesis of Jenadeleh [59] and the book of Xu et al. [60] Besides natural images, there are other modalities, whose no-reference quality assessment are also investigated in the literature, such as stereoscopic images [61], light field images [62], or virtual reality [63].

Materials
In this part of the paper, the applied IQA benchmark databases and the evaluation protocol are discussed in detail.

Applied IQA Benchmark Databases
In this paper, we applied five publicly available IQA benchmark, such as CLIVE [21], KonIQ-10k [22], SPAQ [23], TID2013 [24], and KADID-10k [25], to evaluate and compare our proposed methods to the state-of-the-art. Specifically, CLIVE [21], KonIQ-10k [22], and SPAQ [23] contain unique quality labeled images with authentic distortions. The quality ratings were collected in a crowdsourcing experiment [64][65][66] for CLIVE [21] and KonIQ-10k [22], while the quality ratings were obtained in a traditional laboratory environment for SPAQ [23]. Further, CLIVE [21] and KonIQ-10k [22] contain images with fixed resolution. On the other hand, there is no fixed resolution in SPAQ [23] but the images have high resolution which varies around 4000 × 4000. In contrast to CLIVE [21], KonIQ-10k [22], and SPAQ [23], TID2013 [24] and KADID-10k [25] consist of 24 and 25 reference images whose perceptual quality are considered perfect, respectively. The quality labeled distorted images were produced artificially by an image processing tool from the reference images using different distortion types (i.e., JPEG compression noise, salt & pepper noise, Gaussian blur, etc.) at multiple distortion levels. The main properties of the applied IQA benchmark databases are summarized in Table 1. Further, the empirical distributions of quality scores are depicted in Figure 1.

Evaluation Protocol and Metrics
The assessment of NR-IQA algorithms involves the measurement of the correlation strength between the predicted scores and the ground-truth scores of an IQA benchmark database. As common in the literature, about 80% of images was used for training and the remaining 20% was used for testing in our experiments. Further, databases with artificial distortions were divided into training and test sets with respect to the reference images to prevent semantic content overlap between these two sets.
In this paper, the medians of Pearson linear correlation coefficients (PLCC), Spearman rank order correlation coefficient (SROCC), and Kendall rank order correlation coefficient (KROCC), which were measured over 100 random train-test splits, are given to characterize the performance of the proposed method and other examined state-of-the-art methods. However, there is a non-linear relationship between the predicted and the ground-truth scores. This is why, a non-linear logistic regression was applied before the computation of PLCC as advised by [67]: where Q f and Q p stand for the fitted and predicted score, respectively. Further, the regression parameters are denoted by β i 's (i = 1, . . . , 5).

Proposed Method
The high-level overview of the proposed NR-IQA algorithm is depicted in Figure 2. It can be observed that the proposed method is built upon two distinct steps. In the first, training step, local features are extracted from a database of quality labeled, training images. Next, a regression model is trained based on them to obtain a quality model. This model is used to estimate the perceptual quality of a previously unseen image in the testing step.
Image distortions influence the human visual system's (HVS) sensitivity to local image structures, such as edges or texture elements [68]. For that reason, many NR-IQA methods have been proposed [69][70][71] in the literature to compile a quality model from them. However, edge information or local binary patterns are not always able to provide powerful feature representation for NR-IQA. Therefore, in this study, an application of local feature descriptors and HVS-inspired filters are investigated thoroughly to extract quality-aware local features and compile a powerful feature representation for NR-IQA. Figure 3 depicts the general process of local quality-aware feature extraction. First, the input RGB image is filtered by a set of HVS inspired filters to create feature maps. On these maps, local keypoints are detected by local feature descriptors (such as SURF [72]). Finally, feature extraction is carried out from the neighborhoods of the detected keypoints. Figure 1. The empirical distributions of quality scores in the applied IQA databases. (a) CLIVE [21], (b) KonIQ-10k [22], (c) SPAQ [23], (d) TID2013 [24], (e) KADID-10k [25].
In the proposed method, an input RGB image is converted into YCbCr color space, since the chroma component is separated from the color information in YCbCr. The conversion from RGB to YCbCr was carried out using the following equation [73]: where R, G, and B stand for the red, green, and blue color channels, respectively. Subsequently, a color channel C i is filtered using different HVS-inspired filters to obtain multiple feature maps. Since local feature descriptors treat images from the HVS's point of view, their statistics are able to provide quality-aware features [16]. Further, the applied HVS inspired filters emphasize those statistical regularities of a natural scene which are highly sensitive to image distortions from the perspective of HVS. Specifically, 5 statistical features are derived from each filtered color channels using the statistics of different local feature descriptors. In Sections 3.3-3.5, the compilation of HVS inspired feature maps are described.
Next, the proposed quality-aware feature extraction from the feature maps is described in Section 3.6.

Bilaplacian Feature Maps
First, Bilaplacian feature maps were obtained using Bilaplacian filters. In [74], Gerhard et al. demonstrated that the HVS is highly adapted to statistical regularities of images. Further, zero-crossings [75] in an image occur where the gradient starts increasing or decreasing and help the HVS in interpreting the image. Using the idea of zero-crossings, Ghosh et al. [76] pointed out that the behaviour of the extended classical receptive field of retinal ganglion cells can be modeled as a combination of three zero-mean Gaussians at three different scales which are equivalent to are the Bilaplacian of the Gaussian filter [77]. The L(x, y) Laplacian of an I(x, y) image can be expressed as: Since a digital image is represented as a set of discrete pixels, discrete convolution kernels are used to approximate the Laplacian. In this paper, the following kernels are considered: Bilaplacian kernels are obtained by convolving two Laplacian kernels: where * stands for the convolution operator. In our study, L 2 11 , L 2 22 , L 2 33 , L 2 44 , L 2 55 , L 2 13 , and L 2 24 Bilaplacian kernels were considered. Subsequently, a set of feature maps is derived from the input image by filtering with the Bilaplacian kernels the Y, Cb, and Cr channels. To be more specific, 3 × 7 = 21 feature maps are obtained by filtering 3 color channels (Y, Cb, Cr) with 7 filters ( L 2 11 , L 2 22 , L 2 33 , L 2 44 , L 2 55 , L 2 13 , L 2 24 ).

High-Boost Feature Maps
High-boost filtering is used to enhance high-frequency image regions which the HVS is also sensitive for [78]. Similarly to the previous subsection, high-boost filtering is applied on the color channels of Y, Cb, and Cr to strengthen high-frequncy information. In this case, the convolution kernel is the following: where C is a constant value which controls the enhancement difference between a pixel location and its neighborhood. In our study, C = 1 was used. However, image distortions can occur at different scales. Therefore, a color channel was filtered 4 times in succession to obtain four feature maps from one channel. Since we have three channels, 3 × 4 = 12 feature maps were extracted in total applying high-boost filtering.

Derivative Feature Maps
In [74], Gerhard et al. draw the inference that the HVS is biased for processing natural images. Further, it has a large knowledge of statistical regularities in images. In [79], Li et al. demonstrated that derivatives and higher order derivatives are related to different statistical regularities of a natural scene. Therefore, there are good features for NR-IQA. For instance, higher order derivatives may be able to capture detailed discriminative information, while first order derivative information is typically related to the slope and elasticity of a surface. Second order derivatives intended to capture the geometric qualities associated to curvature [80]. Motivated by these previous works, we used the following convolution of two derivative kernels to filter Y, Cb, Cr color channels of an input image: Similarly, we can define D 2 , D 3 , D 4 , and D 5 masks for 5 × 5, 7 × 7, 11 × 11, and 13 × 13 sizes. Finally, the derivative feature maps are obtained by filtering Y, Cb, Cr color channels with D 1 , D 2 , D 3 , D 4 , and D 5 . As a result, 3 × 5 = 15 derivative feature maps were obtained.

Feature Extraction
As one can see from the previous subsections, 21 Bilaplacian feature maps, 12 highboost feature maps, and 15 derivative feature maps were generated which means in total 21 + 12 + 15 = 48 feature maps. In the feature extraction step, 7 × N keypoints are detected using 7 different local keypoint detectors, i.e., SURF (speed up robust features) [72], FAST (features from accelerated segment test) [81], BRISK (binary robust invariant scalable keypoints) [18], KAZE (Japanese word that means wind) [17], ORB (oriented FAST and rotated binary robust independent elementary features) [82], Harris [83], and minimum eigenvalue [84], in each feature map. For each keypoint, its M × M rectangular neighborhood with the keypoint's location as center point is taken. Further, each M × M rectangular block in each Bilaplacian, high-boost, and derivative feature maps are characterized by the mean, median, standard deviation, skewness, and kurtosis of the grayscale values found in the involved block. The skewness of a set of n elements is determined as wherex is the arithmetic mean of all x i elements. Similarly, the kurtosis can be given as Using the statistics of the rectangular blocks, a given feature map is characterized by the arithmetic mean of all blocks' statistics. As a result, a 3 × 7 × 7 × 5 = 735 dimensional feature vector is obtained using Bilaplacian filters and the statistics of local feature descriptors, since 3 color channels were filtered with 7 Bilaplacian kernels and 7 different feature descriptors with 5 different statistics were applied. Further, a 3 × 4 × 7 × 5 = 420 dimensional feature vector is obtained using the high-boost filters, since 3 color channels were filtered with 4 high-boost kernels and 7 different feature descriptors with 5 different statistics were applied as in the previous case. Similarly, 3 × 5 × 7 × 5 = 525 dimensional feature vector is obtained from the derivative feature maps. By concatenating the fore-mentioned vectors, a 735 + 420 + 525 = 1680 dimensional feature vector can be derived which can be mapped onto perceptual quality scores with a machine learning technique.
As an illustration, Figure 4 depicts a Y channel and its Bilaplacian feature maps with the detected FAST keypoints. From this illustration it can be seen that keypoints are accumulated around those regions which highly influence humans' quality perception. In Figure 5, it is illustrated that the location of keypoints on the feature maps is changing with respect to the strength of image distortion. As a consequence, it seems justified that the statistics of local feature descriptors on carefully chosen feature maps are qualityaware features.

Perceptual Quality Estimation
The quality model formally can be written as: q = G(F), where q is a vector of quality scores, F is a set of extracted feature vectors, and G denotes the quality model. Specifically, G can be determined by a properly chosen machine learning (regression) technique. In this study, we made experiments with two different regression methods, i.e., support vector regressor (SVR) and Gaussian process regressor (GPR). In the followings, we denote the proposed methods by LFD-IQA-SVR and LFD-IQA-GPR with respect to the applied regression method. In the chosen codename, LFD refers to the abbreviation of local feature descriptors whose statistics were utilized as quality-aware features.

Experimental Results
In this section, our numerical experimental results are presented. First, an ablation study is carried out in Section 4.1 first to justify certain design choices of the proposed methods. In the following subsection, a comparison to several other state-of-the-art methods is presented using accepted publicly available benchmark databases and evaluation protocol described in Section 3.1. This comparison involves direct and cross-database tests as well as significance tests.

Ablation Study
In the proposed feature extraction methodology, there are two tunable parameters, i.e., the N number of detected keypoints for each feature feature descriptor and the M × M block size. In this ablation study, CLIVE [21] was utilized using the evaluation protocol given in Section 3.1 to determine an optimal value for these two parameters. To be more specific, we varied the number of detected keypoints from 1 to 55 and we experimented with 3 different block sizes, i.e., 3 × 3, 5 × 5, and 7 × 7. The results for LFD-IQA-SVR and LFD-IQA-GPR are summarized in Figures 6 and 7, respectively. From these results, it can be seen that 5 × 5-sized neighborhood is the optimal choice for both LFD-IQA-SVR and LFD-IQA-GPR. On the other hand, LFD-IQA-SVR achieves its best performance at 45 detected keypoints while LFD-IQA-GPR has its peak performance at 40 detected keypoints. Therefore, we applied 5 × 5 neighborhoods and 45 or 40 keypoints, respectively. The proposed methods were implemented and tested in MATLAB R2022a. To be more specific, the Computer Vision Toolbox's functions were utilized for the detection of keypoints and feature extraction, while the Statistics and Machine Learning Toolbox was used in the regression part of the proposed method.

Comparison to the State-of-the-Art
The proposed methods were compared to the following 16 state-of-the-art methods: BIQI [85], BLIINDS-II [35], BMPRI [86], BRISQUE [36], CurveletQA [37], DIIVINE [38], ENIQA [87], GRAD-LOG-CP [69], GWH-GLBP [70], IL-NIQE [27], NBIQA [88], NIQE [26], OG-IQA [71], PIQE [89], Robust BRISQUE [90], and SSEQ [91]. Excluding the training-free IL-NIQE [27], NIQE [26] and, PIQE [89], these methods were evaluated as the same way as the proposed methods. To provide a fair comparison, the same subsets of images were selected in the random 100 train-test splits. Since IL-NIQE [27], NIQE [26] and, PIQE [89] are opinion unaware methods, they were tested on the applied benchmark databases in one iteration measuring PLCC, SROCC, and KROCC on the entire database without any traintest splits. In Tables 2 and 3, the median values measured over 100 random train-test splits for the considered and proposed NR-IQA methods on authentic distortions (CLIVE [21], KonIQ-10k [22], SPAQ [23]) are reported. Similarly, Table 4 summarizes the results on artificial distortions. From these results, it can be seen that the proposed LFD-IQA-SVR achieves the second best results in almost all cases, while the proposed LFD-IQA-GPR provides the best results for all databases in all performance metrics. In Table 5, the results measured on the individual databases are aggregated into direct and weighted averages of PLCC, SROCC, and KROCC. From these results, it can be concluded that the proposed methods are able to outperform all the other methods by a large margin. Further, the difference between the proposed and the other algorithms is larger in case of weighted averages. This indicates that the proposed methods tend to give a better performance on larger IQA databases. Figures 8 and 9 depict ground-truth versus predicted quality score scatter plots of the proposed methods determined on CLIVE [21], KonIQ-10k [22], and KADID-10k [25] test sets, respectively. To prove that achieved results summarized in Tables 2-4 are significant, the Wilcoxon rank sum test was applied [69,92]. To be specific, the null hypothesis was that two sets of 100 SROCC values produced by two different NR-IQA methods were sampled from continuous distributions with equal median values. In our tests, 5% significance level was applied. The results are summarized in Table 6 for LFD-IQA-SVR, while the results are shown in Table 7 for LFD-IQA-GPR. Here, symbol '1' is used to denote that the proposed method is significantly better than the method in the row on the database in the column.
From the presented results, it can be clearly seen that the achieved result is significant compared to the state-of-the-art. As a consequence, the proposed HVS-inspired feature extraction method have proved to be more effective than the those of the examined state-ofthe-art methods. Table 2. Comparison to the state-of-the-art on CLIVE [21] and KonIQ-10k [22] databases. Median PLCC, SROCC, and KROCC values were measured over 100 random train-test splits. The best results are typed in bold, the second best results are underlined, and the third best results are typed in italic.

Method CLIVE [21] KonIQ-10k [22] SPAQ [23] TID2013 [24] KADID-10k [25]
BIQI [85] (a) (b) (c) Figure 9. Ground-truth scores versus predicted scores using the proposed LFD-IQA-GPR method on (a) CLIVE [21], (b) KonIQ-10k [22], and (c) KADID-10k [25] test sets. Table 7. Results of the two-sided Wilcoxon rank sum test. Symbol '1' is used to denote that the proposed method-LFD-IQA-GPR-is significantly better than the method in the row on the database in the column. In an other test, the generalization ability of the methods were examined. Namely, the algorithms were trained on the entire KonIQ-10k [22] database used as a training set and tested on the entire CLIVE [21] used as a test set. This process is called cross database test in the literature [93]. The results of the cross database are shown in Table 8. In this test, the proposed methods are also the best performing ones. Namely, they are able to outperform the state-of-the-art by a large margin. Table 8. Results of the cross database test. The examined and the proposed methods were trained on KonIQ-10k [22] and tested on CLIVE [21]. The best results are typed in bold, the second best results are underlined, and the third best results are typed in italic.

Conclusions
In this paper, a novel machine learning based NR-IQA method was introduced which applies an innovative quality-aware feature extraction procedure relying on the statistics of local feature descriptors. To be more specific, a sequence of HVS inspired filters were applied to Y, Cb, and Cr color channels of an input image to enhance those statistical regularities of the image to which the HVS is sensitive. Next, certain statistics of various local feature descriptors were extracted from each feature map to construct a powerful feature vector which is able to characterize possible image distortions from various points of view. Finally, the obtained feature vector is mapped onto perceptual quality scores with a trained regressor. The proposed method was compared to 16 state-of-the-art NR-IQA methods on five large benchmark IQA databases containing either authentic (CLIVE [21], KonIQ-10k [22], SPAQ [23]) or artificial (TID2013 [24], KADID-10k [25]) distortions. Specifically, the comparison involved the demonstration of three performance metrics on direct database tests, significance tests, and a cross database test. As shown, the proposed method is able to outperform significantly the state-of-the-art and provides competitive results. Future work involves a real-time GPU (graphical processing unit) implementation of the proposed method. Another direction of future research is to generalize the achieved results to other types of image modalities, such as stereoscopic or computer-generated images.

Acknowledgments:
We thank the academic editor and the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: