Multi-pooled Inception features for no-reference image quality assessment

Image quality assessment (IQA) is an important element of a broad spectrum of applications ranging from automatic video streaming to display technology. Furthermore, the measurement of image quality requires a balanced investigation of image content and features. Our proposed approach extracts visual features by attaching global average pooling (GAP) layers to multiple Inception modules of on an ImageNet database pretrained convolutional neural network (CNN). In contrast to previous methods, we do not take patches from the input image. Instead, the input image is treated as a whole and is run through a pretrained CNN body to extract resolution-independent, multi-level deep features. As a consequence, our method can be easily generalized to any input image size and pretrained CNNs. Thus, we present a detailed parameter study with respect to the CNN base architectures and the effectiveness of different deep features. We demonstrate that our best proposal - called MultiGAP-NRIQA - is able to provide state-of-the-art results on three benchmark IQA databases. Furthermore, these results were also confirmed in a cross database test using the LIVE In the Wild Image Quality Challenge database.


Introduction
With the increasing popularity of imaging devices as well as the rapid spread of social media and multimedia sharing websites, digital images and videos have become an essential part of daily life, especially in everyday communication. Consequently, there is a growing need for effective systems that are able to monitor the quality of visual signals. Obviously, the most reliable way of assessing image quality is to perform subjective user studies, which involves the gathering of individual quality scores. However, the compilation and evaluation of a subjective user study are very slow and laborious processes. Furthermore, their application in a real-time system is impossible. In contrast, objective image quality assessment (IQA) involves the development of quantitative measures and algorithms for estimating image quality.
Objective IQA is classified based on the availability of the reference image. Full-reference image quality assessment (FR-IQA) methods have full access to the reference image, whereas no-reference image quality assessment (NR-IQA) algorithms possess only the distorted digital image. In contrast, reduced-reference image quality assessment (RR-IQA) methods have partial information about the reference image; for example, as a set of extracted features. Objective IQA algorithms are evaluated on benchmark databases containing the distorted images and their corresponding mean opinion scores (MOSs), which were collected during subjective user studies. The MOS is a real number, typically in the range 1.0-5.0, where 1.0 represents the lowest quality and 5.0 denotes the best quality. Furthermore, the MOS of an image is its arithmetic mean over all collected individual quality ratings. As already mentioned, publicly available IQA databases help researchers to devise and evaluate IQA algorithms and metrics. Existing IQA datasets can be grouped into two categories with respect to the introduced image distortion types. The first category contains images with artificial distortions, while the images of the second category are taken from sources with "natural" degradation without any additional artificial distortions.
The rest of this section is organized as follows. In Subsection 1.1, we review related work in NR-IQA with a special attention on deep learning based methods. Subsection 1.2 introduces the contributions made in this study. arXiv:2011.05139v1 [cs.CV] 10 Nov 2020

Related work
Many traditional NR-IQA algorithms rely on the so-called natural scene statistics (NSS) [1] model. These methods assume that natural images possess a particular regularity that is modified by visual distortion. Further, by quantifying the deviation from the natural statistics, perceptual image quality can be determined. NSS-based feature vectors usually rely on the wavelet transform [2], discrete cosine transform [3], curvelet transform [4], shearlet transform [5], or transforms to other spatial domains [6]. DIIVINE [2] (Distortion Identification-based Image Verity and INtegrity Evaluation) exploits NSS using wavelet transform and consists of two steps. Namely, a probabilistic distortion identification stage is followed by a distortion-specific quality assessment one. In contrast, He et al. [7] presented a sparse feature representation of NSS using also the wavelet transform. Saad et al. [3] built a feature vector from DCT coefficients. Subsequently, a Bayesian inference approach was applied for the prediction of perceptual quality scores. In [8], the authors presented a detailed review about the use of local binary pattern texture descriptors in NR-IQA.
Another line of work focuses on opinion-unaware algorithms that require neither training samples nor human subjective scores. Zhang et al. [9] introduced the integrated local natural image quality evaluator (IL-NIQE), which combines features of NSS with multivariate Gaussian models of image patches. This evaluator uses several quality-aware NSS features, i.e., the statistics of normalized luminance, mean subtracted and contrast-normalized products of pairs of adjacent coefficients, gradient, log-Gabor filter responses, and color (after the transformation into a logarithmic-scale opponent color space).
Kim et al. [10] introduced a no-reference image quality predictor called the blind image evaluator based on a convolutional neural network (BIECON), in which the training process is carried out in two steps. First, local metric score regression and then subjective score regression are conducted. During the local metric score regression, nonoverlapping image patches are trained independently; FR-IQA methods such as SSIM or GMS are used for the target patches. Then, the CNN trained on image patches is refined by targeting the subjective image score of the complete image. Similarly, the training of a multi-task end-to-end optimized deep neural network [11] is carried out in two steps. Namely, this architecture contains two sub-networks: a distortion identification network and a quality prediction network. Furthermore, a biologically inspired generalized divisive normalization [12] is applied as the activation function in the network instead of rectified linear units (ReLUs). Similarly, Fan et al. [13] introduced a two-stage framework. First, a distortion type classifier identifies the distortion type then a fusion algorithm is applied to aggregate the results of expert networks and produce a perceptual quality score.
In recent years, many algorithms relying on deep learning have been proposed. Because of the small size of many existing image quality benchmark databases, most deep learning based methods employ CNNs as feature extractors or take patches from the training images to increase the database size. The CNN framework of Kang et al. [14] is trained on non-overlapping image patches extracted from the training images. Furthermore, these patches inherit the MOS of their source images. For preprocessing, local contrast normalization is employed. The applied CNN consists of conventional building blocks, such as convolutional, pooling, and fully connected layers. Bosse et al. [15] introduced a similar method. Namely, they developed a 12-layer CNN that is trained on 32 × 32 image patches. Furthermore, a weighted average patch aggregation method was introduced in which weights representing the relative importance of image patches in quality assessment are learned by a subnetwork. In contrast, Li et al. [16] combined a CNN trained on image patches with the Prewitt magnitudes of segmented images to predict perceptual quality.
Li et al. [17] trained a CNN on 32 × 32 image patches and employed it as a feature extractor. In this method, a feature vector of length 800 represents each image patch of an input image and the sum of image patches' feature vectors is associated with the original input image. Finally, a support vector regressor (SVR) is trained to evaluate the image quality using the feature vector representing the input image. In contrast, Bianco et al. [18] utilized a fine-tuned AlexNet [19] as a feature extractor on the target database. Specifically, image quality is predicted by averaging the quality ratings on multiple randomly sampled image patches. Further, the perceptual quality of each patch is predicted by an SVR trained on deep features extracted with the help of a fine-tuned AlexNet [19]. Similarly, Gao et al. [20] employed a pretrained CNN as a feature extractor, but they generate one feature vector for each CNN layer. Furthermore, a quality score is predicted for each feature vector using an SVR. Finally, the overall perceptual quality of the image is determined by averaging these quality scores. In contrast, Zhang et al. [21] trained first a CNN to identify image distortion types and levels. Furthermore, the authors took another CNN, that was trained on ImageNet, to deal with authentic distortions. To predict perceptual image quality, the features of the last convolutional layers were pooled bi-linearly and mapped onto perceptual quality scores with a fully-connected layer. He et al. [22] proposed a method containing two steps. In the first step, a sequence of image patches is created from the input image. Subsequently, features are extracted with the help of a CNN and a long short-term memory (LSTM) is utilized to evaluate the level of image distortion. In the second stage, the model is trained to predict the patches' quality score. Finally, a saliency weighted procedure is applied to determine the whole image's quality from the patch-wise scores. Similarly, Ji et al. [23] utilized a CNN and an LSTM for NR-IQA, but the deep features were extracted from the convolutional layers of a VGG16 [24] network. In contrast to other algorithms, Zhang et al. [25] proposed an opinion-unaware deep method. Namely, high-contrast image patches were selected using deep convolutional maps from pristine images which were used to train a multi-variate Gaussian model.

Contributions
Convolutional neural networks (CNNs) have demonstrated great success in a wide range of computer vision tasks [26], [27], [28], including NR-IQA [14], [15], [16], [29]. Furthermore, pretrained CNNs can also provide a useful feature representation for a variety of tasks [30]. In contrast, employing pretrained CNNs is not straightforward. One major challenge is that CNNs require a fixed input size. To overcome this constraint, previous methods for NR-IQA [14], [15], [16], [18] take patches from the input image. Furthermore, the evaluation of perceptual quality was based on these image patches or on the features extracted from them. In this paper, we make the following contributions. We introduce a unified and content-preserving architecture that relies on the Inception modules of pretrained CNNs, such as GoogLeNet [31] or Inception-V3 [32]. Specifically, this novel architecture applies visual features extracted from multiple Inception modules of pretrained CNNs and pooled by global average pooling (GAP) layers. In this manner, we obtain both intermediate-level and high-level representation from CNNs and each level of representation is considered to predict image quality. Due to this architecture, we do not take patches from the input image like previous methods [14], [15], [16], [18]. Unlike previous deep architectures [22], [15], [18] we do not utilize only the deep features of the last layer of a pretrained CNN. Instead, we carefully examine the effect of different features extracted from different layers on the prediction performance and we point out that the combination of deep features from mid-and high-level layers results in significant prediction performance increase. With experiments on three publicly available benchmark databases, we demonstrate that the proposed method is able to outperform other state-of-the-art methods. Specifically, we utilized KonIQ-10k [33], KADID-10k [34], and LIVE In the Wild Image Quality Challenge Database [35] databases. KonIQ-10k [33] is the largest publicly available database containing 10,073 images with authentic distortions, while KADID-10k [34] consists of 81 reference images and 10,125 distorted ones (81 reference images × 25 types of distortions × 5 levels of distortions). LIVE In the Wild Image Quality Challenge Database [35] is significantly smaller than KonIQ-10k [33] or KADID-10k [34]. For a cross database test, also the LIVE In the Wild Image Quality Challenge Database [35] is applied which contains 1, 162 images with authentic distortions evaluated by over 8, 100 unique human observers.
The remainder of this paper is organized as follows. After this introduction, Section 2 introduces our proposed approach. In Section 3, the experimental results and analysis are presented, and a conclusion is drawn in Section 4.

Methodology
To extract visual features, GoogLeNet [31] or Inception-V3 [32] were applied as base models. GoogLeNet [31] is a 22 layer deep CNN and was the winner of ILSVRC 2014 with a top 5 error rate of 6.7 %. Depth and width of the network was increased but not simply following the general method of stacking the layers on each other. A new level of organization was introduced codenamed Inception module (see Figure 1). In GoogLeNet [31] not everything happens sequentially like in previous CNN models, pieces of the network work in parallel. Inspired by a neuroscience model in [36] where for handling multiple scales a series of Gabor filters were used with a two layer deep model. But contrary to the beforementioned model all layers are learned and not fixed. In GoogLeNet [31] architecture Inception layers are introduced and repeated many times. Subsequent improvements of GoogLeNet [31] have been called Inception-vN where N refers to the version number put out by Google. Inception-V2 [32] was refined by the introduction of batch normalization [37]. Inception-V3 [32] was improved by factorization ideas. Factorization into smaller convolutions means for example replacing a 5 × 5 convolution by a multi-layer network with fewer parameters but with the same input size and output depth.
We chose the features of Inception modules for the following reasons. The main motivation behind the construction of Inception modules is that salient parts of images may very extremely. This means that the region of interest can occupy very different image regions both in terms of size and location. That is why, determining the convolutional kernel size in a CNN is very difficult. Namely, a larger kernel size is required for visual information that is distributed rather globally. On the other hand, a smaller kernel size is better for visual information that is distributed more locally. As already mentioned, the creators of Inception modules reflected to this challenge by the introduction of multiple filters with multiple sizes on the same level. Furthermore, visual distortions have a similar nature. Namely, the distortion distribution is strongly influenced by image content [38]. Figure 1: Illustration of Inception module. It was restricted to filter sizes 1 × 1, 3 × 3, and 5 × 5. Subsequently, the outputs were concatenated into a single vector that is the input for the next stage. Adding of an alternative parallel pooling path was found to be beneficial. Applying filters of 1 × 1 convolution makes possible to reduce the volume before the expensive 3 × 3 and 5 × 5 convolutions [31]. An input image is run through on an ImageNet database pretrained CNN body (GoogLeNet and Inception-V3 are considered in this study) which carries out all its defined operations. Furthermore, global average pooling (GAP) layers are attached to each Inception module to extract resolution independent deep features at different abstraction levels. The feature vectors obtained from the Inception modules are concatenated and an SVR with radial basis function is applied to predict perceptual image quality.

Pipeline of the proposed method
The pipeline of the proposed framework is depicted in Figure 2. A given input image to be evaluated is run through a pretrained CNN body (GoogLeNet [31] and Inception-V3 [32] are considered in this study) which carries out all its defined operations. Specifically, global average pooling (GAP) layers are attached to the output of each Inception module. Similar to max-or min-pooling layers, GAP layers are applied in CNNs to reduce the spatial dimensions of convolutional layers. However, a GAP layer carries out a more extreme type of dimensional reduction than a max-or min-pooling layer. Namely, an h × w × d block is reduced to 1 × 1 × d. In other words, a GAP layer reduces a feature map to a single value by taking the average of this feature map. By adding GAP layers to each Inception module, we are able to extract resolution independent features at different levels of abstraction. Namely, the feature maps produced by neuroscience models inspired [36] Inception modules have been shown representative for object categories [31], [32] and correlate well with human perceptual quality judgments [39]. The motivation behind the application of GAP layers was the followings. By attaching GAP layers to the Inception modules, we gain an architecture which can be easily generalized to any input image resolution and base CNN architecture. Furthermore, this way the decomposition of the input image into smaller patches can be avoided which means that parameter settings related to the database properties (patch size, number of patches, sampling strategy, etc.) can be ignored. Moreover, some kind of image distortions are not uniformly distributed in the image. These kind of distortions could be better captured in an aspect-ratio and content preserving architecture.
As already mentioned, a feature vector is extracted over each Inception module using a GAP layer. Let f k denote the feature vector extracted from the kth Inception module. The input image's feature vector is obtained by concatenating the respective feature vectors produced by the Inception modules. Formally, we can write where N denotes the number of Inception modules in the base CNN and ⊕ stands for the concatenation operator. In Section 3.3, we present a detailed analysis about the effectiveness of different Inception modules' deep features as a perceptual metric. Furthermore, we point out the prediction performance increase due to the concatenation of deep features extracted from different abstraction levels.
Subsequently, an SVR [40] with radial basis function (RBF) kernel is trained to learn the mapping between feature vectors and corresponding perceptual quality scores.
Moreover, we also applied Gaussian process regression (GPR) with rational quadratic kernel function [41] in Section 3.4.

Database compilation and transfer learning
Many image quality assessment databases are available online, such as TID2013 [42] or LIVE In the Wild [35], for research purposes. In this study, we selected the recently published KonIQ-10k [33] database to train and test our system, because it is the largest available database containing digital images with authentic distortions. Furthermore, we present a parameter study on KonIQ-10k [33] to find the best design choices. Our best proposal is compared against the state-of-the-art on KonIQ-10k [33] and also on other publicly available databases. KonIQ-10k [33] consists of 10,073 digital images with the corresponding MOS values. To ensure the fairness of the experimental setup, we selected randomly 6,073 images (∼ 60%) for training, 2,000 images (∼ 20%) for validation, and 2,000 images (∼ 20%) for testing purposes. First, the base CNN was fine-tuned on target database KonIQ-10k [33] using the above-mentioned training and the validation subsets. To this end, regularly the base CNN's last 1,000way softmax layer was removed and replaced by a 5-way one in previous methods [18], because the training and validation subsets were reorganized into five classes with respect to the MOS values: class A for excellent image quality (5.0 > M OS ≥ 4.2), class B for good image quality (4.2 > M OS ≥ 3.4), class C for fair image quality (3.4 > M OS ≥ 2.6), class D for poor image quality (2.6 > M OS ≥ 1.8), and class E for very poor image quality (1.8 > M OS ≥ 1.0). Subsequently, the base CNN was further train to classify the images into quality categories. Since the MOS distribution in KonIQ-10k [33] is strongly imbalanced (see Figure 3), there would be very little number of images in the class for excellent images. That is why, we took a regression-based approach instead of classification-based approach for fine-tuning. Namely, we removed the base CNN's last 1,000-way softmax layer and we replaced it by a regression layer containing only one neuron. Since GoogLeNet [31] and Inception-V3 [32] accept images with input size of 224 × 224 and 299 × 299, respectively, twenty 224 × 224-sized or 299 × 299-sized patches were cropped randomly from each training and validation images. Furthermore, these patches inherit the perceptual quality score of their source images and the fine-tuning is carried out on these patches. Specifically, we trained the base CNN further for regression to predict the images patches MOS values which are inherited from their source images. During fine-tuning Adam optimizer [43] was used, the initial learning rate was set to 0.0001 and divided by 10 when the validation error stopped improving. Further, the batch size was set to 28 and the momentum was 0.9 during fine-tuning.

Experimental results and analysis
In this section, we demonstrate our experimental results. First, we give the definition of the evaluation metrics in Section 3.1. Second, we describe the experimental setup and the implementation details in Section 3.2. In Section 3.3, we give a detailed parameter study to find the best design choices of the proposed method using KonIQ-10k [33] database. Subsequently, we carry out a comparison to other state-of-the-art methods using KonIQ-10k [33], KADID-10k [34], and LIVE In the Wild [35] publicly available IQA databases. Finally, we present a so-called cross database test using LIVE In the Wild Image Quality Challenge database [35].

Evaluation metrics
The performance of NR-IQA algorithms are characterized by the correlation calculated between the ground-truth scores of a benchmark database and the predicted scores. To this end, Pearson's linear correlation coefficient (PLCC) and Spearman's rank order correlation coefficient (SROCC) are widely used in the literature [44]. PLCC between datasets A and B is defined as whereĀ andB denote the average of sets A and B, and A i and B i denote the ith elements of sets A and B, respectively. SROCC, it can be expressed as whereÂ andB stand for the middle ranks.

Experimental setup and implementation details
As already mentioned, a detailed parameter study was carried out on the recently published KonIQ-10k [33], which is the currently largest available IQA database with authentic distortions, to determine the optimal design choices. Subsequently, our best proposal is compared to the state-of-the-art using other publicly available databases as well.
The proposed method was implemented in MATLAB R2019a mainly relying on the functions of the Deep Learning Toolbox (formerly Neural Network Toolbox), Image Processing Toolbox, and Statistics and Machine Learning Toolbox. Thus, the parameter study was also carried out in MATLAB environment. More specifically, it was evaluated by 100 random train-validation-test split of the applied database and we report on the average of the PLCC and SROCC values.
As usual in machine learning, ∼ 60% of the images was used for training, ∼ 20% for validation, and ∼ 20% for testing purposes. Moreover, for IQA databases containing artificial distortions the splitting of the database is carried out with respect to the reference images, so no semantic overlapping was between the training, validation, and test sets. Further, the models were trained and tested on a personal computer with 8-core i7-7700K CPU two NVidia Geforce GTX 1080 GPUs.

Parameter study
First, we conducted experiments to determine which Inception module in GoogLeNet [31] or in Inception-V3 [32] is the most appropriate for visual feature extraction to predict perceptual image quality. Second, we answer the question whether the concatenation of different Inception modules' feature vectors improves the prediction's performance or not. Third, we demonstrate that fine-tuning of the base CNN architecture results in significant performance increase. In this parameter study, we used KonIQ-10k database to answer the above mentioned questions and to find the most effective design choices. In the next subsection, our best proposal is used to carry out a comparison to the state-of-the-art using other databases as well.
The results of the parameter study are summarized in Tables 1, 2, 3, and 4. Specifically, Table 1 and 3 contains the results with GoogLeNet [31] and Inception-V3 [32] base architectures without fine-tuning, respectively. On the other hand, Table 2 and 4 summarizes the results when fine-tuning is applied. In these tables, we reported on the average, the median, and the standard deviation of the PLCC and SROCC values obtained after 100 random train-validation-test splits using KonIQ-10k database. Furthermore, we report on the effectiveness of deep features extracted from different Inception modules. Moreover, the tables also contain the prediction performance of the concatenated deep feature vector. From these results, it can be concluded that the deep features extracted from the early Inception modules perform slightly poorer than those of intermediate and last Inception modules. Although most state-of-the-art methods [22], [15], [18] utilize the features of the last CNN layers, it is worth to examine earlier layers as well, because the tables' data indicate that the middle layers encode those information which are the most powerful for perceptual quality prediction. We can also assert that feature vectors containing both mid-level and high-level deep representations are significantly more efficient than those of containing only one level's feature representation. Finally, it can be clearly seen that fine-tuning the base CNN architectures also improves the effectiveness of the extracted deep features. On the whole, the deeper Inception-V3 [32] provides more effective features than GoogLeNet [31]. Our best proposal relies on Inception-V3 and concatenates the features of all Inception modules. In the followings, we call this architecture MultiGAP-NRIQA and compare it to other state-of-the-art in the next subsection.
Another contribution of this parameter study may be the followings. It is worth to study the features of different layers separately because the features of intermediate layers may provide a better representation of the given task than high-level features. Furthermore, the proposed feature extraction method may be also superior in other problems where the task is to predict one value only from the image data itself relying on a large enough database.
In our environment (MATLAB R2019a, PC with 8-core i7700K CPU and two NVidia Geforce GTX 1080), the computational times of the proposed MultiGAP-NRIQA method are the followings. The loading of the base CNN and the 1024 × 768-sized or the 512 × 384 input image takes about 1.8s. Furthermore, the feature extraction from multiple Inception modules of Inception-V3 [32] and concatenation takes on average 1.355s or 0.976s on the GPU, respectively. Furthermore, the SVR regression takes 2.976s on average computing on the CPU.
To ensure a fair comparison, these traditional and deep methods were trained, tested, and evaluated exactly the same as our proposed method. Specifically, ∼ 60% of the images was used for training, ∼ 20% for validation, and ∼ 20% for testing purposes. If a validation set is not required, the training set contains ∼ 80% of the images. Moreover, for IQA databases containing artificial distortions the splitting of the database is carried out with respect to the reference images, so no semantic overlapping was between the training, validation, and test sets. To compare our method to the state-of-the-art, we report on the average PLCC and SROCC values of 100 random train-validation-test splits of our method and those of other algorithms. As already mentioned, the results are summarized in Table 6. More specifically, this table illustrates the measured average PLCC and SROCC on three large publicly available IQA databases (Table 5 summarizes the major parameters of the IQA databases used in this paper).
From the results, it can be seen that the proposed significantly outperforms the state-of-the-art on KonIQ-10k database. Moreover, only the MultiGAP-NRIQA method is able perform over 0.9 PLCC and SROCC. It can be observed that GPR with rational quadratic kernel function performs better than SVR with Gaussian kernel function. Similarly, the proposed method outperforms the state-of-the-art on LIVE In the Wild IQA database [35] by a large margin. On KADID-10k, DeepFL-IQA [54] provides the best results by a large margin. The proposed MultiGAP-GPR gives the third best results.

Cross database test
To prove the generalization capability of our proposed MultiGAP-NRIQA method, we carry out a so-called cross database test in this subsection. This means that our model was trained on the whole KonIQ-10k [33] database and tested on LIVE In the Wild Image Quality Challenge Database [35]. Moreover, the other learning-based NR-IQA methods were also tested this way. The results are summarized in Table 7. From the results, it can be clearly seen that all learning-based methods performed significantly poorer in the cross database test than in the previous tests. It should be emphasized that our MultiGAP-NRIQA method generalized better than the state-of-the-art traditional or deep learning based algorithms even without fine-tuning. The performance drop occurs owing to the fact that images are treated slightly differently in each publicly available IQA database. For example, in LIVE In The Wild [35] database the images were rescaled. In contrast, the images of KonIQ-10k [33] were cropped from their original counterparts.

Conclusion
In this paper, we introduced a deep framework for NR-IQA which constructs a feature space relying on multi-level Inception features extracted from pretrained CNNs via GAP layers. Unlike previous deep methods, the proposed approach do not take patches from the input image, but instead treat the image as a whole and extract image resolution independent features. As a result, the proposed approach can be easily generalized to any input image size and CNN base architecture. Unlike previous deep methods, we extract multi-level features from the CNN to incorporate both mid-level and high-level deep representations into the feature vector. Furthermore, we pointed out in a detailed parameter study that mid-level features provide significantly more effective descriptors for NR-IQA. Another important observation was that the feature vector containing both mid-level and high-level representations outperforms all feature vectors containing the representation of one level. We also carried out a comparison to other state-of-the-art methods and our approach outperformed the state-of-the-art on the largest available benchmark IQA databases. Moreover, the results were also confirmed in a cross database test. There are many directions for future research. Specifically, we would like to improve the fine-tuning process in order to transfer quality-aware features more effectively into the base CNN. Another direction of future research could be the generalization of the applied feature extraction method to other CNN architectures, such as residual networks.  Table 7: Cross database test. The learning-based NR-IQA methods were trained on the whole KonIQ-10k database and tested on LIVE In the Wild database. The measured PLCC and SROCC values are reported. The best results are shown in bold and the second best results are typed in italic. The results of DeepFL-IQA [54] was measured by the authors of [54].