A combined full-reference image quality assessment method based on convolutional activation maps

The goal of full-reference image quality assessment (FR-IQA) is to predict the quality of an image as perceived by human observers with using its pristine, reference counterpart. In this study, we explore a novel, combined approach which predicts the perceptual quality of a distorted image by compiling a feature vector from convolutional activation maps. More specifically, a reference-distorted image pair is run through a pretrained convolutional neural network and the activation maps are compared with a traditional image similarity metric. Subsequently, the resulted feature vector is mapped onto perceptual quality scores with the help of a trained support vector regressor. A detailed parameter study is also presented in which the design choices of the proposed method is reasoned. Furthermore, we study the relationship between the amount of training images and the prediction performance. Specifically, it is demonstrated that the proposed method can be trained with few amount of data to reach high prediction performance. Our best proposal - ActMapFeat - is compared to the state-of-the-art on six publicly available benchmark IQA databases, such as KADID-10k, TID2013, TID2008, MDID, CSIQ, and VCL-FER. Specifically, our method is able to significantly outperform the state-of-the-art on these benchmark databases. This paper is accompanied by the source code of the proposed method: https://github.com/Skythianos/FRIQA-ActMapFeat.


Introduction
In the last decades, a continuous growth of the number of digital images have been witnessed due to the spread of smart phones and various social media. As a result of the huge number of imaging sensors, there is a massive amount of visual data being produced each day. However, digital images may suffer different distortions during the procedure of acquisition, transmission, or compression. As a result, unsatisfactory perceived visual quality or a certain level of annoyance may occur. Consequently, it is essential to predict the perceptual quality of images in many applications, such as display technology, communication, image compression, image restoration, image retrieval, object detection, or image registration. Broadly speaking, image quality assessment (IQA) algorithms can be classified into three different classes based on the availability of the reference, undistorted image. Full-reference (FR) and reduced-reference (RR) IQA algorithms have full and partial information about the reference image, respectively. In contrast, no-reference (NR) IQA methods do not posses any information about the reference image.
Convolutional neural networks (CNN), introduced by LeCun et al. [1] in 1988, are used in many applications from image classification [2] to audio synthesis [3]. In 2012, Krizhevsky et al. [4] won the ImageNet [5] challenge by training a deep CNN with the help of GPUs. Due to the huge number of parameters in a CNN, the training set has to contain sufficient data to avoid over-fitting. However, the number of human annotated images in many databases is rather limited to train a CNN from scratch. On the other hand, a CNN trained on ImageNet database [5] is able to provide powerful features for a wide range of image processing tasks [6], [7], [8] due to the learned comprehensive set of features. In this paper, we propose a combined FR-IQA metric based on the comparison of feature maps extracted from pretrained CNNs. The rest of this section is organized as follows. In Subsection 1.1, previous and related work are summarized and reviewed. Next, Subsection 1.2 outlines the main contributions of this study. arXiv:2010.09361v1 [cs.CV] 19 Oct 2020

Related work
Over the past few decades many FR-IQA algorithms have been proposed in the literature. The earliest algorithms, such as mean squared error (MSE) and peak signal-to-noise ratio (PSNR), are based on the energy of image distortions to measure perceptual image quality. Later, methods have appeared that utilized the features of the human visual system (HVS). This kind of FR-IQA algorithms can be classified into two groups: bottom-up and top-down ones. Bottom-up approaches directly build on the properties of HVS, such as luminance adaptation [9], contrast sensitivity [10], or contrast masking [11], to create a model that enables the prediction of perceptual quality. In contrast, top-down methods try to incorporate the general characteristics of HVS into a metric to devise effective algorithms. Probably, the most famous top-down approach is the structural similarity index (SSIM) proposed by Wang et al. [12] The main idea behind SSIM [12] is to make a distinction between structural and non-structural image distortions, because the HVS is mainly sensitive to the latter ones. Specifically, SSIM is determined at each coordinate within local windows of the distorted and the reference images. The distorted image's overall quality is the arithmetic mean of the local windows' values. Later, advanced forms of SSIM have been proposed. For example, edge-based structural similarity [13] (ESSIM) compares the edge information between the reference image block and the distorted one claiming that edge information is the most important image structure information for the HVS. MS-SSIM [14] built multi-scale information to SSIM, while 3-SSIM [15] is a weighted average of different SSIMs for edges, textures, and smooth regions. Furthermore, saliency weighted [16] and information content weighted [17] SSIMs were also introduced in the literature. Feature similarity index (FSIM) [18] relies on the fact that the HVS utilizes low-level features, such as edges and zero crossings, in early stage of visual information processing to interpret images. That is why, FSIM utilizes two features: (1) phase congruency which is a contrast-invariant dimensionless measure of the local structure and (2) an image gradient magnitude feature. Gradient magnitude similarity deviation (GMSD) [19] method utilizes the sensitivity of image gradients to image distortions and pixel-wise gradient similarity combined with a pooling strategy is applied for the prediction of the perceptual image quality. In contrast, Haar wavelet-based perceptual similarity index (HaarPSI) [20] applies coefficients obtained from Haar wavelet decomposition to compile an IQA metric. Specifically, the magnitudes of high-frequency coefficients were used to define local similarities, while the low-frequency ones were applied to weight the importance of image regions. Quaternion image processing provides a true vectorial approach to image quality assessment. Wang et al. [21] gave a quaternion description for the structural information of color images. Namely, the local variance of the luminance was taken as the real part of a quaternion, while the three RGB channels were taken as the imaginary parts of a quaternion. Moreover, the perceptual quality was characterized by the angle computed between the singular value feature vectors of the quaternion matrices derived from the distorted and the reference image. In contrast, Kolaman and Pecht [22] created a quaternion-based structural similarity index (QSSIM) to assess the quality of RGB images.
Following the success in image classification [4], deep learning has become extremely popular in the field of image processing. Liang et al. [23] first introduced a dual-path convolutional neural network (CNN) containing two channels of inputs. Specifically, one input channel was dedicated to the reference image and another for the distorted image. Moreover, the presented network had one output that predicted the image quality score. First, the input distorted and reference images were decomposed into 224 × 224 sized image patches and the quality of each image pair was predicted independently of each other. Finally, the overall image quality was determined by averaging the scores of the image pairs. Kim and Lee [24] introduced a similar dual-path CNN but their model accepts a distorted image and an error map calculated from the reference and the distorted image as inputs. Furthermore, it generates a visual sensitivity map which is multiplied by an error map to predict perceptual image quality. Similarly to the previous algorithm, the inputs are also decomposed into smaller image patches and the overall image quality is determined by averaging of the scores of distorted patch-error map pairs.
Recently, generic features extracted from different pretrained CNNs, such as AlexNet [4] or GoogLeNet [2], have been proved very powerful for a wide range of image processing tasks. Razavian et al. [6] applied feature vectors extracted from the OverFeat [25] network, which was trained for object classification on ImageNet ILSVRC 2013 [5], to carry out image classification, scene recognition, fine-grained recognition, attribute detection, and content-based image retrieval. The authors reported on superior results compared to those of traditional algorithms. Later, Zhang et al. [26] pointed out that feature vectors extracted from pretrained CNNs outperforms traditional image quality metrics. Motivated by the above-mentioned results, a number of FR-IQA algorithms have been proposed relying on different deep features and pretrained CNNs. Amirshahi et al. [27] compared different activation maps of the reference and the distorted image extracted from an AlexNet [4] pretrained CNN. Specifically, the similarity of the activation maps was measured to produce quality sub-scores. Finally, these sub-scores were aggregated to produce an overall quality value of the distorted image. In contrast, Bosse et al. [28] extracted deep features with the help of a VGG16 [29] network from reference and distorted image patches. Subsequently, the distorted and the reference deep feature vectors were fused together and mapped onto patch-wise quality scores. Finally, the patch-wise scores were pooled supplementing with a patch weight estimation procedure to obtain the overall perceptual quality. In our previous work [30], we introduced a composition preserving deep architecture for FR-IQA relying on a Siamese layout of pretrained CNNs, feature pooling, and a feedforward neural network.
Another line of works focuses on creating combined metrics where existing FR-IQA algorithms are combined to achieve strong correlation with the subjective ground-truth scores. In [31], Okarma examined the properties of three FR-IQA metrics (MS-SSIM [14], VIF [32], and R-SVD [33]) and proposed a combined quality metric based on the arithmetical and power of these metrics. Later, this approach was further developed using optimization techniques [34], [35]. Similarly, Oszust [36] selected 16 FR-IQA metrics and applied their scores as predictor variables in a lasso regression model to obtain a combined metric. Yuan et al. [37] took a similar approach but kernel ridge regression was utilized to fuse the scores of the IQA metrics. In contrast, Lukin et al. [38] fused the results of six metrics with the help of a neural network. Oszust [39] carried out a decision fusion based on 16 FR-IQA measures by minimizing the root mean square error of prediction performance with a genetic algorithm.

Contributions
Motivated by recent convolutional activation map based metrics [27], [40], we make the following contributions in our study. Previous activation map based approaches compared directly the similarity between reference and distorted activation maps by histogram-based similarity metrics. Subsequently, the resulted sub-scores were pooled together using different ad-hoc solutions, such as geometric mean. In contrast, we take a machine learning approach. Specifically, we compile a feature vector for each distorted-reference image pair by comparing distorted and reference activation maps with the help of a traditional image similarity metrics. Subsequently, these feature vectors are mapped to perceptual quality scores using machine learning techniques. Unlike previous combined methods [34], [35], [38], [39], we do not apply directly different optimization or machine learning techniques using the results of traditional metrics, instead traditional metrics are used to compare convolutional activation map and to compile a feature vector.
We demonstrate that our approach has several advantages. First, it can be easily generalized to any input image resolution or base CNN architecture, since image patches are not required to crop from the input images like several previous CNN based approaches [28], [23], [24]. Second, the proposed feature extraction method is highly effective, since the proposed method is able to reach state-of-the-art results even if only 5% of the KADID-10k [41] database is used for training. In contrast, state-of-the-art deep learning based approaches' performance are strongly dependent on the training database size [42]. Another advantage of the proposed approach is that it is able to perform significantly better on the so-called cross-database tests. Our method is compared against the state-of-the-art on six publicly available IQA benchmark databases, such as KADID-10k [41], TID2013 [43], VCL-FER [44], MDID [45], CSIQ [46], and TID2008 [43]. Specifically, our method is able to significantly outperform the state-of-the-art on the benchmark databases.

Structure
The remainder of this paper is organized as follows. After this introduction, Section 2 presents our proposed approach. Section 3 shows experimental results and analysis with a parameter study, a comparison to other state-of-the-art methods, and a cross-database test. Finally, a conclusion is drawn in Section 4.

Proposed method
The proposed approach is based on constructing feature vectors from each convolutional layer of a pretrained CNN for a reference-distorted image pair. Subsequently, the convolutional layer-wise feature vectors are concatenated and mapped onto perceptual quality scores with the help of a regression algorithm. In our experiments, we used the AlexNet [4] pretrained CNN which won the 2012 ILSVRC by reducing the error rate from 26.2% to 15.2%. This was the first time that a CNN performed so well on ImageNet database. The techniques, which were introduced in this model, are widely used also today, such as data augmentation and drop-out. In total, it contains five convolutional and three fully connected layers. Furthermore, rectified linear unit (ReLU) was applied after each convolutional and fully connected layer as activation function.

Architecture
In this subsection, the proposed deep FR-IQA framework which aims to capture image features in variuous levels from a pretrained CNN is introduced in details. Existing works extract features of one or two layers from a pretrained CNN in FR-IQA [28], [30]. However, many papers pointed out the advantages of considering the features of multiple layers in NR-IQA [47] and aesthetics quality assessment [48].   We put the applied base CNN architectures into a unified framework by slicing a CNN into L parts by the convolutional layers, independent from the network architecture, e.g., AlexNet or VGG16. Without the loss of generality, the slicing of AlexNet [4] is shown in Figure 3. As one can see from Figure 3, at this point the features are in the form of W × H × D tensors where the depth (D) is dependent on the applied base CNN architecture and the tensors's width (W ) and height (H) depend on the input image size. In order to make the feature vectors' dimension independent from the input image pairs' resolution, we do the followings. First, convolutional feature tensors are extracted with the help of a pretrained CNN (Figure 3) from the reference image and from the corresponding distorted image. Second, reference and distorted activation maps at a given convolutional layer are compared using traditional image similarity metrics. More specifically, the ith element of a layer-wise feature vector corresponds to the similarity between the ith activation map of the reference feature tensor and ith activation map of the distorted feature tensor. Formally, we can write where ISM (·) denotes a traditional image similarity metric (PSNR, SSIM [12], and HaarPSI [20] are considered in this study), F ref,l i and F dist,l i are the ith activation map from the lth reference and distorted feature tensors, f l is the feature vector extracted from the lth convolutional layer, and f l i stands for its ith element. Figure 4 illustrates the compilation of layer-wise feature vectors.
In contrast to other machine learning techniques, CNNs are often called black-box techniques due to millions of parameters and highly nonlinear internal representations of the data. Activation maps of an input image help us to understand which features a CNN have learned. If we feed AlexNet reference and distorted image pairs and we visualize the activations of the conv1 layer, it can be seen that activations of the reference image and those of the distorted image differs significantly from each other mainly in quality aware details, such as edges and textures (see Figures 1 and 2 for illustration). This observation motivated us that an effective feature vector can be compiled through by comparing the activation maps.
The whole feature vector, that characterizes a reference-distorted image pair, can be obtained by concatenating the layer-wise feature vectors. Formally, we can write where F stands for the whole feature vector, f j (j = 1, 2, ..., L) is the jth layer-wise feature vector, and L denotes the number of convolutional layers in the applied base CNN.
Finally, a regression algorithm is applied to map the feature vectors onto perceptual quality scores. In this study, we made experiments with two different regression techniques, such as support vector regressor (SVR) [49] and Gaussian process regressor (GPR) [50]. Specifically, we applied SVR with linear kernel and radial basis function (RBF) kernel. GPR was applied with rational quadratic kernel.

Experimental results
In this section, we present our experimental results and analysis. First, we introduce the applied evaluation metrics in Subsection 3.1. Second, the implementation details and the experimental setup are given in Subsection 3.2. Subsequently, a detailed parameter study is presented in Subsection 3.3 in which we extensively reason the design choices of our proposed method. In Subsection 3.4, we explore the performance of our proposed method on different distortion types and distortion intensity levels. Subsequently, we examine the relationship between the performance and the amount of training data in Subsection 3.5. In Subsection 3.6 a comparison to other state-of-the-art method is carried out using six benchmark IQA databases, such as KADID-10k [41], TID2013 [51], TID2008 [43], VCL-FER [44], CSIQ [46], and MDID [45]. The results of the cross database are presented in Subsection 3.7. Table 1 illustrates some facts about the publicly available IQA databases used in this paper. It allows comparisons between the number of reference and test images, image resolutions, the number of distortion levels, and the number of distortion types.

Evaluation metrics
A reliable way to evaluate objective FR-IQA methods is based on measuring the correlation strength between the ground-truth scores of a publicly available IQA database and the predicted scores. In the literature, Pearson's linear correlation coefficient (PLCC), Spearman's rank-order correlation coefficient (SROCC), and Kendall's rank-order correlation coefficient (KROCC) are widely applied to characterize the degree of correlation. PLCC between vectors x and y can be expressed as Furthermore, x stands for the vector containing the ground-truth scores, while y vector consists of the predicted scores. SROCC between vectors x and y can be defined as SROCC(x, y) = P LCC(rank(x), rank(y)) (4) where the rank(·) function gives back a vector whose ith element is the rank of the ith element in the input vector. As a consequence, SROCC between vectors x and y can also be expressed as wherex andŷ stand for the middle ranks of x and y, respectively. KROCC between vectors x and y can be determined as KROCC(x, y) = n c − n d 1 2 n(n − 1) where n is the length of the input vectors, n c stands for the number of concordant pairs between x and y, and n d denotes the number of discordant pairs.

Experimental setup
KADID-10k [41] was used to carry out a detailed parameter study to determine the best design choices of the proposed method. Subsequently, other publicly available databases, such as TID2013 [51], TID2008 [43], VCL-FER [44], CSIQ [46], and MDID [45], were also applied to carry out a comparison to other state-of-the-art FR-IQA algorithms. Furthermore, our algorithms and other learning-based state-of-the-art methods were evaluated by 5-fold cross-validation with 20 repetitions. Specifically, an IQA database was divided randomly into a training set (appx. 80%) and a test set (appx. 20%) with respect to the reference, pristine images. As a consequence, there was no semantic content overlapping between these sets. Moreover, we report on the average PLCC, SROCC, and KROCC values. The perceptual quality score distributions of individual IQA databases are depicted in Figure 5.
All models were implemented and tested in MATLAB R2019a relying mainly on the functions of the Deep Learning Toolbox (formerly Neural Network Toolbox), Statistics and Machine Learning Toolbox, and the Image Processing Toolbox.

Parameter study
In this subsection, we present a detailed parameter study using the publicly available KADID-10k [41] database to find the optimal design choices of our proposed method. Specifically, we compared the performance of three traditional metrics (PSNR, SSIM [12], and HaarPSI [20]). Furthermore, we compared the performance of three different regression algorithms, such as linear SVR, Gaussian SVR, and GPR with rational quadratic kernel function. As already mentioned, the evaluation is based on 20 random train-test splits. Moreover, mean PLCC, SROCC, and KROCC values are reported.
The results of the parameter study are summarized in Figure 6. From these results, it can be seen that HaarPSI metric with Gaussian SVR provides the best results. This architecture is called ActMapFeat in the further sections. (c) GPR with rational quadratic kernel function. Figure 6: Parameter study.

Performance over different distortion types and levels
In this subsection, we examine the performance of the proposed ActMapFeat over different image distortion types and levels of KADID-10k [41]. Namely, KADID-10k consists of images with 25 distortion types in 5 levels. Furthermore, the distortion types can be classified into five groups: blurs, color distortions, compression, noise, brightness change, spatial distortions, and sharpness & contrast.
The mean PLCC, SROCC, and KROCC values are reported measured over 20 random train-test splits in Table 2. From these results, it can be observed that ActMapFeat is able to perform relatively uniformly over different image distortion types with the exception of some color-(color shift, color saturation 1.), brightness-(mean shift), and patch-related (non-eccentricity patch. color block) noise types. Furthermore, it performs very well on different blur (Gaussian blur, lens blur, motion blur) and compression types (JPEG, JPEG2000).
The performance results of ActMapFeat over different distortion levels of KADID-10k [41] are illustrated in Table  3. From these results, it can be observed that the proposed method performs relatively uniformly over the different distortion levels. Specifically, it achieves better results on higher distortion levels than on lower ones. Moreover, the best results can be experienced at moderate distortion levels.

Effect of the training set size
In general, the amount of the training images has a strong impact on the performance of machine/deep learning systems [42], [41], [52]. In this subsection, we study the relationship between the number of training images and the performance using the KADID-10k [41] database. In our experiments, the ratio of the training images in the database was varied from 5% to 80%, while at the same time those of the test images was varied from 95% to 20%. The results are illustrated in Figure 7. It can be observed that the proposed system is rather robust to the size of the training set. Specifically, the mean PLCC, SROCC, and KROCC are 0.923, 0.929, and 0.765, if the ratio of the training images is 5%. These performance metrics increase to 0.959, 0.957, and 0.821, respectively, when the ratio of the training set reaches 80% which is a common choice in machine learning. On the whole, our system can be trained with few amount of data to reach relatively high PLCC, SROCC, and KROCC values. This proves the effectiveness of the proposed feature extraction method from distorted-reference image pairs.
(f) CSIQ [46]. and TID2008 [43] databases. Furthermore, mean PLCC, SROCC, and KROCC values are reported measured over 20 random train-test splits for machine learning-based algorithms. In contrast, traditional FR-IQA metrics are tested on the whole database and we report on the PLCC, SROCC, and KROCC values.
The results of the performance comparison to the state-of-the-art on KADID-10k [41], TID2013 [51], VCL-FER [44], TID2008 [43], MDID [45], and CSIQ [46] are summarized in Tables 4, 5, 6, respectively. It can be observed that the performance of the examined state-of-the-art FR-IQA algorithms are far from perfect on KADID-10k [41]. In contrast, our method was able to produce PLCC and SROCC values over 0.95. Furthermore, our KROCC value is about 0.09 higher than the second the best one. On the smaller TID2013 [51], TID2008 [43], MDID [45], and VCL-FER [44] IQA databases, the performance of the examined state-of-the-art approaches significantly improve. In spite of this, our method also gives the best results on these databases in terms of PLCC, SROCC, and KROCC, as well. On CSIQ [46], HaarPSI [20] and SSIM CNN [40] provide the best results. However, the difference between the proposed method and the above mentioned metrics is rather minor and not significant. Namely, significance tests were also carried out. More precisely, the ITU (International Telecommunication Union) guidelines [68] for evaluating quality models were followed. The H 0 hypothesis for a given correlation coefficient (PLCC, SROCC, or KROCC) was that a rival state-of-the-art method produces not significantly different values with p < 0.05. Moreover, the variances of the z-transforms were determined as 1.06/ (N − 3), where N stands for the number of images in a given IQA database. In Tables 4 -6, the green background color stands for that the correlation is lower than those of the proposed method and the difference is statistically, while the red background color means the correlation is higher and the difference is statistically significant.  [45], TID2008 [43], TID2013 [51], VCL-FER [44], KADID-10k [41], and CSIQ [46] test sets. Figure 9 depicts the box plots of the measured PLCC, SROCC, and KROCC values over 20 random train-test splits. On each box, the central mark denotes the median, and the bottom and top edges of the box represent the 25th and 75th percentiles, respectively. Moreover, the whiskers extend to the most extreme values which are not considered outliers.

Cross database test
Cross database test refers to the procedure of training on one given IQA benchmark database and testing on another to show the generalization potential of a machine learning based method. The results of the cross database test using KADID-10k [41], TID2013 [51], TID2008 [43], VCL-FER [44], MDID [45], and CSIQ [46] are depicted in Figure  10. From these results, it can be concluded that the proposed method loses from its performance significantly in most cases, but is able to achieve the state-of-the-art. Moreover, there are some pairings, such as trained on KADID-10k [41] and tested on CSIQ [46], trained on TID2013 [51] and tested on TID2008 [43], trained on TID2008 [43] and tested on TID2013 [51], where the performance loss is rather minor.

Conclusions
In this paper, we introduced a framework for FR-IQA relying on feature vectors which were obtained by comparing reference and distorted activation maps by traditional image similarity metrics. Unlike previous CNN-based approaches, our method does not take patches from the distorted-reference image pairs, but instead obtains convolutional activation maps and creates feature vectors from these maps. This way, our method can be easily generalized to any input image resolutions or base CNN architecture. Furthermore, we carried out a detailed parameter study with respect to the applied base CNN architecture and image similarity metric. Moreover, we pointed out the proposed feature extraction method is effective, since our method is able to reach high PLCC, SROCC, and KROCC values trained only on 5% of KADID-10k images. Our algorithm was compared to 15 other state-of-the-art FR-IQA methods on six benchmark IQA databases, such as KADID-10k, TID2013, VCL-FER, MDID, TID2008, and CSIQ. Our method was able to outperform the state-of-the-art in terms of PLCC, SROCC, and KROCC, as well. The generalization ability of the proposed method was confirmed in cross database tests.    [45] and CSIQ [46] databases. Mean PLCC, SROCC, and KROCC values are reported for the learning-based approaches measured over 20 random train-test splits. The best results are typed in bold. The green background color stands for that the correlation is lower than those of the proposed method and the difference is statistically significant with p < 0.05, while the red background color means the correlation is higher and the difference is statistically significant with p < 0.05.