Blind Image Quality Assessment Based on Classiﬁcation Guidance and Feature Aggregation

In this work, we present a convolutional neural network (CNN) named CGFA-CNN for blind image quality assessment (BIQA). A unique 2-stage strategy is utilized which ﬁrstly identiﬁes the distortion type in an image using Sub-network I and then quantiﬁes this distortion using Sub-network II. And diﬀerent from most deep neural networks, we extract hierarchical features as descriptors to enhance the image representation and design a feature aggregation layer in an end-to-end training manner applying Fisher encoding to visual vocabularies modeled by Gaussian mixture models (GMMs). Considering the authentic distortions and synthetic distortions, the hierarchical feature contains the characteristics of a CNN trained on the self-built dataset and a CNN trained on ImageNet. We evaluate our algorithm on the four publicly available databases, and results demonstrate that our CGFA-CNN has superior performance over other methods both on synthetic and authentic databases.


Introduction
Digital pictures may occur different distortions in the procedure of acquisition, transmission, and compression, leading to an unsatisfactory perceived visual quality or a certain level of annoyance. Thus, it is crucial to predict the quality of digital pictures in many applications, such as compression, communication, printing, display, analysis, registration, restoration, and enhancement [1][2][3]. Generally, image quality assessment approaches can be classified into three kinds according to the additional information needed. Specifically, full-reference image quality assessment (FR-IQA) [4][5][6][7] and reduced-reference image quality assessment (RR-IQA) [8][9][10] need full and partial information of reference images respectively, while blind image quality assessment (BIQA) [11][12][13][14] performs quality measure without any information from the reference image. Thus BIQA methods are more attractive in many practical applications because the reference image usually is not available or hard to drive.
Early researches mainly focus on one or more specific distortion types, such as Gaussian blur [15], blockiness from JPEG compression [16], or ringing arising from JPEG2000 compression [17]. However, images may be affected by unknown distortion in many practical scenarios. In contrast, general BIQA methods aim to work well for arbitrary distortion, which can be classified into two categories according to the features extracted, i.e., Natural Scene Statistics (NSS) based methods and training based methods. NSS based methods [18] assume that the natural image with non-distorted obeys certain perceptually relevant statistical laws that are violated by the presence of common image distortions, and they attempt to describe an image utilizing its scene statistics from different domains. For example, BIRSQUE [19] derives features from the locally normalized luminance coefficients in the spatial domain. M3 [20] utilizes the joint local contrast features from the gradient magnitude (GM) map and the Laplacian of Gaussian (LOG) response. And later a perceptually motivated and feature-driven model is deployed in FRIQUEE [21], in which a large collection of features defined in various complementary, perceptually relevant color and transform-domain spaces are drawn from among the most successful BIQA models produced to date.
However, the knowledge-driven feature extraction and data-driven quality prediction are separated in the above methods. And it has been demonstrated that training based methods outperform the NSS based methods by a large margin because a fully data-driven BIQA solution becomes possible. For example, CORNIA [22] constructs a codebook in an unsupervised manner, using raw image patches as local descriptors and using soft-assignment for encoding. Considering that the feature set generally adopted in previous methods are from zero-order statistics and insufficient for BIQA, HOSA [23] constructs a much smaller codebook using K-means clustering [24] and introducing higher-order statistics. In contrast, the above methods capture spatially normalized coefficients and codebook-based features which are learned automatically from beginning to end by using CNNs. For example, TCSN [25] aims to learn the complicated relationship between visual appearance and perceived quality via a two-stream convolutional neural network. DIQA [26] defines two separated CNN branches to learn objective distortion and human visual sensitivity, respectively.
In this work, we propose an end-to-end BIQA based on classification guidance and feature aggregation, which is accomplished by two sub-networks with shared features in the early layers. Due to the lack of training data, we construct a largescale dataset by means of synthesizing distortions and pre-train Sub-network I to identify an image into a specific distortion type from a set of pre-defined categories. We find the proposed method will be much harder to achieve high accuracies on authentic images if only it is exposed to synthetic distortions during training. Then we extract hierarchical features from the shared layers of two-subnetworks and another CNN (VGG-16 [27]) pre-trained on ImageNet [28] in which pictures occur as a natural consequence of photography and a unified feature group is formed.
Sub-network II takes the hierarchical features and the classification information as inputs to predict the perceptual quality. The combination of two sub-networks enables the learning framework to have the probability of favorable quality perception and proper parameter initialization in an end-to-end training manner. We design a feature aggregation layer that could convert arbitrary input seizes to a fixed-length representation. Then fully connected layer is exploited as a linear regression model to map the high-dimensional features into the quality scores. This allows the proposed CGFA-CNN to accept an image of any size as the input, thus there is no need to perform any transformation of images (including cropping, scaling, etc.), which would affect perceptual quality scores.
The paper is structured as follows. In Sec. 2, previous work on CNN-based BIQA related to our work is briefly reviewed. In Sec. 3, details of the proposed method are described. In Sec. 4, experimental results on the public IQA databases and the corresponding analysis are presented. In Sec. 5, the work of this paper is concluded.

Related Work
In this section, we provide a brief survey about the major solutions to the lack of training data in BIQA, and a review of recent studies related to our work.
Due to the number of parameters to be trained on CNN is usually very large, the training set needs containing sufficient data to avoid over-fitting. However, the number of samples and image contents in the public quality-annotated image databases are rather limited, which cannot meet the need for end-to-end training of a deep network. Currently, there are two main methods to tackle this challenge.
Methods of the first are to train the model based on the image patches. For example, deepIQA [29] randomly samples image patches from the entire image as inputs and predict the quality score on local regions by assigning the subjective mean score (MOS) of the pristine image to all patches within it. Although taking small patches as inputs for data augmentation is superior to use the whole image on a given dataset, this method still suffers from limitations because local image quality with context varies across spatial locations even the distortion is homogeneous. To resolve this problem, BIECON [30] makes use of the existing FR-IQA algorithms to assign quality labels for sampled image patches, but the performance of such a network depends highly on that of FR-IQA models. Other methods such as dipIQ [31] attempting to generate discriminable image pairs by involving FR-IQA models may suffer from similar problems.
The second method is to pre-train a network with large-scale datasets in other fields. And for each pre-trained architectures, two types of back-end training strategies are available: replacing the last layer of the pre-trained CNN model with the regression layer and fine-tuning it with the IQA database to conduct image quality prediction or using SVR to regress the extracted features through the pre-trained networks onto subjective scores. For instance, DeepBIQ [32] reports on the use of different features extracted from pre-trained CNNs for different image classification tasks via ImageNet [28] and Places365 [33] as a generic image description. Kim et al. [34] select the well-known deep CNN models AlexNet [35] and ResNet50 [36] as the architectures of the baseline models, which have been pre-trained for the task of image classification on ImageNet [28]. These methods directly inheriting the weights from the pre-trained models for general image classification tasks have a defect of low relevance to BIQA but unnecessary complexity.
To better address the training data shortage problem, MEON [37] proposes a cascaded multi-task framework, which firstly trains a distortion type identification network by large-scale pre-defined samples. Then a quality prediction network is trained subsequently, taking advantage of distortion information obtained from the first stage. Furthermore, DB-CNN [38] not only constructs a pre-training set based on Waterloo Exploration Database [39] and PASCAL VOC [40] for synthetic distortions, but also uses ImageNet [28] to pre-trained another CNN for authentic distortions. Motivated by the previous studies MEON [37] and DB-CNN [38], we construct a pre-training set based on Waterloo Exploration Database [39] and PASCAL VOC [40] for synthetic distortions. Besides, both distortion type and distortion level are considered at the same time, which results in better quality-aware initializations and distortion information.
Although previous DNN-based BIQA methods have achieved significant performance, all of these methods usually comprise convolutional layers and pooling layers for feature extraction and employ fully connected layers for regression, which would suffer three limitations. First, such techniques such as averaging or maximum pooling are too simple to be accurate for long sequences. Second, a fully connected layer is destructive to the high-dimensional disorder and spatial invariance of the local feature. Third, such CNNs typically require a fixed image size. To feed into the network, images have to be resized or cropped to a fixed size, and either scaling or cropping would cause the perceptual difference with the assigned quality labels. To tackle these challenges, we explore more sophisticated pooling techniques based on clustering approaches such as Bag-of-visual-words (BOW) [41], Vector of Locally aggregated Descriptors (VLAD) [42] and Fisher Vectors [43]. Studies have shown that integrating VLAD as a differentiable module in a neural network can significantly improve the aggregated representation for the task of place recognition [44] and video classification [45]. Our proposed feature aggregation layer acts as a pooling layer on top of the convolutional layers, which converts arbitrary input seizes to a fixed-length representation. Afterward, using a fully connected layer for regression does not require any preprocessing of the input image.

The Proposed Method
The framework CGFA-CNN is illustrated in Fig. 1. Sub-network I aims to classify an image into a specific distortion type and initialize the shared layers for a further learning process, which is firstly pre-trained on a self-built dataset. Sub-network II predicts the perceptual quality of the same image, which is fine-tuned with the IQA databases and takes advantage of distortion information obtained from Subnetwork I. The feature aggregation layer (FV layer) and classification-guided gating unit (CGU) will be described in Sec. 3.3 and Sec. 3.4.

Distortion Type Identification
Construction of the Pre-training Dataset. Due to the deficiency of the available quality-annotated samples, we firstly construct a large-scale dataset based on Waterloo Database [39] and PASCAL VOC Database [40]. The former contains 4,744 images and can be loosely categorized into 7 classes. The latter contains 17,125 images covering 20 categories. In this paper, we merge the two databases and obtain 21,869 pristine images with various contents. Then 9 types of distortion are introduced-JPEG compression, JPEG2000 compression, Gaussian blur, white Gaussian noise, contrast stretching, pink noise, image quantization with color dithering, over-exposure, and under-exposure. We synthesize each image with 5 distortion levels following [39] except for over-exposure and under-exposure, where only three levels are generated according to [46]. The constructed dataset consists of 896,629 images, which are organized into 41 subcategories according to the distortion type and degradation level. We label these images by the subcategory they belong to.  Sub-network I Architecture. Inspired by the VGG-16 network architecture [27], we design a similar structure subject to some modifications identifying the distortion type of the input image. Details are given in Table 1. The tailored VGG-16 network comprises a stack of convolutions (Conv) for feature extraction, one maximum pooling (MaxPool) for feature fusion, three fully connected layers (FC) for feature regression. All hidden layers are equipped with the Rectified Linear Unit (ReLU) [35] and Batch Normalization (BN) [47]. We denote the input mini-batch training data by X (n) , p (n) N n=i , where X (n) is the n-th input image, p (n) is a multi-class indicator vector of the ground truth distortion type. We append the soft-max layer at the end and define the soft-max function aŝ T is a C-dimensional probability vector of the n-th input in a mini-batch, indicating the probability of each distortion type. Model parameters of Sub-network I are collectively denoted as W. A cross-entropy loss is used to train this sub-network Notably, in the fine-tuning phase, except for the shared layers, the rest of the Sub-network I only participates in the forward propagation and the parameters are fixed.

Feature Extraction and Fusion
From Fig. 2 we can see that the representation of different distortion types varies in each convolution. Therefore, only using features extracted from the last convolution is not enough to predict the quality of an image. And inspired by the idea of combining the complementary features and hierarchical feature extraction strategy in our previous work [48], we resort to extracting features from low-level, middle-level and high-level convolutional layers as descriptors by rescaling and concatenating them. We design Sub-network I to identify a given image's distortion type pre-trained on a synthesized dataset. We find this takes advantage of synthetic images but fails to handle those authentically distorted. More details can be found in Sec 4.5. Then we model synthetic and authentic distortions by two separated CNNs and fuse the two feature sets into a unified representation for final quality prediction. The tailored VGG-16 pre-trained on ImageNet that contains many realistic natural images of different perceptual quality is added to extract relevant features for authentic images. In the proposed CGFA-CNN index, we take a raw image of H ×W ×3 as input and predict its perceptual quality. Then the fused feature group acquired is with Here D is the channel of hierarchical features. Sub-network II takes the fused feature group and the estimated probability vectorp (n) as inputs.

Feature Aggregation Layer and Encoding
In this paper, we design a feature aggregation layer that employs the Fisher Vectors (FV) [43] to perform the feature aggregation and encoding procedures. Because GMM [49] and FV are non-differentiable and fail to achieve theoretically valid backpropagation, we define a FV layer to yield a quality-aware feature vector f . The implementation is shown in Fig. 3.
As illustrated in Fig. 1  locations. Then we utilize GMM to obtain the cluster centres C of K components and encoding vector f of the image descriptors-X. GMM clustering. A Gaussian mixture model p(x|θ) is a mixture of K multivariate Gaussian distributions [49], which can be formulated as where θ is the vector of parameters of the model. For each Gaussian component, π k is the prior probability value, µ k is the means, and K is the diagonal covariance matrices. The parameters are learnt from a training set of descriptors x 1 , . . . , x N . The GMM defines the assignments q ki (k = 1, · · · , K, i = 1, · · · , N ) of the N de-scriptors to the K Gaussian components Fisher encoding. Fisher encoding captures both the 1st order and 2nd order differences between the image descriptors and the centres of a GMM. The construction of the encoding begins by learning a GMM model θ. For each k = 1, · · · , K, define the vectors The Fisher encoding of the set of local descriptors is then given by the concatenation of µ k and v k for all K components, giving an encoding of size 2 In order to integrate Fisher vector as a differentiable module in a neural network, we write the descriptor x i hard assignment to the cluster k as a soft assignment Then we can write the FV representation as where F V 1 and F V 2 are capturing the 1st order and 2nd order statistics respectively. x i (j) is the j-th dimensions of the i-th descriptor and c k (j) is the k-th cluster centres. c k and σ k (k ∈ [1, K]) are the learnable clusters and the clusters' diagonal covariance. We define α as positive ranging between 0 and 1.
Let ω k = 2αc k , b k = −α c k 2 , Equ. (10) can then be written as where {ω k }, {b k }, {c k } are sets of trainable parameters for each cluster k, Beyond the FV aggregation. The source of discontinuities in the traditional Bagof-visual-words (BOW) [41] and Vector of Locally aggregated Descriptors (VLAD) [42] are the hard assignments q ki of descriptors x to cluster centres c k . To make this operation differentiable, we replace it with the descriptor x i hard assignment to the cluster as a soft assignment, and reuse the same soft assignment established in Equ. (12) to obtain differentiable representation. We denote them as VLAD layer and BOW layer, respectively. The differentiable BOW representation and VLAD representation can be written as where a k (x i ) denotes the membership of the descriptor x i to cluster k. BOW is the histogram of the number of image descriptors assigned to each visual word. Therefore, it produces a K-dimensional vector, while VLAD is a simplified nonprobabilistic version of the FV and produces a D × K-dimensional vector. The soft assignment a k (x i ) can be regarded as a two-step process: (i) performing a 1 × 1 convolution with a set of K filters ω k and bias b i . Then the output produced is ω T k x i + b k ; (ii) following a soft-max function to obtain soft assignment of the descriptor x i to the cluster k. Notably, for BOW encoding, there is no need to store the sum of residuals for each visual word, which is the difference vector between the descriptor and its corresponding cluster centre.
The advantage of the BOW aggregation is that it aggregates the descriptor into a more compact representation, and fewer parameters are trained in a discriminative manner only including {ω k } and {b k }. The drawback is that significantly more clusters are needed to obtain a rich representation. The VLAD computes the 1st order residuals between the descriptors and the cluster centres, making the richness of representation relatively sufficient, and parameters to be learned are moderate, including {ω k }, {b k } and {c k }. In contrast, the FV aggregation concatenates both the 1st order and 2nd order aggregated residuals, but too many parameters need to be learned, including {ω k }, {b k }, {c k } and {σ k }.
In Sec. 4.5, we also experiment with averaging and maximum pooling of the image descriptor-X. Results will show that FV proves itself to be superior to the reference BOW and VLAD approach. Additionally, simply using averaging or maximum pooling will result in poor performance.

Classification-guided Gating Unit and Quality Prediction
We have pre-trained Sub-network I to identify the distortion type of the input, and Sub-network II takes the estimated probability vectorp from Sub-network I as partial input. In order to introduce this prior information of the classification, a Classification-guided Gating Unit (CGU) is utilized to emphasize informative features and suppress less useful ones. The CGU combinesp and f to produce a score vectorf where σ is a linear mapping, (W, b) the learnable parameters. Then a linear mapping is followed to yield an overall quality score q. To increase nonlinearity, two fully connected layers are applied as the linear mapping. For Sub-network II, the L 1 function is used as the empirical loss where q i is the MOS of the i-th image in a mini-batch andq i is the predicted quality score by CGFA-CNN.

Database Description and Experimental
Protocol 1)IQA databases: These experiments are confirmed on three singly distorted synthetic IQA databases, i.e., LIVE [50], CSIQ [51], and TID2013 [52], and an authentic LIVE Challenge database [53]. TID2013 contains 24 sceptic distortion types-additive Gaussian noise, additive noise in color components, spatially correlated noise, masked noise, high-frequency noise, impulse noise, quantization noise, Gaussian blur, image denoising, JPEG compression, JPEG2000 compression, JPEG transmission errors, non-eccentricity pattern errors, local bock-wise distortions, mean shift, contrast change, change of color saturation, multiplicative Gaussian noise, comfort noise, lossy compression of noisy images, color quantization with dither, chromatic aberrations, sparse sampling and reconstruction, which denote as #01 to #24 respectively. 2)Evaluation Criteria: Two evaluation criteria are adopted as follows to benchmark BIQA models: • Spearman's rank-order correlation coefficient (SRCC) is a nonparametric measure where I is the test image number and d i is the rank difference between the MOS and the model prediction of the i-th image. • Pearson linear correlation coefficient (PLCC) is a nonparametric measure of the linear correlation where q i andq i stand for the MOS and the model prediction of the i-th image, respectively. For synthetic databases LIVE, CSIQ and TID2013, we divide the distorted images into two splits of non-overlapping content-80% of which are used as fine-tuning samples and the rest 20% are left as testing samples. For the LIVE Challenge database, the distorted images are divided into two groups-80% for training and 20% for testing. This random process repeats ten times, and the average SRCC and PLCC are reported as the final results. Besides, the three synthetic databases are selected for cross-database experiments, using one database as the training sets while the other as the testing.
We compare the proposed CGFA-CNN against several state-of-the-art BIQA methods, including three based on NSS (BRISQUE [19], M3 [20], FRIQUEE [21]), two based on manual feature learning (CORNIA [22], HOSA [23]) and eight based on CNN (BIECON [30], dipIQ [31], deepIQA [29], ResNet50+ft [34], MEON [48], DIQA [26], TSCN [25], and DB-CNN [38]). Due to the source codes of some methods are not available to the public, we only copy the metrics from the corresponding papers. Parameters in Sub-network I are initialized by He's method [54], and Adam is adopted as optimizer with the default parameters with a mini-batch of 64. The learning rate is initialized as a decay logarithmically from 10 −4 , 10 −6 in 30 epochs. The construction details of the pre-training dataset have been described in Section 3.1, which is randomly divided into two subsets, 80% for training and 20% for testing. All images are firstly scaled to 256 × 256 × 3 and then cropped to 224 × 224 × 3 as inputs. The top-1 and top-5 errors are 3.842% and 0.026% respectively.

Experimental Settings
In the fine-tuning phase, the shared layers are directly initialized with the parameters of Sub-network I. Adam is used as optimizer with the default parameters for 20 epochs and the learning rate is set to 10 −5 . Except for the LIVE database, images are input without any pre-processing during training with a mini-batch of 8. Since the LIVE database contains images in different size, images are randomly cropped to 320 × 320 during training in a mini-batch, whose quality annotated are assigned from the corresponding image. And all of the images are input without any preprocessing during testing. We implemented all of our models using PyTorch 0.4.1 deep learning framework and the numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.

Consistency Experiment
We investigate the effectiveness of CGFA-CNN on LIVE, TID2013, CSIQ and LIVE Challenge databases and the results are presented in Table 2. The results of each specific distortion type on LIVE, CSIQ and TID2013 databases are reported in Tables 3, 4, and 5. The top three SRCC and PLCC results are highlighted in red, green and blue, respectively. From Table 2, we can have the following observations. First, DIQA [26] achieves state-of-the-art accuracies which surpasses CGFA-CNN about 0.004 in SRCC and PLCC, and most methods take great advantages in indexes on LIVE. However, their results on CSIQ and TID2013 are rather diverse. Second, CGFA-CNN achieves comparable accuracies on LIVE Challenge comparing with DB-CNN [38] and ResNet50+ft [34], which are pre-trained on ImageNet [28] databases. This suggests that CNNs pre-trained on ImageNet [28] could extract relevant features for authentically distorted images.
Performance on individual distortion types on LIVE, CSIQ, and TID2013 are shown in Tables 3, 4 and 5. On LIVE, we also find that CGFA-CNN is superior to

Cross-database Experiment
To analyze the generalization ability of the proposed method, we train CGFA-CNN on one full database and evaluate on another database. Specifically, a model is trained on CSIQ and evaluated either on LIVE or TID2013. Results are reported in Table 6. It can be concluded that CGFA-CNN can be easier to generalize to distortions that have not been seen during training.

Comparison Among Different Experimental Settings
In this section, we first work with the performance of different feature aggregation layers investigated in this paper and number of GMM components K. Experiments are conducted on LIVE and results are shown in Fig. 4. We observe that SRCC gradually increases and eventually keeps stability as K increases. Besides, CGFA-CNN FV, CGFA-CNN VLAD, and CGFA-CNN BOW attain highly competitive prediction accuracy when K is set to 32, 64 and 1024, respectively. By contrast, CGFA-CNN FV is superior to CGFA-CNN VLAD and CGFA-CNN BOW. Additionally, we report ablation studies to evaluate the design rationality of CGFA-CNN and the following comparative set of experiments are conducted: (1) to evaluate the effectiveness of the proposed FV layer, we use the maximum pooling (denoted as CGFA-CNN (MaxPool)) and average pooling (denoted as CGFA-CNN  (AvgPool)) instead; (2) to examine the validity of the CGU described in this work, we predict the quality score directly by regressing the output feature vector without CGU (denoted as CGFA-CNN (w/o CGU)); (3) to verify the necessity of the hierarchical feature extraction, we extract features only from high-level (Conv 5-2 of shared layers and Conv 4-3 of VGG-16) convolutional layers as descriptors (denoted as CGFA-CNN (single feature)); (4) to discuss the optimal settings of for feature aggregation layer, we set the BOW with K = 1024 (denoted as CGFA-CNN (BOW layer (K = 1024))), VLAD with K = 64 (denoted as CGFA-CNN (VLAD layer (K = 1024))) and FV with K = 32 (denoted as CGFA-CNN (proposed)); (5) to demonstrate the prediction accuracies on authentic distortions by involving VGG-16, we only include Sub-network I pre-trained on self-built dataset to extract features (denoted as CGFA-CNN (w/o VGG-16)). The results are demonstrated in Table 7. We empirically find that the proposed CGFA-CNN could achieve state-ofthe-art prediction accuracies both on the synthetic and authentic distortion image quality databases. Besides, CGFA-CNN (w/o VGG-16) can only deliver promising performance on synthetic databases and its results on LIVE Challenge is inferior to CGFA-CNN (proposed), suggesting that authentic distortions cannot be fully fitted by synthetic distortions.

CONCLUSION
In this work, we propose an end-to-end learning framework for BIQA based on classification guidance and feature aggregation, which is named as CGFA-CNN. In the fine-tuning phase, except for the shared convolutional layers, the rest of the Subnetwork I only participates in the forward propagation, and the parameters are fixed. The fused feature group is aggregated and encoded by the FV layer to obtain a fisher vector. Then the fisher vector is corrected by the CGU and obtain a quality-ware feature, which will be mapped to a quality score by the regression model. In the test phase, only forward propagation is required to obtain the quality score. The results on the four publicly IQA databases demonstrate that the proposed method indeed benefited image quality assessment. However, CGFA-CNN is not a unified learning framework because it takes two steps to pre-train and fine-tune. The promising future direction is to optimize CGFA-CNN for both distortion identification and quality prediction at the same time.