No-Reference Image Quality Assessment Based on Image Multi-Scale Contour Prediction

: Accurately assessing image quality is a challenging task, especially without a reference image. Currently, most of the no-reference image quality assessment methods still require reference images in the training stage, but reference images are usually not available in real scenes. In this paper, we proposed a model named MSIQA inspired by biological vision and a convolution neural network (CNN), which does not require reference images in the training and testing phases. The model contains two modules, a multi-scale contour prediction network that simulates the contour response of the human optic nerve to images at different distances, and a central attention peripheral inhibition module inspired by the receptive ﬁeld mechanism of retinal ganglion cells. There are two steps in the training stage. In the ﬁrst step, the multi-scale contour prediction network learns to predict the contour features of images in different scales, and in the second step, the model combines the central attention peripheral inhibition module to learn to predict the quality score of the image. In the experiments, our method has achieved excellent performance. The Pearson linear correlation coefﬁcient of the MSIQA model test on the LIVE database reached 0.988.


Introduction
Image is an important source of information for human perception and machine recognition [1][2][3]. In order for machines to have visual perception, not only does the equipment need to be capable of predictive maintenance [4], but it also needs to capture high-quality images. Image quality plays a decisive role in the sufficiency and accuracy of the acquired information. However, the image is inevitably distorted in the process of acquisition, compression, processing, transmission, and display. How to measure the quality of the image and evaluate whether the image meets a specific requirement becomes a problem. To solve this problem, it is necessary to establish an effective image quality assessment (IQA) system. At present, IQA methods can be divided into subjective evaluation methods and objective evaluation methods. The former relies on the subjective perception of experimenters to evaluate the quality of the object. The latter simulates the perception mechanism of the human visual system based on the quantitative indicators given by the model. According to the classification of images, IQA can be divided into facial image quality [5,6], synthetic image quality, and so on. Below, we introduce the perspective of model improvement.
Objective image quality assessments are quite meaningful. They can provide feedback and optimization for denoising algorithms, provide early evaluation and preprocessing of image data for computer vision tasks, and even indirectly reflect the quality of the shooting equipment. According to whether a reference image is needed, the objective image quality assessment is divided into full-reference image quality assessment (FR-IQA), reduced reference image quality assessment (RR-IQA) and no-reference image quality assessment (NR-IQA). The FR-IQA [7][8][9][10] method requires a distortion-free reference image and compares the information or feature similarity of two images to obtain the evaluation result of the distorted image. The RR-IQA [11][12][13] method is based on part of the characteristic information of the reference image. The NR-IQA method directly evaluates the quality of distorted images. Despite some NR-IQA methods not needing reference images in the testing phase, they still need reference images in the training phase [14,15]. According to the type of distortion, the method is divided into specific types of distortion and general image quality assessment. Classical methods are based on natural scene statistics (NSS) [10,[16][17][18][19], transform domain [9,20], gradient features [17] and unsupervised learning [21,22], etc.
Since 2014, most of the NR-IQA methods have adopted CNN-based models, and researchers have constantly changed and deepened the model structure. CNN is a simulation of the biological visual system. With research on the physiology and anatomy of the biological optic nerve, an increasing number of scholars have begun to use mathematical models to reveal the processing mechanism of visual information. Inspired by the biological vision, we simulated the mechanism of the biological optic nerve and receptive field, and proposed a two-stage training method that does not require reference images. This was tested on the LIVE [23] data set and TID2013 [24] data set. The innovations of this article are as follows : (1) Using multi-scale contour features as the one-stage regression target to solve the problem of too few data sets. (2) Designing the different learning labels of different layers of the model to simulate the evaluation of human eyes on images at different distances. (3) Designing a central attention peripheral inhibition module to simulate the mechanism of the receptive field of retinal ganglion cells.
The following sections of the paper are organized as follows. Section 2 introduces the current status of the CNN-based NR-IQA. Section 3 details the framework of the model proposed in this paper. Section 4 presents the test results of the model. Section 5 concludes the paper.

Related Work
The current NR-IQA methods based on convolutional neural network (CNN) are divided into image-based and patch-based according to the input image [25]. In the early years, in order to increase training data, most of the methods were based on patchbased methods.
In 2014, Kang et al. [26] used CNN for the NR-IQA for the first time. The author first normalized the image, and then divided it into 32 × 32 non-overlapping image patches, used the CNN network to estimate the quality score of each image patch, and the final image quality score was the average score of all image patches. The CNN network used in this method has one convolutional layer with max and min pooling, two fully connected layers and an output node. Although this method has better results than traditional manual feature extraction methods, it has the following shortcomings when the distortion types are complex and diverse: (1) It is unreasonable to use the average of the quality scores of all image patches as a quality score for the entire image. (2) It is unreasonable to use the global subjective score as the image local quality score for training.
In order to solve the problem 1, Bosse et al. [27] proposed a method including a weight estimation module. During the training stage, a sub-network is used to train the weights of image patches. The method proposed by [27] used a deeper and more complex neural network structure than the method of Kang et al. [26] Therefore, the network learned more image features, and its performance was improved. However, as the network deepens, the problem of too few training data sets becomes more serious. As in the previous method, with the aim of increasing the training data, the network input was the 32 × 32 image patch, and the quality score of the image patch was still the quality score of the entire image.
For the purpose of solving problem 2, many researchers have proposed a method of first generating the local quality score of the distorted image as the one-stage regression target. In 2017, Kim proposed a two-stage method (BIECON) [14]. The first step is to use the FR-IQA method to obtain the local quality score and use the local quality score as the target label of the CNN model to predict the quality score of image patches. In the second step, the subjective quality score of the distorted image was used as the target label, and all model parameters were optimized at the same time. In spite of the fact that the BIECON method solved the unreasonable problem of using the subjective score of the entire image as the quality score label of the image patch, the local quality score of the distorted image generated by the FR-IQA method has an error in itself, and this method must use the reference image.
The root cause of the patch-based method is that there are too few data sets. Therefore, many researchers have proposed methods of pre-training CNN networks using data sets in other fields. For example, DeepBIQ [28] proposed by Simone Bianco et al. In addition to the pre-training method, Liu et al. proposed the RankIQA [29] method based on the idea of ranking learning. Although it is difficult to directly estimate the quality score of a distorted image, it is relatively easy to compare the relative quality of different degrees of distortion. In 2019, the author of the BIECON method proposed the DIQA [15] method. This method is still a two-stage method, but no longer uses the subjective quality score as the regression target, instead, using the objective error map as the intermediate CNN learning target.
In addition to the aforementioned method of predicting image quality scores, Hossein et al. proposed the NIMA [30] method. This method no longer trains the network to predict the quality score of the image but predicts the distribution of human quality scores of the image.
In short, due to the lack of IQA data sets, which seriously affects the structural design of the CNN network, this paper proposes a two-stage method to solve this problem.

Approach
The overall framework of the MSIQA is shown in Figure 1. In the first stage of training, the multi-scale contour prediction network is trained to predict the contour features of diverse scale pictures in scale-spaces. In the second stage of training, the MSIQA model combines the central attention peripheral inhibition module to learn to predict the quality score of the image.

Model Architecture
The MSIQA model consists of two main modules: (1) a multi-scale contour prediction network to simulate the response of human eyes on an image's contours at different

Model Architecture
The MSIQA model consists of two main modules: (1) a multi-scale contour prediction network to simulate the response of human eyes on an image's contours at different distances, and (2) a central attention peripheral inhibition module to simulate the mechanism of the receptive field of retinal ganglion cells. We use four inception [31][32][33] modules with the same structure to build the contour prediction network. Each layer has a batch normalization (BN) [34] and a rectified linear unit (ReLU) [35]. After the inception module, the PixelShuffle [36] method is used to upscale the input feature to the same size as the input image. In the second stage of training, the outputs of different inception modules are first fused and combined with the central attention peripheral inhibition module, then fed into the convolutional layer and two fully connected layers.

Multi-Scale Contour Features
We believe that the sharpness of the edge contour of the image is an important feature that affects the image quality. At the same time, the same distortion type of the distorted images have different degrees of distortion, and the contour features of all images in the image scale space can simulate images with different degrees of distortion. Multi-scale features are used to simulate the contour response of the retina to images at different distances, so we train the model in the first stage to predict the contour features of images at different scales.
The scale-space of an image is the convolution of the image and the Gaussian function of the variable scale. The two-dimensional Gaussian function is: The scale-space of an image I(x, y) is: where The image subtraction of adjacent scales obtains the multi-scale contour features. Therefore, the contour feature ground truth is defined as: The image subtraction of adjacent scales obtains the multi-scale contour features. Therefore, the contour feature ground truth is defined as:  Figure 4 is the same as Figure 3.     In the first stage of training, the contour prediction network learns to predict contour features, and the loss function is defined by the mean square error between the predicted value and the ground-truth:    In the first stage of training, the contour prediction network learns to predict contour features, and the loss function is defined by the mean square error between the predicted value and the ground-truth: In the first stage of training, the contour prediction network learns to predict contour features, and the loss function is defined by the mean square error between the predicted value and the ground-truth: where h θ (I i ) is the contour feature of the image I i predicted by the model, θ is the parameters of the contour prediction network, and m is the exponent number. In our experiment, we choose m = 0.5.

Quality Score Prediction
In the second step of training, the central attention peripheral inhibition module combines the brightness information of the image to weigh the multi-scale contour features learned in the first stage. The central attention peripheral inhibition module adopts a double Gaussian difference model, which is composed of two parts: the center position of the image has strong attention and the edge position is weakened, which simulates the different attention and different residence times of humans in different areas of the image. The distribution is: where k c is the central attention enhancement coefficient, k p is the peripheral inhibition coefficient. Because the optic nerve has different sensitivity to images of different brightness, it is necessary to add brightness information of the image while considering the attention of different areas of the image. We normalized the overall image brightness and increased the quality score weight of image blocks with strong brightness.
The MSIQA model learns to predict image quality score. The loss function is defined as: where h θ (I i ) is the image quality score of the image I i predicted by the model, θ is the parameters of the CNN network, S is the ground truth subjective score of the input image I i .

Training
Because there are fully connected layers in the MSIQA model, the input size of the network must be unified. We have tested the effect of different sizes on the performance of the model, and the results are given in Section 4.
In the first stage of training, 80% of the images in the data set are randomly selected for training. First, the image is cropped into image patches of uniform size, and then the four-scale space images of each image patch are fed to the network for training. In the second stage of training, 80% of the images are randomly selected for training, and the image patches are directly fed to the network for training.

Multi-Task Model
Humans have various intuitive perceptions of different types of distortions. The types of distortions affect humans' evaluation of image quality to a certain extent. Therefore, from the point of view of IQA, the detection of the distortion type is also of certain significance. At the same time, additional feature information of the distortion type is added to more strongly constrain the model and reduce the risk of overfitting.
The multi-task learning model adopts the hard parameter sharing method and adopts the basic structure of the MSIQA model proposed in 3.1. Task one is IQA, and task two is a classification of image distortion types. The overall framework of the model is shown in Figure 5. The image convolution operation can only obtain the relationship between loca channels, and the network should learn important feature information from differen feature channels. Referring to the idea proposed by Hu et al. [37], a feature channel weigh module is added to the model, and the framework is shown in Figure 6. The image convolution operation can only obtain the relationship between local channels, and the network should learn important feature information from different feature channels. Referring to the idea proposed by Hu et al. [37], a feature channel weight module is added to the model, and the framework is shown in Figure 6.
First compress the input from size C × W × H to C × 1 × H: where x is the input and W is the width of the input. After size compression, through two convolutional layers, the input channel weight is finally obtained, and then dot-multiply with input. Figure 5. The overall framework of the model.
The image convolution operation can only obtain the relationship between local channels, and the network should learn important feature information from different feature channels. Referring to the idea proposed by Hu et al. [37], a feature channel weight module is added to the model, and the framework is shown in Figure 6. First compress the input from size where x is the input and W is the width of the input. After size compression, through two convolutional layers, the input channel weight is finally obtained, and then dot-multiply with input.
The loss weight of the two tasks adopts a dynamic weighting method. The loss function is defined as: where  is the weight of loss defined as: T is a constant.

Database and Evaluation Metrics
TID2013 and LIVE are the current mainstream databases for image quality evaluation. These databases provide the subjective score for each distorted image. The LIVE [23] database images are color images of different sizes, with 29 reference images, including five common types of distortion: additive white Gaussian noise (WN), Gaussian blur (GB), JPEG compression and JPEG2000 compression (JP2K) and fast-fading (FF). The The loss weight of the two tasks adopts a dynamic weighting method. The loss function is defined as: where ω is the weight of loss defined as: where L i (t) is the loss of task i in step t, T is a constant.

Database and Evaluation Metrics
TID2013 and LIVE are the current mainstream databases for image quality evaluation. These databases provide the subjective score for each distorted image. The LIVE [23] database images are color images of different sizes, with 29 reference images, including five common types of distortion: additive white Gaussian noise (WN), Gaussian blur (GB), JPEG compression and JPEG2000 compression (JP2K) and fast-fading (FF). The image size of the TID2013 [24] database is 512 × 384, including 24 distortion types, each of which has five different degrees. The summary of each database is tabulated in Table 1 [23,24].
The IQA algorithm performance depends on the correlation between the subjective score and the prediction score. If their correlation is high, it means that the performance of the algorithm is better. We used two standard measures, i.e., Spearman rank-order correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC). The SRCC is defined as: where d i is the difference between the predicted score and ground-truth score of the ith image, and N is the number of images. The PLCC is defined as: where p i is the predicted score of the ith image, and s i is the ground-truth score of the ith image, p and s are the average of each.

Convergence Test
To validate the effect of the number of epochs in Step 1 on the performance in Step 2, we compared different training epochs in Step 1 (5, 10, 15 and 20) in the LIVE database shown in Table 2. For each test, the best is shown in bold. Therefore, 10 epochs were selected for Step 1.

Effect of Patch Size
In order to investigate the effect of patch size on the final prediction accuracy, we used four different patch sizes (64, 112, 224, and 384). As shown in Table 3, the patch size of 64 and 112 shows better performance in SRCC and PLCC. For each test, the best is shown in bold. Taking into account that when the patch size is too small, it is not conducive to the quality assessment of large-size images. Thus, a patch size of 112 was used in the following experiments.

Performances Comparison
We compared MSIQA with three FR-IQA methods (PSNR, SSIM [7], FSIMc [8]) and ten NR-IQA methods (BLINDS-II [9], BRISQUE [10], CORNIA [22], Kang [26], BIECON [14], Bosse [27], DeepBIQ [28], DIQA [15], Hallucinated [38], QualNet [39]). The test results of the MSIQA model on the LIVE data set and TID2013 data set are shown in Table 4. The test results of the MSIQA model for different distortion types in the LIVE data set are shown in Table 5. For each test, the best two models are shown in bold. It can be seen from the results that our method performs very well on the LIVE dataset. The performance on the TID2013 dataset is not very good. We analyzed some images with large prediction errors and found that there are two reasons for this: (1) the TID2013 dataset contains non-real synthetic images, and the model is not designed with the images which synthetic and semantic information lacking. (2) Some of the distortion types in the TID2013 dataset are performed by changing the color of the image. Our model believes that in the case of no distortion of image details, such distortion types have little impact on the quality of the image, but have a greater impact on the aesthetic quality of the image. But the dataset is manually annotated, and humans give lower scores to images with unreasonable colors.

Multi-Task Model Test
We tested the IQA and distortion type classification of the multi-task model proposed by Section 3.5. Based on the MSIQA model, we trained the multi-task model. The test results in the case of very few training epochs (1-3) are shown in Table 6. The joint tasks complement each other by sharing information, and even improve the performance of IQA. At the same time, we have made statistics on the classification accuracy of each distortion type, as shown in Table 7.

Conclusions
In this paper, we propose a biological vision-based multi-scale fusion NR-IQA model named MSIQA, which simulated the mechanism of the biological optic nerve and receptive field and adopts a two-stage training method. The MSIQA model fully combines the image contour feature, brightness and receptive field attention mechanism, and does not require reference images in the training and testing stages. As a result, the SRCC of the MSIQA model test on the LIVE database reached 0.983. On this basis, we propose a multi-task model that can classify distortion types at the same time. In the future, we will compress the model and increase the detection speed.