Quality Assessment on Authentically Distorted Images by Expanding Proxy Labels

In this paper, we propose a no-reference image quality assessment (NR-IQA) approach towards authentically distorted images, based on expanding proxy labels. In order to distinguish from the human labels, we define the quality score, which is generated by using a traditional NR-IQA algorithm, as “proxy labels”. “Proxy” means that the objective results are obtained by computer after the extraction and assessment of the image features, instead of human judging. To solve the problem of limited image quality assessment (IQA) dataset size, we adopt a cascading transfer-learning method. First, we obtain large numbers of proxy labels which denote the quality score of authentically distorted images by using a traditional no-reference IQA method. Then the deep network is trained by the proxy labels, in order to learn IQA-related knowledge from the amounts of images with their scores. Ultimately, we use fine-tuning to inherit knowledge represented in the trained network. During the procedure, the mapping relationship fits in with human visual perception closer. The experimental results demonstrate that the proposed algorithm shows an outstanding performance as compared with the existing algorithms. On the LIVE In the Wild Image Quality Challenge database and KonIQ-10k database (two standard databases for authentically distorted image quality assessment), the algorithm realized good consistency between human visual perception and the predicted quality score of authentically distorted images.


Introduction
Images have become a main carrier of information dissemination as multimedia technology develops rapidly. Nevertheless, the information usually suffers from various degrees of distortion during acquisition, transmission, storage, and other conditions such as camera motion, which makes it extremely difficult to perform subsequent processing. For the purpose of maintaining, controlling, and improving the image quality, it is important to evaluate and quantify it. Since we are in the big data era, the quality of images has a lot of impact in daily life, as the high-definition camera on mobile phones, aerial equipment on drones, and monitoring equipment for public transportation. If the images obtained are not clear enough, or there are noises, it is bound to affect user experience. In the relevant scientific research field, image quality is also closely associated with various scientific works [1,2]. For example, astronomical observations require high quality images, and for medical imaging, the quality even determines final diagnosis. Therefore, the need to develop an efficient and reliable image quality assessment method is indeed increasing.
Image quality assessment is the process of analyzing the features of an image and then, evaluating the degree of its distortion. There are two categories of image quality assessment (IQA) methods, subjective methods and objective methods. Subjective IQA methods rely on the opinions of a great many viewers, making it costly, time consuming, and inefficient in practical applications. The objective methods overcome these deficiencies by using a mathematical model to create a mapping between image features and quality scores that strives to truly reflect human visual perception. Computers are used to build models that imitate the human visual system. When the powerful computing ability is utilized, the evaluation efficiency is greatly improved.
The objective IQA methods are usually divided into the following three categories, based on the available information of the given reference image (undistorted image): Full-reference IQA (FR-IQA), reduced-reference IQA (RR-IQA) and no-reference IQA (NR-IQA).
Among these three types, NR-IQA has the widest range of application and the most practical value, and therefore has received more attention and has become a hotspot in recent years. In early NR-IQA studies, researchers were committed to design some hand-crafted features that could discriminate distorted images and, then, train a regression model to predict image quality. Some NR-IQA methods have been based on a specific distortion [3,4], which commonly used the prior knowledge of the distortion type. To assess the image quality with no information available, natural scene statistics (NSS)-based methods have been widely used to extract reliable features, which assume the natural images share certain statistics and the occurrence of distortions can change these statistics [5][6][7][8][9]. Nevertheless, the hand-crafted features have always been designed for a specific type of distortion which lies in the low-level feature and leads to insufficient feature extraction and analysis.
The development of convolutional neural networks (CNN) has greatly accelerated computer vision research and application. In 2012, Krizhevsky et al. used CNN to make significant progress in image recognition [10], which enabled researchers to have a greater prospect of computer vision. With the popularity of CNN, the structure of neural networks is getting deeper and deeper. A deeper structure means the network has a stronger ability to extract features with higher accuracy of results, however, however, at the cost of a larger amount of training data.
The success of CNN in image recognition has attracted researchers to look for the application in IQA. When compared with the traditional methods, the application of neural networks has brought great progress to NR-IQA. Researchers have completed the NR-IQA task by training deep neural networks (DNN) [11][12][13]. Nevertheless, the problem to be solved in the above research is that we lack a great amount of sample data for the task. Especially when the network structure is deep and wide, the number of parameters increases rapidly, in the meantime, the number of training data should be increased accordingly. The labeling process in an IQA image dataset requires that many human reviewers assess each image which is extremely labor intensive and costly. Therefore, most of the IQA datasets are too limited for effectively training neural networks, as shown in Table 1. Therefore, researchers focus on how to increase the amount of data. Table 1. Number of images in different database for image classification and image quality assessment (IQA).

Database Number of Images
ImageNet [14] (for image classification) More than 14 million LIVE In the Wild Image Quality Challenge database [15] (The IQA database for authentic distortion) 1162 TID2013 [16] (the biggest database for IQA) 3000 The methods for expanding the amount of data can be roughly divided into two types, image segmentation inside the database and extension out of the dataset. The method of image segmentation in the database is to expand the data volume by dividing the entire image into tiles [12,[17][18][19][20][21][22]. However, it has shortcomings in that the assessment of small image patches can ignore the integrity of a whole image and it is not efficient. With regards to methods of augmenting outside the database, researchers always process the available images, such as reversing, contrast changing, and applying varying degrees of Gaussian blur, to ensure sufficient samples [23][24][25].
The quality of the image is mainly affected by distortion. Distortion is divided into two types according to the way it is generated, synthetic distortion, and authentic distortion. Synthetic distortion refers to distortion generated during processing and transmission, such as distortion caused by image compression. Authentic distortion refers to distortion occurring in daily, for example, camera jitter, overexposure, and underexposure. In the current research work on NR-IQA, the research objective is mainly focused on synthetic distortion. When algorithms for synthetic distortion are used for authentic distortion, they performed poorly. A few algorithms for authentic distortion have not performed excellently because of the limited amount of data, and the fact that authentic distortion images cannot be augmented with image processing methods such as Gaussian blur.
Conversely, in real life, authentic distortion is the main factor affecting image quality. Meanwhile, with the rapid development of mobile devices and camera equipment, as well as the eagerness of people to know whether their devices work well, the requirement to evaluate the quality of authentic images has been rapidly improving. Unfortunately, the existing IQA algorithms do not perform well in the real life daily scene. The difficulty is that authentic distortion types are much more varied and the degree of distortion is much more complex, and therefore when applied to authentic application scenes, the methods for synthesizing distortion underperform in experiment. Therefore, the development of no-reference image quality assessment methods for authentic distortion is of great significance in practical applications. However, methods for authentic distortion face the difficulty that researchers cannot expand data such as by processing the available images using Gaussian blur. Authentic distortions are unlike synthetic distortions, since authentic distortions are generated in human's daily life with strong authenticity and complexity. Therefore, the available images are too few. Therefore, DNN models are always trained with an IQA database for direct authentic distortion. In Reference [24], researchers designed an architecture called S-CNN for synthetically distorted images, however, considering that the DNN model was not beneficial for authentic IQA databases, they selected the pretrained VGG-16 network for the classification task on ImageNet as another branch to extract relevant features for authentically distorted images. Finally, in the fine-tuning step, they tailored the pretrained S-CNN and VGG-16 [26] and introduced bilinear pooling module for optimization of the entire model. Unfortunately, the limited size of database restricted the performance on the authentic distortion.
In this paper, an NR-IQA algorithm, based on image proxy labels, is proposed, mainly focusing on the assessment of authentically distorted images. In order to distinguish from the human labels, we define the quality score, which is generated by using a traditional NR-IQA algorithm, as "proxy labels". "Proxy" means that the objective results are obtained by computer after the extraction and assessment of the image features, instead of human judgement. Above all, a traditional IQA algorithm for authentic distortion is applied to obtain amounts of image proxy labels. Generating the proxy labels expands the data size without processing the natural images, to ensure the authenticity of natural image distortion. Then, we use the proxy labels to train the model followed by a fine-tuning.
The innovation lies in applying cascading transfer learning to IQA. Since VGG16 [26] is a classic model for image classification tasks, the feature it can extract is related to classification of the images in natural scenes. In our work, the classification task can be regarded as the source task and the database for training the original VGG16, i.e., the ImageNet database, can be regarded as the source domain. The target task is the IQA, and the target domain is the image in the IQA database for authentic distortion. The source domain and the target domain are similar to some extent, that is, ImageNet contains the images whose scenes are similar with authentic distortion images. Therefore, we use the images in the source domain to generate proxy labels, then, train the VGG16 model and transfer it to a preliminary IQA model. After the first transfer learning, the model has the ability to assess the image quality. For the sake of adequately meeting the IQA task, the twice transfer learning is necessary. In order to be consistent with image type in the first transfer learning, we use authentic distortion images for training in the twice training. After the twice training, the model is completely transferred to the target task.
We apply cascading transfer learning in our algorithm. The first expands the authentic images data size and learning preliminarily, then, the second inherits the foundation from the previous learning and fine-tunes the model. Therefore, the model is transferred to the target task completely, and can more accurately predict the quality of authentically distorted images. When compared with the algorithms which only fine-tune, our algorithm performs better in terms of expanding data size and adequate training. The experimental results strongly demonstrate that the predictions generated by our algorithm are highly correlated with human perception and achieves state-of-art performance in a LIVE In the Wild Image Quality Challenge database [15].
The remainder of this paper is organized as follows: Related work is introduced in Section 2, details of the methodology we proposed are given in Section 3, experimental results and the analysis are presented in Section 4, and conclusions are given in Section 5.

Traditional NR-IQA Methods
Regarding FR-IQA, some methods are commonly used, such as Peak Signal to Noise Ratio (PSNR) and structural similarity index (SSIM) [27]. PSNR is the most widely used method in the field of image and video processing. It has low computational complexity and fast implementation speed. It has been applied in video coding standards H.264 and H.265. Although PSNR has the above characteristics, it has obvious limitations. It is greatly affected by pixels and has low consistency with subjective evaluation. It does not take into account some important physiological, psychological, and physical aspects of the human visual system (HVS) feature. According to the HVS, an evaluation method of error sensitivity analysis and structural similarity analysis (structural similarity index, SSIM) is proposed. Structural similarity assumes that the HVS is highly adapted to extracting structural information from the scene in an attempt to simulate the structural information of the image. However, due to the necessity of pristine images, the commonly used PSNR and SSIM cannot be used for NR-IQA.
Most of the traditional NR-IQA methods can be categorized into natural scene statistics (NSS) methods and learning-based methods. NSS methods assume that images of different qualities have different statistical properties for the response of a particular filter. Discrete cosine transform (DCT) and other domains are typically used to extract features. The purpose of these methods is to assess image quality by estimating the distribution parameters. There are also researchers who extract features directly from the airspace, which greatly improves performance [28]. The DIIVINE [29] deploys summary statistics under an NSS wavelet coefficient model. Another model, BLIINDSS [30] extracts a small number of NSS features in the DCT domain. BRISQUE [28] trains an SVR on a small set of spatial NSS features. CORNIA [31], which is not an NSS-based model, builds distortion-specific code words to compute image quality. NIQE [32] is an unsupervised NR-IQA technique driven by spatial NSS-based features that requires no exposure to distorted images at all. C-DIIVINE [33] is a complex extension of the NSS-based DIIVINE IQA model which uses a complex steerable pyramid. Features computed from it enable changes in local magnitude and phase statistics induced by distortions to be effectively captured.
To tackle the difficult problem of quality assessment of images in the wild, FRIQUEE [34] produces a large and comprehensive collection of "quality-sensitive" statistical image features drawn from among the most successful traditional NR-IQA models.
Furthermore, FRIQUEE deployed these models in a variety of color spaces representative of both chromatic image sensing, bandwidth-efficient, and perceptually motivated color processing. This large collection of features defined in various complementary perceptually relevant color and transform-domain spaces drives the "feature maps" based approach. Therefore, FRIQUEE conducted a discriminant analysis of an initial set of 564 features designed in different color spaces, which, when used to train a regressor, produced an NR-IQA model delivering a high level of quality prediction power.
In the learning-based approach, researchers use support vector machines or neural networks to extract local features of an image and map it to mean opinion score (MOS), or differential mean opinion score (DMOS), to build a predictive model. The codebook method [35] combines different features rather than directly using local features. Since the amount of data in the existing dataset is too small, it is of significance to construct the codebook through unsupervised learning using a dataset without MOS. Saliency maps can also be used to simulate human visual systems and improve the accuracy of these methods [36].

DNN Methods for NR-IQA
The hand-crafted features could be not sufficient to fully characterize the complicated image structures and distortions. Since deep neural network (DNN) can automatically capture more deep features, a series of DNN methods are proposed to deal with the NR-IQA task [37][38][39][40] and provides a very promising performance.
In Reference [41], according to the different role of DNN, authors, first, divided the DNN methods into two categories. One category is the support vector regression (SVR)-based NR-IQA methods, which use DNN to extract deep features and SVR methods to predict image quality. The other category is the DNN-based NR-IQA methods, which take full advantage of the back-propagated capability of DNN to optimize prediction accuracy.
The SVR-based methods are classified into two major schemes as follows: (1) Extracting from low-level features of image [42][43][44] and (2) extracting from data of image/image patch [45][46][47]. Instead of using DNN models to extract deep features, DNN-based methods directly use the DNN model to predict image quality. According to different input in the DNN, the patch-input methods and the image-input methods can be summarized. Since the performance of the DNN heavily depends on the number of training data, the patch-input method aims to divide images into multiple patches as DNN input to increase training samples. According to the different labels of training patches, two ways can be adopted. One is to use the image subjective score (SS) as the image patch label [12,[17][18][19]. The other is to use the score which is generated by the FR-IQA method as the image patch label [20][21][22]. Rather than using image patches as the input, the image-input methods aim to train a prediction model by using the whole image and its associated ground truth, which can effectively estimate the quality of a whole image. However, there has been limited effort towards end-to-end optimized NR-IQA using DNN, primarily due to the lack of sufficient ground truth labels of images. For expanding distorted images, there are generally two expanded, taking the advantage of large databases, such as the ImageNet [14], and the artificial generation images [23][24][25]. The DNN, then, is trained by the transfer-learning method. This is a common way to overcome the small database task. In Reference [25], Li et al. utilized network in network (NIN), which is pretrained for the classification task on the largescale ImageNet database, followed by transfer learning to deal with the NR-IQA problem. RankIQA [23] designed a new strategy to generate the large-scale distortion images without laborious human labeling. According to the rule that the image quality decreases with the increase of the distortion levels, they synthetically generated the ranked image pairs with five different distortion levels from a big database. Then, using the pairs of the ranked images, they pretrained a Siamese network to learn image distortion levels. Ultimately, they fine-tuned a branch of Siamese network to predict image score, which aim to transfer image distortion levels to quality scores. To improve performance of different IQA databases, Zhang et al. designed an end-to-end DB-CNN solution for NR-IQA that worked for both synthetically and authentically distorted images [24]. They designed the architecture of the S-CNN for synthetically distorted images, which aimed to classify the probability of each distortion type at the specific degradation level.
Considering that this DNN model is not beneficial for authentic IQA databases, they selected the pretrained VGG-16 network for the classification task on ImageNet as another branch to extract relevant features for authentically distorted images. Finally, in the fine-tuning step, they tailored the pretrained S-CNN and VGG-16 and introduced bilinear pooling module for optimization of the entire model. Nevertheless, all the methods underperform on the authentic distortion database. To address the mentioned problems, we combine the traditional NR-IQA method to generate somewhat dependable labels without human labeling and apply the VGG-16, which performs well for NR-IQA, to acquire more credible results for authentic NR-IQA problem.

Proxy Labels Based No-Reference Image Quality Assessment
In this section, a NR-IQA algorithm based on obtaining proxy labels for authentically distorted images is detailed. First, the general framework and process of the algorithm are described and, then, the key points and innovations of the algorithm proposed.

Overview of Our Approach
Most of the IQA research is focused on synthetic distortion. In real-life scenarios, due to the complexity and variety of real scenes, the complexity of image distortion is not a simple analogy of synthetic distortion. Therefore, the method for synthesizing distortion cannot meet the actual needs when applied to real-life scenes. At the same time, we noticed that the application of deep learning to blind IQA has been the focus of many researches. The IQA method based on deep learning establishes the mapping from the underlying image to the high-level perception through the deep neural network, so that the image quality is more closely related to human perception. However, deep networks have obvious shortcomings that cannot be ignored such as a large number of samples are needed for training. However, manual labeling of image quality is time-consuming and labor intensive, and it is extremely unrealistic to create a dataset with a variety of distortions, such as ImageNet, and therefore researchers are also looking to expand the number of training available.
Notably, there are a lot of authentically distorted images in life, but these images are not manually labeled. If we could effectively use this large number of unlabeled distortion images, we could significantly expand the amount of data, thus, making network training more accurate. However, manual labeling of these images is time-consuming and labor intensive. If the image is labeled using a traditional IQA algorithm, the computing power of the computer is fully utilized, and an objective quality evaluation of a large number of images is performed in a short time. In this way, the efficiency is greatly improved as compared with manual labeling. Although the objective assessment based on the machine is not close to the human visual perception, it can predict by extracting and analyzing the features of a large number of images, and the network can learn the IQA-related features during the training, and therefore the network has the function of the assessment of image quality.
The ultimate goal of the IQA is to simulate the perception of the person to evaluate the image. If the model is trained using only a large number of machine assessment labels, although the problem of insufficient data volume is solved, the model has the function of evaluating image quality. However, the mapping established by the model stops at objectively assessing the quality, and it is necessary to continue to adjust in order to imitate the perception of the person. In order to avoid this problem, we apply the manually labeled images to continue to learn. By fine-tuning the model through the data in the existing IQA dataset, the network can further imitate human perception based on the function of assessing image quality, adjust the model mapping with human perception, and complete the IQA task more perfectly.
Our approach is based on the observation that a huge number of authentically distorted images is available. The first step in this work is to acquire a lot of real distortion images. We select ImageNet [14] for augmenting data size. This set contains a large number of real scene images, which can meet both the authentic distortion and the large amount of data required in this work. Moreover, the images are mostly taken from reality, and therefore model trained by these images can play a more robust role when applied to life practices. After obtaining a large number of authentic distortion images, the low-level features of the images are extracted by FRIQUEE [34], a traditional NR-IQA method, to perform objective quality scores. The scores obtained are saved as proxy labels of the images.
This step aims at labeling without humans, which greatly expands the amount of training data as compared with other methods.
In the second step, the authentic distortion images from ImageNet and their corresponding proxy labels are fed into the VGG16 network for training, that is, the first transfer learning. Through the training by a large number of data, the task of VGG16 is gradually transformed into IQA from classification, meanwhile the deep-level features related to image quality are learned and other irrelevant features are suppressed. In a word, the model has the function of image quality assessment.
Although the network has been preliminary competent for the IQA task after an adequate training, the ability of predicting remains in the objective evaluation stage and is a certain distance from human perception. Therefore, in the third step, the IQA database and labels are fed for fine-tuning, which is regarded as the second transfer learning. The LIVE In the Wild Image Quality Challenge database which contains the various type of authentically distorted image is selected to adjust the network. Therefore, the model can match the relationship between the artificial perception and the image distortion.
Flow chart is shown as Figure 1.
Electronics 2020, 9, x FOR PEER REVIEW 7 of 22 Although the network has been preliminary competent for the IQA task after an adequate training, the ability of predicting remains in the objective evaluation stage and is a certain distance from human perception. Therefore, in the third step, the IQA database and labels are fed for finetuning, which is regarded as the second transfer learning. The LIVE In the Wild Image Quality Challenge database which contains the various type of authentically distorted image is selected to adjust the network. Therefore, the model can match the relationship between the artificial perception and the image distortion.
Flow chart is shown as Figure 1. (1) Generate large number of proxy labels attached to the image in ImageNet [14], using the FRIQUEE algorithm; (2) train VGG16 using the proxy label; (3) fine-tune the model using IQA data, in our method is LIVE In the Wild Image Quality Challenge database [15]. Ultimately, the model can assess the image quality and score images. In the first stage training, the values next to the images are the proxy labels generated by the computer. As for the fine-tuning stage, the values are the quality score given by observers in LIVE In the Wild Image Quality Challenge database [15].

Generating Proxy Labels
Researchers have established some IQA standard databases, in which the image quality scores are given by the amounts of observers. This kind of labeling is regarded as human labeling. However, the data size is too limited when fed into a deep neural network, and always leads to an unsatisfied performance. In order to expand the amount of available data for training, this work takes the objective methods by computer instead of manual labeling. In order to distinguish from human labels, we define the score, which a computer gives using a traditional NR-IQA algorithm, as "proxy labels". "Proxy" means that the objective results are obtained by machine after the extraction and (1) Generate large number of proxy labels attached to the image in ImageNet [14], using the FRIQUEE algorithm; (2) train VGG16 using the proxy label; (3) fine-tune the model using IQA data, in our method is LIVE In the Wild Image Quality Challenge database [15]. Ultimately, the model can assess the image quality and score images. In the first stage training, the values next to the images are the proxy labels generated by the computer. As for the fine-tuning stage, the values are the quality score given by observers in LIVE In the Wild Image Quality Challenge database [15].

Generating Proxy Labels
Researchers have established some IQA standard databases, in which the image quality scores are given by the amounts of observers. This kind of labeling is regarded as human labeling. However, the data size is too limited when fed into a deep neural network, and always leads to an unsatisfied performance. In order to expand the amount of available data for training, this work takes the objective methods by computer instead of manual labeling. In order to distinguish from human labels, we define the score, which a computer gives using a traditional NR-IQA algorithm, as "proxy labels". "Proxy" means that the objective results are obtained by machine after the extraction and assessment of the image features, instead of human judging. By expanding the labels, the amount of data can be effectively increased, so that the network can fully learn the relevant features of IQA during the training process and avoid the over-fitting problem caused by insufficient data volume.
Since this work is to perform image quality assessment on authentic distorted images, a conventional IQA method specific for authentic distortion is considered when selecting an algorithm for augmenting the proxy labels. When automatically scoring a large number of original images, selecting the algorithm with the best performance to maximize the training accuracy and success rate. At present, in many traditional algorithms, FRIQUEE [34] is one of the best algorithms for authentic distortion image, as shown in Table 2. Therefore, in this work, the FRIQUEE algorithm is chosen as a tool for generating proxy labels. Although the proxy labels do not completely match human perception, they have relevant low-level features of image quality. After training large numbers of proxy labels, the model does not directly meet the purpose of imitating people's perception but does learn the characteristics of distorted images and lays a good foundation for subsequent model adjustment.

Training Network Using Proxy Labels
In this work, the VGG16 [26] network is selected for training and testing. The VGG network is a model proposed by Oxford University, in 2014. Because its structure is intuitive and easy to understand, it has been applied in various fields and it has become a widely used convolutional neural network model. The VGG16 is a model for image classification tasks, and it shows excellent performance in classification and other image processing areas. Meanwhile, lots of IQA researches based on the VGG16 perform quite well, which indicates that VGG16 is also suitable for IQA. When comparing with other DNN models such as ResNet [48] and AlexNet [10], VGG16 has a moderate structure depth, quite fast operation speed, and good performance. While ensuring the training and computing efficiency, it also guarantees competitive performance.
In the original architecture, the last layer of the network is 1000 channels, corresponding to 1000 categories in image recognition. For the single score output of IQA, the number of channels in the last layer is modified to one to meet the needs of the single-class output in this work.
During the initial training process, the model learns the relevant features of IQA through a large amount of data training. Since the VGG16 network requires that size of input image is 224 × 224, first, a large number of original images are subjected to size filtering before input to qualify images with sizes greater than 224 × 224 are retained. Random sampling is adopted at the beginning of VGG16, which ensures any image bigger than 224 × 224 is randomly cropped to 224 × 224. Therefore, despite the different aspect ratios and size, all images fed into VGG16 are 224 × 224. The images and their proxy labels are fed into the VGG16 network, and the first stage of training is performed. During the process, due to the existence of many proxy labels relevant to IQA, the original classification features in the network are suppressed, and the features related to IQA are learned and enhanced. After this training, the network already has the function of quality assessment. Nevertheless, since the data used for training is objective data obtained by computer labeling, it is different from the actual perception of human beings, therefore, it is necessary to continually train. According to the fundament, the IQA database is used to fine-tune the network.

Fine-Tuning the Network
After the preliminary training, the network has the ability to evaluate the quality of images, but the ability is based on objective methods, with some differences existing as compared with people's true feelings. In order to narrow the difference, the second phase of training is carried out, that is, the network is fine-tuned using the IQA database.
In the existing IQA database, most databases are oriented to synthetic distortion. The LIVE In the Wild Image Quality Challenge database [15] is an IQA database of authentic distortion images that fit into this work. Among them are 1162 distorted images, containing various types of distortion which have been captured by various representative modern mobile devices.
In the second phase of training, the model is fine-tuned using the authentic distortion images and corresponding human labels. The characteristics of IQA in the network are further enhanced, making the model more accurate. In addition, more importantly, after fine-tuning, the mapping between features and scores is closer to a human's perception of distortion, making the model more practical.

Experimental Results
In this section, the performance of the algorithm is demonstrated based on several experiments. When compared with the basic algorithm and the most advanced algorithm in the NR-IQA field, the proposed algorithm shows good performance on the authentically distorted images.

Database
We use two types of databases in the experiment, database for augmenting proxy labels and IQA database for model tuning and testing.

Database for Augmenting
The database used to expand the data size is the ImageNet database. ImageNet is a large image dataset set up to facilitate the development of computer image recognition technology [14]. This image set contains a large number of real scene images, which can meet both the authentic distortion and the large amount of data required in this work. In this image set, most of the images are taken in life. The samples of ImageNet are shown in Figure 2.
What we can observe easily is that the image of the actual scene does have a certain degree of distortion. Therefore, why the VGG model performs well in authentic distortion is explained because its training set contains a large number of real images which are close to the distorted image in IQA database.

IQA Database
The subjective IQA database used in this work is to fine-tune the model and verify the performance of the model, where each image is associated with its subjective score (mean opinion score, MOS). The experiment is based on a public IQA database, the LIVE In the Wild Image Quality Challenge database (LIVEC) [15].
This dataset is an authentically distorted images IQA database that matches this work. There are 1162 authentically distorted images and no reference images. The MOS value is obtained from the estimated 350,000 opinion scores given by more than 8100 observers, with a range of [0,100]. The higher the image quality is, the higher the score is. The samples of LIVEC are shown in Figure 3.      Another is the KonIQ-10k database [49], which is an IQA dataset to date consisting of 10,073 quality scored images. Through the use of crowdsourcing, researchers obtained 1.2 million reliable quality ratings from 1459 crowd workers, with the score range of [1,5], and the better the quality is, the higher the value is.

Evaluation Protocols
In this experiment, four commonly used coefficients are used to evaluate performance: Pearson linear correlation coefficient (PLCC), Spearman rank order correlation coefficient (SROCC), Kendall rank order correlation coefficient (KROCC), and root mean squared error (RMSE).
The PLCC is used to measure the linear correlation between actual scores and predicted quality scores. Given N distorted images, the actual score of the i-th image is represented by y i , and the predicted score of the network is represented byŷ i . The PLCC is calculated as follows: where y andŷ are the average of the actual score and the predicted quality score, respectively. And PLCC is expected to be higher. The SROCC measures the monotonic relationship between the actual score and the predicted score. Given N distorted images, the actual score of the i-th image is represented by y i , and the predicted score of the network is represented byŷ i . The SROCC is calculated as follows: where v i is the rank of the true score y i and p i is the rank of the predicted output scoreŷ i , for all images. The higher the SROCC is, better the performance is. The KROCC also measures the monotonic. Given N distorted images, the actual score of the i-th image is represented by x i , the predicted score of the network is represented by y i , thus, n pairs like (x 1 , y 1 ) and (x 2 , y 2 ) can be obtained. Then, we randomly select two data pairs (x i , y i ) and (x j , y j ) to form a pair [(x i , y i ), (x j , y j )]. Thus, we get N(N − 1)/2 pairs. If x i > y i and x j > y j , or x i < y i and x j < y j , we call such a pair concordant. If x i > y i and x j < y j , or x i > y i and x j < y j , the pair is defined discordant. N c is the number of concordant pairs, and N d represents the discordant. KROCC is calculated as follows: therefore, KROCC can evaluate the rank correlation and is expected to be higher. The RMSE compares the absolute error between the algorithm's predicted value and the ground truth to measure the accuracy of the algorithm and it is used in many works [50]. Given N distorted images, the actual score of the i-th image is represented by y i , and the predicted score of the network is represented byŷ i . The RMSE is calculated as follows: and RMSE is expected to be lower.

Evaluation Process
After the first transfer learning, we evaluate the performance of the model. The experiment can verify whether the model is oriented in the correct direction after the expansion of the proxy labels and training, and whether the image quality can be assessed reasonably.
After the completion of the entire training process, the network performance is verified again and compared with other algorithms to prove the correctness of the algorithm. The result verifies the validity of the extended proxy labels and the two trainings.
Since this method is based on a deep network, the LIVEC and KonIQ-10k database is randomly divided into two groups, a training set and a test set. In the experiment, 80% of the images in the database were used as training sets and the remaining 20% were used as test sets. The contents of all images are different; thus, the training and test sets are randomly selected. This random process is repeated ten times to eliminate the bias. For each iteration, the training and test sets were randomly selected as described above.
To find the proper amount of the initial training data, we perform our algorithm based on 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, and 10,000 images with their proxy labels. The average results of the obtained SROCC, PLCC, KENDALL, and RMSE values are reported for comparison. To ensure the stability of the model, we also compute the variance of the four metrics on different trained models.

Experiment Results
For the sake of convenience, the algorithm proposed is abbreviated as EPL (expanding the proxy labels).

Proper Amount of Proxy Labels
One thousand, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000 and 10,000 images with their proxy labels are using to train 10 VGG16 models. Here, we show the mean and variance of SROCC, PLCC, KENDALL, and RMSE, as shown in Table 3. In Table 3, we divide the model into two types: Base and Base + FT (FT means fine-tune). Base means that the model is only trained by the proxy labels, whereas Base + FT means the model is trained completely.
In Figure 4, the line charts are more intuitively represented. The x-axes indicate the numbers of proxy labels we use in the first stage of training, and the y-axes indicate to the value of the different coefficients. For the SROCC, PLCC, and KENDALL, the values are expected to be higher, whereas the RMSE is expected to be lower. From the table and the figure, we can observe, when the training data is more than 5000 images with their proxy labels, the performance of the base model stays at a level, and there is no significant improvement. However, for the completely trained model, when the amount is 7000, the performance reached the best point overall. Therefore, the proper number of proxy labels is 7000, and we choose the model trained by 7000 proxy labels to compare with other algorithms.    According to the comparison of the performance on Base and Base + FT models, as shown in Figure 5, the performance has indeed greatly improved after fine-tuning no matter the number of proxy labels which demonstrates the effectiveness of the two times transfer learning. Since the final performance is based on expanding the proxy labels, thus, it also demonstrates the effectiveness of expanding the proxy labels.
Meanwhile, the variance is low on both the Base and Base + FT models which demonstrate the stability of the model we trained.

Performance Comparison
Generating amounts of proxy labels is conducted to learn the low-level features applied in traditional algorithms. Therefore, we evaluated the efficiency of the model trained by proper number of proxy labels, that is, 7000 proxy labels. The data for ten tests, on LIVEC and KonIQ-10k, are listed in Table 4. The performance and the comparison with others are shown in Table 5.

Performance Comparison
Generating amounts of proxy labels is conducted to learn the low-level features applied in traditional algorithms. Therefore, we evaluated the efficiency of the model trained by proper number of proxy labels, that is, 7000 proxy labels. The data for ten tests, on LIVEC and KonIQ-10k, are listed in Table 4. The performance and the comparison with others are shown in Table 5. It can be seen from Table 5 that the method proposed in this study is superior to all conventional NR-IQA methods after the expansion of the proxy labels and initial training, that is, EPL can automatically extract not only the hand-crafted features, but also some other features relevant to image quality which are conductive to the assessment. When compared with using only hand-crafted features, the combined strategy outperforms.  [28] 0.607 0.645 --CORNIA [31] 0.618 0.662 --GMLOG [51] 0.543 0.571 --IL-NIQE [6] 0.594 0.589 --BWS [5] 0.482 0.526 --FRIQUEE [34] 0 In addition, compared with the end-to-end deep learning methods (CNN [12], RankIQA [23], BIECON [20], DIQaM 17], DIQA [22], CaHFI [52], NRVPD [53], ESD [54], VS-DDON [55], NQS-GAN [56], and ILGNet [57]), since the above algorithms are mostly directed to synthetic distortion, the learning of the authentic distortion features is insufficient. Consequently, although it has not been adjusted by the IQA database, the proposed method is still superior to all the methods.
Nevertheless, when compared with the classification networks (AlexNet [10], VGG16 [26], and ResNet50 [48]) which directly use IQA data for fine-tuning, the model is inferior only after initial training. Because the above models are trained by considerable images in ImageNet and fine-tuned by IQA data, more features can be learned and mappings are closer to the human perception.
In conclusion, the model performance after first stage training is significantly better than most methods in the LIVEC database which strongly proves the effectiveness of using a large quantity of authentic distortion images for initial training.
After fine-tuning the model using LIVEC, performance has been further improved. LCC increased by 22.09% as compared with the first stage of training, and SROCC increased by 16.31%. The data of the detailed ten tests are listed in Table 6. The evaluation results are shown in Table 7.
From Table 7, we can see that the performance of this algorithm has been greatly improved by the fine-tuning of the IQA database and it outperforms the existing algorithms.
After the first stage training, although the model does not perform as well as AlexNet [10], VGG16 [26], and ResNet50 [48], the whole training leads to a competitive result. Because this model is a foundation for the first stage of training, it shows excellent performance after adjustment by the LIVEC database. The above models have a series of features relevant to object classification due to sufficient data to train; however, not all of the features are conductive to IQA. Oppositely, EPL overcomes the necklace by learning from a set of proxy labels attached to a number of images, which prompts the model to enhance the learning of the feature which is strongly relevant to image quality and suppresses the effect of the weakly relevant features. After the fine-tuning, the mapping is closer to human perception due to the relatively proper features learned. All the processing results in the outstanding performance. This proves the importance of acquiring a large number of samples when training the network. In this work, it shows the effectiveness of expanding proxy labels for deep IQA model training. Moreover, the IQA database plays an indispensable role in the training process. After fine-tuning, the network learns a more human-like mapping, and the evaluation result of the image is closer to the real experience of the human.

Supplementary Experiment
Since we propose the EPL method based on the requirement of IQA for authentic distortion, we perform a supplementary experiment on the LIVEC database. However, it is also meaningful to use the idea of synthetic distortion. Meantime, using other method to generate the proxy labels and evaluating the trained model on synthetic IQA database can also prove the effectiveness of proxy labels.
In order to demonstrate that the selected label-generating method has an impact on subsequent results, we select the IL-NIQE [6], a traditional NR-IQA method for synthetic distortion, and FSIMc [58], a traditional FR-IQA method for synthetic distortion, to generate proxy labels on the Waterloo exploration database [59]. FSIMc [58] is an FR-IQA method and outperforms most of NR-IQA methods, including IL-NIQE [6]. A comparison of the results between the two can prove the selected label-generating method truly has an impact. The Waterloo exploration database contains 4744 pristine natural images and 94,880 synthetic distortion images created from them. We randomly choose 20,000 images from Waterloo exploration database to generate proxy labels using IL-NIQE [6] and FSIMc [58].
After getting the proxy labels, similar to the process above, we trained a series of models, and evaluated their performance on SROCC, PLCC, KENDALL, and RMSE based on the Live [60], Tid2013 [16], CSIQ [61], and LIVEMD [62] databases. All of the databases are designed for synthetic distortions. Consistently, we evaluated the models after two times of transfer learning, found the proper number of proxy labels, and adopted the results generated by the most proper model.
We compared the results with other algorithms, as shown in Table 8. The results show the performance is indeed improved by two transfer learning as compared with the traditional method we use to generate proxy label, and comparatively to some extent. Considering that we use IL-NIQE, an NR method, to generate the proxy labels, while the other methods mostly use the FR method (BIECON [20]), or train the model with the no-distorted images, such as the comparisons (DIQA [22], DIQaM [17], RankIQA [23]), the proxy labels we obtained are less accurate than the others. The less accuracy leads to the comparatively ordinary. To improve the performance and demonstrate that the accuracy of proxy labels truly affects the performance, we select FSIMc [58], an FR-IQA method to also generate labels. The experimental results show that by using FSIMc to generate labels, the model performs better than most algorithms and also proves that the method of expanding proxy labels is also applicable for synthetically distorted images.

Conclusions
To address the scarcity of IQA data and the insufficient research on authentic distortion IQA, we have proposed a method that is based on proxy labels. First, a traditional NR-IQA algorithm is used to generate large numbers of proxy labels to expand the amount of training data. Since the labels can be generated in mass, we can train deeper and wider networks than previous work. Then, the network is trained by the proxy labels, and it learns preliminarily. Ultimately, inheriting the foundation from the previous learning, the network is fine-tuned by IQA database for closing to human perception. Experimental results on the LIVE In the Wild Image Quality Challenge database and the KonIQ-10k database demonstrate that our approach performs superior as compared with existing NR-IQA algorithms on authentic distortion.