Image Quality Assessment without Reference by Combining Deep Learning-Based Features and Viewing Distance

: An abundance of objective image quality metrics have been introduced in the literature. One important essential aspect that perceived image quality is dependent on is the viewing distance from the observer to the image. We introduce in this study a novel image quality metric able to estimate the quality of a given image without reference for different viewing distances between the image and the observer. We ﬁrst select relevant patches from the image using saliency information. For each patch, a feature vector is extracted from a convolutional neural network model and concatenated at the viewing distance, for which the quality is predicted. The resulting vector is fed to fully connected layers to predict subjective scores for the considered viewing distance. The proposed method was evaluated using the Colourlab Image Database: Image Quality and Viewing Distance-changed Image Database. Both databases provide subjective scores at two different viewing distances. In the Colourlab Image Database: Image Quality we obtain a Pearson correlation of 0.87 at both 50 cm and 100 cm viewing distances, while in the Viewing Distance-changed Image Database we obtained a Pearson correlation of 0.93 and 0.94 at viewing distance of four and six times the image height. The results show the efﬁciency of our method and its generalization ability.


Introduction
Image quality assessment is central in acquisition, processing, analysis and reproduction of images. The interest and need for image quality assessment has increased in the last decades, resulting in increasing research on this topic. Subjective assessment is even considered to be the "gold standard", but objective assessment is becoming increasingly popular. A plethora of objective assessment methods, commonly known as Image Quality Metrics (IQMs), have been suggested in the literature over the last decades [1][2][3][4][5][6]. These metrics have also been considerably evaluated [7][8][9][10][11][12]. Despite the large number of existing IQMs and their extensive evaluation, there are still several limitations and unsolved challenges [8,[13][14][15][16].
IQMs can, depending on the availability of the reference image, be divided into fullreference, reduced-reference, or no-reference [17]. Full-reference IQMs need the complete reference, while reduced-reference IQMs need partial information of the images, and noreference IQMs do not need access to the reference image. Conventional IQMs only utilize information on the intensity of the distortion, such as mean-squared-error and peak-signalto-noise-ratio (PSNR). In spite of this, these IQMs have been used with success in different applications, but they have only been moderately correlated with perceived quality for natural images [11]. IQMs based on structural similarity have become very popular in the last decade [18], and they showed to correlate better with subjective scores than PSNR [11]. There were also proposed many other IQMs based on different approaches, such as the spatial CIELAB [19], total variation of difference [20], PSNR-HVS-M [21], Difference of Gaussians [22], machine learning [23], spatial hue angle metric [24]. They have also incorporated different aspects related to the human visual system, such as contrast sensitivity [19], visual masking [25,26], gaze information [27,28]. These IQMs have been applied to a a wide range of applications, including color printing [29][30][31], displays [32], compression [33,34], cameras [35], image enhancement [36], gamut mapping [37,38], medical imaging [39,40], and biometrics [41][42][43].
Recently the use of deep learning has attracted the attention of many researchers in image quality [44][45][46][47][48][49][50][51]. The distance from the image to the observers is an important aspect when observers evaluate quality [20,52,53]. This well-known fact was, however, overlooked in many of the existing IQMs, and very few of the IQMs based on deep learning incorporate viewing distance. In addition, the existing datasets for evaluation of the performance of IQMs were only carried out at a single viewing distance or the distance was not controlled (i.e., fixed). The Colourlab Image Database: Image Quality (CID:IQ) [52] is one of a handful publicly available datasets where observers evaluated quality of images at two different viewing distances, namely 50 cm and 100 cm.
The main contributions of this work are: • The integration of the viewing distance on a modified version of the pre-trained VGG16 model. • The integration of the saliency information to extract patches according to their importance. • The comparison of our modified model with several configurations. • Evaluation of the proposed method against other state-of-the-art methods on two datasets.
We utilize a Convolutional Neural Network (CNN) to predict perceived image quality at different viewing distances. To the best of our knowledge, this is the first work where viewing distance is included in a CNN-based IQM. First, we will introduce related background, then we present the proposed method. Furthermore, we present our experimental results, and then the conclusion is given.

Background
There is a large number of IQMs in the literature [1][2][3][4][5], and many different approaches have been taken. In recent years, more and more IQMs based on deep learning have been proposed, and the use of deep learning has been postulated to result in better performing IQMs [15].
Chetouani et al. [54] handled image quality as a classifying problem through linear discriminant analysis. The authors first extract characterizing features from both the original and degraded images. Then, the type of degradation in the image is found by using a minimum distance criterion. The most appropriate IQM for a given distortion is finally applied. Evaluation was carried out on the TID2008 dataset [10] and the LIVE Image Quality Assessment Dataset [55]. The results showed that the suggested method is improving the correlation coefficients, but that the improvement is distortion dependent.
In [56], Chetouani extended the previous work by Chetouani et al. [54] through a CNN model for identifying the degradation and prediction of quality. The degraded image goes through two parallel processes, where the distortion type is identified using a CNN model in one process and for the other the most salient patches are found. The quality estimation is also done through a CNN model with two convolutional layers, two pooling steps, one fully connected layer and one output layer. Evaluation was carried out on the TID2008 dataset [10], Categorical Subjective Image Quality [57], and the LIVE Image Quality Assessment Dataset [55], using the following distortion types: white noise, JPEG, JPEG2000, blur and fast fading. The results indicated that the proposed method gave higher correlation coefficients compared to other no-reference IQMs, and comparable results of the best full-reference IQMs.
Chetouani [58] used the pre-trained VGG16 model in a no-reference IQM. A patch selection step based on saliency information using a scanpath predictor was incorporated in the IQM. Patches of the fixation points from the scanpath predictor was used as input to the CNN model. The IQM was evaluated on the CU-Nantes dataset [59]. The results showed that the proposed CNN model had the highest quality performance.
Hou et al. [50] suggested a no-reference IQM based on a discriminative deep learning model that was trained to classify natural scene statistics features in five quality levels (excellent, good, fair, poor, and bad). The final predicted quality score was obtained from a quality pooling step. The proposed IQM was evaluated on the LIVE Image Quality Assessment Dataset [55], TID2008 [10], Categorical Subjective Image Quality [57], IVC [60], and MICT [61].
Kang et al. [45] introduced a no-reference metric that predicted image quality of patches in images using CNNs. They contrast normalize the grayscale image, before selecting non-overlapping patches, where each patch is inputted to the network. The network consisted of five layers, where the first convolutional layers filtered the input with 50 kernels. 50 feature maps are created, which are pooled into one max and one min. Further, two fully connected layers of 800 nodes are used. The final layer is a linear regression, giving the final quality score. The no-reference IQM was evaluated on the LIVE Image Quality Assessment Dataset [55] and the TID2008 dataset [10]. The proposed IQM showed an overall correlation higher than other no-reference metrics in the evaluation.
Li et al. [46] extracted simple features from images by using a Shearlet transform, and then further treated image quality as a classification problem using deep neural networks. The first step of extracting features were based on that the statistics of the Shearlet coefficients changed as an image were distorted. The features are extracted from each of the color channels (RGB) and normalized, these features are then evolved in stacked auto-encoders before the final features are inputted to a Softmax classifier. The authors used the LIVE Image Quality Assessment Dataset [55], TID2008 [10] and the LIVE multiply distorted dataset [62]. Their results showed comparable results to other no-reference IQMs, but not as high correlation coefficients compared to the best full-reference metrics.
Lv et al. [48] apply a multi-scale Difference of Gaussian to generate features, which were processed in a deep neural network in their proposed IQM. It used a combination of a stacked auto-encoder with three hidden layers and a support vector machine regression. The IQM was evaluated on the TID2008 dataset [10] and the LIVE Image Quality Assessment Dataset [55]. The proposed no-reference IQM showed higher correlation coefficients compared to other state-of-the-art no-reference IQMs, and comparable correlation coefficients to state-of-the-art full-reference IQMs.
Bianco et al. [44] introduced a no-reference IQM using CNNs for generic distortions, where quality scores (categories such as bad, poor, fair, good and excellent) are predicted for sub-regions within the image and support vector regression is applied on the CNN features. Their architecture is based on the Caffe network [63], but pre-trained on three image classification tasks. The authors experimented with selecting between 5 and 50 subregions randomly from the images. Evaluation was performed using the LIVE In the Wild dataset [64], and they showed higher correlation coefficients than state-of-the-art IQMs. They also evaluated their method on the LIVE Image Quality Assessment Dataset [55], Categorical Subjective Image Quality [57], TID2008 [10], TID2013 [65]. Their correlation coefficients were similar or higher compared to other metrics.
Li et al. [66] merged CNNs and Prewitt magnitude on a segmented image to estimate the quality of images. The CNN model is based on seven layers, using normalized 32 × 32 pixel image patches as input. The authors computed weights for the image patches, which is based on a graph-based segmentation of the original image, where the weight is the sum after applying the Prewitt operator on the image. The IQM was evaluated on the TID2008 dataset [10] and the LIVE Image Quality Assessment Dataset [55]. The results show that the introduced IQM has higher correlation coefficients compared to no-reference IQMs, and similar to the best full-reference IQMs.
Kim et al. [47] utilized local quality maps as intermediate targets for CNNs. In the proposed IQM, the CNN is trained with respect to each non-overlapping patch in the image, also giving equal weights for every pixel in the image. This results in a local quality score.
Further, the pooling stage is incorporated for training. All parameters of the model are optimized simultaneously. The CNN architecture consisted of two convolutional layers and five fully connected layers. The proposed IQM was evaluated on the TID2008 dataset [10] and the LIVE Image Quality Assessment Dataset [55], and the results showed that it is comparable to the best performing IQMs.
Gao et al. [49] introduced a full-reference IQM to measure the local similarities between the features from the distorted and reference images using deep neural networks. The reference image and the degraded image are fed separately to the VGGnet [67]. Further, the output of each layer is computed being the feature map. Then, local similarities between the feature map of the reference and the feature map of the degraded image are found. At last, the local similarities are pooled as a final quality score. They evaluated their method on the LIVE Image Quality Assessment Dataset [55], Categorical Subjective Image Quality [57], the LIVE multiply distorted dataset [62], and TID2013 [65]. The performance of the full-reference IQM was similar to that of the best state of the art IQMs.
Fan et al. [68] introduced a no-reference IQM. The first step was to identify the distortion of the input image, which was done using a shallow CNN with one convolution layer. Further, for every distortion type they designed a CNN, which is used to calculate a quality score for each patch in the image. At last, a fusion algorithm was used to generate one single quality score for the entire image. Evaluation was carried out on the LIVE Image Quality Assessment Dataset [55] and Categorical Subjective Image Quality [57] dataset. Performance of the introduced no-reference IQM was comparable to state of the art IQMs, but the correlation coefficients were slightly lower than the best full-reference IQMs.
Ravela et al. [69] proposed a no-reference IQM, in which they classify the distortions present in the degraded image. For each distortion class they compute a quality score. These are further combined through a weighted average-pooling algorithm to obtain a single regressor output. The IQM was evaluated on LIVE Image Quality Assessment Dataset [55], Categorical Subjective Image Quality [57] and the TID2008 dataset [10]. The evaluation showed comparable results to other state of the art IQMs. This approach can be is similar to that of [54,56].
Varga [70] introduced no-reference IQM using multi-level inception features from a pretrained CNN. The method uses the entire to extract image resolution independent features. The IQM was evaluated on the LIVE in the wild dataset [64], and obtained higher correlation values compared to many state of the art methods.
Ma et al. [71] proposed a no-reference IQM mimicking the mimicking the human visual system, more precisely by using an active inference module of a generative adversarial network to predict the main content of the image. Then by using a multi-stream convolutional neural network (CNN) they assess the quality related to scene information, distortion type and content degradation. The proposed IQM was evaluated on LIVE Image Quality Assessment Dataset [55], Categorical Subjective Image Quality [57], TID2013 [65], LIVE In the Wild dataset [64] and LIVE multiply distorted dataset [62]. The method showed comparable or higher correlation values compared to other state of the art IQMs.
Amirshahi et al. [51] introduced a full-reference IQM using self-similarity and a CNN model. It used CNN features across multiple levels to calculate the similarity between the reference image and the degraded image. The IQM was based on the Alexnet [72] architecture. The method extracts feature maps at five convolutional layers, and these are compared using a histogram-based quality metric. A quality value at each layer is computed, and further pooled using a geometrical mean to get a final quality value. The proposed IQM was evaluated on the LIVE Image Quality Assessment Dataset [55], Categorical Subjective Image Quality [57], the Colourlab Image Quality Dataset [52], and TID2013 [65]. The results showed that the proposed IQM gave similar performance to the best state-of-the-art IQMs. The same IQM was also evaluated on a dataset for image contrast enhancement evaluation [73], where it also performed quite well [36].
The approach by Amirshahi et al. [51] was improved in [74] where the feature maps were compared using traditional IQMs such as SSIM [18], PSNR and mean squared error.
The proposed IQM was evaluated on the LIVE Image Quality Assessment Dataset [55], Categorical Subjective Image Quality [57], the Colourlab Image Quality Dataset [52], and TID2013 [65]. They showed an improvement in performance of the IQMs (on average an increase of 23%) using a CNN based approach.
In this study, we advance the current research compared to the existing CNN-based IQMs by predicting image quality without a reference for different viewing distances. To achieve this, relevant patches were selected based on saliency information and the viewing distance was included to features extracted from a modified pre-trained CNN model.

Proposed Method
The pipeline of the proposed no-reference IQM is summarized in Figure 1. For a given degraded image, we first select the most relevant patches based on their saliency weights. For each patch, we extract a feature vector from a CNN model and concatenate it at the viewing distance for which the quality is predicted. The resulting vector is then fed to fully connected layers to predict the subjective quality for the considered viewing distance. Each of these steps is described in this section.

Saliency-Based Patch Selection
Visual attention is one of the selective mechanisms of our human visual system that involves an attractiveness towards some regions of the image. These attractive regions highly influence our subjective judgement and therefore impact the subjective quality of an image. In this study, we exploited this perceptual mechanism to select the most relevant patches that have high perceptual impact. To do so, we employed the scanpath predictor described in [75], which aims to mimic the behavior of our human visual system when it faces a real image. It predicts fixation points of the scanpath via a given saliency map. Figure  For each predicted fixation point, a small patch was extracted. In [45,76], they examined the impact of the patch size and found that a size of 32 × 32 × 3 constitutes a decent trade-off between the performance and the time computation. The same size, 32 × 32 × 3, was used in our approach. As found by Vigier et al. [77] observers at visual angles up to 60 • reach the same salient region, indicating that saliency is the same at different distances. The saliency map was computed using the Graph-Based Visual Saliency (GBVS) method [78]. GBVS has shown to be very good for fixation location and scanpath predictor [79], and is therefore used. The number of fixation points and its impact on the performance will be discussed in Section 4.1. For more details about the saliency-based patch section, readers are referred to [80].

Cnn Model
A wide range of CNN models with various architectures have been proposed in the literature. Some researchers proposed their own models [45] trained from scratch, while others employed pre-trained models like AlexNet [72] and ResNet [81]. In our technique, we used the model introduced by the Oxford Visual Geometry Group (VGG), as this model is widely utilized and provided decent results in many applications [67,[82][83][84][85][86]. More precisely, we fine-tuned the pre-trained VGG16 model without data augmentation, since these treatments change the structure of the data and thus modify the perceived quality [87]. VGG16 is composed of 13 convolutional layers and 3 fully connected layers with an input of size 224 × 224 × 3 (color image) and an output of size 1000 (i.e., 1000 classes). In order to adapt this model to our context, we first replaced the input image layer of VGG16 by another image layer of size 32 × 32 × 3. The 3 initial fully connected layers were also replaced by 2 other fully connected layers of size 128 and 1, where the last fully connected layer is a regression layer to predict "continuous values". In order to predict the quality of a given image for different viewing distances, a feature vector was extracted from the last convolutional layer of our model and concatenated to the viewing distance D, normalized between 0 and 1 with 0 corresponds to 0*H and 1 to 6*H. In order to not give more importance to the viewing distance, the feature vector to which the viewing distance is concatenated was also normalized between 0 and 1. The resulting vector is then fed as input to the first fully connected layer as described in Figure 3. All these modifications allow us to adjust the model to our task, but it also leads to reduce the number of learnable parameters, since we have now around 14 M of learnable parameters compared to the initial 138 M. To train our model, the learning rate and the momentum were set to 0.01 and 0.9. We utilized stochastic gradient descent as the optimization function, and the mean square error as the loss function. The number of epochs and the batch size were set to 25 and 32, respectively. After each epoch, the training data were shuffled, and then we stored the model. The model providing the best performance was finally retained. All the experiments were carried-out with the configuration listed in Table 1.

Datasets
Two different datasets that provide subjective quality scores for two different viewing distances were used to evaluate our method.

Evaluation Criteria
Pearson (PCC) and Spearman (SROCC) correlation coefficients were employed to evaluate the quality prediction of the introduced IQM. The coefficients were calculated between the subjective scores and the predicted image quality values for each viewing distance. A correlation coefficient of 1 indicates a perfect prediction and a correlation coefficient of 0 indicates no correlation.
The predicted scores were mapped to the subjective scores through the following non-linear logistic function: where Q p and Q are the predicted and the mapped scores. β 1 -β 5 are the fitting parameters.

Experimental Results
In this section, we first study the impact on the performance of the number of extracted patches. Our method is then evaluated on each dataset individually. After comparing the results to the state-of-the-art, we test the generalization capacity of our method through a cross dataset evaluation.

Impact of the Number of Fixation Points
As mentioned above, the number of patches extracted per image is fixed by the number of fixation points. Its impact on the performance was here analyzed by varying the number of fixation points from 10 to 200. For each value of the number of fixation points, PCC and SROCC values were calculated. Figure 6 shows the correlation coefficients obtained on the CID:IQ dataset by splitting the database according to the reference images. The test set was composed of one-fold (20% of the reference image and its degraded versions), while the training-validation set included the remaining images. The latter was split randomly without overlapping (80% for training and then 20% for validation). This protocol ensures non-overlap or redundancy (in terms of image content) between sets. This procedure was repeated five times and the correlations were calculated by concatenating the scores.
As expected, the lower the number of fixation points, the lower the correlation. Indeed, the number of fixation points fixes the amount of data of the training set, which directly impacts the capacity of our model to learn the data. Correlations of the two viewing distances were close and increased with the number of fixation points. The best performance was raised for number of fixation points = 180 (i.e., 180 patches extracted per image). In the following, the quality of each image is thus predicted through 180 patches, where the quality is the average of the quality for each of the 180 patches.
The attention of an observer can be influenced by the viewing distance or the distortions [88], and as the incorporated scanpath predictor does not account for this, it may have an influence and should be investigated in future work. In order to better show the relevance of our pipeline, we compared the proposed saliency-based patch selection with the classical approach (i.e., no selection) and a random selection. These tests were carried out using the modified version of VGG16 and the baseline model, which corresponds to the modified version of VGG16 without integrating the viewing distance (i.e., one input and one output). The latter was trained using subjective scores of one viewing distance (i.e., 2.5*H) and tested on two distances (i.e., 2.5*H and 5*H). This procedure was applied to not associate several outputs to a single input. Table 2 shows the correlations obtained on CID:IQ dataset. As can be seen, a random selection of patches provided poorer results, while the proposed saliency-based patch selection improved the performance. Compared to the random selection, the use of all patches (i.e., no selection) improved the performance, but the global correlation still lower than that achieved with the proposed saliency-based patch selection. In addition, the integration of the viewing distance as input (i.e., proposed model) to the baseline model highly improved the performance whatever the selection type, especially when the proposed saliency-based selection step was applied. Figure 7 shows the loss values obtained in the training and validation sets across the number of epochs for one random splitting of the CID:IQ dataset. The loss values of both sets decreased until stabilizing, indicating no-overfitting.  Figure 8 shows the patch scores predicted for a given image and the corresponding subjective scores for two viewing distances. As can be seen, there is a gap between the predicted patch scores for the distance 50 cm (blue curve) and those predicted for the distance 100 cm (green curve). This gap reflects well the gap between the corresponding subjective scores (black and red dotted lines). Therefore, the integration of the viewing distance as input to our model allowed to well shift the predicted score according to the viewing distance considered.

Individual Evaluation
In this section, we present the results of our method for both datasets (CID:IQ and VDID2014). For each of them, we computed the correlations according to the viewing distances as well as the correlations per degradation type.

CID:IQ
We evaluated our method on CID:IQ dataset by applying the protocol described in Section 4.1 (i.e., 5 fold cross validation). Table 3 shows the correlations for each viewing distance. The results were compared to our previous work (CNN-VD) [58], where only one CNN model with two outputs was used. As can be seen, high performances were obtained for the two viewing distances with close correlation values. Compared to CNN-VD, the correlations increased with an improvement in terms of PCC of 1.4% for the two distances.  In Table 4, the correlation coefficients for each distortion, at five levels, are shown and compared to those of MSSIM and CNN QUALITY. In general, the performances were high for all distortions and viewing distances. The highest values were obtained for SGCK at 50 cm and at 100 cm, while the lowest ones were obtained for JPEG at 50 cm and JP2K for 100 cm. We also noticed that for SGCK, JPEG, GB, and DeltaE the proposed method has slightly higher coefficients for the 100 cm viewing distance compared to 50 cm, while it was the opposite for JP2K and PN. Compared to the MSSIM and CNN Quality metrics the proposed metric has good performance, giving a higher correlation value for JPEG, PN, SGCK and DeltaE for 50 cm and JPEG, SGCK and DeltaE for 100 cm. It is also noticeable that the proposed method is more stable compared to MSSIM. To evaluate our method on VDID2014, the dataset was split into 4-fold (i.e., 25% of the reference image and its degraded versions for the test set and the rest for the trainingvalidation set). As can be seen in Table 5, the performances were higher than those obtained on CID:IQ and the best results were obtained for the distance 4*H. Compared to CNN-VD, the improvements in terms of PCC are 5.43% for the distance 4*H and 2.84% for 6*H.  Table 6 presents the result of each degradation type, where each degradation has five levels. The correlations were generally high for all distortions. Contrary to the results of CID:IQ, all the correlations of our method were higher for the distance 6*H. The highest values were obtained for JPEG and JP2K for both 4*H and 6*H. Compared to the MSSIM and CNN Quality metrics, the proposed method has good performance, giving a higher correlation value for JP2K, JPEG and GB for 4*H and for JP2K, JPEG and WN for 6*H. MSSIM obtained the best results for WN for 4*H, while CNN Quality achieved the best results for GB for 6*H.

Computation Time
We also compared the computation time of the proposed pipeline to that with noselection. It is worth noting that we compared here only the computation time related to the quality prediction without integrating the saliency-based patch selection. As shown in Table 7, the quality of an given image is predicted using 180 patches whatever the dimension of the image, while 625 and 310 patches are used for images of CID:IQ and VDID14 datasets, respectively. The results show that the proposed method based on saliency is faster compared to using all patches.
For FR approaches, the best performance on CID:IQ was obtained by CSSIM for 2.5*H and MSSIM for 5*H. CSSIM is a metric based on predictability of blocks simulating the visual system, and has also been shown to perform better than SSIM [90]. The main difference between SSIM and MSSIM is the multi scale analysis that allowed an improvement of 6% on CID:IQ. On VDID2014, CNN Quality achieved the best results for the two viewing distances. No-reference IQMs failed to predict quality for both databases, even after being retrained. Our distance-based method performed better than all the compared ones by more than 1.4% on CID:IQ. On VDID2014, our method obtained competitive results, since PSNR2 and SSIM2 performed better than our method for 6*H. However, our method is blind and thus does not need any information from the reference image. Furthermore, our method performed better than most of the FR metrics.
To show the global performance of the suggested method, we calculated the correlation whatever the viewing distance and the degradation type. Tables 9 and 10 present the results on CID:IQ and VDID2014 datasets, respectively. Our method performed better than all the compared ones by more than 2.7% on CID:IQ. On VDID2014, the results show that the suggested IQM achieved the second best PCC value. However, our method remains highly competitive since most of the compared methods obtained a PCC smaller than 0.9 and best results (i.e., SSIM2 and PSNR2) were achieved by two FR metrics.    Table 9. PCC and SROCC values whatever the viewing distance computed on CID:IQ.

Cross Dataset Evaluation
We evaluated the generalization ability of our method by training our model on CID:IQ and testing it on VDID2014 without overlap between both. It is worth noting that cross dataset evaluation in our context is more difficult than those traditionally carried out in image quality assessment. Indeed, in addition to the difference in terms of content between both datasets, the two viewing distances considered by each of the databases are different. In other words, we evaluated here the efficiency of our method to predict the quality of unknown image for unknown viewing distances. Table 11 shows the correlations obtained for both viewing distances as well as the global performance. Compared to the individual evaluation, the performance decreased but still high. The same PCC value was obtained for the two distances. In addition to the viewing distances and the content (2.5*H and 5*H on CID:IQ against 4*H and 6*H on VDID2014), this decrease is certainly due to the fact that certain degradation types were not considered during the training step (White Noise on VDID2014 and Poisson Noise on CID:IQ).

Conclusions
We have proposed a novel CNN-based blind image quality method that predicts subjective scores for different viewing distances was introduced. The method first selects relevant patches from the image based on a scanpath predictor, further these patches are used to extract features from a CNN based on VGG16. Feature vector concatenated with the viewing distance is fed to a fully connected layer to predict the perceived quality. Our method was evaluated on two different databases. Results obtained by our method were compared to the state-of-the-art and showed its consistency with the subjective judgments. A cross-dataset experiment was also carried out and showed the generalization ability of our method to predict the quality of unknown images for unknown viewing distances.
In future work, the combination of several deep learning-based features should be studied. In addition, the use of other techniques for incorporating attention, foveation [107] and multi scale analysis can be seen as potential future work. The integration of more viewing distances will also be investigated.