Full-Reference Quality Metric Based on Neural Network to Assess the Visual Quality of Remote Sensing Images

: Remote sensing images are subject to di ﬀ erent types of degradations. The visual quality of such images is important because their visual inspection and analysis are still widely used in practice. To characterize the visual quality of remote sensing images, the use of specialized visual quality metrics is desired. Although the attempts to create such metrics are limited, there is a great number of visual quality metrics designed for other applications. Our idea is that some of these metrics can be employed in remote sensing under the condition that those metrics have been designed for the same distortion types. Thus, image databases that contain images with types of distortions that are of interest should be looked for. It has been checked what known visual quality metrics perform well for images with such degradations and an opportunity to design neural network-based combined metrics with improved performance has been studied. It is shown that for such combined metrics, their Spearman correlation coe ﬃ cient with mean opinion score exceeds 0.97 for subsets of images in the Tampere Image Database (TID2013). Since di ﬀ erent types of elementary metric pre-processing and neural network design have been considered, it has been demonstrated that it is enough to have two hidden layers and about twenty inputs. Examples of using known and designed visual quality metrics in remote sensing are presented.


Introduction
Currently, there are a great number of applications of remote sensing (RS) [1,2]. There are many reasons behind this [3,4]. Firstly, modern RS sensors are able to provide data (images) from which useful information can be retrieved for large territories with appropriate accuracy (reliability). Secondly, there exist systems capable of carrying out frequent observations (monitoring) of given terrains that, in turn, allow the analysis of changes or development of certain processes [5,6]. Due to the modern tendency to acquire multichannel images (a set of images with different wavelengths and/or polarizations [7][8][9][10]) and their pre-processing (that might include co-registration, geometric and radiometric correction, calibration, etc. [1]), RS data can be well prepared for further analysis and processing.
However, this does not mean that the quality of RS images is perfect. There are numerous factors that influence RS image quality (in wide sense) and prevent the solving of various tasks of RS data processing. For example, a part of a sensed terrain can be closed by clouds and this sufficiently decreases the quality (usefulness) of such optical or infrared images [11]. This type of quality degradation • an image seems to be perfect, i.e., no degradations can be visually detected (sharpness is satisfactory, no noise is visible, no other degradations are observed); • an image is multichannel and there are component images of very high quality and component images of quite low quality [9,23]; for RS data with a large number of components, i.e., hyperspectral images, this can be detected by component-wise visualization and analysis of images; • an acquired image is originally degraded in some way, e.g., due to the principle of imaging system operation; good examples are synthetic aperture radar (SAR) images, for which a speckle noise is always present [8,24].
This means that the quality of original (acquired) images should be characterized quantitatively using some metrics. An efficiency of RS image pre-processing (e.g., denoising or lossy compression) should be characterized as well. In this sense, there are several groups of metrics (criteria) that can be used for this purpose. Firstly, there are practical situations when full-reference metrics can be applied. This happens, for example, in lossy compression of data when a metric can be calculated using original and compressed images [17][18][19]. Secondly, no-reference metrics can be used when distortion-free data are not available [11,24]. Usually, some parameters of an image are estimated to calculate a no-reference metric in that case. Note that there are quite successful attempts to predict full-reference metrics without having reference images [25,26]. Finally, there are many metrics that characterize image quality (or efficiency of image processing) from the viewpoint of quality of solving the final tasks [27][28][29]. These can be, e.g., the area under the curve [30] or the probabilities of correct classification [31].
It is obvious that many metrics are correlated. For example, RS image classification criteria depend on the quality of the original data, although the efficiency of classification is also dependent on a used set of features, an applied classifier and an used training method. In this paper, we concentrate on metrics characterizing the quality of original images or images after pre-processing, such as denoising or lossy compression, focusing on full-reference metrics and, in particular, visual quality metrics. often desired to establish what metric value corresponds to invisibility of distortions [61]). On the other hand, there is a desire to create a universal metric capable of performing well for numerous and various types of distortions. In this case, researchers try to get maximal SROCC values for universal databases like TID2013.
There are 25 test color images in TID2013 and there are five levels of each type of distortion. In other words, there are 25 reference images and 3000 distorted images (120 distorted images for each test image). For each test image, 120 distorted images are partly compared between each other in a tristimulus manner (a better-quality image is chosen among two distorted ones having the corresponding reference image simultaneously presented on the screen). This has been done by many observers (volunteers that have participated in experiments) and the results are jointly processed. As the result, each distorted image has a mean opinion score that can potentially vary from 0 to 9, but, in fact, varies from about 0.2 to about 7.2 (the larger the better).
Many visual quality metrics have been studied for TID2013. A traditional approach to analysis or verification is to calculate a metric value for all distorted images and then to calculate the SROCC or Kendall rank order correlation coefficient (KROCC) [63] between metric values and MOS. SROCC values approaching unity (or −1) show that there is a strict (although possibly nonlinear) dependence between a given metric and MOS and that such a metric can be considered a candidate for practical use. A metric can be considered universal if it provides high SROCC for all considered types of distortion. For example, both PSNR and SSIM provide a SROCC of about 0.63 for a full set of distortion types (see data in Table 5 in [50]). These are, certainly, not the best results since some modern elementary metrics produce a SROCC approaching 0.9 [50]. The term "elementary metric" is further used to distinguish most metrics considered in [50] from the combined and NN-based metrics proposed recently.
In practice, RS images can be degraded by additive Gaussian noise (it is a conventional noise model for optical images [11]), by noise with different intensities in component images [9], and by spatially correlated noise [64]. Due to image interpolation or deblurring, images corrupted by masked or high-frequency noise can also be met [14]. A quantization noise may occur due to image calibration (range changing). There are also numerous reasons why blur can be observed in RS images [11,14]. Image denoising is a typical stage of image pre-processing [9,10] where specific residual noise can be Remote Sens. 2020, 12, 2349 5 of 31 observed, whereas a multiplicative noise is typical for SAR images [7,8]. Distortions due to compression take place in many practical cases. Certainly, JPEG and JPEG2000 considered in TID2013 are not the only options [17][18][19] but they can be treated as representative of the Discrete Cosine Transform (DCT) and wavelet-based compression techniques.
Hence, TID2013 images in general, and the subsets "Noise" and "Actual" in particular, provide a good opportunity for preliminary (rough) analysis of the applicability of existing metrics to the visual quality assessment of RS images. Nevertheless, among the known metrics, only some of them are particularly suitable for color images. The other ones can be determined for grayscale images and their mean value is usually calculated if this metric is applied component-wise to color images. Nevertheless, both types are analyzed since they are of interest for our study.

Analysis of Elementary Metrics' Performance for TID2013 Subsets
Currently, there is a great number of visual quality metrics. Some of them were designed 10 or even 20 years ago, some other ones were proposed recently. Some can be considered as the modifications of PSNR (e.g., PSNR-HVS-M [65]) or SSIM (e.g., Multi-Scale Structural SIMilarity-MS-SSIM [66]). Some other metrics have been designed based on the other principles. Some metrics are expressed in dB and vary in wide limits, whereas the other ones vary in the range of 0 to 1. There are metrics for which smaller values correspond to better quality (e.g., DCTune [67]) as well. However, a detailed analysis of the principle of operation for all of the elementary metrics considered here is outside of the scope of this paper, since what is most important of all is their performance for the subsets under interest. To avoid the presentation of many mathematical formulas in a long appendix, similarly as in the paper [68], a brief description of the metrics is summarized in Table 1.
The SROCC values for fifty elementary metrics are presented in Table 2 for all types of distortions in descending order, as well as for the subsets "Noise", "Actual", and "Noise&Actual" that includes images with all types of distortions present, at least, in one subset. As can be seen, there are a few quite universal metrics, e.g., the Mean Deviation Similarity Index (MDSI) [69], Perceptual SIMilarity (PSIM) [70], and Visual Saliency-Induced Index (VSI) [71], for which SROCC almost reaches 0.9. As expected, the SROCC values for subsets are larger than for all types of distortions. The "champion" for the subset "Noise" is MDSI (SROCC = 0.928), the best results for the subset "Actual" (SROCC = 0.939) are provided by several metrics (MDSI, PSNRHA [72], PSNRHMAm [73]); for both subsets together, the largest SROCC (equal to 0.937) is again provided by MDSI. Nevertheless, the performance of some of the metrics applied for color images is dependent on the method of color to grayscale conversion. However, the results presented in the paper for the NN-based metrics have been obtained assuming the same type of RGB to YCbCr conversion for all individual metrics, using the first component Y as the equivalent of the greyscale image. Such results can be considered appropriate for many practical applications. Nevertheless, it is still interesting if SROCC values can be further improved.

Design of Combined Image Quality Metrics
The idea of combined (hybrid) metrics for general purpose IQA assumes that different metrics utilize various kinds of image data and therefore different features and properties of images may be used in parallel. Therefore, one may expect good results for a combination of metrics that come from various "families" of metrics, complementing each other. This assumption has been initially motivated by the construction of the SSIM formula [41] where three factors representing luminance, contrast and structural distortions are multiplied. Probably the first approach to the combined full-reference metrics [54] has been proposed as the nonlinear combination of three metrics: MS-SSIM [66], Visual Information Fidelity (VIF) [74] and Singular Value Decomposition-based R-SVD [75], and verified for TID2008 [76], leading to SROCC = 0.8715 for this dataset. Nevertheless, in this research, the optimization goal has been chosen as the maximization of the Pearson's linear correlation coefficient (PCC) without the use of nonlinear fitting. The proposed combined metric is the product of three above mentioned metrics raised to different powers with optimized exponents.
Since the correlation of the R-SVD metric with subjective quality scores (MOS values) is relatively low, similarly to MSVD [77] (see Table 2), better results may be obtained by replacing it with Feature SIMilarity (FSIM) [78], leading to the Combined Image Similarity Metric (CISI) [55] with SROCC = 0.8742 for TID2008, although again optimized towards the highest PCC. Nevertheless, the highest SROCC obtained by optimizing the CISI weights for the whole TID2013 is equal to 0.8596. Another modification [56], leading to SROCC = 0.9098 for TID2008, has been based on four metrics, where FSIMc has been improved by the use of optimized weights for gradient magnitude and phase congruency components with added RFSIM [79]. Riesz transform and Visual contrast sensitivity-based feature SIMilarity index utilizing Log-Gabor filter and monogenic signal similarity matrix Another idea [57], utilizing the support vector regression approach to the optimization of PCC for five databases with additional context classification for distortion types, has achieved SROCC = 0.9495 for seven combined metrics with SROCC = 0.9403 using four of them. Nevertheless, al these results have been obtained for the less demanding TID2008 [76], being the earlier version of TID2013 [50], containing only 1700 images (in comparison to 3000) with a smaller number of distortion types and levels.
Recently, another approach to combined metrics, based on the use of the median and alpha-trimmed mean of up to five initially linearized metrics, has been proposed [53]. The best results obtained for TID2013 are SROCC = 0.8871 for the alpha-trimmed mean of five metrics: Information Fidelity Criterion (IFC) [82], DCTune [67], FSIMc [78], Sparse Feature Fidelity (SFF) [92] and PSNRHMAm [73]. Slightly worse results (SROCC = 0.8847) have been obtained using the median of nearly the same metrics (only replacing IFC [82] with a pixel-based version of Visual Information Fidelity-VIFP [74]).

Neural Network Design and Training for the Considered Subsets
In recent years, neural networks (NN) have demonstrated a very high potential in solving many tasks related to image processing. Their use is often treated as a remedy to get benefits in design and performance improvement. Hence, the peculiarities and possibilities of the NN use for our application are briefly considered, i.e., in the design of new, more powerful full-reference metrics for images with the aforementioned types of distortions. For this purpose, the requirements for such metrics are recalled below.
A good NN-based metric should provide a reasonable advantage in performance compared to elementary metrics. Since we deal with SROCC as one quantitative criterion of metric performance, it should be considerably improved compared to the already reached values of 0.93 . . . 0.94. Since the maximal value of SROCC is unity, its improvement by 0.02 . . . 0.03 can be considered as sufficient. The other relevant aspects are input parameters and NN structure. Since a typical requirement for a full-reference metric is to perform rather fast, input parameters should be calculated easily and quickly. Certainly, their calculation can be done in parallel or accelerated somehow, but anyway none of the input parameters should be too complex. The structure of a used NN should possibly be simple as well. A smaller number of hidden layers and fewer neurons in them without a loss of performance are desired. A smaller number of input parameters can also be advantageous.
A brief analysis of existing solutions shows the following: 1.
Neural networks have already been used in the design of full-reference quality metrics (see, e.g., [58,[107][108][109][110]); the metric [107] employs feature extraction from reference and distorted images and uses deep learning in convolution metrics design, providing SROCC = 0.94 for all types of distortions in TID2013; E. Prashnani et al. [108] have slightly improved the results of [107] due to exploiting a new pairwise-learning framework; Seo et al. [109] reached SROCC = 0.961 using deep learning; 2.
There can be different structures of NNs (despite the popularity of convolutional networks, standard multilayer ones can still be effective enough) and different sets of input parameters (both certain features and elementary metrics can be used).
Keeping this in mind, our idea is to use a set of elementary quality metrics as inputs and apply the NNs with a quite simple structure for solving our task-to get a combined metric (or several combined metrics) with performance sufficiently better than for the best elementary metric. Then, a set of particular tasks to be solved arises, namely: Regarding the last question, since we plan to exploit TID2013 in our design, it should be recalled that the quality of images in this database is characterized by mean opinion score (MOS). The main properties of MOS in TID2013 are determined by the methodology of experiments carried out by observers. Potentially, it was possible that MOS could be from 0 to 9, but, as the result of experiments, MOS varies in the limits from 0.24 to 7.21 [53]. Moreover, the analysis of MOS and image quality [53] has shown that four gradations of image quality are possible with respect to MOS: 1.
This "classification" is a little bit subjective, hence some explanations are needed. Images are considered to have excellent quality if distortions in them cannot be visually noticed. For images with good quality, MOS values have the ranks from 201 to 1000 and distortions can be noticed by careful visual inspection. If MOS values have the ranks from 1001 to 2000, the image quality is classified as middle (the distortions are visible, but they are not annoying). The quality of the other images is conditionally classified as bad-the distortions are mostly annoying. Figure 1 illustrates the examples of distorted images for the same reference image (#16 in TID2013) that has neutral content and is of a medium complexity, being similar to RS images. As it has been stated earlier, there is an obvious tendency to convergence of these types of images. The values of MOS and three elementary metrics are presented for them as well. The image in Figure 1a corresponds to the first group (excellent quality) and it is really difficult to detect distortions. The image in Figure 1b belongs to the second group and distortions are visible, especially in homogeneous image regions. The image in Figure 1c is a good representative of the third group of images for which distortions are obvious but they are not annoying yet. Finally, the image in Figure 1d is an example of a bad quality image. The presented values of metrics show how they correspond to quality degradation and can characterize the distortion level. As may be noticed, not all presented metrics reflect the subjective quality perfectly, e.g., there is a higher PSNR value for the bad quality image (Figure 1d) than for the middle quality image. Similarly, PSNRHA and MDSI metrics do not correspond to their MOS values for almost unnoticeable distortions, i.e., excellent and good quality.  Figure 1 illustrates the examples of distorted images for the same reference image (#16 in TID2013) that has neutral content and is of a medium complexity, being similar to RS images. As it has been stated earlier, there is an obvious tendency to convergence of these types of images. The values of MOS and three elementary metrics are presented for them as well. The image in Figure 1a corresponds to the first group (excellent quality) and it is really difficult to detect distortions. The image in Figure 1b belongs to the second group and distortions are visible, especially in homogeneous image regions. The image in Figure 1c is a good representative of the third group of images for which distortions are obvious but they are not annoying yet. Finally, the image in Figure  1d is an example of a bad quality image. The presented values of metrics show how they correspond to quality degradation and can characterize the distortion level. As may be noticed, not all presented metrics reflect the subjective quality perfectly, e.g., there is a higher PSNR value for the bad quality Taking this into account, it has been decided that the NN output should be in the same limits as the MOS. This means that error minimization with respect to MOS can be used as the target function in the NN training. If needed, MOS (NN output) can be easily recalculated to another scale like, e.g., from 0 to 1.
The penultimate question concerns input pre-processing. It is known that it is often recommended in the NN theory to carry out some preliminary normalization of input data (features) if they have different ranges of variation [61]. This is true for elementary metrics, as for example, PSNR and PSNR-HVS-M, both expressed in dB, can vary in wide limits (even from 10 dB to 60 dB) but, for distorted images in TID2013, they vary in narrower limits. PSNR used for setting five levels of distortions, has five values approximately reached (21,24,27,30, and 33 dB) although, in fact, the values of this metric for images in TID2013 vary from about 13 dB to ≈41 dB. Consecutively, PSNR-HVS-M varies from 14 dB to 59 dB, which mainly corresponds to "operation limits" starting from very annoying distortions to practically perfect quality (invisible distortions). The MDSI varies from 0.1 to 0.55 for images in TID2013 where the larger values correspond to lower visual quality. Some other metrics like SSIM, MS-SSIM, and FSIM vary in the limits from 0 to 1, where the latter limit corresponds to perfect quality. In fact, most values of these metrics are concentrated in the upper third part of this interval (see the scatter plot in Figure 2 for the color version of the metric FSIM, referred to as FSIMc). An obvious general tendency of the increase in FSIMc when MOS becomes larger is observed. Meanwhile, two important phenomena may also be noticed. Firstly, there is a certain diversity of the metric's values for the same MOS. Secondly, the dependence of FSIMc on MOS (or MOS on FSIMc) is nonlinear.
values of this metric for images in TID2013 vary from about 13 dB to ≈41 dB. Consecutively, PSNR-HVS-M varies from 14 dB to 59 dB, which mainly corresponds to "operation limits" starting from very annoying distortions to practically perfect quality (invisible distortions). The MDSI varies from 0.1 to 0.55 for images in TID2013 where the larger values correspond to lower visual quality. Some other metrics like SSIM, MS-SSIM, and FSIM vary in the limits from 0 to 1, where the latter limit corresponds to perfect quality. In fact, most values of these metrics are concentrated in the upper third part of this interval (see the scatter plot in Figure 2 for the color version of the metric FSIM, referred to as FSIMc). An obvious general tendency of the increase in FSIMc when MOS becomes larger is observed. Meanwhile, two important phenomena may also be noticed. Firstly, there is a certain diversity of the metric's values for the same MOS. Secondly, the dependence of FSIMc on MOS (or MOS on FSIMc) is nonlinear. In this sense, linearization (fitting) is often used to get not only the high values of SROCC but the conventional (Pearson) correlation factor (coefficient) as well [69]. Then, two hypotheses are possible. The first one is that the NN, being nonlinear and able to adapt to peculiarities of input data, will "manage" this nonlinearity of input-output dependence "by itself" (denote this hypothesis as H1). The second hypothesis (H2) is that elementary metric pre-processing in the form of fitting can be beneficial for further improvement of combined metric performance (optimization).
Concerning fitting needed for realization of H2, there are several commonly accepted options. One of them is to apply the Power Fitting Function (PFF) y(x) = a • x b + c, where a, b, and c are the adjustable parameters. The fitting results can be characterized by the root mean square error (RMSE) of the scatter plot points after fitting with respect to the fitted curve (the smaller RMSE, the better). In this sense, linearization (fitting) is often used to get not only the high values of SROCC but the conventional (Pearson) correlation factor (coefficient) as well [69]. Then, two hypotheses are possible. The first one is that the NN, being nonlinear and able to adapt to peculiarities of input data, will "manage" this nonlinearity of input-output dependence "by itself" (denote this hypothesis as H1). The second hypothesis (H2) is that elementary metric pre-processing in the form of fitting can be beneficial for further improvement of combined metric performance (optimization).
Concerning fitting needed for realization of H2, there are several commonly accepted options. One of them is to apply the Power Fitting Function (PFF) y(x) = a · x b + c, where a, b, and c are the adjustable parameters. The fitting results can be characterized by the root mean square error (RMSE) of the scatter plot points after fitting with respect to the fitted curve (the smaller RMSE, the better). The results obtained for the elementary metrics considered above (in Table 2) are given in the left part of Table 3. As one can see, the best fitting result (the smallest RMSE) is obtained for the metric MDSI (it equals to 0.3945). The results for the metrics that are among the best according to SROCC, e.g., Contrast and Visual Saliency Similarity Induced index (CVSSI), Multiscale Contrast Similarity Deviation (MCSD), PSNRHA, Gradient Magnitude Similarity Deviation (GMSD), see data in Table 2, are almost equally good, too.
Another fitting model that can be applied is, e.g., Poly2 y(x) = p 1 · x 2 + p 2 · x + p 3 , where p 1 , p 2 , and p 3 are parameters to be adjusted to produce the best fit. The results are very similar. For example, for MDSI, the minimal RMSE equal to 0.3951 is observed. For PSNRHA, RMSE = 0.4013 is achieved, i.e., slightly better than using the PFF. The detailed data obtained for Poly2 fit are presented in the right part of Table 3. For clarity, the better of the two RMSE results are marked by the bold format in Table 3. Then, three options are possible in combined NN design: 1) to use the PFF for all elementary metrics; 2) to apply Poly2 for all elementary metrics; 3) to choose the best fitting for each elementary metric and to apply it.
Earlier in [53], different possible monotonous functions including linear fitting have been considered for combining elementary metrics. As practice has shown, they produce worse results with higher RMSE in comparison to PFF and Poly2.
The next considered issue is what structures and parameters of NNs can be chosen and optimized. As it has been noted above, we concentrate on conventional structures. We prefer to apply the multilayer NN instead of a deep learning approach because in this way many design aspects are clear. It prevents the overtraining problems and allows for a good generalization. In particular, the number of neurons in the input layer can be equal to the number of elementary metrics used. In addition, this solution can be easily adopted by using TID2013, avoiding the need for a large training set of data.
Concerning this number, different options are possible, including the use of all 50 considered elementary metrics. It is also possible to restrict to the best metrics from Table 2 and employ elementary metrics that have certain properties, for example, metrics that have SROCC > 0.9 for the considered types of distortions (there are 22 such metrics), SROCC > 0.92 (14 metrics), or SROCC > 0.93 (7 metrics).
Nevertheless, the theory of NNs states that it is reasonable to apply such input parameters that can "add" or "complement" information to each other, i.e., are not highly correlated. One possible approach is the evaluation of cross-correlation function to determine the similarity between all pairs of metrics and after that to exclude the worst in highly correlated pairs.
Instead of that, in this paper, a very useful approach called Lasso regularization [111] (Lasso is an abbreviation for Least Absolute Shrinkage and Selection Operator) has been applied for the selection of the most unique metrics for NNs. In machine learning, Lasso and Ridge regularizations are used to introduce additional limitations to the model and decrease the problem of overfitting.
The key feature of Lasso is that this method may introduce zero weights for the "noisy" and the least important data. For the task of metrics combination, it means that Lasso regularization can determine the elementary metrics that are the least useful for combining and leave the other ("most informative") ones. This study adopts the implementation of Lasso available in MATLAB ® so that, for different thresholds, zero coefficients for metrics that can be excluded can be estimated. According to that, the number of such non-zero values (NNZ) for each metric has been determined, assuming the following conditions:   Since the network output is the combined metric, one neuron is present in the output layer, but the number of hidden layers and the number of neurons in each hidden layer can be different. Two variants have been analyzed-two and four hidden layers. Furthermore, two variants of the numbers of neurons in the hidden layers have been considered, namely an equal number of neurons in each hidden layer and a twice smaller number of neurons in each successive hidden layer.
As the activation function in hidden layers, the hyperbolic tangent sigmoid transfer function is used, which also provides the normalization in the range (−1,1), whereas a linear function is used for the output layer.
An example of the NN structure is presented in Figure 3. In this case, there are nine inputs and the number of neurons in hidden layers decreases gradually. Since the network output is the combined metric, one neuron is present in the output layer, but the number of hidden layers and the number of neurons in each hidden layer can be different. Two variants have been analyzed-two and four hidden layers. Furthermore, two variants of the numbers of neurons in the hidden layers have been considered, namely an equal number of neurons in each hidden layer and a twice smaller number of neurons in each successive hidden layer.
As the activation function in hidden layers, the hyperbolic tangent sigmoid transfer function is used, which also provides the normalization in the range (−1,1), whereas a linear function is used for the output layer.
An example of the NN structure is presented in Figure 3. In this case, there are nine inputs and the number of neurons in hidden layers decreases gradually. In addition to the NN structure, some other factors might influence its performance, such as the methodology of NN training and testing, stability of training results, the number of epochs, etc. Due to the limited set of 1625 images for the NN training and testing, this constraint complicates our task. In addition to the NN structure, some other factors might influence its performance, such as the methodology of NN training and testing, stability of training results, the number of epochs, etc. Due to the limited set of 1625 images for the NN training and testing, this constraint complicates our task. According to a traditional methodology, an available set should be divided into training and verification ones in some proportion. In the conducted experiments, 70% of images have been used for training and the remaining images for verification. Since the division of images into sets is random, training and verification results can be random as well. To partly get around this uncertainty, the best data are presented below for each version of the trained NN (producing the largest SROCC at training stage) of a given structure.
Concerning the training stage-each NN has been trained to provide as high SROCC as possible. Obviously, some other training strategies are also possible. In particular, it is possible to use the Pearson correlation coefficient (PCC) to be maximized. Nevertheless, the results of the additional use of the PCC for the training are not considered in this paper.
To give more understanding of which NN structures have been analyzed and what parameters have been used, the main characteristics are presented in Table 4, where the number of input elementary metrics is provided in each case. As can be seen, there are quite a lot of possible NN structures.

Neural Network Training and Verification Results
The main criteria of the NN training and verification are the SROCC values. Four SROCCs have been analyzed: SROCCtrain(Max), SROCCtrain(Lasso), SROCCtest(Max), and SROCCtest(Lasso) that correspond to the training and test (verification) cases using maximal and Lasso-determined numbers of inputs. Starting from the NN with two hidden layers, where preliminary fitting is not used, the obtained results are presented in  The analysis of data shows the following: • If Lasso is not used, the increase in the number of metrics leads to a general tendency of increasing both SROCCtrain and SROCCtest; meanwhile, for more than 40 inputs, the improvement is not observed; • If Lasso regularization is applied, the results are not so good if the number of inputs (N inp ) is smaller than 20; but if it exceeds 20, the performance practically does not depend on N inp ; this means that the Lasso method allows the simplification of the NN structure, minimizing the number of inputs and neurons in the other layers; • The best (largest) SROCC values exceed 0.97, demonstrating that a sufficient improvement (compared to the best elementary metric) is attained due to the use of the NN and parameters' optimization; • SROCCtrain and SROCCtest are practically the same for each configuration of the analyzed NN, so training results can be considered as stable; • No essential difference in results has been found for equal or non-equal numbers of neurons in hidden layers.
Concerning another number of hidden layers, namely four, the obtained plots are given in Figure 5. The analysis of the obtained data makes it possible to draw two main conclusions. Firstly, there are no obvious advantages in comparison to the case of using NNs with two hidden layers. Secondly, for all other conclusions given above, concerning SROCCtrain and SROCCtest, the influence of Ninp and Lasso regularization are the same, i.e., it is reasonable to apply a limited number, e.g., 24 elementary metrics determined by Lasso.
The next question to be answered is: "does preliminary fitting help?" The answer to it is presented in Figure 6 with two sets of plots. Both sets are obtained for the NNs with two hidden layers and with an equal number of neurons in them. The plots in Figure 6a are obtained for the PFF, and in Figure 6b-for the best fit. A comparison of the plots corresponding to each other in Figure 6 shows that there is no sufficient difference in the choice of fitting. Moreover, comparison to the corresponding plots in Figure 4a indicates that preliminary fitting does not produce sufficient performance improvement compared to the cases when it is not used. This means that the trained NNs provide this pre-processing by themselves.
The other conclusions that stem from the analysis of the plots in Figure 6 are practically the same as earlier. The lasso method ensures that the performance close to optimal can be provided for N inp slightly larger than 20. The maximal attained values of SROCC are above 0.97 and smaller than 0.975. The results for four hidden layers and non-equal numbers of neurons in the hidden layers have been analyzed as well, and the best results are practically at the same level. The results for four hidden layers and non-equal numbers of neurons in the hidden layers have been analyzed as well, and the best results are practically at the same level.    To prove that the number of hidden layers does not have any essential influence on the performance of the combined NN-based metrics, the metric characteristics for three and five hidden layers of NNs that used PFF for elementary metric pre-processing and equal numbers of neurons in all layers are presented in Figure 7. The analysis shows that the maximal attained values of the SROCC are even smaller than for NNs with two layers. Completing the analysis based on SROCC, it is possible to state the following:  there are many configurations of NNs that provide approximately the same SROCC; To prove that the number of hidden layers does not have any essential influence on the performance of the combined NN-based metrics, the metric characteristics for three and five hidden layers of NNs that used PFF for elementary metric pre-processing and equal numbers of neurons in all layers are presented in Figure 7. The analysis shows that the maximal attained values of the SROCC are even smaller than for NNs with two layers. To prove that the number of hidden layers does not have any essential influence on the performance of the combined NN-based metrics, the metric characteristics for three and five hidden layers of NNs that used PFF for elementary metric pre-processing and equal numbers of neurons in all layers are presented in Figure 7. The analysis shows that the maximal attained values of the SROCC are even smaller than for NNs with two layers. Completing the analysis based on SROCC, it is possible to state the following:  there are many configurations of NNs that provide approximately the same SROCC; Completing the analysis based on SROCC, it is possible to state the following: • there are many configurations of NNs that provide approximately the same SROCC; • keeping in mind the desired simplicity, the use of NNs with two hidden layers without fitting, and with a number of inputs of about 20, is recommended.
Nevertheless, two other aspects are interesting as well-what are the elementary metrics "recommended" by Lasso in this case, and are the conclusions drawn from SROCC analysis in agreement with conclusions that stem from the analysis for other criteria? Answering the first question, two good NN configurations can be found in Figure 4a, namely NNs with 16 and 24 inputs. The elementary metrics used by the NN with 16 inputs, as well as with 24 inputs, are listed in Table 5. As can be seen, all 16 metrics from the first set are also present in the second set.
The first observation is that the metrics MDSI, CVSSI, PSNRHA, GMSD, IGM, HaarPSI, ADM, IQM2, which are among the top-20 in Table 2, are present among the chosen ones. Some moderately good metrics, such as DSS, are also chosen. There are elementary metrics that are efficient according to data in Table 2 but they are not chosen, for example, MCSD, PSNRHMAm, PSIM. Although for PSNRHMAm, the reason for excluding this metric by Lasso may be its high correlation with PSNRHVS and PSNRHA, the situation is not as clear for MCSD and PSIM. Meanwhile, the sets contain such metrics as WASH and MSVD that, according to data in Table 2, do not perform well. In addition, both sets contain PSNR, and the second set also contains MSE, strictly connected to PSNR. This means that, although Lasso allows making the sets of recommended elementary metrics narrower, the result of its operation is not optimal. MDSI [69] 0.8897 X X HaarPSI [102] 0.8730 X X MS-UNIQUE [101] 0.8708 X PSNRHA [72] 0.8198 X X CVSSI [104] 0.8090 X X IGM [94] 0.8023 X X GMSD [95] 0.8004 X X IQM2 [96] 0.7955 X X DSS [97] 0.7915 X ADM [88] 0.7861 X X MS-SSIM [66] 0.7872 X RFSIM [79] 0.7721 X X CSSIM4 [105] 0.7394 X DSI [106] 0.7114 X X VIF [74] 0.6816 X PSNRHVS [84] 0.6536 X X PSNR 0.6396 X X MSE 0.6396 X NQM [81] 0.6349 X QILV [83] 0.5975 X CWSSIM [86] 0.5551 X X IFC [82] 0.5229 X X WASH [93] 0.2903 X X MSVD [77] 0.1261 X X Nevertheless, there are several positive outcomes of the design using Lasso. They become obvious from the analysis of data presented in Table 6. It may be observed that there are several good configurations of NNs that provide a SROCC of about 0.97 for the number of inputs of about 20 (this is also shown in plots). Moreover, these NNs ensure RMSE values that are considerably smaller than for the best elementary metric after linearization (see data in Table 3 where the best values are larger than 0.39). In addition, the values of the Pearson correlation coefficient (PCC) are also large and exceed 0.97, indicating very good linearity properties of the designed combined metrics.
Having calculated the SROCC, RMSE and PCC values, it is possible to carry out a more thorough analysis. The first observation is that SROCC, RMSE and PCC are highly correlated in our case. Larger SROCC and PCC correspond to smaller RMSE. The best results, according to all three criteria, are produced by the NN with configuration #3, although, considering the NN complexity, the configuration #2 is good as well. The number of inputs smaller than 16 (e.g., 11 or 12 in configurations #1, #4, #5) leads to worse values of the considered criteria. The use of the NN configurations with preliminary fitting, a decreasing number of neurons in hidden layers, and a larger number of hidden layers (configurations ##4-9), does not produce improvements in comparison to the corresponding configurations #1 and #2. Thus, the application of the NN configuration #2 will be further analyzed.  Table 6 contains three columns marked by the heading "The best network" and three columns marked by "The top 5 results". It has been mentioned earlier that the results of the NN learning depend on the random division of distorted images into training and testing sets. Because of this, to analyze the stability of training, we have calculated the mean SROCC, RMSE and PCC values for the top five results of NN training for each configuration as well as their standard deviations provided in brackets. A comparison of SROCC, RMSE and PCC for the top five results to the corresponding values for the best network shows that the difference is small. Moreover, the conclusions that can be drawn as the result of the analysis of these "average" results concerning NN performance fully coincide with conclusions drawn from the analysis for the best network.

Analysis of Computational Efficiency
In addition, the computational efficiency should be briefly discussed. For the NN-based metric with 16 inputs, the calculation of PSNR, MDSI, PSNRHVS, ADM, GMSD, WASH, IQM2, CVSSI and HaarPSI is very fast or fast, the calculation of IFC and RFSIM requires several times longer, whereas the calculation of MSVD, CWSSIM, PSNRHA, IGM, and DSI, takes even more time (about one order of magnitude). Thus, even with parallel calculations of elementary metrics, a calculation of the NN-based metric needs sufficiently more time than for such good elementary metrics as MDSI, GMSD or HaarPSI.
Hence, further research directions should be directed to the combination of possibly fast elementary metrics, providing a possibly good balance between the MOS prediction monotonicity (as well as accuracy) and computational efficiency. The average calculation time of the considered elementary metrics determined for 512 × 384 pixels images from the TID2013 dataset is provided in Table 7. Time data was evaluated using a notebook with Intel i5 4th generation CPU and 8 GB RAM controlled by the Linux Ubuntu 18.04 operating system, using MATLAB ® 2019b software.

Verification for Three-Channel Remote Sensing Images
The analysis of metrics' performance and their verification for multichannel RS images is a complex problem. Obviously, the best solution could be to have a database of reference and distorted RS images and MOS for each distorted image. This way of a future research is, in general, possible and expedient but it requires considerable time and effort. Firstly, a large number of observers have to be attracted to experiments. Secondly, these observers should have some skills in the analysis of RS data; this is the main problem. Thirdly, the set of images to be viewed and assessed has to be somehow agreed in the RS community.
Because of this, we are currently able to perform only some preliminary tests. The aim of the very first test is to show that the designed metric (in fact, MOS predicted by the trained NN) produces reasonable results for particular images. Regarding the images of Figure 1, for each of them the true MOS value is available. Furthermore, processing the value of each metric (PSNR, PSNRHA, MDSI) by using the linearization with the parameters previously processed (power fitting function) is also possible to obtain the corresponding estimation of MOS. Comparing these values with the MOS value predicted by NN (Table 8), it is clear that NN provides good results. Generalizing from this specific case to the TID2013 Noise&Actual subset, an interesting scatter plot can be obtained (Figure 8). A remarkable aspect is that the MOS values predicted from the selected elementary metrics can even be negative, whereas this drawback is absent when the MOS values are predicted by using NN. It is also seen that the NN-based metric provides high linearity of the relation between true and predicted MOS. Problems of MDSI for small MOS values are observed as well.   Table 8 with true MOS values (range 0-7) for the TID2013 Noise&Actual subset.
The further studies relate to four test images presented in Figure 9. These are three-channel pseudo-color images called Frisco, Diego2, Diego3, and Diego4, respectively, all of size 512 × 512 pixels, 24 bits per pixel, from the visible range of the Landsat sensor. The reasons for choosing them are twofold, as these images are of different complexity and they have been already used in some experiments [112]. The images Frisco and Diego4 are quite simple since they contain large homogeneous regions, whereas the image Diego2 has a very complex structure (a lot of small-size details and textures), and the image Diego3 is of a medium complexity.  Table 8 with true MOS values (range 0-7) for the TID2013 Noise&Actual subset.
The further studies relate to four test images presented in Figure 9. These are three-channel pseudo-color images called Frisco, Diego2, Diego3, and Diego4, respectively, all of size 512 × 512 pixels, 24 bits per pixel, from the visible range of the Landsat sensor. The reasons for choosing them are twofold, as these images are of different complexity and they have been already used in some experiments [112]. The images Frisco and Diego4 are quite simple since they contain large homogeneous regions, whereas the image Diego2 has a very complex structure (a lot of small-size details and textures), and the image Diego3 is of a medium complexity.
pseudo-color images called Frisco, Diego2, Diego3, and Diego4, respectively, all of size 512 × 512 pixels, 24 bits per pixel, from the visible range of the Landsat sensor. The reasons for choosing them are twofold, as these images are of different complexity and they have been already used in some experiments [112]. The images Frisco and Diego4 are quite simple since they contain large homogeneous regions, whereas the image Diego2 has a very complex structure (a lot of small-size details and textures), and the image Diego3 is of a medium complexity. A standard requirement of visual quality metrics is monotonicity, i.e., monotonous increasing or decreasing if the "intensity" of a given type of distortion increases. This property can be easily checked for many different types of distortions. These images have been compressed using the lossy AGU method [113], providing different quality and compression ratios (CR). This has been performed by changing the quantization step (QS) where a larger QS relates to larger introduced distortions and worse visual quality, respectively. Nevertheless, considering the possible extension of the proposed approach for multispectral RS images, as well as the highly demanded development of the RS image quality assessment database in the future, some more typical approaches to RS data compression should eventually be applied for this purpose, such as the CCSDS 123.0-B-2 standard [114].
The quality (excellent, good and so on) is determined according to the results in [61]. The collected data are presented in Table 9. An obvious tendency is that all metrics including the designed one become worse if QS (and CR, respectively) increases. The values of the NN-based metric are larger than MOS values predicted from elementary metrics for the test image Frisco, but smaller for the test image Diego2. Probably, this property is partly connected with image complexity. However, there are evidences that the designed metric "behaves" correctly. In Figure 10, the compressed images using QS = 40 are presented, for which distortions are always visible. A visual inspection (comparison) of these images to the reference images in Figure 9a,b shows that distortions are more A standard requirement of visual quality metrics is monotonicity, i.e., monotonous increasing or decreasing if the "intensity" of a given type of distortion increases. This property can be easily checked for many different types of distortions. These images have been compressed using the lossy AGU method [113], providing different quality and compression ratios (CR). This has been performed by changing the quantization step (QS) where a larger QS relates to larger introduced distortions and worse visual quality, respectively. Nevertheless, considering the possible extension of the proposed approach for multispectral RS images, as well as the highly demanded development of the RS image quality assessment database in the future, some more typical approaches to RS data compression should eventually be applied for this purpose, such as the CCSDS 123.0-B-2 standard [114].
The quality (excellent, good and so on) is determined according to the results in [61]. The collected data are presented in Table 9. An obvious tendency is that all metrics including the designed one become worse if QS (and CR, respectively) increases. The values of the NN-based metric are larger than MOS values predicted from elementary metrics for the test image Frisco, but smaller for the test image Diego2. Probably, this property is partly connected with image complexity. However, there are evidences that the designed metric "behaves" correctly. In Figure 10, the compressed images using QS = 40 are presented, for which distortions are always visible. A visual inspection (comparison) of these images to the reference images in Figure 9a,b shows that distortions are more visible for the test image Diego2. This is clearly confirmed by the values of PSNRHA (36.32 dB and 32.63 dB, respectively-see Table 9). PSNR shows the same tendency, although MDSI does not indicate this. Hence, some cases may be encountered when conclusions drawn from the analysis of different quality metrics can be different.  The analysis for the case of additive white Gaussian noise that has been added to the considered four three-channel images has also been carried out. Four values of noise variance have been used to correspond to four upper levels of distortions exploited in TID2013. The mean MOS values (averaged for 25 test images in TID2013) have also been obtained that correspond to these values of noise variance. For each image, PSNR, PSNRHA, and MDSI have been calculated and the corresponding predicted values of MOS have been determined. The NN-based metric has been calculated as well. The obtained results are presented in Table 10.
The analysis shows that all metrics become worse (PSNR, PSNRHA and the designed metric decrease, and MDSI increases) if noise variance increases, i.e., the monotonicity property is preserved. The predicted MOS values are quite close to the mean MOS, whereas for good and middle quality PSNRHA, MDSI and the NN-based metric provide better MOS prediction than PSNR. However, for bad quality images, the situation is the opposite.  The analysis for the case of additive white Gaussian noise that has been added to the considered four three-channel images has also been carried out. Four values of noise variance have been used to correspond to four upper levels of distortions exploited in TID2013. The mean MOS values (averaged for 25 test images in TID2013) have also been obtained that correspond to these values of noise variance. For each image, PSNR, PSNRHA, and MDSI have been calculated and the corresponding predicted values of MOS have been determined. The NN-based metric has been calculated as well. The obtained results are presented in Table 10. The analysis shows that all metrics become worse (PSNR, PSNRHA and the designed metric decrease, and MDSI increases) if noise variance increases, i.e., the monotonicity property is preserved. The predicted MOS values are quite close to the mean MOS, whereas for good and middle quality PSNRHA, MDSI and the NN-based metric provide better MOS prediction than PSNR. However, for bad quality images, the situation is the opposite.

Conclusions
The task of assessing the visual quality of remote sensing images subject to different types of degradations has been considered. It has been shown that there are no commonly accepted metrics and, therefore, their design is desired. The problems that can be encountered have been mentioned and their solution has been proposed. The already existing database of distorted color images, for which it is possible to choose images with the types of distortions often observed in remote sensing, has been employed for this purpose. Its use allows the determination of existing visual quality metrics that perform well for the types of degradations that are of interest. The best of such metrics provides a SROCC with a MOS of about 0.93, which is considered as a very good result. Moreover, the database TID2013 allows the design of visual quality metrics based on the use of elementary visual quality metrics as inputs. Several configurations of NNs and methods of input data pre-processing have been studied. It has been shown that even simple NNs without input pre-processing (linearization) having two hidden layers are able to provide SROCC values of about 0.97. PCC values are of the same order, meaning that the relation between NN output and MOS values is practically linear.
Some elementary metrics and the designed one have then been verified for three-channel remote sensing images with two types of degradations, demonstrating the monotonicity of the proposed metric's behavior. Besides, it has been shown that the designed metric produces an accurate evaluation of MOS that allows the classification of remote sensing images according to their quality. Since, as shown in the paper, from the visual quality point of view, color images and RS images (after applying the necessary operation to visualize them) have similar characteristics, this manuscript provides the first evidence that a full-reference image quality metric, developed using a dataset (TID2013) that is actually not a remote sensing dataset, works well when applied to RGB images obtained by remote sensing data.
In the future, the ways of accelerating NN-based metrics by means of restricting the set of possible inputs are planned to be analyzed, considering the computational efficiency of the input metrics as well. We also hope that some perceptual experiments, with the help of some specialists in RS image analysis for image quality assessment, will be carried out. Although for different types of RS data, different types of degradations can be important, we have considered only those ones that are quite general and are present in TID2013. For example, the characteristics of clouds can be additional feature(s) used as elementary metrics in the quality characterization of particular types of RS data in future research.
One of the limitations of the proposed method is the necessity of calculation of several metrics, which are not always fast. To overcome this issue, some of them may be calculated in parallel, although hardware acceleration possibilities should be provided in such cases.
Another shortcoming is the fact that many metrics are developed for grayscale images and hence they may be applied for single channel only (or independently for three channels leading to three independent results). Hence, a relevant direction of our further research will be related to the optimization of color to greyscale conversion methods used for the individual elementary metrics. In some cases, an appropriate application of elementary metrics for multichannel images may also require changes of data types and dynamic ranges. Although the color to greyscale conversion based on the International Telecommunication Union (ITU) Recommendation BT.601-5 (with the use of RGB to YCbCr conversion in fact limiting the range of the Y component to the range [16,235]), has been assumed in this paper, the results of individual elementary metrics obtained using various color spaces and conversion methods may lead to further increases in the combined metrics' performance.