Combined Full-Reference Image Quality Metrics for Objective Assessment of Multiply Distorted Images

: In the recent years, many objective image quality assessment methods have been proposed by different researchers, leading to a signiﬁcant increase in their correlation with subjective quality evaluations. Although many recently proposed image quality assessment methods, particularly full-reference metrics, are in some cases highly correlated with the perception of individual distortions, there is still a need for their veriﬁcation and adjustment for the case when images are affected by multiple distortions. Since one of the possible approaches is the application of combined metrics, their analysis and optimization are discussed in this paper. Two approaches to metrics’ combination have been analyzed that are based on the weighted product and the proposed weighted sum with additional exponential weights. The validation of the proposed approach, carried out using four currently available image datasets, containing multiply distorted images together with the gathered subjective quality scores, indicates a meaningful increase of correlations of the optimized combined metrics with subjective opinions for all datasets.


Introduction
The increasing popularity and availability of relatively cheap cameras, as well as electronic mobile devices, equipped with visual sensors, undoubtedly causes a dynamic growth of applicability of image and video analysis in many tasks. Some obvious examples may be related to video surveillance, traffic monitoring, video inspection and diagnostics, video-based navigation of mobile robots, or even autonomous vehicles. Some other applications are related to non-destructive testing, data fusion from various sensors, and many others, also related to modern Industry 4.0 solutions. Another factor, influencing the growing popularity of image analysis, is the development of some freeware libraries, such as OpenCV, that makes it possible to perform many tasks in real-time, especially with hardware support provided by modern Graphics Processing Units (GPU).
Nevertheless, machine and computer vision algorithms typically utilize natural images, which may be subject to various distortions, occurring not only during their acquisition but also caused by, e.g., lossy compression or the presence of transmission errors. This situation is typical for modern electronic devices, such as cameras, phones, and some other gadgets where image data are subject to several nonlinear transformations before recording. In such a case, the ability to detect such distortions and assess the overall image quality is an important challenge given the reliability of the results obtained from their analysis.
In the recent several years, many objective image quality assessment (IQA) metrics have been proposed, which may be divided into three major groups: full-reference (FR), which require the knowledge of the original "pristine" image without any distortions, no-reference (NR) methods, also known as "blind" metrics and less popular reducedreference (RR) approaches, which assume a partial knowledge of the original (reference) image. Although NR methods are the most desirable, their universality and correlation with subjective opinions of the human observers, provided as Mean Opinion Scores (MOS) or Differential MOS (DMOS) values in IQA databases, are typically significantly lower in comparison to FR methods. The more detailed analysis of many metrics and their comparisons for various widely accepted datasets containing reference and distorted images together with subjective quality scores may be found in some recent survey papers [1][2][3][4].
There are numerous attempts to improve the correlation between FR metrics and MOS (or DMOS). One way to do this is to design so-called combined metrics [5][6][7][8] that jointly employ several metrics (that we call elementary) in one or another way. In practice, one needs easily computable metrics and a simple way of combining them, similarly as for the 3D printed surfaces [9] or remote sensing images [10]. Because of this, the goal of this paper is to put forward a family of combined metrics that can be optimized with application to assessing the quality of images with multiple distortions. To the best of our knowledge, such optimization has not been yet carried out for available databases containing only images with multiple distortions. Previously developed combined metrics [5,6,8,11,12] concern only the singly distorted images.
The most commonly appearing types of distortions that an ideal IQA metric should be sensitive to concern blurring artifacts, various types of noise, and lossy compression artifacts. Although in some IQA datasets containing singly distorted images more than 20 types may be distinguished, e.g., 24 types in the TID2013 dataset [13] including colorrelated distortions, their combinations provided in the multiply-distorted IQA datasets are limited to a few kinds of them. Typically, they are the combinations of blur, noise, JPEG/JPEG 2000 artifacts, and contrast change. These five common types of distortions have been used, e.g., in the MDID database [14] discussed in Section 3.
Considering the interference of individual distortions and their influence on the perceived image quality, the usefulness of some metrics designed for singly distorted images for the development of the combined metrics highly correlated with subjective quality assessment of multiply distorted images is not obvious and should be verified experimentally.
The rest of the paper is organized as follows: Section 2 contains the overview of some elementary metrics, typically applied for the quality assessment of singly-distorted images, whereas four publicly available multiply-distorted image datasets used in experiments are presented in Section 3. Section 4 is related to the description of the idea of combined metrics and the proposed approach with experimental results discussed in Section 5. Section 6 concludes the paper.

Overview of Some Elementary Metrics
The performance of a combined metric depends on the following elements: • The number of the combined elementary metrics; • Which metrics are combined; • How the metrics are combined; • What images are used in testing.
Hence, we start by recalling modern elementary metrics. Development of modern visual quality metrics, replacing the "classical" pixel-based approaches such as Mean Square Error (MSE) or Peak Signal-to-Noise Ratio (PSNR), started in fact in 2002 with the idea of the Universal Image Quality Index (UQI) [15], followed by its improvement widely known as the Structural SIMilarity (SSIM) [16], implemented also in the multi-scale version (MS-SSIM) [17].
The general formula describing the idea of the SSIM, sensitive to three main types of distortions, i.e., luminance, contrast and structural distortions, may be expressed as where the default values of the stabilizing constants (preventing the instability of results for dark and flat image areas) for 8-bit grayscale images are: C 1 = (0.01 × 255) 2 , C 2 = (0.03 × 255) 2 and C 3 = C 2 /2. The above computations are performed using the sliding window approach and the final metric is the average of the local similarities. This approach was the basis also for some other similarity-based metrics leading to a further increase of the correlations between the objective quality scores and subjective MOS or DMOS values provided in various IQA datasets (typically containing only singlydistorted images). Some such examples, used also in this paper, are: information content weighted SSIM (IW-SSIM) and IW-PSNR [18], Complex Wavelet SSIM (CW-SSIM) [19], Feature SIMilarity (FSIM) [20], Quality Index based on Local Variance (QILV) [21] as well as a color version of SSIM (CSSIM), SSIM4 and its color version CSSIM4 [22], belonging to the group of SSIM-based metrics with additional predictability of image blocks.
A good illustration of the exemplary modifications of the SSIM might be the QILV metric [21] expressed as where σ V A V B denotes the covariance between the variances of two images (V A and V B , respectively), σ V A and σ V B are the global standard deviations of the local variance with µ V A and µ V B being the mean values of the local variance. Another example may be FSIM [20] based on the local similarity defined as where T 1 and T 2 are the stability constants preventing the division by zero and x is the sliding window position. The two main components are the phase congruency (PC) being a significance measure of a local structure) and gradient magnitude (GM) as a complementary feature extracted using the Scharr edge filter. The final metric should be calculated according to the formula where PC m (x) = max(PC A (x), PC B (x)) and x denotes each position of the local window on the image plane A (or B).
Another approach, originating from information theory, assumes the use of natural scene statistics (NSS) combined with a measurement of the mutual information between the subbands in the wavelet domain, proposed by Sheikh and Bovik as Visual Information Fidelity (VIF) metric [23]. Its simplified multi-scale pixel domain version (VIFp) requires fewer computations, although it does not allow the orientation analysis. Both methods are based on the earlier idea of Information Fidelity Criterion (IFC) [24]. A lower computational complexity metric, known as DCT Subbands Similarity (DSS) [25] utilizes the fact that statistics of DCT coefficients change with the degree and type of image distortion. Another motivation for its authors has been the popularity of the 2D DCT as many image and video coding techniques are based on block-based DCT transforms, particularly originating from JPEG and MPEG standards.
A combination of steerable pyramid wavelet transform and SSIM, known as IQM2, was proposed by Dumic et al. [26], where the kernel with two orientations was applied to achieve the best performance preserving low computational demands.
A different approach to the perceptual IQA was proposed by Wu et al. [27], utilizing the internal generative mechanism (IGM) adopting a Bayesian prediction model and decomposing the image into predicted and disorderly portions. It was assumed that the first part may be assessed using the SSIM-like methods, whereas the degradation on disorderly uncertainty may be predicted using the PSNR. Both parts should be further nonlinearly combined to acquire the final quality score.
Chang et al. [28] proposed the method based on the independent feature similarity (IFS) simulating the properties of the Human Visual System (HVS), particularly useful for the quality prediction of images with color distortions. Due to the possible use of the partial information from the reference image (based on the use of Independent Component Analysis-ICA), this method can also be considered as an example of the RR approach. Another metric based on the HVS, known as Perceptual SIMilarity (PSIM) was proposed as a four-step method [29] and partially verified using two multiply distorted databases. It is based on the extraction of gradient magnitude maps for both compared images followed by calculations of their multi-scale similarities and measurement of chromatic channel degradations and final pooling.
Alternatively, authors of the Sparse Feature Fidelity (SFF) metric [30] assumed transformation of images into sparse representations in the primary visual cortex to detect the sparse features by the feature detector trained by the ICA algorithm using natural image samples. They used feature similarity and luminance correlation components to simulate jointly visual attention and visual threshold. The other metric based on sparse representations, known as UNIQUE [31], utilized an unsupervised learning approach. Interestingly, in the preprocessing step, a color space selection is performed (conversion into YCbCr model is suggested with replacement of the Cb chrominance by the green channel) followed by random patch sampling, forming the vector containing 64 elements for each of three channels, further normalization using a mean subtraction and a whitening operation. The additional extension by analyzing the learned weights was proposed as the MS-UNIQUE metric [32]. Both metrics were trained using randomly selected patches from the ImageNet database. Further extension of such a training-based approach, particularly using deep learning CNN approaches [33,34], is also possible; however, it still requires a relatively large amount of training data available mainly in the singly distorted IQA datasets.
An interesting metric, utilizing gradient similarity, chromaticity similarity, and deviation pooling, was proposed as the Mean Deviation Similarity Index (MDSI) [35], where the color distortions were measured using a joint similarity map of two chromatic channels. Another attempt to use the gradient similarity has been proposed by Xue et al. [36], known as Gradient Magnitude Similarity Deviation (GMSD).
Reisenhofer et al. [37] proposed the use of the Haar wavelet decomposition to develop another HVS-based perceptual similarity metric, known as HaarPSI. This metric is based on the use of six 2D Haar wavelet filters extracting the horizontal and vertical edges on different frequency scales and may be considered as a simplification of FSIM [20]. Another feature-based method, known as RVSIM [38], utilizes Riesz transform (similarly as earlier RFSIM [39]) together with visual contrast sensitivity, whereas the CVSSI metric [40] is based on the similarity of contrast and visual saliency (VS), forming the final score with the use of weighted standard deviations of the local contrast quality map and the global VS quality map.
Considering the topic of this paper, the above overview of elementary metrics is limited to the FR algorithms demonstrating a high prediction accuracy for the four considered multiply distorted IQA datasets, obtained without any nonlinear fitting functions (e.g., logistic or polynomial ones). Although a few metrics oriented for the quality assessment of multiply distorted images have been recently proposed, e.g., using gradient detection [41], in some cases, their codes are not publicly available or they belong to the group of "blind" methods, such as the method based on phase congruency [42]. Therefore, the results presented in this paper are focused on the combination of better-known elementary metrics with available codes, originally developed for singly distorted images.
In addition to the above-mentioned metrics, some of the IQA methods, which have led to an improved performance applied in the combined metrics, include: WSNR [43], PSNRHMA [44], VSNR [45], Visual Saliency-Induced Index (VSI) [46], Multiscale Contrast Similarity Deviation (MCSD) [47], spectral residual similarity (SR-SIM) [48] and Wavelet Based Sharp Features (WASH) [49]. Some other recently proposed metrics used in experiments have been developed originally for the quality estimation of screen content images, such as SIQAD [50] and SCI_GSS [51], as well as for the reduced-reference image quality assessment of contrast change (RIQMC) [52].
Since some of the methods presented above are designed for the direct use with color images only and the others require the use of grayscale ones, all the calculations for the latter ones have been made using MATLAB's rgb2gray conversion, according to the ITU-R BT.601-7 Recommendation, after rounding to three decimal places.

Multiply Distorted Image Quality Assessment Datasets
The development of new IQA datasets is a quite challenging and time-consuming task, especially assuming conducting perceptual experiments involving many observers for a relatively large number of distorted images. Hence, among many IQA datasets, only a few of them, such as, e.g., TID2013 [13], containing numerous images subject to several types of distortions, may be considered as widely accepted by the community. Unfortunately, most of the databases developed several years ago do not contain images with more than a single distortion applied simultaneously, and most of the metrics developed and verified using such datasets predict the quality of multiply distorted images with relatively low accuracy.
As stated by Chandler [2], one of the main challenges in the multiply distorted IQA is the fact that the developed metrics should consider not only the joint effects of distortions on the image but also the effects of distortions on each other. Hence, considering the practical usefulness of metrics that would be able to predict the visual quality of multiply distorted images with the possibly highest accuracy, some other datasets have been developed to fill this research gap.
The first of such datasets, provided by the Laboratory for Image and Video Engineering (LIVE) from Texas University at Austin, referred to as LIVEMD [53], contains two groups of doubly distorted images. The first group deals with a blur followed by JPEG lossy compression, whereas the second one contains blurred images due to defocusing corrupted further by a white noise to simulate sensor noise. Each group contains 225 images, however, some of them are in fact singly distorted, hence only the subset of 270 multiply distorted images has been used in experiments carried out in our paper.
Another dataset, known as MDID13 [54], contains 12 natural color reference images and 324 images corrupted simultaneously by distortions that may take place during the acquisition, compression, and transmission of images. Six standard definition reference images (768 × 512 pixels) originate from the Kodak database, whereas the other six high definition images (1280 × 720) are the same as in the LIVEMD dataset. The testing images contain the three-fold mixtures of blurring, JPEG compression, and noise, being complementary to the LIVEMD, where only two-fold artifacts are used. Subjective scores have been provided by 25 inexperienced observers using two viewing distances due to different image sizes and the single-stimulus (SS) method according to the ITU-R BT.500-12 Recommendation.
The third database used for the verification of the proposed approach is known simply as MDID [14]. It contains 20 reference images (cropped to 512 × 384 pixels without scaling) and 1600 distorted images. The images are corrupted by the combinations of five distortions, namely Gaussian noise (GN), Gaussian blur (GB), contrast change (CC), JPEG, and JPEG2000 lossy compression. Each distorted image has been obtained from the respective reference image applying random types and random levels of distortions. The MOS values have been provided by 192 subjects who participated in the subjective rating. Sample images from the MDID database affected by various combinations of distortions with different levels are presented in Figure 1 with the reference image marked by the red frame. The last dataset, developed in the Imaging and Vision Laboratory at the University of Milano-Bicocca, is known as IVL_MD or MDIVL database [55]. It contains two groups of images: 400 images with noise and JPEG distortions, as well as 350 images with blur plus JPEG distortions, together with corresponding MOS values. The distorted images, subjectively evaluated by 12 observers using the SS method, have been obtained from 10 reference images that have the size of 886 × 591 pixels.
There are also other databases containing images with multiple distortions, e.g., LIVE in the Wild Image Quality Challenge database, containing widely diverse authentic image distortions [56]. However, this database does not offer reference images and, therefore, it does not allow calculating FR metrics that are needed in our case.
Comparing the four publicly available multiply distorted IQA databases, the most relevant one is undoubtedly the MDID database [14], not only because of the largest number of images and distortion types but also considering the numerous human observers involved in perceptual experiments. Therefore, the experimental results obtained for this dataset should be considered as the most important. On the other hand, due to the greater diversity of distortions and higher number of images, the expected correlation values are lower than for the other datasets.
To provide a comparison of the performance of the best elementary (individual) metrics for each of the above databases, the Pearson Linear Correlation Coefficients (PCC) between the raw objective scores (i.e., without any additional nonlinear fitting) and subjective MOS/DMOS values have been calculated, illustrating the prediction accuracy. Additionally, Spearman Rank Order Correlation Coefficients (SROCC) and Kendall Rank Order Correlation Coefficients (KROCC) have been calculated to illustrate the prediction monotonicity of each elementary metric.
The obtained performance for selected elementary metrics, including the best performing ones, is presented in Table 1, where the top three results for each dataset are marked with bold font. As can be easily noticed, various methods demonstrate the best performance for various datasets, also differing with prediction accuracy measured by PCC and prediction monotonicity indicated by rank order correlations. Although not all results obtained for elementary metrics have been provided in the paper, the values of over 50 of them have been calculated for four considered datasets. Additionally, the correlation results obtained for all databases weighted by the number of images in each of the considered datasets have been presented. Therefore, the weights (before normalization) are 270 for LIVEMD excluding the single distorted part of the database), 324 for MDID13, 1600 for MDID, and 750 for MDIVL, respectively. Hence, the most "universal" elementary metrics seem to be VIF, DSS, and IW-SSIM, providing the highest aggregated correlations, being a good starting point for the development of the combined metrics.

Combined Metrics and the Proposed Approach
Ideally, an FR metric has to provide a linear dependence between metric values and MOS. Less strictly, dependence between MOS and a metric should be monotonous (desirably, a larger metric value corresponds to a larger MOS). However, for many existing elementary metrics, these dependences are far from ideal. As examples, Figure 2 presents scatter plots of MOS vs. some elementary FR metrics for the considered databases (scatter plots in the left column). As one can see, the dependences can be nonlinear (as shown in the scatter plot of IQM2 vs. MOS), different metrics have different ranges of variation (many metrics vary in the limits from 0 to 1 but not all), some "outliers" (large displacements of some points with respect to the most of the others) might happen as well. These properties arise problems in aggregation of several elementary metrics into a combined one.
The idea of the combined metrics is motivated by the complementary properties of different elementary metrics, which may demonstrate a "sensitivity" to various kinds of distortions to varying degrees. Hence, it has been assumed that their nonlinear combination may replace the necessity of nonlinear fitting proposed by the Video Quality Experts Group (VQEG) to increase the linear correlation between the subjective and objective scores. Some initial attempts were made to combine the metrics for singly distorted images by the optimization of weighting exponents for the product of three metrics [5] using the TID2008 database, although during further experiments, one of the metrics was replaced by FSIM forming the Combined Image Similarity Index (CISI) [6], being the weighted product of MS-SSIM [17], VIF [18] and FSIM [20].
A multi-metric fusion based on the regression approach applied for some older elementary metrics was proposed in the paper [7] with the additional context-dependent version utilizing the machine learning approach to determine the context automatically. Nevertheless, the verification of results was made using the TID2008 dataset only.
Another approach to multi-metric fusion is based on the use of genetic algorithms for the combination of metrics [11], although modeled as their weighted sum instead of their product that may limit the possibility of avoiding the additional nonlinear fitting. Hence, a similar approach was also used for the weighted products of elementary metrics [12], leading to further improvements.
The use of neural networks for the combination of elementary IQA metrics was used in the paper [8], where a randomly selected half of the TID2013 dataset was used for training. This approach utilized six elementary metrics, leading to a significant increase of the SROCC chosen as the optimization criterion. Nevertheless, similarly as in the other cases, the combined metrics have been used only for the assessment of singly distorted images. Additionally, a potential application of deep learning methods would require the development of larger training datasets containing also the subjective quality scores for multiply distorted images. Therefore, a combination of existing metrics using a relatively simple model is expected to be a well-performing solution also for multiply distorted images. To provide a simple form of the combined metric which would not require the additional nonlinear regression, e.g., using the logistic function, the strategy based on the weighted product of elementary metrics has been initially chosen in this paper with PCC as the optimization criterion. Although, in some cases, prediction monotonicity may be more important than the prediction accuracy itself, we have verified experimentally that the optimization of weighting exponents using the PCC values as the criterion, provides also high SROCC values. During the experiments, it has appeared that the performances obtained in the opposite case are not always good enough. Another reason for the use of the PCC for raw scores without prior nonlinearity fitting was the flexibility of the proposed approach, making it possible to control all weights simultaneously in a single optimization procedure. Considering the various dynamic ranges of elementary metrics, as well as the DMOS and MOS values in each dataset, the use of the PCC does not require additional normalization of their values. Hence, the assumed formula of the combined metric may be expressed as: where N is the number of elementary metrics denoted as Q i , and w i are their exponential weights, obtained as the result of optimization conducted using MATLAB's fminsearch function.
Although the application of the assumed method of metrics' combination provides encouraging results, the selected fusion of metrics based on their weighted product does not always lead to fully satisfactory performance. Hence, a novel fusion model has been investigated based on the sum of the exponentially weighted metrics where each component of the sum has an additional weight. The proposed formula may be presented as: where the additional weights a i have been introduced to make the combined metric even more flexible and increase its correlation with subjective quality scores provided in stateof-the-art datasets for multiply distorted images.

Results of Optimization
Using the weights a in Equation (6), different ranges of metrics' variation are taken into account (i.e., specific normalization is performed). Using both a and w coefficients, the combined metric can be optimized, i.e., its better values of PCC and/or SROCC can be provided in comparison to elementary metrics used as inputs for the combined metric.
An initial verification of the usefulness of the proposed approach for the FR quality assessment of multiply distorted images has been made primarily for the metrics listed in Table 1 using the four considered datasets independently. All initially considered metrics providing the PCC values below the bottom limits assumed for all datasets have been excluded from initial experiments (i.e., at least one of the conditions should be fulfilled by each metric to be included in further experiments). The values of these limits for PCC are: 0.7 for LIVEMD, 0.8 for MDID13, 0.85 for MDID and 0.8 for MDIVL. The relatively low limit for the LIVEMD dataset is caused by removing the singly distorted images from the analysis leading to a decrease of the correlation values for this dataset. Nevertheless, in some cases, combinations of two or three "worse" metrics might provide better results in comparison to the combination of one of them with the best performing elementary metric. Therefore, in the second stage of experiments, all combinations of two and three metrics have been tested for all datasets. To limit the number of possible combinations reasonably, several "best" combinations have been chosen as the basis for further increase of the number of metrics.
The optimization of exponential parameters w i for the combined metrics CM as well as the multipliers a i and exponents w i for the proposed CM + formula has been conducted using the derivative-free method without constraints based on the Nelder-Mead simplex method implemented in MATLAB's fminsearch function. Finally, all multipliers a i in the proposed CM + formula have been normalized so that ∑ a i = 1.
As the "best" combinations of two, three and more metrics for individual databases differ from each other, they are presented in Table 2 separately for each dataset. Analyzing the obtained results, it can be noticed that a meaningful increase of the prediction accuracy has been achieved for all datasets even using the "best" combination of two or three elementary metrics using the weighted product of metrics denoted as CM. The use of more additional elementary metrics further improves the obtained results in terms of the PCC significantly and, in some cases, may lead to a slight decrease of the prediction monotonicity (lower values of SROCC and KROCC). The results of the application of the proposed CM + metrics based on the normalized sum of the exponentially weighted elementary metrics are presented in Table 3, where higher correlations in comparison to respective CM metrics are marked by bold font. As may be noticed, the obtained performance of the proposed combined metrics is better for three datasets and slightly worse for the MDID database. An additional comparison of the linearity of the achieved correlation (without the necessity of any additional nonlinear mapping) is presented in the scatter plots shown in Figure 2.
However, it should be kept in mind that many elementary metrics have various properties and various dynamic ranges, hence, the trends shown in the various plots may be reversed to each other. For some of these metrics, smaller values indicate higher quality whereas the opposite is true for some other metrics. Since the maximum absolute value of the PCC has been considered as the objective function, the presentation of the scatter plots using the raw scores of these metrics may present both "negative" and "positive" trends. It is dependent on the obtained results of the optimization and the elementary metrics which have been used in the final combined metric. As in two datasets the DMOS values have been provided as the subjective scores, whereas the inventors of the other two datasets have used the MOS values, the original values-different for different datasets-have been used in the paper and are presented in all scatter plots included in the paper. The scale of all obtained combined metrics depends on the raw scores of individual metrics and the obtained results have not been normalized. It should also be noted that the high DMOS values typically represent poor quality whereas high MOS values indicate a high quality of images. As it may be observed, results of the CM7 + metric obtained for the MDID2013 dataset vary noticeably less than for the three other databases. Nevertheless, highly linear relationships between the subjective and objective quality scores are achieved mainly for the proposed CM + metrics for all considered databases. Some differences in the dynamic ranges of the combined metrics, particularly using the CM formulas, result from the use of various types of metrics and different weights obtained after the optimization procedure.
An additional comparison of the performance of the proposed approach has been made using some other combined metrics, previously developed for singly distorted images, applied for the datasets containing only multiply distorted images. The obtained experimental results for three such datasets (MDID2013, MDID, and MDIVL) are presented in Table 4. Since four Regression-based Similarity (rSIM) metrics [11] have been actually designed as the weighted sum of individual metrics, the additional nonlinear regression with the use of the logistic function has been applied using the coefficients provided in [11]. As one can see, our approach provides sufficiently better results than the approaches proposed in [11,12].
Since the metrics used in "best" combinations for various datasets differ, an additional cross-database validation has been conducted applying the combined metrics optimized for a single database for the assessment of images from the other three datasets. The obtained validation results are presented in Table 5, where the better performance results than obtained for the best elementary metrics for each dataset are marked with bold font. As it may be observed, the application of some of the combined metrics obtained for the MDIVL dataset does not lead to satisfactory results for the others. Table 4. Comparison of results obtained for three major datasets using some combined metrics originally designed for singly distorted images with the "best" elementary metrics and the proposed methods. Performance of all metrics is expressed as Pearson, Spearman and Kendall correlation coefficients between the subjective quality scores and objective metrics. Better results from two alternatives are marked with bold font.
Nevertheless, from a practical point of view, a final recommendation of a "universal" combined metric suitable for all databases would be desired. Therefore, some additional experiments have been made using the "aggregated" correlation as the goal function. The "aggregated" correlation has been calculated as the weighted sum of four correlations computed for each dataset where their number of images has been used as the weight (before normalization), similarly as for the elementary metrics shown in Table 1. The results obtained for both proposed families of the combined metrics are presented in Table 6. It is worth noting that even considering all four databases, the correlations are higher than those achieved by the other combined metrics for single datasets as shown in Table 4. Analyzing the presented results, the advantages of the novel approach based on the weighted sum of metrics, leading to the CM + family, may be observed for most metrics (better results from two alternatives are marked with bold font). Another interesting observation is that the "best" combinations of metrics in the CM + family utilize different elementary metrics than in the case of the CM family. In some cases, due to the use of more parameters, it is also possible to achieve similar correlations using the CM + approach with a smaller number of combined elementary metrics than using the CM family.
The graphical illustration of the correlation between the "best universal" combined metric CM + 7 and subjective scores for individual datasets is provided in Figure 3, where the lowest correlation for LIVEMD may be easily observed. Nevertheless, due to the lowest number of images, this dataset may be considered as the least significant. Highly linear relationships between the subjective evaluation and objective metric achieved for three major datasets (PCC = 0.9387 for MDID, PCC = 0.8911 for MDID13, and PCC = 0.9122 for MDIVL, respectively, as shown over the plots in Figure 3) confirm the validity of the proposed approach. These results are still better in comparison to the results obtained for some alternative combined metrics presented in Table 4. The weights obtained for the elementary metrics that have different properties and various dynamic ranges, used in the CM + 7 according to Formula (6), are provided in Table 7.  Table 6. Performance of the "best" elementary, and "universal" CM and CM + metrics for all four databases in view of the aggregated (weighted) correlation with subjective scores. Better correlations from two families of the combined metrics are marked with bold font.  The conducted experiments have confirmed the hypothesis that the specificity of multiply distorted images requires a combination of different metrics since some of the previously proposed hybrid approaches have led to worse performance even in comparison to the "best" elementary metrics. Additionally, the application of the combination model proposed in the paper increases their performance meaningfully for most of the datasets considered in the paper as well as for all datasets treated as a whole. The application of the proposed approach makes it possible to improve both the quality prediction accuracy measured by the PCC and the prediction monotonicity reflected by both rank-order correlations (SROCC and KROCC).

Conclusions
Image quality assessment of multiply distorted images is still a challenging area of research as many elementary metrics designed using the IQA databases with singly distorted images have poor performance for multiple distorted ones. The application of the combined metrics makes it possible to increase the obtained performance; however, the results achieved using one of the available databases are not always directly applicable for the others. Therefore, our future research will concentrate on some other fusion strategies, including the use of genetic algorithms and neural networks for this purpose. Different approaches for feature extraction and network training are possible, however, as stated in the paper [34], "the training set has to contain enough data samples to avoid overfitting". Meanwhile, even an application of relatively simple fusion models, as proposed in this paper, makes it possible to achieve much better results than may be achieved for a single metric.
Analyzing the results presented for the four available databases considered together, a significant increase of the aggregated correlation with subjective scores may be observed, not only in comparison to elementary metrics but also with the use of some other combined metrics, proposed earlier for images with single distortions. Those results confirm the practical usefulness and universality of the proposed approach, particularly the novel CM + metrics.
Since the proposed fusion model is not computationally demanding, its efficiency does not decrease significantly, assuming the possibility of parallel calculations of the elementary metrics. The only exception may be related to the memory limitations that would hinder the parallel computations of elementary metrics for large images. The time and memory requirements are dependent on the used hardware and the image size. For the parallel computation of metrics (e.g., 7 metrics for 8 independent threads), the calculation time of the final combined metric is nearly the same as for the "slowest" elementary metric being used.
The next step of research might be related to the application of the CNN-based metrics trained using the images affected by multiple distortions. Regardless of the different "nature" of the multiply distorted images compared to those affected by a single distortion, this direction of future research might be promising and will be considered. Nevertheless, its significant limitation is the necessity of the development of some larger datasets containing multiply distorted images that may be used for training purposes.
Nevertheless, considering the presence of the multiple distortions in many electronic devices equipped with vision sensors, the proposed approach may be useful in various electronic systems used for image and video analysis purposes.

Conflicts of Interest:
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: