1. Introduction
With social media and streaming applications booming, it is required from systems, that are able to quickly transmit a large number of images, to provide the best available user experience [
1]. However, various distortions are added to digital images during storage, compression, and transmission. Therefore, the continuous evaluation and monitoring of image quality is of great importance to content providers [
2]. As a consequence, objective image quality assessment (IQA) has become a very hot research topic [
3], because it tries to devise mathematical models that are able to give perceptual quality estimation consistent with human judgment. The literature usually divides objective IQA into three branches [
4,
5] based on the availability or unavailability of the reference (distortion-free) images in the quality evaluation process. As the terminology suggests, full-reference (FR) IQA evaluates the quality of distorted images with full access to their reference counterparts, while no-reference (NR) IQA has no access and reduced-reference (RR) IQA has partial access to them.
Because the underlying model of the human visual system (HVS) is extremely complex and its many elements are not fully understood [
6], researchers have proposed many FR-IQA algorithms, which take into consideration different aspects of the HVS, over the years. Recently, there have been numerous attempts to increase the performance of FR-IQA by combining several already existing FR-IQA metrics to compile a “super” evaluator. First, Okarma [
7] introduced such a fusion-based metric by applying the product and power of MS-SSIM [
8], VIF [
9], and R-SVD [
10]. Later, this idea was developed further into several directions. A line of works utilized optimization or regression techniques to determine optimal weights or exponents in summations or products of already existing FR-IQA metrics. For instance, Oszust [
11] determined the optimal weights using a genetic algorithm with a root mean square error (RMSE) objective function which was calculated between predicted and ground-truth scores. Bakurov et al. [
12] chose a similar solution, but the authors revisited the SSIM [
13] and MS-SSIM [
8] metrics to find optimal parameters in their formulas, using evolutionary and swarm intelligence methods instead of the originally proposed grid search. On the other hand, Okarma [
14] used the MATLAB fminsearch function to determine the optimal exponents in a weighted product of traditional FR-IQA metrics. Another line of works utilize machine learning techniques to create fusion-based methods. The illustrative example is the paper of Lukin et al. [
15] where the results of traditional FR-IQA metrics were used as a feature vector to train a shallow neural network. Amirshahi et al. [
16] compiled a fusion-based metric by comparing the activation maps produced through reference and distorted images of an AlexNet [
17] convolutional neural network using traditional image quality metrics.
1.1. Contributions
In this paper, we make the following contributions. We propose a novel framework for FR-IQA that combines multiple metrics and tries to leverage the strength of each by formulating FR-IQA as an optimization problem. Following the idea of other fusion-based metrics, the perceptual quality of a test image is defined as the weighted product of several already existing, hand-crafted FR-IQA metrics. Unlike other methods [
7,
18,
19], the weights in the product are determined in a novel optimization-based framework and the objective function is defined to maximize the correlation strength and minimize the root mean square error between the predicted and ground-truth quality scores.
1.2. Structure of the Paper
To provide a clear and organized presentation of our work, this paper is structured as follows. In
Section 2, we discuss the main approaches of FR-IQA and summarize significant methods of the field.
Section 3 outlines our proposed method. In
Section 4, we present the experimental results and analyze the performance of our method by comparing it to other state-of-the-art methods. We conclude this paper in
Section 5 and discuss potential applications and future research directions.
2. Related Work
Taking the mean square error between reference and distorted images is a simple and straightforward FR-IQA metric. However, the provided quality scores do not correlate well with human judgment [
20]. Similarly, PSNR [
21] is also simple and straightforward but fails to give satisfactory results [
22]. Other metrics take the sensitivity of the HVS to structural degradation into consideration, such as the structural similarity index (SSIM) [
13]. On the basis of SSIM [
13], a large number of FR-IQA metrics has been proposed over the years, such as MS-SSIM [
8], CW-SSIM [
23], ESSIM [
24], GSSIM [
25], IW-SSIM [
26], and 3-SSIM [
27]. In SSIM [
13], a comparison between the distorted and reference (distortion-free) images is performed on the basis of three features, i.e., luminance, contrast, and structure. To be more specific, the SSIM between two images (denoted here by
A and
B) in an image patch around
coordinates is given as
where the luminance component is defined as
the contrast component is given as
and the structure component is determined as
In Equations (
1)–(
4),
and
denote the average of the pixel values in the image patch around
in images
A and
B, respectively. Similarly,
and
stand for the variances. Further,
is the covariance calculated between the two images from
A and
B. The constants are calculated as
,
, and
. Further,
L stands for the dynamic range of the pixel values, and for images with 8-bit depth,
is the recommended value. By default,
and
are also constants whose exact values were chosen by Wang et al. [
13] after an ablation study. To give the perceptual quality of the distorted image in possession of the reference image, the arithmetic average of
is taken. As already mentioned, a huge number of FR-IQA metrics has been inspired by the original SSIM. For comprehensive overviews on SSIM-motivated methods, the following papers can be recommended [
12,
28,
29,
30,
31]. Here, several representative methods are mentioned in the following. The authors of multi-scale SSIM [
8] extended the idea of SSIM into multiple scales. Sampat et al. [
23] replaced the components of SSIM by complex wavelet coefficients [
32]. In contrast, Zhang et al. [
24] defined an edge strength-based image quality metric where the strength of edges was defined in horizontal and diagonal directions using directional derivatives. Chen et al. [
25] took a similar approach, but the edge information was characterized by gradient magnitudes. In [
26], the authors used the information content measure as a weighting factor in the pooling process of SSIM [
13] to obtain improved prediction results. This idea was further improved by Larson et al. [
33] where low-level distortions, which are nearly imperceptible, were modeled by local luminance and contrast masking, while high-level distortions were modeled using spatial-frequency components. Kolaman and Yadid-Pecht [
34] extended the SSIM metric to colorful images by modeling colors with quaternions. In [
35], the authors analyzed different strategies aiming at the usage of visual saliency maps [
36] in improving IQA algorithms. A proposal was the weighting of local estimates by local saliency values. In [
37], first- and second-order Riesz-transform [
38] coefficients were used to create feature maps for the reference and the distorted images which were compared to give an estimation of the perceptual image quality. Similarly, Zhang et al. [
39] compared feature maps to quantify image quality, but the authors used phase congruency [
40] and gradient magnitude maps.
Recently, the scientific community has paid more and more attention to the deployment of machine and deep learning models in almost all computer vision tasks [
41]. The field of image quality assessment has accommodated this trend [
3,
42]. For instance, Tang et al. [
43] extracted spatial and frequency domain features from reference–distorted image pairs and trained a random forest regressor for image quality prediction. In contrast, Bosse et al. [
44] devised a convolutional neural network (CNN) architecture which can be trained end-to-end on single images or on image pairs. Similarly, Zhang et al. [
45] trained an end-to-end CNN in a patch-wise fashion for FR-IQA and compared the effectiveness of deep features extracted from different pretrained CNNs. As a consequence, it can be used for both NR- and FR-IQA. In [
46], the authors proposed a pairwise-learning framework for FR-IQA. Several works extracted deep features via pretrained CNNs from reference–distorted image pairs and compared them to assess the perceptual image quality. For instance, Amirshahi et al. [
47] compared the histograms of deep features using a histogram intersection kernel (HIK) [
48] at multiple levels. The perceptual quality was obtained by aggregating the similarity scores provided by the HIKs. Later, this approach was further developed in [
16] by replacing the HIK in comparing convolutional feature maps with a traditional image similarity metric. In [
49], the authors used the error map calculated between the reference and distorted images and the subjective saliencies of the distorted images to train a CNN for perceptual image quality estimation.
Recently, several researchers devised fusion-based FR-IQA methods where the goal is creating a “super-evaluator" using already known FR-IQA metrics to achieve advanced performance. A large number of fusion-based algorithms determine weights for each FR-IQA metric in a summation or in a product of sequence [
1]. An illustrative example is the method proposed by Okarma [
7]. Namely, the properties of three different FR-IQA metrics were examined thoroughly, and a combined metric was devised based on the metrics’ arithmetical product and power. By using mathematical optimization techniques, the parameter values of this fusion-based metric were refined in [
14]. Oszust [
50] and Yuan et al. [
51] also developed this approach further by applying lasso regression and kernel ridge regression, respectively. Oszust [
11] determined the weights in a linear combination of traditional FR-IQA metrics by applying a genetic algorithm. In [
52], this approach was further developed by using multi-gene genetic programming [
53]. The effectiveness of this approach was also demonstrated on screen content images [
54]. Simulated annealing was also applied in this framework [
55]. Machine learning techniques were also used in creating fusion-based algorithms. An illustrative example is Lukin et al.’s [
15] work. Namely, the authors used the outcomes of several FR-IQA metrics as features and trained a neural network on top of them to predict perceptual quality. A similar approach using a neural network was proposed for the quality assessment of remote sensing images [
56].
In summary, this section has highlighted the various approaches that have been proposed in the literature for FR-IQA. Although the reviewed studies have contributed significantly to the field, a detailed overview about IQA or FR-IQA is out of the scope of this study. For a general overview about the field of IQA, the PhD dissertations of Jenadeleh [
57] and Men [
58] can be recommended while Min-juan et al. [
59], Phadikar et al. [
60], George et al. [
61], and Pedersen et al. [
30] provide state-of-the-art studies on FR-IQA.
3. Proposed Method
In [
7], Okarma took into account the different properties of three different FR-IQA metrics (MS-SSIM [
8], VIF [
62], and R-SVD [
10]) and defined a combined quality metric (CQM):
where
,
, and
values were used because they lead to a near-optimal solution on an IQA benchmark database. Following this basic idea of Okarma [
7], a fusion-based metric is defined as the weighted product of
n different traditional FR-IQA methods’ results:
where
s are the results of the applied FR-IQA metrics and
s are the associated weights. Specifically, we chose
and the following metrics were utilized: FSIM [
39], FSIMc [
39], GSM [
63], IFC [
9], IW-SSIM [
26], MAD [
33], MS-SSIM [
8], NQM [
64], PSNR [
21], RFSIM [
37], SFF [
65], SSIM [
13], SR-SIM [
66], UQI [
67], VIF [
62], VSI [
68], and VSNR [
69]. A summary of the acronyms of the used FR-IQA metrics can be found in
Table 1. In the literature, the parameters of an FR-IQA metric are tuned on a smaller subset of images. In the case of a traditional metric, such as SSIM [
13] given by Equations (
1)–(
4), the number of tunable parameters is one or two. As a consequence, appropriate values can be easily found applying for cycles over a search space. In contrast, our fusion-based metric given by Equation (
6) contains
parameters and an optimization task is defined to find their exact values. To determine the optimal weights (parameters) in Equation (
6), the following optimization problem is defined:
where
and
are vectors containing the predicted and ground-truth quality scores, respectively.
and
denote the Spearman’s rank-order correlation coefficient and Kendall’s rank-order correlation coefficient calculated between two vectors, respectively. Further,
is the root mean square error determined between two vectors. Prior to the calculation of the RMSE, a non-linear mapping is applied to the predicted scores following the recommendations of [
4]. In this paper, the following non-linear function was applied
with the following
parameters,
,
,
,
, and
, which were also used in the MATLAB implementation of the VSI [
68] method.
Because an FR-IQA metric is supposed to provide objective scores which have a high correlation and low RMSE with respect to subjective quality scores collected from human observers, the objective function’s—given by Equation (
7)—numerator consists of the sum of SROCC and KROCC while the denominator corresponds to the RMSE. Our preliminary investigations revealed that considering only SROCC or KROCC may result in a higher RMSE than those of the state of the art. That is why we decided to divide the sum of SROCC and KROCC by the RMSE.
Two nature-inspired optimization methods, such as the genetic algorithm [
70] (GA) and pattern search [
71] (PS), were applied to the problem defined by Equation (
7) to determine the optimal weights. Further, the simplex method of Lagarias et al. [
72], which is implemented in the
fminsearch function of MATLAB’s Optimization Toolbox, was also used. To improve the efficiency of the fusion, each method was able to execute model selection (which FR-IQA metric to aggregate or not). The main motivation behind the choice of optimization methods was to collect algorithms that are able to give at least approximate solutions for NP (non-deterministic polynomial-time) hard problems.
Figure 1 and
Figure 2 depict the compilation of the proposed fusion-based FR-IQA metric. Specifically, the fusion was carried out on 20% of the reference images and their corresponding distorted counterparts in our method. In the literature, 20% is a common choice for parameter setting in a derived formula [
73,
74], but there are also researchers who used 30% [
39] or 80% [
75]. In total, four fusion strategies were realized with the help of one optimization method and benchmark database. Further, the fusion strategies were also cross-database tested. Each optimization method was carried out 100 times and the best solution was finally selected. We codenamed our method OFIQA to refer to the fact that the decision fusion was carried out via optimization.
In the GA, the population size and the number of generations were set to 100. The best solutions on the four benchmark databases were provided by the following equations:
In the case of the PS after 100 runs, the following fusion-based metrics can be obtained:
Using the method of Lagarias et al. [
72], the following fusion metrics can be obtained: