Full-Reference Image Quality Assessment Based on an Optimal Linear Combination of Quality Measures Selected by Simulated Annealing

Digital images can be distorted or contaminated by noise in various steps of image acquisition, transmission, and storage. Thus, the research of such algorithms, which can evaluate the perceptual quality of digital images consistent with human quality judgement, is a hot topic in the literature. In this study, an image quality assessment (IQA) method is introduced that predicts the perceptual quality of a digital image by optimally combining several IQA metrics. To be more specific, an optimization problem is defined first using the weighted sum of a few IQA metrics. Subsequently, the optimal values of the weights are determined by minimizing the root mean square error between the predicted and ground-truth scores using the simulated annealing algorithm. The resulted optimization-based IQA metrics were assessed and compared to other state-of-the-art methods on four large, widely applied benchmark IQA databases. The numerical results empirically corroborate that the proposed approach is able to surpass other competing IQA methods.


Introduction
Nowadays, people increasingly communicate through media in form of audio, video, and digital images. Therefore, image quality assessment (IQA) has found many applications and become a hot research topic in the research community [1]. Namely, IQA methods evaluate the perceptual quality of digital images and support, among others, image enhancement [2], restoration [3], steganography [4], or denoising algorithms [5]. Further, IQA is also necessary in the benchmarking of many image processing or computer-vision algorithms [6][7][8]. In the literature, IQA is classified into two groups, i.e., subjective and objective IQA. Specifically, subjective IQA deals with the collection of users' quality ratings for a set of digital images either in a laboratory [1] or in an online crowd-sourcing experiment [9]. Moreover, images' perceptual quality is expressed as a mean opinion score (MOS), which is the arithmetic mean of individual quality scores. As a result, subjective IQA provides quality labelled images with objective IQA as training or test data [10]. Namely, objective IQA deals with algorithms and mathematical models that are able to predict the quality of a given image. Conventionally, objective IQA is divided into three classes [11]-full-reference (FR) [12], reduced-reference (RR) [13], and no-reference (NR) [14]-with respect to the availability of the reference (distortion-free) images. As the names indicate, FR-IQA methods have full access to the reference images. In contrast, NR-IQA algorithms evaluate image quality without any information about the reference images [15], and RR-IQA algorithms have partial information about them.

Contribution
The development of objective FR-IQA algorithms can also involve fusion-based strategies that already take existing FR-IQA metrics and try to create a "super evaluator". Recently, many complex fusion-based approaches have been published in the literature [16][17][18][19]. The main contribution to this paper is also a fusion-based approach. Namely, we demonstrate a solution based on a linear combination of several already existing FR-IQA metrics optimized with a simulated annealing (SA) algorithm using a root mean square error (RMSE) objective, which is able to produce well-performing fusion-based FR-IQA metrics. To be more specific, a linear combination of 16 FR-IQA metrics is used in an optimization problem to select FR-IQA metrics and find their weights via an SA algorithm that minimizes the RMSE of the prediction. Unlike the approach of Oszust [20], we apply simulated annealing instead of a genetic algorithm for performing the fusion of FR-IQA metrics. Namely, simulated annealing usually achieves better results in the case of continuous function approximation than basic genetic algorithms because they choose one or two genes at a given location [21]. The proposed fusion-based metrics was evaluated on large, popular, and widely accepted IQA benchmark databases, such as LIVE [22], TID2013 [23], TID2008 [24], and CSIQ [25].

Organization
The rest of this paper is organized as follows. In Section 2, an overview about the current state of FR-IQA is given. Next, the proposed fusion-based metric is introduced in Section 3. Our experimental results, together with the description of the applied benchmark IQA databases, evaluation environment, and performance indices, are given in Section 4. Finally, a conclusion is drawn in Section 5.

Literature Review
In this paper, we follow the classification of FR-IQA algorithms presented in [26]. To be specific, Ding et al. [26] categorized existing FR-IQA algorithms into five distinct classes, i.e., error visibility, structural similarity, information theoretic, learning-based, and fusion-based methods.
Error visibility methods measure a distance between the pixels of the distorted and the reference images to quantify perceptual quality degradation. The representative method of this class of FR-IQA is the mean squared error (MSE) method, which measures the average of the squares of the errors. In other words, it is the average squared difference between the reference and the distorted images in the context of FR-IQA [27]. Another well-known example is the peak signal-to-noise ratio (PSNR), which is commonly applied to assess the quality of the reconstruction of lossy compression codecs [28]. Although both MSE and PSNR have low computational costs and their physical meaning is clear and well understood, they often mismatch with subjective perceptions of visual quality.
Structural similarity methods measure the similarity between the corresponding regions of the distorted and reference images using sliding-windows in the images and correlation measures. The representative and first published method of this class is the structural similarity index (SSIM) [29], which has become extremely popular in the field with many extensions and applications [30]. The theorem of SSIM has become extremely popular in the research community and inspired many variants. For example, the wavelet domain structural similarity [31] carries out SSIM in the wavelet domain to quantify perceptual quality. This work was extended by Sampat et al. [32] into the complex wavelet domain. In [33], information content was utilized as weights in the pooling process of local image quality scores. In contrast, Wang et al. [34] extended SSIM to multi-scale processing to improve perceptual quality estimation. Li and Bovik [35] elaborated an FR-IQA metric by taking the average of SSIMs computed over three different regions of an image, such as edges, textures, and smooth regions. Kolaman and Yadid-Pecht [36] found an extension of SSIM to color images by representing red, green, and blue color channels with quaternions. Later, SSIM was also extended to hyperspectral images [37].
Information theoretic methods approach the FR-IQA task from the point of view of information communication. For example, Sheikh et al. [38,39] compared the information content of the reference and distorted images. Namely, perceptual quality was quantified by how much information is similar between the reference and distorted images. In contrast, Larson and Chandler [25] classified image distortions as near-threshold and supra-threshold. The authors elaborated two quality indexes for both distortion types. Finally, the overall perceptual quality was determined based on the quality scores of near-threshold and supra-threshold distortions.
As the terminology suggests, learning-based methods rely on a specific machine learning algorithm to create a quality model from training images. Next, the obtained quality model is tested on previously unseen images. For instance, Liang et al. [40] implemented a special convolutional neural network containing two paths, one for the reference image and the other for the distorted image. Further, this network was trained on 224 × 224-sized image patches sampled simultaneously from the reference and distorted images. As a consequence, the perceptual quality of a distorted image was estimated by the average score of the considered patches. Kim and Lee [41] devised a similar network, but it predicts a visual sensitivity map that is multiplied by an error map calculated directly from the reference and the distorted images to estimate perceptual image quality. Ahn et al. [42] further improved the idea of Kim and Lee [41] by implementing an end-to-end trained convolutional neural network with three inputs, i.e., reference image, distorted image, and spatial error map. Similar to [41], a distortion-sensitivity map was predicted from the inputs and was later multiplied by the spatial error map to give an estimation for the perceptual image quality. In contrast to the previously mentioned methods, Ding et al. [43] extracted a set of feature maps from the reference and the distorted images using the Sobel operator, log Gabor filter, and local pattern analysis. Subsequently, the extracted feature maps were compared, and from the resulting similarity scores a feature vector was compiled that was mapped onto perceptual quality scores with a trained support vector regressor. Tang et al. [44] took a similar approach, but the authors employed a different set of features (phase congruency maps [45], gradient magnitude maps, and log Gabor maps). Further, the similarity scores of the feature maps were mapped onto perceptual quality with a trained random forest regressor.
Fusion-based FR-IQA methods utilize existing FR-IQA metrics to create a new FR-IQA algorithm. First, Okarma [46] suggested the idea of combined methods. Namely, the author proposed a combined metric using the product and power of MS-SSIM [34], VIF [38], and R-SVD [47]. This approach was developed further in [19], where the optimal exponents in the product were determined by using MATLAB's fminsearch command. In [48], Oszust took a similar approach, but the author applied the scores of traditional FR-IQA metrics as predictor variables in a lasso regression. Instead of lasso regression, Yuan et al. [49] used kernel ridge regression in a similar layout. The work of Lukin et al. [50] exhibits the properties of both learning-based and fusion-based methods. Specifically, the authors created a training and a test set from the images of an IQA benchmark database. Next, the scores of several traditional FR-IQA metrics were used as image features, and a neural network was trained to estimate perceptual image quality. Amirshahi et al. [51] elaborated a special fusion-based FR-IQA metric relying on a pretrained convolutional neural network. Namely, the authors ran a reference-distorted image pair through an AlexNet [52] network and compared the activation maps with the help of a traditional FR-IQA metric. Next, the resulted scores were aggregated to obtain a single score for the perceptual image quality. Bakurov et al. [53] revisited the classical SSIM [29] and MS-SSIM [34] metrics by applying evolutionary and swarm intelligence optimization methods to find optimal hyperparameters for SSIM and MS-SSIM instead of the original settings. Fusion-based metrics were also proposed for remote sensing images [54], stitched panoramic images [55], and 3D image quality assessment [18].
For more detailed studies about FR-IQA, we refer readers to the book of Xu et al.'s [56] and to the study of Pedersen and Hardeberg [57]. Further, Zhang et al. [58] provide an evaluation of several state-of-the-art FR-IQA algorithms on various IQA benchmark databases. Zhai and Min provided an comprehensive overview of classical algorithms in [59]. For the quality assessment of screen content images [60], Min et al. gave an overview in [61].

Proposed Method
As already mentioned, an FR-IQA metric should deliver perceptual quality scores consistent with the human judgement using both the distorted and reference images. Let us express the aggregated decision of n different FR-IQA metrics by a weighted sum as: where q i (i = 1, 2, . . . , n) stands for the quality scores provided by the FR-IQA metrics. Further, α = (α 1 , α 2 , . . . , α n ) is a real vector of weights whose values are found via an optimization procedure to ensure an effective fusion of FR-IQA metrics. Namely, an optimization fusion was carried out in our study using n = 16 open-source FR-IQA metrics, such as FSIM [62], FSIMc [62], GSM [63], IFC [38], IFS [64], IW-SSIM [33], MAD [25], MS-SSIM [34], NQM [65], PSNR, RFSIM [66], SFF [67], SR-SIM [12], SSIM [29], VIF [39], and VSI [68]. In the literature, Pearson's linear correlation coefficient (PLCC), Spearman's rank-order correlation coefficient (SROCC), Kendall's rank order correlation coefficient (KROCC), and root mean square error (RMSE) are often considered to characterize the consistency between the ground-truth quality scores of an IQA benchmark database and the quality scores predicted by an FR-IQA metric [22]. From these performance indices, RMSE was applied as an objective function in the proposed optimization based metric.  In the offline optimization stage, the proposed fusion-based metric is obtained by using 20% of the reference with its corresponding distorted counterparts. Next, a simulated annealing (SA) optimization process selects FR-IQA metrics and provides them with weights. The resulting metric is codenamed as LCSA-IQA to refer to the fact that is the linear combination of selected FR-IQA metrics where the weights were assigned using simulated annealing. Formally, the optimization problem can be written as where Q p is vector containing the quality scores of a set of images obtained by Equation (1) and S contains the corresponding ground-truth scores. Further, prior to the calculation of RMSE, a non-linear regression is also applied [22] since a non-linear relationship exists between the ground-truth and predicted scores. Formally, it can be written where β 1 , ..., β 5 stand for the parameters of the regression model. In addition, Q and Q p are the fitted and predicted scores, respectively. Since we use four large, widely accepted IQA benchmark databases, i.e., LIVE [22], TID2013 [23], TID2008 [24], and CSIQ [25], in this paper, four optimization-based fusion FR-IQA metrics are proposed, respectively. To this end, approximately 20% of the reference images were randomly selected from a given benchmark IQA database. More precisely, Q and S were compiled based on those distorted images whose reference counterparts were randomly selected. Although 20% is a common choice for parameter setting in the literature [69,70], there are also researchers who applied 30% [62] or 80% [71] for parameter tuning. However, we evaluate all the fusion based metrics on all the databases to demonstrate results independent from the database. Next, the optimization problem was solved described by Equation (2) to determine the α i weights for Equation (1). Since the number of possible solutions increases exponentially with number of the considered FR-IQA metrics, simulated annealing (SA) [72,73] was used to solve the above-described optimization task. Namely, SA is a probabilistic optimization technique for estimating the global optimum of a given function. The stochastic nature of this algorithm enables the usage of nonlinear objective functions where many other methods do not operate well. SA was inspired by the physical model of heating a material and then slowly decreasing the temperature to eliminate imperfections from the material. Hence, minimizing the system's energy is the main goal. More precisely, the SA randomly generates a new point at each iteration. Based on a probability distribution with a scale proportional to the temperature, the new point's distance from the present point or the size of the search is determined. All new points that reduce the objective are accepted by the algorithm, but points that increase the objective can also be accepted with a pre-defined probability. Due to this property of the method, SA is prevented from being stuck in local minima in early iterations. In our implementation, the SA was performed using MATLAB R2020a with a Global Optimization Toolbox using α i = 0 for i = 1, 2, ..., n as initial point and defining no lower or upper bounds for the method. After 100 runs of SA, the best solutionα best d -was selected, where d denotes the database from which 20% of the reference images was chosen randomly.
In the end of the SA optimization processes using LIVE [22], TID2013 [23], TID2008 [24], and CSIQ [25] databases, the following FR-IQA metrics can be obtained, which are codenamed LCSA, referring to the fact that they are linear combinations of FR-IQA measures selected by simulated annealing: The corresponding β vectors are as follows:

Results
In this section, our experimental results are presented. First, the applied IQA benchmark databases and evaluation protocol are described in Section 4.1. Next, Section 4.2 presents a comparison to other competing state-of-the-art methods on four large IQA benchmark databases, i.e., LIVE [22], TID2013 [23], TID2008 [24], and CSIQ [25].

Applied IQA Benchmark Databases and Evaluation Protocol
The main properties of the applied IQA benchmark databases are outlined in Table 1. These databases consist of a set of reference images, whose visual quality are considered perfect and flawless. Further, distorted images are generated artificially from the reference images using different distortion types (i.e., JPEG compression noise, JPEG2000 compression noise, salt and pepper, motion blur, Gaussian, Poisson, etc.) at different distortion levels. Figure 3 depicts the empirical MOS distributions of the applied benchmark databases.   In the literature, PLCC, SROCC, and KROCC is widely used and accepted to characterize the performance of FR-IQA methods. They are measured between the ground-truth scores of an IQA benchmark database and the predicted scores. Moreover, prior to the calculation of PLCC a non-linear regression is also applied [22] since a non-linear relationship exists between the ground-truth and predicted scores. This non-linear relationship was also defined by Equation (3). Further, Q and Q p are the fitted and predicted scores, respectively. PLCC between vectors x and y with length m is defined as wherex andȳ are the mean subtracted version of vectors x and y, respectively. On the other hand, SROCC can be defined as where x i and y i are the ith entries of vectors x and y, respectively. In contrast, KROCC uses the number of concordant pairs (m c ) and the number of discordant pairs (m d ) between vectors x and y and is defined as As already mentioned, the proposed fusion-based metrics were implemented using MATLAB R2020a and its Global Optimization Toolbox. The computer configuration applied in our experiments is summarized in Table 2.

Comparison to the State-of-the-Art
In this subsection, the proposed fusion-based metrics are compared to several state-ofthe-art FR-IQA whose original source codes were made publicly available by the authors. Moreover, we reimplemented the fusion-based SSIM-CNN [51] method in MATLAB R2020a (available at: https://github.com/Skythianos/SSIM-CNN (accessed on 12 May 2022)). The PLCC, SROCC, and KROCC performance comparisons of the proposed fusion-based FR-IQA metrics with the state-of-the-art are summarized in Tables 3 and 4. Specifically, Table 3 demonstrates the results on LIVE [22] and TID2013 [23], while Table 4 contains the obtained results for TID2008 [24] and CSIQ [25] databases. The obtained results clearly show that the proposed LCSA metrics are able to outperform the state-of-the-art. Specifically, those LCSA metrics that were parameter-tuned on database d always deliver the highest correlation values, while another LCSA not parameter-tuned on database d usually provides the second-best results. Table 3. PLCC, SROCC, and KROCC performance comparison of the proposed fusion-based FR-IQA metrics on LIVE and TID2013 databases with the state-of-the-art. The best results are typed in bold, and the second best results are underlined.

LIVE [22]
TID2013    Table 5 illustrates the direct and weighted average of correlation values measured on LIVE [22], TID2013 [23], TID2008 [24], and CSIQ [25]. From the results of direct averages, it can be clearly seen that the proposed LCSA2 and LCSA4 provide the best results in two out of three performance indices, while LCSA3 is able to produce second best KROCC value. The results of weighted averages are biased towards those FR-IQA measures that perform well on TID2013 [23] since it is the largest database from the applied benchmarks. Similarly, LCSA2 is the best-performing method in this respect because it provides the best results for SROCC and KROCC. Further, LCSA4 delivers the second best PLCC and KROCC values, while LCSA3's performance is equivalent in terms of SROCC and KROCC to those of LCSA4. Table 5. PLCC, SROCC, and KROCC performance comparison of the proposed fusion-based FR-IQA metrics with the state-of-the-art. The best results are typed in bold, the second best results are underlined.  In the following, we examine the performance of the proposed and the other state-ofthe-art methods on the individual distortion types of the applied IQA benchmark databases. The distortion types and their abbreviations used by the databases are summarized in Table 6. Further, Tables 7-10 contain detailed results on the different distortion types of LIVE [22], TID2013 [23], TID2008 [24], and CSIQ [25], respectively. To be more specific, the SROCC values are given for each individual distortion types. Table 6. Distortion types used in the applied benchmark IQA databases (LIVE [22], TID2013 [23], TID2008 [24], and CSIQ [25]).

Conclusions
In this study, we presented a novel fusion-based FR-IQA metric using simulated annealing. Specifically, an optimization problem was solved based on the weighted sum of several FR-IQA metrics by minimizing the root mean squared error between the predicted and ground-truth perceptual quality scores. The evaluation of the proposed fusion-based metrics on four large publicly available and widely accepted IQA benchmark databases empirically corroborated that the proposed metrics are able to produce competitive results compared to the state-of-the-art in terms of various performance indices, such as PLCC, SROCC, and KROCC. Future research could involve other optimization techniques and their combination for improved perceptual quality prediction. Another direction is the generalization of the proposed method for other types of media.