1. Introduction
High dynamic range (HDR) imaging, as a popular image enhancement technology, aims at recovering the detail information in bright and dark regions of images by fusing multiple low dynamic range (LDR) images with varying exposure levels [
1]. Consequently, HDR images have a powerful ability to acquire almost all brightness ranges in natural scenes, and have attracted attention from various multimedia signal processing fields, such as HDR compression, streaming and display [
2]. Moreover, due to the limitations on popularization of HDR display devices, tone-mapping operators (TMOs) have been successively developed to ensure the visualization of HDR images on traditional LDR displays, which reduce brightness dynamic range of images as much as possible without destroying the original structure of scenes [
3]. Unfortunately, there are no completely suitable TMOs for converting HDR images, so that the relevant visual quality degradation phenomena (e.g., detail loss especially in the bright and dark regions and color unnaturalness) will be inevitably introduced into tone-mapped images (TMIs) [
4]. To distinguish the generalization ability of different TMOs accurately, objective image quality assessment (IQA) of TMIs is one of the most challenging problems to optimize the HDR processing pipeline.
Up to now, a large number of perceptual IQA methods designed for LDR images have been proposed [
5,
6], and can be usually divided into three categories: full-reference (FR), reduced-reference (RR) and no-reference/blind (NR). The FR methods are guided by a distortion-free reference image. The RR methods only require a part of the reference image, while the NR/blind methods do not. Among the classical FR-IQA methods, the structural similarity method (SSIM) [
5] is one of the most influential methods in academic communities, which measures the difference between the reference image and distorted image from brightness, contrast and structure. Evidently, these IQA methods for LDR images are not applicable to TMIs due to the specific truth that reference and distorted images have different dynamic range. To solve this problem, Yeganeh et al. [
7] proposed the tone-mapped quality index (TMQI) by combining multi-scale structure fidelity and statistical naturalness in the grayscale domain. Although the TMQI method outperforms the existing IQA methods designed for LDR images in terms of predicting the quality of TMIs, there is still large room for further improvement, such as perceptual analysis in chrominance domain. Afterwards, a few improved versions based on the TMQI method were put forward. Ma et al. [
8] revised the related feature components in the TMQI method for the accuracy of evaluation. Nasrinpour et al. [
9] integrated the importance of salient regions into the TMQI method to further improve the evaluation performance. Besides, Nafchi et al. [
10] expended the existing feature similarity (FSIM) [
11] method to form the feature similarity index for tone-mapped images (FSITM). Song et al. [
12] utilized the exposure condition segmentation and extracted perceptual features to predict the quality of TMIs. Unfortunately, considering that the reference HDR images are usually unavailable and unintelligible in many practical cases, the above FR-IQA methods designed for TMIs are prone to defeat despite the advanced performance on the benchmark TMIs database.
Obviously, the development of blind IQA (BIQA) methods is more challenging compared with FR-IQA methods due to the lack of a reference image. Generally, most BIQA methods designed for ordinary 2D images (2D-BIQA) are based on the framework of supervised learning, that is, several quality-aware features are extracted from images and quality regression is conducted via the model trained by machine learning or deep learning algorithms [
13,
14,
15,
16]. Among the diverse quality-aware features, natural scene statistics (NSS) based features play a significant role in evaluating 2D images degraded with single distortion or multiple distortions. Moorthy et al. [
13] presented the distortion identification-based image verity and integrity evaluation (DIIVINE) method by exploring the statistics between the sub-band coefficients obtained from steerable pyramid decomposition. Zhang et al. [
14] extracted the additional complex phase statistics on the basis of the DIIVINE method. Saad et al. [
15] and Mittal et al. [
16] developed the BLIINDS-II and BRISQUE methods by using the NSS of discrete cosine transform (DCT) coefficients and mean subtracted contrast normalized (MSCN) coefficients, respectively. Moreover, there exist some aesthetic IQA methods. For example, Sun et al. [
17] proposed an alternative set of features for aesthetic estimation based on a visual complexity principle. They extracted the visual complexity properties from an input image in terms of their composition, shape, and distribution. Mavridaki et al. [
18] introduced five feature vectors for describing the photo’s simplicity, colorfulness, sharpness, pattern and composition to perform the aesthetic quality evaluation. Although these 2D-BIQA methods and aesthetic IQA methods have shown their performance superiority in predicting the quality of 2D images and aesthetic-related images addressed by common distortion types, e.g., blockiness, blurriness, noise and aesthetic drop, there is a large gap in predicting the quality of TMIs dominated by detail loss especially in the bright and dark regions and color unnaturalness. The reasons for performance deviation can be summarized as the following two aspects. First, NSS based features are extracted from the entire image or sub-band, and can be usually regarded as global features, so the relevant local features (e.g., local structure and texture information) are ignored. Remarkably, the detail loss of TMIs caused by structural degradation is mainly reflected in bright and dark regions of images. Another problem is that the extracted NSS features are based on luminance component of image, missing the crucial role of color information on the human visual system (HVS). Therefore, it is necessary to explore the special perceptual characteristics of TMIs to improve the performance of IQA methods.
Actually, some BIQA methods specialized for TMIs (TM-BIQA) have been presented in the past three years [
19,
20,
21,
22,
23,
24,
25,
26]. Gu et al. [
19] designed a blind tone-mapped quality index (BTMQI) by analyzing information fidelity, naturalness and structure. Considering that the brightest and darkest regions of TMIs are prone to detail loss, Jiang et al. [
20] proposed a blind TM-IQA (BTMIQA) method by combining the detail features with naturalness and aesthetic features. Kundu et al. [
21] utilized the NSS features from the spatial domain and HDR gradient domain to form the HIGRADE method. Yue et al. [
22] extracted multiple quality-sensitive features including colorfulness, naturalness, and structure to construct a TM-BIQA method. Jiang et al. [
23] proposed a blind quality evaluator of tone-mapped images (BLIQUE-TMI) by considering the impact of visual information, local structure and naturalness on HVS, where the former two kinds of features are extracted based on sparse representation, and the other ones are derived from color statistics. Zhao et al. [
24] proposed a method that is mainly based on local phase congruency, some statistical characteristics on the edge maps and opponent color space to measure the image sharpness, halo effect and chromatic distortion, respectively. Chi et al. [
25] designed a new blind TM IQA method with image segmentation and visual perception, a feature clustering scheme was proposed to quantify the importance of features. Fang et al. [
26] extracted features from global statistics model to characterize the naturalness and local texture features to capture the quality degradation. However, these TM-BIQA methods still have the following limitations: (1) The color information is completely ignored in the BTMQI and HIGRADE methods, and the aesthetic quality of TMIs cannot be evaluated in the BLIQUE-TMI method. (2) For the BTMIQA method, the extracted local features are too simple to characterize the visual perception for different brightness regions (DB-regions) in TMIs, and the detail loss phenomenon in regions of normal exposure is also omitted.
Towards a more accurate evaluation for TMIs, a blind TMI quality assessment method based on regional sparse response and aesthetics is proposed in this paper, denoted as RSRA-BTMI. The basic consideration of RSRA-BTMI is that we attempt to dig some quality-aware features from imaging and viewing properties of TMIs, i.e., we focus on exploring the specific perceptual characteristics for DB-regions in TMIs, so that extracting both local and global features to portray the detail loss and color unnaturalness. In summary, the main contributions of this paper are described as follows.
- (1)
Inspired by the viewing properties in visual physiology, i.e., the quality of images is perceived by HVS from global to local regions, multi-dictionaries are specially designed for DB-regions of TMIs and entire TMIs via dictionary learning. Moreover, the self-built TMIs training dataset for dictionary learning in this study is available for the further research demand.
- (2)
Each region is sparsely represented to obtain the corresponding sparse atoms activity for describing the regional visual information of TMIs, which is closely related to visual activity in the receptive fields of simple cells. In addition, a regional feature fusion strategy based on entropy weighting is presented to aggregate the above local features.
- (3)
Motivated by the fact that HVS prefers an image with saturated and natural color, the relevant aesthetic features, e.g., contrast, color fidelity, color temperature and darkness, are extracted for global chrominance analysis. Besides, residual information of entire TMIs is fully utilized to simulate global perception of HVS, and the NSS based features extracted from residual images are combined with the aesthetic features to form the final global features.
The rest of the paper is organized as follows: The proposed RSRA-BTMI method is described in
Section 2. The performance comparison results of RSRA-BTMI and other BIQA methods are presented in
Section 3. Finally, the conclusion is given in
Section 4.
3. Experiment Results and Discussion
To verify the performance of the proposed RSRA-BTMI method, the ESPL-LIVE HDR [
34] database was used to make comparisons between the proposed method and existing state-of-the-art BIQA methods. The database was generated by three different types of HDR image processing algorithms, including TMO, multi-exposure fusion and post-processing. The images processed by TMOs and their corresponding subjective scores were utilized in the experiment. The basic situation of TMIs in the ESPL-LIVE HDR database is shown in
Table 1. It contained a total of 747 TMIs degraded by TMOs.
In order to validate the accuracy of the method, 80% of the image samples in the database were selected as the training set to train a TM-IQA model, which was used to predict the quality of the remaining 20% image samples. The scenarios of the training set and testing set were independent of each other. Then, to evaluate whether the method is statistically consistent with visual perception, it is necessary to compare the predicted scores with subjective ratings. According to the objective IQA standard proposed by the Video Quality Expert Group (VQEG), Pearson linear correlation coefficient (PLCC), Spearman rank-order correlation coefficient (SROCC) and root mean squared error (RMSE) were employed to validate the consistence. With experience, a method correlates well with subjective scores if PLCC and SROCC are close to 1 and RMSE is close to 0. In addition, to get the reliable results of the proposed RSRA-BTMI method, the above procedure was repeated 1000 times using randomly divided training and testing sets. Finally, we reported the median value of performance index obtained from the 1000 random trails as the final performance index.
3.1. Parameter Setting and Feature Analysis of the Proposed RSRA-BTMI Method
As can be found from the feature extraction in
Section 2, the size of some parameters needs to be set. Actually, the size of the presegmented block of TMI for dictionary learning will affect what the block contains. Specially, the larger the block, the greater the probability that the block contains different luminance content, and the operation of block based regional subset partition is more difficult. This will affect multi-dictionary learning and accurate extraction of sparse feature vector. However, the smaller of blocks will cause the higher complexity and lower efficiency of the proposed method. Therefore, the block size is set to a moderate value 8 × 8, and the dictionary size
m is set to 128.
m also determines the size of the final feature vector, so the feature size extracted from each region in the sparse domain is 128.
As described in
Section 2, there were several types of features extracted in this work. Sparse atomic activity based on regional entropy weighting
F1 and auxiliary statistics based on global reconstruction residual
F2 represent regional sparse response features in the sparse domain. Contrast
F3, color fidelity
F4, color temperature
F5 and darkness
F6 constitute the aesthetic features. Actually, most of the components in the sparse eigenvector were zero, and the non-zero component justified that the sample TMI had a corresponding response in the pretrained dictionary prototype. From a biological point of view, there were a series of visual neurons in the mammalian visual system. Visual neurons can sparsely encode the stimulus, that is, when a specific external stimulus is received, the information carried by the stimulus can be correctly perceived, as long as a small number of corresponding neurons accept the stimulus. Therefore, the sparse representation coefficients based on multi-dictionaries characterize the neuron state under a particular stimulus. The non-zero positions indicate that the neuron receives the stimulus, and the zero portions indicate that the neuron is not stimulated. Therefore, the sparse decomposition process of images is a sparse response of a neuron to a specific stimulus. A TMI to be assessed is transformed into sparse coefficients, and the sparse characteristics of each coefficient contain the essential features of the TMI. The feature extraction from the sparse domain will be more visually perceptible than the original image pixels. The more
SCcoeff-l represents that the more stimuli are received. To percept the global distortion, the global reconstruction residual statistics feature
F2 was extracted to assist
F1. The aesthetic features
F3,
F4,
F5 and
F6 were also considered because color distortion is not negligible in TMIs.
To analyze the feature contribution, the performances of each type of features were separately evaluated on the ESPL-LIVE HDR database. In addition, the combination contribution of
F1 and
F2 in the sparse domain was also given to confirm the validity of the proposed features, as well as the combination of aesthetic features
F3,
F4,
F5 and
F6. PLCC, SROCC and RMSE were used as the performance criteria. These results are shown in
Table 2. We could observe that the separate feature shows good performance alone, and a better performance could be achieved when the features were incorporated together. This makes us believe that the proposed features are complementary with each other.
In the previous analyses in
Section 2, it can be known that
SCcoeff-g had less effect on sparse reconstruction, but whether it had the ability to distinguish a high or poor-quality of the image or not remains to be validated. By the same proposed process of sparse atomic activity feature extraction in
Section 2, sparse atomic activity statistics of different portions, such as
SCcoeff-g and the combination of
SCcoeff-l and
SCcoeff-g, were used to measure the performance for quality assessment. In
Table 3,
SCcoeff-lg is represented for the combination of
SCcoeff-l and
SCcoeff-g.
Table 3 lists three types of features about the activity of
SCcoeff-g,
SCcoeff-lg and
SCcoeff-l. It can be found that
SCcoeff-g and
SCcoeff-lg also exhibited good quality discrimination performance, and even exceeded the performance of the methods such as BTMQI, which will be shown later. According to the comparison, the portion of
SCcoeff-l was selected as the final fusion feature in the sparse domain.
In addition, to verify the advantage of multi-dictionaries in the proposed RSRA-BTMI method,
Table 4 lists the experimental analysis of single dictionary and multi-dictionaries. In
Table 4, the performance obtained by combining multi-dictionaries with aesthetic characteristics was better than that obtained by combining single dictionary with aesthetic characteristics, here, they were denoted as ‘M + A’ and ‘S + A’, respectively. It is mainly attributed that those multi-dictionaries take more account of the different characteristics of HDR images after the TM process in DB-regions. Together with aesthetics, it can better perceive the detail loss in the DB-regions and color unnaturalness.
To clearly show a high correlation of aesthetic features with subjective scores, we trained a quality prediction model by aesthetic features. According to the trained quality prediction model, we used the aesthetic features of different distorted TMIs to predict the quality, the results are shown in the following
Figure 8. It can be found that the more natural TMI is, the higher the predicted quality value (i.e., Q) will be, and also a companion with a higher MOS value.
3.2. Influence of Training Set Sizes
In order to study the influence of different training sets on quality prediction results, PLCC and SROCC values obtained via different training sets were also analyzed, as shown in
Table 5. The training set size was set as 10–90%, and we could draw the following conclusions via the results in the
Table 5: (1) with the increasing of the training set, PLCC and SROCC values also increased gradually, which is consistent with the conclusion of the existing learning-based BIQA method and (2) when the training set was less than 20%, the performance dropped significantly, but it also had better performance than other existing methods, such as BTMQI shown in the
Table 6.
3.3. Feature Selection
Since the total 181-dimensional features may cause an overfitting situation, we made an experiment to eliminate redundancy from the total features. RF has an ability to detect the importance of features, so it can well guide the feature selection work. Specifically, we utilized RF to predict the importance of features extracted in the ESPL-LIVE HDR database as shown in
Figure 9 [
23]. It can be found that different features had different importance. To determine the best feature dimension, we utilized different dimension of features to build quality prediction model and evaluate the corresponding performance. As shown in
Figure 10, it can be found that the performance of PLCC and SROCC was best when the dimension of feature was 56. For brevity, the feature set after importance selection is expressed by ‘
Fc’ in the following description.
3.4. Overall Performance Comparison
In order to prove the effectiveness of the proposed RSRA-BTMI method, it was compared with the existing advanced BIQA methods. Since the ESPL-LIVE HDR database did not provide the original HDR reference image, the FR-IQA methods designed for TMIs could not be utilized on the database directly. The proposed RSRA-BTMI method was not compared with the existing FR-IQA methods.
Table 5 shows the performance comparisons between the proposed RSRA-BTMI method and two types of existing IQA methods. The first type is the 2D-BIQA methods specialized for ordinary LDR images based on natural scene statistical features, including C-DIIVINE [
14], DIIVINE [
13], BLIINDS-II [
15], BRISQUE [
16] and OG [
39]. The other type is specifically designed for TM-BIQA, including BTMQI [
19], HIGRADE [
21], Yue’s [
22], BTMIQA [
20], BLIQUE-TMI [
23] and Chi’s [
25].
From
Table 6, it can be found that the performance of the TM-BIQA methods was far superior to the 2D-BIQA methods for TMIs’ quality assessment because the TMIs’ distortion types were different from those of ordinary LDR images. In general, the distortion of LDR images included some common distortions, such as encoding distortion, and Gaussian noise. However for a TMI, its distortion mainly reflected in the color unnaturalness and the detail loss especially in its
Breg and
Dreg. Therefore, it is unsuitable to directly use the 2D-BIQA methods to evaluate the TMIs’ quality. First of all, obviously, as the 2D-BIQA methods, C-DIIVINE, DIIVINE, BLIINDS-II, BRISQUE and OG only consider the corresponding distortions of ordinary LDR images, such as JPEG, JP2K compression, blur, white noise, etc., the quality prediction performances of these 2D-BIQA methods used on TMIs were usually poor, and their PLCC and SROCC values were very low, only about 0.530 and 0.523 at the best. Secondly, the PLCC values of the existing TM-BIQA methods were much higher than those of the 2D-BIQA methods, as well as the SROCC values. Among the TM-BIQA methods, BTMQI mainly considered the details and structure preservation degree of TMIs, but did not consider the color distortion carefully, which had a great impact on the TMIs’ quality. HIGRADE also spares more effects on the structure and naturalness, but neglects the color distortion. The BTMIQA method mainly uses the local entropy to perceive the detail loss of TMIs but it omits the information loss in the normal exposure region. The other methods also have room for improvement. The proposed method applies sparse perception with multi-dictionaries to extract main features of TMI’s DB-regions, which not only can reduce visual redundancy but also obtain the human visual perception response to different regions. Moreover, it is clear that the proposed RSRA-BTMI method had better performance than the other methods. It is mainly due to the truth that the proposed RSRA-BTMI method utilizes the compressed sensing. Combining the regional sparse response with aesthetics can obtain the detail loss especially in
Breg and
Dreg of TMIs, as well as the color distortion. Therefore, the proposed RSRA-BTMI method outperformed the existing methods and was consistent with the subjective perception of the human vision. It is also attributed to the fact that the proposed RSRA-BTMI method simulated the distortion process of TM in the sparse domain.
Moreover, we also calculated the performance after feature importance selection based on the proposed RSRA-BTMI method. Clearly, the feature selection used further improved the performance of the proposed RSRA-BTMI method.
3.5. Discussion
Due to the particularity of TMIs in imaging and viewing properties, two kinds of perceptual factors ought to be considered in the TM-BIQA method, i.e., detail loss and color unnaturalness. In this paper, we proposed an RSRA-BTMI method by considering the impact of DB-regions and the global region of TMI on human subjective perception, whose performance on the ESPL-LIVE HDR database was better than other competing 2D-BIQA and TM-BIQA methods. From the perspective of semantic invariance in DB-regions of TMIs, multi-dictionaries were specially designed so that each brightness region could be sparsely represented to describe the regional visual information. Moreover, global reconstruction residual statistics were also conducted to identify the high frequency information loss and utilized as the compensation features in the sparse domain. For the color unnaturalness, several color related metrics, such as contrast, color fidelity, color temperature and darkness, were analyzed and discussed carefully. As an efficient metric, the proposed RSRA-BTMI method could not only serve as the quality monitor in the end-to-end TMI processing pipeline, but also promoted the development of some relevant technologies, such as tone mapping, image enhancement and denoising of TMI.
Although the proposed RSRA-BTMI method achieved excellent results in evaluating TMIs degraded with detail loss and color distortion, there were still limitations in some respects. First, several special distortions may appear in the actual imaging process, e.g., abnormal exposure, violent noise and indelible artifacts. Obviously, the introduction of an artifact or noise will greatly increase the high frequency information of the image, but it is not belonging to the component of positive detail information in images and usually causes terrible visual perception. Therefore, the presented global reconstruction residual statistics will produce the opposite result in this special case. Second, for the proposed method, there is a blocking operation on TMI before multi-dictionary learning for DB-regions. However, fixed size blocks may result in regions of different brightness within one TMI block, which is not conducive to multi-dictionary learning. Thus, a more reasonable and efficient way to improve the application scope of the proposed RSRA-BTMI method is worth being explored.